The study cohort and clinical chemistry
A total of 101 individuals were recruited from the SCAPIS study [18], including 48 males and 53 females between 50 and 65 years of age (Fig. 1a). Among them, 92 (91%) individuals were of European descent, while a few were of South American or Asian origin. Extensive phenotype characterization of the subjects was conducted before the study to establish the inclusion and exclusion criteria for the definition of “healthy” subjects. The sample collection in combination with clinical chemistry analysis of 30 parameters and as well as anthropometric measurements was conducted every 3 months in the first year and at approximately a 6-month interval in the second year (Fig. 1b). The complete list of assessed clinical variables is available in Additional file 1: Table S1. Among the 101 subjects, 94 completed the full 2-year study including six visits.
Whole-genome sequencing
DNA from whole blood of each individual was isolated at the first visit and the whole genome was determined using next-generation sequencing. All 101 individuals passed the quality control. In total, 7.3 million variants were identified with a general genotyping rate of 99.93%. A MDS analysis was performed based on the genome-wide IBS pairwise distances of the total set of variants from the 101 individuals (Fig. 1c). Distinct subsets of individuals revealed the relationship of geographic origin of the parents.
Plasma protein profiling
The protein levels of plasma samples from the subjects were analyzed using PEA as described previously [24]. All samples were analyzed with eleven panels as outlined in Fig. 1b covering plasma proteins of interest for cardiovascular and neurological disease, inflammation, cancer, metabolism, organ damage, development, and cell regulation. Bridging reference samples were used for inter-plate normalization (Additional file 1: Fig. S1A), and the comparison of reference samples run on different plates showed a strong correlation among different replicates (Additional file 1: Fig. S1B). Reference sample normalization was conducted to reduce the batch effect (Additional file 1: Fig. S1C-D, see more details in the “Methods” section). Proteins run in multiple panels were also analyzed and found to correlate well with an average Pearson correlation between panels of 0.86 (Additional file 1: Fig. S2A), as exemplified by the interleukin-6 protein which was run in four different panels (Additional file 1: Fig. S2B). In total, the relative protein concentration levels of 794 unique protein targets for 90 subjects with six visits were generated. Among them, 80 proteins are found in the list of drug targets for FDA approved drugs [34] (Fig. 2a, Additional file 1: Table S2).
To assess the variability of protein concentration, we compared the IQR of the fold change of protein concentration levels from their median abundance level (Fig. 2a, Additional file 1: Table S2). The most variable protein in the study was kallikrein-related peptidase 12 (KLK12) which is involved in angiogenesis. Spondin 2 (SPON2), a cell adhesion protein that promotes adhesion and outgrowth of hippocampal embryonic neurons, on the other hand, was the most stable protein with a median fold change of 1 and IQR of 0.01. Extreme outliers were also observed, suggesting the discrepancy in protein concentration levels among individuals. The inter-individual variation (calculated as average coefficient of variation (CV)) and the intra-individual variation of each protein for each individual across the six visits were also determined (Table S2). Figure 2b shows that the majority of all proteins have higher variation between individuals rather than within individuals. Growth hormone 2 (GH2) and RAS p21 protein activator 1 (RASA1) are the most dispersed proteins on inter-individual level. The overview of the concentration levels across six visits for these two proteins is visualized in Fig. 2c and d respectively. The concentration of both proteins was relatively stable across the six visits for each individual, and distinct groups of individuals with elevated concentration levels can be identified based on the longitudinal protein concentration profiles.
Clustering analysis of the protein profiles
Unsupervised clustering analysis was performed based on the Pearson correlation of the global protein concentration profiles based on six samples for each of the 90 individuals. The hierarchical tree shows that the majority of samples from the same individual cluster together, indicating that the intra-individual variation is smaller than the inter-individual variation in normal healthy individuals (Fig. 2e, Additional file 1: Fig. S3). The comparison of the distribution of intra-individual and inter-individual correlations also demonstrates a similar conclusion with a median intra-individual correlation of 0.99 and median inter-individual correlation of 0.96 (Fig. 2f). The effect of the inter-individual variation, visits, and residuals for each of the 794 proteins was assessed using two-factor ANOVA, and the proportion of variance explanation is visualized as a ternary plot (Fig. 2g). The plot demonstrates that most variability can be observed between individuals (inter-individual) with relatively low contribution by the visits factor. Folate receptor 3 (FOLR3) shows the largest inter-individual differences with 99.4% variance explained by subjects, 0.1% by visits, and 0.5% by residuals.
A small number of individuals (n = 10) showed a higher variability between some of the visits, and these can be seen as outliers in the hierarchical tree (Fig. 2e and Additional file 1: Fig. S3), as one or more visits are not clustering with the others from the same individual. Pairwise comparisons of the protein levels across six visits of the 10 individuals were shown in Additional file 1: Fig. S4. Interestingly, one of the individuals (W0010) started a dietary change after visit two and thus lost weight between visit three (120.5 kg) and visit four (104.7 kg) (Additional file 1: Fig. S5A). For another individual (W0022), the clinical chemistry result reveals elevated C-reactive protein (CRP) levels (79 mg/L) at visit two due to an infection (Additional file 1: Fig. S5B). An analysis of the protein profiles of these two individuals will be described more in depth below.
Genome-wide association analysis of the blood protein profiles
To investigate the genetic influences on inter-individual differences in blood protein concentration, a genome-wide association analysis based on 7.3 million variants identified by whole-genome sequencing and 794 plasma protein profiles was performed. A total of 2936 associations reached a given statistical significance level (P < 6 × 10−11) (Additional file 1: Fig. S6). Among them, 144 significant associations between 107 proteins and 143 independent genetic variants (LD r2 < 0.1, conditional P < 0.01) were identified (Fig. 3a), including 67 cis-pQTL variants for 67 proteins and 77 trans-pQTL variants for 40 proteins (Fig. 3b). Among them, 74% of the pQTLs including the proxy of the pQTLs (LD r2 > 0.6) have not been reported before. All but 13 of the pQTLs replicated at nominal significance (P < 0.001) in previous studies (see more details in methods and Additional file 2: Table S4). Most of the cis-pQTLs and trans-pQTLs were found in intronic, intergenic, or other untranslated regions (Fig. 3c). The association between cis- or trans-pQTL with genomic regions was further examined by using Fisher’s exact test. We found that cis-pQTL variants were higher enriched in coding regions (P < 0.1) and untranslated regions (P < 0.01), while trans-pQTL variants were higher enriched in intergenic regions (P < 0.001). In addition, 45% (n = 30) of the cis-pQTLs also had an eQTL for the same protein in blood (Additional file 2: Table S5), suggesting that the genetic effect on plasma protein levels is mainly on transcription level. Sentinel pQTL variant was determined as the variant with lowest P value at each pQTL locus and visualized in Fig. 3d. The variants are relatively equally distributed between the chromosomes for both cis- and trans-pQTLs. To investigate the associations between pQTLs and human diseases, we also examined whether the sentinel variants or variants in LD r2 > 0.8 were identified in disease-GWAS studies. In total, 16 pQTLs were associated with 21 diseases (Additional file 2: Table S6). For example, rs6727306 was identified as an atopic dermatitis risk loci in a multi-ancestry GWAS study [35]. Here, we also show the association of rs6727306 between interleukin 18 receptor 1 (IL18R1) which contributes to IL18-induced cytokine production [36].
In Fig. 4, the three proteins with strongest associations between blood protein levels and genetic variants are analyzed in more depth. The genetic variants associated with the concentration levels of the FOLR3 protein (Fig. 4a) are all found at chromosome 11 (cytoband q13.4) in close proximity to the gene coding for FOLR3. The highest association is found for variant rs71891516, which is a stop gain variant in the coding region of FOLR3. FOLR3 is a secreted plasma protein [37] that can bind to folate and reduce folic acid derivatives and mediate delivery of 5-methyltetrahydrofolate to the interior of cells [38]. Interestingly, individuals that carry the variant thus will have a premature termination codon which signals the end of translation. This interruption causes the protein to be abnormally shortened. A more detailed analysis of the two chromosomes of the individuals reveals that the protein levels are high for both the homozygote and heterozygotes for the stop-gain variant (Fig. 4b). The longitudinal analysis during the six visits for the 90 individuals (Fig. 4c) demonstrates that the individual protein levels were remarkably stable during the 2-year period. The reason behind the difference in levels is not known at present, but it is tempting to speculate that the shorter version has longer blood half-life and thus yields higher concentration levels in blood. In this context, it is important to note that the truncated variant of FOLR3 might have an altered antibody binding, and therefore, the apparent change in concentration is instead due to altered epitope binding. This needs to be ruled out by more in-depth analysis using antibody-independent analysis.
For the protein platelet-derived growth factor receptor beta (PDGFRB), the genetic variants (Fig. 4d) are located to chromosome 5 (cytoband q32), which is the location of the protein-coding gene. The highest association is found for variant rs3816018, which has been previously reported in Garrigos et al. [39] and Benson et al. [40]. Interestingly, the chromosomal analysis shows that heterozygote individuals for the protein variant have intermediate levels of blood protein levels (Fig. 4e) compared to the homozygotes. Similarly, to FOLR3, most of the relative levels of the individuals were stable during the 2-year study period (Fig. 4f). For the protein meprin A subunit beta (MEP1B), the genetic variants (Fig. 4g) are located to chromosome 18 (cytoband q12.1), which again is the location of the protein-coding gene. The highest association is found for variant rs620982, located downstream of the MEP1B gene. Again, the heterozygote individuals have intermediate levels of the protein (Fig. 4h), and these levels are stable during the 2-year study period (Fig. 4i).
Integrative multivariate data analysis
To get a comprehensive quantification of the effects of genetic and non-genetic factors on the variation of protein concentration during the longitudinal study period, we established a linear-regression model for each protein that included all genome-wide significant variants, anthropometrics, the 30 clinical chemistry parameters, sex, and visit. In the analysis, the genetic variants were combined as “genetic component” and all the anthropometric and clinical chemistry variables were combined as “environmental component.” A summary of the analysis across all 794 analyzed plasma proteins (Fig. 5a) shows that the influence of genetics and environment on blood protein level variability varies considerably. Limited longitudinal effects were found in the variability of proteins with genetic associations with an average contribution of 2%, suggesting that the protein levels associated with genetics are relatively stable throughout the 2-year study period. Out of the 107 proteins with significant pQTL associations, 56 proteins have at least a 50% contribution from genetics (Fig. 5b). The FOLR3 protein is the most affected protein with 98% of the blood protein level variance explained by genetics. Membrane metalloendopeptidase (MME), which is involved in the destruction of opioid peptides by cleavage of a Gly-Phe bond [41], is an example of a protein with the concentration levels in blood strongly associated with both genetic and environmental components, mainly due to the liver marker GGT (Additional file 1: Fig. S7A). Another example is protein carbonic anhydrase 5A (CA5A), which is a liver enriched gene [36], with the concentration levels mainly affected from genetics (60%) but also from ALAT (7%) (Additional file 1: Fig.S7B). The results demonstrate the importance of determining the underlying genetic make-up when analyzing individual differences in blood protein levels.
One hundred eighty-six proteins have at least a 10% contribution from a certain environmental component to the variability of the blood concentration levels (Additional file 1: Table S3). Among them, 63 proteins are associated with kidney function, 33 proteins are associated with lipid profile, 32 proteins are associated with body composition, 21 proteins are associated with leukocytes, and 42 proteins are associated with other clinical parameters. The top 30 most significant proteins associated with environmental components and with no genetic component are highlighted in Fig. 5c. A CCA [42] was also performed to investigate the associations of protein profiles with anthropometric and clinical chemistry variables. Associations of all analyzed samples (n = 540), together with proteins and clinical or anthropometric variables, were presented in the triplot (Fig. 5d). The CCA (Fig. 5d) predicts the effects of the plasma protein data and clinical parameters on sample levels and highlights that LEP is highly positively correlated with body fat and negatively correlated with bone mass and muscle mass. As an example, N-terminal pro-brain natriuretic peptide (NT-proBNP) and natriuretic peptide (BNP) were highly correlated with the NTproBNP levels in clinical chemistry, consistent with linear regression analysis result in Fig. 5b. Sex differences can be also observed, for example with higher skeletal muscle mass and Hb levels in males and higher body fat mass and HDL levels in females. Glycoprotein hormones, alpha polypeptide (CGA) which is a placenta-enriched protein, showed the largest sex difference with high levels of concentration in female samples. Prokineticin 1 (PROK1), on the other hand, showed higher concentration levels in male samples. The majority of proteins with significant pQTL variants were as expected shown not significantly associated with clinical or anthropometric variables but are located in the center of the plot.
Changes due to environmental factors
To investigate the effect of life style changes and in particular weight changes on the proteome, we focused on the mixed effect modeling results for weight-related anthropometrics (weight, waist, BMI, and bioimpedance fat) and obtained a list of the top 50 most significant proteins. The resulting connections between proteins and weight-related parameters are visualized as a chord diagram plot (Fig. 6a), and the protein profiling data was used to perform hierarchical clustering of the 50 proteins based on their concentration levels across the six visits (Additional file 1: Fig. S8A). We assessed the changes in plasma protein profiles before and after weight loss, exemplified by the participant W0010 who showed a large weight loss between visit three (120.5 kg) and visit four (104.7 kg), but started a change in lifestyle already after visit two. The protein levels in each of the six visits are visualized for all proteins with positive (n = 37) (Fig. 6b) or negative (n = 13) (Additional file 1: Fig. S8B) correlations with weight-related anthropometrics, respectively, highlighting the large changes between visits three and four for many of these proteins. We also compared the ratio of the complete set of plasma protein profiles between visits two and four (Additional file 1: Fig. S8C) to highlight the most altered proteins for this individual, and here, we see that the growth hormone protein (GH) had the largest change over all.
Finally, to get a comprehensive mapping of the proteome changes during an infection, we focused the multivariate analysis on the plasma protein profiles and their relationship with the CRP (Fig. 6c). Based on linear mixed effect modeling results, the top 50 most highly associated proteins with CRP are visualized in Fig. 6c, and the circular dendrogram (Additional file 1: Fig. S8D) shows the relationship based on correlation of protein profiles between these mainly inflammatory and immunity-related proteins. An analysis of the same proteins in the individual with a serious infection at visit two shows an increase of a whole cascade of inflammatory-related proteins upon infection with the positively correlated proteins (n = 44) in Fig. 6d, with the largest change of many of the proteins in visit two. The small number of negatively correlated proteins (n = 6) is shown for the same individual in Additional file 1: Fig. S8E. The top driving proteins mainly include cytokines IL1RL1, IL1RN, IL27, IL12, IL6, and IL10; chemokines CCL3, CCL4, CCL7, CCL20, CXCL9, and CXCL10; also tumor necrosis factor TNFRSF6B, DLL1, and XCL1; a peptidase MMP12; and the growth factor TGFA. Additional file 1: Fig. S8F shows the log2-ratio between visit two and visit one for all proteins in the same individual, which clearly shows that IL17C, GCG, and REG1A have the largest increase in concentration and at the other end, ALDH3A1 decreased the most.