The Milieu Intérieur cohort
The 400 donors in this study were a subset of the 1000 healthy donors of the Milieu Intérieur cohort [27, 29,30,31,32] recruited at BioTrial (Rennes, France). The Milieur Intérieur cohort was approved by the Comité de Protection des Personnes – Ouest 6 (Committee for the protection of persons) on June 13, 2012 and by French Agence nationale de sécurité du médicament (ANSM) on June 22, 2012. The study is sponsored by Institut Pasteur (Pasteur ID-RCB Number: 2012-A00238-35) and was conducted as a single-center interventional study without an investigational product. The original protocol was registered under ClinicalTrials.gov (study # NCT01699893). The samples and data used in this study were formally established as the Milieu Interieur biocollection (NCT03905993), with approvals by the Comité de Protection des Personnes – Sud Méditerranée and the Commission nationale de l'informatique et des libertés (CNIL) on April 11, 2018. Donors included in this sub-study were stratified by sex and were between the ages of 30–39 (n = 200) or 60–69 (n = 200) years old. Participants were selected based on stringent inclusion and exclusion criteria, as detailed elsewhere [30]. To minimize the influence of population substructure on genomic analyses, the study was restricted to individuals of self-reported Metropolitan French origin for three generations (i.e., with parents and grand-parents born in continental France). Fasting whole blood samples were collected in EDTA tubes, and plasma was separated following high-speed centrifugation and stored at – 80 °C until analysis. Standard blood testing and complete hemogram was performed on fresh aliquots, while protein immunoassays were performed on frozen aliquots.
Quantification of plasma protein levels in 400 healthy individuals
The protein immunoassays and the blood tests were performed on samples taken the same day and analyzed at different times, on frozen and fresh aliquots, respectively. Blood chemical and major cell fractions were estimated through direct enumeration and standard blood panels. The concentrations of 297 plasma proteins of 400 individuals were quantified by Luminex multi-analyte immunoassays (Discovery Map v3.3 from Myriad RBM, Additional file 1: Table S1), as previously described [33]. Proteins measured included cytokines, chemokines, metabolic markers, hormones, growth factors, tissue remodeling proteins, angiogenesis markers, acute-phase reactants, cancer markers, kidney damage markers, and central nervous system (CNS) biomarkers. Protein levels were analyzed and compared with their respective lower limit of quantification (LLOQ). Among the 297 assayed proteins, 68 proteins were reported at a concentration lower than the LLOQ in at least 20% of the individuals and were filtered out. For the 229 proteins that were kept, reported concentrations lower than the LLOQ were considered as missing values (NAs), to prevent incorrect protein-environment or protein-genotype associations due to undetected or undetectable proteins. Next, for each protein, plasma-level distributions across individuals were tested for normality using the Shapiro-Wilk test on the raw and log-transformed values within the 2.5% and 97.5% percentiles. Shapiro-Wilk null hypothesis was not rejected (p-value ≤ 0.001, after multiple testing correction) for a total of 50 (22%) and 183 (80%) proteins, depending on whether raw or log-transformed values were used. These results suggested that the majority of raw protein plasma levels followed a log-normal distribution. Thus, the levels of all 229 proteins were log-transformed for downstream analyses.
Filters and tests for non-genetic variables
For each individual from the Milieu Intérieur cohort, an extensive electronic clinical record file was filled, gathering 754 lifestyle, environmental, and medical history variables as well as blood metabolite and enzyme levels from standard blood test and erythrocyte enumeration [30]. First, variables with names describing repetitive measurements over several visits after the first visit were filtered out. Second, redundant columns informative about the sex of the individual were removed. Third, mono-factorial variables, character variables, variables with missing values in 20% or more of the individuals, or varying in less than 10 individuals, variables correlated with other variables with a Spearman correlation coefficient of 1, and variables providing redundant information about the same phenotype were filtered out, for a final number of 254 variables. Then, for each of the 254 non-genetic factors, a univariate linear regression analysis was performed against the log-transformed expression levels of each of the 229 plasma proteins evaluated. Age and sex were systematically included as covariates in all such regressions, consistent with their pervasive influence shown in previous studies, as well as with their association with many of the non-genetic factors evaluated [30, 34]. Univariate linear regression was performed between each pair of proteins and non-genetic factors. In addition, to reduce the sensitivity of the linear models to outliers, the ten lowest and ten highest values of each protein were removed from the regression analysis. Significance was declared at p-value ≤ 0.05 after Bonferroni multiple testing correction accounting for the number of tests (n = 58,166). A total of 152 significant associations collectively involving 49 proteins and 20 non-genetic factors were found. In addition, seven major blood-cell fractions, i.e., leukocytes, lymphocytes, monocytes, neutrophils, eosinophils, basophils, and platelets, were assessed through hemogram on fresh aliquots, along the other blood chemicals and enzymes [30].
Genotyping and imputation
Each individual from the Milieu Intérieur cohort was genotyped by the HumanOmniExpress-24 BeadChip (Illumina), covering 719,665 SNPs. In total, 245,766 rare functional variants were also genotyped on a HumanExome-12 BeadChip (Illumina). After quality control, both datasets were merged, for a total of 723,341 SNPs, all mapped in GRCh37.p13 coordinates. Next, IMPUTE v.2 [35] was used to perform genotype imputation, on 1-Mb windows buffered by an additional 1 Mb. Before imputation, SNPs were phased using 500 conditioning haplotypes, 50 MCMC, 10 burn-in, and 10 pruning iterations. SNPs and allelic states were aligned to the imputation reference panel from the 1000 Genome Project Phase 1 v3 (2010/11/23). SNPs with dissimilar alleles (even after flipping) or ambiguous C/G or A/T alleles were filtered out. Imputation yielded a total of set of 37,895,612 SNPs. Removing SNPs with information metric ≤ 0.8, duplicated or monomorphic SNPs, and SNPs with missingness > 5% (SNPs with genotyping probability lower than 0.8 in an individual were considered as missing) reduced the set to 11,395,554 SNPs. Further removing non-SNP variants and filtering out variants with a MAF < 0.05 (with the --snps-only option of PLINK v1.9) in the 400 sampled individuals resulted in a final set of 5,201,092 SNPs. Following Patin et al [29], principal component analysis (PCA) of the OmniExpress array was performed on 261,827 independent SNPs with 36 reference populations from north and west Africa, Middle East, western Asia, and Europe (Human Genome Diversity Panel [36]) and the principal components (PCs) explaining more than 1 % of the total variance were kept to account for potential population stratification (PC1 = 5.42%, PC2 = 1.63%).
Genome-wide association testing of plasma proteins
To perform the pQTL mapping of plasma proteins, we chose to use a multivariate approach by incorporating, for each protein, the associated non-genetic variables as covariates, in addition to sex, age, the 7 major blood-cell fractions (leukocytes, lymphocytes, monocytes, neutrophils, eosinophils, basophils, and platelets), and the two first PCs of the genetic data. If a non-genetic variable was a redundant measure with the corresponding protein (i.e., measured by both Luminex and standard blood panel, e.g., CRP), it was not added as a covariable in the model. We used a first linear mixed model to correct the protein expression levels for their specific covariates and for kinship (mean identity by state, a.k.a. IBS = − 0.0025 ± 0.026 over each pair of individuals, with two pairs of individuals with IBS = 0.2 and IBS = 0.3, potentially indicating second-degree relatives), using per chromosome genetic relationship matrix (GRM) computed using GenAbel v1.8 [37] (leaving one chromosome out). In order to limit the correction for multiple testing while still accounting for both the number of tested SNPs and proteins, the analysis was performed separately for cis and trans QTLs, and the false discovery rate (FDR) was computed independently for cis-acting and trans-acting SNPs, following Quach et al. [38]. Cis-acting SNPs were defined as SNPs located at a maximum distance of 1 MB from the transcription start or end site of the corresponding gene, while all other SNPs are defined as trans-acting. For each protein and each kind of associations tested (cis or trans), the minimal raw p-value was reported. In addition, for each protein, 100 permutations were performed between all cis- or trans-SNPs, and the minimal p-value of each of these permutations was extracted. Next, proteins were ascendingly sorted based on their raw p-values. Then, the FDR was computed, for each protein, as the mean over the N = 100 permutations of the number of times its raw p-value is lower than the nth permutation from all proteins, divided by the rank of the corresponding protein. Protein-SNP pairs were considered as significant when the corresponding FDR was equal to or lower than 0.05, for a total of 78 cis- and 22 trans-pQTLs. To investigate the potential presence of secondary pQTLs, we performed the same analysis a second time, incorporating the genotype of the most significant SNP detected in the first round of analysis as an additional covariate. The FDR was computed independently on each analysis iteration, both considering the same number of proteins and permutations. The conditional analysis yielded 11 cis- and 1 trans-pQTLs. The significance thresholds corresponding to first-round and second-round pQTLs were respectively around 2.2e−05 and 1.5e−05 for cis and respectively around 9e−10 and 2e−10 for trans (significance thresholds were computed as the mean between the last significant and the first non-significant p-values). The genome-wide analysis yielded a total of 100 cis-pQTLs and 12 trans-pQTLs.
Contribution of non-genetic and genetic components in the variability of plasma proteins
The relative contribution of the various environmental and genetic variables was assessed using the correlation-adjusted marginal correlation score (CAR score [39]) from the care package in R. The CAR score is the shrunk estimator of the adjusted coefficient of determination (R2) of each independent variable in a linear model, which considers the marginal correlation between variables. The CAR score is determined for each independent variable within a model, representing their independent contribution to the total variability of the dependent variable. The sum of the CAR score attributed to each variable in a model is equal to the model adjusted R2. For each protein, the relative contribution of its significantly associated non-genetic variable was assessed at once. In case only a single variable was significantly associated with a protein, we used the adjusted coefficient of determination (R2) of the variable as its relative contribution in the variability of the corresponding protein levels. Then, the relative contribution of the identified pQTLs was assessed by computing their CAR score in protein-specific models incorporating age, sex, the protein-specific covariates, and the corresponding genotypes. The effect size of pQTLs was computed following a two-stage model similar to the GWAS. A first linear model was used to regress out the associated covariates (and previously identified SNP in the case of SNPs identified by the conditional analysis) from the log-transformed and non-transformed plasma protein levels, and a second linear model was used to regress the residuals of the first against the tested SNP. The Beta was extracted from this model and used as the SNP effect size.
Global impact of age and sex on plasma protein levels
In order to characterize the global impact of age and sex on plasma protein heterogeneity, while accounting for the collinearity of several plasma proteins, we performed a principal component analysis (PCA) on the expression levels of the 229 proteins across the 400 individuals. When considered independently, only PC1 and PC2 explained more than 5% of the total variability (Additional file 2: Figure S1).
Assessment of gene-gene interactions
The interactions between trans-pQTL associated proteins and candidate proteins coded by genes located within a 500-kb window centered around the associated SNP were assessed using STRING-db v11 [40] at https://string-db.org/. All protein-coding genes were queried at once through the “Multiple proteins” option and default settings (organism: “Homo sapiens”). Proteins were considered to interact when they were shown to be direct neighbors in protein-protein interaction networks, or when one of the protein was shown to be directly or indirectly involved in the regulation of the other.
Contribution of blood-cell fractions in protein level predictions
To quantify the relevance of blood-cell fractions in the prediction of plasma protein levels, we used a one-way ANOVA to compare, for each pQTL, the predictions coming from two models. A first linear model considered the genotype of the corresponding SNP, the previously defined protein-specific covariates, and age, sex, the two first PCs of the genetic data, and the 7 blood-cell fractions. A second linear model was evaluated by considering all variables used in the previous one, with the exception of the 7 blood-cell fractions. The models including pQTLs obtained from the conditional analysis additionally corrected for the SNP used for their identification. To assess the potential relevance of the different circulatory cell fractions in the different results obtained from the two pQTL mapping, the proteins were first corrected for age and sex through linear regression, and the resulting residuals regressed against each circulatory cell counts individually. Association p-values were corrected independently for each protein and are reported in Additional file 3: Table S2.
Genome-wide analysis of plasma proteins excluding blood-cell fractions
For the purpose of evaluating the contribution of blood-cell fractions in the pQTL assessment, the genome-wise association analysis was performed as previously detailed, while removing the 7 blood-cell fractions from the model (“Methods”). This approach identified 115 protein quantitative trait loci (pQTLs) collectively involving 94 proteins and 113 SNPs (FDR ≤ 0.05). Among them, 103 were defined as cis-pQTLs. In addition, 12 pQTLs were identified in trans, i.e., located further than 1 MB far from the gene boundaries, or located on another chromosome. In total, 73 proteins were associated with only one SNP, while 21 were associated with two independent SNPs. Among the 94 proteins with significant pQTLs, 83 proteins were associated exclusively with cis-pQTLs, 9 exclusively with trans-pQTLs, and 2 with both cis and trans-pQTLs. In comparison with the first analysis, 92 pQTLs were reproduced, 81 in cis, and 11 in trans. An association was considered as reproduced when the SNP, or a SNP in linkage disequilibrium with R2 ≥ 0.8, was significantly associated with the same protein at FDR ≤ 0.05. Twelve cis-pQTLs were no longer associating with the same SNPs or to SNPs in high LD (R2 ≥ 0.8) with it, but with other cis-SNPs, and 8 pQTLs (7 cis, 1 trans) were not reproduced. On the opposite, 10 additional cis- and 1 trans- pQTLs were obtained.
Co-localization of trans-pQTLs and trans-eQTLs
We assessed the co-localization between the trans-pQTLs identified in this work and previously identified trans-eQTLs associated with the same gene and reported in the GWAS Catalog V1.0.2 [41], in eQTLs from GTEx V8 or in QTLbase v1.2 [42] (http://mulinlab.org/qtlbase) through the VannoPortal [43], using a LD threshold of R2 ≥ 0.8 (computed in Europeans from the 1000 Genome Project) between each pQTL and all SNPs present in a 200-kb window centered on the pQTL.
Replication of previously identified plasma protein QTLs
We compared our significant SNP-protein pairs with four studies analyzing the genetic basis of plasma protein levels. Sun et al. [22] reported 1927 pQTLs, resulting from the analysis of 3622 plasma proteins in 3301 individuals; Suhre et al. [23] reported 539 pQTLs from the analysis of 1124 plasma proteins in 1000 individuals; Deming et al. [24] reported 56 pQTLs from the analysis of 146 plasma proteins in 818 individuals; and Zhong et al. [44] reported 144 pQTLs from the analysis of 107 plasma proteins in 101 individuals. The replication of the reported pQTLs was performed at the protein level rather than at the SNP level, due to the poor overlap in terms of sentinel SNPs reported in previous studies [22,23,24, 44] and our imputed set of 5,201,092 SNPs. A protein previously reported as cis-regulated (i.e., reported as significantly associated with a SNP annotated as “cis” by the authors) was considered replicated when it was significantly associated with a SNP located closer than 1 Mb around the gene extremities.
Overall, cis- and trans-pQTLs were labeled as novel when they were absent from the four reference studies used for replication, or when they—or SNPs in high LD with them (R2 ≥ 0.8, based on the reference European population from the 1000 Genome Project)—were absent from the GWAS catalog [41] and from QTLbase v1.2 [42] (http://mulinlab.org/qtlbase). The studies that previously identified the pQTLs reported in this work are referenced in Additional file 4: Table S3.
Protein and gene annotations
Proteins were classified as immune-related when they were either (i) annotated as adaptive immune proteins (Table S2 from Fischer & Rausell, 2016 [45]), or innate immune proteins (Table S1 from Fischer & Rausell, 2016 [45], and Table S1 from Deschamps et al. [46]) or (ii) produced in sufficient concentrations in substantial fractions of immune cells, as described in Rausell et al. [47]. The list of primary immunodeficiency genes was obtained from Table S3 from Fischer & Rausell [45]. Gene-disease annotations were obtained from OMIM (downloaded at https://www.omim.org/downloads/ on the 2019/06/10). Entries were parsed following Caron et al. [48]. Mendelian disease genes were selected for their level of supporting evidence equal to 3 and for not having a “somatic” flag, and monogenic Mendelian disease genes (MMDGs) were further selected for not being flagged as “complex”. The list of secreted proteins was downloaded from UniProt (https://www.uniprot.org/) on the 2020/02/03, using the keywords: locations:(location:"Secreted [SL-0243]" type:component) AND organism:"Homo sapiens (Human) [9606]". The list of FDA-approved targeted proteins was downloaded on the 2020/02/05 from http://mrmassaydb.proteincentre.com/fdaassay/.
The protein classes were taken from Myriad RBM Discovery Map V3.3 table. Proteins were considered enriched or depleted in a specific class when the proportion of proteins in that class was larger than in 10,000 randomly sampled set of proteins of the same size, coming from the tested set of 229 proteins. Previous reports of pQTLs associated with monogenic Mendelian disease genes, with primary immunodeficiency genes or with genes coding for FDA-approved biomarkers were identified through QTLbase [42] (http://mulinlab.org/qtlbase).
Disease loci enrichment
To assess the potential association with diseases or other traits of the cis and trans-pQTLs reported in this work was assessed, we used hits from the NHGRI-EBI Genome Wide Association Studies (GWAS) Catalog [41], downloaded on the 2019/03/22 (file name gwas_catalog_v1.0-associations_e95_r2019-03-22.tsv). SNPs associated with a trait or a disease with a reported p-value ≤ 1e−08 and mapped to autosomes were kept. SNPs associated with traits containing “blood,” “plasma” or “serum,” and “protein” were removed. A set of pQTLs was declared as enriched when the proportion of pQTLs in a set that were GWAS SNPs or in linkage disequilibrium with GWAS SNPs (r2 ≥ 0.8) was larger in 95% of 10,000 randomly sampled set of SNPs of the same size, matched by MAF (bins of 5%). Randomly sampled SNPs were drawn from 122,757 and 384,897 SNPs, selected respectively from the 1,674,134 and 5,201,092 SNPs tested for cis and trans associations respectively (with the --indep-pairwise 100 5 0.5 function of PLINK v1.9). When several traits or diseases were associated with one locus, the most significant one was selected.
Functional annotation of pQTLs
The cis-pQTL molecular consequences were assessed using VEP v97 [49]. Each cis-pQTL was annotated based on the canonical transcript of the gene coding [50] for the regulated protein.