NC P/LPs in 10,389 cancer cases
We identified NC P/LPs in the TCGA cohort of 10,389 adult cancer cases across different ancestral populations. The cohort contained germline genome data that passed quality control procedures as described by TCGA PanCanAtlas Germline working group [12]. Genetic ancestry analyses of PanCanAltas AIM and Germline working groups [12, 13, 18] stratified the cohort into 305 individuals of the Latinx/Native American, 971 of the African American, 8279 of the European, 652 of the East Asian, 50 of the South Asian, 106 of mixed (> 20% genetic admixture) ancestries, and 26 of other ancestry (Methods, Fig. 1).
In the 10,389 cases, TCGA PanCanAtlas Germline working group previously conducted variant calling that resulted in ~ 1.46 billion variants and utilized CharGer [19] to prioritize variants [12]. Among the CharGer-prioritized variants as well as all the variants in the ACMG 59 genes that are not directly associated with cancer, we systematically filtered them based on allele read count thresholds and association with cancer or cancer syndromes, followed by ACMG classification along with variant reviews by a board-certified molecular geneticist (Methods). Based on this procedure, we identified 2916 heterozygous and 4 homozygous NC P/LPs that were pathogenic or likely pathogenic (Fig. 1A). These NC P/LPs included 753 unique variants distributed across 363 non-cancer genes, harbored in 24.1% (2505/10,389) of the cancer cases (Additional file 3: Table S3). We further examined the frequencies across ancestries: 26.75% of European ancestry were found to harbor NC P/LPs, a higher frequency compared to other ancestries with more than 100 patients, including 15.08% of the Latinx/Native American, 12.98% of the African American, and 11.66% of the East Asian (Fig. 1B). Overall, putative autosomal dominant (AD) variants affected 2.07% (215/10,389) and autosomal recessive (AR) variants affected 22.02% (2288/10,389) of the cohort while showing different frequencies across ancestries (Additional file 4: Figure S1).
Among the ACMG 59 genes [2], for which reporting of secondary findings is recommended, 34 were not directly cancer-related. These 34 ACMG non-cancer genes affected 1.48% of the TCGA cancer cases across all ancestral groups. Restricting to the P/LPs in genes that are not directly cancer-related, previous studies have reported 1.0% (95 % confidence interval (CI) 0.4–1.6%) carriers in 1000 Genome project [20], 1.2% (1.0–1.3%) in eMERGE [21], 1.1% (0.1–2.0%) in European American Framingham Heart Study, and 0.5% (0.2–0.7%) in African-American Jackson Heart Study [22]. The carrier frequencies may vary depending on genes assessed, evidence available at the time of pathogenicity assignment, ancestry of participants, and participant ascertainment method.
Using 94 AR disorders assessed by Haque et al. [23], we found TCGA carrier frequencies of 10.1% (95 % confidence interval (CI) 9.5–10.8%), 6.2% (CI 4.7–7.7%), 1.7% (0.7–2.7%), and 2.6% (0.8–4.4%) in individuals of European, African American, East Asian, and Latinx ancestry for at least one P/LP variant in the 94 disorders. Our estimates are slightly higher in European-ancestry individuals and lower in African American, East Asian, and Latinx compared with Haque et al. [23] reported carrier frequencies of 7.7% (7.5–7.9%), 11.3% (10.9–11.7%), 6.9% (6.3–7.6%), and 6.0% (5.6–6.4%), in the same ancestral groups. The differences between the two studies in part reflect differences in knowledge and standards for classification of variants over time.
Prevalence of NC P/LPs across ancestral populations
We investigated the genes with the highest prevalence of NC P/LPs across ancestries. Fourteen of the ACMG 59 genes were affected by NC P/LPs in TCGA (Fig. 2A). For example, NC P/LPs in KCNQ1, associated with type 1 long QT syndrome, affected 6 European individuals and 1 individual in East Asian or South Asian in the TCGA cohort. Many ACMG 59 genes were restricted to individuals of European population in this cohort, such as LDLR, DSG2, APOB, ATP7B, CACNA1S, FBN1, and TNNI3 (Fig. 2A and Additional file 5: Table S4).
Across populations, the most commonly identified predisposing genes—each with at least 100 carriers—included SPG7, MPO, and ACADM; all three genes are associated with autosomal recessively inherited diseases (Fig. 2A and Additional file 5: Table S4). SPG7 variants were observed in 100 European carriers (1.21%) and one carrier each of Latinx/Native American, African American, and South Asian ancestry. Similar to SPG7, MPO variants were found in 117 European carriers (1.41%), one carrier of Latinx/Native American and African American ancestry. ACADM variants were identified in 115 European carriers (1.39%), 2 carriers of African American and one carrier of mixed ancestry. We also observed several commonly ancestry-specific genes carrying variants that showed occurrences exclusively in one ancestral population. In addition to those exclusively identified in the Europeans, we found two ACMG 59 genes whose NC P/LPs only affected East Asians in TCGA, including SERPINB7 in 8 carriers (1.23%) and SLC22A12 in 6 carriers (0.92%) (Fig. 2A and Additional file 5: Table S4). These results highlight the genetic variants that may disproportionally affect diverse populations.
We sought validation of NC P/LPs in the gnomAD non-cancer cohort (abbreviated as gnomAD below), which had no samples overlapping with the TCGA cohort. The carrier frequencies of the TCGA-identified NC P/LPs were the highest among the gnomAD Ashkenazi Jewish (ASJ), Non-Finnish European (NFE), and Finnish (FIN) compared to other ancestral populations (Additional file 4: Figure S2), consistent with ClinGen’s recent report finding Europeans in gnomAD contained over half of the information on the clinically relevant variants on ClinVar [24]. We also examined the gene-level carrier frequencies of the same variants across different gnomAD ancestral populations. Multiple genes, including SPG7, MPO, and ACADM, showed higher carrier frequencies of this set of NC P/LPs in European individuals from both the TCGA and gnomAD cohorts (Fig. 2A and Additional file 4: Figure S3A). ACADS and GALT variants were present at highest frequency in African Americans in both TCGA and gnomAD; similarly SERPINB7 and SLC22A12 variants appeared predominantly in the East Asian population of both cohorts.
We further investigated the concordance of variant-level frequencies in the matched ancestries between TCGA and gnomAD. We found significant correlations of variant frequencies in East Asian (Pearson R = 0.8, p value = 2.53e−07), European (Pearson R = 0.83, p value = 4.13e−143), and African American (Pearson R = 0.82, p value = 1.35e−12) population, whereas the variant frequencies in Latinx/Native American ancestry did not show significant correlations (Pearson R = 0.09, p value = 0.58, Additional file 5: Table S4), likely due to the smaller sample size or admixtures in the populations. Multiple variants found exclusively in one TCGA ancestry were rediscovered in their respective ancestry cohort of gnomAD, such as ACADS p.W177R in African American, MCCC2 p.L355F in Latinx/Native American, ALPL p.E191K in European, SERPINB7 p.R266* in East Asian, and ACADSB p.Q99* in South Asian (Fig. 2B and Additional file 5: Table S4). Replicated across cohorts, the markedly higher frequencies of these variants in one population compared to the others support their population specificity and potential founder effects.
We further compared the frequency of each NC P/LP in TCGA vs. gnomAD stratified by population using a two-tailed Fisher’s exact text. The analyses identified 57 variants (FDR < 0.05) (Additional file 6: Table S5), the majority of which were TCGA-enriched variants found in the European ancestry and the most significant ones included MPO c.2031-2A>C (splice site variant, myeloperoxidase deficiency), F11 p.E135* and p.F301L (hereditary factor XI deficiency disease), and ACADM p.K333E and p.G271R (medium-chain acyl-coenzyme A dehydrogenase deficiency). In contrast, CYP21A2 p.P454S, TSFM p.Q307* and ALPL p.E191K showed higher frequencies in gnomAD compared to TCGA Europeans (Additional file 4: Figure S3B). Acknowledging the caveat of comparing TCGA vs. gnomAD data from different sequencing platforms, variant calling pipelines, and sampled sources and populations, these results suggest a potential different distribution of NC P/LPs that may be indirectly associated with the cancer phenotype that needs to be further tested.
Gene expression impacted by NC P/LPs
Many of the identified NC P/LPs are truncating variants that are presumed to alter expression of the gene products through mechanisms such as NMD. To interrogate the variant consequences at the gene expression level, we applied a multivariate linear regression model using the expression quantile of the affected gene within each cancer cohort as a dependent variable and variant status as the independent variable, adjusting for age, gender, PC1, PC2, cancer type, and tumor purity (Methods). We found 5 significant and 16 suggestive genes to be differentially expressed, 17 of which showed reduced expression in variant carriers (Fig. 3A).
We further examined the specific variants that co-occurred with low gene expression in the bottom quartile of their respective cancer cohorts (Fig. 3B and Additional file 4: Figure S4). The majority or all nonsense variants carriers of multiple genes showed bottom-quartile expression of the affected genes, such as IRAK4, PHKB, MMACHC, and GFER. Meanwhile, bottom-quartile HEXB gene expression was observed in 13 carriers of the missense variant HEXB p.P417L. Overall, we observed distinct expression effects of different variant types. Among 376 nonsense variants, 137 (36.4%) showed bottom-quartile expression of the respective genes, confirming the potential effects from NMD. In comparison, 271 of 1169 (23.2%) missense variants also showed bottom quartile expression (approximating the 25% in a null scenario, Additional file 7: Table S6), suggesting that many of these missense variants, along with a fraction of truncating variants, may exert their effects through mechanisms independent of altering gene expression.
Allele-specific expression of NC P/LPs
The majority of differentially expressed genes associated with NC P/LPs showed reduced expression (Fig. 3A). Further, many pathogenic variants are assumed to alter gene expression through mechanisms such as NMD, leading that ASE at the RNA-level would be observed for a heterozygous individual. To further assess the mechanism at a variant level, we further identified ASE showing reduced expression of the alternate allele for the rare NC P/LPs (MAF ≤ 0.05%) by performing a one-sided binomial test (Methods). Among the 657 rare NC P/LPs with sufficient read counts in tumor RNA-Seq data for analyses, ~ 26.3% (173/657) showed significant ASE (FDR < 0.05), and another 4.26% (28/657) showed suggestive ASE (0.05 ≤ FDR < 0.15) (Fig. 4A).
We investigated whether the ASE status of variants were disproportionally present in a predicted variant function class by conducting a permutation test (Methods). Among the predicted variant function class, nonsense variants were significantly enriched for those showing ASE (48 %, P < 1E−4), confirming their transcriptional impact that is likely mediated through NMD. In addition, synonymous variants are also enriched for those showing significant ASE (7 out of 7, P < 1E−4), although there were only 7 synonymous variants in our analysis (Fig. 4B and Additional file 8: Table S7), including four East Asian carriers of CYP27A1 c.862G>T, one East Asian carrier of G6PC c.727G>T and two European carriers of DGUOK c.676G>A.
Next, we identified genes enriched for significant ASE NC P/LPs using Fisher’s exact test (Methods). We observed that ASE variants were significantly enriched in 6 genes, including ASS1 (FDR = 0.001), SMARCAL1 (FDR = 0.022), AGXT (FDR = 0.023), KIAA1279 (FDR = 0.023), TMEM216 (FDR = 0.023), and NEK8 (FDR = 0.023) (Fig. 4C). At the variant level, 12 of the 17 ASS1 carriers showed significant ASE, including two carriers of missense variant p.G324S, nonsense variant p.R344*, all three carriers of the nonsense variant p.R279* and five carriers of missense variant p.G390R (Figs. 4D and 5). Five out of the six SMARCAL1 p.E848* carriers also showed significant ASE. All carriers with enough read counts data of the remaining four genes (AGXT, KIAA1279, TMEM216, and NEK8) showed significant ASE (Figs. 4D and 5). Additionally, we identified two genes showing suggestive enrichment of significant ASE NC P/LPs, including G6PC and LAMA2 (FDR = 0.057) (Fig. 4C). For G6PC, significant ASE was found in one each carrier of p.R83C, p.L216, and p.Q347* (Fig. 4D and Additional file 4: Figure S5). For LAMA2, all 3 carriers of p.R1326* showed significant ASE, while the only carrier of p.R1826* showed suggestive ASE (Fig. 4D and Additional file 4: Figure S5). There is no overlap between genes enriched with ASE variants and genes whose expression is associated with variants. The non-concordance between ASE analysis and gene expression impact suggested a possible expression compensation from the reference allele or a lack of power in the gene expression analysis. Altogether, the observed ASE validated the expression consequences of many NC P/LPs.
Mis-splicing effects induced by NC P/LPs
Aside from nonsense variants assumed to undergo NMD, we also observed significant numbers of variants showing ASE and gene expression effects (Figs. 3 and 4). Recent studies have shown that, in addition to canonical splice sites, many missense variants can also affect splicing [17, 25]. Utilizing data from our recently published Massively Parallel Splicing Assay (MaPSy) experiment surveying nearly five thousand variants [17], we systematically identified variants showing mis-splicing effects both in vitro and in vivo matched to the NC P/LPs identified in TCGA patients. Among 226 NC P/LPs that overlapped with MaPSy variants, 36 NC P/LPs showed significant mis-splicing effects, including 30 European carriers of missense GALT p.Q103R, one European carrier of missense SMPD1 p.H423Y, one American carrier of missense RYR1, two European carriers of nonsense NDUFAF2 p.R47*, one European carrier of nonsense HADH p.R236*, and one East Asian carrier of nonsense BBS1 p.E586*. The carrier of nonsense HADH p.R236* showing mis-splicing in MaPSy also showed significant ASE.
To discover additional variants with expression and possibly functional consequences, we assessed the mis-splicing effects that may be exerted by all prioritized variants (i.e., the 35,911 variants in Fig. 1A). Among these, 21,008 were non-cancer related NC P/LPs that yielded insufficient evidence to satisfy a likely pathogenic classification, and 266 were characterized by MaPSy. Fifteen of these prioritized missense variants showed mis-splicing effects, including three European carriers of DARS2 p.R263Q, one European carrier of CD36 p.I413L, one European carrier/one African carrier of SLC29A3 p.M116R, and eight East Asian carriers/one Mixed ancestry carrier of PLG p.A620T. On ClinVar, DARS2 p.R263Q is classified as pathogenic by a single submitter associated with the condition of leukoencephalopathy with brain stem and spinal cord involvement and lactate elevation, whereas SLC29A3 p.M116R is pathogenic without assertion criteria and linked to histiocytosis-lymphadenopathy plus syndrome. The European carrier of CD36 p.I413L, another MaPSy-identified variant associated with platelet glycoprotein IV deficiency, also showed significant ASE along with the suggestive ASE found in the European carrier of the MaPSy-identified SLC29A3 p.M116R. These variants have additional functional evidence of mRNA effects in patient samples and/or functional assays, which strengthen variant interpretation assertions.