Mutational tolerance patterns for PTEN abundance
We previously developed VAMP-seq, a generalizable method to simultaneously measure the effects of thousands of missense variants of a protein on intracellular abundance [9]. In VAMP-seq, each cell expresses a different protein variant directly fused to a fluorescent protein such as the enhanced green fluorescent protein (EGFP), so each cell’s level of fluorescence is directly proportional to the steady-state abundance of that protein variant. Single-copy, site-directed genomic integration into a HEK 293T landing pad cell line permits the expression of a library of thousands of protein variants in a pooled format [23]. The pooled cells are then separated into four bins of graded fluorescence using fluorescence-activated cell sorting (FACS). High throughput DNA sequencing is used to quantify the distribution of each variant across the four bins, and this distribution is analyzed to yield an abundance score for each variant. VAMP-seq has thus far been applied to five proteins: PTEN [9], TPMT [9], NUDT15 [24], VKOR [25], and CYP2C9 [26].
To supplement the 4407 PTEN variants whose abundance we measured initially [9], we generated a new PTEN library focused on positions with low coverage in the initial experiment (Additional file 1: Fig S1A, B). We re-amplified 198 low coverage positions using degenerate NNK primers, fused the resulting PTEN variant library to EGFP, and tagged each plasmid with a unique nucleotide barcode. We introduced the library into an improved HEK 293T landing pad cell line [14], which allowed for rapid enrichment for modified cells expressing the library by triggering apoptosis of the unmodified cells in the culture using the built-in iCasp9 cassette. We conducted seven replicate VAMP-seq experiments using this library, yielding additional abundance data for 4186 unique variants within this secondary library. 272 variants were observed in 5 or more replicates in both the initial and new experiments, and abundance scores for these overlapping variants were well correlated (Pearson’s r2 0.70, Fig. 1A). A linear model fit to the scores of the overlapping variants had a slope of 0.89 and an intercept of 0.11, suggesting that the abundance scores from the two libraries could be combined without additional normalization steps.
Thus, we aggregated abundance scores from the initial and new libraries (Additional file 1: Fig S1C). We created a composite abundance score by taking the mean of all replicate scores for variants observed in 4 or more replicates, a threshold we set using a permutation test of the separation of synonymous and nonsense variants (Additional file 1: Fig S2) [17]. As expected, the number of variants encoded by two or more unique codons increased with the composite dataset (Additional file 1: Fig S3A). The final filtered dataset of 4721 variants included 174 synonymous variants, 160 nonsense variants, and 4387 missense variants including 764 variants not scored in the initial experiment (Fig. 1B). 3904 variants scored in the initial experiment received modestly refined scores, reducing the coefficient of variation for many variants (Additional file 1: Fig S3B). This resulted in a net 12.5% increase in the number of variants with a coefficient of variation less than 0.5 (Additional file 1: Fig S3C). We compared our revised scores to a set of 24 PTEN variants we had individually assessed for steady-state abundance in the original manuscript, and while the correlation coefficient did not change with the refined score, this is likely because the correlation with the original data was already high (Spearman ρ2 = 0.93; Additional file 1: Fig S3D). Altogether, the number of variants confidently assigned low abundance classifications increased from 1260 to 1423, and the number of variants confidently assigned WT-like classifications increased from 1577 to 1738 variants (Additional file 1: Fig S3E). Thus, as cell and library engineering approaches improve, VAMP-seq datasets can be bolstered with additional replicates containing new variants, and existing datasets can be reanalyzed to create more complete and accurate datasets.
An advantage of the composite abundance dataset was an increased number of positions with high variant coverage, revealing patterns of amino acid substitutions tolerated at each position. The composite dataset included 61 positions where approximately 90% of missense variants were scored (increased from an original 50 positions) and 22 positions that had full coverage (increased from an original 9 positions). We performed hierarchical clustering of the high coverage position abundance scores, which yielded groups of intolerant, partially tolerant, and tolerant positions (Fig. 1C). Roughly half (n = 33) of positions were nearly uniformly tolerant to substitutions. In contrast, 15 positions were partially tolerant to substitution, while the remaining 13 positions were intolerant. Residues Arg173, Gly251, and Asp326 were almost entirely intolerant to substitution (Fig. 1C). All three of these residues are located in the interface between the PTEN phosphatase and C2 domains, suggesting that specific characteristics of the WT amino acids at these positions are critical for keeping the interface intact. Consistent with this hypothesis, Arg173 and Asp326 make extensive polar contacts in the PTEN structure (Fig. 1D). With the exception of His61, the remaining intolerant positions encoded phenylalanine, isoleucine, leucine, or valine residues and were only partially tolerant of other bulky hydrophobic side-chains (Fig. 1C).
Missing data is often computationally imputed using machine learning to yield a complete dataset needed for certain downstream analyses [27, 28]. A recent study imputed missing PTEN abundance data prior to training a logistic regression classifier of PTEN variant effect and molecular phenotype [12]. 693 of the imputed abundance values were scored in our refined dataset, allowing an independent confirmation of the accuracy of the imputation. We examined the correlation between the imputed abundance data and newly acquired data from our second library and found moderate correlation and scaling between the imputed and experimentally determined values (slope: 0.53; Pearson’s r2: 0.54; Additional file 1: Fig S4). This was consistent with the correlation observed by the authors during 10-fold cross-validation of their initial imputation algorithm with the initial abundance dataset (Pearson’s r2: 0.56). The lower than expected slope is largely explainable by a difference in scaling of the imputed data, which ranged from 0.25 to 1, whereas the measured values ranged from 0 to 1.
We next examined positions which were poorly imputed to better understand the situations in which the imputation algorithm struggled to accurately predict abundance scores. Variants of Arg173 and Gly251, two of the three positions highlighted earlier as being almost entirely intolerant to substitution, were incorrectly imputed with intermediate abundance scores, likely because our analyses showed that these WT residues were unusually critical for maintaining PTEN abundance (Fig. 1C, magenta triangles). Imputed values for proline residues were generally low, even for residues such as Pro103 and Pro248, which were highly amenable to substitution and yielded numerous variants with WT-like scores (Fig. 1C, blue triangle). The algorithm also struggled with Tyr180, Leu182, and Asp268, which were residues that were partially tolerant to substitution and exhibited a wide range of abundance scores depending on the variant, whereas the imputed values roughly approximated the positional mean (Fig. 1C, cyan triangles). By providing additional experimental data and reducing reliance on imputation, the additional abundance data we furnish here will improve accuracy in the downstream uses of the abundance data.
Classification of PTEN variants by abundance and activity
Our composite abundance scores, along with the PTEN lipid phosphatase scores measured in yeast [10], provide two distinct measures of the properties of a total of 4178 PTEN missense variants. We integrated these two datasets to separate PTEN variants into four distinct subsets: WT-like, loss of abundance only, loss of activity only, and loss of both abundance and activity. To perform this four-way classification, we focused on variants that were confidently scored by both assays and thus could be categorized according to both properties (Fig. 2A; see Methods). The majority of variants were in agreement with both assays, as 51% of the classified variants were WT-like for both properties (Fig. 2A, green), while 21% exhibited loss of both activity and abundance (Fig. 2A, purple).
The remaining 28% of variants exhibited discrepancies between the two assays, pointing to subtler molecular phenotypes than near-complete losses of phosphatase activity through loss of intracellular abundance. The smallest subsets were the 6% of variants classified as loss of activity only variants, where phosphatase activity was abrogated without affecting protein steady-state abundance (Fig. 2A, orange). The remaining variants, accounting for 22% of the total classified variants, were loss of abundance only (Fig. 2A, turquoise). These are likely variants that have no inherent effect on phosphatase activity, yet may reduce the total amount of intracellular PTEN expressed in cells, such that the net result may be hypomorphic functioning in cells. This included Asp331Gly, which also exhibited reduced abundance when expressed in U87-MG glioblastoma cells [29], but had near-WT phosphatase activity when purified and equimolar amounts were tested in vitro [29, 30].
The preponderance of loss of abundance only variants prompted us to examine the relationship between the two assays more closely. We first analyzed the scaled activity scores of a panel of 20 variants exhibiting a range of abundances that had been individually assessed for their steady-state mean fluorescence intensity when fused with GFP [9]. The variants in this panel only consistently showed reduced activity when their measured abundances were profoundly reduced, by at least 5-fold (Fig. 2B). Thus, very low abundance variants can score as WT-like in the yeast rescue activity assay.
To better understand whether these low abundance variants without reduced yeast rescue activity scores were clinically important, we focused on the subset of low abundance PTEN variants classified as pathogenic in ClinVar, or associated with autism spectrum disorder (ASD) or PTEN Hamartoma Tumor Syndrome (PHTS) in a recently published cohort of well curated PTEN variant positive individuals seen at Cleveland Clinic (CC cohort) [12]. Of the 40 total variants in this set, 22 (55%) were classified as loss of both activity and abundance variants (Fig. 2C). The remaining 18 variants (45%) were confidently assessed as low in abundance, with 8 considered loss of abundance only due to their high activity scores. The other 10 exhibited intermediate activity. Notably, 12 of these clinically meaningful variants (Tyr27Ser, Gly129Arg, Met134Thr, Arg173Cys, Arg173His, Thr202Ile, Pro246Leu, Gly251Val, Asp252Gly, Lys254Thr, Asn276Ser, Asp326Asn) have reduced in vitro activity, reduced abundance, or altered function when assessed in human cell models by other labs, further supporting these variants’ perturbed function [30,31,32,33,34,35,36,37,38]. For the 6 remaining pathogenic variants, the low abundance score in our dataset is the only measurement of altered experimental consequence to date. Thus, it is tempting to speculate that at least some of these loss of abundance only variants are pathogenic by sole virtue of their lowered abundance despite scoring as WT-like or indeterminate in the yeast activity assay. However, it is also possible that noise in either assay, a lack of sensitivity to subtle but clinically meaningful loss of activity, or alterations in function not captured by the yeast activity assay, could be responsible for the pathogenicity of these variants.
We next asked how variants in each subset were distributed within the PTEN protein structure. At 42 positions, the majority of variants led to both loss of abundance and loss of activity, and these largely mapped to the buried regions of both PTEN domains (Fig. 2D, purple spheres). At 17 positions, the majority of variants led to a loss of abundance only, and these positions were located around the periphery of the PTEN structure, especially within the C2 domain (Fig. 2D, turquoise spheres). At 9 positions, variants led to a loss of activity only, and these largely mapped around the active site but included Ala333, a membrane-proximal residue on the C2 domain (Fig. 2D, orange spheres).
We then analyzed how each subset related to the variants found in publicly available databases for PHTS, ASD, and various tumors biopsied from cancer patients, or in the CC cohort [12]. We found 59 germline pathogenic or likely pathogenic variants associated with PHTS in ClinVar, or characterized as having classic PHTS symptoms in the CC cohort, that were also confidently scored in both abundance and activity datasets. Of these, 46 were PHTS only variants that were not observed associated with autism spectrum disorder or developmental disabilities in the SFARI database or the CC cohort. Compared to all possible single-nucleotide variants, or variants observed in unaffected population databases such as GnomAD or TOPMed, these PHTS only variants were enriched for loss of abundance and activity variants, as well as loss of activity only variants (Fig. 2E, purple and orange bars). In contrast, loss of abundance variants were slightly reduced, while WT-like variants were reduced (Fig. 2E, turquoise and green bars). There were 8 variants that were only associated with autism spectrum disorder. While the sample is small, there were enrichments in loss of abundance or loss of activity variants, but no enrichment in variants with concomitant losses to both abundance and activity. On the other hand, there were 13 germline variants associated with both classic PHTS symptoms and autism spectrum disorder. Seven of these variants were loss of both abundance and activity.
The enrichment of loss of activity and abundance variants in individuals with PHTS, particularly those also exhibiting autism spectrum disorders, was consistent overall with the phenotypes observed in a similar analysis by Mighell et al. [12]. There was a notable difference in the enrichments of loss of abundance only or loss of activity only variants. They observed more loss of abundance only variants associated with PHTS, whereas we observed more loss of abundance only variants associated with autism spectrum disorder. Their method differed from ours, as their analysis looked at frequencies of variants and variant classes observed within PTEN variant-positive individuals in their cohort, while we utilized a larger set of data where information on frequency was not always available. We thus examined how each abundance and activity class was populated by unique PTEN variants, and the aforementioned differences may be due to these methodological differences. Regardless, distinct clinical groups appear potentially enriched with different PTEN molecular phenotypes. In contrast, there were only slight enrichments and depletions observed in the 154 variants of uncertain significance (VUS) in ClinVar that were also confidently scored in both datasets. 52 of these variants exhibited either loss of abundance or loss of activity based on these criteria and may be prime targets for reclassification in the future.
Reclassifying these variants will require incorporation of PTEN-specific considerations for clinical interpretation by expert working groups [39]. Along these lines, the initial PTEN abundance and activity data were used to create a logistic regression model capable of separating clinically significant PTEN variants from other variants [12]. This model revealed that PTEN variants with intermediate activity in yeast or truncation-like missense variants were appreciably more likely to also cause PHTS, supporting the use of abundance and activity measurements as evidence of pathogenicity [12]. This study relied on imputed missing abundance scores to train their model. By providing additional experimental data and reducing reliance on imputation, the additional abundance data we provide here will empower refinement of PTEN variant reclassification.
Next, we examined enrichment of variants in the different PTEN abundance and activity subsets in breast, uterine, lung, colorectal, prostate, skin, and brain cancers found in various cancer genomics datasets accessed with cBioPortal [20, 21]. Here, the goal was to use our data to better distinguish different types of potentially cancer-driving, PTEN loss-of-function variants, from the potentially innocuous PTEN variants that may have coincidentally accumulated during tumor development. We estimated a null model of mutation in the absence of selection by calculating the frequencies of each subset possible through single nucleotide variation (Fig. 3, grey bars). WT-like variants were uniformly depleted as compared to our null model, likely due to corresponding enrichments of variants from the other functionally damaging subsets outcompeting them. Loss of abundance only variants also appeared to be depleted, likely due to the fact that, by definition, these variants retain at least partial activity, which may be sufficient to counteract oncogenesis in most circumstances. In contrast, variants exhibiting both a loss of abundance and activity were uniformly enriched across cancer types. This finding is consistent with the enrichment of low abundance variants we previously observed [9] and suggests PTEN loss of function through loss of abundance along with loss of activity is a common contributor to oncogenesis across cancer types.
The remaining loss of activity only subset is particularly interesting as it includes dominant-negative PTEN variants [6]. Homodimerization is thought to keep PTEN in its active conformation and allow it to exert maximal PtdIns(3,4,5)P3 phosphatase activity [40]. Accordingly, cells encoding a WT PTEN allele exhibited greater Akt intracellular signaling when the dominant-negative variant Cys124Ser was co-expressed as compared to a null or destabilized variant [9, 40]. Consistent with this observation, transgenic mice where one allele was replaced with known dominant-negative alleles such as Cys124Ser and Gly129Glu exhibited increased tumor burden [40, 41]. Thus, we examined this subset for both known and potentially uncharacterized dominant-negative variants amongst the differing cancers.
Loss of activity only variants were differentially enriched across cancers, with breast, uterine, prostate, and lung cancers exhibiting the greatest enrichments (Fig. 3A). The known dominant-negative variants Cys124Ser, Gly129Glu, Arg130Gly, and Arg130Gln were observed at a much higher frequency than predicted by the null model, where they were collectively only possible through 5 of the 3618 single nucleotide-driven codon changes (Fig. 3B). These known dominant-negative variants were the major contributors to the enrichment of loss of activity only variants, contributing 58% of the enrichment observed in breast, 87% in uterine, and 100% in lung cancers. Thus, no additional loss of activity only lung cancer variants were scored, while uterine and breast cancers had enriched loss of activity only variants in addition to the known dominant negatives (Fig. 3B). We hypothesized that this additional loss of activity only variants observed in uterine and breast cancers might represent new dominant-negative variants.
Identifying potential dominant-negative variants
To test this hypothesis and identify new PTEN dominant-negative variants, we quantified levels of Akt activation loop phosphorylation at Thr308 (pAkt) in cells expressing both a variant and a WT copy of PTEN [40] (Fig. 4, Additional file 1: Fig S5). We implemented this assay by expressing PTEN variants using our HEK 2393T landing pad cells, which already express WT PTEN [9]. Using this approach, we previously identified Pro38Ser as a dominant-negative variant, as it resulted in increased pAkt levels similar to the known dominant-negative Cys124Ser variant [9].
We chose a small panel of loss of activity only variants to screen using this assay including Asp24Gly, Asp92His, Arg130Pro, and Arg159Ser which were selected because they were observed multiple times in breast cancer, uterine cancer, or both. We also chose Tyr16Ser, Tyr46Asp, and Thr160Pro, which were observed one or zero times in tumors, allowing us to test whether observation in tumors correlated with dominant-negative activity. Finally, we included the known low abundance variants Leu345Gln and Asp252Gly, along with the known dominant-negative variant Cys124Ser, as controls.
As expected, all of the loss of activity only variants were expressed at near WT levels and were clearly distinguishable from Leu345Gln and Asp252Gly, the loss of abundance controls (Fig. 4A). Overexpression of WT PTEN reduced pAkt levels below that of unrecombined cells. In contrast, the loss of abundance control variants had pAkt levels similar to unrecombined cells, suggesting we could qualitatively distinguish variants presumably functional for PtdIns(3,4,5)P3 phosphatase activity over those that were inactive. Importantly, all of the loss of activity only variants exhibited pAkt levels equal to or greater than the unrecombined and loss of abundance controls, confirming these variants were, indeed, loss of activity. Along with the known dominant-negative variant Cys124Ser, the Arg130Pro and Asp92His variants exhibited markedly increased pAkt levels (Fig. 4B). This increase in pAkt signal was not due to elevated AKT1 expression, suggesting a specific effect on Akt1 signaling. The remaining loss of activity only variants exhibited intermediate pAkt signals and were thus less conclusive.
Importantly, each variant’s ability to drive Akt1 Thr308 phosphorylation correlated with the variant’s incidence in cancers (Pearson’s r: 0.76) (Fig. 4C). Arg130Pro was found to be mutated four times in sequenced breast cancers, and three times in uterine cancers, and once in an esophageal cancer (Fig. 4C). Asp92His was mutated in three independent instances of breast cancer. Consistent with these results, the Asp92His variant is present in the CAMA-1 breast cancer cell line tested in the Cancer Cell Line Encyclopedia (CCLE) [42]. This cell line exhibits normal PTEN expression and elevated pAkt, similar to other cell lines expressing the known dominant-negative variants Arg130Gln, Cys124Ser, and Arg130Gly in CCLE [42] (Fig. 4D). In contrast, Asp24Gly and Arg159Ser, which were mutated two times in breast and uterine cancers, respectively, did not exhibit increased pAkt intensity in our assay. Thus, the PTEN large-scale variant functional datasets, when combined with cancer genomics data, can identify dominant-negative variants that may exhibit altered disease severity.