Annotation of pseudogenic gene segments by massively parallel sequencing of rearranged lymphocyte receptor loci
- Jared Dean†1,
- Ryan O. Emerson†1,
- Marissa Vignali†1,
- Anna M. Sherwood1,
- Mark J. Rieder1,
- Christopher S. Carlson2 and
- Harlan S. Robins2Email author
© Dean et al. 2015
Received: 10 July 2015
Accepted: 5 November 2015
Published: 23 November 2015
The adaptive immune system generates a remarkable range of antigen-specific T-cell receptors (TCRs), allowing the recognition of a diverse set of antigens. Most of this diversity is encoded in the complementarity determining region 3 (CDR3) of the β chain of the αβ TCR, which is generated by somatic recombination of noncontiguous variable (V), diversity (D), and joining (J) gene segments. Deletion and non-templated insertion of nucleotides at the D-J and V-DJ junctions further increases diversity. Many of these gene segments are annotated as non-functional owing to defects in their primary sequence, the absence of motifs necessary for rearrangement, or chromosomal locations outside the TCR locus.
We sought to utilize a novel method, based on high-throughput sequencing of rearranged TCR genes in a large cohort of individuals, to evaluate the use of functional and non-functional alleles. We amplified and sequenced genomic DNA from the peripheral blood of 587 healthy volunteers using a multiplexed polymerase chain reaction assay that targets the variable region of the rearranged TCRβ locus, and we determined the presence and the proportion of productive rearrangements for each TCRβ V gene segment in each individual. We then used this information to annotate the functional status of TCRβ V gene segments in this cohort.
For most TCRβ V gene segments, our method agrees with previously reported functional annotations. However, we identified novel non-functional alleles for several gene segments, some of which were used exclusively in our cohort to the detriment of reported functional alleles. We also saw that some gene segments reported to have both functional and non-functional alleles consistently behaved in our cohort as either functional or non-functional, suggesting that some reported alleles were not present in the population studied.
In this proof-of-principle study, we used high-throughput sequencing of the TCRβ locus of a large cohort of healthy volunteers to evaluate the use of functional and non-functional alleles of individual TCRβ V gene segments. With some modifications, our method has the potential to be extended to gene segments in the α, γ, and δ TCR loci, as well as the genes encoding for B-cell receptor chains.
During T-cell development, immature T-cells undergo somatic rearrangement of their T-cell receptor (TCR) loci within the thymus . This rearrangement accounts for the vast sequence diversity found in mature TCRs, which in turn allows TCRs to bind to the great diversity of antigens presented by major histocompatibility complex molecules on the surface of other cells. The TCR protein is composed of two molecules, encoded by the TCRα and the TCRβ genes (or, in a small proportion of T-cells, by the TCRγ and δ genes). Diversity in the TCRβ and TCRδ chains results from the recombination of a large number of variable (V), diversity (D), and joining (J) gene segments, whereas only V and J gene segments recombine to generate the TCRα and TCRγ chains. Additional diversity is achieved by deletion and non-templated insertion of nucleotides at the junctions . A similar process occurs in B-cells, and results in the generation of heavy and light chains of the immunoglobulin receptors and secreted antibodies.
Different sources report variable numbers of V, D, and J gene segments for the TCRB locus [2–4]; for example, the international ImMunoGeneTics (IMGT) database reports 68 TRV, 14 TRJ, and 2 TRD gene segments, corresponding to 146, 16, and 3 alleles, respectively, including both functional and non-functional alleles . However, it is likely that there is some variation in gene numbers within the human population. Recent studies indicate that true polymorphisms are easily missed because of limitations in the number and diversity of individuals analyzed through low-throughput methods [5, 6], both in ethnically diverse populations such as Papua New Guineans and Mexicans, and in well-studied Caucasian cohorts (reviewed in ). The same sequencing studies have suggested that many polymorphisms have been reported in error . Thus, we sought to evaluate the presence of functional and non-functional alleles in a large cohort of healthy individuals using high-throughput sequencing of the variable region of the beta chain of the TCR.
It is estimated that several million distinct TCRβ sequences (measured by counting unique rearranged complementarity determining region 3 [CDR3] sequences) are present in the peripheral blood of a typical human, including many TCRβ sequences that utilize each of the available V, D, and J gene segments . The repertoire of germline gene segments comprising the genomic TCRβ locus is therefore an important contributor to sequence diversity in naïve T cells, and thus to the ability of the adaptive immune system to engage pathogens and to recognize aberrant proteins such as those generated by tumor cells .
Owing to the random nature of sequence editing during VDJ recombination, most rearrangements result in non-functional TCRβ genes: either a stop codon is created, the V and J gene segments are not in the same coding frame, or a pseudogenic gene segment that encodes a major defect outside the CDR3 region is incorporated. T cells in which the first VDJ rearrangement results in a non-functional TCRβ gene often undergo VDJ recombination of their second allele, thus allowing the cell a second chance to rearrange a functional TCRβ . After gene rearrangement, positive selection in the thymus ensures that all mature T cells have at least one TCRβ allele encoding a functional TCRβ protein; T cells carrying only non-productive rearrangements undergo apoptosis . In addition, some pseudogeneic alleles carry non-functional Recombination Signal Sequences (RSSs), which prevent their incorporation into rearranged genes [12, 13].
By definition, pseudogenes cannot encode for productively rearranged receptor genes, and therefore can only be observed as the second allele in cells that also express a functional TCRβ [11, 14]. Functional gene segments can be observed both in productive and in non-productive rearrangements. Herein, we take advantage of this fact to evaluate the use of functional and non-functional alleles of V gene segments in a large cohort of healthy volunteers. Because pseudogenic gene segments can only be observed in non-productive rearrangements, there is no selective pressure to ensure that CDR3 length is a multiple of three or that it lacks a stop codon, and thus rearrangements at the TCRβ locus including these gene segments should lead to genes with key motifs that are in-frame and have no premature stops less than one third of the time. In contrast, functional genes will have a much higher rate of CDR3 sequences that are in frame with the flanking V and J genes, because they must be in-frame and free of stop codons in all cases where they are part of a productive rearrangement, in addition to being in-frame one third of the time they are part of a non-productive rearrangement. For example, if we assume that 80 % of rearrangements of functional genes are productive, functional genes would be in frame approximately 87 % of the time (i.e., 80 % plus one third of 20 %).
Multiplex amplification across the VDJ junction allows direct sequencing of the genomic DNA of TCRβ CDR3 regions from millions of T cells simultaneously , thus enabling a much deeper analysis than previously possible. By sequencing tens of millions of TCRβ CDR3 sequences from hundreds of healthy volunteers and analyzing the frequency of unique productive and non-productive rearrangements that contain each TCRβ V gene segment, we derived rules that allow the classification of individual immune gene segments into functional and non-functional categories. Comparison of these results to IMGT , the most widely used immune gene database, showed that our data agree with the current functional annotation of the majority of TCRβ V gene segments. However, our results suggest that two TCRβ V gene segments that were only reported to contain functional alleles were overwhelmingly seen as non-functional alleles in our cohort, suggesting the existence of novel pseudogenic alleles that are present at high frequency in this population. We also identified a subset of TCRβ V gene segments that had not previously been recognized as having multiple alleles, whose functional status segregated in our cohort. Finally, two genes currently annotated as having both functional and non-functional alleles behaved uniformly in our cohort, one as a functional gene and the other as a pseudogene.
Human peripheral blood samples were obtained from healthy volunteers under a protocol following written informed consent approved and supervised by a Fred Hutchinson Cancer Research Center Institutional Review Board. The research included in this work conforms to the Helsinki Declaration. Further details of the cohort are given in ; data from 587 individuals in that cohort (i.e., those with data available at the time of writing this study) were used in the secondary analysis included in this work. The sequencing data for the 587 individuals can be accessed from www.adaptivebiotech.com/pub/Dean, and has also been deposited in the Dryad Digital Repository (doi:10.5061/dryad.t47g3).
High-throughput TCRβ sequencing
Genomic DNA was extracted from cell samples using the Qiagen DNeasy Blood Extraction Kit (Qiagen, Gaithersburg, MD, USA). We amplified and sequenced the CDR3 region of the rearranged TCRβ genes using previously described protocols [8, 17]. Briefly, a multiplexed polymerase chain reaction (PCR) method was employed using a mixture of 60 forward primers specific to TCR Vβ gene segments and 13 reverse primers specific to TCR Jβ gene segments, and 87 base pair (bp) reads were obtained using the Illumina HiSeq System (Illumina Inc., San Diego, CA). Raw HiSeq sequence data were preprocessed to remove errors in the primary sequence of each read, and to compress the data. A nearest neighbor algorithm was used to collapse the data into unique sequences by merging closely related sequences, to remove both PCR and sequencing errors. The TCRβ CDR3 region, as defined by the IMGT collaboration , begins with the second conserved cysteine encoded by the 3′ portion of the V gene segment and ends with the conserved phenylalanine encoded by the 5′ portion of the J gene segment. The number of nucleotides between these codons determines the length and therefore the frame of the CDR3 region. Each sequence was required to have a minimum six-nucleotide match to one of the V gene segments and one of the J gene segments. To ensure that our conclusions were conservative, sequences that did not unambiguously match a single V gene segment (because of deleted nucleotides or errors) were excluded from our analyses. This resulted in the exclusion of data from five V gene segments annotated as functional in the IMGT database because they did not contain any unique nucleotide sequence within our 87-bp sequencing reads (TRBV6-2 and TRBV6-3). Next, each rearrangement was scored as productive if (a) there were no stop codons in the reading frame of the CDR3 region, and (b) it was in frame with the V and the J gene (i.e., CDR3 length was a multiple of three nucleotides); or classified as non-productive otherwise.
We calculated the proportion of time that each TCRβ V gene segment was identified as being part of a productive rearrangement, the average of this metric for each gene segment in the cohort, and its standard deviation among all individuals. To define a threshold for functional and non-functional (i.e., pseudogene/open reading frame [ORFs]) genes, we first assumed that the current annotations present in the IMGT database are mostly correct, and thus calculated the median proportion of productive rearrangements both for gene segments currently annotated as functional and those currently annotated as non-functional; the median was used to avoid statistical artifacts from a small number of gene segments whose annotations might not match the alleles present in our cohort. Also, gene segments with known functional and non-functional alleles were ignored for this calculation. In an attempt to re-classify gene segments based on these medians, we used the midpoint to define 56.7 % as the threshold separating functional from non-functional genes.
To identify genes with both functional and non-functional alleles segregating within our cohort, we looked for substantial differences in the proportion of productive rearrangement between individuals. Specifically, we segregated gene segments into “fixed” (consistently functional or consistently non-functional) and “segregating” (functional and non-functional alleles) classes. To capture gene segments with concordant versus discrepant behavior across individuals, we used the standard deviation of the proportion of productive rearrangements as a metric. Again assuming that most IMGT annotations are correct but there are some outliers, we defined the median of each group using current annotations and took the midpoint of those medians as a threshold. This midpoint was then used to identify a threshold value of standard deviation (5 %) to annotate gene segments as either fixed or segregating with respect to functionality.
Final annotation for each gene segment was performed as follows: gene segments that displayed a standard deviation above 5 % were deemed to have both types of alleles (i.e., functional alleles and non-functional pseudogene/ORF alleles) segregating in our cohort. For gene segments with a standard deviation below 5 %, we annotated them as functional if they were seen in productive rearrangements 56.7 % of the time or more, and as non-functional/pseudogenes otherwise.
Results and discussion
In order to functionally annotate human TCRβ V gene segments, we performed deep sequencing of the TCRβ CDR3 regions on a set of PBMC samples obtained from 587 healthy volunteers . Genomic DNA was obtained from these samples, amplified using a multiplex PCR assay that targets the rearranged TCRβ CDR3 region, and sequenced as previously described . Each CDR3 sequence was aligned against reference sequences in the IMGT database to determine which V, D, and J gene segments were utilized in each rearrangement . Sequences were deemed to be productive if the rearranged gene segments were observed to maintain the coding frame between the V gene segment, the CDR3 region, and the downstream J gene segment, and no premature stop codons were detected. In total we generated 276 Gb of sequence data, and observed 117 million unique TCR sequences with unambiguous V gene classification from an estimated total of 179 million total T cells profiled. The data can be accessed from www.adaptivebiotech.com/pub/Dean.
Finally, although only observed in 186 individuals in our cohort, TRBV16-1, which is annotated in IMGT as a gene segment with functional and non-functional alleles, behaved in our cohort as a fully functional gene segment, displaying a percentage productive rearrangement of 86.4 % and a standard deviation of 3 %, whereas TRBV7-4, also currently annotated as having both functional and non-functional alleles, actually had the lowest percentage of productive rearrangements of all gene segments analyzed and a low standard deviation, and thus consistently behaved as a pseudogene in our cohort.
By assessing high-throughput sequencing data derived from peripheral T cells that have undergone thymic selection in a large cohort of healthy individuals, we annotated TCRβ V gene segments as either functional, pseudogene/ORFs, or as having both functional and non-functional alleles. We found that most TCRβ V gene segments currently annotated as functional genes in the IMGT database were observed as in-frame rearrangements approximately 90 % of the time, whereas the vast majority of TCRβ V gene segments annotated as either pseudogenes or ORFs by IMGT were found as in-frame rearrangements approximately 30 % of the time. Thus, in both these cases, our method confirms the current annotation for these gene segments. However, we observed a few notable discrepancies: TCRβ V gene segments TRBV6-8 and TRBV6-9 are currently annotated as functional; however, our analysis suggests the existence of pseudogenic alleles, because they were only found in productive rearrangements approximately 20 % of the time. Moreover, we observed several examples of genes annotated as having only functional genes in IMGT that appear to have both functional and non-functional alleles that segregate in this cohort. Finally, we saw one gene segment annotated as having both functional and non-functional alleles that consistently behaved as a functional gene segment in this cohort, while a second gene segment annotated as having both functional and non-functional allele consistently behaved as a pseudogene in this cohort.
Importantly, the method described in this study, which is ideally suited to study the TCRB locus, has the potential to be applied to other loci that undergo selection such as the α, γ, and δ gene segments of the TCR and the genes that encode for immunoglobulins. Although subtleties related to, for example, the different selective pressures that γ and δ T cells are exposed to will require some modifications to this approach , we expect that the basic logic described above will be applicable to other loci. In closing, we hypothesize that population-level variation in the repertoire of available TCR/immunoglobulin gene segments may represent an underappreciated source of heritable diversity in the adaptive immune system.
We thank Dr Erik Yusko for helpful comments and suggestions regarding data analysis and display.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Davis MM, Bjorkman PJ. T-cell antigen receptor genes and T-cell recognition. Nature. 1988;334:395–402.View ArticlePubMedGoogle Scholar
- Ashwell JD, Weissman A. T-cell antigen receptor genes, gene products and coreceptors. In: Rich RR, editor. Clinical immunology, principles and practice. 2nd ed. London: Mosby International Limited; 2001. p. 5.1–5.19.Google Scholar
- Janeway CASM, Travers P, Walport M. Immunobiology: the immune system in health and disease. 6th ed. New York: Garland Science; 2004.Google Scholar
- Giudicelli V, Chaume D, Lefranc MP. IMGT/GENE-DB: a comprehensive database for human and mouse immunoglobulin and T cell receptor genes. Nucleic Acids Res. 2005;33:D256–261.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang Y, Jackson KJ, Sewell WA, Collins AM. Many human immunoglobulin heavy-chain IGHV gene polymorphisms have been reported in error. Immunol Cell Biol. 2008;86:111–5.View ArticlePubMedGoogle Scholar
- Gadala-Maria D, Yaari G, Uduman M, Kleinstein SH. Automated analysis of high-throughput B-cell sequencing data reveals a high frequency of novel immunoglobulin V gene segment alleles. Proc Natl Acad Sci U S A. 2015;112:E862–870.PubMed CentralView ArticlePubMedGoogle Scholar
- Watson CT, Breden F. The immunoglobulin heavy chain locus: genetic variation, missing data, and implications for human disease. Genes Immun. 2012;13:363–73.View ArticlePubMedGoogle Scholar
- Robins HS, Campregher PV, Srivastava SK, Wacher A, Turtle CJ, Kahsai O, et al. Comprehensive assessment of T-cell receptor beta-chain diversity in alphabeta T cells. Blood. 2009;114:4099–107.PubMed CentralView ArticlePubMedGoogle Scholar
- Miller JF, Sadelain M. The journey from discoveries in fundamental immunology to cancer immunotherapy. Cancer Cell. 2015;27:439–49.View ArticlePubMedGoogle Scholar
- Aifantis I, Buer J, von Boehmer H, Azogui O. Essential role of the pre-T cell receptor in allelic exclusion of the T cell receptor beta locus. Immunity. 1997;7:601–7.View ArticlePubMedGoogle Scholar
- Malissen M, Trucy J, Jouvin-Marche E, Cazenave PA, Scollay R, Malissen B. Regulation of TCR alpha and beta gene allelic exclusion during T-cell development. Immunol Today. 1992;13:315–22.View ArticlePubMedGoogle Scholar
- Nadel B, Tang A, Escuro G, Lugo G, Feeney AJ. Sequence of the spacer in the recombination signal sequence affects V(D)J rearrangement frequency and correlates with nonrandom Vkappa usage in vivo. J Exp Med. 1998;187:1495–503.PubMed CentralView ArticlePubMedGoogle Scholar
- Feeney AJ, Tang A, Ogwaro KM. B-cell repertoire formation: role of the recombination signal sequence in non-random V segment utilization. Immunol Rev. 2000;175:59–69.View ArticlePubMedGoogle Scholar
- Oltz EM. Regulation of antigen receptor gene assembly in lymphocytes. Immunol Res. 2001;23:121–33.View ArticlePubMedGoogle Scholar
- Lefranc MP. IMGT® databases, web resources and tools for immunoglobulin and T cell receptor sequence analysis. Leukemia. 2003;17:260–6.View ArticlePubMedGoogle Scholar
- Emerson R, DeWitt W, Vignali M, Gravley J, Desmarais C, Carlson C, et al. Immunosequencing reveals diagnostic signatures of chronic viral infection in T cell memory. bioarXiv. 2015. doi:http://dx.doi.org/10.1101/026567.
- Carlson CS, Emerson RO, Sherwood AM, Desmarais C, Chung MW, Parsons JM, et al. Using synthetic templates to design an unbiased multiplex PCR assay. Nat Commun. 2013;4:2680.View ArticlePubMedGoogle Scholar
- Yousfi Monod M, Giudicelli V, Chaume D, Lefranc MP. IMGT/JunctionAnalysis: the first tool for the analysis of the immunoglobulin and T cell receptor complex V-J and V-D-J JUNCTIONs. Bioinformatics. 2004;20 Suppl 1:i379–385.View ArticlePubMedGoogle Scholar
- Adams EJ, Gu S, Luoma AM. Human gamma delta T cells: evolution and ligand recognition. Cell Immunol. 2015;296:31–40.View ArticlePubMedGoogle Scholar