Choice of transcripts and software has a large effect on variant annotation
© McCarthy et al.; licensee BioMed Central Ltd. 2014
Received: 7 October 2013
Accepted: 20 March 2014
Published: 31 March 2014
Skip to main content
© McCarthy et al.; licensee BioMed Central Ltd. 2014
Received: 7 October 2013
Accepted: 20 March 2014
Published: 31 March 2014
Variant annotation is a crucial step in the analysis of genome sequencing data. Functional annotation results can have a strong influence on the ultimate conclusions of disease studies. Incorrect or incomplete annotations can cause researchers both to overlook potentially disease-relevant DNA variants and to dilute interesting variants in a pool of false positives. Researchers are aware of these issues in general, but the extent of the dependency of final results on the choice of transcripts and software used for annotation has not been quantified in detail.
This paper quantifies the extent of differences in annotation of 80 million variants from a whole-genome sequencing study. We compare results using the RefSeq and Ensembl transcript sets as the basis for variant annotation with the software Annovar, and also compare the results from two annotation software packages, Annovar and VEP (Ensembl’s Variant Effect Predictor), when using Ensembl transcripts.
We found only 44% agreement in annotations for putative loss-of-function variants when using the RefSeq and Ensembl transcript sets as the basis for annotation with Annovar. The rate of matching annotations for loss-of-function and nonsynonymous variants combined was 79% and for all exonic variants it was 83%. When comparing results from Annovar and VEP using Ensembl transcripts, matching annotations were seen for only 65% of loss-of-function variants and 87% of all exonic variants, with splicing variants revealed as the category with the greatest discrepancy. Using these comparisons, we characterised the types of apparent errors made by Annovar and VEP and discuss their impact on the analysis of DNA variants in genome sequencing studies.
Variant annotation is not yet a solved problem. Choice of transcript set can have a large effect on the ultimate variant annotations obtained in a whole-genome sequencing study. Choice of annotation software can also have a substantial effect. The annotation step in the analysis of a genome sequencing study must therefore be considered carefully, and a conscious choice made as to which transcript set and software are used for annotation.
The advent of accessible and relatively inexpensive high-throughput sequencing technology has resulted in extensive sequencing of whole human genomes or exomes in a research setting and seems likely to lead to an explosion of genomic sequencing in a clinical context. While there remain challenges in unambiguously determining an individual’s genome or exome sequence [1, 2], our focus here is on the downstream interpretation of that sequence. Let us take as a starting point a specified list of positions, assumed to be correct, at which the nucleotides in the individual’s sequence differ from the human reference sequence. We will restrict our scope here to single nucleotide variants (SNVs) and short indels. A crucial step in linking sequence variants with changes in phenotype is variant annotation.
Variant annotation is the process of assigning functional information to DNA variants. There are many different types of information that could be associated with variants, from measures of sequence conservation  to predictions about the effect of a variant on protein structure and function [4–6]. Here we focus on the most fundamental level of variant annotation, which is categorising each variant based on its relationship to coding sequences in the genome and how it may change the coding sequence and affect the gene product.
The coding sequences of the genome are, broadly speaking, the genes: ‘gene’ has come to refer principally to a genomic region producing (through transcription) polyadenylated mRNAs that encode a protein . We refer to these polyadenylated mRNAs as ‘transcripts’, although the term transcript can refer to any RNAs produced from the transcription of a genomic DNA sequence. Thus, there are non-coding transcripts that do not encode a protein, but nevertheless can have a function, for example in regulation. When considering transcripts in the context of genomic DNA sequences, a transcript is defined by its exons, introns and UTRs and their locations. Many separate transcripts may overlap any given position in the genome, and it is not uncommon for genes to have many different transcripts (or ‘isoforms’), of which they tend to express many simultaneously .
Our understanding of the protein-coding sequences in the genome is summarised in the set of transcripts we believe to exist. Thus, variant annotation depends on the set of transcripts used as the basis for annotation. The widely used annotation databases and browsers – ENSEMBL, REFSEQ and UCSC – contain sets of transcripts that can be used for variant annotation, as well as a wealth of information of many other kinds as well, such as ENCODE data about the function of non-coding regions of the genome. A transcript set may therefore also contain information about regions of the genome that regulate expression of transcripts.
To annotate DNA variants we therefore require a set of transcripts that summarises our understanding of the genome. For each variant, we use a software tool to determine the likely effect of the variant based on the transcripts (or other genomic features) that overlap the variant’s position. One or more possible annotations for the variant can then be reported.
Frequently, however, variant annotation is more complex. Typical pipelines are not well suited for handling a variant that could have one consequence for one transcript and a different consequence for a different transcript. Even where a gene is relatively well defined and does not overlap other genes, we may have many transcripts (isoforms of the gene) to choose from, often supported by varying levels of evidence for their existence and structure. It is common for a gene to have multiple transcripts overlapping a given position in the genome, so given a set of transcripts a software tool has to choose which one to use. If it provides an annotation for the variant for each transcript, then the question becomes what annotation to report. If the software reports all possible annotations from all possible transcripts then the user must decide how to prioritise different, competing annotations, or how to integrate them into downstream analysis. This issue is further exacerbated in the uncommon case of a single variant affecting multiple genes (each of which likely has multiple transcripts). Current annotation tools vary in approaches to reporting consequences of a variant in multiple genes at once. Choice of the underlying set of transcripts used for annotation can give the user more control over transcript use. Transcript sets from different sources can have different characteristics. For example, both ENSEMBL and REFSEQ contain transcripts established from experimental evidence utilising automated annotation pipelines and manual curation, but their precise requirements for inclusion of transcripts differ. The result is that the ENSEMBL transcript set is larger than the REFSEQ set (see Additional file 1), but the REFSEQ transcript set is not simply a subset of the ENSEMBL transcript set.
Related to this issue is the fact that any given variant can have several plausible annotations, even when considering just a single transcript as the basis for annotation. Choosing the ‘best’ annotation is frequently not clear-cut, as in the case of the variant NC_000006.11:g.30558477_30558478insA, a single-base insertion at the end of an exon (Figure 1B). This variant could be annotated as a frameshift insertion in a coding sequence (which it is), or as a stop-loss variant (as it falls in a stop codon). In fact, the correct annotation is that this is a synonymous variant. In many cases we would be misled into thinking that the variant is a frameshift or stop-loss variant, and therefore be likely to assume it has a functional effect and include it in any list of variants of interest for further investigation. Indeed, one of the software tools used for this study reports a frameshift insertion annotation and the other a stop-loss annotation for this variant, when using ENSEMBL transcripts. In this example there seems to be a single best annotation, but many cases are more ambiguous, with several equally valid possible annotations. The software tool must make some sort of choice in such cases as to which annotation to report for the variant (and transcript used). There are many other annotation tools available (for example, Mutalyzer 2 , VAT , VAAST 2.0 , GATK VariantAnnotator  and SnpEff ), which will have better or worse performance for certain variants, but here we want to make the more general point using two very widely used annotation tools, that there is a large degree of discrepancy between any two annotation tools, and researchers need to be aware of this when choosing a tool and conducting analysis.
A third major issue complicating variant annotation is the question of how to deal with genes and pseudogenes. We have widely varying levels of information available for different genes. Should we treat variants in well-characterised genes in the same way as those in pseudogenes or non-genic regions of the genome? There is not currently a clear solution to this issue, although distinctions are usually made between annotations given from coding and non-coding transcripts. Again, careful choice of transcript set used for annotation can help.
Transcript set: a summary of information about genomic features, particularly the structure of transcripts (sequence and locations of exons, introns, UTRs and regulatory regions), used as the basis for determining the likely functional consequence of a variant.
Software tool: a piece of software that when given a particular variant can query a transcript set and return the functional annotation (or possibly annotations) of that variant. An annotation tool uses a particular algorithm applied to a given set of transcripts for annotating variants.
We examine the effects of fixing one and then the other on a set of over 80 million SNVs and short indels from a large clinical sequencing project (see Methods). ANNOVAR is a popular annotation software tool, so we compare the results from ANNOVAR when used with the REFSEQ and ENSEMBL transcript sets. We also compare the annotation results from ANNOVAR and another popular annotation tool, VEP, the Variant Effect Predictor tool from ENSEMBL, when using the ENSEMBL transcript set and characterise the sorts of differences in annotation between the two tools and the apparent errors that ANNOVAR and VEP tend to make in annotation. Beyond issues specific to these particular transcript sets and software tools, we consider good practice for whole-genome annotation and problems that are yet to be solved.
The data used in this paper come from the WGS500 Project, a collaboration between the University of Oxford, Oxford Biomedical Research Centre and Illumina, Inc, to sequence 500 genomes of clinical relevance. Samples were accepted from patients where positive findings would have immediate clinical translational relevance in terms of clinical diagnosis, prognosis, genetics counselling and reproductive options, or treatment selection. As seen in some of the published studies that participated in the WGS500 project [26–32], this large umbrella project consists of many smaller sub-projects, focusing on particular diseases. All patients in this study gave written informed consent. The relevant research ethics committee (REC) reference numbers are: Central Oxfordshire Research Ethics Committee (05/Q1605/88), Hammersmith and Queen Charlotte’s and Chelsea REC (06/Q0406/151), NRES Committee South Central—Oxford B (12/SC/0381), NRES Committee South Central—Southampton A (12/SC/0044), NRES Committee North West—Haydock (03/0/97 version 3), NRES Committee Yorkshire & The Humber—South Yorkshire (10/H1310/73), Oxfordshire Research Ethics Committee (06/Q1605/3), Oxfordshire Research Ethics Committee B (04.OXB.017; 09/H0605/3) and Oxfordshire Research Ethics Committee C (09/H0606/74; 09/H0606/5), Riverside Research Ethics Committee (09/H0706/20) and Southampton and South-West Hampshire REC A (06/Q1702/99). The research conformed to the Helsinki Declaration and to local legislation.
We focus here on whole genomes of 276 individuals sequenced as part of the WGS500 project. The samples included 80 patients with immune disease, 151 individuals from Mendelian disease studies (primarily parent–child trios) and 45 germ-line DNA samples from cancer patients. The sequencing was conducted using 100-bp paired-end protocols on either the Illumina HiSeq 2000 instrument  or the Illumina HiSeq 2500 in standard mode , with a mixture of v2.5 and v3.0 chemistries, to at least 25 × average coverage. Sequence reads were generated using the Illumina Off-Line Basecaller (v1.9.3)  and mapped to the human reference genome GRCh37d5/hg19d5 using Stampy, predominantly versions 1.0.12_(r975) and 1.0.13_(r1160) . Picard (picard-tools v1.67) was used to merge data and de-duplicate merged BAM files . Variants were called from the aligned sequence reads using Platypus, version 0.1.9 . The raw data for annotation are VCF (variant call format) files  containing information about the called variants.
In total, 80,995,744 unique variant calls were obtained from 276 individual genomes in the fifth freeze of the project’s data, and merged into a preliminary union file. We compare functional annotations for 80,981,575 variants from the preliminary union file using transcript sets from different genome annotation databases and different annotation software tools, restricting ourselves to the set of variants for which an annotation was obtained using at least one transcript set or software tool.
Variant annotations were obtained using the software tool ANNOVAR(version 2013Feb21), using both the REFSEQ (release 57, January 2013) and ENSEMBL (version 69, October 2012) transcript sets [10, 40]. We used the default transcript sets from REFSEQ and ENSEMBL. REFSEQ records are selected and curated from public sequence archives, so a REFSEQ record represents a synthesis, by a person or group, reducing the redundancy in the database. The REFSEQ database does not contain all possible (or even all observed) transcripts or gene models, but those that it does annotate feature strong evidence for their existence, structure and (possibly) function. Of a total of 105,258 human transcripts in REFSEQ release 57, 41,501 were used by ANNOVAR in the reported annotations for the variants in this study.
Similarly, ENSEMBL provides genome resources for chordate genomes with a particular focus on human genome data. ENSEMBL makes available substantial and diverse transcript information, including the CCDS [13, 41], Human and Vertebrate Analysis and Annotation (HAVANA) , Vertebrate Genome Annotation (Vega) , ENCODE data  and the GENCODE gene and transcript sets . There are 208,677 transcripts in ENSEMBL version 69, of which 115,901 were used in reported annotations for this comparison.
A broad interpretation of splicing regions was used for ANNOVAR annotations, so that all variants within six bases of an intron/exon boundary would fall into ANNOVAR’s splicing annotation category. ANNOVAR returns a single annotation for each variant. If there are several relevant transcripts for a particular variant, then ANNOVAR will return the annotation with the most severe consequence according to its rules of precedence.
Variant annotations were also obtained using version 2.7 of ENSEMBL’s VEP, based on the ENSEMBL version 69 transcript set. As VEP returns all possible annotations for each variant (given the transcripts present at each variant’s location in the genome), we prioritised annotations using a common-sense ranking of the ‘severity’ of the consequence of the variant (Additional file 1: Table S3) to make the VEP annotation results directly comparable with those from ANNOVAR. This prioritisation for consequences from VEP is just one possible way to prioritise variants and this subjectivity could affect the extent of matching between annotations from ANNOVAR and VEP. The most severe consequence for each variant was reported and compared to the ANNOVAR results.
We compared results across all annotation categories for the REFSEQ/ENSEMBL comparison. A comparison table (union_rfs_ens_comparison.tab), was produced with a custom Perl script from VCF files containing ANNOVAR annotations when using REFSEQ and ENSEMBL transcripts and gene information for the transcript(s) used for each annotation. ANNOVAR reports only the ‘most damaging’ annotation, but can return transcript information for all transcripts that would give the annotation reported. Subsequent statistical analysis was done in R version 2.15.0 .
For the comparison of ANNOVAR and VEP we focused on exonic variants (and especially loss-of-function (LoF) and nonsynonymous variants) for the ANNOVAR/VEP comparison as these are currently of the greatest interest in the majority of annotation applications in whole-genome sequencing (WGS) studies. A VCF file containing all variants for comparison with annotations from ANNOVAR was processed to obtain VEP annotations, and the results were processed with a custom Python script to create a table of variants (ANV_VEP_ens_comparison_best_annos.tab). The table provides annotation results obtained using ENSEMBL transcripts with ANNOVAR and VEP, and information on transcripts used. The table was then analysed with R. We used the ENSEMBL Web Browser (archive version of ENSEMBL 69)  and the UCSC Genome Browser  to inspect sets of variants identified to be of particular interest by comparing annotations using the DNA sequence and other information available in the browser. Source code for the analyses described here is available in the repository containing the data, along with a ‘README’ file that provides more details about the data and source code files.
Putative LoF variants: variants that are likely to cause a gene product to be subject to nonsense-mediated decay and result in lost (or impaired) function of the gene. We include in this high-level category frameshift deletions, frameshift insertions, and stop-gain, stop-loss and (most) splicing variants. Where finer resolution splicing categories are available (for example from VEP and some other annotation tools), we classify variants in splice acceptor and splice donor sites as LoF and other splicing variants as generically exonic (defined below). Annovar does not provide subcategories of splicing variants, so for our study we include all splicing variants in the LoF high-level category.
Nonsynonymous and missense variants: variants in exons that change the amino acid sequence encoded by the gene (but are not LoF), including single-base changes and nonframeshift indels. For this study we include VEP’s ‘splice_region_variant’ annotation in the missense high-level category as this reflects the fact that general splice region variants are usually of a similar level of interest as canonical missense variants.
Synonymous variants: variants located in exons that do not change the translated amino acid sequence.
Exonic variants: variants that fall anywhere in exons or splicing regions, so this includes all variants in the LoF, nonsynonymous and synonymous high-level categories above.
The exact terms used to denote annotation categories differ between ANNOVAR and VEP, but the correspondence in terms is almost always clear (Additional file 1: Table S4). There are three exonic categories used by VEP (initiator codon variant, stop retained variant and other coding) for which there is no direct equivalent among the ANNOVAR categories.
The comparison of annotation results from ANNOVAR using either the REFSEQ or ENSEMBL transcript sets shows that the choice of transcript set has a large effect on the ultimate variant annotations. Across all 80 million variants there is an overall match rate of 85%. However, the matching annotation rate is 44% for LoF variants, the set of variants of most interest for biological and medical studies. The match rate is also substantially lower than the overall match rate for variants in non-coding RNA and UTR regions, but there is better agreement for exonic and intronic variants. This observation accords with what we would expect: in areas of the genome where more is known about the protein-coding structure of the sequence, the annotations when using the two transcript sets agree more closely.
Same software, different transcripts: REFSEQ vs ENSEMBL by ANNOVAR annotation category
ALL LOF and MISSENSE
The asymmetry in the differences in annotations between REFSEQ and ENSEMBL is striking. We see many more exonic annotations, across all LoF, nonsynonymous and synonymous categories, when using ENSEMBL transcripts (Table 1 and Additional file 1: Table S1). There are several thousand variants that are called exonic by ENSEMBL and yet are called as intergenic, intronic or in a non-coding RNA by REFSEQ. Conversely, there are only a few hundred exonic variants from REFSEQ that are annotated as intergenic, intronic or in non-coding RNA according to ENSEMBL. Using ENSEMBL here would gain over 2,000 frameshift indels and over 1,000 stop-gain/stop-loss variants compared with using REFSEQ, which all LoF variants of substantial interest for follow-up. This asymmetry is not surprising when we consider the composition of the two transcript sets. The REFSEQ set contains 105,258 human transcripts in release 57, for which the protein-coding sequences cover approximately 1.07% of the genome (34 Mb). ANNOVAR actively used 41,501 of these transcripts for annotation of this set of variants. The ENSEMBL version 69 set contains 208,677 transcripts (192,635 on chromosomes 1 to 22, X and Y, excluding patches and alternate loci), covering approximately 28% of the genome (892 Mb), including introns. The protein-coding sequences in the ENSEMBL transcript set cover approximately 1.12% of the genome (35 Mb). Of these transcripts, 115,091 were actively used for the annotation of this set of variants, including the set of 92,776 transcripts containing protein-coding sequences.
This extent of discrepancy in annotations can be partially explained by the fact that a high proportion of REFSEQ transcripts have an equivalent or highly similar transcript in ENSEMBL, but in the other direction there are many transcripts in ENSEMBL that do not appear to have a similar transcript in REFSEQ. ANNOVAR reports the most severe consequence for a variant across all transcripts present at that position in the genome, so with more transcripts available when using ENSEMBL there is an elevated chance of finding a more severe consequence for one of the ENSEMBL transcripts. Examples of variants with striking differences in annotation help to characterise the sorts of differences seen (Additional file 1: Figures S1 to S8). We saw no significant differences in annotation agreement rates across different variant frequencies (Additional file 1: Table S5a).
Same transcripts, different software: ANNOVAR and VEP annotations for exonic variants
ANV + VEP
OTHER CODING total
ALL LOF and MISSENSE
In total, 637,841 variants were given exonic annotations by either ANNOVAR or VEP (Table 2). Of these, 551,983 (86.5%) had exactly matching annotations from the two tools and 556,387 (87.2%) have category matching annotations. However, the match rate is substantially lower (65% for exact matches, 66% for category matches) for LoF annotations (Table 2). We observe that 89% of exonic variants from VEP get an exactly matching annotation from ANNOVAR and 96% of exonic variants according to ANNOVAR get an exactly matching annotation from VEP. These percentages of agreement should not be taken to show that ANNOVAR is ‘more accurate’ than VEP – the difference between the tools for exonic variants is driven by the larger number of splicing annotations from VEP, which is due to a difference in the definition of a splicing variant used by the two tools.
Considering all annotation categories for VEP and ANNOVAR annotations shows a substantial amount of disagreement in annotations from the two tools, even when using the same transcripts (Additional file 1: Figures S1 and S2). We observe relatively lower concordance for intergenic, intronic, miRNA and splicing variants. Even in well-defined categories such as nonsynonymous (missense) and frameshift, we see a large amount of disagreement in annotations between the two tools. We saw no significant differences in annotation agreement rates across different variant frequencies (Additional file 1: Table S5b).
To characterise the sorts of apparent errors or inconsistencies that commonly emerge in annotation by ANNOVAR and VEP, we investigated cases for which annotations from ANNOVAR and VEP disagree. Although it is counter-intuitive (since the annotations were based on the same set of transcripts), ANNOVAR and VEP do not always use the same transcript for the annotation of a variant. This is a result of the interaction of different annotation categories, different precedence rules and the fact (for this study) of reporting only one consequence for each variant. When characterising differences and apparent errors in annotation, we looked at variants for which we know ANNOVAR and VEP did indeed use the same transcript as the basis for annotation. We focused on LoF variants – frameshift, stop-gain, stop-loss and splicing – as they are currently of most interest in disease studies, and we saw better than 90% agreement between ANNOVAR and VEP annotations for nonsynonymous and synonymous variant categories (Table 2). Where possible (as in the case of splicing annotations), we discuss differences in annotation algorithms that are likely causes of differences in annotation, but detailed information on annotation algorithms is not available for ANNOVAR or VEP, even in online documentation [49, 50].
We observed over 2,000 variants that are annotated as frameshift by either ANNOVAR or VEP but not the other (Additional file 1: Table S6). Among these, we found that ANNOVAR annotates over 300 variants as frameshift despite them being SNVs, so the ANNOVAR annotation is unequivocally incorrect for these variants. For the majority of these variants, however, it is not possible to say conclusively from manual inspection whether the ANNOVAR or VEP annotation is correct.
All of the variants that are annotated as frameshift by VEP but not by ANNOVAR are genuine indels and none are a multiple of three bases, so VEP looks to be correctly identifying these variants as frameshift indels. Several hundred variants get nonframeshift, nonsynonymous and synonymous annotations from ANNOVAR, which are incompatible with the frameshift annotations from VEP. The frameshift annotations seem reasonable, so ANNOVAR looks to give incorrect annotations for these variants. Several hundred other variants are annotated as stop-gain by ANNOVAR and frameshift by VEP. A stop-gain annotation is not necessarily incompatible with a frameshift annotation from VEP, as ANNOVAR inspects the transcript produced by the insertion/deletion and sometimes finds that a stop codon is introduced by the indel. Following its precedence rules, it then returns an annotation of stop-gain rather than frameshift. The disagreement between annotations for such variants is thus reasonable once we take into account how the two tools report annotations. From looking at specific examples it appears that only a small fraction of variants get an incorrect annotation from both software tools.
When we look at the variants annotated as stop-gain by ANNOVAR, but not by VEP (when the same transcripts are used), we see that the majority (437 of the 570) are given frameshift annotations by VEP (Additional file 1: Table S7). We saw above that ANNOVAR’s precedence rules can lead it to give a stop-gain annotation to an indel for which frameshift would otherwise be a reasonable annotation. Here too, all of the variants annotated as frameshift by VEP seem to be genuine frameshift variants (as they are indels that are not a multiple of 3 bp in size). These discrepancies, therefore, reflect a difference in precedence for reporting annotations, rather than a true difference between the annotation algorithms, and the ANNOVAR annotation (assuming it correctly identifies introduced stop codons) adds information of interest. There is a much smaller number of variants given missense (77) and synonymous (39) annotations by VEP (Additional file 1: Table S7a).
Manual inspection in the ENSEMBL Genome Browser of ten of those discrepant variants on chromosome 1 shows that for eight of the ten missense (from VEP) variants, the VEP annotation looks correct (for two variants neither annotation looks correct; see Additional file 1: Table S10 for details of these variants). For other discrepant variants, manual inspection reveals that the VEP annotation looks correct more often than the ANNOVAR annotation (see Additional file 1: ‘Supplementary Results’ for more details). When we look at variants annotated as stop-gain by VEP and either frameshift or nonframeshift by ANNOVAR, we see that approximately 20% (30 variants) of these are SNVs, which cannot be correctly annotated as frameshift or nonframeshift (as these terms only apply to an insertion or deletion). Thus, the ANNOVAR annotations for these particular variants cannot be correct, and must simply be a result of a software bug. For the remaining variants it is difficult to assess whether the ANNOVAR or VEP annotation is better. Even after taking into account the differences in annotation caused by different precedence rules, the stop-gain annotations from VEP look more reliable than those from ANNOVAR.
There are only small numbers of variants that are annotated as stop-loss by ANNOVAR and not by VEP, but almost all of these are annotated as frameshift by VEP. Inspection reveals that all of these variants are indeed indels that are not a multiple of three bases, therefore annotations of frameshift from VEP are reasonable. Looking closely at the variants reveals that there is a roughly even split between when the ANNOVAR or the VEP annotation look better. There are only 16 variants that are annotated as stop-loss by VEP and as something else by ANNOVAR when the two tools use the same transcript for annotation (Additional file 1: Table S8).
The category (or categories) of splicing variants is a source of many differences in annotations from different annotation software tools. Unlike most other categories of annotation, in the field there are still multiple notions of what entails a splicing variant. ANNOVAR defines just one broad category, splicing, for these variants: any variant within x bp of a splicing junction receives the annotation splicing. The value of x can be specified by the user of ANNOVAR, and for our annotations here we used a broad definition of splicing by setting x=6. In contrast, VEP uses three categories of splicing variant: (1) splice donor variant, a splice variant that changes the two-base region at the 5′ end of an intron; (2) splice acceptor variant, a splice variant that changes the two-base region at the 3′ end of an intron and (3) splice region variant, a sequence variant in which a change has occurred within the region of the splice site, either within one to three bases of the exon or three to eight bases of the intron. VEP thus gives more useful information, through its subcategories of splicing variants, about the likely function of a variant. We also see that differences in annotation can arise simply as a result of differing definitions of what a splicing variant is, rather than any truly substantial differences in the algorithms producing the annotations. We investigated these differences in annotation on variants where both tools used the same transcript for annotation, and annotations did not match, that is, a variant with a splicing annotation from ANNOVAR did not get an annotation of one of splice donor variant, splice acceptor variant or splice region variant, or the inverse.
The major source of difference in splicing annotations is that the overwhelming proportion of ANNOVAR splicing variants that receive non-splicing annotations from VEP actually receive one of VEP’s three splicing annotations, but reported as being in a non-coding transcript (Additional file 1: Table S9). This result suggests that VEP does a better job at reporting when the transcript it uses for annotation is non-coding, but that there may actually not be such a large degree of difference between splicing annotations as appears initially. We also see here the combined effect of different definitions of splicing variants and precedence rules that result in a splicing variant found in one transcript being reported instead of a less ‘serious’ variant seen in another transcript. We see a large number of variants annotated as synonymous by ANNOVAR and as splice region variant by VEP, and all are in an exon, either in the first three bases (5′ end) or last three bases (3′ end) of the exon. Thus, these annotation differences seem to be a systematic result of differences in the annotation algorithms used by ANNOVAR and VEP, and for these variants the VEP annotations look to be better.
The results of our comparison of annotations obtained using REFSEQ and ENSEMBL transcript sets emphasise the importance of the choice of transcript set used for annotation. Applying the same annotation software with different transcript sets saw a matching rate of 44% for putative LoF annotations. Though not done here, transcript sets from REFSEQ and ENSEMBL (or other sources) can be restricted to a subset of transcripts to exclude low confidence annotations. Where a specific tissue of interest is known, annotation could be restricted to use only the set of transcripts known to be expressed in that tissue. Defining a targeted set of transcripts will not always be easy, but for sequencing studies where the cost of false positives (e.g. through follow-up experiments) is high, and where information on the expression of specific transcripts exists, a set of high-confidence transcripts tailored to the study at hand may be preferable. Projects like GENCODE aim to provide a carefully curated transcript set supported by experimental evidence [15, 51–53], so through efforts such as these we may see annotation results converge as (ideally tissue-specific) transcript sets align across different repositories. For the time being, though, large differences remain.
Variant annotation remains challenging for current software tools: differing choices made in annotation packages on how to analyse, categorise and prioritise annotations for a variant lead to differing annotations from different tools, even when using the same set of transcripts as the basis for annotation. Differences in annotations from different software tools (e.g. 64% overall agreement for LoF annotations) are not as large as those seen when using different transcript sets (44% overall agreement for LoF annotations), and are often caused by differences in the annotation categories defined by different tools. Nevertheless, the extent of the differences seen show that, again, careful consideration must be given when choosing a software tool to make sure that it is well suited to the goals of the scientific investigation.
Standardising definitions of variants across the field, to reduce the scope for apparent differences in annotations returned by different software tools and to crystallise the (epistemic) meaning of terms used for annotations, could be of value. In our results here, for example, differing definitions of splicing variants cause tens of thousands of annotation differences. The Sequence Ontology Project  may help with this. It would be beneficial for phase information to be used in annotating variants in close proximity, given, for example, the extent of ‘rescue’ of LoF variants by nearby variants . Currently, annotation tools typically do not associate any measure of uncertainty with reported variant annotations. Such information could be useful for downstream analysis, especially for consideration when allocating resources for follow-up experiments on variants of interest. When a high level of certainty about the validity of an annotation is required, variants could be annotated with two software tools and variants with differing annotations flagged to be treated with caution.
In the comparison of annotation tools here, we restricted each tool to report just the most severe consequence annotation for each variant, to avoid comparisons becoming too unwieldy. However, VEP and other annotation tools can (and often by default do) report annotations for all transcripts, providing extra information that is often valuable. Adding this extra information, as with utilising phase information or tissue-specific transcripts, increases the challenges for data processing and interpretation by adding complexity to the treatment of variant annotation, but with good reason: this added complexity reflects the underlying biology, so taking this information into account potentially adds significant value to analyses of DNA variants.
Our understanding of the human genome continues to improve rapidly even as we gain a better appreciation of the genome’s complexity. As a result, at some point we may see the variant annotations from different approaches converge. For the time being, though, we confront an epistemic challenge (determining the meaning or function of variants observed) because our ontological foundations (knowledge and understanding of what all sequences in the genome actually do) remain unresolved or unclear. Thus, the choices of transcript set and software tool can have substantial effects on the annotation results obtained, and from there, large effects on all downstream aspects of the analysis of WGS data. Variant annotation is not yet a plug-and-play procedure and should not be treated as such.
In addition to different variant annotation approaches (of which there are more than we have compared here), there are different sequencing technologies, read mappers and variant callers. Each of these can potentially have substantial impact on the final variants and annotations obtained, but comparison of other sources of variation is beyond the scope of this paper. We refer interested readers to systematic comparisons of other aspects of the next-generation sequencing pipeline, for example comparisons of benchtop high-throughput sequencing technologies , short-read mappers , variant callers  and variant-calling pipelines as a whole [59, 60].
We have aimed to highlight the effect on final annotation results that can arise from two aspects of analyses of whole genome (or whole exome) sequence data, namely, choice of transcript and choice of annotation software. While we are not advocating any particular software or transcript set, we suggest researchers be aware of the impact of these choices, and hope our comparisons may inform such decisions.
We have quantified the extent of disparity in variant annotation when different transcript sets and different software tools are used. This comparison of annotations for 80 million human DNA variants revealed many substantial differences between annotations based on different transcript sets and different software tools. The extent of differences in annotations was particularly large in annotation categories of most interest, namely, putative LoF and nonsynonymous variants. We found many more variants with annotations in interesting categories when using ENSEMBL transcripts compared with REFSEQ transcripts only. If it is important not to miss potential LoF variants, then there are advantages to using ENSEMBL transcripts. If it is important to reduce false positives, then a carefully curated set of transcripts tailored to the study at hand may be preferred. Even when using the same transcript set, different annotation software packages can provide substantially different annotations.
There are variants with potentially severe effects that are identified with one method and not another. We require consistent, accurate and reliable annotation of variants to support the use of WGS in making diagnostic and treatment decisions. The dependence of current annotation results on the set of transcripts and software used can be managed, with sufficient care, in the research context. However, more work is required to improve variant annotation for clinical use. The differences in annotation due to choice of transcript set and software package quantified here should be given due consideration when undertaking variant annotation in practice. Careful thought needs to be given to the choice of transcript sets and software packages for variant annotation in sequencing studies.
Consensus Coding Sequence
loss of function
single nucleotide variant
Variant Effect Predictor
The authors wish to thank the patients and their families for participating in this study. We are also grateful to Alistair Pagnamenta for helpful discussions and suggestions to improve the analysis and manuscript, and Gil McVean and Jenny Taylor for their comments on an earlier version of the manuscript. This work was supported by a Wellcome Trust Core Grant for the Wellcome Trust Centre for Human Genetics (090532/Z/09/Z). PD is supported by a Wellcome Trust Senior Investigator Award (095552/Z/11/Z). DJM is supported by a General Sir John Monash Scholarship from the General Sir John Monash Foundation, Australia. MAR is funded by a Clarendon Scholarship, NDM Studentship and Green Templeton College Award from the University of Oxford.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.