Genome annotation for clinical genomic diagnostics: strengths and weaknesses

Table 3 Comparison of computationally derived annotation versus manually derived annotation

Annotation procedure	Automatic annotation—for example, Ensembl	Manual annotation—for example, HAVANA
Genome analysis	Very quick	Very slow and labour intensive
Annotation consistency	Consistent	Risk of subjectivity—achieving consistency requires careful training and monitoring
Sequence quality	Flexible; can use unfinished, short-read NGS sequence, shotgun assembly	Best results on high-quality sequence, but can offer great insight into lower-quality assembly
Functional annotation	Limited, lacking comprehensive detail of manual annotation—frequently misassign related sequences—i.e. protein-coding loci and pseudogenes	Extensive use of biotypes, such as coding, pseudogene, lncRNA, NMD, etc.
Complex genomic regions	Limited in ability to represent complex structures and other nonstandard features	Superior representation and resolution of gene families and able to define CDS regions of complicated gene structures
Gene annotation	Many false-positive and false-negative calls at locus level in all gene biotypes	Better coverage of loci and alternatively spliced transcripts
Pseudogenes	Limited	Able to predict pseudogenes and differentiate from genuine coding genes
Poly(A) features	Limited	Annotates poly(A) features
Flexibility	Error prone, forces problems such as non-canonical splicing and can only look at sequences more or less in isolation	Deals with inconsistencies in data, consults literature and other databases, can compare paralogues and orthologues and rapidly integrate new sequencing technologies

CDS coding sequence, HAVANA Human and Vertebrate Analysis and Annotation, lncRNA long non-coding RNA, NGS next-generation sequencing, NMD nonsense-mediated decay

ISSN: 1756-994X