Skip to main content

Table 3 Comparison of computationally derived annotation versus manually derived annotation

From: Genome annotation for clinical genomic diagnostics: strengths and weaknesses

Annotation procedure Automatic annotation—for example, Ensembl Manual annotation—for example, HAVANA
Genome analysis Very quick Very slow and labour intensive
Annotation consistency Consistent Risk of subjectivity—achieving consistency requires careful training and monitoring
Sequence quality Flexible; can use unfinished, short-read NGS sequence, shotgun assembly Best results on high-quality sequence, but can offer great insight into lower-quality assembly
Functional annotation Limited, lacking comprehensive detail of manual annotation—frequently misassign related sequences—i.e. protein-coding loci and pseudogenes Extensive use of biotypes, such as coding, pseudogene, lncRNA, NMD, etc.
Complex genomic regions Limited in ability to represent complex structures and other nonstandard features Superior representation and resolution of gene families and able to define CDS regions of complicated gene structures
Gene annotation Many false-positive and false-negative calls at locus level in all gene biotypes Better coverage of loci and alternatively spliced transcripts
Pseudogenes Limited Able to predict pseudogenes and differentiate from genuine coding genes
Poly(A) features Limited Annotates poly(A) features
Flexibility Error prone, forces problems such as non-canonical splicing and can only look at sequences more or less in isolation Deals with inconsistencies in data, consults literature and other databases, can compare paralogues and orthologues and rapidly integrate new sequencing technologies
  1. CDS coding sequence, HAVANA Human and Vertebrate Analysis and Annotation, lncRNA long non-coding RNA, NGS next-generation sequencing, NMD nonsense-mediated decay