Skip to main content

Table 3 Comparison of computationally derived annotation versus manually derived annotation

From: Genome annotation for clinical genomic diagnostics: strengths and weaknesses

Annotation procedure

Automatic annotation—for example, Ensembl

Manual annotation—for example, HAVANA

Genome analysis

Very quick

Very slow and labour intensive

Annotation consistency

Consistent

Risk of subjectivity—achieving consistency requires careful training and monitoring

Sequence quality

Flexible; can use unfinished, short-read NGS sequence, shotgun assembly

Best results on high-quality sequence, but can offer great insight into lower-quality assembly

Functional annotation

Limited, lacking comprehensive detail of manual annotation—frequently misassign related sequences—i.e. protein-coding loci and pseudogenes

Extensive use of biotypes, such as coding, pseudogene, lncRNA, NMD, etc.

Complex genomic regions

Limited in ability to represent complex structures and other nonstandard features

Superior representation and resolution of gene families and able to define CDS regions of complicated gene structures

Gene annotation

Many false-positive and false-negative calls at locus level in all gene biotypes

Better coverage of loci and alternatively spliced transcripts

Pseudogenes

Limited

Able to predict pseudogenes and differentiate from genuine coding genes

Poly(A) features

Limited

Annotates poly(A) features

Flexibility

Error prone, forces problems such as non-canonical splicing and can only look at sequences more or less in isolation

Deals with inconsistencies in data, consults literature and other databases, can compare paralogues and orthologues and rapidly integrate new sequencing technologies

  1. CDS coding sequence, HAVANA Human and Vertebrate Analysis and Annotation, lncRNA long non-coding RNA, NGS next-generation sequencing, NMD nonsense-mediated decay