Skip to main content

Table 4 Important areas to consider for genome annotation

From: Genome annotation for clinical genomic diagnostics: strengths and weaknesses

Genome assembly is not complete Human assembly is still not complete and still being refined
The current assembly is GRCh38, which still contains fragmented genes, and gene duplications are incorrectly represented, yet most analysis is still performed on GRCh37
Transcriptome is still incomplete Some exons are still not represented in the human genome owing to low expression or temporal expression in tissue that has not yet been interrogated
WES kits will not contain all exons
WGS-negative cases should be iteratively re-analysed as new transcriptional features are revealed
Reference annotation datasets can be missing key features Automatic annotation is fast but not as accurate as manual annotation
CCDS—missing UTRs
LRG—single, usually canonical, transcript—potential for missing exons; choice of transcript is arbitrary
RefSeq—based on transcriptome, potential for missing exons and problems with inconsistent mapping to reference assembly
Annotation does not necessarily determine which transcripts are the most likely to be functional, and the longest one might not be the major one
Non-coding genome Long-range gene interactions are poorly understood; methods such as Capture Hi-C will provide insights into such epigenetics
Previously ignored transcript biotypes such as NMD and retained intron are now known to have important regulatory roles in disease
Non-coding RNAs have an important role in disease, yet they are hard to predict and their function remains largely unknown.
Biotype associations A biotype conflict in annotation datasets will cause incorrect variant calls—for example, lncRNA variant compared with coding gene, coding gene compared with pseudogene
Transcript expression profile Is transcript expressed in correct tissue for disease phenotype?
Is transcript expressed at the right developmental time for disease phenotype?
  1. CCDS Collaborative Consensus Coding Sequence project, lncRNA long non-coding RNA, LRG Locus Reference Genomic project, NMD nonsense-mediated decay, WES whole-exome sequencing, WGS whole-genome sequencing