Skip to main content

Table 2 GENCODE annotation biotypes (2017)

From: Genome annotation for clinical genomic diagnostics: strengths and weaknesses

Biotype Description
Protein coding Contains an ORF that has strong coding potential
 Known coding 100% identical to known RefSeq protein or Swiss-Prot entry
 Novel coding Shares >60% length with known coding sequence from RefSeq, or Swiss-Prot, or has cross-species/family support or domain evidence
 Putative coding Shares <60% length with known coding sequence from RefSeq, or Swiss-Prot, or has an alternative first or last coding exon
 Nonsense-mediated decay If the coding sequence (following the appropriate reference) of a transcript finishes >50 bp from a downstream splice site, then it is tagged as NMD. If the variant does not cover the full reference coding sequence, then it is annotated as NMD if NMD is unavoidable—i.e. no matter what the exon structure of the missing portion is, the transcript will be subject to NMD
 Non-stop decay Transcripts that have poly(A) features (including signal) without a prior stop codon in the CDS—i.e. a non-genomic poly(A) tail attached directly to the CDS without a 3′ UTR; these transcripts are subject to degradation
 Retained intron Alternatively spliced transcript believed to contain intronic sequence relative to other, coding, variants
 Processed transcript Cannot assign an ORF, but is part of a coding locus
lncRNA Long non-coding RNA—lacks protein-coding potential and is of length >200 bp
 Bidirectional promoter Transcription start sites of the lncRNA model and the protein-coding model are on opposite strands and within 200 bp of one another, or are found in the same CpG island
 3-Prime overlapping Transcription start site and/or published experimental data support independent transcription from the 3′ UTR of a coding gene
 Antisense At least one variant overlaps a protein-coding locus on the opposite strand, or evidence of antisense regulation of a coding gene has been published
 lincRNA Long intergenic ncRNA: does not overlap (neither sense nor antisense) a coding gene
 Sense intronic In an intron of a coding gene; no exonic overlap
 Sense overlapping Contains a coding gene in an intron; no exonic overlap.
Pseudogene Matches to protein, but ORF disrupted by frameshifts and/or premature stop codons
 Processed Lacks introns and arose from retrotransposition of parent gene mRNA
 Unprocessed Can contain introns and is produced by genomic duplication
 Transcribed Locus-specific transcripts indicate transcription; these can be classified into ‘processed’ and ‘unprocessed
 Translated Locus-specific protein mass spectroscopy data suggest translation; the connection is maintained with the pseudogene biotype until the experimental community validates it as a coding gene
 Polymorphic Pseudogene owing to a single-nucleotide variant (SNV), or insertion-deletion variant (indel); but the same gene is translated in other individuals/haplotypes/strains
 Unitary Species-specific unprocessed pseudogene, without a parent gene, that has an active orthologue in another species
  1. Data sourced from GENCODE project [196]
  2. ncRNA noncoding RNA, ORF open reading frame, UTR untranslated region