- Open Access
Genome annotation for clinical genomic diagnostics: strengths and weaknesses
Genome Medicinevolume 9, Article number: 49 (2017)
The Human Genome Project and advances in DNA sequencing technologies have revolutionized the identification of genetic disorders through the use of clinical exome sequencing. However, in a considerable number of patients, the genetic basis remains unclear. As clinicians begin to consider whole-genome sequencing, an understanding of the processes and tools involved and the factors to consider in the annotation of the structure and function of genomic elements that might influence variant identification is crucial. Here, we discuss and illustrate the strengths and weaknesses of approaches for the annotation and classification of important elements of protein-coding genes, other genomic elements such as pseudogenes and the non-coding genome, comparative-genomic approaches for inferring gene function, and new technologies for aiding genome annotation, as a practical guide for clinicians when considering pathogenic sequence variation. Complete and accurate annotation of structure and function of genome features has the potential to reduce both false-negative (from missing annotation) and false-positive (from incorrect annotation) errors in causal variant identification in exome and genome sequences. Re-analysis of unsolved cases will be necessary as newer technology improves genome annotation, potentially improving the rate of diagnosis.
Advances in genomic technologies over the past 20 years have provided researchers with unprecedented data relating to genome variation in different diseases . However, even after whole-exome sequencing (WES), the genetic basis for a particular phenotype remains unclear in a considerable proportion of patients. Here, we examine how genomic annotation might influence variant identification, using examples mostly from both common and rarer neurological disorders. We highlight why the present technology can fail to identify the pathogenic basis of a patient’s disorder, or produce an incorrect result where the wrong variant is labelled as causative. For these reasons, we believe it is important to re-analyse unresolved cases as newer technology and software improve gene and genome annotation. The aim of this paper is to make common genomic techniques accessible to clinicians through the use of figures and examples that help to explain genome sequencing, gene classification and genome annotation in the context of pathogenic sequence variation. Finally, we discuss how new genomic techniques will improve our ability to identify pathogenic sequence variation.
The Human Genome Project (HGP) was launched officially in 1987 by the US Department of Energy to sequence the approximately 3 billion base-pairs (bp) that constitute the human genome . The first draft sequence was published in 2001 and computational annotation, a process that attributes a biological function to the genomic elements, described 30,000 to 40,000 protein-coding genes across 22 pairs of autosomes and the X and Y sex chromosomes in a genome of 2.9 billion bases (gigabases, Gb) . The precise size and gene count of the reference human genome remains uncertain to this day because sequence gaps remain, while the classification of genes becomes more refined . Consequently, additions are continually made to the genome to fill sequence gaps . The most recent published estimates suggest that just under 20,000 protein-coding genes  are present in a genome of approximately 3.1 Gb . The HGP enabled initial research examining sequence variation on chromosome 22 , to more recent medical advances that now see DNA sequencing used routinely in large-scale research programs, such as the Deciphering Developmental Disorders (DDD) study [8, 9]. Sequencing for the HGP used the chain terminator method , more commonly known as ‘Sanger sequencing’, and owing to the better-quality sequence data and read-length associated with Sanger sequencing compared with current sequencing technologies, Sanger sequencing is still used to confirm sequence variants .
Current methods for producing the raw sequence data for whole-genome sequencing (WGS) are placed into two categories based upon the length of the nucleotide sequence produced, or sequence ‘read’. Short-read technology comes from Illumina Inc.  and uses well-established chemistry to identify the sequence of nucleotides in a given short segment of DNA. Illumina sequencing platforms such as the HiSeq X produce base-pair reads of lengths from 150 to 250 bp in a given DNA segment and are used to read sequences from both ends of a DNA fragment. This ‘next-generation’ technology is a dramatic improvement over older Sanger sequencing methods that produced longer reads but at much higher cost . More recently, ‘third-generation’ technologies from Pacific Biosciences (PacBio) and Oxford Nanopore are gaining users and making an impact. These third-generation methods generate longer reads, up to tens of thousands of base-pairs per read, but with higher error rates.
The speed of DNA sequencing, the amount of sequence that can be produced and the number of genomes that can be sequenced have increased massively with next-generation sequencing (NGS) techniques . Such advances have enabled large collaborative projects that look at variation in a population, such as the 1000 Genomes Project , as well as those investigating the medical value of WGS, such as the UK 100,000 Genomes Project . It is hoped that WGS will facilitate the research, diagnosis and treatment of many diseases.
Once a patient genome has been sequenced, it needs to be aligned to the reference genome and analysed for variants. Typically, software algorithms such as the Burrows-Wheeler Aligner (BWA) are used for short-  and long-read  alignment and the Genome Analysis Toolkit (GATK) is used to identify or ‘call’ sequence variants . Figure 1 illustrates a typical genome analysis pipeline, describing the different file formats commonly used—FASTQ , BAM  and VCF .
Pathogenic sequence variation can range in size from single-nucleotide variants (SNVs), small insertions and deletions (‘indels’) of fewer than 50 base-pairs in length, to larger structural variants (SVs) , which are generally classified as regions of genomic variation greater than 1 kb, such as copy-number variants (CNVs), insertions, retrotransposon elements, inversions, segmental duplications, and other such genomic rearrangements [24, 25]. Currently, the consequence of non-synonymous variants of the protein-coding elements only can be routinely automatically predicted by algorithms such as SIFT and PolyPhen , yet many different types of variants are implicated in disease. As sequencing techniques begin to move away from ‘gene panel’ testing to WGS, it is crucial to understand the structure of genes and any regulatory features that might lie within intra/intergenic regions as changes in any of these regions might have a crucial impact on the function of a gene.
Recently, the American College of Medical Genetics and Genomics (ACMG) recommended a set of standards and guidelines to help medical geneticists assign pathogenicity using standardized nomenclature and evidence used to support the assignment for Mendelian disorders . For example, the terms ‘mutation’ and ‘polymorphism’ have often been used misleadingly, with assumptions made that ‘mutation’ is pathogenic, whereas ‘polymorphism’ is benign. As such, one recommendation that ACMG makes is that both these terms are replaced by ‘variant’, with the following modifiers (1) pathogenic, (2) likely pathogenic, (3) uncertain significance, (4) likely benign, or (5) benign . As such, here, we use the term variant. A standard gene-variant nomenclature is maintained and versioned by the Human Genome Variation Society (HGVS) . Both ACMG and HGVS examples are illustrated in Table 1.
Classifying genes and other genomic elements
Current gene sets identify under 20,000 protein-coding genes and over 15,000 long non-coding RNAs (lncRNAs) [29, 30]. In this section, for clinicians who might not be familiar with gene structure and function, we present the important elements of different parts of protein-coding genes, and other categories of genomic elements, such as pseudogenes and elements of the non-coding genome such as lncRNAs, and we highlight their potential functionality, illustrated with examples of their roles in disease. We demonstrate the importance of classifying such regions correctly and why incorrect classification could impact the interpretation of sequence variation.
Important elements of protein coding genes
A eukaryotic gene is typically organized into exons and introns (Fig. 2 ), although some genes, for example SOX3, which is associated with X-linked mental retardation , can have a single exon structure. The functional regions of protein-coding genes are typically designated as the coding sequence (CDS) and the 5′ and 3′ untranslated regions (UTRs) (Fig. 2 ).
The 5′ UTR of a transcript contains regulatory regions. For example, some upstream open reading frames (uORFs; which are sequences that begin with an ATG codon and end in a stop codon, meaning that they have the potential to be translated) in the 5′ UTR are translated to produce proteins that could enhance or suppress the function of the main CDS . Experimental techniques such as cap-analysis gene expression (CAGE)  are used to identify transcription start sites (TSSs) (Fig. 2a).
Variants in the CDS are generally the most well studied and understood area of pathogenic sequence variation. For example, approximately 700 pathogenic CDS variants have been reported in the epilepsy-associated gene SCN1A .
The 3′ UTR of a transcript can contain regions controlling regulatory proteins such as RNA binding proteins (RBPs) and microRNAs (miRNAs) (Fig. 2a). Interestingly, the 3′ UTR has been linked to overall translation efficiency and stability of the mRNA . The 5′ and 3′ UTRs can also interact with each other to regulate translation through a closed-loop mechanism . Important sequence motifs involved in controlling the expression of a gene include promoters, enhancers and silencers, which are found in exonic, intragenic and intergenic regions (Fig. 2a).
A multi-exonic eukaryotic gene can produce different disease phenotypes through alternative protein isoforms that result from the use of alternative splice site/exon combinations (Fig. 3 ) . Canonical splice sites are generally conserved at the 5′ (donor) and 3′ (acceptor) ends of vertebrate introns. The GT–intron–AG configuration is the most common, although other, rarer instances of splice sites are found, such as GC–intron–AG and AT–intron–AC .
Although there can be an abundant transcript that is expressed in a particular cell, the same transcript might not dominate elsewhere, and, even if a dominant transcript is identified, the transcript might not be functional . Differential expression can be both tissue- and age-specific , can occur in response to different environmental signals [41, 42], and an exon expressed in one tissue might not be relevant to further analysis if it is not expressed in the tissue where a disease phenotype is present. For example, genes expressed in brain generally have longer 3′ UTRs than those in other tissues, and such differences could impact miRNA binding sites and other regulatory regions . Studies have shown that retained introns have an important role in brain gene expression and regulation [44, 45].
Polyadenylation (poly(A)), which involves addition of the poly(A) tail, is important for nuclear export to the cytosol for translation by the ribosome and also helps with mRNA stability (Fig. 2d). Many annotated genes also have more than one poly(A) site, which can be functional in different tissues or different stages of development .
After translation, the polypeptide chain produced by the ribosome might need to undergo posttranslational modification, such as folding, cutting or chemical modifications, before it is considered to be a mature protein product (Fig. 2e). Noonan syndrome is believed to result from the disruption of the phosphorylation-mediated auto-inhibitory loop of the Src-homology 2 (SH2) domain during post-translational modification .
Transcripts that contain premature stop codons (perhaps as a result of using an alternative splice donor, splice acceptor, or inclusion/exclusion of an alternative exon, which causes a CDS frameshift) are degraded through the nonsense-mediated decay (NMD) cellular surveillance pathway (Fig. 4) [47, 48]. NMD was originally believed to degrade erroneous transcripts, but much evidence has been found to suggest it is also an active regulator of transcription [49, 50]. Several NMD factors have been shown to be important for the regulation of neurological events such as synaptic plasticity and neurogenesis [51–53].
Two other types of cellular surveillance pathways are known to exist: non-stop decay and no-go decay. Non-stop decay is a process that affects transcripts that have poly(A) features but do not have a prior stop codon in the CDS. The translation of such transcripts could produce harmful peptides with a poly-lysine amino acid sequence at the C-terminal end of the peptide—therefore, these transcripts are subject to degradation. Similar to NMD transcripts, either aberrant splicing or SNVs can cause the generation of these transcripts . Finally, no-go decay is triggered by barriers that block ribosome movement on the mRNA .
The functional importance of pseudogenes
Pseudogenes are traditionally regarded as ‘broken’ copies of active genes. Freed of selective pressure, they have typically lost the ability to encode functional proteins through the occurrence of nonsense variations, frameshifts, truncation events, or loss of essential regulatory elements. The majority of pseudogenes fall into one of two categories: processed and unprocessed (Fig. 5, Table 2) .
Processed pseudogenes represent back-integration or retrotransposition of an RNA molecule into the genome sequence, and, although they generally lack introns, they frequently incorporate the remains of the poly(A) tail. Processed pseudogenes are often flanked by direct repeats that might have some function in inserting the pseudogene into the genome, and are often missing sequence compared with their parent gene (Fig. 5) . By contrast, unprocessed pseudogenes are defunct relatives of functional genes that arise through faulty genomic duplication resulting in missing (parts of) exons and/or flanking regulatory regions (Fig. 5).
Computational annotation of pseudogenes tends to suffer from significant false positives/negatives and can cause problems that result from the misalignment of NGS data. Specifically, identification of transcribed pseudogenes and single-exon pseudogenes can be a challenge . Such difficulties were demonstrated where it was found that more than 900 human pseudogenes have evidence of transcription, indicating functional potential [58, 59]. Consequently, the ability to distinguish between pseudogenes and the functional parent gene is essential when predicting the consequence of variants.
MacArthur and colleagues  reported that reference sequence and gene annotation errors accounted for 44.9% of candidate loss-of-function (LoF) variants in the NA12878 genome, which belongs to the daughter from a trio of individuals belonging to the CEPH/Utah pedigree whose genomes were sequenced to high depth as part of the HapMap project . The NA12878 genome sequence and transformed cells from the same individual (the GM12878 cell line) are often used as a reference in other projects [62, 63]. After reannotation of protein-coding genes harbouring 884 putative LoF variants, 243 errors in gene models were identified, 47 (19.3%) of which were updated from protein-coding to pseudogene, removing a significant source of false-positive LoF annotation .
Transcripts derived from the pseudogene locus PTENP1 have been shown to regulate the parent PTEN locus . Deletion of PTENP1 has been reported to downregulate PTEN expression in breast and colon cancer  and melanoma , and downregulation of PTENP1 through methylation of its promoter sequence in clear-cell renal cell carcinoma suppresses cancer progression . Although PTENP1 has not yet been associated with any neuronal disorders, both PTEN and PTENP1 are expressed in multiple brain tissues [67, 68].
The non-coding genome
Most of the genome is non-coding, and therefore most variation occurs in non-coding regions. To understand the effect of a sequence variant in such regions, the non-coding elements need to be classified. Non-coding elements consist of cis-regulatory elements such as promoters and distal elements (for example, enhancers)  and non-coding RNAs (ncRNAs). Large collaborative initiatives, such as ENCODE  and RoadMap Epigenomics , have been tasked to create comprehensive maps of these regions. The Ensembl regulatory build  and Variant Effect Predictor (VEP)  are able to determine whether variants fall within such regions, but are not yet able to determine pathogenicity, although tools that do so are beginning to emerge, such as FunSeq  and Genomiser .
The ncRNAs are generally divided into two groups, small RNAs (sRNAs) and lncRNAs. sRNAs include miRNAs, Piwi-interacting RNAs (piRNAs), short interfering RNAs (siRNAs), small nucleolar RNAs (snoRNAs) and other short RNAs . The sRNAs can be predicted using tools such as Infernal  and Rfam , which makes the interpretation of sequence variation and consequence easier, especially when compared with the analysis of lncRNAs. However, correctly discriminating functional copies from pseudogenes remains a challenge.
Of particular interest to the study of neurological disease are microRNAs (miRNAs), which are small (approximately 20 nucleotides) ncRNAs that are involved in the regulation of post-transcriptional gene expression . miRNAs can trigger transcript degradation, modify translational efficiency and downregulate gene expression by triggering epigenetic changes (DNA methylation and histone modifications) at the promoter of target genes, and are the best-understood of the ncRNAs. Studies have shown that variants in miRNA binding sites are associated with some neurological diseases, and there is evidence for a role in epilepsy, suggesting that miRNAs might be good candidates for the development of novel molecular approaches for the treatment of patients with epilepsy [79, 80]. For example, miRNA MIR328 binds to the 3′ UTR of PAX6 to regulate its expression. However, variation in the miRNA binding site reduces the binding affinity of MIR328, which in turn results in an increase in the abundance of PAX6 transcripts, which is associated with electrophysiological features of Rolandic epilepsy . The EpiMiRNA consortium is investigating the role of miRNAs in the development, treatment and diagnosis of temporal lobe epilepsy .
The classification of lncRNAs is increasingly used to convey functional information, despite the fact that we know relatively little about the role or mechanism of the vast majority of them (Fig. 6). The term lncRNA was itself established to distinguish longer ncRNAs from the small ncRNAs that were initially separated using an experimental threshold of >200 nucleotides, which remains the simplest definition of a lncRNA . RNA sequencing (RNA-Seq) assays predict that potentially tens, if not hundreds, of thousands of lncRNA transcripts have now been identified , which has inevitably led to the naming of many proposed subclasses of lncRNA [84, 85]. Without any international agreement on the classification of lncRNAs, proposed subclasses have been classified based on either length, function, sequence or structural conservation, or association with either protein-coding genes, DNA elements, subcellular location or a particular biological state. They are hard to predict owing to their size, but also because they are expressed at low levels and lack a known tertiary structure, unlike miRNAs. A recent study by Nitsche and colleagues showed that >85% of lncRNAs have conserved splice sites that can be dated back to the divergence of placental mammals .
lncRNAs, such as XIST , have been studied for some time, yet little is known about the function of most. However, they are gaining interest within the scientific and medical community  owing to their potential involvement in disease [88, 89]. Experiments in mouse models have demonstrated that dysregulation of certain lncRNAs could be associated with epilepsy , and a role in gene regulation is proposed for the vast number of unstudied cases , which makes them interesting candidates for new targeted therapies and disease diagnostics . For example, experiments in a knock-in mouse model of Dravet syndrome have shown that the upregulation of the healthy allele of SCN1A by targeting a lncRNA improved the seizure phenotype .
CNVs also play an important role in human disease and can affect multiple coding genes, resulting in dosage effects, truncation of single genes or novel fusion products between two genes. CNVs have also been shown to be pathogenic in non-coding regions . Talkowski and colleagues  observed a CNV causing disruption in the long-intergenic non-coding RNA (lincRNA) LINC00299 in patients with severe developmental delay, raising the possibility that lincRNAs could play a significant role in developmental disorders. More recently, Turner et al.  reported WGS of 208 patients from 53 families with simplex autism and discovered small deletions within non-coding putative regulatory regions of DSCAM, implicated in neurocognitive dysfunction in Down syndrome. These CNVs were transmitted from the mother to the male proband.
Repetitive sequences and transposable elements are known to be involved in disease and are believed to make up more than two-thirds of the human genome. They also have a strong association with genomic CNVs . Long interspersed nuclear elements (LINEs) and Alu repeats (which are types of retrotransposons) have been associated with increased genomic instability through non-allelic homologous recombination events and can lead to pathogenic duplications and deletions . Alu–Alu repeat recombinations within the introns of ALDH7A1 have been associated with pyroxidine-dependent epilepsy . The ability to accurately detect repetitive sequences is of great importance due to the problems they can cause during the aligning or assembling of sequence reads , and the human genome is commonly analysed for repeats using Repbase annotation  and computational algorithms, such as the hidden Markov model (HMM)-derived database Dfam .
The ability to understand the function of a gene and how variation might affect its function is dependent upon understanding its structure, which can be elucidated by genome annotation. Genome annotation in its simplest form proceeds by ab initio gene prediction algorithms that search a genome for putative gene structures [103,104,105] such as signals associated with transcription, protein-coding potential and splicing . Although these gene-prediction algorithms were used in the early analysis of the human genome [107, 108], they are limited in both accuracy and coverage . The current automated gene-annotation tools, such as Ensembl, provide fast computational annotation of eukaryotic genomes using evidence derived from known mRNA , RNA-Seq data  and protein sequence databases .
Computational annotation systems are essential for providing an overview of gene content in newly sequenced genomes and those with fewer resources assigned to annotation, yet manual annotation is still regarded as the ‘gold standard’ for accurate and comprehensive annotation (Table 3) . As part of the ENCODE project, which was established to investigate all functional elements in the human genome , a genome-annotation assessment project was developed to assess the accuracy of computational gene annotation compared with a manually annotated test-set produced by the Human and Vertebrate Analysis and Annotation (HAVANA) team . Although the best computational methods identified ~70% of the manually annotated loci, prediction of alternatively spliced transcript models was significantly less accurate, with the best methods achieving a sensitivity of 40–45%. Conversely, 3.2% of transcripts only predicted by computational methods were experimentally validated.
Only two groups, HAVANA and Reference Sequence (RefSeq) , produce genome-wide manual transcript annotation. The HAVANA team is based at the Wellcome Trust Sanger Institute, UK, and provides manual gene and transcript annotation for high-quality, fully finished ‘reference’ genomes, such as that of human . HAVANA manual annotation is supported by computational and wet lab groups who, through their predictions, highlight regions of interest in the genome to be followed up by manual annotation, identify potential features missing from annotation and experimentally validate the annotated transcripts, then provide feedback to computational groups to help improve the analysis pipelines.
The RefSeq collection of transcripts and their associated protein products is manually annotated at the National Center for Biotechnology Information (NCBI) in the USA. Although many RefSeq transcripts are completely manually annotated, a significant proportion are not: for example in NCBI Homo sapiens Annotation Release 106, approximately 45% of transcripts were classified as being computationally annotated . Furthermore, unlike HAVANA transcripts, which are annotated on the genome, RefSeq transcripts are annotated independently of the genome and based upon the mRNA sequence alone, which can lead to difficulty mapping to the genome.
The GENCODE  gene set takes advantage of the benefits of both manual annotation from HAVANA and automated annotation from the Ensembl gene build pipeline by combining the two into one dataset. GENCODE describes four primary gene functional categories, or biotypes: protein-coding gene, pseudogene, lncRNA and sRNA. The adoption of further biotypes, at both the gene level and transcript level, has enriched annotation greatly (Table 2). The final gene set is overwhelmingly manually annotated (~100% of all protein-coding loci and ~95% of all transcripts at protein-coding genes are manually annotated). Computational annotation predictions of gene features are provided to give hints to manual annotators and direct attention to unannotated probable gene features, and are also used to quality control (QC) manual annotation to identify and allow correction of both false-positive and false-negative errors.
GENCODE and RefSeq collaborate to identify agreed CDSs in protein-coding genes and to try and reach agreement where there are differences as part of the collaborative Consensus CoDing Sequence (CCDS) project [115, 116]. These CDS models, which do not include 5′ or 3′ UTRs, are frequently used in exome panels alongside the full RefSeq and GENCODE gene sets that form the majority of the target sequences in exome panels.
The GENCODE gene set improves on the CCDS set as it is enriched with additional alternatively spliced transcripts at protein-coding genes as well as pseudogene and lncRNA annotation, and as such is the most detailed gene set . GENCODE is now incorporated into the two most widely used commercial WES kits [118, 119], with fewer variants of potential medical importance missed .
To present genome annotation in a meaningful and useful manner, publicly available, web-based interfaces for viewing annotation have been provided—for example, the Ensembl Genome Browser  and the UCSC browser  (Fig. 7), both of which display the GENCODE models. The GENCODE genes are updated twice a year, whereas CCDS is updated at least once a year. All transcripts are assigned a unique stable identifier, which only changes if the structure of the transcript changes, making the temporal tracking of sequences easy.
A great deal of functionality is provided by genome browsers, such as: displaying and interrogating genome information by means of a graphical interface, which is integrated with other related biological databases; identifying sequence variation and its predicted consequence using VEP; investigating phenotype information and tissue-specific gene expression; and searching for related sequences in the genome using BLAST. Figure 7 presents by way of example the gene KCNT1, which is associated with early infantile epileptic encephalopathies  displayed in both the Ensembl and UCSC genome browsers.
Using comparative genomics to confirm gene functionality
Sequence data from other organisms are essential for interpreting the human genome owing to the functional conservation of important sequences in evolution  that can then be identified by their similarity . The zebrafish, for example, has a high genetic and physiological homology to human, with approximately 70% of human genes having at least one zebrafish orthologue. This means that the zebrafish model can provide independent verification of a gene being involved in human disease. Zebrafish also develop very quickly and are transparent, and so the fate, role and life cycle of individual cells can be followed easily in the developing organism. This makes the zebrafish a highly popular vertebrate model organism with which to study complex brain disorders [125, 126], and it has been essential for modelling disease in the DDD study .
Likewise, owing to a combination of experimental accessibility and ethical concerns, the mouse is often used as a proxy with which to study human disease [128, 129], and this justified the production of a high-quality, finished, reference mouse genome sequence, similar to that of the human sequence . Murine behavioural traits, tissues, physiology and organ systems are all extremely similar to those of human , and their genomes are similar too, with 281 homologous blocks of at least 1 Mb  and over 16,000 mouse protein-coding genes with a one-to-one orthology to human . The large number of knockout mouse models available can be used to study many neurological diseases in patients , such as the Q54 transgenic mouse used to study Scn2A seizure disorders . Recent studies in rodent models of epilepsy have identified changes in miRNA levels in neural tissues after seizures, which suggests that they could be key regulatory mechanisms and therapeutic targets in epilepsy . It is therefore important that high-quality annotation for these model organisms is maintained, so that genes and transcripts can be compared across these organisms consistently . With the advent of CRISPR–Cas9 technology, it is now possible to engineer specific changes into model organism genomes to assess the effects of such changes on gene function .
Nevertheless, model organism genomes and human genomes differ. For example, the laboratory mouse is highly inbred, whereas the human population is much more heterogeneous . Furthermore, many environmental and behavioral components are known to affect disease in certain mouse strains, which are factors that are not clearly understood in human disease . Although comparative genomics helps to build good gene models in the human genome and understand gene function and disease, basing predictions in clinical practice upon animal models alone might lead to misdiagnosis.
New techniques to improve functional annotation of genomic variants
NGS technologies facilitate improvements in gene annotation that have the potential to improve the functional annotation and interpretation of genomic variants. The combination of both long and short NGS reads  will change the scope of annotation. While short-read RNA-Seq assays may be able to produce hundreds of millions of reads and quantify gene expression, they are generally unable to represent full-length transcripts, which makes the assembly of such transcripts incredibly difficult . However, the greater read lengths produced by new sequencing technologies such as PacBio and synthetic long-read RNA-Seq (SLR-Seq), which uses Illumina short-read sequencing on single molecules of mRNA, have the potential to produce sequence for complete transcripts in a single read. In addition, utilizing longer-read technologies such as that from PacBio has already been shown to improve resolution of regions of the genome with SVs , and emerging technologies, such as 10X genomics , promise further improvements. This is especially important because WES is unable to represent structural variation reliably. The importance of representing such regions through WGS has been demonstrated by numerous neurological diseases associated with SVs, including cases of severe intellectual disability . Other examples of SV-induced neurological disease include Charcot–Marie–Tooth disease, which is most commonly caused by gene-dosage effects as a result of a duplication on the short arm of chromosome 17 , although other causes are known ; Smith–Magenis syndrome, caused by copy-number variants on chromosome 17p12 and 17p11.2 ; and Williams–Beuren syndrome, caused by a hemizygous microdeletion involving up to 28 genes on chromosome 7q11.23 .
Together, NGS data will also lead to the discovery of new exons and splice sites that both extend and truncate exons in a greater diversity of tissues and cell types. Whether the variants identified that are associated with novel exons or splice sites belong to protein-coding transcripts, or potential regulatory transcripts, or are transcripts likely to be targets of the NMD pathway, such technologies will permit better functional annotation of these overlapping variants. An example is the re-annotation of variants that were previously called intronic as exonic sequences. Similarly, a previously described synonymous substitution, or benign non-synonymous substitution, could affect core splice-site bases of a novel splice junction. RNA-Seq assays are able to discern expression of individual exons, allowing prioritisation of variants expressed in appropriate tissues for a disease. In the future, clinical investigation could target the genome in conjunction with the transcriptome—for example, using patient tissue as the basis for RNA-Seq assays—to identify regions where genes are expressed irregularly.
Transcriptomics datasets, such as CAGE , RAMPAGE  and polyA-seq , aid the accurate identification of the 5′ (for the two former) and 3′ (for the latter) ends of transcripts. This knowledge allows researchers to better annotate the functionality of a biotype, specifically enabling the addition of CDS where this was not previously possible, and enriching the functional annotation of overlapping variants. Furthermore, knowledge of termini allows the confident annotation of 5′ and 3′ UTRs that could harbor important regulatory sequences such as uORFs and miRNA target sites.
Other datasets, such as mass spectrometry (MS)  and ribosome profiling (RP, or Riboseq) , indicate translation, either by directly identifying proteins (MS) or by identifying translation on the basis of ribosomal binding to mRNA transcripts (RP), which aids the accurate identification of the presence and extent of expression of the CDS. Combining these datasets with cross-species conservation of protein coding potential found by PhyloCSF  allows annotators to identify previously unannotated protein-coding loci and confirm lncRNAs as lacking in protein-coding potential.
With the increasing importance of epigenetics and its role in neurological disorders , such as epilepsy , several companies are making detection of these features a priority—for example, detecting methylated nucleotides directly, as part of their sequencing reaction . Other well-described genetic marks are the DNase hypersensitivity sites that are often found in regions of active transcription . However, before these marks are considered in the process of annotation, we will require better experimental datasets that validate them. To put such marks into context and aid validation, gene annotation must be as accurate and comprehensive as possible so that potential cis (local) and trans (distant) interactions can be identified. Regulatory regions such as enhancers are features that can be described as part of the extended gene and represent the next frontier for gene annotation using data such as Capture Hi-C  and ChIA-PET  to identify physical connections between regulatory regions affected by variation and the genes they regulate, which can often be located a great distance away. This could mean that variants that were previously considered to be benign could in future be reclassified as pathogenic. For example, variants in evolutionarily conserved transcription factor binding sites are believed to have a role in narcolepsy .
Computational and manual genome-annotation methods that have been described have relied almost exclusively on traditional transcriptional evidence to build or extend models of genes and their transcripts. While the number of sequences in public databases continues to increase, genes expressed at very low levels, or with restricted expression profiles (such as many non-coding loci), are likely to remain either under-represented or incomplete when relying on such evidence [160, 161].
New technologies and software will help assess the complexity of loci much more thoroughly through the investigation of alternative splicing/translation start sites/poly(A) sites , alternative open reading frames, and so on. They will also allow the revisiting of the human genome—for example, to investigate evolutionarily conserved regions and regulatory features for functionality and to identify new non-coding loci structures as well as new coding transcripts.
We have reviewed how important regions of the genome that harbor pathogenic sequence variation can lie outside the CDS of genes. We have discussed how researchers can better understand why an incorrect interpretation of a pathogenic variant could arise. Such reasons can range from the human reference genome being incomplete, not all exons being represented in public databases, to incorrect annotation of transcripts/exons owing to their expression in a different tissue or at a different developmental stage to the disease phenotype. Table 4 gives a summary of such examples. As such, considerable efforts continue to be made to increase the catalogue of new genes involved in diseases, such as neurological disease . However, even well-studied genes should be revisited iteratively to identify novel features that previous technology could not detect. For example, a recent publication by Djemie and colleagues  revisited patients who had presented with Dravet syndrome, typically associated with SCN1A variants, but had been SCN1A variant-negative after clinical sequencing. By re-testing with NGS, it was possible to identify 28 variants that were overlooked with Sanger sequencing. Around 66% of the reported false-negative results were attributed to human error, whereas many of the others were a result of poor base-calling software .
It is important to remember that the full human transcriptome has yet to be annotated across all tissues of the human genome. Clearly, while gene panels and whole-exome sequences are a great start to getting a diagnosis, they are not perfect as they are snapshots of sequence at a particular point in time, meaning that pathogenic sequence variants that lie in yet-to-be-annotated exons will not be detected. This emphasizes the power of whole-genome sequences as, unlike exomes, they can be re-analysed again at any point in the future as new gene structures are found . To identify such features, it will be important to update the annotation of disease genes using the most relevant experimental methods and tissue to help identify transcripts that might be expressed at low levels or only at certain developmental stages.
Similarly, improvements in the understanding and annotation of gene structures can lead to reclassification of variants as less pathogenic than previously believed, with implications for treatment strategies. For example, de la Hoya and colleagues demonstrated that improvements to understanding of native alternative splicing events in the breast cancer susceptibility gene BRCA1 show that the risk of developing cancer is unlikely to be increased for carriers of truncating variants in exons 9 and 10, or indeed other alleles that retain 20–30% tumour-suppressor function, even where such variants had been previously characterized as pathogenic .
Accordingly, it is essential to consider multiple transcripts for pathogenic variant discovery, unlike the standard clinical approach of only considering a ‘canonical’ transcript, invariably based on the longest CDS but not necessarily on any expression values . Such situations could result in ambiguous HGVS nomenclature when transcript IDs are not specified, and, as a result, important variants might be missed if variant analysis is only performed against the canonical transcript. For example, a variant can be classified as intronic based on the canonical transcript but could be exonic when based upon an alternatively spliced transcript. Such technical challenges illustrate the difficulties for clinicians when dealing with clinical reports containing details of identified variants (for example, HGVS identifiers) and attempting to map them accurately to function and allow variant interpretation.
A solution to this problem would be to identify all the high-confidence transcripts and call variants against these transcripts, highlighting variants that might have severe effects against one or more such transcripts. To improve sensitivity, these findings could be weighted by transcript expression level in the disease-relevant tissue(s) (Fig. 8). To improve sensitivity even further, RNA-Seq assays from different developmental stages could be interrogated to see whether exons are expressed at the correct developmental stage as that of the disease phenotype .
Also of interest and concern is where genes thought to be implicated in a specific disease are now thought to have insufficient evidence for their role in disease. For example, the following genes were previously thought to be associated with epilepsy: EFHC1 , SCN9A, CLCN2, GABRD, SRPX2 and CACNA1H . The Epilepsy Genetics Initiative (EGI) attempts to address such problems by iteratively re-analysing WES and WGS of epilepsy cases every 6 months.
The overwhelming amount of sequence variation that is generated by WES and WGS means that many variants produced will have no role in disease. Therefore, the use of databases that contain sequence variants from global sequencing projects, such as ExAC  and the 1000 Genomes Project  can help filter out common variants to help identify rare variants [60, 172]. Such databases can be used to identify those genes that are intolerant of any variation in their sequence, and, when variants in such genes are identified in patients, this could be an indicator of pathogenic sequence variation . Other variant databases, such as The Human Gene Mutation Database (HGMD)  and ClinVar , provide information on inherited disease variants and on relationships between variants and phenotype. Genomic interpretation companies are now providing increasingly quick pathogenic variant interpretation turnaround times [176,177,178,179]. However, the value of such interpretation will only be as good as the gene annotation that is used for genome analysis and interpretation, demonstrating the need for continual updating and improvement of current gene sets.
Genome annotation is also increasingly seen as essential for the development of pharmacological interventions, such as drug design. Typically, drug design targets the main transcript of a gene (the choice of such a transcript is not necessarily informed by biological data, but is generally based upon the longest transcript), yet, as mentioned previously, it is now understood that certain transcripts can be expressed in different tissues, or at certain developmental times . For example, the onconeural antigen Nova-1 is a neuron-specific RNA-binding protein, and its activity is inhibited by paraneoplastic antibodies. It is encoded by NOVA1, which is only expressed in neurons . The alternative splicing of exon 5 of the epilepsy-associated gene SCN1A generates isoforms of the voltage-gated sodium channel that differ in their sensitivity to the anti-epileptic medications phenytoin and lamotrigine . Finally, isoform switching in the mouse gene Dnm1 (encoding dynamin-1), as a result of alternative splicing of exon 10 during embryonic to postnatal development, causes epilepsy .
With new drugs having a high failure rate and associated financial implications [183,184,185], it is not unreasonable to suggest that identifying tissue-specific exons and transcripts through annotation has the potential to reduce such failure rates significantly. New methods of generating genomic data must therefore be adopted continually and interrogated by annotators to facilitate the translation of genomic techniques into the clinic in the form of genomic medicines.
Such advances will begin to address some of the controversies and challenges for clinicians that the fast advances in genomics bring. They will help to understand why current technology can fail to identify the pathogenic basis of a patient’s disorder, or, more worryingly, why it can produce an incorrect result where the wrong variant is labelled as causative. This understanding will help clinicians to explain the advantages and limitations of genomics to families and healthcare professionals when caring for patients. The implication is that it will empower them to request reanalysis of unsolved cases as newer technology improves the annotation of gene structure and function. It will also encourage clinicians to request referral for disease modification when therapy becomes available for a clinical disease caused by specific genomic alterations.
American College of Medical Genetics and Genomics
Cap-analysis gene expression
Consensus coding sequence
Deciphering Developmental Disorders
Human and Vertebrate Analysis and Annotation
Human Genome Project
Human Genome Variation Society
Insertion and deletion
Long-intergenic non-coding RNA
Long non-coding RNA
National Center for Biotechnology Information
Open reading frame
Transcription start site
Variant effect predictor
EpiPM Consortium. A roadmap for precision medicine in the epilepsies. Lancet Neurol. 2015;14:1219–28.
Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. Erratum in: Nature. 2001;411:720. Szustakowki, J [corrected to Szustakowski, J]. Nature 2001 Aug 2;412(6846):565.
International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature. 2004;431:931–45.
Church DM, Schneider VA, Graves T, Auger K, Cunningham F, Bouk N, et al. Modernizing reference genome assemblies. PLoS Biol. 2011;9:e1001091.
GENCODE. Human GENCODE version 24. 2016. http://www.gencodegenes.org/stats/current.html. Accessed 14 Feb 2017.
Ensembl. Ensembl Human, release 83, GRC38. 2016. http://www.ensembl.org/Homo_sapiens/Info/Annotation. Accessed 14 Feb 2017.
Mullikin JC, Hunt SE, Cole CG, Mortimore BJ, Rice CM, Burton J, et al. An SNP map of human chromosome 22. Nature. 2000;407:516–20.
Firth HV, Wright CF. The Deciphering Developmental Disorders (DDD) study. Dev Med Child Neurol. 2011;53:702–3.
Deciphering Developmental Disorders Study. Prevalence and architecture of de novo mutations in developmental disorders. Nature. 2017;542:433–8.
Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A. 1977;74:5463–7.
Papandreou A, McTague A, Trump N, Ambegaonkar G, Ngoh A, Meyer E, et al. GABRB3 mutations: a new and emerging cause of early infantile epileptic encephalopathy. Dev Med Child Neurol. 2016;58:416–20.
Illumina. Illumina Inc. https://www.illumina.com/. Accessed 26 Apr 2017.
Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–9.
McPherson JD. A defining decade in DNA sequencing. Nat Methods. 2014;110:1003–5.
1000 Genomes Project Consortium, Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, et al. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–73.
100K Genomes. Sequencing 100000 Genomes. 2014. http://www.genomicsengland.co.uk/. Accessed 14 Feb 2017.
Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–60.
Li H, Durbin R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010;26:589–95.
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–303.
Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2010;38:1767–71.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–9.
Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The variant call format and VCFtools. Bioinformatics. 2011;27:2156–8.
Chiang C, Scott AJ, Davis JR, Tsang EK, Li X, Kim Y, et al. The impact of structural variation on human gene expression. Nat Genet. 2017;9:692–9.
Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, Huddleston J, et al. An integrated map of structural variation in 2,504 human genomes. Nature. 2015;526:75–81.
Freeman JL, Perry GH, Feuk L, Redon R, McCarroll SA, Altshuler DM, et al. Copy number variation: new insights in genome diversity. Genome Res. 2006;16:949–61.
Frousios K, Iliopoulos CS, Schlitt T, Simpson MA. Predicting the functional consequences of non-synonymous DNA sequence variants--evaluation of bioinformatics tools and development of a consensus strategy. Genomics. 2013;102:223–8.
Richards S, Aziz N, Bale S, Bick D, Das S, Gastier-Foster J, et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med. 2015;17:405–24.
HGVS. HGVS nomenclature. 2017. http://www.hgvs.org/mutnomen. Accessed 24 Apr 2017.
Guigo R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, et al. EGASP: the human ENCODE genome annotation assessment project. Genome Biol. 2006;7 Suppl. 1:S2. 1–31.
Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, et al. RefSeq: an update on mammalian reference sequences. Nucleic Acids Res. 2014;42(Database issue):D756–63.
Bauters M, Frints SG, Van Esch H, Spruijt L, Baldewijns MM, de Die-Smulders CE, et al. Evidence for increased SOX3 dosage as a risk factor for X-linked hypopituitarism and neural tube defects. Am J Med Genet A. 2014;164A:1947–52.
Araujo PR, Yoon K, Ko D, Smith AD, Qiao M, Suresh U, et al. Before it gets started: regulating translation at the 5′ UTR. Comp Funct Genomics. 2012;2012:475731.
Shiraki T, Kondo S, Katayama S, Waki K, Kasukawa T, Kawaji H, et al. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc Natl Acad Sci U S A. 2003;100:15776–81.
Parihar R, Ganesh S. The SCN1A gene variants and epileptic encephalopathies. J Hum Genet. 2013;58:573–80.
Beaudoing E, Freier S, Wyatt JR, Claverie JM, Gautheret D. Patterns of variant polyadenylation signal usage in human genes. Genome Res. 2000;10:1001–10.
Kang MK, Han SJ. Post-transcriptional and post-translational regulation during mouse oocyte maturation. BMB Rep. 2011;44:147–57.
Black DL. Mechanisms of alternative pre-messenger RNA splicing. Annu Rev Biochem. 2003;72:291–336.
Burset M, Seledtsov IA, Solovyev VV. Analysis of canonical and non-canonical splice sites in mammalian genomes. Nucleic Acids Res. 2000;28:4364–75.
Gonzalez-Porta M, Frankish A, Rung J, Harrow J, Brazma A. Transcriptome analysis of human tissues and cell lines reveals one dominant transcript per gene. Genome Biol. 2013;14:R70.
Jaffe AE, Shin J, Collado-Torres L, Leek JT, Tao R, Li C, et al. Developmental regulation of human cortex transcription and its clinical relevance at single base resolution. Nat Neurosci. 2015;18:154–61.
Wang Z, Burge CB. Splicing regulation: from a parts list of regulatory elements to an integrated splicing code. RNA. 2008;14:802–13.
Lianoglou S, Garg V, Yang JL, Leslie CS, Mayr C. Ubiquitously transcribed genes use alternative polyadenylation to achieve tissue-specific expression. Genes Dev. 2013;27:2380–96.
Miura P, Shenker S, Andreu-Agullo C, Westholm JO, Lai EC, et al. Widespread and extensive lengthening of 3′ UTRs in the mammalian brain. Genome Res. 2013;23:812–25.
Yap K, Lim ZQ, Khandelia P, Friedman B, Makeyev EV. Coordinated regulation of neuronal mRNA steady-state levels through developmentally controlled intron retention. Genes Dev. 2012;26:1209–23.
Braunschweig U, Barbosa-Morais NL, Pan Q, Nachman EN, Alipanahi B, Gonatopoulos-Pournatzis T, et al. Widespread intron retention in mammals functionally tunes transcriptomes. Genome Res. 2014;24:1774–86.
Reimand J, Wagih O, Bader GD. Evolutionary constraint and disease associations of post-translational modification sites in human genomes. PLoS Genet. 2015;11:e1004919.
Cheng J, Maquat LE. Nonsense codons can reduce the abundance of nuclear mRNA without affecting the abundance of pre-mRNA or the half-life of cytoplasmic mRNA. Mol Cell Biol. 1993;13:1892–902.
Nagy E, Maquat LE. A rule for termination-codon position within intron-containing genes: when nonsense affects RNA abundance. Trends Biochem Sci. 1998;23:198–9.
Zhao Y, Lin J, Xu B, Hu S, Zhang X, Wu L. MicroRNA-mediated repression of nonsense mRNAs. Elife. 2014;3:e03032.
Boutz PL, Bhutkar A, Sharp PA. Detained introns are a novel, widespread class of post-transcriptionally spliced introns. Genes Dev. 2015;29:63–80.
Nguyen LS, Jolly L, Shoubridge C, Chan WK, Huang L, Laumonnier F, et al. Transcriptome profiling of UPF3B/NMD-deficient lymphoblastoid cells from patients with various forms of intellectual disability. Mol Psychiatry. 2012;17:1103–15.
Adlakha YK, Saini N. Brain microRNAs and insights into biological functions and therapeutic potential of brain enriched miRNA-128. Mol Cancer. 2014;13:33.
Lin YS, Wang HY, Huang DF, Hsieh PF, Lin MY, Chou CH, et al. Neuronal splicing regulator RBFOX3 (NeuN) regulates adult hippocampal neurogenesis and synaptogenesis. PLoS One. 2016;11:e0164164.
Sundermeier T, Ge Z, Richards J, Dulebohn D, Karzai AW. Studying tmRNA-mediated surveillance and nonstop mRNA decay. Methods Enzymol. 2008;447:329–58.
Shoemaker CJ, Green R. Translation drives mRNA quality control. Nat Struct Mol Biol. 2012;19:594–601.
Frankish A, Harrow J. GENCODE pseudogenes. Methods Mol Biol. 2014;1167:129–55.
Vanin EF. Processed pseudogenes: characteristics and evolution. Annu Rev Genet. 1985;19:253–72.
Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 2012;22:1760–74.
Pei B, Sisu C, Frankish A, Howald C, Habegger L, Mu XJ, et al. The GENCODE pseudogene resource. Genome Biol. 2012;13:R51.
MacArthur DG, Balasubramanian S, Frankish A, Huang N, Morris J, Walter K, et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science. 2012;335:823–8.
International HapMap Consortium. A haplotype map of the human genome. Nature. 2005;437:1299–320.
Zook JM, Chapman B, Wang J, Mittelman D, Hofmann O, Hide W, et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol. 2014;32:246–51.
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74.
Poliseno L, Salmena L, Zhang J, Carver B, Haveman WJ, Pandolfi PP. A coding-independent function of gene and pseudogene mRNAs regulates tumour biology. Nature. 2010;465:1033–8.
Poliseno L, Haimovic A, Christos PJ, Vega Y Saenz de Miera EC, Shapiro R, Pavlick A, et al. Deletion of PTENP1 pseudogene in human melanoma. J Invest Dermatol. 2011;131:2497–500.
Yu G, Yao W, Gumireddy K, Li A, Wang J, Xiao W, et al. Pseudogene PTENP1 functions as a competing endogenous RNA to suppress clear-cell renal cell carcinoma progression. Mol Cancer Ther. 2014;13:3086–97.
GTEX. GTEX. 2017. http://www.gtexportal.org/. Accessed 24 Apr 2017.
Atlas. Expression Atlas. https://www.ebi.ac.uk/gxa/home. Accessed 12 Feb 2017.
Wittkopp PJ, Kalay G. Cis-regulatory elements: molecular mechanisms and evolutionary processes underlying divergence. Nat Rev Genet. 2011;13:59–69.
Roadmap Epigenomics Consortium, Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–30.
Aken BL, Achuthan P, Akanni W, Amode MR, Bernsdorff F, Bhai J, et al. Ensembl 2017. Nucleic Acids Res. 2017;45(D1):D635–42.
McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, et al. The Ensembl variant effect predictor. Genome Biol. 2016;17:122.
Khurana E, Fu Y, Colonna V, Mu XJ, Kang HM, Lappalainen T, et al. Integrative annotation of variants from 1092 humans: application to cancer genomics. Science. 2013;342:1235587.
Smedley D, Schubach M, Jacobsen JO, Köhler S, Zemojtel T, Spielmann M, et al. A whole-genome analysis framework for effective identification of pathogenic regulatory variants in Mendelian disease. Am J Hum Genet. 2016;99:595–606.
Morris KV, Mattick JS. The rise of regulatory RNA. Nat Rev Genet. 2014;15:423–37.
Barquist L, Burge SW, Gardner PP. Studying RNA homology and conservation with infernal: from single sequences to RNA families. Curr Protoc Bioinformatics. 2016;54:12.13.1–12.13.25.
Nawrocki EP, Burge SW, Bateman A, Daub J, Eberhardt RY, Eddy SR, et al. Rfam 12.0: updates to the RNA families database. Nucleic Acids Res. 2015;43(Database issue):D130–7.
Ambros V. The functions of animal microRNAs. Nature. 2004;431:350–5.
Henshall DC. MicroRNA and epilepsy: profiling, functions and potential clinical applications. Curr Opin Neurol. 2014;27:199–205.
Ren L, Zhu R, Li X. Silencing miR-181a produces neuroprotection against hippocampus neuron cell apoptosis post-status epilepticus in a rat model and in children with temporal lobe epilepsy. Genet Mol Res. 2016;15(1); doi:10.4238/gmr.15017798.
Panjwani N, Wilson MD, Addis L, Crosbie J, Wirrell E, Auvin S, et al. A microRNA-328 binding site in PAX6 is associated with centrotemporal spikes of rolandic epilepsy. Ann Clin Transl Neurol. 2016;3:512–22.
Reschke CR, Silva LF, Norwood BA, Senthilkumar K, Morris G, Sanz-Rodriguez A, et al. Potent anti-seizure effects of locked nucleic acid antagomirs targeting miR-134 in multiple mouse and rat models of epilepsy. Mol Ther Nucleic Acids. 2017;6:45–56.
Ulitsky I, Bartel DP. lincRNAs: genomics, evolution, and mechanisms. Cell. 2013;154:26–46.
Wright MW. A short guide to long non-coding RNA gene nomenclature. Hum Genomics. 2014;8:7.
St Laurent G, Wahlestedt C, Kapranov P. The Landscape of long noncoding RNA classification. Trends Genet. 2015;31:239–51.
Nitsche A, Rose D, Fasold M, Reiche K, Stadler PF. Comparison of splice sites reveals that long noncoding RNAs are evolutionarily well conserved. RNA. 2015;21:801–12.
McHugh CA, Chen CK, Chow A, Surka CF, Tran C, McDonel P, et al. The Xist lncRNA interacts directly with SHARP to silence transcription through HDAC3. Nature. 2015;521:232–6.
Liu Z, Sun M, Lu K, Liu J, Zhang M, Wu W, et al. The long noncoding RNA HOTAIR contributes to cisplatin resistance of human lung adenocarcinoma cells via downregualtion of p21(WAF1/CIP1) expression. PLoS One. 2013;8:e77293.
Zhang X, Weissman SM, Newburger PE. Long intergenic non-coding RNA HOTAIRM1 regulates cell cycle progression during myeloid maturation in NB4 human promyelocytic leukemia cells. RNA Biol. 2014;11:777–87.
Lee DY, Moon J, Lee ST, Jung KH, Park DK, Yoo JS, et al. Dysregulation of long non-coding RNAs in mouse models of localization-related epilepsy. Biochem Biophys Res Commun. 2015;462:433–40.
Morris KV. The theory of RNA-mediated gene evolution. Epigenetics. 2015;10:1–5.
Vitiello M, Tuccoli A, Poliseno L. Long non-coding RNAs in cancer: implications for personalized therapy. Cell Oncol (Dordr). 2015;38:17–28.
Hsiao J, Yuan TY, Tsai MS, Lu CY, Lin YC, Lee ML, et al. Upregulation of haploinsufficient gene expression in the brain by targeting a long non-coding RNA improves seizure phenotype in a model of Dravet syndrome. EBioMedicine. 2016;9:257–77.
Zhang F, Lupski JR. Non-coding genetic variants in human disease. Hum Mol Genet. 2015;24(R1):R102–10.
Talkowski ME, Maussion G, Crapper L, Rosenfeld JA, Blumenthal I, Hanscom C, et al. Disruption of a large intergenic noncoding RNA in subjects with neurodevelopmental disabilities. Am J Hum Genet. 2012;91:1128–34.
Turner TN, Hormozdiari F, Duyzend MH, McClymont SA, Hook PW, Iossifov I, et al. Genome sequencing of autism-affected families reveals disruption of putative noncoding regulatory DNA. Am J Hum Genet. 2016;98:58–74.
Zhou W, Zhang F, Chen X, Shen Y, Lupski JR, Jin L. Increased genome instability in human DNA segments with self-chains: homology-induced structural variations via replicative mechanisms. Hum Mol Genet. 2013;22:2642–51.
Chen L, Zhou W, Zhang L, Zhang F. Genome architecture and its roles in human copy number variation. Genomics Inform. 2014;12:136–44.
Mefford HC, Zemel M, Geraghty E, Cook J, Clayton PT, Paul K, et al. Intragenic deletions of ALDH7A1 in pyridoxine-dependent epilepsy caused by Alu-Alu recombination. Neurology. 2015;85:756–62.
de Koning AP, Gu W, Castoe TA, Batzer MA, Pollock DD. Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet. 2011;7:e1002384.
Bao W, Kojima KK, Kohany O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob DNA. 2015;6:11.
Hubley R, Finn RD, Clements J, Eddy SR, Jones TA, Bao W, et al. The Dfam database of repetitive DNA families. Nucleic Acids Res. 2016;44(D1):D81–9.
Burge CB, Karlin S. Finding the genes in genomic DNA. Curr Opin Struct Biol. 1998;8:346–54.
Salamov AA, Solovyev VV. Ab initio gene finding in Drosophila genomic DNA. Genome Res. 2000;10:516–22.
Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics. 2003;19 Suppl 2:ii215–25.
Mudge J, Harrow J. Methods for improving genome annotation. In: Alterovitz G, Ramoni MF, editors. Knowledge based bioinformatics: from analysis to interpretation. Chichester, West Sussex: John Wiley & Sons; 2010. p. 209–14.
Hattori M, Fujiyama A, Taylor TD, Watanabe H, Yada T, Park HS, et al. The DNA sequence of human chromosome 21. Nature. 2000;405:311–9. Erratum in: Nature. 2000;407:110.
Dunham I, Shimizu N, Roe BA, Chissoe S, Hunt AR, Collins JE, et al. The DNA sequence of human chromosome 22. Nature. 1999;402:489–95. Erratum in: Nature. 2000;404:904.
Karsch-Mizrachi I, Nakamura Y, Cochrane G. The international nucleotide sequence database collaboration. Nucleic Acids Res. 2012;40(Database issue):D33–7.
Yandell M, Ence D. A beginner’s guide to eukaryotic genome annotation. Nat Rev Genet. 2012;13:329–42.
UniProt Consortium. Ongoing and future developments at the Universal Protein Resource. Nucleic Acids Res. 2011;39(Database issue):D214–9.
Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, et al. GENCODE: producing a reference annotation for ENCODE. Genome Biol. 2006;7 Suppl 1:S4. 1-9.
ENCODE Project Consortium, Birney E, Stamatoyannopoulos JA, Dutta A, Guigó R, et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007;447:799–816.
Frankish A, Uszczynska B, Ritchie GR, Gonzalez JM, Pervouchine D, Petryszak R, et al. Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction. BMC Genomics. 2015;16 Suppl 8:S2.
Pruitt KD, Harrow J, Harte RA, Wallin C, Diekhans M, Maglott DR, et al. The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. Genome Res. 2009;19:1316–23.
Farrell CM, O’Leary NA, Harte RA, Loveland JE, Wilming LG, Wallin C, et al. Current status and new features of the Consensus Coding Sequence database. Nucleic Acids Res. 2014;42(Database issue):D865–72.
Mudge JM, Frankish A, Harrow J. Functional transcriptomics in the post-ENCODE era. Genome Res. 2013;23:1961–73.
SeqCap. SeqCap EZ Human Exome Library v3.0. 2014. http://sequencing.roche.com/products/nimblegen-seqcap-target-enrichment/seqcap-ez-system/seqcap-ez-exome-v3.html. Accessed 12 Feb 2017.
Chen R, Im H, Snyder M. Whole-exome enrichment with the agilent sureselect human all exon platform. Cold Spring Harb Protoc. 2015;2015:626–33.
Coffey AJ, Kokocinski F, Calafato MS, Scott CE, Palta P, Drury E, et al. The GENCODE exome: sequencing the complete human exome. Eur J Hum Genet. 2011;19:827–31.
Tyner C, Barber GP, Casper J, Clawson H, Diekhans M, Eisenhart C, et al. The UCSC Genome Browser database: 2017 update. Nucleic Acids Res. 2017;45(D1):D626–34.
Barcia G, Fleming MR, Deligniere A, Gazula VR, Brown MR, Langouet M, et al. De novo gain-of-function KCNT1 channel mutations cause malignant migrating partial seizures of infancy. Nat Genet. 2012;44:1255–9.
Cooper GM, Stone EA, Asimenos G, NISC Comparative Sequencing Program, Green ED, Batzoglou S, et al. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 2005;15:901–13.
Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005;15:1034–50.
Howe K, Clark MD, Torroja CF, Torrance J, Berthelot C, Muffato M, et al. The zebrafish reference genome sequence and its relationship to the human genome. Nature. 2013;496:498–503. Erratum in: Nature. 2014;505:248.
Kalueff AV, Stewart AM, Gerlai R. Zebrafish as an emerging model for studying complex brain disorders. Trends Pharmacol Sci. 2014;35:63–75.
Deciphering Developmental Disorders Study. Large-scale discovery of novel genetic causes of developmental disorders. Nature. 2015;519:223–8.
Skarnes WC, Rosen B, West AP, Koutsourakis M, Bushell W, Iyer V, et al. A conditional knockout resource for the genome-wide study of mouse gene function. Nature. 2011;474:337–42.
Steward CA, Gonzalez JM, Trevanion S, Sheppard D, Kerry G, Gilbert JG, et al. The non-obese diabetic mouse sequence, annotation and variation resource: an aid for investigating type 1 diabetes. Database (Oxford). 2013;2013:bat032.
Church DM, Goodstadt L, Hillier LW, Zody MC, Goldstein S, She X, et al. Lineage-specific biology revealed by a finished genome assembly of the mouse. PLoS Biol. 2009;7:e1000112.
Hofker MH, Deursen JV. Transgenic mouse: methods and protocols. Methods in molecular biology. Totowa, NJ: Humana Press; 2003. p. 3741. xiii.
Pevzner P, Tesler G. Human and mouse genomic sequences reveal extensive breakpoint reuse in mammalian evolution. Proc Natl Acad Sci U S A. 2003;100:7672–7.
MGI. MGI-Mouse Vertebrate Homology. 2017. http://www.informatics.jax.org/homology.shtml. Accessed 24 Apr 2017.
Kearney JA, Plummer NW, Smith MR, Kapur J, Cummins TR, Waxman SG, et al. A gain-of-function mutation in the sodium channel gene Scn2a results in seizures and behavioral abnormalities. Neuroscience. 2001;102:307–17.
Henshall DC, Hamer HM, Pasterkamp RJ, Goldstein DB, Kjems J, Prehn JH, et al. MicroRNAs in epilepsy: pathophysiology and clinical utility. Lancet Neurol. 2016;15:1368–76.
Bult CJ, Eppig JT, Blake JA, Kadin JA, Richardson JE, Group MGD. Mouse genome database 2016. Nucleic Acids Res. 2016;44(D1):D840–7.
Ma X, Chen C, Veevers J, Zhou X, Ross RS, Feng W, et al. CRISPR/Cas9-mediated gene manipulation to create single-amino-acid-substituted and floxed mice with a cloning-free method. Sci Rep. 2017;7:42244.
Leiter EH, von Herrath M. Animal models have little to teach us about type 1 diabetes: 2. In opposition to this proposal. Diabetologia. 2004;47:1657–60.
Roep BO, Atkinson M. Animal models have little to teach us about type 1 diabetes: 1. In support of this proposal. Diabetologia. 2004;47:1650–6.
Voelkerding KV, Dames SA, Durtschi JD. Next-generation sequencing: from basic research to diagnostics. Clin Chem. 2009;55:641–58.
Steijger T, Abril JF, Engström PG, Kokocinski F, Hubbard TJ, Guigó R, et al. Assessment of transcript reconstruction methods for RNA-seq. Nat Methods. 2013;10:1177–84.
Gordon D, Huddleston J, Chaisson MJ, Hill CM, Kronenberg ZN, Munson KM, et al. Long-read sequence assembly of the gorilla genome. Science. 2016;352:aae0344.
Zheng GX, Lau BT, Schnall-Levin M, Jarosz M, Bell JM, Hindson CM, et al. Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat Biotechnol. 2016;34:303–11.
Gilissen C, Hehir-Kwa JY, Thung DT, van de Vorst M, van Bon BW, Willemsen MH, et al. Genome sequencing identifies major causes of severe intellectual disability. Nature. 2014;511:344–7.
Lupski JR, de Oca-Luna RM, Slaugenhaupt S, Pentao L, Guzzetta V, Trask BJ, et al. DNA duplication associated with Charcot-Marie-Tooth disease type 1A. Cell. 1991;66:219–32.
Speevak MD, Farrell SA. Charcot-Marie-Tooth 1B caused by expansion of a familial myelin protein zero (MPZ) gene duplication. Eur J Med Genet. 2013;56:566–9.
Yuan B, Neira J, Gu S, Harel T, Liu P, Briceño I, et al. Nonrecurrent PMP22-RAI1 contiguous gene deletions arise from replication-based mechanisms and result in Smith-Magenis syndrome with evident peripheral neuropathy. Hum Genet. 2016;135:1161–74.
Corley SM, Canales CP, Carmona-Mora P, Mendoza-Reinosa V, Beverdam A, Hardeman EC, et al. RNA-Seq analysis of Gtf2ird1 knockout epidermal tissue provides potential insights into molecular mechanisms underpinning Williams-Beuren syndrome. BMC Genomics. 2016;17:450.
Batut P, Dobin A, Plessy C, Carninci P, Gingeras TR. High-fidelity promoter profiling reveals widespread alternative promoter usage and transposon-driven developmental gene expression. Genome Res. 2013;23:169–80.
Derti A, Garrett-Engele P, Macisaac KD, Stevens RC, Sriram S, Chen R, et al. A quantitative atlas of polyadenylation in five mammals. Genome Res. 2012;22:1173–83.
Zhang G, Annan RS, Carr SA, Neubert TA. Overview of peptide and protein analysis by mass spectrometry. Curr Protoc Protein Sci. 2010;Chapter 16:Unit16.1.
Ingolia NT. Ribosome profiling: new views of translation, from single codons to genome scale. Nat Rev Genet. 2014;15:205–13.
Lin MF, Jungreis I, Kellis M. PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics. 2011;27:i275–82.
Jakovcevski M, Akbarian S. Epigenetic mechanisms in neurological disease. Nat Med. 2012;18:1194–204.
Henshall DC, Kobow K. Epigenetics and epilepsy. Cold Spring Harb Perspect Med. 2015;5(12); doi:10.1101/cshperspect.a022731.
PacBio. Detecting DNA Base Modification. 2017. http://www.pacb.com/wp-content/uploads/2015/09/WP_Detecting_DNA_Base_Modifications_Using_SMRT_Sequencing.pdf. Accessed 24 Apr 2017.
Mifsud B, Tavares-Cadete F, Young AN, Sugar R, Schoenfelder S, Ferreira L, et al. Mapping long-range promoter contacts in human cells with high-resolution capture Hi-C. Nat Genet. 2015;47:598–606.
Fullwood MJ, Ruan Y. ChIP-based methods for the identification of long-range chromatin interactions. J Cell Biochem. 2009;107:30–9.
Guturu H, Chinchali S, Clarke SL, Bejerano G. Erosion of conserved binding sites in personal genomes points to medical histories. PLoS Comput Biol. 2016;12:e1004711.
Clark MB, Amaral PP, Schlesinger FJ, Dinger ME, Taft RJ, Rinn JL, et al. The reality of pervasive transcription. PLoS Biol. 2011;9:e1000625. discussion e1001102.
Bussotti G, Leonardi T, Clark MB, Mercer TR, Crawford J, Malquori L, et al. Improved definition of the mouse transcriptome via targeted RNA sequencing. Genome Res. 2016;26:705–16.
Frankish A, Mudge JM, Thomas M, Harrow J. The importance of identifying alternative splicing in vertebrate genome annotation. Database (Oxford). 2012;2012:bas014.
Djemie T, Weckhuysen S, von Spiczak S, Carvill GL, Jaehn J, Anttonen AK, et al. Pitfalls in genetic testing: the story of missed SCN1A mutations. Mol Genet Genomic Med. 2016;4:457–64.
Mercimek-Mahmutoglu S, Patel J, Cordeiro D, Hewson S, Callen D, Donner EJ, et al. Diagnostic yield of genetic testing in epileptic encephalopathy in childhood. Epilepsia. 2015;56:707–16.
Foo JN, Liu JJ, Tan EK. Whole-genome and whole-exome sequencing in neurological diseases. Nat Rev Neurol. 2012;8:508–17.
de la Hoya M, Soukarieh O, López-Perolio I, Vega A, Walker LC, van Ierland Y, et al. Combined genetic and splicing analysis of BRCA1 c.[594-2A > C; 641A > G] highlights the relevance of naturally occurring in-frame transcripts for developing disease gene variant classification algorithms. Hum Mol Genet. 2016;25:2256–68.
MacArthur JA, Morales J, Tully RE, Astashyn A, Gil L, Bruford EA, et al. Locus Reference Genomic: reference sequences for the reporting of clinically relevant sequence variants. Nucleic Acids Res. 2014;42(Database issue):D873–8.
Subaran RL, Conte JM, Stewart WC, Greenberg DA. Pathogenic EFHC1 mutations are tolerated in healthy individuals dependent on reported ancestry. Epilepsia. 2015;56:188–94.
Helbig I, Tayoun AA. Understanding genotypes and phenotypes in epileptic encephalopathies. Mol Syndromol. 2016;7:172–81.
Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536:285–91.
1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, et al. A global reference for human genetic variation. Nature. 2015;526:68–74.
MacArthur DG, Manolio TA, Dimmock DP, Rehm HL, Shendure J, Abecasis GR, et al. Guidelines for investigating causality of sequence variants in human disease. Nature. 2014;508:469–76.
Hwang S, Kim E, Lee I, Marcotte EM. Systematic comparison of variant calling pipelines using gold standard personal exome variants. Sci Rep. 2015;5:17875.
Stenson PD, Mort M, Ball EV, Evans K, Hayden M, Heywood S, et al. The Human Gene Mutation Database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies. Hum Genet. 2017. doi: 10.1007/s00439-017-1779-6
Landrum MJ, Lee JM, Benson M, Brown G, Chao C, Chitipiralla S, et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 2016;44(D1):D862–8.
Congenica. Congenica Ltd. 2017. https://www.congenica.com/. Accessed 24 Apr 2017.
Sophia-Genetics. Sophia Genetics. 2017. http://www.sophiagenetics.com/home.html. Accessed 24 Apr 2017.
WuXi. WuXi NextCODE. https://www.wuxinextcode.com/. Accessed 7 Apr 2017.
Omicia. Omicia 2016. http://www.omicia.com/. Accessed 24 Apr 2017.
Barrie ES, Smith RM, Sanford JC, Sadee W. mRNA transcript diversity creates new opportunities for pharmacological intervention. Mol Pharmacol. 2012;81:620–30.
Buckanovich RJ, Yang YY, Darnell RB. The onconeural antigen Nova-1 is a neuron-specific RNA-binding protein, the activity of which is inhibited by paraneoplastic antibodies. J Neurosci. 1996;16:1114–22.
Boumil RM, Letts VA, Roberts MC, Lenz C, Mahaffey CL, Zhang ZW, et al. A missense mutation in a highly conserved alternate exon of dynamin-1 causes epilepsy in fitful mice. PLoS Genet. 2010;6. doi: 10.1371/journal.pgen.1001046
Paul SM, Mytelka DS, Dunwiddie CT, Persinger CC, Munos BH, Lindborg SR, et al. How to improve R&D productivity: the pharmaceutical industry’s grand challenge. Nat Rev Drug Discov. 2010;9:203–14.
Arrowsmith J, Miller P. Trial watch: phase II and phase III attrition rates 2011–2012. Nat Rev Drug Discov. 2013;12:569.
Hay M, Thomas DW, Craighead JL, Economides C, Rosenthal J. Clinical development success rates for investigational drugs. Nat Biotechnol. 2014;32:40–51.
Vengoechea J, Parikh AS, Zhang S, Tassone F. De novo microduplication of the FMR1 gene in a patient with developmental delay, epilepsy and hyperactivity. Eur J Hum Genet. 2012;20:1197–200.
Lemke JR, Lal D, Reinthaler EM, Steiner I, Nothnagel M, Alber M, et al. Mutations in GRIN2A cause idiopathic focal epilepsy with rolandic spikes. Nat Genet. 2013;45:1067–72.
Epi4K Consortium. De novo mutations in SLC1A2 and CACNA1A are important causes of epileptic encephalopathies. Am J Hum Genet. 2016;99:287–98.
Bilguvar K, Oztürk AK, Louvi A, Kwan KY, Choi M, Tatli B, et al. Whole-exome sequencing identifies recessive WDR62 mutations in severe brain malformations. Nature. 2010;467:207–10.
Coutinho AM, Oliveira G, Katz C, Feng J, Yan J, Yang C, et al. MECP2 coding sequence and 3′UTR variation in 172 unrelated autistic patients. Am J Med Genet B Neuropsychiatr Genet. 2007;144B:475–83.
Combi R, Dalprà L, Ferini-Strambi L, Tenchini ML. Frontal lobe epilepsy and mutations of the corticotropin-releasing hormone gene. Ann Neurol. 2005;58:899–904.
Ramser J, Abidi FE, Burckle CA, Lenski C, Toriello H, Wen G, et al. A unique exonic splice enhancer mutation in a family with X-linked mental retardation and epilepsy points to a novel role of the renin receptor. Hum Mol Genet. 2005;14:1019–27.
Elkon R, Ugalde AP, Agami R. Alternative cleavage and polyadenylation: extent, regulation and function. Nat Rev Genet. 2013;14:496–506.
Lynch DC, Revil T, Schwartzentruber J, Bhoj EJ, Innes AM, Lamont RE, et al. Disrupted auto-regulation of the spliceosomal gene SNRPB causes cerebro-costo-mandibular syndrome. Nat Commun. 2014;5:4483.
Qureshi IA, Mehler MF. Emerging roles of non-coding RNAs in brain evolution, development, plasticity and disease. Nat Rev Neurosci. 2012;13:528–41.
GENCODE. GENCODE annotation biotypes. https://www.gencodegenes.org/gencode_biotypes.html. Accessed 24 Apr 2017.
Kozak M. An analysis of 5′-noncoding sequences from 699 vertebrate messenger RNAs. Nucleic Acids Res. 1987;15:8125–48.
Ivanov IP, Firth AE, Michel AM, Atkins JF, Baranov PV, et al. Identification of evolutionarily conserved non-AUG-initiated N-terminal extensions in human coding sequences. Nucleic Acids Res. 2011;39:4220–34.
Brenner S, Barnett L, Katz ER, Crick FH. UGA: a third nonsense triplet in the genetic code. Nature. 1967;213:449–50.
Venters BJ, Pugh BF. Genomic organization of human transcription initiation complexes. Nature. 2013;502:53–8.
Mitchell PJ, Tjian R. Transcriptional regulation in mammalian cells by sequence-specific DNA binding proteins. Science. 1989;245:371–8.
Fatemi M, Pao MM, Jeong S, Gal-Yam EN, Egger G, Weisenberger DJ. Footprinting of mammalian promoters: use of a CpG DNA methyltransferase revealing nucleosome positions at a single molecule level. Nucleic Acids Res. 2005;33:e176.
Down TA, Hubbard TJ. Computational detection and location of transcription start sites in mammalian genomic DNA. Genome Res. 2002;12:458–61.
Matlin AJ, Clark F, Smith CW. Understanding alternative splicing: towards a cellular code. Nat Rev Mol Cell Biol. 2005;6:386–98.
We thank Jane Rogers for her guidance, Eugene Bragin for his informatics input, and Imogen Steward, who is still awaiting her genetic diagnosis, but instrumental in the undertaking of this manuscript. We hope that this paper will support patients such as Imogen, now and in the future.
This work is funded by the National Institutes of Health grant U41HG007234 to the GENCODE project, and Wellcome Trust grant (WT098051) to the Sanger Institute and the European Molecular Biology Laboratory. Part of this work was undertaken at University College London Hospitals, which received a proportion of funding from the NIHR Biomedical Research Centres funding scheme. We are grateful for support from the Epilepsy Society.
All authors wrote and proof-read the manuscript and then read and approved the final manuscript. CAS managed the writing and organized the article and produced all figures and tables after mutual discussion with all authors. APJP, BAM and SMS provided clinical input and structure.
CAS is Technical Support Scientist at Congenica Ltd, a clinical interpretation company and one of the partners for the UK 100,000 Genomes Project. CAS’s daughter has been diagnosed with West syndrome, an early infantile epileptic encephalopathy, and is currently on the 100,000 Genomes Project. JH is Program Manager for Population Sequencing at Illumina Inc. The other authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.