A 26-hour system of highly sensitive whole genome sequencing for emergency management of genetic diseases
- Neil A. Miller†1,
- Emily G. Farrow†1, 2, 3, 4,
- Margaret Gibson1,
- Laurel K. Willig1, 2, 4,
- Greyson Twist1,
- Byunggil Yoo1,
- Tyler Marrs1,
- Shane Corder1,
- Lisa Krivohlavek1,
- Adam Walter1,
- Josh E. Petrikin1, 2, 4,
- Carol J. Saunders1, 2, 3, 4,
- Isabelle Thiffault1, 3,
- Sarah E. Soden1, 2, 4,
- Laurie D. Smith1, 2, 3, 4,
- Darrell L. Dinwiddie5,
- Suzanne Herd1,
- Julie A. Cakici1,
- Severine Catreux6,
- Mike Ruehle6 and
- Stephen F. Kingsmore1, 2, 3, 4, 7Email author
© Miller et al. 2015
Received: 15 May 2015
Accepted: 10 September 2015
Published: 30 September 2015
While the cost of whole genome sequencing (WGS) is approaching the realm of routine medical tests, it remains too tardy to help guide the management of many acute medical conditions. Rapid WGS is imperative in light of growing evidence of its utility in acute care, such as in diagnosis of genetic diseases in very ill infants, and genotype-guided choice of chemotherapy at cancer relapse. In such situations, delayed, empiric, or phenotype-based clinical decisions may meet with substantial morbidity or mortality. We previously described a rapid WGS method, STATseq, with a sensitivity of >96 % for nucleotide variants that allowed a provisional diagnosis of a genetic disease in 50 h. Here improvements in sequencing run time, read alignment, and variant calling are described that enable 26-h time to provisional molecular diagnosis with >99.5 % sensitivity and specificity of genotypes. STATseq appears to be an appropriate strategy for acutely ill patients with potentially actionable genetic diseases.
Genomic medicine is a new discipline whereby an individual’s genome information is used to guide personal strategies for disease prevention, etiologic diagnosis, and therapeutic selection [1, 2]. Despite its recent implementation into clinical care, genomic medicine is already transforming the diagnosis, molecular staging, prognostic determination, and management of patients with symptoms suggestive of genetic diseases, particularly Mendelian disorders (those caused by defects in single genes) and recurrent cancers [3–11]. Genomic medicine is transformative in these applications because it rapidly and simultaneously tests nearly all genes potentially relevant to that patient’s disease, largely irrespective of the clinician’s differential diagnosis list or detailed knowledge of all of the conditions being tested [6, 11]. This is particularly powerful for patients with very rare or newly discovered diseases, atypical clinical presentations or responses to treatment, and actionable pharmacogenomic findings. Timely molecular diagnosis, staging, and prognosis along with pharmacogenomic-based guidance can immediately engender a treatment shift from interim, phenotype-driven, population-based management to precision medicine with definitive, individualized therapies, and management plans, as well as drug exposures attuned to genotype and molecular prognosis . In particular, there is increasing evidence that rapid whole genome sequencing (WGS) can be useful in the acute care of infants with genetic diseases in neonatal and pediatric intensive care units [6, 11–13].
While the cost of WGS has fallen dramatically, it remains too slow to be suitable for guidance in the management of many acute medical conditions. We previously described diagnostic WGS for genetic diseases in 50 h (WGS50), with 77–96 % sensitivity and approximately 99.5 % specificity for detection of nucleotide variants . Fifty hours was the interval between receipt of a blood sample and identification of a provisional molecular diagnosis, provided that the diagnosis was readily apparent upon variant filtering. WGS50 had two principal time components: 2 × 100 cycles of sequencing-by-synthesis (SBS) was approximately 25.5 h. The identification of nucleotide variants by sequence alignment, variant calling, and genotyping was approximately 17.5 h. Here we describe second generation STATseq, with improved timeliness, sensitivity, and scalability.
Study design, setting, and participants
This study was approved by the Institutional Review Board (IRB) at Children’s Mercy – Kansas City (CM-KC). It conforms to the Declaration of Helsinki. Participants were principally parent–child trios enrolled in a research biorepository who received WGS in addition to standard diagnostic tests to diagnose monogenic disorders of unknown etiology in affected infants [6, 11, 12]. Affected infants with suspected genetic disorders were nominated by a treating physician, typically a neonatologist. A standard form requesting the primary signs and symptoms, past diagnostic testing results, differential diagnosis or candidate genes, pertinent family history, availability of biologic parents for enrollment, and whether rapid WGS would potentially affect treatment was submitted for immediate evaluation by a team of experts at the Center for Pediatric Genomic Medicine (CPGM) at CM-KC. Infants received rapid WGS if the likely diagnosis was of a type that was detectable by next-generation sequencing and had any potential to alter management or genetic counseling. Patients were not required to undergo standardized clinical examinations or diagnostic testing prior to referral; standard etiologic testing was performed as clinically indicated. Infants likely to have disorders associated with cytogenetic abnormalities were not accepted unless standard testing for those disorders was negative. Informed written consent was obtained from parents. Retrospective samples, UDT103 and UDT173, were blinded validation samples with known molecular diagnoses for a genetic disease . Reference sample NA12878 was obtained from the Coriell Institute repository.
Ascertainment of clinical features
The clinical features of affected infants were ascertained comprehensively by physician and family interviews and review of the medical record. Phenotypic features were translated into Human Phenotype Ontology (HPO) terms and mapped to 6,000 genetic diseases with the clinicopathologic correlation tools Phenomizer and/or SSAGA [6, 11, 14, 15]. Briefly, Phenomizer uses term-similarity measures to calculate a similarity score for query HPO terms entered by the user and terms used to annotate diseases in HPO. It then assigns a P value using statistical modeling to compare the similarity score obtained for the specific set of phenotypic terms entered into the distribution of similarity scores obtained using randomly chosen HPO term combinations. The P value was then used to rank diseases in the differential genetic diagnosis. The Phenomizer differential genetic diagnosis is exported as a tab-separated value file. Diagnoses without known causative genes are removed. Where the likely inheritance pattern is apparent, the Phenomizer output is limited to the appropriate inheritance mode. Where Phenomizer reports many equally scoring values, the differential diagnosis is performed using several different but overlapping term sets, for example with a key feature being listed as mandatory rather than observed, or with removal of clinical features that are felt to likely represent a second, unrelated disorder. Finally, the Phenomizer list may be pruned to 100–250 entries, if necessary, based on manual inspection of the fit of diseases to the clinical features at various P value cutoffs.
Similarly, SSAGA is web-based software that facilitates entry of Human Phenotype Ontology (HPO) terms related to the clinical features observed in an individual patient (Additional file 1: Figure S1). SSAGA provides a differential diagnosis that is limited to all Online Mendelian Inheritance in Man (OMIM), Orphanet, and DECIPHER (DatabasE of genomiC varIation and Phenotype in Humans using Ensembl Resources) disease entries that match at least one entered feature [16, 17]. Diseases in the differential genetic diagnosis can be ranked by the number of matching terms entered.
Genome sequencing and quality control
DNA isolation from peripheral blood was automated utilizing the MSMI Chemagen Instrument equipped with liquid dispensing manifolds (Perkin Elmer, Baesweiler, Germany). Briefly, a 24-well head is used to isolate 1.8 mL of blood per sample. The system is fully enclosed to comply with biosafety standards, and isolation is completed in approximately 2 h resulting in an average of approximately 40 μg of DNA/mL of blood.
For 18-h WGS performed in Essex, isolated genomic DNA was prepared using a modification of the standard Illumina TruSeq sample preparation. Briefly, DNA was sheared using a Covaris S2 Biodisruptor, end-repaired, A-tailed, and adaptor-ligated. PCR was omitted. Libraries were purified using SPRI beads (Beckman Coulter). For 18-h WGS, the amount of DNA used was optimized, based on experience of varying the input from representative DNA samples, and allowed a concentration to be selected that produced a known cluster density after the library was denatured using 0.1 M NaOH and presented to the flowcell.
At CM-KC, genomic DNA was prepared for WGS using either TruSeq PCR Free (Illumina) or KAPA HYPER (KAPA Biosystems) without PCR amplification according to manufacturer’s protocols. HYPER library preparation without PCR is completed in 90 min, with an additional 90 min allotted for QC. Briefly, 2 μg of DNA was sheared with a Covaris S2 Biodisruptor, end-repaired, A-tailed, and adaptor-ligated. Quantitation of libraries was carried out by real-time PCR.
Samples for rapid WGS were each loaded onto one or two flowcells, followed by sequencing on Illumina HiSeq2500 instruments using HiSeq Rapid SBS v1 chemistry that were set to rapid run mode (SBS26) or with customized faster flowcell scanning times (SBS18). Cluster generation, followed by 2 × 101 cycle sequencing reads, separated by the paired-end turnaround, were performed automatically on the instrument. Prior to SBS18 optimization, tiles failing quality metrics were manually excluded before further analysis. Genome sequencing was performed as research, largely in a manner that complies with routine diagnostic tests, as defined by the Clinical Laboratory Improvements Act (CLIA) guidelines [18–21]. However, genome sequencing was not performed herein as a CLIA laboratory developed test.
Next generation sequencing analysis
Sequence data were generated with Illumina RTA 18.104.22.168 & CASAVA-1.8.2, aligned to the human reference GRCh37.p5 using GSNAP , and nucleotide (nt) variants were detected and genotyped with the Genome Analysis Tool Kit  (GATK, versions 1.6. and 3.2). Sequence analysis used FASTQ, bam, and VCF files. The largest deletion variant detected was 263 nt, and the largest insertion was 469 nt.
Variants were annotated with the Rapid Understanding of Nucleotide variant Effect Software (RUNES, v3.3.5) [6, 11, 12]. RUNES incorporates data from ENSEMBL’s Variant Effect Predictor (VEP) software , produces comparisons to NCBI dbSNP, known disease variants from the Human Gene Mutation Database , and performs additional in silico prediction of variant consequences using RefSeq and ENSEMBL gene annotations . RUNES categorized each variant according to ACMG recommendations for reporting sequence variation [18–21] and with an allele frequency (MAF) derived from CPGM’s Variant Warehouse database of approximately 90 million variants and 3,900 individuals [6, 11, 12]. Category 1 variants had previously been reported to be disease-causing. Category 2 variants had not previously been reported to be disease-causing, but were of types that were expected to be pathogenic (loss of initiation, premature stop codon, disruption of stop codon, whole gene deletion, frameshifting indel, disruption of splicing). Category 3 were variants of unknown significance that were potentially disease-causing (non-synonymous substitution, in-frame indel, disruption of polypyrimidine tract, overlap with 5’ exonic, 5’ flank or 3’ exonic splice contexts). Category 4 were variants that were probably not causative of disease (synonymous variants that were unlikely to produce a cryptic splice site, intronic variants >20 nt from the intron/exon boundary, and variants commonly observed in unaffected individuals). Category 5 variants were known to be benign. All variants, together with their RUNES annotations, are stored in a queriable warehouse database (Additional file 2: Figure S2). Inputs to the RUNES pipeline were a genomic variant file (.vcf or .gvcf); the pipeline produces a JSON document that is used as input to the VIKING interpretation tool.
The DRAGEN pipeline operates on a single-server hybrid hardware/software platform, with a dual Intel Xeon central processing units (CPUs), and a custom Peripheral Component Interconnect Express (PCIe) board with a field-programmable gate array (FPGA) and 32 GB of Dynamic random-access memory (DRAM) attached directly via four double data rate type three synchronous dynamic random-access memory DDR3 SDRAM channels. Critical compute-intensive functions of the pipeline are performed by custom massively parallel FPGA logic for maximum speed, while other functions run in optimized multi-threaded software on the Xeon cores, for maximum flexibility. A parallel (redundant array of independent disks, RAID 0) Solid State Drive (SSD) file system provides the I/O bandwidth necessary to feed the processing pipeline, and FPGA compress/decompress engines maintain throughput to and from compressed file formats.
DRAGEN read mapping/alignment
DRAGEN uses a hash table index of a reference genome to map many overlapping seeds from each read to exact matches in the reference. The hash table is constructed from any chosen reference with a multi-threaded tool, in as little as 6 min for a whole human genome, and loaded into the FPGA-board DRAM prior to mapping operations. The entire read mapping process is performed by custom FPGA logic, with software layers streaming unaligned reads from FASTQ (or Illumina BCL) files to the PCIe board via DMA, and simultaneously streaming aligned read records back into host memory, for BAM output and/or variant calling.
DRAGEN’s hash-based mapping uses a novel dynamic seed extension method: when a primary seed (default 21 nt) matches more than a maximum number of reference locations (default 16), longer seeds from all these reference positions are populated into the hash table, such that specific extended seed sequences will match fewer reference locations. Seeds are extended symmetrically, up to 64 nt in each direction, with a maximum of 149 nt from a 21 nt primary seed. Long seed extensions were done in multiple short increments, averaging 3–4 nt in each direction, with different extended seed patterns terminating at different extension lengths, as needed to match no more than the maximum number of reference positions.
When a hash table query is made for a common primary seed, a single EXTEND record (merging the contents of two or more objects together into the first object) is retrieved, indicating the number of additional read bases to join onto the seed in each direction. The additional bases were hashed (along with a unique identifier for the pre-extended seed), and another hash table query was made, which may return yet another EXTEND record, iteratively. When an adequate extended seed length was achieved, the next hash table query retrieved a list of up to the maximum number of matching reference positions and orientations. This iterative seed extension method yields similar results to incremental suffix-tree or Burrows-Wheeler mapping but with dramatically fewer index memory accesses, which is critical to DRAGEN’s mapping speed.
In FPGA logic, read pairs are load-balanced over several DRAGEN map/align engines. An engine extracts many overlapping seeds from each read, by default starting at every even offset (50 % density). These are mapped by DRAM hash table queries, each to zero or more reference positions, with forward or reverse-complemented orientation determined for each match. The several engines nearly saturate the four local DDR3 interfaces with hash bucket read bursts and reference sequence fetches for alignment. Matches along similar alignment diagonals are grouped into seed chains, which are conservatively filtered; by default, a short seed chain can be filtered out if another seed chain at least four times longer mostly overlaps it in the read.
Lists of seed chains from paired end reads are examined to detect pairs with appropriate insert sizes and orientations. For each unpaired seed chain, a rescue scan may be executed to search for the mate within the expected insert window; mate K-mer matches within a configurable Hamming distance (the number of positions at which the corresponding nucleotides are different) result in new candidate positions being added to the list of seed chains. Each seed chain or rescue match is then extended by gapless local alignment, permitting single nucleotide variants (SNVs) and clipping but not nucleotide insertions and deletions (indels). The collection of gapless alignment results for each read is analyzed by heuristics, to judge for which ones Smith-Waterman gapped alignment would have a non-trivial likelihood of improving the overall read pair results.
Each Smith-Waterman aligner uses an array of 56 parallel scoring cells, virtually arranged into an anti-diagonal wavefront, which steps one position horizontally or vertically each clock cycle. The wavefront scores a generally diagonal swath of cells through the alignment rectangle but steers automatically to re-center the best alignment path after indel events. Back-trace from the maximum scoring cell runs simultaneous with the following alignment operation, yielding a CIGAR string, which indicates soft clipping and indel positions.
All gapped and gapless alignment results are compared to obtain best and second-best scores. For paired ends, pair scores are computed, each as the sum of the two alignment scores minus a pairing penalty, based on the deviation from the empirical mean insert; and the best scoring pair is reported. The quality of read mapping (MAPQ) is estimated primarily in proportion to the difference between best and second best scores, the proportionality coefficient varying by read length; second-order factors such as the number of scores very close to the second-best are also considered. When the best alignment does not cover a read, up to three supplementary (chimeric) alignments are optionally reported for other segments of the read.
DRAGEN sorting and duplicate marking
After mapping, reads are sorted by reference position; PCR or optical duplicates are optionally flagged. An initial sorting phase operates on aligned reads returning from the FPGA. Final sorting and duplicate marking commences when mapping completes; these operations overlap variant calling when the latter is requested, and add almost zero time to the FASTQ-to-VCF pipeline.
DRAGEN variant calling
The DRAGEN variant caller runs mostly in highly optimized software, for maximum flexibility of the algorithms. Only stable, compute-intensive operations are accelerated by FPGA engines. DRAGEN implements multi-threaded parallelism in a single pass over the whole reference genome, without launching multiple caller processes on various subsets of the reference. A single call to the DRAGEN executable runs the entire pipeline from FASTQ to VCF, for the whole genome. Mapping/alignment is done in one pass over the reads, and all steps of variant calling (in addition to read sorting and duplicate marking) run simultaneously in a software/hardware pipeline emitting VCF results.
First, callable reference regions are identified, with sufficiently aligned coverage. Within these, a fast scan of the sorted reads identifies active regions, centered around pileup columns with non-trivial evidence of a variant, and padded with enough context to cover significant non-reference content nearby, extra wide where there is evidence of indels.
Aligned reads are clipped within each active region and assembled into a De Bruijn graph, edges weighted by observation counts using the reference sequence as a backbone. If the graph is degenerate, it is reconstructed using longer K-mers. After some graph cleanup and simplification, all source-to-sink paths are extracted as candidate haplotypes, up to a limit (default 128). If this cap must be enforced, higher-weight paths are preferred. Each haplotype is Smith-Waterman aligned back to the reference genome to identify the variants it represents, and re-synchronized with read alignments.
Then for each read-haplotype pair, the probability P(r|H) of observing the read, assuming the haplotype was the true starting sample, is estimated using a pair hidden Markov model (HMM). Since the haplotype is assumed true, only errors in sample preparation and sequencing are considered. Essentially, the probabilities of all possible alignments (edit combinations) of the read to the haplotype are calculated and summed, using a dynamic programming matrix very similar to affine-gap Smith-Waterman, except summing rather than maximizing path probabilities. At each row (read position) in the matrix, mismatch probabilities are taken from base quality scores and MAPQ, and gap probabilities are derived from a PCR error model sensitive to repetitive sequence content.
This pair-HMM calculation is the most expensive step, and, therefore, is accelerated in custom FPGA logic. Reads and haplotypes to be compared are queued up for HMM processing by software threads completing previous steps and sent to the FPGA by direct memory access (DMA). A load balancer distributes work over more than 100 small HMM engines, each of which is pipelined to calculate all three probabilities (for match, insert, and delete states) for one matrix cell per clock cycle. Calculated P(r|H) values DMA back to host memory, where they are picked up by downstream software threads.
Scanning by reference position over the active region, candidate genotypes are formed from diploid combinations of variant events (SNVs or indels) observed in the earlier Smith-Waterman alignments of the haplotypes to the reference. For each event (including reference), the conditional probability P(r|e) of observing each overlapping read is estimated as the maximum P(r|H) for haplotypes supporting the event. These are multiplied to yield the conditional probability P(R|e) of observing the whole read pileup, and using Bayes’ formula, the posterior probability P(e1e2|R) of each diploid genotype (diplotype) is calculated, and the winner is called.
Causative variants were identified primarily with Variant Integration and Knowledge INterpretation in Genomes (VIKING) software (Additional file 2: Figure S2 and Additional file 3: Figure S3) [6, 11]. Inputs for VIKING were the annotated genomic variant file produced by the RUNES pipeline and a SSAGA (Symptom and Sign Associated Genome Analysis) or Phenomizer record, comprising the clinical features of the affected patient, corresponding diseases in the differential diagnosis, and the respective disease genes (Additional file 1: Figure S1) [6, 11, 14, 15]. The SSAGA or Phenomizer record was created during the laboratory steps in WGS26. Alternatively, a menu of pre-determined candidate gene lists can be utilized to filter variants in VIKING, such as genes with OMIM records, or genes previously associated with mitochondrial disorders. VIKING integrated the superset of relevant disease mappings and annotated variant genotypes. By allowing dynamic filtering of variants based on variables such as individual clinical features, diseases, genes, assigned ACMG-type pathogenicity category, allele frequency, genotype, and inheritance pattern, VIKING assists in identification of a differential diagnosis. VIKING settings can be saved, which allows configuration in a manner that can enable a provisional molecular diagnosis to be determined in as little as seconds. VIKING also allowed data mark-up, sessions to be saved, and export of fields in formats suitable for inclusion in diagnostic reports.
In a typical interpretation session, variants were filtered by limitation to ACMG Categories 1–3 and MAF <1 %, <0.5 %, <0.1 %, or to those that are unique to the proband or to the family, dependent on the clinical impression (Additional file 3: Figure S3). All potential monogenetic inheritance patterns were examined, including de novo, recessive, dominant, X-linked, mitochondrial, and, where possible, somatic variation. Where a single likely causative heterozygous variant for a recessive disorder was identified, the entire coding domain was manually inspected using the Integrated Genome Viewer (IGV) for coverage, additional variants, as were variants for that locus called in the appropriate parent that may have had low coverage in the proband . VIKING featured link-outs to IGV that are refreshed in a trio on a variant-by-variant basis allowing rapid examination of pattern of inheritance, quality of alignment, and local sequence features (such as simple sequence repeats). Expert interpretation and literature curation were performed for all likely causative variants with regard to evidence for pathogenicity. VIKING featured link-outs to a warehouse of approximately 90 million variants in approximately 3,900 individuals, OMIM, HGNC, HGMD, Entrez Gene, ENSEMBL and pathways information, facilitating rapid literature curation (Additional file 2: Figure S2). Analysis was performed sequentially by two experts. Sanger sequencing was used for clinical confirmation of all diagnostic genotypes. Reporting was performed by an ACMG fellow laboratory director who was an expert in WGS analysis in single gene diseases. Additional expert consultation and functional confirmation were performed when the subject’s phenotype differed from previous mutation reports for that disease gene.
26-h whole genome sequencing
Breakdown of times of principal steps for rapid diagnostic whole genome sequencing
DNA isolation, QC and shearing
PCR-free library prep
WGS library QC
% > Q30
RUNES variant annotation
WGS26, SBS18, and Dragen v1.2
WGS26, SBS18, and Dragen v1.2
WGS26, SBS18, and Dragen v1.2
WGS26, SBS18, and Dragen v1.2
Second, the time taken for sequence alignment, variant detection, and genotyping was reduced from approximately 15 h in WGS50, gapped alignment, and variant calling with CASAVA v1.8.2 (Illumina), to approximately 40 min for WGS26 with the novel DRAGEN aligner and variant caller (Table 1). DRAGEN accelerated these steps by highly parallel alignments to a sorted reference genome and customized high-memory computer hardware with high IO throughput.
Third, in WGS50 variants were annotated for likely functional consequence with Rapid Understanding of Nucleotide variant Effect Software (RUNES) software in 2.5 h. In WGS26, RUNES was accelerated to 30 min by software refactoring. Fourth, WGS50 utilized manual manipulation of spreadsheets for analysis and interpretation of variants. In WGS26, these steps were performed with the interpretation software VIKING (Variant Integration and Knowledge INterpretation In Genomes). VIKING facilitated genome analysis and interpretation by allowing dynamic filtering of variants based on variables such as individual clinical features, diseases, genes, assigned ACMG-type pathogenicity category, allele frequency, genotype, and inheritance pattern (Additional file 2: Figure S2). For example, VIKING filtering to display variants with: (1) an ACMG-type pathogenicity category of 1–3; (2) an allele frequency of less than 0.1 %; (3) that fit a recessive inheritance pattern (homozygous, compound heterozygous, or hemizygous); and (4) that are in OMIM monogenic disease-associated genes, yielded 16 variants in eight genes in WGS18 of sample UDT_103 (Additional file 2: Figure S2). Of these, only two variants in one gene fit the clinical features of the patient and a bona fide inheritance pattern. They were UNC13D NM_199242.2 c.2955-2A > G in 16 of 31 reads with quality score 99, which was inherited maternally, and c.859-3C > A in 18 of 35 reads with quality score 99, which was inherited paternally. These variants gave an actionable, provisional diagnosis of Hemophagocytic lymphohistiocytosis type 3. In complex cases, where no causative diplotype is identified by such semi-automated analysis, a thorough manual analysis ensues that can take many hours.
Analytic performance of 26-h WGS
Comparison of the analytic performance of a conventional alignment and variant calling pipeline (GSNAP with GATK minus VQSR), with a novel, extremely rapid method (DRAGEN)
SBS18 yield (GB)
Alignments with mapping quality >20
% Paired Reads
% Chimeric Reads
Rare, potentially pathogenic variants
Analytic sensitivity (GeT-RM or SNP array)
Analytic specificity (GeT-RM or SNP array)
Analytic sensitivity (full GIAB)
Analytic specificity (full GIAB)
Optimizing the sensitivity and specificity of 26-h WGS
GSNAP/GATK-VQSR, while providing excellent sensitivity and specificity, was both costly and insufficiently rapid to be ideal for diagnosis in acutely ill neonates. Compute time on a 608 Intel Xeon core Linux cluster with 6 TB of DDR3 RAM and 20 TB SATA hard drives was 22.5 h. A number of alternative alignment and variant detection algorithms and hardware were evaluated. The most rapid and sensitive of these was DRAGEN v1.2 (Edico Genomics, La Jolla, CA, USA). Compute time on two 12-core Intel Xeon processors with hyper-threading technology (with 128 GB of RAM and 8 × 400 GB RAID-0 SSD on the staging disk) was 41 min for 40X WGS (Table 1). The analytic performance of DRAGEN and GSNAP/GATK-VQSR were compared in three SBS18 runs at two sites with varying sequence yield (Table 2). DRAGEN identified a similar number of genomic nucleotide variants to GSNAP/GATK-VQSR (averages of 4,719,492 and 4,736,550, respectively, in the three Caucasian genomes), and similar number of rare, potentially pathogenic variants (averages of 684 and 629 variants of ACMG categories 1–3 with allele frequencies <1 %, respectively, Table 2). However, DRAGEN provided both very rapid alignment and variant calling and slightly higher sensitivity and specificity than GSNAP/GATK 1.6 or 3.2 without VQSR (as high as 99.9 % for both; Tables 1 and 2).
These data support the use of multiple alignment and calling algorithms for maximal sensitivity, and highlight deficiencies in the NA12878 reference datasets.
Here we have described methods for rapid, medical WGS (version 2 STATseq), with greater analytic sensitivity (99.5 % in a 40X genome), faster time to result, and improved scalability. Twenty-six hours was the shortest elapsed time from receipt of a blood sample to diagnosis of a genetic disease. Twenty-six hours was possible when readily apparent upon application of a standardized set of variant filters using VIKING software and integration of an automated differential diagnosis based on SSAGA or Phenomizer software. It assumed no time interval between steps in the protocol. Maximum analytic sensitivity was achieved by combining variant calls of the DRAGEN and GSNAP/GATK 3.2-VQSR pipelines. The most significant innovations were as follows: First, approximately 18–21 h to generate 30–47-fold, 2 × 100 nt SBS with a modified Illumina HiSeq 2500. Second, approximately 1 h for read alignment, variant calling, annotation, and interpretation. Importantly, the methods were replicated both in a research laboratory (Illumina, Essex) and in a genome center in a children’s hospital (CM-KC).
In addition to speed, the methods described herein enable scaling of medical WGS to approximately 350 samples per year per sequencing instrument. The DRAGEN alignment and variant calling hardware and software has specifications which are likely to make genome sequencing practicable in many hospital laboratories, such as reducing the need for cloud computing or a large local cluster. The VIKING software greatly alleviates the burden of genome analysis and interpretation and allows common inheritance modes to be rapidly examined. When a diplotype of likely pathogenic variants is observed in a gene that ranks high on the differential diagnosis, interpretation can be performed in minutes. Ruling out a genetic diagnosis or making a diagnosis in situations of novel phenotype expansion, however, is an arduous process, involving hours of effort by a highly experienced, laboratory geneticist, even when assisted by software.
Diagnostic sensitivity is the single most important attribute for medical WGS. Here we examined a surrogate, namely analytic sensitivity for nucleotide variants; 99.9 % analytic sensitivity and specificity of genome-wide genotypes was obtained with high quality, 47X genome sequence and the DRAGEN pipeline in a 26-h format. Notably, this figure reflects both substitutions and indels (of size up to 469 nt). Possibly more remarkable was 99.4 % analytic sensitivity and specificity with 20X genome sequence and the same methods. Furthermore, analytic sensitivity was further increased when two alignment algorithms and variant callers are used, as has been suggested [12, 29]. Herein, we achieved maximum analytic sensitivity by combining variant calls of the DRAGEN and GSNAP/GATK 3.2-VQSR pipelines. In a diagnostic use-case, the issue of which two conflicting genotypes to retain at sites where variant genotypes differ between the two pipelines may be solved simply by retention of the more pathogenic genotype. Greater sensitivity resulted in a remarkable increase in rare variants, which were being over-filtered by conventional pipelines. Thus, the approximately 2.8 billion nucleotides of genomes that can be genotyped with paired, short reads contain approximately 5 million nucleotide variants in individuals of northern European ancestry, and approximately 6 million in those of African ancestry. The ultra-sensitive GSNAP/GATK-VQSR pipeline has been in routine use in a clinical laboratory with Sanger confirmation of hundreds of diagnostic genotypes [6, 11] (Saunders et al., unpublished). This experience has confirmed the results reported herein – namely that analytic specificity remains adequate for clinical use despite such sensitivity. In short, we believe that we, as a community, been missing many variants due to the limitations of our software algorithms. However, in contrast to analytic sensitivity, further work is needed to determine whether a two-pipeline method improves diagnostic yield sufficiently to be cost effective in light of decreased specificity.
For greatest usefulness as a clinical diagnostic tool, WGS must genotype all genomic sites, whether reference or variant calls. In this manner, WGS can be used both for diagnosis and to rule out treatable genetic diagnoses. From a clinician perspective, ruling out treatable genetic disease diagnoses or diseases with benign prognosis is paramount for clinical decision-making. In particular, in acutely ill infants in a NICU setting, end-of-life decisions are common, with most deaths resulting from withholding or withdrawing care after careful weighting of the prognosis by the care team and family [11, 30]. Notably, the methods described here assign values to approximately 2.8 billion nucleotides, whether variant genotypes, reference genotypes, or no call. With further software development, it should be possible to generate an automatic report of the completeness of genotyping of all protein coding nucleotides and intron-exon boundaries of relevant disease genes with defined coverage and quality scores, and, thereby, in the future, to ‘rule out’ specific diagnoses.
Limitations of rapid WGS
While medical WGS is becoming increasingly robust, especially relative to exome sequencing, it is appropriate to highlight its current analytic limitations for genetic disease diagnosis. The analytic sensitivity for variants other than nucleotide variants is too low for use as a stand-alone clinical test. Notable deficiencies of paired, short-read WGS are analytic sensitivity and specificity for pathogenic structural variations and triplet repeat expansions. Phenotype-associated genes with highly homologous pseudogenes require custom software solutions to disambiguate variants mapping to the gene or pseudogene. The biggest limitation for medical WGS and exome sequencing, however, is the interpretation of variants of uncertain significance (VUS). For these reasons, genetic disease diagnosis will continue to require multiple types of testing, including functional and confirmatory testing, for the foreseeable future.
Another current limitation of WGS26 is that it is a research method, and confirmatory testing of causative genotypes, which is typically required for diagnostic reporting, takes at least two days. Upon protocol validation to meet CLIA and CAP guidelines for laboratory developed tests (LDTs), however, the requirement for confirmatory testing will be decided on a case-by-case basis by an accredited laboratory director. Over the next several years, however, some type of FDA approval will also be required for high complexity LDTs, such as medical WGS. A pre-investigational device exemption inquiry was made for clinical research use of WGS50 for diagnosis of genetic diseases in acutely ill infants in our level IV (regional) NICU. Encouragingly, the FDA conferred non-significant risk status for these methods for research use in this setting.
A third limitation of current WGS is lack of comprehensive negative predictive value. On a gene-by-gene basis, current WGS allows visual inspection for gaps in exonic or intronic coverage. Thus, where a single diagnosis – such as MSUD – must be ruled out, this can readily be accomplished. A significant advantage of WGS over exome sequencing is more complete coverage. In particular, exome sequencing tends to suffer loss of coverage for first exons. In addition to imperfect analytic sensitivity, however, diagnostic sensitivity is limited by lack of knowledge of all pathogenic variants. In particular, pathogenic intronic and regulatory variants are under-represented in clinical databases, and, in contrast to exonic variants of uncertain significance, cannot not yet robustly assayed by in silico pathogenicity prediction tools.
It is interesting to speculate what the fastest time to diagnostic result might be with current WGS technology. Technically, a substantial reduction in sample preparation time from 7.5 h should be possible. With customized robotics, these pre-analytic steps should be feasible in 2 h. SBS should be possible in approximately 10 h with 2 × 50 cycles. In silico modeling suggests that analytic sensitivity and specificity for nucleotide variants would remain >95 % with such read lengths. Stranneheim et al. have described pulsed whole genome sequencing with analysis of results iteratively at 35, 50, 75, and 100 cycles . When combined with the DRAGEN system, there is the possibility of near real-time analysis of results whereby sequencing continues until a diagnosis is achieved. While further reductions in time to result may seem pedantic, sub-24 h time to result can be material since medical rounds typically occur once a day. Thus, return of results between 07:00–11:00 allows their significance to be discussed by the whole medical team when fresh. Off-hours results are returned to an on-duty physician who is likely to need specialist consultation.
An unsolved need for medical WGS26 is sample multiplexing, both to lower the cost of testing and to allow trios to be analyzed simultaneously. Sequencing of parent–infant trios is necessary for genetic disease diagnosis since the most common mechanism of causative mutations is de novo. WGS26 is performed one sample at a time (dual flowcells) at a reagent cost of $6,500, which is more than eight-fold greater than WGS on a HiSeq X. WGS26 sequencer depreciation is approximately $714 per genome at full capacity (350 genomes per year), compared with $137 on a HiSeq X. Technician cost is similar (approximately $70 per genome). The cost of computation and automated analysis varies widely with scale, but around a median of approximately $100 per genome. Interpretation and reporting is in the range of $70–$700 depending on the number and types of variants identified in a trio. Thus, cost is a significant barrier to broad adoption of WGS26, particularly given the mark-ups in price that are commonly employed in the US medical system to offset negotiated discounts or lack of payment. An attractive compromise between cost and speed is the HiSeq X configured to perform approximately 450 GB of 2 × 75 nt sequencing in a trio in 33 h on a single flowcell, for a total turnaround time of approximately 41 h. An alternative is rapid exome sequencing (WES), using the WGS26 software and hardware. With ongoing improvements in the hybridization kinetics of exome capture probes, and in the representation of all exons, a 36-h, 100X WES of three trios per $6,500 run should be feasible.
Finally, it is worth briefly mentioning the medical applications that currently may benefit from a 26-h, rather than a less costly 6-day, medical genome. These are applications that have a relatively high likelihood of guiding acute medical decisions in clinical situations where a delay is likely to result in significant morbidity or mortality. Currently, the best defined such application is in the differential diagnosis of certain single gene diseases. One example is maple syrup urine disease (MSUD, OMIM #248600), in which irritability and poor feeding typically occur within 48 h of delivery. Lethargy, intermittent apnea, opisthotonus, and stereotyped movements are evident by day of life 4. Diagnosis and institution of treatment before the onset of these neurologic signs significantly reduces the lifetime risk of mental illness and global functional impairment [32, 33]. Mass spectrometry of blood at 48 h of life is used to screen infants for MSUD in newborn screening programs. While typically positive in affected newborns, results may be delayed until after onset of neurologic signs or an initial screen may be falsely negative if the newborn has not fed appropriately after delivery.
Our initial clinical experience with rapid WGS involved 35 parent–infant trios [6, 11, 12]. All infants were acutely ill, aged less than 4 months at the time of enrollment, had a suspected genetic cause of disease, and lacked a molecular diagnosis. Clinical features in these infants were typically apparent at birth. Rapid WGS provided a genetic diagnosis in 20 (57 %) infants. In nine (45 %) infants receiving a diagnosis, the condition had not been considered in the differential diagnosis at the time of enrollment. Thirteen (65 %) diagnoses were noted to have acute clinical utility, and four (20 %) diagnoses had strongly favorable effects on management. However, six (30 %) diagnosed infants were started on palliative care and 120-day mortality was 57 %. A randomized, prospective clinical study of rapid WGS is now in progress to ascertain the extensibility of these results to broad NICU populations. Clearly, while the application of rapid WGS for NICU diagnosis of genetic disease appears tremendously promising, translating diagnoses into effective precision medicine is in its infancy.
Twenty-six-hour STATseq appears to be an appropriate strategy for acutely ill patients with potentially actionable genetic diseases. Having demonstrated improved analytic performance of version 2 STATseq, and time to result of 26 h, the next step is to retrospectively analyze the diagnostic yield of these methods, particularly in cases where no diagnostic diplotype was identified by conventional WGS.
Data and materials
The genomic sequence data for this study have been deposited in the database dbGAP with accession number phs000564. The CM-KC software described herein is in development for availability as freeware for research use.
The authors thank Kevin P. Hall, James Richardson, Stewart Macarthur, Paul Smith, Sean Humphray, Jacqueline C. Weir, Jason Betley, Zoya Kingsbury, Holly Duckworth, Russell J. Grocock, and Elliott Margulies of Illumina, Inc., Chesterford Research Park, Little Chesterford, Essex, UK, for development of SBS18 and for transferring it to CM-KC. We also thank Nhu Bui and Holly Zink for technical help, and Tom Wu for the provision of GSNAP. A deo lumen, ab amicis auxilium.
Supported by Children’s Mercy Hospital, Marion Merrell Dow Foundation, W.T. Kemper Foundation, Patton Trust, Pat & Gil Clements Foundation, Claire Giannini Foundation, National Institute of Child Health and Human Development and National Human Genome Research Institute grant U19HD077693, National Institute of Diabetes and Digestive and Kidney Diseases grant R01DK091823, and in-kind support from Illumina, Inc.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Beaudet AL. 1998 ASHG presidential address. Making genomic medicine a reality. Am J Hum Genet. 1999;64:1–13.PubMed CentralView ArticlePubMed
- Green ED, Guyer MS. Charting a course for genomic medicine from base pairs to bedside. Nature. 2011;470:204–13.View ArticlePubMed
- McCarthy JJ, McLeod HL, Ginsburg GS. Genomic medicine: a decade of successes, challenges, and opportunities. Sci Transl Med. 2013;5:189sr4.View ArticlePubMed
- Hudson TJ. Genome variation and personalized cancer medicine. J Intern Med. 2013;274:440–50.View ArticlePubMed
- Tran B, Brown AM, Bedard PL, Winquist E, Goss GD, Hotte SJ, et al. Feasibility of real time next generation sequencing of cancer genes linked to drug response: results from a clinical trial. Int J Cancer. 2013;132:1547–55.View ArticlePubMed
- Soden SE, Saunders CJ, Willig LK, Farrow EG, Smith LD, Petrikin JE, et al. Effectiveness of exome and genome sequencing guided by acuity of illness for diagnosis of neurodevelopmental disorders. Sci Transl Med. 2014;6:265ra168.PubMed CentralView ArticlePubMed
- Yang Y, Muzny DM, Xia F, Niu Z, Person R, Ding Y, et al. Molecular findings among patients referred for clinical whole-exome sequencing. JAMA. 2014;312:1870–9.PubMed CentralView ArticlePubMed
- Lee H, Deignan JL, Dorrani N, Strom SP, Kantarci S, Quintero-Rivera F, et al. Clinical exome sequencing for genetic identification of rare Mendelian disorders. JAMA. 2014;312:1880–7.PubMed CentralView ArticlePubMed
- Dixon-Salazar TJ, Silhavy JL, Udpa N, Schroth J, Bielas S, Schaffer AE, et al. Exome sequencing can improve diagnosis and alter patient management. Sci Transl Med. 2012;4:138ra78.PubMed CentralView ArticlePubMed
- Srivastava S, Cohen JS, Vernon H, Barañano K, McClellan R, Jamal L, et al. Clinical whole exome sequencing in child neurology practice. Ann Neurol. 2014;76:473–83.View ArticlePubMed
- Willig LK, Petrikin JE, Smith LD, Saunders CJ, Thiffault I, Miller NA, et al. Whole-genome sequencing for identification of Mendelian disorders in critically ill infants: a retrospective analysis of diagnostic and clinical findings. Lancet Respir Med. 2015;5:377–87.View Article
- Saunders CJ, Miller NA, Soden SE, Dinwiddie DL, Noll A, Alnadi NA, et al. Rapid whole-genome sequencing for genetic disease diagnosis in neonatal intensive care units. Sci Transl Med. 2012;4:154ra135.PubMed CentralView ArticlePubMed
- Priest JR, Ceresnak SR, Dewey FE, Malloy-Walton LE, Dunn K, Grove ME, et al. Molecular diagnosis of long QT syndrome at 10 days of life by rapid whole genome sequencing. Heart Rhythm. 2014;11:1707–13.View ArticlePubMed
- Kohler S, Doelken SC, Rath A, Ayme S, Robinson PN. Ontological phenotype standards for neurogenetics. Hum Mutat. 2012;33:1333–9.View ArticlePubMed
- Zemojtel T, Kohler S, Mackenroth L, Jäger M, Hecht J, Krawitz P, et al. Effective diagnosis of genetic disease by computational phenotype analysis of the disease-associated genome. Sci Transl Med. 2014;6:252ra123.PubMed CentralView ArticlePubMed
- Online Mendelian Inheritance in Man, OMIM®. McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University (Baltimore, MD), 13 July 2015. Available at: http://omim.org.
- Firth HV, Richards SM, Bevan AP, Clayton S, Corpas M, Rajan D, et al. DECIPHER: Database of Chromosomal Imbalance and Phenotype in Humans using Ensembl Resources. Am J Hum Genet. 2009;84:524–33.PubMed CentralView ArticlePubMed
- Maddalena A, Bale S, Das S, Grody W, Richards S, ACMG Laboratory Quality Assurance Committee. Technical standards and guidelines: molecular genetic testing for ultra-rare disorders. Genet Med. 2005;7:571.View ArticlePubMed
- Richards CS, Bale S, Bellissimo DB, Das S, Grody WW, Hegde MR, et al. ACMG recommendations for standards for interpretation and reporting of sequence variations: Revisions 2007. Genet Med. 2008;10:294.View ArticlePubMed
- Richards S, Aziz N, Bale S, Bick D, Das S, Gastier-Foster J, et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med. 2015;17:405–24.View ArticlePubMed
- Rehm HL, Bale SJ, Bayrak-Toydemir P, Berg JS, Brown KK, Deignan JL, et al. ACMG clinical laboratory standards for next-generation sequencing. Genet Med. 2013;15:733–47.PubMed CentralView ArticlePubMed
- Wu TD, Nacu S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics. 2010;26:873–81.PubMed CentralView ArticlePubMed
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–303.PubMed CentralView ArticlePubMed
- McLaren W, Pritchard B, Rios D, Chen Y, Flicek P, Cunningham F. Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics. 2010;26:2069.PubMed CentralView ArticlePubMed
- Stenson PD, Ball EV, Howells K, Phillips AD, Mort M, Cooper DN. The Human Gene Mutation Database: providing a comprehensive central mutation database for molecular diagnostics and personalized genomics. Hum Genomics. 2009;4:69.PubMed CentralView ArticlePubMed
- Dreszer TR, Karolchik D, Zweig AS, Hinrichs AS, Raney BJ, Kuhn RM, et al. The UCSC Genome Browser database: extensions and updates 2011. Nucleic Acids Res. 2012;40:D918.PubMed CentralView ArticlePubMed
- Thorvaldsdottir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform. 2013;14:178.PubMed CentralView ArticlePubMed
- DePristo M, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43:491–8.PubMed CentralView ArticlePubMed
- O’Rawe J, Jiang T, Sun G, Wu Y, Wang W, Hu J, et al. Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Med. 2013;5:28.PubMed CentralView ArticlePubMed
- Weiner J, Sharma J, Kilbride H, Lantos J. How infants die in the neonatal intensive care unit: trends from 1999 through 2008. Arch Pediatr Adolesc Med. 2011;165:630–4.View ArticlePubMed
- Stranneheim H, Engvall M, Naess K, Lesko N, Larsson P, Dahlberg M, et al. Rapid pulsed whole genome sequencing for comprehensive acute diagnostics of inborn errors of metabolism. BMC Genomics. 2014;15:1090.PubMed CentralView ArticlePubMed
- Muelly ER, Moore GJ, Bunce SC, Mack J, Bigler DC, Morton DH, et al. Biochemical correlates of neuropsychiatric illness in maple syrup urine disease. J Clin Invest. 2013;123:1809–20.PubMed CentralView ArticlePubMed
- Strauss KA, Puffenberger EG, Morton DH. One community’s effort to control genetic disease. Am J Public Health. 2012;102:1300–6.PubMed CentralView ArticlePubMed
- Bernardi G, Olofsson B, Filipski J, Zerial M, Salinas J. The mosaic genome of warm-blooded vertebrates. Science. 1985;228:953–8.View ArticlePubMed
- Cuny G, Soriano P, Macaya G, Bernardi G. The major components of the mouse and human genomes. 1. Preparation, basic properties and compositional heterogeneity. Eur J Biochem. 1981;115:227–33.View ArticlePubMed
- 1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, et al. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65.View Article