Study design, setting, and participants
This study was approved by the Institutional Review Board (IRB) at Children’s Mercy – Kansas City (CM-KC). It conforms to the Declaration of Helsinki. Participants were principally parent–child trios enrolled in a research biorepository who received WGS in addition to standard diagnostic tests to diagnose monogenic disorders of unknown etiology in affected infants [6, 11, 12]. Affected infants with suspected genetic disorders were nominated by a treating physician, typically a neonatologist. A standard form requesting the primary signs and symptoms, past diagnostic testing results, differential diagnosis or candidate genes, pertinent family history, availability of biologic parents for enrollment, and whether rapid WGS would potentially affect treatment was submitted for immediate evaluation by a team of experts at the Center for Pediatric Genomic Medicine (CPGM) at CM-KC. Infants received rapid WGS if the likely diagnosis was of a type that was detectable by next-generation sequencing and had any potential to alter management or genetic counseling. Patients were not required to undergo standardized clinical examinations or diagnostic testing prior to referral; standard etiologic testing was performed as clinically indicated. Infants likely to have disorders associated with cytogenetic abnormalities were not accepted unless standard testing for those disorders was negative. Informed written consent was obtained from parents. Retrospective samples, UDT103 and UDT173, were blinded validation samples with known molecular diagnoses for a genetic disease [12]. Reference sample NA12878 was obtained from the Coriell Institute repository.
Ascertainment of clinical features
The clinical features of affected infants were ascertained comprehensively by physician and family interviews and review of the medical record. Phenotypic features were translated into Human Phenotype Ontology (HPO) terms and mapped to 6,000 genetic diseases with the clinicopathologic correlation tools Phenomizer and/or SSAGA [6, 11, 14, 15]. Briefly, Phenomizer uses term-similarity measures to calculate a similarity score for query HPO terms entered by the user and terms used to annotate diseases in HPO. It then assigns a P value using statistical modeling to compare the similarity score obtained for the specific set of phenotypic terms entered into the distribution of similarity scores obtained using randomly chosen HPO term combinations. The P value was then used to rank diseases in the differential genetic diagnosis. The Phenomizer differential genetic diagnosis is exported as a tab-separated value file. Diagnoses without known causative genes are removed. Where the likely inheritance pattern is apparent, the Phenomizer output is limited to the appropriate inheritance mode. Where Phenomizer reports many equally scoring values, the differential diagnosis is performed using several different but overlapping term sets, for example with a key feature being listed as mandatory rather than observed, or with removal of clinical features that are felt to likely represent a second, unrelated disorder. Finally, the Phenomizer list may be pruned to 100–250 entries, if necessary, based on manual inspection of the fit of diseases to the clinical features at various P value cutoffs.
Similarly, SSAGA is web-based software that facilitates entry of Human Phenotype Ontology (HPO) terms related to the clinical features observed in an individual patient (Additional file 1: Figure S1). SSAGA provides a differential diagnosis that is limited to all Online Mendelian Inheritance in Man (OMIM), Orphanet, and DECIPHER (DatabasE of genomiC varIation and Phenotype in Humans using Ensembl Resources) disease entries that match at least one entered feature [16, 17]. Diseases in the differential genetic diagnosis can be ranked by the number of matching terms entered.
Genome sequencing and quality control
DNA isolation from peripheral blood was automated utilizing the MSMI Chemagen Instrument equipped with liquid dispensing manifolds (Perkin Elmer, Baesweiler, Germany). Briefly, a 24-well head is used to isolate 1.8 mL of blood per sample. The system is fully enclosed to comply with biosafety standards, and isolation is completed in approximately 2 h resulting in an average of approximately 40 μg of DNA/mL of blood.
For 18-h WGS performed in Essex, isolated genomic DNA was prepared using a modification of the standard Illumina TruSeq sample preparation. Briefly, DNA was sheared using a Covaris S2 Biodisruptor, end-repaired, A-tailed, and adaptor-ligated. PCR was omitted. Libraries were purified using SPRI beads (Beckman Coulter). For 18-h WGS, the amount of DNA used was optimized, based on experience of varying the input from representative DNA samples, and allowed a concentration to be selected that produced a known cluster density after the library was denatured using 0.1 M NaOH and presented to the flowcell.
At CM-KC, genomic DNA was prepared for WGS using either TruSeq PCR Free (Illumina) or KAPA HYPER (KAPA Biosystems) without PCR amplification according to manufacturer’s protocols. HYPER library preparation without PCR is completed in 90 min, with an additional 90 min allotted for QC. Briefly, 2 μg of DNA was sheared with a Covaris S2 Biodisruptor, end-repaired, A-tailed, and adaptor-ligated. Quantitation of libraries was carried out by real-time PCR.
Samples for rapid WGS were each loaded onto one or two flowcells, followed by sequencing on Illumina HiSeq2500 instruments using HiSeq Rapid SBS v1 chemistry that were set to rapid run mode (SBS26) or with customized faster flowcell scanning times (SBS18). Cluster generation, followed by 2 × 101 cycle sequencing reads, separated by the paired-end turnaround, were performed automatically on the instrument. Prior to SBS18 optimization, tiles failing quality metrics were manually excluded before further analysis. Genome sequencing was performed as research, largely in a manner that complies with routine diagnostic tests, as defined by the Clinical Laboratory Improvements Act (CLIA) guidelines [18–21]. However, genome sequencing was not performed herein as a CLIA laboratory developed test.
Next generation sequencing analysis
Sequence data were generated with Illumina RTA 1.12.4.2 & CASAVA-1.8.2, aligned to the human reference GRCh37.p5 using GSNAP [22], and nucleotide (nt) variants were detected and genotyped with the Genome Analysis Tool Kit [23] (GATK, versions 1.6. and 3.2). Sequence analysis used FASTQ, bam, and VCF files. The largest deletion variant detected was 263 nt, and the largest insertion was 469 nt.
Variants were annotated with the Rapid Understanding of Nucleotide variant Effect Software (RUNES, v3.3.5) [6, 11, 12]. RUNES incorporates data from ENSEMBL’s Variant Effect Predictor (VEP) software [24], produces comparisons to NCBI dbSNP, known disease variants from the Human Gene Mutation Database [25], and performs additional in silico prediction of variant consequences using RefSeq and ENSEMBL gene annotations [26]. RUNES categorized each variant according to ACMG recommendations for reporting sequence variation [18–21] and with an allele frequency (MAF) derived from CPGM’s Variant Warehouse database of approximately 90 million variants and 3,900 individuals [6, 11, 12]. Category 1 variants had previously been reported to be disease-causing. Category 2 variants had not previously been reported to be disease-causing, but were of types that were expected to be pathogenic (loss of initiation, premature stop codon, disruption of stop codon, whole gene deletion, frameshifting indel, disruption of splicing). Category 3 were variants of unknown significance that were potentially disease-causing (non-synonymous substitution, in-frame indel, disruption of polypyrimidine tract, overlap with 5’ exonic, 5’ flank or 3’ exonic splice contexts). Category 4 were variants that were probably not causative of disease (synonymous variants that were unlikely to produce a cryptic splice site, intronic variants >20 nt from the intron/exon boundary, and variants commonly observed in unaffected individuals). Category 5 variants were known to be benign. All variants, together with their RUNES annotations, are stored in a queriable warehouse database (Additional file 2: Figure S2). Inputs to the RUNES pipeline were a genomic variant file (.vcf or .gvcf); the pipeline produces a JSON document that is used as input to the VIKING interpretation tool.
DRAGEN
The DRAGEN pipeline operates on a single-server hybrid hardware/software platform, with a dual Intel Xeon central processing units (CPUs), and a custom Peripheral Component Interconnect Express (PCIe) board with a field-programmable gate array (FPGA) and 32 GB of Dynamic random-access memory (DRAM) attached directly via four double data rate type three synchronous dynamic random-access memory DDR3 SDRAM channels. Critical compute-intensive functions of the pipeline are performed by custom massively parallel FPGA logic for maximum speed, while other functions run in optimized multi-threaded software on the Xeon cores, for maximum flexibility. A parallel (redundant array of independent disks, RAID 0) Solid State Drive (SSD) file system provides the I/O bandwidth necessary to feed the processing pipeline, and FPGA compress/decompress engines maintain throughput to and from compressed file formats.
DRAGEN read mapping/alignment
DRAGEN uses a hash table index of a reference genome to map many overlapping seeds from each read to exact matches in the reference. The hash table is constructed from any chosen reference with a multi-threaded tool, in as little as 6 min for a whole human genome, and loaded into the FPGA-board DRAM prior to mapping operations. The entire read mapping process is performed by custom FPGA logic, with software layers streaming unaligned reads from FASTQ (or Illumina BCL) files to the PCIe board via DMA, and simultaneously streaming aligned read records back into host memory, for BAM output and/or variant calling.
DRAGEN’s hash-based mapping uses a novel dynamic seed extension method: when a primary seed (default 21 nt) matches more than a maximum number of reference locations (default 16), longer seeds from all these reference positions are populated into the hash table, such that specific extended seed sequences will match fewer reference locations. Seeds are extended symmetrically, up to 64 nt in each direction, with a maximum of 149 nt from a 21 nt primary seed. Long seed extensions were done in multiple short increments, averaging 3–4 nt in each direction, with different extended seed patterns terminating at different extension lengths, as needed to match no more than the maximum number of reference positions.
When a hash table query is made for a common primary seed, a single EXTEND record (merging the contents of two or more objects together into the first object) is retrieved, indicating the number of additional read bases to join onto the seed in each direction. The additional bases were hashed (along with a unique identifier for the pre-extended seed), and another hash table query was made, which may return yet another EXTEND record, iteratively. When an adequate extended seed length was achieved, the next hash table query retrieved a list of up to the maximum number of matching reference positions and orientations. This iterative seed extension method yields similar results to incremental suffix-tree or Burrows-Wheeler mapping but with dramatically fewer index memory accesses, which is critical to DRAGEN’s mapping speed.
In FPGA logic, read pairs are load-balanced over several DRAGEN map/align engines. An engine extracts many overlapping seeds from each read, by default starting at every even offset (50 % density). These are mapped by DRAM hash table queries, each to zero or more reference positions, with forward or reverse-complemented orientation determined for each match. The several engines nearly saturate the four local DDR3 interfaces with hash bucket read bursts and reference sequence fetches for alignment. Matches along similar alignment diagonals are grouped into seed chains, which are conservatively filtered; by default, a short seed chain can be filtered out if another seed chain at least four times longer mostly overlaps it in the read.
Lists of seed chains from paired end reads are examined to detect pairs with appropriate insert sizes and orientations. For each unpaired seed chain, a rescue scan may be executed to search for the mate within the expected insert window; mate K-mer matches within a configurable Hamming distance (the number of positions at which the corresponding nucleotides are different) result in new candidate positions being added to the list of seed chains. Each seed chain or rescue match is then extended by gapless local alignment, permitting single nucleotide variants (SNVs) and clipping but not nucleotide insertions and deletions (indels). The collection of gapless alignment results for each read is analyzed by heuristics, to judge for which ones Smith-Waterman gapped alignment would have a non-trivial likelihood of improving the overall read pair results.
Each Smith-Waterman aligner uses an array of 56 parallel scoring cells, virtually arranged into an anti-diagonal wavefront, which steps one position horizontally or vertically each clock cycle. The wavefront scores a generally diagonal swath of cells through the alignment rectangle but steers automatically to re-center the best alignment path after indel events. Back-trace from the maximum scoring cell runs simultaneous with the following alignment operation, yielding a CIGAR string, which indicates soft clipping and indel positions.
All gapped and gapless alignment results are compared to obtain best and second-best scores. For paired ends, pair scores are computed, each as the sum of the two alignment scores minus a pairing penalty, based on the deviation from the empirical mean insert; and the best scoring pair is reported. The quality of read mapping (MAPQ) is estimated primarily in proportion to the difference between best and second best scores, the proportionality coefficient varying by read length; second-order factors such as the number of scores very close to the second-best are also considered. When the best alignment does not cover a read, up to three supplementary (chimeric) alignments are optionally reported for other segments of the read.
DRAGEN sorting and duplicate marking
After mapping, reads are sorted by reference position; PCR or optical duplicates are optionally flagged. An initial sorting phase operates on aligned reads returning from the FPGA. Final sorting and duplicate marking commences when mapping completes; these operations overlap variant calling when the latter is requested, and add almost zero time to the FASTQ-to-VCF pipeline.
DRAGEN variant calling
The DRAGEN variant caller runs mostly in highly optimized software, for maximum flexibility of the algorithms. Only stable, compute-intensive operations are accelerated by FPGA engines. DRAGEN implements multi-threaded parallelism in a single pass over the whole reference genome, without launching multiple caller processes on various subsets of the reference. A single call to the DRAGEN executable runs the entire pipeline from FASTQ to VCF, for the whole genome. Mapping/alignment is done in one pass over the reads, and all steps of variant calling (in addition to read sorting and duplicate marking) run simultaneously in a software/hardware pipeline emitting VCF results.
First, callable reference regions are identified, with sufficiently aligned coverage. Within these, a fast scan of the sorted reads identifies active regions, centered around pileup columns with non-trivial evidence of a variant, and padded with enough context to cover significant non-reference content nearby, extra wide where there is evidence of indels.
Aligned reads are clipped within each active region and assembled into a De Bruijn graph, edges weighted by observation counts using the reference sequence as a backbone. If the graph is degenerate, it is reconstructed using longer K-mers. After some graph cleanup and simplification, all source-to-sink paths are extracted as candidate haplotypes, up to a limit (default 128). If this cap must be enforced, higher-weight paths are preferred. Each haplotype is Smith-Waterman aligned back to the reference genome to identify the variants it represents, and re-synchronized with read alignments.
Then for each read-haplotype pair, the probability P(r|H) of observing the read, assuming the haplotype was the true starting sample, is estimated using a pair hidden Markov model (HMM). Since the haplotype is assumed true, only errors in sample preparation and sequencing are considered. Essentially, the probabilities of all possible alignments (edit combinations) of the read to the haplotype are calculated and summed, using a dynamic programming matrix very similar to affine-gap Smith-Waterman, except summing rather than maximizing path probabilities. At each row (read position) in the matrix, mismatch probabilities are taken from base quality scores and MAPQ, and gap probabilities are derived from a PCR error model sensitive to repetitive sequence content.
This pair-HMM calculation is the most expensive step, and, therefore, is accelerated in custom FPGA logic. Reads and haplotypes to be compared are queued up for HMM processing by software threads completing previous steps and sent to the FPGA by direct memory access (DMA). A load balancer distributes work over more than 100 small HMM engines, each of which is pipelined to calculate all three probabilities (for match, insert, and delete states) for one matrix cell per clock cycle. Calculated P(r|H) values DMA back to host memory, where they are picked up by downstream software threads.
Scanning by reference position over the active region, candidate genotypes are formed from diploid combinations of variant events (SNVs or indels) observed in the earlier Smith-Waterman alignments of the haplotypes to the reference. For each event (including reference), the conditional probability P(r|e) of observing each overlapping read is estimated as the maximum P(r|H) for haplotypes supporting the event. These are multiplied to yield the conditional probability P(R|e) of observing the whole read pileup, and using Bayes’ formula, the posterior probability P(e1e2|R) of each diploid genotype (diplotype) is calculated, and the winner is called.
VIKING
Causative variants were identified primarily with Variant Integration and Knowledge INterpretation in Genomes (VIKING) software (Additional file 2: Figure S2 and Additional file 3: Figure S3) [6, 11]. Inputs for VIKING were the annotated genomic variant file produced by the RUNES pipeline and a SSAGA (Symptom and Sign Associated Genome Analysis) or Phenomizer record, comprising the clinical features of the affected patient, corresponding diseases in the differential diagnosis, and the respective disease genes (Additional file 1: Figure S1) [6, 11, 14, 15]. The SSAGA or Phenomizer record was created during the laboratory steps in WGS26. Alternatively, a menu of pre-determined candidate gene lists can be utilized to filter variants in VIKING, such as genes with OMIM records, or genes previously associated with mitochondrial disorders. VIKING integrated the superset of relevant disease mappings and annotated variant genotypes. By allowing dynamic filtering of variants based on variables such as individual clinical features, diseases, genes, assigned ACMG-type pathogenicity category, allele frequency, genotype, and inheritance pattern, VIKING assists in identification of a differential diagnosis. VIKING settings can be saved, which allows configuration in a manner that can enable a provisional molecular diagnosis to be determined in as little as seconds. VIKING also allowed data mark-up, sessions to be saved, and export of fields in formats suitable for inclusion in diagnostic reports.
In a typical interpretation session, variants were filtered by limitation to ACMG Categories 1–3 and MAF <1 %, <0.5 %, <0.1 %, or to those that are unique to the proband or to the family, dependent on the clinical impression (Additional file 3: Figure S3). All potential monogenetic inheritance patterns were examined, including de novo, recessive, dominant, X-linked, mitochondrial, and, where possible, somatic variation. Where a single likely causative heterozygous variant for a recessive disorder was identified, the entire coding domain was manually inspected using the Integrated Genome Viewer (IGV) for coverage, additional variants, as were variants for that locus called in the appropriate parent that may have had low coverage in the proband [27]. VIKING featured link-outs to IGV that are refreshed in a trio on a variant-by-variant basis allowing rapid examination of pattern of inheritance, quality of alignment, and local sequence features (such as simple sequence repeats). Expert interpretation and literature curation were performed for all likely causative variants with regard to evidence for pathogenicity. VIKING featured link-outs to a warehouse of approximately 90 million variants in approximately 3,900 individuals, OMIM, HGNC, HGMD, Entrez Gene, ENSEMBL and pathways information, facilitating rapid literature curation (Additional file 2: Figure S2). Analysis was performed sequentially by two experts. Sanger sequencing was used for clinical confirmation of all diagnostic genotypes. Reporting was performed by an ACMG fellow laboratory director who was an expert in WGS analysis in single gene diseases. Additional expert consultation and functional confirmation were performed when the subject’s phenotype differed from previous mutation reports for that disease gene.