- Method
- Open Access
Phylogenetically typing bacterial strains from partial SNP genotypes observed from direct sequencing of clinical specimen metagenomic data
- Received: 19 February 2015
- Accepted: 15 May 2015
- Published: 9 June 2015
Abstract
We describe an approach for genotyping bacterial strains from low coverage genome datasets, including metagenomic data from complex samples. Sequence reads from unknown samples are aligned to a reference genome where the allele states of known SNPs are determined. The Whole Genome Focused Array SNP Typing (WG-FAST) pipeline can identify unknown strains with much less read data than is needed for genome assembly. To test WG-FAST, we resampled SNPs from real samples to understand the relationship between low coverage metagenomic data and accurate phylogenetic placement. WG-FAST can be downloaded from https://github.com/jasonsahl/wgfast.
Keywords
- Single Nucleotide Polymorphism
- Whole Genome Sequence
- Methicillin Resistant Staphylococcus Aureus
- Single Nucleotide Polymorphism Genotype
- Single Nucleotide Polymorphism Locus
Background
Whole genome sequencing (WGS) is a powerful and increasingly available technology for understanding the evolutionary and epidemiological relationships among bacterial pathogens. For bacterial disease outbreaks, whole genome analysis has been used to identify and attribute the outbreak sources for many bacterial pathogens, including Escherichia coli O104 [1], Vibrio cholerae [2], Klebsiella spp. [3], methicillin resistant Staphylococcus aureus (MRSA) [4] and even Bacillus anthracis [5]. Adding the genetic relationships of isolates to other standard epidemiological correlates (for example, time and space) offers the power to identify disease outbreaks that would not otherwise be apparent. This approach has been highly successful using sub-genomic DNA methods (for example, multi-locus sequence typing (MLST) [6]) but the use of whole genome sequencing will replace these in the near future due to precision and accuracy of strain identification offered by this near comprehensive technology [7].
The advent of molecular diagnostics (for example, polymerase chain reaction (PCR)) has led to improved pathogen identification, in part, because they are not dependent upon isolation and subsequent culturing of the pathogen. But the currently dominant disease-tracking methods (for example, pulsed field gel electrophoresis (PFGE)) only work with isolated pure cultures, leading to the possibility that disease tracking efforts will be diminished in this new age [8]. Molecular epidemiological methods using the power of WGS that parallel molecular diagnostics with direct application to complex specimens are needed. In fact, recent studies have used this approach to associate diseases with the infectious agent [9, 10].
WGS analysis to identify pathogen strains would seem possible through the metagenomic deep sequencing of clinical specimens, but genome coverage of a specific microbe is hard to predict and the pathogen may represent only a minor component in the microbiome of the infected tissue [11]. Many pathogen populations have low diversity and, hence, single nucleotide polymorphism (SNP) discovery with low-genome coverage leads to greater misidentification due to sequencing errors than true SNP genotyping. To reduce this ‘signal-to-noise’ problem, we developed the Whole Genome Focused Array SNP Typing (WG-FAST) method, where only known SNPs with defined allele states are scored. These are derived from a reference population where high quality genomic data are available to generate a highly robust phylogenetic reconstruction. Sequencing reads are aligned to a reference genome annotated with the positions of known SNPs and their allelic states. The metagenomic SNP genotype of the unknown pathogen can then be placed into the most likely phylogenetic position. It is the reference population SNP database that defines the best possible model for population structure, which is then used as a reference for unknown SNP genotypes identified from less than ideal (for example, low coverage) datasets. We also present several approaches for establishing confidence in phylogenetic placement including hypothesis-testing methods that generate odds ratio probabilities. This is essential because the precision of phylogenetic placement will be unique for each application and is dependent upon a number of variables including: (1) the SNP/genome density in the reference population; (2) the depth of genome coverage from the unknown sample; and (3) the phylogenetic topology in the actual placement position of the reference population. Thus, placement confidence metrics must be established for each unknown sample. WG-FAST will allow the use of deep metagenomic sequencing data to identify strains from complex samples such as clinical specimens, food matrices, and the environment, alleviating the requirement for pure cultures to accomplish molecular epidemiological goals.
Methods
Single nucleotide polymorphism (SNP) discovery
The robust characterization of SNPs in a reference set of isolates is a necessary first step in the WG-FAST analysis pipeline. A pipeline to wrap methods discussed below, known as the Northern Arizona SNP Pipeline (NASP), is publically available (tgennorth.github.io/NASP/). Our strategy for reference SNP identification is to use only the non-redundant core genome sequences to avoid missing data and misuse of paralogous regions. To create a reference database, raw reads or assembled genomes are aligned to a reference genome with BWA-MEM [12] or NUCmer [13], respectively. SNPs and insertion/deletions (indels) can be identified with variant callers including the UnifiedGenotyper method in GATK [14, 15], SAMtools [16], VarScan [17], and/or SOLSNP ([18]). Called SNPs can then filtered using user-defined thresholds for read depth (default = 3×) and allele frequency proportion (default = 90 %). All called SNPs are then placed into a matrix that includes the nucleotide calls in each position of the reference genome for all genomes queried. Benchmarking tests on a single genome (E. coli C227-11) with 12 million reads, 100 bases in length, took 4 h 25 min to place and perform 100 subsampling confidence tests using eight processors on a single node with 48 Gb of RAM.
Whole genome focused array SNP typing (WG-FAST) pipeline
Source code for WG-FAST is publically available at [19] under a GPL v3 license. The required input for a WG-FAST analysis includes a NASP-formatted SNP matrix, a phylogeny inferred with RAxML [20], a reference genome assembly, and a directory including single or paired-end reads with ‘.fastq.gz’ extensions. Dependencies for WG-FAST include BWA-MEM, GATK, Picard-tools ([21]), DendroPy [22], RAxML v8 [23], BioPython [24], Trimmomatic [25], and SAMtools [16]; many of these dependencies are included in the WG-FAST repository. A script to generate the formatted, required phylogeny from the SNP matrix is included with WG-FAST.
A workflow of the WG-FAST pipeline. Input files are shown in white boxes and include a SNP matrix, a corresponding phylogeny, a reference genome assembly, and a directory of reads from all unknowns. The output files include a phylogeny with unknowns highlighted in red, all pairwise patristic distances, the average depth of coverage across the reference assembly, and if the subsample routine is invoked, the resulting P values
An optional subsampling routine is built into WG-FAST in order to test the robustness of a given placement on a phylogenetic tree. From the final phylogeny, the two closest genomes to each unknown, based on patristic distances, are identified. The SNPs from the two neighbors are then sampled at the same coverage level as each unknown and a new SNP matrix is created. Each matrix is then converted into a multi-FASTA and the samples are placed into the phylogeny with the EPA algorithm. The patristic distance to the reference isolate is then calculated for each subsample and is compared to the ‘true’ patristic distance using all SNPs; the reference is used because its position is fixed and the ‘Reference’ name is the same for each NASP-formatted SNP matrix, regardless of the target organism. The null hypothesis is that a random subsampling placement will differ significantly from the ‘correct’ placement based on a comparison of patristic distances. The number of times that the distance from the reference is different from the known placement is calculated for 100 replicates based on a user-defined threshold. From a set of 100 replicates, if the number of samples placed incorrectly is fewer than 5, then the P value is <0.05 and the placement can tentatively be trusted. For large datasets (that is, hundreds of thousands of SNPs and hundreds to thousands of genomes), this subsampling routine may be impractical, as 200 placements are required if 100 replicates is selected by the user.
A WG-FAST test case
To test the utility of the WG-FAST pipeline, approximately 700 E. coli genome assemblies were downloaded from GenBank [28]; E. coli was used as the test case due to the large number of assembled genomes in public databases and due to the non-clonal nature of the species [29]. SNPs from all genomes were identified using E. coli K-12 W3110 (accession # NC_007779) [30] as the reference, and a SNP matrix was generated with NASP. A maximum likelihood phylogeny (Additional file 1: Data file 1) was inferred on this concatenated SNP alignment with RAxML v. 8.1.13 using the following parameters: −f d -p 12345 -m GTRGAMMA. Closely related genomes, based on phylogenetic relatedness, were then manually removed, resulting in 255 genomes (Additional file 2). Autapomorphic SNPs (that is, private SNP alleles) in the outgroup genome, TW10509 (AEKA00000000) [31], belonging to a ‘cryptic’ lineage of E. coli [32], were also removed. The resulting SNP matrix consisted of greater than 225,000 SNPs (Additional file 3: Data file 2).
Read subsampling
To test the robustness of the WG-FAST pipeline using a low number of reads, sequence reads were randomly sampled at varied depths (50–100,000 read pairs), from published E. coli datasets (Additional file 4). One hundred separate datasets at each read depth were then processed with WG-FAST. The minimum number of called positions in order to correctly genotype the unknown ≥95 % of the time, based on a patristic distance ratio (query patristic distance to reference/true patristic distance to reference) between 0.99 and 1.01, was identified for each genome. Multiple isolates from different regions of the tree and sequence data from multiple sequencing platforms were analyzed.
SNP subsampling
In addition to subsampling raw reads, positions present in the SNP matrix were subsampled for each genome in the phylogeny. SNPs were sampled at a lowest frequency of 50, then sampled every 100 SNPs subsequently, until the patristic distance of 95 % of 100 iterations, compared to the reference, was between 0.99 and 1.01, compared to the patristic distance of the placement using all available positions.
SNP matrix correlation with subsamplings
To identify the fewest number of reference positions required in order to obtain a comparable matrix to a matrix using all available SNPs, a subsampling method was employed. A user-provided number of SNPs were randomly selected from the matrix, the reduced matrix was converted into a multi-FASTA, and a distance matrix was calculated with mothur [33]. A distance matrix was also generated from the complete SNP matrix with mothur. A Mantel test was then performed on the two matrices with mothur, using the Pearson correlation. The Pearson correlation value at each SNP level, with 100 replicates, was then plotted. A script to wrap these functions is available with WG-FAST (subsample_snps_pearson.py).
Metagenomic analysis
To test the WG-FAST method on metagenomic samples, 53 datasets from a recent metagenomic survey of stool samples from the 2011 E. coli O104:H4 outbreak [34] were downloaded and processed with WG-FAST (Additional file 5). The subsample routine was run on all samples using 100 iterations.
In silico mixtures
In some clinical samples, mixtures of multiple closely related conspecific strains have been observed [35]. To determine how mixtures will affect phylogenetic placement using WG-FAST, several artificial mixtures were generated (Additional file 6) and processed with WG-FAST. When processed with WG-FAST, a minimum coverage of 1× and a minimum proportion of 60 % was used.
Error rate calculation
WG-FAST is intended to phylogenetically genotype isolates from complex samples where the desired signal could be faint. In these cases, error in the data could confound accurate phylogenetic characterization. To test the error rate, raw reads were mapped against the E. coli genome TY-2482 (SRR292862) with BWA-MEM, and a BAM file was generated. At each position in the reference chromosome, the most frequent base was removed and the counts of the alternate alleles, which represents error, were summed. Average error and associated standard deviation was calculated across the entire reference chromosome.
Results
Whole genome focused array SNP typing (WG-FAST) pipeline
WG-FAST was developed as a parallel, open source method to accurately genotype novel isolates from high read coverage (for example, 50× reference genome coverage) or from metagenomic data in the context of a known phylogenetic or population genetic structure (Fig. 1). This method can be used to type new bacterial populations, where the tree structure should either not be altered, the read depth is low (<1×), as is the case with metagenomic samples, or where computation of a new tree is too computationally expensive. WG-FAST is an open-source application written in Python and relies on published and validated tools for read alignment, single nucleotide polymorphism (SNP) calling, and the placement of samples in a phylogenetic context.
Intrinsic error rate
One potential pitfall to identifying SNPs from low coverage samples is mistaking sequencing errors for true variants. We estimated the single-read base call error rate across the E. coli chromosome in isolate TY-2482 [36] to understand its effect on genotyping accuracy; the average error rate in this dataset was 0.16 % (SD ± 0.43 %) (Additional file 7). Although this error rate is low, at 1× coverage of a model bacterial genome (for example, 5 Mbp) this would result in the discovery of roughly 8,000 false SNPs. These errors would lead to incorrect calls across the reference genome, which would confound the analysis of true SNPs in many epidemiological analyses where the true variation can be much less. While the use of short read error correction tools, such as Hammer [37] or Musket [38], prior to WG-FAST should reduce many of these errors, the common solution is to increase sequence coverage to verify a particular SNP. With high read coverage, the false SNP discovery is small, but this is difficult and expensive to achieve in a metagenomic analysis of complex specimens. Rather, the WG-FAST approach limits base calling to known SNP positions and therefore minimizes the impact of this error rate. In a 1,000 SNP genotype, fewer than 2 SNPs would be falsely identified at this rate. If metagenomic data are used to generate a genotype at known genomic positions and with known allele states, the sequencing error has little consequence on a multi-locus genotype determination.
Escherichia coli dataset and phylogeny
As a WG-FAST test case, approximately 700 E. coli genome assemblies were downloaded from GenBank; E. coli was used as a test case due to the large number of sequenced genomes. For this analysis, closely related genomes were manually removed based on phylogenetic redundancy, resulting in a dataset of 255 genomes (Additional file 2). SNPs were then identified from NUCmer [39] alignments and a phylogeny was generated from the concatenated SNP alignment (approximately 225,000 SNPs) with RAxML v8 [23], using TW10509 (accession #AEKA00000000) as the root (Additional file 8) and K-12 W3110 [30] as the reference. The retention index (RI) [40] of the tree was 0.80, demonstrating significant homoplasy in the underlying SNP data, probably resulting from historical recombination among lineages. However, the major E. coli phylogroups, labeled A through E, were monophyletic and consistent with previous analyses [41].
Subsample SNP correlations
The effect of increasing SNP loci on genetic distance estimation. A pairwise distance matrix between 255 E. coli genomes and based upon approximately 225,000 SNPs was calculated and then correlated to matrices generated using fewer SNPs. SNP subsampling was performed 100 times at each level. Each subsampled matrix was then converted into a multi-FASTA and a distance matrix was calculated with mothur [33]. Distance matrices were compared with the Mantel test function in mothur and the Pearson correlation value was calculated and plotted for each subsampling level. The mean and standard deviation for all iterations were calculated and plotted
Subsampling SNPs from the complete SNP matrix
The effect of SNP number on phylogenetic placement. The SNPs were randomly subsampled from each genome at different levels and were then inserted into the phylogeny 30 separate times with either the genome removed (gray) or not removed (red). The number of reference positions required for accurate placement are illustrated by the height of bars, but were capped at 9,500 positions. The phylogenomic clade designations (a-e) are shown at each relevant node
Subsampling Reads for WG-FAST placement
A phylogeny and associated heatmap showing the placement of read subsamples for 100 iterations of three E. coli isolates. The maximum likelihood phylogeny was inferred by RAxML [20] from a concatenation of approximately 225,000 single nucleotide polymorphisms called against the reference genome, K-12 W3110 [30]. Raw reads were randomly sampled from three genomes at four different depths and reference positions were identified. Each genome, at each level, was then inserted into the phylogeny with the evolutionary placement algorithm (EPA) in RAxML [26]. Duplicate placements were removed from the tree and redundancies were represented as a heatmap
Strain mixtures
In some cases, a single pathogen from a given species will be dominant in a clinical specimen [42], but not always. To test the effect of strain mixtures in silico, we used E. coli as the test case, which is a normal inhabitant of the healthy human gut. Reads from the reference isolate O104:H4, strain C227-11 [1], were mixed with reads from the O157:H7 isolate 8624 [43] at different proportions (90:10, 80:20, 70:30, 60:40) in a total of 10,000 read pairs (100 bp reads) (Additional file 6). At a read mixture of 80:20, the dominant sample was still accurately genotyped, although a longer branch length was observed due to homoplasious SNPs and unwarranted additional phylogenetic steps (Additional file 10). At a 70:30 mixture, the unknowns were no longer placed into the dominant strain clade, and could not be accurately typed. At a 60:40 mixture, most samples erroneously grouped with the reference with longer branch lengths. Strain mixtures at near equal proportions, which is not anticipated based on analyses of stool samples, would definitely confound accurate placement with WG-FAST. Importantly, however, this problem can be identified due to the presence of long branches leading to each unknown sample with highly homoplasious characters. More detailed analyses to identify the homoplasious SNPs and separate them has the potential to deconvolute mixtures into the source genotypes and allow their phylogenetic placement.
Metagenomic sample analysis
A phylogenetic tree showing the placement of 53 metagenomic samples associated with the 2011 outbreak of E. coli O104:H4 [34]. Raw reads from each sample were mapped against the reference genome, K-12 W3110 [30], and all positions were called by GATK [15]. Filtered positions were then inserted into the E. coli phylogeny with the evolutionary placement algorithm (EPA) in RAxML [26]. The placement of the reference O104:H4 isolate, TY-2482, is shown in green, the reference genome is shown in gray, correctly genotyped samples are shown in red, while other genotyped E. coli isolates are shown in blue
Metrics to measure placement accuracy
A decision tree to help a user decide the confidence in the WG-FAST placement of a sample
Discussion
Analysis of the microbiome has been largely performed at higher taxonomic levels (for example, genus, species) and focused primarily upon the 16S rRNA gene [44] but these analyses are now increasingly using full metagenomic data sets [45]. This offers the opportunity to move the taxonomic discrimination to levels below that of species (that is, precise strain identification). However, genome-based classification is still complicated by low coverage datasets and ambiguous classification due to database biases, including incomplete datasets for many organisms. For pathogen identification, multiple reference genomes are usually available for a known pathogen and should only increase as whole genome sequence (WGS) data becomes easier to generate and analyze.
The availability of WGS data has led to the problem of how to analyze newly sequenced isolates in the context of existent data. This is especially a problem when trying to accurately genotype the causative agent of infections from WGS obtained directly from clinical specimens. Our WG-FAST pipeline is able to phylogenetically genotype isolates from single isolate sequencing projects, low coverage sequencing projects, or from complex samples with variable coverage, such as metagenomics projects sequenced from human samples. Specifically, WG-FAST was designed to accurately genotype new isolates where the read coverage is below 1× and may not be able to be genotyped by methods that rely on higher query coverage. Although studies have been published that use a similar concept [46], WG-FAST represents the only publically-available pipeline that can perform these functions and provide statistical support for a given phylogenetic placement. Although WG-FAST currently only works on SNP data, other data, including indels, can also be used to discriminate between related isolates and may provide additional phylogenetic resolution between genomes.
To test the application of WG-FAST towards real samples, a set of 53 metagenomic datasets from an analysis of diarrheal samples associated with a Escherichia coli O104:H4 outbreak [34], were processed. All samples identified by Loman et al. as positive for O104:H4 and unmixed, based on a separate bioinformatics pipeline, were also correctly genotyped by WG-FAST. However, additional samples were identified as positive for O104:H4 by WG-FAST that were reported as negative by Loman et al. This demonstrates that the ability of WG-FAST to genotype based on partial genotypes may allow for lower level detection than read assembly or mapping methods that require higher reference coverage in order to classify a pathogen. The long branches on some samples demonstrate signs of mixtures of multiple isolates (Fig. 5), which were also identified by Loman et al.; because WG-FAST does not discover novel SNPs, any branch lengths are indicative of homoplasy created by character state conflicts.
The placement of artificial mixtures demonstrated that at near equal proportions, WG-FAST can place a sample in the wrong location (Additional file 10). When a long branch is observed on a placed sample, other evidence must be considered when evaluating the quality of a placement, including the number of positions required for accurate placement, determined by subsampling, in that region of the phylogeny. Removing homoplasious SNPs has the potential to resolve the mixture into dominant and subdominant strains. For each dataset studied, a similar analysis should be conducted in order to understand the limits of the placement method. To determine the fewest number of reads and reference positions that still result in accurate phylogenetic placement, a subsampling approach was employed. The subsampling experiments based on subsampled SNPs demonstrated that a minimum of 100 reference positions must be called in the case of the E. coli dataset used in this study to accurately genotype unknowns ≥95 % of the time. However, the region of the tree where the unknown falls can drastically affect the number of required positions, which can be greater than 9,500 (approximately 0.002× genome coverage). There was a strong correlation with the number of positions required for accurate placement and the topology of the tree. In general, nodes that were filled with closely related isolates required only approximately 100 positions for accurate placement, while nodes containing isolates with long branch lengths required far more positions to be called for accurate placement. The sequence analysis of additional diverse isolates will help fill in blank regions in the tree and create a reference phylogeny that will be better able to place unknown isolates at very low read coverage.
When compiling a reference database for a pathogen of interest, the clonality of an organism should be considered. For highly recombinant pathogens, such as Burkholderia pseudomallei, WG-FAST analysis may require additional positions to be called in order to separate the clonal signal from the recombinant signal. For highly clonal pathogens, the issue becomes the relative lack of polymorphisms in the dataset. For example, only 2,298 SNPs are able to describe the global phylogenetic diversity in Yersinia pestis [47], which will require more sequence reads to accurately place an unknown due to the reduced size of the available SNP search space.
The large sequence datasets that are now available to most researchers have presented new problems, both computationally and methodologically, for the analysis of new isolates. WG-FAST presents a method to characterize new isolates in the context of a reference population. The applications to this method include assigning isolates to known outbreaks, as described in this study, typing unknown isolates to specific phylogenetic lineages, and may provide the resolution to resolve transmission routes, although additional experimentation is required before this is verified. As sequence data, both single isolate and metagenomic, become more commonplace, methods that scale linearly with huge datasets, such as WG-FAST, will become critical for the analysis of clinical pathogens.
Conclusions
In this study, we demonstrate how WG-FAST can be used to genotype isolates at the strain level from complex samples using low levels of sequence data obtained from metagenomics studies. While WG-FAST can also be used in conjunction with single isolate genomics datasets, it is especially powerful when analyzing low coverage datasets. In addition to genotyping, WG-FAST performs statistical analysis to help assess the quality of an unknown placement. We demonstrate that in E. coli, WG-FAST can be used to genotype from metagenomic datasets, place samples accurately at extremely low reference genome coverage, and provide a confidence landscape when assessing placement confidence. As reference databases and sequence datasets become more complex, methods such as WG-FAST are required for strain-level genotyping.
Declarations
Acknowledgements
The authors would like to thank Dr. Talima Pearson for discussions on phylogenetic placement. This work was supported by the United States Department of Homeland Security Science and Technology grant no. HSHQDC-10-C-00139.
Authors’ Affiliations
References
- Rasko DA, Webster DR, Sahl JW, Bashir A, Boisen N, Scheutz F, et al. Origins of the E.coli strain causing an oubreak of hemolytic-uremic syndrome in Germany. N Engl J Med. 2011;365:709–17.PubMedPubMed CentralView ArticleGoogle Scholar
- Hendriksen RS, Price LB, Schupp JM, Gillece JD, Kaas RS, Engelthaler DM, et al. Population genetics of Vibrio cholerae from Nepal in 2010: evidence on the origin of the Haitian outbreak. mBio. 2011;2:e00157–11.PubMedPubMed CentralView ArticleGoogle Scholar
- Snitkin ES, Zelazny AM, Thomas PJ, Stock F, Group NCSP, Henderson DK, et al. Tracking a hospital outbreak of carbapenem-resistant Klebsiella pneumoniae with whole-genome sequencing. Sci Transl Med. 2012;4:148ra116.PubMedPubMed CentralView ArticleGoogle Scholar
- Price LB, Stegger M, Hasman H, Aziz M, Larsen J, Andersen PS, et al. Staphylococcus aureus CC398: host adaptation and emergence of methicillin resistance in livestock. mBio. 2012;3:e00305–11.PubMedPubMed CentralView ArticleGoogle Scholar
- Rasko DA, Worsham PL, Abshire TG, Stanley ST, Bannan JD, Wilson MR, et al. Bacillus anthracis comparative genome analysis in support of the Amerithrax investigation. Proc Natl Acad Sci U S A. 2011;108:5027–32.PubMedPubMed CentralView ArticleGoogle Scholar
- Maiden MC, Bygraves JA, Feil E, Morelli G, Russell JE, Urwin R, et al. Multilocus sequence typing: a portable approach to the identification of clones within populations of pathogenic microorganisms. Proc Natl Acad Sci U S A. 1998;95:3140–5.PubMedPubMed CentralView ArticleGoogle Scholar
- Long SW, Beres SB, Olsen RJ, Musser JM. Absence of patient-to-patient intrahospital transmission of Staphylococcus aureus as determined by whole-genome sequencing. MBio. 2014;5:301692–14.View ArticleGoogle Scholar
- Shopsin B, Kreiswirth BN. Molecular epidemiology of methicillin-resistant Staphylococcus aureus. Emerg Infect Dis. 2001;7:323–6.PubMedPubMed CentralView ArticleGoogle Scholar
- Fischer N, Rohde H, Indenbirken D, Gunther T, Reumann K, Lutgehetmann M, et al. Rapid metagenomic diagnostics for suspected outbreak of severe pneumonia. Emerg Infect Dis. 2014;20:1072–5.PubMedPubMed CentralView ArticleGoogle Scholar
- Wilson MR, Naccache SN, Samayoa E, Biagtan M, Bashir H, Yu G, et al. Actionable diagnosis of neuroleptospirosis by next-generation sequencing. N Engl J Med. 2014;370:2408–17.PubMedPubMed CentralView ArticleGoogle Scholar
- Hajishengallis G, Darveau RP, Curtis MA. The keystone-pathogen hypothesis. Nat Rev Microbiol. 2012;10:717–25.PubMedPubMed CentralView ArticleGoogle Scholar
- Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXivorg. 2013;arXiv:1303.3997 [q-bio.GN].Google Scholar
- Delcher AL, Phillippy A, Carlton J, Salzberg SL. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 2002;30:2478–83.PubMedPubMed CentralView ArticleGoogle Scholar
- DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43:491–8.PubMedPubMed CentralView ArticleGoogle Scholar
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–303.PubMedPubMed CentralView ArticleGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–9.PubMedPubMed CentralView ArticleGoogle Scholar
- Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 2012;22:568–76.PubMedPubMed CentralView ArticleGoogle Scholar
- SolSNP – Use of a modified Kolmogorov-Smirnov statistic and data filtering to call variants. Available at: http://sourceforge.net/projects/solsnp/.
- The whole genome focused array SNP typing pipeline. Available at: https://github.com/jasonsahl/wgfast.
- Stamatakis A. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics. 2006;22:2688–90.PubMedView ArticleGoogle Scholar
- Picard Tools. Available at: http://broadinstitute.github.io/picard/.
- Sukumaran J, Holder MT. DendroPy: a Python library for phylogenetic computing. Bioinformatics. 2010;26:1569–71.PubMedView ArticleGoogle Scholar
- Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30:1312–3.PubMedPubMed CentralView ArticleGoogle Scholar
- Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25:1422–3.PubMedPubMed CentralView ArticleGoogle Scholar
- Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30:2114–20.PubMedPubMed CentralView ArticleGoogle Scholar
- Berger SA, Krompass D, Stamatakis A. Performance, accuracy, and Web server for evolutionary placement of short sequence reads under maximum likelihood. Syst Biol. 2011;60:291–302.PubMedPubMed CentralView ArticleGoogle Scholar
- FigTree. Available at: http://tree.bio.ed.ac.uk/software/figtree/.
- Benson DA, Karsch-Mizrachi I, Clark K, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Res. 2012;40:D48–53.PubMedPubMed CentralView ArticleGoogle Scholar
- Dykhuizen DE, Green L. Recombination in Escherichia coli and the definition of biological species. J Bacteriol. 1991;173:7257–68.PubMedPubMed CentralGoogle Scholar
- Hayashi K, Morooka N, Yamamoto Y, Fujita K, Isono K, Choi S, et al. Highly accurate genome sequences of Escherichia coli K-12 strains MG1655 and W3110. Mol Syst Biol. 2006;2:2006–0007.PubMedPubMed CentralView ArticleGoogle Scholar
- Luo C, Walk ST, Gordon DM, Feldgarden M, Tiedje JM, Konstantinidis KT. Genome sequencing of environmental Escherichia coli expands understanding of the ecology and speciation of the model bacterial species. Proc Natl Acad Sci U S A. 2011;108:7200–5.PubMedPubMed CentralView ArticleGoogle Scholar
- Walk ST, Alm EW, Gordon DM, Ram JL, Toranzos GA, Tiedje JM, et al. Cryptic lineages of the genus Escherichia. Appl Environ Microbiol. 2009;75:6534–44.PubMedPubMed CentralView ArticleGoogle Scholar
- Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, et al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol. 2009;75:7537–41.PubMedPubMed CentralView ArticleGoogle Scholar
- Loman NJ, Constantinidou C, Christner M, Rohde H, Chan JZ, Quick J, et al. A culture-independent sequence-based metagenomics approach to the investigation of an outbreak of Shiga-toxigenic Escherichia coli O104:H4. JAMA. 2013;309:1502–10.PubMedView ArticleGoogle Scholar
- Hasman H, Saputra D, Sicheritz-Ponten T, Lund O, Svendsen CA, Frimodt-Moller N, et al. Rapid whole-genome sequencing for detection and characterization of microorganisms directly from clinical samples. J Clin Microbiol. 2014;52:139–46.PubMedPubMed CentralView ArticleGoogle Scholar
- Rohde H, Qin J, Cui Y, Li D, Loman NJ, Hentschke M, et al. Open-source genomic analysis of Shiga-toxin-producing E. coli O104:H4. N Engl J Med. 2011;365:718–24.PubMedView ArticleGoogle Scholar
- Medvedev P, Scott E, Kakaradov B, Pevzner P. Error correction of high-throughput sequencing datasets with non-uniform coverage. Bioinformatics. 2011;27:i137–41.PubMedPubMed CentralView ArticleGoogle Scholar
- Liu Y, Schroder J, Schmidt B. Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data. Bioinformatics. 2013;29:308–15.PubMedView ArticleGoogle Scholar
- Delcher AL, Salzberg SL, Phillippy AM. Using MUMmer to identify similar regions in large sequence sets. Curr Protoc Bioinformatics 2003, Chapter 10:Unit 10 13.Google Scholar
- Farris JS. The retention index and the rescaled consistency index. Cladistics. 1989;5:417–9.View ArticleGoogle Scholar
- Sahl JW, Steinsland H, Redman JC, Angiuoli SV, Nataro JP, Sommerfelt H, et al. A comparative genomic analysis of diverse clonal types of enterotoxigenic Escherichia coli reveals pathovar-specific conservation. Infect Immun. 2011;79:950–60.PubMedPubMed CentralView ArticleGoogle Scholar
- Willner D, Low S, Steen JA, George N, Nimmo GR, Schembri MA, et al. Single clinical isolates from acute uncomplicated urinary tract infections are representative of dominant in situ populations. mBio. 2014;5:e01064–13.PubMedPubMed CentralView ArticleGoogle Scholar
- Hazen TH, Sahl JW, Fraser CM, Donnenberg MS, Scheutz F, Rasko DA. Refining the pathovar paradigm via phylogenomics of the attaching and effacing Escherichia coli. Proc Natl Acad Sci U S A. 2013;110:12810–5.PubMedPubMed CentralView ArticleGoogle Scholar
- Turnbaugh PJ, Ley RE, Hamady M, Fraser-Liggett CM, Knight R, Gordon JI. The human microbiome project. Nature. 2007;449:804–10.PubMedPubMed CentralView ArticleGoogle Scholar
- Fisher CK, Mehta P. Identifying keystone species in the human gut microbiome from metagenomic timeseries using sparse linear regression. arXiv:11115572 [csDS] 2014.Google Scholar
- Doughty EL, Sergeant MJ, Adetifa I, Antonio M, Pallen MJ. Culture-independent detection and characterisation of Mycobacterium tuberculosis and M. africanum in sputum samples using shotgun metagenomics on a benchtop sequencer. Peer J. 2014;2:e585.PubMedPubMed CentralView ArticleGoogle Scholar
- Cui Y, Yu C, Yan Y, Li D, Li Y, Jombart T, et al. Historical variations in mutation rate in an epidemic pathogen, Yersinia pestis. Proc Natl Acad Sci U S A. 2013;110:577–82.PubMedPubMed CentralView ArticleGoogle Scholar
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.