Genomics and outbreak investigation: from sequence to consequence
© BioMed Central Ltd 2013
Published: 29 April 2013
Skip to main content
© BioMed Central Ltd 2013
Published: 29 April 2013
Outbreaks of infection can be devastating for individuals and societies. In this review, we examine the applications of new high-throughput sequencing approaches to the identification and characterization of outbreaks, focusing on the application of whole-genome sequencing (WGS) to outbreaks of bacterial infection. We describe traditional epidemiological analysis and show how WGS can be informative at multiple steps in outbreak investigation, as evidenced by many recent studies. We conclude that high-throughput sequencing approaches can make a significant contribution to the investigation of outbreaks of bacterial infection and that the integration of WGS with epidemiological investigation, diagnostic assays and antimicrobial susceptibility testing will precipitate radical changes in clinical microbiology and infectious disease epidemiology in the near future. However, several challenges remain before WGS can be routinely used in outbreak investigation and clinical practice.
Outbreaks of infection can be devastating for individuals and societies. In medieval times, the Black Death led to the death of up to a third of the inhabitants of Europe . More recently, an outbreak of Shiga-toxin-producing Escherichia coli (STEC) struck Germany in May-June 2011, resulting in over 3,000 cases and over 50 deaths, and provided ample evidence of the harrowing effects of bacterial infection on a modern, industrialized society [2, 3].
In its loosest sense, the term 'outbreak' can be used to refer to any increase in the incidence of a given infection, which can occur in response to local, societal or environmental changes: for example, one might see an increase in the prevalence of staphylococcal wound infections when hospital ward or operating theatre cleaning procedures change, or when there are changes in the use of antibiotics. However, in the strictest sense (which we adopt here), the term implies a series of infections caused by indistinguishable or closely linked isolates, which are sufficiently similar to justify talking about 'an outbreak strain'. Such outbreaks can range in size from a few individuals, for instance in a family outbreak or an outbreak on a hospital ward, to epidemics that rage across countries or continents.
Investigation of a suspected outbreak has two aims: termination of the cluster of disease and prevention of similar occurrences by understanding how such outbreaks originate. A key question surfaces at the start of any such investigation: is one really seeing an outbreak in the strictest sense, caused by a single strain, or is one merely seeing an increased incidence of infection, involving multiple unrelated strains? The answer to this question is of more than academic interest, as it dictates how the finite resources available for infection control are best deployed. For example, evidence of cross infection with a single methicillin-resistant Staphylococcus aureus (MRSA) strain on a ward might prompt an aggressive strategy of patient isolation and decolonization, whereas an increase in infections caused by diverse staphylococcal strains (presumably each derived from the patient's own microbiota) might prompt a look at policies for wound care or antibiotic usage. Similarly, identification and characterization of an outbreak strain or the discovery of its source or mode of transmission influences the behavior of the infection control team - potential responses include removal of the source, interruption of transmission or strengthening of host defenses.
A selection of recent outbreaks*
Disease or pathogen
Airborne, point source
Stoke on Trent, UK
Likely source a hot tub
Airborne, propagated human-to-human
2012 to now
South Wales, UK
Subsequent to poor take-up of measles, mumps and rubella (MMR) vaccine
Airborne, propagated human-to-human
2011 to now
England and Wales, UK
Perhaps related to waning immunity in adults
Airborne, propagated human-to-human
Spread through social links, including nightclub
Link between cases unclear
Bloodstream infection, common source
2009 to 2012
Europe, including UK
100s of cases
Thought be associated with contaminated batch of heroin
Exposure to animal feces
E. coli O157
Sutton Coldfield, UK
Contact between humans and animals in suburban park
Food-borne, point source
Linked to consumption of watermelon
Late 2011 to early 2012
Northern Ireland, UK
Associated with contaminated hospital water supplies
2010 to now
Occurred 10 months after powerful earthquake
2008 to now
Exacerbated by consequences of economic collapse, including poor water sanitation
Zoonotic, animal-to-human spread
Virus type known to be circulating in birds
Here, we examine the applications of new high-throughput sequencing approaches to the identification and characterization of outbreaks, focusing on the application of whole-genome sequencing (WGS) to outbreaks of bacterial infection. We describe how traditional epidemiological analysis works and show how WGS can be informative at multiple steps in outbreak investigation.
Although traditional epidemiology can often track down the source of an outbreak (for example, a case-control study can identify the foodstuff responsible for a food-poisoning outbreak [9, 10]), for several decades laboratory investigations have also had an important role in outbreak investigation and management . Thus, when suspicion of an outbreak has been raised on clinical or epidemiological grounds, the laboratory can provide evidence to confirm or dismiss a common microbial cause. Alternatively, an increase in laboratory reports of a given pathogen may provide the first evidence that an outbreak is under way.
However, in addition to providing diagnostic information, the laboratory also offers epidemiological typing, which provides an assessment of how closely cases are related to each other. In broad terms, this means classifying isolates as unrelated (not part of an outbreak) or sufficiently closely related (in extremis, indistinguishable) to represent epidemic transmission.
Epidemiological typing requires the identification of stable distinguishing characteristics. Initially, this relied on analyses of useful phenotypic features (such as serological profiles, growth characteristics or susceptibilities to bacteriophage or antimicrobial agents) . However, the arrival of molecular biology in general and specifically of the polymerase chain reaction (PCR) led to a profusion of genotypic approaches, largely documenting differences in patterns of bands seen on gels: examples include pulsed-field gel electrophoresis, ribotyping, variable number-tandem repeat typing, random amplification of polymorphic DNA, arbitrarily primed PCR and repetitive-element PCR .
This riotous proliferation of genotypic typing methods, often with complex and non-standardized workflows, led Achtman in the late 1990s to coin the phrase YATM for 'yet another typing method'  and to pioneer, with others, the adoption of sequence-based approaches, notably multilocus sequence typing (MLST) . In this approach, differences in stretches of DNA sequence from conserved housekeeping genes are used to assign bacterial isolates to sequence types, which, in turn, often fall into larger clonal complexes. Sequence-based approaches bring the advantage of portability; in other words, results from one laboratory can be easily compared with those from others around the world. In addition, archiving of information in national or international datasets allows isolates and outbreaks to be placed in the wider context of pathogen population structure.
Yet, despite the advantages of sequence-based typing, drawbacks remain. For example, there is a lack of standardization, as evidenced by the existence of multiple MLST databases and even multiple competing MLST schemes for the same species [14, 15]. In addition, costs and complex workflows mean that most pathogen typing is performed in batch mode, retrospectively, in reference laboratories that struggle to provide data with real-time impact - one possible exception is the near-real-time typing of Mycobacterium tuberculosis isolates in the UK . Approaches such as MLST also lack the resolution needed to reconstruct chains of transmission within outbreaks, tending instead to lump together all isolates from an outbreak together as 'indistinguishable' members of the same sequence type.
WGS promises to deliver the ultimate high-resolution genotypic typing method [17–20]. Although we recognize that virologists pioneered the use of WGS for pathogen typing, targeting genomes small enough for WGS with traditional Sanger sequencing , here we will concentrate on the application of WGS to outbreaks of bacterial infection, catalyzed by the recent arrival in the marketplace of a range of technologies that fall under the umbrella term 'high-throughput sequencing' (sometimes called 'next-generation sequencing') [22, 23].
High-throughput sequencing, especially with the arrival of bench-top sequencers [24, 25], brings methodologies for bacterial WGS that are simple, quick and cheap enough to fall within the remit of an average-sized clinical or research laboratory. Through a single unified workflow, it becomes possible to identify all the features of interest of a bacterial isolate, speeding up the detection and investigation of outbreaks and delivering data in a portable digital format that can be shared internationally.
How whole-genome sequencing contributes to each step in outbreak investigation
Contribution of whole-genome sequencing (WGS)
Confirming the existence of an outbreak
Bench-top sequencing of whole bacterial genomes in near real time to confirm or refute the existence of outbreaks of MRSA or C. difficile
Open-ended diagnostic metagenomics to identify and characterize outbreak strain
WGS and/or metagenomics leads to the development of diagnostic reagents then used in defining cases within an outbreak
Descriptive study: collecting data and generating hypotheses
Integration of WGS with geographical data to uncover modes of spread of typhoid
Reconstruction of routes of transmission, including hidden transmission events
Identification of virulence factors and antimicrobial resistance
Analysis and hypothesis testing
Iterative refinements to assumptions and models
Institution and verification of control measures
Documenting effects of vaccination on pathogen populations
Confirmation that infections are imported rather than locally transmitted
Need for user-friendly digital output easily transferred between laboratories and expert advice of clinical academics at home in research and clinical environments
When pathogens are endemic, for example, MRSA or Clostridium difficile in healthcare facilities, it can be difficult to decide whether one or more outbreaks are under way or whether there has simply been a general rise in the incidence of infection. Eyre and colleagues  showed that bench-top sequencing of whole bacterial genomes could be used in near real time to confirm or refute the existence of outbreaks of MRSA or C. difficile in an acute hospital setting. In particular, they found that the genome sequences from an apparent cluster of C. difficile infections turned out to be unrelated and so did not represent an outbreak sensu stricto .
Metagenomics, that is, wholesale sequencing of DNA extracted from complex microbial communities without culture, capture or enrichment of pathogens or their sequences, provides an exciting new approach to the identification and characterization of outbreak strains that does away with the need for laboratory culture or target-specific amplification or enrichment. This approach has been used to identify the causes of outbreaks of viral infection . Most recently, diagnostic metagenomics has been applied to stool samples collected during the German outbreak of STEC O104:H4, allowing recovery of draft genomes from the outbreak strain and several other pathogens and showing the applicability of diagnostic metagenomics to bacterial infections .
Case definition within an outbreak usually involves a combination of clinical and laboratory criteria; for instance, a complex of symptoms and an associated organism. This definition can then be used for active case finding to identify additional patients in the cluster. During the German STEC outbreak, rapid genome sequencing together with crowd-sourced bioinformatics analyses led to the development of a set of diagnostic reagents that could then be used in defining cases within the outbreak . Similarly, during new outbreaks of viral infection, genome-scale sequencing can act as a precursor to the development of simpler specific tests that can be used in case definition [31, 32].
During this phase of outbreak investigation, inferences from sequence data (such as on phylogeny, transmissibility, virulence or resistance) can be integrated with clinical and environmental metadata (such as geographical, temporal or anatomical data) to generate hypotheses and build and test models. For example, in a landmark study, Baker and colleagues  combined high-resolution genotyping and geospatial analysis to uncover the modes of transmission of endemic typhoid fever in an urban setting in Nepal.
During this phase of hypothesis generation, it may be possible to infer hidden transmission events. For instance, when faced with the recurrence of a strain of C. difficile in a hospital after more than 3 years of absence, Eyre and colleagues  concluded that unsuspected community transmission of C. difficile was the most likely explanation for their observations. They also noted that most of their C. difficile cases were unrelated to other recent cases in the hospital, from which they concluded that their hospital infection control policies were working as well as they could and that further reductions in the incidence of C. difficile infections would have to rely on additional and different interventions.
In some cases, it may be possible to hypothesize what determinants underlie the success of an outbreak strain. For example, the sasX gene (a mobile genetic element-encoded gene involved in nasal colonization and pathogenesis) appeared to be a key determinant of the successful spread of MRSA in China , and genes for the Panton-Valentine toxin were hypothesized to contribute to the spread of a novel MRSA genotype that caused an outbreak in a British special care baby unit .
Prediction of resistance phenotype from genotype has been applied routinely for years to viral pathogens such as human immunodeficiency virus, for which the cataloguing of resistance mutations in a publicly accessible database has greatly strengthened the utility of the approach . Data are accumulating from S. aureus  and from E. coli strains that produce extended-spectrum beta-lactamases showing that WGS can be used to predict the resistance phenotype in bacteria (Nicole Stoesser, Department of Microbiology, John Radcliffe Hospital, Oxford, personal communication). Well-maintained databases documenting links between genotypes and resistance phenotypes are likely to add value to such ventures.
Host factors associated with disease may also be identified during data collection. Increasingly, whole-genome sequences of humans are available and being used to study population genetic risks for diseases, as reviewed recently by Chapman and Hill .
During this stage, there is often a series of iterative refinements to assumptions and models. For example, in a detailed retrospective analysis of tuberculosis cases in the English Midlands, Walker and colleagues  first documented the diversity of M. tuberculosis genotypes in their collection and then explored how the patterns of genome diversity were reflected in contemporaneous and serial isolates from individual patients and among isolates from household outbreaks. This allowed them to define cut-offs in the number of SNPs that could be used to rule isolates in or out of a recent transmission event. In some instances, they could then allocate cases to clusters in which a link had been suspected, but had not been proven, by conventional epidemiological methods. In other cases, where a link had been suspected on grounds of ethnicity, they were able to exclude recent transmission within the West Midlands region.
Outbreaks of meningococcal disease caused by serogroup C have largely been eradicated in the UK by vaccination. However, a retrospective genomic analysis of strains from a meningococcal outbreak allowed chains of transmission to be identified . This study pioneered the automated comparison of WGS data using a new public database, the Bacterial Isolate Genome Sequence Database (BIGSdb) ; the development of this kind of user-friendly, open-access tool is likely to underpin the adoption of WGS in epidemiological investigations in a clinical and public health environment.
Relatedness between isolates within an outbreak (and more widely) is often assessed by the construction of a phylogenetic tree . Such phylogenetic inferences can enable the identification of sources or reservoirs of infection: examples include the acquisition of leprosy by humans from wild armadillos and the acquisition of Mycobacterium bovis in cattle from sympatric badger populations [41, 42]. Integration of phylogeny with geography has allowed the origins and spread of pandemics and epidemics to be traced, including the Yersinia pestis pandemic  and, controversially, the 2010 cholera outbreak in Haiti, which has been traced to Nepalese peacekeepers .
Molecular phylogenies also make it possible to look back over years, decades, even centuries. For example, He and colleagues  showed that two distinct strains of fluoroquinolone-resistant C. difficile 027 emerged in the USA in 1993 to 1994, and that these showed different patterns of global spread. Genomic information, together with estimates from the sequence data of the time since isolates had diverged ('molecular clock' estimates) allowed them to reconstruct detailed routes of transmission within the UK. Similar studies have revealed patterns of the global spread of cholera, Shigella sonnei and MRSA [36, 46, 47].
Vaccination provides a means of disrupting transmission by removing susceptible hosts from the population. For example, immunity to specific capsule types responsible for pneumococcal infection is targeted by their inclusion in a multivalent vaccine. High-throughput sequencing studies provide clear evidence that capsule switching is occurring in pneumococcal populations in response to vaccination, which has implications for disease control and vaccine design [48, 49].
Viral illnesses have long been the target of successful vaccination programs. WGS analysis of rubella virus cases from the USA has confirmed that indigenous disease has been eradicated and that all the cases there are imported, with virus sequences matching those found elsewhere in the world .
To be useful to clinicians, whole-genome sequence data must be readily accessible in a portable, easily stored and searched, user-friendly format. However, data sharing even through established hospital informatics systems is a non-trivial task, particularly given the current diversity in sequencing platforms and analytical pipelines. Perhaps the answer here is to ensure the involvement of clinical academics with the relevant research credentials and accreditation to make clinical decisions, who might be best placed to pioneer the use of WGS data to manage outbreaks.
Whole-genome sequencing in outbreak investigations: opportunities and challenges
Provision of data on a timescale that allows clinical interventions
Costs now comparable to those of other clinically relevant expenditure (such as of antibiotic treatment or bed occupancy)
Use now comparable to that of other automated laboratory systems
Delivers far richer data than any previous method
Potential for open-ended one-size-fits-all culture-independent workflow
Chasing a moving target: difficult to devise stable and agreed standard operating procedures in the face of relentless technical innovation
Proof needed that WGS cost-effective across a range of clinical applications
Difficulties in predicting phenotype from genotype
Still sufficiently technically demanding to require input of skilled staff
Resistance to adoption of potentially disruptive technology
Provides portable, digital, library-based approach
Large datasets require significant hardware for storage and analysis
Need for standardized, robust, user-friendly analysis pipelines
Issues over data storage, ownership and presentation need to be resolved
Integration with healthcare informatics systems to allow easy communication with clinicians
WGS provides highest possible resolution
Potential to link pathogen discovery, biology and evolution with phylogeny and epidemiology to facilitate iterative hypothesis generation, testing and refinement
Need to move beyond SNP typing of draft genomes of colony-purified isolates to embrace full range of genome variation, including within-patient variation
Better integration with conventional epidemiology required to place data in context and evaluate hypothesized routes of transmissions
Acquiring clinical metadata often remains a bottleneck
There is still a need for improved speed, ease of use, accuracy and longer read lengths. However, given the ongoing, relentless improvements in performance and cost-effectiveness of high-throughput sequencing, it is likely that these financial and technical challenges will be met relatively easily over the coming years . Nonetheless, improvements in the analysis, archiving and sharing of WGS data need to occur before sequencing results can become trustworthy enough to guide clinical decision-making. Significant investment in establishing standards, databases and communication tools will be required to maximize the opportunities provided by WGS in epidemiology. There may also be organizational and ethical issues with data ownership and access .
Careful contextualization of WGS data will be needed before robust conclusions can be drawn, ideally within an agreed framework of standard operating procedures. Interpretation of genomic data requires a detailed knowledge of within-host and between-host genotypic diversity, whether defined at a single time point or longitudinally. Readings from the molecular clock provide the temporal information needed to reconstruct the emergence and evolution of lineages and transmission events within an outbreak. This means that extensive benchmarking will be needed to determine the rates of genomic change, which are likely to be species- and even lineage-specific. Only when WGS data have been obtained from a large number of epidemiologically linked and unlinked cases in a given lineage will it be possible to define cut-offs for the genomic differences that allow linked and unlinked cases to be accurately defined. This may also rely on comparisons with an 'outgroup', that is, a group of cases that clearly fall outside the outbreak cluster.
Estimates of rates of genetic change have been published for some organisms: for example, S. aureus mutates relatively rapidly, with 3 × 10-6 mutations per year, corresponding to 8.4 SNPs per genome per year [3, 39], whereas M. tuberculosis evolves slowly, acquiring only 0.5 SNPs per genome per year [27, 53–55]. However, such data are available for only a very limited number of other pathogens. This will need to be expanded significantly before routine use of WGS data becomes a reality. We suspect that there may be consistent differences in the mode and rate of genotypic change between organisms for which an asymptomatic carrier state (for example C. difficile) or a latent period (M. tuberculosis) exists and those, such as measles, for which there is no carrier state.
In conclusion, it is clear that WGS is already transforming the practice of outbreak investigation. However, the dizzyingly fast pace of change in this field, with steady improvements in high-throughput sequencing, make predictions about the future difficult, particularly now that nanopore sequencing technologies are poised to deliver a revolution in our ability to sequence macromolecules in clinical samples (not just DNA, but also RNA and even proteins) [56, 57]. Portable nanopore technologies might provide a route to real-time near-patient testing and environmental sampling, as well as delivering a combined read-out of genotype and phenotype in bacterial cells (perhaps even allowing direct detection of the expression of resistance determinants). It also seems likely that clinical diagnostic metagenomics , perhaps equipped with target-specific enhancements such as sorting or capture of cells or DNA, will deliver improved genomic epidemiological information, including insights into within-patient pathogen population genetics and identification and typing of non-culturable or difficult-to-culture organisms.
One thing is certain: the future of bacterial outbreak investigation will rely on a new paradigm of genomics and metagenomics. Therefore, it is up to all clinical and epidemiological researchers to embrace the opportunities and meet the challenges of this new way of working
multilocus sequence typing
methicillin-resistant Staphylococcus aureus
Shiga-toxin-producing Escherichia coli
TMW is an MRC Research Training Fellow.