Outbreaks: Defi Nition and Classifi Cation Genomics and Outbreak Investigation: from Sequence to Consequence

Outbreaks of infection can be devastating for individuals and societies. In medieval times, the Black Death led to the death of up to a third of the inhabitants of Europe [1]. More recently, an outbreak of Shiga-toxin-producing Escherichia coli (STEC) struck Germany in May-June 2011, resulting in over 3,000 cases and over 50 deaths, and provided ample evidence of the harrowing eff ects of bacterial infection on a modern, industrialized society [2,3]. In its loosest sense, the term 'outbreak' can be used to refer to any increase in the incidence of a given infection, which can occur in response to local, societal or environmental changes: for example, one might see an increase in the prevalence of staphylococcal wound infections when hospital ward or operating theatre cleaning procedures change, or when there are changes in the use of antibiotics. However, in the strictest sense (which we adopt here), the term implies a series of infections caused by indistinguishable or closely linked isolates, which are suffi ciently similar to justify talking about 'an outbreak strain'. Such outbreaks can range in size from a few individuals, for instance in a family outbreak or an outbreak on a hospital ward, to epidemics that rage across countries or continents. Investigation of a suspected outbreak has two aims: termination of the cluster of disease and prevention of similar occurrences by understanding how such outbreaks originate. A key question surfaces at the start of any such investigation: is one really seeing an outbreak in the strictest sense, caused by a single strain, or is one merely seeing an increased incidence of infection, involving multiple unrelated strains? Th e answer to this question is of more than academic interest, as it dictates how the fi nite resources available for infection control are best deployed. For example, evidence of cross infection with a single methicillin-resistant Staphylococcus aureus (MRSA) strain on a ward might prompt an aggressive strategy of patient isolation and decolonization, whereas an increase in infections caused by diverse staphylococcal strains (presumably each derived from the patient's own microbiota) might prompt a look at policies for wound care or antibiotic usage. Similarly, identifi cation and charac terization of an outbreak strain or the discovery of its source or mode of transmission infl uences the behavior of the infection control team-potential responses include removal of the source, interruption of transmission or strengthening of host defenses. In the past decade, …

which can occur in response to local, societal or environmental changes: for example, one might see an increase in the prevalence of staphylococcal wound infections when hospital ward or operating theatre cleaning procedures change, or when there are changes in the use of antibiotics. However, in the strictest sense (which we adopt here), the term implies a series of infections caused by indistinguishable or closely linked isolates, which are suffi ciently similar to justify talking about 'an outbreak strain' . Such outbreaks can range in size from a few individuals, for instance in a family outbreak or an outbreak on a hospital ward, to epidemics that rage across countries or continents.
Investigation of a suspected outbreak has two aims: termination of the cluster of disease and prevention of similar occurrences by understanding how such outbreaks originate. A key question surfaces at the start of any such investigation: is one really seeing an outbreak in the strictest sense, caused by a single strain, or is one merely seeing an increased incidence of infection, involving multiple unrelated strains? Th e answer to this question is of more than academic interest, as it dictates how the fi nite resources available for infection control are best deployed. For example, evidence of cross infection with a single methicillin-resistant Staphylococcus aureus (MRSA) strain on a ward might prompt an aggressive strategy of patient isolation and decolonization, whereas an increase in infections caused by diverse staphylococcal strains (presumably each derived from the patient's own microbiota) might prompt a look at policies for wound care or antibiotic usage. Similarly, identifi cation and charac terization of an outbreak strain or the discovery of its source or mode of transmission infl uences the behavior of the infection control team -potential responses include removal of the source, interruption of transmission or strengthening of host defenses.
In the past decade, many diff erent kinds of outbreaks have hit the headlines (Table 1), with concern focused on the spread of multi-drug-resistant strains in hospitals (such as MRSA) [4] or in the community (such as multidrug-resistant tuberculosis [5]); the threat of bioterrorism [6]; and 'emerging infections' , caused by newly discovered pathogens, such as severe acute respiratory syndrome (SARS) or infection with the novel coronavirus 2012 (HCoV-EMC/2012) [7,8], or by novel variants of previously recognized species or strains, such as STEC O104:H4 [2,3]. Outbreaks are often linked to social factors, including mass travel, migration, conflict or societal break down, or to environmental threats, such as earthquakes or floods. They can arise from exposure to a common source in the environment (for example, legionellosis arising from a water source); when the period of exposure is brief, these events are termed 'point-source outbreaks' . Alternatively, outbreaks can be propagated by human-to-human spread or, in the case of zoonoses, such as swine or bird flu, can result from the spread to humans from animal reservoirs. Outbreaks can also be classified according to context, for example whether they occur in the community or in healthcare settings, or according to the mode of transmission, for example food-borne, waterborne, airborne or vector-borne.
Here, we examine the applications of new highthroughput sequencing approaches to the identification and characterization of outbreaks, focusing on the application of whole-genome sequencing (WGS) to outbreaks of bacterial infection. We describe how traditional epidemiological analysis works and show how WGS can be informative at multiple steps in outbreak investigation.

Epidemiological typing: progress and problems
Although traditional epidemiology can often track down the source of an outbreak (for example, a case-control study can identify the foodstuff responsible for a foodpoisoning outbreak [9,10]), for several decades laboratory investigations have also had an important role in outbreak investigation and management [11]. Thus, when suspicion of an outbreak has been raised on clinical or epidemiological grounds, the laboratory can provide evidence to confirm or dismiss a common microbial cause. Alternatively, an increase in laboratory reports of a given pathogen may provide the first evidence that an outbreak is under way.
However, in addition to providing diagnostic information, the laboratory also offers epidemiological typing, which provides an assessment of how closely cases are related to each other. In broad terms, this means classifying isolates as unrelated (not part of an outbreak) or sufficiently closely related (in extremis, indistinguishable) to represent epidemic transmission.
Epidemiological typing requires the identification of stable distinguishing characteristics. Initially, this relied on analyses of useful phenotypic features (such as serological profiles, growth characteristics or susceptibili ties to bacteriophage or antimicrobial agents) [11]. However, the arrival of molecular biology in general and specifically of the polymerase chain reaction (PCR) led to a profusion of genotypic approaches, largely docu menting differences in patterns of bands seen on gels: examples include pulsed-field gel electrophoresis, ribotyping, variable number-tandem repeat typing, random amplification of polymorphic DNA, arbitrarily primed PCR and repetitive-element PCR [11]. This riotous proliferation of genotypic typing methods, often with complex and non-standardized workflows, led Achtman in the late 1990s to coin the phrase YATM for 'yet another typing method' [12] and to pioneer, with others, the adoption of sequence-based approaches, notably multilocus sequence typing (MLST) [13]. In this approach, differences in stretches of DNA sequence from conserved housekeeping genes are used to assign bacterial isolates to sequence types, which, in turn, often fall into larger clonal complexes. Sequence-based approaches bring the advantage of portability; in other words, results from one laboratory can be easily com pared with those from others around the world. In addition, archiving of information in national or inter national datasets allows isolates and outbreaks to be placed in the wider context of pathogen population structure.
Yet, despite the advantages of sequence-based typing, drawbacks remain. For example, there is a lack of standardization, as evidenced by the existence of multiple MLST databases and even multiple competing MLST schemes for the same species [14,15]. In addition, costs and complex workflows mean that most pathogen typing is performed in batch mode, retrospectively, in reference laboratories that struggle to provide data with real-time impact -one possible exception is the near-real-time typing of Mycobacterium tuberculosis isolates in the UK [16]. Approaches such as MLST also lack the resolution needed to reconstruct chains of transmission within outbreaks, tending instead to lump together all isolates from an outbreak together as 'indistinguishable' members of the same sequence type.

The promise of whole-genome sequencing
WGS promises to deliver the ultimate high-resolution genotypic typing method [17][18][19][20]. Although we recognize that virologists pioneered the use of WGS for pathogen typing, targeting genomes small enough for WGS with traditional Sanger sequencing [21], here we will concentrate on the application of WGS to outbreaks of bacterial infection, catalyzed by the recent arrival in the marketplace of a range of technologies that fall under the umbrella term 'high-throughput sequencing' (sometimes called 'next-generation sequencing') [22,23].
High-throughput sequencing, especially with the arrival of bench-top sequencers [24,25], brings methodolo gies for bacterial WGS that are simple, quick and cheap enough to fall within the remit of an average-sized clinical or research laboratory. Through a single unified workflow, it becomes possible to identify all the features of interest of a bacterial isolate, speeding up the detection and investigation of outbreaks and delivering data in a portable digital format that can be shared internationally.
By delivering a definitive catalog of genetic polymorphisms (especially single-nucleotide polymorphisms or SNPs), WGS delivers far greater resolution than traditional methods. For instance, whereas MLST identified only a single sequence type for a collection of MRSA isolates, WGS identified several distinct clusters [26]. Two recent studies of tuberculosis transmission have shown that the resolution of WGS with SNP typing is much higher than that provided by the previous 'gold standard' typing method, mycobacterial interspersed repeti tive unit variable number tandem repeat (MIRU-VNTR) typing [27,28]. WGS also links epidemiology to pathogen biology, delivering unprecedented insights into genome evolution, genome structure and gene content, including information on clinically important markers, such as resistance and virulence genes [11] (Figure 1).

Applications of genome sequencing in outbreak investigation
Traditional outbreak investigation can be divided into discrete steps, although these often overlap. WGS has the potential to contribute to each of these steps ( Table 2).

Confirming the existence of an outbreak
When pathogens are endemic, for example, MRSA or Clostridium difficile in healthcare facilities, it can be difficult to decide whether one or more outbreaks are under way or whether there has simply been a general rise in the incidence of infection. Eyre and colleagues [25] showed that bench-top sequencing of whole bacterial genomes could be used in near real time to confirm or refute the existence of outbreaks of MRSA or C. difficile in an acute hospital setting. In particular, they found that the genome sequences from an apparent cluster of C. difficile infections turned out to be unrelated and so did not represent an outbreak sensu stricto [25].
Metagenomics, that is, wholesale sequencing of DNA extracted from complex microbial communities without culture, capture or enrichment of pathogens or their sequences, provides an exciting new approach to the identification and characterization of outbreak strains that does away with the need for laboratory culture or target-specific amplification or enrichment. This approach has been used to identify the causes of outbreaks of viral infection [29]. Most recently, diagnostic metagenomics has been applied to stool samples collected during the German outbreak of STEC O104:H4, allowing recovery of draft genomes from the outbreak strain and several other pathogens and showing the applicability of diagnostic metagenomics to bacterial infections [30].

Case definition
Case definition within an outbreak usually involves a combination of clinical and laboratory criteria; for instance, a complex of symptoms and an associated organism. This definition can then be used for active case finding to identify additional patients in the cluster. During the German STEC outbreak, rapid genome sequen cing together with crowd-sourced bioinformatics analyses led to the development of a set of diagnostic reagents that could then be used in defining cases within the outbreak [3]. Similarly, during new outbreaks of viral infection, genome-scale sequencing can act as a precursor to the development of simpler specific tests that can be used in case definition [31,32].

Descriptive study
During this phase of outbreak investigation, inferences from sequence data (such as on phylogeny, transmissibility, virulence or resistance) can be integrated with clinical and environmental metadata (such as geographical, temporal or anatomical data) to generate hypotheses and build and test models. For example, in a landmark study, Baker and colleagues [33] combined high-resolution genotyping and geospatial analysis to uncover the modes of transmission of endemic typhoid fever in an urban setting in Nepal.
During this phase of hypothesis generation, it may be possible to infer hidden transmission events. For instance, when faced with the recurrence of a strain of C. difficile in a hospital after more than 3 years of absence, Eyre and colleagues [25] concluded that unsuspected community transmission of C. difficile was the most likely explanation for their observations. They also noted that most of their C. difficile cases were unrelated to other recent cases in the hospital, from which they concluded that their hospital infection control policies were working as well as they could and that further reductions in the incidence of C. difficile infections would have to rely on additional and different interventions.
In some cases, it may be possible to hypothesize what determinants underlie the success of an outbreak strain. For example, the sasX gene (a mobile genetic elementencoded gene involved in nasal colonization and pathogenesis) appeared to be a key determinant of the successful spread of MRSA in China [34], and genes for the Panton-Valentine toxin were hypothesized to contribute to the spread of a novel MRSA genotype that caused an outbreak in a British special care baby unit [26].
Prediction of resistance phenotype from genotype has been applied routinely for years to viral pathogens such as human immunodeficiency virus, for which the cataloguing of resistance mutations in a publicly accessible database has greatly strengthened the utility of the approach [35]. Data are accumulating from S. aureus [36] and from E. coli strains that produce extended-spectrum beta-lactamases showing that WGS can be used to predict the resistance phenotype in bacteria (Nicole Stoesser, Department of Microbiology, John Radcliffe Figure 1. Whole-genome sequencing delivers high-resolution typing and insights into pathogen biology. In this hypothetical example, the two large ovals represent sets of isolates (small ovals) that have been assigned to genotypes using conventional laboratory typing. Clouds indicate clusters within those genotypes built using epidemiological data. Whole-genome sequencing provides a more detailed view of pathogen epidemiology, revealing previously unseen links (red lines) between genome-sequenced isolates (filled small ovals) within and between genotypes. Whole-genome sequencing also provides insights into pathogen biology, including the factors associated with virulence (represented here by toxin gene X) and drug resistance (represented here by resistance gene Y). Host factors associated with disease may also be identified during data collection. Increasingly, wholegenome sequences of humans are available and being used to study population genetic risks for diseases, as reviewed recently by Chapman and Hill [37].

Analysis and hypothesis testing
During this stage, there is often a series of iterative refinements to assumptions and models. For example, in a detailed retrospective analysis of tuberculosis cases in the English Midlands, Walker and colleagues [27] first documented the diversity of M. tuberculosis genotypes in their collection and then explored how the patterns of genome diversity were reflected in contemporaneous and serial isolates from individual patients and among isolates from household outbreaks. This allowed them to define cut-offs in the number of SNPs that could be used to rule isolates in or out of a recent transmission event. In some instances, they could then allocate cases to clusters in which a link had been suspected, but had not been proven, by conventional epidemiological methods. In other cases, where a link had been suspected on grounds of ethnicity, they were able to exclude recent transmission within the West Midlands region.
Outbreaks of meningococcal disease caused by serogroup C have largely been eradicated in the UK by vaccination. However, a retrospective genomic analysis of strains from a meningococcal outbreak allowed chains of transmission to be identified [38]. This study pioneered the automated comparison of WGS data using a new public database, the Bacterial Isolate Genome Sequence Database (BIGSdb) [39]; the development of this kind of user-friendly, open-access tool is likely to underpin the adoption of WGS in epidemiological investigations in a clinical and public health environment.
Relatedness between isolates within an outbreak (and more widely) is often assessed by the construction of a phylogenetic tree [40]. Such phylogenetic inferences can enable the identification of sources or reservoirs of infection: examples include the acquisition of leprosy by humans from wild armadillos and the acquisition of Mycobacterium bovis in cattle from sympatric badger populations [41,42]. Integration of phylogeny with geography has allowed the origins and spread of pandemics and epidemics to be traced, including the Yersinia pestis pandemic [43] and, controversially, the 2010 cholera outbreak in Haiti, which has been traced to Nepalese peacekeepers [44]. Molecular phylogenies also make it possible to look back over years, decades, even centuries. For example, He and colleagues [45] showed that two distinct strains of fluoroquinolone-resistant C. difficile 027 emerged in the USA in 1993 to 1994, and that these showed different patterns of global spread. Genomic information, together with estimates from the sequence data of the time since isolates had diverged ('molecular clock' estimates) allowed them to reconstruct detailed routes of transmission within the UK. Similar studies have revealed patterns of the global spread of cholera, Shigella sonnei and MRSA [36,46,47].

Institution and verification of control measures
Vaccination provides a means of disrupting transmission by removing susceptible hosts from the population. For example, immunity to specific capsule types responsible for pneumococcal infection is targeted by their inclusion Table 2

. How whole-genome sequencing contributes to each step in outbreak investigation
Step

Contribution of whole-genome sequencing (WGS) References
Confirming the existence of an outbreak Bench-top sequencing of whole bacterial genomes in near real time to confirm or refute the existence of outbreaks of MRSA or C. difficile [25] Open-ended diagnostic metagenomics to identify and characterize outbreak strain [30] Case definition WGS and/or metagenomics leads to the development of diagnostic reagents then used in defining cases within an outbreak [3,31,32] Descriptive study: collecting data and generating hypotheses Integration of WGS with geographical data to uncover modes of spread of typhoid [38] Reconstruction of routes of transmission, including hidden transmission events [25,45,59,60] Identification of virulence factors and antimicrobial resistance [26,34,36] Analysis and hypothesis testing Iterative refinements to assumptions and models [25,27,36,[41][42][43][44][45][46][47] Institution and verification of control measures Documenting effects of vaccination on pathogen populations [48,49] Confirmation that infections are imported rather than locally transmitted [25,27,50] Communication Need for user-friendly digital output easily transferred between laboratories and expert advice of clinical academics at home in research and clinical environments in a multivalent vaccine. High-throughput sequencing studies provide clear evidence that capsule switching is occurring in pneumococcal populations in response to vaccination, which has implications for disease control and vaccine design [48,49]. Viral illnesses have long been the target of successful vaccination programs. WGS analysis of rubella virus cases from the USA has confirmed that indigenous disease has been eradicated and that all the cases there are imported, with virus sequences matching those found elsewhere in the world [50].

Communication
To be useful to clinicians, whole-genome sequence data must be readily accessible in a portable, easily stored and searched, user-friendly format. However, data sharing even through established hospital informatics systems is a non-trivial task, particularly given the current diversity in sequencing platforms and analytical pipelines. Perhaps the answer here is to ensure the involvement of clinical academics with the relevant research credentials and accre ditation to make clinical decisions, who might be best placed to pioneer the use of WGS data to manage outbreaks.

Conclusions and future perspectives
As we have seen, there is now ample evidence that WGS can make a significant contribution to the investigation of outbreaks of bacterial infection. It is therefore safe to conclude that once WGS has been integrated with epidemiological investigation, diagnostic assays and antimicrobial susceptibility testing, we will soon see large changes in the practice of clinical microbiology and infec tious disease epidemiology. Nonetheless, several challenges remain before WGS can be routinely used in clinical practice (Table 3).
There is still a need for improved speed, ease of use, accuracy and longer read lengths. However, given the ongoing, relentless improvements in performance and cost-effectiveness of high-throughput sequencing, it is likely that these financial and technical challenges will be met relatively easily over the coming years [51]. Nonetheless, improvements in the analysis, archiving and sharing of WGS data need to occur before sequencing results can become trustworthy enough to guide clinical decision-making. Significant investment in establishing standards, databases and communication tools will be required to maximize the opportunities provided by WGS in epidemiology. There may also be organizational and ethical issues with data ownership and access [52].
Careful contextualization of WGS data will be needed before robust conclusions can be drawn, ideally within an agreed framework of standard operating procedures. Interpretation of genomic data requires a detailed knowledge of within-host and between-host genotypic diversity, whether defined at a single time point or longitudinally. Readings from the molecular clock provide the temporal information needed to reconstruct the emergence and evolution of lineages and transmission events within an outbreak. This means that extensive benchmarking will be needed to determine the rates of genomic change, which are likely to be species-and even lineagespecific. Only when WGS data have been obtained from a large number of epidemiologically linked and unlinked cases in a given lineage will it be possible to define cutoffs for the genomic differences that allow linked and unlinked cases to be accurately defined. This may also rely on comparisons with an 'outgroup' , that is, a group of cases that clearly fall outside the outbreak cluster.
Estimates of rates of genetic change have been published for some organisms: for example, S. aureus mutates relatively rapidly, with 3 × 10 -6 mutations per year, corresponding to 8.4 SNPs per genome per year [3,39], whereas M. tuberculosis evolves slowly, acquiring only 0.5 SNPs per genome per year [27,[53][54][55]. However, such data are available for only a very limited number of other pathogens. This will need to be expanded significantly before routine use of WGS data becomes a reality. We suspect that there may be consistent differ ences in the mode and rate of genotypic change between organ isms for which an asymptomatic carrier state (for example C. difficile) or a latent period (M. tuberculosis) exists and those, such as measles, for which there is no carrier state.
In conclusion, it is clear that WGS is already transforming the practice of outbreak investigation. However, the dizzyingly fast pace of change in this field, with steady improvements in high-throughput sequencing, make predictions about the future difficult, particularly now that nanopore sequencing technologies are poised to deliver a revolution in our ability to sequence macromolecules in clinical samples (not just DNA, but also RNA and even proteins) [56,57]. Portable nanopore technologies might provide a route to real-time near-patient testing and environmental sampling, as well as delivering a combined read-out of genotype and phenotype in bacterial cells (perhaps even allowing direct detection of the expression of resistance determinants). It also seems likely that clinical diagnostic metagenomics [30], perhaps equipped with target-specific enhancements such as sorting or capture of cells or DNA, will deliver improved genomic epidemiological information, including insights into within-patient pathogen population genetics and identification and typing of non-culturable or difficult-toculture organisms.
One thing is certain: the future of bacterial outbreak investigation will rely on a new paradigm of genomics and metagenomics. Therefore, it is up to all clinical and epidemiological researchers to embrace the opportunities and meet the challenges of this new way of working