Robust barcoding and identification of Mycobacterium tuberculosis lineages for epidemiological and clinical studies

Background Tuberculosis, caused by bacteria in the Mycobacterium tuberculosis complex (MTBC), is a major global public health burden. Strain-specific genomic diversity in the known lineages of MTBC is an important factor in pathogenesis that may affect virulence, transmissibility, host response and emergence of drug resistance. Fast and accurate tracking of MTBC strains is therefore crucial for infection control, and our previous work developed a 62-single nucleotide polymorphism (SNP) barcode to inform on the phylogenetic identity of 7 human lineages and 64 sub-lineages. Methods To update this barcode, we analysed whole genome sequencing data from 35,298 MTBC isolates (~ 1 million SNPs) covering 9 main lineages and 3 similar animal-related species (M. tuberculosis var. bovis, M. tuberculosis var. caprae and M. tuberculosis var. orygis). The data was partitioned into training (N = 17,903, 50.7%) and test (N = 17,395, 49.3%) sets and were analysed using an integrated phylogenetic tree and population differentiation (FST) statistical approach. Results By constructing a phylogenetic tree on the training MTBC isolates, we characterised 90 lineages or sub-lineages or species, of which 30 are new, and identified 421 robust barcoding mutations, of which a minimal set of 90 was selected that included 20 markers from the 62-SNP barcode. The barcoding SNPs (90 and 421) discriminated perfectly the 86 MTBC isolate (sub-)lineages in the test set and could accurately reconstruct the clades across the combined 35k samples. Conclusions The validated 90 SNPs can be used for the rapid diagnosis and tracking of MTBC strains to assist public health surveillance and control. To facilitate this, the SNP markers have now been incorporated into the TB-Profiler informatics platform (https://github.com/jodyphelan/TBProfiler).


Background
Tuberculosis, caused by bacteria in the Mycobacterium tuberculosis complex (MTBC), is a major global burden causing approximately ten million active cases and killing 1.5 million people in 2018 (www.who.int/tb). The MTBC consists of Mycobacterium tuberculosis sensu stricto (Mtb) (lineages 1, 2, 3, 4 and 7) and M. tuberculosis var. africanum (lineages 5 and 6; M. africanum), which cause human disease, but others including M. tuberculosis var. bovis affect predominantly animals [1]. Recently, new Mtb lineages (8,9) have been proposed [2,3]. The MTBC lineages vary in their geographic distribution and spread, being endemic in different locations around the globe, leading to the hypothesis that the strain types are specifically adapted to different human populations [4]. Lineage 2 is particularly mobile with evidence of recent spread from Asia to Europe and Africa. Lineage 4 is common in Europe and southern Africa, with regions of high TB incidence and high levels of HIV co-infection, whilst lineages 5, 6 and 7 appear isolated within West Africa and Ethiopia, respectively [1].
There is some evidence to suggest that MTBC lineages can determine the transmission, control, and clinical outcome of pulmonary and extra-pulmonary tuberculosis. In particular, variational phenotypes include differences in the emergence of drug resistance, transmissibility, virulence, host response, disease site and severity [5,6]. Such phenotypes confer advantages for those MTBC lineages and may lead to an increased likelihood of disease spread and poorer prognosis for patients. Whether increased virulence is associated with poorer prognosis is unclear, with some studies reporting increased mortality risk with strains thought to be less virulent [7]. Of particular concern are the emergence of drug-resistant, multidrug-resistant (MDR-TB) and extensively drug-resistant (XDR-TB) strains, where Beijing strains show strong linear-resistance associations [8]. However, there is considerable inter-strain variation within lineages. For example, when comparing two different Beijing sub-lineages, the "ancient" (atypical) and "modern" (typical) strains show differences in geographical distribution, drug resistance and virulence patterns [9]. In particular, the "modern" sub-lineage is distributed worldwide and has been largely associated with MDR-TB and XDR-TB and hypervirulence [9].
Tracking the spread of lineages is of great importance in tuberculosis research and control. Rapid lineage identification enables the analysis of phenotypic associations, informs on likely provenance and can assist in the prediction of potential future outbreaks. The molecular barcoding of lineages and sub-lineages can be used to classify clinical isolates to aid in the evaluation of tools to control the disease, including therapeutics and vaccines, whose effectiveness may vary by strain type [1,5]. Historically, strain identification has involved the genotyping of tandem repeats (e.g. spoligotypes) and large deletions (regions of difference (RDs)) [10], but these approaches are being replaced by methods analysing data from whole genome sequencing (WGS) technologies. These approaches include in silico spoligotyping and RD detection, the characterisation of lineage-associated single nucleotide polymorphisms (SNPs) and higher resolution methods such as core genome MLST [11]. SNP-based approaches can be applied in silico or implemented within a laboratory typing assay [12,13]. Although the SNP-defined lineages do not offer the same resolution as using the whole genome, they provide a valuable insight into the epidemiology of circulating strains. A 62-SNP barcode was developed using WGS data for 1601 MTBC isolates and was the first to position samples within clades of a global phylogeny of 7 human lineages and 64 sub-lineages, covering all common strain types [1].
Here  [14], which has been used to profile more than fifty thousand MTBC for strain types and drug resistance, and will thereby assist with barcode implementation for research and infection control activities.

Principal component analysis and phylogenetic tree
Distance matrices and the principal components of the multi-FASTA files were computed with Plink software (v1.90b4; https://www.cog-genomics.org/plink2) [23]. The distance matrices were used for the new cluster identification. Maximum likelihood phylogenetic trees were constructed from the multi-FASTA file using IQ-TREE (v1.6.12) (http://www.iqtree.org/) [24]. A general time reversible model with rate heterogeneity set to a discrete Gamma model and an ascertainment bias correction were used (parameters -m GTR+G+ASC), with 1000 bootstrap samples used to measure branch quality and robustness. Phylogenetic trees were generated for all MTBC isolates, as well as for each main lineage separately. The resulting Newick-formatted tree files were visualised and annotated with metadata in iTOL (v5.2; https://itol.embl.de/) [25]. These metadata included the 62-SNP barcode sub-lineage predictions [1], allowing for the rapid identification of outliers. By annotating the branches with ancestral mutations, it was possible to inform on SNP markers for barcoding.

Lineage revision and new sub-lineage identification
The visual inspection of the phylogenetic trees (and principal component analysis plots) revealed that some pre-existing (sub-)lineages (as defined using the 62-SNP barcode) could be merged or split, as well as new ones created. The original 62-SNP barcode was constructed to reflect the original strain-type families used by researchers based on spoligotypes and RDs. We sought to analyse the phylogenetic tree to further divide these clades where obvious splits in the phylogeny existed. To aid in old lineage revision and new lineage identification, phylogenetic trees relating to lineages 1 to 9 and animal strains were analysed using a semi-automated procedure. Each tree was traversed (and each clade inspected) from root to tip using the ETE3 Toolkit (v3.1.1) package in Python3 (http://etetoolkit.org/) [26]. We identified metrics and parameters such as branch bootstrap support values and intra/inter-cluster SNP distances to determine splits in the tree, which led to clusters that are separated by long branch lengths from other isolates. Whilst traversing, the following criteria had to be met to establish clades leading to new or revised sub-lineages: (1) a minimum clade size of 20, with a branch supported by a bootstrap value of > 95; (2) differences in the distributions of SNP distances where comparing the isolates within and outside the clade, using a Welch t test assuming unequal variances [27] (P < 0.05) and a Cohen's d effect size [28] (d > 0.5); (3) the ratio of the branch length of the clade compared to the mean branch length of its descendants (ratio > 1); (4) estimation of the number of clade-informative SNPs, requiring at least 10 SNPs with a fixation index (F ST ) [29] value of 1; (5) confirmation of the clade through visual inspection of the tree. Each of the parameter thresholds was based on established cut-offs or determined using standard point of inflection methods [1]. The population differentiation F ST statistic assigns a strength of association between each SNP and (sub-)lineage, with a score of 1 indicating that the SNP allele is fixed in the sub-lineage of interest and not present outside that group. Using the five criteria led to the addition of 87 (27 new) sub-lineages or lineages (including 8 and 9), or changing the branch position of established others (e.g. 1.2 and 1.1.1.1) (see Additional file 1: Fig. S1). The SNP-IT tool for identifying species in MTBC [30] was applied to the M. bovis, M. orygis and M. caprae isolates (N = 110; test set), and three barcoding SNPs were required for these mycobacteria. The overall number of (sub-)lineages or species covered was 90.

Barcoding SNPs
To ensure that the required 90 clade-specific mutations ("potential barcoding SNPs", all with F ST = 1) were robust, where possible, we retained synonymous SNPs in essential genes [31], and excluded those in drug resistance loci (from TB-Profiler [14]) and non-essential PE/PPE gene families [32]. From those retained "robust" SNPs (n = 421), a minimal set of one per lineage included preferentially those already present in the 62-SNP barcode [1] and, if not possible, (arbitrarily) the lowest position was chosen. The gene functional categories were extracted from Tuberculist (tuberculist.epfl.ch), and the frequency of ontologies across all potential barcoding, robust and minimal SNPs, was assessed for differences across lineage using the chi-squared tests.

Validation of lineage barcode
To validate the final set of robust 421 clade-defining SNPs (Additional file 1: Table S2), the 17,395 samples in the testing set (with 572,021 SNPs) were used. The (sub-)lineage of these samples was predicted with TB-Profiler [14]. At the same time, a phylogenetic tree was reconstructed of the training and test samples together using FastTree2 software [33]. To assess the sensitivity and specificity of the predictions, this tree was traversed in the ETE3 Toolkit, and test samples were examined for their presence in the clades defined by the training dataset.

MTBC isolates, SNPs and phylogeny
Across a total of 35,298 MTBC isolates with sequencing data, we identified 1,014,762 high-quality SNPs. The isolates represented all MTBC lineages (1-9), M. bovis, M. orygis and M. caprae, but the majority were from lineages 4 (51.6%), 2 (25.2%), 3 (11.1%) and 1 (9.5%), with the frequency of others being at most 1% (Additional file 1: Table S1). Whilst it is a convenience set of sampled isolates, the geographical distribution of the lineages was as expected, with lineage 2 dominating in Southeast Asia, lineages 1 and 3 predominant in South Asia, lineage 4 abundant in Europe, Americas and Africa and lineages 5 and 6 present in West Africa (Fig. 1) Table S1). A phylogenetic tree was constructed on the training isolates and confirmed the clustering by lineage and sub-lineages (Fig. 2). Similarly, a principal component analysis of the 35k isolates using the~1 million SNPs revealed the expected clustering by lineage or species (Additional file 1: Fig. S1(a)). Phylogenetic trees were constructed for each lineage separately and confirmed the sub-lineage and strain-type clustering (Additional file 1: Fig. S1(b)-(f)). However, by assessing the fine-scale clustering of sublineages predicted by the 62-SNP barcode, outlying samples were revealed and suggested a need for the repositioning of mutations underlying the clades or, alternatively, the creation of new sub-lineages that were on long branches (Additional file 1: Fig. S12(b, c)). In some cases, new sub-lineages reflected existing RD-or spoligotype-based strain classifications which were imperfectly or not captured using the 62-SNP barcode (see Additional file 1: Fig. S2 (d,e)).

Barcoding SNPs
By traversing the whole MTBC and lineage-based phylogenetic trees using a semi-automated algorithm, it was possible to modify sub-lineages within the flexible nomenclature structure of the previous barcode [1], as well as define clade-informative SNPs. The phylogenetic analyses characterised 27 additional (sub-)lineages covering lineages 1 (8), 3 (2), 4 (15), 8 (1) and 9 (1). The final number of (sub-)lineages in Mtb was 85 (L (ineage)1 16, L2 7, L3 7, L4 52, L7 1, L8 1, L9 1) and M. africanum was 2 (L5 1, L6 1) ( Table 1; Fig. 2), requiring 87 SNP markers. A further three SNP markers were required to   To find informative SNPs for each of the 90 MTBC clades, we used the population differentiation metric F ST to identify mutations that were only present in the isolates in the selected (sub-)lineage of interest (F ST = 1). We identified 8128 potential barcoding SNPs (with F ST = 1) across the 90 clades (Table 1). These barcoding SNPs were distributed evenly genome-wide, with no visible clustering of informative mutations for individual lineages (Additional file 1: Fig. S3). Of these SNPs, 7282 (89.6%) were in genic regions, with mutations leading to 4699 non-synonymous (NS) and 2564 synonymous (S) amino acid changes, as well as 20 changes in non-coding genes. By focusing on essential genes, 889 (10.9%) SNPs remained (499 NS, 390 S). Furthermore, variants in Bolded are changes from the barcode in reference [1]-either new sub-lineages or new barcoding SNPs; MDR-TB multidrug-resistant TB, which is resistant to at least rifampicin and isoniazid drugs. *All potential barcoding SNPs (F ST = 1). **Final robust SNP set, based on synonymous changes in essential and non-drug resistance genes only (except 12 sub-lineages which had no informative SNPs in essential genes; see Additional file 1: Table S2) Fig. 2 Phylogenetic tree of Mycobacterium tuberculosis complex isolates. A representative tree with a maximum of 10 isolates per sub-lineage (important regions of difference (RDs) are also highlighted) drug-resistance-associated genes were removed, leaving 824 SNPs (464 NS and 360 S mutations). Across all lineages, except lineages 8 (N = 2) and 9 (N = 3) which had small sample sizes, we compared the distribution of gene functions for all potential barcoding SNPs in all characterised genes (7060/7282 SNPs) with only those in essential (and non-drug resistance) loci (790/824 SNPs) (Additional file 1: Fig. S4). The distribution of gene function for all potential barcoding SNPs is similar across all lineages. However, after filtering for essential and nondrug-resistant genes, lineage 2 has a relatively high proportion of non-synonymous SNP mutations in cell wall and cell process genes, whilst for lineage 6, M. bovis, M. caprae and M. orygis, there are relatively higher proportions of non-synonymous SNP mutations in intermediary metabolism and pathway genes. For 11 (sub-)lineages, there were no potential barcoding SNPs lying within essential and non-drug resistance genes, so they were identified in non-essential and non-PE/PPE loci (Additional file 1: Table S3) (180 SNPs, 61 synonymous mutations). By considering only the SNPs with synonymous changes, similar to the selection strategy applied in [1], a total of 421 SNPs were considered suitable for barcoding the 90 (sub-)lineages (Table 1; Additional file 1: Table  S2). Of these, 20 SNPs represented (sub-)lineages in the 62-SNP barcode [1] and were therefore retained, leading to 70 new SNPs chosen for final (sub-)lineage classification (Additional file 1: Table S3). Across the 60 (sub-)lineages common to the 62-and 90-SNP barcodes, the 40 new SNPs had higher F ST values than those in the old barcode (Additional file 1: Fig. S5). Using the test set (N = 17,395) which had representation of 86 of the 90 (sub-)lineages, we found that the minimal set of 90 SNPs had perfect predictive performance for all clades (all sensitivities and specificities of value 1). This analysis excluded four (sub-)lineages (3.1.2.2, 4.6.2.1, 8 and 9), which had no test samples.

Comparisons to other software
The barcode was compared to lineage predictions from SNP-IT [30] software, a 27 strain-type system covering MTBC, including 6 animal lineages that are not present in our large dataset. First, we assessed the assigned major MTBC lineages (1-6) by both barcodes and found complete concordance. Second, we quantified how the increased number of strain types in our barcode (n = 90) improved the resolution of sub-lineage assignment over the SNP-IT tool. For 14 of the 21 SNP-IT strain types present in our data, the 90-SNP approach provides higher resolution of clades (range 2 to 15 sub-lineages per SNP-IT clade) (Additional file 1: Fig. S6). Six other strain types have direct mapping between our barcode and SNP-IT, and there is one instance where isolates classified as M. bovis with our barcode are further classified into M. bovis BCG and M. bovis bovis using SNP-IT.

Discussion
MTBC strain types and lineages are distributed phylogeographically and have been associated with differences in the emergence of drug resistance, transmissibility, virulence, host response, vaccine efficacy, disease site and severity [5,6,34]. However, further research into lineage, genotype-phenotype associations are required. Such research needs to be underpinned by molecular barcodes of MTBC (sub-)lineages, strain types and species. Here, we updated a 62-SNP barcode that forms a highly resolved phylogenetic identification system that determines 7 lineages, 64 sub-lineages and M. bovis, but was constructed using~1600 MTBC isolates with WGS data [1]. Using twenty-fold more MTBC isolates with WGS data, we identified and validated a set of 90 robust SNPs (of 421 alternatives) to cover a global phylogeny of 9 lineages, 87 sub-lineages, M. bovis, M. caprae and M. orygis. These SNPs can be used to construct highresolution and reproducible phylogenies, which can be incorporated within diagnostic assays and assess genotype-phenotype associations. By extending an established 62-SNP barcode system with a flexible nomenclature [1], it was possible to update and add seamlessly (sub-)lineages and species and in the future include potentially novel strain types should they be reported. Such modifications could involve inclusion of SNPs to barcode other MTBC animal lineages or partitioning of M. africanum lineages 5 and 6 into sublineages [3]. Further, incorporating drug resistance loci will further enhance the usefulness of the 90-SNP barcode as an important tool for tuberculosis control and elimination activities worldwide. To assist this, the 90-SNP variants have been incorporated into the publicly available TB-Profiler informatics tool [14], which predicts resistance to 14 anti-tuberculosis drugs from WGS data.
Our barcode development focused on SNPs, but future work could include other types of strain-specific polymorphisms (e.g. insertions, deletions and large structural variants), which are less common than SNPs, but may have major functional consequences. An analysis of the gene ontologies of the barcoding SNPs revealed some differences across lineages, but there is a need to the characterise functional effects of the lineage-specific SNP variants, as these could provide insights into disease control measures. Overall, we have provided an updated molecular barcode for MTBC strain types, with ninety robust markers that can be detected from applications of WGS or integrated within high-throughput genotyping or sequencing (e.g. amplicon) platforms to inform ongoing TB surveillance and control.

Conclusions
The use of molecular barcoding of MTBC bacteria causing tuberculosis can provide insights into outbreaks and help to reveal strain types that are more virulent and prone to drug resistance. In an analysis of 35,298 isolates from MTBC, we update an established 62-SNP barcode with a minimal set of 90 genetic markers, which now cover M. tuberculosis (7 lineages, 85 sub-lineages), M. africanum (2 lineages), M. bovis, M. caprae and M. orygis bacteria. The new barcode has been implemented within the publicly available TB-Profiler informatics tool, to assist the rapid, simple and reliable phylogenetic identification of individual MTBC isolates, thereby aiding clinical studies in the tracking, maintenance and phenotypic determination of MTBC pathogens.
Additional file 1: Table S1. The study samples (N=35,298) used and their lineages. Table S2. Robust barcoding SNPs (421 SNPs, including the 90 SNPs in Table S3). Table S3. The ninety minimal barcoding SNPs. Figure S1. Population structure of the Mycobacterium tuberculosis complex isolates by lineage. Figure S2. Examples of discrepancies using the 62-SNP barcode. Figure S3. The genome-wide distribution of barcoding SNPs (F ST =1) for each lineage. Figure S4. Functional differences between genes containing lineage-barcoding (F ST =1) SNPs. Figure S5. Differentiation of sub-lineages when comparing the 62-versus 90-SNP barcodes. Figure S6. The increased resolution of our 90-SNP barcode (implemented in TB-Profiler software) over the comparable (sub-)lineages of the SNP-IT tool.