GenTB: A user-friendly genome-based predictor for tuberculosis resistance powered by machine learning

Background Multidrug-resistant Mycobacterium tuberculosis (Mtb) is a significant global public health threat. Genotypic resistance prediction from Mtb DNA sequences offers an alternative to laboratory-based drug-susceptibility testing. User-friendly and accurate resistance prediction tools are needed to enable public health and clinical practitioners to rapidly diagnose resistance and inform treatment regimens. Results We present Translational Genomics platform for Tuberculosis (GenTB), a free and open web-based application to predict antibiotic resistance from next-generation sequence data. The user can choose between two potential predictors, a Random Forest (RF) classifier and a Wide and Deep Neural Network (WDNN) to predict phenotypic resistance to 13 and 10 anti-tuberculosis drugs, respectively. We benchmark GenTB’s predictive performance along with leading TB resistance prediction tools (Mykrobe and TB-Profiler) using a ground truth dataset of 20,408 isolates with laboratory-based drug susceptibility data. All four tools reliably predicted resistance to first-line tuberculosis drugs but had varying performance for second-line drugs. The mean sensitivities for GenTB-RF and GenTB-WDNN across the nine shared drugs were 77.6% (95% CI 76.6–78.5%) and 75.4% (95% CI 74.5–76.4%), respectively, and marginally higher than the sensitivities of TB-Profiler at 74.4% (95% CI 73.4–75.3%) and Mykrobe at 71.9% (95% CI 70.9–72.9%). The higher sensitivities were at an expense of ≤ 1.5% lower specificity: Mykrobe 97.6% (95% CI 97.5–97.7%), TB-Profiler 96.9% (95% CI 96.7 to 97.0%), GenTB-WDNN 96.2% (95% CI 96.0 to 96.4%), and GenTB-RF 96.1% (95% CI 96.0 to 96.3%). Averaged across the four tools, genotypic resistance sensitivity was 11% and 9% lower for isoniazid and rifampicin respectively, on isolates sequenced at low depth (< 10× across 95% of the genome) emphasizing the need to quality control input sequence data before prediction. We discuss differences between tools in reporting results to the user including variants underlying the resistance calls and any novel or indeterminate variants Conclusions GenTB is an easy-to-use online tool to rapidly and accurately predict resistance to anti-tuberculosis drugs. GenTB can be accessed online at https://gentb.hms.harvard.edu, and the source code is available at https://github.com/farhat-lab/gentb-site. Supplementary Information The online version contains supplementary material available at 10.1186/s13073-021-00953-4.

Keywords: Tuberculosis, Drug resistance, Drug-susceptibility testing, Diagnostics, Whole genome sequencing, Machine learning, MDR-TB, XDR-TB Background Human tuberculosis, a chronic infectious disease caused by members of the Mycobacterium tuberculosis complex, is a leading cause of death from a bacterial infectious agent [1]. The proliferation of multidrug-resistant tuberculosis (MDR-TB) is threatening TB prevention and control activities worldwide [1]. Timely detection of antimicrobial resistance is vital to guide therapeutic options and contain transmission. Antimicrobial resistance is conventionally determined by in vitro drug susceptibility tests (DST) on solid or liquid antibiotic-containing culture, which uses drug-specific testing breakpoints ("critical concentration") to classify the infecting strain into drugsusceptible or drug-resistant [2]. Being contingent on mycobacteria's slow growth rate, these phenotypic tests require days to weeks [3,4]. In contrast, molecular methods have emerged as rapid resistance prediction alternatives to complement and speed up traditional DST, leveraging known and reliable genotype-phenotype relationships between variants in the M. tuberculosis genome and in vitro drug resistance [5].
Over recent years, whole-genome sequencing (WGS) of M. tuberculosis has become an affordable tool to provide genetic information for genotypic resistance prediction and high-resolution outbreak reconstruction [6]. Large scale genotype-phenotype assessments have demonstrated high diagnostic accuracy for clinical use to predict susceptibility to first-line drugs based on WGS [7]. While some evidence suggests that WGS-based mycobacterial diagnostics is feasible with fast turnaround in a clinical research setting further validation studies under routine care conditions are warranted [3]. Following these results, public health authorities have begun to discontinue phenotypic testing when pan susceptibility is predicted from the genotype, a step with considerable cost and time benefits [8]. Start-to-end applications which analyze sequencing data to predict resistance phenotypes and are accessible to non-bioinformatic experts are required as WGS based analyses become part of the standardized diagnostic process in clinical laboratories. A range of published tools available for command-line [9,10] or web-based/desktop use [11][12][13] or both [14,15] exists. These applications vary in quality control and sequence preprocessing steps and rely on detecting pre-defined resistance-conferring mutations such as single nucleotide polymorphisms (SNPs) or small insertions/deletions (indels) in the WGS data to predict the resistance phenotype. They also vary in the type of information fed back to the user including error rates and specific variants detected.
Here, we present GenTB (https://gentb.hms.harvard. edu), an open user-friendly start-to-end application to predict drug resistance phenotypes to 13 drugs from WGS data. Resistance prediction is made based on a previously observed set of variant positions spanning 18 resistance-associated genetic loci and a validated random forest (RF) classifier [16] as well as a wide and deep neural network (WDNN) combining a logistic regression model with a multilayer perceptron [17]. GenTB provides access to multivariate statistical models of resistance to non-expert users. These models can consider the simultaneous effect of one or more mutations. GenTB's online interface allows users to interactively explore the sequencing data, prediction results and geographic distributions. The GenTB analysis pipeline is also available for command-line use wrapped in Snakemake [18]. In this study, we benchmark these two classification models implemented in GenTB along with two other tools with a command-line interface, TB-profiler [14], and Mykrobe [15], on a large dataset of > 20 k clinical M. tuberculosis isolates (Additional file 1) starting from raw Illumina sequence data.

Backend and website build
GenTB is a bespoke Django website hosted by the Harvard Medical School O2 high performance computing environment and collaboratively developed on GitHub (https://github.com/farhat-lab/gentb-site) [19]. The website uses off-the-shelf frontend components; Bootstrap for styling and mobile-friendly delivery, nvd3 for plots and graphs, resumable.js for robust uploading and supplements these with custom Javascript functionality for integration. The backend is a Python-Django web service using a PostgreSQL database which integrates with Dropbox for file uploading, and python-chore for slurm cluster job submission and management. GenTB predict jobs are run by modular programs organized into pipelines. The modularity allows for easy maintenance and management of dependencies and outputs. Administration screens allow a non-expert developer design new program calls and construct new pipelines and integrate them without redeployment of the website. Further tools provide error tracking. GenTB predict results are integrated into the PostgreSQL database allowing website generated plots to be populated quickly. All generated files for the intermediary pipeline steps are provided for download by the user. GenTB Map uses a PostGIS database to rapidly link strain mutation and lineage information with geo-spatial objects; these are fed into the leaflet.js display to render strain information to the user. Map allows users to display strain data groupings by country, lineage, drug resistance phenotype, or specific genetic mutation through tabs that can nest the groups in any order.

Raw read processing
Upon uploading single-end or paired-end FastQ files, GenTB first validates the input using fastQValidator (Fig. 1). Low-quality reads and sequencing adapters are then trimmed with fastp [20]. Kraken is used as a quality control step to assess the percentage of reads that map to the M. tuberculosis complex using a custom-built Kraken database comprising M. tuberculosis complex reference sequences [21]. Reads not classified as M. tuberculosis complex are filtered internally using seqtk (https://github.com/lh3/seqtk). Paired-read matching is performed with fastq-pair (https://github.com/linsalrob/ fastq-pair) followed by minimap2 alignment (parameters: -ax sr) of reads to the H37Rv reference genome (AL123456) [22]. In the present benchmarking we removed unclassified reads only for those isolates with > 10% non-M. tuberculosis reads. Samtools is used for sorting the aligned reads, removing duplicates, and indexing [23]. After read mapping is performed the coverage of drug resistance genes is confirmed (Additional file 2: Table S1). Sequence read datasets with a coverage of < 95% at 10× or less across these resistance genes will not be further processed, and an error message is displayed to the user. Variants are called with pilon (parameters: default) [24] to obtain SNPs and indels in the variant calling format (VCF) requiring that they have a PASS or Amb filter tags with read allele frequency > 0.40. The allele frequency threshold of 0.40 was chosen based on our observation that lower thresholds only marginally increased sensitivity of resistance prediction with a larger decrease in specificity (see Farhat et al [16]). Fast-Lineage-Caller then detects the M. tuberculosis lineage based on five lineage typing schemes as implemented by Freschi et al. [25]. Subsequently, invariant sites in the VCF file are removed, and Fig. 1 Schematic overview of the GenTB pipeline. Raw sequence data is quality checked and adapter trimmed before alignment to the H37Rv reference strain (accession AL123456). Variants are called with Pilon, and variant matrices used by the prediction models are prepared using custom scripts available on Github. The analysis will fail if quality criteria are not met (blunt end arrows). Numbers represent the three moments in the pipeline where users can upload their data to predict resistance for their isolate. a custom Perl script annotates each variant as frameshift, synonymous or non-synonymous, stop codon, indel along with the H37Rv locus tag for each respective gene. A custom python script generates a matrix file with all model features/variables in the columns used as input to the two prediction steps specified below. These scripts are available from Github [19] and are open source (AGPLv3 license). All intermediate sequence files are accessible to the user for download and verification.
For runtime evaluation, we pulled start and end time of all successfully completed pipeline runs submitted between April 12th and May 12th and computed average and median processing times.

Operation
GenTB is a free tool and registration is open to everyone. User registration is needed for security and to allow users to run predictions, track intermediary files and results. Users with low internet bandwidth can use the Dropbox integration to upload files. Both raw sequence reads and variants in variant call format (VCF) can be uploaded for resistance prediction. The required minimum genomic input, i.e., in case of targeted sequencing data, is specified on the input page and derived from Farhat et al. [16]. The user can select an option to delete uploaded source data after prediction or otherwise to save it for their future access through GenTB. Files are user-specific and not shared or accessible by others. Users can submit their genomic data for prediction and log off and will be sent an email with a link to the results when they are completed. Their prediction result will be stored indefinitely unless the user deletes it. Raw sequence data and intermediary files will be stored for three days.
The upload and processing stability of the GenTB online interface has been tested with up to 300 isolates uploaded in one batch. For batch processing of larger numbers of raw sequence data, we provide a command-line GenTB workflow based on Snakemake v5.20.1 [18] where dependent software will be sourced via conda [26]. The Snakemake workflow can be accessed via Github (https:// github.com/farhat-lab/gentb-snakemake) [27]. This repository contains a README file detailing the installation process and a description of the output files.

Validation sequencing and phenotype data
We collated a database of 20,408 Illumina raw sequence read datasets for which laboratory-based phenotypic DST data was available from public sources (Additional file 1). Sequence data was downloaded from NCBI nucleotide databases. Custom scripts were used to pool the phenotype data from Patric [28], ReseqTB [29], Zignol et al. [30], Wollenberg et al. [31], Phelan et al. [32], Hicks et al. [33], Coll et al. [34], and Dheda et al. [35] (scripts available at https://github.com/farhat-lab/ resdata-ng). Phenotypic testing was performed using WHO endorsed methods (Additional file 2: Table S2). Sequence data was merged in case of multiple sequencing runs per isolate for downstream processing and resistance prediction. The 20,408 sequence read datasets are not completely independent of the original training datasets with a small overlap (< 5%) and we thus do not expect this to affect the diagnostic accuracy.
Genotypic resistance prediction using two statistical models Two multivariate models are used to predict the resistance phenotype, an RF model (GenTB-RF) and a WDNN (GenTB-WDNN). GenTB-RF was trained on isolates with available resistance phenotype data and was validated as described in Farhat et al. [16]. Briefly, 1397 clinical isolates sampled as detailed in Farhat et al. [16] underwent targeted sequencing at 18 drug resistance loci using molecular inversion probes and in parallel underwent binary drug culture-based DST to 13 drugs (Additional file 2: Table S1). One RF was built for each drug using the randomForest R package (v. 4.6.7) with a subset of the total 992 SNPs/indels observed. Variants of highest importance for resistance prediction to each drug were selected by iteratively paring down the model and measuring loss of performance. Important variants are shown in Additional file 2: Fig. S1 for isoniazid and rifampicin.
Pyrazinamide resistance is known to rely on a large number of individually rare variants. Given the large increase in published M. tuberculosis WGS and linked DST data as well as the recent implication of novel resistance loci we retrained the pyrazinamide RF here using a newer version of randomForest R package (v. 4.6.-14) on variants in the genes pncA, panD, clpC1, and clpP [36]. We used 75% (15,267 isolates) of the dataset to train the model and 25% (5098 isolates) to validate its performance. During retraining, we excluded silent variants, those that occurred only in phenotypically susceptible isolates, and the final model was trained on 393 variants occurring in 3,262 phenotypically pyrazinamide resistant isolates [25]. We chose the randomForest mtry variable that yielded the smallest out-of-bag error and varied the classwt variable to maximize the sum of sensitivity and specificity.
GenTB-WDNN is a multitask logistic regression model combined with a multilayer perceptron. It has been previously shown to have equal or higher performance than the RF architecture when both are trained on the same data [17]. GenTB-WDNN was trained on 3,601 isolates (sampled as detailed in Chen et al. [17]) for 11 drugs using the Keras 2.2.4 library in Python 3.6 with a TensorFlow 1.8.0 backend. The model uses 222 features (i.e., SNPs or small insertions/deletions) along with derived variables (i.e., the number of nonsynonymous SNPs across all resistance-conferring genes) to predict the resistance phenotype. GenTB-WDNN was trained on the same genetic loci like GenTB-RF plus the resistance genes rpsA (plus its promoter region) and eis (plus its promoter region) (Additional file 2: Table S1).

Performance of GenTB and comparison with other tools
To assess the performance of GenTB for predicting resistance, all isolates were processed through the GenTB pipeline. We compared the diagnostic accuracy with two leading resistance prediction tools, TB-profiler 2.8.12 [14] and Mykrobe v0.9.0 [15], that were run with default parameters. These two tools rely on a curated list of mutations and make a resistance call once they identify one of these mutations in a sample. The two tools and two GenTB prediction models' predictive ability was obtained by comparing the genotypic prediction to the phenotype data that was considered the ground truth. We calculated the true positive rate (sensitivity), the true negative rate (specificity), and area under the receiver operating curve (AUC for short) to measure test accuracy for each drug and tool. We evaluated 1,000 probability thresholds per drug to call resistance or susceptibility for GenTB-RF while using the GenTB-WDNN thresholds described in Chen et al. [17] (Additional file 2: Fig.  S2 and Additional file 2: Fig. S3).

Statistical analyses and data visualization
Prediction files from all tools were parsed and analyzed in Jupyter Notebooks running Python 3.7 using the Pandas [37] and JSON libraries. Receiver operating characteristic curves were plotted using the Seaborn library [38]. The Vioplot package was used for violin plots [39]. We used the randomforestExplainer v0.10.1 package to visualize important variants in random forest models. Summary tables were created in R version 3.6.3 [40] using the packages from the tidyverse [41] and kable (https://cran.r-project.org/web/packages/kableExtra/ index.html). Sequencing depth in resistance loci was calculated and plotted using Mosdepth version 0.2.9 [42]. Confidence intervals were obtained by bootstrapping, comparing 5000 predictions per tool and drug on a resampled dataset.

Comparison of output between tools
We collated the output files and information produced by the GenTB online application, the webserver of TB-Profiler (https://tbdr.lshtm.ac.uk, version 3.0.0), and the Desktop version of Mykrobe (MacOS app v0.90) using one example raw sequence dataset (accession ERR1664619). The tools' output was compared based on the following criteria: (1) type and accessibility of output data formats; (2) communication of genotypic prediction results, i.e., binary classification versus probability; (3) disclosure of the prediction model's error rate; (4) description of known resistance-conferring variants identified; (5) reporting any novel mutation not listed in the resistance variant database; (6) detailed account of detected lineage variants and what lineage typing scheme was used; and (7) report quality metrics on the input sequence data.

Questionnaire
We conducted a survey among past GenTB users in May 2021 to explore how easy the GenTB website is to use (Additional file 3). The survey was developed using Google Forms in English and administered on May 3rd to 166 registered users. The five questionnaire items assessed how (1) pleasing, (2) how clear, (3) how easy to use, (4) how stable, and (5) how usable the GenTB tool is. Responses were recorded on a Likert scale where 0 means "worst" and 10 "best."

Results
A user-friendly application to analyze M. tuberculosis sequencing data GenTB was developed as a free and benchmarked online application to help public health and clinical practitioners deconvolute the complexity of M. tuberculosis WGS data. GenTB Predict allows users to predict resistance to 13 anti-TB drugs from a clinical isolate's raw Illumina sequence data (FASTQ) obtained from either WGS or targeted sequencing. Two validated machine learning models are used to make predictions: GenTB-RF and GenTB-WDNN ("Implementation" and [16,17]). GenTB-RF is the default prediction model. The average computing time based on 23 runs for the prediction pipeline was 35 min (SD 4 min, median 35 min, IQR 33 to 38 min). In addition to the GenTB Predict function that we focus on here, the web-application has additional features for sharing, mapping, and exploring M. tuberculosis genetic and phenotypic data (Fig. 2). GenTB Data enables researchers to store, version, and share M. tuberculosis sequence and phenotype data and is powered by the Dataverse research data repository [43]. Users can select an option to delete source files upon processing the prediction. GenTB Map enables users to geographically visualize genetic and phenotype data. Users can explore the subset of 20,408 isolates with geographic tags (n = 12,547 isolates) used for GenTB predict validation ("Implementation") or can upload and explore their own data in enriched-VCF format (https:// gitlab.com/doctormo/evcf/-/blob/master/docs/Enriched_ VCF_Format.md). Raw data and results can be exported to a tabular data format.
A questionnaire-based evaluation of GenTB's userfriendliness among previous users (12 respondents) showed that GenTB is a clear, pleasing, stable, easy to use, and usable application (Additional file 2: Fig. S4).

Dataset description
We curated a dataset of 20,408 M. tuberculosis isolates with known phenotypic resistance status to benchmark GenTB Predict performance ("Implementation" and Additional file 1). We excluded 29 isolates as they failed FastQ validation. Of the remaining, 1339 isolates did not pass our taxonomy filter criterion, and their non-M. tuberculosis reads were removed. The GenTB pipeline identified an additional 499 isolates where more than 5% of the genome was covered at depth <10x and these isolates were excluded from further analysis. These isolates had a median depth of 21x (IQR 17 to 26). The remaining 19,880 isolates with high quality sequencing data were majority lineage 4 (52%), with lesser representation of lineage 2 (21%), lineage 3 (15%), lineage 1 (10%), M. bovis (0.6%), lineage 6 (0.3%), and lineage 5 (0.2%). Completeness of phenotypic DST data varied by drug and was highest for the first-line drugs rifampicin (98.3%), isoniazid (96.4%), ethambutol (77.5%), and pyrazinamide (71.5%) (Additional file 2: Table S3). The second and third-line drug phenotype data ranged from 35.1% completeness for streptomycin to 7.8% for ethionamide. Of the 20,408 isolates, 13,817 were phenotypically susceptible to first-line drugs, 4743 (23.3%) were phenotypically MDR (i.e., resistant to isoniazid and rifampicin) and 396 (1.9%) were phenotypically XDR (MDR and resistant to fluoroquinolones and the secondline injectables-amikacin, kanamycin, or capreomycin). We ran GenTB-RF and GenTB-WDNN to predict resistance on 19,880 isolates and compared the predictions to phenotypic data.

Predictive performance of the GenTB-Random Forest
We assessed each tools' predictive performance by comparison with phenotypic culture-based DST results. Overall, the four tools had comparable performance characterized by varying sensitivities and high specificities (Tables 1 and 2, Fig. 3A, Additional file 2: Fig. S5). Diagnostic performance was better for first-line than second-line drugs. As sensitivity varied most widely, we discuss it by drug class below. Specificities varied less by tool or by drug. GenTB-RF's diagnostic specificity was > 92% for all drugs including the second-line injectables . GenTB-RF's specificities were similar or higher than the other three tools with the exception of pyrazinamide (94% [95% CI 93-95]) and streptomycin (89% [95% CI = 88-90]) compared to TB-Profiler (96% and 95%, respectively) as well as Mykrobe (98% and 95%, respectively).
GenTB-RF predictions for pyrazinamide using the original model (v1.0) had low sensitivity at 56% (95% CI 54-58) with adequate specificity (98% [95% CI = 98-99]) compared to the other tools when evaluated on the 19,880 isolates (2336 phenotypically resistant and 11,932 susceptible) [16]. Pyrazinamide resistance is known to be caused by a large number of individually rare variants in the gene pncA [44]. Given the large interval increase in available WGS data and recent implication of novel resistance loci (panD, clpC1, clpP) [36] since GenTB-RF was last trained, we assessed the number of rare variants in the four aforementioned genes linked to pyrazinamide resistance. In a random 75% subset of the 20,379 isolates, we detected a total of 393 different variants in pncA, panD, clpC1, and clpP with 40% (158/393) occurring only once. The majority of these variants, i.e., 73% (285/393) were not previously seen by the original model. As a result of these observations, we retrained a GenTB-RFv2.0, on 75% of the data using all 393 nonsynonymous variants including singletons and insertion/ deletion variants from pncA, panD, clpC1, and clpP. The retrained model, when benchmarked on an independent validation dataset of 5,098 isolates, offered a sensitivity (79% [95% CI 76-83]) and specificity (94% [95% CI 93-95]) similar to the other tools) ( Table 1).

Second-line drugs
For second-line drugs, larger discrepancies between genotype and resistance phenotype have been previously described compared with first-line drugs [14,15]. Diagnostic sensitivity to the second-line injectable drugs amikacin and kanamycin ranged between 63 and 68% across the four tools, with the exception of a sensitivity of 55% by TB-Profiler for amikacin (Table 1). Specificities to   Table S4 and Additional file 2: Table S5). For levofloxacin, GenTB's diagnostic sensitivity was 81% (95% CI 73-88) with a specificity of 77% (95% CI 66-87) (Additional file 2: Table S4).

Predictive performance of GenTB-WDNN
We sought to determine the performance of GenTB-WDNN that was previously shown to outperform other statistical resistance prediction approaches [17]. To determine the probability of phenotypic resistance, GenTB-WDNN combines multitask logistic regression to learn the effect of individual mutations with a threelayer perceptron to account for more complex epistatic effects on antibiotic resistance [45,46]. Similar to GenTB-RF, the overall GenTB-WDNN performance was marked by high prediction accuracy of first-line drug resistance and lower accuracy of second-line resistance ( Table 1). AUC 95% CI overlapped for all drugs between the two models except for ofloxacin and rifampicin for which the GenTB-RF AUC was higher (Table 2). For streptomycin, the GenTB-WDNN offered the best sensitivity and specificity of all four models (sensitivity 87%, 95% CI 85-88%, specificity 87% (95%CI 86-88%). Specificities were > 95% for all drugs except for streptomycin (87%, 95% CI 85 to 88) and ethambutol (93%, 95% CI 93 to 94).

Predictive performance depends on sequencing depth
We evaluated the need for quality control on sequencing depth as several tools do not currently implement this prior to resistance prediction [9,14,15]. We observed predictive performance to be highly dependent on sequencing depth as indicated by lower sensitivity to predict rifampicin or isoniazid resistance by all four tools for the 499 isolates that did not meet the threshold of ≥ 10×depth across > 95% of the genome (median depth of 21×, IQR 17 to 26, Fig. 3E, F). Using GenTB-RF, the mean sensitivity of isoniazid and rifampicin prediction was 84.6% (SD 3.6) and 87.3% (SD 3.6) respectively among low-depth isolates, compared with 91% and 93%, respectively, on high-depth isolates (Additional file 2: Table S6, Fig. 3E, F). Loss of sensitivity due to low sequencing depth was comparable across the four tools.

Discordant resistance predictions
To gain insight into model performance, we probed discrepancies between GenTB-RF's genotype-based prediction and the resistance phenotype. We focused on this model as it had the highest overall sensitivity. We examined specifically rifampicin and isoniazid as resistance to these two drugs defines MDR-TB, and their genetic resistance mechanisms are well understood. We investigated isolates for which GenTB-RF predicted resistance while the phenotype was reported as susceptible (false positives) and isolates for which GenTB-RF predicted susceptibility with a resistant phenotype (false negatives). We confirmed that false negative predictions were not due to low sequencing depth in relevant drug resistance loci (i.e., that depth was ≥ 10× across all bases, Additional file 2: Fig. S6 and Additional file 2: Fig. S7).

Isoniazid false negatives
Among the 518 false negative isoniazid predictions (phenotypically resistant isolates predicted susceptible by GenTB-RF), 194/518 (37%) harbored non-silent variants in isoniazid resistance-associated genes (Additional file 2: Table S8). Only 13 of the 139 unique variants observed in the 518 isolates were seen before by GenTB-RF and none of these were considered important isoniazid resistance mutations. KatG W328L was the variant detected most frequently (occurred in 10/518 isolates predicted false negative) and although not previously seen by GenTB-RF was described to occur in 0.2% of isoniazid resistance in one study [50]. Most variants linked to isoniazid resistance observed in these isolates were rare, i.e., 134/139 (96%) occurred in ≤ 3 isolates.

Output comparison across the three tools
All four tools are accessible to the non-experienced user via either an online interface (GenTB, TB-Profiler) or via a Desktop application. We compared each tool's output using the criteria specified in the "Implementation" section (Table 3). GenTB-RF provides a heatmap indicating the probability of resistance including the models' error rate with all prediction and intermediary files available for download. TB-Profiler and Mykrobe present binary (resistant or susceptible) predictions in overview tables with download options in CSV or JSON formats, respectively. TB-Profiler and GenTB present resistance causing variants and variants not associated with resistance. All tools provide the main-and sub-lineage call made but GenTB also specifies the lineage typing schemes used.

Discussion
The increasing affordability of WGS and our improving comprehension of mycobacterial drug resistance mechanisms has placed sequencing at the forefront of M. tuberculosis resistance diagnosis in clinical and public health laboratories (e.g., Public Health England in the UK and the Centers for Disease Control and Prevention in the USA) [7,51]. Yet, the complexity of resistance biology is such that large and diverse bacterial isolate datasets are needed to confirm the accuracy of genotype-based resistance prediction and its generalizability. Further, the required computational resources and knowledge to conduct sequencing analysis prohibit both the access to and confidence in WGS based resistance prediction in clinics in both low-and high-incidence settings. High confidence automated tools that are systematically benchmarked on diverse datasets are needed to facilitate adoption, and to act as the standard for future tool development and regulation by oversight agencies such as the World Health Organization (WHO).
GenTB is an automated open tool for resistance prediction from WGS. Here we benchmarked its two prediction models against two other leading TB prediction tools. Both GenTB models predicted resistance and susceptibility against first-line drugs with high accuracy. Predictive performance for second line drugs showed lower sensitivity, and this may be helped by studying a larger number of isolates with ethionamide, amikacin, capreomycin, kanamycin, and fluoroquinolone resistance in the future. Specificity was high for several second line drugs, i.e., capreomycin, kanamycin, and ofloxacin. This high specificity may be used to rule out resistance when no resistance-conferring variant for these drugs was found. A detailed analysis of discrepant predictions made by GenTB-RF illustrated that a number of false positive predictions were supported by canonical resistance variants, e.g., non-silent mutation in the rpoB RRDR in case of rifampicin, suggesting that their phenotypes were erroneously labeled as susceptible. Similarly, nearly half (48%) of the variants found in isoniazid false positive predictions are canonical resistance variants. These isoniazid resistance variants, the large proportion (60%) of phenotypic resistance to another drug among these isolates, and the knowledge that isoniazid is usually a gateway drug resistance, suggest that some phenotypes were erroneously characterized as susceptible [52]. Accordingly, specificity of genotype-based prediction in practice maybe even higher than reported here (Table  1). Given the estimated 2% prevalence of rifampicin resistance among new TB cases in the USA in 2019 [1], GenTB-RF's diagnostic accuracy translates to a positive predictive value of 49.5% and a negative predictive value of 99.9%.
For isolates with a resistant rifampicin phenotype that were predicted susceptible by GenTB-RF, we found a mutation in the rpoB RRDR in a nearly a quarter (23%) of isolates that reasonably accounts for the resistance phenotype, but had not been seen by the model previously. For the remaining majority of false negatives (71% for rifampicin) no relevant resistance variant was found. In these cases, phenotypic resistance remained unexplained and could be due to erroneous phenotypes or yet unknown resistance mechanisms. For isolates with a resistant isoniazid phenotype predicted susceptible, no important resistance-conferring mutations were found.
In these cases, phenotypic resistance could be due to rare and yet undescribed resistance variants. A substantial proportion of false negative predictions to isoniazid or rifampicin had genotypic resistance to at least another drug (48% of rifampicin false negatives and 40% of isoniazid false negatives). These observations overall suggest that a viable option to reduce false negative predictions by current models would be to leverage genotypic predictions to other drugs and flag such isolates for complementary phenotypic DST. In the future as new larger datasets of paired genotype and resistance phenotype are curated, e.g. by efforts sponsored by the WHO [29], retraining existing resistance prediction models will improve diagnostic sensitivity.
The final output produced by the four tools varies in terms of detail and type of variants reported with GenTB providing the most detail. In addition to resistanceassociated variants, GenTB outputs a description of novel variants in resistance genes that have not been previously seen by GenTB's models. The phylogenetic lineage calling procedure implemented in GenTB [25] uses several validated typing schemes to facilitate comparisons across lineage schemes.
Unlike other published resistance prediction tools that rely on a curated list of resistance-conferring mutations that call resistance when a specific variant is present, GenTB-RF and GenTB-WDNN use multivariable statistical models to predict resistance phenotype. These models are better suited to account for the complex relationships between resistance genotype and phenotype. Among the advantages of multivariate prediction models is that relationships between  [45,46]. As such, the two models provide a probability value that a given isolate is resistant or susceptible rather than a binary classification. This is relevant in case of variants that, if present alone, confer only weak to no resistance, but may confer complete resistance if present in combination. Also, each variable in a multivariable model has different weights depending on the strength of association with resistance in the training data, reflecting the biological reality where variants cause differing levels of resistance. The benchmarking data presented here confirm that these multivariate models offer gains in sensitivity over the other two tools that use curated mutation lists; however, this comes at a small decrease in specificity overall. Given its higher overall performance GenTB-RF is currently implemented as the default prediction model. As larger and more diverse data will become available for model training, especially for prediction of resistance more quantitatively, i.e., to predict minimum inhibitory concentrations or MICs, we anticipate multivariate models including the more complex GenTB-WDNN architecture to have an even bigger advantage over direct association of mutation lists. This study was not without limitations. An important prerequisite for reliable genotypic resistance prediction is the quality of the raw sequencing data. Variants and small indels in resistance-conferring genes can be accurately and confidently called from Illumina raw sequence data if the genes are adequately covered at an acceptable sequencing depth [53]. However, short-read sequencing data is recognized to have lower sensitivity for detecting more complex genomic variants including long indels or structural variation and these may have been missed in this study. But these latter types of variants are expected to be rare. Our finding of "apparent" false positive predictions (i.e., resistance call by GenTB-RF while susceptible phenotype) in isolates harboring canonical resistance variants portends some erroneous phenotypes in our ground truth dataset. None of the three tools studied predict resistance to recently introduced or repurposed drugs, such as bedaquiline or linezolid, due to limited phenotypic resistance data which is partially driven by limited clinical experience with them thus far [54]. Due to the scale and public nature of the dataset used for benchmarking in this study, we were unable to retest the laboratory-based drug susceptibility profiles of isolates with discordant predictions, but hope that it provides a test closer to a "real-world" scenario for these tool's application. We also note that the models' performance as benchmarked here is an average across the currently publicly available datasets that represent isolates from different countries. In the future, large country-specific datasets will further validate their diagnostic accuracy based on the local epidemiology.

Conclusions
The rapid emergence and affordability of sequencing of M. tuberculosis along with the herein confirmed high accuracy of several genotypic resistance prediction tools supports the use of informatically assisted treatment design in the clinical setting. Independent benchmarking efforts will facilitate regulatory reviews and assessments and build confidence in the tools' performances. As genotypic resistance predictions will accompany and increasingly replace laboratory-based resistance phenotyping performance criteria will need to be defined to guide clinical and public health laboratories in their use. Lastly, it will be important to communicate the confidence and uncertainty that is inherent to all genotypic predictions to clinicians and provide clear diagnostic algorithms in case of genotype-phenotype discordances.
Additional file 1. Accessions and phenotypic drug susceptibility profiles of isolates included in the benchmarking dataset.
Additional file 2: Fig S1. Characteristics of variables used for resistance prediction by Gentb-RF for isoniazid and rifampicin. Fig S2. Probability distribution and selected threshold for Gentb-RF resistance predictions. Fig S3. Probability distribution and thresholds according to [17] of Gentb-WDNN resistance predictions. Fig S4. User-friendliness evaluation of the GenTB tool. Fig S5. Diagnostic performance of the four prediction tools across antituberculosis drugs. Fig S6. Sequencing depth of resistance-conferring genes in isolates falsely predicted susceptible to first line agents. Fig S7. Sequencing depth of resistance-conferring genes in isolates falsely predicted susceptible to second line agents. Table S1. Genetic loci used for random forest model training. Table S2. Phenotypic drug susceptibility testing methods used by studies included in this benchmarking dataset. Table S3. Frequencies and percentages of available drug susceptibility data per drug. Table S4. Diagnostic accuracy comparison of tools for drugs with insufficient phenotype data and pyrazinamide performance on all isolates. Table S5. Area under the Receiver Operating Characteristic curve for GenTB-RF and GenTB-WDNN. Table  S6. Diagnostic accuracy to rifampicin and isoniazid across low-depth and