From: SRST2: Rapid genomic surveillance for public health and hospital microbiology labs

Summary of SRST2 approach. Inputs are reads (fastq format) and one or more databases of reference allele sequences for typing (fasta format). Reads are aligned to all reference sequences (using bowtie2) and each alignment processed (using SAMtools). At each position in each alignment, the number of matching and mismatching bases is determined and a binomial test is performed to assess the evidence against the reference allele; resulting in a set of P values for each reference allele sequence. To determine which of all known reference alleles is most likely present at a given locus, the P value distributions for known alleles are compared as described in the text. Briefly, for each allele the P values expected if the reads were derived from the reference allele in the presence of a given level of sequencing error (set to 1% of bases by default) are regressed on those actually observed, similar to a Q-Q plot; the slope of the fitted line, which increases with the strength of evidence against the reference allele, is calculated and taken as the score for that allele. The scores file (optional output) contains the scores for each allele at each locus, along with additional information about the alignments for each allele including percent coverage. For each locus, the allele with the lowest score is accepted as the closest matching allele (small arrows) and reported in the output table. In MLST mode, sequence type (ST) definitions are provided as input and used by SRST2 to calculate STs for each read set.

