Systematic identification and quantification of phase variation in commensal and pathogenic Escherichia coli

Bacteria have been shown to generate constant genetic variation in a process termed phase variation. We present a tool based on whole genome sequencing that allows detection and quantification of coexisting genotypes mediated by genomic inversions in bacterial cultures. We tested our method on widely used strains of Escherichia coli, and detected stable and reproducible phase variation in several invertible loci. These are shown here to be responsible for maintaining constant variation in populations grown from a single colony. Applying this tool on other bacterial strains can shed light on how pathogens adjust to hostile environments by diversifying their genomes. Electronic supplementary material The online version of this article (doi:10.1186/s13073-014-0112-4) contains supplementary material, which is available to authorized users.

The output of the BWA algorithm is a SAM file. SAM is a text tabular file where each row represents a single WGS read. Only a few of the to create the matrix data, which contains 3 columns (c2, c4 and c9).

c. Use Inversion Detection Algorithm
As is described in the Supplementary Methods, the algorithm for inversions detection searches for clusters of abnormal reads concentrated on a sloped line forming the funnel pattern when plotting gap-size against genomic location.
To execute the algorithm, run the command line: [location, score] = detect_inversions(data); The output of the function is a vector of genomic locations containing putative inversions and a corresponding vector of scores, based on the size of the cluster of abnormal reads in that location.

d. Visualize the Inversion (funnel or contact matrix + ref)
Once we obtain a set of putative inversions, we can visualize each genomic location and inspect the funnel pattern that it produces. Run the command line: region = reads_over_region(data, center, range); in order to visualize the funnel pattern around the putative inversion. Set the parameter center to the genomic location you wish to inspect. Running this command on the e14 inversion reported in the paper will produce the following figure: The output of the function is region, which is a matrix containing all reads that map to the area around the inversion. An alternative way to view the inversion is by plotting a read's location against its pair's (a contact matrix view), by running: plot(region(:,2), region(:,2) + region(:,3), '.');

Genomic Location
Gap-Size

e. Identify Inversion's Edges
One of the ways to identify inversion's edges is to find hybrid readsreads that combine two separate genomic sequences. BWA treats these reads by trimming them and mapping only one part to the genome. When a cluster of reads is soft trimmed at the same genomic location, it might mark the edge of an inversion. Run the command line: [location, trimmed] = find_trimmed_clusters(region) to get a list of all the genomic locations marked by trimmed reads' edges. Use this function to determine the inversion's edges.

f. Prepare Two Reference Genomes
After we identified a funnel-forming inversion, we now wish to quantify how abundant it is in the population. Our approach for quantifying an inversion is to map the WGS data to a small part of the genome with and without the inversion, single out the pool of reads that map normally to one genome and abnormally to the other (variable reads), and calculate the fraction of reads which map normally to the inverted genome of the pool of variable reads.
First we want to create to small reference genomes containing the genomic area around the inversion. Run the command line: prepare_reference_genomes(reference_genome, center, range, left_boundary, right_boundary) This function receives a fasta file of the reference genome, and the boundaries of the inversion and prepare two smaller fasta files containing the inverted locus, one with the inversion embedded to the sequence (inverted) and another showing the un-inverted sequence (normal).
After the two reference genomes were created, run BWA on the original fastq files two times, using normal.fasta and inverted.fasta as reference genomes, resulting in two SAM files: normal.sam & inverted.fasta.

g. Quantify The Inversion
After mapping to the two reference genomes, the matlab function quantify can be used to extract the ratio inverted/normal genotypes in the clone. Run the command line: with the filenames of the two SAM files as inputs, to produce the desired ratio.