Profiling diverse sequence tandem repeats in colorectal cancer reveals co-occurrence of microsatellite and chromosomal instability involving Chromosome 8

We developed a sensitive sequencing approach that simultaneously profiles microsatellite instability, chromosomal instability, and subclonal structure in cancer. We assessed diverse repeat motifs across 225 microsatellites on colorectal carcinomas. Our study identified elevated alterations at both selected tetranucleotide and conventional mononucleotide repeats. Many colorectal carcinomas had a mix of genomic instability states that are normally considered exclusive. An MSH3 mutation may have contributed to the mixed states. Increased copy number of chromosome arm 8q was most prevalent among tumors with microsatellite instability, including a case of translocation involving 8q. Subclonal analysis identified co-occurring driver mutations previously known to be exclusive. Supplementary Information The online version contains supplementary material available at 10.1186/s13073-021-00958-z.

coverage of the targeted region within 100 to 200 bases; (ii) adjustable primer probe density; (iii) a primer probe GC content between 30% and 65%; (iv) uniqueness of the last 20 bases for the 3' portion of the primer probe in the human genome with a string edit distance of 1 from any other genome location; (v) no overlap of the last 10 bases with a known SNP as annotated in dbSNP Build ID 131;and (vi) no sequences immediately adjacent to highly repetitive sequences.
These adapters contain a 7-base indexing sequence (xxxxxx*T) directly following the sequencing primer binding site. The adapters were annealed in a final concentration of 15 µM per adapter in Nuclease Free Duplex Buffer (IDT) by a 1% temperature ramp from 94°C to 20°C, after an initial 5 min 94°C denaturation step. Unlike standard Illumina adapters, our modified library adapters are only complementary to the P5 primer on the flow cell surface. The portion that is complementary to the P7 primer is introduced in the primer probe extension step.
Primer probes were column-synthesized at the Stanford Genome Technology Center and combined in an isomolar pool.
The KAPA HyperPlus library preparation kit (Roche) was used for the following steps. For each library, 1 μg gDNA was subject to random fragmentation with the KAPA enzyme mix; the incubation was at 37°C for 9 min, directly followed by incubation on ice. A-tailing enzyme mix was added to the final fragmentation products and the fragmented library was A-tailed with incubation at 65°C for 30 min. Because the random fragmentation creates blunt-ended breaks, the end-repair step was omitted. The DNA ligase mix including 75 pmol annealed adapter was added to the A-tailed library. The reaction volume was incubated at 20°C for 15 min.
Afterwards, the library products were purified with AMPure XP beads in a bead solution-tosample ratio of 0.8. The size distribution of the sequencing library was measured with the DNA High Sensitivity Reagent Kit on a LabChip GX (Perkin-Elmer) per the manufacturer's protocol.
The purified library was then used directly for single primer targeted sequencing with no additional steps.
The flow cell modification and capture assay procedures were modified from those previously described [23]. The targeting process requires (1) hybridization and extension of the target oligonucleotides onto a solid phase support such as a flow cell, followed by capturing of the sequencing library by overnight hybridization; and (2) extension of the captured library and standard Illumina cluster generation. For runs performed using High Output mode, we used v3 sequencing reagents (Illumina) for 2x101 cycle paired end reads. For runs performed using Rapid Run mode, we used v2 reagents (Illumina) for 125 (read 1) and 101 (read 2) cycles of paired end sequencings. For all the HiSeq experiments, image analysis and base calling were performed using the HCS 2.2.58 and RTA 1.18.64 software (Illumina).

Microsatellite mutation calling
After single-end alignment to the NCBI v37 human genome, we added a custom SAM file tag (ZP) to the aligned reads. This tag denotes which microsatellite probe generated the read as well as the linked target STR information (e.g. repeat motif and flanking sequences). We used the Read 2 (R2) aligned sequences, which include the capture probe sequence and residual genomic sequences, to determine the tag for paired end reads. This indexing method does not require Read 1 (R1) sequences to align to the genome; both aligned and unaligned reads are tagged based on alignment of their R2 mate to a designated primer probe sequence. After the indexing, we evaluate reads to determine whether both the expected 5' and 3' STR flanking sequences are present in R1. For this study, we aggregated the 'STR-spanning' reads, which contain a combined sequence that includes an 8-base 5' flanking sequence, a variable length region containing at least a minimum number of STR motif repeats, and an 8-base 3' flanking region. Using the reads that contain this combined sequence, the STR motif repeat count was calculated by dividing the number of bases in the variable length region by the length of the STR motif. For example, if the variable length region is 28 bases and the STR motif is GATA (tetranucleotide), then the STR motif repeat count is seven. The R1 reads encompassing entire STRs (i.e. STR-spanning reads) are counted and summarized by motif repeat count (i.e. allele) to provide a basis for determining the genotype and extent of mutations in microsatellites. The distribution of alleles and relative percentage of reads for each allele was used to compare tumor versus normal samples (Additional File 1: Figure S2a).
For a given microsatellite locus, we let = ( 1 , … , ) be the read count for each of the alleles. Letting = ∑ be the total number of reads observed for the locus, we computed the allele coverage proportion vector for the microsatellite as follows: = ( 1 , … , ), = .
These proportion vectors exhibited peaks at alleles that are present in a sample. Using samples from a matched normal source, we first determined alleles based on the criteria previously described [23]. We assumed n( ) ≤ 2. To determine whether a microsatellite locus has a somatic allele shift in tumor DNA versus its paired normal , we used a weighted Euclidean distance between the two proportion vectors: We excluded the proportions of normal alleles, , when calculating (i.e. a zero weight for normal alleles). This weighting minimized the contribution of copy number changes in the distance. If is greater than a given threshold, we consider that the microsatellite locus has an allele shift. The threshold was decided to minimize false positive and negative rates after fitting the most probable Gaussian mixture densities to our data (Additional File 1: Figure S2b). To evaluate the extent of MSI of a tumor, we calculated the fraction of microsatellite loci with allele shift versus the total number of measured microsatellite loci.

Digital PCR confirmation of copy number alterations
For a number of copy number alterations, we used digital PCR assay for validation. First, we identified the coordinates defining individual gene regions of interest (ROI) within their exons.
We designed all copy number primers and probes with Primer3, using input gene sequences from NCBI Nucleotide's GRCh37.p10 Primary Assembly [27,28]. Each primer was verified for uniqueness with UCSC Genome Browser's Blat based on the GRCh37/hg19 assembly [29].
We obtained all primers from Integrated DNA Technologies (San Diego, CA) and MGB hydrolysis probes from Life Technologies (NY, USA). For the genes ERBB2, FGFR1 and MET, we used a hydrolysis probe-based primer design, including a FAM-tagged MGB probe flanked by two primers for each ROI. We used primers and a VIC-probe for a previously defined [20] highly conserved region on Chromosome 1 (UC1) as a reference for all copy number assays.
This method has been previously validated for detecting copy number alterations in cancer [19,20].
For the genes AURKA, CDK4, FLT3 and VEGFA, we utilized an analogous multiplex digital PCR assay using a single-color DNA-binding dye (EvaGreen) as we have previously described [30]. Briefly, the multiplexing is based on discriminating differences in fluorescence intensity as dependent on amplicon length. In our previous work, we showed the ability to discriminate amplicons that differed by as little as 10bp. Here we use the same principle to multiplex an ROI and a reference, with ROI primers amplifying similar or slightly longer amplicons (60-66 bp).
The droplets can then be clustered based on the difference in fluorescent signal amplitude; populations containing no template correspond to the lowest fluorescent amplitude. This design allowed us to multiplex the ROI and UC1 assays in the same well.
The digital PCR assays were run on a QX200 droplet digital PCR system (Bio-Rad) across a series of annealing temperatures [19]. The 20 µl single-color ddPCR reaction mixtures consisted of 2X ddPCR Evagreen Supermix, 100 nM each primer and 20 ng restriction digested gDNA. Droplets were generated as described above and cycled as follows: 95ºC for 5 min (1 cycle); 95ºC for 30s, 52-62ºC for 1 min (40 cycles); 4ºC for 5 min, 90 ºC for 5 min (1 cycle); and 4ºC hold. After PCR, each well was read individually with the QX200 TM droplet reader on one fluorescent channel. We identified the annealing temperature that produced four distinct and fully separated droplet populations: droplets containing no template, reference amplicon only droplets, target amplicon only droplets, or both the reference and target amplicon template fragments.
We assessed each patient sample with three independent replicates for gene copy number. For samples analyzed in the hydrolysis probe-based assay, droplets were clustered using QuantaSoft (version1.2.10.0). For samples analyzed with the single-color dye-based assay, we exported the raw droplet fluorescence amplitudes from Bio-Rad QuantaSoft software. The concentration of each target was calculated as follows:

− ln( )
where droplet volume is the volume of an individual droplet in µls. We set this number as 0.0009ul for EvaGreen assays. Copy number of each well was calculated as follows: We calculated a weighted average and standard deviation of copy number across triplicates wells, based on the total number of droplets read in each replicate.
Supplementary Figure S1. Comparison between conventional and single primer targeted sequencing methods.
The single primer-targeted sequencing method used in the current study (OS-Seq) is compared with a conventional method, generally used for the whole exome and gene panels. This single primer targeting method uses a single adapter library, while the conventional method uses a ready-to-sequence dual-adapter library. The capture process itself completes the libraries for the subsequent sequencing step. Steps for targeting genomic sequence involves a round of hybridization and single primer extension. This process is simpler compared to the conventional method which generally uses multiple wash steps after immobilizing the captured library with magnetic beads. Moreover, targeted sequencing is easily automated with a commercial system (e.g. Illumina cBot). Most importantly, OS-Seq requires no post capture PCR step when using capture probes already immobilized on the sequencing flowcell. Therefore, a sequencing read represents a single DNA molecule. PCR-born indels, or stutters can be minimized especially when targeting microsatellites.
Supplementary Figure S2. Determination of microsatellite allelic shift. (a) Definition of microsatellite distance between tumor and normal samples. An example of allelic shift at the BAT25 locus of the P799 tumor is shown. Left panels show the allele histograms generated by our OS-Seq method. Relative abundance (sequencing read count) of DNA molecules and different microsatellite alleles (number of motif repeats) are indicated on the y-and x-axes, respectively. From the normal allele profile, the heterozygote alleles (38 and 41 motif repeats) are apparent. The allele proportion is calculated by dividing the read count of each allele by the total read count from the locus. The example shown at the bottom right are the proportion vectors from the normal and tumor samples (P N and P T ), and the distance value calculated following the definition of microsatellite distance provided at the top right. (b) Kernel density distribution of microsatellite distance values from all 46 tumor / normal comparisons at 225 microsatellite loci. The distribution is separately shown for mono-/ dinucleotide and tri-/ tetranucleotide microsatellites. Mixture of two Gaussian distributions are assumed: one for wild type, and the other for shifted alleles. A threshold (dotted vertical line) is indicated at the intersection of the two density curves.