Multiple approaches for massively parallel sequencing of HCoV-19 (SARS-CoV-2) genomes directly from clinical samples

COVID-19 has caused a major epidemic worldwide, however, much is yet to be known about the epidemiology and evolution of the virus. One reason is that the challenges underneath sequencing HCoV-19 directly from clinical samples have not been completely tackled. Here we illustrate the application of amplicon and hybrid capture (capture)-based sequencing, as well as ultra-high-throughput metatranscriptomic (meta) sequencing in retrieving complete genomes, inter-individual and intra-individual variations of HCoV-19 from clinical samples covering a range of sample types and viral load. We also examine and compare the bias, sensitivity, accuracy, and other characteristics of these approaches in a comprehensive manner. This is, to date, the first work systematically implements amplicon and capture approaches in sequencing HCoV-19, as well as the first comparative study across methods. Our work offers practical solutions for genome sequencing and analyses of HCoV-19 and other emerging viruses.


INTRODUCTION 53
As of 14 March 2020, human coronavirus 2019 (HCoV-19) has surpassed severe acute 54 respiratory syndrome coronavirus (SARS-CoV) and Middle East respiratory syndrome coro-55 navirus (MERS-CoV) in every aspect, infecting over 140,000 people in more than 110 coun-56 tries, with a mortality of over 5,000 1,2 . So far, coronaviruses have caused three major 57 epidemics in the past two decades, posing a great challenge to global health and economy. single nucleotide variations (iSNVs) and its allele frequency have also contributed to anti-86 viral therapy and drug resistance, e.g., to reveal highly conserved genes during the outbreak 87 that potentially serve as ideal therapeutic targets 19,21 . However, it is a challenge to accurately 88 detect iSNVs from clinical samples, especially when the samples are subjected to extra 89 steps of enrichment and amplification. 90 Therefore, we aim to comprehensively compare the sensitivity, inter-individual (variant) and 91 intra-individual (iSNV) accuracy, and other general features of different approaches by sys-92 tematically utilizing ultra-high-throughput metatranscriptomic, hybrid capture-based, and 93 amplicon-based sequencing approaches to obtain genomic information of HCoV-19 from 94 serial dilutions of a cultured isolate and directly from clinical samples. We present a reason-95 able sequencing strategy that fits into different scenarios, and estimate the minimal amount 96 of sequencing data necessary for downstream HCoV-19 genome analyses. Our study, to-97 gether with our tailor-made experimental workflows and bioinformatics pipelines, offers very the situation, but this requires high-standard laboratory settings and expertise apart from 114 being time-consuming. Also, unwanted mutations that are not concordant with original clin-115 ical samples may be introduced during the culturing process. 116 To enrich adequate viral content for whole-genome sequencing in a convenient manner, we 117 pursued two other methods: multiplex PCR amplification (amplicon) and hybrid capture 118 (capture) (Fig. 1). We designed a systematic study to comprehensively validate the bias, 119 sensitivity, inter-individual (variant) and intra-individual (iSNV) accuracy of multiple ap-120 proaches by sequencing serial dilutions of a cultured isolate (unpublished), as well as the 121 eight clinical samples (Fig. 2). We performed qRT-PCR of 10-fold serial dilutions (D1-D7) of 122 the cultured isolate, and the Ct was 17.3, 20.8, 24.5 for, 28.7, 31.8, 35, and 39.9, respec-123 tively, indicating the undiluted RNA (D0) of the cultured isolate contained ~1E+08 genome 124 copies per mL. For amplicon sequencing, we utilized two kits comprising of two set of pri-125 mers generating PCR products of 300-400 bp and 100-200 bp, respectively. The ~400 bp 126 amplicon-based sequencing was implemented in all samples and analyzed throughout the 127 study, while the ~200 bp amplicon-based sequencing was only applied in the cultured isolate 128 for coverage analysis.
to inadequate depth in amplicon sequencing (Fig. 3a). Another pitfall is that amplification 142 across the genome can hardly be unbiased, causing difficulties in complete genome assem-143 bly. Indeed, amplicon sequencing exhibited a higher level of bias compared with meta-144 transcriptomic sequencing, in terms of coverage across the viral genomes from the cultural 145 isolate and the clinical samples tested in our study (Fig. 3b, d, Supplementary Fig. 1). To 146 our surprise, however, capture sequencing was almost as unbiased as meta sequencing, 147 demonstrating better performance than the previous capture method used to enrich ZIKV 148 despite the HCoV-19 genome is ~3 fold larger than ZIKV 22 (Fig. 3b, c). Two reasons 149 amongst others were likely to be accountable to this improvement, 1) we utilized 506 pieces 150 of 120 ssDNA probes covering 2x of the HCoV-19 genome to capture the libraries, 2) we 151 employed the DNBSEQ sequencing technology that features PCR-free rolling circle replica-152 tion (RCR) of DNA Nanoballs (DNBs) 23,24 . 153 The sequencing results of amplicon and capture approaches revealed dramatic increases 154 in the ratio of HCoV-19 reads out of the total reads compared with meta sequencing, sug-155 gesting the enrichment was highly efficient -5596-fold in capture method and 5710-fold in 156 amplicon method for each sample on average (Supplementary Table 1-2). To further com-157 pare the sensitivity of different methods, we plotted the number of HCoV-19 reads per million 158 (HCoV-19-RPM) of total sequencing reads against the viral concentration for each sample. 159 The productivity was similar between the two methods when the input RNA of the cultured 160 isolate contained 1E+05 genome copies per mL and above (Fig. 3e). However, amplicon 161 sequencing produced 10-100 fold more HCoV-19 reads than capture sequencing when the input RNA concentration of the cultured isolate was 1E+04 genome copies per mL and lower, suggesting amplicon-based enrichment was more efficient than capture for more 164 challenging samples (conc. ≤ 1E+04 genome copies per mL, or Ct ≥ 28.7) (Fig. 3e). Meta 165 sequencing -as expected -produced dramatically lower HCoV-19-RPM than the other two 166 methods among clinical samples tested with a wide range of Ct values, whereas amplicon 167 and capture were generally comparable to each other (Fig. 3e). Considering the costs for 168 sequencing, storage, and analysis increase substantially with larger datasets, we tried to 169 estimate how much sequencing data must be produced for each approach in order to 170 achieve 10X depth across 95% of the HCoV-19 genome, and the results can be found in 171 Supplementary Table 3. As a practical, cost-effective guidance for future sequencing, we 172 also assessed the minimum amount of data required to pass the stringent filters (≥ 95% 173 coverage and method-specific depth, see Methods) in our pipelines corresponding to differ-1E+06 copies of HCoV-19 genome per mL, while capture sequencing requires 24,474 to 9 177 Mb for the same situation (Fig. 3g, Supplementary Table 4). 178

Investigation of inter-and intra-individual variations.
To determine the accuracy of dif-179 ferent approaches in discovering inter-individual genetic diversity, we tested each method 180 in calling the single nucleotide variations (SNVs) and verified some of the SNVs with Sanger 181 sequencing ( Supplementary Fig. 2). Two to five SNVs were identified within each clinical 182 sample, and in all the seven samples, SNVs identified by the three methods were concord-183 ant except that capture missed one SNV at position 16535 in GZMU0014 (Fig. 4a). We then 184 investigated the allele frequencies of these sites across methods, and found that alleles 185 identified by capture sequencing displayed lower frequencies than the other two methods, 186 especially for GZMU0014, GZMU0030, and GZMU0042 where the viral load was lower (Ct 187 ≥ 29), which explained why capture sequencing neglected an SNV in our pipeline when the 188 cutoff of SNV calling was set as 80% allele frequency (Fig. 4b). These data indicate that 189 amplicon sequencing is more accurate than capture sequencing in identifying SNVs, espe-190 cially for challenging samples. sequencing has also allowed us to investigate RNA expression patterns of the overall mi-210 crobiome and host content and thus suitable for discovering new viruses, distinguishing co-211 infections, and dissecting virus-host interactions. To explore the microbiota, we performed 212 further metatranscriptomics analysis of the clinical samples. We were able to identify host 213 nucleic acids in all of the samples, and over 95% of total reads were from the host in sputum, 214 nasal, and throat samples ( Supplementary Fig. 4a). Virus contributed to less than 5% of 215 reads in anal swab and throat swab while more than 50% of reads in nasal swab (Supple-216 mentary Fig. 4b). These results suggest nasal swab could be the most ideal sample type for 217 viral detection among the four sample types, which agrees with recent clinical evidence 25 . 218 Among the viral reads, over 90% were Coronaviridae, which is consistent with clinical diag-219 nostics ( Supplementary Fig. 4c). Reads from other viruses were also identified, indicating 220 further measurements could be taken to confirm if co-infection exists ( Supplementary Fig.  221 4). Bacterial composition was also shown, providing support for scientific research, as well bioinformatics pipelines tailored in this study, e.g., 1) the bias of amplicon sequencing can 252 be improved by reducing the amount of cycles in the 1st PCR or optimize the molar ratios 253 of primers (Fig. 1a), 2) the amplicon sequencing is particularly convenient compared with 254 previous counterparts since the fragmentation and library construction steps are omitted 255 cons using single-end 400 nt reads (Fig. 1a), 3) using anything less than 506 pieces of 120 257 ssDNA probes in hybrid capture may attenuate the sequencing coverage while increase the 258 bias, 4) metatranscriptomic sequencing was conducted with an ultra-high-throughput se-259 quencing platform so that the successful rate was substantially higher than usual. 5) the 260 minimal amount of data necessary for analyzing the HCoV-19 genome from clinical samples 261 across methods is higher than that predicted by data from the cultured isolate was probably 262 due to the high nucleic acids background from the host and other microbes (Supplementary 263 Table 3-4, Supplementary Fig. 4). Also, we do not take into account the time spent in se- 100 nt strategy using the same protocol described above, generating 37 Gb sequencing 305 data for each sample on average. 306

Amplicon-based enrichment and sequencing 307
Total RNA was reverse transcribed to synthesize the first-strand cDNA with random 308 hexamers and Superscript II reverse transcriptase kit (Invitrogen, Carlsbad, USA).  Table 5. 20 µl of first-strand instructions, including lambda phage gDNA unless specified. 2 ng of Human gDNA was 317 added to each PCR reaction of the cultured isolate. The PCR was performed as follows: 5 318 min at 37°C, 10 min at 95°C, 15 cycles of (10 s at 95°C, 1min at 64°C, 1min at 60°C to 10 s reaction were unique dual indexed. After the 2nd PCR, products were purified following the 324 same procedures as the 1st PCR and quantified using the Qubit dsDNA High Sensitivity 325 assay on Qubit 3.0 (Life Technologies). PCR products of samples yielding sufficient material 326 (> 5 ng/µl) were pooled at equimolar to a total DNA amount of 300 ng before converting to 327 single-stranded circular DNA. DNBs-based libraries were generated from 20 μl of single-328 stranded circular DNA pools and sequenced on the MGISEQ-2000 platform with single-end 329 400 nt strategy, generating 1.8 Gb sequencing data for each sample on average. 330

sequencing data 378
HCoV-19 reads of metatranscriptomic and hybrid capture sequencing data were identified 379 by aligning the HcoV-19-like reads to the HCoV-19 reference genome (GISAID accession: 380 EPI_ISL_402119) with BWA in a strict manner of coverage ≥ 95% and identity ≥ 90%. For 381 comparisons of the coverage and depth of the viral genome across samples and methods, 382 we normalized the viral reads to total sequencing reads with HCoV-19 Reads Per Million 383 (HCoV-19-RPM). HCoV-19-RPM for amplicon sequencing data was calculated by the same 384 pipelines we applied for metatranscriptomic and hybrid capture sequencing data. 385 To estimate the minimum data requirements for genome assembling and intra-individual 386 variation analysis, we applied gradient-based sampling to the HCoV-19 genome align-387 ments (referred to BAM files) to each dataset using Samtools (v1.9) 34 . The effective genome 388 coverage was set as 95% for all three MPS methods. Considering the distinct 389 technologies used in different methods, we set method-dependent thresholds of effective 390 depth as follows: 1) ≥ 10X for metatranscriptomic sequencing; 2) ≥ 20X for hybrid capture 391 sequencing; and 3) ≥ 100X for amplicon sequencing. We next calculated the coverage and 392 depth within each subsampled BAM file per sample to determine the minimal BAM file that 393 could meet the above thresholds of both coverage and sequencing depth. The method-394 dependent minimum amount of sequencing data of each sample were estimated accord-395 ingly. We assessed the correlations between the HCoV-19 genome copies per mL in diluted 396 samples of cultured isolates and the minimum amount of sequencing data for amplicon-397 and capture-based methods using Pearson correlation coefficient (R) with the function 398 scatter from the R package ggpubr (v3.6.1) 38 . 399

Consistency in variants calling performance among methods 400
Except for amplicon sequencing samples, variants calling in metatranscriptomic and hybrid 401 For amplicon sequencing data, only base positions covered by ≥100X reads were used for 417 iSNVs calling. For metatranscriptomic and hybrid capture-based sequencing data, the 418 thresholds of depth were set as 10X and 20X, respectively. The candidate iSNVs were fur-419 ther filtered for quality as follows: 1) frequency filtering, only minor alleles (frequency ≥ 5% 420 and <50%) and major alleles (frequency ≥ 50% and ≤ 95%) were remained; 2) depth filter-421 ing, iSNVs with fewer than five forward or reverse reads were removed; and 3) strand bias 422 filtering (not applicable to single-end reads of amplicon sequencing), iSNVs were removed 423 if there were more than a 10-fold strand bias or a 5-fold difference between the strand bias 424 of the variant call and that of the reference call. 425

Taxonomy of clinical samples by unbiased metatranscriptomic sequencing 426
For metatranscriptomic sequencing of clinical samples, raw sequencing data of a single se-427 quence lane (approximately 60-75 Gb per sample) was used to simultaneously assess the 428 RNA expression patterns of human, bacteria and viruses in clinical samples from COVID-429 19 patients. We first used software fastp (v0.19.5) 27 to filter low-quality reads and remove 430 adapter with parameters: -5 -3 -q 20 -c -l 30. After QC, we mapped high-quality reads to 431 hg19 and removed human ribosomal RNA (rRNA) reads by SOAP2 v2.21 42 (parameters: -432 m 0 -x 1000 -s 28 -l 32 -v 5 -r 1), and the remaining RNA reads were then aligned to hg19 433 by HISAT2 43 with default settings to identify non-rRNA human transcripts as previously de-434 scribed. Next, we applied Kraken 2 44 (version 2.0.8-beta, parameters: --threads 24 --confi-435 dence 0) to assign microbial taxonomic ranks to non-human RNA reads against the large