Intra-host Variation and Evolutionary Dynamics of SARS-CoV-2 Population in COVID-19 Patients

As of middle May 2020, the causative agent of COVID-19, SARS-CoV-2, has infected over 4 million people with more than 300 thousand death as official reports1,2. The key to understanding the biology and virus-host interactions of SARS-CoV-2 requires the knowledge of mutation and evolution of this virus at both inter- and intra-host levels. However, despite quite a few polymorphic sites identified among SARS-CoV-2 populations, intra-host variant spectra and their evolutionary dynamics remain mostly unknown. Here, using deep sequencing data, we achieved and characterized consensus genomes and intra-host genomic variants from 32 serial samples collected from eight patients with COVID-19. The 32 consensus genomes revealed the coexistence of different genotypes within the same patient. We further identified 40 intra-host single nucleotide variants (iSNVs). Most (30/40) iSNVs presented in single patient, while ten iSNVs were found in at least two patients or identical to consensus variants. Comparison of allele frequencies of the iSNVs revealed genetic divergence between intra-host populations of the respiratory tract (RT) and gastrointestinal tract (GIT), mostly driven by bottleneck events among intra-host transmissions. Nonetheless, we observed a maintained viral genetic diversity within GIT, showing an increased population with accumulated mutations developed in the tissue-specific environments. The iSNVs identified here not only show spatial divergence of intra-host viral populations, but also provide new insights into the complex virus-host interactions.

RT-qPCR (Table S1). All patients had direct contacts with confirmed cases during the early 72 stage of the outbreak. Most patients, except P15 and P62, had severe symptoms and received 73 mechanical ventilation in ICU, including the patient P01 who passed away eventually. The 74 4 patient P01 also showed much lower antibody (IgG and IgM) responses (Table S1) compared 75 to other patients. We then deep sequenced the 62 clinical samples using metatranscriptomic 76 and/or hybrid capture methods ( Table S1). The numbers of SARS-CoV-2 reads per million 77 (SARS-CoV-2 RPM) among the metatranscriptomic data correlated well with the corresponding 78 RT-qPCR cycle threshold (Ct), reflecting a robust estimation of viral load (R = 0.71, P = 6.7e-11) 79 (Fig. 1a). The respiratory tract (RT: Nose, Sputum, Throat) and gastrointestinal tract (GIT: Anus, 80 Feces) samples showed higher SARS-CoV-2 RPMs compared to gastric mucosa and urine 81 samples (Fig. 1b). Furthermore, RT and GIT samples from two patients with mild symptoms 82 showed relatively low viral loads among their respective sample types. The data here may 83 reflect an active replication of SARS-CoV-2 in RT and GIT, especially in patients with severe 84 symptoms 3,4 .

85
Here, using metatranscriptomic data, we obtained 32 consensus complete genomes 86 from the clinical samples with at least 60-fold sequence coverage (Table S1 and Table S2).

87
Comparing the assemblies to the reference sequence (GISAID accession: EPI_ISL_402119) 88 revealed 14 consensus variants (6 synonymous and 8 non-synonymous) located mostly in 89 ORF1ab, S and N genes ( Table S2). Most of the consensus variants were also detected among 90 public sequences, including the widespread associated variants (C8782T and T28144C) 91 detected in four patients (P10, P13, P14 and P62). The novel consensus variant causes a 92 frameshift at the end of ORF8 in the patient P14, showing the phenotypic plasticity during the 93 evolution of SARS-CoV-2. Evolutionary relationships showed that the consensus SARS-CoV-2 94 genomes of the GZMU cohort belonged to distinct clades, including clades defined by T28144C 95 and A23403G, respectively (Fig. 1c). Remarkably, we observed distinct SARS-CoV-2 genomes 96 co-existed in the GIT samples of the patient (P08) with three nucleotide differences ( Fig. 1d and  Although plenty of polymorphic sites were identified among SARS-CoV-2 populations, 100 intra-host variant spectra of closely related viral genomes are mostly disguised by the 101 consensus sequences. We firstly examined the reproducibility of our experimental procedures 102 for allele frequency identification. Only a minor difference of alternative allele frequencies (AAFs) 103 was observed among biological replicates of two selected samples (Fig. S1), showing that the 104 estimated population composition was marginally affected by independent experimental 105 procedures. To control false discovery rate, we applied a stringent approach to detect iSNVs.

106
The iSNVs were identified from the 32 samples using metatranscriptomic data and then verified 107 using hybrid capture data, which are available for most (27/32) samples (Table S3 and Table   108 S4). Overall, we observed 1 to 23 iSNVs in six patients with a cut-off of 5% minor allele 109 frequency ( Fig. 2a and Fig. 2b). When an iSNV was discovered in one patient, we reduced the 110 cut-off to 2% to detect that iSNV from the rest samples of the same patient (see methods). The

111
AAFs of iSNVs detected from the metagenomic data correlated well with those of the hybrid 112 capture data (Spearman's ρ = 0.99, P < 2.2e-16; Fig. S2). Furthermore, the numbers of the 113 observed iSNVs did not correlate with the sequencing coverage ( Fig. S3), suggesting that the 114 coverage of metatrancriptomic data was sufficient to estimate intra-host variation in most 115 samples.

116
We further analyzed intra-host variation across genes for evidence of natural selection.

117
Overall, the 40 identified iSNV sites (10 synonymous iSNVs and 30 non-synonymous iSNVs) 118 distributed evenly across genomic regions ( Fig. 2c; Table S3). High proportion of non-119 synonymous iSNVs suggests that most iSNVs were either under frequent positive selection or 120 insufficient purifying selection. However, we did not observe significant difference in AAFs 121 between non-synonymous and synonymous iSNVs (Fig. 2d) and among codon positions ( Fig.   122 S4), indicating a relaxed intra-host selection. It is likely that most of those non-synonymous 123 iSNVs will be removed by purifying selection and/or genetic drift in a longer timescale 6 . 6 Nonetheless, the exact functional and evolutionary relevance of the intra-host variants remain to 125 be explored.

126
One central task when estimating intra-host variation is to identify the source of iSNVs.

127
Overall, the distribution of the iSNVs among samples does not correlate well with the consensus 128 SNPs (Fig. 2a). Samples carrying the same consensus SNPs generally had different iSNVs, 129 particularly in P01, P10 and P13. Here, we classified the iSNVs into i) rare iSNVs (30/40) 130 detected in a single patient, and ii) common iSNVs (10/40) detected in at least two patients 131 and/or identical to consensus variants. The ten common iSNVs did not show significant higher 132 AAFs than the rare iSNVs ( Fig. 2e). Notably, the ten common iSNVs include two iSNVs 133 (G11083T and C21711T) exclusively detected in the GIT populations of P01, P08 and P10

134
( Table S4). Among the common iSNVs, G11083T is the most widespread consensus variant 135 distributed in multiple lineages of SARS-CoV-2, suggesting that it might derive from recurring 136 mutations on distinct strains rather than the mutation on a single ancestral strain. Interestingly, 137 although G11083T was detected as an intra-host variant in the GIT samples of three patients, it 138 was not detected in the corresponding RT samples, indicating a recurrent mutation of this loci, 139 especially in the GIT population. Interestingly, G11083T locate in a region encoding a predicted 140 T-cell epitope 7 , suggesting that recurrent mutation may provide genetic plasticity to better adapt 141 against host defenses.

142
Using Shannon entropy, we observed a significantly higher genetic diversity within the 143 GIT samples than that of RT samples (Wilcoxon rank-sum test, P = 1.4e-05; Fig. 3a and Table   144 S5), reflecting an increased viral population size within the GIT samples. We further investigated 145 the genetic differentiation between the two places. Notably, no iSNVs was shared between RT 146 and GIT samples from the same patients, suggesting a clear genetic divergence among intra-147 host viral populations. Here we used L1-norm distance to estimate genetic dissimilarity among 148 samples based on iSNVs and their AAFs and compared that between samples within and 149 among hosts ( Fig. 3b and Table S6). As expected, genetic distances among samples from the 7 same host were smaller than those among inter-host samples ( Fig. 3b and Table S6). Within 151 each host, the greatest genetic differentiation was observed among GIT samples and between 152 GIT and RT samples, while the differentiation among RT samples was relatively small. For 153 example, seven iSNVs were shared among the GIT samples of P01, while none of them was 154 observed in RT samples (Fig. 2a). It seems that the frequent genetic divergence between GIT 155 and RT populations is mostly driven by bottleneck events during distant intra-host transmissions.

166
including four iSNVs with increased AAFs and two iSNVs with decreased AAFs across the three 167 sampling dates (Fig. 4a). Given their similar growth rates but distinct allele frequencies, it is 168 likely that more than two genetically related haplotypes co-existed in within P01. Similar patterns 169 were also observed in the GIT population of P08 (Fig. 4b). Notably, the dynamics of intra-host 170 variants changed the consensus allele (>50%) of three genomic loci (3160, 21711 and 28854) 171 of P08. Taken together, the iSNVs and their frequencies suggest that the viral populations in 172 GIT is more stable than those in RT. Nonetheless, in both P01 and P08, we observed increased 173 AAFs of C21711T and G11083T, suggesting that these two variants might be adaptively 174 selected, especially in the GIT. Whether viral adaptation is involved in the intra-host divergence 175 among distant populations warrants further investigation.

8
We further phased the proximal iSNVs into local haplotypes using paired-end mapped 177 reads (Table S7). Most minor haplotypes had one nucleotide difference from the dominant 178 haplotype of the same sample, suggesting that they might derive from the main strain of the 179 population. Nonetheless, we observed one exception in the GIT population of P01, covering the 180 variable sites of C21707T, C21711T and A21717G (Fig. S5). With the cut-off of 1%, one

197
The stochastic process between and within intra-host populations seems to also attenuate the

203
This result is also consistent with the high viral load in GIT (Fig. 1b)

270
To prevent false discovery, base positions reporting an alternative allele with the following 271 conditions were masked as N: 1) sequencing coverage less than 5-fold; 2) sequencing 272 coverage less than 10-fold and the proportion of reads with the alternative allele less than 80%.

273
The collected coronaviridae-like reads were also de novo assembled using SPAdes (v. 3.14.0)

281
Available consensus sequences of SARS-CoV-2 (Table S8) were collected from GISAID 282 database (https://www.gisaid.org/) on 5th April, 2020, after the removal of highly homologous 283 sequences, 122 representative virus strains (Table S8) were used to infer evolutionary 284 relationships with the assembled genomes. Within the GZMU cohort, only one genome was 285 selected when more than one identical genome was achieved from the same patient. The 286 assembled SARS-CoV-2 and selected representative genomes were aligned using MAFFT with 287 default settings. A maximum likelihood (ML) tree was inferred using the software IQ-TREE

288
(v.1.6.12) 19 , with the best fit nucleotide substitution model selected by ModelFinder from the 289 same software. The inferred ML tree was then visualized using the R package ggtree 20 (v.3.10).

290
Major branches and the defining nucleotide mutations were manually labelled.

311
(https://github.com/alimanfoo/pysamstats), and then subject to iSNV site identification with 312 following criteria: 1) base quality larger than 20; 2) sequencing coverage of paired-end mapped 313 reads >= 10; 3) at least five reads support the minor allele 4) minor allele frequency >= 5%; 5) 314 strand bias ratio of reads with the minor allele and reads with major allele less than ten-fold. To 315 minimize false discoveries, sites with more than one alternative allele were filtered out.

316
Biological effects of the identified iSNVs were annotated using the SnpEff (v.2.0.5) with default 317 settings. Alternative allele frequencies (AAFs) at the identified iSNV sites were measured by the 318 proportion of paired-end mapped reads with alternative alleles. When an iSNV was detected in 319 one patient, the detection cut-off of that iSNV was reduced to 2% for the rest samples of the 320 same patient. Only the AAFs more than 2% with at least three reads were kept for the following 321 analyses. All the iSNVs were verified using hybrid capture data when available. At the iSNV 322 sites, the allele with higher frequency was defined as major allele, while one with less frequency 323 was defined as minor allele, regardless whether it is different from the reference allele. A

324
heatmap was generated to visualize the AAFs for all samples using the pheatmap package in R

325
(v.3.6.1). A subset of the identified iSNVs were validated by Sanger sequencing using the