Sampling and sequencing
COVID-19-positive cases were selected from all nasopharyngeal swabs sent to the Public Health Laboratory of Québec (Laboratoire de Santé Public du Québec, LSPQ) from the beginning of the pandemic until June 1st, 2020. In this period, we aimed to sequence as many samples as possible, though coverage dropped after late March 2020. Here we present 2921 high-quality SARS-CoV-2 genomes (Additional file 2: Table S1). Our sequencing strategy during this time had no specific bias towards outbreaks. The dataset is therefore relatively unbiased, with the caveat that the sampling strategy during the first wave was biased towards travellers, those with symptoms, and those directly exposed to a case. We attempted sequencing of all randomly selected qPCR-positive samples, without any filter for a particular cycle threshold (Ct) value. To protect patient confidentiality, in the publicly released data the first accurate date of sampling is set to March 10th, 2020. All samples taken before that are set to March 1st, and their real sampling dates are between February 25th and March 9th, 2020. The true sampling dates were used in phylogenetic analyses described below. Note that our dataset of 2921 consensus sequences is available through GISAID and the Canadian SARS-CoV-2 repository, CanCoGen’s VirusSeq Portal https://virusseq-dataportal.ca/explorer (Additional file 2: Table S1).
Total nucleic acid extraction was performed with the NucliSENS EASYMAG automated platform on 200 μL of nasopharyngeal swabs. The presence of SARS-CoV-2 was assessed by a qPCR diagnostic test targeting genes E and N [14]. Targeted SARS-CoV-2 amplification, library preparation, and sequencing were performed at the McGill Genome Centre as follows. Briefly, RNA samples were processed in a 96-well plate format, including positive and negative controls on each plate. A targeted amplification strategy was used based on the ARTIC V3 primer scheme (https://github.com/artic-network/artic-ncov2019 [19]) using the V3 primers only without adding the redundant V1 primers. For primer pairs 5, 17, 23, 26, 66, 70, 74, 91, 97, and 64, for which a lower coverage was observed, a separate additional low amplification (LA1) pool was prepared to increase the number of reads in the corresponding region. For post-PCR cleanup, pools 1 and 2 were combined, while pool LA1 was cleaned up separately, quantified, and added to the combined pools 1 + 2 in equimolar concentration. Samples from plates 1–4 were prepared for Nanopore sequencing as described: https://www.protocols.io/view/sars-cov-2-mcgill-nanopore-sequencing-protocol-sup-bjajkicn. For Nanopore sequencing, we used native barcodes on pooled amplicons and loaded 20–40 ng of library onto the flow cell. Samples from plates 5–8 were prepared for Nextera Flex Illumina sequencing as described:
https://www.protocols.io/view/sars-cov-2-mcgill-nextera-flex-sequencing-protocol-bisbkean. For samples sequenced on Illumina, the Nextera Flex kit was used starting from 150 ng of DNA following the procedure from the manufacturer. Plate 9 was sequenced using both Nanopore and Illumina technologies, as well as by applying the Cleanplex assay by Paragon Genomics, followed by MGI sequencing. For samples that were sequenced more than once, the data with the higher coverage was used to generate consensus sequences and subsequent phylogenetic analysis. For the Cleanplex assay, we used the Cleanplex for MGI SARS-CoV-2 research panel by Paragon Genomics. The assay utilizes 343 primer pairs tiled across the viral genome as described [20]. The manufacturer’s protocol (UG4002-1) was used, with the modification of increasing the multiplex PCR cycle number to 16 in order to improve the sequencing of samples with qPCR Ct values of > 29, followed by sequencing on an MGI DNBSEQ G400 instrument.
Basecalling and consensus sequence generation
All samples were aligned to the reference genome of the Severe Acute Respiratory Syndrome Coronavirus-2 isolate Wuhan-Hu-1 (GenBank Accession MN908947.3) [21]. Aligned reads were then used to produce a consensus sequence using pipelines based on the Artic Network novel coronavirus bioinformatics protocol (https://artic.network/ncov-2019/ncov2019-bioinformatics-sop.html). A brief description of the pipeline, including software packages and important parameters, is provided for each sequencing platform below.
Datasets produced using the Nextera Flex Illumina protocol were first filtered to remove any host reads. To do so, reads were aligned to a hybrid reference including SARS-CoV-2 (MN908947.3) and GRCh38 using bwa-mem (v0.7.17) [22]. Any reads mapping to a region of the human reference with a mapping quality of zero or more were removed from the dataset. After filtering out host reads, the remaining reads were trimmed using cutadapt (v2.10) [23], then aligned to the SARS-CoV-2 reference (MN908947.3) using bwa-mem (v0.7.17) [22]. After alignment, reads were filtered using sambamba (v0.7.0) [24] to remove paired reads with an insert size outside of the 60–300-bp range, as well as any unmapped reads, secondary alignments and reads that did not match the FR/RF orientation. iVar (v1.3) [25] was used to trim any remaining primers. Samtools (v1.9) [26] was used to produce a pileup which was then used as input by iVar (v1.3) to create a consensus sequence for regions with a minimum of 10× depth, using reads with a Q score > 20 and a minimum allele frequency of 0.75. A full description of the process can be found here:
https://c3g.github.io/covseq_McGill/SARS_CoV2_Sequencing/Illumina_overview.html.
Datasets produced using the CleanPlex MGI protocol were processed using the same pipeline as Illumina Nextera Flex samples, except that Artic Network primers and amplicon data was changed to the corresponding CleanPlex information. A full description of the process can be found here:
https://c3g.github.io/covseq_McGill/SARS_CoV2_Sequencing/MGI_overview.html.
Raw data produced using Nanopore sequencing was basecalled using guppy (v3.4.4) [27] with a High-Accuracy Model (dna_r9.4.1_450bps_hac). Reads were de-multiplexed using guppy barcodes (v3.4.4), requiring barcodes on both ends. Reads were filtered by size to remove anything outside of the 400–700 bp range using the ARTIC Network “guppyplex” tool. Reads were aligned with minimap2 (v2.17) [28], then filtered to remove incorrect primer pairs and randomly downsample high-depth regions to keep 800× depth per strand using the ARTIC network framework. Nanopolish (v0.13.1) [29] was used to call variants in regions with a minimum depth of 16× and a flank of 10 bp. After masking regions with coverage below 20×, the called variants were used to generate a consensus sequence using bcftools (v1.9) [26] consensus. A full description of the process can be found here:
https://c3g.github.io/covseq_McGill/SARS_CoV2_Sequencing/ONT_overview.html.
For samples sequenced with two or more technologies, all datasets were processed separately using the methods described above. The resulting consensus sequences were compared to keep only the most complete consensus for downstream analyses, as determined based on the number of missing bases (Ns). After excluding consensus sequences with > 5% Ns, we were left with 2921 consensus sequences for further analysis (83.14% passing rate). The consensus sequences were deposited in GISAID under accession numbers listed in Additional file 2: Table S1. The raw sequence data is available in NCBI as described in the “Availability of data and materials” section below. The list of authors and laboratories for consensus sequences obtained from GISAID are in Additional file 3: Table S2.
Positive and negative controls
Across our dataset, we included two types of negative controls and two types of positive controls:
-
Extraction-negative controls (Ext controls) consist of a water blank included in every plate of samples before extraction. They are processed alongside the samples from RNA extraction onwards.
-
Reverse transcription-negative control (RT control) consist of an extraction buffer blank included in plates before reverse transcription. They are processed alongside the samples from reverse transcription onwards.
-
Viral culture-positive control consists of a sample of RNA extracted from a viral culture of B.1 lineage with 6 known SNVs. They are processed alongside the samples from reverse transcription onwards.
-
Accukit RNA/cDNA positive control (Accugenomics Inc., Catalog number 1231) consists of artificial RNA or cDNA molecules with introduced SNVs every 90 bp. The artificial molecules do not cover the whole genome, and the RNA version covers 94% of the SARS-CoV2 genome whereas the cDNA version covers 83% of the SARS-CoV2 genome. They are processed alongside samples from reverse transcription (RNA) or right after (cDNA) depending on the kind of control used. For more information, consult the provider’s website at https://accugenomics.com/accukit-sars-cov-2/. Please be aware that for this dataset, a previous version of the kit was used; notably, the cDNA version of this kit is no longer available.
All cases detected before March 17, 2020, were processed in plates that included only extraction-negative controls and viral culture-positive controls. Samples detected after that date were processed in plates including all four controls (with only one of the two versions of the AccuKit control included). All controls were processed using the same bioinformatic pipelines as the rest of the samples.
For a plate to pass quality controls, extraction controls were not allowed to have more than 2× average depth of coverage or produce a consensus sequence covering more than 1% of the genome length. During processing, only two plates out of a total of 42 plates failed this test, and samples in those plates were repeated. Reverse transcription controls had the same threshold as extraction controls, but none of them exceeded these thresholds during processing.
Positive controls were evaluated based on coverage (consensus sequences of Accukit controls should not exceed their length based on the % of the genome covered by the kit) and by the detection of the known SNVs. All positive controls passed these evaluations.
Phylogenetic analysis
Raw and time-scaled phylogenomic trees were built using the NextStrain pipeline (https://github.com/nextstrain, version 1.16.2) [30] installed in a conda environment (https://github.com/conda/conda, version 4.8.3). This pipeline uses the Augur toolkit (https://github.com/nextstrain/augur, version 7.0.2) [31] to filter, align/mask genomic sequences, build trees (divergence and time-scaled), and produce an output file processed by the Auspice web interface (https://github.com/nextstrain/auspice, version 2.16.0) [30] to explore phylodynamic and phylogenomic data. Augur removed all sequences shorter than 27,500 bp and sampled after June 1st, 2020. The Augur/align module was then called to execute the multiple sequence alignment with MAFFT (https://github.com/GSLBiotech/mafft, version v7.463) [32] using Wuhan-1 (Genbank accession MN908947) as a reference genome. The final alignment was masked at the beginning (first 100 sites) and end (last 50 sites), and at positions 18529, 29849, 29851, and 29853 (sites of known low sequencing quality and homoplasies). We then used Augur to select sequences from GISAID that were most similar to our 2921 Québec sequences. These global context sequences were then grouped by country/month in order to keep a maximum of 100 sequences and 5 identical sequences per country-month combination. In this way, a total of 12,801 genomes were pulled from GISAID on October 30th, 2020 (Additional file 3: Table S2).
We used IQ-TREE (http://www.iqtree.org/, version 1.6.12) [33] to construct a phylogenetic tree of Québec only sequences and another tree of Québec and global context sequences, with the GTR substitution model. Branch lengths, sampling dates, and ancestral states (geographic regions, nucleotides and amino acids sequences) at internal nodes were inferred with the Augur/refine and Augur/traits modules by calling TreeTime (https://github.com/neherlab/treetime, version 0.7.5) [34] (using the same default parameters as those chosen in public builds; https://github.com/nextstrain/ncov). Finally, the Augur/export module exports a single compiled results file required for data visualization in Auspice. All Nexstrain analyses were executed on a 64-bit CentOS server version 7.4.1708 using 40 CPUs.
Clade assignment was done during the NexStrain build. As input, the Augur/clades module uses the phylogenetic tree, the observed and inferred nucleotide sequences at each node and a clade configuration file. In this clade file, every single clade value is associated with a specific combination of position/nucleotide variant. As an alternative clade assignment scheme, we also used the Phylogenetic Assignment of Named Global Outbreak LINeages [12] combined with lineages (https://github.com/cov-lineages/lineages) version 2020-05-09 (Pangolin 2.3.2 and pangoLearn 2021-02-21, https://cov-lineages.org/pangolin_docs/pangolearn.html).
To infer introduction events into Québec (QC), we used discrete character ancestral state reconstruction (ASR) to infer non-QC and QC nodes in the global context time tree. Three methods were implemented in R [35] using either (1) maximum likelihood (ace function from ape package v5.4-1 [36], assuming the equal rates model), or unordered Fitch parsimony implemented with the fitch.mvsl function in mvSLOUCH v2.6.1 [37] either with (2) delayed (DELTRAN) or (3) accelerated (ACCTRAN) transformation algorithms in order to deal with ambiguous nodes. With the reconstruction, we assigned nodes to the QC state when supported by ≥ 50% (with ML) or 1 (with parsimony) of the state assignment. To find the transitions from non-QC to QC nodes, first we identify every QC node or tip that is preceded by a non-QC node. Next, among this set of nodes, we look for the most basal and discard every node that is a descendant of another node in this set. The parents of these remaining nodes are the non-QC node of the transition and the introduction event is considered to have happened within the transition. Note that these methods likely underestimate the number of introductions. The non-QC to QC transitions were collected and their most basal QC leaf (or leaves) are recorded. These candidates were then cross-checked with travel history data and were only recorded if at least one had travel history. In the case of genetically identical genomes in two travellers with the same travel history, we assumed only one introduction event, which is a conservative estimate, given that in principle both could cause secondary infections in Québec. If only using travel history, they would be counted as two separate introductions. In the case of a polytomy with multiple basal QC sequences, only one was chosen by the shortest branch length. If no travel history was available, then the closest outgroup of the introduction event was used to assign the likely origin of the introduction event. The descendants of these identified transitions were used to define QC transmission lineages. The date of the non-QC node (prior to the first QC node) was used as the TMRCA of the introduction event (Additional file 1: Figure S1). For the largest QC transmission lineages (containing > 20 cases), the TMRCA was also inferred using BEAST (see below). Phylogenetic visualizations and dataset manipulation were done in R using a suite of packages: ape [36], phylotools [38], phytools [39], phangorn [40], tidyverse [41], ggtree [42], and treeio [43].
Phylodynamics
The molecular clock signal was assessed by plotting the root-to-tip phylogenetic distance against time using TempEst [44]. The largest (> 20 sequenced cases) QC transmission lineages were analyzed using Bayesian phylogenetic tree reconstruction with Markov chain Monte Carlo (MCMC) implemented in BEAST v2.6.2 [45] with the Birth-Death Skyline (BDSKY) model, assuming a gamma distributed Hasegawa-Kishino-Yano (HYK) nucleotide substitution model (with a uniform distribution 0.25 [0,1] of the nucleotide frequencies, a lognormal 2 [0, ∞] for □, and a □ count of 4 with an exponential distribution 0.5 [0, ∞]), and a strict molecular clock (0.8 × 10−3 substitutions/site/year). Using this model, we estimated the effective reproduction number (Re), TMRCA, and sampling proportion. The prior for the reproduction number was a lognormal distribution (initial = 2 [0,10], M = 0, S = 0.5), origin was a normal distribution (mean = 0.1, □ = 0.05, initial 10[0,∞]), the rate of becoming uninfectious was a normal distribution (mean = 10, □ = 1.3, initial = 1[0,∞]), and sampling rate was a beta distribution (□ = 1, □ = 5, initial = 0.01[0,1]). All MCMC analyses were run with 50 million generations and sampling every 50,000 steps for lineages with > 100 cases and 30 million generations and 30,000 steps for lineages < 100 cases. Convergence was achieved when all posteriors gave effective sample sizes (ESS) > 300 with 10% burn-in.
Calculation of population genetic metrics
We calculated Tajima’s D to infer changes of the viral effective population size and deviation from a standard neutral evolutionary model. We separated the data into eight time periods of 2 weeks between February 20, 2020, and June 10, 2020. For each time period, we randomly sampled 20 viral consensus sequences to calculate Tajima’s D, and repeated this procedure 99 times to obtain confidence intervals. We calculated both a combined value of D across all sequences, and a separate estimate for each PANGO viral lineage. Lineages or time bins with fewer than 20 sequences were discarded. We calculated D as described [46]:
$$ {D}_{\mathrm{Tajima}}=\frac{\theta_{\pi }-{\theta}_w}{\sqrt{\hat{V}\left({\theta}_{\pi }-{\theta}_w\right)}} $$
(1)
where the \( \hat{V} \) denotes the expected sampling variance of (θπ − θw). θπ is the nucleotide diversity, calculated based on the average number of pairwise differences among consensus sequences:
$$ {\theta}_{\pi }=\frac{Nb\_ reads\_ pwdiff}{\sum_{i=1}^nC\left(N,2\right)} $$
(2)
where n is the genome length, N is the number of consensus sequences, C(N, 2) is the choose() function which calculates the number of pairs of consensus sequences in a set of size N, and Nb_reads_pwdiff is the number of pairwise nucleotide differences. Because pairwise differences are maximized when there are intermediate-frequency mutations, θπ is more sensitive to intermediate-frequency mutations. θw is another estimator of the nucleotide diversity which is calculated based on the number segregating sites and is sensitive to low-frequency mutations:
$$ {\theta}_w=\frac{S}{a_1} $$
(3)
$$ {a}_1={\sum}_{i=1}^{N-1}\frac{1}{i} $$
(4)
where S in the number of segregating sites, a1is a normalizing factor for the sample size of consensus sequences (N).
We also estimated dN/dS, the ratio of nonsynonymous (dN) and synonymous substitutions rates (dS), by comparing consensus sequences to the reference genome (Genbank MN908947.3) allowing us to infer changes in selective pressures at the protein level.
$$ \frac{dN}{dS}=\frac{\left(N{b}_{nsub}/N{b}_{nss}\right)}{\left(N{b}_{ss ub}/N{b}_{ss}\right)} $$
(5)
where Nbnsub is the number of nonsynonymous substitutions, Nbnss is the number of nonsynonymous sites, Nbssub is the number of synonymous substitutions, and Nbss is the number of synonymous sites. We only considered consensus sequences with more than 1 synonymous mutation to be able to attribute finite values to dN/dS. These analyses were coded in R [35].