Pervasive generation of non-canonical subgenomic RNAs by SARS-CoV-2

SARS-CoV-2, a positive-sense RNA virus in the family Coronaviridae, has caused a worldwide pandemic of coronavirus disease 2019 or COVID-19. Coronaviruses generate a tiered series of subgenomic RNAs (sgRNAs) through a process involving homology between transcriptional regulatory sequences (TRS) located after the leader sequence in the 5′ UTR (the TRS-L) and TRS located near the start of ORFs encoding structural and accessory proteins (TRS-B) near the 3′ end of the genome. In addition to the canonical sgRNAs generated by SARS-CoV-2, non-canonical sgRNAs (nc-sgRNAs) have been reported. However, the consistency of these nc-sgRNAs across viral isolates and infection conditions is unknown. The comprehensive definition of SARS-CoV-2 RNA products is a key step in understanding SARS-CoV-2 pathogenesis. Here, we report an integrative analysis of eight independent SARS-CoV-2 transcriptomes generated using three sequencing strategies, five host systems, and seven viral isolates. Read-mapping to the SARS-CoV-2 genome was used to determine the 5′ and 3′ coordinates of all junctions in viral RNAs identified in these samples. Using junctional abundances, we show nc-sgRNAs make up as much as 33% of total sgRNAs in cell culture models of infection, are largely consistent in abundance across independent transcriptomes, and increase in abundance over time during infection. By assessing the homology between sequences flanking the 5′ and 3′ junction points, we show that nc-sgRNAs are not associated with TRS-like homology. By incorporating read coverage information, we find strong evidence for subgenomic RNAs that contain only 5′ regions of ORF1a. Finally, we show that non-canonical junctions change the landscape of viral open reading frames. We identify canonical and non-canonical junctions in SARS-CoV-2 sgRNAs and show that these RNA products are consistently generated by many independent viral isolates and sequencing approaches. These analyses highlight the diverse transcriptional activity of SARS-CoV-2 and offer important insights into SARS-CoV-2 biology.


Background
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) emerged from Wuhan, China, in December 2019 and rapidly led to a worldwide pandemic, known as COVID-19 [1]. The initial sequencing report found that this virus maintains 89.1% nucleotide identity to SARS-like coronaviruses found in bats in China [1].
Coronaviruses (CoV) including SARS-CoV-2 contain large,~30 kb, positive-sense, single-stranded RNA genomes with a unique genome organization structure. The 5′ end of the genome contains two large open reading frames (ORFs), ORF1a and ORF1b. ORF1a produces a large polyprotein, while ribosomal slippage at an RNA pseudoknot structure and slippery sequence at the end of ORF1a occasionally leads to a frameshift and subsequent translation of a joint ORF1a and ORF1b polyprotein [2]. ORF1a and ORF1ab polyproteins contain protease activity that can cleave the polyproteins into a variety of nonstructural proteins (Nsp) capable of facilitating viral transcription and replication and modulating host transcription and translation. Following ORF1b, there is an arrangement of ORFs encoding structural and accessory proteins that varies in content and order across the CoV family. While structural proteins are incorporated into emergent virions, accessory proteins are thought to be dispensable for replication in cell culture but to increase viral fitness in vivo [3][4][5].
Initial translation of ORF1ab can occur directly from incoming viral genomic RNA, generating the ORF1a and ORF1ab polyproteins and initiating infection. However, because the viral genome is so large, and because downstream ORFs are far from the 5′ end of the genome, effective translation of these ORFs requires the generation of subgenomic RNAs (sgRNAs) [6]. To generate these sgRNAs, it is thought that CoV undergoes a process of discontinuous extension during synthesis of negativestranded RNA templates. As the virus-encoded RNAdependent RNA polymerase progresses from the 3′ end towards the 5′ end of the viral genome, it encounters transcriptional regulatory sequences (TRS) upstream of each major ORF (TRS-B). Here, the polymerase can skip to a similar TRS that is 3′ of a shared leader sequence within the 5′ UTR of the CoV genome (TRS-L). These antisense sgRNAs are then transcribed, resulting in a series of tiered sgRNAs containing the same 5′ leader sequence [6].
Several studies have characterized SARS-CoV-2 sgRNAs using Nanopore direct RNA sequencing (dRNAseq). In contrast to short-read sequencing, which can produce read pairs of 150 bp each, dRNAseq is capable of directly sequencing entire transcripts including the full length of the viral genome (29,903 bases). This allows dRNAseq to identify RNA isoforms and variants that may be challenging to deconvolute using short-read sequencing. Using nanopore sequencing, Taiaroa et al. identified eight major sgRNAs, as well as a small number of non-canonical junctions between the 5′ leader sequence and downstream regions of the SARS-CoV-2 genome [7]. Kim et al. identified a total of 9 distinct subgenomic RNAs, as well as a series of non-canonical subgenomic RNAs (nc-sgRNAs) that contain unexpected junctions between 5′ and 3′ sequences [8]. In addition, they confirmed the existence of non-canonical junctions at various sites across the genome using short-read DNA nanoball sequencing [8]. Davidson et al. identified a series of transcripts encoding the nucleocapsid (N) protein, but with distinct small internal deletions, and validated this finding with proteomic evidence [9].
While these previous studies identified non-canonical RNA junctions and unexpected RNA species in isolation, there have been no integrative analyses of the SARS-CoV-2 junctions. It is unclear if the identified noncanonical junctions are broadly representative of SARS-CoV-2 biology or if they are a result of dataset-or isolate-specific artifacts. Here, we report an integrative analysis of eight independent SARS-CoV-2 transcriptomes generated using three sequencing strategies, five host systems, and seven viral isolates to comprehensively characterize the landscape of canonical and noncanonical sgRNAs generated by SARS-CoV-2.

Sequencing data
Details on all studied transcriptomes are listed in Table 1.
In the case of the Finkel et al. and Emanual et al. transcriptomes, the 24-hpi samples were used in all analyses except for the investigation of junction changes over time.

Determining junction coordinates
Reads in all transcriptomes were mapped against the SARS-CoV-2 reference genome (accession NC_ 045512.2) using minimap2 [15] and custom parameters detailed in minimap_sars2.sh and inspired by Kim et al. [8], and unmapped reads were discarded. The exceptions are the transcriptomes from Finkel et al. and Emanuel et al.-because their read lengths were very short (60 and 72 bases in length respectively), mapping was conducted with STAR [16] using settings detailed in star.sh.
The 5′ and 3′ junction positions of junction-spanning reads were determined with generate_synthetic_transcripts.py. This script reads a BAM file and the SARS-CoV-2 genome fasta and, by parsing the BAM alignment CIGAR and start position, determines the start, end, and junction coordinates of the alignment. Any minimap2called deletion or intron of at least 1000 bases in length was considered a junction. Junction coordinates were then written to a tab-delimited output file for further analysis.

Analysis and visualization of junctions
The 5′ and 3′ coordinates of each junction were processed in R [17], and figures generated with ggplot2 [18]. Gene maps were generated using the gggenes R package [19].
For the global distribution of reads in each transcriptome, specific 5′-3′ junctions that occur at least twice are represented as arcs. Each inverted peak represents a histogram of 5′ or 3′ junctions at each genome position, with a bin size of 100 bases-this means that the magnitude of each peak represents the total number of 5′ or 3′ junctions within each 100 base span. Code is global_ junctions.Rmd.
The distribution of major 3′ junctions was generated by plotting a histogram of 3′ junction positions that occur after position 21000 and that have a 5′ end prior to position 100. Bin size is 20 bases. Code is canonical_ 3prime_junctions.Rmd.
The percentage of non-canonical junctions in each was calculated by first assigning each individual 5′-3′ junction as canonical or non-canonical. A junction was considered to be canonical if its 5′ position was within 20 bases of the TRS-L core sequence ACGAAC (starting at position 69), and its 3′ junction was within 15 bases of a TRS-B (defined by the presence of the TRS core sequence). The percentage of total junctions that were assigned as non-canonical was then determined. Code is global_junctions.Rmd.

Identifying dataset-specific junctions
The consistency of non-canonical junctions across transcriptomes was determined separately for 5′ and 3′ junctions based on one replicate each from Taiaroa  For each SARS-CoV-2 genome position, the percentage of noncanonical junctions at that site was calculated for each transcriptome (X). The mean (μ) and standard deviation (σ) of percentages at each site across the five transcriptomes were then calculated. For each position in each transcriptome, a Z-score was calculated by subtracting the observed percentage by the mean and dividing by the standard deviation ((X − μ)/σ). The landscape of non-canonical junctions was determined by splitting the SARS-CoV-2 genome into sequential 100 nt spans and calculating the percentage of non-canonical junctions with a 5′ or 3′ end in each span. Z-scores were calculated in a similar manner as base-by-base Z-scores, but by comparing percentages in each 100 nt span across To determine the aggregated change in each class of junctions, each 5′-3′ junction was assigned to a class based on the location of its 5′ and 3′ position. If a junction's 5′ position was located within 20 bases of the TRS-L, and its 3′ position was located within 15 bases of a TRS-B, it was assigned to the ORF closest to the TRS-B. Otherwise, the junction was assigned as noncanonical. The percentage of total junctions in each class at the early timepoint was subtracted from the class percentages at the late timepoint for each dataset. Code is junctions_over_time.Rmd.

Assessing homology near 5′ and 3′ positions of each junction
The identity and length of homologous sequences flanking the 5′ and 3′ sites of each junction were determined by identifying the 30 bases flanking each junction site. The longest identical subsequence between these two 30 base sequences was then collected. The most common homology sequence was plotted for each group of junctions. Here, a junction was considered canonical if it has a 3′ end within 15 bases of a canonical TRS-B or was assigned as "Within ORF" if it was otherwise within an ORF. The exception were junctions with a 5′ end within ORF1a-these were assigned to ORF1a. The total distribution of homology lengths was also determined for canonical and non-canonical junctions. To determine the random distribution of homology lengths, all 30 base sequences (30mers) present in the SARS-CoV-2 genome were determined. The longest identical subsequence between 100,000 random pairs of 30mers that are derived from at least 1000 bases apart on the genome was determined. Code is find_trs_lengths.py and TRS_ plotting.Rmd.

Analysis of ORFs on dRNAseq reads
generate_synthetic_transcripts.py was used to generate "coordinate-derived transcripts" from the SARS-CoV-2 mapped BAMs from the Taiaroa et al., Kim et al., and Davidson et al. dRNAseq transcriptomes. This script reads a BAM file and the SARS-CoV-2 genome fasta and, by parsing the BAM alignment CIGAR and start position, determines the start, end, and junction coordinates of the alignment. Coordinate-derived transcripts were created by using these coordinates to retrieve the associated sequence from the SARS-CoV-2 genome. Any minimap2-called deletion or intron of at least 1000 bases in length was considered a junction. All coordinatederived transcripts were then aligned against a minimal leader sequence fragment (5′-CTTCCCAGGTAA CAAACCAACCAACTTTCGATCTCTTGTAGATCTG TTCTC-3′) using blastn [21]. This minimal sequence was used instead of the full leader sequence because inspection of 5′ coverage in the three long-read datasets revealed a significant drop in coverage at the end of this sequence. Transcripts with an alignment of at least 45 nucleotides in length and a percent identity of at least 85% and that contain the start of this sequence within the first 50 nucleotides of the transcript were kept. Transcripts not containing this leader sequence were discarded. ORFs were called directly from each transcript using Prodigal [22], using parameters listed in prodigal.sh. ORFs were allowed to initiate from any nUG start codon, were not required to be the most 5′ ORF on each transcript, and were only reported if they are at least 30 codons in length. Each ORF was translated into amino acid sequence using prodigal_to_orf_direct.py. If multiple Prodigal-predicted ORFs contained an overlapping amino acid sequence, the longest ORF was output. Each ORF was output in fasta format and labeled with transcript of origin and coordinates of the ORF on the transcript.
The amino acid sequence of each ORF was then mapped against canonical SARS-CoV-2 proteins (from accession NC_045512.2) as well as Prodigal-predicted unannotated ORFs from the SARS-CoV-2 genome using DIAMOND [23]. DIAMOND parameters are described in diamond.sh and include an E-value threshold of 10. During downstream analysis, alignments were considered valid if the aligned portions maintained at least 95% amino acid identity to the reference protein.
Various statistics of each ORF, including alignment information, the start position of the ORF on the transcript, and fusion information, were generated from the DIAMOND alignment file using parse_orf_ assignments.py. An ORF was considered variant if it was not the same length as the canonical protein, with a notable exception. Because there are alternative start codons upstream of some annotated genes, and Prodigal would call ORFs using these start codons if they are present on a transcript, ORFs were allowed to have up to 20 amino acids of additional N-terminal amino acids while still being classified as canonical if they end at the expected position. If an ORF had matches against multiple SARS-CoV-2 proteins, it was given a primary assignment of the protein with the longest alignment length.

Results
Analysis of three independent dRNAseq datasets reveals canonical and non-canonical junctions To determine the major RNA species present in SARS-CoV-2-infected cells, we applied a uniform computational approach to three independent dRNAseq datasets from Taiaroa et al., Kim et al., and Davidson et al. [7][8][9]. To assess sgRNA presence, we identified junction-spanning reads following read-mapping against the SARS-CoV-2 reference genome.
Analysis of the 5′ and 3′ locations of junctions separated by at least 1000 bases revealed that the majority of junctions consist of a 5′ end originating in the 5′ UTR and a 3′ end at distinct sites in the 3′ region of the genome (Fig. 1a-c). These represent the canonical junctions between the leader sequence and 3′ sequences that generate the major sgRNAs [6]. Plotting the counts of 3′ junctions that contain a 5′ end within the first 100 bases revealed eight prevalent junction points along the 3′ end of the genome (Fig. 1d-f), all of which are proximal to the TRS core sequence ACGAAC. Interestingly, we observed that the 3′ junction peak and TRS-B closest to ORF6 lands 151 bases within the M ORF. Based on these junction analyses and the locations of identified TRS body sites, predicted sgRNAs and their most 5′ gene or genes are shown in Fig. 1g. We predict 8 major sgRNAs based both on identification of the TRS core sequence and presence of a dominant junction peak. In contrast to Kim et al. [8] but in concordance with Taiaroa et al. [7] and Davidson et al. [9], we do not find evidence of a prominent 3′ junction point near ORF7B (Fig. 1d-f). and 3′ junctions at that position was calculated and plotted as an inverse peak. The histogram bin size is 100 bases, meaning each inverse peak represents the cumulative count of 5′ or 3′ junctions occurring within that span. Curved lines represent the 5′ and 3′ locations of junctions that occur at least twice. Red curves represent canonical junctions, black curves represent non-canonical junctions. "Taiaroa" [7], "Kim" [8], and "Davidson" [9] represent the three independent SARS-CoV-2 dRNAseq datasets investigated. d-f A histogram of 3′ junctions past position 21000 with a 5′ end before position 100 are plotted. Dashed lines indicate the start coordinates of annotated viral genes. Bin size is 20 bases. g Based on junction analysis, the predicted major species of virus-produced RNAs are represented. The most 5′ gene or genes on each subgenomic RNA is listed Many SARS-CoV-2 junctions are non-canonical Similar to observations by Kim et al. [8], we identified an unexpectedly diffuse pattern of non-canonical junctions across the genome (Fig. 1a-c, black arcs), suggestive of many unexpected RNA species. Had only canonical junctions been observed, the arcs on this plot would be exclusively between the first 100 bases and defined points on the x-axis (Fig. 1a-f).
Because the transcriptomes in these long-read sequencing studies [7][8][9] all consist of dRNAseq from infected Vero cells, we sought to confirm this observation using independent transcriptomes from other SARS-CoV-2 isolates, in other host systems, and with other sequencing technologies. To this end, we assessed five additional transcriptomes generated in Vero cells [10], A549 cells [11], Calu3 cells [13], bronchial organoids [12], and ferret nasal washings [11] using Illumina PolyA or total RNA sequencing. Information regarding each sample is detailed in Table 1. Junction analysis in these transcriptomes likewise revealed pervasive non-canonical junctions (Additional file 1: Fig. S1A-E). Quantification of the relative abundance of canonical and non-canonical junctions in these transcriptomes revealed that up to 33% of junctions were non-canonical in cell culture (Fig. 2a, Additional file 1: Fig. S1F). In addition, the transcriptome from a ferret's nasal washing contained 51% non-canonical junctions, indicating that non-canonical junctions are also produced in vivo.
We considered if non-canonical junctions were driven by dataset-specific artifacts. For example, it is possible that defective particles containing defective viral genomes (DVGs), often with large internal deletions that The percentage of non-canonical junctions at each genome position in each dataset was determined (X). The mean (μ) and standard deviation (σ) of percentages at each position across the five datasets was calculated. For each position in each dataset, the Z-score was calculated as (X − μ)/σ (i.e., the number of standard deviations away from the mean), and the percentages and Z-scores for each position in each dataset were plotted. c, d Each point represents the percentage and Z-score of one position in one dataset. For each position in the SARS-CoV-2 genome, the percentage and Z-score of non-canonical junctions with a 5′ end (c) or a 3′ end (d) was determined as described above for five independent datasets: Taiaroa, Kim, Davidson, Finkel, and Blanco-Melo. Points with a percentage above 4% of non-canonical junctions and a Z-score above 1 were highlighted increase their replicative fitness, could accumulate during preparation of virion stocks [24] and generate the observed non-canonical junctions during subsequent infections. If this was true, we would expect to observe non-canonical junctions specific to a single virion preparation and accumulation of these specific noncanonical junctions over the course of infection. To address this possibility, we identified dataset-specific junctions by identifying junctions that are differentially abundant across datasets (Fig. 2b). While there is evidence of position-by-position variability across transcriptomes, this analysis revealed that there are only six cases where a junction site in one transcriptome is both highly abundant (> 4% of non-canonical junctions) and variable (Z > 1) across five independent SARS-CoV-2 sequencing studies [7][8][9][10][11] (Fig. 2c, d). To further investigate the distribution of non-canonical junctions in each transcriptome, we determined the percentage of non-canonical junctions in sequential 100 nt spans across the SARS-CoV-2 genome (Additional file 1: Fig. S2). This analysis revealed that, while there is some heterogeneity in noncanonical junction distributions in some datasets, the general landscape of non-canonical junctions is similar (Additional file 1: Fig. S2C, S2D). Together, these data suggest that non-canonical junctions are not entirely driven by dataset-specific artifacts.
Of note, we found that three 5′ junction sites in the

The proportion of non-canonical junctions increases with time after infection
Next, we assessed changes in junctions over time using the approach detailed in Fig. 3 using transcriptomes from early (4 or 5 h post infection) and late (24 h post infection) timepoints from Finkel et al. [10] and Emanual et al. [13]. This analysis revealed that there are large decreases in 5′ junctions originating at position 64, and in 3′ junctions ending in 28254 (proximal to the N TRS-B), with a smaller decrease in 3′ junctions ending at 26467 (proximal to the M TRS-B) (Fig. 3b, c). Notably, the Finkel et al.-specific 5′ junction at 16971 increased in relative abundance over time. In contrast, there was no prominent increase in the relative abundance of junctions at other positions (Fig. 3b, c). Instead, there are small but widespread increases in non-canonical junction sites throughout the SARS-CoV-2 genome (Fig. 3d, e). Quantification of the total percent changes of canonical and non-canonical junctions revealed a large decrease in canonical N junctions and increase in non-canonical junctions (Fig. 3f, g). These results indicate that the abundance of non-canonical junctions increases over time in cell culture. It is unclear if a similar pattern occurs during infection in vivo.

Non-canonical junctions are not associated with TRS-like homology
While homology between the TRS-L and TRS-B is thought to be critical for generation of canonical sgRNAs, the role of TRS-like homology in the generation of nc-sgRNAs is unclear. To address this question, we determined the 30 bases surrounding the 5′ and 3′ sites of each junction and identified the longest continuous subsequence in common between these sequence (Fig. 4a). We found that junctions with a 3′ end landing within 15 nucleotides of a TRS-B overwhelmingly have junction homology associated with variants of the TRS core sequence (Fig. 4b-d, blue). In contrast, junctions landing within ORFs and not proximal to a TRS-B do not have a prevalent homologous sequence (Fig. 4b-d,  red). The exception are intra-M junctions, whose homology is a consequence of the location of the ORF6 TRS within the M ORF.
While there is not a single strong motif driving noncanonical, intra-ORF junctions, it is still possible that non-canonical junctions are associated with junctionspecific homology between the 5′ and 3′ junction sites. To address this possibility, we tallied the length of homology between the 5′ and 3′ sites of canonical and noncanonical junctions. This analysis revealed that canonical junctions have markedly longer homology lengths (mean > 10 bases) than non-canonical junctions (mean < 5 bases) (Fig. 4e-g). Indeed, the homology lengths of non-canonical junctions are only slightly longer than homology lengths between 100,000 random pairs of SARS-CoV-2-derived 30 base sequences (Fig. 4e-g). These data indicate that non-canonical junctions are not associated with TRS-like homology.
sgRNAs containing only a 5′ region of ORF1a are produced in all studied cell culture and in vivo infection systems Assessment of the three dRNAseq transcriptomes revealed evidence of sgRNAs that contain only the 5′ region of ORF1a (Additional file 1: Fig. S3A). This is similar to an observation previously reported by Kim et al. [8]. Inspection of read coverage of ORF1a in these transcriptomes confirmed that there is elevated 5′ ORF1a coverage with specific decreases in coverage concomitantly with sharp increases in junctions near genome positions 1600 and 6000 (Fig. 5a-c).
While there is clearly higher coverage of the 5′ region of ORF1a in the three dRNAseq transcriptomes, it remained unclear if these 5′ ORF1a sgRNAs are a major percentage of total ORF1a in the cell, or if this finding was due to a systematic bias favoring shorter reads in nanopore sequencing. To address this question, we assessed ORF1a coverage in four independent Illumina polyA-enriched RNAseq transcriptomes. Analysis of these transcriptomes revealed a similarly elevated 5′ ORF1a coverage and consistent drops in coverage near genome positions 1600 and 6000 (Fig. 5d-g). Importantly, one of these transcriptomes is from bronchial organoids (Fig. 5g) and one transcriptome is from the nasal washings of an infected ferret (Fig. 5f), suggesting that 5′ ORF1a subgenomic RNAs can be produced in vivo and were not only due to artifacts of highly replicative conditions in cell culture. It is likely that polyA selection, as well as a technical bias against long transcripts in nanopore sequencing, has exaggerated the Fig. 3 Non-canonical junctions accumulate over time in cell culture. a Illustration of computational approach used to determine change in junction percentages over time. For each position in the SARS-CoV-2 genome and separately for 5′ and 3′ junctions, the number of junctions at that position was calculated. Using these numbers, the percentage of 5′ or 3′ junctions falling at each position was calculated. The change in junction percentage at each position is defined as the difference between the position's junction percentage at the late and early timepoints, and this junction change was plotted for each position. b-e The change in junction percentage for 5′ (orange) and 3′ (blue) junctions over time in the Finkel (b, d) and Emanuel (c, e) datasets was determined as described above. Changes in the percentage of junctions Positions with at change greater than 2.5% are annotated with text on the plot. Panels d and e are zoomed in versions of b and c. f, g Junctions were assigned into groups based on their 5′ and 3′ junction positions. If a junction had a 5′ end within 20 bases of the TRS-L and within 15 bases of a TRS-B, it was considered a canonical junction belonging to the ORF with a start closest to the TRS-B. Otherwise, it was considered non-canonical. The percentage of junctions falling into each category was calculated for early and late timepoints, and the difference between each category's percentage in the late vs early timepoint was plotted increased density at the 5′ region of ORF1a. However, analysis of short-read transcriptomes, including one generated from RNA that was not polyA-enriched (Fig. 5h), also revealed a consistent elevation in 5′ ORF1a coverage. Together, these data suggest that the excess 5′ ORF1a coverage is not entirely driven by these biases.
Across the eight transcriptomes, there is anywhere between 2-fold and 17-fold excess coverage of the 5′ region of ORF1a when comparing the 5′ and 3′ 1000 bases (median 3.1-fold). We found that there is similar elevated 5′ ORF1a coverage in the circulating human coronavirus CoV 229E [14] (Additional file 1: Fig. S4), supporting the existence of 5′ ORF1a RNAs in other coronaviruses. All together, these data support the hypothesis that subgenomic RNAs containing just the 5′ region of ORF1a are produced by SARS-CoV-2.

Junctions have the potential to generate variant open reading frames
To test the hypothesis that non-canonical junctions can result in variant open reading frames (ORFs) in nc-sgRNAs, we assessed the ORF content of each read in the three dRNAseq transcriptomes. Because dRNAseq can have an error rate of greater than 10% [25], we generated "coordinate-derived transcripts" by determining the start, end, and junction coordinates of each raw transcript by mapping to SARS-CoV-2 (NC_045512.2) and Fig. 4 Non-canonical junctions are not associated with TRS-like homology. a Illustration of computational approach used to assess homologous sequences between the 5′ and 3′ junction points. For each of the three dRNAseq datasets, the 30 bases flanking the 5′ and 3′ junction points were assessed, and the longest homologous sequence between these two 30 base regions was determined. b-d Homologous sequences present in the 15 bases on either side of the 5′ and 3′ ends of each junction. Junctions are classified by the location of their 3′ end-if this is within 15 bases of the canonical TRS site or if it falls within a gene it is assigned accordingly. The only exception is ORF1a-junctions with a 5′ end originating in ORF1a are assigned to ORF1a. Labels represent the most common homologous sequence between the ends of each junction. The core TRS sequence ACGAAC is underlined. e-g The length of the longest homologous sequence for canonical (C) and non-canonical (NC) are plotted. Here, junctions were considered canonical if they have a 5′ end within 20 nucleotides of the TRS-L and a 3′ end within 15 nucleotides of a TRS-B. (R) represents random homology lengths. The points for (R) were calculated by first extracting all possible 30 base sequences (30mers) from the SARS-CoV-2 genome, and then assessing the length of the longest homologous sequence between 100,000 random pairs of 30mers that are separated by at least 1000 bases on the genome. Within each column, the relative width of each band represents the relative abundance of junctions with each homology length. The value at the top of each column is the mean homology length extracting the associated genomic sequence. While this approach is capable of overcoming sequencing errors and can resolve major RNA species, it is not capable of resolving small or structurally complex RNA rearrangements. Because the 5′ end of nanopore reads may be truncated due to technical factors during sequencing [26], we leveraged the assumption that functional sgRNAs should contain the 5′ leader sequence to identify reads generated from complete transcripts. While excluding reads that do not contain the leader sequence may have reduced sensitivity, it prevents identification of variant ORFs as a consequence of truncated reads.
ORFs were predicted directly from the coordinatederived transcripts, and ORFs of unexpected length were identified as "variant." Assessment of the relative abundance of variant and canonical ORFs revealed evidence of variant ORFs of ORF1a/b, S, and M (Fig. 6a, b, c).
To characterize the identified M, S, and ORF1a ORF variants, we determined the stop and start points of each ORF variant relative to their canonical counterparts (Fig. 6d-l). This analysis revealed that all theoretical M ORF variants were predicted to start at a UUG start codon 42 amino acids before the end of the M gene and end at the canonical stop codon (Fig. 6d, g, j). M ORF variants make up between 7.9 and 19% of M ORFs in the three dRNAseq datasets and are generated by the major junction point at the ORF6 TRS sequence within the M reading frame (Fig. 1d-f). If translated, M ORF variants would lack the N-terminal transmembrane domains and contain regions predicted to be present on the inside of the virion or in the cytosol. While the efficacy of translation initiation from this start codon is unknown, it has an acceptable Kozak consensus site with an A at the − 3 position.
S ORF variants make up between 5.1 and 24.2% of all S ORFs identified in these datasets (Fig. 6a-c). The majority of S ORF variants are predicted from nc-sgRNAs with 3′ junctions landing within the S ORF (Additional file 1: Fig. S2D-F), resulting in N-terminal truncations with theoretical AUG and non-AUG start codons (Fig. 6e, h, k). If translated, these variants would lack the N-terminal signal sequence [27] and, given the variations in ORF start sites, it is unclear if these ORFs encode functional isoforms.
Our identification of ORF1a variants is consistent with the observed patterns of 5′ ORF1a coverage (Fig. 5). The peak end position of identified ORF1a variant ORFs is around amino acid position 500 (Fig. 6f, i, l), consistent with the observed decrease in read coverage around genome position 1600. This position falls within the NSP2 amino acid sequence suggesting that, if translated, many ORF1a ORF variants would consist of NSP1 and Nterminal fragments of NSP2. Interestingly, we found that approximately 1/3 of ORF1a variants are in-frame fusions with downstream ORFs, encoding mostly N (Fig. 6m-o). We were only able to identify full-length

Discussion
While prior studies have identified nc-sgRNAs in SARS-CoV-2 [8,9] and other coronaviruses [14], there has been no multi-study analysis of SARS-CoV-2 nc-sgRNAs. We conducted an integrative analysis of eight independent SARS-CoV-2 transcriptomes to catalog unexpected, noncanonical RNA species. We show that SARS-CoV-2 produces nc-sgRNAs, that these nc-sgRNAs increase in abundance over time, and that they are not associated with TRS-like homology. We highlight strong evidence supporting the existence of nc-sgRNAs containing only the 5′ region of ORF1a and show that the presence of nc-sgRNAs can alter the landscape of viral ORFs.
We found that canonical sgRNAs are consistently produced across SARS-CoV-2 transcriptomes, and identify eight major sgRNAs encoding S, ORF3a, E, M, ORF6, ORF7a/b, ORF8, and N. Each of these sgRNAs is supported by abundant 3′ junctions originating at a TRS-B just upstream of the associated ORF. Similar to other reports, we do not find evidence of a major series of sgRNAs encoding ORF10 [7][8][9]. We also do not find Color indicates if the fusion partner is on the N and C terminus of ORF1a, if the terminus is ambiguous, or if the fusion is a "self" fusion between an upstream and downstream region of ORF1a strong evidence of a major group of sgRNAs encoding ORF7B despite the presence of a TRS-core-like sequence (AAGAAC) proximal to the ORF7B start.
Interestingly, up to 33% of sgRNAs from infected cells in cell culture are non-canonical. One potential source of nc-sgRNAs are DVGs introduced by defective particles during infection. These can accumulate during the propagation of virus stocks. While information on the passaging of viral stocks is only available for four transcriptomes, at least five of the six SARS-CoV-2 transcriptomes generated from cell lines were generated following a low MOI infection, which may reduce the impact of defective particles [28,29]. Indeed, while there is some variation in the absolute abundance of noncanonical junctions across transcriptomes that may reflect dataset-specific factors, the landscape of noncanonical junctions across transcriptomes is generally similar. One exception is that dataset-specific junctions are found in the Finkel et al. transcriptomes. In particular, there is a strong 5′ junction point at position 16971 that represents over 12% of non-canonical junctions in two Finkel et al. replicates but is largely absent in other transcriptomes. A hypothesis that could explain this observation is that there is a population of defective particles in the Finkel et al. virion preparation that contain an internal deletion. This DVG, which is smaller than the full genome, may have a replicative advantage during virus stock preparation [24]. Indeed, the Finkel et al.specific junction at 16971 displays a relatively large increase in relative abundance over time, in contrast to smaller and non-specific increases of non-canonical junctions at other positions.
The consistent presence of non-canonical junctions across the SARS-CoV-2 genome in independent transcriptomes suggests that the bulk of the observed nc-sgRNAs are not dataset-specific but are generalizable to SARS-CoV-2 biology. Furthermore, their non-specific increase in abundance over time in cell culture supports a model in which nc-sgRNAs are generated nonspecifically and at higher frequencies later in infection. We found that non-canonical junctions are not associated with TRS-like homology, raising the possibility that there is another mechanism behind their formation. For example, it is possible that factors such as RNA structure [30], accumulation of viral proteins, or actions of the viral RdRp and associated cofactors [31] could influence nc-sgRNA formation. Indeed, a recent study of the murine betacoronavirus MHV suggests that the 3′ to 5′ exoribonuclease Nsp14 influences recombination site selection of the viral RdRp [31]. Additional studies are necessary to gain a mechanistic understanding of the factors that influence nc-sgRNA generation and to understand the phenotypic effects of nc-sgRNAs on SARS-CoV-2 pathogenesis. Notably, considering that defective viral genomes in negative-sense RNA viruses have been associated with antiviral immunity, dendritic cell maturation, and interferon production [5,24,32,33], it will be important to assess if SARS-CoV-2 nc-sgRNAs are associated with immune activity.
Junctions in viral RNA change the landscape of viral ORFs. Because the ORF6 TRS-B is located 151 bases into the M ORF, sgRNAs created using this TRS-B contain a 3′ region of M. There is a UUG located 41 codons prior to the end of M that will now be situated prior to the ORF6 on these sgRNAs. While UUG is a relatively weak start codon, there is potential that this UUG could stimulate translation of a C-terminal M fragment or regulate the expression of the downstream ORF6 as an upstream ORF [34]. Additional studies are needed to understand if this M ORF variant is translated and to understand if its UUG start codon influences ORF6 translation.
In addition to this M ORF variant, we found that some nc-sgRNAs contain S and ORF1a ORF variants. Many of the S ORF variants would lack the N-terminal signal sequence [35], and especially considering the variation in the hypothetical start positions of these variants, it is unclear if they are translated. Notably, subgenomic RNAs encoding variant S ORFs have been reported for SARS-CoV [36].
In contrast, the ORF1a ORF variants have a peak stop position around ORF1a amino acid position 500, which correlates well with an observed decrease in ORF1a 5′ coverage around genome position 1600. There is strong support for the existence of 5′ ORF1a-containing nc-sgRNAs across all eight transcriptomes studied here. Although there is not a strong peak of 5′ junctions at specific positions in the 5′ region of ORF1a, the consistent decreases in coverage and concomitant increases in 5′ junctions around positions 1600 and 6000 are notable. It is possible that there are biological factors at these positions that mediate these generalized inflection points but, unlike TRS-mediated homology, do not concentrate junctions to a specific base. As posited by Kim et al. [8], the existence of subgenomic RNAs with the 5′ end of ORF1a raises the possibility that these nc-sgRNAs may influence the relative abundance of Nsps, making this observation an important target for future study.

Conclusions
In conclusion, we show that SARS-CoV-2 generates nc-sgRNAs in cell culture and in vivo and that these nc-sgRNAs are largely consistent across independent transcriptomes and sequencing technologies. We show that nc-sgRNAs are not associated with TRS-like homology, suggesting there may be a homology-independent mechanism driving their formation. Finally, we show that non-canonical junctions have the potential to influence the landscape of viral ORFs and that there is especially