Complex Structural Variants Resolved by Short-Read and Long-Read Whole Genome Sequencing in Mendelian Disorders

Complex structural variants (cxSVs) are genomic rearrangements comprising multiple structural variants, typically involving three or more breakpoint junctions. They contribute to human genomic variation and can cause Mendelian disease, however they are not typically considered during genetic testing. Here, we investigate the role of cxSVs in Mendelian disease using short-read whole genome sequencing (WGS) data from 1,324 individuals with neurodevelopmental or retinal disorders from the NIHR BioResource project. We present four cases of individuals with a cxSV affecting Mendelian disease-associated genes. Three of the cxSVs are pathogenic: a de novo duplication-inversion-inversion-deletion affecting ARID1B in an individual with Coffin-Siris syndrome, a deletion-inversion-duplication affecting HNRNPU in an individual with intellectual disability and seizures, and a homozygous deletion-inversion-deletion affecting CEP78 in an individual with cone-rod dystrophy. Additionally, we identified a de novo duplication-inversion-duplication overlapping CDKL5 in an individual with neonatal hypoxic-ischaemic encephalopathy. Long-read sequencing technology used to resolve the breakpoints demonstrated the presence of both a disrupted and an intact copy of CDKL5 on the same allele; therefore, it was classified as a variant of uncertain significance. Analysis of sequence flanking all breakpoint junctions in all the cxSVs revealed both microhomology and longer repetitive sequences, suggesting both replication and homology based processes. Accurate resolution of cxSVs is essential for clinical interpretation, and here we demonstrate that long-read WGS is a powerful technology by which to achieve this. Our results show cxSVs are an important although rare cause of Mendelian disease, and we therefore recommend their consideration during research and clinical investigations.


Main
Structural variants (SVs) are a major source of variation in the human genome and collectively account for more differences between individuals than single nucleotide variants (SNVs). [1][2][3] The canonical forms include balanced and unbalanced SVs, such as inversions, insertions, translocations, deletions and duplications. More complex rearrangements are typically composed of three or more breakpoint junctions and cannot be characterised as a single canonical SV type. These are known as non-canonical or complex SVs (cxSVs). 4; 5 Several previous studies have reported clinically relevant cxSVs in individuals with Mendelian disorders. For example, a duplication-triplication-inversion-duplication was found at the MECP2 and PLP1 loci in individuals with Rett syndrome, 6 and a duplication-inversionterminal deletion of chromosome 13 was present in foetuses with 13q deletion syndrome, 7 among others. [8][9][10] Recently, pathogenic cxSVs associated with autism spectrum disorder (ASD) and neuropsychiatric disorders have also been reported. 11; 12 Whole-genome sequencing (WGS) studies have shown that cxSVs are considerably more abundant and diverse than had been previously appreciated, representing an estimated 2% of the SVs in the human genome, and each human genome contains on average 14 cxSVs. 11 The presence of multiple types of cxSVs has also been independently observed in several other studies. 6; 12-15 Extreme cases of cxSVs, such as chromothripsis, have also been identified in both cancer cells and the germline, and involve hundreds of rearrangements often concerning more than one chromosome. 16; 17 11 Nevertheless, the study of cxSVs has been challenging due to technical limitations. Complex SVs have been reported in projects such as the 1000 Genomes, 1; 18 but these primarily focused on the canonical types. [19][20][21] With the rapid expansion of high-throughput sequencing (HTS) technologies including long-read WGS, genome-wide characterization of SVs with high precision has been achieved, 1 facilitating the study of more complex forms of SVs.
However, systematic identification of Mendelian disease-causing cxSVs in large cohorts is lacking. Here we aim to systematically investigate the role that cxSVs may play in rare Mendelian disorders in our cohort of 1,324 individuals using short-read and long-read WGS.
All participants were consented for the NIHR BioResource research study, which performs WGS of individuals with undiagnosed rare disorders. They were recruited from three different sub-projects: 725 were in the Inherited Retinal Disorders (IRD) project, 472 were in the Neurological and Developmental Disorders (NDD) project, and 127 were in the Next Generation Children (NGC) project, which performs diagnostic trio WGS of individuals from Neonatal and Paediatric Intensive Care Units. All participants provided written informed consent. The study was approved by the East of England Cambridge South national institutional review board (13/EE/0325).
We performed short-read WGS on all the subjects as previously described. 22 To identify cxSVs we followed a three step framework as previously suggested by Quinlan et al. 4 First we called canonical structural variants from the short-read WGS data, second we identified putative cxSVs by filtering and clustering adjacent SV calls, and third we reconstructed the likely architecture of the variant by a combination of manual analysis of sequencing reads, Sanger sequence analysis of PCR products across the breakpoint junctions, microarrays, and long-read WGS ( Figure S1). Canonical SVs were called by Canvas, 23 which identifies copy number gains and losses based on read depth, and Manta, 24 which calls translocations, deletions, tandem duplications, insertions, and inversions, and is based on both paired read fragment spanning and split read evidence. Filtering and clustering these SV calls then revealed putative cxSVs. Briefly, SVs were initially filtered to keep only those that pass standard Illumina quality filters, do not overlap previously reported CNVs in healthy cohorts, 25 and are rare (Minor Allele Frequency < 0.01 in the NIHR BioResource (n= 9,453)). To cluster the filtered SVs we identified those where at least one breakpoint was within 1Kb of the breakpoint of another SV in the same individual, and categorized them into different types using a custom script implemented in R. We kept only those clusters that overlapped a coding exon of a disease-associated gene. Disease-associated gene lists were assembled from various sources including OMIM, RetNet and literature searches, then curated to ensure they comply with previously described criteria. 26 The lists comprise 1,423 genes (NDD) and 248 genes (IRD). For NGC participants, trio analysis focused on de novo and rare biallelic variant discovery unrestricted by a gene list. We also filtered out putative cxSVs that only included multiple deletions, as these appear to be caused by insertion of a reverse transcribed mature mRNA molecule elsewhere in the genome. 27 This resulted in 81 candidate cxSVs.
We performed careful inspection of each candidate cxSV using Integrative Genomics Viewer (IGV). 28 We excluded a proportion that were actually canonical SVs -for example, dispersed duplications are frequently miscalled as an overlapping deletion and tandem duplication 29leaving 46 likely real cxSVs. Of these, 42 were deemed unlikely to be pathogenic because the participant's phenotype was not consistent with the phenotype associated with the gene, or they were heterozygous in a gene known to cause disease with a recessive mode of inheritance, with no second hit. To identify second hits within a gene, we considered SNVs and indels filtered for rare, high-quality, coding variants in a disease-associated gene as previously described. 22 We also excluded cases where pathogenic SNVs or indels in a disease-associated gene could entirely account for the disease.
The four clinically relevant cxSVs were found in four individuals; P1 and P4 from the NGC project, P2 from NDD and P3 from the IRD project. None of them were observed in our internal cohort of 9,453 genomes, ClinVar or DECIPHER. All predicted novel breakpoint junctions were confirmed in all four participants by Sanger sequencing using standard protocols. To confirm predicted copy number changes and regions of homozygosity we used Illumina SNP genotyping array as previously described, 22 and/or CytoScan® 750K Cytogenetics Solution microarray (Affymetrix). Here we describe the phenotypes and genetic architecture of the four cxSVs.
Participant 1 (P1) presents a de novo duplication-inversion-inversion-deletion encompassing ARID1B (MIM: 135900) that causes Coffin-Siris syndrome (CSS [MIM: 135900]). This individual was a four-month old female who was born prematurely and presented with characteristic features of CSS as a neonate. CSS is a multiple malformation syndrome characterized by intellectual disability, severe speech impairment, coarse facial features, microcephaly, developmental delay and hypoplastic nails of the fifth digits. 30 A large cxSV was identified on chromosome 6, comprising a 3.3Mb duplication, two inversions of 4.9Kb and 3.3Mb, and a 16.3Mb deletion ( Figure 1A). A total of 87 proteincoding genes were within the structural variant boundaries (Table S1) (Table S1). Of these 6, only ARID1B has been previously reported as disease-associated with a LOF mechanism.
Haploinsufficiency of ARID1B causes CSS; consistent with the phenotype of P1. We also looked at the 10 autosomal recessive genes within the deletion and did not find a second likely pathogenic variant in any. No disease-associated gene that was present within the duplicated region had been reported to be triplosensitive. Furthermore, the first inversion and the 3'breakpoint of the second inversion were within CNKSR3 (MIM: 617476). However, CNKSR3 has not previously been associated with disease, and is not constrained for LOF variation in ExAC, thus the effect of this inversion on the phenotype remains unknown.
For P1, Sanger sequencing of the PCR product generated over the breakpoint junctions confirmed J2 and J3 ( Figure 1A). Microarray was performed concurrently with WGS, which confirmed the copy number changes. We observed that the duplication-inversion-inversiondeletion variant occurred on the paternal chromosome, by looking at the parental origin of the hemizygous variants in the deleted region. Therefore, we found a de novo cxSV (dupINVinvDEL) in an individual with CSS, classified as pathogenic according to the ACMG guidelines. 32 Although the LOF of ARID1B likely explains the phenotype of this individual, she presented with a very severe case of CSS, so it is possible that other genes affected by the cxSV might contribute to the phenotype. Participant 2 (P2) has a deletion-inversion-duplication encompassing HNRNPU (MIM: 602869). This individual is a 22-year-old male who presented at term with hypotonia. All his early developmental milestones were delayed and he presented with tonic-clonic seizures at 9 months. His seizure disorder has been managed by medication but has continued episodically into adulthood. He presented with significant intellectual disability, autism, and limited speech and language, and MRI showed partial agenesis of the corpus callosum and enlarged ventricles. We identified a cxSV on chromosome 1, formed by a 1.2 Mb deletion and a 246Kb duplication flanking an inversion of 505Kb ( Figure 1B). This variant encompassed 8 genes (Table S1), of which two were previously associated with disease:  The Manta and Canvas calls from short-read WGS suggested two possible molecular models. Model one: formed by two closely located duplications followed by an inversion affecting only one copy of each duplicated segment ( Figure S2A). 12 And model two: formed by two closely located duplications followed by an inversion affecting both copies of the 3' duplicated segment ( Figure S2B). In both models the non-duplicated nested fragment (named C) is inverted. Importantly, model one suggests that there is an intact copy of CDKL5 on the non-reference allele, in addition to the disrupted copy; model two does not.
Therefore, resolving the model was essential for the interpretation of the pathogenicity of this variant. We attempted PCR amplification over the predicted new formed junctions, and only the one supporting the disrupted CDKL5, common to both, was amplified and confirmed by Sanger sequencing (Figure2S). Both duplications were confirmed by microarray.
In order to resolve the model, we performed long-read WGS of DNA from this individual using Oxford Nanopore Technologies (ONT). The sample was prepared using the 1D ligation library prep kit (SQK-LSK108), and genomic libraries were sequenced on R9 flowcell. Read sequences were extracted from base-called FAST5 files by albacore (version 2.0.2) to generate FASTQ files, and then aligned against the GRCh37 human reference genome using NGMLR (version 0.2.6) 36 and LAST (version 912). 37 Analysis was performed using default parameters, and for LAST we used first last-train function to optimise alignment scoring.
We obtained a median read length of 8,136bp ( Figure S3A), 56% of the genome was covered with a minimum coverage of 3x ( Figure S3B), and around 97% of the reads mapped to the human genome (GRCh37). All the breakpoints of the cxSV were covered by at least four reads.
Coverage was insufficient to resolve the cxSV using long-read SV calling algorithms such as Sniffles 36 or NanoSV, 38 (for which a minimum coverage of 10x is recommended). In lieu of this, we manually reviewed the split long reads across the cxSV junction breakpoints. Eight of the reads that covered the cxSV breakpoints were identified as inherited from the paternal chromosome, either by SNP phasing (Figure 1D, J2, J3, J4 and J6) or by indirect phasing based on the assumption that breakpoint junctions occur on the same allele ( Figure 1D, J5). Therefore, ONT sequencing allowed us to identify two reads supporting the junction that was initially not possible to confirm by Sanger sequencing (J5) due to sequence homology. By phasing analysis, we were also able to identify three reads supporting an intact copy of CDKL5 in the allele inherited from the father, confirming that model one ( Figure 1D) is likely to be correct. Although we were able to confirm that the cxSV harbours an intact copy of CDKL5, we cannot conclude that the de novo cxSV identified is benign and unrelated to the neonatal phenotype since we cannot rule out the possibility that regulatory elements are affected. Thus, this variant was classified as a variant of unknown significance (VUS). The child is currently one year old and developmentally normal with no seizures, but remains under ongoing follow up.
It has previously been reported that DNA replication-based mechanisms such as microhomology-mediated break-induced replication (MMBIR) or fork stalling and template switching (FoSTeS) are likely to be the primary mechanism responsible for the formation of cxSVs. 4; 5; 39-41 Our data overall support this as there is microhomology of at least 3bp in 8 of the 9 novel breakpoint junctions in the four individuals (Table S2). We also observe in P4 the insertion of a 100bp Alu sequence in J2 junction. It has been previously suggested that Alu elements could facilitate template switching and annealing via microhomology between replication forks. 41 However, additional evaluation of the breakpoint sequences with RepeatMasker also identified longer repeats and low complexity sequences in three of the individuals. In P1 we found that sequence flanking two of the breakpoints had high homology with SINE sequences (ERVL-MaLRs), one with LINE sequences (L2) and one with DNA/hAT-Charlie (MER3) sequences (Table 1). In P2 we noted that sequence flanking three of the breakpoints had homology with SINE sequences (Alu and MIR), and in P3, sequences surrounding all the breakpoints presented high homology with LINEs. This raises the possibility of involvement of homologous recombination mechanisms. 5 Interestingly, we also note that the inversions in P1, P2 and P4 involved junctions that were newly formed by the duplications in our predicted models.
In this study, we evaluate the role of cxSVs in Mendelian disorders using short-read and long-read WGS. By analysing cxSVs from WGS data in 1,324 individuals with NDD or IRD, we found that 46 (3.5%) have a rare cxSVs overlapping disease-associated genes, and at least 3 (0.3%) are disease-causing. This compares to approximately 5-20% of individuals with Mendelian disorders who have a clinically relevant canonical SV. 22; 42; 43 However, 0.3% is likely to be an underestimate due to three reasons.
First, a substantial proportion of individuals in NDD and IRD cohorts are pre-screened by microarray prior to enrolment in the project. It has been reported that 7.6% of all rare duplications detected by microarray are part of a complex rearrangement. 12 This misclassification is in part due to the impossibility for detecting inversions by microarray, since inversions are involved in 84.8% of cxSVs. 11 Thus, it is likely that many 'canonical CNVs' detected by microarray are actually misclassified cxSVs.
Second, short-read WGS is limited in identifying and resolving these variants, since they are prone to occur at repetitive regions where mapping and variant calling algorithms have lower sensitivity. Certain categories of cxSVs such as chromothripsis and those formed by nested rather than chained breakpoints could also be excluded by our filtering and clustering method. These limitations may now be addressed by long-read sequencing technologies such as Nanopore, which have the advantage of reads of 10-100Kb allowing for more accurate mapping particularly over repetitive regions, and facilitating phasing. 38 Although coverage is lower and error rate is higher than short-read WGS, various studies have demonstrated the power of long-read WGS to detect SVs and cxSVs. 36; 38; 44-46 Third, estimating the frequency of cxSVs requires defining them, which is not always straightforward because the distinction between canonical SVs, cxSVs, and chromoanagenesis can be unclear. 39; 47 It is therefore perhaps appropriate to consider types of human genomic variation as a continuum rather than discrete classes, progressing from SNVs (that cause the least disruption to the genome), through indels, canonical SVs and cxSVs to the highly disruptive chromoanagenesis and aneuploidies.
The low frequency of pathogenic cxSVs compared to other classes of pathogenic variants is likely to be due to differential rates of the causative mutational processes and higher purifying selection pressure on cxSVs. 2; 11 Consequently, it is unsurprising that three of the cxSVs reported here (in P1, P2 (confirmation in progress), and P4) are de novo. Two were also confirmed to have occurred on the paternally inherited allele, consistent with previously reported observations that ~80% of de novo mutations are of paternal origin. 48; 49 Complex SVs are not typically considered in research and clinical diagnostic pipelines, in part due to the technical and analytical challenges around identification and interpretation.
Therefore, when analysis is restricted to canonical SVs alone, the full scope of genome-wide structural variation is overlooked. This study demonstrates the potential for detecting and characterising cxSVs from short-read WGS to identify possible clinically relevant cxSVs, then performing long-read sequencing where necessary to resolve breakpoints. An alternative possible method is long insert WGS (liWGS), which has been successfully employed to detect cxSVs in other studies, and has the advantage of improved mapping particularly over repetitive regions due to the large fragments, but has a lower resolution of ~5Kb. [11][12][13] Our experience with P4, who has a VUS involving CDKL5, demonstrates that understanding the genetic architecture of a cxSV is essential for interpreting the pathogenicity of the variant, especially if the gene of interest is disrupted by a duplication or inversion rather than a deletion. The impact of a deletion on the function of affected genes is generally assumed to be LOF. However, the consequence of a duplication can be uncertain and depends on precisely how the variant rearranges the gene, as well as gene-specific factors such as dosage sensitivity. Furthermore, duplications intersecting regulatory regions can result in a different phenotype from variants within the gene itself. 50 Investigating cxSVs in our cohort identified previously reported subclasses (delINVdup, delINVdel and dupINVdup in P2, P3 and P4 respectively), as well as a dupINVinvDEL in P1. 11 As has been previously anticipated, with the study of larger population-scale cohorts and the use of higher resolution technologies such as long-read WGS, many additional subclasses will continue to be discovered. 11 However, inferring the likely molecular mechanism of cxSV formation is challenging due to the complexity of the events. The high frequency of microhomology observed at the breakpoint junctions of the cxSVs in our participants and the presence of an inserted sequence in three of them is generally consistent with the hypothesis that replication-based mechanisms such as FoSTeS/MMBIR are the primary mechanism responsible for the formation of cxSVs. However, we also find evidence for variable number tandem repeat (VNTR) sequences in all the participants.
Taken together, these results suggest a hypothesis whereby multiple mutational mechanisms could mediate the formation of cxSVs, as novel sequence junctions in intermediate derivative chromosomes formed by canonical SV could facilitate additional mutational events.
Our work demonstrates that cxSVs contribute to rare Mendelian disorders and provides insight into identifying and resolving cxSVs by using both short and long-read WGS. We suggest that cxSVs should be included into research and clinical diagnosis and considered when screening SVs in the human genome. Detailed characterization of cxSVs and largescale WGS studies are essential for unveiling the complex architecture and determining accurate population frequencies. Furthermore, we demonstrate a combinatorial approach of short-read WGS and confirmation by microarray, Sanger sequencing, and long-read WGS as an approach for resolving and characterizing cxSVs.

Supplemental data
Supplemental data includes three figures and two tables. ONT reads (paternal allele) J2 J3 J4