A method for complete characterization of complex germline rearrangements from long DNA reads

1. Department of Human Genetics, Yokohama City University Graduate School of Medicine 2. Research Institute for Microbial Diseases, Osaka University, Suita, Japan 3. Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST) 4. Graduate School of Frontier Sciences, University of Tokyo 5. Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), AIST 6. Contributed equally


Introduction
Various germline DNA sequence changes are known to cause rare genetic disorders. Many small nucleotide-level changes (one to a few bases) in 4,143 genes have been reported in OMIM (https://www.omim.org/) (as of Aug 24, 2019), which are known as single gene disorders. In addition to these small changes, large structural variations of the chromosomes can also cause diseases.
Previous studies on pathogenic structural changes in patients with genetic/genomic disorders found chromosomal abnormalities by microscopy, by detecting copy number variations (CNVs) using microarrays 1 , or by detecting both CNVs and breakpoints using high-throughput short read sequencing 2 .
However, there are difficulties in precisely identifying sequence-level changes especially in highly similar repetitive sequences (e.g. simple repeats, recentlyintegrated transposable elements), or in finding how these rearrangements are ordered 3 . Long read sequencing (PacBio or nanopore) is advantageous for characterizing rearrangements in such cases, and is recently beginning to be used for patient genome analysis to identify pathogenic variations [4][5][6] . In addition, if rearrangements are complex (e.g. chromothripsis), long read sequencing (reads are more than 10 kb in length) has a further advantage because one read may encompass all or much of a complex rearrangement 7 . Chromothripsis is a chaotic complex rearrangement, where many fragments of the genome are rearranged into derivative chromosomes. Current approaches to analyze chromothripsis usually need manual inspection to reconstruct whole rearrangements. Detection and reconstruction methods for complex rearrangements are needed to characterize pathogenic variations from whole genome sequencing data.
In order to understand rearrangements between two sequences (e.g. a read and a genome), we must determine equivalent positions, i.e. bases descended from the same base in the most recent common ancestor of the sequences. This is not necessarily easy, due to sequences that are similar but not equivalent (e.g. alpha-1 and alpha-2 globin). If we compare two sequences that have both undergone deletions, duplications, and rearrangements since their common ancestor, it seems hard to reliably determine equivalent bases.
To make the problem tractable, we impose an assumption: that we are comparing a derived sequence (a DNA read) to an ancestral sequence (the genome) 8 . This means that every part of the read is descended from (equivalent to) a unique part of the genome. (The exception is "spontaneously generated" sequence not descended from an ancestor: this is rare, and we allow for it by allowing parts of the read to not align anywhere.) Thus, we need to accurately: divide the read into (one or more) parts and align each part to the genome. To do this, we first learn the rates of small insertions, deletions, and each kind of substitution in the reads 9 , then find the most-likely division and alignment based on these rates 8,10 . We can also calculate the probability that each base is wrongly aligned, which is high when part of a read aligns almost equally well to several genome loci. This approach was previously used to characterize rearrangements that are "localized", i.e. encompassed by one DNA read 8 .
Here we extend this approach, to: find arbitrary (non-localized) rearrangements, subtract rearrangements found in control individuals, then order and orient rearranged DNA reads to fully reconstruct complex rearrangements in derivative chromosomes.
. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. was not certified by peer review) (which The copyright holder for this preprint this version posted October 5, 2019. ; https://doi.org/10.1101/19006379 doi: medRxiv preprint

Nanopore sequencing of 4 patients with chromosomal translocations
We sequenced genomic DNA from 4 patients with reciprocal chromosomal translocations using a nanopore long read sequencer, Table 1). Clinical information of these patients is described in the Supplementary methods and elsewhere [11][12][13][14] 11,12 has de novo reciprocal translocation between chr2 and chrX, 46,X,t(X;2)(q22;p13) (Fig2a). The breakpoints were not detected by short read . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. was not certified by peer review) (which The copyright holder for this preprint this version posted October 5, 2019. ; https://doi.org/10.1101/19006379 doi: medRxiv preprint sequencing 15 though they were detected by more-painstaking breakpoint PCR 11 , so we tested whether we could find this rearrangement with long reads. We performed PromethION DNA sequencing (112 Gb), and found 2,773 groups of rearranged reads compared to human reference genome hg38. After subtracting rearrangements present in 33 controls, we found 80 patient-only groups, of which two involve both chr2 and chrX (Fig2b). These are exactly the reciprocal chr2-X translocation (Fig2c, Supplementary Fig 2). The breakpoints agreed with reported breakpoints determined by Sanger sequencing 11 Supplementary Fig 3).
These types of retrotransposon are known to be active or polymorphic in humans [16][17][18] . One case appears to be an orphan 3'-transduction from an L1HS in chr20: the L1HS was transcribed with readthrough into 3' flanking sequence, then the 3'-end of this transcript (without any L1HS sequence) was reversetranscribed and integrated into chr10 (Fig2e). Such orphan transductions can cause disease 19 . We also found an insertion of mitochondrial DNA (NUMT) into chr2 (Fig2e). Some of these rearrangements have been previously found in other humans, e.g. the ERV-K LTR inserted in chr12 20 . Thus our subtraction of rearrangements found in other humans was not thorough, especially because patient 1 is Caucasian whereas most of our controls (32/33) are Japanese.

Patient 2
Patient 2 (Nishimura et al. described as Case1) 11 has reciprocal chromosomal translocation between chr4 and chrX, 46,X,t(X;4)(q21.3;p15.2) and a 4 kb deletion of chrX and a 7 kb deletion of chr4 (Fig3a): these were . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. was not certified by peer review) (which The copyright holder for this preprint this version posted October 5, 2019. ; https://doi.org/10.1101/19006379 doi: medRxiv preprint found previously by Southern blot combined with inverse PCR sequencing 11 but not by short read sequencing 15 . We performed PromethION DNA sequencing (117 Gb), and found 3,336 groups of rearranged reads relative to the reference genome, which reduced to 33 groups after control subtraction (Fig3b). Only 2 out of 33 groups involve both chr4 and chrX: they show a reciprocal unbalanced chromosomal translocation exactly as described previously 11,15 (Fig3c, Supplementary Fig 4). Another of the 33 groups shows a 43 kb deletion near the translocation site at chrX:107943791-107986323 (Fig3c, Supplementary Fig   4), which eliminates the TEX13B gene ( Supplementary Fig 4), and was not previously described 15  including a 10kb deletion that removes most of the TRIM48 gene.

Patient 3: complex rearrangements at chr7-chr15 translocation
We next analyzed Patient 3 whose precise structure of chromosomal translocations was only partly solved before 13,15 . Patient 3 was reported to have two reciprocal chromosomal translocations between chr7 and chr15 as well as between chr9 and chr14, t(7;15)(q21;q15) and t(9,14)(q21;q11.2) (Fig4a), and has 4.6 Mb and ~1 Mb deletions on chr15 and chr7, respectively, which were predicted by microarray, although the precise locations of breakpoints were not detected in detail. We performed whole genome nanopore sequencing (95 Gb) . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. was not certified by peer review) (which The copyright holder for this preprint this version posted October 5, 2019. ; https://doi.org/10.1101/19006379 doi: medRxiv preprint on this patient and found 3,351 groups of rearranged reads relative to the reference genome, which reduced to 43 groups after control subtraction (Fig4b).
Fifteen out of 43 groups are involved in the two translocations: dnarrangelink found a unique way to order and orient them without changing the number of chromosomes (Fig4c, Supplementary Fig 6). At first, there seem to be two groups involving both chr9 and chr14, which accurately indicate the balanced chr9-chr14 translocation described previously 15 . However, dnarrange-link additionally identified a complex rearrangement for t(9,14)(q21;q11.2). A part of chr4 was unexpectedly inserted to derivative chr9 (Fig4d). This rearrangement was not investigated in the previous analyses, as chr7q21 was the primary locus for split-foot. In addition to this, dnarrange respectively, which were detected by microarray (Fig 4e). Note that these deletions are not present in any part of the rearrangement, but only in the fullyreconstructed rearrangement: they are holistic properties of the complex rearrangement. One candidate gene for split-foot, SEM1 was not disrupted, nor had altered expression in lymphoblastoid cells ( Supplementary Fig 7a, b, Supplementary Results).
. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. was not certified by peer review) (which The copyright holder for this preprint this version posted October 5, 2019. ; https://doi.org/10.1101/19006379 doi: medRxiv preprint A striking feature of these rearrangements is that the rearranged fragments come from near-exactly adjacent parts of the ancestral genome (Fig4d, e). This suggests that the rearrangements occurred by shattering of the ancestral genome into multiple fragments, which rejoined in a different order and orientation with loss of some fragments. Such shattering naturally explains why the fragments come from adjacent parts of the ancestor 8 .
We performed Sanger sequence confirmation for all 18 breakpoints  Table 6). There were only minor differences (usually 0 or 1 bases) between Sanger sequenceconfirmed breakpoints and dnarrange predicted breakpoints from lamassemble consensus sequences ( Supplementary Fig 9).
The other rearrangements are mostly local tandem-duplication or insertions (Supplementary Table 4  CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. was not certified by peer review)  Supplementary Fig 11). Dotplot pictures of reads that cross the chr1 breakpoint suggest that there is a reciprocal translocation, but the other half of the read aligns (with low confidence) to satellite or simple repeat sequences at centromeric regions on multiple different chromosomes (Fig5d, two example reads are shown). This limitation might be overcome by obtaining reads long enough to extend beyond the centromeric repeats, or perhaps by obtaining a reference genome that is more accurate in centromeric regions.

Discussion
We analyzed a variety of chromosomal translocations in 4 patients, who were selected because previous studies had difficulty in determining precise breakpoints by conventional approaches including microarrays and short read sequencing. Especially, complex rearrangements in Patient 3 were not solved even by intensive analysis 13,15 . Our method could not only precisely detect breakpoints but also characterize how shattered fragments were ordered and oriented. To the best of our knowledge, there has been no method to filter patient-only rearrangements, and connect them to reconstruct rearranged chromosomes by an automatic algorithm.
Recently, long read sequencing is becoming available for individual genome analysis due to a decrease in cost and increase in output data size. Accordingly, there have been a few approaches to use long read sequencing to detect structural variations 7,8,21 , including tandem-repeat changes in rare genetic diseases 6 , providing evidence that long read sequencing has a clear advantage in precisely detecting rearrangements. We observed that multiple breakpoints were jointly detected in a single read in Patient 3 (Supplementary . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. was not certified by peer review) (which The copyright holder for this preprint this version posted October 5, 2019. ; https://doi.org/10.1101/19006379 doi: medRxiv preprint Fig 8d, e), because long enough reads can cover several breakpoints, which is helpful to phase and order rearrangements. There are continuous efforts to obtain longer nanopore reads, however, in case of complex rearrangements (e.g. chromothripsis), it is not easy to cover whole rearrangements, as seen in  Table 5). In summary, our approach using dnarrange and long read sequencing is superior to conventional approaches (e.g. microarray) because: it can 1) connect multiple rearrangements, 2) subtract shared rearrangements, and 3) detect balanced chromosomal rearrangements (e.g. inversion).
Our approach in this study narrowed down patient-only rearrangements using 33 controls. The number of rearrangements decreased exponentially with the first few samples to a few hundreds. This may be due to the presence of common rearrangements in the population. We suspect large numbers of controls will not be needed if there is a target rearrangement locus (e.g. 4p15.2) because the number of candidates is small. In all 4 patients, .
CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. was not certified by peer review) (which The copyright holder for this preprint this version posted October 5, 2019. ; https://doi.org/10.1101/19006379 doi: medRxiv preprint patient-only (not present in at least 66 autosomal alleles of 33 controls) rearrangements were fewer than 100. If we were to further narrow down to ultra-rare variations that may cause rare congenital disorders, a larger number of controls may be considered. Patient 1 has more patient-only groups of rearranged reads (80) than the other patients (33, 43 and 14). This is because the patient is Caucasian and most of the control data used were Japanese (32/33 datasets). Applying ethnicity-matched controls, or parents or other relatives, will be useful to further remove benign rearrangements.
We noticed that large fractions of these rearrangements are insertions or tandem multiplications (Supplementary Table 4 Table 7). Interestingly, most of the inserted sequences were aligned to TEs. TE insertions may be a common type of rare variation seen in individuals. In addition to TE-insertion, we detected rare processed pseudogene insertion in 3 patients. Two of these insertions were previously described with allele frequency 1-10% in Japanese (MFF) and 1-10% in non-Japanese (MATR3) 26 . We also observed non-tandem duplications that do not seem to be retrotranspositions: interestingly, about half of these are localized, i.e. a copy of a DNA segment is inserted near (e.g., within a few kb of) the original segment 8 (see blue highlighted loci in Supplementary Table 5).
Our analysis proves useful despite its dubious assumption that the reference genome is ancestral to the DNA reads. This may be partly because we focus on disease-causing rearrangements, which are likely to be derived.
Also, incorrect rearrangements due to a non-ancestral reference may be found . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. was not certified by peer review) (which The copyright holder for this preprint this version posted October 5, 2019. ; https://doi.org/10.1101/19006379 doi: medRxiv preprint in both patients and controls, thus filtered out. It would be useful to construct a reference human genome that is ancestral (and complete), as far as possible, because this simplifies the relationship between the reference and extant human DNA sequences 8 .
Our method in combination with subtracting shared rearrangements in control datasets has a great strength in precisely detecting chromosomal rearrangements, including inversions, translocations, TE insertions, NUMT and processed pseudogene insertions. There has been no method that can effectively subtract rearrangements shared in the population, thus we believe our method is useful to analyze complex rearrangements in a clinical setting (i.e. rare genetic disease or perhaps cancer genomes). We also showed a limitation of our method: detecting rearrangements in large repetitive regions beyond the length of long reads in Patient 4. These regions are still elusive and highly variable between individuals. To date there is no good method to detect rearrangements in large repetitive regions (e.g. centromeric or telomeric repeats) genome-wide. We hope our understanding of these still-intractable regions will expand as sequencing technologies advance.
In conclusion, we developed an effective method to find chromosomal aberration, with precise breakpoint identification, only from long read sequencing. Our method also provides an automatic algorithm for reconstruction of complex rearrangements. Long read sequencing may be considered when chromosomal abnormalities are suspected.
. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. was not certified by peer review) (which The copyright holder for this preprint this version posted October 5, 2019. ; https://doi.org/10.1101/19006379 doi: medRxiv preprint

Samples and ethical issues
All genomic DNA from patients and controls were examined after obtaining informed consent. Experimental protocols were approved by institutional review board of Yokohama City University under the number of A19080001. dnarrange dnarrange finds DNA reads that have rearrangements relative to a reference genome, and discards "case" reads that share rearrangements with "control" reads (Supplementary Methods). It takes one or more files of read-to-genome alignments, where each file is a "case" or a "control". It assumes the alignments have this property, which is guaranteed by last-split: each read base is aligned to at most one genome base. dnarrange first performs these steps, for cases and controls: 1. In order to recognize large "deletions" as rearrangements, if an alignment has deletions >= g (a threshold; default 10kb), split it into separate alignments either side of these deletions.
2. Get rearranged reads. We classify rearrangements into four types: interchromosome, inter-strand (if a read's alignment jumps between the two strand of a chromosome), non-colinear (if a read's alignment jumps backwards on the chromosome), and "big gap" (if a read's alignment jumps forwards on the chromosome by >= g).
3. Discard any "case" read that shares a rearrangement with any "control" read. (Two reads are deemed to share a rearrangement if they have similar . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. was not certified by peer review) (which The copyright holder for this preprint this version posted October 5, 2019. ; https://doi.org/10.1101/19006379 doi: medRxiv preprint rearrangements that overlap in the genome: the precise criteria are in the Supplementary methods, Supplementary Fig 13, 14.) It then performs these steps, for cases only: 4. Discard any read with any rearrangement not shared by any other read.
Repeat this step until no further reads are discarded (so that dnarrange has the useful property of idempotence).

Group reads that share rearrangements. First, a link is made between any
pair of reads that share a rearrangement. Then, groups are connected components, i.e. sets of reads linked directly or indirectly.
6. Discard groups with fewer than 3 reads. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. was not certified by peer review) (which The copyright holder for this preprint this version posted October 5, 2019. ; https://doi.org/10.1101/19006379 doi: medRxiv preprint * The left end is downstream of (has higher reference coordinate than) the right end.
In order to infer the actual links, we require some further information or assumption. We make this assumption: there are as many links as possible, or equivalently, the derived genome has as few chromosomes as possible. For example, in Fig6a, B1 may be linked to C2, but in that case it becomes impossible to link C1 to anything, and D1 to anything. Based on our assumption, we instead link B1 to C1 and D1 to C2. In this example, dnarrange-link infers two derivative chromosomes: one is reconstructed from two reads by linking A2 to E1, the other is reconstructed from three reads by linking D1 to C2 and C1 to B1 (Fig6b).
The two types of end, with linkability relationship, define a bipartite graph. To infer the links based on our assumption, we find a "maximum matching" in this graph. If there is more than one maximum matching, one is chosen arbitrarily, and a warning message is printed. In . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. was not certified by peer review) (which The copyright holder for this preprint this version posted October 5, 2019. ; https://doi.org/10.1101/19006379 doi: medRxiv preprint 1. Calculate the rates of insertion, deletion, and substitutions between two reads by "doubling" the rates from last-train, because errors occur in both reads.
2. Use these rates to find pairwise alignments between the reads with LAST.
LAST also calculates the probability that each pair of bases is wrongly aligned (which is high when there are alternative alignments with near-equal likelihood). Some results using a prototype of lamassemble were published previously 6 .

Sanger-sequence confirmation of breakpoints
. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. was not certified by peer review) (which The copyright holder for this preprint this version posted October 5, 2019. ; https://doi.org/10.1101/19006379 doi: medRxiv preprint PCR primers for breakpoints estimated from rearrangements were designed using primer3 plus software (Supplementary Table 6). PCR amplification was done using ExTaq (Takara) and LATaq, then amplified products were Sanger sequenced using BioDye Terminator v3.1 Cycle Sequencing kit with 3130xl genetic analyzer (Applied Biosystems, CA, USA).   Long DNA reads are aligned to a reference genome using LAST (blue box), then dnarrange finds rearranged reads, and groups reads that overlap the same rearrangement (pink box). lamassemble merges/assembles each group of reads into a consensus sequence (yellow box). When there is a "complex" rearrangement (more than one group of rearranged reads is needed to understand the full structure of the rearrangement), dnarrange-link was used to infer the order and orientation of the groups, and thereby reconstruct derivative chromosomes (green box).      b. Derivative chr R was reconstructed by linking A2 to E1 (left). Derivative chr S was reconstructed by linking B1 to C1, and D1 to C2 (right). B1 can also be linked to C2, but in that case it is impossible to link C1 to anything, and D1 to anything, thus this possibility was suppressed.

Web resources
. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. was not certified by peer review) . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. was not certified by peer review) . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. was not certified by peer review)