ClinPrior: an algorithm for diagnosis and novel gene discovery by network-based prioritization

Background Whole-exome sequencing (WES) and whole-genome sequencing (WGS) have become indispensable tools to solve rare Mendelian genetic conditions. Nevertheless, there is still an urgent need for sensitive, fast algorithms to maximise WES/WGS diagnostic yield in rare disease patients. Most tools devoted to this aim take advantage of patient phenotype information for prioritization of genomic data, although are often limited by incomplete gene-phenotype knowledge stored in biomedical databases and a lack of proper benchmarking on real-world patient cohorts. Methods We developed ClinPrior, a novel method for the analysis of WES/WGS data that ranks candidate causal variants based on the patient’s standardized phenotypic features (in Human Phenotype Ontology (HPO) terms). The algorithm propagates the data through an interactome network-based prioritization approach. This algorithm was thoroughly benchmarked using a synthetic patient cohort and was subsequently tested on a heterogeneous prospective, real-world series of 135 families affected by hereditary spastic paraplegia (HSP) and/or cerebellar ataxia (CA). Results ClinPrior successfully identified causative variants achieving a final positive diagnostic yield of 70% in our real-world cohort. This includes 10 novel candidate genes not previously associated with disease, 7 of which were functionally validated within this project. We used the knowledge generated by ClinPrior to create a specific interactome for HSP/CA disorders thus enabling future diagnoses as well as the discovery of novel disease genes. Conclusions ClinPrior is an algorithm that uses standardized phenotype information and interactome data to improve clinical genomic diagnosis. It helps in identifying atypical cases and efficiently predicts novel disease-causing genes. This leads to increasing diagnostic yield, shortening of the diagnostic Odysseys and advancing our understanding of human illnesses. Supplementary Information The online version contains supplementary material available at 10.1186/s13073-023-01214-2.

-Table S6.List of identified genes, OMIM nomenclature and numbers of cases identified in our cohort.
-Table S8.Molecular function GO terms enriched in the HSP/CA network.
-Table S9.Biological process GO terms enriched in the HSP/CA network.
-Table S10.Cellular compartment GO terms enriched in the HSP/CA network.
-Table S11.Candidate genes in the HSP/CA expanded network.

Benchmarking using a synthetic WES dataset
To test the performance of the method at identifying causal variants in sequencing data, we generated a synthetic whole exome sequencing (WES) dataset consisting of 68,210 VCF files (one for each of the pathogenic variants retrieved from the ClinVar December 2019 release).
We excluded variants classified as "modifier" and in noncoding regions and those genes with HPO-gene associations below 5, resulting in 66,800 variants [1].The synthetic dataset was created by inserting a single pathogenic variant from the set of 66,800 in high-confidence variant calls for one individual (NA12878) published by the Genome in a Bottle (GIAB) consortium, used frequently for benchmarking purposes [2].Each pathogenic variant zygosity was adjusted to match the OMIM disease's mode of inheritance for the associated gene (heterozygous for dominant diseases, homo or hemizygous for recessive diseases).Synthetic exomes were then filtered based on frequency (MAF<0.1% according to the VEP filter tool in each of the 1000 Genomes Project, NHLBI GO Exome Sequencing Project (ESP), ExAC and gnomAD) and variant impact excluding noncoding regions and modifier ones, resulting in 642 variants.We then ran ClinPrior using as an input 1) the filtered VCF files and 2) the patient phenotypic features created with the HPOs of the pathogenic variant-associated gene extracted from the HPO-gene associations from the phenotypic layer.
The prioritization process in ClinPrior is weighted by two variables: i) calculation of the phenotypic association metric derived from prior phenotype-gene association knowledge and further network propagation of this metric, and ii) calculation of the variant deleteriousness score.Using the synthetic data from 66,800 WES, we consider four different benchmarking scenarios: Scenario 1.
 Patient with perfect phenotypic match (all HPOs associated to the gene).Scenario 2.

Scenario 3.
 Novel candidate causal gene (removed candidate HPOs-gene associations from the 439.300HPO-gene associations, therefore without prior knowledge)  Patient with perfect phenotypic match (all HPOs associated to the gene).

Scenario 4.
 Novel candidate causal gene (removed candidate HPOs-gene associations from the 439.300HPO-gene associations, therefore without prior knowledge)  Patient with unmatching phenotypic data (random HPOs).
In the first scenario 1, we considered an ideal situation where the simulated patient's phenotype perfectly matches the clinical description from the databases associated with the causal gene.Here, the phenotypic association score is the highest possible, resulting in the desired near-perfect top-1 identification (AUROC = 0.9994).This perfect scenario is in contrast with the rest: i) Scenario 2, with random phenotypic data, where patient phenotypic mismatch diminishes the phenotypic association metric, and the prioritization is driven mostly by the variant deleteriousness score (AUROC = 0.8393); ii) Scenario 3, with simulated novel candidate genes, removing prior HPO-gene associations of the candidate, but with perfect patient phenotypic matching.Here, both the phenotypic metric propagation and the variant deleteriousness score are the main drivers of variant ranking (AUROC = 0.7824); and iii) Scenario 4, or worse-case scenario, with simulated novel candidate genes but with unmatched (random) patient phenotypic data.Here, only the variant deleteriousness score is the main driver of variant prioritization (AUROC=0.6489).

HSP/CA expanded network
We constructed an HSP/CA expanded network with an initial list of 718 seed genes with the terms "spastic paraplegia" or "ataxia" in HPOs included in the OMIM database.Next, we used ClinPrior to obtain a list of the 1,000 prioritized genes for each seed gene after considering the HPO-gene associations of the 718-gene from the phenotypic layer as the patient.Then, we selected the most recurrent genes, with at least 75 appearances, among the 718 lists with the top 1,000 prioritized genes.With this procedure, we obtained 2,187 genes that we extracted from the global physical and functional ClinPrior networks, resulting in a final HSP/CA expanded interactome of 27,759 gene-gene interactions.Network available in the NDEx repository [3].
To assess whether there was greater connectivity in an HSP/CA expanded network than in the global network, we calculated i) the number of edges between protein pairs and ii) the average path length in the HSP/CA network by calculating the shortest paths between all protein pairs.We then compared these statistics for 1,000 permutations of a randomly selected set of 2,187 proteins derived from the global network.Finally, we calculated the Z scores to determine how far the measures of the HSP/CA expanded network deviate from the expected mean (μ).
To evaluate which pathways or functional categories were enriched in the HSP/CA network, we followed a similar strategy as described elsewhere [4].Briefly, we used hypergeometricbased tests from the GOstats package [5].We used p < 0.001 as the cut-off point for GO terms with fewer than 1,000 protein members to determine which GO terms were significantly enriched.

WES-WGS and variant calling
For WES analysis, capture was performed using the SeqCap EZ Human Exome Kit v3.0 (Roche Nimblegen, USA) or the SureSelect XT Human All Exon V5 50 Mb kit (Agilent, USA) with 100-bp paired-end read sequences, and for WGS, a PCR-free library with 150-bp paired-end read sequences was generated on a HiSeq 2000-4000 platform (Illumina, Inc.USA) at Centre Nacional d'Anàlisi Genòmica (CNAG Barcelona, Spain).Sequences were aligned to hg19 by Burrows-Wheeler Aligner (BWA mem), and single nucleotide variants and small insertions/deletions (indels) were identified using GATK, applying GATK's best practices for germline single nucleotide polymorphism (SNP) and indel discovery in WES [6].Structural variants, including CNVs, were called using ClinSV v0.9 [7].CNVs for genome data were detected by ControlFreeC v11.5 [8,9] using a window size of 20 kb and a step size of 4 kb.All the variants identified by at least one tool were considered for further analysis.

Variant segregation and classification
Sanger sequencing was used in all cases to confirm the findings, and for family segregation, 8 of the 11 de novo variants were tested in both parents.Especially for novel variants, downstream targeted inheritance testing was critical to variant classification.
The candidate variants were strictly classified following the ACMG/AMP standards and guidelines for the interpretation of sequence variants [10][11][12].We used the VarSome search engine [13] to annotate the identified variants.Discrepancies in the classification in some variants when compared to VarSome are due to the finding of another pathogenic variant in trans in the same gene (PM3), a de novo variant (PS2, PM6), a good phenotypic fit (PP4) or functional validation (PS3).A case was considered solved if variants were classified as pathogenic or likely pathogenic.Cases with a variant of unknown significance (VUS) but compatible segregation studies and specific clinical and MRI findings highly suggestive of a given disease were also considered solved.Incidental findings were reviewed in all patients according to published guidelines [14,15].

Functional validation
Several VUSs were functionally tested by different methods, including transfection assays, lipidomics, cDNA sequencing or minigene splicing assays, targeted metabolomics and mitochondrial respiration assays, patch-clamp assays to measure potassium currents, serylation assays and yeast complementation studies, mRNA and protein quantification with immunofluorescence analysis, and quantitative real-time (qRT-PCR) for CNV validation (additional Tables S4 and Table S7).

Detection of the biallelic intronic expansion AAGGG in RFC1
Forty-three patients with negative results in WES were studied.We made the diagnosis of pathogenic biallelic expansion in RFC1 in those cases in whom 1) RP-PCR did not repeatedly amplify any product compared to healthy controls, and 2) RP-PCR showed a characteristic "sawtooth" pattern.PCR primers (F:TCAAGTGATACTCCAGCTACACCGTTGC and R:GTGGGAGACAGGCCAATCACTTCAG) were extracted from Cortese et al. [16].
Polymerase chain reactions (PCRs) were performed in a final volume of 50 µl using Taq polymerase (Sigma).The amplification products were visualized after 2% agarose gel electrophoresis (1 h, 60 V).The primers used in the RP-PCR were extracted from Rafehi et al [17].PCR was performed in a final volume of 50 μl using 20 ng genomic DNA, 0.8 mM of the primers CANVAS_FAM_2F and M13R, and 0.2 mM of the first M13R_CANVAS_RE_R using Phusion FLASH High Fidelity MASTER MIX (Thermo Fisher) polymerase.Products were diluted in formamide and analysed in an ABI3730xl DNA Analyser-sequencer using GeneMapper (Applied Biosystems).

PanelApp panels related to Hereditary Spastic Paraplegia or Ataxia
We examined the genes diagnosed in our real-world cohort, including pathogenic variants detected by both WES and WGS and variants in candidate genes validated by us (validated candidate), but excluding phenotype-matching VUSs, in the following 6 different PanelApp panels (https://panelapp.genomicsengland.co.uk/) related to hereditary spastic paraplegia or ataxia.

Receiver Operating Characteristic (ROC).
The predictive performance of ClinPrior was evaluated by plotting the area under the receiver operating characteristic (AUROC) curve to assess: i) variant prioritization of known disease genes and candidate disease genes in 66,800 synthetic WES analysed, and ii) the optimal number of HPOs required for gene prioritization in 82 patients from the real-world cohort.
When querying the 66,800 synthetic WES, we considered the top 50 positions of the variant ranked lists, and when querying the 82 patients (100 runs each), we considered the top 1,000 positions of the gene ranked list after average prioritization (the top ~5% of genes queried).
When we look at the top 50 positions (if we do top 50 for variant prioritization (i)) and the top 1,000 positions (if we do top 1,000 for gene prioritization in HPO benchmarking (ii)), the predicted rank is progressively weighted, i.e. the AUROC decreases progressively as the prior rank increases.Thus, if ClinPrior prioritizes at position 1, the resulting AUROC is also 1, and if the rank is 10, the AUROC is 0.8163 (i) or 0.991 (ii).For a rank of 50 or higher (i) or 1,000 or higher (ii) the AUROC is 0. The pROC package in R was used to calculate AUCs along with their standard errors and 95% confidence intervals [18].We used plotROC [19] and ggplot2 [20] packages to plot the ROC curves.The rank lists for each scenario are available on zenodo [1].

Additional Results
We enrolled 135 families with undiagnosed HSP and/or CA after targeted screening for the most common genetic causes, as described in the Materials and Methods.Based on the phenotypic traits, we classified patients into several groups: i) pure spastic paraplegia, ii) pure CA, and iii) complex phenotypes presenting ataxia and/or spasticity with other symptoms (Fig. 3A).The probands were 85 males and 50 females, with ages ranging from 1 month to 83 years (median 32 years).The age of clinical onset ranged from the first month of life to 72 years (median 7 years); age was less than 20 years in 85 patients (63%) and more than 20 years in 50 patients.The median evolution of disease before WES testing was 9 years (1 month -53 years), and it was longer than 10 years in 56% of patients.Consanguinity was reported in 16 families (12%).
Illustrative clinical cases 1. Atypical phenotypes GFAP (IDSPG4): In this family with 3 generations (the proband, a 46-year-old woman; her son; her cousin; and her cousin's son) affected by adult-onset, mild spastic paraplegia, cranial and medullar MRI findings were initially described as normal.A heterozygous novel missense variant (p.Gly18Val) in the GFAP gene cosegregated in all affected family members, as well as in two asymptomatic relatives (the proband's sister and her mother).
Mutations in this gene cause Alexander disease (OMIM #203450) [21], an autosomaldominant leukodystrophy with described adult presentations [22].We therefore decided to clinically re-evaluate all family members; all of them showed clear signs of cerebellar dysfunction and spastic paraparesis, and two patients were paucisymptomatic, presenting mild alterations in neurological examinations, namely, scoliosis, nystagmus, diplopia, hyperreflexia and the Babinski sign.MRI reexamination of all patients showed notable spinal cord and medulla atrophy, in contrast to what was observed in the less affected patients, who presented mild signal changes in the trunk and less atrophy.Magnetic resonance spectroscopy showed a metabolite profile suggestive of astrocyte hypertrophy, consistent with neuroaxonal degeneration [23].We validated this variant using a transfection assay to test the capacity of the GFAP protein encoded by the gene carrying p.Gly18Val variant to induce protein aggregation in the astrocytoma cell Line U251-MG.We did not observe inclusions, dot-like clumps or aggregates in the p.Gly18Val-mutant construct, but we did observe abnormally large cell sizes with long astrocytic processes, a phenotype that was confirmed by quantitative image analysis.Thus, we proposed considering astrocyte hypertrophy as an additional criterion of pathogenicity in the functional evaluation of unreported variants [24].NDUFS6 (IDSPG55): This patient was a 66-year-old female with spastic paraparesis, hyperreflexia and dysarthria beginning at 10 years of age, accompanied by seizures with good response to antiepileptic drugs, tremor, cavus feet, keratoconus, cataracts, vocal cord weakness, and mild developmental delay.She learned to talk at 5 years of age.Nerve conduction studies revealed axonal sensorimotor neuropathy, and cranial MRI showed a signal intensity alteration in the posterolateral portion of both lenticulate nuclei that was hyperintense in proton density and T2 images.She had an affected sister.They were born from consanguineous parents.Metabolic studies including lactate were not available.Both patients were homozygous for the splicing variant in NDUFS6, c.309+5G>A, classified as pathogenic according to the ACMG criteria [11,25].This variant was previously reported in two different patients with Leigh syndrome and was validated functionally [26,27].The clinical picture and MRI findings were compatible with Leigh syndrome.Most previous reports suggested that NDUFS6 pathogenic variants invariably lead to neonatal or early childhood death; instead, these patients showed a slowly progressive disease, as seen in Leigh syndrome caused by mutations in other genes.To date, mutations in more than 80 genes have been found to be related to Leigh syndrome [28].Therefore, this family expands the clinical phenotype associated with NDUFS6.

ACER3 (IDSPG75):
This patient, a 16-year-old male born to consanguineous parents, had an affected male cousin.Both had shown early-onset, slowly progressive pure spastic paraparesis and had normal MRI findings.They harbour one homozygous variant in the ACER3 gene p.(Gly211Cys), classified as pathogenic.Only six cases of ACER3-related progressive leukodystrophy have been reported [29,30], and all of these patients showed severe developmental delay with neurological regression, mainly including truncal hypotonia, spasticity, dystonia, seizures, feeding problems and, except for one patient, acquired lateinfantile-onset microcephaly.Regarding the imaging findings, they exhibited abnormal periventricular and deep WM signals, a thin corpus callosum and progressive atrophy.These patients had a noticeably milder disease than previously reported patients, thus expanding the clinical spectrum of patients carrying these variants.

KIDINS220 (IDSPG18):
This patient was a 10-year-old female with spastic paraparesis and preserved cognition.Her development was considered normal until 4 years of age.The findings of MRI performed at 10 years of age were normal.She was heterozygous for a functionally validated variant (p.Ser1352Glyfs43Ter) in KIDINS220.A complex-spastic paraplegia phenotype known as spastic paraplegia, intellectual disability, nystagmus, and obesity syndrome (SINO) has been related to heterozygous variants in this gene, and reports show invariably delayed psychomotor development, intellectual disability, spasticity, cerebral atrophy with reduced white matter volume, nystagmus and increased height and weight [31,32].Furthermore, homozygous variants lead to a severe autosomal-recessive cognitive disorder characterized by the onset of arthrogryposis and ventriculomegaly (VENARG) that is not compatible with life [33].This patient had a noticeably milder disease than previously reported patients, thus expanding the clinical spectrum of patients carrying these variants.

COL6A3 (IDSPG161):
This patient, a 56-year-old woman presenting with severe and congenital axonal peripheral neuropathy, developed pyramidal signs at 3 years of age.She had no previous remarkable family or personal history.We identified a homozygous variant (p.Lys2483Glu) in COL6A3, a gene that has been associated with Bethlem myopathy, dystonia and Ulrich congenital muscular dystrophy.More than 10 patients with this mutation have been reported [34,35].This patient shows an atypical form because predominantly distal involvement and the pattern of neurogenic involvement in the neurophysiological study are present.Patients with both axonal neuropathy and contractures may have a collagen VIrelated disorder.

PMM2 (IDSPG123):
This patient, a 13-year-old male patient presenting with nonprogressive congenital ataxia with oculomotor apraxia associated with intellectual disability (IQ: 50), neurodevelopmental delay, and hypotonia, was born to healthy nonconsanguineous parents.MRI showed nonprogressive global cerebellar atrophy.Two variants in compound heterozygosity were found in PMM2, a gene associated with congenital glycosylation disorder type Ia, CDG Ia (OMIM: # 212065) [36].The p.Arg141His variant has been reported in numerous publications as the variant most frequently found in patients with CDG Ia [37].The intronic branch site c.640-23A>G was previously described in a patient with CDG Ia who presented this variant in compound heterozygosity with the p.Arg141His variant.The adenosine substituted in the c.640-23A> G variant is found in the sequence of the branch-site TTCAT, a highly conserved domain within intron 7 that facilitates the removal of the intron in the splicing process [38].Strikingly, patient IDSPG123 presented a fundamental neurological phenotype and exhibited normal phosphomannomutase activity, similar to that observed in patients with other "mild" splicing variants, who can show normal values of enzymatic activity even in the presence of disease.

SPTAN1 (IDSPG134):
This patient, a male with no relevant personal or familial antecedents, presented with pure spastic paraparesis at 7 years old.His condition was slowly progressive, and at 21 years old, he had normal MRI and nerve conduction study findings.WES detected a nonsynonymous single nucleotide variant (c.55C>T:p.Arg19Trp) in SPTAN1, and Sanger segregation confirmed a de novo state.De novo variants in nonerythrocytic alpha-II-spectrin (SPTAN1) cause several phenotypes, including i) early-infantile epileptic encephalopathy type 5 (OMIM #613477), an early onset epilepsy with global developmental delay and spastic quadriplegia [39]; ii) CA with hypoplastic brain structures and intellectual disability without seizures [40]; iii) late-onset hereditary motor neuropathy [41]; and iv) a neurodevelopmental phenotype with peripheral sensorimotor neuropathy [42].A previous report suggested that biallelic SPTAN1 variants may lead to pure autosomal-recessive hereditary spastic paraparesis [43]; therefore, we performed WGS in this patient to find a second variant.WGS did not find another deleterious variant.Thus, the case of this patient further expands the phenotypic spectrum associated with SPTAN1 variants to pure spastic paraparesis with de novo heterozygous variants.

LONP1 (IDSPG166):
In this family, two siblings, aged 39 and 36 years, had child-onset CA with pyramidal involvement that was slowly progressive.MRI revealed cerebellar atrophy and a dilated 4 th ventricle.They had two variants in a compound heterozygous state in the LONP1 gene, classified as pathogenic and likely pathogenic after applying ACMG criteria.A multisystemic phenotype known as CODAS, an acronym for cerebral ocular dental auricular skeletal syndrome, has been linked to this gene.Affected patients have developmental delay, craniofacial anomalies, cataracts, ptosis, a median nasal groove, delayed tooth eruption, hearing loss, short stature, delayed epiphyseal ossification, metaphyseal hip dysplasia, and vertebral coronal clefts [44].In this family, there are predominant neurological manifestations such as ataxia and spastic paraparesis, but these manifestations are not associated with ocular, auricular, or skeletal abnormalities.Thus, this family expands the clinical spectrum associated with LONP1.

PDK3 (IDSPG172):
This patient, a 50-year-old woman presenting with severe and congenital axonal sensorimotor peripheral neuropathy, developed pyramidal signs at 13 years of age.She had no previous remarkable family or personal history.We identified a de novo hemizygous variant (c.473G>A) in PDK3, a gene that has been associated with Charcot-Marie-Tooth disease, X-linked dominant, 6.Only two patients with this mutation have been reported.The presence of spastic paraplegia had not been previously reported in Charcot-Marie-Tooth disease, X-linked dominant, 6.

KCNA1 (IDLNF52):
This patient, an 8-year-old male presenting with a severe combination of dyskinesia and neonatal epileptic encephalopathy, was born to consanguineous parents.WES revealed a homozygous variant (p.Val368Leu) in KCNA1, involving a conserved residue in the pore domain, close to the selectivity signature sequence for K+ ions (TVGYG).
Functional analysis showed that the mutant protein alone failed to produce functional channels in the homozygous state, while the coexpression of the mutant protein with wildtype protein had no effects on K+ currents, which were similar to those with wild-type protein alone.Over 50 families affected by the episodic ataxia type 1 disease spectrum have been described with mutations in KCNA1, encoding the voltage-gated K+ channel subunit Kv1.1.All these mutations are either transmitted in an autosomal-dominant mode or found as de novo events.This clinical phenotype was reported in a previous publication [45].

SARS1 (IDSPG64):
This patient, a 14-year-old female presenting with a global developmental delay, with late acquisition of independent walking at early childhood, motor clumsiness and delayed speech, was able to produce only a few bisyllables during early childhood.Early clumsiness evolved to overt signs of spastic paraparesis that worsened slowly during middle childhood but subsequently remained stable.WES revealed a heterozygous variant (NM_006513.4:c.969+1_969+3del) that causes the ablation of a canonical splice site in the boundary of exon 7 and intron 7. Functional analysis showed that this change causes the inclusion of 16 intronic bp into the cDNA, which results in a loss-offunction, dominant negative effect in complementation assays in S. cerevisiae and serylation assays.This family presents a novel complex spastic paraplegia phenotype and mode of inheritance (de novo dominant) for a variant in SARS1.This clinical phenotype was reported in a previous publication [46].

Cases with dual diagnoses
POLR3A and CACNA1A (IDLNF56): This patient, a 15-year-old female, presented with moderate intellectual disability, ADHD, behavioural abnormalities (obsessions, mood disorder, emotional lability, visual hallucinations), and absence and myoclonic seizures beginning at thirteen years old.She also had nystagmus, strabismus and instability starting in the first years of life.MRI showed periventricular heterogeneous T2 WM hyperintensities and hypointensity in the globus pallidus, thalamic anterolateral nuclei, dentate nuclei, optic radiations, and pyramidal tracts, with mild atrophy of the cerebellar superior vermis.WES analysis identified two variants in POLR3A in compound heterozygosis, classified as pathogenic (c.2171G>A:p.Cys724Tyr) and likely pathogenic (c.2113C>G:p.Pro705Ala), as well as a heterozygous LoF variant in CACNA1A (c.1637dup:p.Tyr546Ter) that was revealed to be de novo after the segregation study.The patient's clinical picture could be more related to the variant in CACNA1A, but the radiological pattern was more consistent with the POLR3A variant.

WGS cases with SNVs
We obtained a positive WGS diagnosis for 3 additional families with SNVs in the SPG7, SPTBN2 and SPTAN1 genes.In one case (IDSPG21), we found a deep intronic splicing variant in the second allele (c.286+853A>G in SPG7) [47].In a second case (IDSPG125), a single pathogenic variant in one allele was detected in SPTBN2, a gene in which both autosomal-dominant and autosomal-recessive modes of inheritance with the same phenotype of CA had been described (spinocerebellar ataxia, autosomal recessive 14, SCAR14, OMIM #615386 and spinocerebellar ataxia 5, SCA5, OMIM #600224).The absence of findings in WGS ascertained a dominant inheritance mode in this family.Finally, in a third patient (IDSPG134) with pure spastic paraplegia harbouring a single de novo variant in SPTAN1 (of described dominant inheritance for a developmental phenotype or recessive for complex spastic paraplegia), WGS ruled out a second variant in trans, thus uncovering a novel phenotype linked to a dominant inheritance mode.
Benchmarking for HPO number optimisation on a real-world cohort.
We performed a benchmark using the 82 causal genes from the real-world cohort patients and their associated HPOs to determine the number of HPOs required for optimal prioritization of the genes causing the patient's phenotype.For each patient (x), we ran ClinPrior with an increasing number of HPOs (2, 3, 4, until we reached the maximum number of HPOs available from each patient, number of HPOs = z).For each z (2 or 3 or 4 HPOs…), we ran ClinPrior 100 times for each patient (x) with HPOs randomly selected from the patient's total HPOs.The predictive performance of ClinPrior was evaluated by plotting the AUROC to assess the optimal number of HPOs.Fig. S1 shows that the AUROCs improve as number of HPOs (z) increases.Fig. S1: Gene rank yield by HPOs number.Gene prioritization performance by area under the receiver operating characteristic curve (AUROC) in identifying the optimal number of HPOs.Each ROC curve represents the prioritization performance using a different set of HPOs from each patient, ranging from 2 to 30 randomly selected HPOs.The curves were calculated by integrating 100 runs per patient (82 patients) and considering only the top 1,000 positions (the top ~5% of genes interrogated) with progressive weighting in the prioritization rank.A) All HPOs.B) high-frequency (>1%) HPOs.C) low-frequency (<1%) HPOs.
We assessed whether the optimal number of HPOs varied when the HPOs were more or less specific.Therefore, we divided the HPOs of each patient into two groups according to whether the normalised frequency of each HPO term within the 439,379 gene-HPO associations was less than or greater than 1%.We re-ran ClinPrior 100 times for each patient (x) and increasing number of HPOs (z), but now selecting either common (>1%) or rare (<1%) HPOs.We evaluated the performance by looking at the ROC and once again, we observed that as the number of HPOs increased, the AUROCs improved.Of note, when the HPOs are more specific (less frequent), results improve (AUROC = 0.844 with 10 HPOs).
Further, in a situation with both specific and unspecific HPOs, we observe that with a number of HPOs around 20, the yield is close to the best results of scenario with specific HPOs (AUROC=0.790with 24 HPOs).
Similar performance was obtained when evaluating the frequency of gene prioritization ranks (<1%, <5%, <10%, <20% and >20% of 23,511 total genes) in the same three scenarios: i) no HPO frequency constraint (mix of specific and non-specific terms), ii) HPO frequency above 1% and iii) HPO frequency below 1% (Fig. S2).Optimal prioritization performance was achieved with about 7 to 10 HPOs in a rare scenario.In conclusion, we recommend using as many HPOs as possible, but using specific (below 1%) HPOs is also more informative.
Interestingly, using common HPOs (more frequent than 1%) does not compromise accurate prioritization, and using a mix of at least 15 specific and non-specific HPOs, ClinPrior can still perform accurate prioritization with similar diagnostic yield as using 7-10 specific HPOs.Fig. S2: ClinPrior performance by HPO number.Prioritization yield in ranking 82 causal genes to top positions: <1%, <5%, <10%, <20% and >20% (out of 23,511 total genes).The three stacked bar plots show the frequency of gene ranks as the number of HPOs increases.In plot (A) there is no constraint on HPO frequency, and in plots (B) and (C) there is a constraint with HPO frequencies >1% and <1%, respectively.
The difference in the maximum number of HPOs displayed is because in a patient with an average of 30 HPOs (A), approximately 15 will have a frequency >1% (B) and the remaining 15 will have a frequency <1% (C).Table S5.ACMG criteria for identified variant classification.Table S6.List of identified genes, OMIM nomenclature and numbers of cases identified in our cohort.The SARS1 c.969+1_969+3del variant causes the ablation of a canonical splice site in the boundary of exon 7 and intron 7. Functional analysis showed that this change causes the inclusion of 16 intronic bp into the cDNA, which results in a loss-offunction, dominant negative effect in complementation assays in S. cerevisiae and serylation assays [46].A targeted metabolites analysis towards key pathways catalyzed by this enzyme was performed detecting a significant decrease in glycine/serine ratio and an increase in 5-methyltetrahydrofolate/folate ratio compared to controls.Moreover, a knockdown approach for disease modeling in Drosophila melanogaster was performed as described in García-Cazorla et al [   Table S8.Molecular function GO terms enriched in the HSP network.Top 50 molecular function (MF) GO terms enriched in the HSP network integrating proteins by using a hypergeometric distribution function statistical analysis.GOMFID, Gene Ontology term identification number; P value, probability value for each GO term tested; OddsRatio, the strength of the association; ExpCount, the expected number of selected genes annotated at the GO term; Count, the number of genes present in the HSP network that are annotated at the GO term, Size, the number of total proteins that are annotated at the GO term; Term, the GO term name.Table S9.Biological process GO terms enriched in the HSP network.Top 50 biological process (BP) GO terms enriched in the HSP network integrating proteins by using a hypergeometric distribution function statistical analysis.GOBPID, Gene Ontology term identification number; P value, probability value for each GO term tested; OddsRatio, the strength of the association; ExpCount, the expected number of selected genes annotated at the GO term; Count, the number of genes present in the HSP network that are annotated at the GO term, Size, the number of total proteins that are annotated at the GO term; Term, the GO term name.Table S10.Cellular compartment GO terms enriched in the HSP network.Top 50 cellular compartment (CC) GO terms enriched in the HSP network integrating proteins by using a hypergeometric distribution function statistical analysis.GOCCID, Gene Ontology term identification number; P value, probability value for each GO term tested; OddsRatio, the strength of the association; ExpCount, the expected number of selected genes annotated at the GO term; Count, the number of genes present in the HSP network that are annotated at the GO term, Size, the number of total proteins that are annotated at the GO term; Term, the GO term name.Table S11.Candidate genes in the HSP expanded network.Seventy-five candidate genes predicted by the prioritization tool to be associated with HSP disease.Gene constraints: lof_z, the Z-score value of being LoF intolerant; mis_z, the Z-score value of being intolerant to missense variation; pLI, the probability of being LoF intolerant; pRec, the probability of being intolerant to homozygous variation.

Table S2 .
Clinical table.Clinical manifestations, main complementary exams and genes identified in diagnosed cases.Cases in which, after retrospective review, we considered that the clinical diagnosis was possible.
PCR was carried out to measure the relative copy number of the human SPAST gene (exon 5-exon 5, exon 7-exon 7, and intron 7-intron 7) compared to the human MEMO1 (exon 2-exon 2) or NLRC4 (exon 4-exon 4) genes.IDSPG132 and father exhibit only one copy of the SPAST gene compared to the mother and 8 healthy individuals.

Table S4 .
CNV cases.Validation of CNV variants from WGS.
[45]alidate ACER3 variants, a targeted lipidomics analysis towards sphingolipids was performed demonstrating a similar lipid profile as the already published for Edvardson et al, Journal of Medical Genetics 2016[30].Moreover, qRT-PCR analysis showed decreased mRNA expression of ACER3 in patients' fibroblasts.Patch-clamp studies of variant p.Val368Leu revealed that mutant protein alone failed to produce functional channels in homozygous state, while coexpression with wild-type produced no effects on K+ currents, similar to wild-type protein alone.These findings are concordant with the reported mode of inheritance in this family and the severity of the patient's phenotype[45].To confirm the association between the c.1423-12 C>G LAMA1 variant and the partial inclusion of exon 11, we performed a minigene splicing assay.Semi-quantitative fluorescent RT-PCR revealed a normal/abnormal-splicing ratio of 0.8 for the wild-type allele and 0.2 for the mutated allele.This intronic mutation resulted into generation of various isoforms in the minigene splicing assay, showing that ~20% residual wild-type isoform is still expressed by the intronic-mutated allele alone.A targeted lipidomics analysis towards phospholipids was performed demonstrating a similar lipid profile as the already published for Vaz et al 48.Moreover, cDNA analysis showed two different transcripts for Case1: the first carried the missense variant, whereas the second used an alternative donor site in intron 11, resulting in the inclusion of 102 bp into the coding sequence, leading to the insertion of 34 amino as described in Vélez-Santamaría et al[53].
[47]Sanger sequencing of fibroblast cDNA revealed that c.286+853A>G results into creation of a transcript which includes a 75 bp pseudoexon located in SPG7 's intron 2. This in-frame pseudoexon includes at least two codon stops.Western Blot showed reduced levels of SPG7, confirming a loss of function effect[47].Western blot and Immunofluorescence for detyrosinated tyrosin, after paclitaxel treatment to increase deTyr-tubulin pools revealed a strong decrease of deTyr-tubulin levels in patients compared to controls.We have performed cotransfection studies of VASH1 and SVBP (WT/MUT) in HeLa cells showing that p.Leu49Pro variant reduces VASH1 and SVBP soluble quantities.

Table S7 .
Cases with functionally validated variants