Human phenotype ontology annotation and cluster analysis to unravel genetic defects in 707 cases with unexplained bleeding and platelet disorders

Background Heritable bleeding and platelet disorders (BPD) are heterogeneous and frequently have an unknown genetic basis. The BRIDGE-BPD study aims to discover new causal genes for BPD by high throughput sequencing using cluster analyses based on improved and standardised deep, multi-system phenotyping of cases. Methods We report a new approach in which the clinical and laboratory characteristics of BPD cases are annotated with adapted Human Phenotype Ontology (HPO) terms. Cluster analyses are then used to characterise groups of cases with similar HPO terms and variants in the same genes. Results We show that 60% of index cases with heritable BPD enrolled at 10 European or US centres were annotated with HPO terms indicating abnormalities in organ systems other than blood or blood-forming tissues, particularly the nervous system. Cases within pedigrees clustered closely together on the bases of their HPO-coded phenotypes, as did cases sharing several clinically suspected syndromic disorders. Cases subsequently found to harbour variants in ACTN1 also clustered closely, even though diagnosis of this recently described disorder was not possible using only the clinical and laboratory data available to the enrolling clinician. Conclusions These findings validate our novel HPO-based phenotype clustering methodology for known BPD, thus providing a new discovery tool for BPD of unknown genetic basis. This approach will also be relevant for other rare diseases with significant genetic heterogeneity. Electronic supplementary material The online version of this article (doi:10.1186/s13073-015-0151-5) contains supplementary material, which is available to authorized users.


Background
Bleeding and platelet disorders (BPD) are a heterogeneous group of rare diseases caused by abnormalities of coagulation factors, platelets or blood vessel walls. Although in most cases the major clinical feature is abnormal bleeding, BPD are frequently associated with phenotypes in other organ systems, particularly the immune system (for example, Wiskott-Aldrich syndrome (WAS, ORPHA906)), skeleton (for example, Thrombocytopenia with Absent Radius syndrome (ORPHA3320)), eye (for example, Hermansky-Pudlak Syndrome (HPS, ORPHA79430)) and kidney (for example, MYH9-related disorder (MYH9-RD, ORPHA182050)). BPD may be inherited as autosomal recessive, autosomal dominant, X-linked, or complex traits. Collectively, BPD represent a significant diagnostic and management challenge to health care systems [1].
Diagnosis of BPD currently requires clinical evaluation, then tests such as coagulation factor activity and platelet function assays [2,3]. However, for mild BPD, this enables diagnosis to the level of a defective coagulation factor or a platelet pathway in only 40% to 60% of cases [4,5]. The proportion of cases with BPD that receive genetic diagnosis is even lower [6], partly because the genetic basis of most BPD remains unknown, particularly for prevalent sub-groups such as platelet secretion defects [5]. Even for BPD with a known genetic basis, diagnosis may not be possible if phenotype tests are insufficiently specific to point towards relevant genes. Moreover, many BPD with causal variants in the same gene are clinically heterogeneous, as illustrated by MYH9-RD in which macrothrombocytopenia is a consistent feature, but neutrophil Döhle-like inclusions, renal impairment, deafness and cataracts are variable [7,8]. High throughput sequencing (HTS) has the potential to circumvent these limitations by providing a diagnostic tool for genetic diagnosis and by enabling gene discovery to increase the repertoire of disorders for which genetic diagnosis is possible.
Analysis of large case collections recruited through consortia of investigators and technologies such as HTS assist discovery of rare BPD genes [9,10]. However, gene discovery additionally requires analysis of shared data through systematic phenotype coding to identify similarities between unrelated cases. Phenotype coding also facilitates creation of clinical registries, genotype-phenotype databases and biobanks. Examples of coding systems include the World Health Organization International Classification of Diseases (ICD) [11], which provides a post-diagnosis classification of disease organized by organ system that is unsuitable for coding rare genetic diseases particularly those that affect multiple organs. The International Health Terminology Standards Development Organisation Systematized Nomenclature of Medicine Clinical Terminology (SNOMED-CT) [12] is an alternative system, but is not optimised for laboratory results and has greater emphasis on specific diseases rather than phenotypes. Platforms such as Online Mendelian Inheritance in Man (OMIM) [13] and Orphanet [14] provide descriptions of known genetic disorders but do not utilise systematic phenotype terms. The Bleeding History Phenotype Ontology [15] is a hierarchical ontology system that may enable phenotypic similarities between some cases with BPD to be resolved. However, this ontology has a limited repertoire of phenotypic terms outside those immediately pertinent to bleeding. Moreover, there are no terms for the results of laboratory tests such as light transmission aggregation or dense granule secretion. These are essential for the accurate phenotypic description of platelet function disorders, which are the largest single group of BPD of unknown genetic basis.
The Human Phenotype Ontology (HPO) project is an international initiative to support the phenotypic annotation of genetic disorders available under an open-source licence [16]. The HPO version 887 contains a set of 10,371 terms that describe abnormalities of human phenotype and 13,556 relations between the HPO terms organised hierarchically through is-a relations [17]. For example, 'thrombocytopenia is-a abnormal platelet count is-a abnormality of thrombocytes'. This enables phenotypes to be described using a standardised and controlled vocabulary with greater detail and flexibility than other coding systems. The HPO has previously been applied to diseases using phenotypic terms from Orphanet and OMIM [18] but it has not previously been used to describe and compare phenotype data from individual cases. Systematic comparison of phenotype data from individual cases assists gene discovery because clusters of cases with similar phenotypes are likely to share defects in the same gene or interacting set of genes [19].
The BRIDGE-BPD study [20] is a multicentre, observational consortium study that aims to identify causal gene defects in cases with BPD of unknown aetiology by deep multi-system phenotyping and HTS. We report the results of Stage 1 of the BRIDGE-BPD study in which we develop new HPO terminology to facilitate annotation of BPD phenotypes in 707 BPD cases recruited at 10 European and US centres. We show for the first time how HPO annotation can be used to describe the phenotypes of individual cases within a large rare disease collection and how a novel statistical clustering approach using HPO data guides gene discovery.

Study overview and enrolment criteria
The target population for the BRIDGE-BPD study comprises children and adults with disorders of platelet number, volume, morphology or function or with pathological bleeding that cannot be explained by standard laboratory tests. To ensure enrolment of cases with a high likelihood of a genetic BPD, the inclusion criteria require features such as a BPD from an early age, a BPD that is part of a syndromic disorder or a family history of a BPD. Cases with an acquired BPD (Table 1) or a known genetic BPD are excluded unless they display bleeding or laboratory features that cannot be explained by this diagnosis alone. For example, a case with haemophilia A and unexplained thrombocytopenia is considered eligible on the basis of thrombocytopenia. Most cases had already been investigated within research and specialist clinical diagnostic facilities, maximising exclusion of acquired or known genetic BPD. The BRIDGE-BPD study is approved by a UK Research Ethics Committee (Cambridgeshire 1 Research Ethics Committee 10/H0304/66) and appropriate national ethics authorities for non-UK enrolment centres (Additional file 1). All study procedures were performed after the participants provided informed written consent and were in accordance with the Declaration of Helsinki. Sequence data for cases who provided consent for public access have been deposited at the European Genome-Phenome Archive (EGAS00001001172).

Recruitment of cases and collection of phenotype data
Index cases and pedigree members were identified by screening clinic lists, case notes and local registries. Potential participants were checked for eligibility and were invited to give informed written consent. Demographic and clinical data, laboratory test results and pedigree relationships were recorded on a case report form.

BRIDGE-BPD database
Pseudonymised demographic and clinical data were transferred to electronic data capture pages usually as quantitative or categorical variables but with a minority as free text to maintain detail. Bleeding symptoms were recorded as numerical severity scores for the 12 major symptoms within the MCMDM-1 VWD Bleeding Assessment Tool [21], or as the terms 'yes' or 'no' for each symptom (Additional file 2). Laboratory test results were recorded as quantitative variables, except for tests such as platelet light transmission aggregation (LTA) and ATP secretion, which were recorded as 'normal' or 'abnormal' according to the interpretation of the enrolling clinician. Platelet morphology determined by light or electron microscopy was recorded as free text.

Development of HPO terminology for BPD
In order to ensure accurate annotation of the bleeding and platelet phenotype, 80 terms and associated is-a relationships were added to HPO [17], in parallel with development of hpoPlot, a new free software tool that summarises HPO codes of a set of cases, released on the Comprehensive R Archive Network [22]. The HPO modifications were predominantly within the abnormality of blood and bloodforming tissue leading class and its constituent classes 'abnormality of thrombocytes', 'abnormal bleeding', 'abnormality of coagulation' and 'abnormal thrombosis'. Some terms required creation of a new 'abnormal platelet morphology' class (Additional file 3). Many terms within 'abnormality of blood and blood-forming tissue' overlap with other system-specific HPO leading classes, particularly 'abnormality of the immune system' (Additional file 4).

Automated and suggested HPO terms
To improve the reliability and efficiency of HPO coding, cases were annotated with relevant terms automatically for abnormal bleeding symptoms or abnormal categorical laboratory test results and for some numerical laboratory test results, if outside a gender-specific reference interval. For the remaining laboratory test results, suggested HPO terms were presented for manual confirmation. This was necessary to prevent inappropriate coding if a laboratory test result was abnormal because of a co-morbidity or if the reference interval was not applicable because the case was a child or pregnant. HPO terms for other clinical or laboratory abnormalities could be selected through expansion of parent HPO terms or by searching for a specific HPO term. Platelet count less than 100 × 10 9 /L or greater than 400 × 10 9 /L, or Acquired bleeding or platelet disorders, including any of the following: Mean platelet volume less than 6 fL or greater than 12 fL, or Use of any medication known to affect platelet function or cause bleeding

Measuring similarity and clustering strength
We define the relative information content (IC) of HPO terms on the basis of their rarity within the BPD case collection. Measures of phenotypic similarity between a pair of individuals, represented as two sets of HPO terms, are then determined by the overall rareness of the pair's shared terms [23]. The IC of term t is given by where p t is the frequency of the term in the BPD collection. The similarity between two terms, s and t, is defined as where anc(x) denotes the ancestor terms of x, that is, the terms {y|x 'isa' y}. The similarity between two cases represented as two sets of terms, D a and D b , is given by This definition ensures symmetry of the similarity measure. A scale-independent measure of phenotypic dissimilarity was also computed between two cases a and b with respect to the rest of the collection: where D a is the set of HPO terms associated with case a.
The rank distance among a set Z of more than two cases is given by the mean over all pairwise distances: To evaluate the phenotypic similarity of a group of cases, Z, with respect to a containing collection, a one-tail Monte Carlo P value for testing whether the distance between cases in Z was less than what would be expected by chance in a similar size group was computed as: where W is a set of 250,000 subsets of |Z| index cases drawn at random from the entire collection. Although these P values are marginally uniform under the null, there may be a small correlation between them because the same data are reused in each set of permutations. However, if |Z| is small relative to the total sample size, the correlation should be negligible.

HPO-based clustering of cases
Unsupervised clustering of unrelated BPD cases was performed by applying the Partitioning Around Medoids (PAM) algorithm [24], to a square distance matrix M in which the ath row and bth column was set to -sim(D a , D b ). Affected relatives were excluded to ensure that the ensuing clusters did not depend on enrolment patterns of the affected relatives of particular index cases. The discriminatory power of HPO nodes shared by cases within each cluster was determined by performing a Fisher exact test on the contingency table containing the number of cases inside/outside the cluster vs the number of cases with/ without the HPO node. Each cluster was then summarised by the HPO node having the smallest P value overall and up to two nodes from other distinct lineages in the HPO graph having a P value smaller than 10 -3 . When there were multiple nodes within a lineage fulfilling these criteria, only the most significant node was retained.

Whole exome sequencing and variant reporting
Genomic DNA was isolated from venous blood or saliva obtained from the cases at enrolment or from archived samples. DNA library capture was performed using ROCHE NimbleGen SeqCap EZ 64 Mb Human Exome Library version 3.0 (ROCHE NimbleGen, Inc., Madison, WI, USA). The libraries were sequenced on an Illumina Hiseq 2000 instrument (Additional file 5 [25][26][27]). In order to filter for technical artefacts and variants unlikely to be pathogenic, variants were excluded from further analysis if they fulfilled any of the following criteria: (1) variant allele frequency >0.1% in any of the reference cohorts (Additional file 6); (2) variant not predicted to alter protein by snpEff 3.4 [28]; (3) variant not present in other affected pedigree members recruited to BRIDGE-BPD; (4) <3 reads supporting the alternate allele; (5) allele count >1 among 20 in-house whole-genome sequenced controls; (6) overall allele count >20 including exomes from other BRIDGE projects.

Assigning causality to rare variants in known BPD genes
In order to assist analysis of variants identified in the BPD cases, we utilised the ThromboGenomics platform version 1.0 list of 49 genes that have already been linked to human platelet or coagulation disorders (Additional file 7) [29,30]. Rare variants in ThromboGenomics genes were classified using the following definitions that are consistent with current EuroGentest guidelines for assigning pathogenicity to variants [31]: (1)

Characteristics of the study collection
In Stage 1 of the BRIDGE-BPD study, 648 index cases and 59 affected pedigree members (414 female, 293 male) were recruited at 10 enrolment centres ( Figure 1). WES has been completed for 519 cases (Additional file 8).
Platelet count (PLT) was recorded for 639 of the 648 index cases, of which 140 (21.9%) had PLT <100 × 10 9 /L and 53 (8.3%) had PLT >400 × 10 9 /L. Mean platelet volume (MPV) was recorded for 460 index cases of which one (0.2%) had MPV <6 fL and 91 (19.8%) had MPV >12 fL. Since some automated cytometers do not enumerate the MPV in blood samples with large platelets, enrolling clinicians were able to record whether large platelets were visible by light or electron microscopy. This identified a further 56 index cases with large platelets. Bleeding occurred across the range of platelet counts and volumes (Figure 2A,B). A total of 330 (50.9%) index cases had abnormal platelet morphology, defined as abnormal structure or size determined by light or electron microscopy (Additional file 9).
After excluding cases with PLT <100 × 10 9 /L or an unmeasured PLT in which reliability of LTA results cannot be guaranteed, there were 428 (66.0%) index cases with an LTA test result recorded for at least one agonist. A total of 196/428 (45.8%) index cases had two or more and 80/428 (18.7%) had three or more agonist responses that were classified as abnormal by the enrolling clinician ( Figure 2C).

Clinical phenotype and HPO coding
The presence or absence of up to 10 bleeding symptoms for males and 12 bleeding symptoms for females was recorded for 528 index cases. Among the 197 male cases, 45 experienced four or more different bleeding symptoms and 12 had six or more symptoms. Among the 331 female cases, 58 experienced six or more different bleeding symptoms and nine had eight or more symptoms. The most common bleeding symptoms were cutaneous bleeding (343 index cases), bleeding from minor wounds (231 index cases) and epistaxis (215 index cases; Figure 3).
The median number of HPO terms annotated to each case was 7.5 (range 1 to 23; Figure 4A). All index cases were annotated with one or more HPO term from the 'abnormality of blood and blood-forming tissue' leading class. Of the 648 index cases, 387 (59.7%) were also annotated with one or more HPO term outside of 'abnormality of blood and blood-forming tissue' after excluding terms that are 'descendants' of both the 'abnormality of blood and blood-forming tissue' class and other leading classes (Additional file 4 and Figure 4B). The most common other HPO terms described abnormalities of the nervous system (149 index cases), immune system (107 index cases) and skeletal system (102 index cases). When index cases were grouped by abnormality in each other organ system, the frequencies of terms relevant to thrombocytopenia, thrombocytosis, bleeding, platelet morphology and platelet aggregation reflected the overall proportions of these terms in the collection for most organ systems. However, in cases with terms relevant to the nervous system and growth these terms showed significantly different distributions (P <0.05 after Bonferroni correction, which allows for dependence between tests in addition to multiplicity; Figure 4B).

Clustering of cases using HPO
We performed unsupervised clustering of the HPO-encoded phenotype data in order to obtain an undirected characterisation of different subgroups within the heterogeneous BPD collection and assess whether particular sets of HPO terms tended to co-occur among cases in these groups. The 648 unrelated index cases were partitioned into 30 clusters ranging in size from five to 36 cases. Clinically recognisable subgroups included 'Impaired epinephrineinduced platelet aggregation, Impaired ADP-induced platelet aggregation' (Additional file 10, cluster 18 and Figure 5) which is a frequently reported pattern of abnormality in  light transmission aggregation results identified previously as a Gi-pathway defect [33]. A further prominent cluster was ' Autism spectrum disorder, Thrombocytosis, Decreased mean platelet volume' (Additional file 10, cluster 29 and Figure 5) illustrating an increasingly recognised association between platelet abnormalities and neurological disorders [34].
A new approach to gene discovery using HPO-encoded phenotype data In order to resolve causal gene variants from the numerous candidates in a rare disease study in which genetic heterogeneity is expected, it is typically assumed that cases with causal variants in a gene or set of related genes have similar phenotypes. We therefore hypothesised that cases with rare coding variants in a disease gene would tend to cluster strongly on the basis of their HPO-encoded phenotypes. To validate this approach, phenotype closeness was first computed within the pedigree groups without provisional syndromic diagnoses recruited to the BRIDGE-BPD study, who are expected to share a causal gene variant. The    significantly close clustering (P <0.05 for each test; Figure 6). A meta-analysis using Fisher's method to assess whether these groups clustered closely together as a whole yielded a P value that is less than the numerical resolution of the analysis software. These data demonstrate that despite occasional anomalies, BPD cases in which a common genetic basis is likely tend to cluster on the basis of their HPO terms.
Cases with variants in ACTN1 cluster strongly on the basis of their HPO terms We applied our approach under an autosomal dominant model and discovered that the gene for which cases had the strongest phenotype similarity was ACTN1 (P = 3.4 × 10 -8 ).
Causal variants in ACTN1 have been identified in 18 recently reported pedigrees with a phenotype that comprises mild bleeding and macrothrombocytopenia with no other syndromic features [35][36][37]. Consistent with these previous descriptions, amongst the 21 index cases and three pedigree members with ACTN1 variants in the BRIDGE-BPD analysis, 23/24 (96%) displayed thrombocytopenia and 22/24 (92%) large platelets (Table 2). HPO-based clustering of this group occurred irrespective of whether the ACTN1 variants were classified as PV, LPV or VUS. Cluster 1 (Additional file 10) was significantly enriched for variants in ACTN1 (Fisher's exact test P value 1.79 × 10 -5 ).

Pertinent and non-pertinent rare variants in MYH9
We also assessed the HTS and data analysis pipelines with the autosomal dominant disorder MYH9-RD. We selected this disorder because previous descriptions of MYH9-RD suggested that cases are usually readily identified clinically and were unlikely to have been recruited. Despite this, we identified nine different MYH9 missense variants in 13 unrelated BPD cases and one pedigree case. These included seven index cases with rare variants classified as PV or LPV and six index cases with VUS. In order to investigate whether HPO annotation and cluster analysis would have enabled detection of these cases as a distinct phenotype group we explored the HPO terms that had been assigned to these cases. The seven index cases with PV or LPV in MYH9 included six cases with terms indicating macrothrombocytes, six with terms indicating thrombocytopenia and one with neutrophil Döhlebodies but none with other recognised features such as renal impairment, sensorineural deafness or cataract ( Table 3). The seven index cases with PV or LPV clustered closely together based on the HPO terms (P = 0.005) while the six index cases with VUS did not cluster (P = 0.684). However, the 13 index cases as a whole did not cluster strongly based on their HPO phenotypes (P = 0.363) due to the dilution of PV with VUS. This observation highlights the need for improved algorithms for selecting candidate variants based on predictions of pathogenicity but also supports our stringent approach for assigning causality to variants in known genes.

Variant identification in the ThromboGenomics gene list
The BRIDGE-BPD HTS and data analysis pipelines were tested by evaluating coding variants in the ThromboGenomics list of known genes linked to autosomal recessive or X-linked recessive BPD. Since cases with BPD of known genetic aetiology were excluded from enrolment, we predicted that causal variants in the ThromboGenomics genes would be uncommon. Consistent with this, we identified PV in only two ThromboGenomics genes that completely explained the phenotype of one case with HPS3 and two cases with WAS. Further cases with PV in F9 and in F8 displayed reduced FIX and FVIII activity, respectively, that were explained by the observed PV. However, these cases also displayed abnormal platelet function indicating that the PV only partially explained the phenotype. Three cases had variants classified as LPV in ThromboGenomics genes, which cannot be considered causal for the BPD phenotype without further confirmatory investigations (Table 4). A further 10 index cases had variants classified as VUS in ThromboGenomics genes because there was no plausible association between the gene and phenotype (Additional file 11). This group included cases with variants in coagulation factor genes who had normal levels of coagulation factors.

Discussion
Heritable BPD are individually rare but collectively common diseases that have heterogeneous clinical characteristics. This clinical complexity, genetic heterogeneity and the large number of candidate genes for BPD has previously hampered gene discovery. In this exploratory Stage 1 of the BRIDGE-BPD study, we first enhanced HPO terminology to enable standardized annotation of the phenotypes of BPD cases. After enrolling the largest collection of BPD cases reported to date, we demonstrated that HPO annotation enabled the characterisation of clusters of BPD cases with similar phenotypes, which we hypothesise have causal genetic variants in the same or related genes. The use of HPO to facilitate research within neurogenetics by comparing standard database descriptions of diseases has been described previously [18]. However, this is the first report of the application of HPO to characterise individual case phenotypes and to aid gene discovery through statistical cluster analysis. The complex clinical characteristics of the 648 index cases in the Stage 1 BRIDGE-BPD study reflect our desire to enrol a comprehensive collection of cases with different subgroups of disorders within BPD, including bleeding of unknown origin. Previously reported collections have  comprised BPD cases without prior investigation [4,5] and collections linked by similar phenotypes such as abnormal platelet number [38], platelet function [33] or von Willebrand disease [21,39]. Thus, the BRIDGE-BPD collection is unique in its size and diversity. In common with previous collections, the BRIDGE-BPD cases were predominantly females, who experienced more bleeding symptoms than males because of obstetric and heavy menstrual bleeding. There was marked heterogeneity in PLT, MPV, platelet morphology, and particularly in LTA platelet function test results, highlighting the diversity of heritable platelet function disorders [33]. A striking finding from the Stage 1 BRIDGE-BPD study was that a median of 7.5 HPO terms were required to annotate the phenotype of each case. These comprised at least one term from 'abnormality of blood and bloodforming tissue', reflecting the study inclusion criteria. However, 60% of cases also had HPO terms reflecting abnormality in other organ systems, even after removing overlapping terms. We also observed a statistically significant difference in the frequency of five haematological HPO terms in cases with a nervous system abnormality compared to other cases (P = 0.001) reflecting more frequent platelet function and morphology defects. This is consistent with the existing literature indicating that neurological and platelet function disorders often coincide [34]. The statistically significant difference in the frequency of the five haematological HPO terms between cases with abnormal growth and those without (P = 8.12 × 10 -6 ) arose because of enrolment of a group of cases with growth defects at a single centre. One potential criticism of Stage 1 of the BRIDGE-BPD study is that cases were recruited from some centres with highly specialist interests, which may reduce the wider applicability of the study findings. This effect is illustrated by our finding of phenotypic associations between haematological terms and both growth and neurological terms arising from the specialist practice at the University of Leuven. However, this effect is partially offset by recruitment from the UK enrolment centres which are more typical tertiary referral centres, and will not occur in subsequent stages of the BRIDGE-BPD project which will recruit more widely.
A further unique attribute of the BRIDGE-BPD methodology is that it enables calculation of the closeness between groups of HPO terms annotated to different cases. This provides a measure of phenotypic similarity between cases and enables the definition of case subgroups to assist the analysis of genotype data. We have provided proof of principle of this approach by demonstrating low P values for phenotypic similarity between BPD cases from the same pedigrees and between unrelated cases with provisional clinical diagnoses of syndromic BPD, in which both groups are likely to share a causal genetic variant. This included examples with known candidate genes (HPS and WAS) and with unknown genes (Gorham-Stout and Roifman syndromes, pseudohypoparathyroidism type Ib).
We also evaluated phenotype similarity in groups of cases sharing a rare coding variant in the same gene. The most significantly similar group corresponded to cases harbouring variants in ACTN1. Although this particular association could be found through traditional regression-based approaches against platelet count, the HPO approach allows associations between genotypes and combinations of any kind of HPO-encoded phenotypes to be discerned. Here, an association was found between the presence of a variant and two terms jointly -the 'Thrombocytopenia' and 'Increased Mean Platelet Volume' terms -while regression analysis is typically performed trait by trait in a univariate fashion. Certainly, standard regression methods cannot be used to model heterogeneous combinations of data types (binary, quantitative, categorical), nor can they account for the hierarchical nature of ontologically encoded outcome data. Furthermore, patient HPO data can be linked to data in online human databases such as OMIM, and to model system databases such as the Mouse Phenotype Ontology. In future, this could be used to prioritise genes based on the similarity of groups of cases to ontological phenotype terms derived from the literature.
We assessed index cases harbouring variants in MHY9 who were not considered likely to have MYH9-RD by the enrolling clinicians based on clinical evaluation. In this group, the cases with PV or LPV in MYH9 displayed a high degree of similarity whereas individuals with VUS in MYH9 were not similar. Furthermore, the index cases with PV or LPV in MYH9 were phenotypically different from the index cases with VUS. This illustrates that HPO-based clustering can reveal phenotype and genotype links that cannot be made by enrolling clinicians in isolation but only in combination with powerful methods for distinguishing variants by predicted pathogenicity.
We also demonstrated potential limitations of this approach arising from differences in phenotype coding between and within enrolling centres. For example, two pedigrees showed poor clustering by phenotype, with phenotypic similarity P values of 0.42 and 0.33. In one of these pedigrees, the lack of similarity between the two members was because the index case had detailed phenotypic evaluation but the pedigree case consented only to limited evaluation. In the other pedigree, the lack of similarity was due to the marked difference in age at evaluation. This illustrates how the age-dependence of HPO terms such as bleeding symptoms may result in paediatric cases clustering away from adult cases. The potential confounding effect caused by gender-specific HPO terms was minimised by excluding these terms from calculation of similarity scores. Despite these limitations, recruiting from a heterogeneous set of centres is crucial to obtain the sample sizes required to achieve power for gene discovery.
The approach we have presented for computing similarity scores relies on pre-selecting groups of cases based on the presence of rare variants fulfilling certain criteria, such as presumed mode of inheritance. Sometimes, only a subset of the cases in a group will be explained by variants in the same gene, thus diluting the strength of phenotypic similarity of the entire group. For ACTN1, which was used to model similarity clustering analysis, the effect of the dilution was minimal, but in other cases it may play a larger role. The development of new methods to offset this effect is a worthwhile area of future research.

Conclusions
We have demonstrated that HPO annotation of the large and diverse BRIDGE-BPD collection enables the identification of clusters of individuals with phenotypic similarities who are likely to have causal genetic variants in the same or related genes. Our international consortium has enabled comparison of standardised phenotypic descriptions among cases across the world, which improves statistical power by increasing the size of case groups.
We have validated the methodologies developed for the BRIDGE-BPD study, using BPD such as ACTN1-related disorder, which have been newly reported since the start of recruitment to our study collection. However, these approaches are equally applicable to disorders of unknown genetic basis, in which cases are also expected to show similarity of HPO terms. It is noteworthy that the cases in the stage 1 BRIDGE-BPD study with PV or LPV in established BPD genes together account for less than 10% of the study collection. We anticipate that identification of further case subgroups based on similarity of HPO terms will be a powerful source of new gene discoveries in the remainder of the study collection and among cases in the ongoing enrolment programme. This gene discovery approach, pioneered here in bleeding and platelet disorders, is broadly applicable to other rare disease groups.