DrABC: deep learning accurately predicts germline pathogenic mutation status in breast cancer patients based on phenotype data

Background Identifying breast cancer patients with DNA repair pathway-related germline pathogenic variants (GPVs) is important for effectively employing systemic treatment strategies and risk-reducing interventions. However, current criteria and risk prediction models for prioritizing genetic testing among breast cancer patients do not meet the demands of clinical practice due to insufficient accuracy. Methods The study population comprised 3041 breast cancer patients enrolled from seven hospitals between October 2017 and 11 August 2019, who underwent germline genetic testing of 50 cancer predisposition genes (CPGs). Associations among GPVs in different CPGs and endophenotypes were evaluated using a case-control analysis. A phenotype-based GPV risk prediction model named DNA-repair Associated Breast Cancer (DrABC) was developed based on hierarchical neural network architecture and validated in an independent multicenter cohort. The predictive performance of DrABC was compared with currently used models including BRCAPRO, BOADICEA, Myriad, PENN II, and the NCCN criteria. Results In total, 332 (11.3%) patients harbored GPVs in CPGs, including 134 (4.6%) in BRCA2, 131 (4.5%) in BRCA1, 33 (1.1%) in PALB2, and 37 (1.3%) in other CPGs. GPVs in CPGs were associated with distinct endophenotypes including the age at diagnosis, cancer history, family cancer history, and pathological characteristics. We developed a DrABC model to predict the risk of GPV carrier status in BRCA1/2 and other important CPGs. In predicting GPVs in BRCA1/2, the performance of DrABC (AUC = 0.79 [95% CI, 0.74–0.85], sensitivity = 82.1%, specificity = 63.1% in the independent validation cohort) was better than that of previous models (AUC range = 0.57–0.70). In predicting GPVs in any CPG, DrABC (AUC = 0.74 [95% CI, 0.69–0.79], sensitivity = 83.8%, specificity = 51.3% in the independent validation cohort) was also superior to previous models in their current versions (AUC range = 0.55–0.65). After training these previous models with the Chinese-specific dataset, DrABC still outperformed all other methods except for BOADICEA, which was the only previous model with the inclusion of pathological features. The DrABC model also showed higher sensitivity and specificity than the NCCN criteria in the multi-center validation cohort (83.8% and 51.3% vs. 78.8% and 31.2%, respectively, in predicting GPVs in any CPG). The DrABC model implementation is available online at http://gifts.bio-data.cn/. Conclusions By considering the distinct endophenotypes associated with different CPGs in breast cancer patients, a phenotype-driven prediction model based on hierarchical neural network architecture was created for identification of hereditary breast cancer. The model achieved superior performance in identifying GPV carriers among Chinese breast cancer patients. Supplementary Information The online version contains supplementary material available at 10.1186/s13073-022-01027-9.


Background
Breast cancer is the most common cancer in women around the world [1]. Approximately 10% of patients with breast cancer carry germline pathogenic variants (GPVs) in cancer predisposition genes (CPGs) implicated in the DNA repair pathway [2,3]. Distinguishing breast cancer patients with GPVs is essential for employing systemic treatment strategies and risk-reducing interventions [4,5]. However, less than 10% of these carriers are referred for genetic testing in current clinical practice due to the cost and time spent [6,7].
The probability of carrying GPVs among breast cancer patients has long been evaluated in terms of family cancer history and clinical characteristics, such as the age at diagnosis and tumor pathological information [8,9]. One of the most commonly used criteria is the National Comprehensive Cancer Network (NCCN) criterion [7,[10][11][12]. However, adhering to the current NCCN criteria would overlook nearly half of breast cancer patients with a clinically actionable GPV [7,[11][12][13]. Nonetheless, routine genetic testing of all or most breast cancer patients would require vastly increased genetic counseling and management, which might not be easily achieved with presently available resources [14]. Furthermore, extending population-based genetic testing to patients with low rates of or non-existent founder mutations might pose a considerable financial burden, ethical concerns, and other barriers [15,16]. Therefore, an accurate prediction model for GPVs in clinically actionable genes is urgently needed. Recently, deep learning algorithms were demonstrated to improve clinical practice in genomic diagnostics due to their high accuracy and ability to extract information from big data [17]. Recent studies have demonstrated deep learning as a feasible and potentially useful tool for predicting germline BRCA1/2 status for cancer patients using demographic and clinical characteristics, medical images, or pathology images [18][19][20]. It is not known whether deep learning algorithms can be used to improve the precise selection of breast cancer patients to undergo genetic testing.
Here, we evaluated the family history of multiple cancer types and detailed phenotypes in a multi-center cohort of 3041 female Chinese breast cancer patients who underwent multigene genetic testing. Based on the distinct endophenotypes of breast cancer patients with GPVs in genes involved in homologous recombination and other DNA repair pathways, we designed a deep learningdriven model named DrABC (DNA-repair Associated Breast Cancer) to improve the accuracy in identifying carriers for GPVs in CPGs among breast cancer patients.

Study participants and design
In this multi-center cohort study, we consecutively recruited unselected female patients with breast cancer from October 1, 2017, to August 31, 2019, at the Cancer Hospital of Chinese Academy of Medical Sciences and Peking Union Medical College (CHCAMS, i.e., the discovery cohort) and other six hospitals(i.e., the validation cohort), including (1) Huanxing Cancer Hospital, (2) Guiyang Maternal and Child Healthcare Hospital in participating hospital. Written informed consent was obtained from each participant. This article follows the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guidelines [21].
As a result, 3041 women with breast cancer were enrolled, while 113 patients without available samples were excluded. The germline genetic test and analysis of 50 CPGs and detailed phenotypic evaluation were conducted in the remaining 2928 patients.

Phenotype data
We collected phenotypic data including the age at diagnosis, family cancer history, personal cancer history, pathological features, molecular subtype, and clinical stage (Additional file 1: Supplementary method). Molecular subtyping was performed based on hormone receptor (HR, including estrogen receptor [ER] and progesterone receptor [PR]) and HER2 status [22]. Staging was determined according to the 8th edition of the classification of breast cancer staging from the American Joint Commission of Cancer [23].

GPV analysis
Genomic DNA was extracted from peripheral blood or saliva. GPVs in patients from each center were analyzed by their local diagnostic laboratory, which generated a clinical genetic test report for each participant. Each laboratory provided results by the enrichment of the coding regions and consensus splice sites of 50 CPGs in the DNA repair pathway using a targeted panel followed by sequencing (Additional file 1: Supplementary method) [24,25]. Only novel variants or variants with < 0.1% population frequency in the 1000 Genomes (October 2013) and the genome Aggregation Database (gnomAD, http:// gnomad. broad insti tute. org/) were collected in this study. The clinical significance of each GPV was evaluated based on a 5-tier classification system of pathogenic/likely pathogenic (P/LP), benign/likely benign (B/LB), and variants of uncertain significance (VUS) according to guidelines of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology and in-house pipeline [25][26][27][28]. The variants in BRCA1/2 were further analyzed according to the ENIGMA expert panel review [29,30]. For those variants without available expert panel results, the consensus classifications in ClinVar were referred to. Variants classified as P/LP were considered pathogenic in this study (Additional file 1: Supplementary method).

DrABC model development
The DrABC risk prediction model was designed based on a hierarchical neural network that starts with an input layer of 25 neurons corresponding to features of carriers of GPVs in CPGs followed by two hidden layers. A dropout operator is applied to the hidden layers with a 25% chance of disabling a random neuron, which prevents the model from overfitting. In addition, a non-linear activation function, Scaled Exponential Linear Unit [31], is attached to the output of the hidden layers, which helps keep the representation distributions close to Gaussian. Finally, the output layer consists of two neurons with a sigmoid activation function, such that it produces two valid probabilities (i.e., in the range of [0, 1]): P 1 and P 2 .Using P 1 and P 2 , the final prediction is calculated using the following equations: where P a is the probability of having mutation in any CPGs, P b is the probability of having BRCA1/2 mutation, and P c is the probability of having mutations in other CPGs.
With the paired input features and ground truth annotations of [P a , P b , P c ] (in the form of one-hot encoding), we trained 101 deep learning models using cross-entropy loss via gradient descent. The final prediction is derived by aggregating results from all deep learning models through the ensemble learning strategy (Additional file 1: Supplementary method) [32,33]. The cutoff points for each prediction scenario were determined to achieve 90% sensitivity (or the maximum sensitivity).
To evaluate the performance between the DrABC model and other machine learning models, we compared six kinds of commonly used machine learning algorithms, including a fixed grid of Generalized Linear Models (GLMs), a naive Bayes (NB) classifier, five prespecified Gradient Boosting Machine (GBM) models, three pre-specified and a random grid of eXtreme Gradient Boosting (XGBoost) models, a default Random Forest (RF), a near-default Deep Neural Net (DNN), and a random grid of DNNs. All models were trained on the discovery dataset to predict whether a breast cancer patient carries germline pathogenic variants in any cancer predisposition genes (CPGs) using an inner five-fold crossvalidation strategy. For each algorithm family, only the best model was retained to represent the maximum performance of each kind. These common machine learning algorithms were performed using the R package h2o [34].

Statistical analysis
Student's t-tests were used to analyze age at enrollment and age at diagnosis. The prevalence of personal cancer history, family cancer history, tumor size, histological grade, ER/PR/androgen receptor (AR)/HER2 status, and lymph nodes metastasis were compared using Pearson χ 2 or Fisher's exact tests. The risk of carrying a GPV in BRCA1/2 or CPGs was also estimated using NCCN guidelines (version 1.2020) [12], BRCAPRO (version 2.1-7) [35,36], Myriad II [37], PENN II [38], and BOADICEA (v3) [39] models in the multi-center validation cohort. Sensitivity, specificity, accuracy, and area under the curve (AUC) with the receiver operating characteristic (ROC) were calculated to evaluate the predictive performance of DrABC, other machine learning, and previous models. The performance of two ROC curves was compared through the "DeLong's test" [40] using the algorithm of Sun and Xu [41]. Two-sided p < 0.05 was considered statistically significant. Statistical analysis was performed using SPSS version 15.0 (SPSS, USA) and R statistical software, version 3.5.1. The Youden index (J = sensitivity + specificity − 1) was used to evaluate the balance and potential effectiveness of each model with the suggested threshold [42].

Using a deep learning model to predict GPVs in DNA repair genes
To ensure data integrity and cleanness, 249 patients with VUSs and 247 patients without complete clinical information or family cancer history were excluded from model construction [46]. A total of 1701 patients from the CHCAMS constituted the discovery cohort, and 731 patients from six other institutions constituted the independent multi-center validation cohort (Additional file 7: Fig. S5). We used 25 clinical features associated with GPVs in CPGs to develop the prediction model. These 25 features correspond to an input layer of 25 neurons (Additional file 8: Table S2), followed by two hidden layers of 16 and 8 neurons, respectively (Additional file 9: Fig. S6). As a result, DrABC achieved a superior performance through the inner five-fold cross-validation in the discovery cohort, which was slightly higher than other traditional machine learning models but without significance (p > 0.05 when comparing each model with the DrABC; Additional file 10: Fig. S7).

Performance of DrABC versus previous models
DrABC generates probabilities of whether a breast cancer patient carries GPVs in BRCA1/2, other CPGs except for BRCA1/2, or any CPG. In predicting GPVs in any CPG, the AUCs for DrABC were 0.80 (95% CI, 0.78-0.83) for the discovery cohort and 0.74 (95% CI, 0.69-0.79) for the validation cohort, which were superior to those for previous models (AUC = 0.65 for BRCAPRO [35], AUC = 0.57 for BOADICEA [39], AUC = 0.56 for Myriad [37], and AUC = 0.61 for PENN II [38] in the validation cohort; p < 0.01 when comparing each model with the DrABC; Fig. 3A, Table 2, and Additional file 11:   BRCA1/2 (D). **p < 0.01, ****p < 0.0001, when comparing with the DrABC diagnosed with breast cancer at ≤ 65 years of age could increase the sensitivity to 100% but reduced specificity to 2.5% and accuracy to 15.7%. When achieving the highest detection rate, DrABC had a sensitivity of 90.8% and specificity of 53.2% for all GPVs in the discovery cohort and a sensitivity of 83.8% and specificity of 51.3% for all GPVs in the multi-center validation cohort (Table 2 and Additional file 12: Table S4).
In predicting GPVs in BRCA1/2, the AUCs for DrABC were 0.81 (95% CI, 0.78-0.84) for the discovery cohort and 0.79 (95% CI, 0.74-0.85) for the validation cohort, which were also superior to those for previous models (AUC = 0.70 for BRCAPRO [35], AUC = 0.59 for BOA-DICEA [39,47], AUC = 0.59 for Myriad [37], and AUC = 0.63 for PENN II [38] in the validation cohort; p < 0.01 when comparing each model with the DrABC; Fig. 3B, Table 2, and Additional file 11: Table S3). The DrABC had a sensitivity of 85.6% and specificity of 65.5% for GPVs in BRCA1/2 in the discovery cohort and a sensitivity of 82.1% and specificity of 63.1% for GPVs in BRCA1/2 in the validation cohort, when achieving the highest detection rate (Additional file 12: Table S4). Compared to previous models, the DrABC demonstrated the highest Youden index with the corresponding threshold for detecting GPVs in BRCA1/2 or any CPG (Table 2), suggesting DrABC has a more balanced performance compared with previous models.

Contributions of family cancer history and pathological features to DrABC performance
To identify their contributions of features to the deeplearning model, we assessed the performance of DrABC after eliminating family cancer history or pathological feature data in the validation cohort. Eliminating family cancer history data did not reduce the performance of DrABC, with AUCs of 0.72 in predicting GPVs in any CPG, 0.75 in predicting GPVs in BRCA1/2, and 0.52 in predicting GPVs in other CPGs. However, eliminating pathological feature data reduced the performance of DrABC, with AUCs of 0.62 in predicting GPVs in any CPG, 0.66 in predicting GPVs in BRCA1/2, and 0.44 in predicting GPVs in other CPGs (Additional file 15: Fig. S10). Therefore, pathological feature represents an important predictive factor for hereditary breast cancer.

Reconstructing previous prediction models using in-house data in discovery cohort
To investigate the contribution of the Chinese-specific training dataset to the superior performance of the DrABC model to the previous model, we reconstructed the previous prediction models of BRCAPRO, BOADICEA, Myriad, and PENN II using the underlying algorithms (i.e., Bayes' theorem for BRCAPRO and BOADICEA, Logistic regression for Myriad and PENN II; through the R package h2o [34]) and input variables (Additional file 16:  Fig. S11 and Additional file 18: Table S6).

Online DrABC tool
We implemented a website interface (http:// gifts. biodata. cn/) to accommodate extensions to the DrABC model and make it easily accessible to healthcare providers and researchers (Additional file 19: Fig. S12). The user guide was provided in the Additional file 20: A user guide for the DrABC model.

Discussion
Breast cancer patients with GPVs in BRCA1/2 and other breast cancer-associated genes benefit from particular patterns of systemic treatments and risk-reducing interventions [48]. Although risk prediction models have been developed for combined groups of patients with breast or ovarian cancer as well as healthy individuals with a family history of hereditary breast and ovarian cancer [36,37,49,50], no clinical tool has been specifically developed for patients already diagnosed with breast cancer. Therefore, we developed and validated a reliable prediction model using deep learning algorithms to identify GPV carriers among unselected breast cancer patients with better accuracy than previous models and no trend toward overfitting.
In this study, we have compared and tested the currently available risk prediction models and identified the shortfalls and limitations as follows: (1) the probability of carrying GPVs was derived from data from multi-generation families and computed based on the family history of specific cancers, age at diagnosis, and ancestry [37,50,51]. Thus, their performances in small family structures with simple pedigrees would be significantly limited [8]. (2) Evolutionarily recent or de novo mutations may have a more significant influence on disease susceptibility or protection than ancient mutations (Additional file 16: Table S5) [52]. (3) The vast majority of models were developed based on the data driven from European populations, but the performance in Asian populations has not been validated [53]. (4) Most of the existing models were specifically designed to predict the GPV carrier risk in BRCA1/2 genes and thus cannot be readily used to assess the risk for other breast cancer predisposition loci, which are also important for personalized healthcare decisions.
Thus, to identify whether the superior performance of DrABC may also be attributed to its Chinese-specific training dataset, we imitated the previous prediction models of BRCAPRO, BOADICEA, Myriad, and PENN II using the corresponding algorithms and input variables and trained them in the discovery cohort of this study. As a result, the performance of DrABC was superior to those of the reconstructed models of BRCAPRO, Myriad, and PENN II, but similar to the reconstructed BOADI-CEA model. Notably, only DrABC and the reconstructed BOADICEA model have incorporated pathological features in the algorithm. Collectively, the DrABC model has shown better performance in the Chinese population than all these previous models in their current versions. After training these previous models with the Chinesespecific dataset, the previous models without the inclusion of pathological information still cannot compete with the DrABC model, while the BOADICEA model involving the pathological features demonstrated similar performance to the DrABC model.
In comparison with traditional machine learning models, although DrABC achieved a slightly superior performance than other traditional machine learning models, there was no significant difference among them (Additional file 10: Fig. S7). While the difference between the performance of the machine learning models was also not observed in a previous study of predicting GPVs status in pancreatic cancer patients [20]. However, based on the similar deep learning technique, DNN models had the worst performance with an AUC of 0.75, suggesting that DNNs in particular are difficult to perform well without ingenious design. In addition, as a complex classification task with three categories, we specially designed the prediction model based on a hierarchical neural network, producing two probabilities: P 1 and P 2 , where P 1 is the probability of having a mutation in any CPGs, P 2 is the probability of having BRCA1/2 mutation when the patient is known to carrier mutation in any CPGs. To sum up, DrABC is a specially designed and well-performed model for this scenario.
As each CPG has distinct endophenotypes in terms of clinical and pathological features, the detailed phenotype of a proband with breast cancer should be incorporated in risk prediction. However, previously incorporating ER/PR/HER2 status into the BOADICEA model did not improve its predictive accuracy [53], inconsistent with the present study. Intriguingly, pathological features contributed more than family cancer history to the ability of DrABC to predict GPVs in BRCA1/2 and any CPGs, which might contribute to the superior performance of DrABC and the reconstructed BOADICEA model than the other previous models.
Asian breast cancer patients exhibit several unique features. Breast cancer is diagnosed at much younger ages in Asian women than in women from Western countries [1,54]. Moreover, BRCA2 mutations are more common than BRCA1 mutations in Asian women as compared with Caucasian women [55,56]. However, we found that breast cancer patients with GPVs in BRCA2 have less distinct endophenotypes than those with GPVs in BRCA1. These two features reduce the performance of previous risk prediction models and criteria [53]. To our knowledge, DrABC is the first available GPVs risk prediction model suitable for Asian breast cancer patients, which might contribute to the better performance than previous models based on Western populations. Therefore, we introduced an applicable pipeline for GPV carrier risk assessment among patients with breast cancer (Additional file 21: Fig. S13). This approach would strike a balance between identifying more GPV carriers and testing fewer breast cancer patients and, in turn, would bolster national guidelines for genetic testing, and reduce healthcare costs. However, we cannot rule out that testing breast cancer patients with a low risk of GPVs would further increase the detection rate [57] but should be undertaken considering local healthcare resources and patient desires.
However, there are some limitations in this study. As our study included few carriers of GPVs in CPGs other than BRCA1/2, their endophenotypes were not well-represented. Although this study employed a multi-center design, only Chinese female patients with breast cancer were investigated. Extending the usage of this model in other ethnicities requires further tuning via training the model with ethnicity-specific dataset, following by validating in larger cohorts in the corresponding population.

Conclusions
Breast cancer patients with GPVs in different CPGs exhibit distinct endophenotypes. Based on these distinct features, we developed and validated a phenotype-driven risk prediction model using a deep learning algorithm to identify GPV carriers among unselected breast cancer patients in a multi-center cohort. The DrABC model better predicted the risk of carrying GPVs in BRCA1/2 or other CPGs in the Chinese population compared to previous risk prediction models which were trained in other populations. This robust germline defect risk stratification tool can be utilized to triage patients at higher risk for genetic testing.