Network-based approaches elucidate differences within APOBEC and clock-like signatures in breast cancer

Background Studies of cancer mutations have typically focused on identifying cancer driving mutations that confer growth advantage to cancer cells. However, cancer genomes accumulate a large number of passenger somatic mutations resulting from various endogenous and exogenous causes, including normal DNA damage and repair processes or cancer-related aberrations of DNA maintenance machinery as well as mutations triggered by carcinogenic exposures. Different mutagenic processes often produce characteristic mutational patterns called mutational signatures. Identifying mutagenic processes underlying mutational signatures shaping a cancer genome is an important step towards understanding tumorigenesis. Methods To investigate the genetic aberrations associated with mutational signatures, we took a network-based approach considering mutational signatures as cancer phenotypes. Specifically, our analysis aims to answer the following two complementary questions: (i) what are functional pathways whose gene expression activities correlate with the strengths of mutational signatures, and (ii) are there pathways whose genetic alterations might have led to specific mutational signatures? To identify mutated pathways, we adopted a recently developed optimization method based on integer linear programming. Results Analyzing a breast cancer dataset, we identified pathways associated with mutational signatures on both expression and mutation levels. Our analysis captured important differences in the etiology of the APOBEC-related signatures and the two clock-like signatures. In particular, it revealed that clustered and dispersed APOBEC mutations may be caused by different mutagenic processes. In addition, our analysis elucidated differences between two age-related signatures—one of the signatures is correlated with the expression of cell cycle genes while the other has no such correlation but shows patterns consistent with the exposure to environmental/external processes. Conclusions This work investigated, for the first time, a network-level association of mutational signatures and dysregulated pathways. The identified pathways and subnetworks provide novel insights into mutagenic processes that the cancer genomes might have undergone and important clues for developing personalized drug therapies.


NETPHIX Method
Our analysis for the association of gene alteration information utilized NETPHIX, which was developed to identify network based associations for continuous cancer phenotypes. While the algorithm can deal with more general cases, we used a simple version of NETPHIX, which identifies subnetworks associated with an increased level of phenotypes. See (Kim et al. 2019) for the full details of NETPHIX algorithm.
The optimization problem is formally defined as follows. We are given a graph G = (V, E), with vertices V = {1, . . . , n} representing genes and edges E representing interactions among genes. Let P denote the set of m patients (or cell lines). For each sample j ∈ P , we are also given a phenotype profile value w j ∈ R which quantitatively measures a phenotype (e.g., mutation counts in our study). Let P i ⊆ P be the set of patients in which gene i ∈ V is altered. We say that a patient j ∈ P is covered by gene i ∈ V if j ∈ P i i.e. if gene i is altered in sample j. We say that a sample j ∈ P is covered by a subset of genes (or vertices) S ⊆ V , if there exists at least one vertex v in S such that j ∈ P v .
The goal is to identify a connected subgraph S of G of at most k vertices such that the sum of the weights of the samples covered by S is maximized. The weights are computed based on mutation counts. Since we are interested in functionally complementary mutations, we also penalize coverage overlap when a sample is covered more than once by S by assigning a penalty p j for each of the additional times sample j is covered by S. Let c S (j) be the number of times element j ∈ P is covered by S. For a set S of genes, we define its weight W (S) as: Thus, we define the optimization problem as follows: Given a graph G defined on a set of n vertices V , a set P , a family of subsets P = {P 1 , . . . , P n } where for each i, P i ⊆ P is associated with i ∈ V , weights w j and penalties p j ≥ 0 for each sample j ∈ P , find the subset S ⊆ V of ≤ k connected vertices maximizing W (S).
We then formulated the problem as integer linear programming (ILP) as follows and solved it to optimality with CPLEX. l:il∈E Let z j be a binary variable equal to 1 if sample j is covered by a gene i and 0 otherwise. Let y j denote the number of genes in I that cover sample j in the solution. Finally, let w j be the weight of sample j and p j be the penalty for sample j. Although the general problem is NP-hard, we obtained the optimal solution to the ILP instances using CPLEX. We ran the program with k = 1...7 and the statistical significance of the selected modules was then assessed with permutation tests.

Construction of Gene Alteration Table
The gene level alteration information for the input to NETPHIX is constructed by utilizing all somatic point mutations and small indels for the same 560 patients data. In general, we defined a gene g to be altered for a patient p if it has at least one "valid" mutation in the genomic region of g for p. The definition of "valid" mutations can be different for each signature as we further refined the information by removing mutations attributed to the signature. For example, the input alteration table used for the association with Signature 2 is constructed after removing all somatic mutations assigned to Signature 2. Formally, for the alteration table ALT i used for association with Signature i, a gene g in ALT i is defined to be altered only if it has at least one non-silent mutation in the genomic region of g that is not attributed to Signature i. For ALT 3 and ALT 8 , we additionally removed all indels as these signatures are believed to lead to a high burden of indels. Finally, we augmented the alteration table if the gene is annotated as being biallelic inactivated (Supplementary Table 4a Figure S1: Gene expression correlation modules (refers to Fig. 2).
Clustering of all genes significantly correlated with at least one of the signatures. This shows a more fine-grained clustering (12 clusters) than in Fig. 2. A heatmap of mean expression correlation for each cluster and signature (left), number of genes in each cluster (middle), and representative GO terms enriched in each cluster of genes (right) are shown. Sig. 3D BRCA1 BRCA2 RB1 cover Figure S2: Subnetworks identified by NETPHIX using less stringent cut-off (refers to Fig. 3). The best m (the module size) using less stringent cut-offs was selected as maximal index for which the optimal objective function increased more than 1% with respect to previous index and the phenotype p-value did not increase. Panel for each signature consists of a network view of a module (left) and a heatmap showing the association of selected gene alterations with signature strength across patients (right). The network node size indicates the gene robustness (regarding NETPHIX results for different random initialization runs of SigMa) while the darkness of red color represents its individual association score (p-value). Each heatmap shows the number of mutations attributed to a given signature for all samples (orange; top row; log 10 scale) sorted from low to high (columns). For each gene in the module, gene mutations observed in each sample caused by other signatures are shown in gray, while samples not altered are in white. The last row shows the mutation profile of the entire subnetwork in black. Only subnetworks that changed with respect to the normal cut-offs (see Fig. 3 and Materials and Methods) are shown. Results for Signatures 2D and 3C did not change with respect to the normal cut-offs and results for Signatures 1D and 5D stayed insignificant (the FDR adjusted p-value above 0.1).