scGRNom: a computational pipeline of integrative multi-omics analyses for predicting cell-type disease genes and regulatory networks

Understanding cell-type-specific gene regulatory mechanisms from genetic variants to diseases remains challenging. To address this, we developed a computational pipeline, scGRNom (single-cell Gene Regulatory Network prediction from multi-omics), to predict cell-type disease genes and regulatory networks including transcription factors and regulatory elements. With applications to schizophrenia and Alzheimer’s disease, we predicted disease genes and regulatory networks for excitatory and inhibitory neurons, microglia, and oligodendrocytes. Further enrichment analyses revealed cross-disease and disease-specific functions and pathways at the cell-type level. Our machine learning analysis also found that cell-type disease genes improved clinical phenotype predictions. scGRNom is a general-purpose tool available at https://github.com/daifengwanglab/scGRNom. Supplementary Information The online version contains supplementary material available at 10.1186/s13073-021-00908-9.


Background
Recent genome-wide association studies (GWAS) studies have identified a variety of genetic risk variants associated with multiple brain diseases. For example, a recent study found 109 pleiotropic loci significantly associated with at least two brain disorders [1]. Many cross-disease common genetic risk factors have revealed many shared functional consequences in clinical presentations [2]. Recent studies have also revealed shared symptoms at both psychiatric and physical levels between neurodegenerative and neuropsychiatric diseases [3]. For instance, 97% of Alzheimer's disease patients develop neuropsychiatric symptoms throughout the disease [4]. Besides, additional insights into each disease's progression and causes have further demonstrated the highly interlinked nature of both disease types [5]. However, our understanding of the molecular mechanisms of genetic variants between diseases remains elusive, particularly at the cell-type levels.
Alzheimer's disease (AD) and schizophrenia (SCZ) are neurodegenerative and neuropsychiatric diseases, respectively. Both are significantly associated with genetic variants and have complex underlying cellular and molecular mechanisms from genotype to phenotype [6,7]. Notably, AD is physiologically characterized by accumulations of amyloid beta plaques and neurofibrillary tau protein tangles in the brain [8]. Amyloid beta plaques primarily originate from the apolipoprotein E-encoding gene APOE and its multiple variants. The APOE gene is a single step in the broader amyloidogenic processing pathway (APP), and additional genes involved in the process contribute to the regulation of amyloid beta production [6]. Much work has identified major genes of interest involved in the APP [6]. However, a distinct need still exists to further explore these disease loci to understand better the interplay between their regulatory elements and eventual amyloid beta creation and accumulation. Similarly, neurofibrillary tau tangles are associated with many genetic loci and require a study of the highly complex molecular mechanisms required to achieve disease pathology [8]. Further, the downstream effects from both amyloid beta and neurofibrillary tangles within and between various cell types add additional complexity toward linking specific regulatory events and elements with clinical pathology [9,10].
Also, SCZ is a neuropsychiatric disorder characterized by disruptions in dopamine, glutamate, and GABA-based receptor signaling pathways [11]. Pathologically, the direct connection is less clear between known molecular abnormalities and observed physical changes through neurological imaging studies [12]. Thus, attempting to understand the interactions between activation of known risk genes and higher-level pathway disruptions may help elucidate the causes for structural shifts in SCZ patients [10]. At a psychiatric level, the alterations to various cortical structures create multiple forms of symptoms, including positive (e.g., hallucinogenic episodes) and negative (e.g., anti-social tendencies) [13]. Finally, GWAS for AD cohorts has revealed multiple conserved genetic loci that could encode shared risk factors between the two diseases [14]. Clinical presentations, specifically psychiatric effects, create a crucial point of intersection to be explored where general psychosis was found in up to 60% of AD patients, including hallucination events as well as other effects mirroring those of the positive symptoms found in SCZ patients [15]. Thus, studying shared risk variants and genes between both diseases may help elucidate functional genomics of interest in both diseases and can further uncover cross-disease and disease-specific mechanisms between neurodegenerative and neuropsychiatric diseases.
Recently, advances in single-cell sequencing technologies have generated a great deal of excitement and interest in studying functional genomics at cellular resolution. For example, scRNA-seq and scATAC-seq techniques have measured the transcriptomics and epigenomics of individual cells in the human brain [16,17]. Further computational analyses have clustered cells into many cell types [16]. The cells in the same type share similar transcriptional activities such as gene expression and genomic functions. Differential gene expression across cell types is a complex, multi-gene dynamic process that tightly regulates and controls functions and is governed by gene regulatory factors such as transcriptional factors (TFs) and non-coding regulatory elements. These factors cooperate as a gene regulatory network (GRN) to facilitate the correct cellular and molecular functions on the genome scale. Disrupted cooperation can give rise to abnormal gene expression, such as those present in diseases. Therefore, GRN has been used as a robust system to infer genomic functions and molecular mechanisms, especially for human diseases [18].
Recent analyses have also revealed that brain disease risk variants are located in non-coding regulatory elements (e.g., enhancers). The risk genes likely have celltype-specific effects for both neuronal and non-neuronal cell types [19,20]. Besides, recent single-cell studies suggest changes to cell-type-specific gene expression in brain diseases [9,10]. However, our understanding of the underlying gene regulatory mechanisms driving celltype-and disease-specific gene expression, especially across diseases, remains elusive. To better understand cell-type gene regulatory mechanisms, several computational methods have recently been developed to predict cell-type GRNs [21], such as PIDC [22], GENIE3 [23], and GRNBoost2 [24], aiming to provide deeper mechanistic insights on how transcription factors regulate target gene expression at the cell-type level. However, these methods typically only use single omics (e.g., transcriptomics) and predict networks based on statistical associations (e.g., co-expression), providing insufficient mechanistic insights into gene regulation at the cellular resolution. For instance, how the disease variants affect the transcription factor binding sites (TFBSs) on the distal regulatory elements (e.g., enhancers) that control disease genes is still unclear, especially at the cell-type level. Thus, it is essential to integrate emerging multiomics data to understand cell-type gene regulation, especially involving non-coding regulatory elements. Recent studies have shown that integrating multi-omics data can reduce the impact of noise from a single omics data and achieve better prediction accuracy [25].
To explore these ideas, we developed a computational pipeline, scGRNom (single-cell Gene Regulatory Network prediction from multi-omics), to integrate multiomics data and predict cell-type GRNs linking TFs, regulatory elements (e.g., enhancers and promoters), and target genes. scGRNom is a general-purpose tool opensource available at https://github.com/daifengwanglab/ scGRNom [26]. In particular, we applied scGRNom to the multi-omics data at the cellular resolution, such as chromatin interactions, epigenomics, and single-cell transcriptomics of primary cell types in the human brain, including different excitatory and inhibitory neuronal types, microglia, and oligodendrocyte. Our predictions have high overlapping with state-of-the-art methods for revealing TFs and target genes [21], but they provide additional information on cell-type gene regulation, such as linking the regulatory elements to the genes. We also found that the enhancers in our celltype GRNs are enriched with GWAS SNPs in human brain diseases, including psychiatric disorders and AD. Thus, we further linked the GWAS SNPs that interrupt TFBSs to cell-type disease genes based on the cell-type GRNs for SCZ and AD and found cross-disease and disease-specific genomic functions at the cell-type level. Finally, we found that the cell-type disease genes shared by AD and SCZ have improved predicting clinical phenotypes in AD, like disease staging and cognitive impairment.

Methods
Predicting gene regulatory networks from multi-omics data scGRNom is a computational pipeline in R as a generalpurpose tool [26] to (I) integrate multi-omics datasets for predicting gene regulatory networks linking transcription factors, non-coding regulatory elements, and target genes and (II) identify disease genes and regulatory elements. scGRNom can be applied in general to predict either bulk or cell-type disease genes and regulatory networks. First, for predicting gene regulatory networks from multi-omics, scGRNom has three steps ( Fig. 1), each of which is available as an R function: Step 1: Finding chromatin interactions. The function, scGRNom_interaction, inputs the chromatin interaction data (e.g., Hi-C) and predicts all possible interactions between enhancers and promoters in the data or the userprovided list-for example, those from topologically associating domains (TADs) in Hi-C data. In addition, the function uses an R package, GenomicInteractions [27], to annotate interacting regions and link them to genes.
Step 2: Inferring the transcription factor binding sites on interacting regions. The function, scGRNom_getTF, infers the transcription factor binding sites (TFBSs) based on consensus binding site sequences in the enhancers and promoters that potentially interact from the previous step scGRNom_interaction. It outputs a reference gene regulatory network linking these TF, enhancers, and/or promoters of genes. In particular, this function uses TFBSTools [28] to obtain the position weight matrices of the TFBS motifs from the JASPAR database [29] and predicts the TFBS locations on the enhancers and promoters via mapping TF motifs. The function further links TFs with binding sites on all possible interacting enhancers and promoters and outputs the reference regulatory network. Furthermore, this function can run on a parallel computing version via an R package, motifmatchr [30] for computational speed-up.
Step 3: Predicting the gene regulatory network. The function, scGRNom_getNt, predicts the final gene Fig. 1 The computational pipeline, scGRNom, for predicting the gene regulatory network via multi-omics data. The pipeline inputs the chromatin interactions (e.g., from Hi-C) of regulatory elements (e.g., enhancer-promoter), identifies the transcription factor binding sites (TFBSs) on interacting regulatory elements, predicts TF-target gene expression relationships (e.g., high coefficients from Elastic net regression), and finally outputs a gene regulatory network linking TFs (cyan), regulatory elements (purple) to target genes (green) regulatory network based on the TF-target gene expression relationships in the reference network. The reference gene regulatory network from the previous step provides all possible regulatory relationships (wires) between TF, enhancers, and target genes. However, the chromatin interacting regions are broad, so that many TFs likely have binding sites on them. Also, changes in gene expression may trigger different regulatory wires in the reference network. To refine our maps and determine the activity status of regulatory wires, we apply elastic net regression, a machine learning method that has successfully modeled TF-target gene expression relationships in the gene regulatory networks by our previous work [10]. Further, suppose the chromatin accessibility information is available (e.g., from scATACseq data for a cell type). In that case, the function can also filter the enhancers based on their chromatin accessibilities and then output the network links only having the enhancers with high accessibility (e.g., overlapped with scATAC-seq peaks). The parameter "open_chrom" inputs a list of user-defined chromatin accessible regions.
Mathematically, given a gene expression dataset and a reference network (e.g., from scGRNom_getTF), the function uses the TF expression to predict each target gene expression and finds the TF with high regression coefficients. Given a target gene, let y∈R n be a gene expression vector modeling its expression values across n samples (e.g., n cells from single-cell data) and X∈R nÂm be the gene expression matrix of m TFs across n samples. Those m TFs should link to the target gene from the reference network, implying possible regulatory relationships to the gene. The elastic net regression model then aims to find the optimal coefficients for m TFs c Ã ∈ R m to solve the following optimization problem: where α and β are parameters to adjust the contributions from L 2 and L 1 regularizations of c∈R m . The samples are randomly divided into the training and testing sets by the parameter, train_ratio (e.g., if train_ratio = 0.7, then 70% training and 30% testing data). The optimal TF coefficients c * are estimated by the training data. Also, for the model evaluation, the mean square error (MSE) of the regression is calculated and reported by ‖y test − X test c * ‖ 2 using the testing data. Further, the top TFs with high coefficients can be either selected by absolute coefficient values (the parameter, cutoff_absolute) or a percentage from all m TFs (the parameter, cutoff_percentage). Finally, the function outputs a final gene regulatory network linking the top TFs as well as their linked enhancers (from the reference network, if any) to all possible target genes.

Identifying cell-type disease genes and regulatory elements
In addition to predicting gene regulatory networks, the pipeline also provides another function, scGRNom_dis-Genes, for identifying cell-type disease genes and regulatory elements (e.g., enhancers, promoters). This function's input includes a cell-type gene regulatory network and a list of GWAS SNPs associated with a disease (Fig. 2). The function uses an R package, GenomicRanges [31], to overlap these disease SNPs with the enhancers and promoters of the input cell-type gene regulatory network, and then find the ones that interrupt the binding sites of all possible TFs (TFBSs) on the enhancers and promoters by motif-breakR [32]. It finally maps the overlapped enhancers or promoters and TFs with interrupted TFBSs onto the input network to find the linked genes and enhancers/promoters as the output cell-type disease genes and regulatory elements.
Application to multi-omics data and data processing for predicting cell-type gene regulatory networks in the human brain We applied the scGRNom pipeline to the multi-omics data for the human brain, including cell-type chromatin interactions [19], transcription factor binding sites [28], single-cell transcriptomics [16], and recent cell-type open chromatin regions [17] for predicting cell-type GRNs in the human brain. In particular, we predicted cell-type gene regulatory networks for major cell types: excitatory and inhibitory neurons, microglia, and oligodendrocyte for the human brain [16]. The excitatory neuronal types include Ex1, Ex2, Ex3e, Ex4, Ex5b, Ex6a, Ex6b, Ex8, and Ex9. The inhibitory neuronal types include In1a, In1b, In1c, In3, In4a, In4b, In6a, In6b, In7, and In8. We first input recently published cell-type chromatin interactome data in the human brain [19] to scGRNom_interaction to reveal all possible interactions from enhancers to gene promoters in the neuronal, microglia, and oligodendrocyte types. The genome annotation was from TxDb.Hsapiens.UCSC.hg19.knownGene [33]. We then predicted a reference regulatory network for each of these cell types using scGRNom_getTF. Finally, given a cell type, we input the single-cell gene expression data for the type and the reference network from scGRNom_getTF to scGRNom_getNt for predicting the cell-type gene regulatory network (GRN). Specifically, to make our predicted networks comparable across cell types, we made the following data preparation and processing steps. First, although different studies have generated increasing numbers of single-cell data (e.g., for microglia [9]), we used the data from one study (GSE97942) [16] that includes the gene expression data (UMI) of individual cells of primary cell types, all from one postmortem tissue of human frontal cortex, aiming not to introduce additional batches from different studies, which helps our further comparative analyses across cell types. In addition, we filtered the genes that express in less than 100 cells. We then normalized gene expression by Seurat 4.0 [34] for further removing noises and batches across cell types. We also applied the method MAGIC [35] to impute the single-cell gene expression of all cells to address potential dropout issues. Then, for each cell type, we removed the lowly expressed genes with log10(sum of imputed gene expression levels of the cells of the cell type+1) < 1. The numbers of genes and cells for each cell type used for prediction are available in Additional file 1.
For predicting a cell-type GRN using the Elastic net regression model, we randomly split the cells into 70% training and 30% testing sets. We then selected the best Elastic net model that minimized the mean square error (MSE). We then filtered the target genes based on the goodness of fit by MSEs. In particular, we removed the target genes predicted by Elastic net regression with MSE > 0.1 and also TF-target gene with absolute Elastic net coefficient < 0.01. The output cell-type GRN consists of the network edges that link TFs, enhancers (if any), and target genes (TGs), as well as the Elastic net coefficient of TF-TGs for each edge. Finally, we provided two versions of each cell-type GRN (Additional file 2): (I) The edges that include the enhancers that overlap the cell-type open chromatin regions predicted by recent scATAC-seq data (broad excitatory and inhibitory neurons, microglia, and oligodendrocyte) [17] Fig. 2 Identification of cell-type disease genes and regulatory elements. The function of our pipeline, scGRNom_disGenes, predicts cell-type disease genes and regulatory elements. First, it inputs a cell-type GRN (top right) and disease-associated SNPs from GWAS (top left). Second, the function identifies the disease SNPs that interrupt the binding sites of the TFs on the enhancers or promoters that link to the target genes in the input GRN (middle). Finally, the function outputs a list of the target genes and regulatory elements (enhancers or promoters) linked by disease SNPs (bottom). Comparison with state-of-the-art methods We compared our scGRNom predictions with existing state-of-the-art methods. In particular, we input the single-cell gene expression data for each cell type to a recently published benchmark framework, BEELINE [21], and predicted the cell-type regulatory networks using three of the most consistent and highly accurate methods, PIDC [22], GENIE3 [23], and GRNBoost2 [24]. These methods only input gene expression data to predict all possible TF-target gene (TG) regulatory links based on their expression relationships, without considering the regulatory elements or open chromatin regions at the cell-type level. Thus, after applying to our processed single-cell gene expression data, they generated more network edges than us because scGRNom only keeps the TF-TG links in which TFs have binding sites on the regulatory elements (e.g., enhancers and promoters). Therefore, to make these networks comparable with ours, we extended our networks by selecting up to the top 30% TFs for each TG and then checked if our TF-TG links predicted by the state-of-the-art method (also up to top 30% TFs for each TG selected for each method). In particular, given a cell-type GRN by scGRNom, we selected top K TFs per target gene (TG) to see if the TF-TG pairs were also predicted by PIDC, GENIE3, or GRNBoost2 (again picking top K TFs per TG). We varied K values from 0 to 30% and then calculated the percentages of the scGRNom's TF-TG pairs that can be predicted by one of those methods (Additional file 3: Figure S1).

GWAS SNPs and heritability enrichment analyses for the enhancers in the cell-type gene regulatory networks in the human brain
Genome-wide association studies (GWAS) have identified a variety of genetic risk variants, including single nucleotide polymorphisms (SNPs) that are significantly associated with diseases and phenotypes (i.e., the traits). For example, recent GWAS studies have identified many SNPs associated with AD (2357 credible SNPs) [6] and SCZ (6105 credible SNPs) [7]. In addition to the credible SNPs, we also included additional SNPs with p<5e−5 from the AD and SCZ GWAS summary statistics for linking potential additional cell-type disease genes [36]. We applied the partitioned linkage disequilibrium score regression (LDSC) [37] to evaluate the heritability explained by the enhancers of our cell-type gene regulatory networks for GWAS SNPs. In particular, our heritability enrichment analyses used the GWAS summary statistics for the diseases or traits: SCZ [7], AD [6], autism spectrum disorder (ASD) [38], bipolar disorder (BPD) [39], amyotrophic lateral sclerosis (ALS) [40], major depressive disorder (MDD) [41], intelligence [42], multiple sclerosis (MS) [43], Parkinson's disease (PD) [44], attention-deficit hyperactivity disorder (ADHD) [45], education [46], type 2 diabetes (T2D) [47], inflammatory bowel disease (IBD) [48], and coronary artery disease (CAD) [49]. Also, we provided the numbers of GWAS SNPs for AD and SCZ that interrupt the binding sites of at least one of all possible TFBSs and the binding sites of the regulatory TFs in each cell-type GRN (Additional file 3: Table  S1).

Identification and enrichment analyses of cell-type disease genes
Given a cell type and a disease type (AD or SCZ), we input the cell-type GRNs (both versions) and the GWAS SNPs for the disease to the scGRNom_disGenes function for identifying the cell-type disease genes. We identified the cell-type disease genes (AD and SCZ) for all the cell types as above, including excitatory and inhibitory neuronal subtypes, microglia, and oligodendrocyte (Additional file 4). We also merged the disease genes for excitatory/inhibitory neuronal subtypes as the broad excitatory/inhibitory neuronal disease genes. We used the web app, Metascape [50], to find the enrichments of cell-type disease genes such as KEGG pathways, GO terms, protein-protein interactions, and diseases (via DisGeNET). Enrichment p-values shown in this paper were adjusted using the Benjamin-Hochberg (B-H) correction. Also, we looked at the expression levels of our cell-type disease genes in the disease samples. In particular, we compared the published population gene expression data in AD for single-cell expression [9].

Machine learning prediction of clinical phenotypes from cell-type disease genes
Finally, we used the machine learning approach to predict clinical phenotypes from our cell-type disease genes using the population data of the ROSMAP project, an independent AD cohort [51]. Given a clinical phenotype, we assume that X i ∈ R d represents the expression data of d disease genes for the ith individual in the cohort and Y i ∈ {0, 1} represents the binarized class of the ith individual's phenotype with i ∈ 1, …, n individuals for training. We then found the optimal logistic regression model with the parameters fβ Ã 0 ; β Ã 1 ∈R d g to classify the phenotype from the disease gene expression data via minimizing the following loss function: where (.) T is the transpose operation. Also, we performed cross-validation (K = 5) for the individual samples with 80% training and 20% testing sets. We also balanced the class sample size in each training set by the weighting method [52] so that the baseline of the classification accuracy is 50% for two classes. We used the individuals from the training sets to train the classification model. We then used the individuals from the testing sets to evaluate the classification performance, i.e., accuracy. In particular, we predicted four clinical phenotypes in ROSMAP: Braak stages that measure the severity of neurofibrillary tangle (NFT) pathology (Braak early stages (0, 1, 2, 3) vs. late stages (4, 5, 6)), CERAD scores that measure neuritic plaques (no AD vs. AD), the diagnosis of cognitive status (DCFDX, no or mild cognitive impairments (1, 2, 3) vs. Alzheimer's dementia (4, 5)), and the cognitive status at the time of death (COGDX, no or mild cognitive impairments (1, 2, 3) vs. Alzheimer's dementia (4,5)).

Results
Predicting cell-type gene regulatory networks in the human brain We applied the scGRNom pipeline to the multi-omics data for the human brain, including cell-type chromatin interactions [19], transcription factor binding sites [28], single-cell transcriptomics [16], and cell-type open chromatin regions [17] ("Methods" section). We predicted the cell-type gene regulatory networks (GRNs) for both glial and neuronal cell types in the human brain, including microglia, oligodendrocyte, excitatory neuronal subtypes (Ex1, Ex2, Ex3e, Ex4, Ex5b, Ex6a, Ex6b, Ex8, and Ex9), and inhibitory neuronal subtypes (In1a, In1b, In1c, In3, In4a, In4b, In6a, In6b, In7, and In8). Each cell-type GRN links TFs to enhancers to target genes (TGs) and has two versions that (I) only include the network edges with open cell-type enhancers predicted by recent scATAC-seq data [17] and (II) only include the edges with top 10% TFs with highest absolute coefficients for each target gene without using cell-type open chromatin regions. The reason why we included the second version is that the open chromatin regions predicted by scATAC-seq may not be highly accurate and cell-typespecific, given that the scATAC-seq data is noisy and also currently unavailable for neuronal subtypes (e.g., Ex1-9, In1-8 Our GRNs reveal many known cell-type-specific regulations. Figure 3A visualizes the subnetworks among TFs for select cell types (i.e., TGs are also TFs). For example, two known TFs, MEF2A and RFX3, that control microglia phenotypes play hub roles in the microglia network [53,54]. The nuclear factors NFIA, NFIX, and FOXP1 controlling neural differentiation and gliogenesis are also hub genes in the oligodendrocyte network [55][56][57]. MEF2C regulating inhibitory and excitatory synapses is a central node in both excitatory and inhibitory networks (e.g., Ex1 and In6b) [58]. In addition to cell-type TFs, we also observed the cell-type-specific expression relationships between TFs and TGs (high correlation). For instance, in Fig. 3B, E2F3-LRRK2, STAT2-FBXO32, IRF2-DYNC1, and ATF4-EPB41L1 show cell-type-specific expression relationships across the cells of microglia, oligodendrocyte, Ex1, and In6b types.
In addition, we compared our predicted cell-type gene regulatory networks with existing state-of-the-art methods for predicting cell-type gene regulatory networks, particularly those that are consistent and highly accurate PIDC, GENIE3, and GRNBoost2 benchmarked by BEELINE [21] ("Methods" section). The percentages of the overlapped TF-TG links of the cell-type network between scGRNom (both versions) and the state-of-theart methods are over 50% ("Methods" section, Additional file 3: Figure S1). This suggests a high consistency between scGRNom and these methods. However, these methods predicted TF-TG links without providing information on regulatory elements like enhancers. Thus, we looked further at the enhancers in our cell-type GRNs and found that they have significantly high heritability enrichments for GWAS SNPs of multiple brain diseases and traits (p<0.05, "Methods" section). For example, the enhancers in our excitatory-neuron and inhibitoryneuron GRNs have high enrichments for AD, SCZ, major depressive disorder, intelligence, and education (Fig. 4A, Additional file 3: Figure S2). Besides, these brain-cell-type enhancers do not have significant enrichment of GWAS SNPs for non-brain diseases. Therefore, the heritability enrichment analysis suggests that the enhancers in our cell-type GRNs have potential pleiotropic roles associated with multiple brain diseases or traits. Finally, we also compared our cell-type networks with public GRN databases such as TRRUST [59], Dorothea [60], and RegNetwork [61]. We found that the overlaps are not significant (hypergeometric test p-value > 0.999, Additional file 3: Table S2). In fact, those public GRNs were primarily inferred by integrating different studies from the literature (e.g., via physical interactions, co-expressed genes at the bulk tissue level) and thus may not be specific for the neuronal and glial cell types in the human brain. However, the overlapped network edges suggest the potential associations of those public GRNs with the human brain's cell types. For example, we found that 319 TRRUST edges, 5935 Dorothea edges, and 106 RegNetwork edges overlap with at least one of our cell-type GRNs.

Identifying cell-type disease genes in AD and SCZ for neuronal and glial cell types
We used these cell-type GRNs to link GWAS SNPs with disease risk genes for each cell type, advancing knowledge on cross-disease and disease-specific interplays among genetic, transcriptional, and epigenetic risks at cellular resolution. In particular, we chose SCZ and AD, two majorly represented neuropsychiatric and neurodegenerative diseases with potential convergent underlying mechanisms [4], and we linked a number of cell-type disease genes ("Methods" section, Additional file 4) and performed their enrichments (Additional file 5). We found that many disease genes present in one or a few cell types only, suggesting potential cell-type-specific contributions to AD and SCZ (Additional file 3: Figure  S3). As shown in Fig. 4B, the cell-type AD genes are also significantly enriched with known disease genes for AD and other dementia diseases such as brain atrophy and cerebral amyloid angiopathy (p < 0.01). For example, the microglia AD genes, including APP, CLU, BACE2, and BIN1, are specifically enriched for other diseases such as neurofibrillary degeneration, Parkinson's dementia, and aging memory impairment. Also, the cell-type SCZ genes are enriched for various disorders such as neurodevelopmental, mental, and mood disorders and depression (p < 0.01, Fig. 4C). Furthermore, many cell-type disease genes have corresponding expression activities in the disease samples. For example, 72 excitatory AD genes (86%) and 65 inhibitory AD genes (80%) (plus 22 oligodendrocyte AD genes and three microglia AD genes) are significantly differentially expressed in the corresponding cell types in AD individuals, respectively (p < 0.05) [9].
The functional enrichment analyses for our cell-type disease genes also uncover known genomic functions and pathways at the cell-type level. For instance, the microglia AD genes are enriched with amyloid beta formation and clearance [62], MAPK signaling [63], and neuron death [64] (Fig. 4D), and the oligodendrocyte AD genes are enriched with Tau protein binding [65]. This is vital to understanding a multitude of diseases that commonly demonstrate atrophy of cortical tissue as a hallmark feature. In SCZ genes, we also found that multiple key hallmark pathways were enriched, such as dopaminergic synapse [66], trans-synaptic signaling [67], and synapse organization [68] for excitatory SCZ genes (Fig. 4E). For inhibitory SCZ genes, we observed that MAPK family signaling [69], regulation of NMDA receptor activity [70], dopaminergic synapses [66], and neurotransmission [70] are enriched.

Comparative analyses reveal the interplays between genomic functions, pathways, cell types, and diseases
In addition to cell-type-specific pathways in these diseases, we also identified those involving multiple cell types in each disease, implying that potential cell-type interactions drive the disease pathology. For example, the enrichment of SCZ primarily includes changes in synapse structures and cell shaping and differentiation (Fig. 5A). Clinically, this is consistent with the consensus that SCZ is strictly neuropsychiatric as opposed to degenerative. In particular, cell morphogenesis and regulation of neuron differentiation are enriched in all four major cell type SCZ genes (p < 0.01). Early life neurodevelopmental genetic markers may suggest causal links with alterations in hippocampal cell differentiation points on the front of cell morphogenesis, leading to cascades of downstream effects [71]. This has primarily been studied and modeled within the scope of iPSCbased analyses, which make correlations and connections to the clinical presentation more difficult due to the additional abstraction from the standard pathologybased analysis. Also, the BDNF signaling pathway that potentially relates to intercellular communications is enriched as well in multiple cell types (p < 0.01) [72]. Finally, we also observed that protein-protein interactions (PPIs) are enriched among the disease genes of the cell types at a higher level. As shown in Fig. 5B, the SCZ genes for dopaminergic synapse, NMDA receptors, glutamate binding, and activation are shared by multiple cell types and have strong PPIs, implying protein-level cross-type coordination [73]. In AD, multiple pathways were significantly enriched across various cell types (Fig.  5C). For instance, the catabolic process for the AD key player, amyloid precursor protein (APP), is enriched with both glial and neuronal types (p < 0.01) [74].
Furthermore, we found that cross-disease conserved functions/pathways are involved in one or multiple cell types, revealing potential novel functional interplays across cell types and diseases (Fig. 5D). For example, the cell morphogenesis involved in differentiation is enriched in both AD and SCZ neuronal genes. Another example is that the vesicle-mediated transport is enriched for both AD and SCZ microglia genes. In total, we found 11 microglia genes shared by AD and SCZ. In particular, the VEGF signaling pathway is enriched in the SCZ microglia genes. In the general theme of AD pathology, increased VEGF expression has been linked to worse cognitive outcomes in postmortem analysis [75]. Similarly, multiple meta-analyses have revealed differential expression levels between healthy controls and SCZ patients [76]. However, little has been done to link potential gene function to cell-type-level interactions and pathways. Here, these SCZ-AD shared microglia genes may help explain shared higher-level dysfunction between both diseases as evidenced by higher expression. Also, we found that some functions involve different cell types across diseases. The dendrite development has been found in both SCZ and AD pathology [77,78]. We found that it is mainly enriched with microglia in AD but neuronal types in SCZ.
More interestingly, when exploring the interactions between cell types that change between diseases, the disease-specific pathologies enter to explain the cause of discrepancies. In particular, for AD, it is shown that phagocytic microglia are activated during the early stages of synaptic decline, leading to eventual neuroinflammation and programmed cell death [79]. For SCZ, the oligodendrocyte enrichment reveals similar intercellular mechanisms between excitatory and inhibitory neurons, specifically those regulating neuron differentiation (Fig.  5A), providing potential direction for future exploration and validation of the communication role of oligodendrocytes [80].

Prediction of clinical phenotypes from cell-type disease genes
Finally, we want to investigate the clinical applications of our cell-type disease genes. To this end, we looked at the population-level gene expression data for AD in the ROSMAP cohort [51]. In particular, we first found that many cell-type AD genes have significantly associated expression levels with clinical phenotypes across individuals in ROSMAP. For example, out of 195 cell-type AD genes, we found that 72 genes are significantly associated with the Braak stages that measure the severity of neurofibrillary tangle (NFT) pathology (ANOVA p < 0.05), 78 genes with the CERAD scores that measure neuritic plaques (ANOVA p < 0.05), 89 genes with the diagnosis of Fig. 5 Cross-cell-type conserved and cell-type-specific functions, pathways, and protein-protein interactions in schizophrenia and Alzheimer's disease. Darkness in heatmaps is proportional to −log10(p) of the enrichment. a The enrichments of select conserved and specific functions/pathways (e.g., GO, KEGG, REACTOME) across cell-type disease genes in schizophrenia. b The enrichments of protein-protein interactions among major cell-type disease genes in schizophrenia: excitatory neuron (broad, red), inhibitory neuron (broad, blue), microglia (green), and oligodendrocyte (purple). c The enrichments of select conserved and specific functions/pathways across cell-type disease genes in Alzheimer's disease. d The enrichments of select conserved and specific functions/pathways across cell-type disease genes between schizophrenia and Alzheimer's disease cognitive status (DCFDX) (ANOVA p < 0.05), 73 genes with the cognitive status at the time of death (COGDX) (ANOVA p < 0.05), and 92 genes with the Mini-Mental State Examination (MMSE) scores (Pearson correlation p < 0.05). In total, 135 cell-type AD genes were significantly associated with at least one clinical phenotype in ROSMAP (Additional file 6).
In addition to statistically significant associations between our cell-type disease genes and clinical phenotypes, we also applied the machine learning approach to predict clinical phenotypes from these cell-type disease genes using ROSMAP data ("Methods" section). Specifically, for each clinical phenotype, we used the logistic regression model to classify individual states of the phenotype (as classes) from their expression data of 53 AD-SCZ shared cell-type genes. We performed the cross-validation (K = 5) for the individual samples with 80% training and 20% testing sets. As shown in Fig. 6, our average accuracy values for classifying all four major clinical phenotypes: Braak stages, CERAD scores, DCFDX status, and COGDX status, are much higher than the baselines (50%), the random select genes, and AD genes from the latest GWAS study [6]. This suggests that using cell-type disease genes shared by AD and SCZ has improved predicting those clinical phenotypes, especially for cognitive-related ones in AD.

Discussion
This paper focuses on the scGRNom's application to single-cell data for AD and SCZ. However, scGRNom is general-purpose for understanding functional genomics and gene regulation in other diseases. Besides the cell types, the pipeline also predicts the gene regulatory networks for cell clusters (unknown cell types) and bulk tissue types. Furthermore, recent eQTL studies have identified various SNPs associating with gene expression in multiple brain tissue types using the population data such as GTEx [81] and PsychENCODE [10]. Although those brain eQTLs suggest the SNP-gene association at the bulk tissue level, we still found that several eQTLs in PsychENCODE match our linked GWAS SNPs and celltype disease genes, e.g., SNPs chr2:127846321 for and chr20:43598154 for STK4, two microglia disease genes for AD and SCZ, respectively. This suggests potential cell-type effects of these human brain eQTLs. Thus, increasing single-cell data at the population level allows us to predict the cell-type eQTLs [82], which will likely help understand cell-type gene regulation and refine linking disease genes at the cell-type level.
For linking disease genes, we primarily used the interrupted TFBSs by GWAS SNPs. Future studies utilizing scGRNom would be able to take advantage of the evergrowing number of GWAS for a wide variety of diseases as well as single-cell data. However, additional information can also help link GWAS SNPs to disease genes. For example, existing tools such as FUMA [83] have linked GWAS loci to genes by integrating information from multiple resources, providing more functional linkages from genotype to genes to phenotype in human diseases. Thus, incorporating multiple GWAS data (e.g., Fig. 6 Prediction accuracy of AD clinical phenotypes from disease genes. The AD population data for prediction was from the ROSMAP cohort [51]. AD clinical phenotypes include Braak-stages that measure the severity of neurofibrillary tangle (NFT) pathology; Cerad-scores that measure neuritic plaques; Cogdx-cognitive status at the time of death; and Dcfdx-the diagnosis of cognitive status. The bar height represents the average accuracy of cross-validation (K = 5) from the prediction using logistic regression ("Methods" section). Red: scGRNom's cell-type disease genes shared by AD and SCZ (SCZ-AD genes). Green: AD genes from GWAS [6]. Blue: randomly select genes (same number as SCZ-AD genes) various brain regions) exposes key areas of observed phenotypes. Previous studies have demonstrated the caution that must be exercised when attempting to correlate GWAS data with clinical phenotypes, and methods such as our analysis mitigate these effects [20]. A similar methodology as outlined could be used where common loci within each set of summary statistics are incorporated and established before integration into the cell-type GRNs, thus linking neuronal spatial information with known mutation sites in patients along with potentially cell-type-specific functionality. Expanding past the cell types examined here into additional multicellular analyses is also possible given expanded interactome data. Notably, this would allow for further investigation of complex neuropsychological diseases and cases where the line between different clinical classifications becomes blurred and leads to additional complications about clinically relevant genetic therapies. One such example includes autism spectrum disorder (ASD), where clinical presentations can vary in multiple axes of severity, creating a broad spectrum of phenotypes. In such cases, potentially linking specific symptoms or aspects of a particular subset of ASD to particular brain regions and cell types allows for a better-informed picture of functional consequences associated with genetic mutation sites. Such connections could aid in determining genetic risk factors associated with variations in edge case patients; they also create the opportunity to take advantage of induced pluripotent stem cell (iPSC) technology using genetic engineering technologies to create point mutations matching computationally identified genes. Moreover, because the singlecell multi-omics data we used for predicting cell-type GRNs are not specific for particular diseases (AD or SCZ), we used the SNPs disrupting all possible TFs on the enhancers and promoters from our cell-type GRNs to link at large cell-type disease genes. However, the scGRNom pipeline is general-purpose and able to work for incoming disease-specific single-cell multi-omics and link to cell-type disease genes via interrupted regulatory TFs in the diseases.
Machine learning has also been widely used to analyze multi-omics, such as multiview learning and deep learning [10,84]. Multiview learning has great potential for understanding functional multi-omics and revealing nonlinear interactions across omics. Therefore, integrating such emerging machine learning approaches will enable identifying different cross-omic patterns, especially for increasing single-cell multi-omics data and providing more comprehensive mechanistic insights in cell-type gene regulation and linking to disease genes. For example, this means adding more omics such as methylation data that reflect epigenetic changes that may occur due to wide variations of inherited and environmental factors [85]. At a deeper functional level, variations in methylation have been attributed to alterations in splicing activity, ultimately impacting the regulation and expression of key genes [86]. Additionally, integrating proteomic data at a single-cell level enhances the broader picture formed through additional data sources even further [87]. Lastly, expanding past simple methylation and proteomics allows for the ability to include all forms of data incorporated through the single-cell cytometry [88].

Conclusions
We developed a computational pipeline, scGRNom, to integrate multi-omics data and predict gene regulatory networks (GRNs), which link TFs, non-coding regulatory elements (e.g., enhancers), and target genes. With applications to the data from single-cell multi-omics of the human brain, we predicted cell-type GRNs for both neuronal (e.g., excitatory, inhibitory) and glial cell types (e.g., microglia, oligodendrocyte). Further, scGRNom can input cell-type GRNs and disease risk variants to link disease genes at the cell-type level, such as brain diseases like AD and SCZ. These disease genes revealed conserved and specific genomic functions across neuropsychiatric and neurodegenerative diseases, providing cross-disease regulatory mechanistic insights at the cellular resolution. Although this paper focuses on AD and SCZ, scGRNom is a general-purpose tool for understanding functional genomics and gene regulation in other diseases, at either bulk tissue or cell-type levels. Finally, scGRNom is open-source available at https:// github.com/daifengwanglab/scGRNom [26].