Study design
In summary, this study describes a scalable framework for inferring UR genes on dynamic cellulome- and genome-wide scales (Fig. 1). This framework is based on time series scRNA-seq analyses of allergen/diluent stimulated peripheral blood mononuclear cells (PBMC) from patients with SAR and healthy controls. We hypothesized that UR genes would be found at early time points. In order to identify URs, the scRNA-seq data were organized into directed MNMs of the different time points. We reasoned that the predicted directed molecular interactions in the early MNMs could be traced to a UR gene. However, no linear hierarchies could be identified in the MNMs, even before stimulation. Instead, our analyses of single-cell and bulk profiling data from allergic and other inflammatory diseases supported a quantitative approach to prioritizing UR, which was based on their predicted effects on downstream target cells.
Participants
We included sixteen patients with SAR and fourteen matched healthy controls, from whom samples were obtained outside of the pollen season when both patients and healthy controls were asymptomatic. Of these, samples from fourteen controls and eight SAR patients were collected in Linköping, Sweden, and the remainder in Vienna, Austria. The samples from Sweden were used for isolating and in vitro culture of PBMC for scRNA-seq. Also, sera and supernatants were used to determine cytokine protein expression levels. The eight remaining samples, which were collected in Vienna, were used for blocking validation studies (Additional file 1). The median (range) ages of the patients were 29.5 (23~59) and 29.5 (21~39) for the healthy controls (Additional file 1). The Swedish samples from SAR patients were obtained for research purposes at the Department of Clinical Sciences, Linköping University. The patients from Vienna were also collected for research purposes at the Department of Pathophysiology and Allergy Research, Medical University of Vienna. The control samples were obtained from the donated blood at the Blood Donor Central at Linköping University Hospital. The inclusion criteria for SAR patients were a positive history for birch- or grass-pollen, or both pollens, inducing allergic rhinitis for at least 2 years. The sensitivity to birch or grass pollen was confirmed with skin prick tests (ALK Abello, Hørsholm, Denmark) and by an ImmunoCap Rapid Test (Phadia, Thermo Fisher Scientific, Uppsala, Sweden), which tests for birch, grass, and house dust mite sensitivity in all subjects. Exclusion criteria were positive IgE responses to house dust mites as well as any other diseases, including malignancies, diabetes mellitus, infectious diseases, severe deviation of the nasal septum, or nasal polyps. Written, informed consent was obtained from all participants, and the study was approved by the ethics committees of the universities of Linköping and Vienna.
In vitro culture of allergen-challenge of PBMCs
PBMCs were enriched from buffy coats from healthy controls and fresh peripheral blood using Lymphoprep (Axis-shield, UK) with density centrifugation. PBMCs were incubated with birch pollen extract (BPE) (ALK Abello, 10 𝜇g/mL) at a density of 106 cells/mL for different time periods in RPMI 1640 supplemented with 10% fetal bovine serum (FBS) (Gibco, USA). Before obtaining the cells, all sera were collected. The sera were aliquoted, frozen, and stored at −70 °C. All supernatants were collected after 12 h, 1 day, 2 days, 3 days, 5 days, and 7 days, respectively, and aliquoted, frozen and stored at −70 °C.
scRNA-seq wet lab protocol
The scRNA-seq experiments for the samples from SAR patients and controls were performed using the Seq-Well technique [19]. In short, single-cell suspensions were prepared from cultured cells or fresh blood at each time point using standard techniques. Cells were counted and co-loaded with barcoded and functionalized oligo-dT beads (Chemgenes, Wilmington, Massachusetts, USA, cat. no. MACOSKO-2011-10) on microwell arrays synthesized as described with minor changes [19, 20]. For each sample, 20,000 live cells and ~110,000 beads were loaded per array. The cells and beads co-loaded in the microwell array were covered with previously plasma-treated polycarbonate membranes. Next, the array with the cover membrane was placed in a shaker at 37 °C for 45 min. After the membrane was removed, cell lysis and hybridization bead removal, reverse transcription, and whole transcriptome amplification were performed. Libraries were prepared for each sample using the Nextera XT DNA Library Preparation Kit (Illumina, San Diego, USA, cat. no. FC-131-1096) according to the manufacturer’s instructions. Libraries from three samples were pooled together for sequencing using the NextSeq 500/550 system, and sequencing results were analyzed as described below.
scRNA-seq data processing
The single-cell data were processed into digital gene expression matrices following James Nemeshof Harvard Medical School’s McCarrol Lab’s Drop-seq Core Computational Protocol (version 1.0.1, Drop-Seq tools v1.12) [21, 22] using bcl2fastq (v2.19.1) conversion [23] and Picard software (v2.9.0) [24]. The indexed reference for alignment of the reads was generated from GRCh38 (April 2017, Ensembl) [25] using STAR software (v2.5.3) [26]. Only primary alignments towards the reference genome were considered during downstream analyses, according to the mapping quality using STAR software. The quality of cells was assessed by having a minimum of 10,000 reads, 400 transcripts, 200 genes, and less than 20% mitochondrial genes per cell. Outliers were removed based on empirical evaluation of the distribution of transcripts count over the cells, i.e., all cells with > 16,000 transcripts, due to the risk of duplicates in the library resulting in two or more cells sharing a cell barcode [27]. This processing resulted in a total of 9031, 8873, and 2871 cells for the non-stimulated, allergen-stimulated, and uncultured control, respectively, from the healthy individuals, and 5755, 6124, and 2244 cells for the non-stimulated, allergen-stimulated, and uncultured control, respectively, from the allergic individuals. To reduce the noise in the data, k-nearest neighbor (KNN)-smoothing with Python (v3.7.4) [28] was applied using a k of ~0.1% of the total number of cells (i.e., k = 14 and k = 21 for the data from allergic and healthy individuals, respectively) [29]. This proportional cut-off was selected to ensure that between-cell differences are maintained, but also to reduce noise optimally. The percentage 0.1% was selected based on expected cell type’s percentages among PBMC in healthy individuals [30].
Microarray data processing (reference datasets)
To define cell types in the scRNA-seq data as described below, reference microarray data from B cells, CD4+, CD8+, monocytes, natural killer (NK) cells, naïve T cells, PBMC, T helper (Th)1, T helper (Th)17, T helper (Th)2, and T regulatory (Treg) cells were processed and analyzed [20]. Briefly, all microarrays were normalized using LIMMA R-package (version 3.32.10, R version 3.4) [31, 32]. We performed background correction using background Correct function with method “normexp,” followed by between arrays quantile normalization (normalize Between Arrays, method “quantile”). All probes with an expression below 1.2 times the background signal were removed.
Cell type identification
To cluster the cells and define the cell types, reference component analysis (RCA) (v1.0) was performed [33] in R (v3.4) [32], for the healthy and patient group separately. The cells were projected against reference bulk-profiling data in a stepwise manner using different references for deeper subtyping in each step. First, cells were identified as monocytes, dendritic cells, B cells, or T/NK cells. For this step, the reference was constructed based on the HG_U133A/GNF1H (Affymetrix) gene atlas data set [34, 35], including only the cell types of interest, as described in the original paper [33]. In short, all genes with log10 (fold change) expression values greater than or equal to 0.5, relative to the median across all samples, were included. The resulting reference contained 746 genes (Additional file 2). T/NK cells were then further divided into CD4+ T cells, CD8+ T cells, and NK cells. Thereafter, CD4+ T cells were divided into their subtypes: Th1-, Th2-, Th17-, and regulatory T cells. For these steps, the references were constructed based on the microarray data from [20], including only the cell types of interest in each of the references. Here, all genes with log10 (fold change) expression values greater than or equal to 0.3 and 1, relative to the median across all samples, were included, resulting in references containing 1494 and 316 genes for the T/NK and CD4+ T cell subtyping, respectively (Additional file 2). The different log10 (fold change) cut-offs used to construct these references were based on empirical evaluations of the resulting clusters, where the aim was to obtain a clear separation of the cell types with a minimum number of cells of unclear identity.
During each step in the clustering and cell typing procedure, we also saved the Pearson correlation P-value for each cell from the RCA algorithm. With the aim to ensure credible cell types in the data, we removed the cells that did not match any cluster (P-value > 0.05 for all cell types in the reference), as previously described [20]. This filtering process resulted in a loss of ~12% of the cells (2345 cells from the healthy samples and 1839 cells from the allergic samples).
The clustering is presented in Fig. 3A and commented on in Additional file 3: Supplementary note 1.
Global transcriptomic changes analysis
Global transcriptomic changes between time points for each cell type separately were assessed with a Euclidian distance-based method, as described [36].
For each cell type separately, we calculated average expression over all cells of the same type at the same time point, generating a library of representative cells (e.g., a representative cell of monocytes at day 1 was represented as a vector of the same length as the number of expressed genes, where the ith value within that vector was an average expression of the ith gene overall monocytes recovered at day 1). Next, we calculated the pairwise Euclidian distance between representative cells of the cell type between time points (e.g., monocytes at day 1 and monocytes at day 2). This distance was compared to a random distance calculated based on 1000 permutations. In each iteration time point labels were shuffled between cells of the same type (e.g., a monocyte recovered at day 1 might have been assigned as a monocyte at day 3 during a particular permutation), followed by calculation of the random representative cell type at the time point. The distance calculated between two different time points for each cell type was then compared to the null distribution of distances obtained from permutations.
Differentially expressed genes
For single-cell data, DEGs were identified between allergen stimulated and diluent stimulated patients for each time point using Monocle (version 2.6.41) [37, 38] in R (v3.4) [32], as previously described [20]. When setting up the data for analysis, using newCellDataSet() function, a negative binomial distribution was defined (expressionFamily=negbinomial) and the lowest detection limit was set as lowerDetectionLimit= 0.5. Genes detected in at least three cells within a group were included in differential expression analysis using the differentialGeneTest() function. Genes were considered as significantly differentially expressed if the q-value < 0.05. Additionally, DEGs were calculated between allergen-stimulated allergic and healthy individuals, for each time point, and between uncultured control samples from allergic and healthy individuals.
Identification of pathways, URs, and construction of MNMs
Pathways and URs were identified using the ingenuity pathway analysis (IPA) software (2019Q4-2021Q1) from January 19 to May 2021 [39]. Specifically, pathways in DEGs from bulk and single-cell data were identified using Canonical Pathways of Core analysis in IPA. To construct MNM at each time point, we started by identifying disease-associated genes (i.e., DEGs between allergen stimulated cells in patients and healthy controls) using the methods described above. If >5000 DEGs were found between two groups, the top 5000 DEGs (lowest q-values) were used for the IPA analysis, due to limitations of size allowance in IPA. Using those gene lists, MNMs were constructed: The Upstream Analysis of IPA software was queried for prediction of the URs of cell type-specific DEGs for each cell type at each time point separately. If such an UR was found in another cell type, a directed interaction between the two cell types was inferred. This analysis was performed at each time point separately. Here, we focused on URs that were secreted or membrane bound. If such an UR was found, an interaction was assumed between the cell types. For example, PDGFB was a predicted UR of DEGs in all cell types at day 2. PDGFB was identified as a DEG in monocytes. This observation led to the identification of a potential directed interaction from monocytes to all other cell types at day 2 (Fig. 7C). To reduce complexity, we did not incorporate potential autocrine interactions between monocytes.
Potential drug identification
All possible drugs targeting DEGs at different time points in all cell types separately were identified with Molecules of Core analysis in IPA. Top 5000 DEGs were used for the IPA analysis, due to limitations of size allowance in IPA.
Identification and ranking of the most important predicted URs or pathways
First, we analysed the top 5000 DEGs in all cells at all time points by IPA (version 2019Q4-2020Q1) from January 19 to May 2021 [31]. We identified all predicted URs and retrieved the lists of URs of each cell type, which were selected according to the following criteria: P-value <0.05 and |z-score| ≥2. Next, we counted the number of occurrences for each significant UR of each cell type at each time point. Finally, we ranked all significant URs, or pathways, based on the number of occurrences. We repeated the same analyses for enriched pathways of each cell type at each time point.
In vitro validation assays
For blocking experiments, PBMCs (1 × 106 cells) from eight SAR patients were stimulated for 5 days with 10 μg/mL BPE in the absence or presence of either a neutralizing anti-human IL-4 antibody (MAB204) or anti-human PDGF-BB antibody (AF-220-NA, both from R&D Systems, Minneapolis, USA) at a concentration of 5 μg/mL. Cell culture supernatants were collected at day 5 and the levels of IL-6, IL-13, and VEGF were measured by Luminex technology on a Bio-Plex 200, according to the manufacturer’s specifications (Bio-Rad Laboratories, California, USA). After collection of cell culture supernatants, PBMCs (2 × 105 cells) were transferred to a 96-well plate and after another 16 h, T-cell proliferation was evaluated using incorporation of tritiated thymidine. Stimulation indices (SI) were calculated by dividing counts per minute (cpm) measured in stimulated cultures by cpm in cultures without stimuli.
Validation experiment using human magnetic multiplex beads assay
The detection sensitivity for the protein and cytokine assays were CCL5 1.8 pg/mL; IL-4, 9.3 pg/mL; GM-CSF (CSF2): 4.1 pg/mL; IFN-α: 0.26 pg/mL; IFN-γ: 0.4 pg/mL; IL1RN, 18.0 pg/mL; IL-5, 0.5 pg/mL; IL-13, 36.6 pg/mL; PDGF-BB, 0.2 pg/mL. We used a five-parameter logistic curve to get the standard curve. Any sample that fell outside the recovery range (70~130%) in the curve area was considered inaccurate. If protein values lay outside of the detection limit for calculations, we assumed they were either the highest detection limit or a value of zero. We then performed a double-sided Wilcoxon rank sum test, by IBM SPSS statistics version 26, to compare protein concentrations in serum and supernatant between patients and healthy controls.
Other datasets information
In this study, we used two bulk profiling datasets and two scRNA-seq datasets, which included inflammatory diseases, namely atopic dermatitis (AD), ulcerative colitis (UC), and Crohn’s disease (CD). The bulk microarrays were used to analyze the gene expression in endoscopic-derived intestinal mucosal biopsies from patients with inflammatory bowel diseases (UC and CD) and healthy controls, and intestinal mucosal biopsies included both lesional and non-lesional gut mucosa in GSE75214 [40] (in total 97 UC patients, eight CD patients, and 11 controls). GSE32924 [41] included paired samples of both lesional and non-lesional skin from 12 patients with AD compared with normal human skin from eight healthy controls [41]. Using GEO2R [42] with default settings, we identified DEGs between lesional and healthy control samples, as well as between non-lesional and healthy controls, in the gut or skin, from patients with UC, CD, or AD. The data were annotated using the National Center for Biotechnology Information (NCBI) generated platform [43]. They were then adjusted for multiple testing using the Benjamini-Hochberg procedure [44]. The data were then sorted to only include significant DEGs (FDR < 0.05) for downstream analysis, which included GWAS gene enrichments. Genome-Wide Association Studies (GWAS) genes were obtained from DisGeNET [45] (data downloaded on Feb 9, 2021). In total, 416 GWAS genes for CD, 359 genes for UC, and 126 genes for AD were retrieved. Enrichment analyses of GWAS genes respective DEGs were performed using Fisher’s exact test. All genes from respective datasets were used as background for the enrichment analysis.
URs were identified using The Upstream Analysis of IPA. Significant URs were selected based on |z-score|≥2 and overlap P-value < 0.05. Similarly, all drugs with possible effects on DEGs were identified using Molecules of Core analysis in IPA as described above. To view how DEGs were predicted to interact with each other in AD (GSE32924 [41]), we created a gene network using Networks of Core analysis in IPA. First, Molecules tool was used to identify all possible networks representing interactions between single UR and a single-downstream effect (function/disease) and the dataset genes involved in both. The top five networks with the highest consistency score were merged and visualized as a graph.
The scRNA-seq data for UC included 12 colon biopsies from five patients including four non-inflamed and four inflamed biopsies as well as four healthy individuals [46]. The DEGs between biopsies from non-inflamed tissues from patients and healthy controls, as well as between inflamed biopsies from patients and healthy controls were downloaded from [46] and used for further downstream analyses. The scRNA-seq data for CD (E-MTAB-8901) included ileum biopsies from seven patients and eight healthy individuals, where five samples were from inflamed tissue, two samples from non-inflamed tissue, and eight samples from healthy control tissue [47]. Using the processed data, with cell types as identified in [47], we identified DEGs between non-inflamed and healthy ileums, as well as between inflamed and healthy ileums, using Monocle as described under the “Differentially expressed genes” section above.