 Method
 Open access
 Published:
imply: improving celltype deconvolution accuracy using personalized reference profiles
Genome Medicine volume 16, Article number: 65 (2024)
Abstract
Using computational tools, bulk transcriptomics can be deconvoluted to estimate the abundance of constituent cell types. However, existing deconvolution methods are conditioned on the assumption that the whole study population is served by a single reference panel, ignoring persontoperson heterogeneity. Here, we present imply, a novel algorithm to deconvolute cell type proportions using personalized reference panels. Simulation studies demonstrate reduced bias compared with existing methods. Real data analyses on longitudinal consortia show disparities in cell type proportions are associated with several disease phenotypes in Type 1 diabetes and Parkinson’s disease. imply is available through the R/Bioconductor package ISLET at https://bioconductor.org/packages/ISLET/.
Background
Tissues are complex samples composed of different cell types, and real bulk transcriptomic data are often weighted sums of multiple signals over several different cell types [19]. In largescale and populationlevel clinical studies, like Parkinson’s Disease Biomarkers Program (PDBP) and The Cancer Genome Atlas (TCGA), transcriptomic samples are often collected from complex tissues. For admixed tissue samples, differentially expressed transcriptional profiles from different phenotypical groups can be caused by either celltype composition disparities or underlying celltypespecific (CTS) gene expression heterogeneity. Studies have shown that cell type proportions are confounders with other phenotypical covariates like age, sex, or clinical outcomes, for bulk transcriptomic data analysis [5, 6]. As a result, ignoring CTS compositions in gene expression analysis would cause inflated false positive rates of identifying relevant genetic features. An accurate cell type proportion deconvolution is thus vital, especially for cell types with low abundance and weak biological signals, where the real biological differences could be shadowed by technical noises [5, 30, 39].
Recently, several statistical and deep learning methods have been proposed to deconvolute cell type abundance from bulk transcriptome data. These methods utilize linear least squares regression [12, 49, 55], quadratic programming [25], support vector regression [11, 43], nonnegative matrix factorization [21, 45], and deep neural networks (DNNs) [9, 38]. These methods share the same goal of quantifying the unknown abundances of various cell types and can be broadly summarized into two categories: ReferenceBased (RB) and ReferenceFree (RF). The RB deconvolution relies on a celltypespecific (CTS) gene expression signature matrix (reference panel) composed of the preselected features known to differentiate cell types, while the RF deconvolution estimates cell type proportions in the absence of a reference panel. Naturally, the accuracy of cell type abundance inference is dependent on the quality of signature matrices, and a more accurate reference panel is beneficial for improving cell type abundance estimations [6]. RF deconvolution, in contrast, offers flexibility where reference panels are hard to obtain.
Currently, all RB deconvolution methods require a reference panel as the input across all subjects. For example, CIBERSORT [43], an RB deconvolution method [5, 6], provides a verified signature panel LM22. It is useful for leukocyte deconvolution and includes 547 marker genes which could distinguish 22 hematopoietic cell types. xCell [4] combines the gene set enrichment with deconvolution techniques and introduces curated gene signatures representing 64 distinct cell types, including a wide range of both adaptive and innate immune cells. However, it is a very strong assumption to use a single reference panel across the whole population. This assumption ignores persontoperson heterogeneity for CTS gene expression, and deviates from the biological fact that the gene expression profile could vary, even for one purified cell type, depending on environmental influences, age, sex, subject’s health status, and treatment paradigms [1, 8, 15, 18, 23, 27, 28, 40, 48]. Mismatched reference signatures can impact the deconvolution accuracy [22, 47]. The problem is even exacerbated when handling longitudinally observed and repeatedlymeasured data, when intrasubject samples share information and intersubject heterogeneities are relatively strong. Recent research shows that models incorporating personalized effects can accurately retrieve cell type reference panels on an individualbasis [16]. However, to date, no method is available to take advantage of personalized reference panels to precisely deconvolute cell type proportions, especially when longitudinal samples are available.
Here we develop a new deconvolution algorithm imply (improving ce llt ype deconvolution using personalized reference) as depicted in Fig. 1. imply can utilize personalized reference panels to precisely deconvolute cell type proportions using longitudinal or repeatedly measured data. It borrows information across the repeatedly measured transcriptome samples within each subject, to recover personalized reference panels. The personalized references are further adopted to improve cell type deconvolution. The method consists of three stages. In the first stage, using a commonly shared reference panel across the population, we deconvolute the bulk transcriptomic data and estimate initial cell type proportions. The first stage is based on support vector regression (SVR), as it has been shown to be a leading framework for conventional deconvolution problems [43]. In the second stage, we use a mixedeffect modeling framework to retrieve personalized reference panels based on subjects’ phenotypical information, observed bulk transcriptomic data, and the initial cell type proportions from the first stage. In the third and final stage, we use the recovered personalized reference panels, together with repeated measurement of bulk transcriptomic data for each subject, to estimate cell type proportions. The rationale for using this threestage approach is straightforward: the personalized reference panel is more accurate compared with the populationlevel signature. Naturally, using this more accurate reference panel can consequently lead to a more precise deconvolution.
We conducted extensive in silico simulations and real data analyses to test the performance of imply. The simulation results showed significantly increased accuracies in cell type proportion estimation compared with existing approaches. Our method imply reduced bias in deconvolution, and increased the correlation between the estimated and the groundtruth cell type abundance. Real data analyses on two large longitudinal consortia, The Environmental Determinants of Diabetes in the Young (TEDDY study) [29] and Parkinson’s Disease Biomarkers Program (PDBP study), showed more realistic deconvolution results that align with lowthroughput experiments. The results suggested that disparities in cell type proportions of certain cell types are associated with several disease phenotypes in Type 1 diabetes and Parkinson’s disease. Our method imply has been implemented and integrated into the Bioconductor package ISLET [17] and is available at https://bioconductor.org/packages/ISLET/.
Methods
Overview of imply
The primary objective of imply is to improve the accuracy of cell abundance estimations through the usage of a “subject and celltype specific reference panel”, which is a personalized CTS reference panel unique to each study participant. The algorithm is structured into three stages. In Stage I, the initial cell type compositions will be estimated using a populationlevel CTS reference panel. This first stage is very much alike existing deconvolution frameworks, but provides a valid initial estimation for downstream stages. The core component of imply lies in Stage II, where we optimize the usage of repeatedlymeasured samples within each subject. Multimeasured samples, within each subject, are assumed to share the same reference panel but have cell type composition variations. Here, mixedeffect modeling is naturally adopted to capture the grouplevel average (fixed effect) and subjectlevel deviations (random effect). The output from this stage, for each subject, is a personalized reference panel. In Stage III, the personalized deconvolution can be easily conducted by adopting the personalized reference panel from Stage II, for each subject.
Model notations
We use g to index the features (e.g., genes), where \(g=1,2,3,\dots , G\), and n to index the study subjects, where \(n=1,2,3,\dots , N\). For each subject n, the repeated or longitudinal samples are indexed by i, where \(i=1,2,\dots , t_n\). For each subject n, the observed bulk transcriptome dataset can be represented by \(\varvec{y_n}\) below, which is of dimension \(G \times t_n\):
Here, each element \(y_{gni}\) in \(\varvec{y_n}\) has three indexes: g to index the feature, n to index the subject, and i to index the sample. The whole bulk transcriptome dataset from all subjects, which can be represented as \(\varvec{Y} = (\varvec{y_1}, \varvec{y_2}, \cdots , \varvec{y_N})\), is of dimension \(G \times T\). Here, T is the total number of samples across N subjects, and thus \(T= \sum _{n=1}^{N}t_n\). We use k to index cell types, where \(k=1,2,\dots , K\).
Stage I: Initial cell type proportion estimation
Initially, similar to existing deconvolution methods, it is easy to obtain a single reference panel \(\varvec{E}\) of dimension \(J \times K(J < G)\) for the study population, where J indicates the total number of usable and discriminative signature genes for deconvolution. Previous studies have demonstrated the feasibility of constructing a reference panel from pure cell line data or annotated singlecell RNAseq (scRNAseq) data [44, 49], and thus \(\varvec{E}\) can be treated as known. With the observed bulk data \(\varvec{y_n}\) and the initial reference panel \(\varvec{E}\), as illustrated in the topleft of Fig. 1, the firstround referencebased ‘coarse’ deconvolution is conducted using a \(\nu\)Support Vector Regression algorithm (\(\nu\)SVR) based on a linearity assumption. Such strategy was already proven to be a successful choice in leading deconvolution algorithms. To be specific, this stage requires both the signature matrix \(\varvec{E}\) and the featureoverlapped RNAsequencing data \(\varvec{y_n}\), comprising only the overlapping features filtered by marker genes from the signature matrix. For each subject n and sample i, the deconvolution is thus a regression problem: \(\varvec{y}_{\cdot ni} = \varvec{E} \varvec{\theta }_{E,ni\cdot } + \varvec{b}\), where \(\varvec{b}\in R^J\) is the error term. Our initial deconvolution parameterofinterest is \(\varvec{\theta }_{E,ni\cdot }\), and can be estimated by minimizing the following objective function:
The solved \(\hat{\varvec{\theta }}_{E,ni\cdot }=\left(\hat{\theta }_{E,ni1}, \hat{\theta }_{E,ni2},\dots \hat{\theta }_{E,niK}\right)'\), for each sample, is a cell type abundance vector of dimension \(K \times 1\). The constraints of the objective function and parameters of \(\epsilon\), C, \(\xi _j\), and \(\xi _j^*\) are detailed in the Additional file 1: Method Details. Negative estimates in \(\hat{\varvec{\theta }}_{E,ni\cdot }\) are set to 0, and the remaining coefficients are normalized to sumtoone, which is the general practice in proportion deconvolution. Repeating this process for all T samples across all subjects, we obtain the deconvoluted cell compositions. It is worth noting that although this deconvolution stage has little difference from existing methods, it provides a valid initial estimation for downstream steps.
Stage II: Personalized reference panel recovery
In this stage, the inputs are the original bulk transcriptome data \(\varvec{y_n}\), and the solved cell type compositions \(\hat{\varvec{\theta }}_{E,ni\cdot }\), for all samples from subject n. The goal is to solve for “subject and celltypespecific” reference panels. The key is to optimize the usage of repeatedlymeasured samples within each subject. In this stage, we make an assumption that the multimeasured samples, within each subject, would share the same CTS reference panel. In other words, the transcriptome variations in observed bulk samples, within each subject, are primarily caused by cell type composition discrepancies. This is a moderate assumption, considering the compositional nature of multiple samples from the same tissue (for example, samples from multiple regions per brain). Here, mixedeffect regression would be a natural choice to capture the groupwise transcriptome average (fixed effect) and subjectlevel deviations (random effect) from the group average. Such modeling also allows for the consideration of additional covariates (Additional file 1: Method Details). Using the original bulk transcriptome data \(\varvec{y_n}\) and the celltypespecific and samplespecific compositions \(\hat{\theta }_{E,nik}\) from Stage I, the following linear mixedeffect regression can be formulated for each gene g. Here, we drop the gene index g to simplify notation, but note this framework can be applied in parallel to all genesofinterest to solve for “subject and celltypespecific” references.
Here, the known independent variables are a group label \(z_n\) for each subject n, and estimated cell type compositions \(\hat{\theta }_{E,nik}\) from Stage I for all samples. The coefficientsofinterest include grouplevel fixed effects \(m_{k}\), \(\beta _k\), and subjectlevel random effect \(u_{nk}\). The interpretation is straightforward: \(m_{k}\) is the average gene expression level in cell type k for the control group (\(z_n=0\)), and \(m_{k}+\beta _k\) is the average gene expression in cell type k for the case group (\(z_n=1\)). Apparently, \(\beta _k\) is the differential expression across the two groups for a cell type k. Most importantly, the random effect \(u_{nk}\) represents a subjectspecific deviation from the groupwise mean expression in cell type k. Note this modeling example above reflects the most basic scenario where study subjects originate from two groups (for example, cancer versus normal), where a binary scalar \(z_n\) is adopted to indicate group labels. This modeling can be extended naturally to incorporate additional covariates, either at the celltype level or the subjectlevel. Modeling details and design matrices setup specified in the Additional file 1: Method Details. \(\hat{m}_k\), \(\hat{\beta }_k\) and \(\hat{u}_{nk}\) are obtained by penalized least square algorithm with restricted maximum likelihood. The subject and celltypespecific reference panel is obtained by addition (fixed effect + random effect), with respect to each corresponding condition, cell type, and subject. To be specific, for subject n and cell type k, its purified reference expression is \(r_{nk} = \hat{m}_{k}+z_n\hat{\beta }_{k} + \hat{u}_{nk}\). After repeating the same model for all G genes and adding gene index g back to \(r_{gnk}\), the personalized reference panel for each subject n can be represented by a matrix \(\varvec{R}_n\) of dimension \(G\times K\):
Stage III: Personalized deconvolution
With the personalized reference panel \(\varvec{R}_n\) available for each subject n, and the original bulk mixture transcriptome data, as shown in the lowerright corner of Fig. 1, we use nonnegative least squares to deconvolute the cell type abundance \(\varvec{\Theta }_{I,n}\). Here, we solve for \(\varvec{\Theta }_{I,n}\) by optimizing the following objective function, for each subject n, under the constraint \(\varvec{\Theta }_{I,n} \ge 0\):
\(\varvec{\Theta }_{I,n}\) is of dimension \(K\times t_n\) for each subject n. This is a joint optimization across all the samples per subject simultaneously instead of samplewise optimization, using the subjectspecific signature matrix \(\varvec{R}_n\) and quadratic programming. The subscript I stands for the implyestimated cell type abundance, in contrast to the coarse deconvolution abundance \(\varvec{\theta }_{E}\) from Stage I. Note that \(\nu\)SVR with samplewise optimization can also be utilized in this stage as an alternative approach, and we name this variant as implys. Overall, instead of using the populationlevel signature matrix \(\varvec{E}\), the key of imply is to adopt a personalized \(\varvec{R}_n\) to serve in the cell type abundance inference.
Simulations
Pure celltypespecific expression profiles
Notations of gene g, subject n, sample i, and cell type k are borrowed from the previous section. The simulation scheme is borrowed and adapted from on our prior benchmark study [39], offering a comprehensive and flexible simulation framework. We utilized a set of true cell line RNAseq dataset [34] to obtain the distribution of gene expression parameters in a genomewide scale. This study has six immune cell types (neutrophils, monocytes, Bcells, CD4 T cells, CD8 T cells, and natural killer cells). For each cell type, the CTS gene expression parameters, expression means (\(\mu _{gk}\)) and biological dispersion (\(\phi _{gk}\)), are obtained by using the PROPER [53] package. There are correlations across cell type for both expression means and dispersion, as expected. Therefore, for the reference panel simulation, we use Multivariate Normal Distribution (MVN) to capture correlations for both expression mean and dispersion, in the log scale. We use \(\hat{\varvec{\Sigma }}_{m}\) (\(\bar{\varvec{\mu }}_m\)) and \(\hat{\varvec{\Sigma }}_{\phi }\) (\(\bar{\varvec{\mu }}_\phi\)) to denote variancecovariance matrices of expression mean and dispersion, respectively. The dimensions match the number of cell types and the details of \(\hat{\varvec{\Sigma }}_{m}\), \(\bar{\varvec{\mu }}_m\), \(\hat{\varvec{\Sigma }}_{\phi }\), and \(\bar{\varvec{\mu }}_\phi\) can be found in the Additional file 1: Simulation Details. We conduct 30 iterations for each simulation scenario, with six cell types and 1,000 genes:
Note that the mean expression \(\varvec{M}\), and the biological dispersion \(\varvec{\Phi }\) are still parameter matrices for downstream usage. The case and control groups share the same \(\varvec{\Phi }\), but distinct mean expressions. The effect size of differential expression is defined by LogFoldChanges (LFC) denoted by \(\Delta\). The means for control and case are denoted by \(\varvec{M}_{Ctrl}=\varvec{M}\) and \(\varvec{M}_{Case}=\varvec{M}+\Delta\). We introduce 10% of differentially expressed (DE) genes on cell types 1, 2, 3, and 4, respectively. The true CTS gene expression matrix \(\varvec{P}\) is derived from a Gamma Distribution for both case and control:
Subjecttosubject variations (SSV) are also introduced, implemented as the expression change percentages over the baseline in \(\varvec{P}_{case/ctrl}\). Variations are then added to \(\varvec{P}_{case/ctrl}\) to obtain subjectspecific underlying gene expression matrices \(\varvec{P}_n\). To reflect the various levels of variations, the level of SSV can take the following ranges: 05%, 5%10%, 10%20%, and 20%50%. The total subject count per case/control group can take value in 25, 50, 75, and 100. The subjectlevel celltypespecific underlying gene expression is shared across multiple samples, and each subject is measured 3 times.
Cell type proportions and observed read counts
To generate the cell type proportions, we borrow information from multiple welllabeled single cell RNAseq studies. We mix and bootstrap cell labels from a combined pool and obtain the empirical cell proportions from this resampling. We use Dirichlet Distribution to estimate \(\varvec{\alpha }\) parameters and simulate cell type proportions. The detailed procedures for generating cell proportions are outlined in the Additional file 1: Simulation Details. The simulated samplespecific cell proportions are:
\(\varvec{\theta _{T,ni\cdot }}\) are reorganized into cell composition matrix, \(\varvec{\Theta }_T\). The samplespecific underlying gene expression reference panel is the weighted average across cell types in \(\varvec{P}_n\) by \(\varvec{\theta }_{T,ni\cdot }\), denoted as \(\varvec{\lambda }_{ni} = \varvec{P}_n\times \varvec{\theta }_{T,ni\cdot }^{'}\), and will follow a Gamma Distribution as well [41]. \(\varvec{\lambda }_{ni}\) is further assessed by the Poisson Distribution to generate observed RNAsequencing counts data, denoted as:
for subject n at measurement i across all G genes. Overall, the Gamma Distribution models biological variations, the Dirichlet Distribution regulates cell type proportion variations, and Poisson Distribution mimics technical noise related to the randomness in the sequencing experiments. This multistep simulation design enables the separation of biological and technical noise [16, 39], among other factors, to facilitate a comprehensive simulation study for our model testing.
Input signature matrix
The signature matrix is required by the algorithm as an input. To obtain it, we first take the average across all \(\varvec{P}_n\) matrices to get CTS gene expression mean matrix. Then 300 or 500 pseudomarker genes are selected by findRefinx function (ordered by coefficients of variation) from TOAST [32] to establish a signature matrix as the input for imply.
Evaluation metrics
We use \(\varvec{\Theta }\) and \(\hat{\varvec{\Theta }}\) to denote the ground truth and estimated cellular abundances, which has the unique property of unitsum and bounded by zero and one. Naturally, a central goal here is to assess how good the cellular abundances estimator \(\hat{\varvec{\Theta }}\) is. Specifically, we denote imply’s deconvolution values as \(\hat{\varvec{\Theta }}_I\), and existing method’s deconvolution results as \(\hat{\varvec{\Theta }}_E\). The existing methods include currently available deconvolution approaches and those do not consider personalized reference panels. The following evaluation metrics are adopted for benchmarking:
Absolute bias differences (ABD) and relative absolute bias differences (rABD)
Here, for both ABD and rABD, if they are smaller than zero, it means the imply successfully reduces the estimation bias. A smaller value further indicates better performance.
Correlation differences (CD)
Here, if CD>0, then imply increases the correlation between the estimation and the ground truth. A larger value indicates favorable performance.
Lin’s concordance correlation coefficient (CCC) and its variations
Lin’s concordance correlation coefficient (Lin’s CCC) [31], denoted as \(\rho _{\mathrm{C}}\), has been extensively used to evaluate the concordance between a new measure and a gold standard measurement, and is defined as:
where \(E_I\) indicates the expectation under the assumption that \(\varvec{\Theta }\) and \(\hat{\varvec{\Theta }}\) are independent. Lin’s CCC is bounded between 1 (perfect agreement) and 1 (disagreement), and the concordance improves as \(\rho _{\textrm{C}} (\varvec{\Theta }, \hat{\varvec{\Theta }})\) approaches 1. Additionally, we adopt a Euclidean distancebased variation of Lin’s CCC, by substituting the expected squared difference to Euclidean distance, denoted as \(\rho _{\textrm{C,E}}\), defined below:
Another option is to employ the Aitchison [2] distancebased Concordance Correlation Coefficient (CCC), which is explained in detail in the Additional file 1: Evaluation Metric, with the results provided in the Additional file 1: Simulation Results. These metrics are adopted because they have been shown to be statistically more rigorous in dependent measures that are subject to the positiveness and unitsum constraints [13], as is often the case in compositional proportion outcome. If imply yields increased concordance and improved precision, we would expect positive values in the differences of CCC. These metrics are respectively defined below:
Overview of the PDBP and TEDDY cohorts
Real data analysis was conducted on two cohorts: the Parkinson’s disease Biomarker Program (PDBP) and The Environmental Determinants of Diabetes in the Young (TEDDY) [29]. The PDBP consortium has the repeatedly measured RNAseq datasets, demographic and clinical information collected from patients with or without Parkinson’s Disease (PD) recruited from multiple medical centers and research institutions in the United States between November 2012 and August 2018. The PDBP cohort data were collected longitudinally overtime for each subject, allowing us to track changes in cell type composition and disease progression over time. In our study deidentified participants with at least three observations over time were retained. A total of 399 PD patients and 173 controls, with 2599 longitudinal samples over 2 years, were included. Longitudinal RNA samples in PDBP were extracted from the whole blood. Clinical data includes information about patients’ medical history, symptoms, disease status, total Montreal Cognitive Assessment (MoCA) scores, and MDS UPDRS part III motor scores. The TEDDY cohort is a multicenter pediatric study of Type 1 diabetes (T1D). TEDDY cohort screened and enrolled participants with susceptibility of T1D based on the Human Leukocyte Antigen (HLA) genotypes from six clinical centers in four countries (U.S., Finland, Germany, and Sweden). A total of 8,676 highrisk infants were enrolled from birth and followed every 3 months for blood sample collection and islet autoantibody (IAbs) measurement up to 4 years of age. Details of sample collection, RNA sequencing procedures, bioinformatics processing, and quality control are described in the Additional file 1: Method Details and [54]. The longitudinal whole blood transcriptome data enable the imply deconvolution.
Results
We first evaluate imply’s deconvolution accuracy using synthetic data generated through the steps described earlier. imply is the only method that reestimates cell type proportions using subjectspecific reference panels from longitudinal bulk data; therefore, a direct comparison with existing deconvolution methods is not directly available. Nevertheless, we designed the benchmark to be inclusive of existing methods. TCA [46], designed for csDE genes detection, integrates a reestimation feature for refining initially noisy cell proportion inputs. Specifically, TCA takes a maximumlikelihood (ML) approach to derive model parameters given initial cell proportion, and then the proportions are subsequently updated based on these estimated parameters. TCA requires preliminary cell proportions for effective reestimation. We employ the nonnegative least squares and \(\nu\)SVR to acquire the initial inputs for TCA, and label them as TCAn and TCAs, respectively, which could be benchmarked with imply. ISLET [16] is the first method to retrieve individualspecific reference estimation in repeated samples based on the ExpectationMaximization (EM) algorithm. ISLET can be an alternative approach to our mixedeffect model to solve subjectspecific reference panels. Here, we consider ISLETs and ISLETn, respectively, representing ISLET variants that the final personalized deconvolution is conducted by SVR or nonnegative least squares, respectively. We also introduce a variant of imply, where Stage III is achieved by SVR instead of the nonnegative least squares. This variant is denoted as implys. We comprehensively benchmark our proposed personalized deconvolution methods, imply and its variant implys, against other algorithms: TCAn, TCAs, ISLETn, and ISLETs. Additionally, we compare imply with representative deep learningbased algorithms, Scaden [38] and TAPE [9], as well as popular statistical modeling methods, CIBERSORTx [44] and MuSiC [51], under a baseline simulation setting detailed in the Additional file 1: Figs. S27 and S28.
imply increases precision in celltype deconvolution
We start with a baseline simulation scenario with six cell types, two disease groups, and 100 subjects per group with 3 replicates per subject. The subjectspecific variation (SSV) in the underlying CTS gene expression panels is up to 5%. To simulate csDE genes, we introduce 10% of DE genes respectively to cell types 1, 2, 3, and 4. The effect size is characterized by the LogFoldChange (LFC) set to 0.5. Figure 2A shows the estimated reference panels by imply versus the ground truth. Overall, we observe good accuracy in personalized reference panel recovery, especially among highexpression genes. This result demonstrates the fidelity of Stage II and lays a foundation for personalized deconvolution in Stage III. Next, we evaluate if imply’s final cell type deconvolution, from Stage III, could reduce bias. Here, there are mainly two aspects to consider for accuracy benchmarking: one is to compare with alternative frameworks that do not use personalized reference panels; the other one is to benchmark with existing methods. Figure 2B shows the scatterplot of the estimated cell type proportions versus the true proportions. Our result is overlaid on top of the result from CIBERSORT, one of the stateoftheart methods. imply yields higher precision in deconvolution as its estimates aggregate closer to the diagonal line. In Fig. 2CF, the bias reductions are quantitatively assessed and compared using metrics introduced previously: ABD, rABD, CD, and \(\Delta \rho _{\textrm{C,E}}\). Each point in a boxplot represents one simulation iteration, with the red dotted lines of zero indicating the basis for not using personalized reference panels. Thus, the zero line represents the existing deconvolution method, such as CIBERSORT, which did not consider personalized reference panels. For ABD and rABD, lower values indicate a greater increase in deconvolution accuracy; while for CD and \(\Delta \rho _{\textrm{C,E}}\), higher values indicate improved concordance with the true values. Notably, imply consistently demonstrates the most substantial reduction in deconvolution bias and highest concordance with the truth. In contrast, TCA performs poorly, especially when the initial proportion inputs are estimated through nonnegative least squares (TCAn). Even when the initial proportion input is derived from CIBERSORT, the bias reduction achieved by TCA (TCAs) is not as significant as that achieved by imply. Furthermore, we notice that subjectspecific reference panels estimated by ISLET also yield benefits for personalized deconvolution, illustrated by ISLETs and ISLETn. However, the improvements are not as pronounced as those achieved by imply. The Wilcoxon signedrank test was conducted to demonstrate the statistical significance of the superiority of imply compared to other models. Detailed test results can be found in the Additional file 1: Table S3S6. Moreover, the Additional file 1: Figs. S27 and S28 present further benchmarking analysis conducted under slightly different setups, showing that imply outperforms CIBERSORTx [44], MuSiC [51], and two deep learningbased methods, Scaden [38] and TAPE [9], under the baseline simulation setup.
We also explore the methods’ performance under various simulation scenarios and summarize the results in Table 1. The table shows averaged ABDs across simulation replicates, with each standard error, at exhaustive combinations of subjectspecific variations (SSV=05%, 5%10%, 10%20%), effect sizes (LFC = 0.5, 1, 1.25), and sample sizes (N=25, 50, 100). Bold fonts highlight the algorithm with the most amount of bias reduction for each scenario. imply and implys consistently demonstrate exceptional performance in reducing deconvolution bias across all scenarios.
Benchmarking at celltype resolution
We next investigate the deconvolution accuracy at each cell type. Figure 3A shows the ABD and \(\Delta \rho _{\textrm{C}}\) outcomes of 30 replicates of each cell type under the condition of the SSV range of 05%, the sample size of 75, and the effect size of 0.5. Across all cell types, we can see a discernible reduction in bias when personalized reference panels are adopted. imply and implys consistently stand out, yielding a significant enhancement in concordance within each cell type compared to other models. The heatmap in Fig. 3B shows the average rABD at various combinations of sample sizes and effect sizes, separated by cell types. At large effect sizes, improvements in cell deconvolution accuracies facilitated by imply are notably more profound. However, rABD exhibits limited alterations to variations in sample sizes. The simulation results also suggest a connection between bias reduction and cell type abundances; specifically, deconvolution accuracies for more abundant cells are highly sensitive to LFC changes (see the Additional file 1: Simulation Results for additional details). In contrast, for minor cell types, the small amount of contribution makes deconvolution an even more challenging task, where the sequencing noise could easily dominate underlying biological variations. The Additional file 1: Fig. S31 contains additional simulation results specifically addressing minor cell types.
Influential factors in deconvolution accuracy
We further zoom in to study how sample size, effect size, and subjectspecific variation would affect personalized deconvolution. In Fig. 4A, ABD and \(\Delta \rho _{\textrm{C,E}}\) for imply, together with ISLETn and TCAs, are presented across LFC ranging from 0 (null) to 1.5. imply consistently exhibits the lowest ABD in all scenarios and the highest \(\Delta \rho _{\textrm{C,E}}\) in most settings. These results indicate the advantage of adopting personalized reference panels. In addition, imply provides the most stable (i.e., smallest variation) among the three methods as the effect size increases. Figure 4B shows the same metrics across various sample sizes. As expected, ABD decreases as the sample size increases. imply consistently maintains the highest \(\Delta \rho _{\textrm{C,E}}\) across various sample sizes. In Fig. 4C, we further investigate the \(\Delta \rho _{\textrm{C,E}}\) alteration percentages, which are defined as \(\Delta \rho _{\textrm{C,E}}\% = \frac{\Delta \rho _{\textrm{C,E}}}{\rho _{\textrm{C,E}}(\Theta _E,\Theta )}\times 100\%\), at different levels of SSV, which are annotated by the top row. We observe a robust pattern across different effect sizes, samples sizes, and SSVs, and conclude that imply and implys consistently provide the most outstanding concordance improvement.
Application of imply to longitudinal transcriptomic datasets
We applied imply to analyze the longitudinal transcriptomic datasets from both the PDBP and TEDDY [29] consortia. For PDBP dataset, the mean proportions across all visit times of six cell types, including B cell, Monocyte, CD4, CD8, NK cell, and other cells, are shown for cases and controls in Fig. 5A. Here, B cell contributes the most among all six cell types, while NK contributes the least. The visualization suggests a higher CD8 proportions in the PD group than in the control group, while CD4 proportions in the PD groups are lower. Figure 5B displays the heatmap of Pearson correlations among the six cell types. B cells, monocytes, and CD4 all show negative pairwise correlations. Figure 5C shows boxplots of CD8 cell type proportions comparing case and control, at each time point. The median value of CD8 proportion in case is higher than that in control group at each time point. The CD4 and CD8 cell type proportions, broken down by the participant’s visit time of each subject, are shown in Fig. 5D and E, respectively. For CD4 cell type, the mean proportions in case group are lower than those in control group for each visit time. For CD8 cell type, the mean proportions among cases are higher than those among controls, for each visit time. These findings are wellaligned with previous studies where the PD patients showed elevated CD8 proportions and reduced CD4 proportions than controls [7, 20, 26, 52]. We also benchmarked imply with the existing method CIBERSORT as shown in Fig. 5F. Using CIBERSORT, the pvalue of the Wilcoxon Rank Sum test is 0.0111 and the median difference is \(0.007\) for CD8 proportions, between cases and controls. It incorrectly suggests that the CD8 cell type proportion of cases are lower than controls. In contrast, imply yields a pvalue less than \(10^{16}\) and the median difference is 0.58, which shows the correct effect size direction. It also increases differential power between cases and controls, as shown in the ROC plot. We also explored the associations between the various cell type proportions and clinical outcomes, including total UPSIT score, total scores of Montreal Cognitive Assessment (MoCA), and MDS UPDRS part III motor scores, which provide additional assessments of patient’s cognitive and motor function in PD. Additionally, association studies with Cerebrospinal fluid (CSF) were conducted (the Additional file 1: Fig. S21S25). For the T1D study of TEDDY, the disease status (i.e., cases) of interest is the onset of pancreatic islet autoantibodies (IA). The longitudinal analysis of requantified cellular composition identifies NK cell abundance as higher in males than females (\(p<0.0001\)), as illustrated in Fig. 5H. Previous research in TEDDY reported a higher risk of IA being associated with viral infection during the first 6 months of life [50]. The sex difference in NK cell fraction in Fig. 5H could be a consequence of earlylife vaccination or viral infection [10], since infants are exposed to exogenous antigens and have a high susceptibility to infections. In this analysis, we use longitudinal samples of IA cases and controls collected at the age of 921 months, and compare deconvoluted cell fractions between groups by imply. Figure 5I shows that the NK cell proportions are significantly lower (\(p<0.0001\)) in the participants who developed IA at a young age compared to controls, while this trend is not observed in the initial cell abundance estimated by CIBERSORT (\(p=0.77\), the Additional file 1: Fig. S26). The relative higher NK cell abundance in males (vs. females) and controls (vs. cases) among TEDDY participants is consistent with the previous finding that males have a lower risk of autoimmunity than females [37].
Furthermore, we perform a downstream csDE genes analysis on IA status based on the implydeconvoluted cell type fraction, using ISLET [16] with FDR\(<0.1\). The cell type proportions improved by imply enabled the detection of DE genes in CD4 T cells and identified more NKcellspecific DE genes (\(n>300\)) compared to a previous csDE genes testing result (\(n=30\)) based on the proportions deconvoluted by AutoGeneS [3]. The IAcsDE genes based on the improved cell fractions include the markers for multiple T cell receptors (e.g., TRBV, TRDV, TRGV, TRJV) and the genes regulating immune responses such as CAMP and CRK. The CAMP gene expression was found to be associated with serum levels of vitamin D in the studies of innate immunity [14, 24, 36], while the TEDDY cohort also reported a strong linkage between vitamin D and the risk of IA [33]. Protein CRK is involved in NK cells inhibitory receptor signaling and modulates the signaling of activating receptors, which may function as a twoway molecular switch to control NK cellmediated cytotoxicity [35, 42].
Discussion
The computational deconvolution of admixed bulk tissue samples is drawing substantial interest in omics. The interest is growing as deconvolution methodologies are being developed, and as increasingly large datasets are becoming available with and without repeated measures. We are among the first to consider personalized reference panels in deconvolution. Our computational framework optimizes the usage of shared information in longitudinal samples from each subject. Alternative machine learning approaches, such as ExpectationMaximization (EM) and nonnegative matrix factorization algorithms, could also extract personalized reference panels and have been implemented in ISLET [16] and CIBERSORTx [44]. Nevertheless, these methods lack the conciseness and computational efficiency exhibited by the proposed linear mixedeffects modeling framework.
A limitation of imply is the requirement of an initial signature matrix as the input in Stage I, which could affect the initial cell type abundance estimation as the input for downstream. An alternative approach is to initialize cell fractions by external multisubject reference cell count data, such as singlecell profiling and labeling, flow cytometry, or imaging. For some genes, the random effect variance estimation may shrink towards zero, likely due to the adoption of penalized MLE. For such scenarios, the CTS heterogeneity between individuals would not be fully recovered. Furthermore, the intraindividual heterogeneity was not considered in reference panel recovery. This is because our present work was motivated by the bulk transcriptome of longitudinal blood samples, many of which were collected from healthy controls. In those scenarios, the underlying pure gene expression panel for each subject is relatively stable over time. Our previous work [16] suggests that the intraindividual CTS heterogeneity, when assessing using longitudinal PBMC scRNAseq data, is trivial when compared with interindividual variation. Hence, our future work will include the curation of longitudinal scRNAseq data from distinct tissue types or disease populations and the incorporation of potential variations between time points at cell type resolution.
Conclusions
In this work, we present our statistical framework imply to conduct celltype deconvolution in bulk data using personalized panels. Our method imply leverages the repeated bulk RNAseq samples to purify personalized reference transcriptome, and then jointly quantifies the cell abundances across multiple samples per individual. We show the advantage of using personalized reference panels by extensively in silico simulation studies and the analytical results of two largescale longitudinal consortia. imply can produce more accurate and realistic deconvolution results.
Availability of data and materials
1. imply is implemented and integrated into a R/Bioconductor package ISLET [17], which is available at https://bioconductor.org/packages/ISLET.
2. The PDBP bulk transcriptome and related clinical data are publicly available on request to AMPPD at https://amppd.org
3. The TEDDY [29] bulk transcriptome dataset has been deposited in NCBI’s database of Genotypes and Phenotypes (dbGaP) with the primary accession code phs001442.v3.p2
Abbreviations
 CTS:

CellTypeSpecific
 PDBP:

Parkinson’s Disease Biomarkers Program
 TEDDY:

The Environmental Determinants of Diabetes in the Young
 TCGA:

The Cancer Genome Atlas
 RB/RF:

ReferenceBased/ReferenceFree
 (cs)DE:

(celltypespecific) Differentially Expressed
 scRNAseq:

singlecell RNAseq
 SSV:

SubjecttoSubject Variations
 (r)ABD:

(Relative) Absolute Bias Differences
 CD:

Correlation Differences
 Lin’s CCC:

Lin’s Concordance Correlation Coefficient
 LFC:

LogFoldChange
 MLE:

Maximum Likelihood Estimation
 SVR:

Support Vector Regression
 EM:

ExpectationMaximization
References
AguirreGamboa R, Joosten I, Urbano PC, van der Molen RG, van Rijssen E, van Cranenbroek B, et al. Differential effects of environmental and genetic factors on T and B cell immune traits. Cell Rep. 2016;17(9):2474–87.
Aitchison J, BarcelóVidal C, MartínFernández JA, PawlowskyGlahn V. Logratio analysis and compositional distance. Math Geol. 2000;32:271–5.
Aliee H, Theis FJ. AutoGeneS: automatic gene selection using multiobjective optimization for RNAseq deconvolution. Cell Syst. 2021;12(7):706–15.
Aran D, Hu Z, Butte AJ. xCell: digitally portraying the tissue cellular heterogeneity landscape. Genome Biol. 2017;18:1–14.
Avila Cobos F, Vandesompele J, Mestdagh P, De Preter K. Computational deconvolution of transcriptomics data from mixed cell populations. Bioinformatics. 2018;34(11):1969–79.
Avila Cobos F, AlquiciraHernandez J, Powell JE, Mestdagh P, De Preter K. Benchmarking of cell type deconvolution pipelines for transcriptomics data. Nat Commun. 2020;11(1):5650.
Baba Y, Kuroiwa A, Uitti RJ, Wszolek ZK, Yamada T. Alterations of Tlymphocyte populations in Parkinson disease. Parkinsonism Relat Disord. 2005;11(8):493–8.
Çalışkan M, Baker SW, Gilad Y, Ober C. Host genetic variation influences gene expression response to rhinovirus infection. PLoS Genet. 2015;11(4):e1005111.
Chen Y, Wang Y, Chen Y, Cheng Y, Wei Y, Li Y, et al. Deep autoencoder for interpretable tissueadaptive deconvolution and celltypespecific gene analysis. Nat Commun. 2022;13(1):6735.
Cheng MI, Li JH, Riggan L, Chen B, Tafti RY, Chin S, et al. The Xlinked epigenetic regulator UTX controls NK cellintrinsic sex differences. Nat Immunol. 2023;24(5):1–12.
Chiu YJ, Hsieh YH, Huang YH. Improved cell composition deconvolution method of bulk gene expression profiles to quantify subsets of immune cells. BMC Med Genet. 2019;12:1–17.
Clarke J, Seo P, Clarke B. Statistical expression deconvolution from mixed tissue samples. Bioinformatics. 2010;26(8):1043–9.
Cui Y, Peng L, Hu Y, Lai HJ. Assessing the reproducibility of microbiome measurements based on concordance correlation coefficients. J R Stat Soc Ser C Appl Stat. 2021;70(4):1027–48.
de Oliveira ALG, Chaves AT, Cardoso MS, Pinheiro GRG, Antunes DE, de Faria Grossi MA, et al. Reduced vitamin D receptor (VDR) and cathelicidin antimicrobial peptide (CAMP) gene expression contribute to the maintenance of inflammatory immune response in leprosy patients. Microbes Infect. 2022;24(6–7):104981.
Di Biase MA, Geaghan MP, Reay WR, Seidlitz J, Weickert CS, Pébay A, et al. Cell typespecific manifestations of cortical thickness heterogeneity in schizophrenia. Mol Psychiatry. 2022;27(4):2052–60.
Feng H, Meng G, Lin T, Parikh H, Pan Y, Li Z, et al. ISLET: individualspecific reference panel recovery improves celltypespecific inference. Genome Biol. 2023;24(1):174.
Feng H, Meng G, Li Q. ISLET: IndividualSpecific cell typE referencing Tool. https://doi.org/10.18129/B9.bioc.ISLET. Bioconductor version: Release 3.18. 2023.
Findley AS, Monziani A, Richards AL, Rhodes K, Ward MC, Kalita CA, et al. Functional dynamic genetic effects on gene regulation are specific to particular cell types and environmental conditions. Elife. 2021;10:e67077.
Finotello F, Trajanoski Z. Quantifying tumorinfiltrating immune cells from transcriptomics data. Cancer Immunol Immunother. 2018;67(7):1031–40.
GalianoLandeira J, Torra A, Vila M, Bove J. CD8 T cell nigral infiltration precedes synucleinopathy in early stages of Parkinson’s disease. Brain. 2020;143(12):3717–33.
Gaujoux R, Seoighe C. Semisupervised Nonnegative Matrix Factorization for gene expression deconvolution: a case study. Infect Genet Evol. 2012;12(5):913–21.
Ghaffari S, Bouchonville KJ, Saleh E, Schmidt RE, Offer SM, Sinha S. BEDwARS: a robust Bayesian approach to bulk gene expression deconvolution with noisy reference signatures. Genome Biol. 2023;24(1):1–30.
Gibson G. The environmental contribution to gene expression profiles. Nat Rev Genet. 2008;9(8):575–81.
Gombart AF, Saito T, Koeffler HP. Exaptation of an ancient Alu short interspersed element provides a highly conserved vitamin Dmediated innate immune response in humans and primates. BMC Genomics. 2009;10(1):1–11.
Gong T, Szustakowski JD. DeconRNASeq: a statistical framework for deconvolution of heterogeneous tissue samples based on mRNASeq data. Bioinformatics. 2013;29(8):1083–5.
Hisanaga K, Asagi M, Itoyama Y, Iwasaki Y. Increase in peripheral CD4 bright+ CD8 dull+ T cells in Parkinson disease. Arch Neurol. 2001;58(10):1580–3.
Idaghdour Y, Czika W, Shianna KV, Lee SH, Visscher PM, Martin HC, et al. Geographical genomics of human leukocyte gene expression variation in southern Morocco. Nat Genet. 2010;42(1):62–7.
Kedlian VR, Donertas HM, Thornton JM. The widespread increase in interindividual variability of gene expression in the human brain with age. Aging (Albany NY). 2019;11(8):2253.
Krischer J, Rewers M, She JX, Ziegler AG, Toppari J, Lernmark k, et al. The Environmental Determinants of Diabetes in the Young Study (TEDDY). dbGaP Genotypes and Phenotypes. phs001442.v3.p2. 2021. https://www.ncbi.nlm.nih.gov/projects/gap/cgibin/study.cgi?study_id=phs001442.v4.p3. Accessed 26 July 2023.
Kuhn A, Kumar A, Beilina A, Dillman A, Cookson MR, Singleton AB. Cell populationspecific expression analysis of human cerebellum. BMC Genomics. 2012;13:1–15.
Lawrence I, Lin K. A concordance correlation coefficient to evaluate reproducibility. Biometrics. 1989;45(1):255–68.
Li Z, Wu Z, Jin P, Wu H. Dissecting differential signals in highthroughput data from complex tissues. Bioinformatics. 2019;35(20):3898–905.
Li Q, Liu X, Yang J, Erlund I, Lernmark Å, Hagopian W, et al. Plasma metabolome and circulating vitamins stratified onset age of an initial islet autoantibody and progression to Type 1 diabetes: the TEDDY Study. Diabetes. 2021;70(1):282–92.
Linsley PS, Speake C, Whalen E, Chaussabel D. Copy number loss of the interferon gene cluster in melanomas is linked to reduced T cell infiltrate and poor patient prognosis. PLoS ONE. 2014;9(10):e109760.
Liu D. The adaptor protein Crk in immune response. Immunol Cell Biol. 2014;92(1):80–9.
Lowry MB, Guo C, Zhang Y, Fantacone ML, Logan IE, Campbell Y, et al. A mouse model for vitamin Dinduced human cathelicidin antimicrobial peptide gene expression. J Steroid Biochem Mol Biol. 2020;198:105552.
Markle JG, Frank DN, MortinToth S, Robertson CE, Feazel LM, RolleKampczyk U, et al. Sex differences in the gut microbiome drive hormonedependent regulation of autoimmunity. Science. 2013;339(6123):1084–8.
Menden K, Marouf M, Oller S, Dalmia A, Magruder DS, Kloiber K, et al. Deep learning–based cell composition analysis from tissue expression profiles. Sci Adv. 2020;6(30):eaba2619.
Meng G, Tang W, Huang E, Li Z, Feng H. A comprehensive assessment of cell typespecific differential expression methods in bulk data. Brief Bioinform. 2023;24(1):bbac516.
Modlich O, Prisack HB, Munnes M, Audretsch W, Bojar H. Immediate gene expression changes after the first course of neoadjuvant chemotherapy in patients with primary breast cancer disease. Clin Cancer Res. 2004;10(19):6418–31.
Moschopoulos PG. The distribution of the sum of independent gamma random variables. Ann Inst Stat Math. 1985;37(1):541–4.
Nabekura T, Chen Z, Schroeder C, Park T, Vivier E, Lanier LL, et al. Crk adaptor proteins regulate NK cell expansion and differentiation during mouse cytomegalovirus infection. J Immunol. 2018;200(10):3420–8.
Newman AM, Liu CL, Green MR, Gentles AJ, Feng W, Xu Y, et al. Robust enumeration of cell subsets from tissue expression profiles. Nat Med. 2015;12(5):453–7.
Newman AM, Steen CB, Liu CL, Gentles AJ, Chaudhuri AA, Scherer F, et al. Determining cell type abundance and expression from bulk tissues with digital cytometry. Nat Biotechnol. 2019;37(7):773–82.
Qiao W, Quon G, Csaszar E, Yu M, Morris Q, Zandstra PW. PERT: a method for expression deconvolution of human blood samples from varied microenvironmental and developmental conditions. PLoS Comput Biol. 2012;8(12):e1002838.
Rahmani E, Schweiger R, Rhead B, Criswell LA, Barcellos LF, Eskin E, et al. Celltypespecific resolution epigenetics without the need for cell sorting or singlecell biology. Nat Commun. 2019;10(1):3417.
Sutton GJ, Poppe D, Simmons RK, Walsh K, Nawaz U, Lister R, et al. Comprehensive evaluation of deconvolution methods for human brain gene expression. Nat Commun. 2022;13(1):1358.
Troester MA, Hoadley KA, Sørlie T, Herbert BS, BørresenDale AL, Lønning PE, et al. Celltypespecific responses to chemotherapeutics in breast cancer. Cancer Res. 2004;64(12):4218–26.
Tsoucas D, Dong R, Chen H, Zhu Q, Guo G, Yuan GC. Accurate estimation of celltype composition from gene expression data. Nat Commun. 2019;10(1):2975.
Vehik K, Lynch KF, Wong MC, Tian X, Ross MC, Gibbs RA, et al. Prospective virome analyses in young children at increased genetic risk for type 1 diabetes. Nat Med. 2019;25(12):1865–72.
Wang X, Park J, Susztak K, Zhang NR, Li M. Bulk tissue cell type deconvolution with multisubject singlecell expression reference. Nat Commun. 2019;10(1):380.
Wang P, Yao L, Luo M, Zhou W, Jin X, Xu Z, et al. Singlecell transcriptome and TCR profiling reveal activated and expanded T cell populations in Parkinson’s disease. Cell Disc. 2021;7(1):52.
Wu H, Wang C, Wu Z. PROPER: comprehensive power evaluation for differential expression using RNAseq. Bioinformatics. 2015;31(2):233–41.
Xhonneux LP, Knight O, Lernmark Å, Bonifacio E, Hagopian WA, Rewers MJ, et al. Transcriptional networks in atrisk individuals identify signatures of type 1 diabetes progression. Sci Transl Med. 2021;13(587):eabd5666.
Zhong Y, Wan YW, Pang K, Chow LM, Liu Z. Digital sorting of complex tissues for cell typespecific gene expression profiles. BMC Bioinformatics. 2013;14(1):1–10.
Acknowledgements
The TEDDY study is funded by the National Institute of Diabetes and Digestive and Kidney Diseases, National Institute of Allergy and Infectious Diseases, National Institute of Child Health and Human Development, National Institute of Environmental Health Sciences, Centers for Disease Control and Prevention, and JDRF. We thank the TEDDY study data coordinating center at Health Informatics Institute, University of South Florida, for data processing and sharing. The PDBP consortium is supported by the National Institute of Neurological Disorders and Stroke (NINDS) at the National Institutes of Health. A full list of PDBP investigators can be found at https://pdbp.ninds.nih.gov/policy. The PDBP investigators have not participated in reviewing the data analysis or content of the manuscript. Data used in the preparation of this article were obtained from the Accelerating Medicine Partnership (AMP) Parkinson’s Disease (AMP PD) Knowledge Platform.
Funding
This work was partially supported by the National Institutes of Health [U24DK097771 via the NIDDK Information Network’s (dkNET) New Investigator Pilot Program in Bioinformatics (PI: Q.L.) and Cancer Center Support Grant P30CA021765 to Q.L.], the American Cancer Society Institutional Research Grant (ACS IRG) [IRG1618621 to H.F.] through Case Comprehensive Cancer Center, and the American Lebanese Syrian Associated Charities (ALSAC) to Q.L.
Author information
Authors and Affiliations
Contributions
G.M., Q.L., and H.F. developed and implemented the algorithm. Y.P., W.T., and L.Z. conducted real data analysis. G.M., Y.C., and S.H. conducted the simulations. G.M., H.F., and Q.L. wrote the manuscript. F.S., M.W., R.W., and J.K. contributed to simulation or real data results interpretation. All authors reviewed the manuscript. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
imply: improving celltype deconvolution accuracy using personalized reference profiles, Supplementary Materials. It includes detailed methodological descriptions, simulation details, evaluation metrics, and results, as well as analyses on real data sets.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Meng, G., Pan, Y., Tang, W. et al. imply: improving celltype deconvolution accuracy using personalized reference profiles. Genome Med 16, 65 (2024). https://doi.org/10.1186/s1307302401338z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1307302401338z