Background

Bulk RNA-sequencing experiments reveal average gene expression values for all cells present in a sample mixture. Computational deconvolution methods separate the mixed signals from the aggregated expression and provide estimation of cellular components without physical isolations. The inferred cellular proportions are important to understand the ecosystem of the tissue and can be used as covariates in differential expression, reducing false positives and false negatives [1, 2]. Moreover, for heterogeneous bulk samples like tumor [3], deconvolution enables identification and quantification of the infiltrating immune populations, which provides rich prognostic values and can guide targeted therapy (e.g., in immunotherapy) [4,5,6,7,8].

Numerous deconvolution methodologies have been developed (see [1], for review), aiming at estimation of cell-type abundance from bulk transcriptomic data. Depending on if and how priori knowledge used, these methods can be broadly classified into four categories: regression-based, marker-based, and reference-free methods. Regression-based methods require an expression matrix as input, which consists of a cell type-specific expression profile for selected genes. These methods then solve the deconvolution as a regression problem. A comprehensive evaluation of factors involved in regression-based methods, like data transformation, normalization, and regression algorithms, can be found elsewhere [9]. Marker-based methods require a set of genes that characterize the expression patterns in different cell types and return either an enrichment score [10] that is unitless or abundance estimates [11, 12].

Reference-free methods [12,13,14] are completely unsupervised and do not require any prior knowledge as input. Such methods are based on finding a simplex, which is a geometric data structure expected under ideal mixture proportion scenarios. Finally, we note that a recent advance in deconvolution uses a Bayesian framework that relies on a reference matrix but uses it in a way that is distinct from reference based approaches [15].

Rapid development of deconvolution methodologies now raises another challenge of evaluating their performance across diverse realistic settings. Many benchmarking studies have been undertaken to meet this demand [2, 9, 16, 17]. Regardless of the focus of their evaluation, all benchmarking efforts rely on datasets with known ground truth. To acquire such data, one traditional approach involves using real bulk data with paired cell type fraction information, which can be derived from fluorescence-activated cell sorting (FACS) or immunohistochemistry (IHC) staining [17, 18]. However, this approach is restricted by the extensive experimental labor and limited sample availability, making it less practical for large-scale benchmarking studies. An alternative approach is computational mixing where purified expressions of different cell-types are mixed in controlled proportions [10, 19, 20]. While the purely computational strategy can generate large datasets, this approach has the clear limitation that it makes the strong assumption that proportion variation and random noise are the only source of variance in the data.

Increasing availability of single-cell data [21] offers the opportunity to create more realistic simulations. Instead of computational mixing of pure expression states, individual single cell profiles are added together in controlled proportions [9, 17, 18, 22]. This has the explicit advantage over pure computational mixing as it introduces more variations in the simulated samples. However, as we will show in this work, while this approach has rapidly become the standard method for bulk simulation, the problem with unrealistic biological variance is only partially resolved. To simulate data compatible with bulk measurements, a large number of cells (typically hundreds) are added for each simulated sample. As such the pure cell type-specific expression in each sample, while not exactly identical, tends towards the global mean of that cell type in the source scRNA data, enforcing the unrealistic assumption that there is no systematic variation beyond cell-type proportions. One possible solution is to take into account intra-sample heterogeneity in the simulated bulk mixtures. In Chu et al.’s [15] study, they created such simulated bulk mixtures by restricting that the malignant cells aggregated to form a simulated bulk sample originating from the same biological sample. Dong et al. [23] and Menden et al. [24] implemented a simulation strategy that involved repetitively sub-sampling cells from the same patient, ensuring proper inter-sample heterogeneity. However, there is currently no general evaluation of how heterogeneity affects the deconvolution results, as compared with bulk simulation using random cells.

In this study, we introduce a novel heterogeneous simulation approach that aims at capturing accurate biological variance. Through systematic comparison of these simulation methods, we demonstrated that bulk simulation methods using random cells do not reflect realistic biological heterogeneity while our newly proposed approach does. Leveraging the varying heterogeneity levels in the simulated bulk samples, we provided an in-depth comparison of different categories of deconvolution methods using our systematic benchmarking frameworks (Additional file 1: Fig. S1), aiming to elucidate the impact of heterogeneity on the results. By summarizing deconvolution performance across experimental repeats, we found that introducing biological heterogeneity has a notable effect on the deconvolution results, with reference-free methods being most affected. Our study can guide researchers in choosing the most appropriate deconvolution methods, and the highly realistic simulation framework we proposed can facilitate further methodological development.

Results

Exploring biological variance in simulated bulk data: influence of different simulation strategies

In previous benchmarking studies [2, 9, 17], the evaluation of deconvolution performance relied on simulated bulk expression using predefined cell-type fractions as ground truth. These studies employed a “homogeneous” simulation approach, where single-cell profiles from single-cell data were combined randomly within each cell type and aggregated in proportions. However, this approach only accounts for cell-type proportion level variance and overlooks other sources of biological variance.

To address this limitation, we explored alternative simulation methods aiming at introducing more biological variance within the simulated samples. We proposed a “heterogeneous” simulation setting where cells used to compose the cell type components of a simulated bulk sample are constrained to come from the same biological samples (Additional file 1: Fig. S2), thus capturing the sample-level heterogeneity [25, 26]. We also considered a less “heterogeneous” setting where only malignant cells are originated from the same sample in the simulation, which we referred to as “semi-heterogeneous” simulation, inspired from Chu’s benchmarking work [15].

We postulated that these three simulation methods—homogeneous, semi-heterogeneous, and heterogeneous—will inherently produce samples with distinct levels of variance, reflecting varied capacity to mimic real biological complexity. To test this, we applied these simulation strategies on four distinct single-cell datasets (Table 1, Additional file 2: Table S1), resulting in a total of 12 simulated expression profiles for evaluation. For cell type fraction simulation, we adopted a beta distribution-based strategy, allowing the mean and variances of each cell-type fractions to be approximately matched to those of real data (see the “Methods” section; Additional file 1: Fig. S3). To create baseline bulk expression for variance comparison, we aggregated single cells from the same patients and used them as approximations of real bulk samples (see the “Methods” section). We also collected expression profiles from the TCGA datasets [27] when the relevant tumor type is available. A detailed description of bulk datasets used in variance comparison can be found in Additional file 2: Table S2.

Table 1 Single-cell datasets used in benchmarking

We illustrate our framework with bulk data simulated from Jerby_Arnon2018_SKCM. Utilizing the coefficient of variation (CV) of gene expressions as a measure of intra-sample variance, we first compared gene-level CV between simulated bulk samples and baseline bulk expression. Our findings revealed that the heter-simulated bulk samples exhibited variance closely aligned with that of actual bulk samples, while the homo-simulated samples displayed generally lower variability and semi-heter simulated samples failing between (Fig. 1a). Summarized gene-CV at pathway levels [43] further confirmed this finding, and by extending the CV analysis to include real TCGA bulk samples from the same tumor type, we showed that heter-simulated samples retained proper biological variance compared with real bulk samples (Fig. 1b).

Fig. 1
figure 1

Comparison of different bulk simulation strategies. a Scatter plots comparing coefficient of variation (CV) for all genes between the simulated samples and baseline expression. b Heatmap comparing average CV of genes from different biological pathways. c Boxplot comparing pairwise correlations between simulated bulk samples, with the dashed line indicating the average pairwise correlation in baseline expression. d Heatmaps showing gene correlations in bulk samples, using the top 300 most variable genes from the baseline expression. e Scatter plots comparing variance of 50 hallmark gene sets between simulated and baseline bulk expression, with each dot representing the averaged CV of genes from a hallmark gene set. In ad, all simulated datasets and baseline expression are derived from the Jerby_Arnon2018_SKCM dataset. Throughout the comparison, each simulated dataset comprises 100 simulated samples, and baseline expression is generated by aggregating single-cell expression from the same biological sample, serving as an approximation to real bulk samples

Moving forward, we calculated pairwise correlations between the simulated samples to gain further insight into biological variance (Fig. 1c). The distribution of pairwise correlations across different simulation settings revealed distinct variance levels, with homo-simulated samples showing the highest pairwise correlations, and the heter-simulated samples exhibiting the least pairwise similarities, aligning more closely with real-world settings. Moreover, as we visualized pairwise correlation between genes under different simulation settings, we found that the heter-simulated samples effectively captured proper gene correlations (Fig. 1d). They maintained appropriate gene clusters, similar to those seen in baseline bulk expression, and demonstrated reasonable coefficient correlations. In contrast, homo-simulated samples resulted in false-positive gene clustering structures and spuriously high gene correlations. We note that in this comparison we only focused on the top 300 most variant genes due to the immense scale of the gene-gene correlation matrix; the same patterns hold when specifically considering cell-type marker genes as well (Additional file 1: Fig. S4, Additional file 2: Table S3).

In Fig. 1e, we extended the variance comparison to all datasets by comparing pathway-level variance in simulated samples against real bulk expressions. Our results showed that across each dataset, the heter-simulated samples closely mirrored the actual variance observed in real data, exhibiting pathway-level CVs that align closely with those of real bulk samples, whereas the homo-simulated samples are in general less variable. We note that for bulk data simulated from Riemondy2022_MB dataset, the semi-simulated and heter-simulated samples show little difference with respect to pathway variances; this is because the simulated MB samples are mainly dominated by malignant cells, making it less distinguishable between two strategies.

In addition to the three simulation methods mentioned above, we also investigated four additional bulk simulation approaches: the “favilaco” method [9], the “immunedeconv” method, [17], the “SCDC” method [23], and one “heterogeneous” simulation method that does not rely on single-cell sampleID information, which we referred to as the “heterogeneous-sampleIDfree” method (see the “Methods” section). These expanded methodologies, along with our original simulation strategies, have been integrated into our published package, offering a comprehensive toolkit for bulk simulations.

An extensive comparison of all seven simulation methods is detailed in Additional file 1: Fig. S4-S7. Our analysis revealed that the “heterogeneous-sampleIDfree” method successfully maintained comparable variance without the constraints of sample ID dependency, offering a significant advantage for single-cell datasets where sample IDs may be limited or absent. Conversely, the other two methods, “immunedeconv” and “favilaco,” which were originally implemented for deconvolution benchmarking studies [9, 17], exhibited a notable lack of variance in simulated samples. Finally, the “SCDC” method [23], although it achieved variance comparable to that of real bulk samples, carries the risk of generating empty gene expression values. This is due to its reliance on repeated sub-sampling from the same patient, which can easily suffer from the sparsity issue in single cells [44] when aggregating over only a limited number of cells.

Together, our results suggested that different simulation strategies result in simulated bulk samples with diverse levels of biological variance. Specially, step** through homogeneous, semi-heterogeneous, and heterogeneous simulation, the heterogeneity level inside samples is increasing with the final heterogeneous simulation closely retaining the characteristics observed in real bulk samples.

Bulk simulation using random cells ignores heterogeneity within constituent cell types

Heterogeneity of tumors between different patients with the same tumor type has long been recognized [45]. Despite similar histological appearance, different patients can have intrinsically different genomic landscapes. In clinical practice, this heterogeneity motivates molecular subty** and enables personalized treatment protocols [46, 47]. Retaining biological heterogeneity within simulated bulk samples is essential for realistic bulk simulation.

To illustrate the limitation of bulk simulation using randomly selected cells (namely homogeneous simulation), we considered a simulation setup using single cell Medulloblastoma (MB) dataset from Riemondy et al. [31]. Medulloblastoma is a well-recognized heterogeneous brain cancer with four distinct subtypes based on genetic characteristics: WNT, SHH, Group 3, and Group 4 [48, 49]. Analyzing the tSNE clustering of malignant populations (Fig. 2a), we found that cells from the same subtype predominantly clustered together, revealing marked differences between subtypes. Additionally, within each subgroup, further patient-specific heterogeneity was observed. We note that such intra-heterogeneity of malignant cells extends beyond this scRNA dataset and is also found in other tumor types (Additional file 1: Fig. S8). Additionally, beyond just malignant populations, non-malignant cells may also exhibit diverse patterns across different samples (Additional file 1: Fig. S9). Together, these findings suggested even among cells identified as the same cell type, intra-tumor heterogeneity is not uncommon.

Fig. 2
figure 2

Bulk simulation using random cells failed to retain intra-tumor variations. a tSNE plot of = 31,823 malignant cells from 28 medulloblastoma patients in Riemondy2022_MB dataset, colored by MB subtypes (left) and patient ID (right). b tSNE plot showing that 500 randomly selected malignant cells are evenly dispersed among 31,823 malignant cells. c Barplot showing the average expression levels of 22 MB subtype-specific genes in different groups of samples, colored by different MB subtypes. Upper panel: average expression levels of these genes across different MB subtype patients from Riemondy2022_MB; bottom panel: average expression values in 100 homogeneous simulated bulk samples, with error bars indicating the 10th and 90th quantiles. d Diverging bar plot comparing variance of hallmark gene-sets between homo-simulated and patient-specific bulk expression from Riemondy2022_MB, with patient-specific expression defined as the aggregated expression of single cells from the same patient

Despite this prevalent intra-heterogeneity, random selection of cells results in an evenly distributed selection of cells (Fig. 2b), and such selection, if performed repeatedly, will create a homogeneous expression profile with low variance. In Fig. 2c, we simulated a bulk expression dataset in this manner and compared it to actual MB patient profiles (Additional file 2: Table S2). By analyzing the expression levels of 22 MB-subtype specific genes (Additional file 2: Table S4) [46], we observed distinct expression patterns across different MB subtypes in real patient profiles, whereas the simulated samples exhibited minimal heterogeneity in expression values of these genes, with the 10% and 90% expression quantiles fluctuate around the average level.

Moreover, analyzing the variation at the pathway level using the hallmark genes [43], we found that systematic pathway-level variance is pervasive in real data but is not recapitulated in the random cell simulation (Fig. 2d). Together, these results suggested that bulk simulation methods employing random single-cell selection overlook meaningful biological variability, resulting in a “homogeneous” profile characterized by low variance.

Benchmarking framework

To systematically evaluate the performance of different categories of deconvolution methods and examine how different bulk simulation strategies will impact the performance, we designed a benchmarking framework as depicted in Additional file 1: Fig. S1. The deconvolution methods we included span four categories of deconvolution methodologies as mentioned in previous sections: reference-free, regression-based, marker-based, and Bayesian method.

For reference-free methods, we selected debCAM (referred to as CAMfree) [12] and linseed [14]. For regression-based methods, we included five regression algorithms that have been previously developed or implemented for deconvolution: MuSiC [50], Robust Partial Correlations (RPC) [51], weighted robust linear regression (wRLM) [52], CIBERSORT [53], and non-negative least squares (nnls) [54]. For marker-based method, we considered debCAM-marker [12], TOAST-marker [55], and gsva [56]. For Bayesian method, we explored the recently published BayesPrism [15].

Procedures to evaluate deconvolution results vary in terms of whether the agreement between ground truth and inferred proportions is assessed by correlation or squared error and whether performance is evaluated per-cell type or globally. We focused our evaluation on per cell-type Pearson correlation, which reflects the accuracy of downstream inference such as the difference in proportions between two groups. We also calculated root mean square error (RMSE) values, which evaluate if the inferred proportions are correct on the absolute scale across different cell types, with smaller RMSE indicating better performance. The deconvolution pipelines including simulation, deconvolution, and evaluation are then applied to eight published single-cell cohorts (Table 1) and repeated 10 times for each cohort.

Regression-based methods differ in their robustness to heterogeneity

Regression based approaches dominate the deconvolution field with many available methods and extensive independent benchmarking [9, 50, 57]. All regression methods fit a model that assumes that bulk expression matrix \(Y_{g \times s}\) with g genes and s samples, can be expressed as a \(Y_{g \times s}=X_{g \times k} P_{k \times s}+E_{g \times s}\), where P is the proportion matrix and k is the number of cell-types. X is the given reference matrix and the task is to fit P, which is a general regression problem.

A first decision in regression-based approaches involves the construction of the reference matrix X, a process termed as hyper-parameter reference construction. Selecting an optimal subset of genes is crucial for effective performance [58], while utilizing the entire gene set often results in poor outcomes (data not shown). The second methodological choice is the form of the regression problem itself, whether to use constraints, feature weights, and how to formulate the loss. For example: squared loss with constraints gives the basic non-negative least squares formulation (nnls). Alternative approaches involve weighted regression, robust regression, epsilon-insensitive loss (CIBERSORT), etc. Importantly, since the feature selection and regression are decoupled, we can combine different methods arbitrarily.

In our benchmarking work, we investigated four different reference matrix construction methods: CIBERSORTx [59], autogeneS [60], and two marker selection methods derived from differential expression (DE) analysis: limma [61] and scran (see the “Methods” section for details). Additionally, we considered an “all genes” setting, which is applicable exclusively to MuSiC, where all genes are utilized by default and no feature selection is performed.

Implementing these reference construction approaches to different regression methods, we systematically examined all possible combinations of these methodologies across eight distinct cohorts (Table 1), with 10 simulations for each cohorts (excluding “all genes” from methods other than MuSiC). A representative result using bulk data simulated from the Puram2017_HNSCC dataset is presented in Fig. 3a. We found that in the homogeneous setting, the results from different methodological combinations are remarkably consistent, with correlations spanning from 0.98 to 1. However, the performance of methods diverges with increasing heterogeneity. The results on all datasets is presented in Fig. 3b, where we showed that the performance of all methods decays as heterogeneity is added, while some methods decay less rapidly. We further summarized the influence of various methodological decisions by fitting a multi-linear regression to the Pearson correlation results with dataset and cell-type as a covariate (Fig. 3c; see the “Methods” section).

Fig. 3
figure 3

Impact of bulk simulation strategies on regression based deconvolution. a Heatmap comparing the deconvolution performance of regression models (rows) under different bulk simulation strategies, as evaluated by average Pearson correlation (upper panel) and average RMSE values (lower panel) over 10 experimental repeats for the Puram2017_HNSCC dataset. The columns of the heatmaps correspond to different reference construction methods, with the “all genes” column being specifically applicable to MuSiC, suggesting that all genes will be used as the input, which is the default setting for MuSiC. b Line plot comparing Pearson r of regression-based methods under various simulation strategies across eight different datasets, with each row representing a reference construction choice and the error bars indicating the min and max level of Pearson r over 10 experimental repeats. c Box plot showing the effect size (estimated by coefficients from the multivariate regression model) of different methodological choices in deconvolution performance across different simulated bulk conditions. The error bars indicate the 95% confidence intervals of the coefficients

Altogether these results revealed that effect of methodological choices to be more pronounced with increased heterogeneity levels. Considering the summary in Fig. 3c, we find that in the homogeneous setting the reference choice method contributes relatively little (0.0067 to 0.016) and the coefficients are not significantly different. However, as heterogeneity levels in the bulk samples increase, the choice of reference becomes increasingly important and the CIBERSORTx reference dominates compared to other methods. We also note that the recently proposed optimization-based method autogeneS is indeed best in the homogeneous setting but is no better than differential expression-based method under heterogeneous settings.

Regarding regression choices, MuSiC consistently outperforms the others, irrespective of the chosen reference, under heterogeneous settings (Fig. 3b, c). On the other hand, nnls demonstrates the highest sensitivity to heterogeneity, exhibiting a sharp decline in performance as heterogeneity escalates and consistently ranking as the least effective among the regression models. The method of intermediate performance wRLM, RPC, and CIBERSORT are conceptually similar in that they make the regression problem robust in the technical statistical sense of robustness to outliers. This achieved by altering the loss function from squared loss to a function that grows less rapidly: epsilon insensitive loss for CIBERSORT and Huber loss for both RPC and wRLM (both use R based “rlm” function with default parameters). Overall these three conceptually similar approaches are indistinguishable when using sophisticated reference algorithms (CIBERSORTx or autogeneS, Fig. 3b top two rows), except that CIBERSORT regression lags behind the other two on DE-based references in four out of eight datasets. In Additional file 1: Fig. S11, we expanded our analysis to include another robust regression-based method, FARDEEP [71]. We included major immune cell types from their annotations for further study: DC, Neutrophil, NK cell, T cells. For all the macrophage subpopulations: chemokine myeloid, complement myeloid, M2-activated myeloid, and non-activated microglia, we relabeled them into macrophages to ensure a reasonable resolution of cell types. Immune cells that are classified as “Proliferate” or do not have any subty** label are excluded from further study. For ovarian cancer dataset Izar2020_OV, we excluded cell type “Erythrocyte” due to limited single-cell size. For the remaining scRNA datasets, we used their original cell-type labels.

Bulk data used for variance comparison

When comparing the variance of the simulated bulk expression, we considered two types of “real” bulk data for comparison. First, baseline expression: we aggregated single cells from the same biological samples within the scRNA-seq data, which serves as an approximation of actual bulk samples [72, 73]. Specifically, this was achieved by calculating the mean expression from the CPM-normalized expression of cells from the same biological samples. The resulting profiles maintain consistent library sizes and are ready for variance analysis. Second, real bulk expression data from The Cancer Genome Atlas (TCGA): we accessed the TCGA cohorts from https://xenabrowser.net/. The gene expression data was acquired in HTSeq-FPKM format and subsequently transformed to Transcript Per Million (TPM) for variance comparison. Only samples from primary and metastatic sites were selected for the analysis. A detailed description of the datasets used for variance comparison can be found in Additional file 2: Table S2.

Simulation of cell-type frequencies

Cell-type fractions simulation from beta distribution

To introduce variances into the cellular compositions of simulated bulk samples, we simulated cell-type frequencies that are close to that in real bulk samples. The cell-type proportions of each patient from the scRNA dataset were used as an approximation to the cell-type frequencies of real bulk samples. We fitted a beta distribution for each cell type and drew random values from the fitted distribution as the simulated frequencies. Randomly selected frequencies for different cell types are then scaled and summed to one for each simulated bulk sample. This fraction simulation method, refereed to as “beta” method in Additional file 1: Fig. S3, is implemented throughout the simulated bulk expression in this study.

Cell-type fractions simulation from external resources

In Additional file 1: Fig. S3, we evaluated the performance of our fraction simulation approach by comparing it with an external method employed by Avila Cobos et al. [9] (referred to as “favilaco”), and a basic fraction simulation utilizing the Dirichlet distribution. The “favilaco” method, which is part of the bulk simulation pipeline proposed by Avila et al., is now available as the \(bulkSimulator\_favilaco()\) function in the deconvBenchmarking package. This function simultaneously simulates bulk expression data and fractions, which we used to extract and compare the simulated fraction distributions with other methods.

For the Dirichlet-based simulation, we set the shape parameter \(\alpha\) to reflect the relative abundance of each cell type. Specifically, \(\alpha\) is set to the cell type frequencies from the single-cell dataset used for bulk simulation and adjusted by a dispersion parameter to modulate the spread of the distribution. We considered a set of varying levels of dispersion parameters 0.01, 0.05, 0.001 in the simulation. The Dirichlet-based simulation was conducted using the \(fracSimulator\_Dirichlet()\) function from the deconvBenchmarking package.

By comparing the distribution of the simulated fractions under different settings, we showed that “favilaco”-based method significantly deviates from the baseline’s mean and variance, and while Dirichlet distribution exhibits comparable distributions, it requires additional tuning of the dispersion parameter, making it less efficient than the beta distribution based method we proposed.

Bulk simulation strategies

Using the same source scRNA-seq dataset and simulated cell-type fractions, each simulated bulk expression in this study was comprised of 100 simulated samples, created using the following strategies:

Homogeneous simulation

We standardized the widely used bulk simulation method that aggregates over random cells as homogeneous simulation. Specifically, in each simulated sample, n single cells are aggregated linearly, with their proportions aligned to the simulated cell-type frequencies. The value of n is set to approximate the typical number of single cells of a biological sample from the scRNA data (Additional file 2: Table S1).

Semi-heterogeneous simulation

We restricted that the malignant parts of each simulated bulk sample come from the same patient, while the non-malignant parts are randomly selected regardless of where they are from. Specifically, for each simulated bulk sample i, the malignant expression signal come exclusively from a randomly selected patient’s malignant profile \(C_{malignant}\) and is weighted according to the simulated malignant fraction, and the non-malignant single cells are randomly selected and weighted according to the corresponding simulated frequencies.

Heterogeneous simulation

We restricted that both malignant and non-malignant parts of each simulated bulk sample come from the same biological sample. Specifically, for each simulated bulk sample i, given a cell-type k, the expression signal of cell-type k comes exclusively from a randomly selected patient’s k profile \(C_k\) and is weighted according to the simulated fraction (Additional file 1: Fig. S2).

Note that in both semi-heterogeneous and heterogeneous simulation settings, we employed additional randomization to prevent using the exact same cells across different simulated samples. Specifically, when aggregating patient-specific single cells to create a unique patient profile, we randomly select between 50 and 100% of the single cells. Moreover, we introduced a threshold parameter to specify the limited number of cells used for aggregation. If the number of patient-specific single cells falls below this threshold, we aggregate across multiple samples to prevent sparsity issues [44] within the cell-type specific profile. These two methods are now implemented as the \(bulkSimulator\_semi()\) and \(bulkSimulator\_heter()\) function in the deconvBenchmarking package.

sampleID independent heterogeneous simulation

We restricted that each cell-type component in the simulated bulk sample is constrained to originating from the same sub-cluster. Specifically, sub-cluster information for each cell type is obtained using the quickCluster() function from scran R package, with the min.size parameter set to 10 [74].

Bulk simulation from external resources

We incorporated three additional published approaches for bulk simulation comparison: the “favilaco” method [9], the “immunedeconv” method [17], and the “SCDC” method [23]. Note that only the “immunedeconv” method supports user-provided fractions; therefore, we passed the simulated fractions to this method; the “favilaco” and “SCDC” methods do not support user-provided fractions, so we retained their default settings in the bulk simulation. These methods are implemented as \(bulkSimulator\_favilaco()\), \(bulkSimulator\_SCDC()\), and \(bulkSimulator\_immunedeconv()\) function in the deconvBenchmarking package, which we used for bulk simulation.

Calculation of biological variance in bulk samples

The following statistics are calculated as indicators of biological variance. Note that the baseline expression referred to below represents pseudobulk samples from single-cell expression, which is obtained by aggregating single cells from the same patients, as an approximation of real bulk sample.

Coefficient of variation (CV)

For each gene i in the simulated and baseline bulk samples, we calculated CV values on the log transformed expression using the following formula:

$$\begin{aligned} CV_{i} = \frac{\sigma _i}{\mu _i} \end{aligned}$$

Average coefficient of variation (CV) for biological pathways

We downloaded the hallmark gene list from https://www.gsea-msigdb.org/gsea/msigdb/ and calculated the average CV values for genes included in each genelist, which is used as indicators for pathway-level variance.

Pairwise correlations between genes

We considered two sets of genes for calculating the gene-gene correlation matrix: (1) the top 300 most variable genes from the baseline bulk expression and (2) the cell type marker genes derived from limma-based differential expression analysis. The detailed listing of the genes used can be found in Additional file 2: Table S3. Within each simulated bulk expression and the baseline expression, we calculated the Pearson correlations between these genes and visualized them in a heatmap [75, 76].

Pairwise similarities between samples

Using the top 300 most variable genes from the baseline bulk expression, we calculated the pairwise Pearson correlation between samples for each simulated bulk expression. We then visualized the distribution of these statistics using a boxplot (Additional file 1: Fig. S7).

Reference construction

To generate necessary input for reference-based methods (regression-based, marker-based, and BayesPrism), we applied the following reference construction methods using the training cells.

In particular, for interchangeability between the signature matrices and cell type markers, we applied the \(refMarkers\_sigMatrixList()\) and \(refMatrix\_markerList()\) functions we developed in the deconvBenchmarking R package. Specifically, to create a signature matrix from a set of gene lists, we first averaged the expression values of single cells from the same cell type, resulting in a raw gene-by-cell-type matrix; we then refined this matrix by subsetting it with the marker genes. To identify cell-type markers from a given reference matrix, we assigned each gene to the cell type with the highest expression.

We note that for MuSiC, the regression-based method in our benchmarking, the values generated in the reference matrix are not directly used by MuSiC itself. Instead, they serve as a feature selection step, where the genes identified in the reference matrix are used as the marker input for MuSiC.

CIBERSORTx signature matrix

We used the “Create Signature Matrix” module from the online CIBERSORTx portal (https://CIBERSORTx.stanford.edu) to generate the signature matrix, with the training expression as input and all the parameters set to default values. We note that CIBERSORTx has a size limit for the input, so we downsampled the training cells and shrunk the input size when necessary. The resulting signature matrices are used directly as input for regression-based methods. Note that the CIBERSORTx-based reference typically contains thousands of genes in its signature. When converting this reference matrix to cell-type marker genes using the \(refMatrix\_markerList()\) function, we employed the \(maximum\_n\) = 100 parameter to ensure that each cell type can have at most 100 marker genes, prioritized by the fold change in the reference matrix.

autogeneS signature matrix

The python package autogeneS [60] is utilized for signature matrix construction. We used the following parameters in the optimize function: \(ngen=3000, seed=0, mode = fixed, nfeatures=400\). The resulting optimized reference matrix with pareto index 0 is selected as autogeneS signature matrix.

limma derived cell-type specific markers

R package limma [61] is used to identify cell-type-specific markers that are differentially expressed within each cell-type. Specifically, the one-against-rest comparison is performed for the statistical test, comparing each cell-type against all other cell-types combined. Genes with a log fold change greater than 2 are considered as cell-type specific markers, and a parameter \(minimum\_n = 5\) is introduced to ensure a sufficient number of markers for each cell type. Cell types with fewer than \(minimum\_n\) genes passing the log2 fold change threshold will be excluded from the marker list. Note that for the Riemondy2022_MB and Jerby_Arnon2018_SKCM dataset, the log fold change threshold is relaxed to 1 to ensure proper number of cell-type specific markers.

scran-derived cell-type-specific markers

The BayesPrism :: get.exp.stat() function, which incorporates the pairwiseTTests() function from the scran package, is used to identify cell-type specific markers. Specifically, pairwise comparison between cell types is performed for the statistical test [74]. The marker gene filtering process here utilizes the same log fold change threshold and \(minimum\_n\) parameter as applied in limma derived marker identification.

Reference for BayesPrism

We designated cell type labels from the single-cell dataset as cell.type.label and subclassified malignant populations based on their biological origins as cell.state.label in the BayesPrism :: new.prism() function, the same BayesPrism reference construction strategy used in Hippen et al.’s study [65]. We adhered to the reference derived directly from the scRNA data, bypassing the reference update step in BayesPrism, which entails generating a new reference from initial deconvolution results. This decision was based on our observation that utilizing the updated reference contributes minimally to performance enhancement (Additional file 1: Fig. S17).

Highly variable (hv) genes

For computational efficiency, we selected highly variable genes as candidates to run autogeneS and DE analysis. We used the plot.scRNA.outlier() function from R package BayesPrism to calculate the maximum cell-type specificity score for each gene. Genes with max.spec greater than a threshold value (0.5 for autogeneS and 0.3 for DE analysis) are selected for downstream analysis. This filtering narrows down the gene candidates from more than 10 thousands to thousands.

Deconvolution methods

Deconvolution methods are applied to the simulated bulk samples in their linear scale (non-log transformed) following recommendations from previous benchmarking studies [9, 16]. All methods evaluated in the study are wrapped in the deconvBenchmarking R package we developed: https://github.com/humengying0907/deconvBenchmarking.

Evaluation of deconvolution performance

Pearson correlation and root mean square error (RMES) values are used to evaluate the accuracy of different deconvolution methods. Specifically, for Pearson correlation, we calculated per cell-type correlations by comparing estimated fractions to known fractions within each cell type, with higher Pearson r corresponding to better performance. This process results in a set of per cell-type correlations for each simulated bulk dataset. These correlation values are then averaged to yield the overall correlation performance score for a given deconvolution method. For RMSE values, we focus on a global comparison between the estimated and known fractions for all cell types altogether, with smaller RMSE values indicating a lower absolute difference and thus better performance.

The averaged Pearson r across experimental repeats is used to represent the overall performance of a method. Variation in Pearson r and RMSE values across experimental repeats is visually inspected by comparing the minimum and maximum level across 10 experimental repeats, which help evaluate the reproducibility and stability of each method.

For reference-free deconvolution methods, where cell-type labels are not explicitly provided, we calculated pairwise Pearson correlations between the estimated and the known fractions and assign the unnamed cell type to the cell type with which it showed the highest correlation (Additional file 1: Fig. S18).

Multi-linear regression model of deconvolution performance

In order to evaluate the impact of methodological choices on deconvolution performance for regression-based methods, we utilized a linear modeling approach. Specifically, we fitted a linear model using the lm() function in R to predict Pearson correlation coefficients based on various predictors, incorporating dataset and cell types as covariates. The model formula employed was:

$$\begin{aligned} \text {model} = lm( \text {correlation} \sim 1 + \text {regression}\_\text {choice} + \text {reference}\_\text {choice} + \text {dataset} \times \text {cell}\_\text {type}) \end{aligned}$$

We designated nnls as the baseline regression choice, and no selection with MuSiC regression as the baseline reference choice for comparison. The coefficients obtained from the model indicate, conditional on all other effects, the relative impact of a specific methodological choice over the baseline. Higher coefficients indicate a more significant impact of this methodological choice on overall performance. The model was fitted separately for bulk data simulated from different approaches, in order to explore the effect shift of methodological choices under different conditions.