Background

Liquid biopsy, the molecular analysis of body fluids, has emerged as a promising tool in cancer research, offering a more accessible assessment of patient health status compared to traditional tissue biopsies. Advancements in technology have enabled extensive genomic and transcriptomic analysis of DNA and RNA, especially in blood [1,2,3,4]. Among the cell-free nucleic acids being studied, cell-free RNA (cfRNA) has garnered increasing attention [5], including for its tissue and cell-type specificity [6,7,8].

Cellular deconvolution, a powerful computational approach, is used to determine the cellular origin of RNA in mixed transcriptomic data, such as bulk RNA-seq [9]. A multitude of cell types contributes to the formation of blood cell-free transcriptome [6], and as a result, the comparisons between such heterogeneous samples may occlude the critical differences, which are driven only by select cell types [10]. The knowledge of which specific cell types are responsible for the observed differences in the blood during, for example, carcinogenesis, will provide a more comprehensive characterization of the cell-free transcriptome perturbations [11]. Furthermore, identifying the cell types of importance may accelerate the development of more targeted diagnostic strategies.

Cellular deconvolution using cfRNAs has been demonstrated to yield promising results in an increasing number of studies [6, 7, 9] and the deconvolution algorithm performs its own internal gene filtering.

In order to generate principal component analysis (PCA) plots, samples were separately normalized with variance stabilizing transformation using the function “vst” from the R package DESeq2 (1.34.0) [39]. Plots were generated with the R package ggplot2 (version 3.4.1) [40].

Cell-type deconvolution

Cell-type deconvolution was performed using the R package Bisque (version 1.0.5) [52] with a reference single-cell dataset. As the study focused on liver cancer, we used a liver-derived single-cell dataset hypothesized to predominantly capture liver-specific signals from the data. We selected the reference single-cell dataset in accordance with the guidelines set by the authors of the Bisque algorithm, which stipulate a minimum of three single-cell samples [52], and ensured that it featured robustly defined cell-type annotations. The reference single-cell dataset was generated by MacParland et al. from the livers of five healthy donors and contained cell type annotations for 8,444 cells [41]. The dataset containing log2CPM values and the corresponding annotation file were downloaded from the GEO database (accession number GSE115469). The single-cell reference data and the cell-free datasets were transformed into ExpressionSet class objects with the function “ExpressionSet” from the R package Biobase (version 2.54.0) [42]. To facilitate the cell-type deconvolution, the cell subtype annotations for hepatocytes, T cells, macrophages and liver sinusoidal endothelial cells (LSECs) were collapsed.

Finally, decomposition was carried out using the function “ReferenceBasedDecomposition” from the package Bisque with the parameter “use.overlap = FALSE” for each dataset. The Chen et al. dataset with the additional non-liver solid tumor samples was analyzed separately and was not used in the modeling steps.

Statistical test computation

To test if hepatocyte proportions were greater in liver cancer samples compared to other samples, a one-sided, unpaired Wilcoxon test (Wilcoxon rank-sum test) was calculated using the deconvolution results of all samples with the function “wilcox_test” from the R package rstatix (version 0.7.2) [43]. To this end, the parameters “paired = FALSE,” “exact = TRUE” and “alternative = “greater” were used. For multiple comparisons, p-values were adjusted using the Benjamini–Hochberg method with the “adjust_pvalue” function and parameter “method = ”BH”” from the R package rstatix. Effect size (r) and corresponding confidence intervals were generated with the function “wilcox_effsize” using the parameters “alternative = ”greater”,” “paired = FALSE,” “nboot = 100″ and “ci = TRUE” from the R package rstatix.

To test if the hepatocyte proportions were greater in the plasma compared with extracellular vesicles (EVs) of five liver cancer patients, a one-sided, paired Wilcoxon test (Wilcoxon signed-rank test) was performed as previously described, with the only change being the parameter “paired = TRUE.” Effect size and corresponding confidence intervals were calculated as previously described, with the only change being “paired = TRUE.” The results were visualized with the R packages rstatix, ggpubr (version 0.5.0) [44] and ggplot2.

Hepatocyte proportion-based classification

To analyze the feasibility of classifying liver cancer and healthy samples based on hepatocyte proportions, we tested 20 hepatocyte proportion cutoffs ranging from 0.2 to 0.4 in all cfRNA datasets—with samples above the cutoff classified as liver cancer (LC) patients and healthy donors (HD) if otherwise. Accuracy, sensitivity and specificity were computed at each cutoff with the function “confusionMatrix” from the R package caret (version 6.0–93) [45] and were used to generate a scatter plot using the R package ggplot2. A confusion matrix plot was generated at the cutoff with the highest classification accuracy with the function “evaluate” and a modified version of the function “plot_confusion_matrix” from the R package cvms (version 1.3.9.9000) [46].

Model construction

Random forest

We built random forest models to both determine the relative importance of cell types between biological conditions, sources of samples and to evaluate the diagnostic capabilities of various predictors. First, random forest models were built with each dataset using the cell-type proportions as input using the function “randomForest” with the parameter “importance = TRUE” from the R package randomForest (version 4.7–1.1) [47]. Afterward, the generated models were used as input for the function “varImpPlot” with the parameter “type = TRUE” from the R package randomForest, which calculates how much the model accuracy decreases without a certain predictor (feature). Finally, the results were visualized using the R package ggplot2.

To assess the performance of predictors, we trained a model with the Roskams-Hieter et al. dataset, chosen for its balanced structure and informative sample composition (Additional file 1: Fig. S1), using either the raw counts of gene markers reported by Roskams-Hieter et al. [24] and Chen et al. [2: File S1).

In light of the strengths of each diagnostic model and the enhanced performance of the combined gene marker model, we decided to integrate some of the cellular deconvolution results into the combined gene markers models. Based on previous results, we decided to integrate the proportions of hepatocytes, cholangiocytes, PECs and LSECs into the combined gene marker model. The new, integrated model displayed the highest overall accuracy among all models and closely matched the sensitivity and specificity of the deconvolution and gene marker models. The enhanced performance of the integrated model thus facilitates more comprehensive modeling of liquid biopsy data, incorporating not only gene marker expression but also additional data, such as cell-type proportions. We expect that integrated models will exhibit improved performance as they will incorporate more and varied types of liquid biopsy data. Particularly with the concern of relatively low sensitivity displayed by prospective liquid biopsy assays, the incorporation of cell-type proportions yielded by targeted cellular deconvolution can mitigate that issue to a degree.

Potential confounding factors remain a major issue for the clinical adoption of liquid biopsy. A comprehensive exploration of potential sources of variation in the blood cell-free transcriptome can mitigate these concerns. While our analysis showed one of the major confounders in liquid biopsy—age [61, 62]—to have no discernible effect on the efficacy of the targeted cellular deconvolution model either for male or female samples, we identified sample generation date to be of vital importance. Although sometimes unavoidable, extended storage of blood samples, especially in improper conditions, should be avoided whenever possible for optimal outcomes. Yet, further exploration is needed to identify other unknown confounders and possibly mitigate their adverse effects.

Conclusions

In conclusion, in this study, we showed the viability of liquid biopsy studies that are translatable across different conditions. Furthermore, we highlighted the potential of targeted cellular deconvolution and deconvolution in general for blood cell-free transcriptomic studies, which can improve cfRNA characterization and assist in the development of enhanced diagnostic assays.

In the future, we envision the application of targeted cellular deconvolution to other conditions as well and the expansion of the data generated by us through the deeper analysis of, for example, liver cirrhosis-derived samples. The increase of assay accuracy with the addition of cell-type proportion data to other liquid biopsy biomarkers in the framework of “integrated liquid biopsy” can facilitate its clinical adoption. Finally, as new liquid biopsy datasets are being continuously generated, the need for meta-analyses, comparison and integration of diverse and extensive information will continue to grow and we expect more gene markers will be discovered. We believe that the strategies outlined in this study will contribute to these efforts and expedite the clinical adoption of liquid biopsy diagnostic assays.