Background

Tumors are generally considered to be of monoclonal origin, with mutations facilitating the expansion of a malignant cell in the body to visible tumor tissue, as well as from carcinoma in situ to metastatic carcinoma. Notably, patient-derived tumor tissue includes cancer cells as well as other cell types, such as infiltrating immune cells and fibroblasts. Therefore, the use of traditional bulk RNA sequencing technology can only permit an average qualitative characterization of such highly complex tissue, representing the average of cancer cells and non-cancer cells. Although bulk RNA approaches have made many invaluable contributions to medical science [1], such approaches have ignored the distinct phenotypical and functional traits of single cells within tumor samples.

Since the first single-cell RNA sequencing (scRNA-seq) study was published in 2009 [2], various commercial platforms and methods have been developed for scRNA-seq. ScRNA-seq is a technique allowing for the study of tissues at single-cell resolution. Many researchers within the life sciences employ scRNA-seq for the investigation of diverse biological functions. In particular, great achievements have been made in tumor research using scRNA-seq technology. The single-cell resolution afforded by scRNA-seq enables direct measurement of the transcriptional output of cells from tumor samples [3], comparison of differences between the transcriptomes of various cells, identification of rare cell subpopulations, such as heterogeneous tumor subpopulations [4], or the revelation of differences between stimulated dendritic cells [5], in turn providing unprecedented insights that have contributed to the development of cancer therapy.

In the present review, we introduce methods for the preparation of solid tumor samples, related scRNA-seq platforms, the analysis of scRNA-seq data, and achievements in tumor research facilitated by the use of scRNA-seq.

Single-cell RNA sequencing

To carry out scRNA-seq, solid tumor samples first need to be processed to effectively isolate viable single cells from the tissue of interest [6]. Thereafter, the single cells are lysed to obtain RNA, which is reverse transcribed into cDNA and then amplified to construct a sequencing library. The most suitable sequencing instrument must be selected based on the experimental scheme and research objectives. After sequencing, the data need to be correctly analyzed to reveal new findings.

Preparation of solid tumor specimens

Many published articles on the use of scRNA-seq in cancer research have detailed the preparation of solid tumor specimens. After cancer is diagnosed, tumor tissues are removed by biopsy and treated immediately to preserve single cells. Tumor tissue is usually cut into sections of approximately 1 mm3 and washed with PBS to remove fat, visible vessels, and surrounding necrotic areas [6] should be excluded before subsequent bioinformatics analysis to mitigate its influence on downstream analysis results. The percentage of mitochondrial reads is a common quality control metric [15]. When there are a large number of mitochondrial transcripts, it means that the cells are in a state of stress [16], so a threshold is commonly applied to exclude data from cells with too many mitochondrial transcripts. Similarly, the proportion of ribosomal reads is another commonly used quality control metric. Because scRNA-seq is mainly used to study functional (messenger) RNA, cells that have had their ribosomes removed and still have a high proportion of ribosome reads cannot be further analyzed [17]. In addition to the quality control methods for a single dataset, a method denoted ‘scRNABatchQC’ has been proposed that facilitates quality assessment across datasets to intuitively detect biases and outliers [18]. To aid researchers who are intimidated by scRNA-seq analysis, Etherington et al. [19] developed tools and training materials that can be used for scRNA-seq training and quality control.

Batch-effect correction

During the scRNA-seq experimental procedure, when cells subject to different conditions are cultured, captured, and sequenced separately, batch effects will be evident [20]. There are several methods available for batch-effect correction of scRNA-seq data, including Seurat 3 [21], MMD-ResNet [22], Harmony [23], Scanorama [24], Liger [25], scMerge [26], ZINB-WaVE [27], and others. Based on a variety of evaluation indicators, Harmony, Liger, and Seurat 3 are the recommended methods for dealing with batch effects, among which Harmony is the first choice due to its shorter run time [28]. Recently, a novel numerical algorithm for batch-effect correction of bulk and scRNA-seq data was proposed, denoted ‘scBatch’. This approach is not limited by the hypothesis of the batch-effect generation mechanism, and is superior to the benchmark batch-effect correction algorithms [29].

Normalization

Data normalization is essential for scRNA-seq to make gene expression comparable within and/or between samples. A number of methods have been developed for the normalization of RNA-seq data [30,31,32,33]. However, the majority of methods follow the same principle as bulk RNA-seq normalization and, thus, are not applicable to scRNA-seq data [30, 31]. Nevertheless, several methods have recently been devised to normalize scRNA-seq data, such as SCONE [34] and regularized negative binomial regression [35]. SCONE provides a flexible framework for users to choose appropriate normalization methods. Normalization using regularized negative binomial regression effectively eliminates technical differences due to different sequencing depth without inhibiting biological heterogeneity. A previous study [36] compared seven scRNA-seq data normalization methods with regard to reduction of noise or bias, and found that each of these methods was suitable to normalize specific types of data for further downstream analysis.

Cell cycle phase assignment

Determining the cell cycle phase of a single cell can facilitate understanding of biological processes such as tumorigenesis [37,38,39,40,41] and cell differentiation [42, 43], and avoid the confounding effects caused by the cell cycle phase prior to downstream analysis. Scialdone et al. [44] described and compared six supervised cell cycle prediction methods based on a cell transcriptome, of which the parameter-free PCA-based method and the custom predictor known as the “Pairs” method performed best in allocating cells to the correct cell cycle stage. Buettner et al. [45] proposed a calculation method, denoted ‘single-cell latent variable model’ (scLVM), which can be used to eliminate variations caused by cell cycle and other confounding factors before downstream analysis. Recently, Hsiao et al. [46] proposed a new method to characterize the progress of the cell cycle, which is different from the traditional classification of cells according to the standard of cell cycle stage (G1, S or G2/M phase), but can quantify the cell cycle progression of induced pluripotent stem cells on a continuum, which provides a basis for the characterization of the cell cycle in other cell types.

Cell clustering

One of the basic goals of scRNA-seq data analysis is to identify cell types from experimental samples to elucidate tissue complexity and heterogeneity. Due to the importance of cell type recognition, efforts have been made to develop new algorithms, including CountClust [47], CIDR [48], SIMLR [49], SAFE [50], and other advanced methods. A few studies [51,52,53,54] have compared and summarized diverse clustering algorithms for scRNA-seq data analysis. Unlike previous methods, Geddes et al. [55] proposed the first ensemble clustering framework based on autoencoder dimension reduction, which could be combined with different clustering algorithms to promote the accurate recognition of cell types. Since then, several new clustering methods have emerged, including DivBiclust [56], a biclustering-based framework, SAME [57], which extracts cluster solutions from multiple methods, and PARC [58], which is suitable for large-scale single-cell data. To overcome flaws associated with the manual labeling of cell types, Shao et al. [59] developed an automatic annotation toolkit, denoted ‘scCATCH’, based on clustering, which can accurately annotate cell types with acceptable repeatability.

Reconstruction of cell trajectory and pseudo-time

By reconstructing cell trajectory and pseudo-time based on scRNA-seq data, dynamic processes in cells can be calculated and simulated, which is of great significance for understanding the transition between cell states in cancer [60]. Currently available algorithms include Monocle 2 [61], Monocle 3 [62], TSCAN [63], Slingshot [64], SLICE [65], LISA [66], p-Creode [67], Waddington-OT [68] and others. Considering the exponential growth in the size of scRNA-seq data, Chen et al. [66] proposed an unsupervised method, denoted ‘Lisa’, for the reconstruction of cell trajectory and pseudo-time for a large number of scRNA-seq datasets. p-Creode is another unsupervised algorithm that can predict cell state-transition trajectories. Waddington-OT uses the mathematical method of optimal transport (OT) to infer ancestor-descendant fate, and reconstruct cell trajectories. In addition to the three methods mentioned above, the algorithms Monocle 2 (the new version Monocle 3), TSCAN, and Slingshot have been shown to have good performance at reconstructing cell trajectory and presudo-time [69].

Differential expression and gene set enrichment analysis

One of the most common uses of gene expression data is for the identification of differentially expressed (DE) genes under different experimental conditions (e.g., stimulated versus non-stimulated, mutant versus wild-type, or between different time points), and thus to determine the root cause of phenotypic differences observed under different conditions [70]. A zero value for a gene’s expression level in scRNA-seq data may indicate two things. One is the “real” zero, caused by the changing characteristics of single-cell gene transcription, while the other is the “dropout” zero, caused by technical reasons, which often affects the validity of differential expression analysis. Miao et al. [71] developed the R package DESingle, which can accurately distinguish between the two types of zeros. DECENT is a DE gene analysis method based on UMI scRNA-seq data, and is used to analyze the pre-dropout distributions of inferred RNA molecules [72]. In addition to dropout zeros, another challenge in differential expression analysis of scRNA-seq data is multimodal data distribution. ZIAQ is the first approach to consider both dropout rates and the complex distributions of scRNA-seq data, which can be used to identify more DE genes [73].

We usually group DE genes according to their participation in common biological processes to facilitate the interpretation of results [74]. Existing gene set enrichment (GSE) analysis methods include DAVID [75], PAGE [76], CAMERA [77] and others, but almost all of these methods are more suitable for bulk RNA-seq analysis [78]. In addition, almost all existing GSE methods are used as a separate step after DE analysis. Considering the above shortcomings, Ma et al. [79] proposed IDEA, a computational method integrating DE analysis and GSE analysis for scRNA-seq, which could greatly improve the outcomes of both.

Gene regulatory network inference

The combination of active transcription factors and their target genes is usually described in gene regulatory networks (GRNs). Revealing these regulatory interactions is the goal of GRN inference methods, providing valuable insights for the identification of causal regulatory factors in biological processes [74]. A class of GRN inference methods are based on Boolean network models, such as SCNS toolkit [80] and BTR [81]. Another approach for the inference of regulatory networks is based on co-expression analysis, and example models include SINCERA [82], which is specifically used for scRNA-seq data. In addition, there are algorithms based on ordinary differential equations, such as SCODE [83] and InferenceSnapshot [84]. Recently, Moerman et al. [85] proposed the GRNBoost2 and Arboreto frameworks, which can help researchers to deduce high-quality GRNs from large datasets in a reasonable amount of time.

Progress of single-cell RNA sequencing in tumors

Cancer patients may be unresponsive to therapy due to drug resistance and metastasis of single cells, both of which constitute major challenges in the treatment of malignant tumors. About 90% of available drugs are effective in less than half of patients [86]. Cancer is associated with the interaction of thousands of gene products, and genotype as well as interactions vary greatly within and between tumors [87], which is a key reason for the failure of some drugs. In contrast to bulk analysis, which does not account for the differences between cancer cells and their cancer-related counterparts, scRNA-seq provides unprecedented high resolution for the analysis of each individual malignant cell, stromal cell, endothelial cell, parenchymal cell, and immune cell, as well gene expression and pathway activation [88]. Thus, scRNA-seq provides insights that contribute to the development of strategies for cancer treatment and personalized medicine (see Additional file 1).

Highlighting intra- and inter-tumoral heterogeneity

The considerable heterogeneity of tumors and tumor tissue samples between different patients is an important reason for treatment failure [89]. Therefore, understanding the functional status of individual tumor cells and recognizing cell subset composition and characteristics is of great significance for cancer biology and treatment strategies (Table 1 (see Additional file 2)).

Glioblastoma (GBM) is the most common primary malignant brain tumor in adults [90]. It is the glioma with the highest degree of malignancy, and the most often seen in clinical practice, with poor prognosis and a lack of effective treatment regimens [91]. In 2014, Patel et al. [92] analyzed 430 cells of 5 primary GBMs (all IDH1/2 wild-type primary GBMs) using scRNA-seq, and found that these cells differed in the expression of various programs related to carcinogenic signaling pathways, cell proliferation, the immune response, and hypoxic stress. In 2018, Yuan et al. [93] analyzed high-grade glioma (HGG) with large-scale parallel scRNA-seq using a high-density microwell system, and found that, similar to oligodendrocyte progenitors, glioma cells exhibited proliferative characteristics. In contrast, similar to astrocytes, neuroblasts, and oligodendrocytes, glioma cells exhibited an amitotic state in tumors.

Melanoma is a highly malignant skin cancer with four clinically distinguishable subtypes and is responsible for approximately half of skin cancer-related deaths in Japan [94]. Gerber et al. [95] used scRNA-seq to analyze transcription in cells from three different metastatic melanoma patients (BRAF/NRAS wild type, BRAF mutant/NRAS wild type, and BRAF wild type /NRAS mutant). BRAF/NRAS wild-type samples had a low-abundance subgroup with high expression of ABC transporters, while cells from the other two samples exhibited more homogeneous single-cell gene expression patterns.

Head and neck squamous cell carcinoma (HNSCC) encompasses a group of malignant tumors originating from the squamous epithelium of the oral cavity, oropharynx, larynx, and hypopharynx [Discovery of invasion and metastasis mechanisms

The ability of single-cell gene expression profiling to identify specific patterns of gene expression allows for the elucidation of mechanisms underlying tumor invasion and metastasis [154]. The combination of scRNA-seq with genoty** can more accurately distinguish malignant cells from normal cells [101]. In the future, different omics technologies could combine with scRNA-seq technology to more comprehensively characterize individual cells. With deepening understanding of the cellular dynamics of cancer, the efficacy of personalized medicine will improve, ultimately saving lives and reducing the global burden of cancer on healthcare systems.