Introduction

Colorectal cancer (CRC) is a molecularly heterogeneous disease [1, 2]. The heterogeneity of cell types involved in CRC carcinogenesis makes it difficult to elucidate cell lineages using traditional developmental biology techniques such as bulk transcriptomics methods [3]. Through single-cell transcriptomics technology, it is now possible to deconstruct a tumor into its diverse cell subpopulations and thus gain a better understanding of the underlying biology like subty** [4,5,6]. However, spatial or anatomical information inherent in the tissue architecture is lost using single-cell transcriptomic technology only.

Spatial transcriptomics (ST) is an emerging technology that adds spatial dimensionality and tissue morphology information to the single-cell transcriptomics data of cells in an undissociated tissue, thus hel** to preserve precise spatial or anatomical information. Overcoming the throughput limitation of in situ hybridization (ISH) methods, ST allows for unbiased map** of transcripts in individual tissue sections with spatial resolution by using spatially barcoded oligo-deoxythymidine microarrays [7]. As a high-throughput spatially resolved transcriptomic tool, ST has been used to study architecturally complex tissues or diseases including melanoma [8], prostate cancer [9], cardiac sarcoidosis [10], non-small cell lung [11], human and other species’ cortex [12, 13], as well as their spatiotemporal characterizations [14,15,16].

Extensive multimodal studies have unraveled molecular landscape of diverse diseases [17]. Combining these two complementary and powerful technologies has been confirmed to be scalable to study architecturally complex tissues and to provide meaningful biological insight across a range of pathologies, such as melanoma [18], bone marrow [19], prostate cancer [20], pancreatic ductal adenocarcinomas [21], myocardial infarction [22], lung fibroblasts [23], spinal cord [24] and plants like rice root [25].

The tumor microenvironment (TME) comprises various cell types (immune cells, fibroblasts, endothelial cells, etc.) and extracellular components (growth factors, cytokines, extracellular matrix, hormones, etc.) that surround cancerous/tumor cells [26]. Since many currently used anticancer therapies target non-tumor components, such as the extracellular matrix (ECM) [27], immune system and vascular system [28], understanding cellular components and how their dynamic interactions to shape the tumor landscape are particularly important.

In this study, we aim to provide a comprehensive global view of tumor heterogeneity and intercellular interaction networks of CRC using single-cell transcriptional profiles coupled with spatial transcriptional profiles. By analyzing the single-cell and spatial transcriptional profiles of 41,700 cells from 3 treatment-naïve patients with CRC, we generated a molecular map of all major CRC populations based on single-cell RNA sequencing (scRNA-seq). The malignant cells in epithelial cells were identified and categorized into seven subclasses (tumor_CAV1, tumor_ATF3_JUN | FOS, tumor_ZEB2, tumor_VIM, tumor_WSB1, tumor_LXN, tumor_PGM1), which may help to the molecular subty** of colorectal cancers. In addition, we used spatially resolved transcriptomics in combination with computational tools to attribute cell types to different CRC niches. Annotated tumor regions based on the cryosection sections represented high TMSB4X expression, and suggested a typical marker of tumorgenesis. The stroma region was characterized by VIM gene, which was also used as a typical feature of one subtype of malignant cells in CRC scRNA-seq. Furthermore, we inferred the important interaction between tumor and stromal regions mediated by gene pair of C5AR1 and RPS19, which played roles of ligand and receptor, respectively.

Results

Landscape view of cell composition in tumors, adjacent tissues and peripheral blood in patients with CRC

To shed light on the complexity of the TME in CRC, we performed scRNA-seq along with spatial trancriptome sequencing on viable cells derived from matched tumor and adjacent tissues, as well as peripheral blood mononuclear cells (PBMCs) of 3 patients with CRC (Fig. 1a, Supplementary Table S1). On average, we obtained more than 150 G sequencing reads for each sample, with a median sequencing saturation of 91.40% (87.0%–95.5%). A total of 41,700 cells were identified in 9 samples derived from 3 patients (including 10347, 13241 and 18112 in tumor tissues, adjacent tissues and peripheral blood, respectively; Supplementary Table S2). We obtained approximately 1000 genes and 2500 unique molecular identifiers (UMIs) for each cell, indicating sufficient coverage and transcript representations. After quality control filters (few detected features in cells and few expressed cells associated with detected features), we acquired 35,666 high-quality cells for further analysis.

Fig. 1: Cell type identification in human CRC by 10X Genomics scRNA-seq.
figure 1

a Workflow of sample collection and single-cell transcriptome analysis from Chinese patients with CRC. b t-distributed stochastic neighbor embedding (t-SNE) plot of 35,666 high-quality cells from CRC patients (the CRC scRNA-seq dataset), grouped into eight major cell types (left panel, top). Proportions of the global cell types in tumor tissues, adjacent tissues and blood on average (left panel, bottom). The normalized expression of marker genes for each cell type (right panel) (c). Gene expression heatmap analyzed by 10X Genomics scRNA-seq. d Proportions of the global cell types in individual samples with CRC.

To define cell clusters with similar expression profiles, we performed dimensionality reduction of t-distributed stochastic neighbor embedding (tSNE) implemented in the Seurat package. Each cluster was further identified as a specific cell subpopulation on the basis of the expression of the most variable genes and the canonical markers, including those in epithelial cells (with gene markers: EPCAM, KRT5, PHGR1, LGALS4, and TFF3), T cells (CD4 + T cells: PTPRC, CD3D, and CD4; CD8 + T cells: PTPRC, CD3D, and CD8A), B cells (CD19 and MS4A1), monocytes (CD14, ITGAX for CD11C), natural killer (NK) cells (FCGR3A and NCAM1), endothelial cells (CDH5, PLVAP, CLDN5, VWF), fibroblasts (LUM, DCN, COL1A1), and mast cells (KIT, CPA3, MS4A2, and TPSAB1) (Fig. 1b). In addition to these well-known markers, we also analyzed cluster-specific genes via differential gene expression analysis (Supplementary Table S3). These cluster-specific marker genes included FBLN1 for fibroblasts, as well as MT1A and PLN for smooth muscle cells (Fig. 1c, Fig. S1e). In total, eight cell types in CRC were identified based on canonical markers and cluster-specific genes: epithelial cells, fibroblasts, endothelial cells, monocytes, T cells, NK cells, B cells, and mast cells. The heterogeneous compositions of the TME in CRC across tumor tissues, normal tissues and peripheral blood are consistent with a recent single-cell transcriptome study of CRC [29].

To characterize different cell compositions in tumor tissues, normal tissues and peripheral blood in CRC, the proportions of each cell type were investigated. An overall increase in myeloid cell populations and decrease in B cell populations were observed in tumor tissues compared to normal tissues (Fig. 1b, bottom; Fig. S1d), suggesting a redirected immune response in CRC patients. In details, we observed that the proportion of monocytes was increased with approximately 2.5-fold, whereas that of NK cells and B cells was decreased (about 0.3–0.4 times) in tumors compared to normal tissues, suggesting a myeloid immunosuppression in the CRC TME (Fig. S1a,b, Supplementary Table S4). To further explore the distinct cell compositions in the TME across individuals, more detailed proportions were assessed (Fig. 1d). These results showed, for example, that in patient T0602, the proportion of epithelial cells decreased in contrast with patient T0529 and increased compared to that in patient T0609 (Fig. 1d, left; Fig. S1 c1; Supplementary Table S5). Since the transition from normal epithelium to intraepithelial neoplasia were found to be associated with CRC patient survival [30, 31], the difference in epithelial cells across individuals may be important for survival and worthy of further investigation. Considering that the cellular proportion determined by scRNA-seq may be biased toward an underrepresentation of malignant cells derived from epithelial cells [32], we also explored the proportions of immune and stromal cells account for all cells except epithelial cells which includes tumor cells like previous study [29]. The results showed that myeloid cell-driven immune response in patient T0529 was stronger than that in the other two patients (Fig. 1d, right; Fig. S1 c2).

Epithelial cells represents multilineages including a lineage of malignant cells

It has been suggested that human colon cancer cells recapitulate the multilineage differentiation processes of normal colon epithelia. To investigate each lineage contributing to the CRC heterogeneity at single cell resolution, we subclustered cell populations for each cell type to identify subpopulations. To annotate these subpopulations, we combined another published CRC cohort consisting of 6 CRC patients in tumor regions as well as matched normal mucosa [29], and transferred the annotations of subtypes to our datasets in this study with the Seurat R package (Fig. 2a). Since the transition from normal epithelium to intraepithelial neoplasia were found to be associated with CRC patient survival [24], we focused on epithelial cells and found 9 subpopulations, namely CD19 + CD20 + B cells, crypt cells, enterocytes, goblet cells, intermediate, mature colonotypes, proinflammatory, stem-like, and tumor cells (Fig. 2b).

Fig. 2: Transcriptome signatures and heterogeneity in normal and tumor epithelial cells.
figure 2

a t-SNE plot of the CRC scRNA-seq dataset color-coded by colorectal subtypes. b t-SNE plot of all 5887 epithelial cells (tumor/malignant cells are included) of the CRC scRNA-seq dataset color-coded by subtypes. ce The semisupervised trajectory of all epithelial cells inferred by Monocle v2, color-coded by state (c) or subtypes d or stemness. Stemness levels were calculated as the mean expression of stem-like signature (e, f). Volcano plot showing differentially expressed genes between tumor cells and other normal epithelial cells (non-malignant cells). (P-value < 0.05, Wilcoxon rank sum test, loge (fold change) > 0.25. g Significant biological processes (GO terms) enriched in tumor/malignant cells by clusterProfiler (hypergeometric test). h t-SNE plot of 3150 tumor cells derived from the CRC scRNA-seq dataset, color-coded by cell subtypes (h). i, j The trajectory of tumor cells inferred by Monocle v2, color-coded by cell subtypes (i) and sample origins (j).

To distinguish malignant cells and nonmalignant cells in epithelial cells, we performed scRNA-seq-based copy number variation (CNV) and subclustering analysis (Fig. S2a, b). The proportions of malignant cells in each subcluster of epithelial cells were shown in Fig. S2c. The trajectory revealed a transcriptional hierarchy, defining seven molecular states (Fig. 2c, top). The cells from tumor tissues dominated the divergent differentiation states 2 and 5, suggesting the tissue arrangement along pseudotimes (Fig. 2c, bottom).

To illustrate the differentiation paths across the multilineages among the epithelial cell populations, the semisupervised trajectory inferred by monocle2 [33] revealed a transcriptional hierarchy defining three branches. The hierarchy was dominated by malignant epithelial cells, as well as normal epithelial cells (including goblet cells and (stem-like/transit amplifying cells) and immune-related cell types (including proinflammatory and mature colonotypes), which originated from normal epithelial cells with branching toward malignant epithelial cells (gray) and immune-related cell types (light green), respectively (Fig. 2d). Projection of malignant epithelial cells along the epithelial cell differentiation trajectory revealed segregation of tumor cells from normal epithelial cell types and stem-like populations. The greater stemness of malignant epithelial cells suggested the regenerative/proliferative potential of these tumor cells (Fig. 2e). The hypoxia and epithelial mesenchymal transition (EMT) were also investigated in the malignant epithelial cell populations (Fig. S2e).

Transcriptional and functional features of malignant cells reveal heterogeneity in CRC patients

To characterize the malignant cell populations, we scrutinized the transcriptional features between malignant and nonmalignant cells. The known malignant epithelial cell populations characterized by upregulated expression of S100A4, VEGFA, MYC, and ICAM1 (intercellular adhesion molecule-1), according to their significant differential expression (loge|fold change | > 0.25, T test, p value < 0.05) (Fig. 2f, left). The most differentially expressed gene EMP3 (Epithelial membrane protein 3), which has been identified as an tumor suppressor in breast cancer [34], glioma [

Fig. 3: Spatial transcriptome (ST) of CRC and map** of cell types at spatial resolution.
figure 3

a A pathologic section from tumor tissues of one CRC patient (T0602). b Annotations obtained by integration analysis of the CRC. scRNA-seq dataset and CRC5_1 in ST-seq dataset using seruat labeltransfer. c Clustering of the CRC5_1 ST spots and annotating CRC5_1 tumor cryosection on the ST slide. CRC5_1 cryosection was obtained from tumor tissues of patient T0602. d Expression levels for genes with subtype-specific patterns in CRC5_1 ST spots. i Standardized expression levels of five genes in the CRC5_1 in ST-seq datasets. e A pathologic section from normal, adjacent tissues of the CRC patient (T0602). f Annotations obtained by integration analysis of the CRC scRNA-seq dataset and CRC5N_1 in ST-seq dataset using seruat labeltransfer. g Clustering of the CRC5N_1 ST spots and annotating CRCN5_1 normal cryosection on the ST slide. CRC5N_1 cryosection was obtained from adjacent tissues of patient T0602. h Expression levels for genes with subtype-specific patterns in CRC5N_1 ST spots. j Standardized expression levels of five genes in the CRC5N_1 in ST-seq datasets.

First, the spatial transcriptomics data were integrated with the scRNA-seq data using Seurat-v3 anchor-based integration to annotate each region in the corresponding section [37, 38]. Every spot in the spatial data was considered a weighted mix of cell-types identified by scRNA-seq. For each spot, the cell type with the maximum prediction score among all possible cell types and thus transferred from the scRNA-seq dataset is illustrated (Fig. 3b, f). After further adjustment on the basis of annotated histological features, we annotated four and two anatomical regions in the CRC5_1 section (derived from a tumor tissue, Fig. 3c), and in the CRCN5_1 section (derived from an adjacent tissue), separately (Fig. 3g). We observed many obviously characteristic genes, which represented higher expression in annotated regions especially in the tumor tissues (Fig. 3i) compared to normal tissues (Fig. 3j). It is noted that five DEGs in comparision of malignant and non-malignant cells in the CRC scRNA-seq dataset were included IFITM1, CXCL1, CXCL8, S100A4, and TGFBI. The higher expression in tumor or stromal regions were shown in Fig. S3. IFITM1 was highly expressed and spatially restricted relative to the annotated tumor regions. IFITM1 is an interferon-induced transmembrane protein family member. The roles of IFITM1 has been summarized that it involves in gallbladder carcinoma, esophageal adenocarcinoma, colorectal cancer, and gastric cancer [39]. Fang et al. investigated that over-expression of IFITM1 promoted the aggressiveness of CRC cells, whereas knockdown of IFITM1 expression inhibited cell migration, invasion or tumorigenicity in vitro [

Materials and methods

Subjects and clinical characteristics

We chose the patient inclusion criteria as the clinical stage of the tumor being stage 2 and stage 3, without the presence of intestinal obstruction or abdominal infection. Three patients were included and all patients were treatment-naive before tumor resection. No one knows the underlying mechanism heterogeneity in a single cell level. Matched adjacent normal tissues and primary tumors and peripheral blood were obtained from all 3 patients (CRC0529, CRC0602, CRC0609). The detailed clinical information were shown in the Supplementary Table S7. All sampling and experimental steps in this study were approved by the Ethics Committee of Zhuhai People’s Hospital Affiliated with **an University (Research projects IRB Review Approval Notice: LW-[2022]#1). Relevant informed consent documents were signed by the participants before sample collection and data acquisition, all participants received no compensation from this study.

Preparation of single-cell suspensions

All tissue samples were washed twice with cold PBS. Tissue samples were cut into 1 mm3 in size and placed in petri dish with cold PBS, then transferred into centrifuge tube, adding appropriate amount of enzyme and shaking at a certain temperature for a period of time. After 2-3 minutes’ standing, supernatant were collected and then use a filter membrane to remove large clumps. After centrifuge the cells were collected, and then we resuspended the cells with red blood cell lysis buffer and incubate 2-3 min at room temperature and then centrifuge at 120×g under 4°C for 3 min. Samples were resuspended again with cold PBS.

Droplet-based single-cell sequencing

Using the Single Cell 5’ Library and Gel Bead Kit (10X Genomics, 120237) and Chromium Single Cell A Chip Kit (10X Genomics, 120236), the cell suspension was loaded onto the Chromium single-cell controller (10X Genomics) to generate single-cell gel beads in the emulsion (GEMs) according to the manufacturer’s protocol. Briefly, single cells were suspended in PBS containing 0.04% bovine serum albumin. Approximately 10,000 cells were added to each channel, and about 6000 cells were recovered. The captured cells were lysed, and the released RNA was barcoded via reverse transcription in individual GEMs. Reverse transcription was performed at 53°C for 45 min, followed by 85°C for 5 min, and then the temperature was held at 4°C in a C1000 Touch Thermal Cycler (Bio Rad). After reverse transcription, single-cell droplets were broken and the single-strand cDNA was isolated and cleaned with Cleanup Mix containing DynaBeads (Thermo Fisher Scientific). cDNA was generated and amplified, and quality was assessed using the Agilent 4200. Single-cell RNA-seq libraries were prepared using Single Cell 5’ Library Gel Bead Kit V2 following the manufacture’s introduction. Next generation sequencing was performed on an Illumina Novaseq6000 with a sequencing depth of at least 100,000 reads per cell and pair end 150 bp (performed by CapitalBio Technology, Bei**g).

Single cell RNA-seq (scRNA-seq) data processing

Sequencing data were aligned to the human reference genome (GRCh38) and processed using the CellRanger (version 4.0.0). The gene expression matrix from the CellRanger pipeline was filtered, normalized using the Seurat R package (v3.2) [37]. Cells were selected if they met the following criteria: (i) top 99% of cells in unique molecular identifier counts; (ii) >200 genes; and (iii) <25% of mitochondrial gene expression in UMI counts. After the removal of low-quality cells, the gene expression matrices were normalized to the total UMI counts per cell and transformed to the natural log scale. Then all the datasets of individual sample were integrated using the “FindIntegrationAnchors” and “IntegrateData” function in Seurat. Louvain algorithm was applied to iteratively group proximal cells together by “FindClusters” function with resolution of 0.6. Visualization was achieved by both the t-Distributed Stochastic Neighbor Embedding (tSNE) projection and Uniform Manifold Approximation and Projection (UMAP).

Cell type annotations were performed on Blueprint and Encode reference dataset via SingleR [70], along with the marker-based correction. We classified all cells into eight major cell types, including T cells, B cells, NK cells, Monocytes, Epithelial cells, Fibroblasts, Endothelial cells, and Mast cells.

10x Visium Spatial transcriptomics (ST)

Cryosections were cut at 10-μm thickness, mounted onto the GEX arrays. Sections were placed on Thermocycler Adaptor with the active surface facing up and incubated for 1 min at 37°C, and fixed for 30 min with methyl alcohol under −20°C, and then stained with H&E (Eosin, Dako CS701, Hematoxylin Dako S3309, bluing buffer CS702). The brightfield Images were taken on a Leica DMI8 whole-slide scanner at 10× resolution.

Visium spatial gene expression was processed using Visium spatial gene expression slide and Reagent Kit (10× Genomics, PN-1000184). For each well, Slide Cassette was used to create leakproof wells for adding reagents. 70 μL Permeabilization enzyme was added and incubated at 37 °C for 20 min. Each well was washed with 100 μL SSC, and 75 μL reverse transcription Master Mix was added for cDNA Synthesis.

cDNA library were prepared for sequencing. After the first-strand synthesis finished, reverse transcription Master Mix was removed from the wells, and then 75 μL 0.08 M KOH was added and incubated for 5 min at room temperature, then we removed the KOH from wells and washed with 100 μL EB buffer. A total of 75 μL Second Strand Mix was added into each well for second-strand synthesis. cDNA amplification was performed on a S1000TM Touch Thermal Cycler (Bio Rad). According to the manufacture’s introduction, Visium spatial libraries were constructed using Visium spatial Library construction kit (10× Genomics, PN-1000184). The libraries were sequenced using an Illumina Novaseq6000 sequencer with a sequencing depth of at least 100,000 reads per spot with pair-end 150 bp (PE150) reading strategy (performed by CapitalBio Technology, Bei**g).

Spatial transcriptome sequencing (ST-seq) data processing

The sequencing reads were mapped to the GRCh38 human genome and expression was quantified with the spaceranger-1.0.0. Further analysis was performed with Seurat (version 3.0.2). To annotate spots, we applied the integration workflow introduced in Seurat v3, which enabled the probabilistic transfer of cell types from the scRNA-seq data to the ST data. Specifically, we first identified pairwise correspondences between single cells and single spots to quantify the batch effect. Each spot was then annotated based on the transcriptomic similarity between spots and cell types in the scRNA-seq dataset. This probabilistic transfer procedure was implemented using the FindTransferAnchors (dims=1:30) and TransferData (dims=1:30) functions in Seurat with the combination of top 100 DEGs of each cell type.

Differential expression and functional enrichment analysis

After dimensional reduction and projection of all cells into two-dimensional space by tSNE and UMAP, cells were clustered together according to common features. The “FindAllMarkers” function in Seurat was used to find markers for each of the identified clusters. Using differentially expressed genes (DEGs) of each cluster, we performed functional enrichment analysis which were implemented by clusterprofiler (v3.10.1) with |log2Foldchange | >0 and p.adj < 0.05 as thresholds (hypergeometric test). The enrichment analysis of comprehensive functions including Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, Reactome and Disease. Gene sets enrichment analysis was performed by GSEA application version of JAVA (v2.2.2.4), which used predefined gene sets from the Molecular Signatures Database (MSigDB, v6.2).

Regulon analyses

Regulon scores for individual cells were computed using the SCENIC (single-cell regulatory network inference and clustering) pipeline [71]. A log-normalized expression matrix of neuronal cells was used as an input into the pySCENIC workflow (https://pyscenic.readthedocs.io/en/latest/index.html) with default settings to infer regulons (master TFs and their target genes).

CNV estimation and identification of malignant cells

The chromosomal CNA profile of single cells was inferred by the R package inferCNV (version 1.0.4) [72]. Average signal were used as reference to define a baseline of normal karyotype such that their average copy number value was subtracted from all cells. The following parameters were applied: cutoff=0.1, cluster_by_groups=TRUE, HMM = TRUE, and denoise=TRUE.

Cell-cell communication analysis

In order to explore cell-cell communications via ligand–receptor interactions, we employed the strategy proposed by Vento-Tormo et. al. [73] based on a public repository of ligands, receptors and interactions database CellPhoneDB (v2.0) [74]. The interaction score between two different cell types was mediated by a specific ligand-receptor pair based on the mean gene expression of ligand from one cell type and the corresponding receptor from another cell type. To identify the significant cell-cell interaction, we permuted the change of cell type label for each cell at 1,000 times to calculate the significance of each pair (p-value < 0.01). This procedure was performed between all pairs of cell types. The interactions between distinct cell subpopulations via putative ligand-receptor pairs were visualized using the ggplot2 package.

Single-cell trajectory analysis

We used Monocle v.2 [33] to illustrate the cell state transition in total epithelial cells, tumor cells in the CRC scRNA-seq dataset and in CRC5_1 tumor cryosection in the ST-seq dataset. This R package applied a reversed graph embedding technique to reconstruct single-cell trajectories. UMI count matrices and the negbinomial.size parameter were used to create a CellDataSet object in the default setting. We filtered variable genes with the following cutoff criteria: (1) genes expressed in more than 10 cells; (2) average expression value > 0.1; and (3) Qval < 0.01. These variable genes were used for semisupervised trajectory reconstruction. Dimensional reduction and cell ordering were performed using the DDRTree method and the orderCells function.