Introduction

Esophageal squamous cell carcinoma (ESCC) is a common gastrointestinal malignancy in some parts of the world like China, with a poor prognosis and high mortality mainly due to the lack of specific measures for early diagnosis and effective therapies. ESCC represents an extraordinary paradigm of carcinoma development, shaped in a sequential manner from inflammation (INF), hyperplasia (HYP), dysplasia (DYS), carcinoma in situ (CIS) to invasive carcinoma (ICA)1. However, up to date, how ESCC initiates and develops is largely unknown. This long-standing question is the major factor obstructing the early intervention and clinical care improvement of the disease. As such, exploring the mechanisms underlying ESCC formation and identifying biomarkers are the crucial tasks for early detection, diagnosis, and precision treatment of the cancer.

Recent genome studies, including The Cancer Genome Atlas (TCGA) project, have identified many genome variations in ESCC by using whole-exome or whole-genome sequencing on clinical tissue samples2,3. Although these studies have revealed an important role of the identified genome alterations in ESCC, it remains unresolved that how the normal epithelial cells may transit by the mutations through precancerous lesions to invasive carcinoma because all these previous studies were in cross-sectional design. Another important issue in this regard is that somatic mutations solely might not be sufficient for ESCC initiation and development because such mutations also occur in pathologically normal human esophageal tissues4. These findings imply that other mechanisms such as transcriptome aberrance in ESCC tumorigenesis may merit further investigation. In addition, it has been demonstrated that the complex context of tumor microenvironment (TME) also has important roles in tumor initiation and development. Therefore, elucidating dynamic transcriptomic changes of TME cellular compositions during tumorigenesis are significant and inevitable in discovering how ESCC develops. Recently established single-cell transcriptomic analysis is a promising approach, which allows to analyze the complex cellular compositions and to decipher cell state transition in tissue samples5.

Capturing continuous tumorigenic lesions from sole patient over time to perform such study is impossible and, therefore, to address this important issue would be highly dependent on well-established animal models. Fortunately, it has been shown that chemical carcinogen 4-nitroquinoline 1-oxide (4NQO) can induce mouse ESCC development in a manner that mimics the tumorigenic processes of ESCC in humans6,7. The distinct multiple stages of ESCC tumorigenesis induced by the carcinogen provide an excellent opportunity to interrogate cell state transition by single-cell RNA sequencing (scRNA-seq), which would elucidate the dynamic ESCC tumorigenesis.

Here we report a single cell-based transcriptomic profiling study on various types of cells across every pathogenic stage during ESCC initiation and development in a mouse model induced by 4NQO. We have built a complete atlas and characterized the transcriptomic profiles for the transition of esophageal epithelial cells under the attack of carcinogen and elucidated how they evolve over time in a holistic approach. We have also depicted the transition landscapes of non-epithelial cells, i.e., fibroblasts and immune cells, in the esophageal microenvironments of different stages of tumorigenesis. Furthermore, we have found that some key changes in mice also occur in human esophageal tissue samples. These results shed light on the phenotypes and transition fates of different cell types across the tumorigenic processes of ESCC in animal model and may be implicated in human ESCC.

Results

Bulk RNA-seq and scRNA-seq of mouse esophageal samples

To explore the transcriptomic alterations at various pathological stages during ESCC tumorigenesis, 8-week-old female C57BL/6 mice were treated with 4NQO for 16 weeks, which resulted in five recognizable precancerous and cancerous lesions in the esophagus, i.e., INF, HYP, DYS, CIS, and ICA (Fig. 1a; Supplementary Fig. 1a). We examined mice receiving 4NQO and found they all developed the expected lesions in the esophagus at different time points of experiment. The ESCC number per animal (mean ± SD) at week 26 was 6.0 ± 3.6. We first performed conventional RNA-seq on mini-bulks of normal epithelial samples obtained by laser-capture microdissection (LCM) from control mice at different ages of 1, 2, 8, and 25 months and various precancerous or cancerous lesions from 4NQO-treated mice. Principal component analysis (PCA) of the differentially expressed genes showed that the expression programs in mice exposed to 4NQO were substantially different from that in control mice; however, there existed some overlaps in the expression profiles between 4NQO-exposed and non-exposed mice. For instance, the transcriptomic profile for stage INF in 4NQO-exposed mice was similar to that in control mice aged 25 months (Fig. 1b), indicating that conventional RNA-seq could not precisely clarify the path of malignant cell transition during the development and progression of ESCC due to high intra-tissue heterogeneity.

Fig. 1: Experimental design of RNA-seq on 4NQO-induced esophageal lesions in mice.
figure 1

a Induction of esophageal precancerous and cancerous lesions in mice. Mice were treated with 4NQO in drinking water (100 μg/ml) for 16 weeks and then kept without 4NQO treatment for another 10 weeks (upper panel). Mice were killed before (week 0), during (week 12) and after treatment (weeks 20, 22, 24, or 26), respectively. Hematoxylin–eosin (H&E) staining and immunohistochemistry (IHC) analysis of Mki67 on esophageal epithelium slides clearly identified six different pathological lesions, i.e., normal (NOR), inflammation (INF), hyperplasia (HYP), dysplasia (DYS), carcinoma in situ (CIS), and invasive carcinoma (ICA) (lower panel). Similar staining results were observed in over three visual fields from each stage of esophageal lesions (more staining image in Supplementary Fig. 2d). Scale bar, 100 μm. b Plot of principal component analysis (PCA) of mini-bulk tissue RNA-seq on different pathological lesions indicated by different colors. M month of mouse age. c Overview of the experimental design of scRNA-seq. Pathological lesions of the esophagus were dissected and digested into single-cell suspensions for further separation using FITC-CD45 antibody via FACS (1–4). CD45+ and CD45 cells whose numbers in different lesions are shown on right panel were scRNA-sequenced, respectively.

We therefore conducted a time-ordered single-cell transcriptomic profiling on various esophageal lesions in 4NQO-exposed mice. For stage NOR or stage INF, we used the whole esophagus from 30 or 20 mice. For other stages, we used the lesion foci and the sample numbers were 32 HYP from 25 mice, 24 DYS from 17 mice, 24 CIS from 20 mice and 25 ICA from 23 mice, respectively (Fig. 1c). A total of 66,089 cells including 29,975 CD45+ immune cells and 36,114 CD45 non-immune cells were obtained across various pathological stages (Fig. 1c; Supplementary Fig. 1b). The median unique molecular identifier (UMI) per cell was 7748 in immune cells and 9370 in non-immune cells (Supplementary Fig. 1c); with a median signal detect ability of 1936 genes for immune cells and 2620 genes for non-immune cells (Supplementary Fig. 1c), respectively. Based on the expressions of canonical markers, we classified immune cells into T cells, B cells, myeloid cells and natural killer cells (Fig. 1c; Supplementary Fig. 1d, e) and identified four clusters of non-immune cells including epithelial cells, fibroblasts, endothelial cells and myocytes by using t-distributed stochastic neighbor embedding (tSNE) (Fig. 1c; Supplementary Fig. 1f, g).

Identifying epithelial cell types during ESCC tumorigenesis

To discover how normal esophageal epithelium develops into invasive carcinoma, we next examined the expression alterations and functional changes in epithelial cells during the transition from normal to precancer or cancer. We identified 1756 epithelial cells across all six stages that were classified into six subtypes designated as EpiC 1 to EpiC 6 (Fig. 2a; Supplementary Fig. 2a and Supplementary Table 1). Through the analysis of pathway activities (Fig. 2b; Supplementary Fig. 2b), we found that EpiC 1 (n = 339) had higher expression of genes (e.g., Birc5, Mki67, Top2a, and Ube2c) in mitosis and proliferation8,9,10,

Methods

Human biospecimen collection

ESCC tumor, dysplasia lesions and tumor-adjacent (>5 cm) normal tissues of the same patients (n = 4) used for LCM and esophageal lesions of different pathological stages (n = 45) used for bulk RNA sequencing were collected during surgery or endoscopy in Linzhou Esophageal Cancer Hospital (Henan Province, China) from 2018 to 2019. The various lesions were diagnosed independently by at least two pathologists according to the American Joint Committee on Cancer Eighth edition. No patient had received chemotherapy or radiotherapy before biopsy or surgery. This study was approved by the Institutional Review Boards of Cancer Hospital, Chinese Academy of Medical Sciences and informed consent was obtained from each patient. Clinical information was collected from patients’ medical records.

Induction of multi-staged ESCC development and sample preparations

Animal experiments in this study were conducted in compliance with approved protocols and guidelines from the Institutional Animal Care and Use Committee of the Chinese Academy of Medical Sciences. Eight-week-old female C57BL/6 mice, purchased from the Bei**g Huafukang bioscience company in China, were maintained in local housing facility of a controlled condition (23 ± 1 °C, 50 ± 10% humidity and 12–12 h light-dark cycle). Mice were treated with 4NQO (Sigma-Aldrich) in drinking water (100 μg/ml) for 16 weeks to induce multi-staged ESCC carcinogenesis. Drinking water containing the carcinogen was replaced once a week with freshly prepared one and mice were allowed to access drinking water ad libitum during treatment. After 16 weeks of carcinogen treatment, 4NQO drinking water was replaced by sterile water until the mice were killed.

The esophageal lesions were identified by two independent pathologists based on the histopathological criteria described previously58 (Fig. 1a). Briefly, stage NOR was well-oriented stratified epithelium consists of basal zone and superficial zone. Stage INF was normal epithelium with focal aggregates of epithelial lymphocytes. Stage DYS was defined as loss of polarity in the epithelial cells, nuclear pleomorphism, hyperchromatic, and increased or abnormal mitoses. In stage HYP, these abnormalities were confined to the lower third of the epithelium while in DYS they present in lower two thirds of the epithelium. Lesions with such abnormal changes involving the entire thickness of epithelium were considered as carcinoma in situ (stage CIS). Stage ICA was defined as a lesion with invasion into the sub-epithelial tissues. Esophageal samples of 4NQO-induced mice were subjected to single-cell RNA sequencing at six different time points and indicated different pathological stages: stage NOR (at week 0), stage INF (at week 12), stage HYP at (week 20), stage DYS (at week 22), stage CIS (at week 24), and stage ICA (at week 26). A group of control mice treated without 4NQO were killed at month 1, 2, 8, and 25 (n = 2, respectively). The esophagus was removed immediately when the animal was killed. Cross-sections of the esophagus were cut and stored at –80 °C and sections of the frozen tissues were stained with hematoxylin–eosin (H&E) for histopathological examination and microdissection for bulk RNA sequencing.

Bulk RNA sequencing and data analysis

Tissue samples from human or mice were cut into 5–10 consecutive sections (8 μm) and the epithelial layer contained 30–50 cells on each section was micro-dissected with a Leica LMD7000 laser-capture microdissection system. RNA was isolated from the mini-bulk samples and cDNA was prepared for sequencing based on the Geo-seq protocol59. Sequencing libraries were built using the TruePrep DNA Library Prep Kit V2 for Illumina (Vazyme), and evaluated by Bioanalyzer (DNA HS kit, Agilent). RNA-seq data were mapped to GRCh38 human genome and GRCm38 murine genome by HISAT2 (version 2.1.0)60 with default parameter for human and murine samples, respectively. The gene expression matrix of raw reads counts after annotation by HTSeq (version 0.6.1p1) was processed using the DESeq2 (version 1.22.2)61 and visualized by showing the first 3 dimensions calculated by plotPCA function. We used TRIZOL to extract bulk RNA from patient biopsy (n = 45), and constructed sequencing library using NEBNext Ultra II RNA Library Prep Kit for Illumina. Sequencing data were processed by HISAT2 and HTseq as described above. The normalized expression from bulk samples of human precancerous lesions was used to estimated neutrophil fraction with CIBERSORT (version 1.06)62.

Single-cell RNA sequencing (scRNA-seq) and data analysis

For mice at stage NOR and INF, the whole esophagi were taken and for mice at other stages, the dysplastic/malignant lesions were taken immediately after killed. The tissue samples of ESCC and various precursor lesions were gently minced into small pieces and digested for with in RPMI-1640 medium (Invitrogen) containing collagenase IV (Gibco) and hyaluronidase (Sigma-Aldrich). CD45-FITC (553080, BD Biosciences, dilution 1:20) antibody staining was performed for fluorescence activated cell sorting (FACS) on a FACSAria sorter (BD Biosciences). Single cells with or without GFP signal, representing immune or non-immune cells, were sorted and captured respectively in nanoliter droplets using Chromium (10× Genomics). scRNA-seq libraries were prepared using Chromium Single Cell 5′ Reagent Kits (10× Genomics) and sequencing was accomplished with an Illumina HiSeq x10 System.

Raw gene expression matrices obtained per sample using CellRanger (version 2.1.0, 10× genomics) were combined using the Seurat R package (version 2.3.4)63. Genes detected in <0.1% of all cells were filtered. We further excluded cells with gene counts <500 and cells that had >10% of mitochondrial gene expressions. After quality control, 66,089 cells were further analyzed for their gene expression profiles. The genes with normalized expression between 0.0125 and 3, and dispersion >0.5 were selected as highly variable genes. The resultants were first summarized by principle component analysis (PCA) and then first several PCs were selected for tSNE dimensional reduction using the default settings of the RunTSNE function. The numbers of resulting highly variable genes and the select PCs are shown in Supplementary Table 2. Cell clusters in the resulting two-dimensional representation were annotated as known biological cell types using canonical marker genes.

Major cell-type clustering and marker gene identification

We reanalyzed epithelial cells and stromal cells separately to identify their sub-clusters by using highly variable gene identification and dimensional reduction as described above (Supplementary Table 2). Cells with mix features were removed from further analysis (e.g., cells with both Cd3d and Cd19 expression indicating T cell and B cell Multiplet). Clusters were identified using FindClusters function, and the specific gene markers for each cluster were determined using the FindAllMarkers function implanted in Seurat package.

Gene set variation analysis (GSVA)

Pathway analyses were predominantly performed on the 50 hallmark pathways described in the molecular signature database, exported using the MSigDB database (version 6.2)64. We also assessed biological process activities using a described biological process of Gene Ontology (GO) dataset. To assign pathway activity estimates to individual cells, we applied the GSVA using standard settings, as implemented in the GSVA package (version 1.30.0)65. To assess differential activities of pathways between sub-cluster of cells, we contrasted the activity scores for each cell using Limma package (version 3.38.3)66. Differential activities of pathways were calculated for each identified cluster. T-values of the results of some significant differential pathways (P < 0.05) in top 10 were visualized using heatmaps with average pathway activity scores of each cluster.

Analysis of transcription factor expression

SCENIC (version 1.1.0)67 was used to assess the transcriptional activity of epithelial cells with high quality (UMI > 5500). The analysis used the motifs database for RcisTarget and GRNboost (corresponding to GENIE3 1.4.3, AUCell 1.4.1 and RcisTarget 1.2.1; with mm10__refseq-r80__10kb_up_and_down_tss.mc9nr). The input matrix was read counts.

Cell transition trajectory and diffusion map analysis

Monocle 2 (version 2.10.1)68 was used for the trajectory analysis on high quality epithelial cells (UMI > 5500). All the top 100 markers of each cluster were used for the cell ordering. Dimensionality reduction and trajectory construction were performed on the selected genes with default methods and parameters. We calculated diffusion components using the RunDiffusion function as implemented in Seurat package with default parameters. The first three dimensions were used to draw diffusion maps. Mean coordinates of all of each cluster’s cells were considered as the center of the cluster. The farthest cell of stage NOR from the total distance of other stages was the start point.

Analysis of interaction between cell types

We used CellPhoneDB (version 2.0.6)69 with default arguments to reveal interaction between cell types. Each cluster analyzed was downsampled to 100 cells since the low cell number of some epithelial cluster. Interactions of epithelial cells with immune cells and immune cells with epithelial cells were demonstrated respectively.

Immunohistochemistry and immunofluorescent detection

Formalin-fixed paraffin-embedded (FFPE) sections of esophageal precursor lesions were collected from 30 patients between 2016 and 2018 in Linzhou Cancer Hospital, including INF (n = 5), HYP (n = 13), DYS (n = 8) and CIS (n = 3), to validate the results obtained in mice. The protein expression levels of the marker genes were detected by IHC staining for mice tissues and immunofluorescence for human specimens with antibodies (Abcam) shown in Supplementary Table 3. The samples were incubated with antibody against Ki67 (1:50 for IHC, ab16667), Top2a (1:8000 for IHC, 1:10,000 for IF, ab52934), Aldh3a1 (1:200 for IHC, 1:600 for IF, ab76976), Atf3 (1:200 for IHC, 1:600 for IF, ab216569), S100a8 (1:500 for IHC, 1:1500 for IF, ab92331), Mmp14 (1:2000 for IHC, 1:6000 for IF, ab51074), or Itga6 (1:250 for IHC, 1:750 for IF, ab181551). Opal multiplex staining was performed according to the Opal 5-Color Manual IHC Kit (Perkin Elmer). Opal DAPI, Opal 520, Opal 570, Opal 620, and Opal 690 were used to generate different signals. Slides were counterstained with DAPI (1:2000) for nuclei visualization, and subsequently coverslipped using a VectaShield Hardset mounting media. The slides were imaged using Vectra Polaris Automated Quantitative Pathology Imaging System (Perkin Elmer). We used inForm software (Perkin Elmer) to unmix and remove autofluorescence and to analyze the multispectral images.

Statistical analysis

Statistical analyses were conducted by using R v3.5.170 and Prism 7 (Graphpad Software). Pearson’s correlation was calculated with the R function cor() and the significance was determined using two-sided unpaired Wilcoxon rank-sum test. P < 0.05 was considered statistically significant.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.