Introduction

Recent developments in single-cell sequencing technologies have opened the possibility of analyzing individual single cells. A number of reports have demonstrated that single-cell analysis provides pivotal information for elucidating cellular plasticity and diversity within a given population of cells in vitro and in vivo. There are a number of potential applications for scRNA-seq analysis. Among them, cancer is supposed to be one of the most important targets to analyze1. Cancer is a complex cellular ecosystem that consists of various cell types, including cancer cells, cancer-associated fibroblasts, tumor-infiltrating leukocytes and vascular cells2. In cancer, no two individual cells are identical, as they are located in varying micro-environmental conditions and interact with different cells. Even among clonal cancer cells, diverse phenotypic features are frequently observed in each individual cell during the clonal evolution of cancers3,4,5. It is important to focus on the molecular diversity of cancer cells to understand the mechanisms underlying the emergence of drug-resistant and metastasizing cells. Detailed knowledge of such intra-tumor heterogeneity would provide crucial information for understanding the eventual development of drug-resistant cells or metastatic dissemination in cancer and also generate potential opportunities for novel pharmaceutical interventions6,7. Most cancers acquire resistance to anti-cancer drugs, including gefitinib, one of the most well-characterized molecular-targeting anti-cancer drugs for EGFR in lung adenocarcinomas, after a period of drug treatment8,9,10. Several drug resistance-acquiring mutations, such as the T790M mutation in the EGFR gene, have been reported11. These drug-resistant mutations emerge in a small number of cells and spread to the population12. Before drug resistance is fixed in the form of genomic mutations, transcriptomic diversity is considered to be the precedent for the initial repertoire, allowing cells to survive during initial selection13. Details of the molecular process that eventually leads to drug resistance remain mostly elusive despite several pioneering studies14. It has been difficult to examine cellular diversity using conventional methods, which subject groups of cells to assays in bulk. Limited amounts of data represent cancer cell heterogeneity, even with the most powerful datasets describing cancers, such as the TCGA, ICGC and COSMIC databases15,16,17,18,19.

It is expected that single-cell analysis will provide substantial novel insights into the identification and characterization of rare cells among cancer cells20,21,22,23,24. Such single-cell analyses were initiated with single-cell RNA-seq (scRNA-seq) analysis. In scRNA-seq, single cells are separated by micro-pipetting, laser capture micro-dissection, FACS, micro-fluidic and micro-droplet-based methods25,26,27,28,7C inset table, Fisher’s exact test, p = 0.39).

Module activities in other cell lines

We examined whether the high activities of the DUSP1-AURKA gene modules were unique to gefitinib-treated PC9 cells. Using the micro-chamber datasets for the other cell lines, we conducted a similar module analysis. First, we performed clustering analysis using entire genes and found that the cells were separated depending on their originating cell types (Sup. Fig. S8A). Of note, the cells were not separated between untreated and gefitinib-treated cells in H1975 and H2228, which gefitinib should not affect (Sup. Fig. S8B). Distinct patterns were observed for the PC9 and II-18 cells, among which the untreated and gefitinib-treated cells were distinct. H1650 cell clustering was marginal, perhaps reflecting their partial response to gefitinib. When we clustered each of the cell lines by the “magenta” module (the DUSP1 gene module) in PC9 cells, we did not identify clear outlier cells in the gefitinib-treated II-18 cells. Several cells had high expression levels of AURKA and DUSP1 genes (s_255 and s_274 cells; indicated by arrows; Fig. 8A). However, their expression levels were less significant than those of the outlier PC9 cell, s_062 (top panel; Fig. 8A).

Figure 8
figure 8

Module analysis in another cell line. (A) Hierarchical clustering analyses were conducted using the genes included in the “magenta” module in PC9 (top) and II-18 cells (second). The expression levels (RPKM) of AURKA and DUSP1 are also shown in the bar plot (middle). The scatterplot shows the relationship between the expression levels of two genes, DUSP1 (x-axis) and AURKA (y-axis) (bottom). (B) Clustering of II-18 cells by the module “II-18-red.” Hierarchical clustering analyses were conducted using the genes included in the II-18 module “red” (top). The treatment of individual cells, their MEred value, their expression level of SOX4, their MEturquoise value and their CD44 expression level are shown. Two cells (s_252 and s_247) show low expression of SOX4 and high expression of CD44.

Instead, when we conducted a similar analysis for II-18 cells, we identified a distinct module, “red.” This module consisted of 132 genes (Sup. Table S8) with the SOX4 gene as a core. Clustering using this module identified two outlier cells (s_252 and s_247 cells) (Fig. 8B) that exhibited high levels of ME-turquoise and CD44 expression. Analysis using the micro-droplet dataset revealed that such cells represented 0.06% of the entire population (Sup. Fig. S9). SOX4 has been reported to act as a tumor suppressor gene, depending on the context, by promoting cell cycle arrest and apoptosis53,54,55. CD44 is known as a cancer stem or cancer progenitor marker in several tumors56,57. Another report indicates that cells expressing CD44 show stem-cell-like properties58. Aberrant activation of this module was not observed in PC9. Cancer cells may utilize some common modules for survival but may more frequently use unique modules, depending on their original transcriptomic status.

Possible biological relevance of outlier cells

Using data from clinical samples, we investigated the potential phenotypic significance of the outlier cells with high expression levels of DUSP1 and AURKA genes. For this purpose, we used the TCGA dataset9, which provides transcriptome information as well as clinical information for 506 lung adenocarcinoma patients. We divided the patients into two groups based on AURKA and DUSP1 expression levels (Fig. 9A, inset table). Again, there seemed to be no or little direct correlation between the expression levels of these genes in the clinical samples. Twenty-nine cases showed high expression levels of both genes. We compared the overall survival times of the patients depending their expression levels of AURKA and DUSP1 genes. We observed that the patients with high expression levels of both AURKA and DUSP1 genes showed a poor prognosis compared with cases with normal or low expression levels of either gene (Fig. 9B). Activation of AURKA and DUSP1 genes may have a favorable effect on the survival of lung cancer cells. Outlier cells with modules highly related to AURKA and DUSP1 would have survival advantages, particularly under severe conditions, and may contribute to the development of small populations of more malignant cells for which anti-cancer drugs are less effective.

Figure 9
figure 9

Biological relevance based on the TCGA-LUAD dataset. (A) Heatmap of 506 TCGA patients, showing their expression levels of AURKA and DUSP1. The patients were divided into nine groups based on the expression profiles. In the right margin, the groups of patients are shown (top). The number of patients in each group is shown in the inset table (middle). The color bar in the box shows the gene expression levels and the groups (bottom). (B) The Kaplan-Meier curve shows that patients with high expression levels of both AURKA and DUSP1 are associated with a poor prognosis. Statistical significance (p-value) of differences between the two groups is shown in the plot.

Conclusions

In this study, we first evaluated the representative two analytical platforms, the micro-chamber and the micro-droplet methods, which are used for scRNA-seq. We found that the datasets obtained from those different platforms have the similar nature, although the respective platforms have their unique advantages and disadvantages. To make the most use of the advantages of these two platforms, we attempted to combine the datasets generated from two different platforms in the later sections. In the first part of the paper, we generated datasets from the micro-chamber and the micro-droplet platforms. In either of the platforms, the datasets consisted of the gene expression information of each individual single cell. Such a separation was possible even though the two platforms identify the single cell based on the different methods. Namely, for the micro-chamber platform, we collected the individual cells by separating the cells into different chambers. With the micro-droplet platform, we separated the cells by confining the cells into micro-droplets, where the mRNAs of each cell were labeled with distinct barcodes in individual droplets. As a result, even though the number of the analyzed cells and the sequencing depth per cell differed between the two platforms, we could compare the expression information between the data at the individual cell level. Detailed technical evaluations and comparisons of these platforms revealed that both methods were highly reproducible and concordant. However, the difference in sequence depths, which depends on the number of cells subjected to the analysis with a given sequencing cost, caused distinct features of the datasets. Indeed, both methods had inherent advantages and disadvantages. Namely, the micro-chamber system enabled us to examine the detailed character of each cell in terms of its gene expression information. However, the feasible number of subjected cells is too small to detect a rare population of cells and estimate the frequency and variance of those cells if they are detected. Conversely, the micro-droplet platform examined a much larger number of cells. We considered that the micro-droplet platform to be still useful, despite the expression information from a given single cell being relatively poor. This is the only currently available platform that can analyze >5,000 cells at the same time. Without this platform, it would be essentially impossible to characterize a population of cells regarding the gene expression information of each individual cell. Therefore, that the micro-chamber method should be complementarily used with the micro-droplet method. Namely, the former has an advantage in the sequencing depth and rich expression information for each single cell, while the latter has an advantage in the population analysis of the cells. However, the limited sequencing depth for each cell makes interpretation of the gene expression data on its own difficult.

In the second part of this paper, we demonstrated to address these limitations by integrating the data from the two platforms. First, we provided a statistical inference originating from the micro-chamber dataset that could predict the missing values in the micro-droplet dataset. Second, we identified a minor population of cells using a transcriptional module-based approach with the micro-chamber dataset. Further analyses using the micro-droplet dataset revealed the frequency and divergence of such cells in the entire population. In particular, we identified two modules with the AURKA gene and the DUSP1 gene as their cores. Interestingly, simultaneous activation of those genes was associated with the poorest prognosis in clinical samples. We believe that single-cell analysis would provide indispensable information for further analysis of the molecular basis underlying the emergence of such cancer cells.

Further detailed evaluations are clearly needed to validate the clinical relevance of the observed heterogeneity of cancer cells. Diverse in vivo microenvironments should further impose complicated factors on cellular gene expression. Several methods to monitor single-cell transcriptomes in vivo are being developed. However, the resolution and precision of the data are still limited. Taking various advantages of the cell lines, we believe that this work should provide a first step towards a thorough understanding of the diverse nature of cancer.

Materials and Methods

Cell culture

PC9 and II-18 cells were acquired from the RIKEN Bio Resource Center (catalog number RCB4455 and RCB2093), and H1650, H1975 and H2228 were acquired from the American Type Culture Collection (catalog numbers CRL5883, CRL5908 and CRL5953). The cells were grown in RPMI-1640 medium (Wako, 189–02145) with 10% fetal bovine serum (FBS), MEM Non-Essential Amino Acid Solution (catalog number M7145, Sigma-Aldrich, St. Louis, MO) and penicillin and streptomycin in an incubator maintained at 37 °C with 5% CO2. For gefitinib (CAS 184475-35-2, Santa Cruz Biotechnology) treatment, the drug was added to the culture medium at a final concentration of 1 μM. Twenty-four hours after the drug treatment, the cells were harvested. For the untreated control, DMSO was added to the culture medium in place of gefitinib. For each experiment, 106 cells were harvested and separated using bead-seq and a Chromium Single Cell 3’ (10× Genomics, version 1).

Single-cell RNA-seq with the micro-chamber system

We prepared libraries according to Matsunaga et al.31 and utilized the HiSeq. 2500 platform (Illumina) with 50-base single-end reads. For the PC9 replicate samples, we performed 35-base single-end reads. To remove ribosomal RNA, the generated RNA-seq tags were mapped to rmRNA, and unmapped reads were removed. Trimmed reads were aligned to the human reference genome (UCSC hg19) by TopHat/Bowtie. Using our Perl script, RNA-seq tag counts were calculated as reads per kilobase RNA per million mapped tags (RPKM)59.

Single cell RNA-seq with the micro-droplet system

Using Chromium Single Cell 3′, libraries were prepared according to the manufacturer’s instructions. We used a HiSeq. 2500 Rapid run platform to generate 50-base paired-end reads. RNA-seq tags from the Chromium experiments were aligned using Cell Ranger software. Using our Perl script, sequences with low quality and PCR duplicates were removed. Trimmed reads were sorted based on their cell barcode, and only cell barcodes with >5 k tags were selected. Using our Perl script, RNA-seq tag counts were calculated as parts per million mapped tags (ppm).

Correlation analysis between two platforms

When the values of the results from the two different platforms, which have distinct numbers of cells and sequencing depths, were compared, the statistical significance of the difference was evaluated by the indicated methods. For the correlation analysis at the cell to cell level, we selected the individual cells having the largest the second largest and the third largest number of their sequence tags and designates them as “top1”, “top 2” and “top3” cells, respectively, for each of the platforms.

Cell cycle analysis of PC9 cells

As shown in Fig. 5A at the top left, we used 44 PC9 DMSO-treated cells and 20 cell cycle-regulated genes (four genes per phase) to refine the cell state; CCNE1, E2F1, CDC6 and PCNA were used for G1/S phase, RFC4, DHFR, RRM2, and RAD51 for S phase, CDC2, TOP2A, CCNF and CCNA2 for G2 phase, STK15, BUB1, CCNB1 and PLK1 for G2/M phase, and PTTG1, RAD21, VFGFC and CDKN3 for M/G1 phase. These gene sets were obtained from Whitefield et al.60. The expression levels (RPKM) of each gene in the gene set of each single cell were calculated and scaled. To order the cells, we compared the average scores of five phases.

For the micro-droplet datasets, we first attempted to draw a heatmap using same method as for the micro-chamber dataset. However, we cannot draw the heatmap as in Fig. 5A top due to the absence of values in the micro-droplet datasets. To overcome those problems, we analyzed the cell cycle of each cell based on the method previously reported by Macosko et al.33. As shown at the bottom left of Fig. 5A, we used 5,166 PC9 DMSO-treated cells and 603 genes from Macosko et al. From those genes, we excluded genes with a low correlation to the cell state (r < 0.2). Twenty-one genes correlated with the G1/S phase, 14 genes with the S phase, 39 genes with the G2/M phase, 51 genes with the M phase, and 19 genes with the M/G1 phase remained (a total of 144 genes, Sup. Table 2). We calculated the expression levels of these genes and averaged the normalized (log2(PPM+1)) values in each phase. We scaled these scores and obtained a phase-specific score for 5,166 cells. Next, we compared the pattern of the phase-specific scores to nine potential patterns to determine the cell phases and ordered the cells according to their phases. Of the 5,166 cells, 2,812 cells were grouped into five phases. In contrast, the other 2,354 cells were estimated to be intermediate between G1/S and S, G2/M and M, and M/G1 and M phases. With the ordered datasets, we ran the R package “gplots” and used the “heatmap2” routine included in this package61.

At the top right of Fig. 5A, we used the same datasets as in the heatmap. At the bottom left of Fig. 5A, we used the 2,812 cells that had been grouped into five phases. We did not use the cells estimated to be intermediate between phases. To generate a two-dimensional projection, we reduced the dimensionality of those two datasets by principal component analysis (PCA)62. We represented individual cells by running R package “ggplot2” to draw figures63.

MAPK Analysis of PC9 cells

To determine the expression levels of genes included in the MAPK/ERK pathway in Fig. 5B, we mapped the tag counts of each gene in the illustration43. We used the top1 cell, the cell with the largest number of mapped reads per cell, from each platform to color the figures64.

Estimation of missing values

To estimate missing values, we combined gene expression data from two systems. We used 232 DMSO-treated cells and 210 gefitinib-treated cells from the present study. Genes with an average RPKM >10 across different cells were selected from the micro-chamber data sets. We also used the micro-droplet system datasets as predictors and to construct predictive models. The base-10 logarithms of all the expression levels were processed, and a pseudo value of 0.01 was used for values that were missing before the logarithms of the values were taken. There were 4,901 and 4,845 genes in the micro-chamber system with RPKM >10 for the DMSO- and gefitinib-treated cells, respectively. The expression levels of the genes in the micro-chamber system were encoded as explanatory variables, and the other genes that were not consistently among the explanatory variables were encoded as response variables. LASSO regression was then performed65. The response functions of LASSO were subsequently employed with the micro-droplet system datasets to predict gene expression levels.

To validate the estimation, we used the gene expression levels of the micro-droplet system dataset that were not missing and compared them with the values that had been estimated according to the computational method (Fig. 6A and D). The global correlation coefficients were determined by calculating Pearson’s r between the experimental values and predicted values of all the cells.

All the R programs were executed using R version 3.3.1, and the R package “glmnet” was employed to perform the Lasso regression. The parameter lambda in the Lasso regression was set to the 10th value of the lambda list in “glmnet” R package, and other parameters were set to their default values66.

Module-based single-cell analysis

We ran R package “WGCNA” and estimated co-expression network modules. First, we used 66 cells (DMSO-treated and gefitinib-treated PC9 cells)44. We clustered the samples and detected and removed five outlier cells with low expression levels (<5 RPKM) for more than 5000 genes. We removed genes that were not expressed much more than 5 RPKM in at least one cell. Based on the scRNA-seq data from 61 PC9 cells, we identified 71 modules and listed the genes included in those modules and the ME value of each cell. To evaluate the characteristics of these modules, we also conducted an eigengene network analysis and gene ontology (GO) enrichment analysis, which are included in the WGCNA package. We repeated the same process for the other four cell lines: II-18, H1650, H1975, and H2228. Figures were generated based on the identified modules (Sup. Table S9).

To create Fig. 7A, we used 61 PC9 cells (44 DMSO-treated and 17 gefitinib-treated cells) and the expression levels of genes included in the module “lightsteelblue1”. First, we rearranged the cells in the MElightsteelblue1 value order and represented the treatment (DMSO or gefitinib) and MElightsteelblue1 value for each cell with a bar plot. We then transformed the expression level of the gene in the module “lightsteelblue1” to a log2(RPKM+0.01) value and drew a heatmap. We used heatmap.2, which is included in the R package “ggplots.” In the right margin, we show the expression levels of four genes, the top3 module genes and AURKA, and the MEmagenta value for each cell with a bar plot.

To create Fig. 7C, we used the expression levels of the genes included in the module “magenta.” We projected 9,544 cells based on their PC scores onto a two-dimensional map using t-Distributed Stochastic Neighbor Embedding (t-SNE)67. Cells were clustered into two clusters based on the k-means score and colored by treatment, orange for DMSO and blue for gefitinib.

To create Fig. 8, we gathered data from 429 cells (Sup. Table 5) and applied a hierarchal clustering based on the genes included in the modules “II-18-red” (top) and “magenta (PC9 module)” (bottom).

Survival analysis

To analyze the TCGA dataset, we downloaded the RNA-seq v2 data and clinical information for the TCGA lung adenocarcinoma (TCGA-LUAD) dataset from the NCI Genomic Data Commons using TCGA-Assembler v2.0.1 (the data downloaded on 2017/03/09)68. We obtained 506 cases with both RNA-seq and clinical data. As the RNA-seq dataset, we downloaded the dataset by TCGA-Assembler with the following options; assayPlatform = “gene.normalized_RNAseq” and cancerType = “LUAD.” We transformed the expression levels to their log2(expression + 1) values. In the present study, we denoted expression “high” when its level was > average + 0.5 s.d. and “low” when its level was < average −0.5 s.d. We download the clinical dataset by TCGA-Assembler with the option; cancerType = “LUAD.” The data for overall survival for each case were extracted from clinical patient and follow-up files. Kaplan-Meier analysis with the log-rank test was conducted using the survival package in R69.

Availability of data and material

The sequence data from this study have been submitted to DNA Data Bank of Japan under accession number DRA005922- DRA005929.