Abstract
Dimensionality reduction techniques, such as t-SNE, can construct informative visualizations of high-dimensional data. When jointly visualising multiple data sets, a straightforward application of these methods often fails; instead of revealing underlying classes, the resulting visualizations expose dataset-specific clusters. To circumvent these batch effects, we propose an embedding procedure that uses a t-SNE visualization constructed on a reference data set as a scaffold for embedding new data points. Each data instance from a new, unseen, secondary data is embedded independently and does not change the reference embedding. This prevents any interactions between instances in the secondary data and implicitly mitigates batch effects. We demonstrate the utility of this approach by analyzing six recently published single-cell gene expression data sets with up to tens of thousands of cells and thousands of genes. The batch effects in our studies are particularly strong as the data comes from different institutions using different experimental protocols. The visualizations constructed by our proposed approach are clear of batch effects, and the cells from secondary data sets correctly co-cluster with cells of the same type from the primary data. We also show the predictive power of our simple, visual classification approach in t-SNE space matches the accuracy of specialized machine learning techniques that consider the entire compendium of features that profile single cells.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Two-dimensional embeddings and their visualizations may assist in the analysis and interpretation of high-dimensional data. Intuitively, two data instances should be co-located in the resulting visualization if their multi-dimensional profiles are similar. For this task, non-linear embedding techniques such as t-distributed stochastic neighbor embedding (t-SNE) (van der Maaten & Hinton, 2008) or uniform manifold approximation and projection (McInnes & Healy, 2011), covariate shift (Bickel et al., 2009) or data set shift (Quionero-Candela et al., 2009). In bioinformatics, the domain-specific differences are more commonly referred to as batch effects (Butler et al., 2018; Haghverdi et al., 2018; Stuart et al., 2019).
Massive, multi-variate biological data sets often suffer from these source-specific biases. The focus of this work is single-cell genomics, a domain that was selected due to high biomedical relevance and abundance of recently published data. Single-cell RNA sequencing (scRNA-seq) data sets are the result of isolating RNA molecules from individual cells, which serve as an estimate of the expression of cell’s genes. The studies can exceed thousands of cells and tens of thousands of genes, and typically start with cell type analysis. Here, it is expected that cells of the same type would cluster together in two-dimensional data visualization (Wolf et al., 2018). For instance, Fig. 1a shows t-SNE embedded data from mouse brain cells originating from the visual cortex (Hrvatin et al., 2018) and the hypothalamus (Chen et al., Full size image
The proposed solution implements a map** of new data into an existing t-SNE visualization. While the utility of such an algorithm was already hinted at in recent publication (Kobak & Berens, 2019), we here provide its practical and theoretically-grounded implementation. Considering the abundance of recent publications on batch effect removal, we present surprising evidence that a computationally more direct and principled embedding procedure solves the batch effects problem when constructing interpretable visualizations from different data sources.
Our contributions are twofold:
-
1.
We introduce a theoretically-grounded extension of the t-SNE visualization algorithm that supports embedding new data points into existing reference visualizations. Our extension is readily incorporated into existing approximation schemes, enabling its applications to large data sets. We show that optimization using the default t-SNE parameters is highly unstable and proposes parameter values leading to stable convergence.
-
2.
We show that the proposed t-SNE extensions can mitigate batch effects in the data sets and demonstrate this feature in treating single-cell gene expression data.