Introduction

Recent advances in genomics and microscopy enable the collection of single cell gene expression data (scRNA-seq) across cells from spatial1 and temporal2 coordinates. Understanding how cells aggregate information across spatio-temporal scales and how, in turn, gene expression variability reflects this aggregation process remains challenging. A particular experimental design challenge is due to the fact that existing techniques (e.g., smFish3, seqFish4, MERFISH5, ISS6) rely on the pre-selection of a small number of target genes or markers, incapable of capturing the full transcriptomic information required to characterize subtle differences in cell populations. Selecting the best such markers (marker selection) is often statistically and computationally challenging, often a function of the nonlinearity of the data and the type of differences to be captured.

Marker selection is the product of both prior knowledge and computational analysis of previously collected scRNA-seq data. Computationally, it aims to reduce the dimension of data such as gene expression—from thousands of genes to a few—to enable downstream analysis such as visualization, cell type recovery, identification of gene programs or gene panel design for interventional studies. Akin to principal component analysis (PCA)7 or variational autoencoders (VAE)26,28. Targeted at addressing explainability tasks in machine learning, such methods have primarily been developed with text data in mind. Their performance has hence not been previously evaluated in a comprehensive way in the context of single cell studies. The relationship of MarkerMap with respect to these method and other previous approaches is discussed in Methods and Tables 1, 2, and 3.

Table 1 Classification performance metrics
Table 2 Full transcriptome reconstruction
Table 3 Random Forest classification performance metrics

MarkerMap is available as a well documented open-source software, along with tutorial and example workflows. The package provides a framework for custom designed feature selection methods along with metrics for evaluation (Fig. 1).

Fig. 1: Computational pipeline of MarkerMap.
figure 1

Data are imported as an n × d array of expression counts, together with optional annotations. During preprocessing, some genes are removed, and the rest undergo scaling, normalization, and a log(1+X) transform (Methods). Then MarkerMap or a variety of other marker selection algorithms are run to pick k markers. These markers are used for downstream tasks including benchmarking, UMAP embedding, and data reconstruction. The architecture of MarkerMap is depicted in the lower right. Given input signals, a differentiable sampling process selects a global set of markers. In the supervised setting when annotations are available, the signal restricted to the selected markers is fed to a neural network that predicts labels. In the unsupervised version, the signal restricted to the selection is fed to a variational autoencoder that aims to reconstruct the original signal with no information of the label. The joint loss version uses a convex combination of the reconstruction loss and the classification loss. A circle represents a source of random inputs to be used for differentiable sampling, a technique for iteratively assigning weight to informative features (Methods).

Improving accuracy in supervised scRNA-seq studies

We evaluated the performance of MarkerMap in the context of five publicly available scRNA-seq studies: Zeisel32, a CITE-seq technology based data set33, a mouse brain scRNA-seq data set34, the Paul15 stem cell data set35, and the SSv4 V1 data set36 (see Methods for a full description of the data sets and the data processing pipeline).

MarkerMap’s performance is benchmarked against other non-linear approaches which, despite addressing related tasks, have not been previously compared to one another. In detail, we considered the following feature selection baselines (Methods): PERSIST37, LassoNet25, SMaSH16, and Concrete VAE28. We also adapted a continuous relaxation Gumbel-Softmax technique from41,42,43. This can be seen as a consequence of the consistency of certain estimators:44 shows this to be the case of a nearest neighbor classifier under general conditions. Such a margin is large enough to accommodate realistic expectations of mislabelling error in data sets; we do however note that there may be more complex, adversarial, or systematic sources of error for which robustness may not hold. Figure 2 echoes the good performance of a set of random markers, when the number of markers is sufficiently large45 and chosen to characterize a single cell type.

Prospects for reconstruction in unsupervised settings

As a generative model, MarkerMap allows the reconstruction of the full transcriptomic input from the selected set of most informative markers. To understand the limits of this recovery, we first quantified the reconstruction quality by comparing distributional properties of the original and reconstructed data sets. Specifically, variances of genes from the reconstructed data were computed and compared to the variances of their counterparts in the original test data in a Mouse Brain data set, following unsupervised MarkerMap training with a 80–20% train-test split. The variances of the reconstructed data were lower than those of the original data (Fig. 3). This is a common phenomenon for generative models obtained with variational autoencoders, known as variance shrinkage46,47. To further visualize this, both test data and reconstructed data were projected onto the first two principle eigenvectors of the test data (Fig. 2).

Fig. 3: Downstream MarkerMap evaluation:visualization and reconstruction.
figure 3

A UMAP embeddings of the Zeisel data for different values of k markers. For UMAP, the parameters n_neighbors = 50, min_dist = 0.1 were used. B UMAP embeddings of the Mouse Brain data for different values of k markers. For all UMAPs, the parameters n_neighbors = 50, min_dist=0.1 were used. C In rows 1 and 3, histograms of gene expression variance values from the Mouse Brain data set for the original values and their corresponding reconstructions across cell types. In rows 2 and 4, PCA projections onto the first two eigenvectors of the original data along with their reconstructed counterparts. Additional variance and UMAP embedding figures are presented in the Supplementary Fig. 1 (variance plots), and Supplementary Figs. 2, 3, 4 (UMAP embeddings).

We further assessed whether, despite variance differences, the highly variable genes in the original data are recapitulated in the reconstructed one. To this end, two metrics for relative ranking were employed: the Jaccard Index and Spearman Rank Correlation Coefficient, ρ. Additionally, average 2 distance between the reconstructed expression profiles and the original expression profiles were computed per cell type (Evaluation Metrics and Methods).

Each of these metrics were computed for both the reconstructed data from MarkerMap and reconstructed data from a related generative model, scVI48. The scVI model learns the parameters of a zero-inflated negative binomial distribution for modeling genes counts from scRNA-seq data48. While both MarkerMap and scVI use a variational autoencoder framework for reconstruction, MarkerMap tries to reconstruct the full gene expression from the input of a small number of discrete markers, while scVI uses the full gene expression as input. In these experiments we used 50 markers for MarkerMap. Compared to scVI, MarkerMap generally scores worse on the variance metrics and better on the 2 distance (Table 4). However, it should be noted that MarkerMap and scVI have slightly different goals that suggest that these results are appropriate. Unsupervised MarkerMap tries to find the best k markers that optimally reconstruct the full data, while the scVI model learns a low dimensional manifold from which data is generated. A direction of future exploration is leveraging the differential sampling scheme of MarkerMap and the generative power of scVI to improve MarkerMap’s reconstruction ability, while preserving its interpretability quality.

Table 4 Quality metrics for full transcriptome reconstruction

Discussion

In this work we propose MarkerMap, a data-driven, generative, neural network framework for feature selection. Given scRNA-seq data, we employ differentiable sampling methods to find a global set of genetic markers with competitive performance in downstream classification (of cell type) and reconstruction (of the entire transcriptome of an unseen test data). The supervised version selects the markers that maximize label prediction accuracy. The unsupervised version selects markers that maximize the reconstruction accuracy of a variational autoencoder (with no label information). A mixed MarkerMap is also available, combining both label prediction and transcriptome reconstruction. Our experiments suggest that, even though differentiable sampling techniques based on properties of the Gumbel distribution are often suggested for interpretable machine learning tasks, they can underperform. Hence, the mathematically appealing, continuous relaxation procedure alone is not enough to explain why MarkerMap is competitive with respect to alternatives. Additional exploration, both experimental and theoretical, is required to understand this empirical result. In this work, we provide a competitive solution to feature selection in a real biological context. Most importantly, we provide a tool where related solutions from different fields can be compared to aid future research in this area. A promising future application of this tool is the design of probes for spatial trascriptomics studies.

We provide an extensive numerical benchmark of both supervised and unsupervised tools in the context of genetic marker selection on real single cell gene expression data sets. We show that while all methods exhibit better performance as the number of selected markers increases, the methods have differences in stability when presented with noisy labels. The baselines considered originated from different research communities, which have not been previously compared to one another despite addressing similar tasks.

MarkerMap introduces new concepts from explainable machine learning in a transcriptomic centric setting. We show that MarkerMap is competitive across real data sets, thus offering the potential for optimal combinatorial experimental design with downstream analysis in mind. MarkerMap is available as a pip installable python package that is easy to use, robust and reproducible, making it appropriate for the experimental design of transcriptomic studies, along with the development of new metrics and methodology.

As deep generative models inspired by the growing explainability literature27,28 and foundation models literature become popular in genomics37, we sought to establish benchmarks for exploring both the potential and limitations of such tools, and thus included them in our analysis. Our message is simple: the flexibility of generative models can, in principle, improve both clustering and imputation, despite the need for more computational resources. This is increasingly the case for larger datasets, with a larger number of clusters and richer subclusters. Even if the improvements are small, they could be crucial in cases where rare cell types exist.

However, we saw a large variability in the performance of the different generative models considered, even as they share architectural similarities (e.g.,37 can perform worse than Scanpy subroutines on small datasets). Documenting such behaviors is crucial as the architectures of generative models become more involved. However, this skepticism should not temper the enthusiasm for generative model research: having access to good generative models means the ability to generate counterfactual data and to simulate perturbational scenarios in both spatial and non-spatial settings. While this lies outside the scope of our current paper, we hope to expand this exploration in follow-up work.

Methods

MarkerMap

MarkerMap is a generative method which belongs to the class of differentiable sampling techniques for subset selection26,27,28. Existing differentiable sampling techniques aim to find local features that suit each input individually. These methods have been used for and are relevant to language contexts where the input is usually a sequence of variable length representing text. For example, in an online market setting, we might want to learn what specific words or group of words of a review are most predictive of the score associated with the review. Instead, MarkerMap seeks to find a global set of features (markers when referring to genes), amenable to the structure of scRNA-seq data, which results in optimization differences.

In a nutshell, given high dimensional data points or gene expression profiles \({\{{x}_{i}\}}_{i = 1}^{n}\subset {{\mathbb{R}}}^{d}\), arranged in a matrix \(X\in {{\mathbb{R}}}^{n\times d}\), the feature selection problem aims to find a subset of coordinates (i.e., markers, genes) S ⊂ {1, …, d}, ∣S∣ = K, relevant to a given downstream task (i.e., clustering, visualization, reconstruction). For example, in sparse linear regression, data X is used to predict responses \(Y\in {{\mathbb{R}}}^{n}\) so that Y ≈ Xβ when only a small subset of the columns making up X is relevant for the prediction. Similarly, in non-linear settings, the search is over a joint pair (β, f), where f is a non-linear function so that Y ≈ f(Xβ).

Instead of optimizing for β, differentiable sampling methods assume informative samples are generated from a continuous distributions over a simplex with dimension equal to K, the number of features to be selected26,27,28,29. This is accomplished through a selector layer. In detail, the selector layer contains k = 1, …K nodes. The nodes are associated with a d-dimensional real-valued vector γ(k) which governs the probability that a feature will be selected, whose entries j are equal to:

$$\begin{array}{r}{\gamma }_{j}^{(k)}=\frac{\exp \left(\left(\log \left({\pi }_{j}^{(k)}\right)\,+\,{g}_{j}^{(k)}\right)/\tau \right)}{\mathop{\sum }\nolimits_{s = 1}^{d}\exp \left(\left(\log \left({\pi }_{s}^{(k)}\right)\,+\,{g}_{s}^{(k)}\right)/\tau \right)},\end{array}$$
(1)

where \({g}_{j}^{(k)}\) are independent samples from a Gumbel distribution with location 0 and scale 1, τ is positive and real, and π(k) represent the class probabilities over a categorical distribution. The γ(k) is a vector following a Gumbel-Softmax distribution, independently introduced by29 and26. This distribution takes the form

$${p}_{\pi ,\tau }({\gamma }^{(1)},...{\gamma }^{(K)})=(K-1)!{\tau }^{K-1}{\left(\mathop{\sum }\limits_{i = 1}^{K}\frac{{\pi }^{(i)}}{{\left({\gamma }^{(i)}\right)}^{\tau }}\right)}^{-K}\mathop{\prod }\limits_{i=1}^{K}\left(\frac{{\pi }^{(i)}}{{\left({\gamma }^{(i)}\right)}^{\tau +1}}\right),$$
(2)

and can be visualized over the (K − 1)-dimensional simplex.

The number τ is referred to as temperature and the values \(\log {\pi }^{(k)}\) are called logits. The logits control how likely a feature or gene j is likely to be selected as a representative feature k our of the total K features we can select. For an input \({x}_{i}={({x}_{ij})}_{j = 1}^{d}\), each node k of the selector layer outputs \({x}_{i}\,*\,{y}^{(k)}\), which essentially masks the genes that are deemed uninformative. As the temperature τ approaches 0, \(Pr({\gamma }_{j}^{(k)}=1)\to {\pi }_{j}^{(k)}/{\sum }_{s}{\pi }_{s}^{(k)}\), and only one feature of xi is selected and matched with a unique selector node k28.

Illustrative Toy Example

Consider a data set of \(X={\{{x}_{i}\}}_{i = 1}^{n}\) gene expression profiles, corresponding to n cells with 10,000 variable genes. In this toy scenario, we only record if these gene are overexpressed (+1) or under expressed (-1) with respect to some control population. A natural task would be to attempt to compress our data by expressing it in a lower dimension, which is often achieved with a VAE (Architecture). The cells come in two states A and B, depending on how they respond to a particular perturbation, a response which we observe. Assume that whether the cells are in state A or state B only depends on 3 genes in the following way. A cell is in state A if the genes are either all overexpressed (1,1,1) or all underexpressed (-1,-1,-1), otherwise they are all in cell state B. None of the genes are individually informative of the clustering, the mean per cluster would be 0 for all genes. Initially, we may not know which of the 10,000 genes are indicative of the cell states, but we’d like to obtain a reduced number of genes capable of accurately predicting cell state. For example, without prior information, one might assume all genes are equally good at this task, so our initial ‘weighting’ of the genes would be (\(\frac{1}{10,000},\frac{1}{10,000},\ldots \frac{1}{10,000}\)) (Parameter Initialization) which would translate to a continuous probability distribution through equation (1) i.e. they specify the πj. The uniformity corresponds to randomly picking any of the genes as informative features. Then, through the variational optimization algorithm (Optimization), these weights are iterated on until, upon convergence, the genes that are jointly more indicative of cell state will have a higher probability weight, while also being sufficient for the reconstruction of the entire gene expression. This will correspond to a vector (\(\frac{1}{3},\frac{1}{3},\frac{1}{3},0,0\ldots 0\)) assuming the first three genes are the most informative.

Optimization

Letting p(x) be the probability distribution over the d-dimensional data X and given a set of labels Y, MarkerMap learns: a) a subset of markers S of size K, b) a reconstruction function \({f}_{\theta }:{{\mathbb{R}}}^{K}\to {{\mathbb{R}}}^{d}\), and c) a classifier \({f}_{W}:{{\mathbb{R}}}^{K}\to {{{\mathcal{Y}}}}\).

To learn these elements, the following empirical objective is optimized:

$$\begin{array}{r}\mathop{{\mathrm{arg}}\,{\mathrm{min}}}\limits_{S,\theta, W}{{\mathbb{E}}}_{p(x)}[\parallel {f}_{\theta }({x}_{S})-x{\parallel }_{2}+\ell ({f}_{W}({x}_{S}),y(x))],\end{array}$$
(3)

where the first term optimizes signal reconstruction from a subset of markers xS and the second objective minimizes the expected classification risk, both over the unknown distribution p(x) with respect to a loss function . In practice, we consider the alternative empirical objective

$$\begin{array}{r}\mathop{{\mathrm{arg}}\,{\mathrm{min}}}\limits_{S,\theta, W}\,\alpha \parallel {f}_{\theta }({X}_{S})-X{\parallel }_{2}+(1-\alpha )\parallel ({f}_{W}\,({X}_{S}),Y){\parallel }_{2},\end{array}$$
(4)

where α ∈ [0, 1] serves to balance between a reconstruction loss and classification loss. MarkerMap considers three separate objectives: a supervised objective with α = 0, an unsupervised objective with α = 1, and a joint objective where α = 0.5. More generally, α can be treated as a tunable (but fixed) hyperparameter that weighs the reconstruction and classification terms in the optimization objective. Because full reconstruction is nominally a harder task it can be considered a bottleneck, since one can achieve low classification error without information about the entire gene expression. Thus, when α is small enough, the convergence of MarkerMap is dependent on the quality of the reconstruction. Depending on the user-specified goal, the three proposed values of α provide either a classifier (α = 0) which may be capable of selecting a smaller number of genes with good performance, a generative model (α = 1) which is capable of signal reconstruction possibly at the cost of additional markers needed, or both (α = 0.5). One may choose a different value of α that is possibly data- or problem-specific.

Optimizing this objective is difficult due to the combinatorial search over the subset S. We address this challenge heuristically by expanding on continuous sampling techniques27 in a batch learning setting49. In a nutshell, b = 1, 2, …B batches are sampled without replacement from the data set (X, Y). The selected features are then computed and aggregated across batches as follows:

  1. 1.

    Instance-wise logits \({{{{{{\rm{log}}\; \pi }}}}}_{i}^{b}={f}_{\pi }({x}_{i})\) are generated for each xi in the batch b, where fπ is a neural network. Averaging them leads to an intermediate average batch logit log πb.

  2. 2.

    The average batch logits are computed by aggregating information from the current and previous batches, \(\log {\pi }^{b}\leftarrow \beta {{{{{{\rm{log}}\; \pi }}}}}^{b-1}+(1-\beta ){{{{{{\rm{log}}\; \pi }}}}}^{b},\beta \in (0,1)\) much like the update for mean moment in BatchNorm49.

  3. 3.

    The K continuous d-dimensional hot encoded vectors \({\gamma }^{(k),b}={({\gamma }_{j}^{(k)})}_{j = 1,d}^{b}\) are generated from \(\log {\pi }^{b}\) via continuous relaxation, see (1).

  4. 4.

    Each γ(k),b selects one of the K features by element-wise multiplication \({X}_{S}^{b}={X}^{b}\boxtimes {\gamma }^{b}\).

  5. 5.

    The resulting \({X}_{S}^{b}\) then becomes the input in a Variational-Autoencoder-like architecture, which includes a classifier loss as well as a reconstruction (Fig. 1 and Eq. (4)).

  6. 6.

    All network weights are updated through stochastic gradient descent steps, following the optimization of the appropriate loss in Eq. (4) until convergence. The steps are repeated for B timesteps, corresponding to the number of batches.

Architecture

The three main components of MarkerMap’s architecture are the neural network fπ for instance-wise logit generation, the task specific feed-forward network fW for classification, and the variational autoencoder fθ for encoding and reconstruction. The neural network fπ is an encoder with two hidden layers and a sampling layer performing relaxed subset samplingScalability

Training MarkerMap on the 4581 genes and 39,583 cells of the Mouse Brain data set (the largest data set considered) on public cloud GPUs resulted in a training time of 5 minutes for supervised classification tasks, and 15 minutes for unsupervised tasks. LassoNet performed similarly when the architecture (number of hidden layers and units) and batch sizes were chosen to be similar to those of MarkerMap. RankCorr and SMaSH achieved smaller training times, less than a minute, but require supervised signals. The differential expression tests in Scanpy and COSG are quick but also require supervised signals. PERSIST benefits somewhat by taking a two step approach to learning markers, but the initial step makes the method take longer.

Benchmarks

We contrast MarkerMap against several subset selection methods. The methods have been introduced in different communities and many have not been previously compared to one another.

  • LassoNet: A residual feed-forward network that makes use of an 1 penalty on network weights in order to induce sparsity in selected features25.

  • Concrete VAE: a traditional VAE architecture that assumes a discrete distribution on latent parameters and performs inference using the formulation of the concrete distribution (also known as Gumbel-Softmax distribution)26.

  • Global-Gumbel VAE: adapted from27. A VAE architecture related to the Concrete VAE.

  • Smash Random Forest: A classical Random Forest classification algorithm implemented in the SMaSHpy library (see https://pypi.org/project/smashpy)16.

  • RankCorr: A non-parametric marker selection method using (statistical) rank correlation, implemented in the RankCorr library (see https://github.com/ahsv/RankCorr)15.

  • PERSIST: An autoencoder model similar to MarkerMap that finds markers in an unsupervised fashion, or uses cell labels for supervised learning. PERSIST uses specific loss functions geared towards scRNA-seq data and a two step process to find the most relevant markers. (see https://github.com/iancovert/persist/)37

  • COSG: A differential expression test based on cosine similarity of expression of different genes. (see https://github.com/genecell/COSG)38.

  • Scanpy: The ‘rank_genes_groups` function of this package performs differential expression tests based on the cell groups with a number of different statistical methods. We tested the methods t-test, t-test with overestimated variance, Wilcoxon-ranked sum test, and Wilcoxon-ranked sum test with tie correction (see https://scanpy.readthedocs.io/en/stable/index.html)39

  • Scanpy Highly Variable Genes: The ‘highly_variable_genes’ function of this package39 performs an unsupervised method from Seurat52 to select genes that are highly variable.

The differential expression tests are one-vs-all methods. For these, we took one marker from each cell type, removing duplicates, until that would put us over our budget k. Then we took the marker with the highest score (COSG) or lowest p-value (Scanpy) until we had k markers.

Data sets

We used publicly available real world data sets from established single cell analysis pipelines, where the problem of marker selection is of interest in the context of explaining cluster assignment. In each data set, the labels correspond to cell types.

Zeisel data set

The Zeisel data set contains data from 3005 cells and 4000 genes32. The cells were collected from the mouse somatosensory cortex (S1) and hippocampal CA1 region. The labels correspond to 7 major cell types and where obtained though biclustering of the full gene expression data set. For the Zeisel (subtypes) data set, we used the more specific 47 cell types. We removed cells whose specific cell types were unknown, leaving 2816 cells.

CITE-seq data set

Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-seq) is a single cell method that allows joint readouts from gene expression and proteins. The CITE-seq data set contains data from 8617 cells and 500 genes33. These cells correspond to major cord blood cells across 13 cell types, obtained from the clustering of combined gene expression and protein read-out data, and not from the clustering of the original single cell data set alone.

Paul data set

The Paul data set35 consists of 2730 mouse bone marrow cells, collected with the MARS-seq protocol. Post processing, each cell contains 3451 genes. The Paul data set contains progenitor cells that are differentiating, hence the data appear to follow a continuous trajectory. The associated outputs represent 10 discrete cell types sampled along these trajectories. Hence, the cell types are are not well separated35. After removing general genes and housekee** genes, we are left with 3074 genes. For this data set we do not further remove genes based on cell type because the data set is already small.

Mouse brain data set

This data set is a spatial transcriptomic data set, containing data from 40,572 cells and 31,053 genes from diverse neuronal and glial cell types across stereotyped anatomical regions in the mouse brain34. The output labels correspond to the major cell types identified by the authors. Prior to the pre-processing described below, we perform additional gene and cell filtering because training with the full data set was not feasible for the unsupervised model on public cloud infrastructure. We start by removing cells with unknown cell types. Then we keep only those genes that satisfy the following two conditions: (1) they are present in at least 0.05% of cells and (2) they are present in 3% of cells or the average gene expression level in cells where the gene is present is greater than 1.12. These particular values are somewhat arbitrary and could be changed based on the researcher’s desires. After this filtering we are left with 39,583 cells and 12,869 genes; further pre-processing described below will reduce the number of genes to 4581. In the Mouse Brain data set we use the 9 major cell types after removing those that are unknown. In the Mouse Brain (subtypes) data set we use all 59 specific cell types, which includes two different unknown categories. When leaving these two unknown categories in, we are working with 40,532 cells and 7115 genes after the pre-processing described below.

SSv4 V1 data set

The SSv4 data set36 consists of cells collected from the mouse primary visual cortex (V1). This publicly available data set includes initial pre-processing done by PERSIST37 which reduces the data set to 13,349 cells and 10,000 genes with 98 cell types. We removed 6 cell types that each had fewer than 4 cells because many of our supervised methods require multiple representatives per class. After further pre-processing described below, the resulting data set size is of 13,342 cells and 4293 genes.

Data processing

The data were processed and filtered following16,33. In particular, we first remove genes associated with general cell function as well as housekee** genes. Next, we remove genes which are in present in less than 30% of cells for every cell type. We also remove genes which are present in over 75% of cells for at least 50% of the cell types. Lastly, we normalize the gene counts per cell so that each cell has the same total gene expression, we perform a \({\log }_{2}(1+x)\) transform of the cell counts, and we center and scale the data so that each gene has mean 0 and variance 1. When evaluating the generative data, we forgo normalizing gene counts across cells and setting the mean to 0 and the variance to 1 of each gene. Instead, we only perform the log2(1 + X) transform and then set the mean and variance of the entire data matrix X to 0 and 1 respectively.

Evaluation metrics

Given K, most of the methods selected the top K features informative of ground-truth labels. The exceptions, RankCorr and LassoNet, do not allow the selection of an exact number of features, as they rely on specifying a regularizer parameter that controls feature sparsity. In those cases, we selected K features by grid searching the regularizer that would get the desired number of features.

For each baseline and data set, the selected features were then used as only input to a either a nearest neighbors classifier or a random forest classifier. For each data set, method and classifier type, we reported two quantities, the misclassification rate and a weighted F1 score, along with their corresponding confusion matrices. These quantities are defined as follows, for a number of ground truth clusters c = 1, 2, …C.

  • Average misclassification rate. The misclassification rate of a given cluster is defined as

    $$\begin{array}{r}{M}_{c}=1-\frac{T{P}_{c}}{T{P}_{c}\,+\,F{P}_{c}},\end{array}$$
    (5)

    where TP and FP correspond to the number of true positives and false positive predictions, respectively. We report the average misclassification \(\frac{1}{C}{\sum }_{c}\,{M}_{c}\).

  • Average F1 score. Per cluster, the F1 score is defined as

    $$\begin{array}{r}{F}_{c}=\frac{2{P}_{c}{R}_{c}}{{P}_{c}\,+\,{R}_{c}},\end{array}$$
    (6)

    where Pc and Rc are the precision and recall of the classifier for a cluster c. We report the average F1 score \(\frac{1}{C}{\sum }_{c}\,{F}_{c}\).

When evaluating the reconstructed data, we use the Jaccard Index, the Spearman Correlation Coefficient ρ, the 2 distance, and the 1 distance. Let \(X\in {{\mathbb{R}}}^{n\times d}\) be our data as before, and let \(\tilde{X}\in {{\mathbb{R}}}^{n\times d}\) be the reconstructed data.

  • Jaccard Index. First we calculate the variances of each gene in the original data. Since each gene is a column of X, the variance of those columns is a d-length vector which we will denote \({\sigma }_{X}^{2}\). Next we find the rank vector of the variances, \(R({\sigma }_{X}^{2})\), where the largest variance is assigned 1, the second largest is assigned 2, and so on until the smallest variance is assigned d. We use the ranks to find the indices of the largest 20% of the variances:

    $${I}_{X}=\left\{i:R\left({\sigma }_{X}^{2}\right)[i]\le \frac{d}{5}\right\}$$
    (7)

    We follow the same process for the reconstructed data to get the set of indices \({I}_{\tilde{X}}\). Finally, we calculate the Jaccard Index on these two sets of indices to determine their similarity53:

    $$J=\frac{\left\vert {I}_{X}\cap {I}_{\tilde{X}}\right\vert }{\left\vert {I}_{X}\cup {I}_{\tilde{X}}\right\vert }$$
    (8)

    The Jaccard Index ranges from 0 to 1, and higher values indicate that more of the highly variable genes from the original data are also highly variable in the reconstructed data.

  • Spearman correlation coefficient. The Spearman correlation coefficient is exactly the Pearson correlation coefficient calculated on the ranks of a vector’s values, rather than the raw values. Thus, we first calculate the rank vectors of the gene variances as we did for the Jaccard Index, \(R({\sigma }_{X}^{2})\) and \(R({\sigma }_{\tilde{X}}^{2})\). Finally we calculate the correlation coefficient:

    $$\rho =\frac{\,{{\mbox{cov}}}\,\left(R\left({\sigma }_{X}^{2}\right),R\left({\sigma }_{\tilde{X}}^{2}\right)\right)}{{\sigma }_{R\left({\sigma }_{X}^{2}\right)}{\sigma }_{R\left({\sigma }_{\tilde{X}}^{2}\right)}}$$
    (9)

    where \({\sigma }_{R({\sigma }_{X}^{2})}\) and \({\sigma }_{R({\sigma }_{\tilde{X}}^{2})}\) are the standard deviations of the ranks of the original data and the reconstructed data respectively. This ρ is the Spearman correlation coefficient—values closer to one indicate higher similarity of the ranks of the gene variances.

  • 2 Distance. To calculate the 2 distance, we take the average over all cells of the 2 distance between the original cell and the reconstructed cell:

    $$\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}\parallel {x}_{i}-{\tilde{x}}_{i}{\parallel }_{2}$$
    (10)

    where xi is the ith row of X. Lower values indicate that the original data and reconstructed data are more similar.

  • 1 Distance. To calculate the 1 distance, we take the average over all cells of the 1 distance between the original cell and the reconstructed cell:

    $$\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}\parallel {x}_{i}-{\tilde{x}}_{i}\parallel$$
    (11)

    where xi is the ith row of X. Lower values indicate that the original data and reconstructed data are more similar.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.