MarkerMap: nonlinear marker selection for single-cell studies

Gregory, Wilson; Sarwar, Nabeel; Kevrekidis, George; Villar, Soledad; Dumitrascu, Bianca

doi:10.1038/s41540-024-00339-3

MarkerMap: nonlinear marker selection for single-cell studies

Article
Open access
Published: 14 February 2024

Volume 10, article number 17, (2024)
Cite this article

Download PDF

You have full access to this open access article

npj Systems Biology and Applications

MarkerMap: nonlinear marker selection for single-cell studies

Download PDF

4603 Accesses
10 Altmetric
Explore all metrics

Abstract

Single-cell RNA-seq data allow the quantification of cell type differences across a growing set of biological contexts. However, pinpointing a small subset of genomic features explaining this variability can be ill-defined and computationally intractable. Here we introduce MarkerMap, a generative model for selecting minimal gene sets which are maximally informative of cell type origin and enable whole transcriptome reconstruction. MarkerMap provides a scalable framework for both supervised marker selection, aimed at identifying specific cell type populations, and unsupervised marker selection, aimed at gene expression imputation and reconstruction. We benchmark MarkerMap’s competitive performance against previously published approaches on real single cell gene expression data sets. MarkerMap is available as a pip installable package, as a community resource aimed at develo** explainable machine learning techniques for enhancing interpretability in single-cell studies.

Single-Cell RNA Sequencing Analysis: A Step-by-Step Overview

RNA-Seq Data Analysis in Galaxy

A review of the current state of single-cell proteomics and future perspective

Article Open access 07 June 2023

Introduction

Recent advances in genomics and microscopy enable the collection of single cell gene expression data (scRNA-seq) across cells from spatial¹ and temporal² coordinates. Understanding how cells aggregate information across spatio-temporal scales and how, in turn, gene expression variability reflects this aggregation process remains challenging. A particular experimental design challenge is due to the fact that existing techniques (e.g., smFish³, seqFish⁴, MERFISH⁵, ISS⁶) rely on the pre-selection of a small number of target genes or markers, incapable of capturing the full transcriptomic information required to characterize subtle differences in cell populations. Selecting the best such markers (marker selection) is often statistically and computationally challenging, often a function of the nonlinearity of the data and the type of differences to be captured.

Marker selection is the product of both prior knowledge and computational analysis of previously collected scRNA-seq data. Computationally, it aims to reduce the dimension of data such as gene expression—from thousands of genes to a few—to enable downstream analysis such as visualization, cell type recovery, identification of gene programs or gene panel design for interventional studies. Akin to principal component analysis (PCA)⁷ or variational autoencoders (VAE)^26,28. Targeted at addressing explainability tasks in machine learning, such methods have primarily been developed with text data in mind. Their performance has hence not been previously evaluated in a comprehensive way in the context of single cell studies. The relationship of MarkerMap with respect to these method and other previous approaches is discussed in Methods and Tables 1, 2, and 3.

Table 1 Classification performance metrics

Full size table

Table 2 Full transcriptome reconstruction

Full size table

Table 3 Random Forest classification performance metrics

Full size table

MarkerMap is available as a well documented open-source software, along with tutorial and example workflows. The package provides a framework for custom designed feature selection methods along with metrics for evaluation (Fig. 1).

**Fig. 1: Computational pipeline of MarkerMap.**

Improving accuracy in supervised scRNA-seq studies

We evaluated the performance of MarkerMap in the context of five publicly available scRNA-seq studies: Zeisel³², a CITE-seq technology based data set³³, a mouse brain scRNA-seq data set³⁴, the Paul15 stem cell data set³⁵, and the SSv4 V1 data set³⁶ (see Methods for a full description of the data sets and the data processing pipeline).

MarkerMap’s performance is benchmarked against other non-linear approaches which, despite addressing related tasks, have not been previously compared to one another. In detail, we considered the following feature selection baselines (Methods): PERSIST³⁷, LassoNet²⁵, SMaSH¹⁶, and Concrete VAE²⁸. We also adapted a continuous relaxation Gumbel-Softmax technique from^41,42,43. This can be seen as a consequence of the consistency of certain estimators:⁴⁴ shows this to be the case of a nearest neighbor classifier under general conditions. Such a margin is large enough to accommodate realistic expectations of mislabelling error in data sets; we do however note that there may be more complex, adversarial, or systematic sources of error for which robustness may not hold. Figure 2 echoes the good performance of a set of random markers, when the number of markers is sufficiently large⁴⁵ and chosen to characterize a single cell type.

Prospects for reconstruction in unsupervised settings

As a generative model, MarkerMap allows the reconstruction of the full transcriptomic input from the selected set of most informative markers. To understand the limits of this recovery, we first quantified the reconstruction quality by comparing distributional properties of the original and reconstructed data sets. Specifically, variances of genes from the reconstructed data were computed and compared to the variances of their counterparts in the original test data in a Mouse Brain data set, following unsupervised MarkerMap training with a 80–20% train-test split. The variances of the reconstructed data were lower than those of the original data (Fig. 3). This is a common phenomenon for generative models obtained with variational autoencoders, known as variance shrinkage^46,47. To further visualize this, both test data and reconstructed data were projected onto the first two principle eigenvectors of the test data (Fig. 2).

**Fig. 3: Downstream MarkerMap evaluation:visualization and reconstruction.**

We further assessed whether, despite variance differences, the highly variable genes in the original data are recapitulated in the reconstructed one. To this end, two metrics for relative ranking were employed: the Jaccard Index and Spearman Rank Correlation Coefficient, ρ. Additionally, average ℓ₂ distance between the reconstructed expression profiles and the original expression profiles were computed per cell type (Evaluation Metrics and Methods).

Each of these metrics were computed for both the reconstructed data from MarkerMap and reconstructed data from a related generative model, scVI⁴⁸. The scVI model learns the parameters of a zero-inflated negative binomial distribution for modeling genes counts from scRNA-seq data⁴⁸. While both MarkerMap and scVI use a variational autoencoder framework for reconstruction, MarkerMap tries to reconstruct the full gene expression from the input of a small number of discrete markers, while scVI uses the full gene expression as input. In these experiments we used 50 markers for MarkerMap. Compared to scVI, MarkerMap generally scores worse on the variance metrics and better on the ℓ₂ distance (Table 4). However, it should be noted that MarkerMap and scVI have slightly different goals that suggest that these results are appropriate. Unsupervised MarkerMap tries to find the best k markers that optimally reconstruct the full data, while the scVI model learns a low dimensional manifold from which data is generated. A direction of future exploration is leveraging the differential sampling scheme of MarkerMap and the generative power of scVI to improve MarkerMap’s reconstruction ability, while preserving its interpretability quality.

Table 4 Quality metrics for full transcriptome reconstruction

Full size table

Discussion

In this work we propose MarkerMap, a data-driven, generative, neural network framework for feature selection. Given scRNA-seq data, we employ differentiable sampling methods to find a global set of genetic markers with competitive performance in downstream classification (of cell type) and reconstruction (of the entire transcriptome of an unseen test data). The supervised version selects the markers that maximize label prediction accuracy. The unsupervised version selects markers that maximize the reconstruction accuracy of a variational autoencoder (with no label information). A mixed MarkerMap is also available, combining both label prediction and transcriptome reconstruction. Our experiments suggest that, even though differentiable sampling techniques based on properties of the Gumbel distribution are often suggested for interpretable machine learning tasks, they can underperform. Hence, the mathematically appealing, continuous relaxation procedure alone is not enough to explain why MarkerMap is competitive with respect to alternatives. Additional exploration, both experimental and theoretical, is required to understand this empirical result. In this work, we provide a competitive solution to feature selection in a real biological context. Most importantly, we provide a tool where related solutions from different fields can be compared to aid future research in this area. A promising future application of this tool is the design of probes for spatial trascriptomics studies.

We provide an extensive numerical benchmark of both supervised and unsupervised tools in the context of genetic marker selection on real single cell gene expression data sets. We show that while all methods exhibit better performance as the number of selected markers increases, the methods have differences in stability when presented with noisy labels. The baselines considered originated from different research communities, which have not been previously compared to one another despite addressing similar tasks.

MarkerMap introduces new concepts from explainable machine learning in a transcriptomic centric setting. We show that MarkerMap is competitive across real data sets, thus offering the potential for optimal combinatorial experimental design with downstream analysis in mind. MarkerMap is available as a pip installable python package that is easy to use, robust and reproducible, making it appropriate for the experimental design of transcriptomic studies, along with the development of new metrics and methodology.

As deep generative models inspired by the growing explainability literature^27,28 and foundation models literature become popular in genomics³⁷, we sought to establish benchmarks for exploring both the potential and limitations of such tools, and thus included them in our analysis. Our message is simple: the flexibility of generative models can, in principle, improve both clustering and imputation, despite the need for more computational resources. This is increasingly the case for larger datasets, with a larger number of clusters and richer subclusters. Even if the improvements are small, they could be crucial in cases where rare cell types exist.

However, we saw a large variability in the performance of the different generative models considered, even as they share architectural similarities (e.g.,³⁷ can perform worse than Scanpy subroutines on small datasets). Documenting such behaviors is crucial as the architectures of generative models become more involved. However, this skepticism should not temper the enthusiasm for generative model research: having access to good generative models means the ability to generate counterfactual data and to simulate perturbational scenarios in both spatial and non-spatial settings. While this lies outside the scope of our current paper, we hope to expand this exploration in follow-up work.

Methods

MarkerMap

MarkerMap is a generative method which belongs to the class of differentiable sampling techniques for subset selection^26,27,28. Existing differentiable sampling techniques aim to find local features that suit each input individually. These methods have been used for and are relevant to language contexts where the input is usually a sequence of variable length representing text. For example, in an online market setting, we might want to learn what specific words or group of words of a review are most predictive of the score associated with the review. Instead, MarkerMap seeks to find a global set of features (markers when referring to genes), amenable to the structure of scRNA-seq data, which results in optimization differences.

In a nutshell, given high dimensional data points or gene expression profiles ${\{{x}_{i}\}}_{i = 1}^{n}\subset {{\mathbb{R}}}^{d}$, arranged in a matrix $X\in {{\mathbb{R}}}^{n\times d}$, the feature selection problem aims to find a subset of coordinates (i.e., markers, genes) S ⊂ {1, …, d}, ∣S∣ = K, relevant to a given downstream task (i.e., clustering, visualization, reconstruction). For example, in sparse linear regression, data X is used to predict responses $Y\in {{\mathbb{R}}}^{n}$ so that Y ≈ Xβ when only a small subset of the columns making up X is relevant for the prediction. Similarly, in non-linear settings, the search is over a joint pair (β, f), where f is a non-linear function so that Y ≈ f(Xβ).

Instead of optimizing for β, differentiable sampling methods assume informative samples are generated from a continuous distributions over a simplex with dimension equal to K, the number of features to be selected^26,27,28,29. This is accomplished through a selector layer. In detail, the selector layer contains k = 1, …K nodes. The nodes are associated with a d-dimensional real-valued vector γ^(k) which governs the probability that a feature will be selected, whose entries j are equal to:

$$\begin{array}{r}{\gamma }_{j}^{(k)}=\frac{\exp \left(\left(\log \left({\pi }_{j}^{(k)}\right)\,+\,{g}_{j}^{(k)}\right)/\tau \right)}{\mathop{\sum }\nolimits_{s = 1}^{d}\exp \left(\left(\log \left({\pi }_{s}^{(k)}\right)\,+\,{g}_{s}^{(k)}\right)/\tau \right)},\end{array}$$

(1)

where ${g}_{j}^{(k)}$ are independent samples from a Gumbel distribution with location 0 and scale 1, τ is positive and real, and π^(k) represent the class probabilities over a categorical distribution. The γ^(k) is a vector following a Gumbel-Softmax distribution, independently introduced by²⁹ and²⁶. This distribution takes the form

$${p}_{\pi ,\tau }({\gamma }^{(1)},...{\gamma }^{(K)})=(K-1)!{\tau }^{K-1}{\left(\mathop{\sum }\limits_{i = 1}^{K}\frac{{\pi }^{(i)}}{{\left({\gamma }^{(i)}\right)}^{\tau }}\right)}^{-K}\mathop{\prod }\limits_{i=1}^{K}\left(\frac{{\pi }^{(i)}}{{\left({\gamma }^{(i)}\right)}^{\tau +1}}\right),$$

(2)

and can be visualized over the (K − 1)-dimensional simplex.

The number τ is referred to as temperature and the values $\log {\pi }^{(k)}$ are called logits. The logits control how likely a feature or gene j is likely to be selected as a representative feature k our of the total K features we can select. For an input ${x}_{i}={({x}_{ij})}_{j = 1}^{d}$, each node k of the selector layer outputs ${x}_{i}\,*\,{y}^{(k)}$, which essentially masks the genes that are deemed uninformative. As the temperature τ approaches 0, $Pr({\gamma }_{j}^{(k)}=1)\to {\pi }_{j}^{(k)}/{\sum }_{s}{\pi }_{s}^{(k)}$, and only one feature of x_i is selected and matched with a unique selector node k²⁸.

Illustrative Toy Example

Consider a data set of $X={\{{x}_{i}\}}_{i = 1}^{n}$ gene expression profiles, corresponding to n cells with 10,000 variable genes. In this toy scenario, we only record if these gene are overexpressed (+1) or under expressed (-1) with respect to some control population. A natural task would be to attempt to compress our data by expressing it in a lower dimension, which is often achieved with a VAE (Architecture). The cells come in two states A and B, depending on how they respond to a particular perturbation, a response which we observe. Assume that whether the cells are in state A or state B only depends on 3 genes in the following way. A cell is in state A if the genes are either all overexpressed (1,1,1) or all underexpressed (-1,-1,-1), otherwise they are all in cell state B. None of the genes are individually informative of the clustering, the mean per cluster would be 0 for all genes. Initially, we may not know which of the 10,000 genes are indicative of the cell states, but we’d like to obtain a reduced number of genes capable of accurately predicting cell state. For example, without prior information, one might assume all genes are equally good at this task, so our initial ‘weighting’ of the genes would be ($\frac{1}{10,000},\frac{1}{10,000},\ldots \frac{1}{10,000}$) (Parameter Initialization) which would translate to a continuous probability distribution through equation (1) i.e. they specify the π_j. The uniformity corresponds to randomly picking any of the genes as informative features. Then, through the variational optimization algorithm (Optimization), these weights are iterated on until, upon convergence, the genes that are jointly more indicative of cell state will have a higher probability weight, while also being sufficient for the reconstruction of the entire gene expression. This will correspond to a vector ($\frac{1}{3},\frac{1}{3},\frac{1}{3},0,0\ldots 0$) assuming the first three genes are the most informative.

Optimization

Letting p(x) be the probability distribution over the d-dimensional data X and given a set of labels Y, MarkerMap learns: a) a subset of markers S of size K, b) a reconstruction function ${f}_{\theta }:{{\mathbb{R}}}^{K}\to {{\mathbb{R}}}^{d}$, and c) a classifier ${f}_{W}:{{\mathbb{R}}}^{K}\to {{{\mathcal{Y}}}}$.

To learn these elements, the following empirical objective is optimized:

$$\begin{array}{r}\mathop{{\mathrm{arg}}\,{\mathrm{min}}}\limits_{S,\theta, W}{{\mathbb{E}}}_{p(x)}[\parallel {f}_{\theta }({x}_{S})-x{\parallel }_{2}+\ell ({f}_{W}({x}_{S}),y(x))],\end{array}$$

(3)

where the first term optimizes signal reconstruction from a subset of markers x_S and the second objective minimizes the expected classification risk, both over the unknown distribution p(x) with respect to a loss function ℓ. In practice, we consider the alternative empirical objective

$$\begin{array}{r}\mathop{{\mathrm{arg}}\,{\mathrm{min}}}\limits_{S,\theta, W}\,\alpha \parallel {f}_{\theta }({X}_{S})-X{\parallel }_{2}+(1-\alpha )\parallel ({f}_{W}\,({X}_{S}),Y){\parallel }_{2},\end{array}$$

(4)

where α ∈ [0, 1] serves to balance between a reconstruction loss and classification loss. MarkerMap considers three separate objectives: a supervised objective with α = 0, an unsupervised objective with α = 1, and a joint objective where α = 0.5. More generally, α can be treated as a tunable (but fixed) hyperparameter that weighs the reconstruction and classification terms in the optimization objective. Because full reconstruction is nominally a harder task it can be considered a bottleneck, since one can achieve low classification error without information about the entire gene expression. Thus, when α is small enough, the convergence of MarkerMap is dependent on the quality of the reconstruction. Depending on the user-specified goal, the three proposed values of α provide either a classifier (α = 0) which may be capable of selecting a smaller number of genes with good performance, a generative model (α = 1) which is capable of signal reconstruction possibly at the cost of additional markers needed, or both (α = 0.5). One may choose a different value of α that is possibly data- or problem-specific.

Optimizing this objective is difficult due to the combinatorial search over the subset S. We address this challenge heuristically by expanding on continuous sampling techniques²⁷ in a batch learning setting⁴⁹. In a nutshell, b = 1, 2, …B batches are sampled without replacement from the data set (X, Y). The selected features are then computed and aggregated across batches as follows:

1.
Instance-wise logits ${{{{{{\rm{log}}\; \pi }}}}}_{i}^{b}={f}_{\pi }({x}_{i})$ are generated for each x_i in the batch b, where f_π is a neural network. Averaging them leads to an intermediate average batch logit log π^b.
2.
The average batch logits are computed by aggregating information from the current and previous batches, $\log {\pi }^{b}\leftarrow \beta {{{{{{\rm{log}}\; \pi }}}}}^{b-1}+(1-\beta ){{{{{{\rm{log}}\; \pi }}}}}^{b},\beta \in (0,1)$ much like the update for mean moment in BatchNorm⁴⁹.
3.
The K continuous d-dimensional hot encoded vectors ${\gamma }^{(k),b}={({\gamma }_{j}^{(k)})}_{j = 1,d}^{b}$ are generated from $\log {\pi }^{b}$ via continuous relaxation, see (1).
4.
Each γ^(k),b selects one of the K features by element-wise multiplication ${X}_{S}^{b}={X}^{b}\boxtimes {\gamma }^{b}$.
5.
The resulting ${X}_{S}^{b}$ then becomes the input in a Variational-Autoencoder-like architecture, which includes a classifier loss as well as a reconstruction (Fig. 1 and Eq. (4)).
6.
All network weights are updated through stochastic gradient descent steps, following the optimization of the appropriate loss in Eq. (4) until convergence. The steps are repeated for B timesteps, corresponding to the number of batches.

Architecture

The three main components of MarkerMap’s architecture are the neural network f_π for instance-wise logit generation, the task specific feed-forward network f_W for classification, and the variational autoencoder f_θ for encoding and reconstruction. The neural network f_π is an encoder with two hidden layers and a sampling layer performing relaxed subset sampling^Scalability

Training MarkerMap on the 4581 genes and 39,583 cells of the Mouse Brain data set (the largest data set considered) on public cloud GPUs resulted in a training time of 5 minutes for supervised classification tasks, and 15 minutes for unsupervised tasks. LassoNet performed similarly when the architecture (number of hidden layers and units) and batch sizes were chosen to be similar to those of MarkerMap. RankCorr and SMaSH achieved smaller training times, less than a minute, but require supervised signals. The differential expression tests in Scanpy and COSG are quick but also require supervised signals. PERSIST benefits somewhat by taking a two step approach to learning markers, but the initial step makes the method take longer.

Benchmarks

We contrast MarkerMap against several subset selection methods. The methods have been introduced in different communities and many have not been previously compared to one another.

LassoNet: A residual feed-forward network that makes use of an ℓ₁ penalty on network weights in order to induce sparsity in selected features 25.
Concrete VAE: a traditional VAE architecture that assumes a discrete distribution on latent parameters and performs inference using the formulation of the concrete distribution (also known as Gumbel-Softmax distribution)²⁶.
Global-Gumbel VAE: adapted from²⁷. A VAE architecture related to the Concrete VAE.
Smash Random Forest: A classical Random Forest classification algorithm implemented in the SMaSHpy library (see https://pypi.org/project/smashpy)¹⁶.
RankCorr: A non-parametric marker selection method using (statistical) rank correlation, implemented in the RankCorr library (see https://github.com/ahsv/RankCorr)¹⁵.
PERSIST: An autoencoder model similar to MarkerMap that finds markers in an unsupervised fashion, or uses cell labels for supervised learning. PERSIST uses specific loss functions geared towards scRNA-seq data and a two step process to find the most relevant markers. (see https://github.com/iancovert/persist/)³⁷
COSG: A differential expression test based on cosine similarity of expression of different genes. (see https://github.com/genecell/COSG)³⁸.
Scanpy: The ‘rank_genes_groups` function of this package performs differential expression tests based on the cell groups with a number of different statistical methods. We tested the methods t-test, t-test with overestimated variance, Wilcoxon-ranked sum test, and Wilcoxon-ranked sum test with tie correction (see https://scanpy.readthedocs.io/en/stable/index.html)³⁹
Scanpy Highly Variable Genes: The ‘highly_variable_genes’ function of this package³⁹ performs an unsupervised method from Seurat⁵² to select genes that are highly variable.

The differential expression tests are one-vs-all methods. For these, we took one marker from each cell type, removing duplicates, until that would put us over our budget k. Then we took the marker with the highest score (COSG) or lowest p-value (Scanpy) until we had k markers.

Data sets

We used publicly available real world data sets from established single cell analysis pipelines, where the problem of marker selection is of interest in the context of explaining cluster assignment. In each data set, the labels correspond to cell types.

Zeisel data set

The Zeisel data set contains data from 3005 cells and 4000 genes³². The cells were collected from the mouse somatosensory cortex (S1) and hippocampal CA1 region. The labels correspond to 7 major cell types and where obtained though biclustering of the full gene expression data set. For the Zeisel (subtypes) data set, we used the more specific 47 cell types. We removed cells whose specific cell types were unknown, leaving 2816 cells.

CITE-seq data set

Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-seq) is a single cell method that allows joint readouts from gene expression and proteins. The CITE-seq data set contains data from 8617 cells and 500 genes³³. These cells correspond to major cord blood cells across 13 cell types, obtained from the clustering of combined gene expression and protein read-out data, and not from the clustering of the original single cell data set alone.

Paul data set

The Paul data set³⁵ consists of 2730 mouse bone marrow cells, collected with the MARS-seq protocol. Post processing, each cell contains 3451 genes. The Paul data set contains progenitor cells that are differentiating, hence the data appear to follow a continuous trajectory. The associated outputs represent 10 discrete cell types sampled along these trajectories. Hence, the cell types are are not well separated³⁵. After removing general genes and housekee** genes, we are left with 3074 genes. For this data set we do not further remove genes based on cell type because the data set is already small.

Mouse brain data set

This data set is a spatial transcriptomic data set, containing data from 40,572 cells and 31,053 genes from diverse neuronal and glial cell types across stereotyped anatomical regions in the mouse brain³⁴. The output labels correspond to the major cell types identified by the authors. Prior to the pre-processing described below, we perform additional gene and cell filtering because training with the full data set was not feasible for the unsupervised model on public cloud infrastructure. We start by removing cells with unknown cell types. Then we keep only those genes that satisfy the following two conditions: (1) they are present in at least 0.05% of cells and (2) they are present in 3% of cells or the average gene expression level in cells where the gene is present is greater than 1.12. These particular values are somewhat arbitrary and could be changed based on the researcher’s desires. After this filtering we are left with 39,583 cells and 12,869 genes; further pre-processing described below will reduce the number of genes to 4581. In the Mouse Brain data set we use the 9 major cell types after removing those that are unknown. In the Mouse Brain (subtypes) data set we use all 59 specific cell types, which includes two different unknown categories. When leaving these two unknown categories in, we are working with 40,532 cells and 7115 genes after the pre-processing described below.

SSv4 V1 data set

The SSv4 data set³⁶ consists of cells collected from the mouse primary visual cortex (V1). This publicly available data set includes initial pre-processing done by PERSIST³⁷ which reduces the data set to 13,349 cells and 10,000 genes with 98 cell types. We removed 6 cell types that each had fewer than 4 cells because many of our supervised methods require multiple representatives per class. After further pre-processing described below, the resulting data set size is of 13,342 cells and 4293 genes.

Data processing

The data were processed and filtered following^16,33. In particular, we first remove genes associated with general cell function as well as housekee** genes. Next, we remove genes which are in present in less than 30% of cells for every cell type. We also remove genes which are present in over 75% of cells for at least 50% of the cell types. Lastly, we normalize the gene counts per cell so that each cell has the same total gene expression, we perform a ${\log }_{2}(1+x)$ transform of the cell counts, and we center and scale the data so that each gene has mean 0 and variance 1. When evaluating the generative data, we forgo normalizing gene counts across cells and setting the mean to 0 and the variance to 1 of each gene. Instead, we only perform the log₂(1 + X) transform and then set the mean and variance of the entire data matrix X to 0 and 1 respectively.

Evaluation metrics

Given K, most of the methods selected the top K features informative of ground-truth labels. The exceptions, RankCorr and LassoNet, do not allow the selection of an exact number of features, as they rely on specifying a regularizer parameter that controls feature sparsity. In those cases, we selected K features by grid searching the regularizer that would get the desired number of features.

For each baseline and data set, the selected features were then used as only input to a either a nearest neighbors classifier or a random forest classifier. For each data set, method and classifier type, we reported two quantities, the misclassification rate and a weighted F1 score, along with their corresponding confusion matrices. These quantities are defined as follows, for a number of ground truth clusters c = 1, 2, …C.

Average misclassification rate. The misclassification rate of a given cluster is defined as
$$\begin{array}{r}{M}_{c}=1-\frac{T{P}_{c}}{T{P}_{c}\,+\,F{P}_{c}},\end{array}$$
(5)
where TP and FP correspond to the number of true positives and false positive predictions, respectively. We report the average misclassification $\frac{1}{C}{\sum }_{c}\,{M}_{c}$.
Average F1 score. Per cluster, the F1 score is defined as
$$\begin{array}{r}{F}_{c}=\frac{2{P}_{c}{R}_{c}}{{P}_{c}\,+\,{R}_{c}},\end{array}$$
(6)
where P_c and R_c are the precision and recall of the classifier for a cluster c. We report the average F1 score $\frac{1}{C}{\sum }_{c}\,{F}_{c}$.

When evaluating the reconstructed data, we use the Jaccard Index, the Spearman Correlation Coefficient ρ, the ℓ₂ distance, and the ℓ₁ distance. Let $X\in {{\mathbb{R}}}^{n\times d}$ be our data as before, and let $\tilde{X}\in {{\mathbb{R}}}^{n\times d}$ be the reconstructed data.

Jaccard Index. First we calculate the variances of each gene in the original data. Since each gene is a column of X, the variance of those columns is a d-length vector which we will denote ${\sigma }_{X}^{2}$. Next we find the rank vector of the variances, $R({\sigma }_{X}^{2})$, where the largest variance is assigned 1, the second largest is assigned 2, and so on until the smallest variance is assigned d. We use the ranks to find the indices of the largest 20% of the variances:
$${I}_{X}=\left\{i:R\left({\sigma }_{X}^{2}\right)[i]\le \frac{d}{5}\right\}$$
(7)
We follow the same process for the reconstructed data to get the set of indices ${I}_{\tilde{X}}$. Finally, we calculate the Jaccard Index on these two sets of indices to determine their similarity⁵³:
$$J=\frac{\left\vert {I}_{X}\cap {I}_{\tilde{X}}\right\vert }{\left\vert {I}_{X}\cup {I}_{\tilde{X}}\right\vert }$$
(8)
The Jaccard Index ranges from 0 to 1, and higher values indicate that more of the highly variable genes from the original data are also highly variable in the reconstructed data.
Spearman correlation coefficient. The Spearman correlation coefficient is exactly the Pearson correlation coefficient calculated on the ranks of a vector’s values, rather than the raw values. Thus, we first calculate the rank vectors of the gene variances as we did for the Jaccard Index, $R({\sigma }_{X}^{2})$ and $R({\sigma }_{\tilde{X}}^{2})$. Finally we calculate the correlation coefficient:
$$\rho =\frac{\,{{\mbox{cov}}}\,\left(R\left({\sigma }_{X}^{2}\right),R\left({\sigma }_{\tilde{X}}^{2}\right)\right)}{{\sigma }_{R\left({\sigma }_{X}^{2}\right)}{\sigma }_{R\left({\sigma }_{\tilde{X}}^{2}\right)}}$$
(9)
where ${\sigma }_{R({\sigma }_{X}^{2})}$ and ${\sigma }_{R({\sigma }_{\tilde{X}}^{2})}$ are the standard deviations of the ranks of the original data and the reconstructed data respectively. This ρ is the Spearman correlation coefficient—values closer to one indicate higher similarity of the ranks of the gene variances.
ℓ₂ Distance. To calculate the ℓ₂ distance, we take the average over all cells of the ℓ₂ distance between the original cell and the reconstructed cell:
$$\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}\parallel {x}_{i}-{\tilde{x}}_{i}{\parallel }_{2}$$
(10)
where x_i is the i^th row of X. Lower values indicate that the original data and reconstructed data are more similar.
ℓ₁ Distance. To calculate the ℓ₁ distance, we take the average over all cells of the ℓ₁ distance between the original cell and the reconstructed cell:
$$\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}\parallel {x}_{i}-{\tilde{x}}_{i}\parallel$$
(11)
where x_i is the i^th row of X. Lower values indicate that the original data and reconstructed data are more similar.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

We used publicly available data as detailed in the Data sets section.

Code availability

The code is available as a Python package at https://github.com/Computational-Morphogenomics-Group/MarkerMapand on pip as “markermap”. See Fig. 1 for an overview of the package functionality. Code to easily load and pre-process the four data sets used in this paper are provided. Additional pre-processing can be done with the Scanpy package, and MarkerMap also provides functions to manage to split the data into training and test sets. The package implements MarkerMap as well as Concrete VAE and Global Gate VAE. Additionally, it provides wrappers for all the other mentioned methods to allow for easy benchmarking. All models select k markers, which are then used for further tasks including visualizations.

References

Lohoff, T. et al. Integration of spatial and single-cell transcriptomic data elucidates mouse organogenesis. Nat. Biotechnol. 40, 74–85 (2022).
Article CAS PubMed Google Scholar
Sladitschek, H. L. et al. Morphoseq: Full single-cell transcriptome dynamics up to gastrulation in a chordate. Cell 181, 922–935.e21 (2020).
Codeluppi, S. et al. Spatial organization of the somatosensory cortex revealed by osmfish. Nat. Methods 15, 932–935 (2018).
Article CAS PubMed Google Scholar
Lubeck, E., Coskun, A. F., Zhiyentayev, T., Ahmad, M. & Cai, L. Single-cell in situ rna profiling by sequential hybridization. Nat. Methods 11, 360 (2014).
Article CAS PubMed PubMed Central Google Scholar
Chen, K. H., Boettiger, A. N., Moffitt, J. R., Wang, S. & Zhuang, X. Spatially resolved, highly multiplexed RNA profiling in single cells. Science 348, aaa6090 (2015).
Article PubMed PubMed Central Google Scholar
Ke, R. et al. In situ sequencing for RNA analysis in preserved tissue and cells. Nat. Methods 10, 857–860 (2013).
Article CAS PubMed Google Scholar
Hotelling, H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24, 417 (1933).
Article Google Scholar
Kingma, D. P. & Welling, M. Auto-encoding variational bayes. ar**v preprint ar**v:1312.6114 (2013).
Townes, F. W., Hicks, S. C., Aryee, M. J. & Irizarry, R. A. Feature selection and dimension reduction for single-cell rna-seq based on a multinomial model. Genome Biol. 20, 1–16 (2019).
Article Google Scholar
Svensson, V., Gayoso, A., Yosef, N. & Pachter, L. Interpretable factor models of single-cell rna-seq via variational autoencoders. Bioinformatics 36, 3418–3421 (2020).
Article CAS PubMed PubMed Central Google Scholar
Finak, G. et al. Mast: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 16, 1–13 (2015).
Article Google Scholar
Delaney, C. et al. Combinatorial prediction of marker panels from single-cell transcriptomic data. Mol. Syst. Biol. 15, e9005 (2019).
Article CAS PubMed PubMed Central Google Scholar
Ibrahim, M. M. & Kramann, R. Genesorter: feature ranking in clustered single cell data. bioRxiv https://doi.org/10.1101/676379 (2019).
Dumitrascu, B., Villar, S., Mixon, D. G. & Engelhardt, B. E. Optimal marker gene selection for cell type discrimination in single cell analyses. Nat. Commun. 12, 1–8 (2021).
Article Google Scholar
Vargo, A. H. & Gilbert, A. C. A rank-based marker selection method for high throughput scrna-seq data. BMC Bioinformatics 21, 1–51 (2020).
Article Google Scholar
Nelson, M. E., Riva, S. G. & Cvejic, A. Smash: a scalable, general marker gene identification framework for single-cell RNA-sequencing. BMC Bioinformatics 23, 328 (2022).
Article CAS PubMed PubMed Central Google Scholar
Conrad, T. O. et al. Sparse proteomics analysis–a compressed sensing-based approach for feature selection and classification of high-dimensional proteomics mass spectrometry data. BMC Bioinformatics 18, 1–20 (2017).
Article Google Scholar
Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. In International Conference on Machine Learning, 3145–3153 (PMLR, 2017).
McWhirter, C., Mixon, D. G. & Villar, S. Squeezefit: label-aware dimensionality reduction by semidefinite programming. IEEE Trans. Inform. Theory 66, 3878–3892 (2019).
Article MathSciNet Google Scholar
Liang, S. et al. Single-cell manifold-preserving feature selection for detecting rare cell populations. Nat. Comput. Sci. 1, 374–384 (2021).
Article PubMed PubMed Central Google Scholar
Yang, P., Huang, H. & Liu, C. Feature selection revisited in the single-cell era. Genome Biol. 22, 1–17 (2021).
Article Google Scholar
Pullin, J. M. & McCarthy, D. J. A comparison of marker gene selection methods for single-cell RNA sequencing data. bioRxiv https://doi.org/10.1101/2022.05.09.490241 (2022).
Tibshirani, R. Regression shrinkage and selection via the lasso. J. Roy. Statistical Soc.: Ser. B (Methodological) 58, 267–288 (1996).
MathSciNet Google Scholar
Mahoney, M. W. & Drineas, P. Cur matrix decompositions for improved data analysis. Proc. Natl Acad. Sci. 106, 697–702 (2009).
Article MathSciNet CAS PubMed PubMed Central ADS Google Scholar
Lemhadri, I., Ruan, F., Abraham, L. & Tibshirani, R. Lassonet: a neural network with feature sparsity. J. Mach. Learn. Res. 22, 1–29 (2021).
MathSciNet Google Scholar
Maddison, C. J., Mnih, A. & Teh, Y. W. The concrete distribution: a continuous relaxation of discrete random variables. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24–26, 2017, Conference Track Proceedings (2017).
**e, S. M. & Ermon, S. Reparameterizable subset sampling via continuous relaxations. In: International Joint Conference on Artificial Intelligence (2019).
Balın, M. F., Abid, A. & Zou, J. Concrete autoencoders: differentiable feature selection and reconstruction. In: International Conference on Machine Learning, 444–453 (PMLR, 2019).
Jang, E., Gu, S. & Poole, B. Categorical reparameterization with gumbel-softmax. In: International Conference on Learning Representations (2016).
Chen, J. et al. L-Shapley and C-Shapley: Efficient Model Interpretation for Structured Data. International Conference on Learning Representations (2018).
Teneggi, J., Luster, A. & Sulam, J. Fast hierarchical games for image explanations. IEEE Trans. Pattern Anal. Mach. Intell. (2022).
Zeisel, A. et al. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142 (2015).
Article CAS PubMed ADS Google Scholar
Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865 (2017).
Article CAS PubMed PubMed Central Google Scholar
Kleshchevnikov, V. et al. Comprehensive map** of tissue cell architecture via integrated single cell and spatial transcriptomics. bioRxiv https://doi.org/10.1101/2020.11.15.378125v1 (2020).
Paul, F. et al. Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell 163, 1663–1677 (2015).
Article CAS PubMed Google Scholar
Tasic, B. et al. Shared and distinct transcriptomic cell types across neocortical areas. Nature 563, 72–78 (2018).
Article CAS PubMed PubMed Central ADS Google Scholar
Covert, I. et al. Predictive and robust gene selection for spatial transcriptomics. Nat. Commun. 14, 2091 (2023).
Article CAS PubMed PubMed Central ADS Google Scholar
Dai, M., Pei, X. & Wang, X.-J. Accurate and fast cell marker gene identification with COSG. Briefings Bioinformatics 23, bbab579 (2022).
Article PubMed Google Scholar
Wolf, F. A., Angerer, P. & Theis, F. J. Scanpy: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Article PubMed PubMed Central Google Scholar
Nguyen, L. H. & Holmes, S. Ten quick tips for effective dimensionality reduction. PLoS Comput. Biol. 15, e1006907 (2019).
Article CAS PubMed PubMed Central ADS Google Scholar
Li, M., Soltanolkotabi, M. & Oymak, S. Gradient descent with early stop** is provably robust to label noise for overparameterized neural networks. In: International Conference on Artificial Intelligence and Statistics, 4313–4324 (PMLR, 2020).
Patrini, G., Rozza, A., Krishna Menon, A., Nock, R. & Qu, L. Making deep neural networks robust to label noise: a loss correction approach. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition 1944–1952 (2017).
Priebe, C. E., Huang, N., Villar, S., Mu, C. & Chen, L. Deep learning is provably robust to symmetric label noise. Preprint at https://arxiv.org/abs/2210.15083 (2022).
Lugosi, G. Learning with an unreliable teacher. Pattern Recognition 25, 79–87 (1992).
Article MathSciNet ADS Google Scholar
Fischer, S. & Gillis, J. How many markers are needed to robustly determine a cell’s type? Iscience 24, 103292 (2021).
Article CAS PubMed PubMed Central ADS Google Scholar
Skafte, N., Jørgensen, M. & Hauberg, S. Reliable training and estimation of variance networks. Adv. Neural Inform. Process. Syst. 32 (2019).
Akrami, H., Joshi, A. A., Aydore, S. & Leahy, R. M. Addressing variance shrinkage in variational autoencoders using quantile regression. Preprint at https://arxiv.org/abs/2010.09042 (2020).
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Article CAS PubMed PubMed Central Google Scholar
Ioffe, S. & Szegedy, C. Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, 448–456 (2015).
He, K., Zhang, X., Ren, S. & Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proc. IEEE International Conference on Computer Vision 1026–1034 (2015).
Smith, L. N. Cyclical learning rates for training neural networks. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), 464–472 (IEEE, 2017).
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell https://doi.org/10.1016/j.cell.2021.04.048 (2021).
Jaccard, P. The distribution of the flora in the alpine zone 1. N. Phytologist 11, 37–50 (1912).
Article Google Scholar

Download references

Acknowledgements

W.G., G.A.K., and S.V. were partially funded by ONR N00014-22-1-2126. S.V. is also partially funded by the NSF-Simons Research Collaboration on the Mathematical and Scientific Foundations of Deep Learning (MoDL) (NSF DMS 2031985), and the TRIPODS Institute for the Foundations of Graph and Deep Learning at Johns Hopkins University. W.G. and S.V. are also partially supported by a Amazon+JHU AI2AI research award. B.D. was partly supported by the Accelerate Programme for Scientific Discovery, funded by Schmidt Futures. The authors would like to thank Sinead Williamson and Maria Brbic for their comments on earlier versions of the manuscript. The authors would also like to thank the anonymous reviewers for their insightful feedback and constructive comments on this project.

Author information

These authors contributed equally: Wilson Gregory, Nabeel Sarwar, George Kevrekidis.

Authors and Affiliations

Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, 21218, USA
Wilson Gregory, George Kevrekidis & Soledad Villar
Center for Data Science, New York University, New York, NY, 10012, USA
Nabeel Sarwar
Mathematical Institute for Data Science, Johns Hopkins University, Baltimore, MD, 21218, USA
Soledad Villar
Department of Statistics, Columbia University, New York, NY, 10027, USA
Bianca Dumitrascu
Irving Institute for Cancer Dynamics, Columbia University, New York, NY, 10027, USA
Bianca Dumitrascu

Authors

Wilson Gregory
View author publications
You can also search for this author in PubMed Google Scholar
Nabeel Sarwar
View author publications
You can also search for this author in PubMed Google Scholar
George Kevrekidis
View author publications
You can also search for this author in PubMed Google Scholar
Soledad Villar
View author publications
You can also search for this author in PubMed Google Scholar
Bianca Dumitrascu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

W.G., N.S., G.A.K., S.V., and B.D. developed the code and performed the computational analysis. W.G., N.S., G.A.K. contributed to this work equally. B.D. and S.V. conceived, managed and supervised the project. All authors wrote and reviewed the manuscript.

Corresponding authors

Correspondence to Soledad Villar or Bianca Dumitrascu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Figures

Reporting summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Gregory, W., Sarwar, N., Kevrekidis, G. et al. MarkerMap: nonlinear marker selection for single-cell studies. npj Syst Biol Appl 10, 17 (2024). https://doi.org/10.1038/s41540-024-00339-3

Download citation

Received: 26 April 2023
Accepted: 17 January 2024
Published: 14 February 2024
DOI: https://doi.org/10.1038/s41540-024-00339-3
Springer Nature Limited

MarkerMap: nonlinear marker selection for single-cell studies

Abstract

Similar content being viewed by others

Single-Cell RNA Sequencing Analysis: A Step-by-Step Overview

RNA-Seq Data Analysis in Galaxy

A review of the current state of single-cell proteomics and future perspective

Introduction

Improving accuracy in supervised scRNA-seq studies

Prospects for reconstruction in unsupervised settings

Discussion

Methods

MarkerMap

Illustrative Toy Example

Optimization

Architecture

Benchmarks

Data sets

Zeisel data set

CITE-seq data set

Paul data set

Mouse brain data set

SSv4 V1 data set

Data processing

Evaluation metrics

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary Figures

Reporting summary

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation