Abstract
The emergence and development of single-cell RNA sequencing (scRNA-seq) techniques enable researchers to perform large-scale analysis of the transcriptomic profiling at cell-specific resolution. Unsupervised clustering of scRNA-seq data is central for most studies, which is essential to identify novel cell types and their gene expression logics. Although an increasing number of algorithms and tools are available for scRNA-seq analysis, a practical guide for users to navigate the landscape remains underrepresented. This chapter presents an overview of the scRNA-seq data analysis pipeline, quality control, batch effect correction, data standardization, cell clustering and visualization, cluster correlation analysis, and marker gene identification. Taking the two broadly used analysis packages, i.e., Scanpy and MetaCell, as examples, we provide a hands-on guideline and comparison regarding the best practices for the above essential analysis steps and data visualization. Additionally, we compare both packages and algorithms using a scRNA-seq dataset of the ctenophore Mnemiopsis leidyi, which is representative of one of the earliest animal lineages, critical to understanding the origin and evolution of animal novelties. This pipeline can also be helpful for analyses of other taxa, especially prebilaterian animals, where these tools are under development (e.g., placozoan and Porifera).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Moroz LL (2015) Biodiversity meets neuroscience: from the sequencing ship (Ship-Seq) to deciphering parallel evolution of neural systems in Omic's era. Integr Comp Biol 55(6):1005–1017
Moroz LL (2018) NeuroSystematics and periodic system of neurons: model vs reference species at single-cell resolution. ACS Chem Neurosci 9(8):1884–1903
Hernandez-Nicaise M-L (1991) Ctenophora. In: Harrison FWFW, Westfall JA (eds) Microscopic anatomy of invertebrates: Placozoa, Porifera, Cnidaria, and Ctenophora. Wiley, New York, pp 359–418
Nielsen C (2012) Animal evolution: interrelationships of the living phyla. Oxford University Press, Oxford
Nielsen C (2019) Early animal evolution: a morphologist's view. R Soc Open Sci 6(7):190638
Li Y et al (2021) Rooting the animal tree of life. Mol Biol Evol 38(10):4322–4333
Moroz LL (2012) Phylogenomics meets neuroscience: how many times might complex brains have evolved? Acta Biol Hung 63(Suppl 2):3–19
Moroz LL et al (2014) The ctenophore genome and the evolutionary origins of neural systems. Nature 510(7503):109–114
Ryan JF et al (2013) The genome of the ctenophore Mnemiopsis leidyi and its implications for cell type evolution. Science 342(6164):1242592
Schultz DT et al (2023) Ancient gene linkages support ctenophores as sister to other animals. Nature 618(7963):110–117
Whelan NV et al (2015) Error, signal, and the placement of Ctenophora sister to all other animals. Proc Natl Acad Sci U S A 112(18):5773–5778
Whelan NV et al (2017) Ctenophore relationships and their placement as the sister group to all other animals. Nat Ecol Evol 1(11):1737–1746
Moroz LL, Kohn AB (2015) Unbiased view of synaptic and neuronal gene complement in ctenophores: are there pan-neuronal and pan-synaptic genes across Metazoa? Integr Comp Biol 55(6):1028–1049
Moroz LL, Kohn AB (2016) Independent origins of neurons and synapses: insights from ctenophores. Philos Trans R Soc Lond Ser B Biol Sci 371(1685):20150041
Moroz LL, Romanova DY (2022) Alternative neural systems: what is a neuron? (Ctenophores, sponges and placozoans). Front Cell Dev Biol 10:1071961
Moroz LL, Romanova DY, Kohn AB (1821) Neural versus alternative integrative systems: molecular insights into origins of neurotransmitters. Philos Trans R Soc Lond Ser B Biol Sci 2021(376):20190762
Martindale MQ (2022) Emerging models: the “development” of the ctenophore Mnemiopsis leidyi and the cnidarian Nematostella vectensis as useful experimental models. Curr Top Dev Biol 147:93–120
Martindale MQ, Henry JQ (2015) Ctenophora. In: Wanninger A (ed) Evolutionary developmental biology of invertebrates 1: introduction, non-Bilateria, Acoelomorpha, Xenoturbellida, Chaetognatha. Springer Vienna, Vienna, pp 179–201
Sebe-Pedros A et al (2018) Early metazoan cell type diversity and the evolution of multicellular gene regulation. Nat Ecol Evol 2(7):1176–1188
Baran Y et al (2019) MetaCell: analysis of single-cell RNA-seq data using K-nn graph partitions. Genome Biol 20(1):1–19
Sachkova MY et al (2021) Neuropeptide repertoire and 3D anatomy of the ctenophore nervous system. Curr Biol 31(23):5274–5285 e6
Hayakawa E et al (2022) Mass spectrometry of short peptides reveals common features of metazoan peptidergic neurons. Nat Ecol Evol 6(10):1438–1448
Moroz LL (2009) On the independent origins of complex brains and neurons. Brain Behav Evol 74(3):177–190
Zappia L, Theis FJ (2021) Over 1000 tools reveal trends in the single-cell RNA-seq analysis landscape. Genome Biol 22(1):1–18
Zappia L, Phipson B, Oshlack A (2018) Exploring the single-cell RNA-seq analysis landscape with the scRNA-tools database. PLoS Comput Biol 14(6):e1006245
Svensson V, da Veiga Beltrame E, Pachter L (2020) A curated database reveals trends in single-cell transcriptomics. Database 2020:baaa073
Wolf FA, Angerer P, Theis FJ (2018) SCANPY: large-scale single-cell gene expression data analysis. Genome Biol 19(1):1–5
Shannon P et al (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13(11):2498–2504
Satija R et al (2015) Spatial reconstruction of single-cell gene expression data. Nat Biotechnol 33(5):495–502
McCarthy DJ et al (2017) Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33(8):1179–1186
** S et al (2021) Inference and analysis of cell-cell communication using CellChat. Nat Commun 12(1):1–20
Luecken MD, Theis FJ (2019) Current best practices in single‐cell RNA‐seq analysis: a tutorial. Mol Syst Biol 15(6):e8746
Amezquita RA et al (2020) Orchestrating single-cell analysis with Bioconductor. Nat Methods 17(2):137–145
Andrews TS et al (2021) Tutorial: guidelines for the computational analysis of single-cell RNA sequencing data. Nat Protoc 16(1):1–9
Cao J et al (2019) The single-cell transcriptional landscape of mammalian organogenesis. Nature 566(7745):496–502
Ziegler CG et al (2020) SARS-CoV-2 receptor ACE2 is an interferon-stimulated gene in human airway epithelial cells and is detected in specific cell subsets across tissues. Cell 181(5):1016–1035. e19
Mathys H et al (2019) Single-cell transcriptomic analysis of Alzheimer’s disease. Nature 570(7761):332–337
Bornstein C et al (2018) Single-cell map** of the thymic stroma identifies IL-25-producing tuft epithelial cells. Nature 559(7715):622–626
Giladi A et al (2018) Single-cell characterization of haematopoietic progenitors and their trajectories in homeostasis and perturbed haematopoiesis. Nat Cell Biol 20(7):836–846
Alpaydin E (2020) Introduction to machine learning. MIT press
Pedregosa F et al (2011) Scikit-learn: machine learning in Python. J Machine Learn Res 12:2825–2830
Van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9(11):2579
McInnes L, Healy J, Melville J (2018) Umap: uniform manifold approximation and projection for dimension reduction. ar**v preprint ar**v:180203426
Lun AT, Bach K, Marioni JC (2016) Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol 17(1):1–14
Bacher R et al (2017) SCnorm: robust normalization of single-cell RNA-seq data. Nat Methods 14(6):584–586
Cole MB et al (2019) Performance assessment and selection of normalization procedures for single-cell RNA-Seq. Cell Syst 8(4):315–328. e8
Lytal N, Ran D, An L (2020) Normalization methods on single-cell RNA-seq data: an empirical survey. Front Genet 11:41
Street K et al (2018) Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics 19(1):1–16
Johnson WE, Li C, Rabinovic A (2007) Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8(1):118–127
Polański K et al (2020) BBKNN: fast batch alignment of single cell transcriptomes. Bioinformatics 36(3):964–965
Korsunsky I et al (2019) Fast, sensitive and accurate integration of single-cell data with harmony. Nat Methods 16(12):1289–1296
Haghverdi L et al (2018) Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat Biotechnol 36(5):421–427
Hie B, Bryson B, Berger B (2019) Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat Biotechnol 37(6):685–691
Tran HTN et al (2020) A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol 21(1):1–32
Zheng GX et al (2017) Massively parallel digital transcriptional profiling of single cells. Nat Commun 8(1):1–12
Stuart T et al (2019) Comprehensive integration of single-cell data. Cell 177(7):1888–1902. e21
Grün D et al (2015) Single-cell messenger RNA sequencing reveals rare intestinal cell types. Nature 525(7568):251–255
Wang B et al (2017) Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nat Methods 14(4):414–416
Kiselev VY et al (2017) SC3: consensus clustering of single-cell RNA-seq data. Nat Methods 14(5):483–486
Lin P, Troup M, Ho JW (2017) CIDR: ultrafast and accurate clustering through imputation for single-cell RNA-seq data. Genome Biol 18(1):1–11
Yau C (2016) pcaReduce: hierarchical clustering of single cell transcriptional profiles. BMC Bioinform 17(1):1–11
Zeisel A et al (2015) Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347(6226):1138–1142
Jiang L et al (2016) GiniClust: detecting rare cell types from single-cell gene expression data with Gini index. Genome Biol 17(1):1–13
Qiu X et al (2017) Reversed graph embedding resolves complex single-cell trajectories. Nat Methods 14(10):979–982
Traag VA, Waltman L, Van Eck NJ (2019) From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep 9(1):1–12
Blondel VD et al (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 2008(10):P10008
Duò A, Robinson MD, Soneson C (2018) A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Research 7:1141
Zhang S et al (2020) Review of single-cell rna-seq data clustering for cell type identification and characterization. ar**v preprint ar**v:200101006
Liu B, Li Y, Zhang L (2022) Analysis and visualization of spatial transcriptomic data. Front Genet 12:785290
Coifman RR et al (2005) Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps. Proc Natl Acad Sci 102(21):7426–7431
Haghverdi L et al (2016) Diffusion pseudotime robustly reconstructs lineage branching. Nat Methods 13(10):845–848
Xu C, Su Z (2015) Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics 31(12):1974–1980
Patterson-Cross RB, Levine AJ, Menon V (2021) Selecting single cell clustering parameter values using subsampling-based robustness metrics. BMC bioinform 22(1):1–13
Teaching team at the Harvard Chan Bioinformatics Core. Introduction to Single-cell RNA-seq. [cited 2022 04/10]; Available from: https://hbctraining.github.io/scRNA-seq/lessons/07_SC_clustering_cells_SCT.html
Paul Hoffman SL (2022) Seurat - guided clustering tutorial. [cited 2022 04/10]; Available from: https://satijalab.org/seurat/articles/pbmc3k_tutorial.html
Fruchterman TM, Reingold EM (1991) Graph drawing by force‐directed placement. Softw Pract Exp 21(11):1129–1164
Wolf FA et al (2019) PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol 20(1):1–9
Mann HB, Whitney DR (1947) On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat 18:50–60
Welch BL (1947) The generalization of ‘STUDENT'S’ problem when several different population variances are involved. Biometrika 34(1–2):28–35
Musser JM et al (2021) Profiling cellular diversity in sponges informs animal cell type and nervous system evolution. Science 374(6568):717–723
Varoqueaux F et al (2018) High cell diversity and complex Peptidergic signaling Underlie Placozoan behavior. Curr Biol 28(21):3495–3501 e2
Dries R et al (2021) Advances in spatial transcriptomic data analysis. Genome Res 31(10):1706–1718
Tarashansky AJ et al (2021) Map** single-cell atlases throughout Metazoa unravels cell type evolution. elife 10:e66747
Liu X, Shen Q, Zhang S (2023) Cross-species cell-type assignment from single-cell RNA-seq data by a heterogeneous graph neural network. Genome Res 33(1):96–111
Wang R et al (2023) Construction of a cross-species cell landscape at single-cell level. Nucleic Acids Res 51(2):501–516
Wang J et al (2021) Tracing cell-type evolution by cross-species comparison of cell atlases. Cell Rep 34(9):108803
Bishop CM, Nasrabadi NM (2006) Pattern recognition and machine learning, vol 4. Springer
Gan G, Ma C, Wu J (2020) Data clustering: theory, algorithms, and applications. SIAM
Ross SM (2014) Introduction to probability models. Academic press
Zelle JM (2004) Python programming: an introduction to computer science. Franklin, Beedle & Associates, Inc
Chambers JM (2008) Software for data analysis: programming with R, vol 2. Springer
Moroz LL (2023) Brief history of Ctenophora. Methods Mol Biol. in press
Burkhardt P, Jekely G (2021) Evolution of synapses and neurotransmitter systems: the divide-and-conquer model for early neural cell-type evolution. Curr Opin Neurobiol 71:127–138
Moroz LL, Mukherjee K, Romanova DY (2023) Nitric oxide signaling in ctenophores. Front Neurosci 17:1125433
Acknowledgments
This work was supported in part by the Human Frontiers Science Program (RGP0060/2017) and National Science Foundation (IOS-1557923) grants to LLM. Research reported in this publication was also supported in part by the National Institute of Neurological Disorders and Stroke of the National Institutes of Health under Award Number R01NS114491 (to LLM). D.R. was supported by the Russian Science Foundation grant (23-14-00050). The content is solely the authors’ responsibility and does not necessarily represent the official views of the National Institutes of Health.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Appendix
Appendix
1.1 Recommended Books
-
Pattern Recognition and Machine Learning [87]
-
Data Clustering: Algorithms and Applications [88]
-
Introduction to Machine Learning [40]
-
Introduction to Probability Models [89]
-
Python Programming: An Introduction to Computer Science [90]
-
Software for Data Analysis: Programming with R [91]
1.2 PAGA Graph
PAGA algorithm enables the similarity analysis between different partitions (or clusters) generated by the community-detection-based clustering methods. The edge weight or the connectivity in the page graph carries the essential information regarding the similarity. This section briefly introduces the edge weights computation of the PAGA graph. A partitioned directed graph is denotated as G containing e edges and n nodes, and each node corresponds to one cell. For the group i, there are overall ei outgoing edges linked with ni nodes in it. The target coarse-grained PAGA graph is represented as G∗ = (V∗, E∗), where \( {V}^{\ast }=\left\{{v}_1^{\ast },\dots, {v}_M^{\ast}\right\} \) is a set of the M cell groups and e∗ ∈ E∗ is a PAGA edge estimated by the PAGA algorithm. A random variable ϵij is used to describe the number of edges connected from cell group i to cell group j in random connecting situation. p(ϵij) is calculated as:
The expression of p(ϵij| ei, ni, nj) is a binomial distribution with the expectation \( \frac{e_i{n}_j}{n-1} \) and variance \( \frac{e_i{n}_j\left(n-{n}_j-1\right)}{{\left(n-1\right)}^2} \). A new variable ϵ = ϵij + ϵji is introduced to provide a symmetric metrics for the similarity of two clusters. Suppose the cluster size is large enough so that the binomial distributions can be approximated by Gaussian, then the distribution of ϵ can be approximated as:
Suppose the actual number of edges between cluster i and cluster j is \( {\epsilon}_{ij}^{\mathrm{sym}} \) and the expected number of edges is \( {\hat{\epsilon}}_{ij}=\frac{e_i{n}_j+{e}_j{n}_i}{n-1} \); the edge weights wij is defined as:
If the number of actual edges is larger than the expected value, the connectivity will be set as 1, the upper bounder of the connectivity value. If the given partitioned graph is nondirected, one can convert it to a bi-directed graph by replacing a single edge with two independent edges pointing to the two linked nodes, respectively. Then, the same PAGA calculation strategy can be applied.
Rights and permissions
Copyright information
© 2024 Springer Science+Business Media, LLC, part of Springer Nature
About this protocol
Cite this protocol
Li, Y., Sun, C., Romanova, D.Y., Wu, D.O., Fang, R., Moroz, L.L. (2024). Analysis and Visualization of Single-Cell Sequencing Data with Scanpy and MetaCell: A Tutorial. In: Moroz, L.L. (eds) Ctenophores. Methods in Molecular Biology, vol 2757. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-3642-8_17
Download citation
DOI: https://doi.org/10.1007/978-1-0716-3642-8_17
Published:
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-3641-1
Online ISBN: 978-1-0716-3642-8
eBook Packages: Springer Protocols