Abstract
Advances in next generation sequencing (NGS) technologies resulted in a broad array of large-scale gene expression studies and an unprecedented volume of whole messenger RNA (mRNA) sequencing data, or the transcriptome (also known as RNA sequencing, or RNA-seq). These include the Genotype Tissue Expression project (GTEx) and The Cancer Genome Atlas (TCGA), among others. Here we cover some of the commonly used datasets, provide an overview on how to begin the analysis pipeline, and how to explore and interpret the data provided by these publicly available resources.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Schena M, Shalon D, Davis RW, Brown PO (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270:467. https://doi.org/10.1126/science.270.5235.467
Clark TA, Sugnet CW, Ares M (2002) Genomewide analysis of mRNA processing in yeast using splicing-specific microarrays. Science 296:907. https://doi.org/10.1126/science.1069415
Yamada K, Lim J, Dale JM et al (2003) Empirical analysis of transcriptional activity in the arabidopsis genome. Science 302:842. https://doi.org/10.1126/science.1088305
Cheng J, Kapranov P, Drenkow J et al (2005) Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science 308:1149. https://doi.org/10.1126/science.1108625
David L, Huber W, Granovskaia M et al (2006) A high-resolution map of transcription in the yeast genome. Proc Natl Acad Sci U S A 103:5320. https://doi.org/10.1073/pnas.0601091103
Clark TA, Schweitzer AC, Chen TX et al (2007) Discovery of tissue-specific exons using comprehensive human exon microarrays. Genome Biol 8:R64. https://doi.org/10.1186/gb-2007-8-4-r64
Liu S, Lin L, Jiang P et al (2011) A comparison of RNA-Seq and high-density exon array for detecting differential gene expression between closely related species. Nucleic Acids Res 39:578–588. https://doi.org/10.1093/nar/gkq817
Bertone P, Stolc V, Royce TE et al (2004) Global identification of human transcribed sequences with genome tiling arrays. Science 306:2242. https://doi.org/10.1126/science.1103388
Mockler TC, Ecker JR (2005) Applications of DNA tiling arrays for whole-genome analysis. Genomics 85:1–15. https://doi.org/10.1016/j.ygeno.2004.10.005
Edwards HD, Nagappayya SK, Pohl NLB (2011) Probing the limitations of the fluorous content for tag-mediated microarray formation. Chem Commun 48:510–512. https://doi.org/10.1039/C1CC16022B
Khouja MH, Baekelandt M, Sarab A et al (2010) Limitations of tissue microarrays compared with whole tissue sections in survival analysis. Oncol Lett 1:827–831. https://doi.org/10.3892/ol_00000145
Tanase CP, Albulescu R, Neagu M (2011) Application of 3D hydrogel microarrays in molecular diagnostics: advantages and limitations. Expert Rev Mol Diagn 11:461–464. https://doi.org/10.1586/erm.11.30
Weisenberg JLZ (2008) Diagnostic yield and limitations of chromosomal microarray: a retrospective chart review. Ann Neurol 64:S101
Okoniewski MJ, Miller CJ (2006) Hybridization interactions between probesets in short oligo microarrays lead to spurious correlations. BMC Bioinformatics 7:276. https://doi.org/10.1186/1471-2105-7-276
Royce TE, Rozowsky JS, Gerstein MB (2007) Toward a universal microarray: prediction of gene expression through nearest-neighbor probe sequence identification. Nucleic Acids Res 35:e99–e99. https://doi.org/10.1093/nar/gkm549
Mardis ER (2008) The impact of next-generation sequencing technology on genetics. Trends Genet 24:133–141. https://doi.org/10.1016/j.tig.2007.12.007
Goodwin S, McPherson JD, McCombie WR (2016) Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet 17:333–351. https://doi.org/10.1038/nrg.2016.49
Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63. https://doi.org/10.1038/nrg2484
Marioni JC, Mason CE, Mane SM et al (2008) RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 18:1509–1517. https://doi.org/10.1101/gr.079558.108
Mortazavi A, Williams BA, McCue K et al (2008) Map** and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5:621–628. https://doi.org/10.1038/nmeth.1226
Cloonan N, Forrest ARR, Kolle G et al (2008) Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods 5:613–619. https://doi.org/10.1038/nmeth.1223
Stark R, Grzelak M, Hadfield J (2019) RNA sequencing: the teenage years. Nat Rev Genet 20:631–656. https://doi.org/10.1038/s41576-019-0150-2
Costa-Silva J, Domingues D, Lopes FM (2017) RNA-Seq differential expression analysis: an extended review and a software tool. PLoS One 12:e0190152. https://doi.org/10.1371/journal.pone.0190152
Chang K, Creighton CJ, Davis C et al (2013) The Cancer Genome Atlas pan-cancer analysis project. Nat Genet 45:1113–1120. https://doi.org/10.1038/ng.2764
Lonsdale J, Thomas J, Salvatore M et al (2013) The Genotype-Tissue Expression (GTEx) project. Nat Genet 45:580–585. https://doi.org/10.1038/ng.2653
The GTEx Consortium (2015) The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348:648. https://doi.org/10.1126/science.1262110
Rozenblatt-Rosen O, Stubbington MJT, Regev A, Teichmann SA (2017) The Human Cell Atlas: from vision to reality. Nature 550:451–453. https://doi.org/10.1038/550451a
Mereu E, Lafzi A, Moutinho C et al (2020) Benchmarking single-cell RNA-sequencing protocols for cell atlas projects. Nat Biotechnol 38(6):1–9. https://doi.org/10.1038/s41587-020-0469-4
Papatheodorou I, Moreno P, Manning J et al (2020) Expression Atlas update: from tissues to single cells. Nucleic Acids Res 48:D77–D83. https://doi.org/10.1093/nar/gkz947
Franzén O, Gan L-M, Björkegren JLM (2019) PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database 2019:baz046. https://doi.org/10.1093/database/baz046
Angermueller C, Pärnamaa T, Parts L, Stegle O (2016) Deep learning for computational biology. Mol Sys Biol 12:878. https://doi.org/10.15252/msb.20156651
Chiu Y-C, Chen H-IH, Zhang T et al (2019) Predicting drug response of tumors from integrated genomic profiles by deep neural networks. BMC Med Genet 12:18. https://doi.org/10.1186/s12920-018-0460-9
Sun Y, Zhu S, Ma K et al (2019) Identification of 12 cancer types through genome deep learning. Sci Rep 9:17256. https://doi.org/10.1038/s41598-019-53989-3
Zhang Z, Pan Z, Ying Y et al (2019) Deep-learning augmented RNA-seq analysis of transcript splicing. Nat Methods 16:307–310. https://doi.org/10.1038/s41592-019-0351-9
**ong HY, Alipanahi B, Lee LJ et al (2015) The human splicing code reveals new insights into the genetic determinants of disease. Science 347:1254806. https://doi.org/10.1126/science.1254806
Ghandi M, Huang FW, Jané-Valbuena J et al (2019) Next-generation characterization of the Cancer Cell Line Encyclopedia. Nature 569:503–508. https://doi.org/10.1038/s41586-019-1186-3
Streeter I, Harrison PW, Faulconbridge A et al (2017) The human-induced pluripotent stem cell initiative-data resources for cellular genetics. Nucleic Acids Res 45:D691–D697. https://doi.org/10.1093/nar/gkw928
Papatheodorou I, Fonseca NA, Keays M et al (2017) Expression Atlas: gene and protein expression across multiple studies and organisms. Nucleic Acids Res 46:D246–D251. https://doi.org/10.1093/nar/gkx1158
Wilks C, Cline MS, Weiler E et al (2014) The Cancer Genomics Hub (CGHub): overcoming cancer through the power of torrential data. Database 2014:bau093. https://doi.org/10.1093/database/bau093
Barretina J, Caponigro G, Stransky N et al (2012) The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483:603–607. https://doi.org/10.1038/nature11003
Andrews S, Krueger F, Segonds-Pichon A et al (2012) FastQC. Babraham, UK
Schmieder R, Edwards R (2011) Quality control and preprocessing of metagenomic datasets. Bioinformatics 27:863–864. https://doi.org/10.1093/bioinformatics/btr026
Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114–2120. https://doi.org/10.1093/bioinformatics/btu170
Chen S, Zhou Y, Chen Y, Gu J (2018) fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34:i884–i890. https://doi.org/10.1093/bioinformatics/bty560
Guo Y, Dai Y, Yu H et al (2017) Improvements and impacts of GRCh38 human reference on high throughput sequencing data analysis. Genomics 109:83–90. https://doi.org/10.1016/j.ygeno.2017.01.005
Dobin A, Davis CA, Schlesinger F et al (2012) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29:15–21. https://doi.org/10.1093/bioinformatics/bts635
Kim D, Langmead B, Salzberg SL (2015) HISAT: a fast spliced aligner with low memory requirements. Nat Methods 12:357–360. https://doi.org/10.1038/nmeth.3317
Liao Y, Smyth GK, Shi W (2013) featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30:923–930. https://doi.org/10.1093/bioinformatics/btt656
Anders S, Pyl PT, Huber W (2014) HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics 31:166–169. https://doi.org/10.1093/bioinformatics/btu638
Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15:550. https://doi.org/10.1186/s13059-014-0550-8
Robinson MD, McCarthy DJ, Smyth GK (2009) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139–140. https://doi.org/10.1093/bioinformatics/btp616
Dillies M-A, Rau A, Aubert J et al (2012) A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinform 14:671–683. https://doi.org/10.1093/bib/bbs046
Wang Q, Armenia J, Zhang C et al (2018) Unifying cancer and normal RNA sequencing data from different sources. Sci Data 5:180061. https://doi.org/10.1038/sdata.2018.61
Leek JT (2014) svaseq: removing batch effects and other unwanted noise from sequencing data. Nucleic Acids Res 42:e161. https://doi.org/10.1093/nar/gku864
Leek JT, Johnson WE, Parker HS et al (2012) The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28:882–883. https://doi.org/10.1093/bioinformatics/bts034
Leek JT, Storey JD (2007) Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet 3:e161. https://doi.org/10.1371/journal.pgen.0030161
Chakraborty S (2019) Use of Partial Least Squares improves the efficacy of removing unwanted variability in differential expression analyses based on RNA-Seq data. Genomics 111:893–898. https://doi.org/10.1016/j.ygeno.2018.05.018
Gagnon-Bartsch JA, Speed TP (2012) Using control genes to correct for unwanted variation in microarray data. Biostatistics 13:539–552. https://doi.org/10.1093/biostatistics/kxr034
Somekh J, Shen-Orr SS, Kohane IS (2019) Batch correction evaluation framework using a-priori gene-gene associations: applied to the GTEx dataset. BMC Bioinformatics 20:268. https://doi.org/10.1186/s12859-019-2855-9
Johnson WE, Li C, Rabinovic A (2006) Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8:118–127. https://doi.org/10.1093/biostatistics/kxj037
Oytam Y, Sobhanmanesh F, Duesing K et al (2016) Risk-conscious correction of batch effects: maximising information extraction from high-throughput genomic datasets. BMC Bioinformatics 17:332. https://doi.org/10.1186/s12859-016-1212-5
Mostafavi S, Battle A, Zhu X et al (2013) Normalizing RNA-sequencing data by modeling hidden covariates with prior knowledge. PLoS One 8:e68141. https://doi.org/10.1371/journal.pone.0068141
Long Q, Argmann C, Houten SM et al (2016) Inter-tissue coexpression network analysis reveals DPP4 as an important gene in heart to blood communication. Genome Med 8:15. https://doi.org/10.1186/s13073-016-0268-1
Chen C, Grennan K, Badner J et al (2011) Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods. PLoS One 6:e17238. https://doi.org/10.1371/journal.pone.0017238
Rustici G, Kolesnikov N, Brandizi M et al (2013) ArrayExpress update—trends in database growth and links to data analysis tools. Nucleic Acids Res 41:D987–D990. https://doi.org/10.1093/nar/gks1174
Castillo D, Gálvez JM, Herrera LJ et al (2017) Integration of RNA-Seq data with heterogeneous microarray data for breast cancer profiling. BMC Bioinformatics 18:506. https://doi.org/10.1186/s12859-017-1925-0
Thompson JA, Tan J, Greene CS (2016) Cross-platform normalization of microarray and RNA-seq data for machine learning applications. PeerJ 4:e1621. https://doi.org/10.7717/peerj.1621
Considerations for RNA-Seq read length and coverage. https://support.illumina.com/bulletins/2017/04/considerations-for-rna-seq-read-length-and-coverage-.html?langsel=/us/. Accessed 6 Apr 2020
Conesa A, Madrigal P, Tarazona S et al (2016) A survey of best practices for RNA-seq data analysis. Genome Biol 17:13. https://doi.org/10.1186/s13059-016-0881-8
Liu Y, Ferguson JF, Xue C et al (2013) Evaluating the impact of sequencing depth on transcriptome profiling in human adipose. PLoS One 8:e66883. https://doi.org/10.1371/journal.pone.0066883
Cock PJA, Fields CJ, Goto N et al (2009) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 38:1767–1771. https://doi.org/10.1093/nar/gkp1137
Li H, Handsaker B, Wysoker A et al (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078–2079. https://doi.org/10.1093/bioinformatics/btp352
Uhlén M, Fagerberg L, Hallström BM et al (2015) Tissue-based map of the human proteome. Science 347:1260419. https://doi.org/10.1126/science.1260419
Dunham I, Kundaje A, Aldred SF et al (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489:57–74. https://doi.org/10.1038/nature11247
Bradley RK, Merkin J, Lambert NJ, Burge CB (2012) Alternative splicing of RNA triplets is often regulated and accelerates proteome evolution. PLoS Biol 10:e1001229. https://doi.org/10.1371/journal.pbio.1001229
Sheng X, Wu J, Sun Q et al (2016) MTD: a mammalian transcriptomic database to explore gene expression and regulation. Brief Bioinform 18:28–36. https://doi.org/10.1093/bib/bbv117
Stachelscheid H, Seltmann S, Lekschas F et al (2013) CellFinder: a cell data repository. Nucleic Acids Res 42:D950–D958. https://doi.org/10.1093/nar/gkt1264
Wan Q, Dingerdissen H, Fan Y et al (2015) BioXpress: an integrated RNA-seq-derived gene expression database for pan-cancer analysis. Database 2015:bav019. https://doi.org/10.1093/database/bav019
Yu NY-L, Hallström BM, Fagerberg L et al (2015) Complementing tissue characterization by integrating transcriptome profiling from the Human Protein Atlas and from the FANTOM5 consortium. Nucleic Acids Res 43:6787–6798. https://doi.org/10.1093/nar/gkv608
Barrett T, Wilhite SE, Ledoux P et al (2013) NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res 41:D991–D995. https://doi.org/10.1093/nar/gks1193
Garalde DR, Snell EA, Jachimowicz D et al (2018) Highly parallel direct RNA sequencing on an array of nanopores. Nat Methods 15:201–206. https://doi.org/10.1038/nmeth.4577
Chatterjee A, Ahn A, Rodger EJ et al (2018) A guide for designing and analyzing RNA-Seq data. Methods Mol Biol 1783:35–80. https://doi.org/10.1007/978-1-4939-7834-2_3
Love MI, Anders S, Kim V, Huber W (2015) RNA-Seq workflow: gene-level exploratory analysis and differential expression. F1000Res 4:1070. https://doi.org/10.12688/f1000research.7035.1
Law CW, Alhamdoosh M, Su S et al (2018) RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR. F1000Res 5:ISCB Comm J-1408. https://doi.org/10.12688/f1000research.9005.3
Chen Y, Lun ATL, Smyth GK (2014) Differential expression analysis of complex RNA-seq experiments using edgeR. In: Datta S, Nettleton D (eds) Statistical analysis of next generation sequencing data. Springer, Cham, pp 51–74
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Science+Business Media, LLC, part of Springer Nature
About this protocol
Cite this protocol
Zoabi, Y., Shomron, N. (2021). Processing and Analysis of RNA-seq Data from Public Resources. In: Shomron, N. (eds) Deep Sequencing Data Analysis. Methods in Molecular Biology, vol 2243. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-1103-6_4
Download citation
DOI: https://doi.org/10.1007/978-1-0716-1103-6_4
Published:
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-1102-9
Online ISBN: 978-1-0716-1103-6
eBook Packages: Springer Protocols