Abstract
Next-generation sequencing technologies have made it possible to obtain, at a relatively low cost, a detailed snapshot of the RNA transcripts present in a tissue sample. The resulting reads are usually binned by gene, exon, or other region of interest; thus the data typically amount to read counts for tens of thousands of features, on no more than dozens or hundreds of observations. It is often of interest to use these data to develop a classifier in order to assign an observation to one of several pre-defined classes. However, the high dimensionality of the data poses statistical challenges: because there are far more features than observations, many existing classification techniques cannot be directly applied. In recent years, a number of proposals have been made to extend existing classification approaches to the high-dimensional setting. In this chapter, we discuss the use of, and modifications to, logistic regression, linear discriminant analysis, principal components analysis, partial least squares, and the support vector machine in the high-dimensional setting. We illustrate these methods on two RNA-sequencing data sets.
indicates joint first authorship.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The training set is the set of observations used to fit the classifier.
- 2.
We thank Liguo Wang for providing us the raw counts for the prostate cancer data set used in [49].
- 3.
- 4.
Briefly, R-fold cross-validation involves splitting the observations in the training set into R sets. Then for r = 1, …, R, we build classifiers for a range of tuning parameters using all observations except those in the rth fold. We then calculate the error e r of each of these classifiers on the observations in the rth fold. Finally, we calculate the cross-validation error as \(\frac{1} {R}\sum _{r=1}^{R}e_{r}\). The tuning parameter value corresponding to the minimum cross-validation error is selected.
- 5.
Proposals have been made for an â„“ 1-penalized SVM that results in a sparse decision rule, but the standard SVM decision rule involves all of the features [100].
References
Agresti, A.: Categorical Data Analysis. Wiley, New York (2002)
Aguilera, A.M., Escabias, M., Valderrama, M.J.: Using principal components for estimating logistic regression with high-dimensional multicollinear data. Comput. Stat. Data Anal. 50(8), 1905–1924 (2006)
Allen, D.M.: The relationship between variable selection and data augmentation and a method for prediction. Technometrics 16(1), 125–127 (1974)
Anders, S., Huber, W.: Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010)
Bair, E., Hastie, T., Paul, D., Tibshirani, R.: Prediction by supervised principal components. J. Am. Stat. Assoc. 101(473), 119–137 (2006)
Bair, E., Tibshirani, R.: Semi-supervised methods to predict patient survival from gene expression data. PLoS Biol. 2(4), e108 (2004)
Barshan, E., Ghodsi, A., Azimifar, Z., Zolghadri Jahromi, M.: Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds. Pattern Recogn. 44(7), 1357–1371 (2011)
Bickel, P.J., Levina, E.: Some theory for Fisher’s linear discriminant function, naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli 10(6), 989–1010 (2004)
Boulesteix, A.L.: PLS dimension reduction for classification with microarray data. Stat. Appl. Genet. Mol. Biol. 3(1), 1–33 (2004)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Brown, M.P., Grundy, W.N., Lin, D., Cristianini, N., Sugnet, C.W., Furey, T.S., Ares, M., Haussler, D.: Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. 97(1), 262–267 (2000)
Bullard, J., Purdom, E., Hansen, K., Dudoit, S.: Evaluation of statistical methods for normalization and differential expression in mRNA-seq experiments. BMC Bioinform. 11, 94 (2010)
Chun, H., Keleş, S.: Sparse partial least squares regression for simultaneous dimension reduction and variable selection. J. Roy. Stat. Soc. Ser. B (Stat. Meth.) 72(1), 3–25 (2010)
Chung, D., Keles, S.: Sparse partial least squares classification for high dimensional data. Stat. Appl. Genet. Mol. Biol. 9(1), Article 17 (2010)
Clemmensen, L., Hastie, T., Witten, D., Ersbøll, B.: Sparse discriminant analysis. Technometrics 53(4), 406–413 (2011)
Collins, M., Dasgupta, S., Schapire, R.E.: A generalization of principal components analysis to the exponential family. In Advances in Neural Information Processing Systems, pp. 617–624 (2001)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
d’Aspremont, A., Bach, F., Ghaoui, L.E.: Optimal solutions for sparse principal component analysis. J. Mach. Learn. Res. 9, 1269–1294 (2008)
d’Aspremont, A., El Ghaoui, L., Jordan, M.I., Lanckriet, G.R.: A direct formulation for sparse PCA using semidefinite programming. SIAM Rev. 49(3), 434–448 (2007)
Datta, S., Pihur, V., Datta, S.: An adaptive optimal ensemble classifier via bagging and rank aggregation with applications to high dimensional data. BMC Bioinformatics 11(1), 427 (2010)
De Leeuw, J.: Principal component analysis of binary data by iterated singular value decomposition. Comput. Stat. Data Anal. 50(1), 21–39 (2006)
Dietterich, T.G.: Ensemble methods in machine learning. In: Multiple Classifier Systems, pp. 1–15. Springer, Berlin (2000)
Dillies, M.A., Rau, A., Aubert, J., Hennequet-Antier, C., Jeanmougin, M., Servant, N., Keime, C., Marot, G., Castel, D., Estelle, J., et al.: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief. Bioinform. 14(6), 671–683 (2013)
Ding, B., Gentleman, R.: Classification using generalized partial least squares. J. Comput. Graph. Stat. 14(2), 280–298 (2005)
Donoho, D.L., Johnstone, I.M.: Adapting to unknown smoothness via wavelet shrinkage. J. Am. Stat. Assoc. 90(432), 1200–1224 (1995)
Dudoit, S., Fridlyand, J., Speed, T.P.: Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc. 97(457), 77–87 (2002)
Efron, B.: Estimating the error rate of a prediction rule: improvement on cross-validation. J. Am. Stat. Assoc. 78(382), 316–331 (1983)
Fort, G., Lambert-Lacroix, S.: Classification using partial least squares with penalized logistic regression. Bioinformatics 21(7), 1104–1111 (2005)
Frank, L.E., Friedman, J.H.: A statistical view of some chemometrics regression tools. Technometrics 35(2), 109–135 (1993)
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Software 33(1), 1–22 (2010)
Friedman, J.H.: Regularized discriminant analysis. J. Am. Stat. Assoc. 84(405), 165–175 (1989)
Fu, X., Fu, N., Guo, S., Yan, Z., Xu, Y., Hu, H., Menzel, C., Chen, W., Li, Y., Zeng, R., et al.: Estimating accuracy of RNA-seq and microarrays with proteomics. BMC Genom. 10, 161 (2009)
Furey, T.S., Cristianini, N., Duffy, N., Bednarski, D.W., Schummer, M., Haussler, D.: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16(10), 906–914 (2000)
Geisser, S.: The predictive sample reuse method with applications. J. Am. Stat. Assoc. 70(350), 320–328 (1975)
Grosenick, L., Greer, S., Knutson, B.: Interpretable classifiers for FMRI improve prediction of purchases. IEEE Trans. Neural Syst. Rehabil. Eng. 16(6), 539–548 (2008)
Guo, Y., Hastie, T., Tibshirani, R.: Regularized linear discriminant analysis and its application in microarrays. Biostatistics 8(1), 86–100 (2007)
Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46(1–3), 389–422 (2002)
Haas, B.J., Zody, M.C., et al.: Advancing RNA-seq analysis. Nat. Biotech. 28(5), 421–423 (2010)
Hastie, T., Buja, A., Tibshirani, R.: Penalized discriminant analysis. Ann. Stat. 23(1), 73–102 (1995)
Hastie, T., Tibshirani, R.: Discriminant analysis by Gaussian mixtures. J. Roy. Stat. Soc. Ser. B (Methodological) 58(1), 155–176 (1996)
Hastie, T., Tibshirani, R., Buja, A.: Flexible discriminant analysis by optimal scoring. J. Am. Stat. Assoc. 89, 1255–1270 (1994)
Hastie, T., Tibshirani, R., Friedman, J.J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2009)
Hoerl, A.E., Kennard, R.W.: Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12(1), 55–67 (1970)
Hsu, C.W., Lin, C.J.: A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw. 13(2), 415–425 (2002)
James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning. Springer, New York (2013)
Jolliffe, I.: Principal Component Analysis. Wiley, New York (2005)
Jolliffe, I.T., Trendafilov, N.T., Uddin, M.: A modified principal component technique based on the LASSO. J. Comput. Graph. Stat. 12(3), 531–547 (2003)
Journée, M., Nesterov, Y., Richtárik, P., Sepulchre, R.: Generalized power method for sparse principal component analysis. J. Mach. Learn. Res. 11, 517–553 (2010)
Kannan, K., Wang, L., Wang, J., Ittmann, M.M., Li, W., Yen, L.: Recurrent chimeric RNAs enriched in human prostate cancer identified by deep sequencing. Proc. Natl. Acad. Sci. 108(22), 9172–9177 (2011)
Lee, S., Huang, J.Z., Hu, J.: Sparse logistic principal components analysis for binary data. Ann. Appl. Stat. 4(3), 1579–1601 (2010)
Lee, S.I., Lee, H., Abbeel, P., Ng, A.Y.: Efficient L1 regularized logistic regression. In: Proceedings of the National Conference on Artificial Intelligence, vol. 21, pp. 401–408. AAAI Press, Menlo Park (1999); MIT Press, Cambridge, London (2006)
Lee, Y., Lin, Y., Wahba, G.: Multicategory support vector machines: theory and application to the classification of microarray data and satellite radiance data. J. Am. Stat. Assoc. 99(465), 67–81 (2004)
Leek, J.T., Scharpf, R.B., Bravo, H.C., Simcha, D., Langmead, B., Johnson, W.E., Geman, D., Baggerly, K., Irizarry, R.A.: Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11(10), 733–739 (2010)
Leng, C.: Sparse optimal scoring for multiclass cancer diagnosis and biomarker detection using microarray data. Comput. Biol. Chem. 32(6), 417–425 (2008)
Li, J., Witten, D.M., Johnstone, I.M., Tibshirani, R.: Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Biostatistics 13(3), 523–538 (2012)
Ma, Z.: Sparse principal component analysis and iterative thresholding. Ann. Stat. 41(2), 772–801 (2013)
Mai, Q., Zou, H., Yuan, M.: A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika 99(1), 29–42 (2012)
Malone, J.H., Oliver, B.: Microarrays, deep sequencing and the true measure of the transcriptome. BMC Biol. 9, 34 (2011)
Mardia, K.V., Kent, J.T., Bibby, J.M.: Multivariate Analysis. Academic, New York (1980)
Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M., Gilad, Y.: RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18(9), 1509–1517 (2008)
Marx, B.D.: Iteratively reweighted partial least squares estimation for generalized linear regression. Technometrics 38(4), 374–381 (1996)
Marx, B.D., Smith, E.P.: Principal component estimation for generalized linear regression. Biometrika 77(1), 23–31 (1990)
McCarthy, D.J., Chen, Y., Smyth, G.K.: Differential expression analysis of multifactor RNA-seq experiments with respect to biological variation. Nucleic Acids Res. 40(10), 4288–4297 (2012)
McCullagh, P., Nelder, J.A.: Generalized Linear Models. Chapman and Hall, Boca Raton (1989)
Meier, L., Van De Geer, S., Bühlmann, P.: The group lasso for logistic regression. J. Roy. Stat. Soc. Ser. B (Stat. Meth.) 70(1), 53–71 (2008)
Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., Wold, B.: Map** and quantifying mammalian transcriptomes by RNA-seq. Nat. Meth. 5(7), 621–628 (2008)
Nguyen, D.V., Rocke, D.M.: Multi-class cancer classification via partial least squares with gene expression profiles. Bioinformatics 18(9), 1216–1226 (2002)
Nguyen, D.V., Rocke, D.M.: Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18(1), 39–50 (2002)
Opitz, D., Maclin, R.: Popular ensemble methods: An empirical study. J. Artif. Intell. Res. 11, 169–198 (1999)
Oshlack, A., Wakefield, M.J.: Transcript length bias in RNA-seq data confounds systems biology. Biol. Direct 4(14) (2009)
Ozsolak, F., Milos, P.M.: RNA sequencing: advances, challenges and opportunities. Nat. Rev. Genet. 12(2), 87–98 (2010)
Park, M.Y., Hastie, T.: L1-regularization path algorithm for generalized linear models. J. Roy. Stat. Soc. Ser. B (Stat. Meth.) 69(4), 659–677 (2007)
Park, P.J.: ChIP–seq: advantages and challenges of a maturing technology. Nat. Rev. Genet. 10, 669–680 (2009)
Quackenbush, J.: Microarray data normalization and transformation. Nat. Genet. 32, 496–501 (2002)
R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2013). http://www.R-project.org/
Robinson, M.D., McCarthy, D.J., Smyth, G.K.: edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1), 139–140 (2010)
Robinson, M.D., Oshlack, A.: A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11, R25 (2010)
Schein, A.I., Saul, L.K., Ungar, L.H.: A generalized linear model for principal component analysis of binary data. In: Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics, pp. 14–21 (2003)
Shao, J.: Linear model selection by cross-validation. J. Am. Stat. Assoc. 88(422), 486–494 (1993)
Shen, H., Huang, J.Z.: Sparse principal component analysis via regularized low rank matrix approximation. J. Multivariate Anal. 99(6), 1015–1034 (2008)
Shendure, J.: The beginning of the end for microarrays? Nat. Meth. 5(7), 585–587 (2008)
Stone, M.: Cross-validatory choice and assessment of statistical predictions. J. Roy. Stat. Soc. Ser. B (Methodological) 36, 111–147 (1974)
Tarazona, S., GarcÃa-Alcalde, F., Dopazo, J., Ferrer, A., Conesa, A.: Differential expression in RNA-seq: a matter of depth. Genome Res. 21(12), 2213–2223 (2011)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc. Ser. B (Methodological) 58, 267–288 (1996)
Tibshirani, R., Hastie, T., Narasimhan, B., Chu, G.: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. 99(10), 6567–6572 (2002)
Tibshirani, R., Hastie, T., Narasimhan, B., Chu, G.: Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Stat. Sci. 18(1), 104–117 (2003)
Trendafilov, N.T., Jolliffe, I.T.: Projected gradient approach to the numerical solution of the SCoTLASS. Comput. Stat. Data Anal. 50(1), 242–253 (2006)
Trendafilov, N.T., Jolliffe, I.T.: DALASS: variable selection in discriminant analysis via the LASSO. Comput. Stat. Data Anal. 51(8), 3718–3736 (2007)
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (2000)
Wang, Z., Gerstein, M., Snyder, M.: RNA-seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10(1), 57–63 (2009)
Weston, J., Watkins, C.: Multi-class support vector machines. Technical report, Citeseer (1998)
Witten, D., Tibshirani, R., Gu, S.G., Fire, A., Lui, W.O.: Ultra-high throughput sequencing-based small RNA discovery and discrete statistical biomarker analysis in a collection of cervical tumours and matched controls. BMC Biol. 8(58) (2010)
Witten, D.M.: Classification and clustering of sequencing data using a Poisson model. Ann. Appl. Stat. 5(4), 2493–2518 (2011)
Witten, D.M., Tibshirani, R.: Penalized classification using Fisher’s linear discriminant. J. Roy. Stat. Soc. Ser. B (Stat. Meth.) 73(5), 753–772 (2011)
Witten, D.M., Tibshirani, R., Hastie, T.: A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10(3), 515–534 (2009)
Wold, H., et al.: Estimation of principal components and related models by iterative least squares. Multivariate Anal. 1, 391–420 (1966)
Wold, S.: Cross-validatory estimation of the number of components in factor and principal components models. Technometrics 20(4), 397–405 (1978)
Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. Roy. Stat. Soc. Ser. B (Stat. Meth.) 68(1), 49–67 (2006)
Zhu, J., Hastie, T.: Classification of gene microarrays by penalized logistic regression. Biostatistics 5(3), 427–443 (2004)
Zhu, J., Rosset, S., Hastie, T., Tibshirani, R.: 1-norm support vector machines. Adv. Neural Inform. Process. Syst. 16(1), 49–56 (2004)
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. Roy. Stat. Soc. Ser. B (Stat. Meth.) 67(2), 301–320 (2005)
Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. J. Comput. Graph. Stat. 15(2), 265–286 (2006)
Acknowledgements
D.W. received support for this work from NIH Grant DP5OD009145, NSF CAREER Award DMS-1252624, and a Sloan Foundation Research Fellowship.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Tan, K.M., Petersen, A., Witten, D. (2014). Classification of RNA-seq Data. In: Datta, S., Nettleton, D. (eds) Statistical Analysis of Next Generation Sequencing Data. Frontiers in Probability and the Statistical Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-07212-8_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-07212-8_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07211-1
Online ISBN: 978-3-319-07212-8
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)