Abstract
To meet the big data challenges, many new statistical tools have been developed in recent years. In this review, we summarize some of these approaches to give an overview of the current state of the development. We will focus on the case that the number of observations is much larger than the dimension of the unknown parameters, although we will mention some investigations related to the high-dimensional data. We will discuss methods using subsamples as well as methods processing the whole data piece by piece.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ai, M., Yu, J., Zhang, H., Wang, H.Y.: Optimal subsampling algorithms for big data regressions. Stat. Sin. (2019). https://doi.org/10.5705/ss.202018.0439
Ailon, N., Chazelle, B.: Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform. In: Proceedings of the Thirty-eighth Annual ACM Symposium on Theory of Computing, pp. 557–563 (2006)
Ailon, N., Chazelle, B.: The fast Johnson–Lindenstrauss transform and approximate nearest neighbors. SIAM J. Comput. 39(1), 302–322 (2009)
Ailon, N., Liberty, E.: Fast dimension reduction using Rademacher series on dual BCH codes. Discrete Comput. Geom. 42(4), 615 (2009)
Avron, H., Maymounkov, P., Toledo, S.: Blendenpik: Supercharging LAPACK’s least-squares solver. SIAM J. Sci. Comput. 32, 1217–1236 (2010)
Battey, H., Fan, J., Liu, H., Lu, J., Zhu, Z.: Distributed testing and estimation under sparse high dimensional models. Ann. Stat. 46(3), 1352 (2018)
Chen, X., **e, M.-g.: A split-and-conquer approach for analysis of extraordinarily large data. Stat. Sin. 24, 1655–1684 (2014)
Chen, S., Varma, R., Singh, A., Kovačcević, J.: A statistical perspective of sampling scores for linear regression. In: 2016 IEEE International Symposium on Information Theory (ISIT), pp. 1556–1560. IEEE, Piscataway (2016)
Drineas, P., Mahoney, M.W., Muthukrishnan, S.: Sampling algorithms for l 2 regression and applications. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, pp. 1127–1136. Society for Industrial and Applied Mathematics, Philadelphia (2006)
Drineas, P., Mahoney, M.W., Muthukrishnan, S., Sarlos, T.: Faster least squares approximation. Numer. Math. 117, 219–249 (2011)
Drineas, P., Magdon-Ismail, M., Mahoney, M.W., Woodruff, D.P.: Faster approximation of matrix coherence and statistical leverage. J. Mach. Learn. Res. 13, 3475–3506 (2012)
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., et al.: Least angle regression. Ann. Stat. 32(2), 407–499 (2004)
Fan, J., Han, F., Liu, H.: Challenges of big data analysis. Natl. Sci. Rev. 1(2), 293–314 (2014)
Fithian, W., Hastie, T.: Local case-control sampling: Efficient subsampling in imbalanced data sets. Ann. Stat. 42(5), 1693 (2014)
Han, L., Tan, K.M., Yang, T., Zhang, T.: Local uncertainty sampling for large-scale multiclass logistic regression. Ann. Stat. 48(3),1770–1788 (2020)
Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz map**s into a Hilbert space. Contemp. Math. 26, 189–206 (1984)
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pp. 315–323 (2013)
Katzfuss, M.: A multi-resolution approximation for massive spatial datasets. J. Am. Stat. Assoc. 112(517), 201–214 (2017)
Klotz, J.H.: Updating simple linear regression. Stat. Sin., 15, 399–403 (1995)
Laney, D.: 3d data management: Controlling data volume, velocity and variety. META Group Res. Note 6(70), 1 (2001)
Liang, F., Cheng, Y., Song, Q., Park, J., Yang, P.: A resampling-based stochastic approximation method for analysis of large geostatistical data. J. Am. Stat. Assoc. 108(501), 325–339 (2013)
Lin, N., **e, R.: Aggregated estimating equation estimation. Stat. Interface 4, 73–83 (2011)
Luo, L., Song, P.X.-K.: Renewable estimation and incremental inference in generalized linear models with streaming data sets. J. R. Stat. Soc. B (Stat. Methodol.) 82(1), 69–97 (2020)
Ma, P., Mahoney, M.W., Yu, B.: A statistical perspective on algorithmic leveraging. J. Mach. Learn. Res. 16(1), 861–911 (2015)
Ma, P., Zhang, X., **ng, X., Ma, J., Mahoney, M.: Asymptotic analysis of sampling estimators for randomized numerical linear algebra algorithms. In International Conference on Artificial Intelligence and Statistics, pp. 1026–1035 (2020)
Mahoney, M.W.: Randomized algorithms for matrices and data. Found. Trends Mach. Learn. 3(2), 123–224 (2011)
Martinsson, P.-G., Tropp. J.: Randomized numerical linear algebra: Foundations & algorithms (2020). Preprint. ar**v:2002.01387
Needell, D., Ward, R., Srebro, N.: Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. In: Advances in Neural Information Processing Systems, pp. 1017–1025 (2014)
Schifano, E.D., Wu, J., Wang, C., Yan, J., Chen, M.-H.: Online updating of statistical inference in the big data setting. Technometrics 58(3), 393–403 (2016)
Shang, Z., Cheng, G.: Computational limits of a distributed algorithm for smoothing spline. J. Mach. Learn. Res. 18(1), 3809–3845 (2017)
Toulis, P., Airoldi, E.M., et al.: Asymptotic and finite-sample properties of estimators based on stochastic gradients. Ann. Stat. 45(4), 1694–1727 (2017)
Wang, H.Y.: Divide-and-conquer information-based optimal subdata selection algorithm. J. Stat. Theory Pract. 13(3), 46 (2019)
Wang, H.Y.: More efficient estimation for logistic regression with optimal subsamples. J. Mach. Learn. Res. 20(132), 1–59 (2019)
Wang, H., Ma, Y.: Optimal subsampling for quantile regression in big data. Biometrika, 108(1), 99–112 (2021)
Wang, C., Chen, X., Smola, A.J., **ng, E.P.: Variance reduction for stochastic gradient optimization. In: Advances in Neural Information Processing Systems, pp. 181–189 (2013)
Wang, Y., Yu, A.W., Singh, A.: On computationally tractable selection of experiments in measurement-constrained regression models. J. Mach. Learn. Res. 18(1), 5238–5278 (2017)
Wang, C., Chen, M.-H., Wu, J., Yan, J., Zhang, Y., Schifano, E.: Online updating method with new variables for big data streams. Canad. J. Stat. 46(1), 123–146 (2018)
Wang, H.Y., Zhu, R., Ma, P.: Optimal subsampling for large sample logistic regression. J. Am. Stat. Assoc. 113(522), 829–844 (2018)
Wang, H.Y., Yang, M., Stufken, J.: Information-based optimal subdata selection for big data linear regression. J. Am. Stat. Assoc. 114(525), 393–405 (2019)
Wang, Y., Hong, C., Palmer, N., Di, Q., Schwartz, J., Kohane, I., Cai, T.: A fast divide-and-conquer sparse Cox regression. Biostatistics. 22(2), 381–401 (2021)
**e, R., Wang, Z., Bai, S., Ma, P., Zhong, W.: Online decentralized leverage score sampling for streaming multidimensional time series. Proc. Mach. Learn. Res. 89, 2301 (2019)
Yao, Y., Wang, H.Y.: Optimal subsampling for softmax regression. Stat. Papers 60, 585–599 (2018)
Yu, J., Wang, H., Ai, M., Zhang, H.: Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. J. Am. Stat. Assoc., 1–12 (2020). https://doi.org/10.1080/01621459.2020.1773832s
Zhang, Y., Duchi, J., Wainwright, M.: Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. J. Mach. Learn. Res. 16(1), 3299–3340 (2015)
Zhao, P., Zhang, T.: Stochastic optimization with importance sampling for regularized loss minimization. In International Conference on Machine Learning, pp. 1–9 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Yao, Y., Wang, H. (2021). A Selective Review on Statistical Techniques for Big Data. In: Zhao, Y., Chen, (.DG. (eds) Modern Statistical Methods for Health Research. Emerging Topics in Statistics and Biostatistics . Springer, Cham. https://doi.org/10.1007/978-3-030-72437-5_11
Download citation
DOI: https://doi.org/10.1007/978-3-030-72437-5_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-72436-8
Online ISBN: 978-3-030-72437-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)