A Selective Review on Statistical Techniques for Big Data

Yao, Yaqiong; Wang, HaiYing

doi:10.1007/978-3-030-72437-5_11

Yaqiong Yao⁸ &
HaiYing Wang⁸

Part of the book series: Emerging Topics in Statistics and Biostatistics ((ETSB))

1837 Accesses
3 Citations

Abstract

To meet the big data challenges, many new statistical tools have been developed in recent years. In this review, we summarize some of these approaches to give an overview of the current state of the development. We will focus on the case that the number of observations is much larger than the dimension of the unknown parameters, although we will mention some investigations related to the high-dimensional data. We will discuss methods using subsamples as well as methods processing the whole data piece by piece.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.00; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Recent Advances in Big Data Analytics

Big data: the next challenge for statistics

Article 01 August 2015

Big Data Analytics: Views from Statistical and Computational Perspectives

References

Ai, M., Yu, J., Zhang, H., Wang, H.Y.: Optimal subsampling algorithms for big data regressions. Stat. Sin. (2019). https://doi.org/10.5705/ss.202018.0439
Ailon, N., Chazelle, B.: Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform. In: Proceedings of the Thirty-eighth Annual ACM Symposium on Theory of Computing, pp. 557–563 (2006)
Google Scholar
Ailon, N., Chazelle, B.: The fast Johnson–Lindenstrauss transform and approximate nearest neighbors. SIAM J. Comput. 39(1), 302–322 (2009)
Article MathSciNet Google Scholar
Ailon, N., Liberty, E.: Fast dimension reduction using Rademacher series on dual BCH codes. Discrete Comput. Geom. 42(4), 615 (2009)
Google Scholar
Avron, H., Maymounkov, P., Toledo, S.: Blendenpik: Supercharging LAPACK’s least-squares solver. SIAM J. Sci. Comput. 32, 1217–1236 (2010)
Article MathSciNet Google Scholar
Battey, H., Fan, J., Liu, H., Lu, J., Zhu, Z.: Distributed testing and estimation under sparse high dimensional models. Ann. Stat. 46(3), 1352 (2018)
Article MathSciNet Google Scholar
Chen, X., **e, M.-g.: A split-and-conquer approach for analysis of extraordinarily large data. Stat. Sin. 24, 1655–1684 (2014)
Google Scholar
Chen, S., Varma, R., Singh, A., Kovačcević, J.: A statistical perspective of sampling scores for linear regression. In: 2016 IEEE International Symposium on Information Theory (ISIT), pp. 1556–1560. IEEE, Piscataway (2016)
Google Scholar
Drineas, P., Mahoney, M.W., Muthukrishnan, S.: Sampling algorithms for l ₂ regression and applications. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, pp. 1127–1136. Society for Industrial and Applied Mathematics, Philadelphia (2006)
Google Scholar
Drineas, P., Mahoney, M.W., Muthukrishnan, S., Sarlos, T.: Faster least squares approximation. Numer. Math. 117, 219–249 (2011)
Article MathSciNet Google Scholar
Drineas, P., Magdon-Ismail, M., Mahoney, M.W., Woodruff, D.P.: Faster approximation of matrix coherence and statistical leverage. J. Mach. Learn. Res. 13, 3475–3506 (2012)
MathSciNet MATH Google Scholar
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., et al.: Least angle regression. Ann. Stat. 32(2), 407–499 (2004)
Article MathSciNet Google Scholar
Fan, J., Han, F., Liu, H.: Challenges of big data analysis. Natl. Sci. Rev. 1(2), 293–314 (2014)
Article Google Scholar
Fithian, W., Hastie, T.: Local case-control sampling: Efficient subsampling in imbalanced data sets. Ann. Stat. 42(5), 1693 (2014)
Article MathSciNet Google Scholar
Han, L., Tan, K.M., Yang, T., Zhang, T.: Local uncertainty sampling for large-scale multiclass logistic regression. Ann. Stat. 48(3),1770–1788 (2020)
Article MathSciNet Google Scholar
Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz map**s into a Hilbert space. Contemp. Math. 26, 189–206 (1984)
Article MathSciNet Google Scholar
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pp. 315–323 (2013)
Google Scholar
Katzfuss, M.: A multi-resolution approximation for massive spatial datasets. J. Am. Stat. Assoc. 112(517), 201–214 (2017)
Article MathSciNet Google Scholar
Klotz, J.H.: Updating simple linear regression. Stat. Sin., 15, 399–403 (1995)
MathSciNet MATH Google Scholar
Laney, D.: 3d data management: Controlling data volume, velocity and variety. META Group Res. Note 6(70), 1 (2001)
Google Scholar
Liang, F., Cheng, Y., Song, Q., Park, J., Yang, P.: A resampling-based stochastic approximation method for analysis of large geostatistical data. J. Am. Stat. Assoc. 108(501), 325–339 (2013)
Article MathSciNet Google Scholar
Lin, N., **e, R.: Aggregated estimating equation estimation. Stat. Interface 4, 73–83 (2011)
Article MathSciNet Google Scholar
Luo, L., Song, P.X.-K.: Renewable estimation and incremental inference in generalized linear models with streaming data sets. J. R. Stat. Soc. B (Stat. Methodol.) 82(1), 69–97 (2020)
Google Scholar
Ma, P., Mahoney, M.W., Yu, B.: A statistical perspective on algorithmic leveraging. J. Mach. Learn. Res. 16(1), 861–911 (2015)
MathSciNet MATH Google Scholar
Ma, P., Zhang, X., **ng, X., Ma, J., Mahoney, M.: Asymptotic analysis of sampling estimators for randomized numerical linear algebra algorithms. In International Conference on Artificial Intelligence and Statistics, pp. 1026–1035 (2020)
Google Scholar
Mahoney, M.W.: Randomized algorithms for matrices and data. Found. Trends Mach. Learn. 3(2), 123–224 (2011)
MATH Google Scholar
Martinsson, P.-G., Tropp. J.: Randomized numerical linear algebra: Foundations & algorithms (2020). Preprint. ar**v:2002.01387
Google Scholar
Needell, D., Ward, R., Srebro, N.: Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. In: Advances in Neural Information Processing Systems, pp. 1017–1025 (2014)
Google Scholar
Schifano, E.D., Wu, J., Wang, C., Yan, J., Chen, M.-H.: Online updating of statistical inference in the big data setting. Technometrics 58(3), 393–403 (2016)
Article MathSciNet Google Scholar
Shang, Z., Cheng, G.: Computational limits of a distributed algorithm for smoothing spline. J. Mach. Learn. Res. 18(1), 3809–3845 (2017)
MathSciNet MATH Google Scholar
Toulis, P., Airoldi, E.M., et al.: Asymptotic and finite-sample properties of estimators based on stochastic gradients. Ann. Stat. 45(4), 1694–1727 (2017)
Article MathSciNet Google Scholar
Wang, H.Y.: Divide-and-conquer information-based optimal subdata selection algorithm. J. Stat. Theory Pract. 13(3), 46 (2019)
Article MathSciNet Google Scholar
Wang, H.Y.: More efficient estimation for logistic regression with optimal subsamples. J. Mach. Learn. Res. 20(132), 1–59 (2019)
MathSciNet MATH Google Scholar
Wang, H., Ma, Y.: Optimal subsampling for quantile regression in big data. Biometrika, 108(1), 99–112 (2021)
Article MathSciNet Google Scholar
Wang, C., Chen, X., Smola, A.J., **ng, E.P.: Variance reduction for stochastic gradient optimization. In: Advances in Neural Information Processing Systems, pp. 181–189 (2013)
Google Scholar
Wang, Y., Yu, A.W., Singh, A.: On computationally tractable selection of experiments in measurement-constrained regression models. J. Mach. Learn. Res. 18(1), 5238–5278 (2017)
MathSciNet Google Scholar
Wang, C., Chen, M.-H., Wu, J., Yan, J., Zhang, Y., Schifano, E.: Online updating method with new variables for big data streams. Canad. J. Stat. 46(1), 123–146 (2018)
Article MathSciNet Google Scholar
Wang, H.Y., Zhu, R., Ma, P.: Optimal subsampling for large sample logistic regression. J. Am. Stat. Assoc. 113(522), 829–844 (2018)
Article MathSciNet Google Scholar
Wang, H.Y., Yang, M., Stufken, J.: Information-based optimal subdata selection for big data linear regression. J. Am. Stat. Assoc. 114(525), 393–405 (2019)
Article MathSciNet Google Scholar
Wang, Y., Hong, C., Palmer, N., Di, Q., Schwartz, J., Kohane, I., Cai, T.: A fast divide-and-conquer sparse Cox regression. Biostatistics. 22(2), 381–401 (2021)
Article MathSciNet Google Scholar
**e, R., Wang, Z., Bai, S., Ma, P., Zhong, W.: Online decentralized leverage score sampling for streaming multidimensional time series. Proc. Mach. Learn. Res. 89, 2301 (2019)
Google Scholar
Yao, Y., Wang, H.Y.: Optimal subsampling for softmax regression. Stat. Papers 60, 585–599 (2018)
Article MathSciNet Google Scholar
Yu, J., Wang, H., Ai, M., Zhang, H.: Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. J. Am. Stat. Assoc., 1–12 (2020). https://doi.org/10.1080/01621459.2020.1773832s
Zhang, Y., Duchi, J., Wainwright, M.: Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. J. Mach. Learn. Res. 16(1), 3299–3340 (2015)
MathSciNet MATH Google Scholar
Zhao, P., Zhang, T.: Stochastic optimization with importance sampling for regularized loss minimization. In International Conference on Machine Learning, pp. 1–9 (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Connecticut, Storrs, CT, USA
Yaqiong Yao & HaiYing Wang

Authors

Yaqiong Yao
View author publications
You can also search for this author in PubMed Google Scholar
HaiYing Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to HaiYing Wang .

Editor information

Editors and Affiliations

Department of Mathematics & Statistics, Georgia State University, Atlanta, GA, USA
Yichuan Zhao
School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, KwaZulu-Natal, South Africa
(Din) Ding-Geng Chen

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Yao, Y., Wang, H. (2021). A Selective Review on Statistical Techniques for Big Data. In: Zhao, Y., Chen, (.DG. (eds) Modern Statistical Methods for Health Research. Emerging Topics in Statistics and Biostatistics . Springer, Cham. https://doi.org/10.1007/978-3-030-72437-5_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-72437-5_11
Published: 15 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-72436-8
Online ISBN: 978-3-030-72437-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

A Selective Review on Statistical Techniques for Big Data

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Recent Advances in Big Data Analytics

Big data: the next challenge for statistics

Big Data Analytics: Views from Statistical and Computational Perspectives

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Selective Review on Statistical Techniques for Big Data

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Recent Advances in Big Data Analytics

Big data: the next challenge for statistics

Big Data Analytics: Views from Statistical and Computational Perspectives

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation