Test Set Sizing via Random Matrix Theory

Dubbs, Alexander

doi:10.1007/s43069-024-00292-1

Test Set Sizing via Random Matrix Theory

Research
Published: 19 February 2024

Volume 5, article number 17, (2024)
Cite this article

Operations Research Forum Aims and scope Submit manuscript

Alexander Dubbs¹

105 Accesses
Explore all metrics

Abstract

This paper uses techniques from Random Matrix Theory to find the ideal training-testing data split for a simple linear regression with m data points, each an independent n-dimensional multivariate Gaussian. It defines “ideal” as satisfying the integrity metric, i.e., the empirical model error is the actual measurement noise, and thus fairly reflects the value or lack of same of the model. This paper is the first to solve for the training and test set sizes for any model in a way that is truly optimal. The number of data points in the training set is the root of a quartic polynomial Theorem 1 derives which depends only on m and n; the covariance matrix of the multivariate Gaussian, the true model parameters, and the true measurement noise drop out of the calculations. The critical mathematical difficulties were realizing that the problems herein were discussed in the context of the Jacobi Ensemble, a probability distribution describing the eigenvalues of a known random matrix model, many integrals over which are known, since they are in the style of Selberg and Aomoto. Mathematical results are supported with thorough computational evidence. This paper is a step towards automatic choices of training/test set sizes in machine learning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the Choice of Regression Basis Functions and Machine Learning

Article 01 March 2022

Bayesian Robust Regression with the Horseshoe+ Estimator

Adaptive Design of Experiments Based on Gaussian Processes

Availability of Data and Materials

The data is publicly available at the UC Irvine Machine Learning Repository, https://archive.ics.uci.edu/, and was originally analyzed by [21,22,23,24,25,26].

References

Larsen J, Goutte C (1999) On optimal data split for generalization estimation and model selection. Proceedings of the IEEE workshop on neural networks for signal processing IX. IEEE, pp 225–234. https://doi.org/10.1109/NNSP.1999.788141
Picard RR, Berk KN (1990) Data splitting. Am Stat 44:140–147
Article Google Scholar
Afendras G, Markatou M (2019) Optimality of training/test size and resampling effectiveness in cross-validation. J Stat Plan Inference 199:286–301
Article Google Scholar
Guyon I (1997) A scaling law for the validation-set training-set size ratio. AT &T Bell Laboratories, pp 1–11
Google Scholar
Guyon I, Makhoul J, Schwartz R, Vapnik V (1998) What size test set gives good error rate estimates? IEEE Trans Pattern Anal Mach Intell 20:52–64
Article Google Scholar
Kearns M (1997) A bound on the error of cross validation using the approximation and estimation rates, with consequences for the training-test split. Neural Comput 9:1143–1161
Article Google Scholar
Dumitriu I, Edelman A (2002) Matrix models for beta ensembles. J Math Phys 43:5830–5847
Article Google Scholar
Selberg A (1944) Remarks on a multiple integral. Norsk Mat Tidsskr 26:71–78
Google Scholar
Aomoto K (1987) On the complex Selberg Integral. Q J Math 38:385–399
Article Google Scholar
Fyodorov YV, Le Doussal P (2016) Moments of the position of the maximum for GUE characteristic polynomials and for log-correlated Gaussian processes. J Stat Phys 164:190–240
Article Google Scholar
Muirhead RJ (1982) Aspects of multivariate statistical theory. Wiley Series in Probability and Statistics, Hoboken, New Jersey
Book Google Scholar
Lippert RA (2003) A matrix model for the \(\beta \)-Jacobi ensemble. J Math Phys 44:4807–4816
Article Google Scholar
Killip R, Nenciu I (2004) Matrix models for circular ensembles. Int Math Res Not 2004:2665–2701
Article Google Scholar
Andrews GE, Askey R, Roy R (1999) Special functions, encyclopedia of mathematics and its applications, 71. Cambridge University Press, The Edinburgh Building, Cambridge CB2 2RU, UK
Google Scholar
Savin DV, Sommers H-J (2006) Shot noise in chaotic cavities with an arbitrary number of open channels. Phys Rev B 73:081307
Article Google Scholar
Sommers H-J, Wieczorek W, Savin DV (2007) Statistics of conductance and shotnoise power for chaotic cavities. Acta Phys Pol, A 112:691–697
Article Google Scholar
Savin DV, Sommers H-J, Wieczorek W (2008) Nonlinear statistics of quantum transport in chaotic cavities. Phys Rev B 77:125332
Article Google Scholar
Novaes M (2011) Asymptotics of Selberg-like integrals by lattice path counting. Ann Phys 326:828–838
Article Google Scholar
Forrester PJ (2022) Joint moments of a characteristic polynomials and its derivative for the circular \(\beta \)-ensemble. Prob Math Phys 3:145–170
Article Google Scholar
Mezzadri F, Reynolds AK, Winn B (2017) Moments of the eigenvalue densities and of the secular coefficients of \(\beta \)-ensembles. Nonlinearity 30:1034
Article Google Scholar
Coraddu A, Oneto L, Ghi A, Savio S, Anguita D, Figari M (2016) Machine learning approaches for improving condition-based maintenance of naval propulsion plants. Proceedings of the Institution of Mechanical Engineers, Part M: Journal of Engineering for the Maritime Environment 230:136–153
Google Scholar
Cho D, Yoo C, Im J, Cha D-H (2020) Comparative assessment of various machine learning-based bias correction methods for numerical weather prediction model forecasts of extreme air temperatures in urban areas. Earth Space Sci 7:e2019EA000740
Article Google Scholar
Chicco D, Jurman G (2020) Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Med Inform Decis Mak 20:16
Article Google Scholar
Abid F et al (2019) Predicting forest fire in Algeria using data mining techniques: Case study of the decision tree algorithm. In Proceedings of the International Conference on Advanced Intelligent Systems for Sustainable Development (AI2SD 2019), Marrakech, Morocco
Rafiei MH, Adeli H (2015) A novel machine learning model for estimation of sale prices of real estate units. J Constr Eng Manag 142:04015066
Article Google Scholar
Santosa MS, Abreu PH, J. García-Laencina P, Simão A, Carvalho A (2015) A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients. J Biomed Inform 58:49–59
Article Google Scholar
Wigner EP (1955) Characteristic vectors of bordered matrices with infinite dimensions. Ann Math 62:548–564
Article Google Scholar
Marchenko VA, Pastur LA (1967) Distribution of the eigenvalues in certain sets of random matrices. Matematicheskii Sbornik 72:507–536
Google Scholar
Wachter KW (1978) The strong limits of random matrix spectra for sample matrices of independent elements. Ann Probab 6:1–18
Article Google Scholar

Download references

Acknowledgements

The author is grateful for the help of two anonymous editors who improved the quality of this paper substantially.

Funding

None.

Author information

Authors and Affiliations

400 Central Park West, New York, NY, 10025, USA
Alexander Dubbs

Authors

Alexander Dubbs
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The author is the sole author of this work.

Corresponding author

Correspondence to Alexander Dubbs.

Ethics declarations

Ethics Approval

Not applicable.

Competing Interests

The author declares no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Dubbs, A. Test Set Sizing via Random Matrix Theory. Oper. Res. Forum 5, 17 (2024). https://doi.org/10.1007/s43069-024-00292-1

Download citation

Received: 15 July 2023
Accepted: 06 January 2024
Published: 19 February 2024
DOI: https://doi.org/10.1007/s43069-024-00292-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Test Set Sizing via Random Matrix Theory

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

On the Choice of Regression Basis Functions and Machine Learning

Bayesian Robust Regression with the Horseshoe+ Estimator

Adaptive Design of Experiments Based on Gaussian Processes

Availability of Data and Materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics Approval

Competing Interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Test Set Sizing via Random Matrix Theory

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

On the Choice of Regression Basis Functions and Machine Learning

Bayesian Robust Regression with the Horseshoe+ Estimator

Adaptive Design of Experiments Based on Gaussian Processes

Availability of Data and Materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics Approval

Competing Interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation