Log in

Test Set Sizing via Random Matrix Theory

  • Research
  • Published:
Operations Research Forum Aims and scope Submit manuscript

Abstract

This paper uses techniques from Random Matrix Theory to find the ideal training-testing data split for a simple linear regression with m data points, each an independent n-dimensional multivariate Gaussian. It defines “ideal” as satisfying the integrity metric, i.e., the empirical model error is the actual measurement noise, and thus fairly reflects the value or lack of same of the model. This paper is the first to solve for the training and test set sizes for any model in a way that is truly optimal. The number of data points in the training set is the root of a quartic polynomial Theorem 1 derives which depends only on m and n; the covariance matrix of the multivariate Gaussian, the true model parameters, and the true measurement noise drop out of the calculations. The critical mathematical difficulties were realizing that the problems herein were discussed in the context of the Jacobi Ensemble, a probability distribution describing the eigenvalues of a known random matrix model, many integrals over which are known, since they are in the style of Selberg and Aomoto. Mathematical results are supported with thorough computational evidence. This paper is a step towards automatic choices of training/test set sizes in machine learning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Availability of Data and Materials

The data is publicly available at the UC Irvine Machine Learning Repository, https://archive.ics.uci.edu/, and was originally analyzed by [21,22,23,24,25,26].

References

  1. Larsen J, Goutte C (1999) On optimal data split for generalization estimation and model selection. Proceedings of the IEEE workshop on neural networks for signal processing IX. IEEE, pp 225–234. https://doi.org/10.1109/NNSP.1999.788141

  2. Picard RR, Berk KN (1990) Data splitting. Am Stat 44:140–147

    Article  Google Scholar 

  3. Afendras G, Markatou M (2019) Optimality of training/test size and resampling effectiveness in cross-validation. J Stat Plan Inference 199:286–301

    Article  Google Scholar 

  4. Guyon I (1997) A scaling law for the validation-set training-set size ratio. AT &T Bell Laboratories, pp 1–11

    Google Scholar 

  5. Guyon I, Makhoul J, Schwartz R, Vapnik V (1998) What size test set gives good error rate estimates? IEEE Trans Pattern Anal Mach Intell 20:52–64

    Article  Google Scholar 

  6. Kearns M (1997) A bound on the error of cross validation using the approximation and estimation rates, with consequences for the training-test split. Neural Comput 9:1143–1161

    Article  Google Scholar 

  7. Dumitriu I, Edelman A (2002) Matrix models for beta ensembles. J Math Phys 43:5830–5847

    Article  Google Scholar 

  8. Selberg A (1944) Remarks on a multiple integral. Norsk Mat Tidsskr 26:71–78

    Google Scholar 

  9. Aomoto K (1987) On the complex Selberg Integral. Q J Math 38:385–399

    Article  Google Scholar 

  10. Fyodorov YV, Le Doussal P (2016) Moments of the position of the maximum for GUE characteristic polynomials and for log-correlated Gaussian processes. J Stat Phys 164:190–240

    Article  Google Scholar 

  11. Muirhead RJ (1982) Aspects of multivariate statistical theory. Wiley Series in Probability and Statistics, Hoboken, New Jersey

    Book  Google Scholar 

  12. Lippert RA (2003) A matrix model for the \(\beta \)-Jacobi ensemble. J Math Phys 44:4807–4816

    Article  Google Scholar 

  13. Killip R, Nenciu I (2004) Matrix models for circular ensembles. Int Math Res Not 2004:2665–2701

    Article  Google Scholar 

  14. Andrews GE, Askey R, Roy R (1999) Special functions, encyclopedia of mathematics and its applications, 71. Cambridge University Press, The Edinburgh Building, Cambridge CB2 2RU, UK

    Google Scholar 

  15. Savin DV, Sommers H-J (2006) Shot noise in chaotic cavities with an arbitrary number of open channels. Phys Rev B 73:081307

    Article  Google Scholar 

  16. Sommers H-J, Wieczorek W, Savin DV (2007) Statistics of conductance and shotnoise power for chaotic cavities. Acta Phys Pol, A 112:691–697

    Article  Google Scholar 

  17. Savin DV, Sommers H-J, Wieczorek W (2008) Nonlinear statistics of quantum transport in chaotic cavities. Phys Rev B 77:125332

    Article  Google Scholar 

  18. Novaes M (2011) Asymptotics of Selberg-like integrals by lattice path counting. Ann Phys 326:828–838

    Article  Google Scholar 

  19. Forrester PJ (2022) Joint moments of a characteristic polynomials and its derivative for the circular \(\beta \)-ensemble. Prob Math Phys 3:145–170

    Article  Google Scholar 

  20. Mezzadri F, Reynolds AK, Winn B (2017) Moments of the eigenvalue densities and of the secular coefficients of \(\beta \)-ensembles. Nonlinearity 30:1034

    Article  Google Scholar 

  21. Coraddu A, Oneto L, Ghi A, Savio S, Anguita D, Figari M (2016) Machine learning approaches for improving condition-based maintenance of naval propulsion plants. Proceedings of the Institution of Mechanical Engineers, Part M: Journal of Engineering for the Maritime Environment 230:136–153

    Google Scholar 

  22. Cho D, Yoo C, Im J, Cha D-H (2020) Comparative assessment of various machine learning-based bias correction methods for numerical weather prediction model forecasts of extreme air temperatures in urban areas. Earth Space Sci 7:e2019EA000740

    Article  Google Scholar 

  23. Chicco D, Jurman G (2020) Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Med Inform Decis Mak 20:16

    Article  Google Scholar 

  24. Abid F et al (2019) Predicting forest fire in Algeria using data mining techniques: Case study of the decision tree algorithm. In Proceedings of the International Conference on Advanced Intelligent Systems for Sustainable Development (AI2SD 2019), Marrakech, Morocco

  25. Rafiei MH, Adeli H (2015) A novel machine learning model for estimation of sale prices of real estate units. J Constr Eng Manag 142:04015066

    Article  Google Scholar 

  26. Santosa MS, Abreu PH, J. García-Laencina P, Simão A, Carvalho A (2015) A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients. J Biomed Inform 58:49–59

    Article  Google Scholar 

  27. Wigner EP (1955) Characteristic vectors of bordered matrices with infinite dimensions. Ann Math 62:548–564

    Article  Google Scholar 

  28. Marchenko VA, Pastur LA (1967) Distribution of the eigenvalues in certain sets of random matrices. Matematicheskii Sbornik 72:507–536

    Google Scholar 

  29. Wachter KW (1978) The strong limits of random matrix spectra for sample matrices of independent elements. Ann Probab 6:1–18

    Article  Google Scholar 

Download references

Acknowledgements

The author is grateful for the help of two anonymous editors who improved the quality of this paper substantially.

Funding

None.

Author information

Authors and Affiliations

Authors

Contributions

The author is the sole author of this work.

Corresponding author

Correspondence to Alexander Dubbs.

Ethics declarations

Ethics Approval

Not applicable.

Competing Interests

The author declares no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dubbs, A. Test Set Sizing via Random Matrix Theory. Oper. Res. Forum 5, 17 (2024). https://doi.org/10.1007/s43069-024-00292-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s43069-024-00292-1

Keywords

Navigation