Abstract
This paper uses techniques from Random Matrix Theory to find the ideal training-testing data split for a simple linear regression with m data points, each an independent n-dimensional multivariate Gaussian. It defines “ideal” as satisfying the integrity metric, i.e., the empirical model error is the actual measurement noise, and thus fairly reflects the value or lack of same of the model. This paper is the first to solve for the training and test set sizes for any model in a way that is truly optimal. The number of data points in the training set is the root of a quartic polynomial Theorem 1 derives which depends only on m and n; the covariance matrix of the multivariate Gaussian, the true model parameters, and the true measurement noise drop out of the calculations. The critical mathematical difficulties were realizing that the problems herein were discussed in the context of the Jacobi Ensemble, a probability distribution describing the eigenvalues of a known random matrix model, many integrals over which are known, since they are in the style of Selberg and Aomoto. Mathematical results are supported with thorough computational evidence. This paper is a step towards automatic choices of training/test set sizes in machine learning.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs43069-024-00292-1/MediaObjects/43069_2024_292_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs43069-024-00292-1/MediaObjects/43069_2024_292_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs43069-024-00292-1/MediaObjects/43069_2024_292_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs43069-024-00292-1/MediaObjects/43069_2024_292_Fig4_HTML.png)
Similar content being viewed by others
References
Larsen J, Goutte C (1999) On optimal data split for generalization estimation and model selection. Proceedings of the IEEE workshop on neural networks for signal processing IX. IEEE, pp 225–234. https://doi.org/10.1109/NNSP.1999.788141
Picard RR, Berk KN (1990) Data splitting. Am Stat 44:140–147
Afendras G, Markatou M (2019) Optimality of training/test size and resampling effectiveness in cross-validation. J Stat Plan Inference 199:286–301
Guyon I (1997) A scaling law for the validation-set training-set size ratio. AT &T Bell Laboratories, pp 1–11
Guyon I, Makhoul J, Schwartz R, Vapnik V (1998) What size test set gives good error rate estimates? IEEE Trans Pattern Anal Mach Intell 20:52–64
Kearns M (1997) A bound on the error of cross validation using the approximation and estimation rates, with consequences for the training-test split. Neural Comput 9:1143–1161
Dumitriu I, Edelman A (2002) Matrix models for beta ensembles. J Math Phys 43:5830–5847
Selberg A (1944) Remarks on a multiple integral. Norsk Mat Tidsskr 26:71–78
Aomoto K (1987) On the complex Selberg Integral. Q J Math 38:385–399
Fyodorov YV, Le Doussal P (2016) Moments of the position of the maximum for GUE characteristic polynomials and for log-correlated Gaussian processes. J Stat Phys 164:190–240
Muirhead RJ (1982) Aspects of multivariate statistical theory. Wiley Series in Probability and Statistics, Hoboken, New Jersey
Lippert RA (2003) A matrix model for the \(\beta \)-Jacobi ensemble. J Math Phys 44:4807–4816
Killip R, Nenciu I (2004) Matrix models for circular ensembles. Int Math Res Not 2004:2665–2701
Andrews GE, Askey R, Roy R (1999) Special functions, encyclopedia of mathematics and its applications, 71. Cambridge University Press, The Edinburgh Building, Cambridge CB2 2RU, UK
Savin DV, Sommers H-J (2006) Shot noise in chaotic cavities with an arbitrary number of open channels. Phys Rev B 73:081307
Sommers H-J, Wieczorek W, Savin DV (2007) Statistics of conductance and shotnoise power for chaotic cavities. Acta Phys Pol, A 112:691–697
Savin DV, Sommers H-J, Wieczorek W (2008) Nonlinear statistics of quantum transport in chaotic cavities. Phys Rev B 77:125332
Novaes M (2011) Asymptotics of Selberg-like integrals by lattice path counting. Ann Phys 326:828–838
Forrester PJ (2022) Joint moments of a characteristic polynomials and its derivative for the circular \(\beta \)-ensemble. Prob Math Phys 3:145–170
Mezzadri F, Reynolds AK, Winn B (2017) Moments of the eigenvalue densities and of the secular coefficients of \(\beta \)-ensembles. Nonlinearity 30:1034
Coraddu A, Oneto L, Ghi A, Savio S, Anguita D, Figari M (2016) Machine learning approaches for improving condition-based maintenance of naval propulsion plants. Proceedings of the Institution of Mechanical Engineers, Part M: Journal of Engineering for the Maritime Environment 230:136–153
Cho D, Yoo C, Im J, Cha D-H (2020) Comparative assessment of various machine learning-based bias correction methods for numerical weather prediction model forecasts of extreme air temperatures in urban areas. Earth Space Sci 7:e2019EA000740
Chicco D, Jurman G (2020) Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Med Inform Decis Mak 20:16
Abid F et al (2019) Predicting forest fire in Algeria using data mining techniques: Case study of the decision tree algorithm. In Proceedings of the International Conference on Advanced Intelligent Systems for Sustainable Development (AI2SD 2019), Marrakech, Morocco
Rafiei MH, Adeli H (2015) A novel machine learning model for estimation of sale prices of real estate units. J Constr Eng Manag 142:04015066
Santosa MS, Abreu PH, J. García-Laencina P, Simão A, Carvalho A (2015) A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients. J Biomed Inform 58:49–59
Wigner EP (1955) Characteristic vectors of bordered matrices with infinite dimensions. Ann Math 62:548–564
Marchenko VA, Pastur LA (1967) Distribution of the eigenvalues in certain sets of random matrices. Matematicheskii Sbornik 72:507–536
Wachter KW (1978) The strong limits of random matrix spectra for sample matrices of independent elements. Ann Probab 6:1–18
Acknowledgements
The author is grateful for the help of two anonymous editors who improved the quality of this paper substantially.
Funding
None.
Author information
Authors and Affiliations
Contributions
The author is the sole author of this work.
Corresponding author
Ethics declarations
Ethics Approval
Not applicable.
Competing Interests
The author declares no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Dubbs, A. Test Set Sizing via Random Matrix Theory. Oper. Res. Forum 5, 17 (2024). https://doi.org/10.1007/s43069-024-00292-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s43069-024-00292-1