Abstract
This chapter presents regularization and selection methods for linear and nonlinear (parametric) models. These are important Machine Learning techniques as they allow for targeting three distinct objectives: (1) prediction improvement; (2) model identification and causal inference in high-dimensional data settings; (3) feature-importance detection. The chapter starts by presenting model selection for improving prediction accuracy and model identification and estimation in high-dimensional data settings. Then, it addresses regularized linear models focusing on Lasso, Ridge, and Elastic-net models. Next, it focuses on regularized nonlinear models, which are extensions of the linear ones to generalized linear models (GLMs). Subsequently, it illustrates optimal Subset selection algorithms, which are pure computational approaches to optimal modeling and feature-importance extraction. After delving into the statistical properties of regularized regression, the chapter discusses causal inference in high-dimensional settings, both with an exogenous and endogenous treatment. The applied part of the chapter is fully dedicated to the Stata, R, and Python implementations of the methods presented in the theoretical part.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ahrens, A., Hansen, C., & Schaffer, M. (2020). Lassopack: Model selection and prediction with regularized regression in Stata. Stata Journal, 20(1), 176–235.
Angrist, J. D., & Krueger, A. B. (1991). Does compulsory school attendance affect schooling and earnings? The Quarterly Journal of Economics, 106(4), 979–1014.
Belloni, A., & Chernozhukov, V. (2014b). Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies, 81(2(287)), 608–650.
Belloni, A., Chen, D., Chernozhukov, V., & Hansen, C. (2012). Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica, 80(6), 2369–2429.
Belloni, A., & Chernozhukov, V. (2011). High dimensional sparse econometric models: An introduction. In P. Alquier, E. Gautier, & G. Stoltz (Eds.), Inverse problems and high-dimensional estimation: Stats in the chteau summer school, August 31-September 4, 2009 (pp. 121–156). Lecture notes in statistics. Berlin, Heidelberg: Springer.
Belloni, A., Chernozhukov, V., & Hansen, C. (2014). High-dimensional methods and inference on structural and treatment effects. Journal of Economic Perspectives, 28(2), 29–50.
Belloni, A., Chernozhukov, V., & Wei, Y. (2016). Post-selection inference for generalized linear models with many controls. Journal of Business & Economic Statistics, 34(4), 606–619.
Bergmeir, C., Hyndman, R. J., & Koo, B. (2016). hdm: High-dimensional metrics. The R Journal, 8(2), 185–199.
Bergmeir, C., Hyndman, R. J., & Koo, B. (2018). A note on the validity of cross-validation for evaluating autoregressive time series prediction. Computational Statistics & Data Analysis, 120, 70–83.
Bühlmann, P., & van de Geer, S. (2011). Theory for l1/l2-penalty procedures. Springer Series in StatisticsIn P. Bühlmann & S. van de Geer (Eds.), Statistics for high-dimensional data: Methods, theory and applications (pp. 249–291). Berlin, Heidelberg: Springer.
Cerulli, G. (2020). SUBSET: Stata module to implement best covariates and stepwise subset selection. https://econpapers.repec.org/software/bocbocode/s458647.htm
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/ectj.12097
Chernozhukov, V., Hansen, C., & Spindler, M. (2015). Post-selection and post-regularization inference in linear models with many controls and instruments. The American Economic Review, 105(5), 486–490. Publisher: American Economic Association. https://www.jstor.org/stable/43821933
Daubechies, I., Defrise, M., & Mol, C. D. (2004). An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57(11), 1413–1457.
Daubechies, I., Defrise, M., & Mol, C. D. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22.
Eaton, J. P., & Haas, C. A. (1995). Titanic: Triumph and tragedy (2nd ed.). New York: W. W. Norton & Company.
Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. The Annals of Statistics, 32(2), 407–499. Publisher: Institute of Mathematical Statistics.
Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2011). Regression shrinkage and selection via the lasso: a retrospective. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 73(3), 273–282. Publisher: [Royal Statistical Society, Wiley]. https://www.jstor.org/stable/41262671
Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360.
Foster, D. P., & George, E. I. (1994). The risk inflation criterion for multiple regression. The Annals of Statistics, 22(4), 1947–1975.
Friedman, J., Hastie, T., Höfling, H., & Tibshirani, R. (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2), 302–332.
Fu, W. J. (1998). Penalized regressions: The bridge versus the lasso. Journal of Computational and Graphical Statistics, 7(3), 397–416.
Geer, S. A. V. D., & Bühlmann, P. (2009). On the conditions used to prove oracle results for the Lasso. Electronic Journal of Statistics, 3(none), 1360–1392.
Gorman, J. W., & Toman, R. J. (1966). Selection of variables for fitting equations to data. Technometrics, 8(1):27–51. Publisher: Taylor & Francis. https://amstat.tandfonline.com/doi/abs/10.1080/00401706.1966.10490322
Harrison, D. J., & Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management, 5(1), 81–102. Publisher: Elsevier. https://ideas.repec.org/a/eee/jeeman/v5y1978i1p81-102.html
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). Springer series in statistics. New York: Springer. https://www.springer.com/gp/book/9780387848570
Hastie, T., Tibshirani, R., & Wainwright, M. (2015). Statistical learning with sparsity: The lasso and generalizations. Chapman and Hall/CRC.
Hocking, R. R., & Leslie, R. N. (1967). Selection of the best subset in regression analysis. Technometrics, 9(4), 531–540.
Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Applications to nonorthogonal problems. Technometrics, 12(1), 69–82. Publisher: Taylor & Francis. https://amstat.tandfonline.com/doi/abs/10.1080/00401706.1970.10488635
Hoerl, A. E., Kannard, R. W., & Baldwin, K. F. (2007). Ridge regression: Some simulations. Communications in Statistics—Theory and Methods. Publisher: Marcel Dekker, Inc. https://www.tandfonline.com/doi/abs/10.1080/03610927508827232
Leeb, H., & Pötscher, B. M. (2008). Can one estimate the unconditional distribution of post-model-selection estimators? Econometric Theory, 24(2), 338–376. Publisher: Cambridge University Press. https://www.jstor.org/stable/20142496
Loh, P. -L., & Wainwright, M. J. (2017). Support recovery without incoherence: A case for nonconvex regularization. The Annals of Statistics, 45(6), 2455–2482. Publisher: Institute of Mathematical Statistics.
Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C.-H., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J. P., Poggio, T., Gerald, W., Loda, M., Lander, E. S., & Golub, T. R. (2001). Multiclass cancer diagnosis using tumor gene expression signatures. Proceedings of the National Academy of Sciences, 98(26), 15149–15154.
Robinson, P. M. (1988). Root-N-consistent semiparametric regression. Econometrica, 56(4), 931–954. Publisher: [Wiley, Econometric Society]. https://www.jstor.org/stable/1912705
Roecker, E. B. (1991). Prediction error and its estimation for subset-selected models. Technometrics, 33(4), 459–468.
Theil, H. (1957). Specification errors and the estimation of economic relationships. Revue de l’Institut International de Statistique/Review of the International Statistical Institute, 25(1/3), 41–51. Publisher: [International Statistical Institute (ISI), Wiley]. https://www.jstor.org/stable/1401673
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 267–288. Publisher: [Royal Statistical Society, Wiley]. https://www.jstor.org/stable/2346178
van Wieringen, W. N. (2020). Lecture notes on ridge regression. ar**v:1509.09169 [stat].
Wang, Z., Liu, H., & Zhang, T. (2014). Optimal computational and statistical rates of convergence for sparse nonconvex learning problems. The Annals of Statistics, 42(6), 2164–2201.
Wold, H., & Faxer, P. (1957). On the specification error in regression analysis. The Annals of Mathematical Statistics, 28(1), 265–267. Publisher: Institute of Mathematical Statistics. https://www.jstor.org/stable/2237040
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Cerulli, G. (2023). Model Selection and Regularization. In: Fundamentals of Supervised Machine Learning. Statistics and Computing. Springer, Cham. https://doi.org/10.1007/978-3-031-41337-7_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-41337-7_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41336-0
Online ISBN: 978-3-031-41337-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)