Model Selection and Regularization

  • Chapter
  • First Online:
Fundamentals of Supervised Machine Learning

Part of the book series: Statistics and Computing ((SCO))

  • 1321 Accesses

Abstract

This chapter presents regularization and selection methods for linear and nonlinear (parametric) models. These are important Machine Learning techniques as they allow for targeting three distinct objectives: (1) prediction improvement; (2) model identification and causal inference in high-dimensional data settings; (3) feature-importance detection. The chapter starts by presenting model selection for improving prediction accuracy and model identification and estimation in high-dimensional data settings. Then, it addresses regularized linear models focusing on Lasso, Ridge, and Elastic-net models. Next, it focuses on regularized nonlinear models, which are extensions of the linear ones to generalized linear models (GLMs). Subsequently, it illustrates optimal Subset selection algorithms, which are pure computational approaches to optimal modeling and feature-importance extraction. After delving into the statistical properties of regularized regression, the chapter discusses causal inference in high-dimensional settings, both with an exogenous and endogenous treatment. The applied part of the chapter is fully dedicated to the Stata, R, and Python implementations of the methods presented in the theoretical part.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Ahrens, A., Hansen, C., & Schaffer, M. (2020). Lassopack: Model selection and prediction with regularized regression in Stata. Stata Journal, 20(1), 176–235.

    Article  Google Scholar 

  • Angrist, J. D., & Krueger, A. B. (1991). Does compulsory school attendance affect schooling and earnings? The Quarterly Journal of Economics, 106(4), 979–1014.

    Article  Google Scholar 

  • Belloni, A., & Chernozhukov, V. (2014b). Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies, 81(2(287)), 608–650.

    Google Scholar 

  • Belloni, A., Chen, D., Chernozhukov, V., & Hansen, C. (2012). Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica, 80(6), 2369–2429.

    Article  MathSciNet  MATH  Google Scholar 

  • Belloni, A., & Chernozhukov, V. (2011). High dimensional sparse econometric models: An introduction. In P. Alquier, E. Gautier, & G. Stoltz (Eds.), Inverse problems and high-dimensional estimation: Stats in the chteau summer school, August 31-September 4, 2009 (pp. 121–156). Lecture notes in statistics. Berlin, Heidelberg: Springer.

    Chapter  Google Scholar 

  • Belloni, A., Chernozhukov, V., & Hansen, C. (2014). High-dimensional methods and inference on structural and treatment effects. Journal of Economic Perspectives, 28(2), 29–50.

    Article  MATH  Google Scholar 

  • Belloni, A., Chernozhukov, V., & Wei, Y. (2016). Post-selection inference for generalized linear models with many controls. Journal of Business & Economic Statistics, 34(4), 606–619.

    Article  MathSciNet  Google Scholar 

  • Bergmeir, C., Hyndman, R. J., & Koo, B. (2016). hdm: High-dimensional metrics. The R Journal, 8(2), 185–199.

    Article  Google Scholar 

  • Bergmeir, C., Hyndman, R. J., & Koo, B. (2018). A note on the validity of cross-validation for evaluating autoregressive time series prediction. Computational Statistics & Data Analysis, 120, 70–83.

    Article  MathSciNet  MATH  Google Scholar 

  • Bühlmann, P., & van de Geer, S. (2011). Theory for l1/l2-penalty procedures. Springer Series in StatisticsIn P. Bühlmann & S. van de Geer (Eds.), Statistics for high-dimensional data: Methods, theory and applications (pp. 249–291). Berlin, Heidelberg: Springer.

    Google Scholar 

  • Cerulli, G. (2020). SUBSET: Stata module to implement best covariates and stepwise subset selection. https://econpapers.repec.org/software/bocbocode/s458647.htm

  • Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/ectj.12097

  • Chernozhukov, V., Hansen, C., & Spindler, M. (2015). Post-selection and post-regularization inference in linear models with many controls and instruments. The American Economic Review, 105(5), 486–490. Publisher: American Economic Association. https://www.jstor.org/stable/43821933

  • Daubechies, I., Defrise, M., & Mol, C. D. (2004). An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57(11), 1413–1457.

    Article  MathSciNet  MATH  Google Scholar 

  • Daubechies, I., Defrise, M., & Mol, C. D. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22.

    Google Scholar 

  • Eaton, J. P., & Haas, C. A. (1995). Titanic: Triumph and tragedy (2nd ed.). New York: W. W. Norton & Company.

    Google Scholar 

  • Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. The Annals of Statistics, 32(2), 407–499. Publisher: Institute of Mathematical Statistics.

    Google Scholar 

  • Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2011). Regression shrinkage and selection via the lasso: a retrospective. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 73(3), 273–282. Publisher: [Royal Statistical Society, Wiley]. https://www.jstor.org/stable/41262671

  • Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360.

    Google Scholar 

  • Foster, D. P., & George, E. I. (1994). The risk inflation criterion for multiple regression. The Annals of Statistics, 22(4), 1947–1975.

    Article  MathSciNet  MATH  Google Scholar 

  • Friedman, J., Hastie, T., Höfling, H., & Tibshirani, R. (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2), 302–332.

    Article  MathSciNet  MATH  Google Scholar 

  • Fu, W. J. (1998). Penalized regressions: The bridge versus the lasso. Journal of Computational and Graphical Statistics, 7(3), 397–416.

    MathSciNet  Google Scholar 

  • Geer, S. A. V. D., & Bühlmann, P. (2009). On the conditions used to prove oracle results for the Lasso. Electronic Journal of Statistics, 3(none), 1360–1392.

    Google Scholar 

  • Gorman, J. W., & Toman, R. J. (1966). Selection of variables for fitting equations to data. Technometrics, 8(1):27–51. Publisher: Taylor & Francis. https://amstat.tandfonline.com/doi/abs/10.1080/00401706.1966.10490322

  • Harrison, D. J., & Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management, 5(1), 81–102. Publisher: Elsevier. https://ideas.repec.org/a/eee/jeeman/v5y1978i1p81-102.html

  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). Springer series in statistics. New York: Springer. https://www.springer.com/gp/book/9780387848570

  • Hastie, T., Tibshirani, R., & Wainwright, M. (2015). Statistical learning with sparsity: The lasso and generalizations. Chapman and Hall/CRC.

    Google Scholar 

  • Hocking, R. R., & Leslie, R. N. (1967). Selection of the best subset in regression analysis. Technometrics, 9(4), 531–540.

    Article  MathSciNet  Google Scholar 

  • Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Applications to nonorthogonal problems. Technometrics, 12(1), 69–82. Publisher: Taylor & Francis. https://amstat.tandfonline.com/doi/abs/10.1080/00401706.1970.10488635

  • Hoerl, A. E., Kannard, R. W., & Baldwin, K. F. (2007). Ridge regression: Some simulations. Communications in Statistics—Theory and Methods. Publisher: Marcel Dekker, Inc. https://www.tandfonline.com/doi/abs/10.1080/03610927508827232

  • Leeb, H., & Pötscher, B. M. (2008). Can one estimate the unconditional distribution of post-model-selection estimators? Econometric Theory, 24(2), 338–376. Publisher: Cambridge University Press. https://www.jstor.org/stable/20142496

  • Loh, P. -L., & Wainwright, M. J. (2017). Support recovery without incoherence: A case for nonconvex regularization. The Annals of Statistics, 45(6), 2455–2482. Publisher: Institute of Mathematical Statistics.

    Google Scholar 

  • Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C.-H., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J. P., Poggio, T., Gerald, W., Loda, M., Lander, E. S., & Golub, T. R. (2001). Multiclass cancer diagnosis using tumor gene expression signatures. Proceedings of the National Academy of Sciences, 98(26), 15149–15154.

    Article  Google Scholar 

  • Robinson, P. M. (1988). Root-N-consistent semiparametric regression. Econometrica, 56(4), 931–954. Publisher: [Wiley, Econometric Society]. https://www.jstor.org/stable/1912705

  • Roecker, E. B. (1991). Prediction error and its estimation for subset-selected models. Technometrics, 33(4), 459–468.

    Article  Google Scholar 

  • Theil, H. (1957). Specification errors and the estimation of economic relationships. Revue de l’Institut International de Statistique/Review of the International Statistical Institute, 25(1/3), 41–51. Publisher: [International Statistical Institute (ISI), Wiley]. https://www.jstor.org/stable/1401673

  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 267–288. Publisher: [Royal Statistical Society, Wiley]. https://www.jstor.org/stable/2346178

  • van Wieringen, W. N. (2020). Lecture notes on ridge regression. ar**v:1509.09169 [stat].

  • Wang, Z., Liu, H., & Zhang, T. (2014). Optimal computational and statistical rates of convergence for sparse nonconvex learning problems. The Annals of Statistics, 42(6), 2164–2201.

    Article  MathSciNet  MATH  Google Scholar 

  • Wold, H., & Faxer, P. (1957). On the specification error in regression analysis. The Annals of Mathematical Statistics, 28(1), 265–267. Publisher: Institute of Mathematical Statistics. https://www.jstor.org/stable/2237040

  • Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Giovanni Cerulli .

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Cerulli, G. (2023). Model Selection and Regularization. In: Fundamentals of Supervised Machine Learning. Statistics and Computing. Springer, Cham. https://doi.org/10.1007/978-3-031-41337-7_3

Download citation

Publish with us

Policies and ethics

Navigation