Random Forests

  • Chapter
  • First Online:
Applied Statistical Learning

Part of the book series: Statistics and Computing ((SCO))

Abstract

Bagging refers to fitting a learning algorithm on bootstrap samples and aggregating the results. A random forest performs bagging of trees, and in addition, at each split, random forests only consider a random subset of x-variables. This promotes the use of a larger number of x-variables and makes the algorithm less dependent on a small number of variables. For any one tree, roughly one third of the observations are not in the bootstrap sample and form an out-of-bag sample. For a given tree, the out-of-bag sample can be used as validation sample, giving the algorithm the unique ability to tune parameters without a separate validation sample. This is particularly useful when the training data available are limited. A case study predicts math achievement of Portuguese high school students.

The wisdom of tree crowds

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
EUR 29.95
Price includes VAT (Germany)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 96.29
Price includes VAT (Germany)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
EUR 128.39
Price includes VAT (Germany)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Rerunning rforest for different iterations is inefficient. Data for such a plot can in principle be generated from a single run of a random forest algorithm. However, the WEKA JAVA plugin that rforest calls does not support that functionality.

  2. 2.

    In linear regression, the denominator of the MSE is not n but \(n-p\) where p is the number of parameters estimated. This usually only makes a small difference when p is small relative to n. We ignore this here.

References

  • Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.

    Article  MATH  Google Scholar 

  • Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.

    Article  MATH  Google Scholar 

  • Cortez, P., & Silva, A. M. G. (2008). Using data mining to predict secondary school student performance. In A. Brito & J. Teixeira (Eds.) Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) (pp. 5–12), Porto, Portugal. EUROSIS.

    Google Scholar 

  • Dua, D., & Graff, C. (2017). UCI machine learning repository. https://archive.ics.uci.edu/.

    Google Scholar 

  • Efron, B. (1992). Bootstrap methods: Another look at the Jackknife. In Breakthroughs in Statistics (pp. 569–593). Springer.

    Google Scholar 

  • Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. Machine Learning, 63(1), 3–42.

    Article  MATH  Google Scholar 

  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction (2nd edn.). Heidelberg: Springer.

    Book  MATH  Google Scholar 

  • Ho, T. K. (1995). Random decision forests. In Proceedings of 3rd International Conference on Document Analysis and Recognition (Vol. 1, pp. 278–282). IEEE.

    Google Scholar 

  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830. https://scikit-learn.org/stable/modules/ensemble.html.

    MathSciNet  MATH  Google Scholar 

  • Schonlau, M. (2020). Size text box, Patient Joe data. Data set and Manual. Retrieved from https://www.dataarchive.lissdata.nl/study_units/view/971.

  • Schonlau, M., & Zou, R. Y. (2020). The random forest algorithm for statistical learning. The Stata Journal, 20(1), 3–29.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Schonlau, M. (2023). Random Forests. In: Applied Statistical Learning. Statistics and Computing. Springer, Cham. https://doi.org/10.1007/978-3-031-33390-3_10

Download citation

Publish with us

Policies and ethics

Navigation