Abstract
Bagging refers to fitting a learning algorithm on bootstrap samples and aggregating the results. A random forest performs bagging of trees, and in addition, at each split, random forests only consider a random subset of x-variables. This promotes the use of a larger number of x-variables and makes the algorithm less dependent on a small number of variables. For any one tree, roughly one third of the observations are not in the bootstrap sample and form an out-of-bag sample. For a given tree, the out-of-bag sample can be used as validation sample, giving the algorithm the unique ability to tune parameters without a separate validation sample. This is particularly useful when the training data available are limited. A case study predicts math achievement of Portuguese high school students.
The wisdom of tree crowds
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Rerunning rforest for different iterations is inefficient. Data for such a plot can in principle be generated from a single run of a random forest algorithm. However, the WEKA JAVA plugin that rforest calls does not support that functionality.
- 2.
In linear regression, the denominator of the MSE is not n but \(n-p\) where p is the number of parameters estimated. This usually only makes a small difference when p is small relative to n. We ignore this here.
References
Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Cortez, P., & Silva, A. M. G. (2008). Using data mining to predict secondary school student performance. In A. Brito & J. Teixeira (Eds.) Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) (pp. 5–12), Porto, Portugal. EUROSIS.
Dua, D., & Graff, C. (2017). UCI machine learning repository. https://archive.ics.uci.edu/.
Efron, B. (1992). Bootstrap methods: Another look at the Jackknife. In Breakthroughs in Statistics (pp. 569–593). Springer.
Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. Machine Learning, 63(1), 3–42.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction (2nd edn.). Heidelberg: Springer.
Ho, T. K. (1995). Random decision forests. In Proceedings of 3rd International Conference on Document Analysis and Recognition (Vol. 1, pp. 278–282). IEEE.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830. https://scikit-learn.org/stable/modules/ensemble.html.
Schonlau, M. (2020). Size text box, Patient Joe data. Data set and Manual. Retrieved from https://www.dataarchive.lissdata.nl/study_units/view/971.
Schonlau, M., & Zou, R. Y. (2020). The random forest algorithm for statistical learning. The Stata Journal, 20(1), 3–29.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Schonlau, M. (2023). Random Forests. In: Applied Statistical Learning. Statistics and Computing. Springer, Cham. https://doi.org/10.1007/978-3-031-33390-3_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-33390-3_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-33389-7
Online ISBN: 978-3-031-33390-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)