Introduction

Breiman developed the idea of bootstrap aggregation (bagging) models1, commonly used with bootstrap averages of tree models, as a way of flexibly modeling data. Bootstrap averaging is a way of reducing the prediction variance of single tree models. However, correlations between trees implied that there would be limits to the reduction in prediction errors achieved by increasing the number of trees. The random forest (RF) model was developed by Breiman2 as a way of reducing correlation between bootstrapped trees, by limiting the number of variables used for splitting at each tree node. RF models often achieve much better prediction error than bagging models. RF models have proved a straightforward machine learning method, much used because of their ability to provide accurate predictions for large and complex datasets and availability in many software packages. The semi-parametric model is determined by three user specified parameters, one of the more critical being the stop** criterion for node splitting, the minimum node size of each potential parent node. The node size regulates the model complexity of each tree in the forest and has implications on the statistical performance of the algorithm. In a recent paper Arsham et al.3 proposed using as stop** criteria the size of the offspring nodes and showed in a series of simulation studies circumstances in which performance over a standard RF model could be improved in this way.

The original RF algorithm by Breiman2 used the minimum size of the parent node to limit tree growth. This implementation of the RF algorithm has been utilized in several packages including the randomForest4 and ranger5 packages; ranger5 appears to be among the most efficient implementation of the standard RF algorithm. The problem of how to select the node size in RF models has been much studied in the literature6,7. There are a number of available packages that allow for alternatives to the standard parental node size limit for node splitting. In particular the randomForestSRC8 and the partykit9,10 R packages both allow for splits to be limited by the size of the children nodes.

In this short paper we outline a number of variant types of RF algorithms, generalizations of the RF model developed by Breiman2, and which use a number of different criteria for stop** tree expansion, in addition to the canonical ones of Breiman2 and Arsham et al.3. We illustrate fits of model to the National Health and Nutrition Examination Survey (NHANES) data and four other datasets, the Tasmanian Abalone data, the Boston Housing crime rate data, the Los Angeles ozone concentration data, and the MIT servo data; these last four datasets are all as used in the paper of Breiman2. Further description of the data is given in Table 1.

Table 1 Description of five datasets fitted.

Results

As can be seen from Table 2 and Fig. 1, for the NHANES, Tasmanian Abalone and Los Angeles Ozone datasets the default (parent node size) tree-expansion limitation yields the lowest mean square prediction error (MSPE), although in all cases the MSPE is very close for most other tree-expansion limitation statistics. In particular the MSPE using leaf-node limitation is within 2% of that for parent node limitation. However, for the Boston Housing data leaf-node limitation yields an MSPE that is substantially better, by about 4%, than parent-node limitation, and indeed any other method of tree-limitation. The MSPE using 25–75% intercentile range limitation is substantially better than any for the MIT servo data, the only other method that works nearly as well uses 10–90% intercentile range. All other methods of tree-expansion limitation, in particular both leaf-node and parent-node methods, have MSPE that is at least 15% larger (Table 2). In general use of the two intercentile range statistics (intercentile 10–90% range, intercentile 25–75% range) to control tree expansion yield much less variation in MSPE; in particular, using the 25–75% range, the MSPE does not exceed 5% of the MSPE for the best tree-expansion method for each dataset (Fig. 1).

Table 2 Measures of goodness of fit (mean square cross-validated test error) to glycohemoglobin percentage, estimated from hold-out test set (2017–2018 NHANES data) associated with fit of random forest model fit to 2015–2016 NHANES data, and similar measures of goodness of fits to Tasmanian Abalone data, Boston Housing data, Los Angeles Ozone data and MIT Servo data.
Figure 1
figure 1

Percentage increase in mean square predictive error (MSPE) for each stop** rule over the tree expansion rule yielding lowest MSPE, for each dataset.

Discussion

We have presented a number of alternative tree-expansion stop** rules for RF models. It appears that for some datasets, in particular the NHANES, Tasmanian Abalone and Los Angeles Ozone data the new types of stop** rules that we fit have very similar MSPE as the standard stop** rules normally used by RF models (Table 2, Fig. 1). However, for two other datasets, the Boston Housing and MIT Servo data, it is clear that two particular variant stop** rules fit substantially better than the standard RF model (Table 2, Fig. 1). In general, use of the intercentile 25–75% range statistic to control tree expansion yields much less variation in MSPE, and MSPE also closer to the optimal. The MSPE for this measure does not exceed 5% of the MSPE for the best tree-expansion method for each dataset (Fig. 1).

One of the parameters in the RF algorithm is the minimum size of the node below which the node would remain unsplit. This is very commonly available in implementations of the RF algorithm, in particular in the randomForest package4. The problem of how to select the node size in RF models is much studied in the literature. In particular Probst et al.7 review the topic of hyperparameter tuning in RF models, with a subsection dedicated to the choice of terminal node size. This has also been discussed from a more theoretical point of view in a related article by Probst et al.6. As Probst et al. document, the optimal node size is often quite small, and in many packages the default is set to 1 for classification trees and 5 for regression trees7. There are a number of packages available that allow for alternatives to the standard parental node size limit for node splitting. In particular the randomForestSRC8 and the partykit9,10 R packages both allow for splits to be limited by the size of the offspring node. As far as we are aware no statistical package uses the range, variance or centile range based limits demonstrated here. It should be noted that the use of limits of parental and offspring node size are not equivalent. While it is obviously the case that if the offspring nodesize is at least \(n\) then the parental node size must be at least \(2n\), the reverse is clearly not the case. For example, it may be that among the candidate splits of a particular node of size \(2n\) would in general be offspring nodes of sizes \(1,2,...,n - 1,n,n + 1,...2n - 1\). Were one to insist on terminal nodes being of size \(n\) then only the split into two nodes each of size \(n\) would be considered, whereas without restriction on the size of the terminal nodes potential candidates would in general include nodes of size \(1,2,...,n - 1,n + 1,...2n - 1\) also, although the splitting variables might not in general allow all these to occur.

Numerous variants of the RF model have been created, many with implementations in R software. For example, quantile regression RF was introduced by Meinshausen11 and combines quantile regression with random forests and its implementation provided in the package quantregForest. Garge et al.12 implemented a model-based partitioning of the feature space, and developed associated R software mobForest (although this has now been removed from the CRAN archive). Seibold et al.13 also used recursive partioning RF models which were fitted to amyotrophic lateral sclerosis data. Seibold et al. have also developed software for fitting such models, in the R model4you package14. Segal and ** rules with specific application to regression trees. However, the basic idea would obviously easily carry over to classification trees, using for example the Gini or cross-entropy loss functions.

Methods

Data

The NHANES data that we use comprises data for the 2015–2016 and 2017–2018 screening samples, the former used to train the RF and the latter as test set. There are n = 4292 individuals in the 2015–2016 data, and n = 4051 individuals in the 2017–2018 data. A total of 19 descriptive variables (features) are used in the model, with laboratory glycohemoglobin percentage as the outcome variable, a continuous measure. The population weights given in these two datasets are used to weight mean square error (MSE). The version of the NHANES data is exactly as used in the paper of Arsham et al.3. We also employ four other datasets, the Tasmanian Abalone data, the Boston Housing crime rate data, the Los Angeles ozone concentration data, and the MIT servo data; these last four datasets are all as used in the paper of Breiman2. A description of all these datasets is given in Table 1. The five datasets are all given in Supplement S1.

Statistical methods

There are minimal adjustable parameters in the standard RF algorithm2, specifically the number of trees (i.e. the number of bootstrap samples, ntree), and the number of variables sampled per node (mtry) used to determine the growth of the tree, and the maximum number of nodes per tree (maxnodes). The version of the algorithm that we have implemented incorporates a number of additional parameters that determine whether tree generation is halted, specifically:

  1. (a)

    The proportion of the total variance (in the total dataset) of the outcome variable in a given node used to determine whether to stop the further development of the tree from that node downwards;

  2. (b)

    The proportion of the total range (= maximum − minimum) (in the total dataset) of the outcome variable in a given node used to determine whether to stop the further development of the tree from that node downwards;

  3. (c)

    The proportion of the intercentile range [X%, 100 − X%] (in the total dataset) of the outcome variable in a given node used to determine whether to stop the further development of the tree from that node downwards. We used X = 10% and X = 25%.

  4. (d)

    The minimum number of observations per parent node.

  5. (e)

    The minimum number of observations per terminal (leaf) node.

The tree generation at a particular node is halted if any of conditions (a)–(e) is triggered. In most implementations of the standard RF model2, for example the R randomForest package4, only criteria (d) is available; in some software, in particular in the randomForestSRC8 and partykit9 R packages criteria (d) and (e) are available as options. The paper of Arsham et al.3 outlined the use of criterion (e) in the context of regression trees. Table 2 outlines the minimum mean square prediction error (MSPE) obtained using the 2017–2018 NHANES data as test set, with model training via the 2015–2016 data. For all other datasets MSPE was defined via tenfold cross validation. In all cases MSPE was the minimum value using ntree = 1000 trees with maxnodes = 1000. We employed a number of sampled variables per node mtry generally about half the total number of independent variables, so mtry = 10, 4, 7, 5, 2, for the NHANES, Tasmanian Abalone, Boston Housing, Los Angeles Ozone and MIT Servo datasets, respectively.

In all cases the categorical variables are treated simply as numeric (non-categorical) variables. We also performed additional model fits in which we used Breiman’s method of coding categorical variables2, but as these generally yielded inferior model fits, as measured by the minMSPE, we do not report these further.

The Fortran 95-2003 code implementing the regression random forest algorithm described above is given in Supplement S1, along with a number of parameter steering files for the five datasets fitted.

Ethics declaration

This study has been approved annually by the National Cancer for Health Statistics Research Ethics Review Board (ERB), and all methods were performed in accordance with the relevant guidelines and regulations of that ERB. All participants signed a form documenting their informed consent, and participants gave informed consent to storing specimens of their blood for future research.