Keywords

1 Introduction

Legacy supervised machine learning algorithms that are commonly used in pattern recognition tasks include Bayesian Networks, Neural Networks, Support Vector Machines, k-Nearest Neighbours and Decision Trees (DTs) [16]. Although sustained research into each of these individual classification and regression algorithms has led to improved effectiveness, it is widely accepted that ensemble methods generally out perform them [26]. Ensemble methods such as bagging, boosting and Random Forests (RFs) combine the individual outputs of several base learners into a more reliable aggregate committee decision [26]. RFs are ensembles of DTs generated from a bootstrapped training data set [6]; they have gained cross-disciplinary popularity due to their high accuracy rates and simple interpretation [32].

The history of RFs can be traced back to the first breed of DTs: CART [7], ID3 and C4.5 [21], which use a common recursive divide-and-conquer approach to partition the training data set until class homogeneity is achieved. These DT types differ in terms of splitting criteria, types of attributes allowed, type of output provided (regression and/or classification), support for missing values, tree pruning strategy and ability to detect outliers [25]. Tree pruning is normally required to reduce the problem of overfitting the classification model on the given training instances. Although the concept of bagging was one of the first techniques to combine outputs from several random DTs, the resultant trees were found to be highly correlated [26]. RFs seek to improve the effectiveness of bagging by reducing the resemblance between trees and thus, simultaneously reduce variance and bias errors. Bias error reflects how inaccurate, learned models are at capturing the significant trends in the training data while variance error captures how monotonous the models are at labeling training instances [18, p. 311]. Simultaneously low, bias and variance errors, are desirable in ensemble classifiers as they indicate that individual base classifiers are highly accurate but diverseFootnote 1.

An alternative DT-based ensemble technique is boosting, which aims to improve weak learning algorithms through a committee method that gives more focus to incorrectly classified instances [14, 26]. This is done by specifically assigning weights to: (1) individual classifiers in the committee based on training set error and (2) misclassified instances in the training set in order to increase their influence in the next iteration. Boosting has however been found to be less popular than RFs and bagging due to its lack of consistency and low convergence likelihood [26, p. 150]. Furthermore, RFs offer greater opportunity for parallel execution than boosting which has sequential iterations that are dependent on their predecessors.

In DT induction, the criteria used to decide which attribute is the most suitable for partitioning the data set portion at each node, is crucial for achieving high classification accuracy. A study by Raileanu et al. [22] sought to theoretically analyze the two most commonly used splitting criteria/metrics: the gini-index and information gain functions in order to solve the general problem of selecting the most suitable criteria for a given data set. Their findings revealed that the frequency of disagreement between the two criteria is only 2%, thus confirming previously published empirical results which assert that it is impossible to decide on which of the two tests is preferable [22]. This however does not mean, the two criteria can not be optimized individually, neither should it preclude the search for other criteria that may be more effective. Although RFs are an elegant and effective classification technique, there is room for achieving locally optimal attributes [23] and a need for capturing other attribute relations besides conditional interactions [32].

Feature engineering/construction refers to the common practice of creating new features from an existing feature set in a bid to assist classification algorithms to better distinguish between similar previously encountered instances [15]. It is a part of the broader topic of data representation, which seeks to unravel more informative perspectives of a data set by mainly using dimensionality reduction methods such as Linear Discriminant Analysis (LDA) and Principal Component analysis (PCA) [2]. In practise, the new augmented feature set from feature engineering may be used as is or subsequently complemented by a feature selection stage which identifies the most suitable subset of features. Typical feature engineering strategies include computations such as rule based conjunctions, rational differences and polynomial relations based on existing features [15].

This study proposes an optimized gini-index metric and a shape based feature engineering technique as a means towards improving the effectiveness of RFs. A steepend gini-index function is used to replace the conventional gini-index function in order to induce a preferential bias towards probabilities that suggest purity. A novel feature coherence model, based on the shape of a synthetic radial feature contour, is proposed for injecting new attributes to reflect general feature correlation within an instance. The specific question that the study seeks to answer is whether this steepened gini-index splitting function and the proposed shape feature injection can improve the effectiveness of RFs.

The remainder of this paper is structured as follows. Section 2 outlines the RF algorithm in detail and reports on previous work centred on its optimization. Section 3 explains the proposed new methods while Sect. 4 describes the framework used for experimentation. The results of the study are presented in Sect. 5 then Sect. 6 concludes the study and looks at proposed future work.

2 Random Forests

Although RFs also use sampling of training instances with replacement (bootstrapped sampling) like bagging, they introduce additional stochastic behaviour by choosing a random set of predictors (attributes or features) without replacement at each DT node. The conventional RF algorithm (commonly referred to as Forest-RI) can be formally described by Algorithm 1; alternative descriptions can be found in [6, 14, 26].

figure a

The maximum permissible purity of DT nodes, along with the parameters ns and d, can be used as constraints for limiting the size and sensitivity of each DT in the RF. The purity of a node is the highest proportion of any class present in its sample. A choice can be made between the constraints imposed by parameters ns and d, since either constraint can effectively limit tree depth, albeit though different criteria. The most commonly used values in literature for the parameter m are 1, sqrt(M) and \(log_2(M)+1\) [3]. The sample size, n is normally set to N, the size of the training set [6], to yield a sampling ratio of 1. In each of the Ntree iterations, a DT is induced based on the given inputs; thus creating a forest of DTs, \(\{T_t,\ t=1, \dotsc ,Ntree\}\).

Each DT, \(T_t\) can be considered as a classifier, \(\{h(x,\theta _t)\}\) where \({\theta _t}\) is a set of independent but identically distributed stochastic vectors and x is a new instance to be classified [6]. \(\theta _t\) is determined during induction by the set of random samples and features chosen with and without replacement respectively. An individual DT classifier is considered to have perfectly mastered its training set if \(h(x_i,\theta _t) = y_i \ \forall \ (x_i,y_i) \in S_T\). The algorithm terminates when \(L_{nt}\) is empty; at this point all leaf nodes in a given DT are labelled with one of the C possible classifications based on the majority class within their node’s sample. For a given test instance, a RF solicits predictions from each of its Ntree DTs and the ensemble prediction is usually determined by a majority vote.

Although the gini-index was used in the Forest-RI algorithm to determine the attribute value yielding the best split, it has since been found to be weak at identifying strong conditional associations among features [17]. A study by Robnik and Sikonja [23] sought to improve the effectiveness of RFs by using several attribute evaluation measures instead of just one, then aggregating DT votes using the margin achieved on similar out-of-bag instances as a weight. The evaluation heuristics used were: gini-index, gain ratio, Minimum Description Length (MDL), ReliefF or Myopic ReliefF; with each heuristic being applied to its fifth of the trees in a RF. A slight increase in effectiveness is observed when comparing the use of five heuristics against the Gini index alone. This improvement is especially visible on data sets with strong feature dependencies and is attributed to the use of the ReliefF algorithm which manages to decrease DT correlation while retaining prediction strength. A more significant improvement in RF classification accuracy is achieved across several data sets when adopting the new voting strategy.

The pioneering study on RFs by Breiman [6] also proposed a variation in the induction of RFs by using random linear combinations of inputs, a procedure known as Forest-RC. At any given node, L existing numerical features are randomly chosen and added together using random weights in the range [−1,1] to form a new feature. F such features are generated and the best splitting condition is chosen from the range of all possible feature values. Experiments on 19 data sets using \(L=3\) and \(F=2\) or 8 show that although Forest-RC is generally more comparable to Adaboost than Forest-RI, it is not necessarily superior.

Due to the multiple steps in the Forest-RI algorithm, there are several options for improving performance and effectiveness. Improvement in performance can be facilitated through an implementation that exploits parallelization opportunities presented by modern multi-core processors and GPUs like the FastRF, LibRF and the CUDARF algorithms [12]. Improvement in RF effectiveness is generally facilitated by creating a committee of diverse but highly accurate tree based classifiers; a comprehensive survey of such RF variants can be found in [11, 17].

3 Proposed Methods

This study aims to improve the effectiveness of RFs by using an alternative splitting function and a feature construction technique that captures inter-feature relationships. The foundations for these two concepts are elaborated below.

3.1 Split Functions

In the original Forest-RI algorithm, the gini-index [7] is used to obtain the best splitting criteria from the random subset of features that represent the instances internal to each DT node. Since the gini-index is a measure of impurity, it can be used to estimate how well a given splitting condition separates the instances within a node into their different classes. The gini-index (also known as gini impurityFootnote 2) was proposed by Breiman et al. [8] as a splitting criteria for DTs known as Classification and Regression Trees (CART). Although 4 other splitting criteria (symmetric Gini, twoing, ordered twoing and class probability) were also proposed, the Gini index generally performs best. Other impurity measures that can be used as alternatives to the gini-index include entropy and classification error [30]. Because the CART algorithm is generally used to build binary DTs, it is more applicable to binary, numerical and ordinal attributes as their values can be partitioned into 2 groups. For numerical attributes, all possible values of each feature are evaluated as possible thresholds for splitting a node and the value that yields the highest gini-index is chosen.

The Gini index of a DT node can be formulated as follows [11]:

$$\begin{aligned} Gini(S) = 1 - \sum _{i=1}^C{p(i)^2}\text {,} \end{aligned}$$
(1)

which is equivalent to:

$$\begin{aligned} Gini(S) = \sum _{i=1}^C{p(i)*(1 - p(i))}\text {,} \end{aligned}$$
(2)

where S is the sample of instances in a node and C is the number of class labels in the data set. If all the classes in the data set are enumerated from 1 to C, then p(i) is the probability of the \(i^{th}\) class in a particular node. Likewise the entropy and classification error metrics can be formulated as in Eqs. 3 and 4 respectively [24, 30].

$$\begin{aligned} E(S) = -\sum _{i=1}^{N}{p(i)*log(pi)} \text {.} \end{aligned}$$
(3)
$$\begin{aligned} CE(S) = 1 - \max _{1 \le i \le C}{p(i)} \text {.} \end{aligned}$$
(4)

When considering the viability of an attribute splitting condition or value, the chosen impurity metric is calculated for each potential child node resulting from such a split and normalized using the probability of the child node. These normalized total impurity values are then summed up to represent the combined impurity for the splitting condition in question. The splitting condition yielding the lowest normalized total impurity is potentially the best condition as it corresponds to the highest purity in the given context. The general approach in literature is to use the impurity gained at a particular child node relative to its parent as opposed to the absolute impurity of the child node [23, 31]. The generic formulation of this approach for all impurity based measures in the context of binary DTs is as follows [23]:

$$\begin{aligned} \varDelta impurity(a_v) = impurity(S_0) - \sum _{j=1}^{2}{p(S_j|a_v)*impurity(S_j|a_v)}\text {,} \end{aligned}$$
(5)

where a is the attribute to split on using \(a_v\) as the specific condition or value, \(S_0\) is the parent node and \(S_j\) is one of the two child nodes. \(\varDelta impurity\) is normally referred to as impurity gain; when applied to the gini-index it is referred to as gini gain. Despite the availability of other gain splitting functions such as Information Gain and Gain Ratio, the Gini Gain was chosen by Breiman for use in RFs because of its simplicity and effectiveness [23]. Information Gain and Gain Ratio are both based on entropy, with the latter being a normalized derivation of the former. The complexity of Information Gain and Gain Ratio arises mainly from their logarithmic computations.

The approach adopted towards formulating alternative impurity measures was to visually analyse the behaviour of existing metrics (gini-index, entropy and class error) in the context of different probabilities. The assumption made was that the gini-index and entropy were superior to class error [8, 30], with the gini-index being preferred primarily because of its computational simplicity [22]. The task was then to identify or formulate alternative functions that had a similar shape to the gini-index and entropy functions for input in the range [0,1]. Two proposals were made: the Gaussian and Steepened Gini Index (SGI) functions, with the latter proving more effective than the former.

Gaussian Impurity Function. The Gaussian function was proposed due to its graph which is a symmetric bell curve shape that is similar to the graph of the gini-index. It is widely used in statistics to describe normal distributions and can be formulated as follows [13]:

$$\begin{aligned} G(x) = a*e^{-(\frac{(x-b)^2}{2c^2})}\text {,} \end{aligned}$$
(6)

where a is the amplitude, b is the position of the peak within the bell shape and c is the standard deviation which controls the spread of the bell shape. Since the input of an impurity measure is a probability, this Gaussian distribution is over input values between 0 and 1. In this study, the parameter b was set to 0.5 since probabilities at the tails of the distribution are generally indicative of lower impurity while those in the middle are more likely to correspond with diversity. In a statistical normal distribution, about 99.7% of the data values lie within three standard deviation [9], hence the standard deviation is one third of the distance from the mean to either of the tail ends. We thus set parameter c to 0.1667 (which is \(\frac{0.5}{3}\)). After considering a few options (0.125, 0.25, 0.5, and 1), the parameter a was set to 0.5 as this value seemed to yield higher accuracy rates on the sonar UCI data setFootnote 3 [19]. The resulting Gaussian impurity function, shown in Eq. 7, simply replaces class probabilities with corresponding Gaussian distribution outputs.

$$\begin{aligned} Gaussian(S) = \sum _{i=1}^C{G(i)}\text {.} \end{aligned}$$
(7)

Figure 1 shows the behaviour of the Gini index, entropy, classification error, SGI and the proposed Gaussian metric for probabilities encountered in two-class nodes of a dichotomous DTFootnote 4. It can be observed that all metrics are:

  1. 1.

    symmetric, showing that a p(i) vs \(1 - p(i)\) node class split has the same impurity as its inverse, \(1 - p(i)\) vs p(i);

  2. 2.

    maximized at probability 0.5 when classes are equally represented within a node and,

  3. 3.

    minimized for pure nodes which are represented at the tail ends where all metrics give an impurity value of 0 except the Gaussian metric which slightly deviates from the trend with a value of 0.011.

Fig. 1.
figure 1

Impurity functions for two-class node distributions

The Gaussian metric was anticipated to emphasize the difference between the impurity of probabilities in the middle of the distribution and that of those towards the tails; this was expected to favour splits where one of the classes is clearly dominant. Although preliminary experiments on the sonar data set showed the Gaussian metric to be comparable with the gini-index, the former was consistently lower than the latter. Hence, the Gaussian metric was not explored in further experiments and an alternative was sort.

Steepend Gini Index. Our interpretation of the inferiority of the Gaussian metric to the gini-index is based on the fact that split scores are determined using the relative as opposed to absolute impurity of a node. This highlights the importance of comparing the metrics based on gradient in order to model the concept of relative impurity. Since nodes are expected to gain in purity as we move down a DT, it is proposed that a desirable metric would be one that has a steeper negative impurity gradient towards the leaves of the DT. At this stage, a split that leads to greater node purity should be favoured since this node is unlikely to undergo further purification.

This reasoning led us to explore the possibility of modifying the gini-index function such that it yields a steeper gradient towards the two ends of its symmetric shape. After a few attempts at adapting the gini-index function in order to achieve this behaviour, the following function was adopted:

$$\begin{aligned} SGI(S) = \sum _{i=1}^C{\frac{p(i)*(1 - p(i))+\sqrt{p(i)*(1 - p(i))}}{2}}\text {.} \end{aligned}$$
(8)
Fig. 2.
figure 2

Gradual change in purity for two-class node splits

Figure 1 shows that the SGI metric is indeed a steepend version of the gini-index. Figure 2 simulates the situation where the dorminant class within a node consistently achieves an increase of 10% in purity as we move down a DT. The root node is assumed to have an equal proportion of two classes while the leaf nodes are pure. From this simulation it is evident that although the Gaussian metric has a similar shape to other metrics, the same can not be said for its gradient function. The gini-index is observed to maintain a consistant gradient while the SGI tries to initially mimick this consistancy but then steepens towards the end. Although entropy appears to have similar behaviour to SGI, it has an initially less consistant gradiant change than SGI and is less steeper at the final node transformation.

3.2 Shape Based Feature Engineering

It was initially envisaged that the adoption of the SGI metric would be enough to yield a significant improvement in the effectiveness of RFs; previous literature however seems to suggest that alternative impurity measures alone may only provide minor improvement [22]. Robnik [23] observes that although the gini-index offers good performance, it evaluates attributes separately and does not take attribute inter-relationships into account. The ReliefF measure was then proposed as a solution for alleviating misclassifications due to high feature interactions. Our work explores feature engineering as an alternative approach for capturing important information from multiple features simultaneously.

Feature engineering/construction entails transforming a given input feature set in order to give a more adequate representation of instances during the training and testing of machine learning models [27]. In some cases, the generated set of new features is deliberately smaller than the original feature set for improved computational efficiency and in order to remove irrelevant features. Examples of such approaches include clustering, LDA and PCA; which are also used for data compression [27]. In other cases, the size of the set of generated features is not constrained and may even be bigger than that of the original feature set. Examples of such approaches in the context of DTs include the FRINGE [20] and Forest-RC [6] algorithms which used Boolean operators and linear combinations respectively to generate a new set of composite features.

Instead of constructing new features that are each based on a selected subset of the original input feature set, this study proposes the formulation of new features that capture the level of coherence between all the feature values in a given instance. The underlying assumption is that if such properties can be empirically quantified, they could be used as extra features for improved class discrimination. This proposed formulation is inspired by the statistical radar/spider chart, which graphically displays multivariate data using a two-dimensional chart of multiple numerical variables represented on a radial axes with a common starting line [29]. Normalized input attributes are allocated fixed orientations from the centre using the order provided by the data set and the contour produced by a given instance is used to characterize its level of feature coherence. These contours can then be analyzed using existing shape descriptors; at this stage only the circularity property has been used as the viability of this concept is still being explored. A circularity score is normally deduced by measuring the perimeter of a closed contour as well as the area of the region it encloses, then computing [5]

$$\begin{aligned} \frac{perimeter^2}{4*\pi *area}\text {.} \end{aligned}$$
(9)

The fact that the axes of radar charts are numerical, restricts the application of this method to data sets with only numerical feature values. A hypothetical radar chart is shown in Fig. 3 as an illustration of how normalized feature values can be used to plot radial graphs. The main conclusion that can be drawn from the plots is that spherical contours should be more indicative of greater feature coherence than jagged shapes; it is also expected that instances from the same class should have similarly shaped radial graphs. In cases where instance feature values are in the range [0,1), the generated shapes were observed to be generally the same. This effectively meant that there was no improvement in instance differentiation after adding circularity as an extra feature, especially for data sets with few attributes. This problem was alleviated by scaling the normalized values to the range [0,10). It was envisage that there would be a need for the normalization of feature values so as to reduce the bias of circularity towards any one feature. Indeed, experiments on the sonar data set showed an improvement in classification effectiveness when the injected circularity score was calculated from normalized feature values. The original feature values were however, left un-normalized to avoid distorting the data sets.

Fig. 3.
figure 3

Hypothetical feature coherence example. Instance1 has greater circularity than Instance2

4 Experimental Protocol

The main of objective of this study is to improve the effectiveness of RFs through the use of a SGI feature evaluation heuristic and a shape-based feature set characterization method. In order to ascertain the effectiveness of these two techniques on RFs, four experiments are conducted in line with the experimental design shown in Fig. 4. The remainder of this section ellaborates on the sub-components of this design.

Fig. 4.
figure 4

Experimental design

4.1 Dataset Preparation

The effectiveness of the methods proposed in this study is tested using 10Footnote 5 data sets from the UCI repository [19]. These data sets are drawn from Robnik [23] and Breiman’s [6] studies on RFs; we exclude data sets with nominal attributes and missing values as these properties are beyond the scope of the present study. An additional constraint is enforced to exclude data sets with more than 3000 instances, for computational reasons. The characteristics of the chosen data sets are summarized in Table 1, which reveals the diversity of the problems represented, in terms of data set size (N), number of features (M) and number of classes (C).

Table 1. UCI datasets

Each of the chosen data sets is preprocessed by converting all feature values to floating points and class labels to integers, then saved in a consistant format. The output of the data preparation phase is two data set files: one with the original feature set only and another with the circularity feature included.

4.2 Static Random List Generation

Given the highly stochastic nature of RFs, it is imperative to ensure a significant level of contextual consistancy in their induction and evaluation in order to minimize the effect of uncertainty on the outcome of comparative analyses. A deliberate effort is made to subject each RF variant taking part in a comparative experiment, to the same training and testing instances. This is accomplished by generating static/fixed lists of cross validation folds and training samples, before any tree induction or experimentation takes place.

The lists that are generated only contain indexes of instances in the data set, for the sake of efficiency. One random number generator seed is used to produce an entire collection of static random lists. This collection is subsequently used to ensure RF training and testing consistancy in four evaluation experiments. The random number generator is reset at the start of each experiment to enable flexible reenactment of all 4 experiments. The same results can be reproduced repeatedly regardless of the order in which the individual experiments are conducted.

Although the adopted experimental design (including the RF algorithm), offers several opportunities for parallelization on multi-core processors, this generally comes at the expense of a deterministic execution cycle since parallel loop iterations are typically non-deterministic and may differ from run to run [4, p. 670]. In our case, we require loop iterations to follow a specific order of execution in order to ensure that experimental results can be reproduced verbatim. As a result, no parallelization is implemented in this study. This unfortunately forces us to cap the size of data sets to be investigated, in order to avoid computational overhead.

4.3 Random Forest Analysis

In line with the methodology adopted by Robnik [23] we execute all experiments under the following settings: (1) The recommended RF parameter values are used: the number of trees, \(Ntrees = 100\); the number of attributes randomly chosen at each DT node, \(m = sqrt(M)\)Footnote 6; and the cut-off node size, \(n_s=\) 5. (2) All data sets are evaluated using 10-fold cross-validation. The following additional default parameter values are used: a sampling ratio of 1 and a maximum tree depth, d equal to n.

The four experiments conducted, test the effectiveness of the standard Forest-RI algorithm using either the gini or SCI impurity measure, with and without shape feature injection. We refer to each of these four modifications to the Forest-RI algorithm as RF variants. The use of 10-fold cross validation means that each experiment trains and tests 10 RFs. The accuracy of a RF is calculated as the percentage of test set instances that it correctly classifies. To capture the overall effectiveness of a particular RF approach, we record statistics such as the minimum, maximum, mean and median from the 10 fold accuracies. The code to facilitate all these experiments was implemented in C++, with the opencv library being used for shape feature calculation.

The output of the four experiments is used to compare the effectiveness of the Forest-RI algorithm in the following five contexts: (1) using gini impurity with and without shape feature injection, (2) using SGI impurity with and without shape feature injection, (3) using gini impurity vs SGI, both without shape feature injection, (4) using gini impurity vs SGI, both with shape feature injection and (5) using gini impurity without shape feature injection vs SGI with shape feature injection. We consider this comparison to be objective, since all Forest-RI variants are trained and tested on precisely the same instances.

Since preliminary experiments revealed that it was possible for results to differ on different runs of cross validation, it seemed appropriate to adopt repeated cross validation [33]. For each data set, the cross validations of four experiments are repeated 30 times [28, 34], with a different seed in each case. To avoid ambiguity, we propose some terminology for referring to the statistics recorded in this study. For each complete run of 10-fold cross validation using one of the four experiments, we record the following fold accuracies (FAs): minimum-FA, maximum-FA, mean-FA and median-FA. After the 30 cross validation repeats, we calculate the cross validation accuracies (CVAs) for each of the four experiments. Specifically, minimum-CVA, maximum-CVA and mean-CVA are derived using the minimum minimum-FA, maximum maximum-FA and mean mean-FA respectively.

Table 2. Wilcoxon signed-ranks test p-values from comparing 30 repeated experiment pairs. p-values are classified as either significant at 0.05 level , significant at 0.1 level , or non-significant . +gini and -gini represent RFs using the gini-index with and without shape feature injection respectively. Likewise with steepend gini-index (sgi).

We ultimately seek to establish whether any comparative difference in effectiveness can be attributed to the proposed techniques under investigation or the stochastic properties of RFs and training vs. testing set splits. The Wilcoxon signed-ranks test [10, 23] is used to evaluate the statistical significance of such a difference in RF effectiveness. For each of the five effectiveness comparisons conducted, the null hypothesis assumes there is no difference in the performance of the two RF approaches in question.

5 Results

Table 2 shows the p-values associated with the Wilcoxon signed-ranks test comparing pairs of median-FAs from 30 repeated experiments. Each pair either compares RFs based on splitting function or inclusion of shape feature injection; the training and testing sets used in both cases are however identical. We thus have two groups of 30 corresponding median-FAs and the p-values indicate the overall probability that the difference in corresponding median-FAs is due to chance. A lower p-value is indicative of a more significant difference in median-FAs of the two groups of RFs. We used this test to demonstrate whether the RF variants proposed in this study yield significant differences in effectiveness. Out of the 45 tests done, 5 and 13 tests were significant at an \(\alpha \) level of 0.1 and 0.05 respectively. Although this means that in the majority of tests, the RF variants under comparison showed insignificant differences in effectiveness; we note that the +gini vs +sgi and -gini vs +sgi tests each recorded significant differences in 5 of the 9 data sets used. For the bupa data set all tests yielded insignificant differences while the yeast data set attained significant differences in all but the -sgi vs +sgi test.

Fig. 5.
figure 5

Frequencies of median-FA superiority after 30 experiment repeats

Table 3. CVA statistics (%) over 30 repeats

Since the Wilcoxon signed-ranks test merely highlights the significance of differences in performance, we rely on median-FA comparisons to infer on the superiority of one RF variant over another. Figure 5 shows the number of times the median-FA of a RF variant was higher than that of its competitor over 30 cross validation repeats. Although no clear trend of superiority is demonstrated in the 9 data sets used, we note that the RF variants considered have strengths in different contexts. For example, the german-numeric data set seems to favour +gini over -gini, +sgi over -sgi, +sgi over +gini and +sgi over -gini; which is the exact opposite of the yeast data set context. For some data sets (for example isonosphere, iris and segmentation), most of the 30 experiment repeats yielded exactly the same median-FA while the remaining repeats favoured one RF variant.

Table 3 shows the range of FAs over the 300 (30 cross validation repeats, each with 10 folds) times RFs are trained and tested for each data set. In previous literature such as [1, 23], classification effectiveness is reported using the mean-FA of just one cross validation cycle. Although our mean-CVA results are shown to be comparable with accuracies in previous literature, this statistic alone does not give a comprehensive and reliable picture of RF performance. By reporting the minimum-CVA, maximum-CVA and mean-CVA, we give an indication of the worst, best and average performance of a given RF variant. The distribution of our mean-CVAs confirm the finding of a disagreement of only 2% between splitting criteria, made in previous literature [22]. However, the minimum-CVAs show a greater level of variance. For example, a difference of −14% is shown between the mean-CVAs of -gini and other RF variants.

6 Conclusion

This study sought to improve the effectiveness of RFs through the use of a steepend gini-index and shape feature injection. Although such improvements are indeed recorded over some data sets, the general trend is that of an insignificant difference in effectiveness. When considering the mean-CVA and minimum-CVA results of -gini vs +sgi, we note that the latter outperforms the former over more datasets; we therefore conclude that the steepened gini-index splitting function and the proposed shape feature injection can improve the effectiveness of RFs.

In addition to the proposed RF variants, a major contribution of this study is an experimental framework which allows for a high level of contextual consistancy and repeatability in the induction and evaluation of RFs. Previous studies such as [1, 23] have used the outcome of single runs of cross validation on multiple data sets as evidence of apparent algorithm optimization. We have argued that any claimed superiority should be demonstrated under highly controlled conditions that limit unnecessary stochastic variation, and sustained over multiple repetitions.

Over the course of this study, some opportunities for further work have been identified, we conclude by outlining some of these areas. The large differences in minimum-CVA over several data sets, highlight the potential of creating a RF that uses a hybrid of the RF variants considered in this study. In such a case, the hybrid RF would be equiped to deal with the varying level of complexity in different data sets. Additionally, the extreme weaknesses of one RF variant in some contexts could be compensated for by the better performance of another. Future work will focus on exploring this idea of a hybrid set of RF variants in conjunction with weighted voting. Since some of the minimum-CVAs may have been caused by unfavourable random cross validation splits, the use of stratified cross validation in future work may provide a slightly more controlled training and testing environment. A simple shape descriptor has been adopted in this study; extensions to this work may consider other more advanced shape characterization methods such as moments.