Joint maximization of accuracy and information for learning the structure of a Bayesian network classifier

Halbersberg, Dan; Wienreb, Maydan; Lerner, Boaz

doi:10.1007/s10994-020-05869-5

Joint maximization of accuracy and information for learning the structure of a Bayesian network classifier

Published: 28 February 2020

Volume 109, pages 1039–1099, (2020)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Joint maximization of accuracy and information for learning the structure of a Bayesian network classifier

Download PDF

Dan Halbersberg¹,
Maydan Wienreb¹ &
Boaz Lerner¹

1728 Accesses
10 Citations
1 Altmetric
Explore all metrics

Abstract

Although recent studies have shown that a Bayesian network classifier (BNC) that maximizes the classification accuracy (i.e., minimizes the 0/1 loss function) is a powerful tool in both knowledge representation and classification, this classifier: (1) focuses on the majority class and, therefore, misclassifies minority classes; (2) is usually uninformative about the distribution of misclassifications; and (3) is insensitive to error severity (making no distinction between misclassification types). In this study, we propose to learn the structure of a BNC using an information measure (IM) that jointly maximizes the classification accuracy and information, motivate this measure theoretically, and evaluate it compared with six common measures using various datasets. Using synthesized confusion matrices, twenty-three artificial datasets, seventeen UCI datasets, and different performance measures, we show that an IM-based BNC is superior to BNCs learned using the other measures—especially for ordinal classification (for which accounting for the error severity is important) and/or imbalanced problems (which are most real-life classification problems)—and that it does not fall behind state-of-the-art classifiers with respect to accuracy and amount of information provided. To further demonstrate its ability, we tested the IM-based BNC in predicting the severity of motorcycle accidents of young drivers and the disease state of ALS patients—two class-imbalance ordinal classification problems—and show that the IM-based BNC is accurate also for the minority classes (fatal accidents and severe patients) and not only for the majority class (mild accidents and mild patients) as are other classifiers, providing more informative and practical classification results. Based on the many experiments we report on here, we expect these advantages to exist for other problems in which both accuracy and information should be maximized, the data is imbalanced, and/or the problem is ordinal, whether the classifier is a BNC or not. Our code, datasets, and results are publicly available http://www.ee.bgu.ac.il/~boaz/software.

Multi-dimensional Bayesian network classifiers: A survey

Article 11 July 2020

Constrained Naïve Bayes with application to unbalanced data classification

Article Open access 20 October 2021

Unified Performance Measure for Binary Classification Problems

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction and related work

Classifiers, e.g., the neural network (NN), random forest (RF), and support vector machine (SVM), excel in prediction but not in knowledge representation, which is needed in problems for which key factor identification is sought, such as in an attempt to understand possible causes of accidents, a disease, or a machine/process fault. The Bayesian network (BN) excels in knowledge representation, which makes it ideal to identify key factors, but it is not considered a supreme classifier. To achieve high accuracy (ACC), learning the structure of a BN classifier (BNC) should maximize a (discriminative) score that is specific to classification and not a generative one based on the likelihood function that may fit a general BN structure, but not necessarily that of a BNC structure. Indeed, when a BNC was learned to minimize the 0/1 loss function, it showed superiority to BNCs learned using marginal and class-conditional likelihood-based scores and even to state-of-the-art classifiers like NN and SVM (Kelner and Lerner 2012).

However, by maximizing accuracy (minimizing the 0/1 loss function) in learning its structure, the BNC—similar to other machine learning classifiers—cannot account for the error distribution and, thus, is not informative enough about the classification result and the contribution of each class to the error (Provost et al. 1998; García et al. 2010), and it may also be sub-optimal (Ranawana and Palade 2006). Other discriminative measures used in learning a classifier, such as the area under curve (AUC), suffer from the same shortcoming, because they all relate to ACC. Moreover, in most cases, these measures only suit binary classification problems. Also, it may explain why other studies (García et al. 2009) suggested measures such as the consensus measure of accuracy.

On the other hand, measures that maximize information and account for error distribution, e.g., mutual information (MI) (Cover and Thomas 2012), the Matthew correlation coefficient (MCC) (Baldi et al. 2000), and the confusion entropy (CEN) (Wei et al. 2010) usually are not accurate enough. Labatut and Cherifi (2011) claimed that most of the non-accuracy measures were initially developed for other purposes than to compare/evaluate classifiers (e.g., to measure the association between two random variables, the alignment between two raters, or the similarity between two sets). Therefore, they may lead to confusing terminology or even to wrong interpretation, or they may be noisy and ad hoc for a particular problem.

A second challenge for a BNC, as well as for all other machine-learning classifiers, is that for imbalanced data, they usually predict all (or almost all, depending on the imbalance level) samples of the minority classes as of the majority class. These classifiers show high accuracy, which is in the order of the prior probability of the majority class, since they classify all samples to this class, but at the same time, they may misclassify all samples of the minority classes. Class imbalance can traditionally be tackled using different approaches, e.g., random sampling—upsampling the minority class(es) or downsampling the majority class (Chawla 2005; Provost 2000). However, these two sampling methods result in over-fitting and domain deformation or loss of data, respectively. In addition, tackling imbalance by random downsampling or upsampling, or applying different costs to different misclassifications provides an optimistic ACC estimate, and thus is not recommended (Provost 2000). Also other accuracy-driven measures, e.g., precision, sensitivity, and specificity lead to sub-optimal solutions in the presence of class imbalance (Ranawana and Palade 2006). More advanced methods to tackle class imbalance include feature selection (Wasikowski and Chen 2010); sampling subsets of the classes (Liu et al. 2009); combination of down- and upsampling using e.g., the synthetic minority over-sampling technique (SMOTE) (Chawla et al. 2002); combination of down–upsampling with an ensemble of classifiers (Galar et al. 2012) or with feature selection (Lerner et al. 2007); cost-sensitive learning (Domingos 1999); measuring the balanced accuracy (over all classes) (Brodersen et al. 2010) or its geometric mean (García et al. 2010); and hierarchical decomposition of the classification task, where each hierarchy level is designed to tackle a simpler problem that is represented by classes that are approximately balanced (Lerner et al. 2007). Although probably never tested, classifiers—BNCs and others—learned using information measures such as MI, MCC, and CEN should be less affected by class imbalance data but at the same time also less accurate.

A third challenge is that 0/1 loss-function classifiers do not account differently for different error severities, as they count all misclassifications the same, both for performance evaluation and in learning. However, when the class (target) variable is ordinal, exploiting the ordinal nature of this variable may facilitate learning the classifier and make it more accurate. Considering an ordinal target variable Y, taking one of M values, such that $V_1< \cdots <V_M$, a learning algorithm can take into account the natural ordering of this variable to induce a classifier, which harnesses this extra information to improve its accuracy. One such classifier is the cumulative probability tree (Frank and Hall 2001), for which Y is transformed into $M-1$ binary variables such that the ith binary variable represents the test $Y>V_i$. The model then comprises $M-1$ tree classifiers, where the ith tree is trained to output $P(Y>V_i)$. Another ordinal classifier is the cumulative link model (CLM) (Agresti 2011) that is an extension of the generalized linear model (GLM) for ordinal classification. A third ordinal classifier is the ordinal decision tree, which generalizes the classification and regression tree (CART) (Breiman et al. 1984) to ordinal target variables by considering splitting functions based on ordinal impurity functions (Piccareta 2008), which are specific implementations of the generalized Gini impurity function for a node. Principally—although we are not aware of any such study—the mean absolute error, MAE, (Hyndman and Koehler 2006), which sometimes is used to evaluate the error between a prediction and the true value, may also be used to augment learning an ordinal classifier. While such a measure can capture the ordinal information in a problem and potentially penalize different errors differently as we desired, it is not informative regarding the error distribution and is still sensitive to class imbalance.

To motivate this study further, let’s consider two examples. The first is prediction of the severity of young-driver (YD) motorcycle accidents (MAs). Road injuries are the leading cause of death among YDs (ages 18–24) (Toledo et al. 2012); YDs make up 9–13% of the population, but their percentage in driver fatalities is 18–30% (OECD 2006). Besides the tragic human cost, a fatal accident costs (OECD 2006) around $1.5M, where in the US alone, the cost of YD road accidents in 2002 was $40 billion. MAs are particularly deadly, and luckily fatal MAs are only $\sim 1\%$ of all accidents, whereas severe and minor accidents are around 12% and 87% of the accidents, respectively. However, experiments show that MA classifiers tend to focus on the majority class of minor accidents at the expense of the minority classes of severe and fatal accidents (Halbersberg and Lerner 2019). In addition they are uninformative about their error distribution and are insensitive to error severity (making, e.g., no distinction between misclassification of fatal accidents as severe or minor although the former is less harsh than the latter). Road-safety experts wish their MA classifier to not only maximize accuracy, but also to be informative about its errors, to be as indifferent as possible to data imbalance between minor and fatal accidents, and to penalize misclassifications of fatal accidents as severe and as minor differently.

The second example is prediction of the disease state of an ALS patient. Amyotrophic lateral sclerosis (ALS) is a devastating neurodegenerative illness of the human motor system with an unknown pathogenesis (Kiernan et al. 2011), which is still not visibly affected by the therapies available today, and from which 50% of patients die within three to five years of onset, and about 20% survive between five to ten years (Mitchell and Borasio 2007; Kiernan et al. 2011). The ALS functional rating scale (ALSFRS) is a widely accepted metric in the ALS medical community for the evaluation of ALS-related disability and progression (Brooks et al. 1996), with values between 0 for no functionality and 4 for full functionality for ten ALSFRS items describing physical functionalities in, e.g., breathing, speaking, and walking. By considering the ALSFRS as the target (class label), we may define ALS disease state prediction as an ordinal problem. With respect to the relative frequencies of ALSFRS values, which typically may vary from around 1% for ALSFRS of 0 to 42% and 35% for values of 3 and 4, respectively, disease state prediction also becomes a class imbalance problem. ALS patients, along with their doctors and carers, wish for disease state prediction to be very accurate (Gordon and Lerner 2019) but at the same time informative, to not be fooled by the imbalance among disease states, and to consider mild misclassification less harshly than severe misclassification.

In this study, we propose to learn a BNC, which leverages knowledge representation, using measures replacing the 0/1 loss function and trading accuracy and information. We are interested in learning the BNC using a measure that maximizes both accuracy and information, considers the error distribution, admits class imbalance, and accounts for error severity (which is significant only for ordinal problems). First, we consider existing measures, such as MI, MCC, and CEN, that all use the entire confusion matrix and not just its diagonal (as ACC) and, therefore, have the potential to meet at least some of our concerns. In addition, we evaluate the MAE, which naturally accounts for error severity. Second, since none of these measures accounts for all concerns, we propose next a novel information measure (IM), trading accuracy and information, that accounts for all of them. Third, we extend this measure further, adding to it a term that trades off accuracy and IM, giving the measure an additional degree of flexibility. Then we motivate the proposed measures and thoroughly evaluate them, comparing them with the existing measures theoretically and using several performance measures (which are the same learning measures), synthesized confusion matrices, artificial datasets, UCI ordinal datasets, and three real ordinal problems. We show the advantages of the IM-based BNC compared with BNCs that are learned using alternative measures and other state-of-the-art classifiers with respect to maximization of accuracy and information in ordinal class-imbalance problems. These advantages are manifested here for many databases and several real-world problems, but we believe they hold true for other problems (e.g., ranking problems) having the same requirements, and for classifiers other than the BNC.

In summary, our contribution is that: (1) We propose to utilize the BNC using a measure replacing the 0/1 loss function to jointly maximize accuracy and information, consider the error distribution, admit class imbalance, and account for error severity in tackling class-imbalance ordinal classification problems; (2) Since our theoretical and empirical evaluation of existing measures showed that none of the existing measures accounts for all these concerns, we suggest a novel information measure (IM) that has all the above desired properties; (3) We motivate the proposed measure and thoroughly evaluate it theoretically in comparison with the existing measures and empirically using several performance measures, synthesized confusion matrices, artificial datasets, UCI ordinal datasets, and three real ordinal problems; and (4) We demonstrate the advantages of the IM-based BNC compared with BNCs that are learned using existing measures and with other state-of-the-art classifiers (e.g., NN, SVM, BNC, and RF) with respect to maximization of accuracy and information in ordinal class-imbalance problems. We manifested these advantages using many databases and several real-world problems, and we believe these hold true for other problems (e.g., ranking problems) having the same requirements, and for classifiers other than the BNC.

The rest of this paper is organized as follows. In Sects. 2 and 3, we review the BNC and candidate measures for learning its structure, respectively. In Sect. 4, we propose new measures for learning a BNC and demonstrate how to control their values to trade learning among the conflicting requirements of accuracy, information, and error severity. In Sect. 5, we experimentally evaluate our information measures comparing them with existing measures using synthesized confusion matrices that pose different classification scenarios and challenges. In Sect. 6, we expand our evaluation and compare empirically BNCs learned based on our (as well as other) measures with state-of-the-art classifiers using databases representing artificial and real-world problems. Finally in Sect. 7, we summarize the study and draw important conclusions.

2 Bayesian network classifiers

The BN compactly represents the joint probability distribution P over a set of variables $X=\{X_1, \ldots ,X_n\}$, each, in the discrete case, having a finite set of mutually exclusive states. It consists of a network structure G and a set of parameters $\theta $, where $G=(V,E)$ is a directed acyclic graph in which the nodes V in G are in one-to-one correspondence with the variables in X, and the edges E in G encode a set of conditional independence assertions about variables in X. $\theta $ consists of local probability distributions, each for each variable $X_i$ given its parents $PA(X_i)$ in G. Given the network, the joint probability distribution over X comprises the local distributions as (Heckerman 1998):

$$\begin{aligned} P(X_1,X_2,...,X_n)=\prod _{i=1}^{n} P(X_i|PA(X_i)). \end{aligned}$$

(1)

Learning the structure of the BN from a dataset D is NP hard (Cooper and Herskovits 1992), and thus is usually performed heuristically and sub-optimally using, e.g., the search and score (S&S) approach by which the structure that maximizes a score function, which measures the fitness of the structure to the data, is selected. One such score (measure) is the a posteriori probability of the network given the data, P(G|D) (or the marginal likelihood, P(D|G), for equally probable structures) (Cooper and Herskovits 1992), and another measure is based on the minimum description length principle (Lam and Bacchus 1994), penalizing model complexity, where both scores are asymptotically equivalent and correct. However, these scores, similar to other log likelihood (LL) or information-based scores, either likelihood-equivalent or not (Heckerman et al. 1995), cannot optimize a classifier (Friedman et al. 1997) because they are not directed in maximizing the classification accuracy. Instead, it was suggested to learn a BNC by maximizing the conditional log likelihood (CLL) of G given D (Grossman and Domingos 2004):

$$\begin{aligned} CLL(G|D)=\log \prod _{i=1}^{N} P(c_i|v_i')=\sum _{i=1}^{N} \log P(c_i|v_i')=LL(G|D)- \sum _{i=1}^{N} \log P(v_i'), \end{aligned}$$

(2)

where $v_{i}'$ and $c_i$ are the feature vector and class label, respectively, for the ith of N instances. However, the computation of CLL is exponential in the number of instances N, and also, although CLL is asymptotically correct, for a finite sample, the class maximizing CLL can only indicate the correct classification, but it can not guarantee it (Kelner and Lerner 2012).

A score that measures the degree of compatibility between a possible state of the class variable C and the correct class is the 0/1 loss function:

$$\begin{aligned} L(c_i,{\widehat{c}}_i)={\left\{ \begin{array}{ll} 0, &{} c_i = {\widehat{c}}_i \quad ,\\ 1, &{} c_i \ne {\widehat{c}}_i \end{array}\right. } \end{aligned}$$

(3)

where ${\widehat{c}}_i$ is the estimated class label for the ith instance. Instead of selecting a structure based on summation of supervised marginal likelihoods over the dataset (2), the risk minimization by cross validation (RMCV) score selects a structure based on summation of false decisions about the class state over the dataset (Kelner and Lerner 2012),

$$\begin{aligned} \begin{aligned} RMCV(D,G)=\frac{1}{K} \sum _{k=1}^{K} \frac{K}{N} \sum _{i=1}^{N/K} L(c_{ki}, {\arg \max }_c P(C=c\text {|} v_{ki}', D \backslash D^K_k,G)), c \in \{c_1, \ldots ,c_M\}, \end{aligned} \end{aligned}$$

(4)

where the training set D is divided into K non overlap** validation sets $D_k^K$ (each having N/K instances of the form $v_{ki}=(c_{ki},v_{ki}')$), and for each such validation set, an effective training set has $|D \setminus D_k^K|$ (i.e., $N(K-1)/K$) instances. As part of the cross validation (CV), the classification error rate, i.e., the RMCV score, is measured on all vectors of $D_k^K$ and averaged over the K validation sets. No use of the test set is made during learning. Note that the RMCV score is normalized by the dataset size N, whereas (2) is not. Although normalization has the same effect on all learned structures, it can clarify the meaning of the score (i.e., an error rate) and help in comparing scores over datasets. Moreover, sharing the same range of values ([0, 1]), RMCV establishes its correspondence to classification accuracy. Also, note that the same RMCV measure can be used for learning the BNC and for evaluating its accuracy, which makes learning oriented towards classification.

To compute the RMCV, the candidate structure has to be turned into a classifier by learning its parameters. Local probabilities are modeled using the unrestricted multinominal distribution (Heckerman 1998), where the distribution parameters are obtained using maximum likelihood (ML) (Cooper and Herskovits 1992), similar to (Kontkanen et al. 1999). Moreover, it has been empirically shown (Grossman and Domingos 2004) that ML parameter estimation does not deteriorate the results compared to maximum conditional likelihood estimation, which can only be obtained by computationally expensive numerical approximation. Learning a BN rather than a structure has an additional cost of parameter learning, though this cost is negligible while using ML estimation and fully observed data.

Starting with the empty or naïve Bayesian graph and using a simple hill-climbing search with the RMCV score establish the RMCV structure learning algorithm for BNCs (Kelner and Lerner 2012). The hill-climbing implementation includes a search over all neighbor graphs at each iteration. A neighbor graph is defined as a single modification of the current graph using one of the following operators: edge addition, deletion, or reversal provided that the derived graph remains a directed acyclic graph. The RMCV BNC showed superiority to other BNCs and state-of-the-art classifiers using synthetic and UCI datasets and, thus, is used in this study to represent a BNC. However, as it is based on the 0/1 loss function, RMCV, similar to other classifiers, is prone to all weaknesses of classifiers as described in Sect. 1.

3 Evaluating classifier performance

How can we know whether the classification model we have constructed is the most suitable one? Performance measures that evaluate multi-class classifiers are usually based on the confusion matrix between predicted and true classes (Baldi et al. 2000). Although this matrix summarizes all correct and wrong predictions (Table 1), and thereby may represent the classifier error distribution, the common way to evaluate classifier performance is based on the classification accuracy (Ferri et al. 2009; Jurman et al. 2012), i.e., the 0/1 loss function (RMCV score), which is the (normalized) matrix trace.

Table 1 A confusion matrix for a three-class classification problem

Joint maximization of accuracy and information for learning the structure of a Bayesian network classifier

Abstract

Similar content being viewed by others

Multi-dimensional Bayesian network classifiers: A survey

Constrained Naïve Bayes with application to unbalanced data classification

Unified Performance Measure for Binary Classification Problems

1 Introduction and related work

2 Bayesian network classifiers

3 Evaluating classifier performance

3.1 Mutual information

3.2 Confusion entropy

3.3 Matthew correlation coefficient

3.4 Mean absolute error

4 Trading between information and accuracy

Lemma 1

Proof

Lemma 2

Proof

5 Measure evaluation using synthesized confusion matrices

5.1 Sensitivity to class imbalance

5.2 Sensitivity to the number of classes

5.3 Sensitivity to the error severity

5.4 Sensitivity to the error distribution

5.5 ACC–information tradeoff

6 Experiments and results

6.1 Artificial datasets

6.2 UCI and real-world databases

6.3 Comparison to state-of-the-art algorithms

7 Discussion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

1.1 Information measure with alpha

1.2 Sensitivity analysis

1.2.1 Accuracy

1.2.2 Mean absolute error

Lemma 3

Proof

Lemma 4

Proof

1.2.3 Mutual Information

Lemma 5

Proof

Lemma 6

Proof

Lemma 7

Proof

1.2.4 Information measure

1.2.5 Confusion entropy

1.3 Artificial BN sampling

1.4 IM scores for artificial databases

1.5 Run time measured by number of neighbors for artificial BNs

1.6 IM scores for UCI databases

1.7 Run Time measured by number of neighbors for UCI BNs

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation