Asymmetric gradient penalty based on power exponential function for imbalanced data classification

Zhou, Linyong; Ran, Guangcan; Tan, Hongwei; **e, **aoyao

doi:10.1007/s40747-023-01225-x

Asymmetric gradient penalty based on power exponential function for imbalanced data classification

Original Article
Open access
Published: 04 September 2023

Volume 10, pages 1333–1348, (2024)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

Asymmetric gradient penalty based on power exponential function for imbalanced data classification

Download PDF

Linyong Zhou ORCID: orcid.org/0000-0002-7325-1770^1,2,
Guangcan Ran^1,2,
Hongwei Tan¹ &
…
**aoyao **e²

534 Accesses
Explore all metrics

Abstract

Model bias is a tricky problem in imbalanced data classification. An asymmetric gradient penalty method is proposed based on the power exponential function to alleviate this. The methodology integrates a power exponential function as a moderator into the cross-entropy loss of the negative samples, driving the model to focus on hesitant samples while ignoring easy and singular samples. The rationality of the algorithm is explored from the gradient point of view, and it is demonstrated that the approach improves focal loss and asymmetric focal loss. Then, the imbalanced data classification experiments were deployed on MNIST, CIFAR10, CIFAR100, and Caltech101, respectively. For binary classification, datasets with several imbalance ratios constituted by varying the sample size of the majority class and minority class are included in the experiments. In the multi-category classification experiments, we discuss imbalanced datasets with only a single majority category and those with several majority categories and examine step-imbalance datasets and linear-imbalance datasets. The results reveal that the proposed method exhibits competitiveness on various imbalanced datasets and better robustness on high imbalance ratio datasets. Finally, the approach is deployed on the pulsar candidate dataset HTRU, and the state-of-the-art results are yielded. Our code is available at https://github.com/gzmtzly/GPPE.

Remix: Rebalanced Mixup

Thresholding Strategies for Deep Learning with Highly Imbalanced Big Data

Model Optimization in Imbalanced Regression

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

The Five-hundred-meter Aperture Radio Telescope (FAST), also known as the “Chinese shy eye”, is a national major project and the world’s largest single-aperture radio telescope. It is located in **tang County, Guizhou Province. FAST has various scientific goals, one of which is to find pulsars. The discovery of new pulsars and the subsequent observations of these new pulsars with FAST can provide new means to solve fundamental problems in physics such as the test of general relativity, the exploration of black holes in the center of the Milky Way, the detection of the interstellar medium of the Milky Way, and the probe of the internal state of neutron stars. In this study, we will investigate the imbalanced data classification approach and employ it in the pulsar candidate dataset to exclude interfering signals as much as possible while ensuring no loss of pulse signals. Image classification has always been a classic and eternal research problem in the field of machine learning. With the proposal of the concept of deep learning and the development of GPU hardware, the deep convolutional neural network [1, 2] has achieved great success in this field. However, it is observed that in most situations there is a considerable imbalance among the categories, i.e., the sample size of one or more classes is significantly higher than that of other classes. Such datasets are referred to as imbalanced datasets, and the class with a larger number of samples is called the negative class, while the other is described as the positive class. In medical clinical data, for example, the majority of persons are healthy, and only a tiny fraction are samples of cases [3]. In the pulsar candidate dataset, most of the samples are noisy, and just a few of them are viable candidates [4]. Similarly, the phenomenon of data imbalance is widespread in fields such as fraud detection and bioinformatics. Therefore, the study of imbalanced data classification holds great significance and has a broad range of applications.

Most machine learning algorithms are designed based on class equilibria, which means that the cost of misclassification is the same for each category. Therefore, the learned model will skew towards the majority when the methods are applied to an imbalanced dataset. This is due to the fact that, in the situation of class imbalance, the majority class essentially occupies the gradient responsible for updating the model weights, resulting in an increase in the misclassification of the minority class. Some strategies have been proposed to augment the weight of the minority by altering the data distribution or improving the loss function. Methods that alter data distribution are easy to implement and can yield promising results in some fields. However, these strategies also contribute a considerable quantity of redundant data, resulting in overfitting [5], or they miss some critical information, leading to underfitting. At the same time, such strategies may expand computing costs. For example, oversampling on pulsar candidate dataset HTRU, which contains 1196 positive samples and 89,996 negative samples, might have disastrous repercussions. In recent years, researchers have increasingly inclined to improve the loss function to alleviate the model bias, because it involves neither additional computation nor information redundancy or information loss. Based on this, the study explores the imbalanced data classification from the perspective of improving loss function and presents an asymmetric gradient penalty approach based on the power exponential function (Gradient Penalty based on Power Exponential Function, GPPE).

The motivation of the study is to consider the following three points: (1) Negative samples are widely available and a considerable part of them are easily identifiable or overlap** information. So, such data should provide lower gradients to the model during training. (2) There may be singularities or mislabeled samples in the majority. Therefore, it is crucial to get rid of the entanglement of such samples in model optimization. (3) For the “hard” negative samples with a probability of 0.6 to 0.9, a larger gradient should be given to optimize the model.

The main contributions of the study are rendered in the following three aspects.

(i) A novel imbalanced data classification method is proposed based on the power exponential function. Then, the rationality of the method and the selection of the hyperparameters are discussed from the perspective of the gradient.

(ii) A mount of imbalanced binary datasets with varying categories and imbalance ratios are extracted from MNIST, CIFAR10, and CIFAR100 to reveal the effectiveness of the methodology. Finally, the method is applied to the pulsar candidate dataset HTRU.

(iii) The proposed method is extended to address imbalanced multi-category classification. Step and linear imbalanced datasets are created from well-known datasets such as MNIST, CIFAR10, CIFAR100, and Caltech101 to validate the performance.

The study is organized as follows: section “Related work” presents an overview of the existing research on the classification of imbalanced data. A description of the proposed method is presented in section “Methods” along with theoretical analysis. Experimental proof of the method’s reliability is stated in section “Experimental analysis”. Section Algorithm application” applies the methodology to the pulsar candidate dataset. Finally, we conclude the entire essay and anticipate further investigation.

Related work

Imbalanced data classification is a hot and tricky subject in the realm of artificial intelligence. Totally, the current study can be divided into three categories: data level, algorithm level, and hybrid technique [6, 7]. To begin, the data-level strategy focuses on altering the data distribution to raise the weight of minorities by random oversampling (ROS) or random undersampling (RUS). Methods often employed include the synthetic minority oversampling technique (SMOTE) [8] and its variants Borderline-SMOTE, Safe-Level-SMOTE, and so on. Oversampling approaches based on Generative Adversarial Nets (GANs) [9, 10] have received increasing attention and are being utilized for imbalanced data classification [11,12,13] in recent years, thanks to the proposal and refinement of GANs. In contrast to the SMOTE, which is based on Euclidean distance, GANs learns the data distribution through the network model and samples the minority class to rebalance the dataset. In principle, GANs is a perfect sampling approach. However, in practice, stability and convergence become barriers to the widespread implementation of this technology. Second, the algorithm-level method enhances the loss function in order to boost the weight of the minority class in model optimization. Unlike the data-level strategy, this methodology does not change the original data distribution and hence adds no additional computational cost to the network. These approaches are primarily achieved by creating new loss functions, implementing cost-sensitive learning, and altering decision thresholds [14]. The most common way is to build an advanced loss function, which includes Mean False Error loss (MFE) [15], Focal loss (FL) [16], and Asymmetric Focal loss (ASL) [17]. In the field of image segmentation, especially for medical images, Combo loss (CL) [18], a weighted sum of Dice loss [19] and modified cross-entropy, is proposed in order to effectively deal with the problem of data imbalance. A significant limitation of the Dice loss is that it weights false positives and false negatives equally, leading to high precision but low recall in image segmentation [20]. Therefore, the literature [21] suggests replacing Dice loss with Tversky loss while exponentiating it with a focal parameter to focus on hard classes. Finally, they define the Focal Tversky loss (FTL). Furthermore, the Hybrid Focal loss (HFL) is defined as the sum of the Focal Tversky loss and the Focal loss in [22]. By allocating different costs to various categories, cost-sensitive learning overcomes the category imbalance [23]. Typically, the cost of misclassifying minority is substantially higher than the cost of misclassifying majority. However, determining the cost matrix is not easy. Cost-sensitive deep neural networks (DNN) [24] and learning cost matrices with cost-sensitive CNN are two common strategies (CoSen) [25]. The threshold modification approach achieves category balance by altering the network’s decision threshold at the output layer of the network [26]. Finally, hybrid approaches are a combination of data-level methods and algorithm-level methods, the most common of which are randomized deep over-sampling methods (DOS) [27], large margin local embedding (LMLE), and class rectification and hard sample mining (CRL) [28].

FL and ASL are two loss functions that have gained significant attention for their improved performance. Positive and negative samples face identical gradient penalties under the FL technique. However, the higher gradient penalty on the negative sample will suppress the positive sample’s contribution to the model [17]. Then, the ASL technique employs a decoupled asymmetric approach to penalize just the negative, but it ignores condition (3) despite satisfying criteria (1) and (2) in the introduction. As a consequence, an asymmetric gradient penalty technique based on the power exponential function (gradient penalty based on power exponential function, GPPE) for imbalanced data classification is proposed in this study. According to the analysis, the suggested strategy is a deeper integration of FL and ASL, and it is also much superior to these two loss functions in terms of evaluation metrics.

Methods

In this part, we introduce the imbalanced data classification, the cross-entropy loss, and the accompanying augmented loss functions FL and ASL. Then, we analyze the difficulties encountered when these loss functions are applied to imbalanced data and offer a novel technique.

Binary classification and cross entropy loss

Suppose the dataset containing N samples can be denoted as $X=\{(x_i, y_i): i=1,2,\ldots ,N\}$, where $x_i \in R^d$ indicates that the sample is a point on the D-dimensional space, and $y_i \in \{0, 1\}$ means that the label is binary. If the size of positive category $N_1$ and the size of negative category samples $N_0$ satisfy $ N_0 \gg N_1$, then the data set is said to be imbalanced and $IR=N_0/N_1$ is known as the imbalanced ratio. In general, we define $y=0$ as the majority class, i.e., the negative class, and $y=1$ as the minority class, i.e., the positive class. Generally speaking, the higher the imbalance ratio, the more difficult it is to classify the dataset. The most commonly used loss function for binary classification is cross-entropy loss, which can be expressed as:

$$\begin{aligned} L_{\textrm{CE}}(p,y)=-\sum _{i=1}^n(y_i\log (p_i)+(1-y_i)\log (1-p_i))\nonumber \\ \end{aligned}$$

(1)

where $p_i=P(y=1$ $\vert $ $ x_i)$ represents the probability that the sample $x_i$ is predicted to be positive. If the output of $x_i$ after passing through feature extractor is denoted as $z_i$, then $p_i$ can be calculated by the Sigmoid function or the Softmax function. As in the literature [16], if we define $p_t$ as:

$$\begin{aligned} p_t= {\left\{ \begin{array}{ll} p &{}\quad y=1 \\ 1-p &{}\quad y=0 \end{array}\right. } \end{aligned}$$

(2)

Then, the cross entropy loss can be expressed as $L_{\textrm{CE}} $$ (p,y)=L_{\textrm{CE}}(p_t)=-\log (p_t)$. To address the imbalance more effectively, Lin et al. developed the focal loss based on cross entropy loss. The focal loss is defined as:

$$\begin{aligned} L_{\textrm{FL}}(p_t)=-(1-p_t)^\alpha \log (p_t) \end{aligned}$$

(3)

where $\alpha $ is the hyperparameter and $(1-p_t)^\alpha $ is the dynamic scaling factor, which can shift the focus of model optimization away from easily recognizable examples and toward hard ones. By splitting the positive and negative samples, the focal loss can be further decomposed as:

$$\begin{aligned} L_{\textrm{FL}}(p)=L_{\textrm{FL}}^+(p) + L_{\textrm{FL}}^-(p) \end{aligned}$$

(4)

where $L_{\textrm{FL}}^+(p)=-(1-p)^\alpha \log (p)$, $L_{\textrm{FL}}^-(p)=-p^\alpha \log (1-p)$. After that, Ridnik et al. [17] argue that while the contribution of negative samples can be effectively reduced when $\alpha $ is set at a high level, it also eliminates the gradient contribution of positive samples. As a result, they suggested a decoupled asymmetric focal loss, which is represented by

$$\begin{aligned} L_{\textrm{ASL}}(p)=L_{\textrm{ASL}}^+(p) + L_{\textrm{ASL}}^-(p) \end{aligned}$$

(5)

where $L_{\textrm{ASL}}^+(p)=-(1-p)^{\alpha _+}\log (p)$, $L_{\textrm{ASL}}^-(p) $$ =-p^{\alpha _-}\log (1-p)$. ${\alpha _+}$, ${\alpha _-}$ are the regulatory factors of positive and negative samples respectively, and meet ${\alpha _-} > {\alpha _+}$. At the same time, in order to reduce the entanglement of easily identifiable negative samples to model optimization, probability shifting of negative samples is carried out, namely

$$\begin{aligned} L_{\textrm{ASL}}^-(p)=-p_{\gamma }^{\alpha _-}\log (1-p_{\gamma }) \end{aligned}$$

(6)

where $p_{\gamma }= \max (p-\gamma , 0)$. $\gamma $ is the hyperparameter and denotes the probability shifting threshold.

Asymmetric gradient penalty based on power exponential function

Three factors are taken into account in our design. First, the widely available negative samples may be interspersed with mislabeled samples or wacky samples. It is also taken into consideration in the literature [29], which offers a negative sample reduction technique based on Euclidean distance, although it is not suitable for high-dimensional image datasets. In contrast, we will approach this problem from the standpoint of the loss function. If the model classifies a negative sample as positive with a high probability (e.g., greater than 0.95), it is reasonable to believe that it is a mislabeled point or singularity, and the sample’s contribution to model optimization should be discarded (gradient is 0), which is consistent with the literature [17]. Second, for the negative category, samples with excessively high or low prediction probability should not interfere with model optimization (providing a smaller gradient). The model should concentrate on samples with prediction probabilities between 0.6 and 0.9, and the loss function is supposed to offer a bigger gradient for such examples. Finally, the dynamic modifiers of both FL and ASL are power functions of probability, i.e., decaying in fixed power. According to the analysis, the decay rate should be a probability-related variable, which implies that the dynamic adjustment factor should be a power exponential function of the probability. Based on the foregoing reasons, the proposed loss function is as follows:

$$\begin{aligned} L_{\textrm{GPPE}}(p)=L_{\textrm{GPPE}}^+(p) + L_{\textrm{GPPE}}^-(p) \end{aligned}$$

(7)

where $L_{\textrm{GPPE}}^+(p)=-(1-p)^\theta \log (p)$, $L_{\textrm{GPPE}}^-(p)=-p^{(\alpha p+ \beta )}\log (1-p)$, and $\theta $, $\alpha $, $\beta $ are the hyperparameters. $\theta $ is a constant and defaults to 0. Furthermore, to meet the aforesaid requirements, we adopt a trick similar to that used in the literature [17], namely shifting the probability. However, unlike them, probability shifting in this study is exclusively considered from the perspective of gradient(see Sect. 3.3). Hence, the threshold is very low, which is set as $\gamma =0.05$ in the trials, and no probability shifting is performed in the dynamic moderator. To sum up, $L_{\textrm{GPPE}}^-(p)$ can be further expressed as:

$$\begin{aligned} L_{\textrm{GPPE}}^-(p)=-p^{(\alpha p+ \beta )}\log (1-p_{\gamma }) \end{aligned}$$

(8)

where $p_{\gamma }= \max (p-\gamma , 0)$. When $\alpha $ is 0 and the probability shifting covers the dynamic adjustment factor, the method in this study becomes ASL. Meanwhile, if the same decay factor is shared by positive and negative samples, the method degenerates to FL. Furthermore, if hyperparameter $\beta $ is set to 1, the method eventually degrades to CE loss. As a result, the proposed loss function is an extension and integration of CE, FL, and ASL. According to the definition of $p_t$, the formula of GPPE can be rewritten as:

$$\begin{aligned} L_{\textrm{GPPE}}(p_t)=-(1-p_t)^{\phi (p_t)}\log (\varphi (p_t)) \end{aligned}$$

(9)

where

$$\begin{aligned} \phi (p_t)= {\left\{ \begin{array}{ll} \theta &{}\quad y=1 \\ \alpha (1-p_t)+\beta &{}\quad y=0 \end{array}\right. } \end{aligned}$$

(10)

and

$$\begin{aligned} \varphi (p_t)= {\left\{ \begin{array}{ll} p_t &{}\quad y=1 \\ \min (p_t+\gamma , 1) &{}\quad y=0 \end{array}\right. } \end{aligned}$$

(11)

The algorithm flow is as follows:

Method analysis

The gradient is the source power of network parameter update. In this part, we will investigate the rationality of the proposed loss function and the values of the hyperparameter from the perspective of the gradient. Since the loss calculation of positive samples is the same as that of the CE, only the gradient of negative samples is accounted for in the analysis. The sigmoid function is used as the classifier for analysis. According to the chain rule of derivative, the gradient of GPPE with respect to the output z of the network is as follows:

$$\begin{aligned} \frac{\textrm{d}L_{\textrm{GPPE}}}{\textrm{d}z}= & {} \frac{\textrm{d}L_{\textrm{GPPE}}}{\textrm{d}p} \frac{\textrm{d}p}{\textrm{d}z} \nonumber \\= & {} p^{\alpha p+\beta } \left[ \frac{1}{1-p_{\gamma }}-\left( \alpha \log (p) + \frac{\alpha p + \beta }{p}\right) \right. \nonumber \\{} & {} \left. \log (1-p_{\gamma })\right] p(1-p) \end{aligned}$$

(12)

Figure 1 depicts images of various loss functions and their accompanying gradients

Figure 1a is the image of the different loss functions. When $p \rightarrow 0$, the proposed loss tends to zero and converges faster than the CE and FL. Therefore, the model is less influenced by the easily classified negative samples. On the other hand, the loss of both CE and FL tends to infinity, and GPPE converges to $\log (\gamma )$ when $p \rightarrow 1$. Figure 1b is the gradient image of each loss function. It can be demonstrated that the GPPE yields a lower gradient (probability less than 0.5) for easy samples, concentrates more gradients on the hard samples(probability 0.6 to 0.9), which may be modified using parameters $\alpha $ and $\beta $ (seen Fig. 2), and assigns a gradient close to 0 for singular negative samples (probability tends to 1). Therefore, the GPPE meets the three requirements stated in the introduction. Although ASL shares some similarities with GPPE, the latter offers greater flexibility in describing hard samples and enforces a stricter gradient penalty on easy samples. Theoretically, the proposed strategy is more adaptable to imbalanced datasets, especially those with a larger imbalance ratio. Figure 2 shows the influence of hyperparameters on the gradient.

Table 1 Binary classification confusion matrix

Asymmetric gradient penalty based on power exponential function for imbalanced data classification

Abstract

Similar content being viewed by others

Remix: Rebalanced Mixup

Thresholding Strategies for Deep Learning with Highly Imbalanced Big Data

Model Optimization in Imbalanced Regression

Introduction

Related work

Methods

Binary classification and cross entropy loss

Asymmetric gradient penalty based on power exponential function

Method analysis

Experimental analysis

Evaluation metrics

Network structure

Classification experiment on MNIST

Classification experiment on CIFAR10

Classification experiment on CIFAR100

Classification experiment on Caltech101

Hyperparameter analysis

Algorithm application

Discussion and conclusion

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation