Introduction

The Five-hundred-meter Aperture Radio Telescope (FAST), also known as the “Chinese shy eye”, is a national major project and the world’s largest single-aperture radio telescope. It is located in **tang County, Guizhou Province. FAST has various scientific goals, one of which is to find pulsars. The discovery of new pulsars and the subsequent observations of these new pulsars with FAST can provide new means to solve fundamental problems in physics such as the test of general relativity, the exploration of black holes in the center of the Milky Way, the detection of the interstellar medium of the Milky Way, and the probe of the internal state of neutron stars. In this study, we will investigate the imbalanced data classification approach and employ it in the pulsar candidate dataset to exclude interfering signals as much as possible while ensuring no loss of pulse signals. Image classification has always been a classic and eternal research problem in the field of machine learning. With the proposal of the concept of deep learning and the development of GPU hardware, the deep convolutional neural network [1, 2] has achieved great success in this field. However, it is observed that in most situations there is a considerable imbalance among the categories, i.e., the sample size of one or more classes is significantly higher than that of other classes. Such datasets are referred to as imbalanced datasets, and the class with a larger number of samples is called the negative class, while the other is described as the positive class. In medical clinical data, for example, the majority of persons are healthy, and only a tiny fraction are samples of cases [3]. In the pulsar candidate dataset, most of the samples are noisy, and just a few of them are viable candidates [4]. Similarly, the phenomenon of data imbalance is widespread in fields such as fraud detection and bioinformatics. Therefore, the study of imbalanced data classification holds great significance and has a broad range of applications.

Most machine learning algorithms are designed based on class equilibria, which means that the cost of misclassification is the same for each category. Therefore, the learned model will skew towards the majority when the methods are applied to an imbalanced dataset. This is due to the fact that, in the situation of class imbalance, the majority class essentially occupies the gradient responsible for updating the model weights, resulting in an increase in the misclassification of the minority class. Some strategies have been proposed to augment the weight of the minority by altering the data distribution or improving the loss function. Methods that alter data distribution are easy to implement and can yield promising results in some fields. However, these strategies also contribute a considerable quantity of redundant data, resulting in overfitting [5], or they miss some critical information, leading to underfitting. At the same time, such strategies may expand computing costs. For example, oversampling on pulsar candidate dataset HTRU, which contains 1196 positive samples and 89,996 negative samples, might have disastrous repercussions. In recent years, researchers have increasingly inclined to improve the loss function to alleviate the model bias, because it involves neither additional computation nor information redundancy or information loss. Based on this, the study explores the imbalanced data classification from the perspective of improving loss function and presents an asymmetric gradient penalty approach based on the power exponential function (Gradient Penalty based on Power Exponential Function, GPPE).

The motivation of the study is to consider the following three points: (1) Negative samples are widely available and a considerable part of them are easily identifiable or overlap** information. So, such data should provide lower gradients to the model during training. (2) There may be singularities or mislabeled samples in the majority. Therefore, it is crucial to get rid of the entanglement of such samples in model optimization. (3) For the “hard” negative samples with a probability of 0.6 to 0.9, a larger gradient should be given to optimize the model.

The main contributions of the study are rendered in the following three aspects.

(i) A novel imbalanced data classification method is proposed based on the power exponential function. Then, the rationality of the method and the selection of the hyperparameters are discussed from the perspective of the gradient.

(ii) A mount of imbalanced binary datasets with varying categories and imbalance ratios are extracted from MNIST, CIFAR10, and CIFAR100 to reveal the effectiveness of the methodology. Finally, the method is applied to the pulsar candidate dataset HTRU.

(iii) The proposed method is extended to address imbalanced multi-category classification. Step and linear imbalanced datasets are created from well-known datasets such as MNIST, CIFAR10, CIFAR100, and Caltech101 to validate the performance.

The study is organized as follows: section “Related work” presents an overview of the existing research on the classification of imbalanced data. A description of the proposed method is presented in section “Methods” along with theoretical analysis. Experimental proof of the method’s reliability is stated in section “Experimental analysis”. Section Algorithm application” applies the methodology to the pulsar candidate dataset. Finally, we conclude the entire essay and anticipate further investigation.

Related work

Imbalanced data classification is a hot and tricky subject in the realm of artificial intelligence. Totally, the current study can be divided into three categories: data level, algorithm level, and hybrid technique [6, 7]. To begin, the data-level strategy focuses on altering the data distribution to raise the weight of minorities by random oversampling (ROS) or random undersampling (RUS). Methods often employed include the synthetic minority oversampling technique (SMOTE) [8] and its variants Borderline-SMOTE, Safe-Level-SMOTE, and so on. Oversampling approaches based on Generative Adversarial Nets (GANs) [9, 10] have received increasing attention and are being utilized for imbalanced data classification [11,12,13] in recent years, thanks to the proposal and refinement of GANs. In contrast to the SMOTE, which is based on Euclidean distance, GANs learns the data distribution through the network model and samples the minority class to rebalance the dataset. In principle, GANs is a perfect sampling approach. However, in practice, stability and convergence become barriers to the widespread implementation of this technology. Second, the algorithm-level method enhances the loss function in order to boost the weight of the minority class in model optimization. Unlike the data-level strategy, this methodology does not change the original data distribution and hence adds no additional computational cost to the network. These approaches are primarily achieved by creating new loss functions, implementing cost-sensitive learning, and altering decision thresholds [14]. The most common way is to build an advanced loss function, which includes Mean False Error loss (MFE) [15], Focal loss (FL) [16], and Asymmetric Focal loss (ASL) [17]. In the field of image segmentation, especially for medical images, Combo loss (CL) [18], a weighted sum of Dice loss [19] and modified cross-entropy, is proposed in order to effectively deal with the problem of data imbalance. A significant limitation of the Dice loss is that it weights false positives and false negatives equally, leading to high precision but low recall in image segmentation [20]. Therefore, the literature [21] suggests replacing Dice loss with Tversky loss while exponentiating it with a focal parameter to focus on hard classes. Finally, they define the Focal Tversky loss (FTL). Furthermore, the Hybrid Focal loss (HFL) is defined as the sum of the Focal Tversky loss and the Focal loss in [22]. By allocating different costs to various categories, cost-sensitive learning overcomes the category imbalance [23]. Typically, the cost of misclassifying minority is substantially higher than the cost of misclassifying majority. However, determining the cost matrix is not easy. Cost-sensitive deep neural networks (DNN) [24] and learning cost matrices with cost-sensitive CNN are two common strategies (CoSen) [25]. The threshold modification approach achieves category balance by altering the network’s decision threshold at the output layer of the network [26]. Finally, hybrid approaches are a combination of data-level methods and algorithm-level methods, the most common of which are randomized deep over-sampling methods (DOS) [27], large margin local embedding (LMLE), and class rectification and hard sample mining (CRL) [28].

FL and ASL are two loss functions that have gained significant attention for their improved performance. Positive and negative samples face identical gradient penalties under the FL technique. However, the higher gradient penalty on the negative sample will suppress the positive sample’s contribution to the model [17]. Then, the ASL technique employs a decoupled asymmetric approach to penalize just the negative, but it ignores condition (3) despite satisfying criteria (1) and (2) in the introduction. As a consequence, an asymmetric gradient penalty technique based on the power exponential function (gradient penalty based on power exponential function, GPPE) for imbalanced data classification is proposed in this study. According to the analysis, the suggested strategy is a deeper integration of FL and ASL, and it is also much superior to these two loss functions in terms of evaluation metrics.

Methods

In this part, we introduce the imbalanced data classification, the cross-entropy loss, and the accompanying augmented loss functions FL and ASL. Then, we analyze the difficulties encountered when these loss functions are applied to imbalanced data and offer a novel technique.

Binary classification and cross entropy loss

Suppose the dataset containing N samples can be denoted as \(X=\{(x_i, y_i): i=1,2,\ldots ,N\}\), where \(x_i \in R^d\) indicates that the sample is a point on the D-dimensional space, and \(y_i \in \{0, 1\}\) means that the label is binary. If the size of positive category \(N_1\) and the size of negative category samples \(N_0\) satisfy \( N_0 \gg N_1\), then the data set is said to be imbalanced and \(IR=N_0/N_1\) is known as the imbalanced ratio. In general, we define \(y=0\) as the majority class, i.e., the negative class, and \(y=1\) as the minority class, i.e., the positive class. Generally speaking, the higher the imbalance ratio, the more difficult it is to classify the dataset. The most commonly used loss function for binary classification is cross-entropy loss, which can be expressed as:

$$\begin{aligned} L_{\textrm{CE}}(p,y)=-\sum _{i=1}^n(y_i\log (p_i)+(1-y_i)\log (1-p_i))\nonumber \\ \end{aligned}$$
(1)

where \(p_i=P(y=1\) \(\vert \) \( x_i)\) represents the probability that the sample \(x_i\) is predicted to be positive. If the output of \(x_i\) after passing through feature extractor is denoted as \(z_i\), then \(p_i\) can be calculated by the Sigmoid function or the Softmax function. As in the literature [16], if we define \(p_t\) as:

$$\begin{aligned} p_t= {\left\{ \begin{array}{ll} p &{}\quad y=1 \\ 1-p &{}\quad y=0 \end{array}\right. } \end{aligned}$$
(2)

Then, the cross entropy loss can be expressed as \(L_{\textrm{CE}} \)\( (p,y)=L_{\textrm{CE}}(p_t)=-\log (p_t)\). To address the imbalance more effectively, Lin et al. developed the focal loss based on cross entropy loss. The focal loss is defined as:

$$\begin{aligned} L_{\textrm{FL}}(p_t)=-(1-p_t)^\alpha \log (p_t) \end{aligned}$$
(3)

where \(\alpha \) is the hyperparameter and \((1-p_t)^\alpha \) is the dynamic scaling factor, which can shift the focus of model optimization away from easily recognizable examples and toward hard ones. By splitting the positive and negative samples, the focal loss can be further decomposed as:

$$\begin{aligned} L_{\textrm{FL}}(p)=L_{\textrm{FL}}^+(p) + L_{\textrm{FL}}^-(p) \end{aligned}$$
(4)

where \(L_{\textrm{FL}}^+(p)=-(1-p)^\alpha \log (p)\), \(L_{\textrm{FL}}^-(p)=-p^\alpha \log (1-p)\). After that, Ridnik et al. [17] argue that while the contribution of negative samples can be effectively reduced when \(\alpha \) is set at a high level, it also eliminates the gradient contribution of positive samples. As a result, they suggested a decoupled asymmetric focal loss, which is represented by

$$\begin{aligned} L_{\textrm{ASL}}(p)=L_{\textrm{ASL}}^+(p) + L_{\textrm{ASL}}^-(p) \end{aligned}$$
(5)

where \(L_{\textrm{ASL}}^+(p)=-(1-p)^{\alpha _+}\log (p)\), \(L_{\textrm{ASL}}^-(p) \)\( =-p^{\alpha _-}\log (1-p)\). \({\alpha _+}\), \({\alpha _-}\) are the regulatory factors of positive and negative samples respectively, and meet \({\alpha _-} > {\alpha _+}\). At the same time, in order to reduce the entanglement of easily identifiable negative samples to model optimization, probability shifting of negative samples is carried out, namely

$$\begin{aligned} L_{\textrm{ASL}}^-(p)=-p_{\gamma }^{\alpha _-}\log (1-p_{\gamma }) \end{aligned}$$
(6)

where \(p_{\gamma }= \max (p-\gamma , 0)\). \(\gamma \) is the hyperparameter and denotes the probability shifting threshold.

Asymmetric gradient penalty based on power exponential function

Three factors are taken into account in our design. First, the widely available negative samples may be interspersed with mislabeled samples or wacky samples. It is also taken into consideration in the literature [29], which offers a negative sample reduction technique based on Euclidean distance, although it is not suitable for high-dimensional image datasets. In contrast, we will approach this problem from the standpoint of the loss function. If the model classifies a negative sample as positive with a high probability (e.g., greater than 0.95), it is reasonable to believe that it is a mislabeled point or singularity, and the sample’s contribution to model optimization should be discarded (gradient is 0), which is consistent with the literature [17]. Second, for the negative category, samples with excessively high or low prediction probability should not interfere with model optimization (providing a smaller gradient). The model should concentrate on samples with prediction probabilities between 0.6 and 0.9, and the loss function is supposed to offer a bigger gradient for such examples. Finally, the dynamic modifiers of both FL and ASL are power functions of probability, i.e., decaying in fixed power. According to the analysis, the decay rate should be a probability-related variable, which implies that the dynamic adjustment factor should be a power exponential function of the probability. Based on the foregoing reasons, the proposed loss function is as follows:

$$\begin{aligned} L_{\textrm{GPPE}}(p)=L_{\textrm{GPPE}}^+(p) + L_{\textrm{GPPE}}^-(p) \end{aligned}$$
(7)

where \(L_{\textrm{GPPE}}^+(p)=-(1-p)^\theta \log (p)\), \(L_{\textrm{GPPE}}^-(p)=-p^{(\alpha p+ \beta )}\log (1-p)\), and \(\theta \), \(\alpha \), \(\beta \) are the hyperparameters. \(\theta \) is a constant and defaults to 0. Furthermore, to meet the aforesaid requirements, we adopt a trick similar to that used in the literature [17], namely shifting the probability. However, unlike them, probability shifting in this study is exclusively considered from the perspective of gradient(see Sect. 3.3). Hence, the threshold is very low, which is set as \(\gamma =0.05\) in the trials, and no probability shifting is performed in the dynamic moderator. To sum up, \(L_{\textrm{GPPE}}^-(p)\) can be further expressed as:

$$\begin{aligned} L_{\textrm{GPPE}}^-(p)=-p^{(\alpha p+ \beta )}\log (1-p_{\gamma }) \end{aligned}$$
(8)

where \(p_{\gamma }= \max (p-\gamma , 0)\). When \(\alpha \) is 0 and the probability shifting covers the dynamic adjustment factor, the method in this study becomes ASL. Meanwhile, if the same decay factor is shared by positive and negative samples, the method degenerates to FL. Furthermore, if hyperparameter \(\beta \) is set to 1, the method eventually degrades to CE loss. As a result, the proposed loss function is an extension and integration of CE, FL, and ASL. According to the definition of \(p_t\), the formula of GPPE can be rewritten as:

$$\begin{aligned} L_{\textrm{GPPE}}(p_t)=-(1-p_t)^{\phi (p_t)}\log (\varphi (p_t)) \end{aligned}$$
(9)

where

$$\begin{aligned} \phi (p_t)= {\left\{ \begin{array}{ll} \theta &{}\quad y=1 \\ \alpha (1-p_t)+\beta &{}\quad y=0 \end{array}\right. } \end{aligned}$$
(10)

and

$$\begin{aligned} \varphi (p_t)= {\left\{ \begin{array}{ll} p_t &{}\quad y=1 \\ \min (p_t+\gamma , 1) &{}\quad y=0 \end{array}\right. } \end{aligned}$$
(11)

The algorithm flow is as follows:

Algorithm 1
figure a

Calculate loss of GPPE

Fig. 1
figure 1

Image of loss function and gradient. a Loss function image, b gradient image

Fig. 2
figure 2

The effect of the parameters on the gradient. a, b are the effects of \(\alpha \) and \(\beta \) on the gradient respectively

Method analysis

The gradient is the source power of network parameter update. In this part, we will investigate the rationality of the proposed loss function and the values of the hyperparameter from the perspective of the gradient. Since the loss calculation of positive samples is the same as that of the CE, only the gradient of negative samples is accounted for in the analysis. The sigmoid function is used as the classifier for analysis. According to the chain rule of derivative, the gradient of GPPE with respect to the output z of the network is as follows:

$$\begin{aligned} \frac{\textrm{d}L_{\textrm{GPPE}}}{\textrm{d}z}= & {} \frac{\textrm{d}L_{\textrm{GPPE}}}{\textrm{d}p} \frac{\textrm{d}p}{\textrm{d}z} \nonumber \\= & {} p^{\alpha p+\beta } \left[ \frac{1}{1-p_{\gamma }}-\left( \alpha \log (p) + \frac{\alpha p + \beta }{p}\right) \right. \nonumber \\{} & {} \left. \log (1-p_{\gamma })\right] p(1-p) \end{aligned}$$
(12)

Figure 1 depicts images of various loss functions and their accompanying gradients

Figure 1a is the image of the different loss functions. When \(p \rightarrow 0\), the proposed loss tends to zero and converges faster than the CE and FL. Therefore, the model is less influenced by the easily classified negative samples. On the other hand, the loss of both CE and FL tends to infinity, and GPPE converges to \(\log (\gamma )\) when \(p \rightarrow 1\). Figure 1b is the gradient image of each loss function. It can be demonstrated that the GPPE yields a lower gradient (probability less than 0.5) for easy samples, concentrates more gradients on the hard samples(probability 0.6 to 0.9), which may be modified using parameters \(\alpha \) and \(\beta \) (seen Fig. 2), and assigns a gradient close to 0 for singular negative samples (probability tends to 1). Therefore, the GPPE meets the three requirements stated in the introduction. Although ASL shares some similarities with GPPE, the latter offers greater flexibility in describing hard samples and enforces a stricter gradient penalty on easy samples. Theoretically, the proposed strategy is more adaptable to imbalanced datasets, especially those with a larger imbalance ratio. Figure 2 shows the influence of hyperparameters on the gradient.

Table 1 Binary classification confusion matrix

As seen in Fig. 2a, when \(\beta \) is set to 6, the crest of the gradient graph shifts to the right and gets narrower as \(\alpha \) increases. It can be inferred that the gradient of the negative sample becomes zero when \(\alpha \rightarrow +\infty \). This leads to the fact that the negative class does not contribute to the optimization of the model, while the positive class provides the entire gradient. On the contrary, with the decrease of \(\alpha \), the crest of the gradient graph shifts to the left and widens, increasing the contribution of easy negative samples to the gradient and inhibiting the gradient contribution of positive samples. Similarly, it can be seen from Fig. 2b that when \(\alpha \) is set to \(-2\), the crest of the gradient graph moves to the right and becomes narrower with the increase of \(\beta \), thus reducing the gradient contribution of the negative sample. Therefore, \(\alpha \) and \(\beta \) should be chosen with caution. The gradient diagram analysis shows that when \(\alpha \) is set to \(-2\), \(\beta \) is 2, 4, and 6 may better fulfill the three requirements in the introduction.

Experimental analysis

Next, we will conduct comparative experiments under different datasets with different imbalance ratios to further demonstrate the effectiveness of the algorithm. The experiment is divided into two parts. First, using the widely used MNIST, CIFAR10, CIFAR100, and Caltech101, several groups of categories are selected for imbalanced binary and multi-category classification experiment, and the results were compared with CE, FL, CL, ASL, FTL, and HFL. Second, the method is applied to the pulsar candidate dataset HTRU [30] to demonstrate its applicability.

Evaluation metrics

For imbalanced binary classification, the Recall, Precision, and its comprehensive index F_score are frequently employed to assess the model’s performance. Table 1 defines the confusion matrix of the binary classification. Then the calculation of Recall, Precision, and F_score are described as:

$$\begin{aligned} \text {Precision}= & {} \frac{\text {TP}}{\text {TP}+\text {FP}} \end{aligned}$$
(13)
$$\begin{aligned} \text {Recall}= & {} \frac{\text {TP}}{\text {TP}+\text {FN}} \end{aligned}$$
(14)
$$\begin{aligned} \text {F}\_{\text {score}}= & {} \frac{2 \times \text {Precision} \times \text {Recall}}{\text {Precision} + \text {Recall}} \end{aligned}$$
(15)

In addition, the area under the curve (AUC) of receiver operating characteristic (ROC), and average precision (AP) are also important indicators to measure the performance of the model. Of course, for easily identifiable datasets like MNIST, the various loss functions perform well on the evaluation metrics mentioned, making it challenging to observe significant differences in performance. As a result, the sum of the number of false positives and false negatives, abbreviated as F_PN, is utilized as the performance metric for such datasets. The confusion matrix reveals that the calculation formula of F_PN is

$$\begin{aligned} \text {F}\_{\text {PN}} = \text {FP} + \text {FN} \end{aligned}$$
(16)

For imbalanced multi-category classification, macro F_score (mF1), mean AUC (mAUC) and mean AP (mAP) are important measure metrics. In addition, the mean accuracy of majority and minority classes, abbreviated as Acc_maj and Acc_min, are also employed to compare the model performance.

Network structure

The purpose of the experiment is to evaluate the classification performance of various loss functions. Therefore, all other settings are consistent except for the loss function. The convolutional neural network (CNN) is utilized as a feature extractor on all datasets. However, the network depth varied widely. Figures 3, 4, 5 and 6 exhibit the CNN structure on MNIST, CIFAR10/CIFAR100, Caltech101, and HTRU, respectively.

Fig. 3
figure 3

Network structure on MNIST

Fig. 4
figure 4

Network structure on CIFAR10/CIFAR100

Fig. 5
figure 5

Network structure on Caltech101

Fig. 6
figure 6

Network structure on HTRU

All experiments are implemented on PyTorch, a deep learning framework. In the experiment, the batch size is uniformly set to 100, the training epoch is uniformly set to 100, and the learning rate is set to 0.001. The Adam algorithm is selected as the optimization function, in which the parameters \(\beta _1\) and \(\beta _2\) are 0.5 and 0.999, respectively. The code and hyperparameters of ASL are available at https://github.com/Alibaba-MIIL/ASL and ours are at https://github.com/gzmtzly/GPPE.

Classification experiment on MNIST

MNIST is a handwritten digital dataset with 10 categories and a total of 60,000 training and 10,000 test data. Two types of imbalanced datasets, binary and multi-category, are extracted as experimental data. The data distribution is exposed in Table 2.

Table 2 Imbalanced data distribution on MNIST

For binary datasets \(\{(C_i, C_j) \}\), the former category is positive with a size of 50, whereas the latter is negative with a size of 1000, 3000, and 5000, respectively (Dis. 1–6). Therefore, the imbalance ratios of the binary datasets constructed on MNIST are 20:1, 60:1, and 100:1, respectively. Finally, we get 6 different training datasets, and the test data is extracted from the original test set with the corresponding categories \(\{(C_i,C_j) \}\). For the multi-category dataset, Dis. 7 and Dis. 8 split the former 5 categories as the minority with the size of 50, and the latter 5 as the majority with the size of 3000 and 5000. Dis. 9 and Dis. 10 choose the even classes as the minority and the data distribution is similar to Dis. 7 and Dis. 8. Thus, the imbalanced datasets constructed on MNIST are characterized by a fixed sample size for minorities and a variable sample size for majorities. In addition, multi-category datasets contain numerous minority categories and numerous majority categories.

Table 3 FN+FP of different loss function on MNIST

The experiments are performed 10 times on binary datasets and 5 times on multi-category datasets for each loss function, and the final results are obtained by averaging. Here, \(\text {F}\_{\text {PN}}\) is used to compare the model performance on the binary dataset, and the accuracy of minority and majority is used for multi-category datasets. Six loss functions, including CE, FL, ASL, CL, FTL, and HTL, are used to compare the results of the binary classification in the experiment, while only the first three are employed to compare the results of the multi-category classification.

Table 3 presents the experimental results on the binary dataset (Dis. 1–6), from which three conclusions can be drawn: (1) For each dataset, the F_PN of GPPE is the lowest, meaning the error rate is the lowest. In comparison to CE, FL, ASL, CL, FTL, and HLT, the F_PN of GPPE falls by 105, 105, 111, 89, 131, and 95 on Dis. 1, 104, 111, 109, 102, 177, and 94 on Dis. 2, and 114, 123, 97, 109, and 131 (apart from FTL) on Dis. 3. (2) The GPPE performs better with extremely imbalanced datasets. The FP of GPPE reduces from 11 to 2 on Dis. 1–3 and from 11 to 0 on Dis. 4–5 while the overall advantage of F_PN is maintained as the imbalance ratio increases from 20:1 to 100:1. (3) In addition to GPPE, FL, CL, and HTL performed competitively. For the ultra-high imbalance ratio dataset (100:1), gradient vanishing is observed in the FTL method and the model is not trained.

Figure 7 displays the comparison of \(F\_PN\) for 10 runs on Dis. 5. It is clear that in each round of experiments, the GPPE obtains the lowest \(F\_PN\) and also the best stability.

Fig. 7
figure 7

The comparison of \(F\_PN\) for 10 runs on Dis. 5

Tables 4 and 5 present the results of imbalanced multi-category classification. Table 4 reports the results by the accuracy of the minority classes as well as the majority classes. Results on Dis. 7 indicate that the accuracy of GPPE increased by 4.21%, 4.53%, and 4.5% when compared to CE, FL, and ASL, respectively. Notably, while accuracy decreased slightly for the majority class, there was a significant improvement of 8.59%, 9.18%, and 9.12% for the minority class, which is of greater concern. Similar results are yielded on Dis. 8, with an increase of 5.26%, 5.03%, and 4.93% in overall accuracy, as well as 10.42%, 10.05%, and 9.97% in the minority category, respectively.

Table 4 Accuracy comparison on Dis. 7 and Dis. 8
Table 5 Metrics of different loss function on Dis. 9–10 of MNIST
Table 6 Imbalanced data distribution on CIFAR10
Table 7 Precision, Recall and F_score comparison on Dis. 1–2 of CIFAR10

Table 5 reflects the performance of each method on Dis. 9–10 through accuracy, mF1, mAUC, and mAP. It can be concluded from Table 5 that: (1) GPPE outperforms the other three methods across the board on the two imbalance ratio datasets, which is tough to achieve because the focus of each metric is not exactly the same. (2) The results on Dis. 9 illustrate that GPPE enhances mF1 by 4.16%, 4.71%, and 4.02% compared to CE, FL, and ASL, respectively, while improves mAUC by 0.18, 0.11, and 0.15, respectively. The results on Dis. 10 reveal similar findings, which indicate that the GPPE minimizes false positives while minimizing false negatives to the greatest extent.

In summary, the experiments on MNIST illustrate that: (1) GPPE is a highly competitive method for imbalanced datasets with a fixed number of samples in minority classes and a varying number of samples in majority classes. The effectiveness of GPPE is more pronounced in datasets with higher imbalances. (2) GPPE exhibits remarkable adaptability in imbalanced datasets with numerous majority classes and numerous minority classes.

Classification experiment on CIFAR10

CIFAR10 is a natural image dataset comprised of ten categories, each with 5000 samples in the training set and 1000 samples in the test set. Similar to MNIST, binary and multi-category datasets are constructed on CIFAR10. The data distribution is exposed in Table 6. Dis. 1–2 are imbalanced binary datasets with the category \(\{(3,8)\}\). The former is the minority class with the size of 500, while the latter is the majority class with the size of 1000 and 3000 respectively. Thus, the imbalance ratios are 2 : 1 and 6 : 1, respectively. Dis. 3–4 are imbalanced multi-category datasets. The last category is chosen as the majority with an imbalance ratio of 15:1 and 10:1. Different from the multi-class datasets on MNIST, the multi-class datasets constructed on CIFAR10 contain only one majority category and the sample size of the majority category is fixed, while the sample size of the minority category is varied. The experiment is performed 3 times on binary and multi-category datasets for each loss function, and the final results are obtained by averaging.

Fig. 8
figure 8

AUC and AP comparison on Dis. 1–2 of CIFAR10

Table 8 Accuracy, mF1, mAUC and mAP comparison on Dis. 3–4 of CIFAR10

The results on Dis. 1–2 of CIFAR10 are displayed in Table 7 and Fig. 8. As can be observed from Table 7, both Recall and F_score of GPPE are significantly improved at the cost of a slight decrease in Precision. The results on Dis. 1 reveal that the F_score of GPPE is improved by 3.88%, 4.59%, 4.77%, 4.27%, 5.73%, and 5.22% compared to the CE, FL, ASL, CL, FTL, and HFL respectively. The results on Dis. 2 suggested that the F_score is improved by 9.03%, 9.91%, 11.06%, 10.78%, 15.02%, and 12.66%, respectively. The performance on Dis. 2 is more significant, which indicates that GPPE is more effective on the higher imbalance ratio dataset.

Figure 8 presents the comparison of both AUC and AP with respect to Dis. 1–2. On Dis. 1, GPPE is optimal for both AUC and AP, and AP is improved by 0.0012, 0.003, 0.0011, 0.0026, 0.0067, and 0.0029, respectively, while on Dis. 2, AP is enhanced by 0.0093, 0.0062, 0.0132, 0.0272, 0.1176, and 0.0377, respectively. Therefore, some similar conclusions can be drawn (1) GPPE not only maintains a significant advantage in both AUC and AP but also the higher the imbalance ratio, the more dramatic the advantage. (2) The FTL method is less effective and hardly adapts to high imbalance ratio datasets.

Table 8 presents the imbalanced multi-category classification results on D.is.3–4. From the findings on Dis. 3, it can be concluded that compared with CE, FL, and ASL, GPPE has a significant improvement in all metrics, the accuracy has improved by 7.01%, 7.89% and 6.89%, the mF1 has raised by 7.22%, 8.1% and 7.17%, the mAUC has enhanced by 0.0622, 0.0234 and 0.0655, while the mAP has increased by 0.0786, 0.0431 and 0.0828. Similar conclusions can be obtained on Dis. 4. Comparatively speaking, the effect on Dis. 3 is more significant, indicating that GPPE has a greater advantage for highly imbalanced multi-category datasets.

To summarize, the experiments on CIFAR10 illustrate that: (1) GPPE is robust for imbalanced binary datasets with more complex structures. (2) The performance of GPPE is superior when dealing with imbalanced multi-class datasets that have a single majority class and numerous minority classes, and the dominance becomes more prominent as the number of samples from the minority class decreases.

Classification experiment on CIFAR100

CIFAR100 covers 60,000 images corresponding to 100 classes (600 images/class), which are further divided into 20 superclasses. The standard train/test split is 500/100 images for each class or 2,500/100 for each superclass. Imbalanced binary classification and multi-category classification experiments are carried out on CIFAR100, and the exact data distribution is shown in Table 9. The training data for the binary dataset (Dis. 1–2) consists of large carnivores (superclass 8) and large omnivores (superclass 11), with imbalance ratios of 10:1 and 5:1. The first five classes on the multi-category dataset (Dis. 3–4) are set as majority classes and the rest as minority classes, with imbalance ratios of 10:1 and 5:1. Unlike the datasets constructed on MNIST and CIFAR10, the imbalanced datasets constructed on CIFAR100 are characterized by fixed sample size for majority class and variable sample size for the minority class.

Table 9 Imbalanced data distribution on CIFAR100
Table 10 F_score, AUC and AP comparison on Dis. 1–2 of CIFAR100
Table 11 Accuracy, mF1, mAUC and mAP comparison on Dis. 3–4 of CIFAR100

Table 10 presents the binary classification results on CIFAR100. The results express that the F_score of GPPE on Dis. 1 is boosted by 14.69%, 14.55%, 13.98%, 14.18%, 17.61% and 16.13%, and by 10.54%, 12.82%, 13.05%, 11.67%, 16.6% and 14.74% on Dis. 2. Therefore, we can conclude that GPPE performs outstandingly on high imbalance ratio datasets. Of course, the superiority of GPPE can also be observed by AUC and AP.

Table 11 shows the results of the multi-category classification on CIFA100. Compared with the other three methods, GPPE outperforms all metrics. On Dis. 3, the accuracy is increased by 1.69%, 1.5%, and 1.24%, and mF1 is raised by 2.04%, 1.72%, and 1.5%, while on Dis. 4, the accuracy is enhanced by 0.58%, 1.44% and 0.72%, and mF1 is boosted by 1.07%, 1.45% and 1.04%. Obviously, the comparison of the results on Dis. 3 and Dis. 4 reveals that the method is more practical for datasets with higher imbalance ratios. Meanwhile, it is reasonable to assume that GPPE is also competitive in imbalanced multi-category classification.

In conclusion, the results on CIFAR100 demonstrate that GPPE remains effective even when the number of samples in the majority class is constant and the number of samples in the minority class varies. Furthermore, it shows greater resilience to imbalanced datasets, regardless of whether they are binary or multi-category classification sets.

Classification experiment on Caltech101

Without considering the background, Caltech101 contains 8,677 images, divided into 101 categories. The number of images for each category varies from 31 to 800. The dataset is initially imbalanced, as shown in Figure 9 which displays the distribution of samples with a sample size greater than 60. Unlike the step imbalance datasets constructed on MNIST, CIFAR10, and CIFAR100, Caltech101 exhibits linear imbalance, and the imbalance is also present in the test set. Therefore, it is necessary to predetermine which categories belong to the majority class. In the experiment, the images were uniformly resized to \(65\times 65\) and 60%/20%/20% of the data were divided into training/cross-validation/test sets (each class is divided according to the ratio). The first six categories (sample size \(>200\)) in Fig. 9 are defined as the majority class.

Fig. 9
figure 9

Data distribution on Caltech101

Table 12 presents the results obtained on the Caltech101. The GPPE model shows a considerable improvement in accuracy compared to the CE, FL, and ASL models, with an increase of 3.63%, 2.93%, and 2.01%, respectively. Furthermore, the GPPE model also exhibits an improvement in F_score by 5.34%, 5.46%, and 2.82%, respectively. The results indicate that GPPE has notable benefits in terms of mAUC and mAP, suggesting that it is a strong performer for linearly imbalanced multi-class datasets.

Table 12 Accuracy, mF1, mAUC and mAP comparison on Caltech101

To summarize, the experiment on Caltech101 indicates that the GPPE also performs competitively for datasets that are naturally linearly imbalanced (including train and test sets).

Hyperparameter analysis

Through experiments, we demonstrate how hyperparameters affect the classification of imbalanced data in this part. The dataset consists of category 3 and category 8 extracted from CIFAR10, where category 3 contains 500 samples and category 8 contains 2000 samples, i.e., an imbalance ratio of 4:1. The experiment was divided into two groups. The first is to assess the influence of parameters \(\beta \) on the classification performance when freezing \(\alpha \)(\(\alpha =-2\)). On the contrary, The second is to investigate the effect of parameter \(\alpha \) when freezing \(\beta \)(\(\beta =6\)). The results of the two groups of experiment are shown in Table 13 and Fig. 10, respectively.

Table 13 The effect of parameter \(\beta \) (\(\alpha =-2\))
Fig. 10
figure 10

The effect of parameter \(\alpha \) (\(\beta = 6\))

Table 13 indicates that when freezing \(\alpha \), the Precision slightly decreases with the increase of \(\beta \), while the Recall significantly increases, indicating that the influence of negative samples on the model is weakening, and conversely the gradient contribution of positive is growing, which is exactly consistent with our gradient analysis (Fig. 2b). As illustrated in Fig. 9, while \(\beta \) is frozen, with the shrinkage of \(\alpha \), the Precision increases slowly, the Recall decreases continuously, and the overall performance declines slightly. This indicates that the influence of negative classes is strengthening and the gradient contribution of positive classes is suppressed, which coincides with the analysis results in Fig. 2a. According to the gradient analysis(3.3) and the experiments, \(\{(-2,2),(-2,4),(-2,6)\}\) are excellent combinations of hyperparameters \(\alpha \), \(\beta \).

Algorithm application

This section discusses the application of the GPPE to the identification of pulsar candidates. A pulsar is a type of high-speed rotating neutron star. The screening and identification of pulsar candidates is a classic imbalanced data classification task. HTRU is the first publicly accessible pulsar candidate data set, containing 1,196 positive samples (pulsar) and 89,996 negative samples (RFI), with an imbalance ratio of around 75:1. Therefore, unlike the datasets mentioned in section “Experimental analysis”, the HTRU is characterized by natural step imbalance. In the experiment, the time-vs-phase plot, abbreviated as sub-int, was extracted as experimental data and was translated into \(64 \times 64\). Figure 11 depicts a schematic representation of positive and negative samples.

Fig. 11
figure 11

HTRU samples in the experiment. a, b are positive and negative sample respectively

We divide 30% of the raw data into the training set, 20% into the cross-validation set, and the remaining 50% into the test set. The specific data size is presented in Table 14. As far as pulsar identification is concerned, the cost of the false negative is much higher than that of the false positive. Therefore, in terms of evaluation metrics, we focus more on F_score rather than AUC. As a comparison method, random oversampling (ROS) is also applied to the HTRU. we perform random oversampling on the positive samples, resulting in a total of 27,000 samples for both positive and negative classes. The experimental results are summarized in Table 15 and the time consumed by various methods on a single training epoch is displayed in Fig. 12.

Table 14 The split HTRU dataset
Table 15 Comparison of experimental results on HTRU
Fig. 12
figure 12

Time consumed in a single training epoch on HTRU

From Table 15, it is observed that (1) the ROS is completely ineffective when both the training and test sets are imbalanced. After in-depth analysis, we believe that there are two reasons for this phenomenon: one is overfitting, and the other is the imbalance of the test set. ROS oversamples the positive sample by a factor of 75, resulting in the overfitting of the trained model to the positive sample. The experimental results show that the FP is 16656 and the FN is 3. Since there are only 596 positive samples in the test set, a precision of 0.0344 is obtained according to Eq. (13). (2) The GPPE method exhibits extremely competitive. From the perspective of F_score, compared with CE, FL, ASL, CL, FTL, and HFL, the comprehensive performance of GPPE is enhanced by 1.89%, 1.14%,1.04%, 0.81%, 1.19%, and 3.2%. In terms of Recall, it is lifted by 4.2%, 4.7%, 2.85%, 1.18%, 4.36%, and 6.04%. Of course, the Precision of GPPE is not optimal, but it is perfectly acceptable for the pulsar identification which is more concerned with the false negatives. As can be seen from Fig. 12, the binary version of GPPE is slightly less than the other methods in terms of the time lost. Therefore, this method also exhibits a slight advantage in terms of efficiency. In short, it can be concluded that the GPPE can significantly reduce the interference signal while guarding the pulsar signal. Therefore, this strategy is more appropriate for the identification of pulsar candidates.

Discussion and conclusion

The principle, advantages, and disadvantages of focal loss and asymmetric focal loss are analyzed in this study, and an asymmetric gradient penalty approach based on the power exponential function for imbalanced data classification is presented. The rationality of the method is examined from the perspective of the gradient. Then, comparison experiments are conducted on MNIST, CIFAR10, CIFAR100, and Caltech101 datasets to illustrate the performance of the algorithm. Finally, the method is implemented on the imbalanced pulsar candidate dataset HTRU. An interesting area of our future investigation includes employing a similar gradient strategy on the positive class to improve Precision and combining with GANs to improve the diversity and controllability of generated data on imbalanced datasets.