Comparing and Improving Active Learning Uncertainty Measures for Transformer Models by Discarding Outliers

Gonsior, Julius; Falkenberg, Christian; Magino, Silvio; Reusch, Anja; Hartmann, Claudio; Thiele, Maik; Lehner, Wolfgang

doi:10.1007/s10796-024-10503-z

Comparing and Improving Active Learning Uncertainty Measures for Transformer Models by Discarding Outliers

Open access
Published: 26 June 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Information Systems Frontiers Aims and scope Submit manuscript

Comparing and Improving Active Learning Uncertainty Measures for Transformer Models by Discarding Outliers

Download PDF

121 Accesses
Explore all metrics

Abstract

Despite achieving state-of-the-art results in nearly all Natural Language Processing applications, fine-tuning Transformer-encoder based language models still requires a significant amount of labeled data to achieve satisfying work. A well known technique to reduce the amount of human effort in acquiring a labeled dataset is Active Learning (AL): an iterative process in which only the minimal amount of samples is labeled. AL strategies require access to a quantified confidence measure of the model predictions. A common choice is the softmax activation function for the final Neural Network layer. In this paper, we compare eight alternatives on seven datasets and show that the softmax function provides misleading probabilities. Our finding is that most of the methods primarily identify hard-to-learn-from samples (commonly called outliers), resulting in worse than random performance, instead of samples, which actually reduce the uncertainty of the learned language model. As a solution, this paper proposes Uncertainty-Clip**, a heuristic to systematically exclude samples, which results in improvements for most methods compared to the softmax function.

ChatGPT is bullshit

Article Open access 08 June 2024

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

A survey on large language model based autonomous agents

Article Open access 22 March 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The changes in the manuscript are highlighted in blue. By uncommenting line 99 and commenting line 98 in the Latex Sources a version without any highlight can be easily produced.

The biggest revisional changes are:

incorporating the related work chapter into the background and other fitting places
added the new Section 5.6 "Limitations"
many minor changes (not highlighted in blue), often to rephrase overly complicated sentences and to fix grammar errors

The most common use case of Machine Learning (ML) is supervised learning, which inherently requires a labeled dataset to demonstrate the desired outcome to the (ML) model. This initial step of acquiring a labeled dataset can only be accomplished by often rare-to-get and costly human domain experts; automation is not possible as the automation via ML is exactly the task which should be learned. For example, the average cost for the common label task of segmenting a single image reliably is 6,40 USD^{Footnote 1}. At the same time, recent advances in the field of Neural Network (NN) such as Transformer-encoder Vaswani et al (2017) models for Natural Language Processing (NLP) (with BERT Devlin et al (2019) being the most prominent example) or Convolutional Neural Networks (CNN) LeCun et al (2015) for computer vision resulted in huge Deep NN which require even more labeled training data. Reducing the amount of labeled data is therefore a primary objective of making ML more applicable in real-world scenarios. The focus of this paper is on the NLP domain and Transformer-encoder NN respectively, but the proposed methods can also be applied without any further work to other deep NN models and domains. A concrete real-world use-case scenario our research targets at is text classification, f. e. categorization of legal documents such as regulatory documents to ensure adherence to legal requirements.

Noting that deep NN require a large amount of labeled data, Active Learning (AL) is a popular method to reduce the required human effort in creating a labeled dataset, by reducing labeling of nearly identical, and therefore redundant samples. AL is an iterative process, which step-by-step decides, which samples to label first, based on the existing knowledge in the form of the currently labeled samples. In each AL-cycle a new subset of unlabeled samples is actively selected for labeling by human annotators, thus, the set of labeled samples growths continuously and the quality of the learned ML-model is gradually improved. Due to the iterative selection, the already existent knowledge of the so-far labeled data can be leveraged to select the most promising samples to be labeled next. The goal is to reduce the amount of necessary labeling work while kee** the same model performance. AL achieves this by preventing the annotation of redundant samples, which the model has already learned to properly represent. The challenge of applying AL is an almost paradoxical problem: how to decide which samples are most beneficial to the ML model, without knowing the label of the samples, since this is exactly the task to be learned by the to-be-trained ML model.

Despite successful application in a variety of domains (Gonsior et al, 2020; Gal et al, 2017; Lowell et al, 2019), AL fails to work for very deep NN such as Transformer-encoder models, rarely beating pure random sampling. The common explanation (Karamcheti et al, 2021; Gleave and Irving, 2022; Sankararaman et al, 2022; D’Arcy and Downey, 2022) is that AL methods favor hard-to-learn samples, often simply called outliers, which therefore neglects the potential benefits gained from AL. Another potential explanation – to the best of our knowledge not yet covered in the AL literature – could be the calculation of the uncertainty of the NN. Nearly all AL strategies rely on a method measuring the uncertainty of the to-be-trained ML model in its prediction as probabilities. The reasoning behind is that those samples with high model uncertainty are the most useful ones to learn from, and should therefore be labeled first. For NN typically the softmax activation function is used for the last layer, and its output is interpreted as the probability of the confidence of the NN. But interpreting the softmax function as the true model confidence is a fallacy Pearce et al (2021). We therefore compare eight alternative Uncertainty-measures for AL in an extensive end-to-end evaluation of fine-tuning Transformer models for seven common classification datasets.

Our main contributions are:

An empirical comparison of eight alternative Uncertainty-measures to the vanilla softmax function in the context of AL, applied for fine-tuning Transformer-encoder models.
A proposal of the novel and easy to implement method Uncertainty-Clip** (UC) of mitigating the negative effect of uncertainty based AL methods of favoring outliers.
A systematic evaluation of the Uncertainty-Clip** method demonstrating how it improves nearly all Uncer-tainty-measures.

The remainder of this paper is structured as follows: In Section 2, we briefly explain AL, the Transformer model architecture, and the softmax function. Section 3 presents the alternative Uncertainty-measures, Section 4 describes our experimental setup. We present our results and discussion in Section 5, and conclude in Section 6.

2 Active Learning Basics

This section introduces the standard AL cycle in Section 2.1, and gives an overview of the three categories of AL strategies: uncertainty-based strategies in Section 2.2, diversity-based strategies in Section 2.3 and combined strategies in Section 2.4. We conclude by explaining our reasoning to focus on improving uncertainty-based strategies in this work in Section 2.5.

2.1 Active Learning Cycle

Supervised learning techniques inherently rely on an annotated dataset. AL is a well-known technique for saving human effort by iteratively selecting exactly those unlabeled samples for expert labeling that are the most useful ones for the overall classification task Settles (2012). The goal is to train a classification model $\theta $ that maps samples $x \in \mathcal {X}$ to a respective label $y \in \mathcal {Y}$; for the training, the labels $\mathcal {Y}$ have to be provided by an oracle, often one or multiple human annotators. Figure 1 shows a standard pool-based AL cycle: Given a small initial labeled dataset $\mathcal {L}= \{(x_i,y_i)\}_{i=0}^n$ of n samples $x_i \in \mathcal {X}$ and the respective label $y_i \in \mathcal {Y}$ and a large unlabeled pool $\mathcal {U}= \{x_i\}, x_i \not \in \mathcal {L}$, an ML model called learner $\theta : \mathcal {X} \mapsto \mathcal {Y}$ is trained on the labeled set. A query strategy $f:\mathcal {U}\longrightarrow Q$ then subsequently chooses a batch of $b$ unlabeled samples $Q$, which will be labeled by the oracle (human expert) and added to the set of labeled data $\mathcal {L}$. This AL cycle repeats $\tau $ times until a stop** criterion is met.

2.2 Uncertainty-based AL

Commonly used AL query strategies, which are relying on informativeness, use the uncertainty of the learner model $\theta $ to select the AL query. The uncertainty is defined as the inverse confidence/probability of the learner $P_{\theta }(y|x)$ in classifying a sample x with the label y. The idea behind is to label samples in those regions the learner model is most uncertain about, and thereby intentionally decreasing the model’s overall uncertainty. The most simple informativeness AL strategy is Uncertainty Least Confidence (LC) Lewis and Gale (1994). This strategy selects those samples that the learner model is most uncertain about, i.e. where the probability $P_{\theta }(\hat{y}|x)$ of the most probable label $\hat{y}$ is the lowest:

$$\begin{aligned} f_{LC}(\mathcal {U}) =\underset{{x \in \mathcal {U}}}{\arg \max } \left( 1-P_{\theta }(\hat{y}|x)\right) \end{aligned}$$

(1)

A variant of the uncertainty strategy is Uncertainty Max-Margin (MM) Scheffer et al (2001), which selects those samples, where the difference between the certainty for the most $\hat{y}_1$ and second most probable class $\hat{y}_2$ is the lowest:

$$\begin{aligned} {{f}_{MM}}(\mathcal {U}) = \underset{{x \in \mathcal {U}}}{\arg \min } \left( P_{\theta }(\hat{y}_1|x)-P_{\theta }(\hat{y}_2|x)\right) \end{aligned}$$

(2)

Another variant is Uncertainty Entropy (Ent) Shannon (1948), where the entropy of the label distribution is used to measure the uncertainty of the learner:

$$\begin{aligned} {f_{Ent}}(\mathcal {U}) = \underset{x \in \mathcal {U}}{\arg \max } \left( - \underset{i}{\sum }\ P_{\theta }(\hat{y_i}|x)log P_{\theta }(\hat{y_i}|x)\right) \end{aligned}$$

(3)

All uncertainty-based strategies have in common that they need the confidence of the learner model in a quantified form to rank the unlabeled samples from the most certain to the most uncertain samples. For the case of deep Neural Network (NN), like Transformer-Encoder based language models, the softmax activation function is used.

Query-by-committee (QBC) Seung et al (1992), another widely used strategy, uses a committee of multiple learner models to label those samples first, where the models disagree the most about if they are uncertain about the samples. A large drawback of this strategy is the increased runtime, especially when using large NNs as learner model.

Recent uncertainty-based strategies such as Cartography Active Learning (CAL) Zhang and Plank (3.3).

3.1 Softmax as Uncertainty-measure

All uncertainty-based AL strategies have in common that they need the uncertainty of the learner model in a quantified form to rank the unlabeled samples. In this paper, this will be called Uncertainty-measure. The inverse of an Uncertainty-measure, the confidence-probability of an ML model, should reflect, how probable it is that the model’s own predictions are true. For example, a confidence of $70\%$ should mean a correct prediction in 70 out of 100 cases. Due to the property of having 1 as the sum for all components the softmax function is often used as a makeshift probability measure for the certainty of NNs:

$$\begin{aligned} \sigma (z)_i = \frac{exp(z_i)}{\sum ^K_{j=1}exp(z_j)}, \text { for } i=1,\dots , K \end{aligned}$$

(4)

The output of the last neurons i before entering the activation functions is called logit, and denoted as $z_i$, K denotes the amount of neurons in the last layer.

But as has been mentioned in the past by other researchers Lakshminarayanan et al (2017); Weiss and Tonella (2022); Gleave and Irving (2022); Sankararaman et al (2022); D’Arcy and Downey (2022), the training objective for NNs is purely to maximize the value of the correct output neuron, not to create a true confidence-probability. An inherent limitation of the softmax function is its inability to have – in the theoretical case – zero confidence in its prediction, as the sum of all possible outcomes always equals 1. Previous works have indicated that softmax based confidence is often overconfident Gal and Ghahramani (2016).

Pearce et al (2021); Hein et al (2019) deeply investigate the foundations of the vanilla softmax function as confidence-probability. They show that especially NNs using a typical ReLU activation function for the inner layers can be easily tricked into being overly confident in any prediction by simply scaling the input $x$ with an arbitrarily large value $\alpha >1$ to $\tilde{x} = \alpha x$ Pearce et al (2021); Hein et al (2019).

3.2 Alternative Uncertainty-measures

Even though it is known that an Uncertainty-measure is a crucial component for uncertainty-based AL Lakshminarayanan et al (2017); Blundell et al (2015), especially in the context of deep NN like Transformer models Schröder et al (2022), few research has been done on solely comparing Uncertainty-measures, with the goal of AL for Transformer-encoder models in mind. We selected seven methods from the literature suitable as alternative Uncertainty-measures of Deep NNs such as Transformer-encoder models. They can be divided into five categories Gawlikowski et al (2021): a) single network deterministic methods, which deterministically produce the same result for each NN forward pass (Inhibited Softmax (IS) Możejko et al (2018), TrustScore (TrSc) Jiang et al (2018) and Evidential Neural Networks (Evi) Sensoy et al (2018)), b) Bayesian methods, which sample from a distribution and result therefore in non-deterministic results (Monte-Carlo Dropout (MC) Gal and Ghahramani (2016)), c) ensemble methods, which combine multiple deterministic models into a single decision (Softmax Ensemble Seung et al (1992) ), d) calibration methods, which calibrate the softmax function (Label Smoothing (LS) Szegedy et al (2016) and Temperature Scaling (TeSc) Zhang et al (2020)), and e) test-time augmentation methods, which, similarly to the ensemble methods, augment the input samples, and return the combined prediction for the augmented samples. The last category is a subject of future research as we could not find a subset of data augmentation techniques that reliably worked well for our use case among different datasets.

More elaborate AL strategies like BALD Kirsch et al (2019) or QUIRE Huang et al (2010) not only focus on the confidence-probability measure, but also make use of the vector space to label a diverse training set, including also regions far away from the classification boundary. As the focus of this paper is on purely evaluating the influence of the confidence prediction methods, we are deliberately solely using the most basic AL strategy Uncertainty Least Confidence.

In the following, the core ideas of the individual methods are briefly explained. More details, reasonings, and the exact formulas can be found in the original papers.

Inhibited Softmax (IS). The Inhibited Softmax method Możejko et al (2018) is a simple extension of the vanilla softmax function by an additional constant factor $\alpha \in \mathbb {R}$, which enhances the effect of the absolute magnitude of the single logit value $z_i$ on the softmax output:

$$\begin{aligned} \sigma (z)_i = \frac{exp(z_i)}{\sum ^K_{j=1}exp(z_j)+exp(\alpha )} \end{aligned}$$

(5)

To ensure that the added fraction is not removed during the training process, several changes to the NN have to be made, including: a) removing the bias b from the input of the neuron activation function, and b) extending the loss function by a special evident regularisation term.

TrustScore (TrSc). The TrustScore Jiang et al (2018) method uses the set of available labeled data to calculate a TrustScore, independent of the NN model. In a first step, the available labeled data is clustered into a single high density region for each class. The TrustScore ts of a sample $x$ is then calculated as the ratio of the distance from $x$ to the cluster of the nearest class $c_{closest}$, and the distance to the cluster of the predicted class $\hat{y}$:

$$\begin{aligned} ts_x= \frac{dist(x, c_{closest})}{dist(x, c_{\hat{y}})} \end{aligned}$$

(6)

Therefore, the TrustScore is higher when the cluster of the nearest class is further away from the cluster of the most probable class, indicating a potentially wrong classification. The distance metric as well as the calculation of the clusters is based on the k-nearest neighbors algorithm.

Evidential Neural Networks (Evi). Evidential Neural Networks Sensoy et al (2018) treat the vanilla softmax outputs as a parameter set over a Dirichlet distribution. The prediction acts as evidence supporting the given parameter set out of the distribution, and the confidence-probability of the NN reflects the Dirichlet probability density function over the possible softmax outputs.

Monte-Carlo Dropout (MC). Monte-Carlo Dropout Gal and Ghahramani (2016) is a Bayesian method that uses the NN dropout regularization method to construct for “free” an ensemble of the same trained model. Dropout refers to randomly disabling neurons during the training phase, originally with the aim of reducing overfitting of the to-be-trained network. For Monte-Carlo Dropout, the dropout method is applied during the prediction phase. As neurons are disabled randomly, this results in a large Gaussian sample space of different models. Therefore, each model, with differently dropped out neurons, results in a potentially different prediction. Combining the vanilla softmax prediction using the arithmetic mean produces a combined Uncertainty-measure.

Softmax Ensemble. The softmax ensemble approach uses an ensemble of NN models, similar to Monte-Carlo Dropout. The predictions of the ensemble can be interpreted as a vote upon the prediction. The disagreement among the votees acts then as the Uncertaintymeasure and can be calculated in two ways, either as Vote Entropy (VE), or as Kullback-Leibler Divergence (KLD) McCallumzy and Nigamy (1998):

$$\begin{aligned} VE(x) = -\sum _i\frac{V(\hat{y_i}, x)}{K}log\frac{V(\hat{y_i}, x)}{K} \end{aligned}$$

(7)

with K being the number of ensemble models, and $V(x, \hat{y_i})$ denoting the number of ensemble models assigning the class $\hat{y_i}$ to the sample $x$. The complete equation to calculate the KLD is omitted for brevity. Using an ensemble of softmax models inside of an AL strategy results in the uncertainty-based Query-by-committee Seung et al (1992) AL strategy.

Temperature Scaling (TeSc). Temperature Scaling Zhang et al (2020) is a model calibration method that is applied after the training and changes the calculation of the softmax function by inducing a temperature $T>0$:

$$\begin{aligned} \sigma (z_i) = \frac{exp(z_i/T)}{\sum ^K_{j=1}exp(z_j/T)} \end{aligned}$$

(8)

For $T=1$ the softmax function stays the same as the original version, for values $T<1$ the softmax output of the largest logit is increased, and for values of $T>1$ (which is the recommended case in using Temperature Scaling) the output of the most probable logit is decreased. This has a dampening effect on the overall confidence. The value of the temperature T is computed empirically using the existent labeled set of samples during application time. The parameter is therefore different for each AL iteration.

Label Smoothing (LS). Label Smoothing Szegedy et al (2016) removes a fraction $\alpha $ of the loss function per predicted class and distributes it uniformly among the other classes by adding $\frac{\alpha }{K-1}$ to the other loss outputs, with K being the number of classification classes. In contrast to Temperature Scaling, Label Smoothing is not applied after a network has been trained, but directly during the training process due to a modified loss function.

3.3 Uncertainty-Clip** (UC)

The aforementioned Uncertainty-measures can directly be used in AL strategies to sort the pool of unlabeled samples to select exactly those samples for labeling that have the lowest confidence/highest uncertainty. As repeatedly reported by others Karamcheti et al (2021); Gleave and Irving (3. These are the most uncertain samples, and supposedly outliers that have ambiguous labels. Therefore, a high uncertainty is expected, but not a good indicator for a prioritized labeling of these samples, as an ML model, which is primarily being trained on ambiguous outliers, will often need a lot of labeled data to separate the target classes correctly. Ignoring the most uncertain outlier samples results in selecting the second-most uncertain samples for labeling, which in most cases contain almost as much classification boundary information as the most uncertain samples, and in the case of outliers even more. The challenge here is to detect potential outliers without knowing the true labels, while at the same time to prevent too many samples from being selected for labeling. The different thresholds used for detecting the clip** thresholds are displayed in Fig. 4.

This idea of throwing away some labels to improve the accuracy has also been noted to function well for Transformer-models by Sankararaman et al (${\textbf {u}}$ (line 1 and 2). Afterwards, the top-k uncertainty values are removed (line 3 and 4). This method can be used in combination with any AL strategy using uncertainty for ranking the pool of unlabeled samples. A fixed threshold has the advantage of a very low implementation overhead, but the disadvantage of being very dependent on the parameter k. A too low value may ignore too few samples, a too high too many. The following two methods aim to circumvent this restriction.

First Peak Clip**. Other indicators for filtering out uncertainty distributions are local maxima and minima. The goal is to determine the first peak/local maxima in the uncertainty distribution, which is potentially being caused by the outlier samples. Algorithm 2 illustrates our proposed implementation of this idea. First, we calculate a kernel density estimation to get a smooth probability density function from the distribution of the uncertainty values (line 4). Based on this, we calculate the first local maxima from the right (line 5), as not all uncertainty distributions do have a second peak to the right. Additionally, to prevent from filtering out too many samples, we limit this method using the top-k-clip** threshold from the first method (line 6 and 7).

A disadvantage of this and the following method is the dependency on the existence of local maxima that not every uncertainty values distributions exhibits.

First Valley Clip**. The idea is basically very similar to the first peak clip** method, but instead of taking the maximum of the first hill from the right, we use the valley after the first hill as threshold, outlined in Algorithm 3.

4 Experimental Setup

Details to reproduce our evaluation are provided in this Section. In support of the the reproducability initiative, we are offering other researchers the re-use of our work by making our source code fully publicly available on GitHub Footnote 3.

4.1 Setup

We extended the AL framework small-text Schröder et al (3.3 in combination with the implemented uncertainty measurements, displayed values are percentages

Full size table

The AL experiments can be evaluated in a multitude of ways. At its core, after each AL iteration, a standard ML metric is being measured for the labeled and withheld dedicated test set. We decided upon the accuracy (acc) metric, which we calculated on a withheld test dataset. It is possible to compare the test accuracy values of the last iteration, the mean of the last five iterations ($acc_{last5}$), and the mean of all iterations. The last one equals to the area-under-the-curve when plotting the so-called AL learning curve. As an effective AL strategy should select the most valuable samples for labeling first, those metrics that include the accuracy of multiple iterations are often closer to real use-cases. Nevertheless, at the beginning of the labeling process the fluctuation of the test accuracy for most strategies is very high and contains often surprisingly little information about which strategy is better. The influence of the initially labeled samples is simply so high that a better strategy, with a bad starting point, has no chance to be better than a bad strategy, which has a good starting point. But after a couple of AL iterations, the results stabilize and good strategies can be reliably distinguished from bad strategies, as each strategy tends to approach its own characteristic threshold, regardless of the starting point. Therefore, we decided to use the accuracy of the last five iterations, deliberately ignoring the first iterations with the highly fluctuating results.

Additionally, we measured the runtime. AL, applied in real-life scenarios, is an interactive process. Decisions of the AL strategy should be made in the magnitude of single-digit seconds, longer calculation times render the annotation process impractical.

As each experiment was repeated 10 times, we report in the following the arithmetic mean for the 10 repetitions, separated by dataset.

5 Results

We conducted a series of experiments to compare the alternative Uncertainty-measures (See Section 3):

We start by comparing the relative performance gains using the proposed Uncertainty-Clip** variants (Section 5.1).
Afterwards, we compare the different Uncertainty-mea-sures, with and without Uncertainty-Clip** to assess the general influence of Uncertainty-Clip** (Section 5.2).
Next, we analyze which AL strategies behave similarly, indicated by samples that have been queried by multiple strategies (Section 5.3.
Then, we analyze the class distribution of the queried samples (Section 5.4).
Due to AL being a human-in-the-loop method, we include a necessary analysis of the runtimes (Section 5.5).
Finally, we conclude with an overview of the limitations of our approach (Section 5.6).

5.1 Uncertainty-Clip**

The first experiment targets our proposed Uncertainty-Clip** variants. The results are displayed in Table 3. The columns contain the results per Uncertainty-measure, the rows the different Uncertainty-Clip** methods. As the used metric $acc_{last5}$ differs per dataset, we are not simply averaging across the different datasets. $80\%$ accuracy can mean something totally different than $60\%$ accuracy on another dataset. Therefore, we take the $acc_{last5}$ per random seed, dataset, Uncertainty-Clip** variant, and alternative Uncertainty-measure and compare the relative percentage gain with the unclipped variant. Each displayed value is then the arithmetic mean of the relative gains over all random seeds and datasets. The Uncertainty-Clip** method with the highest improvement per Uncertainty-measure is highlighted in bold.

The first observation to be made is that every proposed Uncertainty-Clip** method improves almost every alternative Uncertainty-measure. This is a strong indicator for using Uncertainty-Clip** in combination with uncertainty-based AL strategies. The answer to the question which Uncertainty-Clip** method performs best depends on the combination of the Uncertainty-Clip** method, the alternative Uncertainty-measure, and, interestingly, the underlying Transformer-language model as well. Vote Entropy (VE) is the only Uncertainty-measure that does not benefit from Uncertainty-Clip**, but as seen in the next Section, this method is not performing well without either. The overall best Uncertainty-Clip** method, indicated by the last column showing the average gain, is First Peak clip** as it slightly outperforms top-k clip** throughout all results. Therefore, the following evaluations will be made using First Peak clip**, but the results are almost identical for the other Uncertainty-Clip** methods.

The clip** threshold k was always set to 95% and found empirically using hyperparameter tuning, other values of up to 90% also worked well.

5.2 Test Accuracies

The next evaluation goes into detail on how the alternative Uncertainty-measures compare to each other, and how the Uncertainty-Clip** depends on the underlying dataset.

Figure 5 displays the distributions of the average test accuracy of the last five AL iterations $acc_{last5}$ per method and per dataset. Each result is displayed using two boxplots: The light gray one displays the original values, the dark gray one the ones using Uncertainty-Clip**. The underlying displayed distribution consists of the $acc_{last5}$ values combined for all datasets. As each method was evaluated using 10 different starting points, we additionally display the arithmetic mean of the 10 repetitions per dataset as an additional colorful stick. The methods are ordered after the mean $acc_{last5}$ value using Uncertainty-Clip**. As Uncertainty-Clip** does not work on the baselines of a full passive classifier (Pass) and random selection (Rand), only one boxplot is shown for these two.

General Remarks First, it is obvious that the difference between the individual methods depends on the used dataset. But averaged over all datasets (the yellow diamond in the middle of the boxplots indicates the arithmetic mean) the differences become marginally small. Still, it can be safely stated already that the two Softmax Ensemble techniques (KLD and VE) perform far worse than any other technique. This confirms the work of Gleave and Irving (5 when looking only on the light gray boxplots. Our paradoxical finding is therefore that uncertainty-based AL strategies actually benefit from a slightly less than perfect Uncertainty-measure.

Comparing the light and dark gray boxplots per method, Uncertainty-Clip** often seems to reduce mostly the lower ends of the distribution. This is expected, as applying Uncertainty-Clip** can only result in ignoring results, and thereby prevention of bad AL choices. As no new results are added, improvement of already good choices is not to be expected. The effect of Uncertainty-Clip** becomes more clear if one compares the colored lines per dataset in each boxplot of the left side with those of the right side. These lines indicate the arithmetic mean per dataset. Especially for the dataset AG-News and Trec-6 the clip** improves the final test accuracy drastically, as the right lines are often higher than the left ones. These datasets, which are more influenced by Uncertainty-Clip**, appear therefore to contain a higher percentage of outliers.

Some combinations between alternative Uncertainty-measures and datasets do not benefit from Uncertainty-Clip** (e.g. the Rotten Tomatoes Dataset and the Inhibited Softmax method for the BERT model). This is to be expected as we are trying to ignore outlier points that do not really fit into the classification categories and are therefore quite ambiguous. They are very easily influenced by even small changes to the setup, and it may happen in rare cases that also good data points are sometimes wrongly being ignored. But given that NLP datasets are generally very large with lots of data points there are always outlier points which should be ignored. We could not identify a single dataset which never benefits from Uncertainty-Clip**, indicating that every dataset does contain outlier data points. In summary, except for the ensemble method Vote Entropy, we can conclude from our extensive experiments that all methods generally benefit from Uncertainty-Clip**.

Baselines In addition to pure Random Sampling selection, and Uncertainty Least Confidence with the vanilla softmax function, Uncertainty Entropy and Uncertainty Max-Margin – both also using the vanilla softmax function – were included in our evaluation. The baseline strategy Uncertainty Max-Margin performs better on RoBERTa than on BERT, indicating that the softmax function is better calibrated to true probabilities for RoBERTa than for the original BERT model.

Best overall performing Uncertainty-measure method The results differ based on the used Transformer-encoder model. For BERT, no method was able to beat random sampling without our proposed Uncertainty-Clip**, whereas with it, the majority was able to justify the usage of AL. For RoBERTa many methods were better than Random sampling even without Uncertainty-Clip**, but benefited even further from using it. Otherwise, the question of whether to use the softmax activation function or an alternative Uncertainty-measure can be concluded with a confirmation for the status-quo to continue using the softmax function. The pure softmax function-based methods Uncertainty Least Confidence (LC) or Uncertainty Max-Margin (MM) perform still comparably good, with only marginally small differences to most alternative methods. This confirms the conclusion of the deep investigation of Pearce et al (

The negative numbers in the right plot show the difference to the original Jaccard coefficients. Uncertainty-Clip** makes all strategies behave dissimilar, as nearly all Jaccard coefficients go slightly down. This indicates that all strategies without Uncertainty-Clip** label the same set of outliers, and after these are removed, they have of course less similarities. We also included the wrongly classified samples (Wrong) by the fully trained passive classifiers in the analysis. Interestingly, a difference can be seen between the two compared Transformer models: for RoBERTa most strategies sample few wrongly classified samples, which becomes even less after the Uncertainty-Clip**, whereas for the original BERT model the this number even increases slightly with Uncertainty-Clip**.

In conclusion, equally good performing strategies also query similar samples, and Uncertainty-Clip** manages to remove a set of outliers. It appears that the different methods have this quality in common.

5.4 Class Distribution

To further analyze Uncertainty-Clip**, we selected the two datasets, TREC-6 and AG’s News, which have more than two target classes, for an additional deeper analysis of difference in class distribution for the queried samples. We compared the class distributions in the training set, this is displayed in Fig. 7. Uncertainty-Clip** should improve in theory especially those datasets, where many samples have ambiguous labels, which can therefore harm the classification quality. The more classification target classes there are, the higher is the chance of ambiguous labels, hence our focus on the two datasets TREC-6 and AG’s News. Additionally, we include the class distribution across the test dataset, on which the evaluation metrics were being computed. Our expectation regarding the random strategy would be that the class distribution is the same in the queried samples to the full available training dataset, which nearly holds true.

For TREC-6, most AL strategies favor samples from class A and B, and sample less from class D and F. Uncertainty-Clip** strongly enhances the AL selection of class B, and even decreases sampling from class F. This indicates that Uncertainty-Clip** heavily correlates with specific classes, caused by probably many ambiguous labels of samples in class B. This could be an indicator for poor and ambiguous class thresholds between class B and the other classes in TREC-6.

For AG’s News almost all strategies sample evenly distributed among all classes without Uncertainty-Clip**, and significantly less often of class B, but more of class C and D with Uncertainty-Clip**. Uncertainty Max-Margin stands out from the other methods, as it behaves without clip** similar to all other methods that use clip**. This explains why Uncertainty Max-Margin is one of the few methods that does not benefit from Uncertainty-Clip** for Trec-6 and AG’s News, as seen in Fig. 5, despite an otherwise good performance.

We infer from the data that Uncertainty-Clip** vastly influences the distribution of the queried classes towards potentially more interesting classes for labeling, which are the same for all methods.

5.5 Runtime Comparison

AL is in practice a human-in-the-loop process. Annotators expect a responsive system that instructs them immediately and without long waiting times about what to label next. Therefore, a fast runtime of the AL query selection is a crucial factor in making AL real-world usable. In Fig. 8, we display the runtimes averaged over all datasets for our methods in seconds per complete AL loop, measuring only the time of the AL computations, not the Transformer-encoder model fine-tuning time ^{Footnote 4}. First, it becomes clear that the overhead of the two Ensemble methods (KLD and VE) of training multiple Transformer-encoder models in parallel is much higher compared to the other methods. A waiting time of a couple of seconds is acceptable for most annotators, wh ereas a constant waiting time of over a full minute per each AL iteration is straining the patience of users. All other methods perform equally fast with a neglectable user waiting time. Taking into consideration the poor performance of the ensemble methods, we refrain from recommending them for use in practice.

Weiss and Tonella (

6 Conclusion

Noting the importance of NN Uncertainty-measures for AL and the potential shortcomings of simply using vanilla softmax as such a measure, we experimentally compared eight alternative methods over seven datasets using both the original BERT Transformer-encoder model as well as the improved RoBERTa variant. After discovering that better Uncertainty-measures result in selecting only outliers for labeling, we proposed Uncertainty-Clip**, improving all methods, including vanilla softmax, which showed to be the overall best performing technique.

In conclusion, we have not found any evidence for a general bad performance of the vanilla softmax method. Therefore, we can safely recommend to continue use it due to the nearly non-existent application overhead in an AL setting. Yet, the compared methods such as Monte-Carlo Dropout, Label Smoothing, and Temperature Scaling are not far behind in our ranking of the best methods. Our proposed Uncertainty-Clip** improves both vanilla softmax as well as alternative methods, and should be used in practice. We found that either a simple Top-k clip**, or the First Peak variant works well. Future research should compare the effectiveness of Uncertainty-Clip** in combination with more advanced AL strategies, which also make use of the vector space. Coming back to the concrete real-world use-case scenario of categorization of legal documents, our proposed method of Uncertainty-Clip** enables practitioners to use AL to annotate a large data set using minimum effort in the NLP domain using powerful Transformer-architecture based language models, without having to worry about outliers spoiling the resulting labeled dataset.

Financial or non-financial interests

The authors have no relevant financial or non-financial interests to disclose.

Availability of code, data and materials

The full source code to perform the experiments conducted in this paper are available at GitHub under https://github.com/jgonsior/active-learning-softmax-uncertainty-clipp-ing The Readme.md contains all information on how to replicate the experiments, including the random seeds used in the experiments.

Notes

According to scale.ai as of December 2021 (as of 2024 the cost is not publicly visible anymore): https://web.archive.org/web/20210112234705/https://scale.com/pricing
We have found similar results in all other datasets used in this paper, described in Section 4.2
https://github.com/jgonsior/active-learning-softmax-uncertainty-clip**
The average fine-tuning time of the Transformer-encoder models adds an additional 5-25 seconds to each AL iteration, depending on the used dataset

References

Baram Y, Yaniv RE, Luz K (2004). Online choice of active learning algorithms. Journal of Machine Learning Research 5(Mar):255–291
Blundell C, Cornebise J, Kavukcuoglu K, Wierstra D (2015) Weight uncertainty in neural network. In: ICML, PMLR, pp 1613–1622
Coleman C, Yeh C, Mussmann S, Mirzasoleiman B, Bailis P, Liang P, Leskovec J, Zaharia M (2020). Selection via proxy: Efficient data selection for deep learning. ICLR
D’Arcy M, Downey D (2022). Limitations of active learning with deep transformer language models
Devlin J, Chang MW, Lee K, Toutanova K (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: NAACL, Association for Computational Linguistics, pp 4171–4186
Gal Y, Ghahramani Z (2016) Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In: ICML, PMLR, pp 1050–1059
Gal, Y., Islam, R., Ghahramani, Z. (2017). Deep bayesian active learning with image data. In: International Conference on Machine Learning, PMLR, pp 1183–1192
Gawlikowski, J., Tassi, C.R.N., Ali, M., Lee, J., Humt, M., Feng, J., Kruspe, A., Triebel, R., Jung, P., Roscher, R., et al. (2021). A survey of uncertainty in deep neural networks. ar**v preprint ar**v:2107.03342
Gleave, A., Irving, G. (2022). Uncertainty estimation for language reward models. ar**v preprint ar**v:2203.07472
Gonsior, J., Rehak, J., Thiele, M., Koci, E., Günther, M., Lehner, W. (2020). Active learning for spreadsheet cell classification. In: EDBT/ICDT Workshops
Gonsior, J., Thiele, M., & Lehner, W. (2022). Imital: Learned active learning strategy on synthetic data. In P. Pascal & D. Ienco (Eds.), Discovery Science (pp. 47–56). Cham: Springer Nature Switzerland.
Chapter Google Scholar
Hein, M., Andriushchenko, M., Bitterwolf, J. (2019). Why relu networks yield high-confidence predictions far away from the training data and how to mitigate the problem. In: CVPR, IEEE, pp 41–50
Houlsby N, Huszár F, Ghahramani Z, Lengyel M (2011) Bayesian active learning for classification and preference learning. 1112.5745
Hsu, W.N., Lin, H.T. (2015). Active learning by learning. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI Press, AAAI’15, pp 2659–2665
Huang, S.J., **, R., Zhou, Z.H. (2010) Active learning by querying informative and representative examples. In: Lafferty J, Williams C, Shawe-Taylor J, Zemel R, Culotta A (eds) NeurIPS, Curran Associates, Inc., vol 23, pp 892–900
Jiang, H., Kim, B., Guan, M., Gupta, M. (2018). To trust or not to trust a classifier. NeurIPS 31
Karamcheti S, Krishna R, Fei-Fei L, Manning C (2021). Mind your outliers! investigating the negative impact of outliers on active learning for visual question answering. In: ACL-IJCNLP, Association for Computational Linguistics, pp 7265–7281
Kirsch, A., & v Amersfoort, J., Gal, Y. (2019). Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning. NeurIPS, Curran Associates Inc, 32, 7026–7037.
Konyushkova, K., Sznitman, R., Fua, P. (2015). Introducing geometry in active learning for image segmentation. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) 2015 IEEE International Conference on Computer Vision (ICCV), IEEE, vol 30, pp 4225–4235
Konyushkova, K., Sznitman, R., Fua, P. (2018). Discovering general-purpose active learning strategies. ar**v preprint ar**v:1810.04114
Lakshminarayanan B, Pritzel A, Blundell C (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. NeurIPS 30
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.
Article Google Scholar
Lewis, D. D., & Gale, W. A. (1994). A sequential algorithm for training text classifiers. SIGIR ’94 (pp. 3–12). London: Springer.
Lhoest, Q., del Moral, A.V., Jernite, Y., Thakur, A., von Platen, P., Patil, S., Chaumond, J., Drame, M., Plu, J., Tunstall, L., et al (2021). Datasets: A community library for natural language processing. ar**v preprint ar**v:2109.02846
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. ar**v preprint ar**v:1907.11692
Lowell, D., Lipton, Z.C., Wallace, B.C. (2019). Practical obstacles to deploying active learning. EMNLP-IJCNLP pp 21–30
McCallumzy, A.K., Nigamy, K. (1998). Employing em and pool-based active learning for text classification. In: ICML, Citeseer, pp 359–367
Możejko, M., Susik, M., Karczewski, R. (2018). Inhibited softmax for uncertainty estimation in neural networks. ar**v preprint ar**v:1810.01861
Pearce, T., Brintrup, A., Zhu, J. (2021). Understanding softmax confidence and uncertainty. ar**v preprint ar**v:2106.04972
Sankararaman, K.A., Wang, S., Fang, H. (2022). Bayesformer: Transformer with uncertainty estimation. ar**v preprint ar**v:2206.00826
Scheffer, T., Decomain, C., & Wrobel, S. (2001). Mining the web with active hidden markov models. In F. Hoffmann, D. J. Hand, N. Adams, D. Fisher, & G. Guimaraes (Eds.), ICDM (pp. 309–318). Soc: IEEE Comput.
Google Scholar
Schröder, C., Niekler, A. (2020). A survey of active learning for text classification using deep neural networks. ar**v preprint ar**v:2008.07267
Schröder, C., Müller, L., Niekler, A., Potthast, M. (2021). Small-text: Active learning for text classification in python. ar**v preprint ar**v:2107.10314
Schröder, C., Niekler, A., Potthast, M. (2022). Revisiting uncertainty-based query strategies for active learning with transformers. In: ACL, Association for Computational Linguistics, pp 2194–2203
Sener, O., Savarese, S. (2018). Active learning for convolutional neural networks: A core-set approach. International Conference on Learning Representations
Sensoy, M., Kaplan, L., Kandemir, M. (2018). Evidential deep learning to quantify classification uncertainty. NeurIPS 31
Settles, B. (2012). Active learning. Artificial Intelligence and Machine Learning, 6(1), 1–114.
Google Scholar
Seung, H.S., Opper, M., Sompolinsky, H. (1992). Query by committee. In: Proceedings of the fifth annual workshop on Computational learning theory, ACM, COLT ’92, pp 287–294
Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379–423.
Article Google Scholar
Siddhant, A., Lipton, Z.C. (2018). Deep bayesian active learning for natural language processing: Results of a large-scale empirical study. ar**v preprint ar**v:1808.05697
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In: CVPR, IEEE, pp 2818–2826
Tang, Y. P., & Huang, S. J. (2019). Self-paced active learning: Query the right thing at the right time. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01), 5117–5124.
Article Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I. (2017). Attention is all you need. NeurIPS 30
Wang, Z., & Ye, J. (2015). Querying discriminative and representative samples for batch mode active learning. ACM Transactions on Knowledge Discovery from Data, 9(3), 1–23.
Weiss, M., Tonella, P. (2022). Simple techniques work surprisingly well for neural network test prioritization and active learning (replicability study). ar**v preprint ar**v:2205.00664
Yoo, D., Kweon, I.S. (2019). Learning loss for active learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 93–102
Zhan, X., Liu, H., Li, Q., Chan, A.B. (2021). A comparative survey: Benchmarking for pool-based active learning. IJCAI pp 4679–4686, survey Track
Zhang, J., Kailkhura, B., Han, T.Y.J. (2020). Mix-n-match: Ensemble and compositional methods for uncertainty calibration in deep learning. In: ICML, PMLR, pp 11117–11128
Zhang, M., Plank, B. (2021). Cartography active learning. ar**v preprint ar**v:2109.04282

Download references

Acknowledgements

The authors are grateful to the Center for Information Services and High Performance Computing ZIH at TU Dresden for providing its facilities for high throughput calculations. This research was funded by the German Federal Ministry of Education and Research (BMBF) through grants 01IS17044 Software Campus 2.0 (TU Dresden).

Funding

Open Access funding enabled and organized by Projekt DEAL. The foundational research of this work was partially funded by the German Federal Ministry of Education and Research (BMBF) through grants 01IS17044 Software Campus 2.0 (TU Dresden).

Author information

Authors and Affiliations

Technische Universität Dresden, Dresden, Germany
Julius Gonsior, Christian Falkenberg, Silvio Magino, Anja Reusch, Claudio Hartmann & Wolfgang Lehner
Hochschule für Technik und Wirtschaft Dresden, Dresden, Germany
Maik Thiele

Authors

Julius Gonsior
View author publications
You can also search for this author in PubMed Google Scholar
Christian Falkenberg
View author publications
You can also search for this author in PubMed Google Scholar
Silvio Magino
View author publications
You can also search for this author in PubMed Google Scholar
Anja Reusch
View author publications
You can also search for this author in PubMed Google Scholar
Claudio Hartmann
View author publications
You can also search for this author in PubMed Google Scholar
Maik Thiele
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Lehner
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: Julius Gonsior Method development: Julius Gonsior Implementation: Julius Gonsior, Christian Falkenberg, Silvio Magino, Anja Reusch Experiments: Julius Gonsior, Claudio Hartmann, Anja Reusch Writing - original draft: Julius Gonsior, Wolfgang Lehner Writing - review & editing: Wolfgang Lehner, Maik Thiele, Claudio Hartmann, Anja Reusch, Silvio Magino, Christian Falkenberg, and Julius Gonsior did the proof reading, while Julius Gonsior revised the manuscript.

Corresponding author

Correspondence to Julius Gonsior.

Ethics declarations

Ethics approval and consent to participate

Not applicable

Consent for publication

Not applicable

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Gonsior, J., Falkenberg, C., Magino, S. et al. Comparing and Improving Active Learning Uncertainty Measures for Transformer Models by Discarding Outliers. Inf Syst Front (2024). https://doi.org/10.1007/s10796-024-10503-z

Download citation

Accepted: 10 June 2024
Published: 26 June 2024
DOI: https://doi.org/10.1007/s10796-024-10503-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Comparing and Improving Active Learning Uncertainty Measures for Transformer Models by Discarding Outliers

Abstract

Similar content being viewed by others

ChatGPT is bullshit

Natural language processing: state of the art, current trends and challenges

A survey on large language model based autonomous agents

1 Introduction

2 Active Learning Basics

2.1 Active Learning Cycle

2.2 Uncertainty-based AL

3.1 Softmax as Uncertainty-measure

3.2 Alternative Uncertainty-measures

3.3 Uncertainty-Clip** (UC)

4 Experimental Setup

4.1 Setup

5 Results

5.1 Uncertainty-Clip**

5.2 Test Accuracies

5.4 Class Distribution

5.5 Runtime Comparison

6 Conclusion

Financial or non-financial interests

Availability of code, data and materials

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation