1 Introduction

The changes in the manuscript are highlighted in blue. By uncommenting line 99 and commenting line 98 in the Latex Sources a version without any highlight can be easily produced.

The biggest revisional changes are:

  • incorporating the related work chapter into the background and other fitting places

  • added the new Section 5.6 "Limitations"

  • many minor changes (not highlighted in blue), often to rephrase overly complicated sentences and to fix grammar errors

The most common use case of Machine Learning (ML) is supervised learning, which inherently requires a labeled dataset to demonstrate the desired outcome to the (ML) model. This initial step of acquiring a labeled dataset can only be accomplished by often rare-to-get and costly human domain experts; automation is not possible as the automation via ML is exactly the task which should be learned. For example, the average cost for the common label task of segmenting a single image reliably is 6,40 USDFootnote 1. At the same time, recent advances in the field of Neural Network (NN) such as Transformer-encoder Vaswani et al (2017) models for Natural Language Processing (NLP) (with BERT Devlin et al (2019) being the most prominent example) or Convolutional Neural Networks (CNN) LeCun et al (2015) for computer vision resulted in huge Deep NN which require even more labeled training data. Reducing the amount of labeled data is therefore a primary objective of making ML more applicable in real-world scenarios. The focus of this paper is on the NLP domain and Transformer-encoder NN respectively, but the proposed methods can also be applied without any further work to other deep NN models and domains. A concrete real-world use-case scenario our research targets at is text classification, f. e. categorization of legal documents such as regulatory documents to ensure adherence to legal requirements.

Noting that deep NN require a large amount of labeled data, Active Learning (AL) is a popular method to reduce the required human effort in creating a labeled dataset, by reducing labeling of nearly identical, and therefore redundant samples. AL is an iterative process, which step-by-step decides, which samples to label first, based on the existing knowledge in the form of the currently labeled samples. In each AL-cycle a new subset of unlabeled samples is actively selected for labeling by human annotators, thus, the set of labeled samples growths continuously and the quality of the learned ML-model is gradually improved. Due to the iterative selection, the already existent knowledge of the so-far labeled data can be leveraged to select the most promising samples to be labeled next. The goal is to reduce the amount of necessary labeling work while kee** the same model performance. AL achieves this by preventing the annotation of redundant samples, which the model has already learned to properly represent. The challenge of applying AL is an almost paradoxical problem: how to decide which samples are most beneficial to the ML model, without knowing the label of the samples, since this is exactly the task to be learned by the to-be-trained ML model.

Despite successful application in a variety of domains (Gonsior et al, 2020; Gal et al, 2017; Lowell et al, 2019), AL fails to work for very deep NN such as Transformer-encoder models, rarely beating pure random sampling. The common explanation (Karamcheti et al, 2021; Gleave and Irving, 2022; Sankararaman et al, 2022; D’Arcy and Downey, 2022) is that AL methods favor hard-to-learn samples, often simply called outliers, which therefore neglects the potential benefits gained from AL. Another potential explanation – to the best of our knowledge not yet covered in the AL literature – could be the calculation of the uncertainty of the NN. Nearly all AL strategies rely on a method measuring the uncertainty of the to-be-trained ML model in its prediction as probabilities. The reasoning behind is that those samples with high model uncertainty are the most useful ones to learn from, and should therefore be labeled first. For NN typically the softmax activation function is used for the last layer, and its output is interpreted as the probability of the confidence of the NN. But interpreting the softmax function as the true model confidence is a fallacy Pearce et al (2021). We therefore compare eight alternative Uncertainty-measures for AL in an extensive end-to-end evaluation of fine-tuning Transformer models for seven common classification datasets.

Our main contributions are:

  • An empirical comparison of eight alternative Uncertainty-measures to the vanilla softmax function in the context of AL, applied for fine-tuning Transformer-encoder models.

  • A proposal of the novel and easy to implement method Uncertainty-Clip** (UC) of mitigating the negative effect of uncertainty based AL methods of favoring outliers.

  • A systematic evaluation of the Uncertainty-Clip** method demonstrating how it improves nearly all Uncer-tainty-measures.

Fig. 1
figure 1

Standard Active Learning Cycle including our proposed Uncertainty-Clip** (UC) to influence the uncertainty based ranking (using the probability \(P_\theta (y|x)\) of the learner model \(\theta \) in predicting class y for a sample x) by ignoring the top-k results

The remainder of this paper is structured as follows: In Section 2, we briefly explain AL, the Transformer model architecture, and the softmax function. Section 3 presents the alternative Uncertainty-measures, Section 4 describes our experimental setup. We present our results and discussion in Section 5, and conclude in Section 6.

2 Active Learning Basics

This section introduces the standard AL cycle in Section 2.1, and gives an overview of the three categories of AL strategies: uncertainty-based strategies in Section 2.2, diversity-based strategies in Section 2.3 and combined strategies in Section 2.4. We conclude by explaining our reasoning to focus on improving uncertainty-based strategies in this work in Section 2.5.

2.1 Active Learning Cycle

Supervised learning techniques inherently rely on an annotated dataset. AL is a well-known technique for saving human effort by iteratively selecting exactly those unlabeled samples for expert labeling that are the most useful ones for the overall classification task Settles (2012). The goal is to train a classification model \(\theta \) that maps samples \(x \in \mathcal {X}\) to a respective label \(y \in \mathcal {Y}\); for the training, the labels \(\mathcal {Y}\) have to be provided by an oracle, often one or multiple human annotators. Figure 1 shows a standard pool-based AL cycle: Given a small initial labeled dataset \(\mathcal {L}= \{(x_i,y_i)\}_{i=0}^n\) of n samples \(x_i \in \mathcal {X}\) and the respective label \(y_i \in \mathcal {Y}\) and a large unlabeled pool \(\mathcal {U}= \{x_i\}, x_i \not \in \mathcal {L}\), an ML model called learner \(\theta : \mathcal {X} \mapsto \mathcal {Y}\) is trained on the labeled set. A query strategy \(f:\mathcal {U}\longrightarrow Q\) then subsequently chooses a batch of \(b\) unlabeled samples \(Q\), which will be labeled by the oracle (human expert) and added to the set of labeled data \(\mathcal {L}\). This AL cycle repeats \(\tau \) times until a stop** criterion is met.

2.2 Uncertainty-based AL

Commonly used AL query strategies, which are relying on informativeness, use the uncertainty of the learner model \(\theta \) to select the AL query. The uncertainty is defined as the inverse confidence/probability of the learner \(P_{\theta }(y|x)\) in classifying a sample x with the label y. The idea behind is to label samples in those regions the learner model is most uncertain about, and thereby intentionally decreasing the model’s overall uncertainty. The most simple informativeness AL strategy is Uncertainty Least Confidence (LC) Lewis and Gale (1994). This strategy selects those samples that the learner model is most uncertain about, i.e. where the probability \(P_{\theta }(\hat{y}|x)\) of the most probable label \(\hat{y}\) is the lowest:

$$\begin{aligned} f_{LC}(\mathcal {U}) =\underset{{x \in \mathcal {U}}}{\arg \max } \left( 1-P_{\theta }(\hat{y}|x)\right) \end{aligned}$$
(1)

A variant of the uncertainty strategy is Uncertainty Max-Margin (MM) Scheffer et al (2001), which selects those samples, where the difference between the certainty for the most \(\hat{y}_1\) and second most probable class \(\hat{y}_2\) is the lowest:

$$\begin{aligned} {{f}_{MM}}(\mathcal {U}) = \underset{{x \in \mathcal {U}}}{\arg \min } \left( P_{\theta }(\hat{y}_1|x)-P_{\theta }(\hat{y}_2|x)\right) \end{aligned}$$
(2)

Another variant is Uncertainty Entropy (Ent) Shannon (1948), where the entropy of the label distribution is used to measure the uncertainty of the learner:

$$\begin{aligned} {f_{Ent}}(\mathcal {U}) = \underset{x \in \mathcal {U}}{\arg \max } \left( - \underset{i}{\sum }\ P_{\theta }(\hat{y_i}|x)log P_{\theta }(\hat{y_i}|x)\right) \end{aligned}$$
(3)

All uncertainty-based strategies have in common that they need the confidence of the learner model in a quantified form to rank the unlabeled samples from the most certain to the most uncertain samples. For the case of deep Neural Network (NN), like Transformer-Encoder based language models, the softmax activation function is used.

Query-by-committee (QBC) Seung et al (1992), another widely used strategy, uses a committee of multiple learner models to label those samples first, where the models disagree the most about if they are uncertain about the samples. A large drawback of this strategy is the increased runtime, especially when using large NNs as learner model.

Recent uncertainty-based strategies such as Cartography Active Learning (CAL) Zhang and Plank (3.3).

3.1 Softmax as Uncertainty-measure

All uncertainty-based AL strategies have in common that they need the uncertainty of the learner model in a quantified form to rank the unlabeled samples. In this paper, this will be called Uncertainty-measure. The inverse of an Uncertainty-measure, the confidence-probability of an ML model, should reflect, how probable it is that the model’s own predictions are true. For example, a confidence of \(70\%\) should mean a correct prediction in 70 out of 100 cases. Due to the property of having 1 as the sum for all components the softmax function is often used as a makeshift probability measure for the certainty of NNs:

$$\begin{aligned} \sigma (z)_i = \frac{exp(z_i)}{\sum ^K_{j=1}exp(z_j)}, \text { for } i=1,\dots , K \end{aligned}$$
(4)

The output of the last neurons i before entering the activation functions is called logit, and denoted as \(z_i\), K denotes the amount of neurons in the last layer.

But as has been mentioned in the past by other researchers Lakshminarayanan et al (2017); Weiss and Tonella (2022); Gleave and Irving (2022); Sankararaman et al (2022); D’Arcy and Downey (2022), the training objective for NNs is purely to maximize the value of the correct output neuron, not to create a true confidence-probability. An inherent limitation of the softmax function is its inability to have – in the theoretical case – zero confidence in its prediction, as the sum of all possible outcomes always equals 1. Previous works have indicated that softmax based confidence is often overconfident Gal and Ghahramani (2016).

Pearce et al (2021); Hein et al (2019) deeply investigate the foundations of the vanilla softmax function as confidence-probability. They show that especially NNs using a typical ReLU activation function for the inner layers can be easily tricked into being overly confident in any prediction by simply scaling the input \(x\) with an arbitrarily large value \(\alpha >1\) to \(\tilde{x} = \alpha x\)  Pearce et al (2021); Hein et al (2019).

3.2 Alternative Uncertainty-measures

Even though it is known that an Uncertainty-measure is a crucial component for uncertainty-based AL Lakshminarayanan et al (2017); Blundell et al (2015), especially in the context of deep NN like Transformer models Schröder et al (2022), few research has been done on solely comparing Uncertainty-measures, with the goal of AL for Transformer-encoder models in mind. We selected seven methods from the literature suitable as alternative Uncertainty-measures of Deep NNs such as Transformer-encoder models. They can be divided into five categories Gawlikowski et al (2021): a) single network deterministic methods, which deterministically produce the same result for each NN forward pass (Inhibited Softmax (IS) Możejko et al (2018), TrustScore (TrSc) Jiang et al (2018) and Evidential Neural Networks (Evi) Sensoy et al (2018)), b) Bayesian methods, which sample from a distribution and result therefore in non-deterministic results (Monte-Carlo Dropout (MC) Gal and Ghahramani (2016)), c) ensemble methods, which combine multiple deterministic models into a single decision (Softmax Ensemble Seung et al (1992) ), d) calibration methods, which calibrate the softmax function (Label Smoothing (LS) Szegedy et al (2016) and Temperature Scaling (TeSc) Zhang et al (2020)), and e) test-time augmentation methods, which, similarly to the ensemble methods, augment the input samples, and return the combined prediction for the augmented samples. The last category is a subject of future research as we could not find a subset of data augmentation techniques that reliably worked well for our use case among different datasets.

More elaborate AL strategies like BALD Kirsch et al (2019) or QUIRE Huang et al (2010) not only focus on the confidence-probability measure, but also make use of the vector space to label a diverse training set, including also regions far away from the classification boundary. As the focus of this paper is on purely evaluating the influence of the confidence prediction methods, we are deliberately solely using the most basic AL strategy Uncertainty Least Confidence.

In the following, the core ideas of the individual methods are briefly explained. More details, reasonings, and the exact formulas can be found in the original papers.

Inhibited Softmax (IS). The Inhibited Softmax method Możejko et al (2018) is a simple extension of the vanilla softmax function by an additional constant factor \(\alpha \in \mathbb {R}\), which enhances the effect of the absolute magnitude of the single logit value \(z_i\) on the softmax output:

$$\begin{aligned} \sigma (z)_i = \frac{exp(z_i)}{\sum ^K_{j=1}exp(z_j)+exp(\alpha )} \end{aligned}$$
(5)

To ensure that the added fraction is not removed during the training process, several changes to the NN have to be made, including: a) removing the bias b from the input of the neuron activation function, and b) extending the loss function by a special evident regularisation term.

TrustScore (TrSc). The TrustScore Jiang et al (2018) method uses the set of available labeled data to calculate a TrustScore, independent of the NN model. In a first step, the available labeled data is clustered into a single high density region for each class. The TrustScore ts of a sample \(x\) is then calculated as the ratio of the distance from \(x\) to the cluster of the nearest class \(c_{closest}\), and the distance to the cluster of the predicted class \(\hat{y}\):

$$\begin{aligned} ts_x= \frac{dist(x, c_{closest})}{dist(x, c_{\hat{y}})} \end{aligned}$$
(6)

Therefore, the TrustScore is higher when the cluster of the nearest class is further away from the cluster of the most probable class, indicating a potentially wrong classification. The distance metric as well as the calculation of the clusters is based on the k-nearest neighbors algorithm.

Evidential Neural Networks (Evi). Evidential Neural Networks Sensoy et al (2018) treat the vanilla softmax outputs as a parameter set over a Dirichlet distribution. The prediction acts as evidence supporting the given parameter set out of the distribution, and the confidence-probability of the NN reflects the Dirichlet probability density function over the possible softmax outputs.

Monte-Carlo Dropout (MC). Monte-Carlo Dropout Gal and Ghahramani (2016) is a Bayesian method that uses the NN dropout regularization method to construct for “free” an ensemble of the same trained model. Dropout refers to randomly disabling neurons during the training phase, originally with the aim of reducing overfitting of the to-be-trained network. For Monte-Carlo Dropout, the dropout method is applied during the prediction phase. As neurons are disabled randomly, this results in a large Gaussian sample space of different models. Therefore, each model, with differently dropped out neurons, results in a potentially different prediction. Combining the vanilla softmax prediction using the arithmetic mean produces a combined Uncertainty-measure.

Softmax Ensemble. The softmax ensemble approach uses an ensemble of NN models, similar to Monte-Carlo Dropout. The predictions of the ensemble can be interpreted as a vote upon the prediction. The disagreement among the votees acts then as the Uncertaintymeasure and can be calculated in two ways, either as Vote Entropy (VE), or as Kullback-Leibler Divergence (KLD) McCallumzy and Nigamy (1998):

$$\begin{aligned} VE(x) = -\sum _i\frac{V(\hat{y_i}, x)}{K}log\frac{V(\hat{y_i}, x)}{K} \end{aligned}$$
(7)

with K being the number of ensemble models, and \(V(x, \hat{y_i})\) denoting the number of ensemble models assigning the class \(\hat{y_i}\) to the sample \(x\). The complete equation to calculate the KLD is omitted for brevity. Using an ensemble of softmax models inside of an AL strategy results in the uncertainty-based Query-by-committee Seung et al (1992) AL strategy.

Fig. 3
figure 3

Histogram distribution of exemplary uncertainty values for a single AL iteration of TREC-6 dataset before Uncertainty-Clip**. The x-axis ranges from 0, full certainty, to 1, full uncertainty

Fig. 4
figure 4

Displayed Thresholds used for the different Uncertainty-Clip** methods over a distribution of exemplary uncertainty values for a single AL iteration. The x-axis ranges from 0, full certainty, to 1, full uncertainty

Temperature Scaling (TeSc). Temperature Scaling Zhang et al (2020) is a model calibration method that is applied after the training and changes the calculation of the softmax function by inducing a temperature \(T>0\):

$$\begin{aligned} \sigma (z_i) = \frac{exp(z_i/T)}{\sum ^K_{j=1}exp(z_j/T)} \end{aligned}$$
(8)

For \(T=1\) the softmax function stays the same as the original version, for values \(T<1\) the softmax output of the largest logit is increased, and for values of \(T>1\) (which is the recommended case in using Temperature Scaling) the output of the most probable logit is decreased. This has a dampening effect on the overall confidence. The value of the temperature T is computed empirically using the existent labeled set of samples during application time. The parameter is therefore different for each AL iteration.

Label Smoothing (LS). Label Smoothing Szegedy et al (2016) removes a fraction \(\alpha \) of the loss function per predicted class and distributes it uniformly among the other classes by adding \(\frac{\alpha }{K-1}\) to the other loss outputs, with K being the number of classification classes. In contrast to Temperature Scaling, Label Smoothing is not applied after a network has been trained, but directly during the training process due to a modified loss function.

3.3 Uncertainty-Clip** (UC)

Algorithm 1
figure a

Top-k Clip**

Algorithm 2
figure b

First Peak Clip**

Algorithm 3
figure c

First Valley Clip**

The aforementioned Uncertainty-measures can directly be used in AL strategies to sort the pool of unlabeled samples to select exactly those samples for labeling that have the lowest confidence/highest uncertainty. As repeatedly reported by others Karamcheti et al (2021); Gleave and Irving (3. These are the most uncertain samples, and supposedly outliers that have ambiguous labels. Therefore, a high uncertainty is expected, but not a good indicator for a prioritized labeling of these samples, as an ML model, which is primarily being trained on ambiguous outliers, will often need a lot of labeled data to separate the target classes correctly. Ignoring the most uncertain outlier samples results in selecting the second-most uncertain samples for labeling, which in most cases contain almost as much classification boundary information as the most uncertain samples, and in the case of outliers even more. The challenge here is to detect potential outliers without knowing the true labels, while at the same time to prevent too many samples from being selected for labeling. The different thresholds used for detecting the clip** thresholds are displayed in Fig. 4.

This idea of throwing away some labels to improve the accuracy has also been noted to function well for Transformer-models by Sankararaman et al (\({\textbf {u}}\) (line 1 and 2). Afterwards, the top-k uncertainty values are removed (line 3 and 4). This method can be used in combination with any AL strategy using uncertainty for ranking the pool of unlabeled samples. A fixed threshold has the advantage of a very low implementation overhead, but the disadvantage of being very dependent on the parameter k. A too low value may ignore too few samples, a too high too many. The following two methods aim to circumvent this restriction.

First Peak Clip**. Other indicators for filtering out uncertainty distributions are local maxima and minima. The goal is to determine the first peak/local maxima in the uncertainty distribution, which is potentially being caused by the outlier samples. Algorithm 2 illustrates our proposed implementation of this idea. First, we calculate a kernel density estimation to get a smooth probability density function from the distribution of the uncertainty values (line 4). Based on this, we calculate the first local maxima from the right (line 5), as not all uncertainty distributions do have a second peak to the right. Additionally, to prevent from filtering out too many samples, we limit this method using the top-k-clip** threshold from the first method (line 6 and 7).

A disadvantage of this and the following method is the dependency on the existence of local maxima that not every uncertainty values distributions exhibits.

First Valley Clip**. The idea is basically very similar to the first peak clip** method, but instead of taking the maximum of the first hill from the right, we use the valley after the first hill as threshold, outlined in Algorithm 3.

4 Experimental Setup

Details to reproduce our evaluation are provided in this Section. In support of the the reproducability initiative, we are offering other researchers the re-use of our work by making our source code fully publicly available on GitHub Footnote 3.

4.1 Setup

We extended the AL framework small-text Schröder et al (3.3 in combination with the implemented uncertainty measurements, displayed values are percentages

The AL experiments can be evaluated in a multitude of ways. At its core, after each AL iteration, a standard ML metric is being measured for the labeled and withheld dedicated test set. We decided upon the accuracy (acc) metric, which we calculated on a withheld test dataset. It is possible to compare the test accuracy values of the last iteration, the mean of the last five iterations (\(acc_{last5}\)), and the mean of all iterations. The last one equals to the area-under-the-curve when plotting the so-called AL learning curve. As an effective AL strategy should select the most valuable samples for labeling first, those metrics that include the accuracy of multiple iterations are often closer to real use-cases. Nevertheless, at the beginning of the labeling process the fluctuation of the test accuracy for most strategies is very high and contains often surprisingly little information about which strategy is better. The influence of the initially labeled samples is simply so high that a better strategy, with a bad starting point, has no chance to be better than a bad strategy, which has a good starting point. But after a couple of AL iterations, the results stabilize and good strategies can be reliably distinguished from bad strategies, as each strategy tends to approach its own characteristic threshold, regardless of the starting point. Therefore, we decided to use the accuracy of the last five iterations, deliberately ignoring the first iterations with the highly fluctuating results.

Additionally, we measured the runtime. AL, applied in real-life scenarios, is an interactive process. Decisions of the AL strategy should be made in the magnitude of single-digit seconds, longer calculation times render the annotation process impractical.

As each experiment was repeated 10 times, we report in the following the arithmetic mean for the 10 repetitions, separated by dataset.