1 Introduction

Within the last decade enormous advances on deep neural networks (DNNs) have been realized, encouraging their adaptation in a variety of research fields, where complex systems have to be modeled or understood, such as earth observation, medical image analysis, or robotics. Although DNNs have become attractive in high-risk fields such as medical image analysis (Nair et al. 2020; Roy et al. 2019; Seebock et al. 2020; LaBonte et al. 2019; Reinhold et al. 2020; Eggenreich et al. 2020) or autonomous vehicle control (Feng et al. 2018; Choi et al. 2019; Amini et al. 2018; Loquercio et al. 2020), their deployment in mission- and safety-critical real world applications remains limited. The main factors responsible for this limitation are

  • the lack of expressiveness and transparency of a deep neural network’s inference model, which makes it difficult to trust their outcomes (Roy et al. 2019),

  • the inability to distinguish between in-domain and out-of-domain samples (Lee et al. 2018a; Mitros and Mac Namee 2019) and the sensitivity to domain shifts (Ovadia et al. 2019),

  • the inability to provide reliable uncertainty estimates for a deep neural network’s decision (Ayhan and Berens 2018) and frequently occurring overconfident predictions (Guo et al. 2017; Wilson and Izmailov 2020), and

  • the sensitivity to adversarial attacks that make deep neural networks vulnerable for sabotage (Rawat et al. 2017; Serban et al. 2018; Smith and Gal 2018).

These factors are mainly based on an uncertainty already included in the data (data uncertainty) or a lack of knowledge of the neural network (model uncertainty). To overcome these limitations, it is essential to provide uncertainty estimates, such that uncertain predictions can be ignored or passed to human experts (Gal and Ghahramani 2016). Providing uncertainty estimates is not only important for safe decision-making in high-risk fields, but it is also crucial in fields where the data sources are highly inhomogeneous and labeled data is rare, such as in remote sensing (Rußwurm et al. 2020; Gawlikowski et al. 2022). Also for fields where uncertainties form a crucial part of the learning techniques, such as for active learning (Gal et al. 2017b; Chitta et al.

  • Sources and types of uncertainty (Sect. 2),

  • Recent studies and approaches for estimating uncertainty in DNNs (Sect. 3),

  • Uncertainty measures and methods for assessing the quality and impact of uncertainty estimates (Sect. 4),

  • Recent studies and approaches for calibrating DNNs (Sect. 5),

  • An overview over frequently used evaluation data sets, available benchmarks and implementationsFootnote 1 (Sect. 6),

  • An overview over real-world applications using uncertainty estimates (Sect. 7),

  • A discussion on current challenges and further directions of research in the future (Sect. 8).

  • The basic descriptions of uncertainty representations in neural networks are not problem specific and many of the proposed methods (e.g., BNNs or ensemble of neural networks) can be applied to many different types of problems such as classification, regression, or segmentation. If not stated differently, the presented methods are not limited to a specific type of problem. In order to get a deeper dive into explicit applications of the methods, we refer to the section on applications and to further readings in the referenced literature.

    2 Uncertainty in deep neural networks

    A neural network is a non-linear function \(f_\theta\) parameterized by model parameters \(\theta\) (i.e. the network weights) that maps from a measurable input set \(\mathbb {X}\) to a measurable output set \(\mathbb {Y}\), i.e.

    $$\begin{aligned} f_\theta : \mathbb {X}\rightarrow \mathbb {Y}\qquad f_\theta (x)=y. \end{aligned}$$
    (1)

    For a supervised setting, we further have a finite set of training data \(\mathcal {D}\subseteq \mathbb {D}=\mathbb {X}\times \mathbb {Y}\) containing N data samples and corresponding targets, i.e.

    $$\begin{aligned} \mathcal {D}=(\mathcal {X},\mathcal {Y})=\{x_n,y_n\}_{n=1}^N\subseteq \mathbb {D}\,. \end{aligned}$$
    (2)

    For a new data sample \(x^*\in \mathbb {X}\), a neural network trained on \(\mathcal {D}\) can be used to predict a corresponding target \(f_\theta (x^*) = y^*\). We consider four different steps from the raw information in the environment to a prediction by a neural network with quantified uncertainties, namely

    1. 1.

      the data acquisition process: The occurrence of some information in the environment (e.g. a bird’s singing) and a measured observation of this information (e.g. an audio record).

    2. 2.

      the DNN building process: The design and training of a neural network.

    3. 3.

      the applied inference model: The model is applied for inference (e.g. a BNN or an ensemble of neural networks).

    4. 4.

      the prediction’s uncertainty model: The modelling of the uncertainties caused by the neural network and/or by the data.

    In practice, these four steps contain several potential sources of uncertainty and errors, which again affect the final prediction of a neural network. The five factors that we think are the most vital for the cause of uncertainty in a DNN’s predictions are

    • the variability in real world situations,

    • the errors inherent to the measurement systems,

    • the errors in the architecture specification of the DNN,

    • the errors in the training procedure of the DNN,

    • the errors caused by unknown data.

    In the following, the four steps leading from raw information to uncertainty quantification on a DNN’s prediction are described in more detail. Within this, we highlight the sources of uncertainty that are related to the single steps and explain how the uncertainties are propagated through the process. Finally, we introduce a model for the uncertainty on a neural network’s prediction and introduce the main types of uncertainty considered in neural networks.

    The goal of this section is to give an accountable idea of the uncertainties in neural networks. Hence, for the sake of simplicity, we only describe and discuss the mathematical properties, which are relevant for understanding the approaches and applying the methodology in different fields.

    2.1 Data acquisition

    In the context of supervised learning, the data acquisition describes the process where measurements x and target variables y are generated in order to represent a (real world) situation \(\omega\) from some space \(\Omega\). In the real world, a realization of \(\omega\) could for example be a bird, x a picture of this bird, and y a label stating ‘bird’. During the measurement, random noise can occur and information may get lost. We model this randomness in x by

    $$\begin{aligned} x\vert \omega \sim p_{x\vert \omega }. \end{aligned}$$
    (3)

    Equivalently, the corresponding target variable y is derived, where the description is either based on another measurement or is the result of a labeling process.Footnote 2 For both cases, the description can be affected by noise and errors and we state it as

    $$\begin{aligned} y\vert \omega \sim p_{y\vert \omega }. \end{aligned}$$
    (4)

    A neural network is trained on a finite data set of realizations of \(x\vert \omega _i\) and \(y\vert \omega _i\) based on N real world situations \(\omega _1,\ldots ,\omega _N\),

    $$\begin{aligned} \mathcal {D}=\{x_i, y_i\}_{i=1}^N\,. \end{aligned}$$
    (5)

    When collecting the training data, two factors can cause uncertainty in a neural network trained on this data. First, the sample space \(\Omega\) should be sufficiently covered by the training data \(x_1,\ldots ,x_N\) for \(\omega _1,\ldots ,\omega _N\). For that, one has to take into account that for a new sample \(x^*\) it in general holds that \(x^*\ne x_i\) for all training situations \(x_i\). Following, the target has to be estimated based on the trained neural network model, which directly leads to the first factor of uncertainty:

    Factor I: Variability in real world situations

    Most real world environments are highly variable and almost constantly affected by changes. These changes affect parameters such as temperature, illumination, clutter, and physical objects’ size and shape. Changes in the environment can also affect the expression of objects, such as plants after rain look very different from plants after a drought. When real world situations change compared to the training set, this is called a distribution shift. Neural networks are sensitive to distribution shifts, which can lead to significant changes in the performance of a neural network.

    The second case is based on the measurement system, which has a direct effect on the correlation between the samples and the corresponding targets. The measurement system generates information \(x_i\) and \(y_i\) that describe \(\omega _i\) but might not contain enough information to learn a direct map** from \(x_i\) to \(y_i\). This means that there might be highly different real world information \(\omega _i\) and \(\omega _j\) (e.g. city and forest) resulting in very similar corresponding measurements \(x_i\) and \(x_j\) (e.g. temperature) or similar corresponding targets \(y_i\) and \(y_j\) (e.g. label noise that labels both samples as forest). This directly leads to our second factor of uncertainty:

    Factor II: Error and noise in measurement systems

    The measurements themselves can be a source of uncertainty on the neural network’s prediction. This can be caused by limited information in the measurements, such as the image resolution. Moreover, it can be caused by noise, for example, sensor noise, by motion, or mechanical stress leading to imprecise measures. Furthermore, false labeling is also a source of uncertainty that can be seen as an error or noise in the measurement system. It is referenced as label noise and affects the model by reducing the confidence on the true class prediction during training. Depending on the intensity, this type of noise and errors can be used to regularize the training process and to improve robustness and generalization (Goodfellow et al. 2016; Peterson et al. 2019; Lukasik et al. 2020).

    2.2 Deep neural network design and training

    The design of a DNN covers the explicit modeling of the neural network and its stochastic training process. The assumptions on the problem structure induced by the design and training of the neural network are called inductive bias (Battaglia et al. 2018). We summarize all decisions of the modeler on the network’s structure (e.g. the number of parameters, the layers, the activation functions, etc.) and training process (e.g. optimization algorithm, regularization, augmentation, etc.) in a structure configuration s. The defined network structure gives the third factor of uncertainty in a neural network’s predictions:

    Factor III: Errors in the model structure

    The structure of a neural network has a direct effect on its performance and therefore also on the uncertainty of its prediction. For instance, the number of parameters affects the memorization capacity, which can lead to under- or over-fitting on the training data. Regarding uncertainty in neural networks, it is known that deeper networks tend to be overconfident in their softmax output, meaning that they predict too much probability on the class with the highest probability score (Guo et al. 2017).

    For a given network structure s and a training data set \(\mathcal {D}\), the training of a neural network is a stochastic process and therefore the resulting neural network \(f_\theta\) is based on a random variable,

    $$\begin{aligned} \theta \vert D, s \sim p_{\theta \vert D,s}. \end{aligned}$$
    (6)

    The process is stochastic due to random decisions as the order of the data, random initialization, or random regularization as augmentation or dropout. The loss landscape of a neural network is highly non-linear and the randomness in the training process in general leads to different local optima \(\theta ^*\) resulting in different models (Lakshminarayanan et al. 2017). Also, parameters such as batch size, learning rate, and the number of training epochs affect the training and result in different models. Depending on the underlying task these models can significantly differ in their predictions for single samples, even leading to a difference in the overall model performance. This sensitivity to the training process directly leads to the fourth factor for uncertainties in neural network predictions:

    Factor IV: Errors in the training procedure

    The training process of a neural network includes many parameters that have to be defined (batch size, optimizer, learning rate, stop** criteria, regularization, etc.), and also stochastic decisions within the training process (batch generation and weight initialization) take place. All these decisions affect the local optima and it is therefore very unlikely that two training processes deliver the same model parameterization. A training data set that suffers from imbalance or low coverage of single regions in the data distribution also introduces uncertainties on the network’s learned parameters, as already described in the data acquisition. This might be softened by applying augmentation to increase the variety or by balancing the impact of single classes or regions on the loss function.

    Since the training process is based on the given training data set \(\mathcal {D}\), errors in the data acquisition process (e.g. label noise) can result in errors in the training process.

    2.3 Inference

    The inference describes the prediction of an output \(y^*\) for a new data sample \(x^*\) by the neural network. At this time, the network is trained for a specific task. Thus, samples that are not inputs for this task cause errors and are therefore also a source of uncertainty:

    Factor V: Errors caused by unknown data

    Especially in classification tasks, a neural network that is trained on samples derived from a world \(\mathcal {W}_1\) can also be capable of processing samples derived from a completely different world \(\mathcal {W}_2\). This is for example the case when a network trained on images of cats and dogs receives a sample showing a bird. Here, the source of uncertainty does not lie in the data acquisition process, since we assume a world to contain only feasible inputs for a prediction task. Even though the practical result might be equal to too much noise on a sensor or complete failure of a sensor, the data considered here represents a valid sample, but for a different task or domain.

    2.4 Predictive uncertainty model

    As a modeller, one is mainly interested in the uncertainty that is propagated onto a prediction \(y^*\), the so-called predictive uncertainty. Within the data acquisition model, the probability distribution for a prediction \(y^*\) based on some sample \(x^*\) is given by

    $$\begin{aligned} p(y^*\vert x^*) = \int _\Omega p(y^*\vert \omega )p(\omega \vert x^*)d\omega \end{aligned}$$
    (7)

    and a maximum a posteriori (MAP) estimation is given by

    $$\begin{aligned} y^* = \arg \max _y p(y \vert x^*). \end{aligned}$$
    (8)

    Since the modeling is based on the unavailable latent variable \(\omega\), one takes an approximative representation based on a sampled training data set \({\mathcal {D}}=\{x_i,y_i\}_{i=1}^N\) containing N samples and corresponding targets. The distribution and MAP estimator in (7) and (8) for a new sample \(x^*\) are then predicted based on the known examples by

    $$\begin{aligned} p(y^*\vert x^*) = \int _D p(y^*\vert \mathcal {D},x^*) \end{aligned}$$
    (9)

    and

    $$\begin{aligned} y^* = \arg \max _y p(y \vert \mathcal {D},x^*). \end{aligned}$$
    (10)

    In general, the distribution given in (9) is unknown and can only be estimated based on the given data in D. For this estimation, neural networks form a very powerful tool for many tasks and applications.

    The prediction of a neural network is subject to both model-dependent and input data-dependent errors, and therefore the predictive uncertainty associated with \(y^*\) is in general separated into data uncertainty [also statistical or aleatoric uncertainty (Hüllermeier and Waegeman 2021)] and model uncertainty [also systemic or epistemic uncertainty (Hüllermeier and Waegeman 2021)]. Depending on the underlying approach, additional explicit modeling of distributional uncertainty (Malinin and Gales 2018) is used to model the uncertainty, which is caused by examples from a region not covered by the training data. Figure 1 illustrates the described types of uncertainty for regression and classification tasks.

    2.4.1 Model- and data uncertainty

    The model uncertainty covers the uncertainty that is caused by shortcomings in the model, either by errors in the training procedure, an insufficient model structure, or lack of knowledge due to unknown samples or a bad coverage of the training data set. In contrast to this, data uncertainty is related to uncertainty that directly stems from the data. Data uncertainty is caused by information loss when representing the real world within a data sample and represents the distribution stated in (7). For example, in regression tasks, noise in the input and target measurements causes data uncertainty that the network cannot learn to correct. In classification tasks, samples that do not contain enough information in order to identify one class with 100% certainty cause data uncertainty on the prediction. The information loss is a result of the measurement system, e.g. by representing real world information by image pixels with a specific resolution, or by errors in the labelling process.

    For the five presented factors for uncertainties on a neural network’s prediction, this means the following: Only factor 2 represents a source of aleatoric uncertainty since it causes insufficient data that make a certain prediction not possible. For all other factors, the source of uncertainty lies in the experimental setup and is related to epistemic uncertainty. The uncertainty induced by Factor I is a result of the insufficient coverage of the data distribution in the training data. Factor III and Factor IV clearly represent shortcomings in the training and the modelling of the network. Factor V is also related to epistemic uncertainty since the data itself might be fine but the unknown domain is not included in the modelling and hence the model lacks knowledge of how to handle this data. Figure 2 illustrates the discussed stages of a neural network pipeline employed in a remote sensing classification task, along with the diverse sources of uncertainties that impact the resulting predictions.

    While model uncertainty can be (theoretically) reduced by improving the architecture, the learning process, or the training data set, the data uncertainties cannot be explained away (Kendall and Gal 2017). Therefore, DNNs that are capable of handling uncertain inputs and that are able to remove or quantify the model uncertainty and give a correct prediction of the data uncertainty are of paramount importance for a variety of real world mission- and safety-critical applications.

    Fig. 1
    figure 1

    Visualization of the data, the model, and the distributional uncertainty for classification and regression models

    The Bayesian framework offers a practical tool to reason about uncertainty in deep learning (Gal and Ghahramani 2015). In Bayesian modeling, the model uncertainty is formalized as a probability distribution over the model parameters \(\theta\), while the data uncertainty is formalized as a probability distribution over the model outputs \(y^*\), given a parameterized model \(f_\theta\). The distribution over a prediction \(y^*\), the predictive distribution, is then given by

    $$\begin{aligned} p(y^*\vert x^*, D)&=\int \underbrace{p(y^*\vert x^*,\theta )}_{\text {Data}}\underbrace{p(\theta \vert D)}_{\text {Model}}d\theta \,. \end{aligned}$$
    (11)

    The term \(p(\theta \vert D)\) is referenced as posterior distribution on the model parameters and describes the uncertainty on the model parameters given a training data set D. The posterior distribution is in general not tractable. While ensemble approaches seek to approximate it by learning several different parameter settings and averaging over the resulting models (Lakshminarayanan et al. 2017), Bayesian inference reformulates it using Bayes Theorem (Bishop and Nasrabadi 2006)

    $$\begin{aligned} p(\theta \vert D) = \frac{p(D\vert \theta )p(\theta )}{p(D)}. \end{aligned}$$
    (12)

    The term \(p(\theta )\) is called the prior distribution on the model parameters since it does not take any information but the general knowledge on \(\theta\) into account. The term \(p(D\vert \theta )\) represents the likelihood that the data in D is a realization of the distribution predicted by a model parameterized with \(\theta\). Many loss functions are motivated by or can be related to the likelihood function. Loss functions that seek to maximize the log-likelihood (for an assumed distribution) are for example the cross-entropy or the mean squared error (Ritter et al. 2018).

    Even with the reformulation given in (12), the predictive distribution given in (11) is still intractable. To overcome this, several different ways to approximate the predictive distribution were proposed. A broad overview of the different concepts and some specific approaches is presented in Sect. 3.

    2.4.2 Distributional uncertainty

    Depending on the approaches that are used to quantify the uncertainty in \(y^*\), the formulation of the predictive distribution might be further separated into data, distributional, and model parts (Malinin and Gales 2018):

    $$\begin{aligned} p(y^*\vert x^*, D)=\int \int \underbrace{p(y\vert \mu )}_{\text {Data}}\underbrace{p(\mu \vert x^*,\theta )}_{\text {Distributional}}\underbrace{p(\theta \vert D)}_{\text {Model}}d\mu d\theta . \end{aligned}$$
    (13)

    The distributional part in (13) represents the uncertainty on the actual network output, e.g. for classification tasks this might be a Dirichlet distribution, which is a distribution over the categorical distribution given by the softmax output. Modeled this way, distributional uncertainty refers to uncertainty that is caused by a change in the input-data distribution, while model uncertainty refers to uncertainty that is caused by the process of building and training the DNN. As modeled in (13), the model uncertainty affects the estimation of the distributional uncertainty, which affects the estimation of the data uncertainty.

    While most methods presented in this paper only distinguish between model and data uncertainty, approaches specialized on out-of-distribution (OOD) detection often explicitly aim at representing the distributional uncertainty (Malinin and Gales 2018; Nandy et al. 2020). A more detailed presentation of different approaches for quantifying uncertainties in neural networks is given in Sect. 3. In Sect. 4, different measures for measuring the different types of uncertainty are presented.

    2.5 Uncertainty classification

    On the basis of the input data domain, the predictive uncertainty can also be classified into three main classes:

    • In-domain uncertainty (Ashukha et al. 2019)


      In-domain uncertainty represents the uncertainty related to an input drawn from a data distribution assumed to be equal to the training data distribution. The in-domain uncertainty stems from the inability of the deep neural network to explain an in-domain sample due to a lack of in-domain knowledge. From a modeler’s point of view, in-domain uncertainty is caused by design errors (model uncertainty) and the complexity of the problem at hand (data uncertainty). Depending on the source of the in-domain uncertainty, it might be reduced by increasing the quality of the training data (set) or the training process (Hüllermeier and Waegeman 2021).

    • Domain-shift uncertainty (Ovadia et al. 2019)


      Domain-shift uncertainty denotes the uncertainty related to an input drawn from a shifted version of the training distribution. The distribution shift results from insufficient coverage by the training data and the variability inherent to real world situations. A domain-shift might increase the uncertainty due to the inability of the DNN to explain the domain-shift sample on the basis of the seen samples at training time. Some errors causing domain shift uncertainty can be modeled and can therefore be reduced. For example, occluded samples can be learned by the deep neural network to reduce domain shift uncertainty caused by occlusions (DeVries and Taylor 2017). However, it is difficult if not impossible to model all errors causing domain shift uncertainty, e.g., motion noise (Kendall and Gal 2017). From a modeler’s point of view, domain-shift uncertainty is caused by external or environmental factors but can be reduced by covering the shifted domain in the training data set.

    • Out-of-domain uncertainty (Hendrycks and Gimpel 2017; Liang et al. 2018b; Shafaei et al. 2019; Mundt et al. 2019)


      Out-of-domain uncertainty represents the uncertainty related to an input drawn from the subspace of unknown data. The distribution of unknown data is different and far from the training distribution. While a DNN can extract in-domain knowledge from domain-shift samples, it cannot extract in-domain knowledge from out-of-domain samples. For example, when domain-shift uncertainty describes phenomena like a blurred picture of a dog, out-of-domain uncertainty describes the case when a network that learned to classify cats and dogs is asked to predict a bird. The out-of-domain uncertainty stems from the inability of the DNN to explain an out-of-domain sample due to its lack of out-of-domain knowledge. From a modeler’s point of view, out-of-domain uncertainty is caused by input samples, where the network is not meant to give a prediction for or by insufficient training data.

    Since the model uncertainty captures what the DNN does not know due to a lack of in-domain or out-of-domain knowledge, it captures all, in-domain, domain-shift, and out-of-domain uncertainties. In contrast, the data uncertainty captures in-domain uncertainty that is caused by the nature of the data the network is trained on, for example, overlap** samples and systematic label noise.

    Fig. 2
    figure 2

    The illustration shows the different steps of a neural network pipeline, based on the earth observation example of land cover classification (here settlement and forest) based on optical images. The different factors that affect the predictive uncertainty are highlighted in the boxes. Factor I is shown as changing environments by cloud-covered trees, and different types and colors of trees. Factor II is shown by insufficient measurements, that can not directly be used to separate between settlement and forest and by label noise. In practice, the resolution of such images can be low and which would also be part of Factor II. Factor III and Factor IV represent the uncertainties caused by the network structure and the stochastic training process, respectively. Factor V in contrast is represented by feeding the trained network with unknown types of images, namely cows and pigs

    3 Uncertainty estimation

    As described in Sect. 2, several factors may cause model and data uncertainty and affect a DNN’s prediction. This variety of sources of uncertainty makes the complete exclusion of uncertainties in a neural network impossible for almost all applications. Especially in practical applications employing real world data, the training data is only a subset of all possible input data, which means that a miss-match between the DNN domain and the unknown actual data domain is often unavoidable. However, an exact representation of the uncertainty of a DNN prediction is also not possible to compute, since the different uncertainties can in general not be modeled accurately and are most often even unknown. Therefore, methods for estimating uncertainty in a DNN prediction is a popular and vital field of research. The data uncertainty part is normally represented in the prediction, e.g. in the softmax output of a classification network or in the explicit prediction of a standard deviation in a regression network (Kendall and Gal 2017). In contrast to this, several different approaches which model the model uncertainty and seek to separate it from the data uncertainty in order to receive an accurate representation of the data uncertainty were introduced (Kendall and Gal 2017; Malinin and Gales 2018; Lakshminarayanan et al. 2017).

    In general, the methods for estimating the uncertainty can be split into four different types based on the number (single or multiple) and the nature (deterministic or stochastic) of the used DNNs.

    • Single deterministic methods give the prediction based on one single forward pass within a deterministic network. The uncertainty quantification is either derived by using additional (external) methods or is directly predicted by the network.

    • Bayesian methods cover all kinds of stochastic DNNs, i.e. DNNs where two forward passes of the same sample generally lead to different results.

    • Ensemble methods combine the predictions of several different deterministic networks at inference.

    • Test-time augmentation methods give the prediction based on one single deterministic network but augment the input data at test-time in order to generate several predictions that are used to evaluate the certainty of the prediction.

    Fig. 3
    figure 3

    Visualization of the four different types of uncertainty quantification methods presented in this paper

    In the following, the main ideas and further extensions of the four types are presented and their main properties are discussed. In Fig. 3, an overview of the different types and methods is given. In Fig. 4, the different underlying principles that are used to differentiate between the different types of methods are presented. Table 1 summarizes the main properties of the methods presented in this work, such as complexity, computational effort, memory consumption, flexibility, and others.

    Table 1 An overview of the four general methods presented in this paper, namely Bayesian neural networks, ensembles, single deterministic neural networks, and test-time data augmentation
    Fig. 4
    figure 4

    A visualization of the basic principles of uncertainty modeling of the four presented general types of uncertainty prediction in neural networks. For a given input sample \(x^*\) each approach delivers a prediction \(y^*\), a representation of model uncertainty \(\sigma _{\text {model}}\) and a value of data uncertainty \(\sigma _{\text {data}}\). A Single deterministic model, B Bayesian neural network, C ensemble approach, and D test-time data augmentation. The mean and the standard deviation are only used to keep the visualization simple. In practice, other methods could be utilized. For the deterministic approaches the idea of predicting the parameters of a probability distribution \(\**\) is visualized, other approaches which base on tools additional to the prediction network are not visualized here

    Table 2 Overview of the properties of internal and external deterministic single network methods

    3.1 Single deterministic methods

    For deterministic neural networks, the parameters are deterministic and each repetition of a forward pass delivers the same result. With single deterministic network methods for uncertainty quantification, we summarize all approaches where the uncertainty on a prediction \(y^*\) is computed based on one single forward pass within a deterministic network. In the literature, several such approaches can be found. They can be roughly categorized into approaches where one single network is explicitly modeled and trained in order to quantify uncertainties (Sensoy et al. 2018; Malinin and Gales 2018; Możejko et al. 2018; Nandy et al. 2020; Oala et al. 2020) and approaches that use additional components in order to give an uncertainty estimate on the prediction of a network (Raghu et al. 2019; Ramalho and Miranda 2020; Oberdiek et al. 2018; Lee and AlRegib 2020). While for the first type, the uncertainty quantification affects the training procedure and the predictions of the network, the latter type is in general applied to already trained networks. Since trained networks are not modified by such methods, they have no effect on the network’s predictions. In the following, we call these two types internal and external uncertainty quantification approaches.

    3.1.1 Internal uncertainty quantification approaches

    Many of the internal uncertainty quantification approaches followed the idea of predicting the parameters of a distribution over the predictions instead of a direct pointwise maximum-a-posteriori estimation. Often, the loss function of such networks takes the expected divergence between the true distribution and the predicted distribution into account e.g., in Malinin and Gales (2018), Malinin and Gales (2019). The distribution over the outputs can be interpreted as a quantification of the model uncertainty (see Sect. 2), trying to emulate the behavior of Bayesian modeling of the network parameters (Nandy et al. 2020). The prediction is then given as the expected value of the predicted distribution.

    For classification tasks, the output in general represents class probabilities. These probabilities are a result of applying the softmax function

    $$\begin{aligned} \begin{aligned}&\text {softmax}:{\mathbb {R}}^K\rightarrow \left\{ z\in {\mathbb {R}}^K\vert z_i \ge 0, \sum _{k=1}^K z_k =1\right\} \\&\text {softmax}(z)_j = \frac{\exp (z_j)}{\sum _{k=1}^K\exp (z_k)} \end{aligned} \end{aligned}$$
    (14)

    for multiclass settings and the sigmoid function

    $$\begin{aligned} \begin{aligned}&\text {sigmoid}:{\mathbb {R}}\rightarrow [0,1] \\&\text {sigmoid}(z) = \frac{1}{1+\exp (-z)} \end{aligned} \end{aligned}$$
    (15)

    for binary classification tasks on the logits z. These probabilities can be already interpreted as a prediction of the data uncertainty. However, it is widely discussed that neural networks are often over-confident and the softmax output is often poorly calibrated, leading to inaccurate uncertainty estimates (Vasudevan et al. 2019; Hendrycks and Gimpel 2017; Sensoy et al. 2018; Możejko et al. 2020) transferred the idea of evidential neural networks from classification tasks to regression tasks by learning the parameters of an evidential normal inverse gamma distribution over an underlying Normal distribution. Charpentier et al. (2020) avoided the need of OOD data for the training process by using normalizing flows to learn a distribution over a latent space for each class. A new input sample is projected onto this latent space and a Dirichlet distribution is parameterized based on the class-wise densities of the received latent point.

    Besides the Dirichlet distribution based approaches described above, several other internal approaches exist. In Liang et al. (2018b), a relatively simple approach based on small perturbations on the training input data and the temperature scaling calibration is presented leading to efficient differentiation of in- and out-of-distribution samples. Możejko et al. (2021) introduced a method that computes virtual residuals in the training samples of a regression task based on a cross-validation like pre-training step. With original training data expanded by the information of these residuals, the actual predictor is trained to give a prediction and a value of certainty. The experiments indicated that the virtual residuals represent a promising tool in order to avoid overconfident network predictions.

    3.1.2 External uncertainty quantification approaches

    External uncertainty quantification approaches do not affect the models’ predictions, since the evaluation of the uncertainty is separated from the underlying prediction task. Furthermore, several external approaches can be applied to already trained networks at the same time without affecting each other. Raghu et al. (2019) argued that when both tasks, the prediction, and the uncertainty quantification, are done by one single method, the uncertainty estimation is biased by the actual prediction task. Therefore, they recommended a “direct uncertainty prediction” and suggested training two neural networks, one for the actual prediction task and a second one for the prediction of the uncertainty on the first network’s predictions. Similarly, Ramalho and Miranda (2020) introduced an additional neural network for uncertainty estimation. But in contrast to Raghu et al. (2019), the representation space of the training data is considered and the density around a given test sample is evaluated. The additional neural network uses this training data density in order to predict whether the main network’s estimate is expected to be correct or false. Hsu et al. (2020) detected out-of-distribution examples in classification tasks at test-time by predicting total probabilities for each class, in addition to the categorical distribution given by the softmax output. The class-wise total probability is predicted by applying the sigmoid function to the network’s logits. Based on these total probabilities, OOD examples can be identified as those with low class probabilities for all classes.

    In contrast to this, (Oberdiek et al. 2018) took the sensitivity of the model, i.e. the model’s slope, into account by using gradient metrics for the uncertainty quantification in classification tasks. Lee and AlRegib (2020) applied a similar idea but made use of back-propagated gradients. In their work, they presented state-of-the-art results on out-of-distribution and corrupted input detection.

    3.1.3 Summing up single deterministic methods

    Compared to many other principles, single deterministic methods are computationally efficient in training and evaluation. For training, only one network has to be trained and often the approaches can even be applied to pre-trained networks. Depending on the actual approach, only a single or at most two forward passes have to be fulfilled for evaluation. The underlying networks could contain more complex loss functions, which slows down the training process (Sensoy et al. 2018) or external components that have to be trained and evaluated additionally (Raghu et al. 2019). But in general, this is still more efficient than the number of predictions needed for ensembles based methods (Sect. 3.3), Bayesian methods (Sect. 3.2), and test-time data augmentation methods (Sect. 3.4). A drawback of single deterministic neural network approaches is the fact that they rely on a single opinion and can therefore become very sensitive to the underlying network architecture, training procedure, and training data.

    3.2 Bayesian neural networks

    Bayesian Neural Networks (BNNs) (Denker et al. 1987; Tishby et al. 1989; Buntine and Weigend 1991) have the ability to combine the scalability, expressiveness, and predictive performance of neural networks with the Bayesian learning as opposed to learning via the maximum likelihood principles. This is achieved by inferring the probability distribution over the network parameters \(\theta =(w_1,\ldots ,w_K)\). More specifically, given a training input-target pair (xy) the posterior distribution over the space of parameters \(p(\theta \vert x,y)\) is modelled by assuming a prior distribution over the parameters \(p(\theta )\) and applying Bayes theorem:

    $$\begin{aligned} p(\theta \vert x,y) = \frac{p(y\vert x,\theta )p(\theta )}{p(y\vert x)} \propto p(y\vert x,\theta )p(\theta ). \end{aligned}$$
    (16)

    Here, the normalization constant in (16) is called the model evidence \(p(y\vert x)\) which is defined as

    $$\begin{aligned} p(y \vert x) = \int p(y\vert x, \theta )p(\theta )d\theta . \end{aligned}$$
    (17)

    Once the posterior distribution over the weights has been estimated, the prediction of output \(y^*\) for a new input data \(x^*\) can be obtained by Bayesian Model Averaging or Full Bayesian Analysis that involves marginalizing the likelihood \(p(y\vert x,\theta )\) with the posterior distribution:

    $$\begin{aligned} p(y^*\vert x^*, x, y) = \int p(y^*\vert x^*,\theta ) p(\theta \vert x,y)d\theta . \end{aligned}$$
    (18)

    This Bayesian way of prediction is a direct application of the law of total probability and endows the ability to compute the principled predictive uncertainty. The integral of (18) is intractable for the most common prior posterior pairs and approximation techniques are therefore typically applied. The most widespread approximation, the Monte Carlo Approximation, follows the law of large numbers and approximates the expected value by the mean of N stochastic networks, \(f_{\theta _1},\ldots ,f_{\theta _N}\), parameterized by N samples, \(\theta _1, \theta _2,\ldots , \theta _N\), from the posterior distribution of the weights, i.e.

    $$\begin{aligned} y^*\, \approx \, \frac{1}{N}\sum _{i=1}^{N} y_i^* \, = \, \frac{1}{N}\sum _{i=1}^{N} f_{\theta _i}(x^*). \end{aligned}$$
    (19)

    Wilson and Izmailov (2020) argue that a key advantage of BNNs lies in this marginalization step, which particularly can improve both the accuracy and calibration of modern deep neural networks. We note that the use cases of BNNs are not limited to uncertainty estimation but open up the possibility to bridge the powerful Bayesian toolboxes within deep learning. Notable examples include Bayesian model selection (MacKay 1992a; Sato 2001; Corduneanu and Bishop 2001; Ghosh et al. 2019), model compression (Louizos et al. 2017; Federici et al. 2017; Achterhold et al. 2018), active learning (MacKay 1992b; Gal et al. 2017b; Kirsch et al. 2019), continual learning (Nguyen et al. 2018; Ebrahimi et al. 2020; Farquhar and Gal 2019; Li et al. 2020), theoretic advances in Bayesian learning (Khan et al. 2019) and beyond. While the formulation is rather simple, there exist several challenges. For example, no closed-form solution exists for the posterior inference as conjugate priors do not typically exist for complex models such as neural networks (Bishop and Nasrabadi 2006). Hence, approximate Bayesian inference techniques are often needed to compute the posterior probabilities. Yet, directly using approximate Bayesian inference techniques has been proven to be difficult as the size of the data and the number of parameters are too large for the use cases of deep neural networks. In other words, the integrals of the above equations are not computationally tractable as the size of the data and the number of parameters grows. Moreover, specifying a meaningful prior for deep neural networks is another challenge that is less understood.

    In this survey, we classify the BNNs into three different types based on how the posterior distribution is inferred to approximate Bayesian inference:

    • Variational inference (Hinton and Van Camp 1993; Barber and Bishop 1998)


      Variational inference approaches approximate the (in general intractable) posterior distribution by optimizing over a family of tractable distributions.

    • Sampling approaches (Neal 1992)


      Sampling approaches deliver a representation of the target random variable from which realizations can be sampled. Such methods are based on Markov Chain Monte Carlo and further extensions.

    • Laplace approximation (Denker and LeCun 1991; MacKay 1992c)


      Laplace approximation simplifies the target distribution by approximating the log-posterior distribution and then, based on this approximation, deriving a normal distribution over the network weights.

    These three types differ in multiple criteria that are of interest for applicants. While variational inference and the Laplace approximation offer an analytical expression of the uncertainty and are derived in a deterministic manner, the sampling approaches generate samples and lack such an analytical expression and determinism. Here, it is important to note that the variational inference is deterministic, even though many approximations of it are based on stochastic sampling. On the other hand, the sampling approaches are not biased from the network’s predictions and have the theoretical capability to combine multiple modes (i.e. multiple local solutions), where variational inference and the Laplace approximation only operate in the neighbourhood of a single mode. At the same time, a possible convergence to a solution is significantly harder to asses for the sampling approaches. Considering the computational costs, the Laplace approximation scales down to a normal neural network training, while the variational inference is slowed down by regularization and additional parameters that are needed for representing the uncertainty. The sampling approaches are most costly at training time since the training is already based on sampling. Further, the Laplace approximation has the advantage that it can be applied to pre-trained networks without any changes needed. At inference all the presented approaches are relatively costly since all are based on multiple forward passes in order to approximate the underlying probability distribution. An overview of the main differences in the three types can be found in Table 3.

    Table 3 Overview over the properties of different types of Bayesian neural network approaches as also discussed in the introduction to Sect. 3.2. The properties are stated relatively among the approaches

    While limiting our scope to these three categories, we also acknowledge several advances in related domains of BNN research. Some examples are (i) approximate inference techniques such as alpha divergence (Hernández-Lobato et al. 2016; Li and Gal 2017; Minka et al. 2005), expectation propagation (Minka 2001; Zhao et al. 2020), assumed density filtering (Hernández-Lobato and Adams 2015) etc, (ii) probabilistic programming to exploit modern Graphical Processing Units (GPUs) (Tran et al. 2016, 2017; Bingham et al. 2019; Cabañas et al. 2019), (iii) different types of priors (Ito et al. 2005; Sun et al. 2018), (iv) advancements in theoretical understandings of BNNs (Depeweg et al. 2017; Khan et al. 2019; Farquhar et al. 2020), (iv) uncertainty propagation techniques to speed up the marginalization procedures (Postels et al. 2019) and (v) computations of aleatoric uncertainty (Gast and Roth 2018; Depeweg et al. 2018).

    3.2.1 Variational inference

    The goal of variational inference is to infer the posterior probabilities \(p(\theta \vert x,y)\) using a pre-specified family of distributions \(q(\theta )\). Here, this so-called variational family \(q(\theta )\) is defined as a parametric distribution. An example is the Multivariate Normal distribution where its parameters are the mean and the covariance matrix. The main idea of variational inference is to find the settings of these parameters that make \(q(\theta )\) to be close to the posterior of interest \(p(\theta \vert x,y)\). This measure of closeness between the probability distributions is given by the Kullback–Leibler (KL) divergence

    $$\begin{aligned} \text {KL}(q\Vert p) = {\mathbb {E}}_q\left[ \text {log} \frac{q(\theta )}{p(\theta \vert x,y)} \right] . \end{aligned}$$
    (20)

    Due to the posterior \(p(\theta \vert x, y)\) the KL-divergence in (20) can not be minimized directly. Instead, the evidence lower bound (ELBO), a function that is equal to the KL divergence up to a constant, is optimized. For a given prior distribution on the parameters \(p(\theta )\), the ELBO is given by

    $$\begin{aligned} L = {\mathbb {E}}_q\left[ \log \frac{p(y\vert x,\theta )}{q(\theta )}\right] \end{aligned}$$
    (21)

    and for the KL divergence

    $$\begin{aligned} \text {KL}(q\Vert p) = -L + \log p(y\vert x) \end{aligned}$$
    (22)

    holds.

    Variational methods for BNNs have been pioneered by Hinton and Van Camp (Hinton and Van Camp 1993) where the authors derived a diagonal Gaussian approximation to the posterior distribution of neural networks (couched in information theory—a minimum description length). Another notable extension in the 1990s has been proposed by Barber and Bishop (1998), in which the full covariance matrix was chosen as the variational family, and the authors demonstrated how the ELBO can be optimized for neural networks. Several modern approaches can be viewed as extensions of these early works (Hinton and Van Camp 1993; Barber and Bishop 1998) with a focus on how to scale the variational inference to modern neural networks.

    An evident direction with the current methods is the use of stochastic variational inference (or Monte-Carlo variational inference), where the optimization of ELBO is performed using a mini-batch of data. One of the first connections to stochastic variational inference has been proposed by Graves (2011) with Gaussian priors. In 2015, (Blundell et al. 2015) introduced Bayes By Backprop, a further extension of stochastic variational inference (Graves 2011) to non-Gaussian priors and demonstrated how the stochastic gradients can be made unbiased. Notable, (Kingma et al. 2015) introduced the local reparameterization trick to reduce the variance of the stochastic gradients. One of the key concepts is to reformulate the loss function of the neural network as the ELBO. As a result, the intractable posterior distribution is indirectly optimized and variational inference is compatible with back-propagation with certain modifications to the training procedure. These extensions widely focus on the fragility of stochastic variational inference that arises due to sensitivity to initialization, prior definition, and variance of the gradients. These limitations have been addressed recently by Wu et al. (2018), where a hierarchical prior was used and the moments of the variational distribution are approximated deterministically.

    Above works commonly assumed mean-field approximations as the variational family, neglecting the correlations between the parameters. In order to make more expressive variational distributions feasible for deep neural networks, several works proposed to infer using the matrix normal distribution (Louizos and Welling 2016; Zhang et al. 2018a; Sun et al. 2017) or more expressive variants (Bae et al. 2021) and called drop connect. This was found to be more robust on the uncertainty representation, even though it was shown that a combination of both can lead to higher accuracy and robustness in the test predictions (McClure and Kriegeskorte 2016). Lastly, connections of variation inference to Adam (Khan et al. 2018), RMS Prop (Khan et al. 2017), and batch normalization (Atanov et al. 2019) have been further suggested in the literature.

    3.2.2 Sampling methods

    Sampling methods, also often called Monte Carlo methods, are another family of Bayesian inference algorithms that represent uncertainty without a parametric model. Specifically, sampling methods use a set of hypotheses (or samples) drawn from the distribution and offer the advantage that the representation itself is not restricted by the type of distribution (e.g. can be multi-modal or non-Gaussian)—hence probability distributions are obtained non-parametrically. Popular algorithms within this domain are particle filtering, rejection sampling, importance sampling, and Markov Chain Monte Carlo sampling (MCMC) (Bishop and Nasrabadi 2006). In the case of neural networks, MCMC is often used since alternatives such as rejection and importance sampling are known to be inefficient for such high-dimensional problems. The main idea of MCMC is to sample from arbitrary distributions by a transition in state space where this transition is governed by a record of the current state and the proposal distribution that aims to estimate the target distribution (e.g. the true posterior). To explain this, let us start defining the Markov Chain: a Markov Chain is a distribution over random variables \(x_1, \cdots , x_T\) which follows the state transition rule:

    $$\begin{aligned} p(x_1,\cdots , x_T) = p(x_1)\prod _{t=2}^{T}p(x_t\vert x_{t-1}), \end{aligned}$$
    (23)

    i.e. the next state only depends on the current state and not on any other former state. In order to draw samples from the true posterior, MCMC sampling methods first generate samples in an iterative and the Markov Chain fashion. Then, at each iteration, the algorithm decides to either accept or reject the samples where the probability of acceptance is determined by certain rules. In this way, as more and more samples are produced, their values can approximate the desired distribution.

    Hamiltonian Monte Carlo or Hybrid Monte Carlo (HMC) (Duane et al. 1987) is an important variant of MCMC sampling method (pioneered by Neal (1992, 1994, 1995); Neal et al. (2011) for neural networks), and is often known to be the gold standards of Bayesian inference (Neal et al. 2011; Dubey et al. 2016; Li and Gal 2017). The algorithm works as follows: (i) start by initializing a set of parameters \(\theta\) (either randomly or in a user-specific manner). Then, for a given number of total iterations, (ii) instead of a random walk, a momentum vector—an auxiliary variable \(\rho\) is sampled, and the current value of parameters \(\theta\) is updated via the Hamiltonian dynamics:

    $$\begin{aligned} H(\rho , \theta ) = -\text {log} p(\rho , \theta ) = -\text {log} p(\rho \vert \theta ) - \text {log} p(\theta ). \end{aligned}$$
    (24)

    Defining the potential energy (\(V(\theta ) = - log p(\theta )\) and the kinetic energy \(T(\rho \vert \theta ) = -\text {log} p(\rho \vert \theta )\), the update steps via Hamilton’s equations are governed by,

    $$\begin{aligned} \frac{d\theta }{dt}&= \frac{\partial H}{\partial \rho } = \frac{\partial T}{\partial \rho } \ \text {and} \end{aligned}$$
    (25)
    $$\begin{aligned} \frac{d\rho }{dt}&= - \frac{\partial H}{\partial \theta } = -\frac{\partial T}{\partial \theta } - \frac{\partial V}{\partial \theta }. \end{aligned}$$
    (26)

    The so-called leapfrog integrator is used as a solver (Leimkuhler and Reich 2004). (iii) For each step, a Metropolis acceptance criterion is applied to either reject or accept the samples (similar to MCMC). Unfortunately, HMC requires the processing of the entire data set per iteration, which is computationally too expensive when the data-set size grows to million to even billions. Hence, many modern algorithms focus on how to perform the computations in a mini-batch fashion stochastically. In this context, for the first time, (Welling and Teh 2011) proposed to combine Stochastic Gradient Descent (SGD) with Langevin dynamics [a form of MCMC (Rossky et al. 1978; Roberts and Stramer 2002; Neal et al. 2011)] in order to obtain a scalable approximation to MCMC algorithm based on mini-batch SGD (Kushner and Yin 2003; Goodfellow et al. 2016). The work demonstrated that performing Bayesian inference on Deep Neural Networks can be as simple as running a noisy SGD. This method does not include the momentum term of HMC via using the first-order Langevin dynamics and opened up a new research area on Stochastic Gradient Markov Chain Monte Carlo (SG-MCMC).

    Consequently, several extensions are available which include the use of 2nd order information such as preconditioning and optimizing with the Fisher Information Matrix (FIM) (Ma et al. 2015; Marceau-Caron and Ollivier 2017; Nado et al. 2018), the Hessian (Simsekli et al. 2016; Zhang and Sutton 2011; Fu et al. 2016), adapting preconditioning diagonal matrix (Li et al. 2016a), generating samples from non-isotropic target densities using Fisher scoring (Ahn et al. 2012), and samplers in the Riemannian manifold (Patterson and Teh 2013) using the first order Langevin dynamics and Levy diffusion noise and momentum (Ye and Zhu 2018). Within these methods, the so-called parameter-dependent diffusion matrices are incorporated with the intention to offset the stochastic perturbation of the gradient. To do so, the “thermostat” ideas (Ding et al. 2014; Shang et al. 2015; Leimkuhler and Shang 2016) are proposed so that a prescribed constant temperature distribution is maintained with the parameter-dependent noise. Ahn et al. (2014) devised a distributed computing system for SG-MCMC to exploit the modern computing routines, while (Wang et al. 2018b) showed that Generative Adversarial Models (GANs) can be used to distill the samples for improved memory efficiency, instead of distillation for enhancing the run-time capabilities of computing predictive uncertainty (Balan et al. 2015). Lastly, other recent trends are techniques that reduce the variance (Dubey et al. 2016; Zou et al. 2018) and bias (Durmus et al. 2016; Durmus and Moulines 2019) arising from stochastic gradients.

    Concurrently, there have been solid advances in the theory of SG-MCMC methods and their applications in practice. Sato and Nakagawa (Sato and Nakagawa 2014), for the first time, showed that the SGLD algorithm with constant step size weakly converges; (Chen et al. 2015) showed that faster convergence rates and more accurate invariant measures can be observed for SG-MCMCs with higher order integrators rather than a 1st order Euler integrator while (Teh et al. 2016) studied the consistency and fluctuation properties of the SGLD. As a result, verifiable conditions obeying a central limit theorem for which the algorithm is consistent, and how its asymptotic bias-variance decomposition depends on step-size sequences have been discovered. A more detailed review of the SG-MCMC with a focus on supporting theoretical results can be found in Nemeth and Fearnhead Nemeth and Fearnhead 2021. Practically, SG-MCMC techniques have been applied to shape classification and uncertainty quantification (Li et al. 2016b), empirically study and validate the effects of tempered posteriors (or called cold-posteriors) (Wenzel et al. 2020) and train a deep neural network in order to generalize and avoid over-fitting (Ye et al. 2017; Chandra et al. 2019).

    3.2.3 Laplace approximation

    The goal of the Laplace Approximation is to estimate the posterior distribution over the parameters of neural networks \(p(\theta \mid x,y)\) around a local mode of the loss surface with a Multivariate Normal distribution. The Laplace Approximation to the posterior can be obtained by taking the second-order Taylor series expansion of the log posterior over the weights around the MAP estimate \(\hat{\theta }\) given some data (xy). If we assume a Gaussian prior with a scalar precision value \(\tau >0\), then this corresponds to the commonly used \(L_2\)-regularization, and the Taylor series expansion results in

    $$\begin{aligned} \log p(\theta \mid x,y)&\approx \log p({\hat{\theta }} \mid x,y) \\&\quad + \frac{1}{2}(\theta -{\hat{\theta }})^T(H + \tau I)(\theta -{\hat{\theta }}), \end{aligned}$$

    where the first-order term vanishes because the gradient of the log posterior \(\delta \theta =\nabla \log p(\theta \mid x,y)\) is zero at the maximum \({\hat{\theta }}\). Taking the exponential on both sides and approximating integrals by reverse engineering densities, the weight posterior is approximately a Gaussian with the mean \({\hat{\theta }}\) and the covariance matrix \((H+\tau I)^{-1}\) where H is the Hessian of \(\log p(\theta \mid x,y)\). This means that the model uncertainty is represented by the Hessian H resulting in a Multivariate Normal distribution:

    $$\begin{aligned} p(\theta \mid x,y) \sim {\mathcal {N}}\left( {\hat{\theta }}, (H+\tau I)^{-1}\right) . \end{aligned}$$
    (27)

    In contrast to the two other methods described, the Laplace approximation can be applied on already trained networks and is generally applicable when using standard loss functions such as MSE or cross entropy and piece-wise linear activations (e.g. RELU). MacKay (1992c) and Denker and LeCun (1991) have pioneered the Laplace approximation for neural networks in the 1990s, and several modern methods provide an extension to deep neural networks (Botev et al. 2017; Martens and Grosse 2015; Ritter et al. 2018; Lee et al. 2020).

    The core of the Laplace Approximation is the estimation of the Hessian. Unfortunately, due to the enormous number of parameters in modern neural networks, the Hessian matrices cannot be computed in a feasible way as opposed to relatively smaller networks in MacKay (1992c) and Denker and LeCun (1991). Consequently, several different ways for approximating H have been proposed in the literature. A brief review is as follows. Instead of diagonal approximations [e.g. Becker and LeCun (1989), Salimans and Kingma (2016)], several researchers have been focusing on including the off-diagonal elements [e.g. Liu and Nocedal (1989), Hennig (2013) and Le Roux and Fitzgibbon (2010)]. Amongst them, layer-wise Kronecker Factor approximation of Grosse and Martens (2016), Martens and Grosse (2015), Botev et al. (2017) and Chen et al. (1993; Barber and Bishop 1998; Neal 1992; Denker and LeCun 1991; MacKay 1992c). There are also emerging challenges on new frontiers beyond accurate inference techniques. Some examples are: (i) how to specify meaningful priors? (Ito et al. 2005; Sun et al. 2018), (ii) how to efficiently marginalize over the parameters for fast predictive uncertainty? (Balan et al. 2015; Postels et al. 2019; Hobbhahn et al. 2022; Lee et al. 2022) (iii) infrastructures such as new benchmarks, evaluation protocols and software tools (Mukhoti et al. 2018; Tran et al. 2017; Bingham et al. 2019; Filos et al. 2019), and (iv) towards better understandings on the current methodologies and their potential applications (Farquhar et al. 2020; Wenzel et al. 2020; Mukhoti and Gal 2018; Feng et al. 2019).

    3.3 Ensemble methods

    3.3.1 Principles of ensemble methods

    Ensembles derive a prediction based on the predictions received from multiple so-called ensemble members. They target a better generalization by making use of synergy effects among the different models, arguing that a group of decision-makers tend to make better decisions than a single decision-maker (Sagi and Rokach 2018; Hansen and Salamon 1990). For an ensemble \(f:X \rightarrow Y\) with members \(f_i:X \rightarrow Y\) for \(i \in {1,2,\ldots ,M}\), this could be for example implemented by simply averaging over the members’ predictions,

    $$\begin{aligned} f(x):=\frac{1}{M} \sum _{i=1}^M f_i(x). \end{aligned}$$

    Based on this intuitive idea, several works applying ensemble methods to different kinds of practical tasks and approaches, for example bio-informatics (Cao et al. 2020; Nanni et al. 2020; Wei et al. 2017), remote sensing (Lv et al. 2017; Dai et al. 2019; Marushko and Doudkin 2020), or reinforcement learning (Kurutach et al. 2018; Rajeswaran et al. 2017) can be found in the literature. Besides the improvement in accuracy, ensembles give an intuitive way of representing the model uncertainty on a prediction by evaluating the variety among the member’s predictions.

    Compared to Bayesian and single deterministic network approaches, ensemble methods have two major differences. First, the general idea behind ensembles is relatively clear and there are not many groundbreaking differences in the application of different types of ensemble methods and their application in different fields. Hence, this section focuses on different strategies to train an ensemble and some variations that target making ensemble methods more efficient. Second, ensemble methods were originally not introduced to explicitly handle and quantify uncertainties in neural networks. Although the derivation of uncertainty from ensemble predictions is obvious, since they actually aim at reducing the model uncertainty, ensembles were first introduced and discussed in order to improve the accuracy on a prediction (Hansen and Salamon 1990). Therefore, many works on ensemble methods do not explicitly take the uncertainty into account. Notwithstanding this, ensembles have been found to be well suited for uncertainty estimations in neural networks (Lakshminarayanan et al. 2017).

    3.3.2 Single- and multi-mode evaluation

    One main point where ensemble methods differ from the other methods presented in this paper is the number of local optima that are considered, i.e. the differentiation into single-mode and multi-mode evaluation.

    In order to create synergies and marginalise false predictions of single members, the members of an ensemble have to behave differently in case of an uncertain outcome. The map** defined by a neural network is highly non-linear and hence the optimized loss function contains many local optima to which a training algorithm could converge. Deterministic neural networks converge to one single local optimum in the solution space (Fort et al. 2019). Other approaches, e.g. BNNs, still converge to one single optimum, but additionally, take the uncertainty on this local optimum into account (Fort et al. 2019). This means, that neighbouring points within a certain region around the solution also affect the loss and also influence the prediction of a test sample. Since these methods focus on single regions, the evaluation is called single-mode evaluation. In contrast to this, ensemble methods consist of several networks, which should converge to different local optima. This leads to a so-called multi-mode evaluation (Fort et al. 2019).

    Fig. 7
    figure 7

    A visualization of the different evaluation behaviours of deterministic neural networks, Bayesian neural networks, and the ensemble of deterministic neural networks. The x-axis indicates the network parameters \(\theta\) and the y-axis represents the loss value. While the deterministic network learns the parameters based on a pointwise estimation, the Bayesian neural network also takes the surrounding of the single point into account. The ensemble of deterministic methods optimizes pointwise but learns several different parameter settings

    In Fig. 7, the considered parameters of a single-mode deterministic, single-mode probabilistic (Bayesian), and multi-mode ensemble approach are visualized. The goal of multi-mode evaluation is that different local optima could lead to models with different strengths and weaknesses in the predictions such that a combination of several such models brings synergy effects improving the overall performance.

    3.3.3 Bringing variety into ensembles

    One of the most crucial points when applying ensemble methods is to maximize the variety in the behaviour among the single networks (Renda et al. 2019; Lakshminarayanan et al. 2017). In order to increase the variety, several different approaches can be applied:

    • Random initialization and data shuffle

      Due to the very non-linear loss landscape, different initializations of a neural network lead in general to different training results. Since the training is realized on mini-batches, the order of the training data points also affects the final result.

    • Bagging and boosting

      Bagging (Bootstrap aggregating) and Boosting are two strategies that vary the distribution of the used training data sets by sampling new sets of training samples from the original set. Bagging is sampling from the training data uniformly and with replacement (Bishop and Nasrabadi 2006). Thanks to the replacement process, ensemble members can see single samples several times in the training set while missing some other training samples. For boosting, the members are trained one after another, and the probability of sampling a sample for the next training set is based on the performance of the already trained ensemble (Bishop and Nasrabadi 2006).

    • Data augmentation

      Augmenting the input data randomly for each ensemble member leads to models trained on different data points and therefore in general to a larger variety among the different members.

    • Ensemble of different network architecture

      The combination of different network architectures leads to different loss landscapes and can therefore also increase the diversity in the resulting predictions (Herron et al. 2020).

    In several works, it has been shown that the variety induced by random initialization works sufficiently and that bagging could even lead to a weaker performance (Lee et al. 2015; Lakshminarayanan et al. 2017). Livieris et al. (2021) evaluated different bagging and boosting strategies for ensembles of weight-constrained neural networks. Interestingly, it is found that bagging performs better for a small number of ensemble members while boosting performs better for a large number. Nanni et al. (2019) evaluated ensembles based on different types of image augmentation for bioimage classification tasks and compared those to each other. Guo and Gould (2015) used augmentation methods within in an ensemble approach for object detection. Both works stated that the ensemble approach using augmentations improves the resulting accuracy. In contrast to this, (Rahaman et al. 2021; Wen et al. 2021b) stated with respect to uncertainty quantification that image augmentation can harm the calibration of an ensemble, and post-processing calibration methods have to be slightly adapted when using ensemble methods. Other ways of inducing variety for specific tasks have been also introduced. For instance, in Kim et al. (2018), the members are trained with different attention masks in order to focus on different parts of the input data. Other approaches focused on the training process and introduced learning rate schedulers that are designed to discover several local optima within one training process (Huang et al. 2017; Yang and Wang 2020). Following, an ensemble can be built based on local optima found within one single training run. It is important to note that if not explicitly stated, the works and approaches presented so far targeted improvements in predictive accuracy and did not explicitly consider uncertainty quantification.

    3.3.4 Ensemble methods and uncertainty quantification

    Besides the improvement in the accuracy, ensembles are widely used for modelling uncertainty on predictions of complex models, such as in climate prediction (Leutbecher and Palmer 2008; Parker 2013). Accordingly, ensembles are also used for quantifying the uncertainty on a deep neural network’s prediction, and over the last years they became more and more popular for such tasks (Lakshminarayanan et al. 2017; Renda et al. 2019). Lakshminarayanan et al. (2017) are often referenced as a base work on uncertainty estimations derived from ensembles of neural networks and as a reference for the competitiveness of deep ensembles. They introduced an ensemble training pipeline to quantify predictive uncertainty within DNNs. In order to handle data and model uncertainty, the member networks are designed with two heads, representing the prediction and a predicted value of data uncertainty on the prediction. The approach is evaluated with respect to accuracy, calibration, and out-of-distribution detection for classification and regression tasks. In all tests, the method performs at least equally well as the BNN approaches used for comparison, namely Monte Carlo Dropout and Probabilistic Backpropagation. Lakshminarayanan et al. (2017) also showed that shuffling the training data and a random initialization of the training process induces a sufficient variety in the models in order to predict the uncertainty for the given architectures and data sets. Furthermore, bagging is even found to worsen the predictive uncertainty estimation, extending the findings of Lee et al. (2015), who found bagging to worsen the predictive accuracy of ensemble methods on the investigated tasks. Gustafsson et al. (2020) introduced a framework for the comparison of uncertainty quantification methods with a specific focus on real life applications. Based on this framework, they compared ensembles and Monte Carlo dropouts and found ensembles to be more reliable and applicable to real life applications. These findings endorse the results reported by Beluch et al. (2018) who found ensemble methods to deliver more accurate and better calibrated predictions on active learning tasks than Monte Carlo Dropout. Ovadia et al. (2019) evaluated different uncertainty quantification methods based on test sets affected by distribution shifts. The excessive evaluation contains a variety of model types and data modalities. As a takeaway, the authors stated that already for a relatively small ensemble size of five, deep ensembles seem to perform best and are more robust to data set shifts than the compared methods. Vyas et al. (2018) presented an ensemble method for the improved detection of out-of-distribution samples. For each member, a subset of the training data is considered as out-of-distribution. For the training process, a loss, seeking a minimum margin greater than zero between the average entropy of the in-domain and the out-of-distribution subsets is introduced and leads to a significant improvement in the out-of-distribution detection.

    3.3.5 Making ensemble methods more efficient

    Compared to single model methods, ensemble methods come along with a significantly increased computational effort and memory consumption (Sagi and Rokach 2018; Malinin et al. 2020). When deploying an ensemble for a real life application the available memory and computational power are often limited. Such limitations could easily become a bottleneck (Kocić et al. 2019) and could become critical for applications with limited reaction time. Reducing the number of models leads to less memory and computational power consumption. Pruning approaches reduce the complexity of ensembles by pruning over the members and reducing the redundancy among them. For that, several approaches based on different diversity measures are developed to remove single members without strongly affecting the performance (Guo et al. 2018; Cavalcanti et al. 2016; Martínez-Muñoz et al. 2008).

    Distillation is another approach where the number of networks is reduced to one single model. It is the procedure of teaching a single network to represent the knowledge of a group of neural networks (Buciluǎ et al. 2006). First works on the distillation of neural networks were motivated by restrictions when deploying large-scale classification problems (Buciluǎ et al. 2006). The original classification problem is separated into several sub-problems focusing on single blocks of classes that are difficult to differentiate. Several smaller trainer networks are trained on the sub-problems and then teach one student network to separate all classes at the same time. In contrast to this, Ensemble distillation approaches capture the behaviour of an ensemble by one single network. The first works on ensemble distillation used the average of the softmax outputs of the ensemble members in order to teach a student network the derived predictive uncertainty (Hinton et al. 2015). Englesson and Azizpour (2019) justify the resulting predictive distributions of this approach and additionally cover the handling of out-of-distribution samples. When averaging over the members’ outputs, the model uncertainty, which is represented in the variety of ensemble outputs, gets lost. To overcome this drawback, researchers applied the idea of learning higher order distributions, i.e. distributions over a distribution, instead of directly predicting the output (Lindqvist et al. 2020; Malinin et al. 2020). The members are then distillated based on the divergence from the average distribution. The idea is closely related to the prior networks (Malinin and Gales 2018) and the evidential neural networks (Sensoy et al. 2018), which are described in Sect. 3.1. Malinin et al. (2020) modelled ensemble members and the distilled network as prior networks predicting the parameters of a Dirichlet distribution. The distillation then seeks to minimize the KL divergence between the averaged Dirichlet distributions of the ensemble members and the output of the distilled network. Lindqvist et al. (2020) generalized this idea to any other parameterizable distribution. With that, the method is also applicable to regression problems, for example by predicting a mean and standard deviation to describe a normal distribution. Within several tests, the distillation models generated by these approaches are able to distinguish between data uncertainty and model uncertainty. Although distillation methods cannot completely capture the behaviour of an underlying ensemble, it has been shown that they are capable of delivering good and for some experiments even comparable results (Lindqvist et al. 2020; Malinin et al. 2020; Reich et al. 2020).

    Other approaches, as sub-ensembles (Valdenegro-Toro 2019) and batch-ensembles (Wen et al. 2019) seek to reduce the computation effort and memory consumption by sharing parts among the single members. It is important to note that the possibility of using different model architectures for the ensemble members could get lost when parts of the ensemble are shared. Also, the training of the models cannot be run in a completely independent manner. Therefore, the actual time needed for training does not necessarily decrease in the same way as the computational effort does.

    Sub-ensembles (Valdenegro-Toro 2019) divide a neural network architecture into two sub-networks. The trunk network is for the extraction of general information from the input data, and the task network uses this information to fulfill the actual task. In order to train a sub-ensemble, first, the weights of each member’s trunk network are fixed based on the resulting parameters of one single model’s training process. Following, the parameters of each ensemble member’s task network are trained independently from the other members. As a result, the members are built with a common trunk and an individual task sub-network. Since the training and the evaluation of the trunk network have to be done only once, the number of computations needed for training and testing decreases by the factor \(\frac{M \cdot N_{\text {task}} + N_{\text {trunk}}}{M\cdot N}\), where \(N_{\text {task}}\), \(N_{\text {trunk}}\), and N stand for the number of variables in the task networks, the trunk network, and the complete network. Valdenegro-Toro (2019) further underlined the usage of a shared trunk network by arguing that the trunk network is in general computational more costly than the task network. In contrast to this, batch-ensembles (Wen et al. 2019) connect the member networks with each other at every layer. The ensemble members’ weights are described as a Hadamard product of one shared weight matrix \(W \in \mathbb {R}^{n\times m}\) and M individual rank one matrix \(F_i \in \mathbb {R}^{n \times m}\), each linked with one of the M ensemble members. The rank one matrices can be written as a multiplication \(F_i=r_is_i^\text {T}\) of two vectors \(s\in \mathbb {R}^{n}\) and \(r\in \mathbb {R}^{m}\) and hence the matrix \(F_i\) can be described by \(n+m\) parameters. With this approach, each additional ensemble member increases the number of parameters only by the factor \(\frac{n+m}{M\cdot (n+m)+n\cdot m} + 1\) instead of \(\frac{M+1}{M}=1 + \frac{1}{M}\). On the one hand, with this approach, the members are not independent anymore such that all the members have to be trained in parallel. On the other hand, the authors also showed that parallelization can be realized similarly to the optimization on mini-batches and on a single unit.

    3.3.6 Sum up ensemble methods

    Ensemble methods are very easy to apply since no complex implementation or major modification of the standard deterministic model have to be realized. Furthermore, ensemble members are trained independently from each other, which makes the training easily parallelizable. Also, trained ensembles can be extended easily, but the needed memory and computational effort increase linearly with the number of members for training and evaluation. The main challenge when working with ensemble methods is the need of introducing diversity among the ensemble members. For accuracy, uncertainty quantification, and out-of-distribution detection, random initialization, data shuffling, and augmentations have been found to be sufficient for many applications and tasks (Lakshminarayanan et al. 2017; Nanni et al. 2019). Since these methods may be applied anyway, they do not need much additional effort. The independence of the single ensemble members leads to a linear increase in the required memory and computation power with each additional member. This holds for the training as well as for testing. This limits the deployment of ensemble methods in many practical applications where the computation power or memory is limited, the application is time-critical, or very large networks with high inference time are included (Malinin et al. 2020).

    Many aspects of ensemble approaches are only investigated with respect to the performance on the predictive accuracy but do not take predictive uncertainty into account. This also holds for the comparison of different training strategies for a broad range of problems and data sets. Especially since the overconfidence from single members can be transferred to the whole ensemble, strategies that encourage the members to deliver different false predictions instead of all delivering the same false prediction should be further investigated. For a better understanding of ensemble behavior, further evaluations of the loss landscape, as done by Fort et al. (2019), could offer interesting insights.

    3.4 Test-time augmentation

    Inspired by ensemble methods and adversarial examples (Ayhan and Berens 2018), the test-time augmentation is one of the simpler predictive uncertainty estimation techniques. The basic method is to create multiple test samples from each test sample by applying data augmentation techniques on it and then test all those samples to compute a predictive distribution in order to measure uncertainty. The idea behind this method is that the augmented test samples allow the exploration of different views and is therefore capable of capturing the uncertainty. In general, test-time augmentation can use the same augmentation techniques that can be used for regularization during training and has been shown to improve calibration to in-distribution data and out-of-distribution data detection (Ashukha et al. 2019; Lyzhov et al. 2020). Mostly, this technique of test-time augmentations has been used in medical image processing (Wang et al. 2018a, 2019; Ayhan and Berens 2018; Moshkov et al. 2020). One of the reasons for this is that the field of medical image processing already makes heavy use of data augmentations while using deep learning (Ronneberger et al. 2015), so it is quite easy to just apply those same augmentations during test time to calculate the uncertainties. Another reason is that collecting medical images is costly, thus forcing practitioners to rely on data augmentation techniques. Moshkov et al. (2020) used the test-time augmentation technique for cell segmentation tasks. For that, they created multiple variations of the test data before feeding it to a trained UNet or Mask R-CNN architecture. Following this, they used majority voting to create the final output segmentation mask and discuss the policies of applying different augmentation techniques and how they affect the final predictive results of the deep networks.

    Overall, test-time augmentation is an easy method for estimating uncertainties because it keeps the underlying model unchanged, requires no additional data, and is simple to put into practice with off-the-shelf libraries. Nonetheless, it needs to be kept in mind that during applying this technique, one should only apply valid augmentations to the data, meaning that the augmentations should not generate data from outside the target distribution. According to Shanmugam et al. (2020), test-time augmentation can change many correct predictions into incorrect predictions (and vice versa) due to many factors such as the nature of the problem at hand, the size of training data, the deep neural network architecture, and the type of augmentation. To limit the impact of these factors, (Shanmugam et al. 2020) proposed a learning-based method for test-time augmentation that takes these factors into consideration. In particular, the proposed method learns a function that aggregates the predictions from each augmentation of a test sample. Similar to Shanmugam et al. (2020), Lyzhov et al. (2020) proposed a method, named “greedy Policy Search”, for constructing a test-time augmentation policy by choosing augmentations to be included in a fixed-length policy. Similarly, (Kim et al. 2020) proposed a method for learning a loss predictor from the training data for instance-aware test-time augmentation selection. The predictor selects test-time augmentations with the lowest predicted loss for a given sample.

    Although learnable test-time augmentation techniques (Shanmugam et al. 2020; Lyzhov et al. 2020; Kim et al. 2020) help to select valid augmentations, one of the major open questions is to find out the effect on uncertainty due to different kinds of augmentations. It can for example happen that a simple augmentation-like reflection is not able to capture much of the uncertainty while some domain-specialized stretching and shearing captures more uncertainty. It is also important to find out how many augmentations are needed to correctly quantify uncertainties in a given task. This is particularly important in applications like earth observation, where inference might be needed on a global scale with limited resources.

    3.5 Neural network uncertainty quantification approaches for real life applications

    In order to use the presented methods on real life tasks, several different considerations have to be taken into account. The memory and computational power are often restricted while many real world tasks may be time-critical (Kocić et al. 2019). An overview of the main properties is given in Table 1.

    The presented applications all come along with advantages and disadvantages, depending on the properties a user is interested in. While ensemble methods and test-time augmentation methods are relatively easy to apply, Bayesian approaches deliver a clear description of the uncertainty on the models parameters and also deliver a deeper theoretical basis. The computational effort and memory consumption is a common restriction on real life applications, where single deterministic network approaches perform best, but the distillation of ensembles or efficient Bayesian methods can also be taken into consideration. Within the different types of Bayesian approaches, the performance, the computational effort, and the implementation effort still vary strongly. Laplace approximations are relatively easy to apply and compared to sampling approaches much less computational effort is needed. Furthermore, there often already exist pretrained networks for an application. In this case, Laplace Approximation and external deterministic single network approaches can in general be applied to already trained networks.

    Another important aspect that has to be taken into account for uncertainty quantification in real life applications is the source and type of uncertainty. For real life applications, out-of-distribution detection forms the maybe most important challenge in order to avoid unexpected decisions of the network and to be aware of adversarial attacks. Especially since many motivations of uncertainty quantification are given by risk minimization, methods that deliver risk-averse predictions are an important field to evaluate. Many works already demonstrated the capability of detecting out-of-distribution samples on several tasks and built a strong fundamental tool set for the deployment in real life applications (Yu and Aizawa 2019; Vyas et al. 2018; Ren et al. 2019; Gustafsson et al. 2020). However, in real life, the tasks are much more difficult than finding out-of-distribution samples among data sets (e.g., MNIST or CIFAR data sets) and the main challenge lies in comparing such approaches on several real-world data sets against each other. The work of Gustafsson et al. (2020) forms a first important step towards an evaluation of methods that better suits the demands in real life applications. Interestingly, they show for their tests ensembles to outperform the considered Bayesian approaches. This indicates, that the multi-mode evaluation given by ensembles is a powerful property for real life applications. Nevertheless, Bayesian approaches have delivered strong results as well and furthermore come along with a strong theoretical foundation (Lee et al. 2020; Hobbhahn et al. 2022; Eggenreich et al. 2020; Gal et al. 2017b). As a way to go, the combination of efficient ensemble strategies and Bayesian approaches could combine the variability in the model parameters while still considering several modes for a prediction. Also, single deterministic approaches as the prior networks (Malinin and Gales 2018; Nandy et al. 2020; Sensoy et al. 2018; Zhao et al. 2019) deliver comparable results while consuming significantly less computation power. However, this efficiency often comes along with the problem that separated sets of in- and out-of-distribution samples have to be available for the training process (Zhao et al. 2019; Nandy et al. 2020). In general, the development of new problem and loss formulations such as given in Nandy et al. (2020) leads to a better understanding and description of the underlying problem and forms an important field of research.

    4 Uncertainty measures and quality

    In Sect. 3, we presented different methods for modeling and predicting different types of uncertainty in neural networks. In order to evaluate these approaches, measures have to be applied to the derived uncertainties. In the following, we present different measures for quantifying the different predicted types of uncertainty. In general, the correctness and trustworthiness of these uncertainties are not automatically given. In fact, there are several reasons why evaluating the quality of the uncertainty estimates is a challenging task.

    • First, the quality of the uncertainty estimation depends on the underlying method for estimating uncertainty. This is exemplified in the work undertaken by Yao et al. (2019), which shows that different approximates of Bayesian inference (e.g. Gaussian and Laplace approximates) result in different qualities of uncertainty estimates.

    • Second, there is a lack of ground truth uncertainty estimates (Lakshminarayanan et al. 2017), and defining ground truth uncertainty estimates is challenging. For instance, if we define the ground truth uncertainty as the uncertainty across human subjects, we still have to answer questions as “How many subjects do we need?” or “How to choose the subjects?”.

    • Third, there is a lack of a unified quantitative evaluation metric (Huang et al. 2019b). To be more specific, uncertainty is defined differently in different machine learning tasks such as classification, segmentation, and regression. For instance, prediction intervals or standard deviations are used to represent uncertainty in regression tasks, while entropy (and other related measures) are used to capture uncertainty in classification and segmentation tasks.

    4.1 Evaluating uncertainty in classification tasks

    For classification tasks, the network’s softmax output already represents a measure of confidence. But since the raw softmax output is neither very reliable (Hendrycks and Gimpel 2017) nor can it represent all sources of uncertainty (Smith and Gal 2018), further approaches and corresponding measures were developed.

    4.1.1 Measuring data uncertainty in classification tasks

    Consider a classification task with K different classes and a probability vector network output p(x) for some input sample x. In the following p is used for simplification and \(p_k\) stands for the k-th entry in the vector. In general, the given prediction p represents a categorical distribution, i.e. it assigns a probability to each class to be the correct prediction. Since the prediction is not given as an explicit class but as a probability distribution, (un)certainty estimates can be directly derived from the prediction. In general, this pointwise prediction can be seen as estimated data uncertainty (Kendall and Gal 2017). However, as discussed in Sect. 2, the model’s estimation of the data uncertainty is affected by model uncertainty, which has to be taken into account separately. In order to evaluate the amount of predicted data uncertainty, one can for example apply the maximal class probability or the entropy measures:

    $$\begin{aligned}&\text {Maximal probability:} \quad&p_{\text {max}}&=\max \left\{ p_k\right\} _{k=1}^K&\end{aligned}$$
    (28)
    $$\begin{aligned}&\text {Entropy:}&\text {H}(p)&=-\sum _{k=1}^Kp_k\log _2(p_k)&\end{aligned}$$
    (29)

    The maximal probability represents a direct representation of certainty, while entropy describes the average level of information in a random variable. Even though a softmax output should represent the data uncertainty, one cannot tell from a single prediction how large the amount of model uncertainty is that affects this specific prediction as well.

    4.1.2 Measuring model uncertainty in classification tasks

    As already discussed in Sect. 3, a single softmax prediction is not a very reliable way for uncertainty quantification since it is often badly calibrated (Smith and Gal 2018) and does not have any information about the certainty of the model itself has on this specific output (Smith and Gal 2018). An (approximated) posterior distribution \(p(\theta \vert D)\) on the learned model parameters can help to receive better uncertainty estimates. With such a posterior distribution, the softmax output itself becomes a random variable and one can evaluate its variation, i.e. uncertainty. For simplicity, we denote \(p(y\vert \theta , x)\) also as p and it will be clear from context whether p depends on \(\theta\) or not. The most common measures for this are mutual information (MI), the expected Kullback–Leibler Divergence (EKL), and the predictive variance. Basically, all these measures compute the expected divergence between the (stochastic) softmax output and the expected softmax output

    $$\begin{aligned} {\hat{p}} = {\mathbb {E}}_{\theta \sim p(\theta \vert D)}\left[ p(y\vert x, \theta \right] \,. \end{aligned}$$
    (30)

    The MI uses entropy to measure the mutual dependence between two variables. In the described case, the difference between the information given in the expected softmax output and the expected information in the softmax output is compared, i.e.

    $$\begin{aligned} \text {MI}\left( \theta , y \vert x, D\right) = \text {H}\left[ {\hat{p}}\right] - {\mathbb {E}}_{\theta \sim p(\theta \vert D)}\text {H}\left[ p(y \vert x, \theta )\right] \,. \end{aligned}$$
    (31)

    Smith and Gal (2018) pointed out that the MI is minimal when the knowledge about model parameters does not increase the information in the final prediction. Therefore, the MI can be interpreted as a measure of model uncertainty.

    The Kullback–Leibler divergence measures the divergence between two given probability distributions. The EKL can be used to measure the (expected) divergence among the possible softmax outputs,

    $$\begin{aligned} {\mathbb {E}}_{\theta \sim p(\theta \vert D)}\left[ KL({\hat{p}}\,\Vert \,p)\right] ={\mathbb {E}}_{\theta \sim p(\theta \vert D)}\left[ \sum _{i=1}^K {\hat{p}}_i \log \left( \frac{{\hat{p}}_i}{p_i}\right) \right] \,, \end{aligned}$$
    (32)

    which can also be interpreted as a measure of uncertainty on the model’s output and therefore represents the model uncertainty.

    The predictive variance evaluates the variance on the (random) softmax outputs, i.e.

    $$\begin{aligned} \sigma (p)&= {\mathbb {E}}_{\theta \sim p(\theta \vert D)} \left[ \left( p - {\hat{p}} \right) ^2\right] \,. \end{aligned}$$
    (33)

    As described in Sect. 3, an analytically described posterior distribution \(p(\theta \vert D)\) is only given for a subset of the Bayesian methods. And even for an analytically described distribution, the propagation of the parameter uncertainty into the prediction is in almost all cases intractable and has to be approximated for example with Monte Carlo approximation. Similarly, ensemble methods collect predictions from M neural networks, and test-time data augmentation approaches receive M predictions from M different augmentations applied to the original input sample. For all these cases, we receive a set of M samples, \(\left\{ p^i\right\} _{i=1}^M\), which can be used to approximate the intractable or even undefined underlying distribution. With these approximations, the measures defined in (31), (32), and (33) can be applied straight forward and only the expectation has to be replaced by average sums. For example, the expected softmax output becomes

    $$\begin{aligned} {\hat{p}} \approx \frac{1}{M}\sum _{i=1}^M p^i\,. \end{aligned}$$

    For the expectations given in (31), (32), and (33), the expectation is approximated similarly.

    4.1.3 Measuring distributional uncertainty in classification tasks

    Although these uncertainty measures are widely used to capture the variability among several predictions derived from BNNs (Kendall and Gal 2017), ensemble methods (Lakshminarayanan et al. 2017), or test-time data augmentation methods (Ayhan and Berens 2018), they cannot capture distributional shifts in the input data or OOD examples, which could lead to a biased inference process and a falsely stated confidence. If all predictors attribute a high probability mass to the same (false) class label, this induces a low variability among the estimates. Hence, the network seems to be certain about its prediction, while the uncertainty in the prediction itself (given by the softmax probabilities) is also evaluated to be low. To tackle this issue, several approaches described in Sect. 3 take the magnitude of the logits into account since a larger logit indicates larger evidence for the corresponding class (Sensoy et al. 2018). Thus, the methods either interpret the total sum of the (exponentials of) the logits as the precision value of a Dirichlet distribution (see description of Dirichlet Priors in Sect. 3.1) (Malinin and Gales 2018, 2019; Nandy et al. 2020), or as a collection of evidence that is compared to a defined constant (Sensoy et al. 2018; Możejko et al. 2018). One can also derive a total class probability for each class individually by applying the sigmoid function to each logit (Hsu et al. 2020). Based on the class-wise total probabilities, OOD samples might easier be detected, since all classes can have low probability at the same time. Other methods deliver an explicit measure of how well new data samples suit the training data distribution. Based on this, they also give a measure that a sample will be predicted correctly (Ramalho and Miranda 2020).

    4.1.4 Performance measure on complete data set

    While the measures described above measure the performance of individual predictions, others evaluate the usage of these measures on a set of samples. Measures of uncertainty can be used to separate between correctly and falsely classified samples or between in-domain and OOD samples (Hendrycks and Gimpel 2017). For that, the samples are split into two sets, for example, in-domain and OOD or correctly classified and falsely classified. The two most common approaches are the Receiver Operating Characteristic (ROC) curve and the Precision-Recall (PR) curve. Both methods generate curves based on different thresholds of the underlying measure. For each considered threshold, the ROC curve plots the true positive rate against the false positive rate,Footnote 3 and the PR curve plots the precision against the recall.Footnote 4 While the ROC and PR curves give a visual idea of how well the underlying measures are suited to separate the two considered test cases, they do not give a qualitative measure. To reach this, the area under the curve (AUC) can be evaluated. Roughly speaking, the AUC gives a probability value that a randomly chosen positive sample leads to a higher measure than a randomly chosen negative example. For example, the maximum softmax values measure ranks of correctly classified examples higher than falsely classified examples. Hendrycks and Gimpel (2017) showed for several application fields that correct predictions have in general a higher predicted certainty in the softmax value than false predictions. Especially for the evaluation of in-domain and OOD examples, the Area Under Receiver Operating Curve (AUROC) and the Area Under Precision-Recall Curce (AUPRC) are commonly used (Nandy et al. 2020; Malinin and Gales 2018, 2019). The clear weakness of these evaluations is the fact that the performance is evaluated and the optimal threshold is computed based on a given test data set. A distribution shift from the test set distribution can ruin the whole performance and make the derived thresholds impractical.

    4.2 Evaluating uncertainty in regression tasks

    4.2.1 Measuring data uncertainty in regression predictions

    In contrast to classification tasks, where the network typically outputs a probability distribution over the possible classes, regression tasks only predict a pointwise estimation without any hint of data uncertainty. As already described in Sect. 3, a common approach to overcome this is to let the network predict the parameters of a probability distribution, for example, a mean vector and a standard deviation for a normally distributed uncertainty (Lakshminarayanan et al. 2017; Kendall and Gal 2017). In doing so, a measure of data uncertainty is directly given. The prediction of the standard deviation allows an analytical description that the (unknown) true value is within a specific region. The interval that covers the true value with a probability of \(\alpha\) (under the assumption that the predicted distribution is correct) is given by

    $$\begin{aligned} \left[ {\hat{y}}-\frac{1}{2}\Phi ^{-1}(\alpha )\cdot \sigma ;\quad {\hat{y}}+\frac{1}{2}\Phi ^{-1}(\alpha )\cdot \sigma \right] \end{aligned}$$
    (34)

    where \(\Phi ^{-1}\) is the quantile function, the inverse of the cumulative probability function. For a given probability value \(\alpha\) the quantile function gives a boundary, such that \(100\cdot \alpha \%\) of a standard normal distribution’s probability mass is on values smaller than \(\Phi ^{-1}(\alpha )\). Quantiles assume some probability distribution and interpret the given prediction as the expected value of the distribution.

    In contrast to this, other approaches (Pearce et al. 2018; Su et al. 2018) directly predict a so-called prediction interval (PI)

    $$\begin{aligned} PI(x) = \left[ B_l, B_u\right] \end{aligned}$$
    (35)

    in which the prediction is assumed to lay. Such intervals induce uncertainty as a uniform distribution without giving a concrete prediction. The certainty of such approaches can, as the name indicates, be directly measured by the size of the predicted interval. The Mean Prediction Interval Width (MPIW) can be used to evaluate the average certainty of the model (Pearce et al. 2018; Su et al. 2018). In order to evaluate the correctness of the predicted intervals the Prediction Interval Coverage Probability (PICP) can be applied (Pearce et al. 2018; Su et al. 2018). The PCIP represents the percentage of test predictions that fall into a prediction interval and is defined as

    $$\begin{aligned} \text {PICP}=\frac{c}{n}, \end{aligned}$$
    (36)

    where n is the total number of predictions and c is the number of ground truth values that are actually captured by the predicted intervals.

    4.2.2 Measuring model uncertainty in regression predictions

    In Sect. 2, it is described, that model uncertainty is mainly caused by the model’s architecture, the training process, and underrepresented areas in the training data. Hence, there is no real difference in the causes and effects of model uncertainty between regression and classification tasks such that model uncertainty in regression tasks can be measured equivalently as already described for classification tasks, i.e. in most cases by approximating an average prediction and measuring the divergence among the single predictions (Kendall and Gal 2017).

    4.3 Evaluating uncertainty in segmentation tasks

    The evaluation of uncertainties in segmentation tasks is very similar to the evaluation of classification problems. The uncertainty is estimated in segmentation tasks using approximates of Bayesian inference (Nair et al. 2020; Roy et al. 2019; LaBonte et al. 2019; Eaton-Rosen et al. 2018; McClure et al. 2019; Soleimany et al. 2019; Soberanis-Mukul et al. 2020; Seebock et al. 2020) or test-time data augmentation techniques (Wang et al. 2019). In the context of segmentation, the uncertainty in pixel-wise segmentation is measured using confidence intervals (LaBonte et al. 2019; Eaton-Rosen et al. 2018), the predictive variance (Soleimany et al. 2019; Seebock et al. 2020), the predictive entropy (Roy et al. 2019; Wang et al. 2019; McClure et al. 2019; Soberanis-Mukul et al. 2020) or the mutual information (Nair et al. 2020). The uncertainty in structure (volume) estimation is obtained by averaging over all pixel-wise uncertainty estimates (Seebock et al. 2020; McClure et al. 2019). The quality of volume uncertainties is assessed by evaluating the coefficient of variation, the average Dice score, or the intersection over union (Roy et al. 2019; Wang et al. 2019). These metrics measure the agreement in area overlap between multiple estimates in a pair-wise fashion. Ideally, a false segmentation should result in an increase in pixel-wise and structure uncertainty. To evaluate whether this is the case, (Nair et al. 2020) evaluated the pixel-level true positive rate and false detection rate as well as the ROC curves for the retained pixels at different uncertainty thresholds. Similar to Nair et al. (2020), McClure et al. (2019) also analyzed the area under the ROC curve.

    5 Calibration

    A predictor is called well-calibrated if the derived predictive confidence represents a good approximation of the actual probability of correctness (Guo et al. 2017). Therefore, in order to make use of uncertainty quantification methods, one has to be sure that the network is well calibrated. Formally, for classification tasks a neural network \(f_\theta\) is calibrated (Kuleshov et al. 2018) if it holds that

    $$\begin{aligned} \forall p \in [0,1]:\quad \sum _{i=1}^N \sum _{k=1}^K\frac{y_{i,k}\cdot {\mathbb {I}}\{f_\theta (x_i)_k=p\}}{{\mathbb {I}}\{f_\theta (x_i)_k=p\}} \xrightarrow []{N \rightarrow \infty } p\,. \end{aligned}$$
    (37)

    here \({\mathbb {I}}\{\cdot \}\) is the indicator function that is either 1 if the condition is true or 0 if it is false, and \(y_{i,k}\) is the k-th entry in the one-hot encoded ground truth vector of a training sample \((x_i,y_i)\). This formulation means that for example \(30\%\) of all predictions with a predictive confidence of \(70\%\) should actually be false. For regression tasks the calibration can be defined such that predicted confidence intervals should match the confidence intervals empirically computed from the data set (Kuleshov et al. 2018), i.e.

    $$\begin{aligned} \forall p \in [0,1]:\quad \sum _{i=1}^N\frac{{\mathbb {I}}\left\{ y_i\in \text {conf}_{p}(f_\theta (x_i))\right\} }{N} \xrightarrow []{N \rightarrow \infty } p, \end{aligned}$$
    (38)

    where \(\text {conf}_p\) is the confidence interval that covers p percent of a distribution.

    A DNN is called under-confident if the left-hand side of (37) and (38) are larger than p. Equivalently, it is under-confident if the terms are smaller than p. The calibration property of a DNN can be visualized using a reliability diagram, as shown in Fig. 8.

    In general, calibration errors are caused by factors related to model uncertainty (Guo et al. 2017). This is intuitively clear, since as discussed in Sect. 2, data uncertainty represents the underlying uncertainty that an input x and a target y represent the same real world information. Following, correctly predicted data uncertainty would lead to a perfectly calibrated neural network. In practice, several works pointed out that deeper networks tend to be more overconfident than shallower ones (Guo et al. 2017; Seo et al. 2019; Li and Hoiem 2020).

    Several methods for uncertainty estimation presented in Sect. 3 also improve the network’s calibration (Lakshminarayanan et al. 2017; Gal and Ghahramani 2016). This is clear since these methods quantify model and data uncertainty separately and aim at reducing the model uncertainty on the predictions. Besides the methods that improve calibration by reducing the model uncertainty, a large and growing body of literature has investigated methods for explicitly reducing calibration errors. These methods are presented in the following, followed by measures to quantify the calibration error. It is important to note that these methods do not reduce the model uncertainty, but propagate the model uncertainty onto the representation of the data uncertainty. For example, if a binary classifier is overfitted and predicts all samples of a test set as class A with probability 1, while half of the test samples are actually class B, the recalibration methods might map the network output to 0.5 in order to have reliable confidence. This probability of 0.5 is not equivalent to the data uncertainty but represents the model uncertainty propagated onto the predicted data uncertainty.

    Fig. 8
    figure 8

    a Reliability diagram showing an overconfident classifier: The bin-wise accuracy is smaller than the corresponding confidence. b Reliability diagram of an underconfident classifier: The bin-wise accuracy is larger than the corresponding confidence. c Reliability diagram of a well-calibrated classifier: The confidence fits the actual accuracy for the single bins

    5.1 Calibration methods

    Calibration methods can be classified into three main groups according to the step when they are applied:

    • Regularization methods applied during the training phase (Szegedy et al. 2016; Pereyra et al. 2017; Lee et al. 2018a; Müller et al. 2019; Venkatesh and Thiagarajan 2019)

      These methods modify the objective, optimization, and/or regularization procedure in order to build DNNs that are inherently calibrated.

    • Post-processing methods applied after the training process of the DNN (Guo et al. 2017; Wenger et al. 2020)

      These methods require a held-out calibration data set to adjust the prediction scores for recalibration. They only work under the assumption that the distribution of the left-out validation set is equivalent to the distribution, on which inference is done. Hence, also the size of the validation data set can influence the calibration result.

    • Neural network uncertainty estimation methods

      Approaches, as presented in Sect. 3, that reduce the amount of model uncertainty on a neural network’s confidence prediction, also lead to a better calibrated predictor. This is because the remaining predicted data uncertainty better represents the actual uncertainty on the prediction. Such methods are based for example on Bayesian methods (Izmailov et al. 2020; Foong et al. 2019; Zhang et al. 2019; Laves et al. 2019; Wilson and Izmailov 2020) or deep ensembles (Lakshminarayanan et al. 2017; Mehrtash et al. 2020).

    In the following, we present the three types of calibration methods in more detail (Fig. 9).

    Fig. 9
    figure 9

    Visualization of the different types of uncertainty calibration methods presented in this paper

    5.1.1 Regularization methods

    Regularization methods for calibrating confidences manipulate the training of DNNs by modifying the objective function or by augmenting the training data set. The goal and idea of regularization methods are very similar to the methods presented in Sect. 3.1 where the methods mainly quantify model and data uncertainty separately within a single forward pass. However, the methods in Sect. 3.1 quantify the model and data uncertainty, while these calibration methods are regularized in order to minimize the model uncertainty. Following, at inference, the model uncertainty cannot be obtained anymore. This is the main motivation for us to separate the approaches presented below from the approaches presented in Sect. 3.1.

    One popular regularization based calibration method is label smoothing (Szegedy et al. 2016). For label smoothing, the labels of the training examples are modified by taking a small portion \(\alpha\) of the true class’ probability mass and assigning it uniformly to the false classes. For hard, non-smoothed labels, the optimum cannot be reached in practice, as the gradient of the output with respect to the logit vector z,

    $$\begin{aligned} \begin{aligned} \nabla _z \text {CE}(y, {{\hat{y}}}(z))&= \text {softmax}(z) - y \\&= \frac{\exp (z)}{\sum _{i=1}^K \exp (z_i)}-y\,, \end{aligned} \end{aligned}$$
    (39)

    can only converge to zero with increasing distance between the true and false classes’ logits. As a result, the logits of the correct class are much larger than the logits for the incorrect classes and the logits of the incorrect classes can be very different from each other. Label-smoothing avoids this and while it generally leads to a higher training loss, the calibration error decreases and the accuracy often increases as well (Müller et al. 2019).

    Seo et al. (2019) extended the idea of label smoothing and directly aimed at reducing the model uncertainty. For this, they sampled T forward passes of a stochastic neural network already at training time. Based on the T forward passes of a training sample \((x_i,y_i)\), a normalized model variance \(\alpha _i\) is derived as the mean of the Bhattacharyya coefficients (Comaniciu et al. 2000) between the T individual predictions \({{\hat{y}}}_1,\ldots ,{{\hat{y}}}_T\) and the average prediction \({\bar{y}} = \frac{1}{T}\sum _{t=1}^T{{\hat{y}}}_t\),

    $$\begin{aligned} \begin{aligned} \alpha _i&= \frac{1}{T}\sum _{t=1}^T BC({\bar{y}}_i, {{\hat{y}}}_{i,t}) \\&=\frac{1}{T}\sum _{t=1}^T \sum _{k=1}^K \sqrt{{\bar{y}}_{i,k} \cdot {{\hat{y}}}_{i,t,k}}\,. \end{aligned} \end{aligned}$$
    (40)

    Based on this \(\alpha _i\), (Seo et al. 2019) introduced the variance-weighted confidence-integrated loss function that is a convex combination of two contradictive loss functions,

    $$\begin{aligned} \begin{aligned} L^{\text {VWCI}}(\theta )=-\sum _{i=1}^N(1-\alpha _i)L_{\text {GT}}^{(i)}(\theta ) + \alpha _i L_{\text {U}}^{(i)}(\theta )\,, \end{aligned} \end{aligned}$$
    (41)

    where \(L_\text {GT}^{(i)}\) is the mean cross-entropy computed for the training sample \(x_i\) with given ground-truth \(y_i\). \(L_\text {U}\) represents the mean KL-divergence between a uniform target probability vector and the computed prediction. The adaptive smoothing parameter \({\alpha }_i\) pushes predictions of training samples with high model uncertainty (given by high variances) towards a uniform distribution while increasing the prediction scores of samples with low model uncertainty. As a result, variances in the predictions of a single sample are reduced and the network can then be applied with a single forward pass at inference.

    Pereyra et al. (2017) combated the overconfidence issue by adding the negative entropy to the standard loss function and therefore a penalty that increases with the network’s predicted confidence. This results in the entropy-based objective function \(L^H\), which is defined as

    $$\begin{aligned} L^H(\theta ) = -\frac{1}{N} \sum _{i=1}^{N} y_i \log {\hat{y}}_i - \alpha _i H({\hat{y}}_i), \end{aligned}$$
    (42)

    where \(H({\hat{y}}_i)\) is the entropy of the output and \(\alpha _i\) is a parameter that controls the strength of the entropy-based confidence penalty. The parameter \(\alpha _i\) is computed equivalently for the VWCI loss.

    Instead of regularizing the training process by modifying the objective function, (Thulasidasan et al. 2019) regularized it by using a data-agnostic data augmentation technique named mixup (Zhang et al. 2018b). In mixup training, the network is not only trained on the training data but also on virtual training samples \(({\tilde{x}}, {\tilde{y}})\) generated by a convex combination of two random training pairs \((x_i,y_i)\) and \((x_j,y_j)\), i.e.

    $$\begin{aligned} \tilde{x}= & {} \lambda x_i + (1 - \lambda ) x_j \end{aligned}$$
    (43)
    $$\begin{aligned} \tilde{y}= & {} \lambda y_i + (1 - \lambda ) y_j. \end{aligned}$$
    (44)

    According to Thulasidasan et al. (2019), the label smoothing resulting from mixup training can be viewed as a form of entropy-based regularization resulting in the inherent calibration of networks trained with mixup. Maroñas et al. (2020) see mixup training among the most popular data augmentation regularization techniques due to its ability to improve the calibration as well as the accuracy. However, they argued that in mixup training the data uncertainty in mixed inputs affects the calibration and therefore mixup does not necessarily improve the calibration. They also underlined this claim empirically. Similarly, Rahaman et al. (2021) experimentally showed that the distributional-shift induced by data augmentation techniques such as mixup training can negatively affect the confidence calibration. Based on this observation, (Maroñas et al. 2020) proposed a new objective function that explicitly takes the calibration performance on the unmixed input samples into account. Inspired by the expected calibration error (ECE, see Sect. 5.2) (Naeini et al. 2015) measured the calibration performance on the unmixed samples for each batch b by the differentiable squared differences between the batch accuracy and the mean confidence on the batch samples. The total loss is given as a weighted combination of the original loss on mixed and unmixed samples and the calibration measure is evaluated only on the unmixed samples:

    $$\begin{aligned} L^{ECE}(\theta ) = \frac{1}{B} \sum _{b \in B} L^b(\theta ) + \beta ECE_b, \end{aligned}$$
    (45)

    where \(L^b(\theta )\) is the original unregularized loss using training and mixed samples included in batch b and \(\beta\) is a hyperparameter controlling the relative importance given to the batchwise expected calibration error \(ECE_b\). By adding the batchwise calibration error for each batch \(b \in B\) to the standard loss function, the miscalibration induced by mixup training is regularized.

    In the context of data augmentation, (Patel et al. 2021) improved the calibration of uncertainty estimates by using on-manifold data augmentation. While mixup training combines training samples, on-manifold adversarial training generates out-of-domain samples using adversarial attack. They experimentally showed that on-manifold adversarial training outperforms mixup training in improving the calibration. Similar to Patel et al. (2021), Hendrycks et al. (2019) showed that exposing classifiers to OOD examples at training can help to improve the calibration.

    5.1.2 Post-processing methods

    Post-processing (or post-hoc) methods are applied after the training process and aim at learning a re-calibration function. For this, a subset of the training data is held-out during the training process and used as a calibration set. The re-calibration function is applied to the network’s outputs (e.g. the logit vector) and yields an improved calibration learned on the left-out calibration set. Zhang et al. (2020) discussed three requirements that should be satisfied by post-hoc calibration methods. They should

    1. 1.

      preserve the accuracy, i.e. should not affect the predictor’s performance.

    2. 2.

      be data efficient, i.e. only a small fraction of the training data set should be left out for the calibration.

    3. 3.

      be able to approximate the correct re-calibration map as long as there is enough data available for calibration.

    Furthermore, they pointed out that none of the existing approaches fulfills all three requirements.

    For classification tasks, the most basic but still very efficient way of post-hoc calibration is temperature scaling (Guo et al. 2017). For temperature scaling, the temperature \(T>0\) of the softmax function

    $$\begin{aligned} \text {softmax}(z_i) = \frac{\exp ^{z_i/T}}{\sum _{j=1}^K\exp ^{z_j/T}}, \end{aligned}$$
    (46)

    is optimized. For \(T=1\) the function remains the regular softmax function. For \(T>1\) the output changes such that its entropy increases, i.e. the predicted confidence decreases. For \(T\in (0,1)\) the entropy decreases and following, the predicted confidence increases. As already mentioned above, a perfectly calibrated neural network outputs MAP estimates. Since the learned transformation can only affect the uncertainty, the log-likelihood based losses as cross-entropy do not have to be replaced by a special calibration loss. While the data efficiency and the preservation of the accuracy are given, the expressiveness of basic temperature scaling is limited (Zhang et al. 5.2 Evaluating calibration quality

    Evaluating calibration consists of measuring the statistical consistency between the predictive distributions and the observations (Vaicenavicius et al. 2019). For classification tasks, several calibration measures are based on binning. For that, the predictions are ordered by the predicted confidence \({{\hat{p}}}_i\) and grouped into M bins \(b_1,\ldots ,b_M\). Following, the calibration of the single bins is evaluated by setting the average bin confidence

    $$\begin{aligned} \text {conf}(b_m)=\frac{1}{\vert b_m \vert } \sum _{s\in b_m}{\hat{p}}_s \end{aligned}$$
    (47)

    in relation to the average bin accuracy

    $$\begin{aligned} \text {acc}(b_m) = \frac{1}{\vert b_m \vert } \sum _{s \in b_m} \mathbbm {1}({\hat{y}}_s=y_s), \end{aligned}$$
    (48)

    where \({\hat{y}}_s\), \(y_s\), and \({\hat{p}}_s\) refer to the predicted and true class label of a sample s. As noted in Guo et al. (2017), confidences are well-calibrated when for each bin \(\text {acc}(b_m)=\text {conf}(b_m)\). For a visual evaluation of a model’s calibration, the reliability diagram introduced by DeGroot and Fienberg (1983) is widely used. For a reliability diagram, the \(\text {conf}(b_m)\) is plotted against \(\text {acc}(b_m)\). For a well-calibrated model, the plot should be close to the diagonal, as visualized in Fig. 8. The basic reliability diagram visualization does not distinguish between different classes. In order to do so and hence to improve the interpretability of the calibration error, (Vaicenavicius et al. 2019) used an alternative visualization named multidimensional reliability diagram.

    For a quantitative evaluation of a model’s calibration, different calibration measures can be considered. The Expected Calibration Error (ECE) is a widely used binning-based calibration measure (Naeini et al. 2015; Guo et al. 2017; Laves et al. 2019; Mehrtash et al. 2020; Thulasidasan et al. 2019; Wenger et al. 2020). For the ECE, M equally-spaced bins \(b_1,\ldots ,b_M\) are considered, where \(b_m\) denotes the set of indices of samples whose confidences fall into the interval \(I_m=[\frac{m-1}{M},\frac{m}{M}]\). The ECE is then computed as the weighted average of the bin-wise calibration errors, i.e.

    $$\begin{aligned} \text {ECE} = \sum _{m=1}^{M}\frac{\vert b_m \vert }{N}\vert \text {acc}(b_m)-\text {conf}(b_m)\vert . \end{aligned}$$
    (49)

    For the ECE, only the predicted confidence score (top-label) is considered. In contrast to this, the Static Calibration Error (SCE) (Nixon et al. 2019; Ghandeharioun et al. 2019) considers the predictions of all classes (all-labels). For each class, the SCE computes the calibration error within the bins and then averages across all the bins, i.e.

    $$\begin{aligned} \text {SCE} = \frac{1}{K} \sum _{k=1}^{K} \sum _{m=1}^{M} \frac{\vert b_{m_k} \vert }{N} \vert \text {conf}(b_{m_k})-\text {acc}(b_{m_k}) \vert . \end{aligned}$$
    (50)

    here \(conf(b_{m_k})\) and \(acc(b_{m_k})\) are the confidence and accuracy of bin \(b_m\) for class label k, respectively. Nixon et al. (2019) empirically showed that all-label calibration measures such as the SCE are more effective in assessing the calibration error than the top-label calibration measures as the ECE.

    In contrast to the ECE and SCE, which group predictions into M equally-spaced bins (what in general leads to different numbers of evaluation samples per bin), the adaptive calibration error (Nixon et al. 2019; Ghandeharioun et al. 2019) adaptively groups predictions into R bins with different widths but the equal number of predictions. With this adaptive bin size, the adaptive Expected Calibration Error (aECE)

    $$\begin{aligned} \text {aECE} = \frac{1}{R}\sum _{r=1}^{R} \vert \text {conf}(b_r) - \text {acc}(b_r) \vert , \end{aligned}$$
    (51)

    and the adaptive Static Calibration Error (aSCE)

    $$\begin{aligned} \text {aSCE} = \frac{1}{K R} \sum _{k=1}^{K} \sum _{r=1}^{R} \vert \text {conf}(b_{r_k})-\text {acc}(b_{r_k}) \vert \end{aligned}$$
    (52)

    are defined as extensions of the ECE and the SCE. As has been empirically shown in Patel et al. (2021) and Nixon et al. (2019), the adaptive binning calibration measures \(\text {aECE}\) and \(\text {aSCE}\) are more robust to the number of bins than the corresponding equal-width binning calibration measures \(\text {ECE}\) and \(\text {SCE}\).

    It is important to make clear that in a multi-class setting, the calibration measures can suffer from imbalance in the test data. Even when then calibration is computed classwise, the computed errors are weighted by the number of samples in the classes. Following, larger classes can shadow the bad calibration on small classes, comparable to accuracy values in classification tasks (Pulgar et al. 2017).

    6 Data sets and baselines

    In this section, we collect commonly used tasks and data sets for evaluating uncertainty estimation among existing works. Besides, a variety of baseline approaches commonly used as a comparison against the methods proposed by the researchers are also presented. By providing a review on the relevant information of these experiments, we hope that both researchers and practitioners can benefit from it. While the former can gain a basic understanding of recent benchmarks tasks, data sets, and baselines so that they can design appropriate experiments to validate their ideas more efficiently, the latter might use the provided information to select more relevant approaches to start based on a concise overview on the tasks and data sets on which the approach has been validated.

    In the following, we will introduce the data sets and baselines summarized in Table 4 according to the taxonomy used throughout this review.

    Table 4 Overview of frequently compared benchmark approaches, tasks and their data sets among existing works organized according to the taxonomy of this paper. From left to right, the columns indicate the approach considered, the tasks evaluated with the corresponding approaches, the data sets used for this evaluation, and the baselines commonly used in these works

    The structure of the table is designed to organize the main contents of this section concisely, ho** to provide a clear overview of the relevant works. We group the approaches of each category into one of four blocks and extract the most commonly used tasks, data sets, and provided baselines for each column respectively. The corresponding literature is listed at the bottom of each block to facilitate lookup. Note that we focus on methodological comparison here, but not the choice of architecture for different methods which has an impact on performance as well. Due to the space limitation and visual density, we only show the most important elements (task, data set, baselines) ranked according to the frequency of use in the literature we have researched.

    The main results are as follows. One of the most frequent tasks for evaluating uncertainty estimation methods is the regression task, where samples close and far away from the training distribution are studied. Furthermore, the calibration of uncertainty estimates in the case of classification problems is very often investigated. Further noteworthy tasks are OOD detection and robustness against adversarial attacks. In the medical domain, calibration of semantic segmentation results is the predominant use case.

    The choice of data sets is mostly consistent among all reviewed works. For regression, toy data sets are employed for visualization of uncertainty intervals while the UCI (Dua and Graff 2017) data sets are studied in light of (negative) log-likelihood comparison. The most common data sets for calibration and OOD detection are MNIST (LeCun et al. 1998; Deng 2012), CIFAR10 and CAFIAR100 (Krizhevsky 2009) as well as SVHN (Netzer et al. 2011) while ImageNet (Deng et al. 2009) and its tiny variant are also studied frequently. These form distinct pairs when OOD detection is studied where models trained on CIFAR variants are evaluated on SVHN and visa versa while MNIST is paired with variants of itself like notMNIST and FashionMNIST (** review. PLoS Digit Health 1(8):e0000,085" href="/article/10.1007/s10462-023-10562-9#ref-CR190" id="ref-link-section-d252457459e18323">2022). For medical applications involving clinical background data, distribution shifts are quite common. This is because medical information is highly confidential, and machine learning models are often trained on data from different sources (e.g., varying patient backgrounds or measurement systems) than those they are applied to later (Koh et al. 2006; Bailey and Durrant-Whyte 2006; Montemerlo et al. 2002; Kaess et al. 2010). As a result, many probabilistic methods such as factor graphs (Dellaert et al. 2017; Loeliger 2004) are now the work-horse of advanced consumer products such as robotic vacuum cleaners and unmanned aerial vehicles. In the case of planning and control, estimation problems are widely treated as Bayesian sequential learning problems, and sequential decision-making frameworks such as POMDPs (Silver and Veness 2010; Ross et al. 2008) assume a probabilistic treatment of the underlying planning problems. With probabilistic representations, many reinforcement learning algorithms are backed up by stability guarantees for safe interactions in the real world (Richards et al. 2018; Berkenkamp et al. 2016, 2017). Lastly, there have been also several advances starting from reasoning (semantics (Grimmett et al. 2016) to joint reasoning with geometry), embodiment [e.g. active perception (Bajcsy 1988)] to learning [e.g. active learning (Triebel et al. 2016; Narr et al. 2016; Cohn et al. 1996) and identifying unknown objects (Nguyen et al. 2015; Wong et al. 2020; Boerdijk et al. 2021)].

    Similarly, with the advent of deep learning, many researchers proposed new methods to quantify the uncertainty in deep learning as well as on how to further exploit such information. As opposed to many generic approaches, we summarize task-specific methods and their application in practice as followings. Notably, (Richter and Roy 2017) proposed to perform novelty detection using auto-encoders, where the reconstructed outputs of auto-encoders were used to decide how much one can trust the network’s predictions. Peretroukhin et al. (2016).

    Unlike normal computer vision scenarios where the image acquisition equipment is quite near to the subject, the EO satellites are hundreds of kilometers away from the subject. The sensitivity of sensors, atmospheric absorption properties, and surface reflectance properties all contribute to uncertainties in the acquired data. Integrating the knowledge of physical EO systems, which also contain information about uncertainty models in those systems, is another major open issue. However, for several applications in EO, measuring uncertainties is not only something good to have but rather an important requirement of the field. E.g., the geo-variables derived from EO data may be assimilated into process models (ocean, hydrological, weather, climate, etc) and the assimilation requires the probability distribution of the estimated variables.

    8 Conclusion and outlook

    8.1 Conclusion—how well do the current uncertainty quantification methods work for real world applications?

    Even though many advances on uncertainty quantification in neural networks have been made over the last years, their adoption in practical mission- and safety-critical applications is still limited. There are several reasons for this, which are discussed one by one as follows:

    • Missing validation of existing methods over real-world problems 

      Although DNNs have become the de facto standard in solving numerous computer vision and medical image processing tasks, the majority of existing models are not able to appropriately quantify the uncertainty that is inherent to their inferences, particularly in real world applications. This is primarily because the baseline models are mostly developed using standard data sets such as Cifar10/100, ImageNet, or well-known regression data sets that are specific to a particular use case and are therefore not readily applicable to complex real-world environments, such as low-resolution satellite data or other data sources affected by noise. Although many researchers from other fields apply uncertainty quantification in their field (Rußwurm et al. 2020; Loquercio et al. 2020; Choi et al. 2019), a broad and structured evaluation of existing methods based on different real world applications is not available yet. Works like (Gustafsson et al. 2020) already built the first step towards a real life evaluation.

    • Lack of standardized evaluation protocol 

      Existing methods for evaluating the estimated uncertainty are better suited to compare uncertainty quantification methods based on measurable quantities such as the calibration (Nado et al. 2021) or the performance on OOD detection (Malinin and Gales 2018). As described in Sect. 6, these tests are performed on standardized sets within the machine learning community. Furthermore, the details of these experiments might differ in the experimental setting from paper to paper (Mukhoti et al. 2018). However, a clear standardized protocol of tests that should be performed on uncertainty quantification methods is still not available. For researchers from other domains, it is difficult to directly find state-of-the-art methods for the field they are interested in, not to speak of the hard decision on which sub-field of uncertainty quantification to focus. This makes the direct comparison of the latest approaches difficult and also limits the acceptance and adoption of currently existing methods for uncertainty quantification.

    • Inability to evaluate uncertainty associated to a single decision 

      Existing measures for evaluating the estimated uncertainty (e.g., the expected calibration error) are based on the whole testing data set. This means, that equivalent to classification tasks on unbalanced data sets, the uncertainty associated with single samples or small groups of samples may potentially get biased towards the performance on the rest of the data set. But for practical applications, assessing the reliability of predicted confidence would give much more possibilities than an aggregated reliability based on some testing data, which are independent of the current situation (Kull and Flach 2014). Especially for mission- and safety-critical applications, pointwise evaluation measures could be of paramount importance and hence such evaluation approaches are very desirable.

    • Lack of ground truth uncertainties 

      Current methods are empirically evaluated and the performance is underlined by reasonable and explainable values of uncertainty. A ground truth uncertainty that could be used for validation is in general not available. Additionally, even though existing methods are calibrated on given data sets, one cannot simply transfer these results to any other data set since one has to be aware of shifts in the data distribution and that many fields can only cover a tiny portion of the actual data environment. In application fields such as EO, the preparation of a huge amount of training data is hard and expensive, and hence synthetic data can be used to train a model. For this artificial data, artificial uncertainties in labels and data should be taken into account to receive a better understanding of the uncertainty quantification performance. The gap between the real and synthetic data, or estimated and real uncertainty further limits the adoption of currently existing methods for uncertainty quantification.

    • Explainability issue  

      Existing methods of neural network uncertainty quantification deliver predictions of certainty without any clue about what causes possible uncertainties. Even though those certainty values often look reasonable to a human observer, one does not know whether the uncertainties are actually predicted based on the same observations the human observer made. But without being sure about the reasons and motivations of single uncertainty estimations, a proper transfer from one data set to another, and even only a domain shift, are much harder to realize with guaranteed performance. Regarding safety-critical real-life applications, the lack of explainability makes the application of the available methods significantly harder. Besides the explainability of neural network decisions, existing methods for uncertainty quantification are not well understood on a higher level. For instance, explaining the behavior of single deterministic approaches, ensembles or Bayesian methods is a current direction of research and remains difficult to grasp in every detail (Fort et al. 2019). It is, however, crucial to understand how those methods operate and capture uncertainty to identify pathways for refinement, and detect and characterize uncertainty, failures, and important shortcomings (Fort et al. 2019).

    8.2 Outlook

    • Generic evaluation framework

      As already discussed above, there are still problems regarding the evaluation of uncertainty methods, such as the lack of ’ground truth’ uncertainties, the inability to test on single instances, and standardized benchmarking protocols. To cope with such issues, the provision of an evaluation protocol containing various concrete baseline data sets and evaluation metrics that cover all types of uncertainty would undoubtedly help to boost research in uncertainty quantification. Also, the evaluation with regard to risk-averse and worst case scenarios should be considered there. This means, that uncertainty predictions with a very high predicted uncertainty should never fail, such as for a prediction of a red or green traffic light. Such a general protocol would enable researchers to easily compare different types of methods against an established benchmark as well as on real world data sets. The adoption of such a standard evaluation protocol should be encouraged by conferences and journals.

    • Expert & systematic comparison of baselines 

      A broad and structured comparison of existing methods for uncertainty estimation on real world applications is not available yet. An evaluation of real world data is even not standard in current machine learning research papers. As a result, given a specific application, it remains unclear which method for uncertainty estimation performs best and whether the latest methods outperform older methods also on real world examples. This is also partly caused by the fact, that researchers from other domains that use uncertainty quantification methods, in general, present successful applications of single approaches on a specific problem or a data set by hand. Considering this, there are several points that could be adopted for a better comparison within the different research domains. For instance, domain experts should also compare different approaches against each other and present the weaknesses of single approaches in this domain. Similarly, for a better comparison among several domains, a collection of all the works in the different real world domains could be collected and exchanged on a central platform. Such a platform might also help machine learning researchers in providing an additional source of challenges in the real world and would pave the way to broadly highlight weaknesses in the current state-of-the-art approaches. Google’s repository on baselines in uncertainties in neural networks (Nado et al. 2021)Footnote 5 could be such a platform and a step towards achieving this goal.

    • Uncertainty ground truths

      It remains difficult to validate existing methods due to the lack of uncertain ground truths. An actual uncertainty ground truth on which methods can be compared in an ImageNet like manner would make the evaluation of predictions on single samples possible. To reach this, the evaluation of the data generation process and occurring sources of uncertainty, such as the labeling process, might be investigated in more detail.

    • Explainability and physical models

      Knowing the actual reasons for a false high certainty or a low certainty makes it much easier to engineer the methods for real life applications, which again increases the trust of people into such methods. Recently, (Antorán et al. 2020) claimed to have published the first work on explainable uncertainty estimation. Uncertainty estimations, in general, form an important step towards explainable artificial intelligence. Explainable uncertainty estimations would give an even deeper understanding of the decision process of a neural network, which, in the practical deployment of DNNs, shall incorporate the desired ability to be risk averse while staying applicable in real world (especially safety-critical applications). Also, the possibility of improving explainability with physically based arguments offers great potential. While DNNs are very flexible and efficient, they do not directly embed the domain-specific expert knowledge that is mostly available and can often be described by mathematical or physical models, such as earth system science problems (Reichstein et al. 2019). Such physic-guided models offer a variety of possibilities to include explicit knowledge as well as practical uncertainty representations into a deep learning framework (Willard et al. 2020; De Bézenac et al. 2019).