1 Introduction

Having attracted great attention in both academia and digital economy, deep neural networks (DNNs, Goodfellow et al. (2016)) are about to become vital components of safety-critical applications. Examples are autonomous driving (Bojarski et al.,

  • by deriving a novel and surprisingly simple Wasserstein-based learning objective for sub-networks that simultaneously optimizes task performance and uncertainty quality,

  • by conducting an extensive empirical evaluation where W-dropout outperforms state-of-the-art uncertainty techniques w.r.t. various benchmark metrics, not only in-data but also under data shifts,

  • and by introducing two novel uncertainty measures: a non-saturating calibration score and a measure for distributional tails that allows to analyze worst-case scenarios w.r.t. uncertainty quality.

  • The remainder of the paper is organized as follows: first, we present related work on uncertainty estimation in neural networks in Sect. 2. Next, Wasserstein dropout is introduced in Sect. 3. We study the uncertainties induced by Wasserstein dropout on various datasets in Sect. 4, paying special attention to safety-relevant evaluation schemes and metrics. An outlook in Sect. 5 concludes the paper.

    2 Related work

    Approaches to estimate predictive uncertainties can be broadly categorized into three groups: Bayesian approximations, ensemble approaches and parametric models.

    Monte Carlo dropout (Gal and Ghahramani, 2016) is a prominent representative of the first group. It offers a Bayesian motivation, conceptual simplicity and scalability to application-size neural networks (NNs). This combination distinguishes MC dropout from other Bayesian neural network (BNN) approximations like in Blundell et al. (2015) and Ritter et al. (2020). A computationally more efficient version of MC dropout is one-layer or last-layer dropout (see e.g. Kendall and Gal (2017)). Alternatively, analytical moment propagation allows sampling-free MC-dropout inference at the price of additional approximations (e.g. Postels et al. (2019)). Further extensions of MC dropout target tuned performance by learning layer-specific drop rates using Concrete distributions (Gal et al., 2017), the integration of aleatoric uncertainty (Kendall and Gal, 2017), using a parametric approach and input-dependent dropout distributions (Fan et al., 2021). Note that dropout training is used—independent from an uncertainty context—for better model generalization (Srivastava et al., 2014). An alternative sampling-based approach is SWAG which constructs a Gaussian model weight distribution from the (last segment of the) training trajectory (Maddox et al., 2019).

    Ensembles of neural networks, so-called deep ensembles (Lakshminarayanan et al., 2017), pose another popular approach to uncertainty modeling. Comparative studies of uncertainty mechanisms (Gustafsson et al., 2020; Snoek et al., 2019) highlight their advantageous uncertainty quality, making deep ensembles a state-of-the-art method. Fort et al. (2019) argue that ensembles capture the multi-modality of loss landscapes thus yielding potentially more diverse sets of solutions. When used in practice, these ensembles additionally include parametric uncertainty prediction for each of their members.

    The third group are the before mentioned parametric modeling approaches that extend point estimations by adding a model output that is interpreted as variance or covariance (Heskes, 1996; Nix and Weigend, 1994). Typically, these approaches optimize a (Gaussian) negative log-likelihood (NLL, Nix and Weigend (1994)) and can be easily integrated with other approaches, for a review see Khosravi et al. (2011). A more recent representative of this group is, e.g., deep evidential regression (Amini et al., 2020), which places a prior distribution on Gaussian parameters. A closely related model class is deep kernel learning. It approaches uncertainty modeling by combining NNs and Gaussian processes (GPs) in various ways, e.g., via an additional layer (Iwata and Ghahramani, \(f_{\theta }\) yielding for each input \(x_i\) a distribution \({\mathcal {D}}_{\tilde{\theta }}(x_i)\) over network predictions. During MC dropout inference the final prediction is given by the mean of a sample from \({\mathcal {D}}_{\tilde{\theta }}(x_i)\), while the uncertainty associated with this prediction can be estimated as a sum of its variance and a constant uncertainty offset. The value of the latter term requires dataset-specific optimization. During MC dropout training, minimizing the objective function, e.g., the mean squared error (MSE), shifts all sub-network predictions towards the same training targets. For a more formal explanation of this behavior, and without loss of generality, let \(f_{\theta }\) be a NN with one-dimensional output. The expected MSE for a training sample \((x_i,y_i)\) under the model’s output distribution \({\mathcal {D}}_{\tilde{\theta }}(x_i)\) is given by

    $$\begin{aligned} E_{\tilde{\theta }}\left[ (f_{\tilde{\theta }}(x_i) - y_i)^2 \right] = \left( \mu _{\tilde{\theta }}(x_i) - y_i\right) ^2 + \sigma _{\tilde{\theta }}^2(x_i)\ , \end{aligned}$$
    (1)

    with sub-network mean \(\mu _{\tilde{\theta }}(x_i) =E_{\tilde{\theta }}[f_{\tilde{\theta }}(x_i)]\) and variance \(\sigma _{\tilde{\theta }}^2(x_i) = E_{\tilde{\theta }}[f_{\tilde{\theta }}^2(x_i)] - E_{\tilde{\theta }}[ f_{\tilde{\theta }}(x_i) ]^2\). Therefore, training simultaneously minimizes the squared error between sub-network mean \(\mu _{\tilde{\theta }}(x_i)\) and target \(y_i\) as well as the variance \(\sigma ^2_{\tilde{\theta }}(x_i)\).

    As we, in contrast, seek to employ sub-networks to model aleatoric uncertainty, minimizing the variance over the sub-networks is not desirable for our purpose. Instead, we aim at explicitly fitting the sub-network variance \(\sigma ^2_{\tilde{\theta }}(x_i)\) to the input-dependent, i.e. heteroscedastic, data variance. That is to say, we not only match the mean values as in (1) but seek to match the entire data distribution \({\mathcal {D}}_y(x_i)\) by means of the model’s output distribution \({\mathcal {D}}_{\tilde{\theta }}(x_i)\). This output distribution is induced by applying Bernoulli dropout to all activations of the network. The matchings are technically realized by minimizing a distance measure between the two distributions \({\mathcal {D}}_{\tilde{\theta }}(x_i)\) and \({\mathcal {D}}_y(x_i)\). While, in principle, various distances could be used, we, however, require two properties: i) the distance needs to be non-saturating, i.e. it needs to grow monotonously and unboundedly with the actual mismatch between the distributions. This is needed (or desirable) as for safety reasons we want to penalize strong mismatches. Additionally, ii) we require the distance to have a simple, closed form. This is needed for subsequent, bootstrap-inspired approximations (see below). The (squared) 2-Wasserstein distance (Villani, 2008) fulfills both of these propertiesFootnote 1 and is therefore employed in the following. Assuming that both distributions \({\mathcal {D}}_{\tilde{\theta }}(x_i)\) and \({\mathcal {D}}_y(x_i)\) are GaussianFootnote 2 then yields a compact analytical expression

    $$\begin{aligned} \text {WS}_2^2(x_i)&=\mathrm {WS}^2_2\left[ {\mathcal {D}}_{\tilde{\theta }}(x_i), {\mathcal {D}}_y(x_i)\right] \nonumber \\&=\mathrm {WS}^2_2\left[ {\mathcal {N}}(\mu _{\tilde{\theta }}(x_i), \sigma _{\tilde{\theta }}(x_i)), {\mathcal {N}}(\mu _y(x_i), \sigma _y(x_i))\right] \nonumber \\&=\left( \mu _{\tilde{\theta }} (x_i) - \mu _{y}(x_i)\right) ^2+\left( \sigma _{\tilde{\theta }}(x_i) - \sigma _{y}(x_i)\right) ^2 \,, \end{aligned}$$
    (2)

    with \(\mu _{\tilde{\theta }}(x_i) = E_{\tilde{\theta }}[f_{\tilde{\theta }}(x_i)]\) and \(\sigma _{\tilde{\theta }}^2(x_i) = E_{\tilde{\theta }}[(f_{\tilde{\theta }}(x_i) - E_{\tilde{\theta }}[f_{\tilde{\theta }}(x_i)])^2]\), and \(\mu _y,\sigma _y\) defined analogously w.r.t. the data distribution.

    In practice however, (2) cannot be readily used as the distribution of \(y\) given \(x_i\) is typically not accessible. Instead, for a given, fixed value of \(x_i\) from the training set only a single value of \(y_i\) is known. Therefore, we take \(y_i\) as a (rough) one-sample approximation of the mean \(\mu _y(x_i)\) resulting in \(\mu _y(x_i) \approx y_i\) and \(\sigma _y^2(x_i) \approx E_y[(y - y_i)^2]\). However, \(\sigma _y^2(x_i)\) cannot be inferred from a single sample. Inspired by parametric bootstrap** (Dekking et al., 2005; Hastie et al., 2009), we therefore approximate the empirical data variance (for a given mean value \(y_i\) and input \(x_i\)) with samples from our model, i.e., we approximate \(E_y[(y - y_i)^2]\) by

    $$\begin{aligned} E_{\tilde{\theta }}[(f_{\tilde{\theta }}(x_i) - y_i)^2] = (\mu _{\tilde{\theta }}(x_i)-y_i)^2 + \sigma _{\tilde{\theta }}^2(x_i). \end{aligned}$$
    (3)

    Inserting our approximations \(\mu _y(x_i) \approx y_i\) and \(\sigma _y(x_i) \approx (\mu _{\tilde{\theta }}(x_i)-y_i)^2 + \sigma _{\tilde{\theta }}^2(x_i)\) into (2) yields the Wasserstein dropout loss (W-dropout) for a data point \((x_i,y_i)\) from the training distribution:

    $$\begin{aligned} \text {WS}_2^2(x_i) \approx (\mu _{\tilde{\theta }}(x_i)-y_i)^2 \nonumber + \left[ \sqrt{\sigma _{\tilde{\theta }}^2(x_i)} -\sqrt{(\mu _{\tilde{\theta }}(x_i)-y_i)^2 + \sigma _{\tilde{\theta }}^2(x_i)} \right] ^2\,. \end{aligned}$$
    (4)

    Considering a mini-batch of size M instead of a single data point, we arrive at the optimization objective \(\text {WS}_{batch }^2 = \frac{1}{M}\sum _{i=1}^M\text {WS}_2^2(x_i)\). In practice, \(\mu _{\tilde{\theta }}(x_i) \approx \frac{1}{L}\sum _{l=1}^L f_{\tilde{\theta }_l}(x_i)\) and \(\sigma _{\tilde{\theta }}^2(x_i) \approx \frac{1}{L}\sum _{l=1}^L f_{\tilde{\theta }_l}^2(x_i) - (\frac{1}{L}\sum _{l=1}^L f_{\tilde{\theta }_l}(x_i))^2\) are approximated by empirical estimators using a sample size L. In contrast to MC dropout we require thereby L stochastic forward passes per data point during training (instead of one), while at inference procedures are exactly the same.

    Besides the regression tasks considered here our approach could be useful for other objectives which use or benefit from an underlying distribution, e.g., Dirichlet distributions to quantify uncertainty in classification, as discussed in the conclusion.

    4 Experiments

    We first outline the scope of our empirical study in Sect. 4.1 and begin with experiments on illustrative and visualizable toy datasets in Sect. 4.2. Next, we benchmark W-dropout on various 1D datasets (mostly from the UCI machine learning repository (Duam and Graff, 2017)) in Sect. 4.3, considering both in-data and distribution-shift scenarios. In Sect. 4.4, W-dropout is applied to the complex task of object detection using the compact SqueezeDet architecture (Wu et al., 2017).

    4.1 Benchmark approaches and evaluation measures

    In this subsection, we present the considered benchmark approaches (first paragraph) and evaluation measures for uncertainty modeling. Aside established measures (second paragraph), we propose two novel uncertainty scores: an unbounded calibration measure and an uncertainty tail measure for the analysis of worst-case scenarios w.r.t. uncertainty quality (third and forth paragraph). A brief overview of the technical setup (last paragraph) concludes the subsection.

    Benchmark approaches

    We compare W-dropout networks to archetypes of uncertainty modeling, namely approximate Bayesian techniques, parametric uncertainty, and ensembling approaches. From the first group, we pick MC dropout (abbreviated as MC, Gal and Ghahramani (2016)) and Concrete dropout (CON-MC, Gal et al. (2017)). The variance of MC is given as the sample variance plus a dataset-specific regularization term. The networks employing these methods do not exhibit parametric uncertainty outputs (see below). We additionally consider SWA-Gaussian (SWAG, Maddox et al. (2019)), which samples from a Gaussian model weight distribution that is constructed based on model parameter configurations along the (final segment of the) training trajectory. While these sampling-based approaches integrate uncertainty estimation into the structure of the entire network, parametric approaches model the variance directly as the output of the neural network (Nix and Weigend, 1994). Such networks typically output mean and variance of a Gaussian distribution \((\mu , \sigma ^2)\) and are trained by likelihood maximization. This approach is denoted as PU for parametric uncertainty. Ensembles of PU-networks (Lakshminarayanan et al., 2017), referred to as deep ensembles, pose a widely used state-of-the-art method for uncertainty estimation (Snoek et al., 2019). Deep evidential regression (PU-EV, Amini et al. (2020)) extends this parametric approach and considers prior distributions over \(\mu\) and \(\sigma\). Kendall and Gal (2017) consider drawing multiple dropout samples from a parametric uncertainty model and aggregating multiple predictions for \(\mu\) and \(\sigma\). We denote this approach PU-MC. Moreover, we consider ensembles of non-parametric standard networks. We refer to the latter ones as DEs while we call those using additionally PU-based uncertainty PU-DEs. All considered types of networks provide estimates \((\mu _i,\sigma _i)\) where \(\sigma _i\) is obtained either as direct network output (PU, PU-EV), by sampling (MC, CON-MC, SWAG, W-dropout) or as an ensemble aggregate (DE, PU-DE). For PU-MC, a combination of parametric output and sampling is employed. Throughout this section, we subsume PU, PU-EV, PU-DE and PU-MC as “parametric methods".

    Standard evaluation measures

    In all experiments we evaluate both regression performance and uncertainty quality. Regression performance is quantified by the root-mean-square error \(\sqrt{1/N\,\sum _i (\mu _i-y_i)^2 }\) (RMSE, Bishop (2006)). Another established metric in the uncertainty community is the (Gaussian) negative log-likelihood (NLL), \(1/N \sum _i \left( \log \sigma _i +(\mu _i - y_i)^2/(2 \sigma _i^2) + c \right)\), a hybrid between performance and uncertainty measure (Gneiting and Raftery, 2007), see Appendix C.2 for a discussion. Throughout the paper, we ignore the constant \(c=\log \sqrt{2\pi }\) of the NLL. The expected calibration error (ECE, Kuleshov et al. (2018)) in contrast is not biased towards well-performing models and in that sense a pure uncertainty measure. It reads ECE \(= \sum _{j=1}^B \vert \tilde{p}_j - 1/B\vert\) for B equally spaced bins in quantile space and \(\tilde{p}_j = \vert \{r_i \vert q_j \le \tilde{q}(r_i) < q_{j+1}\}\vert /N\) the empirical frequency of data points falling into such a bin. Their normalized prediction residuals \(r_i\) are defined as \(r_i = (\mu _i - y_i)/\sigma _i\). Further, \(\tilde{q}\) is the cdf of the standard normal distribution \({\mathcal {N}}(0,1)\) and \([q_j,q_{j+1})\) are equally spaced intervals on [0, 1], i.e., \(q_j=(j-1)/B\).

    An unbounded uncertainty calibration measure

    A desirable property for uncertainty measures is a signal that grows (preferentially linearly) with the misalignment between predicted and ideal uncertainty estimates, especially when handling strongly deviating uncertainty estimates. As the Wasserstein metric fulfils this property, we not only use it for model optimization but propose to consider the 1-Wasserstein distance of normalized prediction residuals (WS) as a complementary uncertainty evaluation measure. It is generally applicable and by no means restricted to W-dropout networks. In detail, the 1-Wasserstein distance (Villani, 2008), also known as earth mover’s distance (Rubner et al., 1998), is a transport-based measure, denoted by \(d_{\mathrm{WS}}\), between two probability densities, with Wasserstein GANs (Arjovsky et al., 2017) as its most prominent application in machine learning. In the context of uncertainty estimation, we use the Wasserstein distance to measure deviations of uncertainty estimates \(\{r_i\}_i\) from ideal (Gaussian)Footnote 3 calibration that is given if \(y_i \sim {\mathcal {N}}(\mu _i,\sigma _i)\) with accompanying normalized residuals of \(r_i \sim {\mathcal {N}}(0,1)\), i.e. we calculate \(d_{\mathrm{WS}}\left( \{r_i\}_i,{\mathcal {N}}(0,1)\right)\). As ECE, this is a pure uncertainty measure. However, it is not based on quantiles but directly on normalized residuals and can therefore resolve deviations on all scales. For example, two strongly ill-calibrated uncertainties would result in (almost) identical ECE values while WS would resolve this difference in magnitude. Let us compare ECE and WS more systematically: we consider normal distributions \({\mathcal {N}}(\mu , 1)\) and \({\mathcal {N}}(0, \sigma )\) (see Fig. 2) that are shifted (top left panel, dark blue) and squeezed/stretched (bottom left panel, dark blue), respectively. Their deviations from the ideal normalized residual distribution (the standard normal, red) are measured in terms of both ECE (r.h.s., blue) and WS (r.h.s., orange). For large values of \(\vert \mu \vert\) and \(\sigma\), ECE is bounded while WS increases linearly showing the better sensitivity of the latter towards strong deviations. For small values, \(\sigma \rightarrow 0\), ECE takes its maximum value, WS a value of 1. In Fig. 3, we visualize these value pairs (WS\((\sigma )\), ECE\((\sigma )\)) (gray lines), i.e. \(\sigma\) serves as curve parameter. The upper ‘branch’ corresponds to \(0<\sigma <1\), the lower ‘branch’ to \(\sigma > 1\). For comparison, the pairs (WS, ECE) of various networks trained on standard regression datasets are visualized (see Sect. 4.3 for experimental details and results). They approximately follow the theoretical \(\sigma\)-curve, emphasizing that both under- and overestimating variance is of practical relevance. A given WS value allows, due to lacking saturation for underestimation, to distinguish these two cases more easily compared to ECE. While one might rightfully argue that the higher sensitivity of WS leads to a certain susceptibility to potential outliers, this can be addressed by regularizing the normalized residuals or by filtering extreme outliers.

    Fig. 2
    figure 2

    Comparison of the proposed Wasserstein-based measure (WS) and the expected calibration error (ECE). We measure the deviation between a standard normal distribution \({\mathcal {N}}(0,1)\) (lhs, red) and shifted normal distributions \({\mathcal {N}}(\mu , 1)\) (top left, dark blue) and squeezed/stretched normal distributions \({\mathcal {N}}(0, \sigma )\) (bottom left, dark blue), respectively. The resulting ECE values (orange) and WS values (blue) on the rhs emphasize the higher sensitivity of WS in case of large distributional differences. For details on ECE and WS, see text (Color figure online)

    Fig. 3
    figure 3

    Dependency between the Wasserstein-based measure and the expected calibration error for Gaussian toy data (gray curves) and for 1D standard datasets (point cloud, see Sect. 4.3 for details). The toy curves are obtained by plotting (WS\((\sigma )\), ECE\((\sigma )\)) from Fig. 2 (bottom right). For 1D standard datasets, uncertainty methods are encoded via plot markers, data splits via color. Datasets are not encoded and cannot be distinguished (see Appendix C for more details). Each plot point corresponds to a cross-validated trained network (Color figure online)

    A novel uncertainty tail measure

    We furthermore introduce a measure for distributional tails that allows to analyze worst-case scenarios w.r.t. uncertainty quality, thus reflecting safety considerations. Such potentially critical worst-case scenarios are signified by the above mentioned outliers, where the locally predicted uncertainty strongly underestimates the actual model error. A better understanding of uncertainty estimates in these scenarios might allow to determine lower bounds on operation quality of safety-critical systems. For this, we consider normalized residuals \(r_i=(\mu _i-y_i)/\sigma _i\) based on the prediction estimates \((\mu _i,\sigma _i)\) for a given data point \((x_i,y_i)\). As stated, we restrict our analysis to uncertainty estimates that underestimate model errors, i.e., \(\vert r_i\vert \gg 1\). These cases might be more harmful than overly large uncertainties, \(\vert r_i\vert \ll 1\), that likely trigger a conservative system behavior. We quantify uncertainty quality for worst-case scenarios as follows: for a given (test) dataset, the absolute normalized residuals \(\{\vert r_i\vert \}_i\) are calculated. We determine the \(99\%\) quantile \(q_{0.99}\) of this set and calculate the mean value over all \(\vert r_i\vert > q_{0.99}\), the so-called expected tail loss at quantile \(99\%\) (\(\mathbf{ETL}_{{{\textbf {0.99}}}}\), Rockafellar and Uryasev (2002)). The ETL\(_{0.99}\) thus measures the average uncertainty quality of the worst \(1\%\).

    Technical setup

    For the first two parts we use almost identical setups of 2 hidden layers with ReLu activations, using 50 neurons per layer for the toy datasets and 100 for the 1D standard datasets. All dropout-based networks (MC, CON-MC, W-dropout) apply Bernoulli dropout to all hidden activations. For W-dropout networks, we sample \(L = 5\) sub-networks in each optimization step, other values of L are considered in Appendix B. On the smaller toy datasets, we afford \(L=10\). For MC and W-dropout, the drop rate is set to \(p = 0.1\) (see Appendix B for other values of p). The drop rate of CON-MC in contrast is learned during training and (mostly) takes values between \(p=0.2\) and \(p=0.5\). For ensemble methods (DE, PU-DE) we employ 5 networks. All NNs are optimized using the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 0.001. Additionally, we apply standard normalization to the input and output features of all datasets to enable better comparability. The number of training epochs and cross validation runs depends on the dataset size. Further technical details on the networks, the training procedure, and the implementation of the uncertainty methods can be found in Appendix A.1. In using a least squares regression, we make the standard assumption that errors follow a Gaussian distribution. This is reflected in the (standard) definitions of above named measures, i.e., all uncertainty measures quantify the set of outputs \(\{(\mu _i, \sigma _i)\}\) relative to a Gaussian distribution.

    4.2 Toy datasets

    Fig. 4
    figure 4

    Comparison of uncertainty approaches (columns) on two 1D toy datasets: a noisy one (top) and a high-frequency one (bottom). Test data ground truth (respective first row) is shown with mean estimates (resp. second row) and standard deviations (resp. third row). The light green dashed curve (third row) indicates the ground truth uncertainty. Similar uncertainty approaches (columns) are grouped together, W-dropout is highlighted by a yellow frame (Color figure online)

    To illustrate qualitative behaviors of the different uncertainty techniques, we consider two \({\mathbb {R}}\rightarrow {\mathbb {R}}\) toy datasets. This benchmark puts a special focus on the handling of aleatoric heteroscedastic uncertainty. The first dataset is Gaussian white noise with an x-dependent amplitude, see first row of Fig. 4. The second dataset is a polynomial overlayed with a high-frequency, amplitude-modulated sine, see fourth row of Fig. 4. The explicit equations for the toy datasets used here can be found in Appendix A.2.

    While the uncertainty in the first dataset (‘toy-noise’) is clearly visible, it is less obvious for the fully deterministic second dataset (‘toy-hf’). There is an effective uncertainty due to the insufficient expressivity of the model though, as the shallow networks employed are empirically not able to fit (all) fluctuations of ‘toy-hf’ (see fifth row of Fig. 4). One might (rightfully) argue that this is a sign of insufficient model capacity. But in more realistic, e.g., higher dimensional and sparser datasets the distinction between true noise and complex information becomes exceedingly difficult to make and regularization is actively used to suppress the modeling of (ideally) undesired fluctuations. As the Nyquist-Shannon sampling theorem states, with limited data deterministic fluctuations above a cut-off frequency can no longer be resolved (Landau, 1967). They therefore become virtually indistinguishable from random noise.

    The mean estimates of all uncertainty methods (second and fifth row in Fig. 4) look alike on both datasets. They approximate the noise mean and the polynomial, respectively. In the latter case, all methods rudimentarily fit some individual fluctuations. The variance estimation (third and sixth row in Fig. 4) in contrast reveals significant differences between the methods: MC dropout variants and other non-parametric ensembles are not capable of capturing heteroscedastic aleatoric uncertainty. This behavior of MC is expectable as it was primarily introduced to account for model uncertainty. The non-parametric DE is effectively optimized in a similar fashion. In contrast, NLL-optimized PU networks have a home-turf advantage on these datasets since the parametric variance is explicitly optimized to account for the present heteroscedastic aleatoric uncertainty. W-dropout is the only non-parametric approach that accounts for the presence of this kind of uncertainty. While the results look similar, the underlying mechanisms are fundamentally different. On the one hand explicit prediction of the uncertainty, on the other hand implicit modeling via distribution matching. Accompanying quantitative evaluations can be found in Table 7 in Appendix A.2. To collect further evidence that W-dropout approximates the true ground truth uncertainty \(\sigma _{\mathrm{true}}\) appropriately, we fit it to ‘noisy line’ toy datasets in Appendix A.2. Both large and small \(\sigma _{\mathrm{true}}\) values are correctly matched, indicating that W-dropout is not just adding an uncertainty offset but flexibly spreads/contracts its sub-networks as intended. In the following, we substantiate the corroborative results of W-dropout on toy data by an empirical study on 1D standard datasets and an application to a modern object detection network.

    4.3 Standard 1D regression datasets

    Next, we study standard regression datasets, extending the dataset selection in Gal and Ghahramani (2016) by adding four additional datasets: ‘diabetes’, ‘abalone’, ‘california’, and ‘superconduct’. Table 8 in Appendix A.3 provides details on dataset sources, preprocessing and basic statistics. Apart from train- and test-data results, we study regression performance and uncertainty quality under data shift. Such distributional changes and uncertainty quantification are closely linked since the latter ones are rudimentary “self-assessment” mechanisms that help to judge model reliability. These judgements gain importance for model inputs that are structurally different from the training data.

    Data splits

    Natural candidates for such non-i.i.d. splits are splits along the main directions of data in input and output space, respectively. Here, we consider 1D regression tasks. Therefore, output-based splits are simply done on a scalar label variable (see Fig. 5, right). We call such a split label-based (for a comparable split, see, e.g., Foong et al. (2019)). In input space, the first component of a principal component analysis (PCA) provides a natural direction (see Fig. 5, left). Projecting the data points onto this first PCA-axis yields the scalar values the PCA-split is based on. Note that these projections are only considered for data splitting, they are not used for model training. Splitting data along such a direction in input or output space in, e.g., 10 equally large chunks, creates 2 outer data chunks and 8 inner data chunks. Training a model on 9 of these chunks such that the remaining chunk for evaluation is an inner chunk is called data interpolation. If the remaining test chunk is an outer chunk, it is data extrapolation. For example, for labels running from 0 to 1, (label-based) extrapolation testing would consider only data with a label larger 0.9, while training would be performed on the smaller label values. We introduce this distinction as extrapolation is expected to be considerably more difficult than ‘bridging’ between feature combinations that were seen during training.

    Fig. 5
    figure 5

    Scheme of two non-i.i.d. splits: a PCA-based split in input space (left) and label-based split in output space (right). While datasets appear to be convex here, they are (most likely) not in reality

    More general information on training and dataset-dependent modifications to the experimental setup are relegated to the technical Appendix A.1. The presented results are obtained as follows: for each of the 14 standard datasets, we calculate (for each uncertainty method) the per-dataset scores: RMSE, mean NLL, ECE and WS. To improve statistical significance, these scores are 5- or 10-fold cross-validated, i.e. averages across a respective number of folds. Given the (fold-averaged) per-dataset scores for all 14 standard datasets, we calculate and visualize their mean and median values as well as quantile intervals (see Figs. 6 and 7). For high-level summaries of the results on in-data and out-of-data test sets please refer to Table 1 and Table 2, respectively. While the mean values characterize the average behavior of the uncertainty methods, the displayed 75% quantiles indicate how well methods perform on the more challenging datasets. A small 75% quantile value thus hints at consistent stability of an uncertainty mechanism across a variety of tasks.

    Table 1 Regression performance (RMSE) and uncertainty quality (NLL, ECE, WS) of W-dropout and various uncertainty benchmarks. W-dropout yields the best uncertainty scores while providing a competitive RMSE value. Each number is the average across 14 standard 1D (test) datasets. The figures in this table correspond to the blue crosses in the second columns of Figs. 6 and 7, respectively. See text for further details
    Table 2 Out-of-data analysis of W-dropout and various uncertainty benchmarks. Regression performance (RMSE) and uncertainty quality (NLL, ECE, WS) are displayed. As for in-domain test data, W-dropout outperforms the other uncertainty methods without sacrificing regression quality. Each number is obtained by two-fold averaging: firstly, across two types of out-of-data test sets (label-based and PCA-based splits) and secondly, across 14 standard 1D datasets. The figures in this table are based on the blue crosses in the last four columns of Figs. 6 and 7, respectively. See text for further details
    Fig. 6
    figure 6

    Root-mean-square errors (RMSEs (\(\downarrow\)), top row) and expected calibration errors (ECEs (\(\downarrow\)), bottom row) of different uncertainty methods under i.i.d. conditions (first and second panel in each row) and under various kinds of data shift (third to sixth panel in each row, see text for details). W-dropout (light blue background) is compared to 8 benchmark approaches. Each blue cross is the mean over 14 1D regression datasets. Orange line markers indicate median values. The gray vertical bars reach from the 25% quantile (bottom horizontal line) to the 75% quantile (top horizontal line) (Color figure online)

    Regression quality

    First, we consider regression performance, see Table 1 and the first two panels in the top row of Fig. 6. Averaging the RMSE values across the 14 datasets yields almost identical test results for all uncertainty methods (see Table 1). On training data (Fig. 6, first panel in top row) in contrast, we find the parametric methods to exhibit larger train data RMSEs which could be due to NLL optimization favoring to adapt variance rather than mean. However, this regularizing NLL training comes along with a smaller generalization gap, leading to competitive test RMSEs (see Table 1 and the second panel in the top row of Fig. 6). W-dropout is on a par with the benchmark approaches, i.e. our optimization objective does not lead to degraded regression quality.Footnote 4 Next, we investigate model performance under data shift, visualized in the third to sixth panel in the top row of Fig. 6. For interpolation setups (fourth and sixth panel), regression quality is comparable between all methods. As expected, performances under these data shifts are (slightly) worse compared to those on i.i.d. test sets. The more challenging extrapolation setups (third and fifth panel) amplify the deterioration in performance across all methods. Again, W-dropout yields competitive RMSE values (see also Table 2).

    Expected calibration errors

    Figure 6 (bottom row) provides average ECE values of the outlined uncertainty methods under i.i.d. conditions (first and second panel), under label-based data shifts (third and fourth panel) and under PCA-based data shifts (fifth and sixth panel). On training data, PU performs best, followed by PU-EV and all other methods. Interestingly, both SWAG and W-dropout show a relatively broad range of ECE values on the various training datasets. This could be interpreted as a form of over-estimation of the present uncertainty and for W-dropout this effect occurs on mostly smaller datasets with lower data variability. However, looking at the i.i.d. test results (Table 1 and second panel in the bottom row of Fig. 6) we find W-dropout to provide the lowest averaged ECE (Table 1), followed by the PU-based (implicit) ensembles of PU-DE and PU-MC. The calibration quality of W-dropout is moreover the most consistent one across the datasets as can be seen from its small 75% quantile value (Fig. 6, second panel in bottom row).

    Looking at the stability w.r.t. data shift, i.e., extra- and interpolation based on label-split or PCA-split, again W-dropout reaches the smallest calibration errors (followed by PU-DE and PU-MC, see Table 2). Regarding the 75% quantiles, W-dropout consistently provides one of the best results on all out-of-data (OOD) test sets.

    Fig. 7
    figure 7

    Negative log-likelihoods (NLLs (\(\downarrow\)), top row) and Wasserstein distances (\(\downarrow\) , bottom row) of different uncertainty methods under i.i.d. conditions (first and second panel in each row) and under various kinds of data shift (third to sixth panel in each row, see text for details). W-dropout (light blue background) is compared to 8 benchmark approaches. Each blue cross is the mean over ECE values from 14 standard regression datasets. Orange line markers indicate median values. The gray vertical bars reach from the 25% quantile (bottom horizontal line) to the 75% quantile (top horizontal line) (Color figure online)

    Negative log-likelihoods

    For the unbounded NLL (see Table 1 and the top row of Fig. 7), the results are more widely distributed compared to the (bounded) ECE values. W-dropout reaches the smallest mean value on i.i.d. test sets, followed by MC and PU-MC (Table 1). The mean NLL value of PU is above the upper plot limit in Fig. 7 (second panel in the upper row) indicating a rather weak stability of this method. On PCA-interpolate and PCA-extrapolate test sets (Fig. 7, last two panels in the upper row), MC, PU-MC and W-dropout networks perform best. On label-interpolate and label-extrapolate test sets, only MC and W-dropout networks are in first place when considering average values, followed by PU-EV. The mean NLLs of many other approaches are above the upper plot limit. Averaging all these OOD results in Table 2, we find W-dropout to provide the overall smallest NLL values, narrowly followed by MC. Note that median results are not as widely spread and PU-DE, MC, PU-MC and W-dropout perform comparably well. These qualitative differences between mean and median behavior indicate that most methods perform poorly ‘once in a while’. A noteworthy observation as stability across a variety of data shifts and datasets can be seen as a crucial requirement for an uncertainty method. W-dropout models yield high stability in that sense w.r.t. NLL.

    Wasserstein distances

    Studying Wasserstein distances, we again observe the smallest scores on test data for W-dropout, followed by PU-MC and PU-DE (see Table 1 and the second panel in the bottom row of Fig. 7). While PU provides the best WS value on training data, its generalization behavior is less stable: on test data, its mean and 75% quantile take high values beyond the plot range. Under data shift (Table 2 and third to sixth panel in bottom row of Fig. 7), W-dropout and MC are in the lead, CON-MC and DE follow on ranks three and four. On label-based data shifts, MC and W-dropout outperform all other methods by a significant margin when considering average values. As for NLL, we find the mean values for PU-DE and PU-MC to be significantly above their respective median values indicating again weaknesses w.r.t. the stability of parametric methods. Here as well, not only good average results, but also consistency over the datasets and splits, is a hallmark of Wasserstein dropout.

    Epistemic uncertainty

    Summarizing these evaluations on 1D regression datasets, we find W-dropout to yield better and more stable uncertainty estimates than the state-of-the-art methods of PU-DE and PU-MC. We moreover observe advantages for W-dropout under PCA- and label-based data shifts. These results suggest that W-dropout induces uncertainties which increase under data shift, i.e., it approximately models epistemic uncertainty. This conjecture is supported by Fig. 8 that visualizes the uncertainties of MC dropout (blue) and W-dropout (orange) for transitions from in-data to out-of-data. As expected, these shifts lead to increased (epistemic) uncertainty for MC dropout. This holds true for W-dropout that behaves highly similar under data shift indicating that it “inherits" this ability from MC dropout: both approaches match sub-networks to training data and these sub-networks “spread” when leaving the training data distribution. Since W-dropout models heteroscedastic, i.e. input-dependent, aleatoric uncertainty, we notice a higher variability of its uncertainties in Fig. 8 compared to the ones of MC dropout.

    For further (visual) inspections of uncertainty quality, see the residual-uncertainty scatter plots in Appendix A.4. A reflection on NLL and comparisons of the different uncertainty measures on 1D regression datasets can be found in Appendix A.3.

    Fig. 8
    figure 8

    Extrapolation behavior of W-dropout (orange) and MC dropout (blue). Two extrapolation “directions” (rows) and two datasets (columns) are considered. The vertical bar in each panel separates training data (left) from out-of-data (OOD, right). Scatter points show the predicted standard deviation for individual data points. The colored solid lines show averages over points in equally-sized bins and reflect the expected growth of epistemic uncertainty in the OOD-region. For details on the data splits and extrapolations please refer to Sect. 4.3 and Appendix A.3 (Color figure online)

    Table 3 Study of worst-case scenarios for different uncertainty methods: W-dropout (W-Drop), PU-DE and PU-MC are compared to the ideal Gaussian case for i.i.d. and non-i.i.d. data splits. Uncertainty quality in these scenarios is quantified by the expected tail loss at the \(99\%\) quantile (ETL\(_{0.99}\)). Each mean and max value is taken over the ETLs of 110 models trained on 15 different datasets

    Expected tail loss

    For both toy and standard regression datasets, we calculate the expected tail loss at the 99% quantile (ETL\(_{0.99}\)) on test data. Doing this for all trained networks yields a total of 110 ETL\(_{0.99}\) values per uncertainty method when including cross-validation. As a tail measure, the ETL\(_{0.99}\) evaluates a specific aspect of the distribution of uncertainty estimates. Studying such a property is useful if the uncertainty estimate distribution as a whole is appropriate, as measured e.g. by the ECE. We thus restrict the ETL\(_{0.99}\) analysis to the three methods that provide the best ECE values, namely PU-MC, PU-DE and W-dropout. The mean and maximum values of their ETL\(_{0.99}\)’s are reported in Table 3. While none of these methods gets close to the ideal ETL\(_{0.99}\)’s of the desired \({\mathcal {N}}(0,1)\) Gaussian, W-dropout networks exhibit significantly less pronounced tails and therefore higher stability compared to PU-MC and PU-DE. This holds true over all considered test sets. Deviations from standard normal increase from the i.i.d. train-test split over the PCA-based train-test split to the label-based one. We attribute the lower stability of PU-DE to the nature of the PU networks that compose the ensemble, although their inherent instability (see Table 9 in Appendix A.3) is largely suppressed by ensembling. Considering the tail of the distribution of the prediction residuals \(\vert r_i\vert\), however, reveals that regularization of PU by ensembling might not work in every single case. It is then unlikely that larger ensemble are able to fully cure this instability issue. Regularizing PU by applying dropout (PU-MC) leads to only mild improvement. W-dropout networks in contrast encode uncertainty into the structure of the entire network thus yielding improved stability compared to parametric approaches. Further analysis shows that the large normalized residuals \(r_i=(\mu _i-y_i)/\sigma _i\), which cause the large \(\text {ETL}_{0.99}\) values, correspond (on average) to large absolute errors \((\mu _i-y_i)\).Footnote 5 This underpins the practical relevance of the ETL analysis, as large absolute errors are more harmful than small ones in many contexts, e.g. when detecting traffic participants.

    Dependencies between uncertainty measures

    All uncertainty-related measures (NLL, ECE, WS, ETL) relate predicted uncertainties to actually occurring model residuals. Each of them putting emphasize on different aspects of the considered samples: NLL is biased towards well-performing models, ECE measures deviations within quantile ranges, Wasserstein distance resolves distances between normalized residuals and ETL focuses on distribution tails. The empirically observed dependencies between WS and ECE are visualized in Fig. 3. Additionally to WS and ECE, we consider Kolmogorov–Smirnov (KS) distances (Stephens, 1974) on normalized residuals in Fig. 21 in Appendix C.

    While all these scores are expectably correlated, noteworthy deviations from ideal correlation occur. Therefore, we advocate for uncertainty evaluations based on various measures to avoid overfitting to a specific formalization of uncertainty. The top panel of Fig. 21 reflects the higher sensitivity of the Wasserstein distance compared to ECE: we observe two “slopes", the first one corresponds to models that overestimate uncertainties, i.e., \(\sigma _{\tilde{\theta }} > \vert \mu _{\tilde{\theta }} - y_i\vert\) on average. In these scenarios, WS is typically below 1 as 1 would be the WS distance between a delta distribution at zero (corresponding to \(\sigma _{\tilde{\theta }} \rightarrow \infty\)) and the expected \({\mathcal {N}}(0,1)\) Gaussian. The second “slope" contains models that underestimate uncertainties, i.e., \(\sigma _{\tilde{\theta }} < \vert \mu _{\tilde{\theta }} - y_i\vert\). WS is not bounded in these scenarios and is thus—unlike ECE—able to resolve differences between any two uncertainty estimators.

    4.4 Application to object regression

    Table 4 Basic statistics of the harmonized object detection datasets. Dataset size and number of annotated objects are reported for train data (first two columns) and test data (last two columns). For details on dataset harmonization, see text and references therein

    After studying toy and standard regression datasets, we turn towards the challenging task of object detection (OD), namely the SqueezeDet model (Wu et al., 2017), a fully convolutional neural network. First, we adopt the W-dropout objective to SqueezeDet (see the following paragraph). Next, we introduce the six considered OD datasets and sketch central technical aspects of training and inference. Since OD networks are often employed in open-world applications (like autonomous vehicles or drones), they likely encounter various types of concept shifts during operations. In such novel scenarios, well-calibrated “self-assessment” capabilities help to foster safe functioning. We therefore evaluate Wasserstein-SqueezeDet not only in-domain but on corrupted and augmented test data as well as on other object detection datasets (see last paragraphs of this subsection).

    Architecture

    SqueezeDet takes an RGB input image and predicts three quantities: (i) 2D bounding boxes for detected objects (formalized as a 4D regression task), (ii) a confidence score for each predicted bounding box and (iii) the class of each detection. Its architecture is as follows: First, a sequence of convolutional layers extracts features from the input image. Next, dropout with a drop rate of \(p=0.5\) is applied to the final feature representations. Another convolutional layer, the ConvDet layer, finally estimates prediction candidates. In more detail, SqueezeDet predictions are based on so-called anchors, initial bounding boxes with prototypical shapes. The ConvDet layer computes for each such anchor a confidence score, class scores and offsets to the initial position and shape. The final prediction outputs are obtained by applying a non-maximum-suppression (NMS) procedure to the prediction candidates. The original loss of SqueezeDet is the sum of three terms. It reads \(L_{\mathrm{SqueezeDet}} = L_{\mathrm{regres}} + L_{\mathrm{conf}} + L_{\mathrm{class}}\) with the bounding box regression loss \(L_{\mathrm{regres}}\), a confidence-score loss \(L_{\mathrm{conf}}\) and the object-classification loss \(L_{\mathrm{class}}\). Our modification of the learning objective is restricted to the L2 regression loss:

    $$\begin{aligned} L_{\mathrm{regres}} = \frac{\lambda _{\mathrm{bbox}}}{N_{\mathrm{obj}}} \sum _{i=1}^{W} \sum _{j=1}^{H} \sum _{k=1}^{K} \sum _{\xi \in \{x,y,w,h\}} I_{ijk} \left[ ({\delta \xi }_{ijk} - \delta \xi _{ijk}^G)^2 \right] \end{aligned}$$
    (5)

    with \({\delta \xi }_{ijk}\) and \(\delta \xi _{ijk}^G\) being estimates and ground truth expressed in coordinates relative to the k-th anchor at grid point (ij) where \(\xi \in \{x,y,w,h\}\). See Wu et al. (2017) for descriptions of all other loss parameters. Applying W-dropout component-wise to this 4D regression problem yields

    $$\begin{aligned} L_{\mathrm{regres}, \mathrm{W}} = \frac{\lambda _{\mathrm{bbox}}}{N_{\mathrm{obj}}} \sum _{i=1}^{W} \sum _{j=1}^{H} \sum _{k=1}^{K} \sum _{\xi \in \{x,y,w,h\}} I_{ijk}&\left[ {\mathcal {W}}(\xi _{ijk}) \right] \,, \end{aligned}$$

    where

    $$\begin{aligned} {\mathcal {W}}(\xi _{ijk}) = \left( \mu _{\delta \xi _{ijk}} - \delta \xi _{ijk}^G\right) ^2 + \left( \sqrt{\sigma _{\delta \xi _{ijk}}^2} -\sqrt{\left( \mu _{\delta \xi _{ijk}} -\delta \xi _{ijk}^G\right) ^2 + \sigma _{\delta \xi _{ijk}}^2}\right) ^2 \end{aligned}$$

    with \(\mu _{\delta \xi _{ijk}} = \frac{1}{L} \sum _{l=1}^L \delta \xi _{ijk}^{(l)}\) being the sample mean and \(\sigma _{\delta \xi _{ijk}}^2 = \frac{1}{L} \sum _{l=1}^L (\delta \xi _{ijk}^{(l)} - \mu _{\delta \xi _{ijk}})^2\) being the sample variance over L dropout predictions \(\delta \xi _{ijk}^{(l)}\) for \(\xi \in \{x,y,w,h\}\).

    Datasets

    We train SqueezeDet networks on six traffic scene datasets: KITTI (Geiger et al., 2012), SynScapes (Wrenninge and Unger, 2006).Footnote 7 The number of clusters is chosen for each image to match the average number of detections across the 50 forward passes. Each cluster is summarized by its mean detection and standard deviation. To ensure meaningful statistics, we discard clusters with 4 or less detections. The cluster means are matched with ground truth. We exclude predictions from the evaluation if their IoU with ground truth is \(\le 0.1\). For each dataset, SqueezeDet’s maximum number of detections is chosen proportionally to the average number of ground truth objects per image.

    Table 5 Regression performance and uncertainty quality of SqueezeDet-type networks on KITTI data. W-SqueezeDet (W-SqzDet) is compared with the default MC-SqueezeDet (MC-SqzDet). The values of NLL, ECE and WS are aggregated across their respective four dimensions, for details see Appendix A.5 and Table 12 therein

    In-data evaluation

    To assess model performance, we report the mean intersection over union (mIoU) and RMSE (in pixel space) between predicted bounding boxes and matched ground truths. The quality of the uncertainty estimates is measured by (coordinate-wise) NLL, ECE, WS and ETL. Table 5 shows a summary of our results on train and test data for the KITTI dataset. The results for NLL, ECE, WS and ETL have been averaged across the 4 regression coordinates. MC-SqueezeDet (abbreviated as MC-SqzDet) and W-SqueezeDet (W-SqzDet) show comparable regression results in terms of RMSE and mIoU, with slight advantages for MC-SqueezeDet. At this point, we only consider versions of SqueezeDet that provide uncertainty scores. For a discussion regarding performance degradation w.r.t. the deterministic SqueezeDet (approximately \(10\%\), see Table 13), please refer to Appendix A.5. Considering uncertainty quality, we find substantial advantages for W-SqueezeDet across all evaluation measures. These advantages are due to the estimation of heteroscedastic aleatoric uncertainty during training (see also the test statistics ‘trajectories’ during training for BDD100k in Fig. 18 in Appendix A.5).

    The test RMSE and ECE values of all six OD datasets are visualized as diagonal elements in Fig. 9. The (mostly) ‘violet’ RMSE diagonals for MC-SqueezeDet and W-SqueezeDet (top row of Fig. 9) again indicate comparable regression performances. Datasets are ordered by size from small (top) to large (bottom). The large NuImages test set occurs to be the most challenging one. Regarding ECE (bottom row of Fig. 9), W-SqueezeDet performs consistently stronger, see the ‘violet’ W-SqueezeDet diagonal (smaller values) and the ‘red’ MC-SqueezeDet diagonal (higher values). These findings qualitatively resemble those on the standard regression datasets and indicate that W-dropout works well on a modern application-scale network.

    To analyze how well these OD uncertainty mechanisms function on test data that is structurally different from training data, we consider two types of out-of-data analyses in the following: first, we study SqueezeDet models that are trained on one OD dataset and evaluated on the test sets of the remaining five OD datasets. A rather ‘semantic’ OOD study as features like object statistics and scene composition vary between training and OOD test sets. Second, we consider networks that are trained on one OD dataset and evaluated on corrupted versions (defocus blur, Gaussian noise) of the respective test set, thus facing changed ‘low-level’ features, i.e. less sharp edges due to blur and textures overlayed with pixel noise, respectively.

    Fig. 9
    figure 9

    In-data and out-of-data evaluation of MC-SqueezeDet (lhs) and W-SqueezeDet (rhs) on six OD datasets. We consider regression quality (RMSE, top row) and uncertainty quality (ECE, bottom row). For each heatmap entry, the row label refers to the training dataset, the column label to the test dataset. Thus, diagonal matrix elements are in-data evaluations, non-diagonal elements are OOD analyses. W-SqueezeDet provides substantially smaller ECE values both in-data and out-of-data

    Out-of-data evaluation on other OD datasets

    We train one SqueezeDet on each of the six OD datasets and evaluate each of these models on the test sets of the remaining 5 datasets. The resulting OOD regression scores and OOD ECE values are visualized as off-diagonal elements in Fig. 9 for MC-SqueezeDet (left column) and W-SqueezeDet (right column). Since datasets are ordered by size (a rough proxy to dataset complexity), the upper triangular matrix corresponds to cases in which the evaluation dataset is especially challenging (“easy to hard"), while the lower triangular matrix subsumes easier test sets compared to the respective i.i.d. test set (“hard to easy"). Accordingly, we observe (on average) lower RMSE values in the lower triangular matrix for both SqueezeDet variants. The ECE values of W-SqueezeDet are once more smaller (‘violet’) compared to MC-SqueezeDet (‘red’). The ECE diagonal of W-SqueezeDet is visually more pronounced compared to the one of MC-SqueezeDet since uncertainty calibration is effectively optimized during the training of W-SqueezeDet. The Nightowls dataset causes a cross-shaped pattern, indicating that neither transfers of Nightowls models to other datasets nor transfers from other models to Nightowls work well. This behavior can be understood as the feature distributions of Nightowls’ nighttime images diverge from the (mostly) daytime images of the other datasets. The high uncertainty quality of W-SqueezeDet is underpinned by the evaluations of NLL and WS (see Fig. 17 and text in Appendix A.5).

    Table 6 Out-of-data evaluation of MC-SqueezeDet (MC-SqzDet) and W-SqueezeDet (W-SqzDet) on distorted OD datasets. Each model is trained on the original dataset and evaluated on two modified versions of the respective test set: a blurred one (first two columns) and a noisy one (last two columns), see text for details. We report the expected calibration error (ECE) and find W-SqueezeDet to perform better than MC-SqueezeDet on most datasets

    Out-of-data evaluation on corrupted datasets

    In contrast to the analysis above, we now focus on ‘non-semantic’ data shifts due to technical distortions. For each test set, we generate a blurred and a noisy version.Footnote 8 Two examples of these transformations can be found in Fig. 16 in Appendix A.5. In accordance with previous results, W-SqueezeDet provides smaller ECE values compared to MC-SqueezeDet on most blurred and noisy test sets (see Table 6). We observe a less substantial deterioration of uncertainty quality for blurring compared to adding pixel noise, possibly because the latter one more strongly affects short-range pixel correlations that the networks rely on.

    5 Conclusion

    The prevailing approaches to uncertainty quantification rely on parametric uncertainty estimates by means of a dedicated network output. In this work, we propose a novel type of uncertainty mechanism, Wasserstein dropout, that quantifies (aleatoric) uncertainty in a purely non-parametric manner: by revisiting and newly assembling core concepts from existing dropout-based uncertainty methods, we construct distributions of randomly drawn sub-networks that closely approximate the actual data distributions. This is achieved by a natural extension of the Euclidean metric (\(L_2\)-loss) for points to the 2-Wasserstein metric for distributions. In the limit of vanishing distribution width, i.e. vanishing uncertainty, both metrics coincide. Assuming Gaussianity and making a bootstrap approximation, the metric can be replaced by a compact loss objective affording stable training. To the best of our knowledge, W-dropout is the first non-parametric method to model aleatoric uncertainty in neural networks. It outperforms the ubiquitous parametric approaches, as, e.g., shown by our comparison to deep ensembles (PU-DE).

    An extensive additional study of uncertainties under data shift further reveals advantages of W-dropout models compared to deep ensembles (PU-DE) and parametric models combined with dropout (PU-MC): the Wasserstein-based technique still provides (on average) better calibrated uncertainty estimates while coming along with a higher stability across a variety of datasets and data shifts. In contrast, we find parametric uncertainty estimation (PU) to be prone to instabilities that are only partially cured by the regularizing effects of explicit or implicit (dropout-based) ensembling (PU-DE, PU-MC). With respect to worst-case scenarios, W-dropout networks are by a large margin better than either PU-DE or PU-MC. This makes W-dropout especially suitable for safety-critical applications like automated driving or medical diagnosis where (even rarely occurring) inadequate uncertainty estimates might lead to injuries and damage. Furthermore, while our theoretical derivation focuses on aleatoric uncertainty, the presented distribution-shift experiments suggest that W-dropout is also able to capture epistemic uncertainty. Finding a theoretical explanation for that is subject of future research.

    With respect to computational demands, W-dropout is roughly equivalent to MC dropout (MC) and, in fact, could be used as a drop-in replacement for the latter. While L-fold sampling of sub-networks increases the training complexity, we observe an increase of training time that is significantly below L in our implementation. Inference is performed in the same way for both methods and thus also their run-time complexities are equivalent. In comparison to deep ensembles, W-dropout’s use of a single network reduces requirements on training and storage at the expense of multiple forward passes during inference. This property is shared with MC and approaches exist to reduce the prediction cost, for instance last-layer MC allows sampling-free inference (see also Postels et al. (2019)).

    In addition to the toy and 1D regression experiments, SqueezeDet is selected as a representative of large-scale object detection networks. We find the above mentioned properties of Wasserstein dropout to carry over to Wasserstein-SqueezeDet, namely the enhanced uncertainty quality and its increased stability under different types of data shifts. At the same time observed performance losses are minimal. Overall, our experiments on SqueezeDet show that W-dropout scales to larger networks relevant for practical applications.

    When intending to employ uncertainty estimation as a safeguard against model errors, distributional properties of the normalized residuals gain importance. To address such properties we introduce the ETL as a measure for rare and critical cases where uncertainty is strongly underestimated. While we find that W-dropout leads to more Gaussian residuals compared to our benchmarks we still observe remaining deviations. A priori, it is not clear whether the aleatoric uncertainty in complex data is Gaussian or whether such rare cases could be better described with more heavy-tailed distributions. If this is the case, the question arises of whether dropout mechanisms are flexible enough to model distributions outside the Gaussian regime, which we investigated in Sicking et al. (2020).

    Taking a step back, the idea of exchanging the distributions allows to apply our framework to a variety of tasks beyond regression and makes the migration from single point modelling to full distributions a rather general concept. Replacing, e.g., Gaussians with Dirichlet distributions makes an application to classification conceivable, where Malinin and Gales (2018) employ parametric (Dirichlet) distributions to quantify uncertainty. Conceptually, our findings suggest that distribution modeling based on sampling generalizes better compared to parameterized counterparts. An observation that might find applications far outside the scope of uncertainty quantification.