Introduction

Artificial intelligence (AI), which is already having an impact in the field of medicine, will play an even larger role during the next few years [1]. Modern deep neural networks (DNNs) have produced remarkable achievements in data analysis, classification, and image processing. DNNs have drawn more and more the attention of experts as their involvement using medical data can improve the precision of medical applications. If large datasets are available, neural networks can interpret very complex phenomena more effectively than traditional statistical methods. Sadly, their performance is directly correlated with the size of the input [1]. This is a non-trivial criticality where datasets are scarce in nature (i.e., rare diseases or unusual/early-research data), data aggregation is not possible, and/or augmentation capabilities are limited. Deep learning models are also vulnerable to overfitting, especially when constrained by small datasets. This in turn negatively impacts their capacity for generalization [2]. This is an important challenge for situations where dramatic outcomes can result from silent failures (i.e., the network confidently failing to classify data), such as in medical diagnosis [3]. Additionally, no epistemic uncertainty, particularly significant when training data are lacking, is provided in either classification or regression use cases. Many solutions, such as dropout (during training) [4], data augmentation [5], and k-fold cross validation [6], have been proposed in literature to counteract overfitting and correctly assess the performance. Despite these efforts, problems regarding interpretability of the output and the related uncertainty still exist. To mitigate these issues, the Bayesian paradigm can be viewed as a systematic framework for analyzing and training uncertainty-aware neural networks, with good learning capabilities from small datasets and resistance to overfitting [7]. Particularly, Bayesian neural networks (BNNs) are a viable framework for using deep learning in contexts where there is a need to produce information capable of alerting the user if a system should fail to generalize [8]. Many studies have investigated the use of the Bayesian paradigm in medicine for classification tasks. Some applications concern the classification of histopathological images [9], oral cancer images [10], and resting state functional magnetic resonance imaging (rs-fMRI) images for Alzheimer’s disease [11]. More applications of the Bayesian paradigm are available in the thorough review work by Abdullah et al. [12].

Bayesian Neural Networks

The concept behind BNNs comes from the application of the Bayesian paradigm to artificial neural networks (ANNs) in order to render them probabilistic systems. The Bayesian approach to probability (in contrast to the frequentist approach) spans from the meaning behind Bayes’s rule shown in the Eq. 1:

$$\begin{aligned} P(H|D) = \frac{P(D|H)P(H)}{P(D)} \,, P(D) = \int P(D|\theta ) P(\theta ) \,d\theta \end{aligned}$$
(1)

where P(H|D) is called the posterior, P(D|H) the likelihhod, P(H) the prior, and P(D) the evidence. P(D) is obtained by integrating over all the possible parameter in order to normalize the posterior. This step is intractable for practical models and is tackled through various approaches (see also predictive posterior later). H and D respectively represent the hypothesis and the available data. Applying the Bayes’ formula to train a predictor can be thought of as learning from data D [8]. One possible description for a BNN is that of a stochastic neural network trained using Bayesian inference [8]. The design and implementation of a BNN is compound of two steps: the definition of the network architecture and the selection of a stochastic model (in terms of prior distribution on the network’s parameters and/or prior confidence in the predictive capabilities) [8]. The stochastic part in model parametrization can be viewed as the formation of the hypothesis H [8]. Looking at the Eq. 1 also gives a more complete picture of the probabilistic point of view for the training process. Initially, the prior is defined during the network’s construction process. We then proceed at the computation of the likelihood (how good the model fits the data) through some form of probabilistic alternative to forward and back-propagation. Lastly, we normalize the result for the evidence (all the possible models fitting the data) in order to update our prior belief with new found information and construct the new posterior. This process is repeated throughout various epochs, as for classic neural networks, until performance criteria are met. Epistemic uncertainty is included in the posterior [8] during training and at inference. More precisely, once the model is trained, at inference time, an approximate form of the predictive posterior, of which the analytical form is shown in Eq. 2, is used.

$$\begin{aligned} P(\hat{y}|\hat{x},D) = \int P(\hat{y}|\hat{x},\theta )P(\theta |D) \,d\theta \end{aligned}$$
(2)

where \(P(\hat{y}|\hat{x},D)\) represents new data probability given the known data, \(P(\hat{y}|\hat{x},\theta )\) represents the probability with respect to model parameters, and it considers the effect the known data have on the parameters (\(P(\theta |D)\)). This means that, with the same stochastic model and equal inputs, different outputs can be given, cumulatively providing an epistemic uncertainty profile. True Bayesian inference for large neural networks is intractable (integrals on milions of parameters for evidence and predictive posterior), so alternative methods, such as variational inference [13], Markov Chain Monte Carlo [14], and dropout Bayesian approximation [15], are used in order to render these models computationally feasible. Giving more insight in the world of BNNs is not in the scope of this article, but good resources are available in the literature such as Jospin et al. [8] and Mullachery et al. [\(\sim\)1 s for the CNN, DropCNN, and DropBCNN, while the BCNN required \(\sim\) 6.5 s (reducible to 4.5 s by only taking one Monte Carlo estimate of the gradient). Classifying an image required \(\sim\)1.9 ms for the CNN, DropCNN and DropBCNN and \(\sim\)9.5 ms for the BCNN. Note that, to obtain a useful classification with the corresponding uncertainty profiles, the probabilistic networks need to classify an image for n different times and then vote by majority, so the time for the DropBCNN and BCNN should be considered n times (n = 100 in our case).

Table 1 Hyperparameters for the proposed networks

Results

Figure 3 shows a representative example of the learning curves for the CNN, DropCNN (dropout layers inactive at evaluation, dropout probability of 25% and 50%), DropBCNN (dropout layers active at evaluation, dropout probability of 25% and 50%), and the BCNN. Table 2 shows the result for accuracy on the four tested networks. Data are shown for accuracy on training, validation, and test set. Moreover, validation-test mismatch is provided as a measure of the capacity of the network to detect out-of-distribution (OOD) data [3). Moreover, although the learning curves for the BCNN seem to provide a worse picture compared to the other models, the BCNN behavior is actually the desired one in order to avoid silent failures in deep learning systems. This is visible in Table 2 where we see the strong reduction in validation-test mismatch (\(\sim\)7%, p-value < 0.05) in terms of accuracy when going to the BCNN from the DropBCNN (p=0.5) (Bayesian approximation) and an even stronger reduction compared to the deterministic model (\(\sim\)15%, p-value < 0.05). This is indication of the improved capability of the BCNN in learning correct features and the ability to spot OOD inputs using the same patients (of the training set) in the validation set. Not only, the BCNN is also capable of achieving comparable accuracies on the test set with respect to the deterministic CNN (see Table 2). The Bayesian models are also able to provide a measure of epistemic uncertainty as seen in Table 6 and Fig. 4. This information, not available when using deterministic networks, is invaluable to assess the reliability of the prediction, especially in medicine. Uncertainty profiles can also be used to improve the performance, give the model the capability to resist adversarial attacks [32], refuse the classification under a certain threshold to avoid failures, and guide the acquisition of more data towards where the epistemic uncertainty is the highest. Both the DropBCNN and BCNN are able to provide uncerainty metrics, but as is possible to see in Table 6, the fully Bayesian model displays a greater discrepancy both between “Correct” and “Incorrect” confidence (\(\sim\)7% more compared to the best DropBCNN with p = 0.5, p-value < 0.05) and between “AL” and “CTRL & ATTR” (\(\sim\)7% more compared to the best DropBCNN with p = 0.5, p-value < 0.05). This is in line with the confusion matrices in Fig. 5 and the metrics of precision, recall, and F1-score showing better prediction capabilities towards the AL classification vs the CTRL and ATTR discrimination for all the models (max p-value < 0.05). Certainly, to take into consideration is the higher computational cost of the BCNN compared to the DropBCNN and CNN. In this sense, the Bayesian approximation can be seen as a way of maintaining a measure of uncertainty while compromising between the better performance of a fully Bayesian model and the lower computational cost of a deterministic CNN.

Study’s Limitations

The main limitation of this work lies in the specific case study (early acquired cardiac PET images from CA patients) approached with the explained methodology. In particular, in the limited dataset and in the fact that the severity of the disease was not accounted for (as a general index across the various subtypes is not available), possibly leading to biased data and dataset split. To better explore the capabilities and potentiality of the Bayesian framework in similar scenarios and to produce a severity metric based on PET acquisitions are objectives of future works. Moreover, better tuning of the models and a major exploration of possible approximations and algorithms to improve Bayesian inference performance and computational cost could also be considered future works.

Conclusion

In the present work, four models were developed to assess, through a CA classification case study, the capability of BCNNs to overcome some of the limitations of deep learning in data scarcity scenarios. The developed BCNN showed comparable accuracy on the test dataset in comparison with the deterministic CNN; it is also able to reduce silent failures by spotting OOD inputs better than the deterministic and approximate bayesian models. Moreover, both the approximate Bayesian DropBCNN and the BCNN provided epistemic uncertainty. It is well known that epistemic uncertainty is fundamental for enriching the prediction and delivering crucial information to improve model performance, better interpret results, and possibly construct thresholds to refuse classification.