1 Introduction

Deep learning (DL) has achieved impressive performance in a lot of application areas, especially in object recognition [34]. However, their nested non-linear structure combined with millions of parameters makes deep learning a black-box method that arrives at decisions in a non-transparent manner. Therefore, the explainability of DL models [9, 13, 21, 33, 35] and the development of associated techniques to visualize, explain, and interpret the behavior of DL algorithms have recently received increasing attention [3, 8, 16, 19, 22, 24, 25, 28, 30, 31, 36, 38, 40]. Apart from methods optimizing activations [10, 25, 36] and inverting input representations [8, 22], visualizing and explaining the decision process of convolutional neural networks (CNNs) by saliency methods is an intensively discussed approach [1, 3, 16, 19, 20, 24, 28, 30, 31, 38,39,40].

Saliency methods attempt to explain an algorithm’s decision by assigning pixel-level values that reflect the importance of input components in terms of their contribution to the classification result. Therefore, saliency methods generally lead to so-called saliency maps [30] (also known as input contribution heatmaps or feature importance maps), which try to explain the decision process of CNNs through

  1. i.

    Input Modification, i.e., assigning a relevance to a pixel based on the drop in prediction probability caused by the pixel’s perturbation [20, 38, 39],

  2. ii.

    Class Activation, i.e., combining the activation pattern of a higher-level layer with further information, such as the network’s output [40],

  3. iii.

    Backpropagation, i.e., tracing the contribution of the output nodes backwards through the network to the input nodes [3, 16, 38].

A particularly common backpropagation approach is layer-wise relevance propagation (LRP) introduced by Lapuschkin [3, 19]. Layer-wise relevance propagation relies on the assumption that the total amount of relevance is preserved when decomposing the classification decision backwards to the pixel-wise relevance scores. This so-called layer-wise conservation principle postulates that the sum of relevance assigned to neurons in a CNN layer remains the same for two adjacent layers. Despite the popularity of using saliency methods to explain DL models, a significant number of papers have been published addressing more intensively the stability and robustness of saliency methods [1, 3, 12, 15, 18, 19, 23, 28]. Therefore, our work aims to explore the use of adversarial examples as a further tool to help evaluating the robustness and explanatory power of techniques devoted to the explainability of DL models, exemplified by LRP.

1.1 Related Work on the Explanatory Power of Saliency Methods

In the work of Samek et al. [28], the amount of changed classification probability is presented as possible measure to evaluate the explanation of the decision process provided by LRP. For the investigation of the explanatory power of LRP, they suggest replacing the input variables considered most relevant with samples from a probability distribution, such as the Uniform or Dirichlet distribution. In this case, a large decrease in classification probability caused by a perturbation of the input variables with the highest relevance scores is considered to be an indicator of a suitable explanation. A similar idea is followed by Bach et al. [3] who evaluated the impact of single value perturbations on the detection result by flip** pixels with highly positive and highly negative relevance scores, as well as pixels with relevance scores close to zero. Lapuschkin [19] presents a more generalized approach by employing an iterative greedy procedure to evaluate the expected behavior of LRP.

The work of Ghorbani et al. [12], on the other hand, shows for various gradient-based methods and DeepLIFT that the same object classification for two extremely similar images can be explained by different saliency maps. They perform slight modifications to the input images to ensure a similar classification of the modified images and the originals, however, leading to a substantial difference in saliency maps. Similarly, Kindermans et. al. [18] analyze the invariance of saliency map generating methods to transformations of the input data which have no impact on the prediction outcome. Instead of modifying input images, Heo et al. [15] adversarially manipulate the classification model, leaving the model accuracy unchanged while achieving a dramatic change in explanation. Therefore, they are able to perform model manipulations that result in modified models classifying an input object with nearly the same classification probability as the original model. However, the saliency maps differ significantly in dependence of the underlying model. Further, Adebayo et al. [1] present a sanity check for saliency methods comprising model randomization and data randomization tests. Based on their observations, they find that some saliency methods (e.g., gradient\(\odot \)input) can be interpreted as implicitly implemented techniques analogous to edge detection tending to detect edges rather than explain decisions.

Similarities in saliency maps of marginally perturbed images intentionally designed to cause a major shift in classification (also known as adversarial examples [\(X\in \mathbb {R}^n\), which are generally denoted by \(z^l \in \mathbb {R}^{n^l}\). The map** between the i-th component of layer l and the j-th component of layer \(l+1\) is defined by \(z_{i\rightarrow j}^{l, l+1}\), so that

$$\begin{aligned} z_j^{l+1} = \sum \limits _{i=1}^{n^l} z_{i\rightarrow j}^{l, l+1} \end{aligned}$$
(1)

holds for all \(j\in \big \{1,\ldots ,n^{l+1}\big \}\) (see Fig. ).

Fig. 2
figure 2

Principle procedure of layer-wise relevance propagation, showing the idea of redistributing the relevance score \(R^{l+1}_j\), \(j\in \{1,\ldots ,n^{l+1}\}\), of the j-th component of layer \(l+1\) (right) in dependence of the corresponding input components’ forward contributions (left)

Using layer-wise relevance propagation, the degree of a component’s influence on the final decision score is measured by the relevance score R which constitutes a relative measure of a component’s contribution to the network’s outcome. The sign of R indicates the direction of the contribution, whereby \(R>0\) implicates a positive and \(R<0\) a contradictory contribution to the classifier’s outcome, i.e., a contribution that contradicts the final decision score. A component with a relevance score close to zero is expected to be irrelevant with regard to the decision made by the classifier.

Proceeding from the model’s final layer with an initial relevance score of , LRP successively propagates the relevances backwards through the network until the input layer is reached. Considering a multi-class classification problem solved via deep neural networks, the classifier’s prediction usually results in a vector containing probabilities for each existing class. In this case, the relevance score is initialized with the value of the class that is supposed to be explained. Under the assumption that the relevance \(R_j^{l+1}\) of each component \(z_j^{l+1}\) of layer \(l+1\) (cf. Eq. 1) has already been identified, the relevances of the previous layer’s components \(z_i^{l}\) is given by

$$\begin{aligned} R_i^{l} = \sum \limits _{j=1}^{n^{l+1}} R_{i \leftarrow j}^{l,l+1} \end{aligned}$$
(2)

(cf. Fig. 2). The relevance message \(R_{i \leftarrow j}^{l,l+1}\), directed from component j to component i, describes the ratio of the relevance score \(R_j^{l+1}\) that can be traced back to the i-th component of layer l and can be determined, for example, according to the decomposition rules listed in Table . The signum function occurring in Table 1 is defined as follows:

$$\begin{aligned} \text {sgn}\left( z_j^{l+1}\right) ={\left\{ \begin{array}{ll} \ \ 1 &{} \, \text {if } z_j^{l+1}\ge 0\\ -1 &{} \, \text {otherwise} \ \ . \end{array}\right. } \end{aligned}$$

Positive forward contributions \(\left( z_{i\rightarrow j}^{l,l+1}\right) ^+\) and and negative forward contributions \(\left( z_{i\rightarrow j}^{l,l+1}\right) ^-\) are defined by

(3)

and

(4)

Further decomposition rules, as well as a detailed description of the aforementioned decomposition rules can be found in [3, 19] and [24].

Table 1 Relevance message based on commonly used decomposition rules

2.2 Adversarial Examples

Adversarial examples are especially common in the area of image classification and object recognition, intentionally designed to deceive machine learning models and provoke misclassifications with high probabilities. They are characterized by a close resemblance to the training data and cannot be differentiated from regular input images by human observers (see Fig. ).

Fig. 3
figure 3

Results of the L-BFGS attack targeting the classes airplane, automobile, bird, cat, deer, dog, frog, ship and truck based on an image \(X\in \mathbb {R}^n\) (bottom row, third from the left) originally assigned to the class horse. All adversarial examples show a classification probability \(p_{c'}\) over \(99\ \%\) towards the attack’s corresponding target class \(c'\in \mathcal {C}\)

Based on a pre-trained classifier \(f: \mathbb {R}^n\longrightarrow \mathcal {C}\), the generation of adversarial examples can be mathematically defined as constrained optimization problem. Given a set of potential classes \(\mathcal {C}\) as well as a reference image \(X\in \mathbb {R}^n\) belonging to the class \(c\in \mathcal {C}\), the optimization problem reads as follows:

$$\begin{aligned} \begin{aligned}&\min \limits _{\eta } \ \Arrowvert \eta \Arrowvert _p\\&\ \text {s.t.} \ \ f(X') = c'\\&\ \ \ \ \ \ \ f(X) = c\\&\ \ \ \ \ \ \ c' \ne c\\&\ \ \ \ \ \ \ L\le X' \le U \end{aligned} \end{aligned}$$
(5)

with \(\eta =X-X'\) meaning the discrepancy between the reference image and the adversarial example \(X'\in \mathbb {R}^n\). The parameters \(L,U\in \mathbb {R}^n\) represent the component-wise lower and upper bounds on the pixel values of the adversarial example belonging to the target class \(c'\in \mathcal {C}\). The determination of the minimal perturbation \(\eta ^*\), that is needed to provoke a misclassification of the reference image, is a complex, nontrivial problem. There is a wide range of algorithms, so called adversarial attacks, which enable the approximate solution of this problem. Adversarial attacks, such as the Fast Gradient Sign method or the L-BFGS attack (see Sect. 3.2), are usually based on different algorithmic approaches and assumptions. An extensive survey of existing adversarial attacks is given by [29] and [37].

3 Experimental Settings

3.1 Dataset and Network Architecture

The generation of the adversarial examples forming the foundation of the analysis covered within this paper and the training of the underlying classifier is based on the dataset CIFAR-10. In the research area of machine learning, CIFAR-10 is a commonly used benchmark dataset of RGB images characterized by a comparatively low image resolution (\(32\times 32\)). The dataset comprises 60,000 images (50,000 training and 10,000 test samples) belonging to the classes airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck, which consist of 6,000 samples each. The available classes are denoted by \(c_i\in \mathcal {C}\) for \(i\in \{1,\ldots ,10\}\) and thus, \(\mathcal {C}=\{c_i \ | \ i\in \{1,\ldots ,10\}\}\) holds.

Table 2 Detailed specification of the CNN architecture

In the field of image classification, there are already a considerable amount of CNN architectures that achieve excellent results on CIFAR-10 [4]. These CNNs are generally very deep and equipped with innovative architectural elements (e.g., skip connections [17]). To circumvent potential dependencies on specific architectural design and ensure a straightforward and transparent analysis of relevant features, we deliberately selected a CNN characterized by a more simple and less deep network architecture. The network’s architectural design is based solely on fundamental structural elements, such as convolutional, max-pooling, and fully-connected layers, whose configuration is inspired by the state-of-the-art classifier VGG [17] (cf. Table ). Unlike VGG networks, our CNN architecture features a significantly smaller number of trainable parameters. VGG16 [31], for instance, includes approximately 138 million trainable parameters, while our CNN only consists of 307,936 trainable parameters (cf. Table 2). The CIFAR-10-based training and evaluation of the selected network leads to a training accuracy of \(90.87~\%\) and a validation accuracy of \(89.02~\%\). Therefore, 8,902 test images and 47,552 training images are correctly classified.

3.2 Generation of Adversarial Examples

The generation of the adversarial examples is based on 8,902 correctly classified images of the CIFAR-10’s test dataset using the L-BFGS attack (cf. Sect. 2.2). The L-BFGS attack is an iterative white-box attack based on the limited memory BFGS method for bound constrained optimization (short L-BFGS-B), a numerical optimization algorithm described in detail by Byrd et al. [6]. Since the L-BFGS attack is a targeted adversarial attack, the attack’s desired target class \(c'\in \mathcal {C}\) needs to be specified in advance. To reach a wide variety of adversarial examples for later analysis, every class of CIFAR-10 is chosen once to be the attack’s target class. Therefore, the attack is executed nine times for each correctly classified image \(X\in \mathbb {R}^n\), i.e., for all targets \(c'\in \mathcal {C}\) with \(c'\ne y\). In this case, the parameter \(y=c\) for \(c\in \mathcal {C}\) denotes the image’s true label. The implementation of the L-BFGS attack provided by the Python library Foolbox 2.4.0 [11, 26] was used to create the adversarial examples. Further information regarding the algorithmic specification of the attack’s implementation can be found in [32].

3.3 Generation of Relevance Scores

The relevance scores for both, adversarial examples and original images (cf. Fig. 1), are determined according to the basic relevance decomposition rule (LRP-0), the \(\varepsilon \)-rule (LRP-\(\varepsilon \)), as well as the \(\alpha \beta \)-rule (LRP-\(\alpha \beta \)) (cf. Table 1) implemented in Python 3.6.10 using Tensorflow 2.1.0. The advanced decomposition rules LRP-\(\varepsilon \) and LRP-\(\alpha \beta \) are executed for different parameter values, i.e., for \(\varepsilon \in \{0.0001, 0.01,0.1,1\}\) and \(\alpha \in \{1,2\}\), to evaluate the parameters’ effects on the final relevance scores as well. Given the tensorial representation of the input images, the application of the relevance decomposition rules according to Sect. 3.3 results in 3072 relevance scores per adversarial example or original image, respectively.

To allow a clear distinction between relevant and irrelevant input components, especially when passing on off-manifold data like adversarial examples, the softmax pre-activation values are used as initial relevance scores, instead of utilizing the classifier’s final probabilistic outcome \(p_{c}\in [0,1]\), \(c\in \mathcal {C}\). To avoid incredible large relevance scores and enable comparability while maintaining the relevance scores’ ratio and signs within each image, the final relevance scores are normalized separately in dependence of the underlying image using the maximum norm. As opposed to [24], there is no composition of different decomposition rules in dependence of a layer’s position within the architecture of the CNN.

3.4 Analysis of the Relevance Scores

The analysis is conducted based on the relevance scores of 77,402 generated adversarial examples, as well as the relevance scores of the correctly classified test images of CIFAR-10 (cf. Sect. 3.3). For our investigations, we applied established methods of descriptive statistics and exploratory data analysis, such as expected values, standard deviations, quantile values and ranges, as well as visual evaluation and verification via histograms and saliency maps. The statistical analysis is performed separately for relevance scores above and below zero, due to the different interpretation of positive and negative relevance scores (cf. Sect. 2.1).

Input components with high positive or particularly contradictory contributions to the classifier’s outcome, i.e., components with high absolute relevance scores, are of particular interest in the context of our analyses. If the hypothesis that different classification decisions must lead to distinct saliency maps to provide reasonable explanations holds, there should be a significant discrepancy between adversarial examples and original images, especially in extreme value ranges. Therefore, the focus is on the components with the most extreme relevance scores in each sample, as well as the largest \(1~\%\) of the positive relevance scores and the smallest \(1~\%\) of the negative relevance scores. This selection is due to the determined quantile values shown in Table and Table . In addition to the analysis of significantly influential components of adversarial examples and original images, non-influential components, i.e., components with relevance scores close to zero, are examined as well. Due to their comparatively low relevance scores, components with a positive score below 0.001 and a negative score above \(-0.001\) are assumed to be irrelevant for the classification.

Furthermore, we establish a ranking describing the relevance shift between components of original images and components of adversarial examples triggered by the application of the adversarial attack (cf. Fig. ). Therefore, the input components of each image are sorted separately and in descending order according to their relevance score, without distinguishing between positive and negative values. The relevance shift of each component is defined by the difference between the position of a component in the relevance ranking based on the original image and the position of a component in the relevance ranking based on the corresponding adversarial example. Hence, a positive shift indicates a positional degradation and a negative shift a positional enhancement of a component when looking at adversarial examples. A shift of zero implies that the position of a component remains unchanged. When analyzing the positional shift of individual components, we focus primarily on the components with the largest or the largest \(1~\%\) of the positive relevance scores, similar to the statistical evaluation. Additionally, we investigate the change in position for the most relevant \(10~\%\) of the components. In the following, the relevance ranking for components of adversarial examples is referred to as adversarial relevance ranking and the ranking based on components of original images is referred to as original relevance ranking.

Fig. 4
figure 4

Description of the procedure for generating the relevance ranking based on the relevance scores of original images and corresponding adversarial examples (see input). Step 1: The relevance scores (here LRP Score) are sorted for each image and associated adversarial examples individually in descending order. The pixel with the highest score is ranked first, while the pixel with the lowest score is ranked last. A pixel can be identified by its position in the image frame (here Pixel Pos.). Step 2: The position of each pixel in the original relevance ranking and the adversarial relevance ranking is compared, resulting in a positional relevance shift for each pixel (here Pos. Shift)

4 Results

4.1 Classification Accuracy

The application of the L-BFGS attack according to the experimental setup sketched above (cf. Sect. 3.2) results in a total of 77,402 adversarial examples which corresponds to a success rate of \(96.61 \ \%\). The vast majority of adversarial examples, more precisely \(99 \ \%\), show a classification probability \(p_{c'}\) above \(93.37 \ \%\) towards their respective target class \(c'\in \mathcal {C}\). Only in \(0.5 \ \%\) of the cases, \(p_{c'}\le 54.82 \ \%\) holds. Hence, in these cases, the adversarial attack is not able to create powerful adversarial examples which convincingly fool the underlying CNN. In \(0.1 \ \%\) of the cases, \(p_{c'}\) is even lower than \(26.62 \ \%\).

A closer look at these low-probability adversarial examples (i.e., \(p_{c'}\le 54.82 \ \%\)) revealed, that most of them result from images originally assigned to the classes dog and cat. It can also be observed that 39 out of 84 low-probability adversarial examples, that result from the L-BFGS attack targeting the class cat, are originally assigned to the class dog. A similar picture was found while examining the adversarial examples of the target class dog. This observation is not surprising since these classes seem to be generally mistaken by the underlying CNN, due to their close visual appearance. Furthermore, it was striking to see that regardless of the image’s true label \(y=c\) with \(c\in \mathcal {C}\) and the attack’s target class \(c'\in \mathcal {C}\), the remaining probability \(1-p_{c'}\) was typically assigned to one or two other classes at a lower but similar level. In this context, the second highest probability was almost always assigned to the image’s true class c. Despite the images’ low resolution of \(32\times 32\), the adversarial examples cannot be distinguished from their original image by human observers, as Fig. 3 compellingly illustrates.

4.2 Relevance Scores

4.2.1 Visual Evaluation

The visual verification and direct comparison between the input contribution heatmaps of an original image and its corresponding adversarial examples reveal no significant differences (cf. Figs.  and ). Despite strongly divergent classification decisions and high classification accuracy, there is almost no difference between components of adversarial examples marked relevant and relevant components of original images. Even though individual pixels undergo minor changes in the absolute magnitude of their relevance scores and some previously insignificant pixels seems to become relevant as a result of the adversarial attack, the majority of pixels appear to have a strong impact on both, the classification of the adversarial example and the classification of the original image.

This can be observed especially for input contribution heatmaps derived from relevance scores obtained by applying LRP-\(\alpha \beta \) with \(\alpha = 1\) (cf. Figs. 5 and 6). Even tough some background components seem to become relevant for the classifier’s outcome through the changes induced by the adversarial attack, the contours of the original objects are clearly visible in the input contribution heatmaps of the adversarial examples. This implies that components marked relevant for original images seem to be relevant for adversarial examples as well, albeit leading to a significantly different classification decision with a high accuracy towards the pre-defined target class (here automobile, bird, cat, horse and truck). In some cases the negative relevance scores even overpower the positive ones (e.g. Fig. 6 original class bird, target class cat) replicating the original object’s contour lines. Thus, the heatmaps seem to clearly contradict the result of the classifier.

Fig. 5
figure 5

Results of the application of LRP-\(\varepsilon \) and LRP-\(\alpha \beta \) to correctly classified images of CIFAR-10 belonging to the classes automobile, bird, cat, horse and truck

Fig. 6
figure 6

Results of the application of LRP-\(\varepsilon \) and LRP-\(\alpha \beta \) to adversarial examples belonging to the target classes \(c'\in \{{\textit{automobile}}, {\textit{bird}}, {\textit{cat}}, {\textit{horse}}, {\textit{truck}}\}\)

Accordingly, a visual verification seems to be ambiguous and not sufficient to explain the entirely different classification results of original images and adversarial examples. Furthermore, the question arises whether a visual verification of relevance scores based on human interpretation of contour lines can actually explain the influence of a component within the complex structure of a deep neural network. However, to allow the evaluation to be based on more than a visual inspection a statistical evaluation of the differences in saliency maps is presented in the following section.

4.2.2 Statistical Evaluation

Regardless of the applied decomposition rule, the statistical evaluation shows that on average \(0.36~\%\) of the adversarial components have a relevance score of zero, and therefore are considered non-influential to the final decision score \(f(X')\), \(X'\in \mathbb {R}^n\). For correctly classified images, on average only \(0.24~\%\) of the components have a relevance score of zero. Looking at the quantile values in Tables 3 and 4, only \(1~\%\) of the positive and \(1~\%\) of the negative relevance scores appear to be significant for the final classification decision. However, the majority of the components seem to have no significant impact according to LRP, as their relevance scores are an order of magnitude lower than those below and above the \(1~\%\) and \(99~\%\) quantiles, respectively. This is also reflected by the relevance scores’ expected value, which ranges from zero (LRP-0) to 0.0516 (LRP-\(\alpha \beta \), \(\alpha =1\)) for adversarial examples and from zero to 0.055 for original images.

Considering the relevance scores’ quantile values, summarized in Tables 3 and 4, there is no discernible difference between the relevance scores of adversarial examples, and the relevance scores of original images. In both cases, the relevance scores obtained by LRP-0 and LRP-\(\varepsilon \) with \(\varepsilon \in \{0.0001,\) \(0.01, 0.1\}\) are symmetrically distributed around zero. The distributions of the relevance scores obtained by LRP-\(\varepsilon \) with \(\varepsilon =1\) and LRP-\(\alpha \beta \) with \(\alpha \in \{1,2\}\), on the other hand, are slightly skewed to the right, which is due to the nature of the applied decomposition rules.

Table 3 Quantiles of the adversarial examples’ relevance scores obtained by applying LRP-0, LRP-\(\varepsilon \) and LRP-\(\alpha \beta \)
Table 4 Quantiles of the original images’ relevance scores obtained by applying LRP-0, LRP-\(\varepsilon \) and LRP-\(\alpha \beta \)

In the case of LRP-\(\varepsilon \), the parameter \(\varepsilon \) absorbs a certain amount of relevance and thus eliminates weak or contradictory contributions as \(\varepsilon \) grows. Accordingly, with an increasing parameter value the number of irrelevant components increases and only the most salient components survive, which is also reflected by the quantile values in Tables 3 and 4. Furthermore, it can be observed that the relevance scores’ standard deviation also declines with growing \(\varepsilon \), showing values below 0.1184. Additionally, the gaps between the quantile values of the relevance scores change for \(\varepsilon =1\) and the relevance scores of allegedly influential components tend to become even larger. This seems to allow a more precise distinction of relevant and irrelevant features. In contrast to LRP-\(\varepsilon \), the observed distribution shift for the relevance scores obtained by LRP-\(\alpha \beta \) is due to the different weighting of positive and negative forward contributions. Especially interesting is the significant difference in the lower quantile values for relevance scores of adversarial components and components of original images for \(\alpha =1\). In the case of original images, \(2\%\) of the components are assigned a relevance score less than zero, whereas only \(0.1~\%\) of the relevance scores associated with adversarial examples are in the negative value range. Nevertheless, in both cases, the relevance scores have similar expected values (0.0516 for adversarial examples, 0.055 for original images) and standard deviations (0.0930 for adversarial examples, 0.1013 for original images). Similar to LRP-\(\varepsilon \) the variation of the gap between the relevance scores’ quantile values can be observed for LRP-\(\alpha \beta \) as well.

Regardless of the applied decomposition rule (i.e., LRP-0, LRP-\(\varepsilon \) or LRP-\(\alpha \beta \)), the examination and direct comparison of the relevance scores for both adversarial examples and original images using quantiles, expected values and standard deviations revealed no major differences between their relevance scores. Even the analysis of highly influential or non-influential components showed neither significant differences between the relevance scores of adversarial examples and original images, nor general differences between positive and negative relevance scores. Hence, the statistical analysis indicates a rather ambiguous behavior of LRP as well (cf. Sect. 4.2.1), supporting the conjecture of insufficient explanatory power, especially when considering defective data such as adversarial examples.

4.2.3 Relevance Ranking

The results above are also supported by the established relevance ranking for components of original images and components of adversarial examples, as well as by the direct comparison of their ranking position according to Sect. 3.4. Particularly striking are the results of the relevance ranking for adversarial examples and original images based on the relevance scores obtained by applying LRP-\(\alpha \beta \) with \(\alpha =1\).

Fig. 7
figure 7

Frequency distribution of the positional change of the most relevant component of each original image according to LRP-\(\alpha \beta \), \(\alpha =1\), which also belongs to the top \(1\%\) (left) or the top \(10\%\) (right) of the adversarial input components

The examination of the components with the highest relevance score in each original image pursuant to LRP-\(\alpha \beta \) with \(\alpha =1\) shows, that \(96.72\%\) of these components are also among the \(10\%\) of the components most relevant for the classification of the corresponding adversarial example. In \(75.21\%\) of the cases, the most relevant component of the original image even belongs to the \(1\%\) top-scored components of the associated adversarial example. When looking at the positional shift of an original image’s most relevant component, which still belongs to the top \(1\%\) or top \(10\%\) of the most relevant components in the adversarial ranking, a comparatively small change in position can be observed. For the majority of the components (more precisely \(70\%\) of them), the change in position is below 7 when considering the top \(1\%\) of the components within the adversarial ranking, and less than 54 when taking the top \(10\%\) into account. This observation is also illustrated by Fig. . In \(3.75\%\) of the cases, the most relevant component of the original image and the most relevant component of the corresponding adversarial example are identical.

Considering the \(1\%\) of the original images’ most relevant components according to LRP-\(\alpha \beta \) with \(\alpha =1\), it can be observed that \(43.37\%\) of them also belong to the \(1\%\) of the top-ranked components of the corresponding adversarial examples. Approximately \(93\%\) of original images’ top \(1\%\) even belong to the top \(10\%\) of the adversarial components mainly responsible for the classifier’s outcome. As illustrated by Fig. 7, the absolute positional change of an original component within the adver-sarial relevance ranking is less than 10 in \(70\%\) of the cases when looking at the \(1\%\) of the top-ranked adversarial components, and less than 54 when considering the top \(10\%\). Of particular interest is the change in position of the most relevant \(1\%\) of the original components, which also belong to the \(1\%\) of the most relevant components of the respective adversarial examples. Here, an average positional change of zero can be observed. In the case of the components that are given \(10\%\) of the highest relevance scores in the original image, \(64.12\%\) are also among the most relevant \(10\%\) in the adversarial relevance ranking, even though the adversarial examples are classified with an equally high accuracy. In fact, this even applies to strong attacks where the target is very different from the origin, e.g. images originally belonging to the class bird vs. their corresponding adversarial examples belonging to the target class truck.

Fig. 8
figure 8

Frequency distribution of the positional change of the \(1\%\) most relevant components of each original image according to LRP-\(\alpha \beta \), \(\alpha =1\), which also belong to the top \(1\%\) (left) or the top \(10\%\) (right) of the adversarial input components

Fig. 9
figure 9

Frequency distribution of the positional change of the \(10\%\) most relevant components of each original image according to LRP-\(\alpha \beta \), \(\alpha =1\), which also belong to the top \(10\%\) of the adversarial input components

Furthermore, the positional change of the components that belong to both, the top \(10\%\) of the original images and to the top \(10\%\) of the corresponding adversarial examples, averages zero as well (cf. Fig.  and Fig. , left). For \(50\%\) of these components, the absolute change in position is below 51. The relevance ranking and positional shift analysis were performed analogously based on relevance scores obtained by LRP-\(\alpha \beta \) with \(\alpha =2\) and LRP-\(\varepsilon \) with \(\varepsilon =1\). The overall tendencies of these results are similar to the results based on relevance scores obtained by LRP-\(\alpha \beta \) with \(\alpha =1\). Therefore, these results will not be discussed further. However, the main results can be found in Table .

Since the majority of top-ranked \(1\%\) experience only a marginal change in position of 2.3 per thousand and the majority of the top-ranked \(10\%\) merely undergo a relative position change of \(1.76\%\), these top-score shifts between original images and adversarial examples cannot be considered a reliable foundation for explaining the change in classification w.r.t. adversarial examples and original images. Given the fact that the analysis did not discriminate between class affiliations or target class dependencies, these results indicate a general characteristic problem of layer-wise relevance propagation.

Table 5 Share of the original images’ top-ranked components that also belong to the corresponding share of the adversarial examples’ top-ranked components based on the relevance scores obtained by LRP-\(\alpha \beta \) with \(\alpha =2\) and LRP-\(\varepsilon \) with \(\varepsilon =1\)

4.3 Discussion

Adversarial examples are generally characterized by high similarity to the original data. Therefore, edges in images rarely undergo significant changes in adversarial attacks (cf. Fig. 3). This feature is clearly highlighted by LRP by carving out almost identical contour lines for both the original image and the adversarial examples (cf. Figs. 5 and 6) while they are classified differently. Consequently, LRP emphasizes the image contour lines rather than actually explaining the network’s decision. This finding is also supported by the observations of Adebayo et al. [1], who show that some saliency methods (e.g., gradient\(\odot \)input) work like an edge detector, in combination with the work of Ancona et al. [2], who show that gradient\(\odot \)input is strongly related to LRP and even equivalent in some configurations. Taking further into account that visual comparison between object contours and saliency maps is a "poor guide in judging whether the saliency map is sensitive to the underlying model" [1], our results lead to the conclusion that assigning a relevance score to individual input components based on a layer-wise conservation principle to measure their importance in the decision process does not properly explain the behavior of a deep neural network.

However, this rationale is not inconsistent with other evaluations using relevance score-dependent perturbations of input components to analyze the explanatory power of LRP [3, 19, 28]. Since these approaches change components of objects marked relevant, they technically change the edges of these objects which should reduce the classification probability. But this does not explain how the network arrives at its decisions, because in the case of adversarial examples—where components allegedly relevant according to LRP mainly remain unchanged (cf. Figs. 7 and 8)—the classification decision is completely overturned.

5 Conclusion

In this paper, we presented a comprehensive statistical analysis and a novel approach to evaluate the explanatory power of LRP using adversarial examples as relevance score-independent perturbation. The performed analyses demonstrate that there is no significant difference between the saliency maps of adversarial images and the corresponding original ones. This leads to the conclusion that there is no evidence that LRP in its current version explains the CNN’s decision process for original images or adversarial examples in a comprehensible way. Nevertheless, our analyses show that adversarial examples offer the potential to uncover inconsistencies in the robustness and stability of explanations obtained by saliency methods. We believe that adversarial examples are a useful addition to the means of evaluating the explanatory power of such methods.

While our work was a first step in this direction, the presented approach can be used for consistency evaluations of other explainability methods (e.g., LIME [27, 35]) as well.