1 Introduction

Machine learning research and the development of machine learning models for applications is typically focused on accuracy and model performance for certain tasks. Yet, the interpretability and explainability of those models is often limited (Guidotti et al. 2018; Biran and Cotton 2017). While an accurate model might perform well on a given task and given data, we often do not understand what influenced its decision. Explainable AI (XAI) techniques can enable such an understanding and thereby help to build trust and to validate robustness of the model (Ribeiro et al. 2016a). Local Interpretable Model-agnostic Explanations (LIME) and SHapley Additive exPlanations (SHAP) are two explainable AI techniques that have been gaining popularity recently (Roy et al. 2022). LIME was proposed by Ribeiro et al. (2016a) as a technique that provides an explanation for the predictions of any classification model. LIME supports both textual and tabular data, and its results can easily be visualized graphically. Further, LIME was validated and successfully used in several user studies designed to help humans understand how machine learning models make their decisions (Magesh et al. 2020; Goyal et al. 2017; Ribeiro et al. 2016b). SHAP was introduced by Lundberg and Lee (2017) as a unified approach to interpreting machine learning models. SHAP supports text classification with deep neural networks and can, in this domain, be compared to LIME. Kokalj et al. (2021) found that LIME and SHAP are the most widely used permutation-based explanation methods.

Within the domain of software engineering, the identification of bugs among the issues reported in an issue tracking system is an important task to support the triage of issues, e.g., for the localization of bugs (Mills et al. 2018), that can be supported by text mining. Following the guideline by Herzig et al. (2013), we define a bug as an issue report that documents corrective maintenance tasks that require semantic changes to the source code. Using this characterization of bug issues, Herzig et al. (2013) provided clear guidelines, which are used to create ground truth data for issue types by researchers (Herzig et al. 2013; Herbold et al. 2022). Such data can be used to train and evaluate machine learning models, which provide promising results for the automated identification of bug issues (e.g., Herbold et al. 2020; Palacio et al. 2019; Von der Mosel et al. 2022). However, while the past research showed that machine learning models can achieve satisfactory performance, we only have a limited understanding of how machine learning achieves this performance.

With this study, we narrow this gap in our knowledge and aim to understand if researchers can utilize XAI methods to discern how machine learning for identification of bug issues works. Our study is set up as a confirmatory study, based on assumptions that we derive from prior work on issue type prediction. We analyze the explanations for prediction of bug issues on a large scale, and at the same time deepen our understanding of LIME and SHAP as tools for explaining machine learning models. We investigate if it is easier to explain why an issue is a bug rather than why it is not a bug. Additionally, we will evaluate the hypothesis that SHAP generates better explanations than LIME. We augment this with a look at the impact of data from different projects as a potentially powerful confounding factor, to understand if explanation quality might rather be driven by where the data originates from. All analyses are based on manually validated explanations, which the authors rate according to four categories, i.e., if the explanation is related to the prediction, the terms used to explain are unambiguous, the explanation captures the context, and if we gain insights into the machine learning algorithm from the explanation. The contributions of our research are the following:

  • A large data set of rated LIME and SHAP explanations of issue type predictions based on the quality of their explanations. The data contains 3.090 issues, including the machine learning model’s prediction of whether they are a bug or not and their corresponding LIME and SHAP explanation.

  • We found that, contrary to our assumption, there is no significant difference in the explanation quality between bug issues and non-bug issues. This indicates that the models are powerful enough to not only learn a signal for the rather narrow concept of bugs, but also for the very broad concept of non-bugs, which includes, e.g., feature requests, dependency updates, questions, tasks, and release planning.

  • While our results are inconclusive if there is no effect, we show that while some projects might have a small effect on explanation quality, this is mostly independent of the projects, i.e., stable across contexts.

  • Our raters generally had a very positive rating for the quality of explanation for both LIME and SHAP, though LIME has some problems with the ambiguity and the contextuality due to the explanations being based on individual words, in comparison to SHAP, which can use sentence fragments to avoid this problem.

  • This high quality of the explanations and the ability to explain both bug issues and non-bug issues indicates that the model we used to identify bugs has robustly learned this in a generalizable manner and with a signal that is aligned with our human expectation.

The remainder of this paper is structured as follows. In Section 2, we give an overview of the related work. Next, we introduce our research questions and hypotheses in Section 3. Then, we describe our research protocol, including materials, variables, execution plan, and study design in Section 4. Thereafter follow the results of our study in Section 5 as well as a discussion of said results in Section 6. In Section 7, we describe the threats to validity in regard to our study, followed by the conclusion in Section 8.

2 Background and Related Work

Explainable methods can be divided into global and local explainability (Guidotti et al. 2018). The goal of the global techniques is to provide an understanding of the whole model. The explanation is able to explain all entries from the data set and can also apply to generic new data. Local explanation, instead, have the goal to explain only a single instance or small number of entries from the data set (Visani et al. 2020).

In our study, we consider explanations that are generated using LIME and SHAP. Ribeiro et al. (2016a) proposed LIME as a technique to explain the predictions of any classifier in an interpretable and faithful manner. Thereby, LIME perturbs the input around the local prediction and generates the neighborhood of instances. These generated instances are weighted according to the proximity to the original instance. Finally, a linear model that approximates the model well for the original instances is calculated. The technique was already used in different domains, for example music content analysis (Mishra et al. 2017) or to explain classification of lymph node metastases (Palatnik de Sousa et al. 2019). Furthermore, LIME has promising results in the domain of defect prediction (Jiarpakdee et al. 2021). Thus, in this study, we will extend this scope to the domain of using LIME with issues from software projects, especially for the classification of issue types in bugs and non-bugs. Besides LIME, we will use SHAP as second explanation algorithm. Lundberg and Lee (2017) proposed SHAP as a unified method to explain predictions of different machine learning models. SHAP uses concepts from cooperative game theory and is able to assign each attribute from the input of a machine learning model an importance value based on its impact on the prediction when the feature is present or not during the SHAP estimation (Lundberg and Lee 2017). In order to explain different types of models, SHAP has different variants like Linear SHAP, Low-Order SHAP, Max SHAP or Deep SHAP. These variants help to reduce the compute time to calculate the model-specific explanation. In this work, we will use the variant Deep SHAP for natural language models. Lundberg and Lee (2017) noted that LIME is a special case of SHAP.

Tantithamthavorn and Jiarpakdee (2021) outlined that explainable AI is very important for software engineering. The main purpose in the current literature is in the area of defect prediction. For instance, multiple studies have shown that LIME was able to help developers to localize which lines of code are the most risky and explain the prediction of defect models (Tantithamthavorn and Jiarpakdee 2021; Jiarpakdee et al. 2021). However, these studies worked with tabular data, in comparison to our application to text mining. Similarly, Roy et al. (2022) compare LIME and SHAP explanations based on tabular data for defect prediction. In their study, they find that LIME and SHAP cannot be used interchangeably and that they tend to disagree more regarding the ranking of features, rather than the sign of the feature’s importance. Extending the idea of LIME, Pornprasit et al. (2021) proposed PyExplainer, a local rule-based model-agnostic technique for generating explanations of Just-In-Time defect predictions. In two case studies, PyExplainer outperforms LIME in explaining the local instance by improving, for example, the synthetic neighbor generation. However, PyExplainer works only with tabular data and has no support for textual data.

An important task in the training of defect prediction models is the correct issue type classification. The potential large impact of the correct classifications on defect prediction research was shown by Herzig et al. (2013).

Herbold et al. (2020) summarized that knowledge about how issue type predictions work is still limited. To the best of our knowledge, no explainable AI techniques (including LIME and SHAP), have been investigated for issue type predictions so far.

3 Research Questions and Hypotheses

Considering our primary goal of the study, we state the following research question and hypotheses.

  • RQ: How is the correctness of the prediction and the quality of the explanation correlated?

    • H1 Correct predictions of bug issues have a higher qualitative score than correct predictions of non-bug issues.

    • H2 The projects have no direct influence on the qualitative scores.

    • H3 The qualitative scores of SHAP are higher than the scores of LIME

We derived hypothesis H1 from the definition of bugs within the guidelines by Herzig et al. (2013), which define how to identify bugs within issue tracking systems. While the criteria for bugs are relatively focused (e.g., dereferencing a null pointer, memory issues, or other crashes against the criteria), there are many different types of issues for non-bugs, e.g., changing requirements, requests for new features, suggestions of refactorings, architectural considerations, documentation tasks and licensing issues. Consequently, when a machine learning model needs to decide whether an issue is a bug or not, we believe that it focuses on identifying characteristics of bugs (e.g., occurrence of exceptions, dissatisfaction of users, mention of mistakes). We hypothesize that if issues exhibit the characteristics of bugs, they are classified as such. Otherwise, they are classified as non-bugs. If we are correct, this would mean that the explanations for bug issue predictions should focus on the above-mentioned characteristics of bugs and be helpful. However, it also follows that predictions of non-bugs are the result of the absence of a signal, which means that LIME and SHAP would not be able to find a good explanation, as there is no signal to explain.

We derived hypothesis H2 from the work by Ribeiro et al. (2016b). They demonstrated that LIME is able to provide insights to the user of a machine learning model for different domains. They show that it helps the user to understand the model and its behavior to build trust in the quality of the predictions through these insights (Ribeiro et al. 2016b). Thus, this trust is a direct consequence of the quality of the explanations. However, Alvarez-Melis and Jaakkola (2018) showed that LIME is not always stable and Zhang et al. (2019) investigated potential reasons for the lack of robustness. They showed that a potential source is the randomness in its sampling procedure. Overall, we assume that this source has a smaller effect on the quality of the explanation than the other sources, like the machine learning model itself. Thus, the qualitative score should be stable across different projects for the same machine learning model. For SHAP explanations, Man and Chan (2021) showed that the stability of SHAP lies on the background sample size. With an increase in the background sample size, the stability increases. Since, we plan to investigate a large data set and are not aware of any further restrictions that the arguments of LIME are not applicable for SHAP, we derived that the Hypothesis also holds for SHAP.

We derived hypothesis H3 from the work Kokalj et al. (2021) and Lundberg and Lee (2017). Lundberg and Lee (2017) derived the explanation algorithm SHAP as a general variant of LIME, which is suitable and tested for deep neural networks. Therefore, we believe that SHAP outperforms LIME. In comparison, Ribeiro et al. (2016b) proposed LIME as an explanation tool for any classifier. Lundberg and Lee (2017) compared explanations from LIME and SHAP to validate the consistency with human intuition. They found a much stronger agreement between human explanations and SHAP than with other methods. However, we are not aware of an independent confirmation of these findings for the domain of issue type prediction. A further reason for this hypothesis is the work by Slack et al. (2020), who outlined that LIME is more vulnerable against fools than SHAP. Fools are constructed adversarial classifiers that can fool post hoc explanation techniques which rely on input perturbations (Slack et al. 2020). Thus, the stability of LIME on a large data set could suffer from the vulnerabilities, which would lead to an inferior performance.

4 Research Protocol

This section discusses the research protocol, which was pre-registered (Ledel and Herbold 2022) for this study. Structure and contents are closely aligned with the pre-registration. All deviations are summarized in Section 4.5.

4.1 Materials

In the following section, we describe the materials, including the data set of issues and their issue type predictions of the machine learning model. Next, we describe the subjects, namely the LIME and SHAP explanations of issue type predictions.

4.1.1 Predictions of Bug and Non-Bug Issues

For the prediction of issue types, we have used a fine-tuned seBERT model, which was demonstrated by Trautsch and Herbold (2022) to outperform the best performing approach from a recent benchmark (Herbold et al. 2020). Since, for the scope of this paper, we are only interested in a prediction output of two classes, bug and non-bug issue, we have fine-tuned the model with a training set consisting of manually validated bug reports. Those reports were provided by Herbold et al. (2020), a data set that consists of 30,922 issues with 11,154 manually validated bug reports from 38 open source projects. Their data set is, to the best of our knowledge, the largest manually validated data set of bug reports.

4.1.2 Subjects

The subjects of this study are the LIME and SHAP explanations of issue type predictions. We investigate the explanatory power of LIME and SHAP to examine the relationship between the prediction performance of the model and the quality of the LIME and SHAP explanation. Furthermore, we compare the quality of the LIME and the SHAP explanations. We use stratified sampling of issues for each project, since the class of the bug issues is smaller than the class of non-bugs (Taherdoost 2016). We use 10% of the issues from each project in our study to ensure that the sample contains issues from each project. This results in a sample size of 3090 issues. An overview is provided in Table 1. The table further provides insight into the prediction performance of the fine-tuned seBERT model. Overall, the model performed with an accuracy of \(94\%\) and an F\(_1-\)measure of 0.85, i.e., a strong performance that is larger than that of all models from a recent benchmark that already determined that such models are sufficiently mature to be used in certain use cases (Herbold et al. 2020).

Table 1 Statistics about our sampled data, i.e. the number of validated bug issues, the number of validated non-bug issues, the absolute numbers for true positives (tp), true negatives (tn), false positives (fp), and false negatives (fn), as well as the total number of issues sampled from each project in the data set provided by Herbold et al. (2020)

Based on an initial estimation for a sample size of 3090, we calculated the effort required to rate the explanations. Assuming that rating a single issue takes between 5 and 10 minutes, we expected the total amount of work for a single rater to be at least \(5~\frac{\text {minutes}}{\textrm {issue}} \cdot 3,090~\text {issues} = 15,450~\text {minutes} = 257,5~\text {hours} \approx 33~\text {days}.\) Accordingly, the maximum amount of work expected for a single rater was \(10~\frac{\text {minutes}}{\textrm {issue}} \cdot 3,090~\text {issues} = 30,900~\text {minutes} = 515,0~\text {hours} \approx 66~\text {days}.\) The time per issue can vary significantly based on the clarity, length and detail level in the issue.

Prior studies of LIME and SHAP explanations are limited in the amount of analyzed explanation and limited in scope of different application, e.g. highlighting lines of defect predictions to support the developer (Wattanakriengkrai et al. 2020). We broaden this scope in our study to the task of issue type classification and in the number of rated explanations. Further restrictions were that there are only few studies who compared LIME and SHAP explanations against each other in a user study.

4.2 Variables

We rate the LIME and SHAP explanations of predictions of the machine learning models based on a qualitative rating for each explanation for the four categories listed in Table 2. The categories are designed to measure the quality from different perspectives and aim to help us understand if the explanations are directly related to the definition of bug issues (e.g., using terms like feature or bug), if the words used are unambiguous (e.g., polysemes that are used out of context), if the different aspects of the explanation are contextually related (e.g., if words are close to each other in the text or if they are semantically related like components and features of these components), and if the overall explanation provides insights into how the machine learning algorithm likely determined the result. For each category, we decide if the category applies (+1), is neutral (0), or does not apply (-1). In this sense, neutral means that while the criteria only partially apply, e.g., that part of the explanation is related, but other parts are not. The average qualitative rating is then computed by averaging over the four categories. We note that this rating is similar to a three-point Likert scale (Likert 1932).

Table 2 List of categories for which a positive, neutral, or negative rating is performed for each explanation

For each LIME and SHAP explanation, we measure the following dependent variables with respect to the issue type prediction of the approaches.

  • \(qs \in [-4, 4]\): The qualitative rating of the explanation (LIME or SHAP) of an issue. For each explanation qs equals the average of the sum of all four categories between all raters. The minimum value of \(-4=\frac{-1 + -1 + -1}{3} \cdot 4\) can be achieved if all four categories for all raters do not apply, the maximum value of \(4=\frac{1 + 1 + 1}{3} \cdot 4\) is the result if all four categories apply.

As independent variables, we use a one-hot encoding of the confusion of the prediction, such that we have

  • \(tp \in \{0, 1\}\): 1 if the prediction of the issue by the machine learning model is a true positive, 0 otherwise.

  • \(tn \in \{0, 1\}\): 1 if the prediction of the issue by the machine learning model is a true negative, 0 otherwise.

  • \(fp \in \{0, 1\}\): 1 if the prediction of the issue by the machine learning model is a false positive, 0 otherwise.

  • \(fn \in \{0, 1\}\): 1 if the prediction of the issue by the machine learning model is a false negative, 0 otherwise.

Moreover, we consider the following confounding variables, i.e., variables that may be alternative explanations for our results.

  • \(p_i \in \{0,1\}\) for \(i=1,...,38\): A one-hot encoding of the project to which the issue belongs. 1 if the issue belongs to project i, 0 otherwise. This confounder considers if the quality of the explanations may also be explained by the projects, for which they are produced.

  • \(n_i \in [1,3090]\) for \(i=1,...,3\): number of issues seen by each of the \(i=3\) raters. For the first presented issue is \(n_i=1\), for the second issue is \(n_i=2\), etc... This confounder measures if the expectations of the rater change over time and the rater has a different measurement for the quality of the explanation between the first issues and last issues.

Fig. 1
figure 1

Visualization of an example LIME and SHAP explanation of an issue description. We provided both a bar chart of the impact of different terms on the label and a highlighting of these words within the text of the issue description

4.3 Execution Plan

The main task of the execution is to rate the LIME and SHAP explanations based on the quality and to conduct a qualitative rating. Three of our co-authors act as raters to assess all issues independently and to measure the average of these three qualitative assessments that are the foundation of our variable qs.

The task of the rater is to perform a qualitative rating on the LIME and SHAP explanation. During the rating process, the rater is presented with one issue at a time in a random order. For each issue, we show the rater both the explanations of LIME and SHAP side-by-side, with the order of the explanations from left-to-right being randomized. Below each explanation, the rater then provides his rating based on the categories from Table 2. After the rater finishes the rating, they submit the data, and a new, previously unrated, issue is selected at random. One might note that the execution plan could be switched to an AB/BA testing (Madeyski and Kitchenham 2018). However, because we only have three raters, the population sizes of n=2 for AB and n=1 for BA would be extremely low, leading to uncertain results.

Figure 1 shows a screenshot of the labeling tool. For each XAI method, it displays a text field containing the issue title and the issue description. LIME and SHAP are able to use this text to calculate the relevant words based on the local prediction of the machine learning model (Ribeiro et al. 2016a). We choose to execute LIME with ten features. On the one hand, too few features may result in an inadequate explanation (Zhang et al. 2019). On the other hand, since LIME already chooses the best local approximation for the number of features, more features do not increase the level of information presented to the rater. The output of SHAP is structured differently. SHAP does not require a parameter that sets the number of features to a fixed amount, and the features that SHAP identifies consist of groups of words. In order to make the output similar to the LIME explanations, we choose only the ten-most-relevant word grou**s, matching the feature limit set for LIME. Regarding the configuration of SHAP, we use Deep SHAP with an optimization for natural language models.Footnote 1 The optimization is adding coalitional rules to traditional Shapley values. This helps to explain large modern natural language models using few function evaluations.

In the labeling tool, we display the ten-most-relevant features from the concatenated title and description according to their association with the category “bug” or “non-bug”. In the text field, the features which are associated with the category “bug” are highlighted in red, while those associated with the category “non-bug” are highlighted in blue. We also show the score computed for each feature by either LIME or SHAP (Ribeiro et al. 2016a). The score reflects the relevance of the feature to the outcome and is represented by the color strength of the highlight. The feature with the highest score receives an alpha value of 1 and the other words are linearly adjusted. Additionally, a bar chart is displayed on top of every text field to support the explanation.

We present the ratings for the categories introduced in Table 2 as three-point Likert scales. The raters decide if each category applies, is neutral or does not apply for each of the explanations individually. This decision is based on three aspects: 1) the explanation of the categories from Table 2; 2) the issue title and description provided by the text box; and 3) the interpretation of the raters of the depicted terms and their impact on the prediction as depicted by the bar-charts. The rater is only allowed to submit the result for the complete issue at once. As described in Section 4.2, we only rate the quality of the explanation, not of the prediction itself. Thus, the true label of the issue is displayed for the rater.

Each rater starts with a short tutorial that clarifies the functionality of the toolkit to avoid wrong interpretation of the explanations. Additionally, we explain the definition of bug issues based on Herzig et al. (2013), which is used in the data to achieve common knowledge among the raters. After completion, the rater starts with the first issue. The issue is automatically selected at random from the sampled issues from all projects that were not already completed by that rater. The rater is allowed to skip an issue; However, we measure the position of the issue with the confounding variable n. Each issue that was selected according to the sampling strategy is shown to each rater of the study. Thus, each rater has to rate 3,090 issues for both explanation algorithms.

4.4 Study Design

To answer our research question and hypotheses, we execute the following three different analysis phases on our collected data. Each of the phases answers one of our hypotheses.

The core of our analysis is a linear model, which we use to determine the relationship between our variables. We then analyze the coefficients of the linear models to evaluate our hypotheses. As input for the fitting of the linear model, we use all pairs of issues and predictions (see Section 4.2). We measure the goodness-of-fit of the models with the \(R^2\) coefficient.

Additionally, we provide the distributions of the dependent variable qs for each of the explanation models, both overall, grouped by the predicted labels, and grouped by the correctness of the predictions. To enable more in-depth insights into the qualitative ratings, we also provide the distributions of the separate quality categories of the explanations. These distributions are used to augment the analysis of the hypothesis and the resulting discussion of the results, to provide insights into the explanations beyond the pure statistical analysis.

To gain insights into the validity of our data, we measure the agreement between raters. We report Cronbach’s \(\alpha \) (Cronbach 1951) to estimate the reliability of the qualitative ratings for each category listed in Table 2, which is defined as

$$\begin{aligned} \alpha = \frac{n}{n - 1} (1-\frac{\sum _{i}V_i}{V_t}) \end{aligned}$$
(1)

where n is the number of items, \({V_t}\) is the overall variance of the ratings for that category, when all items are considered (i.e., positive, neutral, negative) and \({V_i}\) is the variance of the ratings for the individual items.

Table 3 Interpretation of Spearman’s \(\rho \) according to Cohen (1988)

We repeat this measurement for the Spearman’s \(\rho \) (Spearman 1904) of qs for each category of Table 2 separately, which allows us to gain insights into the differences between the four categories. Spearman’s \(\rho \) is defined as

$$\begin{aligned} \rho = \frac{cov(RS_1, RS_2)}{\sigma _{RS_1}\sigma _{RS_2}}, \end{aligned}$$
(2)

where \(RS_1, RS_2\) are the ranks of two raters, and \(\sigma _{RS_1}, \sigma _{RS_2}\) the standard deviations of the ranks. We use the table from Cohen (1988) for the interpretation of \(\rho \) (see Table 3).

4.4.1 Hypothesis H1

We use a linear model of qs based on our independent and confounding variables, i.e.,

$$\begin{aligned} qs = b_0 + b_1\cdot tp +b_2\cdot tn + b_3\cdot fp + b_4\cdot fn \\ + b_5\cdot n_1 + b_6\cdot n_2 + b_7\cdot n_3 + \sum _{i=1}^m b^p_i \cdot p_i, \end{aligned}$$
(3)

where tp, tn, fp and fn represent correctness of the prediction model as true positive, true negative, false positive and false negative. Additionally, we evaluate the statistical difference between the qs on the subsets of issues that are true positives (i.e., correctly predicted bug issues) versus the subset of issues that are true negatives (i.e. correct predictions of non-bug issues). Depending on the normality of the data, we will either use the t-test with Cohen’s d to measure the effect size, or the Mann-Whitney-U test with Cliff’s \(\delta \) as effect size. Based on our hypothesis, we expect that the coefficient for tp is greater than the coefficient for tn, as well as a significant difference between the subsets, with a non-negligible effect size.

4.4.2 Hypothesis H2

We use the linear model derived from (3) to analyze hypothesis H2, i.e., we consider the magnitudes of the coefficients for \(p_i\), with \(p_i\) being the \(m=38\) one-hot encoded projects. We expect that \(p_i\) are associated with a small coefficient and that there is no correlation between \(p_i\) and qs.

4.4.3 Hypothesis H3

We use the average qs of LIME and SHAP to analyze hypothesis H3, i.e., we evaluate the statistical difference between the qs of the subsets of LIME and SHAP explanations. Depending on the normality of the data, we will either use the paired t-test with Cohen’s d to measure the effect size, or the Wilcoxon signed-rank test with Cliff’s \(\delta \) as effect size. We expect that qs of the subset of SHAP explanation is significantly greater than of the subset of LIME explanation, with a non-negligible effect size.

4.5 Summary of Deviations

During the execution of the research, we encountered situations that required us to deviate from the pre-registered research protocol. All deviations are listed below.

  • We implemented SHAP using the Deep SHAP variant introduced by Lundberg and Lee (2017). In contrast to the implementation of LIME, Deep SHAP generated explanations which may consist of more than one token. This leaked the information about which explainer the rater was looking at. Since both explanations were shown side-by-side, the raters could freely decide if they first label the LIME or the SHAP explanation in a way that we cannot measure with the \(first_{lime}\) variable. We therefore do not include this analysis in the paper.

  • We planned to generate separate explanations for the title of an issue and its description. We deviated from this plan and instead concatenated the title and description, separated by a whitespace. This was required since the underlying model, seBERT, makes its prediction on a single document. The deviation further influences the visualization of the explanations. Instead of the five-most-relevant words from the title and the ten-most-relevant words from the description, we displayed and highlighted the ten-most-relevant words from the joined text.

  • We use the more appropriate Cronbach’s \(\alpha \) (Cronbach 1951) for the inter rater agreement, as Fleiss’ \(\kappa \) (Fleiss 1971) ignores the order of the items on our Likert scale.

  • The study design for hypothesis 2 inadvertently said that the true positives / true negatives from hypothesis 1 would be analyzed instead of the \(p_i\) actually related to hypothesis 2.

  • We extended the general reporting on the quality to also report qs grouped by the correctness of the predictions. This extends the already planned reporting grouped by explanation model and grouped by predicted labels.

5 Results

We now present the results of our experiments. First, we determine the reliability of the qualitative ratings. Then we take a look at the distribution of the data over different subsets and perform statistical analysis to understand their properties. Finally, we present the linear model, which is at the core of our analysis, including related statistical analysis.

5.1 Reliability of the Qualitative Ratings

This subsection describes the measures taken to determine the reliability of the ratings. It follows the research protocol in Section 4.4.

Cronbach’s \(\alpha \) allows us to determine the inter-rater reliability for the qualitative ratings obtained during the execution of our study. Table 4 contains the results for both the overall agreement per category and separate values for each algorithm, calculated with the **ouin library version 0.5.3 in Python. We acknowledge that our data is not normally distributed and that this limits the validity of Cronbach’s \(\alpha \) as a measurement for inter-rater reliability. However, Sheng and Sheng (2012) found that estimates for very large populations, like the one we are considering in our study, are robust, i.e., the effect of this should not be strong. We also acknowledge the relatively low agreement values compared to the threshold of Cronbach’s \(\alpha \) at 0.6 that is typically at least required to be satisfactory, and even more so in comparison to the often used threshold of 0.7 (Taber 2017). However, as Taber (2017) discusses, for tasks where broad knowledge is required (in our case thousands of bugs in tens of projects), lower values of alpha can still be acceptable. Specifically, we observe that the agreement in the category contextual is especially low, while the highest agreement can be found in the category insightful, especially with LIME ratings. We further discuss the implications and reasons for this low agreement in Section 6.6.

Table 4 Arithmetic mean (M), standard deviation (\(\sigma \)) and Cronbach’s \(\alpha \) between the three raters for the ratings per category as well as for the subsets of explanations by LIME and SHAP
Table 5 Spearman’s \(\rho \) between each pair of raters regarding the four categories and the dependent variable qs
Fig. 2
figure 2

Distribution of qs (a) per algorithm; (b) for predictions as bugs (tp, fp) and non-bugs (tn, fn); (c) correct predictions (tp, tn) and incorrect predictions (fp, fn); and (d) correct predictions of bugs (tp) and correct predictions of non-bugs (tn)

Next, we used Spearman’s \(\rho \) to gain insight into the agreement, or disagreement, between the pairs of raters regarding the four categories. We can observe a moderate correlation between rater 1 and rater 3 in the category insightful, while all other categories and combinations correlate weakly at best. The results can be found in Table 5, and they can be interpreted with the help of Table 3. Three p-values indicate that the respective correlation agreement value may be due to chance and is not statistically significant. A thorough analysis of the low agreement will be presented in the discussion section.

5.2 Distributions

This section describes the distributions of the qualitative categories and the dependent variable qs based on different subsets, as well as the statistical analysis that were performed on those subsets.

Figure 2(a)-(c) shows the distribution of qs grouped by the explanation algorithm, the predicted labels, and whether the prediction is correct. We observe that there is an overall positive sentiment among the raters towards the explanations. We further observe that SHAP ratings are most frequent in the interval [3, 4], while LIME has lower ratings, most frequently in the interval [2, 3). The label of the prediction does not seem to lead to a small difference, which we explore when we consider H1 in greater detail: while there are slightly more very good predictions in the interval [3, 4] for bug issues, there are more predictions for non-bug issues in the interval [2, 3). We also observe that incorrect predictions have lower ratings than correct predictions. However, even the incorrect predictions mostly get positive explanations, though with more lower scores than for the correct predictions.

Figure 3 shows the ratings of the individual categories grouped by the algorithm. We observe that the categories related, contextual and insightful have a clearer tendency towards the positive rating than the category contextual. Figure 3(b) and (c) for the ambiguity and contextuality hint at the reason towards why LIME has less perfect ratings. For the ambiguity, the most frequent ratings of LIME are located below 0.6, while SHAP is leaning towards the highest possible ratings. For the contextuality, this difference is less pronounced, but here LIME also scores lower than SHAP. In comparison, the differences between the LIME and SHAP for the relatedness and insightfulness are very small.

Further, we calculated the correlation between the ratings per category using Spearman’s \(\rho \). The results are displayed in Fig. 4. The notable correlations are the following:

  • With LIME, all categories have correlations between 0.511 and 0.691. With SHAP, the correlations are between 0.457 and 0.745.

  • The strongest correlation exists between the categories related and insightful for both LIME (\(p=0.691\)) and SHAP (\(p=0.745\)). The categories contextual and insightful have the second-strongest relationship (LIME: \(p=0.654\), SHAP: \(p=0.652\)).

  • With SHAP, unambiguous and contextual correlate moderately (\(p=0.457\)), while the correlation with LIME is a bit stronger with 0.582.

Fig. 3
figure 3

Distribution of ratings per category, per algorithm

Fig. 4
figure 4

Correlation between the ratings per category, calculated with Spearman’s \(\rho \)

Table 6 Coefficients and p-values of the linear model

5.3 Linear Model

Table 7 Statistical results of the linear model

Table 6 shows the coefficients of the linear model of our dependent variable qs based on our independent variables. Table 7 shows the goodness-of-fit and the global p-value (i.e. if any coefficient is likely significant). The reported \(R^2\) value of 0.071, as an indicator for the amount of variance in the data that the linear regression can explain, is low. This means that the linear model of the independent variables does not give a clear explanation. The p-value is close to zero. This indicates that there are significant coefficients and that observed tendencies are unlikely to be random. Specific independent variables will be further analyzed in the following sections regarding our hypotheses.

5.4 H1: Correct Predictions of Bug Issues have a Higher Qualitative Score than Correct Predictions of Non-Bug Issue

Figure 2 (d) shows the distribution of qs of the correct predictions of bug isues (tp) versus the correct predictions of non-bug issues (tn). The figure indicates a possible small difference because there are slightly fewer tn in the interval [3, 4], than tp. However, in the interval [2, 3], we observe a slightly stronger and opposite effect, i.e., more tn than tp, which offsets this difference on average. Since the distributions of qs for subsets of tp and tn do not follow normal distributions, we applied the Mann-Whitney U test to check for a statistically significant difference between the ratings of the subsets. The results can be found in Table 8. The Mann-Whitney U Test reported a p-value of 0.182, indicating that there is no significant difference between the distributions describing the true positives and true negatives. We therefore do not report the Cliff’s \(\delta \). Thus, the result of the Mann-Whitney U Test matches our observations made from the analysis of Fig. 2(d), i.e., that there does not seem to be a significant difference in qs over both tn and tp.

Our linear model reports coefficients of 0.658 for the variable tp and 0.629 for tn. This indicates that both tp and tn have a strong positive effect on the dependent variable, with not much difference between them. Furthermore, the p-value reports statistical significance for both coefficients.

Table 8 Results of Mann-Whitney U tests and Cliff’s \(\delta \) with the effect sizes for pairs of subsets over qs

5.5 H2: The Projects have No Direct Influence on the Qualitative Scores

Table 6 shows the coefficients of the linear model for the projects \(p_i\) as indicators for correlation. The coefficients are between 0.327 and \(-0.303\). Only the coefficients of the projects ant-ivy, commons-lang, opennlp and systemml are statistically significant according to the p-values. Given a uniform distribution of p-values under the null hypothesis, we would have expected around two projects with a p-value lower than 0.05 without adjusting the family-wise error rate for repeated tests.

5.6 H3: The Qualitative Scores of SHAP are Higher than the Scores of LIME.

Table 8 presents the median values of the subsets of LIME and SHAP explanations over qs. We observe that the median of SHAP at 2.667 is higher than the median of LIME at 2.000. We further observe that the distributions of LIME and SHAP over qs displayed in Fig. 2 do not follow the normal distribution. We therefore apply the Mann-Whitney U test to investigate the difference between the qualitative scores. The result of the test is a p-value of \(<0.001\), indicating a significant difference between the subsets. From Fig. 2 and from the median values, it seems apparent, that this difference is in a more favorable rating of explanations made by SHAP compared to LIME.

6 Discussion

In this section, we discuss the results of our study which include the statistical test, the linear model as well as the information gathered through the coding process.

6.1 RQ: How is the Correctness of the Prediction and the Quality of the Explanation Related?

We observe that the explanations get better quality ratings when they are correct. Thus, the accuracy seems to have an impact and we can expect that models with a higher accuracy likely have better explanations than models with a lower accuracy, for the same task. However, this does not mean that there are not also many high ratings for the explanations of incorrect predictions, or that the ratings are negative for incorrect predictions. Instead, there are only few negative ratings and about 20% of the incorrect predictions even have an average qs of over three, i.e., a very positive rating. This indicates that the explanations for incorrect predictions are not bad, but are not sufficiently convincing for the raters to achieve consistently high scores. The strong performance of some wrong predictions indicates that explanation algorithms are even somewhat able to convince raters of the labels, as the quality of the explanations should otherwise not achieve a very high score. That being said, the coefficients of our linear model (see Table 7) indicate that the explanations rather convince raters that non-bugs are bugs (positive coefficient of 0.400 for fp). In contrast, when the model mispredicts a bug as a non-bug, there is a rather strong negative effect on the explanation quality (negative coefficient of -0.780 for fn). While our results do not allow for a definitive conclusion regarding the reason for this, the rationale for our hypothesis H1 may be an explanation: bugs have a clear signal and if the explanation contradicts this clear signal, this negatively affects the explanation quality.

In the following, we discuss the specific aspects regarding the explanation quality, for which we formulated hypotheses, in greater detail to gain more insights into the reasons for the differences in the quality of explanations, beyond the accuracy of the prediction.

6.2 H1: Correct Predictions of Bug Issues have a Higher Qualitative Score than Correct Predictions of Non-Bug Issues

We hypothesized, that the correct predictions of bug issues (true positives) have a higher qualitative score than correct predictions of non-bug issues (true negatives). This was rooted in the idea, that there are few and rather clear criteria for what constitutes a bug, whereas there are many different kinds of issues for non-bugs. Therefore, we expected that the predictions and explanations will be based on the presence or absence of characteristics of a bug. Ultimately, we assumed, this should mean that explanations for true positives receive higher qualitative ratings.

Contrary to our assumption, our data reveals no significant difference between the true positives and true negatives. Both LIME and SHAP are able to explain true negatives with a similarly high quality as they can explain true positives. This indicates that there must be strong and distinct factors in explanations of true negative predictions that identify them as such. This means that the models are sufficiently powerful to learn the diverse patterns required to identify different kinds of non-bugs.

figure a

6.3 H2: The Projects have No Direct Influence on the Qualitative Score

We postulated the hypothesis that the explanation quality of predictions is not influenced by the projects from our data. In other words, this means that issues should have a similar level of quality and expressiveness towards their label regardless of the project. Yet, our linear model showed that for \(10\%\) of the investigated projects, there was a significant influence on the quality with the absolute values of the coefficients between 0.190 and 0.288. Based on the design of our study, we would expect that the coefficients of 5% of the projects are significant, as this is the expected random error rate under the null hypothesis that the coefficients are not significant. Thus, the strength of this effect is beyond the expectation of a random result. However, we note that the value of the coefficients is rather low in comparison to the intercept and the one-hot encoding of the confusion matrix. When we look for at the projects themselves for explanations, we do not find any notable pattern, as they are all on different topics (i.e., build system, basic programming language elements, natural language processing and machine learning), have different sizes (between 96 KLOC for commons-lang and 842 KLOC for SystemML, line counts for all files except HTML), and different ages (between 2003 for the initial release for commons-lang and 2015 for SystemML). Moreover, the 10% is not far from our expectation, especially when we consider that – in absolute numbers – this means that we only observe four out of 38 instead of two out of 38. Taken together, our results do not yield a conclusive picture. For example, the overall bad fit of the linear model could also explain this slightly higher number of projects, or this could be due to the disagreement between our raters.

figure b
Fig. 5
figure 5

Example of an issue from the labeling interface. The upper half is the explanation generated by LIME and in the lower half is the explanation generated by SHAP. The Example was selected on the criteria that its mean LIME rating for the categories unambiguous and contextual are neutral, while insightful is still positive and the respective SHAP ratings are all positive

6.4 H3: The Qualitative Scores of SHAP are Higher than the Scores of LIME

We assumed that SHAP should yield better explanations, as there is a variant tailored towards transformer models (Deep SHAP) that is able to detect sentence fragments instead of only single words and because SHAP is a generalization of LIME. The data regarding the distributions of the explanations that were generated by LIME and SHAP support this hypothesis. Both the graph and the statistical test confirm that SHAP has higher qualitative scores than LIME, with a small effect size. The likely reason for this is not that SHAP’s explanations use more related terms, but rather that the explanations themselves are less ambiguous and more contextual. The insightfulness drops only slightly. That the difference between the ambiguity and context can directly be explained based on how the algorithms work, i.e., the capability of SHAP to identify sentence fragments, which adds context and reduces ambiguity. While it may seem strange that the problems of the ambiguity and the context do not directly translate into less insight, our experience from labeling the data indicates that this is where our human knowledge can often fill the gap: even if the explanation is only partially easy to interpret within the context, the aspects that are contextual as well as our own capability to disambiguate the explanations enables us to still form an understanding about what happened.

To demonstrate this, we show two example issues. Figure 5 shows an issue where SHAP has positive ratings for unambiguous, contextual and LIME has neutral ratings for these categories, but SHAP and LIME both have a positive rating for the insightfulness (Table 9). The rating for SHAP is obvious: it directly identified the complete text block that describes the core of the suggested improvement. Thus, the explanation may be a bit verbose, but its fragments are ambiguous, matches the context well, and provides a clear insight into what affected the classification and why. With LIME, this looks different: when all words are considered together, you get an insightful picture, i.e., that this is likely about “Better” “support” for “ZIP” and that the reason for the classification is due to the strong positive influence of the term “support”, “Better”, and “useful”. Nevertheless, there are also problems with the explanation, e.g., the ambiguous term “when” (conjunction or adverb) and the usage of “cryptic” out of context (the issue is about encryption, not the cryptic error message).

Table 9 Mean of ratings for the example-issue in Fig. 5

However, that this is not always the case is demonstrated by the issue shown in Fig. 6, where the negative ratings for unambiguous and contextual translate into a loss of insightfulness (Table 10). The four most important features identified by LIME are “it”, “same”, “create” and “connection”, while SHAP identified “one of them is”, “the connection might change”, “you try” and “same”. The features identified by LIME have mixed meanings and would require more context for an understanding (e.g., what kind of connection, what happens with the connection to provide insights). From the features identified by SHAP, the features “the connection might change” and “you try” have an intrinsic meaning that hints in the direction of unexpected behavior documented by the issue, i.e., that connections may change.

figure c

6.5 The Effect of the Labeling Process on the Qualitative Score

The labeling was stretched out over the course of multiple months, which may affect the ratings, e.g., due to a learning curve of the raters. To control for this impact, we consider the confounding variables \(n_i\), which model the influence of the issues already labeled on the overall qualitative rating within our linear model. The coefficients for two raters are significant. While the absolute values of \(9\cdot 10^{-5}\) and \(2\cdot 10^{-4}\) may seem too small to have any effect, we need to consider that we rated an overall 3090 issues. When we multiply this with the issues, we can calculate the estimated difference between the first issue and the last issue:

$$\begin{aligned} 9\cdot 10^{-5} \cdot 3090 = 0.278 \\ 2\cdot 10^{-4} \cdot 3090 = 0.618 \end{aligned}$$
(4)

This means that for rater two, on average one category was rated better \((+1)\) every forth issue towards the end. For rater three, this happened every other issue. For rater one, there was no significant effect. Thus, while there may be a small effect, the concept drift over time should not affect our results in a meaningful way.

Fig. 6
figure 6

Example of an issue from the labeling interface. The upper half is the explanation generated by LIME and in the lower half is the explanation generated by SHAP. The Example was selected on the criteria that its mean LIME rating for the categories unambiguous, contextual, and insightful are negative, the respective SHAP ratings are positive

6.6 Impact of the Rating Reliability

As reported in Section 5.1, the overall agreement between the raters is not at a high level. To better understand if and how this affects our results, we looked at the data in greater detail. We observe that the trends for two raters look similar, regardless of the overall low agreement. The third rater is, in general, more positive, i.e., assigned higher scores than the other two, especially regarding the unambiguity and the context. A discussion among the raters indicates a likely reason for this, which is due to a difference in the interpretation of the categories. Two raters always considered all ten shown fragments of the explanation. The third rater rather interpreted this as a ranking, taking the importance of the terms for the explanation into account. As a consequence, if, e.g., the first three terms were sufficient for the explanation and others only had a very low importance according to the algorithm, the third rater only took these three terms into account when determining the ambiguity and contextuality of the explanation. Since in such cases, the remaining fragments of the explanation tend to be more random (as the algorithms already found an explanation, but need to continue finding fragments), it explains why these ratings are more positive.

Table 10 Mean of ratings for the example issue in Fig. 6

Nevertheless, these differences in interpretation affect our data, raising the question is our conclusions that we reported above are likely to hold. To gain insights into this, we analyzed how our conclusions for each hypothesis would change, if we were to only use one of the raters. For H1, we observe the same as our conclusions, even if we consider all three raters individually. For H2, we observe a similar low number of projects with significant but small coefficients. Notably, there is no tendency for the projects here at all, i.e., there is only one project that is reported significant for two of our raters (eagle), but this is not among the four projects when we consider all raters together. This actually supports our interpretation of the results for H2, i.e., that we believe that the observed significant coefficients could be explained by the randomness p-value under the null hypothesis. For H3, we found that the results for two raters agree with our conclusions. However, for one of the raters, the data indicates no difference between LIME and SHAP. Overall, we observe that interpretations of individual explanations from explainable AI seem to be subjective, but also observe that that general trends mostly hold when we consider a large sample.

6.7 Implications for Future Research

From our findings, there are two implications for future research we want to highlight. The first is for the use case we studied, i.e., the identification of bugs within issues. Our results indicate that such models are not only mature in terms of the performance they achieve, but that our inspections of the local explanations of the results also indicate that they have reliable understanding of what constitutes a bug or a non-bug issue. This raises the confidence that such models can reliably work when deployed within projects. However, how this actually supports developers and raises the productivity within projects is still not well understood. Consequently, we believe that the most important gap in our current knowledge that requires further attention by researchers is not the development of better models, but rather moving existing models into the practice as part of case studies. Such case studies could also involve XAI, e.g., to understand if and how this is important for the adoption of such tools by building trust with developers.

The second major implication is that different XAI algorithms may give different explanations and, further, that explanations are perceived differently by humans. This is no new knowledge, but both aspects should not be underestimated when designing studies on the utility of XAI. In our work, this affected the agreement between raters (see Section 6.6) and also led to the effect that sometimes we have quite different explanations, but they are both insightful (see Fig. 6). How this affects research depends on the kind of study conducted.

In our case, we conducted a quantitative study and we were able to deal with this problem through scale, i.e., by labeling such a large amount of data that the differences in interpretations do not affect the overall results, as the signal persists through the noise introduced by the difference between human judgments. Alternatively, one could try to design a better measurement construct, in which agreements between judgments are higher. This could, e.g., be achieved by more variables that are fine-grained and capture individual aspects, which are subsequently aggregated during the analysis to evaluate broader aspects like the insightfulness. However, we would argue that this is just a different kind of scale, as the number of human judgments per explanation is increased, with the goal that they would not be subjective. The advantage is that fine-grained variables may allow more insights. However, because the interpretation of explanations is still subject to the rater’s interpretation, such a strategy also carries the risk that there will still be too much noise and that the scale in terms of number of explanations rated is not sufficiently high to yield conclusive results. This risk is further increased just by the larger number of variables, which also means that the risk of observing random results is increased (e.g., inadvertent p-hacking), which is typically countered by using more data – which would be more effort when there are more variables.

Another approach is to use a qualitative research design, in which explanations are not rated deductively on fixed scales, but, e.g., with an inductive coding of aspects of explanations which is subsequently discussed either in interviews or in group sessions between raters. Such designs have the advantage that they yield a better understanding of how raters understand individual explanations, typically at the cost of scale and external validity. Consequently, such research questions are well-suited to understand what the target audience of explanations gains, or to identify specific weaknesses in explanations and algorithms, but it may be more difficult to draw general conclusions as we do in this work. Of course, such a qualitative research design can also be combined with a quantitative design as a mixed-methods approach, in which deductive coding on fixed scales is augmented by identifying interesting aspects of explanations that are subsequently discussed in interviews.

Consequently, we advise future work to carefully take the difficulty in interpreting the results of XAI algorithms into account when designing studies and assess how the different methods and the effort involved with them are best suited to address the research questions. These considerations are study-dependent, but there is one advice we can give based on our results independent of design and study goals: avoid and do not trust small-scale studies when evaluating XAI.

7 Threats to Validity

We report the threats to the validity of our work following the classification by Cook et al. (1979) suggested for software engineering by Wohlin et al. (2012). Additionally, we discuss the reliability as suggested by Runeson and Host (2009).

7.1 Construct Validity

The construct of our study assumes that the categories related, unambiguous, contextual, and insightful are suitable to describe the quality of the LIME and SHAP explanations. Our results indicate that these criteria seem to measure distinct, but not unrelated, aspects of the quality, as indicated by our correlation analysis in Section 5.2. Moreover, our raters did not miss an additional category. Both observations indicate that this choice does not impact the validity of our results. On the contrary, these categories allowed us to identify properties that distinguish the algorithms from each other. The second major aspect of our construct is our choice of model, i.e., a linear regression of the quality. The low goodness-of-fit indicates that this may not be sufficiently powerful to express the relation between our variables. We discuss the impact of this on our conclusions below, together with other threats to the internal validity. A smaller threat is the instability of LIME (Garreau and von Luxburg 2020; Zhang et al. 2019), i.e., that different runs could yield difference explanations. It is unclear how exactly this relates to the quality and it could be hypothesized that this could both affect the quality in such cases or not. This aspect not withstanding, we mitigate this threat by considering a large sample of issues. In this, even if there would be some cases in which LIME’s instability affects the quality of individual explanations, the amount of data should still yield reliable trends. Another choice that possibly impacts the validity is our choice to let LIME and SHAP use all words for the explanations. As an alternative, we could have removed stopwords, as they may not yield a lot of valuable information. However, modern natural language processing models based on the transformer architecture (Vaswani et al. 2017) do not conduct such a filtering anymore, as they consider the whole context including stopwords. Consequently, filtering stopwords would mean that we would not allow the XAI algorithm to access the full information that was utilized by the prediction model, which would have added new, from our perspective more severe, threats to the validity.

Another aspect of our construct is our decision to show the explanations for LIME and SHAP side-by-side. Alternatively, we could have shown only one explanation at a time, such that we would have rated both explanations at different points in time independent of each other. While this would arguably have reduced the risk that the rating of the LIME explanation affects the rating of the SHAP explanation (or vice versa), this would have added other problems. Notably, we could still not rule out that when we see the second explanation, we would not have been consciously or unconsciously influenced by what happened earlier. However, since the distance between observing both issues would have been unpredictable, this would be very hard to control for and would, at least, require an additional confounding variable for this purpose. Due to this, we decided to rather show the explanation side-by-side which saved time during labeling as we required only half as many context switches (i.e., understanding an issue).

7.2 Internal Validity

The biggest threat to the internal validity of our results is the low goodness-of-fit of the linear model. The coefficients of a bad-fitting model are less certain and may be noisy. Since the interpretation of our results is mostly based on exactly these coefficients, this could be an alternative explanation for all our observations, i.e., that we could not really conclude anything and instead interpreted noise. However, we observed an increase in the stability of the results when we considered the raters individually and the interpretation of the coefficients for the true positives and true negatives agrees with the statistical tests. Thus, while we believe there is a substantial risk that the estimates of the coefficients are not exact, our results also indicate that there is a low risk that they are misled with respect to our interpretation of the results.

7.3 External Validity

Since our study is limited to LIME and SHAP for textual data, our results may not generalize beyond this setting, e.g., to other explanation algorithms or other types of data. Notably, the difference between LIME and SHAP may be smaller, when there is no specific SHAP variant like Deep SHAP, that captures specific properties of the problem. Moreover, other data may be better suited to the way LIME finds interpretation, as, e.g., tabular data has better specified features which may be better for the design of LIME, which is based on linear regression with regularization. One aspect that we believe should be generalized is the subjectivity of the individual interpretation of the quality of explanations. Even though all our raters agree that, on average, the explanations were often providing them with insights, the low agreement indicates that we often tended to disagree on individual issues: what one may have found insightful, another may not have, based on our background and interpretation of the current case. We believe that such individual differences are, in general, hard to avoid and need to be accounted for in future studies on the quality of XAI models.

7.4 Reliability

Due to the subjective nature of the interpretation of the quality of an explanation, there is a non-negligible threat that the study results could be different, if the study were to be executed by a different set of researchers. As we discuss in Section 6.6, it seems like this threat is more relevant to individual data points, rather than overall trends. Thus, while we believe that especially the coefficients are uncertain, our data points in the direction of a reliable conclusions based on the general trends. Nevertheless, we believe that it would be extremely interesting to see an exact replication of our work to better understand the reliability issues are in work here, as they may not only be related to this study, but to the assessment of XAI approaches in general.

8 Conclusion

Within this study, we manually analyzed 3090 explanations of the prediction of bug issues for each LIME and SHAP. We found that the explanations of both LIME and SHAP are, generally speaking, of high quality. Notably, we did not find an expected difference in the explanation quality between correct predictions of bugs and of non-bugs, which means that the model found explainable signals for both categories. We further investigated, if we can rule out that specific projects influence the overall rating and therefore act as a confounder between an issue and its rating. Our results indicate that there may be some small influence, but the signal for this we detect is rather weak, such that we also cannot rule out a complete independence from the projects. Additionally, the results of our experiments indicate that SHAP outperforms LIME. Specifically, SHAP explanations have a lower ambiguity and a higher contextuality than LIME explanations, which can be attributed to the ability of the deep SHAP variant to capture sentence fragments. Overall, these findings indicate that the bug issue prediction model we used robustly learned to describe bugs, as the signal neither seems to be (strongly) project dependent, nor restricted to only bugs. Instead, explanations are of high quality regardless of the context, including an expected drop in explanation quality when there are predictions.

We also conclude that rating explanations of issue classifications is a highly subjective topic, as the raters of our study had a low agreement and distinctively different perspectives on the explanations and the interpretation of the categories. For this reason, we recommend that researchers who want to study the quality of explanations of classification tasks in natural language must not omit the reporting of the rater agreement. It is further important to go beyond just reporting agreement and to also investigate how the perspectives of the raters compare.