1 Introduction

Large language models (LLMs) represent the de-facto solution for dealing with complex natural language processing (NLP) tasks such as sentiment analysis [1], question answering [2], and many others [3]. The ever-increasing popularity of such data-driven approaches is largely due to their uncanny performance improvements against human counterparts over tasks such as grammar acceptability of a sentence [4] and text translation [5]. In this context, the foreseeable future of intelligent agent systems is likely to be deeply intertwined with LLMs. Intelligent agents exploiting NLP-enabled process for human-agent interaction as well as for inter-agent communication within complex Multi-Agent Systems (MASs) are going to become more and more popular [24, 25].

Fig. 1
figure 1

Overview of the GELPE extraction process. The LLM belonging to a smart agent is examined by a single LPE mechanism, generating a set of relevant lemmas (see Eq. 16 for more details). The set of relevant lemmas, along with the LLM’s predictions for a set of available inputs, are used to optimise the CART model to mimic the LLM behaviour. Thereafter, the CART model can be converted quickly into a logic program equivalent to the starting LLM.

We test the proposed framework over a large set of text classification domains, ranging from simple scenarios—e.g., spam text classification [26, 27]—to challenging tasks such as the Moral Foundation Twitter Corpus (MFTC) [28]. Possibly surprisingly, our experiments show how the explanations of different LPEs are far from being correlated, highlighting how explanation quality is highly dependent on the chosen eXplainable Artificial Intelligence (xAI) approach and the respective scenario at hand. There are huge discrepancies in the results of different state-of-the-art local explainers, each of which identifies a set of relevant concepts that largely differs from the others—at least in terms of relative impact scores. These results highlight the fragility of xAI approaches for NLP, caused mainly by the complexity of large NN models, their inclination to extreme fitting of data and the lack of sound techniques for comparing xAI mechanisms. Notably, the proposed experiments also highlights how GELPE enables the extraction of reliable surrogate logic programs from LLMs with high fidelity over a broad set of datasets. The extracted knowledge is not only faithful to the original model, but also quite simple, as the complexity of the logic program is kept bounded depending on the number of relevant lemmas selected. Throughout our experimental evaluation, we analyse the computation requirements of the proposed extraction process and the efficiency of the extracted logic program. Numerical results highlight the efficiency of the extracted surrogate model, improving over the original LLM in terms of required processing time and consumed energy. The results show how the proposed framework enables the deployment of intelligent solutions over resource-constrained environments via identifying transparent surrogate models. Also, we highlight that leveraging on LLMs to tackle a learning task in NLP does not always represent the best option, as alternative equivalent solutions that are simple, small and transparent can be available [29, 30].


Contributions: We summarise our contributions as follows:

  • We present the first framework for comparing explanations obtained leveraging different LPEs over LLMs. The proposed scheme is designed to assert the correlation level of LPEs over a broad set of input sentences.

  • We test the correlation performance of seven different LPEs over nine different NLP datasets, showcasing how state-of-the-art LPEs strongly disagree.

  • We present GELPE, the first framework for extracting global explanations from the output of LPE processes, enabling the extraction of logic rules from LLMs.

  • We study the performance of GELPE considering its fidelity with respect to LLMs, the complexity of the extracted rules and its achievable efficiency improvements, showcasing encouraging results.

Organization: Sect. 2 discusses the basic concepts of available explanation mechanisms in NLP, along with the required discussion between local and global explanations. Section 3 presents the methodology used in this paper for comparing LPE mechanisms and building global explanations from LPE’s outputs. The experimental evaluation of our methodology is made available in Sect. 4, in which we first focus on the comparison between the available LPEs in Sect. 4.3, while Sect. 4.4 presents the knowledge extraction results. Consequently, Sect. 5 discusses the limitations of the proposed methodology, whereas Sect. 6 concludes the paper with some insight into the possible extensions of our work.

Glossary: Table 1 summarises notations used in the article.

Table 1 Summary of glossary.

2 Background: explanation mechanisms in NLP

The set of explanations extraction mechanisms available in the xAI community are often categorised along two main axis [31, 32]: (i) local against global explanations, and (ii) self-explaining against post-hoc approaches. In the former context, local identifies the set of explainability approaches that given a single input produce an explanation of the reasoning process followed by the NN model to output its prediction for the given input [35]. Given the complexity of the NN models leveraged for tackling most NLP tasks, it is worth noticing how there is a significant lack of global explainability systems, whereas a variety of local xAI approaches are available [36, 37].

About the latter aspect, we define post-hoc as those set of explainability approaches which apply to an already optimised black-box model for which it is required to obtain some sort of insight [38]. Therefore, a post-hoc approach requires additional operations to be performed after that the model outputs its predictions [39]. Conversely, inherently explainable—self-explaining—mechanisms aim at building a predictor having a transparent reasoning process by design—e.g., CART [40]. Therefore, a self-explaining approach can be seen as generating the explanation along with its prediction, using the information emitted by the model as a result of the prediction process [39].

In the context of local post-hoc explanation approaches, a popular solution in NLP is to explain the reasoning process by highlighting how different portions of the input sample impact differently the produced output, by assigning a relevance score to each input component. The relevance score is then highlighted by using some saliency map to ease the visualisation of the obtained explanation. Therefore, it is also common for local post-hoc explanations to be referred to as saliency approaches, as they aim at highlighting salient components.

3 Methodology

In this section, we present our methodology for comparing LPE mechanisms and building global explanations from LPE’s outputs. We first overview the proposed approach in Sect. 3.1. Subsequently, the set of LPE mechanisms adopted in our experiments are presented in Sect. 3.2, and the aggregation approaches leveraged to obtain global impact scores from LPE outputs are described in Sect. 3.3. In Sect. 3.4 we present the metrics used to identify the correlation between LPEs. Finally, in Sect. 3.5 we propose GELPE as a novel methodology to build global explanations of LLMs on top of LPE’s outputs.

3.1 Overview

Measuring different LPE approaches over single local explanations is a complex task. This is why we first consider measuring how much LPEs correlate with each other over a set of fixed samples. The underlying assumption of our framework is that various LPE techniques aim at explaining the same NN model used for prediction. Therefore, while explanations may differ over local samples, one could reasonably assume that reliable LPEs when applied over a vast set of samples—sentences or set of sentences—should converge to similar (correlated) results. Indeed, the underlying LLM considers being relevant for its inference always the same set of concepts—lemmas. A lack of correlation between different LPE mechanisms would hint to the existence of a conflict among the set of concepts that each explanation mechanism considers as relevant for the LLM—thus making at least one, if not all, of the explanations unreliable.

We first analyse the correlation between a set of LPEs over the same pool of samples, and define \(\epsilon _{ NN }\) as a LPE technique applied to a NN model at hand. Being local, \(\epsilon _{ NN }\) is applied to the single input sample \(\textbf{x}_{i}\), producing as output one impact score for each component (token) of the input sample \(l_{k}\). Throughout the remainder of the paper, we consider \(l_{k}\) to be the lemmas corresponding to the input components. Mathematically, we define the output impact score for a single token or its corresponding lemma as \(j \left( l_{k}, \epsilon _{ NN } (\textbf{x}_{i}) \right)\). Depending on the given \(\epsilon _{ NN }\), the corresponding impact score j may be associated with a single label, making j a scalar value, or with a set of labels, making j a vector—one scalar value for each label. To enable the comparison between different LPE, we define the aggregated impact scores of a LPE mechanism over a NN model and a set of samples \(\mathcal {S}\) as \(\epsilon _{ NN }(\mathcal {S})\). In our framework we obtain \(\epsilon _{ NN }(\mathcal {S})\) aggregating \(\epsilon _{ NN }(\textbf{x}_{i}) \text { for each } \textbf{x}_{i} \in \mathcal {S}\) using an aggregation operation \(\mathcal {A}\)—mathematically:

$$\begin{aligned} \epsilon _{ NN }\left( \mathcal {S}\right) = \mathcal {A} \left( \left\{ \epsilon _{ NN }(\textbf{x}_{i}) \textit{ for each } \textbf{x}_{i} \in \mathcal {S} \right\} \right) . \end{aligned}$$
(1)

By defining a correlation metric \(\mathcal {C}\), we obtain from Eq. 1 the following for describing the correlation between two LPE techniques:

$$\begin{aligned} \begin{aligned} \mathcal {C} \left( \epsilon _{ NN }\left( \mathcal {S}\right) , \epsilon '_{ NN }\left( \mathcal {S}\right) \right) = \mathcal {C} \big (&\mathcal {A} \left( \left\{ \epsilon _{ NN }(\textbf{x}_{i}) \textit{ for each } \textbf{x}_{i} \in \mathcal {S} \right\} \right) , \\&\mathcal {A} \left( \left\{ \epsilon '_{ NN }(\textbf{x}_{i}) \textit{ for each } \textbf{x}_{i} \in \mathcal {S} \right\} \right) \big ) \end{aligned} \end{aligned}$$
(2)

where \(\epsilon _{ NN }\) and \(\epsilon '_{ NN }\) are two LPE techniques applied to the same NN model.

The aggregated explanations \(\epsilon _{ NN }\left( \mathcal {S}\right)\) obtained from LPE’s outputs can also be leveraged as a starting point for building transparent surrogate models of the original LLM, as they highlight the impact of each lemma or token in the LLM decision process. Constructing a transparent surrogate model allows for extracting explanations of the global reasoning process of the black-box LLM, enabling knowledge extraction, model debugging, and interaction with a human user or other intelligent agents. To this extent, we here propose GELPE as a novel framework for constructing a logic program—represented as a set of sequential propositional rules—that mimics the LLM behaviour starting from a set of locally relevant lemmas \(\epsilon _{ NN }\left( \mathcal {S}\right)\), extracted using a single LPE. More in detail, GELPE relies on transparent-by-design models such as CART optimised over the LLM outputs, rather than the dataset considered.

We rely on CART models as they represent one of the easiest and most reliable approaches to identify human-readable rules—under the form of trees—from complex structured data. In summary, the optimisation of CART models involves selecting input variables and split points on those variables until a suitable tree is constructed. The selection of which input variables and split points to use is performed using a greedy algorithm aiming at minimising a given cost function. Finally, the tree construction process ends using a predefined stop** criterion, such as a minimum number of training instances assigned to each leaf node of the tree. The set of tree-structured rules extracted using CART can be easily translated into a list of sequential, human-readable expressions that contain logic expressions over the input variables, by extracting one rule for each leaf used in the CART model. Therefore, CART represents a very popular solution for extracting explanations from fuzzy data or black-box classifiers, trying to mimic their outputs. However, a thorough background on CART models is out of the scope of this paper and we refer interested readers to [40].

Since CART relies on structured—usually tabular—data to perform optimisation and inference, we convert the input sentences into a binary format, expressing the presence or absence of relevant lemmas and their combinations. The binarised input is used to optimise the underlying CART model, from which it is possible to extract the equivalent logic program \(\mathcal {P}\). Mathematically, we represent the knowledge extraction procedure as:

$$\begin{aligned} \mathcal {P} = \mathcal {H} \left\{ (bin_{\epsilon _{ NN }\left( \mathcal {S}\right) }(\textbf{x}_{i}), \, NN (\textbf{x}_{i})) \ \forall \ \textbf{x}_{i} \in \mathcal {S} \right\} , \end{aligned}$$
(3)

where \(\mathcal {H}\) identifies the transparent-by-design models used to extract the explanations logic program \(\mathcal {P}\), \(bin_{\epsilon _{ NN }\left( \mathcal {S}\right) }\) represents the binarization process used to convert the sentence \(\textbf{x}_{i}\) into a corresponding binary vector of lemmas occurences and \(NN (\textbf{x}_{i})\) identifies the output of the LLM when fed with input sentence \(\textbf{x}_{i}\).

3.2 Local post-hoc explanations

In our framework, we consider seven different LPE approaches for extracting local explanations \(j \left( l_{k}, \epsilon _{ NN } (\textbf{x}_{i}) \right)\) from an input sentence \(\textbf{x}_{i}\) and the trained LLM—identified as \(NN\). The seven LPEs are selected in order to represent as faithfully as possible the state-of-the-art of xAI approaches in NLP. Subsequently, we briefly describe each of the seven selected LPEs. However, a detailed analysis of these LPEs is out of the scope of this paper and we refer interested readers to [33, 39, 41].

3.2.1 Gradient sensitivity analysis (GS)

The Gradient Sensitivity Analysis (GS) probably represents the simplest approach for assigning relevance scores to input components. GS relies on computing gradients over inputs components as \(\dfrac{\delta f_{\tau _{m}}(\textbf{x}_{i})}{\delta \textbf{x}_{i,k}}\), which represents the derivative of the output with respect to the the \(k^{th}\) component of \(\textbf{x}_{i}\). Following this approach local impact scores of an input component can be thus defined as:

$$\begin{aligned} j \left( l_{k}, \epsilon _{{ NN }} (\textbf{x}_{i}) \right) = \dfrac{\delta f_{\tau _{m}}(\textbf{x}_{i})}{\delta \textbf{x}_{i,k}}, \end{aligned}$$
(4)

where \(f_{\tau _{m}}(\textbf{x}_{i})\) represents the predicted probability distribution of an input sequence \(\textbf{x}_{i}\) over a target class \(\tau _{m}\). While simple, GS has been shown to be an effective approach for understanding approximate input components relevance. However, this approach suffers from a variety of drawbacks, mainly linked with its inability to define negative contributions of input components for a specific prediction—i.e., negative impact scores.

3.2.2 Gradient \(\times\) input (GI)

Aiming at addressing few of the limitations affecting GS, the Gradient \(\times\) Input (GI) approach defines the relevance scores assignment as GS multiplied—element-wise—with \(\textbf{x}_{i,k}\) [42]. Therefore, mathematically speaking, GI impact scores are defined as:

$$\begin{aligned} j \left( l_{k}, \epsilon _{{ NN }} (\textbf{x}_{i}) \right) = \textbf{x}_{i,k} \cdot \dfrac{\delta f_{\tau _{m}}(\textbf{x}_{i})}{\delta \textbf{x}_{i,k}}, \end{aligned}$$
(5)

where notation follows the one of Eq. 4. Being very similar to GS, GI also inherits most of its limitations.

3.2.3 Layer-wise relevance propagation (LRP)

Building on top of gradient-based relevance scores mechanisms—such as GS and GI—, Layer-wise Relevance Propagation (LRP) proposes a novel mechanism relying on conservation of relevance scores across the layers of the NN at hand. Indeed, LRP relies on the following assumptions: (i) NN can be decomposed into several layers of computation; (ii) there exists a relevance score \(R_{d}^{(l)}\) for each dimension \(\textbf{z}_{d}^{(l)}\) of the vector \(\textbf{z}^{(l)}\) obtained as the output of the \(l^{th}\) layer of the NN; and (iii) the total relevance scores across dimensions should propagate through all layers of the NN model, mathematically:

$$\begin{aligned} f(\textbf{x}) = \sum _{d \in L}R_{d}^{(L)} = \sum _{d \in L-1}R_{d}^{(L-1)} = \cdots = \sum _{d \in 1}R_{d}^{(1)}, \end{aligned}$$
(6)

where, \(f(\textbf{x})\) represents the predicted probability distribution of an input sequence \(\textbf{x}\), and L the number of layers of the NN at hand. Moreover, LRP defines a propagation rule for obtaining \(R_{d}^{(l)}\) from \(R^{(l+1)}\). However, the derivation of the propagation rule is out of the scope of this paper, thus we refer interested readers to [43, 44]. In our experiments we consider as impact scores the relevance scores of the input layer, namely \(j \left( l_{k}, \epsilon _{ NN } (\textbf{x}_{i}) \right) = R_{d}^{(1)}\).

3.2.4 Layer-wise attention tracing (LAT)

Since LLMs rely heavily on self-attention mechanisms [45], recent efforts propose to identify input components relevance scores analysing solely the relevance scores of attentions heads of LLM models, introducing Layer-wise Attention Tracing (LAT) [46, 47]. Building on top of LRP, LAT proposes to redistribute the inner relevance scores \(R^{(l)}\) across dimensions using solely self-attention weights. Therefore, LAT defines a custom redistribution rule as:

$$\begin{aligned} R_{i}^{(l)} = \sum _{\textit{k}} \sum _{h} \textbf{a}^{(h)} R_{k, h}^{(l+1)}, \end{aligned}$$
(7)

where, h corresponds to the attention head index, while \(\textbf{a}^{(h)}\) are the corresponding learnt weights of the attention head and k is such that i is input for neuron k. Similarly to LRP, we here consider as impact scores the relevance scores of the input layer, namely \(j \left( l_{k}, \epsilon _{ NN } (\textbf{x}_{i}) \right) = R^{(1)}\).

3.2.5 Integrated gradient (HESS)

Motivated by the shortcomings of previously proposed gradient-based relevance score attribution mechanisms—such as GS and GI—, Sundararajan et al. [48] propose a novel Integrated Gradient approach. The proposed approach aims at explaining the input sample components relevance by integrating the gradient along some trajectory of the input space, which links some baseline value \(\textbf{x}'_{i}\) to the sample under examination \(\textbf{x}_{i}\). Therefore, the relevance score of the \(k^{th}\) input component of the input sample \(\textbf{x}_{i}\) is obtained following

$$\begin{aligned} j \left( l_{k}, \epsilon _{ NN } (\textbf{x}_{i}) \right) = \left( \textbf{x}_{i,k} - \textbf{x}'_{i,k}\right) \cdot \int _{a=0}^{1} \dfrac{\delta f(\textbf{x}'_{i} + t \cdot (\textbf{x}_{i} - \textbf{x}'_{i}))}{\delta \textbf{x}_{i,k}} \, dt, \end{aligned}$$
(8)

where \(\textbf{x}_{i,k}\) represents the \(k^{th}\) component of the input sample \(\textbf{x}_{i}\). By integrating the gradient along an input space trajectory, the authors aim at addressing the locality issue of gradient information. In our experiments we refer to the Integrated Gradient approach as HESS, as for its implementation we rely on the integrated hessian library available for hugging face models.Footnote 1

3.2.6 SHapley additive exPlanations (SHAP)

SHapley Additive exPlanations (SHAP) relies on Shapley values to identify the contribution of each component of the input sample toward the final prediction distribution. The Shapley value concept derives from game theory, where it represents a solution for a cooperative game, found assigning a distribution of a total surplus generated by the players coalition. SHAP computes the impact of an input component as its marginal contribution toward a label \(\tau _{m}\), computed deleting the component from the input and evaluating the output discrepancy. Firstly defined for explaining simple NN models [36], in our experiments we leverage the extension of SHAP supporting transformer models such as BERT [49], available in the SHAP python library.Footnote 2

3.2.7 Local interpretable model-agnostic explanations (LIME)

Similarly to SHAP, Local Interpretable Model-agnostic Explanations (LIME) relies on input sample perturbation to identify its relevant components. Here, the predictions of the NN at hand are explained via learning an explainable surrogate model [37]. In detail, in order to obtain its explanations LIME constructs a set of samples from the perturbation of the input observation under examination. The constructed samples are considered to be close to the observation to be explained from a geometric perspective, thus considering small perturbation of the input. The explainable surrogate model is then trained over the constructed set of samples, obtaining the corresponding local explanation. Given an input sentence, we here consider obtaining its perturbed version via words—or tokens—removal and words substitution. In our experiments, we rely on the already available LIME python library.Footnote 3

3.3 Aggregating local explanations

Once local explanations of the NN model are obtained for each input sentence, we aggregate them to obtain a global list of concept impact scores. Before aggregating the local impact scores, we convert the words composing local explanations into their corresponding lemmas-i.e., concepts-to avoid issues when aggregating different words expressing the same concept-e.g., hate and hateful. As no bullet-proof solution exists for the aggregation of different impact scores, we adopt four different approaches in our experiments, namely:

Sum:

A simple summation operation is leveraged to obtain the aggregated score for each lemma. While simple this aggregation approach is effective when dealing with additive impact scores such as SHAP values. However, it suffers from lemma frequency issues, as it tends to overestimate frequent lemmas with average low impact scores. Global impact scores are here defined as \(J(l_{k}, \epsilon _{ NN }) = \sum _{i=1}^{N} j \left( l_{k}, \epsilon _{ NN } \left( \textbf{x}_{i}\right) \right)\). Therefore, we define \(\mathcal {A}\) as

$$\begin{aligned} \mathcal {A} \left( \left\{ \epsilon _{ NN }(\textbf{x}_{i}) \textit{ for each } \textbf{x}_{i} \in \mathcal {S} \right\} \right) = \left\{ \sum _{i=1}^{N} j \left( l_{k} , \epsilon _{ NN } \left( \textbf{x}_{i}\right) \right) \textit{ for each } l_{k} \in \mathcal {S} \right\} . \end{aligned}$$
(9)
Absolute sum:

Here we sum the absolute values of the local impact scores—rather than their true values—to increase the awareness of global impact scores towards lemmas having both high positive and high negative impact over some sentences. Mathematically, we obtain aggregated scores as \(J(l_{k}, \epsilon _{ NN }) = \sum _{i=1}^{N} \vert j \left( l_{k}, \epsilon _{ NN } \left( \textbf{x}_{i}\right) \right) \vert\).

$$\begin{aligned} \mathcal {A} \left( \left\{ \epsilon _{ NN }(\textbf{x}_{i}) \textit{ for each } \textbf{x}_{i} \in \mathcal {S} \right\} \right) = \left\{ \sum _{i=1}^{N} \vert j \left( l_{k} , \epsilon _{ NN } \left( \textbf{x}_{i}\right) \right) \vert \textit{ for each } l_{k} \in \mathcal {S} \right\} . \end{aligned}$$
(10)
Average:

Similar to the sum operation, here we obtain aggregated scores averaging local impact scores, thus avoiding possible overshooting issues arising when dealing with very frequent lemmas. Mathematically, we define \(J(l_{k}, \epsilon _{ NN }) = \frac{1}{N} \cdot \sum _{i=1}^{N} j \left( l_{k}, \epsilon _{ NN } \left( \textbf{x}_{i}\right) \right)\).

$$\begin{aligned} \mathcal {A} \left( \left\{ \epsilon _{ NN }(\textbf{x}_{i}) \textit{ for each } \textbf{x}_{i} \in \mathcal {S} \right\} \right) = \left\{ \frac{1}{N} \cdot \sum _{i=1}^{N} j \left( l_{k} , \epsilon _{ NN } \left( \textbf{x}_{i}\right) \right) \textit{ for each } l_{k} \in \mathcal {S} \right\} . \end{aligned}$$
(11)
Absolute average:

Similarly to absolute sum, here we average absolute values of local impact scores for better-managing lemmas with a skewed impact as well as tackling frequency issues. Global impact scores are here defined as \(J(l_{k}, \epsilon _{ NN }) = \frac{1}{N} \cdot \sum _{i=1}^{N} \vert j \left( l_{k}, \epsilon _{ NN } \left( \textbf{x}_{i}\right) \right) \vert\).

$$\begin{aligned} \mathcal {A} \left( \left\{ \epsilon _{ NN }(\textbf{x}_{i}) \textit{ for each } \textbf{x}_{i} \in \mathcal {S} \right\} \right) = \left\{ \frac{1}{N} \cdot \sum _{i=1}^{N} \vert j \left( l_{k} , \epsilon _{ NN } \left( \textbf{x}_{i}\right) \right) \vert \textit{ for each } l_{k} \in \mathcal {S} \right\} . \end{aligned}$$
(12)

Since the selection of the aggregation mechanism may influence the correlation between different LPEs, in our experiments we analyse LPEs correlation over the same aggregation scheme. Moreover, we also analyse how aggregation impacts the impact scores correlation over the same LPE, highlighting how leveraging the absolute value of impact score is highly similar to adopting its true value—see Sect. 4.3.2.

3.4 Comparing explanations

Each aggregated global explanation J depends on a corresponding label \(\tau _{m}\) since LPEs produce either a scalar impact value for a single \(\tau _{m}\) or a vector of impact scores for each \(\tau _{m}\). Therefore, recalling Sect. 3.3, we can define the set of aggregated global scores depending on the label they refer to as following:

$$\begin{aligned} \mathcal {J}_{\tau _{m}} \left( \epsilon _{ NN }, \mathcal {S}\right) = \left\{ J \left( l_{k}, \epsilon _{ NN } \right) \vert \tau _{m} \textit{ for each } l_{k} \in \mathcal {S} \right\} . \end{aligned}$$
(13)

\(\mathcal {J}_{\tau _{m}} \left( \epsilon _{ NN }, \mathcal {S}\right)\) represents a distribution of impact scores over the set of lemmas—i.e., concepts—available in the samples set for a specific label. To compare the distributions of impact scores extracted using two LPEs—i.e., \(\mathcal {J}_{\tau _{m}} \left( \epsilon _{ NN }, \mathcal {S}\right)\) and \(\mathcal {J}_{\tau _{m}} \left( \epsilon '_{ NN }, \mathcal {S}\right)\)—we use Pearson correlation, which is defined as the ratio between the covariance of two variables and the product of their standard deviations, and it measures their level of linear correlation. The selected correlation metric is applied to the normalised impact scores. Indeed, different LPEs produce impact scores that may differ relevantly in terms of their magnitude. Normalising the impact scores, we map impact scores to a fixed interval, allowing for a direct comparison of \(\mathcal {J}_{\tau _{m}}\) over different \(\epsilon _{ NN }\). Mathematically, we refer to the normalised global impact scores as \(\Vert \mathcal {J}_{\tau _{m}} \Vert\). Therefore, we define the correlation score between two sets of global impact scores for a single label as:

$$\begin{aligned} \begin{aligned} \rho \left( \Vert \mathcal {J}_{\tau _{m}} \left( \epsilon _{ NN }, \mathcal {S}\right) \Vert , \Vert \mathcal {J}_{\tau _{m}} \left( \epsilon '_{ NN }, \mathcal {S}\right) \Vert \right) = \rho \big (&\Vert \left\{ J \left( l_{k}, \epsilon _{ NN } \right) \vert \tau _{m} \textit{ for each } l_{k} \in \mathcal {S} \right\} \Vert , \\&\Vert \left\{ J \left( l_{k}, \epsilon '_{ NN } \right) \vert \tau _{m} \textit{ for each } l_{k} \in \mathcal {S} \right\} \Vert \big ) \end{aligned} \end{aligned}$$
(14)

where \(\rho\) refers to the Pearson correlation used to compare couples of \(\mathcal {J}_{\tau _{m}} \left( \epsilon _{ NN }, \mathcal {S}\right)\). Throughout our analysis we experimented with similar correlation metrics, such as Spearman correlation and simple vector distance—similarly to [50]—, obtaining similar results. Therefore, to avoid redundancy we here show only the Pearson correlation results. Throughout our experiments, we consider a simple min-max normalisation process, scaling the scores to the range \(\left[ 0,1\right]\).

As we aim at obtaining a measure of similarity between LPEs applied over the same set of samples, we can average the correlation scores \(\rho\) obtained for each label \(\tau _{m}\) over the set of labels \(\mathcal {T}\). Therefore, we mathematically define the correlation score of two LPEs, putting together Eqs. 13, 2 and 14 as:

$$\begin{aligned} \mathcal {C} \left( \epsilon _{ NN }\left( \mathcal {S}\right) , \epsilon '_{ NN }\left( \mathcal {S}\right) \right) = \frac{1}{M} \cdot \sum _{m=1}^{M} \rho \left( \Vert \mathcal {J}_{\tau _{m}} \left( \epsilon _{ NN }, \mathcal {S}\right) \Vert , \Vert \mathcal {J}_{\tau _{m}} \left( \epsilon '_{ NN }, \mathcal {S}\right) \Vert \right) \end{aligned}$$
(15)

where M is the total number of labels, belonging to \(\mathcal {T}\).

3.5 GELPE: global explanations from LPEs

Although useful, local explanations are limited, as they do not highlight the general reasoning principle of the underlying model, but rather focus solely on relevant input components for a specific prediction. Aiming at overcoming such limitations, we here present GELPE as the first—up to our knowledge—framework for extracting global explanations from LPEs. Relying on LPE outputs, GELPE allows for the adoption of reliable local extraction mecanisms, while extending their impact to the global reasoning process of the black-box model. Figure 1 presents an overview of GELPE’s working process.

The aggregated explanations \(\epsilon _{ NN }\left( \mathcal {S}\right)\) obtained from a single LPE’s output are leveraged as a starting point for building a transparent surrogate model of the original LLM. GELPE relies on transparent-by-design models such as CART optimised over the LLM outputs, rather than the dataset considered. As described in Eq. 3, during the optimisation process of the CART model, input sentences are converted into a binary format, expressing the presence or absence of relevant lemmas and their combinations. In order to convert a sentence \(\textbf{x}_{i}\) into its binary format, we consider the \(\mathcal {K}\) most valuable lemmas for each class identified during the aggregation process presented in Sect. 3.3. The \(\mathcal {K}\) most valuable lemmas are the ones with the highest aggregated impact scores over a set of sample sentences for a single LPE mechanism. To avoid relying only on keywords, and accounting instead for more complex constructs, we also consider the set of skipgrams built from the combination of the single \(\mathcal {K}\) most valuable lemmas. In this context, skipgrams define co-occurences of relevant lemmas over a span of limited tokens sequences [51]. With such a procedure we build a set of valuable lemmas and sequences \(\mathcal {L}\) defined as:

$$\begin{aligned} \mathcal {L} = \{(L_{i}), (L_{i}, L_{j}), (L_{i}, L_{j}, L_{k}), \ldots \ \forall \, i, j, k \in \mathcal {K}\}, \end{aligned}$$
(16)

where \(L_{i}\) represents the lemma in the \(i^{th}\) position of the sorted lemmas list—in terms of relevance—, and \((L_{i}, \ldots , L_{j})\) represent the concatenation of two or more lemmas. Once the set of most relevant lemmas and corresponding sequences \(\mathcal {L}\) are available, we can define the binarized version of an input sentence as the binary vector that identify the presence or absence of each lemma and sequence in the considered sentence. Mathematically, the binarisation function can be defined as the following:

$$\begin{aligned} \textbf{x}_{bin} = bin_{\epsilon _{ NN }\left( \mathcal {S}\right) }(\textbf{x}_{i}) = \mathbbm {1}(x_{i,j} \in \mathcal {L}) \, || \, \mathbbm {1}(skip(x_{i,j-n},\ldots ,x_{i,j}) \in \mathcal {L}) \quad \forall \, j \in \textbf{x}_{i}, \end{aligned}$$
(17)

where \(x_{i,j}\) represent the components—i.e., tokens or lemmas—of the input sentence \(\textbf{x}_{i}\), \(skip(x_{i,j-n},\ldots ,x_{i,j})\) the corresponding skipgrams built from the last n input components, and \(\mathbbm {1}\) represents the indicator function, being equal to 1 if the lemma/skipgram belongs to \(\mathcal {L}\) and 0 otherwise. Finally, || represents the concatenation operation between vectors. As an example, consider the input sentence the dog is an animal with four legs and the set of most relevant lemmas extracted by a given LPE to be \(\mathcal {L} = \{animal, \, face, \, legs\}\). Then the corresponding binarised version of the input sentence is shown in Fig. 2, where the \(+\) symbol is used to identify the concatenation of two relevant lemmas inside a sentence—i.e., lemma1 \(+\) lemma2 can be interpreted as lemma1 followed by lemma2.

Fig. 2
figure 2

Sentence binarization approach in GELPE.

The binarised input is used to optimise the underlying CART model, from which it is possible to extract the equivalent logic program \(\mathcal {P}\)—see Eq. 3. \(\mathcal {P}\) is extracted by identifying one rule for each leaf used in the CART model optimised over the LLM outputs. The obtained logic program \(\mathcal {P}\) represents an explanation of the black-box LLM in the form of a set of sequential propositional rules containing lemmas, sequences of lemmas, and negations thereof. Extracted rules are sequential, meaning that each propositional rule applies if and only if the previous ones were not valid. As GELPE relies on the CART model, the extracted rules can only identify the presence or absence of a specific set of keywords and sequences, which represents a limitation of such approach. However, varying the value of \(\mathcal {K}\) and the length and expressiveness of the skipgram construction process, the GELPE extraction procedure can be tuned to consider sequences of lemmas as complex as it is needed to fit well the LLM reasoning process. To keep the complexity of the extraction process under control, throughout our experiments we consider relying at most on (2,5)-skipgrams—i.e., building sequences of lemmas of length at most two which are contained over the span of five input tokens. An example of the GELPE extracted knowledge, along with the analysis of its correctness is made available in Sect. 4.4.3.

4 Experiments

In this section we present the setup and results of our experiments. More in detail, we first analyse the set of datasets used in our experimental evaluation in Sect. 4.1, along with the model training details and its obtained performance in Sect. 4.2. We then focus on the comparison between the available LPEs, showing the correlation between their explanations in Sect. 4.3. Section 4.4 presents the knowledge extraction results, analysing the performance of the knowledge extractor model, along with the complexity of the extracted knowledge. Finally, we analyse the efficiency of the knowledge extraction model, showcasing the improvements in terms of time and energy consumption over the LLM counterpart. The source code of our framework and experiments is publicly available.Footnote 4

4.1 Datasets

In our experiments, we aim at analysing the correlation among different LPEs and the feasibility of global knowledge extraction from LLM over a large set of scenarios. Therefore, we consider an heterogeneous set of datasets targetting text classification tasks, ranging from easy to complex setups. More in detail, we consider targetting the SMS [26] and YOUTUBE [27] spam classification datasets as easy setups, having two highly separable classes. Here, each sample represents a text—either obtained from text messages or from comment posted in the comments section of youtube videos—manually labeled as spam or legitimate (ham). Although available, the metadata information—such as the author’s name and publication date—is not used. As a slightly more complex setup, we consider the TREC [52] dataset, containing 4,965 labeled questions. In this context, each sample represents a question belonging to one of six classes—i.e., Abbreviation, Entity, Description, Human, Location, Numeric-value—to be semantically classified. Finally, as a complex setup we select the MFTC datasets as the target classification task. The MFTC dataset is composed of 35,108 tweets—sentences—, which can be considered as a collection of different datasets. Each split of MFTC corresponds to a different context. Here, tweets corresponding to the dataset samples are collected following a certain event or target. As an example, tweets belonging to the Black Lives Matter (BLM) split were collected during the period of Black Lives Matter protests in the US. The list of all MFTC subjects considered in our experiments is the following: (i) All Lives Matter (ALM), (ii) Black Lives Matter (BLM), (iii) Baltimore protests (BLT), (iv) 2016 presidential election (ELE), (v) MeToo movement (MT), (vi) hurricane Sandy (SND). Each tweet in MFTC is labelled, following the same moral theory, with one or more of the following 11 moral values: (i) care/harm, (ii) fairness/cheating, (iii) loyalty/betrayal, (iv) authority/subversion, (v) purity/degradation, (vi) non-moral. Ten of the 11 available moral values are obtained as a moral concept and its opposite expression—e.g., fairness refers to the act of supporting fairness and equality, while cheating refers to the act of refraining from exploiting others. Given morality subjectivity, each tweet is labelled by multiple annotators, and the final moral labels are obtained via majority voting.

As the size of each dataset represents a relevant component to take into account, Table 2 reports the number of sentences belonging to each dataset. Throughout our experiments we use 70% of the samples belonging to the dataset as the training set, in which LLMs are trained, and both local and global explanations are fitted. The remaining 30% of samples is kept for testing the LLM performance as well as the quality of both local and global explanations.

Table 2 Size of the considered datasets.

4.2 Model training

The SMS, YOUTUBE, and TREC datasets represent standard multi-class single-label classification tasks. Therefore, we tackle the classification task over those datasets using a standard cross entropy loss [53]. Meanwhile, tackling MFTC we follow state-of-the-art approaches for dealing with morality classification task [54, 55]. Thus, we treat the morality classification problem as a multi-class multi-label classification task, using a binary cross entropy loss [53]. Differently from recent approaches, we here do not rely on the sequential training paradigm for the MFTC datasets, but rather train each model solely on the MFTC split at hand. Indeed, in our experiments, we do not aim at obtaining strong transferability between domains, but rather we focus on analysing LPEs behaviour.

For all datasets we leverage BERT as the LLM to be optimised [11], and define one NN model for each dataset, optimising its parameters over the 70% of samples, leaving the remaining 30% for testing purposes. We leverage the pre-trained bert-base-uncased model—available in the Hugging Face python libraryFootnote 5—as the starting point of our training process. Each model is trained using the standard Stochastic Gradient Descent (SGD) optimization procedure for 3 epochs, a learning rate of \(5 \times 10^{-5}\), a batch size of 16 and a maximum sequence length of 64. We keep track of the macro F1-score for each model to identify its performance over the test samples. Table 3 shows the performance of the trained BERT model.

Table 3 BERT performance over considered datasets.

4.3 Local post-hoc explainers comparison

We analyse the extent to which different LPEs are aligned in their process of identifying impactful concepts for the underlying NN model. With this aim, we train a BERT model over a specific dataset (following the approach described in Sect. 4.2) and compute the pairwise correlation \(\mathcal {C} \left( \epsilon _{ NN }\left( \mathcal {S}\right) , \epsilon '_{ NN }\left( \mathcal {S}\right) \right)\) (as described in Sect. 3) for each pair of LPEs in the selected set. To avoid issues caused by model overfitting over the training set, which would render explanations unreliable, we apply each \(\epsilon _{ NN }\) over the test set of the selected dataset.

4.3.1 Local post-hoc explainers disagreement

Using the pairwise correlation values we construct the correlation matrices shown in Figs. 3 and 4, which highlight how there exist a very weak correlation score between most LPEs over different datasets. Here, it is interesting to notice how few specific couples or clusters of LPEs exist which highly correlate with each other. For example, GS, GI, and LRP show moderate-to-high correlation score, mainly due to their reliance on computing the gradient of the prediction to identify impactful concepts. However, this is not the case for all LPE couples relying on similar approaches. For example, GI and gradient integration—HESS in the matrices—show little to no correlation, although they both are gradient-based approach for producing local explanations. Similarly, SHAP and LIME show no correlation even if they both rely on input perturbation and are considered the state-of-the-art.

Fig. 3
figure 3

\(\mathcal {C} \left( \epsilon _{ NN }\left( \mathcal {S}\right) , \epsilon '_{ NN }\left( \mathcal {S}\right) \right)\) using average aggregation as \(\mathcal {A}\) over the SMS (left) and YOUTUBE (right) dataset.

Fig. 4
figure 4

\(\mathcal {C} \left( \epsilon _{ NN }\left( \mathcal {S}\right) , \epsilon '_{ NN }\left( \mathcal {S}\right) \right)\) using average aggregation as \(\mathcal {A}\) over the ALM (left) and BLM (right) dataset.

Figures 3 and 4 highlight how the vast majority of LPE pairs show very-small-to-no correlation at all, exposing how the selected approaches actually disagree. Interestingly enough, disagreement between LPEs holds true for every dataset studied in our analyses, no matter the complexity or simplicity of the learning task and the samples considered. This finding represents a fundamental result of our study, as it demonstrates how no accordance exists between LPEs even when they are applied to the same model and dataset, even on very simple classification tasks such as the one represented by the SMS dataset. The reason behind the large discrepancies among LPE might be various, but mostly bear down to the following:

  • Few of the LPEs considered in the literature do not represent reliable solutions for identifying the reasoning principles of LLMs.

  • Each of the uncorrelated LPEs highlights a different set or subset of reasoning principles of the underlying model.

Therefore, our results show how complex it is to identify a set of fair and reliable metrics to spot the best LPE or even reliable LPEs, as they seem to gather uncorrelated explanations. Similar results to the ones shown in Figs. 3 and 4 are obtained for all datasets and are made available at https://github.com/AndAgio/GELPE.

4.3.2 Aggregation affects correlation

Since our LPE correlation metric is dependent on \(\mathcal {A}\), we here analyse how the selection of different aggregation strategies impacts the correlation between LPEs. To understand the impact of \(\mathcal {A}\) on \(\mathcal {C}\), we plot the correlation matrices for a single dataset, varying the aggregation approach, thus obtaining the four correlation matrices shown in Fig. 5.

Fig. 5
figure 5

\(\mathcal {C} \left( \epsilon _{ NN }\left( \mathcal {S}\right) , \epsilon '_{ NN }\left( \mathcal {S}\right) \right)\) using different aggregations over the ALM dataset.

From Figureds 5c, d one could notice the strong correlation between different LPEs. This seems to be in contrast with the results found in Sect. 4.3. However, the reason behind the strong correlation achieved when relying on summation aggregation is not caused by the actual correlation between explanations, but rather on the susceptibility of summation to tokens frequency. Indeed, since the summation aggregation approaches do not take into account the occurrence frequency of lemmas in \(\mathcal {S}\), they tend to overestimate the relevance of popular concepts. Intuitively, using this aggregations, a rather impactless lemma appearing 5000 times would obtain a global impact higher than a very impactful lemma appearing only 10 times. These results highlight the importance of relying on average based aggregation approaches when considering to construct global explanations from the LPE outputs.

Figure 5 also points out how leveraging the absolute value of LPEs incurs in higher correlation scores. The reason behind this is to be found in the impact scores distributions. While true local impact scores are distributed over the set of real numbers \(\mathbb {R}\), computing the absolute value of local impacts j shifts their distribution to \(\mathbb {R}^{+}\), shrinking possible differences between positive and negative scores. Moreover, LPE outputs rely much more heavily on scoring positive contributions using positive impact scores, and typically give less focus to negative impact scores. Therefore, the output of LPEs is generally unbalanced towards positive impact scores, making negative impact scores mostly negligible.

4.3.3 LPEs visualization examples

The results obtained over various LPEs when considering several input sentences identify a large discrepancy between the available LPE approaches. To better visualize the quarrel between LPEs, we here consider to visualize the output of LPE explanations over few of the sentences belonging to the considered datasets in Figs. 6 and reffig:lpespsalmspssinglespssentence. More in detail, we plot the LPEs relevance scores for each token over the sentences of the dataset used in our experiments. Generally speaking, higher scores identify the most relevant tokens for the LLM prediction, while low scores identify non relevant tokens. Negative scores are assigned to the tokens that negatively influence the prediction for a specific class, thus identifying the tokens that should drift the prediction towards a different class.

Fig. 6
figure 6

Example of LPEs influence scores over the sentence achieving the lowest (left) and highest (right) correlation of LPEs in the SMS dataset.

Fig. 7
figure 7

Example of LPEs influence scores over the sentence achieving the lowest (left) and highest (right) correlation of LPEs in the ALM dataset.

Figure 6 shows the LPEs scores over two sentences of the SMS dataset. The left-side plot is obtained for a sentence where LPEs are far from being correlated, thus highlighting the quarrel between LPEs and confirming the findings of Sect. 4.3.1. The difference in LPEs influence scores is evident across most tokens, with each LPE considering as the most relevant tokens several different candidates—e.g., SHAP focuses on anti, GS focuses on invest, LRP focuses on in, etc. On the other hand, the right-side plot is obtained for a sentence where LPEs are slightly correlated, thus showing somewhat an agreement between most LPEs. However, even considering sentences where LPEs generally agree, it is possible to notice how few approaches are far from being perfectly adherent to the majority of LPEs. For example, SHAP and LIME assign an almost zero influence score to all tokens, while other LPEs tend to produce non-negligible scores. Similar results are obtained for the ALM dataset and shown in Fig. 7. However, for the ALM dataset, the disagreement among LPEs is evident even when selecting the sentence achieving the highest LPEs correlation (right-side plot). Similar results to the ones shown in Figs. 6 and 7 are obtained for all sentences in each dataset considered, and made available at https://github.com/AndAgio/GELPE.

4.4 Knowledge extraction

We here analyse if and to what extent it is possible to extract a knowledge base representing the trained LLM from each LPE, and how much these are aligned in their process of explaining the underlying NN model. With this aim, we rely on the GELPE global explainer construction process presented in Sect. 3.5, extracting a set of rules representing the LLM decision process for each dataset at hand. As the building process is dependent on the number of most impactful lemmas, we consider varying the hyperparameter \(\mathcal {K}\) to select the top-\(\mathcal {K}\) relevant lemmas for each class. After the relevant lemmas are selected from a given LPE, we construct the skipgrams of relevant lemmas as the set of skipgrams occurring in the training set that are composed from relevant lemmas only. Skipgrams are considered to extend the capabilities of the extraction process to consider sequences of relevant concepts rather than blindly focusing only on single tokens. Once the relevant lemmas and skipgrams are available, we consider converting the samples of the training set into binary vectors describing the presence or absence of each lemma and skipgram. We optimise the CART model on the binary vectors representing the training samples and extract the corresponding knowledge from the tree as a set of ordered propositional rules. The extracted rules are sequential, meaning that one rule applies if and only if the previous rules were not successful in identifying the relevant prediction. To avoid incurring in an unbearable number of propositional clauses—that would hinder the utility of the knowledge extraction process—we limit the depth of the CART model to be:

$$\begin{aligned} depth = \mu \cdot \dfrac{\Lambda }{\mathcal {K}*|\mathcal {Y}|}, \end{aligned}$$
(18)

where \(\Lambda\) represents the number of total relevant lemmas and skipgrams identified from the LPE, \(|\mathcal {Y}|\) represents the number of classes of the classification task at hand, and \(\mu\) represents an hyperparameter that we set to \(\mu =5\) empirically. Throughout the remainder of this paper, we consider leveraging the average operation as the aggregation function \(\mathcal {A}\), as it represents the least biased aggregation process. However, we also experiment with other aggregation functions, such as sum, absolute sum, and absolute average, obtaining similar results. Therefore, in order to avoid redundancy we here show only the average aggregation results.

4.4.1 Knowledge fidelity

To asses the performance of the proposed knowledge extraction process from LPEs, we measure the fidelity of the predictions obtained using the propositional rules against the corresponding LLM predictions. The fidelity metric measures the percentage of instances in which the propositional rules predictions and model predictions are equivalent, thus measuring the accuracy of the knowledge extraction process. Since, GELPE relies on the output of a single LPE mechanism to produce the logic program equivalent to the LLM at hand, we compare the fidelity performance of GELPE over all the LPEs presented in Sect. 3.2. Tables 4 and 5 present the fidelity of the GELPE extraction process over the SMS and YOUTUBE datasets. In those simple scenarios, the proposed approach extracts a set of accurate rules, representing with high fidelity the decision process of the underlying LLM. Using GELPE, we enable the extraction of simple and easy to understand rules from the complex black-box model.

Table 4 Fidelity of the extracted knowledge w.r.t. to the original BERT model over the SMS dataset.
Table 5 Fidelity of the extracted knowledge w.r.t. to the original BERT model over the YOUTUBE dataset.

Over more complex datasets, the performance of the extracted knowledge using GELPE varies depending on the dataset at hand. Table 6 shows the fidelity of GELPE over the BLT dataset, where the explanation model achieves up to 95.09% fidelity. Meanwhile, Tables 7 and 8 presents the fidelity results over the ELE and SND datasets respectively, where the proposed GELPE extraction seems to struggle to achieve high fidelity values. This is due to the underlying complexity of the dataset at hand. For some tasks—e.g. YOUTUBE, BLT—, considering the most relevant lemmas and their skipgram combinations is sufficient, while others—e.g. ELE, SND—require a more complex understanding of the inner sentence constructs.

Table 6 Fidelity of the extracted knowledge w.r.t. to the original BERT model over the BLT dataset.
Table 7 Fidelity of the extracted knowledge w.r.t. to the original BERT model over the ELE dataset.
Table 8 Fidelity of the extracted knowledge w.r.t. to the original BERT model over the SND dataset.

As expected, increasing the number of relevant lemmas \(\mathcal {K}\) considered to optimise GELPE results in higher fidelity, as the underlying CART model takes into account a broader set of meaningful features. However, increasing \(\mathcal {K}\) over a certain threshold results in an unbearable rules complexity and in smaller fidelity gains. The increment on rule complexity also hiders the understandability of the extracted explanation, representing a fundamental concept to take into account. This phenomenon is clearly shown in Tables 7 and 8, where the fidelity grows up to 20% when \(\mathcal {K}\) ranges from 50 to 250.

Interestingly, the disagreement between different LPEs seems to affect also the performance of the obtained global explainer model. Fidelity results highlight that GELPE explanations obtained from highly correlated LPEs such as GI and GS achieve comparable performance level. Meanwhile, propositional rules obtained from uncorrelated LPEs result in different fidelity level. While expected, such a behaviour represents a useful finding as it allows for the identification of more reliable LPEs, as the ones that results in a higher level of fidelity—e.g., LIME in most scenarios.

4.4.2 Knowledge complexity

The ideal extraction process is required to output a set of sequential propositional rules that is as faithful as possible w.r.t. the underlying LLM. However, the dimensionality of the extracted program should be kept small to limit the complexity burden of the analysis process. An overly complex knowledge base would not be useful for analysing the inner working principle of the explained LLM, as it would be mostly impossible to be processed by a human interpreter. To assess the complexity of the extracted knowledge, we consider tracking the length of the logic program and its cumbersomeness. In this context, the length L represents the number of clauses in the obtained explanation, while the cumbersomeness C represents the average number of atoms in each clause. L and C represent two fundamental parameters for describing the complexity of the extracted logic program. Lengthier programs are more complex to read and may result in the reader getting lost. On the other hand, a higher cumbersomeness translates directly into longer rules, which are by default more complex to understand, as human users are generally more susceptible to complex multi-variable reasoning. Moreover, longer rules are generally more specific, as they require linking multiple input variables—and possibly their interactions—to a specific output label. Therefore, when long rules are extracted it possibly means that the LLM signals a specific behavior over a specific input. This phenomenon can translate directly into the identification of bias issues, overfitting problems and much more.

For each dataset considered we keep track of L and C and analyse their variability over each LPE and \(\mathcal {K}\) value. Tables 9 and 10 show the complexity of the GELPE output over the YOUTUBE and ELE dataset respectively. The results highlight the relevant difference in terms of required complexity to extract reliable explanations when dealing with simple or complex classification tasks. Both L and C are kept small for each LPE and \(\mathcal {K}\) combination over the YOUTUBE dataset, while still being able to reach high fidelity (see Table 5). Meanwhile, the ELE moral classification task requires to consider higher values of L and C in order to achieve a satisfactory level of fidelity (see Table 7).

Table 9 Complexity of the extracted knowledge over the YOUTUBE dataset.
Table 10 Complexity of the extracted knowledge over the ELE dataset.

Table 9 also highlights a dependency between the complexity of the extracted explanations and the parameter \(\mathcal {K}\). In the vast majority of cases, the higher \(\mathcal {K}\) produces a more complex global explanation program, usually characterized by a higher number of clauses L and a larger number of atoms for each clause C. This is expected, since a higher value of \(\mathcal {K}\) identifies a broader set of relevant lemmas considered during the optimization of the CART explainer, thus increasing the number of features available to construct propositional clauses. However, it is interesting to notice how the almost-linear dependency on \(\mathcal {K}\) affects more C than L, since L can be bounded during the CART optimization process via pruning. The increased complexity of the obtained explanation represents a fundamental aspect to take into account when considering leveraging GELPE, as we need for the explanations to be bounded in complexity for them to be human-readable. The limitation of the CART depth—see Eq. 18—represents an hel** tool from this perspective, as it allows to keep the complexity of the explainer under control in complex setup, such as the ELE dataset. This phenomenon can be seen in Table 10, where the complexity of the extracted explanations remains stable over \(\mathcal {K}\). However, depth limitation is not drawback free, as it hinders the achievement of high fidelity values.

4.4.3 Knowledge visualisation

We visualise the logic programs obtained from the knowledge extraction process to analyse their correctness and understandability. Figure 8 shows the logic program \(\mathcal {P}\) obtained from the GELPE extraction process when leveraging LIME as LPE and \(\mathcal {K}=50\) on the YOUTUBE dataset. The extracted knowledge is characterised by a manageable complexity, having a small number of relatively short clauses. In this context, the summation symbol \(+\) is used to identify the concatenation of two relevant lemmas inside a sentence—lemma1 \(+\) lemma2 can be interpreted as lemma1 followed by lemma2. Moreover, we remind that the extracted rules are sequential, meaning that each propositional rule applies if an only if the previous rules did not. For example, in Fig. 8 the last rule, specifing that the message is spam, is valid only if all the previous 31 rules did not match a class output. Interestingly, the extracted knowledge also shows some relevant properties, such as the identification of spam comments as those containing certain hyperlinks (org lemma), subscription related lemmas (sub and subscribe), as well as grammatical errors (suscribe rather than subscribe and withing rather than within).

Fig. 8
figure 8

Logic program \(\mathcal {P}\) obtained from the GELPE extraction process when leveraging LIME as LPE and \(\mathcal {K}=50\) on the YOUTUBE dataset.

Figure 9 shows the extracted knowledge when GELPE is used with SHAP and \(\mathcal {K}=100\) over the BLM dataset. Here, it is also possible to notice relevant concepts being extracted from the LLM decision process. For example, the proposed extraction process allows to identify that the combination of keywords obey and rape result in the text being considered as harmful, as well as the keyword murder. Meanwhile, the sequence standing + injustice along with the justice keyword identify that the sentiment is fairness. Finally, since the extracted rules are sequential, the loyalty fact at the end of the program serves as the default prediction whenever none of the extracted rules applies. These results highlight the goodness of the proposed GELPE framework than enables the extraction of meaningful logic rules from the LLM reasoning principle with high fidelity.

Fig. 9
figure 9

Logic program \(\mathcal {P}\) obtained from the GELPE extraction process when leveraging SHAP as LPE and \(\mathcal {K}=100\) on the BLM dataset.

4.4.4 Resource effeciency

The proposed GELPE framework allows for the extraction of sequential propositional rules from LLM starting from LPEs outputs. In an ideal scenario, the logic program obtained as a result of the GELPE process contains a handful of simple—i.e., short—clauses. The execution of such simple program—surrogate of the original LLM model—requires few computational power, as it does not rely on complex operations such as convolutions that require GPUs or hardware-specific solutions. However, the complexity of the GELPE output can grow quickly depending on the set of considered lemmas and skipgrams, thus hindering its efficiency. Therefore, it is fundamental to assess the ability of the proposed GELPE framework to produce a resource-friendly surrogate model of the original LLM. To this end, we consider measuring the time and energy efficiency of the original LLM model against few of the logic programs obtained using GELPE. More in detail, we consider running the original BERT model both in a GPU enabled scenario—using a Tesla V100S-PCIE with 32GB of RAM—and a CPU only scenario—using an Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz. We rely on the pyJoulesFootnote 6 library for measuring the energy consumption and latency of both LLM and logic program executions. pyJoules is a software toolkit relying on (i) the Intel “Running Average Power Limit” (RAPL)Footnote 7 technology to estimate power consumption of the CPU, RAM and integrated GPU devices; and on (ii) the Nvidia “Nvidia Management Library”Footnote 8 technology to measure energy consumption of Nvidia GPU devices. Therefore, pyJoules represents a reliable solution to measure the energy footprint of a host machine during the execution of a piece of Python code. We consider comparing the BERT efficiency performance against the most faithful logic program—i.e., the one obtained with LIME as LPE—and against the simplest one—i.e., the one obtained with SHAP as LPE. For each LPE, we consider two setups, having the lowest and highest value of \(\mathcal {K}\)—i.e., \(\mathcal {K}=50\) and \(\mathcal {K}=250\), respectively. The logic programs obtained from GELPE from each LPE are run using only the CPU device. We keep track of the average time \(\overline{t}\) required to infer the prediction over a single sample and the corresponding average energy consumed \(\overline{E}\). Table 11 shows the obtained results over all datasets.

Table 11 Resource efficiency comparison of BERT against GELPE for each dataset.

The obtained results highlight how over simple setups such as SMS and YOUTUBE, the surrogate model obtained using GELPE always outperforms the BERT counterpart. This is due to the small task complexity, enabling the proposed framework to extract a small set of simple clauses to mimic the model behaviour. Indeed, the efficiency of the logic program obtained is proportional to the complexity of the clauses to be analysed to achieve a prediction. Meanwhile, over more complex setups, such as the ELE dataset, in which GELPE outputs a large set of long clauses, it is possible to outperform the BERT counterpart only when considering a small value of \(\mathcal {K}\). However, noticeably it is always possible to find a surrogate logic model obtained via GELPE representing a more efficient solution than running the LLM model over the CPU. These results highlight the advantage of leveraging a simple rule-based approach over sub-symbolic models when hardware acceleration is not available. As such, the proposed model represents a feasible solution for those scenarios where the deployment setup is composed of resource-constrained devices, such as embedded devices and micro-controllers. In this scenarios, running the original LLM would not be acceptable, due to latency and memory issues, while GELPE’s output results in a resource efficient transparent program that is easily deployable. Therefore, the obtained results show that the GELPE surrogate model does not represent just an explainable and transparent twin of the LLM original model, but also an efficient one.

5 Discussion and limitations

Fidelity vs. efficiency trade-off The set of experiments proposed in Sect. 4 highlights how it is possible to identify a relevant logic program surrogate of the original LLM achieving high fidelity and efficiency for some scenarios. However, generally speaking there exists an intrinsic trade-off between the achievable fidelity of the surrogate logic program and its resource-efficiency improvements. Indeed, Table 11 highlights how resource efficiency gains are usually achievable whenever small logic programs are enforced using a small set of relevant lemmas—i.e., small \(\mathcal {K}\) values. However, these small programs do not attain the best achievable fidelity. Consider for example the YOUTUBE dataset, where GELPE relying on LIME with \(\mathcal {K}=50\) achieves 88% fidelity, against the best fidelity of 94% achieved with \(\mathcal {K}=150\). On the other hand, logic programs extracted using large set of relevant lemmas—e.g., \(\mathcal {K}=250\)—usually achieve higher fidelity, while being less effective in reducing the resource consumption. Therefore, it is possible to identify the fidelity vs. efficiency trade-off as one of the limitations of the proposed approach. However, while this trade-off exists, it is fundamental to note that it is relevant only whenever hardware acceleration—e.g., using GPUs—is available. Indeed, even the largest logic programs—which are expected to be the most faithful—extracted with GELPE performs similarly—from the resource-efficiency perspective—to the LLM at hand whenever it runs on CPU only (see Table 11). Moreover, the application of knowledge extraction mechanisms is generally considered in those scenarios where model opacity is a no-go. Therefore, we consider the trade-off between the achievable fidelity and the resource-efficiency to not apply in those scenarios where available hardware is limited or whenever transparency represents the most important feature, thus rendering the trade-off less relevant.

BERT and other LLMs Throughout our investigation we consider BERT as the target LLM architecture. Indeed, BERT represent the first large NN model—comprising 340 millions weights—which targets NLP and that is trained on large corpus of data collected from the web, namely the BooksCorpus dataset and a dump of the English Wikipedia of the time. We rely on BERT as it allows for the quick implementation of all LPE approaches available in the state-of-the-art. Indeed, few of these approaches require access to the inner mechanisms building the NN model to produce their explanations, thus being not applicable to closed source models such as the GPT family. The full focus on BERT represents a limitation of the proposed work, as the behaviour of BERT may differ significantly from other LLMs. Therefore, we consider the analysis of the application of our methodology to several different LLMs—wherever possible—as a future extension of this work. Moreover, we note that larger models—such as GPT or Llama—might exhibit some emergent properties not appearing in the adopted BERT model [56]. The emergent properties may somehow cause different results to be achieved employing the same methodology proposed in this paper, thus requiring further investigation. Indeed, emergent properties clash with model interpretability, thus making larger LLMs even more complex to analyse and inspect using available LPEs. Therefore, it is reasonable to expect an even larger lack of correlation amongst available LPEs over larger LLMs caused by the inherent fuzzy nature of their emergent properties which is difficult to analyse from a single or a few examples.

On the LLM reasoning principles From the very first proposal of LLMs, the research community has largely explored and speculated on their ability to reason over complex concepts. However, the definition of the LLMs reasoning capabilities and their limitations represents an open research question in the literature. Indeed, there is no definitive proof on the extent to which LLMs can process complex concepts incorporating human-like logical reasoning behaviour. Therefore, we here feel the need to stress that in this paper, whenever we refer to the reasoning principles of LLMs we consider the process by which the model elaborates the textual information given, without assuming any human-like reasoning capability from the LLM. Accordingly, the explanations extracted using GELPE mimic the information elaboration process of the LLM, rather than conjecturing the LLM’s logical reasoning capabilities. Therefore, the reasoning process carried out in the logic program may be profoundly different to the reasoning capabilities of LLMs and rather represent the logic grounding of how information is elaborated sub-symbolically by the LLM.

6 Conclusions and future work

As intelligent agents are going to increasingly rely on LLMs for smooth interaction with humans and other agents, a fundamental issue for intelligent MASs is to open the LLM black-boxes, enabling explanation of their inner reasoning principles. However, xAI techniques for NLP still suffer several issues, linked with the heterogeneity of available local explanation techniques and the lack of robust global explanation processes. Inspired by these limitations, we propose a novel approach for enabling a fair comparison among state-of-the-art local post-hoc explanation mechanisms, aiming at identifying the extent to which their extracted explanations correlate. We rely on a novel framework for extracting and comparing global impact scores from local explanations obtained from LPEs, and apply such a framework over several text classification datasets, ranging from simple to complex tasks. Our experiments show how most LPEs explanations are far from being mutually correlated when LPEs are applied over a large set of input samples. These results highlight what we called the “quarrel” among state-of-the-art local explainers, highlighting the current fragility of xAI approaches for NLP. The disagreement is apparently caused by each of them focusing on a different set or subset of relevant concepts, or imposing a different distribution on top of them. Furthermore, we propose a novel approach to construct global explanations—under the form of logic programs—of the original LLM starting from the LPE outputs. We test the global explanation extraction approach—namely GELPE—over a broad set of scenarios, highlighting its fidelity against the sub-symbolic model and the simplicity of the extracted knowledge. Moreover, we analyse the efficiency of the extracted logic programs, showing how it is possible to extract a logic program that is equivalent to the original LLM and is faster and less energy wasteful in scenarios where hardware acceleration is not available. Therefore, our experiments show how the extraction process can be leveraged to enable the deployment of NLP applications to resource-constrained environments, such as embedded devices and microcontrollers. These findings also highlights how—for some learning tasks—leveraging LLMs might represents an over complication, as it is possible to achieve similar performance using simple and small logic programs.

Future work is likely to include the application of the proposed methodology to a broad range of state-of-the-art LLMs, starting from Llama and other available open-source architectures, aiming at showing if—and to what extent—the findings of this paper apply to models different from BERT. Similarly, we intend to extend the analysis on the trade-off between the achievable fidelity and efficiency of the surrogate logic programs extracted using GELPE. Finally, although the proposed framework is applied to the NLP realm, it represents a useful starting point for analysing the relevance of LPEs in different domains, such as computer vision [57, 58], graph processing [59,60,61] and many more.