1 Introduction

Over the past few years, tax authorities have started to use machine learning techniques to process the large volumes of data they have about taxpayers. In doing so, they hope to leverage that data for a multitude of purposes, such as detecting fraudulent behaviour by taxpayers, integrating new sources of information, or simply doing tasks that would otherwise be neglected due to personnel shortages (Collosa 2021). The use of AI approaches can potentially allow tax authorities to increase their performance in gathering information and enforcing the law.

The development and use of tax AI in the public sector is, however, subject to legal constraints. In most countries, tax administration (just like other parts of public administration) is subject to the principle of legality: it can only act when authorized by law and in the form authorized by law (Hadwick 2022). Among these formal principles, most countries adopt some form of the duty to give reasons, which obliges administrative decision-makers to specify the facts and the laws that guide their decision-making (Fink and Finck 2022). This means that, regardless of whether an AI system is making a decision or providing inputs for a human decision-maker, the role of the system in decision-making must be explained (Bibal et al. 2021).

In this paper, we examine the suitability of current XAI techniques for providing explanations of decisions about tax. To do so, we make use of a prototype system for fraud detection, developed in collaboration with the Buenos Aires tax authority. This system, as further detailed in Sect. 3 below, is not yet at state-of-the-art performance, but it is nonetheless illustrative of the goals and approaches that power real-world applications of AI in tax. As such, it provides a realistic baseline for evaluating potential explanation models and their assessment vis-à -vis relevant legal background.

This paper contributes by:

  • Giving an expert-based account of the feasibility of current XAI methods in the context of tax law, based on the dataset derived from real-life experiences.

  • Analysing the interplay between legal requirements, expectations of legal experts, and technical possibilities.

  • Creating a background work for future guidance on how to secure taxpayers’ constitutional rights and increasing tax moral in the areas in which tax authorities rely on AI.

To evaluate said explanations, the paper proceeds as follows. Section 2 provides an overview of current scholarship on XAI and the specific requirements for explanations in the tax domain. Section 3.1 presents the dataset. Section 3.2 then discusses how tax fraud detector was implemented, i.e. what machine learning models were chosen. Section 3.3 discusses various XAI methods used and shows the result of their use. The general schema of the implemented system, as well as the how this structure connects with the structure of this manuscript is presented in Fig. 1. Section 4 moves to a qualitative assessment of the explanations produced in the previous section, contrasting them to the legal requirements that any AI-supported decision must meet. This comparison leaves somewhat to desire, so we conclude the paper by proposing technical and legal paths forward toward proper explanations in the tax domain.

Fig. 1
figure 1

The structure of implemented system and of the manuscript

2 Related work

In this paper, we will use the term explainable artificial intelligence (XAI) as the development of techniques that make the functioning of an AI system understandable for a given audience (Arrieta et al. 2020). XAI methods aim to show how the AI system’s input affects its output by revealing the link between the data ingested by the system and the decision it makes. Accordingly, XAI methods may provide the decision-makers with an account of how a given AI-based system works, thereby allowing to a technically guided explanation to be transformed into justification by the public authority.

2.1 XAI curriculum

The selection of methods that have been included is based on their popularity in the XAI community, as evidenced by availability of manuals and research papers. The popularity of such explanation methods like SHAP, LIME, anchors and counterfactuals makes it a must to test them in the area of tax law. All those explanation generation methods work by examining model’s input and output and do not rely on its inner working. Thus, they are in principle applicable to any category of machine learning models.

As for the usability analyses of the methods presented herein, they were the subject of scrutiny in other works. When comparing the results achieved using SHAP and LIME it was found they do not offer advantage over each other and they are assessed similarly by end users (Górski and Ramakrishna 2021). In (Slack et al. 2020) it has been proven that it is possible to prepare a malicious classifier that hides real rationale for a decision when a perturbation-based explainer (i.e. SHAP/LIME) is used. Better sampling techniques are in the works to mitigate this adversarial attack (Vreš and Robnik-Šikonja 2022), which seems a necessity if the explanation models are to evoke users’ trust in the AI-based systems. In the works like (Bench-Capon 1993) and (Schweighofer 2022) it has been shown that there can be disconnection between what the black-box classifier has internalized and the legal rationale for a decision. This can to some extent be identified using SHAP (Schweighofer 2022), but the identification is not straightforward. It does not mean that there is no trust in automation of any part of judicial decision-making process, for instance data gathering for the purpose of making legal decisions has proven to be trustworthy (Barysė and Sarel 2023).

This disconnection can in principle be overcome by including domain knowledge in explainer. Other authors suggested the possibility of building a hybrid system, that includes domain knowledge for better results presentation that is better catered to needs of prospective users in a legally-oriented task (Branting et al. 2021). The authors used subsymbolic methods (embeddings, a technique that couples a text fragment with its representation in high dimensional vector space) coupled with manual annotation for supervised learning. This allowed to include domain knowledge when system prediction were explained and the authors have found this solution to be of higher value than the one that used heatmap, i.e. one that highlighted the text the neural model deemed most important, without reference to predefined legal concepts.

One of the challenges of currently available explanation methods is that they are implicitly based on the background knowledge (Combi et al. 2022). In principle, such knowledge is not in the possession of humans who are interested in the predictions of AI-based system. For example, a heatmap can show which parts of scanned brain contribute to a diagnosis, but this is of little use for a patient with limited or no medical knowledge (Robbins 2019). By analogy, the same applies to taxpayers targeted by AI systems used by tax authorities which are currently explainable only to a limited extent, as such explanations are typically understandable by AI domain experts and require more information to be used by lawyers or taxpayers. Other authors have already noticed that there are many stakeholders (groups of people interested in explanations), but most XAI scholarship seemed to cater to system developers (Langer et al. 2021). For example, if the explanations were to be presented to a judge in a court case, they would have to mediated through expert’s testimony (Kuźniacki et al. 2022).

Even with support of tax experts, XAI methods in the field of tax law may not be understandable by taxpayers due to the lack of or insufficient collaboration and mutual understanding between AI experts develo** and deploying AI systems used by tax authorities and tax experts. Whilst, for example, methods such as SHAP feature importance plots are able to show the variables that impact the neural network’s prediction to the greatest extent, showing how they interact with other features and background knowledge to arrive at the result is whole new endeavor. In other words, currently available explanations are important step**stones in the research that will have to provide the wider context in which the decision was made: “beliefs and motivations; hypotheses of other (human, animal or AI) agents’ intentions; interpretation of external cultural expectations; or, processes used to generate its own explanation” (Dazeley et al. 2021). Current explainability methods lack in terms of causality: the presentation of model’s relevant modules and input data, which does not necessarily end in user’s satisfaction and understanding in the context of a given task (Holzinger et al. 2019). With the use of generative AI the new possibilities of generating explanations are presented and they yield promising results that have capacity to make XAI easy to understand by laypersons (Yu et al. 2022).

2.2 Legal requirements for XAI in the tax domain

For the purposes of this paper, it is important to distinguish between three concepts: interpretability, explainability, and justification. The first two are occasionally used as synonyms, but we follow some scholars (Arrieta et al. 2020) in ascribing different meanings to them. Interpretability refers to an inherent quality of a machine learning model that allows a human (usually an expert (Kolkman 2022) to make sense of it, whereas explainability refers to the possibility of designing an interface that allows a human to make sense of the model. In both cases, what one wants to understand is the model and the outputs it produces (Creel 2020).

Justification, in contrast, is less concerned with understanding and more with the legal value of a decision. Under the principle of legality (Craig 2020), administrative decisions are only valid to the extent that they are grounded on legal authority. From a legal perspective, it follows that a decision by a tax authority—which is a form of administrative body representing the executive state’s (fiscal) power—must be justified with reference to the existing laws, regulations, and other legal instruments applicable to a decision. In fact, these authorities are obliged, to a large extent, to present these reasons to the persons affected by the decision and sometimes to the public (Schauer 1994; Bardutzky 2022). This reason-giving duty continues to apply whenever an AI system becomes part of an administrative decision-making procedure (Bibal et al. 2021).

Since explanation and justification are different things, XAI techniques, by definition, are not sufficient to produce justifications of the kind expected by the law. Nonetheless, it has been argued that explanations of AI-based decisions are necessary for evaluating any justification of such a decision. Since the information provided by AI systems is an important factor in decision-making processes (Demková 2021), any assessments of how the law is applied to a given context must engage with how the system processed data and how that data was used. Consequently, some authors (Fink and Finck 2022) have argued that explanations must be a part of the reason-giving whenever an AI system is used in administrative contexts. At the same time, others (Ferrario and Loi 2022; Mehdiyev et al. 2021; Zerilli, Bhatt, and Weller 2022) have argued that explanations can contribute to the acceptance of AI-based tax decisions by taxpayers. However, some have raised warnings about how the use of XAI may create undue constraints to legal decision-making (Esposito 2022), or launder unacceptable decisions through the manipulation of explanations (Bordt et al. 2022) or, more generally by validating institutional practices of secrecy (Busuioc et al. 2023). XAI, therefore, is not a panacea for algorithmic transparency in the government, or an automation of justification, but a necessary element of the overall governance of public sector AI.

To fulfill this role, XAI solutions in the public sector must be tailored to meet the informational needs imposed by law. In the tax domain, such tailoring means that an explanation of a tax decision must provide the information that is needed to evaluate whether that decision complies with the applicable laws and the rights of taxpayers (Kuźniacki et al. 2022). In particular, XAI can play an important role in preventing arbitrary and discriminatory decisions infringing right to privacy without a proper judicial oversight, such as those already detected in some jurisdictions (Amnesty International 2021). Notably, AI systems without oversight might: (i) biased decisions, (ii) be used for purposes beyond the legitimate scope that motivated their introduction, or (iii) be used in ways that deprive taxpayers of their right to contest potentially wrongful decisions (Kuźniacki et al. 2022). All these risks are compounded by the various forms of opacity that surround AI systems, which may preclude taxpayers from learning about the tax decision-making procedure or even about the existence of a decision based on an AI system in the first place. In order for the legal system to reach the stage at which automated decision making could be implemented there is a need to change the infrastructure and information gathering process so it can be served by the machines and provided with sufficient explanations (Reiling 2020). Additionally, it has been noted that to change the existing procedures in the way legal decision-making is organized there is a need to involved actors of the system such as judges, tax administration to fully reorganize the procedures and create ecosystem that is capable for the use of new technologies (Sourdin 2022).

The decisions of tax authorities must comply with the principle of formal motivation. To comply with this principle, all factual and legal grounds on which the decision is based should be mentioned and explained by the tax authorities, unless a tax and / or trade secrecy requires tax authorities to not reveal certain (especially factual) information concerning their decisionsFootnote 1 (Kuźniacki et al. 2022). The justification for such decisions must be clear and precise and reflect the real motives behind the decision.Footnote 2 If human decision-makers have no access to the explanations of systems they rely on, they might end up simply following any recommendations from those systems (Wagner 2019), or even adopting a selective form of compliance, in which they disregard any solutions that “seem off” and follow what “seems plausible” (Alon-Barkat and Busuioc 2023). Whenever that happens, these decision-makers cannot offer any reason beyond “computer says no”, a rationale that is incompatible with the principle of legality (Oswald 2018), as held by courts in various jurisdictions (Fink and Finck 2022; Kuźniacki et al. 2022; Zandstra and Brouwer 2022). Hence, the use of AI systems in the tax administration is unlikely to be lawful without some support from XAI techniques. The specific configurations of these techniques, however, will depend on the particular requirements of the jurisdiction in which the system is used.

3 Empirical studies

3.1 Dataset

Any evaluation of XAI approaches in the tax domain must consider the context in which AI is used. However, some areas of the government—notably law enforcement (Curtin 2020) and tax authorities (Hadwick 2022)—use the law to prevent, or at least restrict, disclosure of information about the algorithms they use and the data that feeds those algorithms. Accordingly, there is little published technical work on the application of XAI techniques in the context of tax authorities (Kuźniacki et al. 2022; Mehdiyev et al. 2021; Kuźniacki 2022).

There is a number of legal datasets targeted to machine learning projects, but they differ in creation methodology. The COMPAS dataset discloses the decision that was made in practice and the criteria for that decision. The criteria are extracted from relevant authorities, using public records and dataset’s authors merged it with prison and jail information (Barenstein 2003). Such scraped data is often not immediately useful and further processing is nevertheless often needed. The data needs to be manually extracted/annotated or the specialized tools for preprocessing have to be developed (Górski et al. 2020).

In this paper, we make use of a dataset prepared by the Buenos Aires tax authorities for the purposes of fraud detection. It stands out from aforementioned datasets by focusing on tax law, and the facts of the case that are used by tax authorities to assess the risk of fraudulent activity. This dataset is not yet available to the public (and the tax authorities have received our feedback regarding its future development), but the following lines describe its general characteristics, starting with the relevant legal background.

Argentina has three jurisdictions with taxing powers: national, provincial and municipal. At the provincial level, all the 23 Argentine provinces as well as the Autonomous City of Buenos Aires (CABA) impose a Gross Turnover Tax (GTT) on the regular activity of commerce, industry, services or any other activity carried out within their jurisdictions. This GTT is levied on gross revenues resulting from the regular and onerous exercise of commerce, industry, profession, business, services or any other onerous activity conducted on a regular basis within the respective provincial jurisdiction.

In order to detect the possible existence of fraud or tax evasion, understood as “elimination or reduction of tax produced within a country, by those who are legally bound to pay it and who achieve that result by means of fraudulent or omissive conducts that violate legal provisions” (Villegas Héctor 2001), that is, the unfulfillment of the tax obligation through illegitimate means, making a difference from legal ways of avoiding said obligation (elusion) or directly choosing not to carry out the event giving rise to the obligation in itself (economy of choice); the presumptive tool can legitimately be used.

In that regard, the legislator establishes that “may be used as indicators, among others: the capital invested in the exploitation, fluctuations in assets, volume of transactions or sales from previous tax periods, the amount of purchases, the existence of merchandise, the existence of raw materials, dividends, general expenses, wages and salaries, the rent of property used for the business, industry or exploitation and of the house-room, the taxpayer’s standard of living, the normal performance of businesses, exploitations or similar enterprises from the same branch; and any other elements of judgement that are in the possession of the Administration or which must be provided to it by the taxpayer or liable person, chambers of commerce or industry, banks, trade union associations, public or private entities, collection agents or any other person who possesses useful information in this respect related to the taxpayer which is related with the verification and determination of the taxable events” (arts. 247 s paragraph, Fiscal Code). The formula laid by the legislator is broad, including, but not limited to, the indicators aforementioned. In addition, it establishes in article 248 different systems for the determination over presumptive basis.

Based on that broad legal definition, a dataset that reflects the practical side of tax authorities’ work was prepared by the Buenos Aires Tax Authorities. The dataset consists of nine features denoting the existence of the facts that they use to assess the probability of the tax fraud and the status of taxpayer (cf. Table 1). This dataset consists of binary features, in a form of a table that denotes whether a given fact took place in a given case and whether a given case it was finally assessed that a fraud was committed.

Table 1 Description of risk features, as delivered by Buenos Aires Tax Authority

There are some cases with missing data, i.e. it is not declared whether a certain fact occurred in a given case. Upon consultation with Buenos Aires tax authority we have found out that denotes unavailability of a given data for a particular case. We have found that lack of data is also a valuable information and encoded the lack of data using an arbitrarily chosen number (2).

Figure 2 presents the number of times a given value was assigned to a given feature in the dataset. 1 denotes existence of a fact in a given case, 0 – that such fact did not occur, NaN denotes missing data. The dataset consists of 6465 cases, of which 3290 rows have at least one NaN value. As this is a real-life dataset, it exhibits data imbalance. That is, there is a significant disproportionality between the number of cases that are fraudulent and not (612 instances whole). Such imbalance is inevitable due to the underlying phenomenon described, i.e. tax frauds always–by nature–constitutes a tiny fraction of behaviour of all taxpayers, here within the group of taxpayers subject to the tax law as described above (Zareapoor and Shamsolmoali 2015).

Fig. 2
figure 2

Histogram of features’ values distribution. This is a part of data analysis performed before the development of tax fraud detector and generation of explanations

Moreover, it has been discovered by us that the majority of rows are either repeated completely or differ only whether a fraud was committed in a given case or not. In other words, there are dataset rows that exhibit the same feature values and differ only in the end result of fraud being committed. In machine learning terms, the dataset is noisy. This proves that more features that differentiate various cases should be introduced in the future, and this need has been relayed to Buenos Aires tax authority. This lack of predictive potential is a known phenomenon in the case of datasets that consist of categorical features. Also, the features Collecting_Agent and Higher_Rate_Tax_Payer are constant in the dataset (they are always 0), thus they offer no predictive power and they are not used in modelling.

We have generated a synthetic dataset based on the one described in the preceding paragraphs. This was done using the normalizing flow algorithm (Durkan et al. 2019). Synthetic dataset generation aims to create completely new datasets that mimic the distribution of samples in the original one. This algorithm-based generation allowed to create a dataset that contained of 1300 samples, 999 non-fraudulent and 301 fraudulent. This synthetic dataset allows us to prepare machine learning models without the need of performing a human subject research study, whilst still allowing to assess the algorithms’ performance. The usage of such dataset further mitigates any worries regarding the privacy and data safety of the taxpayers mentioned in the dataset.

3.2 Classifiers

The dataset described above was used as the starting point for the implementation of several well-known classification models. While these implementations were meant as a proof of concept for a potential automated detector, they are not the focal point of this study. However, it is still important to present the classifiers implemented. Such system could in principle be used by tax authorities to either identify the potentially fraudulent behavior for further scrutiny, or to provide assessment as for the fraudulent nature of conduct. The nature of dataset and features identified by tax authorities effects the overall system to be more suited for the former role, for the support of tax administration as it exercises its lawful discretion, which is also suggested by the perspective of efficiency and compatibility with fundamental taxpayer rights. The discretion, itself, raises various questions in the context of AI, but the answers to them lie beyond the scope of our paper (cf. (De Cooman 2023)).

The classifiers implemented and presented in this chapter are as follows: decision tree, random forest, logistic regression, simple neural network (with three fully connected hidden layers, sized 20, 15, 10 respectively), hybrid neural network-decision tree model (with tree created from the model using the ANN-DT algorithm (Schmitz et al. 1999)), k-nearest neighbors (KNN), Bayesian rule lists, and XGBoost (XGB). One-hot encoding was used. Hyperparameter space was also explored and the best model was chosen.

This exploration used the traditional method of testing hyperparameters of varying order of magnitude using the grid search method. This implementation used fivefold cross-validation to perform the search (Agrawal and Agrawal 2021). This has led to setting the following parameters: for logistic regression, C = 10, solver = newton_cg, tol = 1e-5, penalty = l1; for the decision tree, max_depth = 7, criterion = gini, max_features = 10, min_samples_leaf = 1, min_samples_split = 5, min_weight_fraction_leaf = 0, splitter = random; for KNN, algorithm = ball_tree, leaf_size = 100, n_neighbors = 10, p = 1; for Random Forest, max_depth = 7, max_features = 1, min_impurity_decrease = 0, min_samples_leaf = 1, min_samples_split = 5, min_weight_fraction_leaf = 0, n_estimators = 10; for XGB, eta = 1, gamma = 0.1, max_depth = 6; for SVC, C = 10, kernel = poly. Neural network was trained for 250 epochs (using early stop**), with batch size = 16. This was implemented using the following Python libraries: TensorFlow 2.10.0, scikit-learn 1.1.2, XGBoost 1.6.2, imodels 1.3.18.

The classifiers’ performance has been evaluated using a number of metrics. Herein (Table 2), we show the results in terms of accuracy and ROC AUC. Those values are presented as means obtained using bootstrap** (with n = 1000), alongside the 95% confidence intervals (Adibi 2004). For hybrid neural network + decision tree solution, the network that achieved 0.66 accuracy was chosen as the base to generate a tree from during the bootstrap**. Additionaly, confusion matrices are presented for those bootstrapped models which accuracy score was closest to the mean one.

Table 2 Performance of various implementations of tax fraud detectors. Those detectors were created before explanations of their decisions were generated

The results are presented in Table 2, in terms of test set accounting of 20% of all the instances. In general, the results are on par, with the exception of Bayesian Rule List. In this research, we did not strive to maximize the performance metrics for the models. Rather, we have treated them as a starting point for explainability analysis for tax law applications. In this respect, it can already be noted that the neural network, and the hybrid neural network+decision tree solution built from the same network achieved very similar performance metrics. The latter is also a readily interpretable solution (cf. the next section). Any neural network is susceptible to be represented by a decision tree (Aytekin Anchors

Anchors provide explanations in natural language, framed in conditional terms: if this, then that. While the actual contents of such a conditional might not be straightforward to parse, this conditional structure allows taxpayers to identify decision rules that should be discussed in court because of their importance in determining tax consequences in a concrete tax case. Counterfactuals

Counterfactual explanations have the potential to be useful for lawyers, as their selective and contrastive nature makes them closer to the kind of explanation produced by human beings. Current implementations of counterfactuals, however, fall short of that potential. Whenever the change in the outcome is due to a change in present values, the reader of the table might be able to turn that counterfactual into actionable information. The features are to be read the same as in the case of the dataset itself (cf. Table 1 and its description). For example, the change in the feature Excess_Deductions, as evidenced in the Table 8 with counterfactuals, i.e. from the presence of the fact denoted by that feature (1) to the lack of such presence (0), leads to a change in the outcome, and so taxpayers might adjust their behaviour in the future. However, the meaning of the counterfactual is less clear when the absence of data leads to a change in outcome: for example, when the second counterfactual leads to no tax fraud detection because of the change in the data for Labour Cost Sales – from the absence of the fact (0) to missing data about it (2).

From the perspective of taxpayers, the low comprehensibility of counterfactuals makes this method less assessable (weak + +), as the interpretation of the XAI outputs will require expert interpretation before it can be used. While some taxpayers, particularly those with more resources, might be able to extract more insights from the outputs of the counterfactual model tested above, this is unlikely to be the case of the ordinary taxpayer, who is not supported by resources of major tax advisory firms. Contrastingly, a more comprehensive formulation of the counterfactuals might make the outputs more useful for users, even if it comes at the expense of some of their potential for assessment in the hands of sophisticated users.

4.2.4 Interpretable methods Bayesian rule list

A non-technical reader might get easily acquainted to the outputs of a Bayesian Rule List. Such a list relies on the kind of if–then-else rules described for anchors, which are also presented in natural language. However, rule lists have the advantage of allowing users to use these rules to evaluate both the general behaviour of the system and its application to particular cases. As such, the technique is very suitable for situations in which taxpayers might need to make adversarial inquiries to contest discriminatory and arbitrary outputs of tax AI systems. Coefficient interpretation

While numerical coefficients might be understandable for software developers and domain experts, a taxpayer might lack the context needed to understand the relative magnitude of a coefficient. As such, the interpretation of coefficients for a logistic regression might be useful for taxpayers if they are mathematically savvy or supported by technical experts. Otherwise, it might not provide much in terms of actionable insights for evaluating or contesting the decisions made by the algorithm. Hybrid neural network with decision tree solution

In abstract terms, the logic of such a hybrid solution is very accessible to the lay taxpayer. Depending on the positive or negative answer regarding each feature in each node, this decision tree goes down through all nodes and branches until a definitive classification of tax fraud or the lack of it is revealed. In practice, however, the sheer number of factors involved in decision-making might result in trees that exceed human cognitive abilities (Miller 1956).

In the example presented in Sect. 3, we have used some techniques to reduce the complexity of the tree, such as presenting only a bit of it. But, even though the figure presents only a tiny branch of that tree, it is still difficult to understand. In addition, looking at only a part of the tree may mislead observers: not only the same decision might be achieved by a different combination of the factors in that tree, but a particular branch of the tree might not include all elements considered in the decision. As such, much of the value of the decision tree approach will depend on what approaches are used to select the part of the tree that is presented as an explanation.

5 Conclusions

In this paper, we have introduced a new dataset, synthetically generated by Buenos Aires tax authorities based on real-life, tax-related data. This presentation was supplemented by a study aimed at uncovering the feature requirements for explainable system that such authorities may be interested in using, as well as the technical study in which we compared several explainers.

LIME scores the best in the evaluation of XAI methods but in the current legal system setup not enough to meet the minimum standard for direct use in tax decision making. For that to happen, it would need to be slightly more comprehensible. Counterfactuals, if more comprehensible, will also be a good candidate to contribute to explainable tax AI. Perhaps a good approach would be to design a XAI method which would merge LIME and counterfactuals with high (at least strong + +) comprehensibility in mind. This would also translate to very high (+ + +) assessability, thereby creating an XAI method which meets a minimum standard for explainable tax AI.

When it comes to interpretable models, the Bayesian Rules List scored the best overall, and its outputs are also clearer than those produced by all XAI methods. This approach appears to ensure the explainability of tax AI for taxpayers in the best way due to its strong comprehensibility and assessability. The hybrid decision model also showed some potential, but it ultimately is not comprehensible enough for use by most taxpayers. To address these shortcomings, a potential direction for development would combine a Bayesian Rules List with a deep neural network, using the former to reflect the latter’s output production logic. By doing so, it might be possible to provide explanations that are even more informative than the Bayesian Rules List for the purposes of tax AI.

In the nearest timeframe, we aim to mitigate the limitations of the technical study, strengthening the contributions of this paper. Firstly, we base the crux our findings on qualitative analyses and base them on the various experiences of the authors, overcoming the difficulties that tend to arise from such co-operations (Ratcheva 2009). Nevertheless, the dataset presented herein calls for a more quantitative study to be performed. Secondly, the dataset is based on features identified by the tax authorities themselves. Although they were screened on the needs of ML-based system, there is a distinct possibility that a better performance could be attained if additional features could be brought into the dataset. This work was able to highlight the challenges encountered by computer scientists when they develop explainable systems, as well as the lawyers’ expectations towards such systems. Thus, it can serve as a background work that guides how to ensure taxpayers’ rights and increases tax morale in the AI-reliant areas of tax law. This goes on to show that transparent and explainable AI contributes to a more equal distribution of that knowledge. Moreover, this can facilitate the creation of new user-centric and domain-specific XAI methods, a challenge in its own right.