1 Introduction

Software is the main driver of innovation of our times. Software-defined systems are permeating our communication, perception, and storage technology as well as our personal interactions with technical systems at an unprecedented pace. “Software-defined everything” is among the hottest buzzwords in IT today [78, 121].

At the same time, we are doomed to trust these systems, despite being unable to inspect or look inside the software we are facing: The owners of the physical hull of ‘everything’ are typically not the ones owning the software defining ‘everything’, nor will they have the right to look at what and how ‘everything’ is defined. This is because commercial software typically is protected by intellectual property rights of the software manufacturer. This prohibits any attempt to disassemble the software or to reconstruct its inner working, albeit it is the very software that is forecasted to be defining ‘everything’. The use of machine-learnt software components amplifies the problem considerably by adding opacity of its own kind. Since commercial interests of the software manufacturers seldomly are aligned with the interest of end users, the promise of ‘software-defined everything’ might well become a dystopia from the perspective of individual digital sovereignty. In this article, we address two of the most pressing incarnations of problematic software behaviour.

Diesel emissions scandal

A massive example of software-defined collective damage is the diesel emissions scandal. Over a period of more than 10 years, millions of diesel-powered cars have been equipped with illegal software that altogether polluted the environment for the sake of commercial advantages of the car manufacturers. At its core, this was made possible by the fact that only a single, precisely defined test setup was put in place for checking conformance with exhaust emissions regulations. This made it a trivial software engineering task to identify the test particularities and to turn off emission cleaning outside these particular conditions. This is an archetypal instance of software do**.

Software do** can be formally characterised as a violation of a cleanness property of a program [10, 32]. A detailed and comparative account of meaningful cleanness definitions related to software do** is available [16, Chapter 3]. One cleanness notion that has proven suitable to detect diesel emissions do** is robust cleanness [16, 19]. It is based on the assumption that there is some well-defined and agreed standard input/output behaviour of the system which the definition extends to the vicinity around the inputs and outputs close to the standard behaviour. The precise specification of “vicinity” and of “standard behaviour” is assumed to be part of a contract between software manufacturer and user. That contract entails the standard behaviour, distance functions for input and output values, and distance thresholds to define the input and output vicinity, respectively. With this, a system behaviour is considered clean, if its output is (or stays) in the output vicinity of the standard, unless the input is (or moves) outside the standard’s input vicinity (see Fig. 1).

Example 1

Every car model that is to enter the market in the European Union (and other countries) must be compliant with local regulations. As part of this homologation process, common to all of these regulations is the need for executing a test under precisely defined lab conditions, carried out on a chassis dynamometer. In this, the car has to follow a speed profile, which is called test cycle in regulations. At the time when the diesel scandal surfaced, the New European Driving Cycle (NEDC) [128] was the single test cycle used in the European Union. It has by now been replaced by the Worldwide harmonized Light vehicles Test Cycle (WLTC) [124] in many countries. We refer to previous work for more details [16, 19, 22]. From a perspective of fraud prevention, having only a single test cycle is a major weakness of the homologation procedure. Robust cleanness can overcome this problem. It admits the consideration of driving profiles that stay in the bounded vicinity of one of several standardised test cycle (i.e., NEDC as well as WLTC), while enforcing bounds on the deviations regarding exhaust emission.

Discrimination mitigation

Another set of exemplary scenarios we consider in this article are high-risk AI systems, systems empowered by AI technology whose functioning may introduce risks to health, safety, or fundamental rights of human individuals. The European Union is currently develo** the AI Act [40, 41] that sets out to mitigate many of the risks that such systems pose. Application areas of concern include credit approval ( [95]), decisions on visa applications ( [84]), admissions to higher education ( [27, 133]), screening of individuals in predictive policing ( [58]), selection in HR ( [92,93,94]), judicial decisions (as with COMPAS [3, 30, 34, 72]), tenant screening ( [115]), and more. In many of these areas, there are legitimate interests and valid reasons for using well-understood AI technology, although the risks associated with their use to date is manifold.

It is widely recognised that discrimination by unfair classification and regression models is one particularly important risk. As a result, a colourful zoo of different operationalisations of unfairness has emerged [96, 131], which should be seen less as a set of competing approaches and more as mutually complementary [52]. At the same time, a consensus is emerging that human oversight is an important piece of the puzzle for mitigating and minimising societal risks of AI [59, 83, 129]. Accordingly, that principle made it into recent drafts of legislation including the European AI Act [40, 41] or certain US state laws [132].

The generic approach we develop for software-do** analysis turns out to be powerful enough to provide automated assistance for human overseers of high-risk AI systems. Apart from spelling out the necessary refocusing, we illustrate the challenge that our work helps to overcome by an exemplary, albeit hypothetical admission system for higher education (inspired by [27, 133]).

Example 2

A large university assigns scores to applicants aiming to enter their computer science PhD program. The scores are computed using an automated, model-based procedure \({\textsf{P}} \) which is based on three data points: the position of the applicant’s last graduate institution in an official, subject-specific ranking, the applicant’s most recent grade point average (GPA), and their score in a subject-specific standardised test taken as part of the application procedure. The system then automatically computes a score for the candidate based on an estimation of how successful it expects them to be as students. A dedicated university employee, Unica is in charge of overseeing the individual outcomes of \({\textsf{P}} \) and is supposed to detect cases where the output of \({\textsf{P}} \) is or appears flawed. The university pays especial attention to fairness in the scoring procedure, so Unica has to watch out to any signs of potential unfairness. Unica is supposed to desk-reject candidates whose scores are below a certain, predefined threshold—unless she finds problems with \({\textsf{P}} \)’s scoring. Without any additional support, Unica, as human overseer in the loop, must manually check all cases for signs of unfairness as they are processed. This can be a tedious, complicated, and error-prone task and as such constitutes an impediment for the assumed scalability of the automated scoring process to high numbers of applicants. Therefore, she at least requires tool support that assists her in detecting when something is off about the scoring of individual applicants.

This support can be made real by exploiting the technical contributions of this article, in terms of a runtime monitor that provides automated assistance to the human oversight and itself is based on the probabilistic falsification technique we develop. As we will explain, func-cleanness, a variant of cleanness, is a suitable basis for rolling out runtime monitors for such high-risk systems, that are able to detect and flag discrimination or unfair treatment of humans.

The contributions made by this article are threefold.

Detecting software do** using probabilistic falsification. The paper starts off by develo** the theory of robust cleanness and func-cleanness. We provide characterisations in the temporal logics HyperSTL and STL, that are then used for an adaptation of existing probabilistic falsification techniques [1, 49]. Altogether, this reduces the problem of software do** detection to the problem of falsifying the logical characterisation of the respective cleanness definition.

Falsification-based test input generation. Recent work [19] proposes a formal framework for robust cleanness testing, with the ambition of making it usable in practice, namely for emissions tests conducted with a real diesel car on a chassis dynamometer. However, that approach leaves open how to perform test input selection in a meaningful manner. The probabilistic falsification technique presented in this article attacks this shortcoming. It supports the testing procedure by guiding it towards test inputs that make the robust cleanness tests likely to fail.

Promoting effective human oversight. We discuss and demonstrate how the technical contributions of this paper contribute to effective human oversight of high-risk systems, as required by the current proposal of the AI act. The hypothetical university admission scenario introduced above will serve as a demonstrator for shedding light on the applicability of our approach as well as the the principles behind it. On a technical level, we provide a runtime monitor for individual fairness based on probabilistic falsification of func-cleanness. On a conceptual level, we consider it important to clarify which duties come with the usage of such a system; from a legal perspective, particularly considering the AI Act, substantiated by considering the ethical dimension from a philosophical perspective, and from a psychological perspective, particularly deliberating on how the overseeing can become effective.

This paper is based on a conference publication [17]. Relative to that paper, the development of the theory here is more complete and now includes temporal logic characterisations for func-cleanness. On the conceptual side, this article adds a principled analysis of the applicability of func-cleanness to effective human oversight, spelled out in the setting of admission to higher education. We live up to the societal complexity of this new example and provide an interdisciplinary situation analysis and an interdisciplinary assessment of our proposed solution. Accordingly, although the technical realisation is based on the probabilistic falsification approach outlined in this article, our solution is substantially more thoughtful than a naive instantiation of the falsification framework.

This article is structured as follows. Section 2 provides the preliminaries for the contributions in this article. Section 3 develops the theoretical foundations necessary to use the concept of probabilistic falsification with robust cleanness and func-cleanness. Section 4 demonstrates how the probabilistic falsification approach can be combined with the previously proposed testing approach [19] for robust cleanness, with a focus on tampered emission cleaning systems of diesel cars. Section 5 develops the technical realisation of a fairness monitor based on func-cleanness for high-risk systems. Section 6 evaluates the fairness monitor from the perspective of the disciplines philosophy, psychology, and law. Finally, Sect. 7 summarises the contributions of this article and discusses limitations of our approaches. The appendix of this article contains additional technical details, proofs, and further philosophical and juridical explanations.

2 Background

2.1 Software do**

After early informal characterisations of software do** [10, 13], D’Argenio et al. [32] propose a collection of formal definitions that specify when a software is clean. The authors call a software doped (w.r.t. a cleanness definition) whenever it does not satisfy such cleanness definition. We focus on robust cleanness and func-cleanness in this article [32].

We define by \(\mathbb {R}_{\ge 0}{:}{=}\{ x\in \mathbb {R}\mid x \ge 0 \}\) the set of non-negative real numbers, by \(\overline{\mathbb {R}}{:}{=}\mathbb {R}\cup \{ -\infty , \infty \}\) the set of extended reals [104], and by \(\overline{\mathbb {R}}_{\ge 0}{:}{=}\mathbb {R}_{\ge 0}\cup \{ \infty \}\) the set of the non-negative extended real numbers. We say that a function \(d: X \times X \rightarrow \overline{\mathbb {R}}_{\ge 0}\) is a distance function if and only if it satisfies \(d(x,x)=0\) and \(d(x,y)=d(y,x)\) for all x, \(y \in X\). We let \(\sigma [k]\) denote the kth literal of the finite or infinite word \(\sigma \).

Reactive Execution Model

We can view a nondeterministic reactive program as a function \({\textsf{S}}:{\textsf{In}}^\omega \rightarrow 2^{({\textsf{Out}}^\omega )}\) perpetually map** inputs \({\textsf{In}}\) to sets of outputs \({\textsf{Out}}\) [32]. To formally model contracts that specify the concrete configuration of robust cleanness or func-cleanness, we denote by \({\textsf{StdIn}}\subseteq {\textsf{In}}^\omega \) the input space of the system designated to define the standard behaviour, and by \(d_{\textsf{In}}:({\textsf{In}}\times {\textsf{In}})\rightarrow \overline{\mathbb {R}}_{\ge 0}\) and \(d_{\textsf{Out}}:({\textsf{Out}}\times {\textsf{Out}})\rightarrow \overline{\mathbb {R}}_{\ge 0}\) distance functions on inputs, respectively outputs.

Fig. 1
figure 1

Robust cleanness intuition

For robust cleanness, we additionally consider two constants \(\kappa _\textsf{i}, \kappa _\textsf{o}\in \overline{\mathbb {R}}_{\ge 0}\). \(\kappa _\textsf{i}\) defines the maximum distance that a non-standard input must have to a standard input to be considered in the cleanness evaluation. For this evaluation, \(\kappa _\textsf{o}\) defines the maximum distance between two outputs such that they are still considered sufficiently close. Intuitively, the contract defines tubes around standard inputs and their outputs. For example, in Fig. 1, \(\textsf{i}\) is a standard input and \(d_{\textsf{In}} \) and \(\kappa _\textsf{i}\) implicitly define a \(2\kappa _\textsf{i}\) wide tube around \(\textsf{i}\). Every input \(\textsf{i}'\) that is within this tube will be evaluated on its outputs. Similarly, \(d_{\textsf{Out}} \) and \(\kappa _\textsf{o}\) define a tube around each of the outputs of \(\textsf{i}\). An output for \(\textsf{i}'\) that is within this tube satisfies the robust cleanness condition. Together, the above objects constitute a formal contract \(\mathcal {C} = \langle {\textsf{StdIn}}, d_{\textsf{In}}, d_{\textsf{Out}}, \kappa _\textsf{i}, \kappa _\textsf{o}\rangle \). Robust cleanness is composed of two separate definitions called l-robust cleanness and u-robust cleanness. Assuming a fixed standard behaviour of a system, l-robust cleanness imposes a lower bound on the non-standard outputs that a system must exhibit, while u-robust cleanness imposes an upper bound. Such lower and upper bound considerations are necessary because of the potential nondeterministic behaviour of the system; for deterministic systems the two notions coincide. We remark that in this article we are using past-forgetful distance functions and the trace integral variants of robust cleanness and func-cleanness (see Biewer [16, Chapter 3] for details).

Definition 1

A nondeterministic reactive program \({\textsf{S}}:{\textsf{In}}^\omega \rightarrow 2^{({\textsf{Out}}^\omega )}\) is robustly clean w.r.t. contract \(\mathcal {C} = \langle {\textsf{StdIn}}, d_{\textsf{In}}, d_{\textsf{Out}}, \kappa _\textsf{i}, \kappa _\textsf{o}\rangle \) if for every standard input \(\textsf{i}\in {\textsf{StdIn}}\) and input sequence \(\textsf{i}'\in {\textsf{In}}^\omega \) it is the case that

  1. 1.

    for every \(\textsf{o}\in {\textsf{S}} (\textsf{i})\), there exists \(\textsf{o}'\in {\textsf{S}} (\textsf{i}')\), such that for every index \(k\in \mathbb {N}\), if \(d_{\textsf{In}}(\textsf{i}[j],\textsf{i}'[j])\le \kappa _\textsf{i}\) for all \(j\le k\), then it holds that \(d_{\textsf{Out}}(\textsf{o}[k],\textsf{o}'[k]) \le \kappa _\textsf{o}\),

    (l-robust cleanness)

  2. 2.

    for every \(\textsf{o}'\in {\textsf{S}} (\textsf{i}')\), there exists \(\textsf{o}\in {\textsf{S}} (\textsf{i})\), such that for every index \(k\in \mathbb {N}\), if \(d_{\textsf{In}}(\textsf{i}[j],\textsf{i}'[j])\le \kappa _\textsf{i}\) for all \(j\le k\), then it holds that \(d_{\textsf{Out}}(\textsf{o}[k],\textsf{o}'[k]) \le \kappa _\textsf{o}\). (u-robust cleanness)

We will in the following refer to Definition 1.1 for l-robust cleanness and Definition 1.2 for u-robust cleanness. Intuitively, l-robust cleanness enforces that whenever an input \(\textsf{i}'\) remains within \(\kappa _\textsf{i}\) vicinity around the standard input \(\textsf{i}\), then for every standard output \(\textsf{o}\in {\textsf{S}} (\textsf{i})\), there must be a non-standard output \(\textsf{o}' \in {\textsf{S}} (\textsf{i}')\) that is in \(\kappa _\textsf{o}\) proximity of \(\textsf{o}\). Referring to Fig. 1, every \(\textsf{i}'\) in the tube around \(\textsf{i}\) must produce for every standard output \(\textsf{o}\in {\textsf{S}} (\textsf{i})\) at least one output \(\textsf{o}' \in {\textsf{S}} (\textsf{i}')\) that resides in the \(\kappa _\textsf{o}\)-tube around \(\textsf{o}\). In other words, for non-standard inputs the system must not lose behaviour that it can exhibit for a standard input in \(\kappa _\textsf{i}\) proximity.

For u-robust cleanness the standard and non-standard output switch roles. It enforces that whenever an input \(\textsf{i}'\) remains within \(\kappa _\textsf{i}\) vicinity around the standard input \(\textsf{i}\), then for every output \(\textsf{o}' \in {\textsf{S}} (\textsf{i}')\) the system can exhibit for this non-standard input, there must be a standard output \(\textsf{o}\in {\textsf{S}} (\textsf{i})\) that is in \(\kappa _\textsf{o}\) proximity of \(\textsf{o}'\). Referring to Fig. 1, every \(\textsf{i}'\) in the tube around \(\textsf{i}\) must only produce outputs \(\textsf{o}' \in {\textsf{S}} (\textsf{i}')\) that are in the \(\kappa _\textsf{o}\)-tube of at least one \(\textsf{o}\in {\textsf{S}} (\textsf{i})\). In other words, for non-standard inputs within \(\kappa _\textsf{i}\) proximity of a standard input the system must not introduce new behaviour, i.e., it must not exhibit an output that is further than \(\kappa _\textsf{o}\) away from the set of standard outputs.

A generalisation of robust cleanness is func-cleanness. A cleanness contract for func-cleanness replaces the constants \(\kappa _\textsf{i}\) and \(\kappa _\textsf{o}\) by a function \(f: \overline{\mathbb {R}}_{\ge 0}\rightarrow \overline{\mathbb {R}}_{\ge 0}\) inducing a dynamic threshold for output distances based on the distance between the inputs producing such outputs.

Definition 2

A nondeterministic reactive system \({\textsf{S}} \) is func-clean w.r.t. contract \(\mathcal {C} = \langle {\textsf{StdIn}}, d_{\textsf{In}}, d_{\textsf{Out}}, f \rangle \) if for every standard input \(\textsf{i}\in {\textsf{StdIn}}\) and input sequence \(\textsf{i}'\in {\textsf{In}}^\omega \) it is the case that

  1. 1.

    for every \(\textsf{o}\in {\textsf{S}} (\textsf{i})\), there exists \(\textsf{o}'\in {\textsf{S}} (\textsf{i}')\), such that for every index \(k\in \mathbb {N}\), \(d_{\textsf{Out}} (\textsf{o}[k],\textsf{o}'[k]) \le f(d_{\textsf{In}} (\textsf{i}[k],\textsf{i}'[k]))\), (l-func-cleanness)

  2. 2.

    for every \(\textsf{o}'\in {\textsf{S}} (\textsf{i}')\), there exists \(\textsf{o}\in {\textsf{S}} (\textsf{i})\), such that for every index \(k \in \mathbb {N}\), \(d_{\textsf{Out}} (\textsf{o}[k],\textsf{o}'[k]) \le f(d_{\textsf{In}} (\textsf{i}[k],\textsf{i}'[k]))\). (u-func-cleanness)

We will in the following refer to Definition 2.1 for l-func-cleanness and Definition 2.2 for u-func-cleanness.

For the fairness monitor in Sect. 5 we will use a simpler variant of func-cleanness for deterministic sequential programs. Since \({\textsf{P}}\) is deterministic, the lower and upper bound requirements coincide, yielding the following simplified definition.

Definition 3

A deterministic sequential program \({\textsf{P}} \) is func-clean w.r.t. contract \(\mathcal {C} = \langle {\textsf{StdIn}}, d_{\textsf{In}}, d_{\textsf{Out}}, f \rangle \) if for every standard input \(\textsf{i}\in {\textsf{StdIn}}\) and input \(\textsf{i}' \in {\textsf{In}}\), it holds that \(d_{\textsf{Out}} ({\textsf{P}} (\textsf{i}),{\textsf{P}} (\textsf{i}')) \le f(d_{\textsf{In}} (\textsf{i},\textsf{i}'))\).

Mixed-IO System Model

The reactive execution model above has the strict requirement that for every input, the system produces exactly one output. Recent work [18, 19] instead considers mixed-IO models, where a program \({\textsf{L}} \subseteq ({\textsf{In}}\cup {\textsf{Out}})^\omega \) is a subset of traces containing both inputs and outputs, but without any restriction on the order or frequency in which inputs and outputs appear in the trace. In particular, they are not required to strictly alternate (but they may, and in this way the reactive execution model can be considered a special case [16]). A particularity of this model is the distinct output symbol \(\delta \) for quiescence, i.e., the absence of an output. For example, finite behaviour can be expressed by adding infinitely many \(\delta \) symbols to a finite trace.

The new system model induces consequences regarding cleanness contracts. Every mixed-IO trace is projected into an input, respectively output domain. The set of input symbols contains one additional element \(\text {--}_\textsf{i}\), that indicates that in the respective steps an output was produced, but masking the concrete output. Similarly, the set of output symbols contains the additional element \(\text {--}_\textsf{o}\) to mask a concrete input symbol. Projection on inputs \( {\downarrow _\textsf{i}} :\) \(({\textsf{In}}\cup {\textsf{Out}})^\omega \rightarrow ({\textsf{In}}\cup \{\text {--}_\textsf{i}\})^\omega \) and projection on outputs \( {\downarrow _\textsf{o}} :({\textsf{In}}\cup {\textsf{Out}})^\omega \rightarrow ({\textsf{Out}}\cup \{\text {--}_\textsf{o}\})^\omega \) are defined for all traces \(\sigma \in ({\textsf{In}}\cup {\textsf{Out}})^\omega \) and \(k\in \mathbb {N}\) as follows: \(\sigma {\downarrow _\textsf{i}} [k] {:}{=}{\textbf { if }} \sigma [k] \in {\textsf{In}}{\textbf { then }} \sigma [k] {\textbf { else }} \text {--}_\textsf{i}\) and similarly \(\sigma {\downarrow _\textsf{o}} [k] {:}{=}{\textbf { if }} \sigma [k] \in {\textsf{Out}}{\textbf { then }} \sigma [k] {\textbf { else }} \text {--}_\textsf{o}\). The distance functions \(d_{\textsf{In}} \) and \(d_{\textsf{Out}} \) apply on input and output symbols or their respective masks, i.e., they are functions \(({\textsf{In}}\cup \{\text {--}_\textsf{i}\}) \times ({\textsf{In}}\cup \{\text {--}_\textsf{i}\}) \rightarrow \overline{\mathbb {R}}_{\ge 0}\) and, respectively, \(({\textsf{Out}}\cup \{\text {--}_\textsf{o}\}) \times ({\textsf{Out}}\cup \{\text {--}_\textsf{o}\}) \rightarrow \overline{\mathbb {R}}_{\ge 0}\). Finally, instead of a set of standard inputs \({\textsf{StdIn}}\), we evaluate mixed-IO system cleanness w.r.t. to a set of standard behaviour \({\textsf{Std}} \subseteq {\textsf{L}} \). Thus, not only inputs, but also outputs can be defined as standard behaviour and for an input, one of its outputs can be considered in \({\textsf{Std}} \) while a different output can be excluded from \({\textsf{Std}} \). As a consequence, the set \({\textsf{Std}} \) is specific for some mixed-IO system \({\textsf{L}} \), because \({\textsf{Std}} \) is useful only if \({\textsf{Std}} \subseteq {\textsf{L}} \). To emphasise this difference we will call the tuple \(\mathcal {C} = \langle {\textsf{Std}}, {d}_{\textsf{In}}, {d}_{\textsf{Out}}, \kappa _\textsf{i}, \kappa _\textsf{o}\rangle \) (cleanness) context (instead of cleanness contract). Robust cleanness of mixed-IO systems w.r.t. such a context is defined below [19].

Definition 4

A mixed-IO system \({\textsf{L}} \subseteq ({\textsf{In}}\cup {\textsf{Out}})^\omega \) is robustly clean w.r.t. context \(\mathcal {C} = \langle {\textsf{Std}}, {d}_{\textsf{In}}, {d}_{\textsf{Out}}, \kappa _\textsf{i}, \kappa _\textsf{o}\rangle \) if and only if \({\textsf{Std}} \subseteq {\textsf{L}} \) and for all \(\sigma \in {\textsf{Std}} \) and \(\sigma '\in {\textsf{L}} \),

  1. 1.

    there exists \(\sigma ''\in {\textsf{L}} \) with \(\sigma ' {\downarrow _\textsf{i}} = \sigma '' {\downarrow _\textsf{i}}\), such that for every index \({k\in \mathbb {N}}\) it holds that whenever \({d}_{\textsf{In}}(\sigma [j] {\downarrow _\textsf{i}},\sigma '[j] {\downarrow _\textsf{i}})\le \kappa _\textsf{i}\) for all \(j\le k\), then \({d}_{\textsf{Out}}(\sigma [k] {\downarrow _\textsf{o}},\sigma ''[k] {\downarrow _\textsf{o}})\le \kappa _\textsf{o}\),

    (l-robust cleanness)

  2. 2.

    there exists \(\sigma ''\in {\textsf{Std}} \) with \(\sigma {\downarrow _\textsf{i}}=\sigma '' {\downarrow _\textsf{i}}\), such that for every index \(k\in \mathbb {N}\) it holds that whenever \({d}_{\textsf{In}}(\sigma [j] {\downarrow _\textsf{i}},\sigma '[j] {\downarrow _\textsf{i}})\le \kappa _\textsf{i}\) for all \(j\le k\), then \({d}_{\textsf{Out}}(\sigma '[k] {\downarrow _\textsf{o}},\sigma ''[k] {\downarrow _\textsf{o}})\le \kappa _\textsf{o}\).

    (u-robust cleanness)

We will in the following refer to Definition 4.1 for l-robust cleanness and Definition 4.2 for u-robust cleanness. Definition 4 universally quantifies a standard trace \(\sigma \). For l-robust cleanness, the universal quantification of \(\sigma '\) effectively only quantifies an input sequence; the input projection for the existentially quantified \(\sigma ''\) must match the projection for \(\sigma '\). The remaining parts of the definition are conceptually identical to their reactive systems counterpart in Definition 1.1. For u-robust cleanness, the existentially quantified trace \(\sigma ''\) is obtained from set \({\textsf{Std}} \) in contrast to l-robust cleanness, where \(\sigma ''\) can be any arbitrary trace of \({\textsf{L}}\). This is necessary, because u-robust cleanness is defined w.r.t. a cleanness context; from knowing that \(\sigma \in {\textsf{Std}} \) is a standard trace and by enforcing that \(\sigma {\downarrow _\textsf{i}} = \sigma '' {\downarrow _\textsf{i}}\) we cannot conclude that also \(\sigma ''\in {\textsf{Std}} \).

Definition 5 shows the definition func-cleanness of mixed-IO systems.

Definition 5

A mixed-IO system \({\textsf{L}} \subseteq ({\textsf{In}}\cup {\textsf{Out}})^\omega \) is func-clean w.r.t. context \(\mathcal {C} = \langle {\textsf{Std}}, {d}_{\textsf{In}}, {d}_{\textsf{Out}}, f \rangle \) if and only if \({\textsf{Std}} \subseteq {\textsf{L}} \) and for all \(\sigma \in {\textsf{Std}} \) and \(\sigma '\in {\textsf{L}} \),

  1. 1.

    there exists \(\sigma ''\in {\textsf{L}} \) with \(\sigma ' {\downarrow _\textsf{i}} = \sigma '' {\downarrow _\textsf{i}}\), such that for every index \(k\in \mathbb {N}\), it holds that \({d}_{\textsf{Out}}(\sigma [k] {\downarrow _\textsf{o}},\sigma ''[k] {\downarrow _\textsf{o}})\le f({d}_{\textsf{In}}(\sigma [k] {\downarrow _\textsf{i}},\sigma '[k] {\downarrow _\textsf{i}}))\), (l-func-cleanness)

  2. 2.

    there exists \(\sigma ''\in {\textsf{Std}} \) with \(\sigma {\downarrow _\textsf{i}}=\sigma '' {\downarrow _\textsf{i}}\), such that for every index \(k\in \mathbb {N}\), it holds that \({d}_{\textsf{Out}}(\sigma '[k] {\downarrow _\textsf{o}},\sigma ''[k] {\downarrow _\textsf{o}})\le f({d}_{\textsf{In}}(\sigma [k] {\downarrow _\textsf{i}},\sigma '[k] {\downarrow _\textsf{i}}))\). (u-func-cleanness)

We will in the following refer to Definition 5.1 for l-func-cleanness and Definition 5.2 for u-func-cleanness.

2.2 Temporal logics

2.2.1 HyperLTL

Linear Temporal Logic (LTL) [97] is a popular formalism to reason about properties of traces. A trace is an infinite word where each literal is a subset of \(\textsf{AP}\), the set of atomic propositions. We interpret programs as circuits encoded as sets \({\textsf{C}} \subseteq (2^{\textsf{AP}})^\omega \) of such traces. LTL provides expressive means to characterise sets of traces, often called trace properties. For some set of traces T, a trace property defines a subset of T (for which the property holds), whereas a hyperproperty defines a set of subsets of T (constituting combinations of traces for which the property holds). In this way it specifies which traces are valid in combination with one another. Many temporal logics have been extended to corresponding hyperlogics supporting the specification of hyperproperties.

HyperLTL [31] is such a temporal logic for the specification of hyperproperties of reactive systems. It extends LTL with trace quantifiers and trace variables that make it possible to refer to multiple traces within a logical formula. A HyperLTL formula is defined by the following grammar, where \(\pi \) is drawn from a set \(\mathcal {V}\) of trace variables and a from the set \(\textsf{AP}\):

The quantifiers \(\exists \) and \(\forall \) quantify existentially and universally, respectively, over the set of traces. For example, the formula \(\mathop {\forall {\pi }.}\mathop {\exists {\pi '}.}\phi \) means that for every trace \(\pi \) there exists another trace \(\pi '\) such that \(\phi \) holds over the pair of traces. To account for distinct valuations of atomic propositions across distinct traces, the atomic propositions are indexed with trace variables: for some atomic proposition \(a\in \textsf{AP}\) and some trace variable \(\pi \in \mathcal {V}\), \(a_\pi \) states that a holds in the initial position of trace \(\pi \). The temporal operators and Boolean connectives are interpreted as usual for LTL. Further operators are derivable: enforces \(\phi \) to eventually hold in the future, enforces \(\phi \) to always hold, and the weak-until operator allows \(\phi \) to always hold as an alternative to the obligation for \(\phi '\) to eventually hold.

HyperLTL Characterisations of Cleanness

D’Argenio et al. [32] assume distinct sets of atomic propositions to encode inputs and outputs. That is, they assume that \(\textsf{AP}= {\textsf{AP}_\textsf{i}\cup \textsf{AP}_\textsf{o}}\) of atomic propositions, where \(\textsf{AP}_\textsf{i}\) and \(\textsf{AP}_\textsf{o}\) are the atomic propositions that define the the input values and, respectively, the output values. Thus, in the context of Boolean circuit encodings of programs, we take \({\textsf{In}}=2^{\textsf{AP}_\textsf{i}}\) and \({\textsf{Out}}=2^{\textsf{AP}_\textsf{o}}\). We capture the following natural correspondence between reactive programs and Boolean circuits; a circuit \({\textsf{C}} \) can be interpreted as a function , where

(1)

with \(t {\downarrow _{A}}\) defined by \((t {\downarrow _{A}})[k]=t[k]\cap A\) for all \(k\in \mathbb {N}\).

In the HyperLTL formulas below occur, for convenience, non-atomic propositions. Their semantics is encoded by atomic propositions and Boolean connectives according to a Boolean encoding of inputs and outputs. We refer to the original work for the details [32, Table 1]. Further, we assume that there is a quantifier-free HyperLTL formula \({\textsf{StdIn}}_\pi \) that can check whether the trace represented by trace variable \(\pi \) is in the set of standard inputs \({\textsf{StdIn}}\subseteq {\textsf{In}}^\omega \). That is, \({\textsf{StdIn}}_\pi \) should be defined such that for every trace \(t\in {\textsf{C}} \) it holds that \(\{ \pi {:}{=}t \} \models _{\textsf{C}} {\textsf{StdIn}}_\pi \) if and only if \((t {\downarrow _{\textsf{AP}_\textsf{i}}}) \in {\textsf{StdIn}}\).

Proposition 1 shows HyperLTL formulas for l-robust cleanness and u-robust cleanness, respectively.Footnote 1

Proposition 1

Let \({\textsf{C}} \) be a set of infinite traces over \(2^{{\textsf{AP}}}\), let be the reactive system constructed from \({\textsf{C}} \) according to Equation 1, and let \(\mathcal {C} = \langle {\textsf{StdIn}}, d_{\textsf{In}}, d_{\textsf{Out}}, \kappa _\textsf{i}, \kappa _\textsf{o}\rangle \) be a contract for robust cleanness. Then is l-robustly clean w.r.t. \(\mathcal {C}\) if and only if \({\textsf{C}} \) satisfies the HyperLTL formula

and is u-robustly clean w.r.t. \(\mathcal {C}\) if and only if \({\textsf{C}} \) satisfies the HyperLTL formula

The first quantifier (for \(\pi _1\)) in both formulas implicitly quantifies the standard input \(\textsf{i}\) and the second quantifier (for \(\pi _2\)) implicitly quantifies the second input \(\textsf{i}'\). Due to the potential nondeterminism in the behaviour of the system, the third, existential, quantifier for \(\pi _1'\), respectively \(\pi _2'\) is necessary. While the formula for l-robust cleanness has the universal quantification on the outputs of the program that takes the standard input \(\textsf{i}\) and the existential quantification on the output for \(\textsf{i}'\), the formula for u-robust cleanness works in the other way around. Thus, the formulas capture the \(\forall \exists \) alternation in Definition 1. The weak until operator has exactly the behaviour necessary to represent the interaction between the distances of inputs and the distances of outputs.

The HyperLTL formulas for func-cleanness are given below.

Proposition 2

Let \({\textsf{C}} \) be a set of infinite traces over \(2^{{\textsf{AP}}}\), let be the reactive system constructed from \({\textsf{C}} \) according to Equation 1, and let \(\mathcal {C} = \langle {\textsf{StdIn}}, d_{\textsf{In}}, d_{\textsf{Out}}, f \rangle \) be a contract for func-cleanness. Then is l-func-clean w.r.t. \(\mathcal {C}\) if and only if \({\textsf{C}} \) satisfies the HyperLTL formula

and is u-func-clean w.r.t. \(\mathcal {C}\) if and only if \({\textsf{C}} \) satisfies the HyperLTL formula

2.2.2 Signal temporal logic

LTL enables reasoning over traces \(\sigma \in {(2^\textsf{AP})}^\omega \) for which it is necessary to encode values using the atomic propositions in \(\textsf{AP}\). Each literal in a trace represents a discrete time step of an underlying model. Thus, \(\sigma \) can equivalently be viewed as a function \(\mathbb {N}\rightarrow 2^\textsf{AP}\). One extension of LTL is Signal Temporal Logic (STL) [33, 76], which instead is used for reasoning over real-valued signals that may change in value along an underlying continuous time domain. In this article, we generalise the original work and use generalised timed traces (GTTs) [53], which, for some value domain X and time domain \(\mathcal {T}\) define traces as functions \(\mathcal {T} \rightarrow X\). The time domain \(\mathcal {T}\) can be either \(\mathbb {N}\) (discrete-time), or \(\mathbb {R}_{\ge 0}\) (continuous-time). For the value domain we will use vectors of real values \(X = \mathbb {R}^n\) for some \(n > 0\) or, to express mixed-IO traces, the set \(X = {\textsf{In}}\cup {\textsf{Out}}\).

STL formulas can express properties of systems modelled as sets \({\textsf{M}} \subseteq (\mathcal {T}\rightarrow X)\) of traces by making the atomic properties refer to booleanisations of the signal values. The syntax of the variant of STL that we use in this article is as follows, where \(f \in X \rightarrow \mathbb {R}\):

STL replaces atomic propositions by threshold predicates of the form \(f > 0\), which hold if and only if function f applied to the trace value at the current time returns a positive value. The Boolean operators and the Until operator are very similar to those of HyperLTL. The Next operator \(\mathop {\textsf{X}}\) is not part of STL, because “next” is without precise meaning in continuous time. The definitions of the derived operators , and are the same as for HyperLTL. Formally, the Boolean semantics of an STL formula \(\phi \) at time \(t \in \mathcal {T}\) for a trace \({w}\in \mathcal {T} \rightarrow X\) is defined inductively:

A system \({\textsf{M}}\) satisfies a formula \(\phi \), denoted \({\textsf{M}} \models \phi \), if and only if for every \({w}\in {\textsf{M}} \) it holds that \({w},0 \models \phi \).

Quantitative Interpretation

STL has been extended by a quantitative semantics [1, 33, 49]. This semantics is designed in such a way that whenever \(\rho (\phi , w, t) \ne 0\), its sign indicates whether \(w, t \models \phi \) holds in the Boolean semantics. For any STL formula \(\phi \), trace w and time t, if \(\rho (\phi , w, t) > 0\), then \(w, t \models \phi \) holds, and if \(\rho (\phi , w, t) < 0\), then \(w, t \models \phi \) does not hold. The quantitative semantics for an STL formula \(\phi \), trace w, and time t the quantitative semantics is defined inductively:

Robustness and Falsification

The value of the quantitative semantics can serve as a robustness estimate and as such be used to search for a violation of the property

Algorithm 1
figure a

Monte-Carlo falsification

at hand, i.e., to falsify it. The robustness of STL formula \(\phi \) is its quantitative value at time 0, that is, \(\mathcal {R}_\phi (w) {:}{=}\rho (\phi , w, 0)\). So, falsifying a formula \(\phi \) for a system \({\textsf{M}} \) boils down to a search problem with the goal condition \(\mathcal {R}_\phi (w) < 0\). Successful falsification algorithms solve this problem by understanding it as the optimisation problem \(\textsf{minimise}_{w \in {\textsf{M}}} \mathcal {R}_\phi (w)\). Algorithm 1 [1, 88] sketches an algorithm for Monte-Carlo Markov Chain falsification, which is based on acceptance-rejection sampling [29].

An input to the algorithm is an initial trace w and a computable robustness function \(\mathcal {R}\). Robustness computation for STL formulas has been addressed in the literature [33, 49]; we omit this discussion here. The third input \({\textsf{PS}}\) is a proposal scheme that proposes a new trace to the algorithm based on the previous one (line 2). The parameter \(\beta \) (used in line 3) can be adjusted during the search and is a means to avoid being trapped in local minima, preventing to find a global minimum.

Notably, there exists prior work by Nguyen et al. [89] that discusses an extension of STL to HyperSTL though using a non-standard semantic underpinning. In this context, they present a falsification approach restricted to the fragment “t-HyperSTL” where, according to the authors, “a nesting structure of temporal logic formulas involving different traces is not allowed”. Therefore, none of our cleanness definitions belongs to this fragment.

3 Logical characterisation of Mixed-IO cleanness

In this section we provide a temporal logic characterisation for robust cleanness and func-cleanness for mixed-IO systems. For this, we propose a HyperSTL semantics (different to that of [89]) and propose HyperSTL formulas for robust cleanness and func-cleanness. We explain how these formulas can be applied to mixed-IO traces and prove that the characterisation is correct. Furthermore, for the special case that \({\textsf{Std}} \) is a finite set, we reformulate the HyperSTL formulas characterising the u-cleannesses as equivalent STL formulas.

Hyperlogics over Continuous Domains

Previous work [89] extends STL to HyperSTL echoing the extension of LTL to HyperLTL. We use a similar HyperSTL syntax in this article:

The meaning of the universal and existential quantifier is as for HyperLTL. In contrast to HyperLTL (and to the existing definition of HyperSTL), we consider it insufficient to allow propositions to refer to only a single trace. In HyperLTL atomic propositions of individual traces can be compared by means of the Boolean connectives. To formulate thresholds for real values, however, we feel the need to allow real values from multiple traces to be combined in the function f, and thus to appear as arguments of f. Hence, in our semantics of HyperSTL, \(f>0\) holds if and only if the result of f, applied to all traces quantified over, is greater than 0. For this to work formally, the arity of function f is the number m of traces quantified over at the occurrence of \(f>0\) in the formula, so \(f: X^m \rightarrow \mathbb {R}\).

A trace assignment [31] \(\Pi : \mathcal {V} \rightarrow {\textsf{M}} \) is a partial function assigning traces of \({\textsf{M}} \) to variables. Let \(\Pi [\pi {:}{=}w]\) denote the same function as \(\Pi \), except that \(\pi \) is mapped to trace w. The Boolean semantics of HyperSTL is defined below.

Definition 6

Let \(\psi \) be a HyperSTL formula, \(t \in \mathcal {T}\) a time point, \({\textsf{M}} \subseteq (\mathcal {T}\rightarrow X)\) a set of GTTs, and \(\Pi \) a trace assignment. Then, the Boolean semantics for \({\textsf{M}}, \Pi , t\models \psi \) is defined inductively:

A system \({\textsf{M}} \) satisfies a formula \(\psi \) if and only if \({\textsf{M}}, \varnothing , 0 \models \psi \). The quantitative semantics for HyperSTL is defined below:

Definition 7

Let \(\psi \) be a HyperSTL formula, \(t \in \mathcal {T}\) a time point, \({\textsf{M}} \subseteq (\mathcal {T}\rightarrow X)\) a set of GTTs, and \(\Pi \) a trace assignment. Then, the quantitative semantics for \(\rho (\psi , {\textsf{M}}, \Pi , t)\) is defined inductively:

Footnote 2

HyperSTL Characterisation

The HyperLTL characterisations in Sect. 2.2.1 assume the system to be a subset of \((2^\textsf{AP})^\omega \) and works with distances between traces by means of a Boolean encoding into atomic propositions. By using HyperSTL, we can characterise cleanness for systems that are representable as subsets of \((\mathcal {T} \rightarrow X)\).

We can take the HyperLTL formulas from Proposition 1 and 2 and transform them into HyperSTL formulas by applying simple syntactic changes. We get for l-robust cleanness the formula

(2)

u-robust cleanness is characterised by

(3)

for l-func-cleanness we get the formula

(4)

and, finally, u-func-cleanness is encoded by

(5)

The quantifiers remain unchanged relative to the formulas in Propositions 1 and 2. The formulas use generic projection functions \( {\downarrow _\textsf{i}}: X \rightarrow {\textsf{In}}\) and \( {\downarrow _\textsf{o}}: X \rightarrow {\textsf{Out}}\) to extract the input values, respectively output values from a trace. To apply the formulas, these functions must be instantiated with functions for the concrete instantiation of the value domain X of the traces to be analysed. For example, for \({\textsf{In}}= \mathbb {R}^m\), \({\textsf{Out}}=\mathbb {R}^l\), and \({\textsf{M}} \subseteq (\mathcal {T} \rightarrow \mathbb {R}^{m+l})\), the projections could be defined for every \({w}= (s_1, \dots , s_m, s_{m+1}, \dots , s_{m+l})\) as \({w} {\downarrow _\textsf{i}} = (s_1, \dots , s_m)\) and \({w} {\downarrow _\textsf{o}} = (s_{m+1}, \dots , s_{m+l})\). The input equality requirement for two traces \(\pi \) and \(\pi '\) is ensured by globally enforcing \(\textsf{eq}(\pi {\downarrow _\textsf{i}}, \pi ' {\downarrow _\textsf{i}}) \le 0\). \(\textsf{eq}\) is a generic function that returns zero if its arguments are identical and a positive value otherwise. It must be instantiated for concrete value domains. For example, \(\textsf{eq}((s_1, \dots , s_m), (s'_1, \dots , s'_m))\) could be defined as the sum of the component-wise distances \(\sum _{1 \le i \le m} |{s_i - s'_i} |\). Finally, in the above formulas we perform simple arithmetic operations to match the syntactic requirements of HyperSTL.

Formulas (3) and (5) are prepared to express u-robust cleanness, respectively u-func-cleanness w.r.t. both cleanness contracts or cleanness contexts. That is, we assume the existence of a function \({\textsf{Std}} _\pi \) that returns a positive value if and only if the trace assigned to \(\pi \) encodes a standard input (when considering cleanness contracts) or encodes an input and output that constitute a standard behaviour (when considering cleanness contexts). Explicitly requiring that \(\pi '_1\) represents a standard behaviour echoes the setup in Definitions 4.2 and 5.2.

We remark that for encoding \({\textsf{Std}} _{\pi }\), due to the absence of the Next-operator in HyperSTL, it might be necessary to add a clock signal \(s(t) = t\) to traces in a preprocessing step.

Example 3

Let \({\textsf{In}}= {\textsf{Out}}= \mathbb {R}\) be the sets representing real-valued inputs and outputs, \(\mathcal {T} = \mathbb {N}\) be the discrete time domain, and \(X = {\textsf{In}}\times {\textsf{Out}}\) the value domain that considers pairs of inputs and outputs as values. We consider the robust cleanness context \(\mathcal {C} = \langle {\textsf{Std}}, d_{\textsf{In}}, d_{\textsf{Out}}, \kappa _\textsf{i}, \kappa _\textsf{o}\rangle \), where \({\textsf{Std}} = \{ w_0, w_1 \}\) contains the two standard traces

\({w}_0 = (1;0)\,(2;0)\,(3;0)\,(4;0)\,\cdots \) and \({w}_1 = (1;1)\,(2;2)\,(3;3)\,(4;4)\,\cdots \).

For the distance functions we use the absolute differences, i.e., \(d_{\textsf{In}} (v_1, v_2) = d_{\textsf{Out}} (v_1, v_2) = |{v_1 - v_2} |\). Let the value thresholds be \(\kappa _\textsf{i}= 1\) and \(\kappa _\textsf{o}= 2\), and let \( {\downarrow _\textsf{i}}, {\downarrow _\textsf{o}}, \textsf{eq}\) and \({\textsf{Std}} _\pi \) be defined as explained above. We consider the non-standard traces \({w}_A = (1.3;0)\,(2.6;0)\,(3.9;0)\,(5.2;0)\,\cdots \), \({w}_B = (1.3;1.3)\,(2.6;2.6)\,(3.9;3.9)\,(5.2;5.2)\,\cdots \), and .

The HyperSTL formulas \(\psi _{\textsf {l-rob}}\) and \(\psi _{\textsf {u-rob}}\) reason about sets of traces. For example, the set \({\textsf{M}} = \{ {w}_0, {w}_1, {w}_A, {w}_B \}\) satisfies both formulas. If both \(\pi _1\) and \(\pi _2\) represent standard traces, then \(\pi _1 {\downarrow _\textsf{i}} = \pi _2 {\downarrow _\textsf{i}}\), because \({w}_0 {\downarrow _\textsf{i}} = {w}_1 {\downarrow _\textsf{i}}\), and the formulas hold for \(\pi '_2 = \pi _1\), respectively \(\pi '_1 = \pi _2\). Otherwise, assume that \(\pi _1\) represents \({w}_0\) and \(\pi _2\) represents \({w}_B\) (the reasoning for other combinations of traces is similar).

First considering \(\psi _{\textsf {l-rob}}\), we pick \({w}_A\) for \(\pi '_2\). We get that \(\pi _2 {\downarrow _\textsf{i}} = \pi '_2 {\downarrow _\textsf{i}}\), because \({w}_B {\downarrow _\textsf{i}} = {w}_A {\downarrow _\textsf{i}}\). Hence, we globally have \(|{\pi _2 {\downarrow _\textsf{i}} - \pi '_2 {\downarrow _\textsf{i}}} | = 0\) and, thus, \(\textsf{eq}(\pi _2 {\downarrow _\textsf{i}}, \pi '_2 {\downarrow _\textsf{i}}) = 0\). At time steps \(0 \le t \le 3\), the distance between the outputs \(|{{w}_0 {\downarrow _\textsf{o}}(t) - {w}_A {\downarrow _\textsf{o}}(t)} |\) is at most \(\kappa _\textsf{o}\). Hence, the left operand of holds and the formula is satisfied for \(t \le 3\). At time \(t = 3\) we have that \(|{{w}_0 {\downarrow _\textsf{i}}(t) - {w}_A {\downarrow _\textsf{i}}(t)} | = |{4.0 - 5.2} | > \kappa _\textsf{i}\). Hence, the right operand of the operator holds and \(\psi _{\textsf {l-rob}}\) is satisfied also for \(t \ge 3\). Notice that if we would remove \({w}_A\) from \({\textsf{M}} \), then it would violate \(\psi _{\textsf {l-rob}}\), because there is no possible choice for \(\pi '_2\) that has the same inputs as \({w}_B\) and where the output distances to \({w}_0\) are below the \(\kappa _\textsf{o}\) threshold.

To satisfy \(\psi _{\textsf {u-rob}}\), we pick \({w}_1\) for \(\pi '_1\). The reasoning why the formula holds for this choice is analogue to \(\psi _{\textsf {l-rob}}\). Notice that if we add the trace to \({\textsf{M}} \), then \(\psi _{\textsf {u-rob}}\) is violated. Concretely, \(\pi _2\) could represent ; then, whether we pick \({w}_0\) or \({w}_1\) for \(\pi '_1\), we eventually get outputs that violate the \(\kappa _\textsf{o}\) constraint, while the \(\kappa _\textsf{i}\) constraint is always satisfied. For example, if we compare and \({w}_1\), then we have for all time steps \(t \le 3\) that , but at time \(t = 3\) we get . Hence, at \(t = 3\) the left and right operand of are false, so \(\psi _{\textsf {u-rob}}\) is violated.

Correctness under Mixed-IO Interpretation

Mixed-IO signals are defined in the discrete time domain \(\mathbb {N}\) and value domain \({\textsf{In}}\cup {\textsf{Out}}\). The abstract functions \( {\downarrow _\textsf{i}}\) and \( {\downarrow _\textsf{o}}\) can be defined equally to the syntactically identical projection functions for mixed-IO models defined in Sect. 2.1. The function \(\textsf {eq}(\textsf{i}_1, \textsf{i}_2)\) can be defined using the distance function \(d_{\textsf{In}} \) and some arbitrary small \(\varepsilon > 0\):

$$\begin{aligned} \textsf {eq}(\textsf{i}_1, \textsf{i}_2)&{:}{=}{\left\{ \begin{array}{ll} 0, &{} \text {if } \textsf{i}_1 = \textsf{i}_2\\ d_{\textsf{In}} (\textsf{i}_1, \textsf{i}_2) + \varepsilon , &{} \text {if } \textsf{i}_1 \ne \textsf{i}_2 \wedge \textsf{i}_1, \textsf{i}_2 \in {\textsf{In}}\\ \infty , &{} \text {otherwise. } \end{array}\right. } \end{aligned}$$
(6)

In the second clause of the above definition we add some positive value \(\varepsilon \) to the result of \(d_{\textsf{In}} \), because \(d_{\textsf{In}} (\textsf{i}_1, \textsf{i}_2)\) could be 0 even if \(\textsf{i}_1 \ne \textsf{i}_2\). For the correctness of the above HyperSTL formulas, however, it is crucial that \(\textsf {eq}(\textsf{i}_1, \textsf{i}_2) = 0\) if and only if \(\textsf{i}_1= \textsf{i}_2\). For a good performance of the falsification algorithm, we will nevertheless want to make use of \(d_{\textsf{In}} \) if \(\textsf{i}_1 \ne \textsf{i}_2\).

Proposition 3 shows that HyperSTL formulas (2) and (3) under the mixed-IO interpretation outlined above indeed characterise l-robust cleanness and u-robust cleanness. Proposition 4 shows the same for func-cleanness.

Proposition 3

Let \({\textsf{L}} \subseteq \mathbb {N}\rightarrow ({\textsf{In}}\cup {\textsf{Out}})\) be a mixed-IO system and \(\mathcal {C} = \langle {\textsf{Std}},d_{\textsf{In}},d_{\textsf{Out}},\kappa _\textsf{i},\kappa _\textsf{o}\rangle \) a contract or context for robust cleanness with \({\textsf{Std}} \subseteq {\textsf{L}} \). Further, let \({\textsf{Std}} _\pi \) be a quantifier-free HyperSTL subformula, such that \({\textsf{L}}, \{ \pi {:}{=}w \}, 0 \models {\textsf{Std}} _\pi \) if and only if \({w}\in {\textsf{Std}} \). Then, \({\textsf{L}} \) is l-robustly clean w.r.t. \(\mathcal {C}\) if and only if \({\textsf{L}}, \varnothing , 0 \models \psi _{\text {l-rob}}\), and \({\textsf{L}} \) is u-robustly clean w.r.t. \(\mathcal {C}\) if and only if \({\textsf{L}}, \varnothing , 0 \models \psi _{\text {u-rob}}\).

Proposition 4

Let \({\textsf{L}} \subseteq \mathbb {N}\rightarrow ({\textsf{In}}\cup {\textsf{Out}})\) be a mixed-IO system and \(\mathcal {C} = \langle {\textsf{Std}},d_{\textsf{In}},d_{\textsf{Out}},f \rangle \) a contract or context for func-cleanness with \({\textsf{Std}} \subseteq {\textsf{L}} \). Further, let \({\textsf{Std}} _\pi \) be a quantifier-free HyperSTL subformula, such that \({\textsf{L}}, \{ \pi {:}{=}w \}, 0 \models {\textsf{Std}} _\pi \) if and only if \({w}\in {\textsf{Std}} \). Then, \({\textsf{L}} \) is l-func-clean w.r.t. \(\mathcal {C}\) if and only if \({\textsf{L}}, \varnothing , 0 \models \psi _{\text {l-fun}}\), and \({\textsf{L}} \) is u-func-clean w.r.t. \(\mathcal {C}\) if and only if \({\textsf{L}}, \varnothing , 0 \models \psi _{\text {u-fun}}\).

STL Characterisation for Finite Standard Behaviour

In many practical settings—when the different standard behaviours are spelled out upfront explicitly, as in NEDC and WLTC —it can be assumed that the number of distinct standard behaviours \({\textsf{Std}} \) is finite (while there are infinitely many possible behaviours in \({\textsf{M}} \)). Finiteness of \({\textsf{Std}} \) makes it possible to remove by enumeration the quantifiers from the u-robust cleanness and u-func-cleanness HyperSTL formulas. This opens the way to work with the STL fragment of HyperSTL, after proper adjustments. In the following, we assume that the set \({\textsf{Std}} = \{ w_1, \dots , w_c \}\) is an arbitrary standard set with \(c\) unique standard traces, where every \({w}_k: \mathcal {T} \rightarrow X\) uses the same time domain \(\mathcal {T}\) and value domain X.

To encode the HyperSTL formulas (3) and (5) in STL, we use the concept of self-composition, which has proven useful for the analysis of hyperproperties [9, 51]. We concatenate a trace under analysis \({w}: \mathcal {T} \rightarrow X\) and the standard traces \({w}_1\) to \({w}_c\) to the composed trace \(w_+ = ({w}, {w}_1, \dots , {w}_c) \subseteq (\mathcal {T} \rightarrow X^{c+1})\). Given a system \({\textsf{M}} \subseteq (\mathcal {T} \rightarrow X)\) and a set \({\textsf{Std}} = \{ {w}_1, \dots , {w}_c \} \subseteq {\textsf{M}} \), we denote by \({\textsf{M}} \circ {\textsf{Std}} {:}{=}\{ ({w}, {w}_1, \dots , {w}_c) \mid {w}\in {\textsf{M}} \}\) the system in which every trace in \({\textsf{M}}\) is composed with the standard traces in \({\textsf{Std}}\). For every \({w}_+ \in {\textsf{M}} \circ {\textsf{Std}} \), we will in the following STL formula write \({w}\) to mean the projection on \({w}_+\) to the trace \({w}\), and we write \({w}_k\), for \(1 \le k \le c\), to mean the projection on \({w}_+\) to the kth standard trace.

Theorem 5

Let \({\textsf{L}} \subseteq \mathbb {N}\rightarrow ({\textsf{In}}\cup {\textsf{Out}})\) be a mixed-IO system and \(\mathcal {C} = \langle {\textsf{Std}},d_{\textsf{In}},d_{\textsf{Out}},\kappa _\textsf{i},\kappa _\textsf{o}\rangle \) a context for robust cleanness with finite standard behaviour \({\textsf{Std}} = \{ w_1, \dots , w_c \} \subseteq {\textsf{L}} \). Then, \({\textsf{L}} \) is u-robustly clean w.r.t. \(\mathcal {C}\) if and only if \(({\textsf{L}} \circ {\textsf{Std}}) \models \varphi _{\textsf {u-rob}}\), where

The theorem for u-func-cleanness is analogue to Theorem 5.

Theorem 6

Let \({\textsf{L}} \subseteq \mathbb {N}\rightarrow ({\textsf{In}}\cup {\textsf{Out}})\) be a mixed-IO system and \(\mathcal {C} = \langle {\textsf{Std}},d_{\textsf{In}},d_{\textsf{Out}},f \rangle \) a context for func-cleanness with finite standard behaviour \({\textsf{Std}} = \{ w_1, \dots , w_c \} \subseteq {\textsf{L}} \). Then, \({\textsf{L}} \) is u-func-clean w.r.t. \(\mathcal {C}\) if and only if \(({\textsf{L}} \circ {\textsf{Std}}) \models \varphi _{\textsf {u-fun}}\), where

Example 4

We consider the robust cleanness context \(\mathcal {C} = \langle {\textsf{Std}}, d_{\textsf{In}}, d_{\textsf{Out}}, \kappa _\textsf{i}, \kappa _\textsf{o}\rangle \) where \({\textsf{Std}} = \{ w_1, w_2 \}\) contains the two standard traces \(w_1 = 1_\textsf{i}\, 2_\textsf{i}\, 3_\textsf{i}\, 7_\textsf{o}\, 0_\textsf{i}\, \delta ^\omega \) and \(w_2 = 0_\textsf{i}\, 1_\textsf{i}\, 2_\textsf{i}\, 3_\textsf{i}\, 6_\textsf{o}\, \delta ^\omega \). We here decorate inputs with index \(\textsf{i}\) and outputs with index \(\textsf{o}\), i.e., \(w_1\) describes a system receiving the three inputs 1, 2, and 3, then producing the output 7, and finally receiving input 0 before entering quiescence. We take

$$\begin{aligned} \! d_{\textsf{In}} (\textsf{i}_1, \textsf{i}_2) = {\left\{ \begin{array}{ll} |{\textsf{i}_1 - \textsf{i}_2} |,\!\!\!\! &{} \text { if } \textsf{i}_1, \textsf{i}_2\in {\textsf{In}}\\ 0, &{} \text { if } \textsf{i}_1\! =\! \textsf{i}_2\! =\! \text {--}_\textsf{i}\\ \infty , &{} \text { otherwise,} \end{array}\right. } \end{aligned}$$

and

$$\begin{aligned} \! d_{\textsf{Out}} (\textsf{o}_1, \textsf{o}_2) = {\left\{ \begin{array}{ll} |{\textsf{o}_1 - \textsf{o}_2} |,\!\!\!\! &{} \text { if } \textsf{o}_1, \textsf{o}_2\in {\textsf{Out}}\backslash \{ \delta \} \\ 0, &{} \text { if } \textsf{o}_1\! =\! \textsf{o}_2\! =\! \text {--}_\textsf{o}\text { or } \textsf{o}_1\! =\! \textsf{o}_2\! =\! \delta \\ \infty , &{} \text { otherwise.} \end{array}\right. } \end{aligned}$$

The contractual value thresholds are assumed to be \(\kappa _\textsf{i}= 1\) and \(\kappa _\textsf{o}= 6\).

Assume we are observing the trace \(w = 0_\textsf{i}\, 1_\textsf{i}\, 2_\textsf{i}\, 6_\textsf{o}\, 0_\textsf{i}\, \delta ^\omega \) to be monitored with STL formula \(\varphi _{\textsf {u-rob}}\) (from Theorem 5). First notice, that for combinations of a and b in \(\varphi _{\textsf {u-rob}}\), where \(a \ne b\), the subformula is always false, because \({w}_1\) and \({w}_2\) have different (input) values at time point 0. Hence, it remains to show that

For the first conjunct, the input distance between inputs in \({w}\) and \({w}_1\) is always 1 at positions 1 to 3, it is 0 at position 4 (because \(\text {--}_\textsf{i}\) is compared to \(\text {--}_\textsf{i}\)), and remains 0 in position 5 and beyond. Thus, \({d}_{\textsf{In}}({w}_1 {\downarrow _\textsf{i}}, {w} {\downarrow _\textsf{i}}) - \kappa _\textsf{i}\) is always at most 0, and the right hand-side of the operator is always false. Consequently, by definition of , the left operand of must always hold, i.e., \({d}_{\textsf{Out}}({w}_1 {\downarrow _\textsf{o}}, {w} {\downarrow _\textsf{o}})\) must always be less or equal to 6. This is the case for \({w}_1\) and \({w}\): at all positions except for 4, \(\text {--}_\textsf{o}\) is compared to \(\text {--}_\textsf{o}\) (or \(\delta \) to \(\delta \)), so the difference is 0, and at position 4, the distance of 6 and 7 is 1.

For the second -formula, \({w}\) is compared to \({w}_2\). These two traces are comparable only to a limited extent: the order of input and output is altered at the last two positions of the signals before quiescence. Hence, the right operand of is true at position 4, and the formula holds for the remaining trace. For positions 1 to 3, the input distances are 0, because the input values are identical. At these positions, the left operand must hold. The values are input values, so \(\text {--}_\textsf{o}\) is compared to \(\text {--}_\textsf{o}\) at each position. This distance is defined to be 0, so it holds that \(-6 \le 0\), and the formula is satisfied. Since both formulas hold, the conjunction of both holds, too, and trace \({w}\) is qualified as robustly clean. There could however be other system traces not considered in this example, that overall could violate robust cleanness of the system.

Restriction of input space

Robust cleanness puts semantic requirements on fragments of a system’s input space, outside of which the system’s behaviour remains unspecified. Typically, the fragment of the input space covered is rather small. To falsify the STL formula \(\varphi _{\textsf {u-rob}}\) from Theorem 5, the falsifier has two challenging tasks. First, it has to find a way to stay in the relevant input space, i.e., select inputs with a distance of at most \(\kappa _\textsf{i}\) from the standard behaviour. Only if this is assured it can search for an output large enough to violate the \(\kappa _\textsf{o}\) requirement. In this, a large robustness estimate provided by the quantitative semantics of STL cannot serve as an indicator for deciding whether an input is too far off or whether an output stays too close to the standard behaviour. We can improve the efficiency of the falsification process significantly by narrowing upfront the input space the falsifier uses.

In practice, test execution traces will always be finite. In previous real-life do** tests, test execution lengths have been bounded by some constant \(B \in \mathbb {N}\) [19], i.e., systems are represented as sets of finite traces \({\textsf{M}} \subseteq ({\textsf{In}}\cup {\textsf{Out}})^{B}\) (which for formality reasons each can be considered suffixed with \(\delta ^\omega \)). In this bounded horizon, we can provide a predicate discriminating between relevant and irrelevant input sequences. Formally, the restriction to the relevant input space fragment of a system \({\textsf{M}} \subseteq ({\textsf{In}}\cup {\textsf{Out}})^{B}\) is given by the set \({\textsf{In}}_{{\textsf{Std}}, \kappa _\textsf{i}} = \{ w \in {\textsf{M}} \mid \exists w' \in {\textsf{Std}}.\ \bigwedge _{k=0}^{B\!-\!1} (d_{\textsf{In}} (w[k] {\downarrow _\textsf{i}}, w'[k] {\downarrow _\textsf{i}}) \le \kappa _\textsf{i}) \}\). Since \({\textsf{Std}} \) and B are finite, membership is computable.

There are rare cases in which this optimisation may prevent the falsifier from finding a counterexample. This is only the case if there is an input prefix leading to a violation of the formula for which there is no suffix such that the whole trace satisfies the \(\kappa _\textsf{i}\) constraint. Below is a pathological example in which this could make a difference.

Example 5

Apart from \(\textrm{NO}_x\) emissions, NEDC (and WLTC) tests are used to measure fuel consumption. Consider a contract similar to the contracts above, but with fuel rate as the output quantity. Assuming a “normal” fuel rate behaviour during the standard test, there might be a test within a reasonable \(\kappa _\textsf{i}\) distance, where the fuel is wasted insanely. Then, the fuel tank might run empty before the intended end of the test, which therefore could not be finished within the \(\kappa _\textsf{i}\) distance, because speed would be constantly 0 at the end. The actually driven test is not in set \({\textsf{In}}_{{\textsf{Std}}, \kappa _\textsf{i}}\), but there is a prefix within \(\kappa _\textsf{i}\) distance that violates the robust cleanness property.

Notably, there may be additional techniques to reduce the size of the input space. For example, if the next input symbol depends on the history of inputs, this constraint could be considered in the proposal scheme.

4 Supervision of diesel emission cleaning systems

The severity of the diesel emissions scandal showed that the regulations alone are insufficient to prevent car manufacturers from implementing tampered—or doped—emission cleaning systems. Recent works [19] shows that robust cleanness is a suitable means to extend the precisely defined behaviour of cars for the NEDC to test cycles within a \(\kappa _\textsf{i}\) range around the NEDC. To demonstrate the usefulness of robust cleanness, the essential details of the emission testing scenario were modelled: the set of inputs is the set of speed values, an output value represents the amount of emissions—in particular, the nitric oxide (\(\textrm{NO}_x\)) emissions—measured at the exhaust pipe of a car. The distance functions are the absolute differences of speed, respectively \(\textrm{NO}_x\), values, and the standard behaviour is the singleton set that contains a trace that consists of the inputs that define the test cycle followed by the average amount of \(\textrm{NO}_x\) gas measured during the test. Thus, formally, we get \({\textsf{In}}= \mathbb {R}\), \({\textsf{Out}}= \mathbb {R}\), \({\textsf{Std}} = \{ \text {NEDC} \cdot \textsf{o} \}\),Footnote 3 and \(d_{\textsf{In}} \) and \(d_{\textsf{Out}} \) as defined in Example 4 [19].

The STL formulas developed in the previous section, combined with the probabilistic falsification approach, give rise to further improvements to the existing testing-based work [19] on diesel do** detection.

To use the falsification algorithm in Algorithm 1, we implement the restriction of the input space to \({\textsf{In}}_{\{ \text {NEDC} \cdot \textsf{o} \},\kappa _\textsf{i}}\) as explained in Sect. 3. With this restriction the STL formula \(\varphi _{\textsf {u-rob}}\) from Theorem 5 can be simplified to

(7)

This is because the conjunction and disjunction over standard traces becomes obsolete for only a single standard trace. For the same reason, the requirement becomes obsolete, as the compared traces are always identical. In the subformula, the right proposition is always false, because of the restricted input space. We implemented Algorithm 1 for the robustness computation according to formula (7).

In practice, running tests like NEDC with real cars is a time consuming and expensive endeavour. Furthermore, tests on chassis dynamometers are usually prohibited to be carried out with rented cars by the rental companies. On the other hand, car emission models for simulation are not available to the public—and models provided by the manufacturer cannot be considered trustworthy. To carry out our experiments, we instead use an approximation technique that estimates the amount of \(\textrm{NO}_x\) emissions of a car along a certain trajectory based on data recorded during previous trips with the same car, sampled at a frequency of \({1}\,\text {HZ}\) (one sample per second). Notably, these trips do not need to have much in common with the trajectory to be approximated. A trip is represented as a finite sequence \(\vartheta \in (\mathbb {R}\times \mathbb {R}\times \mathbb {R})^*\) of triples, where each such triple (van) represents the speed, the acceleration, and the (absolute) amount of \(\textrm{NO}_x\) emitted at a particular time instant in the sample. Speed and acceleration can be considered as the main parameters influencing the instant emission of \(\textrm{NO}_x\) . This is, for instance, reflected in the regulation [67, 124] where the decisive quantities to validate test routes for real-world driving emissions tests on public roads are speed and acceleration.

A recording \(\mathcal {D}\) is the union of finitely many trips \(\vartheta \). We can turn such a recording into a predictor \(\mathcal {P}\) of the \(\textrm{NO}_x\) values given pairs of speed and acceleration as follows:

$$\begin{aligned} \mathcal {P}(v,a) = \textsf{average} [n \ | \ (\exists v', a'.\ (|{v-v'} | \le 2 \ \wedge |{a-a'} | \le 2 \ \wedge \ (v', a', n) \in \mathcal {D}))]. \end{aligned}$$

The amount of \(\textrm{NO}_x\) assigned to a pair (va) here is the average of all \(\textrm{NO}_x\) values seen in the recording \(\mathcal {D}\) for \(v \pm \ell \) and \(a \pm \ell \), with \(0 \le \ell \le 2\). To overcome measurement inaccuracies and to increase the robustness of the approximated emissions, the speed and acceleration may deviate up to \({2}\,\hbox {km}/\hbox {h}\), and \({2}\,\hbox {m}/\hbox {s}^{2}\), respectively. This tolerance is adopted from the official NEDC regulation [128], which allows up to \({2}\hbox {km}/\hbox {h}\) of deviations while driving the NEDC.

To demonstrate the practical applicability of our implementation of Algorithm 1 and our \(\textrm{NO}_x\) approximation, we report here on experiments with an Audi A6 Avant Diesel admitted in June 2020 as well as with its successor model admitted in 2021. We will refer to the former as car A20 and to the latter as car A21. We used the app \({\textsf{LolaDrives}}\) to perform in total six low-cost RDE tests—two with A20 and fourFootnote 4 with A21—and recorded the data received from the cars’ diagnosis ports. The raw data is available on Zenodo [15]. Using the emissions predictor proposed above we estimate that for an NEDC test A20 emits \({86}\,\hbox {mg}/\hbox {km}\) of \(\textrm{NO}_x\) and that A21 emits \({9}\,\hbox {mg}/\hbox {km}\). Car A20 has previously been falsified w.r.t. the RDE specification. Neither A20 nor A21 has been falsified w.r.t. robust cleanness.

Before turning to falsification, we spell out meaningful contexts for robust cleanness. We identified suitable \({\textsf{In}}\), \({\textsf{Out}}\), \({\textsf{Std}} \), \(d_{\textsf{In}} \), and \(d_{\textsf{Out}} \) at the beginning of the section. For \(\kappa _\textsf{i}\), it turned out that \(\kappa _\textsf{i}= {15}\,\hbox {km}/\hbox {h}\) is a reasonable choice, as it leaves enough flexibility for human-caused driving mistakes and intended deviations [19]. The threshold for \(\textrm{NO}_x\) emissions under lab conditions is \({80}\hbox {mg}/\hbox {km}\). The emission limits for RDE tests depend on the admission date of the car. Cars admitted in 2020 or earlier, must emit \({168}\,\hbox {mg}/\hbox {km}\) at most, and cars admitted later must adhere to the limit of \({120}\,\hbox {mg}/\hbox {km}\). For our experiments, we use \(\kappa _\textsf{o}= {88}\,\hbox {mg}/\hbox {km}\) for A20 and \(\kappa _\textsf{o}= {40}\,\hbox {mg}/\hbox {km}\) for A21 to have the same tolerances as for RDE tests. Effectively, the upper threshold for A20 is \(84 + 88 = {172}\,\hbox {mg}/\hbox {km}\), and for A21 the limit is \(9 + 40 = {49}\,\hbox {mg}/\hbox {km}\). Notice that for software do** analysis, the output observed for a certain standard behaviour and the constant \(\kappa _\textsf{o}\) define the effective threshold; this threshold is typically different from the thresholds defined by the regulation.

We modified Algorithm 1 by adding a timeout condition: if the algorithm is not able to find a falsifying counterexample within 3,000 iterations, it terminates and returns both the trace for which the smallest robustness has been observed and its corresponding robustness value. Hence, if falsification of robust cleanness for a system is not possible, the algorithm outputs an upper bound on how robust the system satisfies robust cleanness.

Fig. 2
figure 2

NEDC speed profile (blue, dashed) and input falsifying \(\mathcal {C} \) for \(\kappa _\textsf{o}= {88}\,\hbox {mg}/\hbox {km}\) (red) with \({182}\,\hbox {mg}/\hbox {km}\) of emitted \(\textrm{NO}_x\)

For the concrete case of the diesel emissions, the robustness value during the first 1180 inputs (sampled from the restricted input space \({\textsf{In}}_{{\textsf{Std}}, \kappa _\textsf{i}}\)) is always \(\kappa _\textsf{o}\). When the NEDC output \(o_\text {NEDC} \) and the non-standard output o are compared, the robustness value is \(\kappa _\textsf{o}- |{o_\text {NEDC}- o} |\) (cf., eq. (7), the quantitative semantics of STL, and definition of \(d_{\textsf{Out}} \)). Hence, for test cycles with small robustness values, we get \(\textrm{NO}_x\) emissions o that are either very small or very large compared to \(o_\text {NEDC} \). We ran the modified Algorithm 1 on A20 and A21 for the contexts defined above. For A20, it found a robustness value of \(-8\), i.e., it was able to falsify robust cleanness relative to the assumed contract and found a test cycle for which \(\textrm{NO}_x\) emissions of \({182}\,\hbox {mg}/\hbox {km}\) are predicted. The test cycle is shown in Fig. 2. For A21, the smallest robustness estimate found—even after 100 independent executions of the algorithm—was 38, i.e., A21 is predicted to satisfy robust cleanness with a very high robustness estimate. The corresponding test cycle is shown in Fig. 3.

Fig. 3
figure 3

NEDC speed profile (blue, dashed) and input maximising \(\textrm{NO}_x\) emissions to \({11}\,\hbox {mg}/\hbox {km}\) (red)

On Do** Tests for Cyber-physical Systems

The proposed probabilistic falsification approach to find instances of software do** needs several hundreds of iterations. This is problematic for testing real-world cyber-physical systems (CPS) to which inputs cannot be passed in an automated way. To conduct a test with a car, for example, the input to the system is a test cycle that is passed to the vehicle by driving it. Notably, we consider here the scenario that the CPS is tested by an entity that is different from the manufacturer. While the latter might have tools to overcome these technical challenges, the former typically does not have access to them.

We propose the following integrated testing approach for effective do** tests of cyber-physical systems. The big picture is provided in Fig. 4. In a first step, the CPS is used under real-world conditions without enforcing any specific constraints on the inputs to the system. For all executions, the inputs and outputs are recorded. So, essentially, the system can be used as it is needed by the user, but all interactions with it are recorded. From these recordings, a model can be learned that for arbitrary inputs (whether they were covered in the recorded data or not) predicts the output of the system. Such learning can be as simple as using statistics as we did for the emissions example above, or as complex as using deep neural nets. For the learned model, the probabilistic falsification algorithm computes a test input that falsifies it—inputs to this model can be passed automatically and an output is produced almost instantly. The resulting input serves as an input for the real CPS. If the prediction was correct, also the real system is falsified. If it was incorrect, the learned model can be refined and the process starts again.

Fig. 4
figure 4

Integrated testing approach

For diesel emissions, the first part of this integrated testing approach has been carried out as part of the work reported in this article. We leave the second part—evaluating the generated test traces from Figs. 2 and 3 with a real car—for future work.

Technical Context

Software do** theory provides a formal basis for enlarging the requirements on vehicle exhaust emissions beyond too narrow lab test conditions. That conceptual limitation has by now been addressed by the official authorities responsible for car type approval [124, 127]: The old NEDC-based test procedure is replaced by the newer Worldwide Harmonised Light Vehicles Test Procedure (WLTP), which is deemed to be more realistic. WLTP replaces the NEDC test by a new WLTC test, but WLTC still is just a single test scenario. In addition, WLTP embraces so called Real Driving Emissions (RDE) tests to be conducted on public roads. A recently launched mobile phone app [20, 22], \({\textsf{LolaDrives}}\), harvests runtime monitoring technology for making low-cost RDE tests accessible to everyone.

Learning or approximating the behaviour of a system under test has been studied intensively. Meinke and Sindhu [82] were among the first to present a testing approach incrementally learning a Kripke structure representing a reactive system. Volpato and Tretmans [130] propose a learning approach which gradually refines an under- and over-approximation of an input-output transition system representing the system under test. The correctness of this approach needs several assumptions, e.g., an oracle indicating when, for some trace, all outputs, which extend the trace to a valid system trace, have been observed.

5 Individual fairness of systems evaluating humans

Example 2 introduces a new application domain for cleanness definitions. Unica uses an AI system that is supposed to assist her with the selection of applicants for a hypothetical university. Cleanness of such a system can be related to the fair treatment of the humans that are evaluated by it. A usable fairness analysis can happen no later than at runtime, since Unica needs to make a timely decision on whether to include the applicant in further considerations. We describe technical measures that help in mitigating this challenge by providing her with information from an individual fairness analysis in a suitable, purposeful, expedient way. To this end, we propose a formal definition for individual fairness extending the one by [35] and based on func-cleanness. We develop a runtime monitor that analyses every output of \({\textsf{P}} \) immediately after \({\textsf{P}} \)’s decision, which strategically searches for unfair treatment of a particular individual by comparing them to relevant hypothetical alternative individuals so as to provide a fairness assessment in a timely manner.

Much like \({\textsf{P}} \) is to support Unica, AI systems—in the broadest sense of the word—more and more often support human decision makers. Undoubtedly, such systems should be compliant with applicable law (such as the future European AI Act [40, 41] or the Washington State facial recognition law [132]) and ought to minimise any risks to health, safety or fundamental rights. Sometimes, we cannot mitigate all these risks in advance by technical measures and also some risk-mitigation requires trade-off decisions involving features that are either impossible or difficult to operationalise and formalise. This is why it is essential that a human effectively oversees the system (which is also emphasised by several institutions such as UNESCO [129] and the European High Level Expert Group [59]). Effective human oversight, however, is only possible with the appropriate technical measures that allow human overseers to better understand the system at runtime [70, 71]. From a technical point of view, this raises the pressing question of what such technical measures can and ought to look like to actually enable humans to live up to these responsibilities. Our contribution is intended to bridge the gap between the normative expectations of law and society and the current reality of technological design.

5.1 Positioning within related research topics

Our contribution draws on and adds to three vibrant topics of current research, namely Explainable AI (XAI), AI fairness, and discrimination.

XAI

Many of the most successful AI systems today are some kind of black boxes [11]. Accordingly, the field of ‘Explainable AI’ [54] focuses on the question of how to provide users (and possibly other stakeholders) with more information via several key perspicuity properties [117] of these systems and their outputs to make them understand these systems and their outputs in ways necessary to meet various desiderata [5, 28, 69, 74, 85, 91]. The concrete expectations and promises associated with various XAI methods are manifold. Among them are enabling warranted trust in systems [12, 62, 65, 102, 111], increasing human-system decision-making performance [68] for instance through increasing human situation awareness when operating systems [109], enabling responsible decision-making and effective human oversight [14, 80, 114], as well as identifying and reducing discrimination [74]. It often remains unclear what kind of explanations are generated by the various explainability methods and how they are meant to contribute to the fulfilment of the desiderata, even though these questions have become the subject of systematic and interdisciplinary research [69, 103].

Our approach can be taxonomised along at least two different distinctions [70, 86, 101, 102, 116]: First, it is model-agnostic (not model-specific), i.e., it is not tailored to a particular class of models but operates on observable behaviour—the inputs and outputs of the model. Second, our method is a local method (not global), i.e., it is meant to shed light on certain outputs rather than the system as a whole.

(Un-)Fair Models

Fairness, discrimination, justice, equal opportunity, bias, prejudice, and many more such concepts are part of a meaningfully interrelated cluster that has been analysed and dissected for millennia [6, 7]. Many fields are traditionally concerned with the concepts of fairness and discrimination, ranging from philosophy [6, 7, 36, 52, 98,99,100] to legal sciences [25, 57, 125, 131], to psychology [60, 136], to sociology [2, 63], to political theory [99], to economics [55]. Nowadays, it has also become a technological topic that calls for cross-disciplinary perspectives [50].

Fig. 5
figure 5

Sketch of different origins of unfairness in a decision process supported by a system; the dotted box indicates which unfairness our monitoring targets

With regard to fairness, there are two distinctions that are especially relevant to our work. First, one distinction is made between individual fairness, i.e., that similar individuals are treated similarly [35], and group fairness, i.e., that there is adequate group parity [23]. Measures of individual fairness are often close to the Aristotelian dictum to treat like cases alike [6, 7]. In a sense, operationalisations of individual fairness are robustness measures [24, 118], but instead of requiring robustness with respect to noise or adversarial attacks, measures of individual fairness, such as the one by Dwork et al. [35], call for robustness with respect to highly context-dependent differences between representations of human individuals. Second, recent work from the field of law [131] suggests to differentiate between bias preserving and bias transforming fairness metrics. Bias preserving fairness metrics seek to avoid adding new bias. For such metrics, historic performances are the benchmarks for models, with equivalent error rates for each group being a constraint. In contrast, bias transforming metrics do not accept existing bias as a given or neutral starting point, but aim at adjustment. Therefore, they require to make a ‘positive normative choice’ [131], i.e. to actively decide which biases the system is allowed to exhibit, and which it must not exhibit.

Over the years, many concrete approaches have been suggested to foster different kinds of fairness in artificial systems, especially in AI-based ones [74, 81, 96, 131, 134]. Yet, to the best of our knowledge, an approach like ours is still missing. One of the approaches that is closest to ours, namely that by John et al. [64], is not local and therefore not suitable for runtime monitoring. Also, it is not model-agnostic. So, to the best of our knowledge, our approach provides a new contribution to the debate on unfairness detection.

It is important to note/recognise that our approach can only be understood as part of a more holistic approach to preventing or reducing unfairness. After all, there are many sources of unfairness [8] (also see Fig. 5 and ). Therefore, not every technical measure is able to detect every kind of unfairness and eliminating one source of unfairness might not be sufficient to eliminate all unfairness. Our approach tackles only unfairness introduced by the system, but not other kinds of unfairness.

Discrimination

We understand discrimination as dissimilar treatment of similar cases or similar treatment of dissimilar cases without justifying reason. This is a definition that can also be found in the law [44, §43]. Our work is exclusively focused on discrimination qua dissimilar treatment of similar cases. Discrimination requires a thoughtful and largely not formalisable consideration of ‘justifying reason’. However, we will exploit the relation of discrimination and fairness: Unfairness in a system can arguably be a good proxy of discrimination—even though not every unfair treatment by a system necessarily constitutes discrimination (especially not in the legal sense). Thus, a tool that highlights cases of unfairness in a system can be highly instrumental in detecting discriminatory features of a system. It is not viable, though, to let such a tool rule out unfair treatment fully automatically without human oversight, since there could be justifying reason to treat two similar inputs in a dissimilar way.

5.2 Individual fairness

Unica from Example 2 should be able to detect individual unfairness. An operationalisation thereof by Dwork et al. [35] is based on the Lipschitz condition to enforce that similar individuals are treated similarly. To measure similarity, they assume the existence of an input distance function \(d_{\textsf{In}} \) and an output distance function \(d_{\textsf{Out}} \). This assumption is very similar to the one that we implicitly made in the previous sections for robust cleanness and func-cleanness. However, in the case of the fair treatment of humans finding reasonable distance functions is more challenging than it was for the examples in the previous chapters. Dwork et al. assume that both distance functions perfectly measure distances between individualsFootnote 5 and between outputs of the system, respectively, but admit that in practice these distance functions are only approximations of a ground truth at best. They suggest that distance measures might be learned, but there is no one-size-fits-all approach to selecting distance measures. Indeed, obtaining such distance metrics is a topic of active research [61, 87, 135]. Additionally, the Lipschitz condition assumes a Lipschitz constant L to establish a linear constraint between input and output distances.

Definition 8

A deterministic sequential program \({\textsf{P}}: {\textsf{In}}\rightarrow {\textsf{Out}}\) is Lipschitz-fair w.r.t.

\(d_{\textsf{In}}: {\textsf{In}}\times {\textsf{In}}\rightarrow \mathbb {R}\), \(d_{\textsf{Out}}: {\textsf{Out}}\times {\textsf{Out}}\rightarrow \mathbb {R}\), and a Lipschitz constant L, if and only if for all \(\textsf{i}_1, \textsf{i}_2 \in {\textsf{In}}\), \(d_{\textsf{Out}} ({\textsf{P}} (\textsf{i}_1), P(\textsf{i}_2)) \le L \cdot d_{\textsf{In}} (\textsf{i}_1, \textsf{i}_2)\).

Lipschitz-fairness comes with some restrictions that limit its suitability for practical application:

\(d_{\textsf{In}} \)-\(d_{\textsf{Out}} \)-relation::

High-risk systems are typically complex systems and ask for more complex fairness constraints than the linearly bounded output distances provided by the Lipschitz condition. For example, using the Lipschitz condition prevents us from allowing small local jumps in the output and at the same time forbidding jumps of the same rate of increase over larger ranges of the input space (also see supplementary material in Section ).

Input relevance::

The condition quantifies over the entire input domain of a program. This overlooks two things: first, it is questionable whether each input in such a domain is plausible as a representation for a real-world individual. But whether a system is unfair for two implausible and purely hypothetical inputs is largely irrelevant in practice. Secondly, it also ignores that mere potential unfair treatment is at most a threat, not necessarily already a harm [106]. Therefore, even with a restriction to only plausible applicants, the analysis might take into account more inputs than needed for many real-world applications. What is important in practice is the ability to determine whether actual applicants are treated unfairly—and for this it is often not needed to look at the entire input domain.

Monitorability::

In a monitoring scenario with the Lipschitz condition in place, a fixed input \(\textsf{i}_1\) must be compared to potentially all other inputs \(\textsf{i}_2\). Since the input domain of the system can be arbitrarily large, the Lipschitz condition is not yet suitable for monitoring in practice (for a related point see John et al. [64]).

We propose a notion of individual fairness that is based on Definition 3. Instead of cleanness contracts we consider here fairness contracts, which are tuples \(\mathcal {F} = \langle d_{\textsf{In}}, d_{\textsf{Out}}, f \rangle \) containing input and output distance functions and the function f relating input distances and output distances. Notably, the set of standard inputs \({\textsf{StdIn}}\) known from cleanness contracts is not part of a fairness contract; it is unknown what qualifies an input to be ‘standard’ in the context of fairness analyses. Still, our fairness definition evaluates fairness for a set of individuals \({\mathcal {I}}\subseteq {\textsf{In}}\) (e.g., a set of applicants), which has conceptual similarities to the set \({\textsf{StdIn}}\). A fairness contract specifies certain fairness parameters for a concrete context or situation. Such parameters should generally not already include \({\mathcal {I}}\) to avoid introducing new unfairness through the monitor by tailoring it to specific inputs individually or by treating certain inputs differently from others. Func-fairness can thus be defined as follows:

Definition 9

A deterministic sequential program \({\textsf{P}}: {\textsf{In}}\rightarrow {\textsf{Out}}\) is func-fair for a set \({\mathcal {I}}\subseteq {\textsf{In}}\) of actual inputs w.r.t. a fairness contract \(\mathcal {F} = \langle d_{\textsf{In}}, d_{\textsf{Out}}, f \rangle \), if and only if for every \(\textsf{i}\in {\mathcal {I}}\) and \(\textsf{i}' \in {\textsf{In}}\), \(d_{\textsf{Out}} ({\textsf{P}} (\textsf{i}), {\textsf{P}} (\textsf{i}')) \le f(d_{\textsf{In}} (\textsf{i},\textsf{i}'))\).

The idea behind func-fairness is that every individual in set \({\mathcal {I}}\) is compared to potential other inputs in the domain of \({\textsf{P}} \). These other inputs do not necessarily need to be in \({\mathcal {I}}\), nor do these inputs need to have “physical counterparts” in the real world. Driven by the insights of the Input relevance restriction of Lipschitz-fairness, we explicitly distinguish inputs in the following and will call inputs that are given to \({\textsf{P}} \) by a user actual inputs, denoted \({\textsf{i}_\textsf{a}}\), and call inputs to which such \({\textsf{i}_\textsf{a}}\) are compared to synthetic inputs, denoted \(\textsf{i}_\textsf{s}\). Actual inputs are typicallyFootnote 6 inputs that have a real-world counterpart, while this might or might not be true for synthetic inputs. On first glance, an alternative to using synthetic inputs is to use only actual inputs, e.g., to compare every actual input with every other actual input in \({\mathcal {I}}\). For example, for a university admission, all applicants could be compared to every other applicant. However, this would heavily rely on contingencies: the detection of unfair treatment of an applicant depends on whether they were lucky enough that, coincidentally, another candidate has also applied who aids in unveiling the system’s unfairness towards them. Instead, func-fairness prefers to over-approximate the set of plausible inputs that actual inputs are compared to rather than under-approximating it by comparing only to other inputs in \({\mathcal {I}}\). This way, the attention of the human exercising oversight of the system might be drawn to cases that are actually not unfair, but as a competent human in the loop, they will most likely be able to judge that the input was compared to an implausible counterpart. This will usually enable more effective human oversight than an under-approximation that misses to alert the human to unfair cases.

Notice that func-fairness is a conservative extension of Lipschitz-fairness. With \({\mathcal {I}}= {\textsf{In}}\) and \(f(x) = L \cdot x\), func-fairness mimics Lipschitz-fairness. Wachter et al. [131] classify the Lipschitz-fairness of Dwork et al. [35] as bias-transforming. As we generalise this and introduce no element that has to be regarded as bias-preserving, our approach arguably is bias-transforming, too.

Func-fairness, with its function f, provides a powerful tool to model complex fairness constraints. How such an f is defined has profound impact on the quality of the fairness analysis. A full discussion about which types of functions make a good f go beyond the scope of this article. A suitable choice for f and the distance functions \(d_{\textsf{In}} \) and \(d_{\textsf{Out}} \) heavily depends on the context in which fairness is analysed—there is no one-fits-it-all solution. Func-fairness makes this explicit with the formal fairness contract \(\mathcal {F} = \langle d_{\textsf{In}}, d_{\textsf{Out}}, f \rangle \).

5.3 Fairness monitoring

Algorithm 2
figure b

\(\textsf{FairnessMonitor}\), with \(\xi \text {-}\textsf{min}\ S = (\xi , \textsf{i}_1, \textsf{i}_2)\) only if \((\xi , \textsf{i}_1, \textsf{i}_2) \in S\) and for all \((\xi ', \textsf{i}'_1, \textsf{i}'_2) \in S\), \(\xi ' \ge \xi \)

We develop a probabilistic-falsification-based fairness monitor that, given a set of actual inputs, searches for a synthetic counterexample to falsify a system \({\textsf{P}} \) w.r.t. a fairness contract \(\mathcal {F}\). To this end, it is necessary to provide a quantitative description of func-fairness that satisfies the characteristics of a robustness estimate. We call this description fairness score. For an actual input \({\textsf{i}_\textsf{a}}\) and a synthetic input \(\textsf{i}_\textsf{s}\) we define the fairness score as \(F({\textsf{i}_\textsf{a}}, \textsf{i}_\textsf{s}) {:}{=}f(d_{\textsf{In}}({\textsf{i}_\textsf{a}}, \textsf{i}_\textsf{s})) - d_{\textsf{Out}}({\textsf{P}} ({\textsf{i}_\textsf{a}}),{\textsf{P}} (\textsf{i}_\textsf{s}))\). F is indeed a robustness estimate function: if \(F({\textsf{i}_\textsf{a}}, \textsf{i}_\textsf{s})\) is non-negative, then \(d_{\textsf{Out}}({\textsf{P}} ({\textsf{i}_\textsf{a}}),{\textsf{P}} (\textsf{i}_\textsf{s})) \le f(d_{\textsf{In}}({\textsf{i}_\textsf{a}}, \textsf{i}_\textsf{s}))\), and if it is negative, then \(d_{\textsf{Out}}({\textsf{P}} ({\textsf{i}_\textsf{a}}),{\textsf{P}} (\textsf{i}_\textsf{s})) \not \le f(d_{\textsf{In}}({\textsf{i}_\textsf{a}}, \textsf{i}_\textsf{s}))\). For a set of actual inputs \({\mathcal {I}}\), the definition generalises to \(F({\mathcal {I}}, \textsf{i}_\textsf{s}) {:}{=}\min \{ F({\textsf{i}_\textsf{a}}, \textsf{i}_\textsf{s}) \mid {\textsf{i}_\textsf{a}}\in {\mathcal {I}} \}\), i.e., the overall fairness score is the minimum of the concrete fairness scores of the inputs in \({\mathcal {I}}\). Notice that \(\mathcal {R}_{\mathcal {I}}(\textsf{i}_\textsf{s}) {:}{=}F({\mathcal {I}}, \textsf{i}_\textsf{s})\) is essentially the quantitative interpretation of \(\varphi _{\textsf {u-func}}\) (from Theorem 6) after simplifications attributed to the fact that \({\textsf{P}}\) is a sequential and deterministic program (cf. Definition 2.2 vs. Definition 3).

Algorithm 2 shows \(\textsf{FairnessMonitor}\), which builds on Algorithm 1 to search for the minimal fairness score in a system \({\textsf{P}} \) for fairness contract \(\mathcal {F} \). The algorithm stores fairness scores in triples that also contain the two inputs for which the fairness score was computed. The minimum in a set of such triples is defined by the function \(\xi \text {-}\textsf{min} \) that returns the triple with the smallest fairness score of all triples in the set. The first line of \(\textsf{FairnessMonitor}\) initialises the variable \(\textsf{i}_\textsf{s}\) with an arbitrary actual input from \({\mathcal {I}}\). For this value of \(\textsf{i}_\textsf{s}\), the algorithm checks the corresponding fairness scores for all actual inputs \({\textsf{i}_\textsf{a}}\in {\mathcal {I}}\) and stores the smallest one. In line 3, the globally smallest fairness score triple is initialised. In line 5 it uses the proposal scheme to get the next synthetic input \(\textsf{i}_\textsf{s}'\). Line 6 is similar to line 2: for the newly proposed \(\textsf{i}_\textsf{s}'\) it finds the smallest fairness score, stores it, and updates the global minimum if it found a smaller fairness score (line 7). Lines 8-13 come from Algorithm 1. The only difference is that in addition to \(\textsf{i}_\textsf{s}\) we also store the fairness score \(\xi \). Line 4 of Algorithm 2 differs from Algorithm 1 by terminating the falsification process after a timeout occurs (similar to the adaptation of Algorithm 1 in Sect. 4). Hence, the algorithm does not (exclusively) aim to falsify the fairness property, but aims at minimising the fairness score; even if the fair treatment of the inputs in \({\mathcal {I}}\) cannot be falsified in a reasonable amount of time, we still learn how robustly they are treated fairly, i.e., how far the least fairly treated individual in \({\mathcal {I}}\) is away from being treated unfairly. After the timeout occurs, the algorithm returns the triple with the overall smallest seen fairness score \(\xi _\textsf{min}\), together with the actual input \(\textsf{i}_1\) and the synthetic input \(\textsf{i}_2\) for which \(\xi _\textsf{min}\) was found. In case \(\xi _\textsf{min}\) is negative, \(\textsf{i}_2\) is a counterexample for \({\textsf{P}} \) being func-fair.

\(\textsf{FairnessMonitor}\) implements a sound \(\mathcal {F}\)-unfairness detection as stated in Proposition 7. However, it is not complete, i.e., it is not generally the case that \({\textsf{P}} \) is func-fair for \({\mathcal {I}}\) if \(\xi \) is positive. It may happen that there is a counterexample, but \(\textsf{FairnessMonitor}\) did not succeed in finding it before the timeout. This is analogue to results obtained for model-agnostic robust cleanness analysis [19].

Proposition 7

Let \({\textsf{P}}: {\textsf{In}}\rightarrow {\textsf{Out}}\) be a deterministic sequential program, \(\mathcal {F} = \langle d_{\textsf{In}}, d_{\textsf{Out}}, f \rangle \) a fairness contract, and \({\mathcal {I}}\) a set of actual inputs. Further, let \((\xi _\textsf{min}, \textsf{i}_1, \textsf{i}_2)\) be the result of \(\textsf{FairnessMonitor} ({\textsf{P}}, \mathcal {F}, {\mathcal {I}})\). If \(\xi _\textsf{min}\) is negative, then \({\textsf{P}} \) is not func-fair for \({\mathcal {I}}\) w.r.t. \(\mathcal {F}\).

Moreover, \(\textsf{FairnessMonitor}\) circumvents major restrictions of the Lipschitz-fairness:

\(d_{\textsf{In}} \)-\(d_{\textsf{Out}} \)-relation::

Func-fairness defines constraints between input and output distances by means of a function f, which allows to express also complex fairness constraints. For a more elaborate discussion, see Sect. .

Input relevance::

Func-fairness explicitly distinguishes between actual and synthetic inputs. This way, func-fairness acknowledges a possible obstacle of the fairness theory when it comes to a real-world usage of the analysis, namely that only some elements of the system’s input domain might be plausible and that usually only few of them become actual inputs that have to be monitored for unfairness.

Monitorability::

\(\textsf{FairnessMonitor}\) demonstrates that func-fairness is monitorable. It resolves the quantification over \({\textsf{In}}\) using the above concepts from probabilistic falsification using the robustness estimate function F as defined above.

Algorithm 3
figure c

FairnessAwareSystem

Towards func-fairness in the loop

Fig. 6
figure 6

Schematic visualisation of \(\textsf{FairnessAwareSystem}\)

If a high-risk system is in operation, a human in the loop must oversee the correct and fair functioning of the outputs of the system. To do this, the human needs real-time fairness information. Figure 6 shows how this can be achieved by coupling the system \({\textsf{P}} \) and the \(\textsf{FairnessMonitor}\) in Algorithm 2 in a new system called \(\textsf{FairnessAwareSystem}\). \(\textsf{FairnessAwareSystem}\) is sketched in Algorithm 3. Intuitively, the \(\textsf{FairnessAwareSystem}\) is a higher-order program that is parameterised with the original program \({\textsf{P}} \) and the fairness contract  \(\mathcal {F}\). When instantiated with these parameters, the program takes arbitrary (actual) inputs \({\textsf{i}_\textsf{a}}\) from \({\textsf{In}}\). In the first step, it does a fairness analysis using \(\textsf{FairnessMonitor}\) with arguments \({\textsf{P}}\), \(\mathcal {F}\), and \(\{ {\textsf{i}_\textsf{a}} \}\). To make fairness scores comparable, \(\textsf{FairnessAwareSystem}\) normalises the fairness score \(\xi \) received from \(\textsf{FairnessMonitor}\) by dividingFootnote 7 it by the output distance limit \(f(d_{\textsf{In}}({\textsf{i}_\textsf{a}}, \textsf{i}_\textsf{s}))\). For fair outputs, the score will be between 0 (almost unfair) and 1 (as fair as possible).Footnote 8 Outputs that are not func-fair are accompanied by a negative score representing how much the limit \(f(d_{\textsf{In}}({\textsf{i}_\textsf{a}}, \textsf{i}_\textsf{s}))\) is exceeded. A fairness score of \(-n\) means that the output distance of \({\textsf{P}} ({\textsf{i}_\textsf{a}})\) and \({\textsf{P}} (\textsf{i}_\textsf{s})\) is \(n+1\) times as high as that limit. Finally, \(\textsf{FairnessAwareSystem}\) returns the triple with \({\textsf{P}} \)’s output for \({\textsf{i}_\textsf{a}}\), the normalised fairness score, and the synthetic input with its output witnessing the fairness score.

Interpretation of monitoring results

Especially when \(\textsf{FairnessAwareSystem}\) finds a violation of func-fairness, the suitable interpretation and appropriate response to the normalised fairness score proves to be a non-trivial matter that requires expertise.

Example 6

Instead of using \({\textsf{P}} \) from Example 2 on its own, Unica now uses \(\textsf{FairnessAwareSystem}\) with a suitable fairness contract. and thereby receive a fairness score along with \({\textsf{P}} \)’s verdict on each applicant. (Which fairness contracts are suitable is an open research problem, see Limitations & Challenges in Sect. 7.) If the fairness score is negative, she can also take into account the information on the synthetic counterpart returned by \(\textsf{FairnessAwareSystem}\). Among the 4096 applicants for the PhD program, the monitoring assigns a negative fairness score to three candidates: Alexa, who received a low score, Eugene, who was scored very highly, and John, who got an average score. According to their scoring, Alexa would be desk-rejected, while Eugene and John would be considered further.

Alexa ’s synthetic counterpart, let’s call him Syntbad, is ranked much higher than Alexa. In fact, he is ranked so high that Syntbad would not be desk-rejected. Unica compares Alexa and Syntbad and finds that they only differ in one respect: Syntbad ’s graduate university is the one in the official ranking that is immediately below the one that Alexa attended. Unica does some research and finds that Alexa ’s institution is predominantly attended by People of Colour, while this is not the case for Syntbad ’s institution. Therefore, \(\textsf{FairnessAwareSystem}\) helped Unica not only to find an unfair treatment of Alexa, but also to uncover a case of potential racial discrimination.

John ’s counterpart, Synclair, is ranked much lower than him. Unica manually inspects John ’s previous institution (an infamous online university), his GPA of 1.8, and his test result with only 13%. She finds that this very much suggests that John will not be a successful PhD candidate and desk-rejects him. Therefore, Unica has successfully used \(\textsf{FairnessAwareSystem}\) to detect a fault in scoring system \({\textsf{P}} \) whereby John would have been treated unfairly in a way that would have been to his advantage.

Eugene received a top score, but his synthetic counterpart, Syna, received only an average one. Unica suspects that Eugene was ranked too highly given his graduate institution, GPA, and test score. However, as he would not have been desk-rejected either way, nothing changes for Eugene, and the unfairness he was subject to, is not of effect to him.

The cases of John and Eugene share similarities with the configuration in (b) in Fig. 7, the one of Alexa with (a), and the ones of all other 4093 candidates with (c).

Fig. 7
figure 7

Exemplary illustration of configurations of an input (red cross) and its synthetic counterparts (grey circles) and the synthetic counterpart with the minimal fairness score (blue polygon); with a two-dimensional input space (grid) and a one-dimensional output

If our monitor finds only a few problematic cases in a (sufficiently large and diverse) set of inputs, our monitoring helps Unica from our running example by drawing her attention to cases that require special attention. Thereby, individuals who are judged by the system have a better chance of being treated fairly, since even rare instances of unfair treatment are detected. If, on the other hand, the number of problematic cases found is large, or Unica finds especially concerning cases or patterns, this can point to larger issues within the system. In these cases, Unica should take appropriate steps and make sure that the system is no longer used until clarity is established why so many violations or concerning patterns are found. If the system is found to be systematically unfair, it should arguably be removed from the decision process. A possible conclusion could also be that the system is unsuitable for certain use cases, e.g., for the use on individuals from a particular group. Accordingly, it might not have to be removed altogether but only needs to be restricted such that problematic use cases are avoided. In any case, significant findings should also be fed back to developers or deployers of the potentially problematic system. A fairness monitoring such as in \(\textsf{FairnessAwareSystem}\) or a fairness analysis as in \(\textsf{FairnessMonitor}\) could also be useful to developers, regulating authorities, watchdog organisations, or forensic analysts as it helps them to check the individual fairness of a system in a controlled environment.

6 Interdisciplinary assessment of fairness monitoring

Regulations for car related emissions are in force for a considerable amount of time, thus, its legal interpretation is mostly clear. In case of human oversight of AI systems, the AI act is new and parts of it are legally ambiguous. This raises the question of whether our approach meets requirements that go beyond pre-theoretical deliberations. Even though comprehensive analyses would go far beyond the scope of this paper, we will nevertheless assess some key normative aspects in philosophical and legal terms, and also briefly turn to the related empirical aspects, especially from psychology.

6.1 Psychological assessment

Fairness monitoring promises various advantages in terms of human-system interaction in application contexts—provided it is extended by an adequate user interface—which call for empirical tests and studies. We will only discuss a possible benefit that closely aligns with the current draft of the AI Act: our approach may support effective human oversight. Two central aspects of effective oversight are situation awareness and warranted trust. Our method highlights unfairness in outputs which can be expected to increase users’ situation awareness (i.e., ‘the perception of the elements in the environment within a volume of time and space, the comprehension of their meaning and the projection of their status in the near future’ [37, p. 36]), which is a variable central for effective oversight [38]. In the minimal case, this allows users to realise that something requires their attention and that they should check the outputs for plausibility and adequacy. In the optimal case and after some experience with the monitor, it may even allow users to predict instances where a system will produce potentially unfair outputs. In any case, the monitoring should enable them to understand limitations of the system and to feed back their findings to developers who can improve the system. This leads us to warranted trust, which includes that users are able to adequately judge when to rely on system outputs and when to reject them [62, 73]. Building warranted trust strongly depends on users being able to assess system trustworthiness in the given context of use [73, 110]. According to their theoretical model on trust in automation, Lee and See [73] propose that trustworthiness relates to different facets of which performance (e.g., whether the system performs reliably with high accuracy) and process (e.g., knowing how the system operates and whether the system’s decision-processes help to fulfil the trustor’s goals) are especially relevant in our case. Specifically, fairness monitoring should enable users to more accurately judge system performance (e.g., by revealing possible issues with system outputs) and system processes (e.g., whether the system’s decision logic was appropriate). In line with Lee and See’s propositions, this should provide a foundation for users to be better able to judge system trustworthiness and should thus be a promising means to promote warranted trust. In consequence, our monitoring provides a needed addition to high-risk use contexts of AI because it offers information enabling humans to more adequately use AI-based systems in the sense of possibly better human-system decision performance and with respect to user duties as described in the AI Act.

6.2 Philosophical assessment

More effective oversight promises more informed decision-making. This, in turn, enables morally better decisions and outcomes, since humans can morally ameliorate outcomes in terms of fairness and can see to it that moral values are promoted. Also, fairness monitoring helps in safeguarding fundamental democratic values if it is applied to potentially unfair systems which are used in certain societal institutions of a high-risk character such as courts or parliaments. It could, for example, make AI-aided court decisions more transparent and promote equality before the law. However, since our approach requires finding context-appropriate and morally permissible parameters for \(\mathcal {F}\), moral requirements arise to enable the finding of such parameters. This not only affects, e.g., developers of such systems, but also those who are in a position to enforce that adequate parameters are chosen, such as governmental authorities, supervising institutions or certifiers.

Apart from that, various parties have arguably a legitimate interest in adequately ascribing moral responsibility for the outcomes of certain decisions to human deciders [14]—regardless of whether the decision making process is supported by a system. Adequately ascribing moral responsibility is not always possible, though. One precondition for moral responsibility is that the agent had sufficient epistemic access to the consequences of their doing [90, 119], i.e., that they have enough and sufficiently well justified beliefs about the results of their decision. Someone overseeing a university selection process (like Unica) should, for example, have sufficiently well justified beliefs that, at the very least, their decisions do not result in more unfairness in the world. If the admission process is supported by a black-box system, though, Unica cannot be expected to have any such beliefs since she lacks insight in the fairness of the system. Therefore, adequate responsibility ascription is usually not possible in this scenario. Our monitoring alleviates this problem by providing the decider with better epistemic access to the fairness of the system.

\(\textsf{FairnessAwareSystem}\) helps in making Unica ’s role in the decision process significant and not only that of a mere button-pusher. \(\textsf{FairnessAwareSystem}\) makes it possible for her to fulfil some of the responsibilities and duties plausibly associated with her role. For example, she can now be realistically expected to not only detect, but resolve at least some cases of apparent unfairness competently (although she may need additional information to do so). In this respect, she should not be ‘automated away’ (cf. [79]).

6.3 Legal assessment

A central legislative debate of our time is how to counter the risks AI systems can pose to the health and safety or fundamental rights of natural persons. Protective measures must be taken at various levels: First, before being permitted on the market, it must be ensured ex ante that such high-risk AI-systems are in conformity with mandatory requirementsFootnote 9 regarding safety and human rights. This means in particular that the selection of the properties which a system should exhibit requires a positive normative choice and should not simply replicate biases present in the status quo [131]. In addition, AI-systems must be designed and developed in such a way that natural persons can oversee their functioning. For this purpose, it is necessary for the provider to identify appropriate human oversight measures before its placing on the market or putting into service. In particular, such measures should guarantee that the natural persons to whom human oversight has been assigned have the necessary competence, training and authority to carry out that role [40, recital 48] [41, Art. 14 (5)].

Second, during runtime, the proper functioning of high-risk AI systems, which have been placed on the market lawfully, must be ensured. To achieve this goal, a bundle of different measures is needed, ranging from legal obligations to implement and perform meaningful oversight mechanisms to user training and awareness in order to counteract ‘automation bias’. Furthermore, the AI Act proposal requires deployers to inform the provider or distributor and suspend the use of the system when they have identified any serious incidents or any malfunctioning [40, 41, Art. 29(4)].

Third, and ex post, providers must act and take the necessary corrective actions as soon as they become aware, e.g. through information provided by the deployer, that the high-risk system does not (or no longer) meet the legal requirements [40, 41, Art. 16(g)]. To this end, they must establish and document a system of monitoring that is proportionate to the type of AI technology and the risks of the high-risk AI system [40, 41, Art. 61(1)].

Fairness monitoring can be helpful in all three of the above respects. Therefore, we argue that there is even a legal obligation to use technical measures such as the method presented in this paper if this is the only way to ensure effective human oversight.

7 Conclusion and future work

This articles brings together software do** theory and probabilistic falsification techniques. To this end, it proposes a suitable HyperSTL semantics and characterises robust cleanness and func-cleanness as HyperSTL formulas and, for the special case of finite standard behaviour, STL formulas. Software do** techniques have been extensively applied to the tampered diesel emission cleaning systems; this article continues this path of research by demonstrating how testing of real cars can become more effective. For the first time, we apply software do** techniques to high-risk (AI) systems. We propose a runtime fairness monitor to promote effective human oversight of high-risk systems. The development of this monitor is complemented by an interdisciplinary evaluation from a psychological, philosophical, and legal perspective.

Limitations & Challenges

A challenge to those employing robust cleanness or func-cleanness analysis is the selection of suitable parameters, especially \(d_{\textsf{In}}\), \(d_{\textsf{Out}}\), and f or \(\kappa _\textsf{i}\) and \(\kappa _\textsf{o}\). Because of their high degree of context sensitivity, there are no paradigmatic candidates for them that one can default to. Instead, they have to be carefully selected with the concrete system, the structure of input data and the situation of use in mind.

Reasonable choices for robust cleanness analysis of diesel emissions have been proposed in recent work [19, 21]. With respect to individual fairness analysis, potential systems to which \(\textsf{FairnessAwareSystem}\) or \(\textsf{FairnessMonitor}\) can be applied to are still too diverse to give recommendations for the contract parameters. Obviously, further technical limitations include that f, \(d_{\textsf{In}}\), and \(d_{\textsf{Out}}\) must be computable.

With a particular regard to fairness analysis, we identify also non-technical limitations. As seen in Fig. 5, our fairness monitoring aims to uncover a particular kind of unfairness, namely individual unfairness that originates from within the system. This excludes all kinds of group unfairness as well as unfairness from sources other than the system. Another limitation is the human’s competence to interpret the system outputs. Even though this is not a limitation that is inherent to our approach, it nevertheless will arguably be relevant in some practical cases, and an implementation of the monitoring always has to happen with the human in mind. For example, the design of the tool should avoid creating the false impression that the system is proven to be fair for an individual if no counterexample has been found. Interpretations like this could lead to inflated judgements of system trustworthiness and eventually to overtrusting system outputs [110, 112]. Also, it might be reasonable to limit access to the monitoring results: if individuals who are processed by the system have full access to their fairness analysis, they could use this to ‘game’ the system, i.e. they could use the synthetic inputs to slightly modify their own input such that they receive a better outcome. While more transparency for the user is generally desirable, this has to be kept in mind to avoid introducing new unfairness on a meta-level.

Future Work

The probabilistic falsification technique we use in this article can be seen as a modular framework that consists of several interchangeable components. One of these components is the optimisation technique used to find the input with minimal robustness value. Algorithm 1 uses a simulated annealing technique [29, 107], but other techniques have been proposed for temporal logic falsification, too [4, 108]. We want to further look into such alternative optimisation techniques and to evaluate if they offer benefits w.r.t. cleanness falsification.

Finally, the fairness monitoring approach has been presented using a toy example. It is not claimed to be readily applicable to real-life scenarios. Besides the future work that has already been mentioned throughout the paper, we are planning on various extensions of our approach, and are working on an implementation that will allow us to integrate the monitoring into a real system. Moreover, we plan to test the possible benefits and shortcomings of the approach in user studies where decision-makers are tasked to make hiring decisions with and without the fairness monitoring approach. Further work will encompass activities such as the improvement and embedding of the algorithm \(\textsf{FairnessAwareSystem}\) into a proper tool that can be used by non-computer-scientists, and the extension of the monitoring technique to cover more types of unfairness. For example, logging the output of the fairness monitor could be used to identify groups that are especially likely to be treated unfairly by the system: The individual fairness verdicts provided by \(\textsf{FairnessAwareSystem}\) and \(\textsf{FairnessMonitor}\) may also be logged and considered for further fairness assessments or other means of quality assurance of system P. Statistical analysis might unveil that individuals of certain groups are treated unfairly more frequently than individuals from other groups. Depending on the distinguishing features of the evaluated group, this can uncover problems in P, especially if protected attributes, such as gender, race, age, etc, are taken into account. Thereby, system fairness can be assessed for protected attributes without including them in the input of P, which should generally be avoided, and even without disclosing them to the human in the loop. By evaluating the monitoring logs from sufficiently many diverse runs of \(\textsf{FairnessAwareSystem}\), our local method can be lifted such that it resembles a global method for many practical applications, i.e. we can make statistical statements about the general fairness of P. Such an evaluation can also be used to extract prototypes and counterexamples in the spirit of Been et al. [66] illustrating the tendency to judge unfairly. This is an interesting combination of individual and group fairness that we want to look into further. Other insights from the research on reactive systems [19, 21, 32] can potentially be used to further enrich the monitoring. Finally, various disciplines have to join forces to resolve highly interdisciplinary questions such as what constitutes reasonable and adequate choices for f, \(d_{\textsf{In}}\), and \(d_{\textsf{Out}}\) in given contexts of application.