Introduction

The use of algorithmic and data-driven solutions has rapidly increased in fields where high-stakes decisions are made [1]. This growing dependence highlights the importance of transparency and interpretability in artificial intelligence (AI) systems [10].

The recent studies on counterfactual explanations highlight several key desiderata such as validity, sparsity, plausibility, actionability, and diversity [4, 11]. Validity requires counterfactuals to match the model’s decision-making. Sparsity emphasizes minimal changes from the original instance, simplifying and interpreting the counterfactual. Plausibility ensures counterfactuals are realistic and believable. Actionability makes counterfactual suggestions practical and implementable by focusing on the user’s ability to act. Finally, diversity promotes multiple counterfactuals to clarify the model’s decision boundaries. Achieving one of these desiderata can compromise another, so they must be balanced. For instance, sparsity may reduce explanation plausibility or diversity. In general, counterfactual-based methods prioritize soundness, whereas data-summarizing techniques favor completeness despite being less sound [12]. Our study addresses these trade-offs by creating holistic counterfactual explanations that meet these critical requirements.

The literature on the evaluation counterfactual explanations includes various methods and measures for quality assessment. Emphasis is placed on the importance of practicality and robustness in evaluation processes [13]. Additionally, the significance of robust and plausible explanations for enhancing individual fairness is highlighted [14]. The necessity for creating diverse and actionable explanations is also underscored, along with the introduction of frameworks for their generation and metrics for assessment [12]. The literature further advocates for user testing and qualitative analysis in evaluations, identifying key deficits in existing methodologies [30] implemented a granularity-based workflow scheduling algorithm tailored for cloud computing environments. This approach not only optimized the scheduling process but also enhanced performance metrics like makespan and virtual machine utilization. By incorporating granular computing into our study, we can take a more detailed approach to identifying counterfactual search objectives at various levels of granularity. Moreover, it facilitates comprehension of the broader implications of the generated explanations, ensuring a thorough and comprehensive analysis of the scheduling problems in consideration. The decision to use this methodology greatly enhances our study, providing a more profound understanding of the complexities of counterfactual explanations in scheduling settings.

Methodology

In our proposed approach, we initially train a predictive process monitoring model on historical process execution data to accurately predict operation processing durations. With these predictions, we proceed to generate an initial schedule using MILP (see Fig. 2). The subsequent phase involves generating counterfactual explanations through the application of the NSGA-II for multi-objective optimization. In each iteration of this process, the algorithm selects an individual operation, assessing it against multiple objectives: conventional counterfactual goals like proximity and reduction in processing duration, along with the overall makespan and sequence similarity to the original schedule. Notably, each iteration includes optimizing the schedule to determine the obtained makespan. Modifications are made to the feature vector of the chosen operation while maintaining the status of other operations, thus incorporating these variables into an integrated multi-objective counterfactual search. This strategy aims not only to enhance the efficiency of individual operations but also to preserve the overall effectiveness and coherence of the entire manufacturing process.

The “Predicting Processing Time of Operations’’ section introduces and presents the preliminaries of process mining concepts for the JSSP context and discusses the predictive modeling of processing times for operations. The “JSSP: Objective, Variables, and Constraints” section presents the mathematical optimization approach for JSSP, detailing its parameters, decision variables, objective function, and constraints. In the “Multi-objective Counterfactual Explanations’’ section, the focus is on multi-objective optimization, specifically identifying system-level counterfactual explanations. Finally, the “Multi-objective Counterfactual Explanations’’ section examines the use of the NSGA-II algorithm in addressing the multi-objective aspects of the problem.

Fig. 2
figure 2

Overview of the proposed multi-objective optimization approach for counterfactual explanations

Predicting Processing Time of Operations

The predictive modeling phase of our proposed approach seeks to leverage historical process event data to forecast the processing times for each operation, an aspect often overlooked in optimization strategies. This subsection introduces tailored notations and definitions derived from the domain of process mining, specifically adapted to align with the JSSP context. This customization is essential for a detailed appreciation of the complexities and dynamics prevalent in JSSP, as viewed through the process mining lens. Within this adjusted framework, “events” correspond to individual operations, “traces” are associated with jobs, and “event logs” are reinterpreted to represent operation logs. This refined approach enhances our understanding and analysis of JSSP, bridging the gap between theoretical optimization and practical application challenges. Following the formal definition of the prediction problem, we proceed to outline the ML algorithms selected to address the described regression challenge.

Definition 1

(Operation) An operation is a tuple \(o = (a, j, m, r, t_{start},\) \(t_{complete}, v_1, \ldots , v_n)\), where:

  • \(a \in \mathcal {A}\) represents the activity type of the operation;

  • \(j \in \mathcal {J}\) is the job identifier;

  • \(m \in \mathcal {M}\) denotes the machine on which the operation is performed;

  • \(r \in \mathcal {R}\) denotes the resources utilized during the operation;

  • \(t_{start} \in \mathbb {N}\) is the start timestamp of the operation (defined as seconds since \(1 / 1 / 1970\), a Unix epoch time representation);

  • \(t_{complete} \in \mathbb {N}\) is the completion timestamp of the operation;

  • \(v_1, \ldots , v_n\) represents the list of operation-specific attributes, where \(\forall 1 \le \tilde{n} \le n: v_{\tilde{n}} \in \mathcal {V}_{\tilde{n}}\), with \(\mathcal {V}_{\tilde{n}}\) denoting the domain of the \(\tilde{n}^{th}\) attribute.

Consequently, \(\mathcal {O} = \mathcal {A} \times \mathcal {J} \times \mathcal {M} \times \mathcal {R} \times \mathcal {T}_{start} \times \mathcal {T}_{complete} \times \mathcal {V}_1 \times \cdots \times \mathcal {V}_n\) is defined as the universe of operations. Moreover, we define the following projection functions given the operation \(o \in \mathcal {O}\):

  • \(p_a :\mathcal {O} \rightarrow \mathcal {A},\; p_a(o) = a\),

  • \(p_j :\mathcal {O} \rightarrow \mathcal {J},\; p_j(o) = j\),

  • \(p_m :\mathcal {O} \rightarrow \mathcal {M},\; p_m(o) = m\),

  • \(p_r :\mathcal {O} \rightarrow \mathcal {R},\; p_r(o) = r\),

  • \(p_{t_{start}} :\mathcal {O} \rightarrow \mathcal {T}_{start},\; p_{t_{start}}(o) = t_{start}\),

  • \(p_{t_{complete}} :\mathcal {O} \rightarrow \mathcal {T}_{complete},\; p_{t_{complete}}(o) = t_{complete}\),

  • \(p_{v_i} :\mathcal {O} \rightarrow \mathcal {V}_i,\; p_{v_i}(o) = v_i, \forall 1 \le i \le n\)

Definition 2

(Job) Let \(\sigma \in \mathcal {O}^{*}\) denote a job, where \(\sigma _{j} = \langle o_{1}, o_{2}, \ldots , o_{{|\sigma _{j} |}}\rangle \) is a finite sequence of operations corresponding to a specific job \(j\), and \(\mathcal {J} \) denote the universe of jobs. Each operation \(o_{i}\) in \(\sigma \) is unique (occurs no more than once). Furthermore, for any two operations \(o_i, o_{i'} \in \sigma \), the following conditions hold:

  • \(p_j(o_{i}) = p_j(o_{i'})\): This ensures all operations within \(\sigma \) belong to the same job, maintaining job consistency within the sequence.

  • \(p_{t_{start}}(o_{i}) \le p_{t_{start}}(o_{i'})\) for \(1 \le i < i' \le {|\sigma _{j} |}\): This enforces a chronological order based on the operation start timestamps, reflecting the real-world progression of tasks within the job.

  • \(p_{t_{complete}}(o_{i}) \le p_{t_{start}}(o_{i'})\) for \(1 \le i < i' \le {|\sigma _{j} |}\): This enforces the termination of an operation before the start of the subsequent operation pertaining to the same job.

  • The total number of operations in \(\sigma _j\), also known as the trace length, is represented by \(|\sigma _j |\).

Definition 3

(Partial Job Sequences) In the context of predictive process monitoring, two methods for constructing partial sequences from a job sequence \(\sigma _j\) are defined as follows: prefix and suffix extraction, represented by functions \(hd^{i}(\sigma _j)\) and \(tl^{i}(\sigma _j)\) respectively:

  • Prefix Extraction (\(hd^{i}\)): Given an integer \(i\) within the range \([1, |\sigma _j |]\), \(hd^{i}(\sigma _j)\) constructs a prefix as the sequence \(\langle o_{1}, o_{2}, \ldots , o_{\min (i, |\sigma _j |)}\rangle \), comprising the first \(i\) operations of \(\sigma _j\).

  • Suffix Extraction (\(tl^{i}\)): For the same integer \(i\), \(tl^{i}(\sigma _j)\) yields a suffix as \(\langle o_{w}, o_{w+1}, \ldots , o_{|\sigma _j |}\rangle \) where \(w = \max (|\sigma _j |- i + 1, 1)\), capturing the last \(i\) operations of \(\sigma _j\).

Definition 4

(Operation Log) The operation log \(\mathcal {O}_{\mathcal {J}}\) is defined as a set of completed jobs, \(\mathcal {O}_{\mathcal {J}} = \{\sigma _j \mid j \in \mathcal {J}\}\).

Definition 5

(Prediction of Processing Times) Given a feature extraction function \( \phi : \mathcal {O} \rightarrow \mathcal {X} \) that extracts job and operation-level features for each operation, thus map** the raw data into a structured feature vector \( x \), we define the output variable \( y \). This variable quantifies the processing time for an operation \( o \), calculated as \( y = p_{t_{\text {complete}}}(o) - p_{t_{\text {start}}}(o) \). With this groundwork, the input feature space \( \mathcal {X} \), which consists of vectors derived from operation attributes, and the output space \( \mathcal {Y} \), indicative of processing times, are established. The dataset \( \{(x_1, y_1), (x_2, y_2), \ldots , (x_N, y_N)\} \) comprises \( N \) pairs of these vectors and their corresponding processing times. The objective is to develop a function \( f: \mathcal {X} \rightarrow \mathcal {Y} \), which predicts the processing time \( \hat{y} \) for a new feature vector \( \hat{x}\) corresponding to an operation \( \hat{o} \). The function \( f \) aims to approximate the actual processing time as precisely as possible.

To tackle the regression problem in predictive process monitoring, our study employs Gradient Boosting Machines (GBM) [31]. GBM is highly suitable for tabular data due to its ability to intricately model complex, non-linear relationships within structured datasets. This method iteratively improves prediction accuracy by building a series of decision trees, each correcting the errors of its predecessor. GBM’s flexibility for parameter tuning and robustness against common data issues like missing values make it superior in handling diverse and challenging datasets. Overall, GBM’s precision and adaptability render it an optimal choice for the nuanced demands of predictive process monitoring.

For benchmarking of the model performance, we examine Generalized Linear Models (GLM) [32] and Distributed Random Forests (DRF) [33], with implementations provided by the h2o-library (see the “Dataset Overview and Tools’’ section) alongside GBM. GLM stands as a traditional statistical approach, extending linear regression to accommodate various relationships through different link functions and distributions. GLM’s inherent interpretability and its capacity to elucidate variable significance provide invaluable insights into process dynamics. DRF, in contrast, offers a robust ensemble method by combining multiple decision trees to enhance predictive accuracy and mitigate over-fitting. Although this model lacks inherent interpretability, the incorporation of randomness in data sampling and feature selection broadens the model’s applicability, aiding in capturing a wide array of patterns, with its scalability allowing for large-scale datasets, a common scenario in process monitoring endeavors.

JSSP: Objective, Variables, and Constraints

The second stage of our proposed methodology focuses on the application of optimization algorithms to the specified scheduling challenge, guided by the forecasts provided by the predictive model. In this study, we consider a scheduling environment consisting of a finite set of jobs \(\mathcal {J}\), machines \(\mathcal {M}\), and workers \(\mathcal {R}\). Each job \(j \in \mathcal {J}\) is composed of a sequence of operations \(\sigma _{j}\), which must be processed in a pre-determined order, respecting the precedence constraint. Operations are processed by machines and workers, where the respective activity of each operation \(o_{ij}\) is eligible to be processed by a subset of machines \(\mathcal {M}_{i,j}\) and requires a worker capable of operating the corresponding machine from \(\mathcal {R}_{\mathcal {M}_{i,j}}\) for its execution. The availability of jobs and machines is assumed from the start of the scheduling horizon.

The processing time for each operation, as executed by a worker on a chosen machine, is determined through an ML model. Constraints ensure that each machine and worker is engaged in at most one operation at any time. Moreover, an operation can be executed no more than once and must wait for its preceding operations to be completed before it commences. The scheduling objective is to minimize the overall completion time, known as the makespan. This goal is pursued by solving the problem regarding the sequencing of operations. The formal definition of all other elements is described as follows:

Indices

  • \( i \): Index for operations, \( i = 1, 2, \ldots , |\sigma _j |\).

  • \( j \): Index for jobs, \( j = 1, 2, \ldots , |\mathcal {J} |\).

  • \( k \): Index for machines, \( k = 1, 2, \ldots , |\mathcal {M} |\).

  • \( l \): Index for workers, \( l = 1, 2, \ldots , |\mathcal {R} |\).

The outlined indices provide a structured approach to categorize relevant elements pertaining to the JSSP. Building upon this structure, we now turn our attention to the definition of the sets:

Sets

  • \( \mathcal {J} \): Set of all jobs, \( \mathcal {J} = \{1, 2, \ldots , u\} \).

  • \( \mathcal {M} \): Set of all machines, \( \mathcal {M} = \{1, 2, \ldots , v\} \).

  • \( \mathcal {R} \): Set of all workers, \( \mathcal {R} = \{1, 2, \ldots , w\} \).

  • \( \sigma _j \): Set of operations for job \( j \)\( \sigma _j = \{o_{j1}, o_{j2}, \ldots , o_{j_{|\sigma _j |}}\} \).

  • \( \mathcal {M}_{i,j} \): Subset of machines capable of processing the activity \(p_a(o_{ij})\) of operation \(o_{ij}\) of job \( j \).

  • \( \mathcal {R}_m \): Subset of workers capable of operating machine \( m \).

These sets encapsulate the aggregate elements of the scheduling environment, thereby furnishing a comprehensive view of the jobs, machines, and workers involved. Next, we progress to the definition of a central parameter:

Parameters

  • \( \hat{y}_{i,j,m,r}\): Processing time for operation \( o_{i,j} \) using machine \( m \) and worker \( r \), as predicted by the employed predictive model.

Decision Variables

  • \(p_{t_{start}}(o_{ij}) \): Start time of operation \( o_{ij} \).

  • \( X_{i,j,m,r} \): Binary variable that equals 1 if operation \( o_{ij} \) is assigned to machine \( k \) and worker \( r \), 0 otherwise.

These decision variables form the operative layer of our approach, representing actionable elements within the optimization problem, and enabling manipulation and optimization of the operation scheduling. Leveraging the previous definitions, we introduce an objective function as a driving force toward optimal solutions:

Objective Function

Minimize the makespan and the completion time of the last operation of the last job.

$$\begin{aligned} \min C_{\text {max}} \end{aligned}$$
(1)
$$\begin{aligned} \begin{aligned} C_{\text {max}} =&\ \max _{j \in \mathcal {J}, o_{ij} \in \sigma _j} \Biggr \{ p_{t_{start}}(o_{ij}) \\ {}&+ \sum _{m \in \mathcal {M}_{i,j}} \sum _{r \in \mathcal {R}_k} \left( \hat{y}_{i,j,m,r} \cdot X_{i,j,m,r}\right) \biggl \} \end{aligned} \end{aligned}$$
(2)

This function encapsulates the primary goal, minimizing the schedule makespan, and serves as the criterion for evaluating the effectiveness of identified scheduling solutions. While the objective function provides a target for optimization, the following constraints define the space of feasible solutions:

Constraints

Event Precedence Constraint

This constraint rigorously ensures that the chronological order of operations within a job is meticulously adhered to. In particular, each operation must not start before the previous operation in the job sequence is completed.

$$\begin{aligned} \begin{aligned}&p_{t_{start}}(o_{ij}) + \sum _{m \in \mathcal {M}_{i,j}} \sum _{r \in \mathcal {R}_m} \left( \hat{y}_{i,j,m,r} \cdot X_{i,j,m,r}\right) \\ {}&\le p_{t_{start}}(o_{(i+1)j}), \; \forall o_{ij} \in \sigma _j \setminus \{o_{{|\sigma _j |}j}\}, j \in \mathcal {J} \end{aligned} \end{aligned}$$
(3)

Machine and Worker Assignment Constraint

Adhering to the underlying structures of the manufacturing process, each operation must be processed by exactly one machine and one worker.

$$\begin{aligned} \sum _{m \in \mathcal {M}_{i,j}} \sum _{r \in \mathcal {R}_m} X_{i,j,m,r} = 1 \quad , \forall o_{ij} \in \sigma _j, j \in \mathcal {J} \end{aligned}$$
(4)

Machine and Worker Utilization Constraint

Due to the nature of the underlying manufacturing scenario, a machine must be operated and monitored by a corresponding worker for the total duration of an operation. Thus, each machine and worker can only be allocated to one operation at a time.

$$\begin{aligned} \begin{aligned}&p_{t_{start}}(o_{ij}) + \sum _{m \in \mathcal {M}_{i,j}} \sum _{r \in \mathcal {R}_m} \left( \hat{y}_{i,j,m,r} \cdot X_{i,j,m,r}\right) \le p_{t_{start}}(o_{i'j'}), \\&\forall i \ne i', j \ne j', m \in \mathcal {M}, r \in \mathcal {R} \end{aligned} \end{aligned}$$
(5)

with the additional condition that if \( X_{i,j,m,r} = X_{i',j',m,r} = 1\), then the operations \( o_{ij}, o_{i'j'} \) cannot overlap in time.

Non-Negativity of Start Time Constraint

Precluding any unrealistic scheduling scenarios, each operation must have a non-negative start time.

$$\begin{aligned} p_{t_{start}}(o_{ij}) \ge 0 \quad \forall o_{ij} \in \sigma _j, j \in \mathcal {J} \end{aligned}$$
(6)

In addressing the described JSSP, this study adopts a MILP approach. The relevance of the MILP method in solving JSSP lies in its proficiency in managing complex constraints and multiple objectives. MILP excels in articulating the discrete aspects of scheduling tasks, encompassing vital constraints like sequential task ordering and resource allocation. Although alternative optimization methodologies exist, MILP distinguishes itself through its capability to deliver precise and optimal solutions within a reasonable computational timeframe, particularly in highly complex scenarios. This makes it a potent tool for addressing the complexities inherent in JSSP.

Multi-objective Counterfactual Explanations

Upon completing the training of the ML-based predictive process monitoring model and establishing an initial schedule, we advance to the third stage of our proposed approach, which entails the generation of multi-objective counterfactual explanations. The formal definition and specifics of this process are outlined below.

Definition 6

(Counterfactual Search) Given a predictive model \( f: \mathcal {X} \rightarrow \mathcal {Y} \), with \( \mathcal {Y} \in \mathbb {R} \), a set of desired operation outcomes \( \mathcal {Y'} \subseteq \mathbb {R} \), a schedule \( s \subseteq \mathcal {O}_{\mathcal {J}} \), and the corresponding makespan \( C_{max, s} \in \mathbb {R}\), and a set of desired schedule outcomes \( C'_{max, s} \subseteq \mathbb {R}\), a counterfactual explanation in form of an alternative schedule \( s_{x'} \) for the original schedule \( s \) based on an altered feature vector \( x' \) for the original feature vector \( x\) of an operation \( o\) from a job \( \sigma \in s\) satisfies the following conditions:

  1. 1.

    the prediction \( f(x') \) for the altered feature vector \( x' \) lies in the proximity of desired outcomes \( \mathcal {Y}' \)

  2. 2.

    the altered feature vector \( x' \) lies in the proximity of the original feature vector \( x \)

  3. 3.

    the altered feature vector \( x' \) consists of plausible feature values for the underlying scenario

  4. 4.

    the alternative makespan \( C_{max, s_{x'}} \) lies in the proximity of the of desired schedule outcomes \( C'_{max, s}\) after the optimization of operation sequencing

  5. 5.

    the alternative schedule \( s_{x'} \) lies in the proximity of the original schedule \( s \) after the optimization of operation sequencing

The counterfactual search is defined as a multi-objective minimization task as follows:

$$\begin{aligned} \begin{aligned}&\min _{x,s_x} Obj(x,s_x) = \min _{x, s_x} \left( Obj_1( f(x'), \mathcal {Y}'), Obj_2(x', x),\right. \\&\left. Obj_3(x', \mathcal {X}^{Obs}), Obj_4(C_{max, s_{x'}} , C'_{max, s}), Obj_5(s_{x'}, s) \right) \end{aligned} \end{aligned}$$
(7)

with \(Obj: \mathcal {X} \rightarrow \mathbb {R}^5 \) and \(\mathcal {X}^{Obs}\) being the observed feature vector space. \(Obj_1\) quantifies the distance between prediction \( f(x') \) for the an altered feature vector \( x' \) and the desired outcomes \( \mathcal {Y}' \) as follows:

$$ Obj_1\left( f(x'), \mathcal {Y}'\right) = {\left\{ \begin{array}{ll} 0 &{} \text{ if } f(x') \in \mathcal {Y}' \\ \inf _{y' \in \mathcal {Y}'} |f(x')-y'|&{} \text{ else } \end{array}\right. } $$

\(Obj_2\) quantifies the distance between an altered feature vector \( x' \) and its original \( x \) by employing the Gower-Distance as follows:

$$\begin{aligned} Obj_2\left( x', x\right) = \frac{1}{N} \sum _{i=1}^N \delta _G\left( x'_i, x_i\right) \in [0,1] \end{aligned}$$
(8)

with \(\delta _G\) applying the respective measuring method to each variable type in the following manner:

$$\begin{aligned} \begin{aligned}&\delta _G\left( x'_i, x_i\right) \\ {}&= {\left\{ \begin{array}{ll}\frac{1}{\mathcal {X}_i}|x'_i-x_i |&{} \text{ if } x'_i, x_i \text{ are } \text{ numerical } \text{ variables } \\ \mathbb {I}_{x'_i \ne x_i} &{} \text{ if } x'_i, x_i \text{ are } \text{ categorical } \text{ variables }\end{array}\right. } \end{aligned} \end{aligned}$$
(9)

with \(\mathcal {X}_i\) being the range of applicable values for feature i, observed in \(\mathcal {X}^{Obs}\). \(Obj_3\) quantifies whether the altered feature vector \( x' \) consists of plausible values given the observed feature vectors from \(\mathcal {X}^{Obs}\) as follows:

$$\begin{aligned} Obj_3\left( x', {X}^{Obs} \right) = {\left\{ \begin{array}{ll} 0 &{} \text{ if } x' \in {X}^{Obs} \\ +\infty &{} \text{ else } \end{array}\right. } \end{aligned}$$
(10)

\(Obj_4\) quantifies the distance between an alternative makespan \( C_{max, s_{x'}} \) and its desired schedule outcomes \( C'_{max, s}\) as follows:

$$\begin{aligned} \begin{aligned}&Obj_4\left( C_{max, s_{x'}} , C'_{max, s}\right) \\ {}&= {\left\{ \begin{array}{ll} C_{max, s_{x'}} - C'_{max, s} &{} \text{ if } C_{max, s_{x'}} \in C'_{max, s} \\ +\infty &{} \text{ else } \end{array}\right. } \end{aligned} \end{aligned}$$
(11)

\(Obj_5\) quantifies the distance between an altered schedule \( s_{x} \) and its original \( s \) by employing the Levenshtein-Distance with regards to the sequencing order of operations within the schedule as follows:

$$\begin{aligned} Obj_5\left( s_{x'}, s \right) = C_{|s_{x'}|, |s|}, \end{aligned}$$
(12)

with \( C_{|s_{x'}|, |s|}\) denoting the following modified version of the generic recursion for the Levenshtein distance:

$$\begin{aligned} C_{i,i'} = \min {\left\{ \begin{array}{ll} 0, &{} \text{ if } i=i'=0 \\ C_{i-2, i'-2} + 1, &{} \text{ if } i, i'>0 \end{array}\right. } \end{aligned}$$
(13)

This modified version of the generic recursion is based in the condition that alternative schedules consist of the same amount of operations and only differ in the sequencing thereof. Thus, any alternative schedule is a result of a limited amount of operation transpositions from the original schedule and the difference between two sequences can be measured by the minimum amount of operation transpositions.

We employ the NSGA-II algorithm to tackle our multi-objective optimization problem, renowned for its efficacy in such scenarios due to its low computational demand, incorporation of elitism, and its capacity to sustain a diverse set of solutions [34, 35]. NSGA-II excels in optimizing multiple, often conflicting, objectives concurrently, yielding a spectrum of Pareto optimal solutions that represent the most efficient trade-offs. The process begins with the initialization of a random population, which is then organized into Pareto fronts based on fitness. The foremost front comprises solutions that are unrivaled, and deemed the “best” currently available. Selection for reproduction is based on both the non-domination rank and crowding distance of solutions, ensuring a balance between optimality and diversity. Through crossover and mutation, alongside merging with the parent population, NSGA-II incorporates elitism and ensures diversity, preparing for the next generation by reassessing dominance relationships due to new entrants. This iterative process, guided by crowding distance, prevents convergence to a singular solution, instead exploring a broad objective space. Termination is typically determined by a set number of generations. NSGA-II’s strategy of balancing exploration and exploitation in the solution space ensures a comprehensive search for optimal outcomes in multi-objective contexts. Essential inputs for NSGA-II include the number of decision variables, fitness function, objective dimensions, population size, and the probabilities for crossover and mutation.

For the underlying scenario, the decision variables consist of the index of the operation that is being manipulated, an appropriate alternative machine for the selected operation, and an appropriate resource able to operate the selected machine. This information is encoded to numerical values on which genetic operations can be performed via the NSGA-II implementation. A corresponding decoder yields the operation index, machine, and resource, allowing for the analysis of non-dominated solutions. The fitness function encompasses the decoding of the encoded operation index, machine, and resource, the estimation of the processing time of the altered operation via the underlying ML model, the rescheduling of the altered operation processing time, and the subsequent evaluation of the altered schedule. Lastly, the scores for four determined objectives are returned: the improvement in processing time for the altered operation, the improvement in schedule makespan, the Gower distance between the vectors of the original and altered operation, and the Levenshtein distance between the original and altered schedule.

Experiment Settings

This section gives a general overview of the underlying dataset, the tools employed for implementing our proposed approach, hyperparameter optimization for the ML model, and a description of utilized model evaluation metrics. Moreover, it outlines the framework within which the study’s experiments were conducted, detailing the dataset, implementation tools, predictive modeling, optimization approach, counterfactual explanation generation, and the metrics used for evaluating the performance of these components. This comprehensive overview ensures clarity on the methodology applied in the research, encompassing data preparation, analytical tools selection, hyperparameter tuning processes for the ML model, and the criteria for assessing the effectiveness and accuracy of the proposed counterfactual generation approach.

Dataset Overview and Tools

Table 1 illustrates a selected portion of the operational log data, providing insights into specific aspects of the operations. Table 2 presents a general statistic for the underlying operations log, which is an extended version of the dataset utilized in [3 depicts the search spaces for each hyperparameter. For the final model, the best-performing model type with the most efficacious hyperparameter configurations was adopted.

Table 3 Hyperparameter optimization settings for GLM, DRF, and GBM models

Evaluation Metrics

In order to evaluate the model performance of the adopted machine learning approach, we scrutinize two aspects of the predictive strength. Firstly, concerning point predictions, the assessment of the predictive quality for regression models can rely on a range of metrics that endeavor to capture the prediction error in relation to the underlying ground truth value. In this study, we employ RMSE and Mean Absolute Error (MAE) as the chosen metrics. The RMSE is derived by computing the squared differences between the predicted and actual values for the given dataset, followed by calculating the square root of the average:

$$\begin{aligned} RMSE=\sqrt{\frac{1}{N}\sum _{i=1}^{N}\left( y_{i}-\hat{y_{i}}\right) ^{2}} \end{aligned}$$
(14)

As posited by [38], the RMSE is considerably influenced by the distribution of error magnitudes, which may render it misleading for assessing model quality in specific instances. Consequently, the authors propose MAE as a more comprehensible metric. The MAE is calculated by determining the average of the absolute differences between the predicted and actual values for the given dataset:

$$\begin{aligned} MAE = \frac{1}{N} \sum _{i=1}^{N} {{|y_{i}-\hat{y_{i}} |}} \end{aligned}$$
(15)

Following the suggestion in [39], where the RMSE was demonstrated to be a more reliable evaluation metric in cases with Gaussian error distribution, both metrics will be employed in this study to capture model performance.

Scheduling Scenario Settings

A scheduling scenario is constructed, providing a default schedule as a starting point and a baseline for further analysis. For the construction of the scenario, operations from jobs pertaining to the production of five different articles are sampled and the processing times of each operation are predicted by the fitted model. Afterward, the schedule is optimized using the adopted MILP approach, poses as a starting point for finding counterfactual explanations, and is leveraged to demonstrate the proposed counterfactual search approach.

In particular, the following five jobs are processed: Job 111679 produces ten pieces of the article “Torispherical Head DIN 28011” with activities performed in succession on the following machine types Dishing Press, Flanging, Edge Deburring, Flanging, Edge Deburring. Job 112996 produces six “Torispherical Head DIN 28011” articles but under different material specifications, using Dishing Press, Flanging, and Edge Deburring machines. Job 114735 produces four “Ellipsoidal Head DIN 28013” articles, utilizing the machine for Dishing Press, Flanging and Edge Deburring, and Cleanup. Job 116591 produces one “Dished Bottom,” using the machine for Welding, Seam Grinding, Dishing Press, Flanging and Edge Deburring and Cleanup. Lastly, job 117583 produces one “Normal Bottom,” employing 3D-Profiling, Welding, Seam Grinding, Dishing Press, Flanging, and Edge Deburring. For Dishing Press operations can be executed on up to four machines, Edge Deburring provides three machines and Flanging offers two machine options. The allocation of machines and workers as it was found in the underlying operations log was used as the default configuration, forwarding the construction of a viable schedule to the MILP component of the proposed approach.

Counterfactual Explanations Generation and Evaluation

Counterfactual explanations offer insight by suggesting the smallest alterations to input variables that would have led to a different prediction by the employed model, thus furnishing a direct and intuitive understanding of model behavior. Various methodologies for generating counterfactual explanations exist [4], each characterized by its unique approach to navigating the challenges inherent in explaining complex models. Techniques such as random search, nearest neighbor counterfactual explanations [10] (NNCE), and NSGA-II exemplify the diversity of strategies employed to construct meaningful and actionable counterfactuals. These methodologies differ significantly in terms of their underlying principles and the trade-offs they present between explanation simplicity, computational efficiency, and realism of the proposed counterfactuals. As such, the selection of a counterfactual generation technique must be informed by the specific context of application, taking into account the nuances of the model being explained and the objectives of the involved stakeholders.

For the evaluation of generated counterfactual explanations, the previously described scheduling scenario is leveraged as a baseline for the fitness function of the NSGA-II approach (see the “Multi-objective Counterfactual Explanations’’ section) and corresponding counterfactual explanations. In particular, a comparative analysis is performed against a random search and a NNCE approach, examining the quality and quantity of the generated explanations under the aspects of improvements in schedule makespan, operation processing time, distance to the original operation, and schedule, number of identified counterfactual explanations as well as the computation times.

NSGA-II

NSGA-II is designed for optimizing multiple conflicting objectives simultaneously. It distinguishes itself through an efficient sorting mechanism that categorizes solutions into different fronts based on their dominance relationships, prioritizing solutions that are non-dominated and allowing for a high degree of diversity. In our implementation, NSGA-II commences with an initial population of potential counterfactuals generated through informed random mutations, adhering to operational constraints. These constraints include the suitability of machines for specific operations and the capability of resources to handle these machines. The algorithm then iteratively evolves this population over ten generations, applying selection, crossover, and mutation of genetic operators, guided by the objectives delineated in Definition 6. In particular, the crossover probability was set at 80% with a crossover distribution index set to 5, while mutation probability was set to 20% with a mutation distribution index set to 10. Through successive generations, NSGA-II refines the population, ultimately converging toward a set of optimal counterfactual schedules that satisfy the predetermined objectives.

Random Search

Random search offers a straightforward and intuitive approach by randomly generating potential counterfactuals and evaluating their viability in subsequent steps, emphasizing simplicity and ease of implementation. However, its effectiveness is contingent on the search space size and may require extensive computational resources for complex models. The random search was selected as a baseline approach toward generating counterfactual explanations and implemented as follows: Operations from the default schedule are selected randomly and their manipulatable features are iterated. As with the NSGA-II implementation, these manipulations were restricted to machines viable for the operation type as well as resources being able to operate said machines. The employed ML model then scored the altered operations, estimating their processing times. In the next step, for each altered operation a corresponding alternative schedule was optimized and evaluated with regard to the objectives determined in Definition 6.

NNCE

NNCE identifies counterfactuals by incorporating a distance measure during the generation of counterfactuals, thus locating the closest instances that would result in a different decision. This method benefits from its reliance on actual data points, ensuring realistic and achievable counterfactuals. Nonetheless, its efficacy is limited by the representativeness of the dataset as well as the restrictions imposed by the utilized proximity measurement. Thus, finding counterfactual explanations for the optimization across multiple objectives depends on the concrete implementation of NNCE. NNCE was selected as an advanced approach toward generating counterfactual explanations and implemented similar to random search with the following specifics: Due to the manipulatable features, in particular machine and resource, both being categorical, we employed the Gower distance as a proximity measurement, allowing any alternative feature value to be picked with equal probability. This proximity measurement yields greater distances with an increasing number of altered variable values. To ensure only the nearest neighbors are selected, we restricted variable value iterations to only one of those variables. After optimizing and evaluating the alternative schedules, the nearest neighbor counterfactuals are ranked based on the defined objectives (see Definition 6).

We further examined various levels of search depth based on the number of explored parameter combinations and reiterated our tests ten times to assess the robustness of the performed approaches. In order to capture significant results, pairwise Welch’s t-test [40, 41] are performed as well as effect size estimations via Hedges [42] and a Bayesian analysis [43] across various search depths with regards to improvements in schedule makespan and operation processing time as well as the number of identified counterfactual explanations. Finally, we evaluate and compare the properties of the generated counterfactuals in terms of validity, sparsity, plausibility, actionability, proximity, and diversity [4, 11].

Results

This section presents the results yielded by the proposed methodology and described approaches. First, the results of the hyperparameter optimization are presented, including a comparative analysis of the model performance based on the introduced metrics. Second, an exemplary scheduling scenario is illustrated as a basis for the ensuing evaluation and detailed comparison of the proposed approaches for generating counterfactual explanations. The section concludes with a comparative analysis of the random search, NNCE, and the NSGA-II approach for finding applicable counterfactual explanations.

Hyperparameter Optimization and Model Evaluation

Each of the GLM, DRF, and GBM models underwent hyperparameter optimization as described in the “Machine Learning Settings’’ section. The optimized settings, as well as the MAE and RMSE metrics across 10-fold cross-validation via training data and a final evaluation via test data, are documented in Table 4. The GLM, with the alpha-value of 0.5 and lambda-value of 10\(^{-2}\), demonstrated a moderate performance with an MAE of 53.9±0.61 and an RMSE of 118.9 ± 13.2 during 10-fold cross-validation, and an MAE of 52.7 and RMSE of 112.5 on the test dataset. The DRF model, configured with a maximum tree depth of 15 (max_depth = 15), 8 selected columns at each level (mtries = 8), 100 trees (ntrees = 100), and a row sample rate of 1 (sample_rate = 1), outperformed the GLM model with a remarkably lower MAE of 41.5 ± 0.61 and RMSE of 102.5 ± 15.0 during cross-validation, with similar results on the test data (MAE = 41.6, RMSE = 98.34). The GBM model fitted with a column sample rate of 0.2 (col_sample_rate = 0.2), a learning rate of 0.01 (learn_rate = 0.01), a maximum tree depth of 15 (max_depth = 15), with 2000 trees (ntrees = 2000), and row sample rate of 1 (sample_rate = 1), surpassed other models on evaluations with training and test data. In particular, the GBM achieved the lowest MAE (39.97 ± 0.3) and RMSE (98.0 ± 15.3) during cross-validation as well as the final evaluation with the test data (MAE = 39.35, RMSE 93.3). In the context of the evaluation via test data, the GBM outperformed the GLM by a noteworthy margin, a reduction of \(\sim \) 25% for MAE and \(\sim \) 16% for RMSE, the differences between the GBM and DRF models are comparatively minor, exhibiting differences slightly above 5% for both MAE and RMSE. With the fitted GBM models outperforming any other model type with regard to predictive strength, the final model was trained with the identified hyperparameter settings. Corresponding hyperparameter settings and evaluation metrics are documented in Table 4.

Table 4 Results of the hyperparameter optimization for GLM, DRF, and GBM models
Fig. 3
figure 3

Gantt chart depicting the default production schedule for the exemplary scenario

Default Schedule with MILP

To evaluate the proposed approach, an exemplary scenario was constructed by sampling five jobs pertaining to the production of five different articles. An optimized schedule was calculated by utilizing the predictions of the adopted ML model for the estimated processing times of each involved operation (see the “Scheduling Scenario Settings’’ section).

Figure 3 is a Gantt chart depicting the default schedule for the production process of five different vessel bases and showcasing the operational timeline for the production of these metal components. The chart is organized by lanes, each corresponding to a specific machine, with operations scheduled along a timeline, measuring the makespan of the schedule. The operations are denoted by horizontal bars within each lane, and their lengths represent the processing time of the respective operations. The colors of these bars correspond to distinct jobs, identified by unique numerical codes, with each comprising a sequence of operations necessary to complete the manufacturing of the respective component. Vertical lines accentuate the completion time of jobs and are color-coded accordingly. Each of the 24 operations is accompanied by a label above it, depicting the identifier of the responsible worker, with 21 unique workers being assigned. The operations are distributed across eleven machines pertaining to seven machine types, 3D-Profiling, Cleaning, Dishing Press, Edge Deburring, Flanging, Seam Grinding and Welding, offering a selection of machines for Dishing Press, Edge Deburring, Flanging with up to three options. The depicted jobs exhibit a diverse range of start, processing, and end times, spanning over a total of 1003 min (schedule makespan), elucidating the complexity of the scheduling environment, where various jobs are interlaced through shared resources and sequential dependencies.

Counterfactual Explanations

The analysis of the results pertaining to the generated counterfactual explanations in the given scenario is divided into two distinct research questions:

  • RQ1. How does the proposed multi-objective search for counterfactual explanations compare against other approaches in producing relevant counterfactual explanations?

  • RQ2. How does the proposed approach contribute to a comprehensive exploration of the generated counterfactual explanations?

To address these questions, first, the JSSP scenario outlined in experiment settings is constructed with a default schedule as a baseline for further analysis. Next, various possible explanations for the scheduling process are generated using random search, NNCE, and NSGA-II, and compared afterward. Lastly, we examine solutions regarding schedule makespan in detail by exploring the corresponding alternative production schedules yielded by two different counterfactual explanations of the proposed approach.

Comparative Evaluation Regarding Objectives

To answer the RQ1, counterfactual explanations were generated by applying the proposed multi-objective NSGA-II approach, and a comparative analysis was performed against a random search and NNCE strategy. In particular, each methodology explored the alteration of the machine and resource parameters of one operation at a time and was tested across varying search depths, exploring 50, 100, and 200 parameter combinations. For the random search and NNCE, the total space of feasible parameter combinations was explored accordingly, whereas the NSGA-II approach created a population of 5, 10, or 20 parameter combinations with each of the 10 produced generations, kee** the top performers at the first Pareto front. In order to evaluate the robustness of the applied approaches, the described process was repeated ten times and results were aggregated and analyzed accordingly.

Table 5 Comparative analysis of random search, NNCE, and NSGA-II for counterfactual explanation search. The schedule to be optimized consisted of five jobs with a total amount of 24 operations. Each approach was executed ten times in order to mitigate outliers, with the table displaying averaged results as well as corresponding standard deviations

Table 5 depicts the averaged results and corresponding standard deviations with regards to the improvements of schedule makespan, operation processing time, the distances between the original and altered schedules or operations, the total number of identified counterfactual explanations, and the number of explanations without makespan improvements among the identified counterfactual explanations of each iteration. For the evaluation of the computational cost, the execution times for each approach were measured as well and are examined in the “Comparative Evaluation Regarding Computation Times’’ section.

Makespan Improvement

NNCE exhibits the lowest average makespan improvements, ranging from 25.8 to 26.3 min, followed by random search with values ranging from 41.3 to 47.3 min. The performance does not improve with more parameter combinations, indicating a potential inefficiency in exploring the solution space. Standard deviations decrease with increasing number of explored parameter combinations, suggesting an increasing stability of random search and NNCE with increasing search depth. Considering the nature of the NNCE approach, the limited improvements to the makespan can be attributed to the limitations regarding the explored parameter combinations, thus neglecting potential candidates for counterfactual explanations that differ in more than one feature from its original. NSGA-II demonstrates superior performance, particularly at 200 parameter combinations, achieving a peak average improvement of 71.8 min. The standard deviation is notably higher, ranging from 23.3 to 29.7 min, indicating a comparatively higher variability of identified counterfactuals.

Operation Improvement

Regarding operation processing times, random search exhibits improvements ranging from 26.8 to 27.7 min on average, with a maximum standard deviation of 3.6 min, and is slightly outperformed by NNCE, with improvements ranging between 29.3 and 30 min. NSGA-II yields the highest and lowest improvements to the operation processing time, with values of 42.3 min for 200 parameter combinations and 25.3 min for 50 parameter combinations. The stability of NSGA-II highly depends on the number of parameter combinations, with standard deviations ranging from 12.1 min for 50 parameter combinations to 5.9 min for 200 parameter combinations. The difference in stability between makespan and processing time improvements of NSGA-II can be attributed to the random sampling of the starting population as well as the population. With the increasing number of explored parameter combinations, the tournament structure of NSGA-II contributes to retaining previously identified solutions from one generation to the next, thus yielding the highest improvements of the three algorithms.

Schedule Distance

The comparative analysis of the schedule distances reveals a notable variation across different approaches. Random search and NSGA-II exhibit relatively higher schedule distances, with values ranging from 7.5 to 9.0. This indicates a significant deviation between the original and altered schedules generated through these methodologies. The accompanying standard deviations, ranging from 0.4 to 2.5, suggest variability in the consistency of these approaches, with random search displaying marginally better figures, implying a slightly more consistent performance compared to NSGA-II. The NNCE approach demonstrates superior performance in minimizing schedule distances, with the lowest average distance being 6.23 and the highest at 6.37, coupled with a more constrained standard deviation span of 0.587 to 1.22.

Operation Distance

In terms of operation distance, there’s a discernible differentiation in performance across the methodologies. Random search and NSGA-II again present comparable figures, with operation distances in the proximity of 0.77. However, it is noteworthy that NNCE outperforms both random search and NSGA-II in this metric, consistently achieving operation distances as low as 0.5 without any standard deviation. This performance of NNCE can be attributed to its strategic limitation to single variable value iterations, which inherently restricts the scope of alteration and thus maintains a closer alignment with the original operation parameters. This precision enables NNCE to not only generate counterfactuals that are easier to achieve but also ensure a higher fidelity to the original operational setup. NSGA-II, while showing a slight improvement over random search, still falls short of the benchmark set by NNCE, suggesting that despite NSGA-II’s advancements in exploring multiple parameter combinations, it does not necessarily translate to a proportionate enhancement in maintaining operation proximity. The analysis elucidates the distinct capabilities and limitations of each approach in terms of maintaining schedule and operation proximity in the generation of counterfactual explanations. NNCE’s methodology emerges as particularly effective in achieving closer approximations to the original configurations, thereby offering a compelling option for generating counterfactuals with minimized deviations.

Fig. 4
figure 4

Significance test for makespan improvements of altered schedules for identified counterfactual explanations, comparing random search, NNCE, and NSGA-II

Number of Identified Explanations

The analysis of the number of identified explanations demonstrates a significant variation across the different algorithms, with random search and NNCE notably outperforming NSGA-II. Specifically, at the parameter combination of 200, Random Search and NNCE identified an average of 95.1 and 98.6 explanations, respectively. These figures significantly exceed the performance of NSGA-II, which identified far fewer explanations, respectively averaging 4.3, 6.7, and 10.8 for 50, 100, and 200 explored parameter combinations. The standard deviation associated with these findings, ranging from 0.5 to 9.1, indicates variability in the consistency of explanation identification across the algorithms, with NNCE demonstrating a notably tighter range of deviation, thus suggesting a higher consistency in the amount of identified counterfactual explanations. NSGA-II’s lower performance in this regard is due to its inherent population size limitations, which constrain its ability to identify a broader array of explanations within the same parameter combination scope. As an example, with 200 parameter combinations being explored over ten populations, the population size limit constraints NSGA-II to a maximum of 20 unique counterfactual explanations.

Fig. 5
figure 5

Significance test for processing time improvements of operations for identified counterfactual explanations, comparing random search, NNCE, and NSGA-II

Number of Excluded Explanations

Approximately 20 to 25% of the explanations identified by the examined algorithms did not result in an improvement of the schedule makespan, indicating a noteworthy proportion of explanations that, while potentially optimizing processing times for operations, do not contribute to the overall efficiency of the schedule. This finding highlights the complexity of job scheduling challenges, where not all operational improvements necessarily lead to better schedule outcomes. The number of excluded explanations, ranging from an average of 5.1 to 21.8 with a standard deviation of 0.5 to 4.2, suggests that a significant fraction of operational adjustments, despite being identified as potential improvements, do not directly translate into enhanced schedule performance. This discrepancy could provide valuable insights into the granularity of job scheduling processes, particularly in understanding the minimum necessary operational improvements that could lead to meaningful schedule optimizations. It reflects the inherent challenge in aligning operational modifications with overarching scheduling objectives, emphasizing the need for a careful consideration of both operational and schedule-level impacts in the pursuit of optimization strategies.

For further performance evaluation, pairwise comparisons of the three search approaches were performed to identify significant differences. In particular, Welch’s t-test was performed with a p-value threshold of 0.05 as well as a Bayesian analysis, contrasting different search approaches across explored parameter combinations for improvements in makespan, operation processing time, the number of identified explanations and computation time. Figures 4, 5, and 6, and respectively illustrate the results of the significance tests, each divided into three subplots (A), (B), and (C) for the parameter combinations 50, 100, and 200, respectively. Figure 4A) visualizes the makespan improvement for each of the ten iterations of the random search, NNCE and NSGA-II via boxplots and violin plots. The textual descriptions above the plot detail the test method (Welch), the degrees of freedom for the t-statistic (2, 16.2), the value of the statistic (21.6), the p-value (0.000027), the effect size type (partial omega-squared) and the estimate (0.7), the confidence interval ([0.4, 1.0]) and the total number of observations (30), while the descriptions below the plot document the Bayes analysis with the natural logarithm of the Bayes Factor BF\(_{01}\) (\(-\)6.2), the posterior type and its estimation (0.5) as well as the credible interval ([0.2, 0.6]), the prior type (Cauchy) and value (0.7). The Figs. 5 and 6 illustrate results in an analogous manner. The results of the significance analysis further confirm previous observations: NSGA-II outperforms random search and NNCE by a significant margin with regard to average improvements in makespan and operation processing time, particularly as the number of explored parameter combinations increases. Differences between random search and NNCE are significant for improvements in the schedule makespan and can be attributed, as previously stated, to the limitations imposed by the nearest neighbor search of NNCE. For the number of identified explanations, random search and NNCE significantly outperform NSGA-II across any number of explored parameter combinations, due to the restricted population size of the NSGA-II approach.

Fig. 6
figure 6

Significance test for the number of identified counterfactual explanations, comparing random search, NNCE, and NSGA-II

In each of the three test categories above, the disparities between the examined approaches became more pronounced as more parameter combinations were investigated. These results demonstrate that NSGA-II exhibits superior performance with regard to identifying counterfactual explanations with improved schedule makespans and operation processing times, particularly at higher levels of explored parameter combinations. Its effectiveness in optimizing complex schedules, however, is accompanied by higher variability, as evidenced by the standard deviation values. This suggests that while NSGA-II can achieve significant improvements, its outcomes may be less predictable compared to random search or NNCE. Furthermore, random search and NNCE significantly outperformed NSGA-II with regard to the number of identified solutions, particularly at higher parameter combinations. This could be attributed to the exhaustive nature of random search in exploring the solution space. NNCE provides a slightly higher amount of identified counterfactuals than random search, which can be attributed to its search space being restricted to alternatives similar to the original, thus resulting in a minimally larger amount of potentially viable solutions. However, the number of identified solutions that fail to improve the schedule makespan suggests that random search and NNCE both exhibit diminishing returns, trading the quality of the identified counterfactuals for quantity, emphasized by NSGA-II providing the highest improvements regarding makespan and operation processing times. Subsequent statistical analyses corroborated the superior performance of NSGA-II in refining makespan and operation processing times as the number of explored parameter combinations increased. Our findings elucidate the nuanced strengths and limitations of each approach, presenting NSGA-II as a notably effective strategy for optimizing complex schedules through counterfactual explanations, albeit with an inherent variability that warrants consideration.

Comparative Evaluation Regarding Computation Times

Evaluating computational costs is essential for optimizing algorithm efficiency and ensuring their practical applicability. It allows for the identification of resource-efficient algorithms and facilitates a balance between outcome quality and computational expenditure. This assessment is particularly vital in complex tasks, where the computational intensity of algorithms can significantly impact their viability and effectiveness, guiding the selection toward solutions that effectively balance accuracy, speed, and resource demands. For this purpose, the computation times of the random search, NNCE, and NSGA-II approaches were measured and evaluated, with the results being presented in Table 5 and Fig. 7. In light of the utilized h2o library (see the “Dataset Overview and Tools’’ section, the computational costs of transforming tabular data from R-data-frames to h2o-data-frames and vice versa turned out as costly and the intricacies of implementing methods from this library were deemed a disruptive factor in comparing the three approaches fairly. Hence, for the comparative evaluation of computation times, we isolated the data format transformation steps that are specific to the implemented algorithm and excluded their computation time, so that the analysis results do not hinge on the chosen ML packages and their implementation.

Fig. 7
figure 7

Significance test for the computation time, comparing random search, NNCE, and NSGA-II

Regarding computation time, all three approaches exhibit a linear increase proportional to the amount of explored parameter combinations, with random search and NNCE showing similar results slightly in favor for random search. In particular, the exploration of 50, 100, and 200 parameter combinations takes an average of 5.81 s, 11.07 s, and 21.9 s respectively for a random search, and is closely followed by NNCE. NSGA-II, however, requires more computational resources, exhibiting an approximately 10% increase in computation time across all explored parameter combinations. This increase in computational cost can be attributed to the iterative generation and evaluation of counterfactual explanations generated by NSGA-II as opposed to the straightforward nature of the random search and NNCE approaches. Furthermore, it should be noted that the measured computation times excluded the time necessary for any data format transformations, which are specific to the implementation of the chosen ML model. As an example, random search is able to sample 200 parameter combinations from the feature space and proceed to the scoring by the chosen ML model, whereas NSGA-II integrates the ML scoring at each generation before the tournament step. The significance tests presented in Fig. 7 further confirm that the computational costs associated with random search and NNCE vary noticeably from NSGA-II, showing significant differences across all search depths. NSGA-II, with its sophisticated evolutionary operations, incurs higher computational demands due to the complexity of its non-dominated sorting and diversity preservation mechanisms. Random Search, while exhaustive, has relatively lower computational requirements, stemming from its simplicity. NNCE occupies a middle ground, as its nearest neighbor search involves moderate computational efforts, balanced by its targeted exploration. Thus, the selection of an algorithm must consider the trade-off between computational efficiency and the quality of optimization outcomes, highlighting the necessity of aligning computational resources with optimization objectives.

Comparative Evaluation Regarding Counterfactual Desiderata

This section delineates a comparative analysis among the random search, NNCE, and NSGA-II methodologies, in the context of generating counterfactual explanations that adhere to the specific desiderata of validity, sparsity, plausibility, actionability, proximity, and diversity (see the “Counterfactual Explanations Generation and Evaluation” section).

Random Search

The random search method possesses the capability to generate valid counterfactuals due to its exhaustive exploration of the search space. Our implementation restricted variable alterations to machines and corresponding resources viable for the operation, however, this approach does not inherently guarantee the validity or actionability of the counterfactuals produced, as it lacks a targeted mechanism. This may result in the generation of counterfactuals that suggest changes beyond the control or capacity of the decision-maker, thereby reducing their practical value. Sparsity is not a primary consideration within the random search approach and may yield counterfactuals that involve numerous changes from the original instance, potentially complicating their interpretation and diminishing their usefulness. Given its disregard for domain-specific knowledge and constraints, this approach runs the risk of generating counterfactuals that are implausible within the context of the application domain and can limit the applicability of the explanations in real-world scenarios. Disregarding the limitations imposed by our implementation, this method conventionally lacks a systematic approach to ensure proximity between the original instance and the generated counterfactuals. As a result, the counterfactuals may exhibit significant deviations from the original instance, undermining their applicability and transferability to other application scenarios. While this approach is capable of producing a diverse array of counterfactuals due to its non-deterministic nature, it lacks a structured methodology to ensure that this diversity translates into a meaningful variety of actionable and plausible explanations.

NNCE

The NNCE apporach is explicitly designed to increase the validity of counterfactual explanations by focusing on the identification of nearest neighbors that lead to a change in the decision outcome. By prioritizing the identification of the closest instances that result in a different decision, NNCE naturally encourages the generation of sparse counterfactuals in the proximity of the original instance. This focus on minimal changes supports the creation of explanations that are simpler and more interpretable. The reliance on nearest neighbors further increases the likelihood of generating plausible counterfactuals, enhancing their credibility, and utility. Similar to our implementation of random search, variable alterations were limited to machines and resources able to operate them, offering actionable changes. However, the leveraged nearest-neighbor methodology does not ensure actionability by default. Although NNCE is capable of generating counterfactuals that are close to the original decision boundary, its focus on nearest changes may limit the diversity of the counterfactuals produced. This concentration may overlook other valid and potentially insightful counterfactuals, as indicated with regard to the analysis of average makespan improvements in the “Comparative Evaluation Regarding Objectives” section.

NSGA-II

NSGA-II is able to effectively generate valid counterfactuals by optimizing across multiple objectives, ensuring that the counterfactuals produced are capable of changing the model’s predictions in a meaningful way. The configuration of this approach allows for the prioritization of sparsity and actionability through its fitness function, similar to the implementation of random search and NNCE, where alterations were constrained to machines suitable for the specified operation type, in addition to the resources capable of managing these machines. NSGA-II aims to maintain proximity between the original instance and the counterfactuals through its consideration of proximity as an optimization objective. This focus supports the generation of counterfactuals that represent minimal and feasible departures from the original instance. By nature of its evolutionary algorithm framework, NSGA-II excels in supporting diversity among the generated counterfactual explanations. This algorithm produces a range of Pareto optimal solutions, offering a broad spectrum of alternative explanations and insights into the decision-making process. However, this approach lacks the implementation of domain-specific knowledge into its optimization process, thus necessitating a manual examination of the plausibility of the generated counterfactuals.

Comprehensive Exploration of Multi-objective Counterfactuals

To address RQ2 regarding the quality and applicability of identified counterfactuals, two exemplary counterfactual explanations will be examined. An excerpt of all identified alternative operation settings identified via NSGA-II for 200 explored parameter combinations is depicted in Fig. 8, with Fig. 8A) presenting the found counterfactuals in a tabular manner, providing information about relevant variable values as well as operation and makespan improvements.

Fig. 8
figure 8

Exemplary extract of A the list of identified counterfactual explanations using NSGA-II with a search depth of 200 parameter combinations, showing the top 10 counterfactual explanations in terms of schedule and operation distance as well as improvements in schedule makespan and processing times, B the original instance corresponding to the counterfactual that was selected for further examination, and C a verbalization of the selected counterfactual explanation. The row of the selected counterfactual and its corresponding original are highlighted yellow, indicating alterations to the machine and resource with orange color. Processing times and makespan as well as their improvements are highlighted using blue color-coding

Figure 8B) depicts the original variable values of the altered operation for context, with Fig. 8C) providing a textual description of the relevant alterations and predicted improvements to the operation and job schedule. Figure 9 depicts the Pareto front constructed by NSGA-II with a 3D-visualization of a scatter plot along the axis of operation (“Operation_Improvement”), makespan improvement (“Schedule_Improvement”) and the distance between the original and altered operation (“Operation_Distance”), with the distance between the original and altered schedule being depicted through a gradient color-coding. Each point represents a counterfactual explanation and its position within the plot is determined by its quality regarding the objectives of the counterfactual search (see Definition 6).

The first example demonstrates an improvement in the processing time for the Flanging activity performed on machine “C” using worker with the identifier “137” (see Fig. 3) of job 11679 by allocating it to machine “B” and worker “775” (see Fig. 10). Although the improvement for the operation is 59 min, significantly reducing the predicted processing time from 227 to 168 min, the schedule cannot be optimized in a way that reduces the total makespan. The completion times for jobs 116,591, 112,996, and 117,583 were moved forward, with the completion time of job 11679 being postponed. Although slight changes are observed in the schedules, the sequencing of operations remains largely unchanged.

The second example demonstrates the maximum possible improvement with the change of a single operation while disregarding substantial changes in the scheduling of operations, changing the machine for the first operation of job 11679 from Dishing Press “A” to “D” and the responsible worker from “405” to “125” (see Fig. 11). This change improved the predicted duration of the operation by 29 min (from 203 min to 174 min) as well as the schedule makespan by 192 min, moving the completion times of all jobs forward as a consequence. When compared with the first example, the reduction in operation processing time is comparably small, however, the position of the operation within the production environment allows for more drastic changes to the process schedule and, consequently, radical improvements with regards to the makespan.

Fig. 9
figure 9

Snippet of the 3D-visualization of the multi-dimensional Pareto front regarding operation and schedule distances as well as corresponding improvements, yielded by NSGA-II with a search depth of 200 parameter combinations, depicting the scores of identified counterfactual explanations in the form of points, positioned and color-coded accordingly

Fig. 10
figure 10

Alternative schedule without makespan improvement, highlighting Flanging performed on the alternative machine “B” by worker “775” after alteration. Although the processing time of the highlighted operation decreased by 59 min and several jobs terminated earlier, the schedule makespan could not be improved

These examples highlight the complexities of multi-objective optimization in the scheduling problem and stress the importance of comprehensive exploration when implementing optimization strategies. The contrast between the two scenarios underscores that while targeted improvements in individual operations can be beneficial, their overall impact on the production schedule can vary significantly. This necessitates a more thorough and explorative approach to the optimization process.

Discussion

Our study has demonstrated a novel approach to leveraging counterfactual explanations within the context of JSSP, an intersection of AI, operations research (OR), and process mining. This research has successfully combined complex aspects of decision-making by using a multi-objective optimization framework, specifically the NSGA-II. This has shown how even small changes in operational parameters can have a big effect on the overall efficiency of a system. This integration not only amplifies the potential of AI and OR in process optimization but also enriches the field of process mining with a deeper understanding of how individual process steps influence the larger system.

In recognizing the specialized focus of our investigation on generating counterfactuals for production scheduling scenarios, we also highlight the extensive applicability of our methods across a broad array of domains. These diverse areas-ranging from supply chain optimization and healthcare planning optimization to capacity planning, energy market modeling, transport planning, and logistics optimization, as well as integrated sales and operations planning-benefit from the synergistic integration of predictive analytics and optimization algorithms. Our research contributes significantly to these fields by providing a methodological framework for generating valuable counterfactual insights, which is not only relevant but indispensable for enhancing decision-making processes. Therefore, while our work may originate from a niche perspective, its implications are far-reaching, underscoring the transferability and critical importance of our approach in various interdisciplinary contexts. This extension of our research into other domains emphasizes the universal applicability and potential impact of our findings, illustrating that, despite its specialized nature, our work serves as a foundational pillar for advancing practices in numerous other sectors where predictive modeling and optimization converge.

Fig. 11
figure 11

Alternative schedule, highlighting the activity performed on the alternative Dishing Press “C” by worker “125” after alteration. With a comparably small reduction in the processing time of the highlighted operation, the schedule makespan was improved by 192 min with several jobs terminated earlier compared to the original schedule

The relevance of our study to decision augmentation is particularly significant, offering a robust framework that enhances the capabilities of decision-makers in complex environments. By integrating counterfactual explanations with multi-objective optimization, we provide a tool that augments human decision-making with deep, AI-driven insights. This approach allows decision-makers to understand not only what decision might be optimal but also why it is so by revealing the intricate interdependencies and potential impacts of various choices. It empowers them to make informed decisions that consider a wide range of scenarios and outcomes, ultimately leading to more effective, efficient, and responsible outcomes. This augmentation is especially valuable in sectors where the complexity and volume of data exceed human capacity to analyze, making AI an essential partner in the decision-making process.

In the context of industrial analytics, our study’s approach to responsible AI is also relevant, offering a framework that aligns with the industry’s growing need for transparent, ethical, and accountable AI systems. By applying counterfactual explanations within a multi-objective optimization setting, we provide a means for industrial analytics to not only make AI-driven decisions transparent but also ensure these decisions are ethically sound and accountable. In an industrial setting, where decisions often have substantial economic and operational implications, understanding the rationale behind AI predictions and recommendations is crucial. Our approach allows for a clear exposition of how different variables and scenarios impact outcomes, making AI systems more interpretable and trustworthy to engineers, managers, and stakeholders. This transparency is vital for maintaining confidence in AI-driven industrial processes. Additionally, our methodology’s emphasis on ethical considerations in AI decision-making ensures that industrial analytics systems are designed to be fair, avoiding biases that could lead to inefficiencies or unfair practices. This aspect is particularly important in industries where AI decisions might impact workforce management, environmental compliance, or safety protocols. Lastly, by enhancing the accountability of AI systems in an industrial analytics setting, our study ensures that AI-driven decisions can be audited and improved upon. This aspect is crucial for industries to adapt and evolve their AI systems responsibly, ensuring continuous improvement and alignment with industry standards and regulations. Therefore, in the sphere of industrial analytics, our approach not only advances the technical capabilities of AI but also ensures these advancements are responsibly aligned with the broader goals and values of the industry.

Furthermore, the study emphasizes the relevance of system thinking in manufacturing and process management. By considering the manufacturing process as an interconnected system, our approach enables a comprehensive evaluation of the implications of individual changes on the entire system. This systemic perspective is crucial in identifying optimal solutions that enhance individual operations and overall process efficiency. It encourages a shift from focusing solely on isolated process improvements to a more integrated approach that considers the cumulative effect of these improvements on the entire system. In this regard, incorporating the granular computing paradigm is pivotal in sha** our analytical approach and aligning it with the composition of process mining data. This concept enables nuanced navigation and consideration of objectives and factors across different system levels when identifying counterfactual explanations, facilitating the transition from local optimizations to systematic improvements. The adaptability and structured perspective provided by granular computing are particularly congruent with the nature of process mining data, which inherently comprises multi-level, interconnected information. By employing Granular Computing, we effectively segment the objectives for identifying counterfactuals within the complex scheduling problem into more manageable and interpretable units, ensuring enhancements contribute meaningfully to the system’s overall efficiency. This methodological alignment enhances our capability to dissect and interpret process mining data in an insightful and actionable manner. Consequently, our application of granular computing fosters a framework for identifying impactful counterfactual explanations and amplifies the potential for systemic improvements within the intricate dynamics of job-shop scheduling. The paradigm’s emphasis on considering the granularity of data and system components ensures our findings are comprehensive and relevant, paving the way for holistic advancements rather than isolated improvements. Through this lens, granular computing serves as a cornerstone of our study, guiding our exploration of process mining data toward achieving significant, system-wide enhancements.

Conclusion

In conclusion, this research introduced a sophisticated method for generating counterfactual explanations in complex scheduling scenarios, specifically in the context of the JSSP. The core methodology integrated a predictive ML model with MILP for initial schedule generation, followed by the application of the NSGA-II for multi-objective optimization. The results of the study revealed the effectiveness of the NSGA-II approach in producing robust counterfactual explanations, particularly in improving schedule makespan and operation processing times. This was notably more efficient than the alternative counterfactual explanation method, especially at higher levels of parameter combinations. Furthermore, the exploration of specific counterfactual scenarios underscored the critical role of operation placement within the scheduling environment. It was observed that improvements in individual operations could have significantly different impacts on the overall schedule, depending on their interaction with other operations and their position in the process flow. Overall, the study successfully demonstrated a comprehensive framework for enhancing both individual operation efficiencies and system-wide performance in scheduling environments. The insights gained from this research not only contribute to the field of explainable AI but also provide valuable guidelines for future applications in optimization and scheduling tasks.