1 Introduction

1.1 Probabilistic digital twins

The use of digital twins has emerged as one of the major technology trends the last couple of years. In essence, a digital twin (DT) is a digital representation of some physical system, including data from observations of the physical system, which can be used to perform forecasts, evaluate the consequences of potential actions, simulate possible future scenarios, and in general inform decision making without requiring interference with the physical system. From a theoretical perspective, a digital twin may be regraded to consist of the following two components:

  • A set of assumptions regarding the physical system (e.g. about the behaviour or relationships among system components and between the system and its environment), often given in the form of a physics-based numerical simulation model.

  • A set of information, usually in the form of a set of observations, or records of the relevant actions taken within the system.

In some cases, a digital twin may be desired for a system whose attributes and behaviours are not deterministic, but stochastic. For example, the degradation and failure of physical structures or machinery is typically described as stochastic processes. A system’s performance may be impacted by weather or financial conditions, which also may be most appropriately modelled as stochastic. Sometimes the functioning of the system itself is stochastic, such as supply chain or production chains involving stochastic variation in demand and performance of various system components.

Even for systems or phenomena that are deterministic in principle, a model will never give a perfect rendering of reality. There will typically be uncertainty about the model’s structure and parameters (i.e. epistemic uncertainty), and if consequences of actions can be critical, such uncertainties need to be captured and handled appropriately by the digital twin. In general, the system of interest will have both stochastic elements (aleatory uncertainty) and epistemic uncertainty.

If we want to apply digital twins to inform decisions in systems where the analysis of uncertainty and risk is important, certain properties are required:

  1. 1.

    The digital twin must capture uncertainties: This could be done by using a probabilistic representation for uncertain system attributes.

  2. 2.

    It should be possible to update the digital twin as new information becomes available: This could be from new evidence in the form of data, or underlying assumptions about the system that have changed.

  3. 3.

    For the digital twin to be informative in decision making, it should be possible to query the model sufficiently fast: This could mean making use of surrogate models or emulators, which introduces additional uncertainties.

These properties are paraphrased from [1], which provides a detailed discussion on the use of digital twins for on-line risk assessment.

A digital twin that complies with these properties is referred to as a probabilistic digital twin (PDT). Other formulations of PDTs have also been proposed in the recent years. In [2] the PDT is characterized through probabilistic machine learning from a physical system’s observables, that are associated with properties such as parameters or state variables of a physics-based model. The PDT is first trained as a predictive model before it is exercised in a decision making context. A similar framework is outlined in [3], where the PDT takes the form of a dynamic decision network (an extension of a dynamic Bayesian network with decision nodes). The time evolution of the physical asset and the dynamical updating of the PDT is emphasized with this approach. Other applications of PDTs can be found in [4], here also starting from a Bayesian network perspective, or [5] which instead deals with uncertainty modelling through polynomial chaos expansion. These methods may differ in their specific mathematical implementation, but the general idea is the same, with the main goal of combining physics-based and probabilistic modelling to support decision making under uncertainty.

In this paper we propose a mathematical framework for defining PDTs, starting from a measure-theoretic perspective. This is not in conflict with the current literature on PDTs, but we take a slightly more generic approach which will give us the vocabulary to properly separate between aleatory and epistemic uncertainty. This will let us analyze the effect of gathering information, which is needed e.g. for optimal experimental design, and which is also important for many safety-critical applications.

With respect to the three properties of PDTs described above, we will build on the Bayesian probabilistic framework which is a natural choice to satisfy Items 1 and 2.

A numerical model of a complex physical system can often be computationally expensive, for instance if it involves the numerical solution of nontrivial partial differential equations. In a probabilistic setting this is prohibitive, as a large number of evaluations (e.g. PDE solves) is needed for tasks involving uncertainty propagation, such as prediction and inference. Applications towards real-time decision making also sets natural restrictions with respect to the runtime of such queries. This is why property 3 is important, and why probabilistic models of complex physical phenomena often involve the use of approximate alternatives, usually obtained by “fitting” a computationally cheap model to the output of a few expensive model runs. These computationally cheap approximations are often referred to as response surface models, surrogate models or emulators in the literature.

Introducing this kind of approximation for computational efficiency also means that we introduce additional epistemic uncertainty into our modelling framework. By epistemic uncertainty we mean, in short, any form of uncertainty that can be reduced by gathering more information (to be discussed further later on). In our context, uncertainty may in principle be reduced by running the expensive numerical modes instead of the cheaper approximations.

1.2 Sequential decision making

Many interesting sequential decision making problems arise from the property that our knowledge about the system we operate changes as we learn about the outcomes. That is, each decision may affect the epistemic uncertainty which the next decision will be based upon.

For certain applications it is also important that decisions are robust with respect to what we do not know, i.e. with respect to epistemic uncertainty. Although we will not restrict the framework presented in this paper to any specific type of sequential decision making objectives, we will mainly focus on problems related to optimal information gathering. That is, where the decisions we consider are related to acquiring information (e.g. by running an experiment) in order to reduce the epistemic uncertainty with respect to some specified objective (e.g. estimating some quantity of interest). This is referred to as optimal experimental design in the statistics literature [6] and active learning within machine learning [7].

If we have a probabilistic model of some data generating process (for instance given by a PDT), we can simulate posterior distributions (of some quantity of interest) corresponding to what we might observe after a specific action. If we combine this with a utility function, given as a functional of the posterior distribution, we can try to optimize our decisions with respect to this utility. Methods of this sort are called Bayesian optimal designs [8]. Some examples are utility functions based on the Kullback–Leibler divergence between the prior and posterior [9, 10], mutual information [11], predictive uncertainty [6] or the cost of uncertainty [12]. For further details see e.g. the review in [8] and the references therein.

In the sequential setting, it is quite common to search for a myopic strategy, which means that we look one time step into the future and only consider the immediate consequence of a decision. This is of course generally suboptimal, compared to considering all future decisions and the observations they might produce. However, for non-trivial models, finding an exact solution for the multi-step alternative is usually computationally intractable. An alternative that has received increasing attention the recent years is to make use of reinforcement learning to find an approximate solution in the multi-step case. Some examples of this approach by the use of policy gradient reinforcement learning can be found in [13,14,15]. The examples presented in this paper (Sect. 5) show a different approach using Q-learning.

An example that we will consider in this paper, is the problem of optimal experimental design for structural reliability analysis. This is when information is acquired sequentially in order to evaluate the reliability of a physical operation, something which is a relevant application of a PDT. The quantity of interest here is the probability of structural failure (e.g. structural collapse due to loads caused by wind and waves). Assume that to estimate this probability, a large number of physics-based simulation (e.g. Finite Element Analysis) related to the relevant failure mechanism (e.g. fracture) is needed. Usually, only a small number of such simulations can be performed in practice. The experimental design problem is then to decide which experiments (here simulation runs) to perform in order to build a surrogate model that can be used to estimate a failure probability with sufficient level of confidence. This is a problem that has received considerable attention (see e.g. [16,17,18,19,20,21,22]). These methods all make use of a myopic (one-step lookahead) criterion to determine the “optimal” experiment, as a multi-step or full dynamic programming formulation of the optimization problem becomes numerically infeasible. In [16] the authors consider the case where there are different types of experiments to choose from. Here, the myopic (one-step lookahead) assumption can still be justified, but if the different types of experiments are associated with different costs, then it can be difficult to apply in practice (e.g. if a feasible solution requires expensive experiments with delayed reward). We will revisit this example in this paper, but with a method that looks multiple steps ahead.

1.3 Contribution of this paper

In this paper we will review the mathematical framework of sequential decision making, and connect this to the definition of a PDT. Traditionally, there are two main solution strategies for solving discrete time sequential decision making problems: Maximum principles, and dynamic programming. We review these two solution methods, and conclude that the PDT framework is well suited for a dynamic programming approach. However, dynamic programming suffers from the curse of dimensionality, i.e. possible sequences of decisions and state realizations grow exponentially with the size of the state space. In fact, when doing numerical backward induction in dynamic programming, the objective function must be computed for each combination of values. This makes the method too computationally demanding to be applicable in practice for problems where the state space is large, see Agrell and Dahl [16] for a discussion of this. Hence, we are typically not able to solve a PDT sequential decision making problem in practice directly via dynamic programming.

As a generic solution to the problem of optimal sequential decision making we instead propose an alternative based on reinforcement learning. This means that when we consider the problem of finding an optimal decision policy, instead of truncating the theoretical optimal solution by e.g. looking only one step ahead, we try to approximate the optimal policy. This approximation can be done by using e.g. a neural network. Here we will frame the sequential decision making setup as a Markov decision process (MDP), in general as a partially observable MDP (POMDP), where a state is represented by the information available at any given time. This kind of state specification is often referred to as the information state-space. As a generic approach to deep reinforcement learning using PDTs, we propose an approach using neural networks that operate on the information state-space directly.

Main contributions. In this paper we will:

(i):

Propose a mathematical framework for modelling epistemic uncertainty based on measure theory.

(ii):

Present a mathematical definition of the probabilistic digital twin (PDT). This is a mathematical framework for modelling physical systems with aleatory and epistemic uncertainty.

(iii):

Introduce the problem of sequential decision making, and illustrate how this problem can be solved (at least in theory) via maximum principle methods or the dynamic programming principle.

(iv):

Discuss the curse of dimensionality for these solution methods, and illustrate how the sequential decision making problem in the PDT can be viewed as a partially observable Markov decision process.

(v):

Explain how reinforcement learning (RL) can be applied to find approximate optimal strategies for sequential decision making in the PDT, and propose a generic approach using a deep sets architecture that enables RL directly on the information state-space. We end with a numerical example to illustrate this approach.

The paper is structured as follows: In Sect. 2 we introduce epistemic uncertainty and suggest modeling this via \(\sigma\)-algebras. In Sect. 3, we present the mathematical framework, as well as a formal definition, of a probabilistic digital twin (PDT), and discuss how such PDTs are used in practice. Then, in Sect. 4, we introduce the problem of stochastic sequential decision making. We discuss the traditional solution approaches, in particular dynamic programming which is theoretically a suitable approach for decision problems that can be modelled using a PDT. However, due to the curse of dimensionality, using the dynamic programming directly is typically not tractable. We therefore turn to reinforcement learning using function approximation as a practical alternative. In Sect. 5, we show how an approximate optimal strategy can be achieved using deep reinforcement learning, and we illustrate the approach with a numerical example. Finally, in Sect. 6 we conclude and sketch some future works in this direction.

2 A measure-theoretic treatment of epistemic uncertainty

In this section, we review the concepts of epistemic and aleatory uncertainty, and introduce a measure-theoretic framework for modelling epistemic uncertainty.

2.1 Motivation

In uncertainty quantification (UQ), it is common to consider two different kinds of uncertainty: Aleatory (stochastic) and epistemic (knowledge-based) uncertainty. We say that uncertainty is epistemic if we foresee the possibility of reducing it through gathering more or better information. For instance, uncertainty related to a parameter that has a fixed but unknown value is considered epistemic. Aleatory uncertainty, on the other hand, is the uncertainty which cannot (in the modeller’s perspective) be affected by gathering information alone. Note that the characterization of aleatory and epistemic uncertainty has to depend on the modelling context. For instance, the result of a coin flip may be viewed as epistemic, if we imagine a physics-based model that could predict the outcome exactly (given all initial conditions, etc.). However, under most circumstances it is most natural to view a coin flip as aleatory, or that it contains both aleatory and epistemic uncertainty (e.g. if the bias of the coin is unknown). Der Kiureghian and Ditlevsen [23] provide a detailed discussion of the differences between aleatory and epistemic uncertainty.

In this paper, we have two main reasons for distinguishing between epistemic and aleatory uncertainty. First, we would like to make decisions that are robust with respect to epistemic uncertainty. Secondly, we are interested in studying the effect of gathering information. Modelling epistemic uncertainty is a natural way of doing this.

In the UQ literature, aleatory uncertainty is typically modelled via probability theory. However, epistemic uncertainty is represented in many different ways. For instance, Helton et al. [24] consider four different ways of modelling epistemic uncertainty: Interval analysis, possibility theory, evidence theory (Dempster-Shafer theory) and probability theory.

In this paper we take a measure-theoretic approach. This provides a framework that is relatively flexible with respect to the types of assumptions that underlie the epistemic uncertainty. As a motivating example, consider the following typical setup used in statistics:

Example 2.1

(A parametric model)

Let \({{\textbf {X}}}= (Y, \theta )\) where Y is a random variable representing some stochastic phenomenon, and assume Y is modelled using a given probability distribution, \(P(Y | \theta )\), that depends on a parameter \(\theta\) (e.g. \(Y \sim {\mathcal {N}}(\mu , \sigma )\) with \(\theta = (\mu , \sigma )\)). Assume that we do not know the value of \(\theta\), and we therefore consider \(\theta\) as a (purely) epistemic parameter. For some fixed value of \(\theta\), the random variable Y is (purely) aleatory, but in general, as the true value of \(\theta\) is not known, Y is associated with both epistemic and aleatory uncertainty.

The model \({{\textbf {X}}}\) in Example 2.1 can be decoupled into an aleatory component \(Y|\theta\) and an epistemic component \(\theta\). Any property of the aleatory uncertainty in \({{\textbf {X}}}\) is determined by \(P(Y | \theta )\), and is therefore a function of \(\theta\). For instance, the probability \(P(Y \in A | \theta )\) and the expectation \(E[f(Y) | \theta ]\), are both functions of \(\theta\). There are different ways in which we can choose to address the epistemic uncertainty in \(\theta\). We could consider intervals, for instance the minimum and maximum of \(P(Y \in A | \theta )\) over any plausible value of \(\theta\), or assign probabilities, or some other measure of belief, to the possible values \(\theta\) may take. However, in order for this to be well-defined mathematically, we need to put some requirements on A, f and \(\theta\). By using probability theory to represent the aleatory uncertainty, we implicitly assume that the set A and function f are measurable, and we will assume that the same holds for \(\theta\). We will describe in detail what is meant by measurable in Sect. 2.2 below. Essentially, this is just a necessity for defining properties such as distance, volume or probability in the space where \(\theta\) resides.

In this paper we will rely on probability theory for handling both aleatory and epistemic uncertainty. This means that, along with the measurability requirement on \(\theta\), we have the familiar setup for Bayesian inference:

Example 2.2

(A parametric model—Inference and prediction)

If \(\theta\) from Example 2.1 is a random variable with distribution \(P(\theta )\), then \({{\textbf {X}}}= (Y, \theta )\) denotes a complete probabilistic model (capturing both aleatory and epistemic uncertainty). \({{\textbf {X}}}\) is a random variable with distribution

$$\begin{aligned} P({{\textbf {X}}}) = P(Y | \theta )P(\theta ). \end{aligned}$$

Let I be some piece of information from which Bayesian inference is possible, i.e. \(P({{\textbf {X}}}| I)\) is well defined. We may then define the updated joint distribution

$$\begin{aligned} P_{\text {new}}({{\textbf {X}}}) = P(Y | \theta )P(\theta | I), \end{aligned}$$

and the updated marginal (predictive) distribution for Y becomes

$$\begin{aligned} P_{\text {new}}(Y) = \int P(Y | \theta ) dP(\theta | I). \end{aligned}$$

Note that the distribution \(P_{\text {new}}({{\textbf {X}}})\) in Example 2.2 is obtained by only updating the belief with respect to epistemic uncertainty, and that

$$\begin{aligned} P_{\text {new}}({{\textbf {X}}}) \ne P({{\textbf {X}}}| I) = P(Y|I, \theta )P(\theta |I). \end{aligned}$$

For instance, if I corresponds to an observation of Y, e.g. \(I = \{Y = y \}\), then \(P(Y|I) = \delta (y)\), the Dirac delta at y, whereas \(P(\theta | I)\) is the updated distribution for \(\theta\) having observed one realization of Y. In the following, we will refer to the kind of Bayesian updating in Example 2.2 as epistemic updating.

This epistemic updating of the model considered in Examples 2.1 and 2.2 should be fairly intuitive, if

  1. 1

    All epistemic uncertainty is represented by a single parameter \(\theta\), and

  2. 2

    \(\theta\) is a familiar object like a number or a vector in \({\mathbb {R}}^n\).

But what can we say in a more general setting? It is common that epistemic uncertainty comes from lack of knowledge related to functions. This is the case with probabilistic emulators and surrogate models. The input to these functions may contain epistemic and/or aleatory uncertainty as well. Can we talk about isolating and modifying the epistemic uncertainty in such a model, without making reference to the specific details of how the model has been created? In the following we will show that with the measure-theoretic framework, we can still make use of a simple formulation like the one in Example 2.2.

2.2 The probability space

Let \({{\textbf {X}}}\) be a random variable containing both aleatory and epistemic uncertainty. In order to describe how \({{\textbf {X}}}\) can be treated like in Examples 2.1 and 2.2, but for the general setting, we will first recall some of the basic definitions from measure theory and measure-theoretic probability.

To say that \({{\textbf {X}}}\) is a random variable, means that \({{\textbf {X}}}\) is defined on some measurable space \((\Omega , {\mathcal {F}})\). Here, \(\Omega\) is a set, and if \({{\textbf {X}}}\) takes values in \({\mathbb {R}}^n\) (or some other measurable space), then \({{\textbf {X}}}\) is a so-called measurable function, \({{\textbf {X}}}:\Omega \rightarrow {\mathbb {R}}^n\) (to be defined precisely later). Any randomness or uncertainty about \({{\textbf {X}}}\) is just a consequence of uncertainty regarding \(\omega \in \Omega\). As an example, \({{\textbf {X}}}\) could relate to some 1-year extreme value, whose uncertainty comes from day to day fluctuations, or some fundamental stochastic phenomenon represented by \(\omega \in \Omega\). Examples of natural sources of uncertainty are weather or human actions in large scale. Therefore, whether modeling weather, option prices, structural safety at sea or traffic networks, stochastic models should be used.

The probability of the event \(\{ {{\textbf {X}}}\in E \}\), for some subset \(E \subset {\mathbb {R}}^n\), is really the probability of \(\left\{ \omega \in {{\textbf {X}}}^{-1}(E) \right\}\). Technically, we need to ensure that \(\left\{ \omega \in {{\textbf {X}}}^{-1}(E) \right\}\) is something that we can compute the probability of, and for this we need \({\mathcal {F}}\). \({\mathcal {F}}\) is a collection of subsets of \(\Omega\), and represents all possible events (in the “\(\Omega\)-world”). When \({\mathcal {F}}\) is a \(\sigma\)-algebraFootnote 1 the pair \((\Omega , {\mathcal {F}})\) becomes a measurable space.

So, when we define \({{\textbf {X}}}\) as a random variable taking values in \({\mathbb {R}}^n\), this means that there exists some measurable space \((\Omega , {\mathcal {F}})\), such that any event \(\left\{ {{\textbf {X}}}\in E \right\}\) in the “\({\mathbb {R}}^n\)-world” (which has its own \(\sigma\)-algebra) has a corresponding event \(\left\{ \omega \in {{\textbf {X}}}^{-1}(E) \right\} \in {\mathcal {F}}\) in the “\(\Omega\)-world”. It also means that we can define a probability measure on \((\Omega , {\mathcal {F}})\) that gives us the probability of each event, but before we introduce any specific probability measure, \({{\textbf {X}}}\) will just be a measurable functionFootnote 2.

  • We start with assuming that there exists some measurable space \((\Omega , {\mathcal {F}})\) where \({{\textbf {X}}}\) is a measurable function.

The natural way to make \({{\textbf {X}}}\) into a random variable is then to introduce some probability measureFootnote 3P on \({\mathcal {F}}\), giving us the probability space \((\Omega , {\mathcal {F}}, P)\).

  • Given a probability measure P on \((\Omega , {\mathcal {F}})\) we obtain the probability space \((\Omega , {\mathcal {F}}, P)\) on which \({{\textbf {X}}}\) is defined as a random variable.

We have considered here, for familiarity, that \({{\textbf {X}}}\) takes values in \({\mathbb {R}}^n\). When no measure and \(\sigma\)-algebra is stated explicitly, one can assume that \({\mathbb {R}}^n\) is endowed with the Lebesgue measure (which underlies the standard notion of length, area and volume, etc.) and the Borel \(\sigma\)-algebra (the smallest \(\sigma\)-algebra containing all open sets). Generally, \({{\textbf {X}}}\) can take values in any measurable space. For example, \({{\textbf {X}}}\) can map from \(\Omega\) to a space of functions. This is important in the study of stochastic processes.

2.3 The epistemic sub \(\sigma\)-algebra \({\mathcal {E}}\)

In the probability space \((\Omega , {\mathcal {F}}, P)\), recall that the \(\sigma\)-algebra \({\mathcal {F}}\) contains all possible events. For any random variable \({{\textbf {X}}}\) defined on \((\Omega , {\mathcal {F}}, P)\), the knowledge that some event has occurred provides information about \({{\textbf {X}}}\). This information may relate to \({{\textbf {X}}}\) in a way that it only affects epistemic uncertainty, only aleatory uncertainty, or both. We are interested in specifying the events \(e \in {\mathcal {F}}\) that are associated with epistemic information alone. It is the probability of these events we want to update as new information is obtained. The collection \({\mathcal {E}}\) of such sets is itself a \(\sigma\)-algebra, and we say that \({\mathcal {E}} \subseteq {\mathcal {F}}\) is the sub \(\sigma\)-algebra of \({\mathcal {F}}\) representing epistemic information.

We illustrate this in the following examples. In Example 2.3, we consider the simplest possible scenario represented by the flip of a biased coin, and in Example 2.4 a familiar scenario from uncertainty quantification involving uncertainty with respect to functions.

Example 2.3

(Coin flip)

Define \({{\textbf {X}}}= (Y, \theta )\) as in Example 2.1, and let \(Y \in \{ 0, 1 \}\) denote the outcome of a coin flip where “heads” is represented by \(Y = 0\) and “tails” by \(Y = 1\). Assume that \(P(Y = 0) = \theta\) for some fixed but unknown \(\theta \in [0, 1]\). For simplicity we assume that \(\theta\) can only take two values, \(\theta \in \{ \theta _1, \theta _2 \}\) (e.g. there are two coins but we do not know which one is being used).

Then, \(\Omega = \{ 0, 1 \} \times \left\{ \theta _1, \theta _2 \right\}\), \({\mathcal {F}} = 2^\Omega\) and \({\mathcal {E}} = \{ \emptyset , \Omega , \left\{ (0, \theta _1), (1, \theta _1) \right\} , \left\{ (0, \theta _2), (1, \theta _2) \} \right\}\).

Example 2.4

(UQ)

Let \({{\textbf {X}}}= ({{\textbf {x}}}, {{\textbf {y}}})\) where \({{\textbf {x}}}\) is an aleatory random variable, and \({{\textbf {y}}}\) is the result of a fixed but unknown function applied to \({{\textbf {x}}}\). We let \({{\textbf {y}}}= {\hat{f}}({{\textbf {x}}})\) where \({\hat{f}}\) is a function-valued epistemic random variable.

If \({{\textbf {x}}}\) is defined on a probability space \((\Omega _{{\textbf {x}}}, {\mathcal {F}}_{{\textbf {x}}}, P_{{\textbf {x}}})\) and \({\hat{f}}\) is a stochastic process defined on \((\Omega _f, {\mathcal {F}}_f, P_f)\), then \((\Omega , {\mathcal {F}}, P)\) can be defined as the product of the two spaces and \({\mathcal {E}}\) as the projection \({\mathcal {E}} =\left\{ \Omega _{{\textbf {x}}}\times A | A \in {\mathcal {F}}_f\right\}\).

In the following, we assume that the epistemic sub \(\sigma\)-algebra \({\mathcal {E}}\) has been identified.

Given a random variable \({{\textbf {X}}}\), we say that \({{\textbf {X}}}\) is \({\mathcal {E}}\)-measurable if \({{\textbf {X}}}\) is measurable as a function defined on \((\Omega , {\mathcal {E}})\). We say that \({{\textbf {X}}}\) is independent of \({\mathcal {E}}\), if the conditional probability \(P({{\textbf {X}}}| e)\) is equal to \(P({{\textbf {X}}})\) for any event \(e \in {\mathcal {E}}\). With our definition of \({\mathcal {E}}\), we then have for any random variable \({{\textbf {X}}}\) on \((\Omega , {\mathcal {F}}, P)\) that

  • \({{\textbf {X}}}\) is purely epistemic if and only if \({{\textbf {X}}}\) is \({\mathcal {E}}\)-measurable,

  • \({{\textbf {X}}}\) is purely aleatory if and only if \({{\textbf {X}}}\) is independent of \({\mathcal {E}}\).

2.4 Epistemic conditioning

Let \({{\textbf {X}}}\) be a random variable on \((\Omega , {\mathcal {F}}, P)\) that may contain both epistemic and aleatory uncertainty, and assume that the epistemic sub \(\sigma\)-algebra \({\mathcal {E}}\) is given. By epistemic conditioning, we want to update the epistemic part of the uncertainty in \({{\textbf {X}}}\) using some set of information I. In Example 2.3 this means updating the probabilities \(P(\theta = \theta _1)\) and \(P(\theta = \theta _2)\), and in Example 2.4 this means updating \(P_f\). In order to achieve this in the general setting, we first need a way to decouple epistemic and aleatory uncertainty. This can actually be made fairly intuitive, if we rely on the following assumption:

Assumption 2.5

There exists a random variable \(\theta :\Omega \rightarrow \Theta\) that generatesFootnote 4\({\mathcal {E}}\).

If this generator \(\theta\) exists, then for any fixed value \(\theta \in \Theta\), we have that \({{\textbf {X}}}| \theta\) is independent of \({\mathcal {E}}\). Hence \({{\textbf {X}}}| \theta\) is purely aleatory and \(\theta\) is purely epistemic.

We will call \(\theta\) the epistemic generator, and we can interpret \(\theta\) as a signal that reveals all epistemic information when known. That is, if \(\theta\) could be observed, then knowing the value of \(\theta\) would remove all epistemic uncertainty from our model. As it turns out, under fairly mild conditions one can always assume existence of this generator. One sufficient condition is that \((\Omega , {\mathcal {F}}, P)\) is a standard probability space, and then the statement holds up to sets of measure zero. This is a technical requirement to avoid pathological cases, and does not provide any new intuition that we see immediately useful, so we postpone further explanation to Appendix A.

Example 2.6

(Coin flip—epistemic generator)

In the coin flip example, the variable \(\theta \in \{ \theta _1, \theta _2 \}\) which generates \({\mathcal {E}}\) is already specified.

Example 2.7

(UQ – epistemic generator)

In this example, when \((\Omega , {\mathcal {F}}, P)\) is the product of an aleatory space \((\Omega _{{\textbf {x}}}, {\mathcal {F}}_{{\textbf {x}}}, P_{{\textbf {x}}})\) and an epistemic space \((\Omega _f, {\mathcal {F}}_f, P_f)\), we could let \(\theta :\Omega = \Omega _{{\textbf {x}}}\times \Omega _f \rightarrow \Omega _f\) be the projection \(\theta (\omega _{{\textbf {x}}}, \omega _f) = \omega _f\).

Alternatively, given only the space \((\Omega , {\mathcal {F}}, P)\) where both \({{\textbf {x}}}\) and \({\hat{f}}\) are defined, assume that \({\hat{f}}\) is a Gaussian process (or some other stochastic process for which the Karhunen-Loéve theorem holds). Then there exists a sequence of deterministic functions \(\phi _1, \phi _2, \dots\) and an infinite-dimensional variable \(\theta = (\theta _1, \theta _2, \dots )\) such that \({\hat{f}}({{\textbf {x}}}) = \sum _{i=1}^{\infty } \theta _i \phi _i({{\textbf {x}}})\), and we can let \({\mathcal {E}}\) be generated by \(\theta\).

The decoupling of epistemic and aleatory uncertainty is then obtained by considering the joint variable \(({{\textbf {X}}}, \theta )\) instead of \({{\textbf {X}}}\) alone, because

$$\begin{aligned} P({{\textbf {X}}}, \theta ) = P({{\textbf {X}}}| \theta )P(\theta ). \end{aligned}$$
(1)

From (1) we see how the probability measure P becomes the product of the epistemic probability \(P(\theta )\) and the aleatory probability \(P({{\textbf {X}}}| \theta )\) when applied to \(({{\textbf {X}}}, \theta )\).

Given new information, I, we will update our beliefs about \(\theta\), \(P(\theta ) \rightarrow P(\theta | I)\), and we define the epistemic conditioning as follows:

$$\begin{aligned} P_{\text {new}}({{\textbf {X}}}, \theta ) = P({{\textbf {X}}}| \theta )P(\theta | I). \end{aligned}$$

2.5 Two types of assumptions

Consider the probability space \((\Omega , {\mathcal {F}}, P)\), with epistemic sub \(\sigma\)-algebra \({\mathcal {E}}\). Here \({\mathcal {E}}\) represents epistemic information, which is the information associated with assumptions. In other words, an epistemic event \(e \in {\mathcal {E}}\) represents an assumption. In fact, given a class of assumptions, the following Remark 2.8, shows why \(\sigma\)-algebras are appropriate structures.

Remark 2.8

Let \({\mathcal {E}}\) be a collection of assumptions. If \(e \in {\mathcal {E}}\), this means that it is possible to assume that e is true. If it is also possible to assume that that e is false, then \(\bar{e} \in {\mathcal {E}}\) as well. It may then also be natural to require that \(e_1, e_2 \in {\mathcal {E}} \Rightarrow e_1 \cap e_2 \in {\mathcal {E}}\), and so on. These are the defining properties of a \(\sigma\)-algebra.

For any random variable \({{\textbf {X}}}\) defined on \((\Omega , {\mathcal {F}}, P)\), when \({\mathcal {E}}\) is a sub \(\sigma\)-algebra of \({\mathcal {F}}\), \({{\textbf {X}}}| e\) for \(e \in {\mathcal {E}}\) is well defined, and represents the random variable under the assumption e. In particular, given any fixed epistemic event \(e \in {\mathcal {E}}\) we have a corresponding aleatory distribution \(P({{\textbf {X}}}| e)\) over \({{\textbf {X}}}\), and the conditional \(P({{\textbf {X}}}| {\mathcal {E}})\) is the random measure corresponding to \(P({{\textbf {X}}}| e)\) when e is a random epistemic event in \({\mathcal {E}}\). Here, the global probability measure P when applied to e, P(e), is the belief that e is true. In Sect. 2.4 we discussed updating the part of P associated with epistemic uncertainty. We also introduced the epistemic generator \(\theta\) in order to associate the event e with an outcome \(\theta (e)\), and make use of \(P({{\textbf {X}}}| \theta )\) in place of \(P({{\textbf {X}}}| {\mathcal {E}})\). This provides a more intuitive interpretation of the assumptions that are measurable, i.e. those whose belief we may specify through P.

Of course, the measure P is also based on assumptions. For instance, if we in Example 2.1 assume that Y follows a normal distribution. One could in principle specify a (measurable) space of probability distributions, from which the normal distribution is one example. Otherwise, we view the normality assumption as a structural assumption related to the probabilistic model for \({{\textbf {X}}}\), i.e. the measure P. These kinds of assumptions cannot be treated the same way as assumptions related to measurable events. For instance, the consequence of the complement assumption “Y does not follow a normal distribution” is not well defined.

In order to avoid any confusion, we split the assumptions into two types:

  1. 1

    The measurable assumptions represented by the \(\sigma\)-algebra \({\mathcal {E}}\), and

  2. 2

    the set M of structural assumptions underlying the probability measure P.

This motivates the following definition.

Definition 2.9

(Structural assumptions) We let M denote the set of structural assumptions that defines a probability measure on \((\Omega , {\mathcal {F}})\), which we may write \(P_M(\,\cdot \,)\) or \(P(\,\cdot \, | M)\).

We may also refer to M as the non-measurable assumptions, to emphasize that M contains all the assumptions not covered by \({\mathcal {E}}\). When there is no risk of confusion we will also suppress the dependency on M and just write \(P(\,\cdot \,)\). Stating the set M explicitly is typically only relevant for scenarios where we consider changes being made to the actual system that is being modelled, or for evaluating different candidate models, e.g. through the marginal likelihood P(I|M). In practice one would also state M so that decision makers can determine their level of trust in the probabilistic model, and the appropriate level of caution when applying the model.

As we will see in the upcoming section, making changes to M and making changes to how \(P_M\) acts on events in \({\mathcal {E}}\) are the two main ways in which we update a probabilistic digital twin.

3 The Probabilistic Digital Twin

The object that we will call probabilistic digital twin, PDT for short, is a probabilistic model of a physical system. It is essentially a (possibly degenerate) probability distribution of a vector \({{\textbf {X}}}\), representing the relevant attributes of the system, but where we in addition require the specification of epistemic uncertainty (assumptions) and how this uncertainty may be updated given new information.

Before presenting the formal definition of a probabilistic digital twin, we start with an example showing why the identification of epistemic uncertainty is important.

3.1 Why distinguish between aleatory and epistemic uncertainty?

The decoupling of epistemic and aleatory uncertainty (as described in Sect. 2.4) is central in the PDT framework. There are two good reasons for doing this:

  1. 1

    We want to make decisions that are robust with respect to epistemic uncertainty.

  2. 2

    We want to study the effect of gathering information.

Item 1 relates to the observation that decision theoretic approaches based on expectation may not be robust. That is, if we marginalize out the epistemic uncertainty (and considering only \(E_{\theta }[P({{\textbf {X}}}| \theta )] = \int P({{\textbf {X}}}| \theta ) dP_\theta\)). We give two examples of this below, see Examples 3.1 and 3.2. Item 2 means that by considering the effect of information on epistemic uncertainty, we can evaluate the value of gathering information. This is discussed in further detail in Sect. 4.6.

In the context of PDTs, it is essential that the sources of epistemic uncertainty are understood. First of all, because we need to understand how it may be reduced (as it per definition is a reducible uncertainty), but also if we want to understand the criticality with respect to assumptions made with imperfect or lack of knowledge. Within risk analysis of engineering systems, which is a relevant application area for PDTs, this is important. This is discussed from a general risk perspective in [25, 26] and specifically for PDTs in [1].

Example 3.1

(Coin flip – robust decisions)

Continuing from the coin flip example (see Example 2.3), we let \(\theta _1 = 0.5\) and \(\theta _2 = 0.99\). Assume that you are given the option to guess the outcome of \({{\textbf {X}}}\). If you guess correct, you collect a reward of \(R = 10^6 \), otherwise you have to pay \(L = 10^6 \). A priori your belief about the bias of the coin is that \(P(\theta = 0.5) = P(\theta = 0.99) = 0.5\). If you consider betting on \({{\textbf {X}}}= 0\), then the expected return, obtained by marginalizing over \(\theta\), becomes \(P(\theta = 0.5)(0.5R - 0.5\,L) + P(\theta = 0.99)(0.99 R - 0.01\,L) = 490.000 \).

This is a scenario where decisions supported by taking the expectation with respect to epistemic uncertainty is not robust, as we believe that \(\theta = 0.5\) and \(\theta = 0.99\) are equally likely, and if \(\theta = 0.5\) we will lose \(10^6 \) \(50\%\) of the time by betting on \({{\textbf {X}}}= 0\).

Example 3.2

(UQ – robust decisions)

This example is a continuation of Examples 2.4 and 2.7.

In structural reliability analysis, we are dealing with an unknown function g with the property that the event \(\{ y = g({{\textbf {x}}}) < 0 \}\) corresponds to failure. When g is represented by a random function \({\hat{g}}\) with epistemic uncertainty, the failure probability is also uncertain. Or in other words, if \({\hat{g}}\) is epistemic then \({\hat{g}}\) is a function of the generator \(\theta\). Hence, the failure probability is a function of \(\theta\). We want to make use of a conservative estimate of the failure probability, i.e. use a conservative value of \(\theta\). \(P(\theta )\) tells us how conservative a given value of \(\theta\) is.

3.2 The attributes \({{\textbf {X}}}\)

To define a PDT, we start by considering a vector \({{\textbf {X}}}\) consisting of the attributes of some system. This means that \({{\textbf {X}}}\) is a representation of the physical object or asset that we are interested in. In general, \({{\textbf {X}}}\) describes the physical system. In addition, \({{\textbf {X}}}\) must contain attributes related to any type of information that we want to make use of. For instance, if the information consists of observations, the relevant observable quantities, as well as attributes related to measurement errors or noise, may be included in \({{\textbf {X}}}\). In general, we will think of a model of a system as a set of assumptions that describes how the components of \({{\textbf {X}}}\) are related or behave. The canonical example here is where some physical quantity is inferred from observations including errors and noise, in which case a model of the physical quantity (physical system) is connected with a model of the data generating process (observational system). We are interested in modelling dependencies with associated uncertainty related to the components of \({{\textbf {X}}}\), and treat \({{\textbf {X}}}\) as a random variable.

The attributes \({{\textbf {X}}}\) characterise the state of the system and the processes that the PDT represents. \({{\textbf {X}}}\) may for instance include:

  • System parameters representing quantities that have a fixed, but possibly uncertain, value. For instance, these parameters may be related to the system configuration.

  • System variables that may vary in time, and whose value may be uncertain.

  • System events i.e. the occurrence of defined state transitions.

In risk analysis, one is often concerned with risk measures given as quantified properties of \({{\textbf {X}}}\), usually in terms of expectations. For instance, if \({{\textbf {X}}}\) contains some extreme value (e.g. the 100-year wave) or some specified event of failure (using a binary variable), the expectations of these may be compared against risk acceptance criteria to determine compliance.

3.3 The PDT definition

Based on the concepts introduced so far, we define the PDT as follows:

Definition 3.3

(Probabilistic Digital Twin) A Probabilistic Digital Twin (PDT) is a triplet \(({{\textbf {X}}}, A, I)\), where \({{\textbf {X}}}\) is a vector of attributes of a system, A contains the assumptions needed to specify a probabilistic model, and I contains information regarding actions and observations:

  • \(A = ((\Omega , {\mathcal {F}}), {\mathcal {E}}, M)\), where \((\Omega , {\mathcal {F}})\) is a measure space where \({{\textbf {X}}}\) is measurable, and \({\mathcal {E}}\) is the sub \(\sigma\)-algebra representing epistemic information. M contains the structural assumptions that define a probability measure \(P_M\) on \((\Omega , {\mathcal {F}})\).

  • I is a set consisting of events of the form (do), where d encodes a description of the conditions under which the observation o was made, and where the likelihood \(P(o | {{\textbf {X}}}, d)\) is well defined. For brevity, we will write this likelihood as \(P(I | {{\textbf {X}}})\) when I contains multiple events of this sort.

When M is understood, and there is no risk on confusion, we will drop stating the dependency on M explicitly and just refer to the probability space \((\Omega , {\mathcal {F}}, P)\).

It is important to note that consistency between I and \(P({{\textbf {X}}})\) is required. That is, when using the probabilistic model for \({{\textbf {X}}}\), it should be possible to simulate the type of observations given by I. In this case the likelihood \(P(I | {{\textbf {X}}})\) is well defined, and the epistemic updating of \({{\textbf {X}}}\) can be obtained from Bayes’ theorem.

Finally, we note that with this setup the information I may contain observations made under different conditions than what is currently specified through M. The information I is generally defined as a set of events, given as pairs (do), where d encodes the relevant action leading to observing o, as well as a description of the conditions under which o was observed. Here d may relate to modifications of the structural assumptions M, for instance if the causal relationships that describe the model of \({{\textbf {X}}}\) under observation of o is not the same as what is currently represented by M. This is the scenario when we perform controlled experiments. Alternatively, (do) may represent a passive observation, e.g. \(d =\) ’measurement taken from sensor 1 at time 01:02:03’ and \(o = 1.7\) mm. We illustrate this in the following example.

Example 3.4

(Parametric regression)

Let \((x_1, x_2)\) denote two physical quantities where \(x_2\) depends on \(x_1\), and let \((y, \varepsilon )\) represent an observable quantity where y corresponds to observing \(x_2\) together with additive noise \(\varepsilon\). Set \({{\textbf {X}}}= (x_1, x_2, y, \varepsilon )\).

Fig. 1
figure 1

A standard regression model as a PDT

We define a model M corresponding to \(x_1 \sim p_{x_1}(x_1 | \theta _1)\), \(x_2 = f(x_1, \theta _2)\), \(y = x_2 + \varepsilon\) and \(\varepsilon \sim p_{\varepsilon }\), where \(p_{x_1}\) is a probability density depending on the parameter \(\theta _1\) and \(f(\cdot , \theta _2)\) is a deterministic function depending on the parameter \(\theta _2\). \(\theta _1\) and \(\theta _2\) are epistemic parameters for which we define a joint density \(p_{\theta }\).

Assume that \(I = \left\{ \left( d^{(1)}, o^{(1)}\right) , \dots , \left( d^{(n)}, o^{(n)}\right) \right\}\) is a set of controlled experiments, where \(d^{(i)} = \left( \text {set } x_1 = x_{1}^{(i)}\right)\) and \(o^{(i)}\) is a corresponding observation of \(y|\left( x_1 = x_{1}^{(i)}, \varepsilon = \varepsilon ^{(i)}\right)\) for a selected set of inputs \(x_{1}^{(i)}, \dots , x_{1}^{(n)}\) and unknown i.i.d. \(\varepsilon ^{(i)} \sim p_\varepsilon\). In this scenario, regression is performed by updating the distribution \(p_\theta\) to agree with the observations:

$$\begin{aligned}{} & {} p_{\theta }(\theta | I) = p_{\theta }\left( \theta _1 | \theta _2\right) p_{\theta }(\theta _2 | I)\nonumber \\{} & {} \quad = \frac{1}{Z} \displaystyle \left( \prod _{i=1}^{n} p_{\varepsilon }\left( o^{(i)} - f(x_{1}^{(i)}, \theta _2) \right) \right) p_{\theta }(\theta ), \end{aligned}$$
(2)

where Z is a constant ensuring that the updated density integrates to one.

If instead I corresponds to direct observations, \(d^{(i)} = \left( \text {observe } y^{(i)}\right)\), \(o^{(i)} = y^{(i)}\), then \(p_{\theta }(\theta | I)\) corresponds to using \(x_1\) instead of \(x_{1}^{(i)}\) and multiplying with \(p_{x_1}\left( x_1|\theta _1\right)\) in (2).

Note that the scenario with controlled experiments in Example 3.4 corresponds to a different model than the one in Fig. 1. This is a familiar scenario in the study of causal inference, where actively setting the value of \(x_1\) is the do-operator (see Pearl [27]) which breaks the link between \(x_1\) and \(x_2\).

3.4 Corroded pipeline example

To give a concrete example of a system where the PDT framework is relevant, we consider the following model from Agrell and Dahl [16]. This is based on a probabilistic structural reliability model for assessment of corroded pipelines, which is the basis for the recommended practice DNV GL RP-F101 [28]. It is a model of a physical failure mechanism called pipeline burst, which may occur when the pipeline’s ability to withstand the high internal pressure has been reduced as a consequence of corrosion. We will describe just a general overview of this model, and refer to (Example 4 in Agrell and Dahl [16]) for specific details. Later, in Sect. 5.6, we will revisit this example and make use of reinforcement learning to search for an optimal way of updating the PDT.

A graphical representation of the model is shown in Fig. 2. The steel thickness t and the diameter D represent the pipeline geometry, and the material is represented by the ultimate tensile strength s. The pipeline has a defect, in the form of a rectangle with depth d and length l. For some pipeline (Dts) with defect (dl), the pipeline’s pressure resistance capacity (the maximum differential pressure the pipeline can withstand before bursting) can be determined. By running a Finite Element simulation of the physical mode of failure, we obtain the theoretical capacity \(p_{\text {FE}}\). We then model the true capacity of the pipeline as a function of \(p_{\text {FE}}\) and the model discrepancy \(X_{\text {m}}\) (representing the difference between true and theoretical capacity). We assume that \(X_{\text {m}}\) is independent of the type of pipeline and defect, and also that \(\sigma _{\text {m}}\) is fixed, where it is only the mean \(\mu _{\text {m}}\) that is inferred from experiments. The limit state equation representing the transition from a safe to failed state is expressed as \(g = p_c - p_d\), where \(p_d\) is the pressure load. The probability of failure is defined as \(P(g \le 0)\).

Fig. 2
figure 2

Graphical representation of the corroded pipeline structural reliability model. The shaded nodes d, \(p_{\text {FE}}\) and \(\mu _{\text {m}}\) have associated epistemic uncertainty

If we let \({{\textbf {X}}}\) be the random vector containing all of the nodes in Fig. 2, then \({{\textbf {X}}}\) represents a probabilistic model of the physical system. In this example, we want to model some of the uncertainty related to the defect size, the model uncertainty, and the capacity as epistemic. We assume that the defect depth d has a fixed but unknown value, that can be inferred through observations that include noise. Similarly, the model uncertainty \(X_{\text {m}}\) can be determined from experiments. Uncertainty with respect to \(p_{\text {FE}}\) comes from the fact that evaluating the true value of \(p_{\text {FE}} | (D, t, s, d, l)\) involves a time-consuming numerical computation. Hence, \(p_{\text {FE}}\) can only be known for a finite, and relatively small set of input combinations. We can let \({\hat{p}}_{\text {FE}}\) denote a stochastic process that models our uncertainty about \(p_{\text {FE}}\). To construct a PDT from \({{\textbf {X}}}\) we will let \({\hat{p}}_{\text {FE}}\) take the place of \(p_{\text {FE}}\), and specify that \(d, \mu _{\text {m}}\) and \({\hat{p}}_{\text {FE}}\) are epistemic, i.e. \({\mathcal {E}} = \sigma (d, \mu _{\text {m}}, {\hat{p}}_{\text {FE}})\).

If we want a way to update the epistemic uncertainty based on observations, we also need to specify the relevant data generating process. In this example, we assume that there are three different ways of collecting data:

  1. 1

    Defect measurement: We assume that noise perturbed observations of the relative depth, \(d/t + \varepsilon\), can be made.

  2. 2

    Computer experiment: Evaluate \(p_{\text {FE}}\) at some selected input (Dtsdl).

  3. 3

    Lab experiment: Obtain one observation of \(X_{\text {m}}\).

As the defect measurements require the specification of an additional random variable, we have to include \(\varepsilon\) or \((d/t)_{\text {obs}} = d/t + \varepsilon\) in \({{\textbf {X}}}\) as part of the complete probabilistic model. This would then define a PDT where epistemic updating is possible.

The physical system that the PDT represents in this example is rarely viewed in isolation. For instance, the random variables representing the pipeline geometry and material are the result of uncertainty or variations in how the pipeline has been manufactured, installed and operated. And the size of the defect is the result of a chemical process, where scientific models are available. It could therefore be natural to view the PDT from this example as a component of a bigger PDT, where probabilistic models of the manufacturing, operating conditions and corrosion process, etc. are connected. This form of modularity is often emphasized in the discussion of digital twins, and likewise for the kind of Bayesian network type of models as considered in this example.

4 Sequential decision making

We now consider how the PDT framework may be adopted in real-world applications. As with any statistical model of this form, the relevant type of applications are related to prediction and inference. Since the PDT is supposed to provide a one-to-one correspondence (including uncertainty) with a real physical system, we are interested in using the PDT to understand the consequences of actions that we have the option to make. In particular, we will consider the discrete sequential decision making scenario, where we get the opportunity to make a decision, receive information related to the consequences of this decision, and use this to inform the next decision, and so on.

In this kind of scenario, we want to use the PDT to determine an action or policy for how to act optimally (with respect to some case-specific criterion). By a policy here we mean the instructions for how to select among multiple actions given the information available at each discrete time step. We describe this in more detail in Sect. 4.3 where we discuss how the PDT is used for planning. When we make use of the PDT in this way, we consider the PDT as a “mental model” of the real physical system, which an agent uses to evaluate the potential consequences of actions. The agent then decides on some action to make, observes the outcome, and updates its beliefs about the true system, as illustrated in Fig. 3.

Fig. 3
figure 3

A PDT as a mental model of an agent taking actions in the real world. As new experience is gained, the PDT may be updated by changing the structural assumptions M that defined the probability measure P, or updating belief with respect to epistemic events through conditioning on the new set of information I. The changes in structural assumptions and epistemic information are represented by \(\Delta M\) and \(\Delta I\) respectively. As part of the planning process, the PDT may simulate possible scenarios as indicated by the inner circle

Whether the agent applies a policy or just an action (the first in the policy) before collecting information and updating the probabilistic model depends on the type of application at hand. In general it is better to update the model as often as possible, preferably between each action, but the actual computational time needed to perform this updating might make it impossible to achieve in practice.

4.1 Mathematical framework of sequential decision making

In this section, we briefly recap the mathematical framework of stochastic, sequential decision making in discrete time. We first recall the general framework, and in the following Sect. 4.2, we show how this relates to our definition of a PDT.

Let \(t = 0, 1, 2, \ldots , N-1\) and consider a discrete time system where the state of the system, \(\{x_t\}_{t \ge 1}\), is given by

$$\begin{aligned} x_{t+1} = f_t(x_t, u_t, w_t), \quad t = 0, 1, 2, \ldots , N-1. \end{aligned}$$
(3)

Here, \(x_t\) is the state of the system at time t, \(u_t\) is a control and \(w_t\) is a noise, or random parameter at time t. Note that the control, \(u_t\), is a decision which can be made by an agent (the controller) at time t. This control is to be chosen from a set of admissible controls \({\mathcal {A}}_t\) (possibly, but not necessarily depending on time). Also, \(f_t\), \(t=0, 1, 2, \ldots , N-1\) are functions map** from the space of state variables (state space), controls and noise into the set of possible states of \(\{x_t\}_{t \ge 0}\). The precise structure of the state space, set of admissible controls and the random parameter space depends on the particular problem under consideration. Note that due to the randomness in \(w_t\), \(t=0,1,2, \ldots , N-1\), the system state \(x_t\) and control \(u_t\), \(t=1, 2, \ldots , N-1\) also become random variables.

We remark that because of this, the state equation is sometimes written in the following form,

$$\begin{aligned} x_{t+1}(\omega ) = f_t(x_t(\omega ), u_t(\omega ), \omega ) \end{aligned}$$
(4)

where \(\omega \in \Omega\) is a scenario in a scenario space \(\Omega\) (representing the randomness). Sometimes, the randomness is suppressed for notational convenience, so the state equation becomes \(x_{t+1} = f_t(x_t, u_t)\), \(t = 0, 1, 2, \ldots , N-1\).

Note that in the state Eq. (3) (alternatively, Eq. (4)), \(x_{t+1}\) only depends on the previous time step, i.e. \(x_{t}, u_{t}, w_{t}\). This is the Markov property (as long as we assume that the distribution of \(w_t\) does not depend on past values of \(w_s\), \(s=0,1, \ldots t-1\), but only on \(x_t, u_t\)). That is, the next system state only depends on the previous one. Since this Markovian framework is what will be used throughout this paper as we move on to reinforcement learning for a probabilistic digital twin, we focus on this. However, we remark that there is a large theory of sequential decision making which is not based on this Markovianity. This theory is based around maximum principles instead of dynamic programming.

The aim of the agent is to minimize a cost function under the state constraint (3) (or alternatively, (4)). We assume that this cost function is of the following, additive form,

$$\begin{aligned} E\left[ g(x_N) + \sum _{t=0}^{N-1} h_t\left( x_t, u_t, w_t\right) \right] \end{aligned}$$

where the expectation is taken with respect to an a priori given probability measure. That is, we sum over all instantaneous rewards \(h_t(x_t, u_t, w_t)\), \(t=0, 1, \ldots , N-1\) which depend on the state of the system, the control and the randomness and add a terminal reward \(g(x_N)\) which only depends on the system state at the terminal time \(t=N\). This function is called the objective function.

Hence, the stochastic sequential decision making problem of the agent is to choose admissible controls \(u_t\), \(t =0, 1, 2, \ldots , N-1\) in order to optimize

$$\begin{aligned}{} & {} \min _{u_t \in {\mathcal {A}}_t, t \ge 0} E\left[ g(x_N)+\sum _{t=0}^{N-1} h_t\left( x_t, u_t, w_t\right) \right] \nonumber \\{} & {} \quad \text{ such } \text{ that } x_{t+1} = f_t\left( x_t, u_t, w_t\right) ,\nonumber \\{} & {} \quad t = 0, 1, 2, \ldots , N-1. \end{aligned}$$
(5)

Typically, we assume that the agent has full information in the sense that they can choose the control at time t based on (fully) observing the state process up until this time, but that they are not able to use any more information than this (future information, such as inside information).

This problem formulation is very similar to that of continuous time stochastic optimal control problem.

Remark 4.1

(A note on continuous time) This framework is parallel to that of stochastic optimal control in continuous time. The main differences in the framework in the continuous time case are that the state equation is typically a stochastic differential equation, and the sum is replaced by an integral in the objective function. For a detailed introduction to continuous time control, see e.g. Øksendal [29].

Other versions of sequential decision making problems include inside information optimal control, partial information optimal control, infinite time horizon optimal control and control with various delay and memory effects. One can also consider problems where further constraints, either on the control or the state, are added to problem (5).

In Bertsekas [30], the sequential decision making problem (5) is studied via the dynamic programming algorithm. This algorithm is based on the Bellman optimality principle, which says that an optimal policy chosen at some initial time, must be optimal when the problem is re-solved at a later stage given the state resulting from the initial choice.

4.2 Sequential decision making in the PDT

Now, we show how the sequential decision making framework from the previous section can be used to solve sequential decision making problems in the PDT.

We may apply this sequential decision making framework to our PDT by letting

$$\begin{aligned} x_t:= {{\textbf {X}}}_t. \end{aligned}$$

That is, the state process for the PDT sequential decision making problem is the random vector of attributes \({{\textbf {X}}}_t\). Note that in Definition 3.3, there is no time-dependency in the attributes \({{\textbf {X}}}\). However, since we are interested in considering sequential decision making in the PDT, we need to assume that there is some sort of development over time (or some indexed set, e.g. information) of the PDT.

Hence, the stochastic sequential decision making problem of the PDT-agent is to choose admissible controls \(u_t\), \(t =0, 1, 2, \ldots , N-1\) in order to optimize

$$\begin{aligned}{} & {} \min _{u_t \in {\mathcal {A}}_t, t \ge 0} E\left[ g({{\textbf {X}}}_N) + \sum _{t=0}^{N-1} h_t\left( {{\textbf {X}}}_t, u_t, w_t\right) \right] \nonumber \\{} & {} \quad \text{ such } \text{ that } {{\textbf {X}}}_{t+1} = f_t\left( {{\textbf {X}}}_t, u_t, w_t\right) , \nonumber \\{} & {} \quad t = 0, 1, 2, \ldots , N-1. \end{aligned}$$
(6)

Here, the set of admissible controls, \(\{u_t\}_{t \ge 0} \in {\mathcal {A}}\), are problem specific. So are the functions \(h_t, g_t\) and \(f_t\) for \(t \ge 0\). Given a particular problem, these functions are defined based on the goal of the PDT-agent as well as the updating of the PDT given new input.

4.3 Planning in the PDT

In this section, we discuss how the PDT can be used for planning. That is, how we use the PDT to identify an optimal policy, without acting in the real world, but by instead simulating what will happen in the real world given that the agent chooses specific actions (or controls, as they are called in the sequential decision making literature, see Sect. 4.1). We use the PDT as a tool to find a plan (policy), or a single action (first action of policy), to perform in the real world.

In order to solve our sequential decision making problem in the PDT, we have chosen to use a reinforcement learning formulation. This essentially corresponds to choosing the dynamic programming method for solving the optimal control problem. Because we will use a DPP (Dynamic Programming Principle) approach, we need all the assumptions that come with this: A Markovian framework, or the possibility of transforming the problem to something Markovian. We need the Bellman equation to hold in order to avoid issues with time-inconsistency. In order to ensure this, we for example need to use exponential discounting and not have e.g. conditional expectation of state process in a non-linear way in the objective function. Finally, our planning problem cannot have state constraints.

Starting with an initial PDT as a digital representation of a physical system given our current knowledge, we assume that there are two ways to update the PDT:

  1. 1.

    Changing or updating the structural assumptions M, and hence the probability measure \(P_M\).

  2. 2.

    Updating the information I.

The structural assumptions M are related to the probabilistic model for \({{\textbf {X}}}\). Recall from Sect. 2.4, that these assumptions define the probability measure \(P_M\). Often, this probability measure is taken as given in stochastic modeling. However, in practice, probability measures are not given to us, but decided by analysts based on previous knowledge. Hence, the structural assumptions M may be updated because of new knowledge, external to the model, or for other reasons the analysts view as important.

Updating the information is our main concern in this paper, since this is related to the agent making costly decisions in order to gather more information. An update of the information also means (potentially) reducing the epistemic uncertainty in the PDT. Optimal information gathering in the PDT will be discussed in detail in the following Sect. 4.6.

4.4 MDP, POMDP and their relation to DPP

In this section, we briefly recall the definitions of Markov decision processes, partially observable Markov decision processes and explain how these relate to the sequential decision making framework of Sect. 4.1.

Markov decision processes (MDPs) are discrete-time stochastic control processes of a specific form. An MDP is a tuple

$$\begin{aligned} (S, A, P_a, R_a), \end{aligned}$$

where S is a set of states (the state space) and A is a set of actions (action space). Also,

$$\begin{aligned} P_a(s, s') = P_a\left( s_{t+1} = s' | a_t=a, s_t=s \right) \end{aligned}$$

is the probability of going from state s at time t to state \(s'\) at time \(t+1\) if we do action a at time t. Finally, \(R_a(s,s')\) is the instantaneous reward of transitioning from state s at time t to state \(s'\) at time \(t+1\) by doing action a (at time t).

An MDP satisfies the Markov property, so given that the process is in state s and will be doing a at time t, the next state \(s_{t+1}\) is conditionally independent of all other previous states and actions.

Remark 4.2

(MDP and DPP)

Note that this definition of an MDP is essentially the same as our DPP framework of Sect. 4.1. In the MDP notation, we say actions, while in the control notation, it is common to use the word control. In Sect. 4.1, we talked about instantaneous cost functions, but here we talk about instantaneous rewards. Since minimization and maximization problems are equivalent (since \(\inf \{z\}= - \sup \{-z\}\)), so are these two concepts. Furthermore, the definition of the transition probabilities \(P_a\) in the MDP framework corresponds to the Markov assumption of the DPP method. In both frameworks, we talk about the system states, though in the DPP framework we model this directly via Eq. (3).

A generalization of MDPs are partially observable Markov decision processes (POMDPs). While an MDP is a 4-tuple, a POMDP is a 6-tuple,

$$\begin{aligned} \left( S, A, P_a, R_a, {\bar{\Omega }}, O\right) . \end{aligned}$$

Here (like before), S is the state space, A is the action space, \(P_a\) gives the conditional transition probabilities between the different states in S and \(R_a\) gives the instantaneous rewards of the transitions for a particular action a.

In addition, we have \({\bar{\Omega }}\), which is a set of observations. In contrast to the MDP framework, with POMDPs, the agent no longer observes the state s directly, but only an observation \(o \in {\bar{\Omega }}\). Furthermore, the agent knows O which is a set of conditional observation probabilities. That is,

$$\begin{aligned} O(o | s', a) \end{aligned}$$

is the probability of observing \(o \in {\bar{\Omega }}\) given that we do action a from state \(s'\).

The objective of the agent in the POMDP sequential decision problem is to choose a policy, that is actions at each time, in order to optimize

$$\begin{aligned} \max _{\{a_t\} \in A} E\left[ \sum _{t=0}^{T} \lambda ^t r_t \right] \end{aligned}$$
(7)

where \(r_t\) is the reward earned at time t (depending on \(s_t, a_t\) and \(s_{t+1}\)), and \(\lambda \in [0, 1]\) is a number called the discount factor. The discount factor can be used to introduce a preference for immediate rewards as opposed to more distant rewards, which may be relevant for the problem at hand, or used just for numerical efficiency. Hence, the agent aims to maximize their expected discounted reward over all future times. Note that is it also possible to consider problem (7) over an infinite time horizon or with a separate terminal reward function as well. This is similar to the DPP sequential decision making framework of Sect. 4.1.

In order to solve a POMDP, it is necessary to include memory of past actions and observations. Actually, the inclusion of partial observations means that the problem is no longer Markovian. However, there is a way to Markovianize the POMDP by transforming the POMDP into a belief-state MDP. In this case, the agent summarizes all information about the past in a belief vector b(t), which is updated as time passes. See [31], Chapter 12.2.3 for details.

4.5 MDP (and POMDP) in the PDT framework

In this section, we show how the probabilistic digital twin can be incorporated in a reinforcement learning framework, in order to solve sequential decision problems in the PDT.

In Sect. 4.2, we showed how we can use the mathematical framework of sequential decision making to solve optimal control problems for a PDT-agent. Also, in Sect. 4.4, we saw (in Remark 4.2) that the MDP (or POMDP in general) framework essentially corresponds to that of the DPP. In theory, we could use the sequential decision making framework and the DPP to solve optimal control problems in the PDT. However, due to the curse of dimensionality, this will typically not be practically tractable. In order to resolve this, we cast the PDT sequential decision making problem into a reinforcement learning, in particular a MDP, framework. This will enable us to solve the PDT optimal control problem via deep reinforcement learning, in which there are suitable tools to overcome the curse of dimensionality.

To define a decision making process in the PDT as a MDP, we need to determine our state space, action space, (Markovian) transition probabilities and a reward function.

  • The action space A: These are the possible actions within the PDT. These may depend on the problem at hand. In the next Sect. 4.6, we will discuss optimal information gathering, where the agent can choose between different types of experiments, at different costs, in order to gain more information. In this case, the action space is the set of possible decisions that the agent can choose from in order to attain more information.

  • The state space S: We define a state as a PDT (or equivalently a version of a PDT that evolves in discrete time \(t = 0, 1, \dots\)). A PDT represents our belief about the current physical state of a system, and it is defined by some initial assumptions together with the information acquired through time. In practice, if the structural assumptions are not changed, we may let the information available at the current time represent a state. This means that our MDP will consist of belief-states, represented by information, from which inference about the true physical state can be made. This is a standard way of creating a MDP from a POMDP, so we can view the PDT state-space as a space of beliefs about some underlying partially observable physical state. Starting from a PDT, we define the state space as all updated PDTs we can reach by taking actions in the action space A.

  • The transition probabilities \(P_a\): Based on our chosen definition of the state space, the transition probabilities are the probabilities of going from one level of information to another, given the action chosen by the agent. For example, if the agent chooses to make decision (action) d, what is the probability of going from the current level of information to another (equal or better) level. This is given by epistemic conditioning of the PDT with respect to the given information set \(I = \{(d,o)\}\), which is based on the decision d and the new observation o. When it comes to updates of the structural assumptions M, we consider this as deterministic transitions.

  • The reward \(R_a\): The reward function, or equivalently, cost function, will depend on the specific problem at hand. To each action \(a \in A\), we assume that we have an associated reward \(R_a\). In the numerical examples in Sect. 5, we give specific examples of how these rewards can be defined.

As mentioned in Sect. 4.3, there are two ways to update the PDT: Updating the structural assumptions M and updating the information I. If we update the PDT by (only) adding to the information set I, we always have the Markov property.

If we also update M, then the preservation of the Markov property is not given. In this case, using a maximum principle deep learning algorithm instead of the DPP based deep RL is a possibility, see [32].

Remark 4.3

Note that in the case where we have a very simple PDT with only discrete variables and only a few actions, then the RL approach is not necessary. In this case, the DPP method as done in traditional optimal control works well, and we can apply a planning algorithm to the PDT in order to derive an optimal policy. However, in general, the state-action space of the PDT will be too large for this. Hence, traditional planning algorithms, and even regular RL may not be feasible due to the curse of dimensionality. In this paper, we will consider deep reinforcement learning as an approach to deal with this. We discuss this further in Sect. 5.

Note that what determines an optimal action or policy will of course depend on what objective the outcomes are measured against. That is, what do we want to achieve in the real world? There are many different objectives we could consider. In the following we present one generic objective related to optimal information gathering, where the PDT framework is suitable.

4.6 Optimal information gathering

A generic, but relevant, objective in optimal sequential decision making is simply to “improve itself”. That is, to reduce epistemic uncertainty with respect to some quantity of interest. Another option, is to consider maximizing the Kullback–Leibler divergence with respect to epistemic uncertainty as a general objective. This would mean that we aim to collect the information that “will surprise us the most”. See for instance [8] for a review of some common alternatives.

By definition, a PDT contains an observational model related to the data generating process (the epistemic conditioning relies on this). This means that we can simulate the effect of gathering information, and we can study how to do this optimally. In order to define what we mean by an optimal strategy for gathering information, we then have to specify the following:

  • Objective: What we need the information for. For example, what kind of decision do we intend to support using the PDT? Is it something we want to estimate? What is the required accuracy needed? For instance, we might want to reduce epistemic uncertainty with respect to some quantity, e.g. a risk metric such as a failure probability, expected extreme values, etc.

  • Cost: The cost related to the relevant information-gathering activities.

Then, from the PDT together with a specified objective and cost, one alternative is to define the optimal strategy as the strategy that minimizes the (discounted) expected cost needed to achieve the objective (or equivalently achieves the objective while maximizing reward).

Example 4.4

(Coin flip – information gathering) Continuing from Example 3.1, imagine that before making your final bet, you can flip the coin as many times as you like in order to learn about \(\theta\). Each of these test flips will cost \(10.000 \). You also get the opportunity to replace the coin with a new one, at the cost of \(100.000 \).

An interesting problem is now how to select an optimal strategy for when to test, bet or replace in this game. And will such a strategy be robust? What if there is a limit on the total number of actions than can be performed? In Sect. 5.5 we illustrate how reinforcement learning can be applied to study this problem, where the coin represents a component with reliability \(\theta\), that we may test, use or replace.

5 Deep Reinforcement Learning with PDTs

In this section we give an example of how reinforcement learning can be used for planning, i.e. finding an optimal action or policy, with a PDT. The reinforcement learning paradigm is especially relevant for problems where the state and/or action space is large, or dynamical models where specific transition probabilities are not easily attainable but where efficient sampling is still feasible. In probabilistic modelling of complex physical phenomena, we often find ourselves in this kind of setting.

5.1 Reinforcement Learning (RL)

Reinforcement learning, in short, aims to optimize sequential decision problems through sampling from a MDP (Sutton and Barto [33]). We think of this as an agent taking actions within an environment, following some policy \(\pi (a|s)\), which gives the probability of taking action a if the agent is currently at state s. Generally, \(\pi (a|s)\) represents a (possibly degenerate) probability distribution over actions \(a \in A\) for each \(s \in S\). The agent’s objective is to maximize the amount of reward it receives over time, and a policy \(\pi\) that achieves this is called an optimal policy.

Given a policy \(\pi\) we can define the value of a state \(s \in S\) as

$$\begin{aligned} v_{\pi }(s) = E\left[ \sum _{t=0}^{T} \lambda ^t r_t | s_0 = s \right] \end{aligned}$$
(8)

where \(r_t\) is the reward earned at time t (depending on \(s_t, a_t\) and \(s_{t+1}\)), given that the agent follows policy \(\pi\) starting from \(s_0 = s\). That is, for \(P_a\) and \(R_a\) given by the MDP, \(a_t \sim \pi \left( a_t | s_t\right)\), \(s_{t+1} \sim P_{a_t}\left( s_t, s_{t+1}\right)\) and \(r_t \sim R_{a_t}\left( s_t, s_{t+1}\right)\). Here we make use of a discount factor \(\lambda \in [0, 1]\) in the definition of cumulative reward. If we want to consider \(T = \infty\) (continuing tasks) instead of \(T < \infty\) (episodic task), then \(\lambda < 1\) is generally necessary.

The optimal value function is defined as the one that maximizes (8) over all policies \(\pi\). The optimal action at each state \(s \in S\) then corresponds to acting greedily with respect to this value function, i.e. selecting the action \(a_t\) that in expectation maximises the value of \(s_{t+1}\). Likewise, it is common to define the action-value function \(q_{\pi }(s, a)\), which corresponds to the expected cumulative return of first taking action a in state s and following \(\pi\) thereafter. RL generally involves some form of Monte Carlo simulation, where a large number of episodes are sampled from the MDP, with the goal of estimating or approximating the optimal value of states, state-action pairs, or an optimal policy directly.

Theoretically this is essentially equivalent to the DPP framework, but with RL we are mostly concerned with problems where optimal solutions cannot be found easily and some form of approximation is needed. By the use of flexible approximation methods combined with adaptive sampling strategies, RL makes it possible to deal with large and complex state and action spaces.

5.2 Function approximation

One way of using function approximation in RL is to define a parametric function \({\hat{v}}(s| {{\textbf {w}}}) \approx v_{\pi }(s)\), given by a set of weights \({{\textbf {w}}}\in {\mathbb {R}}^d\), and try to learn the value function of an optimal policy by finding an appropriate value for \({{\textbf {w}}}\). Alternatively, we could approximate the value of a state-action pair, \({\hat{q}}(s, a| {{\textbf {w}}}) \approx q_{\pi }(s, a)\), or a policy \({\hat{\pi }}(a | s, {{\textbf {w}}}) \approx \pi (a | s)\). The general goal is then to optimize \({{\textbf {w}}}\), using data generated by sampling from the MDP; the RL literature contains many different algorithms designed for this purpose. In the case where a neural network is used for function approximation, it is often referred to as deep reinforcement learning. One alternative, which we will make use of in an example later on, is the deep Q-learning approach (DQN) as introduced by van Hasselt et al. [34], which represents the value of a set of m actions at a state s using a multi-layered neural network

$$\begin{aligned} {\hat{q}}(s| {{\textbf {w}}}) :S \rightarrow {\mathbb {R}}^m. \end{aligned}$$
(9)

Note here that \({\hat{q}}(s| {{\textbf {w}}})\) is a function defined on the state space S. In general, any approximation of the value functions v or q, or the policy \(\pi\) are defined on S or \(S \times A\). A question that then arises, is how can we define parametric functions on the state space S when we are dealing with PDTs? We can assume that we have control over the set of admissible actions A, in the sense that this is something we define, and creating parametric functions defined on A should not be a problem. But as discussed in Sect. 4.5, S will consist of belief-states.

5.3 Defining the state space

We are interested in an MDP where the transition probabilities \(P_a(s, s')\) correspond to updating a PDT as a consequence of action a. In that sense, s and \(s'\) are PDTs. Given a well-defined set of admissible actions, the state space S is then the set of all PDTs that can be obtained starting from some initial state \(s_0\), within some defined horizon.

Recall that going from s to \(s'\) then means kee** track of any changes made to the structural assumptions M and the information I, as illustrated in Fig. 3. From now on, we will for simplicity assume that updating the PDT only involves epistemic conditioning with respect to the information I. This is a rather generic situation. Also, finding a way to represent changes in M will have to be handled for the specific use case under consideration. Assuming some initial PDT \(s_0\) is given, any state \(s_t\) at a later time t is then uniquely defined by the set of information \(I_t\) available at time t. Representing states by information in this way is something that is often done to transform a POMDP to a MDP. That is, although the true state \(s_t\) at time t is unknown in a POMDP, the information \(I_t\), and consequently our belief about \(s_t\), is always known at time t. Inspired by the POMDP terminology, we may therefore view a PDT as a belief-state, which seems natural as the PDT is essentially a way to encode our beliefs about some real physical system.

Hence, we will proceed with defining the state space S as the information state-space, which is the set of all sets of information I. Although this is a very generic approach, we will show that there is a way of defining a flexible parametric class of functions on S. But we must emphasize that if there are other ways of compressing the information I, for instance due to conjugacy in the epistemic updating, then these are probably much more efficient. Example 5.1 below shows exactly what we mean by this.

Example 5.1

(Coin flip – information state-space) In the coin flip example (Example 2.3), all of our belief with respect to epistemic uncertainty is represented by the number \(\psi = P(\theta = \theta _1)\). Given some observation \(Y = y \in \{0, 1 \}\), the epistemic conditioning corresponds to

$$\begin{aligned} \psi \rightarrow \frac{\beta _1(y) \psi }{\beta _1(y) \psi + \beta _2(y) (1 - \psi )}, \end{aligned}$$

where, for \(j = 1, 2\), \(\beta _j(y) = \theta _j\) if \(y = 0\) and \(\beta _j(y) = 1-\theta _j\) if \(y = 1\).

In this example, the information state-space consists of all sets of the form \(I_t = \{ y_1, \dots , y_t \}\) where each \(y_i\) is binary. However, if the goal is to let \(I_t\) be the representation of a PDT, we could just as well use \(\psi _t\), i.e. define \(S = [0, 1]\) as the state space. Alternatively, the number of heads and tails (0 s and 1 s) provides the same information, so we could also make use of \(S = \{ 0, \dots , N \} \times \{ 0, \dots , N \}\) where N is an upper limit on the total number of flips we consider.

5.4 Deep learning on the information state-space

Let S be a set of sets \(I \subset {\mathbb {R}}^d\). We will assume that each set \(I \in S\) consists of a finite number of elements \(y \in {\mathbb {R}}^d\), but we do not require that all sets I have the same size. We are interested in functions defined on S.

A function f is permutation invariant if \(f\left( \left\{ {{\textbf {y}}}_1, \dots , {{\textbf {y}}}_N \right\} \right) = f\left( \left\{ {{\textbf {y}}}_{\kappa (1)}, \dots , {{\textbf {y}}}_{\kappa (N)} \right\} \right)\) for any permutation \(\kappa\). It can been shown that under fairly mild assumptions, such functions have the following decomposition

$$\begin{aligned} f(I) = \rho \left( \sum _{{{\textbf {y}}}\in I} \phi ({{\textbf {y}}}) \right) . \end{aligned}$$
(10)

These sum decompositions were studied by Zaheer et al. [35] and later by Wagstaff et al. [36], which showed that if \(|I| \le p\) for all \(I \in S\), then any continuous function \(f :S \rightarrow {\mathbb {R}}\) can be written as (10) for some suitable functions \(\phi :{\mathbb {R}}^d \rightarrow {\mathbb {R}}^p\) and \(\rho :{\mathbb {R}}^p \rightarrow {\mathbb {R}}\). The motivation in [35, 36] was to enable supervised learning of permutation invariant and set-valued functions, by replacing \(\rho\) and \(\phi\) with flexible function approximators, such as Gaussian processes or neural networks. Other forms of decomposition, by replacing the summation in (10) with something else that can be learned, have also been considered by Soelch et al. [37]. For reinforcement learning, we will make use of the form (10) to represent functions defined on the information states space S, such as \({\hat{v}}(s| {{\textbf {w}}})\), \({\hat{q}}(s, a| {{\textbf {w}}})\), or \({\hat{\pi }}(a | s, {{\textbf {w}}})\), using a neural network with parameter \({{\textbf {w}}}\). In the remaining part of this paper we present two examples showing how this works in practice.

5.5 The “coin flip” example

Throughout this paper we have presented a series of small examples involving a biased coin, represented by \({X} = (Y, \theta )\). In Example 4.4 we ended by introducing a game where the player has to select whether to bet on, test or replace the coin. As a simple illustration we will show how reinforcement learning can be applied in this setting.

But now, we will imagine that the coin Y represents a component in some physical system, where \(Y = 0\) corresponds to the component functioning and \(Y = 1\) represents failure. The probability \(P(Y = 1) = 1 - \theta\) is then the component’s failure probability, and we say that \(\theta\) is the reliability.

For simplicity we assume that \(\theta \in \{ 0.5, 0.99 \}\), and that our initial belief is \(P(\theta = 0.5) = 0.5\). That is, when we buy a new component, there is a 50% chance of getting a “bad” component (that fails 50% of the time), and consequently a 50% probability of getting a “good” component (that fails 1% of the time).

We consider a project going over \(N = 10\) days. Each day we will decide between one of the following 4 actions:

  1. 1.

    Test the component (flip the coin once). Cost \(r = -10.000 \).

  2. 2.

    Replace the component (buy a new coin). Cost \(r = -100.000 \).

  3. 3.

    Use the component (bet on the outcome). Obtain a reward of \(r = 10^6 \) if the component works (\(Y = 0\)) and a cost of \(r = -10^6 \) if the component fails (\(Y = 1\)).

  4. 4.

    Terminate the project (set \(t = N\)), \(r = 0\).

We will find a deterministic policy \(\pi :S \rightarrow A\) that maps from the information state-space to one of the four actions. The information state-space S is here represented by the number of days left of the project, \(n = N - t\), and the set \(I_t\) of observations of the component that is currently in use at time t. If we let \(S_Y\) contain all sets of the form \(I = \{ Y_1, \dots , Y_t \}\), for \(Y_t \in \{ 0, 1 \}\) and \(t < N\), then

$$\begin{aligned} S = S_Y \times \{ 1, \dots , N \} \end{aligned}$$
(11)

represents the information state-space. In this example we use the deep Q-learning (DQN) approach described by van Hasselt et al. [34]. Here we define a neural network

$$\begin{aligned} {\hat{q}}(s| {{\textbf {w}}}) = {\hat{q}}(I, n| {{\textbf {w}}}):S \rightarrow {\mathbb {R}}^4, \end{aligned}$$

that represents the action-value of each of the four actions. The optimal policy is then obtained by at each state s selecting the action corresponding to the maximal component of \({\hat{q}}\). We implement this by writing \({\hat{q}}(I, n| {{\textbf {w}}}) = h(f(I| {{\textbf {w}}}_f), n| {{\textbf {w}}}_h)\), where \(h(f, n | {{\textbf {w}}}_h) :{\mathbb {R}}^4\times {\mathbb {R}} \rightarrow {\mathbb {R}}^4\), and the function \(f(I|{{\textbf {w}}}_f)\) taking sets as input is defined as in (10) using two functions \(\phi (\,\cdot \,|{{\textbf {w}}}_\phi ) :{\mathbb {R}} \rightarrow {\mathbb {R}}^4\) and \(\rho (\,\cdot \,|{{\textbf {w}}}_\rho ) :{\mathbb {R}}^4 \rightarrow {\mathbb {R}}^4\). We let \(h(\,\cdot \,|{{\textbf {w}}}_h)\), \(\phi (\,\cdot \,|{{\textbf {w}}}_\phi )\) and \(\rho (\,\cdot \,|{{\textbf {w}}}_\rho )\) be multi-layer perceptrons so that \({\hat{q}}(\,\cdot \,| {{\textbf {w}}})\) becomes a feedforward neural network with a combined parameter vector \({{\textbf {w}}}= ({{\textbf {w}}}_h, {{\textbf {w}}}_\phi , {{\textbf {w}}}_\rho )\) that is optimized using the Q-learning algorithmFootnote 5.

We start by finding a policy that optimizes the cumulative reward over the 10 days (without discounting). To test the model, we run a large number of episodes and store the total reward after each episode. As it turns out, this policy prefers to “gamble” that the component works rather than performing tests. In the case where the starting component is reliable (which happens 50% of the time), a high reward can be obtained by selecting action 3 at every opportunity. The general “idea” with this policy, is that if action 3 results in failure, the following action is to replace the component (action 2), unless there are few days left of the project in which case action 4 is selected. We call this the “unconstrained” policy.

Although the unconstrained policy gives the largest expected reward, there is an approximately 50% chance that it will produce a failure, i.e. that action 3 is selected with \(Y = 1\) as the resulting outcome. One way to reduce this failure probability, is to introduce the constraint that action 3 (using the component) is not allowed unless we have a certain level of confidence in that the component is reliable. We introduce this type of constraint by requiring that \(P(\theta = 0.99) > 0.9\) (a constraint on epistemic uncertainty). The optimal policy under this constraint will start with running experiments (action 1), before deciding whether to replace (action 2), use the component (action 3), or terminate the project (action 4). Figure 4 shows the distributions of the cumulative reward over 1000 simulated episodes, for the constrained and unconstrained policies obtained by RL, together with a completely random policy for comparison.

Fig. 4
figure 4

Total reward after 1000 episodes for a random policy, the unconstrained policy, and the agent which is subjected to the constraint that action 3 is not allowed unless \(P(\theta = 0.99) > 0.9\)

In this example, the information state-space could also be defined in a simpler way, as explained in Example 5.1. As a result the reinforcement learning task will be simplified. Using the different state-space representations, we obtained the same results shown in Fig. 4. Finally, we should note that in the case where defining the state space as in (11) is necessary, the constraint \(P(\theta = 0.99) > 0.9\) is not practical. That is, if we could estimate this probability efficiently, then we also have access to the compressed information state-space. One alternative could then be to instead consider the uncertain failure probability \(p_f(\theta ) = P(Y = 1 | \theta )\), and set a limit on e.g. \(E[p_f] + 2 \cdot \text {Std}(p_f)\). This is the approach taken in the following example concerning failure probability estimation.

5.6 Corroded pipeline example

Here we revisit the corroded pipeline example from Agrell and Dahl [16] which we introduced in Sect. 3.4. In this example, we specify the epistemic uncertainty with respect to model discrepancy, the size of a defect, and the capacity \(p_{\text {FE}}\) coming from a Finite Element simulation. If we let \(\theta\) be the epistemic generator, we can write the failure probability conditioned on epistemic information as \(p_f(\theta ) = P(g \le 0 | \theta )\). In [16] the following objective was considered: Determine with confidence whether \(p_f(\theta ) < 10^{-3}\). That is, when we consider \(p_f\) as a purely epistemic random variable, we want to either confirm that the failure probability is less than the target \(10^{-3}\) (in which case we can continue operations as normal), or to detect with confidence that the target is exceeded (and we have to intervene). We say the objective is achieved if we obtain either \(E[p_f] + 2 \cdot \text {Std}(p_f) < 10^{-3}\) or \(E[p_f] - 2 \cdot \text {Std}(p_f) > 10^{-3}\) (where \(E[p_f]\) and \(\text {Std}(p_f)\) can be efficiently approximated using the method developed in [16]). There are three ways in which we can reduce epistemic uncertainty:

  1. 1

    Defect measurement: Noise perturbed measurement that reduces uncertainty in the defect size d.

  2. 2

    Computer experiment: Evaluate \(p_{\text {FE}}\) at some selected input (Dtsdl), to reduce uncertainty in the surrogate \({\hat{p}}_{\text {FE}}\) used to approximate \(p_{\text {FE}}\).

  3. 3

    Lab experiment: Obtain one observation of \(X_{\text {m}}\), which reduces uncertainty in \(\mu _{\text {m}}\).

The set of information corresponding to defect measurements is \(I_{\text {Measure}} \subset {\mathbb {R}}\) as each measurement is a real valued number. Similarly, \(I_{\text {Lab}} \subset {\mathbb {R}}\) as well, and \(I_{\text {FE}} \subset {\mathbb {R}}^6\) when we consider a vector \({{\textbf {y}}}\in {\mathbb {R}}^6\) as an experiment \([D, t, s, d, l, p_{\text {FE}}]\). In this example, we exploit conjugacy in the representation of \(I_{\text {Measure}}\) and \(I_{\text {Lab}}\) as discussed in Example 5.1. All relevant information from \(I_{\text {Measure}}\) is represented by the posterior distribution of the quantity that is measured, which is normally distributed defined by its mean and standard deviation. The same is true for \(I_{\text {Lab}}\). Hence, the information represented by \(I_{\text {Measure}}\) and \(I_{\text {Lab}}\) can be represented by a single vector in \({\mathbb {R}}^4\) (see [16] for details). We therefore define the information state-space as \(S = S_{\text {FE}} \times {\mathbb {R}}^4\), where \(S_{\text {FE}}\) consists of finite subsets of \({\mathbb {R}}^6\). We use RL to determine which of the three types of experiments to perform, and define the action space \(A = \{ \text {Measurement}, \text {FE}, \text {Lab} \}\). Note that when we decide to run a computer experiment, we also have to specify the input (Dtsdl). This is a separate decision making problem regarding design of experiments. For this we make use of the myopic (one-step lookahead) method developed in [16], although one could in principle use RL for this as well. This kind of decision making, where one first decides between different types of tasks to perform, and then proceeds to find a way to perform the selected task optimally, is often referred to as hierarchical RL in the reinforcement learning literature. Actually, [16] considers a myopic alternative for also selecting between the different types of experiments, and it was observed that this might be challenging in practice if there are large differences in cost between the experiments. This was the motivation for studying the current example, where we now define the reward (cost) r as a direct consequence of \(a \in A\) as follows: \(r = -10\) for \(a = \text {Measurement}\), \(r = -1\) for \(a = \text {Lab}\) and \(r = -0.1\) for \(a = \text {FE}\).

In this example we also use the DQN approach of van Hasselt et al. [34], where we define a neural network

$$\begin{aligned} {\hat{q}}(s, {{\textbf {w}}}) :S = S_{\text {FE}} \times {\mathbb {R}}^4 \rightarrow {\mathbb {R}}^3, \end{aligned}$$

that gives, for each state s, the (near optimal) value of each of the three actions. The neural network is constructed in the same way as in the “coin flip” example in Sect. 5.5, and trained with the same Q-learning approach (Algorithm 1 [38]).

The objective in this RL example is to estimate a failure probability using as little resources as possible. If an agent achieves the criterion on epistemic uncertainty reduction, that the expected failure probability plus/minus two standard deviations is either above or below the target value, we say that the agent has succeeded and we report the sum of the cost of all performed experiments. We also set a maximum limit of 40 experiments. I.e. after 40 tries the agent has failed. To compare the policy obtained by RL, we consider the random policy that selects between the three actions uniformly at random. We also consider a more “human like” benchmark policy, that corresponds to first running 10 computer experiments, followed by one lab experiment then one defect measurement, then 10 new computer experiments, and so on.

The final results from simulating 100 episodes with each of the three policies are shown in Fig. 5.

Fig. 5
figure 5

Total cost (negative reward) after 100 successful episodes. For the random and benchmark policy, the success rate was around 60% (to achieve the objective within 40 experiments in total), whereas 94% was successful for the RL agent

6 Concluding remarks

To conclude our discussion, we recall that in this paper, we have:

  • Given a measure-theoretic discussion of epistemic uncertainty and formally defined epistemic conditioning.

  • Provided a mathematical definition of a probabilistic digital twin (PDT).

  • Connected PDTs with sequential decision making problems, and discussed several solution approaches (maximum principle, dynamic programming, MDP and POMDP).

  • Argued that using (deep) reinforcement learning to solve sequential decision making problems in the PDT is a good choice for practical applications today.

  • For the specific use-case of optimal information gathering, we proposed a generic solution using deep reinforcement learning on the information state-space.

Further research in this direction includes looking at alternative solution methods and reinforcement learning algorithms in order to handle different PDT frameworks. A possible idea is to use a maximum principle approach instead of a DPP approach (as is done in reinforcement learning). By using one of the MP based algorithms in [32], we may avoid the Markovianity requirement, possible time-inconsistency issues (see e.g. Rudloff et al. [39]) and can also allow for state constraints. For instance, this is of interest when the objective of the sequential decision making problem in the PDT is to minimize a risk measure such as CVaR or VaR , see Cheridito and Stadje [40], and Artzner et al. [41] respectively. Both of these risk measures are known to cause time-inconsistency in the Bellman equation, and hence, the DPP (and also reinforcement learning) cannot be applied in a straightforward manner. Another relevant topic is whether formal methods for model checking can be applied for the reinforcement learning approach considered in this paper. This is particularly important for extending beyond cost optimization towards optimizing decisions that may also impact the level of safety of the physical system. The application of PDTs for safe reinforcement learning (see e.g. [42]) is therefore a relevant topic. This is work in progress.