1 Abbreviations

BSDE: Backward stochastic difference equationi.i.d.: Independent and identically distributedMAP extimator: Maximum a posteriori estimatorDR-expectation: Either “data-driven robust estimator” or “divergence robust estimator” (the acronym is deliberately ambiguous)StaticUP: Static generators, uncertain prior frameworkDynamicUP: Dynamic generators, uncertain prior frameworkStaticDR: Static generators, DR-expectation frameworkDynamicDR: Dynamic generators, DR-expectation framework

2 Introduction

Filtering is a common problem in many applications. The essential concept is that there is an unseen Markov process, which influences the state of some observed process, and our task is to approximate the state of the unseen process using a form of Bayes’ theorem. Many results have been obtained in this direction, most famously the Kalman filter (Kalman 1960; Kalman and Bucy 1961), which assumes the underlying processes considered are Gaussian, and gives explicit formulae accordingly. Similarly, under the assumption that the underlying process is a finite-state Markov chain, a general formula to calculate the filter can be obtained (the Wonham filter Wonham (1965)). These results are well known, in both discrete and continuous time (see Bain and Crisan (2009) or Cohen and Elliott (2015) Chapter 21 for further general discussion).

In this paper, we consider a simple setting in discrete time, where the underlying process is a finite-state Markov chain. Our concern is to study uncertainty in the dynamics of the underlying processes, in particular, its effect on the behaviour of the corresponding filter. That is, we assume that the observer has only imperfect knowledge of the dynamics of the underlying process and of their relationship with the observation process, and wishes to incorporate this uncertainty in their estimates of the unseen state. We are particularly interested in allowing the level of uncertainty in the filtered state to be endogenous to the filtering problem, arising from the uncertainty in parameter estimates and process dynamics.

We model this uncertainty in a general manner, using the theory of nonlinear expectations, and concern ourselves with a description of uncertainty for which explicit calculations can be carried out, and which can be motivated by considering statistical estimation of parameters. We then apply this to building a dynamically consistent expectation for random variables based on future states, and to a general control problem, with learning, under uncertainty.

2.1 Basic filtering

Consider two stochastic processes, X={Xt}t≥0 and Y={Yt}t≥0. Let Ω be the space of paths of (X,Y) and \(\mathbb {P}\) be a probability measure on Ω. We denote by \(\{\mathcal {F}_{t}\}_{t\ge 0}\) the (completed) filtration generated by X and Y and denote by \(\mathcal {Y}=\{\mathcal {Y}_{t}\}_{t\ge 0}\) the (completed) filtration generated by Y. The key problem of filtering is to determine estimates of ϕ(Xt) given \(\mathcal {Y}_{t}\), that is, \(\mathbb {E}_{\mathbb {P}}[\phi (X_{t})|\mathcal {Y}_{t}]\), where ϕ is an arbitrary Borel function.

Suppose that X is a Markov chain with (possibly time-dependent) transition matrix \(A_{t}^{\top }\) under \(\mathbb {P}\) (the transpose here saves notational complexity later). Without loss of generality, we assume that X takes values in the standard basis vectors \(\{e_{i}\}_{i=1}^{N}\) of \(\mathbb {R}^{N}\) (where N is the number of states of X), and so we write

$$X_{t} = A_{t} X_{t-1} + M_{t},$$

where \(\mathbb {E}_{\mathbb {P}}[M_{t+1}|\mathcal {F}_{t}] = 0\), so \(\mathbb {E}_{\mathbb {P}}[X_{t}|\mathcal {F}_{t-1}] = A_{t} X_{t-1}\).

We suppose the process Y is multivariate real-valuedFootnote 1. The law of Y depends on X, in particular, the \(\mathbb {P}\)-distribution of Yt given {Xs}st∪{Ys}s<t (that is, given all past observations of X and Y and the current state of X) is

$$Y_{t} \sim c(y;t, X_{t})d\mu(y)$$

for μ a reference measure on \(\left (\mathbb {R}^{d}, \mathcal {B}\left (\mathbb {R}^{d}\right)\right)\), where ∼ is used to indicate the density of the distribution of a random variable.

For simplicity, we assume that Y0≡0, so no information is revealed about X0 at time 0. It is convenient to write Ct(y)=C(y;t) for the diagonal matrix with entries c(y;t,ei), so that

$$C_{t}(y) X_{t} = c(y;t, X_{t})X_{t}.$$

Note that these assumptions, in particular the values of A and C, depend on the choice of probability measure \(\mathbb {P}\). Conversely, as our space Ω is the space of paths of (X,Y), the measure \(\mathbb {P}\) is determined by A and C. We call A and C the generators of our probability measure.

As we have assumed Xt takes values in the standard basis in \(\mathbb {R}^{N}\), the expectation \(\mathbb {E}_{\mathbb {P}}[X_{t}|\mathcal {Y}_{t}]\) determines the entire conditional distribution of Xt given \(\mathcal {Y}_{t}\). In this discrete time context, the filtering problem can be solved in a fairly simple manner: Suppose we have already calculated \(p_{t-1}:=\mathbb {E}_{\mathbb {P}}[X_{t-1}|\mathcal {Y}_{t-1}]\). Then, by linearity and the dynamics of X, using the fact

$$\mathbb{E}_{\mathbb{P}}\left[M_{t}\left|\mathcal{Y}_{t-1}\right.\right] = \mathbb{E}_{\mathbb{P}}\left[\left.\mathbb{E}_{\mathbb{P}}\left[M_{t}\left|\mathcal{F}_{t-1}\right.\right]\right|\mathcal{Y}_{t-1}\right]=0,$$

we can calculate

$$\mathbb{E}_{\mathbb{P}}\left[X_{t}\left|\mathcal{Y}_{t-1}\right.\right] = \mathbb{E}_{\mathbb{P}}\left[\left.A_{t} X_{t-1}+ M_{t}\right|\mathcal{Y}_{t-1}\right] = A_{t} p_{t-1}.$$

Bayes’ theorem then states that, with probability one, with ∝ denoting equality up to proportionality,

$$\mathbb{P}\left(\left.X_{t}=e_{i}\right|\mathcal{Y}_{t}\right) = \mathbb{P}\left(X_{t}=e_{i}\left|\{Y_{s}\}_{s< t}, Y_{t}\right.\right) \propto c(Y_{t};t, e_{i}) \mathbb{P}(X_{t}=e_{i}|\mathcal{Y}_{t-1}),$$

which can be written in a simple matrix form, as given in the following theorem which summarizes the classical filter.

Theorem 1

For X a hidden Markov chain with transition matrix \(A^{\top }_{t}\), and Y an observation process with conditional density (given Xt) given by

$$Y_{t}|X_{t} \sim c(y;t,X_{t})d\mu(y) = \mathbf{1}^{\top} C_{t}(y)X_{t} d\mu(y),$$

the conditional distribution \(\mathbb {E}[X_{t}|\mathcal {Y}_{t}] = p_{t}\) satisfies the recursion

$$ p_{t} = \mathbb{G}(p_{t-1}, A_{t}, C_{t}, Y_{t}) := \frac{C_{t}(Y_{t}) A_{t} p_{t-1}}{\mathbf{1}^{\top} C_{t}(Y_{t}) A_{t} p_{t-1}}, $$
(1)

where 1 denotes a vector with all components 1.

We call pt the “filter state” at time t. Note that, if we assume the density c is positive, At is irreducible and pt−1 has all entries positive, then pt will also have all entries positive.

In practice, the key problem with implementing these methods is the requirement that we know the underlying transition matrix A and the density C. These are generally not known perfectly, but need to be estimated prior to the implementation of the filter. Uncertainty in the choice of these parameters will lead to uncertainty in the estimates of the filtered state, and the aim of this paper is to derive useful representations of that uncertainty.

As variation in the choice of A and C corresponds to a different choice of measure \(\mathbb {P}\), we see that using an uncertain collection of generators corresponds naturally to uncertainty regarding \(\mathbb {P}\). This type of uncertainty, where the probability measure is not known, is commonly referred to as “Knightian” uncertainty (with reference to Knight (1921), related ideas are also discussed by Keynes (1921) and Wald (1945)).

Effectively, we wish to consider the propagation of uncertainty in Bayesian updating (as the filter is simply a special case of this). Huber and Ronchetti 2009 p. 331 briefly touch on this, however (based on earlier work by Kong) argue that this propagation is computationally infeasible. However, their approach was based on Choquet integrals, rather than nonlinear (convex) expectations in the style of Peng (2010) and others. In the coming sections, we see how the structure of nonlinear expectations allows us to derive comparatively simple rules for updating.

Remark 1

While we will present our theory in the context where X is a finite state Markov chain, our approach does not depend in any significant way on this assumption. In particular, it would be equally valid, mutatis mutandis, if we supposed that X followed the dynamics of the Kalman filter, and our uncertainty was on the coefficients of the filter. We specialize to the Markov chain case purely for the sake of concreteness.

The aim of this paper is to provide, with a minimum of technical complexity, the basic structures which underlie this approach to filtering with a nonlinear expectation. It proceeds as follows: In Section 2 we give some key facts about the measures which naturally appear in a filtering context. In Section 3, we introduce the theory of nonlinear expectations, and a means of connecting these with statistical estimation from Cohen (2017). Section 4 unites these expectations with filtering, giving recursive equations which replace the filtering equation in Theorem 1, it also outlines some concrete simplifications of this general structure, depending on whether the underlying parameters can vary through time and how new information is to be incorporated. In Section 5 we consider dynamic properties of this nonlinear expectation when looking at future events, and some connections with the theory of (discrete time) BSDEs. Finally, Section 6 considers a generic control problem in this context.

3 Conditionally Markov measures

In order to incorporate learning in our nonlinear expectations and filtering, it is useful to extend slightly from the family of measures previously described. In particular, we wish to allow the dynamics to depend on past observations, while preserving enough Markov structure to enable filtering. The following classes of probability measures will be of interest.

Definition 1

We write \(\mathcal {M}_{1}\) for the space of probability measures equivalent to a reference measure \(\mathbb {P}\).

Let \(\mathcal {M}_{M}\subset \mathcal {M}_{1}\) denote the probability measures under which

  • X is a Markov chain, that is, for all t, Xt+1 is independent of \(\mathcal {F}_{t}\) given Xt;

  • {Ys}st+1 is independent of \(\mathcal {F}_{t}\) given Xt+1;

  • both X and Y are time homogeneous, that is, the conditional distributions of Xt+1|Xt and Yt|Xt do not depend on t.

Let \(\mathcal {M}_{M|\mathcal {Y}}\subset \mathcal {M}_{1}\) denote the probability measures under which

  • X is a conditional Markov chain, that is, for all t, Xt+1 is independent of \(\mathcal {F}_{t}\) given Xt and {Ys}st; and

  • {Ys}st+1 is independent of \(\mathcal {F}_{t}\) given {Xt+1}∪{Ys}st.

We note that, if we consider a measure in \(\mathcal {M}_{M|\mathcal {Y}}\), there is a natural notion of the generators A and C. In particular, \(\mathcal {M}_{M}\) corresponds to those measures under which the generators A and C are constant, while \(\mathcal {M}_{M|\mathcal {Y}}\) corresponds to those measures under which the generators A and C are functions of time and {Ys}st (i.e. \(\{\mathcal {Y}_{t}\}_{t \ge 0}\)-predictable processes).

For each t, these generators determine the measure on \(\mathcal {F}_{t}\) given \(\mathcal {F}_{t-1}\), and (together with the distribution of X0) this determines the measure at all times. It is straightforward to verify that our filtering equations hold for all measures in \(\mathcal {M}_{M|\mathcal {Y}}\), with the appropriate modification of the generators.

Definition 2

For a measure \(\mathbb {Q}\in \mathcal {M}_{M|\mathcal {Y}}\), we write \(\left (A^{\mathbb {Q}},C^{\mathbb {Q}}(\cdot)\right)\) for the generator of (X,Y) under \(\mathbb {Q}\), recalling that \(C^{\mathbb {Q}}_{t}(y) = \text {diag}\left (\left \{c^{\mathbb {Q}}_{t}(y; e_{i})\right \}_{i=1}^{N}\right)\), and that \(A^{\mathbb {Q}}_{t}\) and \(C^{\mathbb {Q}}_{t}\) are now allowed to depend on {Ys}s<t. For notational convenience, we shall typically not write the dependence on {Ys}s<t explicitly.

Similarly, for a \(\{\mathcal {Y}_{t}\}_{t\ge 0}\)-predictable process (At,Ct(·))t≥0 taking values in the product of the space of transition matrices and the space of diagonal matrix-valued functions, where each diagonal element is a probability density on \(\mathbb {R}^{d}\), and p0 a probability vector in \(\mathbb {R}^{N}\), we write \(\mathbb {Q}(A,C, p_{0})\) for the measure with generator (At,Ct(·))t≥0 and initial distribution \(\mathbb {E}_{\mathbb {Q}}[X_{0}]=p_{0}\).

In what follows, we will be variously wishing to restrict a measure \(\mathbb {Q}\) to a σ-algebra, and to condition a measure on a σ-algebra. To prevent notational confusion, we shall write \(\mathbb {Q}\|_{\mathcal {F}}\) for the restriction of \(\mathbb {Q}\) to \(\mathcal {F}\), and \(\mathbb {Q}|_{\mathcal {F}}\) for \(\mathbb {Q}\) conditioned on \(\mathcal {F}\).

In our setting, our fundamental problem is that we do not know what measure is “true”, and so work instead under a family of measures. In this context, measure changes can be described as follows.

Proposition 1

Let \(\bar {\mathbb {P}}\) be a reference probability measure under which X is a sequence of i.i.d. uniform random variables from the basis vectors \(\{e_{1},...e_{N}\}\subset \mathbb {R}^{N}\) and {Yt}t≥0 is independent of X, with i.i.d. distribution Ytdμ. The measure \(\mathbb {Q}(A,C, p_{0})\in \mathcal {M}_{M|\mathcal {Y}}\) has Radon–Nikodym derivative (or likelihood)

$$\frac{d\mathbb{Q}(A,C, p_{0})\|_{\mathcal{F}_{T}}}{d\bar{\mathbb{P}}\|_{\mathcal{F}_{T}}} = N\left(X_{0}^{\top} p_{0}\right) \prod_{t=1}^{T} \left(\left(X_{t}^{\top} A_{t-1} X_{t-1}\right)\left(\mathbf{1}^{\top} C(Y_{t}) X_{t}\right)\right). $$

Remark 2

The requirement that \(\bar {\mathbb {P}}\) is a probability measure is unnecessary (i.e., in Bayesian parlance, the reference distribution may be improper). For example, we can use Lebesgue measure on \(\mathbb {R}\) as the marginal reference measure μ for Yt without difficulty, in which case ct(y) is the usual (Lebesgue) density of the distribution of Yt.

Proof

A simple verification of this result is possible by factoring the proposed Radon–Nikodym density as the product of three terms:

$$\frac{d\mathbb{Q}(A,C, p_{0})\|_{\mathcal{F}_{T}}}{d\bar{\mathbb{P}}\|_{\mathcal{F}_{T}}} = \frac{X_{0}^{\top} p_{0}}{1/N} \cdot \prod_{t=1}^{T} \left(X_{t}^{\top} A_{t-1} X_{t-1}\right)\cdot \prod_{t=1}^{T} \left(\mathbf{1}^{\top} C(Y_{t}) X_{t}\right).$$

The first term, \((X_{0}^{\top } p_{0})/(1/N)\), changes the distribution of X0 from uniform to p0, as is seen from the calculation

$$E_{\bar{\mathbb{P}}}\left[X_{0}\left(X_{0}^{\top} p_{0}\right)/(1/N)\right] = N E_{\bar{\mathbb{P}}}\left[X_{0} X_{0}^{\top}\right] p_{0} = N N^{-1} I_{N} p_{0} = p_{0}.$$

This is clearly a probability density with respect to \(\bar {\mathbb {P}}\).

The second term, \(\prod _{t=1}^{T} \left (X_{t}^{\top } A_{t-1} X_{t-1}\right)\), changes the conditional distribution of Xt|{Xs}s<t (for each t), from a uniform distribution to the probability vector At−1Xt−1; as can be demonstrated by a calculation similar to that for X0 in the first term. As the columns of At−1 sum to one, it is easy to verify that this product has expectation 1 (conditional on X0) and is nonnegative, that is, it is a probability density which does not modify the distribution of X0.

The third term changes the conditional distribution of Yt|{Xt,Xs,Ys}st from μ to C(y)Xtdμ(y)=c(y;Xt)dμ(y). This is most easily seen by calculating

$$E_{\mathbb{Q}}[ g(Y_{t})|\{X_{t}, X_{s}, Y_{s}\}_{s\neq t}] = \int g(y) \mathbf{1}^{\top} C(y) X_{t} d\mu(y) = \int g(y)c(y; X_{t})d\mu(y) $$

for a general bounded Borel function g. As c is defined to be a density, it is again easy to verify that the product \(\prod \left (\mathbf {1}^{\top } C(Y_{t}) X_{t}\right)\) is a probability density with respect to \(\bar {\mathbb {P}}\), and that it does not modify the distribution of X.

As we are on a canonical space, the measure \(\mathbb {Q}\) is determined by the laws of X and Y, and so we have the result. □

The above proposition gives a Radon–Nikodym derivative adapted to the full filtration \(\{\mathcal {F}_{t}\}_{t\ge 0}\). In practice, it is also useful to consider the corresponding Radon–Nikodym derivative adapted to the observation filtration \(\{\mathcal {Y}_{t}\}_{t\ge 0}\). As this filtration is generated by the process Y, it is enough to multiply together the conditional distributions of \(Y_{t}|\mathcal {Y}_{t-1}\), leading to the following convenient representation. Recall that

$$\sum_{i} p_{i} c_{t}(y; e_{i}) = \mathbf{1}^{\top} C_{t}(y) p.$$

Proposition 2

For \(\mathbb {Q}(A,C, p_{0})\in \mathcal {M}_{M|\mathcal {Y}}\), the Radon–Nikodym derivative restricted to \(\mathcal {Y}_{t}\) is given by

$$L^{\text{obs}}(\mathbb{Q}(A,C, p_{0})|\mathbf{y}):=\frac{d\mathbb{Q}(A,C, {p_{0}})\|_{\mathcal{Y}_{T}}}{d{\bar{\mathbb{P}}}\|_{\mathcal{Y}_{T}}} = \prod_{t=1}^{T} \mathbf{1}^{\top} C_{t}(Y_{t}) A_{t} p_{t-1}^{(A,C), p_{0}}, $$

where \(p_{t-1}^{(A,C), p_{0}}\) is the solution to the filtering problem in the measure \(\mathbb {Q}(A,C,p_{0})\), as determined by (1) (and so includes further dependence on {Ys}s<t).

Proof

The distribution of \(Y_{t}|\mathcal {Y}_{t-1}\) is determined by (for Borel sets B)

$$\begin{aligned} E_{\mathbb{Q}}[1_{Y_{t}\in B}|\mathcal{Y}_{t-1}] &= E_{\mathbb{Q}}\left[E_{\mathbb{Q}}[1_{Y_{t}\in B}|\mathcal{Y}_{t-1}\vee \sigma(X_{t})] \Big|\mathcal{Y}_{t-1}\right] \\ &= E_{\mathbb{Q}}\left[\int_{B} \mathbf{1}^{\top} C_{t}(y)X_{t} d\mu(y)\Big|\mathcal{Y}_{t-1}\right]\\ &= \int_{B} \mathbf{1}^{\top} C_{t}(y)E_{\mathbb{Q}}[X_{t}|\mathcal{Y}_{t-1}] d\mu(y) \\&= \int_{B} \mathbf{1}^{\top} C_{t}(y)A_{t} p_{t-1}^{(A,C),p_{0}} d\mu(y). \end{aligned} $$

As \(Y_{t}|\mathcal {Y}_{t-1}\) has distribution μ under \(\bar {\mathbb {P}}\), it follows that \(\mathbf {1}^{\top } C_{t}(Y_{t})A_{t} p_{t-1}^{(A,C),p_{0}}\) is the Radon–Nikodym density of the conditional law of \(Y_{t}|\mathcal {Y}_{t-1}\) under \(\mathbb {Q}\) with respect to its law under \(\bar {\mathbb {P}}\). As {Ys}sT generates \(\mathcal {Y}_{t}\), the result follows by induction. □

In order to apply classical statistical methods, we take a (generic) parameterization of this family of measures. This will also allow us to encode which parts of the generators we believe are static (and so can be learnt from observations), and which are dynamic (and so will violate stationarity).

Assumption 1

For fixed m>0, we assume we are given a (Borel measurable) function \(\Phi :\mathbb {N}\times \mathbb {R}^{m}\times \mathbb {R}^{m} \to \mathbb {A}\) such that the generators satisfy

$$\Phi(t, \mathfrak{S}, \mathfrak{D}_{t})=(A_{t},C_{t}),$$

where \(\mathfrak {S}\) is a static parameter, and \(\mathfrak {D}_{t}\) is a parameter which may vary at each point in time. We write \(\mathcal {Q}\) for the family of measures in \(\mathcal {M}_{M|\mathcal {Y}}\) induced by this parameterization, and typically omit to write the argument t of Φ.

With a slight abuse of notation, if \((A_{t},C_{t})=\Phi (t, \mathfrak {S}, \mathfrak {D}_{t})\), we write \(\mathbb {Q}(\mathfrak {S}, \mathfrak {D}, p_{0})\)as an alias for the measure \(\mathbb {Q}(A, C, p_{0})\), and \(\mathbb {G}(t, \mathfrak {S}, \mathfrak {D}_{t}, p)\)as an alias for the function \(\mathbb {G}(t, A, C, p)\) defined in (1).

4 Nonlinear expectations

In this section, we introduce the concepts of nonlinear expectations and convex risk measures, and discuss their connection with penalty functions on the space of measures. These objects provide a technical foundation with which to model the presence of uncertainty in a random setting. This theory is explored in some detail in Föllmer and Schied (2002b). Other key works which have used or contributed to this theory, in no particular order, are Hansen and Sargent (2008) (see also Hansen and Sargent (2005, 2007) for work related to what we present here); Huber and Roncetti (2009); Peng (2010); El Karoui et al. (1997); Delbaen et al. (2010); Duffie and Epstein (1992); Rockafellar et al. (2006); Riedel (2004) and Epstein and Schneider (2003). We base our terminology on that used in Föllmer and Schied (2002b) and Delbaen et al. (2010).

We here present, without proof, the key details of this theory as needed for our analysis.

Definition 3

For a σ-algebra \(\mathcal {G}\) on Ω, let \(L^{\infty }(\mathcal {G})\) denote the space of essentially bounded \(\mathcal {G}\)-measurable random variables. A nonlinear expectation on \(L^{\infty }(\mathcal {G})\) is a map**

$$\mathcal{E}:L^{\infty}(\mathcal{G}) \to \mathbb{R}$$

satisfying the assumptions

  • Strict Monotonicity: for any \(\xi _{1}, \xi _{2}\in L^{\infty }(\mathcal {G})\), if ξ1ξ2 a.s., then \(\mathcal {E}(\xi _{1}) \geq \mathcal {E}(\xi _{2})\) and, if in addition \(\mathcal {E}(\xi _{1})=\mathcal {E}(\xi _{2})\), then ξ1=ξ2 a.s.;

  • Constant triviality: for any constant k, \(\mathcal {E}(k)=k\);

  • Translation equivariance: for any \(k\in \mathbb {R}\), \(\xi \in L^{\infty }(\mathcal {G})\), \(\mathcal {E}(\xi +k)= \mathcal {E}(\xi)+k\).

A “convex” expectation also satisfies

  • Convexity: for any λ∈[0,1], \(\xi _{1}, \xi _{2}\in L^{\infty }(\mathcal {G})\),

    $$\mathcal{E}(\lambda \xi_{1}+ (1-\lambda) \xi_{2}) \leq \lambda \mathcal{E}(\xi_{1})+ (1-\lambda) \mathcal{E}(\xi_{2}).$$

If \(\mathcal {E}\) is a convex expectation, then the operator defined by \(\rho (\xi) = \mathcal {E}(-\xi)\) is called a convex risk measure. A particularly nice class of convex expectations is those which satisfy

  • Lower semicontinuity: For a sequence \(\{\xi _{n} \}_{n\in \mathbb {N}}\subset L^{\infty }(\mathcal {G})\) with ξnξ pointwise (and \(\xi \in L^{\infty }(\mathcal {G})\)), we have \(\mathcal {E}(\xi _{n}) \uparrow \mathcal {E}(\xi)\).

The following theorem (which was expressed in the language of risk measures) is due to Föllmer and Schied (2002a) and Frittelli and Rosazza Gianin (2002).

Theorem 2

Suppose \(\mathcal {E}\) is a lower semicontinuous convex expectation. Then there exists a “penalty” function \(\mathcal {R}: \mathcal {M}_{1}\to [0,\infty ]\) such that

$$\mathcal{E}(\xi) = \sup_{\mathbb{Q}\in \mathcal{M}_{1}} \left\{\mathbb{E}_{\mathbb{Q}}[\xi] -\mathcal{R}(\mathbb{Q})\right\}.$$

Provided \(\mathcal {R}(\mathbb {Q})<\infty \) for some \(\mathbb {Q}\) equivalent to \(\mathbb {P}\), we can restrict our attention to measures in \(\mathcal {M}_{1}\) equivalent to \(\mathbb {P}\) without loss of generality.

Remark 3

This result gives some intuition as to how a convex expectation can model “Knightian” uncertainty. One considers all the possible probability measures on the space, and then selects the maximal expectation among all measures, penalizing each measure depending on its plausibility, as measured by \(\mathcal {R}(\cdot)\). As convexity of \(\mathcal {E}\) is a natural requirement of an “uncertainty averse” assessment of outcomes, Theorem 2 shows that this is the only way to construct an “expectation” \(\mathcal {E}\) which penalizes uncertainty, while preserving monotonicity, translation equivariance, and constant triviality (and lower semicontinuity).

In order to relate our penalty function with the temporal structure of filtering, we focus our attention on measures in our parametric family \(\mathcal {Q}\), defined in Assumption 1. We also allow the penalty \(\mathcal {R}\) to depend on time. In the analysis of this paper, the following definition allows us to obtain a (forward) recursive structure in our nonlinear expectations, as one might expect in a filtering context.

Definition 4

We say a family of penalty functions \(\{\mathcal {R}_{t}\}_{t\ge 0}\), is additive if it can be written in the form

$$\mathcal{R}_{t}(\mathbb{Q}) = \left\{\begin{array}{cc} \left(\frac{1}{k}\alpha_{t}(\mathbb{Q}, \{Y_{s}\}_{s\le t})\right)^{k'}& \text{if }\mathbb{Q}\in \mathcal{Q}\\+\infty & \text{otherwise,} \end{array}\right. $$

where, if \(\mathbb {Q} = \mathbb {Q}(\mathfrak {S}, \mathfrak {D}, p_{0})\) and \(p^{\mathbb {Q}}_{t}\) is the solution of the filtering Eq. 1 under \(\mathbb {Q}\), the function αt is of the form

$$\alpha_{t}\left(\mathbb{Q}, \{Y_{s}\}_{s\le t}\right) = \kappa_{\text{prior}}(p_{0}, \mathfrak{S}) + \sum_{s\le t} \gamma_{s}\left(\mathfrak{S}, \mathfrak{D}_{s}, \{Y_{n}\}_{n\le s}, p_{s-1}^{\mathbb{Q}}\right)+m_{t}.$$

Here k and k are positive constants, κprior and {γt}t≥0 are known real functions bounded below, and mt is a \(\mathcal {Y}_{t}\)-measurable scalar random variable which ensures the normalization condition \(\inf _{\mathbb {Q}} \alpha \left (\mathbb {Q}, \{Y_{s}\}_{s\le t}\right)= 0\) holds for almost all observation sequences {Ys}s≥0.

4.1 DR-expectations

From the discussion above, it is apparent that we can focus our attention on calculating the penalty function \(\mathcal {R}\), rather than working with the nonlinear expectation directly. This penalty function is meant to encode how “unreasonable” a probability measure \(\mathbb {Q}\) is as a model for our outcomes.

In Cohen (2017), we have considered a framework which links the choice of the penalty function to statistical estimation of a model. The key idea of Cohen (2017) is to use the negative log-likelihood function for this purpose, where the likelihood is taken against an arbitrary reference measure, and evaluated using the observed data. This directly uses the statistical information from observations to quantify our uncertainty.

In this paper, we make a slight extension of this idea, to explicitly incorporate prior beliefs. In particular, we replace the log-likelihood with the log-posterior density, which in turn gives an additional term in the penalty.

Definition 5

For \(\mathbb {Q}\in \mathcal {Q}\), the observed likelihood \(L^{\text {obs}}(\mathbb {Q}|\mathbf {y})\) is given in Proposition 2. Inspired by a “Bayesian” approach, we augment this by the addition of a prior distribution over \(\mathcal {Q}\). Suppose a (possibly improper) prior is given, with density in terms of the parameters \((\mathfrak {S}, \{\mathfrak {D}_{t}\}_{t\ge 0})\)

$$\exp\left(-\kappa_{\text{prior}}(\mathfrak{S}, p_{0}) - \sum_{t} \gamma_{\text{prior}}(\mathfrak{S},\mathfrak{D}_{t})\right).$$

The posterior relative density is given by the product

$$L(\mathbb{Q}|\mathbf{y}) = L^{\text{obs}}(\mathbb{Q}|\mathbf{y})\exp\left(-\kappa_{\text{prior}}(\mathfrak{S}) - \sum_{t} \gamma_{\text{prior}}(\mathfrak{S},\mathfrak{D}_{t})\right).$$

The “\(\mathcal {Q}|\mathbf {y}_{t}\)-divergence” is defined to be the normalized negative log-posterior relative density

$$ \alpha_{\mathbf{y}}(\mathbb{Q}):= -\log\left(L(\mathbb{Q}|\mathbf{y})\right) + \sup_{\tilde{\mathbb{Q}}\in \mathcal{Q}}\left\{\log\left(L\left(\tilde{\mathbb{Q}}|\mathbf{y}\right)\right)\right\}. $$
(2)

Remark 4

The right-hand side of (2) is well defined whether or not a maximum a posteriori (MAP) estimator exists. Given a MAP estimate \(\hat {\mathbb {Q}}\in \mathcal {Q}\), we would have the simpler representation

$$\alpha_{\mathbf{y}}(\mathbb{Q}):= -\log\left(\frac{L(\mathbb{Q}|\mathbf{y})}{L(\hat{\mathbb{Q}}|\mathbf{y})}\right).$$

Definition 6

For fixed observations yt=(Y1,Y2,...,Yt), for uncertainty aversion parameters k>0 and k∈[1,], we define the convex expectation

$$ \mathcal{E}_{\mathbf{y}_{t}}^{k,k'}(\xi):= \sup_{\mathbb{Q}\in \mathcal{Q}}\left\{\mathbb{E}_{\mathbb{Q}}[\xi|\mathbf{y}_{t}] -\left(\frac{1}{k}\alpha_{\mathbf{y}_{t}}(\mathbb{Q})\right)^{k'}\right\}, $$
(3)

where we adopt the convention x=0 for x∈[0,1] and + otherwiseFootnote 2.

We call \(\mathcal {E}_{\mathbf {y}_{t}}^{k,k'}\) the “DR-expectation” (with parameter k,k). We may omit to write k,k for notational simplicity.

With deliberate ambiguity, the acronym “DR” can either stand for “divergence robust” or “data-driven robust”.

Remark 5

By construction, \(\mathcal {Q}\) is parameterized by \(\mathfrak {S}\) and \(\{\mathfrak {D}_{t}\}_{t\ge 0}\), which lie in \(\mathbb {R}^{m}\) for some m. The divergence and conditional expectations given yt are continuous with respect to this parameterization and can be constructed to be Borel measurable with respect to yt. Consequently, measure theoretic concerns which arise from taking the supremum will not cause difficulty, in particular, the DR-expectation defined in (3) is guaranteed to be a Borel measurable function of yt for every ξ. (This follows from Filippov’s implicit function theorem, see, for example, Cohen and Elliott (2015) Appendix 10.)

Remark 6

The choice of parameters k and k determines much of the behaviour of the nonlinear expectation. The role of k is simple, as it acts to scale the uncertainty aversion—a higher value of k results in smaller penalties, and hence the DR-expectation will lie further above the MAP expectation. The parameter k determines the “curvature” of the uncertainty aversion. Taking k= results in the DR-expectation being positively homogeneous, that is, it is a coherent expectation in the sense of Artzner et al. (1999). In Cohen (2017), the asymptotic behaviour of the DR-expectation is studied, under the assumption of iid observations. For k=1, the DR-expectation of corresponds (for large samples) to the expected value under the maximum likelihood model plus k/2 times the sampling variance of the expectation, while for k=, the DR-expectation corresponds to the expected value under the maximum likelihood model plus \(\sqrt {2k}\) times the sampling standard error of the expectation. In this paper, we will not be considering such an asymptotic result, so the values of k and k will not play a significant role. Their presence nevertheless gives a more general class of penalty functions in Definition 4, and they are kept for notational consistency with other papers considering the DR-expectation.

Remark 7

In principle, we could now apply the DR-expectation framework to a filtering context as follows: Take a collection of models \(\mathcal {Q}\). For a random variable ξ, and for each measure \(\mathbb {Q}\in \mathcal {Q}\), compute \(E_{\mathbb {Q}}[\xi |\mathbf {y}_{t}]\) and \(\alpha _{\mathbf {y}_{t}}(\mathbb {Q})\). Taking a supremum as in (3), we obtain the DR-expectation. However, this is generally not computationally tractable in this form.

Lemma 1

Let \(\{\mathcal {F}_{t}\}_{t\ge 0}\) be a filtration such that Y is adapted. For \(\mathcal {F}_{t}\)-measurable random variables, the choice of horizon Tt in the definition of the penalty function α is irrelevant. In particular, for \(\mathcal {F}_{t}\)-measurable ξ, if \(\mathcal {Q}\|_{\mathcal {F}_{t}}=\{\mathbb {Q}\|_{\mathcal {F}_{t}}: \mathbb {Q}\in \mathcal {Q}\}\), we know

$$\mathcal{E}_{\mathbf{y}_{t}}(\xi) = \sup_{\mathbb{Q}\in \mathcal{Q}\|_{\mathcal{F}_{t}}}\left\{\mathbb{E}_{\mathbb{Q}}[\xi|\mathbf{y}_{t}] -\left(\frac{1}{k}\alpha_{\mathbf{y}_{t}}(\mathbb{Q}\|_{\mathcal{Y}_{t}})\right)^{k'}\right\}, $$

where \(\alpha _{\mathbf {y}_{t}}(\mathbb {Q}\|_{\mathcal {Y}_{t}})\) is defined as above, in terms of the restricted measure \(\mathbb {Q}\|_{\mathcal {Y}_{t}}\).

Proof

By construction, \(\alpha _{\mathbf {y}_{t}}\phantom {\dot {i}\!}\) is obtained from the posterior relative density, which is determined by the restriction of \(\mathbb {Q}\) to \(\mathcal {Y}_{t}\subseteq \mathcal {F}_{t}\), while the expectation depends only on the restriction of \(\mathbb {Q}\) to \(\mathcal {F}_{t}\). As these are the only terms needed to compute the DR-expectation, the result follows. □

Theorem 3

The penalty in the DR-expectation is additive, in the sense of Definition 4.

Proof

From Proposition 2, we have the likelihood

$$L^{\text{obs}}(\mathbb{Q}(A, C, p_{0})|\mathbf{y}_{t}) = \prod_{s=1}^{t}\mathbf{1}^{\top} C_{s}(Y_{s}) A_{s} p_{s-1}^{(A,C), p_{0}},$$

where \(p_{s}^{(A,C), p_{0}}\) is the solution to the filtering Eq. 1. By Lemma 1, the penalty in the DR-expectation is given by

$$\begin{aligned} &\alpha_{\mathbf{y}_{t}}(\mathbb{Q}\|_{\mathcal{Y}_{t}})\\ &= - \log\left(L^{\text{obs}}(\mathbb{Q}(A, C, p_{0})|\mathbf{y}_{t}) \cdot e^{-\kappa_{\text{prior}}(\mathfrak{S}, p_{0}) - \sum_{t} \gamma_{\text{prior}}(\mathfrak{S},\mathfrak{D}_{t})}\right)-m_{t}\\ &= \kappa_{\text{prior}}(\mathfrak{S}, p_{0}) + \sum_{s\le t}\left[ \gamma_{\text{prior}}(\mathfrak{S},\mathfrak{D}_{s}) -\log\left(\mathbf{1}^{\top} C_{s}(Y_{s}) A_{s} p_{s-1}^{(A,C), p_{0}}\right)\right]-m_{t}, \end{aligned} $$

where mt is chosen to ensure \(\inf _{\mathbb {Q}}(\alpha _{\mathbf {y}_{t}}(\mathbb {Q}) \equiv 0\). As \((A_{t}, C_{t}(\cdot)) = \Phi (\mathfrak {S}, \mathfrak {D}_{t})\), we obtain the desired form by setting

$$\gamma_{t}(\mathfrak{S}, \mathfrak{D}_{t}, \{Y_{s}\}_{s\le t}, p) = \gamma_{\text{prior}}(\mathfrak{S},\mathfrak{D}_{t}) -\log\left(\mathbf{1}^{\top} C_{t}(Y_{t}) A_{t} p\right).$$

Remark 8

The purpose of the nonlinear expectation is to give an “upper” estimate of a random variable, accounting for uncertainty in the underlying probabilities. This is closely related to robust estimation in the sense of Wald (1945). In particular, one can consider the robust estimator given by

$$\mathrm{arg\,inf}_{\hat\xi\in\mathbb{R}^{N}} \mathcal{E}_{\mathbf{y}_{t}}\left(\left\|\xi -\hat \xi\right\|^{2}\right),$$

which gives a “minimax” estimate of ξ, given the observations yt and a quadratic loss function. The advantage of the nonlinear expectation approach is that it allows one to construct such an estimate for every random variable/loss function, giving a cost-specific quantification of uncertainty in each case.

We can also see a connection with the theory of H filtering (see, for example, Grimble and El Sayed (1990) or more recently Zhang et al. (2009) and references therein, or the more general H-control theory in Başar and Bernhard (1991)). In this setting, we look for estimates which perform best in the worst-situation, where “worst” is usually defined in terms of a perturbation to the input signal or coefficients. In our setting, we focus not on the estimation problem directly, but on the “dual” problem of building an upper expectation, i.e. calculating the “worst” expectation in terms of a class of perturbations to the coefficients (our setting is general enough that perturbation to the signal can also be included through shifting the coefficients).

Remark 9

There are also connections between our approach and what is called “risk-sensitive filtering”, see, for example, James et al. (1994); Dey and Moore (1995); or the review of Boel et al. (2002) and references therein (from an engineering perspective); or Hansen and Sargent (2007); Hansen and Sargent (2008) (from an economic perspective). In their setting, one uses the nonlinear expectation defined by

$$\mathcal{E}(\xi|\mathcal{Y}_{t}) = -k \log \mathbb{E}_{\mathbb{P}}\big[\exp(- \xi/k)\big|\mathcal{Y}_{t}\big],$$

for some choice of robustness parameter 1/k>0. This leads to significant simplification, as dynamic consistency and recursivity is guaranteed in every filtration (see Graf (1980) and Kupper and Schachermayer (2009), and further discussion in Section 5). The corresponding penalty function is given by the conditional relative entropy,

$$\mathcal{R}_{t}(\mathbb{Q}) = k \mathbb{E}_{\mathbb{Q}}[\log(d\mathbb{Q}/d\mathbb{P})|\mathcal{Y}_{t}],$$

which is additive (Definition 4) and the one-step penalty can be calculated accordingly. In this case, the optimization defining the nonlinear expectation could also be taken over \(\mathcal {M}_{1}\), so this approach has a claim to be including “nonparametric” uncertainty, as all measures are considered, rather than purely Markov measures or measures in a parametric family (however, the optimization can be taken over conditionally Markov measures, and one will obtain an identical result!).

The difficulty with this approach is that it does not allow for easy incorporation of knowledge of the error of estimation of the generators (A,C) in the level of robustness—the only parameter available to choose is k, which multiplies the relative entropy. A small choice of k corresponds to a small penalty, hence a very robust expectation, but this robustness is not directly linked to the estimation of the generators (A,C). Therefore, the impact of statistical estimation error remains obscure, as k is chosen largely exogenously of this error. For this reason, our approach, which directly allows for the penalty to be based on the statistical estimation of the generators, has advantages over this simpler method.

5 Recursive penalties

The DR-expectation provides us with an approach to including statistical estimation in our valuations. However, the calculations suggested by Remark 7 are generally intractable in their stated form. In this section, we shall see how the assumption that the penalty is additive (Definition 4) can be used to simplify our calculations.

Our arguments will be based on dynamic programming techniques. For the sake of precision and brevity, we here state a (forward in time) abstract “dynamic programming principle” which we can call on in later arguments.

Theorem 4

Let U be a topological space which is equal to the countable union of compact metrizable subsets of itself. For some m>0, let \(g_{t}:\mathbb {R}^{m}\times U\to \mathbb {R}^{m}\) be a sequence of Borel measurable functions. For any sequence u=(u1,...,uT) in U, let the sequence \(Z_{t}^{\mathbf {u},z_{0}}\) be defined by the recursion

$$Z_{t}^{\mathbf{u},z_{0}} = g_{t}\left(Z_{t-1}^{\mathbf{u},z_{0}}, u_{t}\right) \qquad \text{and}\qquad Z_{0}^{\mathbf{u},z_{0}} = z_{0}.$$

For each uU, we write g−1(·,u) for the (set-valued) inverse of g(·,u).

Suppose we have a sequence of Borel measurable maps \(\mathcal {A}_{t}:\mathbb {R}\times U\times \mathbb {R}^{m}\to \mathbb {R}\) such that \(v\mapsto \mathcal {A}(v,u,z)\) is nondecreasing and continuous (uniformly in u,z) and \(\mathcal {A}(v, u,z) \to \infty \) as v. For each u and z0, we define the sequence of values at each time t by

$$V_{t}(\mathbf{u}, z_{0}) = \mathcal{A}_{t}\left[V_{t-1}(\mathbf{u},z_{0}), u_{t}, Z^{\mathbf{u},z_{0}}_{t-1}\right]\qquad \text{and}\qquad V_{0}(\mathbf{u},z_{0}) = v_{0}(z_{0}).$$

Then, the minimal value,

$$V^{*}_{t}(z) := \inf_{\left\{\mathbf{u}, z_{0}: Z^{\mathbf{u},z_{0}}_{t}=z\right\}} V_{t}(\mathbf{u},z_{0}),$$

satisfies the recursion

$$V^{*}_{t}(z) = \inf_{u\in U} \inf_{y\in g^{-1}_{t}(z,u)}\left\{\mathcal{A}_{t}\left[V^{*}_{t-1}(y), u, y\right]\right\} \qquad \text{and}\qquad V^{*}_{0}(z_{0})=v_{0}(z_{0}), $$
(4)

(with the convention that the infimum of the empty set is +).

Proof

We proceed by induction. Clearly, the result holds at t=0, as does the (at t=0 empty) statement

$$V_{t}(z) = +\infty \quad \text{for all }z\not\in\bigcup_{\mathbf{u}, z_{0}}\left\{ Z^{\mathbf{u},z_{0}}_{t}\right\}. $$
(5)

Suppose then that (4) and (5) hold at t=n−1. For every ε>0, there exists (u,z0) such that

$$\begin{aligned} \mathcal{A}_{n}\left[V_{n-1}(\mathbf{u},z_{0}), u_{n}, Z^{\mathbf{u}}_{n-1}\right]= V_{n}(\mathbf{u}, z_{0}) &= \mathcal{A}_{n}\left[V_{n-1}(\mathbf{u},z_{0}), u_{n}, Z^{\mathbf{u}}_{n-1}\right]\\ &\leq \mathcal{A}_{n}\left[V_{n-1}^{*}\left(Z^{\mathbf{u}}_{n-1}\right)+\epsilon, u_{n}, Z^{\mathbf{u}}_{n-1}\right]. \end{aligned} $$

Taking the infimum over \(\left \{\mathbf {u}, z_{0}:Z_{t}^{\mathbf {u}} = z\right \}\) (which can be done measurably with respect to z, given Filippov’s implicit function theorem, see, for example, Cohen and Elliott (2015) Appendix 10, and sending ε→0 gives

$$\begin{aligned} \inf_{\left\{\mathbf{u}, z_{0}:Z_{n}^{\mathbf{u}} = z\right\}}V_{n}(\mathbf{u}, z_{0}) & = \inf_{\left\{\mathbf{u}, z_{0}:Z_{n}^{\mathbf{u}} = z\right\}} \mathcal{A}_{n}\left[V_{n-1}^{*}\left(Z^{\mathbf{u}}_{n-1}\right), u_{n}, Z^{\mathbf{u}}_{n-1}\right]. \end{aligned} $$

From the definition of g, we know that

$$\left\{\mathbf{u},z_{0}: Z^{\mathbf{u},z_{0}}_{n} = z\right\}=\left\{\mathbf{u},z_{0}: Z^{\mathbf{u}, z_{0}}_{n-1}\in g^{-1}_{n}(z, u_{n})\right\}$$

from which we derive

$$\begin{aligned} V_{n}^{*}(z) & = \inf_{\left\{\mathbf{u},z_{0}: Z^{\mathbf{u}, z_{0}}_{n-1}\in g^{-1}_{n}(z, u_{n})\right\}} \mathcal{A}_{n}\left[V_{n-1}^{*}\left(Z^{\mathbf{u},z_{0}}_{n-1}\right), u_{n}, Z^{\mathbf{u},z_{0}}_{n-1}\right]. \end{aligned} $$

The right side of this equation depends on u,z0 only through the values of Zn−1u,z0 and un. In particular, considering the set of attainable y, that is, for \(y\in \bigcup _{\mathbf {u}, z_{0}}\left \{ Z^{\mathbf {u},z_{0}}_{n}\right \}\), we change variables to write

$$\begin{aligned} V_{n}^{*}(z) &=\inf_{u_{n}\in U}\inf_{\substack{y\in g^{-1}_{n}(z, u_{n})\\y\in \bigcup_{\mathbf{u}, z_{0}}\left\{ Z^{\mathbf{u},z_{0}}\right\}}} \mathcal{A}_{n}\left[V_{n-1}^{*}(y), u_{n}, y\right]. \end{aligned} $$

As the infimum on the empty set is +, we also obtain (5) at time n, and simplify to give (4) for t=n, completing the inductive proof. □

Corollary 1

Suppose, instead of Z being defined by a forward recursion, for some Borel measurable function \(\tilde g\) we had the backward recursion

$$\tilde g_{t}\left(Z_{t}^{\mathbf{u},z_{0}}, u_{t}\right) = Z_{t-1}^{\mathbf{u},z_{0}}.$$

The result of Theorem 4 still holds (with effectively the same proof), where we write \(\tilde g_{t}\) instead of \(g_{t}^{-1}\), so the second infimum in (4) is unnecessary (as \(\tilde g(z,u)\) is single-valued).

For practical purposes, it is critical that we refine our approach to provide a recursive construction of our nonlinear expectation. In classical filtering, one obtains a recursion for expectations \(\mathbb {E}[\phi (X_{t})|\mathcal {Y}_{t}]\), for Borel functions ϕ; one does not typically consider the expectations of general random variables. Similarly, we will consider the expectations of random variables ϕ(Xt).

Proposition 3

For each t, there exists a \(\mathcal {Y}_{t}\otimes \mathcal {B}(\mathbb {R})\)-measurable function κt such that, for every Borel function ϕ,

$$ \begin{aligned} \mathcal{E}_{\mathbf{y}_{t}}(\phi(X_{t})) &:= \sup_{\mathbb{Q}\in\mathcal{Q}} \left\{\mathbb{E}_{\mathbb{Q}}[\phi(X_{t})|\mathbf{y}_{t}]-\mathcal{R}_{t}(\mathbb{Q})\right\}\\ &= \sup_{q\in S_{N}^{+}}\left\{\sum_{i} q_{i} \phi(e_{i}) - \left(\frac{1}{k}\kappa_{t}(\omega, q)\right)^{k'}\right\}, \end{aligned} $$
(6)

where \(S_{N}^{+}\) denotes the probability simplex in \(\mathbb {R}^{N}\), that is, \(S_{N}^{+}=\{x\in \mathbb {R}^{N}: \sum _{i} x_{i} = 1,\quad x_{i}\geq 0 \quad \forall i\}\).

Proof

Fix the observations yt. Taking ΩX to be the possible states of Xt, that is, the basis vectors in \(\mathbb {R}^{N}\) with the discrete topology and corresponding Borel σ-algebra, the space of measures on ΩX is represented by the probability simplex \(S_{N}^{+}\). We consider the map

$$\phi\mapsto\mathcal{E}'(\phi(\omega_{X})):= \mathcal{E}_{\mathbf{y}_{t}}(\phi(X_{t}))$$

as a nonlinear expectation with underlying space ΩX. By Theorem 2, it follows that there exists a penalty function \(\kappa _{\mathbf {y}_{t}}(q)\phantom {\dot {i}\!}\) such that

$$\mathcal{E}'(\phi(\omega_{X})) = \sup_{q\in S_{N}^{+}}\left\{\sum_{i} q_{i} \phi(e_{i}) - \left(\frac{1}{k}\kappa_{\mathbf{y}_{t}}(q)\right)^{k'}\right\}.$$

Taking a regular version of this penalty function (which by convex duality exists as \(\mathcal {E}\) is measurable in yt), we can write \(\kappa _{t}(\omega, q) = \kappa _{\mathbf {y}_{t}}(q)\phantom {\dot {i}\!}\) as desired. □

Our aim is to find a recursion for κt, for various choices of \(\mathcal {R}\). Our constructions will depend on the following object.

Definition 7

Recall from Theorem 1 and Assumption 1 that, given a generator \((A_{t},C_{t}(\cdot))=\Phi (\mathfrak {S}, \mathfrak {D}_{t}) \in \mathbb {A}\) at time t, our filter dynamics are described by the recursion (up to proportionality)

$$p_{t} = \mathbb{G}(p_{t-1}, \mathfrak{S}, \mathfrak{D}_{t}, Y_{t}) \propto C(Y_{t})A p_{t-1}.$$

We correspondingly define the (set-valued) inverse

$$\mathbb{G}^{-1}(p; \mathfrak{S}, \mathfrak{D}_{t}, Y_{t}) = \left\{p'\in S_{N}^{+} : \mathbb{G}\left(p', \mathfrak{S}, \mathfrak{D}_{t}, Y_{t}\right) = p\right\}.$$

For notational simplicity, we will omit the argument Yt when this does not lead to confusion.

The set \(\mathbb {G}^{-1}(p; \mathfrak {S}, \mathfrak {D}_{t},Y_{t})\) represents the filter states at time t−1 which evolve to p at time t, assuming the generator of our process (at time t) is given by \(\Phi (\mathfrak {S}, \mathfrak {D}_{t})\) and we observe Yt. This set may be empty, if no such filter states exist. As the matrix A is generally not invertible (even accounting for the restriction to \(S_{N}^{+}\)), the set \(\mathbb {G}^{-1}(p; \mathfrak {S}, \mathfrak {D}_{t}, Y_{t})\) is not generally a singleton.

5.1 Filtering with uncertainty

We now show that if we assume our penalty is additive, then the function κ appearing in (6) can be obtained in a recursive manner.

Theorem 5

Suppose \(\mathcal {R}_{t}\) is additive, in the sense of Definition 4. Then, a function κ satisfying (6) is given by

$$\kappa_{t}(p) = \inf_{\mathfrak{S}}K_{t}(p, \mathfrak{S}),$$

where Kt satisfies the recursion

$$K_{t}(p,\mathfrak{S}) = \inf_{\mathfrak{D}_{t}} \left\{\inf_{p'\in \mathbb{G}^{-1}(p,\mathfrak{S}, \mathfrak{D}_{t}, Y_{t})} \left\{K_{t-1}\left(p', \mathfrak{S}\right) + \gamma_{t}\left(\mathfrak{S}, \mathfrak{D}_{t}, \{Y_{s}\}_{s\ge 0}, p'\right)\right\}\right\}-m'_{t}, $$

with initial value \(K_{0}(p_{0}, \mathfrak {S}) = \kappa _{\text {prior}}(p_{0}, \mathfrak {S})\), where m is chosen to ensure we have the normalization \(\inf _{p, \mathfrak {S}}K_{t}(p, \mathfrak {S}) \equiv 0\).

Proof

As we know that \(\mathcal {R}_{t}\) is additive, we have the representation \(\mathcal {R}_{t}(\mathbb {Q}) = \left (k^{-1} \alpha _{t}(\mathbb {Q}, \{Y_{s}\}_{s\le t})\right)^{k'}\), where

$$\alpha_{t}(\mathbb{Q}, \{Y_{s}\}_{s\le t}) = \kappa_{\text{prior}}(p_{0}, \mathfrak{S}) + \sum_{s\le t} \gamma_{s}\left(\mathfrak{S}, \mathfrak{D}_{s}, \{Y_{n}\}_{n\le s}, p_{s-1}^{\mathbb{Q}}\right)+m_{t}.$$

As \(\mathbb {E}_{\mathbb {Q}}[\phi (X_{t})|\mathbf {y}_{t}]\) depends only on the conditional law of \(X_{t}|\mathcal {Y}_{t}\) under \(\mathbb {Q}\), it is easy to see that (6) is satisfied when

$$ \left(\frac{1}{k}\kappa_{t}(p)\right)^{k'} = \inf_{\{\mathbb{Q}:\mathbb{E}_{\mathbb{Q}}[X_{t}|\mathcal{Y}_{t}]= p\}}\mathcal{R}_{t}(\mathbb{Q}) = \inf_{\{\mathbb{Q}:\mathbb{E}_{\mathbb{Q}}[X_{t}|\mathcal{Y}_{t}]= p\}}\left(\frac{1}{k}\alpha_{t}(\mathbb{Q})\right)^{1/k'}, $$
(7)

We wish to write the minimization in (7) as a recursive control problem, to which we can apply Theorem 4. Given p0, \(\mathfrak {S}\), and \(\{\mathfrak {D}_{s}\}_{s\le t}\), the law of \(X_{t}|\mathcal {Y}_{t}\) is given by the solution to the filtering Eq. 1. Write \(Z_{t} = (p_{t}, \mathfrak {S})\), and \(u_{t} = \mathfrak {D}_{t}\), so that Z is a state process defined by \(Z_{0} = z_{0} = (p_{0}, \mathfrak {S})\) and the recursion (controlled by \(\mathbf {u} =\{u_{s}\}_{s\le t} = \{\mathfrak {D}_{s}\}_{s\le t}\))

$$Z_{t} = \hat{\mathbb{G}}(Z_{t-1}, \mathfrak{D}_{t}) := \left(\mathbb{G}\left(p_{t-1}, \mathfrak{S}, \mathfrak{D}_{t}, Y_{t}\right),\, \mathfrak{S}\right).$$

Omitting the constant mt from the definition of αt, we define

$$V_{t}(z_{0}, \mathbf{u}) = \kappa_{\text{prior}}(p_{0}, \mathfrak{S}) + \sum_{s\le t} \gamma_{s}\left(\mathfrak{S}, \mathfrak{D}_{s}, \{Y_{n}\}_{n\le s}, p_{s}\right).$$

Taking \(\mathcal {A}_{t}\) to be the operator

$$\mathcal{A}_{t}\left[V_{t-1}, u_{t}, Z_{t}^{\mathbf{u}, z_{0}}\right]= V_{t-1} + \gamma_{t}(\mathfrak{S}, u_{t}, \{Y_{s}\}_{s\le t}, p_{t-1}),$$

we see that V satisfies the structure assumed in Theorem 4. Therefore, its minimal value satisfies

$$\begin{aligned} V^{*}_{t}(z) &= \inf_{\left\{z_{0}, \mathbf{u}:Z_{t}^{z_{0}, \mathbf{u}} = z\right\}} V_{t}(z_{0}, \mathbf{u}) \\ &= \inf_{\mathfrak{D}_{t}} \left\{\inf_{(p', \mathfrak{S})= z'\in \hat{\mathbb{G}}^{-1}(z)} \left\{V^{*}_{t-1}(z') + \gamma_{t}\left(\mathfrak{S}, u_{t}, \{Y_{s}\}_{s\ge 0}, p'\right)\right\}\right\} \end{aligned} $$

with initial value \(V_{0}(z) = \kappa _{\text {prior}}(p_{0}, \mathfrak {S})\). We renomalize this by setting \(m^{\prime }_{t} = \inf _{z} V^{*}_{t}(z)\) and \(K_{t}(p, \mathfrak {S}) := V^{*}_{t}(z) -m'_{t}\), and so obtain the stated dynamics for K. By construction, we know

$$K_{t}(p, \mathfrak{S}) = \inf_{\{\mathfrak{D}_{s}\}_{s\le t}}\left\{\alpha_{t}\left(\mathbb{Q}, \{Y_{s}\}_{s\leq t}\right): \mathbb{E}_{\mathbb{Q}}[X_{t}|\mathcal{Y}_{t}]= p, \mathbb{Q} = \mathbb{Q}\left(p_{0}, \mathfrak{S}, \{\mathfrak{D}_{s}\}_{s\le t}\right)\right\}.$$

It follows that (7), and hence (6), are satisfied by taking \(\kappa _{t}(p) = \inf _{\mathfrak {S}} K_{t}(p, \mathfrak {S})\), as desired. □

5.2 Examples

In this section, we will seek to outline a few key settings where this theory can be applied.

5.2.1 Static generators, uncertain prior (StaticUP)

We first consider the case where uncertainty is given over the prior inputs to the filter. In particular, this “prior uncertainty” is not updated given new observations, and \(\mathcal {R}\) will not change through time.

Framework 1

(StaticUP) In a StaticUP setting, the inputs to the filtering problem are the initial filter state p0 and the generator (A,C(·)), which we parameterize solely using the static parameter \(\mathfrak {S}\). in particular, we exclude dependence on the “dynamic” parameters \(\{\mathfrak {D}_{t}\}_{t\ge 0}\). To represent our uncertain prior, we take a penalty

$$\mathcal{R}(\mathbb{Q}) = \left(\frac{1}{k} \alpha(\mathbb{Q})\right)^{k'} = \left(\frac{1}{k}\kappa_{\text{prior}}(p_{0}, \mathfrak{S})\right)^{k'}$$

for some prescribed penalty κprior.

We now apply Theorem 5 (omitting dependence on \(\mathfrak {D}_{t}\), as we are in a purely static setting) to see that a dynamic version κt of the penalty function, satisfying (6), can be computed as

$$\kappa_{t}(p) = \inf_{\mathfrak{S}}K_{t}(p, \mathfrak{S}),$$

where Kt satisfies the recursion

$$K_{t}(p,\mathfrak{S}) = \inf_{p'\in \mathbb{G}^{-1}(p,\mathfrak{S}, Y_{t})} \left\{K_{t-1}(p', \mathfrak{S})\right\}-m'_{t}. $$

Assuming \(\inf _{(p_{0}, \mathfrak {S})}\kappa _{\text {prior}}(p_{0}, \mathfrak {S})=0\), we further compute \(m^{\prime }_{t} \equiv 0\). This completely characterizes the penalty function, and hence the nonlinear expectation.

Remark 10

Inspired by the DR-expectation, a possible choice of penalty function κprior would be the negative log-density of a prior distribution for the inputs \((p_{0}, \mathfrak {S})\), shifted to have minimal value zero. Alternatively, taking an empirical Bayesian perspective, κprior could be the log-likelihood from a prior calibration process. In this case, we are incorporating our prior statistical uncertainty regarding the parameters in the filtering problem.

Remark 11

We emphasize that there is no learning of the generator being done in this framework—the penalty applied at time t=0 is simply propagated forward; our observations do not affect our opinion of the plausible generators. In particular, if we assume no knowledge of the initial state (i.e., a zero penalty), then we will have no knowledge of the state at time t (unless the observations cause the filter to degenerate).

Example 1

For a concrete example of the StaticUP framework, we take the class of models in \(\mathcal {M}_{M}\) where A and C are perfectly known and A=I, so Xt=X0 is constant (but X0 is unknown). We take N=2, so X takes only one of two values. For the observation distribution C, we assume that

$$Y_{t}|(X_{t}=e_{1}) \sim \text{Bernoulli}(a), \qquad Y_{t}|(X_{t}=e_{2}) \sim \text{Bernoulli}(b),$$

where a,b∈(0,1) are fixed constants. Effectively, in this example we are using filtering to determine which of two possible parameters is the correct mean for our observation sequence. It is worth emphasising that the filter process p corresponds to the posterior probabilities, in a Bayesian setting, of the events that our Bernoulli process has parameter a or b.

It is useful to note that, from classical Bayesian statistical calculationsFootnote 3, for a given p0, one can see that the corresponding value of pt is determined from the log-odds ratio,

$$\log\left(\frac{p_{t}^{1}}{p_{t}^{2}}\right) = \log\left(\frac{p_{0}^{1}}{p_{0}^{2}}\right)+ {t\bar{Y}_{t}}\log\left(\frac{a}{b}\right)+t(1-\bar Y_{t})\log\left(\frac{1-a}{1-b}\right),$$

where \(\bar Y_{t} = t^{-1} \sum _{s\leq t} Y_{s}\) is the average number of successes observed at time t.

To write down the StaticUP penalty function, let the (known) dynamics be described by \(\mathfrak {S}^{*}\). Consequently, we can write \(K(p, \mathfrak {S})=\infty \) for all \(\mathfrak {S}\neq \mathfrak {S}^{*}\). We initialize with a known penalty \(\kappa _{\text {prior}}(p, \mathfrak {S}^{*})=\kappa _{0}(p)\) for all \(p\in S_{N}^{+}\). As \(\mathfrak {S}^{*}\) is known, there is no distinction between K and κ, that is,

$$\kappa_{t}(p) =\inf_{\mathfrak{S}} K_{t}(p, \mathfrak{S}) = K_{t}(p, \mathfrak{S}^{*}) = \inf_{\left\{p_{0}: \mathbb{E}_{\mathbb{Q}(\mathfrak{S}^{*}, p_{0})}[X_{t}|\mathcal{Y}_{t}] = p\right\}}\left\{\kappa_{0}(p_{0})\right\}.$$

In this example, we can express our penalty in terms of the log-odds, for the sake of notational simplicity given the closed-form solution to the filtering problem, and hence can explicitly calculate the (unique) initial distribution p0 which would evolve to a given p at time t. In particular, the time-t penalty is given by a shift of the initial penalty:

$$\kappa_{t}\bigg(\log\left(\frac{p_{t}^{1}}{p_{t}^{2}}\right)\bigg) = \kappa_{0}\bigg(\log\left(\frac{p_{t}^{1}}{p_{t}^{2}}\right)-{t\bar Y_{t}}\log\left(\frac{a}{b}\right)-t(1-\bar Y_{t})\log\left(\frac{1-a}{1-b}\right) \bigg).$$

Remark 12

This example demonstrates the following behaviour:

  • If the initial penalty is zero, then the penalty at time t is also zero—there is no learning of which state we are in.

  • When parameterized by the log-odds ratio, there is no variation in the curvature of the penalty (and so no change in our “uncertainty”), we simply shift the penalty around, corresponding to our changing posterior probabilities.

  • The update of κ is done purely using the tools of Bayesian statistics, rather than having any direct incorporation of our uncertainty.

Remark 13

We point out that this is, effectively, the model of uncertainty proposed by Walley (1991) (see, in particular, Walley (1991) section 5.3, although there he takes a model where the unknown parameter is Beta distributed). See also Fagin and Halpern (1990).

5.2.2 Dynamic generators, uncertain prior (DynamicUP)

If we model the generator (A,C) as fixed and unknown (i.e., it depends only on \(\mathfrak {S}\)), calculation of \(K_{t}(p, \mathfrak {S})\) suffers from a curse of dimensionality—the dimension of \(\mathfrak {S}\) determines the size of the domain of Kt. On the other hand, if we suppose the generator at time t depends only on the dynamic parameters \(\mathfrak {D}_{t}\), we can use dynamic programming to obtain a lower-dimensional problem.

Framework 2

(DynamicUP) In the DynamicUP setting, for an initial penalty on the initial hidden state, κprior(p0), and a penalty on the time-t generator, \(\gamma _{t}(\mathfrak {D}_{t})\), our total penalty is given by \(\mathcal {R}(\mathbb {Q}) = \left (\frac {1}{k} \alpha (\mathbb {Q})\right)^{k'}\), where we now have

$$\alpha(\mathbb{Q}) = \kappa_{\text{prior}}(p_{0}) + \sum_{s=1}^{\infty} \gamma_{s}(\mathfrak{D}_{s}).$$

In this case, as we ignore the static parameter \(\mathfrak {S}\), we simplify Theorem 5 through the identity \(\kappa _{t}(p) = K_{t}(p, \mathfrak {S})\). This yields the recursion

$$\kappa_{t}(p) = \inf_{\mathfrak{D}_{t}} \left\{\inf_{p'\in \mathbb{G}^{-1}(p,\mathfrak{D}_{t}, Y_{t})} \left\{\kappa_{t-1}(p') + \gamma_{t}(\mathfrak{D}_{t})\right\}\right\}-m'_{t} $$

and again, if we assume \(\inf _{\mathbb {Q}}\alpha (\mathbb {Q}) \equiv 0\), we then conclude \(m^{\prime }_{t} \equiv 0\).

This formulation of the uncertain filter allows us to use dynamic programming to solve our problem forward in time. In the setting of Example 1, as the generator is perfectly known, there is no distinction between the dynamic and static cases.

A continuous-time version of this setting (for a Kalman–Bucy filter) is considered in detail in Allan and Cohen (2019a).

5.2.3 Static generators, DR-expectation (StaticDR)

In the above examples, we have regarded the prior as uncertain and used this to penalize over models. We did not use the data to modify our penalty function \(\mathcal {R}\). The DR-expectation gives us an alternative approach in which the data guides our model choice more directly. In what follows, we apply the DR-expectation in our filtering context and observe that it gives a slightly different recursion for the penalty function. Again, we can consider models where our generator is constant (i.e., depends only on \(\mathfrak {S}\)) or changes dynamically (i.e., depends only on \(\mathfrak {D}_{t}\)).

Framework 3

(StaticDR) As in the StaticUP framework, we assume that the generator (A,C) is determined by the static parameter \(\mathfrak {S}\). For each \(\mathfrak {S}\), with \(\mathbb {Q}= \mathbb {Q}(A,C, p_{0})\) and \((A,C)=\Phi (\mathfrak {S})\), we have a penalty \(\mathcal {R}(\mathbb {Q}) = \left (\frac {1}{k} \alpha (\mathbb {Q})\right)^{k'}\) given by the log-posterior density

$$\alpha(\mathbb{Q}\|_{\mathcal{Y}_{t}}) = \kappa_{\text{prior}}(p_{0}, \mathfrak{S}) - L^{\text{obs}}\left(\mathbb{Q}(A,C, p_{0})\big|\mathbf{y}_{t}\right)+m_{t}$$

which is additive, as shown in Theorem 3. Applying Theorem 5, we see that the penalty can be written \(\kappa _{t}(p) = \inf _{\mathfrak {S}} K_{t}(p, \mathfrak {S})\), where \(K_{0}(p, \mathfrak {S}) = \kappa _{\text {prior}}(p, \mathfrak {S})\) and K satisfies the recursion

$$K_{t}(p,\mathfrak{S}) = \inf_{\mathfrak{D}_{t}} \left\{\inf_{p'\in \mathbb{G}^{-1}(p,\mathfrak{S}, Y_{t})} \left\{K_{t-1}(p', \mathfrak{S}) + \log c^{\mathfrak{S}}\left(Y_{s};A^{\mathfrak{S}} p'\right)\right\}\right\}-m'_{t}. $$

Unlike in the uncertain prior cases, we cannot typically claim that \(m^{\prime }_{t} \equiv 0\), instead it is a random process dependent on our observations.

Remark 14

Comparing Framework 1 (StaticUP) with Framework 3 (StaticDR), we see that the key distinction is the presence of the log-likelihood term. This term implies that observations of Y will affect our quantification of uncertainty, rather than purely updating each model.

Example 2

In the setting of Example 1, recall that X is constant, so we know (A,C(·)). One can calculate the StaticDR penalty either directly, or through solving the stated recursion using the dynamics of p. As in the StaticUP case, the result is most simply expressed by first calculating p0 from pt through

$$\log\left(\frac{p_{0}^{1}}{p_{0}^{2}}\right) = \log\left(\frac{p_{t}^{1}}{p_{t}^{2}}\right)-{t\bar Y_{t}}\log\left(\frac{a}{b}\right)-t(1-\bar Y)_{t}\log\left(\frac{1-a}{1-b}\right)$$

and then

$$\kappa_{t}(p_{t}) = \kappa_{0}(p_{0})- \log\left(p_{0}^{1} a^{t\bar Y_{t}}(1-a)^{t\left(1-\bar Y_{t}\right)} + p_{0}^{2} b^{t\bar Y_{t}}(1-b)^{t\left(1-\bar Y_{t}\right)}\right)-m_{t}, $$

where mt is chosen to ensure infpκt(p)=0. From this, we see that the likelihood modifies our uncertainty directly, rather than us simply propagating each model via Bayes’ rule. A consequence of this is that if we start with extreme uncertainty (κ0≡0), then our observations teach us what models are reasonable, thereby reducing our uncertainty (i.e., we will find κt(p)>0 for p∈(0,1) when t>0).

Remark 15

It is interesting to ask what the long-term behaviour of these uncertain filters will be. In Cohen (2017), the long run behaviour of the DR-expectation based on i.i.d. observations is derived and, in principle, a similar analysis is possible here. Using the asymptotic analysis of maximum likelihood estimation for hidden Markov models in Leroux (1992) or Douc et al. (2011), we know that the MLE will converge with probability one to the true parameter, under appropriate regularity conditions. Here, the presence of the prior influences this slightly, however, this impact vanishes as t. With further regularity assumptions, one can also show that the log-likelihood function, divided by the number of observations, almost surely converges to the relative entropy between a proposed model and the true model (see, for example, Leroux (1992) section 5). If one also knew that the relative entropy is smooth and convex, the analysis of Cohen (2017) Theorems 4 and 5 is possible, showing that the DR-expectation corresponds to adding a term related to the sampling variance of the hidden stateFootnote 4. In particular, as the number of observations increases, the DR-expectation will converge to the expected value under the filter with the true parameters.

5.2.4 Dynamic generators, DR-expectation (DynamicDR)

As in the uncertain prior case, it is often impractical to calculate a recursion for \(K(p, \mathfrak {S})\) given the high dimension of \(\mathfrak {S}\). We therefore consider the case when (A,C) depends only on the dynamic parameters \(\mathfrak {D}_{t}\).

Framework 4

(DynamicDR) As before, for each \(\{\mathfrak {D}_{s}\}_{s\ge 0}\), with \(\mathbb {Q}= \mathbb {Q}(A,C, p_{0})\) and \((A_{t},C_{t}) = \Phi (\mathfrak {D}_{t})\), we have a penalty \(\mathcal {R}(\mathbb {Q}) = \left (\frac {1}{k} \alpha (\mathbb {Q})\right)^{k'}\) given by the log-posterior density

$$\alpha(\mathbb{Q}\|_{\mathcal{Y}_{t}}) = \kappa_{\text{prior}}(p_{0}) - \log L^{\text{obs}}\left(\mathbb{Q}(A,C, p_{0})\big|\mathbf{y}_{t}\right) +m_{t}.$$

From Theorem 3, we know that the log-posterior density is additive. Applying Theorem 5, and the identity \(\kappa _{t}(p) = K_{t}(p, \mathfrak {S})\), we conclude that the penalty κt(p) in (6) can be computed from the recursion

$$\begin{aligned} {\kappa_{t}}(p) &={\inf_{\mathfrak{D}_{t}}}\left\{{\inf_{p^{\prime}\in \mathbb{G}^{-1}(p,\mathfrak{D}_{t}, Y_{t})}}\left\{{\vphantom{\left({c^{\mathfrak{D}_{t}}} (Y_{t}; {A^{\mathfrak{D}}_{t}} {p_{t-1}})\right)}}{{\kappa}_{t-1}}(p^{\prime}) + \gamma_{\text{prior}}(t,\mathfrak{D}_{t}; \{Y_{s}\}_{s< t})\right.\right. \\ &\quad\qquad-\left.\left.\log\left({c^{\mathfrak{D}_{t}}} (Y_{t}; {A^{\mathfrak{D}}_{t}} {p_{t-1}})\right)\right\}{\vphantom{\inf_{p^{\prime}\in \mathbb{G}^{-1}(p,\mathfrak{D}_{t}, Y_{t})}}}\right\}-m_{t}', \end{aligned} $$

with initial value κ0(p)=π(p), where mt′ is chosen to ensure \(\inf _{p\in S_{N}^{+}}\kappa _{t}(p)=0\) for all t.

Remark 16

We expect that there will be less difference between the dynamic uncertain prior and dynamic DR-expectation settings than between the static uncertain prior and static DR-expectation settings. This is because there is only limited learning possible in the dynamic DR-expectation, as \(\mathfrak {D}_{t}\) may vary independently at every time, so the DR-expectation has only one value with which to infer the value of each \(\mathfrak {D}_{t}\). This increases the relative importance of the prior term γprior, which describes our understanding of typical values of the generator. In practice, the key distinction between the dynamic DR-expectation and uncertain prior models appears to be when the initial penalty is near zero—in this case, the DR-expectation regularizes the initial state quickly, while the uncertain prior model may remain near zero indefinitely.

Example 3

In the setting of Example 2, as the dynamics are perfectly known, there is again no difference between the dynamic and static generator DR-expectation cases.

A continuous-time version of this setting (for a Kalman–Bucy filter) is considered in

$$\mathcal{E}_{\mathbf{y}_{t}}: L^{\infty}(\sigma(X_{t})\otimes \mathcal{Y}_{t}) \to L^{\infty}(\mathcal{Y}_{t}).$$

If we wish to calculate expectations of future states, then we may wish to consider doing so in a filtration-consistent manner. This is of particular importance when considering optimal control problems.

Definition 8

For a fixed horizon T>0, suppose that for each t<T we have a map** \(\mathcal {E}(\cdot |\mathcal {Y}_{t}):L^{\infty }(\mathcal {Y}_{T}) \to L^{\infty }(\mathcal {Y}_{t})\). We say that \(\mathcal {E}\) is a \(\mathcal {Y}\)-consistent convex expectation if \(\mathcal {E}(\cdot |\mathcal {Y}_{t})\) satisifes the following assumptions, analogous to those above,

  • Strict Monotonicity: for any \(\xi _{1}, \xi _{2}\in L^{\infty }(\mathcal {Y}_{T})\), if ξ1ξ2 a.s., then \(\mathcal {E}(\xi _{1}|\mathcal {Y}_{t}) \geq \mathcal {E}(\xi _{2}|\mathcal {Y}_{t})\) a.s., and if, in addition, \(\mathcal {E}(\xi _{1}|\mathcal {Y}_{t})=\mathcal {E}(\xi _{2}|\mathcal {Y}_{t})\) then ξ1=ξ2 a.s.;

  • Constant triviality: for \(b\in L^{\infty }(\mathcal {Y}_{t})\), \(\mathcal {E}(b|\mathcal {Y}_{t})=b\);

  • Translation equivariance: for any \(b\in L^{\infty }(\mathcal {Y}_{t})\), \(\xi \in L^{\infty }(\mathcal {Y}_{T})\), \(\mathcal {E}(\xi +b|\mathcal {Y}_{t})= \mathcal {E}(\xi |\mathcal {Y}_{t})+b\);

  • Convexity: for any λ∈[0,1], \(\xi _{1}, \xi _{2}\in L^{\infty }(\mathcal {Y}_{T})\),

    $$\mathcal{E}(\lambda \xi_{1}+ (1-\lambda) \xi_{2}|\mathcal{Y}_{t}) \leq \lambda \mathcal{E}(\xi_{1}|\mathcal{Y}_{t})+ (1-\lambda) \mathcal{E}(\xi_{2}|\mathcal{Y}_{t});$$
  • Lower semicontinuity: for a sequence \(\{\xi _{n} \}_{n\in \mathbb {N}}\subset L^{\infty }(\mathcal {Y}_{T})\) with \(\xi _{n} \uparrow \xi \in L^{\infty }(\mathcal {Y}_{T})\) pointwise, \(\mathcal {E}(\xi _{n}|\mathcal {Y}_{t}) \uparrow \mathcal {E}(\xi |\mathcal {Y}_{t})\) pointwise for every t<T;

and the additional asssumptions

  • \(\{\mathcal {Y}_{t}\}_{t\ge 0}\)-consistency: for any s<t<T, any \(\xi \in L^{\infty }(\mathcal {Y}_{T})\),

    $$\mathcal{E}(\xi|\mathcal{Y}_{s}) = \mathcal{E}(\mathcal{E}(\xi|\mathcal{Y}_{t})|\mathcal{Y}_{s});$$
  • Relevance: for any t<T, any \(A\in \mathcal {Y}_{t}\), \(\mathcal {E}(I_{A}\xi |\mathcal {Y}_{t}) = I_{A} \mathcal {E}(\xi |\mathcal {Y}_{t})\).

The assumption of \(\mathcal {Y}\)-consistency is sometimes simply called recursivity, time consistency, or dynamic consistency (and is closely related to the validity of the dynamic programming principle), however, it is important to note that this depends on the choice of filtration. In our context, consistency with the observation filtration \(\mathcal {Y}\) is natural, as this describes the information available for us to make decisions.

Remark 17

Definition 8 is equivalent to considering a lower semicontinuous convex expectation, as in Definition 3, and assuming that for any \(\xi \in L^{\infty }(\mathcal {Y}_{T})\) and any t<T, there exists a random variable \(\xi _{t}\in L^{\infty }(\mathcal {Y}_{t})\) such that \(\mathcal {E}(I_{A} \xi) = \mathcal {E}(I_{A} \xi _{t})\) for all \(A\in \mathcal {Y}_{t}\). In this case, one can define \(\mathcal {E}(\xi |\mathcal {Y}_{t}) = \xi _{t}\) and verify that the definition given is satisfied (see Föllmer and Schied (2002b); Cohen and Elliott (2010)).

Much work has been done on the construction of dynamic nonlinear expectations (see, for example, Epstein and Schneider (2003); Duffie and Epstein (1992); El Karoui et al. (1997); Cohen and Elliott (2010); and references therein). In particular, there have been close relations drawn between these operators and the theory of BSDEs (for a setting covering the discrete-time examples we consider here, see Cohen and Elliott (2010); Cohen and Elliott (2011)).

Remark 18

The importance of \(\mathcal {Y}\)-consistency is twofold: First, it guarantees that, when using a nonlinear expectation to construct the value function for a control problem, an optimal policy will be consistent in the sense that (assuming an optimal policy exists) a policy which is optimal at time zero will remain optimal in the future. Second, \(\{\mathcal {Y}_{t}\}_{t\ge 0}\)-consistency allows the nonlinear expectation to be calculated recursively, working backwards from a terminal time. This leads to a considerable simplification numerically, as it avoids a curse of dimensionality in intertemporal control problems.

Remark 19

One issue in our setting is that our lack of knowledge does not simply line up with the arrow of time—we are unaware of events which occurred in the past, as well as those which are in the future. This leads to delicacies in questions of dynamic consistency. Conventionally, this has often been considered in a setting of “partially observed control”, and these issues are resolved by taking the filter state pt to play the role of a state variable, and solving the corresponding “fully observed control problem” with pt as underlying. In our context, we do not know the value of pt, instead we have the (even higher dimensional) penalty function Kt as a state variable.

In the following sections, we will outline how our earlier approach can be extended to provide a dynamically consistent expectation, and how enforcing dynamic consistency will modify our perception of risk.

6.1 Asynchronous expectations

We will focus our attention on constructing a dynamically consistent nonlinear expectation for random variables in \(L^{\infty }(\sigma (X_{T})\otimes \mathcal {Y}_{T})\), given observations up to times t<T. Throughout this section, we will use the following construction:

Definition 9

Suppose we have a nonlinear expectation

$$\mathcal{E}_{\mathbf{y}_{T}}: L^{\infty}(\sigma(X_{T})\otimes \mathcal{Y}_{T}) \to L^{\infty}(\mathcal{Y}_{T})$$

constructed for our nonlinear filtering problem, as above, and are given a a \(\mathcal {Y}\)-consistent family of maps

$$\overleftarrow{\mathcal{E}}(\cdot|\mathcal{Y}_{t}): L^{\infty}(\mathcal{Y}_{T}) \to L^{\infty}(\mathcal{Y}_{t}).$$

We then extend \(\overleftarrow{\mathcal {E}}\) to variables in \(L^{\infty }(\sigma (X_{T})\otimes \mathcal {Y}_{T})\) by the composition

$$\overleftarrow{\mathcal{E}}(\cdot|\mathcal{Y}_{t}) := \overleftarrow{\mathcal{E}}(\mathcal{E}_{\mathbf{y}_{T}}(\cdot)|\mathcal{Y}_{t}).$$

Given this definition, our key aim is to construct the \(\mathcal {Y}\)-consistent family \(\overleftarrow{\mathcal {E}}(\cdot |\mathcal {Y}_{t})\), in a way which “agrees” with our uncertainty in the underlying filter. As we are in discrete time, we can construct a \(\mathcal {Y}\)-consistent family through recursion, if we have its definition over each single step. The definition of the DR-expectation can be applied to generate these one-step expectations in a natural way.

Definition 10

For \(\mathcal {R}\) an additive penalty function (Definition 4), we define the one-step expectation, for \(\xi \in L^{\infty }(\mathcal {Y}_{t+1})\), by

$$\overleftarrow{\mathcal{E}}(\xi|\mathcal{Y}_{t}) = \mathrm{ess\,sup}_{\mathbb{Q}\in \mathcal{Q}} \left\{\mathbb{E}_{\mathbb{Q}}\left[\xi- \mathcal{R}_{t+1}(\mathbb{Q})|\mathcal{Y}_{t}\right] \right\}, $$

where the essential supremum is taken among the bounded \(\mathcal {Y}_{t}\)-measurable random variables. Using this, we define a \(\mathcal {Y}\)-consistent expectation \(L^{\infty }(\sigma (X_{T})\otimes \mathcal {Y}_{T}) \to L^{\infty }(\mathcal {Y}_{t})\)by recursion,

$$\overleftarrow{\mathcal{E}}(\xi|\mathcal{Y}_{t}) = \overleftarrow{\mathcal{E}}(\overleftarrow{\mathcal{E}}\left(\xi|\mathcal{Y}_{t+1}\right)|\mathcal{Y}_{t}) = \overleftarrow{\mathcal{E}}(\cdots \overleftarrow{\mathcal{E}}(\mathcal{E}_{\mathbf{y}_{T}}(\xi)|\mathcal{Y}_{T-1})\cdots |\mathcal{Y}_{t}).$$

Remark 20

It is necessary to use the penalty \(\mathcal {R}_{t+1}\) in this definition, as our penalty should include the behaviour of the generator Ct+1(·), which determines the distribution of Yt+1|Xt+1.

Recall that, as \(\mathcal {Y}\) is generated by Y, the Doob–Dynkin lemma states that any \(\mathcal {Y}_{t+1}\)-measurable function ξ is simply a function of {Ys}st+1, so we can write

$$ \xi(\omega) = \hat\xi(Y_{t+1}, \{Y_{s}\}_{s\leq t}). $$
(8)

For any conditionally Markov measure \(\mathbb {Q}\), if \(\mathbb {Q}\) has generator (At,Ct(·))t≥0, it follows that

$$\mathbb{E}_{\mathbb{Q}}[\xi|\mathcal{Y}_{t}]= \int_{\mathbb{R}^{d}} \hat{\xi}(y, \{Y_{s}\}_{s\leq t}) \left(\mathbf{1}^{\top} C_{{t+1}}(y)A_{t} p_{t}\right) d\mu(y).$$

In particular, we apply this to our penalty function to define the function \(\hat {\mathcal {R}}\) such that

$$ \hat{\mathcal{R}}_{t+1}\!(Y_{t+1}, p, \mathfrak{S}, \mathfrak{D}_{t+1}, \{Y_{s}\}_{s\le t}) \,=\, \left(\!\frac{1}{k}\!\left(K_{t}(p_{t}, \mathfrak{S}) \,+\, \gamma_{t+1}\left(\mathfrak{S}, \mathfrak{D}_{t+1}, \{Y_{s}\}_{s\le t+1}, p_{t}\right)\!\right)\!\right)^{k'}. $$
(9)

Applying this to our definition of \(\overleftarrow{\mathcal {E}}\), we obtain the following representation.

Lemma 2

The one-step expectation \(\overleftarrow{\mathcal {E}}\) can be written

$$\begin{aligned} \overleftarrow{\mathcal{E}}(\xi|\mathcal{Y}_{t}) &= \underset{\mathbb{Q}\in \mathcal{Q}}{\mathrm{ess\,sup}} \left\{\mathbb{E}_{\mathbb{Q}}[\xi- \mathcal{R}_{t+1}(\mathbb{Q})|\mathcal{Y}_{t}] \right\}\\ &= \underset{\mathfrak{S}, \mathfrak{D}_{t+1}, p}{\mathrm{ess\,sup}} \left\{\int_{\mathbb{R}^{d}}\left(\hat{\xi}\left(y, \{Y_{s}\}_{s\leq t}\right) - \hat{\mathcal{R}}_{t+1}\left(y,p, \mathfrak{S}, \mathfrak{D}_{t+1}, \{Y_{s}\}_{s\leq t}\right)\right)\right. \\ &\qquad\qquad\qquad \left.\left(\mathbf{1}^{\top} C_{{t+1}}(y)A_{t+1} p\right) d\mu(y) \right\}, \end{aligned} $$

where K is the dynamic penalty constructed in Theorem 5 and \((A_{t+1}, C_{t+1}(\cdot))\equiv \Phi (\mathfrak {S}, \mathfrak {D}_{t+1})\).

Proof

Write

$$\overleftarrow{\mathcal{E}}(\xi|\mathcal{Y}_{t}) = \underset{\mathbb{Q}\in \mathcal{Q}}{\mathrm{ess\,sup}} \left\{\mathbb{E}_{\mathbb{Q}}[\xi|\mathcal{Y}_{t}]- \mathbb{E}_{\mathbb{Q}}[\mathcal{R}_{t+1}(\mathbb{Q})|\mathcal{Y}_{t}] \right\}.$$

We know that

$$\mathbb{E}_{\mathbb{Q}}[\xi|\mathcal{Y}_{t}]= \int_{\mathbb{R}^{d}} \hat{\xi}(y, \{Y_{s}\}_{s\leq t}) \left(\mathbf{1}^{\top} C_{{t+1}}(y)A_{t+1} p_{t}\right) d\mu(y),$$

which depends on \(\mathbb {Q}\) only through At+1, Ct+1 and pt, or equivalently, through the parameters \(\mathfrak {S}\), \(\mathfrak {D}_{t+1}\) and pt. In particular, as \(\mathcal {R}\) is additive, we can substitute in its structure and simplify using the definition of K in Theorem 5 to obtain

$$\begin{aligned} \overleftarrow{\mathcal{E}}(\xi|\mathcal{Y}_{t}) &=\underset{\mathbb{Q}\in \mathcal{Q}}{\mathrm{ess\,sup}}\left\{{\vphantom{\frac{1}{2_{\frac{1}{2}}}}}\mathbb{E}_{\mathbb{Q}}[\xi|\mathcal{Y}_{t}]\right.\\ &\qquad\quad - \left.\mathbb{E}_{\mathbb{Q}}\left[\left(\frac{1}{k}\left(K(p_{t}, \mathfrak{S}) \,+\, \gamma_{t+1}(\mathfrak{S}, \mathfrak{D}_{t+1}, \{Y_{s}\}_{s\le t+1}, p_{t})\right)\right)^{k'}\Big|\mathcal{Y}_{t}\right] \right\}. \end{aligned} $$

Using the definition of \(\hat {\mathcal {R}}\), we change these conditional expectations to integrals, and obtain the desired representation. □

Remark 21

There is a surprising form of double-counting of the penalty here. To see this, let’s assume ϕ does not depend on Y. If we consider \(\xi _{t+1} = \mathcal {E}_{\mathbf {y}_{t+1}}(\phi (X_{t+1}))\), then we have included a penalty for the proposed model at t+1, that is,

$$\xi_{t+1} = \mathcal{E}_{\mathbf{y}_{t+1}}(\phi(X_{t+1})) = \sup_{p\in S_{N}^{+}}\left\{\sum_{i} p^{i}\phi(e_{i})-\left(\frac{1}{k}(K_{t+1}(p)\right)^{k'}\right\},$$

where Kt+1(p) is the penalty associated with the filter state at time t+1, which includes the penalty γt+1 on the parameters \(\mathfrak {S}\) and \(\mathfrak {D}_{t+1}\).

When we calculate \(\overleftarrow{\mathcal {E}}(\xi _{t+1}|\mathcal {Y}_{t})\), we do so by using the penalty \(K(p_{t}, \mathfrak {S}) + \gamma _{t+1}(\mathfrak {S}, \mathfrak {D}_{t+1}, \{Y_{s}\}_{s\le t+1}, p_{t})\), which again includes the term γt+1 which penalizes unreasonable values of the parameters \(\mathfrak {S}\) and \(\mathfrak {D}_{t+1}\). This “double counting” of the penalty corresponds to us including both our “uncertainty at time t+1” (in \(\mathcal {E}_{\mathbf {y}_{t+1}}\)), and also our “uncertainty at t about our uncertainty at t+1” (in \(\overleftarrow{\mathcal {E}}(\cdot |\mathcal {Y}_{t})\)).

Remark 22

One should be careful in this setting, as the recursively-defined nonlinear expectation will be optimized for a different value of \(\mathfrak {S}\) at every time. As \(\mathfrak {S}\) is considered to be a static penalty, this is an internal inconsistency in the modelling of our uncertainty—we always estimate assuming that \(\mathfrak {S}\) has never changed, but evaluate the future by considering our possible future opinions of the value of \(\mathfrak {S}\).

6.2 Review of BSDE theory

While it is useful to give a recursive definition of our nonlinear expectation, a better understanding of its dynamics is of practical importance. In what follows, for the dynamic generator case, we consider the corresponding BSDE theory, assuming that Yt can take only finitely many values, as in Cohen and Elliott (2010). We now present the key results of Cohen and Elliott (2010), in a simplified setting.

In what follows, we suppose that Y takes d values, which we associate with the standard basis vectors in \(\mathbb {R}^{d}\). For simplicity, we write 1 for the vector in \(\mathbb {R}^{d}\) with all components 1.

Definition 11

Write \(\bar {\mathbb {P}}\) for a probability measure such that {Yt}t≥0 is an i.i.d. sequence, uniformly distributed over the d states, and M for the \(\bar {\mathbb {P}}\)-martingale difference process Ytd−11. As in Cohen and Elliott (2010), M has the property that any \(\mathcal {Y}\)-adapted \(\bar {\mathbb {P}}\)-martingale L can be represented by \(L_{t}= L_{0} + \sum _{0\le s< t}Z_{s} M_{s+1}\) for some Z (and Z is unique up to addition of a multiple of 1).

Remark 23

The construction of Z in fact also shows that, if L is written \(L_{t}= \tilde L(Y_{1},...,Y_{t-1}, Y_{t})\), then \(e_{i}^{\top } Z_{t} = L(Y_{1},..., Y_{t-1}, e_{i})\) for every i (up to addition of a multiple of 1).

We can then define a BSDE (Backward Stochastic Difference Equation) with solution (ξ,Z):

$$ \xi_{t}(\omega) - \sum_{t\leq u< T} f(\omega, u, \xi_{u}(\omega), Z_{u}(\omega)) + \sum_{t\leq u< T} Z_{u}(\omega) M_{u+1}(\omega) = \xi_{T}(\omega), $$
(10)

where T is a finite deterministic terminal time, f a \(\mathcal {Y}\)-adapted map \(F:\Omega \times \{0,...,T\} \times \mathbb {R} \times \mathbb {R}^{d}\rightarrow \mathbb {R}\), and ξT a given \(\mathbb {R}\)-valued \(\mathcal {Y}_{t}\)-measurable terminal condition. For simplicity, we henceforth omit the ω argument of ξ, Z, and M.

The general existence and uniqueness result for BSDEs in this context is as follows:

Theorem 6

Suppose f is such that the following two assumptions hold:

  1. (i)

    For any ξ, if Z1=Z2+k1 for some k, then \(f\left (\omega,t, \xi _{t}, Z^{1}_{t}\right) = f\left (\omega, t, \xi _{t}, Z^{2}_{t}\right)\), \(\bar {\mathbb {P}}\)-a.s. for all t.

  2. (ii)

    For any \(z\in \mathbb {R}^{d}\), for all t, for \(\bar {\mathbb {P}}\)-almost all ω, the map

    $$\xi\mapsto \xi-f(\omega, t, \xi, z)$$

    is a bijection \(\mathbb {R}\rightarrow \mathbb {R}\).

Then, for any terminal condition ξT essentially bounded, \(\mathcal {Y}_{t}\)-measurable, and with values in \(\mathbb {R}\), the BSDE (10) has a \(\mathcal {Y}\)-adapted solution (ξ,Z). Moreover, this solution is unique up to indistinguishability for ξ and indistinguishability up to addition of multiples of 1 for Z.

In this setting, we also have a comparison theorem:

Theorem 7

Consider two discrete-time BSDEs as in (10), corresponding to coefficients f1,f2, and terminal values \(\xi ^{1}_{T}, \xi ^{2}_{T}\). Suppose the conditions of Theorem 6 are satisfied for both equations, let (ξ1,Z1) and (ξ2,Z2) be the associated solutions. Suppose the following conditions hold:

  1. (i)

    \(\xi ^{1}_{T}\geq \xi ^{2}_{T} \bar {\mathbb {P}}\)-a.s.

  2. (ii)

    \(\bar {\mathbb {P}}\)-a.s., for all times t and every \(\xi \in \mathbb {R}\) and \(z\in \mathbb {R}^{d}\),

    $${\kern-5cm}f^{1}(\omega, t, \xi, z) \geq f^{2}(\omega, t, \xi, z).$$
  3. (iii)

    \(\bar {\mathbb {P}}\)-a.s., for all t, f1 satisfies

    $$f^{1}\left(\omega, t, \xi_{t}^{2}, Z_{t}^{1}\right) - f^{1}\left(\omega, t, \xi_{t}^{2}, Z_{t}^{2}\right)\geq\min_{j\in \mathbb{J}_{t}}\left\{\left(Z^{1}_{t}-Z^{2}_{t}\right)\left(e_{j}-d^{-1}\boldsymbol{1}\right)\right\},$$

    where \(\mathbb {J}_{t} :=\{i:\bar {\mathbb {P}}(X_{t+1}=e_{i} | \mathcal {F}_{t})>0\}\).

  4. (iv)

    \(\bar {\mathbb {P}}\)-a.s., for all t and all \(z\in \mathbb {R}^{d}\), ξξf1(ω,t,ξ,z) is an increasing function.

It is then true that ξ1ξ2\(\bar {\mathbb {P}}\)-a.s. A driver f1 satisfying (iii) and (iv) will be called “balanced”.

Finally, we also know that all dynamically consistent nonlinear expectations can be represented through BSDEs:

Theorem 8

The following two statements are equivalent.

  1. (i)

    \(\overleftarrow{\mathcal {E}}(\cdot |\mathcal {Y}_{t})\) is a \(\mathcal {Y}_{t}\)-consistent, dynamically translation invariant, nonlinear expectation.

  2. (ii)

    There exists a driver f which is balanced, independent of ξ, and satisfies the normalisation condition f(ω,t,ξt,0)=0, such that, for all ξT, the value of \(\xi _{t} \!\,=\, \overleftarrow{\mathcal {E}}(\xi _{T}|\mathcal {Y}_{t})\) is the solution to a BSDE with terminal condition ξT and driver f.

Furthermore, these two statements are related by the equation

$$f(\omega, t, \xi, z) =\overleftarrow{\mathcal{E}}({zM}_{t+1}|\mathcal{Y}_{t}).$$

6.3 BSDEs for future expectations

By applying the above general theory, we can easily see that our nonlinear expectation has a representation as the solution to a particular BSDE.

Theorem 9

Write

$$\xi_{t}:=\overleftarrow{\mathcal{E}}\left(\phi\left.\left(X_{T}, \{Y_{t}\}_{t\le T}\right)\right|\mathcal{Y}_{t}\right).$$

The dynamically consistent expectation satisfies the BSDE

$$\xi_{t+1} = \xi_{t}-f(Z_{t}; \kappa_{t}) + Z_{t} M_{t+1}$$

with driver

$$\begin{aligned} & f\left(Z_{t}; \hat{\mathcal{R}}_{t+1}\right)\\ &:= \underset{p, \mathfrak{S}, \mathfrak{D}_{t+1}}{\mathrm{ess\,sup}}\sum_{i}\left\{ \left(Z^{i}-\hat{\mathcal{R}}_{t+1}(e_{i}, p, \mathfrak{S}, \mathfrak{D}_{t+1}, \{Y_{s}\}_{s\le t})\right) \left(\mathbf{1}^{\top} C_{t+1}(e_{i})A_{t+1} p\right)\right.\\ &\qquad\qquad\qquad - \left.d^{-1}Z^{i}\right\}, \end{aligned} $$

where \((A_{t+1}, C_{t+1}(\cdot)) \equiv \Phi (\mathfrak {S}, \mathfrak {D}_{t+1})\).

Proof

As ξt+1 is \(\mathcal {Y}_{t+1}\)-measurable, by the Doob–Dynkin lemma there exists a Borel measurable function \(\hat {\xi }_{t+1}\) such that \(\xi _{t+1}=\hat {\xi }_{t+1}(Y_{t+1})\) (omitting to write {Ys}st as an argument). We write Zt for the vector containing each of the values of this function. From the definition of M, as in the proof of the martingale representation theorem in Cohen and Elliott (2010), it follows that

$$\xi_{t+1} - \mathbb{E}_{\bar{\mathbb{P}}}[\xi_{t+1}|\mathcal{Y}_{t}] = Z_{t} M_{t+1}\qquad \text{and}\qquad \mathbb{E}_{\bar{\mathbb{P}}}[\xi_{t+1}|\mathcal{Y}_{t}] = \sum_{i} d^{-1} Z^{i}.$$

We then calculate, using Lemma 2 (simplified to our finite-state setting and omitting {Ys}st as an argument),

$$\begin{aligned} & \xi_{t} - \mathbb{E}_{\bar{\mathbb{P}}}[\xi_{t+1}|\mathcal{Y}_{t}]\\ &= \overleftarrow{\mathcal{E}}(\xi_{t+1}|\mathcal{Y}_{t}) - \mathbb{E}_{\bar{\mathbb{P}}}[\xi_{t+1}|\mathcal{Y}_{t}]\\ &= \underset{p, \mathfrak{S}, \mathfrak{D}_{t+1}}{\mathrm{ess\,sup}}\left\{\!\sum_{i} \left(Z^{i}-\hat{\mathcal{R}}_{t+1}(e_{i}, p,\mathfrak{S}, \mathfrak{D}_{t+1})\right)\!\! \left(\mathbf{1}^{\top} C_{t+1}(e_{i})A_{t+1} p\right) \,-\, \mathbb{E}_{\bar{\mathbb{P}}}[\xi_{t+1}|\mathcal{Y}_{t}]\right\}\\ &= f(Z_{t}; \kappa_{t}). \end{aligned} $$

The answer follows by rearrangement. □