1 Introduction

Multi-stage stochastic programs (MSPs) have been widely used to model practical decision making over time and under uncertainty, e.g., multiperiod portfolio selection [13], pension fund management [23], the production and trading of gas [1] and electricity [21], dynamic fleet management [11], and airline revenue management [26]. For these problems, the random inputs are usually modeled as a multivariate random process and the decisions at individual stages depend on the historical realizations of the random process, making the problem an optimization problem in functional spaces. In stochastic programming, it is assumed that the distribution of the random process is known, and the goal is to find optimal decisions that minimize the total expected cost where the expectation is taken with respect to the true distribution of the random process. Nowadays, classical stochastic programming is challenged for the reason that the exact probability distribution is hardly known and can only be partially observable with the statistical information from historical data.

To overcome the above issue, the distributionally robust optimization (DRO) technique was proposed, and extensive researches have been done in the last decade. Providing a trade-off between the stochastic and robust frameworks, DRO aims at finding an optimal decision \(x\in {\mathbb {R}}^d\) that minimizes the worst-case expected cost \(\sup _{{\mathbb {Q}}\in {\mathcal {D}}}{\mathbb {E}}_{\mathbb {Q}}[f(x,\xi )]\) with the random cost function \(f(\cdot ,\cdot ):\ \) \({\mathbb {R}}^d\)\(\times {\mathbb {R}}^s\rightarrow {\mathbb {R}}\). Here, it assumes that the probability distribution \({\mathbb {Q}}\) of the random vector \(\xi \in {\mathbb {R}}^s\) cannot be known exactly but lies in an ambiguity set of probability distributions \({\mathcal {D}}\). Typical ways to construct \({\mathcal {D}}\) include: specify the moments of the input distribution, see Delage and Ye [12], Scarf, Arrow, and Karlin [32], Wiesemann, Kuhn, and Sim [38], and Xu and Sun [41], put constraints on the support set of the input distribution, see Shapiro [34], consider the distance of the input distribution from a nominal distribution, for instance, Ben-Tal et al. [3] construct the set with \(\phi \)-divergence, or specify a set which contains the underlying probability distribution \({\mathbb {Q}}\) with a high probability based on statistical hypothesis tests, e.g., Bertsimas et al. [4]. To develop a more realistic model, the static distributionally robust optimization has been recently extended to the multi-stage case in papers such as Analui and Pflug [2] and Pflug and Pichler [27], where the ambiguity set of distributions is constructed as a neighborhood of the nominal distribution by defining a nested distance. **n and Goldberg [40] studied a multi-stage distributionally robust newsvendor problem with the prior knowledge of the support set and the information about first two order moments of the demand distribution at each stage. The ambiguity set based on nested moments of the underlying transition probabilities for Markov decision process is studied in Xu and Mannor [42] and extended in Yu and Xu [43]. For both single-stage and multi-stage DROs, a very important issue is the proper construction of the ambiguity set. However, due to the basic requirement that the ambiguity set includes the true distribution, the DRO method can be a double-edged sword as Ye [37] states. That is, we may construct an ambiguity set \({\mathcal {D}}\) so unsuitable that leads to overly conservative solutions. To overcome the shortcomings above, an easy way to ensure this is to make the ambiguity set large enough through one of the above ways. But the larger the ambiguity set is, the more conservative the solution would be. It is still unclear about the best way to construct the uncertainty set. This issue becomes much trickier in the multi-stage case.

In practice, the true distribution is hardly observable but can be inferred from existent data in stochastic programs. Nowadays, in the age of big data, as more and more historical data are available, decision maker can obtain meaningful prior insights about the uncertainty of the future from the vast amount of data. The convergence of these phenomena has given rise to the increasingly widespread application of data-driven techniques. This approach is also adopted in the estimation of distribution in stochastic programs and can be roughly classified into the following two methods.

Nonparametric data-driven approaches provide a way to model MSP. This class of modeling approaches mainly uses nonparametric methods to construct an approximation of the distribution function and then derive corresponding multistage stochastic optimization algorithms to solve the model. For example, Hannah and Powell [19] used nonparametric density estimation for the joint distribution. Based on this idea, Pflug and Pichler [28] not only used historical data to estimate the probability distribution through kernel density estimation, but also designed the corresponding optimization algorithm which enjoys the guarantee of asymptotic optimality under some strong technical assumptions. Hanasusanto and Kuhn [18] presented another nonparametric approach wherein the conditional distributions in dynamic stochastic programming are estimated using kernel regression. Bertsimas, Shtern and Sturt [5] proposed a data-driven approach to multi-stage stochastic linear optimization that can be asymptotically optimal. Nevertheless, directly constructing an ambiguity set based on sample paths makes it impossible to explicitly or dynamically describe the inter-stage dependence. And the computational tractability of the proposed approach relies on linear decision rules which greatly limit the decision space and might not obtain the true optimal decision. Since no additional assumptions are made on the distribution, current nonparametric approaches are computationally demanding.

Unlike nonparametric approaches, in order to estimate the unknown probability distribution in MSP problems, data-driven parametric methods usually follow the following procedure. They assume that the distribution belongs to a parametric model. After fitting the distribution with historical data, the decision is made by solving the approximate MSP problem. The techniques used to solve MSPs often require the relevantly structural information of uncertainty at different stages; see Shapiro et al. [35], while a well-performing solution will never attend without an accurate estimate of the inter-stage correlation structure. Therefore, with prior knowledge of the true distribution of uncertain parameters, a fundamental difficulty about this approach is how to properly estimate a parametric model with available data so that it can suitably reflect the dynamic evolution of the random data process and enhance the overall performance of the resulting algorithm.

Recently, we noticed an interesting statistical approach to overcome the drawbacks of usual distributionally robust optimization approaches. To avoid the conservative and pessimistic focus on the worst case in DRO, Zhou et al. [39] explored a newly Bayesian approach towards the uncertainty in the true distribution \({\mathbb {P}}^c\), called Bayesian risk optimization (BRO). The authors assumed that the underlying distribution of the random vector \(\xi \) lies in a parameterized family of distributions \(\{{\mathbb {P}}_\theta ,\theta \in \Theta \}\), where \(\Theta \) is the parameter space. Let \(\theta ^c\in \Theta \) be the unknown true parameter corresponding to \({\mathbb {P}}^c\). Through a Bayesian perspective, \(\theta ^c\) can be viewed as a realization of a belief random variable \(\tilde{\theta }\), whose posterior distribution can be computed on basis of historical data. The problem they proposed is as follows:

$$\begin{aligned} \min _{x\in {\mathcal {X}}}\rho _{{\mathbb {P}}_n}\{{\mathbb {E}}_{{\mathbb {P}}_{\theta }}[h(x,\xi )]\}, \end{aligned}$$
(1)

where the feasible set \({\mathcal {X}}\) is a closed subset of \({\mathbb {R}}^d\), the realization of the random vector \(\xi \) lives in \({\mathbb {R}}^s\), and h is a cost function with \(h(\cdot ,\cdot ):{\mathcal {X}}\times {\mathbb {R}}^s\rightarrow {\mathbb {R}}\). \(\rho \) is a risk functional applied to \({\mathbb {E}}_{{\mathbb {P}}_{\theta }}[h(x,\xi )]\), treated as a random variable induced by \(\theta \sim {\mathbb {P}}_n\), with \({\mathbb {P}}_n\) being the posterior distribution of \(\tilde{\theta }\).

This modeling technique makes us realize that with the information of prior and posterior distributions, one can consider not only the ambiguity set of possible distributions but also the possible structure over the underlying distribution, which would represent the belief about the prior knowledge for the underlying distributions based on available data. Therefore, with the Bayesian technique, we can get a much more accurate distribution structure where additional information can be used to parameterize the true distribution than only observed historical scenarios. In this way, we can combine the information about data with the prior knowledge of decision-makers to solve the difficulties in existent DROs and parametric data-driven MSPs. With some assumptions about the distribution’s structure, we can also reduce the computational burden from nonparametric approaches. Moreover, as the sample size tends to infinity, the posterior distribution approaches normality with its mean being the true parameter under some regularity conditions. By utilizing this conclusion, we can estimate the unknown parameters and thus the true distribution. Due to the above observations, the modeling framework in (1) is interesting and practical, but it is only applicable for static or single-stage problems. Extending this kind of Bayesian technique to the multi-stage case and utilizing the idea of data-driven methods, we introduce a new kind of MSP model called multi-stage Bayesian expectation optimization (BEO) by using the Bayesian posterior expectation to estimate the real distribution parameters. The new model can, we hope, avoid the ideal distribution assumption in traditional MSPs, overcome the dilemma about the construction of ambiguity sets in DROs and improve the computational efficiency of current data-driven MSPs.

Concretely, we propose a data-driven approximation method to MSP with unknown distributions that (i) can directly cope with the dynamic correlation relationship of stage-wise probability distributions with parameters, (ii) ensures the convergence to the underlying MSP problem when the data-size tends to infinity, and (iii) can be practically solved through designing tailored algorithms. Without the convergence guarantee, our multistage stochastic programming problem based on the data-driven approach would not be of any practical interest. Thus the convergence property here are of practical importance, as it ensures that the multi-stage BEO method optimally approximates the underlying true multistage stochastic programming problem. Our data-driven approach introduces an alternative optimization framework with the help of Bayesian statistics and provides theoretical justification for the efficient solution of realistic multi-stage stochastic programs.

The rest of the paper is organized as follows. Section 2 introduces the multi-stage BEO problem. Section 3 is devoted to establishing asymptotic results related to posterior distribution and consistency conclusions related to the multi-stage BEO problem. Section 4 proposes two algorithms for solving single-stage and multi-stage stochastic convex programming problems, respectively. Section 5 provides a group of numerical results to test the consistency and reveal some insights on and practical advantages of the proposed new formulation compared with existing formulations. The final section summarizes the paper.

2 Model and prerequisites

To describe a \(2\le T\in {\mathbb {N}}\) stage stochastic optimization problem, let \(\xi _t\in {\mathbb {R}}^s\) denote the random variables at stage t. \(\xi _{t}:\Omega \rightarrow \** _t\) is defined on some probability space \((\Omega ,{\mathcal {F}}, {\mathbb {P}})\), and the support set \(\** _t\subset {\mathbb {R}}^s\) is equipped with a Borel \(\sigma \)-algebra \({\mathcal {B}}_{\** _t}\). Then we have an \({\mathbb {R}}^{Ts}\)-valued stochastic process \(\pmb {\xi }=\{{\xi }_t\}^T_{t=1}\) to model a finite data sequence under uncertainty. Suppose that at stage t, the decision vector \(x_t\in {\mathbb {R}}^{d}\) depends only on the known information up to time t, which consists of the random block vector \(\pmb {\xi }^t:=(\xi _1,\ldots ,\xi _t)\). This property is equivalent to the measurability of \(x_t\) with respect to the \(\sigma \)-field induced by \(\pmb {\xi }^t, {\mathcal {F}}_t:=\sigma ( \pmb {\xi }^t)\subset {\mathcal {F}}\). That is, \(\pmb {\xi }^t\) can be treated as a random variable on the probability space \((\Omega ,{\mathcal {F}}_t,{\mathbb {P}})\). We have that \({\mathcal {F}}_1(\xi )\equiv \{\emptyset ,\Omega \}\) for the reason that there is no uncertainty at \(t=1\) and \({\mathcal {F}}_t\subset {\mathcal {F}}_{t+1}\) for \(t=1,\ldots ,T-1\) obviously. Denote by \(\** ^t\) the support set of \(\pmb {\xi }^t\) and the corresponding probability measure \({\mathbb {P}}^t:={\mathbb {P}}[\pmb {\xi }^t\in \cdot ]\). Due to \({\mathcal {F}}_1=\{\emptyset ,\Omega \}\) at stage \(t=1, \xi _{1}\) and \(x_1\) are deterministic. In what follows, we assume that \(\pmb {\xi }\in L_p(\Omega ,{\mathcal {F}},{\mathbb {P}};{\mathbb {R}}^{Ts})\) for some \(p\in [1,+\infty )\).

The MSP problem with convex cost functions can be described as

$$\begin{aligned} \min \limits _{x_1\in {\mathcal {X}}_1}h_1(x_1)+{\mathbb {E}}\left[ \inf \limits _{{x}_2\in {\mathcal {X}}_2(x_1,\pmb {\xi }^2)}h_2({x}_2,{\xi }_2)+{\mathbb {E}}\left[ \cdots +{\mathbb {E}}\left[ \inf \limits _{{x}_T\in {\mathcal {X}}_T({x}_{T-1},\pmb {\xi }^T)}h_T({x}_T,{\xi }_T)\right] \right] \right] , \end{aligned}$$
(2)

where \({\mathcal {X}}_1\subseteq {\mathbb {R}}^d\) and \({\mathcal {X}}_t(x_{t-1},\pmb {\xi }^{t})\subseteq {\mathbb {R}}^d\) are feasible regions for decisions at stage t, and are compact and convex sets for given \(x_{t-1}\) and \(\pmb {\xi }^t\). \(h_1:{\mathbb {R}}^d\rightarrow {\mathbb {R}}\) and \(h_t:{\mathbb {R}}^d\times {\mathbb {R}}^s\rightarrow {\mathbb {R}}\) are convex with respect to (w.r.t.) \(x_t\) and measurable w.r.t. \(\xi _{t}\). As is common in the stochastic programming literature, we assume that \({\mathcal {X}}_t\) \((2\le t\le T)\) affinely depends on \(\pmb {\xi }^t\).

To practically describe the inter-stage dependence, we assume the distribution of the random variable \(\xi _t\) lies in a parameterized family of distributions \(\{{\mathbb {P}}_{\theta _t},\theta _t\in \Theta _t\}\) for each stage \(t=1,\ldots ,T\), where \(\Theta _t\) is the parameter space equipped with a Borel \(\sigma \)-algebra \({\mathcal {B}}_{\Theta _t}\) and \(\theta _t^c(\pmb {\xi }^{t-1})\in \Theta _t\) is the unknown true parameter depended on \(\pmb {\xi }^{t-1}\). Then the dynamic evolution of \(\xi _t\) and \(\theta _t^c\) has the form

$$\begin{aligned} \theta _2^c\rightsquigarrow \xi _2\rightsquigarrow \theta _3^c(\pmb {\xi }^2)\rightsquigarrow \xi _3\rightsquigarrow \cdots \rightsquigarrow \theta _T^c(\pmb {\xi }^{T-1})\rightsquigarrow \xi _T. \end{aligned}$$

We denote \(\theta ^c_t:=\theta _t^c(\pmb {\xi }^{t-1})\) for short. With the true underlying distribution of \(\xi _t\), model (2) is equivalent to

$$\begin{aligned} \min \limits _{x_1\in {\mathcal {X}}_1}h_1(x_1)+{\mathbb {E}}_{{\mathbb {P}}_{\theta _2^c}}&\left[ \inf \limits _{x_2\in {\mathcal {X}}_2(x_1,\pmb {\xi }^2)}h_2(x_2,\xi _2)+{\mathbb {E}}_{{\mathbb {P}}_{\theta _3^c}}\left[ \right. \right. \nonumber \\&\left. \left. \cdots +{\mathbb {E}}_{{\mathbb {P}}_{\theta _T^c}}\left[ \inf \limits _{x_T\in {\mathcal {X}}_T({x}_{T-1},\pmb {\xi }^T)}h_T(x_T,\xi _T)\right] \right] \right] . \end{aligned}$$
(3)

We now present the data-driven approach for solving multi-stage stochastic convex optimization problem (2). From the perspective of Bayes, \(\theta _t^c\) can be seen as a realization of a random variable \(\tilde{\theta }_t\), whose posterior distribution \({\mathbb {P}}_{t,n_t}\) can be computed with the prior distribution of \(\tilde{\theta }_{t}\) and \(n_t\) real observations \(\xi _t^1(\pmb {\xi }^{t-1}),\xi _t^2(\pmb {\xi }^{t-1}),\ldots ,\xi _t^{n_t}(\pmb {\xi }^{t-1})\) generated from \(\theta _t^c\), which are determined at stage \(t-1\). Meanwhile, it is known from Bayesian asymptotic theory [16] that the posterior distribution of \(\tilde{\theta }_t\) converges weakly to a point mass on \(\theta _t^c\) at an exponential rate. Inspired by this result, instead of problem (3), we propose solving the following multi-stage BEO problem

$$\begin{aligned} \min \limits _{x_1\in {\mathcal {X}}_1}h_1(x_1)+{\mathbb {E}}_{{{\mathbb {P}}_{{\hat{\theta }}_{2}}}}&\left[ \inf \limits _{x_2\in {\mathcal {X}}_2(x_1,\pmb {\xi }^2)}h_2(x_2,\xi _2)+{\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{3}}}\left[ \right. \right. \nonumber \\&\left. \left. \cdots +{\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{T}}}\left[ \inf \limits _{x_T\in {\mathcal {X}}_T({x}_{T-1},\pmb {\xi }^T)}h_T(x_T,\xi _T)\right] \right] \right] . \end{aligned}$$
(4)

In (4), we denote \({\hat{\theta }}_{t}:={\mathbb {E}}_{{\mathbb {P}}_{t,n_t}}\tilde{\theta }_t\) for short.

To make the above formulations concise, we derive the corresponding dynamic programming equations. Let \(H_t\) be the recourse function of model (3) at stage t, then we have

$$\begin{aligned} H_t(x_{t-1},\pmb {\xi }^t):=\inf _{x_t\in {\mathcal {X}}_t(x_{t-1},\pmb {\xi }^{t})}h_t(x_t,\xi _t)+{\mathbb {E}}_{{\mathbb {P}}_{\theta _{t+1}^c}}[H_{t+1}(x_{t},\pmb {\xi }^{t+1})],\ \ t=2,\ldots ,T, \end{aligned}$$
(5)

where \(H_{T+1}\equiv 0\).

Similarly, the recourse function of model (4) at stage \(t, \hat{H}_t\), can be recursively determined as:

$$\begin{aligned} \hat{H}_t(x_{t-1},\pmb {\xi }^t):=\inf _{x_t\in {\mathcal {X}}_t(x_{t-1},\pmb {\xi }^{t})}h_t(x_t,\xi _t)+{\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t+1}}}[\hat{H}_{t+1}(x_{t},\pmb {\xi }^{t+1})],\ \ t=2,\ldots ,T, \end{aligned}$$
(6)

where \(\hat{H}_{T+1}\equiv 0\).

As demonstrated in [36], the optimal solution of (5) and (6) for \(t=2.\ldots ,T\) will provide an optimal policy for problem (3) and problem (4), respectively. That is, a policy is optimal if and only if it is derived from the optimal solutions of the respective dynamic programming equations. Thus, in the subsequent study of consistency, we will primarily focus on the stage-wise convergence w.r.t. (6) from a dynamic programming perspective.

In contrast to traditional stochastic optimization and DRO, the multi-stage BEO approach integrates one’s posterior knowledge of underlying distributions. This knowledge is derived from computing a Bayesian prior distribution with available data, reflecting a data-driven technique. The motivation for proposing this type of multi-stage BEO model is inspired by a theoretical result highlighted in [30], expressed as

$$\begin{aligned} \begin{aligned} {\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t+1}}}[\hat{H}_{t+1}(x_{t},\pmb {\xi }^{t+1})]=&\int _{{\mathbb {R}}^s}\hat{H}_{t+1}(x_{t},\pmb {\xi }^{t+1})d{\mathbb {P}}_{{\hat{\theta }}_{t+1}}\\ =&\int _{{\mathbb {R}}^s}\hat{H}_{t+1}(x_{t},\pmb {\xi }^{t+1})\left[ \int _{\Theta _{t+1}}d{\mathbb {P}}_{{\tilde{\theta }}_{t+1}}d{\mathbb {P}}_{t+1,n_{t+1}}\right] \\ =&\int _{\Theta _{t+1}}\int _{{\mathbb {R}}^s}\hat{H}_{t+1}(x_{t},\pmb {\xi }^{t+1})d{\mathbb {P}}_{{\tilde{\theta }}_{t+1}}d{\mathbb {P}}_{t+1,n_{t+1}}\\ =&{\mathbb {E}}_{{\mathbb {P}}_{t+1,n_{t+1}}}{\mathbb {E}}_{{\mathbb {P}}_{{\tilde{\theta }}_{t+1}}}[\hat{H}_{t+1}(x_{t},\pmb {\xi }^{t+1})]. \end{aligned} \end{aligned}$$
(7)

This equation demonstrates that applying the expectation (w.r.t. \({\mathbb {P}}_{t+1,n_{t+1}}\)) on \(\tilde{\theta }_{t+1}\) can effectively hedge against the input uncertainty on the value function \({\mathbb {E}}_{{\mathbb {P}}_{{\tilde{\theta }}_{t+1}}}[\hat{H}_{t+1}(x_{t},\pmb {\xi }^{t+1})]\) for each stage \(t=2,\ldots ,T\).

Furthermore, to show that the multi-stage BEO model can overcome the over-conservatism of current DRO models, we compare it with the multi-stage DRO problem. It is known from [24] that the multi-stage DRO problem under parametric assumptions can be recursively formulated as:

$$\begin{aligned} \begin{aligned} \hat{H}_t^{DR}(x_{t-1},\pmb {\xi }^t):=\inf _{x_t\in {\mathcal {X}}_t(x_{t-1},\pmb {\xi }^{t})}h_t(x_t,\xi _t)&\\ +\max _{\theta _{t+1}\in \Theta _{t+1}}{\mathbb {E}}_{{\mathbb {P}}_{{\theta }_{t+1}}}&[\hat{H}^{DR}_{t+1}(x_{t},\pmb {\xi }^{t+1})],\ \ t=2,\ldots ,T, \end{aligned} \end{aligned}$$
(8)

where \(\hat{H}^{DR}_{T+1}\equiv 0\). \(\hat{H}_t^{DR}\) can be treated as the recourse function at stage t under the DRO framework. \(\Theta _t\) stands for the uncertainty set about the possible parameter realizations of the parameterized distribution at stage t. Therefore, we can illustrate how our multi-stage BEO model avoids the over-conservatism typically encountered in multi-stage DRO from the following two perspectives. Firstly, the objective function of multi-stage DRO in (8) always considers the worst-case scenario under the uncertainty set \(\Theta _{t+1}\). In contrast, \(\hat{\theta }_{t+1}\) in our model is simply a specific element of \(\Theta _{t+1}\), reflecting a more nuanced approach to uncertainty. Secondly, the reformulation presented in (7) indicates that our multi-stage BEO model introduces greater flexibility by employing the Bayesian posterior distribution \({\mathbb {P}}_{t+1,n_{t+1}}\). This approach is advantageous over the worst-case measure utilized in multi-stage DRO, as it offers a probabilistic framework that encompasses the entire range of parameters, rather than the binary partitioning typical of the latter.

In essence, the proposed Bayesian approach not only enables a more comprehensive understanding of potential scenarios, but also allows for a more adaptive and responsive optimization strategy. By considering a spectrum of possibilities rather than focusing solely on extreme cases, the multi-stage BEO model provides a balanced and practical solution to optimization problems where uncertainty is a key factor.

3 Consistency of multi-stage BEO

To lay the theoretical foundation for the model (4), we need to ensure that, as the data-size goes to infinity, the objectives \(\hat{H}_t, 2\le t\le T\), of problem (6), thus the multi-stage BEO problem (4) can recover the true objectives \(H_t\), and the optimal solutions of the multi-stage BEO problem converge to the true optimal solutions. This corresponds to the consistency analysis of problem (4). To this end, let \(D_{KL}({\mathbb {P}}\Vert {\mathbb {Q}})\) denote the Kullback–Leibler divergence between two distributions \({\mathbb {P}}\) and \({\mathbb {Q}}\), and let “a.s.” be short for “almost surely”. In our Bayesian setting, we denote \(D_{KL}(\theta ^c\Vert \theta ):=D_{KL}({\mathbb {P}}_{\theta ^c}\Vert {\mathbb {P}}_{\theta })\) for brevity. What’s more, we denote a KL neighborhood \(D_{KL}^\epsilon (\theta ^c)\) of \(\theta ^c\) as \(\{\theta :D_{KL}({\theta ^c}\Vert {\theta })\le \epsilon \}\).

Early in 1965, Schwartz [33] established the weak consistency of the posterior distribution under the condition that the prior has the Kullback–Leibler property (KL property). This classical result was used in Wu [39] to establish the asymptotics of BRO. For our purpose, we aim at extending this theory under a weaker assumption to the dynamic case with multiple stages. First of all, we propose the following conditional variants of the assumptions adopted in [33] and [39].

Assumption 1

(sufficient conditions for consistency under \({\mathbb {P}}_{\theta ^c_t}\)).

  1. (i)

    \(\Theta _t, 2\le t\le T\), is a compact set.

  2. (ii)

    For fixed \(\pmb {\xi }^{t-1}\), the conditional distribution \({\mathbb {P}}_{\theta _t}\) has a density function \(p_{\theta _t}(\cdot )\) which is \({\mathcal {B}}_{\Theta _t}\bigotimes {\mathcal {B}}_{\** _t}\)-measurable.

  3. (iii)

    For any neighborhood \(V_t \in \) \({\mathcal {B}}_{\Theta _t}\) of \(\theta ^c_t\), there exists a sequence of uniformly consistent tests of the hypothesis \(\tilde{\theta }_t=\theta ^c_t\) against the alternative \(\tilde{\theta }_t\in \Theta _t\backslash V_t\).

  4. (iv)

    For any \(\epsilon _t> 0\) and any neighborhood \(V_t\in \) \({\mathcal {B}}_{\Theta _t}\) of \(\theta ^c_t, V_t\) contains a subset \( W_t \) such that \(\pi _t(W_t) > 0\) and \(D_{KL} ({\theta ^c_t}\Vert {\theta _t}) <\epsilon \) for all \(\theta _t\in W_t\).

Remark 1

As a standard assumption in Bayesian frameworks, the compactness of \(\Theta _t\) in (i) ensures the existence of the posterior estimator \(\hat{\theta }_t\), which is essential for the multi-stage BEO model’s operational integrity. Moreover, it is known from [33] that the posterior distribution is consistent if the true distribution \({\mathbb {P}}^c\) can be suitably tested versus the complements of neighborhoods of \({\mathbb {P}}^c\) and the Kullback–Leibler neighborhoods of \({\mathbb {P}}^c\) receive positive probabilities under the prior, known as the KL property. Thus conditions (ii), (iii) and (iv) can be regarded as the conditional version of this property and is fundamental in the posterior uniform consistency study hereafter.

First, we have the following result.

Theorem 1

Suppose Assumption 1 holds. Then for any neighborhood \(V_t \in \) \({\mathcal {B}}_{\Theta _t}\) of \(\theta ^c_t, {\mathbb {P}}_{t,n_t}(V_t)\rightarrow 1\) a.s. (\({\mathbb {P}}_{\theta ^c_t}\)) as \( n_t \rightarrow \infty \).

Proof

Let \(p_{n_t}=\frac{1}{\pi _t(V_t^c)}\int _{V^c_t}p^{n_t}_{\theta _t}\pi _t(d\theta _t)\) and \(q_{n_t}=\frac{1}{\pi _t(W_t)}\int _{W_t}p^{n_t}_{\theta _t}\pi _t(d\theta _t)\). We have the following inequality:

$$\begin{aligned} {\mathbb {P}}_{t,n_t}(V_t^c)=\frac{\int _{V^c_t}p^{n_t}_{\theta _t}\pi _t(d\theta _t)}{\int _{\Theta _t}p^{n_t}_{\theta _t}\pi _t(d\theta _t)}\le \frac{\pi _t(V_t^c)p_{n_t}}{\pi _t(W_t)q_{n_t}}. \end{aligned}$$

We will demonstrate that \(p_{n_t}/q_{n_t}\rightarrow 0\) a.s. (\({\mathbb {P}}_{\theta ^c_t}\)).

By utilizing Lemma 6.1 in [6], there exist integers k and r such that \(mk\le n_t<(m+1)k\), and

$$\begin{aligned} {\mathbb {P}}_{\theta _t^c}\left\{ \frac{p_{n_t}}{p_{\theta _t^c}^{n_t}}>\epsilon _m\right\} \le \frac{2e^{-\frac{mr}{2}}}{\epsilon _m}\sqrt{1-e^{-mr}}. \end{aligned}$$

Choosing \(\epsilon _m=e^{-\frac{mr}{4}}\), gives us

$$\begin{aligned} {\mathbb {P}}_{\theta _t^c}\left\{ \frac{p_{n_t}}{p_{\theta _t^c}^{n_t}}>e^{-\frac{mr}{4}}\right\} \le 2e^{-\frac{mr}{4}}. \end{aligned}$$
(9)

For \(A_{n_t}=\left\{ \frac{p_{n_t}}{p_{\theta _t^c}^{n_t}}>e^{-\frac{n_tr}{2k}}\right\} \), it is contained in the set on the left side of (9), and \(2e^{r/2}e^{-\frac{n_tr}{2k}}\) is greater than the right side of (9). Thus it follows from (9) that \({\mathbb {P}}_{\theta _t^c}(\cap ^\infty _{N=1}\cup _{n_t\ge N}A_{n_t})=0\). That is, there exists an integer \(N_1\) such that

$$\begin{aligned} \frac{p_{n_t}}{p_{\theta _t^c}^{n_t}}\le e^{-\frac{n_tr}{2k}}, \ \ \forall n_t>N_1. \end{aligned}$$
(10)

To find an upper bound for \(\frac{q_{n_t}}{p_{\theta _t^c}^{n_t}}\), we define \(H(\theta _t)={\mathbb {E}}_{\theta _{t}^c}\left( \log \frac{p_{\theta _{t}}(\xi _t)}{p_{\theta _{t}^c}(\xi _t)}\right) \) and \(\phi _{n_t}(\theta _t)=\frac{1}{n_t}\sum _{i=1}^{n_t}\log p_{\theta _{t}}(\xi _t^i)-\log p_{\theta _{t}^c}(\xi _t^i)\). For each \(\theta _t, \phi _{n_t}(\theta _t)\rightarrow H(\theta _t)\) a.s. (\({\mathbb {P}}_{\theta ^c_t}\)) by the strong law of large numbers. Meanwhile, by Fubini’s theorem, there exists a set with \({\mathbb {P}}_{\theta ^c_t}\) measure zero such that for all its complements, \(\phi _n\rightarrow H\) a.s. (\(\nu \)), \(\nu (B)=\frac{1}{\pi _t(W_t)}\pi _t(W_t\cap B), B\in {\mathcal {B}}_{\Theta _t}\). For fixed \(\epsilon > 0\) and \(W_t\) given in condition (iv), an application of Fatou’s lemma and H\(\ddot{\text {o}}\)lder inequality gives that for some \(N_2\) and each \(n_t>N_2\),

$$\begin{aligned} \int e^{n\phi _n(\theta _t)}\nu (d\theta _t)=\frac{q_{n_t}}{p_{\theta _t^c}^n}\ge e^{-n\epsilon }, \ \ \forall n_t>N_2,\ \ a.s.\ \ ({\mathbb {P}}_{\theta ^c_t}). \end{aligned}$$
(11)

With (10) and (11), we obtain

$$\begin{aligned} {\mathbb {P}}_{t,n_t}(V_t^c)\le \frac{\pi _t(V_t^c)}{\pi _t(W_t)}e^{-n(\frac{r}{2k}-\epsilon )},\ \forall n_t>\max (N_1,N_2). \end{aligned}$$

The result then follows by choosing \(\epsilon <\frac{r}{2k}\). \(\square \)

To proceed, we need the following definition of weak convergence.

Definition 1

([6], weak convergence) A sequence of distributions \(\hat{{\mathbb {P}}}_n\) is said to converge to \({{\mathbb {P}}}\) weakly, denoted as \(\hat{{\mathbb {P}}}_n\Rightarrow {{\mathbb {P}}}\), if and only if \(\int f(\omega )\hat{{\mathbb {P}}}_n(d\omega )\rightarrow \int f(\omega ) {{\mathbb {P}}}(d\omega )\) for each bounded and continuous function \(f:\Omega \rightarrow {\mathbb {R}}\).

Theorem 2

Under Assumption 1, \({\mathbb {P}}_{t,n_t}\Rightarrow \delta _{\theta ^c_t}\) a.s. (\({\mathbb {P}}_{\theta ^c_t}\)), \(t=2,\ldots ,T\), where \(\delta _{\theta ^c_t}\) is a point mass on \(\theta ^c_t\).

Proof

For each stage \(t=2,\ldots ,T\), we denote \(\Theta _t^k\subseteq \Theta _t\) as an open ball centered at \(\theta _t^c\) with radius \(\frac{1}{k}\). By Lemma 3.3 in [39], we have that for any bounded and continuous function f and a fixed positive integer \(m, {\mathbb {P}}_{t,n_t}(\Theta _t\setminus \Theta _t^m)\rightarrow 0\) and thus \(\int _{\Theta _t\setminus \Theta _t^m}f(\theta _t){\mathbb {P}}_{t,n_t}(d\theta _t)\rightarrow 0\) as \(n_t\rightarrow \infty \). It follows that

$$\begin{aligned} \inf _{\theta _t\in \Theta _t^m}f(\theta _t)\le \liminf _{n_t\rightarrow \infty }\int _{\Theta _t}f(\theta _t){\mathbb {P}}_{t,n_t}(d\theta _t)\le \limsup _{n_t\rightarrow \infty }\int _{\Theta _t}f(\theta _t){\mathbb {P}}_{t,n_t}(d\theta _t)\le \sup _{\theta _t\in \Theta _t^m}f(\theta _t). \end{aligned}$$

Letting \(m\rightarrow \infty \), the continuity of f and Definition 1 mean that \({\mathbb {P}}_{t,n_t}\Rightarrow \delta _{\theta ^c_t}\). \(\square \)

Theorem 2 establishes the weak consistency of the posterior distribution in the multi-stage case. Then we can concentrate on the consistency of the multi-stage BEO problem (4).

3.1 Consistency of objective functions

We first propose a crucial result for the strong consistency of the multi-stage BEO problem’s objective functions by showing their uniformly convergence.

Assumption 2

\({\mathbb {P}}_{\theta _t}\) has a conditional density \(p_{\theta _t}\) that is continuous w.r.t. \(\theta _t\).

This is a mild assumption because a suitably parameterized family of distributions can be chosen to guarantee the continuity of \(p_{\theta _t}\) w.r.t. \(\theta _{t}\) even if \(\xi _t\) has a complex structure, e.g., time-series models [17, Appendix A]. We illustrate this with an example which includes rather flexible classes of distributions.

Example 1

(Finite mixture distributions controlled by parameters) Widely adopted in many fields such as machine learning, finite mixture distribution is usually defined as \(\xi \sim \sum _{i=1}^{m}\theta _i^c{\mathbb {P}}_i\), where each \({\mathbb {P}}_i\) has a known distribution function. In our setting, we can treat this kind of mixture distributions as a family of distributions governed by the weights \(\theta _i^c, 1 \le i \le m\), which are unknown. That is, the density can be viewed as a continuous function w.r.t. \(\theta = (\theta _1, \ldots , \theta _m)\), i.e., \(p_{\theta }=\sum _{i=1}^{m}\theta _i p_i\). In this perspective, we can then determine the posterior distribution numerically, e.g., using MCMC methods [16].

The following lemma demonstrates the convergence of conditional expectations of integrands w.r.t. \(n_t\).

Lemma 1

Suppose Assumptions 1 and 2 hold, and \(h_t(x_t,\cdot )\) is bounded and continuous on \(\** _t\) for every \(x_t \in {\mathcal {X}}_t\). Then for every fixed \(x_t\in {\mathcal {X}}_t\), we have \({\mathbb {P}}_{{\hat{\theta }_{t}}}\Rightarrow {\mathbb {P}}_{\theta _t^c}\), which implies

$$\begin{aligned}{\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }_{t}}}}[h_t(x_t,\xi _t)]\rightarrow {\mathbb {E}}_{{\mathbb {P}}_{\theta _t^c}}[h_t(x_t,\xi _t)],\ \ a.s.\ ({\mathbb {P}}_{\theta _t^c}).\end{aligned}$$

Proof

Since \(\Theta _t\) is compact, the identity map** \(I(\theta _t)=\theta _t\) is thus bounded on \(\Theta _t\). Therefore, it follows from Definition 1 that \(\hat{\theta }_{t}={\mathbb {E}}_{{\mathbb {P}}_{t,n_t}}\theta _t\rightarrow {\mathbb {E}}_{\delta _{\theta _t^c}}\theta _t=\theta _t^c \). With the assumption that the density \(p_{\theta _t}\) is continuous w.r.t. \(\theta _t, p_{{\hat{\theta }_t}} \) converges to \(p_{\theta ^c_t}\) pointwisely \(a.s.\ ({\mathbb {P}}_{\theta _t^c}\)). Then, by Scheffe’s theorem in [6] and Theorem 2.2 in [7], we have \({\mathbb {P}}_{{\hat{\theta }_{t}}}\Rightarrow {\mathbb {P}}_{\theta _t^c}\) for every fixed \(x_t\). It follows directly from the definition of weak convergence that \({\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }_{t}}}}[h_t(x_t,\xi _t)]\rightarrow {\mathbb {E}}_{{\mathbb {P}}_{\theta _t^c}}[h_t(x_t,\xi _t)]\) as \( n_t \rightarrow \infty \). \(\square \)

To establish the consistency of optimal values and solutions, we need to show the uniformly convergence of objective functions. The precise definition of uniformly convergence is as follows.

Definition 2

([35] uniformly convergence) A sequence of functions \(\{\hat{f}_n(x)\}\) converges uniformly to f(x) with probability 1 (w.p.1) on \({\mathcal {X}}\) if

$$\begin{aligned} \sup _{x\in {\mathcal {X}}}\vert \hat{f}_n(x)-f(x)\vert \rightarrow 0 \ \ w.p.1,\ as\ \ n\rightarrow \infty , \end{aligned}$$

and we denote it as

$$\begin{aligned} \hat{f}_n(x)\rightrightarrows f(x)\ \ w.p.1,\ as\ n\rightarrow \infty . \end{aligned}$$

Theorem 3

Suppose Assumptions 1 and 2 hold, \({\mathcal {X}}_t\) is a nonempty and compact subset of \({\mathbb {R}}^{d}\). If

  1. (i)

    \(h_t(x_t,\xi _t)\) is continuous at any \((x_t,\xi _{t})\in {\mathcal {X}}_t\times \** _t\),

  2. (ii)

    \(h_t(x_t,\xi _t), x_t\in {\mathcal {X}}_t\), is dominated by an integrable function,

  3. (iii)

    the samples are i.i.d. from the true distribution \({\mathbb {P}}_{\theta _t^c}\),

then the expected value function \({\mathbb {E}}_{{\mathbb {P}}_{\theta _t^c}}[h_t(x_t,\xi _t)]\) is finite valued and continuous on \({\mathcal {X}}_t\), and

$$\begin{aligned} {\mathbb {E}}_{{\mathbb {P}}_{\hat{\theta }_{t}}}[h_t(x_t,\xi _t)]\rightrightarrows {\mathbb {E}}_{{\mathbb {P}}_{\theta _t^c}}[h_t(x_t,\xi _t)], \ \ \forall {x_t} \in {\mathcal {X}}_t,\ \ a.s.\ ({\mathbb {P}}_{\theta _t^c}). \end{aligned}$$

Proof

It follows from condition (ii) that there exists an integrable function \(g_t(\cdot )\) such that \(\vert {\mathbb {E}}_{{\mathbb {P}}_{\theta _t^c}}[h_t(x_t,\xi _t)]\vert \le {\mathbb {E}}_{{\mathbb {P}}_{\theta _t^c}}[g_t(\xi _t)]\), and consequently \(\vert {\mathbb {E}}_{{\mathbb {P}}_{\theta _t^c}}[h_t(x_t,\xi _t)]\vert <+\infty \) for all \(x_t\in {\mathcal {X}}_t\). For each \(x_t\in {\mathcal {X}}_t\), we can define a sequence of points \(\{x_t^k\}_{k=1}^n\) in \({\mathcal {X}}_t\) such that \(x_t^k\) converges to \(x_t\). Moreover, condition (ii) also implies that

$$\begin{aligned} \lim _{k\rightarrow \infty }{\mathbb {E}}_{{\mathbb {P}}_{\theta _t^c}}[h_t(x_t^k,\xi _t)]={\mathbb {E}}_{{\mathbb {P}}_{\theta _t^c}}[\lim _{k\rightarrow \infty }h_t(x_t^k,\xi _t)] \end{aligned}$$

according to the Lebesgue’s dominated convergence theorem.

Since, by (i), \(h_t(x_t^k,\xi _t)\rightarrow h_t(x_t,\xi _t)\) w.p.1, it follows that \({\mathbb {E}}_{{\mathbb {P}}_{\theta _t^c}}[h_t(x_t^k,\xi _t)]\rightarrow {\mathbb {E}}_{{\mathbb {P}}_{\theta _t^c}}[h_t(x_t,\xi _t)]\), and hence \({\mathbb {E}}_{{\mathbb {P}}_{\theta _t^c}}[h_t(x_t,\xi _t)]\) is continuous w.r.t. \(x_t\). We define the set \(V_ k:= \{x_t \in {\mathcal {X}}_t: ||x_t -\bar{x}_t||\le \gamma _k \}\) and

$$\begin{aligned} \delta _k(\xi _t):=\sup _{x_t\in V_k}\left| h_t(x_t,\xi _t)-h_t(\bar{x}_t,\xi _t)\right| , \end{aligned}$$
(12)

where \(\bar{x}_t\) is a point belonging to \({\mathcal {X}}_t\) and \(\{\gamma _k\}_{k=1}^n\) is a sequence of positive numbers converging to 0.

Due to condition (ii), we have the measurability of \(h_t(x_t,\xi _t)\) which leads to the Lebesgue measurability of \(\delta _k(\xi _t)\). Moreover, we have that \(\delta _k(\xi _t), k\in {\mathbb {N}}\), is dominated by an integrable function by condition (ii). The assumption shows that the function \(\delta _k(\xi _t)\) is continuous on \(\** _t\) and converges to 0 when k tends to \(\infty \) for a.e. \(\xi _t\in \** _t\). Then by the Lebesgue’s dominated convergence theorem, we have that

$$\begin{aligned} \lim _{k\rightarrow \infty }{\mathbb {E}}_{{\mathbb {P}}_{\theta _t^c}}[\delta _k(\xi _t)]={\mathbb {E}}_{{\mathbb {P}}_{\theta _t^c}}[\lim _{k\rightarrow \infty }\delta _k(\xi _t)]=0. \end{aligned}$$
(13)

We can further obtain that

$$\begin{aligned} \left| {\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }_{t}}}}[h_t(x_t,\xi _t)]-{\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }_{t}}}}[h_t(\bar{x}_t,\xi _t)]\right| \le {\mathbb {E}}_{\hat{\theta }_{t}}\vert h_t(x_t,\xi _t)-h_t(\bar{x}_t,\xi _t)\vert , \end{aligned}$$

and hence

$$\begin{aligned} \sup _{x_t\in V_k}\left| {\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }_{t}}}}[h_t(x_t,\xi _t)]-{\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }_{t}}}}[h_t(\bar{x}_t,\xi _t)]\right| \le {\mathbb {E}}_{\hat{\theta }_{t}}\delta _k(\xi _t). \end{aligned}$$
(14)

It is known from Lemma 1 that \({\mathbb {P}}_{{\hat{\theta }}_{t}}\Rightarrow {\mathbb {P}}_{\theta ^c_t}\). Therefore, the right-hand side of (14) converges a.s \(({\mathbb {P}}_{\theta _t^c})\) to \({\mathbb {E}}_{{\mathbb {P}}_{\theta _t^c}}[\delta _k(\xi _t)]\) as \(n_t \rightarrow \infty \). Together with (13), this implies that for any given \(\epsilon > 0\), there exists a neighborhood W of \(\bar{x}_t\) such that \(a.s.\ ({\mathbb {P}}_{\theta _t^c})\) for sufficiently large \(n_t\), we have

$$\begin{aligned} \sup _{x\in W\cap {\mathcal {X}}_t}\left| {\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }_{t}}}}[h_t(x_t,\xi _t)]-{\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }_{t}}}}[h_t(\bar{x}_t,\xi _t)]\right| <\epsilon . \end{aligned}$$

Since \({\mathcal {X}}_t\) is compact, there exist a finite number of points \(x_t^1,\ldots ,x_t^m\in {\mathcal {X}}_t\) and corresponding neighborhoods \(W_1,\ldots ,W_m\) which form a cover of \({\mathcal {X}}_t\). It can then be deduced that \(a.s.\ ({\mathbb {P}}_{\theta _t^c})\) for \(n_t\) large enough, the following holds:

$$\begin{aligned} \sup _{x_t\in W_j\cap {\mathcal {X}}_t}\left| {\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }_{t}}}}[h_t(x_t,\xi _t)]-{\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }_{t}}}}[h_t(x_t^j,\xi _t)]\right| <\epsilon ,\ \ j=1,\ldots ,m. \end{aligned}$$
(15)

Furthermore, since \({\mathbb {E}}_{{\mathbb {P}}_{\theta _t^c}}[h_t(x_t,\xi _t)]\) is continuous on \({\mathcal {X}}_t\), the above neighborhoods can be chosen in such a way that

$$\begin{aligned} \sup _{x_t\in W_j\cap {\mathcal {X}}_t}\left| {\mathbb {E}}_{{\mathbb {P}}_{\theta _t^c}}[h_t(x_t,\xi _t)]-{\mathbb {E}}_{{\mathbb {P}}_{\theta _t^c}}[h_t(x_t^j,\xi _t)]\right| <\epsilon ,\ \ j=1,\ldots ,m. \end{aligned}$$
(16)

Again by the weak convergency and Lemma 1, we have that \({\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }_{t}}}}[h_t(x_t,\xi _t)]\) converges pointwise to \({\mathbb {E}}_{{\mathbb {P}}_{\theta _t^c}}[h_t(x_t,\xi _t)], a.s.\ ({\mathbb {P}}_{\theta _t^c})\). Therefore,

$$\begin{aligned} \left| {\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }_{t}}}}[h_t(x_t^j,\xi _t)]-{\mathbb {E}}_{{\mathbb {P}}_{\theta _t^c}}[h_t(x_t^j,\xi _t)]\right| <\epsilon ,\ \ j=1,\ldots ,m,\ \ a.s.\ ({\mathbb {P}}_{\theta _t^c}) \end{aligned}$$
(17)

for \(n_t\) large enough. It follows from (15)–(17) that for \(n_t\) large enough, we have

$$\begin{aligned} \sup _{x_t\in {\mathcal {X}}_t}\left| {\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }_{t}}}}[h_t(x_t,\xi _t)]-{\mathbb {E}}_{{\mathbb {P}}_{\theta _t^c}}[h_t(x_t,\xi _t)]\right| <3\epsilon ,\ \ a.s.\ ({\mathbb {P}}_{\theta _t^c}). \end{aligned}$$
(18)

This is true for arbitrary \(\epsilon > 0\) and we complete the proof. \(\square \)

Remark 2

Theorem 3 provides sufficient conditions for the uniform convergence of the objective functions with respect to posterior estimate parameters. Without these conditions, the convergence of the objective function still holds but might not be uniform as Lemma 1 shows.

3.2 Consistency of optimal values and solutions

With the results established in the last subsection, we can now demonstrate the consistency of optimal values and solutions. Let

$$\begin{aligned} \hat{S}_{t}^{n_{t+1}}:= \arg \min _{x_t\in {\mathcal {X}}_t(x_{t-1},\pmb {\xi }^{t})}h_{t}(x_{t},\xi _{t})+{\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t+1}}}\left[ \hat{H}_{t+1}(x_{t},\pmb {\xi }^{t+1})\right] \end{aligned}$$

be the stagewise optimal solution set of the multi-stage BEO problem (4) at stage \(t, t=1,\ldots ,T-1\), and let

$$\begin{aligned} {S_t:=\arg \min _{x_t\in {\mathcal {X}}_t(x_{t-1},\pmb {\xi }^{t})}h_{t}(x_{t},\xi _{t})+{\mathbb {E}}_{{\mathbb {P}}_{\theta _{t+1}^c}}\left[ H_{t+1}(x_{t},\pmb {\xi }^{t+1})\right] } \end{aligned}$$

be the stagewise optimal solution set of problem (3) at stage \(t, t=1,\ldots ,T-1\). The following definition shows how to measure the deviation between \({\hat{S}_{t}^{n_{t+1}}}\) and \(S_t\).

Definition 3

For \(A,B \subset {\mathbb {R}}^d\), define \({\mathbb {D}}(A,B):= \sup _{x\in A} dist(x,B)\), where \(dist(x,B):= \inf _{x\in B}\Vert x-y\Vert \) and \(\Vert \cdot \Vert \) denotes an arbitrary norm.

Our demonstrations rely on the following fundamental result.

Lemma 2

([35], Theorem 5.3) Let \({\mathcal {X}}\) be a compact subset of \({\mathbb {R}}^d\). Suppose that a sequence of continuous functions \(\{\hat{f}_n \}: {\mathcal {X}}\rightarrow {\mathbb {R}}\) converges uniformly to a continuous function f. Then for \(S_n=\arg \min _{x\in {\mathcal {X}}}f_n(x), S=\arg \min _{x\in {\mathcal {X}}}f(x)\), we have \({\mathbb {D}}(S_n,S)\rightarrow 0\) as \(n_t\rightarrow \infty \). Furthermore, we have \(\hat{f}^{*}_{n}\rightarrow f^*\), where \(\hat{f}^*_n:= \min _{x\in {\mathcal {X}}} \hat{f}_n(x)\) and \(f^*:= \min _{x\in {\mathcal {X}}} f(x)\).

In many studies about the quantitative stability of multi-stage stochastic programs, it is natural to impose some kind of continuity properties and convexity on the objective function. For example, it was assumed in [15, 22] that \(h_t\) is a uniformly continuous function defined on \({\mathcal {X}}_t\times \** _t\) and also a Lipschitz function for every fixed \(x_t\). The following theorem is our main result on the consistency of the recourse function (6) of problem (4) to that of problem (3) with the true distribution at each stage t.

Theorem 4

Suppose Assumptions 1 and 2 hold and \({\mathcal {X}}_t, 1\le t\le T\), are nonempty convex and compact subsets of \({\mathbb {R}}^{d}\). If

  1. (i)

    \(h_t(x_t,\xi _{t})\) is continuous and convex on \({\mathcal {X}}_t\times \** _t\),

  2. (ii)

    \(h_t(x_t,\xi _t), x_t\in {\mathcal {X}}_t\), is dominated by an integrable function,

  3. (iii)

    \({\mathcal {X}}_t(x_{t-1},\pmb {\xi }^{t})\) is uniformly bounded for all \(x_{t-1}\),

  4. (iv)

    the samples are i.i.d. from the true distribution,

then, at each stage \(t, \hat{H}_t\) and \(H_t\) are continuous and convex on \({\mathcal {X}}_t\times \** ^{t+1}\) and

$$\begin{aligned} h_{t-1}(x_{t-1},\xi _{t-1})+&{\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}\left[ \hat{H}_{t}(x_{t-1},\pmb {\xi }^{t}) \right] \rightrightarrows \\&h_{t-1}(x_{t-1},\xi _{t-1})+{\mathbb {E}}_{{\mathbb {P}}_{\theta _{t}^c}}\left[ H_t(x_{t-1},\pmb {\xi }^{t})\right] ,\ \ a.s.\ ({\mathbb {P}}_{\theta _t^c}), \end{aligned}$$
$$\begin{aligned} {\mathbb {D}}({\hat{S}_{t-1}^{n_{t}}},S_{t-1})\rightarrow 0\ \ as\ \ n_t\rightarrow \infty ,\ \ a.s.\ ({\mathbb {P}}_{\theta _t^c}),\ \ \text {and} \end{aligned}$$
$$\begin{aligned} \min _ {x_{t-1}} h_{t-1}(x_{t-1},\xi _{t-1})+&{\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}} \left[ \hat{H}_{t}(x_{t-1},\pmb {\xi }^{t}) \right] \rightarrow \\&\min _{x_{t-1}}h_{t-1}(x_{t-1},\xi _{t-1})+{\mathbb {E}}_{{\mathbb {P}}_{\theta _{t}^c}}\left[ H_{t}(x_{t-1},\pmb {\xi }^{t})\right] ,\ \ a.s.\ ({\mathbb {P}}_{\theta _t^c}). \end{aligned}$$

Proof

The proof is by mathematical induction on t. Since \(H_{T+1}=\hat{H}_{T+1}\equiv 0, H_{T}(x_{T-1},\pmb {\xi }^{T})=\hat{H}_{T}(x_{T-1},\pmb {\xi }^{T})=\min _{x_T\in {\mathcal {X}}_T}h_T(x_T,\xi _T)\). Condition (iii) ensures that the level sets \(\text {lev}_{\{\alpha ,\,x_{T-1}\}} h_{T}(x_T,\xi _{T})\) are nonempty and uniformly bounded for all \(x_{T-1}\) since level sets must be subsets of \({\mathcal {X}}_T\), here the level set is defined as

$$\begin{aligned} \text {lev}_{\{\alpha ,\,x_{T-1}\}} h_{T}(x_T,\xi _{T}):=\{x_T\in {\mathcal {X}}_T(x_{T-1},\pmb {\xi }^{T}):h_T(x_T,\xi _{T})\le \alpha \}. \end{aligned}$$

As a compact set in a finite dimensional vector space, any closed and bounded subset of \({\mathcal {X}}_T\) must be compact. Then we can get the inf-compactness condition in [8] of \(h_T\). Using Proposition 4.4 in [8] and Proposition 2.1 in [14], we can conclude that the optimal value functions \(H_{T}(x_{T-1},\pmb {\xi }^{T})\) and \(\hat{H}_{T}(x_{T-1},\pmb {\xi }^{T})\) are continuous and jointly convex w.r.t. \(x_{T-1}\) and \(\xi _{T}\).

By (i), (ii), (iv) and Theorem 3, we can assure that

$$\begin{aligned} {\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{T}}}&\left[ h_{T-1}(x_{T-1},\xi _{T-1})+\hat{H}_{T}(x_{T-1},\pmb {\xi }^{T}) \right] \rightarrow \\&{\mathbb {E}}_{{\mathbb {P}}_{\theta _{T}^c}}\left[ h_{T-1}(x_{T-1},\xi _{T-1})+H_{T}(x_{T-1},\pmb {\xi }^{T})\right] ,\ \ a.s.\ ({\mathbb {P}}_{\theta _T^c}) \end{aligned}$$

uniformly on \({\mathcal {X}}_{T-1} \). This and Lemma 2 tell us that

$$\begin{aligned} {\mathbb {D}}(\hat{S}_{T-1}^{n_T},S_{T-1})\rightarrow 0\ \ as\ \ n_T\rightarrow \infty ,\ \ \text {and} \end{aligned}$$
$$\begin{aligned} \min \limits _{x_{T-1}}{\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}&\left[ h_{T-1}(x_{T-1},\xi _{T-1})+\hat{H}_{T}(x_{T-1},\pmb {\xi }^{T})\right] \rightarrow \\ {}&\min \limits _{x_{T-1}}{\mathbb {E}}_{{\mathbb {P}}_{\theta _{T}^c}}\left[ h_{T-1}(x_{T-1},\xi _{T-1})+H_{T}(x_{T-1},\pmb {\xi }^{T}) \right] ,\ \ a.s.\ ({\mathbb {P}}_{\theta _T^c}). \end{aligned}$$

Therefore, we have proved the theorem for T. Assume that the conclusion is true for \(t+1,\ldots ,T-1\). We now show that it also holds for t.

We first prove the continuity and convexity of \(\hat{H}_{t}(x_{t-1},\pmb {\xi }^{t})\) and \(H_{t}(x_{t-1},\pmb {\xi }^{t})\) on \({\mathcal {X}}_t\times \** ^{t+1}\). With the continuity of \(h_t(x_t,\xi _{t}), \hat{H}_{t+1}(x_{t},\pmb {\xi }^{t+1})\) and \(H_{t+1}(x_{t},\pmb {\xi }^{t+1})\) as the conclusion for stage \(t+1\), we can deduce the continuity of \(\hat{H}_{t}(x_{t-1},\pmb {\xi }^{t})\) and \(H_{t}(x_{t-1},\pmb {\xi }^{t})\) similar to the proof as that for stage T. What’s more, as the expectations of convex functions \(\hat{H}_{t+1}(x_{t},\pmb {\xi }^{t+1})\) and \(H_{t+1}(x_{t},\pmb {\xi }^{t+1})\) are still convex functions, it is easy to see that \(\hat{H}_{t}(x_{t-1},\pmb {\xi }^{t})\) and \(H_{t}(x_{t-1},\pmb {\xi }^{t})\) are also convex.

As for the consistency,

$$\begin{aligned} \begin{aligned}&{\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}\left[ h_{t-1}(x_{t-1},\xi _{t-1})+\hat{H}_{t}(x_{t-1},\pmb {\xi }^{t}) \right] -{\mathbb {E}}_{{\mathbb {P}}_{\theta _{t}^c}}\left[ h_{t-1}(x_{t-1},\xi _{t-1})+H_{t}(x_{t-1},\pmb {\xi }^{t}) \right] \\&\quad = \underbrace{{\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}\left[ h_{t-1}(x_{t-1},\xi _{t-1})+\hat{H}_{t}(x_{t-1},\pmb {\xi }^{t}) \right] -{\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}\left[ h_{t-1}(x_{t-1},\xi _{t-1})+H_{t}(x_{t-1},\pmb {\xi }^{t}) \right] }_{\dagger }\\&\qquad +\underbrace{{\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}\left[ h_{t-1}(x_{t-1},\xi _{t-1})+H_{t}(x_{t-1},\pmb {\xi }^{t}) \right] -{\mathbb {E}}_{{\mathbb {P}}_{\theta _{t}^c}}\left[ h_{t-1}(x_{t-1},\xi _{t-1})+H_{t}(x_{t-1},\pmb {\xi }^{t}) \right] }_{\dagger \dagger }. \end{aligned} \end{aligned}$$
(19)

From the induction assumption for \(t+1\), we have \(\hat{H}_{t}(x_{t-1},\pmb {\xi }^{t})\rightarrow H_{t}(x_{t-1},\pmb {\xi }^{t})\). This means \(\left| \hat{H}_{t}(x_{t-1},\pmb {\xi }^{t})-H_{t}(x_{t-1},\pmb {\xi }^{t})\right| <\epsilon \) when \(n_t\) is large enough. We have

$$\begin{aligned} \begin{aligned}&\vert {\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}\left[ h_{t-1}(x_{t-1},\xi _{t-1})+\hat{H}_{t}(x_{t-1},\pmb {\xi }^{t}) \right] -{\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}\left[ h_{t-1}(x_{t-1},\xi _{t-1})+H_{t}(x_{t-1},\pmb {\xi }^{t}) \right] \vert \\&\quad \le {\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}\vert \hat{H}_{t}(x_{t-1},\pmb {\xi }^{t})-H_{t}(x_{t-1},\pmb {\xi }^{t}) \vert <{\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}\epsilon \le \epsilon . \end{aligned} \end{aligned}$$

It is obvious that \(\vert \dagger \vert <\epsilon \) when \(n_t\) is larger than a specific integer \(N_1\), which is large enough to ensure that \({\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}\left[ \hat{H}_{t}(x_{t-1},\pmb {\xi }^{t}) \right] \) pointwisely converges to \({\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}\left[ H_{t}(x_{t-1},\pmb {\xi }^{t}) \right] \). From Lemma 1, we have

$$\begin{aligned} {\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }_{t}}}}\left[ h_{t-1}(x_{t-1},\xi _{t-1})+H_{t}(x_{t-1},\pmb {\xi }^{t}) \right] \rightarrow {\mathbb {E}}_{{\mathbb {P}}_{\theta ^c_t}}\left[ h_{t-1}(x_{t-1},\xi _{t-1})+H_{t}(x_{t-1},\pmb {\xi }^{t}) \right] ,\ \ a.s.\ ({\mathbb {P}}_{\theta _t^c}), \end{aligned}$$

which means that there exists an integer \(N_2\) such that \(\vert \dagger \dagger \vert <\epsilon \) when \(n_t>N_2\). Thus

$$\begin{aligned} {\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}\left[ h_{t-1}(x_{t-1},\xi _{t-1})+\hat{H}_{t}(x_{t-1},\pmb {\xi }^{t}) \right] \rightarrow {\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}\left[ h_{t-1}(x_{t-1},\xi _{t-1})+H_{t}(x_{t-1},\pmb {\xi }^{t}) \right] ,\ \ a.s.\ ({\mathbb {P}}_{\theta _t^c}) \end{aligned}$$

pointwisely.

What we need to do is to show that \({\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}\left[ h_{t-1}(x_{t-1},\xi _{t-1})+\hat{H}_{t}(x_{t-1},\pmb {\xi }^{t}) \right] \) uniformly converges to \({\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}\left[ h_{t-1}(x_{t-1},\xi _{t-1})+H_{t}(x_{t-1},\pmb {\xi }^{t}) \right] \).

We first prove that \({\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}\left[ \hat{H}_{t}(x_{t-1},\pmb {\xi }^{t}) \right] \) uniformly converges to \({\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}\left[ H_{t}(x_{t-1},\pmb {\xi }^{t}) \right] \). Due to the convexity of \({\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}\left[ \hat{H}_{t}(x_{t-1},\pmb {\xi }^{t}) \right] \) and \({\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}\left[ H_{t}(x_{t-1},\pmb {\xi }^{t}) \right] \), it is known from Corollary 7.18 in [31] that \(\hat{H}_{t}(x_{t-1},\pmb {\xi }^{t})\) converges uniformly to \(H_{t}(x_{t-1},\pmb {\xi }^{t})\) on every compact subset of \({\mathcal {X}}_{t-1}\times \** ^t\). Thus we have \({\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}\left[ \hat{H}_{t}(x_{t-1},\pmb {\xi }^{t}) \right] \rightrightarrows {\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}\left[ H_{t}(x_{t-1},\pmb {\xi }^{t}) \right] \), which means that there exists an integer \(N_3\) such that

$$\begin{aligned} \sup _{(x_{t-1},\,\pmb {\xi }^{t})\in {\mathcal {X}}_{t-1}\times \** ^t}\left| {\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}\left[ \hat{H}_{t}(x_{t-1},\pmb {\xi }^{t}) \right] -{\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}\left[ H_{t}(x_{t-1},\pmb {\xi }^{t}) \right] \right| <\epsilon \end{aligned}$$

for all \(n_t>N_3\). Furthermore, from

$$\begin{aligned}{\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }_{t}}}}\left[ h_{t-1}(x_{t-1},\xi _{t-1})+H_{t}(x_{t-1},\pmb {\xi }^{t}) \right] \rightrightarrows {\mathbb {E}}_{{\mathbb {P}}_{\theta ^c_t}}\left[ h_{t-1}(x_{t-1},\xi _{t-1})+H_{t}(x_{t-1},\pmb {\xi }^{t}) \right] , \end{aligned}$$

we have that there exists an integer \(N_4\) such that

$$\begin{aligned} \sup _{(x_{t-1},\,\pmb {\xi }^{t})\in {\mathcal {X}}_{t-1}\times \** ^t}\left| {\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }_{t}}}}\left[ H_{t}(x_{t-1},\pmb {\xi }^{t})\right] -{\mathbb {E}}_{{\mathbb {P}}_{\theta ^c_t}}\left[ H_{t}(x_{t-1},\pmb {\xi }^{t}) \right] \right| <\epsilon , \end{aligned}$$

when \(n_t>N_4\).

Then by choosing \(n_t>\max (N_1,N_2,N_3,N_4)\), we can draw the conclusion that

$$\begin{aligned}{} & {} \sup _{(x_{t-1},\,\pmb {\xi }^{t})\in {\mathcal {X}}_{t-1}\times \** ^t}\left| {\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}\left[ h_{t-1}(x_{t-1},\xi _{t-1})+\hat{H}_{t}(x_{t-1},\pmb {\xi }^{t}) \right] \right. \\{} & {} \qquad -\left. {\mathbb {E}}_{{\mathbb {P}}_{\theta _{t}^c}}\left[ h_{t-1}(x_{t-1},\xi _{t-1})+H_{t}(x_{t-1},\pmb {\xi }^{t}) \right] \right| \\{} & {} \quad \le \sup _{(x_{t-1},\,\pmb {\xi }^{t})\in {\mathcal {X}}_{t-1}\times \** ^t}\left| {\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}\left[ h_{t-1}(x_{t-1},\xi _{t-1})+\hat{H}_{t}(x_{t-1},\pmb {\xi }^{t}) \right] \right. \\{} & {} \qquad -\left. {\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}\left[ h_{t-1}(x_{t-1},\xi _{t-1})+H_{t}(x_{t-1},\pmb {\xi }^{t}) \right] \right| \\{} & {} \qquad +\sup _{(x_{t-1},\,\pmb {\xi }^{t})\in {\mathcal {X}}_{t-1}\times \** ^t}\left| {\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}\left[ h_{t-1}(x_{t-1},\xi _{t-1})+H_{t}(x_{t-1},\pmb {\xi }^{t}) \right] \right. \\{} & {} \qquad -\left. {\mathbb {E}}_{{\mathbb {P}}_{\theta _{t}^c}}\left[ h_{t-1}(x_{t-1},\xi _{t-1})+H_{t}(x_{t-1},\pmb {\xi }^{t}) \right] \right| \\{} & {} \quad = \sup _{(x_{t-1},\,\pmb {\xi }^{t})\in {\mathcal {X}}_{t-1}\times \** ^t}\left| {\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}\left[ \hat{H}_{t}(x_{t-1},\pmb {\xi }^{t}) \right] -{\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}\left[ H_{t}(x_{t-1},\pmb {\xi }^{t}) \right] \right| \\{} & {} \qquad + \sup _{(x_{t-1},\,\pmb {\xi }^{t})\in {\mathcal {X}}_{t-1}\times \** ^t}\left| {\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }_{t}}}}\left[ H_{t}(x_{t-1},\pmb {\xi }^{t})\right] -{\mathbb {E}}_{{\mathbb {P}}_{\theta ^c_t}}\left[ H_{t}(x_{t-1},\pmb {\xi }^{t}) \right] \right| <2\epsilon . \end{aligned}$$

Since the above inequality holds for arbitrary \(\epsilon >0\), it is obvious that

$$\begin{aligned} h_{t-1}(x_{t-1},\xi _{t-1})+{\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}_{t}}}\left[ \hat{H}_{t}(x_{t-1},\pmb {\xi }^{t}) \right] \rightrightarrows h_{t-1}(x_{t-1},\xi _{t-1})+{\mathbb {E}}_{{\mathbb {P}}_{\theta _{t}^c}}\left[ H_{t}(x_{t-1},\pmb {\xi }^{t}) \right] . \end{aligned}$$

Combining this with Lemma 2, we can finally obtain that

$$\begin{aligned} {\mathbb {D}}({\hat{S}_{t-1}^{n_{t}}},S_{t-1})\rightarrow 0\ \ as\ \ n_t\rightarrow \infty ,\ \ a.s.\ ({\mathbb {P}}_{\theta _t^c}),\ \ \text {and} \end{aligned}$$
$$\begin{aligned} \min \limits _{x_{t-1}}h_{t-1}(x_{t-1},\xi _{t-1})+&{\mathbb {E}}_{{{\mathbb {P}}_{{\hat{\theta }}_{t}}}}\left[ \hat{H}_{t}(x_{t-1},\pmb {\xi }^{t}) \right] \rightarrow \\&\min \limits _{x_{t-1}}h_{t-1}(x_{t-1},\xi _{t-1})+{\mathbb {E}}_{{\mathbb {P}}_{\theta _{t}^c}}\left[ H_{t}(x_{t-1},\pmb {\xi }^{t})\right] ,\ \ a.s.\ ({\mathbb {P}}_{\theta _t^c}) . \end{aligned}$$

This means that the conclusions hold for t. And we complete the proof by the principle of mathematical induction. \(\square \)

Upon establishing the convergence of the recourse functions at each stage as articulated in Theorem 4, we further extend our analysis to the convergence of the multi-stage BEO as a whole. This broader perspective considers not only the convergence of the objective functions and solutions at individual stages but also how these convergences collectively to shape the overall solution of the original multi-stage problem. Theorem 4 provides a solid foundation by demonstrating that, under certain assumptions, the approximations at individual stages are consistent with respect to the true distribution. With this conclusion, we now shift our focus to ascertain how these stage-wise consistencies cumulatively affect the whole objective function and optimal solution of the entire multi-stage BEO problem. Specifically, we will show the consistency of multi-stage BEO problem (4) by analyzing the convergence property as the sample size increases to infinity. We find that when the size of historical data samples for estimating \(\hat{\theta }_t\) at each stage goes to infinity, both the optimum value and the set of optimal solutions of multi-stage BEO problem (4) converge to the counterparts of the original MSP (3) under true but unknown distribution.

For this purpose, let \(v^c\) represent the optimum value and \(S^c\) the set of optimal solutions for problem (3). Additionally, let \(N=\min \{n_2,n_3,\ldots ,n_T\}.\) Correspondingly, \(\hat{v}^N\) denotes the optimum value and \(\hat{S}^N\) the set of optimal solutions for the multi-stage BEO problem (4). With these notations, we present our conclusions on the convergence analysis for the multi-stage BEO problem.

Theorem 5

Suppose Assumptions 1 and 2 hold and \({\mathcal {X}}_t, 1\le t\le T\), are nonempty convex and compact subsets of \({\mathbb {R}}^{d}\). If

  1. (i)

    \(h_t(x_t,\xi _{t})\) is continuous and convex on \({\mathcal {X}}_t\times \** _t\),

  2. (ii)

    \(h_t(x_t,\xi _t), x_t\in {\mathcal {X}}_t\), is dominated by an integrable function,

  3. (iii)

    \({\mathcal {X}}_t(x_{t-1},\pmb {\xi }^{t})\) is uniformly bounded for all \(x_{t-1}\),

  4. (iv)

    the samples are i.i.d. from the true distribution, then, \(\hat{v}^N\rightarrow v^c\) and \({\mathbb {D}}(\hat{S}^N,S^c)\rightarrow 0\) as N tends to infinity.

Proof

Let \(\hat{h}^N(x)\) represent the objective function value of the multi-stage BEO problem (4) under a feasible policy x, and similarly, let h(x) denote that for MSP problem (3) with true distribution under a feasible policy x. Then it follows that

$$\begin{aligned}\hat{v}^N=\hat{h}^N(\hat{x}^*)\le \hat{h}^N({x}^*), \end{aligned}$$

where \(\hat{x}^*\) and \(x^*\) represent the optimal policies of problems (4) and (3), respectively. By examining the upper limit as the size of the historical data paths grows to infinity, we find that

$$\begin{aligned} \limsup _{N\rightarrow \infty }\hat{v}^N=\limsup _{N\rightarrow \infty }\hat{h}^N(\hat{x}^*)\le \limsup _{N\rightarrow \infty }\hat{h}^N({x}^*)=\lim _{N\rightarrow \infty }\hat{h}^N({x}^*)=h(x^*)=v^c, \end{aligned}$$
(20)

where the second equality holds due to Theorem 4. Since once the decision variable is fixed, the limit of the objective value exists and is consistent with that under true distribution, as established in Theorem 4.

Next, we demonstrate the convergence of the sequence \(\left\{ \hat{v}^{N}\right\} \). Observing that \(\left\{ \hat{v}^{N}\right\} \) is bounded because of (20), if it were not convergent, we could identify two subsequences converging to distinct values, say \(v_1\) and \(v_2\), i.e., \(\left\{ \hat{v}^{N_k}\right\} \rightarrow v_1, \left\{ \hat{v}^{N_j}\right\} \rightarrow v_2\) and \(v_1 \ne v_2\). Given the boundedness of the corresponding optimal solutions \(\left\{ \hat{x}^{*,N_k}\right\} \) and \(\left\{ \hat{x}^{*,N_j}\right\} \) due to the compactness assumption about feasible regions, there exist convergent subsequences for each of these two series. For notation brevity, we assume without loss of generality that \(\hat{x}^{*,N_k} \rightarrow x_1, \hat{x}^{*,N_j} \rightarrow x_2\), where \(x_1\) and \(x_2\) are two feasible policies. Thus, we have

$$\begin{aligned} \begin{aligned}&v_1=\lim _{k \rightarrow \infty } \hat{v}^{N_k}=\lim _{k \rightarrow \infty } \hat{h}^{N_k}(\hat{x}^{*,N_k})=h\left( x_1\right) \ge v^c, \\&v_2=\lim _{j \rightarrow \infty } \hat{v}^{N_j}=\lim _{j \rightarrow \infty } \hat{h}^{N_j}\left( \hat{x}^{*,N_j}\right) =h\left( x_2\right) \ge v^c. \end{aligned} \end{aligned}$$

Meanwhile, we know from (20) that \(v_1 \le v^c\) and \(v_2 \le v^c\), it follows \(v_1=v^c=v_2\), leading to a contradiction. Therefore, the sequence \(\left\{ \hat{v}_{N}\right\} \) converges and we have

$$\begin{aligned} \lim _{N \rightarrow \infty } \hat{v}^N=\limsup _{N \rightarrow \infty } \hat{v}^N=v^c. \end{aligned}$$

Finally, to prove the convergence of the solution set \(\hat{S}^N\) to \(S^c\), we assume, by contradiction, that \({\mathbb {D}}(\hat{S}^N,S^c)\) does not converge to zero as N increases to infinity. Then there exist a positive number \(\epsilon _0\) and a sequence of optimal policies \(\left\{ \hat{x}^{*,N_k}\right\} \) such that \(dist(\hat{x}^{*,N_k}, S^c)>\epsilon _0\) for all k. Following the same argument as above, we can find a subsequence of \(\left\{ \hat{x}^{*,N_k}\right\} \) that converges to \(\bar{x}\), with \(\bar{x}\) being a feasible policy. We can then deduce that

$$\begin{aligned} \lim _{k \rightarrow \infty } \hat{h}^{N_k}\left( \hat{x}^{*,N_k} \right) =v^c=h(\bar{x}). \end{aligned}$$

Since \(\left\{ \hat{x}^{*,N_k}\right\} \) is a sequence of optimal policies w.r.t. the multi-stage BEO problem (4), we have \(\bar{x} \in S^c\). However, if \(\hat{x}^{*,N_k} \rightarrow \bar{x}\) and \(dist(\hat{x}^{*,N_k}, S^c)>\epsilon _0\), then \(dist(\bar{x},S^c) \ge \epsilon _0>0\), which is a contradiction. This conclusion establishes the proof. \(\square \)

4 Solution of the multi-stage BEO problem

In the previous section, we showed that with a vast amount of data, the proposed BEO approach provides a well-performed approximation to multi-stage stochastic convex optimization problem when the underlying distribution is unknown. In this section, we consider how to solve the BEO problem for the single-stage and multi-stage cases, respectively.

4.1 Algorithm for single-stage BEO problem based on stochastic approximation

For simplicity, let \(H(x,\theta ):={\mathbb {E}}_{{\mathbb {P}}_\theta }[h(x,\xi )]\). Our single-stage BEO can then be described as

$$\begin{aligned} \min _{x\in {\mathcal {X}}}H(x,{\mathbb {E}}_{{\mathbb {P}}_n}\theta ). \end{aligned}$$
(21)

In order to sufficiently utilize the structural properties of the single-stage BEO, we will adopt the stochastic approximation (SA, see, for example, [20]) framework to design an algorithm for problem (21). Due to SA’s simple iteration framework and applicability for nonsmooth problems, it has been widely adopted for solving stochastic programs. In the following part, we denote \(d(x,\xi ):=\frac{dh(\cdot ,\cdot )}{dx}\) as the gradient of the cost function \(h(x,\xi )\) w.r.t. x. To solve the single-stage BEO problem (21), we propose an SA-type algorithm of the following form:

$$\begin{aligned} x_{i+1}=\Pi _{{\mathcal {X}}}[x_{i}+\epsilon _iD_i], \end{aligned}$$
(22)

where \({\mathcal {X}}\) is a convex and closed feasible region, \(\{\epsilon _i\}_{i\ge 0}\) is the step size sequence satisfying \(\epsilon _i>0, \sum _{i=1}^\infty \epsilon _i=\infty , D_i\) denotes a descent direction of the cost function \(h(x,\xi )\) and \(\Pi \) is the projection operator that projects the candidate iteration point back to the feasible set \({\mathcal {X}}\). A typical choice for \(D_i\) is by calculating the value of the negative gradient vector with some finite difference scheme. The SA algorithm with a proper estimator of the gradient will lead to the convergence to a local optimum under some mild assumptions (e.g., [9]). We now consider the calculation of \(d(x,\xi )\) for the single-stage BEO problem.

For presentation convenience, we only consider the results for the one-dimensional decision variable x. The multi-dimensional case can be handled by forming the gradient vector one dimension at a time, treating each dimension as a one-dimensional parameter while fixing the rest. To prove the consistency of estimators, we need the following crucial result.

Lemma 3

([9] Proposition 1) Assume that the function \(h(x,\xi )\) satisfies that \(\vert h(x_1,\xi )-h(x_2,\xi )\vert <K(\xi )\vert x_1-x_2\vert \) for all \(x_1, x_2\in {\mathcal {X}}\) with a random variable \(K(\xi )\) such that \({\mathbb {E}}[K(\xi )]<\infty \) and \(d(x,\xi )\) exists w.p. 1 for all \(x\in {\mathcal {X}}\), with \({\mathcal {X}}\) an open set. We use \(D_h\) to denote the set of points at which h is differentiable and if \({\mathbb {P}}(\xi \in D_h)=1\) for all \(\xi \in \** \), then \(d{\mathbb {E}}[h(x,\xi )]/dx={\mathbb {E}}[d(x,\xi )]\) for all \(x\in {\mathcal {X}}\).

Suppose that \(h(x,\xi )\) defining H in (21) satisfies the assumptions in Lemma 3, then we have the following equation for the gradients of single-stage BEO:

$$\begin{aligned} \frac{d{\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}}}[h(x,\xi )]}{dx}={\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}}}\left[ \frac{dh(x,\xi )}{dx}\right] ={\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}}}[d(x,\xi )]. \end{aligned}$$
(23)

which means the gradient and expectation operations are interchangeable.

With (23), we can then estimate the gradient of the objective function of problem (21) in practice by the sample average \(\frac{1}{m}\sum _{j=1}^{m}d(x,\xi _j)\), where \(\xi _j, 1\le j\le m,\) are i.i.d. with the common distribution \({\mathbb {P}}_{\hat{\theta }}\) and \(\hat{\theta }={\mathbb {E}}_{{\mathbb {P}}_n}\tilde{\theta }\). This will provide a strongly consistent estimator of the gradient. As for the calculation of the parameter \(\hat{\theta }\), if its closed form is not available, one can use some simulation method such as MCMC [16] to approximate the parameter \(\hat{\theta }\).

With the above preparation, the concrete SA-type algorithm for solving problem (21) can be described as the following Algorithm 1.

Algorithm 1
figure a

Solution of single-stage BEO via SA

Remark 3

Compared with the BRO problem (1), our single-stage BEO problem has a distinct advantage in the sense that the objective function is much easier to estimate. Specifically, one needs many more samples for the BRO approach because of its nested structure. For instance, if n samples are used to estimate the inner expectation and m samples to estimate the outer risk function, it would take a total of \(n\times m\) samples to estimate the objective function in the BRO problem (1). However, the estimation of the objective function in a single-stage BEO problem only needs n samples which is obviously smaller than that of BRO.

As for the convergence of Algorithm 1, we have the following theorem.

Theorem 6

Suppose that \(\{m_i\}\) is a monotonically increasing sequence, the step size sequence \(\{\epsilon _i\}_{i\ge 0}\) satisfies \(\sum _{i=1}^\infty \epsilon _i=\infty \) and \(\sum _{i=1}^\infty \epsilon _i^2<\infty \). Then Algorithm 1 with \(D_i=\frac{1}{m_i}\sum _{j=1}^{m_i}d(x,\xi _{j})\) converges w.p.1 to a unique stationary point of the following ordinary differential equation

$$\begin{aligned} \dot{x}=-{\mathbb {E}}_{{\mathbb {P}}_{{\hat{\theta }}}}[d(x,\xi )]. \end{aligned}$$

Proof

Based on the Theorem 2.1 of [20], we can obtain the desired conclusion directly. Thus our theorem holds. \(\square \)

4.2 Algorithm for multi-stage BEO based on scenario tree generation

In Sect. 3, we established theoretical guarantees that the multi-stage BEO problem (4) provides a pretty good approximation to the original MSP problem (3) so long as the value of data-size is sufficiently large. In this subsection, we consider the scenario tree approximation technique to solve the problem (4) numerically.

The discrete approximation to problem (4) can be developed in two steps. Firstly, based on Theorem 4, we can obtain a posterior estimation of the parameter \(\theta _t\), and then the density and conditional density of \({\mathbb {P}}_{{\hat{\theta }}_{t}}\). Secondly, with the estimated density models, the scenario tree as a discrete approximation to the data process \(\{\xi ^t\}\) can be generated through sampling the conditional distributions.

In line with our description about the dynamic inter-stage dependence, we will recursively generate a scenario tree stage by stage: first, estimate the posterior distribution \({\mathbb {P}}_{1,n_1}\) and the parameter \(\hat{\theta }_{1,n_1}\) at the first stage \(t = 1\), and discretize the approximate distribution \({\mathbb {P}}_{\hat{\theta }_{1,n_1}}\) by the discrete measure \(\sum _{i=1}^{n_1}p_i\delta _{\xi _{1,i}}\) of \(n_1\) samples. This can be achieved based on some existing algorithms in the literature, for instance, via the stochastic approximation algorithm in [29]. Recursively, assuming that the sub-tree of the previous t stages has been generated, then for each historical path \((\xi _{1},\ldots ,\xi _{t})\) or each node at stage t, estimate the conditional distribution at stage \(t+1\) by sampling from the conditional distribution \({\mathbb {P}}_{{\hat{\theta }}_{t,n_{t+1}}}\). This approximate distribution can be similarly calculated by a discrete probability measure with \(n_{t+1}\) samples. As is common in the scenario tree generation literature, we consider a symmetric scenario tree with the branching structure \([s_1,\ldots ,s_T]\). The resulting scenario tree generation algorithm is presented as Algorithm 2. More importantly, to ensure the accuracy and practicality of Algorithm 2, at each non-leaf node, we utilize the Wasserstein measure to find the best empirical distribution and the clustering technique to determine the best branches.

Algorithm 2
figure b

Scenario tree generation for problem (4).

With Algorithm 2, a finite-dimensional approximation problem to problem (4) is obtained by considering \(S=\prod _{t=1}^Ts_t\) scenario, and assigning to each scenario a discrete probability \(p^s\). This yields the following reformulation of problem (4) based on a scenario tree consisting of the scenarios \(\pmb {\xi }_s:=(\xi _1^s,\ldots ,\xi _T^s), s=1,\ldots ,S\):

$$\begin{aligned} \begin{aligned}&\min \sum _{s=1}^{S}p^s\sum _{t=1}^{T}h_t(x_t^s,\xi _{t}^s)\\ \text {s.t.}\quad&x_t^s\in {\mathcal {X}}_t(x_{t-1}^s,\pmb {\xi }_s)\quad \forall s,\ \forall t;\\&x_1^{k}=x_1^j\quad \forall k,j;\\&x_t^{k}=x_t^j\quad \text {whenever}\quad \pmb {\xi }_k^{t-1}\equiv \pmb {\xi }_j^{t-1},\quad 2\le t\le T. \end{aligned} \end{aligned}$$
(24)

Due to the continuity and convexity assumptions about \(h_t(\cdot ,\cdot )\) and \({\mathcal {X}}_t(\cdot ,\cdot )\), problem (24) is a constrained convex programming problem. It can be easily solved by off-the-shelf packages. By now, we have shown how to efficiently solve problem (4), thus problem (3), in reality.

5 Numerical experiments

In this section, we illustrate the consistency conclusions and practical solutions to BEO problems through numerical experiments. With respect to the two cases examined in the last section, we consider two typical applications which correspond to the single-stage BEO and multi-stage BEO problems, respectively. We conduct numerical experiments to contrast the applied value of our BEO approach with other data-driven approaches in the literature.

5.1 A queuing system

In this subsection, we will illustrate the single-stage BEO problem through considering the following first-come-first-serve M/M/1 queuing system:

$$\begin{aligned} \min _{x>0}H(x)={\left\{ \begin{array}{ll} \min \{{\mathbb {E}}_{\theta ^c}[T(x;\xi )]+\frac{c_0}{x},M\}, &{}\text {if}\ \ \theta ^cx<1;\\ M,&{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(25)

Here \(\xi \) denotes the random interarrival time and \(T(x; \xi )\) denotes the steady-state average customer waiting time, \(c_0\) and M are pre-specified constants standing for the unit service cost and the cost limit, respectively. With the assumption that \(\xi \) follows an exponential distribution \(p(\xi ;\theta ^c)=\theta ^c\exp (-\theta ^c)\), the expectation function \({\mathbb {E}}_{\theta ^c}[T(x;\xi )]\) has an closed-form \(\frac{x}{1-\theta ^cx}\) in (25) which leads to an analytical expression \(x^*=\frac{\sqrt{c_0}}{\sqrt{c_0}\theta ^c+1}\) of the optimal solution to problem (25). This explicit solution will be used as a benchmark against which different data-driven methods will be compared in what follows.

Although the parameter \(\theta ^c\) of the underlying distribution cannot be known in reality, we can observe some i.i.d. interarrival time samples \(\xi _{1},\ldots ,\xi _n\) from the true distribution. The traditional parametric method is then to fit the parameter \(\theta ^c\) with maximum likelihood estimation and then optimize the problem with the estimated distribution, which is called the “empirical simulation optimization" (ESO) method. Instead, we propose a Bayesian approach which quantifies the uncertainty of the random vector with prior knowledge. Specifically, we consider the Gamma distribution \(\Gamma (a_0, b_0)\) as a prior distribution, which is widely used to estimate the parameter of an exponential distribution. Based on this setting, the conjugate posterior distribution is

$$\begin{aligned}p(\hat{\theta }\vert \xi _1,\ldots ,\xi _n)=\Gamma (a_0+n,b_0+\sum _{i=1}^{n}\xi _i) \end{aligned}$$

and the expectation of the posterior distribution is \(\hat{\theta }=\frac{a_0+n}{b_0+\sum _{i=1}^{n}\xi _i}\). Then we can cope with problem (25) by considering the following optimization problem under the estimated input model:

$$\begin{aligned} \min _{x>0}\hat{H}(x)=\min _{x>0}{\mathbb {E}}_{\hat{\theta }}[T(x;\xi )]+\frac{c_0}{x}. \end{aligned}$$
(26)

This problem corresponds to a single-stage BEO problem and we can solve it with Algorithm 1.

With respect to different \(\hat{\theta }\)s, problem (26) with the same interarrival time samples will lead to different optimal solutions \(\hat{x}^*\)s. In order to access the performance of these solutions, we introduce an indicator, named the expected square-deviation, to evaluate the deviation between the objective function value \({{H}(\hat{x}^*)}\) with the estimated optimal solution \(\hat{x}^*\) and the true optimal function value \(H(x^*)\) by different methods:

$$\begin{aligned} {D(\hat{x}^*)={\mathbb {E}}\left[ \left( \frac{{H}(\hat{x}^*)-H(x^*)}{H(x^*)}\right) ^2\right] }. \end{aligned}$$

As we can see, the value of D measures the deviation from the true optimal solution of the optimal solution with the corresponding solution method on average. Thus the bigger the value of D is, the larger the error from the true optimum value would be.

To compare the performances of BRO, BEO and ESO methods, we will independently run K repetitive experiments. For every replication \(k, k=1,\ldots ,K\), we generate a sample set with n samples \(\{\xi _{1}\ldots \xi _n\}\) from the underlying distribution and find the optimal solution \(\hat{x}^{*,k}\) of the approximation problem (26) by BRO, BEO and ESO methods, respectively. Finally, we compute the average square-deviation \({D=\frac{1}{K}\sum _{k=1}^{K}\left( \frac{{H}(\hat{x}^{*,k})-H(x^*)}{H(x^*)}\right) ^2}\).

We set the parameter values in the specific numerical experiment as follows: the true parameter \(\theta ^c\) of the underlying distribution is 10, the unit service cost \(c_0\) is 1.0, the cost limit M is 500.0, the number of replications K is 100, while the parameters in the prior Gamma distribution are chosen as \(a_0 = 2\) and \(b_0 = 0\). Under the above setting, it is not difficult to solve the original problem (25). The true optimal solution is \(x^*\approx 0.091\) and the optimum value is \(H(x^*) = 12\).

Table 1 Comparison of solutions got with different methods

Table 1 shows the numerical results got with BEO, ESO, as well as BRO with the risk measure being the expectation and VaR, respectively. Due to the space limitation, readers can refer to [10, 44] for detailed algorithms solving the resulting Mean BRO and VaR BRO. The first column of Table 1 shows the sample size n; under each method, the first subcolumn presents the average value of optimal solutions \(\hat{x}^{*}=\frac{1}{k}\sum _{k=1}^K\hat{x}^{*,k}\) obtained over K replications, and the second subcolumn shows the average square-deviation D. We have the following observations from the numerical results in Table 1:

  • First of all, the average square-deviation of each of four methods is monotonically decreasing with the increase of n, as expected.

  • For all the examined sample sizes, the BEO can always find the true optimal solution, while its average square-deviation is constantly the smallest one among four examined methods, except for \(n=1000\).

  • With the increase of n, the optimal solution got with Mean BRO or VaR BRO is monotonically increasing and approaching the true optimal solution, but never reaching the true solution.

  • The average square-deviation of BEO is always significantly smaller than that of Mean BRO or VaR BRO except for \(n=1000\); on the other hand, the decreasing speeds of the average square-deviation of both Mean BRO and VaR BRO are faster than that of BEO; in terms of the average square-deviation, the Mean BRO is better than VaR BRO for small ns while the opposite holds for large ns.

  • Although ESO can find the true optimal solution for different ns like BEO, its average square-deviation is always the biggest one, actually much large for all ns smaller than 1000, among four examined methods.

When facing a single-stage problem, our BEO formulation can obtain a better optimal solution which overcomes the shortcomings of both ESO and BRO. The optimal value got with BEO is quite stable whether the sample size is large or small.

From this numerical experiment, we might conclude that the proposed Bayesian expectation optimization technique can overcome the shortcomings of both ESO and BRO in terms of a smaller deviation and a better optimal solution, especially in the case of a relatively small sample size.

5.2 A multi-stage inventory problem

To demonstrate the consistency conclusions about the multi-stage BEO problem in Sect. 3 and Algorithm 2 designed in the last section for its solution, we consider the re-modeling and solution of multi-stage inventory problems, a well-adopted example in the stochastic optimization literature. We aim at exploring the practical value of the multi-stage BEO problem (4) in applications with complex situations, especially cases with a relatively large number of stages (e.g., \(T \ge 5\)) and/or a relatively small number of historical scenarios (e.g., \( N \le 100\)).

The general multi-stage inventory problem can be described as

$$\begin{aligned} \min {\mathbb {E}}U\left( \sum _{t=1}^{T-1}b_tx_t+\sum _{t=2}^{T}c_t\eta _t+\sum _{t=1}^{T-1}h_t\nu _t-\sum _{t=2}^{T}s_t\xi _t-d\nu _T\right) \end{aligned}$$
(27)
$$\begin{aligned} \text {s.t.}\ \ x_{t-1}+\nu _{t-1}-\xi _t=\nu _t-\eta _t, \end{aligned}$$
(28)
$$\begin{aligned} \nu _t\ge 0\ \ t = 2,\ldots ,T, \end{aligned}$$
(29)
$$\begin{aligned} \eta _t\ge 0\ \ t = 2,\ldots ,T. \end{aligned}$$
(30)

Here, for each stage \(t = 1,\ldots ,T-1, b_t\) denotes the purchase price, \(s_t\) denotes the selling price, \(h_t\) denotes the inventory holding cost, \(c_t\) denotes the cost of purchasing additional inventory from another retailer, d denotes the final value of the inventory and \(\zeta _1\) is the initial value of the inventory. The random data \(\xi _t\) represents the uncertain demand. The decision variable \(x_t\ge 0\) represents the order size, \(\nu _t\) is the amount of stock that will be carried to the next stage and \(\eta _t\) is the shortfall of stock at stage t.

What’s more, U denotes a convex disutility function with respect to the value of the cost function:

$$\begin{aligned}U(h)=\left\{ \begin{array}{rcl} &{}h^{1+\delta }\ if&{}\ h\ge 0,\\ &{}h\ \ \ \ \ if&{}\ h<0\\ \end{array} \right. \end{aligned}$$

by referring to [25], with \(\delta =1\).

We consider a five-stage problem. As we know, Poisson process is often used to describe the number of events in a given time interval. Therefore, without loss of generality, we assume that the true demand process follows a Poisson model. However, considering that a single Poisson model cannot reflect the inter-stage correlated demands accurately, we initially set the true parameter of the Poisson model as that generated by a time-inhomogeneous auto-regressive AR(1) model. Concretely, for \(\xi _{t}, t= 2,\ldots ,5\), we have:

$$\begin{aligned} \xi _{t}\sim \text {Poisson}(\theta ^c_{t-1}),\quad \theta ^c_{t-1}=\psi \theta ^c_{t-2}+\mu +{\varepsilon _{t-1}},\quad \varepsilon _{t-1}\sim {\mathcal {N}}(0,1). \end{aligned}$$
(31)

The initial demand is \(\xi _{1}\) = 65, the unit cost for the inventory is \(d = 2\) and the initial stock value is \(\zeta _1 = 2\). We choose \(\psi =0.5\) and \(\mu =0.5\xi _{t-1}\). The values of other deterministic parameters are chosen as those in Table 2.

Table 2 Values of deterministic and prior parmeters

Due to the random noise in the parameters of the multi-stage inventory problem, we cannot find an analytical expression for the optimal value of problem (27)–(30). The optimal value of the real problem is thus estimated by finding the optimal value of its SAA problem under 150,000 samples, which is \(-\)2000.6552.

We use a Gamma distribution \(\Gamma (a_{0,t}, b_{0,t})\) as a prior for the random demand at stage t, which is conjugate with the Poisson distribution. Hence, the posterior distribution is

$$\begin{aligned} p(\theta _t\vert \xi ^1_t,\ldots ,\xi ^{n_t}_t)=\Gamma (a_{0,t}+\sum _{i=1}^{n_t}\xi _t^i,b_{0,t}+n_t), \end{aligned}$$

and the expectation of the posterior distribution is \(\hat{\theta }_{t}=\frac{a_{0,t}+\sum _{i=1}^{n_t}\xi ^i_t}{b_{0,t}+n_t}\). The values of prior parameters \(a_{0,t}\) and \(b_{0,t}\) are presented in the last two columns of Table 2.

We now examine the performance of the optimal solutions got with respect to different sample sizes. With the demand process generated through (31), Algorithm 2 is then utilized to approximately discretize the problem (27)–(30) as:

$$\begin{aligned} \begin{aligned}&\min \sum _{s=1}^{S}p^sU\left( \sum _{t=1}^{T-1}b_tx_t^s+\sum _{t=2}^{T}c_t \eta _{t}^s+\sum _{t=1}^{T-1}h_t \nu _{t}^s-\sum _{t=2}^{T}s_t\xi _t^s-d \nu _{T}^s\right) \\ \text {s.t.}\quad&x_{t-1}^s+ \nu _{t-1}^s-\xi _t^s= \nu _{t}^s- \eta _{t}^s \ s=1,\ldots ,S,\ t=2,\ldots ,T,\\&x_1^{k}=x_1^j\quad k,j=1,\ldots ,S,\\&x_t^{k}=x_t^j\quad \text {whenever}\quad \pmb {\xi }_k^{t-1}\equiv \pmb {\xi }_j^{t-1},\\&\nu _{t}^s\ge 0\ \ t = 2,\ldots ,T,\\&\eta _{t}^s\ge 0\ \ t = 2,\ldots ,T. \end{aligned} \end{aligned}$$
(32)

Then problem (32) can be easily solved as a convex programming problem.

To assess the approximate solutions, we measure their performance by the average square-deviation of the function value of each solution from the true optimum value, similar to that in the above single-stage case. When solving the resulting multi-stage BEO problem, we set the tree structure in Algorithm 2 as \(3\times 3\times 2\times 2, 5\times 5\times 3\times 3\) and \(10\times 10\times 8\times 8\), which are denoted as Tree 1, Tree 2 and Tree 3, respectively. For comparison purposes, we also solve problem (27)–(30) by the SAA method, the Bayesian DRO approach proposed in [24] and the robust stochastic optimization (RSO) method in [5], respectively.

By considering different sample sizes, Table 3, 4, 5, 6, 7 and 8 provide detailed results obtained with Algorithm 2, SAA, DRO and RSO, respectively. The first row in each table presents the average objective function value over \(K=500\) replications, the second row shows the average square-deviation (SD), reported in the scale of \( 10^{-4}\) and the third row shows the average computation time (CPU time in seconds) to solve problem (32) by different methods. When solving the problem with the sample size of 10,000, the RSO method needs more than 30 min for each solution. Considering that we need to repeat the solution 500 times, it will take more than 10 days to evaluate the approximate solution got with the RSO approach. Therefore, we don’t show the results of the RSO approach in Table 8.

The encountered convex programming problems were solved under Python 3.7 environment by Gurobi solver. All the experiments have been performed on a 64-bit PC with 12 GB of RAM and a 3.20 GHz processor.

Table 3 Comparison of four methods with sample size 10
Table 4 Comparison of four methods with sample size 20
Table 5 Comparison of four methods with sample size 50
Table 6 Comparison of four methods with sample size 100
Table 7 Comparison of four methods with sample size 1000
Table 8 Comparison of four methods with sample size 10,000

We have the following observations from the numerical results in Tables 3, 4, 5, 6, 7 and 8:

  • With the increase of the sample size, the average square-deviations of all these four approaches are monotonically decreasing and the average objective value under each case is getting closer to the true optimal value, as expected.

  • For the same sample size, the bigger the scenario tree is, the smaller the average square-deviation would be. And the average objective value approaches the real optimal value when the scenario tree becomes larger. This phenomenon can be attributed to the fact that a larger scenario tree can more accurately represent the complexity and variability inherent in real-world uncertainty.

  • When the sample size is as large as 10,000, a small scenario tree cannot fully utilize all the distribution information of the samples. This explains why the average square-deviation of a larger tree exhibits a more pronounced decreasing trend compared to smaller trees. However, as demonstrated in the results, a drawback of the scenario tree algorithm is the significant increase in computational burden as the tree size grows. This observation underscores an important trade-off to consider when selecting the size of a scenario tree: larger trees can provide more precise solutions but at the cost of higher computational demands.

  • Moreover, the multi-stage BEO approach with a small sample size can achieve the same overall performance as or even performs better than the SAA method with a large sample size. For example, the solution of the multi-stage BEO method with 20 samples provides a better objective value, smaller average square-deviation and shorter computation time than those of the SAA method with 1000 samples. This superiority can be largely credited to the multi-stage BEO method’s enhanced ability to incorporate and process the intricate dynamics and interstage dependencies present in multi-stage decision processes, which are often overlooked in traditional methods like SAA.

  • Both DRO and RSO exhibit some degree of over-conservatism. As the sample size increases, the reduction in average square-deviation is not particularly pronounced. This persistent conservativeness, especially in the context of large data sets, is undesirable. With the increase of data availability, the solutions got under the worst-case model do not show significant variations, limiting their adaptability to more nuanced data information or trends. What’s more, DRO approach could not generate a well-performed solution in all sample sizes. The average square-deviation of the DRO approach is always the largest one among different approaches, which again shows its over-conservatism. On the other hand, the average square-deviation of the RSO approach is always the smallest one among different approaches, and different sample sizes. Nevertheless, the computation time of the RSO approach varies significantly comparing to that of other approaches.

  • Compared with other approaches, the estimated solution of SAA converges to the true solution faster when the sample size varies from 10 to 10,000, but the computation time of the SAA method also explodes with the increase of the sample size. However, the computation time of the multi-stage BEO method under each tree structure and that of the DRO approach change very little with respect to different sample sizes.

Summarizing the above observations, we can see that, when comprehensively considering the approximation to the optimal value, the stability and solution time, our proposed BEO approach is practical and reliable for the solution of multi-stage convex stochastic programs, and performs better than relevant methods such as SAA.

6 Conclusions and future work

We propose a new framework, multi-stage Bayesian expectation optimization, to transform usual multi-stage stochastic programs in order to better reflect the potential uncertainty of the probability distributions in stochastic optimization and to cope with the inter-stage dependence in MSP. We prove the weak convergence of Bayesian posterior distributions, and subsequently, the consistency of objective functions and optimal solutions to multi-stage BEO problems. This gives assurance that the proposed method provides a well-performed approximation for the stochastic programming problem with respect to true distribution if the sample size is sufficiently large. Based on our theoretical results, we propose two kinds of algorithms to solve the single- and multi-stage BEO problems, respectively. Our numerical results demonstrate that, with not only a large sample size but also a small one, the proposed BEO approach demonstrates superior performance compared to the typical approaches such as BRO or SAA. Therefore, our method is more practical and can produce a high-quality solution in the cases of sample acquisition difficulty or the huge computation cost faced by MSPs with many stages or complex distributions.

Since the BEO method is based on prior distribution, the selection of prior parameters will have a certain influence on the posterior distribution when the sample volume is small. Therefore, choosing appropriate prior parameters and tree structure can significantly improve the performance of the BEO approach. How to balance the size of samples or the scenario tree with the stability of the obtained results is an important issue worthy of further investigation from a practical point of view. What’s more, the nature of Bayes’ theorem for data analysis and parameter estimation makes Bayesian approaches especially amenable to robust settings, another interesting issue is then the extension of the proposed multi-stage BEO method to multi-stage distributionally robust optimization problems.