Linear programming-based solution methods for constrained partially observable Markov decision processes

Helmeczi, Robert K.; Kavaklioglu, Can; Cevik, Mucahit

doi:10.1007/s10489-023-04603-7

Linear programming-based solution methods for constrained partially observable Markov decision processes

Published: 10 June 2023

Volume 53, pages 21743–21769, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Robert K. Helmeczi¹,
Can Kavaklioglu¹ &
Mucahit Cevik¹

334 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Constrained partially observable Markov decision processes (CPOMDPs) have been used to model various real-world phenomena. However, they are notoriously difficult to solve to optimality, and there exist only a few approximation methods for obtaining high-quality solutions. In this study, grid-based approximations are used in combination with linear programming (LP) models to generate approximate policies for CPOMDPs. A detailed numerical study is conducted with six CPOMDP problem instances considering both their finite and infinite horizon formulations. The quality of approximation algorithms for solving unconstrained POMDP problems is established through a comparative analysis with exact solution methods. Then, the performance of the LP-based CPOMDP solution approaches for varying budget levels is evaluated. Finally, the flexibility of LP-based approaches is demonstrated by applying deterministic policy constraints, and a detailed investigation into their impact on rewards and CPU run time is provided. For most of the finite horizon problems, deterministic policy constraints are found to have little impact on expected reward, but they introduce a significant increase to CPU run time. For infinite horizon problems, the reverse is observed: deterministic policies tend to yield lower expected total rewards than their stochastic counterparts, but the impact of deterministic constraints on CPU run time is negligible in this case. Overall, these results demonstrate that LP models can effectively generate approximate policies for both finite and infinite horizon problems while providing the flexibility to incorporate various additional constraints into the underlying model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Thailand)

Instant access to the full article PDF.

Institutional subscriptions

Markov Decision Processes with Discounted Costs: Improved Successive Over-Relaxation Method

Multi-objective dynamic programming with limited precision

Article Open access 02 November 2021

Markov Decision Processes with Discounted Rewards: Improved Successive Over-Relaxation Method

Data Availability

All the datasets are publicly available, and can be obtained using the cited sources.

References

Ahluwalia VS, Steimle LN, Denton BT (2021) Policy-based branch-and-bound for infinite-horizon multi-model markov decision processes. Computers & Operations Research 126:105–108
Article MathSciNet MATH Google Scholar
Alagoz O, Ayvaci MU, Linderoth JT (2015) Optimally solving markov decision processes with total expected discounted reward function: Linear programming revisited. Computers & Industrial Engineering 87:311–316
Article Google Scholar
Ayer T, Alagoz O, Stout N (2012) A POMDP approach to personalize mammography screening decisions. Operations Research 60(5):1019–1034
Article MathSciNet MATH Google Scholar
Ayvaci M, Alagoz O, Burnside E (2012a) The effect of budgetary restrictions on breast cancer diagnostic decisions. M &SOM 14(4):600–617
Google Scholar
Ayvaci MU, Alagoz O, Burnside ES (2012b) The effect of budgetary restrictions on breast cancer diagnostic decisions. Manufacturing & Service Operations Management 14(4):600–617
Bravo RZB, Leiras A, Cyrino Oliveira FL (2019) The use of uav s in humanitarian relief: an application of pomdp-based methodology for finding victims. Production and Operations Management 28(2):421–440
Article Google Scholar
Caramia M, Dell’Olmo P, Caramia M, Dell’Olmo P (2020) Multi-objective optimization. Multi-objective Management in Freight Logistics: Increasing Capacity, Service Level, Sustainability, and Safety with Optimization Algorithms pp 21–51
Cassandra A (1994) Optimal policies for partially observable Markov decision processes. Brown University, Providence, RI
Cassandra A (2003) Simple examples. http://www.pomdp.org/examples/, Accessed 09 Jan 2019
Cassandra AR (1998) Exact and approximate algorithms for partially observable Markov decision processes. Brown University
Cassandra AR, Kaelbling LP, Littman ML (1994) Acting optimally in partially observable stochastic domains. In: AAAI, AAAI
Celen M, Djurdjanovic D (2020) Integrated maintenance and operations decision making with imperfect degradation state observations. Journal of Manufacturing Systems 55:302–316
Article Google Scholar
Cevik M, Ayer T, Alagoz O, Sprague BL (2018) Analysis of mammography screening policies under resource constraints. Production and Operations Management 27(5):949–972
Article Google Scholar
Deng S, **ang Z, Zhao P, Taheri J, Gao H, Yin J, Zomaya AY (2020) Dynamical resource allocation in edge for trustable internet-of-things systems: A reinforcement learning method. IEEE Transactions on Industrial Informatics 16(9):6103–6113
Article Google Scholar
Egorov M, Sunberg ZN, Balaban E, Wheeler TA, Gupta JK, Kochenderfer MJ (2017) Pomdps. jl: A framework for sequential decision making under uncertainty. The Journal of Machine Learning Research 18(1):831–835
Erenay F, Alagoz O, Said A (2014) Optimizing colonoscopy screening for colorectal cancer prevention and surveillance. M &SOM 16(3):381–400
Google Scholar
Gan K, Scheller-Wolf AA, Tayur SR (2019) Personalized treatment for opioid use disorder. Available at SSRN 3389539
Jiang X, Wang X, ** H (2017) Finding optimal polices for wideband spectrum sensing based on constrained pomdp framework. IEEE Transactions on Wireless Communications 16(8):5311–5324. https://doi.org/10.1109/TWC.2017.2708124
Article Google Scholar
Kavaklioglu C, Cevik M (2022) Scalable grid-based approximation algorithms for partially observable markov decision processes. Concurrency and Computation: Practice and Experience 34(5):e6743
Article Google Scholar
Kim D, Lee J, Kim K, Poupart P (2011) Point-based value iteration for constrained POMDPs. In: Twenty-Second International Joint Conference on Artificial Intelligence, pp 1968–1974
Lee J, Kim GH, Poupart P, Kim KE (2018) Monte-carlo tree search for constrained pomdps. Advances in Neural Information Processing Systems 31
Lovejoy W (1991a) A Survey of Algorithmic Methods for Partially Observed Markov Decision Processes. Annals of Operations Research 28:47–66
Article MathSciNet MATH Google Scholar
Lovejoy W (1991b) Computationally feasible bounds for partially observed Markov decision processes. Operations Research 39(1):162–175
Article MathSciNet MATH Google Scholar
Ma X, Xu H, Gao H, Bian M, Hussain W (2022) Real-time virtual machine scheduling in industry iot network: A reinforcement learning method. IEEE Transactions on Industrial Informatics 19(2):2129–2139
Article Google Scholar
Maillart LM (2006) Maintenance policies for systems with condition monitoring and obvious failures. IIE Transactions 38(6):463–475
Article Google Scholar
McLay LA, Mayorga ME (2013) A dispatching model for server-to-customer systems that balances efficiency and equity. Manufacturing & Service Operations Management 15(2):205–220
Article Google Scholar
Monahan G (1982) State of the art - A survey of partially observable Markov decision processes: Theory, models, and algorithms. Management Science 28(1):1–16
Article MathSciNet MATH Google Scholar
Pajarinen J, Kyrki V (2017) Robotic manipulation of multiple objects as a pomdp. Artificial Intelligence 247:213–228
Article MathSciNet MATH Google Scholar
Parr R, Russell S (1995) Approximating optimal policies for partially observable stochastic domains. IJCAI, IJCAI 95:1088–1094
Pineau J, Gordon G, Thrun S (2006) Anytime Point-Based Approximations for Large POMDPs. JAIR 27:335–380
Article MATH Google Scholar
Poupart P, Malhotra A, Pei P, Kim KE, Goh B, Bowling M (2015) Approximate linear programming for constrained partially observable markov decision processes. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 29
Puterman ML (2014) Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons
MATH Google Scholar
Roijers DM, Vamplew P, Whiteson S, Dazeley R (2013) A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research 48:67–113
Article MathSciNet MATH Google Scholar
Roijers DM, Whiteson S, Oliehoek FA (2015) Point-based planning for multi-objective pomdps. In: Twenty-fourth international joint conference on artificial intelligence
Rudin W (1987) Real and complex analysis, 3rd edn. McGraw-Hill
MATH Google Scholar
Sandikci B (2010) Reduction of a pomdp to an mdp. Wiley Encyclopedia of Operations Research and Management Science
Sandıkçı B, Maillart LM, Schaefer AJ, Alagoz O, Roberts MS (2008) Estimating the patient’s price of privacy in liver transplantation. Operations Research 56(6):1393–1410
Article MathSciNet MATH Google Scholar
Silver D, Veness J (2010) Monte-carlo planning in large pomdps. Advances in neural information processing systems 23
Smith T, Simmons R (2012) Heuristic search value iteration for pomdps. ar**v:1207.4166
Sondik EJ (1971) The optimal control of partially observable Markov processes. Stanford University
Spaan MT (2012) Partially observable markov decision processes. In: Reinforcement Learning, Springer, pp 387–414
Steimle LN, Ahluwalia VS, Kamdar C, Denton BT (2021a) Decomposition methods for solving markov decision processes with multiple models of the parameters. IISE Transactions 53(12):1295–1310
Google Scholar
Steimle LN, Kaufman DL, Denton BT (2021b) Multi-model markov decision processes. IISE. Transactions 53(10):1124–1139
Google Scholar
Suresh (2005) Sampling from the simplex. Available from http://geomblog.blogspot.com/2005/10/sampling-from-simplex.html Accessed on 26 Feb 2015
Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press
Treharne JT, Sox CR (2002) Adaptive inventory control for nonstationary demand and partial information. Management Science 48(5):607–624
Article MATH Google Scholar
Walraven E, Spaan MT (2018) Column generation algorithms for constrained pomdps. Journal of artificial intelligence research 62:489–533
Article MathSciNet MATH Google Scholar
Wray KH, Czuprynski K (2022) Scalable gradient ascent for controllers in constrained pomdps. In: 2022 International Conference on Robotics and Automation (ICRA), IEEE, pp 9085–9091
Yılmaz ÖF (2020) An integrated bi-objective u-shaped assembly line balancing and parts feeding problem: optimization model and exact solution method. Annals of Mathematics and Artificial Intelligence pp 1–18
Yılmaz ÖF, et al. (2021) Tactical level strategies for multi-objective disassembly line balancing problem with multi-manned stations: an optimization model and solution approaches. Annals of Operations Research pp 1–51
Young S, Gašić M, Thomson B, Williams JD (2013) Pomdp-based statistical spoken dialog systems: A review. Proceedings of the IEEE 101(5):1160–1179
Article Google Scholar

Download references

Acknowledgements

This research was enabled in part by the support provided by the Digital Research Alliance of Canada (alliancecan.ca).

Author information

Authors and Affiliations

Toronto Metropolitan University, Toronto, Canada
Robert K. Helmeczi, Can Kavaklioglu & Mucahit Cevik

Authors

Robert K. Helmeczi
View author publications
You can also search for this author in PubMed Google Scholar
Can Kavaklioglu
View author publications
You can also search for this author in PubMed Google Scholar
Mucahit Cevik
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mucahit Cevik.

Ethics declarations

Conflicts of interest

No potential conflict of interest was reported by the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Proof of upper bound approximation

The proof of the upper bound approximation requires two intermediate theorems:

Theorem 1

The optimal value function, $V^*$, is piecewise linear and convex.

Proof

See [8].$\square $

Theorem 2

For any real-valued convex function f and discrete random variable X

$$\begin{aligned} f(\mathbb {E}[X])\le \mathbb {E}[f(X)] \end{aligned}$$

(24)

where $\mathbb {E}(\cdot )$ gives the expected value.

Proof

See [35, Thm. 3.3].$\square $

For completeness, the proof of the upper bound approximation (Theorem 3) is reproduced from [37].

Theorem 3

Let $\hat{\mathcal {V}}(g)$ give the grid-based upper bound approximation of the true optimal value, $V(g)$. Then,

$$\begin{aligned} \hat{\mathcal {V}}(g)\ge V(g)\qquad \forall g\in \mathcal {G}\end{aligned}$$

(25)

Proof

[37] uses proof by induction using the value iteration algorithm. From (3b) and (3c), the optimal value at time $t$ for the belief $b$ is

$$\begin{aligned} V_{t}^{*}(\varvec{b}) = \max _{a\in \mathcal {A}}\left\{ \sum _{i\in \mathcal {S}} b_{i} w_{i}^{ta} + \sum _{i\in \mathcal {S}}b_{i} \bigg ( \sum _{o\in \mathcal {O}} \sum _{j\in \mathcal {S}} z_{jo}^{ta} \ p_{ij}^{ta}\ V_{t+1}^*(\varvec{b}') \bigg )\right\} \end{aligned}$$

(26)

For the first iteration, where $t=T$, from (3a), the initial values for the optimal and approximate value functions are $\hat{\mathcal {V}}_T(g)=V^*_T(g)=\sum _{i\in \mathcal {S}} b_{i}R_{i}$. Assume that for all $t\in [1,T]$ Theorem 3 holds. The value for generic $t-1$ is

$$\begin{aligned} V_{t-1}^{*}(\varvec{b})= & {} \max _{a\in \mathcal {A}}\Bigg \{\sum _{i\in \mathcal {S}} b_{i} w_{i}^{(t-1) a} + \sum _{i\in \mathcal {S}}b_{i}\nonumber \\{} & {} \times \bigg ( \sum _{o\in \mathcal {O}} \sum _{j\in \mathcal {S}} z_{jo}^{(t-1) a} \ p_{ij}^{(t-1) a}\ V_{t}^*(\varvec{b}') \bigg )\Bigg \} \end{aligned}$$

(27a)

Recall that because $\mathcal {G}$ contains all of the corner points, any $b\in \mathcal {B}(\mathcal {S})$ can be represented as a convex combination of $g\in \mathcal {G}$ i.e., $b$ can be replaced with $\sum _{k\in \mathcal {K}}\beta _{k}g_k$ for non-negative $\beta $ satisfying $\sum _{k\in \mathcal {K}}\beta _{k}=1$. Thus

$$\begin{aligned}= & {} \max _{a\in \mathcal {A}}\Bigg \{\sum _{i\in \mathcal {S}} b_{i} w_{i}^{(t-1) a} + \sum _{i\in \mathcal {S}}b_{i} \Bigg ( \sum _{o\in \mathcal {O}} \sum _{j\in \mathcal {S}} z_{jo}^{(t-1) a} \ p_{ij}^{(t-1) a}\ V_{t}^*\nonumber \\ {}{} & {} \times \left( \sum _{k\in \mathcal {K}}\beta _{k}g_k\right) \Bigg )\Bigg \} \end{aligned}$$

(27b)

Theorem 1 states that $V^*$ satisfies the properties needed to apply Theorem 2 as follows:

$$\begin{aligned}\le & {} \max _{a\in \mathcal {A}}\Bigg \{\sum _{i\in \mathcal {S}} b_{i} w_{i}^{(t-1) a} + \sum _{i\in \mathcal {S}}b_{i} \Bigg ( \sum _{o\in \mathcal {O}} \sum _{j\in \mathcal {S}} z_{jo}^{(t-1) a} \ p_{ij}^{(t-1) a}\nonumber \\ {}{} & {} \times \sum _{k\in \mathcal {K}}\beta _{k}V_{t}^*(g_k) Bigg)\Bigg \} \end{aligned}$$

(27c)

Per the induction hypothesis, this is

$$\begin{aligned}\le & {} \max _{a\in \mathcal {A}}\Bigg \{\sum _{i\in \mathcal {S}} b_{i} w_{i}^{(t-1) a} + \sum _{i\in \mathcal {S}}b_{i} \Bigg ( \sum _{o\in \mathcal {O}} \sum _{j\in \mathcal {S}} z_{jo}^{(t-1) a} \ p_{ij}^{(t-1) a}\nonumber \\ {}{} & {} \times \sum _{k\in \mathcal {K}}\beta _{k}\hat{\mathcal {V}}{t}^*(g_k) \Bigg )\Bigg \}\end{aligned}$$

(27d)

$$\begin{aligned}= & {} \hat{\mathcal {V}}_{t-1}(g) \end{aligned}$$

(27e)

$\square $

Appendix B: Proof of lower bound approximation

The lower bound approximation outlined in [23] is given in Theorem 4.

Theorem 4

Let $\hat{\varGamma }$ be the set of $\alpha $-vectors corresponding to the grid $\mathcal {G}$ and $\hat{\mathcal {V}}(b)$ be the approximate expected value generated by $\hat{\varGamma }$ for the belief point $b$. Then

$$\begin{aligned} \hat{\mathcal {V}}(b) \le V^*(b),\qquad \forall b\in \mathcal {B}(\mathcal {S})\end{aligned}$$

(28)

Proof

After drop** the time index in (4), the inequality in (28) becomes

$$\begin{aligned} \hat{\mathcal {V}}(b) \le \max _{\alpha \in \varGamma }\{\varvec{b}\cdot \alpha \} \end{aligned}$$

(29a)

As $\varGamma $ is defined as the set of all $\alpha $-vectors, by definition $\hat{\varGamma }\subseteq \varGamma $. Thus, (29a) can be rewritten as:

$$\begin{aligned} \hat{\mathcal {V}}(b) \le \max \left\{ \max _{\alpha \in \hat{\varGamma }}\{\varvec{b}\cdot \alpha \}, \max _{\alpha \in \varGamma \setminus \hat{\varGamma }}\{\varvec{b}\cdot \alpha \}\right\} \end{aligned}$$

(29b)

Substituting in (4)

$$\begin{aligned} \hat{\mathcal {V}}(b) \le \max \left\{ \hat{\mathcal {V}}(b), \max _{\alpha \in \varGamma \setminus \hat{\varGamma }}\{\varvec{b}\cdot \alpha \}\right\} \end{aligned}$$

(29c)

It follows that right hand side of Equation (29c) is at least $\hat{\mathcal {V}}(b)$.$\square $

Appendix C: Grid construction

The approximation techniques employed in this study require the generation of a set of grid points, which provide a finite set of discrete belief states to approximate the continuous, infinite belief simplex. Specifically, these grid sets are generated using a slightly modified version of the fixed-resolution grid approach.

1.1 C.1: Fixed-resolution grid construction approach

Using the fixed-resolution grid approach, the resulting grid set contains beliefs sampled at equidistant intervals in each dimension of the state space according to a resolution parameter $\rho $. Specifically, for any grid point $\varvec{g}$, the $i$th component $g_i$ can be any integer multiple of $\rho ^{-1}$, subject to the constraint that all components must be non-negative and cannot exceed 1. Thus, each component of $\varvec{g}$ must belong to the set

$$\begin{aligned} \mathcal {H}(\rho )=\left\{ 0, \frac{1}{\rho }, \frac{2}{\rho }, \ldots , \frac{\rho -2}{\rho }, \frac{\rho -1}{\rho }, 1\right\} \end{aligned}$$

(30)

As $\varvec{g}$ represents a belief state, the sum of its components must equal 1. Therefore, for a problem with $|\mathcal {S}|$ states and a resolution value of $\rho $, the approximate grid set $\mathcal {G}$ can be generated by computing the Cartesian product of $\mathcal {H}(\rho )$ with itself $|\mathcal {S}|$ times and filtering off all grids whose components do not sum to 1. That is

$$\begin{aligned} \mathcal {G}=\left\{ g\mid g\in \tilde{\mathcal {G}},~\sum _i g_i =1 \right\} \end{aligned}$$

(31)

where

$$\begin{aligned} \tilde{\mathcal {G}} = \underbrace{\mathcal {H}(\rho )\times \mathcal {H}(\rho ) \times \ldots \times \mathcal {H}(\rho )}_{|\mathcal {S}| \text { times}} \end{aligned}$$

(32)

Fixed-resolution grid set example Consider a problem with two states and a resolution value $\rho =2$. Following the fixed-resolution grid construction approach, first note that each component of $\varvec{g}$ must belong to the set

$$\begin{aligned} \mathcal {H}(2)=\left\{ 0, \frac{1}{2}, 1\right\} \end{aligned}$$

(33a)

Then, $\tilde{\mathcal {G}}$ can then be computed as

$$\begin{aligned} \tilde{\mathcal {G}}= & {} \left\{ 0, \tfrac{1}{2}, 1\right\} \times \{0, \tfrac{1}{2}, 1\} \end{aligned}$$

(33b)

$$\begin{aligned}= & {} \Bigg \{\left[ 0, 0 \right] , [ 0, \tfrac{1}{2} ], \left[ 0, 1 \right] , [ \tfrac{1}{2}, 0 ], [ \tfrac{1}{2}, \tfrac{1}{2} ], [ \tfrac{1}{2}, 1 ], [ 1, 0 ],\nonumber \\ {}{} & {} [ 1, \tfrac{1}{2} ], [ 1, 1 ]\Bigg \} \end{aligned}$$

(33c)

Only the grids in $\tilde{\mathcal {G}}$ whose components sum to 1 are kept, resulting in

$$\begin{aligned} \mathcal {G}=\left\{ [0, 1], [\tfrac{1}{2}, \tfrac{1}{2}], [1, 0]\right\} \end{aligned}$$

(33d)

1.2 C2: Modified grid construction approach

Following the fixed-resolution grid approach, the grid set generated using a resolution value of $\rho $ is denoted $\mathcal {G}_\rho $, where

$$\begin{aligned} |\mathcal {G}_\rho | = \left( {\begin{array}{c}|\mathcal {S}| + \rho - 1\\ |\mathcal {S}| - 1\end{array}}\right) \end{aligned}$$

(34)

By strictly constructing grid sets according to this approach is not feasible for problems with more than a few core states because, for many grid-based solution algorithms, the grid sets become prohibitively large for increasing $\rho $ values. For example, in the $4\times 3$ problem, which has 11 states, the grid set sizes are $|\mathcal {G}_{\rho =1}|=11$, $|\mathcal {G}_{\rho =2}|=66$, $|\mathcal {G}_{\rho =3}|=286$, $|\mathcal {G}_{\rho =4}|=1001$, and $|\mathcal {G}_{\rho =5}|=3003$ for resolution values ranging from one to five. That is, the size of $\mathcal {G}$ cannot be finely controlled. To address this, one can combine multiple grid sets generated using the fixed-resolution grid approach, taking only as many grids as they desire. For a desired grid set size of $N$, one can iteratively construct fixed-resolution grid sets until the first one whose size exceeds $N$ is encountered. Denoting this set as $\mathcal {G}_\iota $, by definition, $|\mathcal {G}_{\iota - 1}|\le N$. Then, grid points can be sampled from $\mathcal {G}_\iota \setminus \mathcal {G}_{\iota - 1}$ and added to $\mathcal {G}_{\iota -1}$ until $\mathcal {G}_{\iota -1}$ has $N$ grid points. As there are many grid points to choose from, it might be important not to select these points arbitrarily, as this may result in a higher grid density in some dimensions. Accordingly, grids are drawn at equidistant intervals from the sorted set difference $\mathcal {G}_{\iota }\setminus \mathcal {G}_{\iota - 1}$. The overall procedure is summarized in Algorithm 1.

Modified grid-construction example Consider a problem with 3 states, where the desired number of grids, $N$, is 5. Following Algorithm 1, the final value of $\iota $ is found to be 2, as $\mathcal {G}_2$ is the first grid set whose size (6) exceeds $N$. Subsequently,

$$\begin{aligned} \mathcal {G}= \mathcal {G}_{\iota =1}&= [[0, 0, 1], [0, 1, 0], [1, 0, 0]] \end{aligned}$$

(35a)

$$\begin{aligned} \mathcal {G}^*&=[[0, \tfrac{1}{2}, \tfrac{1}{2}], [\tfrac{1}{2}, 0, \tfrac{1}{2}], [\tfrac{1}{2}, \tfrac{1}{2}, 0]] \end{aligned}$$

(35b)

The step size $\eta $ is obtained as

$$\begin{aligned} \eta&=\lfloor N~/~|\mathcal {G}^*| \rfloor \end{aligned}$$

(35c)

$$\begin{aligned}&=\lfloor 5~/~3\rfloor \end{aligned}$$

(35d)

$$\begin{aligned}&=\lfloor 1.\overline{6} \rfloor \end{aligned}$$

(35e)

$$\begin{aligned}&= 1 \end{aligned}$$

(35f)

Lastly, the 1st and 2nd elements of $\mathcal {G}^*$ are added to $\mathcal {G}$. Thus, the final grid set becomes

$$\begin{aligned} \mathcal {G}=[[0, 0, 1], [0, \tfrac{1}{2}, \tfrac{1}{2}], [0, 1, 0], [\tfrac{1}{2}, 0, \tfrac{1}{2}], [1, 0, 0]] \end{aligned}$$

(35g)

Appendix D: Calculating the grid transition probabilities

Figure 6 summarizes the process of calculating the grid transition probabilities. Note that these transition probabilities are essential inputs to the linear programming formulations for CPOMDPs discussed in this paper. The steps in this process can be summarized as follows:

(1)
For each grid point, action, observation combination, the updated belief state is computed.
(2)
The interpolation weights for $g'$ are computed.
(3)
The transition probability between grid points $g\in \mathcal {G}$ are computed.
(4)
The approximate value function for time $t+1$ is computed.

Algorithms 2 and 3 show the calculation of the grid transition probabilities in detail for finite horizon and infinite horizon cases, respectively. Before solving the corresponding LP models for the CPOMDPs, these algorithms are first run to obtain the $f$ values.

Appendix E: Numerical examples

1.1 Tiger problem parameters

The following examples illustrate the application of ITLP to both the finite horizon and infinite horizon formulations of the tiger problem. Specifically, this problem is defined as follows [8]:

States: $\{\text {Tiger left}~(s_0),~\text {Tiger right}~(s_1)\}$
Actions: $\{\text {Listen}~(a_0),~\text {Open left door}~(a_1),~\text {Open right}$ $\text {door}~(a_2)\}$
Observations: $\{\text {Tiger left}~(o_0),~\text {Tiger right}~(o_1)\}$
Transition Probabilities: Recall that the probability of transitioning from state $i$ to state $j$ at time t after taking action a is $p_{ij}^{ta}$. In the tiger problem, transition probabilities are stationary: they are constant with respect to time. As a result, transition probabilities can simply be written as $p_{ij}^{a}$. The transition probabilities are provided in Table 7.
Observation Probabilities: As with transition probabilities, the observation probabilities are stationary for the tiger problem. Therefore, the probability of making observation $o$ at time $t$ after taking action $a$ and arriving in state $j$, $z_{jo}^{ta}$, can be written without the time index as $z_{jo}^{a}$. These observation probabilities are given in Table 8.
Immediate Rewards: The reward for taking action $a$ in state $i$ at time $t$ is $w_{ia}^{t}$. The rewards are stationary for the tiger problem, so they can be simply denoted as $w_{ia}$, and are given in Table 9.
Costs: In this paper, the tiger POMDP is extended beyond [8]’s definition to include a cost on each action. Specifically, the cost of each door opening action ($a_1$, $a_2$) is taken as 1, and the cost of listening ($a_0$) is taken as 2.

Table 7 Tiger problem transition probabilities

Full size table

Table 8 Tiger problem observation probabilities

Full size table

Table 9 Tiger problem immediate rewards

Full size table

For the purposes of the ensuing examples, the belief simplex for the tiger problem $\mathcal {B}(\mathcal {S})$ is approximated with the grid set $\mathcal {G}=\{[0, 1], [0.5, 0.5], [1, 0]\}$. The terminal reward for exiting the decision process in one of these grids is set to the reward earned by taking the unconstrained optimal action in that grid. As the optimal action in [1, 0] is to open the door on the right, the terminal reward is 10. Similarly, the reward for ending in [0, 1] is 10. The optimal action when the belief is [0.5, 0.5] is to listen, so the reward is $-1$. For simplicity, the initial belief distribution parameter is set as $\delta =[0, 1, 0]$, meaning that the starting belief state is [0.5, 0.5].

1.2 E2: Finite horizon CPOMDP for the tiger problem

Consider the tiger problem as described above with a horizon of 3 (i.e., 2 decision epochs), a budget of 3, and a discount factor of $\lambda =1$. For a given belief state $b$, the expected reward for taking action $a$ is simply $b\cdot w_{ia}$. For example, the expected reward for opening the door on the left ($a_1$) when $b=[0.5, 0.5]$ is

$$\begin{aligned} \begin{aligned}&{=}b\cdot w_{i1}\\&=(0.5 \, 0.5)\cdot \left( \begin{array}{cc}-100 \\ 10\end{array}\right) \\&=45 \end{aligned} \end{aligned}$$

(36)

It directly follows that the coefficient of $x_{tka}$ in the objective function for the LP is this expected reward (i.e., the coefficient of $x_{tka}$ is $b_k\cdot w_{ia}$). In the tiger problem formulated above, this value is time independent. Following (10), it turns out that for the selected grid set, the obtained grid transition probabilities ($f_{k\ell }^{ta}$) are time independent as well, so the time index can be dropped. These probabilities are given in Table 10.

Table 10 Grid transition probabilities ($f_{k\ell }^{a}$)

Full size table

In this problem, the objective function given by (13a) is

$$\begin{aligned} - x_{0,0,0}+ & {} 10 x_{0,0,1} - 100 x_{0,0,2} - x_{0,1,0} - 45 x_{0,1,1} \nonumber \\- & {} 45 x_{0,1,2} - x_{0,2,0} - 100 x_{0,2,1} + 10 x_{0,2,2} \nonumber \\- & {} x_{1,0,0} + 10 x_{1,0,1} - 100 x_{1,0,2} - x_{1,1,0} \nonumber \\- & {} 45 x_{1,1,1} - 45 x_{1,1,2} - x_{1,2,0} 100 x_{1,2,1} + 10 x_{1,2,2} \nonumber \\+ & {} 10 x_{N0} - x_{N1} + 10 x_{N2} \end{aligned}$$

(37a)

the constraints from (13b) are

$$\begin{aligned} x_{0,0,0} + x_{0,0,1} + x_{0,0,2}&= 0 \end{aligned}$$

(37b)

$$\begin{aligned} x_{0,1,0} + x_{0,1,1} + x_{0,1,2}&= 1 \end{aligned}$$

(37c)

$$\begin{aligned} x_{0,2,0} + x_{0,2,1} + x_{0,2,2}&= 0 \end{aligned}$$

(37d)

the constraints from (13c) are

$$\begin{aligned} - x_{0,0,0}&- 0.35 x_{0,1,0} + x_{1,0,0} + x_{1,0,1} + x_{1,0,2} = 0\end{aligned}$$

(37e)

$$\begin{aligned} - x_{0,0,1}&- x_{0,0,2} - 0.3 x_{0,1,0} - x_{0,1,1} - x_{0,1,2} - x_{0,2,1} \nonumber \\&- x_{0,2,2} + x_{1,1,0} + x_{1,1,1} + x_{1,1,2} = 0\end{aligned}$$

(37f)

$$\begin{aligned} - 0.35 x_{0,1,0}&- x_{0,2,0} + x_{1,2,0} + x_{1,2,1} + x_{1,2,2} = 0 \end{aligned}$$

(37g)

the constraints from (13d) are

$$\begin{aligned} - x_{1,0,0}&- 0.35 x_{1,1,0} + x_{N0} = 0 \end{aligned}$$

(37h)

$$\begin{aligned} - x_{1,0,1}&- x_{1,0,2} - 0.3 x_{1,1,0} - x_{1,1,1} - x_{1,1,2} \nonumber \\ {}&- x_{1,2,1} - x_{1,2,2} + x_{N1} = 0\end{aligned}$$

(37i)

$$\begin{aligned} - 0.35 x_{1,1,0}&- x_{1,2,0} + x_{N2} = 0 \end{aligned}$$

(37j)

and the budget constraint in (13e) is

$$\begin{aligned} \begin{aligned} 2 x_{0,0,0}&+ x_{0,0,1} + x_{0,0,2} + 2 x_{0,1,0} + x_{0,1,1} + x_{0,1,2} + 2 x_{0,2,0} \\ {}&+ x_{0,2,1} + x_{0,2,2} + 2 x_{1,0,0} + x_{1,0,1} + x_{1,0,2} + 2 x_{1,1,0} \\ {}&+ x_{1,1,1} + x_{1,1,2} + 2 x_{1,2,0} + x_{1,2,1} + x_{1,2,2} <= 3 \end{aligned} \end{aligned}$$

(37k)

giving the final LP for this finite horizon problem as:

$$\begin{aligned} \max&\quad (37\textrm{a}) \end{aligned}$$

(38a)

$$\begin{aligned} \mathrm {s.t.}&\quad (37\textrm{b})-(37\textrm{k})\end{aligned}$$

(38b)

$$\begin{aligned}&\quad (13\textrm{e}) \end{aligned}$$

(38c)

The optimal $x_{tka}$ values obtained by solving this model are given in Table 11.

Table 11 Finite horizon optimal $x_{tka}$

Full size table

1.3 E3: Infinite horizon CPOMDP for the tiger problem

We also provide the sample approximate formulation for the infinite horizon version of the tiger problem. Here, a discount factor $\lambda =0.9$ is chosen. Given the problem definition, Table 10 also gives the $f_{k\ell }^{a}$ values for this infinite horizon formulation. This is due to the very small grid size used here for this problem formulation. The objective from (18a) is then

$$\begin{aligned} - x_{0, 0}+ & {} 10 x_{0, 1} \!- 100 x_{0, 2} \!- x_{1, 0} \!- 45 x_{1, 1} \!- 45 x_{1, 2} - x_{2, 0}\nonumber \\- & {} 100 x_{2, 1} + 10 x_{2, 2} \end{aligned}$$

(39a)

the constraints in Equation (18b) are

$$\begin{aligned} 0.1 x_{0, 0}+ & {} x_{0, 1} + x_{0, 2} - 0.315 x_{1, 0} = 0\end{aligned}$$

(39b)

$$\begin{aligned} - 0.9 x_{0, 1}- & {} 0.9 x_{0, 2} + 0.73 x_{1, 0} + 0.1 x_{1, 1} \nonumber \\ {}+ & {} 0.1 x_{1, 2} - 0.9 x_{2, 1} - 0.9 x_{2, 2} = 1 \end{aligned}$$

(39c)

$$\begin{aligned} - 0.315 x_{1, 0}+ & {} 0.1 x_{2, 0} + x_{2, 1} + x_{2, 2} = 0 \end{aligned}$$

(39d)

and the budget constraint from Equation (18c) is

$$\begin{aligned} 2 x_{0, 0} + x_{0, 1} + x_{0, 2} + 2 x_{1, 0} + x_{1, 1} + x_{1, 2} + 2 x_{2, 0} + x_{2, 1} + x_{2, 2} <= 11.5 \end{aligned}$$

(39e)

leading to a final LP for this infinite horizon CPOMDP example as

$$\begin{aligned}&\max&(39\textrm{a}) \end{aligned}$$

(40)

$$\begin{aligned}&\mathrm {s.t.}&(39\textrm{b})-(39\textrm{e})\end{aligned}$$

(41)

$$\begin{aligned}{} & {} (18\textrm{d}) \end{aligned}$$

(42)

Appendix F: Analysis with threshold-type policies

This section provides an application of threshold-type policy constraints for the tiger problem to provide a use case for specific constraints that can be incorporated into the ITLP algorithm. As illustrated in Figure 7, threshold-type policy constraints help establish decision thresholds over the belief states. The threshold-type policy constraints are imposed based on the stochastic dominance relation between the belief states. That is, $b^{\ell }$ stochastically dominates $b^{k}$, denoted as $b^{\ell } \succ _s b^{k}$, if $\sum _{i=j} b_i^{\ell } \ge \sum _{i=j} b_i^{k}$ for all $j \in \{0,1,\hdots , |\mathcal {S}|-1\}$.

The threshold-type policy constraints for the action $a_1 = \texttt {Open Left}$ can be formulated as

$$\begin{aligned} \theta _{tk1} \le \theta _{t\ell 1}, \quad \forall g^{\ell } \succ _s g^{k}, \forall t \end{aligned}$$

(43)

where $\theta _{tka} \in \{0,1\}$ takes value 1 if action a is selected at decision epoch t, and take value 0 otherwise (recall that these are the same binary variables that are included to obtain deterministic policies). For instance, if $g^0 = [0,1]$ and $g^1 = [0.1, 0.9]$, then $g^{0} \succ _s g^{1}$. It follows that if the optimal action for $g^1$ is $a_1$, then the optimal action for $g^0$ should also be $a_1$. This is because, if the POMDP policy prescribes taking action $a_1 = \texttt {Open Left}$ for a belief state $g^1$ that indicate a lower likelihood of tiger occupying state $s=\texttt {Tiger Right}$ (i.e., $g_1^1 = 0.9$), then this policy should prescribe the same action to a belief state $g^0$ that is expected to be more suitable for taking action $a_1 = \texttt {Open Left}$ (i.e., because $g_0^1 = 1.0$). Note that the threshold-type policies can be similarly derived for $a_2 = \texttt {Open Right}$. Figure 8 provides sample policies obtained by incorporating the threshold-type policy constraints into a finite horizon CPOMDP model with three different budget levels. As expected, the obtained policies are all of threshold type.

Depending on the problem specifications, such threshold-type policy constraints can be formulated and added to the CPOMDP model. However, as their characterization rely on the stochastic ordering of the belief states, not all problems might be suitable for imposing threshold-type policy constraints. For instance, in the paint problem, it is expected to ship items that are unflawed, unblemished and painted (i.e., $s=\texttt {NFL-NBL-PA}$), however, the ordering of the other three states (i.e., $s\in \{ \texttt {NFL-NBL-NPA}, \texttt {FL-NBL-PA}, \texttt {FL-BL-NPA} \}$) cannot be easily established for this action, making the characterization of threshold-type policies non-trivial for this problem. Similar observations can be made for the other three actions (i.e., paint, inspect, and reject) as well.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Helmeczi, R.K., Kavaklioglu, C. & Cevik, M. Linear programming-based solution methods for constrained partially observable Markov decision processes. Appl Intell 53, 21743–21769 (2023). https://doi.org/10.1007/s10489-023-04603-7

Download citation

Accepted: 31 March 2023
Published: 10 June 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s10489-023-04603-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Thailand)

Instant access to the full article PDF.

Institutional subscriptions

Linear programming-based solution methods for constrained partially observable Markov decision processes

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Markov Decision Processes with Discounted Costs: Improved Successive Over-Relaxation Method

Multi-objective dynamic programming with limited precision

Markov Decision Processes with Discounted Rewards: Improved Successive Over-Relaxation Method

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Appendices

Appendix A: Proof of upper bound approximation

Theorem 1

Proof

Theorem 2

Proof

Theorem 3

Proof

Appendix B: Proof of lower bound approximation

Theorem 4

Proof

Appendix C: Grid construction

1.1 C.1: Fixed-resolution grid construction approach

1.2 C2: Modified grid construction approach

Appendix D: Calculating the grid transition probabilities

Appendix E: Numerical examples

1.1 Tiger problem parameters

1.2 E2: Finite horizon CPOMDP for the tiger problem

1.3 E3: Infinite horizon CPOMDP for the tiger problem

Appendix F: Analysis with threshold-type policies

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation