Log in

Linear programming-based solution methods for constrained partially observable Markov decision processes

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Constrained partially observable Markov decision processes (CPOMDPs) have been used to model various real-world phenomena. However, they are notoriously difficult to solve to optimality, and there exist only a few approximation methods for obtaining high-quality solutions. In this study, grid-based approximations are used in combination with linear programming (LP) models to generate approximate policies for CPOMDPs. A detailed numerical study is conducted with six CPOMDP problem instances considering both their finite and infinite horizon formulations. The quality of approximation algorithms for solving unconstrained POMDP problems is established through a comparative analysis with exact solution methods. Then, the performance of the LP-based CPOMDP solution approaches for varying budget levels is evaluated. Finally, the flexibility of LP-based approaches is demonstrated by applying deterministic policy constraints, and a detailed investigation into their impact on rewards and CPU run time is provided. For most of the finite horizon problems, deterministic policy constraints are found to have little impact on expected reward, but they introduce a significant increase to CPU run time. For infinite horizon problems, the reverse is observed: deterministic policies tend to yield lower expected total rewards than their stochastic counterparts, but the impact of deterministic constraints on CPU run time is negligible in this case. Overall, these results demonstrate that LP models can effectively generate approximate policies for both finite and infinite horizon problems while providing the flexibility to incorporate various additional constraints into the underlying model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Thailand)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data Availability

All the datasets are publicly available, and can be obtained using the cited sources.

References

  1. Ahluwalia VS, Steimle LN, Denton BT (2021) Policy-based branch-and-bound for infinite-horizon multi-model markov decision processes. Computers & Operations Research 126:105–108

    Article  MathSciNet  MATH  Google Scholar 

  2. Alagoz O, Ayvaci MU, Linderoth JT (2015) Optimally solving markov decision processes with total expected discounted reward function: Linear programming revisited. Computers & Industrial Engineering 87:311–316

    Article  Google Scholar 

  3. Ayer T, Alagoz O, Stout N (2012) A POMDP approach to personalize mammography screening decisions. Operations Research 60(5):1019–1034

    Article  MathSciNet  MATH  Google Scholar 

  4. Ayvaci M, Alagoz O, Burnside E (2012a) The effect of budgetary restrictions on breast cancer diagnostic decisions. M &SOM 14(4):600–617

    Google Scholar 

  5. Ayvaci MU, Alagoz O, Burnside ES (2012b) The effect of budgetary restrictions on breast cancer diagnostic decisions. Manufacturing & Service Operations Management 14(4):600–617

  6. Bravo RZB, Leiras A, Cyrino Oliveira FL (2019) The use of uav s in humanitarian relief: an application of pomdp-based methodology for finding victims. Production and Operations Management 28(2):421–440

    Article  Google Scholar 

  7. Caramia M, Dell’Olmo P, Caramia M, Dell’Olmo P (2020) Multi-objective optimization. Multi-objective Management in Freight Logistics: Increasing Capacity, Service Level, Sustainability, and Safety with Optimization Algorithms pp 21–51

  8. Cassandra A (1994) Optimal policies for partially observable Markov decision processes. Brown University, Providence, RI

  9. Cassandra A (2003) Simple examples. http://www.pomdp.org/examples/, Accessed 09 Jan 2019

  10. Cassandra AR (1998) Exact and approximate algorithms for partially observable Markov decision processes. Brown University

  11. Cassandra AR, Kaelbling LP, Littman ML (1994) Acting optimally in partially observable stochastic domains. In: AAAI, AAAI

  12. Celen M, Djurdjanovic D (2020) Integrated maintenance and operations decision making with imperfect degradation state observations. Journal of Manufacturing Systems 55:302–316

    Article  Google Scholar 

  13. Cevik M, Ayer T, Alagoz O, Sprague BL (2018) Analysis of mammography screening policies under resource constraints. Production and Operations Management 27(5):949–972

    Article  Google Scholar 

  14. Deng S, **ang Z, Zhao P, Taheri J, Gao H, Yin J, Zomaya AY (2020) Dynamical resource allocation in edge for trustable internet-of-things systems: A reinforcement learning method. IEEE Transactions on Industrial Informatics 16(9):6103–6113

    Article  Google Scholar 

  15. Egorov M, Sunberg ZN, Balaban E, Wheeler TA, Gupta JK, Kochenderfer MJ (2017) Pomdps. jl: A framework for sequential decision making under uncertainty. The Journal of Machine Learning Research 18(1):831–835

  16. Erenay F, Alagoz O, Said A (2014) Optimizing colonoscopy screening for colorectal cancer prevention and surveillance. M &SOM 16(3):381–400

    Google Scholar 

  17. Gan K, Scheller-Wolf AA, Tayur SR (2019) Personalized treatment for opioid use disorder. Available at SSRN 3389539

  18. Jiang X, Wang X, ** H (2017) Finding optimal polices for wideband spectrum sensing based on constrained pomdp framework. IEEE Transactions on Wireless Communications 16(8):5311–5324. https://doi.org/10.1109/TWC.2017.2708124

    Article  Google Scholar 

  19. Kavaklioglu C, Cevik M (2022) Scalable grid-based approximation algorithms for partially observable markov decision processes. Concurrency and Computation: Practice and Experience 34(5):e6743

    Article  Google Scholar 

  20. Kim D, Lee J, Kim K, Poupart P (2011) Point-based value iteration for constrained POMDPs. In: Twenty-Second International Joint Conference on Artificial Intelligence, pp 1968–1974

  21. Lee J, Kim GH, Poupart P, Kim KE (2018) Monte-carlo tree search for constrained pomdps. Advances in Neural Information Processing Systems 31

  22. Lovejoy W (1991a) A Survey of Algorithmic Methods for Partially Observed Markov Decision Processes. Annals of Operations Research 28:47–66

    Article  MathSciNet  MATH  Google Scholar 

  23. Lovejoy W (1991b) Computationally feasible bounds for partially observed Markov decision processes. Operations Research 39(1):162–175

    Article  MathSciNet  MATH  Google Scholar 

  24. Ma X, Xu H, Gao H, Bian M, Hussain W (2022) Real-time virtual machine scheduling in industry iot network: A reinforcement learning method. IEEE Transactions on Industrial Informatics 19(2):2129–2139

    Article  Google Scholar 

  25. Maillart LM (2006) Maintenance policies for systems with condition monitoring and obvious failures. IIE Transactions 38(6):463–475

    Article  Google Scholar 

  26. McLay LA, Mayorga ME (2013) A dispatching model for server-to-customer systems that balances efficiency and equity. Manufacturing & Service Operations Management 15(2):205–220

    Article  Google Scholar 

  27. Monahan G (1982) State of the art - A survey of partially observable Markov decision processes: Theory, models, and algorithms. Management Science 28(1):1–16

    Article  MathSciNet  MATH  Google Scholar 

  28. Pajarinen J, Kyrki V (2017) Robotic manipulation of multiple objects as a pomdp. Artificial Intelligence 247:213–228

    Article  MathSciNet  MATH  Google Scholar 

  29. Parr R, Russell S (1995) Approximating optimal policies for partially observable stochastic domains. IJCAI, IJCAI 95:1088–1094

  30. Pineau J, Gordon G, Thrun S (2006) Anytime Point-Based Approximations for Large POMDPs. JAIR 27:335–380

    Article  MATH  Google Scholar 

  31. Poupart P, Malhotra A, Pei P, Kim KE, Goh B, Bowling M (2015) Approximate linear programming for constrained partially observable markov decision processes. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 29

  32. Puterman ML (2014) Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons

    MATH  Google Scholar 

  33. Roijers DM, Vamplew P, Whiteson S, Dazeley R (2013) A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research 48:67–113

    Article  MathSciNet  MATH  Google Scholar 

  34. Roijers DM, Whiteson S, Oliehoek FA (2015) Point-based planning for multi-objective pomdps. In: Twenty-fourth international joint conference on artificial intelligence

  35. Rudin W (1987) Real and complex analysis, 3rd edn. McGraw-Hill

    MATH  Google Scholar 

  36. Sandikci B (2010) Reduction of a pomdp to an mdp. Wiley Encyclopedia of Operations Research and Management Science

  37. Sandıkçı B, Maillart LM, Schaefer AJ, Alagoz O, Roberts MS (2008) Estimating the patient’s price of privacy in liver transplantation. Operations Research 56(6):1393–1410

    Article  MathSciNet  MATH  Google Scholar 

  38. Silver D, Veness J (2010) Monte-carlo planning in large pomdps. Advances in neural information processing systems 23

  39. Smith T, Simmons R (2012) Heuristic search value iteration for pomdps. ar**v:1207.4166

  40. Sondik EJ (1971) The optimal control of partially observable Markov processes. Stanford University

  41. Spaan MT (2012) Partially observable markov decision processes. In: Reinforcement Learning, Springer, pp 387–414

  42. Steimle LN, Ahluwalia VS, Kamdar C, Denton BT (2021a) Decomposition methods for solving markov decision processes with multiple models of the parameters. IISE Transactions 53(12):1295–1310

    Google Scholar 

  43. Steimle LN, Kaufman DL, Denton BT (2021b) Multi-model markov decision processes. IISE. Transactions 53(10):1124–1139

    Google Scholar 

  44. Suresh (2005) Sampling from the simplex. Available from http://geomblog.blogspot.com/2005/10/sampling-from-simplex.html Accessed on 26 Feb 2015

  45. Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press

  46. Treharne JT, Sox CR (2002) Adaptive inventory control for nonstationary demand and partial information. Management Science 48(5):607–624

    Article  MATH  Google Scholar 

  47. Walraven E, Spaan MT (2018) Column generation algorithms for constrained pomdps. Journal of artificial intelligence research 62:489–533

    Article  MathSciNet  MATH  Google Scholar 

  48. Wray KH, Czuprynski K (2022) Scalable gradient ascent for controllers in constrained pomdps. In: 2022 International Conference on Robotics and Automation (ICRA), IEEE, pp 9085–9091

  49. Yılmaz ÖF (2020) An integrated bi-objective u-shaped assembly line balancing and parts feeding problem: optimization model and exact solution method. Annals of Mathematics and Artificial Intelligence pp 1–18

  50. Yılmaz ÖF, et al. (2021) Tactical level strategies for multi-objective disassembly line balancing problem with multi-manned stations: an optimization model and solution approaches. Annals of Operations Research pp 1–51

  51. Young S, Gašić M, Thomson B, Williams JD (2013) Pomdp-based statistical spoken dialog systems: A review. Proceedings of the IEEE 101(5):1160–1179

    Article  Google Scholar 

Download references

Acknowledgements

This research was enabled in part by the support provided by the Digital Research Alliance of Canada (alliancecan.ca).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mucahit Cevik.

Ethics declarations

Conflicts of interest

No potential conflict of interest was reported by the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Proof of upper bound approximation

The proof of the upper bound approximation requires two intermediate theorems:

Theorem 1

The optimal value function, \(V^*\), is piecewise linear and convex.

Proof

See [8].\(\square \)

Theorem 2

For any real-valued convex function f and discrete random variable X

$$\begin{aligned} f(\mathbb {E}[X])\le \mathbb {E}[f(X)] \end{aligned}$$
(24)

where \(\mathbb {E}(\cdot )\) gives the expected value.

Proof

See [35, Thm. 3.3].\(\square \)

For completeness, the proof of the upper bound approximation (Theorem 3) is reproduced from [37].

Theorem 3

Let \(\hat{\mathcal {V}}(g)\) give the grid-based upper bound approximation of the true optimal value, \(V(g)\). Then,

$$\begin{aligned} \hat{\mathcal {V}}(g)\ge V(g)\qquad \forall g\in \mathcal {G}\end{aligned}$$
(25)

Proof

[37] uses proof by induction using the value iteration algorithm. From (3b) and (3c), the optimal value at time \(t\) for the belief \(b\) is

$$\begin{aligned} V_{t}^{*}(\varvec{b}) = \max _{a\in \mathcal {A}}\left\{ \sum _{i\in \mathcal {S}} b_{i} w_{i}^{ta} + \sum _{i\in \mathcal {S}}b_{i} \bigg ( \sum _{o\in \mathcal {O}} \sum _{j\in \mathcal {S}} z_{jo}^{ta} \ p_{ij}^{ta}\ V_{t+1}^*(\varvec{b}') \bigg )\right\} \end{aligned}$$
(26)

For the first iteration, where \(t=T\), from (3a), the initial values for the optimal and approximate value functions are \(\hat{\mathcal {V}}_T(g)=V^*_T(g)=\sum _{i\in \mathcal {S}} b_{i}R_{i}\). Assume that for all \(t\in [1,T]\) Theorem 3 holds. The value for generic \(t-1\) is

$$\begin{aligned} V_{t-1}^{*}(\varvec{b})= & {} \max _{a\in \mathcal {A}}\Bigg \{\sum _{i\in \mathcal {S}} b_{i} w_{i}^{(t-1) a} + \sum _{i\in \mathcal {S}}b_{i}\nonumber \\{} & {} \times \bigg ( \sum _{o\in \mathcal {O}} \sum _{j\in \mathcal {S}} z_{jo}^{(t-1) a} \ p_{ij}^{(t-1) a}\ V_{t}^*(\varvec{b}') \bigg )\Bigg \} \end{aligned}$$
(27a)

Recall that because \(\mathcal {G}\) contains all of the corner points, any \(b\in \mathcal {B}(\mathcal {S})\) can be represented as a convex combination of \(g\in \mathcal {G}\) i.e., \(b\) can be replaced with \(\sum _{k\in \mathcal {K}}\beta _{k}g_k\) for non-negative \(\beta \) satisfying \(\sum _{k\in \mathcal {K}}\beta _{k}=1\). Thus

$$\begin{aligned}= & {} \max _{a\in \mathcal {A}}\Bigg \{\sum _{i\in \mathcal {S}} b_{i} w_{i}^{(t-1) a} + \sum _{i\in \mathcal {S}}b_{i} \Bigg ( \sum _{o\in \mathcal {O}} \sum _{j\in \mathcal {S}} z_{jo}^{(t-1) a} \ p_{ij}^{(t-1) a}\ V_{t}^*\nonumber \\ {}{} & {} \times \left( \sum _{k\in \mathcal {K}}\beta _{k}g_k\right) \Bigg )\Bigg \} \end{aligned}$$
(27b)

Theorem 1 states that \(V^*\) satisfies the properties needed to apply Theorem 2 as follows:

$$\begin{aligned}\le & {} \max _{a\in \mathcal {A}}\Bigg \{\sum _{i\in \mathcal {S}} b_{i} w_{i}^{(t-1) a} + \sum _{i\in \mathcal {S}}b_{i} \Bigg ( \sum _{o\in \mathcal {O}} \sum _{j\in \mathcal {S}} z_{jo}^{(t-1) a} \ p_{ij}^{(t-1) a}\nonumber \\ {}{} & {} \times \sum _{k\in \mathcal {K}}\beta _{k}V_{t}^*(g_k) Bigg)\Bigg \} \end{aligned}$$
(27c)

Per the induction hypothesis, this is

$$\begin{aligned}\le & {} \max _{a\in \mathcal {A}}\Bigg \{\sum _{i\in \mathcal {S}} b_{i} w_{i}^{(t-1) a} + \sum _{i\in \mathcal {S}}b_{i} \Bigg ( \sum _{o\in \mathcal {O}} \sum _{j\in \mathcal {S}} z_{jo}^{(t-1) a} \ p_{ij}^{(t-1) a}\nonumber \\ {}{} & {} \times \sum _{k\in \mathcal {K}}\beta _{k}\hat{\mathcal {V}}{t}^*(g_k) \Bigg )\Bigg \}\end{aligned}$$
(27d)
$$\begin{aligned}= & {} \hat{\mathcal {V}}_{t-1}(g) \end{aligned}$$
(27e)

\(\square \)

Appendix B: Proof of lower bound approximation

The lower bound approximation outlined in [23] is given in Theorem 4.

Theorem 4

Let \(\hat{\varGamma }\) be the set of \(\alpha \)-vectors corresponding to the grid \(\mathcal {G}\) and \(\hat{\mathcal {V}}(b)\) be the approximate expected value generated by \(\hat{\varGamma }\) for the belief point \(b\). Then

$$\begin{aligned} \hat{\mathcal {V}}(b) \le V^*(b),\qquad \forall b\in \mathcal {B}(\mathcal {S})\end{aligned}$$
(28)

Proof

After drop** the time index in (4), the inequality in (28) becomes

$$\begin{aligned} \hat{\mathcal {V}}(b) \le \max _{\alpha \in \varGamma }\{\varvec{b}\cdot \alpha \} \end{aligned}$$
(29a)

As \(\varGamma \) is defined as the set of all \(\alpha \)-vectors, by definition \(\hat{\varGamma }\subseteq \varGamma \). Thus, (29a) can be rewritten as:

$$\begin{aligned} \hat{\mathcal {V}}(b) \le \max \left\{ \max _{\alpha \in \hat{\varGamma }}\{\varvec{b}\cdot \alpha \}, \max _{\alpha \in \varGamma \setminus \hat{\varGamma }}\{\varvec{b}\cdot \alpha \}\right\} \end{aligned}$$
(29b)

Substituting in (4)

$$\begin{aligned} \hat{\mathcal {V}}(b) \le \max \left\{ \hat{\mathcal {V}}(b), \max _{\alpha \in \varGamma \setminus \hat{\varGamma }}\{\varvec{b}\cdot \alpha \}\right\} \end{aligned}$$
(29c)

It follows that right hand side of Equation (29c) is at least \(\hat{\mathcal {V}}(b)\).\(\square \)

Appendix C: Grid construction

The approximation techniques employed in this study require the generation of a set of grid points, which provide a finite set of discrete belief states to approximate the continuous, infinite belief simplex. Specifically, these grid sets are generated using a slightly modified version of the fixed-resolution grid approach.

1.1 C.1: Fixed-resolution grid construction approach

Using the fixed-resolution grid approach, the resulting grid set contains beliefs sampled at equidistant intervals in each dimension of the state space according to a resolution parameter \(\rho \). Specifically, for any grid point \(\varvec{g}\), the \(i\)th component \(g_i\) can be any integer multiple of \(\rho ^{-1}\), subject to the constraint that all components must be non-negative and cannot exceed 1. Thus, each component of \(\varvec{g}\) must belong to the set

$$\begin{aligned} \mathcal {H}(\rho )=\left\{ 0, \frac{1}{\rho }, \frac{2}{\rho }, \ldots , \frac{\rho -2}{\rho }, \frac{\rho -1}{\rho }, 1\right\} \end{aligned}$$
(30)

As \(\varvec{g}\) represents a belief state, the sum of its components must equal 1. Therefore, for a problem with \(|\mathcal {S}|\) states and a resolution value of \(\rho \), the approximate grid set \(\mathcal {G}\) can be generated by computing the Cartesian product of \(\mathcal {H}(\rho )\) with itself \(|\mathcal {S}|\) times and filtering off all grids whose components do not sum to 1. That is

$$\begin{aligned} \mathcal {G}=\left\{ g\mid g\in \tilde{\mathcal {G}},~\sum _i g_i =1 \right\} \end{aligned}$$
(31)

where

$$\begin{aligned} \tilde{\mathcal {G}} = \underbrace{\mathcal {H}(\rho )\times \mathcal {H}(\rho ) \times \ldots \times \mathcal {H}(\rho )}_{|\mathcal {S}| \text { times}} \end{aligned}$$
(32)

Fixed-resolution grid set example Consider a problem with two states and a resolution value \(\rho =2\). Following the fixed-resolution grid construction approach, first note that each component of \(\varvec{g}\) must belong to the set

$$\begin{aligned} \mathcal {H}(2)=\left\{ 0, \frac{1}{2}, 1\right\} \end{aligned}$$
(33a)

Then, \(\tilde{\mathcal {G}}\) can then be computed as

$$\begin{aligned} \tilde{\mathcal {G}}= & {} \left\{ 0, \tfrac{1}{2}, 1\right\} \times \{0, \tfrac{1}{2}, 1\} \end{aligned}$$
(33b)
$$\begin{aligned}= & {} \Bigg \{\left[ 0, 0 \right] , [ 0, \tfrac{1}{2} ], \left[ 0, 1 \right] , [ \tfrac{1}{2}, 0 ], [ \tfrac{1}{2}, \tfrac{1}{2} ], [ \tfrac{1}{2}, 1 ], [ 1, 0 ],\nonumber \\ {}{} & {} [ 1, \tfrac{1}{2} ], [ 1, 1 ]\Bigg \} \end{aligned}$$
(33c)

Only the grids in \(\tilde{\mathcal {G}}\) whose components sum to 1 are kept, resulting in

$$\begin{aligned} \mathcal {G}=\left\{ [0, 1], [\tfrac{1}{2}, \tfrac{1}{2}], [1, 0]\right\} \end{aligned}$$
(33d)

1.2 C2: Modified grid construction approach

Following the fixed-resolution grid approach, the grid set generated using a resolution value of \(\rho \) is denoted \(\mathcal {G}_\rho \), where

$$\begin{aligned} |\mathcal {G}_\rho | = \left( {\begin{array}{c}|\mathcal {S}| + \rho - 1\\ |\mathcal {S}| - 1\end{array}}\right) \end{aligned}$$
(34)

By strictly constructing grid sets according to this approach is not feasible for problems with more than a few core states because, for many grid-based solution algorithms, the grid sets become prohibitively large for increasing \(\rho \) values. For example, in the \(4\times 3\) problem, which has 11 states, the grid set sizes are \(|\mathcal {G}_{\rho =1}|=11\), \(|\mathcal {G}_{\rho =2}|=66\), \(|\mathcal {G}_{\rho =3}|=286\), \(|\mathcal {G}_{\rho =4}|=1001\), and \(|\mathcal {G}_{\rho =5}|=3003\) for resolution values ranging from one to five. That is, the size of \(\mathcal {G}\) cannot be finely controlled. To address this, one can combine multiple grid sets generated using the fixed-resolution grid approach, taking only as many grids as they desire. For a desired grid set size of \(N\), one can iteratively construct fixed-resolution grid sets until the first one whose size exceeds \(N\) is encountered. Denoting this set as \(\mathcal {G}_\iota \), by definition, \(|\mathcal {G}_{\iota - 1}|\le N\). Then, grid points can be sampled from \(\mathcal {G}_\iota \setminus \mathcal {G}_{\iota - 1}\) and added to \(\mathcal {G}_{\iota -1}\) until \(\mathcal {G}_{\iota -1}\) has \(N\) grid points. As there are many grid points to choose from, it might be important not to select these points arbitrarily, as this may result in a higher grid density in some dimensions. Accordingly, grids are drawn at equidistant intervals from the sorted set difference \(\mathcal {G}_{\iota }\setminus \mathcal {G}_{\iota - 1}\). The overall procedure is summarized in Algorithm 1.

figure a

Modified grid-construction example Consider a problem with 3 states, where the desired number of grids, \(N\), is 5. Following Algorithm 1, the final value of \(\iota \) is found to be 2, as \(\mathcal {G}_2\) is the first grid set whose size (6) exceeds \(N\). Subsequently,

$$\begin{aligned} \mathcal {G}= \mathcal {G}_{\iota =1}&= [[0, 0, 1], [0, 1, 0], [1, 0, 0]] \end{aligned}$$
(35a)
$$\begin{aligned} \mathcal {G}^*&=[[0, \tfrac{1}{2}, \tfrac{1}{2}], [\tfrac{1}{2}, 0, \tfrac{1}{2}], [\tfrac{1}{2}, \tfrac{1}{2}, 0]] \end{aligned}$$
(35b)

The step size \(\eta \) is obtained as

$$\begin{aligned} \eta&=\lfloor N~/~|\mathcal {G}^*| \rfloor \end{aligned}$$
(35c)
$$\begin{aligned}&=\lfloor 5~/~3\rfloor \end{aligned}$$
(35d)
$$\begin{aligned}&=\lfloor 1.\overline{6} \rfloor \end{aligned}$$
(35e)
$$\begin{aligned}&= 1 \end{aligned}$$
(35f)

Lastly, the 1st and 2nd elements of \(\mathcal {G}^*\) are added to \(\mathcal {G}\). Thus, the final grid set becomes

$$\begin{aligned} \mathcal {G}=[[0, 0, 1], [0, \tfrac{1}{2}, \tfrac{1}{2}], [0, 1, 0], [\tfrac{1}{2}, 0, \tfrac{1}{2}], [1, 0, 0]] \end{aligned}$$
(35g)

Appendix D: Calculating the grid transition probabilities

Figure 6 summarizes the process of calculating the grid transition probabilities. Note that these transition probabilities are essential inputs to the linear programming formulations for CPOMDPs discussed in this paper. The steps in this process can be summarized as follows:

  1. (1)

    For each grid point, action, observation combination, the updated belief state is computed.

  2. (2)

    The interpolation weights for \(g'\) are computed.

  3. (3)

    The transition probability between grid points \(g\in \mathcal {G}\) are computed.

  4. (4)

    The approximate value function for time \(t+1\) is computed.

Fig. 6
figure 6

A flowchart for the grid transition probability calculation process

Algorithms 2 and 3 show the calculation of the grid transition probabilities in detail for finite horizon and infinite horizon cases, respectively. Before solving the corresponding LP models for the CPOMDPs, these algorithms are first run to obtain the \(f\) values.

figure b
figure c

Appendix E: Numerical examples

1.1 Tiger problem parameters

The following examples illustrate the application of ITLP to both the finite horizon and infinite horizon formulations of the tiger problem. Specifically, this problem is defined as follows [8]:

  • States: \(\{\text {Tiger left}~(s_0),~\text {Tiger right}~(s_1)\}\)

  • Actions: \(\{\text {Listen}~(a_0),~\text {Open left door}~(a_1),~\text {Open right}\) \(\text {door}~(a_2)\}\)

  • Observations: \(\{\text {Tiger left}~(o_0),~\text {Tiger right}~(o_1)\}\)

  • Transition Probabilities: Recall that the probability of transitioning from state \(i\) to state \(j\) at time t after taking action a is \(p_{ij}^{ta}\). In the tiger problem, transition probabilities are stationary: they are constant with respect to time. As a result, transition probabilities can simply be written as \(p_{ij}^{a}\). The transition probabilities are provided in Table 7.

  • Observation Probabilities: As with transition probabilities, the observation probabilities are stationary for the tiger problem. Therefore, the probability of making observation \(o\) at time \(t\) after taking action \(a\) and arriving in state \(j\), \(z_{jo}^{ta}\), can be written without the time index as \(z_{jo}^{a}\). These observation probabilities are given in Table 8.

  • Immediate Rewards: The reward for taking action \(a\) in state \(i\) at time \(t\) is \(w_{ia}^{t}\). The rewards are stationary for the tiger problem, so they can be simply denoted as \(w_{ia}\), and are given in Table 9.

  • Costs: In this paper, the tiger POMDP is extended beyond [8]’s definition to include a cost on each action. Specifically, the cost of each door opening action (\(a_1\), \(a_2\)) is taken as 1, and the cost of listening (\(a_0\)) is taken as 2.

Table 7 Tiger problem transition probabilities
Table 8 Tiger problem observation probabilities
Table 9 Tiger problem immediate rewards

For the purposes of the ensuing examples, the belief simplex for the tiger problem \(\mathcal {B}(\mathcal {S})\) is approximated with the grid set \(\mathcal {G}=\{[0, 1], [0.5, 0.5], [1, 0]\}\). The terminal reward for exiting the decision process in one of these grids is set to the reward earned by taking the unconstrained optimal action in that grid. As the optimal action in [1, 0] is to open the door on the right, the terminal reward is 10. Similarly, the reward for ending in [0, 1] is 10. The optimal action when the belief is [0.5, 0.5] is to listen, so the reward is \(-1\). For simplicity, the initial belief distribution parameter is set as \(\delta =[0, 1, 0]\), meaning that the starting belief state is [0.5, 0.5].

1.2 E2: Finite horizon CPOMDP for the tiger problem

Consider the tiger problem as described above with a horizon of 3 (i.e., 2 decision epochs), a budget of 3, and a discount factor of \(\lambda =1\). For a given belief state \(b\), the expected reward for taking action \(a\) is simply \(b\cdot w_{ia}\). For example, the expected reward for opening the door on the left (\(a_1\)) when \(b=[0.5, 0.5]\) is

$$\begin{aligned} \begin{aligned}&{=}b\cdot w_{i1}\\&=(0.5 \, 0.5)\cdot \left( \begin{array}{cc}-100 \\ 10\end{array}\right) \\&=45 \end{aligned} \end{aligned}$$
(36)

It directly follows that the coefficient of \(x_{tka}\) in the objective function for the LP is this expected reward (i.e., the coefficient of \(x_{tka}\) is \(b_k\cdot w_{ia}\)). In the tiger problem formulated above, this value is time independent. Following (10), it turns out that for the selected grid set, the obtained grid transition probabilities (\(f_{k\ell }^{ta}\)) are time independent as well, so the time index can be dropped. These probabilities are given in Table 10.

Table 10 Grid transition probabilities (\(f_{k\ell }^{a}\))

In this problem, the objective function given by (13a) is

$$\begin{aligned} - x_{0,0,0}+ & {} 10 x_{0,0,1} - 100 x_{0,0,2} - x_{0,1,0} - 45 x_{0,1,1} \nonumber \\- & {} 45 x_{0,1,2} - x_{0,2,0} - 100 x_{0,2,1} + 10 x_{0,2,2} \nonumber \\- & {} x_{1,0,0} + 10 x_{1,0,1} - 100 x_{1,0,2} - x_{1,1,0} \nonumber \\- & {} 45 x_{1,1,1} - 45 x_{1,1,2} - x_{1,2,0} 100 x_{1,2,1} + 10 x_{1,2,2} \nonumber \\+ & {} 10 x_{N0} - x_{N1} + 10 x_{N2} \end{aligned}$$
(37a)

the constraints from (13b) are

$$\begin{aligned} x_{0,0,0} + x_{0,0,1} + x_{0,0,2}&= 0 \end{aligned}$$
(37b)
$$\begin{aligned} x_{0,1,0} + x_{0,1,1} + x_{0,1,2}&= 1 \end{aligned}$$
(37c)
$$\begin{aligned} x_{0,2,0} + x_{0,2,1} + x_{0,2,2}&= 0 \end{aligned}$$
(37d)

the constraints from (13c) are

$$\begin{aligned} - x_{0,0,0}&- 0.35 x_{0,1,0} + x_{1,0,0} + x_{1,0,1} + x_{1,0,2} = 0\end{aligned}$$
(37e)
$$\begin{aligned} - x_{0,0,1}&- x_{0,0,2} - 0.3 x_{0,1,0} - x_{0,1,1} - x_{0,1,2} - x_{0,2,1} \nonumber \\&- x_{0,2,2} + x_{1,1,0} + x_{1,1,1} + x_{1,1,2} = 0\end{aligned}$$
(37f)
$$\begin{aligned} - 0.35 x_{0,1,0}&- x_{0,2,0} + x_{1,2,0} + x_{1,2,1} + x_{1,2,2} = 0 \end{aligned}$$
(37g)

the constraints from (13d) are

$$\begin{aligned} - x_{1,0,0}&- 0.35 x_{1,1,0} + x_{N0} = 0 \end{aligned}$$
(37h)
$$\begin{aligned} - x_{1,0,1}&- x_{1,0,2} - 0.3 x_{1,1,0} - x_{1,1,1} - x_{1,1,2} \nonumber \\ {}&- x_{1,2,1} - x_{1,2,2} + x_{N1} = 0\end{aligned}$$
(37i)
$$\begin{aligned} - 0.35 x_{1,1,0}&- x_{1,2,0} + x_{N2} = 0 \end{aligned}$$
(37j)

and the budget constraint in (13e) is

$$\begin{aligned} \begin{aligned} 2 x_{0,0,0}&+ x_{0,0,1} + x_{0,0,2} + 2 x_{0,1,0} + x_{0,1,1} + x_{0,1,2} + 2 x_{0,2,0} \\ {}&+ x_{0,2,1} + x_{0,2,2} + 2 x_{1,0,0} + x_{1,0,1} + x_{1,0,2} + 2 x_{1,1,0} \\ {}&+ x_{1,1,1} + x_{1,1,2} + 2 x_{1,2,0} + x_{1,2,1} + x_{1,2,2} <= 3 \end{aligned} \end{aligned}$$
(37k)

giving the final LP for this finite horizon problem as:

$$\begin{aligned} \max&\quad (37\textrm{a}) \end{aligned}$$
(38a)
$$\begin{aligned} \mathrm {s.t.}&\quad (37\textrm{b})-(37\textrm{k})\end{aligned}$$
(38b)
$$\begin{aligned}&\quad (13\textrm{e}) \end{aligned}$$
(38c)

The optimal \(x_{tka}\) values obtained by solving this model are given in Table 11.

Table 11 Finite horizon optimal \(x_{tka}\)

1.3 E3: Infinite horizon CPOMDP for the tiger problem

We also provide the sample approximate formulation for the infinite horizon version of the tiger problem. Here, a discount factor \(\lambda =0.9\) is chosen. Given the problem definition, Table 10 also gives the \(f_{k\ell }^{a}\) values for this infinite horizon formulation. This is due to the very small grid size used here for this problem formulation. The objective from (18a) is then

$$\begin{aligned} - x_{0, 0}+ & {} 10 x_{0, 1} \!- 100 x_{0, 2} \!- x_{1, 0} \!- 45 x_{1, 1} \!- 45 x_{1, 2} - x_{2, 0}\nonumber \\- & {} 100 x_{2, 1} + 10 x_{2, 2} \end{aligned}$$
(39a)

the constraints in Equation (18b) are

$$\begin{aligned} 0.1 x_{0, 0}+ & {} x_{0, 1} + x_{0, 2} - 0.315 x_{1, 0} = 0\end{aligned}$$
(39b)
$$\begin{aligned} - 0.9 x_{0, 1}- & {} 0.9 x_{0, 2} + 0.73 x_{1, 0} + 0.1 x_{1, 1} \nonumber \\ {}+ & {} 0.1 x_{1, 2} - 0.9 x_{2, 1} - 0.9 x_{2, 2} = 1 \end{aligned}$$
(39c)
$$\begin{aligned} - 0.315 x_{1, 0}+ & {} 0.1 x_{2, 0} + x_{2, 1} + x_{2, 2} = 0 \end{aligned}$$
(39d)

and the budget constraint from Equation (18c) is

$$\begin{aligned} 2 x_{0, 0} + x_{0, 1} + x_{0, 2} + 2 x_{1, 0} + x_{1, 1} + x_{1, 2} + 2 x_{2, 0} + x_{2, 1} + x_{2, 2} <= 11.5 \end{aligned}$$
(39e)

leading to a final LP for this infinite horizon CPOMDP example as

$$\begin{aligned}&\max&(39\textrm{a}) \end{aligned}$$
(40)
$$\begin{aligned}&\mathrm {s.t.}&(39\textrm{b})-(39\textrm{e})\end{aligned}$$
(41)
$$\begin{aligned}{} & {} (18\textrm{d}) \end{aligned}$$
(42)
Fig. 7
figure 7

Tiger problem threshold-type policy illustration

Fig. 8
figure 8

Tiger problem threshold-type policy samples with \(|\mathcal {G}|=250\)

Appendix F: Analysis with threshold-type policies

This section provides an application of threshold-type policy constraints for the tiger problem to provide a use case for specific constraints that can be incorporated into the ITLP algorithm. As illustrated in Figure 7, threshold-type policy constraints help establish decision thresholds over the belief states. The threshold-type policy constraints are imposed based on the stochastic dominance relation between the belief states. That is, \(b^{\ell }\) stochastically dominates \(b^{k}\), denoted as \(b^{\ell } \succ _s b^{k}\), if \(\sum _{i=j} b_i^{\ell } \ge \sum _{i=j} b_i^{k}\) for all \(j \in \{0,1,\hdots , |\mathcal {S}|-1\}\).

The threshold-type policy constraints for the action \(a_1 = \texttt {Open Left}\) can be formulated as

$$\begin{aligned} \theta _{tk1} \le \theta _{t\ell 1}, \quad \forall g^{\ell } \succ _s g^{k}, \forall t \end{aligned}$$
(43)

where \(\theta _{tka} \in \{0,1\}\) takes value 1 if action a is selected at decision epoch t, and take value 0 otherwise (recall that these are the same binary variables that are included to obtain deterministic policies). For instance, if \(g^0 = [0,1]\) and \(g^1 = [0.1, 0.9]\), then \(g^{0} \succ _s g^{1}\). It follows that if the optimal action for \(g^1\) is \(a_1\), then the optimal action for \(g^0\) should also be \(a_1\). This is because, if the POMDP policy prescribes taking action \(a_1 = \texttt {Open Left}\) for a belief state \(g^1\) that indicate a lower likelihood of tiger occupying state \(s=\texttt {Tiger Right}\) (i.e., \(g_1^1 = 0.9\)), then this policy should prescribe the same action to a belief state \(g^0\) that is expected to be more suitable for taking action \(a_1 = \texttt {Open Left}\) (i.e., because \(g_0^1 = 1.0\)). Note that the threshold-type policies can be similarly derived for \(a_2 = \texttt {Open Right}\). Figure 8 provides sample policies obtained by incorporating the threshold-type policy constraints into a finite horizon CPOMDP model with three different budget levels. As expected, the obtained policies are all of threshold type.

Depending on the problem specifications, such threshold-type policy constraints can be formulated and added to the CPOMDP model. However, as their characterization rely on the stochastic ordering of the belief states, not all problems might be suitable for imposing threshold-type policy constraints. For instance, in the paint problem, it is expected to ship items that are unflawed, unblemished and painted (i.e., \(s=\texttt {NFL-NBL-PA}\)), however, the ordering of the other three states (i.e., \(s\in \{ \texttt {NFL-NBL-NPA}, \texttt {FL-NBL-PA}, \texttt {FL-BL-NPA} \}\)) cannot be easily established for this action, making the characterization of threshold-type policies non-trivial for this problem. Similar observations can be made for the other three actions (i.e., paint, inspect, and reject) as well.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Helmeczi, R.K., Kavaklioglu, C. & Cevik, M. Linear programming-based solution methods for constrained partially observable Markov decision processes. Appl Intell 53, 21743–21769 (2023). https://doi.org/10.1007/s10489-023-04603-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-04603-7

Keywords

Navigation