Abstract
Constrained partially observable Markov decision processes (CPOMDPs) have been used to model various real-world phenomena. However, they are notoriously difficult to solve to optimality, and there exist only a few approximation methods for obtaining high-quality solutions. In this study, grid-based approximations are used in combination with linear programming (LP) models to generate approximate policies for CPOMDPs. A detailed numerical study is conducted with six CPOMDP problem instances considering both their finite and infinite horizon formulations. The quality of approximation algorithms for solving unconstrained POMDP problems is established through a comparative analysis with exact solution methods. Then, the performance of the LP-based CPOMDP solution approaches for varying budget levels is evaluated. Finally, the flexibility of LP-based approaches is demonstrated by applying deterministic policy constraints, and a detailed investigation into their impact on rewards and CPU run time is provided. For most of the finite horizon problems, deterministic policy constraints are found to have little impact on expected reward, but they introduce a significant increase to CPU run time. For infinite horizon problems, the reverse is observed: deterministic policies tend to yield lower expected total rewards than their stochastic counterparts, but the impact of deterministic constraints on CPU run time is negligible in this case. Overall, these results demonstrate that LP models can effectively generate approximate policies for both finite and infinite horizon problems while providing the flexibility to incorporate various additional constraints into the underlying model.
Similar content being viewed by others
Data Availability
All the datasets are publicly available, and can be obtained using the cited sources.
References
Ahluwalia VS, Steimle LN, Denton BT (2021) Policy-based branch-and-bound for infinite-horizon multi-model markov decision processes. Computers & Operations Research 126:105–108
Alagoz O, Ayvaci MU, Linderoth JT (2015) Optimally solving markov decision processes with total expected discounted reward function: Linear programming revisited. Computers & Industrial Engineering 87:311–316
Ayer T, Alagoz O, Stout N (2012) A POMDP approach to personalize mammography screening decisions. Operations Research 60(5):1019–1034
Ayvaci M, Alagoz O, Burnside E (2012a) The effect of budgetary restrictions on breast cancer diagnostic decisions. M &SOM 14(4):600–617
Ayvaci MU, Alagoz O, Burnside ES (2012b) The effect of budgetary restrictions on breast cancer diagnostic decisions. Manufacturing & Service Operations Management 14(4):600–617
Bravo RZB, Leiras A, Cyrino Oliveira FL (2019) The use of uav s in humanitarian relief: an application of pomdp-based methodology for finding victims. Production and Operations Management 28(2):421–440
Caramia M, Dell’Olmo P, Caramia M, Dell’Olmo P (2020) Multi-objective optimization. Multi-objective Management in Freight Logistics: Increasing Capacity, Service Level, Sustainability, and Safety with Optimization Algorithms pp 21–51
Cassandra A (1994) Optimal policies for partially observable Markov decision processes. Brown University, Providence, RI
Cassandra A (2003) Simple examples. http://www.pomdp.org/examples/, Accessed 09 Jan 2019
Cassandra AR (1998) Exact and approximate algorithms for partially observable Markov decision processes. Brown University
Cassandra AR, Kaelbling LP, Littman ML (1994) Acting optimally in partially observable stochastic domains. In: AAAI, AAAI
Celen M, Djurdjanovic D (2020) Integrated maintenance and operations decision making with imperfect degradation state observations. Journal of Manufacturing Systems 55:302–316
Cevik M, Ayer T, Alagoz O, Sprague BL (2018) Analysis of mammography screening policies under resource constraints. Production and Operations Management 27(5):949–972
Deng S, **ang Z, Zhao P, Taheri J, Gao H, Yin J, Zomaya AY (2020) Dynamical resource allocation in edge for trustable internet-of-things systems: A reinforcement learning method. IEEE Transactions on Industrial Informatics 16(9):6103–6113
Egorov M, Sunberg ZN, Balaban E, Wheeler TA, Gupta JK, Kochenderfer MJ (2017) Pomdps. jl: A framework for sequential decision making under uncertainty. The Journal of Machine Learning Research 18(1):831–835
Erenay F, Alagoz O, Said A (2014) Optimizing colonoscopy screening for colorectal cancer prevention and surveillance. M &SOM 16(3):381–400
Gan K, Scheller-Wolf AA, Tayur SR (2019) Personalized treatment for opioid use disorder. Available at SSRN 3389539
Jiang X, Wang X, ** H (2017) Finding optimal polices for wideband spectrum sensing based on constrained pomdp framework. IEEE Transactions on Wireless Communications 16(8):5311–5324. https://doi.org/10.1109/TWC.2017.2708124
Kavaklioglu C, Cevik M (2022) Scalable grid-based approximation algorithms for partially observable markov decision processes. Concurrency and Computation: Practice and Experience 34(5):e6743
Kim D, Lee J, Kim K, Poupart P (2011) Point-based value iteration for constrained POMDPs. In: Twenty-Second International Joint Conference on Artificial Intelligence, pp 1968–1974
Lee J, Kim GH, Poupart P, Kim KE (2018) Monte-carlo tree search for constrained pomdps. Advances in Neural Information Processing Systems 31
Lovejoy W (1991a) A Survey of Algorithmic Methods for Partially Observed Markov Decision Processes. Annals of Operations Research 28:47–66
Lovejoy W (1991b) Computationally feasible bounds for partially observed Markov decision processes. Operations Research 39(1):162–175
Ma X, Xu H, Gao H, Bian M, Hussain W (2022) Real-time virtual machine scheduling in industry iot network: A reinforcement learning method. IEEE Transactions on Industrial Informatics 19(2):2129–2139
Maillart LM (2006) Maintenance policies for systems with condition monitoring and obvious failures. IIE Transactions 38(6):463–475
McLay LA, Mayorga ME (2013) A dispatching model for server-to-customer systems that balances efficiency and equity. Manufacturing & Service Operations Management 15(2):205–220
Monahan G (1982) State of the art - A survey of partially observable Markov decision processes: Theory, models, and algorithms. Management Science 28(1):1–16
Pajarinen J, Kyrki V (2017) Robotic manipulation of multiple objects as a pomdp. Artificial Intelligence 247:213–228
Parr R, Russell S (1995) Approximating optimal policies for partially observable stochastic domains. IJCAI, IJCAI 95:1088–1094
Pineau J, Gordon G, Thrun S (2006) Anytime Point-Based Approximations for Large POMDPs. JAIR 27:335–380
Poupart P, Malhotra A, Pei P, Kim KE, Goh B, Bowling M (2015) Approximate linear programming for constrained partially observable markov decision processes. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 29
Puterman ML (2014) Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons
Roijers DM, Vamplew P, Whiteson S, Dazeley R (2013) A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research 48:67–113
Roijers DM, Whiteson S, Oliehoek FA (2015) Point-based planning for multi-objective pomdps. In: Twenty-fourth international joint conference on artificial intelligence
Rudin W (1987) Real and complex analysis, 3rd edn. McGraw-Hill
Sandikci B (2010) Reduction of a pomdp to an mdp. Wiley Encyclopedia of Operations Research and Management Science
Sandıkçı B, Maillart LM, Schaefer AJ, Alagoz O, Roberts MS (2008) Estimating the patient’s price of privacy in liver transplantation. Operations Research 56(6):1393–1410
Silver D, Veness J (2010) Monte-carlo planning in large pomdps. Advances in neural information processing systems 23
Smith T, Simmons R (2012) Heuristic search value iteration for pomdps. ar**v:1207.4166
Sondik EJ (1971) The optimal control of partially observable Markov processes. Stanford University
Spaan MT (2012) Partially observable markov decision processes. In: Reinforcement Learning, Springer, pp 387–414
Steimle LN, Ahluwalia VS, Kamdar C, Denton BT (2021a) Decomposition methods for solving markov decision processes with multiple models of the parameters. IISE Transactions 53(12):1295–1310
Steimle LN, Kaufman DL, Denton BT (2021b) Multi-model markov decision processes. IISE. Transactions 53(10):1124–1139
Suresh (2005) Sampling from the simplex. Available from http://geomblog.blogspot.com/2005/10/sampling-from-simplex.html Accessed on 26 Feb 2015
Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press
Treharne JT, Sox CR (2002) Adaptive inventory control for nonstationary demand and partial information. Management Science 48(5):607–624
Walraven E, Spaan MT (2018) Column generation algorithms for constrained pomdps. Journal of artificial intelligence research 62:489–533
Wray KH, Czuprynski K (2022) Scalable gradient ascent for controllers in constrained pomdps. In: 2022 International Conference on Robotics and Automation (ICRA), IEEE, pp 9085–9091
Yılmaz ÖF (2020) An integrated bi-objective u-shaped assembly line balancing and parts feeding problem: optimization model and exact solution method. Annals of Mathematics and Artificial Intelligence pp 1–18
Yılmaz ÖF, et al. (2021) Tactical level strategies for multi-objective disassembly line balancing problem with multi-manned stations: an optimization model and solution approaches. Annals of Operations Research pp 1–51
Young S, Gašić M, Thomson B, Williams JD (2013) Pomdp-based statistical spoken dialog systems: A review. Proceedings of the IEEE 101(5):1160–1179
Acknowledgements
This research was enabled in part by the support provided by the Digital Research Alliance of Canada (alliancecan.ca).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
No potential conflict of interest was reported by the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Proof of upper bound approximation
The proof of the upper bound approximation requires two intermediate theorems:
Theorem 1
The optimal value function, \(V^*\), is piecewise linear and convex.
Proof
See [8].\(\square \)
Theorem 2
For any real-valued convex function f and discrete random variable X
where \(\mathbb {E}(\cdot )\) gives the expected value.
Proof
See [35, Thm. 3.3].\(\square \)
For completeness, the proof of the upper bound approximation (Theorem 3) is reproduced from [37].
Theorem 3
Let \(\hat{\mathcal {V}}(g)\) give the grid-based upper bound approximation of the true optimal value, \(V(g)\). Then,
Proof
[37] uses proof by induction using the value iteration algorithm. From (3b) and (3c), the optimal value at time \(t\) for the belief \(b\) is
For the first iteration, where \(t=T\), from (3a), the initial values for the optimal and approximate value functions are \(\hat{\mathcal {V}}_T(g)=V^*_T(g)=\sum _{i\in \mathcal {S}} b_{i}R_{i}\). Assume that for all \(t\in [1,T]\) Theorem 3 holds. The value for generic \(t-1\) is
Recall that because \(\mathcal {G}\) contains all of the corner points, any \(b\in \mathcal {B}(\mathcal {S})\) can be represented as a convex combination of \(g\in \mathcal {G}\) i.e., \(b\) can be replaced with \(\sum _{k\in \mathcal {K}}\beta _{k}g_k\) for non-negative \(\beta \) satisfying \(\sum _{k\in \mathcal {K}}\beta _{k}=1\). Thus
Theorem 1 states that \(V^*\) satisfies the properties needed to apply Theorem 2 as follows:
Per the induction hypothesis, this is
\(\square \)
Appendix B: Proof of lower bound approximation
The lower bound approximation outlined in [23] is given in Theorem 4.
Theorem 4
Let \(\hat{\varGamma }\) be the set of \(\alpha \)-vectors corresponding to the grid \(\mathcal {G}\) and \(\hat{\mathcal {V}}(b)\) be the approximate expected value generated by \(\hat{\varGamma }\) for the belief point \(b\). Then
Proof
After drop** the time index in (4), the inequality in (28) becomes
As \(\varGamma \) is defined as the set of all \(\alpha \)-vectors, by definition \(\hat{\varGamma }\subseteq \varGamma \). Thus, (29a) can be rewritten as:
Substituting in (4)
It follows that right hand side of Equation (29c) is at least \(\hat{\mathcal {V}}(b)\).\(\square \)
Appendix C: Grid construction
The approximation techniques employed in this study require the generation of a set of grid points, which provide a finite set of discrete belief states to approximate the continuous, infinite belief simplex. Specifically, these grid sets are generated using a slightly modified version of the fixed-resolution grid approach.
1.1 C.1: Fixed-resolution grid construction approach
Using the fixed-resolution grid approach, the resulting grid set contains beliefs sampled at equidistant intervals in each dimension of the state space according to a resolution parameter \(\rho \). Specifically, for any grid point \(\varvec{g}\), the \(i\)th component \(g_i\) can be any integer multiple of \(\rho ^{-1}\), subject to the constraint that all components must be non-negative and cannot exceed 1. Thus, each component of \(\varvec{g}\) must belong to the set
As \(\varvec{g}\) represents a belief state, the sum of its components must equal 1. Therefore, for a problem with \(|\mathcal {S}|\) states and a resolution value of \(\rho \), the approximate grid set \(\mathcal {G}\) can be generated by computing the Cartesian product of \(\mathcal {H}(\rho )\) with itself \(|\mathcal {S}|\) times and filtering off all grids whose components do not sum to 1. That is
where
Fixed-resolution grid set example Consider a problem with two states and a resolution value \(\rho =2\). Following the fixed-resolution grid construction approach, first note that each component of \(\varvec{g}\) must belong to the set
Then, \(\tilde{\mathcal {G}}\) can then be computed as
Only the grids in \(\tilde{\mathcal {G}}\) whose components sum to 1 are kept, resulting in
1.2 C2: Modified grid construction approach
Following the fixed-resolution grid approach, the grid set generated using a resolution value of \(\rho \) is denoted \(\mathcal {G}_\rho \), where
By strictly constructing grid sets according to this approach is not feasible for problems with more than a few core states because, for many grid-based solution algorithms, the grid sets become prohibitively large for increasing \(\rho \) values. For example, in the \(4\times 3\) problem, which has 11 states, the grid set sizes are \(|\mathcal {G}_{\rho =1}|=11\), \(|\mathcal {G}_{\rho =2}|=66\), \(|\mathcal {G}_{\rho =3}|=286\), \(|\mathcal {G}_{\rho =4}|=1001\), and \(|\mathcal {G}_{\rho =5}|=3003\) for resolution values ranging from one to five. That is, the size of \(\mathcal {G}\) cannot be finely controlled. To address this, one can combine multiple grid sets generated using the fixed-resolution grid approach, taking only as many grids as they desire. For a desired grid set size of \(N\), one can iteratively construct fixed-resolution grid sets until the first one whose size exceeds \(N\) is encountered. Denoting this set as \(\mathcal {G}_\iota \), by definition, \(|\mathcal {G}_{\iota - 1}|\le N\). Then, grid points can be sampled from \(\mathcal {G}_\iota \setminus \mathcal {G}_{\iota - 1}\) and added to \(\mathcal {G}_{\iota -1}\) until \(\mathcal {G}_{\iota -1}\) has \(N\) grid points. As there are many grid points to choose from, it might be important not to select these points arbitrarily, as this may result in a higher grid density in some dimensions. Accordingly, grids are drawn at equidistant intervals from the sorted set difference \(\mathcal {G}_{\iota }\setminus \mathcal {G}_{\iota - 1}\). The overall procedure is summarized in Algorithm 1.
Modified grid-construction example Consider a problem with 3 states, where the desired number of grids, \(N\), is 5. Following Algorithm 1, the final value of \(\iota \) is found to be 2, as \(\mathcal {G}_2\) is the first grid set whose size (6) exceeds \(N\). Subsequently,
The step size \(\eta \) is obtained as
Lastly, the 1st and 2nd elements of \(\mathcal {G}^*\) are added to \(\mathcal {G}\). Thus, the final grid set becomes
Appendix D: Calculating the grid transition probabilities
Figure 6 summarizes the process of calculating the grid transition probabilities. Note that these transition probabilities are essential inputs to the linear programming formulations for CPOMDPs discussed in this paper. The steps in this process can be summarized as follows:
-
(1)
For each grid point, action, observation combination, the updated belief state is computed.
-
(2)
The interpolation weights for \(g'\) are computed.
-
(3)
The transition probability between grid points \(g\in \mathcal {G}\) are computed.
-
(4)
The approximate value function for time \(t+1\) is computed.
Algorithms 2 and 3 show the calculation of the grid transition probabilities in detail for finite horizon and infinite horizon cases, respectively. Before solving the corresponding LP models for the CPOMDPs, these algorithms are first run to obtain the \(f\) values.
Appendix E: Numerical examples
1.1 Tiger problem parameters
The following examples illustrate the application of ITLP to both the finite horizon and infinite horizon formulations of the tiger problem. Specifically, this problem is defined as follows [8]:
-
States: \(\{\text {Tiger left}~(s_0),~\text {Tiger right}~(s_1)\}\)
-
Actions: \(\{\text {Listen}~(a_0),~\text {Open left door}~(a_1),~\text {Open right}\) \(\text {door}~(a_2)\}\)
-
Observations: \(\{\text {Tiger left}~(o_0),~\text {Tiger right}~(o_1)\}\)
-
Transition Probabilities: Recall that the probability of transitioning from state \(i\) to state \(j\) at time t after taking action a is \(p_{ij}^{ta}\). In the tiger problem, transition probabilities are stationary: they are constant with respect to time. As a result, transition probabilities can simply be written as \(p_{ij}^{a}\). The transition probabilities are provided in Table 7.
-
Observation Probabilities: As with transition probabilities, the observation probabilities are stationary for the tiger problem. Therefore, the probability of making observation \(o\) at time \(t\) after taking action \(a\) and arriving in state \(j\), \(z_{jo}^{ta}\), can be written without the time index as \(z_{jo}^{a}\). These observation probabilities are given in Table 8.
-
Immediate Rewards: The reward for taking action \(a\) in state \(i\) at time \(t\) is \(w_{ia}^{t}\). The rewards are stationary for the tiger problem, so they can be simply denoted as \(w_{ia}\), and are given in Table 9.
-
Costs: In this paper, the tiger POMDP is extended beyond [8]’s definition to include a cost on each action. Specifically, the cost of each door opening action (\(a_1\), \(a_2\)) is taken as 1, and the cost of listening (\(a_0\)) is taken as 2.
For the purposes of the ensuing examples, the belief simplex for the tiger problem \(\mathcal {B}(\mathcal {S})\) is approximated with the grid set \(\mathcal {G}=\{[0, 1], [0.5, 0.5], [1, 0]\}\). The terminal reward for exiting the decision process in one of these grids is set to the reward earned by taking the unconstrained optimal action in that grid. As the optimal action in [1, 0] is to open the door on the right, the terminal reward is 10. Similarly, the reward for ending in [0, 1] is 10. The optimal action when the belief is [0.5, 0.5] is to listen, so the reward is \(-1\). For simplicity, the initial belief distribution parameter is set as \(\delta =[0, 1, 0]\), meaning that the starting belief state is [0.5, 0.5].
1.2 E2: Finite horizon CPOMDP for the tiger problem
Consider the tiger problem as described above with a horizon of 3 (i.e., 2 decision epochs), a budget of 3, and a discount factor of \(\lambda =1\). For a given belief state \(b\), the expected reward for taking action \(a\) is simply \(b\cdot w_{ia}\). For example, the expected reward for opening the door on the left (\(a_1\)) when \(b=[0.5, 0.5]\) is
It directly follows that the coefficient of \(x_{tka}\) in the objective function for the LP is this expected reward (i.e., the coefficient of \(x_{tka}\) is \(b_k\cdot w_{ia}\)). In the tiger problem formulated above, this value is time independent. Following (10), it turns out that for the selected grid set, the obtained grid transition probabilities (\(f_{k\ell }^{ta}\)) are time independent as well, so the time index can be dropped. These probabilities are given in Table 10.
In this problem, the objective function given by (13a) is
the constraints from (13b) are
the constraints from (13c) are
the constraints from (13d) are
and the budget constraint in (13e) is
giving the final LP for this finite horizon problem as:
The optimal \(x_{tka}\) values obtained by solving this model are given in Table 11.
1.3 E3: Infinite horizon CPOMDP for the tiger problem
We also provide the sample approximate formulation for the infinite horizon version of the tiger problem. Here, a discount factor \(\lambda =0.9\) is chosen. Given the problem definition, Table 10 also gives the \(f_{k\ell }^{a}\) values for this infinite horizon formulation. This is due to the very small grid size used here for this problem formulation. The objective from (18a) is then
the constraints in Equation (18b) are
and the budget constraint from Equation (18c) is
leading to a final LP for this infinite horizon CPOMDP example as
Appendix F: Analysis with threshold-type policies
This section provides an application of threshold-type policy constraints for the tiger problem to provide a use case for specific constraints that can be incorporated into the ITLP algorithm. As illustrated in Figure 7, threshold-type policy constraints help establish decision thresholds over the belief states. The threshold-type policy constraints are imposed based on the stochastic dominance relation between the belief states. That is, \(b^{\ell }\) stochastically dominates \(b^{k}\), denoted as \(b^{\ell } \succ _s b^{k}\), if \(\sum _{i=j} b_i^{\ell } \ge \sum _{i=j} b_i^{k}\) for all \(j \in \{0,1,\hdots , |\mathcal {S}|-1\}\).
The threshold-type policy constraints for the action \(a_1 = \texttt {Open Left}\) can be formulated as
where \(\theta _{tka} \in \{0,1\}\) takes value 1 if action a is selected at decision epoch t, and take value 0 otherwise (recall that these are the same binary variables that are included to obtain deterministic policies). For instance, if \(g^0 = [0,1]\) and \(g^1 = [0.1, 0.9]\), then \(g^{0} \succ _s g^{1}\). It follows that if the optimal action for \(g^1\) is \(a_1\), then the optimal action for \(g^0\) should also be \(a_1\). This is because, if the POMDP policy prescribes taking action \(a_1 = \texttt {Open Left}\) for a belief state \(g^1\) that indicate a lower likelihood of tiger occupying state \(s=\texttt {Tiger Right}\) (i.e., \(g_1^1 = 0.9\)), then this policy should prescribe the same action to a belief state \(g^0\) that is expected to be more suitable for taking action \(a_1 = \texttt {Open Left}\) (i.e., because \(g_0^1 = 1.0\)). Note that the threshold-type policies can be similarly derived for \(a_2 = \texttt {Open Right}\). Figure 8 provides sample policies obtained by incorporating the threshold-type policy constraints into a finite horizon CPOMDP model with three different budget levels. As expected, the obtained policies are all of threshold type.
Depending on the problem specifications, such threshold-type policy constraints can be formulated and added to the CPOMDP model. However, as their characterization rely on the stochastic ordering of the belief states, not all problems might be suitable for imposing threshold-type policy constraints. For instance, in the paint problem, it is expected to ship items that are unflawed, unblemished and painted (i.e., \(s=\texttt {NFL-NBL-PA}\)), however, the ordering of the other three states (i.e., \(s\in \{ \texttt {NFL-NBL-NPA}, \texttt {FL-NBL-PA}, \texttt {FL-BL-NPA} \}\)) cannot be easily established for this action, making the characterization of threshold-type policies non-trivial for this problem. Similar observations can be made for the other three actions (i.e., paint, inspect, and reject) as well.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Helmeczi, R.K., Kavaklioglu, C. & Cevik, M. Linear programming-based solution methods for constrained partially observable Markov decision processes. Appl Intell 53, 21743–21769 (2023). https://doi.org/10.1007/s10489-023-04603-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-04603-7