Log in

Multi-view reinforcement learning for sequential decision-making with insufficient state information

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Most reinforcement learning methods describe sequential decision-making as a Markov decision process where the effect of action is only decided by the current state. But this is reasonable only if the state is correctly defined and the state information is sufficiently observed. Thus the learning efficiency of reinforcement learning methods based on Markov decision process is limited when the state information is insufficient. Partially observable Markov decision process and history-based decision process are respectively proposed to describe sequential decision-making with insufficient state information. However, these two processes are easy to ignore the important information from the current observed state. Therefore, the learning efficiency of reinforcement learning methods based on these two processes is also limited when the state information is insufficient. In this paper, we propose a multi-view reinforcement learning method to solve this problem. The motivation is that the interaction information between the agent and its environment should be considered from the views of history, present, and future to overcome the insufficiency of state information. Based on these views, we construct a multi-view decision process to describe sequential decision-making with insufficient state information. A multi-view reinforcement learning method is proposed by combining the multi-view decision process and the actor-critic framework. In the proposed method, multi-view clustering is performed to ensure that each type of sample can be sufficiently exploited. Experiments illustrate that the proposed method is more effective than the compared state-of-the-arts. The source code can be downloaded from https://github.com/jamieliuestc/MVRL.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Data availability

All data generated or analyzed during this study are available in https://github.com/jamieliuestc/MVRL.

References

  1. Littman ML, Algorithms for sequential decision-making, Brown University, 1996

  2. Barto AG, Sutton RS, Watkins C (1989) Learning and sequential decision making. University of Massachusetts Amherst, MA

    Google Scholar 

  3. Lample G, Chaplot DS (2017) Playing fps games with deep reinforcement learning, in: Proceedings of AAAI Conference on Artificial Intelligence, San Francisco, California, USA, 2017, pp. 2140–2146

  4. Littman M L (1994) Markov games as a framework for multi-agent reinforcement learning, in: Machine learning proceedings 1994, Elsevier, pp. 157–163

  5. Zheng L, Fiez T, Alumbaugh Z, Chasnov B, Ratliff LJ (2022) Stackelberg actor-critic: Game-theoretic reinforcement learning algorithms, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 9217–9224

  6. Sholeh Y, Mohammad BNS, Ali K (2016) Reinforcement learning and neural networks for multi-agent nonzero-sum games of nonlinear constrained input systems. Int J Mach Learn Cybern 7(6):967–980

    Article  Google Scholar 

  7. Johannink T, Bahl S, Nair A, Luo J, Kumar A, Loskyll M, Ojea JA, Solowjow E, Levine S (2019) Residual reinforcement learning for robot control, in: 2019 International Conference on Robotics and Automation, IEEE, Montreal, Canada, pp. 6023–6029

  8. Kober J, Bagnell JA, Peters J (2013) Reinforcement learning in robotics: a survey. Int J Robot Res 32(11):1238–1274

    Article  Google Scholar 

  9. Gui Y, Hu W, Rahmani A (2022) A reinforcement learning based artificial bee colony algorithm with application in robot path planning. Expert Syst Appl 203:117389

    Article  Google Scholar 

  10. Folkers A, Rick M, Büskens C (2019) Controlling an autonomous vehicle with deep reinforcement learning in, IEEE Intelligent Vehicles Symposium. IEEE, Paris France 2019:2025–2031

    Google Scholar 

  11. Noaeen M, Naik A, Goodman L, Crebo J, Abrar T, Abad ZSH, Bazzan AL, Far B (2022) Reinforcement learning in urban network traffic signal control: A systematic literature review. Expert Syst Appl 199:116830

    Article  Google Scholar 

  12. Wurman PR, Barrett S, Kawamoto K, MacGlashan J, Subramanian K, Walsh TJ, Capobianco R, Devlic A, Eckert F, Fuchs F et al (2022) Outracing champion gran turismo drivers with deep reinforcement learning. Nature 602(7896):223–228

    Article  ADS  CAS  PubMed  Google Scholar 

  13. Yang F, Liu Y, Ding X, Ma F, Cao J (2022) Asymmetric cross-modal hashing with high-level semantic similarity. Pattern Recogn 130:108823

    Article  Google Scholar 

  14. Yang F, Ding X, Ma F, Tong D, Cao J (2023) Edmh: efficient discrete matrix factorization hashing for multi-modal similarity retrieval. Inform Process Manage 60(3):103301

    Article  Google Scholar 

  15. Yang F, Ding X, Liu Y, Ma F, Cao J (2022) Scalable semantic-enhanced supervised hashing for cross-modal retrieval. Knowl-Based Syst 251:109176

    Article  Google Scholar 

  16. Cristescu M-C (2021) Machine learning techniques for improving the performance metrics of functional verification. Sci Technol 24(1):99–116

    MathSciNet  Google Scholar 

  17. Li J, Sun A, Guan Z, Cheema MA, Min G (2022) Real-time dynamic network learning for location inference modelling and computing. Neurocomputing 472:198–200

    Article  Google Scholar 

  18. Zamfirache IA, Precup R-E, Roman R-C, Petriu EM (2022) Policy iteration reinforcement learning-based control using a grey wolf optimizer algorithm. Inf Sci 585:162–175

    Article  Google Scholar 

  19. Sutton RS, Barto AG, Reinforcement learning: An introduction, MIT press, 2018

  20. Chen X, Qu G, Tang Y, Low S, Li N (2022) Reinforcement learning for selective key applications in power systems: recent advances and future challenges. IEEE Trans Smart Grid 13:2935

    Article  Google Scholar 

  21. Puterman ML (1990) Markov decision processes. Handb Oper Res Manage Sci 2:331–434

    MathSciNet  Google Scholar 

  22. Otterlo MV, Wiering M (2012) Reinforcement learning and markov decision processes, in: Reinforcement learning, Springer, pp. 3–42

  23. Daswani M, Sunehag P, Hutter M (2013) Q-learning for history-based reinforcement learning, in: Asian Conference on Machine Learning, Canberra, Australia, pp. 213–228

  24. Leike J (2016) Nonparametric general reinforcement learning, Ph.D. thesis, Australian National University

  25. Monahan GE (1982) State of the art - a survey of partially observable markov decision processes: theory, models, and algorithms. Manage Sci 28(1):1–16

    Article  Google Scholar 

  26. Majeed SJ, Hutter M (2018) On q-learning convergence for non-markov decision processes, in: Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, pp. 2546–2552

  27. Bellemare MG, Ostrovski G, Guez A, Thomas P, Munos R (2016) Increasing the action gap: New operators for reinforcement learning, in: Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, Arizona USA

  28. Melo FS (2001)Convergence of q-learning: A simple proof. Instit Syst Robot, Tech Rep 1–4

  29. Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2015) Continuous control with deep reinforcement learning, in: International Conference on Learning Representations, San Juan, Puerto Rico,

  30. Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods, in: International Conference on Machine Learning, Stockholm, Sweden, pp. 1587–1596

  31. Li M, Huang T, Zhu W (2022) Clustering experience replay for the effective exploitation in reinforcement learning. Pattern Recogn 131:108875

    Article  Google Scholar 

  32. Li M, Huang T, Zhu W (2021) Adaptive exploration policy for exploration-exploitation tradeoff in continuous action control optimization. Int J Mach Learn Cybern 12(12):3491–3501

    Article  Google Scholar 

  33. Konda VR, Tsitsiklis JN (2000) Actor-critic algorithms, in: Advances in Neural Information Processing Systems, Denver, CO, USA, pp. 1008–1014

  34. Zhong C, Lu Z, Gursoy MC, Velipasalar S (2019) A deep actor-critic reinforcement learning framework for dynamic multichannel access. IEEE Trans Cogn Commun Netw 5(4):1125–1139

    Article  Google Scholar 

  35. Grondman I, Busoniu L, Lopes GA, Babuska R (2012) A survey of actor-critic reinforcement learning: standard and natural policy gradients. IEEE Trans Syst Man Cybern Part C (Appl Rev) 42(6):1291–1307

    Article  Google Scholar 

  36. Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M (2014) Deterministic policy gradient algorithms, in: International Conference on Machine Learning, Bei**g, China, , pp. 1387–1395

  37. Sutton RS, McAllester D, Singh S, Mansour Y (2000) Policy gradient methods for reinforcement learning with function approximation, in: Advances in Neural Information Processing Systems, Denver, CO, USA, pp. 1057–1063

  38. Hasselt HV (2010) Double q-learning, in: Advances in Neural Information Processing Systems, Vancouver, Canada, pp. 2613–2621

  39. Van Hasselt H, Guez A, Silver D (2016) Deep reinforcement learning with double q-learning, in: Proceedings of the AAAI Conference on Artificial Intelligence, New York, USA

  40. Huang T, Li M, Qin X, Zhu W (2022) A cnn-based policy for optimizing continuous action control by learning state sequences. Neurocomputing 468:286–295

    Article  Google Scholar 

  41. Zhao J, Guan Z, Xu C, Zhao W, Chen E (2022) Charge prediction by constitutive elements matching of crimes. Proceed Thirty-First Int Joint Conf Artif Intell IJCAI 22:4517–4523

    Google Scholar 

  42. Xu C, Zhao W, Zhao J, Guan Z, Song X, Li J (2022) Uncertainty-aware multiview deep learning for internet of things applications. IEEE Trans Industr Inf 19(2):1456–1466

    Article  Google Scholar 

  43. Xu C, Guan Z, Zhao W, Niu Y, Wang Q, Wang Z (2018) Deep multi-view concept learning., in: IJCAI, Stockholm, pp. 2898–2904

  44. Zhao W, Xu C, Guan Z, Liu Y (2020) Multiview concept learning via deep matrix factorization. IEEE Trans Neural Netw Learn Syst 32(2):814–825

    Article  MathSciNet  Google Scholar 

  45. Xu C, Guan Z, Zhao W, Wu H, Niu Y, Ling B (2019) Adversarial incomplete multi-view clustering. IJCAI 7:3933–3939

    Google Scholar 

  46. Xu C, Liu H, Guan Z, Wu X, Tan J, Ling B (2021) Adversarial incomplete multiview subspace clustering networks. IEEE Trans Cybern 52(10):10490–10503

    Article  Google Scholar 

  47. Li M, Wu L, Wang J, Bou Ammar H (2019) Multi-view reinforcement learning, Advances in neural information processing systems 32 (2019)

  48. Hu Y, Sun S, Xu X, Zhao J (2020) Attentive multi-view reinforcement learning. Int J Mach Learn Cybern 11:2461–2474

    Article  CAS  Google Scholar 

  49. Fan J, Li W, (2022) Dribo: Robust deep reinforcement learning via multi-view information bottleneck, in: International Conference on Machine Learning, PMLR, pp. 6074–6102

  50. Goodfellow I, Bengio Y. a Courville (2016) A, Deep learning, Vol. 1, MIT press Cambridge

  51. Cai X, Nie F, Huang H (2013) Multi-view k-means clustering on big data, in: Proceedings of the 23th International Joint conference on artificial intelligence, Bei**g China, pp. 2598–2604

  52. Han J, Xu J, Nie F, Li X (2020) Multi-view k-means clustering with adaptive sparse memberships and weight allocation. IEEE Trans Knowl Data Eng 34(2):816–827

    Article  Google Scholar 

  53. Fu L, Lin P, Vasilakos AV, Wang S (2020) An overview of recent multi-view clustering. Neurocomputing 402:148–161

    Article  Google Scholar 

  54. Todorov E, Erez T, Tassa Mujoco Y(2012) A physics engine for model-based control, in: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, Algarve, Portugal, pp. 5026–5033

  55. Palanisamy P (2018) Hands-On Intelligent Agents with OpenAI Gym: Your guide to develo** AI agents using deep reinforcement learning, Packt Publishing Ltd

Download references

Acknowledgements

This work is financed by The National Nature Science Foundation of China under Grant No.61772120 and No.62276065.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shi** Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 Proof of Theorem 1

Proof

$$\begin{aligned}{} & {} \Vert B^{\pi }Q_{1}(h,o,a) - B^{\pi }Q_{2}(h,o,a) \Vert _{\infty }\\{} & {} \quad = \max _{h,o,a} \left| \mathbb {E}_{a'\thicksim \pi ,o'\thicksim p^{\pi }}[r(h,o,a) + \gamma Q_{1}(h',o',a') - r(h,o,a) - \gamma Q_{2}(h',o',a')] \right| \\{} & {} \quad = \max _{h,o,a} \left| \sum _{o'\in \mathcal {O}}p(o'|h,o,a)\gamma \sum _{a'\in \mathcal {A}}\pi (a'|h',o') \left[ Q_{1}(h',o',a')-Q_{2}(h',o',a')\right] \right| \\{} & {} \quad \le \max _{h,o,a} \sum _{o'\in \mathcal {O}}p(o'|h,o,a)\gamma \sum _{a'\in \mathcal {A}}\pi (a'|h',o') \left| Q_{1}(h',o',a')-Q_{2}(h',o',a') \right| \\{} & {} \quad \le \max _{h,o,a} \sum _{o'\in \mathcal {O}}p(o'|h,o,a)\gamma \max _{\widetilde{h},\widetilde{o},\widetilde{a}} \left| Q_{1}(\widetilde{h},\widetilde{o},\widetilde{a})-Q_{2}(\widetilde{h},\widetilde{o},\widetilde{a}) \right| \\{} & {} \quad \le \gamma \max _{h,o,a}|Q_{1}(h,o,a)-Q_{2}(h,o,a)|\\{} & {} \quad = \gamma \Vert Q_{1}(h,o,a) - Q_{2}(h,o,a) \Vert _{\infty }. \end{aligned}$$

Thus \(B^{\pi }\) (14) is a contraction map** in the sup-norm when MvDP is finite. According to the principle of contraction map**, \(B^{\pi }\) has unique fixed point that meets \(Q(h,o,a)=\mathbb {E}_{a'\thicksim \pi ,o'\thicksim p^{\pi }}[r(h,o,a)+\gamma Q(h',o',a')]\). According to Bellman equation (13), the unique fixed point is \(Q^{\pi }\). \(\square\)

1.2 Proof of Theorem 2

Proof

$$\begin{aligned}{} & {} \quad \Vert B Q_{1}(h,o,a) - B Q_{2}(h,o,a) \Vert _{\infty }\\{} & {} \quad = \max _{h,o,a} \left| \max _{\pi } \mathbb {E}_{a'\thicksim \pi ,o'\thicksim p^{\pi }}[r(h,o,a) + \gamma Q_{1}(h',o',a')] - \max _{\pi } \mathbb {E}_{a'\thicksim \pi ,o'\thicksim p^{\pi }} [r(h,o,a) + \gamma Q_{2}(h',o',a')] \right| \\{} & {} \quad \le \max _{h,o,a} \left| \max _{\pi } \sum _{o'\in \mathcal {O}}p(o'|h,o,a)\gamma \sum _{a'\in \mathcal {A}}\pi (a'|h',o') \left| Q_{1}(h',o',a')-Q_{2}(h',o',a')\right| \right| \\{} & {} \quad \le \max _{h,o,a,\pi } \sum _{o'\in \mathcal {O}}p(o'|h,o,a)\gamma \sum _{a'\in \mathcal {A}}\pi (a'|h',o') \left| Q_{1}(h',o',a')-Q_{2}(h',o',a') \right| \\{} & {} \quad \le \max _{h,o,a} \sum _{o'\in \mathcal {O}}p(o'|h,o,a)\gamma \max _{\widetilde{h},\widetilde{o},\widetilde{a}} \left| Q_{1}(\widetilde{h},\widetilde{o},\widetilde{a})-Q_{2}(\widetilde{h},\widetilde{o},\widetilde{a}) \right| \\{} & {} \quad \le \gamma \max _{h,o,a}|Q_{1}(h,o,a)-Q_{2}(h,o,a)|\\{} & {} \quad = \gamma \Vert Q_{1}(h,o,a) - Q_{2}(h,o,a) \Vert _{\infty }. \end{aligned}$$

Therefore, B (15) is a contraction map** in the sup-norm when MvDP is finite. According to the principle of contraction map**, B has unique fixed point that meets \(Q(h,o,a)=\max _{\pi }\mathbb {E}_{a'\thicksim \pi ,o'\thicksim p^{\pi }}[r(h,o,a)+\gamma Q(h',o',a')]\). According to Bellman equation (13), the unique fixed point is \(Q^{\pi ^*}\). \(\square\)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, M., Zhu, W. & Wang, S. Multi-view reinforcement learning for sequential decision-making with insufficient state information. Int. J. Mach. Learn. & Cyber. 15, 1533–1552 (2024). https://doi.org/10.1007/s13042-023-01981-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-023-01981-9

Keywords

Navigation