Stepsize Learning for Policy Gradient Methods in Contextual Markov Decision Processes

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases: Research Track (ECML PKDD 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14172))

  • 1016 Accesses

Abstract

Policy-based algorithms are among the most widely adopted techniques in model-free RL, thanks to their strong theoretical groundings and good properties in continuous action spaces. Unfortunately, these methods require precise and problem-specific hyperparameter tuning to achieve good performance, and tend to struggle when asked to accomplish a series of heterogeneous tasks. In particular, the selection of the step size has a crucial impact on their ability to learn a highly performing policy, affecting the speed and the stability of the training process, and often being the main culprit for poor results. In this paper, we tackle these issues with a Meta Reinforcement Learning approach, by introducing a new formulation, known as meta-MDP, that can be used to solve any hyperparameter selection problem in RL with contextual processes. After providing a theoretical Lipschitz bound to the difference of performance in different tasks, we adopt the proposed framework to train a batch RL algorithm to dynamically recommend the most adequate step size for different policies and tasks. In conclusion, we present an experimental campaign to show the advantages of selecting an adaptive learning rate in heterogeneous environments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
EUR 29.95
Price includes VAT (France)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 82.38
Price includes VAT (France)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR 103.38
Price includes VAT (France)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The full version of the paper, including the appendix, is available at http://arxiv.org/abs/2306.07741.

  2. 2.

    For the sake of brevity, when a variable depends on the policy \(\pi _{\boldsymbol{\theta }}\), in the superscript only \(\boldsymbol{\theta }\) is shown.

  3. 3.

    By assuming that the policy and its gradient is LC w.r.t. \(\boldsymbol{\theta }\).

  4. 4.

    Being an alteration of the classic Cartpole, standard results cannot be compared.

  5. 5.

    The expected return changes deeply w.r.t. the task \(\boldsymbol{\omega }\), hence the learning curves as in the other plots in Fig. 2 show very high variance, independently from the robustness of the models.

References

  1. Adriaensen, S., et al.: Automated dynamic algorithm configuration. J. Artif. Intell. Res. 75, 1633–1699 (2022)

    Article  MathSciNet  MATH  Google Scholar 

  2. Agarwal, A., Kakade, S.M., Lee, J.D., Mahajan, G.: On the theory of policy gradient methods: optimality, approximation, and distribution shift. J. Mach. Learn. Res. 22(1), 4431–4506 (2021)

    MathSciNet  MATH  Google Scholar 

  3. Andrychowicz, M., et al.: Learning to learn by gradient descent by gradient descent. In: Advances in Neural Information Processing Systems, pp. 3981–3989 (2016)

    Google Scholar 

  4. Åström, K.J.: Optimal control of Markov processes with incomplete state information. J. Math. Anal. Appl. 10(1), 174–205 (1965)

    Article  MathSciNet  MATH  Google Scholar 

  5. Barto, A.G., Sutton, R.S., Anderson, C.W.: Neuronlike adaptive elements that can solve difficult learning control problems, pp. 81–93. IEEE Press (1990)

    Google Scholar 

  6. Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13(2) (2012)

    Google Scholar 

  7. Bhandari, J., Russo, D.: Global optimality guarantees for policy gradient methods. ar**v preprint ar**v:1906.01786 (2019)

  8. Biedenkapp, A., Bozkurt, H.F., Eimer, T., Hutter, F., Lindauer, M.: Dynamic algorithm configuration: foundation of a new meta-algorithmic framework. In: ECAI 2020, pp. 427–434. IOS Press (2020)

    Google Scholar 

  9. Brockman, G., et al.: Openai gym. ar**v preprint ar**v:1606.01540 (2016)

  10. Dhariwal, P., et al.: Openai baselines. https://github.com/openai/baselines (2017)

  11. Eiben, A.E., Horvath, M., Kowalczyk, W., Schut, M.C.: Reinforcement learning for online control of evolutionary algorithms. In: Brueckner, S.A., Hassas, S., Jelasity, M., Yamins, D. (eds.) ESOA 2006. LNCS (LNAI), vol. 4335, pp. 151–160. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-69868-5_10

    Chapter  Google Scholar 

  12. Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. J. Mach. Learn. Res. 6(Apr), 503–556 (2005)

    Google Scholar 

  13. Farahmand, A.M., Munos, R., Szepesvári, C.: Error propagation for approximate policy and value iteration. In: Advances in Neural Information Processing Systems (2010)

    Google Scholar 

  14. Feurer, M., Springenberg, J., Hutter, F.: Initializing Bayesian hyperparameter optimization via meta-learning. In: Proceedings of the AAAI Conference on Artificial Intelligence,vol. 29 (2015)

    Google Scholar 

  15. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: International Conference on Machine Learning, pp. 1126–1135. PMLR (2017)

    Google Scholar 

  16. Fujimoto, S., Meger, D., Precup, D.: Off-policy deep reinforcement learning without exploration. In: International Conference on Machine Learning, pp. 2052–2062. PMLR (2019)

    Google Scholar 

  17. Garcia, F.M., Thomas, P.S.: A Meta-MDP approach to exploration for lifelong reinforcement learning. In: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pp. 1976–1978 (2019)

    Google Scholar 

  18. Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006)

    Article  MATH  Google Scholar 

  19. Hallak, A., Di Castro, D., Mannor, S.: Contextual Markov decision processes. ar**v preprint ar**v:1502.02259 (2015)

  20. Henderson, P., Romoff, J., Pineau, J.: Where did my optimum go? An empirical analysis of gradient descent optimization in policy gradient methods. ar**v preprint ar**v:1810.02525 (2018)

  21. Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. In: International Conference on Learning and Intelligent Optimization, pp. 507–523. Springer (2011). https://doi.org/10.1007/978-3-642-25566-3_40

  22. Hutter, F., Kotthoff, L., Vanschoren, J. (eds.): Automated Machine Learning. TSSCML, Springer, Cham (2019). https://doi.org/10.1007/978-3-030-05318-5

    Book  Google Scholar 

  23. Im, D.J., Savin, C., Cho, K.: Online hyperparameter optimization by real-time recurrent learning. ar**v preprint ar**v:2102.07813 (2021)

  24. Jastrzebski, S., et al.: The break-even point on optimization trajectories of deep neural networks. ar**v preprint ar**v:2002.09572 (2020)

  25. Jomaa, H.S., Grabocka, J., Schmidt-Thieme, L.: Hyp-RL: hyperparameter optimization by reinforcement learning. ar**v preprint ar**v:1906.11527 (2019)

  26. Kakade, S.M.: A natural policy gradient. In: Advances in Neural Information Processing Systems, vol. 14 (2001)

    Google Scholar 

  27. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. ar**v preprint ar**v:1412.6980 (2014)

  28. Li, K., Malik, J.: Learning to optimize. ar**v preprint ar**v:1606.01885 (2016)

  29. Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband: a novel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res. 18(1), 6765–6816 (2017)

    MathSciNet  MATH  Google Scholar 

  30. Lorraine, J., Vicol, P., Duvenaud, D.: Optimizing millions of hyperparameters by implicit differentiation. In: International Conference on Artificial Intelligence and Statistics, pp. 1540–1552. PMLR (2020)

    Google Scholar 

  31. Maclaurin, D., Duvenaud, D., Adams, R.: Gradient-based hyperparameter optimization through reversible learning. In: International Conference on Machine Learning, pp. 2113–2122. PMLR (2015)

    Google Scholar 

  32. Meier, F., Kappler, D., Schaal, S.: Online learning of a memory for learning rates. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 2425–2432. IEEE (2018)

    Google Scholar 

  33. Nichol, A., Schulman, J.: Reptile: a scalable metalearning algorithm, 2(2) 1 (2018). ar**v preprint ar**v:1803.02999

  34. Occorso, M., Sabbioni, L., Metelli, A.M., Restelli, M.: Trust region meta learning for policy optimization. In: Brazdil, P., van Rijn, J.N., Gouk, H., Mohr, F. (eds.) ECMLPKDD Workshop on Meta-Knowledge Transfer. Proceedings of Machine Learning Research, vol. 191, pp. 62–74. PMLR (2022). https://proceedings.mlr.press/v191/occorso22a.html

  35. Paine, T.L., et al.: Hyperparameter selection for offline reinforcement learning. ar**v preprint ar**v:2007.09055 (2020)

  36. Park, E., Oliva, J.B.: Meta-curvature. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2019). https://proceedings.neurips.cc/paper/2019/file/57c0531e13f40b91b3b0f1a30b529a1d-Paper.pdf

  37. Parker-Holder, J., et al.: Automated reinforcement learning (AutoRL): a survey and open problems. J. Artif. Intell. Res. 74, 517–568 (2022)

    Article  MathSciNet  MATH  Google Scholar 

  38. Paul, S., Kurin, V., Whiteson, S.: Fast efficient hyperparameter tuning for policy gradient methods. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2019). https://proceedings.neurips.cc/paper/2019/file/743c41a921516b04afde48bb48e28ce6-Paper.pdf

  39. Penner, A.R.: The physics of golf. Rep. Progress Phys. 66(2), 131 (2002)

    Article  Google Scholar 

  40. Pirotta, M., Restelli, M., Bascetta, L.: Policy gradient in Lipschitz Markov decision processes. Mach. Learn. 100(2), 255–283 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  41. Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons (2014)

    Google Scholar 

  42. Rachelson, E., Lagoudakis, M.G.: On the locality of action domination in sequential decision making. In: International Symposium on Artificial Intelligence and Mathematics, ISAIM 2010, Fort Lauderdale, Florida, USA, January 6–8, 2010 (2010)

    Google Scholar 

  43. Rakelly, K., Zhou, A., Finn, C., Levine, S., Quillen, D.: Efficient off-policy meta-reinforcement learning via probabilistic context variables. In: International Conference on Machine Learning, pp. 5331–5340. PMLR (2019)

    Google Scholar 

  44. Ravi, S., Larochelle, H.: Optimization as a model for few-shot learning. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24–26, 2017, Conference Track Proceedings. OpenReview.net (2017). https://openreview.net/forum?id=rJY0-Kcll

  45. Schmidhuber, J.: Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. Ph.D. thesis, Technische Universität München (1987)

    Google Scholar 

  46. Schulman, J., Levine, S., Abbeel, P., Jordan, M.I., Moritz, P.: Trust region policy optimization. In: ICML, JMLR Workshop and Conference Proceedings, vol. 37, pp. 1889–1897. JMLR.org (2015)

    Google Scholar 

  47. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. CoRR abs/1707.06347 (2017)

    Google Scholar 

  48. Sehgal, A., La, H., Louis, S., Nguyen, H.: Deep reinforcement learning using genetic algorithm for parameter optimization. In: 2019 Third IEEE International Conference on Robotic Computing (IRC), pp. 596–601. IEEE (2019)

    Google Scholar 

  49. Shala, G., Biedenkapp, A., Awad, N., Adriaensen, S., Lindauer, M., Hutter, F.: Learning step-size adaptation in CMA-ES. In: Parallel Problem Solving from Nature-PPSN XVI: 16th International Conference, PPSN 2020, Leiden, The Netherlands, September 5–9, 2020, Proceedings, Part I 16, pp. 691–706. Springer (2020). https://doi.org/10.1007/978-3-030-58112-1_48

  50. Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. ar**v preprint ar**v:1206.2944 (2012)

  51. Sutton, R.S., Barto, A.G.: Introduction to Reinforcement Learning, 1st edn. MIT Press, Cambridge, MA, USA (1998)

    MATH  Google Scholar 

  52. Sutton, R.S., McAllester, D., Sidngh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Solla, S., Leen, T., Müller, K. (eds.) Advances in Neural Information Processing Systems, vol. 12. MIT Press (2000). https://proceedings.neurips.cc/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf

  53. Tieleman, T., Hinton, G.: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning. Technical Report (2017)

    Google Scholar 

  54. Tirinzoni, A., Salvini, M., Restelli, M.: Transfer of samples in policy search via multiple importance sampling. In: Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 6264–6274. PMLR (2019)

    Google Scholar 

  55. Todorov, E., Erez, T., Tassa, Y.: MuJoCo: a physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE (2012)

    Google Scholar 

  56. Wang, L., Cai, Q., Yang, Z., Wang, Z.: Neural policy gradient methods: global optimality and rates of convergence. ar**v preprint ar**v:1909.01150 (2019)

  57. Weerts, H.J., Mueller, A.C., Vanschoren, J.: Importance of tuning hyperparameters of machine learning algorithms. ar**v preprint ar**v:2007.07588 (2020)

  58. Xu, C., Qin, T., Wang, G., Liu, T.Y.: Reinforcement learning for learning rate control. ar**v preprint ar**v:1705.11159 (2017)

  59. Xu, Z., van Hasselt, H., Silver, D.: Meta-gradient reinforcement learning. ar**v preprint ar**v:1805.09801 (2018)

  60. Zhu, Y., Hayashi, T., Ohsawa, Y.: Gradient descent optimization by reinforcement learning. In: The 33rd Annual Conference of the Japanese Society for Artificial Intelligence (2019)

    Google Scholar 

  61. Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24–26, 2017, Conference Track Proceedings. OpenReview.net (2017). https://openreview.net/forum?id=r1Ue8Hcxg

Download references

Acknowledgments

This paper is supported by FAIR (Future Artificial Intelligence Research) project, funded by the NextGenerationEU program within the PNRR-PE-AI scheme (M4C2, Investment 1.3, Line on Artificial Intelligence).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luca Sabbioni .

Editor information

Editors and Affiliations

Ethics declarations

Ethical Statement

Hyperparameter selection for policy-based algorithms has a significant impact on the ability to learn a highly performing policy in RL, especially with heterogeneous tasks, where different contexts may require different solutions. Our approach shows that it is possible to learn an automatic selection of the best configurations that can be identified after a manual fine-tuning of the parameters. Consequently, our work can be seen as a further step in the AutoML direction, in which a practitioner could run the algorithm and, with some guidance, obtain optimal performance in just a few steps without the need for manual fine-tuning. Beyond this, we are not aware of any societal consequences of our work, such as welfare, fairness, or privacy.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sabbioni, L., Corda, F., Restelli, M. (2023). Stepsize Learning for Policy Gradient Methods in Contextual Markov Decision Processes. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14172. Springer, Cham. https://doi.org/10.1007/978-3-031-43421-1_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43421-1_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43420-4

  • Online ISBN: 978-3-031-43421-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation