Stepsize Learning for Policy Gradient Methods in Contextual Markov Decision Processes

Sabbioni, Luca; Corda, Francesco; Restelli, Marcello

doi:10.1007/978-3-031-43421-1_30

Luca Sabbioni¹²,
Francesco Corda¹² &
Marcello Restelli¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14172))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

1016 Accesses

Abstract

Policy-based algorithms are among the most widely adopted techniques in model-free RL, thanks to their strong theoretical groundings and good properties in continuous action spaces. Unfortunately, these methods require precise and problem-specific hyperparameter tuning to achieve good performance, and tend to struggle when asked to accomplish a series of heterogeneous tasks. In particular, the selection of the step size has a crucial impact on their ability to learn a highly performing policy, affecting the speed and the stability of the training process, and often being the main culprit for poor results. In this paper, we tackle these issues with a Meta Reinforcement Learning approach, by introducing a new formulation, known as meta-MDP, that can be used to solve any hyperparameter selection problem in RL with contextual processes. After providing a theoretical Lipschitz bound to the difference of performance in different tasks, we adopt the proposed framework to train a batch RL algorithm to dynamically recommend the most adequate step size for different policies and tasks. In conclusion, we present an experimental campaign to show the advantages of selecting an adaptive learning rate in heterogeneous environments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: EUR 29.95; Price includes VAT (France)

eBook: EUR 82.38; Price includes VAT (France)

Softcover Book: EUR 103.38; Price includes VAT (France)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Survey on Constraining Policy Updates Using the KL Divergence

Automatic Tuning of Stochastic Gradient Descent with Bayesian Optimisation

A Linear Online Guided Policy Search Algorithm

Notes

1.
The full version of the paper, including the appendix, is available at http://arxiv.org/abs/2306.07741.
2.
For the sake of brevity, when a variable depends on the policy \(\pi _{\boldsymbol{\theta }}\), in the superscript only \(\boldsymbol{\theta }\) is shown.
3.
By assuming that the policy and its gradient is LC w.r.t. \(\boldsymbol{\theta }\).
4.
Being an alteration of the classic Cartpole, standard results cannot be compared.
5.
The expected return changes deeply w.r.t. the task \(\boldsymbol{\omega }\), hence the learning curves as in the other plots in Fig. 2 show very high variance, independently from the robustness of the models.

References

Adriaensen, S., et al.: Automated dynamic algorithm configuration. J. Artif. Intell. Res. 75, 1633–1699 (2022)
Article MathSciNet MATH Google Scholar
Agarwal, A., Kakade, S.M., Lee, J.D., Mahajan, G.: On the theory of policy gradient methods: optimality, approximation, and distribution shift. J. Mach. Learn. Res. 22(1), 4431–4506 (2021)
MathSciNet MATH Google Scholar
Andrychowicz, M., et al.: Learning to learn by gradient descent by gradient descent. In: Advances in Neural Information Processing Systems, pp. 3981–3989 (2016)
Google Scholar
Åström, K.J.: Optimal control of Markov processes with incomplete state information. J. Math. Anal. Appl. 10(1), 174–205 (1965)
Article MathSciNet MATH Google Scholar
Barto, A.G., Sutton, R.S., Anderson, C.W.: Neuronlike adaptive elements that can solve difficult learning control problems, pp. 81–93. IEEE Press (1990)
Google Scholar
Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13(2) (2012)
Google Scholar
Bhandari, J., Russo, D.: Global optimality guarantees for policy gradient methods. ar**v preprint ar**v:1906.01786 (2019)
Biedenkapp, A., Bozkurt, H.F., Eimer, T., Hutter, F., Lindauer, M.: Dynamic algorithm configuration: foundation of a new meta-algorithmic framework. In: ECAI 2020, pp. 427–434. IOS Press (2020)
Google Scholar
Brockman, G., et al.: Openai gym. ar**v preprint ar**v:1606.01540 (2016)
Dhariwal, P., et al.: Openai baselines. https://github.com/openai/baselines (2017)
Eiben, A.E., Horvath, M., Kowalczyk, W., Schut, M.C.: Reinforcement learning for online control of evolutionary algorithms. In: Brueckner, S.A., Hassas, S., Jelasity, M., Yamins, D. (eds.) ESOA 2006. LNCS (LNAI), vol. 4335, pp. 151–160. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-69868-5_10
Chapter Google Scholar
Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. J. Mach. Learn. Res. 6(Apr), 503–556 (2005)
Google Scholar
Farahmand, A.M., Munos, R., Szepesvári, C.: Error propagation for approximate policy and value iteration. In: Advances in Neural Information Processing Systems (2010)
Google Scholar
Feurer, M., Springenberg, J., Hutter, F.: Initializing Bayesian hyperparameter optimization via meta-learning. In: Proceedings of the AAAI Conference on Artificial Intelligence,vol. 29 (2015)
Google Scholar
Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: International Conference on Machine Learning, pp. 1126–1135. PMLR (2017)
Google Scholar
Fujimoto, S., Meger, D., Precup, D.: Off-policy deep reinforcement learning without exploration. In: International Conference on Machine Learning, pp. 2052–2062. PMLR (2019)
Google Scholar
Garcia, F.M., Thomas, P.S.: A Meta-MDP approach to exploration for lifelong reinforcement learning. In: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pp. 1976–1978 (2019)
Google Scholar
Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006)
Article MATH Google Scholar
Hallak, A., Di Castro, D., Mannor, S.: Contextual Markov decision processes. ar**v preprint ar**v:1502.02259 (2015)
Henderson, P., Romoff, J., Pineau, J.: Where did my optimum go? An empirical analysis of gradient descent optimization in policy gradient methods. ar**v preprint ar**v:1810.02525 (2018)
Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. In: International Conference on Learning and Intelligent Optimization, pp. 507–523. Springer (2011). https://doi.org/10.1007/978-3-642-25566-3_40
Hutter, F., Kotthoff, L., Vanschoren, J. (eds.): Automated Machine Learning. TSSCML, Springer, Cham (2019). https://doi.org/10.1007/978-3-030-05318-5
Book Google Scholar
Im, D.J., Savin, C., Cho, K.: Online hyperparameter optimization by real-time recurrent learning. ar**v preprint ar**v:2102.07813 (2021)
Jastrzebski, S., et al.: The break-even point on optimization trajectories of deep neural networks. ar**v preprint ar**v:2002.09572 (2020)
Jomaa, H.S., Grabocka, J., Schmidt-Thieme, L.: Hyp-RL: hyperparameter optimization by reinforcement learning. ar**v preprint ar**v:1906.11527 (2019)
Kakade, S.M.: A natural policy gradient. In: Advances in Neural Information Processing Systems, vol. 14 (2001)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. ar**v preprint ar**v:1412.6980 (2014)
Li, K., Malik, J.: Learning to optimize. ar**v preprint ar**v:1606.01885 (2016)
Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband: a novel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res. 18(1), 6765–6816 (2017)
MathSciNet MATH Google Scholar
Lorraine, J., Vicol, P., Duvenaud, D.: Optimizing millions of hyperparameters by implicit differentiation. In: International Conference on Artificial Intelligence and Statistics, pp. 1540–1552. PMLR (2020)
Google Scholar
Maclaurin, D., Duvenaud, D., Adams, R.: Gradient-based hyperparameter optimization through reversible learning. In: International Conference on Machine Learning, pp. 2113–2122. PMLR (2015)
Google Scholar
Meier, F., Kappler, D., Schaal, S.: Online learning of a memory for learning rates. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 2425–2432. IEEE (2018)
Google Scholar
Nichol, A., Schulman, J.: Reptile: a scalable metalearning algorithm, 2(2) 1 (2018). ar**v preprint ar**v:1803.02999
Occorso, M., Sabbioni, L., Metelli, A.M., Restelli, M.: Trust region meta learning for policy optimization. In: Brazdil, P., van Rijn, J.N., Gouk, H., Mohr, F. (eds.) ECMLPKDD Workshop on Meta-Knowledge Transfer. Proceedings of Machine Learning Research, vol. 191, pp. 62–74. PMLR (2022). https://proceedings.mlr.press/v191/occorso22a.html
Paine, T.L., et al.: Hyperparameter selection for offline reinforcement learning. ar**v preprint ar**v:2007.09055 (2020)
Park, E., Oliva, J.B.: Meta-curvature. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2019). https://proceedings.neurips.cc/paper/2019/file/57c0531e13f40b91b3b0f1a30b529a1d-Paper.pdf
Parker-Holder, J., et al.: Automated reinforcement learning (AutoRL): a survey and open problems. J. Artif. Intell. Res. 74, 517–568 (2022)
Article MathSciNet MATH Google Scholar
Paul, S., Kurin, V., Whiteson, S.: Fast efficient hyperparameter tuning for policy gradient methods. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2019). https://proceedings.neurips.cc/paper/2019/file/743c41a921516b04afde48bb48e28ce6-Paper.pdf
Penner, A.R.: The physics of golf. Rep. Progress Phys. 66(2), 131 (2002)
Article Google Scholar
Pirotta, M., Restelli, M., Bascetta, L.: Policy gradient in Lipschitz Markov decision processes. Mach. Learn. 100(2), 255–283 (2015)
Article MathSciNet MATH Google Scholar
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons (2014)
Google Scholar
Rachelson, E., Lagoudakis, M.G.: On the locality of action domination in sequential decision making. In: International Symposium on Artificial Intelligence and Mathematics, ISAIM 2010, Fort Lauderdale, Florida, USA, January 6–8, 2010 (2010)
Google Scholar
Rakelly, K., Zhou, A., Finn, C., Levine, S., Quillen, D.: Efficient off-policy meta-reinforcement learning via probabilistic context variables. In: International Conference on Machine Learning, pp. 5331–5340. PMLR (2019)
Google Scholar
Ravi, S., Larochelle, H.: Optimization as a model for few-shot learning. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24–26, 2017, Conference Track Proceedings. OpenReview.net (2017). https://openreview.net/forum?id=rJY0-Kcll
Schmidhuber, J.: Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. Ph.D. thesis, Technische Universität München (1987)
Google Scholar
Schulman, J., Levine, S., Abbeel, P., Jordan, M.I., Moritz, P.: Trust region policy optimization. In: ICML, JMLR Workshop and Conference Proceedings, vol. 37, pp. 1889–1897. JMLR.org (2015)
Google Scholar
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. CoRR abs/1707.06347 (2017)
Google Scholar
Sehgal, A., La, H., Louis, S., Nguyen, H.: Deep reinforcement learning using genetic algorithm for parameter optimization. In: 2019 Third IEEE International Conference on Robotic Computing (IRC), pp. 596–601. IEEE (2019)
Google Scholar
Shala, G., Biedenkapp, A., Awad, N., Adriaensen, S., Lindauer, M., Hutter, F.: Learning step-size adaptation in CMA-ES. In: Parallel Problem Solving from Nature-PPSN XVI: 16th International Conference, PPSN 2020, Leiden, The Netherlands, September 5–9, 2020, Proceedings, Part I 16, pp. 691–706. Springer (2020). https://doi.org/10.1007/978-3-030-58112-1_48
Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. ar**v preprint ar**v:1206.2944 (2012)
Sutton, R.S., Barto, A.G.: Introduction to Reinforcement Learning, 1st edn. MIT Press, Cambridge, MA, USA (1998)
MATH Google Scholar
Sutton, R.S., McAllester, D., Sidngh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Solla, S., Leen, T., Müller, K. (eds.) Advances in Neural Information Processing Systems, vol. 12. MIT Press (2000). https://proceedings.neurips.cc/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf
Tieleman, T., Hinton, G.: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning. Technical Report (2017)
Google Scholar
Tirinzoni, A., Salvini, M., Restelli, M.: Transfer of samples in policy search via multiple importance sampling. In: Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 6264–6274. PMLR (2019)
Google Scholar
Todorov, E., Erez, T., Tassa, Y.: MuJoCo: a physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE (2012)
Google Scholar
Wang, L., Cai, Q., Yang, Z., Wang, Z.: Neural policy gradient methods: global optimality and rates of convergence. ar**v preprint ar**v:1909.01150 (2019)
Weerts, H.J., Mueller, A.C., Vanschoren, J.: Importance of tuning hyperparameters of machine learning algorithms. ar**v preprint ar**v:2007.07588 (2020)
Xu, C., Qin, T., Wang, G., Liu, T.Y.: Reinforcement learning for learning rate control. ar**v preprint ar**v:1705.11159 (2017)
Xu, Z., van Hasselt, H., Silver, D.: Meta-gradient reinforcement learning. ar**v preprint ar**v:1805.09801 (2018)
Zhu, Y., Hayashi, T., Ohsawa, Y.: Gradient descent optimization by reinforcement learning. In: The 33rd Annual Conference of the Japanese Society for Artificial Intelligence (2019)
Google Scholar
Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24–26, 2017, Conference Track Proceedings. OpenReview.net (2017). https://openreview.net/forum?id=r1Ue8Hcxg

Download references

Acknowledgments

This paper is supported by FAIR (Future Artificial Intelligence Research) project, funded by the NextGenerationEU program within the PNRR-PE-AI scheme (M4C2, Investment 1.3, Line on Artificial Intelligence).

Author information

Authors and Affiliations

Politecnico di Milano, Milan, Italy
Luca Sabbioni, Francesco Corda & Marcello Restelli

Authors

Luca Sabbioni
View author publications
You can also search for this author in PubMed Google Scholar
Francesco Corda
View author publications
You can also search for this author in PubMed Google Scholar
Marcello Restelli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Luca Sabbioni .

Editor information

Editors and Affiliations

University of Michigan, Ann Arbor, MI, USA
Danai Koutra
University of Vienna, Vienna, Austria
Claudia Plant
Max Planck Institute for Software Systems, Kaiserslautern, Germany
Manuel Gomez Rodriguez
Politecnico di Torino, Turin, Italy
Elena Baralis
CENTAI, Turin, Italy
Francesco Bonchi

Ethics declarations

Ethical Statement

Hyperparameter selection for policy-based algorithms has a significant impact on the ability to learn a highly performing policy in RL, especially with heterogeneous tasks, where different contexts may require different solutions. Our approach shows that it is possible to learn an automatic selection of the best configurations that can be identified after a manual fine-tuning of the parameters. Consequently, our work can be seen as a further step in the AutoML direction, in which a practitioner could run the algorithm and, with some guidance, obtain optimal performance in just a few steps without the need for manual fine-tuning. Beyond this, we are not aware of any societal consequences of our work, such as welfare, fairness, or privacy.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sabbioni, L., Corda, F., Restelli, M. (2023). Stepsize Learning for Policy Gradient Methods in Contextual Markov Decision Processes. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14172. Springer, Cham. https://doi.org/10.1007/978-3-031-43421-1_30

Download citation

DOI: https://doi.org/10.1007/978-3-031-43421-1_30
Published: 18 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43420-4
Online ISBN: 978-3-031-43421-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Stepsize Learning for Policy Gradient Methods in Contextual Markov Decision Processes

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Survey on Constraining Policy Updates Using the KL Divergence

Automatic Tuning of Stochastic Gradient Descent with Bayesian Optimisation

A Linear Online Guided Policy Search Algorithm

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Ethics declarations

Ethical Statement

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

Stepsize Learning for Policy Gradient Methods in Contextual Markov Decision Processes

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Survey on Constraining Policy Updates Using the KL Divergence

Automatic Tuning of Stochastic Gradient Descent with Bayesian Optimisation

A Linear Online Guided Policy Search Algorithm

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Ethics declarations

Ethical Statement

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation