Abstract
Policy-based algorithms are among the most widely adopted techniques in model-free RL, thanks to their strong theoretical groundings and good properties in continuous action spaces. Unfortunately, these methods require precise and problem-specific hyperparameter tuning to achieve good performance, and tend to struggle when asked to accomplish a series of heterogeneous tasks. In particular, the selection of the step size has a crucial impact on their ability to learn a highly performing policy, affecting the speed and the stability of the training process, and often being the main culprit for poor results. In this paper, we tackle these issues with a Meta Reinforcement Learning approach, by introducing a new formulation, known as meta-MDP, that can be used to solve any hyperparameter selection problem in RL with contextual processes. After providing a theoretical Lipschitz bound to the difference of performance in different tasks, we adopt the proposed framework to train a batch RL algorithm to dynamically recommend the most adequate step size for different policies and tasks. In conclusion, we present an experimental campaign to show the advantages of selecting an adaptive learning rate in heterogeneous environments.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The full version of the paper, including the appendix, is available at http://arxiv.org/abs/2306.07741.
- 2.
For the sake of brevity, when a variable depends on the policy \(\pi _{\boldsymbol{\theta }}\), in the superscript only \(\boldsymbol{\theta }\) is shown.
- 3.
By assuming that the policy and its gradient is LC w.r.t. \(\boldsymbol{\theta }\).
- 4.
Being an alteration of the classic Cartpole, standard results cannot be compared.
- 5.
The expected return changes deeply w.r.t. the task \(\boldsymbol{\omega }\), hence the learning curves as in the other plots in Fig. 2 show very high variance, independently from the robustness of the models.
References
Adriaensen, S., et al.: Automated dynamic algorithm configuration. J. Artif. Intell. Res. 75, 1633–1699 (2022)
Agarwal, A., Kakade, S.M., Lee, J.D., Mahajan, G.: On the theory of policy gradient methods: optimality, approximation, and distribution shift. J. Mach. Learn. Res. 22(1), 4431–4506 (2021)
Andrychowicz, M., et al.: Learning to learn by gradient descent by gradient descent. In: Advances in Neural Information Processing Systems, pp. 3981–3989 (2016)
Åström, K.J.: Optimal control of Markov processes with incomplete state information. J. Math. Anal. Appl. 10(1), 174–205 (1965)
Barto, A.G., Sutton, R.S., Anderson, C.W.: Neuronlike adaptive elements that can solve difficult learning control problems, pp. 81–93. IEEE Press (1990)
Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13(2) (2012)
Bhandari, J., Russo, D.: Global optimality guarantees for policy gradient methods. ar**v preprint ar**v:1906.01786 (2019)
Biedenkapp, A., Bozkurt, H.F., Eimer, T., Hutter, F., Lindauer, M.: Dynamic algorithm configuration: foundation of a new meta-algorithmic framework. In: ECAI 2020, pp. 427–434. IOS Press (2020)
Brockman, G., et al.: Openai gym. ar**v preprint ar**v:1606.01540 (2016)
Dhariwal, P., et al.: Openai baselines. https://github.com/openai/baselines (2017)
Eiben, A.E., Horvath, M., Kowalczyk, W., Schut, M.C.: Reinforcement learning for online control of evolutionary algorithms. In: Brueckner, S.A., Hassas, S., Jelasity, M., Yamins, D. (eds.) ESOA 2006. LNCS (LNAI), vol. 4335, pp. 151–160. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-69868-5_10
Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. J. Mach. Learn. Res. 6(Apr), 503–556 (2005)
Farahmand, A.M., Munos, R., Szepesvári, C.: Error propagation for approximate policy and value iteration. In: Advances in Neural Information Processing Systems (2010)
Feurer, M., Springenberg, J., Hutter, F.: Initializing Bayesian hyperparameter optimization via meta-learning. In: Proceedings of the AAAI Conference on Artificial Intelligence,vol. 29 (2015)
Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: International Conference on Machine Learning, pp. 1126–1135. PMLR (2017)
Fujimoto, S., Meger, D., Precup, D.: Off-policy deep reinforcement learning without exploration. In: International Conference on Machine Learning, pp. 2052–2062. PMLR (2019)
Garcia, F.M., Thomas, P.S.: A Meta-MDP approach to exploration for lifelong reinforcement learning. In: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pp. 1976–1978 (2019)
Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006)
Hallak, A., Di Castro, D., Mannor, S.: Contextual Markov decision processes. ar**v preprint ar**v:1502.02259 (2015)
Henderson, P., Romoff, J., Pineau, J.: Where did my optimum go? An empirical analysis of gradient descent optimization in policy gradient methods. ar**v preprint ar**v:1810.02525 (2018)
Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. In: International Conference on Learning and Intelligent Optimization, pp. 507–523. Springer (2011). https://doi.org/10.1007/978-3-642-25566-3_40
Hutter, F., Kotthoff, L., Vanschoren, J. (eds.): Automated Machine Learning. TSSCML, Springer, Cham (2019). https://doi.org/10.1007/978-3-030-05318-5
Im, D.J., Savin, C., Cho, K.: Online hyperparameter optimization by real-time recurrent learning. ar**v preprint ar**v:2102.07813 (2021)
Jastrzebski, S., et al.: The break-even point on optimization trajectories of deep neural networks. ar**v preprint ar**v:2002.09572 (2020)
Jomaa, H.S., Grabocka, J., Schmidt-Thieme, L.: Hyp-RL: hyperparameter optimization by reinforcement learning. ar**v preprint ar**v:1906.11527 (2019)
Kakade, S.M.: A natural policy gradient. In: Advances in Neural Information Processing Systems, vol. 14 (2001)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. ar**v preprint ar**v:1412.6980 (2014)
Li, K., Malik, J.: Learning to optimize. ar**v preprint ar**v:1606.01885 (2016)
Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband: a novel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res. 18(1), 6765–6816 (2017)
Lorraine, J., Vicol, P., Duvenaud, D.: Optimizing millions of hyperparameters by implicit differentiation. In: International Conference on Artificial Intelligence and Statistics, pp. 1540–1552. PMLR (2020)
Maclaurin, D., Duvenaud, D., Adams, R.: Gradient-based hyperparameter optimization through reversible learning. In: International Conference on Machine Learning, pp. 2113–2122. PMLR (2015)
Meier, F., Kappler, D., Schaal, S.: Online learning of a memory for learning rates. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 2425–2432. IEEE (2018)
Nichol, A., Schulman, J.: Reptile: a scalable metalearning algorithm, 2(2) 1 (2018). ar**v preprint ar**v:1803.02999
Occorso, M., Sabbioni, L., Metelli, A.M., Restelli, M.: Trust region meta learning for policy optimization. In: Brazdil, P., van Rijn, J.N., Gouk, H., Mohr, F. (eds.) ECMLPKDD Workshop on Meta-Knowledge Transfer. Proceedings of Machine Learning Research, vol. 191, pp. 62–74. PMLR (2022). https://proceedings.mlr.press/v191/occorso22a.html
Paine, T.L., et al.: Hyperparameter selection for offline reinforcement learning. ar**v preprint ar**v:2007.09055 (2020)
Park, E., Oliva, J.B.: Meta-curvature. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2019). https://proceedings.neurips.cc/paper/2019/file/57c0531e13f40b91b3b0f1a30b529a1d-Paper.pdf
Parker-Holder, J., et al.: Automated reinforcement learning (AutoRL): a survey and open problems. J. Artif. Intell. Res. 74, 517–568 (2022)
Paul, S., Kurin, V., Whiteson, S.: Fast efficient hyperparameter tuning for policy gradient methods. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2019). https://proceedings.neurips.cc/paper/2019/file/743c41a921516b04afde48bb48e28ce6-Paper.pdf
Penner, A.R.: The physics of golf. Rep. Progress Phys. 66(2), 131 (2002)
Pirotta, M., Restelli, M., Bascetta, L.: Policy gradient in Lipschitz Markov decision processes. Mach. Learn. 100(2), 255–283 (2015)
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons (2014)
Rachelson, E., Lagoudakis, M.G.: On the locality of action domination in sequential decision making. In: International Symposium on Artificial Intelligence and Mathematics, ISAIM 2010, Fort Lauderdale, Florida, USA, January 6–8, 2010 (2010)
Rakelly, K., Zhou, A., Finn, C., Levine, S., Quillen, D.: Efficient off-policy meta-reinforcement learning via probabilistic context variables. In: International Conference on Machine Learning, pp. 5331–5340. PMLR (2019)
Ravi, S., Larochelle, H.: Optimization as a model for few-shot learning. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24–26, 2017, Conference Track Proceedings. OpenReview.net (2017). https://openreview.net/forum?id=rJY0-Kcll
Schmidhuber, J.: Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. Ph.D. thesis, Technische Universität München (1987)
Schulman, J., Levine, S., Abbeel, P., Jordan, M.I., Moritz, P.: Trust region policy optimization. In: ICML, JMLR Workshop and Conference Proceedings, vol. 37, pp. 1889–1897. JMLR.org (2015)
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. CoRR abs/1707.06347 (2017)
Sehgal, A., La, H., Louis, S., Nguyen, H.: Deep reinforcement learning using genetic algorithm for parameter optimization. In: 2019 Third IEEE International Conference on Robotic Computing (IRC), pp. 596–601. IEEE (2019)
Shala, G., Biedenkapp, A., Awad, N., Adriaensen, S., Lindauer, M., Hutter, F.: Learning step-size adaptation in CMA-ES. In: Parallel Problem Solving from Nature-PPSN XVI: 16th International Conference, PPSN 2020, Leiden, The Netherlands, September 5–9, 2020, Proceedings, Part I 16, pp. 691–706. Springer (2020). https://doi.org/10.1007/978-3-030-58112-1_48
Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. ar**v preprint ar**v:1206.2944 (2012)
Sutton, R.S., Barto, A.G.: Introduction to Reinforcement Learning, 1st edn. MIT Press, Cambridge, MA, USA (1998)
Sutton, R.S., McAllester, D., Sidngh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Solla, S., Leen, T., Müller, K. (eds.) Advances in Neural Information Processing Systems, vol. 12. MIT Press (2000). https://proceedings.neurips.cc/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf
Tieleman, T., Hinton, G.: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning. Technical Report (2017)
Tirinzoni, A., Salvini, M., Restelli, M.: Transfer of samples in policy search via multiple importance sampling. In: Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 6264–6274. PMLR (2019)
Todorov, E., Erez, T., Tassa, Y.: MuJoCo: a physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE (2012)
Wang, L., Cai, Q., Yang, Z., Wang, Z.: Neural policy gradient methods: global optimality and rates of convergence. ar**v preprint ar**v:1909.01150 (2019)
Weerts, H.J., Mueller, A.C., Vanschoren, J.: Importance of tuning hyperparameters of machine learning algorithms. ar**v preprint ar**v:2007.07588 (2020)
Xu, C., Qin, T., Wang, G., Liu, T.Y.: Reinforcement learning for learning rate control. ar**v preprint ar**v:1705.11159 (2017)
Xu, Z., van Hasselt, H., Silver, D.: Meta-gradient reinforcement learning. ar**v preprint ar**v:1805.09801 (2018)
Zhu, Y., Hayashi, T., Ohsawa, Y.: Gradient descent optimization by reinforcement learning. In: The 33rd Annual Conference of the Japanese Society for Artificial Intelligence (2019)
Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24–26, 2017, Conference Track Proceedings. OpenReview.net (2017). https://openreview.net/forum?id=r1Ue8Hcxg
Acknowledgments
This paper is supported by FAIR (Future Artificial Intelligence Research) project, funded by the NextGenerationEU program within the PNRR-PE-AI scheme (M4C2, Investment 1.3, Line on Artificial Intelligence).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Ethical Statement
Hyperparameter selection for policy-based algorithms has a significant impact on the ability to learn a highly performing policy in RL, especially with heterogeneous tasks, where different contexts may require different solutions. Our approach shows that it is possible to learn an automatic selection of the best configurations that can be identified after a manual fine-tuning of the parameters. Consequently, our work can be seen as a further step in the AutoML direction, in which a practitioner could run the algorithm and, with some guidance, obtain optimal performance in just a few steps without the need for manual fine-tuning. Beyond this, we are not aware of any societal consequences of our work, such as welfare, fairness, or privacy.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Sabbioni, L., Corda, F., Restelli, M. (2023). Stepsize Learning for Policy Gradient Methods in Contextual Markov Decision Processes. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14172. Springer, Cham. https://doi.org/10.1007/978-3-031-43421-1_30
Download citation
DOI: https://doi.org/10.1007/978-3-031-43421-1_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43420-4
Online ISBN: 978-3-031-43421-1
eBook Packages: Computer ScienceComputer Science (R0)