Log in

Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

We present new policy mirror descent (PMD) methods for solving reinforcement learning (RL) problems with either strongly convex or general convex regularizers. By exploring the structural properties of these overall highly nonconvex problems we show that the PMD methods exhibit fast linear rate of convergence to the global optimality. We develop stochastic counterparts of these methods, and establish an \({{\mathcal {O}}}(1/\epsilon )\) (resp., \({{\mathcal {O}}}(1/\epsilon ^2)\)) sampling complexity for solving these RL problems with strongly (resp., general) convex regularizers using different sampling schemes, where \(\epsilon \) denote the target accuracy. We further show that the complexity for computing the gradients of these regularizers, if necessary, can be bounded by \({{\mathcal {O}}}\{(\log _\gamma \epsilon ) [(1-\gamma )L/\mu ]^{1/2}\log (1/\epsilon )\}\) (resp., \({{\mathcal {O}}} \{(\log _\gamma \epsilon ) (L/\epsilon )^{1/2}\}\)) for problems with strongly (resp., general) convex regularizers. Here \(\gamma \) denotes the discounting factor. To the best of our knowledge, these complexity bounds, along with our algorithmic developments, appear to be new in both optimization and RL literature. The introduction of these convex regularizers also greatly enhances the flexibility and thus expands the applicability of RL models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. It is worth noting that we do not enforce \(\pi (a|s) > 0\) when defining \(\omega (\pi (\cdot |s))\) as all the search points generated by our algorithms will satisfy this assumption.

References

  1. Agarwal, A., Kakade, S.M., Lee, J.D., Mahajan, G.: On the theory of policy gradient methods: optimality, approximation, and distribution shift. ar**v:1908.00261 (2019)

  2. Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex optimization. SIAM J. Optim. 27, 927–956 (2003)

    MATH  Google Scholar 

  3. Bellman, R., Dreyfus, S.: Functional approximations and dynamic programming. Math. Tables Other Aids Comput. 13(68), 247–251 (1959)

    Article  MathSciNet  MATH  Google Scholar 

  4. Bhandari, J., Russo, D.: A Note on the Linear Convergence of Policy Gradient Methods. ar**v e-prints ar**v:2007.11120 (2020)

  5. Cen, S., Cheng, C., Chen, Y., Wei, Y., Chi, Y.: Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization. ar**v e-prints ar**v:2007.06558 (2020)

  6. Dang, C.D., Lan, G.: On the convergence properties of non-Euclidean extragradient methods for variational inequalities with generalized monotone operators. Comput. Optim. Appl. 60(2), 277–310 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  7. Even-Dar, E., Kakade, S.M., Mansour, Y.: Online Markov decision processes. Math. Oper. Res. 34(3), 726–736 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  8. Facchinei, F., Pang, J.: Finite-Dimensional Variational Inequalities and Complementarity Problems. Volumes I and II. Comprehensive Study in Mathematics. Springer, New York (2003)

    MATH  Google Scholar 

  9. Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: Proceedings of the International Conference on Machine Learning (ICML) (2002)

  10. Khodadadian, S., Chen, Z., Maguluri, S.T.: Finite-sample analysis of off-policy natural actor-critic algorithm. ar**v:2102.09318 (2021)

  11. Kotsalis, G., Lan, G., Li, T.: Simple and optimal methods for stochastic variational inequalities, I: operator extrapolation. ar**v:2011.02987 (2020)

  12. Kotsalis, G., Lan, G., Li, T.: Simple and optimal methods for stochastic variational inequalities, II: Markovian noise and policy evaluation in reinforcement learning. ar**v:2011.08434 (2020)

  13. Lan, G.: First-Order and Stochastic Optimization Methods for Machine Learning. Springer, Switzerland (2020)

    Book  MATH  Google Scholar 

  14. Liu, B., Cai, Q., Yang, Z., Wang, Z.: Neural proximal/trust region policy optimization attains globally optimal policy. ar**v:1906.10306 (2019)

  15. Mei, J., **ao, C., Szepesvari, C., Schuurmans, D.: On the Global Convergence Rates of Softmax Policy Gradient Methods. ar**v:2005.06392 (2020)

  16. Nemirovski, A.S., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19, 1574–1609 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  17. Nemirovski, A.S., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. Wiley-Interscience Series in Discrete Mathematics. Wiley, New York (1983)

    Google Scholar 

  18. Nesterov, Y.E.: A method for unconstrained convex minimization problem with the rate of convergence \(O(1/k^2)\). Dokl. AN SSSR 269, 543–547 (1983)

    Google Scholar 

  19. Puterman, Martin L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1st edn. Wiley, New York (1994)

    Book  MATH  Google Scholar 

  20. Shani, L., Efroni, Y., Mannor, S.: Adaptive trust region policy optimization: global convergence and faster rates for regularized MDPS. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, pp. 5668–5675. AAAI Press (2020)

  21. Sutton, R.S., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: NIPS’99: Proceedings of the 12th International Conference on Neural Information Processing Systems, pp. 1057–1063 (1999)

  22. Tomar, M., Shani, L., Efroni, Y., Ghavamzadeh, M.: Mirror descent policy optimization. ar**v:2005.09814 (2020)

  23. Vershynin, R.: High-Dimensional Probability: An Introduction with Applications in Data Science, vol. 47. Cambridge University Press, Cambridge (2018)

    MATH  Google Scholar 

  24. Wang, L., Cai, Q., Yang, Z., Wang, Z.: Neural policy gradient methods: global optimality and rates of convergence. ar**v:abs/1909.01150 (2020)

  25. Wolfer, G., Kontorovich, A.: Statistical estimation of ergodic Markov chain kernel over discrete state space. ar**v:1809.05014v6 (2020)

  26. Xu, T., Wang, Z., Liang, Y.: Improving sample complexity bounds for actor-critic algorithms. ar**v:2004.12956 (2020)

Download references

Acknowledgements

The author appreciates very much Caleb Ju, Sajad Khodaddadian, Tianjiao Li, Yan Li and two anonymous reviewers for their careful reading and a few suggested corrections for earlier versions of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guanghui Lan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research was partially supported by the NSF grants 1909298 and 1953199 and NIFA grant 2020-67021-31526. The paper was first released at ar**v:2102.00135 on 01/30/2021.

Appendices

Appendix A: Concentration bounds for \(l_\infty \)-bounded noise

We first show how to bound the expectation of the maximum for a finite number of sub-exponential variables.

Lemma 25

Let \(\left\Vert X\right\Vert _{\psi _1}:= \inf \{t > 0: \exp (|X|/t) \le \exp (2) \}\) denote the sub-exponential norm of X. For a given sequence of sub-exponential variables \(\{X_i\}_{i=1}^n\) with \(\mathbb {E}[X_i] \le v\) and \(\left\Vert X_i\right\Vert _{\psi _1} \le \sigma \), we have

$$\begin{aligned} \mathbb {E}[\max _i X_i] \le C \sigma (\log n + 1) + v, \end{aligned}$$

where C denotes an absolute constant.

Proof

By the property of sub-exponential random variables (Section 2.7 of [23]), we know that \(Y_i = X_i - \mathbb {E}\left[ X_i\right] \) is also sub-exponential with \(\left\Vert Y_i\right\Vert _{\psi _1} \le C_1 \left\Vert X_i\right\Vert _{\psi _1} \le C_1 \sigma \) for some absolute constant \(C_1 > 0\). Hence by Proposition 2.7.1 of [23], there exists an absolute constant \(C > 0\) such that \( \mathbb {E}[\exp (\lambda Y_i)] \le \exp (C^2 \sigma ^2 \lambda ^2) , ~ \forall |\lambda | \le 1/( C \sigma ). \) Using the previous observation, we have

$$\begin{aligned}&\exp ( \mathbb {E}[\lambda \max _{i} Y_i]) \le \mathbb {E}[ \exp (\lambda \max _i Y_i)] \le \mathbb {E}[\textstyle \sum _{i=1}^n \exp (\lambda Y_i) ] \\&\quad \le n \exp (C^2 \sigma ^2 \lambda ^2), ~ \forall |\lambda | \le \frac{1}{C \sigma }, \end{aligned}$$

which implies \( \mathbb {E}[\max _i Y_i] \le \log n / \lambda + C^2 \sigma ^2 \lambda , ~ \forall |\lambda | \le 1/(C \sigma ). \) Choosing \(\lambda = 1/(C \sigma )\), we obtain \( \mathbb {E}\left[ \max _i Y_i\right] \le C \sigma (\log n + 1 ) . \) By combining this relation with the definition of \(Y_i\), we conclude that \( \mathbb {E}[\max _i X_i] \le \mathbb {E}[\max _i Y_i ]+ v \le C \sigma (\log n + 1) + v. \)

\(\square \)

Proposition 7

For \(\delta ^{k} := Q^{\pi _k, \xi _k} - Q^{\pi _k} \in \mathbb {R}^{|{{\mathcal {S}}} | \times |{{\mathcal {A}}} |}\), we have

$$\begin{aligned} \mathbb {E}_{\xi _k}[ \Vert \delta ^k\Vert _{\infty }^2 ] \le \tfrac{({\overline{c}} + {\overline{h}})^2}{(1-\gamma )^2} \left[ \gamma ^{2T_k} + \tfrac{\kappa }{M_k} (\log (|{{\mathcal {S}}} | |{{\mathcal {A}}} |) + 1)\right] , \end{aligned}$$

where \(\kappa >0\) denotes an absolute constant.

Proof

To proceed, we denote \(\delta ^{k}_{s,a} := Q^{\pi _k, \xi _k}(s,a) - Q^{\pi _k}(s,a) \), and hence

$$\begin{aligned}\mathbb {E}_{\xi _k} \Vert Q^{\pi _k, \xi _k} - Q^{\pi _k}\Vert _{\infty }^2 = \mathbb {E}_{\xi _k} [\max _{s \in {{\mathcal {S}}}, a \in {{\mathcal {A}}}} (\delta ^k_{s,a})^2]. \end{aligned}$$

Note that by definition, for each (sa) pair, we have \(M_k\) independent trajectories of length \(T_k\) starting from (sa). Let us denote \(Z_i := \sum _{t = 0}^{T_k - 1} \gamma ^t \left[ c(s_t^i, a_t^i) + h^{\pi _k}(s_t^i) \right] \), \(i = 1, \ldots , M_k\). Hence,

$$\begin{aligned} Q^{\pi _k, \xi _k} (s,a)&= \frac{1}{M_k} \textstyle \sum _{i=1}^{M_k} \textstyle \sum _{t = 0}^{T_k - 1} \gamma ^t \left[ c(s_t^i, a_t^i) + h^{\pi _k}(s_t^i) \right] = \tfrac{1}{M_k} \sum _{i=1}^{M_k} Z_i, \\ \delta ^{k}_{s,a}&= \frac{1}{M_k} \textstyle \sum _{i=1}^{M_k} (Z_i - Q^{\pi _k}(s,a)), ~~ Z_i - Q^{\pi _k}(s,a) \in [-\tfrac{{\overline{c}} + {\overline{h}}}{1-\gamma }, \tfrac{{\overline{c}} + {\overline{h}}}{1-\gamma }]. \end{aligned}$$

Since each \(Z_i - Q^{\pi _k}(s,a)\) is independent of each other, it is immediate to see that

\(Y_{s,a} := (\delta ^k_{s,a})^2\) is a sub-exponential with \(\left\Vert Y_{s,a}\right\Vert _{\psi _1} \le \frac{({\overline{c}} + {\overline{h}})^2}{(1-\gamma )^2 M_k}\). Also note that

$$\begin{aligned} \mathbb {E}_{\xi _k} [Y_{s,a}] = \mathbb {E}_{\xi _k} [(\delta ^k_{s,a})^2] = \mathrm {Var}(\delta ^k_{s,a}) + (\mathbb {E}\delta ^k_{s,a})^2 \le \tfrac{({\overline{c}} + {\overline{h}})^2}{(1-\gamma )^2 M_k} + \tfrac{({\overline{c}} + {\overline{h}})^2}{(1-\gamma )^2} \gamma ^{2T_k}. \end{aligned}$$

Thus in view of Lemma 25 with \(\sigma = \tfrac{ ({\overline{c}} + {\overline{h}})^2}{(1-\gamma )^2 M_k}\), and \(v = \tfrac{({\overline{c}} + {\overline{h}})^2}{(1-\gamma )^2 M_k} + \frac{({\overline{c}} + {\overline{h}})^2}{(1-\gamma )^2} \gamma ^{2T_k}\), we conclude that

$$\begin{aligned} \mathbb {E}[ \Vert \delta ^k\Vert _{\infty }^2]&= \mathbb {E}[ \max _{s\in {{\mathcal {S}}}, a\in {{\mathcal {A}}}} (\delta ^k_{s,a})^2] = \mathbb {E}[ \max _{s\in {{\mathcal {S}}}, a\in {{\mathcal {A}}}} Y_{s,a}] \\&\le \tfrac{C ({\overline{c}} + {\overline{h}})^2}{(1-\gamma )^2 M_k} (\log (|{{\mathcal {S}}} | |{{\mathcal {A}}} |) + 1) + \tfrac{({\overline{c}} + {\overline{h}})^2}{(1-\gamma )^2 M_k} + \tfrac{({\overline{c}} + {\overline{h}})^2}{(1-\gamma )^2} \gamma ^{2T_k}. \end{aligned}$$

\(\square \)

Appendix B: Bias for conditional temporal difference methods

Proof of Lemma 18

For simplicity, let us denote \({\bar{\theta }}_t \equiv \mathbb {E}[\theta _t]\), \(\zeta _t \equiv (\zeta _t^1, \ldots , \zeta _t^\alpha )\) and \(\zeta _{\lceil t\rceil } = (\zeta _1, \ldots , \zeta _t)\). Also let us denote \(\delta ^F_t := F^\pi (\theta _t) - \mathbb {E}[{\tilde{F}}^\pi (\theta _t,\zeta _t^\alpha )|\zeta _{\lceil t-1\rceil }]\) and \({\bar{\delta }}^F_t = \mathbb {E}_{\zeta _{\lceil t-1\rceil }}[\delta ^F_t]\). It follows from Jensen’s ienquality and Lemma 17 that

$$\begin{aligned} \Vert {\bar{\theta }}_t - \theta ^*\Vert _2 = \Vert \mathbb {E}_{\zeta _{\lceil t-1\rceil }}[ \theta _t] - \theta ^*\Vert _2 \le \mathbb {E}_{\zeta _{\lceil t-1\rceil }}[\Vert \theta _t - \theta ^*\Vert _2] \le R. \end{aligned}$$
(7.1)

Also by Jensen’s inequality, Lemma 16 and Lemma 17, we have

$$\begin{aligned} \Vert {\bar{\delta }}^F_t\Vert _2&= \Vert \mathbb {E}_{\zeta _{\lceil t-1\rceil }}[ \delta ^F_t]\Vert _2 \le \mathbb {E}_{\zeta _{\lceil t-1\rceil }}[\Vert \delta ^F_t\Vert _2]\nonumber \\&\le C \rho ^\alpha \mathbb {E}_{\zeta _{\lceil t-1\rceil }}[\Vert \theta _t -\theta ^*\Vert _2]\le C R\rho ^\alpha . \end{aligned}$$
(7.2)

Notice that

$$\begin{aligned} \theta _{t+1}&= \theta _t - \beta _t {\tilde{F}}^\pi (\theta _t,\zeta _t^\alpha )\\&= \theta _t - \beta _t F^\pi (\theta _t) + \beta _t [F^\pi (\theta _t) - {\tilde{F}}^\pi (\theta _t,\zeta _t^\alpha )]. \end{aligned}$$

Now conditional on \(\zeta _{\lceil t-1\rceil }\), taking expectation w.r.t. \(\zeta _t\) on (5.11), we have \( \mathbb {E}[\theta _{t+1}|\zeta _{\lceil t-1\rceil }] = \theta _t - \beta _t F^\pi (\theta _t) + \beta _t \delta ^F_t. \) Taking further expectation w.r.t. \(\zeta _{\lceil t-1\rceil }\) and using the linearity of F, we have \( {\bar{\theta }}_{t+1} = {\bar{\theta }}_t - \beta _t F^\pi ({\bar{\theta }}_t) + \beta _t {\bar{\delta }}^F_t, \) which implies

$$\begin{aligned} \Vert {\bar{\theta }}_{t+1} - \theta ^*\Vert _2^2&= \Vert {\bar{\theta }}_t - \theta ^* - \beta _t F^\pi ({\bar{\theta }}_t) + \beta _t {\bar{\delta }}^F_t\Vert _2^2\\&= \Vert {\bar{\theta }}_t - \theta ^*\Vert _2^2 - 2 \beta _t \langle F^\pi ({\bar{\theta }}_t) - {\bar{\delta }}^F_t, {\bar{\theta }}_t - \theta ^*\rangle + \beta _t^2 \Vert F^\pi ({\bar{\theta }}_t) - {\bar{\delta }}^F_t\Vert _2^2\\&\le \Vert {\bar{\theta }}_t - \theta ^*\Vert _2^2 - 2 \beta _t \langle F^\pi ({\bar{\theta }}_t) - {\bar{\delta }}^F_t, {\bar{\theta }}_t - \theta ^*\rangle \\&\quad + 2 \beta _t^2 [ \Vert F^\pi ({\bar{\theta }}_t)\Vert _2^2 + \Vert {\bar{\delta }}^F_t\Vert _2^2]. \end{aligned}$$

The above inequality, together with (7.1), (7.2) and the facts that

$$\begin{aligned} \langle F^\pi ({\bar{\theta }}_t), {\bar{\theta }}_t - \theta ^*\rangle = \langle F^\pi ({\bar{\theta }}_t) - F^\pi (\theta ^*), {\bar{\theta }}_t - \theta ^*\rangle \ge \varLambda _{\min } \Vert {\bar{\theta }}_t - \theta ^*\Vert _2^2\\ \Vert F^\pi ({\bar{\theta }}_t)\Vert _2 = \Vert F^\pi ({\bar{\theta }}_t) - F^\pi (\theta ^*)\Vert _2 \le \varLambda _{\max }\Vert {\bar{\theta }}_t - \theta ^*\Vert _2, \end{aligned}$$

then imply that

$$\begin{aligned} \Vert {\bar{\theta }}_{t+1} - \theta ^*\Vert _2^2&\le (1 - 2 \beta _t \varLambda _{\min } + 2 \beta _t^2 \varLambda _{\max }^2) \Vert {\bar{\theta }}_t - \theta ^*\Vert _2^2 + 2\beta _t C R^2 \rho ^\alpha \nonumber \\&\quad + 2 \beta _t^2 C^2 R^2\rho ^{2\alpha } \nonumber \\&\le (1 - \tfrac{3}{t+ t_0 -1}) \Vert {\bar{\theta }}_t - \theta ^*\Vert _2^2 + 2\beta _t C R^2 \rho ^\alpha + 2 \beta _t^2 C^2 R^2\rho ^{2\alpha }, \end{aligned}$$
(7.3)

where the last inequality follows from

$$\begin{aligned} 2(\beta _t \varLambda _{\min } - \beta _t^2 \varLambda _{\max }^2)&= 2 \beta _t ( \varLambda _{\min } - \beta _t\varLambda _{\max }^2 ) = 2 \beta _t (\varLambda _{\min } - \tfrac{2 \varLambda _{\max }^2}{\varLambda _{\min } (t+ t_0 -1)})\\&\ge 2 \beta _t (\varLambda _{\min } - \tfrac{2 \varLambda _{\max }^2}{\varLambda _{\min } t_0 }) \ge \tfrac{3}{2} \beta _t \varLambda _{\min } = \tfrac{3}{t+ t_0 -1} \end{aligned}$$

due to the selection of \(\beta _t\) in (5.13). Now let us denote \( \varGamma _t := {\left\{ \begin{array}{ll} 1 &{} t =0,\\ (1 - \tfrac{3}{t+ t_0 -1})\varGamma _{t-1} &{} t \ge 1, \end{array}\right. } \) or equivalently, \(\varGamma _t := \tfrac{(t_0 - 1) (t_0 -2) (t_0 -3)}{(t+t_0-1) (t + t_0 -2) (t+ t_0 -3))}\). Dividing both sides of (7.3) by \(\varGamma _t\) and taking the telescopic sum, we have

$$\begin{aligned} \tfrac{1}{\varGamma _t} \Vert {\bar{\theta }}_{t+1} - \theta ^*\Vert _2^2&\le \Vert {\bar{\theta }}_1 - \theta ^*\Vert _2^2 + 2 C R^2 \rho ^\alpha \textstyle \sum _{i=1}^t \tfrac{\beta _i}{\varGamma _i} + 2 C^2 R^2\rho ^{2\alpha } \textstyle \sum _{i=1}^t \tfrac{\beta _i^2}{\varGamma _i}. \end{aligned}$$

Noting that

$$\begin{aligned} \textstyle \sum _{i=1}^t \tfrac{\beta _i}{\varGamma _i}&= \tfrac{2}{\varLambda _{\min }} \textstyle \sum _{i=1}^t \tfrac{(i+t_0 - 2)(i+t_0-3)}{(t_0-1)(t_0-2)(t_0-3)} \le \tfrac{2 \textstyle \sum _{i=1}^t (i+t_0 - 2)^2}{\varLambda _{\min }(t_0-1)(t_0-2)(t_0-3)}\\&\le \tfrac{2 (t+t_0-1)^3}{3\varLambda _{\min }(t_0-1)(t_0-2)(t_0-3)},\\ \textstyle \sum _{i=1}^t \tfrac{\beta _i^2}{\varGamma _i}&\le \tfrac{4 \textstyle \sum _{i=1}^t (i+t_0-3)}{\varLambda _{\min }^2(t_0-1)(t_0-2)(t_0-3)} \le \tfrac{2 (t+t_0-2)^2}{\varLambda _{\min }^2(t_0-1)(t_0-2)(t_0-3)}, \end{aligned}$$

we conclude

$$\begin{aligned} \Vert {\bar{\theta }}_{t+1} - \theta ^*\Vert _2^2&\le \tfrac{(t_0 - 1) (t_0 -2) (t_0 -3)}{(t+t_0-1) (t + t_0 -2) (t+ t_0 -3)} \Vert {\bar{\theta }}_1 - \theta ^*\Vert _2^2 + 2 C R^2 \rho ^\alpha \tfrac{2 (t+t_0-1)^2}{3\varLambda _{\min } (t + t_0 -2) (t+ t_0 -3)} \\&\quad + 2 C^2 R^2\rho ^{2\alpha } \tfrac{2 (t+t_0-2)}{\varLambda _{\min }^2(t+t_0-1) (t+ t_0 -3)}\\&\le \tfrac{(t_0 - 1) (t_0 -2) (t_0 -3)}{(t+t_0-1) (t + t_0 -2) (t+ t_0 -3)} \Vert {\bar{\theta }}_1 - \theta ^*\Vert _2^2 + \tfrac{8 C R^2 \rho ^\alpha }{3\varLambda _{\min }} + \tfrac{C^2 R^2\rho ^{2\alpha }}{\varLambda _{\min }^2}, \end{aligned}$$

from which the result holds since \({\bar{\theta }}_1 = \theta _1\). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lan, G. Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes. Math. Program. 198, 1059–1106 (2023). https://doi.org/10.1007/s10107-022-01816-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-022-01816-5

Mathematics Subject Classification

Navigation