Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices

Vempala, Santosh S.; Wibisono, Andre

doi:10.1007/978-3-031-26300-2_15

Santosh S. Vempala²⁰ &
Andre Wibisono²¹

Part of the book series: Lecture Notes in Mathematics ((LNM,volume 2327))

456 Accesses

Abstract

We study the Unadjusted Langevin Algorithm (ULA) for sampling from a probability distribution $\nu = e^{-f}$ on $\mathbb {R}^n$. We prove a convergence guarantee in Kullback-Leibler (KL) divergence assuming $\nu $ satisfies a log-Sobolev inequality and the Hessian of f is bounded. Notably, we do not assume convexity or bounds on higher derivatives. We prove convergence guarantees in Rényi divergence of order $q > 1$ assuming the limit of ULA satisfies isoperimetry, namely either the log-Sobolev or Poincaré inequality. We also prove a bound on the bias of the limiting distribution of ULA assuming third-order smoothness of f, without requiring isoperimetry.

This work was supported in part by NSF awards CCF-1717349, DMS-1839323, CCF-2007443, and CCF-2106644.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: EUR 29.95; Price includes VAT (Germany)

eBook: EUR 53.49; Price includes VAT (Germany)

Softcover Book: EUR 69.54; Price includes VAT (Germany)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Functional Central Limit Theorem and Strong Law of Large Numbers for Stochastic Gradient Langevin Dynamics

Article Open access 28 August 2023

Non-Asymptotic Guarantees for Sampling by Stochastic Gradient Descent

Article 01 March 2019

Nonasymptotic Estimates for Stochastic Gradient Langevin Dynamics Under Local Conditions in Nonconvex Optimization

Article 13 January 2023

Notes

1.
Recall for $\nu = \mathcal {N}(0,\frac {1}{\alpha } I)$ and $\rho = \mathcal {N}(0,\frac {1}{\beta } I)$ on $\mathbb {R}^n$, the relative Fisher information is $J_\nu (\rho ) = \frac {n}{\beta } (\beta -\alpha )^2$.
2.
Recall $W_1(\rho ,\nu ) = \sup \{ \mathbb {E}_\rho [g] - \mathbb {E}_\nu [g] \colon g \text{ is }1\text{-Lipschitz}\}$.

References

M. Abadi, A. Chu, I. Goodfellow, H.B. McMahan, I. Mironov, K. Talwar, L. Zhang, Deep learning with differential privacy, in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (ACM, 2016), pp. 308–318
Google Scholar
D. Applegate, R. Kannan, Sampling and integration of near log-concave functions, in Proceedings of the Twenty-third Annual ACM Symposium on Theory of Computing, STOC ’91, New York, NY, USA (ACM, 1991), pp. 156–163
Book Google Scholar
J.C. Baez, Rényi entropy and free energy. Preprint. ar**v:1102.2098 (2011)
Google Scholar
S. Bai, T. Lepoint, A. Roux-Langlois, A. Sakzad, D. Stehlé, R. Steinfeld, Improved security proofs in lattice-based cryptography: using the Rényi divergence rather than the statistical distance. J. Cryptol. 31(2), 610–640 (2018)
Article MATH Google Scholar
D. Bakry, M. Émery, Diffusions hypercontractives, in Séminaire de Probabilités XIX 1983/84 (Springer, 1985), pp. 177–206
Google Scholar
D. Bakry, F. Barthe, P. Cattiaux, A. Guillin et al., A simple proof of the Poincaré inequality for a large class of probability measures. Electron. Commun. Probab. 13, 60–66 (2008)
Article MathSciNet MATH Google Scholar
K. Balasubramanian, S. Chewi, M.A. Erdogdu, A. Salim, M. Zhang, Towards a theory of non-log-concave sampling: first-order stationarity guarantees for Langevin Monte Carlo, in Proceedings of the 2022 Conference on Learning Theory. PMLR (2022)
Google Scholar
J.B. Bardet, N. Gozlan, F. Malrieu, P.A. Zitt, Functional inequalities for Gaussian convolutions of compactly supported measures: explicit bounds and dimension dependence. Bernoulli 24(1), 333–353 (2018)
Article MathSciNet MATH Google Scholar
E. Bernton, Langevin Monte Carlo and JKO splitting, in Conference on Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018 (2018), pp. 1777–1798
Google Scholar
S.G. Bobkov, F. Götze, Exponential integrability and transportation cost related to logarithmic Sobolev inequalities. J. Funct. Anal. 163(1), 1–28 (1999)
Article MathSciNet MATH Google Scholar
S.G. Bobkov, I. Gentil, M. Ledoux, Hypercontractivity of Hamilton–Jacobi equations. J. Math. Pures Appl. 80(7), 669–696 (2001)
Article MathSciNet MATH Google Scholar
S.G. Bobkov, G.P. Chistyakov, F. Götze, Rényi divergence and the central limit theorem. Ann. Probab. 47(1), 270–323 (2019)
Article MathSciNet MATH Google Scholar
M. Bun, T. Steinke, Concentrated differential privacy: simplifications, extensions, and lower bounds, in Theory of Cryptography Conference (Springer, 2016), pp. 635–658
Google Scholar
Y. Cao, J. Lu, Y. Lu, Exponential decay of Rényi divergence under Fokker–Planck equations. J. Stat. Phys. 176, 1172–1184 (2019)
Article MathSciNet MATH Google Scholar
D. Chafaï, Entropies, convexity, and functional inequalities: on $\phi $-entropies and $\phi $-Sobolev inequalities. J. Math. Kyoto Univ. 44(2), 325–363 (2004)
Google Scholar
D. Chafai, F. Malrieu, On fine properties of mixtures with respect to concentration of measure and Sobolev type inequalities, in Annales de l’IHP Probabilités et statistiques, vol. 46 (2010), pp. 72–96
Google Scholar
Z. Chen, S.S. Vempala, Optimal convergence rate of Hamiltonian Monte Carlo for strongly logconcave distributions. Theory Comput. 18(9), 1–18 (2022)
MathSciNet MATH Google Scholar
Y. Chen, S. Chewi, A. Salim, A. Wibisono, Improved analysis for a proximal algorithm for sampling, in Proceedings of the 2022 Conference on Learning Theory. PMLR (2022)
Google Scholar
X. Cheng, P. Bartlett, Convergence of Langevin MCMC in KL-divergence, in F. Janoos, M. Mohri, K. Sridharan, ed. by Proceedings of Algorithmic Learning Theory, volume 83 of Proceedings of Machine Learning Research. PMLR, 07–09 Apr (2018), pp. 186–211
Google Scholar
X. Cheng, N.S. Chatterji, Y. Abbasi-Yadkori, P.L. Bartlett, M.I. Jordan, Sharp convergence rates for Langevin dynamics in the nonconvex setting. Preprint. ar**v:1805.01648 (2018)
Google Scholar
S. Chewi, T. Le Gouic, C. Lu, T. Maunu, P. Rigollet, A. Stromme, Exponential ergodicity of mirror-Langevin diffusions, in Advances in Neural Information Processing Systems, vol. 33 (2020), pp. 19573–19585
Google Scholar
S. Chewi, M.A. Erdogdu, M.B. Li, R. Shen, M. Zhang, Analysis of Langevin Monte Carlo from Poincaré to log-Sobolev, in Proceedings of the 2022 Conference on Learning Theory. PMLR (2022)
Google Scholar
T.A. Courtade, Bounds on the Poincaré constant for convolution measures. Ann. l’Inst. Henri Poincaré Probab. Stat. 56(1), 566–579 (2020)
MATH Google Scholar
I. Csiszár, Generalized cutoff rates and Rényi’s information measures. IEEE Trans. Inf. Theory 41(1), 26–34 (1995)
Article MathSciNet MATH Google Scholar
A. Dalalyan, Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent, in Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research. PMLR, 07–10 Jul (2017), pp. 678–689
Google Scholar
A.S. Dalalyan, A. Karagulyan, User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient, in Stochastic Processes and their Applications (2019)
Google Scholar
X. Dong, The gravity dual of Rényi entropy. Nat. Commun. 7, 12472 (2016)
Article Google Scholar
A. Durmus, E. Moulines, E. Saksman, On the convergence of Hamiltonian Monte Carlo. Preprint. ar**v:1705.00166 (2017)
Google Scholar
A. Durmus, S. Majewski, B. Miasojedow, Analysis of Langevin Monte Carlo via convex optimization. J. Mach. Learn. Res. 20(1), 2666–2711 (2019)
MathSciNet MATH Google Scholar
R. Dwivedi, Y. Chen, M.J. Wainwright, B. Yu, Log-concave sampling: Metropolis-Hastings algorithms are fast!, in Conference on Learning Theory, COLT 2018, Stockholm, Sweden, 6–9 July (2018), pp. 793–797
Google Scholar
C. Dwork, G.N. Rothblum, Concentrated differential privacy. Preprint. ar**v:1603.01887 (2016)
Google Scholar
A. Eberle, A. Guillin, R. Zimmer, Couplings and quantitative contraction rates for Langevin dynamics. Ann. Probab. 47(4), 1982–2010 (2019)
Article MathSciNet MATH Google Scholar
M.A. Erdogdu, R. Hosseinzadeh, On the convergence of Langevin Monte Carlo: the interplay between tail growth and smoothness, in Proceedings of Thirty Fourth Conference on Learning Theory, ed. by M. Belkin, S. Kpotufe, volume 134 of Proceedings of Machine Learning Research. PMLR, 15–19 Aug (2021), pp. 1776–1822
Google Scholar
M.A. Erdogdu, R. Hosseinzadeh, M.S. Zhang, Convergence of Langevin Monte Carlo in chi-squared and Rényi divergence, in International Conference on Artificial Intelligence and Statistics. PMLR (2022), pp. 8151–8175
Google Scholar
A. Ganesh, K. Talwar, Faster differentially private samplers via Rényi divergence analysis of discretized Langevin MCMC, in Advances in Neural Information Processing Systems, ed. by H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, H. Lin, vol. 33 (Curran Associates, 2020), pp. 7222–7233
Google Scholar
A. Garbuno-Inigo, N. Nüsken, S. Reich, Affine invariant interacting Langevin dynamics for Bayesian inference. SIAM J. Appl. Dynam. Syst. 19(3), 1633–1658 (2020)
Article MathSciNet MATH Google Scholar
K. Gatmiry, S.S. Vempala, Convergence of the Riemannian Langevin algorithm. Preprint. ar**v:2204.10818 (2022)
Google Scholar
J. Gorham, L. Mackey, Measuring sample quality with kernels, in Proceedings of the 34th International Conference on Machine Learning, ed. by D. Precup, Y.W. Teh, volume 70 of Proceedings of Machine Learning Research, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR (2017), pp. 1292–1301
Google Scholar
L. Gross, Logarithmic Sobolev inequalities. Am. J. Math. 97(4), 1061–1083 (1975)
Article MathSciNet MATH Google Scholar
P. Harremoës, Interpretations of Rényi entropies and divergences. Physica A Stat. Mech. Appl. 365(1), 57–62 (2006)
Article MathSciNet Google Scholar
Y. He, A.B. Hamza, H. Krim, A generalized divergence measure for robust image registration. IEEE Trans. Signal Process. 51(5), 1211–1220 (2003)
Article MathSciNet MATH Google Scholar
Y. He, K. Balasubramanian, M.A. Erdogdu, Heavy-tailed sampling via transformed unadjusted Langevin algorithm. Preprint. ar**v:2201.08349 (2022)
Google Scholar
R. Holley, D. Stroock, Logarithmic Sobolev inequalities and stochastic Ising models. J. Stat. Phys. 46(5), 1159–1194 (1987)
Article MathSciNet MATH Google Scholar
R. Holley, D. Stroock, Simulated annealing via Sobolev inequalities. Commun. Math. Phys. 115(4), 553–569 (1988)
Article MathSciNet MATH Google Scholar
M. Iwamoto, J. Shikata, Information theoretic security for encryption based on conditional Rényi entropies, in International Conference on Information Theoretic Security (Springer, 2013), pp. 103–121
Google Scholar
Q. Jiang, Mirror Langevin Monte Carlo: the case under isoperimetry, in Advances in Neural Information Processing Systems, ed. by M. Ranzato, A. Beygelzimer, K. Nguyen, P.S. Liang, J.W. Vaughan, Y. Dauphin, vol. 34 (Curran Associates, 2021)
Google Scholar
R. Jordan, D. Kinderlehrer, F. Otto, The variational formulation of the Fokker–Planck equation. SIAM J. Math. Anal. 29(1), 1–17 (1998)
Article MathSciNet MATH Google Scholar
R. Kannan, L. Lovász, M. Simonovits, Random walks and an $O^*(n^5)$ volume algorithm for convex bodies. Random Struct. Algorithms 11, 1–50 (1997)
Google Scholar
M. Ledoux, Concentration of measure and logarithmic Sobolev inequalities. Sémin. Probab. Strasbourg 33, 120–216 (1999)
MathSciNet MATH Google Scholar
Y.T. Lee, S.S. Vempala, Convergence rate of Riemannian Hamiltonian Monte Carlo and faster polytope volume computation, in Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing (ACM, 2018), pp. 1115–1121
Google Scholar
M.B. Li, M.A. Erdogdu, Riemannian Langevin algorithm for solving semidefinite programs. Preprint. ar**v:2010.11176 (2020)
Google Scholar
Y. Li, R.E. Turner, Rényi divergence variational inference, in Advances in Neural Information Processing Systems, vol. 29, ed. by D.D. Lee, M. Sugiyama, U.V. Luxburg, I. Guyon, R. Garnett (Curran Associates, 2016), pp. 1073–1081
Google Scholar
X. Li, Y. Wu, L. Mackey, M.A. Erdogdu, Stochastic Runge–Kutta accelerates Langevin Monte Carlo and beyond, in Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
L. Lovász, S. Vempala, Fast algorithms for logconcave functions: Sampling, rounding, integration and optimization, in FOCS (2006), pp. 57–68
Google Scholar
L. Lovász, S.S. Vempala, Hit-and-run from a corner. SIAM J. Comput. 35(4), 985–1005 (2006)
Article MathSciNet MATH Google Scholar
L. Lovász, S. Vempala, The geometry of logconcave functions and sampling algorithms. Random Struct. Algorithms 30(3), 307–358 (2007)
Article MathSciNet MATH Google Scholar
Y.A. Ma, N.S. Chatterji, X. Cheng, N. Flammarion, P.L. Bartlett, M.I. Jordan, Is there an analog of Nesterov acceleration for gradient-based MCMC? Bernoulli 27(3), 1942–1992 (2021)
Article MathSciNet MATH Google Scholar
M.C. Mackey, Time’s Arrow: The Origins of Thermodynamics Behavior (Springer, 1992)
Google Scholar
O. Mangoubi, A. Smith, Rapid mixing of Hamiltonian Monte Carlo on strongly log-concave distributions. Preprint. ar**v:1708.07114 (2017)
Google Scholar
O. Mangoubi, N. Vishnoi, Dimensionally tight bounds for second-order Hamiltonian Monte Carlo, in Advances in Neural Information Processing Systems, vol. 31 (Curran Associates, 2018), pp. 6027–6037
Google Scholar
O. Mangoubi, N.K. Vishnoi, Nonconvex sampling with the Metropolis-adjusted Langevin algorithm, in Conference on Learning Theory. PMLR (2019), pp. 2259–2293
Google Scholar
Y. Mansour, M. Mohri, A. Rostamizadeh, Multiple source adaptation and the Rényi divergence, in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (AUAI Press, 2009), pp. 367–374
Google Scholar
G. Menz, A. Schlichting, Poincaré and logarithmic Sobolev inequalities by decomposition of the energy landscape. Ann. Probab. 42(5), 1809–1884 (2014)
Article MathSciNet MATH Google Scholar
I. Mironov, Rényi differential privacy, in 2017 IEEE 30th Computer Security Foundations Symposium (CSF) (IEEE, 2017), pp. 263–275
Google Scholar
D. Morales, L. Pardo, I. Vajda, Rényi statistics in directed families of exponential experiments. Stat. J. Theor. Appl. Stat. 34(2), 151–174 (2000)
MATH Google Scholar
F. Otto, C. Villani, Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality. J. Funct. Anal. 173(2), 361–400 (2000)
Article MathSciNet MATH Google Scholar
M. Raginsky, A. Rakhlin, M. Telgarsky, Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis, in Proceedings of the 2017 Conference on Learning Theory, ed. by S. Kale, O. Shamir, volume 65 of Proceedings of Machine Learning Research, Amsterdam, Netherlands, 07–10 Jul 2017. PMLR (2017), pp. 1674–1703
Google Scholar
A. Rényi et al., On measures of entropy and information, in Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics (The Regents of the University of California, 1961)
Google Scholar
O.S. Rothaus, Diffusion on compact Riemannian manifolds and logarithmic Sobolev inequalities. J. Funct. Anal. 42(1), 102–109 (1981)
Article MathSciNet MATH Google Scholar
M. Talagrand, Transportation cost for Gaussian and other product measures. Geom. Funct. Anal. 6, 587–600 (1996)
Article MathSciNet MATH Google Scholar
T. Van Erven, P. Harremos, Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory 60(7), 3797–3820 (2014)
Article MATH Google Scholar
S. Vempala, A. Wibisono, Rapid convergence of the unadjusted Langevin algorithm: isoperimetry suffices, in Advances in Neural Information Processing Systems, vol. 32 (Curran Associates, 2019)
Google Scholar
C. Villani, A short proof of the concavity of entropy power. IEEE Trans. Inf. Theory 46(4), 1695–1696 (2000)
Article MathSciNet MATH Google Scholar
C. Villani, Topics in Optimal Transportation. Number 58 in Graduate Studies in Mathematics (American Mathematical Society, 2003)
Google Scholar
F.Y. Wang, J. Wang, Functional inequalities for convolution of probability measures, in Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, vol. 52 (Institut Henri Poincaré, 2016), pp. 898–914
Google Scholar
X. Wang, Q. Lei, I. Panageas, Fast convergence of Langevin dynamics on manifold: Geodesics meet log-Sobolev, in Advances in Neural Information Processing Systems, ed. by H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, H. Lin, vol. 33 (Curran Associates, 2020), pp. 18894–18904
Google Scholar
A. Wibisono, Sampling as optimization in the space of measures: the Langevin dynamics as a composite optimization problem, in Conference on Learning Theory, COLT 2018, Stockholm, Sweden, 6–9 July 2018 (2018), pp. 2093–3027
Google Scholar
A. Wibisono, Proximal Langevin algorithm: rapid convergence under isoperimetry. e-prints ar**v:1911.01469 (2019)
Google Scholar

Download references

Acknowledgements

The authors thank Kunal Talwar for explaining the privacy motivation and application of Rényi divergence to data privacy; Yu Cao, Jianfeng Lu, and Yulong Lu for alerting us to their work [14] on Rényi divergence; **ang Cheng and Peter Bartlett for helpful comments on an earlier version of this paper; and Sinho Chewi for communicating Theorem 8 to us.

Author information

Authors and Affiliations

Georgia Institute of Technology, College of Computing, Atlanta, GA, USA
Santosh S. Vempala
Yale University, Department of Computer Science, New Haven, CT, USA
Andre Wibisono

Authors

Santosh S. Vempala
View author publications
You can also search for this author in PubMed Google Scholar
Andre Wibisono
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Santosh S. Vempala .

Editor information

Editors and Affiliations

Department of Mathematics, Weizmann Institute of Science, Rehovot, Israel
Ronen Eldan
Department of Mathematics, Weizmann Institute of Science, Rehovot, Israel
Bo'az Klartag
Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB, Canada
Alexander Litvak
Department of Mathematics, Technion - Israel Institute of Technology, Haifa, Israel
Emanuel Milman

Appendix

1.1 Review on Notation and Basic Properties

Throughout, we represent a probability distribution $\rho $ on $\mathbb {R}^n$ via its probability density function with respect to the Lebesgue measure, so $\rho \colon \mathbb {R}^n \to \mathbb {R}$ with $\int _{\mathbb {R}^n} \rho (x) dx = 1$. We typically assume $\rho $ has full support and smooth density, so $\rho (x) > 0$ and $x \mapsto \rho (x)$ is differentiable. Given a function $f \colon \mathbb {R}^n \to \mathbb {R}$, we denote the expected value of f under $\rho $ by

$$\displaystyle \begin{aligned} \mathbb{E}_\rho[f] = \int_{\mathbb{R}^n} f(x) \rho(x) \, dx. \end{aligned} $$

We use the Euclidean inner product $\langle x,y \rangle = \sum _{i=1}^n x_i y_i$ for $x = (x_i)_{1 \le i \le n}, y = (y_i)_{1 \le i \le n} \in \mathbb {R}^n$. For symmetric matrices $A, B \in \mathbb {R}^{n \times n}$, let $A \preceq B$ denote that $B-A$ is positive semidefinite. For $\mu \in \mathbb {R}^n$, $\Sigma \succ 0$, let $\mathcal {N}(\mu ,\Sigma )$ denote the Gaussian distribution on $\mathbb {R}^n$ with mean $\mu $ and covariance matrix $\Sigma $.

Given a smooth function $f \colon \mathbb {R}^n \to \mathbb {R}$, its gradient$\nabla f \colon \mathbb {R}^n \to \mathbb {R}^n$ is the vector of partial derivatives:

$$\displaystyle \begin{aligned} \nabla f(x) = \left(\frac{\partial f(x)}{\partial x_1}, \dots, \frac{\partial f(x)}{\partial x_n} \right). \end{aligned} $$

The Hessian$\nabla ^2 f \colon \mathbb {R}^n \to \mathbb {R}^{n \times n}$ is the matrix of second partial derivatives:

$$\displaystyle \begin{aligned} \nabla^2 f(x) = \left(\frac{\partial ^2f(x)}{\partial x_i x_j} \right)_{1 \le i,j \le n}. \end{aligned} $$

The Laplacian$\Delta f \colon \mathbb {R}^n \to \mathbb {R}$ is the trace of its Hessian:

$$\displaystyle \begin{aligned} \Delta f(x) = \mathrm{Tr}(\nabla^2 f(x)) = \sum_{i=1}^n \frac{\partial ^2 f(x)}{\partial x_i^2}. \end{aligned} $$

Given a smooth vector field $v = (v_1,\dots ,v_n) \colon \mathbb {R}^n \to \mathbb {R}^n$, its divergence$\nabla \cdot v \colon \mathbb {R}^n \to \mathbb {R}$ is

$$\displaystyle \begin{aligned} (\nabla \cdot v)(x) = \sum_{i=1}^n \frac{\partial v_i(x)}{\partial x_i}. \end{aligned} $$

In particular, the divergence of gradient is the Laplacian:

$$\displaystyle \begin{aligned} (\nabla \cdot \nabla f)(x) = \sum_{i=1}^n \frac{\partial ^2 f(x)}{\partial x_i^2} = \Delta f(x). \end{aligned} $$

For any function $f \colon \mathbb {R}^n \to \mathbb {R}$ and vector field $v \colon \mathbb {R}^n \to \mathbb {R}^n$ with sufficiently fast decay at infinity, we have the following integration by parts formula:

$$\displaystyle \begin{aligned} \int_{\mathbb{R}^n} \langle v(x), \nabla f(x) \rangle dx = -\int_{\mathbb{R}^n} f(x) (\nabla \cdot v)(x) dx. \end{aligned} $$

Furthermore, for any two functions $f, g \colon \mathbb {R}^n \to \mathbb {R}$,

$$\displaystyle \begin{aligned} \int_{\mathbb{R}^n} f(x) \Delta g(x) dx = -\int_{\mathbb{R}^n} \langle \nabla f(x), \nabla g(x) \rangle dx = \int_{\mathbb{R}^n} g(x) \Delta f(x) dx. \end{aligned} $$

When the argument is clear, we omit the argument $(x)$ in the formulae for brevity. For example, the last integral above becomes

$$\displaystyle \begin{aligned} {} \int f \, \Delta g \, dx = -\int \langle \nabla f, \nabla g \rangle \, dx = \int g \, \Delta f \, dx. \end{aligned} $$

(B.1)

Derivation of the Fokker-Planck Equation

Consider a stochastic differential equation

$$\displaystyle \begin{aligned} {} dX_t = v(X_t) \, dt + \sqrt{2} \, dW_t \end{aligned} $$

(B.2)

where $v \colon \mathbb {R}^n \to \mathbb {R}^n$ is a smooth vector field and $(W_t)_{t \ge 0}$ is the Brownian motion on $\mathbb {R}^n$ with $W_0 = 0$.

We will show that if $X_t$ evolves following (B.2), then its probability density function $\rho _t(x)$ evolves following the Fokker-Planck equation:

$$\displaystyle \begin{aligned} {} \frac{\partial \rho_t}{\partial t} = -\nabla \cdot (\rho_t v) + \Delta \rho_t. \end{aligned} $$

(B.3)

We can derive this heuristically as follows; we refer to standard textbooks for rigorous derivation [58].

For any smooth test function $\phi \colon \mathbb {R}^n \to \mathbb {R}$, let us compute the time derivative of the expectation

$$\displaystyle \begin{aligned} A(t) = \mathbb{E}_{\rho_t}[\phi] = \mathbb{E}[\phi(X_t)]. \end{aligned} $$

On the one hand, we can compute this as

$$\displaystyle \begin{aligned} {} \dot A(t) = \frac{d}{dt} A(t) = \frac{d}{dt} \int_{\mathbb{R}^n} \rho_t(x) \phi(x) \, dx = \int_{\mathbb{R}^n} \frac{\partial \rho_t(x)}{\partial t} \phi(x) \, dx. \end{aligned} $$

(B.4)

On the other hand, by (B.2), for small $\epsilon > 0$ we have

$$\displaystyle \begin{aligned} X_{t+\epsilon } &= X_t + \int_t^{t+\epsilon } v(X_s) ds + \sqrt{2} (W_{t+\epsilon }-W_t) \\ &= X_t + \epsilon v(X_t) + \sqrt{2} (W_{t+\epsilon }-W_t) + O(\epsilon ^2) \\ &\stackrel{d}{=} X_t + \epsilon v(X_t) + \sqrt{2\epsilon } Z + O(\epsilon ^2) \end{aligned} $$

where $Z \sim \mathcal {N}(0,I)$ is independent of $X_t$, since $W_{t+\epsilon }-W_t \sim \mathcal {N}(0,\epsilon I)$. Then by Taylor expansion,

$$\displaystyle \begin{aligned} \phi(X_{t+\epsilon }) &\stackrel{d}{=} \phi\left(X_t + \epsilon v(X_t) + \sqrt{2\epsilon } Z + O(\epsilon ^2)\right) \\ &= \phi(X_t) + \epsilon \langle \nabla \phi(X_t), v(X_t) \rangle + \sqrt{2\epsilon } \langle \nabla \phi(X_t), Z \rangle \\ &\quad + \frac{1}{2} 2\epsilon \langle Z, \nabla^2 \phi(X_t) Z \rangle + O(\epsilon ^{\frac{3}{2}}). \end{aligned} $$

Now we take expectation on both sides. Since $Z \sim \mathcal {N}(0,I)$ is independent of $X_t$,

$$\displaystyle \begin{aligned} A(t+\epsilon ) &= \mathbb{E}[\phi(X_{t+\epsilon })]\\ &= \mathbb{E}\Big[\phi(X_t) + \epsilon \langle \nabla \phi(X_t), v(X_t) \rangle + \sqrt{2\epsilon } \langle \nabla \phi(X_t), Z \rangle \\ &\qquad + \epsilon \langle Z, \nabla^2 \phi(X_t) Z \rangle \Big] + O(\epsilon ^{\frac{3}{2}}) \\ &= A(t)+ \epsilon \left(\mathbb{E}[\langle \nabla \phi(X_t), v(X_t) \rangle] + \mathbb{E}[\Delta \phi(X_t)]\right) + O(\epsilon ^{\frac{3}{2}}). \end{aligned} $$

Therefore, by integration by parts, this second approach gives

$$\displaystyle \begin{aligned} \dot A(t) &= \lim_{ \epsilon \to 0} \frac{A(t+\epsilon )-A(t)}{\epsilon } \notag \\ &= \mathbb{E}[\langle \nabla \phi(X_t), v(X_t) \rangle] + \mathbb{E}[\Delta \phi(X_t)] \notag \\ &= \int_{\mathbb{R}^n} \langle \nabla \phi(x), \rho_t(x) v(x) \rangle dx + \int_{\mathbb{R}^n} \rho_t(x) \Delta \phi(x) \, dx \notag \\ &= -\int_{\mathbb{R}^n} \phi(x) \nabla \cdot (\rho_t v)(x) \, dx + \int_{\mathbb{R}^n} \phi(x) \Delta \rho_t(x) \, dx \notag \\ &= \int_{\mathbb{R}^n} \phi(x) \left(-\nabla \cdot (\rho_t v)(x) + \Delta \rho_t(x)\right) \, dx. {} \end{aligned} $$

(B.5)

Comparing (B.4) and (B.5), and since $\phi $ is arbitrary, we conclude that

$$\displaystyle \begin{aligned} \frac{\partial \rho_t(x)}{\partial t} = -\nabla \cdot (\rho_t v)(x) + \Delta \rho_t(x) \end{aligned} $$

as claimed in (B.3).

When $v = -\nabla f$, the stochastic differential equation (B.2) becomes the Langevin dynamics (7) from Sect. 2.3, and the Fokker-Planck equation (B.3) becomes (8).

In the proof of Lemma 3, we also apply the Fokker-Planck equation (B.3) when $v = -\nabla f(x_0)$ is a constant vector field to derive the evolution equation (30) for one step of ULA.

1.2 Remaining Proofs

1.2.1 Proof of Lemma 16

Proof of Lemma 16

Let $g \colon \mathbb {R}^n \to \mathbb {R}$ be a smooth function, and let $\tilde g \colon \mathbb {R}^n \to \mathbb {R}$ be the function $\tilde g(x) = g(T(x))$. Let $X \sim \nu $, so $T(X) \sim \tilde \nu $. Note that

$$\displaystyle \begin{aligned} \mathbb{E}_{\tilde \nu}[g^2] &= \mathbb{E}_{X \sim \nu}[g(T(X))^2] = \mathbb{E}_{\nu}[\tilde g^2], \\ \mathbb{E}_{\tilde \nu}[g^2 \log g^2] &= \mathbb{E}_{X \sim \nu}[g(T(X))^2 \log g(T(X))^2] = \mathbb{E}_{\nu}[\tilde g^2 \log \tilde g^2]. \end{aligned} $$

Furthermore, we have $\nabla \tilde g(x) = \nabla T(x) \, \nabla g(T(x))$. Since T is L-Lipschitz, $\|\nabla T(x)\| \le L$. Then

$$\displaystyle \begin{aligned} \|\nabla \tilde g(x)\| \le \|\nabla T(x)\| \, \|\nabla g(T(x))\| \le L \|\nabla g(T(x))\|. \end{aligned} $$

This implies

$$\displaystyle \begin{aligned} \mathbb{E}_{\tilde \nu}[\|\nabla g\|{}^2] = \mathbb{E}_{X \sim \nu}[\|\nabla g(T(X))\|{}^2] \ge \frac{\mathbb{E}_{\nu}[\|\nabla \tilde g\|{}^2]}{L^2}. \end{aligned} $$

Therefore,

$$\displaystyle \begin{aligned} \frac{\mathbb{E}_{\tilde \nu}[\|\nabla g\|{}^2]}{\mathbb{E}_{\tilde \nu}[g^2 \log g^2] \,{-}\, \mathbb{E}_{\tilde \nu}[g^2] \log \mathbb{E}_{\tilde \nu}[g^2]} &\,{\ge}\, \frac{1}{L^2} \, \frac{\mathbb{E}_{\nu}[\|\nabla \tilde g\|{}^2]}{\big(\mathbb{E}_{\nu}[\tilde g^2 \log \tilde g^2] \,{-}\, \mathbb{E}_{\nu}[\tilde g^2] \log \mathbb{E}_{\nu}[\tilde g^2]\big)}\\ &\,{\ge}\, \frac{\alpha}{2L^2} \end{aligned} $$

where the last inequality follows from the assumption that $\nu $ satisfies LSI with constant $\alpha $. This shows that $\tilde \nu $ satisfies LSI with constant $\alpha /L^2$, as desired. □

Proof of Lemma 17

Proof of Lemma 17

We recall the following convolution property of LSI [15]: If $\nu , \tilde \nu $ satisfy LSI with constants $\alpha , \tilde \alpha > 0$, respectively, then $\nu \ast \tilde \nu $ satisfies LSI with constant $\left (\frac {1}{\alpha }+\frac {1}{\tilde \alpha }\right )^{-1}$. Since $\mathcal {N}(0,2tI)$ satisfies LSI with constant $\frac {1}{2t}$, the claim follows. □

Proof of Lemma 19

Proof of Lemma 19

Let $g \colon \mathbb {R}^n \to \mathbb {R}$ be a smooth function, and let $\tilde g \colon \mathbb {R}^n \to \mathbb {R}$ be the function $\tilde g(x) = g(T(x))$. Let $X \sim \nu $, so $T(X) \sim \tilde \nu $. Note that

$$\displaystyle \begin{aligned} \mathrm{Var}_{\tilde \nu}(g) &= \mathrm{Var}_{X \sim \nu}(g(T(X))) = \mathrm{Var}_{\nu}(\tilde g). \end{aligned} $$

Furthermore, we have $\nabla \tilde g(x) = \nabla T(x) \, \nabla g(T(x))$. Since T is L-Lipschitz, $\|\nabla T(x)\| \le L$. Then

$$\displaystyle \begin{aligned} \|\nabla \tilde g(x)\| \le \|\nabla T(x)\| \, \|\nabla g(T(x))\| \le L \|\nabla g(T(x))\|. \end{aligned} $$

This implies

$$\displaystyle \begin{aligned} \mathbb{E}_{\tilde \nu}[\|\nabla g\|{}^2] = \mathbb{E}_{X \sim \nu}[\|\nabla g(T(X))\|{}^2] \ge \frac{\mathbb{E}_{\nu}[\|\nabla \tilde g\|{}^2]}{L^2}. \end{aligned} $$

Therefore,

$$\displaystyle \begin{aligned} \frac{\mathbb{E}_{\tilde \nu}[\|\nabla g\|{}^2]}{\mathrm{Var}_{\tilde \nu}(g)} &\ge \frac{1}{L^2} \, \frac{\mathbb{E}_{\nu}[\|\nabla \tilde g\|{}^2]}{\mathrm{Var}_{\nu}(\tilde g)} \ge \frac{\alpha}{L^2} \end{aligned} $$

where the last inequality follows from the assumption that $\nu $ satisfies Poincaré inequality with constant $\alpha $. This shows that $\tilde \nu $ satisfies Poincaré inequality with constant $\alpha /L^2$, as desired. □

Proof of Lemma 20

Proof of Lemma 20

We recall the following convolution property of Poincaré inequality [23]: If $\nu , \tilde \nu $ satisfy Poincaré inequality with constants $\alpha , \tilde \alpha > 0$, respectively, then $\nu \ast \tilde \nu $ satisfies Poincaré inequality with constant $\left (\frac {1}{\alpha }+\frac {1}{\tilde \alpha }\right )^{-1}$. Since $\mathcal {N}(0,2tI)$ satisfies Poincaré inequality with constant $\frac {1}{2t}$, the claim follows. □

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Vempala, S.S., Wibisono, A. (2023). Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices. In: Eldan, R., Klartag, B., Litvak, A., Milman, E. (eds) Geometric Aspects of Functional Analysis. Lecture Notes in Mathematics, vol 2327. Springer, Cham. https://doi.org/10.1007/978-3-031-26300-2_15

Download citation

DOI: https://doi.org/10.1007/978-3-031-26300-2_15
Published: 30 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26299-9
Online ISBN: 978-3-031-26300-2
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Functional Central Limit Theorem and Strong Law of Large Numbers for Stochastic Gradient Langevin Dynamics

Non-Asymptotic Guarantees for Sampling by Stochastic Gradient Descent

Nonasymptotic Estimates for Stochastic Gradient Langevin Dynamics Under Local Conditions in Nonconvex Optimization

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

1.1 Review on Notation and Basic Properties

Derivation of the Fokker-Planck Equation

1.2 Remaining Proofs

1.2.1 Proof of Lemma 16

Proof of Lemma 16

Proof of Lemma 17

Proof of Lemma 19

Proof of Lemma 20

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Functional Central Limit Theorem and Strong Law of Large Numbers for Stochastic Gradient Langevin Dynamics

Non-Asymptotic Guarantees for Sampling by Stochastic Gradient Descent

Nonasymptotic Estimates for Stochastic Gradient Langevin Dynamics Under Local Conditions in Nonconvex Optimization

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 Review on Notation and Basic Properties

Derivation of the Fokker-Planck Equation

1.2 Remaining Proofs

1.2.1 Proof of Lemma 16

Proof of Lemma 16

Proof of Lemma 17

Proof of Lemma 19

Proof of Lemma 20

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation