Smooth momentum: improving lipschitzness in gradient descent

Kim, Bum Jun; Choi, Hyeyeon; Jang, Hyeonah; Kim, Sang Woo

doi:10.1007/s10489-022-04207-7

Smooth momentum: improving lipschitzness in gradient descent

Published: 22 October 2022

Volume 53, pages 14233–14248, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Bum Jun Kim¹,
Hyeyeon Choi¹,
Hyeonah Jang¹ &
…
Sang Woo Kim ORCID: orcid.org/0000-0001-6023-1837¹

390 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Deep neural network optimization is challenging. Large gradients in their chaotic loss landscape lead to unstable behavior during gradient descent. In this paper, we investigate a stable gradient descent algorithm. We revisit the mathematical derivations of the Momentum optimizer and discuss the potential problem for steep walls. Inspired by the physical motion of the mass, we propose Smooth Momentum, a new optimizer that improves the behavior on steep walls. We mathematically analyze the characteristics of the proposed optimizer and prove that Smooth Momentum exhibits improved Lipschitz properties and convergence, which allows stable and faster convergence in gradient descent. We also demonstrate how Smooth Gradient, a component of the proposed optimizer, can be plugged into other optimizers, like Adam. The proposed method offers a regularization effect comparable to batch normalization or weight decay. Experiments demonstrate that our proposed optimizer significantly improves the optimization of transformers, convolutional neural networks, and non-convex functions for various tasks and datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

AdaSmooth: An Adaptive Learning Rate Method Based on Effective Ratio

Speeding up the Oscillation-Free Modified Heavy Ball Algorithm

Adaptive Momentum Coefficient for Neural Network Optimization

Data Availability

The datasets analyzed during the current study are available in the following repository:

∙ CIFAR-10: http://www.cs.toronto.edu/kriz/cifar.html

∙ Oxford-IIIT PET: https://www.robots.ox.ac.uk/vgg/data/pets/

∙ IWSLT14: https://workshop2014.iwslt.org/

Notes

Strictly speaking, weight x is a vector, not a scalar. In this paper, however, for a clear explanation of our motivation, we have used terms such as 1D motion by describing x as a one-dimensional variable in situations where it does not matter.
A function f is said to be Lipschitz continuous if $\left \| f(x_{1}) - f(x_{2}) \right \| \leq L \left \| x_{1} - x_{2} \right \|$. Here L is called the Lipschitz constant.

References

Pal SK, Pramanik A, Maiti J, Mitra P (2021) Deep learning in multi-object detection and tracking: state of the art. Appl Intell 51(9):6400–6429
Article Google Scholar
Mao Q, Sun H, Zuo L, Jia R (2020) Finding every car: a traffic surveillance multi-scale vehicle object detection method. Appl Intell 50(10):3125–3136
Article Google Scholar
Lu L, Wu D, Wu T, Huang F, Yi Y (2020) Anchor-free multi-orientation text detection in natural scene images. Appl Intell 50(11):3623–3637
Article Google Scholar
Gouk H, Frank E, Pfahringer B, Cree MJ (2021) Regularisation of neural networks by enforcing Lipschitz continuity. Mach Learn 110(2):393–416
Article MathSciNet MATH Google Scholar
Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
MathSciNet MATH Google Scholar
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift, vol 37 of JMLR workshop and conference proceedings pp 448–456 (JMLR.org)
Santurkar S, Tsipras D, Ilyas A, Madry A (2018) How does batch normalization help optimization?, pp 2488–2498
Qiao S, Wang H, Liu C, Shen W, Yuille AL (2019) Weight standardization. CoRR ar**v:1903.10520
Nesterov YE (2004) Introductory lectures on convex optimization - a basic course vol 87 of applied optimization (springer)
Li H, Xu Z, Taylor G, Studer C, Goldstein T (2018) Visualizing the loss landscape of neural nets, pp 6391–6401
Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks, vol 28 of JMLR workshop and conference proceedings, pp 1310–1318 (JMLR.org)
Tieleman T, Hinton G (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw Mach Learn 4(2):26–31
Google Scholar
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: ICLR
Wilson AC, Roelofs R, Stern M, Srebro N, Recht B (2017) The marginal value of adaptive gradient methods in machine learning, pp 4148–4158
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition, pp 770–778 (IEEE computer society)
Saon G et al (2017) English conversational telephone speech recognition by humans and machines, pp 132–136
Qian N (1999) On the momentum term in gradient descent learning algorithms. Neural Netw 12 (1):145–151
Article Google Scholar
Yuan W, Hu F, Lu L (2022) A new non-adaptive optimization method: stochastic gradient descent with momentum and difference. Appl Intell 52(4):3939–3953
Article Google Scholar
Tao W, Pan Z, Wu G, Tao Q (2020) The strength of nesterov’s extrapolation in the individual convergence of nonsmooth optimization. IEEE Trans Neural Netw Learn Syst 31(7):2557–2568
MathSciNet Google Scholar
Gui Y, Li D, Fang R (2022) A fast adaptive algorithm for training deep neural networks. Appl Intell
Paszke A et al (2019) PyTorch: an imperative style, high-performance deep learning library (eds Wallach, H. et al) advances in neural information processing systems vol 32, pp 8024–8035 (curran associates, Inc.). http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
Abadi M et al (2015) TensorFlow: large-scale machine learning on heterogeneous systems. https://www.tensorflow.org/. Software available from tensorflow.org
Yedida R, Saha S, Prashanth T (2021) LipschitzLR: using theoretically computed adaptive learning rates for fast convergence. Appl Intell 51(3):1460–1478
Article Google Scholar
Zhang J, He T, Sra S, Jadbabaie A (2020) Why gradient clip** accelerates training: a theoretical justification for adaptivity. (OpenReview.net)
Curtis FE, Scheinberg K, Shi R (2019) A stochastic trust region algorithm based on careful step normalization. Informs J Optimization 1(3):200–220
Article MathSciNet Google Scholar
Bello I, Zoph B, Vasudevan V, Le QV (2017) Neural optimizer search with reinforcement learning. Vol 70 of proceedings of machine learning research, pp 459–468 (PMLR)
Jiaocheng M, **an S, **n Z, Yuan Peng Z (2022) Bayes-dcgru with bayesian optimization for rolling bearing fault diagnosis. Appl Intell
Ackley D (2012) A connectionist machine for genetic hillclimbing. Vol 28 (Springer science & business media
Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images. Technical report
Zagoruyko S, Komodakis N (2017) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. (OpenReview.net
Parkhi OM, Vedaldi A, Zisserman A, Jawahar CV (2012) Cats and dogs. pp 3498–3505 (IEEE computer society
Polyak BT, Juditsky AB (1992) Acceleration of stochastic approximation by averaging. SIAM J Contr Optimization 30(4):838–855
Article MathSciNet MATH Google Scholar
Liu L et al (2020) On the variance of the adaptive learning rate and beyond. (OpenReview.net)
Riedmiller MA, Braun H (1993) A direct adaptive method for faster backpropagation learning: the RPROP algorithm. pp 586–591 (IEEE)
Wang Y, Li K, Lei Y (2022) A general multi-scale image classification based on shared conversion matrix routing. Appl Intell 52(3):3249–3265
Article Google Scholar
Peters ME et al (2018) Deep contextualized word representations. pp 2227–2237 (association for computational linguistics
Vaswani A et al (2017) Attention is all you need. pp 5998–6008
Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. pp 4171–4186 (association for computational linguistics)
Dosovitskiy A et al (2021) An image is worth 16x16 words: transformers for image recognition at scale. (OpenReview.net)
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. pp 2818–2826 (IEEE computer society)
Papineni K, Roukos S, Ward T, Zhu W (2002) Bleu: a method for automatic evaluation of machine translation. pp 311–318 (ACL)

Download references

Acknowledgements

This work was supported by Samsung Electronics Co., Ltd (IO201210-08019-01).

Author information

Authors and Affiliations

Department of Electrical Engineering, Pohang University of Science and Technology, Pohang, 37673, South Korea
Bum Jun Kim, Hyeyeon Choi, Hyeonah Jang & Sang Woo Kim

Authors

Bum Jun Kim
View author publications
You can also search for this author in PubMed Google Scholar
Hyeyeon Choi
View author publications
You can also search for this author in PubMed Google Scholar
Hyeonah Jang
View author publications
You can also search for this author in PubMed Google Scholar
Sang Woo Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sang Woo Kim.

Ethics declarations

Conflict of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Proof of Theorem 2

We prove Theorem 2 as follows.

Proof

First, we investigate the Hessian matrix:

$$ \begin{array}{@{}rcl@{}} && (\mathbf{H}_{s})_{ij} \end{array} $$

(A1)

$$ \begin{array}{@{}rcl@{}} &=& \frac{\partial^{2} s(\mathbf{x})}{\partial x_{i} \partial x_{j}} \end{array} $$

(A2)

$$ \begin{array}{@{}rcl@{}} &=& \frac{\partial}{\partial x_{i}} \left( \frac{\partial s(\mathbf{x})}{\partial x_{j}} \right) \end{array} $$

(A3)

$$ \begin{array}{@{}rcl@{}} &=& \frac{\partial}{\partial x_{i}} \left( \frac{\delta}{\delta + \left\| \nabla f(\mathbf{x}) \right\|^{2}} \frac{\partial f(\mathbf{x})}{\partial x_{j}} \right) \end{array} $$

(A4)

$$ \begin{array}{@{}rcl@{}} &=& \frac{\delta \frac{\partial^{2} f(\mathbf{x})}{\partial x_{i} \partial x_{j}} (\delta + \left\| \nabla f(\mathbf{x}) \right\|^{2}) - \delta \frac{\partial f(\mathbf{x})}{\partial x_{j}} (2{\sum}_{k} \frac{\partial f(\mathbf{x})}{\partial x_{k}} \frac{\partial^{2} f(\mathbf{x})}{\partial x_{i} \partial x_{k}} ) }{(\delta + \left\| \nabla f(\mathbf{x}) \right\|^{2})^{2}} \end{array} $$

(A5)

$$ \begin{array}{@{}rcl@{}} &=& A(\mathbf{x}) + B(\mathbf{x}), \end{array} $$

(A6)

where

$$ \begin{array}{@{}rcl@{}} A(\mathbf{x}) &=& D \frac{\partial^{2} f(\mathbf{x})}{\partial x_{i} \partial x_{j}}, \end{array} $$

(A7)

$$ \begin{array}{@{}rcl@{}} B(\mathbf{x}) & =& -2 \delta^{-1} D^{2} \frac{\partial f(\mathbf{x})}{\partial x_{j}} {\sum}_{k} \frac{\partial f(\mathbf{x})}{\partial x_{k}} \frac{\partial^{2} f(\mathbf{x})}{\partial x_{i} \partial x_{k}}, \end{array} $$

(A8)

$$ \begin{array}{@{}rcl@{}} D &=& \frac{\delta}{\delta + \left\| \nabla f(\mathbf{x})\right\|^{2}}. \end{array} $$

(A9)

Now, we investigate (LHS) of (14).

$$ \begin{array}{@{}rcl@{}} && (\nabla s(\mathbf{x}))^{\top} \mathbf{H}_{s} (\nabla s(\mathbf{x})) \end{array} $$

(A10)

$$ \begin{array}{@{}rcl@{}} &=& \sum\limits_{i} \sum\limits_{j} \frac{\partial s(\mathbf{x})}{\partial x_{i}} (\mathbf{H}_{s})_{ij} \frac{\partial s(\mathbf{x})}{\partial x_{j}} \end{array} $$

(A11)

$$ \begin{array}{@{}rcl@{}} &=& D^{2} \sum\limits_{i} \sum\limits_{j} \frac{\partial f(\mathbf{x})}{\partial x_{i}} (A(\mathbf{x})+B(\mathbf{x})) \frac{\partial f(\mathbf{x})}{\partial x_{j}} \end{array} $$

(A12)

$$ \begin{array}{@{}rcl@{}} &=& D^{2} \sum\limits_{i} \sum\limits_{j} \frac{\partial f(\mathbf{x})}{\partial x_{i}} A(\mathbf{x}) \frac{\partial f(\mathbf{x})}{\partial x_{j}} \\ && + D^{2} \sum\limits_{i} \sum\limits_{j} \frac{\partial f(\mathbf{x})}{\partial x_{i}} B(\mathbf{x}) \frac{\partial f(\mathbf{x})}{\partial x_{j}}. \end{array} $$

(A13)

The first term in (A13) is

$$ \begin{array}{@{}rcl@{}} && D^{2} \sum\limits_{i} \sum\limits_{j} \frac{\partial f(\mathbf{x})}{\partial x_{i}} A(\mathbf{x}) \frac{\partial f(\mathbf{x})}{\partial x_{j}} \end{array} $$

(A14)

$$ \begin{array}{@{}rcl@{}} &=& D^{2} \sum\limits_{i} \sum\limits_{j} \frac{\partial f(\mathbf{x})}{\partial x_{i}} \left\{D \frac{\partial^{2} f(\mathbf{x})}{\partial x_{i} \partial x_{j}} \right\} \frac{\partial f(\mathbf{x})}{\partial x_{j}} \end{array} $$

(A15)

$$ \begin{array}{@{}rcl@{}} &=& D^{3} \sum\limits_{i} \sum\limits_{j} \frac{\partial f(\mathbf{x})}{\partial x_{i}} \frac{\partial^{2} f(\mathbf{x})}{\partial x_{i} \partial x_{j}} \frac{\partial f(\mathbf{x})}{\partial x_{j}} \end{array} $$

(A16)

$$ \begin{array}{@{}rcl@{}} &=& D^{3}(\nabla f(\mathbf{x}))^{\top} \mathbf{H}_{f} (\nabla f(\mathbf{x})). \end{array} $$

(A17)

The second term in (A13) is

$$ \begin{array}{@{}rcl@{}} && D^{2} \sum\limits_{i} \sum\limits_{j} \frac{\partial f(\mathbf{x})}{\partial x_{i}} B(\mathbf{x}) \frac{\partial f(\mathbf{x})}{\partial x_{j}} \end{array} $$

(A18)

$$ \begin{array}{@{}rcl@{}} &=& -2 \delta^{-1} D^{4} \sum\limits_{j} \left( \frac{\partial f(\mathbf{x})}{\partial x_{j}} \right)^{2} \sum\limits_{i} \sum\limits_{k} \frac{\partial f(\mathbf{x})}{\partial x_{i}} \frac{\partial^{2} f(\mathbf{x})}{\partial x_{i} \partial x_{k}} \frac{\partial f(\mathbf{x})}{\partial x_{k}} \end{array} $$

(A19)

$$ \begin{array}{@{}rcl@{}} &=& -2 \delta^{-1} D^{4} \left\| \nabla f(\mathbf{x}) \right\|^{2} (\nabla f(\mathbf{x}))^{\top} \mathbf{H}_{f} (\nabla f(\mathbf{x})). \end{array} $$

(A20)

Thus,

$$ \begin{array}{@{}rcl@{}} && (\nabla s(\mathbf{x}))^{\top} \mathbf{H}_{s} (\nabla s(\mathbf{x})) \end{array} $$

(A21)

$$ \begin{array}{@{}rcl@{}} &=& D^{2} \sum\limits_{i} \sum\limits_{j} \frac{\partial f(\mathbf{x})}{\partial x_{i}} A(\mathbf{x}) \frac{\partial f(\mathbf{x})}{\partial x_{j}} \end{array} $$

(A22)

$$ \begin{array}{@{}rcl@{}} &&+ D^{2} \sum\limits_{i} \sum\limits_{j} \frac{\partial f(\mathbf{x})}{\partial x_{i}} B(\mathbf{x}) \frac{\partial f(\mathbf{x})}{\partial x_{j}} \\ &=& D^{3}(\nabla f(\mathbf{x}))^{\top} \mathbf{H}_{f} (\nabla f(\mathbf{x})) \end{array} $$

(A23)

$$ \begin{array}{@{}rcl@{}} &&-2 \delta^{-1} D^{4} \left\| \nabla f(\mathbf{x}) \right\|^{2} (\nabla f(\mathbf{x}))^{\top} \mathbf{H}_{f} (\nabla f(\mathbf{x})) \\ &=& D^{3} (1 -2 \delta^{-1} D \left\| \nabla f(\mathbf{x}) \right\|^{2} ) (\nabla f(\mathbf{x}))^{\top} \mathbf{H}_{f} (\nabla f(\mathbf{x})). \end{array} $$

(A24)

Because δ = m²g² ≥ 0 as mentioned in Section 2, $ \left \| \nabla f(\mathbf {x}) \right \|^{2} \geq 0$, and 0 ≤ D ≤ 1, we obtain

$$ D^{3} (1 -2 \delta^{-1} D \left\| \nabla f(\mathbf{x}) \right\|^{2} ) \leq D^{3} \leq 1. $$

(A25)

Therefore,

$$ \left| (\nabla s(\mathbf{x}))^{\top} \mathbf{H}_{s} (\nabla s(\mathbf{x})) \right| \leq \left| (\nabla f(\mathbf{x}))^{\top} \mathbf{H}_{f} (\nabla f(\mathbf{x})) \right|. $$

(A26)

□

Equation A24 also indicates that the larger the gradient, the smaller the property of the second-order terms when using the Smooth Gradient.

Appendix B: Proof of Theorem 3

In this section, we prove Theorem 3. The following proof is inspired by [24]. We used the same assumptions and settings for the optimization, except for the gradient configuration, h_k. Before starting the main proof, we first start with the following lemma:

Lemma 1

If $\delta \leq \frac {1-5 L_{0} \eta }{4 {L_{1}^{2}} \eta ^{2}}$, we have,

$$ \begin{array}{@{}rcl@{}} h_{k} = \eta \frac{\delta}{\delta + \left\| \nabla f(\mathbf{x}_{k}) \right\|^{2}} \leq \frac{1}{5 L_{0} + 4 L_{1} \left\| \nabla f(\mathbf{x}_{k}) \right\|}. \end{array} $$

(B27)

Proof

Rewriting (B27) yields,

$$ \begin{array}{@{}rcl@{}} \left\| \nabla f(\mathbf{x}_{k}) \right\|^{2} - 4 L_{1} \eta \delta \left\| \nabla f(\mathbf{x}_{k}) \right\| + \delta - 5 L_{0} \eta \delta \geq 0. \end{array} $$

(B28)

Note that (LHS) of (B28) is,

$$ \begin{array}{@{}rcl@{}} (LHS) &=& \left\| \nabla f(\mathbf{x}_{k}) \right\|^{2} - 4 L_{1} \eta \delta \left\| \nabla f(\mathbf{x}_{k}) \right\| \\ && + 4 {L_{1}^{2}} \eta^{2} \delta^{2} - 4 {L_{1}^{2}} \eta^{2} \delta^{2} + \delta - 5 L_{0} \eta \delta \end{array} $$

(B29)

$$ \begin{array}{@{}rcl@{}} &=& (\left\| \nabla f(\mathbf{x}_{k}) \right\| - 2 L_{1} \eta \delta)^{2} - 4 {L_{1}^{2}} \eta^{2} \delta^{2} \\ &&+ \delta - 5 L_{0} \eta \delta. \end{array} $$

(B30)

Thus our claim is, $- 4 {L_{1}^{2}} \eta ^{2} \delta ^{2} + \delta - 5 L_{0} \eta \delta \geq 0$. This holds if $\delta \leq \frac {1-5 L_{0} \eta }{4 {L_{1}^{2}} \eta ^{2}}$. □

Now, we start the proof of Theorem 3.

Proof

Our goal is to investigate the upper bound of the iteration complexity of the following gradient descent optimization with Smooth Gradient in a deterministic setting.

$$ \begin{array}{@{}rcl@{}} \mathbf{x}_{k+1} &=& \mathbf{x}_{k} - h_{k} \nabla f(\mathbf{x}_{k}), \end{array} $$

(B31)

$$ \begin{array}{@{}rcl@{}} h_{k} &=& \eta \frac{\delta}{\delta + \left\| \nabla f(\mathbf{x}_{k}) \right\|^{2}}. \end{array} $$

(B32)

First, for $t \in \left [0, 1\right ]$, we parameterize the path between x_k and x_k+ 1 as γ(t) = t(x_k+ 1 −x_k) + x_k. From (B31), using Taylor’s theorem, the triangle inequality, and the Cauchy-Schwarz inequality,

$$ \begin{array}{@{}rcl@{}} f(\mathbf{x}_{k+1}) &\leq& f(\mathbf{x}_{k}) - h_{k} \left\| \nabla f(\mathbf{x}_{k}) \right\|^{2} \\ && + \frac{\left\| \mathbf{x}_{k+1} - \mathbf{x}_{k} \right\|^{2}}{2} {{\int}_{0}^{1}} \left\| \nabla^{2} f(\boldsymbol{\gamma}(t)) \right\| dt. \end{array} $$

(B33)

Borrowing Lemma 9 from [24], we obtain,

$$ \left\| \nabla f(\boldsymbol{\gamma}(t)) \right\| \leq 4 \left( \frac{L_{0}}{L_{1}} + \left\| \nabla f(\mathbf{x}_{k}) \right\| \right). $$

(B34)

From the assumption on L₀ and L₁ and (B34), we have,

$$ \left\| \nabla^{2} f(\boldsymbol{\gamma}(t)) \right\| \leq 5 L_{0} + 4 L_{1} \left\| \nabla f(\mathbf{x}_{k}) \right\|. $$

(B35)

From (B33),

$$ \begin{array}{@{}rcl@{}} f(\mathbf{x}_{k+1}) &\leq& f(\mathbf{x}_{k}) - h_{k} \left\| \nabla f(\mathbf{x}_{k}) \right\|^{2} \\ && + \frac{ 5 L_{0} + 4 L_{1} \left\| \nabla f(\mathbf{x}_{k}) \right\| }{2} {h_{k}^{2}} \left\| \nabla f(\mathbf{x}_{k}) \right\|^{2} . \end{array} $$

(B36)

Using Lemma 1,

$$ f(\mathbf{x}_{k+1}) \leq f(\mathbf{x}_{k}) - \frac{h_{k} \left\| \nabla f(\mathbf{x}_{k}) \right\|^{2}}{2}. $$

(B37)

Here, we investigate the lower bound for the following function,

$$ \frac{h_{k} \left\| \nabla f(\mathbf{x}_{k}) \right\|^{2}}{2} = \frac{\eta \delta}{2} \cdot \frac{\left\| \nabla f(\mathbf{x}_{k}) \right\|^{2}}{\left\| \nabla f(\mathbf{x}_{k}) \right\|^{2} + \delta}. $$

(B38)

Note that for $\left \| \nabla f(\mathbf {x}_{k}) \right \| \geq \epsilon $, the above function has minimum at $\left \| \nabla f(\mathbf {x}_{k}) \right \| = \epsilon $. Thus,

$$ \frac{h_{k} \left\| \nabla f(\mathbf{x}_{k}) \right\|^{2}}{2} \geq \frac{\eta \delta \epsilon^{2}}{2 (\epsilon^{2} + \delta)}. $$

(B39)

If 𝜖 is sufficiently small, because 𝜖² + δ ≈ δ, (RHS) of (B39) can be approximated as an $\frac {\eta \epsilon ^{2}}{2}$. Thus,

$$ f(\mathbf{x}_{k+1}) \leq f(\mathbf{x}_{k}) - \frac{\eta \epsilon^{2}}{2}. $$

(B40)

From the telescopic sum from zero to the T − 1 iterations of (B40),

$$ f^{*} - f(\mathbf{x}_{0}) \leq -\frac{\eta \epsilon^{2} T}{2}. $$

(B41)

Here, we assume that $\eta = \frac {1}{p L_{0}}$. Because δ > 0, we need 1 − 5L₀η > 0, which requires p > 5. Now, we have an uppder bound of the iteration complexity of gradient descent with Smooth Gradient,

$$ T \leq \frac{2p L_{0} (f(\mathbf{x}_{0}) - f^{*})}{\epsilon^{2}}, $$

(B42)

which means that the gradient descent with a Smooth Gradient converges as $\mathcal {O} \left (L_{0} \frac {f(\mathbf {x}_{0})-f^{*}}{\epsilon ^{2}} \right )$. Note that Theorem 4 of [24] claims that under the same assumptions for the optimization, the gradient descent converges in ${\Omega } \left ((L_{1} M / log(M) + L_{0}) \frac {f(\mathbf {x}_{0})-f^{*}}{\epsilon ^{2}} \right )$ iterations, where

$$ M = \sup\{ \| \nabla f(\mathbf{x}) \| \mid \mathbf{x} \text{ such that } f(\mathbf{x}) \leq f(\mathbf{x}_{0})\}, $$

(B43)

for the initialization of x₀. Thus, when M is large, that is, when the problem has a poor initialization, our gradient descent with a Smooth Gradient can be arbitrarily faster than the vanilla gradient descent. □

Appendix C: More details on 𝜃

Here, we detail 𝜃, which was introduced in Section 2.2. We also prove the property that $\tan \theta = \left \| \nabla h(\mathbf {x}) \right \|$.

As discussed in Section 2.2, the mass m moves on a virtual surface (x, h(x)). If we consider x to be an n-dimensional vector, space (x, h(x)) forms the (n + 1) dimension. Because the neural network is non-linear, the loss function and virtual surface have non-linear properties. In other words, h(x) is a non-linear function of x. Here, we linearize h(x) as:

$$ h(\mathbf{x}) = \frac{\partial h}{\partial x_{1}}x_{1} + \frac{\partial h}{\partial x_{2}}x_{2} + ... + \frac{\partial h}{\partial x_{n}}x_{n} + C. $$

(C44)

This approximates the non-linear virtual surface h(x) as a linear surface near point x. Let this plane existing in the (n + 1) dimension be S₁. Let plane h(x) = 0 be S₂. Here, 𝜃 is defined as the acute angle between S₁ and S₂ (Fig. 2). Now, 𝜃 can be obtained by investigating the normal vectors of the two planes. First, (C44) indicates that S₁ has a normal vector $n_{1} = \left (\frac {\partial h}{\partial x_{1}}, \frac {\partial h}{\partial x_{2}}, ..., \frac {\partial h}{\partial x_{n}}, -1\right )$. S₂ has a normal vector n₂ = (0, 0,..., 0, 1). As the angle between the two normal vectors n₁ and n₂ is π − 𝜃,

$$ \begin{array}{@{}rcl@{}} \cos(\pi-\theta) &=& \frac{n_{1} \cdot n_{2}}{\left\| n_{1} \right\| \left\| n_{2} \right\|} \end{array} $$

(C45)

$$ \begin{array}{@{}rcl@{}} &=& \frac{-1}{\sqrt{\left\| \nabla h(\mathbf{x}) \right\|^{2} + 1}}. \end{array} $$

(C46)

Because $0<\theta <\frac {\pi }{2}$, we obtain,

$$ \tan\theta = \left\| \nabla h(\mathbf{x}) \right\|. $$

(C47)

Appendix D: Tips for tuning δ

From our experimental results, we provide the following three tips for tuning δ:

Essentially, δ should be treated as a hyperparameter. Empirical tuning using a validation set, such as the learning rate or threshold in gradient clip**, is required. We recommend tuning in {1, 10, 100, 1000, 10000} first. However, δ is easier to tune than the learning rate, and sensitive tuning is not required (Section 4.1).
For the same regularization but different architectures, because a large neural network has a large gradient norm, its δ should be set to a larger value. In Section 4.2.1, δ was identified as 30000 in ResNet-100 and 1000 in ResNet-16.
For the same architecture but different regularization: For a non-regularized model, we recommend setting a smaller δ value than a well-regularized model. This is because when δ is small, the regularization effect through the Smooth Gradient is greater (refeq:relation). Experimentally, a δ of 1000 was confirmed in well-regularized ResNet-16, whereas a δ of 1 was confirmed in non-regularized PlainNet-16 (Section 4.2). Therefore, it is necessary to consider the δ settings according to how regularizations are applied, even in the same architecture.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Kim, B.J., Choi, H., Jang, H. et al. Smooth momentum: improving lipschitzness in gradient descent. Appl Intell 53, 14233–14248 (2023). https://doi.org/10.1007/s10489-022-04207-7

Download citation

Accepted: 26 September 2022
Published: 22 October 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s10489-022-04207-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Smooth momentum: improving lipschitzness in gradient descent

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

AdaSmooth: An Adaptive Learning Rate Method Based on Effective Ratio

Speeding up the Oscillation-Free Modified Heavy Ball Algorithm

Adaptive Momentum Coefficient for Neural Network Optimization

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Appendices

Appendix A: Proof of Theorem 2

Proof

Appendix B: Proof of Theorem 3

Lemma 1

Proof

Proof

Appendix C: More details on 𝜃

Appendix D: Tips for tuning δ

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Smooth momentum: improving lipschitzness in gradient descent

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

AdaSmooth: An Adaptive Learning Rate Method Based on Effective Ratio

Speeding up the Oscillation-Free Modified Heavy Ball Algorithm

Adaptive Momentum Coefficient for Neural Network Optimization

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Appendices

Appendix A: Proof of Theorem 2

Proof

Appendix B: Proof of Theorem 3

Lemma 1

Proof

Proof

Appendix C: More details on 𝜃

Appendix D: Tips for tuning δ

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation