Log in

Smooth momentum: improving lipschitzness in gradient descent

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Deep neural network optimization is challenging. Large gradients in their chaotic loss landscape lead to unstable behavior during gradient descent. In this paper, we investigate a stable gradient descent algorithm. We revisit the mathematical derivations of the Momentum optimizer and discuss the potential problem for steep walls. Inspired by the physical motion of the mass, we propose Smooth Momentum, a new optimizer that improves the behavior on steep walls. We mathematically analyze the characteristics of the proposed optimizer and prove that Smooth Momentum exhibits improved Lipschitz properties and convergence, which allows stable and faster convergence in gradient descent. We also demonstrate how Smooth Gradient, a component of the proposed optimizer, can be plugged into other optimizers, like Adam. The proposed method offers a regularization effect comparable to batch normalization or weight decay. Experiments demonstrate that our proposed optimizer significantly improves the optimization of transformers, convolutional neural networks, and non-convex functions for various tasks and datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Algorithm 2
Listing 1
Listing 2

Similar content being viewed by others

Data Availability

The datasets analyzed during the current study are available in the following repository:

∙ CIFAR-10: http://www.cs.toronto.edu/kriz/cifar.html

∙ Oxford-IIIT PET: https://www.robots.ox.ac.uk/vgg/data/pets/

∙ IWSLT14: https://workshop2014.iwslt.org/

Notes

  1. Strictly speaking, weight x is a vector, not a scalar. In this paper, however, for a clear explanation of our motivation, we have used terms such as 1D motion by describing x as a one-dimensional variable in situations where it does not matter.

  2. A function f is said to be Lipschitz continuous if \(\left \| f(x_{1}) - f(x_{2}) \right \| \leq L \left \| x_{1} - x_{2} \right \|\). Here L is called the Lipschitz constant.

References

  1. Pal SK, Pramanik A, Maiti J, Mitra P (2021) Deep learning in multi-object detection and tracking: state of the art. Appl Intell 51(9):6400–6429

    Article  Google Scholar 

  2. Mao Q, Sun H, Zuo L, Jia R (2020) Finding every car: a traffic surveillance multi-scale vehicle object detection method. Appl Intell 50(10):3125–3136

    Article  Google Scholar 

  3. Lu L, Wu D, Wu T, Huang F, Yi Y (2020) Anchor-free multi-orientation text detection in natural scene images. Appl Intell 50(11):3623–3637

    Article  Google Scholar 

  4. Gouk H, Frank E, Pfahringer B, Cree MJ (2021) Regularisation of neural networks by enforcing Lipschitz continuity. Mach Learn 110(2):393–416

    Article  MathSciNet  MATH  Google Scholar 

  5. Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958

    MathSciNet  MATH  Google Scholar 

  6. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift, vol 37 of JMLR workshop and conference proceedings pp 448–456 (JMLR.org)

  7. Santurkar S, Tsipras D, Ilyas A, Madry A (2018) How does batch normalization help optimization?, pp 2488–2498

  8. Qiao S, Wang H, Liu C, Shen W, Yuille AL (2019) Weight standardization. CoRR ar**v:1903.10520

  9. Nesterov YE (2004) Introductory lectures on convex optimization - a basic course vol 87 of applied optimization (springer)

  10. Li H, Xu Z, Taylor G, Studer C, Goldstein T (2018) Visualizing the loss landscape of neural nets, pp 6391–6401

  11. Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks, vol 28 of JMLR workshop and conference proceedings, pp 1310–1318 (JMLR.org)

  12. Tieleman T, Hinton G (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw Mach Learn 4(2):26–31

    Google Scholar 

  13. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: ICLR

  14. Wilson AC, Roelofs R, Stern M, Srebro N, Recht B (2017) The marginal value of adaptive gradient methods in machine learning, pp 4148–4158

  15. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition, pp 770–778 (IEEE computer society)

  16. Saon G et al (2017) English conversational telephone speech recognition by humans and machines, pp 132–136

  17. Qian N (1999) On the momentum term in gradient descent learning algorithms. Neural Netw 12 (1):145–151

    Article  Google Scholar 

  18. Yuan W, Hu F, Lu L (2022) A new non-adaptive optimization method: stochastic gradient descent with momentum and difference. Appl Intell 52(4):3939–3953

    Article  Google Scholar 

  19. Tao W, Pan Z, Wu G, Tao Q (2020) The strength of nesterov’s extrapolation in the individual convergence of nonsmooth optimization. IEEE Trans Neural Netw Learn Syst 31(7):2557–2568

    MathSciNet  Google Scholar 

  20. Gui Y, Li D, Fang R (2022) A fast adaptive algorithm for training deep neural networks. Appl Intell

  21. Paszke A et al (2019) PyTorch: an imperative style, high-performance deep learning library (eds Wallach, H. et al) advances in neural information processing systems vol 32, pp 8024–8035 (curran associates, Inc.). http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf

  22. Abadi M et al (2015) TensorFlow: large-scale machine learning on heterogeneous systems. https://www.tensorflow.org/. Software available from tensorflow.org

  23. Yedida R, Saha S, Prashanth T (2021) LipschitzLR: using theoretically computed adaptive learning rates for fast convergence. Appl Intell 51(3):1460–1478

    Article  Google Scholar 

  24. Zhang J, He T, Sra S, Jadbabaie A (2020) Why gradient clip** accelerates training: a theoretical justification for adaptivity. (OpenReview.net)

  25. Curtis FE, Scheinberg K, Shi R (2019) A stochastic trust region algorithm based on careful step normalization. Informs J Optimization 1(3):200–220

    Article  MathSciNet  Google Scholar 

  26. Bello I, Zoph B, Vasudevan V, Le QV (2017) Neural optimizer search with reinforcement learning. Vol 70 of proceedings of machine learning research, pp 459–468 (PMLR)

  27. Jiaocheng M, **an S, **n Z, Yuan Peng Z (2022) Bayes-dcgru with bayesian optimization for rolling bearing fault diagnosis. Appl Intell

  28. Ackley D (2012) A connectionist machine for genetic hillclimbing. Vol 28 (Springer science & business media

  29. Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images. Technical report

  30. Zagoruyko S, Komodakis N (2017) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. (OpenReview.net

  31. Parkhi OM, Vedaldi A, Zisserman A, Jawahar CV (2012) Cats and dogs. pp 3498–3505 (IEEE computer society

  32. Polyak BT, Juditsky AB (1992) Acceleration of stochastic approximation by averaging. SIAM J Contr Optimization 30(4):838–855

    Article  MathSciNet  MATH  Google Scholar 

  33. Liu L et al (2020) On the variance of the adaptive learning rate and beyond. (OpenReview.net)

  34. Riedmiller MA, Braun H (1993) A direct adaptive method for faster backpropagation learning: the RPROP algorithm. pp 586–591 (IEEE)

  35. Wang Y, Li K, Lei Y (2022) A general multi-scale image classification based on shared conversion matrix routing. Appl Intell 52(3):3249–3265

    Article  Google Scholar 

  36. Peters ME et al (2018) Deep contextualized word representations. pp 2227–2237 (association for computational linguistics

  37. Vaswani A et al (2017) Attention is all you need. pp 5998–6008

  38. Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. pp 4171–4186 (association for computational linguistics)

  39. Dosovitskiy A et al (2021) An image is worth 16x16 words: transformers for image recognition at scale. (OpenReview.net)

  40. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. pp 2818–2826 (IEEE computer society)

  41. Papineni K, Roukos S, Ward T, Zhu W (2002) Bleu: a method for automatic evaluation of machine translation. pp 311–318 (ACL)

Download references

Acknowledgements

This work was supported by Samsung Electronics Co., Ltd (IO201210-08019-01).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sang Woo Kim.

Ethics declarations

Conflict of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Proof of Theorem 2

We prove Theorem 2 as follows.

Proof

First, we investigate the Hessian matrix:

$$ \begin{array}{@{}rcl@{}} && (\mathbf{H}_{s})_{ij} \end{array} $$
(A1)
$$ \begin{array}{@{}rcl@{}} &=& \frac{\partial^{2} s(\mathbf{x})}{\partial x_{i} \partial x_{j}} \end{array} $$
(A2)
$$ \begin{array}{@{}rcl@{}} &=& \frac{\partial}{\partial x_{i}} \left( \frac{\partial s(\mathbf{x})}{\partial x_{j}} \right) \end{array} $$
(A3)
$$ \begin{array}{@{}rcl@{}} &=& \frac{\partial}{\partial x_{i}} \left( \frac{\delta}{\delta + \left\| \nabla f(\mathbf{x}) \right\|^{2}} \frac{\partial f(\mathbf{x})}{\partial x_{j}} \right) \end{array} $$
(A4)
$$ \begin{array}{@{}rcl@{}} &=& \frac{\delta \frac{\partial^{2} f(\mathbf{x})}{\partial x_{i} \partial x_{j}} (\delta + \left\| \nabla f(\mathbf{x}) \right\|^{2}) - \delta \frac{\partial f(\mathbf{x})}{\partial x_{j}} (2{\sum}_{k} \frac{\partial f(\mathbf{x})}{\partial x_{k}} \frac{\partial^{2} f(\mathbf{x})}{\partial x_{i} \partial x_{k}} ) }{(\delta + \left\| \nabla f(\mathbf{x}) \right\|^{2})^{2}} \end{array} $$
(A5)
$$ \begin{array}{@{}rcl@{}} &=& A(\mathbf{x}) + B(\mathbf{x}), \end{array} $$
(A6)

where

$$ \begin{array}{@{}rcl@{}} A(\mathbf{x}) &=& D \frac{\partial^{2} f(\mathbf{x})}{\partial x_{i} \partial x_{j}}, \end{array} $$
(A7)
$$ \begin{array}{@{}rcl@{}} B(\mathbf{x}) & =& -2 \delta^{-1} D^{2} \frac{\partial f(\mathbf{x})}{\partial x_{j}} {\sum}_{k} \frac{\partial f(\mathbf{x})}{\partial x_{k}} \frac{\partial^{2} f(\mathbf{x})}{\partial x_{i} \partial x_{k}}, \end{array} $$
(A8)
$$ \begin{array}{@{}rcl@{}} D &=& \frac{\delta}{\delta + \left\| \nabla f(\mathbf{x})\right\|^{2}}. \end{array} $$
(A9)

Now, we investigate (LHS) of (14).

$$ \begin{array}{@{}rcl@{}} && (\nabla s(\mathbf{x}))^{\top} \mathbf{H}_{s} (\nabla s(\mathbf{x})) \end{array} $$
(A10)
$$ \begin{array}{@{}rcl@{}} &=& \sum\limits_{i} \sum\limits_{j} \frac{\partial s(\mathbf{x})}{\partial x_{i}} (\mathbf{H}_{s})_{ij} \frac{\partial s(\mathbf{x})}{\partial x_{j}} \end{array} $$
(A11)
$$ \begin{array}{@{}rcl@{}} &=& D^{2} \sum\limits_{i} \sum\limits_{j} \frac{\partial f(\mathbf{x})}{\partial x_{i}} (A(\mathbf{x})+B(\mathbf{x})) \frac{\partial f(\mathbf{x})}{\partial x_{j}} \end{array} $$
(A12)
$$ \begin{array}{@{}rcl@{}} &=& D^{2} \sum\limits_{i} \sum\limits_{j} \frac{\partial f(\mathbf{x})}{\partial x_{i}} A(\mathbf{x}) \frac{\partial f(\mathbf{x})}{\partial x_{j}} \\ && + D^{2} \sum\limits_{i} \sum\limits_{j} \frac{\partial f(\mathbf{x})}{\partial x_{i}} B(\mathbf{x}) \frac{\partial f(\mathbf{x})}{\partial x_{j}}. \end{array} $$
(A13)

The first term in (A13) is

$$ \begin{array}{@{}rcl@{}} && D^{2} \sum\limits_{i} \sum\limits_{j} \frac{\partial f(\mathbf{x})}{\partial x_{i}} A(\mathbf{x}) \frac{\partial f(\mathbf{x})}{\partial x_{j}} \end{array} $$
(A14)
$$ \begin{array}{@{}rcl@{}} &=& D^{2} \sum\limits_{i} \sum\limits_{j} \frac{\partial f(\mathbf{x})}{\partial x_{i}} \left\{D \frac{\partial^{2} f(\mathbf{x})}{\partial x_{i} \partial x_{j}} \right\} \frac{\partial f(\mathbf{x})}{\partial x_{j}} \end{array} $$
(A15)
$$ \begin{array}{@{}rcl@{}} &=& D^{3} \sum\limits_{i} \sum\limits_{j} \frac{\partial f(\mathbf{x})}{\partial x_{i}} \frac{\partial^{2} f(\mathbf{x})}{\partial x_{i} \partial x_{j}} \frac{\partial f(\mathbf{x})}{\partial x_{j}} \end{array} $$
(A16)
$$ \begin{array}{@{}rcl@{}} &=& D^{3}(\nabla f(\mathbf{x}))^{\top} \mathbf{H}_{f} (\nabla f(\mathbf{x})). \end{array} $$
(A17)

The second term in (A13) is

$$ \begin{array}{@{}rcl@{}} && D^{2} \sum\limits_{i} \sum\limits_{j} \frac{\partial f(\mathbf{x})}{\partial x_{i}} B(\mathbf{x}) \frac{\partial f(\mathbf{x})}{\partial x_{j}} \end{array} $$
(A18)
$$ \begin{array}{@{}rcl@{}} &=& -2 \delta^{-1} D^{4} \sum\limits_{j} \left( \frac{\partial f(\mathbf{x})}{\partial x_{j}} \right)^{2} \sum\limits_{i} \sum\limits_{k} \frac{\partial f(\mathbf{x})}{\partial x_{i}} \frac{\partial^{2} f(\mathbf{x})}{\partial x_{i} \partial x_{k}} \frac{\partial f(\mathbf{x})}{\partial x_{k}} \end{array} $$
(A19)
$$ \begin{array}{@{}rcl@{}} &=& -2 \delta^{-1} D^{4} \left\| \nabla f(\mathbf{x}) \right\|^{2} (\nabla f(\mathbf{x}))^{\top} \mathbf{H}_{f} (\nabla f(\mathbf{x})). \end{array} $$
(A20)

Thus,

$$ \begin{array}{@{}rcl@{}} && (\nabla s(\mathbf{x}))^{\top} \mathbf{H}_{s} (\nabla s(\mathbf{x})) \end{array} $$
(A21)
$$ \begin{array}{@{}rcl@{}} &=& D^{2} \sum\limits_{i} \sum\limits_{j} \frac{\partial f(\mathbf{x})}{\partial x_{i}} A(\mathbf{x}) \frac{\partial f(\mathbf{x})}{\partial x_{j}} \end{array} $$
(A22)
$$ \begin{array}{@{}rcl@{}} &&+ D^{2} \sum\limits_{i} \sum\limits_{j} \frac{\partial f(\mathbf{x})}{\partial x_{i}} B(\mathbf{x}) \frac{\partial f(\mathbf{x})}{\partial x_{j}} \\ &=& D^{3}(\nabla f(\mathbf{x}))^{\top} \mathbf{H}_{f} (\nabla f(\mathbf{x})) \end{array} $$
(A23)
$$ \begin{array}{@{}rcl@{}} &&-2 \delta^{-1} D^{4} \left\| \nabla f(\mathbf{x}) \right\|^{2} (\nabla f(\mathbf{x}))^{\top} \mathbf{H}_{f} (\nabla f(\mathbf{x})) \\ &=& D^{3} (1 -2 \delta^{-1} D \left\| \nabla f(\mathbf{x}) \right\|^{2} ) (\nabla f(\mathbf{x}))^{\top} \mathbf{H}_{f} (\nabla f(\mathbf{x})). \end{array} $$
(A24)

Because δ = m2g2 ≥ 0 as mentioned in Section 2, \( \left \| \nabla f(\mathbf {x}) \right \|^{2} \geq 0\), and 0 ≤ D ≤ 1, we obtain

$$ D^{3} (1 -2 \delta^{-1} D \left\| \nabla f(\mathbf{x}) \right\|^{2} ) \leq D^{3} \leq 1. $$
(A25)

Therefore,

$$ \left| (\nabla s(\mathbf{x}))^{\top} \mathbf{H}_{s} (\nabla s(\mathbf{x})) \right| \leq \left| (\nabla f(\mathbf{x}))^{\top} \mathbf{H}_{f} (\nabla f(\mathbf{x})) \right|. $$
(A26)

Equation A24 also indicates that the larger the gradient, the smaller the property of the second-order terms when using the Smooth Gradient.

Appendix B: Proof of Theorem 3

In this section, we prove Theorem 3. The following proof is inspired by [24]. We used the same assumptions and settings for the optimization, except for the gradient configuration, hk. Before starting the main proof, we first start with the following lemma:

Lemma 1

If \(\delta \leq \frac {1-5 L_{0} \eta }{4 {L_{1}^{2}} \eta ^{2}}\), we have,

$$ \begin{array}{@{}rcl@{}} h_{k} = \eta \frac{\delta}{\delta + \left\| \nabla f(\mathbf{x}_{k}) \right\|^{2}} \leq \frac{1}{5 L_{0} + 4 L_{1} \left\| \nabla f(\mathbf{x}_{k}) \right\|}. \end{array} $$
(B27)

Proof

Rewriting (B27) yields,

$$ \begin{array}{@{}rcl@{}} \left\| \nabla f(\mathbf{x}_{k}) \right\|^{2} - 4 L_{1} \eta \delta \left\| \nabla f(\mathbf{x}_{k}) \right\| + \delta - 5 L_{0} \eta \delta \geq 0. \end{array} $$
(B28)

Note that (LHS) of (B28) is,

$$ \begin{array}{@{}rcl@{}} (LHS) &=& \left\| \nabla f(\mathbf{x}_{k}) \right\|^{2} - 4 L_{1} \eta \delta \left\| \nabla f(\mathbf{x}_{k}) \right\| \\ && + 4 {L_{1}^{2}} \eta^{2} \delta^{2} - 4 {L_{1}^{2}} \eta^{2} \delta^{2} + \delta - 5 L_{0} \eta \delta \end{array} $$
(B29)
$$ \begin{array}{@{}rcl@{}} &=& (\left\| \nabla f(\mathbf{x}_{k}) \right\| - 2 L_{1} \eta \delta)^{2} - 4 {L_{1}^{2}} \eta^{2} \delta^{2} \\ &&+ \delta - 5 L_{0} \eta \delta. \end{array} $$
(B30)

Thus our claim is, \(- 4 {L_{1}^{2}} \eta ^{2} \delta ^{2} + \delta - 5 L_{0} \eta \delta \geq 0\). This holds if \(\delta \leq \frac {1-5 L_{0} \eta }{4 {L_{1}^{2}} \eta ^{2}}\). □

Now, we start the proof of Theorem 3.

Proof

Our goal is to investigate the upper bound of the iteration complexity of the following gradient descent optimization with Smooth Gradient in a deterministic setting.

$$ \begin{array}{@{}rcl@{}} \mathbf{x}_{k+1} &=& \mathbf{x}_{k} - h_{k} \nabla f(\mathbf{x}_{k}), \end{array} $$
(B31)
$$ \begin{array}{@{}rcl@{}} h_{k} &=& \eta \frac{\delta}{\delta + \left\| \nabla f(\mathbf{x}_{k}) \right\|^{2}}. \end{array} $$
(B32)

First, for \(t \in \left [0, 1\right ]\), we parameterize the path between xk and xk+ 1 as γ(t) = t(xk+ 1xk) + xk. From (B31), using Taylor’s theorem, the triangle inequality, and the Cauchy-Schwarz inequality,

$$ \begin{array}{@{}rcl@{}} f(\mathbf{x}_{k+1}) &\leq& f(\mathbf{x}_{k}) - h_{k} \left\| \nabla f(\mathbf{x}_{k}) \right\|^{2} \\ && + \frac{\left\| \mathbf{x}_{k+1} - \mathbf{x}_{k} \right\|^{2}}{2} {{\int}_{0}^{1}} \left\| \nabla^{2} f(\boldsymbol{\gamma}(t)) \right\| dt. \end{array} $$
(B33)

Borrowing Lemma 9 from [24], we obtain,

$$ \left\| \nabla f(\boldsymbol{\gamma}(t)) \right\| \leq 4 \left( \frac{L_{0}}{L_{1}} + \left\| \nabla f(\mathbf{x}_{k}) \right\| \right). $$
(B34)

From the assumption on L0 and L1 and (B34), we have,

$$ \left\| \nabla^{2} f(\boldsymbol{\gamma}(t)) \right\| \leq 5 L_{0} + 4 L_{1} \left\| \nabla f(\mathbf{x}_{k}) \right\|. $$
(B35)

From (B33),

$$ \begin{array}{@{}rcl@{}} f(\mathbf{x}_{k+1}) &\leq& f(\mathbf{x}_{k}) - h_{k} \left\| \nabla f(\mathbf{x}_{k}) \right\|^{2} \\ && + \frac{ 5 L_{0} + 4 L_{1} \left\| \nabla f(\mathbf{x}_{k}) \right\| }{2} {h_{k}^{2}} \left\| \nabla f(\mathbf{x}_{k}) \right\|^{2} . \end{array} $$
(B36)

Using Lemma 1,

$$ f(\mathbf{x}_{k+1}) \leq f(\mathbf{x}_{k}) - \frac{h_{k} \left\| \nabla f(\mathbf{x}_{k}) \right\|^{2}}{2}. $$
(B37)

Here, we investigate the lower bound for the following function,

$$ \frac{h_{k} \left\| \nabla f(\mathbf{x}_{k}) \right\|^{2}}{2} = \frac{\eta \delta}{2} \cdot \frac{\left\| \nabla f(\mathbf{x}_{k}) \right\|^{2}}{\left\| \nabla f(\mathbf{x}_{k}) \right\|^{2} + \delta}. $$
(B38)

Note that for \(\left \| \nabla f(\mathbf {x}_{k}) \right \| \geq \epsilon \), the above function has minimum at \(\left \| \nabla f(\mathbf {x}_{k}) \right \| = \epsilon \). Thus,

$$ \frac{h_{k} \left\| \nabla f(\mathbf{x}_{k}) \right\|^{2}}{2} \geq \frac{\eta \delta \epsilon^{2}}{2 (\epsilon^{2} + \delta)}. $$
(B39)

If 𝜖 is sufficiently small, because 𝜖2 + δδ, (RHS) of (B39) can be approximated as an \(\frac {\eta \epsilon ^{2}}{2}\). Thus,

$$ f(\mathbf{x}_{k+1}) \leq f(\mathbf{x}_{k}) - \frac{\eta \epsilon^{2}}{2}. $$
(B40)

From the telescopic sum from zero to the T − 1 iterations of (B40),

$$ f^{*} - f(\mathbf{x}_{0}) \leq -\frac{\eta \epsilon^{2} T}{2}. $$
(B41)

Here, we assume that \(\eta = \frac {1}{p L_{0}}\). Because δ > 0, we need 1 − 5L0η > 0, which requires p > 5. Now, we have an uppder bound of the iteration complexity of gradient descent with Smooth Gradient,

$$ T \leq \frac{2p L_{0} (f(\mathbf{x}_{0}) - f^{*})}{\epsilon^{2}}, $$
(B42)

which means that the gradient descent with a Smooth Gradient converges as \(\mathcal {O} \left (L_{0} \frac {f(\mathbf {x}_{0})-f^{*}}{\epsilon ^{2}} \right )\). Note that Theorem 4 of [24] claims that under the same assumptions for the optimization, the gradient descent converges in \({\Omega } \left ((L_{1} M / log(M) + L_{0}) \frac {f(\mathbf {x}_{0})-f^{*}}{\epsilon ^{2}} \right )\) iterations, where

$$ M = \sup\{ \| \nabla f(\mathbf{x}) \| \mid \mathbf{x} \text{ such that } f(\mathbf{x}) \leq f(\mathbf{x}_{0})\}, $$
(B43)

for the initialization of x0. Thus, when M is large, that is, when the problem has a poor initialization, our gradient descent with a Smooth Gradient can be arbitrarily faster than the vanilla gradient descent. □

Appendix C: More details on 𝜃

Here, we detail 𝜃, which was introduced in Section 2.2. We also prove the property that \(\tan \theta = \left \| \nabla h(\mathbf {x}) \right \|\).

As discussed in Section 2.2, the mass m moves on a virtual surface (x, h(x)). If we consider x to be an n-dimensional vector, space (x, h(x)) forms the (n + 1) dimension. Because the neural network is non-linear, the loss function and virtual surface have non-linear properties. In other words, h(x) is a non-linear function of x. Here, we linearize h(x) as:

$$ h(\mathbf{x}) = \frac{\partial h}{\partial x_{1}}x_{1} + \frac{\partial h}{\partial x_{2}}x_{2} + ... + \frac{\partial h}{\partial x_{n}}x_{n} + C. $$
(C44)

This approximates the non-linear virtual surface h(x) as a linear surface near point x. Let this plane existing in the (n + 1) dimension be S1. Let plane h(x) = 0 be S2. Here, 𝜃 is defined as the acute angle between S1 and S2 (Fig. 2). Now, 𝜃 can be obtained by investigating the normal vectors of the two planes. First, (C44) indicates that S1 has a normal vector \(n_{1} = \left (\frac {\partial h}{\partial x_{1}}, \frac {\partial h}{\partial x_{2}}, ..., \frac {\partial h}{\partial x_{n}}, -1\right )\). S2 has a normal vector n2 = (0, 0,..., 0, 1). As the angle between the two normal vectors n1 and n2 is π𝜃,

$$ \begin{array}{@{}rcl@{}} \cos(\pi-\theta) &=& \frac{n_{1} \cdot n_{2}}{\left\| n_{1} \right\| \left\| n_{2} \right\|} \end{array} $$
(C45)
Fig. 2
figure 2

Illustration showing the definition of 𝜃

$$ \begin{array}{@{}rcl@{}} &=& \frac{-1}{\sqrt{\left\| \nabla h(\mathbf{x}) \right\|^{2} + 1}}. \end{array} $$
(C46)

Because \(0<\theta <\frac {\pi }{2}\), we obtain,

$$ \tan\theta = \left\| \nabla h(\mathbf{x}) \right\|. $$
(C47)

Appendix D: Tips for tuning δ

From our experimental results, we provide the following three tips for tuning δ:

  • Essentially, δ should be treated as a hyperparameter. Empirical tuning using a validation set, such as the learning rate or threshold in gradient clip**, is required. We recommend tuning in {1, 10, 100, 1000, 10000} first. However, δ is easier to tune than the learning rate, and sensitive tuning is not required (Section 4.1).

  • For the same regularization but different architectures, because a large neural network has a large gradient norm, its δ should be set to a larger value. In Section 4.2.1, δ was identified as 30000 in ResNet-100 and 1000 in ResNet-16.

  • For the same architecture but different regularization: For a non-regularized model, we recommend setting a smaller δ value than a well-regularized model. This is because when δ is small, the regularization effect through the Smooth Gradient is greater (refeq:relation). Experimentally, a δ of 1000 was confirmed in well-regularized ResNet-16, whereas a δ of 1 was confirmed in non-regularized PlainNet-16 (Section 4.2). Therefore, it is necessary to consider the δ settings according to how regularizations are applied, even in the same architecture.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, B.J., Choi, H., Jang, H. et al. Smooth momentum: improving lipschitzness in gradient descent. Appl Intell 53, 14233–14248 (2023). https://doi.org/10.1007/s10489-022-04207-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-04207-7

Keywords

Navigation