Abstract
Deep neural network optimization is challenging. Large gradients in their chaotic loss landscape lead to unstable behavior during gradient descent. In this paper, we investigate a stable gradient descent algorithm. We revisit the mathematical derivations of the Momentum optimizer and discuss the potential problem for steep walls. Inspired by the physical motion of the mass, we propose Smooth Momentum, a new optimizer that improves the behavior on steep walls. We mathematically analyze the characteristics of the proposed optimizer and prove that Smooth Momentum exhibits improved Lipschitz properties and convergence, which allows stable and faster convergence in gradient descent. We also demonstrate how Smooth Gradient, a component of the proposed optimizer, can be plugged into other optimizers, like Adam. The proposed method offers a regularization effect comparable to batch normalization or weight decay. Experiments demonstrate that our proposed optimizer significantly improves the optimization of transformers, convolutional neural networks, and non-convex functions for various tasks and datasets.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-022-04207-7/MediaObjects/10489_2022_4207_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-022-04207-7/MediaObjects/10489_2022_4207_Fige_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-022-04207-7/MediaObjects/10489_2022_4207_Figf_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-022-04207-7/MediaObjects/10489_2022_4207_Figg_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-022-04207-7/MediaObjects/10489_2022_4207_Figh_HTML.png)
Similar content being viewed by others
Data Availability
The datasets analyzed during the current study are available in the following repository:
∙ CIFAR-10: http://www.cs.toronto.edu/kriz/cifar.html
∙ Oxford-IIIT PET: https://www.robots.ox.ac.uk/vgg/data/pets/
∙ IWSLT14: https://workshop2014.iwslt.org/
Notes
Strictly speaking, weight x is a vector, not a scalar. In this paper, however, for a clear explanation of our motivation, we have used terms such as 1D motion by describing x as a one-dimensional variable in situations where it does not matter.
A function f is said to be Lipschitz continuous if \(\left \| f(x_{1}) - f(x_{2}) \right \| \leq L \left \| x_{1} - x_{2} \right \|\). Here L is called the Lipschitz constant.
References
Pal SK, Pramanik A, Maiti J, Mitra P (2021) Deep learning in multi-object detection and tracking: state of the art. Appl Intell 51(9):6400–6429
Mao Q, Sun H, Zuo L, Jia R (2020) Finding every car: a traffic surveillance multi-scale vehicle object detection method. Appl Intell 50(10):3125–3136
Lu L, Wu D, Wu T, Huang F, Yi Y (2020) Anchor-free multi-orientation text detection in natural scene images. Appl Intell 50(11):3623–3637
Gouk H, Frank E, Pfahringer B, Cree MJ (2021) Regularisation of neural networks by enforcing Lipschitz continuity. Mach Learn 110(2):393–416
Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift, vol 37 of JMLR workshop and conference proceedings pp 448–456 (JMLR.org)
Santurkar S, Tsipras D, Ilyas A, Madry A (2018) How does batch normalization help optimization?, pp 2488–2498
Qiao S, Wang H, Liu C, Shen W, Yuille AL (2019) Weight standardization. CoRR ar**v:1903.10520
Nesterov YE (2004) Introductory lectures on convex optimization - a basic course vol 87 of applied optimization (springer)
Li H, Xu Z, Taylor G, Studer C, Goldstein T (2018) Visualizing the loss landscape of neural nets, pp 6391–6401
Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks, vol 28 of JMLR workshop and conference proceedings, pp 1310–1318 (JMLR.org)
Tieleman T, Hinton G (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw Mach Learn 4(2):26–31
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: ICLR
Wilson AC, Roelofs R, Stern M, Srebro N, Recht B (2017) The marginal value of adaptive gradient methods in machine learning, pp 4148–4158
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition, pp 770–778 (IEEE computer society)
Saon G et al (2017) English conversational telephone speech recognition by humans and machines, pp 132–136
Qian N (1999) On the momentum term in gradient descent learning algorithms. Neural Netw 12 (1):145–151
Yuan W, Hu F, Lu L (2022) A new non-adaptive optimization method: stochastic gradient descent with momentum and difference. Appl Intell 52(4):3939–3953
Tao W, Pan Z, Wu G, Tao Q (2020) The strength of nesterov’s extrapolation in the individual convergence of nonsmooth optimization. IEEE Trans Neural Netw Learn Syst 31(7):2557–2568
Gui Y, Li D, Fang R (2022) A fast adaptive algorithm for training deep neural networks. Appl Intell
Paszke A et al (2019) PyTorch: an imperative style, high-performance deep learning library (eds Wallach, H. et al) advances in neural information processing systems vol 32, pp 8024–8035 (curran associates, Inc.). http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
Abadi M et al (2015) TensorFlow: large-scale machine learning on heterogeneous systems. https://www.tensorflow.org/. Software available from tensorflow.org
Yedida R, Saha S, Prashanth T (2021) LipschitzLR: using theoretically computed adaptive learning rates for fast convergence. Appl Intell 51(3):1460–1478
Zhang J, He T, Sra S, Jadbabaie A (2020) Why gradient clip** accelerates training: a theoretical justification for adaptivity. (OpenReview.net)
Curtis FE, Scheinberg K, Shi R (2019) A stochastic trust region algorithm based on careful step normalization. Informs J Optimization 1(3):200–220
Bello I, Zoph B, Vasudevan V, Le QV (2017) Neural optimizer search with reinforcement learning. Vol 70 of proceedings of machine learning research, pp 459–468 (PMLR)
Jiaocheng M, **an S, **n Z, Yuan Peng Z (2022) Bayes-dcgru with bayesian optimization for rolling bearing fault diagnosis. Appl Intell
Ackley D (2012) A connectionist machine for genetic hillclimbing. Vol 28 (Springer science & business media
Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images. Technical report
Zagoruyko S, Komodakis N (2017) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. (OpenReview.net
Parkhi OM, Vedaldi A, Zisserman A, Jawahar CV (2012) Cats and dogs. pp 3498–3505 (IEEE computer society
Polyak BT, Juditsky AB (1992) Acceleration of stochastic approximation by averaging. SIAM J Contr Optimization 30(4):838–855
Liu L et al (2020) On the variance of the adaptive learning rate and beyond. (OpenReview.net)
Riedmiller MA, Braun H (1993) A direct adaptive method for faster backpropagation learning: the RPROP algorithm. pp 586–591 (IEEE)
Wang Y, Li K, Lei Y (2022) A general multi-scale image classification based on shared conversion matrix routing. Appl Intell 52(3):3249–3265
Peters ME et al (2018) Deep contextualized word representations. pp 2227–2237 (association for computational linguistics
Vaswani A et al (2017) Attention is all you need. pp 5998–6008
Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. pp 4171–4186 (association for computational linguistics)
Dosovitskiy A et al (2021) An image is worth 16x16 words: transformers for image recognition at scale. (OpenReview.net)
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. pp 2818–2826 (IEEE computer society)
Papineni K, Roukos S, Ward T, Zhu W (2002) Bleu: a method for automatic evaluation of machine translation. pp 311–318 (ACL)
Acknowledgements
This work was supported by Samsung Electronics Co., Ltd (IO201210-08019-01).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Proof of Theorem 2
We prove Theorem 2 as follows.
Proof
First, we investigate the Hessian matrix:
where
Now, we investigate (LHS) of (14).
The first term in (A13) is
The second term in (A13) is
Thus,
Because δ = m2g2 ≥ 0 as mentioned in Section 2, \( \left \| \nabla f(\mathbf {x}) \right \|^{2} \geq 0\), and 0 ≤ D ≤ 1, we obtain
Therefore,
□
Equation A24 also indicates that the larger the gradient, the smaller the property of the second-order terms when using the Smooth Gradient.
Appendix B: Proof of Theorem 3
In this section, we prove Theorem 3. The following proof is inspired by [24]. We used the same assumptions and settings for the optimization, except for the gradient configuration, hk. Before starting the main proof, we first start with the following lemma:
Lemma 1
If \(\delta \leq \frac {1-5 L_{0} \eta }{4 {L_{1}^{2}} \eta ^{2}}\), we have,
Proof
Rewriting (B27) yields,
Note that (LHS) of (B28) is,
Thus our claim is, \(- 4 {L_{1}^{2}} \eta ^{2} \delta ^{2} + \delta - 5 L_{0} \eta \delta \geq 0\). This holds if \(\delta \leq \frac {1-5 L_{0} \eta }{4 {L_{1}^{2}} \eta ^{2}}\). □
Now, we start the proof of Theorem 3.
Proof
Our goal is to investigate the upper bound of the iteration complexity of the following gradient descent optimization with Smooth Gradient in a deterministic setting.
First, for \(t \in \left [0, 1\right ]\), we parameterize the path between xk and xk+ 1 as γ(t) = t(xk+ 1 −xk) + xk. From (B31), using Taylor’s theorem, the triangle inequality, and the Cauchy-Schwarz inequality,
Borrowing Lemma 9 from [24], we obtain,
From the assumption on L0 and L1 and (B34), we have,
From (B33),
Using Lemma 1,
Here, we investigate the lower bound for the following function,
Note that for \(\left \| \nabla f(\mathbf {x}_{k}) \right \| \geq \epsilon \), the above function has minimum at \(\left \| \nabla f(\mathbf {x}_{k}) \right \| = \epsilon \). Thus,
If 𝜖 is sufficiently small, because 𝜖2 + δ ≈ δ, (RHS) of (B39) can be approximated as an \(\frac {\eta \epsilon ^{2}}{2}\). Thus,
From the telescopic sum from zero to the T − 1 iterations of (B40),
Here, we assume that \(\eta = \frac {1}{p L_{0}}\). Because δ > 0, we need 1 − 5L0η > 0, which requires p > 5. Now, we have an uppder bound of the iteration complexity of gradient descent with Smooth Gradient,
which means that the gradient descent with a Smooth Gradient converges as \(\mathcal {O} \left (L_{0} \frac {f(\mathbf {x}_{0})-f^{*}}{\epsilon ^{2}} \right )\). Note that Theorem 4 of [24] claims that under the same assumptions for the optimization, the gradient descent converges in \({\Omega } \left ((L_{1} M / log(M) + L_{0}) \frac {f(\mathbf {x}_{0})-f^{*}}{\epsilon ^{2}} \right )\) iterations, where
for the initialization of x0. Thus, when M is large, that is, when the problem has a poor initialization, our gradient descent with a Smooth Gradient can be arbitrarily faster than the vanilla gradient descent. □
Appendix C: More details on 𝜃
Here, we detail 𝜃, which was introduced in Section 2.2. We also prove the property that \(\tan \theta = \left \| \nabla h(\mathbf {x}) \right \|\).
As discussed in Section 2.2, the mass m moves on a virtual surface (x, h(x)). If we consider x to be an n-dimensional vector, space (x, h(x)) forms the (n + 1) dimension. Because the neural network is non-linear, the loss function and virtual surface have non-linear properties. In other words, h(x) is a non-linear function of x. Here, we linearize h(x) as:
This approximates the non-linear virtual surface h(x) as a linear surface near point x. Let this plane existing in the (n + 1) dimension be S1. Let plane h(x) = 0 be S2. Here, 𝜃 is defined as the acute angle between S1 and S2 (Fig. 2). Now, 𝜃 can be obtained by investigating the normal vectors of the two planes. First, (C44) indicates that S1 has a normal vector \(n_{1} = \left (\frac {\partial h}{\partial x_{1}}, \frac {\partial h}{\partial x_{2}}, ..., \frac {\partial h}{\partial x_{n}}, -1\right )\). S2 has a normal vector n2 = (0, 0,..., 0, 1). As the angle between the two normal vectors n1 and n2 is π − 𝜃,
Because \(0<\theta <\frac {\pi }{2}\), we obtain,
Appendix D: Tips for tuning δ
From our experimental results, we provide the following three tips for tuning δ:
-
Essentially, δ should be treated as a hyperparameter. Empirical tuning using a validation set, such as the learning rate or threshold in gradient clip**, is required. We recommend tuning in {1, 10, 100, 1000, 10000} first. However, δ is easier to tune than the learning rate, and sensitive tuning is not required (Section 4.1).
-
For the same regularization but different architectures, because a large neural network has a large gradient norm, its δ should be set to a larger value. In Section 4.2.1, δ was identified as 30000 in ResNet-100 and 1000 in ResNet-16.
-
For the same architecture but different regularization: For a non-regularized model, we recommend setting a smaller δ value than a well-regularized model. This is because when δ is small, the regularization effect through the Smooth Gradient is greater (refeq:relation). Experimentally, a δ of 1000 was confirmed in well-regularized ResNet-16, whereas a δ of 1 was confirmed in non-regularized PlainNet-16 (Section 4.2). Therefore, it is necessary to consider the δ settings according to how regularizations are applied, even in the same architecture.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kim, B.J., Choi, H., Jang, H. et al. Smooth momentum: improving lipschitzness in gradient descent. Appl Intell 53, 14233–14248 (2023). https://doi.org/10.1007/s10489-022-04207-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-04207-7