Deep Learning Optimization

Ye, Jong Chul

doi:10.1007/978-981-16-6046-7_11

Jong Chul Ye¹³

Part of the book series: Mathematics in Industry ((MATHINDUSTRY,volume 37))

6215 Accesses

Abstract

In Chap. 6, we discussed various optimization methods for deep neural network training. Although they are in various forms, these algorithms are basically gradient-based local update schemes. However, the biggest obstacle recognized by the entire community is that the loss surfaces of deep neural networks are extremely non-convex and not even smooth. This non-convexity and non-smoothness make the optimization unaffordable to analyze, and the main concern was whether popular gradient-based approaches might fall into local minimizers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: EUR 29.95; Price includes VAT (Germany)

eBook: EUR 46.00; Price includes VAT (Germany)

Softcover Book: EUR 58.84; Price includes VAT (Germany)

Hardcover Book: EUR 74.89; Price includes VAT (Germany)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Z. Allen-Zhu, Y. Li, and Z. Song, “A convergence theory for deep learning via over-parameterization,” in International Conference on Machine Learning. PMLR, 2019, pp. 242–252.
Google Scholar
S. Du, J. Lee, H. Li, L. Wang, and X. Zhai, “Gradient descent finds global minima of deep neural networks,” in International Conference on Machine Learning. PMLR, 2019, pp. 1675–1685.
Google Scholar
D. Zou, Y. Cao, D. Zhou, and Q. Gu, “Stochastic gradient descent optimizes over-parameterized deep ReLU networks,” ar**v preprint ar**v:1811.08888, 2018.
Google Scholar
H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximal-gradient methods under the Polyak-łojasiewicz condition,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2016, pp. 795–811.
Google Scholar
Q. Nguyen, “On connected sublevel sets in deep learning,” in International Conference on Machine Learning. PMLR, 2019, pp. 4790–4799.
Google Scholar
C. Liu, L. Zhu, and M. Belkin, “Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning,” ar**v preprint ar**v:2003.00307, 2020.
Google Scholar
Z. Allen-Zhu, Y. Li, and Y. Liang, “Learning and generalization in overparameterized neural networks, going beyond two layers,” ar**v preprint ar**v:1811.04918, 2018.
Google Scholar
M. Soltanolkotabi, A. Javanmard, and J. D. Lee, “Theoretical insights into the optimization landscape of over-parameterized shallow neural networks,” IEEE Transactions on Information Theory, vol. 65, no. 2, pp. 742–769, 2018.
Article MathSciNet Google Scholar
S. Oymak and M. Soltanolkotabi, “Overparameterized nonlinear learning: Gradient descent takes the shortest path?” in International Conference on Machine Learning. PMLR, 2019, pp. 4951–4960.
Google Scholar
S. S. Du, X. Zhai, B. Poczos, and A. Singh, “Gradient descent provably optimizes over-parameterized neural networks,” ar**v preprint ar**v:1810.02054, 2018.
Google Scholar
I. Safran, G. Yehudai, and O. Shamir, “The effects of mild over-parameterization on the optimization landscape of shallow ReLU neural networks,” ar**v preprint ar**v:2006.01005, 2020.
Google Scholar
A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: convergence and generalization in neural networks,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, pp. 8580–8589.
Google Scholar
S. Arora, S. S. Du, W. Hu, Z. Li, R. Salakhutdinov, and R. Wang, “On exact computation with an infinitely wide neural net,” ar**v preprint ar**v:1904.11955, 2019.
Google Scholar
Y. Li, T. Luo, and N. K. Yip, “Towards an understanding of residual networks using neural tangent hierarchy (NTH),” ar**v preprint ar**v:2007.03714, 2020.
Google Scholar
Y. Nesterov, Introductory lectures on convex optimization: A basic course. Springer Science & Business Media, 2003, vol. 87.
Google Scholar
Z.-Q. Luo and P. Tseng, “Error bounds and convergence analysis of feasible descent methods: a general approach,” Annals of Operations Research, vol. 46, no. 1, pp. 157–178, 1993.
Article MathSciNet Google Scholar
J. Liu, S. Wright, C. Ré, V. Bittorf, and S. Sridhar, “An asynchronous parallel stochastic coordinate descent algorithm,” in International Conference on Machine Learning. PMLR, 2014, pp. 469–477.
Google Scholar
I. Necoara, Y. Nesterov, and F. Glineur, “Linear convergence of first order methods for non-strongly convex optimization,” Mathematical Programming, vol. 175, no. 1, pp. 69–107, 2019.
Article MathSciNet Google Scholar
H. Zhang and W. Yin, “Gradient methods for convex minimization: better rates under weaker conditions,” ar**v preprint ar**v:1303.4645, 2013.
Google Scholar
B. T. Polyak, “Gradient methods for minimizing functionals,” Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, vol. 3, no. 4, pp. 643–653, 1963.
MathSciNet MATH Google Scholar
S. Lojasiewicz, “A topological property of real analytic subsets,” Coll. du CNRS, Les équations aux dérivées partielles, vol. 117, pp. 87–89, 1963.
MATH Google Scholar
B. D. Craven and B. M. Glover, “Invex functions and duality,” Journal of the Australian Mathematical Society, vol. 39, no. 1, pp. 1–20, 1985.
Article MathSciNet Google Scholar
K. Kawaguchi, “Deep learning without poor local minima,” ar**v preprint ar**v:1605.07110, 2016.
Google Scholar
H. Lu and K. Kawaguchi, “Depth creates no bad local minima,” ar**v preprint ar**v:1702.08580, 2017.
Google Scholar
Y. Zhou and Y. Liang, “Critical points of neural networks: Analytical forms and landscape properties,” ar**v preprint ar**v:1710.11205, 2017.
Google Scholar
C. Yun, S. Sra, and A. Jadbabaie, “Small nonlinearities in activation functions create bad local minima in neural networks,” ar**v preprint ar**v:1802.03487, 2018.
Google Scholar
D. Li, T. Ding, and R. Sun, “Over-parameterized deep neural networks have no strict local minima for any continuous activations,” ar**v preprint ar**v:1812.11039, 2018.
Google Scholar
N. P. Bhatia and G. P. Szegö, Stability Theory of Dynamical Systems. Springer Science & Business Media, 2002.
Google Scholar

Download references

Author information

Authors and Affiliations

Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea
Jong Chul Ye

Authors

Jong Chul Ye
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ye, J.C. (2022). Deep Learning Optimization. In: Geometry of Deep Learning. Mathematics in Industry, vol 37. Springer, Singapore. https://doi.org/10.1007/978-981-16-6046-7_11

Download citation

DOI: https://doi.org/10.1007/978-981-16-6046-7_11
Published: 05 January 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-6045-0
Online ISBN: 978-981-16-6046-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Deep Learning Optimization

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Gradient Methods for Non-convex Optimization

Optimization for Deep Learning: An Overview

Using Hessians as a Regularization Technique

References

Author information

Authors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Deep Learning Optimization

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Gradient Methods for Non-convex Optimization

Optimization for Deep Learning: An Overview

Using Hessians as a Regularization Technique

References

Author information

Authors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation