Low-dimensional intrinsic dimension reveals a phase transition in gradient-based learning of deep neural networks

Tan, Chengli; Zhang, Jiangshe; Liu, Junmin; Zhao, Zixiang

doi:10.1007/s13042-024-02244-x

Low-dimensional intrinsic dimension reveals a phase transition in gradient-based learning of deep neural networks

Original Article
Published: 04 July 2024

(2024)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Chengli Tan¹,
Jiangshe Zhang¹,
Junmin Liu¹ &
…
Zixiang Zhao¹

20 Accesses
Explore all metrics

Abstract

Deep neural networks complete a feature extraction task by propagating the inputs through multiple modules. However, how the representations evolve with the gradient-based optimization remains unknown. Here we leverage the intrinsic dimension of the representations to study the learning dynamics and find that the training process undergoes a phase transition from expansion to compression under disparate training regimes. Surprisingly, this phenomenon is ubiquitous across a wide variety of model architectures, optimizers, and data sets. We demonstrate that the variation in the intrinsic dimension is consistent with the complexity of the learned hypothesis, which can be quantitatively assessed by the critical sample ratio that is rooted in adversarial robustness. Meanwhile, we mathematically show that this phenomenon can be analyzed in terms of the mutable correlation between neurons. Although the evoked activities obey a power-law decaying rule in biological circuits, we identify that the power-law exponent of the representations in deep neural networks predicted adversarial robustness well only at the end of the training but not during the training process. These results together suggest that deep neural networks are prone to producing robust representations by adaptively eliminating or retaining redundancies. The code is publicly available at https://github.com/cltan023/learning2022.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Institutional subscriptions

Gradient-based learning drives robust representations in recurrent neural networks by balancing compression and expansion

Article 22 June 2022

Is My Neural Net Driven by the MDL Principle?

Squeezing Correlated Neurons for Resource-Efficient Deep Neural Networks

Data availability

The datasets used in this paper are available in public repositories.

Notes

Results with decayed learning rate can be found in the supplementary material deposited at https://drive.google.com/file/d/171ffVcjG0vYcu5YZiAA74EaL-PbTkCu1/view?usp=sharing.

References

Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the 34th IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788
Bengio Y, Lecun Y, Hinton G (2021) Deep learning for AI. Communications of the ACM 64(7):58–65
Article Google Scholar
Davies A, Veličković P, Buesing L, Blackwell S, Zheng D, Tomašev N, Tanburn R, Battaglia P, Blundell C, Juhász A (2021) Advancing mathematics by guiding human intuition with AI. Nature 600(7887):70–74
Article Google Scholar
Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A (2021) Highly accurate protein structure prediction with AlphaFold. Nature 596(7873):583–589
Article Google Scholar
Huang H (2018) Mechanisms of dimensionality reduction and decorrelation in deep neural networks. Physical Review E 98(6):062313
Article Google Scholar
Poole B, Lahiri S, Raghu M, Sohl-Dickstein J, Ganguli S (2016) Exponential expressivity in deep neural networks through transient chaos. In: The Proceedings of the 30th Conference on Neural Information Processing Systems, pp. 3360–3368
**ao L, Pennington J, Schoenholz S (2020) Disentangling trainability and generalization in deep neural networks. In: Proceedings of the 37th International Conference on Machine Learning, pp. 10462–10472
Neal RM (1996) Priors for infinite networks. In: Bayesian Learning for Neural Networks, pp. 29–53
Lee J, Bahri Y, Novak R, Schoenholz SS, Pennington J, Sohl-Dickstein J (2018) Deep neural networks as Gaussian processes. In: Proceedings of the 6th International Conference on Learning Representations, pp. 1–10
Cohen U, Chung S, Lee DD, Sompolinsky H (2020) Separability and geometry of object manifolds in deep neural networks. Nature Communications 11(1):1–13
Article Google Scholar
Stephenson C, Ganesh A, Hui Y, Tang H, Chung S (2021) On the geometry of generalization and memorization in deep neural networks. In: Proceedings of the 9th International Conference on Learning Representations, pp. 1–25
Doimo D, Glielmo A, Ansuini A, Laio A (2020) Hierarchical nucleation in deep neural networks. In: Proceedings of the 34th Conference in Neural Information Processing Systems, pp. 7526–7536
Shwartz-Ziv R, Tishby N (2017) Opening the black box of deep neural networks via information. ar**v preprint ar**v:1703.00810
Saxe AM, Bansal Y, Dapello J, Advani M, Kolchinsky A, Tracey BD, Cox DD (2019) On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment 2019(12):124020
Article MathSciNet Google Scholar
Mendes-Santos T, Turkeshi X, Dalmonte M, Rodriguez A (2021) Unsupervised learning universal critical behavior via the intrinsic dimension. Physical Review X 11(1):011040
Article Google Scholar
Fefferman C, Mitter S, Narayanan H (2016) Testing the manifold hypothesis. Journal of the American Mathematical Society 29(4):983–1049
Article MathSciNet Google Scholar
Pope P, Zhu C, Abdelkader A, Goldblum M, Goldstein T (2020) The intrinsic dimension of images and its impact on learning. In: Proceedings of the 8th International Conference on Learning Representations, pp. 1–17
Sharma U, Kaplan J (2022) Scaling laws from the data manifold dimension. Journal of Machine Learning Research 23(9):1–34
MathSciNet Google Scholar
Nakada R, Imaizumi M (2020) Adaptive approximation and generalization of deep neural network with intrinsic dimensionality. Journal of Machine Learning Research 21:174–1
MathSciNet Google Scholar
Ansuini A, Laio A, Macke JH, Zoccolan D (2019) Intrinsic dimension of data representations in deep neural networks. In: Proceedings of the 33rd Conference in Neural Information Processing Systems, pp. 1–12
Recanatesi S, Farrell M, Lajoie G, Deneve S, Rigotti M, Shea-Brown E (2021) Predictive learning as a network mechanism for extracting low-dimensional latent space representations. Nature Communications 12(1):1–13
Article Google Scholar
Feng Y, Tu Y (2021) The inverse variance-flatness relation in stochastic gradient descent is critical for finding flat minima. Proceedings of the National Academy of Sciences 118(9):2015617118
Article MathSciNet Google Scholar
Farrell M, Recanatesi S, Moore T, Lajoie G, Shea-Brown E (2022) Gradient-based learning drives robust representations in recurrent neural networks by balancing compression and expansion. Nature Machine Intelligence 4(6):564–573
Article Google Scholar
Wongso S, Ghosh R, Motani M (2023) Using sliced mutual information to study memorization and generalization in deep neural networks. In: Proceedings of the 26th International Conference on Artificial Intelligence and Statistics, pp. 11608–11629
Barlow HB, et al (1961) Possible principles underlying the transformation of sensory messages. Sensory communication 1(01)
Atick JJ, Redlich AN (1990) Towards a theory of early visual processing. Neural computation 2(3):308–320
Article Google Scholar
DiCarlo JJ, Cox DD (2007) Untangling invariant object recognition. Trends in Cognitive Sciences 11(8):333–341
Article Google Scholar
Kalimeris D, Kaplun G, Nakkiran P, Edelman B, Yang T, Barak B, Zhang H (2019) SGD on neural networks learns functions of increasing complexity. Proceedings of the 33th Conference in Neural Information Processing Systems 32, 1–10
Refinetti M, Ingrosso A, Goldt S (2023) Neural networks trained with SGD learn distributions of increasing complexity. In: Proceedings of the 40th International Conference on Machine Learning, pp. 28843–28863
** J, Li Z, Lyu K, Du SS, Lee JD (2023) Understanding incremental learning of gradient descent: A fine-grained analysis of matrix sensing. In: Proceedings of the 40th International Conference on Machine Learning, pp. 15200–15238
Sclocchi A, Wyart M (2024) On the different regimes of stochastic gradient descent. Proceedings of the National Academy of Sciences 121(9):2316301121
Article MathSciNet Google Scholar
Arpit D, Jastrzębski S, Ballas N, Krueger D, Bengio E, Kanwal MS, Maharaj T, Fischer A, Courville A, Bengio Y (2017) A closer look at memorization in deep networks. In: Proceedings of the 38th International Conference on Machine Learning, pp. 233–242
Facco E, d’Errico M, Rodriguez A, Laio A (2017) Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Scientific Reports 7
Levina E, Bickel P (2004) Maximum likelihood estimation of intrinsic dimension. In: Proceedings of the 18th Conference in Neural Information Processing Systems, pp. 1–8
Gomtsyan M, Mokrov N, Panov M, Yanovich Y (2019) Geometry-aware maximum likelihood estimation of intrinsic dimension. In: Proceedings of the 11st Asian Conference on Machine Learning, pp. 1126–1141
Lombardi G, Rozza A, Ceruti C, Casiraghi E, Campadelli P (2011) Minimum neighbor distance estimators of intrinsic dimension. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 374–389
Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Proceedings of the 12nd International Conference on Computational Statistics, pp. 177–186
Sutskever I, Martens J, Dahl G, Hinton G (2013) On the importance of initialization and momentum in deep learning. In: Proceedings of the 34th International Conference on Machine Learning, pp. 1139–1147
Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations, pp. 1–11
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the 34th IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778
Krizhevsky A (2009) Learning multiple layers of features from tiny images. Master’s thesis, University of Toronto
Krizhevsky A (2014) One weird trick for parallelizing convolutional neural networks. ar**v preprint ar**v:1404.5997
Goyal P, Dollár P, Girshick R, Noordhuis P, Wesolowski L, Kyrola A, Tulloch A, Jia Y, He K (2017) Accurate, large minibatch SGD: Training Imagenet in 1 hour. ar**v preprint ar**v:1706.02677
Smith SL, Kindermans P-J, Ying C, Le QV (2018) Don’t decay the learning rate, increase the batch size. In: Proceedings of the 6th International Conference on Learning Representations, pp. 1–11
Iyer G, Hanin B, Rolnick D (2023) Maximal initial learning rates in deep ReLU networks. In: Proceedings of the 40th International Conference on Machine Learning, pp. 14500–14530
Kaddour J, Key O, Nawrot P, Minervini P, Kusner MJ (2024) No train no gain: Revisiting efficient training algorithms for transformer-based language models. Proceedings of the 38th Conference in Neural Information Processing Systems 36, 1–12
Hanin B, Rolnick D (2019) Complexity of linear regions in deep networks. In: Proceedings of the 40th International Conference on Machine Learning, pp. 2596–2604
Valle-Perez G, Camargo CQ, Louis AA (2019) Deep learning generalizes because the parameter-function map is biased towards simple functions. In: Proceedings of the 7th International Conference on Learning Representations, pp. 1–35
Hurst HE (1951) Long-term storage capacity of reservoirs. Transactions of the American Society of Civil Engineers 116:770–799
Article Google Scholar
Black R, Hurst H, Simaika Y (1965) Long-term Storage: An Experimental Study. Constable, London
Google Scholar
Campbell JY, Lo AW, MacKinlay AC (2012) The Econometrics of Financial Markets. Princeton University Press, New Jersy
Book Google Scholar
Grossglauser M, Bolot J-C (1999) On the relevance of long-range dependence in network traffic. IEEE/ACM Transactions on Networking 7(5):629–640
Article Google Scholar
Qian B, Rasheed K (2004) Hurst exponent and financial market predictability. In: The 1st IASTED International Conference on Financial Engineering and Applications, pp. 203–209
Embrechts P (2009) Selfsimilar Processes. Princeton University Press, New Jersey
Book Google Scholar
Lacasa L, Luque B, Luque J, Nuno JC (2009) The visibility graph: A new method for estimating the Hurst exponent of fractional Brownian motion. Europhysics Letters 86(3):1–5
Article Google Scholar
**ao H, Rasul K, Vollgraf R (2017) Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. ar**v preprint ar**v:1708.07747
Clanuwat T, Bober-Irizar M, Kitamoto A, Lamb A, Yamamoto K, Ha D (2018) Deep learning for classical japanese literature. In: Workshop on Machine Learning for Creativity and Design of the 32nd Conference on Neural Information Processing Systems, pp. 1–8
Stringer C, Pachitariu M, Steinmetz N, Carandini M, Harris KD (2019) High-dimensional geometry of population responses in visual cortex. Nature 571(7765):361–365
Article Google Scholar
Nassar J, Sokol P, Chung S, Harris KD, Park IM (2020) On \(1/n\) neural representation and robustness. In: Proceedings of the 34th Conference in Neural Information Processing Systems, pp. 6211–6222
Geirhos R, Jacobsen J-H, Michaelis C, Zemel R, Brendel W, Bethge M, Wichmann FA (2020) Shortcut learning in deep neural networks. Nature Machine Intelligence 2(11):665–673
Article Google Scholar
Navon D (1977) Forest before trees: The precedence of global features in visual perception. Cognitive Psychology 9(3):353–383
Article Google Scholar
Chen L (1982) Topological structure in visual perception. Science 218(4573):699–700
Article Google Scholar
Jastrzebski S, Kenton Z, Arpit D, Ballas N, Fischer A, Bengio Y, Storkey A (2017) Three factors influencing minima in SGD. ar**v preprint ar**v:1711.04623
Li Q, Tai C, Weinan E (2017) Stochastic modified equations and adaptive stochastic gradient algorithms. In: Proceedings of the 38th International Conference on Machine Learning, pp. 2101–2110
Smith S, Elsen E, De S (2020) On the generalization benefit of noise in stochastic gradient descent. In: Proceedings of the 41th International Conference on Machine Learning, pp. 9058–9067
Li Z, Malladi S, Arora S (2021) On the validity of modeling SGD with stochastic differential equations (SDEs). In: Proceedings of the 35th Conference in Neural Information Processing Systems, vol. 34, pp. 12712–12725

Download references

Acknowledgements

This work is supported in part by the National Key Research and Development Program of China under Grant 2020AAA0105601, in part by the National Natural Science Foundation of China under Grants 12371512 and 62276208, and in part by the Natural Science Basic Research Program of Shaanxi Province 2024JC-JCQN-02.

Author information

Authors and Affiliations

Department of Mathematics and Statistics, **’an Jiaotong University, No.28 **anning West Road, **’an, 710049, Shaanxi, People’s Republic of China
Chengli Tan, Jiangshe Zhang, Junmin Liu & Zixiang Zhao

Authors

Chengli Tan
View author publications
You can also search for this author in PubMed Google Scholar
Jiangshe Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Junmin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zixiang Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiangshe Zhang.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Tan, C., Zhang, J., Liu, J. et al. Low-dimensional intrinsic dimension reveals a phase transition in gradient-based learning of deep neural networks. Int. J. Mach. Learn. & Cyber. (2024). https://doi.org/10.1007/s13042-024-02244-x

Download citation

Received: 21 March 2024
Accepted: 04 June 2024
Published: 04 July 2024
DOI: https://doi.org/10.1007/s13042-024-02244-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Institutional subscriptions

Low-dimensional intrinsic dimension reveals a phase transition in gradient-based learning of deep neural networks

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Gradient-based learning drives robust representations in recurrent neural networks by balancing compression and expansion

Is My Neural Net Driven by the MDL Principle?

Squeezing Correlated Neurons for Resource-Efficient Deep Neural Networks

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Low-dimensional intrinsic dimension reveals a phase transition in gradient-based learning of deep neural networks

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Gradient-based learning drives robust representations in recurrent neural networks by balancing compression and expansion

Is My Neural Net Driven by the MDL Principle?

Squeezing Correlated Neurons for Resource-Efficient Deep Neural Networks

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation