Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data

Martin, Charles H.; Peng, Tongsu (Serena); Mahoney, Michael W.

doi:10.1038/s41467-021-24025-8

Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data

Article
Open access
Published: 05 July 2021

Volume 12, article number 4122, (2021)
Cite this article

Download PDF

You have full access to this open access article

From

View current issue

Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data

Download PDF

Charles H. Martin¹,
Tongsu (Serena) Peng¹ &
Michael W. Mahoney ORCID: orcid.org/0000-0001-7920-4652²

28k Accesses
26 Citations
62 Altmetric
Explore all metrics

Abstract

In many applications, one works with neural network models trained by someone else. For such pretrained models, one may not have access to training data or test data. Moreover, one may not know details about the model, e.g., the specifics of the training data, the loss function, the hyperparameter values, etc. Given one or many pretrained models, it is a challenge to say anything about the expected performance or quality of the models. Here, we address this challenge by providing a detailed meta-analysis of hundreds of publicly available pretrained models. We examine norm-based capacity control metrics as well as power law based metrics from the recently-developed Theory of Heavy-Tailed Self Regularization. We find that norm based metrics correlate well with reported test accuracies for well-trained models, but that they often cannot distinguish well-trained versus poorly trained models. We also find that power law based metrics can do much better—quantitatively better at discriminating among series of well-trained models with a given architecture; and qualitatively better at discriminating well-trained versus poorly trained models. These methods can be used to identify when a pretrained neural network has problems that cannot be detected simply by examining training/test accuracies.

Deep Neural Network Ensembles

Deep Bilevel Learning

A systematic review on overfitting control in shallow and deep neural networks

Article 03 March 2021

Introduction

A common problem in machine learning (ML) is to evaluate the quality of a given model. A popular way to accomplish this is to train a model and then evaluate its training/testing error. There are many problems with this approach. The training/testing curves give very limited insight into the overall properties of the model; they do not take into account the (often large human and CPU/GPU) time for hyperparameter fiddling; they typically do not correlate with other properties of interest such as robustness or fairness or interpretability; and so on. A related problem, in particular in industrial-scale artificial intelligence (AI), arises when the model user is not the model developer. Then, one may not have access to the training data or the testing data. Instead, one may simply be given a model that has already been trained—a pretrained model—and need to use it as-is, or to fine-tune and/or compress it and then use it.

Naïvely—but in our experience commonly, among ML practitioners and ML theorists—if one does not have access to training or testing data, then one can say absolutely nothing about the quality of a ML model. This may be true in worst-case theory, but models are used in practice, and there is a need for a practical theory to guide that practice. Moreover, if ML is to become an industrial process, then that process will become compartmentalized in order to scale: some groups will gather data, other groups will develop models, and other groups will use those models. Users of models cannot be expected to know the precise details of how models were built, the specifics of data that were used to train the model, what was the loss function or hyperparameter values, how precisely the model was regularized, etc.

Moreover, for many large scale, practical applications, there is no obvious way to define an ideal test metric. For example, models that generate fake text or conversational chatbots may use a proxy, like perplexity, as a test metric. In the end, however, they really require human evaluation. Alternatively, models that cluster user profiles, which are widely used in areas such as marketing and advertising, are unsupervised and have no obvious labels for comparison and/or evaluation. In these and other areas, ML objectives can be poor proxies for downstream goals.

Most importantly, in industry, one faces unique practical problems such as determining whether one has enough data for a given model. Indeed, high quality, labeled data can be very expensive to acquire, and this cost can make or break a project. Methods that are developed and evaluated on any well-defined publicly available corpus of data, no matter how large or diverse or interesting, are clearly not going to be well-suited to address problems such as this. It is of great practical interest to have metrics to evaluate the quality of a trained model—in the absence of training/testing data and without any detailed knowledge of the training/testing process. There is a need for a practical theory for pretrained models which can predict how, when, and why such models can be expected to perform well or poorly.

In the absence of training and testing data, obvious quantities to examine are the weight matrices of pretrained models, e.g., properties such as norms of weight matrices and/or parameters of Power Law (PL) fits of the eigenvalues of weight matrices. Norm-based metrics have been used in traditional statistical learning theory to bound capacity and construct regularizers; and PL fits are based on statistical mechanics approaches to deep neural networks (DNNs). While we use traditional norm-based and PL-based metrics, our goals are not the traditional goals. Unlike more common ML approaches, we do not seek a bound on the generalization (e.g., by evaluating training/test errors), we do not seek a new regularizer, and we do not aim to evaluate a single model (e.g., as with hyperparameter optimization). Instead, we want to examine different models across common architecture series, and we want to compare models between different architectures themselves. In both cases, one can ask whether it is possible to predict trends in the quality of pretrained DNN models without access to training or testing data.

To answer this question, we provide a detailed empirical analysis, evaluating quality metrics for pretrained DNN models, and we do so at scale. Our approach may be viewed as a statistical meta-analysis of previously published work, where we consider a large suite of hundreds of publicly available models, mostly from computer vision (CV) and natural language processing (NLP). By now, there are many such state-of-the-art models that are publicly available, e.g., hundreds of pretrained models in CV (≥500) and NLP (≈100). (When we began this work in 2018, there were fewer than tens of such models; then in 2020, there are hundreds of such models; and we expect that in a year or two there will be an order of magnitude or more of such models.) For all these models, we have no access to training data or testing data, and we have no specific knowledge of the training/testing protocols. Here is a summary of our main results. First, norm-based metrics do a reasonably good job at predicting quality trends in well-trained CV/NLP models. Second, norm-based metrics may give spurious results when applied to poorly trained models (e.g., models trained without enough data, etc.). For example, they may exhibit what we call Scale Collapse for these models. Third, PL-based metrics can do much better at predicting quality trends in pretrained CV/NLP models. In particular, a weighted PL exponent (weighted by the log of the spectral norm of the corresponding layer) is quantitatively better at discriminating among a series of well-trained versus very-well-trained models within a given architecture series; and the (unweighted) average PL exponent is qualitatively better at discriminating well-trained versus poorly-trained models. Fourth, PL-based metrics can also be used to characterize fine-scale model properties, including what we call layer-wise Correlation Flow, in well-trained and poorly-trained models; and they can be used to evaluate model enhancements (e.g., distillation, fine-tuning, etc.). Our work provides a theoretically principled empirical evaluation—by far the largest, most detailed, and most comprehensive to date—and the theory we apply was developed previously^1,2,3. Performing such a meta-analysis of previously published work is common in certain areas, but it is quite rare in ML, where the emphasis is on develo** better training protocols.

Results

After describing our overall approach, we study in detail three well-known CV architecture series (the VGG, ResNet, and DenseNet series of models). Then, we look in detail at several variations of a popular NLP architecture series (the OpenAI GPT and GPT2 series of models), and we present results from a broader analysis of hundreds of pretrained DNN models.

Overall approach

Consider the objective/optimization function (parameterized by W_ls and b_ls) for a DNN with L layers, and weight matrices W_l and bias vectors b_l, as the minimization of a general loss function ${\mathcal{L}}$ over the training data instances and labels, $\{{{\bf{x}}}_{i},{y}_{i}\}\in {\mathcal{D}}$. For a typical supervised classification problem, the goal of training is to construct (or learn) W_l and b_l that capture correlations in the data, in the sense of solving

$$\mathop{{\rm{argmin}}}\limits_{{{\bf{W}}}_{l},{{\bf{b}}}_{L}}\ \mathop{\sum }\limits_{i=1}^{N}{\mathcal{L}}({E}_{DNN}({{\bf{x}}}_{i}),{y}_{i}),$$

(1)

where the loss function ${\mathcal{L}}(\cdot ,\cdot )$ can take on a myriad of forms⁴, and where the energy (or optimization) landscape function

$${E}_{DNN}=f({{\bf{x}}}_{i};{{\bf{W}}}_{1},\ldots ,{{\bf{W}}}_{L},{{\bf{b}}}_{1},\ldots ,{{\bf{b}}}_{L})$$

(2)

depends parametrically on the weights and biases. For a trained model, the form of the function E_DNN does not explicitly depend on the data (but it does explicitly depend on the weights and biases). The function E_DNN maps data instance vectors (x_i values) to predictions (y_i labels), and thus the output of this function does depend on the data. Therefore, one can analyze the form of E_DNN in the absence of any training or test data.

Test accuracies have been reported online for publicly available pretrained pyTorch models⁵. These models have been trained and evaluated on labeled data $\{{{\bf{x}}}_{i},{y}_{i}\}\in {\mathcal{D}}$, using standard techniques. We do not have access to this data, and we have not trained any of the models ourselves. Our methodological approach is thus similar to a statistical meta-analysis, common in biomedical research, but uncommon in ML. Computations were performed with the publicly available WeightWatcher tool (version 0.2.7)⁶. To be fully reproducible, we only examine publicly available, pretrained models, and we provide all Jupyter and Google Colab notebooks used in an accompanying github repository⁷. See Supplementary Note 1 for details.

Our approach involves analyzing individual DNN weight matrices, for (depending on the architecture) fully connected and/or convolutional layers. Each DNN layer contains one or more layer 2D N_l × M_l weight matrices, W_l, or pre-activation maps, W_i,l, e.g., extracted from 2D Convolutional layers, where N > M. (We may drop the i and/or i, l subscripts below.) The best performing quality metrics depend on the norms and/or spectral properties of each weight matrix, W, and/or, equivalently, it’s empirical correlation matrix, X = W^TW. To evaluate the quality of state-of-the-art DNNs, we consider the following metrics:

$$\,{\text{Frobenius}}\; {\text{Norm}}\,:\parallel {\bf{W}}{\parallel }_{F}^{2}=\parallel {\bf{X}}{\parallel }_{F}=\mathop{\sum }\limits_{i = 1}^{M}{\lambda }_{i}$$

(3)

$$\,{\text{Spectral}}\; {\text{Norm}}\,:\parallel {\bf{W}}{\parallel }_{\infty }^{2}=\parallel {\bf{X}}{\parallel }_{\infty }={\lambda }_{max}$$

(4)

$$\,{\text{Weighted}}\; {\text{Alpha}}\,:\hat{\alpha }=\alpha\, {\mathrm{log}}\,{\lambda }_{max}$$

(5)

$$\alpha {\mbox{-}}{\rm{Norm}}({\rm{or}}\,\, \alpha {\mbox{-}}{\rm{Shatten}}\; {\rm{Norm}}):\parallel {\bf{W}}{\parallel }_{2\alpha }^{2\alpha }=\parallel {\bf{X}}{\parallel }_{\alpha }^{\alpha }=\mathop{\sum }\limits_{i = 1}^{M}{\lambda }_{i}^{\alpha }.$$

(6)

To perform diagnostics on potentially problematic DNNs, we will decompose $\hat{\alpha }$ into its two components, α and λ_max. Here, λ_i is the i^th eigenvalue of the X, λ_max is the maximum eigenvalue, and α is the fitted PL exponent. These eigenvalues are squares of the singular values σ_i of W, ${\lambda }_{i}={\sigma }_{i}^{2}$. All four metrics can be computed easily from DNN weight matrices. The first two metrics are well-known in ML. The last two metrics deserve special mention, as they depend on an empirical parameter α that is the PL exponent that arises in the recently developed Heavy Tailed Self Regularization (HT-SR) Theory^1,2,3.

In the HT-SR Theory, one analyzes the eigenvalue spectrum, i.e., the Empirical Spectral Density (ESD), of the associated correlation matrices^1,2,3. From this, one characterizes the amount and form of correlation, and therefore implicit self-regularizartion, present in the DNN’s weight matrices. For each layer weight matrix W, of size N × M, construct the associated M × M (uncentered) correlation matrix X. Drop** the L and l, i indices, one has

$${\bf{X}}=\frac{1}{N}{{\bf{W}}}^{T}{\bf{W}}.$$

If we compute the eigenvalue spectrum of X, i.e., λ_i such that Xv_i = λ_iv_i, then the ESD of eigenvalues, ρ(λ), is just a histogram of the eigenvalues, formally written as $\rho (\lambda )=\mathop{\sum }\nolimits_{i = 1}^{M}\delta (\lambda -{\lambda }_{i}).$ Using HT-SR Theory, one characterizes the correlations in a weight matrix by examining its ESD, ρ(λ). It can be well-fit to a truncated PL distribution, given as

$$\rho (\lambda ) \sim {\lambda }^{-\alpha },$$

(7)

which is (at least) valid within a bounded range of eigenvalues λ ∈ [λ^min, λ^max].

The original work on HT-SR Theory considered a small number of NNs, including AlexNet and InceptionV3. It showed that for nearly every W, the (bulk and tail) of the ESDs can be fit to a truncated PL, and that PL exponents α nearly all lie within the range α ∈ (1.5, 5)^1,2,3. As for the mechanism responsible for these properties, statistical physics offers several possibilities^8,9, e.g., self-organized criticality^10,11 or multiplicative noise in the stochastic optimization algorithms used to train these models^12,13. Alternatively, related techniques have been used to analyze correlations and information propogation in actual spiking neurons^14,15. Our meta-analysis does not require knowledge of mechanisms; and it is not even clear that one mechanism is responsible for every case. Crucially, HT-SR Theory predicts that smaller values of α should correspond to models with better correlation over multiple size scales and thus to better models. The notion of “size scale” is well-defined in physical systems, to which this style of analysis is usually applied, but it is less well-defined in CV and NLP applications. Informally, it would correspond to pixel groups that are at a greater distance in some metric, or between sentence parts that are at a greater distance in text. Relatedly, previous work observed that smaller exponents α correspond to more implicit self-regularization and better generalization, and that we expect a linear correlation between $\hat{\alpha }$ and model quality^1,2,3.

For norm-based metrics, we use the average of the log norm, to the appropriate power. Informally, this amounts to assuming that the layer weight matrices are statistically independent, in which case we can estimate the model complexity ${\mathcal{C}}$, or test accuracy, with a standard Product Norm (which resembles a data dependent VC complexity),

$${\mathcal{C}} \sim \parallel {{\bf{W}}}_{1}\parallel \times \parallel {{\bf{W}}}_{2}\parallel \times \cdots \times \parallel {{\bf{W}}}_{L}\parallel ,$$

(8)

where ∥ ⋅ ∥ is a matrix norm. The log complexity,

$${\mathrm{log}}\,{\mathcal{C}} \sim {\mathrm{log}}\,\parallel {{\bf{W}}}_{1}\parallel +{\mathrm{log}}\,\parallel {{\bf{W}}}_{2}\parallel +\cdots +{\mathrm{log}}\,\parallel {{\bf{W}}}_{L}\parallel =\mathop{\sum }\limits_{l}{\mathrm{log}}\,\parallel {{\bf{W}}}_{l}\parallel ,$$

(9)

takes the form of an average Log Norm. For the Frobenius Norm metric and Spectral Norm metric, we can use Eq. (9) directly (since, when taking ${\mathrm{log}}\,\parallel {{\bf{W}}}_{l}{\parallel }_{F}^{2}$, the 2 comes down and out of the sum, and thus ignoring it only changes the metric by a constant factor).

The Weighted Alpha metric is an average of α_l over all layers l ∈ {1, …, l}, weighted by the size, or scale, or each matrix,

$$\hat{\alpha }=\frac{1}{L}\mathop{\sum}\limits_{l}{\alpha }_{l}{\mathrm{log}}\,{\lambda }_{max,l}\approx \langle {\mathrm{log}}\,\parallel {\bf{X}}{\parallel }_{\alpha }^{\alpha }\rangle ,$$

(10)

where L is the total number of layer weight matrices. The Weighted Alpha metric was introduced previously³, where it was shown to correlate well with trends in reported test accuracies of pretrained DNNs, albeit on a much smaller and more limited set of models than we consider here.

Based on this, in this paper, we introduce and evaluate the α-Shatten Norm metric,

$$\mathop{\sum}\limits_{l}{\mathrm{log}}\,\parallel {{\bf{X}}}_{l}{\parallel }_{{\alpha }_{l}}^{{\alpha }_{l}}=\mathop{\sum}\limits_{l}{\alpha }_{l}{\mathrm{log}}\,\parallel {{\bf{X}}}_{l}{\parallel }_{{\alpha }_{l}}.$$

(11)

For the α-Shatten Norm metric, α_l varies from layer to layer, and so in Eq. (11) it cannot be taken out of the sum. For small α, the Weighted Alpha metric approximates the Log α-Shatten norm, as can be shown with a statistical mechanics and random matrix theory derivation; and the Weighted Alpha and α-Shatten norm metrics often behave like an improved, weighted average Log Spectral Norm.

Finally, although it does less well for predicting trends in state-of-the-art model series, e.g., as depth changes, the average value of α, i.e.,

$$\bar{\alpha }=\frac{1}{L}\mathop{\sum}\limits_{l}{\alpha }_{l}=\langle \alpha \rangle ,$$

(12)

can be used to perform model diagnostics, to identify problems that cannot be detected by examining training/test accuracies, and to discriminate poorly trained models from well-trained models.

One determines α for a given layer by fitting the ESD of that layer’s weight matrix to a truncated PL, using the commonly accepted Maximum Likelihood method^16,17. This method works very well for exponents between α ∈ (2, 4); and it is adequate, although imprecise, for smaller and especially larger α¹⁸. Operationally, α is determined by using the WeightWatcher tool⁶ to fit the histogram of eigenvalues, ρ(λ), to a truncated PL,

$$\rho (\lambda ) \sim {\lambda }^{\alpha },\ \ \lambda \in [{\lambda }_{min},{\lambda }_{max}],$$

(13)

where λ_max is the largest eigenvalue of X = W^TW, and where λ_min is selected automatically to yield the best (in the sense of minimizing the K-S distance) PL fit. Each of these quantities is defined for a given layer W matrix. See Fig. 1 for an illustration.

**Fig. 1: Schematic of analyzing DNN layer weight matrices W.**

To avoid confusion, let us clarify the relationship between α and $\hat{\alpha }$. We fit the ESD of the correlation matrix X to a truncated PL, parameterized by 2 values: the PL exponent α, and the maximum eigenvalue λ_max. The PL exponent α measures the amount of correlation in a DNN layer weight matrix W. It is valid for λ ≤ λ_max, and it is scale-invariant, i.e., it does not depend on the normalization of W or X. The λ_max is a measure of the size, or scale, of W. Multiplying each α by the corresponding ${\mathrm{log}}\,{\lambda }_{max}$ weighs “bigger” layers more, and averaging this product leads to a balanced, Weighted Alpha metric $\hat{\alpha }$ for the entire DNN. We will see that for well-trained CV and NLP models, $\hat{\alpha }$ performs quite well and as expected, but for CV and NLP models that are potentially problematic or less well-trained, metrics that depend on the scale of the problem can perform anomalously. In these cases, separating $\hat{\alpha }$ into its two components, α and λ_max, and examining the distributions of each, can be helpful.

Comparison of CV models

Each of the VGG, ResNet, and DenseNet series of models consists of several pretrained DNN models, with a given base architecture, trained on the full ImageNet¹⁹ dataset, and each is distributed with the current open source pyTorch framework (version 1.4)²⁰. In addition, we examine a larger set of ResNet models, which we call the ResNet-1K series, trained on the ImageNet-1K dataset¹⁹ and provided on the OSMR Sandbox⁵. For these models, we first perform coarse model analysis, comparing and contrasting the four model series, and predicting trends in model quality. We then perform fine layer analysis, as a function of depth. This layer analysis goes beyond predicting trends in model quality, instead illustrating that PL-based metrics can provide novel insights among the VGG, ResNet/ResNet-1K, and DenseNet architectures.

We examine the performance of the four quality metrics—Log Frobenius norm ($\langle {\mathrm{log}}\,\parallel {\bf{W}}{\parallel }_{F}^{2}\rangle $), Log Spectral norm ($\langle {\mathrm{log}}\,\parallel {\bf{W}}{\parallel }_{\infty }^{2}\rangle $), Weighted Alpha ($\hat{\alpha }$), and Log α-Norm ($\langle {\mathrm{log}}\,\parallel {\bf{X}}{\parallel }_{\alpha }^{\alpha }\rangle $)—applied to each of the VGG, ResNet, ResNet-1K, and DenseNet series. Figure 2 plots the four quality metrics versus reported test accuracies²⁰, as well as a basic linear regression line, for the VGG series. (These test accuracies have been previously reported and made publicly available by others. We take them as given. We do not attempt to reproduce/verify them, since we do not permit ourselves access to training/test data.) Here, smaller norms and smaller values of $\hat{\alpha }$ imply better generalization (i.e., greater accuracy, lower error). Quantitatively, Log Spectral norm is the best; but, visually, all four metrics correlate quite well with reported Top1 accuracies. The DenseNet series has similar behavior. (These and many other such plots can be seen on our publicly available repo.)

**Fig. 2: Comparison of average Log Norm and Weighted Alpha quality metrics for CV models.**

To examine visually how the four quality metrics depend on data set size on a larger, more complex model series, we next look at results on ResNet versus ResNet-1K. Figure 3 compares the Log α-Norm metric for the full ResNet model, trained on the full ImageNet dataset, against the ResNet-1K model, trained on a much smaller ImageNet-1K data set. Here, the Log α-Norm is much better than the Log Frobenius/Spectral norm metrics (although, as Table 1 shows, it is slightly worse than the Weighted Alpha metric). The ResNet series has strong correlation (RMSE of 0.66, R² of 0.9, and Kendall-τ of −1.0), whereas the ResNet-1K series also shows good but weaker correlation (much larger RMSE of 1.9, R² of 0.88, and Kendall-τ of −0.88).

**Fig. 3: Comparison of average α-Norm quality metric for CV models.**

Table 1 Quality metrics (for RMSE, smaller is better; for R², larger is better; for Kendall-τ rank correlation, larger magnitude is better; best is bold) for reported Top1 test error for pretrained models in each architecture series.

Full size table

See Table 1 for a summary of results for Top1 accuracies for all four metrics for the VGG, ResNet, ResNet-1K, and DenseNet series. Similar results are obtained for the Top5 accuracies. The Log Frobenius norm performs well but not extremely well; the Log Spectral norm performs very well on smaller, simpler models like the VGG and DenseNet architectures; and, when moving to the larger, more complex ResNet series, the PL-based metrics, Weighted Alpha and the Log α-Norm, perform the best. Overall, though, these model series are all very well-trodden; and our results indicate that norm-based metrics and PL-based metrics can both distinguish among a series of well-trained versus very-well-trained models, with PL-based metrics performing somewhat (i.e., quantitatively) better on the larger, more complex ResNet series.

In particular, the PL-based Weighted Alpha and Log α-Norm metrics tend to perform better when there is a wider variation in the hyperparameters, going beyond just increasing the depth. In addition, sometimes the purely norm-based metrics such as the Log Spectral norm can be uncorrelated or even anti-correlated with the test accuracy, while the PL-metrrics are positively correlated. See Supplementary Note 2 for additional details.

Going beyond coarse averages to examining quality metrics for each layer weight matrix as a function of depth (or layer id), our metrics can be used to perform model diagnostics and to identify fine-scale properties in a pretrained model. Doing so involves separating $\hat{\alpha }$ into its two components, α and λ_max, and examining the distributions of each. We provide examples of this.

Figure 4 plots the PL exponent α, as a function of depth, for each layer (first layer corresponds to data, last layer to labels) for the least accurate (shallowest) and most accurate (deepest) model in each of the VGG (no BN), ResNet, and DenseNet series. (Many more such plots are available at our repo.)

**Fig. 4: PL exponent (α) versus layer id for VGG, ResNet, and DenseNet.**

In the VGG models, Fig. 4a shows that the PL exponent α systematically increases as we move down the network, from data to labels, in the Conv2D layers, starting with α ≲ 2.0 and reaching all the way to α ~ 5.0; and then, in the last three, large, fully connected (FC) layers, α stabilizes back down to α ∈ [2, 2.5]. This is seen for all the VGG models (again, only the shallowest and deepest are shown), indicating that the main effect of increasing depth is to increase the range over which α increases, thus leading to larger α values in later Conv2D layers of the VGG models. This is quite different than the behavior of either the ResNet-1K models or the DenseNet models.

For the ResNet-1K models, Fig. 4b shows that α also increases in the last few layers (more dramatically than for VGG, observe the differing scales on the Y axes). However, as the ResNet-1K models get deeper, there is a wide range over which α values tend to remain small. This is seen for other models in the ResNet-1K series, but it is most pronounced for the larger ResNet-1K (152) model, where α remains relatively stable at α ~ 2.0, from the earliest layers all the way until we reach close to the final layers.

For the DenseNet models, Fig. 4c shows that α tends to increase as the layer id increases, in particular for layers toward the end. While this is similar to the VGG models, with the DenseNet models, α values increase almost immediately after the first few layers, and the variance is much larger (in particular for the earlier and middle layers, where it can range all the way to α ~ 8.0) and much less systematic throughout the network.

Overall, Fig. 4 demonstrates that the distribution of α values among layers is architecture dependent, and that it can vary in a systematic way within an architecture series. This is to be expected, since some architectures enable better extraction of signal from the data. This also suggests that, while performing very well at predicting trends within an architecture series, PL-based metrics (as well as norm-based metrics) should be used with caution when comparing models with very different architectures.

Figure 4 can be understood in terms of what we will call Correlation Flow. Recall that the average Log α-Norm metric and the Weighted Alpha metric are based on HT-SR Theory^1,2,3, which is in turn based on the statistical mechanics of heavy tailed and strongly correlated systems^8,21,22,23. There, one expects that the weight matrices of well-trained DNNs will exhibit correlations over many size scales, as is well-known in other strongly correlated systems^8,21. This would imply that their ESDs can be well-fit by a truncated PL, with exponents α ∈ [2, 4]. Much larger values (α ≫ 6) may reflect poorer PL fits, whereas smaller values (α ~ 2), are associated with models that generalize better.

Informally, one would expect a DNN model to perform well when it facilitates the propagation of information/features across layers. In the absence of training/test data, one might hypothesize that this flow of information leaves empirical signatures on weight matrices, and that we can quantify this by measuring the PL properties of weight matrices. In this case, smaller α values correspond to layers in which information correlations between data across multiple scales are better captured^1,8. This leads to the hypothesis that small α values that are stable across multiple layers enable better correlation flow through the network. This is similar to recent work on the information bottleneck^24,24,

Methods

To be fully reproducible, we only examine publicly available, pretrained models. All of our computations were performed with the WeightWatcher tool (version 0.2.7)6, and we provide all Jupyter and Google Colab notebooks used in an accompanying github repository⁷, which includes more details and more results.

Additional details on layer weight matrices

Recall that we can express the objective/optimization function for a typical DNN with L layers and with N × M weight matrices W_l and bias vectors b_l as Eq. (2). We expect that most well-trained, production-quality models will employ one or more forms of regularization, such as Batch Normalization (BN), Dropout, etc., and many will also contain additional structure such as Skip Connections, etc. Here, we will ignore these details, and will focus only on the pretrained layer weight matrices W_l. Typically, this model would be trained on some labeled data $\{{d}_{i},{y}_{i}\}\in {\mathcal{D}}$, using Backprop, by minimizing the loss ${\mathcal{L}}$. For simplicity, we do not indicate the structural details of the layers (e.g., Dense or not, Convolutions or not, Residual/Skip Connections, etc.). Each layer is defined by one or more layer 2D weight matrices W_l, and/or the 2D feature maps W_l,i extracted from 2D Convolutional (Conv2D) layers. A typical modern DNN may have anywhere between 5 and 5000 2D layer matrices.

For each Linear Layer, we get a single (N × M) (real-valued) 2D weight matrix, denoted W_l, for layer l. This includes Dense or Fully Connected (FC) layers, as well as 1D Convolutional (Conv1D) layers, Attention matrices, etc. We ignore the bias terms b_l in this analysis. Let the aspect ratio be $Q=\frac{N}{M}$, with Q ≥ 1. For the Conv2D layers, we have a 4-index Tensor, of the form (N × M × c × d), consisting of c × d 2D feature maps of shape (N × M). We extract n_l = c × d 2D weight matrices W_l,i, one for each feature map i = [1, …, n_l] for layer l.

SVD of convolutional 2D layers

There is some ambiguity in performing spectral analysis on Conv2D layers. Each layer is a 4-index tensor of dimension (w, h, in, out), with an (w × h) filter (or kernel) and (in, out) channels. When w = h = k, it gives (k × k) tensor slices, or pre-Activation Maps, W_i,L of dimension (in × out) each. We identify 3 different approaches for running SVD on a Conv2D layer:

1.
run SVD on each pre-Activation Map W_i,L, yielding (k × k) sets of M singular values;
2.
stack the maps into a single matrix of, say, dimension ((k × k × out) × in), and run SVD to get in singular values;
3.
compute the 2D Fourier Transform (FFT) for each of the (in, out) pairs, and run SVD on the Fourier coefficients⁴², leading to ~ (k × in × out) non-zero singular values.

Each method has tradeoffs. Method (3) is mathematically sound, but computationally expensive. Method (2) is ambiguous. For our analysis, because we need thousands of runs, we select method (1), which is the fastest (and is easiest to reproduce).

Normalization of empirical matrices

Normalization is an important, if underappreciated, practical issue. Importantly, the normalization of weight matrices does not affect the PL fits because α is scale-invariant. Norm-based metrics, however, do depend strongly on the scale of the weight matrix—that is the point. To apply RMT, we usually define X with a 1/N normalization, assuming variance of σ² = 1.0. Pretrained DNNs are typically initialized with random weight matrices W₀, with ${\sigma }^{2} \sim 1/\sqrt{N}$, or some variant, e.g., the Glorot/Xavier normalization⁴³, or a $\sqrt{2/N{k}^{2}}$ normalization for Convolutional 2D Layers. With this implicit scale, we do not “renormalize” the empirical weight matrices, i.e., we use them as-is. The only exception is that we do rescale the Conv2D pre-activation maps W_i,L by $k/\sqrt{2}$ so that they are on the same scale as the Linear/Fully Connected (FC) layers.

Special consideration for NLP models

NLP models, and other models with large initial embeddings, require special care because the embedding layers frequently lack the implicit $1/\sqrt{N}$ normalization present in other layers. For example, in GPT, for most layers, the maximum eigenvalue ${\lambda }_{max} \sim {\mathcal{O}}(10-100)$, but in the first embedding layer, the maximum eigenvalue is of order N (the number of words in the embedding), or ${\lambda }_{max} \sim {\mathcal{O}}(1{0}^{5})$. For GPT and GPT2, we treat all layers as-is (although one may want to normalize the first 2 layers X by 1/N, or to treat them as outliers).

Data availability

Data analyzed during the study are all publicly available; and data generated during the study are available along with the code to generate them in our public repository (https://github.com/CalculatedContent/ww-trends-2020).

Code availability

Code sufficient to generate the results of the study is available in our public repository (https://github.com/CalculatedContent/ww-trends-2020).

References

Martin, C. H. & Mahoney, M. W. Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning. J. Mach. Learn. Res. Accepted for publication.
Martin, C. H. & Mahoney, M. W. Traditional and heavy-tailed self regularization in neural network models. In Proceedings of the 36th International Conference on Machine Learning, pp. 4284–4293 (2019).
Martin, C. H. & Mahoney, M. W. Heavy-tailed Universality predicts trends in test accuracies for very large pre-trained deep neural networks. In Proceedings of the 20th SIAM International Conference on Data Mining (2020).
Janocha, K. & Czarnecki, W. M. On loss functions for deep neural networks in classification. Technical Report Preprint: ar**v:1702.05659 (2017).
Sandbox for training convolutional networks for computer vision. https://github.com/osmr/imgclsmob.
WeightWatcher (2018). https://pypi.org/project/WeightWatcher/.
https://github.com/CalculatedContent/ww-trends-2020.
Sornette, D. Critical phenomena in natural sciences: chaos, fractals, selforganization and disorder: concepts and tools. Springer-Verlag, Berlin, 2006.
Nishimori, H. Statistical Physics of Spin Glasses and Information Processing: An Introduction. Oxford University Press, Oxford (2001).
Bak, P., Tang, C. & Wiesenfeld, K. Self-organized criticality: an explanation of 1/f noise. Phys. Rev. Lett. 59, 381–384 (1987).
Article ADS CAS Google Scholar
Watkins, N. W., Pruessner, G., Chapman, S. C., Crosby, N. B. & Jensen, H. J. 25 years of self-organized criticality: concepts and controversies. Space Sci. Rev. 198, 3–44 (2016).
Article ADS Google Scholar
Hodgkinson, L. & Mahoney, M. W. Multiplicative noise and heavy tails in stochastic optimization. Technical Report Preprint: ar**v:2006.06293 (2020).
Sornette, D. & Cont, R. Convergent multiplicative processes repelled from zero: Power laws and truncated power laws. Journal De Physique I 7, 431–444 (1997).
Article ADS Google Scholar
Shew, W. L., Yang, H., Yu, S., Roy, R. & Plenz, D. Information capacity and transmission are maximized in balanced cortical networks with neuronal avalanches. J. Neurosci. 31, 55–63 (2011).
Article CAS Google Scholar
Yu, S., Klaus, A., Yang, H. & Plenz, D. Scale-invariant neuronal avalanche dynamics and the cut-off in size distributions. PLoS ONE 9, e99761 (2014).
Article ADS Google Scholar
Clauset, A., Shalizi, C. R. & Newman, M. E. J. Power-law distributions in empirical data. SIAM Rev. 51, 661–703 (2009).
Article ADS MathSciNet Google Scholar
Alstott, J., Bullmore, E. & Plenz, D. powerlaw: A python package for analysis of heavy-tailed distributions. PLoS ONE 9, e85777 (2014).
Article ADS Google Scholar
Newman, M. E. J. Power laws, Pareto distributions and Zipf’s law. Contemporary Phys. 46, 323–351 (2005).
Article ADS Google Scholar
Russakovsky, O. et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015).
Article MathSciNet Google Scholar
Paszke, A. et al. Pytorch: An imperative style, high-performance deep learning library. In Annual Advances in Neural Information Processing Systems 32: Proceedings of the 2019 Conference, pp. 8024–8035 (2019).
Bouchaud, J. P. & Potters, M. Theory of Financial Risk and Derivative Pricing: From Statistical Physics to Risk Management. Cambridge University Press (2003).
Bouchaud, J. P. & Potters, M. Financial applications of random matrix theory: a short review. In G. Akemann, J. Baik, and P. Di Francesco, editors, The Oxford Handbook of Random Matrix Theory. Oxford University Press (2011).
Bun, J., Bouchaud, J.-P. & Potters, M. Cleaning large correlation matrices: tools from random matrix theory. Phys. Rep. 666, 1–109 (2017).
Article ADS MathSciNet Google Scholar
Tishby, N. & Zaslavsky, N. Deep learning and the information bottleneck principle. In Proceedings of the 2015 IEEE Information Theory Workshop, ITW 2015, pp. 1–5 (2015).
Shwartz-Ziv, R. & Tishby, N. Opening the black box of deep neural networks via information. Technical Report Preprint: ar**v:1703.00810 (2017).
Cheng, Y., Wang, D., Zhou, P. & Zhang, T. A survey of model compression and acceleration for deep neural networks. Technical Report Preprint: ar**v:1710.09282 (2017).
Intel Distiller package. https://nervanasystems.github.io/distiller.
Vaswani, A. et al. Attention is all you need. Technical Report Preprint: ar**v:1706.03762 (2017).
Wolf, T. et al. Huggingface’s transformers: State-of-the-art natural language processing. Technical Report Preprint: ar**v:1910.03771 (2019).
OpenAI GPT-2: 1.5B Release. https://openai.com/blog/gpt-2-1-5b-release/.
Belkin, M., Hsu, D., Ma, S. & Mandal, S. Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proc. Natl. Acad. Sci. USA 116, 15849–15854 (2019).
Article MathSciNet CAS Google Scholar
He, K., Zhang, X., Ren, S. & Sun, J. Identity map**s in deep residual networks. Technical Report Preprint: ar**v:1603.05027 (2016).
Engel, A. & Van den Broeck, C. P. L. Statistical mechanics of learning. Cambridge University Press, New York, NY, USA (2001).
Martin, C. H. & Mahoney, M. W. Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior. Technical Report Preprint: ar**v:1710.09553 (2017).
Bahri, Y. et al. Statistical mechanics of deep learning. Ann. Rev.Condensed Matter Phys. 11, 501–528 (2020).
Article Google Scholar
Martin, C. H. & Mahoney, M. W. Statistical mechanics methods for discovering knowledge from modern production quality neural networks. In Proceedings of the 25th Annual ACM SIGKDD Conference, pp. 3239–3240 (2019).
Neyshabur, B., Tomioka, R. & Srebro, N. Norm-based capacity control in neural networks. In Proceedings of the 28th Annual Conference on Learning Theory, pp. 1376–1401 (2015).
Bartlett, P., Foster, D. J. & Telgarsky, M. Spectrally-normalized margin bounds for neural networks. Technical Report Preprint: ar**v:1706.08498 (2017).
Liao, Q., Miranda, B., Banburski, A., Hidary, J. and Poggio, T. A surprising linear relationship predicts test performance in deep networks. Technical Report Preprint: ar**v:1807.09659 (2018).
Eilertsen, G., Jönsson, D., Ropinski, T., Unger, J. and Ynnerman, A. Classifying the classifier: dissecting the weight space of neural networks. Technical Report Preprint: ar**v:2002.05688 (2020).
Unterthiner, T., Keysers, D., Gelly, S., Bousquet, O. and Tolstikhin, I. Predicting neural network accuracy from weights. Technical Report Preprint: ar**v:2002.11448 (2020).
Sedghi, H., Gupta, V. and Long, P. M. The singular values of convolutional layers. Technical Report Preprint: ar**v:1805.10408 (2018).
Glorot, X. & Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Workshop on Artificial Intelligence and Statistics, pp. 249–256 (2010).

Download references

Acknowledgements

M.W.M. would like to acknowledge ARO, DARPA, NSF, and ONR as well as the UC Berkeley BDD project and a gift from Intel for providing partial support of this work. Our conclusions do not necessarily reflect the position or the policy of our sponsors, and no official endorsement should be inferred. We would also like to thank Amir Khosrowshahi and colleagues at Intel for helpful discussion regarding the Group Regularization distillation technique.

Author information

Authors and Affiliations

Calculation Consulting, San Francisco, CA, USA
Charles H. Martin & Tongsu (Serena) Peng
ICSI and Department of Statistics, University of California at Berkeley, Berkeley, CA, USA
Michael W. Mahoney

Authors

Charles H. Martin
View author publications
You can also search for this author in PubMed Google Scholar
Tongsu (Serena) Peng
View author publications
You can also search for this author in PubMed Google Scholar
Michael W. Mahoney
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

CHM and MWM designed the research, performed the research, and wrote the paper. TSP performed the research and edited the paper.

Corresponding author

Correspondence to Michael W. Mahoney.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Communications thanks Vardan Papyan and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Martin, C.H., Peng, T.(. & Mahoney, M.W. Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data. Nat Commun 12, 4122 (2021). https://doi.org/10.1038/s41467-021-24025-8

Download citation

Received: 03 October 2020
Accepted: 12 May 2021
Published: 05 July 2021
DOI: https://doi.org/10.1038/s41467-021-24025-8
Springer Nature Limited

This article is cited by

Neural network structure simplification by assessing evolution in node weight magnitude
- Ralf Riedel
- Aviv Segev
Machine Learning (2024)
Lyapunov-guided representation of recurrent neural network performance
- Ryan Vogt
- Yang Zheng
- Eli Shlizerman
Neural Computing and Applications (2024)
Separation of scales and a thermodynamic description of feature learning in some CNNs
- Inbar Seroussi
- Gadi Naveh
- Zohar Ringel
Nature Communications (2023)
Learning continuous models for continuous physics
- Aditi S. Krishnapriyan
- Alejandro F. Queiruga
- Michael W. Mahoney
Communications Physics (2023)
Adversarial bandit approach for RIS-aided OFDM communication
- Messaoud Ahmed Ouameur
- Lê Dương Tuấn Anh
- Felipe Augusto Pereira de Figueiredo
EURASIP Journal on Wireless Communications and Networking (2022)

Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data

From

Abstract

Similar content being viewed by others

Deep Neural Network Ensembles

Deep Bilevel Learning

A systematic review on overfitting control in shallow and deep neural networks

Introduction