Abstract
Sample size estimation is a crucial step in experimental design but is understudied in the context of deep learning. Currently, estimating the quantity of labeled data needed to train a classifier to a desired performance, is largely based on prior experience with similar models and problems or on untested heuristics. In many supervised machine learning applications, data labeling can be expensive and time-consuming and would benefit from a more rigorous means of estimating labeling requirements. Here, we study the problem of estimating the minimum sample size of labeled training data necessary for training computer vision models as an exemplar for other deep learning problems. We consider the problem of identifying the minimal number of labeled data points to achieve a generalizable representation of the data, a minimum converging sample (MCS). We use autoencoder loss to estimate the MCS for fully connected neural network classifiers. At sample sizes smaller than the MCS estimate, fully connected networks fail to distinguish classes, and at sample sizes above the MCS estimate, generalizability strongly correlates with the loss function of the autoencoder. We provide an easily accessible, code-free, and dataset-agnostic tool to estimate sample sizes for fully connected networks. Taken together, our findings suggest that MCS and convergence estimation are promising methods to guide sample size estimates for data collection and labeling prior to training deep learning models in computer vision.
Similar content being viewed by others
Introduction
Supervised learning with deep neural networks has achieved state of the art performance in a diverse range of applications. An adequate number of labeled samples is essential for training these systems but most real-world data is unlabeled. Label generation can be cumbersome, expensive and is a major barrier to the development and testing of such systems [1].
Ideally, when confronted with a task and unlabeled data, one would like to estimate how many examples need to be labeled to train a neural network for that task. In this paper, we take a step towards addressing this problem.
Consider a fully connected neural network f of pre-specified dimensions and a dataset X, which is initially unlabeled, but for which labels y can be obtained when needed. We define the minimum convergence size (MCS) for f on X to be the smallest number n such that a subset Xn of n examples drawn at random from X can be labeled and used to train f as a non-trivial classifier, that is, one whose area-under-the-curve (AUC) on a held-out test set is greater than 0.5:
Given that outcomes are balanced, an AUC > 0.5 implies that a model is able to identify some signal in the underlying data, and if that AUC is on the test dataset, this means that the signal identified by the model can generalize to unseen data. In this scenario, below the MCS, we would expect to see little or no correlation between sample size and model performance measured by AUC, whereas above the MCS we would expect to see a positive correlation.
We propose a method for empirically determining the MCS for f on X using only unlabeled data, and we call this estimate the Minimum Convergence Sample Estimate (MCSE). We do this by first constructing an autoencoder g [2], wherein the encoder part has a similar number of parameters and hidden layers as f. We train g on increasingly larger (unlabeled) subsets Xi of X. This may permit similarities in layer-wise learning between f and g. Under these circumstances, we empirically show that, at each step i, the reconstruction loss L of g is related to the generalization performance of f trained on a similarly sized sample. We also demonstrate how this can be used to determine the MCSE for f on X (Fig. 1).
As an example, consider classification of the MNIST [3] dataset with a fully connected neural network (Fig. 2). A comparison of the test set AUC curve of f and the loss curve of autoencoder g shows that their inflection points occur at similar sample sizes. We then define the MCSE for f on MNIST as the sample size corresponding to the inflection point in the loss function of g:
With sample sizes above MCSE, the learnability of the dataset on f may be approximated by the ease with which g is able to embed a latent space that fully represents the data. We hypothesized the following relationship between generalization power of a classifier with respect to learnability of the dataset by the corresponding autoencoder:
β is a scaling constant. We tested this hypothesis by calculating the correlation coefficient below the MCSE and above the MCSE, and results are reported in Table 1. A significant R2 indicates a linear correlation between loss and power. We used eight different standard computer vision datasets to demonstrate our method. MNIST, EMNIST [4], QMNIST [5], KMINST [6] are character-recognition datasets composed of 28x28 pixel grayscale images. FMNIST[30]. Other pre-hoc methods such as empirical process theory did not extend well to non-linear methods [16]. Post-hoc methods usually involve fitting a learning curve, but fitting a learning curve is trivial for minimum convergence sample estimation because any amount of data should result in a non-zero increase in performance on a training data-set. Moreover, these methods are task-specific, data-specific, and model-specific, as one learning curve has no relevance outside that specific task, model and data-set. Nevertheless, while our experiments validate minimum convergence sample estimation on toy data-sets, synthetic data, and one real-world example of medical imaging due to data availability, future work should further validate this method on more across different tasks and imaging types in the healthcare context.
Our second contribution is the proposal of a method to empirically estimate MCSE for a given fully connected neural network f. This function allows users to predict statistical power of a model without needing to train on the entire training set during every trial. It also includes an uncertainty on the estimate, in which the variance is inversely correlated to how structured the underlying data is. Our third contribution is a publicly available tool for minimum sample size estimation for fully connected neural networks.
Importantly, there are several natural opportunities to extend our work to more complex models, as discussed below. First, our paper only considered a fully connected network with a relatively simple architecture. One natural question that might extend from this work involves assessing how this method fares in estimating the statistical power of convolutional or recurrent neural networks. While adding convolutions would be relatively easy to do via the addition of another layer, adding attention mechanisms may require additional structural modifications to fully approximate the statistical power of recurrent neural network or transformers. For our method to be applicable to medical imaging tasks, we anticipate that extending this work to convolutional neural networks remains an important next step. Future work can validate MCSE on more complex architectures utilizing pre-trained networks and skip connections. Second, the loss function that was utilized in this current analysis was the reconstruction loss, which is a relatively simple choice of loss function. For variational autoencoders, the loss function changes to instead use a KL-divergence, while GANs use JS-divergence and WGANs use Wasserstein divergence [31,3).
After taking the second derivative of the loss function, we located the inflection point and its respective sample size as well as the value of the autoencoder loss at that value. Figure 2 was generated when we plotted the autoencoder loss of g and the AUC of f against the sample size. MCSE was drawn as a vertical line in Fig. 3, coinciding with the inflection point on the autoencoder loss function and provides a lower bound on the sample size required to improve model performance. The shaded area bars represent the error, determined as the autoencoder loss at the log(MCSE ± 1) sample sizes.
Third, we determined the correlation between autoencoder loss and area-under-the-curve using R2, Kendall’s τ and Spearman’s ρ (Table 1). These values demonstrated a significant coorelation above, but not below, the MCSE (inflection point of the autoencoder loss curve). To better demonstrate this finding, we plotted out the results for Eq. (3) for values above MCSE (Fig. 3).
Finally, we generalize these results to an n-dimensional hyper-cube and validate them on a medical imaging dataset. To generate the synthetic data set, we randomly sampled data-points from an n-dimensional hyper-cube with side-lengths equal to the class separation to generate the fully connected network classifier and an autoencoder. For the medical dataset, we use the publicly available Kaggle Chest X-ray dataset [34], and accurately predict the minimum number of labeled samples required to learn a meaningful classifier using a fully connected network.
Analysis of the publicly available NIH CXR dataset was carried out with approval of the Institutional Review Board at Icahn School of Medicine at Mount Sinai, New York, NY 10019. The requirement for informed consent was waived as the dataset was completely de-identified.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
The image datasets can be accessed via the torchvision library https://pytorch.org/vision/stable/datasets.html. The synthetic dataset was generated via the sklearn library https://scikit-learn.org/stable/index.html. The de-identified X-ray dataset is publicly available https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia.
References
Sambasivan, N. et al. "everyone wants to do the model work, not the data work”: Data cascades in high-stakes ai. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, CHI ’21 (Association for Computing Machinery, New York, NY, USA, 2021).
Goodfellow, I., Bengio, Y. & Courville, A.Deep Learning, chap. 14 Autoencoders (MIT Press, 2016).
Deng, L. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Process. Mag. 29, 141–142 (2012).
Cohen, G., Afshar, S., Tapson, J. & Van Schaik, A. Emnist: Extending mnist to handwritten letters. In 2017 International Joint Conference on Neural Networks (IJCNN), 2921-2926 (IEEE, 2017).
Yadav, C. & Bottou, L. Cold case: The lost mnist digits.
Uday Prabhu, V. Kannada-mnist: A new handwritten digits dataset for the kannada language. Preprint at https://arxiv.org/abs/1908.01242 (2019).
**ao, H., Rasul, K. & Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. Preprint at https://arxiv.org/abs/1708.07747 (2017).
Krizhevsky, A. Learning multiple layers of features from tiny images. Tech. Rep. (2009).
Coates, A., Ng, A. & Lee, H. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, 215-223 (2011).
Yadav, C. & Bottou, L. Cold case: The lost mnist digits. Advances in neural information processing systems 32 (2019).
Northcutt, C. G., Athalye, A. & Mueller, J. Pervasive label errors in test sets destabilize machine learning benchmarks. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1) (2021).
Northcutt, C., Jiang, L. & Chuang, I. Confident learning: Estimating uncertainty in dataset labels. J. Artif. Intell. Res. 70, 1373–1411 (2021).
Jain, S. et al. Visualchexbert: addressing the discrepancy between radiology report labels and image labels. In Proceedings of the Conference on Health, Inference, and Learning, 105-115 (2021).
Guss, W. H. & Salakhutdinov, R. On characterizing the capacity of neural networks using algebraic topology. Preprint at https://arxiv.org/abs/1802.04443 (2018).
Goldfarb, D. Understanding deep neural networks using topological data analysis. Preprint at https://arxiv.org/abs/1811.00852 (2018).
Du, S. et al. How many samples are needed to estimate a convolutional or recurrent neural network? stat 1050, 30 (2019).
Du, S. & Lee, J. On the power of over-parametrization in neural networks with quadratic activation. In International conference on machine learning, 1329-1338 (PMLR, 2018).
Van de Geer, S. A.Applications of empirical process theory, vol. 91 (Cambridge University Press Cambridge, 2000).
Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012).
Van Engelen, J. E. & Hoos, H. H. A survey on semi-supervised learning. Mach. Learn. 109, 373–440 (2020).
Chen, I. Y. et al. Ethical machine learning in healthcare. Ann Rev. Biomed. Data Sci. 4, 123–144 (2021).
Heo, M. & Leon, A. C. Statistical power and sample size requirements for three level hierarchical cluster randomized trials. Biometrics 64, 1256–1262 (2008).
Röhmel, J. Statistical considerations of fda and cpmp rules for the investigation of new anti-bacterial products. Stat. Med. 20, 2561–2571 (2001).
Strasak, A. M., Zaman, Q., Pfeiffer, K. P., Göbel, G. & Ulmer, H. Statistical errors in medical research-a review of common pitfalls. Swiss Med. Wkly. 137, 44–49 (2007).
Button, K. S. et al. Power failure: why small sample size undermines the reliability of neuroscience. Nat. Rev. Neurosci. 14, 365–376 (2013).
Carneiro, C. F., Moulin, T. C., Macleod, M. R. & Amaral, O. B. Effect size and statistical power in the rodent fear conditioning literature–a systematic review. PloS one 13, e0196258 (2018).
Amanatkar, H. R., Papagiannopoulos, B. & Grossberg, G. T. Analysis of recent failures of disease modifying therapies in alzheimer’s disease suggesting a new methodology for future studies. Expert Rev. Neurother. 17, 7–16 (2017).
He, J. et al. The practical implementation of artificial intelligence technologies in medicine. Nat. Med. 25, 30–36 (2019).
Balki, I. et al. Sample-size determination methodologies for machine learning in medical imaging research: a systematic review. Can. Assoc. Radiologists J. 70, 344–353 (2019).
Dobbin, K. K. & Simon, R. M. Sample size planning for develo** classifiers using high-dimensional dna microarray data. Biostatistics 8, 101–117 (2007).
DOERSCH, C. Tutorial on variational autoencoders. Stat 1050, 13 (2016).
Jolicoeur-Martineau, A. Gans beyond divergence minimization. ar**v preprint ar**v:1809.02145 (2018).
Arjovsky, M., Chintala, S. & Bottou, L. Wasserstein generative adversarial networks. In International conference on machine learning, 214-223 (PMLR, 2017).
Kermany, D. S. et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell 172, 1122–1131 (2018).
Acknowledgements
This work was supported in part through the computational resources and staff expertise provided by Scientific Computing at the Icahn School of Medicine at Mount Sinai. Research reported in this paper was supported by the Office of Research Infrastructure of the National Institutes of Health under award number S10OD026880 and S10OD030463. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Author information
Authors and Affiliations
Contributions
Conception and design: F.F.G., A.S.S., G.N.N. and E.K.O. Funding obtainment: G.N.N. and E.K.O. Provision of study data: F.F.G. Collection and assembly of data: F.F.G. and A.S.S. Data analysis and interpretation: all authors. Manuscript writing: all authors. Final approval of the manuscript: all authors. G.N.N. and E.K.O. are joint senior authors.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Gulamali, F.F., Sawant, A.S., Kovatch, P. et al. Autoencoders for sample size estimation for fully connected neural network classifiers. npj Digit. Med. 5, 180 (2022). https://doi.org/10.1038/s41746-022-00728-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41746-022-00728-0
- Springer Nature Limited