When Does Self-supervision Improve Few-Shot Learning?

Su, Jong-Chyi; Maji, Subhransu; Hariharan, Bharath

doi:10.1007/978-3-030-58571-6_38

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12352))

Included in the following conference series:

European Conference on Computer Vision

5129 Accesses
73 Citations

Abstract

We investigate the role of self-supervised learning (SSL) in the context of few-shot learning. Although recent research has shown the benefits of SSL on large unlabeled datasets, its utility on small datasets is relatively unexplored. We find that SSL reduces the relative error rate of few-shot meta-learners by 4%–27%, even when the datasets are small and only utilizing images within the datasets. The improvements are greater when the training set is smaller or the task is more challenging. Although the benefits of SSL may increase with larger training sets, we observe that SSL can hurt the performance when the distributions of images used for meta-learning and SSL are different. We conduct a systematic study by varying the degree of domain shift and analyzing the performance of several meta-learners on a multitude of domains. Based on this analysis we present a technique that automatically selects images for SSL from a large, generic pool of unlabeled images for a given dataset that provides further improvements.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: EUR 29.95; Price includes VAT (Germany)

eBook: EUR 85.59; Price includes VAT (Germany)

Softcover Book: EUR 106.99; Price includes VAT (Germany)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Self-Supervision Can Be a Good Few-Shot Learner

Towards Label-Free Few-Shot Learning: How Far Can We Go?

A Survey on Meta-learning Based Few-Shot Classification

References

Achille, A., et al.: Task2Vec: task embedding for meta-learning. In: ICCV (2019)
Google Scholar
Asano, Y.M., Rupprecht, C., Vedaldi, A.: A critical analysis of self-supervision, or what we can learn from a single image. In: ICLR (2020)
Google Scholar
Bachman, P., Hjelm, R.D., Buchwalter, W.: Learning representations by maximizing mutual information across views. ar**v preprint ar**v:1906.00910 (2019)
Bertinetto, L., Henriques, J.F., Torr, P.H., Vedaldi, A.: Meta-learning with differentiable closed-form solvers. In: ICLR (2019)
Google Scholar
Bojanowski, P., Joulin, A.: Unsupervised learning by predicting noise. In: ICML (2017)
Google Scholar
Carlucci, F.M., D’Innocente, A., Bucci, S., Caputo, B., Tommasi, T.: Domain generalization by solving jigsaw puzzles. In: CVPR (2019)
Google Scholar
Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of visual features. In: ECCV (2018)
Google Scholar
Caron, M., Bojanowski, P., Mairal, J., Joulin, A.: Unsupervised pre-training of image features on non-curated data. In: ICCV (2019)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML (2020)
Google Scholar
Chen, W.Y., Liu, Y.C., Kira, Z., Wang, Y.C., Huang, J.B.: A closer look at few-shot classification. In: ICLR (2019)
Google Scholar
Chen, Z., Badrinarayanan, V., Lee, C.Y., Rabinovich, A.: Gradnorm: gradient normalization for adaptive loss balancing in deep multitask networks. In: ICML (2018)
Google Scholar
Cui, Y., Song, Y., Sun, C., Howard, A., Belongie, S.: Large scale fine-grained categorization and domain-specific transfer learning. In: CVPR (2018)
Google Scholar
Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: ICCV (2015)
Google Scholar
Doersch, C., Zisserman, A.: Multi-task self-supervised visual learning. In: ICCV (2017)
Google Scholar
Dosovitskiy, A., Springenberg, J.T., Riedmiller, M., Brox, T.: Discriminative unsupervised feature learning with convolutional neural networks. In: NeurIPS (2014)
Google Scholar
Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML (2017)
Google Scholar
Ghiasi, G., Lin, T.Y., Le, Q.V.: Dropblock: a regularization method for convolutional networks. In: NeurIPS (2018)
Google Scholar
Gidaris, S., Bursuc, A., Komodakis, N., Pérez, P., Cord, M.: Boosting few-shot visual learning with self-supervision. In: ICCV (2019)
Google Scholar
Gidaris, S., Komodakis, N.: Dynamic few-shot visual learning without forgetting. In: CVPR (2018)
Google Scholar
Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: ICLR (2018)
Google Scholar
Goyal, P., Mahajan, D., Gupta, A., Misra, I.: Scaling and benchmarking self-supervised visual representation learning. In: ICCV (2019)
Google Scholar
He, K., Fan, H., Wu, Y., **e, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR (2020)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Hénaff, O.J., Razavi, A., Doersch, C., Eslami, S., Oord, A.V.D.: Data-efficient image recognition with contrastive predictive coding. ar**v preprint ar**v:1905.09272 (2019)
Hjelm, R.D., et al.: Learning deep representations by mutual information estimation and maximization. In: ICLR (2019)
Google Scholar
Kendall, A., Gal, Y., Cipolla, R.: Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: CVPR (2018)
Google Scholar
Khosla, A., Jayadevaprakash, N., Yao, B., Fei-Fei, L.: Novel dataset for fine-grained image categorization. In: First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2011)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Koch, G., Zemel, R., Salakhutdinov, R.: Siamese neural networks for one-shot image recognition. In: ICML Deep Learning Workshop, vol. 2 (2015)
Google Scholar
Kokkinos, I.: Ubernet: training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In: CVPR (2017)
Google Scholar
Kolesnikov, A., Zhai, X., Beyer, L.: Revisiting self-supervised visual representation learning. In: CVPR (2019)
Google Scholar
Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for fine-grained categorization. In: 4th International IEEE Workshop on 3D Representation and Recognition (3DRR), Australia, Sydney (2013)
Google Scholar
Kuznetsova, A., et al.: The open images dataset V4: unified image classification, object detection, and visual relationship detection at scale. ar**v:1811.00982 (2018)
Larsson, G., Maire, M., Shakhnarovich, G.: Learning representations for automatic colorization. In: ECCV (2016)
Google Scholar
Lee, K., Maji, S., Ravichandran, A., Soatto, S.: Meta-learning with differentiable convex optimization. In: CVPR (2019)
Google Scholar
Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. ar**v preprint ar**v:1306.5151 (2013)
Maninis, K.K., Radosavovic, I., Kokkinos, I.: Attentive single-tasking of multiple tasks. In: CVPR (2019)
Google Scholar
Misra, I., van der Maaten, L.: Self-supervised learning of pretext-invariant representations. In: CVPR (2020)
Google Scholar
Ngiam, J., Peng, D., Vasudevan, V., Kornblith, S., Le, Q.V., Pang, R.: Domain adaptive transfer learning with specialist models. ar**v preprint ar**v:1811.07056 (2018)
Nilsback, M.E., Zisserman, A.: A visual vocabulary for flower classification. In: CVPR (2006)
Google Scholar
Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV (2016)
Google Scholar
Noroozi, M., Pirsiavash, H., Favaro, P.: Representation learning by learning to count. In: ICCV (2017)
Google Scholar
Oord, A.V.D., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. ar**v preprint ar**v:1807.03748 (2018)
Oreshkin, B., López, P.R., Lacoste, A.: Tadam: task dependent adaptive metric for improved few-shot learning. In: NeurIPS (2018)
Google Scholar
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: NeurIPS (2019)
Google Scholar
Pathak, D., Krähenbühl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: CVPR (2016)
Google Scholar
Qi, H., Brown, M., Lowe, D.G.: Low-shot learning with imprinted weights. In: CVPR (2018)
Google Scholar
Qiao, S., Liu, C., Shen, W., Yuille, A.L.: Few-shot image recognition by predicting parameters from activations. In: CVPR (2018)
Google Scholar
Ravi, S., Larochelle, H.: Optimization as a model for few-shot learning. In: ICLR (2017)
Google Scholar
Ren, M., et al.: Meta-learning for semi-supervised few-shot classification. In: ICLR (2018)
Google Scholar
Ren, Z., Lee, Y.J.: Cross-domain self-supervised multi-task feature learning using synthetic imagery. In: CVPR (2018)
Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Rusu, A.A., et al.: Meta-learning with latent embedding optimization. ar**v preprint ar**v:1807.05960 (2018)
Sener, O., Koltun, V.: Multi-task learning as multi-objective optimization. In: NeurIPS (2018)
Google Scholar
Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NeurIPS (2017)
Google Scholar
Su, J.C., Maji, S.: Adapting models to signal degradation using distillation. In: BMVC (2017)
Google Scholar
Sung, F., Yang, Y., Zhang, L., **ang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: relation network for few-shot learning. In: CVPR (2018)
Google Scholar
Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. In: ECCV (2020)
Google Scholar
Trinh, T.H., Luong, M.T., Le, Q.V.: Selfie: self-supervised pretraining for image embedding. ar**v preprint ar**v:1906.02940 (2019)
Van Horn, G., et al.: The iNaturalist species classification and detection dataset. In: CVPR (2018)
Google Scholar
Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.: Matching networks for one shot learning. In: NeurIPS (2016)
Google Scholar
Wallace, B., Hariharan, B.: Extending and analyzing self-supervised learning across domains. In: ECCV (2020)
Google Scholar
Welinder, P., et al.: Caltech-UCSD Birds 200. Technical report, CNS-TR-2010-001, California Institute of Technology (2010)
Google Scholar
Wertheimer, D., Hariharan, B.: Few-shot learning with localization in realistic settings. In: CVPR (2019)
Google Scholar
Wu, Z., **ong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: CVPR (2018)
Google Scholar
Zamir, A.R., Sax, A., Shen, W., Guibas, L.J., Malik, J., Savarese, S.: Taskonomy: disentangling task transfer learning. In: CVPR, pp. 3712–3722 (2018)
Google Scholar
Zhai, X., Oliver, A., Kolesnikov, A., Beyer, L.: S4L: self-supervised semi-supervised learning. In: ICCV (2019)
Google Scholar
Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: ECCV (2016)
Google Scholar
Zhang, R., Isola, P., Efros, A.A.: Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In: CVPR (2017)
Google Scholar

Download references

Acknowledgement

This project is supported in part by NSF #1749833 and a DARPA LwLL grant. Our experiments were performed on the University of Massachusetts Amherst GPU cluster obtained under the Collaborative Fund managed by the Massachusetts Technology Collaborative.

Author information

Authors and Affiliations

University of Massachusetts Amherst, Amherst, USA
Jong-Chyi Su & Subhransu Maji
Cornell University, Ithaca, USA
Bharath Hariharan

Authors

Jong-Chyi Su
View author publications
You can also search for this author in PubMed Google Scholar
Subhransu Maji
View author publications
You can also search for this author in PubMed Google Scholar
Bharath Hariharan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jong-Chyi Su .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3544 KB)

A Appendix

In Appendix A.1 and Appendix A.2, we provide all the numbers of the figures in Sect. 4.1 and Sect. 4.3 separately. We show that SSL can also improve traditional fine-grained classification in Appendix A.3 and its model visualization in Appendix A.4. Last, we describe the implementation details in Appendix A.5.

Table 4. Performance on few-shot learning tasks. The mean accuracy (%) and the 95% confidence interval of 600 randomly chosen test experiments are reported for various combinations of loss functions. The top part shows the accuracy on 5-way 5-shot classification tasks, while the bottom part shows the same on 20-way 5-shot. Adding self-supervised losses to the ProtoNet loss improves the performance on all seven datasets on 5-way classification results. On 20-way classification, the improvements are even larger. The last row indicates results with a randomly initialized network. The top part of this table corresponds to Fig. 2 in Sect. 4.1.

Full size table

Table 5. Performance on harder few-shot learning tasks. Accuracies are reported on novel set for 5-way 5-shot and 20-way 5-shot classification with degraded inputs, and with a subset (20%) of the images in the base set. The loss of color or resolution, and the smaller training set size make the tasks more challenging as seen by the drop in the performance of the ProtoNet baseline. However the improvements of using the jigsaw puzzle loss are higher in comparison to the results presented in Table 4.

Full size table

1.1 A.1 Results on Few-Shot Learning

Table 4 shows the performance of ProtoNet with different self-supervision on seven datasets. We also test the accuracy of the model on novel classes when trained only with self-supervision on the base set of images. Compared to the randomly initialized model (“None” rows), training the network to predict rotations gives around 2% to 21% improvements on all datasets, while solving jigsaw puzzles only improves on aircrafts and flowers. However, these numbers are significantly worse than learning with supervised labels on the base set, in line with the current literature.

Table 5 shows the performance of ProtoNet with jigsaw puzzle loss on harder benchmarks. The results on the degraded version of the datasets are shown in the top part, and the bottom part shows the results of using only 20% of the images in the base categories. The gains using SSL are higher in this setting.

1.2 A.2 Results on Selecting Images for SSL

Table 6 shows the performance of selecting images for self-supervision, a tabular version of Fig. 5 in Sect. 4.3. “Pool (random)” uniformly samples images proportional to the size of each dataset, while the “pool (weight)” one tends to pick more images from related domains.

Table 6. Performance on selecting images for self-supervision. Adding more unlabeled images selected randomly from a pool often hurts the performance. Selecting similar images by importance weights improves on all five datasets.

Full size table

1.3 A.3 Results on Standard Fine-Grained Classification

Here we present results on standard fine-grained classification tasks. Different from few-shot transfer learning, all the classes are seen in the training set and the test set contains novel images from the same classes. We use the standard training and test splits provided in the datasets. We investigate if SSL can improve the training of deep networks (e.g. ResNet-18 network) when trained from scratch (i.e. with random initialization) using images and labels in the training set only. The accuracy of using various loss functions are shown in Table 7. Training with self-supervision improves performance across datasets. On birds, cars, and dogs, predicting rotation gives 4.1%, 3.1%, and 3.0% improvements, while on aircrafts and flowers, the jigsaw puzzle loss yields 0.9% and 3.6% improvements.

Table 7. Performance on standard fine-grained classification tasks. Per-image accuracy (%) on the test set are reported. Using self-supervision improves the accuracy of a ResNet-18 network trained from scratch over the baseline of supervised training with cross-entropy (softmax) loss on all five datasets.

Full size table

1.4 A.4 Visualization of Learned Models

To understand why the representation generalizes, we visualize what pixels contribute the most to the correct classification for various models. In particular, for each image and model, we compute the gradient of the logits (predictions before softmax) for the correct class with respect to the input image. The magnitude of the gradient at each pixel is a proxy for its importance and is visualized as “saliency maps”. Figure 7 shows these maps for various images and models trained with and without self-supervision on the standard classification task. It appears that the self-supervised models tend to focus more on the foreground regions, as seen by the amount of bright pixels within the bounding box. One hypothesis is that self-supervised tasks force the model to rely less on background features, which might be accidentally correlated to the class labels. For fine-grained recognition, localization indeed improves performance when training from few examples (see [64] for a contemporary evaluation of the role of localization for few-shot learning).

1.5 A.5 Experimental Details

Optimization Details on Few-Shot Learning. During training, especially for the jigsaw puzzle task, we found it to be beneficial to not track the running mean and variance for the batch normalization layer, and instead estimate them for each batch independently. We hypothesize that this is because the inputs contain both full-sized images and small patches, which might have different statistics. At test time we do the same. We found the accuracy goes up as the batch size increases but saturates at a size of 64.

When training with supervised and self-supervised loss, a trade-off term \(\lambda \) between the losses can be used, thus the total loss is \(\mathcal{L} = (1-\lambda )\mathcal{L}_s + \lambda \mathcal{L}_{ss}\). We find that simply use \(\lambda =0.5\) works the best, except for training on mini- and tiered-ImageNet with jigsaw loss, where we set \(\lambda =0.3\). We suspect that this is because the variation of the image size and the categories are higher, making the self-supervision harder to train with limited data. When both jigsaw and rotation losses are used, we set \(\lambda =0.5\) and the two self-supervised losses are averaged for \(\mathcal{L}_{ss}\).

For training meta-learners, we use 16 query images per class for each training episode. When only 20% of labeled data are used, 5 query images per class are used. For MAML, we use 10 query images and the approximation method for backpropagation as proposed in [10] to reduce the GPU memory usage. When training with self-supervised loss, it is added when computing the loss in the outer loop. We use PyTorch [45] for our experiments.

Optimization Details on Domain Classifier. For the domain classifier, we first obtain features from the penultimate-layer (2048 dimensional) from a ResNet-101 model pre-trained on ImageNet [52]. We then train a binary logistic regression model with weight decay using LBFGS for 1000 iterations. The images from the labeled dataset are the positive class and from the pool of unlabeled data are the negative class. A subset of negative images are selected uniformly at random with 10 times the size of positive images. A loss for the positive class is scaled by the inverse of its frequency to account for the significantly larger number of negative examples.

Optimization Details on Standard Classification. For standard classification (Appendix A.3) we train a ResNet-18 network from scratch. All the models are trained with ADAM optimizer with a learning rate of 0.001 for 600 epochs with a batch size of 16. We track the running statistics for the batch normalization layer for the softmax baselines following the conventional setting, i.e. w/o self-supervised loss, but do not track these statistics when training with self-supervision.

Architectures for Self-supervised Tasks. For jigsaw puzzle task, we follow the architecture of [41] where it was first proposed. The ResNet18 results in a 512-dimensional feature for each input, and we add a fully-connected (fc) layer with 512-units on top. The nine patches give nine 512-dimensional feature vectors, which are concatenated. This is followed by a fc layer, projecting the feature vector from 4608 to 4096 dimensions, and a fc layer with 35-dimensional outputs corresponding to the 35 permutations for the jigsaw task.

For rotation prediction task, the 512-dimensional output of ResNet-18 is passed through three fc layers with {512, 128, 128, 4} units. The predictions correspond to the four rotation angles. Between each fc layer, a ReLU activation and a dropout layer with a dropout probability of 0.5 are added.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Su, JC., Maji, S., Hariharan, B. (2020). When Does Self-supervision Improve Few-Shot Learning?. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12352. Springer, Cham. https://doi.org/10.1007/978-3-030-58571-6_38

Download citation

DOI: https://doi.org/10.1007/978-3-030-58571-6_38
Published: 09 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58570-9
Online ISBN: 978-3-030-58571-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

When Does Self-supervision Improve Few-Shot Learning?

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Self-Supervision Can Be a Good Few-Shot Learner

Towards Label-Free Few-Shot Learning: How Far Can We Go?

A Survey on Meta-learning Based Few-Shot Classification

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 3544 KB)

A Appendix

1.1 A.1 Results on Few-Shot Learning

1.2 A.2 Results on Selecting Images for SSL

1.3 A.3 Results on Standard Fine-Grained Classification

1.4 A.4 Visualization of Learned Models

1.5 A.5 Experimental Details

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

When Does Self-supervision Improve Few-Shot Learning?

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Self-Supervision Can Be a Good Few-Shot Learner

Towards Label-Free Few-Shot Learning: How Far Can We Go?

A Survey on Meta-learning Based Few-Shot Classification

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 3544 KB)

A Appendix

A Appendix

1.1 A.1 Results on Few-Shot Learning

1.2 A.2 Results on Selecting Images for SSL

1.3 A.3 Results on Standard Fine-Grained Classification

1.4 A.4 Visualization of Learned Models

1.5 A.5 Experimental Details

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation