MVP: Multimodality-Guided Visual Pre-training

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13690))

Included in the following conference series:

Abstract

Recently, masked image modeling (MIM) has become a promising direction for visual pre-training. In the context of vision transformers, MIM learns effective visual representation by aligning the token-level features with a pre-defined space (e.g., BEIT used a d-VAE trained on a large image corpus as the tokenizer). In this paper, we go one step further by introducing guidance from other modalities and validating that such additional knowledge leads to impressive gains for visual pre-training. The proposed approach is named Multimodality-guided Visual Pre-training (MVP), in which we replace the tokenizer with the vision branch of CLIP, a vision-language model pre-trained on 400 million image-text pairs. We demonstrate the effectiveness of MVP by performing standard experiments, i.e., pre-training the ViT models on ImageNet and fine-tuning them on a series of downstream visual recognition tasks. In particular, pre-training ViT-Base/16 for 300 epochs, MVP reports a 52.4% mIoU on ADE20K, surpassing BEIT (the baseline and previous state-of-the-art) with an impressive margin of 6.8%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now
Chapter
EUR 29.95
Price includes VAT (Thailand)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 93.08
Price includes VAT (Thailand)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR 109.99
Price excludes VAT (Thailand)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    It borrows the framework of masked language modeling (MLM) [11, 19] from natural language processing.

References

  1. Baevski, A., Hsu, W.N., Xu, Q., Babu, A., Gu, J., Auli, M.: Data2vec: a general framework for self-supervised learning in speech, vision and language. ar**v preprint ar**v:2202.03555 (2022)

  2. Bao, H., Dong, L., Wei, F.: BEiT: BERT pre-training of image transformers. ar**v preprint ar**v:2106.08254 (2021)

  3. Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021)

    Google Scholar 

  4. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. ar**v preprint ar**v:2002.05709 (2020)

  5. Chen, X., et al.: Context autoencoder for self-supervised representation learning. ar**v preprint ar**v:2202.03026 (2022)

  6. Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning. ar**v preprint ar**v:2003.04297 (2020)

  7. Chen, X., **e, S., He, K.: An empirical study of training self-supervised vision transformers. ar**v preprint ar**v:2104.02057 (2021)

  8. Chen, Y.-C., et al.: UNITER: UNiversal Image-TExt Representation Learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7

    Chapter  Google Scholar 

  9. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

    Google Scholar 

  10. Desai, K., Johnson, J.: VirTex: learning visual representations from textual annotations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11162–11173 (2021)

    Google Scholar 

  11. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. ar**v preprint ar**v:1810.04805 (2018)

  12. Dong, X., et al.: PeCo: perceptual codebook for BERT pre-training of vision transformers. ar**v preprint ar**v:2111.12710 (2021)

  13. Dosovitskiy, A., et al.: An image is worth \(16\times 16\) words: transformers for image recognition at scale. ar**v preprint ar**v:2010.11929 (2020)

  14. Grill, J.B., et al.: Bootstrap your own latent: a new approach to self-supervised learning. ar**v preprint ar**v:2006.07733 (2020)

  15. He, K., Chen, X., **e, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. ar**v preprint ar**v:2111.06377 (2021)

  16. He, K., Fan, H., Wu, Y., **e, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)

    Google Scholar 

  17. He, K., Girshick, R., Dollár, P.: Rethinking ImageNet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019)

    Google Scholar 

  18. Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904–4916. PMLR (2021)

    Google Scholar 

  19. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. ar**v preprint ar**v:1909.11942 (2019)

  20. Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBERT: a simple and performant baseline for vision and language. ar**v preprint ar**v:1908.03557 (2019)

  21. Li, X., et al.: OSCAR: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8

    Chapter  Google Scholar 

  22. Ni, M., et al.: M3P: learning universal representations via multitask multilingual multimodal pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3977–3986 (2021)

    Google Scholar 

  23. Noroozi, M., Favaro, P.: Unsupervised Learning of visual representations by solving jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_5

    Chapter  Google Scholar 

  24. Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544 (2016)

    Google Scholar 

  25. Radford, A., et al.: Learning transferable visual models from natural language supervision (2021)

    Google Scholar 

  26. Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2556–2565 (2018)

    Google Scholar 

  27. Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. ar**v preprint ar**v:1906.05849 (2019)

  28. Tian, Y., Sun, C., Poole, B., Krishnan, D., Schmid, C., Isola, P.: What makes for good views for contrastive learning. ar**v preprint ar**v:2005.10243 (2020)

  29. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)

    Google Scholar 

  30. Van Den Oord, A., et al.: Neural discrete representation learning. Adv. Neural Inf. Proc. Syst. 30, 1–10 (2017)

    Google Scholar 

  31. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1096–1103 (2008)

    Google Scholar 

  32. Wei, C., Fan, H., **e, S., Wu, C.Y., Yuille, A., Feichtenhofer, C.: Masked feature prediction for self-supervised visual pre-training. ar**v preprint ar**v:2112.09133 (2021)

  33. Wei, C., et al.: Iterative reorganization with weak spatial constraints: solving arbitrary jigsaw puzzles for unsupervised representation learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1910–1919 (2019)

    Google Scholar 

  34. Wei, L., **e, L., Zhou, W., Li, H., Tian, Q.: Exploring the diversity and invariance in yourself for visual pre-training task. ar**v preprint ar**v:2106.00537 (2021)

  35. **e, Z., et al.: SimMIM: a simple framework for masked image modeling. ar**v preprint ar**v:2111.09886 (2021)

  36. Yuan, X., et al.: Multimodal contrastive training for visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6995–7004 (2021)

    Google Scholar 

  37. Zhang, M., Jiang, S., Cui, Z., Garnett, R., Chen, Y.: D-VAE: a variational autoencoder for directed acyclic graphs. Adv. Neural Inf. Process. Syst. 32 (2019)

    Google Scholar 

  38. Zhou, B., et al.: Semantic understanding of scenes through the ade20k dataset. Int. J. Comput. Vis. 127(3), 302–321 (2019)

    Article  Google Scholar 

  39. Zhou, J., et al.: iBOT: image BERT pre-training with online tokenizer. ar**v preprint ar**v:2111.07832 (2021)

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Contract 61836011 and U20A20183. It was also supported by the GPU cluster built by MCC Lab of Information Science and Technology Institution, USTC.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Longhui Wei .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wei, L., **e, L., Zhou, W., Li, H., Tian, Q. (2022). MVP: Multimodality-Guided Visual Pre-training. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13690. Springer, Cham. https://doi.org/10.1007/978-3-031-20056-4_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20056-4_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20055-7

  • Online ISBN: 978-3-031-20056-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation