MVP: Multimodality-Guided Visual Pre-training

Wei, Longhui; **e, Lingxi; Zhou, Wengang; Li, Houqiang; Tian, Qi

doi:10.1007/978-3-031-20056-4_20

Longhui Wei^12,13,
Lingxi **e¹³,
Wengang Zhou^12,14,
Houqiang Li^12,14 &
…
Qi Tian¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13690))

Included in the following conference series:

European Conference on Computer Vision

2674 Accesses
16 Citations

Abstract

Recently, masked image modeling (MIM) has become a promising direction for visual pre-training. In the context of vision transformers, MIM learns effective visual representation by aligning the token-level features with a pre-defined space (e.g., BEIT used a d-VAE trained on a large image corpus as the tokenizer). In this paper, we go one step further by introducing guidance from other modalities and validating that such additional knowledge leads to impressive gains for visual pre-training. The proposed approach is named Multimodality-guided Visual Pre-training (MVP), in which we replace the tokenizer with the vision branch of CLIP, a vision-language model pre-trained on 400 million image-text pairs. We demonstrate the effectiveness of MVP by performing standard experiments, i.e., pre-training the ViT models on ImageNet and fine-tuning them on a series of downstream visual recognition tasks. In particular, pre-training ViT-Base/16 for 300 epochs, MVP reports a 52.4% mIoU on ADE20K, surpassing BEIT (the baseline and previous state-of-the-art) with an impressive margin of 6.8%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Chapter: EUR 29.95; Price includes VAT (Thailand)

eBook: EUR 93.08; Price includes VAT (Thailand)

Softcover Book: EUR 109.99; Price excludes VAT (Thailand)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond

Article 12 January 2023

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

Notes

1.
It borrows the framework of masked language modeling (MLM) [11, 19] from natural language processing.

References

Baevski, A., Hsu, W.N., Xu, Q., Babu, A., Gu, J., Auli, M.: Data2vec: a general framework for self-supervised learning in speech, vision and language. ar**v preprint ar**v:2202.03555 (2022)
Bao, H., Dong, L., Wei, F.: BEiT: BERT pre-training of image transformers. ar**v preprint ar**v:2106.08254 (2021)
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. ar**v preprint ar**v:2002.05709 (2020)
Chen, X., et al.: Context autoencoder for self-supervised representation learning. ar**v preprint ar**v:2202.03026 (2022)
Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning. ar**v preprint ar**v:2003.04297 (2020)
Chen, X., **e, S., He, K.: An empirical study of training self-supervised vision transformers. ar**v preprint ar**v:2104.02057 (2021)
Chen, Y.-C., et al.: UNITER: UNiversal Image-TExt Representation Learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7
Chapter Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Google Scholar
Desai, K., Johnson, J.: VirTex: learning visual representations from textual annotations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11162–11173 (2021)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. ar**v preprint ar**v:1810.04805 (2018)
Dong, X., et al.: PeCo: perceptual codebook for BERT pre-training of vision transformers. ar**v preprint ar**v:2111.12710 (2021)
Dosovitskiy, A., et al.: An image is worth \(16\times 16\) words: transformers for image recognition at scale. ar**v preprint ar**v:2010.11929 (2020)
Grill, J.B., et al.: Bootstrap your own latent: a new approach to self-supervised learning. ar**v preprint ar**v:2006.07733 (2020)
He, K., Chen, X., **e, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. ar**v preprint ar**v:2111.06377 (2021)
He, K., Fan, H., Wu, Y., **e, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Google Scholar
He, K., Girshick, R., Dollár, P.: Rethinking ImageNet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019)
Google Scholar
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904–4916. PMLR (2021)
Google Scholar
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. ar**v preprint ar**v:1909.11942 (2019)
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBERT: a simple and performant baseline for vision and language. ar**v preprint ar**v:1908.03557 (2019)
Li, X., et al.: OSCAR: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Chapter Google Scholar
Ni, M., et al.: M3P: learning universal representations via multitask multilingual multimodal pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3977–3986 (2021)
Google Scholar
Noroozi, M., Favaro, P.: Unsupervised Learning of visual representations by solving jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_5
Chapter Google Scholar
Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544 (2016)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision (2021)
Google Scholar
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2556–2565 (2018)
Google Scholar
Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. ar**v preprint ar**v:1906.05849 (2019)
Tian, Y., Sun, C., Poole, B., Krishnan, D., Schmid, C., Isola, P.: What makes for good views for contrastive learning. ar**v preprint ar**v:2005.10243 (2020)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Google Scholar
Van Den Oord, A., et al.: Neural discrete representation learning. Adv. Neural Inf. Proc. Syst. 30, 1–10 (2017)
Google Scholar
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1096–1103 (2008)
Google Scholar
Wei, C., Fan, H., **e, S., Wu, C.Y., Yuille, A., Feichtenhofer, C.: Masked feature prediction for self-supervised visual pre-training. ar**v preprint ar**v:2112.09133 (2021)
Wei, C., et al.: Iterative reorganization with weak spatial constraints: solving arbitrary jigsaw puzzles for unsupervised representation learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1910–1919 (2019)
Google Scholar
Wei, L., **e, L., Zhou, W., Li, H., Tian, Q.: Exploring the diversity and invariance in yourself for visual pre-training task. ar**v preprint ar**v:2106.00537 (2021)
**e, Z., et al.: SimMIM: a simple framework for masked image modeling. ar**v preprint ar**v:2111.09886 (2021)
Yuan, X., et al.: Multimodal contrastive training for visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6995–7004 (2021)
Google Scholar
Zhang, M., Jiang, S., Cui, Z., Garnett, R., Chen, Y.: D-VAE: a variational autoencoder for directed acyclic graphs. Adv. Neural Inf. Process. Syst. 32 (2019)
Google Scholar
Zhou, B., et al.: Semantic understanding of scenes through the ade20k dataset. Int. J. Comput. Vis. 127(3), 302–321 (2019)
Article Google Scholar
Zhou, J., et al.: iBOT: image BERT pre-training with online tokenizer. ar**v preprint ar**v:2111.07832 (2021)

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Contract 61836011 and U20A20183. It was also supported by the GPU cluster built by MCC Lab of Information Science and Technology Institution, USTC.

Author information

Authors and Affiliations

CAS Key Laboratory of Technology in GIPAS, EEIS Department, University of Science and Technology of China, Hefei, China
Longhui Wei, Wengang Zhou & Houqiang Li
Huawei Cloud, Shenzhen, China
Longhui Wei, Lingxi **e & Qi Tian
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China
Wengang Zhou & Houqiang Li

Authors

Longhui Wei
View author publications
You can also search for this author in PubMed Google Scholar
Lingxi **e
View author publications
You can also search for this author in PubMed Google Scholar
Wengang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Houqiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Qi Tian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Longhui Wei .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wei, L., **e, L., Zhou, W., Li, H., Tian, Q. (2022). MVP: Multimodality-Guided Visual Pre-training. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13690. Springer, Cham. https://doi.org/10.1007/978-3-031-20056-4_20

Download citation

DOI: https://doi.org/10.1007/978-3-031-20056-4_20
Published: 03 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20055-7
Online ISBN: 978-3-031-20056-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

MVP: Multimodality-Guided Visual Pre-training

Abstract

Access this chapter

Similar content being viewed by others

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

MVP: Multimodality-Guided Visual Pre-training

Abstract

Access this chapter

Similar content being viewed by others

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation