Abstract
Recently, masked image modeling (MIM) has become a promising direction for visual pre-training. In the context of vision transformers, MIM learns effective visual representation by aligning the token-level features with a pre-defined space (e.g., BEIT used a d-VAE trained on a large image corpus as the tokenizer). In this paper, we go one step further by introducing guidance from other modalities and validating that such additional knowledge leads to impressive gains for visual pre-training. The proposed approach is named Multimodality-guided Visual Pre-training (MVP), in which we replace the tokenizer with the vision branch of CLIP, a vision-language model pre-trained on 400 million image-text pairs. We demonstrate the effectiveness of MVP by performing standard experiments, i.e., pre-training the ViT models on ImageNet and fine-tuning them on a series of downstream visual recognition tasks. In particular, pre-training ViT-Base/16 for 300 epochs, MVP reports a 52.4% mIoU on ADE20K, surpassing BEIT (the baseline and previous state-of-the-art) with an impressive margin of 6.8%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Baevski, A., Hsu, W.N., Xu, Q., Babu, A., Gu, J., Auli, M.: Data2vec: a general framework for self-supervised learning in speech, vision and language. ar**v preprint ar**v:2202.03555 (2022)
Bao, H., Dong, L., Wei, F.: BEiT: BERT pre-training of image transformers. ar**v preprint ar**v:2106.08254 (2021)
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. ar**v preprint ar**v:2002.05709 (2020)
Chen, X., et al.: Context autoencoder for self-supervised representation learning. ar**v preprint ar**v:2202.03026 (2022)
Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning. ar**v preprint ar**v:2003.04297 (2020)
Chen, X., **e, S., He, K.: An empirical study of training self-supervised vision transformers. ar**v preprint ar**v:2104.02057 (2021)
Chen, Y.-C., et al.: UNITER: UNiversal Image-TExt Representation Learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Desai, K., Johnson, J.: VirTex: learning visual representations from textual annotations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11162–11173 (2021)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. ar**v preprint ar**v:1810.04805 (2018)
Dong, X., et al.: PeCo: perceptual codebook for BERT pre-training of vision transformers. ar**v preprint ar**v:2111.12710 (2021)
Dosovitskiy, A., et al.: An image is worth \(16\times 16\) words: transformers for image recognition at scale. ar**v preprint ar**v:2010.11929 (2020)
Grill, J.B., et al.: Bootstrap your own latent: a new approach to self-supervised learning. ar**v preprint ar**v:2006.07733 (2020)
He, K., Chen, X., **e, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. ar**v preprint ar**v:2111.06377 (2021)
He, K., Fan, H., Wu, Y., **e, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
He, K., Girshick, R., Dollár, P.: Rethinking ImageNet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019)
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904–4916. PMLR (2021)
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. ar**v preprint ar**v:1909.11942 (2019)
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBERT: a simple and performant baseline for vision and language. ar**v preprint ar**v:1908.03557 (2019)
Li, X., et al.: OSCAR: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Ni, M., et al.: M3P: learning universal representations via multitask multilingual multimodal pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3977–3986 (2021)
Noroozi, M., Favaro, P.: Unsupervised Learning of visual representations by solving jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_5
Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544 (2016)
Radford, A., et al.: Learning transferable visual models from natural language supervision (2021)
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2556–2565 (2018)
Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. ar**v preprint ar**v:1906.05849 (2019)
Tian, Y., Sun, C., Poole, B., Krishnan, D., Schmid, C., Isola, P.: What makes for good views for contrastive learning. ar**v preprint ar**v:2005.10243 (2020)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Van Den Oord, A., et al.: Neural discrete representation learning. Adv. Neural Inf. Proc. Syst. 30, 1–10 (2017)
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1096–1103 (2008)
Wei, C., Fan, H., **e, S., Wu, C.Y., Yuille, A., Feichtenhofer, C.: Masked feature prediction for self-supervised visual pre-training. ar**v preprint ar**v:2112.09133 (2021)
Wei, C., et al.: Iterative reorganization with weak spatial constraints: solving arbitrary jigsaw puzzles for unsupervised representation learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1910–1919 (2019)
Wei, L., **e, L., Zhou, W., Li, H., Tian, Q.: Exploring the diversity and invariance in yourself for visual pre-training task. ar**v preprint ar**v:2106.00537 (2021)
**e, Z., et al.: SimMIM: a simple framework for masked image modeling. ar**v preprint ar**v:2111.09886 (2021)
Yuan, X., et al.: Multimodal contrastive training for visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6995–7004 (2021)
Zhang, M., Jiang, S., Cui, Z., Garnett, R., Chen, Y.: D-VAE: a variational autoencoder for directed acyclic graphs. Adv. Neural Inf. Process. Syst. 32 (2019)
Zhou, B., et al.: Semantic understanding of scenes through the ade20k dataset. Int. J. Comput. Vis. 127(3), 302–321 (2019)
Zhou, J., et al.: iBOT: image BERT pre-training with online tokenizer. ar**v preprint ar**v:2111.07832 (2021)
Acknowledgments
This work was supported by the National Natural Science Foundation of China under Contract 61836011 and U20A20183. It was also supported by the GPU cluster built by MCC Lab of Information Science and Technology Institution, USTC.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wei, L., **e, L., Zhou, W., Li, H., Tian, Q. (2022). MVP: Multimodality-Guided Visual Pre-training. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13690. Springer, Cham. https://doi.org/10.1007/978-3-031-20056-4_20
Download citation
DOI: https://doi.org/10.1007/978-3-031-20056-4_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20055-7
Online ISBN: 978-3-031-20056-4
eBook Packages: Computer ScienceComputer Science (R0)