Log in

Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-Wise Pseudo Labeling

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

The Audio-Visual Video Parsing task aims to identify and temporally localize the events that occur in either or both the audio and visual streams of audible videos. It often performs in a weakly-supervised manner, where only video event labels are provided, i.e., the modalities and the timestamps of the labels are unknown. Due to the lack of densely annotated labels, recent work attempts to leverage pseudo labels to enrich the supervision. A commonly used strategy is to generate pseudo labels by categorizing the known video event labels for each modality. However, the labels are still confined to the video level, and the temporal boundaries of events remain unlabeled. In this paper, we propose a new pseudo label generation strategy that can explicitly assign labels to each video segment by utilizing prior knowledge learned from the open world. Specifically, we exploit the large-scale pretrained models, namely CLIP and CLAP, to estimate the events in each video segment and generate segment-level visual and audio pseudo labels, respectively. We then propose a new loss function to exploit these pseudo labels by taking into account their category-richness and segment-richness. A label denoising strategy is also adopted to further improve the visual pseudo labels by flip** them whenever abnormally large forward losses occur. We perform extensive experiments on the LLP dataset and demonstrate the effectiveness of each proposed design and we achieve state-of-the-art video parsing performance on all types of event parsing, i.e., audio event, visual event, and audio-visual event. Furthermore, our experiments verify that the high-quality segment-level pseudo labels provided by our method can be flexibly combined with other audio-visual video parsing backbones and consistently improve their performances. We also examine the proposed pseudo label generation strategy on a relevant weakly-supervised audio-visual event localization task and the experimental results again verify the benefits and generalization of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data Availability

The LLP dataset for the studied audio-visual video parsing is publicly available from the official website https://github.com/YapengTian/AVVP-ECCV20. The AVE dataset for the audio-visual event localization task can be accessed at https://github.com/YapengTian/AVE-ECCV18. Tables 1–9 and figures 3–7 were generated with our source codes, which will be released at our GitHub repository https://github.com/jasongief/VPLAN.

References

  • Afouras, T., Owens, A., Chung, J. S., & Zisserman, A. (2020). Self-supervised learning of audio-visual objects from video. In Proceedings of the European conference on computer vision (ECCV) (pp. 208–224).

  • Alayrac, J. B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. (2022) Flamingo: A visual language model for few-shot learning. ar** network for weakly-supervised audio-visual video parsing. In Advances in neural information processing systems (NeurIPS)

  • Pan, Y., Hu, Y., Yang, Y., Yao, J., Fei, W., Ma, L., & Lu, H. (2023). Gemo-clap: Gender-attribute-enhanced contrastive language-audio pretraining for speech emotion recognition. ar**v:2306.07848

  • Park, D. S., Zhang, Y., Jia, Y., Han, W., Chiu, C. C., Li, B., Wu, Y., & Le, Q. V. (2020). Improved noisy student training for automatic speech recognition (pp. 1–5). ar**v:2005.09629

  • Pasi, P. S., Nemani, S., Jyothi, P., & Ramakrishnan, G. (2022). Investigating modality bias in audio visual video parsing. ar**v:2203.16860

  • Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems (NeurIPS) (pp. 1–12).

  • Patel, G., Allebach, J. P., & Qiu, Q. (2023). Seq-ups: Sequential uncertainty-aware pseudo-label selection for semi-supervised text recognition. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (WACV) (pp. 6180–6190).

  • Pham, H., Dai, Z., **e, Q., & Le, Q. V. (2021). Meta pseudo labels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 11557–11568).

  • Qian, R., Hu, D., Dinkel, H., Wu, M., Xu, N., & Lin, W. (2020). Multiple sound sources localization from coarse to fine. In Proceedings of the European conference on computer vision (ECCV) (pp. 292–308).

  • Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal. S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning (ICML) (pp. 8748–8763).

  • Rao, V., Khalil, M. I., Li, H., Dai, P., & Lu, J. (2022a). Dual perspective network for audio-visual event localization. In Proceedings of the European conference on computer vision (ECCV) (pp. 689–704).

  • Rao, Y., Zhao, W., Chen, G., Tang, Y., Zhu, Z., Huang, G., Zhou, J., & Lu, J. (2022b). Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 18082–18091).

  • Rizve, M. N., Duarte, K., Rawat, Y. S., & Shah, M. (2021). In defense of pseudo-labeling: An uncertainty-aware pseudo-label selection framework for semi-supervised learning (pp. 1–20). ar**v:2101.06329

  • Rouditchenko, A., Zhao, H., Gan, C., McDermott, J., & Torralba, A. (2019). Self-supervised audio-visual co-segmentation. In IEEE international conference on acoustics (pp. 2357–2361). IEEE: Speech and Signal Processing (ICASSP).

  • Senocak, A., Oh, T. H., Kim, J., Yang, M. H., & Kweon, I. S. (2018) Learning to localize sound source in visual scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4358–4366).

  • Shen, X., Li, D., Zhou, J., Qin, Z., He, B., Han, X., Li, A., Dai, Y., Kong, L., Wang, M., et al. (2023). Fine-grained audible video description. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 10585–10596).

  • Song, P., Guo, D., Zhou, J., Xu, M., & Wang, M. (2022). Memorial gan with joint semantic optimization for unpaired image captioning. IEEE Transactions on Cybernetics, 4388–4399.

  • Sun, W., Zhang, J., Wang, J., Liu, Z., Zhong, Y., Feng, T., Guo, Y., Zhang, Y., Barnes, N. (2023). Learning audio-visual source localization via false negative aware contrastive learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 6420–6429).

  • Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 2818–2826).

  • Tang, M., Wang, Z., Liu, Z,. Rao, F., Li, D., & Li, X. (2021). Clip4caption: Clip for video caption. In Proceedings of the ACM international conference on multimedia (ACM MM) (pp. 4858–4862).

  • Tian, Y., Li, D., & Xu, C. (2020). Unified multisensory perception: Weakly-supervised audio-visual video parsing. In Proceedings of the European conference on computer vision (ECCV) (pp. 436–454).

  • Tian, Y., Shi, J., Li, B., Duan, Z., & Xu, C. (2018) Audio-visual event localization in unconstrained videos. In Proceedings of the European conference on computer vision (ECCV) (pp. 247–263).

  • Tran, D., Wang, H., Torresani, L., Feiszli, M. (2019). Video classification with channel-separated convolutional networks. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 5552–5561).

  • Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 6450–6459).

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (NeurIPS) (pp. 1–11).

  • Wang, H., Zha, Z. J., Li, L., Chen, X., & Luo, J. (2023) Context-aware proposal-boundary network with structural consistency for audiovisual event localization. In IEEE transactions on neural networks and learning systems (pp. 1–11).

  • Wang, Z., Lu, Y., Li, Q., Tao, X., Guo, Y., Gong, M., Liu, T. (2022) Cris: Clip-driven referring image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 11686–11695).

  • Wei, Y., Hu, D., Tian, Y., & Li, X. (2022). Learning in audio-visual context: A review, analysis, and new perspective. ar**v:2208.09579

  • Wu, Y., & Yang, Y. (2021). Exploring heterogeneous clues for weakly-supervised audio-visual video parsing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 1326–1335).

  • Wu, Y., Chen, K., Zhang, T., Hui, Y., Berg-Kirkpatrick, T., & Dubnov, S. (2023). Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1–5).

  • Wu, Y., Zhu, L., Yan, Y., & Yang, Y. (2019). Dual attention matching for audio-visual event localization. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 6292–6300).

  • Wu, Y., Zhang, X., Wang, Y., & Huang, Q. (2022). Span-based audio-visual localization. In Proceedings of the ACM international conference on multimedia (ACM MM) (pp. 1252–1260).

  • **a, Y., & Zhao, Z. (2022). Cross-modal background suppression for audio-visual event localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 19989–19998).

  • **e, Q., Luong, M. T., Hovy, E., & Le, Q. V. (2020) Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 10687–10698).

  • Xu, H., Zeng, R., Wu, Q., Tan, M., & Gan, C. (2020). Cross-modal relation-aware networks for audio-visual event localization. In Proceedings of the ACM international conference on multimedia (ACM MM) (pp. 3893–3901).

  • Xu, M., Zhang, Z., Wei, F., Lin, Y., Cao, Y., Hu, H., & Bai, X. (2021). A simple baseline for zero-shot semantic segmentation with pre-trained vision-language model. ar**v:2112.14757

  • Yalniz, I. Z., Jégou, H., Chen, K., Paluri, M., & Mahajan, D. (2019). Billion-scale semi-supervised learning for image classification. ar**v:1905.00546

  • Yang, P., Wang, X., Duan, X., Chen, H., Hou, R., **, C., & Zhu, W. (2022). Avqa: A dataset for audio-visual question answering on videos. In Proceedings of the 30th ACM international conference on multimedia (ACM MM) (pp. 3480–3491).

  • Yu, J., Cheng, Y., Zhao, R. W., Feng, R., & Zhang, Y. (2022). Mm-pyramid: Multimodal pyramid attentional network for audio-visual event localization and video parsing. In Proceedings of the ACM international conference on multimedia (ACM MM) (pp. 6241–6249).

  • Yun, H., Yu, Y., Yang, W., Lee, K., & Kim, G. (2021) Pano-avqa: Grounded audio-visual question answering on 360deg videos. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 2031–2041).

  • Zeng, R., Huang, W., Tan, M., Rong, Y., Zhao, P., Huang, J., & Gan, C. (2019). Graph convolutional networks for temporal action localization. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 7094–7103).

  • Zhou, C., Loy, C. C., Dai, B. (2022a). Extract free dense labels from clip. In Proceedings of the European conference on computer vision (ECCV) (pp. 696–712).

  • Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., & Torralba, A. (2018). The sound of pixels. In Proceedings of the European conference on computer vision (ECCV) (pp. 570–586).

  • Zhou, J., Guo, D., Wang, M. (2023a). Contrastive positive sample propagation along the audio-visual event line. In IEEE transactions on pattern analysis and machine intelligence (TPAMI).

  • Zhou, J., Shen, X., Wang, J., Zhang, J., Sun, W., Zhang, J., Birchfield, S., Guo, D., Kong, L., Wang, M., et al. (2023b). Audio-visual segmentation with semantics. ar**v:2301.13190

  • Zhou, J., Wang, J., Zhang, J., Sun, W., Zhang, J., Birchfield, S., Guo, D., Kong, L., Wang, M., & Zhong, Y. (2022b). Audio–visual segmentation. In Proceedings of the European conference on computer vision (ECCV) (pp. 386–403).

  • Zhou, J., Zheng, L., Zhong, Y., Hao, S., & Wang, M. (2021). Positive sample propagation along the audio-visual event line. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8436–8444).

  • Zhou, Z., Zhang, B., Lei, Y., Liu, L., & Liu, Y. (2022c). Zegclip: Towards adapting clip for zero-shot semantic segmentation. ar**v:2212.03588

  • Zhu, Z., Tang, W., Wang, L., Zheng, N., & Hua, G. (2021). Enriching local and global contexts for temporal action localization. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 13516–13525).

  • Zoph, B., Ghiasi, G., Lin, T. Y., Cui, Y., Liu, H., Cubuk, E. D., & Le, Q. (2020). Rethinking pre-training and self-training. In Advances in neural information processing systems (NeurIPS) (pp. 3833–3845).

Download references

Acknowledgements

We would like to thank Dr. Liang Zheng for his constructive suggestions. We also sincerely appreciate the anonymous reviewers for their positive feedback and professional comments.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Dan Guo or Meng Wang.

Additional information

Communicated by Gunhee Kim.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by the National Key R &D Program of China (NO.2022YFB4500601), the National Natural Science Foundation of China (72188101, 62272144, 62020106007, and U20A20183), the Major Project of Anhui Province (202203a05020011), and the Fundamental Research Funds for the Central Universities. This work is also partially supported by the National Key R &D Program of China (NO.2022ZD0160100)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, J., Guo, D., Zhong, Y. et al. Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-Wise Pseudo Labeling. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-024-02142-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11263-024-02142-3

Keywords

Navigation