Robust Analysis of Visual Question Answering Based on Irrelevant Visual Contextual Information

  • Conference paper
  • First Online:
Artificial Intelligence in China (AIC 2023)

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 1043))

Included in the following conference series:

  • 131 Accesses

Abstract

The growth of social media data has led to the development of VQA (Visual Question Answering) tasks, and the problem of VQA model robustness has emerged as a new issue in the public eye. In this work, we extend the generalization of the SwapMix [1] to the VQA-v2 benchmark dataset by solving the problem that it is only applicable to a specific manually annotated dataset. Then we propose semantic correlation to solve the problem of whether the problem is related to the entities in the images. Finally, we employ SwapMix’s mainstream model MCAN on the VQA-v2 dataset for data augmentation to further improve its model robustness, by this step, we improve the accuracy of the model after performing attentional perturbation from 35.34 to 44.01%. The experiments demonstrate that the method proposed in this paper is effective in diagnosing model robustness and regulating overdependence on visual context.

The General Project of University Industry-University Research Innovation Fund, Science and Technology Development Center, Ministry of Education (2020QT08)

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
EUR 29.95
Price includes VAT (Germany)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 160.49
Price includes VAT (Germany)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
EUR 213.99
Price includes VAT (Germany)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Gupta V, Li Z, Kortylewski A, et al (2022) Swapmix: diagnosing and regularizing the over-reliance on visual context in visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5078–5088

    Google Scholar 

  2. Jiang Y , Natarajan V , Chen X, et al (2018) Pythia v0.1: the winning entry to the VQA challenge 2018. https://doi.org/10.48550/ar**v.1807.09956

  3. Lu J, Batra D, Parikh D, et al (2019) ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. https://doi.org/10.48550/ar**v.1908.02265

  4. Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. https://doi.org/10.48550/ar**v.2103.00020

  5. Yu Z, Yu J, Cui Y, et al (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6281–6290

    Google Scholar 

  6. Bao X, Zhou C, **ao K et al (2021) A review of visual quiz research. J Softw 32(8):23. https://doi.org/10.13328/j.cnki.jos.006215

  7. Shah M, Chen X, Rohrbach M, Parikh D (2019) Cycle-consistency for robust visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6649–6658

    Google Scholar 

  8. Li J, Li D, ** language-image pre-training for unified vision-language understanding and generation. In: International conference on machine learning. PMLR, pp 12888–12900

    Google Scholar 

  9. Zhu Y, Nayak NM, Roy-Chowdhury AK (2013) Context-aware modeling and recognition of activities in video [C/OL]. In: 2013 IEEE conference on computer vision and pattern recognition, Portland, OR, USA

    Google Scholar 

  10. Yang X, Yang XD, Liu MY, et al (2019) STEP: Spatio-temporal progressive learning for video action detection. Cornell University

    Google Scholar 

  11. Pathak D, Krahenbuhl P, Donahue J, et al (2016) Context encoders: feature learning by inpainting [C/OL]. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA

    Google Scholar 

  12. Du Y, Duan G, Ai H (2012) Context-based text detection in natural scenes. In: 2012 19th IEEE international conference on image processing, IEEE, pp 1857–1860

    Google Scholar 

  13. Dvornik N, Mairal J, Schmid C (2018) Modeling visual context is key to augmenting object detection datasets [M/OL]. In: Computer vision—ECCV. Lecture Notes in Computer Science, pp 375–391

    Google Scholar 

  14. Wang X, Ji Q (2015) Video event recognition with deep hierarchical context model. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4418–4427

    Google Scholar 

  15. Sun J, Jacobs DW (2017) Seeing what is not there: learning context to determine where objects are missing. Cornell University

    Google Scholar 

Download references

Acknowledgements

This work was supported by National People’s Committee Young and Middle-aged Talent Training Program (MZR20007),**njiang Uygur Autonomous Region Regional Collaborative Innovation Special Project (Science and Technology Aid Program) (2022E02035) , Hubei Provincial Administration of Traditional Chinese Medicine Research Project on Traditional Chinese Medicine (ZY2023M064) and the General Project of University Industry-University Research Innovation Fund, Science and Technology Development Center, Ministry of Education(2020QT08).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ze** Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Qin, J., Zhang, Z., Ye, Z., Liu, Z., Cheng, Y. (2024). Robust Analysis of Visual Question Answering Based on Irrelevant Visual Contextual Information. In: Wang, W., Mu, J., Liu, X., Na, Z.N. (eds) Artificial Intelligence in China. AIC 2023. Lecture Notes in Electrical Engineering, vol 1043. Springer, Singapore. https://doi.org/10.1007/978-981-99-7545-7_43

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-7545-7_43

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-7544-0

  • Online ISBN: 978-981-99-7545-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation