Robust Analysis of Visual Question Answering Based on Irrelevant Visual Contextual Information

Qin, Jun; Zhang, Ze**; Ye, Zheng; Liu, Zhou; Cheng, Yong

doi:10.1007/978-981-99-7545-7_43

Jun Qin^40,41,
Ze** Zhang^40,41,
Zheng Ye^40,41,
Zhou Liu^40,41 &
…
Yong Cheng⁴²

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 1043))

Included in the following conference series:

International Conference on Artificial Intelligence in China

131 Accesses

Abstract

The growth of social media data has led to the development of VQA (Visual Question Answering) tasks, and the problem of VQA model robustness has emerged as a new issue in the public eye. In this work, we extend the generalization of the SwapMix [1] to the VQA-v2 benchmark dataset by solving the problem that it is only applicable to a specific manually annotated dataset. Then we propose semantic correlation to solve the problem of whether the problem is related to the entities in the images. Finally, we employ SwapMix’s mainstream model MCAN on the VQA-v2 dataset for data augmentation to further improve its model robustness, by this step, we improve the accuracy of the model after performing attentional perturbation from 35.34 to 44.01%. The experiments demonstrate that the method proposed in this paper is effective in diagnosing model robustness and regulating overdependence on visual context.

The General Project of University Industry-University Research Innovation Fund, Science and Technology Development Center, Ministry of Education (2020QT08)

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: EUR 29.95; Price includes VAT (Germany)

eBook: EUR 160.49; Price includes VAT (Germany)

Hardcover Book: EUR 213.99; Price includes VAT (Germany)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Gupta V, Li Z, Kortylewski A, et al (2022) Swapmix: diagnosing and regularizing the over-reliance on visual context in visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5078–5088
Google Scholar
Jiang Y , Natarajan V , Chen X, et al (2018) Pythia v0.1: the winning entry to the VQA challenge 2018. https://doi.org/10.48550/ar**v.1807.09956
Lu J, Batra D, Parikh D, et al (2019) ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. https://doi.org/10.48550/ar**v.1908.02265
Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. https://doi.org/10.48550/ar**v.2103.00020
Yu Z, Yu J, Cui Y, et al (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6281–6290
Google Scholar
Bao X, Zhou C, **ao K et al (2021) A review of visual quiz research. J Softw 32(8):23. https://doi.org/10.13328/j.cnki.jos.006215
Shah M, Chen X, Rohrbach M, Parikh D (2019) Cycle-consistency for robust visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6649–6658
Google Scholar
Li J, Li D, ** language-image pre-training for unified vision-language understanding and generation. In: International conference on machine learning. PMLR, pp 12888–12900
Google Scholar
Zhu Y, Nayak NM, Roy-Chowdhury AK (2013) Context-aware modeling and recognition of activities in video [C/OL]. In: 2013 IEEE conference on computer vision and pattern recognition, Portland, OR, USA
Google Scholar
Yang X, Yang XD, Liu MY, et al (2019) STEP: Spatio-temporal progressive learning for video action detection. Cornell University
Google Scholar
Pathak D, Krahenbuhl P, Donahue J, et al (2016) Context encoders: feature learning by inpainting [C/OL]. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA
Google Scholar
Du Y, Duan G, Ai H (2012) Context-based text detection in natural scenes. In: 2012 19th IEEE international conference on image processing, IEEE, pp 1857–1860
Google Scholar
Dvornik N, Mairal J, Schmid C (2018) Modeling visual context is key to augmenting object detection datasets [M/OL]. In: Computer vision—ECCV. Lecture Notes in Computer Science, pp 375–391
Google Scholar
Wang X, Ji Q (2015) Video event recognition with deep hierarchical context model. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4418–4427
Google Scholar
Sun J, Jacobs DW (2017) Seeing what is not there: learning context to determine where objects are missing. Cornell University
Google Scholar

Download references

Acknowledgements

This work was supported by National People’s Committee Young and Middle-aged Talent Training Program (MZR20007),**njiang Uygur Autonomous Region Regional Collaborative Innovation Special Project (Science and Technology Aid Program) (2022E02035) , Hubei Provincial Administration of Traditional Chinese Medicine Research Project on Traditional Chinese Medicine (ZY2023M064) and the General Project of University Industry-University Research Innovation Fund, Science and Technology Development Center, Ministry of Education(2020QT08).

Author information

Authors and Affiliations

College of Computer Science, South-Central Minzu University, Wuhan, 430074, China
Jun Qin, Ze** Zhang, Zheng Ye & Zhou Liu
Hubei Provincial Engineering Research Center for Intelligent Management, of Manufacturing Ente rprises, Wuhan, 430074, China
Jun Qin, Ze** Zhang, Zheng Ye & Zhou Liu
Bei**g Qingtai Data Technology Co., Ltd., Bei**g, 100000, China
Yong Cheng

Authors

Jun Qin
View author publications
You can also search for this author in PubMed Google Scholar
Ze** Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Ye
View author publications
You can also search for this author in PubMed Google Scholar
Zhou Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yong Cheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ze** Zhang .

Editor information

Editors and Affiliations

College of Artificial Intelligence, Tian** Normal University, Tian**, China
Wei Wang
Tian** Normal University, Tian**, China
Jiasong Mu
Dalian University of Technology, Dalian, China
**n Liu
School of Information Science and Technology, Dalian Maritime University, Dalian, China
Zhenyu Na Na

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Qin, J., Zhang, Z., Ye, Z., Liu, Z., Cheng, Y. (2024). Robust Analysis of Visual Question Answering Based on Irrelevant Visual Contextual Information. In: Wang, W., Mu, J., Liu, X., Na, Z.N. (eds) Artificial Intelligence in China. AIC 2023. Lecture Notes in Electrical Engineering, vol 1043. Springer, Singapore. https://doi.org/10.1007/978-981-99-7545-7_43

Download citation

DOI: https://doi.org/10.1007/978-981-99-7545-7_43
Published: 23 March 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-7544-0
Online ISBN: 978-981-99-7545-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics