Abstract
The growth of social media data has led to the development of VQA (Visual Question Answering) tasks, and the problem of VQA model robustness has emerged as a new issue in the public eye. In this work, we extend the generalization of the SwapMix [1] to the VQA-v2 benchmark dataset by solving the problem that it is only applicable to a specific manually annotated dataset. Then we propose semantic correlation to solve the problem of whether the problem is related to the entities in the images. Finally, we employ SwapMix’s mainstream model MCAN on the VQA-v2 dataset for data augmentation to further improve its model robustness, by this step, we improve the accuracy of the model after performing attentional perturbation from 35.34 to 44.01%. The experiments demonstrate that the method proposed in this paper is effective in diagnosing model robustness and regulating overdependence on visual context.
The General Project of University Industry-University Research Innovation Fund, Science and Technology Development Center, Ministry of Education (2020QT08)
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Gupta V, Li Z, Kortylewski A, et al (2022) Swapmix: diagnosing and regularizing the over-reliance on visual context in visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5078–5088
Jiang Y , Natarajan V , Chen X, et al (2018) Pythia v0.1: the winning entry to the VQA challenge 2018. https://doi.org/10.48550/ar**v.1807.09956
Lu J, Batra D, Parikh D, et al (2019) ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. https://doi.org/10.48550/ar**v.1908.02265
Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. https://doi.org/10.48550/ar**v.2103.00020
Yu Z, Yu J, Cui Y, et al (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6281–6290
Bao X, Zhou C, **ao K et al (2021) A review of visual quiz research. J Softw 32(8):23. https://doi.org/10.13328/j.cnki.jos.006215
Shah M, Chen X, Rohrbach M, Parikh D (2019) Cycle-consistency for robust visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6649–6658
Li J, Li D, ** language-image pre-training for unified vision-language understanding and generation. In: International conference on machine learning. PMLR, pp 12888–12900
Zhu Y, Nayak NM, Roy-Chowdhury AK (2013) Context-aware modeling and recognition of activities in video [C/OL]. In: 2013 IEEE conference on computer vision and pattern recognition, Portland, OR, USA
Yang X, Yang XD, Liu MY, et al (2019) STEP: Spatio-temporal progressive learning for video action detection. Cornell University
Pathak D, Krahenbuhl P, Donahue J, et al (2016) Context encoders: feature learning by inpainting [C/OL]. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA
Du Y, Duan G, Ai H (2012) Context-based text detection in natural scenes. In: 2012 19th IEEE international conference on image processing, IEEE, pp 1857–1860
Dvornik N, Mairal J, Schmid C (2018) Modeling visual context is key to augmenting object detection datasets [M/OL]. In: Computer vision—ECCV. Lecture Notes in Computer Science, pp 375–391
Wang X, Ji Q (2015) Video event recognition with deep hierarchical context model. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4418–4427
Sun J, Jacobs DW (2017) Seeing what is not there: learning context to determine where objects are missing. Cornell University
Acknowledgements
This work was supported by National People’s Committee Young and Middle-aged Talent Training Program (MZR20007),**njiang Uygur Autonomous Region Regional Collaborative Innovation Special Project (Science and Technology Aid Program) (2022E02035) , Hubei Provincial Administration of Traditional Chinese Medicine Research Project on Traditional Chinese Medicine (ZY2023M064) and the General Project of University Industry-University Research Innovation Fund, Science and Technology Development Center, Ministry of Education(2020QT08).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Qin, J., Zhang, Z., Ye, Z., Liu, Z., Cheng, Y. (2024). Robust Analysis of Visual Question Answering Based on Irrelevant Visual Contextual Information. In: Wang, W., Mu, J., Liu, X., Na, Z.N. (eds) Artificial Intelligence in China. AIC 2023. Lecture Notes in Electrical Engineering, vol 1043. Springer, Singapore. https://doi.org/10.1007/978-981-99-7545-7_43
Download citation
DOI: https://doi.org/10.1007/978-981-99-7545-7_43
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-7544-0
Online ISBN: 978-981-99-7545-7
eBook Packages: Computer ScienceComputer Science (R0)