Multi-scale motivated neural network for image-text matching

Qin, Xueyang; Li, Lishuang; Pang, Guangyao

doi:10.1007/s11042-023-15321-0

Multi-scale motivated neural network for image-text matching

Published: 25 May 2023

Volume 83, pages 4383–4407, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

396 Accesses
Explore all metrics

Abstract

Existing mainstream image-text matching methods usually measure the relevance of image-text pairs by capturing and aggregating the affinities between textual words and visual regions, while failing to consider the single-scale matching bias caused by the imbalance of image and text information. In this paper, we design a Multi-Scale Motivated Neural Network (MSMNN) model for image-text matching. In contrast to previous single-scale methods, MSMNN encourages neural networks to extract visual and textual features from three scales, including local features, global features and salient features, which can take full advantage of the complementarity of multi-scale matching to reduce the bias of single-scale matching. Also, we propose a cross-modal interaction module to realize the fusion of visual and textual features in local alignment, so as to discover the potential relationship between image-text pairs. Furthermore, we also propose a matching score fusion algorithm to fuse matching results from three different levels, which can be freely applied to other initial image-text matching results with a negligible overhead. Extensive experiments validate the effectiveness of our method, and the performance has achieved fairly competitive results on two well-known datasets, Flickr30K and MSCOCO, with a boost of 1.04% and 0.59% on evaluation metric mR compared with the advanced method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Canada)

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

Learning to Prompt for Vision-Language Models

Article 31 July 2022

Image Matching from Handcrafted to Deep Features: A Survey

Article Open access 04 August 2020

Data Availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

References

Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 6077–6086
Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: Proceedings of the 30th International conference on machine learning, pp 1247–1255
Chen Y, Li L, Yu L, El Kholy A, Ahmed F, Gan Z, Cheng Y, Liu J (2020) Uniter: Universal image-text representation learning. In: Proceedings of the 16th European conference on computer vision, pp 104–120
Cheng M, Mitra NJ, Huang X, Torr PHS, Hu S (2015) Global contrast based salient region detection. IEEE Trans Pattern Anal Mach Intell 37 (3):569–582
Article Google Scholar
Cheng Y, Zhu X, Qian J, Wen F, Liu P (2022) Cross-modal graph matching network for image-text retrieval. ACM Transactions on Multimedia Computing Communications, and Applications (TOMM) 18(4):1–23
Article Google Scholar
Cui Z, Hu Y, Sun Y, Gao J, Yin B (2022) Cross-modal alignment with graph reasoning for image-text retrieval. Multimed Tools Appl, pp 1–18
Deng Z, Hu X, Zhu L, Xu X, Heng PA (2018) R³net: Recurrent residual refinement network for saliency detection. In: Proceedings of the 27th Intrnational joint conference on artificial intelligence, pp 684–690
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert:, Pre-training of deep bidirectional transformers for language understanding. ar**v:1810.04805
Diao H, Zhang Y, Ma L, Lu H (2021) Similarity reasoning and filtration for image-text matching. In: Proceedings of the AAAI Conference on artificial intelligence
Gan Z, Chen Y, Li L, Zhu C, Cheng Y, Liu J (2020) Large-scale adversarial training for vision-and-language representation learning. ar**v:2006.06195
Gao Q, Lian H, Wang Q, Sun G (2020) Cross-modal subspace clustering via deep canonical correlation analysis. In: Proceedings of the AAAI Conference on artificial intelligence, vol 34, pp 3938–3945
Goodfellow I, Pouget Abadie J, Mirza M, Xu B, Warde Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Proceedings of the 28th Conference on advances in neural information processing systems, pp 2672–2680
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 770–778
Huang Y, Wu Q, Song C, Wang L (2018) Learning semantic concepts and order for image and sentence matching. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 6163–6171
Huang Z, Zeng Z, Huang Y, Liu B, Fu D, Fu J (2021) Seeing out of the box: End-to-end pre-training for vision-language representation learning. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 12976–12985
Huang F, Zhang X, Zhao Z, Li Z (2019) Bi-directional spatial-semantic attention networks for image-text matching. IEEE Trans Image Process 28(4):2008–2020
Article MathSciNet Google Scholar
Ji Z, Chen K, Wang H (2021) Step-wise hierarchical alignment network for image-text matching. In: Proceedings of the 31th Intrnational Joint conference on artificial intelligence
Kang P, Lin Z, Yang Z, Fang X, Bronstein AM, Li Q, Liu W (2022) Intra-class low-rank regularization for supervised and semi-supervised cross-modal retrieval. Appl Intell 52(1):33–54
Article Google Scholar
Karpathy A, Feifei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 3128–3137
Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of The19th Conference on empirical methods in natural language procrssing, pp 1746–1751
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA (2017) Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73
Article MathSciNet Google Scholar
Lee KH, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In: Proceedings of the European conference on computer vision, pp 201–216
Li G, Duan N, Fang Y, Gong M, Jiang D (2019) Unicoder-vl:, A universal encoder for vision and language by cross-modal pre-training. ar**v:1908.06066, 11336–11344
Li X, Wu B, Song J, Gao L, Zeng P, Gan C (2022) Text-instance graph: Exploring the relational semantics for text-based visual question answering. Pattern Recogn 124:108455
Article Google Scholar
Li W, Yang S, Wang Y, Song D, Li X (2021) Multi-level similarity learning for image-text retrieval. Inf Process Manag 58(1):102432
Article Google Scholar
Liu Y, Guo Y, Liu L, Bakker EM, Lew MS (2019) Cyclematch: a cycle-consistent embedding network for image-text matching. Pattern Recogn 93:365–379
Article Google Scholar
Liu C, Mao Z, Liu A, Zhang T, Wang B, Zhang Y (2019) Focus your attention: a bidirectional focal attention network for image-text matching. In: Proceedings of the 27th ACM International conference on multimedia, pp 3–11
Liu C, Mao Z, Zhang T, **e H, Wang B, Zhang Y (2020) Graph structured network for image-text matching. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 10921–10930
Liu J, Zha Z, Hong R, Wang M, Zhang Y (2019) Deep adversarial graph attention convolution network for text-based person search. In: Proceedings of the 27th ACM International conference on multimedia, pp 665–673
Liu H, Zhang S, Lin K, Wen J, Li J, Hu X (2021) Vocabulary-wide credit assignment for training image captioning models. IEEE Trans Image Process 30:2450–2460
Article MathSciNet Google Scholar
Lu J, Batra D, Parikh D, Lee S (2019) Vilbert:, Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. ar**v:1908.02265
Ma L, Jiang W, Jie Z, Wang X (2019) Bidirectional image-sentence retrieval by local and global deep matching. Neurocomputing 345:36–44
Article Google Scholar
Ma L, Lu Z, Shang L, Li H (2015) Multimodal convolutional neural networks for matching image and sentence. In: Proceedings of the IEEE International conference on computer vision, pp 2623–2631
Qi J, Peng Y, Yuan Y (2018) Cross-media multi-level alignment with relation attention network. In: Proceedings of the 27th Intrnational Joint conference on artificial intelligence, pp 892–898
Qian K, Tian L (2021) A topic-based multi-channel attention model under hybrid mode for image caption. Neural Comput Applic, pp 1–10
Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GRG, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th ACM International conference on multimedia, pp 251–260
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
Sarafianos N, Xu X, Kakadiaris IA (2019) Adversarial representation learning for text-to-image matching. In: Proceedings of the IEEE International conference on computer vision, pp 5814–5824
Sharma H, Jalal AS (2021) Image captioning improved visual question answering. Multimed Tools Appl, pp 1–22
Shaw P, Uszkoreit J, Vaswani A (2018) Self-attention with relative position representations. In: Proceedings of the 2018 Conference of the North american chapter of the association for computational linguistics: Human language technologies, Volume 2 (Short Papers), pp 464–468
Shi B, Ji L, Lu P, Niu Z, Duan N (2019) Knowledge aware semantic concept expansion for image-text matching. In: Proceedings of the 28th Intrnational Joint conference on artificial intelligence, pp 5182–5189
Shu X, Zhao G (2021) Scalable multi-label canonical correlation analysis for cross-modal retrieval. Pattern Recogn 115:107905
Article Google Scholar
Tan H, Bansal M (2019) Lxmert:,Learningcross-modalityencoderrepresentationsfromtransformers.ar**v:1908.07490
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attentionisallyouneed. In: Advancesinneuralinformationprocessingsystems,pp5998–6008
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A, Kaiser L, Polosukhin I (2017) Attentionisallyouneed. In: Proceedingsofthe31thConferenceonadvancesinneuralinformationprocessingsystems,pp5998–6008
Wang S, Chen Y, Zhuo J, Huang Q, Tian Q (2018) Jointglobalandco-attentiverepresentationlearningforimage-sentenceretrieval. In: Proceedingsofthe26thACMInternationalconferenceonmultimedia,pp1398–1406
Wang L, Li Y, Lazebnik S (2016) Learningdeepstructure-preservingimage-textembeddings. In: ProceedingsoftheIEEEConferenceoncomputervisionandpatternrecognition,pp5005–5013
Wang Y, Yang H, Qian X, Ma L, Fan X (2019) Positionfocusedattentionnetworkforimage-textmatching. In: Proceedingsofthe28thIntrnationalJointconferenceonartificialintelligence
Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarialcross-modalretrieval. In: Proceedingsofthe25thACMInternationalconferenceonmultimedia,pp154–162
Wang H, Zhang Y, Ji Z, Pang Y, Ma L (2020) Consensus-awarevisual-semanticembeddingforimage-textmatching. In: Proceedingsofthe16thEuropeanconferenceoncomputervision,pp18–34
Wu H, Liu Y, Cai H, He S (2022) Learningtransferableperturbationsforimagecaptioning. ACMTransactionsonMultimediaComputingCommunications,andApplications(TOMM) 18(2):1–18
Google Scholar
Wu Y, Wang S, Song G, Huang Q (2019) Learningfragmentself-attentionembeddingsforimage-textmatching. In: Proceedingsofthe27thACMInternationalconferenceonmultimedia,pp2088–2096
Xu X, Wang T, Yang Y, Zuo L, Shen F, Shen HT (2020) Cross-modalattentionwithsemanticconsistenceforimage-textmatching. IEEETransNeuralNetwLearnSyst 31(12):5412–5425
Google Scholar
Yan F, Mikolajczyk K (2015) Deepcorrelationformatchingimagesandtext. In: ProceedingsoftheIEEEConferenceoncomputervisionandpatternrecognition,pp3441–3450
Yuan H, Huang Y, Zhang D, Chen Z, Cheng W, Wang L (2021) Vsr++:Improvingvisualsemanticreasoningforfine-grainedimage-textmatching. In: Proceedingsofthe25thInternationalConferenceonPatternRecognition(ICPR),pp3728–3735
Zhang S, Chen M, Chen J, Zou F, Li Y-F, Lu P (2021) Multimodalfeature-wiseco-attentionmethodforvisualquestionanswering. InfFusion 73:1–10
Google Scholar
Zhang Q, Lei Z, Zhang Z, Li SZ (2020) Context-awareattentionnetworkforimage-textretrieval. In: ProceedingsoftheIEEEConferenceoncomputervisionandpatternrecognition,pp3536–3545
Zhang K, Mao Z, Liu A, Zhang Y (2022) Unifiedadaptiverelevancedistinguishableattentionnetworkforimage-textmatching.IEEETransMultimed,1–14
Zhang Y, Zhou W, Wang M, Tian Q, Li H (2021) Deeprelationembeddingforcross-modalretrieval. IEEETransImageProcess 30:617–627
Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China [grant number 62076048]; and the Science and Technology Innovation Foundation of Dalian [grant number 2020JJ26GX035].

Author information

Authors and Affiliations

School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, China
Xueyang Qin & Lishuang Li
School of Data Science and Software Engineering, Wuzhou University, Wuzhou, 543002, China
Guangyao Pang

Authors

Xueyang Qin
View author publications
You can also search for this author in PubMed Google Scholar
Lishuang Li
View author publications
You can also search for this author in PubMed Google Scholar
Guangyao Pang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lishuang Li.

Ethics declarations

Conflict of Interests

The authors declared that they have no conflicts of interest to this work.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Qin, X., Li, L. & Pang, G. Multi-scale motivated neural network for image-text matching. Multimed Tools Appl 83, 4383–4407 (2024). https://doi.org/10.1007/s11042-023-15321-0

Download citation

Received: 05 August 2022
Revised: 07 January 2023
Accepted: 06 April 2023
Published: 25 May 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s11042-023-15321-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Canada)

Instant access to the full article PDF.

Institutional subscriptions

Multi-scale motivated neural network for image-text matching

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Learning to Prompt for Vision-Language Models

Image Matching from Handcrafted to Deep Features: A Survey

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Multi-scale motivated neural network for image-text matching

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Learning to Prompt for Vision-Language Models

Image Matching from Handcrafted to Deep Features: A Survey

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation