Abstract
Existing mainstream image-text matching methods usually measure the relevance of image-text pairs by capturing and aggregating the affinities between textual words and visual regions, while failing to consider the single-scale matching bias caused by the imbalance of image and text information. In this paper, we design a Multi-Scale Motivated Neural Network (MSMNN) model for image-text matching. In contrast to previous single-scale methods, MSMNN encourages neural networks to extract visual and textual features from three scales, including local features, global features and salient features, which can take full advantage of the complementarity of multi-scale matching to reduce the bias of single-scale matching. Also, we propose a cross-modal interaction module to realize the fusion of visual and textual features in local alignment, so as to discover the potential relationship between image-text pairs. Furthermore, we also propose a matching score fusion algorithm to fuse matching results from three different levels, which can be freely applied to other initial image-text matching results with a negligible overhead. Extensive experiments validate the effectiveness of our method, and the performance has achieved fairly competitive results on two well-known datasets, Flickr30K and MSCOCO, with a boost of 1.04% and 0.59% on evaluation metric mR compared with the advanced method.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-023-15321-0/MediaObjects/11042_2023_15321_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-023-15321-0/MediaObjects/11042_2023_15321_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-023-15321-0/MediaObjects/11042_2023_15321_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-023-15321-0/MediaObjects/11042_2023_15321_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-023-15321-0/MediaObjects/11042_2023_15321_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-023-15321-0/MediaObjects/11042_2023_15321_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-023-15321-0/MediaObjects/11042_2023_15321_Fig7_HTML.png)
Similar content being viewed by others
Data Availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
References
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 6077–6086
Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: Proceedings of the 30th International conference on machine learning, pp 1247–1255
Chen Y, Li L, Yu L, El Kholy A, Ahmed F, Gan Z, Cheng Y, Liu J (2020) Uniter: Universal image-text representation learning. In: Proceedings of the 16th European conference on computer vision, pp 104–120
Cheng M, Mitra NJ, Huang X, Torr PHS, Hu S (2015) Global contrast based salient region detection. IEEE Trans Pattern Anal Mach Intell 37 (3):569–582
Cheng Y, Zhu X, Qian J, Wen F, Liu P (2022) Cross-modal graph matching network for image-text retrieval. ACM Transactions on Multimedia Computing Communications, and Applications (TOMM) 18(4):1–23
Cui Z, Hu Y, Sun Y, Gao J, Yin B (2022) Cross-modal alignment with graph reasoning for image-text retrieval. Multimed Tools Appl, pp 1–18
Deng Z, Hu X, Zhu L, Xu X, Heng PA (2018) R3net: Recurrent residual refinement network for saliency detection. In: Proceedings of the 27th Intrnational joint conference on artificial intelligence, pp 684–690
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert:, Pre-training of deep bidirectional transformers for language understanding. ar**v:1810.04805
Diao H, Zhang Y, Ma L, Lu H (2021) Similarity reasoning and filtration for image-text matching. In: Proceedings of the AAAI Conference on artificial intelligence
Gan Z, Chen Y, Li L, Zhu C, Cheng Y, Liu J (2020) Large-scale adversarial training for vision-and-language representation learning. ar**v:2006.06195
Gao Q, Lian H, Wang Q, Sun G (2020) Cross-modal subspace clustering via deep canonical correlation analysis. In: Proceedings of the AAAI Conference on artificial intelligence, vol 34, pp 3938–3945
Goodfellow I, Pouget Abadie J, Mirza M, Xu B, Warde Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Proceedings of the 28th Conference on advances in neural information processing systems, pp 2672–2680
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 770–778
Huang Y, Wu Q, Song C, Wang L (2018) Learning semantic concepts and order for image and sentence matching. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 6163–6171
Huang Z, Zeng Z, Huang Y, Liu B, Fu D, Fu J (2021) Seeing out of the box: End-to-end pre-training for vision-language representation learning. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 12976–12985
Huang F, Zhang X, Zhao Z, Li Z (2019) Bi-directional spatial-semantic attention networks for image-text matching. IEEE Trans Image Process 28(4):2008–2020
Ji Z, Chen K, Wang H (2021) Step-wise hierarchical alignment network for image-text matching. In: Proceedings of the 31th Intrnational Joint conference on artificial intelligence
Kang P, Lin Z, Yang Z, Fang X, Bronstein AM, Li Q, Liu W (2022) Intra-class low-rank regularization for supervised and semi-supervised cross-modal retrieval. Appl Intell 52(1):33–54
Karpathy A, Feifei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 3128–3137
Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of The19th Conference on empirical methods in natural language procrssing, pp 1746–1751
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA (2017) Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73
Lee KH, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In: Proceedings of the European conference on computer vision, pp 201–216
Li G, Duan N, Fang Y, Gong M, Jiang D (2019) Unicoder-vl:, A universal encoder for vision and language by cross-modal pre-training. ar**v:1908.06066, 11336–11344
Li X, Wu B, Song J, Gao L, Zeng P, Gan C (2022) Text-instance graph: Exploring the relational semantics for text-based visual question answering. Pattern Recogn 124:108455
Li W, Yang S, Wang Y, Song D, Li X (2021) Multi-level similarity learning for image-text retrieval. Inf Process Manag 58(1):102432
Liu Y, Guo Y, Liu L, Bakker EM, Lew MS (2019) Cyclematch: a cycle-consistent embedding network for image-text matching. Pattern Recogn 93:365–379
Liu C, Mao Z, Liu A, Zhang T, Wang B, Zhang Y (2019) Focus your attention: a bidirectional focal attention network for image-text matching. In: Proceedings of the 27th ACM International conference on multimedia, pp 3–11
Liu C, Mao Z, Zhang T, **e H, Wang B, Zhang Y (2020) Graph structured network for image-text matching. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 10921–10930
Liu J, Zha Z, Hong R, Wang M, Zhang Y (2019) Deep adversarial graph attention convolution network for text-based person search. In: Proceedings of the 27th ACM International conference on multimedia, pp 665–673
Liu H, Zhang S, Lin K, Wen J, Li J, Hu X (2021) Vocabulary-wide credit assignment for training image captioning models. IEEE Trans Image Process 30:2450–2460
Lu J, Batra D, Parikh D, Lee S (2019) Vilbert:, Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. ar**v:1908.02265
Ma L, Jiang W, Jie Z, Wang X (2019) Bidirectional image-sentence retrieval by local and global deep matching. Neurocomputing 345:36–44
Ma L, Lu Z, Shang L, Li H (2015) Multimodal convolutional neural networks for matching image and sentence. In: Proceedings of the IEEE International conference on computer vision, pp 2623–2631
Qi J, Peng Y, Yuan Y (2018) Cross-media multi-level alignment with relation attention network. In: Proceedings of the 27th Intrnational Joint conference on artificial intelligence, pp 892–898
Qian K, Tian L (2021) A topic-based multi-channel attention model under hybrid mode for image caption. Neural Comput Applic, pp 1–10
Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GRG, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th ACM International conference on multimedia, pp 251–260
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
Sarafianos N, Xu X, Kakadiaris IA (2019) Adversarial representation learning for text-to-image matching. In: Proceedings of the IEEE International conference on computer vision, pp 5814–5824
Sharma H, Jalal AS (2021) Image captioning improved visual question answering. Multimed Tools Appl, pp 1–22
Shaw P, Uszkoreit J, Vaswani A (2018) Self-attention with relative position representations. In: Proceedings of the 2018 Conference of the North american chapter of the association for computational linguistics: Human language technologies, Volume 2 (Short Papers), pp 464–468
Shi B, Ji L, Lu P, Niu Z, Duan N (2019) Knowledge aware semantic concept expansion for image-text matching. In: Proceedings of the 28th Intrnational Joint conference on artificial intelligence, pp 5182–5189
Shu X, Zhao G (2021) Scalable multi-label canonical correlation analysis for cross-modal retrieval. Pattern Recogn 115:107905
Tan H, Bansal M (2019) Lxmert:,Learningcross-modalityencoderrepresentationsfromtransformers.ar**v:1908.07490
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attentionisallyouneed. In: Advancesinneuralinformationprocessingsystems,pp5998–6008
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A, Kaiser L, Polosukhin I (2017) Attentionisallyouneed. In: Proceedingsofthe31thConferenceonadvancesinneuralinformationprocessingsystems,pp5998–6008
Wang S, Chen Y, Zhuo J, Huang Q, Tian Q (2018) Jointglobalandco-attentiverepresentationlearningforimage-sentenceretrieval. In: Proceedingsofthe26thACMInternationalconferenceonmultimedia,pp1398–1406
Wang L, Li Y, Lazebnik S (2016) Learningdeepstructure-preservingimage-textembeddings. In: ProceedingsoftheIEEEConferenceoncomputervisionandpatternrecognition,pp5005–5013
Wang Y, Yang H, Qian X, Ma L, Fan X (2019) Positionfocusedattentionnetworkforimage-textmatching. In: Proceedingsofthe28thIntrnationalJointconferenceonartificialintelligence
Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarialcross-modalretrieval. In: Proceedingsofthe25thACMInternationalconferenceonmultimedia,pp154–162
Wang H, Zhang Y, Ji Z, Pang Y, Ma L (2020) Consensus-awarevisual-semanticembeddingforimage-textmatching. In: Proceedingsofthe16thEuropeanconferenceoncomputervision,pp18–34
Wu H, Liu Y, Cai H, He S (2022) Learningtransferableperturbationsforimagecaptioning. ACMTransactionsonMultimediaComputingCommunications,andApplications(TOMM) 18(2):1–18
Wu Y, Wang S, Song G, Huang Q (2019) Learningfragmentself-attentionembeddingsforimage-textmatching. In: Proceedingsofthe27thACMInternationalconferenceonmultimedia,pp2088–2096
Xu X, Wang T, Yang Y, Zuo L, Shen F, Shen HT (2020) Cross-modalattentionwithsemanticconsistenceforimage-textmatching. IEEETransNeuralNetwLearnSyst 31(12):5412–5425
Yan F, Mikolajczyk K (2015) Deepcorrelationformatchingimagesandtext. In: ProceedingsoftheIEEEConferenceoncomputervisionandpatternrecognition,pp3441–3450
Yuan H, Huang Y, Zhang D, Chen Z, Cheng W, Wang L (2021) Vsr++:Improvingvisualsemanticreasoningforfine-grainedimage-textmatching. In: Proceedingsofthe25thInternationalConferenceonPatternRecognition(ICPR),pp3728–3735
Zhang S, Chen M, Chen J, Zou F, Li Y-F, Lu P (2021) Multimodalfeature-wiseco-attentionmethodforvisualquestionanswering. InfFusion 73:1–10
Zhang Q, Lei Z, Zhang Z, Li SZ (2020) Context-awareattentionnetworkforimage-textretrieval. In: ProceedingsoftheIEEEConferenceoncomputervisionandpatternrecognition,pp3536–3545
Zhang K, Mao Z, Liu A, Zhang Y (2022) Unifiedadaptiverelevancedistinguishableattentionnetworkforimage-textmatching.IEEETransMultimed,1–14
Zhang Y, Zhou W, Wang M, Tian Q, Li H (2021) Deeprelationembeddingforcross-modalretrieval. IEEETransImageProcess 30:617–627
Acknowledgements
This work was supported by the National Natural Science Foundation of China [grant number 62076048]; and the Science and Technology Innovation Foundation of Dalian [grant number 2020JJ26GX035].
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors declared that they have no conflicts of interest to this work.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Qin, X., Li, L. & Pang, G. Multi-scale motivated neural network for image-text matching. Multimed Tools Appl 83, 4383–4407 (2024). https://doi.org/10.1007/s11042-023-15321-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15321-0