Log in

Multi-scale motivated neural network for image-text matching

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Existing mainstream image-text matching methods usually measure the relevance of image-text pairs by capturing and aggregating the affinities between textual words and visual regions, while failing to consider the single-scale matching bias caused by the imbalance of image and text information. In this paper, we design a Multi-Scale Motivated Neural Network (MSMNN) model for image-text matching. In contrast to previous single-scale methods, MSMNN encourages neural networks to extract visual and textual features from three scales, including local features, global features and salient features, which can take full advantage of the complementarity of multi-scale matching to reduce the bias of single-scale matching. Also, we propose a cross-modal interaction module to realize the fusion of visual and textual features in local alignment, so as to discover the potential relationship between image-text pairs. Furthermore, we also propose a matching score fusion algorithm to fuse matching results from three different levels, which can be freely applied to other initial image-text matching results with a negligible overhead. Extensive experiments validate the effectiveness of our method, and the performance has achieved fairly competitive results on two well-known datasets, Flickr30K and MSCOCO, with a boost of 1.04% and 0.59% on evaluation metric mR compared with the advanced method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Canada)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

References

  1. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 6077–6086

  2. Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: Proceedings of the 30th International conference on machine learning, pp 1247–1255

  3. Chen Y, Li L, Yu L, El Kholy A, Ahmed F, Gan Z, Cheng Y, Liu J (2020) Uniter: Universal image-text representation learning. In: Proceedings of the 16th European conference on computer vision, pp 104–120

  4. Cheng M, Mitra NJ, Huang X, Torr PHS, Hu S (2015) Global contrast based salient region detection. IEEE Trans Pattern Anal Mach Intell 37 (3):569–582

    Article  Google Scholar 

  5. Cheng Y, Zhu X, Qian J, Wen F, Liu P (2022) Cross-modal graph matching network for image-text retrieval. ACM Transactions on Multimedia Computing Communications, and Applications (TOMM) 18(4):1–23

    Article  Google Scholar 

  6. Cui Z, Hu Y, Sun Y, Gao J, Yin B (2022) Cross-modal alignment with graph reasoning for image-text retrieval. Multimed Tools Appl, pp 1–18

  7. Deng Z, Hu X, Zhu L, Xu X, Heng PA (2018) R3net: Recurrent residual refinement network for saliency detection. In: Proceedings of the 27th Intrnational joint conference on artificial intelligence, pp 684–690

  8. Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert:, Pre-training of deep bidirectional transformers for language understanding. ar**v:1810.04805

  9. Diao H, Zhang Y, Ma L, Lu H (2021) Similarity reasoning and filtration for image-text matching. In: Proceedings of the AAAI Conference on artificial intelligence

  10. Gan Z, Chen Y, Li L, Zhu C, Cheng Y, Liu J (2020) Large-scale adversarial training for vision-and-language representation learning. ar**v:2006.06195

  11. Gao Q, Lian H, Wang Q, Sun G (2020) Cross-modal subspace clustering via deep canonical correlation analysis. In: Proceedings of the AAAI Conference on artificial intelligence, vol 34, pp 3938–3945

  12. Goodfellow I, Pouget Abadie J, Mirza M, Xu B, Warde Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Proceedings of the 28th Conference on advances in neural information processing systems, pp 2672–2680

  13. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 770–778

  14. Huang Y, Wu Q, Song C, Wang L (2018) Learning semantic concepts and order for image and sentence matching. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 6163–6171

  15. Huang Z, Zeng Z, Huang Y, Liu B, Fu D, Fu J (2021) Seeing out of the box: End-to-end pre-training for vision-language representation learning. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 12976–12985

  16. Huang F, Zhang X, Zhao Z, Li Z (2019) Bi-directional spatial-semantic attention networks for image-text matching. IEEE Trans Image Process 28(4):2008–2020

    Article  MathSciNet  Google Scholar 

  17. Ji Z, Chen K, Wang H (2021) Step-wise hierarchical alignment network for image-text matching. In: Proceedings of the 31th Intrnational Joint conference on artificial intelligence

  18. Kang P, Lin Z, Yang Z, Fang X, Bronstein AM, Li Q, Liu W (2022) Intra-class low-rank regularization for supervised and semi-supervised cross-modal retrieval. Appl Intell 52(1):33–54

    Article  Google Scholar 

  19. Karpathy A, Feifei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 3128–3137

  20. Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of The19th Conference on empirical methods in natural language procrssing, pp 1746–1751

  21. Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA (2017) Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73

    Article  MathSciNet  Google Scholar 

  22. Lee KH, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In: Proceedings of the European conference on computer vision, pp 201–216

  23. Li G, Duan N, Fang Y, Gong M, Jiang D (2019) Unicoder-vl:, A universal encoder for vision and language by cross-modal pre-training. ar**v:1908.06066, 11336–11344

  24. Li X, Wu B, Song J, Gao L, Zeng P, Gan C (2022) Text-instance graph: Exploring the relational semantics for text-based visual question answering. Pattern Recogn 124:108455

    Article  Google Scholar 

  25. Li W, Yang S, Wang Y, Song D, Li X (2021) Multi-level similarity learning for image-text retrieval. Inf Process Manag 58(1):102432

    Article  Google Scholar 

  26. Liu Y, Guo Y, Liu L, Bakker EM, Lew MS (2019) Cyclematch: a cycle-consistent embedding network for image-text matching. Pattern Recogn 93:365–379

    Article  Google Scholar 

  27. Liu C, Mao Z, Liu A, Zhang T, Wang B, Zhang Y (2019) Focus your attention: a bidirectional focal attention network for image-text matching. In: Proceedings of the 27th ACM International conference on multimedia, pp 3–11

  28. Liu C, Mao Z, Zhang T, **e H, Wang B, Zhang Y (2020) Graph structured network for image-text matching. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 10921–10930

  29. Liu J, Zha Z, Hong R, Wang M, Zhang Y (2019) Deep adversarial graph attention convolution network for text-based person search. In: Proceedings of the 27th ACM International conference on multimedia, pp 665–673

  30. Liu H, Zhang S, Lin K, Wen J, Li J, Hu X (2021) Vocabulary-wide credit assignment for training image captioning models. IEEE Trans Image Process 30:2450–2460

    Article  MathSciNet  Google Scholar 

  31. Lu J, Batra D, Parikh D, Lee S (2019) Vilbert:, Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. ar**v:1908.02265

  32. Ma L, Jiang W, Jie Z, Wang X (2019) Bidirectional image-sentence retrieval by local and global deep matching. Neurocomputing 345:36–44

    Article  Google Scholar 

  33. Ma L, Lu Z, Shang L, Li H (2015) Multimodal convolutional neural networks for matching image and sentence. In: Proceedings of the IEEE International conference on computer vision, pp 2623–2631

  34. Qi J, Peng Y, Yuan Y (2018) Cross-media multi-level alignment with relation attention network. In: Proceedings of the 27th Intrnational Joint conference on artificial intelligence, pp 892–898

  35. Qian K, Tian L (2021) A topic-based multi-channel attention model under hybrid mode for image caption. Neural Comput Applic, pp 1–10

  36. Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GRG, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th ACM International conference on multimedia, pp 251–260

  37. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99

  38. Sarafianos N, Xu X, Kakadiaris IA (2019) Adversarial representation learning for text-to-image matching. In: Proceedings of the IEEE International conference on computer vision, pp 5814–5824

  39. Sharma H, Jalal AS (2021) Image captioning improved visual question answering. Multimed Tools Appl, pp 1–22

  40. Shaw P, Uszkoreit J, Vaswani A (2018) Self-attention with relative position representations. In: Proceedings of the 2018 Conference of the North american chapter of the association for computational linguistics: Human language technologies, Volume 2 (Short Papers), pp 464–468

  41. Shi B, Ji L, Lu P, Niu Z, Duan N (2019) Knowledge aware semantic concept expansion for image-text matching. In: Proceedings of the 28th Intrnational Joint conference on artificial intelligence, pp 5182–5189

  42. Shu X, Zhao G (2021) Scalable multi-label canonical correlation analysis for cross-modal retrieval. Pattern Recogn 115:107905

    Article  Google Scholar 

  43. Tan H, Bansal M (2019) Lxmert:,Learningcross-modalityencoderrepresentationsfromtransformers.ar**v:1908.07490

  44. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attentionisallyouneed. In: Advancesinneuralinformationprocessingsystems,pp5998–6008

  45. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A, Kaiser L, Polosukhin I (2017) Attentionisallyouneed. In: Proceedingsofthe31thConferenceonadvancesinneuralinformationprocessingsystems,pp5998–6008

  46. Wang S, Chen Y, Zhuo J, Huang Q, Tian Q (2018) Jointglobalandco-attentiverepresentationlearningforimage-sentenceretrieval. In: Proceedingsofthe26thACMInternationalconferenceonmultimedia,pp1398–1406

  47. Wang L, Li Y, Lazebnik S (2016) Learningdeepstructure-preservingimage-textembeddings. In: ProceedingsoftheIEEEConferenceoncomputervisionandpatternrecognition,pp5005–5013

  48. Wang Y, Yang H, Qian X, Ma L, Fan X (2019) Positionfocusedattentionnetworkforimage-textmatching. In: Proceedingsofthe28thIntrnationalJointconferenceonartificialintelligence

  49. Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarialcross-modalretrieval. In: Proceedingsofthe25thACMInternationalconferenceonmultimedia,pp154–162

  50. Wang H, Zhang Y, Ji Z, Pang Y, Ma L (2020) Consensus-awarevisual-semanticembeddingforimage-textmatching. In: Proceedingsofthe16thEuropeanconferenceoncomputervision,pp18–34

  51. Wu H, Liu Y, Cai H, He S (2022) Learningtransferableperturbationsforimagecaptioning. ACMTransactionsonMultimediaComputingCommunications,andApplications(TOMM) 18(2):1–18

    Google Scholar 

  52. Wu Y, Wang S, Song G, Huang Q (2019) Learningfragmentself-attentionembeddingsforimage-textmatching. In: Proceedingsofthe27thACMInternationalconferenceonmultimedia,pp2088–2096

  53. Xu X, Wang T, Yang Y, Zuo L, Shen F, Shen HT (2020) Cross-modalattentionwithsemanticconsistenceforimage-textmatching. IEEETransNeuralNetwLearnSyst 31(12):5412–5425

    Google Scholar 

  54. Yan F, Mikolajczyk K (2015) Deepcorrelationformatchingimagesandtext. In: ProceedingsoftheIEEEConferenceoncomputervisionandpatternrecognition,pp3441–3450

  55. Yuan H, Huang Y, Zhang D, Chen Z, Cheng W, Wang L (2021) Vsr++:Improvingvisualsemanticreasoningforfine-grainedimage-textmatching. In: Proceedingsofthe25thInternationalConferenceonPatternRecognition(ICPR),pp3728–3735

  56. Zhang S, Chen M, Chen J, Zou F, Li Y-F, Lu P (2021) Multimodalfeature-wiseco-attentionmethodforvisualquestionanswering. InfFusion 73:1–10

    Google Scholar 

  57. Zhang Q, Lei Z, Zhang Z, Li SZ (2020) Context-awareattentionnetworkforimage-textretrieval. In: ProceedingsoftheIEEEConferenceoncomputervisionandpatternrecognition,pp3536–3545

  58. Zhang K, Mao Z, Liu A, Zhang Y (2022) Unifiedadaptiverelevancedistinguishableattentionnetworkforimage-textmatching.IEEETransMultimed,1–14

  59. Zhang Y, Zhou W, Wang M, Tian Q, Li H (2021) Deeprelationembeddingforcross-modalretrieval. IEEETransImageProcess 30:617–627

    Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China [grant number 62076048]; and the Science and Technology Innovation Foundation of Dalian [grant number 2020JJ26GX035].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lishuang Li.

Ethics declarations

Conflict of Interests

The authors declared that they have no conflicts of interest to this work.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qin, X., Li, L. & Pang, G. Multi-scale motivated neural network for image-text matching. Multimed Tools Appl 83, 4383–4407 (2024). https://doi.org/10.1007/s11042-023-15321-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15321-0

Keywords

Navigation