Log in

Multi-scale dual-modal generative adversarial networks for text-to-image synthesis

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Generating images from text descriptions is a challenging task due to the natural gap between the textual and visual modalities. Despite the promising results of existing methods, they suffer from two limitations: (1) focus more on the image semantic information while fails to fully explore the texture information; (2) only consider to model the correlation between words and image with a fixed scale, thus decreases the diversity and discriminability of the network representations. To address above issues, we propose a Multi-scale Dual-modal Generative Networks (MD-GAN). The core components of MD-GAN are the dual-modal modulation attention (DMA) and the multi-scale consistency discriminator (MCD). The DMA includes two blocks: the textual guiding module that captures the correlation between images and text descriptions to rectify the image semantic content, and the channel sampling module that adjusts image texture by selectively aggregating the channel-wise information on spatial space. In addition, the MCD constructs the correlation between text and image region of various sizes, enhancing the semantic consistency between text and images. Extensive experiments on CUB and MS-COCO datasets show the superiority of MD-GAN over state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Chen Y, Liu L, Tao J, **a R, Zhang Q, Yang K, **ong J, Chen X (2020) The improved image inpainting algorithm via encoder and similarity constraint. Vis Comput, 1–15

  2. Chen Z, Cai H, Zhang Y, Wu C, Mu M, Li Z, Sotelo MA (2019) A novel sparse representation model for pedestrian abnormal trajectory understanding. Expert Syst Appl 138:112753. https://doi.org/10.1016/j.eswa.2019.06.041

    Article  Google Scholar 

  3. Chen Z, Chen D, Zhang Y, Cheng X, Zhang M, Wu C (2020) Deep learning for autonomous ship-oriented small ship detection. Saf Sci 130:104812. https://doi.org/10.1016/j.ssci.2020.104812

    Article  Google Scholar 

  4. Cheng J, Wu F, Tian Y, Wang L, Tao D (2020) Rifegan: rich feature generation for text-to-image synthesis from prior knowledge. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10911–10920

  5. Dash A, Gamboa JCB, Ahmed S, Liwicki M, Afzal MZ (2017) Tac-gan-text conditioned auxiliary classifier generative adversarial network. ar**v:170306412

  6. Fan X, Jiang W, Luo H, Mao W (2020) Modality-transfer generative adversarial network and dual-level unified latent representation for visible thermal person re-identification. Vis Comput, 1–16

  7. Fang Z, Liu Z, Liu T, Hung CC, **ao J, Feng G (2021) Facial expression gan for voice-driven face generation. Vis Comput, 1–14

  8. Gao L, Chen D, Song J, Xu X, Zhang D, Shen HT (2019) Perceptual pyramid adversarial networks for text-to-image synthesis. Proc AAAI Conf Artif Intell 33:8312–8319

    Google Scholar 

  9. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680

  10. Gregor K, Danihelka I, Graves A, Rezende D, Wierstra D (2015) Draw: a recurrent neural network for image generation. In: International conference on machine learning (PMLR), pp 1462–1471

  11. Jiang B, Huang W, Huang Y, Yang C, Xu F (2020) Deep fusion local-content and global-semantic for image inpainting. IEEE Access 8:156828–156838

    Article  Google Scholar 

  12. Jiang B, Tu W, Yang C, Yuan J (2020) Context-integrated and feature-refined network for lightweight object parsing. IEEE Trans Image Process 29:5079–5093

    Article  MATH  Google Scholar 

  13. Jiang B, Xu F, Huang Y, Yang C, Huang W, **a J (2020) Adaptive adversarial latent space for novelty detection. IEEE Access 8:205088–205098

    Article  Google Scholar 

  14. Karimi M, Veni G, Yu YY (2020) Illegible text to readable text: An image-to-image transformation using conditional sliced wasserstein adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp 552–553

  15. Kimura D, Chaudhury S, Narita M, Munawar A, Tachibana R (2020) Adversarial discriminative attention for robust anomaly detection. In: The IEEE winter conference on applications of computer vision, pp 2172–2181

  16. Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. ar**v:14126980

  17. Kingma DP, Welling M (2013) Auto-encoding variational Bayes. ar**v:13126114

  18. Li B, Qi X, Lukasiewicz T, Torr P (2019) Controllable text-to-image generation. In: Advances in neural information processing systems, pp 2065–2075

  19. Li B, Qi X, Lukasiewicz T, Torr PH (2020) Manigan: text-guided image manipulation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7880–7889

  20. Li R, Wang N, Feng F, Zhang G, Wang X (2020) Exploring global and local linguistic representation for text-to-image synthesis. IEEE Transactions on Multimedia

  21. Li W, Zhang P, Zhang L, Huang Q, He X, Lyu S, Gao J (2019) Object-driven text-to-image synthesis via adversarial training. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 12174–12182

  22. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755

  23. Mirza M, Osindero S (2014) Conditional generative adversarial nets. ar**v:14111784

  24. Nam S, Kim Y, Kim SJ (2018) Text-adaptive generative adversarial networks: manipulating images with natural language. In: Advances in neural information processing systems, pp 42–51

  25. Odena A, Olah C, Shlens J (2017) Conditional image synthesis with auxiliary classifier gans. In: International conference on machine learning (PMLR), pp 2642–2651

  26. Peng D, Yang W, Liu C, Lü S (2021) Sam-gan: self-attention supporting multi-stage generative adversarial networks for text-to-image synthesis. Neural Networks (8)

  27. Qiao T, Zhang J, Xu D, Tao D (2019) Mirrorgan: learning text-to-image generation by redescription. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1505–1514

  28. Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. ar**v:151106434

  29. Reed S, Akata Z, Yan X, Logeswaran L, Schiele B, Lee H (2016) Generative adversarial text to image synthesis. ar**v:160505396

  30. Reed SE, Akata Z, Mohan S, Tenka S, Schiele B, Lee H (2016) Learning what and where to draw. In: Advances in neural information processing systems, pp 217–225

  31. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826

  32. Tan H, Liu X, Li X, Zhang Y, Yin B (2019) Semantics-enhanced adversarial nets for text-to-image synthesis. In: Proceedings of the IEEE international conference on computer vision, pp 10501–10510

  33. Tao M, Tang H, Wu S, Sebe N, Wu F, **g XY (2020) Df-gan: deep fusion generative adversarial networks for text-to-image synthesis. ar**v:200805865

  34. Van Oord A, Kalchbrenner N, Kavukcuoglu K (2016) Pixel recurrent neural networks. In: International conference on machine learning (PMLR), pp 1747–1756

  35. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008

  36. Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The caltech-ucsd birds-200-2011 dataset

  37. Wang Y, Yu L, van de Weijer J (2020) Deepi2i: enabling deep hierarchical image-to-image translation by transferring from gans. ar**v:201105867

  38. Wang Z, Quan Z, Wang ZJ, Hu X, Chen Y (2020) Text to image synthesis with bidirectional generative adversarial network. In: IEEE International conference on multimedia and expo (ICME). IEEE, pp 1–6

  39. Woo S, Park J, Lee JY, So Kweon I (2018) Cbam: convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp 3–19

  40. **a W, Yang Y, Xue JH, Wu B (2021) Tedigan: text-guided diverse face image generation and manipulation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 2256–2265

  41. **an W, Sangkloy P, Agrawal V, Raj A, Lu J, Fang C, Yu F, Hays J (2018) Texturegan: controlling deep image synthesis with texture patches. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8456–8465

  42. Xu T, Zhang P, Huang Q, Zhang H, Gan Z, Huang X, He X (2018) Attngan: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1316–1324

  43. Yang Y, Wang L, **e D, Deng C, Tao D (2021) Multi-sentence auxiliary adversarial networks for fine-grained text-to-image synthesis. IEEE Trans Image Process PP(99):1–1

    Google Scholar 

  44. Yin G, Liu B, Sheng L, Yu N, Wang X, Shao J (2019) Semantics disentangling for text-to-image generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2327–2336

  45. Yuan M, Peng Y (2019) Ckd: cross-task knowledge distillation for text-to-image synthesis. IEEE Trans Multimed 22(8):1955–1968

    Article  Google Scholar 

  46. Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas DN (2017) Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp 5907–5915

  47. Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas DN (2018) Stackgan++: realistic image synthesis with stacked generative adversarial networks. IEEE Trans Pattern Anal Mach Intell 41(8):1947–1962

    Article  Google Scholar 

  48. Zhang H, Koh JY, Baldridge J, Lee H, Yang Y (2021) Cross-modal contrastive learning for text-to-image generation

  49. Zhang Z, **e Y, Yang L (2018) Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6199–6208

  50. Zhou X, Wang Y, Zhu Q, **ao C, Lu X (2019) Ssg: superpixel segmentation and grabcut-based salient object segmentation. Vis Comput 35(3):385–398

    Article  Google Scholar 

  51. Zhu M, Pan P, Chen W, Yang Y (2019) Dm-gan: dynamic memory generative adversarial networks for text-to-image synthesis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5802–5810

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under grant 62072169 and 62172156, and the National Key Research and Development Program of China under grant 2020YFB1713003.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bin Jiang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiang, B., Huang, Y., Huang, W. et al. Multi-scale dual-modal generative adversarial networks for text-to-image synthesis. Multimed Tools Appl 82, 15061–15077 (2023). https://doi.org/10.1007/s11042-022-14080-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-14080-8

Keywords

Navigation