Multi-scale dual-modal generative adversarial networks for text-to-image synthesis

Jiang, Bin; Huang, Yun; Huang, Wei; Yang, Chao; Xu, Fangqiang

doi:10.1007/s11042-022-14080-8

Multi-scale dual-modal generative adversarial networks for text-to-image synthesis

Published: 29 October 2022

Volume 82, pages 15061–15077, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Bin Jiang ORCID: orcid.org/0000-0002-5840-9664¹,
Yun Huang¹,
Wei Huang¹,
Chao Yang¹ &
…
Fangqiang Xu¹

347 Accesses
Explore all metrics

Abstract

Generating images from text descriptions is a challenging task due to the natural gap between the textual and visual modalities. Despite the promising results of existing methods, they suffer from two limitations: (1) focus more on the image semantic information while fails to fully explore the texture information; (2) only consider to model the correlation between words and image with a fixed scale, thus decreases the diversity and discriminability of the network representations. To address above issues, we propose a Multi-scale Dual-modal Generative Networks (MD-GAN). The core components of MD-GAN are the dual-modal modulation attention (DMA) and the multi-scale consistency discriminator (MCD). The DMA includes two blocks: the textual guiding module that captures the correlation between images and text descriptions to rectify the image semantic content, and the channel sampling module that adjusts image texture by selectively aggregating the channel-wise information on spatial space. In addition, the MCD constructs the correlation between text and image region of various sizes, enhancing the semantic consistency between text and images. Extensive experiments on CUB and MS-COCO datasets show the superiority of MD-GAN over state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

Article 11 October 2019

Image Generation: A Review

Article 11 March 2022

Visual attention network

Article Open access 28 July 2023

References

Chen Y, Liu L, Tao J, **a R, Zhang Q, Yang K, **ong J, Chen X (2020) The improved image inpainting algorithm via encoder and similarity constraint. Vis Comput, 1–15
Chen Z, Cai H, Zhang Y, Wu C, Mu M, Li Z, Sotelo MA (2019) A novel sparse representation model for pedestrian abnormal trajectory understanding. Expert Syst Appl 138:112753. https://doi.org/10.1016/j.eswa.2019.06.041
Article Google Scholar
Chen Z, Chen D, Zhang Y, Cheng X, Zhang M, Wu C (2020) Deep learning for autonomous ship-oriented small ship detection. Saf Sci 130:104812. https://doi.org/10.1016/j.ssci.2020.104812
Article Google Scholar
Cheng J, Wu F, Tian Y, Wang L, Tao D (2020) Rifegan: rich feature generation for text-to-image synthesis from prior knowledge. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10911–10920
Dash A, Gamboa JCB, Ahmed S, Liwicki M, Afzal MZ (2017) Tac-gan-text conditioned auxiliary classifier generative adversarial network. ar**v:170306412
Fan X, Jiang W, Luo H, Mao W (2020) Modality-transfer generative adversarial network and dual-level unified latent representation for visible thermal person re-identification. Vis Comput, 1–16
Fang Z, Liu Z, Liu T, Hung CC, **ao J, Feng G (2021) Facial expression gan for voice-driven face generation. Vis Comput, 1–14
Gao L, Chen D, Song J, Xu X, Zhang D, Shen HT (2019) Perceptual pyramid adversarial networks for text-to-image synthesis. Proc AAAI Conf Artif Intell 33:8312–8319
Google Scholar
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680
Gregor K, Danihelka I, Graves A, Rezende D, Wierstra D (2015) Draw: a recurrent neural network for image generation. In: International conference on machine learning (PMLR), pp 1462–1471
Jiang B, Huang W, Huang Y, Yang C, Xu F (2020) Deep fusion local-content and global-semantic for image inpainting. IEEE Access 8:156828–156838
Article Google Scholar
Jiang B, Tu W, Yang C, Yuan J (2020) Context-integrated and feature-refined network for lightweight object parsing. IEEE Trans Image Process 29:5079–5093
Article MATH Google Scholar
Jiang B, Xu F, Huang Y, Yang C, Huang W, **a J (2020) Adaptive adversarial latent space for novelty detection. IEEE Access 8:205088–205098
Article Google Scholar
Karimi M, Veni G, Yu YY (2020) Illegible text to readable text: An image-to-image transformation using conditional sliced wasserstein adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp 552–553
Kimura D, Chaudhury S, Narita M, Munawar A, Tachibana R (2020) Adversarial discriminative attention for robust anomaly detection. In: The IEEE winter conference on applications of computer vision, pp 2172–2181
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. ar**v:14126980
Kingma DP, Welling M (2013) Auto-encoding variational Bayes. ar**v:13126114
Li B, Qi X, Lukasiewicz T, Torr P (2019) Controllable text-to-image generation. In: Advances in neural information processing systems, pp 2065–2075
Li B, Qi X, Lukasiewicz T, Torr PH (2020) Manigan: text-guided image manipulation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7880–7889
Li R, Wang N, Feng F, Zhang G, Wang X (2020) Exploring global and local linguistic representation for text-to-image synthesis. IEEE Transactions on Multimedia
Li W, Zhang P, Zhang L, Huang Q, He X, Lyu S, Gao J (2019) Object-driven text-to-image synthesis via adversarial training. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 12174–12182
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755
Mirza M, Osindero S (2014) Conditional generative adversarial nets. ar**v:14111784
Nam S, Kim Y, Kim SJ (2018) Text-adaptive generative adversarial networks: manipulating images with natural language. In: Advances in neural information processing systems, pp 42–51
Odena A, Olah C, Shlens J (2017) Conditional image synthesis with auxiliary classifier gans. In: International conference on machine learning (PMLR), pp 2642–2651
Peng D, Yang W, Liu C, Lü S (2021) Sam-gan: self-attention supporting multi-stage generative adversarial networks for text-to-image synthesis. Neural Networks (8)
Qiao T, Zhang J, Xu D, Tao D (2019) Mirrorgan: learning text-to-image generation by redescription. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1505–1514
Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. ar**v:151106434
Reed S, Akata Z, Yan X, Logeswaran L, Schiele B, Lee H (2016) Generative adversarial text to image synthesis. ar**v:160505396
Reed SE, Akata Z, Mohan S, Tenka S, Schiele B, Lee H (2016) Learning what and where to draw. In: Advances in neural information processing systems, pp 217–225
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826
Tan H, Liu X, Li X, Zhang Y, Yin B (2019) Semantics-enhanced adversarial nets for text-to-image synthesis. In: Proceedings of the IEEE international conference on computer vision, pp 10501–10510
Tao M, Tang H, Wu S, Sebe N, Wu F, **g XY (2020) Df-gan: deep fusion generative adversarial networks for text-to-image synthesis. ar**v:200805865
Van Oord A, Kalchbrenner N, Kavukcuoglu K (2016) Pixel recurrent neural networks. In: International conference on machine learning (PMLR), pp 1747–1756
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The caltech-ucsd birds-200-2011 dataset
Wang Y, Yu L, van de Weijer J (2020) Deepi2i: enabling deep hierarchical image-to-image translation by transferring from gans. ar**v:201105867
Wang Z, Quan Z, Wang ZJ, Hu X, Chen Y (2020) Text to image synthesis with bidirectional generative adversarial network. In: IEEE International conference on multimedia and expo (ICME). IEEE, pp 1–6
Woo S, Park J, Lee JY, So Kweon I (2018) Cbam: convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp 3–19
**a W, Yang Y, Xue JH, Wu B (2021) Tedigan: text-guided diverse face image generation and manipulation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 2256–2265
**an W, Sangkloy P, Agrawal V, Raj A, Lu J, Fang C, Yu F, Hays J (2018) Texturegan: controlling deep image synthesis with texture patches. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8456–8465
Xu T, Zhang P, Huang Q, Zhang H, Gan Z, Huang X, He X (2018) Attngan: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1316–1324
Yang Y, Wang L, **e D, Deng C, Tao D (2021) Multi-sentence auxiliary adversarial networks for fine-grained text-to-image synthesis. IEEE Trans Image Process PP(99):1–1
Google Scholar
Yin G, Liu B, Sheng L, Yu N, Wang X, Shao J (2019) Semantics disentangling for text-to-image generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2327–2336
Yuan M, Peng Y (2019) Ckd: cross-task knowledge distillation for text-to-image synthesis. IEEE Trans Multimed 22(8):1955–1968
Article Google Scholar
Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas DN (2017) Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp 5907–5915
Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas DN (2018) Stackgan++: realistic image synthesis with stacked generative adversarial networks. IEEE Trans Pattern Anal Mach Intell 41(8):1947–1962
Article Google Scholar
Zhang H, Koh JY, Baldridge J, Lee H, Yang Y (2021) Cross-modal contrastive learning for text-to-image generation
Zhang Z, **e Y, Yang L (2018) Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6199–6208
Zhou X, Wang Y, Zhu Q, **ao C, Lu X (2019) Ssg: superpixel segmentation and grabcut-based salient object segmentation. Vis Comput 35(3):385–398
Article Google Scholar
Zhu M, Pan P, Chen W, Yang Y (2019) Dm-gan: dynamic memory generative adversarial networks for text-to-image synthesis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5802–5810

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under grant 62072169 and 62172156, and the National Key Research and Development Program of China under grant 2020YFB1713003.

Author information

Authors and Affiliations

College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, China
Bin Jiang, Yun Huang, Wei Huang, Chao Yang & Fangqiang Xu

Authors

Bin Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Yun Huang
View author publications
You can also search for this author in PubMed Google Scholar
Wei Huang
View author publications
You can also search for this author in PubMed Google Scholar
Chao Yang
View author publications
You can also search for this author in PubMed Google Scholar
Fangqiang Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bin Jiang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Jiang, B., Huang, Y., Huang, W. et al. Multi-scale dual-modal generative adversarial networks for text-to-image synthesis. Multimed Tools Appl 82, 15061–15077 (2023). https://doi.org/10.1007/s11042-022-14080-8

Download citation

Received: 15 June 2021
Revised: 16 September 2021
Accepted: 25 January 2022
Published: 29 October 2022
Issue Date: April 2023
DOI: https://doi.org/10.1007/s11042-022-14080-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-scale dual-modal generative adversarial networks for text-to-image synthesis

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

Image Generation: A Review

Visual attention network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Multi-scale dual-modal generative adversarial networks for text-to-image synthesis

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

Image Generation: A Review

Visual attention network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation