Abstract
Generating images from text descriptions is a challenging task due to the natural gap between the textual and visual modalities. Despite the promising results of existing methods, they suffer from two limitations: (1) focus more on the image semantic information while fails to fully explore the texture information; (2) only consider to model the correlation between words and image with a fixed scale, thus decreases the diversity and discriminability of the network representations. To address above issues, we propose a Multi-scale Dual-modal Generative Networks (MD-GAN). The core components of MD-GAN are the dual-modal modulation attention (DMA) and the multi-scale consistency discriminator (MCD). The DMA includes two blocks: the textual guiding module that captures the correlation between images and text descriptions to rectify the image semantic content, and the channel sampling module that adjusts image texture by selectively aggregating the channel-wise information on spatial space. In addition, the MCD constructs the correlation between text and image region of various sizes, enhancing the semantic consistency between text and images. Extensive experiments on CUB and MS-COCO datasets show the superiority of MD-GAN over state-of-the-art methods.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-022-14080-8/MediaObjects/11042_2022_14080_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-022-14080-8/MediaObjects/11042_2022_14080_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-022-14080-8/MediaObjects/11042_2022_14080_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-022-14080-8/MediaObjects/11042_2022_14080_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-022-14080-8/MediaObjects/11042_2022_14080_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-022-14080-8/MediaObjects/11042_2022_14080_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-022-14080-8/MediaObjects/11042_2022_14080_Fig7_HTML.png)
Similar content being viewed by others
References
Chen Y, Liu L, Tao J, **a R, Zhang Q, Yang K, **ong J, Chen X (2020) The improved image inpainting algorithm via encoder and similarity constraint. Vis Comput, 1–15
Chen Z, Cai H, Zhang Y, Wu C, Mu M, Li Z, Sotelo MA (2019) A novel sparse representation model for pedestrian abnormal trajectory understanding. Expert Syst Appl 138:112753. https://doi.org/10.1016/j.eswa.2019.06.041
Chen Z, Chen D, Zhang Y, Cheng X, Zhang M, Wu C (2020) Deep learning for autonomous ship-oriented small ship detection. Saf Sci 130:104812. https://doi.org/10.1016/j.ssci.2020.104812
Cheng J, Wu F, Tian Y, Wang L, Tao D (2020) Rifegan: rich feature generation for text-to-image synthesis from prior knowledge. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10911–10920
Dash A, Gamboa JCB, Ahmed S, Liwicki M, Afzal MZ (2017) Tac-gan-text conditioned auxiliary classifier generative adversarial network. ar**v:170306412
Fan X, Jiang W, Luo H, Mao W (2020) Modality-transfer generative adversarial network and dual-level unified latent representation for visible thermal person re-identification. Vis Comput, 1–16
Fang Z, Liu Z, Liu T, Hung CC, **ao J, Feng G (2021) Facial expression gan for voice-driven face generation. Vis Comput, 1–14
Gao L, Chen D, Song J, Xu X, Zhang D, Shen HT (2019) Perceptual pyramid adversarial networks for text-to-image synthesis. Proc AAAI Conf Artif Intell 33:8312–8319
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680
Gregor K, Danihelka I, Graves A, Rezende D, Wierstra D (2015) Draw: a recurrent neural network for image generation. In: International conference on machine learning (PMLR), pp 1462–1471
Jiang B, Huang W, Huang Y, Yang C, Xu F (2020) Deep fusion local-content and global-semantic for image inpainting. IEEE Access 8:156828–156838
Jiang B, Tu W, Yang C, Yuan J (2020) Context-integrated and feature-refined network for lightweight object parsing. IEEE Trans Image Process 29:5079–5093
Jiang B, Xu F, Huang Y, Yang C, Huang W, **a J (2020) Adaptive adversarial latent space for novelty detection. IEEE Access 8:205088–205098
Karimi M, Veni G, Yu YY (2020) Illegible text to readable text: An image-to-image transformation using conditional sliced wasserstein adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp 552–553
Kimura D, Chaudhury S, Narita M, Munawar A, Tachibana R (2020) Adversarial discriminative attention for robust anomaly detection. In: The IEEE winter conference on applications of computer vision, pp 2172–2181
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. ar**v:14126980
Kingma DP, Welling M (2013) Auto-encoding variational Bayes. ar**v:13126114
Li B, Qi X, Lukasiewicz T, Torr P (2019) Controllable text-to-image generation. In: Advances in neural information processing systems, pp 2065–2075
Li B, Qi X, Lukasiewicz T, Torr PH (2020) Manigan: text-guided image manipulation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7880–7889
Li R, Wang N, Feng F, Zhang G, Wang X (2020) Exploring global and local linguistic representation for text-to-image synthesis. IEEE Transactions on Multimedia
Li W, Zhang P, Zhang L, Huang Q, He X, Lyu S, Gao J (2019) Object-driven text-to-image synthesis via adversarial training. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 12174–12182
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755
Mirza M, Osindero S (2014) Conditional generative adversarial nets. ar**v:14111784
Nam S, Kim Y, Kim SJ (2018) Text-adaptive generative adversarial networks: manipulating images with natural language. In: Advances in neural information processing systems, pp 42–51
Odena A, Olah C, Shlens J (2017) Conditional image synthesis with auxiliary classifier gans. In: International conference on machine learning (PMLR), pp 2642–2651
Peng D, Yang W, Liu C, Lü S (2021) Sam-gan: self-attention supporting multi-stage generative adversarial networks for text-to-image synthesis. Neural Networks (8)
Qiao T, Zhang J, Xu D, Tao D (2019) Mirrorgan: learning text-to-image generation by redescription. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1505–1514
Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. ar**v:151106434
Reed S, Akata Z, Yan X, Logeswaran L, Schiele B, Lee H (2016) Generative adversarial text to image synthesis. ar**v:160505396
Reed SE, Akata Z, Mohan S, Tenka S, Schiele B, Lee H (2016) Learning what and where to draw. In: Advances in neural information processing systems, pp 217–225
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826
Tan H, Liu X, Li X, Zhang Y, Yin B (2019) Semantics-enhanced adversarial nets for text-to-image synthesis. In: Proceedings of the IEEE international conference on computer vision, pp 10501–10510
Tao M, Tang H, Wu S, Sebe N, Wu F, **g XY (2020) Df-gan: deep fusion generative adversarial networks for text-to-image synthesis. ar**v:200805865
Van Oord A, Kalchbrenner N, Kavukcuoglu K (2016) Pixel recurrent neural networks. In: International conference on machine learning (PMLR), pp 1747–1756
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The caltech-ucsd birds-200-2011 dataset
Wang Y, Yu L, van de Weijer J (2020) Deepi2i: enabling deep hierarchical image-to-image translation by transferring from gans. ar**v:201105867
Wang Z, Quan Z, Wang ZJ, Hu X, Chen Y (2020) Text to image synthesis with bidirectional generative adversarial network. In: IEEE International conference on multimedia and expo (ICME). IEEE, pp 1–6
Woo S, Park J, Lee JY, So Kweon I (2018) Cbam: convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp 3–19
**a W, Yang Y, Xue JH, Wu B (2021) Tedigan: text-guided diverse face image generation and manipulation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 2256–2265
**an W, Sangkloy P, Agrawal V, Raj A, Lu J, Fang C, Yu F, Hays J (2018) Texturegan: controlling deep image synthesis with texture patches. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8456–8465
Xu T, Zhang P, Huang Q, Zhang H, Gan Z, Huang X, He X (2018) Attngan: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1316–1324
Yang Y, Wang L, **e D, Deng C, Tao D (2021) Multi-sentence auxiliary adversarial networks for fine-grained text-to-image synthesis. IEEE Trans Image Process PP(99):1–1
Yin G, Liu B, Sheng L, Yu N, Wang X, Shao J (2019) Semantics disentangling for text-to-image generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2327–2336
Yuan M, Peng Y (2019) Ckd: cross-task knowledge distillation for text-to-image synthesis. IEEE Trans Multimed 22(8):1955–1968
Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas DN (2017) Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp 5907–5915
Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas DN (2018) Stackgan++: realistic image synthesis with stacked generative adversarial networks. IEEE Trans Pattern Anal Mach Intell 41(8):1947–1962
Zhang H, Koh JY, Baldridge J, Lee H, Yang Y (2021) Cross-modal contrastive learning for text-to-image generation
Zhang Z, **e Y, Yang L (2018) Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6199–6208
Zhou X, Wang Y, Zhu Q, **ao C, Lu X (2019) Ssg: superpixel segmentation and grabcut-based salient object segmentation. Vis Comput 35(3):385–398
Zhu M, Pan P, Chen W, Yang Y (2019) Dm-gan: dynamic memory generative adversarial networks for text-to-image synthesis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5802–5810
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under grant 62072169 and 62172156, and the National Key Research and Development Program of China under grant 2020YFB1713003.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jiang, B., Huang, Y., Huang, W. et al. Multi-scale dual-modal generative adversarial networks for text-to-image synthesis. Multimed Tools Appl 82, 15061–15077 (2023). https://doi.org/10.1007/s11042-022-14080-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-14080-8