Abstract
In this paper, we propose a simple yet effective Text-To-Face (T2F) generative adversarial network named Semantic-Spatial FaceGAN, which addresses the challenge of generating facial images from natural language descriptions. Natural language is inherently abstract, whereas images are concrete. This discrepancy poses a significant challenge, especially when utilizing multiple descriptions to generate accurate images. To overcome this issue, we introduce the Semantic Spatial FaceGAN (SS-FaceGAN) network, capable of generating precise features from multiple descriptions. Additionally, we incorporate a novel Focus Spatial (FS) module that predicts masks based on text semantics to refine image feature map**. We also introduce an attention mechanism, the Word Attention Reuse (WAR) module, which leverages the potential distribution of each word in the description to compute word-level attention. Finally, our experiments demonstrate the effectiveness of our approach.
Similar content being viewed by others
Data Availability
All data generated or analysed during this study are included in this article.
References
Bai Q, Yang C, Xu Y, Liu X, Yang Y, Shen Y (2023) Glead: Improving gans with a generator-leading task. Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp 12094–12104
Ben-Yosef M, Weinshall D (2018) Gaussian mixture generative adversarial networks for diverse datasets, and the unsupervised clustering of images. Preprint ar**v:1808.10356
Brock A, Donahue J, Simonyan K (2019) Large, scale gan training for high fidelity natural image. 7th international conference on learning representations (iclr). New Orleans, LA
Dash A, Ye J, Wang G (2023) A review of generative adversarial networks (gans) and its applications in a wide variety of disciplines: From medical to remote sensing. IEEE Access
Deng Q, Cao J, Liu Y, Chai Z, Li Q, Sun Z (2020) Reference-guided face component editing. Preprint ar**v:2006.02051
Doan T, Monteiro J, Albuquerque I, Mazoure B, Durand A, Pineau J, Hjelm RD (2019) On-line adaptative curriculum learning for gans. Proceedings of the aaai conference on artificial intelligence, vol 33, pp 3470–3477
Du X, Peng J, Zhou Y, Zhang J, Chen S, Jiang G, ... Ji R (2023) Pixelface+: Towards controllable face generation and manipulation with text descriptions and segmentation masks. Proceedings of the 31st acm international conference on multimedia, pp 4666–4677
Franceschi J-Y, Gartrell M, Dos Santos L, Issenhuth T, de Bézenac E, Chen M, Rakotomamonjy A (2024) Unifying gans and score-based diffusion as generative particle models. Advances in Neural Information Processing Systems, 36
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, ... Bengio Y (2014) Generative adversarial nets. Advances in neural information processing systems, 27
He Z, Zuo W, Kan M, Shan S, Chen X (2019) Attgan: Facial attribute editing by only changing what you want. IEEE Trans Image Process 28(11):5464–5478
Kang M, Zhu J-Y, Zhang R, Park J, Shechtman E, Paris S, Park T (2023) Scaling up gans for text-to-image synthesis. Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp 10124–10134
Karras T, Laine S, Aila T (2019) A style-based generator architecture for generative adversarial networks. Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp 4401–4410
Karras T, Laine S, Aittala M, Hellsten J, Lehtinen J, Aila T (2020) Analyzing and improving the image quality of stylegan. Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp 8110–8119
Kim M, Liu F, Jain A, Liu X (2023) Dcface: Synthetic face generation with dual condition diffusion model. Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp 12715–12725
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. Preprint ar**v:1412.6980
Koley S, Bhunia AK, Sain A, Chowdhury PN, **ang T, Song Y-Z (2023) Picture that sketch: Photorealistic image generation from abstract sketches. Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp 6850–6861
Lee C-H, Liu Z, Wu L, Luo P (2020) Maskgan: Towards diverse and interactive facial image manipulation. Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp 5549–5558
Li B, Qi X, Lukasiewicz T, Torr P (2019a) Controllable text-to-image generation. Advances in Neural Information Processing Systems, 32
Li B, Qi X, Lukasiewicz T, Torr P (2019b) Controllable text-to-image generation. Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R (eds), Advances in neural information processing systems, vol. 32. Curran Associates, Inc. Retrieved from https://proceedings.neurips.cc/paper/2019/file/1d72310edc006dadf2190caad5802983-Paper.pdf
Liao W, Hu K, Yang MY, Rosenhahn B (2022) Text to image generation with semantic-spatial aware gan. Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp 18187–18196
Liu C, Hu J, Lin H (2023) Swf-gan: A text-to-image model based on sentence-word fusion perception. Comput Graph 115:500–510
Liu Y, Li Q, Deng Q, Sun Z, Yang M-H (2023) Gan-based facial attribute manipulation. IEEE Trans Pattern Anal Mach Intell
Liu Y, Li Q, Sun Z (2019) Attribute-aware face aging with wavelet-based generative adversarial networks. Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp 11877–11886
Nasir OR, Jha SK, Grover MS, Yu Y, Kumar A, Shah RR (2019) Text2facegan: Face generation from fine grained textual descriptions. 2019 ieee fifth international conference on multimedia big data (bigmm), pp 58–67
Nguyen V-Q, Suganuma M, Okatani T (2020) Efficient attention mechanism for visual dialog that can handle all the interactions between multiple inputs. European conference on computer vision, pp 223–240
Ning X, Nan F, Xu S, Yu L, Zhang L (2023) Multi-view frontal face image generation: a survey. Concurr Comput Pract Exp 35(18):e6147
Oza M, Chanda S, Doermann D (2021) Semantic text-to-face gan-st \(\hat{}\) 2fg. Preprint ar**v:2107.10756
Reed S, Akata Z, Yan X, Logeswaran L, Schiele B, Lee H (2016) Generative adversarial text to image synthesis. International conference on machine learning, pp 1060–1069
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. Proceedings of the ieee conference on computer vision and pattern recognition, pp 815–823
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681
Sharma R, Barratt S, Ermon S, Pande V (2018) Improved training with curriculum gans. Preprint ar**v:1807.09295
Song Y, Soleymani M (2019) Polysemous visual-semantic embedding for cross-modal retrieval. Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp 1979–1988
Sun J, Deng Q, Li Q, Sun M, Liu Y, Sun Z (2024) Anyface++: A unified framework for free-style text-to-face synthesis and manipulation. IEEE Trans Pattern Anal Mach Intell
Sun J, Deng Q, Li Q, Sun M, Ren M, Sun Z (2022) Anyface: Free-style text-to-face synthesis and manipulation. Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp 18687–18696
Sun J, Li Q, Wang W, Zhao J, Sun Z (2021) Multi-caption text-to-face synthesis: Dataset and algorithm. Proceedings of the 29th acm international conference on multimedia, pp 2290–2298
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. Proceedings of the ieee conference on computer vision and pattern recognition, pp 2818–2826
Tao M, Tang H, Wu S, Sebe N, **g X-Y, Wu F, Bao B (2020) Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis. Preprint ar**v:2008.05865
**a W, Yang Y, Xue J-H, Wu B (2021) Tedigan: Text-guided diverse face image generation and manipulation. 2021 ieee/cvf conference on computer vision and pattern recognition (cvpr), pp 2256–2265. https://doi.org/10.1109/CVPR46437.2021.00229
Xu T, Zhang P, Huang Q, Zhang H, Gan Z, Huang X, He X (2018) Attngan: Fine-grained text to image generation with attentional generative adversarial networks. Proceedings of the ieee conference on computer vision and pattern recognition, pp 1316–1324
Yauri-Lozano E, Castillo-Cara M, Orozco-Barbosa L, García-Castro R (2024) Generative adversarial networks for text-to-face synthesis & generation: A quantitative-qualitative analysis of natural language processing encoders for spanish. Inf Process Manag 61(3):103667
Zhan F, Yu Y, Wu R, Zhang J, Lu S, Liu L, ... **ng E (2023) Multimodal image synthesis and editing: The generative ai era
Zhang H, Goodfellow I, Metaxas D, Odena A (2019) Self-attention generative adversarial networks. International conference on machine learning, pp 7354–7363
Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas DN (2017) Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. Proceedings of the ieee international conference on computer vision, pp 5907–5915
Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas DN (2018) Stackgan++: Realistic image synthesis with stacked generative adversarial networks. IEEE Trans Pattern Anal Mach Intell 41(8):1947–1962
Zhu M, Pan P, Chen W, Yang Y (2019) Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp 5802–5810
Acknowledgements
This work is supported by National Natural Science Foundation of China under grant 62176062.
Author information
Authors and Affiliations
Contributions
Qi Guo: Conceptualization of this study, Methodology, Software,Writing original draft. **aodong Gu: Supervision, Conceptualization and methodology, Writing original draft, Project administration.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Guo, Q., Gu, X. Towards photorealistic face generation using text-guided Semantic-Spatial FaceGAN. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-19320-7
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11042-024-19320-7