ManiCLIP: Multi-attribute Face Manipulation from Text

Wang, Hao; Lin, Guosheng; del Molino, Ana García; Wang, Anran; Feng, Jiashi; Shen, Zhiqi

doi:10.1007/s11263-024-02088-6

Hao Wang ORCID: orcid.org/0000-0002-3086-3128¹,
Guosheng Lin²,
Ana García del Molino³,
Anran Wang³,
Jiashi Feng³ &
…
Zhiqi Shen²

159 Accesses
Explore all metrics

Abstract

In this paper we present a novel multi-attribute face manipulation method based on textual descriptions. Previous text-based image editing methods either require test-time optimization for each individual image or are restricted to single attribute editing. Extending these methods to multi-attribute face image editing scenarios will introduce undesired excessive attribute change, e.g., text-relevant attributes are overly manipulated and text-irrelevant attributes are also changed. In order to address these challenges and achieve natural editing over multiple face attributes, we propose a new decoupling training scheme where we use group sampling to get text segments from same attribute categories, instead of whole complex sentences. Further, to preserve other existing face attributes, we encourage the model to edit the latent code of each attribute separately via an entropy constraint. During the inference phase, our model is able to edit new face images without any test-time optimization, even from complex textual prompts. We show extensive experiments and analysis to demonstrate the efficacy of our method, which generates natural manipulated faces with minimal text-irrelevant attribute editing. Code and pre-trained model are available at https://github.com/hwang1996/ManiCLIP.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Institutional subscriptions

Fig. 7

Microsoft COCO: Common Objects in Context

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

Notes

https://github.com/rosinality/StyleGAN2-pytorch.

References

Chen, J., Shen, Y., & Gao, J., et al. (2018). Language-based image editing with recurrent attentive models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 8721–8729.
Cheng, J., Wu, F., Tian, Y., et al. (2020). Rifegan: Rich feature generation for text-to-image synthesis from prior knowledge. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10911–10920.
Collins, E., Bala, R., Price, B., et al. (2020). Editing in style: Uncovering the local semantics of gans. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5771–5780.
Deng, J., Guo, J., Xue, N., et al. (2019). Arcface: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4690–4699.
Ding, M., Yang, Z., Hong, W., et al. (2021). Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34, 19822–19835.
Google Scholar
He, K., Zhang, X., & Ren, S., et al. (2016). Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778.
Hertz, A., Mokady, R., & Tenenbaum, J., et al. (2022). Prompt-to-prompt image editing with cross attention control. ar**v preprintar**v:2208.01626
Heusel, M., Ramsauer, H., & Unterthiner, T., et al. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. ar**v preprintar**v:1706.08500.
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, 6840–6851.
Google Scholar
Huberman-Spiegelglas, I., Kulikov, V., & Michaeli, T. (2023). An edit friendly ddpm noise space: Inversion and manipulations. ar**v preprintar**v:2304.06140.
Jiang, Y., Huang, Z., & Pan, X., et al. (2021). Talk-to-edit: Fine-grained facial editing via dialog. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 13799–13808.
Jo, Y., & Park, J. (2019). Sc-fegan: Face editing generative adversarial network with user’s sketch and color. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1745–1753.
Johnson, J., Alahi, A., & Fei-Fei, L. (2016). Perceptual losses for real-time style transfer and super-resolution. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, Springer, pp 694–711.
Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4401–4410.
Karras, T., Laine, S., & Aittala, M., et al. (2020). Analyzing and improving the image quality of StyleGAN. In: Proc. CVPR.
Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. ar**v preprintar**v:1412.6980.
Lee, C.H., Liu, Z., & Wu, L., et al. (2020). Maskgan: Towards diverse and interactive facial image manipulation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Li, B., Qi, X., & Lukasiewicz, T., et al. (2019). Controllable text-to-image generation. Advances in Neural Information Processing Systems,32.
Li, B., Qi, X., & Lukasiewicz, T., et al. (2020). Manigan: Text-guided image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7880–7889.
Ling, H., Kreis, K., Li, D., et al. (2021). Editgan: High-precision semantic image editing. Advances in Neural Information Processing Systems, 34, 16331–16345.
Google Scholar
Liu, H., Li, C., Wu, Q., et al. (2024). Visual instruction tuning. Advances in neural information processing systems,36.
Nitzan, Y., Bermano, A., Li, Y., et al. (2020). Face identity disentanglement via latent space map**. ar**v preprintar**v:2005.07728.
Patashnik, O., Wu, Z., & Shechtman, E., et al. (2021). Styleclip: Text-driven manipulation of stylegan imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 2085–2094.
Radford, A., Kim, J.W., & Hallacy, C., et al. (2021). Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, PMLR, pp 8748–8763.
Ramesh, A., Dhariwal, P., & Nichol, A., et al. (2022). Hierarchical text-conditional image generation with clip latents. ar**v preprintar**v:2204.06125.
Ramesh, A., Pavlov, M., & Goh, G., et al. (2021). Zero-shot text-to-image generation. In: International Conference on Machine Learning, PMLR, pp 8821–8831.
Rombach, R., Blattmann, A., & Lorenz, D., et al. (2021). High-resolution image synthesis with latent diffusion models. ar**v:2112.10752.
Saha, R., Duke, B., & Shkurti, F., et al. (2021). Loho: Latent optimization of hairstyles via orthogonalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1984–1993.
Sandler, M., Howard, A., & Zhu, M., et al. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4510–4520.
Schuhmann, C., Beaumont, R., Vencu, R., et al. (2022). Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35, 25278–25294.
Google Scholar
Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical journal, 27(3), 379–423.
Article MathSciNet Google Scholar
Shen, Y., Gu, J., & Tang. X., et al. (2020a). Interpreting the latent space of gans for semantic face editing. In: CVPR.
Shen, Y., Yang, C., Tang, X., et al. (2020). Interfacegan: Interpreting the disentangled face representation learned by gans. IEEE transactions on pattern analysis and machine intelligence, 44(4), 2004–2018.
Article Google Scholar
Tan, Z., Chai, M., Chen, D., et al. (2020). Michigan: multi-input-conditioned hair image generation for portrait editing. ar**v preprintar**v:2010.16417.
Tov, O., Alaluf, Y., & Nitzan, Y., et al. (2021). Designing an encoder for stylegan image manipulation. ar**v preprintar**v:2102.02766.
Wang, H., Lin, G., Hoi, S.C., et al. (2021). Cycle-consistent inverse gan for text-to-image synthesis. In: Proceedings of the 29th ACM International Conference on Multimedia, pp 630–638.
Wang, T., Zhang, Y., & Fan, Y., et al. (2022). High-fidelity gan inversion for image attribute editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 11379–11388.
Wei, T., Chen, D., & Zhou, W., et al. (2022). Hairclip: Design your hair by text and reference image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 18072–18081.
Wu, Z., Lischinski, D., & Shechtman, E. (2021). Stylespace analysis: Disentangled controls for stylegan image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12863–12872.
Wu, Q., Liu, Y., & Zhao, H., et al. (2023). Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1900–1910.
**a, W., Yang, Y., & Xue, J.H., et al. (2021). Tedigan: Text-guided diverse face image generation and manipulation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Xu, Y., Yin, Y., & Jiang, L., et al. (2022). Transeditor: Transformer-based dual-space gan for highly controllable facial editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7683–7692.
Xu, T., Zhang, P., & Huang, Q., et al. (2018). Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1316–1324.
Zhang, H., Koh, J.Y., & Baldridge, J., et al. (2021). Cross-modal contrastive learning for text-to-image generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 833–842.
Zhao, H., Shi, J., & Qi, X., et al. (2017). Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2881–2890.
Zhou, Y., Zhang, R., & Chen, C., et al. (2022). Towards language-free training for text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 17907–17917.
Zhu, M., Pan, P., & Chen, W., et al. (2019). Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5802–5810.

Download references

Acknowledgements

The datasets generated during and/or analysed during the current study are available in the GitHub repository, https://github.com/hwang1996/ManiCLIP. This research is supported, in part, by the Education Bureau of Guangzhou Municipality and the Guangzhou-HKUST (GZ) Joint Funding Program (Grant No. 2023A03J0008). This research is supported, in part, by the Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly (LILY), Nanyang Technological University, Singapore; and National Research Foundation, Prime Minister’s Office, Singapore under its NRF Investigatorship Programme (NRFI Award No. NRF-NRFI05-2019-0002). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.

Author information

Authors and Affiliations

The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China
Hao Wang
Nanyang Technological University, Singapore, Singapore
Guosheng Lin & Zhiqi Shen
ByteDance, Singapore, Singapore
Ana García del Molino, Anran Wang & Jiashi Feng

Authors

Hao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Guosheng Lin
View author publications
You can also search for this author in PubMed Google Scholar
Ana García del Molino
View author publications
You can also search for this author in PubMed Google Scholar
Anran Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jiashi Feng
View author publications
You can also search for this author in PubMed Google Scholar
Zhiqi Shen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Guosheng Lin or Zhiqi Shen.

Additional information

Communicated by Yasuyuki Matsushita.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, H., Lin, G., del Molino, A.G. et al. ManiCLIP: Multi-attribute Face Manipulation from Text. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-024-02088-6

Download citation

Received: 14 July 2023
Accepted: 18 April 2024
Published: 21 May 2024
DOI: https://doi.org/10.1007/s11263-024-02088-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Institutional subscriptions

ManiCLIP: Multi-attribute Face Manipulation from Text

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Microsoft COCO: Common Objects in Context

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

ManiCLIP: Multi-attribute Face Manipulation from Text

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Microsoft COCO: Common Objects in Context

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation