Log in

ManiCLIP: Multi-attribute Face Manipulation from Text

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

In this paper we present a novel multi-attribute face manipulation method based on textual descriptions. Previous text-based image editing methods either require test-time optimization for each individual image or are restricted to single attribute editing. Extending these methods to multi-attribute face image editing scenarios will introduce undesired excessive attribute change, e.g., text-relevant attributes are overly manipulated and text-irrelevant attributes are also changed. In order to address these challenges and achieve natural editing over multiple face attributes, we propose a new decoupling training scheme where we use group sampling to get text segments from same attribute categories, instead of whole complex sentences. Further, to preserve other existing face attributes, we encourage the model to edit the latent code of each attribute separately via an entropy constraint. During the inference phase, our model is able to edit new face images without any test-time optimization, even from complex textual prompts. We show extensive experiments and analysis to demonstrate the efficacy of our method, which generates natural manipulated faces with minimal text-irrelevant attribute editing. Code and pre-trained model are available at https://github.com/hwang1996/ManiCLIP.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Algorithm 1
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Similar content being viewed by others

Notes

  1. https://github.com/rosinality/StyleGAN2-pytorch.

References

  • Chen, J., Shen, Y., & Gao, J., et al. (2018). Language-based image editing with recurrent attentive models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 8721–8729.

  • Cheng, J., Wu, F., Tian, Y., et al. (2020). Rifegan: Rich feature generation for text-to-image synthesis from prior knowledge. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10911–10920.

  • Collins, E., Bala, R., Price, B., et al. (2020). Editing in style: Uncovering the local semantics of gans. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5771–5780.

  • Deng, J., Guo, J., Xue, N., et al. (2019). Arcface: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4690–4699.

  • Ding, M., Yang, Z., Hong, W., et al. (2021). Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34, 19822–19835.

    Google Scholar 

  • He, K., Zhang, X., & Ren, S., et al. (2016). Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778.

  • Hertz, A., Mokady, R., & Tenenbaum, J., et al. (2022). Prompt-to-prompt image editing with cross attention control. ar**v preprintar**v:2208.01626

  • Heusel, M., Ramsauer, H., & Unterthiner, T., et al. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. ar**v preprintar**v:1706.08500.

  • Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, 6840–6851.

    Google Scholar 

  • Huberman-Spiegelglas, I., Kulikov, V., & Michaeli, T. (2023). An edit friendly ddpm noise space: Inversion and manipulations. ar**v preprintar**v:2304.06140.

  • Jiang, Y., Huang, Z., & Pan, X., et al. (2021). Talk-to-edit: Fine-grained facial editing via dialog. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 13799–13808.

  • Jo, Y., & Park, J. (2019). Sc-fegan: Face editing generative adversarial network with user’s sketch and color. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1745–1753.

  • Johnson, J., Alahi, A., & Fei-Fei, L. (2016). Perceptual losses for real-time style transfer and super-resolution. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, Springer, pp 694–711.

  • Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4401–4410.

  • Karras, T., Laine, S., & Aittala, M., et al. (2020). Analyzing and improving the image quality of StyleGAN. In: Proc. CVPR.

  • Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. ar**v preprintar**v:1412.6980.

  • Lee, C.H., Liu, Z., & Wu, L., et al. (2020). Maskgan: Towards diverse and interactive facial image manipulation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Li, B., Qi, X., & Lukasiewicz, T., et al. (2019). Controllable text-to-image generation. Advances in Neural Information Processing Systems,32.

  • Li, B., Qi, X., & Lukasiewicz, T., et al. (2020). Manigan: Text-guided image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7880–7889.

  • Ling, H., Kreis, K., Li, D., et al. (2021). Editgan: High-precision semantic image editing. Advances in Neural Information Processing Systems, 34, 16331–16345.

    Google Scholar 

  • Liu, H., Li, C., Wu, Q., et al. (2024). Visual instruction tuning. Advances in neural information processing systems,36.

  • Nitzan, Y., Bermano, A., Li, Y., et al. (2020). Face identity disentanglement via latent space map**. ar**v preprintar**v:2005.07728.

  • Patashnik, O., Wu, Z., & Shechtman, E., et al. (2021). Styleclip: Text-driven manipulation of stylegan imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 2085–2094.

  • Radford, A., Kim, J.W., & Hallacy, C., et al. (2021). Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, PMLR, pp 8748–8763.

  • Ramesh, A., Dhariwal, P., & Nichol, A., et al. (2022). Hierarchical text-conditional image generation with clip latents. ar**v preprintar**v:2204.06125.

  • Ramesh, A., Pavlov, M., & Goh, G., et al. (2021). Zero-shot text-to-image generation. In: International Conference on Machine Learning, PMLR, pp 8821–8831.

  • Rombach, R., Blattmann, A., & Lorenz, D., et al. (2021). High-resolution image synthesis with latent diffusion models. ar**v:2112.10752.

  • Saha, R., Duke, B., & Shkurti, F., et al. (2021). Loho: Latent optimization of hairstyles via orthogonalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1984–1993.

  • Sandler, M., Howard, A., & Zhu, M., et al. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4510–4520.

  • Schuhmann, C., Beaumont, R., Vencu, R., et al. (2022). Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35, 25278–25294.

    Google Scholar 

  • Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical journal, 27(3), 379–423.

    Article  MathSciNet  Google Scholar 

  • Shen, Y., Gu, J., & Tang. X., et al. (2020a). Interpreting the latent space of gans for semantic face editing. In: CVPR.

  • Shen, Y., Yang, C., Tang, X., et al. (2020). Interfacegan: Interpreting the disentangled face representation learned by gans. IEEE transactions on pattern analysis and machine intelligence, 44(4), 2004–2018.

    Article  Google Scholar 

  • Tan, Z., Chai, M., Chen, D., et al. (2020). Michigan: multi-input-conditioned hair image generation for portrait editing. ar**v preprintar**v:2010.16417.

  • Tov, O., Alaluf, Y., & Nitzan, Y., et al. (2021). Designing an encoder for stylegan image manipulation. ar**v preprintar**v:2102.02766.

  • Wang, H., Lin, G., Hoi, S.C., et al. (2021). Cycle-consistent inverse gan for text-to-image synthesis. In: Proceedings of the 29th ACM International Conference on Multimedia, pp 630–638.

  • Wang, T., Zhang, Y., & Fan, Y., et al. (2022). High-fidelity gan inversion for image attribute editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 11379–11388.

  • Wei, T., Chen, D., & Zhou, W., et al. (2022). Hairclip: Design your hair by text and reference image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 18072–18081.

  • Wu, Z., Lischinski, D., & Shechtman, E. (2021). Stylespace analysis: Disentangled controls for stylegan image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12863–12872.

  • Wu, Q., Liu, Y., & Zhao, H., et al. (2023). Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1900–1910.

  • **a, W., Yang, Y., & Xue, J.H., et al. (2021). Tedigan: Text-guided diverse face image generation and manipulation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Xu, Y., Yin, Y., & Jiang, L., et al. (2022). Transeditor: Transformer-based dual-space gan for highly controllable facial editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7683–7692.

  • Xu, T., Zhang, P., & Huang, Q., et al. (2018). Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1316–1324.

  • Zhang, H., Koh, J.Y., & Baldridge, J., et al. (2021). Cross-modal contrastive learning for text-to-image generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 833–842.

  • Zhao, H., Shi, J., & Qi, X., et al. (2017). Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2881–2890.

  • Zhou, Y., Zhang, R., & Chen, C., et al. (2022). Towards language-free training for text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 17907–17917.

  • Zhu, M., Pan, P., & Chen, W., et al. (2019). Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5802–5810.

Download references

Acknowledgements

The datasets generated during and/or analysed during the current study are available in the GitHub repository, https://github.com/hwang1996/ManiCLIP. This research is supported, in part, by the Education Bureau of Guangzhou Municipality and the Guangzhou-HKUST (GZ) Joint Funding Program (Grant No. 2023A03J0008). This research is supported, in part, by the Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly (LILY), Nanyang Technological University, Singapore; and National Research Foundation, Prime Minister’s Office, Singapore under its NRF Investigatorship Programme (NRFI Award No. NRF-NRFI05-2019-0002). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Guosheng Lin or Zhiqi Shen.

Additional information

Communicated by Yasuyuki Matsushita.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, H., Lin, G., del Molino, A.G. et al. ManiCLIP: Multi-attribute Face Manipulation from Text. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-024-02088-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11263-024-02088-6

Keywords

Navigation