QS-Craft: Learning to Quantize, Scrabble and Craft for Conditional Human Motion Animation

Hong, Yuxin; Qian, Xuelin; Luo, Simian; Guo, Guodong; Xue, **angyang; Fu, Yanwei

doi:10.1007/978-3-031-26351-4_27

Yuxin Hong¹²,
Xuelin Qian¹²,
Simian Luo¹²,
Guodong Guo¹⁴,
**angyang Xue^12,13 &
…
Yanwei Fu¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13846))

Included in the following conference series:

Asian Conference on Computer Vision

342 Accesses

Abstract

This paper studies the task of conditional Human Motion Animation (cHMA). Given a source image and a driving video, the model should animate the new frame sequence, in which the person in the source image should perform a similar motion as the pose sequence from the driving video. Despite the success of Generative Adversarial Network (GANs) methods in image and video synthesis, it is still very challenging to conduct cHMA due to the difficulty in efficiently utilizing the conditional guided information such as images or poses, and generating images of good visual quality. To this end, this paper proposes a novel model of learning to Quantize, Scrabble, and Craft (QS-Craft) for conditional human motion animation. The key novelties come from the newly introduced three key steps: quantize, scrabble and craft. Particularly, our QS-Craft employs transformer in its structure to utilize the attention architectures. The guided information is represented as a pose coordinate sequence extracted from the driving videos. Extensive experiments on human motion datasets validate the efficacy of our model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: EUR 29.95; Price includes VAT (Germany)

eBook: EUR 139.09; Price includes VAT (Germany)

Softcover Book: EUR 181.89; Price includes VAT (Germany)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Adversarial learning for modeling human motion

Article 08 September 2018

Motion Transformer for Unsupervised Image Animation

Animation from Blur: Multi-modal Blur Decomposition with Motion Guidance

Notes

1.
The name of Craft is inspired by the game of Minecraft.
2.
We refer the readers to [12] for details of the codebook learning.
3.
Please refer to the supplementary for the transformer structure.
4.
We reuse the symbols of \(\left\{ s_{i}\right\} \) and \(\left\{ t_{i}\right\} \) after embedding for simplicity.

References

Bansal, A., Ma, S., Ramanan, D., Sheikh, Y.: Recycle-GAN: unsupervised video retargeting. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 122–138. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_8
Chapter Google Scholar
Berthelot, D., Schumm, T., Metz, L.: Began: boundary equilibrium generative adversarial networks. ar**v preprint ar**v:1703.10717 (2017)
Blanz, V., Vetter, T.: A morphable model for the synthesis of 3D faces. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, pp. 187–194 (1999)
Google Scholar
Cai, H., Bai, C., Tai, Y.-W., Tang, C.-K.: Deep video generation, prediction and completion of human action sequences. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 374–390. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_23
Chapter Google Scholar
Cao, C., Hou, Q., Zhou, K.: Displaced dynamic expression regression for real-time facial tracking and animation. ACM Trans. graph. (TOG) 33(4), 1–10 (2014)
Google Scholar
Cao, C., et al.: The image local autoregressive transformer. In: NeuPIS (2021)
Google Scholar
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299 (2017)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Chen, M., et al.: Generative pretraining from pixels. In: International Conference on Machine Learning, pp. 1691–1703. PMLR (2020)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. ar**v preprint ar**v:1810.04805 (2018)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. ar**v preprint ar**v:2010.11929 (2020)
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12873–12883 (2021)
Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. ar**v preprint ar**v:1706.08500 (2017)
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. ar**v preprint ar**v:1312.6114 (2013)
Kumar, M., Weissenborn, D., Kalchbrenner, N.: Colorization transformer. ar**v preprint ar**v:2102.04432 (2021)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. ar**v preprint ar**v:2103.14030 (2021)
Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., Van Gool, L.: Pose guided person image generation. ar**v preprint ar**v:1705.09368 (2017)
Ma, L., Sun, Q., Georgoulis, S., Van Gool, L., Schiele, B., Fritz, M.: Disentangled person image generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 99–108 (2018)
Google Scholar
Mao, X., Li, Q., **e, H., Lau, R.Y., Wang, Z., Paul Smolley, S.: Least squares generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2794–2802 (2017)
Google Scholar
van den Oord, A., Vinyals, O., Kavukcuoglu, K.: Neural discrete representation learning. ar**v preprint ar**v:1711.00937 (2017)
Qi, G.J.: Loss-sensitive generative adversarial networks on Lipschitz densities. Int. J. Comput. Vision 128(5), 1118–1140 (2020)
Article MathSciNet MATH Google Scholar
Qian, X., et al.: Pose-normalized image generation for person re-identification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 661–678. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_40
Chapter Google Scholar
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. ar**v preprint ar**v:1511.06434 (2015)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
Google Scholar
Ramesh, A., et al.: Zero-shot text-to-image generation. ar**v preprint ar**v:2102.12092 (2021)
Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: Animating arbitrary objects via deep motion transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2377–2386 (2019)
Google Scholar
Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: First order motion model for image animation. Adv. Neural. Inf. Process. Syst. 32, 7137–7147 (2019)
Google Scholar
Siarohin, A., Woodford, O.J., Ren, J., Chai, M., Tulyakov, S.: Motion representations for articulated animation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13653–13662 (2021)
Google Scholar
Sønderby, C.K., Raiko, T., Maaløe, L., Sønderby, S.K., Winther, O.: Ladder variational autoencoders. ar**v preprint ar**v:1602.02282 (2016)
Thies, J., Zollhofer, M., Stamminger, M., Theobalt, C., Nießner, M.: Face2Face: real-time face capture and reenactment of RGB videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2387–2395 (2016)
Google Scholar
Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: MOCOGAN: decomposing motion and content for video generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1526–1535 (2018)
Google Scholar
Vaswani, A., et al.: Attention is all you need. ar**v preprint ar**v:1706.03762 (2017)
Wang, T.C., et al.: Video-to-video synthesis. ar**v preprint ar**v:1808.06601 (2018)
Wiles, O., Koepke, A.S., Zisserman, A.: X2Face: a network for controlling face generation using images, audio, and pose codes. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 690–706. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_41
Chapter Google Scholar
Xu, C., Fu, Y., Wen, C., Pan, Y., Jiang, Y.G., Xue, X.: Pose-guided person image synthesis in the non-iconic views. IEEE Trans. Image Process. 29, 9060–9072 (2020)
Article MATH Google Scholar
Yang, C., Wang, Z., Zhu, X., Huang, C., Shi, J., Lin, D.: Pose guided human video generation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 204–219. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_13
Chapter Google Scholar
Zhang, W., Zhu, M., Derpanis, K.G.: From actemes to action: a strongly-supervised representation for detailed action understanding. In: 2013 IEEE International Conference on Computer Vision, pp. 2248–2255 (2013). https://doi.org/10.1109/ICCV.2013.280
Zhu, Z., Huang, T., Shi, B., Yu, M., Wang, B., Bai, X.: Progressive pose attention transfer for person image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2347–2356 (2019)
Google Scholar

Download references

Acknowledgements

This work was supported by China Postdoctoral Science Foundation (2022M710746), the Science and Technology Major Project of Commission of Science and Technology of Shanghai (No. 21XD1402500).

Author information

Authors and Affiliations

School of Data Science, and MOE Frontiers Center for Brain Science, Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China
Yuxin Hong, Xuelin Qian, Simian Luo, **angyang Xue & Yanwei Fu
School of Computer Science, Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China
**angyang Xue
Department of CSEE, West Virginia University, Morgantown, USA
Guodong Guo

Authors

Yuxin Hong
View author publications
You can also search for this author in PubMed Google Scholar
Xuelin Qian
View author publications
You can also search for this author in PubMed Google Scholar
Simian Luo
View author publications
You can also search for this author in PubMed Google Scholar
Guodong Guo
View author publications
You can also search for this author in PubMed Google Scholar
**angyang Xue
View author publications
You can also search for this author in PubMed Google Scholar
Yanwei Fu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xuelin Qian .

Editor information

Editors and Affiliations

University of Wollongong, Wollongong, NSW, Australia
Lei Wang
University of Bonn, Bonn, Germany
Juergen Gall
University of Adelaide, Adelaide, SA, Australia
Tat-Jun Chin
National Institute of Informatics, Tokyo, Japan
Imari Sato
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 33257 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hong, Y., Qian, X., Luo, S., Guo, G., Xue, X., Fu, Y. (2023). QS-Craft: Learning to Quantize, Scrabble and Craft for Conditional Human Motion Animation. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds) Computer Vision – ACCV 2022. ACCV 2022. Lecture Notes in Computer Science, vol 13846. Springer, Cham. https://doi.org/10.1007/978-3-031-26351-4_27

Download citation

DOI: https://doi.org/10.1007/978-3-031-26351-4_27
Published: 26 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26350-7
Online ISBN: 978-3-031-26351-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

QS-Craft: Learning to Quantize, Scrabble and Craft for Conditional Human Motion Animation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Adversarial learning for modeling human motion

Motion Transformer for Unsupervised Image Animation

Animation from Blur: Multi-modal Blur Decomposition with Motion Guidance

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 33257 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

QS-Craft: Learning to Quantize, Scrabble and Craft for Conditional Human Motion Animation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Adversarial learning for modeling human motion

Motion Transformer for Unsupervised Image Animation

Animation from Blur: Multi-modal Blur Decomposition with Motion Guidance

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 33257 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation