PointDMIG: a dynamic motion-informed graph neural network for 3D action recognition

Du, Yao; Hou, Zhenjie; Li, **ng; Liang, Jiuzhen; You, Kaijun; Zhou, **nwen

doi:10.1007/s00530-024-01395-9

PointDMIG: a dynamic motion-informed graph neural network for 3D action recognition

Regular Paper
Published: 27 June 2024

Volume 30, article number 192, (2024)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Yao Du¹,
Zhenjie Hou¹,
**ng Li²,
Jiuzhen Liang¹,
Kaijun You¹ &
…
**nwen Zhou¹

106 Accesses
Explore all metrics

Abstract

Point cloud contains rich spatial information, providing effective supplementary clues for action recognition. Existing action recognition algorithms based on point cloud sequences typically employ complex spatiotemporal local encoding to capture the spatiotemporal features, leading to the loss of spatial information and the inability to establish long-term spatial correlation. In this paper, we propose a PointDMIG network that models the long-term spatio-temporal correlation in point cloud sequences while retaining spatial structure information. Specifically, we first employ graph-based static point cloud techniques to construct topological structures for input point cloud sequences and encodes them as human static appearance feature vectors, introducing inherent frame-level parallelism to avoid the loss of spatial information. Then, we extend the technique for static point clouds by integrating the motion information of points between adjacent frames into the topological graph structure, capturing the long-term spatio-temporal evolution of human static appearance while preserving its spatial structure. Moreover, in order to enhance the semantic representation of the point cloud sequences, PointDMIG reconstructs the downsampled point set in the feature extraction process, further enriching the spatio-temporal information of human body movements. Experimental results on NTU RGB+D 60 and MSR Action 3D show that PointDMIG significantly improves the accuracy of 3D human action recognition based on point cloud sequences. We also performed an extended experiment on gesture recognition on the SHREC 2017 dataset, and PointDMIG achieved competitive results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

GaitRA: triple-branch multimodal gait recognition with larger effective receptive fields and mixed attention

Article 20 June 2024

Multimodal vision-based human action recognition using deep learning: a review

Article Open access 19 June 2024

Yoga pose classification: a CNN and MediaPipe inspired deep learning approach for real-world application

Article 03 June 2022

Data availibility

The datasets generated during and analysed during the current study are available from the corresponding author on reasonable request.

References

Javaheri, A., Brites, C., Pereira, F., Ascenso, J.: Point cloud rendering after coding: impacts on subjective and objective quality. IEEE Trans. Multimedia 23, 4049–4064 (2020)
Article Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2017)
Choy, C., Gwak, J., Savarese, S.: 4d spatio-temporal convnets: Minkowski convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3075–3084 (2019)
Luo, W., Yang, B., Urtasun, R.: Fast and furious: Real time end-to-end 3D detection, tracking and motion forecasting with a single convolutional net. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++ deep hierarchical feature learning on point sets in a metric space. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 5105–5114 (2017)
Shen, Y., Feng, C., Yang, Y., Tian, D.: Mining point cloud local structures by kernel correlation and graph pooling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4548–4557 (2018)
Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph CNN for learning on point clouds. ACM Trans Graphics (TOG) 38(5), 1–12 (2019)
Article Google Scholar
Liu, X., Yan, M., Bohg, J.: MeteorNet: deep learning on dynamic 3d point cloud sequences. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9246–9255 (2019)
Fan, H., Yu, X., Ding, Y., Yang, Y., Kankanhalli, M.: Pstnet: Point spatio-temporal convolution on point cloud sequences. ar**v e-prints 2205 (2022)
Fan, H., Yang, Y., Kankanhalli, M.: Point 4D transformer networks for spatio-temporal modeling in point cloud videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14204–14213 (2021)
Li, X., Huang, Q., Yang, T., Wu, Q.: Hyperpointnet for point cloud sequence-based 3D human action recognition. In: 2022 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2022)
Li, X., Huang, Q., Wang, Z., Hou, Z., Yang, T.: Sequentialpointnet: a strong frame-level parallel point cloud sequence network for 3D action recognition. ar**v preprint ar**v:2111.08492 (2021)
De Smedt, Q., Wannous, H., Vandeborre, J.-P., Guerry, J., Saux, B.L., Filliat, D.: 3D hand gesture recognition using a depth and skeletal dataset: Shrec’17 track. In: Proceedings of the Workshop on 3D Object Retrieval, pp. 33–38 (2017)
Lu, L., Lu, Y., Wang, S.: Learning multi-level interaction relations and feature representations for group activity recognition. In: MultiMedia Modeling: 27th International Conference, MMM 2021, Prague, Czech Republic, June 22–24, 2021, Proceedings, Part I 27, pp. 617–628 . Springer (2021)
Lu, L., Lu, Y., Yu, R., Di, H., Zhang, L., Wang, S.: GAIM: graph attention interaction model for collective activity recognition. IEEE Trans. Multimedia 22(2), 524–539 (2019)
Article Google Scholar
Gao, J., Xu, C.: Learning video moment retrieval without a single annotated video. IEEE Trans. Circuits Syst. Video Technol. 32(3), 1646–1657 (2021)
Article MathSciNet Google Scholar
Gao, J., Zhang, T., Xu, C.: Learning to model relationships for zero-shot video classification. IEEE Trans. Pattern Anal. Mach. Intell. 43(10), 3476–3491 (2020)
Article Google Scholar
Hu, Y., Gao, J., Dong, J., Fan, B., Liu, H.: Exploring rich semantics for open-set action recognition. IEEE Trans. Multimedia 26, 5410–5421 (2023)
Article Google Scholar
Gao, J., Chen, M., Xu, C.: Vectorized evidential learning for weakly-supervised temporal action localization. IEEE Trans. Pattern Anal. Mach. Intell. 45, 15949–15963 (2023)
Article Google Scholar
Yan, S., **ong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Hang, R., Li, M.: Spatial-temporal adaptive graph convolutional network for skeleton-based action recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 1265–1281 (2022)
Rahevar, M., Ganatra, A., Saba, T., Rehman, A., Bahaj, S.A.: Spatial-temporal dynamic graph attention network for skeleton-based action recognition. IEEE Access 11, 21546–21553 (2023)
Article Google Scholar
Lu, F., Chen, G., Li, Z., Zhang, L., Liu, Y., Qu, S., Knoll, A.: MoNet: motion-based point cloud prediction network. IEEE Trans. Intell. Transp. Syst. 23(8), 13794–13804 (2021)
Article Google Scholar
Huang, R., Zhang, W., Kundu, A., Pantofaru, C., Ross, D.A., Funkhouser, T., Fathi, A.: An lstm approach to temporal 3D object detection in lidar point clouds. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, pp. 266–282 . Springer (2020)
Zhao, Y., Birdal, T., Deng, H., Tombari, F.: 3D point capsule networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1009–1018 (2019)
Wang, Y., **ao, Y., **ong, F., Jiang, W., Cao, Z., Zhou, J.T., Yuan, J.: 3Dv: 3D dynamic voxel for action recognition in depth video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 511–520 (2020)
Fan, H., Yang, Y.: PointRNN: point recurrent neural network for moving point cloud processing. ar**v preprint ar**v:1910.08287 (2019)
Min, Y., Zhang, Y., Chai, X., Chen, X.: An efficient pointLSTM for point clouds based gesture recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5761–5770 (2020)
Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3D points. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-workshops, pp. 9–14 . IEEE (2010)
Shahroudy, A., Liu, J., Ng, T.-T., Wang, G.: NTU RGB+ D: a large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010–1019 (2016)
Chen, C., Jafari, R., Kehtarnavaz, N.: UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: 2015 IEEE International Conference on Image Processing (ICIP), pp. 168–172 . IEEE (2015)
Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1290–1297 . IEEE (2012)
Zhang, X., Wang, Y., Gou, M., Sznaier, M., Camps, O.: Efficient temporal sequence comparison and classification using gram matrix embeddings on a Riemannian manifold. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4498–4507 (2016)
Klaser, A., Marszałek, M., Schmid, C.: A spatio-temporal descriptor based on 3D-gradients. In: BMVC 2008-19th British Machine Vision Conference, pp. 275–1 . British Machine Vision Association (2008)
Vieira, A.W., Nascimento, E.R., Oliveira, G.L., Liu, Z., Campos, M.F.: Stop: space-time occupancy patterns for 3D action recognition from depth map sequences. In: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications: 17th Iberoamerican Congress, CIARP 2012, Buenos Aires, Argentina, September 3–6, 2012. Proceedings 17, pp. 252–259 . Springer (2012)
Wang, P., Li, W., Gao, Z., Tang, C., Ogunbona, P.O.: Depth pooling based large-scale 3-D action recognition with convolutional neural networks. IEEE Trans. Multimedia 20(5), 1051–1061 (2018)
Article Google Scholar
**ao, Y., Chen, J., Wang, Y., Cao, Z., Zhou, J.T., Bai, X.: Action recognition for depth video using multi-view dynamic images. Inf. Sci. 480, 287–304 (2019)
Article Google Scholar
Sanchez-Caballero, A., de López-Diz, S., Fuentes-Jimenez, D., Losada-Gutiérrez, C., Marrón-Romera, M., Casillas-Perez, D., Sarker, M.I.: 3DFCNN: real-time action recognition using 3D deep neural networks with raw depth information. Multimedia Tools Appl. 81(17), 24119–24143 (2022)
Article Google Scholar
Sanchez-Caballero, A., Fuentes-Jiménez, D., Losada-Gutiérrez, C.: Exploiting the convlstm: Human action recognition using raw depth video-based recurrent neural networks. ar**v preprint ar**v:2006.07744 (2020)
Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., Tian, Q.: Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3595–3603 (2019)
Si, C., Chen, W., Wang, W., Wang, L., Tan, T.: An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1227–1236 (2019)
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Trans. Image Process. 29, 9532–9545 (2020)
Article Google Scholar
Li, L., Wang, M., Ni, B., Wang, H., Yang, J., Zhang, W.: 3D human action representation learning via cross-view consistency pursuit. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4741–4750 (2021)
Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., Tian, Q.: Symbiotic graph neural networks for 3D skeleton-based human action recognition and motion prediction. IEEE Trans. Pattern Anal. Mach. Intell. 44(6), 3316–3333 (2021)
Article Google Scholar
Bavil, A.F., Damirchi, H., Taghirad, H.D.: Action capsules: human skeleton action recognition. Comput. Vis. Image Underst. 233, 103722 (2023)
Article Google Scholar
Zhang, B., Yang, Y., Chen, C., Yang, L., Han, J., Shao, L.: Action recognition using 3D histograms of texture and a multi-class boosting classifier. IEEE Trans. Image Process. 26(10), 4648–4660 (2017)
Article MathSciNet Google Scholar
Elmadany, N.E.D., He, Y., Guan, L.: Information fusion for human action recognition via biset/multiset globality locality preserving canonical correlation analysis. IEEE Trans. Image Process. 27(11), 5275–5287 (2018)
Article MathSciNet Google Scholar
Kamel, A., Sheng, B., Yang, P., Li, P., Shen, R., Feng, D.D.: Deep convolutional neural networks for human action recognition using depth maps and postures. IEEE Trans. Syst. Man Cybern. Syst. 49(9), 1806–1819 (2018)
Article Google Scholar
Elmadany, N.E.D., He, Y., Guan, L.: Multimodal learning for human action recognition via bimodal/multimodal hybrid centroid canonical correlation analysis. IEEE Trans. Multimedia 21(5), 1317–1331 (2018)
Article Google Scholar
Yang, T., Hou, Z., Liang, J., Gu, Y., Chao, X.: Depth sequential information entropy maps and multi-label subspace learning for human action recognition. IEEE Access 8, 135118–135130 (2020)
Article Google Scholar
Trelinski, J., Kwolek, B.: CNN-based and DTW features for human activity recognition on depth maps. Neural Comput. Appl. 33(21), 14551–14563 (2021)
Article Google Scholar
Wu, H., Ma, X., Li, Y.: Spatiotemporal multimodal learning with 3D CNNs for video action recognition. IEEE Trans. Circuits Syst. Video Technol. 32(3), 1250–1261 (2021)
Article Google Scholar
De Smedt, Q., Wannous, H., Vandeborre, J.-P.: Skeleton-based dynamic hand gesture recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–9 (2016)
Hou, J., Wang, G., Chen, X., Xue, J.-H., Zhu, R., Yang, H.: Spatial-temporal attention res-TCN for skeleton-based dynamic hand gesture recognition. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops (2018)
Chen, Y., Zhao, L., Peng, X., Yuan, J., Metaxas, D.N.: Construct dynamic graphs for hand gesture recognition via spatial-temporal attention. ar**v preprint ar**v:1907.08871 (2019)
Sabater, A., Alonso, I., Montesano, L., Murillo, A.C.: Domain and view-point agnostic hand action recognition. IEEE Robot. Autom. Lett. 6(4), 7823–7830 (2021)
Article Google Scholar
Song, J.-H., Kong, K., Kang, S.-J.: Dynamic hand gesture recognition using improved spatio-temporal graph convolutional network. IEEE Trans. Circuits Syst. Video Technol. 32(9), 6227–6239 (2022)
Article Google Scholar
Liu, J., Wang, X., Wang, C., Gao, Y., Liu, M.: Temporal decoupling graph convolutional network for skeleton-based gesture recognition. IEEE Trans. Multimedia 26, 811–823 (2023)
Article Google Scholar
Bigalke, A., Heinrich, M.P.: Fusing posture and position representations for point cloud-based hand gesture recognition. In: 2021 International Conference on 3D Vision (3DV), pp. 617–626. IEEE (2021)

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (No.61063021).

Author information

Authors and Affiliations

School of Computer Science and Artificial Intelligence, Changzhou University, Changzhou, 213164, China
Yao Du, Zhenjie Hou, Jiuzhen Liang, Kaijun You & **nwen Zhou
College of Information Science and Technology, College of Artificial intelligence, Nan**g Forestry University, Nan**g, 210037, China
**ng Li

Authors

Yao Du
View author publications
You can also search for this author in PubMed Google Scholar
Zhenjie Hou
View author publications
You can also search for this author in PubMed Google Scholar
**ng Li
View author publications
You can also search for this author in PubMed Google Scholar
Jiuzhen Liang
View author publications
You can also search for this author in PubMed Google Scholar
Kaijun You
View author publications
You can also search for this author in PubMed Google Scholar
**nwen Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Yao Du proposed the research topic, designed the research plan and framework, and was responsible for drafting the initial manuscript. Zhenjie Hou supervised and provided guidance on the research topic, reviewed and revised the paper. **ng Li designed the experimental methods and analyzed the experimental data. Jiuzhen Liang managed the research project. Kaijun You was responsible for experimental design verification and data collection and organization. **nwen Zhou was responsible for revising the paper and organizing the data.

Corresponding author

Correspondence to Zhenjie Hou.

Ethics declarations

Conflict of interest

All authors of this research paper declare that they have no conflict of interest.

Additional information

Communicated by Junyu Gao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Du, Y., Hou, Z., Li, X. et al. PointDMIG: a dynamic motion-informed graph neural network for 3D action recognition. Multimedia Systems 30, 192 (2024). https://doi.org/10.1007/s00530-024-01395-9

Download citation

Received: 12 November 2023
Accepted: 18 June 2024
Published: 27 June 2024
DOI: https://doi.org/10.1007/s00530-024-01395-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PointDMIG: a dynamic motion-informed graph neural network for 3D action recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

GaitRA: triple-branch multimodal gait recognition with larger effective receptive fields and mixed attention

Multimodal vision-based human action recognition using deep learning: a review

Yoga pose classification: a CNN and MediaPipe inspired deep learning approach for real-world application

Data availibility

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

PointDMIG: a dynamic motion-informed graph neural network for 3D action recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

GaitRA: triple-branch multimodal gait recognition with larger effective receptive fields and mixed attention

Multimodal vision-based human action recognition using deep learning: a review

Yoga pose classification: a CNN and MediaPipe inspired deep learning approach for real-world application

Data availibility

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation