Log in

PointDMIG: a dynamic motion-informed graph neural network for 3D action recognition

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Point cloud contains rich spatial information, providing effective supplementary clues for action recognition. Existing action recognition algorithms based on point cloud sequences typically employ complex spatiotemporal local encoding to capture the spatiotemporal features, leading to the loss of spatial information and the inability to establish long-term spatial correlation. In this paper, we propose a PointDMIG network that models the long-term spatio-temporal correlation in point cloud sequences while retaining spatial structure information. Specifically, we first employ graph-based static point cloud techniques to construct topological structures for input point cloud sequences and encodes them as human static appearance feature vectors, introducing inherent frame-level parallelism to avoid the loss of spatial information. Then, we extend the technique for static point clouds by integrating the motion information of points between adjacent frames into the topological graph structure, capturing the long-term spatio-temporal evolution of human static appearance while preserving its spatial structure. Moreover, in order to enhance the semantic representation of the point cloud sequences, PointDMIG reconstructs the downsampled point set in the feature extraction process, further enriching the spatio-temporal information of human body movements. Experimental results on NTU RGB+D 60 and MSR Action 3D show that PointDMIG significantly improves the accuracy of 3D human action recognition based on point cloud sequences. We also performed an extended experiment on gesture recognition on the SHREC 2017 dataset, and PointDMIG achieved competitive results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availibility

The datasets generated during and analysed during the current study are available from the corresponding author on reasonable request.

References

  1. Javaheri, A., Brites, C., Pereira, F., Ascenso, J.: Point cloud rendering after coding: impacts on subjective and objective quality. IEEE Trans. Multimedia 23, 4049–4064 (2020)

    Article  Google Scholar 

  2. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)

  3. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)

  4. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2017)

  5. Choy, C., Gwak, J., Savarese, S.: 4d spatio-temporal convnets: Minkowski convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3075–3084 (2019)

  6. Luo, W., Yang, B., Urtasun, R.: Fast and furious: Real time end-to-end 3D detection, tracking and motion forecasting with a single convolutional net. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

  7. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

  8. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++ deep hierarchical feature learning on point sets in a metric space. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 5105–5114 (2017)

  9. Shen, Y., Feng, C., Yang, Y., Tian, D.: Mining point cloud local structures by kernel correlation and graph pooling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4548–4557 (2018)

  10. Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph CNN for learning on point clouds. ACM Trans Graphics (TOG) 38(5), 1–12 (2019)

    Article  Google Scholar 

  11. Liu, X., Yan, M., Bohg, J.: MeteorNet: deep learning on dynamic 3d point cloud sequences. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9246–9255 (2019)

  12. Fan, H., Yu, X., Ding, Y., Yang, Y., Kankanhalli, M.: Pstnet: Point spatio-temporal convolution on point cloud sequences. ar**v e-prints 2205 (2022)

  13. Fan, H., Yang, Y., Kankanhalli, M.: Point 4D transformer networks for spatio-temporal modeling in point cloud videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14204–14213 (2021)

  14. Li, X., Huang, Q., Yang, T., Wu, Q.: Hyperpointnet for point cloud sequence-based 3D human action recognition. In: 2022 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2022)

  15. Li, X., Huang, Q., Wang, Z., Hou, Z., Yang, T.: Sequentialpointnet: a strong frame-level parallel point cloud sequence network for 3D action recognition. ar**v preprint ar**v:2111.08492 (2021)

  16. De Smedt, Q., Wannous, H., Vandeborre, J.-P., Guerry, J., Saux, B.L., Filliat, D.: 3D hand gesture recognition using a depth and skeletal dataset: Shrec’17 track. In: Proceedings of the Workshop on 3D Object Retrieval, pp. 33–38 (2017)

  17. Lu, L., Lu, Y., Wang, S.: Learning multi-level interaction relations and feature representations for group activity recognition. In: MultiMedia Modeling: 27th International Conference, MMM 2021, Prague, Czech Republic, June 22–24, 2021, Proceedings, Part I 27, pp. 617–628 . Springer (2021)

  18. Lu, L., Lu, Y., Yu, R., Di, H., Zhang, L., Wang, S.: GAIM: graph attention interaction model for collective activity recognition. IEEE Trans. Multimedia 22(2), 524–539 (2019)

    Article  Google Scholar 

  19. Gao, J., Xu, C.: Learning video moment retrieval without a single annotated video. IEEE Trans. Circuits Syst. Video Technol. 32(3), 1646–1657 (2021)

    Article  MathSciNet  Google Scholar 

  20. Gao, J., Zhang, T., Xu, C.: Learning to model relationships for zero-shot video classification. IEEE Trans. Pattern Anal. Mach. Intell. 43(10), 3476–3491 (2020)

    Article  Google Scholar 

  21. Hu, Y., Gao, J., Dong, J., Fan, B., Liu, H.: Exploring rich semantics for open-set action recognition. IEEE Trans. Multimedia 26, 5410–5421 (2023)

    Article  Google Scholar 

  22. Gao, J., Chen, M., Xu, C.: Vectorized evidential learning for weakly-supervised temporal action localization. IEEE Trans. Pattern Anal. Mach. Intell. 45, 15949–15963 (2023)

    Article  Google Scholar 

  23. Yan, S., **ong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

  24. Hang, R., Li, M.: Spatial-temporal adaptive graph convolutional network for skeleton-based action recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 1265–1281 (2022)

  25. Rahevar, M., Ganatra, A., Saba, T., Rehman, A., Bahaj, S.A.: Spatial-temporal dynamic graph attention network for skeleton-based action recognition. IEEE Access 11, 21546–21553 (2023)

    Article  Google Scholar 

  26. Lu, F., Chen, G., Li, Z., Zhang, L., Liu, Y., Qu, S., Knoll, A.: MoNet: motion-based point cloud prediction network. IEEE Trans. Intell. Transp. Syst. 23(8), 13794–13804 (2021)

    Article  Google Scholar 

  27. Huang, R., Zhang, W., Kundu, A., Pantofaru, C., Ross, D.A., Funkhouser, T., Fathi, A.: An lstm approach to temporal 3D object detection in lidar point clouds. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, pp. 266–282 . Springer (2020)

  28. Zhao, Y., Birdal, T., Deng, H., Tombari, F.: 3D point capsule networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1009–1018 (2019)

  29. Wang, Y., **ao, Y., **ong, F., Jiang, W., Cao, Z., Zhou, J.T., Yuan, J.: 3Dv: 3D dynamic voxel for action recognition in depth video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 511–520 (2020)

  30. Fan, H., Yang, Y.: PointRNN: point recurrent neural network for moving point cloud processing. ar**v preprint ar**v:1910.08287 (2019)

  31. Min, Y., Zhang, Y., Chai, X., Chen, X.: An efficient pointLSTM for point clouds based gesture recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5761–5770 (2020)

  32. Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3D points. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-workshops, pp. 9–14 . IEEE (2010)

  33. Shahroudy, A., Liu, J., Ng, T.-T., Wang, G.: NTU RGB+ D: a large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010–1019 (2016)

  34. Chen, C., Jafari, R., Kehtarnavaz, N.: UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: 2015 IEEE International Conference on Image Processing (ICIP), pp. 168–172 . IEEE (2015)

  35. Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1290–1297 . IEEE (2012)

  36. Zhang, X., Wang, Y., Gou, M., Sznaier, M., Camps, O.: Efficient temporal sequence comparison and classification using gram matrix embeddings on a Riemannian manifold. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4498–4507 (2016)

  37. Klaser, A., Marszałek, M., Schmid, C.: A spatio-temporal descriptor based on 3D-gradients. In: BMVC 2008-19th British Machine Vision Conference, pp. 275–1 . British Machine Vision Association (2008)

  38. Vieira, A.W., Nascimento, E.R., Oliveira, G.L., Liu, Z., Campos, M.F.: Stop: space-time occupancy patterns for 3D action recognition from depth map sequences. In: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications: 17th Iberoamerican Congress, CIARP 2012, Buenos Aires, Argentina, September 3–6, 2012. Proceedings 17, pp. 252–259 . Springer (2012)

  39. Wang, P., Li, W., Gao, Z., Tang, C., Ogunbona, P.O.: Depth pooling based large-scale 3-D action recognition with convolutional neural networks. IEEE Trans. Multimedia 20(5), 1051–1061 (2018)

    Article  Google Scholar 

  40. **ao, Y., Chen, J., Wang, Y., Cao, Z., Zhou, J.T., Bai, X.: Action recognition for depth video using multi-view dynamic images. Inf. Sci. 480, 287–304 (2019)

    Article  Google Scholar 

  41. Sanchez-Caballero, A., de López-Diz, S., Fuentes-Jimenez, D., Losada-Gutiérrez, C., Marrón-Romera, M., Casillas-Perez, D., Sarker, M.I.: 3DFCNN: real-time action recognition using 3D deep neural networks with raw depth information. Multimedia Tools Appl. 81(17), 24119–24143 (2022)

    Article  Google Scholar 

  42. Sanchez-Caballero, A., Fuentes-Jiménez, D., Losada-Gutiérrez, C.: Exploiting the convlstm: Human action recognition using raw depth video-based recurrent neural networks. ar**v preprint ar**v:2006.07744 (2020)

  43. Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., Tian, Q.: Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3595–3603 (2019)

  44. Si, C., Chen, W., Wang, W., Wang, L., Tan, T.: An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1227–1236 (2019)

  45. Shi, L., Zhang, Y., Cheng, J., Lu, H.: Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Trans. Image Process. 29, 9532–9545 (2020)

    Article  Google Scholar 

  46. Li, L., Wang, M., Ni, B., Wang, H., Yang, J., Zhang, W.: 3D human action representation learning via cross-view consistency pursuit. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4741–4750 (2021)

  47. Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., Tian, Q.: Symbiotic graph neural networks for 3D skeleton-based human action recognition and motion prediction. IEEE Trans. Pattern Anal. Mach. Intell. 44(6), 3316–3333 (2021)

    Article  Google Scholar 

  48. Bavil, A.F., Damirchi, H., Taghirad, H.D.: Action capsules: human skeleton action recognition. Comput. Vis. Image Underst. 233, 103722 (2023)

    Article  Google Scholar 

  49. Zhang, B., Yang, Y., Chen, C., Yang, L., Han, J., Shao, L.: Action recognition using 3D histograms of texture and a multi-class boosting classifier. IEEE Trans. Image Process. 26(10), 4648–4660 (2017)

    Article  MathSciNet  Google Scholar 

  50. Elmadany, N.E.D., He, Y., Guan, L.: Information fusion for human action recognition via biset/multiset globality locality preserving canonical correlation analysis. IEEE Trans. Image Process. 27(11), 5275–5287 (2018)

    Article  MathSciNet  Google Scholar 

  51. Kamel, A., Sheng, B., Yang, P., Li, P., Shen, R., Feng, D.D.: Deep convolutional neural networks for human action recognition using depth maps and postures. IEEE Trans. Syst. Man Cybern. Syst. 49(9), 1806–1819 (2018)

    Article  Google Scholar 

  52. Elmadany, N.E.D., He, Y., Guan, L.: Multimodal learning for human action recognition via bimodal/multimodal hybrid centroid canonical correlation analysis. IEEE Trans. Multimedia 21(5), 1317–1331 (2018)

    Article  Google Scholar 

  53. Yang, T., Hou, Z., Liang, J., Gu, Y., Chao, X.: Depth sequential information entropy maps and multi-label subspace learning for human action recognition. IEEE Access 8, 135118–135130 (2020)

    Article  Google Scholar 

  54. Trelinski, J., Kwolek, B.: CNN-based and DTW features for human activity recognition on depth maps. Neural Comput. Appl. 33(21), 14551–14563 (2021)

    Article  Google Scholar 

  55. Wu, H., Ma, X., Li, Y.: Spatiotemporal multimodal learning with 3D CNNs for video action recognition. IEEE Trans. Circuits Syst. Video Technol. 32(3), 1250–1261 (2021)

    Article  Google Scholar 

  56. De Smedt, Q., Wannous, H., Vandeborre, J.-P.: Skeleton-based dynamic hand gesture recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–9 (2016)

  57. Hou, J., Wang, G., Chen, X., Xue, J.-H., Zhu, R., Yang, H.: Spatial-temporal attention res-TCN for skeleton-based dynamic hand gesture recognition. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops (2018)

  58. Chen, Y., Zhao, L., Peng, X., Yuan, J., Metaxas, D.N.: Construct dynamic graphs for hand gesture recognition via spatial-temporal attention. ar**v preprint ar**v:1907.08871 (2019)

  59. Sabater, A., Alonso, I., Montesano, L., Murillo, A.C.: Domain and view-point agnostic hand action recognition. IEEE Robot. Autom. Lett. 6(4), 7823–7830 (2021)

    Article  Google Scholar 

  60. Song, J.-H., Kong, K., Kang, S.-J.: Dynamic hand gesture recognition using improved spatio-temporal graph convolutional network. IEEE Trans. Circuits Syst. Video Technol. 32(9), 6227–6239 (2022)

    Article  Google Scholar 

  61. Liu, J., Wang, X., Wang, C., Gao, Y., Liu, M.: Temporal decoupling graph convolutional network for skeleton-based gesture recognition. IEEE Trans. Multimedia 26, 811–823 (2023)

    Article  Google Scholar 

  62. Bigalke, A., Heinrich, M.P.: Fusing posture and position representations for point cloud-based hand gesture recognition. In: 2021 International Conference on 3D Vision (3DV), pp. 617–626. IEEE (2021)

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (No.61063021).

Author information

Authors and Affiliations

Authors

Contributions

Yao Du proposed the research topic, designed the research plan and framework, and was responsible for drafting the initial manuscript. Zhenjie Hou supervised and provided guidance on the research topic, reviewed and revised the paper. **ng Li designed the experimental methods and analyzed the experimental data. Jiuzhen Liang managed the research project. Kaijun You was responsible for experimental design verification and data collection and organization. **nwen Zhou was responsible for revising the paper and organizing the data.

Corresponding author

Correspondence to Zhenjie Hou.

Ethics declarations

Conflict of interest

All authors of this research paper declare that they have no conflict of interest.

Additional information

Communicated by Junyu Gao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Du, Y., Hou, Z., Li, X. et al. PointDMIG: a dynamic motion-informed graph neural network for 3D action recognition. Multimedia Systems 30, 192 (2024). https://doi.org/10.1007/s00530-024-01395-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00530-024-01395-9

Keywords

Navigation