Human Action Recognition Research Based on Fusion TS-CNN and LSTM Networks

Zan, Hui; Zhao, Gang

doi:10.1007/s13369-022-07236-z

Human Action Recognition Research Based on Fusion TS-CNN and LSTM Networks

Research Article-Computer Engineering and Computer Science
Published: 20 September 2022

Volume 48, pages 2331–2345, (2023)
Cite this article

Arabian Journal for Science and Engineering Aims and scope Submit manuscript

Hui Zan^1,2 &
Gang Zhao²

457 Accesses
2 Citations
Explore all metrics

Abstract

Human action recognition (HAR) technology is currently of significant interest. The traditional HAR methods depend on the time and space of the video stream generally. It requires a mass of training datasets and produces a long response time, failing to simultaneously meet the real-time interaction technical requirements-high accuracy, low delay, and low computational cost. For instance, the duration of a gymnastic action is as short as 0.2 s, from action capture to recognition, and then to the visualization of a three-dimensional character model. Only when the response time of the application system is short enough can it guide synchronous training and accurate evaluation. To reduce the dependence on the amount of video data and meet the HAR technical requirements, this paper proposes a three-stream long-short term memory (TS-CNN-LSTM) framework combining the CNN and LSTM networks. Firstly, human data of color, depth, and skeleton collected by Microsoft Kinect are used as input to reduce the sample sizes. Secondly, heterogeneous convolutional networks are established to reduce computing costs and elevate response time. The experiment results demonstrate the effectiveness of the proposed model on the NTU-RGB + D, reaching the best accuracy of 87.28% in the Cross-subject mode. Compared with the state-of-the-art methods, our method uses 75% of the training sample size, while the complexity of time and space only occupies 67.5% and 73.98% respectively. The response time of one set action recognition is improved by 0.90–1.61 s, which is especially valuable for timely action feedback. The proposed method provides an effective solution for real-time interactive applications which require timely human action recognition results and responses.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price includes VAT (Germany)

Instant access to the full article PDF.

Institutional subscriptions

Multimodal vision-based human action recognition using deep learning: a review

Article Open access 19 June 2024

Human Activity Recognition (HAR) Using Deep Learning: Review, Methodologies, Progress and Future Research Directions

Article 12 August 2023

Human Action Recognition and Prediction: A Survey

Article 28 March 2022

Code or Data Availability

The research in this paper used the NTU-RGB + D (or NTU-RGB + D 120) Action Recognition Dataset made available by the ROSE Lab at the Nanyang Technological University, Singapore. At the same time, thanks to the GitHub platform for releasing our results.

References

Aggarwal, J.K.; Ryoo, M.S.: Human activity analysis: A review. ACM Comput. Surv. 43(3), 1–43 (2011). https://doi.org/10.1145/1922649.1922653
Article Google Scholar
Lee, J.; Ahn, B.: Real-time human action recognition with a low-cost RGB camera and mobile robot platform. Sens. (Basel, Switzerland). 20(10), 2886 (2020). https://doi.org/10.3390/s20102886
Article Google Scholar
Johansson, G.: Visual motion perception. Sci. Am. 232(6), 76–88 (1975). https://doi.org/10.1038/scientificamerican0675-76
Article Google Scholar
Dong, N.; Fang, F.; Xudong, M.: A human activity recognition method based on DBMM. Ind. Control Comput. 33(3), 12–14 (2020). https://doi.org/10.3969/j.issn.1001-182X.2020.03.005
Article Google Scholar
Zhang, H.B.; Zhang, Y.X.; Zhong, B., et al.: A comprehensive survey of vision-based human action recognition methods. Sensors (Basel). (2020). https://doi.org/10.3390/s19051005
Article Google Scholar
Dalal, N.; Triggs, B.: Histograms of oriented gradients for human detection. In: International Conference on Computer Vision & Pattern Recognition (CVPR ’05), Jun 2005, San Diego, United States, pp. 886–893. https://courses.cs.washington.edu/courses/cse576/12sp/notes/CVPR2005_HOG.pdf
Zhu, Y.; Zhang, Y.; Chen, J., et al.: An intelligent system based on human action control. China Sci. Technol. Inf. 1, 68–70 (2020). https://doi.org/10.3969/j.issn.1001-8972.2020.01.023
Article Google Scholar
Pham, H.H.; Salmane, H.; Khoudour, L., et al.: A unified deep framework for joint 3D pose estimation and action recognition from a single RGB camera. Sensors (Basel) 20(7), 1825 (2020). https://doi.org/10.3390/s20071825
Article Google Scholar
Dhiman, C.; Vishwakarma, D.K.: View-invariant deep architecture for human action recognition using two-stream motion and shape temporal dynamics. IEEE Trans. Image Process. 29, 3835–3844 (2020). https://doi.org/10.1109/TIP.2020.2965299
Article MATH Google Scholar
Kim, H.; Park, S.; Park, H., et al.: Enhanced action recognition using multiple stream deep learning with optical flow and weighted sum. Sens. (Basel). 20(14), 1 (2020). https://doi.org/10.3390/s20143894
Article Google Scholar
Ali, S.; Shah, M., et al.: Human action recognition in videos using kinematic features and multiple instance learning. IEEE Trans. Pattern Anal. Mach. Intell. 32(2), 288–303 (2010). https://doi.org/10.1109/TPAMI.2008.284
Article Google Scholar
Xue, F.; Ji, H.; Zhang, W.; Cao, Y.: Action recognition based on dense trajectories and human detection. In: 2018 IEEE International Conference on Automation, Electronics and Electrical Engineering (AUTEEE), pp. 340–343 (2018). https://doi.org/10.1109/AUTEEE.2018.8720753.
**-Ting, S.; Sheng, Y.; Yao, D., et al.: Human action recognition method based on deep learning. Comput. Eng. Des. 41(4), 304–307 (2020). https://doi.org/10.19734/j.issn.1001-3695.2018.05.0499
Article Google Scholar
Panareda, B.P.; Iqbal, A.; Gall, J.: Open set domain adaptation for image and action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 42(2), 413–429 (2020). https://doi.org/10.1109/tpami.2018.2880750
Article Google Scholar
Ma, C.; Wang, Y.; Mao, Z.: Action recognition based on spatiotemporal dual flow fusion network and am softmax. Netw. Secur. Technol. Appl. 11, 47–50 (2019). https://doi.org/10.3969/j.issn.1009-6833.2019.11.027
Article Google Scholar
Penghua, G.E.; Min, Z.; Hua, Y.U., et al.: Human action recognition based on two-stream independently recurrent neural network. Mod. Electron. Tech. 43(4), 137–141 (2020). https://doi.org/10.16652/j.issn.1004-373x.2020.04.035(InChinese)
Article Google Scholar
Luvizon, D.; Picard, D.; Tabia, H.: Multi-task deep learning for real-time 3D human pose estimation and action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 8(43), 27522764 (2020). https://doi.org/10.1109/TPAMI.2020.2976014
Article Google Scholar
Yasin, H.; Hussain, M.; Weber, A.: Keys for action: an efficient keyframe-based approach for 3D action recognition using a deep neural network. Sensors (Basel). 20(8), 2226 (2020). https://doi.org/10.3390/s20082226
Article Google Scholar
Chen, J.; Kong, J.; Sun, H., et al.: Spatiotemporal interaction residual networks with pseudo3D for video action recognition. Sensors (Basel). 20(11), 3126 (2020). https://doi.org/10.3390/s20113126
Article Google Scholar
Meng, F.; Liu, H.; Liang, Y., et al.: Sample fusion network: an end-to-end data augmentation network for Skeleton-based human action recognition. IEEE Trans. Image Process. 28(11), 5281–5295 (2019). https://doi.org/10.1109/TIP.2019.2913544
Article MathSciNet MATH Google Scholar
Sun, Z.; Guo, X.; Li, W., et al.: Cooperative warp of two discriminative features for Skeleton based action recognition. J. Phys.: Conf. Ser. 1187, 42027 (2019). https://doi.org/10.1088/1742-6596/1187/4/042027
Article Google Scholar
Ke, Q.; Bennamoun, M.; An, S., et al.: Learning clip representations for Skeleton-based 3D action recognition. IEEE Trans. Image Process. 27(6), 2842–2855 (2018). https://doi.org/10.1109/TIP.2018.2812099
Article MathSciNet MATH Google Scholar
Kim, D.; Kim, D.H.; Kwak, K.C.: Classification of K-Pop dance movements based on skeleton information obtained by a kinect sensor. Sens. (Basel). 17(6), 1261 (2017). https://doi.org/10.3390/s17061261
Article Google Scholar
Xue-Chao, B.: Dance-specific action recognition based on spatial skeleton sequence diagram. Inf. Technol. 43(11), 16–19 (2019). https://doi.org/10.13274/j.cnki.hdzj.2019.11.004
Article Google Scholar
Caetano, C.; Bremond, F.; Schwartz, W.R.: Skeleton image representation for 3D action recognition based on tree structure and reference joints. IEEE 1, 16–23 (2019). https://doi.org/10.1109/SIBGRAPI.2019.00011
Article Google Scholar
Wen, Y.H.; Gao, L.; Fu, H., et al.: Graph CNNs with motif and variable temporal block for Skeleton-based action recognition. Proceedings of the AAAI Conference on Artificial Intelligence 33, 8989–8996 (2019). https://doi.org/10.1609/aaai.v33i01.33018989
Article Google Scholar
Liu, J.; Shahroudy, A.; Xu, D., et al.: Skeleton-based action recognition using spatio-temporal LSTM network with trust gates. IEEE Trans. Pattern. Anal. Mach. Intell. 40(12), 3007–3021 (2018). https://doi.org/10.1109/TPAMI.2017.2771306
Article Google Scholar
Min, S.; Lan, L.: Human movements recognition based on LSTM network model and front action view. J. Anqing Normal Univ. (Nat. Sci. Ed.) 26(1), 73–76 (2020). https://doi.org/10.13757/j.cnki.cn34-1328/n.2020.01.013
Article Google Scholar
Donahue, J.; Hendricks, L.A.; Rohrbach, M., et al.: Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 677–691 (2017). https://doi.org/10.1109/TPAMI.2016.2599174
Article Google Scholar
Wang, H.; Song, Z.; Li, W., et al.: A hybrid network for large-scale action recognition from RGB and depth modalities. Sensors (Basel). 20(11), 3305 (2020). https://doi.org/10.3390/s20113305
Article Google Scholar
Wang, J.; Yu, L.C.; Lai, K.R., et al.: Tree-structured regional CNN-LSTM model for dimensional sentiment analysis. IEEE/ACM Trans. Audio Speech Language Process. 28, 581–591 (2019). https://doi.org/10.1109/TASKP.2019.2959251
Article Google Scholar
Yenter, A.; Verma, A.: Deep CNN-LSTM with combined kernels from multiple branches for IMDb review sentiment analysis. 2017 IEEE 8th Annual Ubiquitous Computing, Electronics and Mobile Communication Conference (UEMCON), pp. 540–546 (2017).https://doi.org/10.1109/UEMCON.2017.8249013.
Yan, Z.; Chong-Chong, Y.U.; Han, L., et al.: Short-term traffic flow forecasting method based on CNN+LSTM. Comput. Eng. Des. 40(09), 1 (2019). https://doi.org/10.16208/j.issn1000-7024.2019.09.038
Article Google Scholar
Yan, Z.; Yu, Z.; Han, L., et al.: Short term traffic flow prediction method based on CNN+LSTM. Comput. Eng. Des. 40(9), 2620–2624 (2019)
Google Scholar
Mou, L.; Zhou, C.; Zhao, P., et al.: Driver stress detection via multimodal fusion using attention-based CNN-LSTM. Expert Syst. Appl. 173(12), 1193 (2021). https://doi.org/10.1016/j.eswa.2021.114693
Article Google Scholar
Yu, T.; Chen, J.; Yan, N.; et al.: A Multi-Layer PaCrallel LSTM Network for Human Activity Recognition with Smartphone Sensors. In: 10th International Conference on Wireless Communications and Signal Processing (WCSP). IEEE. 1–6 (2018). https://doi.org/10.1109/WCSP.2018.8555945.
Gao, W.; Zhang, L.; Teng, Q., et al.: DanHAR: dual attention network for multimodal human activity recognition using wearable sensors. Appl. Soft Comput. (2021). https://doi.org/10.1016/j.asoc.2021.107728
Article Google Scholar
Mutegeki, R.; Han, D.S.: A CNN-LSTM Approach to Human Activity Recognition. In: 2020 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), pp. 362–366 (2020). https://doi.org/10.1109/ICAIIC48513.2020.9065078.
Zhu, A.; Qianyu, W.U.; Cui, R., et al.: Exploring a rich spatial-temporal dependent relational model for Skeleton-based action recognition by bidirectional LSTM-CNN. Neurocomputing 414(5), 90–100 (2020). https://doi.org/10.1016/j.neucom.2020.07.068
Article Google Scholar
Chen, C.; Du, Z.; He, L., et al.: A novel gait pattern recognition method based on LSTM-CNN for lower limb exoskeleton. J. Bionic Eng. 18, 1059–1072 (2021). https://doi.org/10.1007/s42235-021-00083-y
Article Google Scholar
Kim, T.; Kim, H.Y.; Hernandez Montoya, A.R.: Forecasting stock prices with a feature fusion LSTM-CNN model using different representations of the same data. PLoS ONE 14(2), e212320 (2019). https://doi.org/10.1371/journal.pone.0212320
Article Google Scholar
Hadfield, S.; Lebeda, K.; Bowden, R.: Hollywood 3D: What are the best 3D features for action recognition. Int. J. Comput. Vis. 121(1), 95–110 (2017). https://doi.org/10.1007/s11263-016-0917-2
Article MathSciNet Google Scholar
Shahroudy, A.; Liu, J.; Ng, T.T.; et al.: NTU RGB+D: a large scale dataset for 3D human activity analysis. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010–1019 (2016). https://doi.org/10.1109/CVPR.2016.115.
Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.-Y.; Kot, A.C.: NTU-RGB+D 120: a large-scale benchmark for 3D human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 42(10), 2684–2701 (2019). https://doi.org/10.1109/tpami.2019.2916873
Article Google Scholar
Chan, W.; Tian, Z.; Wu, Y.: GAS-GCN: gated action-specific graph convolutional networks for skeleton-based action recognition. Sensors (Basel) 20(12), 3499 (2020). https://doi.org/10.3390/s20123499
Article Google Scholar
Nie, Q.; Wang, J.; Wang, X., et al.: View-Invariant Human Action Recognition Based on a 3D Bio-Constrained Skeleton Model. IEEE Trans Image Process. 28(8), 3959–3972 (2019). https://doi.org/10.1109/TIP.2019.2907048
Article MathSciNet MATH Google Scholar
Christopher Olah, Understanding LSTM.http://colah.github.io/posts/2015-08-Understanding-LSTMs/Posted on August 27, 2015.
Feichtenhofer, C.; Pinz, A.; Zisserman, A.: Convolutional two-stream network fusion for video action recognition. Comput. Vis. Pattern Recognit. 1, 1933–1941 (2016). https://doi.org/10.1109/CVPR.2016.213
Article Google Scholar
Liu, J.; Shahroudy, A.; Xu, D.; Wang, G.: Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds) Computer Vision—ECCV 2016. Lecture Notes in Computer Science, Vol. 9907. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_50.
Shahroudy, A.; Liu, J.; Ng, T.T., et al.: NTU RGB+D: a large scale dataset for 3D human activity analysis. IEEE Comput. Soc. 1, 1010–1019 (2016). https://doi.org/10.1109/CVPR.2016.115
Article Google Scholar
Liu, J.; Shahroudy, A.; Xu, D.; Wang, G.: Spatio-temporal lstm with trust gates for 3d human action recognition. European Conference on Computer Vision (ECCV) 1, 816–833 (2016). https://doi.org/10.1007/978-3-319-46487-9_50
Article Google Scholar
Li, C.; Wang, P.; Wang, S.; Hou, Y.; Li, W.: Skeleton-based action recognition using lstm and CNN. IEEE International Conference on Multimedia & Expo Workshops 1, 585–590 (2017). https://doi.org/10.1109/ICMEW.2017.8026287
Article Google Scholar
Liu, J.; Wang, G.; Hu, P.; Duan, L.; Kot, A.C.: Global context-aware attention LSTM networks for 3d action recognition. In: 2017 IEEE Conference on Computer Vision and pattern Recognition (CVPR), pp. 3671–3680 (2017). https://doi.org/10.1109/CVPR.2017.391.
Zhang, P.; Lan, C.; **ng, J.; Zeng, W.; Xue, J.; Zheng, N.: View adaptive recurrent neural networks for high-performance human action recognition from skeleton data. In IEEE international Conference on Computer Vision (ICCV), pp. 2136–2145 (2017). https://doi.org/10.1109/ICCV.2017.233.
Cui, R.; Zhu, A.; Zhang, S.; Gang, H.: Multi-source Learning for Skeleton -based Action Recognition Using Deep LSTM Networks, 2018 24th International Conference on Pattern Recognition (ICPR), 547–552,(2018). https://doi.org/10.1109/ICPR.2018.8545247
Zhang, S.; et al.: Fusing geometric features for skeleton-based action recognition using multilayer LSTM networks. In: IEEE Transactions on Multimedia, pp. 2330–2343 (2018).https://doi.org/10.1109/TMM.2018.2802648.
Zhu, A.; Wu, Q.; Cui, R.; Wang, T.; Hang, W.; Hua, G.; Snoussi, H.: Exploring a rich spatial-temporal dependent relational model for skeleton-based action recognition by bidirectional LSTM-CNN. Neurocomputing 414, 90–100 (2020). https://doi.org/10.1016/j.neucom.2020.07.068
Article Google Scholar
Simonyan, K.; Zisserman, A.: ‘Two-stream convolutional networks for action recognition in videos’, Advances in Neural Information Processing Systems (NIPS). Montréal, Canada 1, 568–576 (2014). https://doi.org/10.1002/14651858.CD001941.pub3
Article Google Scholar
Wang, H.; Wang, L.: Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3633–3642 (2017). https://doi.org/10.1109/CVPR.2017.387.
Li, C.; Hou, Y.; Wang, P.; Li, W.: Joint distance maps based action recognition with convolutional neural networks. IEEE Signal Process. Lett. 24(5), 624–628 (2017). https://doi.org/10.1109/LSP.2017.2678539
Article Google Scholar
Wang, L.; Zhao, X.; Liu, Y.: Skeleton feature fusion based on multistream lstm for action recognition. IEEE Access 6, 50788–50800 (2018). https://doi.org/10.1109/ACCESS.2018.2869751
Article Google Scholar
Caetano, C.; Brémond, F.; Schwartz, W. R.: Skeleton image representation for 3D action recognition based on tree structure and reference joints. In: 2019 32nd SIBGRAPI conference on graphics, patterns and images (SIBGRAPI), pp. 16–23 (2019). https://doi.org/10.1109/SIBGRAPI.2019.00011.
Ren, Z.; Zhang, Q.; Qiao, P., et al.: Joint learning of convolution neural networks for RGB-D-based human action recognition. Electron. Lett. (2020). https://doi.org/10.1049/el.2020.2148
Article Google Scholar

Download references

Acknowledgements

This work was supported by the Research on Automatic Segmentation and Recognition of Teaching Scene with the Characteristics of Teaching Behavior of National Natural Science Foundation of China [61977034]; and the project supported by an open fund (NO.jykf20057) of the Key Laboratory of Intelligent Education Technology and Application of Zhejiang Province, Zhejiang Normal University, and Zhejiang Education Science Planning Project (N0. 2021SCG309), Zhejiang Province, China. At the same time, we thank anonymous reviewers and journal editors for their professional opinions and suggestions.

Author information

Authors and Affiliations

Key Laboratory of Intelligent Education Technology and Application of Zhejiang Province, Zhejiang Normal University, **hua, 321004, China
Hui Zan
Faculty of Artificial Intelligence in Education, Central China Normal University, No. 152 Luoyu Road, Hongshan District, Wuhan City, 430079, China
Hui Zan & Gang Zhao

Authors

Hui Zan
View author publications
You can also search for this author in PubMed Google Scholar
Gang Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The two authors contributed to the study conception and design. Investigation, data curation & writing, and original draft were performed by Dr. H.Z. Resources and supervision were performed by Pro. G.Z. and he is assigned as Corresponding Author.

Corresponding author

Correspondence to Gang Zhao.

Ethics declarations

Conflict of interest

It does not contain potential conflicts of interest on financial interests or personal relationships including financial interests.

Ethics Approval

The Manuscript submitted compliance with Ethical Standards, The experiment does not include involving human participants and/or animals, and data collected from human subjects Compliance with theoretical approval. The individuals concerned Consent to participate and publication.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zan, H., Zhao, G. Human Action Recognition Research Based on Fusion TS-CNN and LSTM Networks. Arab J Sci Eng 48, 2331–2345 (2023). https://doi.org/10.1007/s13369-022-07236-z

Download citation

Received: 17 August 2021
Accepted: 25 August 2022
Published: 20 September 2022
Issue Date: February 2023
DOI: https://doi.org/10.1007/s13369-022-07236-z

Keywords

Access this article

Log in via an institution

Price includes VAT (Germany)

Instant access to the full article PDF.

Institutional subscriptions

Human Action Recognition Research Based on Fusion TS-CNN and LSTM Networks

Abstract

Access this article

Similar content being viewed by others

Multimodal vision-based human action recognition using deep learning: a review

Human Activity Recognition (HAR) Using Deep Learning: Review, Methodologies, Progress and Future Research Directions

Human Action Recognition and Prediction: A Survey

Code or Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics Approval

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Human Action Recognition Research Based on Fusion TS-CNN and LSTM Networks

Abstract

Access this article

Similar content being viewed by others

Multimodal vision-based human action recognition using deep learning: a review

Human Activity Recognition (HAR) Using Deep Learning: Review, Methodologies, Progress and Future Research Directions

Human Action Recognition and Prediction: A Survey

Code or Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics Approval

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation