BDNet: a method based on forward and backward convolutional networks for action recognition in videos

Leng, Chuanjiang; Ding, Qichuan; Wu, Chengdong; Chen, Ange; Wang, Huan; Wu, Hao

doi:10.1007/s00371-023-03073-9

BDNet: a method based on forward and backward convolutional networks for action recognition in videos

Original article
Published: 17 October 2023

Volume 40, pages 4133–4147, (2024)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Chuanjiang Leng¹,
Qichuan Ding¹,
Chengdong Wu ORCID: orcid.org/0009-0008-7943-7411¹,
Ange Chen¹,
Huan Wang¹ &
…
Hao Wu²

296 Accesses
Explore all metrics

Abstract

Human action recognition analyzes the behavior in a scene according to the spatiotemporal features carried in image sequences. Existing works suffers from ineffective spatial–temporal feature learning. For short video sequence, the critical challenge is to extract informative spatiotemporal features from a limited-length video. For long video sequences, combining long-range contextual information can improve recognition performance. However, conventional methods primarily consider modeling the action’s spatiotemporal features along a single direction, which is difficult to consider context information and ignores the information from the opposite direction. This article proposes a bi-directional network to simulate the bi-directional Long Short-Term Memory (Bi-LSTM) processing of time series data. Specifically, two 3D Convolutional Neural Networks (3D CNNs) extract spatiotemporal features along the forward and backward image sequence of action for each modality individually. After integrating the features of each branch, a dynamic-fusion strategy is applied to obtain a video-level prediction. We conducted comprehensive experiments on the action recognition dataset UCF101 and HMDB51 and achieved 98.0% and 81.4% recognition accuracy, respectively, with a reduction of three quarters of the inputting RGB images.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep learning for time series classification: a review

Article 02 March 2019

CBAM: Convolutional Block Attention Module

Visualizing and Understanding Convolutional Networks

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

Shah, C., White, R.W.: Task intelligence for search and recommendation. Synth. Lect. Synth. Lect. Inf. Concepts Retr. Serv. 13(3), 1–8 (2021)
Google Scholar
Nguyen, P.-A., Ngo, C.-W.: Interactive search vs. automatic search: an extensive study on video retrieval. ACM Transact. Multimedia Comput. Commun. Appl. (TOMM) 17(2), 1–24 (2021)
Article Google Scholar
Zarmehi, N., Amini, A., Marvasti, F.: Low rank and sparse decomposition for image and video applications. IEEE Trans. Circuits Syst. Video Technol. 30(7), 2046–2056 (2019)
Article Google Scholar
Chen, J.: Intelligent recommendation system of dance art video resources based on the wireless network, Secur. Commun. Netw., 2021, (2021).
Khan, M.A., Sharif, M., Akram, T., Raza, M., Saba, T., Rehman, A.: Hand-crafted and deep convolutional neural network features fusion and selection strategy: an application to intelligent human action recognition. Appl. Soft Comput. 87, 105986 (2020)
Article Google Scholar
Liao, Z., Hu, H., Liu, Y.: Action recognition with multiple relative descriptors of trajectories. Neural Process. Lett. 51(1), 287–302 (2020)
Article Google Scholar
Leng, C., Ding, Q., Wu, C., Chen, A.: Augmented two stream network for robust action recognition adaptive to various action videos. J. Vis. Commun. Image Represent. 81, 103344 (2021)
Article Google Scholar
Jiang, M., Pan, N., Kong, J.: Spatial-temporal saliency action mask attention network for action recognition. J. Visual Commun. Image Represent. 71, 102846 (2020)
Article Google Scholar
Goyal, G., Noceti, N., Odone, F.: Cross-view action recognition with small-scale datasets. Image Vision Comput. 120, 104403 (2022)
Article Google Scholar
Abdelbaky, A., Aly, S.: Two-stream spatiotemporal feature fusion for human action recognition. Vis. Comput. 37(7), 1821–1835 (2021)
Article Google Scholar
Berlin, S.J., John, M.: Spiking neural network based on joint entropy of optical flow features for human action recognition. Vis. Comput. 38(1), 223–237 (2022)
Article Google Scholar
Liu, C., Ying, J., Yang, H., Hu, X., Liu, J.: Improved human action recognition approach based on two-stream convolutional neural network model. Vis. Comput. 37, 1327–1341 (2021)
Article Google Scholar
LO, B. I., HC, M. V. and Schwartz, W. R.: Bubblenet: a disperse recurrent structure to recognize activities, in 2020 IEEE International Conference on Image Processing (ICIP). IEEE, pp. 2216–2220 (2020).
Li, Z., Gavrilyuk, K., Gavves, E., Jain, M., Snoek, C.G.: VideoLSTM convolves, attends and flows for action recognition. Comput. Vis. Image Underst. 166, 41–50 (2018)
Article Google Scholar
Chen, J., Samuel, R.D.J., Poovendran, P.: LSTM with bio inspired algorithm for action recognition in sports videos. Image Vision Comput. 112, 104214 (2021)
Article Google Scholar
Yao, X., Zhang, J., Chen, R., Zhang, D. and Zeng, Y.: Weakly supervised graph learning for action recognition in untrimmed video, Visual Comput., pp. 1–15, (2022).
Fang, Z., Zhang, X., Cao, T., Zheng, Y., Sun, M.: Spatial-temporal slowfast graph convolutional network for skeleton-based action recognition. IET Comput. Vision 16(3), 205–217 (2022)
Article Google Scholar
Sun, N., Leng, L., Liu, J., Han, G.: Multi-stream slowfast graph convolutional networks for skeleton-based action recognition. Image Vision Comput. 109, 104141–104216 (2021)
Article Google Scholar
Han, C., Zhang, L., Tang, Y., Huang, W., Min, F., He, J.: Human activity recognition using wearable sensors by heterogeneous convolutional neural networks. Expert Syst. Appl. 198, 116764 (2022)
Article Google Scholar
Tang, Y., Zhang, L., Min, F., He, J.: Multiscale deep feature learning for human activity recognition using wearable sensors. IEEE Trans. Industr. Electron. 70(2), 2106–2116 (2022)
Article Google Scholar
Huang, W., Zhang, L., Wang, S., Wu, H., Song, A.: Deep ensemble learning for human activity recognition using wearable sensors via filter activation. ACM Transact. Embed. Comput. Syst. 22(1), 1–23 (2022)
Google Scholar
Wang, L., **ong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X. and Van Gool, L. Temporal segment networks: towards good practices for deep action recognition, European Conference on Computer Vision, pp. 20–36, (2016).
Simonyan, K. and Zisserman, A.: Two-stream convolutional networks for action recognition in videos, Adv. Neural Inform. Process Syst., 27, (2014).
Carreira, J. and Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308, (2017).
Dong, W., Zhang, Z., Tan, T.: Attention-aware sampling via deep reinforcement learning for action recognition. Proc. AAAI Conf. Artif. Intell. 33(01), 8247–8254 (2019)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L. and Paluri, M.: Learning spatiotemporal features with 3d convolutional networks, Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497, (2015).
He, J.-Y., Wu, X., Cheng, Z.-Q., Yuan, Z., Jiang, Y.-G.: DB-LSTM: densely-connected bi-directional lstm for human action recognition. Neurocomputing 444, 319–331 (2021)
Article Google Scholar
Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., Baik, S.W.: Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access 6, 1155–1166 (2017)
Article Google Scholar
Ullah, W., Ullah, A., Haq, I.U., Muhammad, K., Sajjad, M., Baik, S.W.: CNN features with bi-directional LSTM for real-time anomaly detection in surveillance networks. Multimed. Tools Appl. 80, 16979–16995 (2021)
Article Google Scholar
Singh, B., Marks, T. K. Jones, M. Tuzel, O. and Shao, M.: A multi-stream bi-directional recurrent neural network for fine-grained action detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1961–1970 (2016).
Wang, H. and Schmid C.: Action recognition with improved trajectories, Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558, (2013).
Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput. Vis. Image Underst. 150, 109–125 (2016)
Article Google Scholar
Li, D., Jahan, H., Huang, X., Feng, Z.: Human action recognition method based on historical point cloud trajectory characteristics. Vis. Comput. 38(8), 2971–2979 (2022)
Article Google Scholar
Klaser, A., Marszałek, M. and Schmid, C.: A spatio-temporal descriptor based on 3D-gradients,” in BMVC 2008–19th British Machine Vision Conference. British Machine Vision Association, pp. 1–10 (2008).
Laptev, I., Marszalek, M., Schmid, C. and Rozenfeld, B.: Learning realistic human actions from movies,” in 2008 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp. 1–8 (2008).
Shou, Z., Lin, X., Kalantidis, Y., Sevilla-Lara, L., Rohrbach, M., Chang, S.-F. and Yan, Z. DMC-Net: generating discriminative motion cues for fast compressed video action recognition, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1268–1277, (2019).
Long, X., Gan, C., Melo, G., Liu, X., Li, Y., Li, F. and Wen, S.: Multimodal keyless attention fusion for video classification, Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, (2018).
Nagrani, A. Sun, C., Ross, D., Sukthankar, R., Schmid, C. and Zisserman, A.: Speech2action: cross-modal supervision for action recognition,” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10317–10326, (2020).
Li, C., Zhong, Q., **e, D. and Pu, S. Collaborative spatiotemporal feature learning for video action recognition, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7872–7881, (2019).
Wang, P., Li, W., Wan, J., Ogunbona, P. and Liu, X. Cooperative training of deep aggregation networks for RGB-D action recognition, Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32(1), (2018).
Min, X., Zhai, G., Zhou, J., Farias, M.C., Bovik, A.C.: Study of subjective and objective quality assessment of audio-visual signals. IEEE Trans. Image Process. 29, 6054–6068 (2020)
Article Google Scholar
Girdhar, R., Carreira, J., Doersch, C. and Zisserman, A.: Video action transformer network, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 244–253 (2019).
Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S. and Hu, H.: Video swin transformer, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.3202–3211 (2022).
Yan, S., **ong, X., Arnab, A., Lu, Z., Zhang, M., Sun, C. and Schmid, C.: Multiview transformers for video recognition, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3333–3343 (2022).
Mazzia, V., Angarano, S., Salvetti, F., Angelini, F., Chiaberge, M.: Action transformer: a self-attention model for short-time pose-based human action recognition. Pattern Recognit. 124, 108487 (2022)
Article Google Scholar
Wang, X., Miao, Z., Zhang, R. and Hao, S.: I3D-LSTM: a new model for human action recognition,” in IOP Conference Series: Materials Science and Engineering, vol. 569, no. 3. IOP Publishing, p. 032035 (2019).
Ma, J., Li, Z., Cheng, J.C., Ding, Y., Lin, C., Xu, Z.: Air quality prediction at new stations using spatially transferred bi-directional long short-term memory network. Sci. Total Environ. 705, 135771 (2020)
Article Google Scholar
Zach, C., Pock, T., and Bischof, H.: A duality based approach for realtime tv-l 1 optical flow,” in Pattern Recognition: 29th DAGM Symposium, Heidelberg, Germany, September 12-14, 2007. Proceedings 29. Springer, pp. 214–223 (2007).
Zhai, G., Min, X.: Perceptual image quality assessment: a survey. Science China Inf. Sci. 63, 1–52 (2020)
Article Google Scholar
Sun, L., Jia, K., Yeung, D.-Y. and Shi, B. E. Human action recognition using factorized spatiotemporal convolutional networks,” Proceedings of the IEEE International Conference on Computer Vision, pp. 4597–4605, (2015).
Feichtenhofer, C., Pinz, A. and Zisserman, A.: Convolutional two-stream network fusion for video action recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941, (2016).
Feichtenhofer, C., Pinz, A. and Wildes, R. P.: Spatiotemporal multiplier networks for video action recognition, Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4768–4777, (2017).
Tong, Z., Song, Y., Wang, J., Wang, L.: Videomae: masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural Inform. Process. Syst. 35, 10078–10093 (2022)
Google Scholar
Huang, G. and Bors, A. G.: Busy-quiet video disentangling for video classification, in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 1341–1350 (2022).
Zhang, Y., Li, X., Liu, C., Shuai, B., Zhu, Y., Brattoli, B., Chen, H., Marsic, I. and Tighe, J.: “VidTr: video transformer without convolutions,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 13577–13587 (2021).

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 61973065, Grant U20A20197, and Grant 61973063, by the Joint fund of Science & Technology Department of Liaoning Province and State Key Laboratory of Robotics, China under Grant 2020-KF-12-02, by Liaoning Key Research and Development Project 2020JH2/10100040, by the Foundation of National Key Laboratory OEIP-O-202005, and by the Fundamental Research Funds for the Central Universities under Grant N182608004.

Author information

Authors and Affiliations

Faculty of Robot Science and Engineering, Northeastern University, Shenyang, 110169, China
Chuanjiang Leng, Qichuan Ding, Chengdong Wu, Ange Chen & Huan Wang
Swinburne University of Technology, Sydney, Australia
Hao Wu

Authors

Chuanjiang Leng
View author publications
You can also search for this author in PubMed Google Scholar
Qichuan Ding
View author publications
You can also search for this author in PubMed Google Scholar
Chengdong Wu
View author publications
You can also search for this author in PubMed Google Scholar
Ange Chen
View author publications
You can also search for this author in PubMed Google Scholar
Huan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hao Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Qichuan Ding or Chengdong Wu.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Leng, C., Ding, Q., Wu, C. et al. BDNet: a method based on forward and backward convolutional networks for action recognition in videos. Vis Comput 40, 4133–4147 (2024). https://doi.org/10.1007/s00371-023-03073-9

Download citation

Accepted: 27 July 2023
Published: 17 October 2023
Issue Date: June 2024
DOI: https://doi.org/10.1007/s00371-023-03073-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

BDNet: a method based on forward and backward convolutional networks for action recognition in videos

Abstract

Access this article

Similar content being viewed by others

Deep learning for time series classification: a review

CBAM: Convolutional Block Attention Module

Visualizing and Understanding Convolutional Networks

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

BDNet: a method based on forward and backward convolutional networks for action recognition in videos

Abstract

Access this article

Similar content being viewed by others

Deep learning for time series classification: a review

CBAM: Convolutional Block Attention Module

Visualizing and Understanding Convolutional Networks

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation