Log in

BDNet: a method based on forward and backward convolutional networks for action recognition in videos

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Human action recognition analyzes the behavior in a scene according to the spatiotemporal features carried in image sequences. Existing works suffers from ineffective spatial–temporal feature learning. For short video sequence, the critical challenge is to extract informative spatiotemporal features from a limited-length video. For long video sequences, combining long-range contextual information can improve recognition performance. However, conventional methods primarily consider modeling the action’s spatiotemporal features along a single direction, which is difficult to consider context information and ignores the information from the opposite direction. This article proposes a bi-directional network to simulate the bi-directional Long Short-Term Memory (Bi-LSTM) processing of time series data. Specifically, two 3D Convolutional Neural Networks (3D CNNs) extract spatiotemporal features along the forward and backward image sequence of action for each modality individually. After integrating the features of each branch, a dynamic-fusion strategy is applied to obtain a video-level prediction. We conducted comprehensive experiments on the action recognition dataset UCF101 and HMDB51 and achieved 98.0% and 81.4% recognition accuracy, respectively, with a reduction of three quarters of the inputting RGB images.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

  1. Shah, C., White, R.W.: Task intelligence for search and recommendation. Synth. Lect. Synth. Lect. Inf. Concepts Retr. Serv. 13(3), 1–8 (2021)

    Google Scholar 

  2. Nguyen, P.-A., Ngo, C.-W.: Interactive search vs. automatic search: an extensive study on video retrieval. ACM Transact. Multimedia Comput. Commun. Appl. (TOMM) 17(2), 1–24 (2021)

    Article  Google Scholar 

  3. Zarmehi, N., Amini, A., Marvasti, F.: Low rank and sparse decomposition for image and video applications. IEEE Trans. Circuits Syst. Video Technol. 30(7), 2046–2056 (2019)

    Article  Google Scholar 

  4. Chen, J.: Intelligent recommendation system of dance art video resources based on the wireless network, Secur. Commun. Netw., 2021, (2021).

  5. Khan, M.A., Sharif, M., Akram, T., Raza, M., Saba, T., Rehman, A.: Hand-crafted and deep convolutional neural network features fusion and selection strategy: an application to intelligent human action recognition. Appl. Soft Comput. 87, 105986 (2020)

    Article  Google Scholar 

  6. Liao, Z., Hu, H., Liu, Y.: Action recognition with multiple relative descriptors of trajectories. Neural Process. Lett. 51(1), 287–302 (2020)

    Article  Google Scholar 

  7. Leng, C., Ding, Q., Wu, C., Chen, A.: Augmented two stream network for robust action recognition adaptive to various action videos. J. Vis. Commun. Image Represent. 81, 103344 (2021)

    Article  Google Scholar 

  8. Jiang, M., Pan, N., Kong, J.: Spatial-temporal saliency action mask attention network for action recognition. J. Visual Commun. Image Represent. 71, 102846 (2020)

    Article  Google Scholar 

  9. Goyal, G., Noceti, N., Odone, F.: Cross-view action recognition with small-scale datasets. Image Vision Comput. 120, 104403 (2022)

    Article  Google Scholar 

  10. Abdelbaky, A., Aly, S.: Two-stream spatiotemporal feature fusion for human action recognition. Vis. Comput. 37(7), 1821–1835 (2021)

    Article  Google Scholar 

  11. Berlin, S.J., John, M.: Spiking neural network based on joint entropy of optical flow features for human action recognition. Vis. Comput. 38(1), 223–237 (2022)

    Article  Google Scholar 

  12. Liu, C., Ying, J., Yang, H., Hu, X., Liu, J.: Improved human action recognition approach based on two-stream convolutional neural network model. Vis. Comput. 37, 1327–1341 (2021)

    Article  Google Scholar 

  13. LO, B. I., HC, M. V. and Schwartz, W. R.: Bubblenet: a disperse recurrent structure to recognize activities, in 2020 IEEE International Conference on Image Processing (ICIP). IEEE, pp. 2216–2220 (2020).

  14. Li, Z., Gavrilyuk, K., Gavves, E., Jain, M., Snoek, C.G.: VideoLSTM convolves, attends and flows for action recognition. Comput. Vis. Image Underst. 166, 41–50 (2018)

    Article  Google Scholar 

  15. Chen, J., Samuel, R.D.J., Poovendran, P.: LSTM with bio inspired algorithm for action recognition in sports videos. Image Vision Comput. 112, 104214 (2021)

    Article  Google Scholar 

  16. Yao, X., Zhang, J., Chen, R., Zhang, D. and Zeng, Y.: Weakly supervised graph learning for action recognition in untrimmed video, Visual Comput., pp. 1–15, (2022).

  17. Fang, Z., Zhang, X., Cao, T., Zheng, Y., Sun, M.: Spatial-temporal slowfast graph convolutional network for skeleton-based action recognition. IET Comput. Vision 16(3), 205–217 (2022)

    Article  Google Scholar 

  18. Sun, N., Leng, L., Liu, J., Han, G.: Multi-stream slowfast graph convolutional networks for skeleton-based action recognition. Image Vision Comput. 109, 104141–104216 (2021)

    Article  Google Scholar 

  19. Han, C., Zhang, L., Tang, Y., Huang, W., Min, F., He, J.: Human activity recognition using wearable sensors by heterogeneous convolutional neural networks. Expert Syst. Appl. 198, 116764 (2022)

    Article  Google Scholar 

  20. Tang, Y., Zhang, L., Min, F., He, J.: Multiscale deep feature learning for human activity recognition using wearable sensors. IEEE Trans. Industr. Electron. 70(2), 2106–2116 (2022)

    Article  Google Scholar 

  21. Huang, W., Zhang, L., Wang, S., Wu, H., Song, A.: Deep ensemble learning for human activity recognition using wearable sensors via filter activation. ACM Transact. Embed. Comput. Syst. 22(1), 1–23 (2022)

    Google Scholar 

  22. Wang, L., **ong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X. and Van Gool, L. Temporal segment networks: towards good practices for deep action recognition, European Conference on Computer Vision, pp. 20–36, (2016).

  23. Simonyan, K. and Zisserman, A.: Two-stream convolutional networks for action recognition in videos, Adv. Neural Inform. Process Syst., 27, (2014).

  24. Carreira, J. and Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308, (2017).

  25. Dong, W., Zhang, Z., Tan, T.: Attention-aware sampling via deep reinforcement learning for action recognition. Proc. AAAI Conf. Artif. Intell. 33(01), 8247–8254 (2019)

    Google Scholar 

  26. Tran, D., Bourdev, L., Fergus, R., Torresani, L. and Paluri, M.: Learning spatiotemporal features with 3d convolutional networks, Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497, (2015).

  27. He, J.-Y., Wu, X., Cheng, Z.-Q., Yuan, Z., Jiang, Y.-G.: DB-LSTM: densely-connected bi-directional lstm for human action recognition. Neurocomputing 444, 319–331 (2021)

    Article  Google Scholar 

  28. Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., Baik, S.W.: Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access 6, 1155–1166 (2017)

    Article  Google Scholar 

  29. Ullah, W., Ullah, A., Haq, I.U., Muhammad, K., Sajjad, M., Baik, S.W.: CNN features with bi-directional LSTM for real-time anomaly detection in surveillance networks. Multimed. Tools Appl. 80, 16979–16995 (2021)

    Article  Google Scholar 

  30. Singh, B., Marks, T. K. Jones, M. Tuzel, O. and Shao, M.: A multi-stream bi-directional recurrent neural network for fine-grained action detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1961–1970 (2016).

  31. Wang, H. and Schmid C.: Action recognition with improved trajectories, Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558, (2013).

  32. Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput. Vis. Image Underst. 150, 109–125 (2016)

    Article  Google Scholar 

  33. Li, D., Jahan, H., Huang, X., Feng, Z.: Human action recognition method based on historical point cloud trajectory characteristics. Vis. Comput. 38(8), 2971–2979 (2022)

    Article  Google Scholar 

  34. Klaser, A., Marszałek, M. and Schmid, C.: A spatio-temporal descriptor based on 3D-gradients,” in BMVC 2008–19th British Machine Vision Conference. British Machine Vision Association, pp. 1–10 (2008).

  35. Laptev, I., Marszalek, M., Schmid, C. and Rozenfeld, B.: Learning realistic human actions from movies,” in 2008 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp. 1–8 (2008).

  36. Shou, Z., Lin, X., Kalantidis, Y., Sevilla-Lara, L., Rohrbach, M., Chang, S.-F. and Yan, Z. DMC-Net: generating discriminative motion cues for fast compressed video action recognition, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1268–1277, (2019).

  37. Long, X., Gan, C., Melo, G., Liu, X., Li, Y., Li, F. and Wen, S.: Multimodal keyless attention fusion for video classification, Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, (2018).

  38. Nagrani, A. Sun, C., Ross, D., Sukthankar, R., Schmid, C. and Zisserman, A.: Speech2action: cross-modal supervision for action recognition,” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10317–10326, (2020).

  39. Li, C., Zhong, Q., **e, D. and Pu, S. Collaborative spatiotemporal feature learning for video action recognition, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7872–7881, (2019).

  40. Wang, P., Li, W., Wan, J., Ogunbona, P. and Liu, X. Cooperative training of deep aggregation networks for RGB-D action recognition, Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32(1), (2018).

  41. Min, X., Zhai, G., Zhou, J., Farias, M.C., Bovik, A.C.: Study of subjective and objective quality assessment of audio-visual signals. IEEE Trans. Image Process. 29, 6054–6068 (2020)

    Article  Google Scholar 

  42. Girdhar, R., Carreira, J., Doersch, C. and Zisserman, A.: Video action transformer network, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 244–253 (2019).

  43. Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S. and Hu, H.: Video swin transformer, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.3202–3211 (2022).

  44. Yan, S., **ong, X., Arnab, A., Lu, Z., Zhang, M., Sun, C. and Schmid, C.: Multiview transformers for video recognition, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3333–3343 (2022).

  45. Mazzia, V., Angarano, S., Salvetti, F., Angelini, F., Chiaberge, M.: Action transformer: a self-attention model for short-time pose-based human action recognition. Pattern Recognit. 124, 108487 (2022)

    Article  Google Scholar 

  46. Wang, X., Miao, Z., Zhang, R. and Hao, S.: I3D-LSTM: a new model for human action recognition,” in IOP Conference Series: Materials Science and Engineering, vol. 569, no. 3. IOP Publishing, p. 032035 (2019).

  47. Ma, J., Li, Z., Cheng, J.C., Ding, Y., Lin, C., Xu, Z.: Air quality prediction at new stations using spatially transferred bi-directional long short-term memory network. Sci. Total Environ. 705, 135771 (2020)

    Article  Google Scholar 

  48. Zach, C., Pock, T., and Bischof, H.: A duality based approach for realtime tv-l 1 optical flow,” in Pattern Recognition: 29th DAGM Symposium, Heidelberg, Germany, September 12-14, 2007. Proceedings 29. Springer, pp. 214–223 (2007).

  49. Zhai, G., Min, X.: Perceptual image quality assessment: a survey. Science China Inf. Sci. 63, 1–52 (2020)

    Article  Google Scholar 

  50. Sun, L., Jia, K., Yeung, D.-Y. and Shi, B. E. Human action recognition using factorized spatiotemporal convolutional networks,” Proceedings of the IEEE International Conference on Computer Vision, pp. 4597–4605, (2015).

  51. Feichtenhofer, C., Pinz, A. and Zisserman, A.: Convolutional two-stream network fusion for video action recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941, (2016).

  52. Feichtenhofer, C., Pinz, A. and Wildes, R. P.: Spatiotemporal multiplier networks for video action recognition, Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4768–4777, (2017).

  53. Tong, Z., Song, Y., Wang, J., Wang, L.: Videomae: masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural Inform. Process. Syst. 35, 10078–10093 (2022)

    Google Scholar 

  54. Huang, G. and Bors, A. G.: Busy-quiet video disentangling for video classification, in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 1341–1350 (2022).

  55. Zhang, Y., Li, X., Liu, C., Shuai, B., Zhu, Y., Brattoli, B., Chen, H., Marsic, I. and Tighe, J.: “VidTr: video transformer without convolutions,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 13577–13587 (2021).

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 61973065, Grant U20A20197, and Grant 61973063, by the Joint fund of Science & Technology Department of Liaoning Province and State Key Laboratory of Robotics, China under Grant 2020-KF-12-02, by Liaoning Key Research and Development Project 2020JH2/10100040, by the Foundation of National Key Laboratory OEIP-O-202005, and by the Fundamental Research Funds for the Central Universities under Grant N182608004.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Qichuan Ding or Chengdong Wu.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Leng, C., Ding, Q., Wu, C. et al. BDNet: a method based on forward and backward convolutional networks for action recognition in videos. Vis Comput 40, 4133–4147 (2024). https://doi.org/10.1007/s00371-023-03073-9

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-023-03073-9

Keywords

Navigation