Abstract
Human action recognition analyzes the behavior in a scene according to the spatiotemporal features carried in image sequences. Existing works suffers from ineffective spatial–temporal feature learning. For short video sequence, the critical challenge is to extract informative spatiotemporal features from a limited-length video. For long video sequences, combining long-range contextual information can improve recognition performance. However, conventional methods primarily consider modeling the action’s spatiotemporal features along a single direction, which is difficult to consider context information and ignores the information from the opposite direction. This article proposes a bi-directional network to simulate the bi-directional Long Short-Term Memory (Bi-LSTM) processing of time series data. Specifically, two 3D Convolutional Neural Networks (3D CNNs) extract spatiotemporal features along the forward and backward image sequence of action for each modality individually. After integrating the features of each branch, a dynamic-fusion strategy is applied to obtain a video-level prediction. We conducted comprehensive experiments on the action recognition dataset UCF101 and HMDB51 and achieved 98.0% and 81.4% recognition accuracy, respectively, with a reduction of three quarters of the inputting RGB images.
Similar content being viewed by others
Data availability
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.
References
Shah, C., White, R.W.: Task intelligence for search and recommendation. Synth. Lect. Synth. Lect. Inf. Concepts Retr. Serv. 13(3), 1–8 (2021)
Nguyen, P.-A., Ngo, C.-W.: Interactive search vs. automatic search: an extensive study on video retrieval. ACM Transact. Multimedia Comput. Commun. Appl. (TOMM) 17(2), 1–24 (2021)
Zarmehi, N., Amini, A., Marvasti, F.: Low rank and sparse decomposition for image and video applications. IEEE Trans. Circuits Syst. Video Technol. 30(7), 2046–2056 (2019)
Chen, J.: Intelligent recommendation system of dance art video resources based on the wireless network, Secur. Commun. Netw., 2021, (2021).
Khan, M.A., Sharif, M., Akram, T., Raza, M., Saba, T., Rehman, A.: Hand-crafted and deep convolutional neural network features fusion and selection strategy: an application to intelligent human action recognition. Appl. Soft Comput. 87, 105986 (2020)
Liao, Z., Hu, H., Liu, Y.: Action recognition with multiple relative descriptors of trajectories. Neural Process. Lett. 51(1), 287–302 (2020)
Leng, C., Ding, Q., Wu, C., Chen, A.: Augmented two stream network for robust action recognition adaptive to various action videos. J. Vis. Commun. Image Represent. 81, 103344 (2021)
Jiang, M., Pan, N., Kong, J.: Spatial-temporal saliency action mask attention network for action recognition. J. Visual Commun. Image Represent. 71, 102846 (2020)
Goyal, G., Noceti, N., Odone, F.: Cross-view action recognition with small-scale datasets. Image Vision Comput. 120, 104403 (2022)
Abdelbaky, A., Aly, S.: Two-stream spatiotemporal feature fusion for human action recognition. Vis. Comput. 37(7), 1821–1835 (2021)
Berlin, S.J., John, M.: Spiking neural network based on joint entropy of optical flow features for human action recognition. Vis. Comput. 38(1), 223–237 (2022)
Liu, C., Ying, J., Yang, H., Hu, X., Liu, J.: Improved human action recognition approach based on two-stream convolutional neural network model. Vis. Comput. 37, 1327–1341 (2021)
LO, B. I., HC, M. V. and Schwartz, W. R.: Bubblenet: a disperse recurrent structure to recognize activities, in 2020 IEEE International Conference on Image Processing (ICIP). IEEE, pp. 2216–2220 (2020).
Li, Z., Gavrilyuk, K., Gavves, E., Jain, M., Snoek, C.G.: VideoLSTM convolves, attends and flows for action recognition. Comput. Vis. Image Underst. 166, 41–50 (2018)
Chen, J., Samuel, R.D.J., Poovendran, P.: LSTM with bio inspired algorithm for action recognition in sports videos. Image Vision Comput. 112, 104214 (2021)
Yao, X., Zhang, J., Chen, R., Zhang, D. and Zeng, Y.: Weakly supervised graph learning for action recognition in untrimmed video, Visual Comput., pp. 1–15, (2022).
Fang, Z., Zhang, X., Cao, T., Zheng, Y., Sun, M.: Spatial-temporal slowfast graph convolutional network for skeleton-based action recognition. IET Comput. Vision 16(3), 205–217 (2022)
Sun, N., Leng, L., Liu, J., Han, G.: Multi-stream slowfast graph convolutional networks for skeleton-based action recognition. Image Vision Comput. 109, 104141–104216 (2021)
Han, C., Zhang, L., Tang, Y., Huang, W., Min, F., He, J.: Human activity recognition using wearable sensors by heterogeneous convolutional neural networks. Expert Syst. Appl. 198, 116764 (2022)
Tang, Y., Zhang, L., Min, F., He, J.: Multiscale deep feature learning for human activity recognition using wearable sensors. IEEE Trans. Industr. Electron. 70(2), 2106–2116 (2022)
Huang, W., Zhang, L., Wang, S., Wu, H., Song, A.: Deep ensemble learning for human activity recognition using wearable sensors via filter activation. ACM Transact. Embed. Comput. Syst. 22(1), 1–23 (2022)
Wang, L., **ong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X. and Van Gool, L. Temporal segment networks: towards good practices for deep action recognition, European Conference on Computer Vision, pp. 20–36, (2016).
Simonyan, K. and Zisserman, A.: Two-stream convolutional networks for action recognition in videos, Adv. Neural Inform. Process Syst., 27, (2014).
Carreira, J. and Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308, (2017).
Dong, W., Zhang, Z., Tan, T.: Attention-aware sampling via deep reinforcement learning for action recognition. Proc. AAAI Conf. Artif. Intell. 33(01), 8247–8254 (2019)
Tran, D., Bourdev, L., Fergus, R., Torresani, L. and Paluri, M.: Learning spatiotemporal features with 3d convolutional networks, Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497, (2015).
He, J.-Y., Wu, X., Cheng, Z.-Q., Yuan, Z., Jiang, Y.-G.: DB-LSTM: densely-connected bi-directional lstm for human action recognition. Neurocomputing 444, 319–331 (2021)
Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., Baik, S.W.: Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access 6, 1155–1166 (2017)
Ullah, W., Ullah, A., Haq, I.U., Muhammad, K., Sajjad, M., Baik, S.W.: CNN features with bi-directional LSTM for real-time anomaly detection in surveillance networks. Multimed. Tools Appl. 80, 16979–16995 (2021)
Singh, B., Marks, T. K. Jones, M. Tuzel, O. and Shao, M.: A multi-stream bi-directional recurrent neural network for fine-grained action detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1961–1970 (2016).
Wang, H. and Schmid C.: Action recognition with improved trajectories, Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558, (2013).
Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput. Vis. Image Underst. 150, 109–125 (2016)
Li, D., Jahan, H., Huang, X., Feng, Z.: Human action recognition method based on historical point cloud trajectory characteristics. Vis. Comput. 38(8), 2971–2979 (2022)
Klaser, A., Marszałek, M. and Schmid, C.: A spatio-temporal descriptor based on 3D-gradients,” in BMVC 2008–19th British Machine Vision Conference. British Machine Vision Association, pp. 1–10 (2008).
Laptev, I., Marszalek, M., Schmid, C. and Rozenfeld, B.: Learning realistic human actions from movies,” in 2008 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp. 1–8 (2008).
Shou, Z., Lin, X., Kalantidis, Y., Sevilla-Lara, L., Rohrbach, M., Chang, S.-F. and Yan, Z. DMC-Net: generating discriminative motion cues for fast compressed video action recognition, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1268–1277, (2019).
Long, X., Gan, C., Melo, G., Liu, X., Li, Y., Li, F. and Wen, S.: Multimodal keyless attention fusion for video classification, Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, (2018).
Nagrani, A. Sun, C., Ross, D., Sukthankar, R., Schmid, C. and Zisserman, A.: Speech2action: cross-modal supervision for action recognition,” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10317–10326, (2020).
Li, C., Zhong, Q., **e, D. and Pu, S. Collaborative spatiotemporal feature learning for video action recognition, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7872–7881, (2019).
Wang, P., Li, W., Wan, J., Ogunbona, P. and Liu, X. Cooperative training of deep aggregation networks for RGB-D action recognition, Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32(1), (2018).
Min, X., Zhai, G., Zhou, J., Farias, M.C., Bovik, A.C.: Study of subjective and objective quality assessment of audio-visual signals. IEEE Trans. Image Process. 29, 6054–6068 (2020)
Girdhar, R., Carreira, J., Doersch, C. and Zisserman, A.: Video action transformer network, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 244–253 (2019).
Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S. and Hu, H.: Video swin transformer, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.3202–3211 (2022).
Yan, S., **ong, X., Arnab, A., Lu, Z., Zhang, M., Sun, C. and Schmid, C.: Multiview transformers for video recognition, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3333–3343 (2022).
Mazzia, V., Angarano, S., Salvetti, F., Angelini, F., Chiaberge, M.: Action transformer: a self-attention model for short-time pose-based human action recognition. Pattern Recognit. 124, 108487 (2022)
Wang, X., Miao, Z., Zhang, R. and Hao, S.: I3D-LSTM: a new model for human action recognition,” in IOP Conference Series: Materials Science and Engineering, vol. 569, no. 3. IOP Publishing, p. 032035 (2019).
Ma, J., Li, Z., Cheng, J.C., Ding, Y., Lin, C., Xu, Z.: Air quality prediction at new stations using spatially transferred bi-directional long short-term memory network. Sci. Total Environ. 705, 135771 (2020)
Zach, C., Pock, T., and Bischof, H.: A duality based approach for realtime tv-l 1 optical flow,” in Pattern Recognition: 29th DAGM Symposium, Heidelberg, Germany, September 12-14, 2007. Proceedings 29. Springer, pp. 214–223 (2007).
Zhai, G., Min, X.: Perceptual image quality assessment: a survey. Science China Inf. Sci. 63, 1–52 (2020)
Sun, L., Jia, K., Yeung, D.-Y. and Shi, B. E. Human action recognition using factorized spatiotemporal convolutional networks,” Proceedings of the IEEE International Conference on Computer Vision, pp. 4597–4605, (2015).
Feichtenhofer, C., Pinz, A. and Zisserman, A.: Convolutional two-stream network fusion for video action recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941, (2016).
Feichtenhofer, C., Pinz, A. and Wildes, R. P.: Spatiotemporal multiplier networks for video action recognition, Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4768–4777, (2017).
Tong, Z., Song, Y., Wang, J., Wang, L.: Videomae: masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural Inform. Process. Syst. 35, 10078–10093 (2022)
Huang, G. and Bors, A. G.: Busy-quiet video disentangling for video classification, in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 1341–1350 (2022).
Zhang, Y., Li, X., Liu, C., Shuai, B., Zhu, Y., Brattoli, B., Chen, H., Marsic, I. and Tighe, J.: “VidTr: video transformer without convolutions,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 13577–13587 (2021).
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grant 61973065, Grant U20A20197, and Grant 61973063, by the Joint fund of Science & Technology Department of Liaoning Province and State Key Laboratory of Robotics, China under Grant 2020-KF-12-02, by Liaoning Key Research and Development Project 2020JH2/10100040, by the Foundation of National Key Laboratory OEIP-O-202005, and by the Fundamental Research Funds for the Central Universities under Grant N182608004.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Leng, C., Ding, Q., Wu, C. et al. BDNet: a method based on forward and backward convolutional networks for action recognition in videos. Vis Comput 40, 4133–4147 (2024). https://doi.org/10.1007/s00371-023-03073-9
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-023-03073-9