Log in

Differential motion attention network for efficient action recognition

  • Research
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Despite the great progresses achieved by commonly-used 3D CNNs and two-stream methods in action recognition, they cause heavy computational burden which are inefficient and even infeasible in real-world scenarios. In this paper, we propose differential motion attention network (DMANet) to specially highlight human dynamics toward efficient action recognition. First, we argue that consecutive frames contain redundant static features and construct a low computational unit for discriminative motion extraction to highlight the human action trajectories across consecutive frames. Second, as not all spatial regions in images play an equal role in depicting human actions, we propose an adaptive protocol to dynamically emphasize informative spatial regions. As an end-to-end lightweight framework, our DMANet outperforms costly 3D CNNs and two-stream methods by 2.3% with only 0.23\(\times \) computations and other efficient methods by 1.6% on Something–Something v1 dataset. Experimental results on two temporal-related datasets and the large-scale scene-related Kinetics-400 dataset prove the efficacy of DMANet. In-depth ablations further give both quantitative and qualitative support on its effects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability statement

All datasets analyzed during this study are available in links.

References

  1. Sheng, B., Li, P., Ali, R., Chen, C.P.: Improving video temporal consistency via broad learning system. IEEE Trans. Cybern. 52(7), 6662–6675 (2021)

    Article  Google Scholar 

  2. Lu, X., Wang, W., Shen, J., Crandall, D.J., Van Gool, L.: Segmenting objects from relational visual data. IEEE Trans. Pattern Anal. Mach. Intell. 44(11), 7885–7897 (2021)

    Article  Google Scholar 

  3. Qin, Z., Lu, X., Liu, D., Nie, X., Yin, Y., Shen, J., Loui, A.C.: Reformulating graph kernels for self-supervised space-time correspondence learning. IEEE Trans. Image Process. (2023)

  4. Wu, P., Lu, X., Shen, J., Yin, Y.: Clip fusion with bi-level optimization for human mesh reconstruction from monocular videos. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 105–115 (2023)

  5. Qin, Z., Lu, X., Nie, X., Liu, D., Yin, Y., Wang, W.: Coarse-to-fine video instance segmentation with factorized conditional appearance flows. IEEE/CAA J. Automatica Sinica 10(5), 1192–1208 (2023)

    Article  Google Scholar 

  6. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)

  7. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)

  8. Vemulapalli, R., Arrate, F., Chellappa, R.: Human action recognition by representing 3d skeletons as points in a lie group. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 588–595 (2014)

  9. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)

  10. **e, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning for video understanding. ar**v preprint ar**v:1712.04851 1(2), 5 (2017)

  11. Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5533–5541 (2017)

  12. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)

  13. Jiang, B., Wang, M., Gan, W., Wu, W., Yan, J.: Stm: Spatiotemporal and motion encoding for action recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2000–2009 (2019)

  14. Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., Wang, L.: Tea: Temporal excitation and aggregation for action recognition. ar**v preprint ar**v:2004.01398 (2020)

  15. Goyal, R., Kahou, S.E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., et al.: The “something something” video database for learning and evaluating visual common sense. In: ICCV, vol. 1, p. 5 (2017)

  16. Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558 (2013)

  17. Laptev, I.: On space-time interest points. Int. J. Comput. Vision 64(2–3), 107–123 (2005)

    Article  Google Scholar 

  18. Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)

  19. Wang, L., **ong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: Towards good practices for deep action recognition. In: European Conference on Computer Vision, pp. 20–36 (2016). Springer

  20. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016)

  21. Sun, S., Kuang, Z., Sheng, L., Ouyang, W., Zhang, W.: Optical flow guided feature: A fast and robust motion representation for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1390–1399 (2018)

  22. Crasto, N., Weinzaepfel, P., Alahari, K., Schmid, C.: Mars: Motion-augmented rgb stream for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7882–7891 (2019)

  23. Alfasly, S., Chui, C.K., Jiang, Q., Lu, J., Xu, C.: An effective video transformer with synchronized spatiotemporal and spatial self-attention for action recognition. IEEE Trans. Neural Netw. Learn. Syst. (2022)

  24. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: A video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6836–6846 (2021)

  25. Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H.: Video swin transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3202–3211 (2022)

  26. Su, Y., Feng, Z., Zhang, J., Peng, W., **ng, M.: Sequential articulated motion reconstruction from a monocular image sequence. ACM Trans. Multimed. Comput. Commun. Appl. 14, 1–21 (2018)

    Article  Google Scholar 

  27. Luo, C., Yuille, A.L.: Grouped spatial-temporal aggregation for efficient action recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5512–5521 (2019)

  28. Lin, J., Gan, C., Han, S.: Tsm: Temporal shift module for efficient video understanding. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 7083–7093 (2019)

  29. Sudhakaran, S., Escalera, S., Lanz, O.: Gate-shift networks for video action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1102–1111 (2020)

  30. Liu, Z., Luo, D., Wang, Y., Wang, L., Tai, Y., Wang, C., Li, J., Huang, F., Lu, T.: Teinet: Towards an efficient architecture for video recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11669–11676 (2020)

  31. Shao, H., Qian, S., Liu, Y.: Temporal interlacing network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11966–11973 (2020)

  32. Wang, L., Tong, Z., Ji, B., Wu, G.: Tdn: Temporal difference networks for efficient action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1895–1904 (2021)

  33. Wang, Q., Hu, Q., Gao, Z., Li, P., Hu, Q.: Ams-net: Modeling adaptive multi-granularity spatio-temporal cues for video action recognition. IEEE Trans. Neural Netw. Learn. Syst. (2023)

  34. Lu, T., Yang, Q., Min, F., Zhang, Y.: Action recognition based on adaptive region perception. Neural Comput. Appl. pp. 1–17 (2023)

  35. Lee, S., Kim, H.G., Choi, D.H., Kim, H.-I., Ro, Y.M.: Video prediction recalling long-term motion context via memory alignment learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3054–3063 (2021)

  36. Patrick, M., Campbell, D., Asano, Y., Misra, I., Metze, F., Feichtenhofer, C., Vedaldi, A., Henriques, J.F.: Kee** your eye on the ball: Trajectory attention in video transformers. Adv. Neural. Inf. Process. Syst. 34, 12493–12506 (2021)

    Google Scholar 

  37. Kim, B., Chang, H.J., Kim, J., Choi, J.Y.: Global-local motion transformer for unsupervised skeleton-based action learning. In: European Conference on Computer Vision, pp. 209–225 (2022). Springer

  38. Wang, Z., She, Q., Smolic, A.: Action-net: Multipath excitation for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13214–13223 (2021)

  39. Su, Y., **ng, M., An, S., Peng, W., Feng, Z.: Vdarn: Video disentangling attentive relation network for few-shot and zero-shot action recognition. Ad Hoc Netw. 113, 102380 (2021)

    Article  Google Scholar 

  40. Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., Tian, Q.: Symbiotic graph neural networks for 3d skeleton-based human action recognition and motion prediction. IEEE Trans. Pattern Anal. Mach. Intell. 44(6), 3316–3333 (2021)

    Article  Google Scholar 

  41. Lu, X., Wang, W., Ma, C., Shen, J., Shao, L., Porikli, F.: See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3623–3632 (2019)

  42. Kwon, H., Kim, M., Kwak, S., Cho, M.: Motionsqueeze: Neural motion feature learning for video understanding. In: European Conference on Computer Vision, pp. 345–362 (2020). Springer

  43. Lim, L.A., Keles, H.Y.: Foreground segmentation using convolutional neural networks for multiscale feature encoding. Pattern Recogn. Lett. 112, 256–262 (2018)

    Article  Google Scholar 

  44. Singh, B., Najibi, M., Davis, L.S.: Sniper: Efficient multi-scale training. ar**v preprint ar**v:1805.09300 (2018)

  45. Liu, Z., Zhang, H., Chen, Z., Wang, Z., Ouyang, W.: Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 143–152 (2020)

  46. Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)

  47. Liu, C., Feng, L., Liu, G., Wang, H., Liu, S.: Bottom-up broadcast neural network for music genre classification. Multimed. Tools Appl. 80, 7313–7331 (2021)

    Article  Google Scholar 

  48. Wang, H., Yao, M., Jiang, G., Mi, Z., Fu, X.: Graph-collaborated auto-encoder hashing for multiview binary clustering. IEEE Trans. Neural Netw. Learn. Syst. (2023)

  49. Wang, H., Jiang, G., Peng, J., Deng, R., Fu, X.: Towards adaptive consensus graph: multi-view clustering via graph collaboration. IEEE Trans. Multimed. (2022)

  50. Wang, H., Wang, Y., Zhang, Z., Fu, X., Zhuo, L., Xu, M., Wang, M.: Kernelized multiview subspace analysis by self-weighted learning. IEEE Trans. Multimed. 23, 3828–3840 (2020)

    Article  Google Scholar 

  51. Hu, L., Liu, S., Feng, W.: Skeleton-based action recognition with local dynamic spatial-temporal aggregation. Exp. Syst. Appl. 120683 (2023)

  52. Gao, L., Hu, L., Lyu, F., Zhu, L., Wan, L., Pun, C.-M., Feng, W.: Difference-guided multi-scale spatial-temporal representation for sign language recognition. Vis. Comput. 39(8), 3417–3428 (2023)

    Article  Google Scholar 

  53. Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. ar**v preprint ar**v:1511.07122 (2015)

  54. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)

  55. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)

  56. Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: Cbam: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)

  57. Hu, L., Gao, L., Liu, Z., Feng, W.: Self-emphasizing network for continuous sign language recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 854–862 (2023)

  58. Hu, L., Gao, L., Liu, Z., Feng, W.: Continuous sign language recognition with correlation network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2529–2539 (2023)

  59. Lin, X., Sun, S., Huang, W., Sheng, B., Li, P., Feng, D.D.: Eapt: efficient attention pyramid transformer for image processing. IEEE Trans. Multimed. (2021)

  60. Chen, Z., Qiu, G., Li, P., Zhu, L., Yang, X., Sheng, B.: Mngnas: Distilling adaptive combination of multiple searched networks for one-shot neural architecture search. IEEE Trans. Pattern Anal. Mach. Intell. (2023)

  61. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. pp. 8024–8035 (2019)

  62. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  63. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929 (2016)

  64. Zhou, B., Andonian, A., Oliva, A., Torralba, A.: Temporal relational reasoning in videos. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 803–818 (2018)

  65. Wang, X., Gupta, A.: Videos as space-time region graphs. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 399–417 (2018)

  66. Zhu, X., Xu, C., Hui, L., Lu, C., Tao, D.: Approximated bilinear modules for temporal modeling. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3494–3503 (2019)

  67. Zhao, Y., **ong, Y., Lin, D.: Trajectory convolution for action recognition. In: Advances in Neural Information Processing Systems, pp. 2204–2215 (2018)

  68. Wang, H., Tran, D., Torresani, L., Feiszli, M.: Video modeling with correlation networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 352–361 (2020)

  69. Zhang, S., Guo, S., Huang, W., Scott, M.R., Wang, L.: V4d: 4d convolutional neural networks for video-level representation learning. ar**v preprint ar**v:2002.07442 (2020)

  70. Li, X., Wang, Y., Zhou, Z., Qiao, Y.: Smallbignet: Integrating core and contextual views for video classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1092–1101 (2020)

  71. Liu, Z., Wang, L., Wu, W., Qian, C., Lu, T.: Tam: Temporal adaptive module for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13708–13718 (2021)

  72. Li, X., Liu, C., Shuai, B., Zhu, Y., Chen, H., Tighe, J.: Nuta: Non-uniform temporal aggregation for action recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3683–3692 (2022)

  73. Fan, Q., Panda, R., et al.: Can an image classifier suffice for action recognition? Proceedings of the International Conference on Learning Representations (2022)

  74. Huang, Z., Zhang, S., Pan, L., Qing, Z., Tang, M., Liu, Z., Ang Jr, M.H.: Tada! temporally-adaptive convolutions for video understanding. Proceedings of the International Conference on Learning Representations (2022)

  75. Tran, D., Wang, H., Torresani, L., Feiszli, M.: Video classification with channel-separated convolutional networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5552–5561 (2019)

  76. Zhou, J., Fu, Z., Huang, Q., Liu, Q., Wang, Y.: Lgnet: A local-global network for action recognition and beyond. IEEE Trans. Multimed. (2022)

  77. Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)

Download references

Author information

Authors and Affiliations

Authors

Contributions

Caifeng Liu contributed to conceptualization, data curation, formal analysis, investigation, resources, validation, visualization, writing—original draft, writing—review and editing, and provided software. Fangjie Gu was involved in conceptualization, project administration, investigation, methodology, project administration, resources, validation, and writing—review and editing

Corresponding author

Correspondence to Fangjie Gu.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, C., Gu, F. Differential motion attention network for efficient action recognition. Vis Comput (2024). https://doi.org/10.1007/s00371-024-03478-0

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00371-024-03478-0

Keywords

Navigation