Log in

DM-CTSA: a discriminative multi-focused and complementary temporal/spatial attention framework for action recognition

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Video-based human action recognition remains a challenging task. There are three main limitations: (1) Most works are only restricted to single temporal scale modeling. (2) Although a few methods consider multilevel motion features, they disregard the fact that different features usually contribute differently. (3) Most attention mechanisms only notice important regions in frames without concerning the spatial structure information around them. To address these issues, a discriminative multi-focused and complementary temporal/spatial attention framework is presented, which consists of the multi-focused temporal attention network with multi-granularity loss (M2TEAN) and complementary spatial attention network with co-classification loss (C2SPAN). Firstly, M2TEAN not only focuses on discriminative multilevel motion features, but also highlights more discriminative features among them. Specifically, a short-term discriminative attention sub-network and a middle-term consistent attention sub-network are, respectively, constructed to focus on discriminative short-term and middle-term features. A long-term evolutive attention sub-network is proposed to focus on long-term action evolution over time. Followed by a multi-focused temporal attention module, more discriminative features are ulteriorly highlighted. Secondly, C2SPAN captures discriminative regions in frames, while mining the spatial structure information around them. Experiments reveal that our methods produce state-of-the-art results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Yu T, Wang L, Da C, Gu H, **ang S, Pan C (2019) Weakly semantic guided action recognition. IEEE Trans Multimed 21(10):2504–2517

    Article  Google Scholar 

  2. Ibrahim MS, Mori G (2018) Hierarchical relational networks for group activity recognition and retrieval. In: Proceedings of the European conference on computer vision (ECCV). pp 721–736

  3. Hu JF, Zheng WS, Pan J, Lai J, Zhang J (2018) Deep bilinear learning for rgb-d action recognition. In: Proceedings of the European conference on computer vision (ECCV). pp 335–351

  4. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 886–893

  5. Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 1–8

  6. Crasto N, Weinzaepfel P, Alahari K, Schmid C (2019) MARS: motion-augmented RGB stream for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 7882–7891

  7. Song S, Liu J, Li Y, Guo Z (2020) Modality compensation network: cross-modal adaptation for action recognition. IEEE Trans Image Process 29:3957–3969

    Article  Google Scholar 

  8. Du Y, Yuan C, Li B, Zhao L, Li Y, Hu W (2018) Interaction-aware spatio-temporal pyramid attention networks for action classification. In: Proceedings of the European conference on computer vision (ECCV). pp 373–389

  9. Zhu Y, Li R, Yang Y, Ye N (2020) Learning cascade attention for fine-grained image classification. Neural Netw 122:174–182

    Article  Google Scholar 

  10. Georgakopoulos SV, Kottari K, Delibasis K, Plagianakos VP, Maglogiannis I (2019) Improving the performance of convolutional neural network for skin image classification using the response of image analysis filters. Neural Comput Appl 31(6):1805–1822

    Article  Google Scholar 

  11. Takikawa T, Acuna D, Jampani V, Fidler S (2019) Gated-scnn: Gated shape cnns for semantic segmentation. In: Proceedings of IEEE international conference on computer vision ( ICCV). pp 5229–5238

  12. Tokunaga H, Acuna D, Jampani V, Fidler S (2019) Adaptive weighting multi-field-of-view CNN for semantic segmentation in pathology. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 12597–12606

  13. Yang M, Wen W, Wang X, Shen L, Gao G (2020) Adaptive convolution local and global learning for class-level joint representation of facial recognition with a single sample per data subject. IEEE Trans Inf Forensics Secur 15:2469–2484

    Article  Google Scholar 

  14. Liu D, Gao X, Wang N, Li J, Peng C (2020) Coupled attribute learning for heterogeneous face recognition. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2019.2957285

    Article  Google Scholar 

  15. Zhou L, Gu X (2020) Embedding topological features into convolutional neural network salient object detection. Neural Netw 121:308–318

    Article  Google Scholar 

  16. Zhang H, Guo H, Wang X, Ji Y, Wu QJ (2020) Clothescounter: a framework for star-oriented clothes mining from videos. Neurocomputing 377:38–48

    Article  Google Scholar 

  17. Zhang H, Ji Y, Huang W, Liu L (2019) Sitcom-star-based clothing retrieval for video advertising: a deep learning framework. Neural Comput Appl 31(11):7361–7380

    Article  Google Scholar 

  18. Dixit M, Li Y, Vasconcelos N (2019) Semantic fisher scores for task transfer: using objects to classify scenes. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2019.2921960

    Article  Google Scholar 

  19. He N, Fang L, Li S, Plaza J, Plaza A (2019) Skip-connected covariance network for remote sensing scene classification. IEEE Trans Neural Netw Learn Syst 31(5):1461–1474

    Article  Google Scholar 

  20. Li C, Zhong Q, **e D, Pu S (2019) Collaborative spatiotemporal feature learning for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 7872–7881

  21. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of IEEE international conference on computer vision (ICCV). pp 4489–4497

  22. Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the Kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 4724–4733

  23. Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Darrell T, Saenko K (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 677–691

  24. Wang Q, Chen K (2020) Multi-label zero-shot human action recognition via joint latent ranking embedding. Neural Netw 122:1–23

    Article  Google Scholar 

  25. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Proceedings of advances in neural information processing systems (NIPS). pp 568–576

  26. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 1933–1941

  27. Dai C, Liu X, Lai J (2020) Human action recognition using two-stream attention based LSTM networks. Applied Soft Computing 86:105820

    Article  Google Scholar 

  28. Lu M, Li ZN, Wang Y, Pan G (2019) Deep attention network for egocentric action recognition. IEEE Trans Image Process 28(8):3703–3713

    Article  MathSciNet  Google Scholar 

  29. Rahimi S, Aghagolzadeh A, Ezoji M (2020) Human action recognition using double discriminative sparsity preserving projections and discriminant ridge-based classifier based on the GDWL-l1 graph. Expert Syst with Appl 141:112927

    Article  Google Scholar 

  30. Naveenkumar N, Domnic S (2020) Deep ensemble network using distance maps and body part features for skeleton based action recognition. Pattern Recognit 100:107125

    Article  Google Scholar 

  31. Li Y, Song S, Li Y, Liu J (2019) Temporal bilinear networks for video action recognition. Proce AAAI Conf Artif Intell 33:8674–8681

    Google Scholar 

  32. Zhang H, Liu D, **ong Z (2019) Two-stream action recognition-oriented video super-resolution. In: Proceedings of IEEE international conference on computer vision (ICCV). pp 8799–8808

  33. Li L, Zhang Z, Huang Y, Wang L (2018) Deep temporal feature encoding for action recognition. In: 2018 24th international conference on pattern recognition (ICPR). pp 1109–1114

  34. Zhu J, Zhu Z, Zou W (2018) End-to-end video-level representation learning for action recognition. In: 2018 24th international conference on pattern recognition (ICPR). pp 645–650

  35. Wang L, **ong Y, Wang Z, Qiao Y, Lin D, Tang X, Van GL (2016) Temporal segment networks: Towards good practices for deep action recognition. In: Proceedings of the European conference on computer vision (ECCV). pp 20–36

  36. Fernando B, Anderson P, Hutter M, Gould S (2016) Discriminative hierarchical rank pooling for activity recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 1924–1932

  37. Bilen H, Fernando B, Gavves E, Vedaldi A (2018) Action recognition with dynamic image networks. IEEE Trans Pattern Anal Mach Intell 40:2799–2813

    Article  Google Scholar 

  38. Meng L, Zhao B, Chang B, Huang G, Sun W, Tung F, Sigal L (2019) Interpretable spatio-temporal attention for video action recognition. In: Proceedings of the IEEE international conference on computer vision workshops (ICCV Workshops)

  39. Song S, Lan C, **ng J, Zeng W, Liu J (2018) Spatio-temporal attention-based LSTM networks for 3D action recognition and detection. IEEE Trans Image Process 27(7):3459–3471

    Article  MathSciNet  Google Scholar 

  40. Yang H, Yuan C, Zhang L, Sun Y, Hu W, Maybank SJ (2020) STA-CNN: Convolutional spatial-temporal attention learning for action recognition. IEEE Trans Image Process 29:5783–5793

    Article  Google Scholar 

  41. Ni B, Li T, Yang X (2018) Learning semantic-aligned action representation. IEEE Trans Neural Netw Learn Syst 29(8):3715–3725

    Article  Google Scholar 

  42. Li D, Qiu Z, Dai Q, Yao T, Mei T (2018) Recurrent tubelet proposal and recognition networks for action detection. In: Proceedings of the European conference on computer vision (ECCV). pp 303–318

  43. Si C, Chen W, Wang W, Wang L, Tan T (2019) An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 1227–1236

  44. Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. https://arxiv.org/abs/1212.0402v1

  45. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: Proceedings of IEEE international conference on computer vision (ICCV). pp 2556–2563

  46. Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in pytorch. In: Proceedings of NIPS workshop. pp 1–4

  47. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 2818–2826

  48. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 248–255

  49. Zach C, Pock T, Bischof H (2007) A duality based approach for realtime tv-l 1 optical flow. In: Proceedings of the 29th DAGM symposium on pattern recognition. pp 214–223

  50. Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2018) Real-time action recognition with deeply transferred motion vector CNNs. IEEE Trans Image Process 27(5):2326–2339

    Article  MathSciNet  Google Scholar 

  51. Wei D, Lim JJ, Zisserman A, Freeman WT (2018) Learning and using the arrow of time. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 8052–8060

  52. Li D, Yao T, Duan LY, Mei T, Rui Y (2018) Unified spatio-temporal attention networks for action recognition in videos. IEEE Trans Multimed 21(2):416–428

    Article  Google Scholar 

  53. Chen L, Song Z, Lu J, Zhou J (2019) Learning principal orientations and residual descriptor for action recognition. Pattern Recognit 86:14–26

    Article  Google Scholar 

  54. Zhao J, Snoek CG (2019) Dance with flow: Two-in-one stream action detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 9935–9944

  55. Wang P, Liu L, Shen C, Shen HT (2019) Order-aware convolutional pooling for video based action recognition. Pattern Recognit 91:357–365

    Article  Google Scholar 

  56. Bo Y, Lu Y, He W (2020) Few-shot learning of video action recognition only based on video contents. In: IEEE winter conference on applications of computer vision (WACV). pp 595–604

  57. Girdhar R, Tran D, Torresani L, Ramanan D (2019) Distinit: Learning video representations without a single labeled video. In: Proceedings of the IEEE international conference on computer vision (ICCV). pp 852–861

  58. Wang C, Fu H, Ling CX, Du P, Ma H (2020) Region-based global reasoning networks. In: Proceedings of the AAAI conference on artificial intelligence. pp 12136–12143

  59. Yang H, Yuan C, Li B, Du Y, **ng J, Hu W, Maybank SJ (2019) Asymmetric 3d convolutional neural networks for action recognition. Pattern Recognit 85:1–12

    Article  Google Scholar 

  60. Su B, Wu Y (2019) Learning low-dimensional temporal representations with latent alignments. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2019.2919303

    Article  Google Scholar 

  61. Pang B, Zha K, Cao H, Shi C, Lu C (2019) Deep rnn framework for visual sequential applications. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 423–432

  62. Hao W, Zhang Z (2019) Spatiotemporal distilled dense-connectivity network for video action recognition. Pattern Recognit 92:13–24

    Article  Google Scholar 

  63. Liu K, Liu W, Ma H, Tan M, Gan C (2020) A real-time action representation with temporal encoding and deep compression. IEEE Trans Circ Syst Video Technol. https://doi.org/10.1109/TCSVT.2020.2984569

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported partially by Science and Technology Overall Innovation Project of Shaanxi Province (Grant 2013KTZB03-03-03) and Shaanxi Province key project of Research and Development Plan (S2018-YF-ZDGY-0187).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ming Tong.

Ethics declarations

Conflict of interest

All the authors of the manuscript declared that there are no potential conflicts of interest.

Human and animal rights

All the authors of the manuscript declared that there is no research involving human participants and/or animal.

Informed consent

All the authors of the manuscript declared that there is no material that required informed consent.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tong, M., Yan, K., **, L. et al. DM-CTSA: a discriminative multi-focused and complementary temporal/spatial attention framework for action recognition. Neural Comput & Applic 33, 9375–9389 (2021). https://doi.org/10.1007/s00521-021-05698-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-021-05698-0

Keywords

Navigation