A Comparative Study of Different Feature Descriptors for Video-Based Human Action Recognition

  • Chapter
  • First Online:
Intelligent Computing: Image Processing Based Applications

Abstract

Human action recognition (HAR) has been a well-studied research topic in the field of computer vision since the past two decades. The objective of HAR is to detect and recognize actions performed by one or more persons based on a series of observations. In this paper, a comparative study of different feature descriptors applied for HAR on video datasets is presented. In particular, we estimate four standard feature descriptors, namely histogram of oriented gradients (HOG), gray-level co-occurrence matrix (GLCM), speeded-up robust features (SURF), and Graphics and Intelligence-based Scripting Technology (GIST) descriptors from RGB videos, after performing background subtraction and creating a minimum bounding box surrounding the human object. To further speed up the overall process, we apply an efficient sparse filtering method, which reduces the number of features by eliminating the redundant features and assigning weights to the features left after elimination. Finally, the performance of the said feature descriptors on three standard benchmark video datasets namely, KTH, HMDB51, and UCF11 has been analyzed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
EUR 29.95
Price includes VAT (Germany)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 42.79
Price includes VAT (Germany)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR 53.49
Price includes VAT (Germany)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Jaimes, A., & Sebe, N. (2005). Multimodal human computer interaction: A survey. In International workshop on human-computer interaction (pp. 1–15). Berlin, Heidelberg: Springer.

    Google Scholar 

  2. Osmani, V., Balasubramaniam, S., & Botvich, D. (2008). Human activity recognition in pervasive health-care : Supporting efficient remote collaboration. Journal of Network and Computer Applications, 31, 628–655. https://doi.org/10.1016/j.jnca.2007.11.002

  3. Jain, A. K., & Li, S. Z. (2011). Handbook of face recognition (Vol. 1). New York: springer.

    Google Scholar 

  4. Du, Y., Wang, W., & Wang, L. (2015). Hierarchical Recurrent Neural Network for Skeleton Based Action Recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1110–1118.

    Google Scholar 

  5. Ji, S., Xu, W., Yang, M., & Yu, K. (2012). 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 221–231.

    Google Scholar 

  6. Gammulle, H., Denman, S., Sridharan, S., & Fookes, C. (2017, March). Two stream lstm: A deep fusion framework for human action recognition. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV) (pp. 177–186) IEEE. https://doi.org/10.1109/WACV.2017.27

  7. Sharif, M., Attique, M., Farooq, K., Jamal, Z., Shah, H., & Akram, T. (2019). Human action recognition : A framework of statistical weighted segmentation and rank correlation-based selection. Pattern Analysis and Applications, (0123456789). https://doi.org/10.1007/s10044-019-00789-0

  8. Ullah, A., Muhammad, K., Ul, I., & Wook, S. (2019). Action recognition using optimized deep auto encoder and CNN for surveillance data streams of non-stationary environments. Future Generation Computer Systems, 96, 386–397. https://doi.org/10.1016/j.future.2019.01.029.

    Article  Google Scholar 

  9. Wang, T., Duan, P., & Ma, B. (2019). Action recognition using dynamic hierarchical trees. Journal of Visual Communication and Image Representation, 61, 315–325.

    Article  Google Scholar 

  10. Singh, R., Kushwaha, A. K. S., & Srivastava, R. (2019). Multi-view recognition system for human activity based on multiple features for video surveillance system. Multimedia Tools and Applications, 78(12), 17165–17196.

    Google Scholar 

  11. Sahoo, S. P., Silambarasi, R., & Ari, S. (2019, March). Fusion of histogram based features for Human Action Recognition. In 2019 5th International Conference on Advanced Computing & Communication Systems (ICACCS) (pp. 1012–1016). IEEE.

    Google Scholar 

  12. Chakraborty, B., Holte, M. B., Moeslund, T. B., Gonz, J., & Roca, F. X. (2011). A Selective Spatio-Temporal Interest Point Detector for Human Action Recognition in Complex Scenes. (pp. 1776–1783).

    Google Scholar 

  13. Li, B., Ayazoglu, M., Mao, T., Camps, O. I., & Sznaier, M. (2011). Activity Recognition using Dynamic Subspace Angles. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition. https://doi.org/10.1109/CVPR.2011.5995672

  14. Rao, A. S., Gubbi, J., Rajasegarar, S., Marusic, S., & Palaniswami, M. (2014). Detection of anomalous crowd behaviour using hyperspherical clustering. In 2014 International Conference on Digital Image Computing: Techniques and Applications (DICTA). https://doi.org/10.1109/DICTA.2014.7008100.

  15. Mahadevan, V., Li, W., Bhalodia, V., & Vasconcelos, N. (2010). Anomaly detection in crowded scenes. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (IEEE). https://doi.org/10.1109/CVPR.2010.5539872

  16. Yu, T., Kim, T., & Cipolla, R. (2010). Real-time action recognition by spatiotemporal semantic and structural forests. British Machine Vision Conference, BMVC, 52(1-52), 12. https://doi.org/10.5244/C.24.52.

    Article  Google Scholar 

  17. Shabani, A. H., Clausi, D. A., & Zelek, J. S. (2011). Improved spatio-temporal salient feature detection for action recognition. British Machine Vision Conference, 1, 1–12.

    Google Scholar 

  18. Zhen, X., & Shao, L. (2016). Action recognition via spatio-temporal local features: A comprehensive study. Image and Vision Computing, 50, 1–13. https://doi.org/10.1016/J.IMAVIS.2016.02.006.

    Article  Google Scholar 

  19. Sharif, M., Khan, M. A., Akram, T., Javed, M. Y., Saba, T., & Rehman, A. (2017). A framework of human detection and action recognition based on uniform segmentation and combination of Euclidean distance and joint entropy-based features selection. Eurasip Journal on Image and Video Processing, 2017(1). https://doi.org/10.1186/s13640-017-0236-8

  20. Wu, J., Hu, D., & Chen, F. (2014). Action recognition by hidden temporal models. The Visual Computer, 30(12), 1395–1404. https://doi.org/10.1007/s00371-013-0899-9

  21. Elgammal, A., Duraiswami, R., Harwood, D., & Davis, L. S. (2002). Background and foreground modeling using nonparametric kernel density estimation for visual surveillance. Proceedings of the IEEE, 90(7), 1151–1163.

    Article  Google Scholar 

  22. Otsu, N. (1979). A Threshold selection method from Gray-Level Histograms. IEEE transactions on systems, Man Cybernetics, 9(1), 62–66. Retrieved from http://web-ext.u-aizu.ac.jp/course/bmclass/documents/otsu1979.pdf.

  23. Dalal, N., & Triggs, B. (2010). Histograms of oriented gradients for human detection. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (pp. 886–893).

    Google Scholar 

  24. Ngiam, J., Chen, Z., Bhaskar, S. A., Koh, P. W., & Ng, A. Y. (2011). Sparse filtering. In Advances in neural information processing systems (pp. 1125–1133). Retrieved from https://papers.nips.cc/paper/4334-sparse-filtering.pdf

  25. Bay, H., Tuytelaars, T., & Gool, L. Van. (2008). SURF : Speeded up robust features. Computer Vision and Image Understanding, 110(3), 346–359. Retrieved from http://www.cescg.org/CESCG-2013/papers/Jakab-Planar_Object_Recognition_Using_Local_Descriptor_Based_On_Histogram_Of_Intensity_Patches.pdf

  26. Haralick, R. M., Shanmugam, K., & Dinstein, I. H. (1973). Textural features for image classification. IEEE Transactions on Systems, Man, and Cybernetics, SMC-3(6), 610–62

    Google Scholar 

  27. Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3), 145–175. https://doi.org/10.1023/A:1011139631724.

    Article  MATH  Google Scholar 

  28. Christian, S., Barbara, C., & Ivan, L. (2004). Recognizing Human Actions : A Local SVM Approach. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004 (pp. 3–7).

    Google Scholar 

  29. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). HMDB : A large video database for human motion recognition. In 2011 International Conference on Computer Vision. https://doi.org/10.1109/ICCV.2011.6126543

  30. Liu, J., Luo, J., & Mubarak, S. (2009). Recognizing realistic actions from videos “ in the wild.” In 2009 IEEE Conference on Computer Vision and Pattern Recognition, (pp. 1–8). https://doi.org/10.1109/CVPR.2009.5206744

  31. Evgeniou, T., & Pontil, M. (2011). Support vector machines : Theory and applications Workshop on support vector machines: Theory and applications. In Machine Learning and Its Applications: Advanced Lectures. https://doi.org/10.1007/3-540-44673-7

  32. Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3), 175–185.

    MathSciNet  Google Scholar 

  33. Ho, T. K. (1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 47, 832–844.

    Google Scholar 

  34. Kim, H. J., Lee, J. S., & Yang, H. S. (2007). Human action recognition using a modified convolutional neural network. In International Symposium on Neural Networks (pp. 715–723). Berlin, Heidelberg: Springer.

    Google Scholar 

  35. Zhou, W., & Zhang, Z. (2014). Human action recognition with multiple-instance markov model. IEEE Transactions on Information Forensics and Security, 9, 1581–1591.

    Article  Google Scholar 

  36. Chen, C. Y., & Grauman, K. (2016). Efficient activity detection in untrimmed video with max-subgraph search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(5), 908–921.

    Article  Google Scholar 

  37. Yuan, C., Li, X., Hu, W., Ling, H., & Maybank, S. (2013). 3D R transform on spatio-temporal interest points for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 724–730). https://doi.org/10.1109/CVPR.2013.99

  38. Yan X & Luo Y (2012). Recognizing human actions using a new descriptor based on spatial-temporal interest points and weighted-output. Neurocomputing, 87.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pawan Kumar Singh .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Sadhukhan, S., Mallick, S., Singh, P.K., Sarkar, R., Bhattacharjee, D. (2020). A Comparative Study of Different Feature Descriptors for Video-Based Human Action Recognition. In: Mandal, J., Banerjee, S. (eds) Intelligent Computing: Image Processing Based Applications. Advances in Intelligent Systems and Computing, vol 1157. Springer, Singapore. https://doi.org/10.1007/978-981-15-4288-6_3

Download citation

Publish with us

Policies and ethics

Navigation