A Novel ConvMixer Transformer Based Architecture for Violent Behavior Detection

  • Conference paper
  • First Online:
Artificial Intelligence and Soft Computing (ICAISC 2023)

Abstract

Nowadays most of the streets, squares and buildings are monitored by a large number of surveillance cameras. Nevertheless, these cameras are used only to record scenes to be analyzed after crimes or thefts, and not to prevent violent actions in an automatic way. In few cases there may be a guard who checks the videos manually in real-time, but it is a very inefficient and expensive process. In this paper we proposes a novel approach to Violence Detection task using a recent architecture named ConvMixer, a simple CNN which uses patch-based embeddings in order to obtain superior performance with fewer parameters and computation resources. We also use an interesting technique that consists in arranging frames into super images to encode the temporal information into the spatial dimensions. Our tests on popular “Real Life Violence Situations” dataset highlight a remarkable accuracy of 0.95, placing our proposed model at the second position of the leader board on the same dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abdali, A.R.: Data efficient video transformer for violence detection. In: 2021 IEEE International Conference on Communication, Networks and Satellite (COMNETSAT), pp. 195–199 (2021). https://doi.org/10.1109/COMNETSAT53002.2021.9530829

  2. Aggarwal, J., Cai, Q.: Human motion analysis: a review. In: Proceedings IEEE Nonrigid and Articulated Motion Workshop, pp. 90–102 (1997). https://doi.org/10.1109/NAMW.1997.609859

  3. Aggarwal, J., **a, L.: Human activity recognition from 3D data: a review. Pattern Recogn. Lett. 48, 70–80 (2014). https://doi.org/10.1016/j.patrec.2014.04.011

    Article  Google Scholar 

  4. Aremu, T., Zhiyuan, L., Alameeri, R.: Any object is a potential weapon! weaponized violence detection using salient image (2022). https://doi.org/10.48550/ARXIV.2207.12850. ar**v:2207.12850

  5. Calandre, J., Peteri, R., Mascarilla, L.: Optical flow singularities for sports video annotation: detection of strokes in Table Tennis, October 2019

    Google Scholar 

  6. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers (2020). https://doi.org/10.48550/ARXIV.2005.12872. ar**v:2005.12872

  7. Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset (2017). https://doi.org/10.48550/ARXIV.1705.07750. ar**v:1705.07750

  8. Chen, L.H., Su, C.W., Hsu, H.W.: Violent scene detection in movies. Int. J. Pattern Recogn. Artif. Intell. 25(08), 11611172 (2011). https://doi.org/10.1142/S0218001411009056

    Article  Google Scholar 

  9. Cheng, M., Cai, K., Li, M.: RWF-2000: an open large scale video database for violence detection, pp. 4183–4190, January 2021. https://doi.org/10.1109/ICPR48806.2021.9412502

  10. De Magistris, G., et al.: Vision-based holistic scene understanding for context-aware human-robot interaction. In: Bandini, S., Gasparini, F., Mascardi, V., Palmonari, M., Vizzari, G. (eds.) 20th International Conference of the Italian Association for Artificial Intelligence. Advances in Artificial Intelligence, AIxIA 2021, Virtual Event, Revised Selected Papers, 1–3 December 2021, vol. 13196, pp. 310–325. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-08421-8_21

  11. Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale (2020). https://doi.org/10.48550/ARXIV.2010.11929. ar**v:2010.11929

  12. Duan, H., Zhao, Y., Chen, K., Lin, D., Dai, B.: Revisiting skeleton-based action recognition (2021). https://doi.org/10.48550/ARXIV.2104.13586. ar**v:2104.13586

  13. Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action at a distance. In: IEEE International Conference on Computer Vision, Nice, France, pp. 726–733 (2003)

    Google Scholar 

  14. Gao, Y., Liu, H., Sun, X., Wang, C., Liu, Y.: Violence detection using oriented violent flows. Image Vis. Comput. 4849, 37–41 (2016). https://doi.org/10.1016/j.imavis.2016.01.006

    Article  Google Scholar 

  15. Gowda, S.N., Rohrbach, M., Sevilla-Lara, L.: Smart frame selection for action recognition (2020). https://doi.org/10.48550/ARXIV.2012.10671. ar**v:2012.10671

  16. Gupta, A., Karel, A., Sakthi Balan, M.: Discovering cricket stroke classes in trimmed telecast videos. In: Nain, N., Vipparthi, S.K., Raman, B. (eds.) CVIP 2019. CCIS, vol. 1148, pp. 509–520. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-4018-9_45

    Chapter  Google Scholar 

  17. Hassner, T., Itcher, Y., Kliper-Gross, O.: Violent flows: Real-time detection of violent crowd behavior, pp. 1–6, June 2012. https://doi.org/10.1109/CVPRW.2012.6239348

  18. Igor L. O., B., Victor H. C., M., Schwartz, W.R.: BubbleNET: a disperse recurrent structure to recognize activities. In: 2020 IEEE International Conference on Image Processing (ICIP), pp. 2216–2220 (2020). https://doi.org/10.1109/ICIP40778.2020.9190769

  19. Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35, 495–502 (2010). https://doi.org/10.1109/TPAMI.2012.59

    Article  Google Scholar 

  20. Ke, S.R., Thuc, H.L.U., Lee, Y.J., Hwang, J.N., Yoo, J.H., Choi, K.H.: A review on video-based human activity recognition. Computers 2(2), 88–131 (2013). https://doi.org/10.3390/computers2020088. www.mdpi.com/2073-431X/2/2/88

  21. Li, C., Hou, Y., Wang, P., Li, W.: Joint distance maps based action recognition with convolutional neural networks. IEEE Sig. Process. Lett. 24(5), 624–628 (2017). https://doi.org/10.1109/LSP.2017.2678539

    Article  Google Scholar 

  22. Lima, J., Figueiredo, C.: Temporal fusion approach for video classification with convolutional and LSTM neural networks applied to violence detection. Inteligencia Artif. 24, 40–50 (2021). https://doi.org/10.4114/intartif.vol24iss67pp40-50

    Article  Google Scholar 

  23. Lo Presti, L., La Cascia, M.: 3D skeleton-based human action classification: a survey. Pattern Recogn. 53, 130–147 (2016) https://doi.org/10.1016/j.patcog.2015.11.019. www.sciencedirect.com/science/article/pii/S0031320315004392

  24. Much, A., Pottel, S., Sibold, K.: Preconjugate variables in quantum field theory and their applications. Phys. Rev. D 94(6), 065007 (2016). https://doi.org/10.1103/physrevd.94.065007

    Article  MathSciNet  Google Scholar 

  25. Mumtaz, A., Sargana, A.B., Habib, Z.: Violence detection in surveillance videos with deep network using transfer learning, pp. 558–563, December 2018. https://doi.org/10.1109/EECS.2018.00109

  26. Ng, J.Y.H., Choi, J., Neumann, J., Davis, L.S.: ActionFlowNet: learning motion representation for action recognition (2016). https://doi.org/10.48550/ARXIV.1612.03052. ar**v:1612.03052

  27. Ng, J.Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: Deep networks for video classification (2015). https://doi.org/10.48550/ARXIV.1503.08909. ar**v:1503.08909

  28. Parmar, P., Morris, B.: HalluciNet-ing spatiotemporal representations using a 2D-CNN (2019). https://doi.org/10.48550/ARXIV.1912.04430. ar**v:1912.04430

  29. Peixoto, B.M., Lavi, B., Martin, J.P.P., Avila, S., Dias, Z., Rocha, A.: Toward subjective violence detection in videos. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2019, pp. 8276–8280 (2019)

    Google Scholar 

  30. Pham, H.H., Khoudour, L., Crouzil, A., Zegers, P., Velastin, S.A.: Video-based human action recognition using deep learning: a review (2022). https://doi.org/10.48550/ARXIV.2208.03775. ar**v:2208.03775

  31. Popoola, O.P., Wang, K.: Video-based abnormal human behavior recognition - a review. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42(6), 865–878 (2012). https://doi.org/10.1109/TSMCC.2011.2178594

    Article  Google Scholar 

  32. Rahmad, N., As’ari, M.A.: The new convolutional neural network (CNN) local feature extractor for automated badminton action recognition on vision based data. J. Phys: Conf. Ser. 1529, 022021 (2020). https://doi.org/10.1088/1742-6596/1529/2/022021

    Article  Google Scholar 

  33. Ramzan, M., et al.: A review on state-of-the-art violence detection techniques. IEEE Access 7, 107560–107575 (2019)

    Article  Google Scholar 

  34. Shabani, A.H., Clausi, D.A., Zelek, J.S.: Improved spatio-temporal salient feature detection for action recognition. In: British Machine Vision Conference, August 2011, University of Dundee, Dundee, UK (2011)

    Google Scholar 

  35. Sharma, M., Baghel, R.: Video surveillance for violence detection using deep learning (2020)

    Google Scholar 

  36. Soliman, M.M., Kamal, M.H., El-Massih Nashed, M.A., Mostafa, Y.M., Chawky, B.S., Khattab, D.: Violence recognition from videos using deep learning techniques, pp. 80–85, December 2019. https://doi.org/10.1109/ICICIS46948.2019.9014714

  37. Soliman, M.M., Kamal, M.H., Nashed, M.A.E.M., Mostafa, Y.M., Chawky, B.S., Khattab, D.R.: Violence recognition from videos using deep learning techniques. In: 2019 Ninth International Conference on Intelligent Computing and Information Systems (ICICIS), pp. 80–85 (2019)

    Google Scholar 

  38. Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild (2012). https://doi.org/10.48550/ARXIV.1212.0402. ar**v:1212.0402

  39. Sumon, S.A., Goni, R., Hashem, N.B., Shahria, T., Rahman, R.M.: Violence detection by pretrained modules with different deep learning approaches. Vietnam J. Comput. Sci. 7(01), 19–40 (2020)

    Article  Google Scholar 

  40. Szegedy, C., et al.: Going deeper with convolutions (2014)

    Google Scholar 

  41. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks (2014). https://doi.org/10.48550/ARXIV.1412.0767. ar**v:1412.0767

  42. Trockman, A., Kolter, J.Z.: Patches are all you need? (2022). https://doi.org/10.48550/ARXIV.2201.09792. ar**v:2201.09792

  43. Ullah, F.U.M., Ullah, A., Muhammad, K., Haq, I., Baik, S.: Violence detection using spatiotemporal features with 3D convolutional neural network. Sensors 19, 2472 (2019). https://doi.org/10.3390/s19112472

    Article  Google Scholar 

  44. Wang, P., Li, W., Gao, Z., Zhang, J., Tang, C., Ogunbona, P.: Deep convolutional neural networks for action recognition using depth map sequences (2015). https://doi.org/10.48550/ARXIV.1501.04686. ar** the path signature methodology and its application to landmark-based human action recognition (2017). https://doi.org/10.48550/ARXIV.1707.03993. ar**v:1707.03993

  45. Zhang, T., Yang, Z., Jia, W., Yang, B., Yang, J., He, X.: A new method for violence detection in surveillance scenes. Multimedia Tools Appl. 75(12), 7327–7349 (2016). https://doi.org/10.1007/s11042-015-2648-8

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christian Napoli .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Alfarano, A., De Magistris, G., Mongelli, L., Russo, S., Starczewski, J., Napoli, C. (2023). A Novel ConvMixer Transformer Based Architecture for Violent Behavior Detection. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds) Artificial Intelligence and Soft Computing. ICAISC 2023. Lecture Notes in Computer Science(), vol 14126. Springer, Cham. https://doi.org/10.1007/978-3-031-42508-0_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-42508-0_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-42507-3

  • Online ISBN: 978-3-031-42508-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation