A Novel ConvMixer Transformer Based Architecture for Violent Behavior Detection

Alfarano, Andrea; De Magistris, Giorgio; Mongelli, Leonardo; Russo, Samuele; Starczewski, Janusz; Napoli, Christian

doi:10.1007/978-3-031-42508-0_1

Andrea Alfarano¹³,
Giorgio De Magistris¹³,
Leonardo Mongelli¹³,
Samuele Russo¹⁴,
Janusz Starczewski¹⁵ &
…
Christian Napoli^13,16

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14126))

Included in the following conference series:

International Conference on Artificial Intelligence and Soft Computing

419 Accesses

Abstract

Nowadays most of the streets, squares and buildings are monitored by a large number of surveillance cameras. Nevertheless, these cameras are used only to record scenes to be analyzed after crimes or thefts, and not to prevent violent actions in an automatic way. In few cases there may be a guard who checks the videos manually in real-time, but it is a very inefficient and expensive process. In this paper we proposes a novel approach to Violence Detection task using a recent architecture named ConvMixer, a simple CNN which uses patch-based embeddings in order to obtain superior performance with fewer parameters and computation resources. We also use an interesting technique that consists in arranging frames into super images to encode the temporal information into the spatial dimensions. Our tests on popular “Real Life Violence Situations” dataset highlight a remarkable accuracy of 0.95, placing our proposed model at the second position of the leader board on the same dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abdali, A.R.: Data efficient video transformer for violence detection. In: 2021 IEEE International Conference on Communication, Networks and Satellite (COMNETSAT), pp. 195–199 (2021). https://doi.org/10.1109/COMNETSAT53002.2021.9530829
Aggarwal, J., Cai, Q.: Human motion analysis: a review. In: Proceedings IEEE Nonrigid and Articulated Motion Workshop, pp. 90–102 (1997). https://doi.org/10.1109/NAMW.1997.609859
Aggarwal, J., **a, L.: Human activity recognition from 3D data: a review. Pattern Recogn. Lett. 48, 70–80 (2014). https://doi.org/10.1016/j.patrec.2014.04.011
Article Google Scholar
Aremu, T., Zhiyuan, L., Alameeri, R.: Any object is a potential weapon! weaponized violence detection using salient image (2022). https://doi.org/10.48550/ARXIV.2207.12850. ar**v:2207.12850
Calandre, J., Peteri, R., Mascarilla, L.: Optical flow singularities for sports video annotation: detection of strokes in Table Tennis, October 2019
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers (2020). https://doi.org/10.48550/ARXIV.2005.12872. ar**v:2005.12872
Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset (2017). https://doi.org/10.48550/ARXIV.1705.07750. ar**v:1705.07750
Chen, L.H., Su, C.W., Hsu, H.W.: Violent scene detection in movies. Int. J. Pattern Recogn. Artif. Intell. 25(08), 11611172 (2011). https://doi.org/10.1142/S0218001411009056
Article Google Scholar
Cheng, M., Cai, K., Li, M.: RWF-2000: an open large scale video database for violence detection, pp. 4183–4190, January 2021. https://doi.org/10.1109/ICPR48806.2021.9412502
De Magistris, G., et al.: Vision-based holistic scene understanding for context-aware human-robot interaction. In: Bandini, S., Gasparini, F., Mascardi, V., Palmonari, M., Vizzari, G. (eds.) 20th International Conference of the Italian Association for Artificial Intelligence. Advances in Artificial Intelligence, AIxIA 2021, Virtual Event, Revised Selected Papers, 1–3 December 2021, vol. 13196, pp. 310–325. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-08421-8_21
Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale (2020). https://doi.org/10.48550/ARXIV.2010.11929. ar**v:2010.11929
Duan, H., Zhao, Y., Chen, K., Lin, D., Dai, B.: Revisiting skeleton-based action recognition (2021). https://doi.org/10.48550/ARXIV.2104.13586. ar**v:2104.13586
Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action at a distance. In: IEEE International Conference on Computer Vision, Nice, France, pp. 726–733 (2003)
Google Scholar
Gao, Y., Liu, H., Sun, X., Wang, C., Liu, Y.: Violence detection using oriented violent flows. Image Vis. Comput. 4849, 37–41 (2016). https://doi.org/10.1016/j.imavis.2016.01.006
Article Google Scholar
Gowda, S.N., Rohrbach, M., Sevilla-Lara, L.: Smart frame selection for action recognition (2020). https://doi.org/10.48550/ARXIV.2012.10671. ar**v:2012.10671
Gupta, A., Karel, A., Sakthi Balan, M.: Discovering cricket stroke classes in trimmed telecast videos. In: Nain, N., Vipparthi, S.K., Raman, B. (eds.) CVIP 2019. CCIS, vol. 1148, pp. 509–520. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-4018-9_45
Chapter Google Scholar
Hassner, T., Itcher, Y., Kliper-Gross, O.: Violent flows: Real-time detection of violent crowd behavior, pp. 1–6, June 2012. https://doi.org/10.1109/CVPRW.2012.6239348
Igor L. O., B., Victor H. C., M., Schwartz, W.R.: BubbleNET: a disperse recurrent structure to recognize activities. In: 2020 IEEE International Conference on Image Processing (ICIP), pp. 2216–2220 (2020). https://doi.org/10.1109/ICIP40778.2020.9190769
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35, 495–502 (2010). https://doi.org/10.1109/TPAMI.2012.59
Article Google Scholar
Ke, S.R., Thuc, H.L.U., Lee, Y.J., Hwang, J.N., Yoo, J.H., Choi, K.H.: A review on video-based human activity recognition. Computers 2(2), 88–131 (2013). https://doi.org/10.3390/computers2020088. www.mdpi.com/2073-431X/2/2/88
Li, C., Hou, Y., Wang, P., Li, W.: Joint distance maps based action recognition with convolutional neural networks. IEEE Sig. Process. Lett. 24(5), 624–628 (2017). https://doi.org/10.1109/LSP.2017.2678539
Article Google Scholar
Lima, J., Figueiredo, C.: Temporal fusion approach for video classification with convolutional and LSTM neural networks applied to violence detection. Inteligencia Artif. 24, 40–50 (2021). https://doi.org/10.4114/intartif.vol24iss67pp40-50
Article Google Scholar
Lo Presti, L., La Cascia, M.: 3D skeleton-based human action classification: a survey. Pattern Recogn. 53, 130–147 (2016) https://doi.org/10.1016/j.patcog.2015.11.019. www.sciencedirect.com/science/article/pii/S0031320315004392
Much, A., Pottel, S., Sibold, K.: Preconjugate variables in quantum field theory and their applications. Phys. Rev. D 94(6), 065007 (2016). https://doi.org/10.1103/physrevd.94.065007
Article MathSciNet Google Scholar
Mumtaz, A., Sargana, A.B., Habib, Z.: Violence detection in surveillance videos with deep network using transfer learning, pp. 558–563, December 2018. https://doi.org/10.1109/EECS.2018.00109
Ng, J.Y.H., Choi, J., Neumann, J., Davis, L.S.: ActionFlowNet: learning motion representation for action recognition (2016). https://doi.org/10.48550/ARXIV.1612.03052. ar**v:1612.03052
Ng, J.Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: Deep networks for video classification (2015). https://doi.org/10.48550/ARXIV.1503.08909. ar**v:1503.08909
Parmar, P., Morris, B.: HalluciNet-ing spatiotemporal representations using a 2D-CNN (2019). https://doi.org/10.48550/ARXIV.1912.04430. ar**v:1912.04430
Peixoto, B.M., Lavi, B., Martin, J.P.P., Avila, S., Dias, Z., Rocha, A.: Toward subjective violence detection in videos. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2019, pp. 8276–8280 (2019)
Google Scholar
Pham, H.H., Khoudour, L., Crouzil, A., Zegers, P., Velastin, S.A.: Video-based human action recognition using deep learning: a review (2022). https://doi.org/10.48550/ARXIV.2208.03775. ar**v:2208.03775
Popoola, O.P., Wang, K.: Video-based abnormal human behavior recognition - a review. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42(6), 865–878 (2012). https://doi.org/10.1109/TSMCC.2011.2178594
Article Google Scholar
Rahmad, N., As’ari, M.A.: The new convolutional neural network (CNN) local feature extractor for automated badminton action recognition on vision based data. J. Phys: Conf. Ser. 1529, 022021 (2020). https://doi.org/10.1088/1742-6596/1529/2/022021
Article Google Scholar
Ramzan, M., et al.: A review on state-of-the-art violence detection techniques. IEEE Access 7, 107560–107575 (2019)
Article Google Scholar
Shabani, A.H., Clausi, D.A., Zelek, J.S.: Improved spatio-temporal salient feature detection for action recognition. In: British Machine Vision Conference, August 2011, University of Dundee, Dundee, UK (2011)
Google Scholar
Sharma, M., Baghel, R.: Video surveillance for violence detection using deep learning (2020)
Google Scholar
Soliman, M.M., Kamal, M.H., El-Massih Nashed, M.A., Mostafa, Y.M., Chawky, B.S., Khattab, D.: Violence recognition from videos using deep learning techniques, pp. 80–85, December 2019. https://doi.org/10.1109/ICICIS46948.2019.9014714
Soliman, M.M., Kamal, M.H., Nashed, M.A.E.M., Mostafa, Y.M., Chawky, B.S., Khattab, D.R.: Violence recognition from videos using deep learning techniques. In: 2019 Ninth International Conference on Intelligent Computing and Information Systems (ICICIS), pp. 80–85 (2019)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild (2012). https://doi.org/10.48550/ARXIV.1212.0402. ar**v:1212.0402
Sumon, S.A., Goni, R., Hashem, N.B., Shahria, T., Rahman, R.M.: Violence detection by pretrained modules with different deep learning approaches. Vietnam J. Comput. Sci. 7(01), 19–40 (2020)
Article Google Scholar
Szegedy, C., et al.: Going deeper with convolutions (2014)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks (2014). https://doi.org/10.48550/ARXIV.1412.0767. ar**v:1412.0767
Trockman, A., Kolter, J.Z.: Patches are all you need? (2022). https://doi.org/10.48550/ARXIV.2201.09792. ar**v:2201.09792
Ullah, F.U.M., Ullah, A., Muhammad, K., Haq, I., Baik, S.: Violence detection using spatiotemporal features with 3D convolutional neural network. Sensors 19, 2472 (2019). https://doi.org/10.3390/s19112472
Article Google Scholar
Wang, P., Li, W., Gao, Z., Zhang, J., Tang, C., Ogunbona, P.: Deep convolutional neural networks for action recognition using depth map sequences (2015). https://doi.org/10.48550/ARXIV.1501.04686. ar** the path signature methodology and its application to landmark-based human action recognition (2017). https://doi.org/10.48550/ARXIV.1707.03993. ar**v:1707.03993
Zhang, T., Yang, Z., Jia, W., Yang, B., Yang, J., He, X.: A new method for violence detection in surveillance scenes. Multimedia Tools Appl. 75(12), 7327–7349 (2016). https://doi.org/10.1007/s11042-015-2648-8
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer, Control and Management Engineering, Sapienza University of Rome, Via Ariosto 25, 00185, Roma, Italy
Andrea Alfarano, Giorgio De Magistris, Leonardo Mongelli & Christian Napoli
Department of Psychology, Sapienza University of Rome, Via dei Marsi 78, 00185, Roma, Italy
Samuele Russo
Department of Computational Intelligence, Czestochowa University of Technology, al. Armii Krajowej 36, 42-200, Czestochowa, Poland
Janusz Starczewski
Institute for Systems Analysis and Computer Science, Italian National Research Council, Via dei Taurini 19, 00185, Roma, Italy
Christian Napoli

Authors

Andrea Alfarano
View author publications
You can also search for this author in PubMed Google Scholar
Giorgio De Magistris
View author publications
You can also search for this author in PubMed Google Scholar
Leonardo Mongelli
View author publications
You can also search for this author in PubMed Google Scholar
Samuele Russo
View author publications
You can also search for this author in PubMed Google Scholar
Janusz Starczewski
View author publications
You can also search for this author in PubMed Google Scholar
Christian Napoli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christian Napoli .

Editor information

Editors and Affiliations

Systems Research Institute of the Polish Academy of Sciences, Warsaw, Poland
Leszek Rutkowski
Częstochowa University of Technology, Częstochowa, Poland
Rafał Scherer
Częstochowa University of Technology, Częstochowa, Poland
Marcin Korytkowski
University of Alberta, Edmonton, AB, Canada
Witold Pedrycz
AGH University of Krakow, Kraków, Poland
Ryszard Tadeusiewicz
University of Louisville, Louisville, KY, USA
Jacek M. Zurada

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alfarano, A., De Magistris, G., Mongelli, L., Russo, S., Starczewski, J., Napoli, C. (2023). A Novel ConvMixer Transformer Based Architecture for Violent Behavior Detection. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds) Artificial Intelligence and Soft Computing. ICAISC 2023. Lecture Notes in Computer Science(), vol 14126. Springer, Cham. https://doi.org/10.1007/978-3-031-42508-0_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-42508-0_1
Published: 14 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42507-3
Online ISBN: 978-3-031-42508-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Novel ConvMixer Transformer Based Architecture for Violent Behavior Detection