Abstract
Nowadays most of the streets, squares and buildings are monitored by a large number of surveillance cameras. Nevertheless, these cameras are used only to record scenes to be analyzed after crimes or thefts, and not to prevent violent actions in an automatic way. In few cases there may be a guard who checks the videos manually in real-time, but it is a very inefficient and expensive process. In this paper we proposes a novel approach to Violence Detection task using a recent architecture named ConvMixer, a simple CNN which uses patch-based embeddings in order to obtain superior performance with fewer parameters and computation resources. We also use an interesting technique that consists in arranging frames into super images to encode the temporal information into the spatial dimensions. Our tests on popular “Real Life Violence Situations” dataset highlight a remarkable accuracy of 0.95, placing our proposed model at the second position of the leader board on the same dataset.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abdali, A.R.: Data efficient video transformer for violence detection. In: 2021 IEEE International Conference on Communication, Networks and Satellite (COMNETSAT), pp. 195–199 (2021). https://doi.org/10.1109/COMNETSAT53002.2021.9530829
Aggarwal, J., Cai, Q.: Human motion analysis: a review. In: Proceedings IEEE Nonrigid and Articulated Motion Workshop, pp. 90–102 (1997). https://doi.org/10.1109/NAMW.1997.609859
Aggarwal, J., **a, L.: Human activity recognition from 3D data: a review. Pattern Recogn. Lett. 48, 70–80 (2014). https://doi.org/10.1016/j.patrec.2014.04.011
Aremu, T., Zhiyuan, L., Alameeri, R.: Any object is a potential weapon! weaponized violence detection using salient image (2022). https://doi.org/10.48550/ARXIV.2207.12850. ar**v:2207.12850
Calandre, J., Peteri, R., Mascarilla, L.: Optical flow singularities for sports video annotation: detection of strokes in Table Tennis, October 2019
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers (2020). https://doi.org/10.48550/ARXIV.2005.12872. ar**v:2005.12872
Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset (2017). https://doi.org/10.48550/ARXIV.1705.07750. ar**v:1705.07750
Chen, L.H., Su, C.W., Hsu, H.W.: Violent scene detection in movies. Int. J. Pattern Recogn. Artif. Intell. 25(08), 11611172 (2011). https://doi.org/10.1142/S0218001411009056
Cheng, M., Cai, K., Li, M.: RWF-2000: an open large scale video database for violence detection, pp. 4183–4190, January 2021. https://doi.org/10.1109/ICPR48806.2021.9412502
De Magistris, G., et al.: Vision-based holistic scene understanding for context-aware human-robot interaction. In: Bandini, S., Gasparini, F., Mascardi, V., Palmonari, M., Vizzari, G. (eds.) 20th International Conference of the Italian Association for Artificial Intelligence. Advances in Artificial Intelligence, AIxIA 2021, Virtual Event, Revised Selected Papers, 1–3 December 2021, vol. 13196, pp. 310–325. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-08421-8_21
Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale (2020). https://doi.org/10.48550/ARXIV.2010.11929. ar**v:2010.11929
Duan, H., Zhao, Y., Chen, K., Lin, D., Dai, B.: Revisiting skeleton-based action recognition (2021). https://doi.org/10.48550/ARXIV.2104.13586. ar**v:2104.13586
Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action at a distance. In: IEEE International Conference on Computer Vision, Nice, France, pp. 726–733 (2003)
Gao, Y., Liu, H., Sun, X., Wang, C., Liu, Y.: Violence detection using oriented violent flows. Image Vis. Comput. 4849, 37–41 (2016). https://doi.org/10.1016/j.imavis.2016.01.006
Gowda, S.N., Rohrbach, M., Sevilla-Lara, L.: Smart frame selection for action recognition (2020). https://doi.org/10.48550/ARXIV.2012.10671. ar**v:2012.10671
Gupta, A., Karel, A., Sakthi Balan, M.: Discovering cricket stroke classes in trimmed telecast videos. In: Nain, N., Vipparthi, S.K., Raman, B. (eds.) CVIP 2019. CCIS, vol. 1148, pp. 509–520. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-4018-9_45
Hassner, T., Itcher, Y., Kliper-Gross, O.: Violent flows: Real-time detection of violent crowd behavior, pp. 1–6, June 2012. https://doi.org/10.1109/CVPRW.2012.6239348
Igor L. O., B., Victor H. C., M., Schwartz, W.R.: BubbleNET: a disperse recurrent structure to recognize activities. In: 2020 IEEE International Conference on Image Processing (ICIP), pp. 2216–2220 (2020). https://doi.org/10.1109/ICIP40778.2020.9190769
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35, 495–502 (2010). https://doi.org/10.1109/TPAMI.2012.59
Ke, S.R., Thuc, H.L.U., Lee, Y.J., Hwang, J.N., Yoo, J.H., Choi, K.H.: A review on video-based human activity recognition. Computers 2(2), 88–131 (2013). https://doi.org/10.3390/computers2020088. www.mdpi.com/2073-431X/2/2/88
Li, C., Hou, Y., Wang, P., Li, W.: Joint distance maps based action recognition with convolutional neural networks. IEEE Sig. Process. Lett. 24(5), 624–628 (2017). https://doi.org/10.1109/LSP.2017.2678539
Lima, J., Figueiredo, C.: Temporal fusion approach for video classification with convolutional and LSTM neural networks applied to violence detection. Inteligencia Artif. 24, 40–50 (2021). https://doi.org/10.4114/intartif.vol24iss67pp40-50
Lo Presti, L., La Cascia, M.: 3D skeleton-based human action classification: a survey. Pattern Recogn. 53, 130–147 (2016) https://doi.org/10.1016/j.patcog.2015.11.019. www.sciencedirect.com/science/article/pii/S0031320315004392
Much, A., Pottel, S., Sibold, K.: Preconjugate variables in quantum field theory and their applications. Phys. Rev. D 94(6), 065007 (2016). https://doi.org/10.1103/physrevd.94.065007
Mumtaz, A., Sargana, A.B., Habib, Z.: Violence detection in surveillance videos with deep network using transfer learning, pp. 558–563, December 2018. https://doi.org/10.1109/EECS.2018.00109
Ng, J.Y.H., Choi, J., Neumann, J., Davis, L.S.: ActionFlowNet: learning motion representation for action recognition (2016). https://doi.org/10.48550/ARXIV.1612.03052. ar**v:1612.03052
Ng, J.Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: Deep networks for video classification (2015). https://doi.org/10.48550/ARXIV.1503.08909. ar**v:1503.08909
Parmar, P., Morris, B.: HalluciNet-ing spatiotemporal representations using a 2D-CNN (2019). https://doi.org/10.48550/ARXIV.1912.04430. ar**v:1912.04430
Peixoto, B.M., Lavi, B., Martin, J.P.P., Avila, S., Dias, Z., Rocha, A.: Toward subjective violence detection in videos. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2019, pp. 8276–8280 (2019)
Pham, H.H., Khoudour, L., Crouzil, A., Zegers, P., Velastin, S.A.: Video-based human action recognition using deep learning: a review (2022). https://doi.org/10.48550/ARXIV.2208.03775. ar**v:2208.03775
Popoola, O.P., Wang, K.: Video-based abnormal human behavior recognition - a review. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42(6), 865–878 (2012). https://doi.org/10.1109/TSMCC.2011.2178594
Rahmad, N., As’ari, M.A.: The new convolutional neural network (CNN) local feature extractor for automated badminton action recognition on vision based data. J. Phys: Conf. Ser. 1529, 022021 (2020). https://doi.org/10.1088/1742-6596/1529/2/022021
Ramzan, M., et al.: A review on state-of-the-art violence detection techniques. IEEE Access 7, 107560–107575 (2019)
Shabani, A.H., Clausi, D.A., Zelek, J.S.: Improved spatio-temporal salient feature detection for action recognition. In: British Machine Vision Conference, August 2011, University of Dundee, Dundee, UK (2011)
Sharma, M., Baghel, R.: Video surveillance for violence detection using deep learning (2020)
Soliman, M.M., Kamal, M.H., El-Massih Nashed, M.A., Mostafa, Y.M., Chawky, B.S., Khattab, D.: Violence recognition from videos using deep learning techniques, pp. 80–85, December 2019. https://doi.org/10.1109/ICICIS46948.2019.9014714
Soliman, M.M., Kamal, M.H., Nashed, M.A.E.M., Mostafa, Y.M., Chawky, B.S., Khattab, D.R.: Violence recognition from videos using deep learning techniques. In: 2019 Ninth International Conference on Intelligent Computing and Information Systems (ICICIS), pp. 80–85 (2019)
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild (2012). https://doi.org/10.48550/ARXIV.1212.0402. ar**v:1212.0402
Sumon, S.A., Goni, R., Hashem, N.B., Shahria, T., Rahman, R.M.: Violence detection by pretrained modules with different deep learning approaches. Vietnam J. Comput. Sci. 7(01), 19–40 (2020)
Szegedy, C., et al.: Going deeper with convolutions (2014)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks (2014). https://doi.org/10.48550/ARXIV.1412.0767. ar**v:1412.0767
Trockman, A., Kolter, J.Z.: Patches are all you need? (2022). https://doi.org/10.48550/ARXIV.2201.09792. ar**v:2201.09792
Ullah, F.U.M., Ullah, A., Muhammad, K., Haq, I., Baik, S.: Violence detection using spatiotemporal features with 3D convolutional neural network. Sensors 19, 2472 (2019). https://doi.org/10.3390/s19112472
Wang, P., Li, W., Gao, Z., Zhang, J., Tang, C., Ogunbona, P.: Deep convolutional neural networks for action recognition using depth map sequences (2015). https://doi.org/10.48550/ARXIV.1501.04686. ar** the path signature methodology and its application to landmark-based human action recognition (2017). https://doi.org/10.48550/ARXIV.1707.03993. ar**v:1707.03993
Zhang, T., Yang, Z., Jia, W., Yang, B., Yang, J., He, X.: A new method for violence detection in surveillance scenes. Multimedia Tools Appl. 75(12), 7327–7349 (2016). https://doi.org/10.1007/s11042-015-2648-8
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Alfarano, A., De Magistris, G., Mongelli, L., Russo, S., Starczewski, J., Napoli, C. (2023). A Novel ConvMixer Transformer Based Architecture for Violent Behavior Detection. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds) Artificial Intelligence and Soft Computing. ICAISC 2023. Lecture Notes in Computer Science(), vol 14126. Springer, Cham. https://doi.org/10.1007/978-3-031-42508-0_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-42508-0_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42507-3
Online ISBN: 978-3-031-42508-0
eBook Packages: Computer ScienceComputer Science (R0)