Abstract
Detection of anomaly in streaming data requires continuous analysis of the stream in real time. This process turns out to be difficult due to varied volume and velocity of data streams. The purpose of this work is to propose a streaming architecture capable of consuming bulky and high speed streams and ingesting them for continuous anomaly detection using machine learning algorithms. A high-throughput architecture has been created by leveraging the proven publish-subscribe model. The publisher is designed to discretize the incoming stream and to ingest the discretized streams to two subscribers which are designed to work alternately with discretized streams for continuous machine learning based anomaly detection. The inherent computation time associated with machine learning algorithm for learning and detection of anomaly is addressed by using two subscribers. The architecture is implemented using Apache Kafka which is a distributed messaging system, capable of communicating high volume of data from one end-point to another. Detection of anomalies is performed using Random Forest (RF) algorithm with an extended, threshold-value based point anomaly detection method. While checking for presence of anomalies in the instances of target attribute, the instances of its strongly correlated attributes (if any) are also tested for anomaly to provide additional knowledge. The proposed architecture has been tested with a case study where streams having different speeds and sizes are ingested to subscribers and the performance of the algorithm has been analyzed in terms of accuracy, computation time and data-miss, in two different configurations of the architecture, namely, single node and distributed node.
The valuable findings of the study are, the random algorithm is found to efficiently detect the presence of anomalies in streams without leaving any data. In single node configuration, the algorithm is found to produce an average accuracy of 96.7% and computation time of 37.5 ms. In distributed node configuration, the algorithm is found to produce and average accuracy of 98.6% and computation time of 38.5 ms. Another important finding of the study is that the RF algorithm is found to produce a more consistent accuracy and computation time for streams having a wide range of speed and size. The proposed method is more desirable for implementing practical applications having massive streams. Detecting anomalies without latency helps to implement accurate countermeasures in right time so that loss of business, finance or resources can be prevented.
Similar content being viewed by others
Availability of supporting data
We used two datasets, one consisting of 200 records collected by survey from different individuals and it is not available for public use. Second one is synthetic data set consisting of discretized streams of having different speeds. This is not available for public use.
References
Ghotkar M, Rokde P (2016) Big Data: how it is generated and its importance. National Conference on Recent Trends in Computer Science and Information Technology, IOSR Journal of Computer Engineering (IOSR-JCE), e-ISSN: 2278–0661, p-ISSN: 2278–8727, pp. 1–5
Soumaya O, Mohamed Amine T, Soufiane A, Abderrahmane D, Mohamed A (2017) Real time data stream processing challenges and perspectives. Int J Comput Issues 14(5):6–12
Leornado Q, Nicolo R (2017) Tutorial: data streaming and its application to stream processing. Proceedings of the 11th ACM International Conference on Distributed and Event-based Systems, pp 15–18, June 2017
Fatih G, Berigel M (2018) Real-time processing of big data streams: lifecycle, tools, tasks and challenges. Proceedings of the 2nd International Symposium on Multidisciplinary Studies and Innovatve Technologies (ISMSIT), e-ISBN: 978–1–5386–4184–2, pp. 1–6, 19–21 October 2018
MohitMaske PP (2015) An introduction to real time processing and streaming of wireless network data. Int J Adv Res Comput Commun Engineering 4:1
Jankov D, Sikdar S, Mukherjee R, Teymourian K, Jermaine C (2017) Real-time high performance anomaly detection over data streams: grand challenge. The 11th ACM International Conference on Distributed and Event-based Systems, 2017, pp. 292–297
Al-amri R, Murugesan RK, Man M, Abdulateef AF, Al-Sharafi MA, Alkahtani AA (2021) A review of machine learning and deep learning techniques for anomaly detection in iot data. Appl Sci 11(12):5320. https://doi.org/10.3390/app11125320
Ahmad S (2016) Real-time anomaly detection for streaming analytics. https://arxiv.org/pdf/1607.02480.pdf
Shukla A, Chaturvedi S, Simmhan Y (2017) RIoTBench: a real-time iot benchmark for distributed stream processing platforms. Res Article Concurr Comput Pract Exp 29(21):1–33
Kolajo T, Daramola O, Adebiyi A (2019) Big data stream analysis: a systematic literature review. J Big Data 2019:1–30
Chandola V (2009) Anomaly detection: a survey. ACM Comput Surveys 2009:1–72
Basora L, Olive X, Dubot T (2019) Recent advances in anomaly detection methods applied to aviation 6(11):1–27
Habeeb BAA, Nasaruddin F, Gani A, Hashem IAT, Ahmed E, Imran M (2019) Real-time big data processing for anomaly detection: a survey. Int J Inf Manage 45:289–307
Sahand Hariri and Matias Carrasco Kind, “Batch and online anomaly detection for scientific applications in a Kubernetes environment”, Proceedings of the 9th Workshop on Scientific Cloud Computing, Publisher: ACM, pp. 1–7, 11 June 2018
Edin Sabic, David Keeley, Bailey Henderson and Sara Nannemann, “Healthcare and anomaly detection: using machine learning to predict anomalies in heart rate data”, AI & Society, pp. 149–158, 2020
Maniraj SP (2019) Aditya Saini and swarna deep sarkar, “credit card fraud detection using machine learning and data science.” Int J Eng Res Technol 8(09):110–115
Varmedja D, Karanovic M, Sladojevic S, Arsenovic M, Anderla A (2019) Credit card fraud detection—machine learning methods. Proceedings of the 18th International Symposium Infotech – Jahorina, Publisher: IEEE, e-ISBN: 978–1–5386–7073–6, p- ISBN: 978–1–5386–7074–3, pp. 1–5, March 2019
Srinivas K, Prasanth N, Trivedi R et al (2022) A novel machine learning inspired algorithm to predict real-time network intrusions. Int j inf tecnol 14:3471–3480. https://doi.org/10.1007/s41870-022-00925-w
Najar AA, Manohar Naik S (2022) DDoS attack detection using MLP and random forest algorithms. Int J Inf Tecnol 14:2317–2327. https://doi.org/10.1007/s41870-022-01003-x
Kalnoor G, Gowrishankar S (2022) A model for intrusion detection system using hidden Markov and variational Bayesian model for IoT based wireless sensor network. Int J Inf Tecnol 14:2021–2033. https://doi.org/10.1007/s41870-021-00748-1
Hafsa M, Jemili F (2018) Comparative study between big data analysis techniques in intrusion detection. Big Data and Cognitive Computing 3(1):1–13
Sughasiny M (2018) Zero event anomaly detection in big data using spark for fast and streaming applications. Int J Pure Appl Math 119(15):3407–3412
Laura Rettig, Mourad Khayati, Philippe Cudre-Mauroux and Michał Piorkowski, “Online Anomaly Detection over Big Data Streams”, Proceedings of the 2015 IEEE International Conference on Big Data, pp. 1113–1122, October 2015
Mohiuddin Solaimani, Mohammed Iftekhar, Latifur Khan, Bhavani Thuraisingham and Joey Burton Ingram, “Spark-based Anomaly Detection Over Multi-source VMwarePerformance Data In Real-time”, Proceedings of the 2014 IEEE Symposium on Computational Intelligence in Cyber Security (CICS), Publisher: IEEE, USA, e-ISBN: 978–1–4799–4521–4, pp. 1–8, 9–12 December 2014
Abdul Ghaffar Shoro and Tariq Rahim Soomro, “Big Data Analysis: Apache Spark Perspective”, Global Journal of Computer Science and Technology: Computer Software & Data Engineering, Publisher: Global Journal Inc., USA, e-ISSN: 0975–4172, p-ISSN: 0975–4350, Vol. 15, Issue 1, pp. 1–10, January 2015
Gireesh Babu CN (2017) Anu Pokhrel, Ashwini V and Thungamani M, “Real Time Big Data Analysis using Apache Flink.” Int J Sci Eng Appl Sci 3(6):78–83
Acharjya DP, Kauser Ahmed P (2016) A survey on big data analytics: challenges, open research issues and tools. Int J Adv Comput Sci Appl 7(2):511–518
Ashwitha J, Venkat Bhat P (2016) Analysis of bill of material data using kafka and spark. Int J Sci Res Publ 6(8):2250–3153
Babcock B, Babu S, Datar M, Motwani R, Widom J (2012) Models and issues in data stream systems, pp. 1–16, 2012
Nisha SP, Shetty J, Narula R, Tandona K (2020) Comparison study of machine learning classifiers to detect anomalies. Int J Electr Comput Eng 5:5445–5452
Sun H, He Q, Liao K, Sellis T, Guo L, Zhang X, Shen J, Chen F (2019) Fast Anomaly Detection in Multiple Multi-Dimensional Data Streams”, Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Publisher: IEEE, Los Angeles, CA, USA, p-ISBN:978–1–7281–0859–9, e-ISBN:978–1–7281–0858–2, pp. 1218–1223
YuK , Shi W, Santoro N, Ma X (2019) Real-time Outlier Detection over Streaming Data”, Proceedings of the 2019 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation,Publisher: IEEE,e-ISBN: 978–1–7281–4034–6, p-ISBN: 978–1–7281–4035–3, pp. 125–132, August 2019
Šabić E, Keeley D, Henderson B, Nannemann S (2020) Healthcare and anomaly detection: using machine learning to predict anomalies in heart rate data. AI Soc. https://doi.org/10.1007/s00146-020-00985-1
Chellammal Surianarayanan, Saranya Kunasekaran, “Detection of anomaly over streams using isolation forest” Chapter 12 in “Streaming analytics Concepts, architectures, platforms, use cases and applications” edited by Pethuru Raj, Chellammal Surianarayanan, Koteeswaran Seerangan and George Ghinea, IET, 2022, ISBN 978–1–83953–417–1
Chellammal Surianarayanan, Saranya Kunasekaran, “Detection of anomaly over streams using big data technologies” Chapter 12 in “Streaming analytics Concepts, architectures, platforms, use cases and applications” edited by Pethuru Raj, Chellammal Surianarayanan, Koteeswaran Seerangan and George Ghinea, IET, 2022, ISBN 978–1–83953–417–1
Supun Kamburugamuve, Geoffrey Fox, David Leake and Judy Qiu, “Survey of Streaming Data Algorithms”, Publisher: tech. rep, Indiana University, 2013
Zhiruo Zhao, Kishan G. Mehrotra, and Chilukuri K. Mohan, “Online Anomaly Detection Using Random Forest”, Proceedings of the 31st International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, Recent Trends and Future Technology in Applied Intelligence, Publisher: Springer Verlag, Montreal, Canada, p-ISSN: 0302–9743, e-ISSN: 1611–3349, Vol. 10868 LNAI, pp. 135–147, 01 January 2018
Dejan V, Mirjana K, Srdjan S, Marko A, Andras A (2019) credit card fraud detection—machine learning methods. Proceedings of the 18th International Symposium Infotech – Jahorina, Publisher: IEEE, e-ISBN: 978–1–5386–7073–6, p-ISBN: 978–1–5386–7074–3, pp. 1–5, March 2019
Kathrin M (2019) Fraud detection using random forest, neural autoencoder, and isolation forest techniques. AI, ML & Data Engineering
Rifkie P, Tama BA (2018) Anomaly detection using random forest: a performance revisited. Proc 2017 Int Conf Data Softw Eng, pp. 1–6, 12 February 2018
Rashmi HR, Buradkar NV (2017) Survey of Random Forest Based Network Anomaly Detection Systems. Int J Adv Res Comput Commun Eng 6(12):95–98
Simon DDA, Sapna S, Hans DS (2019) Anomaly-based intrusion detection in industrial data with SVM and random forests. Proceedings of 27th International Conference on Software, Telecommunications and Computer Networks (SoftCOM), Vol. 1, pp. 1–6, 2019
Biswas P, Samanta T (2021) Anomaly detection using ensemble random forest in wireless sensor network. Int J Inf Tecnol 13:2043–2052. https://doi.org/10.1007/s41870-021-00717-8
https://archive.ics.uci.edu/ml/datasets/PAMAP2+Physical+ Activity+Monitoring
Ali J, Rehanullah K, Nasir A, Imran M (2012) Random forests and decision trees. Int J Computer Sci Iss 9(5):271–278
https://www.kaggle.com/code/rafjaa/dealing-with-very-small-datasets, accessed on 22nd April 2023
Acknowledgements
Not required.
Funding
The authors have no funding support.
Author information
Authors and Affiliations
Contributions
All the authors have contributed.
Corresponding author
Ethics declarations
Conflict of interest
The authors do not have any competing interests.
Ethical approval
Not applicable.
Human and animal participants
The authors did not do any test with either human or animal.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Surianarayanan, C., Kunasekaran, S. & Chelliah, P.R. A high-throughput architecture for anomaly detection in streaming data using machine learning algorithms. Int. j. inf. tecnol. 16, 493–506 (2024). https://doi.org/10.1007/s41870-023-01585-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41870-023-01585-0