A high-throughput architecture for anomaly detection in streaming data using machine learning algorithms

Surianarayanan, Chellammal; Kunasekaran, Saranya; Chelliah, Pethuru Raj

doi:10.1007/s41870-023-01585-0

A high-throughput architecture for anomaly detection in streaming data using machine learning algorithms

Original Research
Published: 04 November 2023

Volume 16, pages 493–506, (2024)
Cite this article

International Journal of Information Technology Aims and scope Submit manuscript

Chellammal Surianarayanan¹,
Saranya Kunasekaran² &
Pethuru Raj Chelliah³

195 Accesses
6 Citations
Explore all metrics

Abstract

Detection of anomaly in streaming data requires continuous analysis of the stream in real time. This process turns out to be difficult due to varied volume and velocity of data streams. The purpose of this work is to propose a streaming architecture capable of consuming bulky and high speed streams and ingesting them for continuous anomaly detection using machine learning algorithms. A high-throughput architecture has been created by leveraging the proven publish-subscribe model. The publisher is designed to discretize the incoming stream and to ingest the discretized streams to two subscribers which are designed to work alternately with discretized streams for continuous machine learning based anomaly detection. The inherent computation time associated with machine learning algorithm for learning and detection of anomaly is addressed by using two subscribers. The architecture is implemented using Apache Kafka which is a distributed messaging system, capable of communicating high volume of data from one end-point to another. Detection of anomalies is performed using Random Forest (RF) algorithm with an extended, threshold-value based point anomaly detection method. While checking for presence of anomalies in the instances of target attribute, the instances of its strongly correlated attributes (if any) are also tested for anomaly to provide additional knowledge. The proposed architecture has been tested with a case study where streams having different speeds and sizes are ingested to subscribers and the performance of the algorithm has been analyzed in terms of accuracy, computation time and data-miss, in two different configurations of the architecture, namely, single node and distributed node.

The valuable findings of the study are, the random algorithm is found to efficiently detect the presence of anomalies in streams without leaving any data. In single node configuration, the algorithm is found to produce an average accuracy of 96.7% and computation time of 37.5 ms. In distributed node configuration, the algorithm is found to produce and average accuracy of 98.6% and computation time of 38.5 ms. Another important finding of the study is that the RF algorithm is found to produce a more consistent accuracy and computation time for streams having a wide range of speed and size. The proposed method is more desirable for implementing practical applications having massive streams. Detecting anomalies without latency helps to implement accurate countermeasures in right time so that loss of business, finance or resources can be prevented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 6

Fig. 7

Fig. 8

Novel method for optimizing performance in resource constrained distributed data streams

Article 16 February 2022

Online Anomaly Detection Using Random Forest

Online Anomaly Detection for Streaming Data Implemented on Top of Kafka, Scikit-Multiflow and River

Availability of supporting data

We used two datasets, one consisting of 200 records collected by survey from different individuals and it is not available for public use. Second one is synthetic data set consisting of discretized streams of having different speeds. This is not available for public use.

References

Ghotkar M, Rokde P (2016) Big Data: how it is generated and its importance. National Conference on Recent Trends in Computer Science and Information Technology, IOSR Journal of Computer Engineering (IOSR-JCE), e-ISSN: 2278–0661, p-ISSN: 2278–8727, pp. 1–5
Soumaya O, Mohamed Amine T, Soufiane A, Abderrahmane D, Mohamed A (2017) Real time data stream processing challenges and perspectives. Int J Comput Issues 14(5):6–12
Article Google Scholar
Leornado Q, Nicolo R (2017) Tutorial: data streaming and its application to stream processing. Proceedings of the 11^th ACM International Conference on Distributed and Event-based Systems, pp 15–18, June 2017
Fatih G, Berigel M (2018) Real-time processing of big data streams: lifecycle, tools, tasks and challenges. Proceedings of the 2^nd International Symposium on Multidisciplinary Studies and Innovatve Technologies (ISMSIT), e-ISBN: 978–1–5386–4184–2, pp. 1–6, 19–21 October 2018
MohitMaske PP (2015) An introduction to real time processing and streaming of wireless network data. Int J Adv Res Comput Commun Engineering 4:1
Google Scholar
Jankov D, Sikdar S, Mukherjee R, Teymourian K, Jermaine C (2017) Real-time high performance anomaly detection over data streams: grand challenge. The 11th ACM International Conference on Distributed and Event-based Systems, 2017, pp. 292–297
Al-amri R, Murugesan RK, Man M, Abdulateef AF, Al-Sharafi MA, Alkahtani AA (2021) A review of machine learning and deep learning techniques for anomaly detection in iot data. Appl Sci 11(12):5320. https://doi.org/10.3390/app11125320
Article Google Scholar
Ahmad S (2016) Real-time anomaly detection for streaming analytics. https://arxiv.org/pdf/1607.02480.pdf
Shukla A, Chaturvedi S, Simmhan Y (2017) RIoTBench: a real-time iot benchmark for distributed stream processing platforms. Res Article Concurr Comput Pract Exp 29(21):1–33
Google Scholar
Kolajo T, Daramola O, Adebiyi A (2019) Big data stream analysis: a systematic literature review. J Big Data 2019:1–30
Google Scholar
Chandola V (2009) Anomaly detection: a survey. ACM Comput Surveys 2009:1–72
Article Google Scholar
Basora L, Olive X, Dubot T (2019) Recent advances in anomaly detection methods applied to aviation 6(11):1–27
Google Scholar
Habeeb BAA, Nasaruddin F, Gani A, Hashem IAT, Ahmed E, Imran M (2019) Real-time big data processing for anomaly detection: a survey. Int J Inf Manage 45:289–307
Article Google Scholar
Sahand Hariri and Matias Carrasco Kind, “Batch and online anomaly detection for scientific applications in a Kubernetes environment”, Proceedings of the 9^th Workshop on Scientific Cloud Computing, Publisher: ACM, pp. 1–7, 11 June 2018
Edin Sabic, David Keeley, Bailey Henderson and Sara Nannemann, “Healthcare and anomaly detection: using machine learning to predict anomalies in heart rate data”, AI & Society, pp. 149–158, 2020
Maniraj SP (2019) Aditya Saini and swarna deep sarkar, “credit card fraud detection using machine learning and data science.” Int J Eng Res Technol 8(09):110–115
Google Scholar
Varmedja D, Karanovic M, Sladojevic S, Arsenovic M, Anderla A (2019) Credit card fraud detection—machine learning methods. Proceedings of the 18^th International Symposium Infotech – Jahorina, Publisher: IEEE, e-ISBN: 978–1–5386–7073–6, p- ISBN: 978–1–5386–7074–3, pp. 1–5, March 2019
https://www.kaggle.com/mlg-ulb/creditcardfraud
Srinivas K, Prasanth N, Trivedi R et al (2022) A novel machine learning inspired algorithm to predict real-time network intrusions. Int j inf tecnol 14:3471–3480. https://doi.org/10.1007/s41870-022-00925-w
Article Google Scholar
Najar AA, Manohar Naik S (2022) DDoS attack detection using MLP and random forest algorithms. Int J Inf Tecnol 14:2317–2327. https://doi.org/10.1007/s41870-022-01003-x
Article Google Scholar
Kalnoor G, Gowrishankar S (2022) A model for intrusion detection system using hidden Markov and variational Bayesian model for IoT based wireless sensor network. Int J Inf Tecnol 14:2021–2033. https://doi.org/10.1007/s41870-021-00748-1
Article Google Scholar
Hafsa M, Jemili F (2018) Comparative study between big data analysis techniques in intrusion detection. Big Data and Cognitive Computing 3(1):1–13
Article Google Scholar
Sughasiny M (2018) Zero event anomaly detection in big data using spark for fast and streaming applications. Int J Pure Appl Math 119(15):3407–3412
Google Scholar
Laura Rettig, Mourad Khayati, Philippe Cudre-Mauroux and Michał Piorkowski, “Online Anomaly Detection over Big Data Streams”, Proceedings of the 2015 IEEE International Conference on Big Data, pp. 1113–1122, October 2015
Mohiuddin Solaimani, Mohammed Iftekhar, Latifur Khan, Bhavani Thuraisingham and Joey Burton Ingram, “Spark-based Anomaly Detection Over Multi-source VMwarePerformance Data In Real-time”, Proceedings of the 2014 IEEE Symposium on Computational Intelligence in Cyber Security (CICS), Publisher: IEEE, USA, e-ISBN: 978–1–4799–4521–4, pp. 1–8, 9–12 December 2014
Abdul Ghaffar Shoro and Tariq Rahim Soomro, “Big Data Analysis: Apache Spark Perspective”, Global Journal of Computer Science and Technology: Computer Software & Data Engineering, Publisher: Global Journal Inc., USA, e-ISSN: 0975–4172, p-ISSN: 0975–4350, Vol. 15, Issue 1, pp. 1–10, January 2015
Gireesh Babu CN (2017) Anu Pokhrel, Ashwini V and Thungamani M, “Real Time Big Data Analysis using Apache Flink.” Int J Sci Eng Appl Sci 3(6):78–83
Google Scholar
Acharjya DP, Kauser Ahmed P (2016) A survey on big data analytics: challenges, open research issues and tools. Int J Adv Comput Sci Appl 7(2):511–518
Google Scholar
Ashwitha J, Venkat Bhat P (2016) Analysis of bill of material data using kafka and spark. Int J Sci Res Publ 6(8):2250–3153
Google Scholar
Babcock B, Babu S, Datar M, Motwani R, Widom J (2012) Models and issues in data stream systems, pp. 1–16, 2012
Nisha SP, Shetty J, Narula R, Tandona K (2020) Comparison study of machine learning classifiers to detect anomalies. Int J Electr Comput Eng 5:5445–5452
Google Scholar
Sun H, He Q, Liao K, Sellis T, Guo L, Zhang X, Shen J, Chen F (2019) Fast Anomaly Detection in Multiple Multi-Dimensional Data Streams”, Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Publisher: IEEE, Los Angeles, CA, USA, p-ISBN:978–1–7281–0859–9, e-ISBN:978–1–7281–0858–2, pp. 1218–1223
YuK , Shi W, Santoro N, Ma X (2019) Real-time Outlier Detection over Streaming Data”, Proceedings of the 2019 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation,Publisher: IEEE,e-ISBN: 978–1–7281–4034–6, p-ISBN: 978–1–7281–4035–3, pp. 125–132, August 2019
Šabić E, Keeley D, Henderson B, Nannemann S (2020) Healthcare and anomaly detection: using machine learning to predict anomalies in heart rate data. AI Soc. https://doi.org/10.1007/s00146-020-00985-1
Article Google Scholar
Chellammal Surianarayanan, Saranya Kunasekaran, “Detection of anomaly over streams using isolation forest” Chapter 12 in “Streaming analytics Concepts, architectures, platforms, use cases and applications” edited by Pethuru Raj, Chellammal Surianarayanan, Koteeswaran Seerangan and George Ghinea, IET, 2022, ISBN 978–1–83953–417–1
Chellammal Surianarayanan, Saranya Kunasekaran, “Detection of anomaly over streams using big data technologies” Chapter 12 in “Streaming analytics Concepts, architectures, platforms, use cases and applications” edited by Pethuru Raj, Chellammal Surianarayanan, Koteeswaran Seerangan and George Ghinea, IET, 2022, ISBN 978–1–83953–417–1
Supun Kamburugamuve, Geoffrey Fox, David Leake and Judy Qiu, “Survey of Streaming Data Algorithms”, Publisher: tech. rep, Indiana University, 2013
Zhiruo Zhao, Kishan G. Mehrotra, and Chilukuri K. Mohan, “Online Anomaly Detection Using Random Forest”, Proceedings of the 31st International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, Recent Trends and Future Technology in Applied Intelligence, Publisher: Springer Verlag, Montreal, Canada, p-ISSN: 0302–9743, e-ISSN: 1611–3349, Vol. 10868 LNAI, pp. 135–147, 01 January 2018
Dejan V, Mirjana K, Srdjan S, Marko A, Andras A (2019) credit card fraud detection—machine learning methods. Proceedings of the 18^th International Symposium Infotech – Jahorina, Publisher: IEEE, e-ISBN: 978–1–5386–7073–6, p-ISBN: 978–1–5386–7074–3, pp. 1–5, March 2019
Kathrin M (2019) Fraud detection using random forest, neural autoencoder, and isolation forest techniques. AI, ML & Data Engineering
Rifkie P, Tama BA (2018) Anomaly detection using random forest: a performance revisited. Proc 2017 Int Conf Data Softw Eng, pp. 1–6, 12 February 2018
Rashmi HR, Buradkar NV (2017) Survey of Random Forest Based Network Anomaly Detection Systems. Int J Adv Res Comput Commun Eng 6(12):95–98
Google Scholar
Simon DDA, Sapna S, Hans DS (2019) Anomaly-based intrusion detection in industrial data with SVM and random forests. Proceedings of 27th International Conference on Software, Telecommunications and Computer Networks (SoftCOM), Vol. 1, pp. 1–6, 2019
Biswas P, Samanta T (2021) Anomaly detection using ensemble random forest in wireless sensor network. Int J Inf Tecnol 13:2043–2052. https://doi.org/10.1007/s41870-021-00717-8
Article Google Scholar
https://archive.ics.uci.edu/ml/datasets/PAMAP2+Physical+ Activity+Monitoring
Ali J, Rehanullah K, Nasir A, Imran M (2012) Random forests and decision trees. Int J Computer Sci Iss 9(5):271–278
Google Scholar
https://www.kaggle.com/code/rafjaa/dealing-with-very-small-datasets, accessed on 22^nd April 2023

Download references

Acknowledgements

Not required.

Funding

The authors have no funding support.

Author information

Authors and Affiliations

Centre for Distance and Online Education, Bharathidasan University, Tiruchirappalli, Tamilnadu, India
Chellammal Surianarayanan
Department of Computer Science, Government Arts and Science College, Srirangam, Affiliated to Bharathidasan University, Tiruchirappalli, Tamilnadu, India
Saranya Kunasekaran
Edge AI Division, Reliance Jio Platforms Ltd, Bangalore, India
Pethuru Raj Chelliah

Authors

Chellammal Surianarayanan
View author publications
You can also search for this author in PubMed Google Scholar
Saranya Kunasekaran
View author publications
You can also search for this author in PubMed Google Scholar
Pethuru Raj Chelliah
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All the authors have contributed.

Corresponding author

Correspondence to Chellammal Surianarayanan.

Ethics declarations

Conflict of interest

The authors do not have any competing interests.

Ethical approval

Not applicable.

Human and animal participants

The authors did not do any test with either human or animal.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Surianarayanan, C., Kunasekaran, S. & Chelliah, P.R. A high-throughput architecture for anomaly detection in streaming data using machine learning algorithms. Int. j. inf. tecnol. 16, 493–506 (2024). https://doi.org/10.1007/s41870-023-01585-0

Download citation

Received: 22 April 2023
Accepted: 03 October 2023
Published: 04 November 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s41870-023-01585-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A high-throughput architecture for anomaly detection in streaming data using machine learning algorithms

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Novel method for optimizing performance in resource constrained distributed data streams

Online Anomaly Detection Using Random Forest

Online Anomaly Detection for Streaming Data Implemented on Top of Kafka, Scikit-Multiflow and River

Availability of supporting data

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Human and animal participants

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A high-throughput architecture for anomaly detection in streaming data using machine learning algorithms

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Novel method for optimizing performance in resource constrained distributed data streams

Online Anomaly Detection Using Random Forest

Online Anomaly Detection for Streaming Data Implemented on Top of Kafka, Scikit-Multiflow and River

Availability of supporting data

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Human and animal participants

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation