Log in

A high-throughput architecture for anomaly detection in streaming data using machine learning algorithms

  • Original Research
  • Published:
International Journal of Information Technology Aims and scope Submit manuscript

Abstract

Detection of anomaly in streaming data requires continuous analysis of the stream in real time. This process turns out to be difficult due to varied volume and velocity of data streams. The purpose of this work is to propose a streaming architecture capable of consuming bulky and high speed streams and ingesting them for continuous anomaly detection using machine learning algorithms. A high-throughput architecture has been created by leveraging the proven publish-subscribe model. The publisher is designed to discretize the incoming stream and to ingest the discretized streams to two subscribers which are designed to work alternately with discretized streams for continuous machine learning based anomaly detection. The inherent computation time associated with machine learning algorithm for learning and detection of anomaly is addressed by using two subscribers. The architecture is implemented using Apache Kafka which is a distributed messaging system, capable of communicating high volume of data from one end-point to another. Detection of anomalies is performed using Random Forest (RF) algorithm with an extended, threshold-value based point anomaly detection method. While checking for presence of anomalies in the instances of target attribute, the instances of its strongly correlated attributes (if any) are also tested for anomaly to provide additional knowledge. The proposed architecture has been tested with a case study where streams having different speeds and sizes are ingested to subscribers and the performance of the algorithm has been analyzed in terms of accuracy, computation time and data-miss, in two different configurations of the architecture, namely, single node and distributed node.

The valuable findings of the study are, the random algorithm is found to efficiently detect the presence of anomalies in streams without leaving any data. In single node configuration, the algorithm is found to produce an average accuracy of 96.7% and computation time of 37.5 ms. In distributed node configuration, the algorithm is found to produce and average accuracy of 98.6% and computation time of 38.5 ms. Another important finding of the study is that the RF algorithm is found to produce a more consistent accuracy and computation time for streams having a wide range of speed and size. The proposed method is more desirable for implementing practical applications having massive streams. Detecting anomalies without latency helps to implement accurate countermeasures in right time so that loss of business, finance or resources can be prevented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Availability of supporting data

We used two datasets, one consisting of 200 records collected by survey from different individuals and it is not available for public use. Second one is synthetic data set consisting of discretized streams of having different speeds. This is not available for public use.

References

  1. Ghotkar M, Rokde P (2016) Big Data: how it is generated and its importance. National Conference on Recent Trends in Computer Science and Information Technology, IOSR Journal of Computer Engineering (IOSR-JCE), e-ISSN: 2278–0661, p-ISSN: 2278–8727, pp. 1–5

  2. Soumaya O, Mohamed Amine T, Soufiane A, Abderrahmane D, Mohamed A (2017) Real time data stream processing challenges and perspectives. Int J Comput Issues 14(5):6–12

    Article  Google Scholar 

  3. Leornado Q, Nicolo R (2017) Tutorial: data streaming and its application to stream processing. Proceedings of the 11th ACM International Conference on Distributed and Event-based Systems, pp 15–18, June 2017

  4. Fatih G, Berigel M (2018) Real-time processing of big data streams: lifecycle, tools, tasks and challenges. Proceedings of the 2nd International Symposium on Multidisciplinary Studies and Innovatve Technologies (ISMSIT), e-ISBN: 978–1–5386–4184–2, pp. 1–6, 19–21 October 2018

  5. MohitMaske PP (2015) An introduction to real time processing and streaming of wireless network data. Int J Adv Res Comput Commun Engineering 4:1

    Google Scholar 

  6. Jankov D, Sikdar S, Mukherjee R, Teymourian K, Jermaine C (2017) Real-time high performance anomaly detection over data streams: grand challenge. The 11th ACM International Conference on Distributed and Event-based Systems, 2017, pp. 292–297

  7. Al-amri R, Murugesan RK, Man M, Abdulateef AF, Al-Sharafi MA, Alkahtani AA (2021) A review of machine learning and deep learning techniques for anomaly detection in iot data. Appl Sci 11(12):5320. https://doi.org/10.3390/app11125320

    Article  Google Scholar 

  8. Ahmad S (2016) Real-time anomaly detection for streaming analytics. https://arxiv.org/pdf/1607.02480.pdf

  9. Shukla A, Chaturvedi S, Simmhan Y (2017) RIoTBench: a real-time iot benchmark for distributed stream processing platforms. Res Article Concurr Comput Pract Exp 29(21):1–33

    Google Scholar 

  10. Kolajo T, Daramola O, Adebiyi A (2019) Big data stream analysis: a systematic literature review. J Big Data 2019:1–30

    Google Scholar 

  11. Chandola V (2009) Anomaly detection: a survey. ACM Comput Surveys 2009:1–72

    Article  Google Scholar 

  12. Basora L, Olive X, Dubot T (2019) Recent advances in anomaly detection methods applied to aviation 6(11):1–27

    Google Scholar 

  13. Habeeb BAA, Nasaruddin F, Gani A, Hashem IAT, Ahmed E, Imran M (2019) Real-time big data processing for anomaly detection: a survey. Int J Inf Manage 45:289–307

    Article  Google Scholar 

  14. Sahand Hariri and Matias Carrasco Kind, “Batch and online anomaly detection for scientific applications in a Kubernetes environment”, Proceedings of the 9th Workshop on Scientific Cloud Computing, Publisher: ACM, pp. 1–7, 11 June 2018

  15. Edin Sabic, David Keeley, Bailey Henderson and Sara Nannemann, “Healthcare and anomaly detection: using machine learning to predict anomalies in heart rate data”, AI & Society, pp. 149–158, 2020

  16. Maniraj SP (2019) Aditya Saini and swarna deep sarkar, “credit card fraud detection using machine learning and data science.” Int J Eng Res Technol 8(09):110–115

    Google Scholar 

  17. Varmedja D, Karanovic M, Sladojevic S, Arsenovic M, Anderla A (2019) Credit card fraud detection—machine learning methods. Proceedings of the 18th International Symposium Infotech – Jahorina, Publisher: IEEE, e-ISBN: 978–1–5386–7073–6, p- ISBN: 978–1–5386–7074–3, pp. 1–5, March 2019

  18. https://www.kaggle.com/mlg-ulb/creditcardfraud

  19. Srinivas K, Prasanth N, Trivedi R et al (2022) A novel machine learning inspired algorithm to predict real-time network intrusions. Int j inf tecnol 14:3471–3480. https://doi.org/10.1007/s41870-022-00925-w

    Article  Google Scholar 

  20. Najar AA, Manohar Naik S (2022) DDoS attack detection using MLP and random forest algorithms. Int J Inf Tecnol 14:2317–2327. https://doi.org/10.1007/s41870-022-01003-x

    Article  Google Scholar 

  21. Kalnoor G, Gowrishankar S (2022) A model for intrusion detection system using hidden Markov and variational Bayesian model for IoT based wireless sensor network. Int J Inf Tecnol 14:2021–2033. https://doi.org/10.1007/s41870-021-00748-1

    Article  Google Scholar 

  22. Hafsa M, Jemili F (2018) Comparative study between big data analysis techniques in intrusion detection. Big Data and Cognitive Computing 3(1):1–13

    Article  Google Scholar 

  23. Sughasiny M (2018) Zero event anomaly detection in big data using spark for fast and streaming applications. Int J Pure Appl Math 119(15):3407–3412

    Google Scholar 

  24. Laura Rettig, Mourad Khayati, Philippe Cudre-Mauroux and Michał Piorkowski, “Online Anomaly Detection over Big Data Streams”, Proceedings of the 2015 IEEE International Conference on Big Data, pp. 1113–1122, October 2015

  25. Mohiuddin Solaimani, Mohammed Iftekhar, Latifur Khan, Bhavani Thuraisingham and Joey Burton Ingram, “Spark-based Anomaly Detection Over Multi-source VMwarePerformance Data In Real-time”, Proceedings of the 2014 IEEE Symposium on Computational Intelligence in Cyber Security (CICS), Publisher: IEEE, USA, e-ISBN: 978–1–4799–4521–4, pp. 1–8, 9–12 December 2014

  26. Abdul Ghaffar Shoro and Tariq Rahim Soomro, “Big Data Analysis: Apache Spark Perspective”, Global Journal of Computer Science and Technology: Computer Software & Data Engineering, Publisher: Global Journal Inc., USA, e-ISSN: 0975–4172, p-ISSN: 0975–4350, Vol. 15, Issue 1, pp. 1–10, January 2015

  27. Gireesh Babu CN (2017) Anu Pokhrel, Ashwini V and Thungamani M, “Real Time Big Data Analysis using Apache Flink.” Int J Sci Eng Appl Sci 3(6):78–83

    Google Scholar 

  28. Acharjya DP, Kauser Ahmed P (2016) A survey on big data analytics: challenges, open research issues and tools. Int J Adv Comput Sci Appl 7(2):511–518

    Google Scholar 

  29. Ashwitha J, Venkat Bhat P (2016) Analysis of bill of material data using kafka and spark. Int J Sci Res Publ 6(8):2250–3153

    Google Scholar 

  30. Babcock B, Babu S, Datar M, Motwani R, Widom J (2012) Models and issues in data stream systems, pp. 1–16, 2012

  31. Nisha SP, Shetty J, Narula R, Tandona K (2020) Comparison study of machine learning classifiers to detect anomalies. Int J Electr Comput Eng 5:5445–5452

    Google Scholar 

  32. Sun H, He Q, Liao K, Sellis T, Guo L, Zhang X, Shen J, Chen F (2019) Fast Anomaly Detection in Multiple Multi-Dimensional Data Streams”, Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Publisher: IEEE, Los Angeles, CA, USA, p-ISBN:978–1–7281–0859–9, e-ISBN:978–1–7281–0858–2, pp. 1218–1223

  33. YuK , Shi W, Santoro N, Ma X (2019) Real-time Outlier Detection over Streaming Data”, Proceedings of the 2019 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation,Publisher: IEEE,e-ISBN: 978–1–7281–4034–6, p-ISBN: 978–1–7281–4035–3, pp. 125–132, August 2019

  34. Šabić E, Keeley D, Henderson B, Nannemann S (2020) Healthcare and anomaly detection: using machine learning to predict anomalies in heart rate data. AI Soc. https://doi.org/10.1007/s00146-020-00985-1

    Article  Google Scholar 

  35. Chellammal Surianarayanan, Saranya Kunasekaran, “Detection of anomaly over streams using isolation forest” Chapter 12 in “Streaming analytics Concepts, architectures, platforms, use cases and applications” edited by Pethuru Raj, Chellammal Surianarayanan, Koteeswaran Seerangan and George Ghinea, IET, 2022, ISBN 978–1–83953–417–1

  36. Chellammal Surianarayanan, Saranya Kunasekaran, “Detection of anomaly over streams using big data technologies” Chapter 12 in “Streaming analytics Concepts, architectures, platforms, use cases and applications” edited by Pethuru Raj, Chellammal Surianarayanan, Koteeswaran Seerangan and George Ghinea, IET, 2022, ISBN 978–1–83953–417–1

  37. Supun Kamburugamuve, Geoffrey Fox, David Leake and Judy Qiu, “Survey of Streaming Data Algorithms”, Publisher: tech. rep, Indiana University, 2013

  38. Zhiruo Zhao, Kishan G. Mehrotra, and Chilukuri K. Mohan, “Online Anomaly Detection Using Random Forest”, Proceedings of the 31st International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, Recent Trends and Future Technology in Applied Intelligence, Publisher: Springer Verlag, Montreal, Canada, p-ISSN: 0302–9743, e-ISSN: 1611–3349, Vol. 10868 LNAI, pp. 135–147, 01 January 2018

  39. Dejan V, Mirjana K, Srdjan S, Marko A, Andras A (2019) credit card fraud detection—machine learning methods. Proceedings of the 18th International Symposium Infotech – Jahorina, Publisher: IEEE, e-ISBN: 978–1–5386–7073–6, p-ISBN: 978–1–5386–7074–3, pp. 1–5, March 2019

  40. Kathrin M (2019) Fraud detection using random forest, neural autoencoder, and isolation forest techniques. AI, ML & Data Engineering

  41. Rifkie P, Tama BA (2018) Anomaly detection using random forest: a performance revisited. Proc 2017 Int Conf Data Softw Eng, pp. 1–6, 12 February 2018

  42. Rashmi HR, Buradkar NV (2017) Survey of Random Forest Based Network Anomaly Detection Systems. Int J Adv Res Comput Commun Eng 6(12):95–98

    Google Scholar 

  43. Simon DDA, Sapna S, Hans DS (2019) Anomaly-based intrusion detection in industrial data with SVM and random forests. Proceedings of 27th International Conference on Software, Telecommunications and Computer Networks (SoftCOM), Vol. 1, pp. 1–6, 2019

  44. Biswas P, Samanta T (2021) Anomaly detection using ensemble random forest in wireless sensor network. Int J Inf Tecnol 13:2043–2052. https://doi.org/10.1007/s41870-021-00717-8

    Article  Google Scholar 

  45. https://archive.ics.uci.edu/ml/datasets/PAMAP2+Physical+ Activity+Monitoring

  46. Ali J, Rehanullah K, Nasir A, Imran M (2012) Random forests and decision trees. Int J Computer Sci Iss 9(5):271–278

    Google Scholar 

  47. https://www.kaggle.com/code/rafjaa/dealing-with-very-small-datasets, accessed on 22nd April 2023

Download references

Acknowledgements

Not required.

Funding

The authors have no funding support.

Author information

Authors and Affiliations

Authors

Contributions

All the authors have contributed.

Corresponding author

Correspondence to Chellammal Surianarayanan.

Ethics declarations

Conflict of interest

The authors do not have any competing interests.

Ethical approval

Not applicable.

Human and animal participants

The authors did not do any test with either human or animal.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Surianarayanan, C., Kunasekaran, S. & Chelliah, P.R. A high-throughput architecture for anomaly detection in streaming data using machine learning algorithms. Int. j. inf. tecnol. 16, 493–506 (2024). https://doi.org/10.1007/s41870-023-01585-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41870-023-01585-0

Keywords

Navigation