Abstract
This paper introduces a strategy for clustering online multiple data streams. We assume that several sources are used for recording, over time, data about some physical phenomena. Each source provides repeated measurements at a very high frequency so that it is not possible to store the whole amount of data into some easy-to-access media, but data are available only in batches. Our aim is to discover a partition of the sources (e.g. sensors) into homogeneous clusters, analysing the incoming streams of data. The proposed strategy is based on processing the incoming data batches independently, through an initial summarization of the data batches by histograms and, then, by means of a local clustering performed on the histograms which provides a further data summarization. To keep track of the data proximities among the data streams over time, we use local clustering outputs for updating a proximity matrix. The final partitioning of the streams is obtained by a clustering based on such proximity matrix. Through an application on real and simulated data, we show the effectiveness of our strategy in finding homogeneous groups of sources of data streams.
Similar content being viewed by others
References
Ackermann MR, Märtens M, Raupach C, Swierkot K, Lammersen C, Sohler C (2012) Streamkm++: a clustering algorithm for data streams. J Exp Algorithmics 17:2.4:2.1–2.4:2.30. https://doi.org/10.1145/2133803.2184450
Agrawal R, Faloutsos C, Swami A (1993) Efficient similarity search in sequence databases. In: Proceedings of the 4th conference on foundations of data organization and algorithms
Alseghayer R, Petrov D, Chrysanthis PK, Sharaf MA, Labrinidis A (2017) Detection of highly correlated live data streams. BIRTE 3(1–3):8
Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large data bases, vol 29. VLDB Endowment, VLDB’03, pp 81–92
Agueh M, Carlier G (2011) Barycenters in the Wasserstein space. Soc Ind Appl Math 43:904–924
Arroyo J, Maté C (2009) Forecasting histogram time series with k-nearest neighbours methods. Int J Forecast 25(1):192–207. https://doi.org/10.1016/j.ijforecast.2008.07.003
Berckmoes B, Lowen R, Van Casteren J (2011) Distances on probability measures and random variables. J Math Anal Appl 374(2):412–428
Beringer J, Hüllermeier E (2006) Online clustering of parallel data streams. Data Knowl Eng 58(2):180–204. https://doi.org/10.1016/j.datak.2005.05.009
Billard L, Diday E (2003) From the statistics of data to the statistic of knowledge: symbolic data analysis. JASA J Am Stat Assoc 98(462):470–487
Caló DG, Montanari A, Viroli C (2014) A hierarchical modeling approach for clustering probability density functions. Comput Stat Data Anal 71:79–91
Cao F, Estert M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise, pp 328–339. https://doi.org/10.1137/1.9781611972764.29
Chan K, Fu AW-C (1999) Efficient time series matching by wavelets. In: Proceedings 15th international conference on data engineering Sydney, NSW, Australia, pp 126–133. https://doi.org/10.1109/ICDE.1999.754915
Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD’07, pp 133–142. https://doi.org/10.1145/1281192.1281210
Cormode G, Garofalakis M, Haas PJ, Jermaine C (2012) Synopses for massive data: samples, histograms, wavelets, sketches. Found Trends Databases 4(1–3):1–294. https://doi.org/10.1561/1900000004
Dai BR, Huang JW, Yeh MY, Chen MS (2006) Adaptive clustering for multiple evolving streams. IEEE Trans Knowl Data Eng 18(9):1166–1180. https://doi.org/10.1109/TKDE.2006.137
Diday E, Noirhomme-Fraiture M (eds) (2008) Symbolic data analysis and the SODAS software. Wiley, Hoboken
Ding H, Trajcevski G, Scheuermann P, Wang X, Keogh EJ (2008) Querying and mining of time series data: experimental comparison of representations and distance measures. PVLDB 1(2):1542–1552
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining, AAAI Press, KDD’96, pp 226–231
Fränti P (2018) Efficiency of random swap clustering. J Big Data 5(13):1–29
Fränti P, Rezaei M, Zhao Q (2014) Centroid index: cluster level similarity measure. Pattern Recognit 47(9):3034–3045
Fränti P, Sieranoja S (2018) K-means properties on six clustering benchmark datasets. Appl Intell 48(12):4743–4759
Ganguly AR, Gama J, Omitaomu OA, Gaber M, Vatsavai RR (eds) (2008) Knowledge discovery from sensor data. CRC Press, Boca Raton
Garofalakis M, Gehrke J, Rastogi R (eds) (2016) Data stream management. Data-centric systems and applications. Springer, Berlin
Ghesmoune M, Azzag H, Lebbah M (2014) G-stream: growing neural gas over data stream. Springer, Cham, pp 207–214
Gibbs AL, Su FE (2002) On choosing and bounding probability metrics. Int Stat Rev 7(3):419–435
Gong S, Zhang Y, Yu G (2017) Clustering stream data by exploring the evolution of density mountain. Proc VLDB Endow 11(4):393–405. https://doi.org/10.1145/3186728.3164136
González-Rivera G, Arroyo J (2012) Time series modeling of histogram-valued data: the daily histogram time series of S&P500 intradaily returns. Int J Forecast 28(1):20–33. https://doi.org/10.1016/j.ijforecast.2011.02.007
Henderson K, Gallagher B, Eliassi-Rad T (2015) EP-MEANS: an efficient nonparametric clustering of empirical probability distributions. In: SAC ’15 proceedings of the 30th annual ACM symposium on applied computing, pp 893–900
Huang D, Zheng WS, Lai JH, Wang CD (2013) Svstream: a support vector-based algorithm for clustering data streams. IEEE Trans Knowl Data Eng 25:1410–1424. https://doi.org/10.1109/TKDE.2011.263
Irpino A, Iacono M (2011) Improving the MHIST-p algorithm for multivariate histograms of continuous data. In: Classification and multivariate analysis for complex data structures. Springer, pp 155–164. ISBN: 978-3-642-13311-4
Irpino A, Verde R, De Carvalho FAT (2014) Dynamic clustering of histogram data based on adaptive squared Wasserstein distances. Expert Syst Appl 41(7):3351–3366
Irpino A, Verde R (2015) Basic statistics for distributional symbolic variables: a new metric-based approach. Adv Data Anal Classif 9(2):143–175. https://doi.org/10.1007/s11634-014-0176-4
Jiang B, Pei J, Tao Y, Lin X (2013) Clustering uncertain data based on probability distribution similarity. IEEE Trans Knowl Data Eng 25:751–763
Kärkkäinen I, Fränti P (2007) Gradual model generator for single-pass clustering. Pattern Recognit 40(3):784–795. https://doi.org/10.1016/j.patcog.2006.06.023
Keogh E, Chakrabarti K, Pazzani M, Mehrotra S (2001) Dimensionality reduction for fast similarity search in large time series databases. Knowl Inf Syst 3:263. https://doi.org/10.1007/PL00011669
Laurinec P, Luck M (2018) Interpretable multiple data streams clustering with clipped streams representation for the improvement of electricity consumption forecasting. Data Min Knowl Disc. https://doi.org/10.1007/s10618-018-0598-2
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley symposium on mathematical statistics and probability. 1. University of California Press, pp 281–297
Mallows CL (1972) A note on asymptotic joint normality. Ann Math Stat 43(2):508–515
Panaretos VM, Zemel Y (2019) Statistical aspects of Wasserstein distances. Ann Rev Stat Appl 6:1
Rezaei M, Fränti P (2016) Set matching measures for external cluster validity. IEEE Trans Knowl Data Eng 28(8):2173–2186. https://doi.org/10.1109/TKDE.2016.2551240
Rodrigues PP, Gama J, Pedroso J (2008) Hierarchical clustering of time-series data streams. IEEE Trans Knowl Data Eng 20(5):615–627. https://doi.org/10.1109/TKDE.2007.190727
Sato M, Ishii S (2000) On-line EM algorithm for the normalized Gaussian network. Neural Comput 12(2):407–432
Sakurai Y, Papadimitriou S, Faloutsos C (2005) BRAID: stream mining through group lag correlations. ACM SIGMOD’05, pp 599–610
Sebastião R, Gama J (2007) Change detection in learning histograms from data streams. Prog Artif Intell 4874:112–123
Shafer I, Ren K, Boddeti VN, Abe Y, Ganger GR, Faloutsos C (2012) RainMon: an integrated approach to mining bursty timeseries monitoring data. ACM KDD’12, pp 1158–1166
Silva JA, Faria ER, Barros RC, Hruschka ER, Carvalho ACPLFd, Ja Gama (2013) Data stream clustering: a survey. ACM Comput Surv 46(1):13:1–13:31. https://doi.org/10.1145/2522968.2522981
Udommanetanakit K, Rakthanmanon T, Waiyamai K (2007) E-stream: evolution-based technique for stream clustering. In: Proceedings of the 3rd international conference on advanced data mining and applications, Springer, Berlin, Heidelberg, ADMA’07, pp 605–615
Verde R, Irpino A (2007) Dynamic clustering of histogram data: using the right metric. Springer, Berlin, pp 123–134
Villani C (2008) Optimal transport: old and new. Springer, Berlin
Zhang T, Ramakrishnan R, Livny M (1996) Birch: an efficient data clustering method for very large databases. SIGMOD Rec 25(2):103–114
Zhu Y, Shasha D (2002) StatStream: statistical monitoring of thousands of data streams in real time. In: Proceedings of the 28th international conference on very large data bases. VLDB Endowment, p 358
Zhao J, Ishikawa Y, **ao C, Sugiura K (2018) Histogram construction for difference analysis of spatio-temporal data on array DBMS. Databases Theory Appl. https://doi.org/10.1007/978-3-319-92013-94
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Balzanella, A., Verde, R. Histogram-based clustering of multiple data streams. Knowl Inf Syst 62, 203–238 (2020). https://doi.org/10.1007/s10115-019-01350-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-019-01350-5