Abstract
Clustering has been recognized as one of the most prominent functions in data mining. It aims to partition a given set of elements into homogeneous groups without any given knowledge about the distribution of data and according to some (dis)similarity criterion. In this paper, we propose a novel streaming algorithm, based on split technique that was introduced to avoid retaining from the scratch and to ensure the incremental clustering aspect. It intends to cluster continuously arriving chunks of data escorted with new mixed features within memory and time restrictions. Our proposed real-time clustering method clusters mixed data streams using split technique in order to tackle the incremental object, attribute, and class learning spaces at once. So, when necessary, the final distribution of the clusters has to be updated. By dint of split technique, changing the final clusters’ distribution has led to a promising clustering model. Experiments performed on real mixed data sets show that the proposal is efficient and outperforms the conventional k-prototypes method based on different evaluation measures.
Similar content being viewed by others
Data availability
The data used in the experimentation section are open source and derived from U.C.I repository [26], openML [27] and Kaggle data sets https://www.kaggle.com/austinreese/craigslist-carstrucks-data.
References
Anderlucci, L., Fortunato, F., Montanari, A.: High-dimensional clustering via random projections. J. Classif. 1–26 (2021)
Bhagat, A., Kshirsagar, N., Khodke, P., Dongre, K., Ali, S.: Penalty parameter selection for hierarchical data stream clustering. Proc. Comput. Sci. 79, 24–31 (2016)
Silva, J.A., Faria, E.R., Barros, R.C., Hruschka, E.R., Carvalho, A.C.D., Gama, J.: Data stream clustering: a survey. ACM Comput. Surv. 46(1), 1–31 (2013)
Chefrour, A.: Incremental supervised learning: algorithms and applications in pattern recognition. Evol. Intel. 12(2), 97–112 (2019)
Sowjanya, A.M., Shashi, M.: Cluster feature-based incremental clustering approach (CFICA) for numerical data. Int. J. Comput. Sci. Netw. Sec. 10(9), 73–79 (2010)
Lamirel, J. C., Mall, R., Ahmad, M.: Comportement comparatif des méthodes de clustering incrémentales et non incrémentales sur les données textuelles hétérogènes. In: 11th International Francophone Conference on Knowledge Extraction and Management (EGC 2011) (2011)
Sowjanya, A.M., Shashi, M.: A cluster feature-based incremental clustering approach to mixed data. J. Comput. Sci. 7(12), 1875 (2011)
Noorbehbahani, F., Mousavi, S.R., Mirzaei, A.: An incremental mixed data clustering method using a new distance measure. Soft. Comput. 19(3), 731–743 (2015)
Shen, F., Hasegawa, O.: A fast nearest neighbor classifier based on self-organizing incremental neural network. Neural Netw. 21(10), 1537–1547 (2008)
Aggarwal, C.C., Philip, S.Y., Han, J., Wang, J.: A framework for clustering evolving data streams. In: Proceedings 2003 VLDB Conference. Morgan Kaufmann, pp. 81–92 (2003)
Ghesmoune, M., Lebbah, M., Azzag, H.: State-of-the-art on clustering data streams. Big Data Anal. 1(1), 1–27 (2016)
Ackermann, M.R., Märtens, M., Raupach, C., Swierkot, K., Lammersen, C., Sohler, C.: Streamkm++ a clustering algorithm for data streams. J. Exp. Algorithmics 17, 2–1 (2012)
Amini, A., Wah, T.Y., Saboohi, H.: On density-based data streams clustering algorithms: a survey. J. Comput. Sci. Technol. 29(1), 116–141 (2014)
Ounali, C., Ben Rejab, F., & Nouira Ferchichi, K.: Incremental algorithm based on split technique. In: International Conference on Intelligent Systems Design and Applications. Springer, Cham, pp. 567–576 (2018)
Bao, J., Wang, W., Yang, T., Wu, G.: An incremental clustering method based on the boundary profile. PLoS ONE 13(4), e0196108 (2018)
Savaresi, S.M., Boley, D.L., Bittanti, S., Gazzaniga, G.: Cluster selection in divisive clustering algorithms. In: Proceedings of the 2002 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, pp. 299–314 (2002)
Marszałek, Z.: Performance tests on merge sort and recursive merge sort for big data processing. Technical Sciences/University of Warmia and Mazury in Olsztyn (2018)
Gorrab, S., Rejab, F.B.: IK-prototypes: incremental mixed attribute learning based on K-prototypes algorithm, a new method. In: International Conference on Intelligent Systems Design and Applications. Springer, Cham, pp. 880–890 (2020)
Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Disc. 2(3), 283–304 (1998)
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, no. 14, pp. 281–297 (1967)
Huang, Z.: A fast clustering algorithm to cluster very large categorical data sets in data mining. Dmkd 3(8), 34–39 (1997)
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 2, 224–227 (1979)
Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. Theory Methods 3(1), 1–27 (1974)
Xu, R., Wunsch, D.C.: Clustering algorithms in biomedical research: a review. IEEE Rev. Biomed. Eng. 3, 120–154 (2010)
Clarke, K.R., Chapman, M.G., Somerfield, P.J., Needham, H.R.: Dispersion-based weighting of species counts in assemblage analyses. Mar. Ecol. Prog. Ser. 320, 11–27 (2006)
Asuncion, A., Newman, D.: UCI machine learning repository (2007)
Vanschoren, J., Van Rijn, J.N., Bischl, B., Torgo, L.: OpenML: networked science in machine learning. ACM SIGKDD Explor. Newsl. 15(2), 49–60 (2014)
Guo, S., Dong, X.L., Srivastava, D., Zajac, R.: Record linkage with uniqueness constraints and erroneous values. Proc. VLDB Endow. 3(1–2), 417–428 (2010)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Authors have no conflict of interest to declare.
Ethical approval:
This article does not contain any studies with human participants performed by any of the authors.
Informed consent
Informed consent was obtained from all individual participants included in the study. This manuscript is the authors’ original work and has not been submitted simultaneously elsewhere. All authors have checked the manuscript and agreed to the submission.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gorrab, S., Ben Rejab, F. & Nouira, K. Split incremental clustering algorithm of mixed data stream. Prog Artif Intell 13, 51–64 (2024). https://doi.org/10.1007/s13748-024-00316-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13748-024-00316-1