Abstract
The k-means algorithm is a prevalent clustering method due to its simplicity, effectiveness, and speed. However, its main disadvantage is its high sensitivity to the initial positions of the cluster centers. The global k-means is a deterministic algorithm proposed to tackle the random initialization problem of k-means but its well-known that requires high computational cost. It partitions the data to K clusters by solving all k-means sub-problems incrementally for all \({k=1,\ldots , K}\). For each k cluster problem, the method executes the k-means algorithm N times, where N is the number of datapoints. In this paper, we propose the global k-means++ clustering algorithm, which is an effective way of acquiring quality clustering solutions akin to those of global k-means with a reduced computational load. This is achieved by exploiting the center selection probability that is effectively used in the k-means++ algorithm. The proposed method has been tested and compared in various benchmark datasets yielding very satisfactory results in terms of clustering quality and execution speed.
Graphical abstract
Code Availibility
Code online available: https://github.com/gvardakas/global-kmeans-pp.git.
Data availability and access
The datasets analysed during the current study are available in the UCI repository, https://archive.ics.uci.edu/ml/index.php, and in the mnist database, http://yann.lecun.com/exdb/mnist/.
Notes
The synthetic dataset is available in the following GitHub repository: https://github.com/deric/clustering-benchmark.git.
Experiments were carried on a machine with an Intel\(^{\circledR }\) Core™ i7-8700 CPU at 3.20 GHz and 16 GB of RAM.
The time constraint is set to 7 days of execution while the available memory is 16 GB of RAM.
References
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv (CSUR) 31(3):264–323
Filippone M, Camastra F, Masulli F, Rovetta S (2008) A survey of kernel and spectral methods for clustering. Pattern Recognit 41(1):176–190
Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666
Kaufman L, Rousseeuw PJ (2009) Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons
Cohen-Addad V, Karthik C (2019) Inapproximability of clustering in lp metrics. In: 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), IEEE, pp 519–539
Cohen-Addad V, Karthik C, Lee E (2021) On approximability of clustering problems without candidate centers. In: Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), SIAM, pp 2635–2648
Aloise D, Deshpande A, Hansen P, Popat P (2009) Np-hardness of euclidean sum-of-squares clustering. Mach Learn 75(2):245–248
Mahajan M, Nimbhorkar P, Varadarajan K (2012) The planar k-means problem is np-hard. Theoretical Comput Sci 442:13–21
MacQueen J et al (1967) Some methods for classification and analysis of multivariate observations. Proceedings of the fifth berkeley symposium on mathematical statistics and probability, Oakland, CA, USA 1:281–297
Lloyd S (1982) Least squares quantization in pcm. IEEE Trans Inf Theory 28(2):129–137
Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40(1):200–210
Arthur D, Vassilvitskii S (2006) k-means++: The advantages of careful seeding. Technical Report 2006-13, Stanford InfoLab. http://ilpubs.stanford.edu:8090/778/
Likas A, Vlassis N, Verbeek JJ (2003) The global k-means clustering algorithm. Pattern Recognit 36(2):451–461
Agrawal A, Gupta H (2013) Global k-means (gkm) clustering algorithm: a survey. Int J Comput Appl 79(2)
Fränti P, Sieranoja S (2019) How much can k-means be improved by using better initialization and repeats? Pattern Recognit 93:95–112
Ajmera J, Wooters C (2003) A robust speaker clustering algorithm. In: 2003 IEEE Workshop on automatic speech recognition and understanding (IEEE Cat. No.03EX721), pp 411–416. https://doi.org/10.1109/ASRU.2003.1318476
Saeys Y, Van Gassen S, Lambrecht BN (2016) Computational flow cytometry: hel** to make sense of high-dimensional immunology data. Nature Rev Immunol 16(7):449–462
Wei Z, Chen Y-C Skeleton clustering: Graph-based approach for dimension-free density-aided clustering. In: NeurIPS 2022 Workshop: New Frontiers in Graph Learning
Nie F, Wang C-L, Li X (2019) K-multiple-means: A multiple-means clustering method with specified k clusters. In: Proceedings of the 25th ACM SIGKDD International conference on knowledge discovery & data mining, pp 959–967
**e J, Jiang S, **e W, Gao X (2011) An efficient global k-means clustering algorithm. J Comput 6(2):271–279
Bagirov AM, Ugon J, Webb D (2011) Fast modified global k-means algorithm for incremental cluster construction. Pattern Recognit 44(4):866–876
Bai L, Liang J, Sui C, Dang C (2013) Fast global k-means clustering based on local geometrical information. Inf Sci 245:168–180
Lai JZ, Huang T- (2010) Fast global k-means clustering using cluster membership and inequality. Pattern Recognit 43(5):1954–1963
Veenman CJ, Reinders MJT, Backer E (2002) A maximum variance cluster algorithm. IEEE Trans Pattern Anal Mach Intell 24(9):1273–1280
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Arbelaitz O, Gurrutxaga I, Muguerza J, Pérez JM, Perona I (2013) An extensive comparative study of cluster validity indices. Pattern Recognit 46(1):243–256
Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S (2012) Scalable k-means++. Proceedings of the VLDB Endowment 5(7):622–633
Dua D, Graff C (2017) UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
LeCun Y, Cortes C (2010) MNIST handwritten digit database
Milligan GW, Cooper MC (1988) A study of standardization of variables in cluster analysis. J Classification 5(2):181–204
Ikotun AM, Ezugwu AE, Abualigah L, Abuhaija B, Heming J (2023) K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Inf Sci 622:178–210
Bachem O, Lucic M, Hassani H, Krause A (2016) Fast and provably good seedings for k-means. Advances in neural information processing systems 29
Choo D, Grunau C, Portmann J, Rozhon V (2020) k-means++: few more steps yield constant approximation. In: International conference on machine learning, PMLR pp 1909–1917
Piccialli V, Russo AR, Sudoso AM (2022) An exact algorithm for semi-supervised minimum sum-of-squares clustering. Comput & Operations Res 147:105958
Acknowledgements
This research was supported by project “Dioni: Computing Infrastructure for Big-Data Processing and Analysis” (MIS No. 5047222) co-funded by European Union (ERDF) and Greece through Operational Program“Competitiveness, Entrepreneurship and Innovation”, NSRF 2014-2020.
Author information
Authors and Affiliations
Contributions
Both authors contributed equally to all aspects of this research. Both authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of Interests
The authors declare no conflict of interest.
Ethical and informed consent for data used
This article does not contain any studies conducted on human or animal subjects by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Vardakas, G., Likas, A. Global k-means++: an effective relaxation of the global k-means clustering algorithm. Appl Intell (2024). https://doi.org/10.1007/s10489-024-05636-2
Accepted:
Published:
DOI: https://doi.org/10.1007/s10489-024-05636-2