Global k-means++: an effective relaxation of the global k-means clustering algorithm

Vardakas, Georgios; Likas, Aristidis

doi:10.1007/s10489-024-05636-2

Global k-means++: an effective relaxation of the global k-means clustering algorithm

Published: 05 July 2024

(2024)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

18 Accesses
Explore all metrics

Abstract

The k-means algorithm is a prevalent clustering method due to its simplicity, effectiveness, and speed. However, its main disadvantage is its high sensitivity to the initial positions of the cluster centers. The global k-means is a deterministic algorithm proposed to tackle the random initialization problem of k-means but its well-known that requires high computational cost. It partitions the data to K clusters by solving all k-means sub-problems incrementally for all \({k=1,\ldots , K}\). For each k cluster problem, the method executes the k-means algorithm N times, where N is the number of datapoints. In this paper, we propose the global k-means++ clustering algorithm, which is an effective way of acquiring quality clustering solutions akin to those of global k-means with a reduced computational load. This is achieved by exploiting the center selection probability that is effectively used in the k-means++ algorithm. The proposed method has been tested and compared in various benchmark datasets yielding very satisfactory results in terms of clustering quality and execution speed.

Graphical abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Code Availibility

Code online available: https://github.com/gvardakas/global-kmeans-pp.git.

Data availability and access

The datasets analysed during the current study are available in the UCI repository, https://archive.ics.uci.edu/ml/index.php, and in the mnist database, http://yann.lecun.com/exdb/mnist/.

Notes

The synthetic dataset is available in the following GitHub repository: https://github.com/deric/clustering-benchmark.git.
Experiments were carried on a machine with an Intel\(^{\circledR }\) Core™ i7-8700 CPU at 3.20 GHz and 16 GB of RAM.
The time constraint is set to 7 days of execution while the available memory is 16 GB of RAM.

References

Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv (CSUR) 31(3):264–323
Article Google Scholar
Filippone M, Camastra F, Masulli F, Rovetta S (2008) A survey of kernel and spectral methods for clustering. Pattern Recognit 41(1):176–190
Article Google Scholar
Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666
Article Google Scholar
Kaufman L, Rousseeuw PJ (2009) Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons
Cohen-Addad V, Karthik C (2019) Inapproximability of clustering in lp metrics. In: 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), IEEE, pp 519–539
Cohen-Addad V, Karthik C, Lee E (2021) On approximability of clustering problems without candidate centers. In: Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), SIAM, pp 2635–2648
Aloise D, Deshpande A, Hansen P, Popat P (2009) Np-hardness of euclidean sum-of-squares clustering. Mach Learn 75(2):245–248
Article Google Scholar
Mahajan M, Nimbhorkar P, Varadarajan K (2012) The planar k-means problem is np-hard. Theoretical Comput Sci 442:13–21
Article MathSciNet Google Scholar
MacQueen J et al (1967) Some methods for classification and analysis of multivariate observations. Proceedings of the fifth berkeley symposium on mathematical statistics and probability, Oakland, CA, USA 1:281–297
MathSciNet Google Scholar
Lloyd S (1982) Least squares quantization in pcm. IEEE Trans Inf Theory 28(2):129–137
Article MathSciNet Google Scholar
Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40(1):200–210
Article Google Scholar
Arthur D, Vassilvitskii S (2006) k-means++: The advantages of careful seeding. Technical Report 2006-13, Stanford InfoLab. http://ilpubs.stanford.edu:8090/778/
Likas A, Vlassis N, Verbeek JJ (2003) The global k-means clustering algorithm. Pattern Recognit 36(2):451–461
Article Google Scholar
Agrawal A, Gupta H (2013) Global k-means (gkm) clustering algorithm: a survey. Int J Comput Appl 79(2)
Fränti P, Sieranoja S (2019) How much can k-means be improved by using better initialization and repeats? Pattern Recognit 93:95–112
Article Google Scholar
Ajmera J, Wooters C (2003) A robust speaker clustering algorithm. In: 2003 IEEE Workshop on automatic speech recognition and understanding (IEEE Cat. No.03EX721), pp 411–416. https://doi.org/10.1109/ASRU.2003.1318476
Saeys Y, Van Gassen S, Lambrecht BN (2016) Computational flow cytometry: hel** to make sense of high-dimensional immunology data. Nature Rev Immunol 16(7):449–462
Article Google Scholar
Wei Z, Chen Y-C Skeleton clustering: Graph-based approach for dimension-free density-aided clustering. In: NeurIPS 2022 Workshop: New Frontiers in Graph Learning
Nie F, Wang C-L, Li X (2019) K-multiple-means: A multiple-means clustering method with specified k clusters. In: Proceedings of the 25th ACM SIGKDD International conference on knowledge discovery & data mining, pp 959–967
**e J, Jiang S, **e W, Gao X (2011) An efficient global k-means clustering algorithm. J Comput 6(2):271–279
Article Google Scholar
Bagirov AM, Ugon J, Webb D (2011) Fast modified global k-means algorithm for incremental cluster construction. Pattern Recognit 44(4):866–876
Article Google Scholar
Bai L, Liang J, Sui C, Dang C (2013) Fast global k-means clustering based on local geometrical information. Inf Sci 245:168–180
Article MathSciNet Google Scholar
Lai JZ, Huang T- (2010) Fast global k-means clustering using cluster membership and inequality. Pattern Recognit 43(5):1954–1963
Veenman CJ, Reinders MJT, Backer E (2002) A maximum variance cluster algorithm. IEEE Trans Pattern Anal Mach Intell 24(9):1273–1280
Article Google Scholar
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Article Google Scholar
Arbelaitz O, Gurrutxaga I, Muguerza J, Pérez JM, Perona I (2013) An extensive comparative study of cluster validity indices. Pattern Recognit 46(1):243–256
Article Google Scholar
Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S (2012) Scalable k-means++. Proceedings of the VLDB Endowment 5(7):622–633
Article Google Scholar
Dua D, Graff C (2017) UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
LeCun Y, Cortes C (2010) MNIST handwritten digit database
Milligan GW, Cooper MC (1988) A study of standardization of variables in cluster analysis. J Classification 5(2):181–204
Article MathSciNet Google Scholar
Ikotun AM, Ezugwu AE, Abualigah L, Abuhaija B, Heming J (2023) K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Inf Sci 622:178–210
Article Google Scholar
Bachem O, Lucic M, Hassani H, Krause A (2016) Fast and provably good seedings for k-means. Advances in neural information processing systems 29
Choo D, Grunau C, Portmann J, Rozhon V (2020) k-means++: few more steps yield constant approximation. In: International conference on machine learning, PMLR pp 1909–1917
Piccialli V, Russo AR, Sudoso AM (2022) An exact algorithm for semi-supervised minimum sum-of-squares clustering. Comput & Operations Res 147:105958
Article MathSciNet Google Scholar

Download references

Acknowledgements

This research was supported by project “Dioni: Computing Infrastructure for Big-Data Processing and Analysis” (MIS No. 5047222) co-funded by European Union (ERDF) and Greece through Operational Program“Competitiveness, Entrepreneurship and Innovation”, NSRF 2014-2020.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Ioannina, Ioannina, GR 45110, Greece
Georgios Vardakas & Aristidis Likas

Authors

Georgios Vardakas
View author publications
You can also search for this author in PubMed Google Scholar
Aristidis Likas
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Both authors contributed equally to all aspects of this research. Both authors read and approved the final manuscript.

Corresponding author

Correspondence to Aristidis Likas.

Ethics declarations

Conflict of Interests

The authors declare no conflict of interest.

Ethical and informed consent for data used

This article does not contain any studies conducted on human or animal subjects by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Vardakas, G., Likas, A. Global k-means++: an effective relaxation of the global k-means clustering algorithm. Appl Intell (2024). https://doi.org/10.1007/s10489-024-05636-2

Download citation

Accepted: 20 June 2024
Published: 05 July 2024
DOI: https://doi.org/10.1007/s10489-024-05636-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Global k-means++: an effective relaxation of the global k-means clustering algorithm

Abstract

Graphical abstract

Access this article

Subscribe and save

Buy Now

Code Availibility

Data availability and access

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of Interests

Ethical and informed consent for data used

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation