Log in

Global k-means++: an effective relaxation of the global k-means clustering algorithm

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

The k-means algorithm is a prevalent clustering method due to its simplicity, effectiveness, and speed. However, its main disadvantage is its high sensitivity to the initial positions of the cluster centers. The global k-means is a deterministic algorithm proposed to tackle the random initialization problem of k-means but its well-known that requires high computational cost. It partitions the data to K clusters by solving all k-means sub-problems incrementally for all \({k=1,\ldots , K}\). For each k cluster problem, the method executes the k-means algorithm N times, where N is the number of datapoints. In this paper, we propose the global k-means++ clustering algorithm, which is an effective way of acquiring quality clustering solutions akin to those of global k-means with a reduced computational load. This is achieved by exploiting the center selection probability that is effectively used in the k-means++ algorithm. The proposed method has been tested and compared in various benchmark datasets yielding very satisfactory results in terms of clustering quality and execution speed.

Graphical abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
Algorithm 5
Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Code Availibility

Code online available: https://github.com/gvardakas/global-kmeans-pp.git.

Data availability and access

The datasets analysed during the current study are available in the UCI repository, https://archive.ics.uci.edu/ml/index.php, and in the mnist database, http://yann.lecun.com/exdb/mnist/.

Notes

  1. The synthetic dataset is available in the following GitHub repository: https://github.com/deric/clustering-benchmark.git.

  2. Experiments were carried on a machine with an Intel\(^{\circledR }\) Core™ i7-8700 CPU at 3.20 GHz and 16 GB of RAM.

  3. The time constraint is set to 7 days of execution while the available memory is 16 GB of RAM.

References

  1. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv (CSUR) 31(3):264–323

    Article  Google Scholar 

  2. Filippone M, Camastra F, Masulli F, Rovetta S (2008) A survey of kernel and spectral methods for clustering. Pattern Recognit 41(1):176–190

    Article  Google Scholar 

  3. Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666

    Article  Google Scholar 

  4. Kaufman L, Rousseeuw PJ (2009) Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons

  5. Cohen-Addad V, Karthik C (2019) Inapproximability of clustering in lp metrics. In: 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), IEEE, pp 519–539

  6. Cohen-Addad V, Karthik C, Lee E (2021) On approximability of clustering problems without candidate centers. In: Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), SIAM, pp 2635–2648

  7. Aloise D, Deshpande A, Hansen P, Popat P (2009) Np-hardness of euclidean sum-of-squares clustering. Mach Learn 75(2):245–248

    Article  Google Scholar 

  8. Mahajan M, Nimbhorkar P, Varadarajan K (2012) The planar k-means problem is np-hard. Theoretical Comput Sci 442:13–21

    Article  MathSciNet  Google Scholar 

  9. MacQueen J et al (1967) Some methods for classification and analysis of multivariate observations. Proceedings of the fifth berkeley symposium on mathematical statistics and probability, Oakland, CA, USA 1:281–297

    MathSciNet  Google Scholar 

  10. Lloyd S (1982) Least squares quantization in pcm. IEEE Trans Inf Theory 28(2):129–137

    Article  MathSciNet  Google Scholar 

  11. Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40(1):200–210

    Article  Google Scholar 

  12. Arthur D, Vassilvitskii S (2006) k-means++: The advantages of careful seeding. Technical Report 2006-13, Stanford InfoLab. http://ilpubs.stanford.edu:8090/778/

  13. Likas A, Vlassis N, Verbeek JJ (2003) The global k-means clustering algorithm. Pattern Recognit 36(2):451–461

    Article  Google Scholar 

  14. Agrawal A, Gupta H (2013) Global k-means (gkm) clustering algorithm: a survey. Int J Comput Appl 79(2)

  15. Fränti P, Sieranoja S (2019) How much can k-means be improved by using better initialization and repeats? Pattern Recognit 93:95–112

    Article  Google Scholar 

  16. Ajmera J, Wooters C (2003) A robust speaker clustering algorithm. In: 2003 IEEE Workshop on automatic speech recognition and understanding (IEEE Cat. No.03EX721), pp 411–416. https://doi.org/10.1109/ASRU.2003.1318476

  17. Saeys Y, Van Gassen S, Lambrecht BN (2016) Computational flow cytometry: hel** to make sense of high-dimensional immunology data. Nature Rev Immunol 16(7):449–462

    Article  Google Scholar 

  18. Wei Z, Chen Y-C Skeleton clustering: Graph-based approach for dimension-free density-aided clustering. In: NeurIPS 2022 Workshop: New Frontiers in Graph Learning

  19. Nie F, Wang C-L, Li X (2019) K-multiple-means: A multiple-means clustering method with specified k clusters. In: Proceedings of the 25th ACM SIGKDD International conference on knowledge discovery & data mining, pp 959–967

  20. **e J, Jiang S, **e W, Gao X (2011) An efficient global k-means clustering algorithm. J Comput 6(2):271–279

    Article  Google Scholar 

  21. Bagirov AM, Ugon J, Webb D (2011) Fast modified global k-means algorithm for incremental cluster construction. Pattern Recognit 44(4):866–876

    Article  Google Scholar 

  22. Bai L, Liang J, Sui C, Dang C (2013) Fast global k-means clustering based on local geometrical information. Inf Sci 245:168–180

    Article  MathSciNet  Google Scholar 

  23. Lai JZ, Huang T- (2010) Fast global k-means clustering using cluster membership and inequality. Pattern Recognit 43(5):1954–1963

  24. Veenman CJ, Reinders MJT, Backer E (2002) A maximum variance cluster algorithm. IEEE Trans Pattern Anal Mach Intell 24(9):1273–1280

    Article  Google Scholar 

  25. Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    Article  Google Scholar 

  26. Arbelaitz O, Gurrutxaga I, Muguerza J, Pérez JM, Perona I (2013) An extensive comparative study of cluster validity indices. Pattern Recognit 46(1):243–256

    Article  Google Scholar 

  27. Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S (2012) Scalable k-means++. Proceedings of the VLDB Endowment 5(7):622–633

    Article  Google Scholar 

  28. Dua D, Graff C (2017) UCI Machine Learning Repository. http://archive.ics.uci.edu/ml

  29. LeCun Y, Cortes C (2010) MNIST handwritten digit database

  30. Milligan GW, Cooper MC (1988) A study of standardization of variables in cluster analysis. J Classification 5(2):181–204

    Article  MathSciNet  Google Scholar 

  31. Ikotun AM, Ezugwu AE, Abualigah L, Abuhaija B, Heming J (2023) K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Inf Sci 622:178–210

    Article  Google Scholar 

  32. Bachem O, Lucic M, Hassani H, Krause A (2016) Fast and provably good seedings for k-means. Advances in neural information processing systems 29

  33. Choo D, Grunau C, Portmann J, Rozhon V (2020) k-means++: few more steps yield constant approximation. In: International conference on machine learning, PMLR pp 1909–1917

  34. Piccialli V, Russo AR, Sudoso AM (2022) An exact algorithm for semi-supervised minimum sum-of-squares clustering. Comput & Operations Res 147:105958

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This research was supported by project “Dioni: Computing Infrastructure for Big-Data Processing and Analysis” (MIS No. 5047222) co-funded by European Union (ERDF) and Greece through Operational Program“Competitiveness, Entrepreneurship and Innovation”, NSRF 2014-2020.

Author information

Authors and Affiliations

Authors

Contributions

Both authors contributed equally to all aspects of this research. Both authors read and approved the final manuscript.

Corresponding author

Correspondence to Aristidis Likas.

Ethics declarations

Conflict of Interests

The authors declare no conflict of interest.

Ethical and informed consent for data used

This article does not contain any studies conducted on human or animal subjects by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vardakas, G., Likas, A. Global k-means++: an effective relaxation of the global k-means clustering algorithm. Appl Intell (2024). https://doi.org/10.1007/s10489-024-05636-2

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10489-024-05636-2

Keywords

Navigation