Log in

Coresets for kernel clustering

  • Published:
Machine Learning Aims and scope Submit manuscript

Abstract

We devise coresets for kernel \(k\)-Means with a general kernel, and use them to obtain new, more efficient, algorithms. Kernel \(k\)-Means has superior clustering capability compared to classical \(k\)-Means, particularly when clusters are non-linearly separable, but it also introduces significant computational challenges. We address this computational issue by constructing a coreset, which is a reduced dataset that accurately preserves the clustering costs. Our main result is a coreset for kernel \(k\)-Means that works for a general kernel and has size \({{\,\textrm{poly}\,}}(k\epsilon ^{-1})\). Our new coreset both generalizes and greatly improves all previous results; moreover, it can be constructed in time near-linear in n. This result immediately implies new algorithms for kernel \(k\)-Means, such as a \((1+\epsilon )\)-approximation in time near-linear in n, and a streaming algorithm using space and update time \({{\,\textrm{poly}\,}}(k \epsilon ^{-1} \log n)\). We validate our coreset on various datasets with different kernels. Our coreset performs consistently well, achieving small errors while using very few points. We show that our coresets can speed up kernel \(\textsc {k-Means++}\) (the kernelized version of the widely used \(\textsc {k-Means++}\) algorithm), and we further use this faster kernel \(\textsc {k-Means++}\) for spectral clustering. In both applications, we achieve significant speedup and a better asymptotic growth while the error is comparable to baselines that do not use coresets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Availability of data and material

The datasets that we use in the paper are all downloaded from published sources, and we do not collect new data.

Code availability

Our code is available and may be published. The code can also be provided for review upon request before the publication of the paper.

Notes

  1. In fact, evaluating \(\Vert c^\star - \varphi (u)\Vert ^2\) for a single point \(u \in X\) already requires \(\Theta (n^2)\) accesses, since \(\Vert c^\star - \varphi (u)\Vert ^2 = K(u, u) - \frac{2}{n} \sum _{x \in X}{K(x, u)} + \frac{1}{n^2} \sum _{x,y \in X}{K(x, y)}\).

  2. Some of these results do not explicitly mention kernel \(k\)-Means, but their method is applicable.

  3. Kumar et al. (2004) developed an FPT-PTAS only for vanilla (i.e., not kernel) \(k\)-Means, but it be adapted to kernel \(k\)-Means by representing a center as a linear combination of points in \(\varphi (X)\).

References

  • Arthur, D., & Vassilvitskii, S. (2007). K-means++: the advantages of careful seeding. SODA (pp. 1027-1035). SIAM.

  • Balcan, M., Ehrlich, S., & Liang, Y. (2013). Distributed k-means and k-median clustering on general communication topologies. NIPS (pp. 1995-2003).

  • Barger, A., & Feldman, D. (2020). Deterministic coresets for k-means of big sparse data. Algorithms, 13(4), 92.

    Article  MathSciNet  Google Scholar 

  • Becchetti, L., Bury, M., Cohen-Addad, V., Grandoni, F., & Schwiegelshohn, C. (2019). Oblivious dimension reduction for k-means: Beyond subspaces and the Johnson-Lindenstrauss lemma. STOC (pp. 1039-1050). ACM.

  • Braverman, V., Jiang, S. H., Krauthgamer, R., & Wu, X. (2021). Coresets for clustering in excluded-minor graphs and beyond. SODA (pp. 2679-2696). SIAM.

  • Chan, T. -H. H., Guerquin, A., & Sozio, M. (2018). Twitter data set. Retrieved from https://github.com/fe6Bc5R4JvLkFkSeExHM/k-center

  • Chen, D., & Phillips, J. M. (2017). Relative error embeddings of the gaussian kernel distance. ALT (Vol. 76, pp. 560-576). PMLR.

  • Chitta, R., **, R., Havens, T. C., & Jain, A. K. (2011). Approximate kernel kmeans: Solution to large scale kernel clustering. KDD (pp. 895-903). ACM.

  • Chitta, R., **, R., & Jain, A. K. (2012). Efficient kernel clustering using random Fourier features. ICDM (pp. 161-170). IEEE Computer Society.

  • Cohen-Addad, V., Saulpic, D., & Schwiegelshohn, C. (2021). A new coreset framework for clustering. STOC (pp. 169-182). ACM.

  • Czumaj, A., & Sohler, C. (2007). Sublinear-time approximation algorithms for clustering via random sampling. Random Structures and Algorithms, 30(1–2), 226–256. https://doi.org/10.1002/rsa.20157

    Article  MathSciNet  Google Scholar 

  • Dhillon, I. S., Guan, Y., & Kulis, B. (2004). Kernel k-means: Spectral clustering and normalized cuts. KDD (pp. 551-556). ACM.

  • Ding, C.H.Q., He, X., & Simon, H.D. (2005). Nonnegative lagrangian relaxation of K-means and spectral clustering. ECML (Vol. 3720, pp. 530-538). Springer.

  • Dua, D., & Graff, C. (2017). UCI machine learning repository, adult dataset. Retrieved from https://archive.ics.uci.edu/ml/datasets/adult

  • Feldman, D., & Langberg, M. (2011). A unified framework for approximating and clustering data. STOC (pp. 569-578). ACM.

  • Feldman, D., Monemizadeh, M., & Sohler, C. (2007). A PTAS for k-means clustering based on weak coresets. SoCG (pp. 11-18). ACM.

  • Feldman, D., Schmidt, M., & Sohler, C. (2020). Turning big data into tiny data: Constant-size coresets for k-means, PCA, and projective clustering. SIAM Journal on Computing, 49(3), 601–657.

    Article  MathSciNet  Google Scholar 

  • Feng, Z., Kacham, P., & Woodruff, D. P. (2021). Dimensionality reduction for the sum-of-distances metric. ICML (Vol. 139, pp. 3220-3229). PMLR.

  • Ghojogh, B., Ghodsi, A., Karray, F., & Crowley, M. (2021). Reproducing kernel Hilbert space, Mercer’s theorem, eigenfunctions, Nyström method, and use of kernels in machine learning: Tutorial and survey. CoRR, abs/2106.08443 .

  • Girolami, M. A. (2002). Mercer kernel-based clustering in feature space. IEEE Transactions on Neural Networks, 13(3), 780–784.

    Article  Google Scholar 

  • Har-Peled, S., & Mazumdar, S. (2004). On coresets for k-means and k-median clustering. STOC (pp. 291-300). ACM.

  • Henzinger, M., & Kale, S. (2020). Fully-dynamic coresets. ESA (Vol. 173, pp.57:1-57:21). Schloss Dagstuhl-Leibniz-Zentrum für Informatik.

  • Huang, L., & Vishnoi, N. K. (2020). Coresets for clustering in Euclidean spaces: Importance sampling is nearly optimal. STOC (pp. 1416-1429). ACM.

  • Joshi, S. C., Kommaraju, R. V., Phillips, J. M., & Venkatasubramanian, S. (2011). Comparing distributions and shapes using the kernel distance. SoCG (pp. 47-56). ACM.

  • Kim, D., Lee, K. Y., Lee, D., & Lee, K. H. (2005). Evaluation of the performance of clustering algorithms in kernel-induced feature space. Pattern Recognition, 38(4), 607–611.

    Article  Google Scholar 

  • Kumar, A., Sabharwal, Y., & Sen, S. (2004). A simple linear time (1+’.)- approximation algorithm for k-means clustering in any dimensions. FOCS (pp. 454-462). IEEE Computer Society.

  • Langberg, M., & Schulman, L. J. (2010). Universal epsilon-approximators for integrals. SODA (pp. 598-607). SIAM.

  • Meek, C., Thiesson, B., & Heckerman, D. (1990). UCI machine learning repository, census1990 dataset. Retrieved from http://archive.ics.uci.edu/ml/datasets/US+Census+Data+(1990)

  • Moro, S., Cortez, P., & Rita, P. (2014). UCI machine learning repository, bank dataset. Retrieved from https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

  • Musco, C., & Musco, C. (2017). Recursive sampling for the Nystrom method. NIPS (pp. 3833-3845).

  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

    MathSciNet  Google Scholar 

  • Rahimi, A., & Recht, B. (2007). Random features for large-scale kernel machines. NIPS (pp. 1177-1184). Curran Associates, Inc.

  • Ren, Y., & Du, Y. (2020). Uniform and non-uniform sampling methods for sub-linear time k-means clustering. ICPR (pp. 7775-7781). IEEE.

  • Schmidt, M. (2014). Coresets and streaming algorithms for the k-means problem and related clustering objectives (Unpublished doctoral dissertation).

  • Schölkopf, B., Smola, A. J., & Müller, K. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 1299–1319.

    Article  Google Scholar 

  • Sohler, C., & Woodruff, D. P. (2018). Strong coresets for k-median and subspace approximation: Goodbye dimension. FOCS (pp. 802-813). IEEE Computer Society.

  • Wang, S., Gittens, A., & Mahoney, M. W. (2019). Scalable kernel k-means clustering with Nystrom approximation: Relative-error bounds. Journal of Machine Learning Research, 20(12), 1–49.

    MathSciNet  Google Scholar 

  • Zhang, R., & Rudnicky, A. I. (2002). A large scale clustering scheme for kernel k-means. ICPR (4) (pp. 289-292). IEEE Computer Society.

Download references

Funding

This work is supported partially by ONR Award N00014-18-1-2364, the Israel Science Foundation grant \#1336/23, a Weizmann-UK Making Connections Grant, a Minerva Foundation grant, the Weizmann Data Science Research Center, a research grant from the Estate of Harry Schutzman, a startup fund from Peking University, and the Advanced Institute of Information Technology, Peking University.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed equally to this paper.

Corresponding author

Correspondence to Shaofeng H. -C. Jiang.

Ethics declarations

Conflict of interest

The authors do not have any Conflict of interest, except for standard cases such as close and recent collaborators, doctoral advisors etc.

Ethical approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Editor: **aoli Fern.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiang, S.H.C., Krauthgamer, R., Lou, J. et al. Coresets for kernel clustering. Mach Learn (2024). https://doi.org/10.1007/s10994-024-06540-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10994-024-06540-z

Keywords

Navigation