Coresets for kernel clustering

Jiang, Shaofeng H. -C.; Krauthgamer, Robert; Lou, Jianing; Zhang, Yubo

doi:10.1007/s10994-024-06540-z

Coresets for kernel clustering

Published: 22 April 2024

(2024)
Cite this article

Machine Learning Aims and scope Submit manuscript

Shaofeng H. -C. Jiang ORCID: orcid.org/0000-0001-7972-827X¹,
Robert Krauthgamer²,
Jianing Lou¹ &
…
Yubo Zhang¹

131 Accesses
1 Altmetric
Explore all metrics

Abstract

We devise coresets for kernel \(k\)-Means with a general kernel, and use them to obtain new, more efficient, algorithms. Kernel \(k\)-Means has superior clustering capability compared to classical \(k\)-Means, particularly when clusters are non-linearly separable, but it also introduces significant computational challenges. We address this computational issue by constructing a coreset, which is a reduced dataset that accurately preserves the clustering costs. Our main result is a coreset for kernel \(k\)-Means that works for a general kernel and has size \({{\,\textrm{poly}\,}}(k\epsilon ^{-1})\). Our new coreset both generalizes and greatly improves all previous results; moreover, it can be constructed in time near-linear in n. This result immediately implies new algorithms for kernel \(k\)-Means, such as a \((1+\epsilon )\)-approximation in time near-linear in n, and a streaming algorithm using space and update time \({{\,\textrm{poly}\,}}(k \epsilon ^{-1} \log n)\). We validate our coreset on various datasets with different kernels. Our coreset performs consistently well, achieving small errors while using very few points. We show that our coresets can speed up kernel \(\textsc {k-Means++}\) (the kernelized version of the widely used \(\textsc {k-Means++}\) algorithm), and we further use this faster kernel \(\textsc {k-Means++}\) for spectral clustering. In both applications, we achieve significant speedup and a better asymptotic growth while the error is comparable to baselines that do not use coresets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

QuicK-means: accelerating inference for K-means by learning fast transforms

Article 14 April 2021

Off-the-Grid: Fast and Effective Hyperparameter Search for Kernel Clustering

Faster Algorithms for the Constrained k-means Problem

Article 06 November 2017

Availability of data and material

The datasets that we use in the paper are all downloaded from published sources, and we do not collect new data.

Code availability

Our code is available and may be published. The code can also be provided for review upon request before the publication of the paper.

Notes

In fact, evaluating \(\Vert c^\star - \varphi (u)\Vert ^2\) for a single point \(u \in X\) already requires \(\Theta (n^2)\) accesses, since \(\Vert c^\star - \varphi (u)\Vert ^2 = K(u, u) - \frac{2}{n} \sum _{x \in X}{K(x, u)} + \frac{1}{n^2} \sum _{x,y \in X}{K(x, y)}\).
Some of these results do not explicitly mention kernel \(k\)-Means, but their method is applicable.
Kumar et al. (2004) developed an FPT-PTAS only for vanilla (i.e., not kernel) \(k\)-Means, but it be adapted to kernel \(k\)-Means by representing a center as a linear combination of points in \(\varphi (X)\).

References

Arthur, D., & Vassilvitskii, S. (2007). K-means++: the advantages of careful seeding. SODA (pp. 1027-1035). SIAM.
Balcan, M., Ehrlich, S., & Liang, Y. (2013). Distributed k-means and k-median clustering on general communication topologies. NIPS (pp. 1995-2003).
Barger, A., & Feldman, D. (2020). Deterministic coresets for k-means of big sparse data. Algorithms, 13(4), 92.
Article MathSciNet Google Scholar
Becchetti, L., Bury, M., Cohen-Addad, V., Grandoni, F., & Schwiegelshohn, C. (2019). Oblivious dimension reduction for k-means: Beyond subspaces and the Johnson-Lindenstrauss lemma. STOC (pp. 1039-1050). ACM.
Braverman, V., Jiang, S. H., Krauthgamer, R., & Wu, X. (2021). Coresets for clustering in excluded-minor graphs and beyond. SODA (pp. 2679-2696). SIAM.
Chan, T. -H. H., Guerquin, A., & Sozio, M. (2018). Twitter data set. Retrieved from https://github.com/fe6Bc5R4JvLkFkSeExHM/k-center
Chen, D., & Phillips, J. M. (2017). Relative error embeddings of the gaussian kernel distance. ALT (Vol. 76, pp. 560-576). PMLR.
Chitta, R., **, R., Havens, T. C., & Jain, A. K. (2011). Approximate kernel kmeans: Solution to large scale kernel clustering. KDD (pp. 895-903). ACM.
Chitta, R., **, R., & Jain, A. K. (2012). Efficient kernel clustering using random Fourier features. ICDM (pp. 161-170). IEEE Computer Society.
Cohen-Addad, V., Saulpic, D., & Schwiegelshohn, C. (2021). A new coreset framework for clustering. STOC (pp. 169-182). ACM.
Czumaj, A., & Sohler, C. (2007). Sublinear-time approximation algorithms for clustering via random sampling. Random Structures and Algorithms, 30(1–2), 226–256. https://doi.org/10.1002/rsa.20157
Article MathSciNet Google Scholar
Dhillon, I. S., Guan, Y., & Kulis, B. (2004). Kernel k-means: Spectral clustering and normalized cuts. KDD (pp. 551-556). ACM.
Ding, C.H.Q., He, X., & Simon, H.D. (2005). Nonnegative lagrangian relaxation of K-means and spectral clustering. ECML (Vol. 3720, pp. 530-538). Springer.
Dua, D., & Graff, C. (2017). UCI machine learning repository, adult dataset. Retrieved from https://archive.ics.uci.edu/ml/datasets/adult
Feldman, D., & Langberg, M. (2011). A unified framework for approximating and clustering data. STOC (pp. 569-578). ACM.
Feldman, D., Monemizadeh, M., & Sohler, C. (2007). A PTAS for k-means clustering based on weak coresets. SoCG (pp. 11-18). ACM.
Feldman, D., Schmidt, M., & Sohler, C. (2020). Turning big data into tiny data: Constant-size coresets for k-means, PCA, and projective clustering. SIAM Journal on Computing, 49(3), 601–657.
Article MathSciNet Google Scholar
Feng, Z., Kacham, P., & Woodruff, D. P. (2021). Dimensionality reduction for the sum-of-distances metric. ICML (Vol. 139, pp. 3220-3229). PMLR.
Ghojogh, B., Ghodsi, A., Karray, F., & Crowley, M. (2021). Reproducing kernel Hilbert space, Mercer’s theorem, eigenfunctions, Nyström method, and use of kernels in machine learning: Tutorial and survey. CoRR, abs/2106.08443 .
Girolami, M. A. (2002). Mercer kernel-based clustering in feature space. IEEE Transactions on Neural Networks, 13(3), 780–784.
Article Google Scholar
Har-Peled, S., & Mazumdar, S. (2004). On coresets for k-means and k-median clustering. STOC (pp. 291-300). ACM.
Henzinger, M., & Kale, S. (2020). Fully-dynamic coresets. ESA (Vol. 173, pp.57:1-57:21). Schloss Dagstuhl-Leibniz-Zentrum für Informatik.
Huang, L., & Vishnoi, N. K. (2020). Coresets for clustering in Euclidean spaces: Importance sampling is nearly optimal. STOC (pp. 1416-1429). ACM.
Joshi, S. C., Kommaraju, R. V., Phillips, J. M., & Venkatasubramanian, S. (2011). Comparing distributions and shapes using the kernel distance. SoCG (pp. 47-56). ACM.
Kim, D., Lee, K. Y., Lee, D., & Lee, K. H. (2005). Evaluation of the performance of clustering algorithms in kernel-induced feature space. Pattern Recognition, 38(4), 607–611.
Article Google Scholar
Kumar, A., Sabharwal, Y., & Sen, S. (2004). A simple linear time (1+’.)- approximation algorithm for k-means clustering in any dimensions. FOCS (pp. 454-462). IEEE Computer Society.
Langberg, M., & Schulman, L. J. (2010). Universal epsilon-approximators for integrals. SODA (pp. 598-607). SIAM.
Meek, C., Thiesson, B., & Heckerman, D. (1990). UCI machine learning repository, census1990 dataset. Retrieved from http://archive.ics.uci.edu/ml/datasets/US+Census+Data+(1990)
Moro, S., Cortez, P., & Rita, P. (2014). UCI machine learning repository, bank dataset. Retrieved from https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
Musco, C., & Musco, C. (2017). Recursive sampling for the Nystrom method. NIPS (pp. 3833-3845).
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
MathSciNet Google Scholar
Rahimi, A., & Recht, B. (2007). Random features for large-scale kernel machines. NIPS (pp. 1177-1184). Curran Associates, Inc.
Ren, Y., & Du, Y. (2020). Uniform and non-uniform sampling methods for sub-linear time k-means clustering. ICPR (pp. 7775-7781). IEEE.
Schmidt, M. (2014). Coresets and streaming algorithms for the k-means problem and related clustering objectives (Unpublished doctoral dissertation).
Schölkopf, B., Smola, A. J., & Müller, K. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 1299–1319.
Article Google Scholar
Sohler, C., & Woodruff, D. P. (2018). Strong coresets for k-median and subspace approximation: Goodbye dimension. FOCS (pp. 802-813). IEEE Computer Society.
Wang, S., Gittens, A., & Mahoney, M. W. (2019). Scalable kernel k-means clustering with Nystrom approximation: Relative-error bounds. Journal of Machine Learning Research, 20(12), 1–49.
MathSciNet Google Scholar
Zhang, R., & Rudnicky, A. I. (2002). A large scale clustering scheme for kernel k-means. ICPR (4) (pp. 289-292). IEEE Computer Society.

Download references

Funding

This work is supported partially by ONR Award N00014-18-1-2364, the Israel Science Foundation grant \#1336/23, a Weizmann-UK Making Connections Grant, a Minerva Foundation grant, the Weizmann Data Science Research Center, a research grant from the Estate of Harry Schutzman, a startup fund from Peking University, and the Advanced Institute of Information Technology, Peking University.

Author information

Authors and Affiliations

Peking University, Bei**g, China
Shaofeng H. -C. Jiang, Jianing Lou & Yubo Zhang
Weizmann Institute of Science, Rehovot, Israel
Robert Krauthgamer

Authors

Shaofeng H. -C. Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Robert Krauthgamer
View author publications
You can also search for this author in PubMed Google Scholar
Jianing Lou
View author publications
You can also search for this author in PubMed Google Scholar
Yubo Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed equally to this paper.

Corresponding author

Correspondence to Shaofeng H. -C. Jiang.

Ethics declarations

Conflict of interest

The authors do not have any Conflict of interest, except for standard cases such as close and recent collaborators, doctoral advisors etc.

Ethical approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Editor: **aoli Fern.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Jiang, S.H.C., Krauthgamer, R., Lou, J. et al. Coresets for kernel clustering. Mach Learn (2024). https://doi.org/10.1007/s10994-024-06540-z

Download citation

Received: 16 July 2022
Revised: 30 August 2023
Accepted: 13 March 2024
Published: 22 April 2024
DOI: https://doi.org/10.1007/s10994-024-06540-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Coresets for kernel clustering

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

QuicK-means: accelerating inference for K-means by learning fast transforms

Off-the-Grid: Fast and Effective Hyperparameter Search for Kernel Clustering

Faster Algorithms for the Constrained k-means Problem

Availability of data and material

Code availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Coresets for kernel clustering

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

QuicK-means: accelerating inference for K-means by learning fast transforms

Off-the-Grid: Fast and Effective Hyperparameter Search for Kernel Clustering

Faster Algorithms for the Constrained k-means Problem

Availability of data and material

Code availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation