Abstract
Spectral clustering is a popular and effective method but known to face two significant challenges: scalability and out-of-sample extension. In this paper, we extend the work of Chen (ICPR 2018) on the speed scalability of spectral clustering in the setting of cosine similarity to deal with massive or online data that are too large to be fully loaded into computer memory. We start by assuming a small batch of data drawn from the full set and develop an efficient procedure that learns both the nonlinear embedding and clustering map from the sample and extends them easily to the rest of the data as they are gradually loaded. We then introduce an automatic approach to selecting the optimal value of the sample size. The combination of the two steps leads to a streamlined memory-efficient algorithm that only uses a small number of batches of data (as they become available), with memory and computational costs that are independent of the size of the data. Experiments are conducted on benchmark data sets to demonstrate the fast speed and excellent accuracy of the proposed algorithm. We conclude the paper by pointing out several future research directions.
The authors thank the anonymous reviewers for careful reviews and useful feedback.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
When both conditions are violated, one can apply principal component analysis (PCA) to reduce the dimensionality of the data such that the first condition is met.
- 2.
To compute this percentage, we need to find the best map between the output labels and the original labels. This is done by using the Kuhn-Munkres algorithm as in [1].
- 3.
- 4.
Available at http://qwone.com/~jason/20Newsgroups/; we also used the bydate version.
- 5.
References
Cai, D., Chen, X.: Large scale spectral clustering via landmark-based sparse representation. IEEE Trans. Cybern. 45(8), 1669–1680 (2015)
Chen, G.: Scalable spectral clustering with cosine similarity. In: Proceedings of the 24th International Conference on Pattern Recognition (ICPR), Bei**g, China (2018)
Chen, G.: A general framework for scalable spectral clustering based on document models. Pattern Recogn. Lett. 125, 488–493 (2019)
Chen, G., Lerman, G.: Foundations of a multi-way spectral clustering framework for hybrid linear modeling. Found. Comput. Math. (2009). https://doi.org/10.1007/s10208-009-9043-7
Choromanska, A., Jebara, T., Kim, H., Mohan, M., Monteleoni, C.: Fast spectral clustering via the Nyström method. In: Jain, S., Munos, R., Stephan, F., Zeugmann, T. (eds.) Algorithmic Learning Theory, pp. 367–381. Springer, Berlin, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40935-6_26
Everitt, B.S., Landau, S., Leese, M., Stahl, D.: Dissimilarity and distance measures for continuous data, pp. 51–52. Wiley, Boston, MA (2011)
Fowlkes, C., Belongie, S., Chung, F., Malik, J.: Spectral grou** using the Nyström method. IEEE Trans. Pattern Anal. Mach. Intell. 26(2), 214–225 (2004)
Huang, D., Wang, C.D., Wu, J.S., Lai, J., Kwoh, C.K.: Ultra-scalable spectral clustering and ensemble clustering. IEEE Trans. Knowl. Data Eng. (TKDE) 32, 1212–1226 (2020)
Li, M., Lian, X.C., Kwok, J.T., Lu, B.L.: Time and space efficient spectral clustering via column sampling. In: CVPR 2011, pp. 2297–2304 (2011). https://doi.org/10.1109/CVPR.2011.5995425
Meila, M., Shi, J.: A random walks view of spectral segmentation. In: Proceedings of the Eighth International Workshop on Artificial Intelligence and Statistics (2001)
Moazzen, Y., Tasdemir, K.: Sampling based approximate spectral clustering ensemble for partitioning data sets. In: Proceedings of the 23rd International Conference on Pattern Recognition (2016)
Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: Advances in Neural Information Processing Systems 14, pp. 849–856 (2001)
Pham, K., Chen, G.: Large-scale spectral clustering using diffusion coordinates on landmark-based bipartite graphs. In: Proceedings of the 12th Workshop on Graph-based Natural Language Processing (TextGraphs-12), pp. 28–37. Association for Computational Linguistics (2018)
Sakai, T., Imiya, A.: Fast spectral clustering with random projection and sampling. In: Perner, P. (ed.) MLDM 2009. LNCS (LNAI), vol. 5632, pp. 372–384. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-03070-3_28
Shaham, U., Stanton, K., Li, H., Basri, R., Nadler, B., Kluger, Y.: Spectralnet: spectral clustering using deep neural networks. In: International Conference on Learning Representations (2018)
Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)
Tasdemir, K.: Vector quantization based approximate spectral clustering of large datasets. Pattern Recogn. 45(8), 3034–3044 (2012)
Wang, L., Leckie, C., Kotagiri, R., Bezdek, J.: Approximate pairwise clustering for large data sets via sampling plus extension. Pattern Recogn. 44, 222–235 (2011)
Wang, L., Leckie, C., Ramamohanarao, K., Bezdek, J.: Approximate spectral clustering. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS (LNAI), vol. 5476, pp. 134–146. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01307-2_15
Yan, D., Huang, L., Jordan, M.: Fast approximate spectral clustering. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 907–916 (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 Springer Nature Switzerland AG
About this paper
Cite this paper
Li, R., Chen, G. (2024). Fast, Memory-Efficient Spectral Clustering with Cosine Similarity. In: Vasconcelos, V., Domingues, I., Paredes, S. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2023. Lecture Notes in Computer Science, vol 14469. Springer, Cham. https://doi.org/10.1007/978-3-031-49018-7_50
Download citation
DOI: https://doi.org/10.1007/978-3-031-49018-7_50
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-49017-0
Online ISBN: 978-3-031-49018-7
eBook Packages: Computer ScienceComputer Science (R0)