Information-Theoretic K-means for Text Clustering

  • Chapter
  • First Online:
Advances in K-means Clustering

Part of the book series: Springer Theses ((Springer Theses))

  • 4583 Accesses

Abstract

Information-theoretic clustering aims to exploit information-theoretic measures as the clustering criteria. A common practice on this topic is the so-called Info-Kmeans, which performs K-means clustering with KL-divergence as the proximity function. While research efforts devoted to Info-Kmeans have shown promising results, a remaining challenge is to deal with high-dimensional sparse data such as text corpora. Indeed, it is possible that the centroids contain many zero-value features for high-dimensional text vectors, which lead to infinite KL-divergence values and create a dilemma in assigning objects to centroids during the iteration process of Info-Kmeans. To meet this challenge, we propose a Summation-based Incremental Learning (SAIL) algorithm for Info-Kmeans clustering in this chapter. Specifically, by using an equivalent objective function, SAIL replaces the computation of KL-divergence by the incremental computation of the Shannon entropy, which successfully avoids the zero-value dilemma. To improve the clustering quality, we further introduce the Variable Neighborhood Search (VNS) meta-heuristic and propose the V-SAIL algorithm, which is then accelerated by a multithreading scheme in PV-SAIL. Experimental results on various real-world text collections have shown that, with SAIL as a booster, the clustering performance of Info-Kmeans can be significantly improved. Also, V-SAIL and PV-SAIL indeed help to improve the clustering quality at a low cost of computation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (Canada)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (Canada)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (Canada)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.trec.nist.gov

  2. 2.

    http://www.daviddlewis.com/resources/testcollections/reuters21578/

  3. 3.

    http://www.mathworks.cn/help/toolbox/stats/kmeans.html

  4. 4.

    http://www.cs.utexas.edu/users/dml/Software/cocluster.html

  5. 5.

    http://www.glaros.dtc.umn.edu/gkhome/views/cluto

References

  1. Banerjee, A., Dhillon, I., Ghosh, J., Sra, S.: Clustering on the unit hypersphere using von mises-fisher distributions. J. Mach. Learn. Res. 6, 1345–1382 (2005)

    MathSciNet  MATH  Google Scholar 

  2. Brand, L.: Advanced Calculus: An Introduction to Classical Analysis. Dover, New York (2006)

    Google Scholar 

  3. Cover, T., Thomas, J.: Elements of Information Theory, 2nd edn. Wiley-Interscience, New York (2006)

    MATH  Google Scholar 

  4. Dhillon, I., Mallela, S., Kumar, R.: A divisive information-theoretic feature clustering algorithm for text classification. J. Mach. Learn. Res. 3, 1265–1287 (2003a)

    MATH  Google Scholar 

  5. Dhillon, I., Mallela, S., Modha, D.: Information-theoretic co-clustering. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 89–98 (2003b)

    Google Scholar 

  6. Elkan, C.: Clustering documents with an exponential-family approximation of the dirichlet compound multinomial distribution. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 289–296 (2006)

    Google Scholar 

  7. Han, E.H., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J.: Webace: a web agent for document categorization and exploration. In: Proceedings of the 2nd International Conference on Autonomous Agents, pp. 408–415 (1998)

    Google Scholar 

  8. Hansen, P., Mladenovic, N.: Variable neighborhood search: principles and applications. Eur. J. Oper. Res. 130, 449–467 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  9. Hendricks, W., Robey, K.: The sampling distribution of the coefficient of variation. Ann. Math. Stat. 7(3), 129–132 (1936)

    Article  MATH  Google Scholar 

  10. Hersh, W., Buckley, C., Leone, T., Hickam, D.: Ohsumed: an interactive retrieval evaluation and new large test collection for research. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 192–201 (1994)

    Google Scholar 

  11. Kullback, S., Leibler, R.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)

    Article  MathSciNet  MATH  Google Scholar 

  12. MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)

    Google Scholar 

  13. Meila, M., Heckerman, D.: An experimental comparison of model-based clustering methods. Mach. Learn. 42, 9–29 (2001)

    Article  MATH  Google Scholar 

  14. Mladenovic, N., Hansen, P.: Variable neighborhood search. Comput. Oper. Res. 24(11), 1097–1100 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  15. Porter, M.: An algorithm for suffix strip**. Program 14(3), 130–137 (1980)

    Article  Google Scholar 

  16. Slonim, N., Tishby, N.: The power of word clusters for text classification. In: Proceedings of the 23rd European Colloquium on Information Retrieval Research (2001)

    Google Scholar 

  17. Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: Proceedings of the KDD Workshop on Text Mining (2000)

    Google Scholar 

  18. Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley, Upper Saddle River (2005)

    Google Scholar 

  19. Tishby, N., Pereira, F., Bialek, W.: The information bottleneck method. In: Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing (1999)

    Google Scholar 

  20. Wu, H., Luk, R., Wong, K., Kwok, K.: Interpreting tf-idf term weights as making relevance decisions. ACM Trans. Inf. Syst. 26(3), 1–37 (2008)

    Article  Google Scholar 

  21. Zhao, Y., Karypis, G.: Criterion functions for document clustering: experiments and analysis. Mach. Learn. 55(3), 311–331 (2004)

    Article  MATH  Google Scholar 

  22. Zhong, S., Ghosh, J.: Generative model-based document clustering: a comparative study. Knowl. Inf. Syst. 8(3), 374–384 (2005)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Wu, J. (2012). Information-Theoretic K-means for Text Clustering. In: Advances in K-means Clustering. Springer Theses. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29807-3_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-29807-3_4

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-29806-6

  • Online ISBN: 978-3-642-29807-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation