Log in

Exploring overlap** clusters using dynamic re-scaling and sampling

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Until recently, the aim of most text-mining work has been to understand major topics and clusters. Minor topics and clusters have been relatively neglected even though they may represent important information on rare events. We present a novel method for exploring overlap** clusters of heterogeneous sizes, which is based on vector space modeling, covariance matrix analysis, random sampling, and dynamic re-weighting of document vectors in massive databases. Our system addresses a combination of difficult issues in database analysis, such as synonymy and polysemy, identification of minor clusters, accommodation of cluster overlap, automatic labeling of clusters based on their document contents, and the user-controlled trade-off between speed of computation and quality of results. We conducted implementation studies with new articles from the Reuters and LA Times TREC data sets and artificially generated data with a known cluster structure to demonstrate the effectiveness of our system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Ando R (2000) Latent semantic space. In: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, pp 213–223

  2. Atkinson J (2000) Text mining: principles and applications. Revista Facultad de Ingenieria. U.T.A., Chile 7

  3. Baeza-Yates R, Ribeiro-Neto B (eds) (1999) Modern Information Retrieval. ACM Press, New York

    Google Scholar 

  4. Bezdek, J (1981) Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York

    MATH  Google Scholar 

  5. Deerwester S, et al. (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407

    Article  Google Scholar 

  6. Dupret G (2003) Latent concepts and the number of orthogonal factors in latent semantic analysis. In: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, pp 221–226

  7. Everitt B, Landau S, Leese N (2001) Cluster Analysis, 4th edn. Arnold, London, UK

  8. Golub G, Van Loan C (1996) Matrix Computations, 3rd edn. John Hopkins University Press, Baltimore, MD

  9. Halkidi M, Batistakis Y, Vazirgiannis M (2001) On cluster validation techniques. J Intelligent Inf Syst 17(2–3):107–145

    Article  MATH  Google Scholar 

  10. Hamerly G (2003) Learning Structure and Concepts in Data Through Data Clustering. Ph.D. Thesis, Department of Computer Science and Engineering, University of California, San Diego, CA

  11. Hearst M (1999) The use of categories and clusters for organizing retrieval results. In: Strzalkowski T (ed) Natural Language Information Retrieval, Kluwer Academic, Dordrecht, The Netherlands, pp 333–374

  12. Houle M (2003) Navigating massive sets via local clustering. In: Proceedings of the International Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, pp 547–552

  13. Hundley D, Kirby M (2003) Estimation of topological dimension. In: Proceedings of the SIAM Data Mining Conference. SIAM, Philadelphia, PA, pp 194–202

  14. Ishii Y (2004) Analysis of customer data for targeted marketing: case studies using airline industry data. In: Proceedings of the 30th ACM SIGMOD of Japan Conference. Tokyo, Japan, pp 37–49 (in Japanese)

  15. Jolliffe I (2002) Principal Component Analysis, 2nd edn. Springer, Berlin Heidelberg New York

    MATH  Google Scholar 

  16. Kobayashi M, Aono M (2002) Major and outlier cluster analysis using dynamic re-scaling of document vectors. In: Proceedings of the SIAM Text Mining Workshop. SIAM, Philadelphia, PA, pp 103–113

  17. Kobayashi M, et al (2002) Matrix computations for information retrieval and major and outlier cluster detection. J Comput Appl Math 149(1):119–129

    Article  MathSciNet  MATH  Google Scholar 

  18. Kumar S, Ghosh J (1999) GAMLS: a generalized framework for associative modular learning systems. In: Proceedings of the Applications and Science of Computational Intelligence II, vol. 3722. SPIE, Bellingham, WA pp 24–34

  19. Lin K-I, Kondadadi R (2001) A similarity-based soft clustering algorithm for documents. In: Proceedings of the International Conference on Database Systems for Advanced Applications. IEEE Computer Society, Los Alamitos, CA, pp 40–47

  20. Macskassy S, et al. (1998) Human performance on clustering Web pages. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, pp 264–268

  21. Niu Z-Y, Ji D-H, Tan C-L (2004) Document clustering based on cluster validation. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, pp 501–506

  22. Pelleg D, Moore, A (2001) Mixtures of rectangles: interpretable soft clustering. In: Proceedings of the International Conference on Machine Learning. Morgan Kaufmann, San Francisco, CA, pp 401–408

  23. Sakano H, Yamada K (2002) Horror story: the curse of dimensionality. The Inf Process Soc Japn Mag 43(5):562–567

    Google Scholar 

  24. Salton G (ed) (1971) The smart retrieval system, Prentice-Hall, Englewood Cliffs, New Jersey

  25. Strehl A (2002) Relationship-based Clustering and Cluster Ensembles for High-dimensional Data Mining. Ph.D. Thesis, The University of Texas at Austin, Austin, Texas

  26. Zaine O, et al (2002) On data clustering analysis: scalability, constraints and validation. In: Proceedings of the Sixth Pacific-Asia Conference on Knowledge Discovery and Data Mining. Lecture Notes in Artificial Intelligence, no. 2336, Springer, Berlin Heidelberg New York, pp 28–39

  27. Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, pp 46–54

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mei Kobayashi.

Additional information

Mei Kobayashi received a Bachelors degree in Chemistry from Princeton and Masters and Ph.D. degrees in Pure and Applied Mathematics from UC Berkeley. She was a student intern in Frick Chemical Laboratory at Princeton, the Biochemical and Math-Physics divisions of Lawrence Berkeley Laboratories, and IBM Research. She has been a Researcher at IBM since 1988 and has been involved in projects ranging from inverse problems, airflow simulation and graphics to speech signal analysis using wavelets. Her most recent work has been on information retrieval, data mining, and unstructured information management. She has served on the Editorial Board of the Bulletin of Japan SIAM and Technical Program Committees of the SIAM Data Mining Conference, SIAM Text Mining Workshops, and Symposiums on Wavelets sponsored by the Japanese Ministry of Education. From 1996 to 1999, she was a Visiting Associate Professor at the Graduate School for Mathematical Sciences of the University of Tokyo.

Masaki Aono received Bachelors and Masters in Science degrees in Information Science from the University of Tokyo and a Ph.D. in Computer Science from Rensselaer Polytechnic Institute. He worked for IBM Research, Tokyo Research Laboratory from 1984 to 2003. He is currently a Professor in the Information and Computer Sciences. Department at the Toyohashi University of Technology, where he is teaching object-oriented programming, logic circuit, computer architecture, and knowledge data engineering. His current research interests include text and data mining, information extraction, semantic web, and information visualization. His most recent work on time series data mining from human body bio-signals obtained by microsensors, was been selected to be part of the 21st century Center Of Excellence Program sponsored by Japanese government. He has been a Japanese delegate of the ISO/IEC JTC1 SC24 Standard Committee since 1996.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kobayashi, M., Aono, M. Exploring overlap** clusters using dynamic re-scaling and sampling. Knowl Inf Syst 10, 295–313 (2006). https://doi.org/10.1007/s10115-006-0005-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-006-0005-y

Keywords

Navigation