Exploring overlap** clusters using dynamic re-scaling and sampling

Kobayashi, Mei; Aono, Masaki

doi:10.1007/s10115-006-0005-y

Exploring overlap** clusters using dynamic re-scaling and sampling

Regular Paper
Published: 30 March 2006

Volume 10, pages 295–313, (2006)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Mei Kobayashi¹ &
Masaki Aono²

118 Accesses
8 Citations
Explore all metrics

Abstract

Until recently, the aim of most text-mining work has been to understand major topics and clusters. Minor topics and clusters have been relatively neglected even though they may represent important information on rare events. We present a novel method for exploring overlap** clusters of heterogeneous sizes, which is based on vector space modeling, covariance matrix analysis, random sampling, and dynamic re-weighting of document vectors in massive databases. Our system addresses a combination of difficult issues in database analysis, such as synonymy and polysemy, identification of minor clusters, accommodation of cluster overlap, automatic labeling of clusters based on their document contents, and the user-controlled trade-off between speed of computation and quality of results. We conducted implementation studies with new articles from the Reuters and LA Times TREC data sets and artificially generated data with a known cluster structure to demonstrate the effectiveness of our system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sampled Weighted Min-Hashing for Large-Scale Topic Mining

New Metrics and Related Statistical Approaches for Efficient Mining in Very Large and Highly Multidimensional Databases

SMGKM: An Efficient Incremental Algorithm for Clustering Document Collections

References

Ando R (2000) Latent semantic space. In: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, pp 213–223
Atkinson J (2000) Text mining: principles and applications. Revista Facultad de Ingenieria. U.T.A., Chile 7
Baeza-Yates R, Ribeiro-Neto B (eds) (1999) Modern Information Retrieval. ACM Press, New York
Google Scholar
Bezdek, J (1981) Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York
MATH Google Scholar
Deerwester S, et al. (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407
Article Google Scholar
Dupret G (2003) Latent concepts and the number of orthogonal factors in latent semantic analysis. In: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, pp 221–226
Everitt B, Landau S, Leese N (2001) Cluster Analysis, 4th edn. Arnold, London, UK
Golub G, Van Loan C (1996) Matrix Computations, 3rd edn. John Hopkins University Press, Baltimore, MD
Halkidi M, Batistakis Y, Vazirgiannis M (2001) On cluster validation techniques. J Intelligent Inf Syst 17(2–3):107–145
Article MATH Google Scholar
Hamerly G (2003) Learning Structure and Concepts in Data Through Data Clustering. Ph.D. Thesis, Department of Computer Science and Engineering, University of California, San Diego, CA
Hearst M (1999) The use of categories and clusters for organizing retrieval results. In: Strzalkowski T (ed) Natural Language Information Retrieval, Kluwer Academic, Dordrecht, The Netherlands, pp 333–374
Houle M (2003) Navigating massive sets via local clustering. In: Proceedings of the International Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, pp 547–552
Hundley D, Kirby M (2003) Estimation of topological dimension. In: Proceedings of the SIAM Data Mining Conference. SIAM, Philadelphia, PA, pp 194–202
Ishii Y (2004) Analysis of customer data for targeted marketing: case studies using airline industry data. In: Proceedings of the 30th ACM SIGMOD of Japan Conference. Tokyo, Japan, pp 37–49 (in Japanese)
Jolliffe I (2002) Principal Component Analysis, 2nd edn. Springer, Berlin Heidelberg New York
MATH Google Scholar
Kobayashi M, Aono M (2002) Major and outlier cluster analysis using dynamic re-scaling of document vectors. In: Proceedings of the SIAM Text Mining Workshop. SIAM, Philadelphia, PA, pp 103–113
Kobayashi M, et al (2002) Matrix computations for information retrieval and major and outlier cluster detection. J Comput Appl Math 149(1):119–129
Article MathSciNet MATH Google Scholar
Kumar S, Ghosh J (1999) GAMLS: a generalized framework for associative modular learning systems. In: Proceedings of the Applications and Science of Computational Intelligence II, vol. 3722. SPIE, Bellingham, WA pp 24–34
Lin K-I, Kondadadi R (2001) A similarity-based soft clustering algorithm for documents. In: Proceedings of the International Conference on Database Systems for Advanced Applications. IEEE Computer Society, Los Alamitos, CA, pp 40–47
Macskassy S, et al. (1998) Human performance on clustering Web pages. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, pp 264–268
Niu Z-Y, Ji D-H, Tan C-L (2004) Document clustering based on cluster validation. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, pp 501–506
Pelleg D, Moore, A (2001) Mixtures of rectangles: interpretable soft clustering. In: Proceedings of the International Conference on Machine Learning. Morgan Kaufmann, San Francisco, CA, pp 401–408
Sakano H, Yamada K (2002) Horror story: the curse of dimensionality. The Inf Process Soc Japn Mag 43(5):562–567
Google Scholar
Salton G (ed) (1971) The smart retrieval system, Prentice-Hall, Englewood Cliffs, New Jersey
Strehl A (2002) Relationship-based Clustering and Cluster Ensembles for High-dimensional Data Mining. Ph.D. Thesis, The University of Texas at Austin, Austin, Texas
Zaine O, et al (2002) On data clustering analysis: scalability, constraints and validation. In: Proceedings of the Sixth Pacific-Asia Conference on Knowledge Discovery and Data Mining. Lecture Notes in Artificial Intelligence, no. 2336, Springer, Berlin Heidelberg New York, pp 28–39
Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, pp 46–54

Download references

Author information

Authors and Affiliations

IBM Tokyo Research Laboratory, 1623-14 Shimotsuruma, Yamato-shi, Kanagawa-ken, 242 8502, Japan
Mei Kobayashi
School of Computing and Information and Computer Sciences, Toyohashi University of Technology, 1-1 Hibarigaoka, Tempaku-cho, Toyohashi-shi, Aichi, 441 8580, Japan
Masaki Aono

Authors

Mei Kobayashi
View author publications
You can also search for this author in PubMed Google Scholar
Masaki Aono
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mei Kobayashi.

Additional information

Mei Kobayashi received a Bachelors degree in Chemistry from Princeton and Masters and Ph.D. degrees in Pure and Applied Mathematics from UC Berkeley. She was a student intern in Frick Chemical Laboratory at Princeton, the Biochemical and Math-Physics divisions of Lawrence Berkeley Laboratories, and IBM Research. She has been a Researcher at IBM since 1988 and has been involved in projects ranging from inverse problems, airflow simulation and graphics to speech signal analysis using wavelets. Her most recent work has been on information retrieval, data mining, and unstructured information management. She has served on the Editorial Board of the Bulletin of Japan SIAM and Technical Program Committees of the SIAM Data Mining Conference, SIAM Text Mining Workshops, and Symposiums on Wavelets sponsored by the Japanese Ministry of Education. From 1996 to 1999, she was a Visiting Associate Professor at the Graduate School for Mathematical Sciences of the University of Tokyo.

Masaki Aono received Bachelors and Masters in Science degrees in Information Science from the University of Tokyo and a Ph.D. in Computer Science from Rensselaer Polytechnic Institute. He worked for IBM Research, Tokyo Research Laboratory from 1984 to 2003. He is currently a Professor in the Information and Computer Sciences. Department at the Toyohashi University of Technology, where he is teaching object-oriented programming, logic circuit, computer architecture, and knowledge data engineering. His current research interests include text and data mining, information extraction, semantic web, and information visualization. His most recent work on time series data mining from human body bio-signals obtained by microsensors, was been selected to be part of the 21st century Center Of Excellence Program sponsored by Japanese government. He has been a Japanese delegate of the ISO/IEC JTC1 SC24 Standard Committee since 1996.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kobayashi, M., Aono, M. Exploring overlap** clusters using dynamic re-scaling and sampling. Knowl Inf Syst 10, 295–313 (2006). https://doi.org/10.1007/s10115-006-0005-y

Download citation

Received: 12 January 2005
Revised: 20 June 2005
Accepted: 15 December 2005
Published: 30 March 2006
Issue Date: October 2006
DOI: https://doi.org/10.1007/s10115-006-0005-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploring overlap** clusters using dynamic re-scaling and sampling

Abstract

Access this article

Similar content being viewed by others

Sampled Weighted Min-Hashing for Large-Scale Topic Mining

New Metrics and Related Statistical Approaches for Efficient Mining in Very Large and Highly Multidimensional Databases

SMGKM: An Efficient Incremental Algorithm for Clustering Document Collections

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Exploring overlap** clusters using dynamic re-scaling and sampling

Abstract

Access this article

Similar content being viewed by others

Sampled Weighted Min-Hashing for Large-Scale Topic Mining

New Metrics and Related Statistical Approaches for Efficient Mining in Very Large and Highly Multidimensional Databases

SMGKM: An Efficient Incremental Algorithm for Clustering Document Collections

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation