Abstract
Until recently, the aim of most text-mining work has been to understand major topics and clusters. Minor topics and clusters have been relatively neglected even though they may represent important information on rare events. We present a novel method for exploring overlap** clusters of heterogeneous sizes, which is based on vector space modeling, covariance matrix analysis, random sampling, and dynamic re-weighting of document vectors in massive databases. Our system addresses a combination of difficult issues in database analysis, such as synonymy and polysemy, identification of minor clusters, accommodation of cluster overlap, automatic labeling of clusters based on their document contents, and the user-controlled trade-off between speed of computation and quality of results. We conducted implementation studies with new articles from the Reuters and LA Times TREC data sets and artificially generated data with a known cluster structure to demonstrate the effectiveness of our system.
Similar content being viewed by others
References
Ando R (2000) Latent semantic space. In: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, pp 213–223
Atkinson J (2000) Text mining: principles and applications. Revista Facultad de Ingenieria. U.T.A., Chile 7
Baeza-Yates R, Ribeiro-Neto B (eds) (1999) Modern Information Retrieval. ACM Press, New York
Bezdek, J (1981) Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York
Deerwester S, et al. (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407
Dupret G (2003) Latent concepts and the number of orthogonal factors in latent semantic analysis. In: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, pp 221–226
Everitt B, Landau S, Leese N (2001) Cluster Analysis, 4th edn. Arnold, London, UK
Golub G, Van Loan C (1996) Matrix Computations, 3rd edn. John Hopkins University Press, Baltimore, MD
Halkidi M, Batistakis Y, Vazirgiannis M (2001) On cluster validation techniques. J Intelligent Inf Syst 17(2–3):107–145
Hamerly G (2003) Learning Structure and Concepts in Data Through Data Clustering. Ph.D. Thesis, Department of Computer Science and Engineering, University of California, San Diego, CA
Hearst M (1999) The use of categories and clusters for organizing retrieval results. In: Strzalkowski T (ed) Natural Language Information Retrieval, Kluwer Academic, Dordrecht, The Netherlands, pp 333–374
Houle M (2003) Navigating massive sets via local clustering. In: Proceedings of the International Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, pp 547–552
Hundley D, Kirby M (2003) Estimation of topological dimension. In: Proceedings of the SIAM Data Mining Conference. SIAM, Philadelphia, PA, pp 194–202
Ishii Y (2004) Analysis of customer data for targeted marketing: case studies using airline industry data. In: Proceedings of the 30th ACM SIGMOD of Japan Conference. Tokyo, Japan, pp 37–49 (in Japanese)
Jolliffe I (2002) Principal Component Analysis, 2nd edn. Springer, Berlin Heidelberg New York
Kobayashi M, Aono M (2002) Major and outlier cluster analysis using dynamic re-scaling of document vectors. In: Proceedings of the SIAM Text Mining Workshop. SIAM, Philadelphia, PA, pp 103–113
Kobayashi M, et al (2002) Matrix computations for information retrieval and major and outlier cluster detection. J Comput Appl Math 149(1):119–129
Kumar S, Ghosh J (1999) GAMLS: a generalized framework for associative modular learning systems. In: Proceedings of the Applications and Science of Computational Intelligence II, vol. 3722. SPIE, Bellingham, WA pp 24–34
Lin K-I, Kondadadi R (2001) A similarity-based soft clustering algorithm for documents. In: Proceedings of the International Conference on Database Systems for Advanced Applications. IEEE Computer Society, Los Alamitos, CA, pp 40–47
Macskassy S, et al. (1998) Human performance on clustering Web pages. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, pp 264–268
Niu Z-Y, Ji D-H, Tan C-L (2004) Document clustering based on cluster validation. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, pp 501–506
Pelleg D, Moore, A (2001) Mixtures of rectangles: interpretable soft clustering. In: Proceedings of the International Conference on Machine Learning. Morgan Kaufmann, San Francisco, CA, pp 401–408
Sakano H, Yamada K (2002) Horror story: the curse of dimensionality. The Inf Process Soc Japn Mag 43(5):562–567
Salton G (ed) (1971) The smart retrieval system, Prentice-Hall, Englewood Cliffs, New Jersey
Strehl A (2002) Relationship-based Clustering and Cluster Ensembles for High-dimensional Data Mining. Ph.D. Thesis, The University of Texas at Austin, Austin, Texas
Zaine O, et al (2002) On data clustering analysis: scalability, constraints and validation. In: Proceedings of the Sixth Pacific-Asia Conference on Knowledge Discovery and Data Mining. Lecture Notes in Artificial Intelligence, no. 2336, Springer, Berlin Heidelberg New York, pp 28–39
Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, pp 46–54
Author information
Authors and Affiliations
Corresponding author
Additional information
Mei Kobayashi received a Bachelors degree in Chemistry from Princeton and Masters and Ph.D. degrees in Pure and Applied Mathematics from UC Berkeley. She was a student intern in Frick Chemical Laboratory at Princeton, the Biochemical and Math-Physics divisions of Lawrence Berkeley Laboratories, and IBM Research. She has been a Researcher at IBM since 1988 and has been involved in projects ranging from inverse problems, airflow simulation and graphics to speech signal analysis using wavelets. Her most recent work has been on information retrieval, data mining, and unstructured information management. She has served on the Editorial Board of the Bulletin of Japan SIAM and Technical Program Committees of the SIAM Data Mining Conference, SIAM Text Mining Workshops, and Symposiums on Wavelets sponsored by the Japanese Ministry of Education. From 1996 to 1999, she was a Visiting Associate Professor at the Graduate School for Mathematical Sciences of the University of Tokyo.
Masaki Aono received Bachelors and Masters in Science degrees in Information Science from the University of Tokyo and a Ph.D. in Computer Science from Rensselaer Polytechnic Institute. He worked for IBM Research, Tokyo Research Laboratory from 1984 to 2003. He is currently a Professor in the Information and Computer Sciences. Department at the Toyohashi University of Technology, where he is teaching object-oriented programming, logic circuit, computer architecture, and knowledge data engineering. His current research interests include text and data mining, information extraction, semantic web, and information visualization. His most recent work on time series data mining from human body bio-signals obtained by microsensors, was been selected to be part of the 21st century Center Of Excellence Program sponsored by Japanese government. He has been a Japanese delegate of the ISO/IEC JTC1 SC24 Standard Committee since 1996.
Rights and permissions
About this article
Cite this article
Kobayashi, M., Aono, M. Exploring overlap** clusters using dynamic re-scaling and sampling. Knowl Inf Syst 10, 295–313 (2006). https://doi.org/10.1007/s10115-006-0005-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-006-0005-y