Semi-supervised Document Clustering via Loci

Sutanto, Taufik; Nayak, Richi

doi:10.1007/978-3-319-26187-4_16

Taufik Sutanto^20,21 &
Richi Nayak²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9419))

Included in the following conference series:

International Conference on Web Information Systems Engineering

1387 Accesses
5 Citations

Abstract

Document clustering is one of the prominent methods for mining important information from the vast amount of data available on the web. However, document clustering generally suffers from the curse of dimensionality. Providentially in high dimensional space, data points tend to be more concentrated in some areas of clusters. We take advantage of this phenomenon by introducing a novel concept of dynamic cluster representation named as loci. Clusters’ loci are efficiently calculated using documents’ ranking scores generated from a search engine. We propose a fast loci-based semi-supervised document clustering algorithm that uses clusters’ loci instead of conventional centroids for assigning documents to clusters. Empirical analysis on real-world datasets shows that the proposed method produces cluster solutions with promising quality and is substantially faster than several benchmarked centroid-based semi-supervised document clustering methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: EUR 29.95; Price includes VAT (Spain)

eBook: EUR 42.79; Price includes VAT (Spain)

Softcover Book: EUR 51.99; Price includes VAT (Spain)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Fine-grained document clustering via ranking and its application to social media analytics

Article 07 April 2018

A Statistics-Based Semantic Relation Analysis Approach for Document Clustering

An Analytical Approach to Document Clustering Techniques

Notes

References

Aksyonoff, A.: Introduction to Search with Sphinx: From Installation to Relevance Tuning. O’Reilly, Sebastopol (2011)
Google Scholar
Basu, S., Banerjee, A., Mooney, R.J.: Semi-supervised clustering by seeding. In: Proceedings of the Nineteenth International Conference on Machine Learning. ICML 2002, San Francisco, CA, USA, pp. 27–34 (2002)
Google Scholar
Hou, J., Nayak, R.: The heterogeneous cluster ensemble method using hubness for clustering text documents. In: Lin, X., Manolopoulos, Y., Srivastava, D., Huang, G. (eds.) WISE 2013, Part I. LNCS, vol. 8180, pp. 102–110. Springer, Heidelberg (2013)
Chapter Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval, vol. 1. Cambridge University Press, Cambridge (2008)
Book MATH Google Scholar
Radovanović, M., Nanopoulos, A., Ivanović, M.: Hubs in space: popular nearest neighbors in high-dimensional data. J. Mach. Learn. Res. 11, 2487–2531 (2010)
MathSciNet MATH Google Scholar
Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1996, New York, NY, USA, pp. 21–29 (1996)
Google Scholar
Sutanto, T., Nayak, R.: The ranking based constrained document clustering method and its application to social event detection. In: Bhowmick, S.S., Dyreson, C.E., Jensen, C.S., Lee, M.L., Muliantara, A., Thalheim, B. (eds.) DASFAA 2014, Part II. LNCS, vol. 8422, pp. 47–60. Springer, Heidelberg (2014)
Chapter Google Scholar
Tomašev, N., Radovanović, M., Mladenić, D., Ivanović, M.: The role of hubness in clustering high-dimensional data. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011, Part I. LNCS, vol. 6634, pp. 183–195. Springer, Heidelberg (2011)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Queensland University of Technology (QUT), Brisbane, Australia
Taufik Sutanto & Richi Nayak
Syarif Hidayatullah State Islamic University, Jakarta, Indonesia
Taufik Sutanto

Authors

Taufik Sutanto
View author publications
You can also search for this author in PubMed Google Scholar
Richi Nayak
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Taufik Sutanto .

Editor information

Editors and Affiliations

Tsinghua University, Bei**g, China
Jianyong Wang
Poznan University of Economics, Poznan, Poland
Wojciech Cellary
Florida Atlantic University, Boca Raton, Florida, USA
Dingding Wang
Victoria University, Melbourne, Victoria, Australia
Hua Wang
Florida International University, Miami, Florida, Florida, USA
Shu-Ching Chen
Florida International University, Miami, Florida, USA
Tao Li
Victoria University, Melbourne, Victoria, Australia
Yanchun Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sutanto, T., Nayak, R. (2015). Semi-supervised Document Clustering via Loci. In: Wang, J., et al. Web Information Systems Engineering – WISE 2015. WISE 2015. Lecture Notes in Computer Science(), vol 9419. Springer, Cham. https://doi.org/10.1007/978-3-319-26187-4_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-26187-4_16
Published: 18 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26186-7
Online ISBN: 978-3-319-26187-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Semi-supervised Document Clustering via Loci

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Fine-grained document clustering via ranking and its application to social media analytics

A Statistics-Based Semantic Relation Analysis Approach for Document Clustering

An Analytical Approach to Document Clustering Techniques

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Semi-supervised Document Clustering via Loci

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Fine-grained document clustering via ranking and its application to social media analytics

A Statistics-Based Semantic Relation Analysis Approach for Document Clustering

An Analytical Approach to Document Clustering Techniques

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation