Log in

A topic-enhanced dirichlet model for short text stream clustering

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Short text streams, such as social media comments, are continuously generated, making effective clustering methods essential for extracting valuable information. However, existing research fails to address the problem of topic concentration in clustering, which leads to multiple topics being confused in one cluster, making it challenging to summarize the center of clustering. To tackle this issue, this paper proposes a novel topic-enhanced clustering method called TEDM, based on the Dirichlet model. The method uses dynamic clustering, leveraging topic information to improve the sampling of documents and better cluster documents on the same topic. TEDM constructs a dynamic word relation graph to extract topic terms, which is updated with the stream of documents to cope with the dynamic changes in topics. Extensive experimental studies demonstrate that TEDM outperforms state-of-the-art works on multiple real datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Algorithm 1
Algorithm 2
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data Availibility

The dataset of Tweets and Tweets-T are available at http://trec.nist.gov/data/microblog, The dataset of News and News-T is available at https://news.google.com/news.

Notes

  1. Tweets dataset: http://trec.nist.gov/data/microblog.

  2. News website: https://news.google.com/news/.

  3. http://jwebpro.sourceforge.net/data-web-snippets.tar.gz

References

  1. Aggarwal CC, Philip SY, Han J, et al (2003) A framework for clustering evolving data streams. In: Proceedings 2003 VLDB conference, Elsevier, pp 81–92

  2. Blackwell D, MacQueen JB (1973) Ferguson distributions via pólya urn schemes. Anna Statist 1(2):353–355

    Google Scholar 

  3. Blei DM, Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd international conference on Machine learning, pp 113–120

  4. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    Google Scholar 

  5. Cao F, Estert M, Qian W, et al (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM international conference on data mining, SIAM, pp 328–339

  6. Chen J, Gong Z, Liu W (2019) A nonparametric model for online topic discovery with word embeddings. Inf Sci 504:32–47

    Article  MathSciNet  Google Scholar 

  7. Chen J, Gong Z, Liu W (2020) A dirichlet process biterm-based mixture model for short text stream clustering. Appl Intell 50(5):1609–1619

    Article  Google Scholar 

  8. Chu D, Reyers M, Thomson J et al (2020) Route identification in the national football league: An application of model-based curve clustering using the em algorithm. J Quantit Anal Sports 16(2):121–132

    Article  Google Scholar 

  9. Duan T, Lou Q, Srihari SN, et al (2019) Sequential embedding induced text clustering, a non-parametric bayesian approach. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, pp 68–80

  10. Ferguson TS (1973) A bayesian analysis of some nonparametric problems. Annal Statist pp 209–230

  11. Geng F, Liu Q, Zhang P (2020) A time-aware query-focused summarization of an evolving microblogging stream via sentence extraction. Digit Commun Netw 6(3):389–397

    Article  Google Scholar 

  12. Iwata T, Watanabe S, Yamada T, et al (2009) Topic tracking model for analyzing consumer purchase behavior. In: Twenty-First international joint conference on artificial intelligence, Citeseer

  13. Kumar J, Shao J, Uddin S, et al (2020) An online semantic-enhanced dirichlet model for short text stream clustering. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 766–776

  14. Li Y, Li H, Wang Z et al (2020) Esa-stream: Efficient self-adaptive online data stream clustering. IEEE Trans Knowl Data Eng 34(2):617–630

    Article  MathSciNet  Google Scholar 

  15. Liang S, Yilmaz E, Kanoulas E (2016) Dynamic clustering of streaming short documents. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 995–1004

  16. Lin Y, ** X, Chen J et al (2019) An analytic computation-driven algorithm for decentralized multicore systems. Future Gener Comput Syst 96:101–110

    Article  Google Scholar 

  17. Miller E (2009) Rank hotness with newton’s law of cooling. Feb 15:3

  18. Mills-Tettey GA, Stentz A, Dias MB (2007) The dynamic hungarian algorithm for the assignment problem with changing costs. Robotics Institute, Pittsburgh, PA, Tech Rep CMU-RI-TR-07-27

  19. Nigam K, McCallum AK, Thrun S et al (2000) Text classification from labeled and unlabeled documents using em. Mach Learn 39(2):103–134

    Article  Google Scholar 

  20. Niwattanakul S, Singthongchai J, Naenudorn E, et al (2013) Using of jaccard coefficient for keywords similarity. In: Proceedings of the international multiconference of engineers and computer scientists, pp 380–384

  21. Rakib MRH, Zeh N, Milios E (2021) Efficient clustering of short text streams using online-offline clustering. In: Proceedings of the 21st ACM Symposium on Document Engineering, pp 1–10

  22. Rendón E, Abundez I, Arizmendi A et al (2011) Internal versus external cluster validation indexes. Int J Comput Commun 5(1):27–34

    Google Scholar 

  23. Rosenberg A, Hirschberg J (2007) V-measure: A conditional entropy-based external cluster evaluation measure. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), pp 410–420

  24. Sammut C, Webb GI (2011) Encyclopedia of machine learning. Springer Science & Business Media

  25. Santos JM, Embrechts M (2009) On the use of the adjusted rand index as a metric for evaluating supervised classification. In: International conference on artificial neural networks, Springer, pp 175–184

  26. Shou L, Wang Z, Chen K, et al (2013) Sumblr: continuous summarization of evolving tweet streams. In: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pp 533–542

  27. Strehl A, Ghosh J (2002) Cluster ensembles–a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617

    MathSciNet  Google Scholar 

  28. Terenin A, Simpson D, Draper D (2020) Asynchronous gibbs sampling. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 144–154

  29. Vo T (2022) Gowseqstream: an integrated sequential embedding and graph-of-words for short text stream clustering. Neural Comput Appl 34(6):4321–4341

    Article  MathSciNet  Google Scholar 

  30. Wang X, McCallum A (2006) Topics over time: a non-markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 424–433

  31. Wang Y, Agichtein E, Benzi M (2012) Tm-lda: efficient online modeling of latent topic transitions in social media. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 123–131

  32. Yang S, Huang G, Cai B (2019) Discovering topic representative terms for short text clustering. IEEE Access 7:92037–92047

    Article  Google Scholar 

  33. Yang S, Huang G, Zhou X, et al (2019b) Dynamic clustering of stream short documents using evolutionary word relation network. In: International Conference on Data Service, Springer, pp 418–428

  34. Yin J, Wang J (2014) A dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 233–242

  35. Yin J, Wang J (2016) A model-based approach for text clustering with outlier detection. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), IEEE, pp 625–636

  36. Yin J, Chao D, Liu Z, et al (2018) Model-based clustering of short text streams. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp 2634–2642

  37. Yoo S, Huang H, Kasiviswanathan SP (2016) Streaming spectral clustering. In: 2016 IEEE 32nd international conference on data engineering (ICDE), IEEE, pp 637–648

  38. Yu G, Huang R, Wang Z (2010) Document clustering via dirichlet process mixture model with feature selection. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 763–772

  39. Zhong S (2005) Efficient streaming text clustering. Neural Netw 18(5–6):790–798

    Article  Google Scholar 

  40. Zhou JY, Wang FY, Zeng DJ (2011) Hierarchical dirichlet processes and their applications: a survey. Zidonghua Xuebao/Acta Automatica Sinica 37(4):389–407

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kan Liu.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, K., He, J. & Chen, Y. A topic-enhanced dirichlet model for short text stream clustering. Neural Comput & Applic 36, 8125–8140 (2024). https://doi.org/10.1007/s00521-024-09480-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-024-09480-w

Keywords

Navigation