Abstract
The topic model is designed to find potential topics from the massive micro-blog data. On the one hand, the extraction of potential topics contributes to the next analysis. On the other hand, because of the particularity of the data, we can not deal with it directly with the traditional topic model algorithm. In the field of data mining, although the traditional text topic mining has been widely studied, a short text like micro-blog has the distinctive characteristics of network languages and emerging novel words. Owning to the short message, the sparsity of data and incomplete description, the micro-blog can not be obtained efficiently. In this paper, we propose a simple, fast, and effective topic model for short texts, named couple-word topic model (CWTM). Based on Dirichlet Multinomial Mixture (DMM) model, it can leverage couple word co-occurrence to help distill better topics over short texts instead of the traditional word co-occurrence way. The method can alleviate the data sparseness problems, improve the performance of the model and adopt the Gibbs sampling algorithm to derive parameters. Through extensive experiments on two real-world short text collections, we find that CWTM achieves comparable or better topic representations than traditional topic model.
Y. Du—This work is supported by the National Nature Science Foundation (Grant No. 61472329 and 61532009), the Key Natural Science Foundation of **hua University (Z1412620) and the Innovation Fund of Postgraduate, **hua University.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Weng, J., Lim, E.-P., Jiang, J., He, Q.: Twitterrank: finding topic-sensitive influential twitterers. In: WSDM (2010)
Wang, X., Zhai, C., Hu, X., Sproat, R.: Mining correlated bursty topic patterns from coordinated text streams. In: SIGKDD (2007)
**aohui, Y., Jiafeng, G., Yanyan, L.: A biterm topic model for short texts. In: WWW, pp. 13–17 (2003)
Blei, D., McAuliffe, J.: Supervised topic models. In: Platt, J., Koller, D., Singer, Y., Roweis, S. (eds.) Advances in Neural Information Processing Systems 20, pp. 121–128. MIT Press, Cambridge (2008)
Hofmann, T.: Probabilistic latent semantic indexing. In: SIGIR (1999)
Ma, Z., Sun, A., Yuan, Q., Cong, G.: Topic-driven reader comments summarization. In: CIKM (2012)
Ramage, D., Dumais, S., Liebling, D.: Characterizing microblogs with topic models. In: International AAAI Conference on Weblogs and Social Media, vol. 5, pp. 130–137 (2010)
Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: UAI (2004)
Chen, J., Nairn, R., Nelson, L., Bernstein, M., Chi, E.: Short and tweet: experiments on recommending content from information streams. In: Proceedings of the 28th International Conference on Human Factors in Computing Systems, pp. 1185–1194. ACM (2010)
Wang, Y., Agichtein, E., Benzi, M.: TM-LDA: efficient online modeling of latent topic transitions in social media. In: Proceedings of the 18th ACM SIGKDD, New York, pp. 123–131. ACM (2012)
Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., Demirbas, M.: Short text classification in twitter to improve information filtering. In: SIGIR (2010)
Deerwester, S., Dumais, S., Furnas, G., Landauer, T., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical dirichlet processes. J. Am. Stat. Assoc. 101 (2004)
Ramage, D., Dumais, S.T., Liebling, D.J.: Characterizing microblogs with topic models. In: ICWSM (2010)
Quan, X., Kit, C., Ge, Y., Pan, S.J.: Short and sparse text topic modeling via self-aggregation. In: AAAI (2015)
Lin, C.X., Zhao, B., Mei, Q., Han, J.: PET: a statistical model for popular events tracking in social communities. In: Proceedings of the 16th ACM SIGKDD, pp. 929–938. ACM (2010)
Weng, J., Lim, E., Jiang, J., He, Q.: Twitterrank: finding topic-sensitive influential twitterers. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 261–270. ACM (2010)
Zhai, K., Boyd-Graber, J.L.: Online latent dirichlet allocation with infinite vocabulary. In: ICML, vol. 28, no. 1, pp. 561–569 (2013). JMLR Proceedings. JMLR.org
Zhao, W., Jiang, J., Weng, J., He, J., Lim, E., Yan, H., Li, X.: Comparing twitter and traditional media using topic models. In: Advances in Information Retrieval, pp. 338–349 (2011)
Phelan, O., McCarthy, K., Smyth, B.: Using twitter to recommend real-time topical news. In: Proceedings of the Third ACM Conference on Recommender Systems, New York, pp. 385–388. ACM (2009)
Hong, L., Davison, B.: Empirical study of topic modeling in twitter. In: Proceedings of the First Workshop on Social Media Analytics, pp. 80–88. ACM (2010)
Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 399–408. ACM (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Diao, Y., Du, Y., **ao, P., Liu, J. (2017). A CWTM Model of Topic Extraction for Short Text. In: Li, J., Zhou, M., Qi, G., Lao, N., Ruan, T., Du, J. (eds) Knowledge Graph and Semantic Computing. Language, Knowledge, and Intelligence. CCKS 2017. Communications in Computer and Information Science, vol 784. Springer, Singapore. https://doi.org/10.1007/978-981-10-7359-5_9
Download citation
DOI: https://doi.org/10.1007/978-981-10-7359-5_9
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7358-8
Online ISBN: 978-981-10-7359-5
eBook Packages: Computer ScienceComputer Science (R0)