Abstract
Different topic modeling techniques have been applied over the years to categorize and make sense of large volumes of unstructured textual data. Our observation shows that there is not one single technique that works well for all domains or for a general use case. We hypothesize that the performance of these algorithms depends on the variation and heterogeneity of topics mentioned in free text and aim to investigate this effect in our study. Our proposed methodology comprises of i) the calculation of a homogeneity score to measure the variation in the data, ii) selection of the algorithm with the best performance for the calculated homogeneity score. For each homogeneity score, the performances of popular topic modeling algorithms, namely NMF, LDA, LSA, and BERTopic, were compared using an accuracy and Cohen’s kappa score. Our results indicate that for highly homogeneous data, BERTopic outperformed the other algorithms (Cohen’s kappa of 0.42 vs. 0.06 for LSA). For medium and low homogeneous data, NMF was superior to the other algorithms (medium homogeneity returns a Cohen’s kappa of 0.3 for NMF vs. 0.15 for LDA, 0.1 for BERTopic, 0.04 for LSA).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Jelodar, H., et al.: Latent Dirichlet Allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools Appl. 78(11), 15169–15211 (2019)
Srivastava, A., Sutton, C.: Autoencoding variational inference for topic models, ar**v preprint ar**v:1703.01488 (2017)
Joo, W., Lee, W., Park, S., Moon, I.-C.: Dirichlet variational autoencoder. Pattern Recogn. 107, 107514 (2020)
Jabbar, A., Li, X., Omar, B.: A survey on generative adversarial networks: variants, applications, and training. ACM Comput. Surv. (CSUR) 54(8), 1–49 (2021)
Glover, J.: Modeling documents with generative adversarial networks, ar**v preprint ar**v:1612.09122 (2016)
Wang, R., Zhou, D., He, Y.: ATM: adversarial-neural topic model. Inf. Process. Manag. 56(6), 102098 (2019)
Zhao, H., Phung, D., Huynh, V., **, Y., Du, L., Buntine, W.: Topic modelling meets deep neural networks: a survey. ar**v preprint ar**v:2103.00498 (2021)
Doan, T.-N., Hoang, T.-A.: Benchmarking neural topic models: an empirical study. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4363–4368 (2021)
Nguyen, H.-H., Thanh, H.: Analyzing customer experience in hotel services using topic modeling. J. Inf. Process. Syst. 17, 586–598 (2021)
Egger, R., Yu, J.: A topic modeling comparison between LDA, NMF, Top2Vec, and BERTopic to demystify Twitter posts. Frontiers Sociol. 7 (2022)
Tan, S., et al.: Interpreting the public sentiment variations on Twitter. IEEE Trans. Knowl. Data Eng. 26(5), 1158–1170 (2013)
Xu, Z., Liu, Y., Xuan, J., Chen, H., Mei, L.: Crowdsourcing based social media data analysis of urban emergency events. Multimedia Tools Appl. 76(9), 11567–11584 (2017)
Vayansky, I., Kumar, S.A.: A review of topic modeling methods. Inf. Syst. 94, 101582 (2020)
Sbalchiero, S., Eder, M.: Topic modeling, long texts and the best number of topics. Some problems and solutions. Qual. Quant. 54(4), 1095–1108 (2020)
Hu, Y., Boyd-Graber, J., Satinoff, B., Smith, A.: Interactive topic modeling. Mach. Learn. 95(3), 423–469 (2014)
Suri, P., Roy, N.R.: Comparison between LDA & NMF for event-detection from large text stream data. In: 2017 3rd International Conference on Computational Intelligence & Communication Technology (CICT), pp. 1–5 (2017)
Anantharaman, A., Jadiya, A., Siri, C.T.S., Adikar, B.N., Mohan, B.: Performance evaluation of topic modeling algorithms for text classification. In: 2019 3rd International Conference on Trends in Electronics and Informatics (ICOEI), pp. 704–708 (2019)
Qiang, J., Qian, Z., Li, Y., Yuan, Y., Wu, X.: Short text topic modeling techniques, applications, and performance: a survey. IEEE Trans. Knowl. Data Eng. 34(3), 1427–1445 (2022)
Nikolenko, S.I., Koltcov, S., Koltsova, O.: Topic modelling for qualitative studies. J. Inf. Sci. 43(1), 88–102 (2017)
DiMaggio, P., Nag, M., Blei, D.: Exploiting affinities between topic modeling and the sociological perspective on culture: application to newspaper coverage of us government arts funding. Poetics 41(6), 570–606 (2013)
Grimmer, J.: A Bayesian hierarchical topic model for political texts: measuring expressed agendas in senate press releases. Polit. Anal. 18(1), 1–35 (2010)
Quinn, K.M., Monroe, B.L., Colaresi, M., Crespin, M.H., Radev, D.R.: How to analyze political attention with minimal assumptions and costs. Am. J. Polit. Sci. 54(1), 209–228 (2010)
Jockers, M.L., Mimno, D.: Significant themes in 19th-century literature. Poetics 41(6), 750–769 (2013)
Baum, D.: Recognising speakers from the topics they talk about. Speech Commun. 54(10), 1132–1142 (2012)
Elgesem, D., Feinerer, I., Steskal, L.: Bloggers’ responses to the Snowden affair: combining automated and manual methods in the analysis of news blogging. Comput. Support. Coop. Work (CSCW) 25(2), 167–191 (2016)
Asuncion, H.U., Asuncion, A.U., Taylor, R.N.: Software traceability with topic modeling. In: 2010 ACM/IEEE 32nd International Conference on Software Engineering, vol. 1, pp. 95–104. IEEE (2010)
Gethers, M., Poshyvanyk, D.: Using relational topic models to capture coupling among classes in object-oriented software systems. In: 2010 IEEE International Conference on Software Maintenance, pp. 1–10. IEEE (2010)
Thomas, S.W.: Mining software repositories using topic models. In: Proceedings of the 33rd International Conference on Software Engineering, pp. 1138–1139 (2011)
Tian, K., Revelle, M., Poshyvanyk, D.: Using latent Dirichlet allocation for automatic categorization of software. In: 2009 6th IEEE International Working Conference on Mining Software Repositories, pp. 163–166. IEEE (2009)
Özdağoğlu, G., Kapucugil-Ikiz, A., Celik, A.F.: Topic modelling-based decision framework for analysing digital voice of the customer. Total Qual. Manag. Bus. Excellence 29(13–14), 1545–1562 (2018)
Barravecchia, F., Mastrogiacomo, L., Franceschini, F.: Digital voice-of-customer processing by topic modelling algorithms: insights to validate empirical results. Int. J. Qual. Reliab. Manag. (2021)
Ding, K., Choo, W.C., Ng, K.Y., Ng, S.I.: Employing structural topic modelling to explore perceived service quality attributes in Airbnb accommodation. Int. J. Hosp. Manag. 91, 102676 (2020)
Putranto, Y., Sartono, B., Djuraidah, A.: Topic modelling and hotel rating prediction based on customer review in Indonesia. Int. J. Manag. Decis. Mak. 20(3), 282–307 (2021)
Gregoriades, A., Pampaka, M., Herodotou, H., Christodoulou, E.: Supporting digital content marketing and messaging through topic modelling and decision trees. Expert Syst. Appl. 184, 115546 (2021)
Sánchez-Franco, M.J., Arenas-Márquez, F.J., Alonso-Dos-Santos, M.: Using structural topic modelling to predict users’ sentiment towards intelligent personal agents. An application for Amazon’s echo and Google home. J. Retail. Consum. Serv. 63, 102658 (2021)
Li, X., Lei, L.: A bibliometric analysis of topic modelling studies (2000–2017). J. Inf. Sci. 47(2), 161–175 (2021)
Angel, M.M., Rey, J.-M.: On the role of Shannon’s entropy as a measure of heterogeneity. Geoderma 98(1–2), 1–3 (2000)
Torres-García, A.A., Mendoza-Montoya, O., Molinas, M., Antelis, J.M., Moctezuma, L.A., Hernández-Del-Toro, T.: Pre-processing and feature extraction. In: Torres-García, A.A., Reyes-García, C.A., Villaseñor-Pineda, L., Mendoza-Montoya, O. (eds.) BioSignal Processing and Classification Using Computational Learning and Intelligence, pp. 59–91. Academic Press (2022)
Zhang, Y.: Modelling the lexical complexity of homogenous texts: a time series approach. Qual. Quant. (2022)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press (2019)
Mitkov, R.: The Oxford Handbook of Computational Linguistics. Oxford University Press (2021)
Kim, S.-W., Gil, J.-M.: Research paper classification systems based on TF-IDF and LDA schemes. Hum. Centric Comput. Inf. Sci. 9(1) (2019)
Wang, Y.-X., Zhang, Y.-J.: Nonnegative matrix factorization: a comprehensive review. IEEE Trans. Knowl. Data Eng. 25(6), 1336–1353 (2013)
Paatero, P., Tapper, U.: Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5(2), 111–126 (1994)
Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)
Dumais, S.T., et al.: Latent semantic analysis. Annu. Rev. Inf. Sci. Technol. 38(1), 188–230 (2004)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding, ar**v preprint ar**v:1810.04805 (2018)
Grootendorst, M.: BERTopic: neural topic modeling with a class-based TF-IDF procedure. ar**v preprint ar**v:2203.05794 (2022)
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. ar**v preprint ar**v:1804.07461 (2018)
Vaswani, A.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Mimno, D., Wallach, H., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 262–272 (2011)
Ge, J., Lin, S., Fang, Y.: A text classification algorithm based on topic model and convolutional neural network. J. Phys: Conf. Ser. 1748(3), 032036 (2021)
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20(1), 37–46 (1960)
Adhitama, R., Kusumaningrum, R., Gernowo, R.: Topic labeling towards news document collection based on latent Dirichlet allocation and ontology. In: 2017 1st International Conference on Informatics and Computational Sciences (ICICoS), pp. 247–252 (2017)
Vieira, S.M., Kaymak, U., Sousa, J.M.: Cohen’s kappa coefficient as a performance measure for feature selection. In: International Conference on Fuzzy Systems (2010)
Consumer Financial Protection Bureau: Credit card complaints. https://data.world/dataquest/bank-and-credit-card-complaints (2018)
McHugh, M.L.: Interrater reliability: the Kappa statistic. Biochemia Medica, pp. 276–282 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering
About this paper
Cite this paper
Kashi, K.S., Antenor, A.A., Ramolete, G.I.L., Heinrich, A. (2023). Data Homogeneity Dependent Topic Modeling for Information Retrieval. In: Nandan Mohanty, S., Garcia Diaz, V., Satish Kumar, G.A.E. (eds) Intelligent Systems and Machine Learning. ICISML 2022. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 471. Springer, Cham. https://doi.org/10.1007/978-3-031-35081-8_6
Download citation
DOI: https://doi.org/10.1007/978-3-031-35081-8_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-35080-1
Online ISBN: 978-3-031-35081-8
eBook Packages: Computer ScienceComputer Science (R0)