Data Homogeneity Dependent Topic Modeling for Information Retrieval

Kashi, Keerthana Sureshbabu; Antenor, Abigail A.; Ramolete, Gabriel Isaac L.; Heinrich, Adrienne

doi:10.1007/978-3-031-35081-8_6

Keerthana Sureshbabu Kashi¹⁸,
Abigail A. Antenor¹⁸,
Gabriel Isaac L. Ramolete¹⁸ &
…
Adrienne Heinrich¹⁸

Part of the book series: Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering ((LNICST,volume 471))

Included in the following conference series:

International Conference on Intelligent Systems and Machine Learning

438 Accesses

Abstract

Different topic modeling techniques have been applied over the years to categorize and make sense of large volumes of unstructured textual data. Our observation shows that there is not one single technique that works well for all domains or for a general use case. We hypothesize that the performance of these algorithms depends on the variation and heterogeneity of topics mentioned in free text and aim to investigate this effect in our study. Our proposed methodology comprises of i) the calculation of a homogeneity score to measure the variation in the data, ii) selection of the algorithm with the best performance for the calculated homogeneity score. For each homogeneity score, the performances of popular topic modeling algorithms, namely NMF, LDA, LSA, and BERTopic, were compared using an accuracy and Cohen’s kappa score. Our results indicate that for highly homogeneous data, BERTopic outperformed the other algorithms (Cohen’s kappa of 0.42 vs. 0.06 for LSA). For medium and low homogeneous data, NMF was superior to the other algorithms (medium homogeneity returns a Cohen’s kappa of 0.3 for NMF vs. 0.15 for LDA, 0.1 for BERTopic, 0.04 for LSA).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: EUR 29.95; Price includes VAT (Germany)

eBook: EUR 67.40; Price includes VAT (Germany)

Softcover Book: EUR 85.59; Price includes VAT (Germany)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Jelodar, H., et al.: Latent Dirichlet Allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools Appl. 78(11), 15169–15211 (2019)
Article Google Scholar
Srivastava, A., Sutton, C.: Autoencoding variational inference for topic models, ar**v preprint ar**v:1703.01488 (2017)
Joo, W., Lee, W., Park, S., Moon, I.-C.: Dirichlet variational autoencoder. Pattern Recogn. 107, 107514 (2020)
Article Google Scholar
Jabbar, A., Li, X., Omar, B.: A survey on generative adversarial networks: variants, applications, and training. ACM Comput. Surv. (CSUR) 54(8), 1–49 (2021)
Article Google Scholar
Glover, J.: Modeling documents with generative adversarial networks, ar**v preprint ar**v:1612.09122 (2016)
Wang, R., Zhou, D., He, Y.: ATM: adversarial-neural topic model. Inf. Process. Manag. 56(6), 102098 (2019)
Article Google Scholar
Zhao, H., Phung, D., Huynh, V., **, Y., Du, L., Buntine, W.: Topic modelling meets deep neural networks: a survey. ar**v preprint ar**v:2103.00498 (2021)
Doan, T.-N., Hoang, T.-A.: Benchmarking neural topic models: an empirical study. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4363–4368 (2021)
Google Scholar
Nguyen, H.-H., Thanh, H.: Analyzing customer experience in hotel services using topic modeling. J. Inf. Process. Syst. 17, 586–598 (2021)
Google Scholar
Egger, R., Yu, J.: A topic modeling comparison between LDA, NMF, Top2Vec, and BERTopic to demystify Twitter posts. Frontiers Sociol. 7 (2022)
Google Scholar
Tan, S., et al.: Interpreting the public sentiment variations on Twitter. IEEE Trans. Knowl. Data Eng. 26(5), 1158–1170 (2013)
Google Scholar
Xu, Z., Liu, Y., Xuan, J., Chen, H., Mei, L.: Crowdsourcing based social media data analysis of urban emergency events. Multimedia Tools Appl. 76(9), 11567–11584 (2017)
Article Google Scholar
Vayansky, I., Kumar, S.A.: A review of topic modeling methods. Inf. Syst. 94, 101582 (2020)
Article Google Scholar
Sbalchiero, S., Eder, M.: Topic modeling, long texts and the best number of topics. Some problems and solutions. Qual. Quant. 54(4), 1095–1108 (2020)
Article Google Scholar
Hu, Y., Boyd-Graber, J., Satinoff, B., Smith, A.: Interactive topic modeling. Mach. Learn. 95(3), 423–469 (2014)
Article MathSciNet Google Scholar
Suri, P., Roy, N.R.: Comparison between LDA & NMF for event-detection from large text stream data. In: 2017 3rd International Conference on Computational Intelligence & Communication Technology (CICT), pp. 1–5 (2017)
Google Scholar
Anantharaman, A., Jadiya, A., Siri, C.T.S., Adikar, B.N., Mohan, B.: Performance evaluation of topic modeling algorithms for text classification. In: 2019 3rd International Conference on Trends in Electronics and Informatics (ICOEI), pp. 704–708 (2019)
Google Scholar
Qiang, J., Qian, Z., Li, Y., Yuan, Y., Wu, X.: Short text topic modeling techniques, applications, and performance: a survey. IEEE Trans. Knowl. Data Eng. 34(3), 1427–1445 (2022)
Article Google Scholar
Nikolenko, S.I., Koltcov, S., Koltsova, O.: Topic modelling for qualitative studies. J. Inf. Sci. 43(1), 88–102 (2017)
Article Google Scholar
DiMaggio, P., Nag, M., Blei, D.: Exploiting affinities between topic modeling and the sociological perspective on culture: application to newspaper coverage of us government arts funding. Poetics 41(6), 570–606 (2013)
Article Google Scholar
Grimmer, J.: A Bayesian hierarchical topic model for political texts: measuring expressed agendas in senate press releases. Polit. Anal. 18(1), 1–35 (2010)
Article MathSciNet Google Scholar
Quinn, K.M., Monroe, B.L., Colaresi, M., Crespin, M.H., Radev, D.R.: How to analyze political attention with minimal assumptions and costs. Am. J. Polit. Sci. 54(1), 209–228 (2010)
Article Google Scholar
Jockers, M.L., Mimno, D.: Significant themes in 19th-century literature. Poetics 41(6), 750–769 (2013)
Article Google Scholar
Baum, D.: Recognising speakers from the topics they talk about. Speech Commun. 54(10), 1132–1142 (2012)
Article Google Scholar
Elgesem, D., Feinerer, I., Steskal, L.: Bloggers’ responses to the Snowden affair: combining automated and manual methods in the analysis of news blogging. Comput. Support. Coop. Work (CSCW) 25(2), 167–191 (2016)
Article Google Scholar
Asuncion, H.U., Asuncion, A.U., Taylor, R.N.: Software traceability with topic modeling. In: 2010 ACM/IEEE 32nd International Conference on Software Engineering, vol. 1, pp. 95–104. IEEE (2010)
Google Scholar
Gethers, M., Poshyvanyk, D.: Using relational topic models to capture coupling among classes in object-oriented software systems. In: 2010 IEEE International Conference on Software Maintenance, pp. 1–10. IEEE (2010)
Google Scholar
Thomas, S.W.: Mining software repositories using topic models. In: Proceedings of the 33rd International Conference on Software Engineering, pp. 1138–1139 (2011)
Google Scholar
Tian, K., Revelle, M., Poshyvanyk, D.: Using latent Dirichlet allocation for automatic categorization of software. In: 2009 6th IEEE International Working Conference on Mining Software Repositories, pp. 163–166. IEEE (2009)
Google Scholar
Özdağoğlu, G., Kapucugil-Ikiz, A., Celik, A.F.: Topic modelling-based decision framework for analysing digital voice of the customer. Total Qual. Manag. Bus. Excellence 29(13–14), 1545–1562 (2018)
Article Google Scholar
Barravecchia, F., Mastrogiacomo, L., Franceschini, F.: Digital voice-of-customer processing by topic modelling algorithms: insights to validate empirical results. Int. J. Qual. Reliab. Manag. (2021)
Google Scholar
Ding, K., Choo, W.C., Ng, K.Y., Ng, S.I.: Employing structural topic modelling to explore perceived service quality attributes in Airbnb accommodation. Int. J. Hosp. Manag. 91, 102676 (2020)
Article Google Scholar
Putranto, Y., Sartono, B., Djuraidah, A.: Topic modelling and hotel rating prediction based on customer review in Indonesia. Int. J. Manag. Decis. Mak. 20(3), 282–307 (2021)
Google Scholar
Gregoriades, A., Pampaka, M., Herodotou, H., Christodoulou, E.: Supporting digital content marketing and messaging through topic modelling and decision trees. Expert Syst. Appl. 184, 115546 (2021)
Article Google Scholar
Sánchez-Franco, M.J., Arenas-Márquez, F.J., Alonso-Dos-Santos, M.: Using structural topic modelling to predict users’ sentiment towards intelligent personal agents. An application for Amazon’s echo and Google home. J. Retail. Consum. Serv. 63, 102658 (2021)
Article Google Scholar
Li, X., Lei, L.: A bibliometric analysis of topic modelling studies (2000–2017). J. Inf. Sci. 47(2), 161–175 (2021)
Article Google Scholar
Angel, M.M., Rey, J.-M.: On the role of Shannon’s entropy as a measure of heterogeneity. Geoderma 98(1–2), 1–3 (2000)
Google Scholar
Torres-García, A.A., Mendoza-Montoya, O., Molinas, M., Antelis, J.M., Moctezuma, L.A., Hernández-Del-Toro, T.: Pre-processing and feature extraction. In: Torres-García, A.A., Reyes-García, C.A., Villaseñor-Pineda, L., Mendoza-Montoya, O. (eds.) BioSignal Processing and Classification Using Computational Learning and Intelligence, pp. 59–91. Academic Press (2022)
Google Scholar
Zhang, Y.: Modelling the lexical complexity of homogenous texts: a time series approach. Qual. Quant. (2022)
Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press (2019)
Google Scholar
Mitkov, R.: The Oxford Handbook of Computational Linguistics. Oxford University Press (2021)
Google Scholar
Kim, S.-W., Gil, J.-M.: Research paper classification systems based on TF-IDF and LDA schemes. Hum. Centric Comput. Inf. Sci. 9(1) (2019)
Google Scholar
Wang, Y.-X., Zhang, Y.-J.: Nonnegative matrix factorization: a comprehensive review. IEEE Trans. Knowl. Data Eng. 25(6), 1336–1353 (2013)
Article Google Scholar
Paatero, P., Tapper, U.: Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5(2), 111–126 (1994)
Article Google Scholar
Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)
Article MATH Google Scholar
Dumais, S.T., et al.: Latent semantic analysis. Annu. Rev. Inf. Sci. Technol. 38(1), 188–230 (2004)
Article Google Scholar
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
Article Google Scholar
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding, ar**v preprint ar**v:1810.04805 (2018)
Grootendorst, M.: BERTopic: neural topic modeling with a class-based TF-IDF procedure. ar**v preprint ar**v:2203.05794 (2022)
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. ar**v preprint ar**v:1804.07461 (2018)
Vaswani, A.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Mimno, D., Wallach, H., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 262–272 (2011)
Google Scholar
Ge, J., Lin, S., Fang, Y.: A text classification algorithm based on topic model and convolutional neural network. J. Phys: Conf. Ser. 1748(3), 032036 (2021)
Google Scholar
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20(1), 37–46 (1960)
Article Google Scholar
Adhitama, R., Kusumaningrum, R., Gernowo, R.: Topic labeling towards news document collection based on latent Dirichlet allocation and ontology. In: 2017 1st International Conference on Informatics and Computational Sciences (ICICoS), pp. 247–252 (2017)
Google Scholar
Vieira, S.M., Kaymak, U., Sousa, J.M.: Cohen’s kappa coefficient as a performance measure for feature selection. In: International Conference on Fuzzy Systems (2010)
Google Scholar
Consumer Financial Protection Bureau: Credit card complaints. https://data.world/dataquest/bank-and-credit-card-complaints (2018)
McHugh, M.L.: Interrater reliability: the Kappa statistic. Biochemia Medica, pp. 276–282 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Aboitiz Data Innovation, Goldbell Towers, 47 Scotts Road, Singapore, Singapore
Keerthana Sureshbabu Kashi, Abigail A. Antenor, Gabriel Isaac L. Ramolete & Adrienne Heinrich

Authors

Keerthana Sureshbabu Kashi
View author publications
You can also search for this author in PubMed Google Scholar
Abigail A. Antenor
View author publications
You can also search for this author in PubMed Google Scholar
Gabriel Isaac L. Ramolete
View author publications
You can also search for this author in PubMed Google Scholar
Adrienne Heinrich
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Keerthana Sureshbabu Kashi .

Editor information

Editors and Affiliations

VIT-AP University, Amrāvati, Andhra Pradesh, India
Sachi Nandan Mohanty
University of Oviedo, Oviedo, Spain
Vicente Garcia Diaz
Vardhaman College of Engineering, Hyderabad, India
G. A. E. Satish Kumar

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kashi, K.S., Antenor, A.A., Ramolete, G.I.L., Heinrich, A. (2023). Data Homogeneity Dependent Topic Modeling for Information Retrieval. In: Nandan Mohanty, S., Garcia Diaz, V., Satish Kumar, G.A.E. (eds) Intelligent Systems and Machine Learning. ICISML 2022. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 471. Springer, Cham. https://doi.org/10.1007/978-3-031-35081-8_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-35081-8_6
Published: 10 July 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-35080-1
Online ISBN: 978-3-031-35081-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Data Homogeneity Dependent Topic Modeling for Information Retrieval