Data Homogeneity Dependent Topic Modeling for Information Retrieval

  • Conference paper
  • First Online:
Intelligent Systems and Machine Learning (ICISML 2022)

Abstract

Different topic modeling techniques have been applied over the years to categorize and make sense of large volumes of unstructured textual data. Our observation shows that there is not one single technique that works well for all domains or for a general use case. We hypothesize that the performance of these algorithms depends on the variation and heterogeneity of topics mentioned in free text and aim to investigate this effect in our study. Our proposed methodology comprises of i) the calculation of a homogeneity score to measure the variation in the data, ii) selection of the algorithm with the best performance for the calculated homogeneity score. For each homogeneity score, the performances of popular topic modeling algorithms, namely NMF, LDA, LSA, and BERTopic, were compared using an accuracy and Cohen’s kappa score. Our results indicate that for highly homogeneous data, BERTopic outperformed the other algorithms (Cohen’s kappa of 0.42 vs. 0.06 for LSA). For medium and low homogeneous data, NMF was superior to the other algorithms (medium homogeneity returns a Cohen’s kappa of 0.3 for NMF vs. 0.15 for LDA, 0.1 for BERTopic, 0.04 for LSA).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
EUR 29.95
Price includes VAT (Germany)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 67.40
Price includes VAT (Germany)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR 85.59
Price includes VAT (Germany)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Jelodar, H., et al.: Latent Dirichlet Allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools Appl. 78(11), 15169–15211 (2019)

    Article  Google Scholar 

  2. Srivastava, A., Sutton, C.: Autoencoding variational inference for topic models, ar**v preprint ar**v:1703.01488 (2017)

  3. Joo, W., Lee, W., Park, S., Moon, I.-C.: Dirichlet variational autoencoder. Pattern Recogn. 107, 107514 (2020)

    Article  Google Scholar 

  4. Jabbar, A., Li, X., Omar, B.: A survey on generative adversarial networks: variants, applications, and training. ACM Comput. Surv. (CSUR) 54(8), 1–49 (2021)

    Article  Google Scholar 

  5. Glover, J.: Modeling documents with generative adversarial networks, ar**v preprint ar**v:1612.09122 (2016)

  6. Wang, R., Zhou, D., He, Y.: ATM: adversarial-neural topic model. Inf. Process. Manag. 56(6), 102098 (2019)

    Article  Google Scholar 

  7. Zhao, H., Phung, D., Huynh, V., **, Y., Du, L., Buntine, W.: Topic modelling meets deep neural networks: a survey. ar**v preprint ar**v:2103.00498 (2021)

  8. Doan, T.-N., Hoang, T.-A.: Benchmarking neural topic models: an empirical study. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4363–4368 (2021)

    Google Scholar 

  9. Nguyen, H.-H., Thanh, H.: Analyzing customer experience in hotel services using topic modeling. J. Inf. Process. Syst. 17, 586–598 (2021)

    Google Scholar 

  10. Egger, R., Yu, J.: A topic modeling comparison between LDA, NMF, Top2Vec, and BERTopic to demystify Twitter posts. Frontiers Sociol. 7 (2022)

    Google Scholar 

  11. Tan, S., et al.: Interpreting the public sentiment variations on Twitter. IEEE Trans. Knowl. Data Eng. 26(5), 1158–1170 (2013)

    Google Scholar 

  12. Xu, Z., Liu, Y., Xuan, J., Chen, H., Mei, L.: Crowdsourcing based social media data analysis of urban emergency events. Multimedia Tools Appl. 76(9), 11567–11584 (2017)

    Article  Google Scholar 

  13. Vayansky, I., Kumar, S.A.: A review of topic modeling methods. Inf. Syst. 94, 101582 (2020)

    Article  Google Scholar 

  14. Sbalchiero, S., Eder, M.: Topic modeling, long texts and the best number of topics. Some problems and solutions. Qual. Quant. 54(4), 1095–1108 (2020)

    Article  Google Scholar 

  15. Hu, Y., Boyd-Graber, J., Satinoff, B., Smith, A.: Interactive topic modeling. Mach. Learn. 95(3), 423–469 (2014)

    Article  MathSciNet  Google Scholar 

  16. Suri, P., Roy, N.R.: Comparison between LDA & NMF for event-detection from large text stream data. In: 2017 3rd International Conference on Computational Intelligence & Communication Technology (CICT), pp. 1–5 (2017)

    Google Scholar 

  17. Anantharaman, A., Jadiya, A., Siri, C.T.S., Adikar, B.N., Mohan, B.: Performance evaluation of topic modeling algorithms for text classification. In: 2019 3rd International Conference on Trends in Electronics and Informatics (ICOEI), pp. 704–708 (2019)

    Google Scholar 

  18. Qiang, J., Qian, Z., Li, Y., Yuan, Y., Wu, X.: Short text topic modeling techniques, applications, and performance: a survey. IEEE Trans. Knowl. Data Eng. 34(3), 1427–1445 (2022)

    Article  Google Scholar 

  19. Nikolenko, S.I., Koltcov, S., Koltsova, O.: Topic modelling for qualitative studies. J. Inf. Sci. 43(1), 88–102 (2017)

    Article  Google Scholar 

  20. DiMaggio, P., Nag, M., Blei, D.: Exploiting affinities between topic modeling and the sociological perspective on culture: application to newspaper coverage of us government arts funding. Poetics 41(6), 570–606 (2013)

    Article  Google Scholar 

  21. Grimmer, J.: A Bayesian hierarchical topic model for political texts: measuring expressed agendas in senate press releases. Polit. Anal. 18(1), 1–35 (2010)

    Article  MathSciNet  Google Scholar 

  22. Quinn, K.M., Monroe, B.L., Colaresi, M., Crespin, M.H., Radev, D.R.: How to analyze political attention with minimal assumptions and costs. Am. J. Polit. Sci. 54(1), 209–228 (2010)

    Article  Google Scholar 

  23. Jockers, M.L., Mimno, D.: Significant themes in 19th-century literature. Poetics 41(6), 750–769 (2013)

    Article  Google Scholar 

  24. Baum, D.: Recognising speakers from the topics they talk about. Speech Commun. 54(10), 1132–1142 (2012)

    Article  Google Scholar 

  25. Elgesem, D., Feinerer, I., Steskal, L.: Bloggers’ responses to the Snowden affair: combining automated and manual methods in the analysis of news blogging. Comput. Support. Coop. Work (CSCW) 25(2), 167–191 (2016)

    Article  Google Scholar 

  26. Asuncion, H.U., Asuncion, A.U., Taylor, R.N.: Software traceability with topic modeling. In: 2010 ACM/IEEE 32nd International Conference on Software Engineering, vol. 1, pp. 95–104. IEEE (2010)

    Google Scholar 

  27. Gethers, M., Poshyvanyk, D.: Using relational topic models to capture coupling among classes in object-oriented software systems. In: 2010 IEEE International Conference on Software Maintenance, pp. 1–10. IEEE (2010)

    Google Scholar 

  28. Thomas, S.W.: Mining software repositories using topic models. In: Proceedings of the 33rd International Conference on Software Engineering, pp. 1138–1139 (2011)

    Google Scholar 

  29. Tian, K., Revelle, M., Poshyvanyk, D.: Using latent Dirichlet allocation for automatic categorization of software. In: 2009 6th IEEE International Working Conference on Mining Software Repositories, pp. 163–166. IEEE (2009)

    Google Scholar 

  30. Özdağoğlu, G., Kapucugil-Ikiz, A., Celik, A.F.: Topic modelling-based decision framework for analysing digital voice of the customer. Total Qual. Manag. Bus. Excellence 29(13–14), 1545–1562 (2018)

    Article  Google Scholar 

  31. Barravecchia, F., Mastrogiacomo, L., Franceschini, F.: Digital voice-of-customer processing by topic modelling algorithms: insights to validate empirical results. Int. J. Qual. Reliab. Manag. (2021)

    Google Scholar 

  32. Ding, K., Choo, W.C., Ng, K.Y., Ng, S.I.: Employing structural topic modelling to explore perceived service quality attributes in Airbnb accommodation. Int. J. Hosp. Manag. 91, 102676 (2020)

    Article  Google Scholar 

  33. Putranto, Y., Sartono, B., Djuraidah, A.: Topic modelling and hotel rating prediction based on customer review in Indonesia. Int. J. Manag. Decis. Mak. 20(3), 282–307 (2021)

    Google Scholar 

  34. Gregoriades, A., Pampaka, M., Herodotou, H., Christodoulou, E.: Supporting digital content marketing and messaging through topic modelling and decision trees. Expert Syst. Appl. 184, 115546 (2021)

    Article  Google Scholar 

  35. Sánchez-Franco, M.J., Arenas-Márquez, F.J., Alonso-Dos-Santos, M.: Using structural topic modelling to predict users’ sentiment towards intelligent personal agents. An application for Amazon’s echo and Google home. J. Retail. Consum. Serv. 63, 102658 (2021)

    Article  Google Scholar 

  36. Li, X., Lei, L.: A bibliometric analysis of topic modelling studies (2000–2017). J. Inf. Sci. 47(2), 161–175 (2021)

    Article  Google Scholar 

  37. Angel, M.M., Rey, J.-M.: On the role of Shannon’s entropy as a measure of heterogeneity. Geoderma 98(1–2), 1–3 (2000)

    Google Scholar 

  38. Torres-García, A.A., Mendoza-Montoya, O., Molinas, M., Antelis, J.M., Moctezuma, L.A., Hernández-Del-Toro, T.: Pre-processing and feature extraction. In: Torres-García, A.A., Reyes-García, C.A., Villaseñor-Pineda, L., Mendoza-Montoya, O. (eds.) BioSignal Processing and Classification Using Computational Learning and Intelligence, pp. 59–91. Academic Press (2022)

    Google Scholar 

  39. Zhang, Y.: Modelling the lexical complexity of homogenous texts: a time series approach. Qual. Quant. (2022)

    Google Scholar 

  40. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press (2019)

    Google Scholar 

  41. Mitkov, R.: The Oxford Handbook of Computational Linguistics. Oxford University Press (2021)

    Google Scholar 

  42. Kim, S.-W., Gil, J.-M.: Research paper classification systems based on TF-IDF and LDA schemes. Hum. Centric Comput. Inf. Sci. 9(1) (2019)

    Google Scholar 

  43. Wang, Y.-X., Zhang, Y.-J.: Nonnegative matrix factorization: a comprehensive review. IEEE Trans. Knowl. Data Eng. 25(6), 1336–1353 (2013)

    Article  Google Scholar 

  44. Paatero, P., Tapper, U.: Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5(2), 111–126 (1994)

    Article  Google Scholar 

  45. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)

    Article  MATH  Google Scholar 

  46. Dumais, S.T., et al.: Latent semantic analysis. Annu. Rev. Inf. Sci. Technol. 38(1), 188–230 (2004)

    Article  Google Scholar 

  47. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)

    Article  Google Scholar 

  48. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding, ar**v preprint ar**v:1810.04805 (2018)

  49. Grootendorst, M.: BERTopic: neural topic modeling with a class-based TF-IDF procedure. ar**v preprint ar**v:2203.05794 (2022)

  50. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. ar**v preprint ar**v:1804.07461 (2018)

  51. Vaswani, A.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  52. Mimno, D., Wallach, H., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 262–272 (2011)

    Google Scholar 

  53. Ge, J., Lin, S., Fang, Y.: A text classification algorithm based on topic model and convolutional neural network. J. Phys: Conf. Ser. 1748(3), 032036 (2021)

    Google Scholar 

  54. Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20(1), 37–46 (1960)

    Article  Google Scholar 

  55. Adhitama, R., Kusumaningrum, R., Gernowo, R.: Topic labeling towards news document collection based on latent Dirichlet allocation and ontology. In: 2017 1st International Conference on Informatics and Computational Sciences (ICICoS), pp. 247–252 (2017)

    Google Scholar 

  56. Vieira, S.M., Kaymak, U., Sousa, J.M.: Cohen’s kappa coefficient as a performance measure for feature selection. In: International Conference on Fuzzy Systems (2010)

    Google Scholar 

  57. Consumer Financial Protection Bureau: Credit card complaints. https://data.world/dataquest/bank-and-credit-card-complaints (2018)

  58. McHugh, M.L.: Interrater reliability: the Kappa statistic. Biochemia Medica, pp. 276–282 (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Keerthana Sureshbabu Kashi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kashi, K.S., Antenor, A.A., Ramolete, G.I.L., Heinrich, A. (2023). Data Homogeneity Dependent Topic Modeling for Information Retrieval. In: Nandan Mohanty, S., Garcia Diaz, V., Satish Kumar, G.A.E. (eds) Intelligent Systems and Machine Learning. ICISML 2022. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 471. Springer, Cham. https://doi.org/10.1007/978-3-031-35081-8_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-35081-8_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-35080-1

  • Online ISBN: 978-3-031-35081-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation