Sentiment Analysis from Social Media Data in Code-Mixed Indian Languages Using Machine Learning Classifiers with TF-IDF and Weighted Word Features

  • Conference paper
  • First Online:
Data Science and Big Data Analytics (IDBA 2023)

Abstract

Analyzing sentiments from code-mixed languages is a challenging task, so in this research we used social media data in code-mixed Dravidian languages (Malayalam-English, Tamil-English) and code-mixed Indo-Aryan languages (Hindi-English, Bengali-English) for sentiment analysis. We have investigated that for all four code-mixed Indian languages, our designed Weighted Word Unigram feature helps in increasing the accuracy of results in comparison with TF-IDF Char, TF-IDF Word, and TF-IDF Combine features for the n-gram range of (1,1), (1,2), and (1,3). The experiment shows that our Weighted Word Unigram (WWU) features are half in quantity when compared with the TF-IDF Word and TF-IDF Combine features for the (1,2) and (1,3) n-gram ranges for every language. Further experiments added that the Weighted Word Unigram features predicted more values in “offensive” and “negative” classes for code-mixed Dravidian languages and code-mixed Indo-Aryan languages, respectively. The dataset for code-mixed Dravidian languages were taken from the HASOC 2020 shared task having “offensive” and “not-offensive” sentiments, and for code-mixed Indo-Aryan languages having “positive” and “negative” sentiments, taken from the SAIL 2017 shared tasks. To train features, we apply a set of machine learning classifiers, namely multinomial Naive Bayes, support vector machine, logistic regression, and random forest classifier. The performance of classifiers was evaluated using precision, recall, and accuracy. For all four code-mixed languages, the designed Weighted Word Unigram (WWU) feature trained with MNB was found to be the best performing, with an accuracy score of 0.75, 0.88, 0.77, and 0.79 for Malayalam-English, Tamil-English, Hindi-English, and Bengali-English, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://scikit-learn.org/.

  2. 2.

    Nltk.corpus.

  3. 3.

    Scipy.sparse.hstack.

  4. 4.

    https://docs.python.org/3/library/timeit.html.

References

  1. Mahata SK, Mandal S, Das D, Bandyopadhyay S (2019) Code-mixed to monolingual translation framework. In: Proceedings of the 11th forum for information retrieval evaluation, 30–35

    Google Scholar 

  2. Suryawanshi S, Chakravarthi BR (2021) Findings of the shared task on troll meme classification in Tamil. In: Proceedings of the first workshop on speech and language technologies for Dravidian languages, 126–132

    Google Scholar 

  3. Hartley J (2021) IL Indian languages: a useful guide to all the languages spoken in India. https://www.berlitz.com/blog/indian-languages-spoken-ist#:~:text=The

  4. Davidson T, Warmsley D, Macy M, Weber I (2017) Automated hate speech detection and the problem of offensive language. Proc Int AAAI Conf Web Social Media 11:512–515

    Article  Google Scholar 

  5. Alfina I, Mulia R, Fanany MI, Ekanata Y (2017) Hate speech detection in the Indonesian language: a dataset and preliminary study. In: 2017 international conference on advanced computer science and information systems (ICACSIS). IEEE, 233–238

    Google Scholar 

  6. Rogers A, Romanov A, Rumshisky A, Volkova S, Gronas M, Gribov A (2018) RuSentiment: an enriched sentiment analysis dataset for social media in Russian. In: Proceedings of the 27th international conference on computational linguistics, 755–763

    Google Scholar 

  7. Sigurbergsson GI, Derczynski L (2019) Offensive language and hate speech detection for Danish. ar**v preprint ar**v:1908.04531

  8. Bosco C, Felice D, Poletto F, Sanguinetti M, Maurizio T (2018) Overview of the Evalita 2018 hate speech detection task. In: EVALITA 2018-sixth evaluation campaign of natural language processing and speech tools for Italian, vol 2263. CEUR, 1–9

    Google Scholar 

  9. Mulki H, Haddad H, Ali CB, Alshabani H (2019) L-hsab: a Levantine twitter dataset for hate speech and abusive language. In: Proceedings of the third workshop on abusive language online, 111–118

    Google Scholar 

  10. Ptaszynski M, Pieciukiewicz A, Dybała P (2019) Results of the poleval 2019 shared task 6: first dataset and open shared task for automatic cyberbullying detection in polish twitter

    Google Scholar 

  11. Arunselvan SJ, Anand Kumar M, Soman KP (2015) Sentiment analysis of Tamil movie reviews via feature frequency count. Int J Appl Eng Res 10:20:17934–17939

    Google Scholar 

  12. Mumbai Vidyavihar (2017) Sentiment analysis in Marathi language. Int J Recent Innov Trends Comput Commun 5(8):21–25

    Google Scholar 

  13. Akhtar MS, Ekbal A, Bhattacharyya P (2016) Aspect based sentiment analysis in Hindi: resource creation and evaluation. In: Proceedings of the tenth international conference on language resources and evaluation (LREC’16), 2703–2709

    Google Scholar 

  14. Azharul Hasan KM, Rahman M et al (2014) Sentiment detection from Bangla text using contextual valency analysis. In: 2014 17th international conference on computer and information technology (ICCIT). IEEE, 292–295

    Google Scholar 

  15. Abburi H, Akkireddy ESA, Gangashetti S, Mamidi R (2016) Multimodal sentiment analysis of Telugu songs. In: SAAIP@IJCAI

    Google Scholar 

  16. Ghosh S, Ghosh S, Das D (2017) Sentiment identification in code-mixed social media text. ar**v preprint ar**v:1707.01184

  17. Joshi A, Prabhu A, Shrivastava M, Varma V (2016) Towards sub-word level compositions for sentiment analysis of Hindi-English code mixed text. In: Proceedings of COLING 2016, the 26th international conference on computational linguistics: technical papers, 2482–2491

    Google Scholar 

  18. Jhanwar MG, Das A (2018) An ensemble model for sentiment analysis of Hindi-English code-mixed data. ar**v preprint ar**v:1806.04450

  19. Bohra A, Vijay D, Singh V, Akhtar SS, Shrivastava M (2018) A dataset of Hindi-English code-mixed social media text for hate speech detection. In: Proceedings of the second workshop on computational modeling of people’s opinions, personality, and emotions in social media, 36–41

    Google Scholar 

  20. Sreelakshmi K, Premjith B, Soman KP (2020) Detection of hate speech text in Hindi-English code-mixed data. Procedia Comput Sci 171(2020):737–744

    Article  Google Scholar 

  21. Mandal S, Mahata SK, Das D (2018) Preparing Bengali-English code-mixed corpus for sentiment analysis of Indian languages. ar**v preprint ar**v:1803.04000

  22. Shalini K, Barathi Ganesh HB, Anand Kumar M, Soman KP (2018) Sentiment analysis for code-mixed Indian social media text with distributed representation. In: 2018 International conference on advances in computing, communications and informatics (ICACCI). IEEE, 1126–1131

    Google Scholar 

  23. HASOC2020 hate speech and offensive content identification in Indo-European languages (2020). https://dravidian-codemix.github.io/2020/index.html

  24. HASOC2021 hate speech and offensive content identification in Indo-European languages (2021). https://dravidian-codemix.github.io/2021/index.html

  25. Chakravarthi BR, Anand Kumar M, McCrae JP, Premjith B, Soman KP, Mandl T (2020) Overview of the track on HASOC-offensive language identification-DravidianCodeMix. In: FIRE (working notes), 112–120

    Google Scholar 

  26. Patra BG, Das D, Das A (2018) Sentiment analysis of code-mixed Indian languages: an overview of sail_code-mixed shared task@ icon-2017. ar**v preprint ar**v:1803.06745

  27. Chakravarthi BR, Muralidaran V, Priyadharshini R, McCrae JP (2020) Corpus creation for sentiment analysis in code-mixed Tamil-English text. In: Proceedings of the 1st joint workshop on spoken language technologies for under-resourced languages (SLTU) and collaboration and computing for under-resourced languages (CCURL). European Language Resources Association, Marseille, France, 202–210. https://www.aclweb.org/anthology/2020.sltu-1.28

  28. Education First (2022) EF resources for learning English. https://www.ef.com/wwen/english-resources/english-vocabulary/top-3000-words/. Accessed 22 June 2022

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Prasad A. Joshi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Joshi, P.A., Pathak, V.M., Joshi, M.R. (2024). Sentiment Analysis from Social Media Data in Code-Mixed Indian Languages Using Machine Learning Classifiers with TF-IDF and Weighted Word Features. In: Mishra, D., Yang, X.S., Unal, A., Jat, D.S. (eds) Data Science and Big Data Analytics. IDBA 2023. Data-Intensive Research. Springer, Singapore. https://doi.org/10.1007/978-981-99-9179-2_16

Download citation

Publish with us

Policies and ethics

Navigation