Sentiment Analysis from Social Media Data in Code-Mixed Indian Languages Using Machine Learning Classifiers with TF-IDF and Weighted Word Features

Joshi, Prasad A.; Pathak, Varsha M.; Joshi, Manish R.

doi:10.1007/978-981-99-9179-2_16

Part of the book series: Data-Intensive Research ((DIR))

Included in the following conference series:

International Conference on Data Science and Big Data Analysis

209 Accesses

Abstract

Analyzing sentiments from code-mixed languages is a challenging task, so in this research we used social media data in code-mixed Dravidian languages (Malayalam-English, Tamil-English) and code-mixed Indo-Aryan languages (Hindi-English, Bengali-English) for sentiment analysis. We have investigated that for all four code-mixed Indian languages, our designed Weighted Word Unigram feature helps in increasing the accuracy of results in comparison with TF-IDF Char, TF-IDF Word, and TF-IDF Combine features for the n-gram range of (1,1), (1,2), and (1,3). The experiment shows that our Weighted Word Unigram (WWU) features are half in quantity when compared with the TF-IDF Word and TF-IDF Combine features for the (1,2) and (1,3) n-gram ranges for every language. Further experiments added that the Weighted Word Unigram features predicted more values in “offensive” and “negative” classes for code-mixed Dravidian languages and code-mixed Indo-Aryan languages, respectively. The dataset for code-mixed Dravidian languages were taken from the HASOC 2020 shared task having “offensive” and “not-offensive” sentiments, and for code-mixed Indo-Aryan languages having “positive” and “negative” sentiments, taken from the SAIL 2017 shared tasks. To train features, we apply a set of machine learning classifiers, namely multinomial Naive Bayes, support vector machine, logistic regression, and random forest classifier. The performance of classifiers was evaluated using precision, recall, and accuracy. For all four code-mixed languages, the designed Weighted Word Unigram (WWU) feature trained with MNB was found to be the best performing, with an accuracy score of 0.75, 0.88, 0.77, and 0.79 for Malayalam-English, Tamil-English, Hindi-English, and Bengali-English, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://scikit-learn.org/.
2.
Nltk.corpus.
3.
Scipy.sparse.hstack.
4.
https://docs.python.org/3/library/timeit.html.

References

Mahata SK, Mandal S, Das D, Bandyopadhyay S (2019) Code-mixed to monolingual translation framework. In: Proceedings of the 11th forum for information retrieval evaluation, 30–35
Google Scholar
Suryawanshi S, Chakravarthi BR (2021) Findings of the shared task on troll meme classification in Tamil. In: Proceedings of the first workshop on speech and language technologies for Dravidian languages, 126–132
Google Scholar
Hartley J (2021) IL Indian languages: a useful guide to all the languages spoken in India. https://www.berlitz.com/blog/indian-languages-spoken-ist#:~:text=The
Davidson T, Warmsley D, Macy M, Weber I (2017) Automated hate speech detection and the problem of offensive language. Proc Int AAAI Conf Web Social Media 11:512–515
Article Google Scholar
Alfina I, Mulia R, Fanany MI, Ekanata Y (2017) Hate speech detection in the Indonesian language: a dataset and preliminary study. In: 2017 international conference on advanced computer science and information systems (ICACSIS). IEEE, 233–238
Google Scholar
Rogers A, Romanov A, Rumshisky A, Volkova S, Gronas M, Gribov A (2018) RuSentiment: an enriched sentiment analysis dataset for social media in Russian. In: Proceedings of the 27th international conference on computational linguistics, 755–763
Google Scholar
Sigurbergsson GI, Derczynski L (2019) Offensive language and hate speech detection for Danish. ar**v preprint ar**v:1908.04531
Bosco C, Felice D, Poletto F, Sanguinetti M, Maurizio T (2018) Overview of the Evalita 2018 hate speech detection task. In: EVALITA 2018-sixth evaluation campaign of natural language processing and speech tools for Italian, vol 2263. CEUR, 1–9
Google Scholar
Mulki H, Haddad H, Ali CB, Alshabani H (2019) L-hsab: a Levantine twitter dataset for hate speech and abusive language. In: Proceedings of the third workshop on abusive language online, 111–118
Google Scholar
Ptaszynski M, Pieciukiewicz A, Dybała P (2019) Results of the poleval 2019 shared task 6: first dataset and open shared task for automatic cyberbullying detection in polish twitter
Google Scholar
Arunselvan SJ, Anand Kumar M, Soman KP (2015) Sentiment analysis of Tamil movie reviews via feature frequency count. Int J Appl Eng Res 10:20:17934–17939
Google Scholar
Mumbai Vidyavihar (2017) Sentiment analysis in Marathi language. Int J Recent Innov Trends Comput Commun 5(8):21–25
Google Scholar
Akhtar MS, Ekbal A, Bhattacharyya P (2016) Aspect based sentiment analysis in Hindi: resource creation and evaluation. In: Proceedings of the tenth international conference on language resources and evaluation (LREC’16), 2703–2709
Google Scholar
Azharul Hasan KM, Rahman M et al (2014) Sentiment detection from Bangla text using contextual valency analysis. In: 2014 17th international conference on computer and information technology (ICCIT). IEEE, 292–295
Google Scholar
Abburi H, Akkireddy ESA, Gangashetti S, Mamidi R (2016) Multimodal sentiment analysis of Telugu songs. In: SAAIP@IJCAI
Google Scholar
Ghosh S, Ghosh S, Das D (2017) Sentiment identification in code-mixed social media text. ar**v preprint ar**v:1707.01184
Joshi A, Prabhu A, Shrivastava M, Varma V (2016) Towards sub-word level compositions for sentiment analysis of Hindi-English code mixed text. In: Proceedings of COLING 2016, the 26th international conference on computational linguistics: technical papers, 2482–2491
Google Scholar
Jhanwar MG, Das A (2018) An ensemble model for sentiment analysis of Hindi-English code-mixed data. ar**v preprint ar**v:1806.04450
Bohra A, Vijay D, Singh V, Akhtar SS, Shrivastava M (2018) A dataset of Hindi-English code-mixed social media text for hate speech detection. In: Proceedings of the second workshop on computational modeling of people’s opinions, personality, and emotions in social media, 36–41
Google Scholar
Sreelakshmi K, Premjith B, Soman KP (2020) Detection of hate speech text in Hindi-English code-mixed data. Procedia Comput Sci 171(2020):737–744
Article Google Scholar
Mandal S, Mahata SK, Das D (2018) Preparing Bengali-English code-mixed corpus for sentiment analysis of Indian languages. ar**v preprint ar**v:1803.04000
Shalini K, Barathi Ganesh HB, Anand Kumar M, Soman KP (2018) Sentiment analysis for code-mixed Indian social media text with distributed representation. In: 2018 International conference on advances in computing, communications and informatics (ICACCI). IEEE, 1126–1131
Google Scholar
HASOC2020 hate speech and offensive content identification in Indo-European languages (2020). https://dravidian-codemix.github.io/2020/index.html
HASOC2021 hate speech and offensive content identification in Indo-European languages (2021). https://dravidian-codemix.github.io/2021/index.html
Chakravarthi BR, Anand Kumar M, McCrae JP, Premjith B, Soman KP, Mandl T (2020) Overview of the track on HASOC-offensive language identification-DravidianCodeMix. In: FIRE (working notes), 112–120
Google Scholar
Patra BG, Das D, Das A (2018) Sentiment analysis of code-mixed Indian languages: an overview of sail_code-mixed shared task@ icon-2017. ar**v preprint ar**v:1803.06745
Chakravarthi BR, Muralidaran V, Priyadharshini R, McCrae JP (2020) Corpus creation for sentiment analysis in code-mixed Tamil-English text. In: Proceedings of the 1st joint workshop on spoken language technologies for under-resourced languages (SLTU) and collaboration and computing for under-resourced languages (CCURL). European Language Resources Association, Marseille, France, 202–210. https://www.aclweb.org/anthology/2020.sltu-1.28
Education First (2022) EF resources for learning English. https://www.ef.com/wwen/english-resources/english-vocabulary/top-3000-words/. Accessed 22 June 2022

Download references

Author information

Authors and Affiliations

JET’s Zulal Bhilajirao Patil College, Dhule, India
Prasad A. Joshi
Institute of Management and Research, Jalgaon, India
Varsha M. Pathak
K. B. C. North Maharashtra University, Jalgaon, India
Manish R. Joshi

Authors

Prasad A. Joshi
View author publications
You can also search for this author in PubMed Google Scholar
Varsha M. Pathak
View author publications
You can also search for this author in PubMed Google Scholar
Manish R. Joshi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Prasad A. Joshi .

Editor information

Editors and Affiliations

Symbiosis University of Applied Sciences Indore, Indore, Madhya Pradesh, India
Durgesh Mishra
Middlesex University, London, UK
**n She Yang
Stanford University, Stanford, CA, USA
Aynur Unal
Computer Science Department, Namibia University of Science and Technology, Windhoek, Namibia
Dharm Singh Jat

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Joshi, P.A., Pathak, V.M., Joshi, M.R. (2024). Sentiment Analysis from Social Media Data in Code-Mixed Indian Languages Using Machine Learning Classifiers with TF-IDF and Weighted Word Features. In: Mishra, D., Yang, X.S., Unal, A., Jat, D.S. (eds) Data Science and Big Data Analytics. IDBA 2023. Data-Intensive Research. Springer, Singapore. https://doi.org/10.1007/978-981-99-9179-2_16

Download citation

DOI: https://doi.org/10.1007/978-981-99-9179-2_16
Published: 17 March 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-9178-5
Online ISBN: 978-981-99-9179-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics