Abstract
Analyzing sentiments from code-mixed languages is a challenging task, so in this research we used social media data in code-mixed Dravidian languages (Malayalam-English, Tamil-English) and code-mixed Indo-Aryan languages (Hindi-English, Bengali-English) for sentiment analysis. We have investigated that for all four code-mixed Indian languages, our designed Weighted Word Unigram feature helps in increasing the accuracy of results in comparison with TF-IDF Char, TF-IDF Word, and TF-IDF Combine features for the n-gram range of (1,1), (1,2), and (1,3). The experiment shows that our Weighted Word Unigram (WWU) features are half in quantity when compared with the TF-IDF Word and TF-IDF Combine features for the (1,2) and (1,3) n-gram ranges for every language. Further experiments added that the Weighted Word Unigram features predicted more values in “offensive” and “negative” classes for code-mixed Dravidian languages and code-mixed Indo-Aryan languages, respectively. The dataset for code-mixed Dravidian languages were taken from the HASOC 2020 shared task having “offensive” and “not-offensive” sentiments, and for code-mixed Indo-Aryan languages having “positive” and “negative” sentiments, taken from the SAIL 2017 shared tasks. To train features, we apply a set of machine learning classifiers, namely multinomial Naive Bayes, support vector machine, logistic regression, and random forest classifier. The performance of classifiers was evaluated using precision, recall, and accuracy. For all four code-mixed languages, the designed Weighted Word Unigram (WWU) feature trained with MNB was found to be the best performing, with an accuracy score of 0.75, 0.88, 0.77, and 0.79 for Malayalam-English, Tamil-English, Hindi-English, and Bengali-English, respectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
Nltk.corpus.
- 3.
Scipy.sparse.hstack.
- 4.
References
Mahata SK, Mandal S, Das D, Bandyopadhyay S (2019) Code-mixed to monolingual translation framework. In: Proceedings of the 11th forum for information retrieval evaluation, 30–35
Suryawanshi S, Chakravarthi BR (2021) Findings of the shared task on troll meme classification in Tamil. In: Proceedings of the first workshop on speech and language technologies for Dravidian languages, 126–132
Hartley J (2021) IL Indian languages: a useful guide to all the languages spoken in India. https://www.berlitz.com/blog/indian-languages-spoken-ist#:~:text=The
Davidson T, Warmsley D, Macy M, Weber I (2017) Automated hate speech detection and the problem of offensive language. Proc Int AAAI Conf Web Social Media 11:512–515
Alfina I, Mulia R, Fanany MI, Ekanata Y (2017) Hate speech detection in the Indonesian language: a dataset and preliminary study. In: 2017 international conference on advanced computer science and information systems (ICACSIS). IEEE, 233–238
Rogers A, Romanov A, Rumshisky A, Volkova S, Gronas M, Gribov A (2018) RuSentiment: an enriched sentiment analysis dataset for social media in Russian. In: Proceedings of the 27th international conference on computational linguistics, 755–763
Sigurbergsson GI, Derczynski L (2019) Offensive language and hate speech detection for Danish. ar**v preprint ar**v:1908.04531
Bosco C, Felice D, Poletto F, Sanguinetti M, Maurizio T (2018) Overview of the Evalita 2018 hate speech detection task. In: EVALITA 2018-sixth evaluation campaign of natural language processing and speech tools for Italian, vol 2263. CEUR, 1–9
Mulki H, Haddad H, Ali CB, Alshabani H (2019) L-hsab: a Levantine twitter dataset for hate speech and abusive language. In: Proceedings of the third workshop on abusive language online, 111–118
Ptaszynski M, Pieciukiewicz A, Dybała P (2019) Results of the poleval 2019 shared task 6: first dataset and open shared task for automatic cyberbullying detection in polish twitter
Arunselvan SJ, Anand Kumar M, Soman KP (2015) Sentiment analysis of Tamil movie reviews via feature frequency count. Int J Appl Eng Res 10:20:17934–17939
Mumbai Vidyavihar (2017) Sentiment analysis in Marathi language. Int J Recent Innov Trends Comput Commun 5(8):21–25
Akhtar MS, Ekbal A, Bhattacharyya P (2016) Aspect based sentiment analysis in Hindi: resource creation and evaluation. In: Proceedings of the tenth international conference on language resources and evaluation (LREC’16), 2703–2709
Azharul Hasan KM, Rahman M et al (2014) Sentiment detection from Bangla text using contextual valency analysis. In: 2014 17th international conference on computer and information technology (ICCIT). IEEE, 292–295
Abburi H, Akkireddy ESA, Gangashetti S, Mamidi R (2016) Multimodal sentiment analysis of Telugu songs. In: SAAIP@IJCAI
Ghosh S, Ghosh S, Das D (2017) Sentiment identification in code-mixed social media text. ar**v preprint ar**v:1707.01184
Joshi A, Prabhu A, Shrivastava M, Varma V (2016) Towards sub-word level compositions for sentiment analysis of Hindi-English code mixed text. In: Proceedings of COLING 2016, the 26th international conference on computational linguistics: technical papers, 2482–2491
Jhanwar MG, Das A (2018) An ensemble model for sentiment analysis of Hindi-English code-mixed data. ar**v preprint ar**v:1806.04450
Bohra A, Vijay D, Singh V, Akhtar SS, Shrivastava M (2018) A dataset of Hindi-English code-mixed social media text for hate speech detection. In: Proceedings of the second workshop on computational modeling of people’s opinions, personality, and emotions in social media, 36–41
Sreelakshmi K, Premjith B, Soman KP (2020) Detection of hate speech text in Hindi-English code-mixed data. Procedia Comput Sci 171(2020):737–744
Mandal S, Mahata SK, Das D (2018) Preparing Bengali-English code-mixed corpus for sentiment analysis of Indian languages. ar**v preprint ar**v:1803.04000
Shalini K, Barathi Ganesh HB, Anand Kumar M, Soman KP (2018) Sentiment analysis for code-mixed Indian social media text with distributed representation. In: 2018 International conference on advances in computing, communications and informatics (ICACCI). IEEE, 1126–1131
HASOC2020 hate speech and offensive content identification in Indo-European languages (2020). https://dravidian-codemix.github.io/2020/index.html
HASOC2021 hate speech and offensive content identification in Indo-European languages (2021). https://dravidian-codemix.github.io/2021/index.html
Chakravarthi BR, Anand Kumar M, McCrae JP, Premjith B, Soman KP, Mandl T (2020) Overview of the track on HASOC-offensive language identification-DravidianCodeMix. In: FIRE (working notes), 112–120
Patra BG, Das D, Das A (2018) Sentiment analysis of code-mixed Indian languages: an overview of sail_code-mixed shared task@ icon-2017. ar**v preprint ar**v:1803.06745
Chakravarthi BR, Muralidaran V, Priyadharshini R, McCrae JP (2020) Corpus creation for sentiment analysis in code-mixed Tamil-English text. In: Proceedings of the 1st joint workshop on spoken language technologies for under-resourced languages (SLTU) and collaboration and computing for under-resourced languages (CCURL). European Language Resources Association, Marseille, France, 202–210. https://www.aclweb.org/anthology/2020.sltu-1.28
Education First (2022) EF resources for learning English. https://www.ef.com/wwen/english-resources/english-vocabulary/top-3000-words/. Accessed 22 June 2022
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Joshi, P.A., Pathak, V.M., Joshi, M.R. (2024). Sentiment Analysis from Social Media Data in Code-Mixed Indian Languages Using Machine Learning Classifiers with TF-IDF and Weighted Word Features. In: Mishra, D., Yang, X.S., Unal, A., Jat, D.S. (eds) Data Science and Big Data Analytics. IDBA 2023. Data-Intensive Research. Springer, Singapore. https://doi.org/10.1007/978-981-99-9179-2_16
Download citation
DOI: https://doi.org/10.1007/978-981-99-9179-2_16
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-9178-5
Online ISBN: 978-981-99-9179-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)