Abstract
Natural Language Processing has been extensively researched for languages with abundant resources like English and Spanish, but low-resource languages present unique challenges and opportunities. Consequently, the performance of learning algorithms for downstream tasks in low resource languages, such as the Indic languages Bhojpuri and Maithili, is unsatisfactory. To address these challenges, we collect a corpus of 20,000 sentences in both Bhojpuri and Maithili each and use them to develop word representations using popular the approaches such as Word2Vec, FastText, and GloVe. The evaluation of the word representation is accomplished through machine translation and text classification tasks. Among the various models used in the experiments, word2Vec with Bidirectional Encoder Representations from Transformers (BERT) outperforms GloVe and FastText in machine translation task, achieving a BiLingual Evaluation Understudy score of 32.95 for Bhojpuri to English and 28.95 for Maithili to English. Additionally, the BERT model provides a good text classification accuracy of 71.23% for Bhojpuri and 66.91% for Maithili. This work may serve as a reasonable beginning point for further research in low-resource Indic languages.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13278-024-01287-w/MediaObjects/13278_2024_1287_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13278-024-01287-w/MediaObjects/13278_2024_1287_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13278-024-01287-w/MediaObjects/13278_2024_1287_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13278-024-01287-w/MediaObjects/13278_2024_1287_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13278-024-01287-w/MediaObjects/13278_2024_1287_Fig5_HTML.png)
Similar content being viewed by others
Data availibility
Data sets generated during the current study are available from the corresponding author on reasonable request.
References
Chandrasekaran D, Mago V (2021) Evolution of semantic similarity-a survey. ACM Comput Surv (CSUR) 54(2):1–37
Zhang S, Zheng X, Hu C (2015). A survey of semantic similarity and its application to social network analysis. In: 2015 IEEE International Conference on Big Data (Big Data). IEEE; . 2362-7
Schnabel T, Labutov I, Mimno D, Joachims T (2015). Evaluation methods for unsupervised word embeddings. In: Proceedings of the 2015 conference on empirical methods in natural language processing; 298-307
Utami M, Syamsudduha S, Maman M (2022) Language Variations in Siniar (Podcast) Youtube: Sociolinguistic Studies. J Asian Multicult Res Soc Sci Stud 3(3):23–29. https://doi.org/10.47616/jamrsss.v3i3.293
Kanfoud MR, Bouramoul A (2022) SentiCode: A new paradigm for one-time training and global prediction in multilingual sentiment analysis. J Intell Inf Syst 59(2):501–522. https://doi.org/10.1007/s10844-022-00714-8
Mikolov T, Chen K, Corrado G, Dean J (2013). Efficient Estimation of Word Representations in Vector Space. Ar**v. 2013
Bojanowski P, Grave E, Joulin A, Mikolov T (2017). Enriching Word Vectors with Subword Information;
Pennington J, Socher R, Manning C (2014). GloVe: Global Vectors for Word Representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics; 1532-43. Available from: https://aclanthology.org/D14-1162
Hadj Taieb MA, Ben Aouicha M, Bourouis Y (2015). Fm3s: Features-based measure of sentences semantic similarity. In: Hybrid Artificial Intelligent Systems: 10th International Conference, HAIS 2015, Bilbao, Spain, June 22-24, 2015, Proceedings 10. Springer; 515-29
Colla D, Mensa E, Radicioni DP (2020) Novel metrics for computing semantic similarity with sense embeddings. Knowl-Based Syst 206:106346
Nguyen HT, Duong PH, Cambria E (2019) Learning short-text semantic similarity with word embeddings and external knowledge sources. Knowl-Based Syst 182:104842
Hartmann N, Fonseca E, Shulby C, Treviso M, Rodrigues J, Aluisio S (2017). Portuguese word embeddings: Evaluating on word analogies and natural language tasks. ar**v preprint ar**v:1708.06025
Alian M, Awajan A (2020) Semantic similarity for english and arabic texts: a review. J Inf & Knowl Manag 19(04):2050033
Pandit R, Sengupta S, Naskar SK, Dash NS, Sardar MM (2019). Improving semantic similarity with cross-lingual resources: A study in bangla-a low resourced language. In: Informatics. 6. MDPI; . 19
Joshi R, Goel P, Joshi R (2020). Deep learning for hindi text classification: A comparison. In: Intelligent Human Computer Interaction: 11th International Conference, IHCI 2019, Allahabad, India, December 12–14, 2019, Proceedings 11. Springer; 94-101
D’Silva J, Sharma U (2022) Automatic text summarization of konkani texts using pre-trained word embeddings and deep learning. Int J Electr Comput Eng (IJECE) 12(2):1990
Ali R, Farooq U, Arshad U, Shahzad W, Beg MO (2022) Hate speech detection on Twitter using transfer learning. Comput Speech Lang 74:101365. https://doi.org/10.1016/j.csl.2022.101365
Patil RS, Kolhe SR, Supervised classifiers with TF-IDF features for sentiment analysis of Marathi tweets. Social Network Analysis and Mining. 12(1)
Nie E, Liang S, Schmid H, Schütze H (2023). Cross-Lingual Retrieval Augmented Prompt for Low-Resource Languages
Lalrempuii C, Soni B, Pakray P (2021) An Improved English-to-Mizo Neural Machine Translation. ACM Trans Asian Low-Resour Lang Inf Process 4:1–21. https://doi.org/10.1145/3445974
Priyadarshi A, Saha SK (2020) Towards the first Maithili part of speech tagger: Resource creation and system development. Comput Speech Lang 62:101054
Mundotiya RK, Mishra S, Singh AK (2022) Hierarchical self attention based sequential labelling model for Bhojpuri, Maithili and Magahi languages. J King Saud Univ Comput Inf Sci 34(10):8739–8749. https://doi.org/10.1016/j.jksuci.2021.09.022
Bhojpuri Cinema News Online News Paper. 2023; 2023. Available from: https://bhojpuriyanews.com/homepage-blog/
Maithil Manch News Paper;. Available from: https://www.maithilmanch.in/blog/
BBC. Learning English - BBC Learning English - Homepage. BBC Learning English. Available from: https://www.bbc.co.uk//learningenglish
N L. IMDB Dataset of 50K Movie Reviews. IMDB Dataset of 50K Movie Reviews Kaggle. Available from: /datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
Sourced O (2021). Devanagari - Wikipedia. Devanagari Wikipedia;2(1)
Blog writer (2020). Sentiment Analysis Process. How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) DigitalOcean. May. Available from: https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk
meta researcher. Wiki word vectors fastText. Wiki word vectors fastText. Available from: https://fasttext.cc/index.html
Sabbeh SF, Fasihuddin HA (2023) A Comparative Analysis of Word Embedding and Deep Learning for Arabic Sentiment Classification. Electronics 12(6):1425. https://doi.org/10.3390/electronics12061425
Acknowledgements
The authors acknowledge the efforts by Mr. Parveen Sinha, Holy Mary School, Nalanda, Bihar, India, and Mr. Bishwanath Yadav, AP Mahila College, Nalanda, Bihar, India, in the data creation process.
Funding
The authors did not receive support from any organization for the submitted work.
Author information
Authors and Affiliations
Contributions
Arun Kumar Yadav: Conceptualization, Methodology, Supervision, Writing—review & editing. Abhishek Kumar: Software, Investigation, Data curation, Writing—original draft. Mohit Kumar: Methodology, Writing—review & editing, Visualization. Divakar Yadav: Conceptualization, Methodology, Formal analysis
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Ethical approval
Not Applicable
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yadav, A.K., Kumar, A., Kumar, M. et al. Semantic proximity assessment in Bhojpuri and Maithili: a word embedding perspective. Soc. Netw. Anal. Min. 14, 130 (2024). https://doi.org/10.1007/s13278-024-01287-w
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13278-024-01287-w