Log in

Semantic proximity assessment in Bhojpuri and Maithili: a word embedding perspective

  • Original Article
  • Published:
Social Network Analysis and Mining Aims and scope Submit manuscript

Abstract

Natural Language Processing has been extensively researched for languages with abundant resources like English and Spanish, but low-resource languages present unique challenges and opportunities. Consequently, the performance of learning algorithms for downstream tasks in low resource languages, such as the Indic languages Bhojpuri and Maithili, is unsatisfactory. To address these challenges, we collect a corpus of 20,000 sentences in both Bhojpuri and Maithili each and use them to develop word representations using popular the approaches such as Word2Vec, FastText, and GloVe. The evaluation of the word representation is accomplished through machine translation and text classification tasks. Among the various models used in the experiments, word2Vec with Bidirectional Encoder Representations from Transformers (BERT) outperforms GloVe and FastText in machine translation task, achieving a BiLingual Evaluation Understudy score of 32.95 for Bhojpuri to English and 28.95 for Maithili to English. Additionally, the BERT model provides a good text classification accuracy of 71.23% for Bhojpuri and 66.91% for Maithili. This work may serve as a reasonable beginning point for further research in low-resource Indic languages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availibility

Data sets generated during the current study are available from the corresponding author on reasonable request.

References

  • Chandrasekaran D, Mago V (2021) Evolution of semantic similarity-a survey. ACM Comput Surv (CSUR) 54(2):1–37

    Article  Google Scholar 

  • Zhang S, Zheng X, Hu C (2015). A survey of semantic similarity and its application to social network analysis. In: 2015 IEEE International Conference on Big Data (Big Data). IEEE; . 2362-7

  • Schnabel T, Labutov I, Mimno D, Joachims T (2015). Evaluation methods for unsupervised word embeddings. In: Proceedings of the 2015 conference on empirical methods in natural language processing; 298-307

  • Utami M, Syamsudduha S, Maman M (2022) Language Variations in Siniar (Podcast) Youtube: Sociolinguistic Studies. J Asian Multicult Res Soc Sci Stud 3(3):23–29. https://doi.org/10.47616/jamrsss.v3i3.293

    Article  Google Scholar 

  • Kanfoud MR, Bouramoul A (2022) SentiCode: A new paradigm for one-time training and global prediction in multilingual sentiment analysis. J Intell Inf Syst 59(2):501–522. https://doi.org/10.1007/s10844-022-00714-8

    Article  Google Scholar 

  • Mikolov T, Chen K, Corrado G, Dean J (2013). Efficient Estimation of Word Representations in Vector Space. Ar**v. 2013

  • Bojanowski P, Grave E, Joulin A, Mikolov T (2017). Enriching Word Vectors with Subword Information;

  • Pennington J, Socher R, Manning C (2014). GloVe: Global Vectors for Word Representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics; 1532-43. Available from: https://aclanthology.org/D14-1162

  • Hadj Taieb MA, Ben Aouicha M, Bourouis Y (2015). Fm3s: Features-based measure of sentences semantic similarity. In: Hybrid Artificial Intelligent Systems: 10th International Conference, HAIS 2015, Bilbao, Spain, June 22-24, 2015, Proceedings 10. Springer; 515-29

  • Colla D, Mensa E, Radicioni DP (2020) Novel metrics for computing semantic similarity with sense embeddings. Knowl-Based Syst 206:106346

    Article  Google Scholar 

  • Nguyen HT, Duong PH, Cambria E (2019) Learning short-text semantic similarity with word embeddings and external knowledge sources. Knowl-Based Syst 182:104842

    Article  Google Scholar 

  • Hartmann N, Fonseca E, Shulby C, Treviso M, Rodrigues J, Aluisio S (2017). Portuguese word embeddings: Evaluating on word analogies and natural language tasks. ar**v preprint ar**v:1708.06025

  • Alian M, Awajan A (2020) Semantic similarity for english and arabic texts: a review. J Inf & Knowl Manag 19(04):2050033

    Article  Google Scholar 

  • Pandit R, Sengupta S, Naskar SK, Dash NS, Sardar MM (2019). Improving semantic similarity with cross-lingual resources: A study in bangla-a low resourced language. In: Informatics. 6. MDPI; . 19

  • Joshi R, Goel P, Joshi R (2020). Deep learning for hindi text classification: A comparison. In: Intelligent Human Computer Interaction: 11th International Conference, IHCI 2019, Allahabad, India, December 12–14, 2019, Proceedings 11. Springer; 94-101

  • D’Silva J, Sharma U (2022) Automatic text summarization of konkani texts using pre-trained word embeddings and deep learning. Int J Electr Comput Eng (IJECE) 12(2):1990

    Article  Google Scholar 

  • Ali R, Farooq U, Arshad U, Shahzad W, Beg MO (2022) Hate speech detection on Twitter using transfer learning. Comput Speech Lang 74:101365. https://doi.org/10.1016/j.csl.2022.101365

    Article  Google Scholar 

  • Patil RS, Kolhe SR, Supervised classifiers with TF-IDF features for sentiment analysis of Marathi tweets. Social Network Analysis and Mining. 12(1)

  • Nie E, Liang S, Schmid H, Schütze H (2023). Cross-Lingual Retrieval Augmented Prompt for Low-Resource Languages

  • Lalrempuii C, Soni B, Pakray P (2021) An Improved English-to-Mizo Neural Machine Translation. ACM Trans Asian Low-Resour Lang Inf Process 4:1–21. https://doi.org/10.1145/3445974

    Article  Google Scholar 

  • Priyadarshi A, Saha SK (2020) Towards the first Maithili part of speech tagger: Resource creation and system development. Comput Speech Lang 62:101054

    Article  Google Scholar 

  • Mundotiya RK, Mishra S, Singh AK (2022) Hierarchical self attention based sequential labelling model for Bhojpuri, Maithili and Magahi languages. J King Saud Univ Comput Inf Sci 34(10):8739–8749. https://doi.org/10.1016/j.jksuci.2021.09.022

    Article  Google Scholar 

  • Bhojpuri Cinema News Online News Paper. 2023; 2023. Available from: https://bhojpuriyanews.com/homepage-blog/

  • Maithil Manch News Paper;. Available from: https://www.maithilmanch.in/blog/

  • BBC. Learning English - BBC Learning English - Homepage. BBC Learning English. Available from: https://www.bbc.co.uk//learningenglish

  • N L. IMDB Dataset of 50K Movie Reviews. IMDB Dataset of 50K Movie Reviews Kaggle. Available from: /datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

  • Sourced O (2021). Devanagari - Wikipedia. Devanagari Wikipedia;2(1)

  • Blog writer (2020). Sentiment Analysis Process. How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) DigitalOcean. May. Available from: https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk

  • meta researcher. Wiki word vectors fastText. Wiki word vectors fastText. Available from: https://fasttext.cc/index.html

  • Sabbeh SF, Fasihuddin HA (2023) A Comparative Analysis of Word Embedding and Deep Learning for Arabic Sentiment Classification. Electronics 12(6):1425. https://doi.org/10.3390/electronics12061425

    Article  Google Scholar 

Download references

Acknowledgements

The authors acknowledge the efforts by Mr. Parveen Sinha, Holy Mary School, Nalanda, Bihar, India, and Mr. Bishwanath Yadav, AP Mahila College, Nalanda, Bihar, India, in the data creation process.

Funding

The authors did not receive support from any organization for the submitted work.

Author information

Authors and Affiliations

Authors

Contributions

Arun Kumar Yadav: Conceptualization, Methodology, Supervision, Writing—review & editing. Abhishek Kumar: Software, Investigation, Data curation, Writing—original draft. Mohit Kumar: Methodology, Writing—review & editing, Visualization. Divakar Yadav: Conceptualization, Methodology, Formal analysis

Corresponding author

Correspondence to Divakar Yadav.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical approval

Not Applicable

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yadav, A.K., Kumar, A., Kumar, M. et al. Semantic proximity assessment in Bhojpuri and Maithili: a word embedding perspective. Soc. Netw. Anal. Min. 14, 130 (2024). https://doi.org/10.1007/s13278-024-01287-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13278-024-01287-w

Keywords

Navigation