Semantic proximity assessment in Bhojpuri and Maithili: a word embedding perspective

Yadav, Arun Kumar; Kumar, Abhishek; Kumar, Mohit; Yadav, Divakar

doi:10.1007/s13278-024-01287-w

Semantic proximity assessment in Bhojpuri and Maithili: a word embedding perspective

Original Article
Published: 04 July 2024

Volume 14, article number 130, (2024)
Cite this article

Social Network Analysis and Mining Aims and scope Submit manuscript

Arun Kumar Yadav¹,
Abhishek Kumar¹,
Mohit Kumar¹ &
…
Divakar Yadav²

5 Accesses
Explore all metrics

Abstract

Natural Language Processing has been extensively researched for languages with abundant resources like English and Spanish, but low-resource languages present unique challenges and opportunities. Consequently, the performance of learning algorithms for downstream tasks in low resource languages, such as the Indic languages Bhojpuri and Maithili, is unsatisfactory. To address these challenges, we collect a corpus of 20,000 sentences in both Bhojpuri and Maithili each and use them to develop word representations using popular the approaches such as Word2Vec, FastText, and GloVe. The evaluation of the word representation is accomplished through machine translation and text classification tasks. Among the various models used in the experiments, word2Vec with Bidirectional Encoder Representations from Transformers (BERT) outperforms GloVe and FastText in machine translation task, achieving a BiLingual Evaluation Understudy score of 32.95 for Bhojpuri to English and 28.95 for Maithili to English. Additionally, the BERT model provides a good text classification accuracy of 71.23% for Bhojpuri and 66.91% for Maithili. This work may serve as a reasonable beginning point for further research in low-resource Indic languages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

Fig. 5

L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi

A Word Embeddings Training Method Based on Modified Skip-Gram and Align

A deep connection to Khasi language through pre-trained embedding

Article 05 December 2022

Data availibility

Data sets generated during the current study are available from the corresponding author on reasonable request.

References

Chandrasekaran D, Mago V (2021) Evolution of semantic similarity-a survey. ACM Comput Surv (CSUR) 54(2):1–37
Article Google Scholar
Zhang S, Zheng X, Hu C (2015). A survey of semantic similarity and its application to social network analysis. In: 2015 IEEE International Conference on Big Data (Big Data). IEEE; . 2362-7
Schnabel T, Labutov I, Mimno D, Joachims T (2015). Evaluation methods for unsupervised word embeddings. In: Proceedings of the 2015 conference on empirical methods in natural language processing; 298-307
Utami M, Syamsudduha S, Maman M (2022) Language Variations in Siniar (Podcast) Youtube: Sociolinguistic Studies. J Asian Multicult Res Soc Sci Stud 3(3):23–29. https://doi.org/10.47616/jamrsss.v3i3.293
Article Google Scholar
Kanfoud MR, Bouramoul A (2022) SentiCode: A new paradigm for one-time training and global prediction in multilingual sentiment analysis. J Intell Inf Syst 59(2):501–522. https://doi.org/10.1007/s10844-022-00714-8
Article Google Scholar
Mikolov T, Chen K, Corrado G, Dean J (2013). Efficient Estimation of Word Representations in Vector Space. Ar**v. 2013
Bojanowski P, Grave E, Joulin A, Mikolov T (2017). Enriching Word Vectors with Subword Information;
Pennington J, Socher R, Manning C (2014). GloVe: Global Vectors for Word Representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics; 1532-43. Available from: https://aclanthology.org/D14-1162
Hadj Taieb MA, Ben Aouicha M, Bourouis Y (2015). Fm3s: Features-based measure of sentences semantic similarity. In: Hybrid Artificial Intelligent Systems: 10th International Conference, HAIS 2015, Bilbao, Spain, June 22-24, 2015, Proceedings 10. Springer; 515-29
Colla D, Mensa E, Radicioni DP (2020) Novel metrics for computing semantic similarity with sense embeddings. Knowl-Based Syst 206:106346
Article Google Scholar
Nguyen HT, Duong PH, Cambria E (2019) Learning short-text semantic similarity with word embeddings and external knowledge sources. Knowl-Based Syst 182:104842
Article Google Scholar
Hartmann N, Fonseca E, Shulby C, Treviso M, Rodrigues J, Aluisio S (2017). Portuguese word embeddings: Evaluating on word analogies and natural language tasks. ar**v preprint ar**v:1708.06025
Alian M, Awajan A (2020) Semantic similarity for english and arabic texts: a review. J Inf & Knowl Manag 19(04):2050033
Article Google Scholar
Pandit R, Sengupta S, Naskar SK, Dash NS, Sardar MM (2019). Improving semantic similarity with cross-lingual resources: A study in bangla-a low resourced language. In: Informatics. 6. MDPI; . 19
Joshi R, Goel P, Joshi R (2020). Deep learning for hindi text classification: A comparison. In: Intelligent Human Computer Interaction: 11th International Conference, IHCI 2019, Allahabad, India, December 12–14, 2019, Proceedings 11. Springer; 94-101
D’Silva J, Sharma U (2022) Automatic text summarization of konkani texts using pre-trained word embeddings and deep learning. Int J Electr Comput Eng (IJECE) 12(2):1990
Article Google Scholar
Ali R, Farooq U, Arshad U, Shahzad W, Beg MO (2022) Hate speech detection on Twitter using transfer learning. Comput Speech Lang 74:101365. https://doi.org/10.1016/j.csl.2022.101365
Article Google Scholar
Patil RS, Kolhe SR, Supervised classifiers with TF-IDF features for sentiment analysis of Marathi tweets. Social Network Analysis and Mining. 12(1)
Nie E, Liang S, Schmid H, Schütze H (2023). Cross-Lingual Retrieval Augmented Prompt for Low-Resource Languages
Lalrempuii C, Soni B, Pakray P (2021) An Improved English-to-Mizo Neural Machine Translation. ACM Trans Asian Low-Resour Lang Inf Process 4:1–21. https://doi.org/10.1145/3445974
Article Google Scholar
Priyadarshi A, Saha SK (2020) Towards the first Maithili part of speech tagger: Resource creation and system development. Comput Speech Lang 62:101054
Article Google Scholar
Mundotiya RK, Mishra S, Singh AK (2022) Hierarchical self attention based sequential labelling model for Bhojpuri, Maithili and Magahi languages. J King Saud Univ Comput Inf Sci 34(10):8739–8749. https://doi.org/10.1016/j.jksuci.2021.09.022
Article Google Scholar
Bhojpuri Cinema News Online News Paper. 2023; 2023. Available from: https://bhojpuriyanews.com/homepage-blog/
Maithil Manch News Paper;. Available from: https://www.maithilmanch.in/blog/
BBC. Learning English - BBC Learning English - Homepage. BBC Learning English. Available from: https://www.bbc.co.uk//learningenglish
N L. IMDB Dataset of 50K Movie Reviews. IMDB Dataset of 50K Movie Reviews Kaggle. Available from: /datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
Sourced O (2021). Devanagari - Wikipedia. Devanagari Wikipedia;2(1)
Blog writer (2020). Sentiment Analysis Process. How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) DigitalOcean. May. Available from: https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk
meta researcher. Wiki word vectors fastText. Wiki word vectors fastText. Available from: https://fasttext.cc/index.html
Sabbeh SF, Fasihuddin HA (2023) A Comparative Analysis of Word Embedding and Deep Learning for Arabic Sentiment Classification. Electronics 12(6):1425. https://doi.org/10.3390/electronics12061425
Article Google Scholar

Download references

Acknowledgements

The authors acknowledge the efforts by Mr. Parveen Sinha, Holy Mary School, Nalanda, Bihar, India, and Mr. Bishwanath Yadav, AP Mahila College, Nalanda, Bihar, India, in the data creation process.

Funding

The authors did not receive support from any organization for the submitted work.

Author information

Authors and Affiliations

Dept of Computer Science & Engineering, NIT Hamirpur, Hamirpur, India
Arun Kumar Yadav, Abhishek Kumar & Mohit Kumar
SOCIS, Indira Gandhi National Open University (IGNOU), New Delhi, India
Divakar Yadav

Authors

Arun Kumar Yadav
View author publications
You can also search for this author in PubMed Google Scholar
Abhishek Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Mohit Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Divakar Yadav
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Arun Kumar Yadav: Conceptualization, Methodology, Supervision, Writing—review & editing. Abhishek Kumar: Software, Investigation, Data curation, Writing—original draft. Mohit Kumar: Methodology, Writing—review & editing, Visualization. Divakar Yadav: Conceptualization, Methodology, Formal analysis

Corresponding author

Correspondence to Divakar Yadav.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical approval

Not Applicable

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yadav, A.K., Kumar, A., Kumar, M. et al. Semantic proximity assessment in Bhojpuri and Maithili: a word embedding perspective. Soc. Netw. Anal. Min. 14, 130 (2024). https://doi.org/10.1007/s13278-024-01287-w

Download citation

Received: 08 April 2024
Revised: 18 June 2024
Accepted: 19 June 2024
Published: 04 July 2024
DOI: https://doi.org/10.1007/s13278-024-01287-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Institutional subscriptions

Semantic proximity assessment in Bhojpuri and Maithili: a word embedding perspective

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi

A Word Embeddings Training Method Based on Modified Skip-Gram and Align

A deep connection to Khasi language through pre-trained embedding

Data availibility

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Semantic proximity assessment in Bhojpuri and Maithili: a word embedding perspective

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi

A Word Embeddings Training Method Based on Modified Skip-Gram and Align

A deep connection to Khasi language through pre-trained embedding

Data availibility

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation