Mono Versus Multilingual BERT: A Case Study in Hindi and Marathi Named Entity Recognition

Litake, Onkar; Sabane, Maithili; Patil, Parth; Ranade, Aparna; Joshi, Raviraj

doi:10.1007/978-981-19-6088-8_56

Onkar Litake¹¹,
Maithili Sabane¹¹,
Parth Patil¹¹,
Aparna Ranade¹¹ &
…
Raviraj Joshi¹²

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 540))

528 Accesses
5 Citations

Abstract

Named entity recognition (NER) is the process of recognizing and classifying important information (entities) in text. Proper nouns, such as a person’s name, an organization’s name, or a location’s name, are examples of entities. The NER is one of the important modules in applications like human resources, customer support, search engines, content classification, and academia. In this work, we consider NER for low-resource Indian languages like Hindi and Marathi. The transformer-based models have been widely used for NER tasks. We consider different variations of BERT like base-BERT, RoBERTa, and AlBERT and benchmark them on publicly available Hindi and Marathi NER datasets. We provide an exhaustive comparison of different monolingual and multilingual transformer-based models and establish simple baselines currently missing in the literature. We show that the monolingual MahaRoBERTa model performs the best for Marathi NER whereas the multilingual XLM-RoBERTa performs the best for Hindi NER. We also perform cross-language evaluation and present mixed observations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: EUR 29.95; Price includes VAT (Germany)

eBook: EUR 192.59; Price includes VAT (Germany)

Softcover Book: EUR 246.09; Price includes VAT (Germany)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

UZNER: A Benchmark for Named Entity Recognition in Uzbek

Improving Language-Dependent Named Entity Detection

Enhancing Performance of Hybrid Named Entity Recognition for Amazighe Language

Notes

1.
Multicase BERT: https://huggingface.co/bert-base-multilingual-cased
Indic BERT: https://huggingface.co/ai4bharat/indic-bert
Xlm-roberta: https://huggingface.co/xlm-roberta-base
Roberta-Marathi: https://huggingface.co/flax-community/roberta-base-mr
Roberta-Hindi: https://huggingface.co/flax-community/roberta-hindi
indic-transformers-hi-roberta: https://huggingface.co/neuralspace-reverie/indic-transformers-hi-roberta
MahaBERT: https://huggingface.co/l3cube-pune/marathi-bert
MahaRoBERTa: https://huggingface.co/l3cube-pune/marathi-roberta
MahaAlBERT: https://huggingface.co/l3cube-pune/marathi-albert-v2.

References

Grishman R, Sundheim BM (1996) Message understanding conference-6: a brief history (1996)
Google Scholar
Maybury M (1999) Advances in automatic text summarization. MIT Press
Google Scholar
Davenport TH, Klahr P (1998) Managing customer support knowledge. California Manage Rev 40(3):195–208
Article Google Scholar
Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K et al (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. ar**v preprint ar**v:1609.08144
Savelsbergh MW (1990) An efficient implementation of local search algorithms for constrained routing problems. Eur J Operat Res 47(1):75–85
Article MathSciNet MATH Google Scholar
Finkel JR, Grenager T, Manning CD (2005) Incorporating non-local information into information extraction systems by gibbs sampling, pp 363–370
Google Scholar
Joshi R (2022) L3cube-mahacorpus and mahabert: marathi monolingual corpus, marathi bert language models, and resources. ar**v preprint ar**v:2202.01159
Joshi R, Goel P, Joshi R (2019) Deep learning for hindi text classification: a comparison. In: International conference on intelligent human computer interaction. Springer, pp 94–101
Google Scholar
Kulkarni A, Mandhane M, Likhitkar M, Kshirsagar G, Jagdale J, Joshi R (2022) Experimental evaluation of deep learning models for marathi text classification. In: Proceedings of the 2nd international conference on recent trends in machine learning, IoT, smart cities and applications. Springer, pp 605–613
Google Scholar
Kulkarni A, Mandhane M, Likhitkar M, Kshirsagar G, Joshi R (2021) L3cubemahasent: a marathi tweet-based sentiment analysis dataset. In: Proceedings of the eleventh workshop on computational approaches to subjectivity, sentiment and social media analysis, pp 213–220
Google Scholar
Velankar A, Patil H, Gore A, Salunke S, Joshi R (2021) Hate and offensive speech detection in hindi and marathi. ar**v preprint ar**v:2110.12200
Seon CN, Ko Y, Kim JS, Seo J (2001) Named entity recognition using machine learning methods and pattern-selection rules. In: NLPRS. Citeseer, pp 229–236
Google Scholar
Alfred R, Leong LC, On CK, Anthony P (2014) Malay named entity recognition based on rule-based approach
Google Scholar
Shao Y, Hardmeier C, Nivre J (2016) Multilingual named entity recognition using hybrid neural networks
Google Scholar
Xu K, Zhou Z, Hao T, Liu W (2017) A bidirectional lstm and conditional random fields approach to medical named entity recognition, pp 355–365
Google Scholar
Ekbal A, Bandyopadhyay S (2010) Named entity recognition using support vector machine: a language independent approach. Int J Electr Comput Syst Eng 4(2):155–170
MATH Google Scholar
Patil NV, Patil AS, Pawar BV (2017) Hmm based named entity recognition for inflectional language, pp 565–572. https://doi.org/10.1109/COMPTELIX.2017.8004034
Matthew Honnibal Ines Montani SVL, Boyd A (2020) spacy: industrial-strength natural language processing in python. https://doi.org/10.5281/zenodo.1212303
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press
Google Scholar
Lothritz C, Allix K, Veiber L, Bissyand T, Klein J (2020) Evaluating pretrained transformer-based models on the task of fine-grained named entity recognition, pp 3750–3760. https://doi.org/10.18653/v1/2020.coling-main.334
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need
Google Scholar
Devlin J, Chang MW, Lee K, Toutanova K (2019) Bert: pre-training of deep bidirectional transformers for language understanding
Google Scholar
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized Bert pretraining approach
Google Scholar
Kakwani D, Kunchukuttan A, Golla S, Gokul N, Bhattacharyya A, Khapra MM, Kumar P (2020) inlpsuite: monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages, pp 4948–4961
Google Scholar
Krishnarao AA, Gahlot H, Srinet A, Kushwaha D (2009) A comparative study of named entity recognition for Hindi using sequential learning algorithms, pp 1164–1169
Google Scholar
Srihari RK (2000) A hybrid approach for named entity and sub-type tagging. In: Sixth applied natural language processing conference, pp 247–254
Google Scholar
Albawi S, Mohammed TA, Al-Zawi S (2017) Understanding of a convolutional neural network. In: 2017 international conference on engineering and technology (ICET), pp 1–6. IEEE
Google Scholar
Schmidhuber J, Hochreiter S et al (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Yang G, Xu H (2020) A residual Bilstm model for named entity recognition. IEEE Access 8:227,710–227,718. https://doi.org/10.1109/ACCESS.2020.3046253
Shah H, Bhandari P, Mistry K, Thakor S, Patel M, Ahir K (2016) Study of named entity recognition for Indian languages. Int J Inf 6(1):11–25
Google Scholar
Bhattacharjee K, Mehta S, Kumar A, Mehta R, Pandya D, Chaudhari P, Verma D et al (2019) Named entity recognition: a survey for Indian languages 1:217–220
Google Scholar
Patil N, Patil AS, Pawar B (2016) Issues and challenges in Marathi named entity recognition. Int J Nat Lang Comput (IJNLC) 5(1):15–30
Article Google Scholar
Singh TD, Ekbal A, Bandyopadhyay S (2008) Manipuri POS tagging using CRF and SVM: a language independent approach, pp 240–245 (2008)
Google Scholar
Shishtla PM, Gali K, **ali P, Varma V (2008) Experiments in telugu ner: a conditional random field approach
Google Scholar
Shelke R, Thakore DS (2020) A novel approach for named entity recognition on Hindi language using residual Bilstm network
Google Scholar
Murthy R, Kunchukuttan A, Bhattacharyya P (2018) Judicious selection of training data in assisting language for multilingual neural NER. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Melbourne, Australia, pp 401–406. https://doi.org/10.18653/v1/P18-2064. https://aclanthology.org/P18-2064
Murthy R, Kunchukuttan A, Bhattacharyya P (2018) Judicious selection of training data in assisting language for multilingual neural NER, pp 401–406. https://doi.org/10.18653/v1/P18-2064
Ijcnlp-08 workshop on NER for south and south east Asian languages. http://ltrc.iiit.ac.in/ner-ssea-08/
Pan X, Zhang B, May J, Nothman J, Knight K, Ji H (2017) Cross-lingual name tagging and linking for 282 languages, pp 1946–1958. https://doi.org/10.18653/v1/P17-1178. https://aclanthology.org/P17-1178

Download references

Acknowledgements

This work was done under the L3Cube Pune mentorship program. We would like to express our gratitude towards our mentors at L3Cube for their continuous support and encouragement.

Author information

Authors and Affiliations

SCTR’s Pune Institute of Computer Technology, L3Cube, Pune, Pune, India
Onkar Litake, Maithili Sabane, Parth Patil & Aparna Ranade
Indian Institute of Technology Madras, L3Cube Pune, Pune, India
Raviraj Joshi

Authors

Onkar Litake
View author publications
You can also search for this author in PubMed Google Scholar
Maithili Sabane
View author publications
You can also search for this author in PubMed Google Scholar
Parth Patil
View author publications
You can also search for this author in PubMed Google Scholar
Aparna Ranade
View author publications
You can also search for this author in PubMed Google Scholar
Raviraj Joshi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Onkar Litake .

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, CMR Institute of Technology, Hyderabad, India
Vinit Kumar Gunjan
Department of Electrical and Computer Engineering, University of Louisville, Louisville, KY, USA
Jacek M. Zurada

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Litake, O., Sabane, M., Patil, P., Ranade, A., Joshi, R. (2023). Mono Versus Multilingual BERT: A Case Study in Hindi and Marathi Named Entity Recognition. In: Gunjan, V.K., Zurada, J.M. (eds) Proceedings of 3rd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications. Lecture Notes in Networks and Systems, vol 540. Springer, Singapore. https://doi.org/10.1007/978-981-19-6088-8_56

Download citation

DOI: https://doi.org/10.1007/978-981-19-6088-8_56
Published: 24 February 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-6087-1
Online ISBN: 978-981-19-6088-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Mono Versus Multilingual BERT: A Case Study in Hindi and Marathi Named Entity Recognition

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

UZNER: A Benchmark for Named Entity Recognition in Uzbek

Improving Language-Dependent Named Entity Detection

Enhancing Performance of Hybrid Named Entity Recognition for Amazighe Language

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Mono Versus Multilingual BERT: A Case Study in Hindi and Marathi Named Entity Recognition

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

UZNER: A Benchmark for Named Entity Recognition in Uzbek

Improving Language-Dependent Named Entity Detection

Enhancing Performance of Hybrid Named Entity Recognition for Amazighe Language

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation