Performance Evaluation of Part-of-Speech Tagging for Bengali Text

Pan, Subrata; Saha, Diganta

doi:10.1007/s40031-021-00630-5

Performance Evaluation of Part-of-Speech Tagging for Bengali Text

Original Contribution
Published: 29 July 2021

Volume 103, pages 577–589, (2022)
Cite this article

Journal of The Institution of Engineers (India): Series B Aims and scope Submit manuscript

201 Accesses
Explore all metrics

Abstract

In this paper, it has been proposed an approach to assess the performance of part-of-speech tagging of the Bengali text. The tagging can be viewed as a process of interpreting a syntactic category for tokens in text documents. The difficulty occurs when choosing an appropriate category of the speech part for tokens. To overcome this limitation, we proposed an effective method to carry out part of the speech tagging on 5 corpora independent of the domain. Subsequently, we performed a tagging performance assessment to verify the efficiency of our system. Our system is developed in 10 phases: initialization of the dataset, sentence boundary determination, tokenization, identification of unique tokens, part-of-speech tagging, retrieving the token and the tagged portion of the speech class, record the retrieved outcomes, query processing, performance evaluation and rank generation. Five corpora have been used for the experiment of the system. The system is successfully tagged 98.97%, 98.35%, 89.93%, 88.46% and 90.01% tokens of experimental corpora. The system has obtained excellent tagging performance for the POS category (Common Noun) compared to other POS categories. The efficiency of this system is visualized through detailed performance appraisal in 18 part-of-speech categories. The system is successfully tagged 16,504,118 tokens over 18,047,593 numbers of distinct tokens in the total corpora. It has been achieved 91.44% overall tagging effectiveness, which represents an improvement of about 3.24% compared to the baseline method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comparative Evaluation of Statistical Part-of-Speech Taggers for Russian

Analyzing Tagging Accuracy of Part-of-Speech Taggers

Towards POS Tagging Methods for Bengali Language: A Comparative Analysis

References

A.Y. Aikhenvald, The Art of Grammar: A Practical Guide (Oxford University Press, Oxford, 2014), p. 99
Book Google Scholar
G.T. Childs, African ideophones, in Sound Symbolism, p. 179
E.V. Gelderen, Sample entry: function words. Encyclopedia of Linguistics (n.d.). http://strazny.com/encyclopedia/sample-functionwords.html
A. Carnie, Syntax: A Generative Introduction (Wiley-Blackwell, New Jersey, 2012), pp. 51–52
Google Scholar
TDIL, MeiTy, GOI. Technology Development for Indian Languages Programme, Govt. of India. https://tdil.meity.gov.in
Bengali Shallow Parser. Developed by *Consortium of Institutions-IIIT Hyderabad, University of Hyderabad, CDAC Pune, Anna University KBC Chennai, IIT Kharagpur, IIT Bombay, IISC Bangalore, Tamil University, IIIT Allahabad, and Jadavpur University. Funded by: TDIL Program, Department Of IT Govt. of India. http://ltrc.iiit.ac.in/analyzer/bengali/
S. Dandapat, S. Sarkar, A. Basu, A hybrid model for part-of-speech tagging and its application to Bengali, in International Conference on Computational Intelligence (2004), p. 169–172
A. Ekbal, S. Bandyopadhyay, Lexicon development and POS tagging using a tagged Bengali news corpus, in FLAIRS Conference (2007), p. 261–262
S. Bandyopadhyay, A. Ekbal, HMM based POS Tagger and Rule-based Chunker for Bengali, in Advances in pattern recognition (2007), p. 384–390
Y.O. Elhadj, Statistical part-of-speech tagger for traditional Arabic texts. J. Comput. Sci. 5(11), 794 (2009)
Article Google Scholar
D. Arawinda et al.. Designing an Indonesian part of speech tagset and manually tagged Indonesian corpus, in 2014 International Conference on Asian Language Processing (IALP) (IEEE, 2014)
D. Cutting, J. Kupiec, J. Pederson, P. Sibun, A practical part of speech tagger, in Proceedings of the Third Conference on Applied Natural Language Processing, vol. 1992
P. Avinesh, G. Karthik, Part of speech tagging and chunking using conditional random fields and transformation based learning, in Proceedings of IJ-CAI Workshop on Shallow Parsing for South Asian Languages (2007)
A. Priyadarshi, S.K. Saha, Towards the first Maithili part of speech tagger: resource creation and system development. Comput. Speech Lang. 62, 101054 (2020)
Article Google Scholar
K. Sarkar, An n-gram based method for Bengali keyphrase extraction, in International Conference on Information Systems for Indian Languages. Springer, Berlin (2011)
Si.Bandhyopadhyay, A. Das, P. Bhaskar, English Bengali Ad-hoc Monolingual Information Retrieval Task Result at FIRE 2008. Working Note of Forum for FIRE-2008 (2008)
A. Jamatia, A. Das, Part-of-speech tagging system for Indian social media text on twitter, in Social-India 2014, First Workshop on Language Technologies for Indian Social Media Text, at the Eleventh International Conference on Natural Language Processing (ICON-2014) (2014), p. 21–28
N. Joshi, H. Darbari, I. Mathur. HMM based POS tagger for Hindi, in Proceeding of 2013 International Conference on Artificial Intelligence, Soft Computing (AISC-2013) (2013), p. 341–349
A.K. Ojha, P. Behera, S. Singh, G.N. Jha, Training and evaluation of POS taggers in indo-Aryan languages: a case of Hindi, Odia and Bhojpuri, in In the proceedings of 7th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (2015), p. 524–529
O. Hellwig, Sanskrittagger: a stochastic lexical and POS tagger for Sanskrit, in Sanskrit Computational Linguistics (Springer, 2007), p. 266–277
N. Saharia, D. Das, U. Sharma, J. Kalita, Part of speech tagger for assamese text, in Proceedings of the ACL-IJCNLP 2009 Conference Short Papers (2009), p. 33–36
B.R. Das, S. Sahoo, C.S. Panda, S. Patnaik, Part of speech tagging in odia using support vector machine. Procedia Comput. Sci. 48, 507–512 (2015)
Article Google Scholar
M. Selvam, A.M. Natarajan, Improvement of rule based morphological analysis and POS tagging in Tamil language via projection and induction techniques. Int. J. Comput. 3(4), 357–367 (2009)
Google Scholar
M.H. Khanam, P. Suryachandra, K. Madhumurthy. Experiments on POS tagging and data driven dependency parsing for Telugu language, in Proceedings of the International Conference on Advances in Computing, Communications and Informatics (2012), p. 1068–1073
A. Hardie, Develo** a tagset for automated part-of-speech tagging in Urdu, in Corpus Linguistics (2003)
M.S. Gill, G.S. Lehal, S.S. Joshi, Part of speech tagging for grammar checking of Punjabi. Linguist. J. 4(1), 6–21 (2009)
Google Scholar
P.J. Antony, K.P. Soman, Kernel based part of speech tagger for kannada, in 2010 International Conference on Machine Learning and Cybernetics, vol. 4 (IEEE, 2010), p. 2139–2144
C. Patel, K. Gali, in Part-of-speech tagging for Gujarati using conditional random fields, in Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages (2008)
J. Singh, N. Joshi, I. Mathur, Part of speech tagging of Marathi text using trigram method. ar**v:1307.4299 (2013)
K. Manju, S. Soumya, S.M. Idicula, Development of a POS tagger for malayalam-an experience, in 2009 International Conference on Advances in Recent Technologies in Communication and Computing (IEEE, 2009), p. 709–713
A. Ekbal, R. Haque, S. Bandyopadhyay, Bengali part of speech tagging using conditional random field, in Proceedings of Seventh International Symposium on Natural Language Processing (SNLP2007) (2007), p. 131–136
N.S. Dash, Part-of-speech (POS) tagging in Bengali written text corpus. Int. J. Linguist. Lang. Technol. 1(1), 53–96 (2013)
Article Google Scholar
S. Dandapat, S. Sarkar, Part of speech tagging for Bengali with hidden Markov model, in Proceeding of the NLPAI Machine Learning Competition
A. Ekbal, S. Bandyopadhyay, Part of speech tagging in Bengali using support vector machine, in 2008 International Conference on Information Technology (IEEE, 2008), p. 106–111
A. Ekbal, R. Haque, S. Bandyopadhyay, Maximum entropy based Bengali part of speech tagging. Adv. Nat. Lang. Process. Appl. Res. Comput. Sci. (RCS) J. 33, 67–78 (2008)
Google Scholar
M.F. Kabir, K. Abdullah-Al-Mamun, M.N. Huda, Deep learning based parts of speech tagger for Bengali, in 2016 5th International Conference on Informatics, Electronics and Vision (ICIEV) (IEEE, 2016), p. 26–29
A.H. Patoary, M.J.B. Kibria, A. Kaium, Implementation of automated Bengali parts of speech tagger: an approach using deep learning algorithm, in 2020 IEEE Region 10 Symposium (TENSYMP) (2020), p. 308–311. https://doi.org/10.1109/TENSYMP50017.2020.9230907
E.T. Jaynes, Information theory and statistical mechanics. Phys. Rev. 106(4), 620 (1957)
Article MathSciNet Google Scholar
A.L. Berger, S.A. Della Pietra, V.J. Della Pietra, A maximum entropy approach to natural language processing. Comput. Linguist. 22, 39–71 (1996)
Google Scholar
A. Ratnaparkhi, A maximum entropy model for part of speech tagging, in Proceedings of the Conference on Empirical Methods in Natural Language Processing, vol. 1996, p. 133–142
Charles Elkan, Log-linear models and conditional random fields. Tutor. Notes CIKM 8, 1–12 (2008)
Google Scholar
R. Christensen, Log-Linear Models and Logistic Regression (Springer, Berlin, 2006)
MATH Google Scholar
The Definitive Glossary of Higher Mathematical Jargon–Indeterminate. Math Vault. 2019-08-01. Accessed 15 Dec 2019
E.W. Weisstein. “Undefined”. http://mathworld.wolfram.com. Accessed 15 Dec 2019
J. Zobel, Writing for Computer Science, vol. 8 (Springer, Berlin, 2004)
Book Google Scholar
A. Patrika, An Indian Bengali-language daily newspaper owned by the ABP group. https://epaper.anandabazar.com
SNLTR, MeiTy, GoWB. Society for natural language technology research, Department of Information Technology and Electronics, Govt. of West Bengal. https://nltr.org/downloads.php
OSBC. Open source Bengali corpus. https://scdnlab.com/corpus
D. Bartholomew, MariaDB cookbook (Packt Publishing Ltd, Birmingham, 2014)
Google Scholar
S. Pan, D. Saha, An automatic identification of function words in TDIL tagged Bengali corpus. Int. J. Comput. Sci. Eng. 7(1), 20–27 (2019)
Google Scholar
M. Marcus, B. Santorini, M.A. Marcinkiewicz, Building a large annotated corpus of English: the penn treebank (1993)
K. Toutanova, C. Manning, Enriching the knowledge sources used in a maximum entropy part-of-speech tagger, in Proceedings of the 2000 Joint SIGDAT Conference EMNLP/VLC (2000), p. 63–71
A. Das, U. Garain, A. Senapati, Automatic detection of subject/object drops in Bengali, in NLP Tool Contest in Proceedings of International Conference on Asian Language Processing (IALP). Kuching, Malaysia, p. 91–94 (2014)

Download references

Acknowledgements

The authors thank are to the Technology Development for Indian Languages Program (TDIL) of the Ministry of Electronics and Information Technology of the Government of India and the Department of Linguistic Research Unit, Indian Statistical Institute of Kolkata for their support. Our heartfelt thanks are to the Society for Natural Language Technology Research (SNLTR) and the Information Technology and Electronics Department of the Government of West Bengal for their data support. The authors also thank the ABP groups for their support of the news dataset. The authors are very grateful to the Shahjlal University of Science and Technology of Bangladesh for their support of the dataset.

Funding

None.

Author information

Subrata Pan
Present address: Jadavpur University, Kolkata, 700032, India

Authors and Affiliations

Jadavpur University, Kolkata, 700032, India
Diganta Saha

Authors

Subrata Pan
View author publications
You can also search for this author in PubMed Google Scholar
Diganta Saha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Subrata Pan.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pan, S., Saha, D. Performance Evaluation of Part-of-Speech Tagging for Bengali Text. J. Inst. Eng. India Ser. B 103, 577–589 (2022). https://doi.org/10.1007/s40031-021-00630-5

Download citation

Received: 21 January 2021
Accepted: 27 May 2021
Published: 29 July 2021
Issue Date: April 2022
DOI: https://doi.org/10.1007/s40031-021-00630-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performance Evaluation of Part-of-Speech Tagging for Bengali Text

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Comparative Evaluation of Statistical Part-of-Speech Taggers for Russian

Analyzing Tagging Accuracy of Part-of-Speech Taggers

Towards POS Tagging Methods for Bengali Language: A Comparative Analysis

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Performance Evaluation of Part-of-Speech Tagging for Bengali Text

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Comparative Evaluation of Statistical Part-of-Speech Taggers for Russian

Analyzing Tagging Accuracy of Part-of-Speech Taggers

Towards POS Tagging Methods for Bengali Language: A Comparative Analysis

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation