Abstract
In this paper, it has been proposed an approach to assess the performance of part-of-speech tagging of the Bengali text. The tagging can be viewed as a process of interpreting a syntactic category for tokens in text documents. The difficulty occurs when choosing an appropriate category of the speech part for tokens. To overcome this limitation, we proposed an effective method to carry out part of the speech tagging on 5 corpora independent of the domain. Subsequently, we performed a tagging performance assessment to verify the efficiency of our system. Our system is developed in 10 phases: initialization of the dataset, sentence boundary determination, tokenization, identification of unique tokens, part-of-speech tagging, retrieving the token and the tagged portion of the speech class, record the retrieved outcomes, query processing, performance evaluation and rank generation. Five corpora have been used for the experiment of the system. The system is successfully tagged 98.97%, 98.35%, 89.93%, 88.46% and 90.01% tokens of experimental corpora. The system has obtained excellent tagging performance for the POS category (Common Noun) compared to other POS categories. The efficiency of this system is visualized through detailed performance appraisal in 18 part-of-speech categories. The system is successfully tagged 16,504,118 tokens over 18,047,593 numbers of distinct tokens in the total corpora. It has been achieved 91.44% overall tagging effectiveness, which represents an improvement of about 3.24% compared to the baseline method.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs40031-021-00630-5/MediaObjects/40031_2021_630_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs40031-021-00630-5/MediaObjects/40031_2021_630_Fig2_HTML.png)
Similar content being viewed by others
References
A.Y. Aikhenvald, The Art of Grammar: A Practical Guide (Oxford University Press, Oxford, 2014), p. 99
G.T. Childs, African ideophones, in Sound Symbolism, p. 179
E.V. Gelderen, Sample entry: function words. Encyclopedia of Linguistics (n.d.). http://strazny.com/encyclopedia/sample-functionwords.html
A. Carnie, Syntax: A Generative Introduction (Wiley-Blackwell, New Jersey, 2012), pp. 51–52
TDIL, MeiTy, GOI. Technology Development for Indian Languages Programme, Govt. of India. https://tdil.meity.gov.in
Bengali Shallow Parser. Developed by *Consortium of Institutions-IIIT Hyderabad, University of Hyderabad, CDAC Pune, Anna University KBC Chennai, IIT Kharagpur, IIT Bombay, IISC Bangalore, Tamil University, IIIT Allahabad, and Jadavpur University. Funded by: TDIL Program, Department Of IT Govt. of India. http://ltrc.iiit.ac.in/analyzer/bengali/
S. Dandapat, S. Sarkar, A. Basu, A hybrid model for part-of-speech tagging and its application to Bengali, in International Conference on Computational Intelligence (2004), p. 169–172
A. Ekbal, S. Bandyopadhyay, Lexicon development and POS tagging using a tagged Bengali news corpus, in FLAIRS Conference (2007), p. 261–262
S. Bandyopadhyay, A. Ekbal, HMM based POS Tagger and Rule-based Chunker for Bengali, in Advances in pattern recognition (2007), p. 384–390
Y.O. Elhadj, Statistical part-of-speech tagger for traditional Arabic texts. J. Comput. Sci. 5(11), 794 (2009)
D. Arawinda et al.. Designing an Indonesian part of speech tagset and manually tagged Indonesian corpus, in 2014 International Conference on Asian Language Processing (IALP) (IEEE, 2014)
D. Cutting, J. Kupiec, J. Pederson, P. Sibun, A practical part of speech tagger, in Proceedings of the Third Conference on Applied Natural Language Processing, vol. 1992
P. Avinesh, G. Karthik, Part of speech tagging and chunking using conditional random fields and transformation based learning, in Proceedings of IJ-CAI Workshop on Shallow Parsing for South Asian Languages (2007)
A. Priyadarshi, S.K. Saha, Towards the first Maithili part of speech tagger: resource creation and system development. Comput. Speech Lang. 62, 101054 (2020)
K. Sarkar, An n-gram based method for Bengali keyphrase extraction, in International Conference on Information Systems for Indian Languages. Springer, Berlin (2011)
Si.Bandhyopadhyay, A. Das, P. Bhaskar, English Bengali Ad-hoc Monolingual Information Retrieval Task Result at FIRE 2008. Working Note of Forum for FIRE-2008 (2008)
A. Jamatia, A. Das, Part-of-speech tagging system for Indian social media text on twitter, in Social-India 2014, First Workshop on Language Technologies for Indian Social Media Text, at the Eleventh International Conference on Natural Language Processing (ICON-2014) (2014), p. 21–28
N. Joshi, H. Darbari, I. Mathur. HMM based POS tagger for Hindi, in Proceeding of 2013 International Conference on Artificial Intelligence, Soft Computing (AISC-2013) (2013), p. 341–349
A.K. Ojha, P. Behera, S. Singh, G.N. Jha, Training and evaluation of POS taggers in indo-Aryan languages: a case of Hindi, Odia and Bhojpuri, in In the proceedings of 7th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (2015), p. 524–529
O. Hellwig, Sanskrittagger: a stochastic lexical and POS tagger for Sanskrit, in Sanskrit Computational Linguistics (Springer, 2007), p. 266–277
N. Saharia, D. Das, U. Sharma, J. Kalita, Part of speech tagger for assamese text, in Proceedings of the ACL-IJCNLP 2009 Conference Short Papers (2009), p. 33–36
B.R. Das, S. Sahoo, C.S. Panda, S. Patnaik, Part of speech tagging in odia using support vector machine. Procedia Comput. Sci. 48, 507–512 (2015)
M. Selvam, A.M. Natarajan, Improvement of rule based morphological analysis and POS tagging in Tamil language via projection and induction techniques. Int. J. Comput. 3(4), 357–367 (2009)
M.H. Khanam, P. Suryachandra, K. Madhumurthy. Experiments on POS tagging and data driven dependency parsing for Telugu language, in Proceedings of the International Conference on Advances in Computing, Communications and Informatics (2012), p. 1068–1073
A. Hardie, Develo** a tagset for automated part-of-speech tagging in Urdu, in Corpus Linguistics (2003)
M.S. Gill, G.S. Lehal, S.S. Joshi, Part of speech tagging for grammar checking of Punjabi. Linguist. J. 4(1), 6–21 (2009)
P.J. Antony, K.P. Soman, Kernel based part of speech tagger for kannada, in 2010 International Conference on Machine Learning and Cybernetics, vol. 4 (IEEE, 2010), p. 2139–2144
C. Patel, K. Gali, in Part-of-speech tagging for Gujarati using conditional random fields, in Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages (2008)
J. Singh, N. Joshi, I. Mathur, Part of speech tagging of Marathi text using trigram method. ar**v:1307.4299 (2013)
K. Manju, S. Soumya, S.M. Idicula, Development of a POS tagger for malayalam-an experience, in 2009 International Conference on Advances in Recent Technologies in Communication and Computing (IEEE, 2009), p. 709–713
A. Ekbal, R. Haque, S. Bandyopadhyay, Bengali part of speech tagging using conditional random field, in Proceedings of Seventh International Symposium on Natural Language Processing (SNLP2007) (2007), p. 131–136
N.S. Dash, Part-of-speech (POS) tagging in Bengali written text corpus. Int. J. Linguist. Lang. Technol. 1(1), 53–96 (2013)
S. Dandapat, S. Sarkar, Part of speech tagging for Bengali with hidden Markov model, in Proceeding of the NLPAI Machine Learning Competition
A. Ekbal, S. Bandyopadhyay, Part of speech tagging in Bengali using support vector machine, in 2008 International Conference on Information Technology (IEEE, 2008), p. 106–111
A. Ekbal, R. Haque, S. Bandyopadhyay, Maximum entropy based Bengali part of speech tagging. Adv. Nat. Lang. Process. Appl. Res. Comput. Sci. (RCS) J. 33, 67–78 (2008)
M.F. Kabir, K. Abdullah-Al-Mamun, M.N. Huda, Deep learning based parts of speech tagger for Bengali, in 2016 5th International Conference on Informatics, Electronics and Vision (ICIEV) (IEEE, 2016), p. 26–29
A.H. Patoary, M.J.B. Kibria, A. Kaium, Implementation of automated Bengali parts of speech tagger: an approach using deep learning algorithm, in 2020 IEEE Region 10 Symposium (TENSYMP) (2020), p. 308–311. https://doi.org/10.1109/TENSYMP50017.2020.9230907
E.T. Jaynes, Information theory and statistical mechanics. Phys. Rev. 106(4), 620 (1957)
A.L. Berger, S.A. Della Pietra, V.J. Della Pietra, A maximum entropy approach to natural language processing. Comput. Linguist. 22, 39–71 (1996)
A. Ratnaparkhi, A maximum entropy model for part of speech tagging, in Proceedings of the Conference on Empirical Methods in Natural Language Processing, vol. 1996, p. 133–142
Charles Elkan, Log-linear models and conditional random fields. Tutor. Notes CIKM 8, 1–12 (2008)
R. Christensen, Log-Linear Models and Logistic Regression (Springer, Berlin, 2006)
The Definitive Glossary of Higher Mathematical Jargon–Indeterminate. Math Vault. 2019-08-01. Accessed 15 Dec 2019
E.W. Weisstein. “Undefined”. http://mathworld.wolfram.com. Accessed 15 Dec 2019
J. Zobel, Writing for Computer Science, vol. 8 (Springer, Berlin, 2004)
A. Patrika, An Indian Bengali-language daily newspaper owned by the ABP group. https://epaper.anandabazar.com
SNLTR, MeiTy, GoWB. Society for natural language technology research, Department of Information Technology and Electronics, Govt. of West Bengal. https://nltr.org/downloads.php
OSBC. Open source Bengali corpus. https://scdnlab.com/corpus
D. Bartholomew, MariaDB cookbook (Packt Publishing Ltd, Birmingham, 2014)
S. Pan, D. Saha, An automatic identification of function words in TDIL tagged Bengali corpus. Int. J. Comput. Sci. Eng. 7(1), 20–27 (2019)
M. Marcus, B. Santorini, M.A. Marcinkiewicz, Building a large annotated corpus of English: the penn treebank (1993)
K. Toutanova, C. Manning, Enriching the knowledge sources used in a maximum entropy part-of-speech tagger, in Proceedings of the 2000 Joint SIGDAT Conference EMNLP/VLC (2000), p. 63–71
A. Das, U. Garain, A. Senapati, Automatic detection of subject/object drops in Bengali, in NLP Tool Contest in Proceedings of International Conference on Asian Language Processing (IALP). Kuching, Malaysia, p. 91–94 (2014)
Acknowledgements
The authors thank are to the Technology Development for Indian Languages Program (TDIL) of the Ministry of Electronics and Information Technology of the Government of India and the Department of Linguistic Research Unit, Indian Statistical Institute of Kolkata for their support. Our heartfelt thanks are to the Society for Natural Language Technology Research (SNLTR) and the Information Technology and Electronics Department of the Government of West Bengal for their data support. The authors also thank the ABP groups for their support of the news dataset. The authors are very grateful to the Shahjlal University of Science and Technology of Bangladesh for their support of the dataset.
Funding
None.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Pan, S., Saha, D. Performance Evaluation of Part-of-Speech Tagging for Bengali Text. J. Inst. Eng. India Ser. B 103, 577–589 (2022). https://doi.org/10.1007/s40031-021-00630-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40031-021-00630-5