Log in

Performance Evaluation of Part-of-Speech Tagging for Bengali Text

  • Original Contribution
  • Published:
Journal of The Institution of Engineers (India): Series B Aims and scope Submit manuscript

Abstract

In this paper, it has been proposed an approach to assess the performance of part-of-speech tagging of the Bengali text. The tagging can be viewed as a process of interpreting a syntactic category for tokens in text documents. The difficulty occurs when choosing an appropriate category of the speech part for tokens. To overcome this limitation, we proposed an effective method to carry out part of the speech tagging on 5 corpora independent of the domain. Subsequently, we performed a tagging performance assessment to verify the efficiency of our system. Our system is developed in 10 phases: initialization of the dataset, sentence boundary determination, tokenization, identification of unique tokens, part-of-speech tagging, retrieving the token and the tagged portion of the speech class, record the retrieved outcomes, query processing, performance evaluation and rank generation. Five corpora have been used for the experiment of the system. The system is successfully tagged 98.97%, 98.35%, 89.93%, 88.46% and 90.01% tokens of experimental corpora. The system has obtained excellent tagging performance for the POS category (Common Noun) compared to other POS categories. The efficiency of this system is visualized through detailed performance appraisal in 18 part-of-speech categories. The system is successfully tagged 16,504,118 tokens over 18,047,593 numbers of distinct tokens in the total corpora. It has been achieved 91.44% overall tagging effectiveness, which represents an improvement of about 3.24% compared to the baseline method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. A.Y. Aikhenvald, The Art of Grammar: A Practical Guide (Oxford University Press, Oxford, 2014), p. 99

    Book  Google Scholar 

  2. G.T. Childs, African ideophones, in Sound Symbolism, p. 179

  3. E.V. Gelderen, Sample entry: function words. Encyclopedia of Linguistics (n.d.). http://strazny.com/encyclopedia/sample-functionwords.html

  4. A. Carnie, Syntax: A Generative Introduction (Wiley-Blackwell, New Jersey, 2012), pp. 51–52

    Google Scholar 

  5. TDIL, MeiTy, GOI. Technology Development for Indian Languages Programme, Govt. of India. https://tdil.meity.gov.in

  6. Bengali Shallow Parser. Developed by *Consortium of Institutions-IIIT Hyderabad, University of Hyderabad, CDAC Pune, Anna University KBC Chennai, IIT Kharagpur, IIT Bombay, IISC Bangalore, Tamil University, IIIT Allahabad, and Jadavpur University. Funded by: TDIL Program, Department Of IT Govt. of India. http://ltrc.iiit.ac.in/analyzer/bengali/

  7. S. Dandapat, S. Sarkar, A. Basu, A hybrid model for part-of-speech tagging and its application to Bengali, in International Conference on Computational Intelligence (2004), p. 169–172

  8. A. Ekbal, S. Bandyopadhyay, Lexicon development and POS tagging using a tagged Bengali news corpus, in FLAIRS Conference (2007), p. 261–262

  9. S. Bandyopadhyay, A. Ekbal, HMM based POS Tagger and Rule-based Chunker for Bengali, in Advances in pattern recognition (2007), p. 384–390

  10. Y.O. Elhadj, Statistical part-of-speech tagger for traditional Arabic texts. J. Comput. Sci. 5(11), 794 (2009)

    Article  Google Scholar 

  11. D. Arawinda et al.. Designing an Indonesian part of speech tagset and manually tagged Indonesian corpus, in 2014 International Conference on Asian Language Processing (IALP) (IEEE, 2014)

  12. D. Cutting, J. Kupiec, J. Pederson, P. Sibun, A practical part of speech tagger, in Proceedings of the Third Conference on Applied Natural Language Processing, vol. 1992

  13. P. Avinesh, G. Karthik, Part of speech tagging and chunking using conditional random fields and transformation based learning, in Proceedings of IJ-CAI Workshop on Shallow Parsing for South Asian Languages (2007)

  14. A. Priyadarshi, S.K. Saha, Towards the first Maithili part of speech tagger: resource creation and system development. Comput. Speech Lang. 62, 101054 (2020)

    Article  Google Scholar 

  15. K. Sarkar, An n-gram based method for Bengali keyphrase extraction, in International Conference on Information Systems for Indian Languages. Springer, Berlin (2011)

  16. Si.Bandhyopadhyay, A. Das, P. Bhaskar, English Bengali Ad-hoc Monolingual Information Retrieval Task Result at FIRE 2008. Working Note of Forum for FIRE-2008 (2008)

  17. A. Jamatia, A. Das, Part-of-speech tagging system for Indian social media text on twitter, in Social-India 2014, First Workshop on Language Technologies for Indian Social Media Text, at the Eleventh International Conference on Natural Language Processing (ICON-2014) (2014), p. 21–28

  18. N. Joshi, H. Darbari, I. Mathur. HMM based POS tagger for Hindi, in Proceeding of 2013 International Conference on Artificial Intelligence, Soft Computing (AISC-2013) (2013), p. 341–349

  19. A.K. Ojha, P. Behera, S. Singh, G.N. Jha, Training and evaluation of POS taggers in indo-Aryan languages: a case of Hindi, Odia and Bhojpuri, in In the proceedings of 7th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (2015), p. 524–529

  20. O. Hellwig, Sanskrittagger: a stochastic lexical and POS tagger for Sanskrit, in Sanskrit Computational Linguistics (Springer, 2007), p. 266–277

  21. N. Saharia, D. Das, U. Sharma, J. Kalita, Part of speech tagger for assamese text, in Proceedings of the ACL-IJCNLP 2009 Conference Short Papers (2009), p. 33–36

  22. B.R. Das, S. Sahoo, C.S. Panda, S. Patnaik, Part of speech tagging in odia using support vector machine. Procedia Comput. Sci. 48, 507–512 (2015)

    Article  Google Scholar 

  23. M. Selvam, A.M. Natarajan, Improvement of rule based morphological analysis and POS tagging in Tamil language via projection and induction techniques. Int. J. Comput. 3(4), 357–367 (2009)

    Google Scholar 

  24. M.H. Khanam, P. Suryachandra, K. Madhumurthy. Experiments on POS tagging and data driven dependency parsing for Telugu language, in Proceedings of the International Conference on Advances in Computing, Communications and Informatics (2012), p. 1068–1073

  25. A. Hardie, Develo** a tagset for automated part-of-speech tagging in Urdu, in Corpus Linguistics (2003)

  26. M.S. Gill, G.S. Lehal, S.S. Joshi, Part of speech tagging for grammar checking of Punjabi. Linguist. J. 4(1), 6–21 (2009)

    Google Scholar 

  27. P.J. Antony, K.P. Soman, Kernel based part of speech tagger for kannada, in 2010 International Conference on Machine Learning and Cybernetics, vol. 4 (IEEE, 2010), p. 2139–2144

  28. C. Patel, K. Gali, in Part-of-speech tagging for Gujarati using conditional random fields, in Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages (2008)

  29. J. Singh, N. Joshi, I. Mathur, Part of speech tagging of Marathi text using trigram method. ar**v:1307.4299 (2013)

  30. K. Manju, S. Soumya, S.M. Idicula, Development of a POS tagger for malayalam-an experience, in 2009 International Conference on Advances in Recent Technologies in Communication and Computing (IEEE, 2009), p. 709–713

  31. A. Ekbal, R. Haque, S. Bandyopadhyay, Bengali part of speech tagging using conditional random field, in Proceedings of Seventh International Symposium on Natural Language Processing (SNLP2007) (2007), p. 131–136

  32. N.S. Dash, Part-of-speech (POS) tagging in Bengali written text corpus. Int. J. Linguist. Lang. Technol. 1(1), 53–96 (2013)

    Article  Google Scholar 

  33. S. Dandapat, S. Sarkar, Part of speech tagging for Bengali with hidden Markov model, in Proceeding of the NLPAI Machine Learning Competition

  34. A. Ekbal, S. Bandyopadhyay, Part of speech tagging in Bengali using support vector machine, in 2008 International Conference on Information Technology (IEEE, 2008), p. 106–111

  35. A. Ekbal, R. Haque, S. Bandyopadhyay, Maximum entropy based Bengali part of speech tagging. Adv. Nat. Lang. Process. Appl. Res. Comput. Sci. (RCS) J. 33, 67–78 (2008)

    Google Scholar 

  36. M.F. Kabir, K. Abdullah-Al-Mamun, M.N. Huda, Deep learning based parts of speech tagger for Bengali, in 2016 5th International Conference on Informatics, Electronics and Vision (ICIEV) (IEEE, 2016), p. 26–29

  37. A.H. Patoary, M.J.B. Kibria, A. Kaium, Implementation of automated Bengali parts of speech tagger: an approach using deep learning algorithm, in 2020 IEEE Region 10 Symposium (TENSYMP) (2020), p. 308–311. https://doi.org/10.1109/TENSYMP50017.2020.9230907

  38. E.T. Jaynes, Information theory and statistical mechanics. Phys. Rev. 106(4), 620 (1957)

    Article  MathSciNet  Google Scholar 

  39. A.L. Berger, S.A. Della Pietra, V.J. Della Pietra, A maximum entropy approach to natural language processing. Comput. Linguist. 22, 39–71 (1996)

    Google Scholar 

  40. A. Ratnaparkhi, A maximum entropy model for part of speech tagging, in Proceedings of the Conference on Empirical Methods in Natural Language Processing, vol. 1996, p. 133–142

  41. Charles Elkan, Log-linear models and conditional random fields. Tutor. Notes CIKM 8, 1–12 (2008)

    Google Scholar 

  42. R. Christensen, Log-Linear Models and Logistic Regression (Springer, Berlin, 2006)

    MATH  Google Scholar 

  43. The Definitive Glossary of Higher Mathematical Jargon–Indeterminate. Math Vault. 2019-08-01. Accessed 15 Dec 2019

  44. E.W. Weisstein. “Undefined”. http://mathworld.wolfram.com. Accessed 15 Dec 2019

  45. J. Zobel, Writing for Computer Science, vol. 8 (Springer, Berlin, 2004)

    Book  Google Scholar 

  46. A. Patrika, An Indian Bengali-language daily newspaper owned by the ABP group. https://epaper.anandabazar.com

  47. SNLTR, MeiTy, GoWB. Society for natural language technology research, Department of Information Technology and Electronics, Govt. of West Bengal. https://nltr.org/downloads.php

  48. OSBC. Open source Bengali corpus. https://scdnlab.com/corpus

  49. D. Bartholomew, MariaDB cookbook (Packt Publishing Ltd, Birmingham, 2014)

    Google Scholar 

  50. S. Pan, D. Saha, An automatic identification of function words in TDIL tagged Bengali corpus. Int. J. Comput. Sci. Eng. 7(1), 20–27 (2019)

    Google Scholar 

  51. M. Marcus, B. Santorini, M.A. Marcinkiewicz, Building a large annotated corpus of English: the penn treebank (1993)

  52. K. Toutanova, C. Manning, Enriching the knowledge sources used in a maximum entropy part-of-speech tagger, in Proceedings of the 2000 Joint SIGDAT Conference EMNLP/VLC (2000), p. 63–71

  53. A. Das, U. Garain, A. Senapati, Automatic detection of subject/object drops in Bengali, in NLP Tool Contest in Proceedings of International Conference on Asian Language Processing (IALP). Kuching, Malaysia, p. 91–94 (2014)

Download references

Acknowledgements

The authors thank are to the Technology Development for Indian Languages Program (TDIL) of the Ministry of Electronics and Information Technology of the Government of India and the Department of Linguistic Research Unit, Indian Statistical Institute of Kolkata for their support. Our heartfelt thanks are to the Society for Natural Language Technology Research (SNLTR) and the Information Technology and Electronics Department of the Government of West Bengal for their data support. The authors also thank the ABP groups for their support of the news dataset. The authors are very grateful to the Shahjlal University of Science and Technology of Bangladesh for their support of the dataset.

Funding

None.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Subrata Pan.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pan, S., Saha, D. Performance Evaluation of Part-of-Speech Tagging for Bengali Text. J. Inst. Eng. India Ser. B 103, 577–589 (2022). https://doi.org/10.1007/s40031-021-00630-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s40031-021-00630-5

Keywords

Navigation