Abstract
Annotating words at the part-of-speech level, either manually or by a machine, is a tough task. It is done effectively when human annotators, as well as computer systems, are properly trained so that they can correctly identify morphological properties and syntactic functions of words in a piece of text. We discuss in this chapter some theoretical aspects and practical issues of part-of-speech (POS) annotation on a written Bengali text corpus. We deliberately avoid all those issues and aspects that are required to design and develop an automatic POS annotation tool for a text, since this is not the goal of this chapter. To keep things simple and within the capacity of those readers who are not well-versed in the application of computers, we address here some of the primary concerns and challenges involved in POS annotation. Starting with the basic concept of POS annotation, we highlight the underlying differences between POS annotation and morphological processing; define the levels and stages of POS annotation; refer to some of the early works on POS annotation; present a generic scheme for POS annotation; and show how a POS annotated text is utilized in various domains and sub-domains of theoretical, descriptive, applied, computational, and cognitive linguistics. The data and information presented in this chapter are primarily meant for the students of those less-advanced languages which still lack linguistic resources like POS annotated texts. The rudimentary ideas and information that are presented in this chapter may be treated as valuable and usable inputs for designing linguistic and computational models for POS annotation in these less-advanced languages.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abney, S. (1997). Part-of-speech tagging and partial parsing. In S. Schreibman, R. G. Siemens, & J. M. Unsworth (Eds.), Corpus-based methods in language & speech: A companion to digital humanities (pp. 118–136). Blackwell.
Antony, P. J., Santhanu, P. M., & Soman, K. P. (2010). SVM-based parts-of-speech tagger for Malayalam. In Proceedings of the International Conference on-Recent Trends in Information, Telecommunication & Computing (ITC 2010) (pp. 339–341), Kochi, Kerala.
Atwell, E., Demetriou, G., Hughes, J., Schiffrin, A., Souter, C., & Wilcock, S. (2000). A comparative evaluation of modern English corpus grammatical annotation schemes. International Computer Archive of Modern English Journal., 24, 7–23.
Avinesh, P. V. S., & Karthik, G. (2007). POS tagging & chunking using conditional random field and transformation based learning. In Proceedings of the Workshop on Shallow Parsing for South Asian Languages (IJCAI-07) (pp. 21–24), IIIT-Hyderabad, India.
Barnbrook, G. (1998). Language and computers. Edinburgh University Press.
Baskaran, S., Bali, K., Bhattacharya, T., Bhattacharya, P., Chaudhury, M., Jha, G. N., Rajendran, S., Sarvanan, K., Sobha, K., & Subbarao, K. V. (2008). Designing a common POS tagset framework for Indian Languages. In Proceedings of the 6th Workshop on Asian Language Resources, Asian Language Resources in International Joint Conference on Natural Language Processing (IJCNLP-2008) (pp. 89–92), 11–12 January 2008, IIIT-Hyderabad.
Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics—Investigating language structure and use. Cambridge University Press.
Brill, E. (1992). A simple rule-based part of speech tagger. In Proceedings of the Third Conference on Applied Natural Language Processing (pp. 152–155), ACL, Trento, Italy, March 31-April 03.
Chaki, J. B. (1996). Bangla Bhasar Vyakaran (Grammar of the Bengali Language). Ananda Publishers.
Chakrabarti, D. (2011). Layered parts of speech tagging for Bangla. Language in India. www.languageinindia.com, May 2011, Special Volume: Problems of Parsing in Indian Languages (pp. 1–6).
Chakravarti, N. N. (1994). Bangla: Ki Likhben, Kena Likhben. Ananda Publishers.
Chakravarty, B. D. (1974). Ucchatara Bangla Vyakaran (Higher Bengali Grammar). Sarkar and Co.
Chattopadhyay, S. K. (1995). Bhasa Prakash Bangla Vyakaran (Grammar of the Bengali Language). Rupa Publications.
Dandapat, S. (2007). POS tagging and chunking with Maximum Entropy model. In Proceedings of Workshop on Shallow Parsing for South Asian Languages (IJCAI-07) (pp. 29–32), IIIT-Hyd, India.
Dandapat, S. (2009). Part-of-Speech tagging for Bengali (Unpublished MS Thesis). Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, India.
Dash, N. S. (2004). Text annotation: A prologue to corpus processing. Indian Journal of Linguistics., 23(1), 71–82.
Dash, N. S. (2005). Corpus linguistics and language technology: With reference to Indian Languages. Mittal Publications.
Dash, N. S. (2013). Part-of-speech (POS) tagging in Bangla written text corpus. Bhasa Bijnan o Prayukti: An International Journal on Linguistics and Language Technology, 1(1), 53–96.
Dash, N. S. (2015). Marking words with part-of-speech (POS) tags within the text boundary of a corpus: Problems, process, and outcomes. Translation Today., 9(1), 5–24.
Dash, N. S. (2016). Multifunctionality of a hyphen in Bengali text corpus: Problems and challenges in text normalization and POS tagging. International Journal of Innovative Studies in Sociology and Humanities, 1(1), 19–34.
Dash, N. S. (2021). Pre-editing and text standardization on a Bengali written text corpus. Aligarh Journal of Linguistics, 10(1), 1–22.
Dash, N. S., Arulmozi, S., & Hussain, M. M. (2016). The carriage of Indian languages corpora: And miles to go before we stop. Indian Journal of Applied Linguistics., 42(1 & 2), 63–92.
deRose, S. (1991). An analysis of probabilistic grammatical tagging methods. In S. Johansson & A.-B. Stenström (Eds.), English computer corpora: Selected papers & research guide (pp. 9–13). Mouton de Gruyter.
Dhanalakshmi, V., Kumar, A., Shivapratap, G., Soman, K. P., & Rajendran, S. (2009). Tamil POS tagging using linear programming. International Journal of Recent Trends in Engineering, 1(2), 166–169.
Durand, D. G., DeRose, S. J., & Mylonas, E. (1996). What should mark-up really be? Applying theories of text to the design of markup systems. In Proceedings of ALLC/ACH ‘96, June 25–29, 1996, Bergen, Norway.
Ekbal, A., Mandal, S., & Bandyopadhyay, S. (2007). POS tagging using HMM and rule-based chunking. In Proceedings of the Workshop on shallow parsing in South Asian languages (SPSAL) (pp. 31–34), IJCAI 2007, IIIT-Hyderabad, India.
Fligelstone, S., Pacey, M., & Rayson, P. (1997). How to generalize the task of annotation. In R. Garside, G. Leech, & A. McEnery (Eds.), Corpus annotation: Linguistic information from computer text corpora (pp. 122–136). Longman.
Garrette, D., & Baldridge, J. (2013). Learning a part-of-speech tagger from two hours of annotation. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT-13) (pp. 138–147) June 2013, Atlanta, GA.
Garside, R. (1987). The CLAWS word-tagging system. In R. Garside, G. Leech, & G. Sampson, (Eds.), The computational analysis of English: A corpus-based approach (pp., 30–41). Longman.
Garside, R. (1995). Grammatical tagging of the spoken part of the British National Corpus: A progress report. In G. Leech, G. Myers, & J. Thomas (Eds.), Spoken English on computer: Transcription, s (pp. 161–167). Longman.
Garside, R. (1996). The robust tagging of unrestricted text: The BNC experience. In J. Thomas & M. Short (Eds.), Using corpora for language research: Studies in honour of Geoffrey Leech (pp. 167–180). Longman.
Greene, B., & Rubin, G. (1971). Automatic grammatical tagging of English. Technical Report, Department of Linguistics, Brown University, Rhode Island (Handout).
Ide, N., & Pustejovsky, J. (Eds.). (2017). Handbook of linguistic annotation. (Text, Speech, and Language Technology series). Springer.
Jha, G. N. (2010). The TDIL program and the Indian language corpora initiative (ILCI). In Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC`10) (pp., 982–985). Valletta, Malta, 19–21 May 2010.
Kumar, D., & Josan, G. S. (2010). Part-of-speech taggers for morphologically rich Indian languages: A survey. International Journal of Computer Applications., 6(5), 1–9.
Kupiec, J. (1992). Robust part-of-speech tagging using a hidden Markov model. Computer Speech and Language., 6(1), 3–15.
Leech, G. (1997). Grammatical tagging. In R. Garside, G. Leech, & A. McEnery (Eds.), Corpus annotation: Linguistic information from computer text corpora (pp. 19–33). Longman.
Leech, G., & Eyes, E. (1993). Syntactic annotation: Linguistic aspects of grammatical tagging & skeleton parsing. In E. Black, R. Garside, & G. Leech (Eds.), Statistically-driven computer grammars of English: The IBM/Lancaster approach (pp. 36–61). Rodopi.
Leech, G., & Garside, R. (1982). Grammatical tagging of the LOB Corpus: A general survey. In S. Johansson, & K. Hofland, (Eds.), Computer Corpora in English Language Research (pp. 110–117). Bergen: NAVF.
Leech, G., & Smith, N. (1999). The use of tagging. In H. van Halteren (Ed.), Syntactic wordclass tagging (pp. 23–36). Kluwer.
Leech, G., & Wilson, A. (1999). Guidelines and standards for tagging. In H. van Halteren (Ed.), Syntactic word class tagging (pp. 55–80). Kluwer.
Leech, G., Garside, R., & Atwell, E. (1983). The automatic grammatical tagging of the LOB corpus. ICAME Journal: International Computer Archive of Modern and Medieval English Journal, 7, 13–33.
Leech, G., Garside, R., & Bryant, M. (1994). The large-scale grammatical tagging of text: Experience with the British National Corpus. In N. Oostdijk & P. deHaan (Eds.), Corpus-based research into language (pp. 47–63). Rodopi.
Manning, C. D. (2011). Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In Proceedings of the 12th International Conference on Computational Linguistics and Intelligent Text Processing (pp. 171–189). Vol. Part I, Tokyo, Japan, Springer, Berlin, February 20–26.
Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Journal Computational Linguistics (Special issue on using large corpora: II), 19 (2), 313–330.
McEnery, T., & Wilson, A. (1996). Corpus linguistics. Edinburgh University Press.
Mishra, N., Mishra, A. (2011). Part of speech tagging for Hindi corpus. In Proceedings of the International Conference on Communication Systems and Network Technologies (pp. 554–558) Katra, Jammu.
Nagata, R., Mizumoto, T., Kikuchi, Y., Kawasaki, Y., & Funakoshi, K. (2018). A POS tagging model designed for learner English. In Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, Association for Computational Linguistics (pp. 39–48). Brussels, Belgium, November 01.
Naseem, T., Snyder, B., Eisenstein, J., & Barzilay, R. (2009). Multilingual part-of-speech tagging: Two unsupervised approaches. Journal of Artificial Intelligence Research, 36(1), 1–45.
Nguyen, D. Q., & Verspoor, K. (2018). An improved neural network model for joint POS tagging and dependency parsing. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies (pp. 81–91), Brussels, Belgium, Association for Computational Linguistics, October 31- November 1.
Pammi, S. C., & Prahallad, K. (2007). POS tagging and chunking using decision forests. In Proceedings of the Workshop on shallow parsing in South Asian languages (SPSAL) (pp. 33–36). IJCAI 2007, IIIT-Hyderabad, India.
Rao, D., & Yarowsky, D. (2007). Part of speech tagging and shallow parsing of Indian languages. In Proceedings of the Workshop on Shallow Parsing for South Asian Languages (IJCAI-07) (pp. 17–20), IIIT-Hyd, India.
Rao, P.T., Ram, S., Vijaykrishna, R., & Sobha, L. (2007). A text chunker and hybrid POS tagger for Indian languages. In Proceedings of the Workshop on Shallow Parsing for South Asian Languages (IJCAI-07) (pp., 9–12), IIIT-Hyd, India.
Ray, P. R., Harish, V., Sarkar, S., & Basu, A. (2010). Part of speech tagging and local word grou** techniques for natural language parsing in Hindi. In Proceedings of the International Conference on Natural language Processing (ICON2003) (pp. 118–125), Department of Computer Science and Engineering, IIT-Kharagpur, India.
Saha, G. K., Saha, A. B., & Debnath, S. (2004). Computer-assisted Bangla words POS tagging. In Proceedings of (iSTRANS-2004) (pp., 111–115), New Delhi, India.
Saharia, N., Das, D., Sharma, U., & Kalita, J. (2009). Part of speech tagger for Assamese Text. In Proceedings of the ACL-IJCNLP-2009 Conference (pp. 33–36). Suntec, Singapore.
Sarkar, P., & Basu, G. (1994). Bhasa Jijnasa (Language Queries). Kolkata: Vidyasagar Pustak Mandir.
Sastry, G. M. R., Chaudhuri, S., & Reddy, P. N. (2007). A HMM-based part-of-speech & statistical chunker for 3 Indian languages. In Proceedings of the Workshop on Shallow Parsing for South Asian Languages (IJCAI-07) (pp. 13–16), IIIT-Hyd, India.
Schulz, S., & Kuhn, J. (2016). Learning from Within? Comparing POS tagging approaches for historical text. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (pp. 4316–4322), European Language Resources Association.
Shambhavi, B. R., & Ramakanth, P. K. (2010). Current State of the art POS tagging for Indian Languages: A study. International Journal of Computer Engineering and Technology., 1(1), 250–260.
Shambhavi, B. R., Ramakanth, K. P., & Revanth, G. (2012). A maximum entropy approach to Kannada part of speech Tagging. International Journal of Computer Applications, 41(13), pp. 9–12.
Shrivastava, M., & Bhattacharyya, P. (2008). Hindi POS tagger using Naive Stemming: harnessing morphological information without extensive linguistic knowledge. In Proceedings of the 6th International Conference on Natural Language Processing (ICON-2008) (pp. 1–8). CDAC, Pune India, 20–22 December 2008.
Singh, S., & Jha, G. N. (2015). Statistical tagger for Bhojpuri employing Support Vector Machine. In Proceedings of the International Conference on Advances in Computing, Communications and Informatics (ICACCI) (pp. 1524–1529).
Toutanova, K., & Manning, C. D. (2000). Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing & Very Large Corpora (EMNLP/VLC-2000) (pp. 63–70).
Wallis, S. A. (2007). Annotation, retrieval, and experimentation. In A. Meurman-Solin, & A. A. Nurmi, (Eds.), Annotating variation and change. Helsinki: Varieng, UoH (ePublished).
Wallis, S. A. (2014). What might a corpus of parsed spoken data tell us about language? In L. Veselovská, & M. Janebová, (Eds.), Complex Visibles Out There. Proceedings of the Olomouc Linguistics Colloquium 2014: Language Use and Linguistic Structure (pp., 641–662). Olomouc: Palacký University, Czech Republic.
Wallis, S. A. (2020). Grammar and corpus methodology. In B. Aarts, G. Popova, & J. Bowie, (Eds.), Oxford handbook of English Grammar (pp. 58–83). Part I: Chapter 4. Oxford: Oxford University Press.
Yang, Y., & Eisenstein, J. (2016). Part-of-speech tagging for historical English. In Proceedings of NAACL-HLT 2016 (pp. 1318–1328), San Diego, California, Association for Computational Linguistics, June 12–17.
Web Links
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Dash, N.S. (2021). Part-of-Speech Annotation. In: Language Corpora Annotation and Processing. Springer, Singapore. https://doi.org/10.1007/978-981-16-2960-0_3
Download citation
DOI: https://doi.org/10.1007/978-981-16-2960-0_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-2959-4
Online ISBN: 978-981-16-2960-0
eBook Packages: EducationEducation (R0)