Part-of-Speech Annotation

Dash, Niladri Sekhar

doi:10.1007/978-981-16-2960-0_3

Niladri Sekhar Dash²

395 Accesses

Abstract

Annotating words at the part-of-speech level, either manually or by a machine, is a tough task. It is done effectively when human annotators, as well as computer systems, are properly trained so that they can correctly identify morphological properties and syntactic functions of words in a piece of text. We discuss in this chapter some theoretical aspects and practical issues of part-of-speech (POS) annotation on a written Bengali text corpus. We deliberately avoid all those issues and aspects that are required to design and develop an automatic POS annotation tool for a text, since this is not the goal of this chapter. To keep things simple and within the capacity of those readers who are not well-versed in the application of computers, we address here some of the primary concerns and challenges involved in POS annotation. Starting with the basic concept of POS annotation, we highlight the underlying differences between POS annotation and morphological processing; define the levels and stages of POS annotation; refer to some of the early works on POS annotation; present a generic scheme for POS annotation; and show how a POS annotated text is utilized in various domains and sub-domains of theoretical, descriptive, applied, computational, and cognitive linguistics. The data and information presented in this chapter are primarily meant for the students of those less-advanced languages which still lack linguistic resources like POS annotated texts. The rudimentary ideas and information that are presented in this chapter may be treated as valuable and usable inputs for designing linguistic and computational models for POS annotation in these less-advanced languages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abney, S. (1997). Part-of-speech tagging and partial parsing. In S. Schreibman, R. G. Siemens, & J. M. Unsworth (Eds.), Corpus-based methods in language & speech: A companion to digital humanities (pp. 118–136). Blackwell.
Chapter Google Scholar
Antony, P. J., Santhanu, P. M., & Soman, K. P. (2010). SVM-based parts-of-speech tagger for Malayalam. In Proceedings of the International Conference on-Recent Trends in Information, Telecommunication & Computing (ITC 2010) (pp. 339–341), Kochi, Kerala.
Google Scholar
Atwell, E., Demetriou, G., Hughes, J., Schiffrin, A., Souter, C., & Wilcock, S. (2000). A comparative evaluation of modern English corpus grammatical annotation schemes. International Computer Archive of Modern English Journal., 24, 7–23.
Google Scholar
Avinesh, P. V. S., & Karthik, G. (2007). POS tagging & chunking using conditional random field and transformation based learning. In Proceedings of the Workshop on Shallow Parsing for South Asian Languages (IJCAI-07) (pp. 21–24), IIIT-Hyderabad, India.
Google Scholar
Barnbrook, G. (1998). Language and computers. Edinburgh University Press.
Google Scholar
Baskaran, S., Bali, K., Bhattacharya, T., Bhattacharya, P., Chaudhury, M., Jha, G. N., Rajendran, S., Sarvanan, K., Sobha, K., & Subbarao, K. V. (2008). Designing a common POS tagset framework for Indian Languages. In Proceedings of the 6th Workshop on Asian Language Resources, Asian Language Resources in International Joint Conference on Natural Language Processing (IJCNLP-2008) (pp. 89–92), 11–12 January 2008, IIIT-Hyderabad.
Google Scholar
Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics—Investigating language structure and use. Cambridge University Press.
Book Google Scholar
Brill, E. (1992). A simple rule-based part of speech tagger. In Proceedings of the Third Conference on Applied Natural Language Processing (pp. 152–155), ACL, Trento, Italy, March 31-April 03.
Google Scholar
Chaki, J. B. (1996). Bangla Bhasar Vyakaran (Grammar of the Bengali Language). Ananda Publishers.
Google Scholar
Chakrabarti, D. (2011). Layered parts of speech tagging for Bangla. Language in India. www.languageinindia.com, May 2011, Special Volume: Problems of Parsing in Indian Languages (pp. 1–6).
Chakravarti, N. N. (1994). Bangla: Ki Likhben, Kena Likhben. Ananda Publishers.
Google Scholar
Chakravarty, B. D. (1974). Ucchatara Bangla Vyakaran (Higher Bengali Grammar). Sarkar and Co.
Google Scholar
Chattopadhyay, S. K. (1995). Bhasa Prakash Bangla Vyakaran (Grammar of the Bengali Language). Rupa Publications.
Google Scholar
Dandapat, S. (2007). POS tagging and chunking with Maximum Entropy model. In Proceedings of Workshop on Shallow Parsing for South Asian Languages (IJCAI-07) (pp. 29–32), IIIT-Hyd, India.
Google Scholar
Dandapat, S. (2009). Part-of-Speech tagging for Bengali (Unpublished MS Thesis). Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, India.
Google Scholar
Dash, N. S. (2004). Text annotation: A prologue to corpus processing. Indian Journal of Linguistics., 23(1), 71–82.
Google Scholar
Dash, N. S. (2005). Corpus linguistics and language technology: With reference to Indian Languages. Mittal Publications.
Google Scholar
Dash, N. S. (2013). Part-of-speech (POS) tagging in Bangla written text corpus. Bhasa Bijnan o Prayukti: An International Journal on Linguistics and Language Technology, 1(1), 53–96.
Google Scholar
Dash, N. S. (2015). Marking words with part-of-speech (POS) tags within the text boundary of a corpus: Problems, process, and outcomes. Translation Today., 9(1), 5–24.
Google Scholar
Dash, N. S. (2016). Multifunctionality of a hyphen in Bengali text corpus: Problems and challenges in text normalization and POS tagging. International Journal of Innovative Studies in Sociology and Humanities, 1(1), 19–34.
Google Scholar
Dash, N. S. (2021). Pre-editing and text standardization on a Bengali written text corpus. Aligarh Journal of Linguistics, 10(1), 1–22.
Google Scholar
Dash, N. S., Arulmozi, S., & Hussain, M. M. (2016). The carriage of Indian languages corpora: And miles to go before we stop. Indian Journal of Applied Linguistics., 42(1 & 2), 63–92.
Google Scholar
deRose, S. (1991). An analysis of probabilistic grammatical tagging methods. In S. Johansson & A.-B. Stenström (Eds.), English computer corpora: Selected papers & research guide (pp. 9–13). Mouton de Gruyter.
Google Scholar
Dhanalakshmi, V., Kumar, A., Shivapratap, G., Soman, K. P., & Rajendran, S. (2009). Tamil POS tagging using linear programming. International Journal of Recent Trends in Engineering, 1(2), 166–169.
Google Scholar
Durand, D. G., DeRose, S. J., & Mylonas, E. (1996). What should mark-up really be? Applying theories of text to the design of markup systems. In Proceedings of ALLC/ACH ‘96, June 25–29, 1996, Bergen, Norway.
Google Scholar
Ekbal, A., Mandal, S., & Bandyopadhyay, S. (2007). POS tagging using HMM and rule-based chunking. In Proceedings of the Workshop on shallow parsing in South Asian languages (SPSAL) (pp. 31–34), IJCAI 2007, IIIT-Hyderabad, India.
Google Scholar
Fligelstone, S., Pacey, M., & Rayson, P. (1997). How to generalize the task of annotation. In R. Garside, G. Leech, & A. McEnery (Eds.), Corpus annotation: Linguistic information from computer text corpora (pp. 122–136). Longman.
Google Scholar
Garrette, D., & Baldridge, J. (2013). Learning a part-of-speech tagger from two hours of annotation. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT-13) (pp. 138–147) June 2013, Atlanta, GA.
Google Scholar
Garside, R. (1987). The CLAWS word-tagging system. In R. Garside, G. Leech, & G. Sampson, (Eds.), The computational analysis of English: A corpus-based approach (pp., 30–41). Longman.
Google Scholar
Garside, R. (1995). Grammatical tagging of the spoken part of the British National Corpus: A progress report. In G. Leech, G. Myers, & J. Thomas (Eds.), Spoken English on computer: Transcription, s (pp. 161–167). Longman.
Google Scholar
Garside, R. (1996). The robust tagging of unrestricted text: The BNC experience. In J. Thomas & M. Short (Eds.), Using corpora for language research: Studies in honour of Geoffrey Leech (pp. 167–180). Longman.
Google Scholar
Greene, B., & Rubin, G. (1971). Automatic grammatical tagging of English. Technical Report, Department of Linguistics, Brown University, Rhode Island (Handout).
Google Scholar
Ide, N., & Pustejovsky, J. (Eds.). (2017). Handbook of linguistic annotation. (Text, Speech, and Language Technology series). Springer.
Google Scholar
Jha, G. N. (2010). The TDIL program and the Indian language corpora initiative (ILCI). In Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC`10) (pp., 982–985). Valletta, Malta, 19–21 May 2010.
Google Scholar
Kumar, D., & Josan, G. S. (2010). Part-of-speech taggers for morphologically rich Indian languages: A survey. International Journal of Computer Applications., 6(5), 1–9.
Article Google Scholar
Kupiec, J. (1992). Robust part-of-speech tagging using a hidden Markov model. Computer Speech and Language., 6(1), 3–15.
Google Scholar
Leech, G. (1997). Grammatical tagging. In R. Garside, G. Leech, & A. McEnery (Eds.), Corpus annotation: Linguistic information from computer text corpora (pp. 19–33). Longman.
Google Scholar
Leech, G., & Eyes, E. (1993). Syntactic annotation: Linguistic aspects of grammatical tagging & skeleton parsing. In E. Black, R. Garside, & G. Leech (Eds.), Statistically-driven computer grammars of English: The IBM/Lancaster approach (pp. 36–61). Rodopi.
Google Scholar
Leech, G., & Garside, R. (1982). Grammatical tagging of the LOB Corpus: A general survey. In S. Johansson, & K. Hofland, (Eds.), Computer Corpora in English Language Research (pp. 110–117). Bergen: NAVF.
Google Scholar
Leech, G., & Smith, N. (1999). The use of tagging. In H. van Halteren (Ed.), Syntactic wordclass tagging (pp. 23–36). Kluwer.
Chapter Google Scholar
Leech, G., & Wilson, A. (1999). Guidelines and standards for tagging. In H. van Halteren (Ed.), Syntactic word class tagging (pp. 55–80). Kluwer.
Chapter Google Scholar
Leech, G., Garside, R., & Atwell, E. (1983). The automatic grammatical tagging of the LOB corpus. ICAME Journal: International Computer Archive of Modern and Medieval English Journal, 7, 13–33.
Google Scholar
Leech, G., Garside, R., & Bryant, M. (1994). The large-scale grammatical tagging of text: Experience with the British National Corpus. In N. Oostdijk & P. deHaan (Eds.), Corpus-based research into language (pp. 47–63). Rodopi.
Google Scholar
Manning, C. D. (2011). Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In Proceedings of the 12th International Conference on Computational Linguistics and Intelligent Text Processing (pp. 171–189). Vol. Part I, Tokyo, Japan, Springer, Berlin, February 20–26.
Google Scholar
Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Journal Computational Linguistics (Special issue on using large corpora: II), 19 (2), 313–330.
Google Scholar
McEnery, T., & Wilson, A. (1996). Corpus linguistics. Edinburgh University Press.
Google Scholar
Mishra, N., Mishra, A. (2011). Part of speech tagging for Hindi corpus. In Proceedings of the International Conference on Communication Systems and Network Technologies (pp. 554–558) Katra, Jammu.
Google Scholar
Nagata, R., Mizumoto, T., Kikuchi, Y., Kawasaki, Y., & Funakoshi, K. (2018). A POS tagging model designed for learner English. In Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, Association for Computational Linguistics (pp. 39–48). Brussels, Belgium, November 01.
Google Scholar
Naseem, T., Snyder, B., Eisenstein, J., & Barzilay, R. (2009). Multilingual part-of-speech tagging: Two unsupervised approaches. Journal of Artificial Intelligence Research, 36(1), 1–45.
Google Scholar
Nguyen, D. Q., & Verspoor, K. (2018). An improved neural network model for joint POS tagging and dependency parsing. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies (pp. 81–91), Brussels, Belgium, Association for Computational Linguistics, October 31- November 1.
Google Scholar
Pammi, S. C., & Prahallad, K. (2007). POS tagging and chunking using decision forests. In Proceedings of the Workshop on shallow parsing in South Asian languages (SPSAL) (pp. 33–36). IJCAI 2007, IIIT-Hyderabad, India.
Google Scholar
Rao, D., & Yarowsky, D. (2007). Part of speech tagging and shallow parsing of Indian languages. In Proceedings of the Workshop on Shallow Parsing for South Asian Languages (IJCAI-07) (pp. 17–20), IIIT-Hyd, India.
Google Scholar
Rao, P.T., Ram, S., Vijaykrishna, R., & Sobha, L. (2007). A text chunker and hybrid POS tagger for Indian languages. In Proceedings of the Workshop on Shallow Parsing for South Asian Languages (IJCAI-07) (pp., 9–12), IIIT-Hyd, India.
Google Scholar
Ray, P. R., Harish, V., Sarkar, S., & Basu, A. (2010). Part of speech tagging and local word grou** techniques for natural language parsing in Hindi. In Proceedings of the International Conference on Natural language Processing (ICON2003) (pp. 118–125), Department of Computer Science and Engineering, IIT-Kharagpur, India.
Google Scholar
Saha, G. K., Saha, A. B., & Debnath, S. (2004). Computer-assisted Bangla words POS tagging. In Proceedings of (iSTRANS-2004) (pp., 111–115), New Delhi, India.
Google Scholar
Saharia, N., Das, D., Sharma, U., & Kalita, J. (2009). Part of speech tagger for Assamese Text. In Proceedings of the ACL-IJCNLP-2009 Conference (pp. 33–36). Suntec, Singapore.
Google Scholar
Sarkar, P., & Basu, G. (1994). Bhasa Jijnasa (Language Queries). Kolkata: Vidyasagar Pustak Mandir.
Google Scholar
Sastry, G. M. R., Chaudhuri, S., & Reddy, P. N. (2007). A HMM-based part-of-speech & statistical chunker for 3 Indian languages. In Proceedings of the Workshop on Shallow Parsing for South Asian Languages (IJCAI-07) (pp. 13–16), IIIT-Hyd, India.
Google Scholar
Schulz, S., & Kuhn, J. (2016). Learning from Within? Comparing POS tagging approaches for historical text. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (pp. 4316–4322), European Language Resources Association.
Google Scholar
Shambhavi, B. R., & Ramakanth, P. K. (2010). Current State of the art POS tagging for Indian Languages: A study. International Journal of Computer Engineering and Technology., 1(1), 250–260.
Google Scholar
Shambhavi, B. R., Ramakanth, K. P., & Revanth, G. (2012). A maximum entropy approach to Kannada part of speech Tagging. International Journal of Computer Applications, 41(13), pp. 9–12.
Google Scholar
Shrivastava, M., & Bhattacharyya, P. (2008). Hindi POS tagger using Naive Stemming: harnessing morphological information without extensive linguistic knowledge. In Proceedings of the 6th International Conference on Natural Language Processing (ICON-2008) (pp. 1–8). CDAC, Pune India, 20–22 December 2008.
Google Scholar
Singh, S., & Jha, G. N. (2015). Statistical tagger for Bhojpuri employing Support Vector Machine. In Proceedings of the International Conference on Advances in Computing, Communications and Informatics (ICACCI) (pp. 1524–1529).
Google Scholar
Toutanova, K., & Manning, C. D. (2000). Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing & Very Large Corpora (EMNLP/VLC-2000) (pp. 63–70).
Google Scholar
Wallis, S. A. (2007). Annotation, retrieval, and experimentation. In A. Meurman-Solin, & A. A. Nurmi, (Eds.), Annotating variation and change. Helsinki: Varieng, UoH (ePublished).
Google Scholar
Wallis, S. A. (2014). What might a corpus of parsed spoken data tell us about language? In L. Veselovská, & M. Janebová, (Eds.), Complex Visibles Out There. Proceedings of the Olomouc Linguistics Colloquium 2014: Language Use and Linguistic Structure (pp., 641–662). Olomouc: Palacký University, Czech Republic.
Google Scholar
Wallis, S. A. (2020). Grammar and corpus methodology. In B. Aarts, G. Popova, & J. Bowie, (Eds.), Oxford handbook of English Grammar (pp. 58–83). Part I: Chapter 4. Oxford: Oxford University Press.
Google Scholar
Yang, Y., & Eisenstein, J. (2016). Part-of-speech tagging for historical English. In Proceedings of NAACL-HLT 2016 (pp. 1318–1328), San Diego, California, Association for Computational Linguistics, June 12–17.
Google Scholar

Web Links

Download references

Author information

Authors and Affiliations

Linguistic Research Unit, Indian Statistical Institute, Kolkata, West Bengal, India
Dr. Niladri Sekhar Dash

Authors

Dr. Niladri Sekhar Dash
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Niladri Sekhar Dash .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Dash, N.S. (2021). Part-of-Speech Annotation. In: Language Corpora Annotation and Processing. Springer, Singapore. https://doi.org/10.1007/978-981-16-2960-0_3

Download citation

DOI: https://doi.org/10.1007/978-981-16-2960-0_3
Published: 08 July 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-2959-4
Online ISBN: 978-981-16-2960-0
eBook Packages: EducationEducation (R0)

Publish with us

Policies and ethics