Part-of-Speech Annotation

  • Chapter
  • First Online:
Language Corpora Annotation and Processing
  • 395 Accesses

Abstract

Annotating words at the part-of-speech level, either manually or by a machine, is a tough task. It is done effectively when human annotators, as well as computer systems, are properly trained so that they can correctly identify morphological properties and syntactic functions of words in a piece of text. We discuss in this chapter some theoretical aspects and practical issues of part-of-speech (POS) annotation on a written Bengali text corpus. We deliberately avoid all those issues and aspects that are required to design and develop an automatic POS annotation tool for a text, since this is not the goal of this chapter. To keep things simple and within the capacity of those readers who are not well-versed in the application of computers, we address here some of the primary concerns and challenges involved in POS annotation. Starting with the basic concept of POS annotation, we highlight the underlying differences between POS annotation and morphological processing; define the levels and stages of POS annotation; refer to some of the early works on POS annotation; present a generic scheme for POS annotation; and show how a POS annotated text is utilized in various domains and sub-domains of theoretical, descriptive, applied, computational, and cognitive linguistics. The data and information presented in this chapter are primarily meant for the students of those less-advanced languages which still lack linguistic resources like POS annotated texts. The rudimentary ideas and information that are presented in this chapter may be treated as valuable and usable inputs for designing linguistic and computational models for POS annotation in these less-advanced languages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Abney, S. (1997). Part-of-speech tagging and partial parsing. In S. Schreibman, R. G. Siemens, & J. M. Unsworth (Eds.), Corpus-based methods in language & speech: A companion to digital humanities (pp. 118–136). Blackwell.

    Chapter  Google Scholar 

  • Antony, P. J., Santhanu, P. M., & Soman, K. P. (2010). SVM-based parts-of-speech tagger for Malayalam. In Proceedings of the International Conference on-Recent Trends in Information, Telecommunication & Computing (ITC 2010) (pp. 339–341), Kochi, Kerala.

    Google Scholar 

  • Atwell, E., Demetriou, G., Hughes, J., Schiffrin, A., Souter, C., & Wilcock, S. (2000). A comparative evaluation of modern English corpus grammatical annotation schemes. International Computer Archive of Modern English Journal., 24, 7–23.

    Google Scholar 

  • Avinesh, P. V. S., & Karthik, G. (2007). POS tagging & chunking using conditional random field and transformation based learning. In Proceedings of the Workshop on Shallow Parsing for South Asian Languages (IJCAI-07) (pp. 21–24), IIIT-Hyderabad, India.

    Google Scholar 

  • Barnbrook, G. (1998). Language and computers. Edinburgh University Press.

    Google Scholar 

  • Baskaran, S., Bali, K., Bhattacharya, T., Bhattacharya, P., Chaudhury, M., Jha, G. N., Rajendran, S., Sarvanan, K., Sobha, K., & Subbarao, K. V. (2008). Designing a common POS tagset framework for Indian Languages. In Proceedings of the 6th Workshop on Asian Language Resources, Asian Language Resources in International Joint Conference on Natural Language Processing (IJCNLP-2008) (pp. 89–92), 11–12 January 2008, IIIT-Hyderabad.

    Google Scholar 

  • Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics—Investigating language structure and use. Cambridge University Press.

    Book  Google Scholar 

  • Brill, E. (1992). A simple rule-based part of speech tagger. In Proceedings of the Third Conference on Applied Natural Language Processing (pp. 152–155), ACL, Trento, Italy, March 31-April 03.

    Google Scholar 

  • Chaki, J. B. (1996). Bangla Bhasar Vyakaran (Grammar of the Bengali Language). Ananda Publishers.

    Google Scholar 

  • Chakrabarti, D. (2011). Layered parts of speech tagging for Bangla. Language in India. www.languageinindia.com, May 2011, Special Volume: Problems of Parsing in Indian Languages (pp. 1–6).

  • Chakravarti, N. N. (1994). Bangla: Ki Likhben, Kena Likhben. Ananda Publishers.

    Google Scholar 

  • Chakravarty, B. D. (1974). Ucchatara Bangla Vyakaran (Higher Bengali Grammar). Sarkar and Co.

    Google Scholar 

  • Chattopadhyay, S. K. (1995). Bhasa Prakash Bangla Vyakaran (Grammar of the Bengali Language). Rupa Publications.

    Google Scholar 

  • Dandapat, S. (2007). POS tagging and chunking with Maximum Entropy model. In Proceedings of Workshop on Shallow Parsing for South Asian Languages (IJCAI-07) (pp. 29–32), IIIT-Hyd, India.

    Google Scholar 

  • Dandapat, S. (2009). Part-of-Speech tagging for Bengali (Unpublished MS Thesis). Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, India.

    Google Scholar 

  • Dash, N. S. (2004). Text annotation: A prologue to corpus processing. Indian Journal of Linguistics., 23(1), 71–82.

    Google Scholar 

  • Dash, N. S. (2005). Corpus linguistics and language technology: With reference to Indian Languages. Mittal Publications.

    Google Scholar 

  • Dash, N. S. (2013). Part-of-speech (POS) tagging in Bangla written text corpus. Bhasa Bijnan o Prayukti: An International Journal on Linguistics and Language Technology, 1(1), 53–96.

    Google Scholar 

  • Dash, N. S. (2015). Marking words with part-of-speech (POS) tags within the text boundary of a corpus: Problems, process, and outcomes. Translation Today., 9(1), 5–24.

    Google Scholar 

  • Dash, N. S. (2016). Multifunctionality of a hyphen in Bengali text corpus: Problems and challenges in text normalization and POS tagging. International Journal of Innovative Studies in Sociology and Humanities, 1(1), 19–34.

    Google Scholar 

  • Dash, N. S. (2021). Pre-editing and text standardization on a Bengali written text corpus. Aligarh Journal of Linguistics, 10(1), 1–22.

    Google Scholar 

  • Dash, N. S., Arulmozi, S., & Hussain, M. M. (2016). The carriage of Indian languages corpora: And miles to go before we stop. Indian Journal of Applied Linguistics., 42(1 & 2), 63–92.

    Google Scholar 

  • deRose, S. (1991). An analysis of probabilistic grammatical tagging methods. In S. Johansson & A.-B. Stenström (Eds.), English computer corpora: Selected papers & research guide (pp. 9–13). Mouton de Gruyter.

    Google Scholar 

  • Dhanalakshmi, V., Kumar, A., Shivapratap, G., Soman, K. P., & Rajendran, S. (2009). Tamil POS tagging using linear programming. International Journal of Recent Trends in Engineering, 1(2), 166–169.

    Google Scholar 

  • Durand, D. G., DeRose, S. J., & Mylonas, E. (1996). What should mark-up really be? Applying theories of text to the design of markup systems. In Proceedings of ALLC/ACH ‘96, June 25–29, 1996, Bergen, Norway.

    Google Scholar 

  • Ekbal, A., Mandal, S., & Bandyopadhyay, S. (2007). POS tagging using HMM and rule-based chunking. In Proceedings of the Workshop on shallow parsing in South Asian languages (SPSAL) (pp. 31–34), IJCAI 2007, IIIT-Hyderabad, India.

    Google Scholar 

  • Fligelstone, S., Pacey, M., & Rayson, P. (1997). How to generalize the task of annotation. In R. Garside, G. Leech, & A. McEnery (Eds.), Corpus annotation: Linguistic information from computer text corpora (pp. 122–136). Longman.

    Google Scholar 

  • Garrette, D., & Baldridge, J. (2013). Learning a part-of-speech tagger from two hours of annotation. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT-13) (pp. 138–147) June 2013, Atlanta, GA.

    Google Scholar 

  • Garside, R. (1987). The CLAWS word-tagging system. In R. Garside, G. Leech, & G. Sampson, (Eds.), The computational analysis of English: A corpus-based approach (pp., 30–41). Longman.

    Google Scholar 

  • Garside, R. (1995). Grammatical tagging of the spoken part of the British National Corpus: A progress report. In G. Leech, G. Myers, & J. Thomas (Eds.), Spoken English on computer: Transcription, s (pp. 161–167). Longman.

    Google Scholar 

  • Garside, R. (1996). The robust tagging of unrestricted text: The BNC experience. In J. Thomas & M. Short (Eds.), Using corpora for language research: Studies in honour of Geoffrey Leech (pp. 167–180). Longman.

    Google Scholar 

  • Greene, B., & Rubin, G. (1971). Automatic grammatical tagging of English. Technical Report, Department of Linguistics, Brown University, Rhode Island (Handout).

    Google Scholar 

  • Ide, N., & Pustejovsky, J. (Eds.). (2017). Handbook of linguistic annotation. (Text, Speech, and Language Technology series). Springer.

    Google Scholar 

  • Jha, G. N. (2010). The TDIL program and the Indian language corpora initiative (ILCI). In Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC`10) (pp., 982–985). Valletta, Malta, 19–21 May 2010.

    Google Scholar 

  • Kumar, D., & Josan, G. S. (2010). Part-of-speech taggers for morphologically rich Indian languages: A survey. International Journal of Computer Applications., 6(5), 1–9.

    Article  Google Scholar 

  • Kupiec, J. (1992). Robust part-of-speech tagging using a hidden Markov model. Computer Speech and Language., 6(1), 3–15.

    Google Scholar 

  • Leech, G. (1997). Grammatical tagging. In R. Garside, G. Leech, & A. McEnery (Eds.), Corpus annotation: Linguistic information from computer text corpora (pp. 19–33). Longman.

    Google Scholar 

  • Leech, G., & Eyes, E. (1993). Syntactic annotation: Linguistic aspects of grammatical tagging & skeleton parsing. In E. Black, R. Garside, & G. Leech (Eds.), Statistically-driven computer grammars of English: The IBM/Lancaster approach (pp. 36–61). Rodopi.

    Google Scholar 

  • Leech, G., & Garside, R. (1982). Grammatical tagging of the LOB Corpus: A general survey. In S. Johansson, & K. Hofland, (Eds.), Computer Corpora in English Language Research (pp. 110–117). Bergen: NAVF.

    Google Scholar 

  • Leech, G., & Smith, N. (1999). The use of tagging. In H. van Halteren (Ed.), Syntactic wordclass tagging (pp. 23–36). Kluwer.

    Chapter  Google Scholar 

  • Leech, G., & Wilson, A. (1999). Guidelines and standards for tagging. In H. van Halteren (Ed.), Syntactic word class tagging (pp. 55–80). Kluwer.

    Chapter  Google Scholar 

  • Leech, G., Garside, R., & Atwell, E. (1983). The automatic grammatical tagging of the LOB corpus. ICAME Journal: International Computer Archive of Modern and Medieval English Journal, 7, 13–33.

    Google Scholar 

  • Leech, G., Garside, R., & Bryant, M. (1994). The large-scale grammatical tagging of text: Experience with the British National Corpus. In N. Oostdijk & P. deHaan (Eds.), Corpus-based research into language (pp. 47–63). Rodopi.

    Google Scholar 

  • Manning, C. D. (2011). Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In Proceedings of the 12th International Conference on Computational Linguistics and Intelligent Text Processing (pp. 171–189). Vol. Part I, Tokyo, Japan, Springer, Berlin, February 20–26.

    Google Scholar 

  • Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Journal Computational Linguistics (Special issue on using large corpora: II), 19 (2), 313–330.

    Google Scholar 

  • McEnery, T., & Wilson, A. (1996). Corpus linguistics. Edinburgh University Press.

    Google Scholar 

  • Mishra, N., Mishra, A. (2011). Part of speech tagging for Hindi corpus. In Proceedings of the International Conference on Communication Systems and Network Technologies (pp. 554–558) Katra, Jammu.

    Google Scholar 

  • Nagata, R., Mizumoto, T., Kikuchi, Y., Kawasaki, Y., & Funakoshi, K. (2018). A POS tagging model designed for learner English. In Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, Association for Computational Linguistics (pp. 39–48). Brussels, Belgium, November 01.

    Google Scholar 

  • Naseem, T., Snyder, B., Eisenstein, J., & Barzilay, R. (2009). Multilingual part-of-speech tagging: Two unsupervised approaches. Journal of Artificial Intelligence Research, 36(1), 1–45.

    Google Scholar 

  • Nguyen, D. Q., & Verspoor, K. (2018). An improved neural network model for joint POS tagging and dependency parsing. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies (pp. 81–91), Brussels, Belgium, Association for Computational Linguistics, October 31- November 1.

    Google Scholar 

  • Pammi, S. C., & Prahallad, K. (2007). POS tagging and chunking using decision forests. In Proceedings of the Workshop on shallow parsing in South Asian languages (SPSAL) (pp. 33–36). IJCAI 2007, IIIT-Hyderabad, India.

    Google Scholar 

  • Rao, D., & Yarowsky, D. (2007). Part of speech tagging and shallow parsing of Indian languages. In Proceedings of the Workshop on Shallow Parsing for South Asian Languages (IJCAI-07) (pp. 17–20), IIIT-Hyd, India.

    Google Scholar 

  • Rao, P.T., Ram, S., Vijaykrishna, R., & Sobha, L. (2007). A text chunker and hybrid POS tagger for Indian languages. In Proceedings of the Workshop on Shallow Parsing for South Asian Languages (IJCAI-07) (pp., 9–12), IIIT-Hyd, India.

    Google Scholar 

  • Ray, P. R., Harish, V., Sarkar, S., & Basu, A. (2010). Part of speech tagging and local word grou** techniques for natural language parsing in Hindi. In Proceedings of the International Conference on Natural language Processing (ICON2003) (pp. 118–125), Department of Computer Science and Engineering, IIT-Kharagpur, India.

    Google Scholar 

  • Saha, G. K., Saha, A. B., & Debnath, S. (2004). Computer-assisted Bangla words POS tagging. In Proceedings of (iSTRANS-2004) (pp., 111–115), New Delhi, India.

    Google Scholar 

  • Saharia, N., Das, D., Sharma, U., & Kalita, J. (2009). Part of speech tagger for Assamese Text. In Proceedings of the ACL-IJCNLP-2009 Conference (pp. 33–36). Suntec, Singapore.

    Google Scholar 

  • Sarkar, P., & Basu, G. (1994). Bhasa Jijnasa (Language Queries). Kolkata: Vidyasagar Pustak Mandir.

    Google Scholar 

  • Sastry, G. M. R., Chaudhuri, S., & Reddy, P. N. (2007). A HMM-based part-of-speech & statistical chunker for 3 Indian languages. In Proceedings of the Workshop on Shallow Parsing for South Asian Languages (IJCAI-07) (pp. 13–16), IIIT-Hyd, India.

    Google Scholar 

  • Schulz, S., & Kuhn, J. (2016). Learning from Within? Comparing POS tagging approaches for historical text. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (pp. 4316–4322), European Language Resources Association.

    Google Scholar 

  • Shambhavi, B. R., & Ramakanth, P. K. (2010). Current State of the art POS tagging for Indian Languages: A study. International Journal of Computer Engineering and Technology., 1(1), 250–260.

    Google Scholar 

  • Shambhavi, B. R., Ramakanth, K. P., & Revanth, G. (2012). A maximum entropy approach to Kannada part of speech Tagging. International Journal of Computer Applications, 41(13), pp. 9–12.

    Google Scholar 

  • Shrivastava, M., & Bhattacharyya, P. (2008). Hindi POS tagger using Naive Stemming: harnessing morphological information without extensive linguistic knowledge. In Proceedings of the 6th International Conference on Natural Language Processing (ICON-2008) (pp. 1–8). CDAC, Pune India, 20–22 December 2008.

    Google Scholar 

  • Singh, S., & Jha, G. N. (2015). Statistical tagger for Bhojpuri employing Support Vector Machine. In Proceedings of the International Conference on Advances in Computing, Communications and Informatics (ICACCI) (pp. 1524–1529).

    Google Scholar 

  • Toutanova, K., & Manning, C. D. (2000). Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing & Very Large Corpora (EMNLP/VLC-2000) (pp. 63–70).

    Google Scholar 

  • Wallis, S. A. (2007). Annotation, retrieval, and experimentation. In A. Meurman-Solin, & A. A. Nurmi, (Eds.), Annotating variation and change. Helsinki: Varieng, UoH (ePublished).

    Google Scholar 

  • Wallis, S. A. (2014). What might a corpus of parsed spoken data tell us about language? In L. Veselovská, & M. Janebová, (Eds.), Complex Visibles Out There. Proceedings of the Olomouc Linguistics Colloquium 2014: Language Use and Linguistic Structure (pp., 641–662). Olomouc: Palacký University, Czech Republic.

    Google Scholar 

  • Wallis, S. A. (2020). Grammar and corpus methodology. In B. Aarts, G. Popova, & J. Bowie, (Eds.), Oxford handbook of English Grammar (pp. 58–83). Part I: Chapter 4. Oxford: Oxford University Press.

    Google Scholar 

  • Yang, Y., & Eisenstein, J. (2016). Part-of-speech tagging for historical English. In Proceedings of NAACL-HLT 2016 (pp. 1318–1328), San Diego, California, Association for Computational Linguistics, June 12–17.

    Google Scholar 

Web Links

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Niladri Sekhar Dash .

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Dash, N.S. (2021). Part-of-Speech Annotation. In: Language Corpora Annotation and Processing. Springer, Singapore. https://doi.org/10.1007/978-981-16-2960-0_3

Download citation

  • DOI: https://doi.org/10.1007/978-981-16-2960-0_3

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-16-2959-4

  • Online ISBN: 978-981-16-2960-0

  • eBook Packages: EducationEducation (R0)

Publish with us

Policies and ethics

Navigation