Corpus Text Annotation

  • Chapter
  • First Online:
Language Corpora Annotation and Processing
  • 422 Accesses

Abstract

Some of the basic and preliminary ideas of text annotation and text processing techniques, which are normally carried out on a corpus of written and spoken texts are addressed in this chapter. Kee** non-trained linguistic scholars and common linguistic readers in view, we briefly discuss the basic nature and goal of text annotation, describe the purposes of text annotation, and refer to the common maxims of text annotation. A common reader may need these ideas to understand the tools, systems, and techniques of text annotation and processing that are discussed in this book. Next, we report on different types of text annotation, which we apply to written and spoken text corpora. We address these issues kee** in view the theoretical, functional, and referential importance of text annotation and text processing in the analysis and application of a natural language data by man and machine in various domains of linguistics and technology. We also draw theoretical differences between text annotation and text processing to dispel the confusions faced by both academicians and corporate scholars who use annotated and processed texts as an indispensable resource. We look into the present status of text annotation and processing in both resource-rich and resource-poor languages and propose to take the necessary initiative for building raw corpora, annotated corpora, and processed corpora for the resource-poor languages to address the requirements of language technology, linguistics, and other disciplines. Finally, we discuss how the applicational importance and referential relevance of corpora are increased after corpora are annotated with various kinds of linguistic (and extralinguistic) information and processed at various levels.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Aldebazal, I., Aranzabe, M. J., Arriola, J. M., & Dias de Ilarraza, A. (2009). Syntactic annotation in the reference Corpus for the processing of Basque (EPEC): Theoretical and practical issues. Corpus Linguistics and Linguistic Theory, 5(2), 241–269.

    Google Scholar 

  • Archer, D., & Culpeper, J. (2003). Sociopragmatic annotation: New directions and possibilities in historical corpus linguistics. In A. Wilson, P. Rayson, & T. McEnery (Eds.), Corpus linguistics by the Lune: Studies in honour of Geoffrey Leech (pp. 37–58). Frankfurt: Peter Lang.

    Google Scholar 

  • Archer, D., Culpeper, J., & Davies, M. (2008). Pragmatic annotation. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics: An international handbook (pp. 613–642). Berlin: Walter de Gruyter.

    Google Scholar 

  • Atwell, E., Demetriou, G., Hughes, J., Schiffrin, A., Souter, C., & Wilcock, S. (2000). A comparative evaluation of modern English corpus grammatical annotation schemes. International Computer Archive of Modern English Journal, 24, 7–23.

    Google Scholar 

  • Baldwin, T., Bannardz, C., Tanaka, T., & Widdows, D. (2003). An empirical model of multiword expression decomposability. In Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment (pp. 89–96).

    Google Scholar 

  • Berez, A. L., & Gries, S. T. (2010). Correlates to middle marking in Dena’ina iterative verbs. International Journal of American Linguistics, 76(1), 145–165.

    Article  Google Scholar 

  • Bird, S., & Liberman, M. (2001). A formal framework for linguistic annotation. Speech Communication, 33(1–2), 23–60.

    Article  Google Scholar 

  • Calzolari, N., Fillmore, C. J., Grishman, R., Ide, N., Lenci, A., MacLeod, C., & Zampolli, A. (2002). Towards best practice for multiword expressions in computational Lexicons. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC 2002) (pp. 1934–1940).

    Google Scholar 

  • Carletta, J., McKelvie, D., Isard, A., Mengel, A., Klein, M., & Møller, M. B. (2004). A generic approach to software support for linguistic annotation using XML. In G. Sampson & D. McCarthy (Eds.), Corpus linguistics: Readings in a Widening discipline (pp. 449–459). London: Continuum.

    Google Scholar 

  • Damerau, F. (1993). Generating and evaluating domain-oriented multi-word terms from texts. Information Processing and Management, 29(4), 433–447.

    Article  Google Scholar 

  • Dash, N. S., Dutta Chowdhury, P., & Sarkar, A. (2009). Naturalization of English words in modern Bengali: A corpus-based empirical study. Language Forum, 35(2), 127–142.

    Google Scholar 

  • Dash, N. S., & Hussain, M. M. (2013). Designing a generic scheme for etymological annotation: A new type of language corpora annotation. In Proceedings of the ALR-11 & 6th International Joint Conference on Natural Language Processing, Nagoya Congress Centre, Nagoya, Japan, 14–18 Oct 2013 (pp. 64–71).

    Google Scholar 

  • Dash, N. S. (2011). Principles of part-of-speech (POS) tagging in Indian language corpora. In Proceedings of 5th Language Technology Conference (LTC-2011): Human Language Technologies as a Challenge for Computer Science and Linguistics. Poznan, Poland, 25–27 Nov 2011 (pp. 101–105).

    Google Scholar 

  • Dash, N. S., Bhattacharyya, P., & Pawar, J. (Eds.) (2017). The WordNet in Indian languages (pp. V–XII). Singapore: Springer.

    Google Scholar 

  • deHaan, P. (1984). Problem-oriented tagging of English corpus data. In J. Aarts & W. Meijs (Eds.), Corpus linguistics (pp. 123–139). Amsterdam: Rodopi.

    Google Scholar 

  • Demir, H., & Ozgur, A. (2014). Improving named entity recognition for morphologically rich languages using word embeddings. In Proceedings of the 13th International Conference on Machine Learning and Applications (ICMLA 2014), 3–6 Dec 2014, Detroit, MI, USA (pp. 117–122).

    Google Scholar 

  • DeRose, S. J. (1988). Grammatical category disambiguation by statistical optimization. Computational Linguistics, 14(1), 31–39.

    Google Scholar 

  • Edwards, J. A., & Lampert, M. D. (Eds.). (1993). Talking data: Transcription and coding in discourse research. Hillsdale, NJ: Erlbaum.

    Google Scholar 

  • Fahnestock, J. (1999). Rhetorical figures in scientific argumentation. New York: Oxford University Press.

    Google Scholar 

  • Garside, R. (1987). The CLAWS word-tagging system. In R. Garside, G. Leech, & G. Sampson (Eds.), The computational analysis of English: A corpus-based approach (pp. 30–41). London: Longman.

    Google Scholar 

  • Gilquin, G., & Gries, S. T. (2009). Corpora and experimental methods: A state-of-the-art review. Corpus Linguistics and Linguistic Theory, 5(1), 1–26.

    Article  Google Scholar 

  • Green, N. (2010). Representation of argumentation in text with rhetorical structure theory. Argumentation, 24(2), 181–196.

    Article  Google Scholar 

  • Greene, B., & Rubin, G. (1971). Automatic grammatical tagging of English. Technical Report, Department of Linguistics, Brown University, Rhode Island (a Handout).

    Google Scholar 

  • Halliday, M. A. K., & Hasan, R. (1976). Cohesion in English (English Language Series 9). London: Longman.

    Google Scholar 

  • Harris, R. A., & DiMarco, C. (2017). Rhetorical figures, arguments, computation. Argument & Computation, 8(3), 211–231.

    Article  Google Scholar 

  • Harris, R. A., Marco, C. D., Ruan, S., & O’Reilly, C. (2018). An annotation scheme for rhetorical figures. Argument and Computation, 9(1), 155–175.

    Article  Google Scholar 

  • Hymes, D. (1962). The ethnography of speaking. In T. Gladwin & W. C. Sturtevant (Eds.), Anthropology and human behavior (pp. 13–53). Washington: The Anthropology Society of Washington.

    Google Scholar 

  • Hymes, D. (1964). Introduction: Toward ethnographies of communication. American Anthropologist, 66(6), 1–34.

    Article  Google Scholar 

  • Ide, N., & Romary, L. (2003). Outline of the international standard linguistic annotation framework. In Proceedings of ACL’03 Workshop on Linguistic Annotation: Getting the Model Right (pp. 1–5).

    Google Scholar 

  • Ide, N., & Romary, L. (2004). An international standard for a linguistic annotation framework. Natural Language Engineering, 10(3–4), 211–225.

    Article  Google Scholar 

  • Ide, N., & Romary, L. (2007). Towards international standards for language resources. In L. Dybkjaer, H. Hemsen, & W. Minker (Eds.), Evaluation of text and speech systems (pp. 263–284), Springer.

    Google Scholar 

  • Ide, N., & Suderman, K. (2014). The linguistic annotation framework: A standard for annotation interchange and merging. Language Resources and Evaluation., 48(3), 395–418.

    Article  Google Scholar 

  • Ide, N., Chiarcos, C., Stede, M., & Cassidy, S. (2017). Designing annotation schemes: From model to representation. In N. Ide & J. Pustejovsky (Eds.), Handbook of linguistic annotation (pp. 73–111). Dordrecht: Springer.

    Google Scholar 

  • Johansson, S. (1995). The encoding of spoken texts. Computers and the Humanities, 29(1), 149–158.

    Article  Google Scholar 

  • Johnston, T. (2013). Auslan Corpus annotation guidelines. Sidney: Macquarie University.

    Google Scholar 

  • Kupiec, J. (1992). Robust part-of-speech tagging using a hidden Markov model. Computer Speech and Language, 6(1), 3–15.

    Google Scholar 

  • Leech, G., & Smith, N. (1999). The use of tagging. In H. V. Halteren (Ed.), Syntactic word class tagging (pp. 23–36). Dordrecht: Kluwer Academic Press.

    Google Scholar 

  • Leech, G., & Wilson, A. (1999). Guidelines and standards for tagging. In H. van Halteren (Ed.), Syntactic wordclass tagging (pp. 55–80). Dordrecht: Kluwer.

    Google Scholar 

  • Leech, G. (1993). Corpus annotation schemes. Literary and Linguistic Computing, 8(4), 275–281.

    Article  Google Scholar 

  • Leech, G. (1997). Introducing corpus annotation. In R. Garside, G. Leech, & A. McEnery (Eds.), Corpus annotation: Linguistic information from computer text corpora (pp. 1–18). London: Addison Wesley Longman.

    Google Scholar 

  • Leech, G. (2005). Adding linguistic annotation. In M. Wynne (Ed.), Develo** linguistic corpora: A guide to good practice (pp. 17–29). Oxford: Oxbrow Books.

    Google Scholar 

  • Löfberg, L., Piao, S., Rayson, P., Juntunen, J. P., Nykänen, A., & Varantola, K. (2005). A semantic tagger for the Finnish language. In Proceedings of the Corpus Linguistics 2005 Conference Series Online E-journal, 1(1). 14–17 July 2005, Birmingham, UK.

    Google Scholar 

  • McEnery, T., & Hardie, A. (2011). Corpus linguistics: Method, theory, and practice. Cambridge: Cambridge University Press.

    Google Scholar 

  • McEnery, T., & Wilson, A. (1996). Corpus linguistics. Edinburgh: Edinburgh University Press.

    Google Scholar 

  • Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990). Introduction to WordNet: An on-line lexical database. International Journal of Lexicography, 3(4), 235–312.

    Article  Google Scholar 

  • Mitkov, R., Evans, R., Orasan, C., Barbu, C., Jones, L., Sotirova, V., & Wolverhampton, W. S. (2000). Co-reference and anaphora: Develo** annotating tools annotated resources and annotation strategies. In Proceedings of the Discourse, Anaphora and Reference Resolution Conference (DAARC2000). 16–18 Nov 2000. Lancaster, UK (pp. 49–58).

    Google Scholar 

  • O’Donnell, M. B. (1999). The use of annotated corpora for new testament discourse analysis: A survey of current practice and future prospects. In S. E. Porter & J. T. Reed (Eds.), Discourse analysis and the new testament: Results and applications (pp. 71–117). Sheffield: Sheffield Academic Press.

    Google Scholar 

  • Oostdijk, N., & Boves, L. (2008). Pre-processing speech corpora. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics: An international handbook (Vol. 1, pp. 642–663). Berlin: Walter de Gruyter.

    Google Scholar 

  • Piao, S., Archer, D., Mudraya, O., Rayson, P., Garside, R., McEnery, A.M., & Wilson, A. (2006). A large semantic lexicon for corpus annotation. In Proceedings of the Corpus Linguistics 2005 Conference Series Online E-journal, 1(1), July 14–17, Birmingham, UK.

    Google Scholar 

  • Popescu-Belis, A. (1998). How corpora with annotated co-reference links improve reference resolution. In Proceedings of the 1st International Conference on Language Resources and Evaluation (pp. 567–572). Granada, Spain.

    Google Scholar 

  • Roman, I., Shipilo, A., & Kovriguina, L. (2016). Russian named entities recognition and classification using distributed word and phrase representations. In Proceedings of the 3rd International Conference on Information Management and Big Data (SIMBig 2016), 1–3 Sept 2016, Cusco, Peru (pp. 150–156).

    Google Scholar 

  • Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword expressions: A pain in the neck for NLP. Lecture Notes in Computer Science, 2276, 1–15.

    Article  Google Scholar 

  • Sinclair, J. M. (1996). EAGLES Preliminary recommendations on Corpus Typology. https://www.ilc.pi.cnr.it/EAGLES96/corpustyp/corpustyp.html.

  • Smith, N., Hoffmann, S., & Rayson, P. (2007). Corpus tools and methods today & tomorrow: Incorporating user-defined annotations. In Proceedings of the 4th Corpus Linguistics Conference, 27–30 July 2007, University of Birmingham, UK. Article No. 276.

    Google Scholar 

  • Smith, N. I., & McEnery, A. M. (2000). Inducing part-of-speech tagged lexicons from large corpora. In R. Mitkov & N. Nikolov (Eds.), Recent advances in natural language processing 2 (pp. 21–30). Amsterdam: John Benjamins.

    Google Scholar 

  • Sperberg-McQueen, C., & Burnard, L. (Eds.) (1994). Guidelines for electronic text encoding and interchange. TEI P3. Text Encoding Initiative, Oxford, Providence, Charlottesville, Bergen.

    Google Scholar 

  • Stenström, A.-B. (1984). Discourse tags. In J. Aarts & W. Meijs (Eds.), Corpus linguistics: Recent developments in the use of computer corpora in English Language Research (pp. 65–81). Amsterdam: Rodopi.

    Google Scholar 

  • Thieberger, N., & Berez, A. L. (2012). Linguistic data management. In N. Thieberger (Ed.), Oxford handbook of linguistic fieldwork (pp. 90–118). Oxford: Oxford University Press.

    Google Scholar 

  • Wallis, S. A. (2007). Annotation, retrieval, and experimentation or: You only get out what you put in. In A. Meurman-Solin & A. A. Nurmi (Eds.), Annotating variation and change. Helsinki: Varieng (ePublished).

    Google Scholar 

  • Wallis, S. A. (2014). What might a corpus of parsed spoken data tell us about language? In L. Veselovská & M. Janebová (Eds.), Complex visibles out there. Proceedings of the Olomouc linguistics Colloquium 2014: Language use and linguistic structure (pp. 641–662). Olomouc: Palacký University, Czech Republic.

    Google Scholar 

  • Wallis, S. A. & Aarts, B. (2006). Recent developments in the syntactic annotation of corpora. In E. M. Bermúdez & L. R. Miyares (Eds.), Linguistics in the twenty-first century (pp. 197–202). Cambridge: Cambridge Scholars Press.

    Google Scholar 

  • Wallis, S. A., & Nelson, G. (2001). Knowledge discovery in grammatically analyzed corpora. Data Mining and Knowledge Discovery, 5(4), 305–336.

    Article  Google Scholar 

  • Webber, B., Stone, M., Joshi, A., & Knott, A. (2003). Anaphora and discourse structure. Computational Linguistics, 29, 545–587.

    Article  Google Scholar 

  • Wolfe, J. (2002). Annotation technologies: A software and research review. Computers and Composition, 19(4), 471–497.

    Article  Google Scholar 

  • Wolfe, J., & Neuwirth, C. M. (2001). From the margins to the center: The future of annotation. Journal of Business and Technical Communication, 15(33), 333–370.

    Article  Google Scholar 

  • **ao, R. (2008). Theory-driven corpus research: Using corpora to inform aspect theory. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics: An international handbook (Vol. 2, pp. 987–1008). Berlin: Gruyter.

    Google Scholar 

  • Zinsmeister, H., Hinrichs, E., Kübler, S., & Witt, A. (2008). Linguistically annotated corpora: Quality assurance, reusability, and sustainability. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics: An international handbook (Vol. 1, pp. 759–776). Berlin: Gruyter.

    Google Scholar 

Web Links

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Niladri Sekhar Dash .

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Dash, N.S. (2021). Corpus Text Annotation. In: Language Corpora Annotation and Processing. Springer, Singapore. https://doi.org/10.1007/978-981-16-2960-0_1

Download citation

  • DOI: https://doi.org/10.1007/978-981-16-2960-0_1

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-16-2959-4

  • Online ISBN: 978-981-16-2960-0

  • eBook Packages: EducationEducation (R0)

Publish with us

Policies and ethics

Navigation