Corpus Text Annotation

Dash, Niladri Sekhar

doi:10.1007/978-981-16-2960-0_1

Niladri Sekhar Dash²

422 Accesses

Abstract

Some of the basic and preliminary ideas of text annotation and text processing techniques, which are normally carried out on a corpus of written and spoken texts are addressed in this chapter. Kee** non-trained linguistic scholars and common linguistic readers in view, we briefly discuss the basic nature and goal of text annotation, describe the purposes of text annotation, and refer to the common maxims of text annotation. A common reader may need these ideas to understand the tools, systems, and techniques of text annotation and processing that are discussed in this book. Next, we report on different types of text annotation, which we apply to written and spoken text corpora. We address these issues kee** in view the theoretical, functional, and referential importance of text annotation and text processing in the analysis and application of a natural language data by man and machine in various domains of linguistics and technology. We also draw theoretical differences between text annotation and text processing to dispel the confusions faced by both academicians and corporate scholars who use annotated and processed texts as an indispensable resource. We look into the present status of text annotation and processing in both resource-rich and resource-poor languages and propose to take the necessary initiative for building raw corpora, annotated corpora, and processed corpora for the resource-poor languages to address the requirements of language technology, linguistics, and other disciplines. Finally, we discuss how the applicational importance and referential relevance of corpora are increased after corpora are annotated with various kinds of linguistic (and extralinguistic) information and processed at various levels.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aldebazal, I., Aranzabe, M. J., Arriola, J. M., & Dias de Ilarraza, A. (2009). Syntactic annotation in the reference Corpus for the processing of Basque (EPEC): Theoretical and practical issues. Corpus Linguistics and Linguistic Theory, 5(2), 241–269.
Google Scholar
Archer, D., & Culpeper, J. (2003). Sociopragmatic annotation: New directions and possibilities in historical corpus linguistics. In A. Wilson, P. Rayson, & T. McEnery (Eds.), Corpus linguistics by the Lune: Studies in honour of Geoffrey Leech (pp. 37–58). Frankfurt: Peter Lang.
Google Scholar
Archer, D., Culpeper, J., & Davies, M. (2008). Pragmatic annotation. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics: An international handbook (pp. 613–642). Berlin: Walter de Gruyter.
Google Scholar
Atwell, E., Demetriou, G., Hughes, J., Schiffrin, A., Souter, C., & Wilcock, S. (2000). A comparative evaluation of modern English corpus grammatical annotation schemes. International Computer Archive of Modern English Journal, 24, 7–23.
Google Scholar
Baldwin, T., Bannardz, C., Tanaka, T., & Widdows, D. (2003). An empirical model of multiword expression decomposability. In Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment (pp. 89–96).
Google Scholar
Berez, A. L., & Gries, S. T. (2010). Correlates to middle marking in Dena’ina iterative verbs. International Journal of American Linguistics, 76(1), 145–165.
Article Google Scholar
Bird, S., & Liberman, M. (2001). A formal framework for linguistic annotation. Speech Communication, 33(1–2), 23–60.
Article Google Scholar
Calzolari, N., Fillmore, C. J., Grishman, R., Ide, N., Lenci, A., MacLeod, C., & Zampolli, A. (2002). Towards best practice for multiword expressions in computational Lexicons. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC 2002) (pp. 1934–1940).
Google Scholar
Carletta, J., McKelvie, D., Isard, A., Mengel, A., Klein, M., & Møller, M. B. (2004). A generic approach to software support for linguistic annotation using XML. In G. Sampson & D. McCarthy (Eds.), Corpus linguistics: Readings in a Widening discipline (pp. 449–459). London: Continuum.
Google Scholar
Damerau, F. (1993). Generating and evaluating domain-oriented multi-word terms from texts. Information Processing and Management, 29(4), 433–447.
Article Google Scholar
Dash, N. S., Dutta Chowdhury, P., & Sarkar, A. (2009). Naturalization of English words in modern Bengali: A corpus-based empirical study. Language Forum, 35(2), 127–142.
Google Scholar
Dash, N. S., & Hussain, M. M. (2013). Designing a generic scheme for etymological annotation: A new type of language corpora annotation. In Proceedings of the ALR-11 & 6th International Joint Conference on Natural Language Processing, Nagoya Congress Centre, Nagoya, Japan, 14–18 Oct 2013 (pp. 64–71).
Google Scholar
Dash, N. S. (2011). Principles of part-of-speech (POS) tagging in Indian language corpora. In Proceedings of 5th Language Technology Conference (LTC-2011): Human Language Technologies as a Challenge for Computer Science and Linguistics. Poznan, Poland, 25–27 Nov 2011 (pp. 101–105).
Google Scholar
Dash, N. S., Bhattacharyya, P., & Pawar, J. (Eds.) (2017). The WordNet in Indian languages (pp. V–XII). Singapore: Springer.
Google Scholar
deHaan, P. (1984). Problem-oriented tagging of English corpus data. In J. Aarts & W. Meijs (Eds.), Corpus linguistics (pp. 123–139). Amsterdam: Rodopi.
Google Scholar
Demir, H., & Ozgur, A. (2014). Improving named entity recognition for morphologically rich languages using word embeddings. In Proceedings of the 13th International Conference on Machine Learning and Applications (ICMLA 2014), 3–6 Dec 2014, Detroit, MI, USA (pp. 117–122).
Google Scholar
DeRose, S. J. (1988). Grammatical category disambiguation by statistical optimization. Computational Linguistics, 14(1), 31–39.
Google Scholar
Edwards, J. A., & Lampert, M. D. (Eds.). (1993). Talking data: Transcription and coding in discourse research. Hillsdale, NJ: Erlbaum.
Google Scholar
Fahnestock, J. (1999). Rhetorical figures in scientific argumentation. New York: Oxford University Press.
Google Scholar
Garside, R. (1987). The CLAWS word-tagging system. In R. Garside, G. Leech, & G. Sampson (Eds.), The computational analysis of English: A corpus-based approach (pp. 30–41). London: Longman.
Google Scholar
Gilquin, G., & Gries, S. T. (2009). Corpora and experimental methods: A state-of-the-art review. Corpus Linguistics and Linguistic Theory, 5(1), 1–26.
Article Google Scholar
Green, N. (2010). Representation of argumentation in text with rhetorical structure theory. Argumentation, 24(2), 181–196.
Article Google Scholar
Greene, B., & Rubin, G. (1971). Automatic grammatical tagging of English. Technical Report, Department of Linguistics, Brown University, Rhode Island (a Handout).
Google Scholar
Halliday, M. A. K., & Hasan, R. (1976). Cohesion in English (English Language Series 9). London: Longman.
Google Scholar
Harris, R. A., & DiMarco, C. (2017). Rhetorical figures, arguments, computation. Argument & Computation, 8(3), 211–231.
Article Google Scholar
Harris, R. A., Marco, C. D., Ruan, S., & O’Reilly, C. (2018). An annotation scheme for rhetorical figures. Argument and Computation, 9(1), 155–175.
Article Google Scholar
Hymes, D. (1962). The ethnography of speaking. In T. Gladwin & W. C. Sturtevant (Eds.), Anthropology and human behavior (pp. 13–53). Washington: The Anthropology Society of Washington.
Google Scholar
Hymes, D. (1964). Introduction: Toward ethnographies of communication. American Anthropologist, 66(6), 1–34.
Article Google Scholar
Ide, N., & Romary, L. (2003). Outline of the international standard linguistic annotation framework. In Proceedings of ACL’03 Workshop on Linguistic Annotation: Getting the Model Right (pp. 1–5).
Google Scholar
Ide, N., & Romary, L. (2004). An international standard for a linguistic annotation framework. Natural Language Engineering, 10(3–4), 211–225.
Article Google Scholar
Ide, N., & Romary, L. (2007). Towards international standards for language resources. In L. Dybkjaer, H. Hemsen, & W. Minker (Eds.), Evaluation of text and speech systems (pp. 263–284), Springer.
Google Scholar
Ide, N., & Suderman, K. (2014). The linguistic annotation framework: A standard for annotation interchange and merging. Language Resources and Evaluation., 48(3), 395–418.
Article Google Scholar
Ide, N., Chiarcos, C., Stede, M., & Cassidy, S. (2017). Designing annotation schemes: From model to representation. In N. Ide & J. Pustejovsky (Eds.), Handbook of linguistic annotation (pp. 73–111). Dordrecht: Springer.
Google Scholar
Johansson, S. (1995). The encoding of spoken texts. Computers and the Humanities, 29(1), 149–158.
Article Google Scholar
Johnston, T. (2013). Auslan Corpus annotation guidelines. Sidney: Macquarie University.
Google Scholar
Kupiec, J. (1992). Robust part-of-speech tagging using a hidden Markov model. Computer Speech and Language, 6(1), 3–15.
Google Scholar
Leech, G., & Smith, N. (1999). The use of tagging. In H. V. Halteren (Ed.), Syntactic word class tagging (pp. 23–36). Dordrecht: Kluwer Academic Press.
Google Scholar
Leech, G., & Wilson, A. (1999). Guidelines and standards for tagging. In H. van Halteren (Ed.), Syntactic wordclass tagging (pp. 55–80). Dordrecht: Kluwer.
Google Scholar
Leech, G. (1993). Corpus annotation schemes. Literary and Linguistic Computing, 8(4), 275–281.
Article Google Scholar
Leech, G. (1997). Introducing corpus annotation. In R. Garside, G. Leech, & A. McEnery (Eds.), Corpus annotation: Linguistic information from computer text corpora (pp. 1–18). London: Addison Wesley Longman.
Google Scholar
Leech, G. (2005). Adding linguistic annotation. In M. Wynne (Ed.), Develo** linguistic corpora: A guide to good practice (pp. 17–29). Oxford: Oxbrow Books.
Google Scholar
Löfberg, L., Piao, S., Rayson, P., Juntunen, J. P., Nykänen, A., & Varantola, K. (2005). A semantic tagger for the Finnish language. In Proceedings of the Corpus Linguistics 2005 Conference Series Online E-journal, 1(1). 14–17 July 2005, Birmingham, UK.
Google Scholar
McEnery, T., & Hardie, A. (2011). Corpus linguistics: Method, theory, and practice. Cambridge: Cambridge University Press.
Google Scholar
McEnery, T., & Wilson, A. (1996). Corpus linguistics. Edinburgh: Edinburgh University Press.
Google Scholar
Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990). Introduction to WordNet: An on-line lexical database. International Journal of Lexicography, 3(4), 235–312.
Article Google Scholar
Mitkov, R., Evans, R., Orasan, C., Barbu, C., Jones, L., Sotirova, V., & Wolverhampton, W. S. (2000). Co-reference and anaphora: Develo** annotating tools annotated resources and annotation strategies. In Proceedings of the Discourse, Anaphora and Reference Resolution Conference (DAARC2000). 16–18 Nov 2000. Lancaster, UK (pp. 49–58).
Google Scholar
O’Donnell, M. B. (1999). The use of annotated corpora for new testament discourse analysis: A survey of current practice and future prospects. In S. E. Porter & J. T. Reed (Eds.), Discourse analysis and the new testament: Results and applications (pp. 71–117). Sheffield: Sheffield Academic Press.
Google Scholar
Oostdijk, N., & Boves, L. (2008). Pre-processing speech corpora. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics: An international handbook (Vol. 1, pp. 642–663). Berlin: Walter de Gruyter.
Google Scholar
Piao, S., Archer, D., Mudraya, O., Rayson, P., Garside, R., McEnery, A.M., & Wilson, A. (2006). A large semantic lexicon for corpus annotation. In Proceedings of the Corpus Linguistics 2005 Conference Series Online E-journal, 1(1), July 14–17, Birmingham, UK.
Google Scholar
Popescu-Belis, A. (1998). How corpora with annotated co-reference links improve reference resolution. In Proceedings of the 1st International Conference on Language Resources and Evaluation (pp. 567–572). Granada, Spain.
Google Scholar
Roman, I., Shipilo, A., & Kovriguina, L. (2016). Russian named entities recognition and classification using distributed word and phrase representations. In Proceedings of the 3rd International Conference on Information Management and Big Data (SIMBig 2016), 1–3 Sept 2016, Cusco, Peru (pp. 150–156).
Google Scholar
Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword expressions: A pain in the neck for NLP. Lecture Notes in Computer Science, 2276, 1–15.
Article Google Scholar
Sinclair, J. M. (1996). EAGLES Preliminary recommendations on Corpus Typology. https://www.ilc.pi.cnr.it/EAGLES96/corpustyp/corpustyp.html.
Smith, N., Hoffmann, S., & Rayson, P. (2007). Corpus tools and methods today & tomorrow: Incorporating user-defined annotations. In Proceedings of the 4th Corpus Linguistics Conference, 27–30 July 2007, University of Birmingham, UK. Article No. 276.
Google Scholar
Smith, N. I., & McEnery, A. M. (2000). Inducing part-of-speech tagged lexicons from large corpora. In R. Mitkov & N. Nikolov (Eds.), Recent advances in natural language processing 2 (pp. 21–30). Amsterdam: John Benjamins.
Google Scholar
Sperberg-McQueen, C., & Burnard, L. (Eds.) (1994). Guidelines for electronic text encoding and interchange. TEI P3. Text Encoding Initiative, Oxford, Providence, Charlottesville, Bergen.
Google Scholar
Stenström, A.-B. (1984). Discourse tags. In J. Aarts & W. Meijs (Eds.), Corpus linguistics: Recent developments in the use of computer corpora in English Language Research (pp. 65–81). Amsterdam: Rodopi.
Google Scholar
Thieberger, N., & Berez, A. L. (2012). Linguistic data management. In N. Thieberger (Ed.), Oxford handbook of linguistic fieldwork (pp. 90–118). Oxford: Oxford University Press.
Google Scholar
Wallis, S. A. (2007). Annotation, retrieval, and experimentation or: You only get out what you put in. In A. Meurman-Solin & A. A. Nurmi (Eds.), Annotating variation and change. Helsinki: Varieng (ePublished).
Google Scholar
Wallis, S. A. (2014). What might a corpus of parsed spoken data tell us about language? In L. Veselovská & M. Janebová (Eds.), Complex visibles out there. Proceedings of the Olomouc linguistics Colloquium 2014: Language use and linguistic structure (pp. 641–662). Olomouc: Palacký University, Czech Republic.
Google Scholar
Wallis, S. A. & Aarts, B. (2006). Recent developments in the syntactic annotation of corpora. In E. M. Bermúdez & L. R. Miyares (Eds.), Linguistics in the twenty-first century (pp. 197–202). Cambridge: Cambridge Scholars Press.
Google Scholar
Wallis, S. A., & Nelson, G. (2001). Knowledge discovery in grammatically analyzed corpora. Data Mining and Knowledge Discovery, 5(4), 305–336.
Article Google Scholar
Webber, B., Stone, M., Joshi, A., & Knott, A. (2003). Anaphora and discourse structure. Computational Linguistics, 29, 545–587.
Article Google Scholar
Wolfe, J. (2002). Annotation technologies: A software and research review. Computers and Composition, 19(4), 471–497.
Article Google Scholar
Wolfe, J., & Neuwirth, C. M. (2001). From the margins to the center: The future of annotation. Journal of Business and Technical Communication, 15(33), 333–370.
Article Google Scholar
**ao, R. (2008). Theory-driven corpus research: Using corpora to inform aspect theory. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics: An international handbook (Vol. 2, pp. 987–1008). Berlin: Gruyter.
Google Scholar
Zinsmeister, H., Hinrichs, E., Kübler, S., & Witt, A. (2008). Linguistically annotated corpora: Quality assurance, reusability, and sustainability. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics: An international handbook (Vol. 1, pp. 759–776). Berlin: Gruyter.
Google Scholar

Web Links

Download references

Author information

Authors and Affiliations

Linguistic Research Unit, Indian Statistical Institute, Kolkata, West Bengal, India
Dr. Niladri Sekhar Dash

Authors

Dr. Niladri Sekhar Dash
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Niladri Sekhar Dash .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Dash, N.S. (2021). Corpus Text Annotation. In: Language Corpora Annotation and Processing. Springer, Singapore. https://doi.org/10.1007/978-981-16-2960-0_1

Download citation

DOI: https://doi.org/10.1007/978-981-16-2960-0_1
Published: 08 July 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-2959-4
Online ISBN: 978-981-16-2960-0
eBook Packages: EducationEducation (R0)

Publish with us

Policies and ethics