Abstract
Some of the basic and preliminary ideas of text annotation and text processing techniques, which are normally carried out on a corpus of written and spoken texts are addressed in this chapter. Kee** non-trained linguistic scholars and common linguistic readers in view, we briefly discuss the basic nature and goal of text annotation, describe the purposes of text annotation, and refer to the common maxims of text annotation. A common reader may need these ideas to understand the tools, systems, and techniques of text annotation and processing that are discussed in this book. Next, we report on different types of text annotation, which we apply to written and spoken text corpora. We address these issues kee** in view the theoretical, functional, and referential importance of text annotation and text processing in the analysis and application of a natural language data by man and machine in various domains of linguistics and technology. We also draw theoretical differences between text annotation and text processing to dispel the confusions faced by both academicians and corporate scholars who use annotated and processed texts as an indispensable resource. We look into the present status of text annotation and processing in both resource-rich and resource-poor languages and propose to take the necessary initiative for building raw corpora, annotated corpora, and processed corpora for the resource-poor languages to address the requirements of language technology, linguistics, and other disciplines. Finally, we discuss how the applicational importance and referential relevance of corpora are increased after corpora are annotated with various kinds of linguistic (and extralinguistic) information and processed at various levels.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aldebazal, I., Aranzabe, M. J., Arriola, J. M., & Dias de Ilarraza, A. (2009). Syntactic annotation in the reference Corpus for the processing of Basque (EPEC): Theoretical and practical issues. Corpus Linguistics and Linguistic Theory, 5(2), 241–269.
Archer, D., & Culpeper, J. (2003). Sociopragmatic annotation: New directions and possibilities in historical corpus linguistics. In A. Wilson, P. Rayson, & T. McEnery (Eds.), Corpus linguistics by the Lune: Studies in honour of Geoffrey Leech (pp. 37–58). Frankfurt: Peter Lang.
Archer, D., Culpeper, J., & Davies, M. (2008). Pragmatic annotation. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics: An international handbook (pp. 613–642). Berlin: Walter de Gruyter.
Atwell, E., Demetriou, G., Hughes, J., Schiffrin, A., Souter, C., & Wilcock, S. (2000). A comparative evaluation of modern English corpus grammatical annotation schemes. International Computer Archive of Modern English Journal, 24, 7–23.
Baldwin, T., Bannardz, C., Tanaka, T., & Widdows, D. (2003). An empirical model of multiword expression decomposability. In Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment (pp. 89–96).
Berez, A. L., & Gries, S. T. (2010). Correlates to middle marking in Dena’ina iterative verbs. International Journal of American Linguistics, 76(1), 145–165.
Bird, S., & Liberman, M. (2001). A formal framework for linguistic annotation. Speech Communication, 33(1–2), 23–60.
Calzolari, N., Fillmore, C. J., Grishman, R., Ide, N., Lenci, A., MacLeod, C., & Zampolli, A. (2002). Towards best practice for multiword expressions in computational Lexicons. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC 2002) (pp. 1934–1940).
Carletta, J., McKelvie, D., Isard, A., Mengel, A., Klein, M., & Møller, M. B. (2004). A generic approach to software support for linguistic annotation using XML. In G. Sampson & D. McCarthy (Eds.), Corpus linguistics: Readings in a Widening discipline (pp. 449–459). London: Continuum.
Damerau, F. (1993). Generating and evaluating domain-oriented multi-word terms from texts. Information Processing and Management, 29(4), 433–447.
Dash, N. S., Dutta Chowdhury, P., & Sarkar, A. (2009). Naturalization of English words in modern Bengali: A corpus-based empirical study. Language Forum, 35(2), 127–142.
Dash, N. S., & Hussain, M. M. (2013). Designing a generic scheme for etymological annotation: A new type of language corpora annotation. In Proceedings of the ALR-11 & 6th International Joint Conference on Natural Language Processing, Nagoya Congress Centre, Nagoya, Japan, 14–18 Oct 2013 (pp. 64–71).
Dash, N. S. (2011). Principles of part-of-speech (POS) tagging in Indian language corpora. In Proceedings of 5th Language Technology Conference (LTC-2011): Human Language Technologies as a Challenge for Computer Science and Linguistics. Poznan, Poland, 25–27 Nov 2011 (pp. 101–105).
Dash, N. S., Bhattacharyya, P., & Pawar, J. (Eds.) (2017). The WordNet in Indian languages (pp. V–XII). Singapore: Springer.
deHaan, P. (1984). Problem-oriented tagging of English corpus data. In J. Aarts & W. Meijs (Eds.), Corpus linguistics (pp. 123–139). Amsterdam: Rodopi.
Demir, H., & Ozgur, A. (2014). Improving named entity recognition for morphologically rich languages using word embeddings. In Proceedings of the 13th International Conference on Machine Learning and Applications (ICMLA 2014), 3–6 Dec 2014, Detroit, MI, USA (pp. 117–122).
DeRose, S. J. (1988). Grammatical category disambiguation by statistical optimization. Computational Linguistics, 14(1), 31–39.
Edwards, J. A., & Lampert, M. D. (Eds.). (1993). Talking data: Transcription and coding in discourse research. Hillsdale, NJ: Erlbaum.
Fahnestock, J. (1999). Rhetorical figures in scientific argumentation. New York: Oxford University Press.
Garside, R. (1987). The CLAWS word-tagging system. In R. Garside, G. Leech, & G. Sampson (Eds.), The computational analysis of English: A corpus-based approach (pp. 30–41). London: Longman.
Gilquin, G., & Gries, S. T. (2009). Corpora and experimental methods: A state-of-the-art review. Corpus Linguistics and Linguistic Theory, 5(1), 1–26.
Green, N. (2010). Representation of argumentation in text with rhetorical structure theory. Argumentation, 24(2), 181–196.
Greene, B., & Rubin, G. (1971). Automatic grammatical tagging of English. Technical Report, Department of Linguistics, Brown University, Rhode Island (a Handout).
Halliday, M. A. K., & Hasan, R. (1976). Cohesion in English (English Language Series 9). London: Longman.
Harris, R. A., & DiMarco, C. (2017). Rhetorical figures, arguments, computation. Argument & Computation, 8(3), 211–231.
Harris, R. A., Marco, C. D., Ruan, S., & O’Reilly, C. (2018). An annotation scheme for rhetorical figures. Argument and Computation, 9(1), 155–175.
Hymes, D. (1962). The ethnography of speaking. In T. Gladwin & W. C. Sturtevant (Eds.), Anthropology and human behavior (pp. 13–53). Washington: The Anthropology Society of Washington.
Hymes, D. (1964). Introduction: Toward ethnographies of communication. American Anthropologist, 66(6), 1–34.
Ide, N., & Romary, L. (2003). Outline of the international standard linguistic annotation framework. In Proceedings of ACL’03 Workshop on Linguistic Annotation: Getting the Model Right (pp. 1–5).
Ide, N., & Romary, L. (2004). An international standard for a linguistic annotation framework. Natural Language Engineering, 10(3–4), 211–225.
Ide, N., & Romary, L. (2007). Towards international standards for language resources. In L. Dybkjaer, H. Hemsen, & W. Minker (Eds.), Evaluation of text and speech systems (pp. 263–284), Springer.
Ide, N., & Suderman, K. (2014). The linguistic annotation framework: A standard for annotation interchange and merging. Language Resources and Evaluation., 48(3), 395–418.
Ide, N., Chiarcos, C., Stede, M., & Cassidy, S. (2017). Designing annotation schemes: From model to representation. In N. Ide & J. Pustejovsky (Eds.), Handbook of linguistic annotation (pp. 73–111). Dordrecht: Springer.
Johansson, S. (1995). The encoding of spoken texts. Computers and the Humanities, 29(1), 149–158.
Johnston, T. (2013). Auslan Corpus annotation guidelines. Sidney: Macquarie University.
Kupiec, J. (1992). Robust part-of-speech tagging using a hidden Markov model. Computer Speech and Language, 6(1), 3–15.
Leech, G., & Smith, N. (1999). The use of tagging. In H. V. Halteren (Ed.), Syntactic word class tagging (pp. 23–36). Dordrecht: Kluwer Academic Press.
Leech, G., & Wilson, A. (1999). Guidelines and standards for tagging. In H. van Halteren (Ed.), Syntactic wordclass tagging (pp. 55–80). Dordrecht: Kluwer.
Leech, G. (1993). Corpus annotation schemes. Literary and Linguistic Computing, 8(4), 275–281.
Leech, G. (1997). Introducing corpus annotation. In R. Garside, G. Leech, & A. McEnery (Eds.), Corpus annotation: Linguistic information from computer text corpora (pp. 1–18). London: Addison Wesley Longman.
Leech, G. (2005). Adding linguistic annotation. In M. Wynne (Ed.), Develo** linguistic corpora: A guide to good practice (pp. 17–29). Oxford: Oxbrow Books.
Löfberg, L., Piao, S., Rayson, P., Juntunen, J. P., Nykänen, A., & Varantola, K. (2005). A semantic tagger for the Finnish language. In Proceedings of the Corpus Linguistics 2005 Conference Series Online E-journal, 1(1). 14–17 July 2005, Birmingham, UK.
McEnery, T., & Hardie, A. (2011). Corpus linguistics: Method, theory, and practice. Cambridge: Cambridge University Press.
McEnery, T., & Wilson, A. (1996). Corpus linguistics. Edinburgh: Edinburgh University Press.
Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990). Introduction to WordNet: An on-line lexical database. International Journal of Lexicography, 3(4), 235–312.
Mitkov, R., Evans, R., Orasan, C., Barbu, C., Jones, L., Sotirova, V., & Wolverhampton, W. S. (2000). Co-reference and anaphora: Develo** annotating tools annotated resources and annotation strategies. In Proceedings of the Discourse, Anaphora and Reference Resolution Conference (DAARC2000). 16–18 Nov 2000. Lancaster, UK (pp. 49–58).
O’Donnell, M. B. (1999). The use of annotated corpora for new testament discourse analysis: A survey of current practice and future prospects. In S. E. Porter & J. T. Reed (Eds.), Discourse analysis and the new testament: Results and applications (pp. 71–117). Sheffield: Sheffield Academic Press.
Oostdijk, N., & Boves, L. (2008). Pre-processing speech corpora. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics: An international handbook (Vol. 1, pp. 642–663). Berlin: Walter de Gruyter.
Piao, S., Archer, D., Mudraya, O., Rayson, P., Garside, R., McEnery, A.M., & Wilson, A. (2006). A large semantic lexicon for corpus annotation. In Proceedings of the Corpus Linguistics 2005 Conference Series Online E-journal, 1(1), July 14–17, Birmingham, UK.
Popescu-Belis, A. (1998). How corpora with annotated co-reference links improve reference resolution. In Proceedings of the 1st International Conference on Language Resources and Evaluation (pp. 567–572). Granada, Spain.
Roman, I., Shipilo, A., & Kovriguina, L. (2016). Russian named entities recognition and classification using distributed word and phrase representations. In Proceedings of the 3rd International Conference on Information Management and Big Data (SIMBig 2016), 1–3 Sept 2016, Cusco, Peru (pp. 150–156).
Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword expressions: A pain in the neck for NLP. Lecture Notes in Computer Science, 2276, 1–15.
Sinclair, J. M. (1996). EAGLES Preliminary recommendations on Corpus Typology. https://www.ilc.pi.cnr.it/EAGLES96/corpustyp/corpustyp.html.
Smith, N., Hoffmann, S., & Rayson, P. (2007). Corpus tools and methods today & tomorrow: Incorporating user-defined annotations. In Proceedings of the 4th Corpus Linguistics Conference, 27–30 July 2007, University of Birmingham, UK. Article No. 276.
Smith, N. I., & McEnery, A. M. (2000). Inducing part-of-speech tagged lexicons from large corpora. In R. Mitkov & N. Nikolov (Eds.), Recent advances in natural language processing 2 (pp. 21–30). Amsterdam: John Benjamins.
Sperberg-McQueen, C., & Burnard, L. (Eds.) (1994). Guidelines for electronic text encoding and interchange. TEI P3. Text Encoding Initiative, Oxford, Providence, Charlottesville, Bergen.
Stenström, A.-B. (1984). Discourse tags. In J. Aarts & W. Meijs (Eds.), Corpus linguistics: Recent developments in the use of computer corpora in English Language Research (pp. 65–81). Amsterdam: Rodopi.
Thieberger, N., & Berez, A. L. (2012). Linguistic data management. In N. Thieberger (Ed.), Oxford handbook of linguistic fieldwork (pp. 90–118). Oxford: Oxford University Press.
Wallis, S. A. (2007). Annotation, retrieval, and experimentation or: You only get out what you put in. In A. Meurman-Solin & A. A. Nurmi (Eds.), Annotating variation and change. Helsinki: Varieng (ePublished).
Wallis, S. A. (2014). What might a corpus of parsed spoken data tell us about language? In L. Veselovská & M. Janebová (Eds.), Complex visibles out there. Proceedings of the Olomouc linguistics Colloquium 2014: Language use and linguistic structure (pp. 641–662). Olomouc: Palacký University, Czech Republic.
Wallis, S. A. & Aarts, B. (2006). Recent developments in the syntactic annotation of corpora. In E. M. Bermúdez & L. R. Miyares (Eds.), Linguistics in the twenty-first century (pp. 197–202). Cambridge: Cambridge Scholars Press.
Wallis, S. A., & Nelson, G. (2001). Knowledge discovery in grammatically analyzed corpora. Data Mining and Knowledge Discovery, 5(4), 305–336.
Webber, B., Stone, M., Joshi, A., & Knott, A. (2003). Anaphora and discourse structure. Computational Linguistics, 29, 545–587.
Wolfe, J. (2002). Annotation technologies: A software and research review. Computers and Composition, 19(4), 471–497.
Wolfe, J., & Neuwirth, C. M. (2001). From the margins to the center: The future of annotation. Journal of Business and Technical Communication, 15(33), 333–370.
**ao, R. (2008). Theory-driven corpus research: Using corpora to inform aspect theory. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics: An international handbook (Vol. 2, pp. 987–1008). Berlin: Gruyter.
Zinsmeister, H., Hinrichs, E., Kübler, S., & Witt, A. (2008). Linguistically annotated corpora: Quality assurance, reusability, and sustainability. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics: An international handbook (Vol. 1, pp. 759–776). Berlin: Gruyter.
Web Links
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Dash, N.S. (2021). Corpus Text Annotation. In: Language Corpora Annotation and Processing. Springer, Singapore. https://doi.org/10.1007/978-981-16-2960-0_1
Download citation
DOI: https://doi.org/10.1007/978-981-16-2960-0_1
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-2959-4
Online ISBN: 978-981-16-2960-0
eBook Packages: EducationEducation (R0)