Abstract
This chapter deals with the fundamentals of corpus compilation, approached from a practical perspective. The topics covered follow the key phases of corpus compilation, starting with the initial considerations of representativeness and balance. Next, issues in collecting corpus data are covered, including ethics and metadata. Technical aspects involving formatting and annotation are then presented, followed by suggestions for sharing the corpus with others. Corpus comparison is also discussed, as it merits some reflection when a corpus is created. To further illustrate key concepts and exemplify the varying roles of the corpus in specific research projects, two sample studies are presented. The chapter closes with a brief consideration of future directions in corpus compilation, focusing on the importance of compensating for the inevitable loss of complex information and taking the increasingly multimodal nature of discourse as a case in point.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
Biber (1993:244) defines a sampling frame as “an operational definition of the population, an itemized listing of population members from which a representative sample can be chosen”.
- 3.
There are tools that automatically convert pdf files to simple text, such as AntFileConverter http://www.laurenceanthony.net/software/antfileconverter/. Accessed 24 May 2019.
- 4.
Fair use is measured through the purpose and character of the use, the nature of the copyrighted work, the amount and substantiality of the portion taken, and the effect of the use on the potential market. For more information, see https://fairuse.stanford.edu/overview/fair-use/four-factors/. Accessed 24 May 2019.
- 5.
For sample templates, see the forms from the Bavarian Archive for Speech Signals at http://www.phonetik.uni-muenchen.de/Bas/BasTemplateInformedConsent_en.pdf, or Newcastle University at https://www.ncl.ac.uk/media/wwwnclacuk/research/files/Example%20Consent%20Form.pdf. Accessed 29 May 2019.
- 6.
See https://baalweb.files.wordpress.com/2016/10/goodpractice_full_2016.pdf. Accessed 24 May 2019.
- 7.
See http://www.natcorp.ox.ac.uk/docs/URG/ (Accessed 24 May 2019) and https://web.archive.org/web/20130302203713/http://micase.elicorpora.info/files/0000/0015/MICASE_MANUAL.pdf (Accessed 24 May 2019).
- 8.
See https://tei-c.org/. Accessed 24 May 2019.
- 9.
See https://creativecommons.org/licenses/. Accessed 24 May 2019.
- 10.
See https://ota.ox.ac.uk/ (Accessed 24 May 2019) and https://www.clarin-d.net/en/corpora (Accessed 24 May 2019).
- 11.
A similar example is reported on in Chap. 8 on analyzing concordances and involves a study of how foreign doctors are represented in a corpus of British press articles.
References
Aston, G., & Burnard, L. (1998). The BNC handbook: Exploring the British National Corpus with SARA. Edinburgh: Edinburgh University Press.
Baker, P., Hardie, A., & McEnery, T. (2006). A glossary of corpus linguistics. Edinburgh: Edinburgh University Press.
Biber, D. (1993). Representativeness in corpus design. Literary and Linguistic Computing, 8(4), 243–257.
Crasborn, O. (2010). What does ‘informed consent’ mean in the internet age? Publishing sign language corpora as open content. Sign Language Studies, 10(2), 276–290.
Douglas, F. M. (2003). The Scottish corpus of texts and speech: Problems of corpus design. Literary and Linguistic Computing, 18(1), 23–37.
Francis, W. N., & Kucera, H. (1964/1979). Manual of information to accompany a standard corpus of present-day edited American English, for use with digital computers. Department of Linguistics, Brown University. http://clu.uni.no/icame/manuals/BROWN/INDEX.HTM. Accessed 24 May 2019.
Hardie, A. (2014). Modest XML for Corpora: Not a standard, but a suggestion. ICAME Journal, 38, 72–103.
Jaworska, S. (2016). A comparative corpus-assisted discourse study of the representations of hosts in promotional tourism discourse. Corpora, 11(1), 83–111. https://doi.org/10.3366/cor.2016.0086.
Johansson, S., Leech, G., & Goodluck, H. (1978). Manual of information to accompany the Lancaster-Oslo/Bergen Corpus of British English, for use with digital computers. Department of English, University of Oslo. http://clu.uni.no/icame/manuals/LOB/INDEX.HTM. Accessed 24 May 2019.
Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1), 97–133.
Leech, G. (1997). Introducing corpus annotation. In R. Garside, G. Leech, & T. McEnery (Eds.), Corpus annotation: Linguistic information from computer text corpora (pp. 1–18). London: Longman.
Leech, G. (2007). New resources, or just better old ones? In M. Hundt, N. Nesselhauf, & C. Biewer (Eds.), Corpus linguistics and the web (pp. 134–149). Amsterdam: Rodopi.
McEnery, T., & Hardie, A. (2012). Corpus linguistics: Method, theory and practice. Cambridge: Cambridge University Press.
McEnery, T., & ** linguistic corpora: A guide to good practice (pp. 47–58). Oxford: Oxbow Books.
Partington, A. (2015). Review of Rühlemann (2014) Narrative in English conversation: A corpus analysis of storytelling. ICAME Journal, 39. https://doi.org/10.1515/icame-2015-0011.
Römer, U., & O’Donnell, M. B. (2011). From student hard drive to web corpus (part 1): The design, compilation and genre classification of the Michigan Corpus of Upper-level Student Papers (MICUSP). Corpora, 6(2), 159–177.
Rühlemann, C., & O’Donnell, M. B. (2012). Introducing a corpus of conversational stories. Construction and annotation of the Narrative Corpus. Corpus Linguistics and Linguistic Theory, 8(2), 313–350. https://doi.org/10.1515/cllt-2012-0015.
Simpson-Vlach, R., & Leicher, S. (2006). The MICASE handbook: A resource for users of the Michigan Corpus of Academic Spoken English. Ann Arbor: University of Michigan Press.
Sinclair, J. (2005). Corpus and text – basic principles. In M. Wynne (Ed.), Develo** linguistic corpora: A guide to good practice (pp. 1–16). Oxford: Oxbow Books.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Further Reading
Further Reading
Biber, D. 1993. Representativeness in Corpus Design. Literary and Linguistic Computing 8(4): 243–257.
Biber’s work is significant not only in having had quite an impact on the field, but also in its attempt to develop empirical methods for evaluating corpus representativeness.
Wynne, M. (Editor). 2005. Develo** Linguistic Corpora: a Guide to Good Practice . Oxford: Oxbow Books. http://ota.ox.ac.uk/documents/creating/dlc/ .
There is a surprising dearth of reference works on corpus compilation. Even if thiscollection of chapters is not recent, it is still worth reading.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Ädel, A. (2020). Corpus Compilation. In: Paquot, M., Gries, S.T. (eds) A Practical Handbook of Corpus Linguistics. Springer, Cham. https://doi.org/10.1007/978-3-030-46216-1_1
Download citation
DOI: https://doi.org/10.1007/978-3-030-46216-1_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46215-4
Online ISBN: 978-3-030-46216-1
eBook Packages: Religion and PhilosophyPhilosophy and Religion (R0)