Corpus Compilation

Ädel, Annelie

doi:10.1007/978-3-030-46216-1_1

Annelie Ädel³

2170 Accesses

Abstract

This chapter deals with the fundamentals of corpus compilation, approached from a practical perspective. The topics covered follow the key phases of corpus compilation, starting with the initial considerations of representativeness and balance. Next, issues in collecting corpus data are covered, including ethics and metadata. Technical aspects involving formatting and annotation are then presented, followed by suggestions for sharing the corpus with others. Corpus comparison is also discussed, as it merits some reflection when a corpus is created. To further illustrate key concepts and exemplify the varying roles of the corpus in specific research projects, two sample studies are presented. The chapter closes with a brief consideration of future directions in corpus compilation, focusing on the importance of compensating for the inevitable loss of complex information and taking the increasingly multimodal nature of discourse as a case in point.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: EUR 29.95; Price includes VAT (Germany)

eBook: EUR 85.59; Price includes VAT (Germany)

Softcover Book: EUR 106.99; Price includes VAT (Germany)

Hardcover Book: EUR 106.99; Price includes VAT (Germany)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Samples in the sense ‘text extracts’ are occasionally used in corpora to avoid having one type of text dominate, just because it happens to be long. There are many arguments for using complete texts, however. See e.g. Douglas (2003) and Sinclair (2005).
2.
Biber (1993:244) defines a sampling frame as “an operational definition of the population, an itemized listing of population members from which a representative sample can be chosen”.
3.
There are tools that automatically convert pdf files to simple text, such as AntFileConverter http://www.laurenceanthony.net/software/antfileconverter/. Accessed 24 May 2019.
4.
Fair use is measured through the purpose and character of the use, the nature of the copyrighted work, the amount and substantiality of the portion taken, and the effect of the use on the potential market. For more information, see https://fairuse.stanford.edu/overview/fair-use/four-factors/. Accessed 24 May 2019.
5.
For sample templates, see the forms from the Bavarian Archive for Speech Signals at http://www.phonetik.uni-muenchen.de/Bas/BasTemplateInformedConsent_en.pdf, or Newcastle University at https://www.ncl.ac.uk/media/wwwnclacuk/research/files/Example%20Consent%20Form.pdf. Accessed 29 May 2019.
6.
See https://baalweb.files.wordpress.com/2016/10/goodpractice_full_2016.pdf. Accessed 24 May 2019.
7.
See http://www.natcorp.ox.ac.uk/docs/URG/ (Accessed 24 May 2019) and https://web.archive.org/web/20130302203713/http://micase.elicorpora.info/files/0000/0015/MICASE_MANUAL.pdf (Accessed 24 May 2019).
8.
See https://tei-c.org/. Accessed 24 May 2019.
9.
See https://creativecommons.org/licenses/. Accessed 24 May 2019.
10.
See https://ota.ox.ac.uk/ (Accessed 24 May 2019) and https://www.clarin-d.net/en/corpora (Accessed 24 May 2019).
11.
A similar example is reported on in Chap. 8 on analyzing concordances and involves a study of how foreign doctors are represented in a corpus of British press articles.

References

Aston, G., & Burnard, L. (1998). The BNC handbook: Exploring the British National Corpus with SARA. Edinburgh: Edinburgh University Press.
Google Scholar
Baker, P., Hardie, A., & McEnery, T. (2006). A glossary of corpus linguistics. Edinburgh: Edinburgh University Press.
Google Scholar
Biber, D. (1993). Representativeness in corpus design. Literary and Linguistic Computing, 8(4), 243–257.
Article Google Scholar
Crasborn, O. (2010). What does ‘informed consent’ mean in the internet age? Publishing sign language corpora as open content. Sign Language Studies, 10(2), 276–290.
Article Google Scholar
Douglas, F. M. (2003). The Scottish corpus of texts and speech: Problems of corpus design. Literary and Linguistic Computing, 18(1), 23–37.
Article Google Scholar
Francis, W. N., & Kucera, H. (1964/1979). Manual of information to accompany a standard corpus of present-day edited American English, for use with digital computers. Department of Linguistics, Brown University. http://clu.uni.no/icame/manuals/BROWN/INDEX.HTM. Accessed 24 May 2019.
Hardie, A. (2014). Modest XML for Corpora: Not a standard, but a suggestion. ICAME Journal, 38, 72–103.
Article Google Scholar
Jaworska, S. (2016). A comparative corpus-assisted discourse study of the representations of hosts in promotional tourism discourse. Corpora, 11(1), 83–111. https://doi.org/10.3366/cor.2016.0086.
Article Google Scholar
Johansson, S., Leech, G., & Goodluck, H. (1978). Manual of information to accompany the Lancaster-Oslo/Bergen Corpus of British English, for use with digital computers. Department of English, University of Oslo. http://clu.uni.no/icame/manuals/LOB/INDEX.HTM. Accessed 24 May 2019.
Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1), 97–133.
Article Google Scholar
Leech, G. (1997). Introducing corpus annotation. In R. Garside, G. Leech, & T. McEnery (Eds.), Corpus annotation: Linguistic information from computer text corpora (pp. 1–18). London: Longman.
Google Scholar
Leech, G. (2007). New resources, or just better old ones? In M. Hundt, N. Nesselhauf, & C. Biewer (Eds.), Corpus linguistics and the web (pp. 134–149). Amsterdam: Rodopi.
Google Scholar
McEnery, T., & Hardie, A. (2012). Corpus linguistics: Method, theory and practice. Cambridge: Cambridge University Press.
Google Scholar
McEnery, T., & ** linguistic corpora: A guide to good practice (pp. 47–58). Oxford: Oxbow Books.
Google Scholar
Partington, A. (2015). Review of Rühlemann (2014) Narrative in English conversation: A corpus analysis of storytelling. ICAME Journal, 39. https://doi.org/10.1515/icame-2015-0011.
Römer, U., & O’Donnell, M. B. (2011). From student hard drive to web corpus (part 1): The design, compilation and genre classification of the Michigan Corpus of Upper-level Student Papers (MICUSP). Corpora, 6(2), 159–177.
Article Google Scholar
Rühlemann, C., & O’Donnell, M. B. (2012). Introducing a corpus of conversational stories. Construction and annotation of the Narrative Corpus. Corpus Linguistics and Linguistic Theory, 8(2), 313–350. https://doi.org/10.1515/cllt-2012-0015.
Article Google Scholar
Simpson-Vlach, R., & Leicher, S. (2006). The MICASE handbook: A resource for users of the Michigan Corpus of Academic Spoken English. Ann Arbor: University of Michigan Press.
Book Google Scholar
Sinclair, J. (2005). Corpus and text – basic principles. In M. Wynne (Ed.), Develo** linguistic corpora: A guide to good practice (pp. 1–16). Oxford: Oxbow Books.
Google Scholar

Download references

Author information

Authors and Affiliations

Dalarna University, Falun, Sweden
Annelie Ädel

Authors

Annelie Ädel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Annelie Ädel .

Editor information

Editors and Affiliations

FNRS Centre for English Corpus Linguistics, Language and Communication Institute, UCLouvain, Louvain-la-Neuve, Belgium
Magali Paquot
Department of Linguistics, University of California, Santa Barbara, CA, USA
Stefan Th. Gries

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ädel, A. (2020). Corpus Compilation. In: Paquot, M., Gries, S.T. (eds) A Practical Handbook of Corpus Linguistics. Springer, Cham. https://doi.org/10.1007/978-3-030-46216-1_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-46216-1_1
Published: 05 May 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46215-4
Online ISBN: 978-3-030-46216-1
eBook Packages: Religion and PhilosophyPhilosophy and Religion (R0)

Publish with us

Policies and ethics

Corpus Compilation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Corpus Linguistics: Methods, Theory and Practice by Tony McEnery and Andrew Hardie

Case Study: The Manually Annotated Sub-Corpus

Corpus Linguistic Analysis: How Far Can We Go?

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Further Reading

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Corpus Compilation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Corpus Linguistics: Methods, Theory and Practice by Tony McEnery and Andrew Hardie

Case Study: The Manually Annotated Sub-Corpus

Corpus Linguistic Analysis: How Far Can We Go?

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Further Reading

Further Reading

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation