Abstract
This chapter explores the increasingly important role of the web in corpus linguistic research. It describes the two main approaches adopted in the field, which have been termed ‘web as corpus’ and ‘web for corpus’. The former approach attempts to extract linguistic examples directly from the web using standard search engines like Google or other more specialist tools, while the latter uses the web as a source of texts for the building of off-line corpora. The chapter examines the pitfalls of the entry-level ‘web as corpus’ approach before going on to describe in detail the steps involved in using the ‘web for corpus’ approach to build bespoke corpora by downloading data from the web. Through a series of examples from leading research in the field, the chapter examines the significant new methodological challenges the web presents for linguistic study. The overall aim is to outline ways in which these challenges can be overcome through careful selection of data and use of appropriate software tools.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
http://www.internetworldstats.com/stats.htm. Accessed 22 May 2019.
- 2.
http://corpus.byu.edu/bnc/. Accessed 22 May 2019.
- 3.
http://www. sigwac .org.uk/. Accessed 22 May 2019.
- 4.
See the website by the same authors for latest estimates: http://www.worldwidewebsize.com/. Accessed 22 May 2019.
- 5.
https://search.googleblog.com/2013/03/billions-of-times-day-in-blink-of-eye.html. Accessed 22 May 2019. This refers to the number of pages the Google software was aware of at that time, not the number of pages actually held in its index.
- 6.
https://searchengineland.com/googles-search-indexes-hits-130-trillion-pages-documents-263378. Accessed 22 May 2019. This information has since been removed from the Google website and updated figures are no longer provided.
- 7.
http://archive.org/. Accessed 22 May 2019.
- 8.
https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32016R0679. Accessed 22 May 2019.
References
Baroni, M., & Bernardini, S. (2004). BootCaT: Bootstrap** corpora and terms from the web. Proceedings of LREC, 2004, 1313–1316.
Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The WaCky wide web: A collection of very large linguistically processed web-crawled corpora. Language Resources & Evaluation, 43, 209–226.
Bergh, G. (2005). Min(d)ing English language data on the web: What can Google tell us? ICAME Journal, 29, 25–46.
Bergh, G., Seppänen, A., & Trotta, J. (1998). Language corpora and the internet: A joint linguistic resource. In A. Renouf (Ed.), Explorations in corpus linguistics (pp. 41–54). Amsterdam: Rodopi.
Bergman, M. K. (2001). The deep web: Surfacing hidden value. Journal of Electronic Publishing, 7(1).
Bernardini, S., Baroni, M., & Evert, S. (2006). A WaCky introduction. In M. Baroni & S. Bernardini (Eds.), Wacky! Working papers on the web as Corpus (pp. 9–40). Bologna: GEDIT. http://wackybook.sslmit.unibo.it/pdfs/bernardini.pdf. Accessed 21 May 2019.
Biber, D., & Egbert, J. (2016). Register variation on the searchable web : A multi-dimensional analysis. Journal of English Linguistics, 44(2), 95–137.
Biber, D., & Egbert, J. (2018). Register variation online. Cambridge: Cambridge University Press.
Biber, D., Egbert, J., & Davies, M. (2015). Exploring the composition of the searchable web: A corpus-based taxonomy of web registers. Corpora, 10(1), 11–45.
Brekke, M. (2000). From BNC to the cybercorpus: A quantum leap into chaos? In J. Kirk (Ed.), Corpora Galore (pp. 227–247). Amsterdam: Rodopi.
Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1–7), 107–117.
Cavaglia, G., & Kilgarriff, A. (2001). Corpora from the web (Information Technology Research Institute Technical Report Series (ITRI-01-06)). University of Brighton. https://www.kilgarriff.co.uk/Publications/2001-CavagliaKilg-CLUK.pdf. Accessed 21 May 2019.
Ciaramita, M., & Baroni, M. (2006). Measuring web-corpus randomness. In M. Baroni & S. Bernardini (Eds.), Wacky! Working papers on the web as corpus (pp. 127–158). Bologna: GEDIT. http://wackybook.sslmit.unibo.it/pdfs/ciaramita.pdf. Accessed 21 May 2019.
Davies, M., & Fuchs, R. (2015). Expanding horizons in the study of world Englishes with the 1.9 billion word global web-based English Corpus. English World-Wide, 36(1), 1–28.
de Schryver, G. (2002). Web for/as corpus: A perspective for the African languages. Nordic Journal of African Studies, 11(2), 266–282.
Fletcher, W. H. (2004). Making the web more useful as a source for linguistic corpora. In U. Connor & T. Upton (Eds.), Applied Corpus linguistics: A multidimensional perspective (pp. 191–205). Amsterdam: Rodopi.
Gatto, M. (2014). Web as corpus: Theory and practice. London: Bloomsbury.
Giesbrecht, E., & Evert, S. (2009). Is part-of-speech tagging a solved task? An evaluation of POS taggers for the German web as corpus. In Proceedings of the 5th Web as Corpus Workshop (WAC5). San Sebastian: Spain.
Huang, Y., Guo, D., Kasakoff, A., & Grieve, J. (2016). Understanding U.S. regional linguistic variation with Twitter data analysis. Computers Environment and Urban Systems, 59, 244–255.
Hüning, M. (2001). WebCONC. http://www.niederlandistik.fu-berlin.de/cgi-bin/web-conc.cgi (no longer accessible). Accessed 21 May 2019.
Ide, N., Reppen, R., & Suderman, K. (2002). The American National Corpus: More than the web can provide. In Proceedings of the 3rd language resources and evaluation conference (LREC) (pp. 839–844). Paris: ELRA.
Kehoe, A. (2006). Diachronic linguistic analysis on the web using WebCorp. In A. Renouf & A. Kehoe (Eds.), The changing face of corpus linguistics (pp. 297–307). Amsterdam: Rodopi.
Kehoe, A., & Gee, M. (2007). New corpora from the web: Making web text more “text-like”. Towards Multimedia in Corpus Studies. Helsinki: VARIENG. http://www.helsinki.fi/varieng/series/volumes/02/kehoe_gee/. Accessed 21 May 2019.
Kehoe, A., & Gee, M. (2012). Reader comments as an aboutness indicator in online texts: Introducing the Birmingham blog corpus. Aspects of corpus linguistics: Compilation, annotation, analysis. Helsinki: VARIENG. http://www.helsinki.fi/varieng/journal/volumes/12/kehoe_gee/. Accessed 21 May 2019.
Kehoe, A., & Renouf, A. (2002). WebCorp: Applying the web to linguistics and linguistics to the web. In Proceedings of the 11th international World Wide Web conference. http://web.archive.org/web/20141206025600/http://www2002.org/CDROM/poster/67/. Accessed 21 May 2019.
Keller, F., & Lapata, M. (2003). Using the web to obtain frequencies for unseen bigrams. Computational Linguistics, 29(3), 459–484.
Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the special issue on the web as corpus. Computational Linguistics, 29(3), 333–347.
Kilgarriff, A., Reddy, S., Pomikálek, J., & Avinesh, P. V. S. (2010). A corpus factory for many languages. http://www.sketchengine.co.uk/wp-content/uploads/2015/05/Corpus_Factory_2010.pdf. Accessed 21 May 2019.
Kucera, H., & Nelson Francis, W. (1967). Computational analysis of present-day American English. Providence: Brown University Press.
Leech, G. (2007). New resources, or just better old ones? The holy grail of representativeness. In M. Hundt, N. Nesselhauf, & C. Biewer (Eds.), Corpus linguistics and the web (pp. 133–149). Amsterdam: Rodopi.
Lutzky, U., & Kehoe, A. (2017). “I apologise for my poor blogging”: Searching for apologies in the Birmingham blog Corpus. Corpus Pragmatics, 1(1), 37–56.
Mair, C. (2012). From opportunistic to systematic use of the web as corpus: Do-support with got (to) in contemporary American English. In T. Nevalainen & E. C. Traugott (Eds.), The Oxford handbook of the history of English (pp. 245–255). Oxford: Oxford University Press.
Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., The Google Books Team, et al. (2011). Quantitative analysis of culture using millions of digitized books. Science, 331(6014), 176–182.
Page, R. (2014). Saying “sorry”: Corporate apologies posted on Twitter. Journal of Pragmatics, 62, 30–45.
Petrović, S., Osborne, M., & Lavrenko, V. (2010). The Edinburgh Twitter corpus. In Proceedings of the NAACL HLT 2010 workshop on computational linguistics in a world of social media (pp. 25–26).
Pomikàlek, J., Rychly, P., & Kilgarriff, A. (2009). Scaling to billion-plus word corpora. Advances in Computational Linguistics: Special Issue of Research in Computing Science, 41, 3–14.
Rayson, P., Charles, O., & Auty, I. (2012). Can Google count? Estimating search engine result consistency. In Proceedings of the seventh Web as Corpus workshop (WAC7) (pp. 23–30). http://sigwac.org.uk/raw-attachment/wiki/WAC7/wac7-proc.pdf. Accessed 21 May 2019.
Resnik, P., & Smith, N. A. (2003). The web as a parallel corpus. Computational Linguistics, 29(3), 349–380.
Rundell, M. (2009). Genius and rubbish and other noun-like adjectives. MacMillan Dictionary Blog. http://www.macmillandictionaryblog.com/noun-like-adjectives. Accessed 21 May 2019.
San Vicente, I., & Manterola, I. (2012). PaCo2: A fully automated tool for gathering parallel corpora from the web. In Proceedings of the eight international conference on Language Resources and Evaluation (LREC12). http://aclanthology.info/papers/L12-1085/l12-1085. Accessed 21 May 2019.
Schäfer, R. (2016). On bias-free crawling and representative web corpora. In Proceedings of the 10th Web as Corpus workshop (WAC-X) and the EmpiriST shared task (pp. 99–105).
Schäfer, R., & Bildhauer, F. (2012). Building large corpora from the web using a new efficient tool chain. In Proceedings of the eighth international conference on Language Resources and Evaluation (LREC) (pp. 486–493). Istanbul: ELRA.
Schäfer, R., & Bildhauer, F. (2013). Web corpus construction (Vol. 6, pp. 1–145). San Rafael: Morgan & Claypool.
Schmied, J. (2006). New ways of analysing ESL on the WWW with WebCorp and WebPhraseCount. In A. Renouf & A. Kehoe (Eds.), The changing face of corpus linguistics (pp. 309–324). Amsterdam: Rodopi.
Sharoff, S. (2006a). Creating general-purpose corpora using automated search engine queries. In M. Baroni & S. Bernardini (Eds.), Wacky! Working papers on the web as corpus (pp. 63–98). Bologna: GEDIT. http://wackybook.sslmit.unibo.it/pdfs/sharoff.pdf. Accessed 21 May 2019.
Sharoff, S. (2006b). Open-source corpora. Using the net to fish for linguistic data. International Journal of Corpus Linguistics, 11(4), 435–462.
Sinclair, J. (2005). Corpus and text – Basic principles, and appendix: How to build a corpus. In M. Wynne (Ed.), Develo** linguistic corpora: a guide to good practice. Oxford: Oxbow Books. http://ota.ox.ac.uk/documents/creating/dlc/. Accessed 21 May 2019.
Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus workshop (WAC7) (pp. 39–43). http://sigwac.org.uk/raw-attachment/wiki/WAC7/wac7-proc.pdf. Accessed 21 May 2019.
van den Bosch, A., Bogers, T., & de Kunder, M. (2016). Estimating search engine index size variability: A 9-year longitudinal study. Scientometrics, 107, 839–856.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Further Reading
Further Reading
-
Schäfer, R., and Bildhauer, F. 2013 Web Corpus Construction. San Rafael: Morgan & Claypool.
This book offers a fuller discussion of the technical issues involved in web crawling for linguistic purposes. It includes chapters on web structure, seed word selection and crawling, post-processing, annotation, and corpus evaluation.
-
Biber, D., and Egbert, J. 2018. Register Variation Online . Cambridge: CUP.
Building on the work by Biber et al. (2015) discussed in Representative Study 2, this volume examines the full range of registers found on the searchable web. It explores overall patterns of register variation with a multidimensional analysis and discusses the main lexical, grammatical and situational features of each register, offering important new insights on the language of the web.
-
Hundt, M., Nesselhauf, N., and Biewer, C. 2007. Corpus Linguistics and the Web . Amsterdam: Rodopi.
This was the first book-length publication to bring together key perspectives in web corpus research. It includes the chapter on representativeness and balance cited in Sect. 15.2.2 as well as chapters on both the new possibilities and new challenges presented by the web as/for corpus approaches.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Kehoe, A. (2020). Web Corpora. In: Paquot, M., Gries, S.T. (eds) A Practical Handbook of Corpus Linguistics. Springer, Cham. https://doi.org/10.1007/978-3-030-46216-1_15
Download citation
DOI: https://doi.org/10.1007/978-3-030-46216-1_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46215-4
Online ISBN: 978-3-030-46216-1
eBook Packages: Religion and PhilosophyPhilosophy and Religion (R0)