Web Corpora

Kehoe, Andrew

doi:10.1007/978-3-030-46216-1_15

Andrew Kehoe³

2043 Accesses
1 Citations

Abstract

This chapter explores the increasingly important role of the web in corpus linguistic research. It describes the two main approaches adopted in the field, which have been termed ‘web as corpus’ and ‘web for corpus’. The former approach attempts to extract linguistic examples directly from the web using standard search engines like Google or other more specialist tools, while the latter uses the web as a source of texts for the building of off-line corpora. The chapter examines the pitfalls of the entry-level ‘web as corpus’ approach before going on to describe in detail the steps involved in using the ‘web for corpus’ approach to build bespoke corpora by downloading data from the web. Through a series of examples from leading research in the field, the chapter examines the significant new methodological challenges the web presents for linguistic study. The overall aim is to outline ways in which these challenges can be overcome through careful selection of data and use of appropriate software tools.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (Canada)

eBook: USD 84.99; Price excludes VAT (Canada)

Softcover Book: USD 109.99; Price excludes VAT (Canada)

Hardcover Book: USD 109.99; Price excludes VAT (Canada)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.internetworldstats.com/stats.htm. Accessed 22 May 2019.
2.
http://corpus.byu.edu/bnc/. Accessed 22 May 2019.
3.
http://www. sigwac .org.uk/. Accessed 22 May 2019.
4.
See the website by the same authors for latest estimates: http://www.worldwidewebsize.com/. Accessed 22 May 2019.
5.
https://search.googleblog.com/2013/03/billions-of-times-day-in-blink-of-eye.html. Accessed 22 May 2019. This refers to the number of pages the Google software was aware of at that time, not the number of pages actually held in its index.
6.
https://searchengineland.com/googles-search-indexes-hits-130-trillion-pages-documents-263378. Accessed 22 May 2019. This information has since been removed from the Google website and updated figures are no longer provided.
7.
http://archive.org/. Accessed 22 May 2019.
8.
https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32016R0679. Accessed 22 May 2019.

References

Baroni, M., & Bernardini, S. (2004). BootCaT: Bootstrap** corpora and terms from the web. Proceedings of LREC, 2004, 1313–1316.
Google Scholar
Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The WaCky wide web: A collection of very large linguistically processed web-crawled corpora. Language Resources & Evaluation, 43, 209–226.
Article Google Scholar
Bergh, G. (2005). Min(d)ing English language data on the web: What can Google tell us? ICAME Journal, 29, 25–46.
Google Scholar
Bergh, G., Seppänen, A., & Trotta, J. (1998). Language corpora and the internet: A joint linguistic resource. In A. Renouf (Ed.), Explorations in corpus linguistics (pp. 41–54). Amsterdam: Rodopi.
Google Scholar
Bergman, M. K. (2001). The deep web: Surfacing hidden value. Journal of Electronic Publishing, 7(1).
Google Scholar
Bernardini, S., Baroni, M., & Evert, S. (2006). A WaCky introduction. In M. Baroni & S. Bernardini (Eds.), Wacky! Working papers on the web as Corpus (pp. 9–40). Bologna: GEDIT. http://wackybook.sslmit.unibo.it/pdfs/bernardini.pdf. Accessed 21 May 2019.
Google Scholar
Biber, D., & Egbert, J. (2016). Register variation on the searchable web : A multi-dimensional analysis. Journal of English Linguistics, 44(2), 95–137.
Article Google Scholar
Biber, D., & Egbert, J. (2018). Register variation online. Cambridge: Cambridge University Press.
Book Google Scholar
Biber, D., Egbert, J., & Davies, M. (2015). Exploring the composition of the searchable web: A corpus-based taxonomy of web registers. Corpora, 10(1), 11–45.
Article Google Scholar
Brekke, M. (2000). From BNC to the cybercorpus: A quantum leap into chaos? In J. Kirk (Ed.), Corpora Galore (pp. 227–247). Amsterdam: Rodopi.
Google Scholar
Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1–7), 107–117.
Google Scholar
Cavaglia, G., & Kilgarriff, A. (2001). Corpora from the web (Information Technology Research Institute Technical Report Series (ITRI-01-06)). University of Brighton. https://www.kilgarriff.co.uk/Publications/2001-CavagliaKilg-CLUK.pdf. Accessed 21 May 2019.
Ciaramita, M., & Baroni, M. (2006). Measuring web-corpus randomness. In M. Baroni & S. Bernardini (Eds.), Wacky! Working papers on the web as corpus (pp. 127–158). Bologna: GEDIT. http://wackybook.sslmit.unibo.it/pdfs/ciaramita.pdf. Accessed 21 May 2019.
Google Scholar
Davies, M., & Fuchs, R. (2015). Expanding horizons in the study of world Englishes with the 1.9 billion word global web-based English Corpus. English World-Wide, 36(1), 1–28.
Article Google Scholar
de Schryver, G. (2002). Web for/as corpus: A perspective for the African languages. Nordic Journal of African Studies, 11(2), 266–282.
Google Scholar
Fletcher, W. H. (2004). Making the web more useful as a source for linguistic corpora. In U. Connor & T. Upton (Eds.), Applied Corpus linguistics: A multidimensional perspective (pp. 191–205). Amsterdam: Rodopi.
Google Scholar
Gatto, M. (2014). Web as corpus: Theory and practice. London: Bloomsbury.
Google Scholar
Giesbrecht, E., & Evert, S. (2009). Is part-of-speech tagging a solved task? An evaluation of POS taggers for the German web as corpus. In Proceedings of the 5th Web as Corpus Workshop (WAC5). San Sebastian: Spain.
Google Scholar
Huang, Y., Guo, D., Kasakoff, A., & Grieve, J. (2016). Understanding U.S. regional linguistic variation with Twitter data analysis. Computers Environment and Urban Systems, 59, 244–255.
Article Google Scholar
Hüning, M. (2001). WebCONC. http://www.niederlandistik.fu-berlin.de/cgi-bin/web-conc.cgi (no longer accessible). Accessed 21 May 2019.
Ide, N., Reppen, R., & Suderman, K. (2002). The American National Corpus: More than the web can provide. In Proceedings of the 3rd language resources and evaluation conference (LREC) (pp. 839–844). Paris: ELRA.
Google Scholar
Kehoe, A. (2006). Diachronic linguistic analysis on the web using WebCorp. In A. Renouf & A. Kehoe (Eds.), The changing face of corpus linguistics (pp. 297–307). Amsterdam: Rodopi.
Google Scholar
Kehoe, A., & Gee, M. (2007). New corpora from the web: Making web text more “text-like”. Towards Multimedia in Corpus Studies. Helsinki: VARIENG. http://www.helsinki.fi/varieng/series/volumes/02/kehoe_gee/. Accessed 21 May 2019.
Kehoe, A., & Gee, M. (2012). Reader comments as an aboutness indicator in online texts: Introducing the Birmingham blog corpus. Aspects of corpus linguistics: Compilation, annotation, analysis. Helsinki: VARIENG. http://www.helsinki.fi/varieng/journal/volumes/12/kehoe_gee/. Accessed 21 May 2019.
Google Scholar
Kehoe, A., & Renouf, A. (2002). WebCorp: Applying the web to linguistics and linguistics to the web. In Proceedings of the 11th international World Wide Web conference. http://web.archive.org/web/20141206025600/http://www2002.org/CDROM/poster/67/. Accessed 21 May 2019.
Google Scholar
Keller, F., & Lapata, M. (2003). Using the web to obtain frequencies for unseen bigrams. Computational Linguistics, 29(3), 459–484.
Article Google Scholar
Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the special issue on the web as corpus. Computational Linguistics, 29(3), 333–347.
Article Google Scholar
Kilgarriff, A., Reddy, S., Pomikálek, J., & Avinesh, P. V. S. (2010). A corpus factory for many languages. http://www.sketchengine.co.uk/wp-content/uploads/2015/05/Corpus_Factory_2010.pdf. Accessed 21 May 2019.
Kucera, H., & Nelson Francis, W. (1967). Computational analysis of present-day American English. Providence: Brown University Press.
Google Scholar
Leech, G. (2007). New resources, or just better old ones? The holy grail of representativeness. In M. Hundt, N. Nesselhauf, & C. Biewer (Eds.), Corpus linguistics and the web (pp. 133–149). Amsterdam: Rodopi.
Google Scholar
Lutzky, U., & Kehoe, A. (2017). “I apologise for my poor blogging”: Searching for apologies in the Birmingham blog Corpus. Corpus Pragmatics, 1(1), 37–56.
Article Google Scholar
Mair, C. (2012). From opportunistic to systematic use of the web as corpus: Do-support with got (to) in contemporary American English. In T. Nevalainen & E. C. Traugott (Eds.), The Oxford handbook of the history of English (pp. 245–255). Oxford: Oxford University Press.
Google Scholar
Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., The Google Books Team, et al. (2011). Quantitative analysis of culture using millions of digitized books. Science, 331(6014), 176–182.
Article Google Scholar
Page, R. (2014). Saying “sorry”: Corporate apologies posted on Twitter. Journal of Pragmatics, 62, 30–45.
Article Google Scholar
Petrović, S., Osborne, M., & Lavrenko, V. (2010). The Edinburgh Twitter corpus. In Proceedings of the NAACL HLT 2010 workshop on computational linguistics in a world of social media (pp. 25–26).
Google Scholar
Pomikàlek, J., Rychly, P., & Kilgarriff, A. (2009). Scaling to billion-plus word corpora. Advances in Computational Linguistics: Special Issue of Research in Computing Science, 41, 3–14.
Google Scholar
Rayson, P., Charles, O., & Auty, I. (2012). Can Google count? Estimating search engine result consistency. In Proceedings of the seventh Web as Corpus workshop (WAC7) (pp. 23–30). http://sigwac.org.uk/raw-attachment/wiki/WAC7/wac7-proc.pdf. Accessed 21 May 2019.
Google Scholar
Resnik, P., & Smith, N. A. (2003). The web as a parallel corpus. Computational Linguistics, 29(3), 349–380.
Article Google Scholar
Rundell, M. (2009). Genius and rubbish and other noun-like adjectives. MacMillan Dictionary Blog. http://www.macmillandictionaryblog.com/noun-like-adjectives. Accessed 21 May 2019.
San Vicente, I., & Manterola, I. (2012). PaCo2: A fully automated tool for gathering parallel corpora from the web. In Proceedings of the eight international conference on Language Resources and Evaluation (LREC12). http://aclanthology.info/papers/L12-1085/l12-1085. Accessed 21 May 2019.
Google Scholar
Schäfer, R. (2016). On bias-free crawling and representative web corpora. In Proceedings of the 10th Web as Corpus workshop (WAC-X) and the EmpiriST shared task (pp. 99–105).
Chapter Google Scholar
Schäfer, R., & Bildhauer, F. (2012). Building large corpora from the web using a new efficient tool chain. In Proceedings of the eighth international conference on Language Resources and Evaluation (LREC) (pp. 486–493). Istanbul: ELRA.
Google Scholar
Schäfer, R., & Bildhauer, F. (2013). Web corpus construction (Vol. 6, pp. 1–145). San Rafael: Morgan & Claypool.
Google Scholar
Schmied, J. (2006). New ways of analysing ESL on the WWW with WebCorp and WebPhraseCount. In A. Renouf & A. Kehoe (Eds.), The changing face of corpus linguistics (pp. 309–324). Amsterdam: Rodopi.
Google Scholar
Sharoff, S. (2006a). Creating general-purpose corpora using automated search engine queries. In M. Baroni & S. Bernardini (Eds.), Wacky! Working papers on the web as corpus (pp. 63–98). Bologna: GEDIT. http://wackybook.sslmit.unibo.it/pdfs/sharoff.pdf. Accessed 21 May 2019.
Google Scholar
Sharoff, S. (2006b). Open-source corpora. Using the net to fish for linguistic data. International Journal of Corpus Linguistics, 11(4), 435–462.
Article Google Scholar
Sinclair, J. (2005). Corpus and text – Basic principles, and appendix: How to build a corpus. In M. Wynne (Ed.), Develo** linguistic corpora: a guide to good practice. Oxford: Oxbow Books. http://ota.ox.ac.uk/documents/creating/dlc/. Accessed 21 May 2019.
Google Scholar
Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus workshop (WAC7) (pp. 39–43). http://sigwac.org.uk/raw-attachment/wiki/WAC7/wac7-proc.pdf. Accessed 21 May 2019.
Google Scholar
van den Bosch, A., Bogers, T., & de Kunder, M. (2016). Estimating search engine index size variability: A 9-year longitudinal study. Scientometrics, 107, 839–856.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Birmingham City University, Birmingham, UK
Andrew Kehoe

Authors

Andrew Kehoe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andrew Kehoe .

Editor information

Editors and Affiliations

FNRS Centre for English Corpus Linguistics, Language and Communication Institute, UCLouvain, Louvain-la-Neuve, Belgium
Magali Paquot
Department of Linguistics, University of California, Santa Barbara, CA, USA
Stefan Th. Gries

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kehoe, A. (2020). Web Corpora. In: Paquot, M., Gries, S.T. (eds) A Practical Handbook of Corpus Linguistics. Springer, Cham. https://doi.org/10.1007/978-3-030-46216-1_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-46216-1_15
Published: 05 May 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46215-4
Online ISBN: 978-3-030-46216-1
eBook Packages: Religion and PhilosophyPhilosophy and Religion (R0)

Publish with us

Policies and ethics

Web Corpora

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Using Existing Data Repositories and Data Analysis

Corpus Linguistic Analysis: How Far Can We Go?

Building and evaluating web corpora representing national varieties of English

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Further Reading

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Web Corpora

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Using Existing Data Repositories and Data Analysis

Corpus Linguistic Analysis: How Far Can We Go?

Building and evaluating web corpora representing national varieties of English

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Further Reading

Further Reading

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation