Abstract
Web archiving frameworks are commonly assessed by the quality of their archival records and by their ability to operate at scale. The ubiquity of dynamic web content poses a significant challenge for crawler-based solutions such as the Internet Archive that are optimized for scale. Human-driven services such as the Webrecorder tool provide high-quality archival captures but are not optimized to operate at scale. We introduce the Memento Tracer framework that aims to balance archival quality and scalability. We outline its concept and architecture and evaluate its archival quality and operation at scale. Our findings indicate quality is on par or better compared against established archiving frameworks and operation at scale comes with a manageable overhead.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
Selenium WebDriver: https://www.seleniumhq.org/.
- 7.
Headless Chrome: https://chromium.googlesource.com/chromium/src/+/lkgr/headless/README.md.
- 8.
WarcProxy: https://github.com/internetarchive/warcprox.
- 9.
- 10.
- 11.
- 12.
A screencast of the Memento Tracer Chrome extension and the interactions with a GitHub repository recorded into a trace is available at: https://doi.org/10.6084/m9.figshare.8049839.v1.
- 13.
The trace is available at: https://doi.org/10.6084/m9.figshare.8024612.
- 14.
The trace is available at: https://doi.org/10.6084/m9.figshare.8024615.
- 15.
References
ISO 28500:2017 - information and documentation - WARC file format. https://www.iso.org/standard/68004.html
United Nations Archives: The National Archives. https://www.nationalarchives.gov.uk/
Berlin, J.: CNN.com Has Been Unarchivable Since November 1st, 2016. https://ws-dl.blogspot.com/2017/01/2017-01-20-cnncom-has-been-unarchivable.html
Berlin, J.A.: To relive the web: a framework for the transformation and archival replay of web pages. Master of Science (MS), Thesis, Computer Science, Old Dominion University (2018)
Brunelle, J.F., Weigle, M.C., Nelson, M.L.: Archival crawlers and Javascript: discover more stuff but crawl more slowly. In: 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 1–10 (2017)
Brunelle, J.F., Kelly, M., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: Not all mementos are created equal: measuring the impact of missing resources. In: Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 321–330 (2014)
Hidayat, A.: PhantomJS. https://github.com/ariya/phantomjs
Internet Archive: Brozzler. https://github.com/internetarchive/brozzler
Internet Archive: Heritrix web crawler. https://github.com/internetarchive/heritrix3
Internet Archive: Wayback machine. http://web.archive.org/
Kahle, B.: Wayback rising!. https://twitter.com/brewster_kahle/status/1118172506777509890
Kreymer, I.: A prototype of automated web archiving, emulation and server preservation. https://blog.webrecorder.io/2018/08/28/automation-emulation-server-preserve.html
Kreymer, I.: Webrecorder. https://github.com/webrecorder/webrecorder
Kreymer, I.: Webrecorder player. https://github.com/webrecorder/webrecorder-player
National Library of Australia: Trove. https://trove.nla.gov.au/
Poursardar, F., Shipman, F.: How perceptions of web resource boundaries differ for institutional and personal archives. In: 2018 IEEE International Conference on Information Reuse and Integration (IRI), pp. 126–129 (2018)
Reich, V., Rosenthal, D.S.H.: LOCKSS: a permanent web publishing and access system. D-Lib Mag. 7(6) (2001)
Rosenthal, D.S.H., Vargas, D.L., Lipkis, T.A., Griffin, C.T.: Enhancing the LOCKSS digital preservation technology. D-Lib Mag. 21(9/10) (2015). https://doi.org/10.1045/september2015-rosenthal
Acknowledgement
This work is supported in part by The Andrew W. Mellon Foundation grant 11600663.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Klein, M., Shankar, H., Balakireva, L., Van de Sompel, H. (2019). The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving. In: Doucet, A., Isaac, A., Golub, K., Aalberg, T., Jatowt, A. (eds) Digital Libraries for Open Knowledge. TPDL 2019. Lecture Notes in Computer Science(), vol 11799. Springer, Cham. https://doi.org/10.1007/978-3-030-30760-8_15
Download citation
DOI: https://doi.org/10.1007/978-3-030-30760-8_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30759-2
Online ISBN: 978-3-030-30760-8
eBook Packages: Computer ScienceComputer Science (R0)