Introduction

In recent years, archives and other memory institutions have collected increasing volumes of born-digital material. This new material has not only disrupted the way that memory institutions acquire and preserve this material, but also how this material is accessed. This is reflected in new areas of study such as ‘digital humanities’ and ‘computational archival science’ (Goudarouli 2018), the emergence of networks specifically dedicated to digital material (Netwerk Digitaal Erfgoed 2019; AURA Network 2021), and by the digital strategies published by memory institutions and forthcoming projects for their collections and materials (The British Library 2017; The National Archives 2017).

In line with the recent ‘Data as Collections’ (Padilla et al. 2018) project, this article aims to explore the UK Government Web Archive (UKGWA) by the use of computational methods. The UKGWA has been selected for this work as it is one of few large born-digital collections currently available without any major restrictions. UKGWA has been already used as an example dataset for workshops and study groups (Storrar and Talboom 2019; Beavan et al. 2021), although these used subsets of the collection and no exploration of the collection in full has been conducted. To provide demonstrations of the article’s discussion points, the authors have created two Python Notebooks for the reader to view or interact with. The Notebooks showcase available data and possibilities beyond keyword searching, and to highlight some of the issues encountered when accessing UKGWA (The National Archives 2021k) computationally. They are aids to discussion rather than demonstrations of novel computational techniques and can be accessed by following the links from the GitHub repository (Bell 2021).

After introducing these case studies, the article will discuss the drawbacks and possibilities of this new approach to archival search for web archives and the feasibility of using Notebooks with regard to skills and resources needed to implement and use them.

Accessing web archives as data

The born-digital material that is the focus of this article is archived web material. In the last two decades, the use of the web has increased exponentially, leading to a huge amount of important information being found online. Rapidly evolving technologies and the relatively short life spans of many websites has led to memory institutions urgently archiving web materials with an emphasis on capturing them before they disappear ahead of integration with their more traditional collections. This web material is dynamic and diverse, ranging from published scholarly articles to government campaigns (International Internet Preservation Consortium 2021b).

Within the research community, there is a growing interest in archival web material. The Internet Archive was founded in 1996 and UKGWA holds archived websites from the same year, making web archives a primary source for historians of the late twentieth and early twenty-first centuries. A number of events have focused specifically on web archives including the Web Archiving Week (International Internet Preservation Consortium 2017), hosted by the British Library and the School of Advanced Study, University of London, The Web Archiving Conference (International Internet Preservation Consortium 2021a), The Research Infrastructure for the Study of Archived Web material (RESAW) conferences (RESAW 2021), and the Engaging with Web Archives Conference (Engaging with Web Archives 2021).

When looking at the programmes in detail, most of the focus is on fostering collaborations, discussing preservation and capture strategies and talking through individual research projects that have used archival web material. Only a small number of talks are on the access to this material and the computational methods that may enhance the access to it.

When looking at it in practice, a growing number of institutions are preserving this material and therefore the amount of archived web material accessible online is also growing. Not all of this material can be accessed in the online environment, mainly due to certain (predominantly copyright) restrictions. For example, both the British Library and the Dutch National Library only provide access to parts of their collections from their reading rooms (British Library 2021b; Koninklijke Bibliotheek (National Library of the Netherlands) 2021). Open material is generally accessed through a keyword search interface or an online catalogue; examples include the Wayback Machine of the Internet Archive (Internet Archive 2021a) and the search interfaces of the UK Web Archive (UK Web Archive 2021) and UKGWA (The National Archives 2021e). These online catalogues and their simple keyword search systems have received a growing number of critiques in the last decade.

First off, users are not used to navigating online catalogues. A survey conducted in the Netherlands on how academics search for digital material, summed it all up with the phrase ‘Just Google It’, the prevalent digital search method. Google was used to find documents within a memory institution, as the online catalogue was found to be too difficult to navigate (Kemman et al. 2012). This is not surprising, as Gollins and Bayne (2015) point out, the original catalogues held by archives, which have now been made into online versions, were made for archivists by archivists. A catalogue is difficult to navigate by a user who is more familiar with a Google or Wikipedia search system (Gollins and Bayne 2015, 131–33). Memory institutions are aware of this problem and know that a large number of users come in through Google. The Europeana portal indexed their material and saw a 400 per cent increase of pages visited (Nicholas and Clark 2015, 28).

Secondly, the simple keyword search, the most popular approach for online catalogues and web archive interfaces, has received a lot of criticism. As Gilliland (2016) and Winters and Prescott (2019) point out, keyword search may not be enough to actively engage with the digital archival material (Gilliland 2016; Winters and Prescott 2019). Also Putnam (2016) and Romein et al. (2020) warn about the pitfalls of using search engines, as they are not transparent and can be biased, influencing the research that is conducted (Putnam 2016; Romein et al. 2020). Furthermore, users themselves are changing and seem to be wanting other search options that go beyond keywords. Milligan (2019) examined the research methods of historians, and he discovered that querying databases and parsing data, to just name a few examples, are slowly becoming more common research methods for historians (Milligan 2019).

Thirdly, the material itself is very different to traditional archival collections. With the sheer volume of digital material, and specifically born-digital material, it is impossible to index everything to improve searchability. There have been successful attempts to automate part of this process (Corrado 2019; Saleh 2018), but there is also a movement of ideas towards the benefits of this material being digital. This shift in thinking started around the 2010s where the digitised material went from being classed as digital surrogates to being seen as enriched and useful data in itself (Nicholson 2013). These collections can be processed by computers garnering new insights from the material and the number of projects focusing and advocating for seeing this material as data is growing, including ‘Collections as Data’ (Padilla et al. 2018), ‘Digging Into Data’ (Digging into Data Challenge and Trans-Atlantic Platform 2019), and ‘Plugged in and Powered Up’ (The National Archives 2019).

Projects within the web archiving community have also actively explored ideas beyond keyword search. The Web Archive Retrieval Tools (WebART) ran from 2012 to 2016 and aimed to critically assess the value of web archives, and developed a number of tools to access web material. The focus of the projects was on realistic research scenarios that would be conducted by humanities researchers. The project produced a number of papers and conferences and a tool named WebARTist, a Web Archive access system, but it is sadly, and perhaps ironically, no longer accessible (WebART 2016).

The Alexandria project also developed tools aimed at developers to make it easier to explore and analyse web archives. Their focus was the semantics of the web archive, looking at ways of exploring web archival material without indexing it. They developed an entity-based search called ArchiveSearch which makes it possible to use most concepts of Wikipedia as search terms (Kanhabua et al. 2016).

As part of the Big UK Domain Data for Arts and Humanities (BUDDAH), the SHINE tool, a prototype historical search engine, incorporates a trend analysis option alongside a more traditional search function (British Library 2021a). Archives Unleashed uses the tools of Big Data to enable large-scale analysis and has teamed up with Archive-IT to become more sustainable. Their experimental tools have surfaced in multiple ways (Ruest et al.

Fig. 2
figure 2

Summaries of Web Archive records by Department, Taxonomy Category, Administrative History common phrases, A–Z common phrases

There is a summary of common phrases from the catalogue description, but with top results such as “series contains dated gathered versions (or 'snapshots')” it is clearly not of use for understanding the archive in aggregate. There is also an obvious difference in the counts between tables, even when the text subjects appear in alignment. The GUK department (352) has more records than UKGWA’s gov.uk blog (146) and gov.uk (92) entries combined (238) because not every site listed with GUK is explicitly identified as gov.uk in the A–Z index. The taxonomy categories have a different purpose, defining records by subject matter rather than department or internet domain—almost 4000 sites are defined as ‘Official publications’. The Department of Health (JA) and the National Health Service feature at, or near, the top of three of the tables, but ‘department health’ is much lower down in the administrative history table. Other topics such as Children, East Midlands, and COVID-19 only appear once. The latter’s prevalence in the UKGWA list a result of the Archive’s effort to capture, and make available, the government’s responses to the pandemic.

Discovery includes two dates—record start and end—which can be summarised, but the end date is less informative, mostly defaulting to 2100. Dates are extracted from UKGWA using a public facing API known as CDX (Alam et al. 2016), an index format created by the Internet Archive’s open-source Heritrix web crawler (Internet Archive 2021b). The CDX API can be used to extract a full archiving history of a url captured in the archive. A request to the API for a url returns one row for each capture of the page, in various formats. There are 11 fields per row, of which the Notebook uses three: timestamp, status code, and digest. The timestamp is used to analyse when and how often a page has been archived. While it is possible through the search interface to filter by date, it is this API which enables the answering of questions such as ‘which sites were actively archived during the period…?’ or ‘how does frequency of capture vary across the archive, and over time?’.

The Notebook loads a file (to save time) of CDX data for all of the home pages (source: A–Z list) in the UKGWA and counts of pages by year of first snapshot (Fig. 3b). The equivalent summary of Discovery data, using the start date, is shown in Fig. 3a. While the charts are similar, there are some noticeable differences. The Discovery data have higher volumes pre-2005 because the catalogue date reflects the earliest document within a record, and some collections include web material which is held as a standard born-digital record and not in the web archive. The earliest snapshot of the Statistics Commission website is from January 2004 (The National Archives 2021i) and is catalogued under reference SF 1 (The National Archives 2021g). However, SF 2 (The National Archives 2021g) and SF 3 (The National Archives 2021h) contain documents published on the first and second websites of the Commission, the earliest from 2000. At the other end of the graph, there are no records for 2021 in the Discovery data. This highlights the fact that catalogues are manually maintained and entries are added according to workload and priorities—the web archive is competing with 1000 years of documents. The CDX data on the other hand are dynamically updated and reflect live archiving activities.

Fig. 3
figure 3

a Discovery entries by start date. b A–Z entries by first snapshot

CDX can be used to analyse changes in a resource by combining other fields from the API. The ‘status’ is a standard internet response code used to identify whether the retrieval of a web resource has been successful, or not. Opening a website in a browser generally triggers a response code of 200, invisibly to the user, who will be more familiar with the jarring response of a 404 (Page not found) code. This code can identify failures in the archiving process, but is most useful for identifying redirections (301 or 302) which most commonly occur when migrating a site to a new domain. The CDX does not identify the location of redirection, but it is interesting when combined with the digest.

The digest (or checksum) is the output of a cryptographic hashing function which can be used to identify whether any changes have occurred from one snapshot to another. The key word here is ‘any’. Nothing in the checksum enables identification or quantification of change. Web documents are a combination of layout, navigational, informational (banners and popups), and content elements. Navigation menus and banners in particular are generally shared across all pages within a site. A change to the wording of the ‘Cookies on gov.uk’ pop-up message on www.gov.uk will reflect as a change in the checksum for every page on the domain even though nothing has fundamentally changed on those pages. The checksum in the case of a redirection is that of the redirection message sent to the browser not of the landing page, meaning a change in the redirected to page is not reflected in the checksum. This is illustrated by Fig. 4 for the home page of the Environment Agency (Environment Agency 2014).

Fig. 4
figure 4

Frequency of changes in the Environment Agency’s homepage, 1996–2021

The top row of black points indicates pages which are unchanged since the previous snapshot, while the bottom row (pink) indicates where changes have occurred. The plot shows how the frequency of capture increased from 2008 and the dense overlap of black and pink dots between 2008 and 2014 suggest an active page with frequent enough captures to record a high percentage of updates. After 2014, there are no further changes but the status code in the CDX data shows that the page was redirected from 4 May 2014 onwards (Environment Agency 2014). Despite the digest being a blunt tool for identifying change, it is useful for differentiating between a static and an active site.

Notebook 1 ends with a visualisation (Sankey diagram) demonstrating linking data across the two catalogues, which is shown in Fig. 5. On the left are three administrative history phrases (development agency, regional development, East Midlands) which connect to department codes in the centre. The codes in turn connect to common phrases extracted from A–Z descriptions for websites from those departments. If there is no common phrase in a description, it is defaulted to ‘other’. The diagram is animated in a correctly rendered Notebook and provides an experimental view not possible without bringing together these two data sources.

Fig. 5
figure 5

Sankey diagram showing Administrative History terms linked to A–Z terms via department

The data used to produce these summaries are a mixture of metadata created for the purpose of filtering a catalogue (e.g. taxonomy codes), metadata generated by archival process (CDX), and unstructured text whose purpose is to provide information to a human reader (administrative histories and the A–Z index). In the case of the A–Z, there is some structure with many entries consisting of two parts separated by ‘–’, generally a department followed by a granular classification (e.g. GOV.UK Blog). Metadata extracted through an API is the ideal source for computational analysis, but in both cases there is scope to improve the APIs. There is no UKGWA taxonomy category, or other indicator, within Discovery, leading to using a keyword search to find UKGWA records. The CDX API has a lot of functionality (not discussed here), but URL is a mandatory parameter, so is unsuitable for speculative querying. Rather than having to scrape the data, the A–Z index could also be provided through an API.

In theory, Discovery could provide this index, but the URLs are not surfaced through the API, and a manually maintained spreadsheet was needed to provide the link between the two systems. Following the call by Moss et al. (2018) to ‘Reconfigure the Archive as Data to be Mined’ (Moss, Thomas, and Gollins 2018), work is in progress, though Project Omega (The National Archives 2021d), to re-engineer the catalogue as linked data which will improve the connections between the main archive and UKGWA. It will be worthwhile periodically revisiting these Notebooks as that project progresses.

This Notebook has introduced two ways to access archival catalogue information about the UKGWA: through the archive’s APIs, and scra** information from the web archive’s index. In doing so, the potential of bringing together multiple sources of data was demonstrated, but also the challenges of combining information which has not necessarily been designed with this purpose in mind.

Notebook 2: Crawling the Government Web Archive

The first Notebook focused on available metadata and crawled the A–Z page of websites. To attempt any further analysis of the content or network structure of the UKGWA, crawling of websites is necessary. Python libraries designed for downloading and processing HTML files from the web are available, and in this case, the popular BeautifulSoup library has been implemented (Richardson 2020). The library includes functionality to programmatically open a web page (i.e. not in a browser), and extract text, hyperlinks, and other components of a web page. The Notebook is not a tutorial on web scra** but highlights some considerations when extracting information from UKGWA websites.

One feature which differentiates scra** a web archive from the live web is that links may point to other archived pages, or pages outside of the archive’s scope. The formats encountered during the creation of this Notebook are shown in Table 1.

Table 1 URL formats encountered in hyperlinks extracted from the UKGWA

Standardisation is a straightforward task, but it is important that the user is aware of these multiple formats. In the absence of a timestamp, it is necessary to inherit the timestamp of the page containing the link, and a relative link will be prefixed by the whole url of the parent page. Other links such as ‘mailto’ or shortcuts to sections (prefixed with a #) also appear, but they apply to web pages in general and are not specific to the web archive.

Table 1’s example is found in the 20,141,226,000,648 snapshot of www.gov.uk, but be aware that the snapshot referred to in the hyperlink does not necessarily exist. Following this link will redirect the user to a snapshot dated 20,141,226,073,441. This is only a seven-hour difference, but the discrepancy between link address and the archived copy can range from seconds to years. The http://www.environment-agency.gov.uk:80/who.html page was first captured 467 days later on Valentine’s Day 1998 (perhaps it felt unloved), but the hyperlink to it on the Environment Agency home page is labelled as 4 November 1996. Large discrepancies can be due to a difference in the crawling depth over time. In the early days of the web archive, often only the home page itself or the pages directly linked from it were crawled. Over time, crawling policies were changed to capture as much of a site’s content as possible. In this instance, it appears to be an oversight, or perhaps a technical hitch, as other links on the page were crawled. This was the first website captured by the UKGWA.

Two options are available to identify the true snapshot date: follow the link automatically resolving to the closest archived version; use the CDX API and apply one’s own logic to finding the desired snapshot date. Using built-in functionality in the archive is the easier option, but there may be performance considerations as following a URL involves retrieving that URL’s web content (possibly including images and videos). Many links will never be resolved because they point to sites outside of the scope of the archive. An NHS story about training toddlers to eat vegetables links to the BBC news page they sourced the story from (National Health Service 2014). This link has all the appearances of an archived page, including timestamp, but was not captured as it is outside the remit of UKGWA. This has a knock-on effect in network analysis where there is a mixture of bidirectional links, links confirmed as being one directional (both pages are archived), and links in one direction but lacking evidence that a return link does not exist (only one page is archived).

Content can be extracted from pages in order to perform text analysis, and the Notebook presents three methods for this task. The Python library used includes functions for extracting text, but navigation menus, and embedded javascript (code enabling dynamic functionality) is also included. The pages need to be pre-processed using boilerplate removal tools to strip out the menus (Kohlschütter et al. 2010). While these tools are effective, many home pages consist entirely of links and menus, resulting in an empty page following boilerplate removal. Applying the tool to the www.salt.gov.uk page provides an example of this occurring. While retained navigation menus can add unwanted noise when performing text analysis, they are important informational, and contextual, elements in their own right.

The Notebook concludes with a demonstration of Document Summarisation (Barrios et al. 2018, 2). A number of guides have been released on this topic, Mitchell (2018) being the most comprehensive and updated version (Mitchell 2018). However, the ethical implications of scra** are only scarcely discussed. The social sciences, including criminology, have published a number of articles on web scra** as they understand the benefits of this field, but also worry about the ethical implications. The field of web scra** is legally ambiguous as ethical guidelines are not directly applicable to online research (Luscombe et al. 2021; Brewer et al. 2021).

As Krotov and Silva (2018) propose in their article, it would be best if organisations had ‘terms of use’ around web scra**, as currently the decision of scra** material that may be sensitive or copyrighted is left to the user, rather than having a preventable action in place (Krotov and Silva 2018). This is echoed by the other articles which are written from the researcher’s perspective and recommends them to consult ‘terms of use’ before considering web scra** (Mitchell 2018; Luscombe et al. 2021; Brewer et al. 2021). However, for most GLAM institutions providing online accessible digital material these guidelines are currently lacking.

A good example is the Internet Archive’s Responsible Crawling guidelines (Osborne 2018). Currently, the UKGWA does not have any guidelines such as this set up for the use of the web archive, TNA as a whole does have guidelines for web scrapers and bulk downloads, but prohibits anyone from using them without permission (The National Archives 2021l). As previously mentioned UKGWA is under Crown Copyright (Open Government Licence) and therefore available for research purposes (The National Archives 2021c). Although there are no specific guidelines for the UKGWA yet (possibly highlighting the novelty of computational access to the archive), it is important to remember that this is a live archive and crawling activity can impact daily archiving activities and potentially other users. Crawling therefore needs to be undertaken responsibly, and some basic guidelines could include:

  • De-anonymise your activity by using a ‘User Agent’ which provides identifiable information including contact details when crawling, in case of problems.

  • If possible restrict crawling to outside of working hours (8.00–18.00 in UK) to avoid impacting the service.

  • Limit crawling rate to 3 URIs per second in working hours (if they are unavoidable), or 5 outside.

  • Contact the web archiving team with any questions (see https://www.nationalarchives.gov.uk/webarchive/information/ for details).

This last point is important. Communicating with whomever is responsible for the archive about plans for web scra** is a common courtesy, and could open up alternative avenues for accessing the material. Being better informed about users’ scra** activities could influence future product development in terms of API features, scaling of computation resources, or creation of pre-processed data extracts (e.g. Named Entities). These guidelines could be adapted by any memory institution providing access to digital material that has the potential to be web scraped.

Web scra** in an ethical manner may not be suitable for a cloud hosted Notebook environment. At a speed of 5 URIs per second, crawling a meaningful subset of the archive could stretch the tenancy limits on a free cloud service. The over 300,000 snapshots of home pages would take over 1000 minutes to crawl at this rate. Our analysis of counting links from a sample of 200 pages found an average of 38 links from each page, with the number of links ranging from 1 to 243. At this rate, following links from one copy of each homepage would require 750 minutes of processing.

Generally, exponential growth in links is experienced as the depth of the crawl increases. Crawling is more suited to out of hours batch processing on a dedicated server, and using specialist tools. Institutions may consider creating downloadable datasets, such as the CDX extract accompanying Notebook 1. This can reduce traffic for common requests especially if there is API functionality available to enable updating with more recent data (as there is with CDX). As data volumes grow, a non-consumptive service becomes preferable, to be able to offer the best access for computational methods, but there will be large cost and resource implications for the institution. However, guidelines around web scra** are still strongly recommended as Mitchell (2018) highlights that the API may not be suitable for all purposes (Mitchell 2018).

Pages on UKGWA may be subject to takedowns (The National Archives 2021j)—removal from the open web—which while rare enough to have little impact on reproducibility of results, suggest care must be taken when scra** web content. There is a risk that taken down content could still be available in the community with archivists unaware what has been copied and researchers unaware they are holding closed content. The solution is to follow the GDPR principle of Storage Limitation (Information Commissioner’s Office 2021), having extracted entities or created an alternative data representation such as tf-idf (Ramos 2003).

Notebooks are a great way to explore access and to showcase the possibilities for researchers to go beyond keyword search. The Notebooks that were created serve three main purposes: to highlight the data sources available for computational analysis of the UKGWA; identify some features unique to the archived web a user must consider; encourage experimentation with this collection as a data source. Notebook 1 used four sources of data to connect the UKGWA to the wider Archive, three of which are publicly available, and one sourced internally. While this showed the possibilities available when data sources are brought together, it also showed that each of these data sources tells a different story about the collection. It highlighted that care needs to be taken when joining together data sources where even seemingly similar attributes such as record dates have subtly different meanings in the two systems. Notebook 2 went further in extracting links and content from the web archive, using the latter to generate automated summaries of web sites. It demonstrated that there is potential for interesting analysis but also points to the need for archives to do more to make their collections available as data to reduce the need for researchers to use crawling methods to acquire data and reinventing the pre-processing wheel.

Notebooks themselves may only be a temporary solution as organisations start to open their collections to computational access and explore its potential, but they can inform future development of access systems. By creating these Notebooks, a number of ethical issues were demonstrated that memory institutions should take into consideration. It may be that organisations themselves are not yet ready to offer this type of access, but as the web scra** possibilities show, it is something that terms of use should be set up for. Although these Notebooks are a first step to showcasing computational access for the UKGWA, having greater API functionality would expand their capabilities in a sustainable and ethical way. There is enthusiasm within TNA to make research datasets available to encourage engagement with the collection and to build a community who can inform the design of the next generation of computation ready interfaces to the web archive.