Keywords

1 Introduction

Every year, the great collections of historical handwritten manuscripts in museums, libraries and other organisations are digitised as electronic images. The digitisation makes the manuscripts available to a wider audience, and preserves the cultural heritage. The automatic recognition of textual corpora and named entities generated from medieval and early-modern manuscript sources with high accuracy is a challenge [2, 20, 22]. Manuscript images are often processed through keyword spotting or word recognition to be accessed and searched, such as [4, 8, 14, 17] and [18]. There are some papers build a search system for handwritten images, such as [1, 5, 15, 16, 21] and [23]. However, their systems only offer keyword search.

Unlike keyword search, semantic search improves search precision and recall by understanding the user’s intent and the contextual meaning of concepts in documents and queries [3, 12, 19, 24]. This paper proposes a semantic search engine for full-text retrieval of historical handwritten document images based on named entity (NE), keyword (KW) and knowledge graph (KG). This would help not only in processing, storing and indexing automatically, but also would allow users to access quickly and retrieve efficiently manuscripts.

2 System Architecture

The Public Record Office of Ireland (PROI) was destroyed on 30 June 1922, resulting in the loss of 700 years of Irish history. The Beyond 2022 Project (https://beyond2022.ie) is combining historical research, archival discovery, and technical innovation to create a virtual reconstruction of the PROI. There are over 300 volumes of surviving and collected handwritten copies of lots documents, with some 100,000 pages containing 25 million words of text.

Fig. 1.
figure 1

The system architecture

Our system architecture of the search engine is illustrated in Fig. 1 which has four separate processing modules being Handwritten Text Recognition, NE Recognition, KW-NE Indexing and KW-NE-Based IR Model. Firstly, the historical handwritten document images are digitised to transcriptions through the Handwritten Text Recognition module. Then, the transcriptions are annotated by NEs through the NE Recognition module. This module needs to connect to the Knowledge Graph to extract the classes and identifiers of NEs. Next, KWs and NEs of the annotated transcriptions and the respective original images are presented and indexed by the KW-NE indexing module and stored in KW-NE Annotated Text and Image Repository. The raw text query is also annotated NEs through the NE Recognition module to become a KW-NE annotated query. Finally, the KW-NE-Based IR Model module compares the annotated query and the annotated documents to return the ranked transcriptions and images.

3 Image Representation and Knowledge Graph

Transkribus [13] is used for training and deploying Handwritten Text Recognition (HTR) models to derive text transcription from image scans. Given the rate at which transcriptions can be generated, NE Recognition (NER) and Entity Linking (EL) are required to automated annotate all instances of entities occurring in the transcription text. We used SpaCy [11] for NER and had highly results on 18\(^{th}\) century English text. To provide flexibility, an NLP pipeline has been implemented as a thin layer over a number of standard NLP tools. The output of the pipeline is a NLP Interchange Format [10] in which a NER tool has annotated classes of entities and, where possible, an EL tool has connected the recognized entities to KG.

The KG collects structured data from various historical sources. Part of the data is manually curated by historians through spreadsheets. Other data sources (e.g. geographical data from OSi [6]) are imported automatically as RDF for direct insertion into KG. The schema (or ontology) used to structure KG, is mainly based on the popular CIDOC-CRM ontology [7]. A short excerpt of KG is depicted in Fig. 2. It shows a few main entities and relationships related to a person (of type CIDOC-CRM:E21_Person) named “William Sutton”, who was member of a few relevant offices in Ireland.

Fig. 2.
figure 2

A portion of our historical KG about “William Sutton”.

4 Information Retrieval Model and Demo

A search engine needs to not only return the best documents, but also be fast. We implemented the index and search functions based on Elasticsearch to have a real-time search engine [9]. The Okapi BM25 model was proposed to find and rank the relevant handwritten manuscripts for queries. In the model, documents and queries are presented by sets of concepts being NEs or KWs. Figure 3 presents an image of a handwritten medieval historical manuscript, its transcription and its concept set d, applied in the model. In the transcription, there are three kinds of words determined by our NER tool: (1) stop-words being the, to, of, we and you; (2) NEs being sheriff, Meath, clerk and William Sutton; and (3) KWs being king, &c, greeting, direct, pay, shilling and silver. The stop-words are not added into the concept set d.

Fig. 3.
figure 3

An example about NE and KW annotation of a medieval historical manuscript

Fig. 4.
figure 4

User interface of our deployed search engine

Figure 4 presents the interface of our search engineFootnote 1, and the concept sets of \(q_1\) and \(q_2\). In that, coun_meath is the identifier of an entity named Meath and classed Country, which is determined by our NER algorithm. While, silver and shilling are keywords. To exploit the features of NEs for semantic search, a NE needs to be presented by its most specific meaning in the concept set d. It means that, with a NE in the transcription,

  • If our NER can determine its identifier, the NE will be presented by its identifier in d. For example, occu_sheriff, coun_meath and occu_clerk are identifiers of entities named sheriff, Meath and clerk, and added into d.

  • If our NER only determines its most specific class, the NE will be presented by a combined information including its name and class. For example, the entity named William Sutton does not exist in our historical KG, so its identifier cannot be extracted. However, the NER determines its most specific class being Person. So william_sutton/person is added into d.

5 Conclusion

We proposed a novel semantic full-text search system for images of historical handwritten manuscripts. Unlike the existing approach only using KW extracted from images, we exploited NE, KW and KG of increase search performance. In that, NER and HTR tools were built to recognise transcriptions and NEs from the manuscript images. Besides, to increase the precision of our NER tool, the historical KG was designed and proposed. Then, we implemented the index and search functions for transcriptions based on Elasticsearch and Okapi BM25 to search images in real-time. Finally, the semantic search engine was also implemented and deployed.