1 Introduction

Several research and application fields require annotated datasets to advance the development of intelligent systems. Among many, ImageNet [1] permitted the growth of novel approaches that have guided the creation of some of the most modern learning systems. Document Image Analysis and Recognition (DIAR) is not an exception, and several well-known benchmark datasets allowed researchers to advance the state of the art in DIAR and in general in pattern recognition. For instance, in the 1990s the NIST [2] and MNIST [3] datasets of handwritten digits have been instrumental for significant advances of techniques for pattern recognition [4]. DIAR is not limited to isolated character recognition, but encompasses several tasks ranging from pre-processing, to layout analysis with the overall aim of achieving document understanding in many application domains [5]. Many application areas deal with proprietary data that cannot be made publicly available, due to copyright and privacy issues such as financial documents or health records. These latter difficulties, along with the annotation effort required for large quantities of documents, are usually the main challenges faced when creating a new benchmark dataset for DIAR.

Among other tasks, Document Layout Analysis (DLA) research advanced significantly in the 1990s thanks to a new collection of scanned pages of scientific articles. Similar to NIST, the UW datasets [6, 7] set a milestone for evaluating research progresses. Thereafter, scientific articles have been widely used as benchmark sources of data due to their availability, in terms of quantity and accessibility, and their rich semantic structure that allows researchers to focus on different tasks in the document understanding pipeline, e.g., Table Detection (TD) and Table Recognition (TR). Although some tasks are nowadays basically solved (e.g., physical layout analysis, that is, the identification of homogeneous regions of text in the page), there is still space for research in the analysis of challenging regions of documents (e.g. tables and graphical illustrations) as well as for the overall understanding of scientific articles published with non-common styles and layouts.

In this paper, we aim to provide a guide to the different datasets that have been proposed over the past 30 years for supporting research on DLA over scientific articles: other recent surveys focused either on historical document collections [8] or state-of-the-art methods for page object detection [9], and we suggest the reader to go through them to a broader comprehensive overview of the DIAR field. In addition to a comprehensive inventory of datasets, highlighting their strengths and limitations, we focus our attention on the annotation procedures that have been proposed for such collections, with an analysis of the advantages and disadvantages of different approaches. We reviewed some of the most important state-of-the-art methods tested on the collections in this survey, but for a complete overview of DLA, we would like to refer to one survey over the most important methods proposed to tackle this task [10].

We focus on DLA of scientific articles, for three main reasons:

  • to investigate the most used annotation procedures, along with the challenges of creating large and qualitative datasets with good annotations;

  • to provide researchers with an overview of available datasets along with their details, to help the identification of the best suitable benchmarks to develop and test novel algorithms;

  • to outline how document collections have changed until today, posing questions and open problems that could further enhance the DIAR research.

The paper is organized as follows. In Sect. 2, we describe DLA and the principal techniques used to tackle it, with a particular focus on the analysis of scientific articles. Then, a detailed description of annotation procedures is depicted in Sect. 3. After an overview of the collections reported in this survey (Sect. 4), we divide them in three main categories, starting from small scale fully annotated in Sect. 5, mostly containing scanned documents. Then, in Sect. 6, partially annotated collections are described, focused only on challenging parts of scientific documents such as tables and figures. Finally, large-scale fully annotated datasets are listed in Sect. 7. For completeness, in Sect. 8 we provide a broader overview of significant datasets that are related to DLA for different types of documents, and in Sect. 9 we discuss and summarize the impact and complexity of each collection presented in this survey, along with the latest state-of-the-art methods tested on the datasets. We then discuss identified open problems and challenges to the field of DLA for scientific articles in Sect. 10. Finally, we outline the conclusions in Sect. 11.

Table 1 Acronyms of tasks addressed in layout analysis of scientific articles

2 Document layout analysis of scientific articles

In addition to Optical Character Recognition (OCR) of printed or handwritten characters, one of the most investigated tasks in Document Image Analysis and Recognition has been Document Layout Analysis which aims at finding regions in a page, such as text or figures (physical layout analysis) and recognizing and classifying them, e.g., discriminating text blocks as title or paragraph (logical layout analysis). In physical layout analysis, the aim is to identify homogeneous regions (usually by means of bounding boxes) [10, 11]. Since many tasks can be addressed in DLA of scientific articles, we summarize the main ones in Table 1 together with the acronyms used in this paper.

Over the years, several methods have been proposed attempting to solve DLA, following the application of novel techniques and the gathering of larger collections of annotated data. Ranging from the early 1990s up to nowadays, it is possible to broadly divide the different techniques into three main groups: heuristics, statistical machine learning, and deep learning methods. The first two groups are described in [12], dividing different approaches depending on two criteria.

The first criterion refers to how the document is analyzed, either using bottom-up, top-down, or hybrid techniques. Bottom-up techniques start gathering information at the pixel level and then iteratively group them into larger areas, from connected components (CCs) up to larger meaningful areas of text or non-text (e.g., figures). Representative algorithms from this group are RLSA [13], Docstrum [14], and Voronoi diagrams [15]. On the contrary, top-down techniques start from the whole document until basic components are found in subsequent steps, like in the X-Y cut algorithm [16]. Finally, hybrid methods are compositions of the aforementioned ones.

The second criterion discriminates the techniques considering what is analyzed either the physical or the logical document layout. The first one aims at the identification of homogeneous regions in the page while the latter at assigning functional information, a label, to these regions. Methods are categorized on these terms depending on the downstream task they are used for. To cite some, Strouthopoulos and Papamarkos [17] propose an Artificial Neural Network ANN to classify \(8 \times 8\) document patches as graphics or halftones. Wu et al. [18] segment text regions using a series of split-or-merge operations guided by a binary SVM classifier. Once the page objects are segmented and/or classified, some post-processing techniques could be considered to generalize the results over different layouts [10]. It is worth to notice that most methods for layout analysis have been demonstrated and tested on collections of digitized scientific articles.

More recently, deep learning techniques have been used also for DLA, taking advantage of larger document collections. In a recent paper summarizing models, tasks, and datasets for document AI [

Table 2 Main tools used to support annotation