Datasets and annotations for layout analysis of scientific articles

Gemelli, Andrea; Marinai, Simone; Pisaneschi, Lorenzo; Santoni, Francesco

doi:10.1007/s10032-024-00461-2

Download PDF

Andrea Gemelli¹,
Simone Marinai¹,
Lorenzo Pisaneschi¹ &
…
Francesco Santoni¹

1243 Accesses
Explore all metrics

Abstract

For a long time now, datasets containing scientific articles have been crucial to the analysis and recognition of document images. These document collections have frequently served as a testing ground for cutting-edge methods for optical character recognition, layout analysis, and document understanding in general. We thoroughly analyze and compare many datasets proposed for layout analysis of scientific documents, ranging from small collections of scanned papers to modern large-scale datasets containing digital-born papers, which have been proposed to train deep learning-based methods. Furthermore, we outline a detailed taxonomy of the annotation procedures used considering manual, automatic, and generative approaches, and we analyze their benefits and drawbacks. This survey is meant to provide the reader with a review of the most used benchmarks together with detailed information on data, annotations, and complexity, hel** scholars to identify the most suitable dataset for their tasks of interest. We also discuss possible open problems to further enhance datasets to support research in the layout analysis of scientific articles.

LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis

BaDLAD: A Large Multi-Domain Bengali Document Layout Analysis Dataset

BINYAS: a complex document layout analysis system

Article 04 November 2020

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Several research and application fields require annotated datasets to advance the development of intelligent systems. Among many, ImageNet [1] permitted the growth of novel approaches that have guided the creation of some of the most modern learning systems. Document Image Analysis and Recognition (DIAR) is not an exception, and several well-known benchmark datasets allowed researchers to advance the state of the art in DIAR and in general in pattern recognition. For instance, in the 1990s the NIST [2] and MNIST [3] datasets of handwritten digits have been instrumental for significant advances of techniques for pattern recognition [4]. DIAR is not limited to isolated character recognition, but encompasses several tasks ranging from pre-processing, to layout analysis with the overall aim of achieving document understanding in many application domains [5]. Many application areas deal with proprietary data that cannot be made publicly available, due to copyright and privacy issues such as financial documents or health records. These latter difficulties, along with the annotation effort required for large quantities of documents, are usually the main challenges faced when creating a new benchmark dataset for DIAR.

Among other tasks, Document Layout Analysis (DLA) research advanced significantly in the 1990s thanks to a new collection of scanned pages of scientific articles. Similar to NIST, the UW datasets [6, 7] set a milestone for evaluating research progresses. Thereafter, scientific articles have been widely used as benchmark sources of data due to their availability, in terms of quantity and accessibility, and their rich semantic structure that allows researchers to focus on different tasks in the document understanding pipeline, e.g., Table Detection (TD) and Table Recognition (TR). Although some tasks are nowadays basically solved (e.g., physical layout analysis, that is, the identification of homogeneous regions of text in the page), there is still space for research in the analysis of challenging regions of documents (e.g. tables and graphical illustrations) as well as for the overall understanding of scientific articles published with non-common styles and layouts.

In this paper, we aim to provide a guide to the different datasets that have been proposed over the past 30 years for supporting research on DLA over scientific articles: other recent surveys focused either on historical document collections [8] or state-of-the-art methods for page object detection [9], and we suggest the reader to go through them to a broader comprehensive overview of the DIAR field. In addition to a comprehensive inventory of datasets, highlighting their strengths and limitations, we focus our attention on the annotation procedures that have been proposed for such collections, with an analysis of the advantages and disadvantages of different approaches. We reviewed some of the most important state-of-the-art methods tested on the collections in this survey, but for a complete overview of DLA, we would like to refer to one survey over the most important methods proposed to tackle this task [10].

We focus on DLA of scientific articles, for three main reasons:

to investigate the most used annotation procedures, along with the challenges of creating large and qualitative datasets with good annotations;
to provide researchers with an overview of available datasets along with their details, to help the identification of the best suitable benchmarks to develop and test novel algorithms;
to outline how document collections have changed until today, posing questions and open problems that could further enhance the DIAR research.

The paper is organized as follows. In Sect. 2, we describe DLA and the principal techniques used to tackle it, with a particular focus on the analysis of scientific articles. Then, a detailed description of annotation procedures is depicted in Sect. 3. After an overview of the collections reported in this survey (Sect. 4), we divide them in three main categories, starting from small scale fully annotated in Sect. 5, mostly containing scanned documents. Then, in Sect. 6, partially annotated collections are described, focused only on challenging parts of scientific documents such as tables and figures. Finally, large-scale fully annotated datasets are listed in Sect. 7. For completeness, in Sect. 8 we provide a broader overview of significant datasets that are related to DLA for different types of documents, and in Sect. 9 we discuss and summarize the impact and complexity of each collection presented in this survey, along with the latest state-of-the-art methods tested on the datasets. We then discuss identified open problems and challenges to the field of DLA for scientific articles in Sect. 10. Finally, we outline the conclusions in Sect. 11.

Table 1 Acronyms of tasks addressed in layout analysis of scientific articles

Full size table

2 Document layout analysis of scientific articles

In addition to Optical Character Recognition (OCR) of printed or handwritten characters, one of the most investigated tasks in Document Image Analysis and Recognition has been Document Layout Analysis which aims at finding regions in a page, such as text or figures (physical layout analysis) and recognizing and classifying them, e.g., discriminating text blocks as title or paragraph (logical layout analysis). In physical layout analysis, the aim is to identify homogeneous regions (usually by means of bounding boxes) [10, 11]. Since many tasks can be addressed in DLA of scientific articles, we summarize the main ones in Table 1 together with the acronyms used in this paper.

Over the years, several methods have been proposed attempting to solve DLA, following the application of novel techniques and the gathering of larger collections of annotated data. Ranging from the early 1990s up to nowadays, it is possible to broadly divide the different techniques into three main groups: heuristics, statistical machine learning, and deep learning methods. The first two groups are described in [12], dividing different approaches depending on two criteria.

The first criterion refers to how the document is analyzed, either using bottom-up, top-down, or hybrid techniques. Bottom-up techniques start gathering information at the pixel level and then iteratively group them into larger areas, from connected components (CCs) up to larger meaningful areas of text or non-text (e.g., figures). Representative algorithms from this group are RLSA [13], Docstrum [14], and Voronoi diagrams [15]. On the contrary, top-down techniques start from the whole document until basic components are found in subsequent steps, like in the X-Y cut algorithm [16]. Finally, hybrid methods are compositions of the aforementioned ones.

The second criterion discriminates the techniques considering what is analyzed either the physical or the logical document layout. The first one aims at the identification of homogeneous regions in the page while the latter at assigning functional information, a label, to these regions. Methods are categorized on these terms depending on the downstream task they are used for. To cite some, Strouthopoulos and Papamarkos [17] propose an Artificial Neural Network ANN to classify \(8 \times 8\) document patches as graphics or halftones. Wu et al. [18] segment text regions using a series of split-or-merge operations guided by a binary SVM classifier. Once the page objects are segmented and/or classified, some post-processing techniques could be considered to generalize the results over different layouts [10]. It is worth to notice that most methods for layout analysis have been demonstrated and tested on collections of digitized scientific articles.

More recently, deep learning techniques have been used also for DLA, taking advantage of larger document collections. In a recent paper summarizing models, tasks, and datasets for document AI [

Table 2 Main tools used to support annotation

Full size table

3 Annotation procedures

As depicted before, the state-of-the-art techniques proposed to tackle DLA have been often supervised oriented, thus requiring carefully annotated data. In particular in the last years, an increasing demand of larger datasets for deep learning methods has started to open new challenges on how to annotate such collections. Throughout the years, procedures for annotating documents have been proposed trying to maximize two different measures, usually inversely proportional: quality and amount of annotations, taking into consideration also time, costs, and data variability. As summarized in Fig. 1, we broadly categorize the annotation procedures in three different classes:

Manual. A designated team is given a set of rules to be followed, along with an annotation tool to both help them and to output the labels in a standard format.
Automatic. A set of techniques, algorithmic or based on machine learning, that scale with the amount of data to be annotated. The only constraint is that scientific articles in PDF need additional structured files, such as LaTeX, XML, or HTML. Human intervention is usually required mostly to check the quality of a subset of randomly selected samples.
Generative. Generative models can synthesize data from documents of a given domain. Along with generated samples, annotations are also generated exploiting deep learning and transformers architectures.

In Table 2, we list the tools for document annotations used in the literature. We report if it is possible to use them through an online service, a local application or calling an API. We also summarize the annotation process for each tool, outlining the input data, the procedure, and the output data format.

3.1 Manual annotations

Manual techniques do not scale well with the dataset size. In addition, guidelines need to be defined so that different annotators follow the same rules, to produce coherent and homogeneous annotations among data. This approach has been the foundation of most of datasets across different machine learning fields; today it is often not preferred due to its high cost both in terms of money and time. However, researchers continue to propose manually annotated data since the human supervision is capable of yielding qualitative annotations and gather important information that are not easily available through automatic annotations.

3.2 Automatic annotations

PDF files of scientific articles are often not semantically structured. It is not trivial to access information such as text, tables, and titles and to detect them easily in the page layout starting from a PDF. However, scientific documents are often stored in other structured formats along with PDF. The information in these files, such as LaTeX and XML, makes it possible to automatically identify objects in the layout. The general annotation scheme in this case is depicted in Fig. 2. We describe in detail the two main annotation approaches in the next subsection, while the differences between datasets can be found in their relative sections. We also include in this category semi-automatic annotations procedures, meaning the usage of automatic ones along with human supervision during the process.

3.2.1 From LaTeX format

Ar**v is a huge collection of scientific documents made available by the authors. From ar**v, many researchers collect data to build document datasets. Some datasets presented in this survey [44,45,46] have been built starting from ar**v, because most papers in the repository are stored including both PDF and LaTeX files, which allows researchers to implement automatic pipelines to annotate documents much faster than humans. This approach is visually described in the top part of Fig. 2: usually, the source LaTeX code is injected with custom commands, e.g.,

or

, around detected target regions such as tables, lists, images. Then, a new version of the PDF is compiled and leveraging information extraction tools that can identify colored text, regions are extracted in terms of bounding boxes and labels. The two main approaches are based on content or regions recoloring according to the granularity of the colored items: in the first case individual words and in the second one bounding boxes of regions. The drawbacks of automatic annotation based on LaTeX are that sometimes changing the source code results in either layout alteration or compilation errors.

3.2.2 From XML format

It is possible to approach automatic annotation by also considering XML files together with the corresponding PDF files. This is the case for datasets created from PubMed (e.g., [47] [48] [49]). PubMed Central (PMC) is a free full-text archive of biomedical and life sciences journal literature hosted at the US National Institutes of Health’s National Library of Medicine (NIH/NLM). It is freely accessible since 1997. In PubMed, documents are presented in PDF and XML formats, making possible to integrate information in PDF and XML files. As depicted in the bottom part of Fig. 2, region coordinates and text are first extracted from the PDF using PDFMiner-based tools. Then, involving string matching techniques such as Levenshtein distance or Bag of Word (BoW), different regions are associated with different tags. Finally, labels are attached to the final layout annotations. The drawbacks of automatic annotation based on XML are that this structured information is not as widely available as LaTeX for scientific documents and cannot be easily edited to compile a modified PDF.

3.3 Generative models

An alternative solution is the generation of synthetic documents together with their annotations. Using generative methods, it is possible to automatically create annotations for an arbitrary amount of data, with layout and content variations. These approaches can use as starting point automatic or manually labeled documents, and they learn how to create new layouts and contents which are reasonably similar to real ones. However, despite the recent improvement, creating synthetic documents is still a difficult task.

Differently from manual and automatic procedures, in this case it is not possible to outline a general approach. Even if none of the datasets presented in this survey is generated, we found interesting to revise, to the best of our knowledge, the most important contributions developed so far in this direction. Document Domain Randomization [50] is a method which attempts to create document pages by diversifying real documents examples. DDR models textual and non-textual contents of a starting set of data following rules which are defined by the users, such as document layouts, font and style. DocSynth [51] is a model which, given a set of document images, generates new variations leveraging a Generative Adversarial Network (GAN). This method generates images to augment existing datasets. Similar to DocSynth, READ (Recursive Autoencoders for Document Layout Generation) [52] is an architecture which relies on a Recurrent Neural Network (RNN) and Variational Auto-Encoder (VAE). Starting from a training set of document examples, with semantic-based labels, READ is able to generate large quantities of variable and plausible synthetic layouts similar to the original ones.

More recently, transformer-based architectures, on top of their success on NLP tasks, started to be used to learn and generate document layouts as well. LayoutTransformer [53] leverages self-attention to learn and understand the existing relationships between regions in a given layout. The authors based the generation of new scientific papers on data in the PubLayNet [49] dataset. Inspired by this work, we proposed an approach to generate layout and content of scientific papers in high-resolution PDF format [54], filling the regions with generated text and tables and images crawled from [55].

3.4 Discussion

One of the most challenging aspects when creating a new dataset is how to annotate the data, to scale with the dimension of the collection and include in the data a good amount of variability. Manual annotations are time consuming and expensive, but ensure good quality; on the contrary, automatic procedures can generate more annotated data, but require often human intervention due to errors. Document generation could be a good alternative to meet the pitfalls of the aforementioned approaches, but it is still a complex task: it is therefore an excellent opportunity to advance research in this direction.

4 Categorization of datasets

In this survey, we focus on scientific document collections, since they have been widely used by the research community to test and compare different approaches for DLA. Other related datasets, either considering scientific papers for different tasks or addressing DLA for different documents, are summarized in Sect. 10.

Following the description of the annotation procedures given before, we generally divide the datasets as shown in Fig. 3. This graphic is useful to visualize different procedures used for the creation of datasets. Generative-based techniques are used during model training for several purposes and, to the best of our knowledge, there are no collections published to be reported here.

Moreover, considering what has been annotated in addition to how, we divide the datasets into three categories:

small-scale fully annotated, where the manual annotation is commonly used and document images come from scanners or digital cameras.
partially annotated documents, where the focus is mainly on finding and recognizing specific regions, such as tables or figures, two important objects for DLA.
large-scale fully annotated, collections—where automatic labeling is preferred because of the huge amount of data.

These three main approaches for building datasets roughly follow also the timeline of research of the last 30 years. In the timeline shown in Fig. 4, the datasets are arranged in a chronological order and labeled according to the previous categories. The first proposed datasets contained scanned documents manually annotated, while most recent collections are built automatically and in general contain digital-born documents. Small-scale fully annotated datasets (Sect. 5) are represented as red triangles, partially annotated datasets (Sect. 6) as green circles, and large-scale fully annotated datasets (Sect. 7) as blue squares.

Table 3 Small-scale fully annotated document collections comparison

Full size table

5 Small-scale fully annotated datasets

In this section, we present small datasets, that is, with a maximum order of about 10,000 pages. It is no coincidence that these collections are mostly composed of scanned documents, considering the resources required to obtain and annotate the data. Moreover, these datasets are have been released before 2009 except for ScanBank [57] (2021), which is also the larger among the others. These datasets are summarized and compared in Table 3.

5.1 University of Washington datasets

In the 1990s, the research group lead by Prof. Haralick released three datasets aimed at fostering research on the development of OCR tools and on document analysis in general. The aim has been surely achieved since the three datasets that came out from this work have been widely used by researchers and laid out the use of benchmark datasets and open evaluation metrics to support researchers to design and evaluate novel algorithms [6, 7]. In short, the UW datasets have been the model for the subsequent collections of data arranged by researchers in the community.

The English Document Database I was released in 1993 and consists of scanned pages from English technical journals randomly selected from the University of Washington library. The dataset was distributed as CD-ROM delivered by express mail. It contains image files (bi-level and gray scale) and corresponding ground-truth information. The images came from scanned documents (from the journals themselves or photocopies) and synthetic ones generated with LaTeX. Additional information is related to the noise level in the document and metadata characterizing the provenance of the document.

Concerning the ground-truth information, pages are first decomposed into “header,” “footer,” and “live-matter” areas. The header is text appearing on top of the page and for scientific journals; it usually includes information like the name of the article, the journal, the authors, and the page number. Apart from the “footer,” the remaining part of the page is referred to as the “live matter.”

Each of these zones is described by its bounding box and is often decomposed at a finer level into sub-zones of different types (e.g. text, figure, tables, half-tones, and mathematical equations) that are again described by their bounding boxes. For each zone, additional information is provided, including zone semantic meaning (e.g., for text zones we can have a section heading or a reference list item), the dominant font, the font style, etc. At the lower level, for each text zone, its contents in terms of ASCII text are provided, using escape sequences for non-ASCII symbols.

5.2 Medical Article Records Groundtruth

The Medical Article Records Groundtruth Dataset (MARG) contains bi-level images mostly coming from journal articles [56]. Ground-truth information is gathered from the output of the MARS (Medical Article Records System) system that combines scanning, OCR, document image analysis, and lexical analysis algorithms. MARS has been developed by an R &D team at the National Library of Medicine (NLM) and can automatically extract bibliographic metadata from paper-based biomedical journals to populate the Library’s database (Medline). The ground-truth data, provided in XML, include page, zone, line, word, and character-level information.

The MARG dataset contains only title pages where the title, authors, affiliation, and abstract are organized in nine different layouts. For instance, in the layout of type “A” the title, authors, affiliation, and abstract are centered at the top of the title page and the rest of the page has a two-column layout.

5.3 PRImA

The dataset produced by the PRImA research group [29] is a realistic and highly accurate collection of several documents with a wide variety of layouts. It is easily accessible through a web application,^{Footnote 1} allowing researchers to identify subsets of interest. One subset was used for the ICDAR2009 Page Segmentation Competition [58]. The dataset is composed of scientific publications, magazines, and technical journals from mainstream publications related to news, business, and technology.

It contains 1240 ground-truth images in a 7 to 1 ratio in favor of magazine pages [29]; as of September 2022, it is composed of 305 images with 265 magazine pages and 40 technical articles. Metadata associated with each document contain administrative and bibliographic information (title, publication, author, copyright holder, etc.), scanning features, and content information such as the presence of images and/or graphical items, number of columns, and variety of font sizes. The dataset was built from scans of documents at 300 dpi in 24-bit color. Steps to eliminate the skew both in digitization and post-scanning are introduced. The scans are then cropped and binarized

The ground truth is stored in an XML-based format whose schema is part of the PAGE [30] image representation framework. Different region types are represented: text, image, line drawing, graphic, table, chart, separator, maths, noise, and frame. The textual regions are hierarchically divided into paragraphs, text lines, words, and glyphs. Each region is outlined with a polygon having only horizontal and vertical edges. For each region type, different information is collected; text regions are described by language, font, reading direction, text color, background color, and logical label. Trained operators added the region boundaries and region-level metadata through the semi-automated ground-truthing tool Aletheia. After a set of non-overlap** imprecise outlines is marked by the operator, Aletheia fits the set to the region contents like a shrink wrap operation. The tool can outline pages, regions, text lines, words, and glyphs.

The MARG and Prima are similar to UW datasets in size, but they concentrate on Layout Analysis using more modern annotation formats (XML).

5.4 ScanBank

After years of automatic labeling of digital-born documents labeling (Sects. 6 and 7), the ScanBank Dataset [57] collects and annotates scanned electronic theses and dissertations (ETDs) to be used for figure extraction. To motivate the need for this type of dataset, the authors claim that methods such as DeepFigure [47], trained over only digital-born documents, are unable to generalize over scanned ones that present a more challenging task. A total of 10,182 images of pages and 3300 figures were obtained across the 70 sampled ETDs, crawled from MIT’s DSpace repository^{Footnote 2}. Each PDF page is converted into an image and then labeled manually with VGG Image Annotator (VIA) [36]. The resolution is scaled to 100 DPI to resemble a “screenshot” of a page from the PDF. Captions, table of contents, and list of tables and figures are labeled too, using bounding boxes. The authors also propose some data augmentation techniques trying to simulate actual scanned documents in a large collection of annotated scientific papers.

Table 4 Partially annotated document collections comparison. Full description of tasks is given in Table 1

Full size table

6 Partially annotated datasets

In this section, we present datasets that are intended for layout analysis considering only some specific parts, such as tables and figures. Table Detection and Figure extraction are important and challenging subtasks of layout analysis: they introduce a significant variability beyond text regions and require the use of more complex pipelines and techniques. The datasets summarized in Table 4 are relatively large and recent. As shown in Fig. 4, they are temporally located between the scanned datasets and digital-born datasets. Their magnitude varies between 100 (CS-150 [60]) and 1 M (TabLeX [66]) pages. In particular, it is possible to notice a particular focus of the community for figure extraction, in particular for tables, given the large amount of collections compared to other regions of the document layout. This is mainly due to two reasons: (i) tables come with nearly infinite variation in layout and finding methods generalizing over them requires a huge amount of labeled data and (ii) tables summarize important information and their detection and recognition are crucial to perform document understanding and information retrieval.

6.1 Marmot

Marmot [59] is a large standard Chinese and English dataset for table detection. It contains over 2000 PDF pages collected in a 1:1 proportion from Chinese e-books and English conference and journal papers. The ground-truth data were extracted with a semi-automatic tool, Marmot, and then double-checked by 15 people. The pages show a good variety in language types, page layouts, and table styles. The Chinese e-Book pages, provided by the Founder Apabi digital library, contain no more than 15 pages per book and are mostly in one-column layout. The English pages are both in one-column and two-column layouts. Half pages in the dataset contain at least one table and the other half contain complex layouts that can be mistaken as a table such as matrices and figures. Each page is described by a labeled ground truth, a page image at 600 dpi, and an XML description of objects attributes. The ground-truth data schema defines a set of tags that describe leaves (basic page units corresponding to text, image, or graph contents) and composites (logical components labeled by the tool and including more leaves). Each table is composed by three parts: table caption, table footnote, table body. Each part contains textline elements (an entire textline not crossing a page column) that in tables correspond to single lines in table cells. Textlines then contain text characters.

6.2 CS-150 and CS-Large

In 2015, the CS-150 dataset [60] has been released to address figure extraction. It is a collection of 50 papers from NIPS 2008–2013, 50 from ICML 2009–2014, and 50 from AAAI 2009–2014, selecting 10 published papers at random from each conference and year. Images, tables, and captions have been labeled by experts using the LabelMe tool [33]. In 2016, the authors released a larger version of the dataset called CS-Large [61] by randomly sampling 346 papers from Semantic Scholar [68] published after 1999.

The authors also propose PDFFigures 2.0 [61], a tool built on top of the previous version for a wider range of different styles and layouts (tested on CS-Large). It is based on data-driven heuristics, which exploit formatting conventions used consistently in the computer science domain.

6.3 FigureSeer

The end-to-end framework FigureSeer [37] parses result figures to enable search and retrieval of results in research papers. To benchmark the framework, an annotated figure parsing dataset is offered. The dataset is produced using over 20,000 papers from five conferences (CVPR, ICML, ACL, CHI, AAAI) gathered from CiteSeerX indexed by Semantic Scholar. In the papers, there are more than 60,000 figures. Mechanical Turk [69] is used to annotate the figures.

6.4 DeepFigures

DeepFigures [47] is the first large automatically labeled dataset of scientific papers for figure extraction. It counts more than one million scientific papers, with 5.5 million induced labels of tables and images bounding boxes. This paper introduces one of the first approaches to automatically label large document collections.

Following the general automatic procedure schema described in Sect. 3.2, in DeepFigures both LaTeX and XML structured information has been used. In these scenarios, tables, figures, and captions have been surrounded by bounding box, using: (i) difference of original and re-compiled PDFs according to RGB pixel values for LaTeX sources; (ii) dynamic programming to find the substrings in the PDF text with the smallest Levenshtein distance to the caption text in the XML file for images, bag of words similarity for tables (using PDFBox [70]). The overall precision of the induced labels has been evaluated by hiring experts to check randomly selected papers from both sources, resulting in an average precision of 96.8%. The dataset has been used by the authors to train a deep learning method and then deployed in Semantic Scholar to extract figures from 13 million scientific documents.

6.5 SciTSR

SciTSR [Footnote 4, a tool to transform LaTex into XML-DOM, has been used.

6.8 PubTabNet

The PubTabNet dataset [64] gathers scientific documents from PMC, with annotations automatically generated by matching PDF and XML files. Differently from the previous ones, this dataset has three main advantages. (i) the tables are typeset by the publishers of over 6,000 journals in PMC, which offers considerably more diversity; (ii) cells are categorized into headers and body cells, which is important when retrieving information from tables; (iii) the format of the output is HTML, which can be directly integrated into web applications. In addition, tables in HTML format are represented as a tree structure.

The PDF and XML files are matched through the algorithm of [49], following the general schema described in Sect. 3.2.2 to detect table regions and subsequently to convert them to images. Table regions are identified within the two sources and kept if the cosine similarity of the TF-IDF features of the two texts is larger than 90% and their length differs by less than 10%; rare tables with formulas or no frequent characters are removed as well. Finally, each table is rendered as image along with its HTML ground-truth annotation, without hyperlinks and metadata unused for the downstream task.

6.8.1 FinTabNet and PubXNet

Two datasets are proposed in [48]: (i) PubXNet, is an enhanced version of PubTabNet [64] adding table cells labels for a joint subset of PubTabNet and PubLayNet [49], (ii) the FinTabNet dataset is proposed to perform table detection and structure recognition on financial documents. Since HTML was already available, authors directly matched that information with the source PDF for automatic labeling.

6.9 PubTables-1 M

PubTables-1 M [65] is a large dataset composed of nearly 1 million tables extracted from scientific articles. This dataset contains detailed information about table structure and attempts to address the problem of over-segmentation using a novel canonicalization technique to obtain more consistent ground truth. PubTables-1 M has been developed to solve table extraction problems and its three main subtasks: table detection, table structure recognition, and functional analysis. It aims at overcoming the large variety of formats, styles, and structures that define tables in scientific papers, contrasting over-segmentation problems, and the need for more data. PubTables-1 M contains rich annotation information for all rows, columns, and tables’ headers. The annotation procedure follows the one described in Sect. 3.2.2: to align XML and PDF the Needleman–Wunsch algorithm has been used, along with custom noise removal heuristics to ensure a certain level of quality. PubTables-1 M contains 947,642 tables annotated for structure recognition. For table detection, there are 575,304 fully annotated tables.

6.10 TabLeX

TabLeX [66] is a dataset of table images generated from scientific articles downloaded from ar** basis defined in the previous point.

Finally, the token-level dataset is converted to the region-level dataset, that is, tokens belonging to the same class are connected using the Breadth First Search algorithm to find connected components and then defining regions of tokens.

7.3 Dense Article Dataset

The Dense Article Dataset (DAD) [79] is released. Manually annotated, it is one of the largest collections of business documents, collected from publicly available annual reports. There are five categories of objects that can be detected: tables, figures, natural images, logos, and signatures. The dataset presents also variability in content and layout: the reports are in different languages and scripts, while collected through ten years of twenty-nine different companies’ reports.

The most recent collection of this category is FinTab [80], a Chinese dataset for table extraction from financial documents that responds to the lack of diversity in tables of the currently existing datasets. It consists of 19 PDF files with more than 1600 tables sourced from annual and semi-annual reports, debt financing, bond financing, collections of medium-term notes, short-term financing, and a prospectus. In total, the dataset adds up to 3329 pages with 2522 of them containing tables. The dataset has been manually reviewed and contains textual and structural ground-truth for the tables. The textual ground truth describes both the characters and string information while the structure ground truth is made related to cell and table line information; the ground truth is stored in JSON.

9 Impact of datasets

The release and widespread use of a research dataset can have a significant impact in the research on a given topic. In order to evaluate the impact of the datasets analyzed in this paper, we investigated three main features: i) the scientific impact, indirectly measured by the number of citations; ii) the dataset complexity, that is related to the quantity and quality of data and annotations, strictly connected to task to be tackled; iii) the most relevant methods evaluated on the datasets. The above factors are summarized in Table 1 and further discussed in this section.

While the impact can be measured with quantitative, albeit sometimes inaccurate, numerical values obtained from the number of citations to papers describing the datasets, the complexity is harder to be estimated. We therefore assess the dataset complexity by looking at some features of the data that are task dependent. The most important methods used are on the other hand identified by looking at research papers that describe techniques using the data. Impact and complexity can be inversely proportional: one dataset released several years ago could have pushed forward significantly the research in the past, but nowadays, given its small scale, it can be of little utility to train data-hungry machine learning models. On the opposite, a novel dataset can count on few citations, but could pose a new important challenge to the research community and can be useful to train novel models. Impact and complexity of the datasets are summarized in Table 6 and discussed in Sections 9.1 and 9.2. The tasks to be performed on a given dataset strongly affect the impact and complexity as well: for a fair comparison and discussion, a further division is therefore needed in our analysis. In reference to Table 1, four macrotasks are considered: Document Layout Analysis (DLA), Table Understanding (TD, TSR, TR, and TFA) and Figure Extraction (FE), and Document Classification for [75]. For each task, the most impactful datasets have been analyzed to identify the latest and most effective methods proposed as summarized in Table 6 and briefly discussed in Sect. 9.3. Some sample pages selected from the most impactful datasets per task are shown in Fig. 5.

In the rest of this section, we inspect the datasets on the basis of the three directions previously mentioned.

9.1 Impact

The impact of a dataset is evaluated in this work by the number of citations divided by the number of years passed from the dataset release to the time of writing (months are not taken into consideration). Citations are collected from Google Scholar and wrong references are usually not filtered out. This information is listed in the Impact columns in Table 6. Given a dataset, the identification of the most suitable paper to look for is not always easy. Nowadays, when a new dataset is released, it is often described in a research article that is also mentioned in the data repository. Users of the dataset are invited to cite this article when referencing to the data. For older datasets, it is less common to have an unique paper, or technical report, to be cited. For instance, papers discussing methods developed and tested on the UW-I/UW-III datasets referenced to the data by means of several citations [6, 7, 81,82,83] sometimes including more than one citation in the same article. In this case, the number of citations reported in Table 6 is obtained by a rough filtering of wrong results obtained with a Google search for "UW-III document." In our view, this is an underestimation of the number of actual references to the dataset. Similar considerations also apply to other earlier datasets like MARG [56], MediaTeam [73], and UvA [74]. Table 6 does not include information for SciBank [46] that is hosted in the IEEE DataPort, and has no citations yet.

Table 6 Impact numbers from Google Scholar citations; years do not take into account months

Full size table

9.2 Complexity

We summarize the data complexity by looking at two information about the dataset content: the number of different types of labels and the type of data annotated.

Number of labels: depending on the task, a larger number of classes to be distinguished are usually more informative and more challenging. For instance, for DLA it is quite easy to distinguish between text and not-text regions, but it is harder to identify the semantic role of text regions. In the case of Table Understanding, rather than looking at the number of classes it is more informative to take note of the type of information provided in the ground truth that can be: Table Position (TP), Table Type (TT), Cell Position (CP), and Cell Type (CT);
Type of data (ToD) labeled: can be scanned (S), digital-born (D), or a mixture of the two (M). Usually digital-born document is easier to be parsed (e.g., there are no OCR errors), while images and low-quality scans make the document analysis harder. More recent datasets nearly always deal with digital-born articles while earlier datasets focus on scanned pages. This change is due from one side to the increasing number of applications dealing with digital-born data and from the other hand to the development of techniques for automatically generating the ground truth that rely on information in digital-born documents.

This information summarizes the datasets descriptions reported in this paper and is shown in the Complexity columns in Table 6.

9.3 Methods

In this section, we briefly overview the most common methods tested on the main datasets analyzed in this paper. We already summarized the main approaches for DLA in Sect. 2. In particular, the methods analyzed in the following, and listed in the Methods column in Table 6, have been found across the best results for the two most impactful datasets for each task. For the remaining datasets, we report some specific methods tested on the data.

Among others and with the aim of a brief and general overview without aspiring to cover all the methods of interest, we report the following approaches.

Heuristic methods for DLA of scientific articles are related to well known algorithms based, for instance, on the RLSA algorithm, projection profile, and area Voronoi diagram (Sect. 2). Other common approaches are based on the analysis of connected components in bi-level images. For instance, in [84] connected components are used to extract figures and captions from PDFs of biomedical articles. Most methods in this group have been tested on the UW datasets [85,86,87,88,89].
When moving the attention to machine learning-based approaches, object detectors in images have been extensively explored in computer vision in general, and in DIAR in particular, by adopting seminal models like Faster R-CNN [90] and Mask R-CNN [91] demonstrating significant advancements in object detection performance. Using a dual backbone ResNext-101 with deformable convolution, DOLNet [92] achieve remarkable results over seven different benchmarks. Differently, HiM [93] and VSR[27] are state-of-the-art methods for DocBank [45] and [49], respectively, and use a RPN in conjunction with textual embedding and a graph structure for final document objects refinements in a multi-modal fashion. Considering a specific object of the document page, the GTE framework [94], composed of a series of vision-based neural networks, uses two levels of depth at table and cell level to detect table boundaries and its structure. Other relevant methods in this category are DeepLabV3+ [95], TNCR [67], and FigureSeer [37] for Figure Extraction and analysis.
Since document layouts are, or either contain, highly structured items, several researchers focused their efforts on graph-based techniques to tackle several tasks. Even if more traditional approaches (like the XY-tree and the Area Voronoi diagram) can be regarded as graph based, the widespread development of geometric deep learning (and in particular Graph Neural Networks) gave rise to applications for DLA of scientific articles. These algorithms usually propose a graph representation of the document, considering clusters of similar semantics (e.g., words or paragraphs) as nodes and using distance metrics as edge connections. Compared to other methods, these techniques still highly rely on the quality of the document preprocessing, restricting their expressive power. One recent work belonging to this area is [25], where the authors show the potentiality of a GNN over a dense network for a custom dataset composed of 0.5 million synthetic tables. More recently, other proposals such as GLAM [96], GFTE [80], and GraphTSR [62] have shown remarkable results on datasets such as DocLayNet [72], FinTab [48], and SciTSR [62]. Concerning the analysis of chart images, in [97] a semantic segmentation network is used to predict probability maps for line styles. From this output, a graph is built and the line tracing is performed using linear programming. We also recently proposed a graph-based architecture to detect and recognize tables and their surroundings on a subset of [49] and [65] relying on a graph structure of the page [98].
Since the first proposal of the transformer architecture [99], the DIAR community began to explore the use of transformer-based architectures for several tasks, including DLA of scientific articles. LayoutLMv3 [100] is the first multi-modal architecture not relying on pre-trained visual extractors that mixing visual, textual, and linear embeddings in a transformer-based model is able to achieve state-of-the-art performances over PubLayNet [49] and other important benchmarks. Belonging to the same family, it is worth to mention also DocFormer [101] and DiT [102] that, similarly to LayoutLMv3, achieve remarkable results specially on ICDAR table competitions [77, 78] and RVL-CDIP [75] datasets.

10 Discussion and open problems

In this survey, we explored the datasets available to perform DLA for scientific articles, mainly focusing on the annotation procedures. The first collections included few documents manually annotated. More recently deep learning techniques gained an increasing interest and achieved important results. To train these models, large quantity of annotated data are needed along with automatic annotation techniques to address a growing amount of data.

It is worth mentioning that, despite technological and algorithmic developments, there are datasets such as DocLayNet [72] that are manually annotated. This opens a reflection on the trade-off needed among quality, heterogeneity, and quantity of annotated data and between the use of manual or automatic annotation techniques to build datasets.

The use of automatic annotation techniques allows researchers to label large datasets, which, however, are often scarcely heterogeneous. On the other hand, manual annotations make it possible to deal with datasets composed of widely varied data that are unfortunately not comparable in quantity to those annotated automatically.

The two approaches come with different limitations: manual procedures are expensive in terms of money and time and do not scale well. On the other side, automatic annotations are not applicable on any application domain, since additional structured information is needed, but is not always available. Moreover, they are not highly reliable for the quality of annotations compared to manual ones and often require human supervision to check a representative sample of data. To cope with these limitations, generative procedures can be a viable alternative, as discussed in Sect. 3.3. An arbitrary large number of synthetically generated documents can help enhance the variability in contents and layout, providing also the corresponding annotations. On the other hand, the generation is restricted to the space of documents that the generator observed during training and there is still the risk of greatly diverting from reality. That is why generated datasets are usually not used in isolation, but combined with other collections.

Problems and limitations bounded to the collection of documents and their annotations are nothing but solved. In other fields, such as computer vision and natural language processing, new paths have started to be traced (e.g., [110]) drawing inspiration from these fields some novel directions could be explored also for the DIAR community. Among others, we believe that addressing the following points could help to overcome the aforementioned limitations:

gathering wider variety of documents: for instance, written in different languages and scripts, to make architectures more robust over different data distributions. We focused mainly on DLA for scientific literature (nearly always in written in English) since most benchmarks belong to this area. However, there are a several open access collections of documents on Internet that can be crawled and unsupervised techniques could be explored to work with them;
create multi-page collections: usually the community is focused on single pages for training, while in real case scenarios there is interest also for understanding multi-page documents, such as in administrative applications. From the training point of view, using multiple pages would result in a more informative process for the system in use, while on the application perspective models usage could be extended to real-scenarios documents;
develo** novel generative methods: unfortunately not all types of documents are accessible due to privacy issues, e.g., medical records or legal documents. Recently, the community has seen the growth of new generative models based on diffusion [111] and unsupervised pre-training [112]. In this direction could be interesting to apply similar techniques to generate an arbitrary large stream of documents simply given a text prompt as input, on top of techniques similar to I2DFormer [113].

11 Conclusions

In this survey, we analyzed and compared the most used datasets for layout analysis, with a focus on those related to scientific publications. In particular, we drew a taxonomy of the most common techniques to annotate data, either manual or automatic. We also discussed techniques (e.g., based on generative models) that might be used for the development of datasets to train intelligent systems able to generalize over larger distributions of documents.

We provided detailed information on the datasets proposed over the years, with reference to the most widely used tools for PDF analysis and labeling. Our aim was to summarize the annotation procedures proposed and to provide a general overview of available datasets, hel** scholars to choose the most suitable data for experimenting their techniques.

We compared all the datasets described in the paper by looking at their estimated impact and complexity. We also presented a brief overview of the most used methods designed and tested on these datasets. A numerical comparing of experimental results achieved by different approaches on common data is challenging due to different articles addressing different sub-tasks, variations in data splits, and diverse performance measures. Adding this information would lead to a fragmented presentation. We remind that this survey primarily focuses on data generation/annotation algorithms rather than data utilization techniques, and we believe that a deeper analysis about these techniques should be addressed in a dedicated paper.

Finally, we discussed advantages and drawbacks of various approaches used for labeling datasets, proposing new perspectives, and interesting open problems to be investigated in the future.

Notes

References

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848
Grother, P.J.: NIST special database 19. Handprinted forms and characters database, National Institute of Standards and Technology 10 (1995)
Deng, L.: The MNIST database of handwritten digit images for machine learning research. IEEE Signal Process. Mag. 29(6), 141–142 (2012)
Article ADS Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Marinai, S.: Introduction to document analysis and recognition. In: Marinai, S., Fujisawa, H. (eds.) Machine Learning in Document Analysis and Recognition. Studies in Computational Intelligence, vol. 90, pp. 1–20. Springer, Berlin (2008). https://doi.org/10.1007/978-3-540-76280-5_1
Chapter Google Scholar
Phillips, I.T., Chen, S., Haralick, R.M.: Cd-rom document database standard. In: Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR’93), pp. 478–483. IEEE (1993)
Liang, J., Rogers, R., Haralick, R.M., Phillips, I.T.: Uw-isl document image analysis toolbox: An experimental environment. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition, vol. 2, pp. 984–988. IEEE (1997)
Nikolaidou, K., Seuret, M., Mokayed, H., Liwicki, M.: A survey of historical document image datasets. Int. J. Doc. Anal. Recognit. (2022). https://doi.org/10.1007/s10032-022-00405-8
Article Google Scholar
Nguyen, T.T., Le, H., Nguyen, T., Vo, N.D., Nguyen, K.: A brief review of state-of-the-art object detectors on benchmark document images datasets. Int. J. Doc. Anal. Recognit. (IJDAR) 26, 1–19 (2023)
Google Scholar
Binmakhashen, G.M., Mahmoud, S.A.: Document layout analysis: a comprehensive survey. ACM Comput. Surv. (2019). https://doi.org/10.1145/3355610
Article Google Scholar
Kise, K.: Page segmentation techniques in document analysis. In: Doermann, D.S., Tombre, K. (eds.) Handbook of Document Image Processing and Recognition, pp. 135–175. Springer, Berlin, Heidelberg (2014). https://doi.org/10.1007/978-0-85729-859-1_5
Chapter Google Scholar
Marinai, S.: Learning algorithms for document layout analysis. In: Rao, C.R., Govindaraju, V. (eds.) Handbook of Statistics. Handbook of Statistics, vol. 31, pp. 400–419. Elsevier (2013). https://doi.org/10.1016/B978-0-444-53859-8.00016-3
Chapter Google Scholar
Wahl, F.M., Wong, K.Y., Casey, R.G.: Block segmentation and text extraction in mixed text/image documents. Comput. Graph. Image Process. 20(4), 375–390 (1982)
Article Google Scholar
O’Gorman, L.: The document spectrum for page layout analysis. IEEE Trans. Pattern Anal. Mach. Intell. 15(11), 1162–1173 (1993)
Article Google Scholar
Kise, K., Sato, A., Iwata, M.: Segmentation of page images using the area Voronoi diagram. Comput. Vis. Image Underst. 70(3), 370–382 (1998)
Article Google Scholar
Nagy, G., Seth, S.C.: Hierarchical representation of optically scanned documents. In: 1984 7th International Conference on Pattern Recognition, pp. 347–349 (1984)
Strouthopoulos, C., Papamarkos, N.: Text identification for document image analysis using a neural network. Image Vis. Comput. 16(12–13), 879–896 (1998)
Article Google Scholar
Wu, C.-C., Chou, C.-H., Chang, F.: A machine-learning approach for analyzing document layout structures with two reading orders. Pattern Recognit. 41(10), 3200–3213 (2008)
Article ADS Google Scholar
Cui, L., Xu, Y., Lv, T., Wei, F.: Document AI: benchmarks, models and applications. ar**v preprint ar**v:2111.08609 (2021)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: Layoutlm: pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1192–1200 (2020)
Riba, P., Dutta, A., Goldmann, L., Fornés, A., Ramos, O., Lladós, J.: Table detection in invoice documents by graph neural networks. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 122–127 (2019). https://doi.org/10.1109/ICDAR.2019.00028
Gemelli, A., Biswas, S., Civitelli, E., Lladós, J., Marinai, S.: Doc2graph: a task agnostic document understanding framework based on graph neural networks. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds.) In: Computer Vision - ECCV 2022 Workshops, pp. 329–344. Springer, Cham (2023)
Qasim, S.R., Mahmood, H., Shafait, F.: Rethinking table recognition using graph neural networks. In: 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, pp. 142–147 (2019). https://doi.org/10.1109/ICDAR.2019.00031
Jimeno Yepes, A., Zhong, P., Burdick, D.: ICDAR 2021 competition on scientific literature parsing. In: International Conference on Document Analysis and Recognition, pp. 605–617 (2021). Springer
Zhang, P., Li, C., Qiao, L., Cheng, Z., Pu, S., Niu, Y., Wu, F.: VSR: a unified framework for document layout analysis combining vision, semantics and relations. In: ICDAR, vol. 12821, pp. 115–130 (2021)
Valveny, E.: In: Doermann, D., Tombre, K. (eds.) Datasets and Annotations for Document Analysis and Recognition, pp. 983–1009. Springer, London (2014). https://doi.org/10.1007/978-0-85729-859-1_32
Antonacopoulos, A., Bridson, D., Papadopoulos, C., Pletschacher, S.: A realistic dataset for performance evaluation of document layout analysis. In: 10th International Conference on Document Analysis and Recognition, ICDAR 2009, pp. 296–300 (2009). https://doi.org/10.1109/ICDAR.2009.271
Pletschacher, S., Antonacopoulos, A.: The page (page analysis and ground-truth elements) format framework. In: 2010 20th International Conference on Pattern Recognition, pp. 257–260 (2010). https://doi.org/10.1109/ICPR.2010.72
Tzutalin, D.: LabelImg. GitHub Repository https://github.com/heartexlabs/labelImg (2015)
Everingham, M., Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The PASCAL visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010). https://doi.org/10.1007/s11263-009-0275-4
Article Google Scholar
Wada, K.: Labelme: Image Polygonal Annotation with Python. GitHub https://github.com/wkentaro/labelme (2018)
Staar, P.W.J., Dolfi, M., Auer, C., Bekas, C.: Corpus conversion service: a machine learning platform to ingest documents at scale. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’18, pp. 774–782. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3219819.3219834
Markewich, L., Zhang, H., **ng, Y., Lambert-Shirzad, N., Jiang, Z., Lee, R.K., Li, Z., Ko, S.: Segmentation for document layout analysis: not dead yet. Int. J. Doc. Anal. Recognit. 25(2), 67–77 (2022). https://doi.org/10.1007/s10032-021-00391-3
Article Google Scholar
Dutta, A., Zisserman, A.: The via annotation software for images, audio and video. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 2276–2279 (2019)
Siegel, N., Horvitz, Z., Levin, R., Divvala, S.K., Farhadi, A.: FigureSeer: parsing result-figures in research papers. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Proceedings of ECCV 2016, Part VII. LNCS, vol. 9911, pp. 664–680. Springer, Berlin (2016). https://doi.org/10.1007/978-3-319-46478-7_41
Jeremy B. Merrill, M.T. Manuel Aristarán: TabulaPDF. https://tabula.technology/
Shinyama, Y.: PDFMiner - Python PDF Parser. GitHub https://github.com/pdfminer/pdfminer.six
Pymupdf: Pymupdf/pymupdf: python bindings for mupdf’s rendering library
Behnel, S., Faassen, M., Bicking, I.: LXML: XML and HTML with Python. Lxml https://github.com/lxml/lxml (2005)
GROBID. GitHub https://github.com/kermitt2/grobid/ (2008–2021)
Jsvine: Jsvine/pdfplumber. https://github.com/jsvine/pdfplumber
Deng, Y., Rosenberg, D.S., Mann, G.: Challenges in end-to-end neural scientific table recognition. In: Proceedings of ICDAR 2019, pp. 894–901. IEEE. https://doi.org/10.1109/ICDAR.2019.00148
Li, M., Xu, Y., Cui, L., Huang, S., Wei, F., Li, Z., Zhou, M.: DocBank: a benchmark dataset for document layout analysis (2020)
Grijalva, F., Parra, C., Gallardo, M., Santos, E., Acuña, B., Rodríguez, J.C., Larco, J.: SciBank: a large dataset of annotated scientific paper regions for document layout analysis. IEEE Dataport (2022). https://doi.org/10.21227/2yex-bt23
Article Google Scholar
Siegel, N., Lourie, N., Power, R., Ammar, W.: Extracting scientific figures with distantly supervised neural networks. In: Chen, J., et al. (eds.) Proceedings of ACM/IEEE on Joint Conference on Digital Libraries, JCDL 2018, pp. 223–232. ACM, New York, NY, USA (2018). https://doi.org/10.1145/3197026.3197040
Zheng, X., Burdick, D., Popa, L., Zhong, P., Wang, N.X.R.: Global table extractor (GTE): a framework for joint table identification and cell structure recognition using visual context. In: Winter Conference for Applications in Computer Vision (WACV) (2021)
Zhong, X., Tang, J., Yepes, A.J.: PubLayNet: largest dataset ever for document layout analysis. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1015–1022 (2019)
Ling, M., Chen, J., Möller, T., Isenberg, P., Isenberg, T., Sedlmair, M., Laramee, R.S., Shen, H.-W., Wu, J., Giles, C.L.: Document domain randomization for deep learning document layout extraction. In: Document Analysis and Recognition—ICDAR 2021, pp. 497–513. Springer, Berlin (2021). https://doi.org/10.1007/978-3-030-86549-8_32
Biswas, S., Riba, P., Lladós, J., Pal, U.: DocSynth: A layout guided approach for controllable document image synthesis. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) 16th International Conference on Document Analysis and Recognition, ICDAR 2021. Lecture Notes in Computer Science, vol. 12823, pp. 555–568 (2021). https://doi.org/10.1007/978-3-030-86334-0_36
Patil, A.G., Ben-Eliezer, O., Perel, O., Averbuch-Elor, H.: READ: recursive autoencoders for document layout generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 544–545 (2020)
Gupta, K., Lazarow, J., Achille, A., Davis, L., Mahadevan, V., Shrivastava, A.: LayoutTransformer: layout generation and completion with self-attention. In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV, pp. 984–994 (2021). https://doi.org/10.1109/ICCV48922.2021.00104
Pisaneschi, L., Gemelli, A., Marinai, S.: Automatic generation of scientific papers for data augmentation in document layout analysis. Pattern Recognit. Lett. 167, 38–44 (2023). https://doi.org/10.1016/j.patrec.2023.01.018
Article ADS Google Scholar
Deng, D., Wu, Y., Shu, X., Wu, J., Xu, M., Fu, S., Cui, W., Wu, Y.: Visimages: a corpus of visualizations in the images of visualization publications. ar**v preprint ar**v:2007.04584 (2020)
Thoma, G.: Ground truth data for document image analysis. In: Symposium on Document Image Understanding and Technology (SDIUT), pp. 199–205 (2003)
Kahu, S.Y., Ingram, W.A., Fox, E.A., Wu, J.: Scanbank: A benchmark dataset for figure extraction from scanned electronic theses and dissertations. In: Downie, J.S., et al. (eds.) ACM/IEEE Joint Conference on Digital Libraries, JCDL 2021, pp. 180–191. IEEE. https://doi.org/10.1109/JCDL52503.2021.00030
Antonacopoulos, A., Pletschacher, S., Bridson, D., Papadopoulos, C.: ICDAR 2009 page segmentation competition. In: 2009 10th International Conference on Document Analysis and Recognition, pp. 1370–1374 (2009). https://doi.org/10.1109/ICDAR.2009.275
Fang, J., Tao, X., Tang, Z., Qiu, R., Liu, Y.: Dataset, ground-truth and performance metrics for table detection evaluation. In: 2012 10th IAPR International Workshop on Document Analysis Systems, pp. 445–449. IEEE (2012)
Clark, C.A., Divvala, S.K.: Looking beyond text: Extracting figures, tables and captions from computer science papers. In: Caragea, C., et al. (eds.) Scholarly Big Data: AI Perspectives, Challenges, and Ideas, Papers from the 2015 AAAI Workshop. AAAI Technical Report, vol. WS-15-13. AAAI Press (2015). http://aaai.org/ocs/index.php/WS/AAAIW15/paper/view/10092
Clark, C.A., Divvala, S.K.: PDFFigures 2.0: mining figures from research papers. In: Adam, N.R., et al. (eds.) Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, JCDL 2016, pp. 143–152. ACM. https://doi.org/10.1145/2910896.2910904
Chi, Z., Huang, H., Xu, H.-D., Yu, H., Yin, W., Mao, X.-L.: Complicated table structure recognition (2019) ar**v:1908.04729 [cs.IR]
Li, M., Cui, L., Huang, S., Wei, F., Zhou, M., Li, Z.: Tablebank: table benchmark for image-based table detection and recognition. In: Proceedings of The 12th Language Resources and Evaluation Conference, pp. 1918–1925 (2020)
Zhong, X., ShafieiBavani, E., Jimeno Yepes, A.: Image-based table recognition: data, model, and evaluation. In: European Conference on Computer Vision, pp. 564–580. Springer (2020)
Smock, B., Pesala, R., Abraham, R.: PubTables-1M: towards comprehensive table extraction from unstructured documents. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4634–4642 (2022)
Desai, H., Kayal, P., Singh, M.K.: TabLeX: A benchmark dataset for structure and content information extraction from scientific tables. In: ICDAR, pp. 554–569. Springer (2021)
Abdallah, A., Berendeyev, A., Nuradin, I., Nurseitov, D.: TNCR: table net detection and classification dataset. Neurocomputing 473, 79–97 (2022). https://doi.org/10.1016/j.neucom.2021.11.101
Article Google Scholar
Fricke, S.: Semantic scholar. J. Med. Libr. Assoc. 106, 23 (2018). https://doi.org/10.5195/JMLA.2018.280
Article Google Scholar
Crowston, K.: Amazon mechanical Turk: a research tool for organizations and information systems scholars. In: Bhattacherjee, A., Fitzgerald, B. (eds.) Sha** the Future of ICT Research. Methods and Approaches, pp. 210–221. Springer, Berlin, Heidelberg (2012)
Google Scholar
The Apache Software Foundation: Apache PDFBox (2012). http://pdfbox.apache.org/
**e, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500 (2017)
Pfitzmann, B., Auer, C., Dolfi, M., Nassar, A.S., Staar, P.W.J.: DocLayNet: A large human-annotated dataset for document-layout segmentation. In: Zhang, A., Rangwala, H. (eds.) KDD ’22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 3743–3751 (2022). https://doi.org/10.1145/3534678.3539043
Sauvola, J., Kauniskangas, H.: Mediateam document database ii, a cd-rom collection of document images. University of Oulu, Finland (1999)
Todoran, L., Worring, M., Smeulders, A.W.M.: The UvA color document dataset. Int. J. Doc. Anal. Recognit. 7(4), 228–240 (2005). https://doi.org/10.1007/s10032-004-0135-2
Article Google Scholar
Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets for document image classification and retrieval. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR)
Lewis, D., Agam, G., Argamon, S., Frieder, O., Grossman, D., Heard, J.: Building a test collection for complex document information processing. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 665–666 (2006)
Göbel, M., Hassan, T., Oro, E., Orsi, G.: ICDAR 2013 table competition. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 1449–1453. IEEE (2013)
Gao, L., Huang, Y., Déjean, H., Meunier, J.-L., Yan, Q., Fang, Y., Kleber, F., Lang, E.: ICDAR 2019 competition on table detection and recognition (cTDaR). In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1510–1515. IEEE (2019)
Mondal, A., Lipps, P., Jawahar, C.: IIIT-AR-13K: a new dataset for graphical object detection in documents. In: International Workshop on Document Analysis Systems, pp. 216–230. Springer (2020)
Li, Y., Huang, Z., Yan, J., Zhou, Y., Ye, F., Liu, X.: GFTE: graph-based financial table extraction. In: International Conference on Pattern Recognition, pp. 644–658. Springer (2021)
Phillips, I.T., Chen, S., Ha, J., Haralick, R.M.: English document database design and implementation methodology. In: Proceeding of the 2nd Annual Symposium on Document Analysis and Retrieval, pp. 65–104 (1993)
Phillips, I.T.: User’s reference manual for the UW English/technical document image database iii. UW-III English/technical document image database manual (1996)
Guyon, I., Haralick, R.M., Hull, J.J., Phillips, I.T.: Data sets for OCR and document image understanding research. In: Handbook of Character Recognition and Document Image Analysis, pp. 779–799. World Scientific (1997)
Li, P., Jiang, X., Shatkay, H.: Figure and caption extraction from biomedical documents. Bioinformatics 35(21), 4381–4388 (2019)
Article CAS PubMed PubMed Central Google Scholar
Liang, J., Phillips, I.T., Haralick, R.M.: A statistically based, highly accurate text-line segmentation method. In: Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR’99 (Cat. No. PR00318), pp. 551–554. IEEE (1999)
Shafait, F., Keysers, D., Breuel, T.: Performance evaluation and benchmarking of six-page segmentation algorithms. IEEE Trans. Pattern Anal. Mach. Intell. 30(6), 941–954 (2008)
Article PubMed Google Scholar
Nayef, N., Ogier, J.-M.: Text zone classification using unsupervised feature learning. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 776–780. IEEE (2015)
Mao, S., Kanungo, T.: Automatic training of page segmentation algorithms: an optimization approach. In: Proceedings 15th International Conference on Pattern Recognition. ICPR-2000, vol. 4, pp. 531–534. IEEE (2000)
Kanungo, T., Haralick, R.M., Baird, H.S., Stuezle, W., Madigan, D.: A statistical, nonparametric methodology for document degradation model validation. IEEE Trans. Pattern Anal. Mach. Intell. 22(11), 1209–1223 (2000)
Article Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28. Curran Associates, Inc. (2015). https://proceedings.neurips.cc/paper_files/paper/2015/file/14bfa6bb14875e45bba028a21ed38046-Paper.pdf
He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)
Mondal, A., Agarwal, M., Jawahar, C.: Dataset agnostic document object detection. Pattern Recognit. 142, 109698 (2023)
Article Google Scholar
Canhui, X., Yuteng, L., Cao, S., Honghong, Z., Hengyue, B., Yinong, C.: Him: hierarchical multimodal network for document layout analysis. Appl. Intell. 53, 1–13 (2023)
Article Google Scholar
Zheng, X., Burdick, D., Popa, L., Zhong, X., Wang, N.X.R.: Global table extractor (GTE): a framework for joint table identification and cell structure recognition using visual context. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 697–706 (2021)
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 801–818 (2018)
Wang, J., Krumdick, M., Tong, B., Halim, H., Sokolov, M., Barda, V., Vendryes, D., Tanner, C.: A graphical approach to document layout analysis. In: International Conference on Document Analysis and Recognition, pp. 53–69. Springer (2023)
Kato, H., Nakazawa, M., Yang, H.-K., Chen, M., Stenger, B.: Parsing line chart images using linear programming. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2109–2118 (2022)
Gemelli, A., Vivoli, E., Marinai, S.: Graph neural networks and representation embedding for table extraction in pdf documents. In: 2022 26th International Conference on Pattern Recognition (ICPR), pp. 1719–1726 (2022). https://doi.org/10.1109/ICPR56361.2022.9956590
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F.: Layoutlmv3: pre-training for document AI with unified text and image masking. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 4083–4091 (2022)
Appalaraju, S., Jasani, B., Kota, B.U., **e, Y., Manmatha, R.: Docformer: end-to-end transformer for document understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 993–1003 (2021)
Li, J., Xu, Y., Lv, T., Cui, L., Zhang, C., Wei, F.: Dit: self-supervised pre-training for document image transformer. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 3530–3539 (2022)
Shen, Z., Zhang, R., Dell, M., Lee, B.C.G., Carlson, J., Li, W.: Layoutparser: a unified toolkit for deep learning based document image analysis. In: Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part I 16, pp. 131–146. Springer (2021)
Ma, C., Lin, W., Sun, L., Huo, Q.: Robust table detection and structure recognition from heterogeneous document images. Pattern Recognit. 133, 109006 (2023). https://doi.org/10.1016/j.patcog.2022.109006
Article Google Scholar
Rusiñol, M., Frinken, V., Karatzas, D., Bagdanov, A.D., Lladós, J.: Multimodal page classification in administrative document image streams. Int. J. Doc. Anal. Recognit. 17(4), 331–341 (2014). https://doi.org/10.1007/s10032-014-0225-8
Article Google Scholar
Zulfiqar, A., Ul-Hasan, A., Shafait, F.: Logical layout analysis using deep learning. In: 2019 Digital Image Computing: Techniques and Applications (DICTA), pp. 1–5. IEEE (2019)
Liu, L., Wang, Z., Qiu, T., Chen, Q., Lu, Y., Suen, C.Y.: Document image classification: progress over two decades. Neurocomputing 453, 223–240 (2021)
Article Google Scholar
Smock, B., Pesala, R., Abraham, R.: Aligning benchmark datasets for table structure recognition. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds.) Document Analysis and Recognition—ICDAR 2023, pp. 371–386. Springer, Cham (2023)
Zheng, X., Burdick, D., Popa, L., Zhong, X., Wang, N.X.R.: Global table extractor (GTE): a framework for joint table identification and cell structure recognition using visual context. In: IEEE Winter Conference on Applications of Computer Vision, WACV 2021, Waikoloa, HI, USA, January 3–8, 2021, pp. 697–706. IEEE (2021). https://doi.org/10.1109/WACV48630.2021.00074
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. CoRR ar**v:2103.00020
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831. PMLR (2021)
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Naeem, M.F., **an, Y., Gool, L.V., Tombari, F.: I2DFormer: learning image to document attention for zero-shot image classification. Adv. Neural Inf. Process. Syst. 35, 12283–12294 (2022)
Google Scholar

Download references

Acknowledgements

Francesco Santoni has been supported by the PAD project (ID FSE: 291424) partially funded by Regione Toscana (bando Assegni di Rircerca 2021 - POR FSE 2014-2020).

Funding

Open access funding provided by Università degli Studi di Firenze within the CRUI-CARE Agreement.

Author information

Authors and Affiliations

Dipartimento di Ingegneria dell’Informazione (DINFO), Università degli Studi di Firenze, Via di Santa Marta, 3, Florence, Italy
Andrea Gemelli, Simone Marinai, Lorenzo Pisaneschi & Francesco Santoni

Authors

Andrea Gemelli
View author publications
You can also search for this author in PubMed Google Scholar
Simone Marinai
View author publications
You can also search for this author in PubMed Google Scholar
Lorenzo Pisaneschi
View author publications
You can also search for this author in PubMed Google Scholar
Francesco Santoni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Simone Marinai.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Gemelli, A., Marinai, S., Pisaneschi, L. et al. Datasets and annotations for layout analysis of scientific articles. IJDAR (2024). https://doi.org/10.1007/s10032-024-00461-2

Download citation

Received: 15 May 2023
Revised: 16 October 2023
Accepted: 23 January 2024
Published: 18 March 2024
DOI: https://doi.org/10.1007/s10032-024-00461-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Datasets and annotations for layout analysis of scientific articles

Abstract

Similar content being viewed by others

LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis

BaDLAD: A Large Multi-Domain Bengali Document Layout Analysis Dataset

BINYAS: a complex document layout analysis system

1 Introduction

2 Document layout analysis of scientific articles

3 Annotation procedures

3.1 Manual annotations

3.2 Automatic annotations

3.2.1 From LaTeX format

3.2.2 From XML format

3.3 Generative models

3.4 Discussion

4 Categorization of datasets

5 Small-scale fully annotated datasets

5.1 University of Washington datasets

5.2 Medical Article Records Groundtruth

5.3 PRImA

5.4 ScanBank

6 Partially annotated datasets

6.1 Marmot

6.2 CS-150 and CS-Large

6.3 FigureSeer

6.4 DeepFigures

6.5 SciTSR

6.8 PubTabNet

6.8.1 FinTabNet and PubXNet

6.9 PubTables-1 M

6.10 TabLeX

7.3 Dense Article Dataset

9 Impact of datasets

9.1 Impact

9.2 Complexity

9.3 Methods

10 Discussion and open problems

11 Conclusions

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation