VTLayout: Fusion of Visual and Text Features for Document Layout Analysis

Li, Shoubin; Ma, Xuyan; Pan, Shuaiqun; Hu, Jun; Shi, Lin; Wang, Qing

doi:10.1007/978-3-030-89188-6_23

Shoubin Li^12,13,
Xuyan Ma^12,13,
Shuaiqun Pan¹³,
Jun Hu¹³,
Lin Shi¹³ &
…
Qing Wang¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13031))

Included in the following conference series:

Pacific Rim International Conference on Artificial Intelligence

2270 Accesses
1 Citations

Abstract

Documents often contain complex physical structures, which make the Document Layout Analysis (DLA) task challenging. As a pre-processing step for content extraction, DLA has the potential to capture rich information in historical or scientific documents on a large scale. Although many deep-learning-based methods from computer vision have already achieved excellent performance in detecting Figure from documents, they are still unsatisfactory in recognizing the List, Table, Text and Title category blocks in DLA. This paper proposes a VTLayout model fusing the documents’ deep visual, shallow visual, and text features to localize and identify different category blocks. The model mainly includes two stages, and the three feature extractors are built in the second stage. In the first stage, the Cascade Mask R-CNN model is applied directly to localize all category blocks of the documents. In the second stage, the deep visual, shallow visual, and text features are extracted for fusion to identify the category blocks of documents. As a result, we strengthen the classification power of different category blocks based on the existing localization technique. The experimental results show that the identification capability of the VTLayout is superior to the most advanced method of DLA based on the PubLayNet dataset, and the F1 score is as high as 0.9599.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (Brazil)

eBook: USD 84.99; Price excludes VAT (Brazil)

Softcover Book: USD 109.99; Price excludes VAT (Brazil)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Deep Layout Analysis of Multi-lingual and Composite Documents

Deep Learning Based Document Layout Analysis on Historical Documents

The YOLO model that still excels in document layout analysis

Article 19 November 2023

References

Abadi, M., et al.: Tensorflow: large-scale machine learning on heterogeneous distributed systems. ar**v preprint ar**v:1603.04467 (2016)
Asim, M.N., Khan, M.U.G., Malik, M.I., Razzaque, K., Dengel, A., Ahmed, S.: Two stream deep network for document image classification. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1410–1416. IEEE (2019)
Google Scholar
Augusto Borges Oliveira, D., Palhares Viana, M.: Fast CNN-based document layout analysis. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 1173–1180 (2017)
Google Scholar
Binmakhashen, G.M., Mahmoud, S.A.: Document layout analysis: a comprehensive survey. ACM Comput. Surv. (CSUR) 52(6), 1–36 (2019)
Article Google Scholar
Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018)
Google Scholar
Goutte, C., Gaussier, E.: A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 345–359. Springer, Heidelberg (2005). https://doi.org/10.1007/978-3-540-31865-1_25
Chapter Google Scholar
Ha, J., Haralick, R.M., Phillips, I.T.: Recursive xy cut using bounding boxes of connected components. In: Proceedings of 3rd International Conference on Document Analysis and Recognition, vol. 2, pp. 952–955. IEEE (1995)
Google Scholar
Howard, A.G., et al.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. ar**v preprint ar**v:1704.04861 (2017)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Google Scholar
Jain, R., Wigington, C.: Multimodal document image classification. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 71–77. IEEE (2019)
Google Scholar
Kang, L., Kumar, J., Ye, P., Li, Y., Doermann, D.: Convolutional neural networks for document image classification. In: 2014 22nd International Conference on Pattern Recognition, pp. 3168–3172. IEEE (2014)
Google Scholar
Kavasidis, I., et al.: A saliency-based convolutional neural network for table and chart detection in digitized documents. ar**v preprint ar**v:1804.06236 (2018)
Keskar, N.S., Socher, R.: Improving generalization performance by switching from Adam to SGD. ar**v preprint ar**v:1712.07628 (2017)
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Chapter Google Scholar
Namboodiri, A.M., Jain, A.K.: Document structure and layout analysis. In: Chaudhuri, B.B. (eds.) Digital Document Processing. Advances in Pattern Recognition. Springer, London (2007). https://doi.org/10.1007/978-1-84628-726-8_2
O’Gorman, L.: The document spectrum for page layout analysis. IEEE Trans. Pattern Anal. Mach. Intell. 15(11), 1162–1173 (1993)
Article Google Scholar
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. ar**v preprint ar**v:1912.01703 (2019)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Ray, S., Chandra, N.: Domain based ontology and automated text categorization based on improved term frequency-inverse document frequency. Int. J. Mod. Educ. Comput. Sci. 4(4), 28 (2012)
Article Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. ar**v preprint ar**v:1506.01497 (2015)
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)
Google Scholar
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2016)
Article Google Scholar
Siddiqui, S.A., Khan, P.I., Dengel, A., Ahmed, S.: Rethinking semantic segmentation for table structure recognition in documents. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1397–1402. IEEE (2019)
Google Scholar
Soto, C.X., Soto, C.X.: Visual detection with context for document layout analysis. Technical report, Brookhaven National Lab. (BNL), Upton, NY (United States) (2019)
Google Scholar
Sun, N., Zhu, Y., Hu, X.: Faster R-CNN based table detection combining corner locating. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1314–1319. IEEE (2019)
Google Scholar
Yang, X., Yumer, E., Asente, P., Kraley, M., Kifer, D., Lee Giles, C.: Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5315–5324 (2017)
Google Scholar
Zhong, X., Tang, J., Yepes, A.J.: PubLayNet: largest dataset ever for document layout analysis. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1015–1022. IEEE (2019)
Google Scholar
Zhou, X., et al.: East: an efficient and accurate scene text detector. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5551–5560 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Chinese Academy of Sciences, Bei**g, China
Shoubin Li & Xuyan Ma
The Institute of Software, Chinese Academy of Sciences, Bei**g, China
Shoubin Li, Xuyan Ma, Shuaiqun Pan, Jun Hu, Lin Shi & Qing Wang

Authors

Shoubin Li
View author publications
You can also search for this author in PubMed Google Scholar
Xuyan Ma
View author publications
You can also search for this author in PubMed Google Scholar
Shuaiqun Pan
View author publications
You can also search for this author in PubMed Google Scholar
Jun Hu
View author publications
You can also search for this author in PubMed Google Scholar
Lin Shi
View author publications
You can also search for this author in PubMed Google Scholar
Qing Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shoubin Li .

Editor information

Editors and Affiliations

MIMOS Berhad, Kuala Lumpur, Malaysia
Duc Nghia Pham
Sirindhorn International Institute of Science and Technology, Thammasat University, Mueang Pathum Thani, Thailand
Thanaruk Theeramunkong
Data61, CSIRO, Brisbane, QLD, Australia
Guido Governatori
Department of Philosophy, Tsinghua University, Bei**g, China
Fenrong Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, S., Ma, X., Pan, S., Hu, J., Shi, L., Wang, Q. (2021). VTLayout: Fusion of Visual and Text Features for Document Layout Analysis. In: Pham, D.N., Theeramunkong, T., Governatori, G., Liu, F. (eds) PRICAI 2021: Trends in Artificial Intelligence. PRICAI 2021. Lecture Notes in Computer Science(), vol 13031. Springer, Cham. https://doi.org/10.1007/978-3-030-89188-6_23

Download citation

DOI: https://doi.org/10.1007/978-3-030-89188-6_23
Published: 25 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89187-9
Online ISBN: 978-3-030-89188-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

VTLayout: Fusion of Visual and Text Features for Document Layout Analysis

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Deep Layout Analysis of Multi-lingual and Composite Documents

Deep Learning Based Document Layout Analysis on Historical Documents

The YOLO model that still excels in document layout analysis

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

VTLayout: Fusion of Visual and Text Features for Document Layout Analysis

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Deep Layout Analysis of Multi-lingual and Composite Documents

Deep Learning Based Document Layout Analysis on Historical Documents

The YOLO model that still excels in document layout analysis

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation