Towards an Annotation Standard for STEM Documents

Datasets, Benchmarks, and Spotters

  • Conference paper
  • First Online:
Intelligent Computer Mathematics (CICM 2023)

Abstract

When publishing papers, researchers in mathematics and related disciplines typically focus on the presentation, i.e. type-setting, of their ideas and provide little semantic information. This impedes the development of services that benefit from semantic information, such as semantic search and screen readers for vision-impaired researchers. As a remedy, there have been attempts to infer semantic data from already published papers using small programs that we call spotters. Unfortunately, there is no standardized format for semantic annotations and spotter authors typically invent their own format. This leads to two problems: i) there is no ecosystem of tools for common tasks like the visualization of results or the manual annotation of a gold standard, and ii) re-using, evaluating and combining results becomes very difficult.

In this paper, we address these issues by describing a standardized, flexible way to represent semantic annotations, using semantic web technologies and, in particular, the Web Annotation standard. Furthermore, we describe SpotterBase, a set of tools to help with processing the annotations and creating new ones.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Open source; code at https://github.com/jfschaefer/spotterbase.

  2. 2.

    https://gl.kwarc.info/SIGMathLing/cicm23-spotterbase.

References

  1. Aizawa, A., Kohlhase, M.: Mathematical information retrieval. In: Sakai, T., Oard, D.W., Kando, N. (eds.) Evaluating Information Retrieval and Access Tasks. TIRS, vol. 43, pp. 169–185. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-5554-1_12

    Chapter  Google Scholar 

  2. Asakura, T., Miyao, Y., Aizawa, A.: Building dataset for grounding of formulae - annotating coreference relations among math identifiers. In: Proceedings of the Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association, pp. 4851–4858 (2022). https://aclanthology.org/2022.lrec-1.519

  3. Asakura, T., et al.: Miogatto: a math identifier-oriented grounding annotation tool. In: 13th MathUI Workshop at 14th Conference on Intelligent Computer Mathematics (MathUI 2021) (2021)

    Google Scholar 

  4. Brat rapid annotation tool. http://brat.nlplab.org. Accessed 06 Apr 2023

  5. de Castilho, R.E., et al.: A web-based tool for the integrated annotation of semantic and syntactic structures. In: Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH), Osaka, Japan: The COLING 2016 Organizing Committee, pp. 76–84 (2016). https://www.aclweb.org/anthology/W16-4011

  6. Kohlhase, M., Müller, D.: System description: sTeX3 - a LATEX-based ecosystem for semantic/active mathematical documents. In: Buzzard, K., Kutsia, T. (eds.) CICM 2022. LNCS, vol. 13467, pp. 184–188. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16681-5_13

    Chapter  Google Scholar 

  7. CoNLL-U Format. https://universaldependencies.org/format.html

  8. Formal Abstracts. https://formalabstracts.github.io/. Accessed 15 Feb 2020

  9. Ginev, D., et al.: KAT: an annotation tool for STEM documents. In: Kohlhase, A., Libbrecht, P. (eds.) Mathematical User Interfaces Workshop (2015). http://www.cermat.org/events/MathUI/15/proceedings/Lal-Kohlhase-Ginev_KAT_annotations_MathUI_15.pdf

  10. Ginev, D.: arXMLiv:2020 dataset, an HTML5 conversion of ar**v.org. SIGMathLing - Special Interest Group on Math Linguistics (2020). https://sigmathling.kwarc.info/resources/arxmliv-dataset-2020/

  11. Ginev, D., Miller, B.R.: Scientific Statement Classification over ar**v org. English. In: Proceedings of the Twelfth Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, pp. 1219–1226 (2020). https://aclanthology.org/2020.lrec-1.153

  12. Hales, T., et al.: A formal proof of the Kepler conjecture. In: Forum of Mathematics, Pi, vol. 5 (2017). https://doi.org/10.1017/fmp.2017.1

  13. Herman, I., et al.: RDF 1.1 Primer (Second Edition). Rich Structured Data Markup for Web Documents. W3CWorking Group Note. World Wide Web Consortium (W3C) (2013). http://www.w3.org/TR/rdfa-primer

  14. Harris, S., Seaborne, A.: SPARQL 1.1 Query Language. W3C Recommendation. World Wide Web Consortium (W3C) (2013). https://www.w3.org/TR/sparql11-query/

  15. Hypothes.is. http://hypothes.is. Accessed 06 Apr 2023

  16. JSON for Linking Data. https://json-ld.org/

  17. Mansouri, B., et al.: Overview of ARQMath-3 (2022): third CLEF Lab on answer retrieval for questions on math. In: Barrón-Cedeño, A., et al. (eds.) CLEF 2022. LNCS, vol. 13390, pp. 286–310. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-13643-6_20

    Chapter  Google Scholar 

  18. Bruce Miller. LaTeXML: A LATEX to XML Converter. http://dlmf.nist.gov/LaTeXML/. Accessed 22 Mar 2023

  19. Rabenstein, U.: Meaning Extraction and Semantic Services in STEMDocuments - A case study on Quantity Expressions and Units. Master’s Thesis. Informatik, FAU Erlangen-Nürnberg (2017). https://gl.kwarc.info/supervision/MSc-archive/blob/master/2017/urabenstein/Rabenstein.pdf

  20. World Wide Web Consortium (W3C), ed. Resource Description Framework (RDF). http://www.w3.org/RDF/. Accessed 05 Apr 2023

  21. Rijgersberg, H., Van Assem, M., Top, J.: Ontology of units of measure and related concepts. Semant. Web 4(1), 3–13 (2013)

    Article  Google Scholar 

  22. SIGMathLing - Special Interest Group on Maths Linguistics. http://sigmathling.kwarc.info. Accessed 07 Dec 2018

  23. Wang, Q., et al.: Exploration of neural machine translation in autoformalization of mathematics in Mizar. In: Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, pp. 85–98 (2020)

    Google Scholar 

  24. Web Annotation Ontology. https://www.w3.org/ns/oa

  25. Web Annotation Working Group. https://www.w3.org/annotation/

  26. XPath Reference (2010). http://www.w3.org/TR/xpath/. Accessed 05 Apr 2023

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jan Frederik Schaefer .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Schaefer, J.F., Kohlhase, M. (2023). Towards an Annotation Standard for STEM Documents. In: Dubois, C., Kerber, M. (eds) Intelligent Computer Mathematics. CICM 2023. Lecture Notes in Computer Science(), vol 14101. Springer, Cham. https://doi.org/10.1007/978-3-031-42753-4_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-42753-4_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-42752-7

  • Online ISBN: 978-3-031-42753-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation