Introduction

The development of sensible, shareable and scalable data models has become an essential asset in nearly every area of science and knowledge. Data-driven approaches are rapidly increasing their impact, in such a way that the focal point is not as much the storage, but instead the ways in which data can be retrieved, explored and utilized [1,2,3]. To reach this, more sophisticated (“smarter”) approaches to data organization are necessary. Among these, we can highlight the Semantic Web, proposed by Tim Berners-Lee [4] in 2001, whose basic building blocks are outlined in Fig. 1. The goal of the Semantic Web is to add logic and structure to the data in the World Wide Web, permitting the application of reasoning schemes. Given that the Web is nothing else than a collection of linked information, these principles may then be applied to any other specific body of knowledge. The information in the Semantic Web can be fetched by smart agents capable of making inferences based on the relationships between the entities, potentially answering complex questions about the data. In this context, data is identified by URIs [5] (Uniform Resource Identifiers), which provide a consistent and web-conforming notation scheme for each element. These elements are formatted as tags, using the XML [6] (eXtended Markup Language) format. Then, the RDF [7] (Resource Description Framework) data model provides meaning to the tags, structuring the information through the assertion of triples of the form “subject–predicate–object”. Finally, once all entities have been introduced, it remains to define the relationships between them: in other words, to build an ontology.

Fig. 1
figure 1

Base elements for the Semantic Web proposal, from Berners-Lee [4], depicted as stacked layers

Ontologies propose classes to characterize the different elements existing in a certain domain of knowledge, and then define how these classes relate between themselves through properties, effectively building a representational vocabulary of the target domain [8], also known as a taxonomy. In this sense, ontologies provide a standardization of knowledge, explicitly defining common terms and structures that can be shared and reutilized between different communities. Besides, these definitions can be used as templates for the data and metadata required to express the entities in a field of knowledge (e.g., user input forms). The term knowledge graph (KG), sometimes expressed as knowledge base, is used for datasets that have been expressed and categorized under the class structure of an ontology: the corresponding class members are denoted as individuals of the KG. Currently, the OWL (Web Ontology Language) [9] format, an ontology-oriented extension of RDF, is the language of choice for expressing ontologies and knowledge graphs.

Moreso, this kind of well-structured information allows to easily connect data coming from different sources, following the paradigm of Linked Data [10]. While specific areas of knowledge require specific ontologies, these (and their corresponding KGs) may be bridged by stating equivalencies between their common elements. Regarding scientific data, ontologies have been widely adopted in biology and biomedicine [11,12,13], but are not yet as common in other fields like Chemistry. Among the existing chemical ontologies, for which a recent review was carried out by Pachl et al. [

Fig. 2
figure 2

Schematic depiction of the directed k-representation (above) and the undirected E-representation (below) initially defined in the context of the energy span model and proposed here as building block for OntoRXN. Here, \(\hbox {k}_{{ij}}\) symbols represent rate constants, \(E_{i}\) node energies and \(E_{ij}\), edge energies

In this line, one area of application where a semantic-based organization could be useful is the study of reaction mechanisms and chemical networks. Recent efforts by our group have been devoted to the development of novel open-source tools for the treatment and processing of reaction networks through graph-based approaches: amk-tools [28] and gTOFfee [29]. The latter is an application of the energy span model (ESM) developed by Sebastian Kozuch and collaborators [30, 31] extended to manage reaction networks as undirected graphs [32]. In this sense, no forward or reverse direction is assumed to construct the network: the chemical flow comes from the exergonicity of the embedded reactions. Consequently, there is a switch from the traditional k-representation, based on reverse and forward rate constants, to the E-representation [33], undirected and based only on energies (Fig. 2). This change of paradigm permits a much more natural pairing with computational chemistry results, which provide energy and not rate constants as a main output, and simplifies the final graph structures.

The concept of reaction mechanism introduced in previous ontologies, such as OntoKin, was built on the classical rate-constant-based representation. This approach, while perfectly valid, lacks the immediate matching with computational data from the E-representation. Therefore, we took this undirected description of reaction networks as the foundation for a new ontology for computationally characterized reaction mechanisms: OntoRXN. We aim to directly connect this ontology with the ioChem-BD database [34, 35], a central piece for our data management workflow. ioChem-BD is a service which parses the outputs obtained from many common computational chemistry codes, such as Gaussian, ADF, VASP, MOLCAS, ORCA..., to store the results in an unified CML [36,37,38,39] (Chemical Markup Language) format. The information contained in those CML files can be visualized and accessed through the ioChem-BD platform, allowing users to easily share information. Reaction networks can be also defined inside the platform, providing meaning and structure to the stored data, in line with the principles of the Semantic Web. While other projects tackling the semantic-based publication of computational chemistry results proposed the definition of new formats (e.g. CSX [40]) to overcome some limitations of CML, we believe that the connection with the already established ioChem-BD database justifies the direct use of CML. In this sense, the development of OntoRXN supposes another step forward in the standardization of information, presenting knowledge graphs as a standard format combining all the information for a given reaction mechanism: the computational results from the CML files and the network structure interlinking the calculations, which embeds the chemical knowledge about the system.

Following this idea, the main guidelines for the design of OntoRXN were:

  • Apply the E-representation: networks as fully undirected graphs.

  • Use the information available on the CML files from ioChem-BD: readily available and already properly tagged.

  • Aggregate individual calculations into molecule sets: chemical reactions and catalytic cycles do not usually refer to a single molecule per step, but instead group several species that have to be taken into account to preserve the number of atoms across the network.

While there is an evident discrepancy in “directedness” between our OntoRXN proposal and the pre-existing solutions (OntoKin + OntoCompChem [27]), both descriptions could be linked altogether, as the k- and E-representations are indeed equivalent. Activation free energies can be transformed into rate constants through the Eyring equation, eventually converting our undirected, energy-based graph to a directed, rate-constant-based one. Under the ontology paradigm, this will be achieved through agents tailored to traverse the network encoded in the KG and assign proper directionality.