Background

Funding agencies, international consortia, institutional policies, and publisher requirements have helped promote the adoption of the FAIR (Findability, Accessibility, Interoperability, and Reusability) guiding principles [4, 41] for biomedical research data sharing to varying degrees of success. While it is now standard to make datasets accessible and potentially reusable via deposition of the dataset in a repository, metadata standardization issues (i.e.—lack of standardization in how datasets are described) continue to make it challenging for researchers to make datasets findable, interoperable, and reusable. To address these issues, domain experts and data stewards have been inspecting the gap between principle and practice [23]; extending [19], adapting [15], and adopting the principles [12]; creating their own metadata standards [6] and data schemas [12, 16, 29]. However a large gap remains between the communities that develop standards and the adoption of these standards by data and resource providers due to issues in communication, education/training, incentives, and the availability of supportive tools [14, 17]. For example, the Dublin Core Metadata Initiative (DCMI) provides a metadata ontology (i.e.—a structured vocabulary for classifying and describing metadata): terms and data elements (Dublin Core Metadata Initiative [9], two general-use schema classes (i.e.—sets of metadata vocabulary used to describe a conceptual entity): core and qualified, and a thorough guide for utilizing their ontology with their model-based framework for creating schemas: the Dublin Core Application Profile (DCAP) guide [8]. The DCAP guide was intended to empower data providers to mix and match Dublin Core (and other) metadata terms/elements (properties) to create new application profiles (schemas) to suit their needs. While the core (data element) schema has been widely-adopted, the lack of authoring tools to help create more type/concept-specific schemas and the lack of tools for transforming schemas into working formats for consumption and implementation has hampered the adoption and implementation of DCAP [1]. Even after standardization communities successfully introduce standards, their adoption, modification, and implementation are frequently defined by widely used tools or repositories within a specific community [12].

Schema.org is a metadata vocabulary standardization project founded by the major search engine companies such as Google, Microsoft, Yahoo, and Yandex. It is an open source, collaborative initiative that develops metadata standards for improved searchability. While domain-neutral, Schema.org welcomes proposals and discussions of new properties and classes from anyone, including domain-specific ontology or schema development groups, via participation in their W3C group [38]. Members of metadata ontology development communities (including the aforementioned DCMI, as well as LRMI, and other W3C groups) [3, 39] have been involved with, have influenced, and have successfully integrated some of their vocabulary into Schema.org [2, 32]. Schema.org already includes some biomedically relevant classes (i.e.—conceptual entities) like Datasets and Medical Study, and applying Schema.org classes to biomedical research resources would improve interoperability, enabling researchers readily ingest existing resources and to leverage search engine-based solutions (like Google Dataset Search) to find resources of interest. Furthermore, the hierarchical nature of schemas from Schema.org allows for inheritance of vocabulary sets (sets of properties) from parent schemas. Although there have been some efforts to leverage Schema.org to improve findability of scientific research data [20, 29, 31] and many generic repositories (like Figshare and Zenodo) are compliant, Schema.org remains largely underutilized by the biomedical research community. Bioschemas is an open and collaborative effort that has been actively promoting the use of Schema.org in the life sciences by serving as a hub for researchers to create new biomedically relevant classes with the goal of refining and proposing these classes to Schema.org [11], Profiti et al. [30], and by raising awareness about the usefulness of metadata schemas. The Bioschemas community has also identified the need for easy-to-use tools to help improve public accessibility and participation in the schema development process.

Here, we describe the Data Discovery Engine’s (DDE) Schema Playground, a web-based tool that improves the ease of using any registered schema or Schema.org classes. Our tool allows users to easily find and visualize relevant classes from Schema.org, Bioschemas, BioLink [5], and others, extend them; create JSON schema validation rules [22]; and save/share the newly created classes for others to reuse. Our tool expresses schemas in JSON-LD format, improving interoperability of schemas which might normally otherwise be viewed as HTML tables. Our tool also includes a framework for building data registries and creating guides for data submission; however, the implementation and integration of these features on our site is restricted to partner organizations. We introduce the features of this tool, review its value to different types of users, demonstrate its application towards the creation of a new schema for COVID-19-related resources, and discuss its adoption by the Bioschemas metadata standardization community.

Implementation

The Data Discovery Engine’s Schema Playground is a browser-based tool built with Vue.js [43], Python/Tornado [35], and the BioThings Software Development Toolkit [2: Supplemental Table 1). The Bioschemas community has a few well-documented tools for schema development, but many of those tools were only available as source code and required basic programming experience. We focused our efforts on features for which user-friendly tools for schema creation and reuse, resulting in a web-based application that empowers individual data resource providers to utilize and customize existing schemas from Schema.org and other similar efforts. As seen in Table 1, these features include:

Table 1 Comparison of Schema.org, Bioschemas, and DDE Schema Playground
  1. 1.

    Searching and viewing schemas from Schema.org and other metadata standardization efforts

    The DDE Schema Playground allows for the visualization of JSON-LD-formatted schemas hosted online either on GitHub or elsewhere (Additional file 1: Supplemental Figure 1A). This allows users who are familiar with Schema.org to review their compliant schema in a more human readable format. The DDE Schema Playground also has a searchable registry of classes from Schema.org, BioLink, BioThings, Bioschemas, and others. Users may browse and visualize the schemas for various classes from these sources to identify the classes of most interest to them (Additional file 1: Supplemental Figure 1B). If a community like Bioschemas or consortia like the National COVID Cohort Collaborative (N3C) [13] is interested in making a new schema available for searching and viewing, they can import and register their JSON-LD-formatted schema. The DDE Schema Playground also enables users to compare up to four schemas. For example, there are multiple Dataset schemas available in the registry, and users can compare them to see what properties are unique to each and what properties they share (Additional file 1: Supplemental Figure 1C).

  2. 2.

    Extending and customizing a pre-existing schema for a particular use

    The ability to browse and inspect pre-existing schemas makes it easier for a user to customize or extend the schema to suit their own purpose. All the properties from the pre-existing schema will be inherited in the extended schema; however, the user may select properties for which validation is desirable. The user can also create new properties to be included in the extended schema. For example, the Dataset class from Schema.org serves as a potential foundation, but a schema focused on COVID-19-related datasets may need additional fields (e.g., infectiousAgent). To tailor the Dataset schema, we find and extend it from the registry (Additional file 1: Supplemental Figure 2A). After we create a name for our schema (the namespace) and the class, we can customize it. We can select to include any property that is available from the schema we are extending (Additional file 1: Supplemental Figure 2B), and we can create new properties (e.g., infectiousAgent) that are tailored to our needs (Additional file 1: Supplemental Figure 2C). This feature also serves as an easy way to maintain Bioschemas profiles as users can update a registered profile by extending from it, making the necessary changes, and pushing them back to Bioschemas. Outside the tool, there is only manual writing/editing of JSON-LD, YAML, TTL, SHACL, ShEx, or other file types and running command-line tools for customizing an existing schema in an interoperable format and making it human-friendly viewable online.

  3. 3.

    Creating validation for the schema for data quality enforcement

    Marginality (whether a property is required or not) and cardinality (whether a property can have one or multiple values) are two aspects of schema properties that are not expressed well by Schema.org but are desirable to biomedical researchers (Additional file 1: Supplemental Figure 3A and 3C). In the DDE Schema Playground, this is handled via the creation of JSON Schema validation rules, and the DDE’s Schema Validation Editor provides a simple drag and drop mechanism to create straightforward validations (Additional file 1: Supplemental Figure 3B). For slightly more complex validations, the user can edit the validation rule before dragging and drop** it into the property of interest. In our example Dataset schema, we may want to restrict the values for our new property (infectiousAgent) such that they map to and are standardized by an ontology. We edit the example JSON schema validation rules for an ontology to tailor it to the NCBI Taxon ontology (Additional file 1: Supplemental Figure 3D). Schema development working groups often leverage the work done by ontology groups to ensure that the values of a property are standardized. Once these JSON schema validation rules have been created, they can be used to test the validity of JSON-LD-formatted metadata using any of the many third party metadata validation tools and program libraries that are already available. Future features of the DDE will include a built-in metadata validation tool.

  4. 4.

    Exporting and saving a schema generated by the Schema Playground editor

    The DDE Schema Playground allows you to export/download your newly created schema locally and it is also integrated with GitHub, allowing users to save to their GitHub repository (Additional file 1: Supplemental Figure 4A-C). The integration with GitHub allows the edits to the schema to be made by multiple parties and provides the schema owner the option of pulling changes to the schema. Additionally, the schema can be forked and edited/customized allowing for re-use of the schemas which in turn improves findability and reusability of resources which follow the schemas.

  5. 5.

    Registering a newly created schema in the DDE schema registry to facilitate its extension and re-use

    Once saved in GitHub, users can review their schema with the schema viewer and add it to the registry to enable others to easily re-use it (Additional file 1: Supplemental Figure 1A). This provides a user-friendly interface for editing, customizing, and re-using schemas for those who prefer not to manually edit text and format in JSON-LD.

    The DDE Schema Playground offers any user the ability to reuse and extend existing schemas. This tool is primarily to assist in the authoring of schemas for use in other applications. In addition, we have converted three Dataset schemas into "guides", which are web-based forms for annotating resources using schemas authored in the DDE Schema Playground. Annotations created using these guides are stored within a resource registry hosted within the DDE. There are currently three public guides based on the Dataset schemas for the Outbreak.info web application [28], the N3C initiative, and the CD2H consortium [7]. While the creation of guides from schemas is not a fully-automated feature that is available to all users, most of the underlying components are reusable, additional guides can be constructed and hosted within the DDE through collaboration. The Bioschemas community has integrated the DDE schema playground as part of its schema creation and update process to improve participation by members who lack the programming expertise needed to participate via their previous pipeline.

Creating the COVID-19 Outbreak schema using the Schema Playground

Schema.org classes are often simultaneously too broad (lacking properties needed) and too narrow (including too many irrelevant properties) for a specific research purpose. For this reason, it becomes necessary to adapt schemas to suit needs of a biomedical research project. Outbreak.info is a project from the Su, Wu, and Andersen labs at Scripps Research to unify COVID-19 and SARS-CoV-2 epidemiology and genomic data, published research, and other resources [10, 37]. The standardization of published research and other resources was accomplished by creating a single, multiclass schema to harmonize the metadata: The COVID-19 Outbreak schema. This schema can be found in the DDE registry at https://discovery.biothings.io/view/outbreak/ and was built via the DDE Schema Playground with some manual editing (for merging all the classes into a single schema). There are six principal classes in the Outbreak schema (Analysis, Dataset, ClinicalTrial, ComputationalTool, Protocol, Publication) and many subclasses to support the principal classes. As seen in Table 2, the classes in the Outbreak schema were extended from related Schema.org classes (whenever available) and were created based on metadata comparisons from a variety of related sources. By extending from existing schemas, we reuse existing metadata properties when appropriate, and create new properties only when necessary.

Table 2 Classes in the Outbreak schema and how they were created and used

For example, the level of detail provided by Protocol Registration System (PRS) schema used by the National Clinical Trial (NCT) registry is more granular than Schema.org’s MedicalStudy class, but broad enough that it encompasses properties from both child classes of MedicalStudy (MedicalTrial and MedicalObservationalStudy). The child classes of MedicalStudy only differ in the property name for the study design (trialDesign vs studyDesign), and this property is not delineated in PRS. Further, the PRS includes many properties not currently available in any of these Schema.org classes. Adopting the PRS directly was also problematic as we planned to ingest records from other registries like the World Health Organization’s Clinical Trial registry (WHOCT), and the PRS was also more granular than WHOCT. For this reason, the Outbreak.info ClinicalTrial class was created by using the DDE to extend from Schema.org, leveraging the PRS-WHO crosswalk [42], and creating properties that could help with issues previously identified [26].

In addition to adapting Schema.org classes to normalize record data from multiple sources within a class, Outbreak.info needed to normalize common metadata properties between different classes. The hierarchical nature of Schema.org classes simplified this process, as many derivative classes inherit properties from the Thing class. For example, the Protocol class in the Outbreak schema was extended from the HowTo class in Schema.org and was based on properties identified from available metadata in protocols.io and the LabProtocol profile from Bioschemas. Since both the Schema.org classes, MedicalStudy and HowTo, are derivatives of Thing, the Outbreak schema naturally has properties in common across multiple classes and can normalize the metadata across these classes allowing for cleaner query design and improved search functionality. This schema is currently used to harmonize and improve FAIRness of metadata from over 300,000 resource entries in the Outbreak.info research library at https://outbreak.info/resources.

Adoption of the Schema Playground into the Bioschemas schema development and maintenance pipeline

Previously, the pipeline for updating a Bioschemas specification involved the use of a google spreadsheet for attaining community consensus, a command-line tool for converting the CSV from the spreadsheet to yaml, cloning the Bioschemas website repository and copying/editing HTML and YAML files, running Jekyll to test the changes, editing example files in the Bioschemas specifications repository, and creating pull requests for the Bioschemas website repository once everything had performed as tested. The level of expertise needed in order to update a specification has been discussed in multiple Bioschemas community calls as a potential barrier to participation. After initial tests during and after Biohackathon 2021, the Bioschemas community has decided to adopt the DDE into its schema development and maintenance pipeline. Manuals for using the DDE to create or update Bioschemas specifications have been developed, and automated scripts using GitHub actions have been developed to more tightly integrate the tool into the pipeline. As seen in Fig. 3, the process for updating a Bioschemas profile requires less technical expertise after the integration of the DDE. While the process prior to and after the DDE still requires the ability to edit a YAML/JSON file (brown) and the ability to use GitHub (black), the DDE-based process does not require the user to have the technical knowledge needed to run tools via the command line (green), or to use Jekyll (blue).

Fig. 3
figure 3

The Bioschemas profile update process before (left) and after (right) the inclusion of the DDE

Discussion

In an effort to make scientific resources more FAIR, communities in the biological sciences (Bioschemas), earth sciences (ESIP’s Science on Schema.org cluster), and more are working diligently to align and influence Schema.org to suit the needs of the scientific research community [11, 33]. These communities serve as an important bridge between domain-specific ontology development groups and the domain-neutral Schema.org by introducing Schema.org to the scientific research resource providers, identifying existing ontologies to leverage, and creating tailored schemas more suitable for the research community. For example, ontology development groups, like PPEO/MIAPPE [24] and EDAM [18], have been consulted or have participated in the development of the Sample and ComputationalWorkflow schemas by the corresponding Bioschemas working groups. A term from PPEO/MIAPPE was included as a property in the Sample schema, while JSON schema validation rules enforce the use of terms from EDAM as values for certain properties in the ComputationalWorkflow schema.

Although communities like Bioschemas have helped to create more relevant classes or improve existing classes, it is difficult to push these suggestions to Schema.org without compelling use cases or widespread adoption of these tailored classes. For example, the Bioschemas community first developed the Gene class (with input from gene resource providers, Gene Ontology proponents and gene resource consumers) in 2018. However, it was not included as a pending class in Schema.org until 2021 due to a lack of widespread adoption. The Bioschemas community spent considerable time and effort on education and training in order to increase the adoption of Bioschemas classes; however, participation in the development of the classes was hampered by the technological expertise needed in order to update a Bioschemas class. The availability of user-friendly tools can make it easier to find and use Schema.org and other community-driven schema classes, and empower data providers and researchers to engage in schema authoring and sharing.

Most tools for utilizing existing Schema.org classes focus on the utilization of an existing schema (such as markup generation) and lack the ability to customize the schema in a Schema.org-compliant way. Tools that do allow customizing/creating a schema (e.g., Bioschemas GoWeb) often require some degree of programming. The DDE Schema Playground is a browser-based tool that enables members of the research community to easily adapt schemas to suit their need and to enable community re-use of their schemas through the DDE schema registry. This encourages and empowers researchers to structure their data in a Schema.org-compliant fashion earlier on in the scientific research process rather than as an afterthought. The schema authoring by the research community, for the research community will encourage the creation and adoption of new classes and properties, which may have previously been neglected due to the absence of representation (e.g., volunteers with subject matter expertise) in data standardization communities. In this fashion, the DDE Schema Playground allows for researchers to express and share their data structuring needs with the data standardization community without diverting attention away from their primary research efforts. Data standardization communities also benefit because their volunteer time can be concentrated on classes already in use by researchers (but could benefit from some standardization), and diverted away from classes which lack interest/support from the research community at large.

There are many ways to express schemas (i.e., SHACL, ShEx), but the DDE only supports the expression of schemas in JSON-LD/JSON Schema format due to the widespread adoption of the JSON-LD format by resource providers and library/tool developers. In addition to this restriction, there are important limitations as to what can or cannot be registered into the DDE schema registry. Schema registration in the DDE is currently limited class-based schema (i.e., classes described by sets of properties) rooted in Schema.org, while many well-used, domain-neutral metadata ontologies (such as DCMI) and schema have properties that are not necessarily tied to any class. These classless metadata vocabularies intentionally do not group the properties into classes in order to encourage the mix-and-match of properties. Although classless metadata vocabularies cannot be registered in their entirety as classless properties in the DDE at this time, the DDE can flexibly ingest properties from any metadata vocabulary (whether or not they are class-based) as long as it is properly formatted (i.e., conforms to JSON-LD/JSON Schema formatting). This means that users can build their schema by extending from Schema.org, Bioschemas, or any registered schema, and incorporate properties from OWL, DCMI, or any other accessible vocabulary as needed. For example, all Bioschemas profile classes also include the conformsTo property from DCMI, and the NIAID Dataset schema [36] also leverages properties from OWL. In theory, classes inheriting just a single property from a Schema.org class, but otherwise built entirely from other metadata ontologies can be viewed and registered in the DDE.

We tested the use of the DDE Schema Playground to create customized Schema.org-compliant classes that could be used to normalize metadata between multiple types (datasets, clinical trials, publications, etc.) of COVID-19-related resources and applied these schemas towards a searchable resource site (https://outbreak.info). The Outbreak resource schema is available in the DDE schema registry which is also includes schemas from Schema.org, Bioschemas, BioLink, the National COVID Cohort Collaborative (N3C), the National Institute of Allergy and Infectious Diseases (NIAID) and more. We hope others will join us in making their open data more interpretable, interoperable, and reusable by adding their schemas to the schema registry.

Conclusion

We have created a user-friendly browser-based tool which facilitates the application of Schema.org towards biomedical research outputs. We demonstrate its use with the creation of the Outbreak.info schema, its adoption into the Bioschemas schema development pipeline, and we encourage others to register and reuse Schema.org-compliant schemas. We welcome user feedback which has and continues to help identify desirable new features and tools (i.e., metadata validation tools) which will be added in the near future.

Availability and requirements

Project name: Data Discovery Engine Schema Playground.

Project home page: https://discovery.biothings.io/schema-playground

Project source code: https://github.com/biothings/discovery-app

Operating system(s): Web-based, Platform independent.

Programming language: JavaScript, Python.

Other requirements: GitHub account for schema editing.

License: Creative Commons Attribution 4.0 International license (Content), Apache 2.0 (source code).

Any restrictions to use by non-academics: No.