Background

Curated databases, such as UniProt [1], DrugBank [2], CTD [3], IUPHAR/BPS [4], Reactome [5], OMIM [6], or COSMIC [7], are pivotal to the development of biomedical science. Such databases are usually populated and updated with expensive and time-consuming human effort [8], that slows down the biological knowledge discovery process. To overcome this limitation, Biomedical Information Extraction (BioIE) aims to shift population and curation processes to machines by develo** effective computational tools that automatically extract meaningful facts from the vast unstructured scientific literature [9, 10]. Once extracted, machine-readable facts can be fed to downstream tasks to ease biological knowledge discovery. Among the various tasks, the discovery of Gene-Disease Associations (GDAs) is one of the most pressing challenges to advance precision medicine and drug discovery [11], as it helps to understand the genetic causes of diseases [12]. Thus, the automatic extraction and curation of GDAs is key to advance precision medicine research and provide knowledge to assist disease diagnostics, drug discovery, and therapeutic decision-making.

Most datasets for GDA extraction are hand-labeled corpora [13,14,15]. Among them, EU-ADR [13] only contains a small portion of GDA instances, making it difficult to train robust RE models for GDA extraction. On the other hand, PolySearch [14] only focuses on ten specific diseases, which are not sufficient to develop comprehensive models. Similarly, CoMAGC [15] only comprises gene-cancer associations on prostate, breast, and ovarian cancers. Hence, all datasets lack enough GDA heterogeneity to train effective RE models. Furthermore, hand-labeling data is an expensive process requiring large amounts of time to expert biologists and, therefore, all of these datasets are limited in size.

To address this limitation, distant supervision has been proposed [16]. Under the distant supervision paradigm, all the sentences mentioning the same pair of entities are labeled by the corresponding relation stored within a source database. The assumption is that if two entities participate in a relation, at least one sentence mentioning them conveys that relation. As a consequence, distant supervision generates a large number of false positives, since not all sentences express the relation between the considered entities. To counter false positives, the RE task under distant supervision can be modeled as a Multi-Instance Learning (MIL) problem [17,18,19,20]. With MIL, the sentences containing two entities connected by a given relation are collected into bags labeled with such relation. Grou** sentences into bags reduces noise, as a bag of sentences is more likely to express a relation than a single sentence. Thus, distant supervision alleviates manual annotation efforts, and MIL increases the robustness of RE models to noise.

Since the advent of distant supervision, several datasets for RE have been developed under this paradigm for news and web domains [16, 18,

  • text: sentence from which the GDA was extracted.

  • relation: relation name associated with the given GDA.

  • h: JSON object representing the gene entity, composed of:

    • id: NCBI Entrez ID associated with the gene entity.

    • name: NCBI official gene symbol associated with the gene entity.

    • pos: list consisting of starting position and length of the gene mention within text.

  • t: JSON object representing the disease entity, composed of:

    • id: UMLS Concept Unique Identifier (CUI) associated with the disease entity.

    • name: UMLS preferred term associated with the disease entity.

    • pos: list consisting of starting position and length of the disease mention within text.

  • If a sentence contains multiple gene-disease pairs, the corresponding GDAs are split into separate data records.

    Overall, TBGA contains over 200,000 instances and 100,000 bags. Table 1 reports per-relation statistics for the dataset. Notice the large number of Not Associated (NA) instances. Moreover, Fig. 1 depicts the 20 most frequent genes, diseases, and GDAs within TBGA. The most frequent genes are tumor suppressor genes, such as TP53 and CDKN2A, and (proto-)oncogenes, like EGFR and BRAF. Among the most frequent diseases, we have neoplasms such as breast carcinoma, lung adenocarcinoma, and prostate carcinoma. As a consequence, the most frequent GDAs are gene-cancer associations.

    Table 1 Per-relation statistics for TBGA
    Fig. 1
    figure 1

    The 20 most frequent genes, diseases, and GDAs within TBGA

    TBGA is two orders of magnitude larger than current available datasets for GDA extraction [13,14,15, 27, 28]. Moreover, TBGA focuses on different association types, whereas most of current datasets only consider positive, negative, or false GDAs. The only exception is CoMAGC [15], where relations focus on different aspects of the gene expression changes and their association with cancer. Therefore, training and then testing RE models on TBGA allows for a more fine-grained and realistic evaluation that helps building effective solutions for GDA extraction. Table 2 compares global statistics between TBGA, EU-ADR [13], CoMAGC [15], PolySearch [14], GAD [27], and GDAE [28] datasets.

    Table 2 Global statistics comparison between TBGA, EU-ADR [13], CoMAGC [15], PolySearch [14], GAD [27], and GDAE [28] datasets

    On the other hand, compared to current large-scale, fully distantly-supervised BioRE datasets—i.e., BioRel [26]. Despite the use of expert-curated data, TBGA has a size comparable to that of fully distantly-supervised BioRE datasets. Besides, with the continuous growth of DisGeNET, the size of TBGA can further increase. Table 3 compares global statistics between TBGA, DTI [10], and BioRel [24] datasets.

    Table 3 Global statistics comparison between TBGA, BioRel [24], and DTI [10] datasets

    Discussion

    Data validation

    In order to validate TBGA, we conducted comprehensive experiments with state-of-the-art RE models under the Multi-Instance Learning (MIL) setting. MIL is the typical setting used for distantly-supervised RE, where sentences are divided into bags based on pairs of entities and the prediction of relations occurs at bag-level. For example, the following two instances compose the “ADM-Schizophrenia” bag, where the target relation is Biomarker. Instance 1: “Our data support that ADM may be associated with the pathophysiology of schizophrenia, although the cause of the association needs further study.” Instance 2: “These findings suggest the possible role of ADM and SEPX1 as biomarkers of schizophrenia.”

    Below, we first describe the experimental setup and then present the results.

    Experimental setup

    Datasets

    We performed experiments on three different datasets: TBGA, DTI, and BioRel. We used TBGA as a benchmark to evaluate RE models for GDA extraction under the MIL setting. On the other hand, we used DTI and BioRel only to validate the soundness of our implementation of the baseline models.

    Evaluation measures

    We evaluated RE models using the Area Under the Precision-Recall Curve (AUPRC). AUPRC is a popular measure to evaluate distantly-supervised RE models, which has been adopted by OpenNRE [35] and used in several works, such as [10, 24]. For experiments on TBGA, we also computed Precision at k items (P@k) and plotted the precision-recall curves.

    Aggregation strategies

    We adopted two different sentence aggregation strategies to use RE models under the MIL setting: average-based (AVE) and attention-based (ATT) [38]. The average-based aggregation assumes that all sentences within the same bag contribute equally to the bag-level representation. In other words, the bag representation is the average of all its sentence representations. On the other hand, the attention-based aggregation represents each bag as a weighted sum of its sentence representations, where the attention weights are dynamically adjusted for each sentence.

    Baseline models

    We considered the main state-of-the-art RE models to perform experiments: CNN [39], PCNN [40], BiGRU [10, 24, 41], BiGRU-ATT [10, 42], and BERE [10]. A detailed description of these RE models, along with information on parameter settings and hyper-parameter tuning, can be found in Additional file 1.

    Experimental results

    We report the results for two different experiments. The first experiment aims to validate the soundness of the implementation of the considered RE models. To this end, we trained and tested the RE models on DTI and BioRel datasets, and we compared the AUPRC scores we obtained against those reported in the original works [10, 24]. For this experiment, we only compared the RE models and aggregation strategies that were used in the original works. The results and discussion of the experiment can be found in Additional file 2. The second experiment uses TBGA as a benchmark to evaluate RE models for GDA extraction. In this case, we trained and tested all the considered RE models using both aggregation strategies. For each RE model, we reported the AUPRC and P@k scores, and we plotted the precision-recall curve.

    GDA benchmarking

    Table 4 shows the AUPRC and P@k scores of RE models on TBGA, whereas Fig. 2 plots the corresponding precision-recall curves. Given the RE models performance and precision-recall curves, we make the following observations.

    1. 1

      The performances achieved by RE models on TBGA indicate a high complexity of the GDA extraction task. When recall is smaller than 0.1, all RE models have precision greater than 0.7. However, at higher recall values, models performance decrease sharply. In particular, when recall is greater than 0.4, no RE model achieves precision values greater than or equal to 0.5. The task complexity is further supported by the lower performances obtained by top-performing RE models on TBGA compared to DTI and BioRel (cf. Additional file 2: Table S2).

    2. 2

      CNN, PCNN, BiGRU, and BiGRU-ATT RE models behave similarly. Among them, BiGRU-ATT has the worst performance. This suggests that replacing BiGRU max pooling layer with an attention layer proves less effective. Overall, the best AUPRC and P@k scores are achieved by BERE when using the attention-based aggregation strategy. This highlights the effectiveness of fully exploiting sentence information from both semantic and syntactic aspects [10]. BERE top performance can also be observed by looking at its precision-recall curve, which remains constantly above the other curves up to recall 0.4, where it stabilizes with the others. Nevertheless, most of RE models—regardless of the considered aggregation strategy—show precision drops at early recall values, not greater than 0.4.

    3. 3

      In terms of AUPRC, the attention-based aggregation proves less effective than the average-based one. On the other hand, attention-based aggregation provides mixed results on P@k measures. Although in contrast with the results obtained in general-domain RE [38], this trend is in line with the results found by **ng et al. [24] on BioRel, where RE models using an average-based aggregation strategy achieve performance comparable to or higher than those using an attention-based one. The only exception is BERE, whose performance using the attention-based aggregation outperforms the one using the average-based strategy.

    Thus, the obtained results suggest that TBGA is a challenging dataset for GDA extraction and, in general, for BioRE.

    Table 4 RE models performance on TBGA dataset
    Fig. 2
    figure 2

    Precision-Recall curves for RE models on TBGA dataset. RE models are evaluated using both aggregation strategies—that is, average-based (AVE) and attention-based (ATT). Therefore, precision-recall curves are plot for each aggregation strategy

    Re-use potential

    TBGA complies with the format required by OpenNRE [35] to train and test RE models. We chose to structure the dataset in this way to ease its re-use to future researchers. OpenNRE already provides several RE models that can be used directly on TBGA. In addition, we have also used OpenNRE to implement widely-used missing RE models.

    We used TBGA as a benchmark to evaluate RE models under the MIL setting—which is the typical setting for the RE task under distant supervision. In other words, we trained and tested RE models at bag-level. However, TBGA contains sentence-level expert-curated annotations in validation and test sets. Thus, researchers can also use TBGA to train RE models at bag-level and evaluate them on sentence-level expert-curated data—which is an emerging setting for distantly-supervised, manually enhanced datasets [25, 26]. To this end, no format changes are required to make TBGA compliant with the alternative setting.

    Conclusions

    We have presented a large-scale, semi-automatically annotated dataset for Gene-Disease Association (GDA) extraction. Automatic GDA extraction is one of the most relevant tasks of BioRE. We have used TBGA as a benchmark to evaluate state-of-the-art BioRE models on GDA extraction. The results suggest that TBGA is a challenging dataset for this task. Besides, the large size of TBGA—along with the presence of expert-curated annotations in its validation and test sets—makes it more realistic than fully distantly-supervised BioRE datasets.

    Methods

    The process to create TBGA consisted of four steps: data acquisition, data cleaning, distant supervision, and dataset generation. Figure 3 illustrates the overall procedure.

    Fig. 3
    figure 3

    Overview of the TBGA creation process. The process consists of four steps: (1) data acquisition; (2) data cleaning; (3) distant supervision; and (4) dataset generation

    Data acquisition

    The data used to generate TBGA comes from DisGeNET [12]. DisGeNET collects data on genotype-phenotype relationships from several resources and covers most of human diseases, including Mendelian, complex, environmental and rare diseases, as well as disease-related traits. According to the type of resource, DisGeNET organizes gene-disease data into one of four categories: Curated, Animal Models, Inferred, and Literature. Curated data contains GDA provided by expert-curated resources; Animal Models data includes GDA from resources containing information about rat and mouse models of disease; Inferred data refers to GDAs inferred from the Human Phenotype Ontology (HPO) [43] and from Variant-Disease Associations (VDAs); and Literature data provides GDAs extracted from the scientific literature using text-mining techniques [27, 44, 45]. For a seamless integration of such GDAs, DisGeNET classifies them by different association types, which are defined in the DisGeNET association type ontology. A detailed description of each association type can be found on the DisGeNET platform [46]. Figure 4 depicts the DisGeNET association type ontology, where we also report the Semanticscience Integrated Ontology (SIO) [47] identifiers of the different association types.

    Fig. 4
    figure 4

    DisGeNET association type ontology. For each association type, we also report its SIO identifier

    We acquired data from DisGeNET v7.0 to build TBGA. This version of DisGeNET contains 1,134,942 GDAs, involving 21,671 genes and 30,170 diseases, disorders, traits, and clinical or abnormal human phenotypes. We accessed DisGeNET data through the web interface [46], where we used the Browse functionality to retrieve GDAs along with supporting evidence. We gathered data from all four resource categories. Moreover, we filtered out data with no PubMed IDentifier (PMID) to avoid retrieving GDAs without a sentence supporting the association.

    Data cleaning

    The data acquired from DisGeNET underwent a data cleaning process. First, we filtered data based on the presence of tags surrounding the gene and disease mentions within sentences. In other words, we restricted to GDAs having representative sentences where the gene and the disease are highlighted. Then, we stripped gene and disease tags from text and we stored the exact location of gene and disease mentions within sentences. Since DisGeNET integrates data from various resources, there might be duplicate evidence for the same GDA. In this case, we discarded duplicates and prioritized data coming from expert-curated resources.

    From each instance resulting from the data cleaning process, we considered the following attributes: the original source, the publication supporting the association, the representative sentence, the association type, as well as information on the gene and disease involved in the association. Regarding genes, we kept the NCBI Entrez [48] identifiers, the NCBI official gene symbols, and the gene locations within sentences. As for diseases, we stored the UMLS [49] CUIs, the UMLS preferred terms, and the disease locations in text.

    Distant supervision

    To effectively train RE models, false GDAs are also required—i.e., instances where co-occurring genes and diseases are not semantically associated. However, DisGeNET stores only true GDAs. To overcome this limitation, we used distant supervision [16] to obtain false GDAs from the sentences contained within the abstract or title of the PubMed articles that support the GDAs retrieved in the data acquisition process. To this end, we relied on the 3.6.2rc6 version of MetaMapLite [50], a near real-time NER tool that identifies UMLS concepts within biomedical text. MetaMapLite returns, among other information, the CUI, the preferred term, and the location in text of the identified UMLS concepts. Thus, we used MetaMapLite to identify gene and disease UMLS concepts within sentences. For each identified concept, we stored its CUI, preferred term, and location in text. Then, we performed the following steps to generate false GDAs.

    1. 1

      We restricted to sentences where the co-occurring genes and diseases come from DisGeNET. The search for false GDAs among the genes and diseases of DisGeNET aimed to reduce false negatives and to obtain gene-disease pairs that were more likely not to be semantically associated.

    2. 2

      We filtered out instances where gene mentions matched common words. For instance, when all letters are in uppercase, the words FOR and TYPE are, by convention [51], aliases for the WWOX and SGCG genes. Therefore, when the gene mentions identified by MetaMapLite matched such (and other) common words, we kept the corresponding instances only if the matched words were in uppercase. As common words, we considered the set of most frequent words provided by Peter Norvig [52], which were derived from the Google Web Trillion Word Corpus [53].

    3. 3

      We used the 2020AA UMLS MRCONSO file [54] to build a disease dictionary that stored UMLS preferred terms, lexical variants, alternate forms, short forms, and synonyms of the DisGeNET diseases. The MRCONSO file contains one row for each occurrence of each unique string or concept name within each source vocabulary of the UMLS Metathesaurus. Thus, we only kept instances where disease mentions exact-matched dictionary terms. In this way, we removed partial matches identified by MetaMapLite and, as a consequence, we reduced erroneous disease mentions.

    4. 4

      Of the remaining instances, we only took those whose gene-disease pairs did not belong to any GDA within DisGeNET and we labeled them as NA.

    For each instance generated through distant supervision, we kept the following attributes: the publication and sentence from which the false GDA has been extracted, the NA association type, and information on the co-occurring gene and disease. For genes, we first mapped UMLS CUIs to NCBI Entrez IDs, and then we stored them together with NCBI official gene symbols and gene locations in text. On the other hand, for diseases, we stored UMLS CUIs, UMLS preferred terms, as well as disease locations in text.

    Dataset generation

    The sets of true and false instances obtained from the data cleaning and distant supervision processes were used to generate TBGA. We considered different associations from the DisGeNET association type ontology to build the dataset. Specifically, we adopted the Therapeutic, Biomarker, and Genomic Alterations associations types as relations. Instead, we did not consider the Altered Expression and Post-translational Modification association types—although at the same level of Genomic Alterations—as we lacked curated data for them. In addition to true associations, we also considered the false association NA.

    The steps required to create TBGA were the following:

    1. 1

      We performed a normalization process to convert DisGeNET association types to TBGA relations. In this regard, given the hierarchical structure of the DisGeNET association type ontology, we could normalize finer association types to their coarser ancestors. For instance, a Genetic Variation association is also a Genomic Alterations one, which, in turn, is a Biomarker association (cf. Fig. 4). Thus, we mapped association types finer than Genomic Alterations to Genomic Alterations itself. On the other hand, instances involving the same gene-disease pair from the same sentence can have Biomarker or Genomic Alterations association types depending on the considered resource. This situation occurs because instances are generated by different biologists or using different text-mining techniques. In these cases, we removed the instances associated with Biomarker to keep gene-disease pairs associated with Genomic Alterations, which represents a finer—and thus more precise—association type than Biomarker.

    2. 2

      We divided true instances among training, validation, and test sets based on the resource category. We used Curated data for validation and test, whereas Animal Models, Inferred, and Literature data for training. The only exception was Therapeutic, where we lacked enough data for training. In this case, we also used Curated data for training, setting an 80/10/10 ratio among training, validation, and test sets.

    3. 3

      We balanced the number of true instances among the dataset relations. For Biomarker and Genomic Alterations, we split Curated data evenly between validation and test. Then, we kept the same ratio that exists among relations in validation and test sets also in training. Since we model the BioRE task as a MIL problem, we downsampled over-represented relations—i.e., Biomarker and Genomic Alterations—at the bag-level rather than at the sentence-level to obtain the desired ratio among relations.

    4. 4

      We want TBGA to reflect the sparseness of GDAs in biomedical literature. Assuming we randomly sample gene and disease mentions from a sentence of a given scientific article, it is very likely that no association occurs between them. Therefore, similar to previous works [10, 24], we included a large number of false instances into training, validation, and test sets to make TBGA sparse. For each set, we sampled a number of NA bags twice the number of bags associated with true relations.

    5. 5

      We removed from the training set the bags whose gene-disease pairs also belong to validation and test sets. This operation avoids to introduce bias at inference time, as RE models cannot exploit training knowledge on the gene-disease pair.

    We provide statistics regarding the different steps of data cleaning and dataset generation for true instances in Table 5. As for NA statistics, we performed distant supervision on more than 700,000 publications, obtaining 152,963 instances and 70,688 bags—which are associated with 83,501 publications and involve 9167 different genes and 5151 different diseases.

    Table 5 Global and per-relation statistics for data cleaning and dataset generation