Abstract
Building and analyzing knowledge graphs (KGs) to aid drug discovery is a topical area of research. A salient feature of KGs is their ability to combine many heterogeneous data sources in a format that facilitates discovering connections. The utility of KGs has been exemplified in areas such as drug repurposing, with insights made through manual exploration and modeling of the data. In this chapter, we discuss promises and pitfalls of using natural language processing (NLP) to mine “unstructured text”— typically from scientific literature— as a data source for KGs. This draws on our experience of initially parsing “structured” data sources—such as ChEMBL—as the basis for data within a KG, and then enriching or expanding upon them using NLP. The fundamental promise of NLP for KGs is the automated extraction of data from millions of documents—a task practically impossible to do via human curation alone. However, there are many potential pitfalls in NLP-KG pipelines, such as incorrect named entity recognition and ontology linking, all of which could ultimately lead to erroneous inferences and conclusions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ehrlinger L, Wöß W (2016) Towards a definition of knowledge graphs. In: CEUR Workshop Proceedings
Bonner S, Barrett IP, Ye C, Swiers R, Engkvist O, Bender A, et al (2021) A review of biomedical datasets relating to drug discovery: a knowledge graph perspective
Nicholson DN, Greene CS (2020) Constructing knowledge graphs and their biomedical applications. Comput Struct Biotechnol J 18:1414–1428
Doǧan T, Atas H, Joshi V, Atakan A, Rifaioglu AS, Nalbat E et al (2021) CROssBAR: comprehensive resource of biomedical relations with knowledge graph representations. Nucleic Acids Res 49(16):e96–e96
Himmelstein DS, Lizee A, Hessler C, Brueggeman L, Chen SL, Hadley D et al (2017) Systematic integration of biomedical knowledge prioritizes drugs for repurposing. elife 6:e26726
Su C, Hou Y, Guo W, Chaudhry F, Ghahramani G, Zhang H, et al (2021) CBKH: The Cornell Biomedical Knowledge Hub. medRxiv
Maglott D, Ostell J, Pruitt KD, Tatusova T (2007) Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res 35(Database issue):D26
Cunningham F, Allen JE, Allen J, Alvarez-Jarreta J, Amode MR, Armean IM et al (2022) Ensembl 2022. Nucleic Acids Res 50(D1):D988–D995
Geleta D, Nikolov A, Edwards G, Gogleva A, Jackson R, Jansson E, et al (2021) Biological insights knowledge graph: an integrated knowledge graph to support drug development. bioRxiv. 2021.10.28.466262
Martin B, Jacob HJ, Hajduk P, Wolfe E, Chen L, Crosby H, et al (2022) Leveraging a billion-edge knowledge graph for drug re-purposing and target prioritization using genomically-informed subgraphs. bioRxiv. 2022.12.20.521235
Paliwal S, de Giorgio A, Neil D, Michel JB, Lacoste AM (2020) Preclinical validation of therapeutic targets predicted by tensor factorization on heterogeneous graphs. Sci Rep 10(1):18250
Nicholson DN, Himmelstein DS, Greene CS (2022) Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts. BioData Min 15(1):26
Kilicoglu H, Shin D, Fiszman M, Rosemblat G, Rindflesch TC (2012) SemMedDB: a PubMed-scale repository of biomedical semantic predications. Bioinformatics 28(23):3158
Wei CH, Allot A, Leaman R, Lu Z (2019) PubTator central: automated concept annotation for biomedical full text articles. Nucleic Acids Res 47(W1):W587–W593
Braicu C, Buse M, Busuioc C, Drula R, Gulei D, Raduly L et al (2019) A comprehensive review on MAPK: a promising therapeutic target in cancer. Cancers (Basel) 11(10)
Szklarczyk D, Gable AL, Nastou KC, Lyon D, Kirsch R, Pyysalo S et al (2021) The STRING database in 2021: customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Res 49(D1):D605–D612
**ong C, Liu X, Meng A (2015) The kinase activity-deficient isoform of the protein araf antagonizes Ras/Mitogen-activated protein kinase (Ras/MAPK) signaling in the zebrafish embryo. J Biol Chem 290(42):25512
Brandes U (2010) A faster algorithm for betweenness centrality. J Math Sociol 25(2):163–177
Zhang J-X, Chen D-B, Dong Q, Zhao Z-D (2016) Identifying a set of influential spreaders in complex networks. Sci Rep 6(1):27823
Approved Drug Products with Therapeutic Equivalence Evaluations | Orange Book [Internet]. Available from: https://www.fda.gov/drugs/drug-approvals-and-databases/approved-drug-products-therapeutic-equivalence-evaluations-orange-book
Wishart DS, Feunang YD, Guo AC, Lo EJ, Marcu A, Grant JR et al (2018) DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res 46(D1):D1074–D1082
Bodenreider O (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 32(suppl_1):D267–D270
Doycheva D, Jägle H, Zierhut M, Deuter C, Blumenstock G, Schiefer U et al (2015) Mycophenolic acid in the treatment of birdshot chorioretinopathy: long-term follow-up. Br J Ophthalmol 99(1):87–91
Finlayson SG, LePendu P, Shah NH (2014) Building the graph of medicine from millions of clinical narratives. Sci Data 1(1):140032
Lowe HJ, Ferris TA, Hernandez PM, Weber SC (2009) STRIDE – An Integrated Standards-Based Translational Research Informatics Platform. AMIA Annu Symp Proc 2009:391
Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG, Badawi O (2018) The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Sci Data 5(1):180178
Johnson AEW, Pollard TJ, Shen L, Lehman LH, Feng M, Ghassemi M et al (2016) MIMIC-III, a freely accessible critical care database. Sci Data 3(1):160035
Malki MA, Dawed AY, Hayward C, Doney A, Pearson ER (2021) Utilizing large electronic medical record data sets to identify novel drug–gene interactions for commonly used drugs. Clin Pharmacol Ther 110(3):816–825
Kuhn M, Letunic I, Jensen LJ, Bork P (2016) The SIDER database of drugs and side effects. Nucleic Acids Res 44(Database issue):D1075
Koskinen M, Salmi JK, Loukola A, Mäkelä MJ, Sinisalo J, Carpén O et al (2022) Data-driven comorbidity analysis of 100 common disorders reveals patient subgroups with differing mortality risks and laboratory correlates. Sci Rep 12(1):1–9
Bodenreider O (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 32(Database issue):D267
Leaman R, Lu Z (2016) TaggerOne: joint named entity recognition and normalization with semi-Markov Models. Bioinformatics 32(18):2839–2846
Sosa DN, Derry A, Guo M, Wei E, Brinton C, Altman RB (2020) A literature-based knowledge graph embedding method for identifying drug repurposing opportunities in rare diseases. Pac Symp Biocomput 2020(25):463–474
Percha B, Altman RB (2018) A global network of biomedical relationships derived from text. Bioinformatics 34(15):2614–2624
Fujiyoshi K, Bruford EA, Mroz P, Sims CL, O’Leary TJ, Lo AWI et al (2021) Standardizing gene product nomenclature-a call to action. Proc Natl Acad Sci U S A 118(3):e2025207118
Skreta M, Arbabi A, Wang J, Drysdale E, Kelly J, Singh D et al (2021) Automatically disambiguating medical acronyms with ontology-aware deep learning. Nat Commun 12(1):1–10
Kilicoglu H, Rosemblat G, Fiszman M, Shin D (2020) Broad-coverage biomedical relation extraction with SemRep. BMC Bioinform 21(1):1–28
Mantovani F, Collavin L, Del Sal G (2018) Mutant p53 as a guardian of the cancer cell. Cell Death Differ 26(2):199–212
Kilicoglu H, Rosemblat G, Rindflesch TC (2017) Assigning factuality values to semantic relations extracted from biomedical research literature. PLoS One 12(7):e0179926
Unni DR, Moxon SAT, Bada M, Brush M, Bruskiewich R, Caufield JH et al (2022) Biolink Model: a universal schema for knowledge graphs in clinical, biomedical, and translational science. Clin Transl Sci 15(8):1848–1855
Zeng X, Song X, Ma T, Pan X, Zhou Y, Hou Y et al (2020) Repurpose open data to discover therapeutics for COVID-19 using deep learning. J Proteome Res 19(11):4624–4636
Zhang R, Hristovski D, Schutte D, Kastrin A, Fiszman M, Kilicoglu H (2020) Drug Repurposing for COVID-19 via Knowledge Graph Completion. J Biomed Inform 115:103696
Ratajczak F, Joblin M, Ringsquandl M, Hildebrandt M (2022) Task-driven knowledge graph filtering improves prioritizing drugs for repurposing. BMC Bioinform 23(1):84
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature
About this protocol
Cite this protocol
Jeynes, J.C.G., James, T., Corney, M. (2024). Natural Language Processing for Drug Discovery Knowledge Graphs: Promises and Pitfalls. In: Heifetz, A. (eds) High Performance Computing for Drug Discovery and Biomedicine. Methods in Molecular Biology, vol 2716. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-3449-3_10
Download citation
DOI: https://doi.org/10.1007/978-1-0716-3449-3_10
Published:
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-3448-6
Online ISBN: 978-1-0716-3449-3
eBook Packages: Springer Protocols