Natural Language Processing for Drug Discovery Knowledge Graphs: Promises and Pitfalls

Jeynes, J. Charles G.; James, Tim; Corney, Matthew

doi:10.1007/978-1-0716-3449-3_10

J. Charles G. Jeynes³,
Tim James³ &
Matthew Corney³

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2716))

1086 Accesses
1 Altmetric

Abstract

Building and analyzing knowledge graphs (KGs) to aid drug discovery is a topical area of research. A salient feature of KGs is their ability to combine many heterogeneous data sources in a format that facilitates discovering connections. The utility of KGs has been exemplified in areas such as drug repurposing, with insights made through manual exploration and modeling of the data. In this chapter, we discuss promises and pitfalls of using natural language processing (NLP) to mine “unstructured text”— typically from scientific literature— as a data source for KGs. This draws on our experience of initially parsing “structured” data sources—such as ChEMBL—as the basis for data within a KG, and then enriching or expanding upon them using NLP. The fundamental promise of NLP for KGs is the automated extraction of data from millions of documents—a task practically impossible to do via human curation alone. However, there are many potential pitfalls in NLP-KG pipelines, such as incorrect named entity recognition and ontology linking, all of which could ultimately lead to erroneous inferences and conclusions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Protocol: GBP 34.95; Price includes VAT (United Kingdom)

eBook: GBP 159.50; Price includes VAT (United Kingdom)

Hardcover Book: GBP 199.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ehrlinger L, Wöß W (2016) Towards a definition of knowledge graphs. In: CEUR Workshop Proceedings
Google Scholar
Bonner S, Barrett IP, Ye C, Swiers R, Engkvist O, Bender A, et al (2021) A review of biomedical datasets relating to drug discovery: a knowledge graph perspective
Google Scholar
Nicholson DN, Greene CS (2020) Constructing knowledge graphs and their biomedical applications. Comput Struct Biotechnol J 18:1414–1428
Article PubMed PubMed Central Google Scholar
Doǧan T, Atas H, Joshi V, Atakan A, Rifaioglu AS, Nalbat E et al (2021) CROssBAR: comprehensive resource of biomedical relations with knowledge graph representations. Nucleic Acids Res 49(16):e96–e96
Article PubMed PubMed Central Google Scholar
Himmelstein DS, Lizee A, Hessler C, Brueggeman L, Chen SL, Hadley D et al (2017) Systematic integration of biomedical knowledge prioritizes drugs for repurposing. elife 6:e26726
Article PubMed PubMed Central Google Scholar
Su C, Hou Y, Guo W, Chaudhry F, Ghahramani G, Zhang H, et al (2021) CBKH: The Cornell Biomedical Knowledge Hub. medRxiv
Google Scholar
Maglott D, Ostell J, Pruitt KD, Tatusova T (2007) Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res 35(Database issue):D26
Article CAS PubMed Google Scholar
Cunningham F, Allen JE, Allen J, Alvarez-Jarreta J, Amode MR, Armean IM et al (2022) Ensembl 2022. Nucleic Acids Res 50(D1):D988–D995
Article CAS PubMed Google Scholar
Geleta D, Nikolov A, Edwards G, Gogleva A, Jackson R, Jansson E, et al (2021) Biological insights knowledge graph: an integrated knowledge graph to support drug development. bioRxiv. 2021.10.28.466262
Google Scholar
Martin B, Jacob HJ, Hajduk P, Wolfe E, Chen L, Crosby H, et al (2022) Leveraging a billion-edge knowledge graph for drug re-purposing and target prioritization using genomically-informed subgraphs. bioRxiv. 2022.12.20.521235
Google Scholar
Paliwal S, de Giorgio A, Neil D, Michel JB, Lacoste AM (2020) Preclinical validation of therapeutic targets predicted by tensor factorization on heterogeneous graphs. Sci Rep 10(1):18250
Article CAS PubMed PubMed Central Google Scholar
Nicholson DN, Himmelstein DS, Greene CS (2022) Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts. BioData Min 15(1):26
Article PubMed PubMed Central Google Scholar
Kilicoglu H, Shin D, Fiszman M, Rosemblat G, Rindflesch TC (2012) SemMedDB: a PubMed-scale repository of biomedical semantic predications. Bioinformatics 28(23):3158
Article CAS PubMed PubMed Central Google Scholar
Wei CH, Allot A, Leaman R, Lu Z (2019) PubTator central: automated concept annotation for biomedical full text articles. Nucleic Acids Res 47(W1):W587–W593
Article CAS PubMed PubMed Central Google Scholar
Braicu C, Buse M, Busuioc C, Drula R, Gulei D, Raduly L et al (2019) A comprehensive review on MAPK: a promising therapeutic target in cancer. Cancers (Basel) 11(10)
Google Scholar
Szklarczyk D, Gable AL, Nastou KC, Lyon D, Kirsch R, Pyysalo S et al (2021) The STRING database in 2021: customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Res 49(D1):D605–D612
Article CAS PubMed Google Scholar
**ong C, Liu X, Meng A (2015) The kinase activity-deficient isoform of the protein araf antagonizes Ras/Mitogen-activated protein kinase (Ras/MAPK) signaling in the zebrafish embryo. J Biol Chem 290(42):25512
Article CAS PubMed PubMed Central Google Scholar
Brandes U (2010) A faster algorithm for betweenness centrality. J Math Sociol 25(2):163–177
Article Google Scholar
Zhang J-X, Chen D-B, Dong Q, Zhao Z-D (2016) Identifying a set of influential spreaders in complex networks. Sci Rep 6(1):27823
Article CAS PubMed PubMed Central Google Scholar
Approved Drug Products with Therapeutic Equivalence Evaluations | Orange Book [Internet]. Available from: https://www.fda.gov/drugs/drug-approvals-and-databases/approved-drug-products-therapeutic-equivalence-evaluations-orange-book
Wishart DS, Feunang YD, Guo AC, Lo EJ, Marcu A, Grant JR et al (2018) DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res 46(D1):D1074–D1082
Article CAS PubMed Google Scholar
Bodenreider O (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 32(suppl_1):D267–D270
Article CAS PubMed PubMed Central Google Scholar
Doycheva D, Jägle H, Zierhut M, Deuter C, Blumenstock G, Schiefer U et al (2015) Mycophenolic acid in the treatment of birdshot chorioretinopathy: long-term follow-up. Br J Ophthalmol 99(1):87–91
Article PubMed Google Scholar
Finlayson SG, LePendu P, Shah NH (2014) Building the graph of medicine from millions of clinical narratives. Sci Data 1(1):140032
Article PubMed PubMed Central Google Scholar
Lowe HJ, Ferris TA, Hernandez PM, Weber SC (2009) STRIDE – An Integrated Standards-Based Translational Research Informatics Platform. AMIA Annu Symp Proc 2009:391
PubMed PubMed Central Google Scholar
Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG, Badawi O (2018) The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Sci Data 5(1):180178
Article PubMed PubMed Central Google Scholar
Johnson AEW, Pollard TJ, Shen L, Lehman LH, Feng M, Ghassemi M et al (2016) MIMIC-III, a freely accessible critical care database. Sci Data 3(1):160035
Article CAS PubMed PubMed Central Google Scholar
Malki MA, Dawed AY, Hayward C, Doney A, Pearson ER (2021) Utilizing large electronic medical record data sets to identify novel drug–gene interactions for commonly used drugs. Clin Pharmacol Ther 110(3):816–825
Article CAS PubMed Google Scholar
Kuhn M, Letunic I, Jensen LJ, Bork P (2016) The SIDER database of drugs and side effects. Nucleic Acids Res 44(Database issue):D1075
Article CAS PubMed Google Scholar
Koskinen M, Salmi JK, Loukola A, Mäkelä MJ, Sinisalo J, Carpén O et al (2022) Data-driven comorbidity analysis of 100 common disorders reveals patient subgroups with differing mortality risks and laboratory correlates. Sci Rep 12(1):1–9
Article Google Scholar
Bodenreider O (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 32(Database issue):D267
Article CAS PubMed PubMed Central Google Scholar
Leaman R, Lu Z (2016) TaggerOne: joint named entity recognition and normalization with semi-Markov Models. Bioinformatics 32(18):2839–2846
Article CAS PubMed PubMed Central Google Scholar
Sosa DN, Derry A, Guo M, Wei E, Brinton C, Altman RB (2020) A literature-based knowledge graph embedding method for identifying drug repurposing opportunities in rare diseases. Pac Symp Biocomput 2020(25):463–474
Google Scholar
Percha B, Altman RB (2018) A global network of biomedical relationships derived from text. Bioinformatics 34(15):2614–2624
Article CAS PubMed PubMed Central Google Scholar
Fujiyoshi K, Bruford EA, Mroz P, Sims CL, O’Leary TJ, Lo AWI et al (2021) Standardizing gene product nomenclature-a call to action. Proc Natl Acad Sci U S A 118(3):e2025207118
Article CAS PubMed PubMed Central Google Scholar
Skreta M, Arbabi A, Wang J, Drysdale E, Kelly J, Singh D et al (2021) Automatically disambiguating medical acronyms with ontology-aware deep learning. Nat Commun 12(1):1–10
Article Google Scholar
Kilicoglu H, Rosemblat G, Fiszman M, Shin D (2020) Broad-coverage biomedical relation extraction with SemRep. BMC Bioinform 21(1):1–28
Article Google Scholar
Mantovani F, Collavin L, Del Sal G (2018) Mutant p53 as a guardian of the cancer cell. Cell Death Differ 26(2):199–212
Article PubMed PubMed Central Google Scholar
Kilicoglu H, Rosemblat G, Rindflesch TC (2017) Assigning factuality values to semantic relations extracted from biomedical research literature. PLoS One 12(7):e0179926
Article PubMed PubMed Central Google Scholar
Unni DR, Moxon SAT, Bada M, Brush M, Bruskiewich R, Caufield JH et al (2022) Biolink Model: a universal schema for knowledge graphs in clinical, biomedical, and translational science. Clin Transl Sci 15(8):1848–1855
Article PubMed PubMed Central Google Scholar
Zeng X, Song X, Ma T, Pan X, Zhou Y, Hou Y et al (2020) Repurpose open data to discover therapeutics for COVID-19 using deep learning. J Proteome Res 19(11):4624–4636
Article CAS PubMed Google Scholar
Zhang R, Hristovski D, Schutte D, Kastrin A, Fiszman M, Kilicoglu H (2020) Drug Repurposing for COVID-19 via Knowledge Graph Completion. J Biomed Inform 115:103696
Article Google Scholar
Ratajczak F, Joblin M, Ringsquandl M, Hildebrandt M (2022) Task-driven knowledge graph filtering improves prioritizing drugs for repurposing. BMC Bioinform 23(1):84
Article CAS Google Scholar

Download references

Author information

Authors and Affiliations

Evotec (UK) Ltd., in silico Research and Development, Abingdon, Oxfordshire, UK
J. Charles G. Jeynes, Tim James & Matthew Corney

Authors

J. Charles G. Jeynes
View author publications
You can also search for this author in PubMed Google Scholar
Tim James
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Corney
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to J. Charles G. Jeynes or Tim James .

Editor information

Editors and Affiliations

In Silico Research and Development, Evotec UK Ltd, Abingdon, UK
Alexander Heifetz

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Jeynes, J.C.G., James, T., Corney, M. (2024). Natural Language Processing for Drug Discovery Knowledge Graphs: Promises and Pitfalls. In: Heifetz, A. (eds) High Performance Computing for Drug Discovery and Biomedicine. Methods in Molecular Biology, vol 2716. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-3449-3_10

Download citation

DOI: https://doi.org/10.1007/978-1-0716-3449-3_10
Published: 14 September 2023
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-3448-6
Online ISBN: 978-1-0716-3449-3
eBook Packages: Springer Protocols

Publish with us

Policies and ethics