Natural Language Processing for Drug Discovery Knowledge Graphs: Promises and Pitfalls

  • Protocol
  • First Online:
High Performance Computing for Drug Discovery and Biomedicine

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2716))

Abstract

Building and analyzing knowledge graphs (KGs) to aid drug discovery is a topical area of research. A salient feature of KGs is their ability to combine many heterogeneous data sources in a format that facilitates discovering connections. The utility of KGs has been exemplified in areas such as drug repurposing, with insights made through manual exploration and modeling of the data. In this chapter, we discuss promises and pitfalls of using natural language processing (NLP) to mine “unstructured text”— typically from scientific literature— as a data source for KGs. This draws on our experience of initially parsing “structured” data sources—such as ChEMBL—as the basis for data within a KG, and then enriching or expanding upon them using NLP. The fundamental promise of NLP for KGs is the automated extraction of data from millions of documents—a task practically impossible to do via human curation alone. However, there are many potential pitfalls in NLP-KG pipelines, such as incorrect named entity recognition and ontology linking, all of which could ultimately lead to erroneous inferences and conclusions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Protocol
GBP 34.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 159.50
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
GBP 199.99
Price includes VAT (United Kingdom)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ehrlinger L, Wöß W (2016) Towards a definition of knowledge graphs. In: CEUR Workshop Proceedings

    Google Scholar 

  2. Bonner S, Barrett IP, Ye C, Swiers R, Engkvist O, Bender A, et al (2021) A review of biomedical datasets relating to drug discovery: a knowledge graph perspective

    Google Scholar 

  3. Nicholson DN, Greene CS (2020) Constructing knowledge graphs and their biomedical applications. Comput Struct Biotechnol J 18:1414–1428

    Article  PubMed  PubMed Central  Google Scholar 

  4. Doǧan T, Atas H, Joshi V, Atakan A, Rifaioglu AS, Nalbat E et al (2021) CROssBAR: comprehensive resource of biomedical relations with knowledge graph representations. Nucleic Acids Res 49(16):e96–e96

    Article  PubMed  PubMed Central  Google Scholar 

  5. Himmelstein DS, Lizee A, Hessler C, Brueggeman L, Chen SL, Hadley D et al (2017) Systematic integration of biomedical knowledge prioritizes drugs for repurposing. elife 6:e26726

    Article  PubMed  PubMed Central  Google Scholar 

  6. Su C, Hou Y, Guo W, Chaudhry F, Ghahramani G, Zhang H, et al (2021) CBKH: The Cornell Biomedical Knowledge Hub. medRxiv

    Google Scholar 

  7. Maglott D, Ostell J, Pruitt KD, Tatusova T (2007) Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res 35(Database issue):D26

    Article  CAS  PubMed  Google Scholar 

  8. Cunningham F, Allen JE, Allen J, Alvarez-Jarreta J, Amode MR, Armean IM et al (2022) Ensembl 2022. Nucleic Acids Res 50(D1):D988–D995

    Article  CAS  PubMed  Google Scholar 

  9. Geleta D, Nikolov A, Edwards G, Gogleva A, Jackson R, Jansson E, et al (2021) Biological insights knowledge graph: an integrated knowledge graph to support drug development. bioRxiv. 2021.10.28.466262

    Google Scholar 

  10. Martin B, Jacob HJ, Hajduk P, Wolfe E, Chen L, Crosby H, et al (2022) Leveraging a billion-edge knowledge graph for drug re-purposing and target prioritization using genomically-informed subgraphs. bioRxiv. 2022.12.20.521235

    Google Scholar 

  11. Paliwal S, de Giorgio A, Neil D, Michel JB, Lacoste AM (2020) Preclinical validation of therapeutic targets predicted by tensor factorization on heterogeneous graphs. Sci Rep 10(1):18250

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Nicholson DN, Himmelstein DS, Greene CS (2022) Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts. BioData Min 15(1):26

    Article  PubMed  PubMed Central  Google Scholar 

  13. Kilicoglu H, Shin D, Fiszman M, Rosemblat G, Rindflesch TC (2012) SemMedDB: a PubMed-scale repository of biomedical semantic predications. Bioinformatics 28(23):3158

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Wei CH, Allot A, Leaman R, Lu Z (2019) PubTator central: automated concept annotation for biomedical full text articles. Nucleic Acids Res 47(W1):W587–W593

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Braicu C, Buse M, Busuioc C, Drula R, Gulei D, Raduly L et al (2019) A comprehensive review on MAPK: a promising therapeutic target in cancer. Cancers (Basel) 11(10)

    Google Scholar 

  16. Szklarczyk D, Gable AL, Nastou KC, Lyon D, Kirsch R, Pyysalo S et al (2021) The STRING database in 2021: customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Res 49(D1):D605–D612

    Article  CAS  PubMed  Google Scholar 

  17. **ong C, Liu X, Meng A (2015) The kinase activity-deficient isoform of the protein araf antagonizes Ras/Mitogen-activated protein kinase (Ras/MAPK) signaling in the zebrafish embryo. J Biol Chem 290(42):25512

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Brandes U (2010) A faster algorithm for betweenness centrality. J Math Sociol 25(2):163–177

    Article  Google Scholar 

  19. Zhang J-X, Chen D-B, Dong Q, Zhao Z-D (2016) Identifying a set of influential spreaders in complex networks. Sci Rep 6(1):27823

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Approved Drug Products with Therapeutic Equivalence Evaluations | Orange Book [Internet]. Available from: https://www.fda.gov/drugs/drug-approvals-and-databases/approved-drug-products-therapeutic-equivalence-evaluations-orange-book

  21. Wishart DS, Feunang YD, Guo AC, Lo EJ, Marcu A, Grant JR et al (2018) DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res 46(D1):D1074–D1082

    Article  CAS  PubMed  Google Scholar 

  22. Bodenreider O (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 32(suppl_1):D267–D270

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Doycheva D, Jägle H, Zierhut M, Deuter C, Blumenstock G, Schiefer U et al (2015) Mycophenolic acid in the treatment of birdshot chorioretinopathy: long-term follow-up. Br J Ophthalmol 99(1):87–91

    Article  PubMed  Google Scholar 

  24. Finlayson SG, LePendu P, Shah NH (2014) Building the graph of medicine from millions of clinical narratives. Sci Data 1(1):140032

    Article  PubMed  PubMed Central  Google Scholar 

  25. Lowe HJ, Ferris TA, Hernandez PM, Weber SC (2009) STRIDE – An Integrated Standards-Based Translational Research Informatics Platform. AMIA Annu Symp Proc 2009:391

    PubMed  PubMed Central  Google Scholar 

  26. Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG, Badawi O (2018) The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Sci Data 5(1):180178

    Article  PubMed  PubMed Central  Google Scholar 

  27. Johnson AEW, Pollard TJ, Shen L, Lehman LH, Feng M, Ghassemi M et al (2016) MIMIC-III, a freely accessible critical care database. Sci Data 3(1):160035

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Malki MA, Dawed AY, Hayward C, Doney A, Pearson ER (2021) Utilizing large electronic medical record data sets to identify novel drug–gene interactions for commonly used drugs. Clin Pharmacol Ther 110(3):816–825

    Article  CAS  PubMed  Google Scholar 

  29. Kuhn M, Letunic I, Jensen LJ, Bork P (2016) The SIDER database of drugs and side effects. Nucleic Acids Res 44(Database issue):D1075

    Article  CAS  PubMed  Google Scholar 

  30. Koskinen M, Salmi JK, Loukola A, Mäkelä MJ, Sinisalo J, Carpén O et al (2022) Data-driven comorbidity analysis of 100 common disorders reveals patient subgroups with differing mortality risks and laboratory correlates. Sci Rep 12(1):1–9

    Article  Google Scholar 

  31. Bodenreider O (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 32(Database issue):D267

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Leaman R, Lu Z (2016) TaggerOne: joint named entity recognition and normalization with semi-Markov Models. Bioinformatics 32(18):2839–2846

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Sosa DN, Derry A, Guo M, Wei E, Brinton C, Altman RB (2020) A literature-based knowledge graph embedding method for identifying drug repurposing opportunities in rare diseases. Pac Symp Biocomput 2020(25):463–474

    Google Scholar 

  34. Percha B, Altman RB (2018) A global network of biomedical relationships derived from text. Bioinformatics 34(15):2614–2624

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Fujiyoshi K, Bruford EA, Mroz P, Sims CL, O’Leary TJ, Lo AWI et al (2021) Standardizing gene product nomenclature-a call to action. Proc Natl Acad Sci U S A 118(3):e2025207118

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Skreta M, Arbabi A, Wang J, Drysdale E, Kelly J, Singh D et al (2021) Automatically disambiguating medical acronyms with ontology-aware deep learning. Nat Commun 12(1):1–10

    Article  Google Scholar 

  37. Kilicoglu H, Rosemblat G, Fiszman M, Shin D (2020) Broad-coverage biomedical relation extraction with SemRep. BMC Bioinform 21(1):1–28

    Article  Google Scholar 

  38. Mantovani F, Collavin L, Del Sal G (2018) Mutant p53 as a guardian of the cancer cell. Cell Death Differ 26(2):199–212

    Article  PubMed  PubMed Central  Google Scholar 

  39. Kilicoglu H, Rosemblat G, Rindflesch TC (2017) Assigning factuality values to semantic relations extracted from biomedical research literature. PLoS One 12(7):e0179926

    Article  PubMed  PubMed Central  Google Scholar 

  40. Unni DR, Moxon SAT, Bada M, Brush M, Bruskiewich R, Caufield JH et al (2022) Biolink Model: a universal schema for knowledge graphs in clinical, biomedical, and translational science. Clin Transl Sci 15(8):1848–1855

    Article  PubMed  PubMed Central  Google Scholar 

  41. Zeng X, Song X, Ma T, Pan X, Zhou Y, Hou Y et al (2020) Repurpose open data to discover therapeutics for COVID-19 using deep learning. J Proteome Res 19(11):4624–4636

    Article  CAS  PubMed  Google Scholar 

  42. Zhang R, Hristovski D, Schutte D, Kastrin A, Fiszman M, Kilicoglu H (2020) Drug Repurposing for COVID-19 via Knowledge Graph Completion. J Biomed Inform 115:103696

    Article  Google Scholar 

  43. Ratajczak F, Joblin M, Ringsquandl M, Hildebrandt M (2022) Task-driven knowledge graph filtering improves prioritizing drugs for repurposing. BMC Bioinform 23(1):84

    Article  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to J. Charles G. Jeynes or Tim James .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Jeynes, J.C.G., James, T., Corney, M. (2024). Natural Language Processing for Drug Discovery Knowledge Graphs: Promises and Pitfalls. In: Heifetz, A. (eds) High Performance Computing for Drug Discovery and Biomedicine. Methods in Molecular Biology, vol 2716. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-3449-3_10

Download citation

  • DOI: https://doi.org/10.1007/978-1-0716-3449-3_10

  • Published:

  • Publisher Name: Humana, New York, NY

  • Print ISBN: 978-1-0716-3448-6

  • Online ISBN: 978-1-0716-3449-3

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics

Navigation