Abstract
Biomedical databases are a major resource of knowledge for research in the life sciences. The biomedical knowledge is stored in a network of thousands of databases, repositories and ontologies. These data repositories differ substantially in granularity of data, storage formats, database systems, supported data models and interfaces. In order to make full use of available data resources, the high number of heterogeneous query methods and frontends requires high bioinformatic skills. Consequently, the manual inspection of database entries and citations is a time-consuming task for which methods from computer science should be applied.Concepts and algorithms from information retrieval (IR) play a central role in facing those challenges. While originally developed to manage and query less structured data, information retrieval techniques become increasingly important for the integration of life science data repositories and associated information. This chapter provides an overview of IR concepts and their current applications in life sciences. Enriched by a high number of selected references to pursuing literature, the following sections will successively build a practical guide for biologists and bioinformaticians.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
The computational modelling in biology network, COMBINE, http://co.mbine.org/.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
Twenty-fourth release of BioModels Database, December 2012.
- 26.
- 27.
- 28.
References
Achard F, Vaysseix G, Barillot E (2001) XML, bioinformatics and data integration. Bioinformatics 17(2):115–125
Adams M, Kelley J, Gocayne J, Dubnick M, Polymeropoulos M, **ao H, Merril C, Wu A, Olde B, Moreno R, Kerlavage A, McCombie W, Venter J (1991) Complementary DNA sequencing: expressed sequence tags and human genome project. Science 252(5013):1651–1656
Adomavicius G, Tuzhilin A (2005) Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Trans Knowl Data Eng 17(6):734–749
Agichtein E, Brill E, Dumais S (2006) Improving web search ranking by incorporating user behavior information. In: SIGIR’06: proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, Seattle. ACM, New York, pp 19–26
Andrade L, Silva MJ (2006) Relevance ranking for geographic IR. In: Workshop on geographic information retrieval, SIGIR’06, Seattle
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G (2000) Gene ontology: tool for the unification of biology. Nat Genet 25(1):25–29
Avraham S, Tung CW, Ilic K, Jaiswal P, Kellogg EA, McCouch S, Pujar A, Reiser L, Rhee SY, Sachs MM, Schaeffer M, Stein L, Stevens P, Vincent L, Zapata F, Ware D (2008) The plant ontology database: a community resource for plant structure and developmental stages controlled vocabulary and annotations. Nucl Acids Res 36(suppl_1):D449–D454
Baeza Yates RA, Neto BR (1999) Modern information retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston
Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O’Donovan C, Redaschi N, Yeh LL (2005) The universal protein resource (UniProt). Nucl Acids Res 33(suppl_1):D154–D159
Bard JBL, Rhee SY (2004) Ontologies in biology: design, applications and future challenges. Nat Rev Genet 5(3):213–222
Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR et al (2008) Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456(7218):53–59
Bodenreider O, Stevens R (2006) Bio-ontologies: current trends and future directions. Brief Bioinform 7(3):256–274
Botstein D, White R, Skolnick M, Davis R (1980) Construction of a genetic linkage map in man using restriction fragment length polymorphisms. Am J Hum Genet 32(3):314–331
Brazma A, Krestyaninova M, Sarkans U (2006) Standards for systems biology. Nat Rev Genet 7:593–605
Brin S, Page L (1998) The anatomy of a large-scale hypertextual Web search engine. In: Proceedings of the seventh international conference on world wide web 7, Brisbane, vol 30. Elsevier, Amsterdam, pp 107–117
Brockschmidt K (1995) Inside OLE, 2nd edn. Microsoft Press, Redmond
Bry F, Kröger P (2003) A computational biology database digest: data, data analysis, and data management. Distrib Parallel Databases 13(1):7–42
Codd EF (1970) A relational model of data for large shared data banks. Commun ACM 13(6):377–387
Cohen-Boulakia S, Leser U (2011) Next generation data integration for life sciences. In: Proceedings of the 2011 IEEE 27th international conference on data engineering (ICDE’11), Hannover. IEEE Computer Society, Los Alamitos, pp 1366–1369
Cuellar A, Lloyd C, Nielsen P, Bullivant D, Nickerson D, Hunter P (2003) An overview of cellmL 1.1, a biological model description language. Simulation 79(12):740–747
Davidson S, Overton C, Buneman P (1995) Challenges in integrating biological data sources. J Comput Biol 2(4):557–572
Day J (2001) The quest for information: a guide to searching the internet. J Contemp Dent Pract 2(4):033–043
Devlin B, Murphy P (1988) An architecture for a business and information system. IBM Syst J 27(1):60–80
Divoli A, Hearst M, Wooldridge MA (2008) Evidence for showing gene/protein name suggestions in bioscience literature search interfaces. In: Pacific symposium on biocomputing, Kohala Coast, vol 13, pp 568–579
Doms A, Schroeder M (2005) GoPubMed: exploring PubMed with the Gene ontology. Nucl Acids Res 33(suppl_2):W783–W786
Dowell R, Jokerst R, Day A, Eddy S, Stein L (2001) The distributed annotation system. BMC Bioinform 2(1):7
Eckerson WW (2002) Data quality and the bottom line: achieving business success through a commitment to high quality data. TDWI report series, The Data Warehousing Institute, Seattle
Efthimiadis EN (2000) Interactive query expansion: a user-based evaluation in a relevance feedback environment. J Am Soc Inf Sci 51(11):989–1003
Elmasri R, Navathe SB (2000) Fundamentals of database systems, 3rd edn. Addison-Wesley, Reading
Etzold T, Harris H, Beaulah S (2003) SRS: an integration platform for databanks and analysis tools in bioinformatics. In: Lacroix Z, Critchlow T (eds) Bioinformatics: managing scientific data. Morgan Kaufmann, San Francisco, pp 109–145
Fenyö D (1999) The Biopolymer markup language. Bioinformatics 15(4):339–340
Fernández-Suárez XM, Galperin MY (2013) The 2013 nucleic acids research database issue and the online molecular biology database collection. Nucl Acids Res 41(D1):D1–D7
Geiger K (1995) Inside ODBC: [Der Entwicklerleitfaden zum Industriestandard für Datenbank-Schnittstellen]. Microsoft Press, Unterschleissheim
Gilmour R (2000) Taxonomic markup language: applying XML to systematic data. Bioinformatics 16(4):406–407
Gleeson P, Crook S, Cannon R, Hines M, Billings G, Farinella M, Morse T, Davison A, Ray S, Bhalla U et al (2010) Neuroml: a language for describing data driven models of neurons and networks with a high degree of biological detail. PLoS Comput Biol 6(6):e1000815
Goble C, Stevens R (2008) State of the nation in data integration for bioinformatics. J Biomed Inform 41(5):687–693
Goujon M, Valentin F, Miyar T, McWilliam H, Lopez R (2007) The EB-eye. EMBnetnews 13(4):18–21
Gray J (2007) Jim gray on eScience: a transformed scientific method. Retrieved from http://research.microsoft.com/en-us/collaboration/fourthparadigm/4th_paradigm_book_jim_gray_transcript.pdf
Greifeneder H (2010) Erfolgreiches SuchmaschinenMarketing: Wie Sie bei Google, Yahoo, MSN & Co. ganz nach oben kommen, 2nd edn. Gabler Verlag
Gruber TR (1993) A translation approach to portable ontology specifications. Knowl Acquis 5(2):199–220
Hanisch D, Fundel K, Mevissen HT, Zimmer R, Fluck J (2005) Prominer: rule-based protein and gene entity recognition. BMC Bioinform 6(Suppl_1):S14
Hearst M (2006) Design recommendations for hierarchical faceted search interfaces. In: ACM SIGIR workshop on faceted search, Seattle
Hearst M (2009) Search user interfaces. Cambridge University Press, Cambridge/New York
Henkel R, Endler L, Peters A, Le Novère N, Waltemath D (2010) Ranked retrieval of computational biology models. BMC Bioinform 11(1):423
Hines M, Morse T, Migliore M, Carnevale N, Shepherd G (2004) Modeldb: a database to support computational neuroscience. J Comput Neurosci 17(1):7–11
Hoehndorf R, Dumontier M, Gennari JH, Wimalaratne S, de Bono B, Cook DL, Gkoutos GV (2011) Integrating systems biology models and biomedical ontologies. BMC Syst Biol 5(1):124
Hucka M, Bergmann F, Keating S, Schaff J, Smith L (2010) The systems biology markup language (SBML): language specification for level 3 version. http://sbml.org/Documents/Specifications/SBML_Level_3/Version_1/Core
Ide NC, Loane RF, Demner-Fushman D (2007) Essie: a concept-based search engine for structured biomedical text. J Am Med Inform Assoc 14(3):253–263
Inmon W (2005) Building the data warehouse, 4th edn. Wiley, Indianapolis
Jaiswal1 P, Ware D, Ni J, Chang K, Zhao W, Schmidt S, Pan X, Clark K, Teytelman L, Cartinhour S, Stein L, McCouch S (2002) Gramene: development and integration of trait and gene ontologies for rice. Comparative and Functional Genomics 3(2):132–136
Juty N, Le Novère N, Laibe C (2012) Identifiers.org and miriam registry: community resources to provide persistent identification. Nucl Acids Res 40(D1):D580–D586
Kanz C, Aldebert P, Althorpe N, Baker W, Baldwin A, Bates K, Browne P, van den Broek A, Castro M, Cochrane G, Duggan K, Eberhardt R, Faruque N, Gamble J, Diez FG, Harte N, Kulikova T, Lin Q, Lombard V, Lopez R, Mancuso R, McHale M, Nardone F, Silventoinen V, Sobhany S, Stoehr P, Tuli MA, Tzouvara K, Vaughan R, Wu D, Zhu W, Apweiler R (2005) The EMBL nucleotide sequence database. Nucl Acids Res 33(suppl_1):D29–D33
Kasprzyk A (2011) Biomart: driving a paradigm change in biological data management. Database 2011:bar049
Kimball R (1998) Bringing up supermarts – a step-by-step approach to building a data warehouse from granular data. DBMS and Internet Syst 11(1):47–53
Kitano H (2002) Systems biology: a brief overview. Science 295:1662–1664
Krallinger M, Valencia A, Hirschman L (2008) Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biol 9(Suppl 2):S8
Krause F, Uhlendorf J, Lubitz T, Schulz M, Klipp E, Liebermeister W (2010) Annotation and merging of SBML models with semanticsbml. Bioinformatics 26(3):421–422
Lacroix Z, Critchlow T (2003) Bioinformatics: managing scientific data. Morgan Kaufmann, San Francisco
Laibe C (2011) Identifiers. org and miriam registry: perennial identifiers for crossreferencing purposes. Available from Nature Precedings. http://dx.doi.org/10.1038/npre.2011.6479.1
Lange M, Spies K, Bargsten J, Haberhauer G, Klapperstück M, Leps M, Weinel C, Wünschiers R, Weißbach M, Stein J, Scholz U (2010) The LAILAPS search engine: relevance ranking in life science databases. J Integr Bioinform 7(2):e110
Langville AN, Meyer CD (2006) Google’s PageRank and beyond: the science of search engine rankings. Princeton University Press, Princeton
Lassila O, Swick RR, Consortium WWW (1998) resource description framework (RDF) model and syntax specification. http://www.w3.org/1998/10/WD-rdf-syntax-19981008
Lee T, Pouliot Y, Wagner V, Gupta P, Stringer-Calvert D, Tenenbaum J, Karp P (2006) BioWarehouse: a bioinformatics database warehouse toolkit. BMC Bioinform 7(1):170
Le Novère N, Finney A, Hucka M, Bhalla U, Campagne F, Collado-Vides J, Crampin E, Halstead M, Klipp E, Mendes P et al (2005) Minimum information requested in the annotation of biochemical models (MIRIAM). Nat Biotechnol 23(12):1509–1515
Le Novère N, Courtot M, Laibe C (2006) Adding semantics in kinetics models of biochemical pathways. In: Proceedings of the 2nd international symposium on experimental standard conditions of enzyme characterizations, Ruedesheim
Li C, Donizelli M, Rodriguez N, Dharuri H, Endler L, Chelliah V, Li L, He E, Henry A, Stefan M et al (2010) Biomodels database: an enhanced, curated and annotated resource for published quantitative kinetic models. BMC Syst Biol 4(1):92
Lloyd C, Lawson J, Hunter P, Nielsen P (2008) The cellmL model repository. Bioinformatics 24(18):2122–2123
Lu Z (2011) PubMed and beyond: a survey of web tools for searching biomedical literature. Database 2011:baq036
Magrane M, UniProt Consortium (2011) UniProt Knowledgebase: a hub of integrated protein data. Database 2011:bar009
Marchionini G (2006) Exploratory search: from finding to understanding. Commun ACM 49(4):41–46
Marenco L, Tosches N, Crasto C, Shepherd G, Miller P, Nadkarni P (2003) Achieving evolvable web-database bioscience applications using the EAV/CR framework: recent advances. J Am Med Inform Assoc 10(5):444–453
Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z et al (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437(7057):376–380
Maxam A, Gilbert W (1977) A new method for sequencing DNA. Proc Natl Acad Sci 74(2):560–564
Mehlhorn H, Lange M, Scholz U, Schreiber F (2012) IDPredictor: predict database links in biomedical database. J Integr Bioinform 9(2):e190
Murray-Rust P, Rzepa H (1999) Chemical markup, XML, and the World Wide Web. 1. Basic principles. J Chem Inf Comput Sci 39(6):928–946. http://www.xml-cml.org
Nolin MA, Ansell P, Belleau F, Idehen K, Rigault P, Tourigny N, Roe P, Hogan JM, Dumontier M (2008) Bio2RDF network of linked data. In: Semantic web challenge; international semantic web conference (ISWC 2008), Karlsruhe
O’Connor B, Day A, Cain S, Arnaiz O, Sperling L, Stein L (2008) Gmodweb: a web framework for the generic model organism database. Genome Biol 9(6):R102
Olivier B, Snoep J (2004) Web-based kinetic modelling using JWS online. Bioinformatics 20(13):2143–2144
Pearson W, Lipman D (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 85:2444–2448
Prud’hommeaux E, Seaborne A (2008) SPARQL query language for RDF. http://www.w3.org/TR/rdf-sparql-query/
Richardson M, Prakash A, Brill E (2006) Beyond pagerank: machine learning for static ranking. In: WWW’06: proceedings of the 15th international conference on World Wide Web, Edinburgh. ACM, New York, pp 707–715
Roos DS (2001) Bioinformatics-trying to swim in a sea of data. Science 291(5507): 1260–1261
Saake G, Heuer A (1999) Datenbanken: Implementierungstechniken, 1st edn. MITP, Bonn
Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci 74(12):5463–5467
Schadt E, Linderman M, Sorenson J, Lee L, Nolan G (2010) Computational solutions to large-scale data management and analysis. Nat Rev Genet 11(9):647–657
Schena M, Shalon D, Davis RW, Brown PO (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270(5235):467–470
Schmitt I (1998) Schemaintegration für den Entwurf Föderierter Datenbanken. infix, Sankt Augustin
Schöch V (2001) Die Suchmaschine Google. Seminararbeit, Institut für Informatik, Freie Universität zu Berlin
Schönsleben P (2001) Integrales Informationsmanagement: Informationssysteme für Geschäftsprozesse – Management, Modellierung, Lebenszyklus und Technologie, 2nd edn. Springer, Berlin/Heidelberg
Schuler GD, Epstein JA, Ohkawa H, Kans JA (1996) Entrez: molecular biology database and retrieval system. In: Doolittle RF (ed) Computer methods for macromolecular sequence analysis. Methods in enzymology, vol 266. Academic, San Diego, pp 141–162
Schulz M, Krause F, Le Novère N, Klipp E, Liebermeister W (2011) Retrieval, alignment, and clustering of computational models based on semantic annotations. Mol Syst Biol 7(1):512
Shah S, Huang Y, Xu T, Yuen M, Ling J, Ouellette BFF (2005) Atlas – a data warehouse for integrative bioinformatics. BMC Bioinform 6(1):34
Siegel J (1996) CORBA fundamentals and programming. Wiley, New York
Siple MD (1998) The complete guide to Java database programming with JDBC. McGraw-Hill, New York/London
Smedley D, Haider S, Ballester B, Holland R, London D, Thorisson G, Kasprzyk A (2009) BioMart – biological queries made easy. BMC Genomics 10(1):22
Smith B, Ashburner M, Rosse C, Bard J, Bug W, Ceusters W, Goldberg L, Eilbeck K, Ireland A, Mungall C et al (2007) The OBO foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol 25(11):1251–1255
Stein L (2010) The case for cloud computing in genome informatics. Genome Biol 11(5):207
Stephens SM, Chen JY, Davidson MG, Thomas S, Trute BM (2005) Oracle database 10 g: a platform for BLAST search and regular expression pattern matching in life sciences. Nucl Acids Res 33(suppl_1):D675–D679
Taylor C, Field D, Sansone S, Aerts J, Apweiler R, Ashburner M, Ball C, Binz P, Bogue M, Booth T et al (2008) Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project. Nat Biotechnol 26(8):889–896
United States National Library of Medicine (2011) Pubmed celebrates its 10th anniversary. http://www.nlm.nih.gov/pubs/techbull/so06/so06_pm_10.html
Valencia A (2002) Search and retrieve: large-scale data generation is becoming increasingly important in biological research. But how good are the tools to make sense of the data? EMBO Rep 3(5):396–400
Waltemath D, Henkel R, Winter F, Wolkenhauer O (2013) Reproducibility of model-based results in systems biology. In: Prokop A, Csukás B (eds) Systems biology: integrative biology and simulation tools. Springer, Dordrecht
Weiner M, Hudson T (2002) Introduction to SNPs: discovery of markers for disease. Biotechniques 32(Supplement):S4–S13
Weise S, Grosse I, Klukas C, Koschützki D, Scholz U, Schreiber F, Junker B (2006) Meta-all: a system for managing metabolic pathway information. BMC Bioinform 7(1):e465
Whetzel PL, Parkinson H, Causton HC, Fan L, Fostel J, Fragoso G, Game L, Heiskanen M, Morrison N, Rocca-Serra P, Sansone SA, Taylor C, White J, Stoeckert CJ (2006) The MGED ontology: a resource for semantics-based description of microarray experiments. Bioinformatics 22(7):866–873
Whetzel P, Noy N, Shah N, Alexander P, Nyulas C, Tudorache T, Musen M (2011) BioPortal: enhanced functionality via new web services from the national center for biomedical ontology to access and use ontologies in software applications. Nucl Acids Res 39(suppl_2):W541–W545
Wiederhold G (1996) Intelligent integration of information – foreword. J Intell Inf Syst 6(2/3):93–98
Wiederhold G (1997) Mediators in the architecture of future information systems. In: Huhns MN, Singh MP (eds) Readings in agents. Morgan Kaufmann, San Francisco, pp 185–196
Yu T, Lloyd C, Nickerson D, Cooling M, Miller A, Garny A, Terkildsen J, Lawson J, Britten R, Hunter P et al (2011) The physiome model repository 2. Bioinformatics 27(5):743–744
Acknowledgements
This work was supported by the European Commission within its 7th Framework Programme, under the thematic area “Infrastructures”, contract number 283496, by the BMBF e:bio programme (University of Rostock) and the Leibniz Institute of Plant Genetics and Crop Plant Research (IPK).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Lange, M., Henkel, R., Müller, W., Waltemath, D., Weise, S. (2014). Information Retrieval in Life Sciences: A Programmatic Survey. In: Chen, M., Hofestädt, R. (eds) Approaches in Integrative Bioinformatics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41281-3_3
Download citation
DOI: https://doi.org/10.1007/978-3-642-41281-3_3
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41280-6
Online ISBN: 978-3-642-41281-3
eBook Packages: Computer ScienceComputer Science (R0)