Abstract
As scientists continue to migrate their work to computational methods, it is important to track not only the steps involved in the computation but also the data consumed and produced. While this provenance information can be captured, in existing approaches, it often contains only weak references between data and provenance. When data files or provenance are moved or modified, it can be difficult to find the data associated with the provenance or to find the provenance associated with the data. We propose a persistent storage mechanism that manages input, intermediate, and output data files, strengthening the links between provenance and data. This mechanism provides better support for reproducibility because it ensures the data referenced in provenance information can be readily located. Another important benefit of such management is that it allows caching of intermediate data which can then be shared with other users. We present an implemented infrastructure for managing data in a provenance-aware manner and demonstrate its application in scientific projects.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Freire, J., Koop, D., Santos, E., Silva, C.T.: Provenance for computational tasks: A survey. Computing in Science and Engineering 10(13), 11–21 (2008)
Davidson, S.B., Freire, J.: Provenance and scientific workflows: challenges and opportunities. In: Proceedings of SIGMOD, pp. 1345–1350 (2008)
Davidson, S.B., Boulakia, S.C., Eyal, A., Ludäscher, B., McPhillips, T.M., Bowers, S., Anand, M.K., Freire, J.: Provenance in scientific workflow systems. IEEE Data Eng. Bull. 30(4), 44–50 (2007)
Bavoil, L., Callahan, S., Crossno, P., Freire, J., Scheidegger, C., Silva, C., Vo, H.: VisTrails: Enabling interactive multiple-view visualizations. In: Proceedings of IEEE Visualization, pp. 135–142 (2005)
Altintas, I., Barney, O., Jaeger-Frank, E.: Provenance collection support in the kepler scientific workflow system. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 118–132. Springer, Heidelberg (2006)
Albuquerque, A., Alet, F., Corboz, P., Dayal, P., Feiguin, A., Fuchs, S., Gamper, L., Gull, E., Gürtler, S., Honecker, A., Igarashi, R., Körner, M., Kozhevnikov, M., Läuchli, A., Manmana, S., Matsumoto, M., McCulloch, I., Michel, F., Noack, R., Pawlowski, G., Pollet, L., Pruschke, T., Schollwöck, U., Todo, S., Trebst, S., Troyer, M., Werner, P., Wessel, S.: The alps project release 1.3: open source software for strongly correlated systems. J. Mag. Mag. Mat. 310, 1187 (2007)
git, http://git-scm.com
First provenance challenge (2006), http://twiki.ipaw.info/bin/view/Challenge/FirstProvenanceChallenge
Mouallem, P., Barreto, R., Klasky, S., Podhorszki, N., Vouk, M.: Tracking files in the kepler provenance framework. In: SSDBM 2009: Proceedings of the 21st International Conference on Scientific and Statistical Database Management, pp. 273–282 (2009)
Second provenance challenge (2007), http://twiki.ipaw.info/bin/view/Challenge/SecondProvenanceChallenge
Fomel, S., Claerbout, J.F.: Guest editors’ introduction: Reproducible research. Computing in Science and Engineering 11, 5–7 (2009)
Santos, E., Freire, J., Silva, C.: Information Sharing in Science 2.0: Challenges and Opportunities. In: CHI Workshop on The Changing Face of Digital Science: New Practices in Scientific Collaborations (2009)
The VisTrails Project, http://www.vistrails.org
Dagotto, E., Rice, T.M.: Surprises on the Way from One- to Two-Dimensional Quantum Magnets: The Ladder Materials. Science 271(5249), 618–623 (1996)
Troyer, M., Tsunetsugu, H., Würtz, D.: Thermodynamics and spin gap of the heisenberg ladder calculated by the look-ahead lanczos algorithm. Phys. Rev. B 50(18), 13515–13527 (1994)
Todo, S., Kato, K.: Cluster algorithms for general- s quantum spin systems. Phys. Rev. Lett. 87(4), 047203 (2001)
Concurrent Versions System, http://www.nongnu.org/cvs
Subversion, http://subversion.tigris.org
The Taverna Project, http://taverna.sourceforge.net
The Kepler Project, http://kepler-project.org
Third provenance challenge (2008), http://twiki.ipaw.info/bin/view/Challenge/ThirdProvenanceChallenge
Moreau, L., Freire, J., Futrelle, J., McGrath, R.E., Myers, J., Paulson, P.: The open provenance model: An overview. In: Freire, J., Koop, D., Moreau, L. (eds.) IPAW 2008. LNCS, vol. 5272, pp. 323–326. Springer, Heidelberg (2008)
Cheney, J., Chiticariu, L., Tan, W.C.: Provenance in databases: Why, how, and where. Foundations and Trends in Databases 1(4), 379–474 (2009)
Plale, B., Alameda, J., Wilhelmson, B., Gannon, D., Hampton, S., Rossi, A., Droegemeier, K.: Active management of scientific data. IEEE Internet Computing 9(1), 27–34 (2005)
Simmhan, Y., Barga, R., van Ingen, C., Lazowska, E., Szalay, A.: Building the trident scientific workflow workbench for data management in the cloud. In: International Conference on Advanced Engineering Computing and Applications in Sciences, pp. 41–50 (2009)
Salamone, S.: Lsid: An informatics lifesaver. Bio-ITWorld (2004)
Paskin, N.: Digital object identifiers for scientific data. Data Science Journal 4, 12–20 (2005)
Hasan, R., Sion, R., Winslett, M.: The case of the fake picasso: preventing history forgery with secure provenance. In: FAST 2009: Proccedings of the 7th conference on File and storage technologies, pp. 1–14 (2009)
Peng, R.S., Eckel, S.P.: Distributed reproducible research using cached computations. Computing in Science & Engineering 11, 28–34 (2009)
Allcock, W., Bester, J., Bresnahan, J., Chervenak, A., Liming, L., Tuecke, S.: Gridftp: Protocol extensions to ftp for the grid. Global Grid Forum, 3 (2001)
Shoshani, A., Sim, A., Gu, J.: Storage resource managers: essential components for the Grid, pp. 321–340. Kluwer Academic Publishers, Dordrecht (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Koop, D., Santos, E., Bauer, B., Troyer, M., Freire, J., Silva, C.T. (2010). Bridging Workflow and Data Provenance Using Strong Links. In: Gertz, M., Ludäscher, B. (eds) Scientific and Statistical Database Management. SSDBM 2010. Lecture Notes in Computer Science, vol 6187. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13818-8_28
Download citation
DOI: https://doi.org/10.1007/978-3-642-13818-8_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13817-1
Online ISBN: 978-3-642-13818-8
eBook Packages: Computer ScienceComputer Science (R0)