Abstract
Pharmacogenomics is an important research field that studies the impact of genetic variation of patients on drug responses, looking for correlations between single nucleotide polymorphisms (SNPs) of patient genome and drug toxicity or efficacy. The large number of available samples and the high resolution of the instruments allow microarray platforms to produce huge amounts of SNP data. To analyze such data and find correlations in a reasonable time, high-performance computing solutions must be used. Cloud4SNP is a bioinformatics tool, based on Data Mining Cloud Framework (DMCF), for parallel preprocessing and statistical analysis of SNP pharmacogenomics microarray data.
This work describes how Cloud4SNP has been extended to execute applications on Apache Spark, which provides faster execution time for iterative and batch processing. The experimental evaluation shows that Cloud4SNP is able to exploit the high-performance features of Apache Spark, obtaining faster execution times and high level of scalability, with a global speedup that is very close to linear values.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
DNA is made up of four subunits, or bases, called adenine (A), cytosine (C), guanine (G), and thymine (T).
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
References
Cannataro M, Guzzi PH, Veltri P (2010) Protein-to-protein interactions: technologies, databases, and algorithms. ACM Comput Surv 43(1):1.1–1.36
Phillips C (2009) SNP databases. In: Komar AA (ed) Single nucleotide polymorphisms, volume 578, chapter 3. Humana Press, Totowa, NJ, pp 43–71
Burmester JK, Sedova M, Shapero MH, Mansfield E (2010) Dmet microarray technology for pharmacogenomics-based personalized medicine. Methods Mol Biol 632:99–124
Belcastro L, Marozzo F, Talia D, Trunfio P (2017) Big data analysis on clouds. In: Zomaya A, Sakr S (eds) Handbook of big data technologies. Springer, pp 101–142. ISBN: 978-3-319-49339-8
Belcastro L, Marozzo F, Talia D, Trunfio P (2017) Appraising spark on large-scale social media analysis. In: Euro-Par Workshops (ed) Lecture notes in computer science, Santiago de Compostela, Spain, 28-29 August 2017, pp 483–495. ISBN: 978-3-319-75178-8
Agapito G, Cannataro M, Guzzi PH, Marozzo F, Talia D, Trunfio P (2013) Cloud4snp: distributed analysis of snp microarray data on the cloud. In: Proc. of the ACM conference on bioinformatics, computational biology and biomedical informatics 2013 (ACM BCB 2013). ACM Press, Washington, DC, p 468. ISBN 978-1-4503-2434-2
Guzzi PH, Agapito G, Di Martino MT, Arbitrio M, Tagliaferrri P, Tassone P, Cannataro M (2012) DMET-analyzer: automatic analysis of affymetrix DMET data. BMC Bioinformatics 13:258
Schmidberger M, Vicedo E, Mansmann U (2009) affypara—a bioconductor package for parallelized preprocessing algorithms of affymetrix microarray data. Bioinform Biol Insights 30(22):83–87
Guzzi PH, Cannataro M (2010) mu-cs: an extension of the tm4 platform to manage affymetrix binary data. BMC Bioinformatics 11:315
Barton G, Abbott J, Chiba N, Huang DW, Huang Y, Krznaric M, MackSmith J, Saleem A, Sherman BT, Tiwari B, Tomlinson C, Aitman T, Darlington J, Game L, Sternberg MJE, Butcher SA (2008) Emaas: an extensible grid-based rich internet application for microarray data analysis and management. BMC Bioinformatics 9(1):493
Talia D, Trunfio P, Marozzo F (2015) Data analysis in the cloud. Elsevier. ISBN 978-0-12-802881-0
Marozzo F, Talia D, Trunfio P (2013) Scalable script-based data analysis workflows on clouds. In: Proceedings of the 8th workshop on workflows in support of large-scale science. pp 124–133
Belcastro L, Marozzo F, Talia D, Trunfio P (2014) Programming visual and script-based big data analytics workflows on clouds. In: Grandinetti L, Joubert G, Kunze M, Pascucci V (eds) Post-proc. of the high performance computing workshop 2014, volume 26 of advances in parallel computing, Cetraro, Italy, 2015. IOS Press, pp 18–31. ISBN: 978-1-61499-582-1
Wilde M, Hategan M, Wozniak JM, Clifford B, Katz DS, Foster I (2011) Swift: a language for distributed parallel scripting. Parallel Comput 37(9):633–652
Lordan F, Tejedor E, Ejarque J, Rafanell R, Alvarez J, Marozzo F, Lezzi D, Sirvent R, Talia D, Badia RM (2014) Servicess: an interoperable programming framework for the cloud. J Grid Comput 12(1):67–91
Marozzo F, Lordan F, Rafanell R, Lezzi D, Talia D, Badia RM (2012) Enabling cloud interoperability with compss. In: Proc. of the 18th international european conference on parallel and distributed (Europar 2012), Rhodes Island, Greece, 27-31 August 2012, vol volume 7484, pp 16–27. Lect Notes Comput Sci
Deelman E, Vahi K, Juve G, Rynge M, Callaghan S, Maechling PJ, Mayani R, Chen W, Da Silva RF, Livny M et al (2015) Pegasus, a workflow management system for science automation. Futur Gener Comput Syst 46:17–35
Marozzo F, Talia D, Trunfio P (2018) A workflow management system for scalable data mining on clouds. IEEE Trans Serv Comput 11(3):480–492
Marozzo F, Talia D, Trunfio P (2015) Js4cloud: script-based workflow programming for scalable data analysis on cloud platforms. Concurr Comput 27(17):5214–5237
Rodrigo Duro F, Marozzo F, Garcia Blas J, Talia D, Trunfio P (2016) Exploiting in-memory storage for improving workflow executions in cloud platforms. J Supercomput 72(11):4069–4088
Marozzo F, Rodrigo Duro F, Garcia Blas J, Carretero J, Talia D, Trunfio P (2017) A data-aware scheduling strategy for workflow execution in clouds. Concurr Comput 29(24)
Belcastro L, Marozzo F, Talia D, Trunfio P (2019) Parsoda: high-level parallel programming for social data mining. Soc Netw Anal Min 9(1):1–19
Belcastro L, Marozzo F, Talia D (2019) Programming models and systems for big data analysis. Int J Parallel Emerg Distrib Syst 34(6):632–652
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature
About this protocol
Cite this protocol
Marozzo, F., Belcastro, L. (2022). High-Performance Framework to Analyze Microarray Data. In: Agapito, G. (eds) Microarray Data Analysis. Methods in Molecular Biology, vol 2401. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-1839-4_2
Download citation
DOI: https://doi.org/10.1007/978-1-0716-1839-4_2
Published:
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-1838-7
Online ISBN: 978-1-0716-1839-4
eBook Packages: Springer Protocols