High-Performance Framework to Analyze Microarray Data

Marozzo, Fabrizio; Belcastro, Loris

doi:10.1007/978-1-0716-1839-4_2

Fabrizio Marozzo³ &
Loris Belcastro³

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2401))

975 Accesses
1 Citations

Abstract

Pharmacogenomics is an important research field that studies the impact of genetic variation of patients on drug responses, looking for correlations between single nucleotide polymorphisms (SNPs) of patient genome and drug toxicity or efficacy. The large number of available samples and the high resolution of the instruments allow microarray platforms to produce huge amounts of SNP data. To analyze such data and find correlations in a reasonable time, high-performance computing solutions must be used. Cloud4SNP is a bioinformatics tool, based on Data Mining Cloud Framework (DMCF), for parallel preprocessing and statistical analysis of SNP pharmacogenomics microarray data.

This work describes how Cloud4SNP has been extended to execute applications on Apache Spark, which provides faster execution time for iterative and batch processing. The experimental evaluation shows that Cloud4SNP is able to exploit the high-performance features of Apache Spark, obtaining faster execution times and high level of scalability, with a global speedup that is very close to linear values.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Protocol: USD 49.95; Price excludes VAT (Canada)

eBook: USD 109.00; Price excludes VAT (Canada)

Softcover Book: USD 139.99; Price excludes VAT (Canada)

Hardcover Book: USD 219.99; Price excludes VAT (Canada)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Fast Computing of Microarray Data Using Resilient Distributed Dataset of Apache Spark

Bioinformatics and Microarray Data Analysis on the Cloud

HPC Tools to Deal with Microarray Data

Notes

1.
DNA is made up of four subunits, or bases, called adenine (A), cytosine (C), guanine (G), and thymine (T).
2.
www.affymetrix.com
3.
http://hadoop.apache.org/
4.
http://spark.apache.org/
5.
http://www.bioconductor.org/
6.
http://www.ncbi.nlm.nih.gov/projects/SNP
7.
h ttp://www.pharmgkb.org
8.
https://spark.apache.org
9.
https://mesos.apache.org/
10.
https://hadoop.apache.org/

References

Cannataro M, Guzzi PH, Veltri P (2010) Protein-to-protein interactions: technologies, databases, and algorithms. ACM Comput Surv 43(1):1.1–1.36
Article Google Scholar
Phillips C (2009) SNP databases. In: Komar AA (ed) Single nucleotide polymorphisms, volume 578, chapter 3. Humana Press, Totowa, NJ, pp 43–71
Chapter Google Scholar
Burmester JK, Sedova M, Shapero MH, Mansfield E (2010) Dmet microarray technology for pharmacogenomics-based personalized medicine. Methods Mol Biol 632:99–124
Article CAS Google Scholar
Belcastro L, Marozzo F, Talia D, Trunfio P (2017) Big data analysis on clouds. In: Zomaya A, Sakr S (eds) Handbook of big data technologies. Springer, pp 101–142. ISBN: 978-3-319-49339-8
Chapter Google Scholar
Belcastro L, Marozzo F, Talia D, Trunfio P (2017) Appraising spark on large-scale social media analysis. In: Euro-Par Workshops (ed) Lecture notes in computer science, Santiago de Compostela, Spain, 28-29 August 2017, pp 483–495. ISBN: 978-3-319-75178-8
Google Scholar
Agapito G, Cannataro M, Guzzi PH, Marozzo F, Talia D, Trunfio P (2013) Cloud4snp: distributed analysis of snp microarray data on the cloud. In: Proc. of the ACM conference on bioinformatics, computational biology and biomedical informatics 2013 (ACM BCB 2013). ACM Press, Washington, DC, p 468. ISBN 978-1-4503-2434-2
Google Scholar
Guzzi PH, Agapito G, Di Martino MT, Arbitrio M, Tagliaferrri P, Tassone P, Cannataro M (2012) DMET-analyzer: automatic analysis of affymetrix DMET data. BMC Bioinformatics 13:258
Article Google Scholar
Schmidberger M, Vicedo E, Mansmann U (2009) affypara—a bioconductor package for parallelized preprocessing algorithms of affymetrix microarray data. Bioinform Biol Insights 30(22):83–87
Google Scholar
Guzzi PH, Cannataro M (2010) mu-cs: an extension of the tm4 platform to manage affymetrix binary data. BMC Bioinformatics 11:315
Article Google Scholar
Barton G, Abbott J, Chiba N, Huang DW, Huang Y, Krznaric M, MackSmith J, Saleem A, Sherman BT, Tiwari B, Tomlinson C, Aitman T, Darlington J, Game L, Sternberg MJE, Butcher SA (2008) Emaas: an extensible grid-based rich internet application for microarray data analysis and management. BMC Bioinformatics 9(1):493
Article CAS Google Scholar
Talia D, Trunfio P, Marozzo F (2015) Data analysis in the cloud. Elsevier. ISBN 978-0-12-802881-0
Google Scholar
Marozzo F, Talia D, Trunfio P (2013) Scalable script-based data analysis workflows on clouds. In: Proceedings of the 8th workshop on workflows in support of large-scale science. pp 124–133
Google Scholar
Belcastro L, Marozzo F, Talia D, Trunfio P (2014) Programming visual and script-based big data analytics workflows on clouds. In: Grandinetti L, Joubert G, Kunze M, Pascucci V (eds) Post-proc. of the high performance computing workshop 2014, volume 26 of advances in parallel computing, Cetraro, Italy, 2015. IOS Press, pp 18–31. ISBN: 978-1-61499-582-1
Google Scholar
Wilde M, Hategan M, Wozniak JM, Clifford B, Katz DS, Foster I (2011) Swift: a language for distributed parallel scripting. Parallel Comput 37(9):633–652
Article Google Scholar
Lordan F, Tejedor E, Ejarque J, Rafanell R, Alvarez J, Marozzo F, Lezzi D, Sirvent R, Talia D, Badia RM (2014) Servicess: an interoperable programming framework for the cloud. J Grid Comput 12(1):67–91
Article Google Scholar
Marozzo F, Lordan F, Rafanell R, Lezzi D, Talia D, Badia RM (2012) Enabling cloud interoperability with compss. In: Proc. of the 18th international european conference on parallel and distributed (Europar 2012), Rhodes Island, Greece, 27-31 August 2012, vol volume 7484, pp 16–27. Lect Notes Comput Sci
Google Scholar
Deelman E, Vahi K, Juve G, Rynge M, Callaghan S, Maechling PJ, Mayani R, Chen W, Da Silva RF, Livny M et al (2015) Pegasus, a workflow management system for science automation. Futur Gener Comput Syst 46:17–35
Article Google Scholar
Marozzo F, Talia D, Trunfio P (2018) A workflow management system for scalable data mining on clouds. IEEE Trans Serv Comput 11(3):480–492
Article Google Scholar
Marozzo F, Talia D, Trunfio P (2015) Js4cloud: script-based workflow programming for scalable data analysis on cloud platforms. Concurr Comput 27(17):5214–5237
Article Google Scholar
Rodrigo Duro F, Marozzo F, Garcia Blas J, Talia D, Trunfio P (2016) Exploiting in-memory storage for improving workflow executions in cloud platforms. J Supercomput 72(11):4069–4088
Article Google Scholar
Marozzo F, Rodrigo Duro F, Garcia Blas J, Carretero J, Talia D, Trunfio P (2017) A data-aware scheduling strategy for workflow execution in clouds. Concurr Comput 29(24)
Google Scholar
Belcastro L, Marozzo F, Talia D, Trunfio P (2019) Parsoda: high-level parallel programming for social data mining. Soc Netw Anal Min 9(1):1–19
Article Google Scholar
Belcastro L, Marozzo F, Talia D (2019) Programming models and systems for big data analysis. Int J Parallel Emerg Distrib Syst 34(6):632–652
Article Google Scholar

Download references

Author information

Authors and Affiliations

DIMES, University of Calabria, Rende, Cosenza, Italy
Fabrizio Marozzo & Loris Belcastro

Authors

Fabrizio Marozzo
View author publications
You can also search for this author in PubMed Google Scholar
Loris Belcastro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Fabrizio Marozzo or Loris Belcastro .

Editor information

Editors and Affiliations

Dipartimento di Giurisprudenza, Economia e Sociologia, Università degli Studi “Magna Graecia” di Catanzaro, Catanzaro, Italy
Giuseppe Agapito

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Marozzo, F., Belcastro, L. (2022). High-Performance Framework to Analyze Microarray Data. In: Agapito, G. (eds) Microarray Data Analysis. Methods in Molecular Biology, vol 2401. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-1839-4_2

Download citation

DOI: https://doi.org/10.1007/978-1-0716-1839-4_2
Published: 14 December 2021
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-1838-7
Online ISBN: 978-1-0716-1839-4
eBook Packages: Springer Protocols

Publish with us

Policies and ethics

High-Performance Framework to Analyze Microarray Data

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Fast Computing of Microarray Data Using Resilient Distributed Dataset of Apache Spark

Bioinformatics and Microarray Data Analysis on the Cloud

HPC Tools to Deal with Microarray Data

Notes

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this protocol

Cite this protocol

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

High-Performance Framework to Analyze Microarray Data

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Fast Computing of Microarray Data Using Resilient Distributed Dataset of Apache Spark

Bioinformatics and Microarray Data Analysis on the Cloud

HPC Tools to Deal with Microarray Data

Notes

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this protocol

Cite this protocol

Download citation

Publish with us

Search

Navigation