High-Performance Framework to Analyze Microarray Data

  • Protocol
  • First Online:
Microarray Data Analysis

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2401))

Abstract

Pharmacogenomics is an important research field that studies the impact of genetic variation of patients on drug responses, looking for correlations between single nucleotide polymorphisms (SNPs) of patient genome and drug toxicity or efficacy. The large number of available samples and the high resolution of the instruments allow microarray platforms to produce huge amounts of SNP data. To analyze such data and find correlations in a reasonable time, high-performance computing solutions must be used. Cloud4SNP is a bioinformatics tool, based on Data Mining Cloud Framework (DMCF), for parallel preprocessing and statistical analysis of SNP pharmacogenomics microarray data.

This work describes how Cloud4SNP has been extended to execute applications on Apache Spark, which provides faster execution time for iterative and batch processing. The experimental evaluation shows that Cloud4SNP is able to exploit the high-performance features of Apache Spark, obtaining faster execution times and high level of scalability, with a global speedup that is very close to linear values.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Protocol
USD 49.95
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (Canada)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (Canada)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (Canada)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    DNA is made up of four subunits, or bases, called adenine (A), cytosine (C), guanine (G), and thymine (T).

  2. 2.

    www.affymetrix.com

  3. 3.

    http://hadoop.apache.org/

  4. 4.

    http://spark.apache.org/

  5. 5.

    http://www.bioconductor.org/

  6. 6.

    http://www.ncbi.nlm.nih.gov/projects/SNP

  7. 7.

    http://www.pharmgkb.org

  8. 8.

    https://spark.apache.org

  9. 9.

    https://mesos.apache.org/

  10. 10.

    https://hadoop.apache.org/

References

  1. Cannataro M, Guzzi PH, Veltri P (2010) Protein-to-protein interactions: technologies, databases, and algorithms. ACM Comput Surv 43(1):1.1–1.36

    Article  Google Scholar 

  2. Phillips C (2009) SNP databases. In: Komar AA (ed) Single nucleotide polymorphisms, volume 578, chapter 3. Humana Press, Totowa, NJ, pp 43–71

    Chapter  Google Scholar 

  3. Burmester JK, Sedova M, Shapero MH, Mansfield E (2010) Dmet microarray technology for pharmacogenomics-based personalized medicine. Methods Mol Biol 632:99–124

    Article  CAS  Google Scholar 

  4. Belcastro L, Marozzo F, Talia D, Trunfio P (2017) Big data analysis on clouds. In: Zomaya A, Sakr S (eds) Handbook of big data technologies. Springer, pp 101–142. ISBN: 978-3-319-49339-8

    Chapter  Google Scholar 

  5. Belcastro L, Marozzo F, Talia D, Trunfio P (2017) Appraising spark on large-scale social media analysis. In: Euro-Par Workshops (ed) Lecture notes in computer science, Santiago de Compostela, Spain, 28-29 August 2017, pp 483–495. ISBN: 978-3-319-75178-8

    Google Scholar 

  6. Agapito G, Cannataro M, Guzzi PH, Marozzo F, Talia D, Trunfio P (2013) Cloud4snp: distributed analysis of snp microarray data on the cloud. In: Proc. of the ACM conference on bioinformatics, computational biology and biomedical informatics 2013 (ACM BCB 2013). ACM Press, Washington, DC, p 468. ISBN 978-1-4503-2434-2

    Google Scholar 

  7. Guzzi PH, Agapito G, Di Martino MT, Arbitrio M, Tagliaferrri P, Tassone P, Cannataro M (2012) DMET-analyzer: automatic analysis of affymetrix DMET data. BMC Bioinformatics 13:258

    Article  Google Scholar 

  8. Schmidberger M, Vicedo E, Mansmann U (2009) affypara—a bioconductor package for parallelized preprocessing algorithms of affymetrix microarray data. Bioinform Biol Insights 30(22):83–87

    Google Scholar 

  9. Guzzi PH, Cannataro M (2010) mu-cs: an extension of the tm4 platform to manage affymetrix binary data. BMC Bioinformatics 11:315

    Article  Google Scholar 

  10. Barton G, Abbott J, Chiba N, Huang DW, Huang Y, Krznaric M, MackSmith J, Saleem A, Sherman BT, Tiwari B, Tomlinson C, Aitman T, Darlington J, Game L, Sternberg MJE, Butcher SA (2008) Emaas: an extensible grid-based rich internet application for microarray data analysis and management. BMC Bioinformatics 9(1):493

    Article  CAS  Google Scholar 

  11. Talia D, Trunfio P, Marozzo F (2015) Data analysis in the cloud. Elsevier. ISBN 978-0-12-802881-0

    Google Scholar 

  12. Marozzo F, Talia D, Trunfio P (2013) Scalable script-based data analysis workflows on clouds. In: Proceedings of the 8th workshop on workflows in support of large-scale science. pp 124–133

    Google Scholar 

  13. Belcastro L, Marozzo F, Talia D, Trunfio P (2014) Programming visual and script-based big data analytics workflows on clouds. In: Grandinetti L, Joubert G, Kunze M, Pascucci V (eds) Post-proc. of the high performance computing workshop 2014, volume 26 of advances in parallel computing, Cetraro, Italy, 2015. IOS Press, pp 18–31. ISBN: 978-1-61499-582-1

    Google Scholar 

  14. Wilde M, Hategan M, Wozniak JM, Clifford B, Katz DS, Foster I (2011) Swift: a language for distributed parallel scripting. Parallel Comput 37(9):633–652

    Article  Google Scholar 

  15. Lordan F, Tejedor E, Ejarque J, Rafanell R, Alvarez J, Marozzo F, Lezzi D, Sirvent R, Talia D, Badia RM (2014) Servicess: an interoperable programming framework for the cloud. J Grid Comput 12(1):67–91

    Article  Google Scholar 

  16. Marozzo F, Lordan F, Rafanell R, Lezzi D, Talia D, Badia RM (2012) Enabling cloud interoperability with compss. In: Proc. of the 18th international european conference on parallel and distributed (Europar 2012), Rhodes Island, Greece, 27-31 August 2012, vol volume 7484, pp 16–27. Lect Notes Comput Sci

    Google Scholar 

  17. Deelman E, Vahi K, Juve G, Rynge M, Callaghan S, Maechling PJ, Mayani R, Chen W, Da Silva RF, Livny M et al (2015) Pegasus, a workflow management system for science automation. Futur Gener Comput Syst 46:17–35

    Article  Google Scholar 

  18. Marozzo F, Talia D, Trunfio P (2018) A workflow management system for scalable data mining on clouds. IEEE Trans Serv Comput 11(3):480–492

    Article  Google Scholar 

  19. Marozzo F, Talia D, Trunfio P (2015) Js4cloud: script-based workflow programming for scalable data analysis on cloud platforms. Concurr Comput 27(17):5214–5237

    Article  Google Scholar 

  20. Rodrigo Duro F, Marozzo F, Garcia Blas J, Talia D, Trunfio P (2016) Exploiting in-memory storage for improving workflow executions in cloud platforms. J Supercomput 72(11):4069–4088

    Article  Google Scholar 

  21. Marozzo F, Rodrigo Duro F, Garcia Blas J, Carretero J, Talia D, Trunfio P (2017) A data-aware scheduling strategy for workflow execution in clouds. Concurr Comput 29(24)

    Google Scholar 

  22. Belcastro L, Marozzo F, Talia D, Trunfio P (2019) Parsoda: high-level parallel programming for social data mining. Soc Netw Anal Min 9(1):1–19

    Article  Google Scholar 

  23. Belcastro L, Marozzo F, Talia D (2019) Programming models and systems for big data analysis. Int J Parallel Emerg Distrib Syst 34(6):632–652

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Fabrizio Marozzo or Loris Belcastro .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Marozzo, F., Belcastro, L. (2022). High-Performance Framework to Analyze Microarray Data. In: Agapito, G. (eds) Microarray Data Analysis. Methods in Molecular Biology, vol 2401. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-1839-4_2

Download citation

  • DOI: https://doi.org/10.1007/978-1-0716-1839-4_2

  • Published:

  • Publisher Name: Humana, New York, NY

  • Print ISBN: 978-1-0716-1838-7

  • Online ISBN: 978-1-0716-1839-4

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics

Navigation