Data-Driven Job Dispatching in HPC Systems

  • Conference paper
  • First Online:
Machine Learning, Optimization, and Big Data (MOD 2017)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10710))

Included in the following conference series:

Abstract

As High Performance Computing (HPC) systems get closer to exascale performance, job dispatching strategies become critical for kee** system utilization high while kee** waiting times low for jobs competing for HPC system resources. In this paper, we take a data-driven approach and investigate whether better dispatching decisions can be made by transforming the log data produced by an HPC system into useful knowledge about its workload. In particular, we focus on job duration, develop a data-driven approach to job duration prediction, and analyze the effect of different prediction approaches in making dispatching decisions using a real workload dataset collected from Eurora, a hybrid HPC system. Experiments on various dispatching methods show promising results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
EUR 29.95
Price includes VAT (France)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 42.79
Price includes VAT (France)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR 52.74
Price includes VAT (France)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The Italian Inter University Consortium for High Performance Computing (http://www.cineca.it).

  2. 2.

    Altair PBS Works (http://www.pbsworks.com/).

  3. 3.

    SLURM Workload Manager (https://slurm.schedmd.com/).

  4. 4.

    https://sites.google.com/view/accasim.

References

  1. Buddhakulsomsiri, J., Kim, D.S.: Priority rule-based heuristic for multi-mode resource-constrained project scheduling problems with resource vacations and activity splitting. Eur. J. Oper. Res. 178(2), 374–390 (2007)

    Article  Google Scholar 

  2. Cavazzoni, C.: EURORA: a european architecture toward exascale. In: FutureHPC@ICS, pp. 1:1–1:4. ACM (2012)

    Google Scholar 

  3. Chen, X., et al.: Predicting job completion times using system logs in supercomputing clusters. In: DSN Workshops, IEEE Computer Society (2013)

    Google Scholar 

  4. Chandio, A.A., et al.: A comparative study of job scheduling strategies in large-scale parallel computational systems. In: TrustCom/ISPA/IUCC, pp. 949–957. IEEE Computer Society (2013)

    Google Scholar 

  5. Bartolini, A., Borghesi, A., Bridi, T., Lombardi, M., Milano, M.: Proactive workload dispatching on the EURORA supercomputer. In: O’Sullivan, B. (ed.) CP 2014. LNCS, vol. 8656, pp. 765–780. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10428-7_55

    Chapter  Google Scholar 

  6. Bartolini, A., et al.: Unveiling eurora - thermal and power characterization of the most energy-efficient supercomputer in the world. In: DATE, pp. 1–6. European Design and Automation Association (2014)

    Google Scholar 

  7. Borghesi, A., Collina, F., Lombardi, M., Milano, M., Benini, L.: Power cap** in high performance computing systems. In: Pesant, G. (ed.) CP 2015. LNCS, vol. 9255, pp. 524–540. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23219-5_37

    Chapter  Google Scholar 

  8. Borghesi, A., Bartolini, A., Lombardi, M., Milano, M., Benini, L.: Predictive modeling for job power consumption in HPC systems. In: Kunkel, J.M., Balaji, P., Dongarra, J. (eds.) ISC High Performance 2016. LNCS, vol. 9697, pp. 181–199. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41321-1_10

    Chapter  Google Scholar 

  9. Reiss, C., et al.: Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In: SoCC, p. 7. ACM (2012)

    Google Scholar 

  10. Storlie, C., et al.: Modeling and predicting power consumption of high performance computing jobs. ar**v:1412.5247 (2014, preprint)

  11. Feitelson, D.G., Rudolph, L., Schwiegelshohn, U., Sevcik, K.C., Wong, P.: Theory and practice in parallel job scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1997. LNCS, vol. 1291, pp. 1–34. Springer, Heidelberg (1997). https://doi.org/10.1007/3-540-63574-2_14

    Chapter  Google Scholar 

  12. Tsafrir, D., et al.: Backfilling using system-generated predictions rather than user runtime estimates. IEEE Trans. Parallel Distrib. Syst. 18(6), 789–803 (2007)

    Article  Google Scholar 

  13. Gaussier, É., et al.: Improving backfilling by using machine learning to predict running times. In: SC, pp. 64:1–64:10. ACM (2015)

    Google Scholar 

  14. Blazewicz, J., et al.: Scheduling subject to resource constraints: classification and complexity. Discret. Appl. Math. 5(1), 11–24 (1983)

    Article  MathSciNet  Google Scholar 

  15. Cao, J., et al.: A taxonomy of application scheduling tools for high performance cluster computing. Clust. Comput. 9(3), 355–371 (2006)

    Article  Google Scholar 

  16. Feitelson, D.G.: Metrics for parallel job scheduling and their convergence. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 2001. LNCS, vol. 2221, pp. 188–205. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-45540-X_11

    Chapter  MATH  Google Scholar 

  17. Feitelson, D.G., Weil, A.M.: Utilization and predictability in scheduling the IBM SP2 with backfilling. In: IPPS/SPDP, pp. 542–546 (1998)

    Google Scholar 

  18. Haupt, R.: A survey of priority rule-based scheduling. Oper. Res. Spektrum 11(1), 3–16 (1989)

    Article  MathSciNet  Google Scholar 

  19. Matsunaga, A.M., Fortes, J.A.B.: On the use of machine learning to predict the time and resources consumed by applications. In: CCGRID, pp. 495–504. IEEE Computer Society (2010)

    Google Scholar 

  20. Shoukourian, H., Wilde, T., et al.: Predicting the energy and power consumption of strong and weak scaling HPC applications. Supercomput. Front. Innov. 1(2), 20–41 (2014)

    Google Scholar 

  21. Sîrbu, A., Babaoglu, O.: A holistic approach to log data analysis in high-performance computing systems: the case of IBM blue gene/q. In: Hunold, S., Costan, A., Giménez, D., Iosup, A., Ricci, L., Gómez Requena, M.E., Scarano, V., Varbanescu, A.L., Scott, S.L., Lankes, S., Weidendorfer, J., Alexander, M. (eds.) Euro-Par 2015. LNCS, vol. 9523, pp. 631–643. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-27308-2_51

    Chapter  Google Scholar 

  22. Sîrbu, A., Babaoglu, O.: Power consumption modeling and prediction in a hybrid CPU-GPU-MIC supercomputer. In: Dutot, P.-F., Trystram, D. (eds.) Euro-Par 2016. LNCS, vol. 9833, pp. 117–130. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43659-3_9

    Chapter  Google Scholar 

  23. Streit, A.: Enhancements to the decision process of the self-tuning dynP scheduler. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2004. LNCS, vol. 3277, pp. 63–80. Springer, Heidelberg (2005). https://doi.org/10.1007/11407522_4

    Chapter  Google Scholar 

  24. Wong, A.K.L., Goscinski, A.M.: Evaluating the easy-backfill job scheduling of static workloads on clusters. In: CLUSTER, IEEE Computer Society (2007)

    Google Scholar 

Download references

Acknowledgments

We thank Dr. A. Bartolini, Prof. L. Benini, Prof. M. Milano and Dr. M. Lombardi for fruitful discussions on the work presented here and for providing access to the Eurora data, together with the SCAI group in Cineca. We acknowledge the Cineca PM-HPC award allowing access to HPC resources. C. Galleguillos has been supported by Postgraduate Grant PUCV 2017. A. Sîrbu has been partially funded by the E.U. project SoBigData Research Infrastructure—Big Data and Social Mining Ecosystem (grant agreement 654024).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cristian Galleguillos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Galleguillos, C., Sîrbu, A., Kiziltan, Z., Babaoglu, O., Borghesi, A., Bridi, T. (2018). Data-Driven Job Dispatching in HPC Systems. In: Nicosia, G., Pardalos, P., Giuffrida, G., Umeton, R. (eds) Machine Learning, Optimization, and Big Data. MOD 2017. Lecture Notes in Computer Science(), vol 10710. Springer, Cham. https://doi.org/10.1007/978-3-319-72926-8_37

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-72926-8_37

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-72925-1

  • Online ISBN: 978-3-319-72926-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation