Abstract
As High Performance Computing (HPC) systems get closer to exascale performance, job dispatching strategies become critical for kee** system utilization high while kee** waiting times low for jobs competing for HPC system resources. In this paper, we take a data-driven approach and investigate whether better dispatching decisions can be made by transforming the log data produced by an HPC system into useful knowledge about its workload. In particular, we focus on job duration, develop a data-driven approach to job duration prediction, and analyze the effect of different prediction approaches in making dispatching decisions using a real workload dataset collected from Eurora, a hybrid HPC system. Experiments on various dispatching methods show promising results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The Italian Inter University Consortium for High Performance Computing (http://www.cineca.it).
- 2.
Altair PBS Works (http://www.pbsworks.com/).
- 3.
SLURM Workload Manager (https://slurm.schedmd.com/).
- 4.
References
Buddhakulsomsiri, J., Kim, D.S.: Priority rule-based heuristic for multi-mode resource-constrained project scheduling problems with resource vacations and activity splitting. Eur. J. Oper. Res. 178(2), 374–390 (2007)
Cavazzoni, C.: EURORA: a european architecture toward exascale. In: FutureHPC@ICS, pp. 1:1–1:4. ACM (2012)
Chen, X., et al.: Predicting job completion times using system logs in supercomputing clusters. In: DSN Workshops, IEEE Computer Society (2013)
Chandio, A.A., et al.: A comparative study of job scheduling strategies in large-scale parallel computational systems. In: TrustCom/ISPA/IUCC, pp. 949–957. IEEE Computer Society (2013)
Bartolini, A., Borghesi, A., Bridi, T., Lombardi, M., Milano, M.: Proactive workload dispatching on the EURORA supercomputer. In: O’Sullivan, B. (ed.) CP 2014. LNCS, vol. 8656, pp. 765–780. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10428-7_55
Bartolini, A., et al.: Unveiling eurora - thermal and power characterization of the most energy-efficient supercomputer in the world. In: DATE, pp. 1–6. European Design and Automation Association (2014)
Borghesi, A., Collina, F., Lombardi, M., Milano, M., Benini, L.: Power cap** in high performance computing systems. In: Pesant, G. (ed.) CP 2015. LNCS, vol. 9255, pp. 524–540. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23219-5_37
Borghesi, A., Bartolini, A., Lombardi, M., Milano, M., Benini, L.: Predictive modeling for job power consumption in HPC systems. In: Kunkel, J.M., Balaji, P., Dongarra, J. (eds.) ISC High Performance 2016. LNCS, vol. 9697, pp. 181–199. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41321-1_10
Reiss, C., et al.: Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In: SoCC, p. 7. ACM (2012)
Storlie, C., et al.: Modeling and predicting power consumption of high performance computing jobs. ar**v:1412.5247 (2014, preprint)
Feitelson, D.G., Rudolph, L., Schwiegelshohn, U., Sevcik, K.C., Wong, P.: Theory and practice in parallel job scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1997. LNCS, vol. 1291, pp. 1–34. Springer, Heidelberg (1997). https://doi.org/10.1007/3-540-63574-2_14
Tsafrir, D., et al.: Backfilling using system-generated predictions rather than user runtime estimates. IEEE Trans. Parallel Distrib. Syst. 18(6), 789–803 (2007)
Gaussier, É., et al.: Improving backfilling by using machine learning to predict running times. In: SC, pp. 64:1–64:10. ACM (2015)
Blazewicz, J., et al.: Scheduling subject to resource constraints: classification and complexity. Discret. Appl. Math. 5(1), 11–24 (1983)
Cao, J., et al.: A taxonomy of application scheduling tools for high performance cluster computing. Clust. Comput. 9(3), 355–371 (2006)
Feitelson, D.G.: Metrics for parallel job scheduling and their convergence. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 2001. LNCS, vol. 2221, pp. 188–205. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-45540-X_11
Feitelson, D.G., Weil, A.M.: Utilization and predictability in scheduling the IBM SP2 with backfilling. In: IPPS/SPDP, pp. 542–546 (1998)
Haupt, R.: A survey of priority rule-based scheduling. Oper. Res. Spektrum 11(1), 3–16 (1989)
Matsunaga, A.M., Fortes, J.A.B.: On the use of machine learning to predict the time and resources consumed by applications. In: CCGRID, pp. 495–504. IEEE Computer Society (2010)
Shoukourian, H., Wilde, T., et al.: Predicting the energy and power consumption of strong and weak scaling HPC applications. Supercomput. Front. Innov. 1(2), 20–41 (2014)
Sîrbu, A., Babaoglu, O.: A holistic approach to log data analysis in high-performance computing systems: the case of IBM blue gene/q. In: Hunold, S., Costan, A., Giménez, D., Iosup, A., Ricci, L., Gómez Requena, M.E., Scarano, V., Varbanescu, A.L., Scott, S.L., Lankes, S., Weidendorfer, J., Alexander, M. (eds.) Euro-Par 2015. LNCS, vol. 9523, pp. 631–643. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-27308-2_51
Sîrbu, A., Babaoglu, O.: Power consumption modeling and prediction in a hybrid CPU-GPU-MIC supercomputer. In: Dutot, P.-F., Trystram, D. (eds.) Euro-Par 2016. LNCS, vol. 9833, pp. 117–130. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43659-3_9
Streit, A.: Enhancements to the decision process of the self-tuning dynP scheduler. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2004. LNCS, vol. 3277, pp. 63–80. Springer, Heidelberg (2005). https://doi.org/10.1007/11407522_4
Wong, A.K.L., Goscinski, A.M.: Evaluating the easy-backfill job scheduling of static workloads on clusters. In: CLUSTER, IEEE Computer Society (2007)
Acknowledgments
We thank Dr. A. Bartolini, Prof. L. Benini, Prof. M. Milano and Dr. M. Lombardi for fruitful discussions on the work presented here and for providing access to the Eurora data, together with the SCAI group in Cineca. We acknowledge the Cineca PM-HPC award allowing access to HPC resources. C. Galleguillos has been supported by Postgraduate Grant PUCV 2017. A. Sîrbu has been partially funded by the E.U. project SoBigData Research Infrastructure—Big Data and Social Mining Ecosystem (grant agreement 654024).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Galleguillos, C., Sîrbu, A., Kiziltan, Z., Babaoglu, O., Borghesi, A., Bridi, T. (2018). Data-Driven Job Dispatching in HPC Systems. In: Nicosia, G., Pardalos, P., Giuffrida, G., Umeton, R. (eds) Machine Learning, Optimization, and Big Data. MOD 2017. Lecture Notes in Computer Science(), vol 10710. Springer, Cham. https://doi.org/10.1007/978-3-319-72926-8_37
Download citation
DOI: https://doi.org/10.1007/978-3-319-72926-8_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-72925-1
Online ISBN: 978-3-319-72926-8
eBook Packages: Computer ScienceComputer Science (R0)