Log in

Jobs Runtime Forecast for JSCC RAS Supercomputers Using Machine Learning Methods

  • Published:
Lobachevskii Journal of Mathematics Aims and scope Submit manuscript

Abstract

The paper is devoted to machine learning methods and algorithms for the supercomputer jobs execution prediction. The supercomputers statistics shows that the actual runtime of the most of the jobs substantially diverges from the time requested by the user. This reduces the efficiency of scheduling jobs, since an inaccurate job execution time estimation leads to a suboptimal jobs schedule. The job classification is considered, it is based on the difference between the job actual and the requested execution time. Forecast was made on the base of supercomputer multiuser job management system statistics by assigning a submitted job to one of the classes. The statistics of supercomputers MVS-100K and MVS-10P in the Joint Supercomputer Center of the Russian Academy of Sciences (JSCC RAS) was used. The job flow feature ranking by importance was done on the statistical analysis results. The cross-correlation of the most important features was determined. The probability estimates of correct prediction were obtained for selected well-known machine learning algorithms: logistic regression, decision trees, k-nearest neighbors, linear discriminant analysis, support vector machine, random forest, gradient boosting, and feedforward neural network. The best values were obtained using the random forest method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

REFERENCES

  1. A. Reuther et al., ‘‘Scalable system scheduling for HPC and big data, ’’ J. Parallel Distrib. Comput. 111, 76–92 (2018). https://doi.org/10.1016/j.jpdc.2017.06.009

    Article  Google Scholar 

  2. A. B. Yoo, M. A. Jette, and M. Grondona, ‘‘SLURM: Simple Linux Utility for Resource Management,’’ Lect. Notes Comput. Sci. 2862, 44–60 (2003). https://doi.org/10.1007/10968987_3

    Article  Google Scholar 

  3. R. L. Henderson, ‘‘Job scheduling under the Portable Batch System,’’ Lect. Notes Comput. Sci. 949, 279–294 (1995). https://doi.org/10.1007/3-540-60153-8_34

    Article  Google Scholar 

  4. IBM Spectrum LSF overview. https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_ foundations/chap_lsf_overview_foundations.html. Accessed 13 May 2020.

  5. A. V. Baranov, E. A. Kiselev, and D. S. Lyakhovets, ‘‘The quasi scheduler for utilization of multiprocessing computing system idle resources under control of the management system of the parallel jobs,’’ Vestn. YuUr Univ., Ser. Vychisl. Mat. Inform. 3 (4), 75–84 (2014). https://doi.org/10.14529/cmse140405

    Article  Google Scholar 

  6. J. Klinkenberg, C. Terboven, S. Lankes, and M. S. Müller, ‘‘Data mining-based analysis of HPC center operations,’’ in Proceedings of the IEEE International Conference on Cluster Computing CLUSTER, Honolulu, HI (2017), pp. 766–773. https://doi.org/10.1109/CLUSTER.2017.23

  7. W. Yoo, A. Sim, and K. Wu, ‘‘Machine learning based job status prediction in scientific clusters,’’ in Proceedings of the 2016 SAI Computing Conference (SAI), London (2016), pp. 44–53. https://doi.org/10.1109/SAI.2016.7555961

  8. O. Tuncer, E. Ates, Y. Zhang, A. Turk, J. Brandt, V. J. Leung, M. Egele, and A. K. Coskun, ‘‘Diagnosing performance variations in HPC applications using machine learning,’’ Lect. Notes Comput. Sci. 10266, 355–373 (2017). https://doi.org/10.1007/978-3-319-58667-0_19

    Article  Google Scholar 

  9. R. McKenna, S. Herbein, A. Moody, T. Gamblin, and M. Taufer, ‘‘Machine learning predictions of runtime and IO traffic on high-end clusters,’’ in Proceedings of the 2016 IEEE International Conference on Cluster Computing (CLUSTER), Taipei (2016), pp. 255–258. https://doi.org/10.1109/CLUSTER.2016.58

  10. E. R. Rodrigues, R. L. F. Cunha, M. A. S. Netto, and M. Spriggs, ‘‘Hel** HPC users specify job memory requirements via machine learning,’’ in Proceedings of the 2016 3rd International Workshop on HPC User Support Tools (HUST), Salt Lake City, UT (2016), pp. 6–13. https://doi.org/10.1109/HUST.2016.006

  11. J. Guo, A. Nomura, R. Barton, H. Zhang, and S. Matsuoka, ‘‘Machine learning predictions for underestimation of job runtime on HPC system,’’ Lect. Notes Comput. Sci. 10776, 179–198 (2018). https://doi.org/10.1007/978-3-319-69953-0_11

    Article  Google Scholar 

  12. G. I. Savin, B. M. Shabanov, P. N. Telegin, and A. V. Baranov, ‘‘Joint supercomputer center of the Russian Academy of Sciences: Present and future,’’ Lobachevskii J. Math. 40, 1853–1862 (2019). https://doi.org/10.1134/S1995080219110271

    Article  MATH  Google Scholar 

  13. Supercomputing Resources of JSCC RAS. http://www.jscc.ru/supercomputing-resources. Accessed May 12, 2020.

  14. D. Paper, ‘‘Introduction to Scikit-Learn,’’ in Proceedings of the Conference on Hands-on Scikit-Learn for Machine Learning Applications, Apress, Berkeley, CA (2020), pp. 1–35. https://doi.org/10.1007/978-1-4842-5373-1_1

  15. D. S. Cramer, ‘‘The origins of logistic regression,’’ Tinbergen Institute Working Paper No. 2002-119/4 (2002), pp. 167–178. https://doi.org/10.2139/ssrn.360300

  16. L. Rokach and O. Maimon, Data Mining with Decision Trees: Theory and Applications (World Scientific, Singapore, 2007). ISBN 978-9812771711

    Book  Google Scholar 

  17. N. Altman, ‘‘An introduction to Kernel and nearest-neighbor nonparametric regression,’’ Am. Stat. 46, 175–185 (1992). https://doi.org/10.2307/2685209

    Article  MathSciNet  Google Scholar 

  18. G. J. McLachlan, Discriminant Analysis and Statistical Pattern Recognition (Wiley Interscience, New York, 1992). https://doi.org/10.1002/0471725293

    Book  MATH  Google Scholar 

  19. C. P. Bennett and C. Campbell, ‘‘Support vector machines: Hype or hallelujah?’’ SIGKDD Explor. Newsl. 2, 2 (2000). https://doi.org/10.1145/380995.380999

    Article  Google Scholar 

  20. L. Breiman, ‘‘Random forests,’’ Machine Learning 45, 5–32 (2001). https://doi.org/10.1023/A:1010933404324

    Article  MATH  Google Scholar 

  21. J. H. Friedman, ‘‘Greedy function approximation: A gradient boosting machine,’’ Ann. Stat. 29, 1189–1232 (2001). https://doi.org/10.1214/aos/1013203451

    Article  MathSciNet  MATH  Google Scholar 

  22. J. Schmidhuber, ‘‘Deep learning in neural networks: An overview,’’ Neural Networks 61, 85–117 (2015). https://doi.org/10.1016/j.neunet.2014.09.003

    Article  Google Scholar 

Download references

ACKNOWLEDGMENTS

The work was carried out at the JSCC RAS as part of the government assignment (project 0065-2019-0016). Supercomputers MVS-100K and MVS-10P were used.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to G. I. Savin, B. M. Shabanov, D. S. Nikolaev, A. V. Baranov or P. N. Telegin.

Additional information

(Submitted by A. M. Elizarov)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Savin, G.I., Shabanov, B.M., Nikolaev, D.S. et al. Jobs Runtime Forecast for JSCC RAS Supercomputers Using Machine Learning Methods. Lobachevskii J Math 41, 2593–2602 (2020). https://doi.org/10.1134/S1995080220120343

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S1995080220120343

Keywords:

Navigation