Jobs Runtime Forecast for JSCC RAS Supercomputers Using Machine Learning Methods

Savin, G. I.; Shabanov, B. M.; Nikolaev, D. S.; Baranov, A. V.; Telegin, P. N.

doi:10.1134/S1995080220120343

Jobs Runtime Forecast for JSCC RAS Supercomputers Using Machine Learning Methods

Published: 04 February 2021

Volume 41, pages 2593–2602, (2020)
Cite this article

Lobachevskii Journal of Mathematics Aims and scope Submit manuscript

G. I. Savin¹,
B. M. Shabanov¹,
D. S. Nikolaev²,
A. V. Baranov¹ &
…
P. N. Telegin¹

153 Accesses
5 Citations
Explore all metrics

Abstract

The paper is devoted to machine learning methods and algorithms for the supercomputer jobs execution prediction. The supercomputers statistics shows that the actual runtime of the most of the jobs substantially diverges from the time requested by the user. This reduces the efficiency of scheduling jobs, since an inaccurate job execution time estimation leads to a suboptimal jobs schedule. The job classification is considered, it is based on the difference between the job actual and the requested execution time. Forecast was made on the base of supercomputer multiuser job management system statistics by assigning a submitted job to one of the classes. The statistics of supercomputers MVS-100K and MVS-10P in the Joint Supercomputer Center of the Russian Academy of Sciences (JSCC RAS) was used. The job flow feature ranking by importance was done on the statistical analysis results. The cross-correlation of the most important features was determined. The probability estimates of correct prediction were obtained for selected well-known machine learning algorithms: logistic regression, decision trees, k-nearest neighbors, linear discriminant analysis, support vector machine, random forest, gradient boosting, and feedforward neural network. The best values were obtained using the random forest method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Institutional subscriptions

Software for Prediction of Task Start Moment in Computer Cluster by Statistical Analysis of Jobs Queue History

Article 06 June 2024

Using Machine Learning Methods to Detect Applications with Abnormal Efficiency

Predicting running time of aerodynamic jobs in HPC system by combining supervised and unsupervised learning method

Article Open access 01 August 2021

REFERENCES

A. Reuther et al., ‘‘Scalable system scheduling for HPC and big data, ’’ J. Parallel Distrib. Comput. 111, 76–92 (2018). https://doi.org/10.1016/j.jpdc.2017.06.009
Article Google Scholar
A. B. Yoo, M. A. Jette, and M. Grondona, ‘‘SLURM: Simple Linux Utility for Resource Management,’’ Lect. Notes Comput. Sci. 2862, 44–60 (2003). https://doi.org/10.1007/10968987_3
Article Google Scholar
R. L. Henderson, ‘‘Job scheduling under the Portable Batch System,’’ Lect. Notes Comput. Sci. 949, 279–294 (1995). https://doi.org/10.1007/3-540-60153-8_34
Article Google Scholar
IBM Spectrum LSF overview. https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_ foundations/chap_lsf_overview_foundations.html. Accessed 13 May 2020.
A. V. Baranov, E. A. Kiselev, and D. S. Lyakhovets, ‘‘The quasi scheduler for utilization of multiprocessing computing system idle resources under control of the management system of the parallel jobs,’’ Vestn. YuUr Univ., Ser. Vychisl. Mat. Inform. 3 (4), 75–84 (2014). https://doi.org/10.14529/cmse140405
Article Google Scholar
J. Klinkenberg, C. Terboven, S. Lankes, and M. S. Müller, ‘‘Data mining-based analysis of HPC center operations,’’ in Proceedings of the IEEE International Conference on Cluster Computing CLUSTER, Honolulu, HI (2017), pp. 766–773. https://doi.org/10.1109/CLUSTER.2017.23
W. Yoo, A. Sim, and K. Wu, ‘‘Machine learning based job status prediction in scientific clusters,’’ in Proceedings of the 2016 SAI Computing Conference (SAI), London (2016), pp. 44–53. https://doi.org/10.1109/SAI.2016.7555961
O. Tuncer, E. Ates, Y. Zhang, A. Turk, J. Brandt, V. J. Leung, M. Egele, and A. K. Coskun, ‘‘Diagnosing performance variations in HPC applications using machine learning,’’ Lect. Notes Comput. Sci. 10266, 355–373 (2017). https://doi.org/10.1007/978-3-319-58667-0_19
Article Google Scholar
R. McKenna, S. Herbein, A. Moody, T. Gamblin, and M. Taufer, ‘‘Machine learning predictions of runtime and IO traffic on high-end clusters,’’ in Proceedings of the 2016 IEEE International Conference on Cluster Computing (CLUSTER), Taipei (2016), pp. 255–258. https://doi.org/10.1109/CLUSTER.2016.58
E. R. Rodrigues, R. L. F. Cunha, M. A. S. Netto, and M. Spriggs, ‘‘Hel** HPC users specify job memory requirements via machine learning,’’ in Proceedings of the 2016 3rd International Workshop on HPC User Support Tools (HUST), Salt Lake City, UT (2016), pp. 6–13. https://doi.org/10.1109/HUST.2016.006
J. Guo, A. Nomura, R. Barton, H. Zhang, and S. Matsuoka, ‘‘Machine learning predictions for underestimation of job runtime on HPC system,’’ Lect. Notes Comput. Sci. 10776, 179–198 (2018). https://doi.org/10.1007/978-3-319-69953-0_11
Article Google Scholar
G. I. Savin, B. M. Shabanov, P. N. Telegin, and A. V. Baranov, ‘‘Joint supercomputer center of the Russian Academy of Sciences: Present and future,’’ Lobachevskii J. Math. 40, 1853–1862 (2019). https://doi.org/10.1134/S1995080219110271
Article MATH Google Scholar
Supercomputing Resources of JSCC RAS. http://www.jscc.ru/supercomputing-resources. Accessed May 12, 2020.
D. Paper, ‘‘Introduction to Scikit-Learn,’’ in Proceedings of the Conference on Hands-on Scikit-Learn for Machine Learning Applications, Apress, Berkeley, CA (2020), pp. 1–35. https://doi.org/10.1007/978-1-4842-5373-1_1
D. S. Cramer, ‘‘The origins of logistic regression,’’ Tinbergen Institute Working Paper No. 2002-119/4 (2002), pp. 167–178. https://doi.org/10.2139/ssrn.360300
L. Rokach and O. Maimon, Data Mining with Decision Trees: Theory and Applications (World Scientific, Singapore, 2007). ISBN 978-9812771711
Book Google Scholar
N. Altman, ‘‘An introduction to Kernel and nearest-neighbor nonparametric regression,’’ Am. Stat. 46, 175–185 (1992). https://doi.org/10.2307/2685209
Article MathSciNet Google Scholar
G. J. McLachlan, Discriminant Analysis and Statistical Pattern Recognition (Wiley Interscience, New York, 1992). https://doi.org/10.1002/0471725293
Book MATH Google Scholar
C. P. Bennett and C. Campbell, ‘‘Support vector machines: Hype or hallelujah?’’ SIGKDD Explor. Newsl. 2, 2 (2000). https://doi.org/10.1145/380995.380999
Article Google Scholar
L. Breiman, ‘‘Random forests,’’ Machine Learning 45, 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Article MATH Google Scholar
J. H. Friedman, ‘‘Greedy function approximation: A gradient boosting machine,’’ Ann. Stat. 29, 1189–1232 (2001). https://doi.org/10.1214/aos/1013203451
Article MathSciNet MATH Google Scholar
J. Schmidhuber, ‘‘Deep learning in neural networks: An overview,’’ Neural Networks 61, 85–117 (2015). https://doi.org/10.1016/j.neunet.2014.09.003
Article Google Scholar

Download references

ACKNOWLEDGMENTS

The work was carried out at the JSCC RAS as part of the government assignment (project 0065-2019-0016). Supercomputers MVS-100K and MVS-10P were used.

Author information

Authors and Affiliations

Joint Supercomputer Center, Scientific Research Institute for System Analysis of the Russian Academy of Sciences, 119334, Moscow, Russia
G. I. Savin, B. M. Shabanov, A. V. Baranov & P. N. Telegin
Research and Development Institute Kvant, 125438, Moscow, Russia
D. S. Nikolaev

Authors

G. I. Savin
View author publications
You can also search for this author in PubMed Google Scholar
B. M. Shabanov
View author publications
You can also search for this author in PubMed Google Scholar
D. S. Nikolaev
View author publications
You can also search for this author in PubMed Google Scholar
A. V. Baranov
View author publications
You can also search for this author in PubMed Google Scholar
P. N. Telegin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to G. I. Savin, B. M. Shabanov, D. S. Nikolaev, A. V. Baranov or P. N. Telegin.

Additional information

(Submitted by A. M. Elizarov)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Savin, G.I., Shabanov, B.M., Nikolaev, D.S. et al. Jobs Runtime Forecast for JSCC RAS Supercomputers Using Machine Learning Methods. Lobachevskii J Math 41, 2593–2602 (2020). https://doi.org/10.1134/S1995080220120343

Download citation

Received: 13 May 2020
Revised: 31 May 2020
Accepted: 05 June 2020
Published: 04 February 2021
Issue Date: December 2020
DOI: https://doi.org/10.1134/S1995080220120343

Keywords: