Design and realization of hybrid resource management system for heterogeneous cluster

He, Qinlu; Zhang, Fan; Bian, Genqing; Zhang, Weiqi; Li, Zhen

doi:10.1007/s10586-024-04267-z

Design and realization of hybrid resource management system for heterogeneous cluster

Published: 22 February 2024

(2024)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Qinlu He¹,
Fan Zhang¹,
Genqing Bian¹,
Weiqi Zhang¹ &
…
Zhen Li²

92 Accesses
Explore all metrics

Abstract

As user-generated data diversifies, an increasing number of big data tasks now encompass unstructured data forms, including audio, imagery, and video content. The processing of these data often requires the support of GPU devices in the cluster. However, most of the existing cluster resource management frameworks lack effective decoupling of CPU resources and GPU resources, and the management granularity of GPU resources is too coarse, resulting in poor GPU sharing and low resource utilization. Given, for the inadequacy of the existing cluster resource management framework that cannot support batch stream integration and CPU-GPU resource scheduling, we design and implement a Hybrid Heterogeneous Resource Management (H-HRM) for CPU-GPU heterogeneous clusters. The resource queue binding mechanism on computing nodes provides flexible binding of CPU resources to GPU resources, which solves the problem that CPU resources and GPU resources in the cluster are difficult to decouple. According to the different uses of CPU or GPU as the main resource, a Hybrid Domain Resource Fairness (HDRF) model is proposed to realize the reasonable allocation of CPU resources and GPU resources. Through the queue stacking technology and the mechanism that multiple executors can run simultaneously on the queue, fine-grained sharing of GPU resources is realized and the utilization rate of GPU resources is improved. Finally, this paper realizes the docking of H-HRM and Spark programming framework and conducts real load tests, and after comparing with the performance of Mesos, it is proved that H-HRM can handle the mixed scenarios of batch jobs and stream processing jobs in the cluster, and the HDRF algorithm also greatly improves the utilization of the GPU.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Resource Scheduling Algorithm for Heterogeneous High-Performance Clusters

Enhancing System Utilization by Dynamic Reallocation of Computing Nodes

Heterogeneity-Aware Resource Allocation in HPC Systems

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

http://www.gov.cn/xinwen/2020-09/29/5548176/files/1c6b4a2ae06c4ffc8bccb49da353495e.pdf, 2020-09-29.
https://www.seagate.com/cn/zh/our-story/data-age-2025/.
Dean, J., Ghemawat, S., MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems De-sign & Implementation-Volume 6 (2004), OSDI’04, USENIX Association, pp.10–10.
Apache Hadoop Project. http://hadoop.apache.org/.
Zaharia, M., Das, T., Li, H., Shenker, S., Stoica, I.: Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. ACM, 2023.
Apache Storm. http://storm.apache.org/.
Murray, D., McSherry, F., Isaacs, R., Isard, M., Barham, P., Abadi, M.: Naiad: a timely dataflow system. In SOSP 13(439), 455 (2023)
Google Scholar
Gates, A.F., et al.: Building a high-level dataflow system on top of map-reduce: the pig ex-perience. Proc. VLDB Endow. 2(2), 1414–1425 (2019)
Article MathSciNet Google Scholar
Google. DataFlow SDK. https://github.com/GoogleCloudPlatform/DataflowJavaSDK.
Google. Google Cloud DataFlow. https://cloud.google.com/dataflow/.
Akidau, T., Bradshaw, R., Chambers, C., Chernyak, S., Fern´andez-Moctezuma, R.J., Lax, R., McVeety, S., Mills, D., Perry, F., Schmidt, E., et al.: The dataflow model: a practical approach tobalancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. VLDB, 2019.
Ghodsi, A., Zaharia, M., Hindman, B., et al.: Dominant resource fairness: fair allocation of multiple resource types[C] //Proc of the 8th USENIX Symp on Networked Systems Design and Implementation, pp. 24–34. USENIX Association, Berkeley (2021)
Google Scholar
Reed, F.J., ZooKeeper, B.: Distributed Process Coordination[M]//ZooKeeper: Distribut-ed Process Coordination. O’Reilly Media, Inc. 2022.
APACHE. Hadoop on demand. http://goo.gl/px8Yd, 2007. Accessed 20 Jun 2021.
Vavilapalli, V.K., Murthy, A.C., Douglas, C., et al.: Apache Hadoop YARN: yet another re-source negotiator[C]//Proc of the 4th ACM Symp on Cloud Computing, pp. 5–20. ACM, New York (2022)
Google Scholar
Hindman, B., Konwinski, A., Zaharia, M., et al.: Mesos: a platform for fine-grained resource sharing in the data center[C] //Proc of the 8th USENIX Symp on Networked Systems De-sign and Implementation, pp. 22–35. USENIX Association, Berkeley, CA (2021)
Google Scholar
Schwarzkopf, M., Konwinski, A., Abd-El-Malek, M., et al.: Omega: flexible, scalable schedulers for large compute clusters[C]//Proc of the 8th ACM European Conf on Computer Systems, pp. 351–364. ACM, New York (2022)
Google Scholar
Boutin, E., Ekanayake, J., Wei, L., et al.: Apollo: scalable and coordinated scheduling for cloud-scale computing[C]//Proc of the 11th USENIX Symp on Operating Systems Design and Implementation, p. 285. USENIX Association, Berkeley, CA (2021)
Google Scholar
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop distributed file system, in Proc. IEEE 26th Symp. Mass Storage Syst. Technol., 2020, pp. 1–10.
Capit, N., Da Costa, G., Georgiou, Y., Huard, G., Martin, C., Mounie, G., Neyron, P., Richard, O.: A batch scheduler with high level components. In Cluster Computing and the Grid, 2022. CCGrid 2022. IEEE International Symposium on, volume 2, pages 776–783 Vol. 2, 2022.
Jackson, D., Snell, Q., Clement, M.: Corealgorithms of the maui scheduler. In Revised Papers from the 7th International Workshop on Job Scheduling Strategies for Parallel Processing, JSSPP ‘01, 87–102. Springer, Berlin (2021)
Google Scholar
Ousterhout, K., Wendell, P., Zaharia, M., et al.: Sparrow: distributed, low latency scheduling[C] //Proc of the 24th ACM Symp on Operating Systems Principles, pp. 69–84. ACM, New York (2013)
Google Scholar
Delimitrou, C., Sanchez, D., Kozyrakis, C., Tarcil: reconciling scheduling speed and quality in large shared clusters. In SoCC, 2019.
Karanasos, K., Rao, S., Curino, C., et al.: Mercury: hybrid centralized and distributed schedul-ing in large shared clusters[C] //Proc of the 2015 USENIX Annual Technical Conf, pp. 485–497. USENIX Association, Berkeley (2021)
Google Scholar
Delgado, P., Didona, D., Dinu, F., et al.: Job-aware scheduling in eagle: divide and stick to your probes[C] // Proc of the 7th ACM Symp on Cloud Computing, pp. 497–509. ACM, New York (2021)
Google Scholar
Delgado, P., Dinu, F., Kermarrec, A.M., et al.: Hawk: Hybrid datacenter scheduling[C] //Proc of the 2015 USENIX Annual Technical Conf, pp. 499–510. USENIX Association, Berkeley (2021)
Google Scholar
Carbone, P., Ewen, S., Haridi, S., Katsifodimos, A., Markl, V., Tzoumas, K.: Apache flink: stream and batch processing in a singleengine. IEEE Data Eng. Bullet.28 (2015)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica.Spark, I.: Cluster com-puting with working sets. HotCloud, 2020.
Libprocess. https://github.com/3rdparty/libprocess.
Protocol Buffers. https://developers.google.cn/protocol-buffers?hl=zh-cn.
Fan, Z., Qiu, F., Kaufman, A., Yoakum-Stover, S.: GPU cluster for high performance computing. In: Proceedings of the ACM/IEEE conference on supercomputing, Pittsburgh, PA, USA, pp 47–53. (2019) https://doi.org/10.1109/SC.2019.26

Download references

Funding

The authors would like to thank anonymous referees for their invaluable suggestions and comments. This work is supported by the National Natural Science Foundation of China (Grant No. 61872284); Key Research and Development Program of Shaanxi (Program No. 2023-YBGY-203, 2023-YBGY-021,); Industrialization Project of Shaanxi Provincial Department of Education (Grant No. 21JC017); “Thirteenth Five-Year” National Key R&D Program Project (Project Number: 2019YFD1100901). Natural Science Foundation of Shannxi Province, China (Grant Nos. 2021JLM-16, 2023-JC-YB-825); Key R&D Plan of **anyang City (Grant No. L2023-ZDYF-QYCX-021).

Author information

Authors and Affiliations

School of Information and Control Engineering, **’an University of Architecture and Technology, **’an, 710054, China
Qinlu He, Fan Zhang, Genqing Bian & Weiqi Zhang
Shaan XI Institute of Metrology Science, **’an, 710043, China
Zhen Li

Authors

Qinlu He
View author publications
You can also search for this author in PubMed Google Scholar
Fan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Genqing Bian
View author publications
You can also search for this author in PubMed Google Scholar
Weiqi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Qinlu He contributed significantly to Design Algorithm. Zhen Li performed the perfor analysis. Genqing Bian and Weiqi Zhang completed the experimental part. Fan Zhang and Qinlu He contributed to manuscript preparation.

Corresponding author

Correspondence to Qinlu He.

Ethics declarations

Competing interests

I declare that the authors have no competing interests as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper.

Consent for publication

I have read and understood the publishing policy, and submit this manuscript in accordance with this policy.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

He, Q., Zhang, F., Bian, G. et al. Design and realization of hybrid resource management system for heterogeneous cluster. Cluster Comput (2024). https://doi.org/10.1007/s10586-024-04267-z

Download citation

Received: 22 July 2023
Revised: 02 January 2024
Accepted: 03 January 2024
Published: 22 February 2024
DOI: https://doi.org/10.1007/s10586-024-04267-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Design and realization of hybrid resource management system for heterogeneous cluster

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Resource Scheduling Algorithm for Heterogeneous High-Performance Clusters

Enhancing System Utilization by Dynamic Reallocation of Computing Nodes

Heterogeneity-Aware Resource Allocation in HPC Systems

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Design and realization of hybrid resource management system for heterogeneous cluster

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Resource Scheduling Algorithm for Heterogeneous High-Performance Clusters

Enhancing System Utilization by Dynamic Reallocation of Computing Nodes

Heterogeneity-Aware Resource Allocation in HPC Systems

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation