Abstract
As user-generated data diversifies, an increasing number of big data tasks now encompass unstructured data forms, including audio, imagery, and video content. The processing of these data often requires the support of GPU devices in the cluster. However, most of the existing cluster resource management frameworks lack effective decoupling of CPU resources and GPU resources, and the management granularity of GPU resources is too coarse, resulting in poor GPU sharing and low resource utilization. Given, for the inadequacy of the existing cluster resource management framework that cannot support batch stream integration and CPU-GPU resource scheduling, we design and implement a Hybrid Heterogeneous Resource Management (H-HRM) for CPU-GPU heterogeneous clusters. The resource queue binding mechanism on computing nodes provides flexible binding of CPU resources to GPU resources, which solves the problem that CPU resources and GPU resources in the cluster are difficult to decouple. According to the different uses of CPU or GPU as the main resource, a Hybrid Domain Resource Fairness (HDRF) model is proposed to realize the reasonable allocation of CPU resources and GPU resources. Through the queue stacking technology and the mechanism that multiple executors can run simultaneously on the queue, fine-grained sharing of GPU resources is realized and the utilization rate of GPU resources is improved. Finally, this paper realizes the docking of H-HRM and Spark programming framework and conducts real load tests, and after comparing with the performance of Mesos, it is proved that H-HRM can handle the mixed scenarios of batch jobs and stream processing jobs in the cluster, and the HDRF algorithm also greatly improves the utilization of the GPU.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10586-024-04267-z/MediaObjects/10586_2024_4267_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10586-024-04267-z/MediaObjects/10586_2024_4267_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10586-024-04267-z/MediaObjects/10586_2024_4267_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10586-024-04267-z/MediaObjects/10586_2024_4267_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10586-024-04267-z/MediaObjects/10586_2024_4267_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10586-024-04267-z/MediaObjects/10586_2024_4267_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10586-024-04267-z/MediaObjects/10586_2024_4267_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10586-024-04267-z/MediaObjects/10586_2024_4267_Fig8_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10586-024-04267-z/MediaObjects/10586_2024_4267_Fig9_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10586-024-04267-z/MediaObjects/10586_2024_4267_Fig10_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10586-024-04267-z/MediaObjects/10586_2024_4267_Fig11_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10586-024-04267-z/MediaObjects/10586_2024_4267_Fig12_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10586-024-04267-z/MediaObjects/10586_2024_4267_Fig13_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10586-024-04267-z/MediaObjects/10586_2024_4267_Fig14_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10586-024-04267-z/MediaObjects/10586_2024_4267_Fig15_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10586-024-04267-z/MediaObjects/10586_2024_4267_Fig16_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10586-024-04267-z/MediaObjects/10586_2024_4267_Fig17_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10586-024-04267-z/MediaObjects/10586_2024_4267_Fig18_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10586-024-04267-z/MediaObjects/10586_2024_4267_Fig19_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10586-024-04267-z/MediaObjects/10586_2024_4267_Fig20_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10586-024-04267-z/MediaObjects/10586_2024_4267_Fig21_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10586-024-04267-z/MediaObjects/10586_2024_4267_Fig22_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10586-024-04267-z/MediaObjects/10586_2024_4267_Fig23_HTML.png)
Similar content being viewed by others
Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
References
http://www.gov.cn/xinwen/2020-09/29/5548176/files/1c6b4a2ae06c4ffc8bccb49da353495e.pdf, 2020-09-29.
Dean, J., Ghemawat, S., MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems De-sign & Implementation-Volume 6 (2004), OSDI’04, USENIX Association, pp.10–10.
Apache Hadoop Project. http://hadoop.apache.org/.
Zaharia, M., Das, T., Li, H., Shenker, S., Stoica, I.: Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. ACM, 2023.
Apache Storm. http://storm.apache.org/.
Murray, D., McSherry, F., Isaacs, R., Isard, M., Barham, P., Abadi, M.: Naiad: a timely dataflow system. In SOSP 13(439), 455 (2023)
Gates, A.F., et al.: Building a high-level dataflow system on top of map-reduce: the pig ex-perience. Proc. VLDB Endow. 2(2), 1414–1425 (2019)
Google. DataFlow SDK. https://github.com/GoogleCloudPlatform/DataflowJavaSDK.
Google. Google Cloud DataFlow. https://cloud.google.com/dataflow/.
Akidau, T., Bradshaw, R., Chambers, C., Chernyak, S., Fern´andez-Moctezuma, R.J., Lax, R., McVeety, S., Mills, D., Perry, F., Schmidt, E., et al.: The dataflow model: a practical approach tobalancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. VLDB, 2019.
Ghodsi, A., Zaharia, M., Hindman, B., et al.: Dominant resource fairness: fair allocation of multiple resource types[C] //Proc of the 8th USENIX Symp on Networked Systems Design and Implementation, pp. 24–34. USENIX Association, Berkeley (2021)
Reed, F.J., ZooKeeper, B.: Distributed Process Coordination[M]//ZooKeeper: Distribut-ed Process Coordination. O’Reilly Media, Inc. 2022.
APACHE. Hadoop on demand. http://goo.gl/px8Yd, 2007. Accessed 20 Jun 2021.
Vavilapalli, V.K., Murthy, A.C., Douglas, C., et al.: Apache Hadoop YARN: yet another re-source negotiator[C]//Proc of the 4th ACM Symp on Cloud Computing, pp. 5–20. ACM, New York (2022)
Hindman, B., Konwinski, A., Zaharia, M., et al.: Mesos: a platform for fine-grained resource sharing in the data center[C] //Proc of the 8th USENIX Symp on Networked Systems De-sign and Implementation, pp. 22–35. USENIX Association, Berkeley, CA (2021)
Schwarzkopf, M., Konwinski, A., Abd-El-Malek, M., et al.: Omega: flexible, scalable schedulers for large compute clusters[C]//Proc of the 8th ACM European Conf on Computer Systems, pp. 351–364. ACM, New York (2022)
Boutin, E., Ekanayake, J., Wei, L., et al.: Apollo: scalable and coordinated scheduling for cloud-scale computing[C]//Proc of the 11th USENIX Symp on Operating Systems Design and Implementation, p. 285. USENIX Association, Berkeley, CA (2021)
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop distributed file system, in Proc. IEEE 26th Symp. Mass Storage Syst. Technol., 2020, pp. 1–10.
Capit, N., Da Costa, G., Georgiou, Y., Huard, G., Martin, C., Mounie, G., Neyron, P., Richard, O.: A batch scheduler with high level components. In Cluster Computing and the Grid, 2022. CCGrid 2022. IEEE International Symposium on, volume 2, pages 776–783 Vol. 2, 2022.
Jackson, D., Snell, Q., Clement, M.: Corealgorithms of the maui scheduler. In Revised Papers from the 7th International Workshop on Job Scheduling Strategies for Parallel Processing, JSSPP ‘01, 87–102. Springer, Berlin (2021)
Ousterhout, K., Wendell, P., Zaharia, M., et al.: Sparrow: distributed, low latency scheduling[C] //Proc of the 24th ACM Symp on Operating Systems Principles, pp. 69–84. ACM, New York (2013)
Delimitrou, C., Sanchez, D., Kozyrakis, C., Tarcil: reconciling scheduling speed and quality in large shared clusters. In SoCC, 2019.
Karanasos, K., Rao, S., Curino, C., et al.: Mercury: hybrid centralized and distributed schedul-ing in large shared clusters[C] //Proc of the 2015 USENIX Annual Technical Conf, pp. 485–497. USENIX Association, Berkeley (2021)
Delgado, P., Didona, D., Dinu, F., et al.: Job-aware scheduling in eagle: divide and stick to your probes[C] // Proc of the 7th ACM Symp on Cloud Computing, pp. 497–509. ACM, New York (2021)
Delgado, P., Dinu, F., Kermarrec, A.M., et al.: Hawk: Hybrid datacenter scheduling[C] //Proc of the 2015 USENIX Annual Technical Conf, pp. 499–510. USENIX Association, Berkeley (2021)
Carbone, P., Ewen, S., Haridi, S., Katsifodimos, A., Markl, V., Tzoumas, K.: Apache flink: stream and batch processing in a singleengine. IEEE Data Eng. Bullet.28 (2015)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica.Spark, I.: Cluster com-puting with working sets. HotCloud, 2020.
Libprocess. https://github.com/3rdparty/libprocess.
Protocol Buffers. https://developers.google.cn/protocol-buffers?hl=zh-cn.
Fan, Z., Qiu, F., Kaufman, A., Yoakum-Stover, S.: GPU cluster for high performance computing. In: Proceedings of the ACM/IEEE conference on supercomputing, Pittsburgh, PA, USA, pp 47–53. (2019) https://doi.org/10.1109/SC.2019.26
Funding
The authors would like to thank anonymous referees for their invaluable suggestions and comments. This work is supported by the National Natural Science Foundation of China (Grant No. 61872284); Key Research and Development Program of Shaanxi (Program No. 2023-YBGY-203, 2023-YBGY-021,); Industrialization Project of Shaanxi Provincial Department of Education (Grant No. 21JC017); “Thirteenth Five-Year” National Key R&D Program Project (Project Number: 2019YFD1100901). Natural Science Foundation of Shannxi Province, China (Grant Nos. 2021JLM-16, 2023-JC-YB-825); Key R&D Plan of **anyang City (Grant No. L2023-ZDYF-QYCX-021).
Author information
Authors and Affiliations
Contributions
Qinlu He contributed significantly to Design Algorithm. Zhen Li performed the perfor analysis. Genqing Bian and Weiqi Zhang completed the experimental part. Fan Zhang and Qinlu He contributed to manuscript preparation.
Corresponding author
Ethics declarations
Competing interests
I declare that the authors have no competing interests as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper.
Consent for publication
I have read and understood the publishing policy, and submit this manuscript in accordance with this policy.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
He, Q., Zhang, F., Bian, G. et al. Design and realization of hybrid resource management system for heterogeneous cluster. Cluster Comput (2024). https://doi.org/10.1007/s10586-024-04267-z
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10586-024-04267-z