Log in

Design and realization of hybrid resource management system for heterogeneous cluster

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

As user-generated data diversifies, an increasing number of big data tasks now encompass unstructured data forms, including audio, imagery, and video content. The processing of these data often requires the support of GPU devices in the cluster. However, most of the existing cluster resource management frameworks lack effective decoupling of CPU resources and GPU resources, and the management granularity of GPU resources is too coarse, resulting in poor GPU sharing and low resource utilization. Given, for the inadequacy of the existing cluster resource management framework that cannot support batch stream integration and CPU-GPU resource scheduling, we design and implement a Hybrid Heterogeneous Resource Management (H-HRM) for CPU-GPU heterogeneous clusters. The resource queue binding mechanism on computing nodes provides flexible binding of CPU resources to GPU resources, which solves the problem that CPU resources and GPU resources in the cluster are difficult to decouple. According to the different uses of CPU or GPU as the main resource, a Hybrid Domain Resource Fairness (HDRF) model is proposed to realize the reasonable allocation of CPU resources and GPU resources. Through the queue stacking technology and the mechanism that multiple executors can run simultaneously on the queue, fine-grained sharing of GPU resources is realized and the utilization rate of GPU resources is improved. Finally, this paper realizes the docking of H-HRM and Spark programming framework and conducts real load tests, and after comparing with the performance of Mesos, it is proved that H-HRM can handle the mixed scenarios of batch jobs and stream processing jobs in the cluster, and the HDRF algorithm also greatly improves the utilization of the GPU.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23

Similar content being viewed by others

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

  1. http://www.gov.cn/xinwen/2020-09/29/5548176/files/1c6b4a2ae06c4ffc8bccb49da353495e.pdf, 2020-09-29.

  2. https://www.seagate.com/cn/zh/our-story/data-age-2025/.

  3. Dean, J., Ghemawat, S., MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems De-sign & Implementation-Volume 6 (2004), OSDI’04, USENIX Association, pp.10–10.

  4. Apache Hadoop Project. http://hadoop.apache.org/.

  5. Zaharia, M., Das, T., Li, H., Shenker, S., Stoica, I.: Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. ACM, 2023.

  6. Apache Storm. http://storm.apache.org/.

  7. Murray, D., McSherry, F., Isaacs, R., Isard, M., Barham, P., Abadi, M.: Naiad: a timely dataflow system. In SOSP 13(439), 455 (2023)

    Google Scholar 

  8. Gates, A.F., et al.: Building a high-level dataflow system on top of map-reduce: the pig ex-perience. Proc. VLDB Endow. 2(2), 1414–1425 (2019)

    Article  MathSciNet  Google Scholar 

  9. Google. DataFlow SDK. https://github.com/GoogleCloudPlatform/DataflowJavaSDK.

  10. Google. Google Cloud DataFlow. https://cloud.google.com/dataflow/.

  11. Akidau, T., Bradshaw, R., Chambers, C., Chernyak, S., Fern´andez-Moctezuma, R.J., Lax, R., McVeety, S., Mills, D., Perry, F., Schmidt, E., et al.: The dataflow model: a practical approach tobalancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. VLDB, 2019.

  12. Ghodsi, A., Zaharia, M., Hindman, B., et al.: Dominant resource fairness: fair allocation of multiple resource types[C] //Proc of the 8th USENIX Symp on Networked Systems Design and Implementation, pp. 24–34. USENIX Association, Berkeley (2021)

    Google Scholar 

  13. Reed, F.J., ZooKeeper, B.: Distributed Process Coordination[M]//ZooKeeper: Distribut-ed Process Coordination. O’Reilly Media, Inc. 2022.

  14. APACHE. Hadoop on demand. http://goo.gl/px8Yd, 2007. Accessed 20 Jun 2021.

  15. Vavilapalli, V.K., Murthy, A.C., Douglas, C., et al.: Apache Hadoop YARN: yet another re-source negotiator[C]//Proc of the 4th ACM Symp on Cloud Computing, pp. 5–20. ACM, New York (2022)

    Google Scholar 

  16. Hindman, B., Konwinski, A., Zaharia, M., et al.: Mesos: a platform for fine-grained resource sharing in the data center[C] //Proc of the 8th USENIX Symp on Networked Systems De-sign and Implementation, pp. 22–35. USENIX Association, Berkeley, CA (2021)

    Google Scholar 

  17. Schwarzkopf, M., Konwinski, A., Abd-El-Malek, M., et al.: Omega: flexible, scalable schedulers for large compute clusters[C]//Proc of the 8th ACM European Conf on Computer Systems, pp. 351–364. ACM, New York (2022)

    Google Scholar 

  18. Boutin, E., Ekanayake, J., Wei, L., et al.: Apollo: scalable and coordinated scheduling for cloud-scale computing[C]//Proc of the 11th USENIX Symp on Operating Systems Design and Implementation, p. 285. USENIX Association, Berkeley, CA (2021)

    Google Scholar 

  19. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop distributed file system, in Proc. IEEE 26th Symp. Mass Storage Syst. Technol., 2020, pp. 1–10.

  20. Capit, N., Da Costa, G., Georgiou, Y., Huard, G., Martin, C., Mounie, G., Neyron, P., Richard, O.: A batch scheduler with high level components. In Cluster Computing and the Grid, 2022. CCGrid 2022. IEEE International Symposium on, volume 2, pages 776–783 Vol. 2, 2022.

  21. Jackson, D., Snell, Q., Clement, M.: Corealgorithms of the maui scheduler. In Revised Papers from the 7th International Workshop on Job Scheduling Strategies for Parallel Processing, JSSPP ‘01, 87–102. Springer, Berlin (2021)

    Google Scholar 

  22. Ousterhout, K., Wendell, P., Zaharia, M., et al.: Sparrow: distributed, low latency scheduling[C] //Proc of the 24th ACM Symp on Operating Systems Principles, pp. 69–84. ACM, New York (2013)

    Google Scholar 

  23. Delimitrou, C., Sanchez, D., Kozyrakis, C., Tarcil: reconciling scheduling speed and quality in large shared clusters. In SoCC, 2019.

  24. Karanasos, K., Rao, S., Curino, C., et al.: Mercury: hybrid centralized and distributed schedul-ing in large shared clusters[C] //Proc of the 2015 USENIX Annual Technical Conf, pp. 485–497. USENIX Association, Berkeley (2021)

    Google Scholar 

  25. Delgado, P., Didona, D., Dinu, F., et al.: Job-aware scheduling in eagle: divide and stick to your probes[C] // Proc of the 7th ACM Symp on Cloud Computing, pp. 497–509. ACM, New York (2021)

    Google Scholar 

  26. Delgado, P., Dinu, F., Kermarrec, A.M., et al.: Hawk: Hybrid datacenter scheduling[C] //Proc of the 2015 USENIX Annual Technical Conf, pp. 499–510. USENIX Association, Berkeley (2021)

    Google Scholar 

  27. Carbone, P., Ewen, S., Haridi, S., Katsifodimos, A., Markl, V., Tzoumas, K.: Apache flink: stream and batch processing in a singleengine. IEEE Data Eng. Bullet.28 (2015)

  28. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica.Spark, I.: Cluster com-puting with working sets. HotCloud, 2020.

  29. Libprocess. https://github.com/3rdparty/libprocess.

  30. Protocol Buffers. https://developers.google.cn/protocol-buffers?hl=zh-cn.

  31. Fan, Z., Qiu, F., Kaufman, A., Yoakum-Stover, S.: GPU cluster for high performance computing. In: Proceedings of the ACM/IEEE conference on supercomputing, Pittsburgh, PA, USA, pp 47–53. (2019) https://doi.org/10.1109/SC.2019.26

Download references

Funding

The authors would like to thank anonymous referees for their invaluable suggestions and comments. This work is supported by the National Natural Science Foundation of China (Grant No. 61872284); Key Research and Development Program of Shaanxi (Program No. 2023-YBGY-203, 2023-YBGY-021,); Industrialization Project of Shaanxi Provincial Department of Education (Grant No. 21JC017); “Thirteenth Five-Year” National Key R&D Program Project (Project Number: 2019YFD1100901). Natural Science Foundation of Shannxi Province, China (Grant Nos. 2021JLM-16, 2023-JC-YB-825); Key R&D Plan of **anyang City (Grant No. L2023-ZDYF-QYCX-021).

Author information

Authors and Affiliations

Authors

Contributions

Qinlu He contributed significantly to Design Algorithm. Zhen Li performed the perfor analysis. Genqing Bian and Weiqi Zhang completed the experimental part. Fan Zhang and Qinlu He contributed to manuscript preparation.

Corresponding author

Correspondence to Qinlu He.

Ethics declarations

Competing interests

I declare that the authors have no competing interests as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper.

Consent for publication

I have read and understood the publishing policy, and submit this manuscript in accordance with this policy.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

He, Q., Zhang, F., Bian, G. et al. Design and realization of hybrid resource management system for heterogeneous cluster. Cluster Comput (2024). https://doi.org/10.1007/s10586-024-04267-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10586-024-04267-z

Keywords

Navigation