Log in

Network states-aware collective communication optimization

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Message Passing Interface (MPI) is the de facto standard for parallel programming, and collective operations in MPI are widely utilized by numerous scientific applications. The efficiency of these collective operations greatly impacts the performance of parallel applications. With the increasing scale and heterogeneity of HPC systems, the network environment has become more complex. The network states vary widely and dynamically between node pairs, and this makes it more difficult to design efficient collective communication algorithms. In this paper, we propose a method to optimize collective operations by using real-time measured network states, specifically focusing on the binomial tree algorithm. Our approach employs a low-overhead method to measure the network states, and the binomial tree with small latency is constructed based on the measurement result. Additionally, we take into account the disparities between the two underlying MPI peer-to-peer communication protocols, eager and rendezvous, and design tailored binomial tree construction algorithms for each protocol. We have implemented hierarchical MPI_Bcast, MPI_Reduce, MPI_Gather and MPI_Scatter, utilizing our network states-aware binomial tree algorithm at the inter-node level. The benchmark results demonstrate that our algorithm effectively enhances performance in small and medium message communication when compared to the default binomial tree algorithm in Open MPI. Specifically, for MPI_Bcast, we observe an average performance improvement of over 15.5% when the message size is less than 64KB. Similarly, for MPI_Reduce, there is an average performance improvement of over 12.1% when the message is below 2KB. In addition, there is an average performance improvement of over 10% for MPI_Gather when the message ranging from 64B to 512B. For MPI_Scatter, our algorithm achieved performance improvement only for certain message sizes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Algorithm 1
Fig. 4
Algorithm 2
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data Availability

Enquiries about data availability should be directed to the authors.

References

  1. Message Passing Interface Forum: MPI: A Message-Passing Interface Standard Version 4.1. (2023)

  2. Graham, R.L., Shipman, G.M., Barrett, B.W., Castain, R.H., Bosilca, G., Lumsdaine, A.: Open mpi: A high-performance, heterogeneous mpi. In: 2006 IEEE International Conference on Cluster Computing, pp. 1–9 (2006). https://doi.org/10.1109/CLUSTR.2006.311904

  3. Gropp, W., Lusk, E., Doss, N., Skjellum, A.: A high-performance, portable implementation of the mpi message passing interface standard. Parallel Comput. 22(6), 789–828 (1996). https://doi.org/10.1016/0167-8191(96)00024-5

    Article  Google Scholar 

  4. Liu, J., Jiang, W., Wyckoff, P., Panda, D.K., Ashton, D., Buntinas, D., Gropp, W., Toonen, B.: Design and implementation of mpich2 over infiniband with rdma support. In: 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings. p. 16 (2004). https://doi.org/10.1109/IPDPS.2004.1302922

  5. Chunduri, S., Parker, S., Balaji, P., Harms, K., Kumaran, K.: Characterization of mpi usage on a production supercomputer. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 386–400 (2018). IEEE

  6. Luo, X., Wu, W., Bosilca, G., Patinyasakdikul, T., Wang, L., Dongarra, J.: Adapt: An event-based adaptive collective communication framework. In: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. HPDC ’18, pp. 118–130. Association for Computing Machinery, New York, NY, USA (2018). doi: https://doi.org/10.1145/3208040.3208054

  7. Loch, W.J., Koslovski, G.P.: Sparbit: A new logarithmic-cost and data locality-aware mpi allgather algorithm. In: 2021 IEEE 33rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 167–176 (2021). https://doi.org/10.1109/SBAC-PAD53543.2021.00028

  8. Cai, Z., Liu, Z., Maleki, S., Musuvathi, M., Mytkowicz, T., Nelson, J., Saarikivi, O.: Synthesizing optimal collective algorithms. In: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPoPP ’21, pp. 62–75. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3437801.3441620

  9. Arap, O., Swany, M., Brown, G., Himebaugh, B.: Adaptive recursive doubling algorithm for collective communication. In: 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, pp. 121–128 (2015). https://doi.org/10.1109/IPDPSW.2015.82

  10. Proficz, J.: Improving all-reduce collective operations for imbalanced process arrival patterns. J. Supercomput. 74(7), 3071–3092 (2018). https://doi.org/10.1007/s11227-018-2356-z

    Article  Google Scholar 

  11. Nukada, A.: Performance optimization of allreduce operation for multi-gpu systems. In: 2021 IEEE International Conference on Big Data (Big Data), pp. 3107–3112 (2021). https://doi.org/10.1109/BigData52589.2021.9672073

  12. Buntinas, D., Goglin, B., Goodell, D., Mercier, G., Moreaud, S.: Cache-efficient, intranode, large-message mpi communication with mpich2-nemesis. In: 2009 International Conference on Parallel Processing, pp. 462–469 (2009). https://doi.org/10.1109/ICPP.2009.22

  13. Lee, J., Hwang, I., Shah, S., Cho, M.: Flexreduce: Flexible all-reduce for distributed deep learning on asymmetric network topology. In: 2020 57th ACM/IEEE Design Automation Conference (DAC), pp. 1–6 (2020). https://doi.org/10.1109/DAC18072.2020.9218538

  14. Li, S., Zhang, Y., Hoefler, T.: Cache-oblivious mpi all-to-all communications based on morton order. IEEE Trans. Parallel Distrib. Syst. 29(3), 542–555 (2018). https://doi.org/10.1109/TPDS.2017.2768413

    Article  Google Scholar 

  15. Zhong, D., Cao, Q., Bosilca, G., Dongarra, J.: Using long vector extensions for mpi reductions. Parallel Comput. 109, 102871 (2022). https://doi.org/10.1016/j.parco.2021.102871

    Article  Google Scholar 

  16. Luo, X., Wu, W., Bosilca, G., Pei, Y., Cao, Q., Patinyasakdikul, T., Zhong, D., Dongarra, J.: Han: A hierarchical autotuned collective communication framework. In: 2020 IEEE International Conference on Cluster Computing (CLUSTER), pp. 23–34 (2020). https://doi.org/10.1109/CLUSTER49012.2020.00013

  17. Kurnosov, M.G.: Dynamic map** of all-to-all collective operations into hierarchical computer clusters. In: 2016 13th International Scientific-Technical Conference on Actual Problems of Electronics Instrument Engineering (APEIE), Vol. 02, pp. 475–478 (2016). https://doi.org/10.1109/APEIE.2016.7806396

  18. Kim, J., Dally, W.J., Scott, S., Abts, D.: Technology-driven, highly-scalable dragonfly topology. In: 2008 International Symposium on Computer Architecture, pp. 77–88 (2008). https://doi.org/10.1109/ISCA.2008.19

  19. Leiserson, C.E.: Fat-trees: universal networks for hardware-efficient supercomputing. IEEE Trans. Comput. 34(10), 892–901 (1985). https://doi.org/10.1109/TC.1985.6312192

    Article  Google Scholar 

  20. Kumar, S., Sharkawi, S.S., Jan, K.A.N.: Optimization and analysis of mpi collective communication on fat-tree networks. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1031–1040 (2016). https://doi.org/10.1109/IPDPS.2016.85

  21. Ma, J., Dong, D., Li, C., Wu, K., **ao, L.: Evaluation of topology-aware all-reduce algorithm for dragonfly networks. In: Cérin, C., Qian, D., Gaudiot, J.-L., Tan, G., Zuckerman, S. (Eds.), Network and Parallel Computing. Lecture Notes in Computer Science, pp. 243–255. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-93571-9_19

  22. Kandalla, K., Subramoni, H., Vishnu, A., Panda, D.K.: Designing topology-aware collective communication algorithms for large scale infiniband clusters: Case studies with scatter and gather. In: 2010 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), pp. 1–8 (2010). https://doi.org/10.1109/IPDPSW.2010.5470853

  23. Subramoni, H., Kandalla, K., Vienne, J., Sur, S., Barth, B., Tomko, K., Mclay, R., Schulz, K., Panda, D.K.: Design and evaluation of network topology-/speed- aware broadcast algorithms for infiniband clusters. In: 2011 IEEE International Conference on Cluster Computing, pp. 317–325 (2011). https://doi.org/10.1109/CLUSTER.2011.43

  24. Gong, Y., He, B., Zhong, J.: Network performance aware mpi collective communication operations in the cloud. IEEE Trans. Parallel Distrib. Syst. 26(11), 3079–3089 (2015). https://doi.org/10.1109/TPDS.2013.96

    Article  Google Scholar 

  25. Sudhakar, C., Ramesh, T., Waghmare, K.: Path based optimization of mpi collective communication operation in cloud. In: 2018 International Conference on Computing, Power and Communication Technologies (GUCON), pp. 595–599 (2018). https://doi.org/10.1109/GUCON.2018.8675055

  26. Kielmann, T., Bal, H.E., Verstoep, K.: Fast measurement of logp parameters for message passing platforms. In: Rolim, J. (ed.) Parallel and Distributed Processing. Lecture Notes in Computer Science, pp. 1176–1183. Springer, Berlin, Heidelberg (2000). https://doi.org/10.1007/3-540-45591-4_162

  27. Shen, H., Sarker, A., Yu, L., Deng, F.: Probabilistic network-aware task placement for mapreduce scheduling. In: 2016 IEEE International Conference on Cluster Computing (CLUSTER), pp. 241–250 (2016). https://doi.org/10.1109/CLUSTER.2016.48

  28. Tahmasbi-Sarvestani, A., Fallah, Y.P., Kulathumani, V.: Network-aware double-layer distance-dependent broadcast protocol for vanets. IEEE Trans. Veh. Technol. 64(12), 5536–5546 (2015). https://doi.org/10.1109/TVT.2015.2487998

    Article  Google Scholar 

  29. Cui, X., Li, X., Wang, B.: Communication optimization technology based on network dynamic performance model. Math. Probl. Eng. 2020, 8890721 (2020). https://doi.org/10.1155/2020/8890721

    Article  Google Scholar 

  30. Rico-Gallego, J.A., Díaz-Martín, J.C., Manumachu, R.R., Lastovetsky, A.L.: A survey of communication performance models for high-performance computing. ACM Comput. Surv. 51(6), 126–112636 (2019). https://doi.org/10.1145/3284358

    Article  Google Scholar 

  31. Bar-Noy, A., Kipnis, S.: Designing broadcasting algorithms in the postal model for message-passing systems. In: Proceedings of the Fourth Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 13–22 (1992)

  32. Hockney, R.W.: The communication challenge for mpp: Intel paragon and meiko cs-2. Parallel Comput. 20(3), 389–398 (1994). https://doi.org/10.1016/S0167-8191(06)80021-9

    Article  Google Scholar 

  33. Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K.E., Santos, E., Subramonian, R., von Eicken, T.: Logp: Towards a realistic model of parallel computation. In: Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPOPP ’93, pp. 1–12. Association for Computing Machinery, New York, NY, USA (1993). https://doi.org/10.1145/155332.155333

  34. Alexandrov, A., Ionescu, M.F., Schauser, K.E., Scheiman, C.: Loggp: Incorporating long messages into the logp model - one step closer towards a realistic model for parallel computation. In: Proceedings of the Seventh Annual ACM Symposium on Parallel Algorithms and Architectures. SPAA ’95, pp. 95–105. Association for Computing Machinery, New York, NY, USA (1995). https://doi.org/10.1145/215399.215427

  35. Wang, Z., Chen, H., Cai, W., Dong, X., Zhang, X.: C-lop: accurate contention-based modeling of mpi concurrent communication. Parallel Comput. 111, 102925 (2022). https://doi.org/10.1016/j.parco.2022.102925

    Article  MathSciNet  Google Scholar 

  36. Rico-Gallego, J.-A., Díaz-Martín, J.-C.: \(\tau \)-lop: modeling performance of shared memory mpi. Parallel Comput. 46, 14–31 (2015). https://doi.org/10.1016/j.parco.2015.02.006

    Article  Google Scholar 

  37. Koohi, S.Z., Hamid, N.A.W.A., Othman, M., Ibragimov, G.: Mempha: model of exascale message-passing programs on heterogeneous architectures. IEEE Trans. Parallel Distrib. Syst. 31(11), 2570–2581 (2020). https://doi.org/10.1109/TPDS.2020.2995867

    Article  Google Scholar 

  38. Hoefler, T., Schneider, T., Lumsdaine, A.: Loggp in theory and practice—an in-depth analysis of modern interconnection networks and benchmarking methods for collective operations. Simul. Model. Pract. Theory 17(9), 1511–1521 (2009). https://doi.org/10.1016/j.simpat.2009.06.007

    Article  Google Scholar 

  39. Bruck, J., Ho, C.-T., Kipnis, S., Weathersby, D.: Efficient algorithms for all-to-all communications in multi-port message-passing systems. In: Proceedings of the Sixth Annual ACM Symposium on Parallel Algorithms and Architectures. SPAA ’94, pp. 298–309. Association for Computing Machinery, New York, NY, USA (1994). https://doi.org/10.1145/181014.181756

  40. Graham, R.L., Shipman, G.: Mpi support for multi-core architectures: Optimized shared memory collectives. In: Lastovetsky, A., Kechadi, T., Dongarra, J. (eds.) Recent Advances in Parallel Virtual Machine and Message Passing Interface. Lecture Notes in Computer Science, pp. 130–140. Springer, Berlin, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87475-1_21

  41. Yoo, A.B., Jette, M.A., Grondona, M.: SLURM: Simple linux utility for resource management. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) Job Scheduling Strategies for Parallel Processing, pp. 44–60. Springer, Berlin (2003)

    Chapter  Google Scholar 

  42. MVAPICH::Benchmarks. https://mvapich.cse.ohio-state.edu/benchmarks/

Download references

Funding

This work is supported by the National Key R &D Program of China, No. 2022YFB4501701.

Author information

Authors and Affiliations

Authors

Contributions

JW: methodology, software, validation, investigation, writing—original draft. TZ: software, investigation, supervision, writing—review and editing. YW: conceptualization, resources, investigation, writing—review and editing, supervision.

Corresponding author

Correspondence to Yunlan Wang.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Comparison of different binomial tree for MPI_Bcast

Comparison of different binomial tree for MPI_Bcast

We constructed four binomial trees, as depicted in the Fig. 13. The experimental environments align with Section 5, with processes 0–3, 4–7, 8–11, and 12–15 located within the same rack. The default binomial tree, referred to as bt0, was constructed accordingly. Additionally, bt1 and bt2 were constructed using depth-first and breadth-first methods, respectively. Finally, bt3 is a binomial tree with poor communication performance, where all communication are carried out between nodes in different racks.

Fig. 13
figure 13

Structure of four binomial tree

The figure 14 illustrates the latency of MPI_Bcast across the four binomial trees, as well as demonstrating a comparison with our algorithm. It can be seen that the variances in performance among the different tree structures primarily emerge during the broadcast of small and medium-sized messages.

Fig. 14
figure 14

Performance comparison of four different binomial trees

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, J., Zhao, T. & Wang, Y. Network states-aware collective communication optimization. Cluster Comput (2024). https://doi.org/10.1007/s10586-024-04330-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10586-024-04330-9

Keywords

Navigation