Network states-aware collective communication optimization

Wang, **gyuan; Zhao, Tianhai; Wang, Yunlan

doi:10.1007/s10586-024-04330-9

Network states-aware collective communication optimization

Published: 10 March 2024

(2024)
Cite this article

Cluster Computing Aims and scope Submit manuscript

**gyuan Wang¹,
Tianhai Zhao¹ &
Yunlan Wang¹

213 Accesses
Explore all metrics

Abstract

Message Passing Interface (MPI) is the de facto standard for parallel programming, and collective operations in MPI are widely utilized by numerous scientific applications. The efficiency of these collective operations greatly impacts the performance of parallel applications. With the increasing scale and heterogeneity of HPC systems, the network environment has become more complex. The network states vary widely and dynamically between node pairs, and this makes it more difficult to design efficient collective communication algorithms. In this paper, we propose a method to optimize collective operations by using real-time measured network states, specifically focusing on the binomial tree algorithm. Our approach employs a low-overhead method to measure the network states, and the binomial tree with small latency is constructed based on the measurement result. Additionally, we take into account the disparities between the two underlying MPI peer-to-peer communication protocols, eager and rendezvous, and design tailored binomial tree construction algorithms for each protocol. We have implemented hierarchical MPI_Bcast, MPI_Reduce, MPI_Gather and MPI_Scatter, utilizing our network states-aware binomial tree algorithm at the inter-node level. The benchmark results demonstrate that our algorithm effectively enhances performance in small and medium message communication when compared to the default binomial tree algorithm in Open MPI. Specifically, for MPI_Bcast, we observe an average performance improvement of over 15.5% when the message size is less than 64KB. Similarly, for MPI_Reduce, there is an average performance improvement of over 12.1% when the message is below 2KB. In addition, there is an average performance improvement of over 10% for MPI_Gather when the message ranging from 64B to 512B. For MPI_Scatter, our algorithm achieved performance improvement only for certain message sizes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sparbit: Towards to a Logarithmic-Cost and Data Locality-Aware MPI Allgather Algorithm

Article 16 March 2023

MPI vs. BitTorrent: Switching between Large-Message Broadcast Algorithms in the Presence of Bottleneck Links

Towards Understanding Post-recovery Efficiency for Shrinking and Non-shrinking Recovery

Data Availability

Enquiries about data availability should be directed to the authors.

References

Message Passing Interface Forum: MPI: A Message-Passing Interface Standard Version 4.1. (2023)
Graham, R.L., Shipman, G.M., Barrett, B.W., Castain, R.H., Bosilca, G., Lumsdaine, A.: Open mpi: A high-performance, heterogeneous mpi. In: 2006 IEEE International Conference on Cluster Computing, pp. 1–9 (2006). https://doi.org/10.1109/CLUSTR.2006.311904
Gropp, W., Lusk, E., Doss, N., Skjellum, A.: A high-performance, portable implementation of the mpi message passing interface standard. Parallel Comput. 22(6), 789–828 (1996). https://doi.org/10.1016/0167-8191(96)00024-5
Article Google Scholar
Liu, J., Jiang, W., Wyckoff, P., Panda, D.K., Ashton, D., Buntinas, D., Gropp, W., Toonen, B.: Design and implementation of mpich2 over infiniband with rdma support. In: 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings. p. 16 (2004). https://doi.org/10.1109/IPDPS.2004.1302922
Chunduri, S., Parker, S., Balaji, P., Harms, K., Kumaran, K.: Characterization of mpi usage on a production supercomputer. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 386–400 (2018). IEEE
Luo, X., Wu, W., Bosilca, G., Patinyasakdikul, T., Wang, L., Dongarra, J.: Adapt: An event-based adaptive collective communication framework. In: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. HPDC ’18, pp. 118–130. Association for Computing Machinery, New York, NY, USA (2018). doi: https://doi.org/10.1145/3208040.3208054
Loch, W.J., Koslovski, G.P.: Sparbit: A new logarithmic-cost and data locality-aware mpi allgather algorithm. In: 2021 IEEE 33rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 167–176 (2021). https://doi.org/10.1109/SBAC-PAD53543.2021.00028
Cai, Z., Liu, Z., Maleki, S., Musuvathi, M., Mytkowicz, T., Nelson, J., Saarikivi, O.: Synthesizing optimal collective algorithms. In: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPoPP ’21, pp. 62–75. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3437801.3441620
Arap, O., Swany, M., Brown, G., Himebaugh, B.: Adaptive recursive doubling algorithm for collective communication. In: 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, pp. 121–128 (2015). https://doi.org/10.1109/IPDPSW.2015.82
Proficz, J.: Improving all-reduce collective operations for imbalanced process arrival patterns. J. Supercomput. 74(7), 3071–3092 (2018). https://doi.org/10.1007/s11227-018-2356-z
Article Google Scholar
Nukada, A.: Performance optimization of allreduce operation for multi-gpu systems. In: 2021 IEEE International Conference on Big Data (Big Data), pp. 3107–3112 (2021). https://doi.org/10.1109/BigData52589.2021.9672073
Buntinas, D., Goglin, B., Goodell, D., Mercier, G., Moreaud, S.: Cache-efficient, intranode, large-message mpi communication with mpich2-nemesis. In: 2009 International Conference on Parallel Processing, pp. 462–469 (2009). https://doi.org/10.1109/ICPP.2009.22
Lee, J., Hwang, I., Shah, S., Cho, M.: Flexreduce: Flexible all-reduce for distributed deep learning on asymmetric network topology. In: 2020 57th ACM/IEEE Design Automation Conference (DAC), pp. 1–6 (2020). https://doi.org/10.1109/DAC18072.2020.9218538
Li, S., Zhang, Y., Hoefler, T.: Cache-oblivious mpi all-to-all communications based on morton order. IEEE Trans. Parallel Distrib. Syst. 29(3), 542–555 (2018). https://doi.org/10.1109/TPDS.2017.2768413
Article Google Scholar
Zhong, D., Cao, Q., Bosilca, G., Dongarra, J.: Using long vector extensions for mpi reductions. Parallel Comput. 109, 102871 (2022). https://doi.org/10.1016/j.parco.2021.102871
Article Google Scholar
Luo, X., Wu, W., Bosilca, G., Pei, Y., Cao, Q., Patinyasakdikul, T., Zhong, D., Dongarra, J.: Han: A hierarchical autotuned collective communication framework. In: 2020 IEEE International Conference on Cluster Computing (CLUSTER), pp. 23–34 (2020). https://doi.org/10.1109/CLUSTER49012.2020.00013
Kurnosov, M.G.: Dynamic map** of all-to-all collective operations into hierarchical computer clusters. In: 2016 13th International Scientific-Technical Conference on Actual Problems of Electronics Instrument Engineering (APEIE), Vol. 02, pp. 475–478 (2016). https://doi.org/10.1109/APEIE.2016.7806396
Kim, J., Dally, W.J., Scott, S., Abts, D.: Technology-driven, highly-scalable dragonfly topology. In: 2008 International Symposium on Computer Architecture, pp. 77–88 (2008). https://doi.org/10.1109/ISCA.2008.19
Leiserson, C.E.: Fat-trees: universal networks for hardware-efficient supercomputing. IEEE Trans. Comput. 34(10), 892–901 (1985). https://doi.org/10.1109/TC.1985.6312192
Article Google Scholar
Kumar, S., Sharkawi, S.S., Jan, K.A.N.: Optimization and analysis of mpi collective communication on fat-tree networks. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1031–1040 (2016). https://doi.org/10.1109/IPDPS.2016.85
Ma, J., Dong, D., Li, C., Wu, K., **ao, L.: Evaluation of topology-aware all-reduce algorithm for dragonfly networks. In: Cérin, C., Qian, D., Gaudiot, J.-L., Tan, G., Zuckerman, S. (Eds.), Network and Parallel Computing. Lecture Notes in Computer Science, pp. 243–255. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-93571-9_19
Kandalla, K., Subramoni, H., Vishnu, A., Panda, D.K.: Designing topology-aware collective communication algorithms for large scale infiniband clusters: Case studies with scatter and gather. In: 2010 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), pp. 1–8 (2010). https://doi.org/10.1109/IPDPSW.2010.5470853
Subramoni, H., Kandalla, K., Vienne, J., Sur, S., Barth, B., Tomko, K., Mclay, R., Schulz, K., Panda, D.K.: Design and evaluation of network topology-/speed- aware broadcast algorithms for infiniband clusters. In: 2011 IEEE International Conference on Cluster Computing, pp. 317–325 (2011). https://doi.org/10.1109/CLUSTER.2011.43
Gong, Y., He, B., Zhong, J.: Network performance aware mpi collective communication operations in the cloud. IEEE Trans. Parallel Distrib. Syst. 26(11), 3079–3089 (2015). https://doi.org/10.1109/TPDS.2013.96
Article Google Scholar
Sudhakar, C., Ramesh, T., Waghmare, K.: Path based optimization of mpi collective communication operation in cloud. In: 2018 International Conference on Computing, Power and Communication Technologies (GUCON), pp. 595–599 (2018). https://doi.org/10.1109/GUCON.2018.8675055
Kielmann, T., Bal, H.E., Verstoep, K.: Fast measurement of logp parameters for message passing platforms. In: Rolim, J. (ed.) Parallel and Distributed Processing. Lecture Notes in Computer Science, pp. 1176–1183. Springer, Berlin, Heidelberg (2000). https://doi.org/10.1007/3-540-45591-4_162
Shen, H., Sarker, A., Yu, L., Deng, F.: Probabilistic network-aware task placement for mapreduce scheduling. In: 2016 IEEE International Conference on Cluster Computing (CLUSTER), pp. 241–250 (2016). https://doi.org/10.1109/CLUSTER.2016.48
Tahmasbi-Sarvestani, A., Fallah, Y.P., Kulathumani, V.: Network-aware double-layer distance-dependent broadcast protocol for vanets. IEEE Trans. Veh. Technol. 64(12), 5536–5546 (2015). https://doi.org/10.1109/TVT.2015.2487998
Article Google Scholar
Cui, X., Li, X., Wang, B.: Communication optimization technology based on network dynamic performance model. Math. Probl. Eng. 2020, 8890721 (2020). https://doi.org/10.1155/2020/8890721
Article Google Scholar
Rico-Gallego, J.A., Díaz-Martín, J.C., Manumachu, R.R., Lastovetsky, A.L.: A survey of communication performance models for high-performance computing. ACM Comput. Surv. 51(6), 126–112636 (2019). https://doi.org/10.1145/3284358
Article Google Scholar
Bar-Noy, A., Kipnis, S.: Designing broadcasting algorithms in the postal model for message-passing systems. In: Proceedings of the Fourth Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 13–22 (1992)
Hockney, R.W.: The communication challenge for mpp: Intel paragon and meiko cs-2. Parallel Comput. 20(3), 389–398 (1994). https://doi.org/10.1016/S0167-8191(06)80021-9
Article Google Scholar
Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K.E., Santos, E., Subramonian, R., von Eicken, T.: Logp: Towards a realistic model of parallel computation. In: Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPOPP ’93, pp. 1–12. Association for Computing Machinery, New York, NY, USA (1993). https://doi.org/10.1145/155332.155333
Alexandrov, A., Ionescu, M.F., Schauser, K.E., Scheiman, C.: Loggp: Incorporating long messages into the logp model - one step closer towards a realistic model for parallel computation. In: Proceedings of the Seventh Annual ACM Symposium on Parallel Algorithms and Architectures. SPAA ’95, pp. 95–105. Association for Computing Machinery, New York, NY, USA (1995). https://doi.org/10.1145/215399.215427
Wang, Z., Chen, H., Cai, W., Dong, X., Zhang, X.: C-lop: accurate contention-based modeling of mpi concurrent communication. Parallel Comput. 111, 102925 (2022). https://doi.org/10.1016/j.parco.2022.102925
Article MathSciNet Google Scholar
Rico-Gallego, J.-A., Díaz-Martín, J.-C.: \(\tau \)-lop: modeling performance of shared memory mpi. Parallel Comput. 46, 14–31 (2015). https://doi.org/10.1016/j.parco.2015.02.006
Article Google Scholar
Koohi, S.Z., Hamid, N.A.W.A., Othman, M., Ibragimov, G.: Mempha: model of exascale message-passing programs on heterogeneous architectures. IEEE Trans. Parallel Distrib. Syst. 31(11), 2570–2581 (2020). https://doi.org/10.1109/TPDS.2020.2995867
Article Google Scholar
Hoefler, T., Schneider, T., Lumsdaine, A.: Loggp in theory and practice—an in-depth analysis of modern interconnection networks and benchmarking methods for collective operations. Simul. Model. Pract. Theory 17(9), 1511–1521 (2009). https://doi.org/10.1016/j.simpat.2009.06.007
Article Google Scholar
Bruck, J., Ho, C.-T., Kipnis, S., Weathersby, D.: Efficient algorithms for all-to-all communications in multi-port message-passing systems. In: Proceedings of the Sixth Annual ACM Symposium on Parallel Algorithms and Architectures. SPAA ’94, pp. 298–309. Association for Computing Machinery, New York, NY, USA (1994). https://doi.org/10.1145/181014.181756
Graham, R.L., Shipman, G.: Mpi support for multi-core architectures: Optimized shared memory collectives. In: Lastovetsky, A., Kechadi, T., Dongarra, J. (eds.) Recent Advances in Parallel Virtual Machine and Message Passing Interface. Lecture Notes in Computer Science, pp. 130–140. Springer, Berlin, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87475-1_21
Yoo, A.B., Jette, M.A., Grondona, M.: SLURM: Simple linux utility for resource management. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) Job Scheduling Strategies for Parallel Processing, pp. 44–60. Springer, Berlin (2003)
Chapter Google Scholar
MVAPICH::Benchmarks. https://mvapich.cse.ohio-state.edu/benchmarks/

Download references

Funding

This work is supported by the National Key R &D Program of China, No. 2022YFB4501701.

Author information

Authors and Affiliations

School of Computer, Northwestern Polytechnical University, **’an, 710049, Shannxi, China
**gyuan Wang, Tianhai Zhao & Yunlan Wang

Authors

**gyuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Tianhai Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yunlan Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

JW: methodology, software, validation, investigation, writing—original draft. TZ: software, investigation, supervision, writing—review and editing. YW: conceptualization, resources, investigation, writing—review and editing, supervision.

Corresponding author

Correspondence to Yunlan Wang.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Comparison of different binomial tree for MPI_Bcast

We constructed four binomial trees, as depicted in the Fig. 13. The experimental environments align with Section 5, with processes 0–3, 4–7, 8–11, and 12–15 located within the same rack. The default binomial tree, referred to as bt0, was constructed accordingly. Additionally, bt1 and bt2 were constructed using depth-first and breadth-first methods, respectively. Finally, bt3 is a binomial tree with poor communication performance, where all communication are carried out between nodes in different racks.

The figure 14 illustrates the latency of MPI_Bcast across the four binomial trees, as well as demonstrating a comparison with our algorithm. It can be seen that the variances in performance among the different tree structures primarily emerge during the broadcast of small and medium-sized messages.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, J., Zhao, T. & Wang, Y. Network states-aware collective communication optimization. Cluster Comput (2024). https://doi.org/10.1007/s10586-024-04330-9

Download citation

Received: 14 September 2023
Revised: 18 January 2024
Accepted: 23 January 2024
Published: 10 March 2024
DOI: https://doi.org/10.1007/s10586-024-04330-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Network states-aware collective communication optimization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Sparbit: Towards to a Logarithmic-Cost and Data Locality-Aware MPI Allgather Algorithm

MPI vs. BitTorrent: Switching between Large-Message Broadcast Algorithms in the Presence of Bottleneck Links

Towards Understanding Post-recovery Efficiency for Shrinking and Non-shrinking Recovery

Data Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Comparison of different binomial tree for MPI_Bcast

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Network states-aware collective communication optimization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Sparbit: Towards to a Logarithmic-Cost and Data Locality-Aware MPI Allgather Algorithm

MPI vs. BitTorrent: Switching between Large-Message Broadcast Algorithms in the Presence of Bottleneck Links

Towards Understanding Post-recovery Efficiency for Shrinking and Non-shrinking Recovery

Data Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Comparison of different binomial tree for MPI_Bcast

Comparison of different binomial tree for MPI_Bcast

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation