Optimizing Distributed Tensor Contractions Using Node-Aware Processor Grids

Irmler, Andreas; Kanakagiri, Raghavendra; Ohlmann, Sebastian T.; Solomonik, Edgar; Grüneis, Andreas

doi:10.1007/978-3-031-39698-4_48

Andreas Irmler ORCID: orcid.org/0000-0003-0525-7772¹²,
Raghavendra Kanakagiri¹⁴,
Sebastian T. Ohlmann¹³,
Edgar Solomonik¹⁴ &
…
Andreas Grüneis¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14100))

Included in the following conference series:

European Conference on Parallel Processing

1614 Accesses

Abstract

We propose an algorithm that aims at minimizing the inter-node communication volume for distributed and memory-efficient tensor contraction schemes on modern multi-core compute nodes. The key idea is to define processor grids that optimize intra-/inter-node communication volume in the employed contraction algorithms. We present an implementation of the proposed node-aware communication algorithm into the Cyclops Tensor Framework (CTF). We demonstrate that this implementation achieves a significantly improved performance for matrix-matrix-multiplication and tensor-contractions on up to several hundreds modern compute nodes compared to conventional implementations without using node-aware processor grids. Our implementation shows good performance when compared with existing state-of-the-art parallel matrix multiplication libraries (COSMA and ScaLAPACK). In addition to the discussion of the performance for matrix-matrix-multiplication, we also investigate the performance of our node-aware communication algorithm for tensor contractions as they occur in quantum chemical coupled-cluster methods. To this end we employ a modified version of CTF in combination with a coupled-cluster code (Cc4s). Our findings show that the node-aware communication algorithm is also able to improve the performance of coupled-cluster theory calculations for real-world problems running on tens to hundreds of compute nodes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: EUR 29.95; Price includes VAT (France)

eBook: EUR 74.89; Price includes VAT (France)

Softcover Book: EUR 94.94; Price includes VAT (France)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Performance Analysis of the NWChem TCE for Different Communication Patterns

Communication–Free Parallel Mesh Multiplication for Large Scale Simulations

Scalable Fine-Grained Metric-Based Remeshing Algorithm for Manycore/NUMA Architectures

Notes

1.
CTF-def and CTF-na can be run with https://github.com/airmler/ctf, branch node-awareness, commit ID 2f32bd6.
2.
https://github.com/eth-cscs/COSMA.git commit ID fe98d3eb.

References

cc4s. https://manuals.cc4s.org
Agarwal, R.C., Balle, S.M., Gustavson, F.G., Joshi, M., Palkar, P.: A three-dimensional approach to parallel matrix multiplication. IBM J. Res. Dev. 39(5), 575–582 (1995)
Article Google Scholar
Aggarwal, A., Chandra, A.K., Snir, M.: On communication latency in PRAM computations. In: Proceedings of the First Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 11–21 (1989)
Google Scholar
Bartlett, R.J., Musiał, M.: Coupled-cluster theory in quantum chemistry. Rev. Mod. Phys. 79, 291–352 (2007)
Article Google Scholar
Bienz, A., Gropp, W.D., Olson, L.N.: Node aware sparse matrix-vector multiplication. J. Parallel Distrib. Comput. 130, 166–178 (2019)
Article Google Scholar
Bienz, A., Gropp, W.D., Olson, L.N.: Reducing communication in algebraic multigrid with multi-step node aware communication. Int. J. High Perform. Comput. Appl. 34(5), 547–561 (2020)
Article Google Scholar
Cannon, L.E.: A cellular computer to implement the Kalman filter algorithm. Ph.D. thesis, Montana State University, Bozeman, MT, USA (1969)
Google Scholar
Chan, E., Heimlich, M., Purkayastha, A., Van De Geijn, R.: Collective communication: theory, practice, and experience. Concurr. Comput.: Pract. Experience 19(13), 1749–1783 (2007)
Article Google Scholar
Choi, J., Dongarra, J., Pozo, R., Walker, D.: ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers. In: The Fourth Symposium on the Frontiers of Massively Parallel Computation, pp. 120–127 (1992)
Google Scholar
Demmel, J., et al.: Communication-optimal parallel recursive rectangular matrix multiplication. In: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp. 261–272 (2013)
Google Scholar
Irmler, A., Kanakagiri, R., Ohlmann, S.T., Solomonik, E., Grüneis, A.: Artifact overview document for Euro-Par 2023 paper: Optimizing distributed tensor contractions using node-aware processor grids. https://doi.org/10.6084/m9.figshare.23548113
Irony, D., Toledo, S., Tiskin, A.: Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput. 64(9), 1017–1026 (2004)
Article MATH Google Scholar
Kwasniewski, G., Kabić, M., Besta, M., VandeVondele, J., Solcà, R., Hoefler, T.: Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019. Association for Computing Machinery, New York (2019)
Google Scholar
Lockhart, S., Bienz, A., Gropp, W., Olson, L.: Performance analysis and optimal node-aware communication for enlarged conjugate gradient methods. ACM Trans. Parallel Comput. 10, 1–25 (2023)
Article MathSciNet Google Scholar
Lockhart, S., Bienz, A., Gropp, W.D., Olson, L.N.: Characterizing the performance of node-aware strategies for irregular point-to-point communication on heterogeneous architectures. Parallel Comput. 116, 103021 (2023)
Article MathSciNet Google Scholar
McColl, W.F., Tiskin, A.: Memory-efficient matrix multiplication in the BSP model. Algorithmica 24, 287–297 (1999)
Article MathSciNet MATH Google Scholar
Solomonik, E., Demmel, J.: Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011. LNCS, vol. 6853, pp. 90–109. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23397-5_10
Chapter Google Scholar
Solomonik, E., Matthews, D., Hammond, J.R., Stanton, J.F., Demmel, J.: A massively parallel tensor contraction framework for coupled-cluster computations. J. Parallel Distrib. Comput. 74, 3176–3190 (2014)
Article Google Scholar
Thakur, R., Gropp, W.D.: Improving the performance of collective operations in MPICH. In: Dongarra, J., Laforenza, D., Orlando, S. (eds.) EuroPVM/MPI 2003. LNCS, vol. 2840, pp. 257–267. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39924-7_38
Chapter Google Scholar
Van De Geijn, R.A., Watts, J.: Summa: scalable universal matrix multiplication algorithm. Concurr. Pract. Experience 9(4), 255–274 (1997)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Theoretical Physics, TU Wien, Vienna, Austria
Andreas Irmler & Andreas Grüneis
Max Planck Computing and Data Facility, Garching, Germany
Sebastian T. Ohlmann
University of Illinois at Urbana-Champaign, Champaign, USA
Raghavendra Kanakagiri & Edgar Solomonik

Authors

Andreas Irmler
View author publications
You can also search for this author in PubMed Google Scholar
Raghavendra Kanakagiri
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian T. Ohlmann
View author publications
You can also search for this author in PubMed Google Scholar
Edgar Solomonik
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Grüneis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andreas Irmler .

Editor information

Editors and Affiliations

University of Glasgow, Glasgow, UK
José Cano
University of Cyprus, Nicosia, Cyprus
Marios D. Dikaiakos
University of Cyprus, Nicosia, Cyprus
George A. Papadopoulos
Chalmers University of Technology, Gothenburg, Sweden
Miquel Pericàs
University of Manchester, Manchester, UK
Rizos Sakellariou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Irmler, A., Kanakagiri, R., Ohlmann, S.T., Solomonik, E., Grüneis, A. (2023). Optimizing Distributed Tensor Contractions Using Node-Aware Processor Grids. In: Cano, J., Dikaiakos, M.D., Papadopoulos, G.A., Pericàs, M., Sakellariou, R. (eds) Euro-Par 2023: Parallel Processing. Euro-Par 2023. Lecture Notes in Computer Science, vol 14100. Springer, Cham. https://doi.org/10.1007/978-3-031-39698-4_48

Download citation

DOI: https://doi.org/10.1007/978-3-031-39698-4_48
Published: 24 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-39697-7
Online ISBN: 978-3-031-39698-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Optimizing Distributed Tensor Contractions Using Node-Aware Processor Grids

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Performance Analysis of the NWChem TCE for Different Communication Patterns

Communication–Free Parallel Mesh Multiplication for Large Scale Simulations

Scalable Fine-Grained Metric-Based Remeshing Algorithm for Manycore/NUMA Architectures

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Optimizing Distributed Tensor Contractions Using Node-Aware Processor Grids

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Performance Analysis of the NWChem TCE for Different Communication Patterns

Communication–Free Parallel Mesh Multiplication for Large Scale Simulations

Scalable Fine-Grained Metric-Based Remeshing Algorithm for Manycore/NUMA Architectures

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation