Optimizing Distributed Tensor Contractions Using Node-Aware Processor Grids

  • Conference paper
  • First Online:
Euro-Par 2023: Parallel Processing (Euro-Par 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14100))

Included in the following conference series:

  • 1614 Accesses

Abstract

We propose an algorithm that aims at minimizing the inter-node communication volume for distributed and memory-efficient tensor contraction schemes on modern multi-core compute nodes. The key idea is to define processor grids that optimize intra-/inter-node communication volume in the employed contraction algorithms. We present an implementation of the proposed node-aware communication algorithm into the Cyclops Tensor Framework (CTF). We demonstrate that this implementation achieves a significantly improved performance for matrix-matrix-multiplication and tensor-contractions on up to several hundreds modern compute nodes compared to conventional implementations without using node-aware processor grids. Our implementation shows good performance when compared with existing state-of-the-art parallel matrix multiplication libraries (COSMA and ScaLAPACK). In addition to the discussion of the performance for matrix-matrix-multiplication, we also investigate the performance of our node-aware communication algorithm for tensor contractions as they occur in quantum chemical coupled-cluster methods. To this end we employ a modified version of CTF in combination with a coupled-cluster code (Cc4s). Our findings show that the node-aware communication algorithm is also able to improve the performance of coupled-cluster theory calculations for real-world problems running on tens to hundreds of compute nodes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
EUR 29.95
Price includes VAT (France)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 74.89
Price includes VAT (France)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR 94.94
Price includes VAT (France)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    CTF-def and CTF-na can be run with https://github.com/airmler/ctf, branch node-awareness, commit ID 2f32bd6.

  2. 2.

    https://github.com/eth-cscs/COSMA.git commit ID fe98d3eb.

References

  1. cc4s. https://manuals.cc4s.org

  2. Agarwal, R.C., Balle, S.M., Gustavson, F.G., Joshi, M., Palkar, P.: A three-dimensional approach to parallel matrix multiplication. IBM J. Res. Dev. 39(5), 575–582 (1995)

    Article  Google Scholar 

  3. Aggarwal, A., Chandra, A.K., Snir, M.: On communication latency in PRAM computations. In: Proceedings of the First Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 11–21 (1989)

    Google Scholar 

  4. Bartlett, R.J., Musiał, M.: Coupled-cluster theory in quantum chemistry. Rev. Mod. Phys. 79, 291–352 (2007)

    Article  Google Scholar 

  5. Bienz, A., Gropp, W.D., Olson, L.N.: Node aware sparse matrix-vector multiplication. J. Parallel Distrib. Comput. 130, 166–178 (2019)

    Article  Google Scholar 

  6. Bienz, A., Gropp, W.D., Olson, L.N.: Reducing communication in algebraic multigrid with multi-step node aware communication. Int. J. High Perform. Comput. Appl. 34(5), 547–561 (2020)

    Article  Google Scholar 

  7. Cannon, L.E.: A cellular computer to implement the Kalman filter algorithm. Ph.D. thesis, Montana State University, Bozeman, MT, USA (1969)

    Google Scholar 

  8. Chan, E., Heimlich, M., Purkayastha, A., Van De Geijn, R.: Collective communication: theory, practice, and experience. Concurr. Comput.: Pract. Experience 19(13), 1749–1783 (2007)

    Article  Google Scholar 

  9. Choi, J., Dongarra, J., Pozo, R., Walker, D.: ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers. In: The Fourth Symposium on the Frontiers of Massively Parallel Computation, pp. 120–127 (1992)

    Google Scholar 

  10. Demmel, J., et al.: Communication-optimal parallel recursive rectangular matrix multiplication. In: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp. 261–272 (2013)

    Google Scholar 

  11. Irmler, A., Kanakagiri, R., Ohlmann, S.T., Solomonik, E., Grüneis, A.: Artifact overview document for Euro-Par 2023 paper: Optimizing distributed tensor contractions using node-aware processor grids. https://doi.org/10.6084/m9.figshare.23548113

  12. Irony, D., Toledo, S., Tiskin, A.: Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput. 64(9), 1017–1026 (2004)

    Article  MATH  Google Scholar 

  13. Kwasniewski, G., Kabić, M., Besta, M., VandeVondele, J., Solcà, R., Hoefler, T.: Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019. Association for Computing Machinery, New York (2019)

    Google Scholar 

  14. Lockhart, S., Bienz, A., Gropp, W., Olson, L.: Performance analysis and optimal node-aware communication for enlarged conjugate gradient methods. ACM Trans. Parallel Comput. 10, 1–25 (2023)

    Article  MathSciNet  Google Scholar 

  15. Lockhart, S., Bienz, A., Gropp, W.D., Olson, L.N.: Characterizing the performance of node-aware strategies for irregular point-to-point communication on heterogeneous architectures. Parallel Comput. 116, 103021 (2023)

    Article  MathSciNet  Google Scholar 

  16. McColl, W.F., Tiskin, A.: Memory-efficient matrix multiplication in the BSP model. Algorithmica 24, 287–297 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  17. Solomonik, E., Demmel, J.: Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011. LNCS, vol. 6853, pp. 90–109. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23397-5_10

    Chapter  Google Scholar 

  18. Solomonik, E., Matthews, D., Hammond, J.R., Stanton, J.F., Demmel, J.: A massively parallel tensor contraction framework for coupled-cluster computations. J. Parallel Distrib. Comput. 74, 3176–3190 (2014)

    Article  Google Scholar 

  19. Thakur, R., Gropp, W.D.: Improving the performance of collective operations in MPICH. In: Dongarra, J., Laforenza, D., Orlando, S. (eds.) EuroPVM/MPI 2003. LNCS, vol. 2840, pp. 257–267. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39924-7_38

    Chapter  Google Scholar 

  20. Van De Geijn, R.A., Watts, J.: Summa: scalable universal matrix multiplication algorithm. Concurr. Pract. Experience 9(4), 255–274 (1997)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andreas Irmler .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Irmler, A., Kanakagiri, R., Ohlmann, S.T., Solomonik, E., Grüneis, A. (2023). Optimizing Distributed Tensor Contractions Using Node-Aware Processor Grids. In: Cano, J., Dikaiakos, M.D., Papadopoulos, G.A., Pericàs, M., Sakellariou, R. (eds) Euro-Par 2023: Parallel Processing. Euro-Par 2023. Lecture Notes in Computer Science, vol 14100. Springer, Cham. https://doi.org/10.1007/978-3-031-39698-4_48

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-39698-4_48

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-39697-7

  • Online ISBN: 978-3-031-39698-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation