Performance of the RI-MP2 Fortran Kernel of GAMESS on GPUs via Directive-Based Offloading with Math Libraries

Kwack, JaeHyuk; Bertoni, Colleen; Pham, Buu; Larkin, Jeff

doi:10.1007/978-3-030-49943-3_5

JaeHyuk Kwack ORCID: orcid.org/0000-0002-8272-1201¹⁰,
Colleen Bertoni¹⁰,
Buu Pham¹¹ &
…
Jeff Larkin¹²

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 12017))

Included in the following conference series:

International Workshop on Accelerator Programming Using Directives

360 Accesses

Abstract

The US Department of Energy (DOE) started operating two GPU-based pre-exascale supercomputers in 2018 and plans to deploy another pre-exascale in 2020, and three exascale supercomputers in 2021/2022. All of the systems are GPU-enabled systems, and they plan to provide optimized vendor-promoted programming models for their GPUs such as CUDA, HIP and SYCL. However, due to their limited functional portability, it is challenging for HPC application developers to maintain their applications in an efficient and effective way with good productivity across all US DOE pre-exascale/exascale systems. Directive-based programming models for accelerators can be one of the solutions for HPC applications on the DOE supercomputers. In this study, we employ OpenMP and OpenACC offloading models to port and re-implement the RI-MP2 Fortran kernel of the GAMESS application on a pre-exascale GPU system, Summit. We compare and evaluate the performance of the re-structured offloading kernels with the original OpenMP threading kernel. We also evaluate the performance of multiple math libraries on the NVIDIA V100 GPU in the RI-MP2 kernel. Using the optimized directive-based offloading implementations, the RI-MP2 kernel on a single V100 GPU becomes more than 7 times faster than on dual-socket Power9 processors, which is near the theoretical speed-up based on peak performance ratios. MPI+directive-based offloading implementations of the RI-MP2 kernel perform more than 40 times faster than a MPI+OpenMP threading implementation on the same number of Summit nodes. This study demonstrates how directive-based offloading implementations can perform near what we expect based on machine peak ratios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Feasibility Studies in Multi-GPU Target Offloading

Portability and Scalability of OpenMP Offloading on State-of-the-Art Accelerators

OpenMP offload toward the exascale using Intel® GPU Max 1550: evaluation of STREAmS compressible solver

Article 06 June 2024

References

Intel Xeon Platinum 8180M Processor Information page. https://ark.intel.com/content/www/us/en/ark/products/120498/intel-xeon-platinum-8180m-processor-38-5m-cache-2-50-ghz.html
Intel Xeon Processor Scalable Family, Specifcation Update (2019). https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf
JLSE Web page. https://press3.mcs.anl.gov/jlse/
Summit User guide Web page. https://www.olcf.ornl.gov/for-users/system-user-guides/summit/summit-user-guide/
cuBLAS API Reference Guide Web page (2019). https://docs.nvidia.com/cuda/cublas
CUDA Toolkit Web page (2019). https://developer.nvidia.com/cuda-toolkit
HIP GitHub repository (2019). https://github.com/ROCm-Developer-Tools/HIP
IBM Engineering and Scientific Subroutine Library User guide Web page (2019). https://www.ibm.com/support/knowledgecenter/en/SSFHY8_6.1
IBM XL Fortran Compiler for Linux User guide Web page (2019). https://www.ibm.com/support/knowledgecenter/SSAT4T_16.1.1
INTEL Fortran Compiler (2019). https://software.intel.com/en-us/fortran-compilers
Intel Math Kernel Library User guide Web page (2019). https://software.intel.com/en-us/mkl
NVBLAS User guide Web page (2019). https://docs.nvidia.com/cuda/nvblas
PGI version 19.4 Documentation for OpenPOWER and NVIDIA Processors (2019). https://www.pgroup.com/resources/docs/19.4/openpower
SYCL Web page (2019). https://www.khronos.org/sycl/
TOP 500 list (2019). https://www.top500.org
Asadchev, A., Allada, V., Felder, J., Bode, B.M., Gordon, M.S., Windus, T.L.: Uncontracted Rys quadrature implementation of up to G functions on graphical processing units. J. Chem. Theory Comput. 6(3), 696–704 (2010)
Article Google Scholar
Asadchev, A., Gordon, M.S.: New multithreaded hybrid CPU/GPU approach to Hartree-Fock. J. Chem. Theory Comput. 8(11), 4166–4176 (2012)
Article Google Scholar
Bernholdt, D.E., Harrison, R.J.: Large-scale correlated electronic structure calculations: the RI-MP2 method on parallel computers. Chem. Phys. Lett. 250(5–6), 477–484 (1996)
Article Google Scholar
Feyereisen, M., Fitzgerald, G., Komornicki, A.: Use of approximate integrals in ab initio theory. an application in MP2 energy calculations. Chem. Phys. Lett. 208(5–6), 359–363 (1993)
Google Scholar
Gordon, M.S., Schmidt, M.W.: Advances in electronic structure theory: GAMESS a decade later, Chap. 41. In: Dykstra, C.E., Frenking, G., Kim, K.S., Scuseria, G.E. (eds.) Theory and Applications of Computational Chemistry, pp. 1167–1189. Elsevier, Amsterdam (2005). https://doi.org/10.1016/B978-044451719-7/50084-6
Katouda, M., Nagase, S.: Efficient parallel algorithm of second-order Møller–Plesset perturbation theory with resolution-of-identity approximation (RI-MP2). Int. J. Quantum Chem. 109(10), 2121–2130 (2009). https://doi.org/10.1002/qua.22068, https://onlinelibrary.wiley.com/doi/abs/10.1002/qua.22068
NVIDIA: Nvidia Tesla v100 GPU architecture (2017). http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
Olivares-Amaya, R., Watson, M.A., Edgar, R.G., Vogt, L., Shao, Y., Aspuru-Guzik, A.: Accelerating correlated quantum chemistry calculations using graphical processing units and a mixed precision matrix multiplication library. J. Chem. Theory Comput. 6(1), 135–144 (2009)
Article Google Scholar
OpenACC-Standard.org: The OpenACC Application Programming Interface version 2.6 (November 2017)
Google Scholar
OpenMP.org: OpenMP Application Programming Interface version 4.5, November 2015
Google Scholar
Ostlund, N.S., Szabo, A.: Modern Quantum Chemistry: Introduction to Advanced Electronic Structure Theory. Macmillan (1982)
Google Scholar
Schmidt, M.W., et al.: General atomic and molecular electronic structure system. J. Comput. Chem. 14(11), 1347–1363 (1993). https://doi.org/10.1002/jcc.540141112, https://onlinelibrary.wiley.com/doi/abs/10.1002/jcc.540141112
Vogt, L., Olivares-Amaya, R., Kermes, S., Shao, Y., Amador-Bedolla, C., Aspuru-Guzik, A.: Accelerating resolution-of-the-identity second-order Møller-Plesset quantum chemistry calculations with graphical processing units. J. Phys. Chem. A 112(10), 2049–2057 (2008)
Article Google Scholar
Watson, M., Olivares-Amaya, R., Edgar, R.G., Aspuru-Guzik, A.: Accelerating correlated quantum chemistry calculations using graphical processing units. Comput. Sci. Eng. 12(4), 40–51 (2010). https://doi.org/10.1109/MCSE.2010.29
Article Google Scholar

Download references

Acknowledgment

This work was supported by the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357, and by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration, and by a grant from the Department of Energy Exascale Computing Project (ECP), administered by the Ames Laboratory. We also gratefully acknowledge the computing resources provided and operated by the Joint Laboratory for System Evaluation (JLSE) at Argonne National Laboratory. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725. Last but not least, we would like to thank the Exascale Computing Project (ECP) and Oak Ridge Leadership Computing Facility (OLCF) for organizing the 2019 ECP/OLCF OpenMP Hackathon in Knoxville, TN, and give special thanks our mentors, Dmytro Bykov from OLCF and Vivek Kale from BNL for their contributions to this work.

Author information

Authors and Affiliations

Argonne National Laboratory, Lemont, IL, 60439, USA
JaeHyuk Kwack & Colleen Bertoni
Iowa State University, Ames, IA, 50011, USA
Buu Pham
NVIDIA, Santa Clara, USA
Jeff Larkin

Authors

JaeHyuk Kwack
View author publications
You can also search for this author in PubMed Google Scholar
Colleen Bertoni
View author publications
You can also search for this author in PubMed Google Scholar
Buu Pham
View author publications
You can also search for this author in PubMed Google Scholar
Jeff Larkin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to JaeHyuk Kwack .

Editor information

Editors and Affiliations

RWTH Aachen University, Aachen, Germany
Sandra Wienke
Lawrence Berkeley National Laboratory, Berkeley, CA, USA
Sridutt Bhalachandra

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 301 KB)

Appendix I

Table 12. Fortran wrapper for cuBLAS and cuBLASXT functions

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kwack, J., Bertoni, C., Pham, B., Larkin, J. (2020). Performance of the RI-MP2 Fortran Kernel of GAMESS on GPUs via Directive-Based Offloading with Math Libraries. In: Wienke, S., Bhalachandra, S. (eds) Accelerator Programming Using Directives. WACCPD 2019. Lecture Notes in Computer Science(), vol 12017. Springer, Cham. https://doi.org/10.1007/978-3-030-49943-3_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-49943-3_5
Published: 09 June 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-49942-6
Online ISBN: 978-3-030-49943-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Performance of the RI-MP2 Fortran Kernel of GAMESS on GPUs via Directive-Based Offloading with Math Libraries

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Feasibility Studies in Multi-GPU Target Offloading

Portability and Scalability of OpenMP Offloading on State-of-the-Art Accelerators

OpenMP offload toward the exascale using Intel® GPU Max 1550: evaluation of STREAmS compressible solver

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 301 KB)

Appendix I

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Performance of the RI-MP2 Fortran Kernel of GAMESS on GPUs via Directive-Based Offloading with Math Libraries

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Feasibility Studies in Multi-GPU Target Offloading

Portability and Scalability of OpenMP Offloading on State-of-the-Art Accelerators

OpenMP offload toward the exascale using Intel® GPU Max 1550: evaluation of STREAmS compressible solver

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 301 KB)

Appendix I

Appendix I

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation