Improving CUDA performance of an unstructured high-order CFD application under OP2 framework

Huang, Kang**; Che, Yonggang; Xu, Chuanfu; Dai, Zhe; Zhang, Jian

doi:10.1007/s11227-023-05679-1

Improving CUDA performance of an unstructured high-order CFD application under OP2 framework

Published: 07 October 2023

Volume 80, pages 5832–5846, (2024)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Kang** Huang¹,
Yonggang Che¹,
Chuanfu Xu¹,
Zhe Dai² &
…
Jian Zhang²

145 Accesses
Explore all metrics

Abstract

OP2 is a domain-specific language-based programming framework for unstructured mesh applications. It supports automatic code generation targeting multiple parallel modes, with CUDA included. However, using OP2 to generate efficient CUDA code for real-world applications is a challenging task. This paper reports our efforts optimizing the CUDA code performance when refactoring an unstructured high-order CFD application (namely HOUR2D) based on OP2. A series of novel methods are realized, including utilizing appropriate execution strategies, using local arrays, and optimizing the OP2 data transfer function, etc. Performance evaluation shows that our optimizations significantly improve the performance of the finally generated CUDA code. The overall performance of our optimized OP2-CUDA code is 13.2 times higher than the unoptimized OP2-CUDA code and 2.4 times higher than the manual CUDA code. Meanwhile, these optimizations do not affect the portability of HOUR2D as an OP2 application.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Canada)

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Extending OP2 framework to support portable parallel programming of complex applications

Article 07 December 2023

A Data-Centric Approach for Efficient and Scalable CFD Implementation on Multi-GPUs Clusters

A review of CUDA optimization techniques and tools for structured grid computing

Article 26 July 2019

Data availability

can be requested from the authors

References

Mullowney P, Li R, Thomas S, Ananthan S, Sharma A, Rood JS, Williams AB, Sprague MA (2021) Preparing an incompressible-flow fluid dynamics code for exascale-class wind energy simulations. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–16
Liao X-K, Lu K, Yang C-Q, Li J-W, Yuan Y, Lai M-C, Huang L-B, Lu P-J, Fang J-B, Ren J et al (2018) Moving from exascale to zettascale computing: challenges and techniques. Front Inf Technol Electron Eng 19:1236–1244
Article Google Scholar
Fang J, Huang C, Tang T, Wang Z (2020) Parallel programming models for heterogeneous many-cores: a comprehensive survey. CCF Trans High Perform Comput 2:382–400
Article Google Scholar
Dai Z, Wang Y, Wang F, Ming L, Zhang J, et al. (2022) Performance optimization and analysis of the unstructured discontinuous galerkin solver on multi-core and many-core architectures. ar**v preprint ar**v:2209.01877
Mudalige GR, Giles MB, Reguly I, Bertolli C, Kelly PH (2012) Op2: an active library framework for solving unstructured mesh-based applications on multi-core and many-core architectures. In: 2012 Innovative Parallel Computing (InPar), pp 1–12. IEEE
Reguly IZ, Owenson AM, Powell A, Jarvis SA, Mudalige GR (2021) Under the hood of sycl–an initial performance analysis with an unstructured-mesh cfd application. In: High Performance Computing: 36th International Conference, ISC High Performance 2021, Virtual Event, June 24–July 2, 2021, Proceedings 36, pp 391–410. Springer
Mudalige GR, Giles MB, Thiyagalingam J, Reguly IZ, Bertolli C, Kelly PHJ, Trefethen AE (2013) Design and initial performance of a high-level unstructured mesh framework on heterogeneous parallel systems. In: Elsevier B.V., pp 669–692
Reguly IZ, László E, Mudalige GR, Giles MB (2014) Vectorizing unstructured mesh computations for many-core architectures. In: Proceedings of Programming Models and Applications on Multicores and Manycores, pp 39–50
Reguly IZ, Mudalige GR, Giles MB (2015) Design and development of domain specific active libraries with proxy applications. In: 2015 IEEE International Conference on Cluster Computing, pp 738–745. IEEE
Reguly IZ, Mudalige GR, Bertolli C, Giles MB, Betts A, Kelly PH, Radford D (2015) Acceleration of a full-scale industrial cfd application with op2. IEEE Trans Parallel Distrib Syst 27(5):1265–1278
Article Google Scholar
Giles MB, Mudalige GR, Sharif Z, Markall G, Kelly PH (2012) Performance analysis and optimization of the op2 framework on many-core architectures. Comput J 55(2):168–180
Article Google Scholar
Reguly IZ, Mudalige GR (2020) Modernising an industrial cfd application. In: 2020 Eighth International Symposium on Computing and Networking Workshops (CANDARW), pp 191–196. IEEE
Giles MB, Mudalige GR, Spencer B, Bertolli C, Reguly I (2013) Designing op2 for gpu architectures. J Parallel Distrib Comput 73(11):1451–1460
Article Google Scholar

Download references

Acknowledgements

This paper was supported by the National Natural Science Foundation of China (62272474, 61561146395).

Funding

National Natural Science Foundation of China (62272474, 61561146395).

Author information

Authors and Affiliations

Institute of Quantum Information and State Key Laboratory of High-Performance Computing, College of Computer Science and Technology, National University of Defense Technology, Changsha, 410073, Hunan, China
Kang** Huang, Yonggang Che & Chuanfu Xu
Computational Aerodynamics Institute, China Aerodynamics Research and Development Center, Mianyang, China
Zhe Dai & Jian Zhang

Authors

Kang** Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yonggang Che
View author publications
You can also search for this author in PubMed Google Scholar
Chuanfu Xu
View author publications
You can also search for this author in PubMed Google Scholar
Zhe Dai
View author publications
You can also search for this author in PubMed Google Scholar
Jian Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

KH realized the HOUR2D code with OP2 and wrote the main manuscript text. YC proposed part of the optimization ideals and revised the manuscript. CX proposed part of the optimization ideals. DZ and JZ realized the manual version of HOUR2D-CUDA code and provided some research suggestions to this work.

Corresponding author

Correspondence to Yonggang Che.

Ethics declarations

Conflict of interest

No competing of interests.

Ethical approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Huang, K., Che, Y., Xu, C. et al. Improving CUDA performance of an unstructured high-order CFD application under OP2 framework. J Supercomput 80, 5832–5846 (2024). https://doi.org/10.1007/s11227-023-05679-1

Download citation

Accepted: 19 September 2023
Published: 07 October 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s11227-023-05679-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Canada)

Instant access to the full article PDF.

Institutional subscriptions

Improving CUDA performance of an unstructured high-order CFD application under OP2 framework

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Extending OP2 framework to support portable parallel programming of complex applications

A Data-Centric Approach for Efficient and Scalable CFD Implementation on Multi-GPUs Clusters

A review of CUDA optimization techniques and tools for structured grid computing

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Improving CUDA performance of an unstructured high-order CFD application under OP2 framework

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Extending OP2 framework to support portable parallel programming of complex applications

A Data-Centric Approach for Efficient and Scalable CFD Implementation on Multi-GPUs Clusters

A review of CUDA optimization techniques and tools for structured grid computing

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation