Data Transfer and Reuse Analysis Tool for GPU-Offloading Using OpenMP

Mishra, Alok; Malik, Abid M.; Chapman, Barbara

doi:10.1007/978-3-030-58144-2_18

Alok Mishra¹²,
Abid M. Malik¹³ &
Barbara Chapman^12,13

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 12295))

Included in the following conference series:

International Workshop on OpenMP

684 Accesses
2 Citations

Abstract

In the high performance computing sector, researchers and application developers expend considerable effort to port their applications to GPU-based clusters in order to take advantage of the massive parallelism and energy efficiency of a GPU. Unfortunately porting or writing an application for accelerators, such as GPUs, requires extensive knowledge of the underlying architectures, the application/algorithm and the interfacing programming model, such as CUDA, HIP or OpenMP. Compared to native GPU programming models, OpenMP has a shorter learning curve, is portable and potentially also performance portable. To reduce the developer effort, OpenMP provides implicit data transfer between CPU and GPU. OpenMP users may control the duration of a data object’s allocation on the GPU via the use of target data regions, but they do not need to. Unfortunately, unless data map**s are explicitly provided by the user, compilers like Clang move all data accessed by a kernel to the GPU without considering its prior availability on the device. As a result, applications may spend a significant portion of their execution time on data transfer. Yet exploiting data reuse opportunities in an application has the potential to significantly reduce the overall execution time. In this paper we present a source-to-source tool that automatically identifies data in an OpenMP program which do not need to be transferred between CPU and GPU. The tool capitalizes on any data reuse opportunities to insert the pertinent, optimized OpenMP target data directives. Our experimental results show considerable reduction in the overall execution time of a set of micro-benchmarks and some benchmark applications from the Rodinia benchmark suite. To the best of our knowledge, no other tool optimizes OpenMP data map**s by identifying and exploiting data reuse opportunities between kernels.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: EUR 29.95; Price includes VAT (Thailand)

eBook: EUR 64.19; Price includes VAT (Thailand)

Softcover Book: EUR 74.99; Price excludes VAT (Thailand)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Automatic translation of data parallel programs for heterogeneous parallelism through OpenMP offloading

Article 29 October 2020

Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations

Article 07 November 2015

Exploring the Limits of Generic Code Execution on GPUs via Direct (OpenMP) Offload

Notes

1.
In this paper the term kernel is always used in reference to a GPU kernel.

References

Barua, P., Shirako, J., Tsang, W., Paudel, J., Chen, W., Sarkar, V.: OMPSan: static verification of OpenMP’s data map** constructs. In: Fan, X., de Supinski, B.R., Sinnen, O., Giacaman, N. (eds.) IWOMP 2019. LNCS, vol. 11718, pp. 3–18. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28596-8_1
Chapter Google Scholar
Bercea, G.T., et al.: Implementing implicit OpenMP data sharing on GPUs. In: Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC, pp. 1–12 (2017)
Google Scholar
C++ Heterogeneous-Compute Interface for Portability (2016). https://github.com/ROCm-Developer-Tools/HIP
Che, S., et al.: Rodinia: a benchmark suite for heterogeneous computing. In: 2009 IEEE International Symposium on Workload Characterization (IISWC), pp. 44–54. IEEE (2009)
Google Scholar
Clang 8.0 (2019). http://releases.llvm.org/8.0.1/tools/clang/docs/index.html
Clang, Libtooling (2019). http://clang.llvm.org/docs/LibTooling.html
Consortium, O., et al.: OpenMP specification version 5.0 (2018)
Google Scholar
Cray, C.: C++ reference manual, s-2179 (8.7). Cray Research (2019). https://pubs.cray.com/content/S-2179/8.7/cray-c-and-c++-reference-manual/openmp-overview
Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)
Article Google Scholar
Dulloor, S.R., et al.: Data tiering in heterogeneous memory systems. In: Proceedings of the Eleventh European Conference on Computer Systems, pp. 1–16 (2016)
Google Scholar
Garcia, V., Debreuve, E., Barlaud, M.: Fast k nearest neighbor search using GPU. In: 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–6. IEEE (2008)
Google Scholar
GCC Support for the OpenMP Language (2019). https://gcc.gnu.org/wiki/openmp
Gelado, I., Stone, J.E., Cabezas, J., Patel, S., Navarro, N., Hwu, W.M.W.: An asymmetric distributed shared memory model for heterogeneous parallel systems. In: Proceedings of the Fifteenth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 347–358 (2010)
Google Scholar
Goodrum, M.A., Trotter, M.J., Aksel, A., Acton, S.T., Skadron, K.: Parallelization of particle filter algorithms. In: Varbanescu, A.L., Molnos, A., van Nieuwpoort, R. (eds.) ISCA 2010. LNCS, vol. 6161, pp. 139–149. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24322-6_12
Chapter Google Scholar
Harish, P., Narayanan, P.J.: Accelerating large graph algorithms on the GPU using CUDA. In: Aluru, S., Parashar, M., Badrinath, R., Prasanna, V.K. (eds.) HiPC 2007. LNCS, vol. 4873, pp. 197–208. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-77220-0_21
Chapter Google Scholar
Huang, W., Ghosh, S., Velusamy, S., Sankaranarayanan, K., Skadron, K., Stan, M.R.: Hotspot: a compact thermal modeling methodology for early-stage VLSI design. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 14(5), 501–513 (2006)
Article Google Scholar
Intel C++ Compiler Code Samples (March 2019). https://software.intel.com/en-us/code-samples/intel-c-compiler
Jablin, T.B., Prabhu, P., Jablin, J.A., Johnson, N.P., Beard, S.R., August, D.I.: Automatic CPU-GPU communication management and optimization. In: Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 142–151 (2011)
Google Scholar
Lattner, C., Adve, V.: LLVM: a compilation framework for lifelong program analysis & transformation. In: Proceedings of the International Symposium on Code Generation and Optimization: Feedback-Directed and Runtime Optimization, p. 75. IEEE Computer Society (2004)
Google Scholar
Li, L., Chapman, B.: Compiler assisted hybrid implicit and explicit GPU memory management under unified address space. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–16 (2019)
Google Scholar
LLVM Support for the OpenMP Language (2019). https://openmp.llvm.org
Mendonça, G., Guimarães, B., Alves, P., Pereira, M., Araújo, G., Pereira, F.M.Q.: DawnCC: automatic annotation for data parallelism and offloading. ACM Trans. Archit. Code Optim. (TACO) 14(2), 13 (2017)
Google Scholar
Mishra, A., Kong, M., Chapman, B.: Kernel fusion/decomposition for automatic GPU-offloading. In: Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization, pp. 283–284. IEEE Press (2019)
Google Scholar
Mishra, A., Li, L., Kong, M., Finkel, H., Chapman, B.: Benchmarking and evaluating unified memory for OpenMP GPU offloading. In: Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC, pp. 1–10 (2017)
Google Scholar
Nvidia, C.: Nvidia cuda c programming guide. Nvidia Corp. 120(18), 8 (2011)
Google Scholar
NVIDIA Tesla: Nvidia tesla v100 GPU architecture (2017)
Google Scholar
OpenMP Compilers & Tools (April 2019). https://www.openmp.org/resources/openmp-compilers-tools
Poesia, G., Guimarães, B., Ferracioli, F., Pereira, F.M.Q.: Static placement of computation on heterogeneous devices. Proc. ACM Program. Lang. 1(OOPSLA), 50 (2017)
Article Google Scholar
Poesia, G., Guimarães, B.C.F., Ferracioli, F., Pereira, F.M.Q.: Static placement of computation on heterogeneous devices. Proc. ACM Program. Lang. 1(OOPSLA), 50:1–50:28 (2017). Article 50
Google Scholar
Seawulf, Computational Cluster at Stony Brook University (2019). https://it.stonybrook.edu/help/kb/understanding-seawulf
Vazhkudai, S.S., et al.: The design, deployment, and evaluation of the coral pre-exascale systems. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 661–672. IEEE (2018)
Google Scholar
Wienke, S., Springer, P., Terboven, C., an Mey, D.: OpenACC—first experiences with real-world applications. In: Kaklamanis, C., Papatheodorou, T., Spirakis, P.G. (eds.) Euro-Par 2012. LNCS, vol. 7484, pp. 859–870. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32820-6_85
Chapter Google Scholar
Yu, S., Park, S., Baek, W.: Design and implementation of bandwidth-aware memory placement and migration policies for heterogeneous memory systems. In: Proceedings of the International Conference on Supercomputing, pp. 1–10 (2017)
Google Scholar

Download references

Acknowledgement

This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. The authors would like to thank Stony Brook Research Computing and Cyberinfrastructure, and the Institute for Advanced Computational Science at Stony Brook University for access to the SeaWulf computing system, which was made possible by a $1.4M National Science Foundation grant (#1531492). Special thanks to our colleague Dr. Chunhua Liao from Lawrence Livermore National Laboratory for his initial feedback and helpful discussions.

Author information

Authors and Affiliations

Stony Brook University, Stony Brook, NY, 11794, USA
Alok Mishra & Barbara Chapman
Brookhaven National Laboratory, Upton, NY, 11973, USA
Abid M. Malik & Barbara Chapman

Authors

Alok Mishra
View author publications
You can also search for this author in PubMed Google Scholar
Abid M. Malik
View author publications
You can also search for this author in PubMed Google Scholar
Barbara Chapman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alok Mishra .

Editor information

Editors and Affiliations

Texas Advanced Computing Center (TACC), Austin, TX, USA
Kent Milfeld
Lawrence Livermore National Laboratory, Livermore, CA, USA
Bronis R. de Supinski
Texas Advanced Computing Center (TACC), Austin, TX, USA
Lars Koesterke
RWTH Aachen University, Aachen, Germany
Jannis Klinkenberg

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mishra, A., Malik, A.M., Chapman, B. (2020). Data Transfer and Reuse Analysis Tool for GPU-Offloading Using OpenMP. In: Milfeld, K., de Supinski, B., Koesterke, L., Klinkenberg, J. (eds) OpenMP: Portable Multi-Level Parallelism on Modern Systems. IWOMP 2020. Lecture Notes in Computer Science(), vol 12295. Springer, Cham. https://doi.org/10.1007/978-3-030-58144-2_18

Download citation

DOI: https://doi.org/10.1007/978-3-030-58144-2_18
Published: 01 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58143-5
Online ISBN: 978-3-030-58144-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Data Transfer and Reuse Analysis Tool for GPU-Offloading Using OpenMP

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Automatic translation of data parallel programs for heterogeneous parallelism through OpenMP offloading

Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations

Exploring the Limits of Generic Code Execution on GPUs via Direct (OpenMP) Offload

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Data Transfer and Reuse Analysis Tool for GPU-Offloading Using OpenMP

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Automatic translation of data parallel programs for heterogeneous parallelism through OpenMP offloading

Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations

Exploring the Limits of Generic Code Execution on GPUs via Direct (OpenMP) Offload

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation