Log in

Parallel programming models for heterogeneous many-cores: a comprehensive survey

  • Survey Paper
  • Published:
CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Abstract

Heterogeneous many-cores are now an integral part of modern computing systems ranging from embedding systems to supercomputers. While heterogeneous many-core design offers the potential for energy-efficient high-performance, such potential can only be unlocked if the application programs are suitably parallel and can be made to match the underlying heterogeneous platform. In this article, we provide a comprehensive survey for parallel programming models for heterogeneous many-core architectures and review the compiling techniques of improving programmability and portability. We examine various software optimization techniques for minimizing the communicating overhead between heterogeneous computing devices. We provide a road map for a wide variety of different research areas. We conclude with a discussion on open issues in the area and potential research directions. This article provides both an accessible introduction to the fast-moving area of heterogeneous programming and a detailed bibliography of its main achievements.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Reproduced from Intel’s OneAPI (2020)

Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. Code is available at: https://goo.gl/y7bBdN.

References

  • Abadi, M., et al.: Tensorflow: Large-scale machine learning on heterogeneous distributed systems. CoRR (2016)

  • Alfieri, R.A.: An efficient kernel-based implementation of POSIX threads. In: USENIX Summer 1994 Technical Conference. USENIX Association (1994)

  • Amd brook+ programming.: Tech. rep., AMD Corporation (2007)

  • Amd cal programming guide v2.0.: Tech. rep., AMD Corporation (2010)

  • AMD’s OpenCL Implementation.: https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime (2020)

  • Amini, M., et al.: Static compilation analysis for host-accelerator communication optimization. In: Languages and Compilers for Parallel Computing, 24th International Workshop, LCPC (2011)

  • Andrade, G., et al.: Parallelme: A parallel mobile engine to explore heterogeneity in mobile computing architectures. In: Euro-Par 2016: Parallel Processing—22nd International Conference on Parallel and Distributed Computing (2016)

  • Arevalo, A., et al.: Programming the cell broadband engine: examples and best practices (2007)

  • Ayguadé, E., et al.: An extension of the starss programming model for platforms with multiple GPUs. In: Euro-Par 2009 Parallel Processing (2009)

  • Bader, D.A., Agarwal, V.: FFTC: fastest Fourier transform for the IBM cell broadband engine. In: High Performance Computing, HiPC (2007)

  • Bae, H., et al.: The cetus source-to-source compiler infrastructure: overview and evaluation. Int. J. Parallel Program. 41, 753–767 (2013)

    Article  Google Scholar 

  • Balaprakash, P., et al.: Autotuning in high-performance computing applications. In: Proceedings of the IEEE (2018)

  • Barker, K.J., et al.: Entering the petaflop era: the architecture and performance of roadrunner. In: Proceedings of the ACM/IEEE Conference on High Performance Computing, SC (2008)

  • Baskaran, M.M., et al.: Automatic c-to-cuda code generation for affine programs. In: R. Gupta (ed.) 19th International Conference on Compiler Construction (CC) (2010)

  • Beckingsale, D., et al.: Performance portable C++ programming with RAJA. In: Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammiDng, PPoPP (2019)

  • Beignet OpenCL.: https://www.freedesktop.org/wiki/ Software/Beignet/ (2020)

  • Bell, N., Hoberock, J.: Chapter 26-thrust: a productivity-oriented library for cuda. In: Mei, W., Hwu, W. (eds.) GPU Computing Gems Jade Edition, Applications of GPU Computing Series, pp. 359–371. Morgan Kaufmann, Burlington (2012)

    Chapter  Google Scholar 

  • Bellens, P., et al.: Cellss: a programming model for the cell BE architecture. In: Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing (2006)

  • Bodin, F., Romain, D., Colin De Verdiere, G.: One OpenCL to Rule Them All? In: International Workshop on Multi-/Many-core Computing Systems, MuCoCoS (2013)

  • Boyer, M., et al.: Improving GPU performance prediction with data transfer modeling. In: 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (2013)

  • Breitbart, J., Fohry, C.: Opencl: an effective programming model for data parallel computations at the cell broadband engine. In: 24th IEEE International Symposium on Parallel and Distributed Processing, IPDPS (2010)

  • Brodtkorb, A.R., et al.: State-of-the-art in heterogeneous computing. Sci. Program. 18, 1–33 (2010)

    Google Scholar 

  • Buck, I., et al.: Brook for GPUs: stream computing on graphics hardware. ACM Trans. Graph 23, 777–786 (2004)

    Article  Google Scholar 

  • Chandrasekhar, A., et al.: IGC: the open source intel graphics compiler. In: IEEE/ACM International Symposium on Code Generation and Optimization, CGO (2019)

  • Che, S., et al.: Rodinia: A benchmark suite for heterogeneous computing. In: Proceedings of the 2009 IEEE International Symposium on Workload Characterization. IEEE Computer Society (2009)

  • Chen, T., et al.: Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. CoRR abs/1512.01274 (2015)

  • Chen, T., et al.: Cell broadband engine architecture and its first implementation—a performance view. IBM J. Res. Dev. 51, 559–572 (2007)

    Article  Google Scholar 

  • Chen, D., et al.: Characterizing scalability of sparse matrix-vector multiplications on phytium ft-2000+. Int. J. Parallel Program. 48, 80–97 (2020)

    Article  Google Scholar 

  • Ciechanowicz, P., et al.: The münster skeleton library muesli: a comprehensive overview. Working Papers, ERCIS-European Research Center for Information Systems, No. 7 (2009)

  • Cole, M.I.: Algorithmic skeletons: structured management of parallel computation (1989)

  • Common-Shader Core.: https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl-common-core?redirectedfrom=MSDN (2018)

  • Copik, M., Kaiser, H.: Using SYCL as an implementation framework for hpx.compute. In: Proceedings of the 5th International Workshop on OpenCL, IWOCL (2017)

  • Crawford, C.H., et al.: Accelerating computing with the cell broadband engine processor. In: Proceedings of the 5th Conference on Computing Frontiers (2008)

  • Cummins, C., et al.: End-to-end deep learning of optimization heuristics. In: PACT (2017)

  • Cummins, C., et al.: Synthesizing benchmarks for predictive modeling. In: CGO (2017)

  • Dao, T.T., Lee, J.: An auto-tuner for opencl work-group size on GPUs. IEEE Trans. Parallel Distrib. Syst. 29, 283–296 (2018)

    Article  Google Scholar 

  • Dastgeer, U., et al.: Adaptive implementation selection in the skepu skeleton programming library. In: Advanced Parallel Processing Technologies—10th International Symposium, APPT (2013)

  • Davis, N.E., et al.: Paradigmatic shifts for exascale supercomputing. J. Supercomput. 62, 1023–1044 (2012)

    Article  Google Scholar 

  • De Sensi, D., et al.: Bringing parallel patterns out of the corner: the p3 arsec benchmark suite. ACM Trans. Archit. Code Optim. (TACO) 14, 1–26 (2017)

    Article  Google Scholar 

  • de Carvalho Moreira, W., et al.: Exploring heterogeneous mobile architectures with a high-level programming model. In: 29th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD (2017)

  • de Fine Licht, J., Hoefler, T.: hlslib: Software engineering for hardware design. CoRR (2019)

  • Demidov, D., et al.: ddemidov/amgcl: 1.2.0 (2018). https://doi.org/10.5281/zenodo.1244532

  • Demidov, D., et al.: ddemidov/vexcl: 1.4.1 (2017). https://doi.org/10.5281/zenodo.571466

  • Demidov, D.: Amgcl: an efficient, flexible, and extensible algebraic multigrid implementation. Lobachevskii J. Math. 40, 535–546 (2019)

    Article  MathSciNet  Google Scholar 

  • Diamos, C., et al.: Compiling a high-level language for GPUs: (via language support for architectures and compilers). In: ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI (2012)

  • Diamos, G.F., et al.: Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In: 19th International Conference on Parallel Architectures and Compilation Techniques, PACT (2010)

  • Directcompute programming guide.: Tech. rep., NVIDIA Corporation (2010)

  • Duran, A., et al.: Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel Process. Lett. 21, 173–193 (2011)

    Article  MathSciNet  Google Scholar 

  • Edwards, H.C., et al.: Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. J. Parallel Distrib. Comput. 74, 3202–3216 (2014)

    Article  Google Scholar 

  • Emani, M.K., et al.: Smart, adaptive map** of parallelism in the presence of external workload. In: CGO (2013)

  • Ernstsson, A., et al.: Skepu 2: flexible and type-safe skeleton programming for heterogeneous parallel systems. Int. J. Parallel Program. 46, 62–80 (2018)

    Article  Google Scholar 

  • Fang, J., et al.: A comprehensive performance comparison of CUDA and opencl. In: ICPP (2011)

  • Fang, J., et al.: Implementing and evaluating opencl on an armv8 multi-core CPU. In: 2017 IEEE International Symposium on Parallel and Distributed Processing with Applications and 2017 IEEE International Conference on Ubiquitous Computing and Communications (ISPA/IUCC) (2017)

  • Fang, J., et al.: Test-driving intel xeon phi. In: ACM/SPEC International Conference on Performance Engineering (ICPE), pp. 137–148 (2014)

  • Fang, J.: Towards a systematic exploration of the optimization space for many-core processors. Ph.D. thesis, Delft University of Technology, Netherlands (2014)

  • Fang, J., et al.: Evaluating multiple streams on heterogeneous platforms. Parallel Process. Lett. 26(4), 1640002 (2016)

    Article  MathSciNet  Google Scholar 

  • FreeOCL.: http://www.zuzuf.net/FreeOCL/ (2020)

  • GalliumCompute.: https://dri.freedesktop.org/wiki /GalliumCompute/ (2020)

  • GalliumCompute.: https://github.com/intel/compute-runtime (2020)

  • Gardner, M.K., et al.: Characterizing the challenges and evaluating the efficacy of a cuda-to-opencl translator. Parallel Comput. 39, 769–786 (2013)

    Article  Google Scholar 

  • Giles, M.B., et al.: Performance analysis of the OP2 framework on many-core architectures. SIGMETRICS Performance Evaluation Review (2011)

  • Gómez-Luna, J., et al.: Performance models for asynchronous data transfers on consumer graphics processing units. J. Parallel Distrib. Comput. 72, 1117–1126 (2012)

    Article  Google Scholar 

  • Govindaraju, N.K., et al.: High performance discrete fourier transforms on graphics processors. In: Proceedings of the ACM/IEEE Conference on High Performance Computing, SC (2008)

  • Grasso, I., et al.: Energy efficient HPC on embedded socs: optimization techniques for mali GPU. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS (2014)

  • Green500 Supercomputers.: https://www.top500.org/green500/ (2020)

  • Gregg, C., et al.: Where is the data? why you cannot debate CPU vs. GPU performance without the answer. In: IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS (2011)

  • Gregory, K., Miller, A.: C++ AMP: accelerated massive parallelism with microsoft visual C++ (2012)

  • Grewe, D., et al.: Opencl task partitioning in the presence of GPU contention. In: LCPC (2013a)

  • Grewe, D., et al.: Portable map** of data parallel programs to opencl for heterogeneous systems. In: CGO (2013b)

  • Gschwind, M., et al.: Synergistic processing in cell’s multicore architecture. IEEE Micro 26, 10–24 (2006)

    Article  Google Scholar 

  • Haidl, M., et al.: Pacxxv2 + RV: an llvm-based portable high-performance programming model. In: Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC, LLVM-HPC@SC (2017)

  • Haidl, M., Gorlatch, S.: PACXX: towards a unified programming model for programming accelerators using C++14. In: Proceedings of the 2014 LLVM Compiler Infrastructure in HPC, LLVM (2014)

  • Haidl, M., Gorlatch, S.: High-level programming for many-cores using C++14 and the STL. Int. J. Parallel Program. 46, 23–41 (2018)

    Article  Google Scholar 

  • Han, T.D., et al.: hicuda: a high-level directive-based language for GPU programming. In: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU, ACM International Conference Proceeding Series (2009)

  • Han, T.D., et al.: hicuda: high-level GPGPU programming. IEEE Trans. Parallel Distrib. Syst. 22, 78–90 (2011)

    Article  Google Scholar 

  • Harris, M.J., et al.: Simulation of cloud dynamics on graphics hardware. In: Proceedings of the 2003 ACM SIGGRAPH/EUROGRAPHICS Workshop on Graphics Hardware (2003)

  • Harvey, M.J., et al.: Swan: a tool for porting CUDA programs to opencl. Comput. Phys. Commun. 182, 1093–1099 (2011)

    Article  Google Scholar 

  • HCC.: Heterogeneous Compute Compiler. https://gpuopen.com/compute-product/hcc-heterogeneous-compute-compiler/ (2020)

  • He, J., et al.: Openmdsp: Extending openmp to program multi-core DSP. In: 2011 International Conference on Parallel Architectures and Compilation Techniques, PACT (2011)

  • Heler, T., et al.: Hpx—an open source c++ standard library for parallelism and concurrency. In: OpenSuCo (2017)

  • Heller, T., et al.: Closing the performance gap with modern C++. In: High Performance Computing - ISC High Performance 2016 International Workshops, ExaComm, E-MuCoCoS, HPC-IODC, IXPUG, IWOPH, P\(\wedge\)3MA, VHPC, WOPSSS (2016)

  • Heller, T., et al.: Using HPX and libgeodecomp for scaling HPC applications on heterogeneous supercomputers. In: Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA (2013)

  • HIP.: Heterogeneous-Compute Interface for Portability. https://github.com/RadeonOpenCompute/hcc (2020)

  • High-level abstractions for performance.: Portability and continuity of scientific software on future computing systems. University of Oxford, Tech. rep. (2014)

  • HLSL.: The High Level Shading Language for DirectX. https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl (2018)

  • Hong, S., et al.: Accelerating CUDA graph algorithms at maximum warp. In: Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP (2011)

  • Hong, S., et al.: Green-marl: a DSL for easy and efficient graph analysis. In: Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS (2012)

  • Intel Inc.: hStreams Architecture for MPSS 3.5 (2015)

  • Intel Manycore Platform Software Stack.: https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss (2020)

  • Intel’s OneAPI.: https://software.intel.com/en-us/oneapi (2020)

  • Introducing rdna architecture.: Tech. rep., AMD Corporation (2019)

  • Jääskeläinen, P., et al.: pocl: a performance-portable opencl implementation. Int. J. Parallel Programm. 43, 752–785 (2015)

    Article  Google Scholar 

  • Kahle, J.A., et al.: Introduction to the cell multiprocessor. IBM J. Res. Dev. 49, 589–604 (2005)

    Article  Google Scholar 

  • Karp, R.M., et al.: The organization of computations for uniform recurrence equations. J. ACM (JACM) 14, 563–590 (1967)

    Article  MathSciNet  Google Scholar 

  • Kim, J., et al.: Bridging opencl and CUDA: a comparative analysis and translation. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC (2015)

  • Kim, J., et al.: Snucl: an opencl framework for heterogeneous CPU/GPU clusters. In: International Conference on Supercomputing, ICS (2012)

  • Kim, Y., et al.: Translating CUDA to opencl for hardware generation using neural machine translation. In: IEEE/ACM International Symposium on Code Generation and Optimization, CGO (2019)

  • Kim, J., et al.: Translating openmp device constructs to opencl using unnecessary data transfer elimination. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC (2016)

  • Kim, W., Voss, M.: Multicore desktop programming with intel threading building blocks. IEEE Softw. 28, 23–31 (2011)

    Article  Google Scholar 

  • Kistler, M., et al.: Petascale computing with accelerators. In: D.A. Reed, V. Sarkar (eds.) Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP (2009)

  • Komoda, T., et al.: Integrating multi-gpu execution in an openacc compiler. In: 42nd International Conference on Parallel Processing, ICPP (2013)

  • Komornicki, A., et al.: Roadrunner: hardware and software overview (2009)

  • Krüger, J.H., Westermann, R.: Linear algebra operators for GPU implementation of numerical algorithms. ACM Trans. Graph (2003)

  • Kudlur, M., et al.: Orchestrating the execution of stream programs on multicore platforms. In: Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, PLDI (2008)

  • Lee, S., Eigenmann, R.: Openmp to GPGPU: a compiler framework for automatic translation and optimization. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP (2009)

  • Lee, S., Eigenmann, R.: Openmpc: Extended openmp programming and tuning for GPUs. In: Conference on High Performance Computing Networking, Storage and Analysis, SC (2010)

  • Lee, V.W., et al.: Debunking the 100x GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In: 37th International Symposium on Computer Architecture, ISCA (2010)

  • Lepley, T., et al.: A novel compilation approach for image processing graphs on a many-core platform with explicitly managed memory. In: International Conference on Compilers, Architecture and Synthesis for Embedded Systems, CASES (2013)

  • Leung, A., et al.: A map** path for multi-gpgpu accelerated computers from a portable high level programming abstraction. In: Proceedings of 3rd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU, ACM International Conference Proceeding Series (2010)

  • Li, Z., et al.: Evaluating the performance impact of multiple streams on the mic-based heterogeneous platform. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops (2016a)

  • Li, Z., et al.: Streaming applications on heterogeneous platforms. In: Network and Parallel Computing—13th IFIP WG 10.3 International Conference, NPC (2016b)

  • Liao, X., et al.: Moving from exascale to zettascale computing: challenges and techniques. Front. IT EE 19, 1236–1244 (2018)

    Google Scholar 

  • Lindholm, E., et al.: NVIDIA tesla: a unified graphics and computing architecture. IEEE Micro 28, 39–55 (2008)

    Article  Google Scholar 

  • Liu, B., et al.: Software pipelining for graphic processing unit acceleration: partition, scheduling and granularity. IJHPCA (2016)

  • Marco, V.S., et al.: Improving spark application throughput via memory aware task co-location: a mixture of experts approach. In: Middleware (2017)

  • Mark, W.R., et al.: Cg: a system for programming graphics hardware in a c-like language. ACM Trans. Graph (2003)

  • Marqués, R., et al.: Algorithmic skeleton framework for the orchestration of GPU computations. In: Euro-Par 2013 Parallel Processing, Lecture Notes in Computer Science (2013)

  • Martinez, G., et al.: CU2CL: A cuda-to-opencl translator for multi- and many-core architectures. In: 17th IEEE International Conference on Parallel and Distributed Systems, ICPADS (2011)

  • Membarth, R., et al.: Hipa\({}^{\text{cc}}\): A domain-specific language and compiler for image processing. IEEE Trans. Parallel Distrib. Syst (2016)

  • Membarth, R., et al.: Generating device-specific GPU code for local operators in medical imaging. In: 26th IEEE International Parallel and Distributed Processing Symposium, IPDPS (2012)

  • Mendonca, G.S.D., et al.: Dawncc: Automatic annotation for data parallelism and offloading. TACO (2017)

  • Merrill, D., et al.: Scalable GPU graph traversal. In: Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP (2012)

  • Meswani, M.R., et al.: Modeling and predicting performance of high performance computing applications on hardware accelerators. IJHPCA (2013)

  • Mishra, A., et al.: Kernel fusion/decomposition for automatic gpu-offloading. In: IEEE/ACM International Symposium on Code Generation and Optimization, CGO (2019)

  • MPI.: Message Passing Interface. https://computing.llnl.gov/tutorials/mpi/ (2020)

  • Muralidharan, S., et al.: Architecture-adaptive code variant tuning. In: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS (2016)

  • Newburn, C.J., et al.: Heterogeneous streaming. In: IPDPSW (2016)

  • Nomizu, T., et al.: Implementation of xcalablemp device acceleration extention with opencl. In: 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, IPDPSWP (2012)

  • Nugteren, C., Corporaal, H.: Introducing ’bones’: a parallelizing source-to-source compiler based on algorithmic skeletons. In: The 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, GPGPU (2012)

  • NVIDIA CUDA Toolkit.: https://developer.nvidia.com/cuda-toolkit (2020)

  • Nvidia geforce gtx 980.: Tech. rep., NVIDIA Corporation (2014)

  • Nvidia tesla p100.: Tech. rep., NVIDIA Corporation (2016)

  • Nvidia tesla v100 gpu architecture.: Tech. rep., NVIDIA Corporation (2017)

  • Nvidia turing gpu architecture.: Tech. rep., NVIDIA Corporation (2018)

  • Nvidia’s next generation cuda compute architecture.: Fermi. NVIDIA Corporation, Tech. rep. (2009)

  • Nvidia’s next generation cuda compute architecture.: Kepler tm gk110/210. NVIDIA Corporation, Tech. rep. (2014)

  • O’Brien, K., et al.: Supporting openmp on cell. Int. J. Parallel Program. 36, 289–311 (2008)

    Article  Google Scholar 

  • Ogilvie, W.F., et al.: Fast automatic heuristic construction using active learning. In: LCPC (2014)

  • OpenCL.: The open standard for parallel programming of heterogeneous systems. http://www.khronos.org/opencl/ (2020)

  • Owens, J.D., et al.: GPU computing. Proceedings of the IEEE (2008)

  • Owens, J.D., et al.: A survey of general-purpose computation on graphics hardware. In: Eurographics, pp. 21–51 (2005)

  • Parallel Patterns Library.: https://docs.microsoft.com/en-us/cpp/parallel/concrt/parallel-patterns-library-ppl?view=vs-2019 (2016)

  • Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS, pp. 8024–8035 (2019)

  • Patterson, D.A.: 50 years of computer architecture: From the mainframe CPU to the domain-specific tpu and the open RISC-V instruction set. In: 2018 IEEE International Solid-State Circuits Conference, ISSCC (2018)

  • PGI CUDA C/C++ for x86.: https://developer.nvidia.com/pgi-cuda-cc-x86 (2020)

  • Pham, D., et al.: The design methodology and implementation of a first-generation CELL processor: a multi-core soc. In: Proceedings of the IEEE 2005 Custom Integrated Circuits Conference, CICC (2005)

  • PIPS.: Automatic Parallelizer and Code Transformation Framework. https://pips4u.org/ (2020)

  • Qualcomm snapdragon mobile platform opencl general programming and optimization.: Tech. rep., Qualcomm Corporation (2017)

  • Ragan-Kelley, J., et al.: Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Trans. Graph (2012)

  • Ragan-Kelley, J., et al.: Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In: ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI (2013)

  • Ravi, N., et al.: Apricot: an optimizing compiler and productivity tool for x86-compatible many-core coprocessors. In: International Conference on Supercomputing, ICS (2012)

  • Ren, J., et al.: Camel: Smart, adaptive energy optimization for mobile web interactions. In: IEEE Conference on Computer Communications (INFOCOM) (2020)

  • Ren, J., et al.: Optimise web browsing on heterogeneous mobile platforms: a machine learning based approach. In: INFOCOM (2017)

  • Ren, J., et al.: Proteus: Network-aware web browsing on heterogeneous mobile systems. In: CoNEXT ’18 (2018)

  • Renderscript Compute.: http://developer.android.com/guide/topics/renderscript/compute.html (2020)

  • ROCm Runtime.: https://github.com/RadeonOpenCompute /ROCR-Runtime (2020)

  • ROCm.: A New Era in Open GPU Computing. https://www.amd.com/en/graphics/servers-solutions-rocm-hpc (2020)

  • Rudy, G., et al.: A programming language interface to describe transformations and code generation. In: Languages and Compilers for Parallel Computing - 23rd International Workshop, LCPC (2010)

  • Sanz Marco, V., et al.: Optimizing deep learning inference on embedded systems through adaptive model selection. ACM Trans. Embed. Comput. 19, 1–28 (2019)

    Google Scholar 

  • Sathre, P., et al.: On the portability of cpu-accelerated applications via automated source-to-source translation. In: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, HPC Asia (2019)

  • Scarpazza, D.P., et al.: Efficient breadth-first search on the cell/be processor. IEEE Trans. Parallel Distrib. Syst. 19, 1381–1395 (2008)

    Article  Google Scholar 

  • Seiler, L., et al.: Larrabee: a many-core x86 architecture for visual computing. IEEE Micro 27, 1–15 (2009)

    Google Scholar 

  • Sidelnik, A., et al.: Performance portability with the chapel language. In: 26th IEEE International Parallel and Distributed Processing Symposium, IPDPS, pp. 582–594 (2012)

  • Steinkrau, D., et al.: Using GPUs for machine learning algorithms. In: Eighth International Conference on Document Analysis and Recognition (ICDAR. IEEE Computer Society (2005)

  • Steuwer, M., et al.: Skelcl—a portable skeleton library for high-level GPU programming. In: 25th IEEE International Symposium on Parallel and Distributed Processing, IPDPS (2011)

  • Steuwer, M., Gorlatch, S.: Skelcl: a high-level extension of opencl for multi-gpu systems. J. Supercomput. 69, 23–25 (2014)

    Article  Google Scholar 

  • Stratton, J.A., et al.: MCUDA: an efficient implementation of CUDA kernels for multi-core CPUs. In: Languages and Compilers for Parallel Computing, 21th International Workshop, LCPC (2008)

  • Sycl integrates opencl devices with modern c++.: Tech. Rep. version 1.2.1 revison 6, The Khronos Group (2019)

  • Szuppe, J.: Boost.compute: A parallel computing library for C++ based on opencl. In: Proceedings of the 4th International Workshop on OpenCL, IWOCL (2016)

  • Taylor, B., et al.: Adaptive optimization for opencl programs on embedded heterogeneous systems. In: LCTES (2017)

  • The Aurora Supercomputer.: https://aurora.alcf.anl.gov/ (2020)

  • The El Capitan Supercomputer.: https://www.cray.com/company/customers/lawrence-livermore-national-lab (2020)

  • The Frontier Supercomputer.: https://www.olcf.ornl.gov/frontier/ (2020)

  • The OpenACC API specification for parallel programming.: https://www.openacc.org/ (2020)

  • The OpenCL Conformance Tests.: https://github.com/KhronosGroup/OpenCL-CTS (2020)

  • The OpenMP API specification for parallel programming.: https://www.openmp.org/ (2020)

  • The Tianhe-2 Supercomputer.: https://top500.org/system/177999 (2020)

  • TI’s OpenCL Implementation.: https://git.ti.com/cgit/opencl (2020)

  • Tomov, S., et al.: Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Comput. 36, 232–240 (2010)

    Article  MathSciNet  Google Scholar 

  • Top500 Supercomputers.: https://www.top500.org/ (2020)

  • Tournavitis, G., et al.: Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based map**. ACM Sigplan Not. 44, 177–187 (2009)

    Article  Google Scholar 

  • Trevett, N.: Opencl, sycl and spir—the next steps. Tech. rep, OpenCL Working Group (2019)

  • Ueng, S., et al.: Cuda-lite: Reducing GPU programming complexity. In: J.N. Amaral (ed.) Languages and Compilers for Parallel Computing, 21th International Workshop, LCPC (2008)

  • Unat, D., et al.: Mint: realizing CUDA performance in 3d stencil methods with annotated C. In: Proceedings of the 25th International Conference on Supercomputing, (2011)

  • van Werkhoven, B., et al.: Performance models for CPU-GPU data transfers. In: 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) (2014)

  • “vega” instruction set architecture.: Tech. rep., AMD Corporation (2017)

  • Verdoolaege, S., et al.: Polyhedral parallel code generation for CUDA. ACM TACO (2013)

  • Viñas, M., et al.: Exploiting heterogeneous parallelism with the heterogeneous programming library. J. Parallel Distrib. Comput. 73, 1627–1638 (2013)

    Article  Google Scholar 

  • Viñas, M., et al.: Heterogeneous distributed computing based on high-level abstractions. Pract. Exp. Concurr. Comput. 20, e4664 (2018)

    Article  Google Scholar 

  • Wang, Z., et al.: Automatic and portable map** of data parallel programs to opencl for gpu-based heterogeneous systems. ACM TACO (2015)

  • Wang, Z., et al.: Exploitation of GPUs for the parallelisation of probably parallel legacy code. In: CC ’14 (2014a)

  • Wang, Z., et al.: Integrating profile-driven parallelism detection and machine-learning-based map**. ACM TACO (2014b)

  • Wang, Z., O’Boyle, M.: Machine learning in compiler optimisation. In: Proceedings of IEEE (2018)

  • Wang, Z., O’Boyle, M.F.: Partitioning streaming parallelism for multi-cores: a machine learning based approach. In: PACT (2010)

  • Wang, Z., O’Boyle, M.F.: Using machine learning to partition streaming programs. ACM TACO (2013)

  • Wang, Z.: Machine learning based map** of data and streaming parallelism to multi-cores. Ph.D. thesis, University of Edinburgh (2011)

  • Wen, Y., et al.: Smart multi-task scheduling for opencl programs on cpu/gpu heterogeneous platforms. In: HiPC (2014)

  • Williams, S., et al.: The potential of the cell processor for scientific computing. In: Proceedings of the Third Conference on Computing Frontiers (2006)

  • Wong, H., et al.: Demystifying GPU microarchitecture through microbenchmarking. In: IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS (2010)

  • Yan, Y., et al.: Supporting multiple accelerators in high-level programming models. In: Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM@PPoPP (2015)

  • Yang, C., et al.: O2render: An opencl-to-renderscript translator for porting across various GPUs or CPUs. In: IEEE 10th Symposium on Embedded Systems for Real-time Multimedia, ESTIMedia (2012)

  • You, Y., et al.: Virtcl: a framework for opencl device abstraction and management. In: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP (2015)

  • Yuan, L., et al.: Using machine learning to optimize web interactions on heterogeneous mobile systems. IEEE Access 7, 139394–139408 (2019)

    Article  Google Scholar 

  • Zenker, E., et al.: Alpaka—an abstraction library for parallel kernel acceleration. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops (2016)

  • Zhang, P., et al.: Auto-tuning streamed applications on intel xeon phi. In: 2018 IEEE International Parallel and Distributed Processing Symposium, IPDPS (2018a)

  • Zhang, P., et al.: MOCL: an efficient opencl implementation for the matrix-2000 architecture. In: Proceedings of the 15th ACM International Conference on Computing Frontiers, CF (2018bb)

  • Zhang, P., et al.: Optimizing streaming parallelism on heterogeneous many-core architectures. IEEE TPDS (2020)

  • Zhao, J., et al.: Predicting cross-core performance interference on multicore processors with regression analysis. IEEE TPDS (2016)

  • ZiiLABS OpenCL.: http://www.ziilabs.com/products/ software/opencl.php (2020)

Download references

Acknowledgements

This work was partially funded by the National Key Research and Development Program of China under Grant No. 2018YFB0204301, the National Natural Science Foundation of China under Grant agreements 61972408, 61602501 and 61872294, and a UK Royal Society International Collaboration Grant.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chun Huang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fang, J., Huang, C., Tang, T. et al. Parallel programming models for heterogeneous many-cores: a comprehensive survey. CCF Trans. HPC 2, 382–400 (2020). https://doi.org/10.1007/s42514-020-00039-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42514-020-00039-4

Keywords

Navigation