Easily porting material point methods codes to GPU

Buckland, Edward; Nguyen, Vinh Phu; de Vaucorbeil, Alban

doi:10.1007/s40571-024-00768-1

Easily porting material point methods codes to GPU

Open access
Published: 05 June 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Computational Particle Mechanics Aims and scope Submit manuscript

Easily porting material point methods codes to GPU

Download PDF

526 Accesses
Explore all metrics

Abstract

The material point method (MPM) is computationally costly and highly parallelisable. With the plateauing of Moore’s law and recent advances in parallel computing, scientists without formal programming training might face challenges in develo** fast scientific codes for their research. Parallel programming is intrinsically different to serial programming and may seem daunting to certain scientists, in particular for GPUs. However, recent developments in GPU application programming interfaces (APIs) have made it easier than ever to port codes to GPU. This paper explains how we ported our modular C++ MPM code Karamelo to GPU without using low-level hardware APIs like CUDA or OpenCL. We aimed to develop a code that has abstracted parallelism and is therefore hardware agnostic. We first present an investigation of a variety of GPU APIs, comparing ease of use, hardware support and performance in an MPM context. Then, the porting process of Karamelo to the Kokkos ecosystem is detailed, discussing key design patterns and challenges. Finally, our parallel C++ code running on GPU is shown to be up to 85 times faster than on CPU. Since Kokkos also supports Python and Fortran, the principles presented therein can also be applied to codes written in these languages.

Using C++ AMP to Accelerate HPC Applications on Multiple Platforms

A Brief History and Introduction to GPGPU

SkelCL: Enhancing OpenCL for High-Level Programming of Multi-GPU Systems

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Particle-based methods were developed to mitigate the problems linked to element distortion and mesh entanglement that occur in large deformation simulations of structures using finite element methods (FEM) [8, 24, 35, 41, 48, 49]. But, these methods are typically more computationally expensive than FEM. For instance, in Smoothed Particle Hydrodynamics, a neighbour search is performed at each time step [24, 35], and in the material point method (MPM), data are exchanged back and forth with the background grid, also at each time step [48, 49]. Therefore, serial particle-based codes can be slow, albeit easy to develop. Because of this, such codes are usually limited to small-scale applications. This problem impedes such codes from being used for large-scale simulations.

For the past decade or so, CPU clock speeds have been steady, and transistors have become so small that the physical limit of how many can fit on a chip is being reached. Overall performance gains are now achieved by increasing the number of threads per CPU. Moreover, many GPUs are now specially dedicated to heavy computing.

Particle-based methods algorithms are mostly parallel in nature (not entirely so due to potential race conditions, especially in GPU implementations). Therefore, parallelising codes can bring substantial performance gains. This can be done in several ways, such as multithreading on one or more CPUs and utilising GPUs. However, many researchers who write scientific codes do not have formal computer science training. With time and practice, many of them have become adept at writing serial scientific codes.

However, writing parallel codes is not straightforward but is relatively easy with the use of libraries such as OpenMP or MPI. Both of these libraries are used in plenty of open-source codes available online, meaning that many examples and resources are available to learners [11, 20]. However, these techniques are not without their limitations. OpenMP is a shared memory library; thus, the code cannot be executed across different computers or nodes. Unlike OpenMP, MPI is a distributed memory library. Therefore, the code can be executed on different (connected) computers or nodes, but a substantial amount of code is necessary for the communication between CPUs. This creates a substantial overhead in both development time and performance. As the number of CPUs used increases, the time dedicated to communication increases and the speedup decreases.

Porting codes to GPUs is rewarding as the speedups can be of two orders of magnitude [55]. Moreover, powerful GPUs are often readily available to researchers, even becoming the hardware of choice in all the most powerful supercomputers. But GPU programming is a very different paradigm from CPU programming. Indeed, due to the architectural differences between CPUs and GPUs, GPU codes need to be written in so-called ‘kernel’ languages such as CUDA and OpenCL.

CUDA and OpenCL are low-level languages that are hardware-specific. They are state-of-the-art in GPU computing but limit code portability. Furthermore, unlike CPU applications, efficient GPU codes require the user to have an intimate understanding of specific hardware, for example requiring programmers to specify precisely how and what memory is used. For CPU applications, this is usually done by the operating system. All this is usually way out of reach for non-computer scientist researchers. But recently, high-level libraries that do not require CUDA or OpenCL knowledge have emerged making porting codes to GPUs easy.

This paper is about how one can easily port particle-based codes from CPU to GPU. Here our serial C++ MPM, Karamelo (Sect. 2), is used as a case example to show case how this is done step by step. In Sect. 3, we discuss the type of libraries available and why we have chosen Kokkos [17]. In Sect. 4, we will present the in-depth process, sharing code snippets to help you port your code. Naturally, we will also discuss the performance of our GPU code and compare it to the original CPU code (Sect. 5).

Note that even though this paper focuses on the porting of a C++ code to GPU, the same techniques apply to other languages since Kokkos supports Python and Fortran. Also note that this paper is not about comparing CUDA and Kokkos (see Edwards et al. [17] for details); neither is it to push the state-of-the-art in parallel computing. It is rather about giving scientists an understanding of how easily port codes for particle-based methods to GPUs. For clarity and ease of reading, the definition of most of these terms is given in Table 1.

2 The material point method

MPM is a family of algorithms for multiphase continuum mechanical simulation first conceptualised by Sulsky et al. [50]. In the MPM, the solids are discretised into Lagrangian particles moving over a fixed Eulerian background grid on which the governing equations are solved. The MPM is, therefore, a hybrid Lagrangian–Eulerian method allowing it to handle a great breadth of simulation problems. Compared to purely Lagrangian methods, the MPM excels at large deformation problems, which can require prohibitively expensive frequent remeshing in mesh-based methods and facilitates more robust collision treatment [4]. The MPM’s recent rise in popularity is therefore no surprise; it has found applications in many fields from geoengineering [1, 19, 22], mechanical engineering [18, 30, 34, 46] and materials sciences [12, 31, 45]. It has even been adopted by Walt Disney Animation Studios for challenging animation tasks such as snow simulations [47].

Table 1 Definition of specific parallel computing terminology

Full size table

Recent years have seen increased demands for faster and more efficient MPM codes. Many parallel MPM codes have been developed by a number of researchers, targeting a range of hardware ranging from multi-threaded single CPU to multi-GPU computer clusters. Unfortunately, these codes were either easy to modify but slow, or fast but difficult to understand and adapt to our research needs. The need to develop a portable, efficient, and easy to modify code led to the development of our own code, Karamelo [11].

2.1 The MPM algorithm

This section briefly presents the explicit dynamics MPM formulation. For more details, we refer to the recent book of Nguyen et al. [39].

The MPM is built on the two main concepts, namely Lagrangian material points carrying physical information, and a background Eulerian grid used for the discretisation of continuous fields (i.e. displacement field). These two concepts are treated in Sect. 2.1.1. A complete algorithm for the MPM is then provided in Sect. 2.1.2. Throughout this paper, subscripts $_p$ and $_I$ are used to refer to quantities at the particle and node positions, respectively. And superscripts $^t$ and $^{t+\varDelta t}$ refer to quantities at timesteps t and $t+\varDelta t$, respectively.

2.1.1 Lagrangian particles and Eulerian grid

In the MPM, a continuum body is discretised by a finite set of $n_p$ Lagrangian material points (or particles) that are tracked throughout the deformation process. The terms particle and material point will be used interchangeably throughout this paper. In the original MPM, the subregions represented by the particles are not explicitly defined. Only their mass and volume are tracked. However, the shape of these subregions are tracked in advanced MPM formulations such as the generalised interpolation material point (GIMP) method [3] and the convected particle domain interpolation (CPDI), see the works Sadeghirad et al. [43, 44], Nguyen et al. [37, 38]. Each material point has an associated position ${\textbf{x}}^t_p\; (p=1,2,\ldots ,n_p )$, mass $m_p$, density $\rho _p$, velocity ${\textbf{v}}_p$, deformation gradient ${\textbf{F}}_p$, Cauchy stress tensor $\varvec{\sigma }_p$, temperature $T_p$, and any other internal state variables necessary for the constitutive model. Collectively, these material points provide a Lagrangian description of the continuum body. Since each material point contains a fixed amount of mass at all times, mass conservation is automatically satisfied.

The original MPM developed by Sulsky is an updated Lagrangian scheme, also called Updated Lagrangian MPM (ULMPM) thereafter. For this MPM, the space that the simulated body occupies and will occupy during deformation is discretised by a background grid (see Fig. 1) where the equation of balance of momentum is solved. On the other hand, in the Total Lagrangian MPM (TLMPM) presented in the work by de Vaucorbeil et al. [10], the background grid covers only the space occupied by the body in its reference configuration as illustrated in Fig. 1. The use of a grid allows the method to be quite scalable by eliminating the need for directly computing particle-particle interactions. Indeed, the particles interact with other particles of the same body, with other solid bodies, or with fluids through a background Eulerian grid. Most often, for efficiency reasons, a fixed regular Cartesian grid is used throughout the simulation.

2.1.2 The basic explicit MPM algorithm

A typical explicit ULMPM computational cycle consists of four steps (see Fig. 2). The first step is to map the information (e.g. mass, momentum and internal and external forces) from the particles to the grid (P2G), since the grid is reset at every cycle. Next, the discrete equations of momentum are solved on the grid nodes (Grid Updating). Then, the particles’ position, velocity, volume, density, deformation gradient, stresses and all relevant internal variables are updated (G2P). These last two steps are equivalent to the updated Lagrangian FEM [5]. Finally, the grid is reset to its original state. Due to this grid resetting, mesh distortion never occurs, making the MPM a good method for large deformation problems.

The flowchart of the TLMPM is quite similar to the one of the ULMPM except that the first Piola–Kirchoff stress tensor is used in the internal force vector, and the spatial derivatives are performed with respect to the original configuration [10], not the current (deformed) one.

2.2 Karamelo

Karamelo is an open-source C++ MPM library developed by de Vaucorbeil et al. [11]. Karamelo’s key design philosophy is to be portable and easy to modify while still being competitively fast. To this end, the structure of Karamelo is based on the Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) program [52]; in particular, Karamelo adopts LAMMPS’s ‘styles’, a design pattern that yields easy to use, modular classes that can be swapped in and out, facilitating customization of almost any part of Karamelo. Karamelo previously supported only multiple CPU parallelisation. And it is the aim of this paper to show how we have ported this code to GPUs. The full code is available at: https://github.com/adevaucorbeil/karamelo/.

2.3 The particle to grid (P2G) problem

Whatever the type of grid used, the number of nodes that will interact with a given particle p is geometrically limited by the support of the shape functions. For instance, for a 2D problem using linear shape functions and a simple Cartesian background grid, one particle would interact with a maximum of 4 nodes. This allows particles to keep track of neighbouring nodes in fixed sized arrays. This, in turn, renders the G2P step trivially parallelisable using nested loops. However, the opposite is not true since the number of particles interacting with a given node is not known a-priori and evolves over time. In particular, nodes in regions of compression tend to have significantly more neighbouring particles than those in tension. This is the crux of parallelising the P2G step: nodes must either maintain particle adjacency information in jagged arrays, incurring a significant bookkee** and memory management overhead, or the outer loop of P2G must iterate over particles, incurring a synchronisation overhead to avoid data races around threads writing to the same node at the same time.

3 GPU APIs

In the first published GPU implementation of MPM, Dong et al. [15] used the CUDA application programming interface (API) to implement a single GPU MPM code. Potential race conditions during the P2G step (see Sect. 2.3) was avoided by adding an extra step of calculating particle-node associations on CPU and loop over each node’s neighbours sequentially to avoid the race condition. This method is very expensive for large deformation scenarios, since every time a particle moves between cells, the associations change, in turn necessitating expensive GPU-CPU-GPU deep-copies. But, to avoid complexity and having a more balanced workload Gao et al. [21] directly used the single CPU parallel MPM algorithm without significant modification, using atomics to address the P2G race condition. Dong and Grabe [14] implemented a multiple GPU MPM using domain decomposition and MPI for the data exchange between the different GPUs while kee** the particle-node association list. Later, Dong et al. [16] used a single root complex to speed-up the communication between GPUs. Finally, the most powerful massively parallel MPM code right now is Claymore, which uses an advanced algorithm for workload balancing; unlike naive subdomain decomposition, Claymore dynamically distributes particles approximately evenly between devices [56]. In the absence of large deformation, this is essentially equivalent to optimised domain decomposition in the initial reference frame, but Claymore has strategies to handle drastic topological change too.

However, the problem with Claymore and most of the efficient open-source MPM codes is that they are written in such a way that it is difficult for people who are not professional programmers to understand and therefore modify them. This is unlike Karamelo that has been developed as a flexible and easy to modify code.

The platform dependence of current MPM codes raises a number of key issues, the most obvious of which is portability. Although arguably superior in terms of speed, CUDA-MPI implementations are restricted to running on CUDA compatible GPUs. Additionally, portability doesn’t just concern hardware vendors, it also concerns time: Features like warp-level CUDA intrinsics are relatively modern. Therefore, using them breaks backwards compatibility. By entangling parallelisation hardware logic with algorithmic design, forward compatibility and code maintainability are also hindered: if hardware improves and new parallelisation features are available,we updating the code to use the new features can be very tedious. Lastly, although minor, involving low-level code also severely impacts code readability and accessibility.

These observations motivate the development of a hardware agnostic, massively parallel MPM library. Parallelisation logic should be abstracted away so that backends changes are completely independent to algorithmic code changes. As performance is still a priority, the code should be able to self-optimise depending on the platform that it is run. And lastly, the code should also be written in such a way that modifications can easily be made, and MPM variations and extensions can easily be added. Such a library could be invaluable to future users and researchers, who would be able to code and run efficient MPM programmes on any hardware.

3.1 General Comparison

GPU programming is still greatly non-standardised. Porting between different APIs is often quite involved and time consuming. This is especially true for GPU APIs which are often intertwined with types, structures and program flow, possibly even requiring different kernel languages. It is therefore important to make a well-informed decision when selecting a GPU API. However, the number of papers comparing GPU APIs is not insignificant. Most such literature tends to focus on comparison of API performance and programmer productivity. Hoshino et al. [26] compared the performance of CUDA and OpenACC, finding that OpenACC is typically twice as slow as CUDA. However in certain scenarios with careful tuning up to 98% efficiency can be achieved. Memeti et al. [36] compared the performance and programmer productivity of OpenCL, OpenACC and CUDA, finding that programming with OpenCL takes significantly more effort than CUDA, which in turn takes significantly more effort than OpenACC. But OpenCL and CUDA yield significantly faster codes than OpenACC. These results all agree with Li et al. [32], which found OpenACC to be 37% faster to program with but in cases up to nine times slower to run than CUDA.

In our particular application, the criteria for API selection are slightly unique, with aforementioned emphases on portability, extensibility and abstraction, albeit preferably with minimal sacrifice to performance. Most notably, although generally yielding the fastest programmes, CUDA code can only run on NVidia hardware, greatly limiting portability, especially considering that AMD is approaching NVidia in GPU market share [42]. OpenCL is generally portable but it suffers from poor ease-of-use, requiring significant management of non-abstracted low-level logic. OpenACC is portable, easy to work with, and has the added benefit that it does not require a separate kernel language; although generally known to be slow, since the performance of OpenACC can varies with application, we shortlisted OpenACC for further qualitative testing in an MPM context.

Another API we shortlisted was JuliaGPU. Julia is a dynamically typed language designed for high-performance technical computing; it has similar syntax to Python but targets runtime performance on par with C and C++ [2. Note that one reason why all ports are significantly longer than 88 lines is that Hu’s code relies on taichi already having a singular value decomposition (SVD) function; these needed to be reimplemented in both Julia and C++ and are included in our counts. It may be counter-intuitive that the Julia code is significantly longer than OpenACC and Kokkos, even with Julia being dynamically typed. This is in part due to GPU programming in Julia requiring boilerplate code to configure and launch GPU kernels; in a larger program, this would be less significant. However, it is also due to the fact that Julia requires parallelised functions to explicitly take all parameters as arguments, quickly leading to code bulk as kernels become increasingly complex; this is in contrast to OpenACC and Kokkos, where variables can simply be captured from outside the loops.

Table 2 A naive measure of code complexity: line and character counts of Hu’s MPM code re-written for OpenACC, Julia and Kokkos

Full size table

The three APIs differ significantly across various aspects contributing to ease of use. Syntactically, OpenACC is the simplest, with most low-level decisions offloaded to the compiler. However, this was a significant source of frustration as it was often very difficult to determine exactly what the outcomes of those decisions were. Sometimes the compiler decides not to run decorated loops on GPU at all, and root causes are often buried in large quantities of meaningless compiler output. Many errors are not picked up at compile time, only at runtime, and even then often give obfuscated or incorrect error messages. Furthermore, different C++ compilers require different (and in many cases, numerous) compiler flags to have OpenACC offload to GPU properly.

Similarly, Julia was very quick to write but proved very difficult to debug. This was primary due to four factors. Firstly, in being dynamically typed, type information can be difficult to ascertain. Secondly, Julia uses lazy compilation at runtime, blurring the distinction between compilation errors and execution errors. Thirdly, the Julia compiler error messages are often meaningless with GPU code. And lastly, available debugging tools are simply less advanced and far fewer compared to that of C++.

Conversely, while Kokkos has a steeper initial learning curve, the libraries are ultimately very intuitive, and we found the experience of develo** and debugging Kokkos code to be significantly smoother than both OpenACC and Julia. Kokkos compiler warnings tend to be very informative, and with C++ being strongly typed, many sources of errors can be picked up at compile time through Kokkos’ judicious usage of static assertions and C++’s Substitution Failure is Not an Error (SFINAE) language feature [53]. Kokkos works well with all GPU debugging tools such as CUDA-MEMCHECK and NVIDIA Nsight systems, and by switching the backend to serial CPU, standard C++ debuggers work too. Finally, although low-level decisions are abstracted away by default, almost everything can ultimately be explicitly specified where necessary. This is invaluable for debugging.

To compare the APIs’ performance, we ran and timed 5000 iterations of Hu’s MPM code with varying numbers of particles; the resultant run times are shown in Fig. 4a. The Kokkos and Julia implementations run almost equally as fast, with the OpenACC code in most cases around twice as slow. Notably, we also found OpenACC to have a significant initial overhead, being slower than even serial code for smaller numbers of particles.

Considering all the above factors, we have selected Kokkos as the most suitable GPU API for the conversion of Karamelo. Before converting the full Karamelo codebase, we experimented with optimisation of the minimal Kokkos MPM code to determine what properties are likely to have big impacts on performance. Figure 4b shows the results of these optimisations applied sequentially. Compared to the unoptimised code, minimising writes to shared memory was found to yield a speedup of 25%. Specifically, inside for loops with multiple calculation steps, instead of writing multiple times to the s which reside in shared memory, it is faster to create a local variable, do all calculations with that local variable, then perform only one write to shared memory at the end. Conversely, zero speedup was observed in minimising reads from shared memory. The original code had one each of particle and node structs; we found that by splitting these into s of each of the components (such as position, velocity and mass), a further 37% speedup could be obtained. Further splitting s of two-dimensional vectors (such as the positions and velocities) into one of doubles, each for the $x$ and $y$ coordinates gave a 16% speedup, and splitting s of matrices into four s of doubles yielded a 27% speedup. Part of this speedup can be attributed to atomics being significantly faster with primitives than with class types; we believe this is a consequence of hardware support for lockfree atomic update of primitives in contrast to atomic updating of class types requiring locking. However, splitting s does appear to increase overhead slightly, making performance slightly worse for smaller problems. Lastly, using floats instead of doubles gave a speedup of 24%.

To better understand the distribution of execution cost within the algorithm, we also profiled each of the MPM substeps separately along with the GPU to CPU deep-copies necessitated by file system dum**; distributions for the runtimes are presented in Fig. 5 (note the logarithmic $x$-axis). Grid update and grid reset are both $\mathcal {O}\left( n_{\textrm{nodes}}\right) $ and as expected take roughly the same time. To calculate the complexity of the other two substeps, it is necessary to consider what it means to be a given particle’s neighbuoring node. When properties are projected from particles to nodes and vice versa, at some stage they are always scaled by the (symmetric) shape functions; in each dimension, let these shape functions have (integer) supports of $\varDelta x$. Outside the support, the shape functions go to zero; thus, a particle node pair that is sufficiently far apart does not contribute and is therefore not considered a neighbouring pair. With a uniform background grid, the number of neighbouring nodes of any given particle is thus

$$\begin{aligned} n_\text {neighbors}\left( I\right) \le \left( \varDelta x\right) ^d \quad \forall I \end{aligned}$$

(3.1)

where $d$ is the spatial dimension, and the complexity of the G2P and P2G substeps are therefore both $\mathcal {O}\left( n_\text {particles}\left( \varDelta x\right) ^d\right) $. Note that $n_\text {nodes}$ and $n_\text {particles}$ are technically independent, but in practice may be considered proportional. In our case we used ratio of 0.59 (48 particles, $9\times 9$ grid; 192 particles, $18\times 18$ grid; etc). While the G2P and P2G steps have the same computational complexity, P2G is notably more expensive due to needing atomics, reflected in the slower runtime (see Fig. 5a). Lastly, deep-copies are extremely expensive, generally ranging between three and four orders of magnitude slower than all four substeps (see Fig. 5b). It is therefore desirable to keep data on the GPU for as long as possible, kee** deep-copies (and in turn dum**) to a minimum.

4 Method

The Kokkos port of Karamelo was performed incrementally in a number of steps. As many of these changes were significant and structural, it was important to subdivide this process as much as possible, so that intermediate stages of porting could be compiled and regression tested.

In this section, we present the major Kokkos related aspects of porting, along with specific design changes and library re-implementations for GPU compatibility.

4.1 Parallel P2G

Algorithmic changes were necessary to make Karamelo’s P2G code GPU compatible. Karamelo originally used two neighbour lists, one for every particle’s neighbouring nodes, and one for every node’s neighbouring particles. Thus, in the P2G step, the code would loop over all grid nodes in the outer loop, and for each grid node, loop over all the node’s neighbouring particles in the inner loop, adding their contributions to the node under consideration. It is theoretically possible to keep this algorithm and simply parallelise the outer loop; this yields Algorithm 1 and is essentially equivalent to the approach of Dong and Grabe [14] as presented in Sect. 3. However, the particles are not uniformly distributed. So, in theory, nodes can have limitless numbers of neighbouring particles. This presents two problems. Firstly, the node neighbours adjacency lists do not form a rectangular two-dimensional array; Karamelo used to use a vector of vectors, and while a of s is theoretically achievable, the process is very fragile, requires explicit specification of memory space (for example CUDA space) violating hardware agnosticism, and has potential to be very slow due to requiring many small allocations on GPU [29]. Secondly, the number of particle neighbours that each node has may change at every timestep, and the population of these variable length neighbour lists is non-trivial to parallelise (recall that Dong and Grave [14] resorted to performing these calculations on CPU and performing deep-copies to get them back onto GPU).

Instead, we changed Karamelo to use Algorithm 2, which uses the particles’ neighbours adjacency list. This is the method used by most modern parallel codes studied in Sect. 3. Note that due to the outer loop being over particles, atomics are now necessary to prevent multiple particles writing to the same node at the same time, but consequently, the inner loop can additionally be parallelised for free.

4.2 Kokkos views and loops

To allow Karamelo to run on GPU, the first step was to convert all GPU-accessible data structures to Kokkos Views. These were primarily members of the and classes, which store the node and particle information respectively. At this stage, since loops have not yet been offloaded to GPU. Therefore, we initially specified all Views to use the CUDA unified virtual memory (UVM) space which can be accessed by both CPU and GPU. This is easily achieved using an additional template argument, namely

. However, this was a temporary change since CUDA UVM space is limited in capacity, slower than regular device spaces, and obviously not portable.

Broadly, the next step was to convert parallel loops to Kokkos’s parallel syntax. This was not always trivial, since very often the code inside loops weren’t immediately GPU compatible. As shown in Fig. 3, Kokkos exposes functions that run a given lambda expression over a range of indexes. If no explicit specification is given, Kokkos will automatically determine the execution space and optimal iteration order (stride pattern). Converting existing loops required a number of small changes. Firstly, pointer indirection is problematic on GPU since pointed to memory generally resides on CPU. It is therefore necessary to cache and pass resultant values to GPU instead of pointers. Additionally, due to a deficiency in earlier C++ language standards, most compilers implicitly access member variables in lambdas through the ‘this’ pointer, so members must also explicitly be cached (this has been fixed in C++17, but at the time of writing, most GPU compilers are limited to C++14) [28]. To illustrate this, the conversion of a basic serial loop is given in Fig. 6. Note that due to Kokkos Views using reference counting, the View copies are not deep-copies but rather just new references.

Some loops in the Karamelo code also contain reductions, such as the summation of total kinetic energy or calculation of minimum values for stable time step size. To this end, Kokkos exposes functions that are very similar in syntax to . The only loops that could not be offloaded to GPU were the loops for dum**, as writing to the file system must be done in serial from CPU. Therefore, this necessitated deep-copying from GPU to CPU before each dump. Since memory allocations and deep-copying are extremely expensive, it is desirable to minimise them where possible; for optimisation, we did the following. Allocating host mirrors was done once in the dum** class’s constructor, tying the lifetime of the mirrors to the lifetime of the program, and only for properties that were to be dumped. Deep-copying was then performed before each dump, also only for dumped properties. The details of the layout used is shown in Listing 7. To increase efficiency, once deep-copying is complete it is actually possible to dump asynchronously on CPU using the host copy while calculation continues on GPU using the device copy (Fig. 7).

Finally, once all code had been either offloaded to GPU or had the necessary data deep-copied to CPU, s were incrementally taken off CUDA UVM space with continuous regression testing.

4.3 Linear algebra

The CPU version of Karamelo made use of the Eigen linear algebra library throughout, and Eigen is not natively GPU compatible [25]. Instead of Eigen, one can use the Kokkos kernels which provides a full ecosystem for linear algebra. However, this library carry a lot of functions that we do not use. Therefore, to limit dependencies and limit Karamelo’s footprint, we decided to implemented our own Kokkos compatible linear algebra library from scratch.

The bulk of this development involved creating a templated matrix class (an exerpt of the class is presented in Listing 8). To maximise both usability and utility, this class makes use of a range of template metaprogramming techniques including parameter packs, universal references, static assertions, idiomatic SFINAE through , type inspection in unevaluated contexts using and , and type aliasing. The class provides a range of constructors and assignment operators, accessors, overloaded arithmetic operators, various products, norms, element-wise operations, transposes, and debugging output (Fig. 8).

To meet current Karamelo demands, we also implemented QR decomposition using Givens rotations, eigendecomposition using the QR algorithm with Wilkinson shifts, and SVD and matrix inversion using eigendecomposition [33, 57].

4.4 Lazy references

As was found in Sect. 3.2, writes to shared memory are expensive, and minimising writes in our Kokkos MPM test code decreased runtimes by 25%. It is not unusual to write to a variable multiple times within one iteration, for example if there are multiple calculation steps or nested loops. As shown in Fig. 9b, it is always possible to bring the number of writes down to one or zero per iteration by creating a local copy, reading and writing to that local copy, and only writing the final result back to shared memory. While this is technically sound, it isn’t particularly safe, as accidentally removing the final write can lead to bugs that are undetectable at compile time and potentially difficult to find. Instead, it is best practice to bind the lifetime of this abstract cache to the lifetime of an object, following the resource acquisition is initialisation (RAII) idiom [51]. To this end we introduce the class, with equivalent usage shown in Fig. 9c. The class itself is very simple, storing both a local copy and a reference to the wrapped value, with the check and write logic moved into the destructor; almost all operators are also trivially overloaded so that s may be used like normal references.

4.5 Inheritance

One major obstacle in translating CPU code into GPU code is that of inheritance and polymorphism, specifically around C++’s use of late binding for dynamic dispatch. The problem is, virtual pointers (VPTRs) in virtual tables (VTables) point to functions that reside in the memory space the class is initialised in, which is usually the host space. It is technically possible to create GPU-compatible class instances using placement new in a device context, but this process is fragile and brings further complications [54]. The simple solution is to not call virtual functions inside loops at all, rather moving said loops into the virtual functions themselves as shown in Fig. 10b. Similarly to Sect. 4.4, this is technically sound but semantically poor, requiring consistent code duplication of the control structure between every derived class; if for instance a programmer changes the value of in but forgets to change , resultant bugs may be hard to find. Our solution is to use the curiously recurring template pattern (CRTP) [2]. As shown in Fig. 10c, when using the CRTP, an intermediate class is introduced that is templated with the derived class as an argument; this intermediate class overrides the virtual function with the shared control structure, but injects the derived class’s kernel statically using early binding. We believe we are the first to apply the CRTP to this particular problem.

4.6 User input expression evaluation

Karamelo allows users to specify expressions in its input files, for example as arguments for a fix. At runtime, these expressions may then be evaluated over particles or nodes. An example is given in Fig. 11, highlighting support for particle/node properties (such as and coordinates) as operands, various operators, single and multivariate functions and even composition of expressions. Previously, Karamelo would store expressions as strings and simultaneously parse and evaluate expressions as necessary. However, reparsing expressions is very inefficient, and furthermore, parsing on GPU is likely to be both complex and slow due to requiring variable length character arrays. Instead, it is best to parse expressions on CPU into some kind of GPU-compatible abstract syntax tree (AST), finally evaluating the AST in parallel on GPU. We found the most suitable AST representation was reverse polish notation (RPN), using Dijkstra’s shunting-yard algorithm for parsing [13]. To facilitate parallel evaluation, our Expression class holds a two- dimensional buffer acting as registers; operations then write to and combine registers as per the given RPN expression. To illustrate this, consider the expression from Fig. 11; after applying the shunting-yard algorithm we get in RPN.

The application of all expression operations essentially has the same structure of loo** over all particles/nodes and updating certain registers; this makes expressions a perfect candidate for CRTP as given in Sect. 4.5. The CRTP function is shown in Fig. 12a and ’s implementation is provided as an example in Fig. 12b. We also expose expression functions as styles so that users may easily add their own.

5 Results

To quantify asymptotic performance improvements from parallelisation, we ran 1000 steps of two simulations on one CPU (Intel Xeon Platinum 8274), multiple CPUs, and one GPU (Tesla V100-SXM2-32GB) while varying the number of particles. Note that Kokkos supports AMD GPUs as well (since version 4.2). However, we do not have any available to us to run comparison tests. The two considered simulations are:

Two bouncing balls (Fig. 13a) which represent the most basic 2D simulation problem, which uses the MPM built-in contact algorithm.
A twisted column (Fig. 13b) problem inspired by that introduced by Gil et al. [23]. The test consists of a column which is fixed at the bottom and an angular velocity $\omega _0=2\pi $ rad/ms is applied to the top surface. The column is a $100\,\textrm{mm} \times 10\,\textrm{mm} \times 10\,\textrm{mm}$ square cuboid. The angular velocity $\omega _0$ is applied by constraining the velocity of each and every node on the top surface of the column. Let n be the number of rotations we want to simulate and T is the final time, then we define $\omega =\omega _0 n/T$. The applied velocities as a function of time are therefore given by:
$$\begin{aligned} v_x(t) = -\omega y(t), \quad v_y(t) = +\omega x(t) \end{aligned}$$
(5.1)
where x(t) and y(t) are the positions of the node along the x and y axes. The column is made of a thermal elasto-plastic material. The flow stress of this material is assumed to obey the Johnson Cook plasticity model, of which the material parameters are omitted for brevity. Note that this is a three-dimensional thermal mechanical simulation in which the diffusion of the heat generated by plastic work is modelled.

Note that for the initial simulations, we disabled dum** to isolate the cost of the MPM.

For the multiple CPU bouncing balls simulation, we used nine CPUs, subdividing each dimension of the domain in three. The results of multiple CPU and of one GPU are identical to that of one single CPU and not reported herein. For this case, running on nine CPUs was 3.34 times faster than on one CPU, and one GPU was 18.18 times faster than one CPU or 8.43 times faster than nine CPUs (Fig. 14). The minimal three times speed up with nine times the number of CPUs demonstrates the poor workload balancing of naive domain decomposition, since a number of subdomains (especially the top left and bottom right) essentially always remained idle; this therefore illustrates one scenario where multiple process parallelisation offers very little gain over single process parallelisation. Additionally, for a small number of particles, one CPU was actually fastest, owing to the lesser overheads.

For the multiple CPU twisted column simulation, we used eight CPUs, subdividing each dimension in two. On complex problems, the speedup from GPU parallelisation is so significant that the gradient of the GPU curves are not visible on linear axes, so note that both axes in Fig. 15a are logarithmic. Asymptotically, eight CPUs was 7.03 times faster than one CPU, and one GPU was 86.57 times faster than one CPU and 12.31 times faster than eight. It is interesting to see that the GPU curve only exhibits linear scaling beyond a fairly large number of particles (around 40,000 in this case). Before that point, the GPU is not yet saturated, so since the GPU can launch more threads as necessary, increasing the number of particles has almost no effect on runtime.

Finally, we compared the costs of dum** on CPU and GPU by running 1000 iterations of the twisted column simulation with a fixed number of 40,960 particles while varying the frequency of dum**. The results thereof are shown in Fig. 16. We see that although GPU is by far the fastest without dumps, with more than five dumps per 100 steps it is actually slower than using eight CPUs. The poor scalability of GPU dum** has two reasons: firstly that deep-copies are necessary and costly and secondly that the one GPU code also only dumps using one CPU. We see that beyond a certain point, it is faster to just do all calculations on CPU rather than calculating on GPU and deep-copying. That said, dum** with such a high frequently is reasonably contrived; in most cases, occasional dum** suffices, and especially with asynchronous dum**, beyond a certain point the cost goes almost to zero.

6 Conclusion and future work

In this paper, we have shown an example of how one could accelerate a C++ scientific code using GPUs. It is important to emphasise that for someone who knows C++, this process is fairly easy. There is no new language to learn, but rather just some fundamental concepts linked to where data reside in memory and an understanding of how to code for the selected GPU API.

To accelerate our MPM code, we found that Kokkos was the most suitable API to use. Kokkos is developed by Sandia National Laboratories in the USA. It is used for large and popular projects such as LAMMPS (a molecular dynamic package). Therefore, we are confident that it is going to be maintained for many years to come.

We have discussed the general process and suggested best practice solutions for key aspects of CPU to GPU porting using Karamelo as an example. The idea is for you to be able to use this information to port your own code. Our code running in parallel on one GPU was found to be up to 85 times faster than on CPU. We expect this to be a ball park indicative figure that could also be achieved in other codes.

What we have shown here is the first, easy step towards accelerating a serial code. But one can go further. A more major step forward would be to parallelise the code on multiple GPUs. The original Karamelo code already uses MPI in multiple CPU parallelisation. Kokkos is compatible with MPI, but this might not be the best way of doing it owing to the large amount code required to make it. Instead, another avenue would be to use existing abstracted libraries for partitioning the global address space; one such example is OpenSCHMEM [9], which incidentally is also already compatible with Kokkos.

Beyond the APIs themselves, a lot of work could be done to the algorithms used to improve the code’s efficiency. In the case of MPM, current algorithms are still far from achieving optimal workload balancing in general; given the amount of activity in this area, we expect significant advancements in the near future. Going forward, kee** Karamelo up-to-date will continue to make cutting edge MPM technologies accessible and useful for researchers and industry alike.

References

Abe K, Soga K, Bandara S (2014) Material point method for coupled hydromechanical problems. J Geotech Geoenviron Eng 140(3):04013033
Article Google Scholar
Abrahams D, Gurtovoy A (2004) C++ template metaprogramming: concepts, tools, and techniques from Boost and beyond. Pearson Education
Bardenhagen S, Kober E (2004) The generalized interpolation material point method. Comput Model Eng Sci 5(6):477–495
Google Scholar
Bardenhagen S, Brackbill J, Sulsky D (2000) The material-point method for granular materials. Comput Methods Appl Mech Eng 187(3–4):529–541
Article Google Scholar
Belytschko T, Liu WK, Moran B (2000) Nonlinear finite elements for continua and structures. John Wiley & Sons, Chichester
Google Scholar
Besard T, Foket C, De Sutter B (2019) Effective extensible programming: unleashing Julia on GPUs. IEEE Trans Parallel Distrib Syst 30(4):827–841. https://doi.org/10.1109/TPDS.2018.2872064
Article Google Scholar
Bezanson J, Karpinski S, Shah VB, Edelman A (2012) Julia: a fast dynamic language for technical computing. ar**v preprint ar**v:1209.5145
Brackbill J, Ruppel H (1986) FLIP: a method for adaptively zoned, particle-in-cell calculations of fluid flows in two dimensions. J Comput Phys 65(2):314–343. https://doi.org/10.1016/0021-9991(86)90211-1
Article MathSciNet Google Scholar
Chapman B, Curtis T, Pophale S, Poole S, Kuehn J, Koelbel C, Smith L (2010) Introducing OpenSHMEM: SHMEM for the PGAS community. In: Proceedings of the fourth conference on partitioned global address space programming model, pp 1–3
de Vaucorbeil A, Nguyen VP, Hutchinson CR (2020) A total-Lagrangian material point method for solid mechanics problems involving large deformations. Comput Methods Appl Mech Eng 360:112783. https://doi.org/10.1016/j.cma.2019.112783
Article MathSciNet Google Scholar
de Vaucorbeil A, Nguyen VP, Nguyen-Thanh C (2020) Karamelo: an open source parallel C++ package for the material point method. Comput Part Mech. https://doi.org/10.1007/s40571-020-00369-8
de Vaucorbeil A, Nguyen VP, Hutchinson CR, Barnett MR (2022) Total Lagrangian material point method simulation of the scratching of high purity coppers. Int J Solids Struct 239:111432
Article Google Scholar
Dijkstra EW (1961) Algol 60 translation: an Algol 60 translator for the x1 and making a translator for Algol 60. Stichting Mathematisch Centrum. Rekenafdeling, MR 34/61
Dong Y, Grabe J (2018) Large scale parallelisation of the material point method with multiple GPUs. Comput Geotech 101:149–158. ISSN 0266-352X. https://doi.org/10.1016/j.compgeo.2018.04.001
Dong Y, Wang D, Randolph MF (2015) A GPU parallel computing strategy for the material point method. Comput Geotech 66:31–38. ISSN 0266-352X. https://doi.org/10.1016/j.compgeo.2015.01.009
Dong Y, Cui L, Zhang X (2022) Multiple-GPU parallelization of three-dimensional material point method based on single-root complex. Int J Numer Methods Eng 123(6):1481–1504. ISSN 1097-0207. https://doi.org/10.1002/nme.6906
Edwards HC, Trott CR, Sunderland D (2014) Kokkos: enabling manycore performance portability through polymorphic memory access patterns. J Parallel Distrib Comput 74(12):3202–3216. ISSN 0743-7315. https://doi.org/10.1016/j.jpdc.2014.07.003. URL http://www.sciencedirect.com/science/article/pii/S0743731514001257. Domain-Specific Languages and High-Level Frameworks for High-Performance Computing
Fagan T, Lemiale V, Nairn J, Ahuja Y, Ibrahim R, Estrin Y (2016) Detailed thermal and material flow analyses of friction stir forming using a three-dimensional particle based model. J Mater Process Technol 231:422–430
Article Google Scholar
Fern J, Rohe A, Soga K, Alonso E (2019) The material point method for geotechnical engineering: a practical guide. CRC Press, Boca Raton
Book Google Scholar
Ganzenmüller GC (2014) Smooth-mach-dynamics package for LAMMPS. Fraunhofer Ernst-Mach Institute for High-Speed Dynamics
Gao M, Wang X, Wu K, Pradhana A, Sifakis E, Yuksel C, Jiang C (2018) GPU optimization of material point methods. ACM Trans Graph 37(6). ISSN 0730-0301. https://doi.org/10.1145/3272127.3275044
Gaume J, Gast T, Teran J, van Herwijnen A, Jiang C (2018) Dynamic anticrack propagation in snow. Nat Commun 9(1):3047
Article Google Scholar
Gil AJ, Lee CH, Bonet J, Aguirre M (2014) A stabilised Petrov–Galerkin formulation for linear tetrahedral elements in compressible, nearly incompressible and truly incompressible fast dynamics. Comput Methods Appl Mech Eng 276:659–690. https://doi.org/10.1016/j.cma.2014.04.006
Article MathSciNet Google Scholar
Gingold RA, Monaghan JJ (1977) Smoothed particle hydrodynamics: theory and application to non-spherical stars. Mon Not R Astron Soc 181(3):375–389. https://doi.org/10.1093/mnras/181.3.375.
Article Google Scholar
Guennebaud G, Jacob B et al (2010) Eigen v3. http://eigen.tuxfamily.org
Hoshino T, Maruyama N, Matsuoka S, Takaki R (2013) Cuda vs openacc: performance case studies with kernel benchmarks and a memory-bound CFD application. In: 2013 13th IEEE/ACM international symposium on cluster, cloud, and grid computing, pp 136–143. https://doi.org/10.1109/CCGrid.2013.12
Hu Y (2018) High-performance MLS-MPM solver with cutting and coupling (CPIC) (MIT license). https://github.com/yuanming-hu/taichi_mpm
Ibanez D (2017) The lambda user’s guide. https://github.com/ibaned/lambda_users_guide
Kokkos (2021) kokkos/kokkos wiki/views/can i make a view of views?. https://github.com/kokkos/kokkos/wiki/View#623-can-i-make-a-view-of-views
Lemiale V, Nairn J, Hurmane A (2010) Material point method simulation of equal channel angular pressing involving large plastic strain and contact through sharp corners. Comput Model Eng Sci 70(1):41–66
Google Scholar
Leroch S, Eder SJ, Ganzenmüller G, Murillo L, Ripoll MR (2018) Development and validation of a meshless 3D material point method for simulating the micro-milling process. J Mater Process Technol 262:449–458
Article Google Scholar
Li X, Overbey J, Seals C, Lim A, Shih P-C (2016) Comparing programmer productivity in openacc and cuda: an empirical investigation. Int J Comput Sci Eng Appl 6:1–15. https://doi.org/10.5121/ijcsea.2016.6501
Ling F (1991) Givens rotation based least squares lattice and related algorithms. IEEE Trans Signal Process 39(7):1541–1551
Article Google Scholar
Liu P, Liu Y, Zhang X (2015) Internal-structure-model based simulation research of shielding properties of honeycomb sandwich panel subjected to high-velocity impact. Int J Impact Eng 77:120–133
Article Google Scholar
Lucy L (1977) Numerical approach to the testing of the fission hypothesis. Astron J (United States) 82:12. https://doi.org/10.1086/112164
Memeti S, Li L, Pllana S, Kołodziej J, Kessler C (2017) Benchmarking opencl, openacc, openmp, and cuda: programming productivity, performance, and energy consumption. In: ARMS-CC’17: proceedings of the 2017 workshop on adaptive resource management and scheduling for cloud computing, New York, NY, USA. Association for Computing Machinery. ISBN 9781450351164. https://doi.org/10.1145/3110355.3110356
Nguyen VP, Nguyen CT, Rabczuk T, Natarajan S (2017) On a family of convected particle domain interpolations in the material point method. Finite Elem Anal Des 126:50–64
Article MathSciNet Google Scholar
Nguyen VP, de Vaucorbeil A, Nguyen-Thanh C, Mandal TK (2020) A generalized particle in cell method for explicit solid dynamics. Comput Methods Appl Mech Eng 371(113308)
Nguyen VP, de Vaucorbeil A, Bordas S (2023) Material point method theory, implementations and applications. Springer International Publishing AG. ISBN 9783031240690
openmp. Juliagpu. https://juliagpu.org/
Oñate E, Idelsohn S, Zienkiewicz OC, Taylor RL, Sacco C (1996) A stabilized finite point method for analysis of fluid mechanics problems. Comput Methods Appl Mech Eng 139(1-4):315–346. ISSN 0045-7825. https://doi.org/10.1016/s0045-7825(96)01088-2
Peddie J, Dow R (2022) Market watch (q4’2021). https://www.jonpeddie.com/store/market-watch-quarterly1
Sadeghirad A, Brannon RM, Burghardt J (2011) A convected particle domain interpolation technique to extend applicability of the material point method for problems involving massive deformations. Int J Numer Methods Eng 86(12):1435–1456
Sadeghirad A, Brannon R, Guilkey J (2013) Second-order convected particle domain interpolation (CPDI2) with enrichment for weak discontinuities at material interfaces. Int J Numer Methods Eng 95(11):928–952
Article MathSciNet Google Scholar
Shen L, Chen Z (2005) A multi-scale simulation of tungsten film delamination from silicon substrate. Int J Solids Struct 42(18–19):5036–5056
Article Google Scholar
Sinaie S, Ngo TD, Nguyen VP, Rabczuk T (2018) Validation of the material point method for the simulation of thin-walled tubes under lateral compression. Thin-Walled Struct 130:32–46
Article Google Scholar
Stomakhin A, Schroeder C, Chai L, Teran J, Selle A (2013) A material point method for snow simulation. ACM Trans Graph 32(4). ISSN 0730-0301. https://doi.org/10.1145/2461912.2461948
Sulsky D, Schreyer HL (1996) Axisymmetric form of the material point method with applications to upsetting and Taylor impact problems. Comput Methods Appl Mech Eng 139(1–4):409–429. https://doi.org/10.1016/s0045-7825(96)01091-2
Article Google Scholar
Sulsky D, Chen Z, Schreyer H (1994a) A particle method for history-dependent materials. Comput Methods Appl Mech Eng 118(1-2):179–196. https://doi.org/10.1016/0045-7825(94)90112-0
Sulsky D, Chen Z, Schreyer HL (1994) A particle method for history-dependent materials. Comput Methods Appl Mech Eng 118(1–2):179–196
Article MathSciNet Google Scholar
Sutter H, Alexandrescu A (2004) C++ coding standards: 101 rules, guidelines, and best practices. Pearson Education
Thompson AP, Aktulga HM, Berger R, Bolintineanu DS, Brown WM, Crozier PS, in ’t Veld PJ, Kohlmeyer A, Moore SG, Nguyen TD, Shan R, Stevens MJ, Tranchida J, Trott C, Plimpton SJ (2022) LAMMPS—a flexible simulation tool for particle-based materials modeling at the atomic, meso, and continuum scales. Comput Phys Commun 271:108171. https://doi.org/10.1016/j.cpc.2021.108171
Vandevoorde D, Josuttis NM (2002) C++ Templates: the complete guide, portable documents. Addison-Wesley Professional
virtual (2022) Kokkos and virtual functions. https://github.com/kokkos/kokkos/wiki/Kokkos-and-Virtual-Functions
Wang X, Qiu Y, Slattery SR, Fang Y, Li M, Zhu S-C, Zhu Y, Tang M, Manocha D, Jiang C (2020a) A massively parallel and scalable multi-GPU material point method. 39(4). https://doi.org/10.1145/3386569.3392442
Wang X, Qiu Y, Slattery SR, Fang Y, Li M, Zhu S-C, Zhu Y, Tang M, Manocha D, Jiang C (2020b) A massively parallel and scalable multi-GPU material point method. ACM Trans Graph 39(4). ISSN 0730-0301. https://doi.org/10.1145/3386569.3392442
Wilkinson J (1971) The algebraic eigenvalue problem. In: Handbook for automatic computation, volume II, linear algebra. Springer, New York

Download references

Acknowledgements

The last author (Alban de Vaucorbeil) gratefully acknowledges the funding support from the Australian Research Council via DECRA project DE230100338.

Funding

Open Access funding enabled and organized by CAUL and its Member Institutions

Author information

Authors and Affiliations

Department of Civil Engineering, Monash University, Clayton, 3800, VIC, Australia
Edward Buckland & Vinh Phu Nguyen
Institute for Frontier Materials, Deakin University, Geelong, VIC, 3216, Australia
Alban de Vaucorbeil

Authors

Edward Buckland
View author publications
You can also search for this author in PubMed Google Scholar
Vinh Phu Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Alban de Vaucorbeil
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alban de Vaucorbeil.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Buckland, E., Nguyen, V.P. & de Vaucorbeil, A. Easily porting material point methods codes to GPU. Comp. Part. Mech. (2024). https://doi.org/10.1007/s40571-024-00768-1

Download citation

Received: 11 February 2024
Revised: 28 March 2024
Accepted: 20 April 2024
Published: 05 June 2024
DOI: https://doi.org/10.1007/s40571-024-00768-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Easily porting material point methods codes to GPU

Abstract

Similar content being viewed by others

Using C++ AMP to Accelerate HPC Applications on Multiple Platforms

A Brief History and Introduction to GPGPU

SkelCL: Enhancing OpenCL for High-Level Programming of Multi-GPU Systems

1 Introduction