Introduction

Drug discovery and molecular design rely heavily on predicting material properties directly from their atomic structure. A particular property of interest for molecular design is the energy gap between the highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO), known as the HOMO-LUMO gap. The HOMO-LUMO gap is a valid approximation for the lowest excitation energy of a molecule and is used to express its chemical reactivity. In particular, molecules that are more chemically reactive are characterized by a lower HOMO-LUMO gap. There are many physics-based computational approaches to compute the HOMO-LUMO gap of a molecule such as ab initio molecular dynamics (MD) [1, 2] and density-functional tight-binding (DFTB) [3]. While these methods have been instrumental in predictive materials science, they are extremely computationally expensive. The advent of deep learning (DL) models has provided alternative methodologies to produce fast and accurate predictions of material properties and hence enable rapid screening in the large search space to select material candidates with desirable properties [4,5,6,7].

In particular, graph convolutional neural network (GCNN) models are extensively used in material science to predict material properties from atomic information [1.

Table 1 ADIOS schema for graph dataset

We have developed an extensible data loader module in HydraGNN that allows reading data from different storage formats. In this work, we evaluate the following three methods for loading data from training datasets.

  • Inline data loading: load SMILES strings written in CSV format into memory and then convert each SMILES into a graph object at every batch data loading. It has the smallest memory footprint.

  • Object data loading: convert all SMILES strings into graph objects and export them in a serialized format (e.g., Pickle) during a preprocessing phase. A process loads each data batch directly from the file system and unpacks into a memory.

  • ADIOS data loading: convert SMILES strings into graph objects mapped to ADIOS variables, and write them into an ADIOS file in a pre-processing step. Each process then loads its batch data during training in parallel along with other processes.

We will discuss performance comparisons in the next section.

Numerical results

In this section, we assess our development of DDP training in HydraGNN on two state-of-the-art DOE supercomputers, Summit and Perlmutter. We discuss the scalability of our approach, and compare the performance of different I/O backends for storing and reading graph data.

Setup

We perform our evaluation on two supercomputers of DOE. Both systems provide state-of-the-art GPU-based heterogeneous architectures.

Summit is a supercomputer at OLCF, one of DOE’s Leadership Computing Facilities (LCFs). Summit consists of about 4600 compute nodes. Each node has a hybrid architecture containing two IBM POWER9 CPUs and six NVIDIA Volta GPUs connected with NVIDIA’s high-speed NVLink. Each node contains 512 GB of DDR4 memory for CPUs and 96 GB of High Bandwidth Memory (HBM2) for GPUs. Summit nodes are connected in a non-blocking fat-tree topology using a dual-rail Mellanox EDR InfiniBand interconnection.

Perlmutter is a supercomputer at NERSC. Perlmutter consists of about 3000 CPU-only nodes and 1500 GPU-accelerated nodes. We use only the GPU-accelerated nodes in our work. Each GPU-accelerated node has an AMD EPYC 7763 (Milan) processor and four NVIDIA Ampere A100 GPUs connected to each other with NVLink-3. Each GPU node has 256 GB of DDR4 memory and 40 GB HBM2 per each GPU. All nodes in Perlmutter are connected with the HPE Cray Slingshot interconnect.

We demonstrate the performance of HydraGNN using two large-scale datasets, a previously published benchmark dataset for graph-based learning (PCQM4Mv2) [11, 44] and a custom dataset generated for this work (AISD HOMO-LUMO) [13]. For both datasets, molecule information is provided as SMILES strings. The PCQM4Mv2 consists of HOMO-LUMO gap values for about 3.3 million molecules. In total, 31 different types of atoms (i.e., H, B, C, N, O, F, Si, P, S, Cl, Ca,Ge,As, Se, Br, I, Mg, Ti, Ga, Zn, Ar, Be, He, Al, Kr, V, Na, Li, Cu, Ne, Ni) are involved in the dataset. The custom AISD HOMO-LUMO dataset was generated using molecular structures from previous work [45]. It is a collection of approximately 10.5 million molecules and contains 6 element types (i.e., H, C, N, O, S, and F).

Table 2 Dataset description

For scalability tests, we use HydraGNN with 6 PNA [28] convolutional layers and 55 neurons per PNA layer. The model is trained by using the AdamW method [46] with a learning rate of 0.001, local batch size of 128, and maximum epochs set to 3. The training set for each of the NN represents 94% of the total dataset; the validation and test sets each represent \(1/3^{rd}\) and \(2/3^{rd}\) parts respectively of the remaining data. For the error convergence tests, the HydraGNN model uses 200 neurons per layer.

Scalability of DDP

Fig. 5
figure 5

Strong scaling performance of HydraGNN training on OLCF’s Summit and NERSC’s Perlmutter(top), and detailed timing (bottom). We perform data-parallel training for PCQM4Mv2 and AISD HOMO-LUMO data sets with HydraGNN using up to 1500 GPUs and observe linear scaling up to 1024 GPUs

We perform DDP training with HydraGNN for PCQM4Mv2 and AISD data on Summit at ORNL and Perlmutter at NERSC using multiple CPUs and GPUs. The number of graphs and size in the datasets are summarized in Table 2.

We measure the total training time for PCQM4Mv2 and AISD HOMO-LUMO datasets over three epochs. As discussed previously, each training consists of a data loading phase, followed by forward calculation, backward calculation, and optimizer update. We test the scalability of DDP by varying the number of nodes on each system, ranging from a single node up to 256 nodes on Summit and 128 nodes on Perlmutter, corresponding to using 1536 Volta GPUs and 512 A100 GPUs respectively. Figure 5 shows the result. The scaling plot (top) shows the averaged training time for PCQM4Mv2 and AISD HOMO-LUMO on each system with a varying number of nodes, and the detailed timings of each sub-function during the training on Summit are shown at the bottom.

We obtain near-linear scaling up to 1024 GPUs for both PCQM4Mv2 and AISD HOMO-LUMO data. As we further scale the workflow on Summit, the number of batches per GPU decreases, leading to sub-optimal utilization of GPU resources. As a result, we see a drop in speedup as we scale beyond 1024 GPUs on Summit. We expect similar scaling behavior on Perlmutter, but we were limited to using 128 nodes (i.e., 512 GPUs) for this work.

Comparing different I/O backends

Data loading takes a significant amount of time in training, as shown in Fig. 5, and hence is a crucial step in the overall workflow. We compare three different data loading methods—inline, object loading, and ADIOS data loading, as discussed in "Distributed data parallel training" section. Figure 6 presents time taken by the three methods for the PCQM4Mv2 data set on Summit. As expected, it outperforms the CSV data loading test case in which SMILES data is converted into a graph object for every molecule. ADIOS outperforms Pickle-based data loading by 4.2x on a single Summit node and 1.5x on 32 Summit nodes. To provide a complete picture of HydaGNN’s data processing, Fig. 7 shows the pre-processing performance in converting the PCQM4Mv2 data into ADIOS2 on Summit. ADIOS supports parallel writing and shows scalable performance as we add CPUs in parallel.

Fig. 6
figure 6

Comparison of different I/O methods in HydraGNN. We measure ADIOS data loading time compared with CSV and Pickle with PCQM4Mv2 dataset on Summit

Fig. 7
figure 7

The performance of PCQM4Mv2 pre-processing in HydraGNN. We convert PCQM4Mv2 data into ADIOS data by using multiple CPUs on Summit

Accuracy

Next, we perform long-running HydraGNN training for the HOMO-LUMO gap prediction with PCQM4Mv2 and AISD HOMO-LUMO datasets until training converges.

Fig. 8
figure 8

HydraGNN predicted values against DFT values of HOMO-LUMO Gap for molecules in PCQM4Mv2 training and validation sets

Fig. 9
figure 9

HydraGNN predicted values against DFT values of HOMO-LUMO Gap for molecules in AISD HOMO-LUMO training, validation and test sets

Figures 8 and 9 show the prediction results for PCQM4Mv2 and AISD HOMO-LUMO datasets respectively. With PCQM4Mv2, we achieve a prediction error of around 0.10 and 0.12 eV, measured in mean absolute error (MAE), for the training and validation set, respectively. We note that PCQM4Mv2 is a public dataset released without test data for the purpose of maintaining the OGB-LSC LeaderboardsFootnote 2. The reported validation MAE from multiple models varies from 0.0857 to 0.1760 eV on the Leaderboard. The validation error of 0.12 eV in this work is within the accepted range. As for the AISD HOMO-LUMO dataset, it contains almost thrice as many molecules as the PCQM4Mv2 dataset. The MAE errors for training, validation, and test sets are 0.14 eV, which is similar to the PCQM4Mv2 dataset. Figure 10 shows the accuracy convergence on the AISD HOMO-LUMO dataset using different numbers of Summit GPUs. It shows that the HydraGNN training with 192 GPUs quickly converged in 0.3 h (wall time) to the similar accuracy level achieved by the 6 GPUs (a single Summit node) that took about 8.2 h.

We highlight that the convergence of the distributed GCNN training with 192 GPUs is deteriorated compared to the distributed training with only 6 GPUs. This is due to a well known numerical artifact that destabilizes the training of DL models at large scale and causes a performance drop because large scale DDP training is mathematically equivalent to large-batch training. In fact, processing data in large batches significantly reduces the stochastic oscillations of the stochastic optimizer used for DL training, thus making the DL training more likely to be trapped in steep local minima, which adversely affect generalization. Although the final accuracy of the GCNN training with 192 GPUs is slightly worse than the one obtained using 6 GPUs for training, we emphasize the significant advantage that HPC resources provide in speeding-up the training. Better accuracy can be obtained when training DL models at large scale by adaptively tuning the learning rate [47, 48] or by applying quasi-Newton accelerations [49], but this goes beyond the focus of our current work.

Fig. 10
figure 10

Convergence of the training and validation runs for the AISD HOMO-LUMO data on Summit with different GPU counts

Conclusions and future work

In this paper, we present a computational workflow that performs DDP training to predict the HOMO-LUMO gap of molecules. We have implemented DDP in HydraGNN, a GCNN library developed at ORNL, which can utilize heterogeneous computing resources including CPUs and GPUs. For efficient storage and loading of large molecular data, we use the ADIOS high-performance data management framework. ADIOS helps reduce the storage footprint of large-scale graph structures as compared with commonly used methods, and provides an easy way to efficiently load data and distribute them amongst processes. We have conducted studies using two molecular datasets on the OLCF’s Summit and NERSC’s Perlmutter supercomputers. Our results show the near-linear scaling of HydraGNN for the test datasets up to 1024 GPUs. Additionally, we present the accuracy and convergence behavior of the distributed training with increasing number of GPUs.

Through efficiently managing large-scale datasets and training in parallel, HydraGNN provides an effective surrogate model for accurate and rapid screening of large chemical spaces for molecular design. Future work will be dedicated to integrating the scalable DDP training of HydraGNN in a computational workflow to perform molecular design.