GMPP-NN: a deep learning architecture for graph molecular property prediction

Abbassi, Outhman; Ziti, Soumia; Belhiah, Meryam; Lagmiri, Souad Najoua; Zaoui Seghroucheni, Yassine

doi:10.1007/s42452-024-05944-9

GMPP-NN: a deep learning architecture for graph molecular property prediction

Research
Open access
Published: 26 June 2024

Volume 6, article number 352, (2024)
Cite this article

Download PDF

You have full access to this open access article

Discover Applied Sciences Aims and scope Submit manuscript

GMPP-NN: a deep learning architecture for graph molecular property prediction

Download PDF

Outhman Abbassi¹,
Soumia Ziti¹,
Meryam Belhiah¹,
Souad Najoua Lagmiri² &
…
Yassine Zaoui Seghroucheni¹

45 Accesses
Explore all metrics

Abstract

The pharmacy industry is highly focused on drug discovery and development for the identification and optimization of potential drug candidates. One of the key aspects of this process is the prediction of various molecular properties that justify their potential effectiveness in treating specific diseases. Recently, graph neural networks have gained significant attention, primarily due to their strong suitability for predicting complex relationships that exist between atoms and other molecular structures. GNNs require significant depth to capture global features and to allow the network to iteratively aggregate and propagate information across the entire graph structure. In this research study, we present a deep learning architecture known as a graph molecular property prediction neural network. which combines MPNN feature extraction with a multilayer perceptron classifier. The deep learning architecture was evaluated on four benchmark datasets, and its performance was compared to the smiles transformer, fingerprint to vector, deeper graph convolutional networks, geometry-enhanced molecular, and atom-bond transformer-based message-passing neural network. The results showed that the architecture outperformed the other models using the receiver operating characteristic area under the curve metric. These findings offer an exciting opportunity to enhance and improve molecular property prediction in drug discovery and development.

Article Highlights

(1)
The GMPP-NN architecture outperforms existing models in chemical drug classification tasks.
(2)
Use MPNN for effective extraction of chemical properties from molecular graphs.
(3)
Improves molecular property prediction, aiding drug discovery by reducing experimental failures.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The process of discovering and develo** drugs is a challenging and expensive field that requires a lot of time and resources. To reduce the computational resources and errors of screening potential compounds, artificial intelligence, especially deep learning models, have been widely used in drug discovery and development. These models can help find promising molecular candidates faster and better, particularly in virtual screening (VS), which is a key step in drug discovery and development [1,2,3,21].

The present study proposes a deep learning architecture called GMPP-NN that integrates the MPNN and a machine learning model called MLP Classifier to predict molecular properties. This architecture was carried out on the four datasets (HIV, BACE, BBBP, and clintox) from moleculenet (a comprehensive benchmark platform for molecular deep learning and machine learning).

Our work makes several key contributions in the field of graph-based deep learning, particularly for molecular property prediction problems in drug discovery and development. We propose the GMPP-NN architecture, which is designed to be flexible and applicable to a wide range of problems that use graph data as input. The experimental results indicate that GMPP-NN can achieve competitive performance when compared to other advanced methods in the field of drug discovery and development. Additionally, we show that GMPP-NN is a flexible architecture that can be modified to use different variants of GNN, such as graph convolutional networks (GCN), and can also be used for regression problems by changing the classification model used in the final stage. These contributions demonstrate the potential of GMPP-NN as a powerful architecture for solving graph-based problems and show its effectiveness for classification problems.

In the next section, we will detail the experimental setups of our datasets and their respective properties. Next, we will give a description of the MPNN model for molecular embedding and our GMPP-NN architecture, followed by a discussion of the architecture’s performance results compared to the existing methods.

2 Materials and methods

2.1 Define features

The bond and atom properties have been initialized and are shown in Tables 1 and 2, respectively. Only the atom features of the first node in the MPNN constitute its initial features. However, the bond features of bond vw are the first features of the MPNN’s first edge. The RDKit package is used to calculate all features [22].

Table 1 Atom Features

Full size table

Table 2 Bond Features

Full size table

2.2 Data collection

In this research study, our datasets come from the benchmark dataset MoleculeNet. For the classification challenge, four datasets (HIV, BACE, BBBP, and ClinTox) were used, including two domains (physiology and biophysics). A global dataset was divided into a training dataset consisting of 75%, a validation dataset comprising 20%, and a test dataset containing 5%. The model underwent training using the training dataset, and its hyperparameters were adjusted and optimized based on the validation dataset’s results. Ultimately, the model’s performance was evaluated using the test dataset [23].

The drug therapeutics program AIDS antiviral screen presented an HIV dataset, testing approximately 40,000 compounds to block HIV replication. The results were classified into two categories: inactive compounds that were confirmed and active compounds that were confirmed [33].

A BACE dataset is available, which includes both quantitative and qualitative data on the binding abilities of various human beta-secretase 1 inhibitors. The dataset includes experimental values published in scientific literature. It consists of 1522 chemical compounds, with their binary classifications focusing on a unique protein target for classification tasks. [34].

The BBBP dataset (Blood Brain Barrier Penetration) is a research project focused on modeling and predicting barrier permeability, an essential aspect in develo** drugs targeting the central nervous system, where the barrier typically blocks most drugs, neurotransmitters, and hormones. This dataset contains binary classification labels for more than 2000 chemical compounds [35].

The ClinTox dataset differentiates between pharmaceuticals that have received FDA approval and compounds that have experienced failure in clinical trial stages due to toxicity-related problems. The dataset consists of 1491 pharmacological compounds and includes two classification tasks: predicting clinical trial toxicity and determining FDA approval status. The dataset is obtained from the SWEETLEAD database, presenting significant insights into the distinctions between successful and unsuccessful drug candidates [36].

In all four datasets, the SMILES dataset is used to represent molecules. the classification problems are resolved by the proposed architecture, allowing for molecular properties across the datasets described in Table 3.

Table 3 Summary of the DataSet

Full size table

2.3 Generate graph

We employed a multi-step process to generate complete graphs from the SMILES representations of chemical compounds, as shown in Fig. 1.

The process of creating a molecular graph consists of two sequential steps. In the initial step, a SMILES string is taken as input, and the result is the creation of a molecule object. Subsequently, in the second step, the generated molecule object serves as input for the creation of a graph, which includes atom features, bond features, and pair indices. This two-step process is fundamental for converting a chemical structure represented by a SMILES string into a structured molecular graph, facilitating further computational analysis and modeling.

Using attributed molecular graphs, a wide range of deep learning techniques can be applied to learn molecular structures and extract useful information. The ability of models to capture crucial properties of atoms and bonds within molecules is made feasible by these features, which help tasks such as property prediction, reaction prediction, and drug design [30].

2.4 Model architecture

Molecular graph embedding is an important area of study in graph neural networks. Molecular graph embedding typically involves three main components: atom-level embedding (which represents atoms in a graph as vectors), bond-level embedding (which represents bonds in a graph as vectors), and molecule-level embedding (which represents the whole molecule as a vector). In this study, we examine the concept of graph embedding, specifically referring to it as molecule-level embedding. This technique involves generating a vector representation for a molecule, which can then be used as input for our GMPP-NN architecture. The model, based on the architecture, is a message-passing network for generating a graph representation vector [39].

2.4.1 Message passing network

MPNN is a neural network model designed for graph data, with differing approaches to processing both undirected graph and directed graph structure and applications. To keep things simple, we will talk about MPNN, which works on molecular graph data g that has both node feature $X_V$ and edge feature $e_{VW}$. As depicted in the Fig. 2, a graph is a molecule that contains atoms as nodes and bonds as edges.

The forward pass has two different phases: the message-passing phase and the readout phase. The message passing phase is characterized by the message functions $M_t$ and vertex update functions $U_t$, and it covers a period of T time steps. During the message passing phase, the hidden states $h_V^{t}$ at each atom node in the network are updated using messages $m_V^{t+1}$ in the following way:

$$\begin{aligned} m_V^{t+1}&= \sum \limits _{W \in N(V)} M_t\left(h_V^t, h_W^t, e_{VW}\right) \end{aligned}$$

(1)

$$\begin{aligned} h_V^{t+1}&= U_t\left( h_V^t, m_V^{t+1}\right) \end{aligned}$$

(2)

where $h_{V}^{0}$ is a function of the initial atom features $x_{V}$, and N(V) is the collection of neighbors of V in graph G.

A graph readout phase:

$$\begin{aligned} g=R\left(H^{L}, H^{0}\right) \end{aligned}$$

(3)

The graph embedding is represented as g. The graph readout function R combines and transforms the initial and final atom node states to provide a unique graph embedding.

2.4.2 GMPP-NN architecture

A lot of different deep learning architectures have been suggested and used to guess. As many deep learning architectures have been proposed and used to predict molecular properties, such as ST (SMILES Transformer), which uses pre-training and a text representation system for molecules to improve performance, especially with small datasets [31], FP2VEC (Fingerprint to Vector) discusses the development of a method that uses deep learning technology to improve the accuracy of predicting the properties of chemical compounds. and FP2VEC, The authors demonstrate that when combined with a convolutional neural network model, achieves competitive results in various tasks related to the quantitative structure-activity relationship (QSAR), particularly in classification tasks [32]. Deeper graph neural network (Deeper-GCN) which can stack very deep layers, GCNs face challenges such as vanishing gradients, over-smoothing, and over-fitting when going deepe, it proposes a novel normalization layer called MsgNorm and a pre-activation version of residual connections for GCNs [25], Geometry-enhanced molecular (GEM) it designs a geometry-based GNN architecture that simultaneously models atoms, bonds, and bond angles within a molecule. Specifically, it uses double graphs to encode atom-bond relations and bond-angle relations. [26] and atom-bond transformer-based message-passing neural network (ABT-MPNN) combines the strengths of message-passing neural networks (MPNNs) and Transformers. It introduces an innovative architecture that integrates molecular representations at the bond, atom, and molecule levels. By designing attention mechanisms during message-passing and readout phases and captures local and global information effectively [27]. Our architecture is used for the same goal: to predict molecular properties using graph data as molecules and a MPNN model for graph embedding, followed by a MLP classifier for molecular property prediction as illustrated in the Fig. 3.

The molecular dataset, represented by SMILES rows, was formatted into graph structures using the graph_from_smiles method, as explained in Fig. 1. the architecture GMPP-NN, read the graph representations of the molecular features using MPNN, and returned a graph embedding g. After the readout phase, the graph structures were loaded as the sample partitioned feature vectors into the MLP classifier (multi-layer perceptron classifier) to predict different molecular properties.

Our architecture involves transforming SMILES datasets into graph datasets as the first stage. We create a batch of sub-graphs (disconnected graphs), where each sub-graph represents a single molecule, and the MPNN model uses the disconnected graph as input to generate the disconnected graph embedding (global feature vector). After that, we partition the global feature vector into sub-vector features that are used for prediction in the MLP classifier.

2.4.3 Outputs insights from diverse datasets

The output of our architecture comes from the last model of our GMPP-NN architecture (the MLP Classifier model), which contains the following predictions:

The BBBP dataset is typically used to predict whether a chemical compound can penetrate the blood-brain barrier (BBB). The BBB serves as a protective interface, maintaining a barrier between the central nervous system and the circulatory system. The prediction of BBBP penetration is essential in drug discovery to determine if a compound can reach the brain to treat neurological conditions.
The Clintox dataset is used for predicting the clinical toxicity of chemical compounds. It helps determine if a compound is likely to have adverse effects on human health, making it crucial for drug safety assessment and regulatory purposes.
The HIV dataset is used to predict whether a peptide sequence is susceptible to cleavage by the HIV-1 protease enzyme. Understanding this property is vital to the development of antiretroviral drugs to combat HIV/AIDS.
The BACE dataset is used for predicting the inhibition of the beta-secretase 1 (BACE-1) enzyme. Inhibition of BACE-1 is a target for the development of drugs to treat Alzheimer’s disease, as BACE-1 is involved in the production of beta-amyloid peptides, which play a role in the disease.

2.4.4 Performance metrics

The AUC (area under curve) metric evaluates a classifier’s performance by classifying positive and negative samples. Positive samples can be classified as true positives or false negatives, while negative samples can be classified as true negatives or false positives. Refer to true negatives (TN), false positives (FP), true positives (TP), and false negatives (FN). The GMPP-NN performance of our architecture was evaluated using ROC-Curve (Receiver Operating Characteristic Curve) and PRC-Curve (Precision-Recall Curve) metrics [40].

The ROC curve evaluates the distinction between a true positive rate (TPR) and a false positive rate (FPR). It quantifies the classifier’s ability to distinguish between positive and negative datasets across a spectrum of classification needs. A higher AUC score, which ranges from 0 to 1, points to better classifier performance. The area under the ROC curve that measures the variation between the TPR and FPR is used to calculate the AUC. Specifically, TPR can be defined as:

$$\begin{aligned} \text {TPR} = \frac{\text {TP}}{\text {TP + FN}} \end{aligned}$$

(4)

while FPR is given by:

$$\begin{aligned} \text {FPR} = \frac{\text {FP}}{\text {FP + TN}} \end{aligned}$$

(5)

These metrics offer insights into the model’s ability to differentiate positive and negative instances accurately.

PRC-Curve is an alternative performance measure for binary classifiers that assesses the balance between precision and recall. precision represents the proportion of accurate positive predictions among all positive predictions, while recall signifies the fraction of accurate positive predictions among all genuine positive samples.

The PRC- curve depicts precision as a function of recall, with the AUC calculated as the area beneath this curve. A higher PRC- curve suggests superior classifier performance when considering precision and recall. PRC-Curve is especially beneficial when addressing unbalanced datasets, where the number of negative instances greatly surpasses the number of positives. The true positive rate (TPR) can be determined as follows:

$$\begin{aligned} \text {TPR} = \frac{\text {TP}}{\text {TP + FN}} \end{aligned}$$

(6)

and the true positive fraction (TPF) as:

$$\begin{aligned} \text {TPF} = \frac{\text {TP}}{\text {TP + FP}} \end{aligned}$$

(7)

These metrics quantify the relationship between true positive predictions, false negatives, and false positives, providing valuable insights into the model’s performance.

3 Results and discussion

3.1 Model training and validation performance

The GMPP-NN (Graph Molecular Property Prediction Neural Network) model training experiment on four different datasets (BBBP, HIV, BACE, and ClinTox) illustrates promising results. The training progressed gradually for each dataset, showing progressive improvements in both ROC-Curve and PRC-Curve, as shown in Figs. 4 and 5.

3.1.1 ROC-curve training and validation performance

After 40 epochs, the GMPP-NN model demonstrated strong discriminative power in predicting molecular properties in various datasets. In the BBBP dataset, the model achieved a remarkable training ROC-Curve of 0.9577 and a validation ROC-Curve of 0.9285, indicating its ability to distinguish between molecules with blood-brain barrier penetration and those without. In the HIV dataset, the model achieved a training ROC-Curve of 0.8413 and a validation ROC-Curve of 0.8175, suggesting its ability to predict HIV-related molecular properties. In the Clintox dataset, the model achieved a training ROC-Curve of 0.9158 and a validation ROC-Curve of 0.8280, indicating its strong discriminative power in classifying toxicological properties. Finally, in the BACE dataset, the model achieved a training ROC-Curve of 0.8917 and a validation ROC-Curve of 0.8599, demonstrating its ability to predict molecular properties related to BACE inhibition. The GMPP-NN architecture is a versatile tool that excels in molecular property prediction tasks due to its consistent performance improvement during training and validation ROC-Curve scores, making it valuable for diverse task prediction.

3.1.2 PRC-Curve training and validation performance

After 40 epochs, the GMPP-NN model demonstrated impressive performance in different datasets, including the BBBP dataset, HIV dataset, Clintox dataset, and BACE dataset. It achieved a training PRC-Curve of 0.9516 and a validation PRC-Curve of 0.8942, indicating its proficiency in distinguishing between molecules with and without blood-brain barrier penetration. The model also showed strong performance in HIV, with a training PRC-Curve of 0.8184 and a validation PRC-Curve of 0.7976. It excelled at classifying toxicological properties with high precision and recall. In the BACE dataset, the model achieved a training AUC of 0.8590 and a validation PRC-Curve of 0.8527, indicating its ability to predict molecular properties related to BACE inhibition with a consistent balance between precision and recall. The PRC-Curve scores in various datasets demonstrate its versatility and reliability in molecular property prediction tasks. It provides precise predictions and effectively identifies relevant instances, making it a valuable tool for different tasks.

Comparing the ROC-Curve and PRC-Curve metrics, we observe that both metrics exhibited similar trends over the training epochs, as demonstrated by the consistent improvement in performance during training and validation. The ROC-Curve reveals how effectively the model separates positive and negative instances across various decision thresholds, while the PRC-Curve focuses on precision and recall, which is particularly important when dealing with imbalanced datasets.

3.2 ROC-Curve and PRC-Curve metrics prediction performance

The evaluation and comparison of ROC-Curve and PRC-Curve metrics for the GMPP-NN model performance on four datasets (BACE, BBBP, clintox, and HIV) are presented in Fig. 6.

The model’s performance on different datasets was evaluated using both ROC-Curve and PRC-Curve measures. The HIV dataset displayed outstanding performance, with an ROC- curve of 0.8677 and a slightly lower PRC- curve of 0.8565. The Clintox dataset achieved exceptional results with an ROC- curve of 0.9795 and an impressive PRC- curve of 0.9257, indicating excellent discrimination between positive and negative instances. The BBBP dataset showed outstanding performance with an ROC- curve of 0.9186 and a somewhat lower PRC- curve of 0.8757, suggesting good detection of positive instances while maintaining reasonable specificity. The BACE dataset performed well, with an ROC- curve of 0.8608 and a slightly higher PRC- curve of 0.8615.

For the HIV dataset, both the ROC- curve and the PRC- curve are suitable choices for evaluating performance. The ROC- curve is potentially better for the Clintox dataset, while the PRC- curve may be preferable for the BBBP dataset. In the case of the BACE dataset, both the ROC- curve and the PRC curve are effective measures of performance.

Therefore, for the HIV dataset, both ROC-Curve and PRC-Curve are good measures of performance. For the Clintox dataset, ROC-Curve may be a better measure of performance, while for the BBBP dataset, PRC-Curve may be a better measure of performance. For the BACE dataset, both ROC-Curve and PRC-Curve are good measures of performance.

In the comparative research of GMPP-NN performance with the other studies, we utilized the ROC-Curve measure as a metric all dataset tasks.

3.3 Comparative analysis of GMPP-NN performance with three studies

We tested our architecture on four different task datasets (HIV, BACE, BBBP, and ClinTox) and evaluated the ROC-Curve score on the test set of each dataset. We compare our results to those of three previous studies.

Table 4 The performances over all datasets

Full size table

The performance comparison of the GMPP-NN model study with the other studies: SMILES transformer (ST), FP2VEC (fingerprint to vector), Deeper graph neural network (Deeper-GCN), Geometry-enhanced molecular (GEM) and atom-bond transformer-based message-passing neural network (ABT-MPNN) using the ROC-curve metric is presented in Table 4. The findings showed that on three of the four datasets, GMPP-NN produced the highest ROC-Curve values. In the HIV dataset, GMPP-NN achieved an ROC-curve of 0.8677, while ST, FP2VEC, Deeper-GCN, ABT-MPNN and GEM achieved ROC-curve values of 0.683, 0.785, 0.789, 0.809 and 0.769 respectively. In the BACE dataset, GMPP-NN obtained an ROC curve of 0.8608, whereas ST, FP2VEC, and GEM achieved ROC curve values of 0.719, 0.883, and 0.856, respectively. In the BBBP dataset, GMPP-NN achieved an ROC-curve of 0.9186, whereas ST, FP2VEC, and GEM obtained ROC-curve values of 0.900, 0.911, and 0.724, respectively. Lastly, in the ClinTox dataset, GMPP-NN achieved the highest ROC-Curve of 0.9795, while ST, FP2VEC, Deeper-GCN, ABT-MPNN and GEM achieved ROC-Curve values of 0.963, 0.803, 0.870, 0.904 and 0.825 respectively. These findings suggest that GMPP-NN Based on the MPNN model and molecular graph as featurization method is a promising architecture for molecular property prediction, as it achieved the highest performance on the majority of the datasets evaluated. However, the choice of the featurization method and embedding model can considerably influence the architecture’s performance. Different featurization techniques and models can have varying impacts on predicting the desired outcomes, requiring careful evaluation.

4 Conclusion

We created and applied an architecture, GMPP-NN (graph molecular property prediction), to the chemical drug classification tasks that showed outstanding performances in the majority of the cases. As far as we are concerned, the work demonstrates how the MPNN-based model of the architecture can extract a variety of chemical properties from molecular graphs. The performance of our model was top-notch. We executed multiple tests on four binary classification benchmark datasets that were labeled with physiological and biophysical features. In detail, we use a molecular graph constructed from the SMILES dataset used as input to the MPNN model to obtain the graph embedding and a MLP classifier (multilayer perceptron classifier) model for the binary classification. We evaluate our study on two metrics, ROC AUC and PRC AUC, and finally compare the GMPP-NN to the other model using the ROC AUC metric. We have found that our architecture performs better than the SMILES transformer (ST), FP2VEC (fingerprint to vector), Deeper graph neural network (Deeper-GCN), Geometry-enhanced molecular (GEM) and atom-bond transformer-based message-passing neural network (ABT-MPNN). We anticipate that our findings will serve as a new reference for molecular property prediction in drug discovery process, specifically in the classification task. and our model help avoid investing resources in molecules with unfavorable properties, reducing the number of experimental failures, and contribute to better understanding and profiling of drug safety, hel** to identify potential risks early in the discovery and development process.

Data availability

All datasets used in this study are public and can be accessed through the following sources: The datasets for molecular property benchmarks can be downloaded from the moleculeNet platform [41].

References

Jiang J, Wang R, Wang M, Gao K, Nguyen DD, Wei G-W. Boosting tree-assisted multitask deep learning for small scientific datasets. J Chem Inf Model. 2020;60(3):1235–44.
Article Google Scholar
Shi Q, Chen W, Huang S, Wang Y, Xue Z. Deep learning for mining protein data. Brief Bioinform. 2021;22:194–218.
Article Google Scholar
Shi Q, Chen W, Huang S, Wang Y, Xue Z. Deep learning for mining protein data. Brief Bioinform. 2021;22:194–218.
Article Google Scholar
Simonyan K, Zisserman, A. Very deep convolutional networks for large-scale image recognition. 2014; ar**v preprint ar**v:1409.1556. (Accessed 2021-05-02).
Olivecrona M, Blaschke T, Engkvist O, Chen H. Molecular de-novo design through deep reinforcement learning. J Cheminf. 2017;9(1):48.
Article Google Scholar
Bastikar V, Bastikar A, Gupta P. Quantitative structure-activity relationship-based computational approaches. Computational approaches for novel therapeutic and diagnostic designing to mitigate SARS-CoV-2 infection. 2022:191-205. https://doi.org/10.1016/B978-0-323-91172-6.00001-7.
McGaughey GB, Sheridan RP, Bayly CI, Culberson JC, Kreatsoulas C, Lindsley S, Maiorov V, Truchon J-F, Cornell WD. Comparison of topological, shape, and docking methods in virtual screening. J Chem Inf Model. 2007;47(4):1504.
Article Google Scholar
Yang K, Swanson K, ** W, Coley C, Eiden P, Gao H, Guzman-Perez A, Hopper T, Kelley B, Mathea M. Analyzing learned molecular representations for property prediction. J Chem Inf Model. 2019;59(8):3370.
Article Google Scholar
Wu Z, Lei T, Shen C, Wang Z, Cao D, Hou T. ADMET evaluation in drug discovery. 19. Reliable prediction of human cytochrome P450 inhibition using artificial intelligence approaches. J Chem Inf Model. 2019;59 (11), 4587.
Wang Z, Liu M, Luo Y, Zhao X, **e Y, Wang L, Cai L, Qi Q, Yuan Z, Yang T, Ji S. Advanced graph and sequence neural networks for molecular property prediction and drug discovery. Bioinformatics. 2022;38(9):2579–86. https://doi.org/10.1093/bioinformatics/btac112.
Article Google Scholar
Han X, **e R, Li X, Li J. SmileGNN: drug-drug interaction prediction based on the SMILES and graph neural network. Life. 2022;12:319. https://doi.org/10.3390/life12020319.
Article Google Scholar
Juno Ryu EG, Abo E, June-Koo KR. Quantum graph neural network models for materials search. Materials. 2023. https://doi.org/10.3390/ma16124300.
Ünlü Atabey, Çevrim Elif, Sarigün Ahmet, Çelikbilek Hayriye, Güvenilir Heval Atas, Koyas Altay, Kahraman Deniz, Olğaç Abdurrahman, Rifaioglu Ahmet Sureyya, Doğan Tunca. Target Specific De Novo Design of Drug Candidate Molecules with Graph Transformer-based Generative Adversarial Networks. 2023; ar**v.org, https://doi.org/10.48550/ar**v.2302.07868
Gürkan S. Uncertainty estimation in deep learning- based property models: Graph neural networks applied to the critical properties. 2022; https://doi.org/10.1002/aic.17696.
Chenfang Z, Yong G, Yang R. Adaptive Propagation Graph Convolutional Networks Based on Attention Mechanism. Information. 2022. https://doi.org/10.3390/info13100471.
Weikai X, Lihui L, Hanghang T. ABM: attention-based message passing network for knowledge graph completion. 2022. https://doi.org/10.1109/BigData55660.2022.10021003.
Miru T, Baiqing L, Hongming C. Application of message passing neural networks for molecular property prediction. Curr Opin Struct Biol. 2023. https://doi.org/10.1016/j.sbi.2023.102616.
Article Google Scholar
Chengyou Liu YS, Rebecca D, Silvia TC, **zhao H. ABT-MPNN: an atom-bond transformer-based message-passing neural network for molecular property prediction. J Cheminf. 2023. https://doi.org/10.1186/s13321-023-00698-9.
Druzhilovskiy DS, Stolbov LA, Savosina PI, Pogodin PV, Filimonov D, Veselovsky AV, Stefanisko K, Tarasova NI, Nicklaus MC, Poroikov V. Computational approaches to identify a hidden pharmacological potential in large chemical libraries. 2020. https://doi.org/10.14529/JSFI200306.
Jiang-hua D. Multi Point-Voxel Convolution (MPVConv) for deep learning on point clouds. Comput Graphics. 2023. https://doi.org/10.1016/j.cag.2023.03.008.
Article Google Scholar
Lina F, Zhilong Y, Guixi S, Yi** C, Jianrong L. A joint deep learning network of point clouds and multiple views for roadside object classification from lidar point clouds. ISPRS J Photogramm Remote Sens. 2022. https://doi.org/10.1016/j.isprsjprs.2022.08.022.
Article Google Scholar
Landrum GRDK. Open-source cheminformatics. 2006. https://rdkit.org/docs/index.html (Accessed 2019-05-24).Google ScholarThere is no corresponding record for this reference.
Wu Z, Ramsundar B, Feinberg EN, Gomes J, Geniesse C, Pappu A.S, Leswing K, Pande Moleculenet V. A benchmark for molecular machine learning. Chem Sci 2018; 513–530.
Wang Z, Liu M, Luo Y, Zhao X, **e Y, Wang L, Cai L, Qi Q, Yuan Z, Yang Tianbao, Ji Shuiwang. Advanced graph and sequence neural networks for molecular property prediction and drug discovery. Bioinformatics. 2022;38(9):2579–86. https://doi.org/10.1093/bioinformatics/btac112.
Article Google Scholar
Li G, **ong C, Thabet A, Ghanem B. Deepergcn: all you need to train deeper gcns. 2020 ar**v preprint ar**v:2006.07739.
Fang X, Liu L, Lei J, et al. Geometry-enhanced molecular representation learning for property prediction. Nat Mach Intell. 2022;4:127–34. https://doi.org/10.1038/s42256-021-00438-4.
Article Google Scholar
Liu C, Sun Y, Davis R, Cardona ST, Hu P. ABT-MPNN: an atom-bond transformer-based message-passing neural network for molecular property prediction. J Cheminf. 2023;15(1):29. https://doi.org/10.1186/s13321-023-00698-9.
Article Google Scholar
Shilpa S, Kashyap G, Sunoj R. Recent applications of machine learning in molecular property and chemical reaction outcome predictions. J Phys Chem A. 2023. https://doi.org/10.1021/acs.jpca.3c04779.
Shilpa S, Kashyap G, Sunoj RB. The Journal of Physical Chemistry A 2023;127 (40), 8253-8271 https://doi.org/10.1021/acs.jpca.3c04779
Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. ar**v, 2016. https://doi.org/10.48550/ar**v.1609.02907 (Accessed September 9, 2023).
Honda S, Shi S, Ueda HR. SMILES transformer: pre-trained molecular fingerprint for low data drug discovery. 2019; Ar**v, abs/1911.04738.
Jeon W, Kim D. FP2VEC: a new molecular featurizer for learning molecular properties. Bioinformatics. 2019;35(23):4979–85. https://doi.org/10.1093/bioinformatics/btz307.
Article Google Scholar
AIDS Antiviral Screen Data, http://wiki.nci.nih.gov/display/NCIDTPdata/AIDS+Antiviral+Screen+Data, Accessed 2017-09-27.
Subramanian G, Ramsundar B, Pande V, Denny RAJ. Chem Inf Model. 2016;56:1936–1949.
Martins IF, Teixeira AL, Pinheiro L, Falcao AOJ. Chem Inf Model. 2012;52:1686–1697.
Gayvert KM, Madhukar NS, Elemento O. Cell Chem Biol. 2016;23:1294–1301.
Fionn Murtagh, Multilayer perceptrons for classification and regression, Neurocomputing, Vol. 2, Issues 5-6, 1991;83-197, ISSN 0925-2312. https://doi.org/10.1016/0925-2312(91)90023-5.
Rittig JG, Qinghe D, Manuel M, Alexander SA. Graph neural networks for the prediction of molecular structure-property relationships. 2022. 10.48550/ar**v.2208.04852.
Kim JY, Sung-Bae CA. Systematic analysis and guidelines of graph neural networks for practical applications, expert systems with applications, vol. 184. 115466. ISSN. 2021;0957–4174. https://doi.org/10.1016/j.eswa.2021.115466.
Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE. 2015;10(3): e0118432. https://doi.org/10.1371/journal.pone.0118432.
Article Google Scholar
https://moleculenet.org/datasets-1

Download references

Acknowledgements

We would like to express our heartfelt thanks to every professor who has contributed to this study. A special thank you goes to our supervising professor, Soumia Ziti, a professor of higher education at the Faculty of Science in Rabat, Morocco. Her help, experience, and encouragement throughout the research process have been essential.We would also like to recognize the help and facilities offered by the Intelligent Processing and Security Systems (IPSS) laboratory. Their resources and experience have greatly contributed to the successful completion of this investigation.Furthermore, we would like to thank everyone who participated in the study and generously offered their time and ideas. Their contributions were critical to the quality of the collected data. Finally, we express our deepest appreciation to all those who provided help, input, and support in the conception and completion of this research paper.

Funding

The authors have not disclosed any funding.

Author information

Authors and Affiliations

IPSS, Intelligent processing and security of systems, Faculty of Science, Mohammed V University in Rabat, 1014, Rabat, Morocco
Outhman Abbassi, Soumia Ziti, Meryam Belhiah & Yassine Zaoui Seghroucheni
IRSM, Computer science, Networks, Security and Management, Higher Institute of Management, Administration and Computer Engineering in Rabat, 10040, Rabat, Morocco
Souad Najoua Lagmiri

Authors

Outhman Abbassi
View author publications
You can also search for this author in PubMed Google Scholar
Soumia Ziti
View author publications
You can also search for this author in PubMed Google Scholar
Meryam Belhiah
View author publications
You can also search for this author in PubMed Google Scholar
Souad Najoua Lagmiri
View author publications
You can also search for this author in PubMed Google Scholar
Yassine Zaoui Seghroucheni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Outhman Abbassi.

Ethics declarations

Competing interests

The authors have not disclosed any competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Abbassi, O., Ziti, S., Belhiah, M. et al. GMPP-NN: a deep learning architecture for graph molecular property prediction. Discov Appl Sci 6, 352 (2024). https://doi.org/10.1007/s42452-024-05944-9

Download citation

Received: 03 April 2024
Accepted: 08 May 2024
Published: 26 June 2024
DOI: https://doi.org/10.1007/s42452-024-05944-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

GMPP-NN: a deep learning architecture for graph molecular property prediction

Abstract

Article Highlights

1 Introduction

2 Materials and methods

2.1 Define features

2.2 Data collection

2.3 Generate graph

2.4 Model architecture

2.4.1 Message passing network

2.4.2 GMPP-NN architecture

2.4.3 Outputs insights from diverse datasets

2.4.4 Performance metrics

3 Results and discussion

3.1 Model training and validation performance

3.1.1 ROC-curve training and validation performance

3.1.2 PRC-Curve training and validation performance

3.2 ROC-Curve and PRC-Curve metrics prediction performance

3.3 Comparative analysis of GMPP-NN performance with three studies

4 Conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation