Hierarchical graph learning for protein–protein interaction

Gao, Ziqi; Jiang, Chenran; Zhang, Jiawen; Jiang, **aosen; Li, Lanqing; Zhao, Peilin; Yang, Huanming; Huang, Yong; Li, Jia

doi:10.1038/s41467-023-36736-1

Hierarchical graph learning for protein–protein interaction

Article
Open access
Published: 25 February 2023

Volume 14, article number 1093, (2023)
Cite this article

Download PDF

You have full access to this open access article

From

View current issue

Hierarchical graph learning for protein–protein interaction

Download PDF

21k Accesses
23 Citations
5 Altmetric
Explore all metrics

Abstract

Protein-Protein Interactions (PPIs) are fundamental means of functions and signalings in biological systems. The massive growth in demand and cost associated with experimental PPI studies calls for computational tools for automated prediction and understanding of PPIs. Despite recent progress, in silico methods remain inadequate in modeling the natural PPI hierarchy. Here we present a double-viewed hierarchical graph learning model, HIGH-PPI, to predict PPIs and extrapolate the molecular details involved. In this model, we create a hierarchical graph, in which a node in the PPI network (top outside-of-protein view) is a protein graph (bottom inside-of-protein view). In the bottom view, a group of chemically relevant descriptors, instead of the protein sequences, are used to better capture the structure-function relationship of the protein. HIGH-PPI examines both outside-of-protein and inside-of-protein of the human interactome to establish a robust machine understanding of PPIs. This model demonstrates high accuracy and robustness in predicting PPIs. Moreover, HIGH-PPI can interpret the modes of action of PPIs by identifying important binding and catalytic sites precisely. Overall, “HIGH-PPI [https://github.com/zqgao22/HIGH-PPI]” is a domain-knowledge-driven and interpretable framework for PPI prediction studies.

A multi-source molecular network representation model for protein–protein interactions prediction

Article Open access 14 March 2024

GNNGL-PPI: multi-category prediction of protein-protein interactions using graph neural networks based on global graphs and local subgraphs

Article Open access 09 May 2024

Graph Neural Networks in Predicting Protein Function and Interactions

Introduction

Biological functions are accomplished by interactions and chemical reactions among biomolecules. Among them, protein–protein interactions (PPIs) are arguably one of the most important molecular events in the human body and are an important source of therapeutic interventions against diseases. A comprehensive dictionary of PPIs can help connect the dots in complicated biological pathways and expedite the development of therapeutic^1,2. In biology, hierarchy information has been widely exploited to gain in-depth information about phenotypes of interest, for example, in disease biology^3,4,5, proteomics^6,7,8, and neurobiology^9,10,11. Naturally, PPIs encapsulate a two-view hierarchy: on the top view, proteins interact with each other; on the bottom view, key amino acids or residues assemble to form important local domains. Following this logic, biologists often take hierarchical approaches to understand PPIs^12,13. Experimentally, scientists often employ high-throughput map**^14,15,16 to pre-build the PPI network at scale, and use bioinformatics clustering methods to identify functional modules of the network (top view). On the individual protein level, isolation methods, such as co-immunoprecipitation¹⁷, pull-down¹⁸, and crosslinking¹⁹ are used to establish the structures of individual proteins, so that surficial ‘hotspots’ can be located and analyzed. In short, hierarchy knowledge of structure information is important to understand the molecular details of PPIs.

More recently, the massive growth in the demand and the cost of experimentally validating PPIs make it impossible to characterize most unknown PPIs in wet laboratories. To map out the human interactome efficiently and inexpensively, computational methods are increasingly being used to predict PPIs automatically. Over the past decade, as one of the most revolutionary tools in computation, Deep Learning (DL) methods, have been applied to study PPIs. Development in this field has been mostly focused on two aspects, learning appropriate protein representations^20,21 and inferring potential PPIs by link predictions^22,23. The former focuses on extracting structural information using protein sequences. In particular, Convolutional Neural Networks (CNNs)³³ assign high probabilities of PPI to protein pairs that are known to share common PPI partners. CN can be generalized to consider neighbors from a greater path length (L3)²², which captures the structural and evolutionary forces that govern biological networks such as the interactome. Additionally, distance-based methods measure the possible distances between protein pairs, such as Euclidean commute time (ECT)³⁴ and random walk with restart (RWR)³⁵. Most methods of traditional link prediction focus on known interactions but tend to overlook important network properties such as node degrees and community partitions.

More importantly, these methods perceive only one of the two views of outside-of-protein and inside-of-protein. Few can model the natural PPI hierarchy by connecting both views. To address this issue, we present a hierarchical graph that applies two Graph Neural Networks (GNNs)^{Full size image}

Second, we examine the model tolerance when testing with low-quality structure data (see Fig. 3b). This meets the realistic scenarios, where native structure information is not always available for predicting PPIs. We prefer the model whose performance is not seriously limited by the structure quality, which is robust to inputs directly from computational models (e.g., AlphaFold⁴³). We evaluate the quality of the input protein structure by calculating the root-mean-square deviation (RMSD) of the native one and the input. Native protein structures (RMSD = 0) are retrieved from the PDB database at the highest resolutions. We compute the best-F1 scores (box plots) of our method on a set of AlphaFold structures with various RMSDs (0.80, 1.59, 2.39, 3.19, 5.36, 7.98), and show the average result of second-best method (GNN-PPI) in a blue dotted line. As can be seen, our model performance is always better than GNN-PPI, even with RMSD up to 8. The comparison with 3D CNN model²¹ further proves the denoising ability of the hierarchical graph for protein structure errors (Supplementary Fig. 4a). In short, our model performance is not significantly affected by structure errors where powerful pre-trained features are not available.

Further, to interpret decisions made by RNN, CNN and GNN, an experiment is conducted to explore the ability to capture protein functional sites. We apply the 3D-grad-CAM approach⁴⁴ on the trained 3D CNN model named DeepRank²¹, and apply the RNNVis approach⁴⁵ on the trained PIPR²⁶ model with 3D information. All three methods have identified more than one motif, in which we only show the most crucial site. Figure 3c displays the binding site for an isomerase protein’s chain A (PDB id: 1BJP). The binding site is made up of four residues with the sequence numbers 6, 42, 43, and 44. As can be seen, whereas neither CNN nor RNN can identify the His-6 residue, our method can precisely identify the binding site by using graph motif search. It seems to be a challenge for the sequence model (i.e., RNN, CNN) to connect His-6 to the other residues, probably because of their weak connections in a sequence mode. Moreover, 3D CNN performs even worse than RNN as it incorrectly classifies the non-essential Ile-41 residue.

For node features in protein graphs, we select seven important features from twelve residue-level feature options (see Supplementary Table 4) that are easily available. The feature selection process (see Supplementary Method 1 for details) produces the optimal set consisting of seven features to ensure that our model peaks at both AUPR and best-F1 scores. Here, we list the selected seven residue-level physicochemical properties in Fig. 3d and discuss their importance for different types of PPIs to both better interpret our model and discover enlightening biomarkers for PPI interface. The average z-score, which results from deleting each feature dimension and analyzing changes in AUPR before and after, is calculated to determine the importance of a feature. We choose a representative type (i.e., binding) to explain because it is the most prevalent in the STRING database. As a consequence, HIGH-PPI regards topological polar surface area (TPSA) and octanol-water partition coefficient (KOW) as dominant features. This finding supports the conventional wisdom that TPSA and KOW play a key role in drug transport process⁴⁶, protein interface recognition^47,48, and PPI prediction⁴⁹.

Top outside-of-protein view improves the performance

We investigate the role of top outside-of-protein view TGNN from three perspectives, including (1) the importance of degree and community recovery for predicting network structures, (2) comparison results of TGNN and other leading link prediction methods, (3) a real-life example to show the shortcomings of the leading link prediction methods.

Recently, various works have demonstrated the usefulness of structure properties (e.g., degree, community) of networks for predicting missing links. HIGH-PPI is inspired to efficiently recover the degree and community partitions of the PPI network by utilizing the network topology. We show an empirical study in Fig. 4a to illustrate the impact of degree and community recovery for link prediction. We randomly select the test results from the model trained in different epochs and calculate the negative Mean Absolute Error (-MAE) of the predicted degrees and real degrees to represent degree recovery. Similarly, for community recovery, we quantify the community recovery using the normalized mutual information (NMI). As can be seen, we observe a significant correlation ($R=-0.66$) between degree recovery and model performance (i.e., best-F1) as well as a high correlation ($R=0.68$) between community recovery and model performance, which means better recovery of the degree and community of PPI network implies better PPI prediction performance.

**Fig. 4: Performance of top view GNN of HIGH-PPI to learn relational information in PPI network.**

Second, we evaluate the performance of TGNN and leading link prediction methods using PPI network structure as input. Our method (TGNN) takes interactions as edges and node degrees as node features. We compare HIGH-PPI with six heuristic methods and one DL-based method. Heuristic methods, the simple yet effective ones utilizing the heuristic node similarities as the link likelihoods, include common neighbors (CN)³³, Katz index (Katz)⁵⁰, Adamic-Adar (AA)⁵¹, preferential attachment (PA)⁵², SimRank (SR)⁵³ and paths of length three (L3)²². MLP_IP, a DL approach, learns node representations using a multilayer perceptron (MLP) and identifies the node similarity via inner product (IP) operation. We calculate the MAE and NMI values of recovered networks and highlight those with a high capacity for recovery (NMI ≥ 0.7 and MAE ≤ 0.35) in orange. Results show that link prediction methods that are more adept at recovering network properties typically perform better. This gain validates our findings in Fig. 4a and highlights the need for TGNN in the top view. In addition, a comparison of MIL_IP and L3 elucidates that pairwise learning is insufficient to well capture the network information. Although L3 can capture the evolutionary principles of PPIs to some extent, our method beats L3 by better recovering the structure of the PPI network.

We provide an example on an SHS27k sub-network. As can be seen, there exist two distinct communities connected by two inter-community edges. We use the original sub-network as inputs and find that non-TGNN link prediction methods (i.e., CN, Katz, SR, AA, PA) tend to give high scores for intercommunity interactions. As an interesting observation, when we apply the Louvain community detection algorithm⁵⁴ to the recovered structure, it cannot produce an accurate community partition as the abundant inter-community interactions disrupt the original community structure. To examine degree recovery ability, we randomly select 50% of interactions as inputs and show each method’s degree recovery result for node KIF22 in Fig. 4c. We find non-TGNN approaches cannot well recover the links connecting the node KIF22 while TGNN approach can. In short, these experiments demonstrate that the structure properties of the PPI network are not always reflected in traditional link prediction methods, and moreover, capturing and learning the network structures in our top view improves the prediction performance.

HIGH-PPI accurately identifies key residues constituting functional sites

Typically, functional sites are spatially clustered sets of residues. They control protein functions and are thus important for PPI prediction. As our proposed model has the capacity to capture spatial-biological arrangements of residues in the bottom view, this characteristic can be used to explain the model’s decision. It is meaningful to notice that HIGH-PPI can automatically learn the residue importance without any residue-level annotations. In this section, we provide (1) a case study of predicting residue importance for the binding surface, (2) two cases of estimating residue importance for catalytic sites, and (3) an explainable ability comparison of precision in predicting binding sites.

First, a binding example between the query protein (PDB id: 2B6H-A) and its partner (PDB id: 2REY-A) is investigated. The ground truth binding surface is retrieved from the PDBePISA database⁵⁵, which is colored in red in Fig. 5a. Subsequently, we apply the GNN explanation approach (see Section 4.5 in “Methods” for details) on the HIGH-PPI model. As can be seen from Fig. 5a, HIGH-PPI can accurately and automatically identify the residues belonging to the binding surface. Another observation is shown in Fig. 5c which indicates our learned residue importance is quite close to the real profiles. We show another six cases of HIGH-PPI for identifying binding surfaces correctly in Supplementary Fig. 7.

**Fig. 5: Automatic explanation for residue importance without supervision.**

Second, in order to evaluate the prediction of catalytic sites for PPIs, we utilize the same GNN explanation approach in our model. The ground truth catalytic site is retrieved from the Catalytic Site Atlas⁵⁶ (CSA), a database for catalytic residue annotation for enzymes. We calculate the residue importance of catalytic sites for query proteins (PDB id: 1S9I-A, 1I0O-A). As seen in Fig. 5b, our proposed HIGH-PPI can correctly predict both residues for 1S9I-A and two out of three for 1I0O-A. We show another nine cases of HIGH-PPI for identifying catalytic sites in Supplementary Fig. 6, where a total of 25 out of 34 catalytic sites are correctly identified.

Additionally, we compare the model interpretability of the CNN, 3D CNN and HIGI-PPI models. We employ the CNN module in GNN-PPI²⁴ and 3D CNN module in DeepRank²¹, respectively. We apply grad-CAM⁵⁷ and 3Dgrad-CAM⁴⁴ approaches to determine residue importance for CNN and 3D CNN models, correspondingly. We use the binding type PPIs from the STRING dataset as the training set, and randomly select 20 binding type PPIs as the test set. We use the ground truth from PDBePISA for each query protein and treat its residues with importance >0 as surface compositions. To gauge the precision of the surface prediction, intersection over union (IoU) is used, and the box plots of the IoU score distributions are shown in Fig. 5d. The results elucidate that HIGH-PPI significantly outperforms other models in terms of interpretability with a minimum variance. In addition, 3D CNN outperforms CNN with a smaller variance, showing that 3D information supports the learning of reliable and generalized protein representations.

Protein functional site prediction sheds light on the model decisions and how to carry out additional experimental validations for PPI investigation. Excellent model interpretability also shows that our approach can accurately describe biological evidence for proteins.

Discussion

Hierarchical graph learning

In this paper, we study the PPI problem from a hierarchical graph perspective and develop a hierarchical graph learning model named HIGH-PPI to predict PPIs. Empirically, HIGH-PPI for PPI prediction outperforms leading methods by a significant margin. The hierarchical graph exhibits high generalization for recognizing unknown proteins and robustness against protein structure errors and PPI network perturbations.

Even without explicit supervision from binding site information, HIGH-PPI demonstrates its ability to capture residue importance for PPI with the aid of a hierarchical graph, which is a good indicator of excellent interpretability. Suppose HIGH-PPI predicts the presence of a catalytic interaction for a protein pair but identifies important sites unrelated to catalysis, we will hardly trust the model’s decision. Moreover, interpretability provides trusted guides for subsequent wet experimental validations. For example, if HIGH-PPI thinks a catalytic site is important, experiments may be designed by targeting the specific site for validation.

In conclusion, interpretable, end-to-end learning with a hierarchical graph revealing the PPI nature can pave the way to map out human interactome and deepen our understanding of PPI mechanisms.

Limitations and future work

We describe our intuitions in the hierarchical graph learning for PPIs. The world is hierarchical. Humans tend to solve problems or learn knowledge by conceptualizing the world from a hierarchical view⁵⁸. Due to huge semantic gaps between hierarchical views, humans always use a multi-view learning strategy to deepen the understanding of one view from the other one. Given rich hierarchical information, recent machine intelligence methods can effectively learn knowledge in each separate view but are not experts in gaining mutual benefits from both views. This is the challenge that our hierarchical world presents to machine intelligence. Here we connect both views by employing the forward and backward propagation of DL models. The forward propagation benefits the learning for the PPI network in the top view. In turn, the backward propagation optimizes the PPI-appropriate protein representations in the bottom view.

We describe two main limitations of HIGH-PPI and outline potential solutions in future work. (1) We did not explore in depth how to use protein-level annotations. Annotations for protein functions are becoming more available due to the recent growth of protein function databases (e.g., the UniProt Knowledge-base⁵⁹) and computational methods²⁹ for protein function prediction. Some annotations may speed up learning PPIs. For example, two proteins with low scores of the “protein binding” function term hardly interact with each other. We suggest that future work may consider leveraging function annotations to enhance the expressiveness of protein representations. Inspired by the contrastive learning principle, a potentially feasible solution is to enhance the consistency in protein representations and functions. (2) Protein domain information may be beneficial for hierarchical models. We clarify the core ideas here and provide a detailed description in Supplementary Method 2. Domains are distinct functional or structural units in proteins and are responsible for PPIs and specific protein functions. Both in terms of structures and functions, the protein domain can represent a crucial middle scale for the PPI hierarchy. However, to our knowledge, true (native) domain annotations are not easily available and predicted ones are usually retrieved from computational tools, which inevitably leads to data unreliability. If we employ the domain scale as a separate view, data unreliability may spread to other views and impair the entire hierarchical model. On this basis, we prefer to recommend domain annotations as supervised information at the residue level. Precisely, a well-designed regularization is required to guarantee that all functional sites, discovered by HIGH-PPI, belong in the prepared domain database. The domain regularization and the PPI prediction loss form a flexible trade-off of learning objectives, which can appropriately tolerate the domain annotation unreliability. (3) Memory requirement grows with the view number of a hierarchical graph. HIGH-PPI employs two views to form the hierarchical graph and treat amino acid residues as microscopic components of proteins. However, we did not further consider one more microscopic view where atoms, the components of residues, provide information for representing residues. It might be beneficial to introduce an atom-level view and develop a memory-efficient way for storing and processing explicit 3D atom-level information. (4) In future work, model robustness can be further improved. Although our model outperforms in the robustness evaluation (see Supplementary Table 3), we observe that ${{FDR}}_{{pre}}$ is most impacted by unreliable data, which is mostly because the number of FP significantly increases (up to 6 times) from Data 1 to Data 9. A possible explanation for the significant rise in FP is that the model’s “low demand” for a positive sample permits certain controversial samples to be projected as true. To address this issue, we recommend the future work consider a straightforward method—the voting strategy which uses the voting outcomes of various independent classifiers to identify true PPIs. Independence makes it unlikely for voting classifiers to commit the same errors. A test pair can only be predicted as true if it is approved by most voting classifiers, which makes the model more demanding for the PPI presence.

Methods

Construction of a hierarchical graph

We denote a set of amino acid residues in a protein as ${Prot}=\{{r}_{1},{r}_{2},\ldots,{r}_{n}\}$. Each residue is described with $\theta$ kinds of physicochemical properties. For the bottom inside-of-protein view, a protein graph ${g}_{b}=({V}_{b},{A}_{b},{X}_{b})$ is constructed to model the relationship between residues in ${Prot}$, where ${V}_{b}\subseteq {Prot}$ is the set of nodes, ${A}_{b}$ is an $n\times n$ adjacency matrix representing the connectivity in ${g}_{b}$, and ${X}_{b}\in {{\mathbb{R}}}^{n\times \theta }$ is a feature matrix containing the properties of all residues.

For the top outside-of-protein view, a set of protein graphs can be interconnected within a PPI graph ${g}_{t}$, which is denoted as ${g}_{b}\in {V}_{t}$. The connectivity (i.e., interactions) between protein graphs can be denoted as an $m\times m$ adjacency matrix ${A}_{t}$. In addition, ${X}_{t}\in {{\mathbb{R}}}^{m\times \varnothing }$ represents a feature matrix containing the representations of all proteins. We model the protein graphs and their connections as a hierarchical graph, in which four key variables (i.e., ${A}_{b}$, ${X}_{b}$, ${A}_{t}$, ${X}_{t}$) need to be clarified.

(1) The adjacency matrix ${A}_{b}\in {\{{{{{\mathrm{0,1}}}}}\}}^{n\times n}$ in the protein graph and protein contact map are exactly equivalent. Contact maps are obtained with atomic level 3D coordinates of proteins. First, we retrieve the native protein structures from the Protein Data Bank⁶⁰ and protein structures of various RMSD scores by AlphaFold⁴³. Then we represent the location of each residue by the 3D coordinate of its ${C}_{\alpha }$ atom. The presence or the absence of contact between a pair of residues is decided by their ${C}_{\alpha }-{C}_{\alpha }$ physical distance. We perform a sensitivity analysis (see Supplementary Fig. 8) and find that our model produces similar results when trained on contact maps with cutoff distances ranging between 9 Å to 12 Å. Finally, we choose the optimal cutoff distance of 10 Å, which allows our model to peak its performance. (2) For a feature matrix ${X}_{b}$, each row represents a set of properties for one amino acid residue. In this work, seven residue-level properties are considered (i.e., $\theta=7$): isoelectric point, polarity, acidity and alkalinity, hydrogen bond acceptor, hydrogen bond donor, octanol-water partition coefficient, and topological polar surface area. Supplementary Data File 3 contains quantitative values of seven types of properties for each amino acid. All properties can be easily retrieved from the RDKit repository⁶¹. (3) The PPI network structure determines the adjacency matrix ${A}_{t}\in {\{{{{{\mathrm{0,1}}}}}\}}^{m\times m}$, in which the $i$-th row and $j$-th column element is 1 if the $i$-th and $j$-th proteins interact. (4) The $i$-th row of the feature matrix ${X}_{t}$ represents the representation vector for the $i$-th protein graph ${g}_{b}$.

BGNN for learning protein representations

We use the bottom view graph neural networks (BGNN) to learn protein representations. Graph convolutional networks (GCNs) have shown great effectiveness for relational data and are suitable for learning graph-structured protein representations. Thus, we propose BGNN based on GCNs.

Given the adjacency matrix ${A}_{b}\in {\{{{{{\mathrm{0,1}}}}}\}}^{n\times n}$ and the feature matrix ${X}_{b}\in {{\mathbb{R}}}^{n\times \theta }$ of an arbitrary protein graph ${g}_{b}$, BGNN outputs the residue-level representations in the first GCN block, ${H}^{(1)}\in {{\mathbb{R}}}^{n\times {d}_{1}}$:

$${H}^{(1)}={GCN}\left({A}_{b},{X}_{b}\right)$$

(1)

where ${d}_{1}$ is the embedding dimension for the first GCN layer.

Formally, we update residue representations with the neighbor aggregations based on the work of Kipf and Welling³⁶:

$${H}^{(1)}={{{{{\rm{BN}}}}}}\left({{{{{\rm{ReLU}}}}}}\left({\widetilde{D}}^{-1/2}\left({A}_{b}+{I}_{n}\right){\widetilde{D}}^{-1/2}{X}_{b}{W}^{(1)}\right)\right)$$

(2)

where ${I}_{n}\in {{\mathbb{R}}}^{n\times n}$ is the identity matrix, $\widetilde{D}\in {{\mathbb{R}}}^{n\times n}$ is the diagonal degree matrix with entries ${D}_{{ii}}={\sum }_{j}{\left({A}_{b}+{I}_{n}\right)}_{{ij}}$, ${W}^{(1)}\in {{\mathbb{R}}}^{\theta \times {d}_{1}}$ is a learnable weight matrix for the GCN layer, ReLU, BN denotes the ReLU activation function and batch normalization, respectively.

With the learnable weight matrix ${W}^{(2)}\in {{\mathbb{R}}}^{{d}_{1}\times {d}_{2}}$, the second GCN block produces the output ${H}^{(2)}\in {{\mathbb{R}}}^{n\times {d}_{2}}$:

$${H}^{(2)}={{{{{\rm{BN}}}}}}\left({{{{{\rm{ReLU}}}}}}\left({\widetilde{D}}^{-1/2}\left({A}_{b}+{I}_{n}\right){\widetilde{D}}^{-1/2}{H}^{(1)}{W}^{(2)}\right)\right)$$

(3)

Finally, we perform the readout operation with a self-attention graph pooling layer³⁹ and average aggregation to obtain the entire graph representation of a fixed size, $x\in {{\mathbb{R}}}^{1\times {d}_{2}}$.To clarify, we use ${x}_{i}\in {{\mathbb{R}}}^{1\times {d}_{2}}$ to represent the final representation for the $i$-th protein graph.

TGNN for learning PPI network information

We use the top view graph neural networks (TGNN) to learn PPI network information. We are inspired by graph isomorphism network (GIN³⁷), which has the superb expressive power to capture graph structures. Formally, we are given the PPI graph ${g}_{t}=({V}_{t},{A}_{t},{X}_{t})$, where ${X}_{t}\in {{\mathbb{R}}}^{m\times {d}_{2}}$ is defined as the feature matrix whose row vector is a final protein representation from BGNN (i.e., ${X}_{t}^{\left[i,:\right]}={x}_{i},i={{{{\mathrm{1,2}}}}},\ldots,m$). TGNN updates the representation of protein $v$ in the $k$-th GIN block:

$${x}_{v}^{(k)}={{{{{\rm{BN}}}}}}\left({{{{{\rm{ReLU}}}}}}\left({{{{{{\rm{MLP}}}}}}}^{(k)}\left(\left(1+\epsilon \right) \, \bullet \, {x}_{v}^{\left(k-1\right)}+{\sum }_{u{{\in }}{{{{{\mathscr{N}}}}}}({{{{{\mathcal{v}}}}}})}{x}_{u}^{(k-1)}\right)\right)\right)$$

(4)

where ${x}_{v}^{(k)}$ denotes the representation of protein $v$ after the $k$-th GIN block, ${{{{{\mathscr{N}}}}}}({{{{{\mathcal{v}}}}}})$ is a set of proteins adjacent to $v$, and $\epsilon$ is a learnable parameter. We denote the inputs of protein representations for the first GIN block as ${x}_{i}^{(0)}={x}_{i},i={{{{\mathrm{1,2}}}}},\ldots,m$.

After three GIN blocks, TGNN produces representations for all proteins. For an arbitrary query pair containing the $i$-th and $j$-th proteins, we use the concatenation operation to combine the representations of ${x}_{i}^{(3)}$ and ${x}_{j}^{(3)}$. A fully connected layer (FC) is employed as the classifier. The final vector ${\hat{y}}_{{ij}}\in {{\mathbb{R}}}^{1\times c}$ for the presence probability of PPI is denoted as ${\hat{y}}_{{ij}}={{{{{\rm{FC}}}}}}\left({h}_{i}^{(3)}{||}{h}_{j}^{(3)}\right)$ where $c$ denotes the total number of PPI types involved and $\parallel$ denotes the concatenation operation.

Model training details

Given a training set ${{{{{{\mathscr{X}}}}}}}_{{train}}$ and ground truth labels for multi-type PPIs ${{{{{{\mathscr{Y}}}}}}}_{{train}}$, we train BGNN and TGNN in an end-to-end manner by minimizing the loss function of multi-task binary cross-entropy:

$${{{{{\mathcal{L}}}}}}\left(\Theta \right)=\mathop{\sum }\limits_{k=0}^{c}\left(\mathop{\sum}\limits_{{x}_{{ij}}\in {{{{{{\mathscr{X}}}}}}}_{{train}}}-{y}_{{ij}}^{k}\log {\hat{y}}_{{ij}}^{k}-\left(1-{y}_{{ij}}^{k}\right)\log \left(1-{\hat{y}}_{{ij}}^{k}\right)\right)$$

(5)

where $\Theta$ is the set of all learnable parameters, and ${ij}$ denotes the ground truth of the $k$-th type PPI of the $i$-th and $j$-th proteins.

We determine all the hyper-parameters through a grid search based on a 5-fold cross-validation. For BGNN, we set the output dimension ${d}_{1}$, ${d}_{2}$ of weight matrix to 128. For each GIN block in TGNN, we use a two-layer MLP and set the output dimension of each layer to 64. As the STRING dataset contains seven types of PPIs, we set the output dimension of the FC layer to $c=7$. We use the Adam optimizer with a learning rate ${lr}=0.001$, ${\beta }_{1}=0.99$, ${\beta }_{2}=0.99$, a batch size of 128, and the default epoch number of 500. We train all of the model parameters until convergence in each cross-validation.

Residue importance computation

We employ the method called GNNExplainer⁶² to generate explanations for HIGH-PPI. By taking the well-trained GNN model and its predictions as inputs, GNNExplainer returns the most important subgraph by maximizing the mutual information ${MI}$ between the model prediction and possible subgraphs. Motivated by this, we directly formalize the notion of subgraph importance using ${MI}$ and further compute the importance of all nodes (i.e., residues).

Given protein graphs ${G}_{1}$ and ${G}_{2}$ that connect in the PPI network, our goal is to identify the node importance of ${G}_{1}$. According to GNNExplainer, once sampling a random subgraph ${G}_{s}\subseteq {G}_{1}$, we obtain the entire importance of ${G}_{s}$ as follow:

$${{MI}}_{s}\left(Y,{G}_{s}\right)=H\left(Y\right)-H\left(Y|G={G}_{s}\right)$$

(6)

where ${{MI}}_{s}$ represents importance of ${G}_{s}$, $Y$ is a variable indicating the probability of PPI presence of ${G}_{1}$ and ${G}_{2}$, and $H\left(\bullet \right)$ is the entropy term.

Assume that all nodes in the subgraph ${G}_{s}$ contribute equally to the ${MI}$ value, we obtain the batch importance for each node in ${G}_{s}$. The final importance score for a specific node is the average of all its batch importance scores. For example, if a node $v$ contributes 0.4 and 0.6 for two sampled subgraphs respectively, the final importance of node $v$ is 0.5. To facilitate comparison, we compute the z-scores of final residue importance for standardization:

$${z}_{s}=\frac{{z}_{f}-\mu }{\sigma }$$

(7)

where ${z}_{f}\in {{\mathbb{R}}}^{1\times n}$ is the finally computed importance vector for all residues, $\mu$ is the average of ${z}_{f}$, $\mu$ is the standard deviation of ${z}_{f}$, and ${z}_{s}\in {{\mathbb{R}}}^{1\times n}$ is the z-score importance after standardization.

Statistics and reproducibility

As indicated in figure legends, data in bar charts are represented as mean $\pm$ standard deviation (SD). For all boxplots, the center line represents the median, upper and lower edges represent the interquartile range, and the whiskers represent 0.5× interquartile range. The statistical significance between the two groups was obtained by a two-sided t-test with P-value < 0.05 considered significant.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The PPI and protein data used in this study are available in the Zenodo database under “Accession Code 7213401”. They are obtained from the following publicly available database. Datasets containing protein sequences and their interaction annotations are obtained from https://github.com/muhaochen/seq_ppi. The native protein structures are obtained from PDB: https://www.rcsb.org/. Protein structures with errors are obtained from AlphaFold: https://alphafold.ebi.ac.uk/. The catalytic site information of proteins can be found at CSA: https://www.ebi.ac.uk/thornton-srv/m-csa/. The ground truth of binding site information is obtained from PDBePISA: https://www.ebi.ac.uk/pdbe/pisa/. All other relevant data supporting the key findings of this study are available within the article and its Supplementary Information files or from the corresponding author upon reasonable request. Source data are provided with this paper.

Code availability

An open-source software implementation of HIGH-PPI is available at https://github.com/zqgao22/HIGH-PPI. The source code can be cited by using https://doi.org/10.5281/zenodo.7600622.

References

Petta, I. et al. Modulation of protein–protein interactions for the development of novel therapeutics. Mol. Ther. 24, 707–718 (2016).
Article CAS PubMed PubMed Central Google Scholar
Skrabanek, L., Saini, H. K., Bader, G. D. & Enright, A. J. Computational prediction of protein–protein interactions. Mol. Biotechnol. 38, 1–17 (2008).
Article CAS PubMed Google Scholar
Hope, K. J., **, L. & Dick, J. E. Acute myeloid leukemia originates from a hierarchy of leukemic stem cell classes that differ in self-renewal capacity. Nat. Immunol. 5, 738–743 (2004).
Article CAS PubMed Google Scholar
Zeng, A. G. et al. A cellular hierarchy framework for understanding heterogeneity and predicting drug response in acute myeloid leukemia. Nat. Med. 28, 1–12 (2022).
Couturier, C. P. et al. Single-cell RNA-seq reveals that glioblastoma recapitulates a normal neurodevelopmental hierarchy. Nat. Commun. 11, 1–19 (2020).
Google Scholar
Engelberg, K., Bechtel, T., Michaud, C., Weerapana, E. & Gubbels, M. J. Proteomic characterization of the Toxoplasma gondii cytokinesis machinery portrays an expanded hierarchy of its assembly and function. Nat. Commun. 13, 1–15 (2022).
Article Google Scholar
Wigbers, M. C. et al. A hierarchy of protein patterns robustly decodes cell shape information. Nat. Phys. 17, 578–584 (2021).
Article CAS Google Scholar
Ho, T. S. Y. et al. A hierarchy of ankyrin-spectrin complexes clusters sodium channels at nodes of Ranvier. Nat. Neurosci. 17, 1664–1672 (2014).
Article CAS PubMed PubMed Central Google Scholar
Siegle, J. H. et al. Survey of spiking in the mouse visual system reveals functional hierarchy. Nature 592, 86–92 (2021).
Article ADS CAS PubMed Google Scholar
Hendrikx, E., Paul, J. M., van Ackooij, M., van der Stoep, N. & Harvey, B. M. Visual timing-tuned responses in human association cortices and response dynamics in early visual cortex. Nat. Commun. 13, 1–19 (2022).
Article Google Scholar
Zhang, Y. et al. A system hierarchy for brain-inspired computing. Nature 586, 378–384 (2020).
Article ADS CAS PubMed Google Scholar
Guharoy, M., Lazar, T., Macossay-Castillo, M. & Tompa, P. Degron masking outlines degronons, co-degrading functional modules in the proteome. Commun. Biol. 5, 1–15 (2022).
Article Google Scholar
Wu, C. H. et al. Identification of lncRNA functions in lung cancer based on associated protein-protein interaction modules. Sci. Rep. 6, 1–11 (2016).
Google Scholar
Rolland, T. et al. A proteome-scale map of the human interactome network. Cell 159, 1212–1226 (2014).
Article CAS PubMed PubMed Central Google Scholar
Hein, M. Y. et al. A human interactome in three quantitative dimensions organized by stoichiometries and abundances. Cell 163, 712–723 (2015).
Article CAS PubMed Google Scholar
Huttlin, E. L. et al. Architecture of the human interactome defines protein communities and disease networks. Nature 545, 505–509 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Kaboord, B. & Perr, M. Isolation of proteins and protein complexes by immunoprecipitation. Methods Mol. Biol. 424, 349–364 (2008).
Aronheim, A., Zandi, E., Hennemann, H., Elledge, S. J. & Karin, M. Isolation of an AP-1 repressor by a novel method for detecting protein-protein interactions. Mol. Cell. Biol. 17, 3094–3102 (1997).
Article CAS PubMed PubMed Central Google Scholar
Su, J. F., Huang, Z., Yuan, X. Y., Wang, X. Y. & Li, M. Structure and properties of carboxymethyl cellulose/soy protein isolate blend edible films crosslinked by Maillard reactions. Carbohydr. Polym. 79, 145–153 (2010).
Article CAS Google Scholar
Zhao, L., Wang, J., Hu, Y. & Cheng, L. Conjoint feature representation of GO and protein sequence for PPI prediction based on an inception RNN attention network. Mol. Ther.-Nucleic Acids 22, 198–208 (2020).
Article CAS PubMed PubMed Central Google Scholar
Renaud, N. et al. DeepRank: a deep learning framework for data mining 3D protein-protein interfaces. Nat. Commun. 12, 1–8 (2021).
Article ADS Google Scholar
Kov´acs, I. A. et al. Network-based prediction of protein interactions. Nat. Commun. 10, 1–8 (2019).
Article ADS Google Scholar
Nasiri, E., Berahmand, K., Rostami, M. & Dabiri, M. A novel link prediction algorithm for protein-protein interaction networks by attributed graph embedding. Computers Biol. Med. 137, 104772 (2021).
Article Google Scholar
Lv, G., Hu, Z., Bi, Y. & Zhang, S. Learning unknown from correlations: graph neural network for inter-novel-protein interaction prediction. In 30th International Joint Conference on Artificial Intelligence (IJCAI). https://doi.org/10.48550/ar**v.2105.06709 (2021).
Kulmanov, M., Khan, M. A. & Hoehndorf, R. DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics 34, 660–668 (2018).
Article CAS PubMed Google Scholar
Chen, M. et al. Multifaceted protein–protein interaction prediction based on Siamese residual RCNN. Bioinformatics 35, i305–i314 (2019).
Article CAS PubMed PubMed Central Google Scholar
Hsieh, Y. L., Chang, Y. C., Chang, N. W. & Hsu, W. L. In Proc. 8th International Joint Conference on Natural Language Processing Vol. 2 (Short Papers) 240–245 (Asian Federation of Natural Language Processing, 2017).
Saha, S. & Raghava, G. P. S. Prediction of continuous B-cell epitopes in an antigen using recurrent neural network. Proteins: Struct., Funct., Bioinforma. 65, 40–48 (2006).
Article CAS Google Scholar
Gligorijević, V. et al. Structure-based protein function prediction using graph convolutional networks. Nat. Commun. 12, 1–14 (2021).
Article Google Scholar
Jiménez, J. et al. DeepSite: protein-binding site predictor using 3D-convolutional neural networks. Bioinformatics 33, 3036–3042 (2017).
Article PubMed Google Scholar
Amidi, A. et al. EnzyNet: enzyme classification using 3D convolutional neural networks on spatial representation. PeerJ 6, e4750 (2018).
Article PubMed PubMed Central Google Scholar
Tubiana, J., Schneidman-Duhovny, D. & Wolfson, H. J. ScanNet: An interpretable geometric deep learning model for structure-based protein binding site prediction. Nat. Methods 19, 1–10 (2022).
Goldberg, D. S. & Roth, F. P. Assessing experimentally derived interactions in a small world. Proc. Natl Acad. Sci. 100, 4372–4376 (2003).
Article ADS MathSciNet CAS PubMed PubMed Central MATH Google Scholar
Fouss, F., Pirotte, A., Renders, J. M. & Saerens, M. Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation. IEEE Trans. Knowl. Data Eng. 19, 355–369 (2007).
Article Google Scholar
Tong, H., Faloutsos, C. & Pan, J. Y. 6th International Conference on Data Mining (ICDM) (IEEE, 2006).
Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations (ICLR). https://doi.org/10.48550/ar**v.1609.02907 (2017).
Xu, K., Hu, W., Leskovec, J. & Jegelka, S. How powerful are graph neural networks? In 7th International Conference on Learning Representations (ICLR). https://doi.org/10.48550/ar**v.1810.00826 (2019).
Szklarczyk, D. et al. STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic acids Res. 47, D607–D613 (2019).
Article CAS PubMed Google Scholar
Lee, J., Lee, I. & Kang, J. Self-attention graph pooling. In 36th International Conference on Machine Learning (ICML). https://doi.org/10.48550/ar**v.1904.08082 (2019).
Zheng, S., Li, Y., Chen, S., Xu, J. & Yang, Y. Predicting drug–protein interaction using quasi-visual question answering system. Nat. Mach. Intell. 2, 134–140 (2020).
Article Google Scholar
Wong, L., You, Z. H., Li, S., Huang, Y. A. & Liu, G. Detection of protein–protein interactions from amino acid sequences using a rotation forest model with a novel PR-LPQ descriptor. Adv. Intell. Syst. Comput. https://doi.org/10.1007/978-3-319-22053-6_75 (2015).
Park, Y. & Marcotte, E. M. Flaws in evaluation schemes for pair-input computational predictions. Nat. methods 9, 1134–1136 (2012).
Article CAS PubMed PubMed Central Google Scholar
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Yang, C., Rangarajan, A. & Ranka, S. Visual explanations from deep 3D convolutional neural networks for Alzheimer’s disease classification. AMIA Annu. Symp. Proc. 2018, 1571–1580 (2018).
Ming, Y. et al. Understanding hidden memories of recurrent neural networks. In 2017 IEEE Conference on Visual Analytics Science and Technology (VAST). https://doi.org/10.48550/ar**v.1710.10777 (2017).
Fernandes, J. & Gattass, C. R. Topological polar surface area defines substrate transport by multidrug resistance associated protein 1 (MRP1/ABCC1). J. medicinal Chem. 52, 1214–1218 (2009).
Article CAS Google Scholar
Hu, Z., Ma, B., Wolfson, H. & Nussinov, R. Conservation of polar residues as hot spots at protein interfaces. Proteins: Struct., Funct., Bioinforma. 39, 331–342 (2000).
Article CAS Google Scholar
Young, L., Jernigan, R. & Covell, D. A role for surface hydrophobicity in protein-protein recognition. Protein Sci. 3, 717–729 (1994).
Article CAS PubMed PubMed Central Google Scholar
Korn, A. P. & Burnett, R. M. Distribution and complementarity of hydropathy in mutisunit proteins. Proteins: Struct., Funct., Bioinforma. 9, 37–55 (1991).
Article CAS Google Scholar
Katz, L. A new status index derived from sociometric analysis. Psychometrika 18, 39–43 (1953).
Article MATH Google Scholar
Zhou, T., Lv, L. & Zhang, Y. C. Predicting missing links via local information. Eur. Phys. J. B 71, 623–630 (2009).
Article ADS CAS MATH Google Scholar
Barabási, A. L. & Albert, R. Emergence of scaling in random networks. Science 286, 509–512 (1999).
Article ADS MathSciNet PubMed MATH Google Scholar
Jeh, G. & Widom, J. Proc. 8th International Conference on Knowledge Discovery and Data Mining (ACM, 2002).
De Meo. P., Ferrara E., Fiumara G. & Provetti A. Generalized louvain method for community detection in large networks. In 11th International Conference on Intelligent Systems Design and Applications (ISDA). https://doi.org/10.1109/ISDA.2011.6121636 (2011).
Krissinel, E. & Henrick, K. Inference of macromolecular assemblies from crystalline state. J. Mol. Biol. 372, 774–797 (2007).
Article CAS PubMed Google Scholar
Porter, C. T., Bartlett, G. J. & Thornton, J. M. The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic acids Res. 32, D129–D133 (2004).
Article CAS PubMed PubMed Central Google Scholar
Selvaraju, R. R. et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV). https://doi.org/10.1007/s11263-019-01228-7 (2017).
Li, J. et al. Semi-supervised graph classification: a hierarchical graph perspective. In 2019 The World Wide Web Conference (WWW). https://doi.org/10.48550/ar**v.1904.05003 (2019).
Boutet, E., Lieberherr, D., Tognolli, M., Schneider, M. & Bairoch, A. UniProtKB/Swiss-Prot 89–112 (Humana Press, 2007).
Berman, H. M. et al. The protein data bank. Nucleic Acids Res. 28, 235–242 (2000).
Article ADS CAS PubMed PubMed Central Google Scholar
Landrum, G., Tosco, P. & Kelley, B. rdkit/rdkit: 2021_09_4 (Q3 2021) 351 Release. https://zenodo.org/record/5835217#.Y_JocB9Bzcs (2022).
Ying, Z., Bourgeois, D., You, J., Zitnik, M. & Leskovec, J. Gnnexplainer: Generating explanations for graph neural networks. In 33rd Advances in Neural Information Processing Systems (NeurIPS). https://doi.org/10.48550/ar**v.1903.03894 (2019).

Download references

Acknowledgements

The research of Li was supported by National Natural Science Foundation of China (Grant No. 62206067), Tencent AI Lab Rhino-Bird Focused Research Program RBFR2022008 and Guangzhou-HKUST(GZ) Joint Funding Scheme. The research of Huang was supported by the National Natural Science Foundation of China (Grant No. 21825101).

Author information

Authors and Affiliations

Data Science and Analytics, The Hong Kong University of Science and Technology, Guangzhou, 511400, China
Ziqi Gao, Jiawen Zhang & Jia Li
Division of Emerging Interdisciplinary Areas, The Hong Kong University of Science and Technology, Hong Kong SAR, China
Ziqi Gao & Jia Li
**shan Translational Medicine Center, Shenzhen Bay Laboratory, Shenzhen, 518118, China
Chenran Jiang
The Cancer Hospital of the University of Chinese Academy of Sciences (Zhejiang Cancer Hospital), Chinese Academy of Sciences, Hangzhou, 310022, China
**aosen Jiang & Huanming Yang
AI Lab, Tencent, Shenzhen, 518000, China
Lanqing Li & Peilin Zhao
Department of Chemistry, The Hong Kong University of Science and Technology, Hong Kong SAR, China
Yong Huang

Authors

Ziqi Gao
View author publications
You can also search for this author in PubMed Google Scholar
Chenran Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Jiawen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
**aosen Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Lanqing Li
View author publications
You can also search for this author in PubMed Google Scholar
Peilin Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Huanming Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yong Huang
View author publications
You can also search for this author in PubMed Google Scholar
Jia Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Z.G. and C.J. wrote the first draft of manuscript. J.L., Y.H., and L.L. revised the manuscript to the submitted version. J.L., Z.G., Y.H., and H.Y. conceived the study. Z.G. designed all the experiments and wrote the codebase of HIGH-PPI. Z.G., J.Z., and X.J. conduct the benchmarks, and run all of the analysis. X.J. collected and preprocessed protein contact maps. Z.G., L.L., and P.Z. contributed to data analysis and model discussion. J.Z. conducted the figure design for overall framework. Z.G., C.J., J.Z., and X.J. completed the visualizations. J.L. and Y.H. supervised the research. All of the authors reviewed the manuscript and approved it for submission.

Corresponding authors

Correspondence to Yong Huang or Jia Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Pufeng Du and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Description of Supplementary Data

Supplementary Data 1

Supplementary Data 2

Supplementary Data 3

Reporting Summary

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Gao, Z., Jiang, C., Zhang, J. et al. Hierarchical graph learning for protein–protein interaction. Nat Commun 14, 1093 (2023). https://doi.org/10.1038/s41467-023-36736-1

Download citation

Received: 18 October 2022
Accepted: 14 February 2023
Published: 25 February 2023
DOI: https://doi.org/10.1038/s41467-023-36736-1
Springer Nature Limited

This article is cited by

GNNGL-PPI: multi-category prediction of protein-protein interactions using graph neural networks based on global graphs and local subgraphs
- **n Zeng
- Fan-Fang Meng
- Yi Li
BMC Genomics (2024)
A variational expectation-maximization framework for balanced multi-scale learning of protein and drug interactions
- Jiahua Rao
- Jiancong **e
- Yuedong Yang
Nature Communications (2024)
Multilevel characterization of unknown protein sequences using hierarchical long short term memory model
- Saurabh Agrawal
- Dilip Singh Sisodia
- Naresh Kumar Nagwani
Multimedia Tools and Applications (2024)
A performance prediction method for on-site chillers based on dynamic graph convolutional network enhanced by association rules
- Qiao Deng
- Zhiwen Chen
- Weihua Gui
Building Simulation (2024)

Hierarchical graph learning for protein–protein interaction

Abstract

Similar content being viewed by others

Introduction

Top outside-of-protein view improves the performance

HIGH-PPI accurately identifies key residues constituting functional sites

Discussion

Hierarchical graph learning

Limitations and future work

Methods

Construction of a hierarchical graph

BGNN for learning protein representations

TGNN for learning PPI network information

Model training details

Residue importance computation

Statistics and reproducibility

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Navigation