Introduction

Biological functions are accomplished by interactions and chemical reactions among biomolecules. Among them, protein–protein interactions (PPIs) are arguably one of the most important molecular events in the human body and are an important source of therapeutic interventions against diseases. A comprehensive dictionary of PPIs can help connect the dots in complicated biological pathways and expedite the development of therapeutic1,2. In biology, hierarchy information has been widely exploited to gain in-depth information about phenotypes of interest, for example, in disease biology3,4,5, proteomics6,7,8, and neurobiology9,10,11. Naturally, PPIs encapsulate a two-view hierarchy: on the top view, proteins interact with each other; on the bottom view, key amino acids or residues assemble to form important local domains. Following this logic, biologists often take hierarchical approaches to understand PPIs12,13. Experimentally, scientists often employ high-throughput map**14,15,16 to pre-build the PPI network at scale, and use bioinformatics clustering methods to identify functional modules of the network (top view). On the individual protein level, isolation methods, such as co-immunoprecipitation17, pull-down18, and crosslinking19 are used to establish the structures of individual proteins, so that surficial ‘hotspots’ can be located and analyzed. In short, hierarchy knowledge of structure information is important to understand the molecular details of PPIs.

More recently, the massive growth in the demand and the cost of experimentally validating PPIs make it impossible to characterize most unknown PPIs in wet laboratories. To map out the human interactome efficiently and inexpensively, computational methods are increasingly being used to predict PPIs automatically. Over the past decade, as one of the most revolutionary tools in computation, Deep Learning (DL) methods, have been applied to study PPIs. Development in this field has been mostly focused on two aspects, learning appropriate protein representations20,21 and inferring potential PPIs by link predictions22,23. The former focuses on extracting structural information using protein sequences. In particular, Convolutional Neural Networks (CNNs)33 assign high probabilities of PPI to protein pairs that are known to share common PPI partners. CN can be generalized to consider neighbors from a greater path length (L3)22, which captures the structural and evolutionary forces that govern biological networks such as the interactome. Additionally, distance-based methods measure the possible distances between protein pairs, such as Euclidean commute time (ECT)34 and random walk with restart (RWR)35. Most methods of traditional link prediction focus on known interactions but tend to overlook important network properties such as node degrees and community partitions.

More importantly, these methods perceive only one of the two views of outside-of-protein and inside-of-protein. Few can model the natural PPI hierarchy by connecting both views. To address this issue, we present a hierarchical graph that applies two Graph Neural Networks (GNNs)Full size image

Second, we examine the model tolerance when testing with low-quality structure data (see Fig. 3b). This meets the realistic scenarios, where native structure information is not always available for predicting PPIs. We prefer the model whose performance is not seriously limited by the structure quality, which is robust to inputs directly from computational models (e.g., AlphaFold43). We evaluate the quality of the input protein structure by calculating the root-mean-square deviation (RMSD) of the native one and the input. Native protein structures (RMSD = 0) are retrieved from the PDB database at the highest resolutions. We compute the best-F1 scores (box plots) of our method on a set of AlphaFold structures with various RMSDs (0.80, 1.59, 2.39, 3.19, 5.36, 7.98), and show the average result of second-best method (GNN-PPI) in a blue dotted line. As can be seen, our model performance is always better than GNN-PPI, even with RMSD up to 8. The comparison with 3D CNN model21 further proves the denoising ability of the hierarchical graph for protein structure errors (Supplementary Fig. 4a). In short, our model performance is not significantly affected by structure errors where powerful pre-trained features are not available.

Further, to interpret decisions made by RNN, CNN and GNN, an experiment is conducted to explore the ability to capture protein functional sites. We apply the 3D-grad-CAM approach44 on the trained 3D CNN model named DeepRank21, and apply the RNNVis approach45 on the trained PIPR26 model with 3D information. All three methods have identified more than one motif, in which we only show the most crucial site. Figure 3c displays the binding site for an isomerase protein’s chain A (PDB id: 1BJP). The binding site is made up of four residues with the sequence numbers 6, 42, 43, and 44. As can be seen, whereas neither CNN nor RNN can identify the His-6 residue, our method can precisely identify the binding site by using graph motif search. It seems to be a challenge for the sequence model (i.e., RNN, CNN) to connect His-6 to the other residues, probably because of their weak connections in a sequence mode. Moreover, 3D CNN performs even worse than RNN as it incorrectly classifies the non-essential Ile-41 residue.

For node features in protein graphs, we select seven important features from twelve residue-level feature options (see Supplementary Table 4) that are easily available. The feature selection process (see Supplementary Method 1 for details) produces the optimal set consisting of seven features to ensure that our model peaks at both AUPR and best-F1 scores. Here, we list the selected seven residue-level physicochemical properties in Fig. 3d and discuss their importance for different types of PPIs to both better interpret our model and discover enlightening biomarkers for PPI interface. The average z-score, which results from deleting each feature dimension and analyzing changes in AUPR before and after, is calculated to determine the importance of a feature. We choose a representative type (i.e., binding) to explain because it is the most prevalent in the STRING database. As a consequence, HIGH-PPI regards topological polar surface area (TPSA) and octanol-water partition coefficient (KOW) as dominant features. This finding supports the conventional wisdom that TPSA and KOW play a key role in drug transport process46, protein interface recognition47,48, and PPI prediction49.

Top outside-of-protein view improves the performance

We investigate the role of top outside-of-protein view TGNN from three perspectives, including (1) the importance of degree and community recovery for predicting network structures, (2) comparison results of TGNN and other leading link prediction methods, (3) a real-life example to show the shortcomings of the leading link prediction methods.

Recently, various works have demonstrated the usefulness of structure properties (e.g., degree, community) of networks for predicting missing links. HIGH-PPI is inspired to efficiently recover the degree and community partitions of the PPI network by utilizing the network topology. We show an empirical study in Fig. 4a to illustrate the impact of degree and community recovery for link prediction. We randomly select the test results from the model trained in different epochs and calculate the negative Mean Absolute Error (-MAE) of the predicted degrees and real degrees to represent degree recovery. Similarly, for community recovery, we quantify the community recovery using the normalized mutual information (NMI). As can be seen, we observe a significant correlation (\(R=-0.66\)) between degree recovery and model performance (i.e., best-F1) as well as a high correlation (\(R=0.68\)) between community recovery and model performance, which means better recovery of the degree and community of PPI network implies better PPI prediction performance.

Fig. 4: Performance of top view GNN of HIGH-PPI to learn relational information in PPI network.
figure 4

a Pearson Correlations (R) between the prediction performance (Best-F1) and degree recovery (left) and community recovery (right). It can be observed that high recovery for the degree and community of PPI network indicates better performance for PPI prediction. Degree recovery is quantified with the Mean Absolute Error (MAE) between the true and predicted degree distributions. Community recovery is quantified with the normalized mutual information (NMI) of true and predicted communities. The shaded area (error band) represents the 95% confidence interval. b Boxplots (center line, the median; upper and lower edges, the interquartile range; whiskers, \(0.5\times\) interquartile range) showing the Best-F1 distributions (5 runs with independent seeds) using various link prediction methods. Methods (green) predicting PPI networks of which the NMI < 0.7 and MAE > 0.35 significantly underperform the others (orange). c Left: An example showing a PPI network with an area of each node representing its degree value and only two external edges connecting the two communities detected. Middle: Real calculating results showing how other link prediction methods generate mislinks as external edges, which may disrupt the community partitions. Right: Real calculating results showing the disability of other link prediction methods to recover degrees. Source data are provided as a Source Data file.

Second, we evaluate the performance of TGNN and leading link prediction methods using PPI network structure as input. Our method (TGNN) takes interactions as edges and node degrees as node features. We compare HIGH-PPI with six heuristic methods and one DL-based method. Heuristic methods, the simple yet effective ones utilizing the heuristic node similarities as the link likelihoods, include common neighbors (CN)33, Katz index (Katz)50, Adamic-Adar (AA)51, preferential attachment (PA)52, SimRank (SR)53 and paths of length three (L3)22. MLP_IP, a DL approach, learns node representations using a multilayer perceptron (MLP) and identifies the node similarity via inner product (IP) operation. We calculate the MAE and NMI values of recovered networks and highlight those with a high capacity for recovery (NMI ≥ 0.7 and MAE ≤ 0.35) in orange. Results show that link prediction methods that are more adept at recovering network properties typically perform better. This gain validates our findings in Fig. 4a and highlights the need for TGNN in the top view. In addition, a comparison of MIL_IP and L3 elucidates that pairwise learning is insufficient to well capture the network information. Although L3 can capture the evolutionary principles of PPIs to some extent, our method beats L3 by better recovering the structure of the PPI network.

We provide an example on an SHS27k sub-network. As can be seen, there exist two distinct communities connected by two inter-community edges. We use the original sub-network as inputs and find that non-TGNN link prediction methods (i.e., CN, Katz, SR, AA, PA) tend to give high scores for intercommunity interactions. As an interesting observation, when we apply the Louvain community detection algorithm54 to the recovered structure, it cannot produce an accurate community partition as the abundant inter-community interactions disrupt the original community structure. To examine degree recovery ability, we randomly select 50% of interactions as inputs and show each method’s degree recovery result for node KIF22 in Fig. 4c. We find non-TGNN approaches cannot well recover the links connecting the node KIF22 while TGNN approach can. In short, these experiments demonstrate that the structure properties of the PPI network are not always reflected in traditional link prediction methods, and moreover, capturing and learning the network structures in our top view improves the prediction performance.

HIGH-PPI accurately identifies key residues constituting functional sites

Typically, functional sites are spatially clustered sets of residues. They control protein functions and are thus important for PPI prediction. As our proposed model has the capacity to capture spatial-biological arrangements of residues in the bottom view, this characteristic can be used to explain the model’s decision. It is meaningful to notice that HIGH-PPI can automatically learn the residue importance without any residue-level annotations. In this section, we provide (1) a case study of predicting residue importance for the binding surface, (2) two cases of estimating residue importance for catalytic sites, and (3) an explainable ability comparison of precision in predicting binding sites.

First, a binding example between the query protein (PDB id: 2B6H-A) and its partner (PDB id: 2REY-A) is investigated. The ground truth binding surface is retrieved from the PDBePISA database55, which is colored in red in Fig. 5a. Subsequently, we apply the GNN explanation approach (see Section 4.5 in “Methods” for details) on the HIGH-PPI model. As can be seen from Fig. 5a, HIGH-PPI can accurately and automatically identify the residues belonging to the binding surface. Another observation is shown in Fig. 5c which indicates our learned residue importance is quite close to the real profiles. We show another six cases of HIGH-PPI for identifying binding surfaces correctly in Supplementary Fig. 7.

Fig. 5: Automatic explanation for residue importance without supervision.
figure 5

a Top: Depiction of a complex protein (left, query protein, PDB id: 2B6H-A; right, interacted protein, PDB id: 2REY-A) modeled in surface representation. Residues on the binding surface of query protein are highlighted in red (important) and others in blue (non-important). Bottom: Residue importance of the query protein learned from HIGH-PPI with coloring ranging from low (blue) to high (red). Important regions are magnified to show the cartoon representation. b Depiction of two proteins (left, PDB id: 1S9I-A; right, PDB id: 1I0O-A) modeled in cartoon representations. Residues are colored to match the importance scores, with more important residues highlighted in red and unimportant ones in blue. Residues with catalytic functions that are correctly or incorrectly identified are highlighted in red and black, respectively. c Polylines showing the consistency of highest peaks that represent the learned (gray) and real (red) functional regions for the binding interaction case shown in a. d Boxplots (center line, the median; upper and lower edges, the interquartile range; whiskers, \(0.5\times\) interquartile range) showing the explainable ability for binding PPIs by calculating the overlap of real and learned functional regions (IoU, Intersection over Union) with 20 PPI pairs and their real interfaces retrieved from STRING and PDBePISA database, respectively. HIGH-PPI shows greater explainable ability significantly (Two-sided t-test results: HIGH-PPI versus CNN ****\(P=4.4\times {10}^{-6}\), HIGH-PPI versus CNN ( + 3D) ****\(P=4.4\times {10}^{-8}\)). No information about residue importance was used to train our model. Source data are provided as a Source Data file.

Second, in order to evaluate the prediction of catalytic sites for PPIs, we utilize the same GNN explanation approach in our model. The ground truth catalytic site is retrieved from the Catalytic Site Atlas56 (CSA), a database for catalytic residue annotation for enzymes. We calculate the residue importance of catalytic sites for query proteins (PDB id: 1S9I-A, 1I0O-A). As seen in Fig. 5b, our proposed HIGH-PPI can correctly predict both residues for 1S9I-A and two out of three for 1I0O-A. We show another nine cases of HIGH-PPI for identifying catalytic sites in Supplementary Fig. 6, where a total of 25 out of 34 catalytic sites are correctly identified.

Additionally, we compare the model interpretability of the CNN, 3D CNN and HIGI-PPI models. We employ the CNN module in GNN-PPI24 and 3D CNN module in DeepRank21, respectively. We apply grad-CAM57 and 3Dgrad-CAM44 approaches to determine residue importance for CNN and 3D CNN models, correspondingly. We use the binding type PPIs from the STRING dataset as the training set, and randomly select 20 binding type PPIs as the test set. We use the ground truth from PDBePISA for each query protein and treat its residues with importance >0 as surface compositions. To gauge the precision of the surface prediction, intersection over union (IoU) is used, and the box plots of the IoU score distributions are shown in Fig. 5d. The results elucidate that HIGH-PPI significantly outperforms other models in terms of interpretability with a minimum variance. In addition, 3D CNN outperforms CNN with a smaller variance, showing that 3D information supports the learning of reliable and generalized protein representations.

Protein functional site prediction sheds light on the model decisions and how to carry out additional experimental validations for PPI investigation. Excellent model interpretability also shows that our approach can accurately describe biological evidence for proteins.