Key words

1 Introduction

Proteins adopt complex three-dimensional structures in order to carry out cellular functions. Many of these functions are carried out by larger assemblies of protein complexes and regulated through physical contacts between effectors and regulators. Understanding protein–protein interactions (PPIs) is fundamental to understanding cellular processes in healthy and diseased states, and their accurate prediction is a longstanding goal of computational biology. Predicting the interacting residues involved in PPIs is useful for constructing refined PPI networks, understanding the impact of mutations, improved accuracy in protein-protein docking and richer annotation of protein function [1]. Furthermore, predicting PPIs is desirable for structure-based drug discovery; PPI interfaces offer the potential for highly selective modulation of pathological processes [2].

Experimental methods for characterizing protein–protein interactions include Yeast-2-hybrid (Y2H) methods [3], mass spectrometry [4], tandem affinity purification [5], and protein chips [6]. However, these methods are time and resource intensive [7], and high false-positive rates are prevalent in larger experimental screens [8] limiting the scalability of protein–protein interaction characterization to the proteome level. There exist high-quality structurally annotated databases characterizing PPI sites such as BioLIP [9] (https://zhanglab.ccmb.med.umich.edu/BioLiP/), which collates structural interaction sites for a variety of protein interaction types. However, these databases characterize only a subset of extant proteins, and a significant proportion of the space remains unannotated and not structurally characterized. Furthermore, accurate predictions are made increasingly challenging due to the promiscuity of protein interactors; a given protein may have multiple interaction partners over disparate or overlap** regions of its surface. Proteins with multiple binding partners may interact with their partners at different times or feature large interaction sites capable of interacting with multiple partners simultaneously. Reporting of interactions may be biased; it has been shown that the number of reported interactions for a given protein is correlated with its frequency of occurrence in the literature [10]. Efficient and reliable computational prediction of protein–protein interactions is therefore highly desirable, though challenging.

Existing methods for PPI site prediction broadly fall into three categories. Protein-protein docking methods seek to produce structures of the resulting protein complex, and typically produce a number of scored candidate structural models as output [11]. Structure-based methods seek to perform prediction of interaction sites by leveraging protein structural information [12]. Sequence-based methods perform predictions based on protein sequences and form the bulk of the existing body of work due to the relative abundance of protein sequence data. While docking and structure-based methods typically require structural data, sequence-based approaches benefit from greater availability of data. However, structural methods may be limited in their utility for applications involving intrinsically disordered proteins (IDPs) or regions (IDRs), which play important roles in facilitating some PPIs, often allowing concerted folding and binding of sequences with these regions, and are enriched in protein and nucleic acid binding proteins [13]. The difficulty in structurally elucidating IDPs and IDRs means that structural datasets are typically deficient in disorder-mediated interactions. It should be noted, however, that supervised sequence-based methods typically rely on datasets curated from structurally characterized protein–protein interactions (see Subheading 2.2 for details), where residues for which the solvent-accessible area decreases upon binding are considered interacting residues. Many machine learning-based approaches have the advantage that the bulk of the computational cost is incurred during training; inference from a trained model is relatively computationally inexpensive, whereas docking-based methods require large quantities of computational resources for each prediction to score and rank structures in the conformational space. An overview focused on machine learning and deep learning approaches and the requisite data preparation is provided in Fig. 1.

Fig. 1
figure 1

Overview of machine learning and deep learning approaches to protein–protein interaction site prediction. Input structural or sequence data requires feature engineering or transformation into an appropriate representation for the architecture of the model

An example of a structure-based approach is ProtCHOIR (https://github.com/monteirotorres/ProtCHOIR), a tool for proteome-scale generation of homo-oligomers in an automated fashion, providing detailed information for the input protein and output complex (Torres PHM & Blundell TL, Manuscript in preparation). ProtCHOIR requires input of either a sequence or a protomeric structure that is queried against a pre-constructed local database of homo-oligomeric structures, then extensively analyzed using well-established tools such as PSI-Blast [14] (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download), MAFFT [15] (https://mafft.cbrc.jp/alignment/software/), TMHMM [16] (http://www.cbs.dtu.dk/services/TMHMM/), PISA [17] (https://www.ccp4.ac.uk/MG/ccp4mg_help/pisa.html), Gesamt [18] (http://ccp4serv7.rc-harwell.ac.uk/gesamt/), and Molprobity [19] (http://molprobity.biochem.duke.edu). Finally, MODELLER [20] (https://salilab.org/modeller/) is employed to achieve the construction of the homo-oligomers. The output complex is thoroughly analyzed taking into account its stereochemical quality, interfacial stabilities, hydrophobicity and conservation profile. The software is easily parallelizable and also outputs a comma-separated value file with summary statistics that can straightforwardly be concatenated as a spreadsheet-like document for large-scale data analysis.

This chapter focuses on the considerations involved in applying deep learning methods to protein structure data for the prediction of protein–protein interaction sites. The main steps in develo** such a project, from data collection and preparation, featurization and representation, through to model design and evaluation are highlighted. The choice of representation is a key decision in such an undertaking. There exist in the literature many machine learning-based approaches to this problem covering a range of classical models and representations including: logistic regression [21], Naive Bayes classifiers [22], Support Vector Machines (SVM) [23,24,25], and random forests [26,27,28]. A full review of existing approaches is beyond the scope of this chapter. Neural network-based approaches have included shallow neural network models with limited width (models with comparatively few hidden layers and limited dimensionality compared to more recent architectures) [29,30,31,32] and, more recently, deeper architectures. Larger datasets and GPU acceleration have enabled the training of deeper neural network architectures. Amongst the more recent deep learning approaches, convolutional neural networks (CNNs) [33], recurrent neural networks (RNNs) including Long Short-Term Memory networks (LSTMs) [34] and hybrids thereof [35] have been applied to the sequence-based interaction site prediction problem. Here the authors note a novel problem framing where models operate directly on graph-structured representations of protein structures (see Note 1) that has recently developed using Message Passing Neural Processes [36]. The graphs are composed of constituent amino acid residues and their interactions, and the target labels are binary labels indicating whether or not a particular residue takes part in a protein–protein interaction. The authors believe this is timely due to rapid development and early successes in geometric deep learning and its applications to computational structural biology [37,38,39,40,41,42], and indeed applications to the study of protein–protein interactions [37]. Indeed, the work of Fout et al. [38] where the authors applied a Graph Neural Network (GNN) model to PPI interface prediction between two interacting proteins is acknowledged.

There are a number of problems to be aware of in designing machine learning predictors. For instance, data often suffer from class imbalances, where the number of negative sites (non-interacting residues) is much greater than the number of positive sites (see Note 2). This problem is exacerbated in larger proteins as the fraction of positive sites decreases with size [29, 43] and has been shown to bias predictors that do not account for this [44]. Sequence-based predictors have also been shown to confuse small molecule ligand, DNA, and RNA binding regions with protein-binding regions [45].

2 Materials

2.1 Computing Resources

Most of the development workflow can be performed on a standard UNIX workstation equipped with a GPU suitable for training deep learning models. The exact GPU memory requirements will depend on the model architecture and dataset sizes used. If models are to be trained on local or cluster-based GPU acceleration (strongly recommended for training and running models at scale), an installation of an appropriate CUDA (https://developer.nvidia.com/cuda-toolkit) version compatible with the GPU will be required. The user should be familiar and comfortable with running command line tools, base python, a commonly used deep learning framework (such as PyTorch or Tensorflow) as well as installing python packages and machine learning basics.

2.1.1 Software Installations

It is recommended to install the required packages in a virtual environment. A virtual environment can be set up using Conda [46] (https://docs.conda.io/en/latest/), a commonly used package and environment manager.

2.1.2 Machine Learning Frameworks

There are a number of actively developed machine learning frameworks. A popular choice for traditional ML is SciKit-Learn [47] (https://scikit-learn.org/stable/). For deep learning frameworks, popular choices include: PyTorch [48] (https://pytorch.org), TensorFlow [49] (https://www.tensorflow.org), and Theano [50] (http://deeplearning.net/software/theano/). Each framework has an associated ecosystem of community-developed implementations of popular methods and tools. A fuller discussion of popular frameworks can be found in a review by Erickson et al. [51].

2.2 Databases and Datasets

There are a number of databases which collect relevant data for constructing datasets of protein–protein interactions. Zhang et al. [52] collated a large database of protein sequences, annotated with protein, small molecule ligand and nucleic acid binding sites at the residue level. Li et al. used this resource to create a large training dataset by removing training data sequences with >40% sequence similarity to the test datasets and >40% to other training examples to ensure diversity in the training set [35]. The authors of this work make available training (87.5%, n = 9681) and validation (12.5%, n = 1382) splits. This is, to our knowledge, by far the largest dataset collated for PPI site prediction.

In addition, there are four other processed datasets that can be used for training and testing: Dset_72 [53], Dset_164 [54], Dset_186 [22], and Dset_448 [21], where the tail number indicates the number of sequences in each dataset. Dset_72, Dset_164, and Dset_186 are constructed through curating heterodimeric PDB entries with structural resolution <3 Å and <25% sequence identity. It is suggested that these datasets be used as independent test sets (practitioners should take care to avoid overlap with these datasets in their training data, discussed in Subheading 3.1.2) to enable benchmarking of novel methods against existing approaches.

2.3 Tools for Computing Features and Representations

There exist many tools for computing both sequence-based and structure-based features for proteins. Previous research by Jones and Thornton [55] into identifying important features for PPI site prediction has revealed solvation potential, residue interface propensity, hydrophobicity, planarity, protrusion, and accessible surface area (ASA) as important features to discriminate between binding residues. In the intervening years, many more tools for calculating protein and amino acids properties from sequences and structures have been made available. Example tools and featurization options that practitioners may wish to consider in develo** a PPI site prediction project are outlined.

2.3.1 Sequence-Based

Sequence-based tools include: PROFEAT [56, 57] (http://bidd.group/cgi-bin/profeat2016/ligand/profnew.cgi), a longstanding web server for calculating structural and physicochemical properties for protein sequences. ProPy [58] (https://pypi.org/project/propy3/1.0.0a2/) is a python package capable of calculating a large number of structural features from protein sequences (amino acid composition descriptors, dipeptide composition descriptors, tri-peptide composition descriptors, Normalized Moreau-Broto autocorrelation descriptors, Moran autocorrelation descriptors, Geary autocorrelation descriptors, Composition, Transition, Distribution descriptors (CTD), sequence order coupling numbers, quasi-sequence order descriptors, pseudo amino acid composition descriptors, amphiphilic pseudo amino acid composition descriptors). Putative Relative Solvent Accessibility (RSA) can be computed using ASAquick [59] (http://mamiris.com/ASAquick/). Meiler et al. [60] make available a set of low-dimensional embeddings of the physicochemical properties of amino acids that can be used for featurization. Position-specific scoring matrices (PSSMs) are often highly informative features [33], as residues important for facilitating interactions are likely to be evolutionarily conserved and can be readily computed using PSI-BLAST [14]. Similarly, evolutionary conservation (ECO) can be computed from HHBlits [61, 62] (https://github.com/soedinglab/hh-suite). Further to these descriptors, Li et al. [35] make use of several other descriptors including: High-Scoring Pairs (HSP), which are similar sub-sequences between two proteins scored using scoring matrices such as PAM & BLOSUM [63] using SPRINT [64] (https://github.com/lucian-ilie/SPRINT/). ANCHOR [65] (https://iupred2a.elte.hu) can be used to calculate the putative protein-binding disorder. Hydrophobicity information can be encoded using the hydropathy index computed using [66].

2.3.2 Sequence Embeddings

Embedding-based methods take sequences as input and return a fixed-length representation of the sequence. The goal of sequence embeddings is to keep similar sequences close in the embedding space, while maintaining distance between dissimilar sequences. There are a number of pre-trained sequence embedding models that can be used with protein sequences. For instance, ProtVec [67] is one such method that has shown good results when applied to protein family classification. ProtVec has been leveraged in PPI site prediction in work by Li et al. [35] in which the authors summed the 100-dimensional representations of each of the sequence 3-mers to produce the fixed embedding and speed up computation. UniRep [68] provides a pre-trained RNN model trained on 24 million sequences from UniRef50 [69]. Rives et al. [70] make available a pre-trained transformer model trained on over 250 million sequences that can be utilized to generate sequence embeddings for PPI tasks.

2.3.3 Structure-Based

DSSP [71, 72] (https://swift.cmbi.umcn.nl/gv/dssp/DSSP_3.html) is a longstanding tool for calculating secondary structural descriptors of proteins from their structures. Graphein [73] (https://github.com/a-r-j/graphein) is a python library for computing geometric representations of protein structures. It is capable of flexibly creating protein structure graphs at various levels of granularity (amino acid, atomic) and under various construction schemes (based on intramolecular contacts and/or various distance-based methods) and mesh representations of protein surfaces. It also includes various featurization schemes, such as the aforementioned DSSP descriptors, cartesian coordinates, and the low-dimensional embeddings of the physicochemical properties of amino acids. Voxelized representations of 3D atomic structures where atoms are fixed as points on a 3D grid can be featurized using a variety of atomic descriptors such as: encodings of atom-type, atomic number, atomic mass, explicit and implicit valence, hybridization, ring status, aromaticity, formal charge, and chiral status as additional channels.

3 Methods

3.1 Data

Training and evaluating a supervised deep learning model requires datasets of labeled examples. These data should be partitioned into training, validation, and test datasets. The raw data should be converted into an appropriate representation and pre-processed prior to training and inference. Important commonalities and considerations in preparing a dataset are highlighted in this section. A schematic overview of a standard preparation procedure is shown in Fig. 2.

Fig. 2
figure 2

Overview of data processing and model development pipeline

3.1.1 Curation

First, a set of protein structures or sequences with protein–protein interaction binding site annotations is required. Some of the available processed and raw sources for such data are discussed in Subheading 2.2.

3.1.2 Train-Test Split Strategies

Data should be split into training, validation, and testing data. Training data are used to iteratively train the parameters of the model while validation data and testing data are used to evaluate model performance for hyperparameter selection and final evaluation, respectively. The proportions 80/10/10 are commonly used for training/validation/testing data. The experimenter can make a number of choices with respect to splitting data. The most obvious and straightforward solution is to split the data randomly. However, this can produce misleading results if related sequences are over-represented in the data compared to true protein space. Furthermore, highly similar training and testing examples may reward predictions based on homology, rather than learning the data-generating function. Both of these scenarios violate the independent and identically distributed (i.i.d.) assumption, where one assumes that the training and test data are drawn from the same underlying joint distribution. In develo** a model, one is interested in performing predictions on novel protein structures that may lie outside of this distribution (ideally a dataset with a representative distribution is constructed), and so a practitioner wishes to examine the generalization of the model, i.e., the extent to which the model can make accurate predictions on unseen data, indicating learning of the underlying physical process, rather than memorization of the input data points. Thus, a good test set should contain a set of protein structures distinct from the structures the model was trained on. However, the exact manner by which one can achieve this depends on the application domain. For instance, if the experimenter is looking to build a model that is specific to predicting interaction sites on, say, kinases, then a random partitioning of the data could be considered valid given a sufficiently large and diverse dataset of annotated kinase structures.

While often used, splitting should not be performed on the basis of clustering by sequence similarity (e.g., using tools such as BLAST [74, 75]) or thresholding by sequence similarity. Instead, the experimenter should make use of resources such as SCOP [76] (http://scop.mrc-lmb.cam.ac.uk) or CATH [77] (https://www.cathdb.info) to obtain information about related structures in order to prevent leaking data from the training set into the test set in the form of structural homologs. This is important as sequences with 35–40% sequence similarity are highly likely to adopt similar structural folds [78, 79]. Furthermore, structural similarity has been found between proteins with sequence similarity in the 8–30% range [80]. However, many PPIs are mediated at the domain level, and similarity at this level is not captured by simplistic sequence similarity scores. These can be thought of as structurally and somewhat evolutionarily independent subunits. Thus, more nuanced data partitioning based on knowledge-based annotation databases should be standard practice in develo** a well-principled project. It should be noted that this presents a much more challenging, but more representative, task for a classifier and obtaining good predictive performance will be difficult but more widely applicable.

3.1.3 Representation

Protein structures are not data structures and there is a large amount of flexibility available to the practitioner in deciding how to represent them. Some common choices are outlined in Table 1. As discussed previously, the choice of representation is central to designing a machine learning project. It informs the choice of architecture and techniques available to the experimenter.

Table 1 Summary of different choices of representations for protein structures

A good representation will infuse the model with a strong inductive bias. An inductive bias allows a model to prioritize one solution to a problem over another and can express assumptions about the underlying data-generating process or the solution space [86]. For instance, it makes sense to perform 1D convolutions over protein sequences and 3D convolutions over grid-structured representations of protein structures as the spatial components of the convolutions are meaningful and valid. Meaningful inductive biases in the case of PPI site prediction may include the ability to account for both long- and short-range interactions between constituent amino acids. For instance, residues that are distant from one another in the sequence may be found in the same interaction interface.

The choice of representation also affects the applicability of the trained model. For instance, a model trained using a sequence-based representation will be able to perform predictions on unseen sequences. However, a model trained on a representation derived from crystal structures will require crystal structures in order to make predictions after training (learning using privileged information offers a framework in which both sequences and structures can be leveraged during training while allowing for sequence-based predictions during inference. See Subheading 4). The desired use case is therefore an important consideration in the choice of representation and should be balanced with the availability of data and relevant inductive biases in the decision-making process.

3.1.4 Input Features

Featurization of the input data can be performed using various stand-alone programmatic bioinformatics tools and web servers according to the representation selected by the experimenter. A non-exhaustive collection of these is highlighted in Subheading 2.3.

3.1.5 Pre-processing

Data and labels used in a machine learning model should be scaled to enable efficient training (see Note 3). Training features and real-valued labels should be pre-processed using a standard scaler (i.e., zero mean and unit variance). This training scaling transformation should be kept and applied to any validation or test data used. It is important to note that the scaling is performed on the training data, and this same transformation is applied to the validation and test sets, rather than scaling each dataset independently. This step is performed to ensure that different features with different scales are standardized to prevent the accumulation of large spread of weights in the network. Large weights are often unstable, as they may produce large gradient values, causing large updates to the network, or result in dead neurons in the network (see Note 8). Alternative pre-processing techniques for numeric features can include min-max scaling and transformations, such as log-transforming features that range across many orders of magnitude. Categorical and ordinal features should be numerically encoded, e.g., through one-hot encoding.

3.2 Model Evaluation

3.2.1 Hyperparameter Tuning

The goal during training is to minimize a loss function by iteratively updating the model parameters (see Note 4). During each training epoch, the model is presented with mini-batches, subsets of the training data, in between which weights are updated (see Note 5).

Deep learning models are themselves described by hyperparameters, which describe the various aspects of the model architecture and the training regime. The set of all possible hyperparameters forms a space, from which the objective is to approximate the combination of hyperparameters that lead to the best predictive performance at test time (see Note 6). A selection of common hyperparameters and commonly used values is described in Table 2. It should be noted that different architectures have additional hyperparameters associated with them. For instance, CNN-based models should include convolution filter sizes, padding, stride length, and pooling types; graph-based methods will have hyperparameters such as the aggregation and readout functions; sequence-based architectures have hyperparameters such as the embedding dimension. Furthermore, some hyperparameter choices have additional hyperparameters associated with them, such as the momentum term used in Adam optimization.

Table 2 Common hyperparameters found across neural network architectures with commonly used choices or ranges

3.2.2 Evaluation Metrics

In order to determine the optimal set of hyperparameters, practitioners are required to make comparisons between models and their performance. When is a model considered better than another? What is considered to be the key measure of performance? These questions fall into the categories of model assessment and selection.

Ideally, the true error of the classifier is approximated. Broadly speaking, there are two methods for approximating this and evaluating the performance of a model: Cross-Validation (CV) based methods and using training/validation/testing data splits. Within CV, there are two main paradigms to consider. The first is k-fold CV, where the data are split into k segments and train on k − 1 of the folds, assessing the performance on the left-out fold. This is performed k times and the errors averaged, to assess the model. The second paradigm is leave-one-out-cross-validation (LOOCV), which is a special case of K-fold CV, where K = n (the total number of examples in the dataset.). Using CV-based methods for evaluating deep learning models can be difficult in practice, due to the high computational and time requirements associated with model training. Training/Validation/Test data splits are useful compromises, where the data are partitioned into three sets. The model is trained on the training data, the hyperparameters are tuned using the performance on the validation data as a guide, and the final assessment of the model is carried out on the test data. It is important to prevent leakage of information between the sets (e.g., having very similar training and testing examples, discussed in Subheading 3.1.2) as this prevents accurate model assessment. Furthermore, hyperparameters should only be tuned on the basis of training and validation performance; hyperparameter tuning on the test set performance is another form of leakage.

To understand and evaluate model performance, a selection of classification metrics should be examined. To begin, consider the possible outcomes of a prediction with respect to its true label. Table 3 is known as a confusion matrix and the relevant metrics discussed below are dependent on the desire types of classification correctness. It is important to note that PPI site prediction is an imbalanced classification problem: positive and negative labels do not occur at similar rates in the data (see Note 2). Thus, simple metrics, like the fraction of correct predictions, are inappropriate as the null predictor (a model that predicts 0 for each output) would score according to the frequency of inactive residues in the sequence data. Thus, a rigorous evaluation would leverage an ensemble of metrics to assess a model and understand its performance on each of prediction classes outlined in the confusion matrix. These are presented in the definitions below (Fig. 3). Readers should note there are a number of ways of computing these scores, e.g., on a per-residue or per-protein basis (see Note 7).

Table 3 Confusion matrix displaying possible classification outcomes
Fig. 3
figure 3

Illustration of classification error types

Sensitivity

This is equivalent to the fraction of correctly identified interaction sites. This is also referred to as recall or the true positive rate (TPR).

$$ \mathrm{Sensitivity}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} $$

False-Positive Rate (FPR)

This is the fraction of negative sites wrongly classified as positive over the total number of negative sites.

$$ \mathrm{FPR}=\frac{\mathrm{FP}}{\mathrm{FP}+\mathrm{TN}} $$

Specificity

This is equivalent to the fraction of correctly classified non-interacting sites out of all the non-interacting sites.

$$ \mathrm{Specificity}=\frac{\mathrm{TN}}{\mathrm{TN}+\mathrm{FP}} $$

Precision

This is equivalent to the fraction of correctly classified interactions sites out of all interaction sites.

$$ \mathrm{Precision}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}} $$

Accuracy

This is equivalent to the fraction of correctly predicted interacting and non-interacting sites.

$$ \mathrm{Accuracy}=\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{FN}+\mathrm{TN}+\mathrm{FP}} $$

ROC and AUC

The Receiver-Operator Characteristic curve (ROC) plots the TPR vs FPR at different classification thresholds. The Area Under the ROC Curve (AUC) gives a measure of performance across classification thresholds from 0 to 1.

F1 Score

This is the harmonic mean of the precision and the sensitivity and ranges between 0 and 1, where one indicates perfect precision and recall.

$$ F1=2\times \frac{\mathrm{Sensitivity}\times \mathrm{Precision}}{\mathrm{Sensitivity}+\mathrm{Precision}} $$

Matthew’s Correlation Coefficient (MCC)

This measure ranges between −1 and 1. A high score is only achievable if a binary classifier performs well in all the metrics in the confusion matrix. It has the advantage of being robust to class-imbalanced datasets and provides the most truthful measure of classifier performance [88]. Therefore, obtaining a strong score with this metric will be the most challenging of the metrics outlined here, and should form the basis for assessing the performance of a protein–protein interaction site-prediction classifier.

$$ \mathrm{MCC}=\frac{\mathrm{TP}\times \mathrm{TN}-\mathrm{FN}\times \mathrm{FP}}{\sqrt{\left(\mathrm{TP}+\mathrm{FN}\right)\times \left(\mathrm{TP}+\mathrm{FN}\right)\times \left(\mathrm{TN}+\mathrm{FP}\right)\times \left(\mathrm{TN}+\mathrm{FN}\right)}} $$

3.2.3 Overfitting

In the development process, it is important to report metrics for both the training and validation datasets in order to identify overfitting. Overfitting arises from a combination of data scarcity and a model that is too flexible, resulting in “memorization” of the training data, rather than learning of the underlying process. Techniques for countering overfitting include adding regularization penalties to the loss function, using dropout or reducing the capacity of the model (e.g., restricting depth or the size of the hidden layers). See Note 8 for more details regarding regularization techniques.

3.2.4 Attribution

Most machine learning and deep learning methods are black-box predictors that do not allow for a clear and interpretable examination of the relationship between the input features and the output classification of the predictor. However, there are techniques that allow for some exploration of this relationship, such as gradient attribution methods [89] and attention mechanisms (Subheading 4) through visualization of the attention weights.

3.3 Alternative Training Regimes for Future Model Development

Deep learning has demonstrated excellent performance in a variety of tasks within computational biology, and indeed computational structural biology. Supervised deep learning involves training with data an artificial neural network that learns to perform regression or classification tasks. The training process involves iteratively tuning the weights of a series of feedforward feature-extracting layers. These layers learn hierarchical features based on their inputs, such that deeper layers in the network learn higher-order features as a composition of earlier layers. Such methods have already been applied successfully to PPI site prediction. While this has resulted in impressive performance gains, PPI site prediction is far from a solved problem and the utility of a highly-performant predictor demands further development in this area. This requires improvements in both training and benchmark dataset quality, as well as in the modeling process. In this section, alternate training regimes that may provide utility for researchers interested in furthering future model development in this field are highlighted.

3.3.1 Multi-modal Input

Combining additional data modalities has proven useful in PPI site prediction. For instance, Zeng et al. [33] make use of a combination of local and global sequence based features for their predictor. Local features are captured using a sliding window over the amino acid sequence, and global features through the use of a text CNN, and their contributions validated through feature-ablation studies. When utilizing multiple data modalities, the question arises as to whether these features should be fused early or late in the model architecture and whether they require individual input encoders to extract features. Furthermore, the correspondence of the modalities should be taken into account, e.g., is each training example presented as a tuple xi, xj or is the second modality repeated across certain examples. Feature ablation studies are useful to understand the relevant contributions of each modality (see Note 9).

3.3.2 Transfer Learning

Transfer learning involves training a model on a similar task, which can then be fine-tuned to the primary task. Such fine tuning typically involves replacing or adding to the latter layers of the model and re-training on the primary dataset. The intuition here is that the earlier layers of the model have learned useful representations relevant to the task and can thus help improve performance and speed up convergence. Practitioners using such a strategy may consider whether or not to freeze the weights of the pre-trained layers during the fine-tuning procedure and conceive of the pre-trained layers as fixed feature extractors.

3.3.3 Multi-task Learning

Multi-task learning (MTL) involves training a predictor on multiple related tasks simultaneously, typically optimizing using additional auxiliary losses with the goal of improving predictive performance or generalization. The intuition underlying this is that the related tasks contain related training signals that are useful to the model to better learn the main task through inductive transfer. For instance, in the case of PPI site prediction, Zhang et al. make use of a multi-task framework to jointly predict interaction sites and solvent-accessible residues as only solvent-accessible residues are capable of interacting with another protein and identify this as an effective strategy in countering the class-imbalance problem [34]. Further model development leveraging multi-task learning may like to examine joint prediction of PPI sites and nucleic acid or small molecule binding sites, an approach developed by [21] to address cross-prediction of these sites.

3.3.4 Learning Using Privileged Information

Learning Using Privileged Information (LUPI) is an approach where a model is trained on multiple sources of data, but evaluated and deployed on examples where some of these data sources are unavailable [90]. Formally, training examples are presented as triples \( \left({x}_i,{x}_i^{\ast },{y}_i\right) \) (instead of tuples (xi, yi) in the classical case), where \( {x}_i^{\ast } \) is the additional information for training example xi and yi is the corresponding label. For instance, a LUPI approach to PPI site prediction could involve training a model on protein structures and sequences, where available, which is then evaluated on and deployed to predict interaction sites from sequences alone. Such an approach has been developed for sequence-based protein-ligand binding affinity prediction, to produce results comparable to structure-based approaches [91]. LUPI presents an attractive framework in which multiple data modalities can be used without requiring complete coverage for all the examples; this is especially relevant in the context of biological datasets which are often fragmented in this respect as they are typically collected without a strategy amenable to machine learning projects in mind.

3.3.5 Uncertainty Modeling and Active Learning

Modeling of uncertainty is of vital importance in computational biology. In data-limited scenarios, as is often the case in computational structural biology, uncertainty modeling can avoid over-confident predictions and enable judicious exploration of space outside of the training distribution [92, 93]. Uncertainties are especially vital in experimental contexts where the acquisition of additional data points is slow, arduous or expensive, as they can allow the experimenter to prioritize hypotheses with a high likelihood of success or experiments of potential greater novelty and higher associated risk [94]. Furthermore, uncertainty modeling opens the possibility for active learning loops, where further exploratory and validation experiments can be prioritized using the model and an associated acquisition function, and the new data incorporated into the model iteratively to explore increasingly distant regions of biological space [95].

Acquisition functions are used to propose which examples in the search space should be considered next by a model. These functions are typically inexpensive to evaluate, and examples include greedy approaches, where data points with the highest predicted response are prioritized, variance-based, where data points with the highest-associated prediction variance are prioritized or expected improvement based methods. Other commonly used acquisition functions include Upper Confidence Bound (UCB) and Maximum Probability of Improvement (MPI).

3.3.6 Attention Mechanisms

Attention mechanisms allow models to apply learnable weights to subsets of their inputs, depending on their perceived importance. Co-attention is a mechanism useful in multi-input models, which allows the attention on each input to be conditional on the other input. For instance, when designing a model to predict PPI sites between two proteins, co-attention mechanisms may allow the model to attend to more relevant parts of each protein in a manner specific to that particular interaction. When presenting one of these proteins with a different interaction partner, the model may attend to different parts of the proteins if the interaction occurs in a different region. Attention mechanisms have successfully been applied to paratope prediction [96].

3.3.7 Ensembling

Ensemble models involve constructing multiple models to perform predictions on the same input. The final prediction then results from some form of averaging of all of the individual model predictions. This can take the form of voting (e.g., as in Random Forests), averaging or by training another model that takes these predictions as inputs. It is often recommended to ensemble models with different architectures as each will have different biases and therefore make different mistakes, making the ensemble model more robust and more beneficial. It is possible to ensemble models of the same architecture, e.g., through snapshot ensembling [97], where “snapshots” of the weights are taken throughout the training of a single model and ensembled to create the final predictor. These approaches can be thought of as ensembling in model space, as several models are constructed and ensembled. This can be computationally costly, due to the requirements of constructing multiple models during training and performing several predictions during inference. Ensembling methods have previously been applied to PPI site prediction [98]. Weight averaging is another ensembling technique that can be considered ensembling in weight space as the procedure results in a singular model.

4 Notes

  1. 1.

    Graphs, G = (V, E, Xv, Xe), are structures used to model objects consisting of a set of nodes (or vertices), vi ∈ V. with, typically, pairwise relations, ei, j = (vi, vj) ∈ E, between them such that E ⊆ V × V. Node features, Xv ∈ ℝ|V| × d; \( {x}_i^v\in {\mathrm{\mathbb{R}}}^d \), are the d features associated with node vi. Similarly, edge features, Xe ∈ ℝ|E| × c and \( {x}_{i,j}^e={x}_{v_i,{v}_j}^e\in {\mathrm{\mathbb{R}}}^c \) are the c features associated with edge \( {x}_{v_i,{v}_j}^e \). Protein structures can be represented as graphs of residues joined by intramolecular interactions or some distance-based criteria, such as thresholding of Euclidean distance or on the basis of K nearest neighbor clustering of distances. Protein structure graphs have been used in computational biology for prediction of protein–protein interaction interfaces [38] and protein structural classification [41]. One can distinguish between geometric deep learning methods, which operate directly on the graph structure, and machine learning applied to graph-based features. For instance, graph-based signatures have proved effective in a variety of applications relating to the impact of mutations on protein interactions [99, 100]. However, recent developments in graph representation learning offer the potential to exploit the relational structure of the data as an inductive bias in the model.

    In this problem-setting, PPI site prediction becomes a node classification task, where the objective is to predict a binary label, \( \hat{y}\in \left\{0,1\right\} \), for each amino acid in the structure graph, indicating whether or not it partakes in a protein–protein interaction. In order to do this, graphs are enriched with information about the protein structure in the form of node and edge features. These features provide the basis for computing the message passing signals in the model to exploit the relational structure present in the data. Node features can take the form of an encoding of the amino acid residue type, secondary structure information, surface accessibility metrics, cartesian coordinates of centroid position and PSSMs. Additional features highlighted in Subheading 2.3 can be computed and used to enhance the modeling by practitioners. Edge features can be used to specify the intramolecular interaction type or Euclidean distance between two adjacent nodes. The authors have developed an example of this type of approach using Message Passing Neural Processes [36].

  2. 2.

    Protein–protein interaction data and site annotations are heavily class-imbalanced as negative results are not often reported or collated. Tools such as SMOTE [101], class weighting techniques, or loss functions that account for this can be used [34].

  3. 3.

    It is typically best to perform some normalization/scaling on input features. The intuition for this can be to ensure equal contribution from each predictor to the model. However, one should be careful when scaling quantities measured in the same units and whether it makes sense to normalize some features at all.

  4. 4.

    Learnable parameters in the model are iteratively tuned during training by backpropagating computed error gradients. Various optimization techniques for this procedure exist such as Stochastic Gradient Descent (SGD) and Adam [102]. The optimization technique is itself a hyperparameter of the architecture, and there may be additional hyperparameters associated with it, such as the learning rate. The learning rate is a small positive value (typically [0, 1]), and can be thought of controlling the “speed” at which a model learns by scaling the size of the weight updates in response to the computed error gradient. Large learning rates typically allow for fast learning, at the cost of potentially resulting in a sub-optimal set of weights. Smaller values will require more epochs to train but may improve the final set of weights by finding better quality minima. The learning rate is often considered to be the most important hyperparameter to tune [103]. Typical starting points are 0.001, 0.01, or 0.1. Learning rate schedulers can be used to vary the learning rate over training time. A fuller discussion of optimization techniques can be found in [104].

  5. 5.

    Training data should be shuffled between training epochs in order to reduce variance and reduce overfitting. Imagine a dataset where the examples are ordered by class. When a minibatch is selected from this dataset, it is desirable that the minibatch is representative of the true dataset in order to estimate the true error gradient. If the dataset is ordered in some manner, this is not achieved. In the context of regular SGD, shuffling is beneficial as it ensures each example produces an independent change in the model that is not biased by ordering artifacts.

  6. 6.

    Deep learning models have many learnable parameters within the model, whereas the architecture as a whole is determined by hyperparameters. Non-exhaustively, these include the number of epochs, the size and depth of the layers, the size of each training batch, and the learning rate. These hyperparameters form a search space that must be traversed to produce a performant predictor. Various strategies for this include: random search [105], grid search and Bayesian optimization techniques [106].

  7. 7.

    When calculating evaluation metrics for predictions, the practitioner is left with the question of whether to compute the metrics for each protein and average them (micro-averaging) or to compute the metrics across all the individual amino acid predictions in the dataset (macro averaging). While macro results are typically reported in the literature, it is recommended to monitor both during development in order to gain a fuller understanding of the model. For instance, this will allow for better understanding the effect of protein size on performance.

  8. 8.

    Overfitting occurs when the model too closely fits the training data and generalizes to unseen data points poorly. This can be observed by comparing the performance on the training and validation datasets. A variety of techniques is available for combatting overfitting. Early stop** involves monitoring the validation loss during hyperparameter tuning and stop** the training process if the loss starts to increase. As this can fluctuate, a patience hyperparameter can be used where training is allowed to continue for a number of subsequent epochs and halted if no improvement is seen. Regularization is a family of techniques that includes adding penalties to the loss function based on the size of the model weights. Large weights can result in large updates to the network, a phenomenon known as exploding gradients, which can result in “dead” neurons, whereby they are shifted away from the training manifold. Dropout is another technique, where subsets of neurons are “switched off” randomly during each training epoch to encourage learning distributed internal representations and prevent over-reliance on individual neurons.

  9. 9.

    Feature ablation studies involve training a model on restricted subsets of the available features in order to interrogate their usefulness and contribution to the final model performance. This can also be used to perform feature selection. Due to the curse of dimensionality (where the dimensionality of the predictors is greater than n, the number of training examples) which results in the training data space being very sparse, identifying highly predictive features is important to improve prediction accuracy and reduce overfitting to the training data.