Topological regression as an interpretable and efficient tool for quantitative structure-activity relationship modeling

Zhang, Ruibo; Nolte, Daniel; Sanchez-Villalobos, Cesar; Ghosh, Souparno; Pal, Ranadip

doi:10.1038/s41467-024-49372-0

Topological regression as an interpretable and efficient tool for quantitative structure-activity relationship modeling

Article
Open access
Published: 13 June 2024

Volume 15, article number 5072, (2024)
Cite this article

Download PDF

You have full access to this open access article

From

View current issue

Topological regression as an interpretable and efficient tool for quantitative structure-activity relationship modeling

Download PDF

2234 Accesses
4 Altmetric
Explore all metrics

Abstract

Quantitative structure-activity relationship (QSAR) modeling is a powerful tool for drug discovery, yet the lack of interpretability of commonly used QSAR models hinders their application in molecular design. We propose a similarity-based regression framework, topological regression (TR), that offers a statistically grounded, computationally fast, and interpretable technique to predict drug responses. We compare the predictive performance of TR on 530 ChEMBL human target activity datasets against the predictive performance of deep-learning-based QSAR models. Our results suggest that our sparse TR model can achieve equal, if not better, performance than the deep learning-based QSAR models and provide better intuitive interpretation by extracting an approximate isometry between the chemical space of the drugs and their activity space.

Integrating QSAR modelling and deep learning in drug discovery: the emergence of deep QSAR

Article 08 December 2023

QPHAR: quantitative pharmacophore activity relationship: method and validation

Article Open access 09 August 2021

QuanDB: a quantum chemical property database towards enhancing 3D molecular representation learning

Article Open access 29 April 2024

Introduction

Quantitative structure–activity relationship (QSAR) models have become an essential tool in pharmaceutical discovery, especially in the virtual screening for hits and lead optimization stages¹. Experimental characterization of candidate molecules is expensive and time-consuming. As a relatively easy-to-implement alternative, QSAR models could be a valuable tool for assisting chemists by providing design ideas to prioritize their experiments. QSARs are usually supervised machine learning models that describe the connections between chemical structures and their biological activities, such as their potency, physicochemical properties, pharmacokinetic properties, or environmental effects². QSAR models enable in silico structural design by providing property predictions from machine-readable representations of the chemical structure, thereby hel** generate and prioritize design ideas. This technique has been widely applied in virtual screening and lead optimization with a fair amount of success^1,3.

In QSAR methods, chemical substances must first be transformed into machine-comprehensible mathematical representations. Three commonly used representations are (a) vectors such as classical molecular descriptors or molecular fingerprints (FPs), (b) graphs, and (c) strings such as Simplified Molecular Input Line Entry System (SMILES). Classical molecular descriptors⁴ encode a specific computed or measured attribute of the molecule into a single number, for instance, the count of bonds, atoms, functional groups, or physicochemical characteristics, and are often used in combination to form feature vectors. PaDEL⁵, Mordred⁶, and RDKit are examples of popular descriptor-calculation software packages for numerically representing the chemical structure and molecular characteristics. Extended-connectivity fingerprints (ECFPs)⁷ are an example of a topological fingerprint computed using a variant of the Morgan Algorithm that encodes chemical substructures by atom neighborhoods using a high-dimensional sparse bit-string representation. The graph representation, on the other hand, characterizes 2D chemical structures as graphs, with atoms as vertices and bonds as edges. SMILES specify a notation for representing the chemical graphs of molecules as strings of characters.

Once the chemical structures are represented using a suitable protocol, a predictive method is chosen to connect the structural information with the functional properties. For instance, if the chemical structures are represented as strings or graphs, deep-learning methods are often used for prediction due to their ability to perform embedded feature extraction. Chemprop⁸, in particular, has turned out to be a popular method that uses directed message-passing neural networks to learn molecular representations directly from the graphs to predict the properties of molecules. This method has been shown to excel at antibiotic discovery^9,10 and lipophilicity prediction¹¹ indicating its potential as a QSAR model. With the rise in popularity of large language models and the attention mechanism, the use of SMILES strings has been increasingly investigated for their potential embedded feature extraction, predictive performance, and interpretability. For example,¹² pre-trained a transformer-based network through masked SMILES recovery, and offered the pre-trained model for transfer learning onto specific tasks. Similarly, Transformer-Convolutional Neural Network (CNN)¹³ applied the transformer architecture to canonicalize SMILES string inputs and enables transfer learning of the model onto specific activity prediction tasks.

QSAR models are often developed for their predictive performance. However, the effectiveness of QSAR models, as a computational tool assisting molecular discovery and design, could be greatly improved by enhancing their domain-specific interpretability. Model interpretability, usually defined as the ability to explain predictions in a human-understandable way¹⁴, typically consists of computing feature importance scores^15,16,17,18, influence functions to identify training instances most responsible for the prediction¹⁹, develo** locally interpretable models to approximate global black-box algorithms^20,21,22, and generating counterfactuals^23,24. For example, standard shallow learners, like Random Forests (RF) and Support Vector Machines (SVM) are often used in QSAR modeling to offer feature importance scores²⁵. However, molecular interpretability is largely based on the interpretability of the underlying molecular representation. For instance, ALogP can be used as an important classical descriptor that plays a key role in determining the solubility of a molecule. However, a target value of ALogP cannot be mapped back to a precise chemical structure. When using interpretable fingerprints, the foregoing feature importance scores could potentially map prediction contributions onto the molecule to visualize which substructures positively or negatively impacted the prediction^25,26,27. Although feature importance measures increase the explanatory power of machine learning models, caution must be taken when these scores are invoked on molecules outside the applicability domain of the model, as prediction importance does not always translate to biological relevance²⁸. Locally interpretable models can be fitted to explicate predictions of black-box models. For instance, SHapley Additive exPlanations (SHAP) offers a model-agnostic method for calculating prediction-wise feature importance^21,22. Since this technique usually informs which features contributed to the specific test instance’s model prediction the most, it may not always lead to actionable design ideas. Thus²⁴, proposed Molecular Model Agnostic Counterfactual Explanations (MMACE) to generate counterfactual explanations that would help answer the question: what changes will result in an alternate outcome, regardless of the underlying model used. These methods are based on the model’s knowledge and, therefore, may be influenced by chance correlation, rough response surfaces, and overfitted models, leading to disappointing results²⁹. Recent advances in the attention mechanism of deep learners offer some explanatory power³⁷. These tools allow chemists to assess the potential of the selected analogous neighbors to infer properties of the query chemical. Additionally, similarity-based methods allow informative visualizations through network graphs derived from the similarities. Network-like Similarity Graphs (NSG)³⁸ were developed to guide lead optimization in drug discovery and have often been used to display the complex activity landscapes and the relationships between chemicals within a target set in 2D. Expanding this to drug-target interactions, methods like Similarity Ensemble Approach (SEA)³⁹ and Chemical Similarity Network Analysis Pulldown (CSNAP)⁴⁰ enable visualization of drug-target interaction networks and the prediction of off-target drug interactions, which have led to deeper investigations into drug polypharmacology and the discovery of off-target drug interactions^41,42. As we show later, these chemical similarity networks allow the clustering of similar molecules, which enables practitioners to mine regions of desired activity for innovative design ideas and potential leads. In addition to providing prediction-wise training instance importance, these graph structures are directly compatible with Laplacian Scores⁴³ for global feature importance, which have been used in QSAR modeling for feature selection^44,45. Since SHAP and MMACE are model agnostic, they can also be paired with similarity-based QSAR models to allow prediction-wise feature importance and the generation of unseen counterfactuals. Thus, similarity-based methods can provide multiple layers of interpretability on top of the commonly applied chemical similarity interpretation and visualization methods listed above.

However, a problem in similarity-based QSAR is that most QSAR methods assume that similar structures lead to similar activities, which is often violated in chemical structure modeling due to the prevalence of activity cliffs (ACs)⁴⁶, which are pairs of compounds with similar molecular structures, but with a large difference in potency against their target⁴⁷. The existence of ACs often cause QSAR models to fail, especially in the lead optimization stage⁴⁸, and limit the prediction performance across the drug landscape, leading to the use of network-based methods to interpret and analyze their behavior^38,49. One way to use similarity-based methods in the presence of ACs is to learn the similarity metric from the data itself, instead of choosing a similarity metric a priori. Large margin nearest neighbor⁵⁰ is a very popular algorithm for supervised metric learning when the response variable is categorical. For continuous response variables, Metric Learning Kernel Regression (MLKR)⁵¹ is perhaps the most popular algorithm to estimate the similarity metric. Metric learning techniques offer good explanatory power because once the metric is learned, the chemical space of molecules is approximately isometric to the activity space, resulting in smoother structure–activity landscapes as shown in⁵². Consequently, under the learned metric, high-activity molecules are clustered relatively tightly in the chemical space and therefore, that space could be mined for new molecules. Figure 1 depicts this phenomenon using various projection methods Generative Topographic Map** (GTM)⁵³, Multidimensional Scaling (MDS), t-distributed Stochastic Neighbor Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP), to show the interpolated activity landscapes of the protein target Coagulation factor XIII, or CHEMBL4530, in 2 dimensions and compare them with an MLKR-based representation. Observe that, except MLKR, none of the other methods were able to separate two chemically-similar-but-funcationally-different molecules, CHEMBL208650 and CHEMBL2086502, which have a Tanimoto similarity between ECFP4 fingerprints of 0.70 but target difference of 2.61. This is to be expected because, in their original form, GTM, MDS, t-SNE, UMAP are all unsupervised techniques and do not incorporate the activity information in their projections. MLKR, on the other hand, is a supervised metric learning method, which allows it to incorporate the target activity information resulting in smoother activity landscapes.

**Fig. 1: The transformed chemical space and interpolated activity surface of target CHEMBL4530 using various projection methods.**

In this paper, we develop an MLKR-inspired regression-based technique, topological regression (TR), that models the distance in the response space using the distances in the chemical space. TR essentially builds a parametric model to determine how pairwise distances in the chemical space impact the weights of nearest neighbors in the response space. Observe, unlike metric learning techniques, TR does not attempt to learn a metric in the response space, nor does it attempt to provide a lower dimensional projection like MDS or GTM. Rather, TR simply estimates the weights of nearest neighbors. In comparison to traditional modeling methods, like RFs and SVMs, which are dependent on a predefined fingerprint, TR can accommodate non-metric systems and does not crucially require coordinates for each instance. As we will show in the subsequent sections, TR can work on the similarities between training molecules, such as those computed from molecular kernels^54,55, thereby circumventing the problem of featurization of molecules. Since, our primary use-case scenario is QSAR in the lead identification/optimization process, where the contiguity of high-activity molecules plays a significant role, we perform a large-scale comparison on 530 ChEMBL bio targets. We use RF, ChemProp, and Transformer-CNN as baseline models and show that TR matches the performance of Transformer-CNN at a significantly less computational cost. We also observe, empirically, that TR produces numerically superior predictive performance as compared to the other competing methods. Additionally, both MLKR and TR produce reasonably contiguous areas of high activity, thereby identifying a relatively compact high-activity chemical space.

Results

Model performance comparison on ChEMBL datasets

We apply our TR method with Gaussian kernel neighbor weighting on 530 ChEMBL datasets under both random split and scaffold split. As explained earlier, we use the ECFP4 TC distance as input to TR to predict the activity values. We use 80% of all the instances in each dataset for training and the remainder for testing. For the construction in the section “multivariate construction of topological regression”, when I^* ∩ I = ϕ, we use 20% of the training instances as anchor points and the remaining 80% of the training set for neighborhood training. We denote this method as TR* in the results. For the approach described in the section “univariate construction of topological regression” without disjointedness requirement, we use 50% of training instances, with a maximum of 2000 instances to improve computation time, as anchor points, and those results are denoted as TR. Finally, to reduce the sensitivity of results to anchor point selection, and to improve generalization error, different random sets of anchor points were sampled to create an ensemble of TR models(see the section “ensemble topological regression”). We denote this method as Ensemble TR and used t = 15, μ_k = 0.6, and ${\sigma }_{k}^{2}=0.2$ for the subsequent results.

The average Spearman correlation and NRMSE for each method (RF, MLKR with KNN, ChemProp, TCNN, TCNN with augmentation, TR*, TR, and Ensemble TR) on both splitting scenarios are shown in Table 1. Figure 2 compares each method using boxplots showing the distribution of the performances for both random and scaffold splitting. As expected, TR* is unable to achieve performance comparable to the competing methods as the model is being constrained by the disjointedness requirement. When we relax this requirement, we observe that TR’s predictive performance improves considerably and is only numerically inferior to TCNN with augmentation. Finally, when we incorporate an ensemble of TR models, the predictive performance of Ensemble TR is essentially as good as that of TCNN with augmentation. If we invoke the law of parsimony, our conceptually straightforward, and mathematically less complex, topological regression approach appears to be more appealing as compared to competing deep learning techniques.

Table 1 Comparative measurements obtained on each of the competing methods

Full size table

**Fig. 2: Comparative analysis of model performances on the 530 ChEMBL bioactivity datasets.**

Computational comparison on ChEMBL datasets

To illustrate the computational efficiency of TR and Ensemble TR, we report each competing method’s average training time, testing time, and peak RAM consumption across all 530 datasets. These results are shown in Table 2. For fair comparison and to provide the best optimized hardware for each model, we trained the deep learning models on systems with GPUs as the training of deep learning-based models are better optimized in GPU based systems. Since the pre-trained TCNN model was released and used for fine-tuning, the reported TCNN time does not include pre-training time. From the results, we observe that TR and Ensemble TR result in the fastest training times and significantly less peak RAM consumption. For testing, TR takes more time than MLKR since RBF kernels are employed compared to MLKR which simply uses 5-NN predictions after transformation, however TR still results in faster test times than TCNN. These results demonstrate the computational efficiency of TR.

Table 2 Computational complexity comparison of competing methods showing training time, testing time, and peak RAM consumption on the scaffold split

Full size table

Interpreting TR

Inspection of the regression coefficients in B demonstrates how TR offers more flexibility as compared to standard KNN. Recall, W_K,m, K ∈ I^*, m ∈ I quantifies the impact of Y_K on Y_m. Now, in an ordinary KNN inverse distance weighting scheme, as distance between the Kth instance and mth instance increases in the chemical space, W_K,m decreases, i.e., $\frac{\delta }{\delta {d}_{K,m;X}^{2}}{W}_{K,m} \, < \, 0$. However, for TR $\frac{\delta }{\delta {d}_{K,m;X}^{2}}{W}_{K,m}={W}_{K,m}{b}_{KK}$. Now W_K,m > 0 by construction, therefore $sign(\frac{\delta }{\delta {d}_{K,m;X}^{2}}{W}_{K,m})$ depends upon the sign(b_KK). Hence, TR can push molecules closer in chemical space far apart in the response space. What this implies is, the prediction generation process for TR can be interpreted in the same vein as that used by KNN, except, unlike KNN, TR searches for nearest set of anchor points in the response space.

We use the chemical space of the drugs targeting Phospholipase D2 (ChEMBL ID: CHEMBL2734) to demonstrate this phenomenon. In Fig. 3 we seek to predict the response corresponding to the molecule CHEMBL492559 (denoted by a red star, pChEMBL= 6.73) in the test set. Based on similarity in the chemical space, standard KNN finds three molecules, CHEMBL492558, CHEMBL492704, and CHEMBL492588, as nearest neighbors, under a 5-fold cross-validation protocol, and makes predictions based on the average of the activities of these three molecules. However, the target molecule is almost at the edge of a high-activity region. Therefore, naive KNN identifies two neighbors, CHEMBL492704 and CHEMBL492588 from the nearby low activity region (across the cliff) and only one neighbor CHEMBL492558 from the ideal high activity region. This happens because the high-activity region in the neighborhood of the target molecule is sparsely populated. In contrast, since TR directly incorporates Y in the learning, it identifies three cross-cliff molecules, CHEMBL494008, CHEMBL4581260, and CHEMBL1254736, that have greater weights in predicting the response associated with the target molecule as compared to CHEMBL492704 and CHEMBL492588. Observe that all three molecules identified by TR as nearest neighbors (CHEMBL494008, CHEMBL4581260, CHEMBL1254736) are in relatively high-activity regions. By presenting structures from diverse scaffolds that exhibit similar activities, TR not only enhances prediction reliability but also aids in the identification of key spatial structural characteristics influencing the activities. The presented structures can be further validated with structural chemical methods such as structural alignment or docking simulations.

**Fig. 3: Comparative analysis of the neighbors found by a KNN procedure and a TR procedure.**

To further illustrate this point across the entire dataset, rather than for one particular test molecule, we generated KNN-graphs depicting the predictions of the various similarity-based methods with the color indicating the activity elicited by the molecules. To do so, each training and test sample was represented as a node, and the predicted neighbors were considered as the connecting edges. These graphs are synonymous with NSGs, in fact, just like NSGs, the edges were only included if the similarity was greater than a fixed cutoff TC and if the molecules were predicted as one of the nearest neighbors. Therefore, the number of neighbors and the cutoff TC control the connectedness of the network graphs, more connections would be established with a larger number of nearest neighbors and lower cutoff similarities until the graph is complete. We used 5 nearest neighbors and the mean similarity of the entire target dataset as the cutoff TC for each competing method for all subsequent network graphs, meaning at most 5 connections would be established if their similarities were greater than the fixed cutoff TC. An example of these KNN-graphs, depicting the test nearest neighbors of a single CV fold of the dataset CHEMBL2734, is included in Fig. 4. Additional figures depicting the training predictions, testing predictions, and molecules within the most active cluster are included in the supplementary document. Notice that the predicted TR neighbors are similar in response value, leading to more homogeneous activity throughout the clusters, whereas KNN and MLKR both result in clusters containing diverse activity values. To quantify this variability, we included the average within-cluster standard deviations for each method in the figure where a low within-cluster standard deviation denotes a more homogeneous cluster.

**Fig. 4: k-Nearest neighbor graphs visualizing the 5 nearest neighbor predictions of K-nearest neighbors (KNN), metric learning for kernel regression (MLKR), and topological regression (TR).**

To systematically show this behavior across all 530 datasets, we calculated the average within-cluster standard deviation from the foregoing test prediction KNN-graphs for the competing methods. Figure 5 depicts these results in the form of a line graph across all 530 datasets. Clearly, TR systematically produces lower within-cluster standard deviation compared to KNN and MLKR, resulting in higher levels of homogeneous activity within the clusters. If we envision activity cliffs to be a phenomenon that induces a strong outlier within an otherwise homogeneous cluster, then it stands to reason that by measuring within-cluster homogeneity we can infer about the presence of cliffs in that cluster. Higher levels of within-cluster homogeneity essentially smooths out activity cliffs resulting in more relevant similarity-based predictions and providing practitioners with instance-wise similar molecules for lead optimization.

**Fig. 5: Intercluster standard deviation computed accross the 530 ChEMBL datasets.**

Since TR results in more homogeneous clusters, the clusters themselves can be more meaningfully mined by chemists for innovative design ideas, potential target leads, and lead optimization pathways. For example, clustering can be performed on the training data, and the most active cluster may contain molecules with specific features that practitioners can use to guide designs and future experiments. The same can be done with the least active cluster to see which molecular features to avoid and provide further insights. Furthermore, the most active training cluster can be mined for lead molecules that have other desired characteristics, such as low toxicity or ease of production. Analogous to NSGs, the training clusters can also be used to visualize lead optimization pathways. Figure 6 depicts a lead optimization pathway in the most active cluster of target protein complex Integrin alpha-4/beta-7 (CHEMBL278) with (a) the TR KNN-graph obtained from the training data of a single CV fold, (b) the most active cluster depicted as a minimum spanning tree with the minimum spanning path between the most active and least active molecules depicted in red, and (c) 5 example molecules from the lead optimization pathway connecting the most active and least active molecules in the cluster. These pathways can be traversed by chemists to envision what changes resulted in specific behaviors, allowing them to easily analyze the current state of a target dataset and discover potential design ideas (additional figures representing optimization pathways for various target datasets are provided in the Supplementary document). If we envision an untested molecule as an additional node in Fig. 6a, the TR method could directly produce the set of edges radiating from that node (via the model for W) that would enable one to assess how the untested molecule relates with the previously tested molecules. This could enable greater trust in the predictions as the chemist could easily visualize how the new sample relates to known molecules. Additionally, these graphs fit directly with Laplacian Scores for feature selection, allowing global feature importance to be calculated in a routine fashion. Lastly, when paired with SHAP or MMACE, which are model agnostic, TR would be able to efficiently generate instance-wise feature importance and unseen counterfactuals, adding additional layers to TR’s interpretability.

**Fig. 6: Optimization pathway visualization in the most active training cluster of target protein complex Integrin alpha-4/beta-7 (CHEMBL278).**

Discussion

In this paper, we have developed a statistical methodology, topological regression (TR), to perform similarity-based regression and demonstrated how it can be used for QSAR modeling. We tested TR on regression tasks with 530 ChEMBL human targets and compared it with a traditional RF, Nearest Neighbors, a metric learning algorithm (MLKR), and two deep learning methods, ChemProp and Transformer-CNN. Empirically, we observed that TR or ensemble TR compared favorably against all competing methods in terms of predictive accuracy on the scaffold split and achieved comparable performance with TCNN on the random splitting at a much lower computational cost. Most importantly, TR provides explainability, visual interpretability, and theoretical justifiability in the form of testable adequacy and optimal model size.

The performances of RF, TCNN, ChemProp, and MLKR are mostly interpreted in a comparative sense. The usual measures employed to assess the performance of these models - NRMSE, MAE - have unbounded support and hence do not offer information about the goodness-of-fit. TR on the other hand completely relies on multivariate general linear models - geographically weighted regression when extracting W_i,j from the drug response, and standard regression theory when modeling W_i,j. For both of these techniques, rigorous tests for goodness-of-fit exist^56,57. Since the standard coefficient of determination offers an immediate goodness-of-fit statistic for linear models (or transformed linear models), we compute the training R-sq values (using (7)) for all 530 ChEMBL datasets considered in this paper. The average R-sq turns out to be 0.8396. Evidently, our conceptually straightforward parametric linear model has sufficient power to explain variation in W_i,j. Turning to predictive adequacy, we compute the prediction interval for the W’s (using extracted W’s as targets) in the cross-validation set. Once again, the linear model specification allows us to compute the prediction interval analytically. We then compute the coverage of these prediction intervals across all folds. Ideally, we would like to see the coverage of the prediction interval achieve a nominal level. In all the 530 datasets across all the folds, the coverage of 95% prediction interval is 94.3%. Clearly, the model specified in (7) is adequate for prediction purposes as well. These results provide empirical justification for the adequacy of the TR model.

Given the small to moderate sample size in ChEMBL datasets, model complexity has a significant impact on prediction performance. For ChemProp or TCNN-type deep learners, regularization of network weights, drop-out layers, and ablations are standard procedures to control model complexity. However, these measures are adhoc and their theoretical properties are not well established. For standard KNN (or even in MLKR), the number of neighbors determines the model size. However, we need to fix the number of nearest neighbors a-priori and tune that quantity via cross-validation. TR, on the other hand, offers a theoretically appropriate way to choose neighborhood size and hence model complexity. In TR, the anchor points play the role of neighbors and ∣I ^*∣ determines the size of the coefficient matrix B. Consequently, changing ∣I ^*∣ yields sequences of nested models, and hence standard model selection techniques, for instance, AIC or BIC, could be used to identify the appropriate size of I ^* without resorting to cross-validation. Since AIC/BIC automatically penalizes model complexity for a given sample size, we can arrive at an optimal model complexity for TR.

Furthermore, TR provides an intuitive explanation of its predictive mechanism based on nearest neighbors in the response space as shown through KNN graphs in the section “interpreting TR”. This explanation could be gleaned from MLKR as well. However, the computational complexity associated with semi-definite programming, required in MLKR, is considerable if the dimension of the input space is high. TR, on the other hand, directly learns the weights associated with neighboring responses, and, by a suitable transformation, estimates the parameters in an unconstrained fashion. This leads to a significant reduction in computational expense as reported in Table 2.

Finally, the visual representation of TR’s predictive mechanism could provide design ideas and allow fast knowledge-based model validation. We anticipate that our framework will have practical value in drug discovery or other QSAR tasks and assist in designing new molecules more effectively.

Methods

Data description and problem motivation

We begin with a description of the datasets that we use to illustrate the comparative performances of the competing models. We offer a brief description of ChemProp, Transformer-CNN and MLKR methods and then outline the motivation behind develo** the TR framework.

Dataset

Since our focus is on QSAR modeling in the lead optimization phase of drug discovery, we choose to assess the performance of competing models on well-curated datasets with single target bioactivity. For this purpose, we downloaded data from the ChEMBL database⁵⁸ following the extraction protocol of⁵⁹. This included only selecting ‘SINGLE PROTEIN’ or ‘PROTEIN COMPLEX’ human targets with confidence scores of 9 and 7, respectively. Additionally, only pCHEMBL values, which are comparable bioactivity measures of half-maximal response (IC50, XC50, EC50, etc.) on a negative logarithmic scale, were selected. We refer the readers to⁵⁹ for further data extraction details.

In the cleaning phase, we first removed the datasets that were too small to train ChemProp and Transformer-CNN. Within each dataset, we further removed instances with duplicated SMILES and instances with chemically invalid SMILES strings which could not be converted to RDKit molecules. Finally, we had 530 datasets on various human target bio-activities. Sample size ranged from 100 to 7890 with the median sample size being 677. The various target activities, referred to as pChEMBL values, were used as the univariate response variable.

Although several representative descriptors and fingerprints (for example: RDKit descriptors, Mordred⁶, ECFP4⁷) are available, we mainly focus on ECFP4 representation for similarity-based predictive models because, empirically, this representation offered the best predictive performance. We relegate the results demonstrating the superior predictive performance of the ECFP4 representation to the Supplementary Material. We calculate folded ECFP4 fingerprints using RDKit’s implementation of the Morgan algorithm with a radius of 2 atoms and bit-size of 1024. Since the output of this representation system is binary, we use the Tanimoto coefficient (TC) as a measure of similarity and 1 − TC as a measure of distance for TR. No standardization steps were required as RDKit was used to extract ECFP4 fingerprints. The ECFP4 fingerprints were used to train the RF model, whereas Chemprop used the SMILES string inputs to internally extract the graph representations and Transformer-CNN directly used the SMILES strings.

ChemProp

We used ChemProp as a baseline model because of its demonstrated utility in drug discovery. ChemProp is a full-fledged Graph Convolutional Neural Network model that takes 2D representations of molecules as predictors. We employed ChemProp’s Bayesian hyperparameter optimization, which optimizes the hidden size, depth, dropout, and the number of feed-forward layers, and trained the model for 100 epochs for all datasets.

Transformer-CNN

We also used Transformer-CNN (TCNN) as a baseline model as it is self-proclaimed to be a Swiss-army knife for QSAR modeling. TCNN is a pre-trained model on over 17 million pairs of strings for the task of SMILES canonicalization. The output of the transformer encoder is then used to generate model-acquired FPs, which are used for downstream prediction through task-trained Text-CNN and convolutional highway layers. In addition, the architecture enables data augmentation by ensembling the results from multiple non-canonical smiles for each sample. Lastly, the architecture contains practically no hyperparameters and enables learning rate scheduling and early stop**, limiting the need for hyperparameter optimization. This mixture of large pre-training, sample augmentation, and string-size agnostic architecture results in a powerful prediction model. We followed the TCNN instructions and trained the model on the SMILES strings, with and without augmentation, for at most 35 epochs as learning rate scheduling and early stop** were employed.

Metric Learning Kernel Regression

The purpose of metric learning is to find a distance metric for a specific task through supervised learning. The metric found by metric learning could subsequently be used in KNN regression or kernel regression for generating predictions and visualizations. For regression tasks, MLKR⁵¹ finds the Mahalanobis metric that minimizes the cumulative leave-one-out CV error ${{{{{{{\mathcal{L}}}}}}}}={\sum }_{i}{({Y}_{i}-{\hat{Y}}_{i})}^{2}$, where Y_i is the numeric response variable of the i-th training sample and $\hat{Y}=\frac{{\Sigma }_{j\ne i}{Y}_{j}{W}_{ij}}{{\Sigma }_{j\ne i}{W}_{ij}}$ with W_.,. being the weights associated with Gaussian kernels. In particular, the transformation matrix L used to obtain the learned metric can be written as a decomposition of Mahalanobis matrix M = L^TL. After L is learned from the data, the original coordinate system of the predictor space X is transformed into the new coordinate system given by LX. Thus, MLKR learns a global space transformation, which can be used to calculate the distance in the response space. Then KNN-regression or similarity-based kernel regression can be performed to provide predictions and interpretation.

However, in order to compute distances, we first need to characterize the molecules in a fashion such that distances can be computed. As mentioned, we focus on ECFP4 fingerprints, which is thus the initial coordinate system supplied to MLKR to learn the transformation and produce a new coordinate system such that the predictor space is approximately isometric to the response space. Figure 7 illustrates this phenomenon. In the left panel, we computed the pairwise Tanimoto distances among all the molecules targeting Mitogen-Activated Protein Kinase 12 (ChEMBL ID: CHEMBLE1908389) using ECFP4 features and projected them in 2D MDS space. The intensity of the pixels indicate the response each molecule elicited. In the right panel, we used the distance metric learned from MLKR to generate the 2D coordinates. Observe how the two molecules, CHEMBL3727733 and CHEMBL3729567, which appeared to be neighbors in the chemical space, were pushed apart after the MLKR transformation. Additionally, we observe a smoother spatial trend in the image produced after the MLKR transformation which allows us to use KNN or kernel regression - with purely distance-dependent kernel elements - for prediction purposes.

Fig. 7: The 2-D multidimensional scaling (MDS) of the original chemical space and metric learning for kernel regression (MLKR) transformed space of ChEMBL target mitogen-activated protein kinase 12 (CHEMBL1908389).

Comparison procedure

To compare model performances we design two types of data splits: (a) random split and (b) scaffold split. Random split is done with 5-fold cross-validation with 80% training and 20% testing in each fold. The scores of the five folds are averaged as the final score. In drug discovery, new structures are often proposed by editing on the scaffold of a known good candidate. Predictions are more likely to fail across scaffolds due to greater chemical dissimilarities. Scaffold split makes sure the training and test samples belong to different Murcko scaffolds - mimicking scenarios when predictions for a new structure of a different scaffold is sought. Since full-blown cross-validation is not feasible with scaffold splits, we use a single hold-out set comprising approximately 20% of data points for each ChEMBL dataset. We use Spearman ρ and Normalized Root Mean Square Error (NRMSE) to compare the candidate models’ capabilities to generate predictions. In the section “Results”, we compare these two metrics obtained from ChemProp with those obtained from MLKR-KNN under both splitting scenarios for all 530 ChEMBL datasets and observe that MLKR-KNN offers numerically superior performance as compared to ChemProp, even though MLKR is not directly a regression technique.

This empirical observation motivates us to develop TR based on a distance formulation and thereby make the MLKR-type strategy amenable to statistical inference. We observe that in the MLKR procedure, a lot of effort is undertaken to ensure that the transformed space is indeed a metric space. However, for prediction, a weighted averaging of the responses from nearest neighbors is performed. Notice that symmetry and non-negativity are the only two conditions required for those weights (W_ij). Therefore, we contend that we can directly work with W_ijs instead. We then proceed to show that, under suitable distributional specification, an explicit estimator of E(W_ij) could be obtained. Since the estimand is an expectation operator, standard statistical theory (delta method, residual bootstrap) could be brought to bear to assess the statistical properties of this estimator. To the best of our knowledge, such statistical assessment of the estimates produced by vanilla MLKR is not available.

Multivariate construction of topological regression

Topological regression (TR) is a similarity-based regression framework that connects the distances in the chemical space with non-negative weights appearing in nearest neighbor regression defined on the response space. The model is illustrated in Fig. 8. More specifically, we specify a multivariate regression model for the weights W_ijs and derive a closed-form expression for the estimator of E(W_ij) under an inverse distance-weighting scheme. Subsequently, we also offer a discussion on an approximate estimator of the foregoing quantity when the weighting is done using a Gaussian kernel.

**Fig. 8: Overview of the proposed topological regression framework.**

Let ${{{{{{{\mathcal{D}}}}}}}}$ represent the set of all training points. First, we partition ${{{{{{{\mathcal{D}}}}}}}}$ into a set of K anchor points and $N=| {{{{{{{\mathcal{D}}}}}}}}| -K$ neighborhood-training points. Let ${I}^{*}=\{{i}_{1}^{*},{i}_{2}^{*},...,{i}_{K}^{*}\}$ be the set of indices associated with the anchor points and I = {i₁, i₂, . . . , i_N} be the indices associated with the neighborhood-training points, with I ^* ∩ I = ϕ and ∣I ^*∣ < ∣I∣. Let ${Y}_{{i}_{j}},{i}_{j}\in I$ be the response associated with the i_jth instance in the set I. Our goal is to express ${Y}_{{i}_{j}}$ as a linear combination of responses ${Y}_{{i}_{j}^{*}}$ belonging to the set I ^*, i.e.

$${Y}_{{i}_{j}}={\sum}_{{i}_{l}^{*}\in {I}^{*}}{W}_{{i}_{l}^{*}{i}_{j}}{Y}_{{i}_{l}^{*}},\forall {i}_{j}\in I$$

(1)

where ${W}_{{i}_{l}^{*}{i}_{j}}$ is a non-negative weight that determines the contribution of the response associated with the lth point in I ^* towards the response associated with the jth point in I. Such non-negative weights are fairly common in distance-weighted regression, for instance, in geographically weighted spatial regression models, often the weights are specified in terms of Gaussian kernels, i.e., ${W}_{{i}_{l}^{*}{i}_{j}}=\exp (-\beta {d}_{{i}_{l}^{*},{i}_{j}}^{2})$ with d²(. ) being a squared Euclidean distance and β > 0 controlling the smoothness of the random field.

Neighborhood training model

Customarily, the weights are expressed as a deterministic function of the distances in the predictor space. In standard KNN regression, we assume that distance in the predictor space is proportional to the distance in the response space. In metric learning, a transformation of the predictor space is learned such that there is an approximate isometry between the transformed predictor space and the response space. In TR, we instead write a formal statistical model to connect ${W}_{{i}_{j}^{*}{i}_{j}}$ with the squared Euclidean distances in the predictor space in the following fashion:

We define the weights

$${W}_{{i}_{j}^{*}{i}_{j}} \,=\, 0\,\,{{\mbox{if}}}\,\,{i}_{j}^{*}={i}_{j}\\ \, > \, 0\,\,{{\mbox{if}}}\,\,{i}_{j}^{*} \, \ne \, {i}_{j}$$

(2)

and since we have I ^* and I to be disjoint and the responses could be assumed to be absolutely continuous, we can define

$${\tilde{W}}^{N\times K}=\left[\begin{array}{cccc}\log ({W}_{{i}_{1}^{*},{i}_{1}})&\log ({W}_{{i}_{2}^{*},{i}_{1}})&\cdots \,&\log ({W}_{{i}_{K}^{*},{i}_{1}})\\ \log ({W}_{{i}_{1}^{*},{i}_{2}})&\log ({W}_{{i}_{2}^{*},{i}_{2}})&\cdots \,&\log ({W}_{{i}_{K}^{*},{i}_{2}})\\ \cdots \,&\cdots \,&\cdots \,&\cdots \\ \log ({W}_{{i}_{1}^{*},{i}_{N}})&\log ({W}_{{i}_{2}^{*},{i}_{N}})&\cdots \,&\log ({W}_{{i}_{K}^{*},{i}_{N}})\end{array}\right]$$

(3)

with the entries in $\tilde{W}$, i.e., ${(\tilde{W})}_{{i}_{j}^{*}{i}_{j}}$ being real quantities. Define the squared Euclidean distance matrix in the predictor space as

$${D}_{X}^{N\times K}=\left[\begin{array}{cccc}{d}_{{i}_{1}^{*},{i}_{1};X}^{2}&{d}_{{i}_{2}^{*},{i}_{1};X}^{2}&\cdots \,&{d}_{{i}_{K}^{*},{i}_{1};X}^{2}\\ {d}_{{i}_{1}^{*},{i}_{2};X}^{2}&{d}_{{i}_{2}^{*},{i}_{2};X}^{2}&\cdots \,&{d}_{{i}_{K}^{*},{i}_{2};X}^{2}\\ \cdots \,&\cdots \,&\cdots \,&\cdots \\ {d}_{{i}_{1}^{*},{i}_{N};X}^{2}&{d}_{{i}_{2}^{*},{i}_{N};X}^{2}&\cdots \,&{d}_{{i}_{K}^{*},{i}_{N};X}^{2}\end{array}\right]$$

(4)

We define a simple multivariate linear regression model connecting $\tilde{W}$ with D_X. Consider the mth row of $\tilde{W}$. Observe that, this row consists of the weights used to express the mth response in I using all the responses in I ^*. We envision this row to be a set of repeated measurements taken on the mth point in I from the vantage points in I ^*. Thus, denoting the K elements in the mth row of $\tilde{W}$ by ${\tilde{W}}_{.,m}=({\tilde{W}}_{1,m},{\tilde{W}}_{2,m},\cdots \,,{\tilde{W}}_{K,m})$, the corresponding row of predictors in D_X by ${D}_{.,m;X}=({d}_{1,m;X}^{2},{d}_{2,m;X}^{2},\cdots \,,{d}_{K,m;X}^{2})$, and the matrix of regression coefficients by

$${B}^{K+1\times K+1}=\left[\begin{array}{cccc}{b}_{01}&{b}_{02}&\cdots \,&{b}_{0K}\\ {b}_{11}&{b}_{12}&\cdots \,&{b}_{1K}\\ \cdots \,&\cdots \,&\cdots \,&\cdots \\ {b}_{K1}&{b}_{K2}&\cdots \,&{b}_{KK}\end{array}\right]$$

(5)

we arrive at the following regression model

$${\tilde{W}}_{1,m} \, =\, {b}_{01}+{b}_{11}{d}_{1,m;X}^{2}+{b}_{21}{d}_{2,m;X}^{2}+\cdots+{b}_{K1}{d}_{K,m;X}^{2}+{\epsilon }_{1}\\ {\tilde{W}}_{2,m} \, =\, {b}_{02}+{b}_{12}{d}_{1,m;X}^{2}+{b}_{22}{d}_{2,m;X}^{2}+\cdots+{b}_{K2}{d}_{K,m;X}^{2}+{\epsilon }_{2}r\\ \cdots \\ {\tilde{W}}_{K,m} \, =\, {b}_{0K}+{b}_{1K}{d}_{1,m;X}^{2}+{b}_{2K}{d}_{2,m;X}^{2}+\cdots+{b}_{KK}{d}_{K,m;X}^{2}+{\epsilon }_{K}$$

(6)

with ${{{{{{{\boldsymbol{\epsilon }}}}}}}}=({\epsilon }_{1},{\epsilon }_{2},\cdots \,,{\epsilon }_{K}) \sim {{{{{{{{\mathcal{N}}}}}}}}}_{K}(0,\Sigma )$. Now assuming mutual independence across the N rows of $\tilde{W}$ and since N > K (by construction), we can obtain the MLEs of B and Σ. Let $\hat{B}$ and $\hat{\Sigma }$ denote their respective estimates. Then, for a new query point, we can compute $({d}_{1,query;X}^{2},{d}_{2,query;X}^{2},\cdots \,,{d}_{K,query;X}^{2})$ and, using $\hat{B}$, obtain the predictions $({\tilde{W}}_{1,query},{\tilde{W}}_{2,query},\cdots \,,{\tilde{W}}_{K,query})$. However, observe that (1) requires (W_1,query, W_2,query, ⋯ , W_K,query) to generate a prediction for the query point, and simply exponentiating the output, $\hat{\tilde{W}}$, of (6) will yield a biased estimate of W because $E(W)=E({e}^{\tilde{W}}) \, \ne \, {e}^{E(\tilde{W})}$ due to Jensen’s inequality. Therefore, we use the properties of the multivariate log-normal distribution to improve the estimate of W in the following way:

Clearly ${{{{{{{{\boldsymbol{W}}}}}}}}}_{.,m}={e}^{{\tilde{{{{{{{{\boldsymbol{W}}}}}}}}}}_{.,m}}$ where the exponent is taken coordinate-wise with ${\tilde{{{{{{{{\boldsymbol{W}}}}}}}}}}_{.,m} \sim {{{{{{{{\mathcal{N}}}}}}}}}_{K}({{{{{{{{\boldsymbol{\mu }}}}}}}}}_{.,m},\Sigma )$ and ${\mu }_{j,m}={b}_{0j}+{b}_{1j}{d}_{1,m}^{2}+{b}_{2j}{d}_{2,m}^{2}+\cdots+{b}_{Kj}{d}_{K,m}^{2}$. Then the usual relationship between the expectation of a log-normal variate with the moment-generating function of its normal counterpart can be used to show that $E({W}_{j,m})=E({e}^{{\tilde{W}}_{j,m}})=\exp ({\mu }_{j,m}+{\Sigma }_{jj}/2)$. Additionally, it is fairly straightforward to show that the covariance matrix of W_.,m is given by Var(W_.,m) = diag(E(W_.,m))(e^Σ − 11^T)diag(E(W_.,m)). Consequently, an estimator of W_j,query is given by ${\hat{{{{{{{{\boldsymbol{W}}}}}}}}}}_{j,{m}^{*}}=\hat{E}({{{{{{{{\boldsymbol{W}}}}}}}}}_{j,query})=\exp ({\hat{\mu }}_{j,query}+{\hat{\Sigma }}_{jj}/2)$ and the corresponding estimator of the covariance matrix is $\hat{Var}({{{{{{{{\boldsymbol{W}}}}}}}}}_{.,query})=diag({\hat{{{{{{{{\boldsymbol{W}}}}}}}}}}_{.,query})({e}^{\hat{\Sigma }}-{{{{{{{\boldsymbol{1}}}}}}}}{{{{{{{{\boldsymbol{1}}}}}}}}}^{T})diag({\hat{{{{{{{{\boldsymbol{W}}}}}}}}}}_{.,query})$. The estimated covariance matrix is positive definite as long as $\hat{\Sigma }$ is positive definite. Furthermore, since $\hat{B}$ is asymptotically normally distributed, we can obtain a conservative estimate of the pointwise prediction interval of ${{{{{{{{\boldsymbol{W}}}}}}}}}_{.,{m}^{*}}$ using the parametric bootstrap technique outlined in⁶⁰.

Extraction of W

In the above discussion, we have used $\log ({{{{{{{\boldsymbol{W}}}}}}}})$ as the target of the multivariate regression in (6). However, W are not observed, but are parameters that appear in the distance-weighted regression in the response space (1). Hence, we first need to extract these weights. A naive option is to set the weights ${W}_{{i}_{j}^{*},{i}_{j}}$ as the inverse of squared Euclidean distance in the response space between points in I and I ^*, i.e. ${W}_{{i}_{j}^{*},{i}_{j}}=1/{d}_{{i}_{j}^{*},{i}_{j};Y}^{2},{i}_{j}^{*}\in {I}^{*},{i}_{j}\in I$. In this configuration, we can simply supply $1/{d}_{{i}_{j}^{*},{i}_{j};Y}^{2}$ in the LHS of (6). We will still recover a closed form expression for $\hat{E}(W)$ because the log-normal distribution is closed under an inverse transformation.

Univariate construction of topological regression

The requirement I ^* ∩ I = ϕ in the previous section induces a delicate trade-off. If we increase the number of anchor points, the neighborhood training model becomes overparametrized. If, on the other hand, we decrease the number of anchor points there may not be enough anchor points to reliably estimate the response, especially in isolated regions of high activity.

One possible solution is to bring the distances among anchor points themselves in the neighborhood training model. But, that conflicts with the above theoretical development because each point in I ^* can be observed from the remaining K − 1 points in I ^* and hence we do not have a K × K covariance matrix. Additionally, because of the symmetry constraint (W_i,j = W_j,i), we can only work with the triangular matrix of weights associated with points within I ^*. Thus, if we forego the above multivariate log-linear regression construction (6) and view TR purely as a least-square optimization problem we can use K(N − K) + K(K − 1)/2 equations to obtain the least-square estimates of the coefficient matrix B. In this scenario, the first K(N − K) equations are obtained by varying m from 1, 2, . . . N in (6). The remaining K(K − 1)/2 equations connect the ${\tilde{W}}_{{i}_{j}^{*},{i}_{{j}^{{\prime} }}^{*}}$ with the instances in I ^*. More specifically, drop** the subscript i and simply denote the K elements in I ^* as {1^*, 2^*, 3^*, ⋯ , K^*}, then we have the following system of equations:

$${\tilde{W}}_{{2}^{*},{1}^{*}} ={b}_{02}+{b}_{12}{d}_{{1}^{*},{1}^{*};X}^{2}+{b}_{22}{d}_{{2}^{*},{1}^{*};X}^{2}+\cdots+{b}_{K2}{d}_{{K}^{*},{1}^{*};X}^{2}+{\epsilon }_{{2}^{*}{1}^{*}} \\ {\tilde{W}}_{{3}^{*},{1}^{*}} ={b}_{03}+{b}_{13}{d}_{{1}^{*},{1}^{*};X}^{2}+{b}_{23}{d}_{{2}^{*},{1}^{*};X}^{2}+\cdots+{b}_{K3}{d}_{{K}^{*},{1}^{*};X}^{2}+{\epsilon }_{{3}^{*}{1}^{*}}\\ \cdots \, \\ {\tilde{W}}_{{3}^{*},{2}^{*}} ={b}_{03}+{b}_{13}{d}_{{1}^{*},{2}^{*};X}^{2}+{b}_{23}{d}_{{2}^{*},{2}^{*};X}^{2}+{b}_{33}{d}_{{3}^{*},{2}^{*};X}^{2}+{b}_{K3}{d}_{{K}^{*},{2}^{*};X}^{2} \cdots+{\epsilon }_{{3}^{*}{2}^{*}}\\ \cdots \, \\ {\tilde{W}}_{{K}^{*} ,{(K-1)}^{*}} ={b}_{0K}+{b}_{1K}{d}_{{1}^{*},{(K-1)}^{*};X}^{2}+{b}_{2K}{d}_{{2}^{*},{(K-1)}^{*};X}^{2}+{b}_{3K}{d}_{{3}^{*},{(K-1)}^{*};X}^{2}+\cdots \\ +{b}_{KK}{d}_{{K}^{*},{(K-1)}^{*};X}^{2}+{\epsilon }_{{K}^{*}{(K-1)}^{*}}$$

(7)

$\hat{B}$ could be obtained by minimizing the error sum of squares. Additionally, if we assume the error terms are iid ${{{{{{{\mathcal{N}}}}}}}}(0,{\sigma }^{2})$, we can easily obtain ${\hat{\sigma }}^{2}$ from the residuals. Now, when a query instance comes in with known chemical features, we can compute ${{{{{{{{\boldsymbol{d}}}}}}}}}_{.,query}^{2}=[{d}_{{1}^{*},query}^{2},{d}_{{2}^{*},query}^{2},\cdots \,,{d}_{{K}^{*},query}^{2}]$ in the chemical space and obtain ${\hat{\tilde{{{{{{{{\boldsymbol{W}}}}}}}}}}}_{.,query}={{{{{{{{\boldsymbol{d}}}}}}}}}_{.,query}^{2}\hat{B}$. Then an estimator of the neighborhood weights for the query point is given by ${\hat{{{{{{{{\boldsymbol{W}}}}}}}}}}_{.,query}=\exp ({\hat{\tilde{{{{{{{{\boldsymbol{W}}}}}}}}}}}_{.,query}+{\hat{\sigma }}^{2}/2)$.

Additionally, since the W’s in this case are univariate, we have the flexibility to write ${W}_{{i}_{j}^{*},\neg {i}_{j}^{*}}=\exp (-\beta {d}_{{i}_{j}^{*},\neg {i}_{j}^{*};Y}^{2})$ with β > 0 and replace the W’s in the LHS of (7) by $\log ({d}_{{i}_{j}^{*},\neg {i}_{j}^{*};Y}^{2})$. Now, each d² has a univariate lognormal distribution. Now, to obtain an estimator of $E({W}_{{i}_{j}^{*},\neg {i}_{j}^{*}})$, we first observe that

$$E({W}_{{i}_{j}^{*},\neg {i}_{j}^{*}})=\int\nolimits_{0}^{\infty }\exp \left(-\beta {d}_{{i}_{j}^{*},\neg {i}_{j}^{*};Y}^{2}\right)f\left({d}_{{i}_{j}^{*},\neg {i}_{j}^{*};Y}^{2}\right)d{d}_{{i}^{*},\neg {i}^{*};Y}^{2}$$

(8)

is the Laplace transform of lognormal distribution. Although, there is no closed form solution of (8), but⁶¹ derives a sharp approximator of (8) for β > 0 using Lambert’s W function. Therefore we propose the following Monte Carlo procedure to estimate $E({W}_{{i}_{j}^{*},\neg {i}_{j}^{*}})$ as follows:

a.
Fit a standard geographically weighted regression with Gaussian Kernel in the response space and extract $\hat{\beta }$⁶².
b.
Fit the model (6) with $\log ({d}_{{i}_{j}^{*},\neg {i}_{j}^{*};Y}^{2})$ in the LHS and obtain ${\hat{\mu }}_{{i}_{j}^{*},\neg {i}_{j}^{*}}$ and ${\hat{\sigma }}^{2}$.
c.
Draw R iid replicates of ${d}_{{i}_{j}^{*},\neg {i}_{j}^{*};Y}^{2}$ from lognormal$(0,{\hat{\sigma }}^{2})$.
d.
For each realization compute $\exp (-\hat{\beta }{d}_{{i}_{j}^{*},\neg {i}_{j}^{*};Y}^{2(r)}{e}^{{\hat{\mu }}_{{i}_{j}^{*},\neg {i}_{j}^{*}}})$.
e.
Then the Monte Carlo estimator of the LHS of (8) is given by $\hat{E}({W}_{{i}_{j}^{*},\neg {i}_{j}^{*}})=\frac{1}{R}\mathop{\sum }\nolimits_{r=1}^{R}\exp (-\hat{\beta }{d}_{{i}_{j}^{*},\neg {i}_{j}^{*};Y}^{2(r)}{e}^{{\hat{\mu }}_{{i}_{j}^{*},\neg {i}_{j}^{*}}})$

While this Monte Carlo approximation works well when β and σ are small, it fails to explore the tail region as β → ∞. Hence, if $\hat{\beta }$ is large, an efficient importance sampler, derived in⁶¹, should be used.

Ensemble topological regression

The above construction in (7) allows relaxing the disjointedness requirement I ^* ∩ I = ϕ to include the anchor points as neighborhood training points and allows modeling the ${\tilde{W}}^{{\prime} }s$ through least squares optimization. However, by construction, ∣I ^*∣ < ∣I∣, meaning not all training points can be included as anchor points because the least squares model becomes overparameterized and overfits the training data leading to poor generalization performance. Since a subset of the available training set must be selected as anchor points, the results may be sensitive to the selected anchor points. To average out the effect of anchor points, one can simply randomly sample multiple different sets of anchor points and ensemble the results of each set. In order to achieve this, we introduce Ensemble TR, which samples t sets of anchor points independently and generates average predictions from the resulting t TR-models. The percentage of training instances to include as anchor instances can be viewed as a hyperparameter, so t percentages can be sampled from a Gaussian distribution ${{{{{{{\mathcal{N}}}}}}}}({\mu }_{k},{\sigma }_{k}^{2})$, with μ_k being the mean percentage of training instances to include as anchor instances and ${\sigma }_{k}^{2}$ being the requested variance of the t percentages. To verify percentage values are valid and to prevent over or under-fitting, the sampled percentages are clipped between the range [30%, 90%]. This leaves the user with three parameters: the number of models (t), the mean percentage of training samples to include as anchor instances (μ_k), and the variance of the percentages $({\sigma }_{k}^{2})$. Ensemble TR maintains its computational efficiency considering ${D}_{X}^{N\times N}$ can be initially calculated, and $t\,{D}_{X}^{N\times K}$’s can be easily sampled from ${D}_{X}^{N\times N}$. This means that once distances are calculated, only t multi-task linear regression models must be solved and RBF kernels applied to their outputs to generate predictions, leading to fast run times.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The ChEMBL datasets used in this study are available in the ChEMBL database (https://www.ebi.ac.uk/chembl/)⁵⁸. The code to extract the 530 ChEMBL datasets is provided in the code repository. Source data are provided with this paper.

Code availability

Sample data files and Python code to regenerate the TR figures and results are openly provided at https://github.com/Ribosome25/TopoReg_QSAR, which is archived in Zenodo under the identifier https://doi.org/10.5281/zenodo.10929477⁶³.

References

Neves, B. J. et al. Qsar-based virtual screening: advances and applications in drug discovery. Front. Pharmacol. 9, 1275 (2018).
Article CAS PubMed PubMed Central Google Scholar
Kwon, S., Bae, H., Jo, J. & Yoon, S. Comprehensive ensemble in qsar prediction for drug discovery. BMC Bioinformatics 20, 1–12 (2019).
Article CAS Google Scholar
Cherkasov, A. et al. Qsar modeling: where have you been? where are you going to? J. Medicinal Chem. 57, 4977–5010 (2014).
Article CAS Google Scholar
Grisoni, F., Ballabio, D., Todeschini, R. & Consonni, V. Molecular descriptors for structure–activity applications: a hands-on approach. Methods Mol. Biol. 1800, 3–53 (2018).
Yap, C. W. Padel-descriptor: An open source software to calculate molecular descriptors and fingerprints. J. Comput. Chem. 32, 1466–1474 (2011).
Article CAS PubMed Google Scholar
Moriwaki, H., Tian, Y.-S., Kawashita, N. & Takagi, T. Mordred: a molecular descriptor calculator. J. Cheminform. 10, 1–14 (2018).
Article Google Scholar
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inform. Modeling 50, 742–754 (2010).
Article CAS Google Scholar
Yang, K. et al. Analyzing learned molecular representations for property prediction. J. Chem. Inform. Modeling 59, 3370–3388 (2019).
Article CAS Google Scholar
Stokes, J. M. et al. A deep learning approach to antibiotic discovery. Cell 180, 688–70213 (2020).
Article CAS PubMed PubMed Central Google Scholar
Liu, G. et al. Deep learning-guided discovery of an antibiotic targeting acinetobacter baumannii. Nat. Chem. Biol. 19, 1342–1350 (2023).
Isert, C., Kromann, J. C., Stiefl, N., Schneider, G. & Lewis, R. A. Machine learning for fast, quantum mechanics-based approximation of drug lipophilicity. ACS Omega 8, 2046–2056 (2023).
Article CAS PubMed PubMed Central Google Scholar
Wang, S., Guo, Y., Wang, Y., Sun, H. & Huang, J. Smiles-bert: large scale unsupervised pre-training for molecular property prediction. In: Proc. 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics 429–436 (IEEE, 2019).
Karpov, P., Godin, G. & Tetko, I. V. Transformer-cnn: Swiss knife for qsar modeling and interpretation. Journal of cheminformatics 12, 1–12 (2020).
Article Google Scholar
Doshi-Velez, F. & Kim, B. Towards a rigorous science of interpretable machine learning. Preprint at https://arxiv.org/abs/1702.08608 (2017).
Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. In: International Conference on Machine Learning. (eds Precup, D. & The, Y. W.) 3319–3328 (PMLR, 2017).
Nembrini, S., König, I. R. & Wright, M. N. The revival of the gini importance? Bioinformatics 34, 3711–3718 (2018).
Article CAS PubMed PubMed Central Google Scholar
Altmann, A., Toloşi, L., Sander, O. & Lengauer, T. Permutation importance: a corrected feature importance measure. Bioinformatics 26, 1340–1347 (2010).
Article CAS PubMed Google Scholar
Smilkov, D., Thorat, N., Kim, B., Viégas, F. & Wattenberg, M. Smoothgrad: removing noise by adding noise. Preprint at https://arxiv.org/abs/1706.03825 (2017).
Koh, P.W. & Liang, P. Understanding black-box predictions via influence functions. In: International Conference on Machine Learning (eds Precup, D. & The, Y. W.) 1885–1894 (PMLR, 2017).
Ribeiro, M.T., Singh, S. & Guestrin, C. "why should i trust you?” explaining the predictions of any classifier. In: Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (ed Krishnapuram, B.) 1135–1144 (ACM, Digital Library, 2016).
Lundberg, S.M. & Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Inform. Process. Syst. 30 (2017).
Rodríguez-Pérez, R. & Bajorath, J. Interpretation of compound activity predictions from complex machine learning models using local approximations and shapley values. J. Medicinal Chem. 63, 8761–8777 (2019).
Article Google Scholar
Mothilal, R.K., Sharma, A. & Tan, C. Explaining machine learning classifiers through diverse counterfactual explanations. In Proc. 2020 Conference on Fairness, Accountability, and Transparency 607–617 (2020).
Wellawatte, G. P., Seshadri, A. & White, A. D. Model agnostic generation of counterfactual explanations for molecules. Chem. Sci. 13, 3697–3705 (2022).
Article CAS PubMed PubMed Central Google Scholar
Marchese Robinson, R. L., Palczewska, A., Palczewski, J. & Kidley, N. Comparison of the predictive performance and interpretability of random forest and linear models on benchmark data sets. J. Chem. Inform. modeling 57, 1773–1792 (2017).
Article CAS Google Scholar
Polishchuk, P. Interpretation of quantitative structure–activity relationship models: past, present, and future. J. Chem. Inform. Modeling 57, 2618–2639 (2017).
Article CAS Google Scholar
Balfer, J. & Bajorath, J. Visualization and interpretation of support vector machine activity predictions. J. Chem. Inform. Modeling 55, 1136–1147 (2015).
Article CAS Google Scholar
Sheridan, R. P. Interpretation of qsar models by coloring atoms according to changes in predicted activity: how robust is it? J. Chem. Inform. Modeling 59, 1324–1337 (2019).
Article CAS Google Scholar
Shoombuatong, W. et al. Towards the Revival of Interpretable Qsar Models. Advances in Qsar Modeling: Applications in Pharmaceutical, Chemical, Food, Agricultural and Environmental Sciences 3–55 (Springer, 2017).
**ong, Z. et al. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. J. Medicinal Chem. 63, 8749–8760 (2019).
Article Google Scholar
Baldassarre, F. & Azizpour, H. Explainability techniques for graph convolutional networks. Preprint at https://arxiv.org/abs/1905.13686 (2019).
Weber, J. K. et al. Simplified, interpretable graph convolutional neural networks for small molecule activity prediction. J. Comput.-Aided Mol. Des. 36, 391–404 (2021).
Ding, H., Takigawa, I., Mamitsuka, H. & Zhu, S. Similarity-based machine learning methods for predicting drug–target interactions: a brief review. Briefings Bioinform. 15, 734–747 (2014).
Article Google Scholar
Yamanishi, Y., Araki, M., Gutteridge, A., Honda, W. & Kanehisa, M. Prediction of drug–target interaction networks from the integration of chemical and genomic spaces. Bioinformatics 24, 232–240 (2008).
Article Google Scholar
Gajewicz-Skretna, A., Furuhama, A., Yamamoto, H. & Suzuki, N. Generating accurate in silico predictions of acute aquatic toxicity for a range of organic chemicals: Towards similarity-based machine learning methods. Chemosphere 280, 130681 (2021).
Article CAS PubMed Google Scholar
Jacob, L. & Vert, J.-P. Protein-ligand interaction prediction: an improved chemogenomics approach. Bioinformatics 24, 2149–2156 (2008).
Article CAS PubMed PubMed Central Google Scholar
Patlewicz, G., Helman, G., Pradeep, P. & Shah, I. Navigating through the minefield of read-across tools: a review of in silico tools for grou**. Comput. Toxicol. 3, 1–18 (2017).
Article Google Scholar
Wawer, M., Peltason, L., Weskamp, N., Teckentrup, A. & Bajorath, J. Structure- activity relationship anatomy by network-like similarity graphs and local structure- activity relationship indices. J. Medicinal Chem. 51, 6075–6084 (2008).
Article CAS Google Scholar
Keiser, M. J. et al. Relating protein pharmacology by ligand chemistry. Nat. Biotechnol. 25, 197–206 (2007).
Article CAS PubMed Google Scholar
Lo, Y.-C. et al. Large-scale chemical similarity networks for target profiling of compounds identified in cell-based chemical screens. PLoS Comput. Biol. 11, 1004153 (2015).
Article Google Scholar
Lounkine, E. et al. Large-scale prediction and testing of drug activity on side-effect targets. Nature 486, 361–367 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Keiser, M. J. et al. Predicting new molecular targets for known drugs. Nature 462, 175–181 (2009).
Article ADS CAS PubMed PubMed Central Google Scholar
He, X., Cai, D. & Niyogi, P. Laplacian score for feature selection. Adv. Neural Inform. Process. Syst. 18 (2005).
Sheikhpour, R., Sarram, M. A., Gharaghani, S. & Chahooki, M. A. Z. Feature selection based on graph laplacian by using compounds with known and unknown activities. J. Chemometrics 31, 2899 (2017).
Article Google Scholar
Valizade Hasanloei, M. A., Sheikhpour, R., Sarram, M. A., Sheikhpour, E. & Sharifi, H. A combined Fisher and Laplacian score for feature selection in QSAR based drug design using compounds with known and unknown activities. J. Comput.-Aided Mol. Des. 32, 375–384 (2018).
Article ADS CAS PubMed Google Scholar
Cruz-Monteagudo, M. et al. Activity cliffs in drug discovery: Dr jekyll or mr hyde? Drug Discov. Today 19, 1069–1080 (2014).
Article CAS PubMed Google Scholar
Stumpfe, D., Hu, H. & Bajorath, J. Evolving concept of activity cliffs. ACS Omega 4, 14360–14368 (2019).
Article CAS PubMed PubMed Central Google Scholar
Maggiora, G. M. On outliers and activity cliffs why QSAR often disappoints. J. Chem. Inform. Modeling 46, 1535–1535 (2006).
Article CAS Google Scholar
Hu, H. & Bajorath, J. Simplified activity cliff network representations with high interpretability and immediate access to SAR information. J. Comput.-Aided Mol. Des. 34, 943–952 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Weinberger, K.Q., Blitzer, J. & Saul, L. Distance metric learning for large margin nearest neighbor classification. Adv. Neural Inform. Process. Syst. 18 (2005).
Weinberger, K.Q. & Tesauro, G. in Artificial Intelligence and Statistics (eds. Meila, M. & Shen, x) 612–619 (PMLR, 2007).
Kireeva, N. V., Ovchinnikova, S. I., Kuznetsov, S. L., Kazennov, A. M. & Tsivadze, A. Y. Impact of distance-based metric learning on classification and visualization model performance and structure–activity landscapes. J. Comput.-aided Mol. Des. 28, 61–73 (2014).
Article ADS CAS PubMed Google Scholar
Horvath, D., Marcou, G. & Varnek, A. In (ed Roy, K.) Advances in QSAR Modeling: Applications in Pharmaceutical, Chemical, Food, Agricultural and Environmental Sciences 167–199 (Springer Verlag, 2017).
Fröhlich, H., Wegner, J. K., Sieker, F. & Zell, A. Kernel functions for attributed molecular graphs—a new similarity-based approach to ADME prediction in classification and regression. QSAR Combinatorial Sci. 25, 317–326 (2006).
Article Google Scholar
Mohr, J. A., Jain, B. J. & Obermayer, K. Molecule kernels: a descriptor-and alignment-free quantitative structure–activity relationship approach. J. Chem. Inform. Modeling 48, 1868–1881 (2008).
Article CAS Google Scholar
Charlton, M., Fotheringham, S. & Brunsdon, C. Geographically Weighted Regression Vol. 2, White paper (National Centre for Geocomputation, National University of Ireland Maynooth, 2009).
Johnson, R.A. & Dean, W.W. et al. Applied Multivariate Statistical Analysis, 5th edn. (Prentice Hall, NJ, 2002).
Gaulton, A. et al. The ChEMBL database in 2017. Nucleic Acids Res. 45, 945–954 (2017).
Article Google Scholar
Bosc, N., Atkinson, F., Felix, E., Gaulton, A., Hersey, A. & Leach, A. R. Large scale comparison of QSAR and conformal prediction methods and their applications in drug discovery. J. Cheminform. 11, 1–16 (2019).
Google Scholar
Carroll, R. J. & Ruppert, D. Prediction and tolerance intervals with transformation and/or weighting. Technometrics 33, 197–210 (1991).
Article MathSciNet Google Scholar
Asmussen, S., Jensen, J. L. & Rojas-Nandayapa, L. On the Laplace transform of the lognormal distribution. Methodol. Comput. Appl. Probab. 18, 441–458 (2016).
Article MathSciNet Google Scholar
Fotheringham, A.S., Brunsdon, C. & Charlton, M. Geographically Weighted Regression: the Analysis of Spatially Varying Relationships (John Wiley & Sons, 2003).
Zhang, R., Nolte, D., Sanchez-Villalobos, C., Ghosh, S. & Pal, R. Topological Regression as an interpretable and efficient tool for Quantitative Structure-Activity Relationship Modeling. Zenodo https://doi.org/10.5281/zenodo.10929477 (2024).

Download references

Acknowledgements

This work was supported in part by the National Science Foundation under Grants Nos. 2007903 (received by RP) and 2007418 (Received by S.G) and Leidos Biomed/NCI under contract 22X049 Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or Leidos Biomed/NCI. The authors acknowledge the High Performance Computing Center (HPCC) at Texas Tech University for providing computational resources that have contributed to the research results reported within this paper. http://www.hpcc.ttu.edu.

Author information

These authors contributed equally: Ruibo Zhang, Daniel Nolte.

Authors and Affiliations

Department of Electrical and Computer Engineering, Texas Tech University, Lubbock, TX, 79409, USA
Ruibo Zhang, Daniel Nolte, Cesar Sanchez-Villalobos & Ranadip Pal
Department of Statistics, University of Nebraska - Lincoln, Lincoln, NB, 68588, USA
Souparno Ghosh

Authors

Ruibo Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Nolte
View author publications
You can also search for this author in PubMed Google Scholar
Cesar Sanchez-Villalobos
View author publications
You can also search for this author in PubMed Google Scholar
Souparno Ghosh
View author publications
You can also search for this author in PubMed Google Scholar
Ranadip Pal
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

R.Z., D.N., S.G., and R.P. formulated the problem and conceived the experiments, R.Z., D.N., C.S., conducted the experiments, R.Z., D.N., C.S., S.G., and R.P. analyzed the results. All authors reviewed the manuscript. R.Z. conducted this work while he was working at Texas Tech University, however, he is currently working at Merck Inc.

Corresponding authors

Correspondence to Souparno Ghosh or Ranadip Pal.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Martin Vogt, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Reporting Summary

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, R., Nolte, D., Sanchez-Villalobos, C. et al. Topological regression as an interpretable and efficient tool for quantitative structure-activity relationship modeling. Nat Commun 15, 5072 (2024). https://doi.org/10.1038/s41467-024-49372-0

Download citation

Received: 13 May 2023
Accepted: 04 June 2024
Published: 13 June 2024
DOI: https://doi.org/10.1038/s41467-024-49372-0
Springer Nature Limited

Topological regression as an interpretable and efficient tool for quantitative structure-activity relationship modeling

Abstract

Similar content being viewed by others

Introduction

Results

Model performance comparison on ChEMBL datasets

Computational comparison on ChEMBL datasets

Interpreting TR

Discussion

Methods

Data description and problem motivation

Dataset

ChemProp

Transformer-CNN

Metric Learning Kernel Regression

Comparison procedure

Multivariate construction of topological regression

Neighborhood training model

Extraction of W

Univariate construction of topological regression

Ensemble topological regression

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation