1 Introduction

Target identification and validation is the top priority in drug discovery [1]. Molecules or drugs that interact with a rational target or selected combinations of targets have improved odds of therapeutic success. An analysis of AstraZeneca’s drug research and development programs showed that 82% of program terminations in preclinical studies were due to safety issues, of which 25% were target-related [2]. Meanwhile, 48% of safety failures in clinical trials are target-related. Therefore, guidance on the appropriate selection of candidate targets can help improve the success rate and portfolio value of drug discovery projects while also reducing time and cost [3].

Traditionally, target discovery has relied on wet experiments, a process that is time-consuming, expensive, and low in accuracy. With the development of bioinformatics, chemical informatics, and omics, computer-aided therapeutic target discovery methods or in silico methods have come to the fore [4,5,6]. By integrating big data with computational methods, computer-aided therapeutic target discovery greatly reduces the scope of experimental targets, shortens the drug discovery and development cycle, and reduces the experimental cost. At present, the two main categories of in silico methods for potential therapeutic target identification are comparative genomics [7] and network-based methods [8]. One of many important characteristics differentiating these methods is that comparative genomics is mostly used in infectious diseases, whereas network-based methods can be used not only in infectious diseases, but also in non-infectious diseases. Nonetheless, these categories of methods often complement each other in their advantages and disadvantages.

With the completely sequenced human genome, in addition to the completed genome sequences of many model organisms, there are increasing research-focused efforts to understand the function of a genome and molecular evolution. Finding potential therapeutic targets among cellular functions based on understanding their related biological processes in pathogens and their hosts has become imperative as antimicrobial resistance continues to spread rapidly. To identify therapeutic targets, comparative genomics combines the information contained in genome database resources and software to reveal fatal weaknesses of pathogens that affect their growth and reproduction in the host, such as genes essential for the survival, growth, and important functions of pathogens [9]. In addition, comparative genomics can also filter out homologs by comparing genomes of pathogens and hosts, avoiding the toxic and side-effects of newly designed drugs on the host, in turn, increasing the success rate of drug design [9].

With many pathogenic variants associated with disease in non-coding regions or difficult to target genes, the number of associations that are candidates for development into drugs is limited. Approaches that combine data from pathway databases or biological networks can broaden the number of potential targets to increase the number of associations that lead to effective treatments. As such, network-based strategies are among the state-of-the-art computation models for target identification and are also an important bridge connecting network pharmacology [10], network medicine [11], network biology [12], systems biology [29]. Some types of constructed networks are protein–protein interaction (PPI) networks [30], gene interaction networks [31], and miRNA–mRNA interaction networks [89]. To reduce the complexity of a large interaction network, users can create filters based on the attributes as needed and use the Cytoscape built-in function to search [89]. In addition to directly filtering nodes using the built-in topological parameters in Cytoscape, users can also use apps (formerly called plugins), such as stringApp [90], the Biological Networks Gene Oncology (BiNGO) tool [91], Molecular Complex Detection (MCODE) [92], and cytoHubba, a user-friendly interface to explore key nodes and subnetworks [93]. StringApp combines the resources of the STRING database and Cytoscape in the same workflow and facilitates the import of STRING molecular networks into Cytoscape for executing STRING analysis in the script file [90]. BiNGO provides a comprehensive set of annotation tools for Gene Ontology (GO)-level annotations of a variety of organisms. It enables the extraction of information about overexpression of a gene in biological networks and supports user-defined annotations and ontologies [91]. MCODE enables searches for densely connected regions within large PPI networks that may reflect molecular complexes. The method is based on connectivity data [92]. CytoHubba provides a one-stop calculation of 11 topological analysis methods to help users explore hub objects from complex biological networks [93]. These useful apps are freely available from the Cytoscape App Store (http://apps.cytoscape.org/).

Table 5 Software and tools of network-based methods for identification of potential therapeutic targets

5 Applications

5.1 Comparative Genomics

With the arrival of the post-genome era, target-based drug design strategy has gradually become the focus [102]. Both the improvement of the sequencing technology and the exponential explosion of the number of fully sequenced genomes has made it possible to select reasonable new therapeutic targets and vaccine candidates throughout the genome. Drug resistance is becoming increasingly widespread due to the continuous evolution of bacterial strains, such as Streptococcus pneumoniae and Mtb. Knowledge of therapeutic targets and drug candidates is useful for enhanced drug discovery and is becoming increasingly reliant on comparative genomics technology [103]. Table 6 lists recent applications of comparative genomics in finding therapeutic targets. We selected some specific examples to describe in this section.

Table 6 Examples of prediction of potential therapeutic targets by comparative genomics in recent years

Determining essential genes of pathogens is a common method to identify potential therapeutic targets. For example, Tilahun et al. [104] retrieved the protein-coding genes of Mtb from the Mtb database and identified the essential genes by a BLAST search of the retrieved protein-coding genes against DEG. Then, the corresponding protein sequences, obtained by searching in DEG, were used to perform a BLASTp search of human protein sequences to avoid host toxicity in the subsequent drug development. Finally, 572 essential genes with no homology to human genes were selected from 3958 genes of Mtb. Discovering potential therapeutic targets from the proteins encoded by essential genes can refine the search scope of therapeutic targets. The existence of homologous genes is a powerful predictor of biological importance [105] and a breakthrough in therapeutic target identification. For example, Satya et al. [48] sequenced the gene encoding 3-deoxy-D-arabinoheptulosonate-7-phosphate synthase (DAHPS) in Pseudomonas fragilis (Pf). Sequence analysis showed high homology (84%) of Pf-DAHPS with other Pseudomonas DAHPS, indicating that it was possible to design a broad-spectrum drug for the genus by targeting the DAHPS sequence. By analyzing the homology between the protein sequence encoded by DAHPS and human protein sequences, DAHPS, which does not exist in humans, was proposed to be an important potential antibacterial target. The predicted three-dimensional structure of Pseudomonas DAHPS may provide an option for reasonable drug design [48].

Comparative genomics can be used to understand the molecular mechanism of disease and predict targets for new drug design. For example, Zumla et al. [106] discovered that the sequence homology of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genome with SARS-CoV and Middle East respiratory syndrome coronavirus (MERS-CoV) was about 82%, and the homology of structural proteins was over 90%. The high sequence homology revealed their common pathogenic mechanism. Therefore, the authors of the study designed and developed direct-acting antiviral drugs that target highly conserved enzymes in SARS-CoV-2, such as the main protease (MPRO) or 3C-like protease (3CLpro), the papain-like protease (PLpro), non-structural protein 12 (Nsp12), and RNA-dependent RNA polymerase (RdRP). Among them, ganciclovir and maraviroc, the drugs against MPRO, were considered effective for the treatment of coronavirus disease 2019 (COVID-19) [107].

Comparative genomics is used to find potential therapeutic targets for the development of human drugs and animal drugs. Damte et al. [108] selected five unique pathways of Mycoplasma hyopneumoniae strains in KEGG. They then used BLASTp in NCBI to compare the only two protein sequences in the unique pathways with the porcine protein sequences. It was found that the two protein sequences in the unique pathways were not homologous to the porcine protein sequences. Therefore, those essential proteins, which exist in M. hyopneumoniae but not in the host (pig), may be useful in drug design and vaccine production against M. hyopneumoniae. For more examples of comparative genomics used to identify potential targets, readers can refer to the list of references provided in Table 6.

5.2 Network-Based Methods

Different types of biological networks can be used to predict potential therapeutic targets by network-based methods, such as PPI networks, gene interaction networks and miRNA–mRNA interaction networks. Table 7 lists almost all applications since 2015 of network-based methods to predict potential therapeutic targets, including the databases, software and tools, network types, related pathogens/diseases/processes, and the identified targets. Some of the targets in Table 7 have been verified or used for drug design. Here, we select several examples of previous studies that have used different network types for further description.

Table 7 Applications of network-based methods for potential therapeutic target identification

PPI networks are the most widely used molecular networks in target discovery. For example, Huo et al. predicted proteins FGG, SLC9A3, MAPK14, FGF1, FGB, F13A1, and CASR as potential therapeutic targets for the treatment of coronary heart disease (CHD) by combining the centrality-based and differentia-based approaches [30]. They extracted PPIs related to Danshensu (one of the main active ingredients of Salvia miltiorrhiza, known as Danshen) from the STRING database, then integrated the data with the CHD gene expression profile and microarray data obtained from the GEO database to construct a non-CHD state co-expression protein interaction network (CePIN) and a CHD state CePIN on Cytoscape [30]. The non-CHD network contained 91 nodes and 98 edges, and the CHD state CePIN contained 99 nodes and 110 edges [30]. Then, topological analysis and network comparison were performed along with the calculation of network connectivity after the removal of candidate nodes. Finally, two bottleneck proteins, FGG and SLC9A3, existing only in the CHD state CePIN, were selected as the targets of Danshensu in the treatment of CHD and as the potential targets for new drug design [30]. In addition, MAPK14, FGF1, FGB, F13A1, and CASR, obtained through the differentia-based approach, also represented potential therapeutic targets for the treatment of CHD and had been confirmed to be related to CHD to some extent [30].

There are also examples of the use of the centrality based approach alone to identify potential therapeutic targets. For example, Moon et al. generated a list of 1089 differentially expressed genes from patients with diffuse systemic sclerosis by a literature search in Google Scholar and PubMed using specific keywords [125]. Then, using the centrality-based approach to build a PPI network, they identified 1068 interactions of those 1089 genes. Finally, a network centrality analysis identified four hub genes (CTGF, HCK, LYN, PDGFRB) as potential therapeutic targets [125]. In another example, Fathima et al. used non-apoptotic cell death genes of colon adenocarcinoma (COAD), glioblastoma multiforme (GBM), and small cell lung cancer (SCLC) screened from their transcriptome profiles to build three PPI networks [133]. Through centrality analysis, 4 of the top 10 hub proteins, which were not found or only found in one target database, were considered as novel valid therapeutic targets (FANCD2 and NCOA4 for COAD, IKBKB for GBM, and RHOA for GBM and SCLC) [133].

As mentioned above, PPI networks, gene interaction networks, and miRNA–mRNA interaction networks) have applications in predicting potential therapeutic targets. For example, Miryala et al. [31] identified 337 functional interactions of 60 antimicrobial resistance genes of Pseudomonas aeruginosa PA01 from the PathoSystems Resource Integration Center (PATRIC) tool, The Antibiotic Resistance Genes Database (ARDB) [142], the comprehensive antibiotic resistance database (CARD), the National database of antibiotic-resistant organisms (NDARO), and the STRING database. By constructing and analyzing the gene interaction network in Cytoscape, nine hub genes were obtained as potential therapeutic targets for new drug development [31]. Xue et al. [147], eight genes (ASPG, AQP2, CNOT8, CTPS1, IFNAR2, MOCS2, PRSS37, and VCP) were finally identified as potential therapeutic targets [157]. Another issue is that although comparative genomics can reduce the number of experimental targets, making some attractive proteins become potential therapeutic targets, the range of potential targets screened by this method is still very wide and is limited by time and cost. It seems that most of these potential targets screened by comparative genomics will not be used for experimental validation. Therefore, it may be profitable to combine comparative genomics with network-based methods to narrow the scope of experimental targets further and reduce the time and material resources, thereby saving costs in the early stage of drug research and development.

Network-based methods are highly dependent on the accuracy of the source data, potentially requiring a great deal of labor to ensure its accuracy [158]. A promising direction to resolve this problem will be integrating different types and complementary data in the future [6]. Other drawbacks of network-based methods are that they cannot predict proteins or genes without interaction data, and the interactions cannot be quantified [155]. Improved network construction and analysis algorithms or mathematical modeling methods [159] may be required to overcome these issues.

6.3 Comparison and Contrast of the Two Categories of In silico Methods for Target identification

Comparative genomics and network-based methods have unique advantages and disadvantages in predicting targets. Comparative genomics almost exclusively searches within the range of pathogen-associated sequences, limiting the scope to the proteomes closely related to the pathogen. Conversely, network-based methods can be used in pathogens and construct a network for human disease-related proteins or genes. In contrast to comparative genomics, network-based methods can connect long-distance relationships through interactions [160], permitting research into the interplay of evolutionary drivers on a larger scale. Conversely, comparative genomics is usually superior to network-based methods in accuracy because comparative genomics directly compares sequences, which are always constant and almost have no deviation. However, there may be false positives and false negatives in the interaction data used in network-based methods [161], and the interactions are only qualitative [160], which may lead to bias. In summary, the combined use of comparative genomics and network-based methods may be more beneficial than either method alone to improve the accuracy and efficiency in target identification.

6.4 Previous Reviews and Prospective Studies on In silico Methods for Target Identification

We have collected five reviews on in silico methods for identifying potential therapeutic targets during 2016–2020, which will be briefly discussed in this section. Sekyere and Asante [7] reviewed comparative genomic analysis trans-complementation assays in the context of antibiotic resistance research and new drug discovery by describing the emergence of several new drug resistance genes, such as lsa(C), erm(44), VCC-1, mcr-1, mcr-2, mcr-3, mcr-4, blaKLUC-3, and blaKLUC-4. For readers interested in further understanding pathogen protein targets, Saha et al. reviewed the computational work and functional prediction from PPI networks applied to different infectious diseases with Plasmodium falciparum used as an example to analyze the process of protein target identification through the host–pathogen protein interactions [162]. Katsila et al. [5] surveyed chemical informatics and network-based methods for identifying therapeutic targets and introduced some databases and network computing tools for target identification. They also appraised the process of computer-aided drug design (CADD), including ligand-based drug design and structure-based drug design [5]. Readers interested in CADD can peruse their article for further understanding. Reisdorf et al. introduced database resources for identification, prioritization, and validation of disease targets, including emerging integrated bioinformatics platforms, such as Open Targets, and public resources, such as DrugBank and ChEMBL [163]. In comparison, the database resources we described focus more on classic or commonly used databases for applications. We also recommend the review by Agamah et al. [153], which examined current in silico methods for the identification of therapeutic targets and candidate drugs, including network-based analysis approaches, data mining, reverse docking, biospectra analysis, and ligand-based in silico target prediction and compared the different approaches and propounded the benefits of hybrid approaches.

6.5 Related Methods for Target Identification

In silico subtractive genomics (first mentioned in Sect. 2.1), also known as differential proteome mining, is a comparative genomics-based method [164]. Subtractive genomics gradually subtracts proteins from the complete proteome of pathogens to find rational targets [18]. The difference between subtractive genomics and comparative genomics is in the range of application of the two methods. Subtractive genomics has been widely used for develo** potential anti-pathogen infection drugs [18], whereas comparative genomics can be used not only to identify potential targets of pathogens but also to understand the molecular basis of disease [106].

For network-based methods, in addition to the centrality-based and differentia-based approaches we reviewed above, there are also studies showing the use of network influence [165], controllability [166], and topological similarity strategy [167] in target identification, but the relevant applications are much fewer. Compared with network centrality, the network influence strategy focuses on the vulnerable nodes close to the central nodes in networks. Acting on these nodes may not be fatal but can have a major impact on the central nodes, so these nodes have the potential to be therapeutic targets [165]. The controllability strategy applies structural controllability theory to determine the minimum set of driver nodes in control of the entire network and identify indispensable nodes as prime targets for disease-causing mutations, viruses, and drugs [166]. The topological similarity strategy focuses on the nodes in the network with similar topological properties to the existing drug targets, which can be potentially developed as therapeutic targets [167].

Commonly used experimental methods for potential therapeutic target identification, especially for essential genes, include single-gene knockout, antisense RNA inhibition of gene expression, large-scale transposon mutagenesis, and CRISPR/Cas9 nuclease system knockout screening.The limitations of experimental methods in identifying essential genes are listed in Table 9 [168].

Table 9 Limitations of experimental methods in identifying essential genes

Current computational studies are based on the integration of prior knowledge, the sparseness of which is still limiting the integrality and accuracy of computational prediction [169]. Data reproducibility of in silico methods is also an essential issue but might be improved by external validation and detailed reports of experimental datasets [153]. It should be emphasized that computational methods complement laboratory-based methods and that the targets identified by in silico methods need to be experimentally validated.

6.6 Review and Prospection of Deep Learning Architecture in Target Identification

Deep learning (DL), a relatively new computational technique that has become a hot research topic, has been rapidly developed and widely used to predict potential therapeutic targets. DL is a subclass of machine learning (ML) algorithms. It uses artificial neural networks with many layers of nonlinear processing units for learning data representations [170]. Therapeutic target identification based on ML or DL is usually used to predict targets of drug repositioning, which means to predict new targets for existing drugs. There are two steps in the ML method to predict therapeutic targets. First, the compounds are transformed into an effective representation, a process called input features, followed by the construction of the feature vectors as input for the ML algorithm to learn the functional relationship between the input feature and the target property [171]. Compared with ML methods, DL reconstructs the original input information into a distributed representation through neurons in the hidden layer. Another characteristic of DL models is that they can automatically learn features upon completing classification and other tasks and learn more complex features when the number of layers increases. DL architectures are well-suited for target prediction because they allow for multitask learning and automatically construct complex features, which, for target prediction, are assumed to be pharmacophore descriptors. Multitask learning has the advantage of allowing for multi-label information and can, therefore, utilize relations between targets. It also permits hidden unit representations to be shared among prediction tasks, which is particularly valuable because some targets have very few measurements available, making single-target prediction ineffective. In addition, DL can boost the performance of tasks with a few training examples. The other advantage of deep networks is that they provide hierarchical representations of a compound, where higher levels represent more complex properties [172].

Convolutional neural networks (CNNs) are a representative DL architecture in potential target prediction. CNNs contain convolutional layers, pooling layers, and fully connected layers. Convolutional layers and pooling layers are responsible for the feature extraction, and fully connected layers are used to construct the nonlinear relationship of the extracted features for obtaining the output [171]. Another DL architecture is deep neural networks (DNNs), which contain multiple hidden layers, with each layer comprising hundreds of nonlinear process units. DNNs can deal with many input features, and the neurons in different layers of a DNN can automatically extract features at different hierarchical levels [173]. The third main DL architecture is auto-encoders, which is a neural network used for unsupervised learning. Auto-encoders contain an encoder part that transforms the input information into a limited number of hidden units and then couples a decoder neural network with the output layer having the same number of nodes as the input layer [174].

Several studies have reported DL for therapeutic target prediction in recent years [175,176,177]. For example, Wang et al. [178] constructed a framework that combines a biased support vector machine and a stacked auto-encoder DL model to identify drug target proteins. The stacked auto-encoders were trained to extract properties from the original protein representations, and the biased support vector machine was used to perform the potential target identification task. The framework identified 23% of the original non-drug target proteins as possible therapeutic target proteins. Zeng et al. [179] developed a DL method, named deepDTnet, for novel target identification. A DNN algorithm was used to learn the relationships between drugs and targets. The model was used to predict the new target for topotecan (an approved topoisomerase inhibitor of human retinoic-acid-receptor-related orphan receptor-gamma t, ROR-γt). Human ROR-γt was predicted as the target, and bioassay experiments showed high inhibitory activity (IC50 = 0.43 μM) on ROR-γt. Lee et al. [180] proposed a DL model named DeepConv-DTI (deep learning with convolution on protein sequences for prediction of drug–target interaction) based on CNN for drug–target interactions prediction, which can be used for target identification. The training dataset contained 11,950 compounds, 3,675 proteins, and 32,568 drug–target interactions. The CNN model is constructed to capture local residue patterns and concatenate protein features with drug features through the fully connected layers. The hyperparameters with an external validation dataset were then optimized. The possible drug–protein interactions are output.

Although DL has advantages in recognition, classification, and feature extraction from complex and noisy data, it still has limitations. First of all, DL is a “black box,” which makes it hard to explain the prediction result and inherent principles of why the compound is effectively targeted to the predicted target. Second, it needs a large number of experimental datasets of drug–target relationships for its training. However, there is currently a lack of experimental data of drug–target relationships [181]. Consequently, there is a risk of overfitting when training the model, leading to low accuracy of the prediction result. Third, DL is usually computationally intensive, time-consuming, and often requires access to and programming knowledge for graphics processing units. DL has recently been applied successfully in therapeutic target identification. However, due to the lack of large-scale studies or experimental data and the hyperparameter selection bias that comes with the high number of potential DL architectures, DL still has scope for improvement and development in research to predict potential therapeutic targets [172, 182].

7 Conclusion

In this review, we introduced, in detail, the two categories of in silico methods for potential therapeutic target identification—comparative genomics and network-based methods—and summarized the databases and software commonly used for these approaches. We also collected and highlighted some previous applications of these methods for therapeutic target identification. Additionally, we analyzed the advantages and disadvantages of the methods and their application prospects. Finally, we accentuated the characteristics of our review in the context of previously published relevant reviews and methods. The purpose of this review was to help readers quickly understand the rationales of in silico methods for potential therapeutic target identification, and become familiar with the available tool resources and the applications of these methods, to harness the full use of the existing tools for target prediction. We strongly believe that more accurate predictions due to users’ familiarity with existing resources will increase the importance of computational methods in the identification of potential therapeutic targets for future research. In turn, the failure rate due to target problems in drug development, the input–output ratio of drug discovery, and the cost of subsequent experiments can be expected to reduce and the drug development cycle time to shorten.