Log in

A New Sequential Forward Feature Selection (SFFS) Algorithm for Mining Best Topological and Biological Features to Predict Protein Complexes from Protein–Protein Interaction Networks (PPINs)

  • Original research article
  • Published:
Interdisciplinary Sciences: Computational Life Sciences Aims and scope Submit manuscript

Abstract

Protein–protein interaction plays an important role in the understanding of biological processes in the body. A network of dynamic protein complexes within a cell that regulates most biological processes is known as a protein–protein interaction network (PPIN). Complex prediction from PPINs is a challenging task. Most of the previous computation approaches mine cliques, stars, linear and hybrid structures as complexes from PPINs by considering topological features and fewer of them focus on important biological information contained within protein amino acid sequence. In this study, we have computed a wide variety of topological features and integrate them with biological features computed from protein amino acid sequence such as bag of words, physicochemical and spectral domain features. We propose a new Sequential Forward Feature Selection (SFFS) algorithm, i.e., random forest-based Boruta feature selection for selecting the best features from computed large feature set. Decision tree, linear discriminant analysis and gradient boosting classifiers are used as learners. We have conducted experiments by considering two reference protein complex datasets of yeast, i.e., CYC2008 and MIPS. Human and mouse complex information is taken from CORUM 3.0 dataset. Protein interaction information is extracted from the database of interacting proteins (DIP). Our proposed SFFS, i.e., random forest-based Brouta feature selection in combination with decision trees, linear discriminant analysis and Gradient Boosting Classifiers outperforms other state of art algorithms by achieving precision, recall and F-measure rates, i.e. 94.58%, 94.92% and 94.45% for MIPS, 96.31%, 93.55% and 96.02% for CYC2008, 98.84%, 98.00%, 98.87 % for CORUM humans and 96.60%, 96.70%, 96.32% for CORUM mouse dataset complexes, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Peng Y, Lu Z (2017) Deep learning for extracting protein-protein interactions from biomedical literature, pp 29–38. https://doi.org/10.18653/v1/w17-2304

  2. Qi Y, Balem F, Faloutsos C, Klein-Seetharaman J, Bar-Joseph Z (2008) Protein complex identification by supervised graph local clustering. Bioinformatics. https://doi.org/10.1093/bioinformatics/btn164

    Article  PubMed  PubMed Central  Google Scholar 

  3. Smits AH, Vermeulen M (2016) Characterizing protein-protein interactions using mass spectrometry: challenges and opportunities. Trends Biotechnol 34(10):825–834. https://doi.org/10.1016/j.tibtech.2016.02.014

    Article  CAS  PubMed  Google Scholar 

  4. Celaj A et al (2017) Quantitative analysis of protein interaction network dynamics in yeast. Mol Syst Biol 13(7):934. https://doi.org/10.15252/msb.20177532

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Brückner A, Polge C, Lentze N, Auerbach D, Schlattner U (2009) Yeast two-hybrid, a powerful tool for systems biology. Int J Mol Sci 10(6):2763–2788. https://doi.org/10.3390/ijms10062763

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Puig O et al (2001) The tandem affinity purification (TAP) method: a general procedure of protein complex purification. Methods 24(3):218–229. https://doi.org/10.1006/meth.2001.1183

    Article  CAS  PubMed  Google Scholar 

  7. George PM, Mlynash M, Adams CM, Kuo CJ, Albers GW, Olivot J-M (2015) Novel Tia biomarkers identified by mass spectrometry-based proteomics. Int J Stroke 10(8):1204–1211. https://doi.org/10.1111/ijs.12603

    Article  PubMed  Google Scholar 

  8. Templin MF, Stoll D, Schrenk M, Traub PC, Vöhringer CF, Joos TO (2002) Protein microarray technology. Drug Discov Today 7(15):815–822. https://doi.org/10.1016/S1359-6446(00)01910-2

    Article  CAS  PubMed  Google Scholar 

  9. Sidhu SS, Koide S (2007) Phage display for engineering and analyzing protein interaction interfaces. Curr Opin Struct Biol 17(4):481–487. https://doi.org/10.1016/j.sbi.2007.08.007

    Article  CAS  PubMed  Google Scholar 

  10. Shoemaker BA, Panchenko AR (2007) Deciphering protein-protein interactions. Part I. Experimental techniques and databases. PLoS Comput Biol 3(3):e42. https://doi.org/10.1371/journal.pcbi.0030042

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Oughtred R et al (2019) The BioGRID interaction database: 2019 update. Nucleic Acids Res 47(D1):D529–D541. https://doi.org/10.1093/nar/gky1079

    Article  CAS  PubMed  Google Scholar 

  12. Xenarios I (2002) DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res 30(1):303–305. https://doi.org/10.1093/nar/30.1.303

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Giurgiu M et al (2019) CORUM: the comprehensive resource of mammalian protein complexes—2019. Nucleic Acids Res 47(D1):D559–D563. https://doi.org/10.1093/nar/gky973

    Article  CAS  PubMed  Google Scholar 

  14. Pagel P et al (2005) The MIPS mammalian protein-protein interaction database. Bioinformatics 21(6):832–834. https://doi.org/10.1093/bioinformatics/bti115

    Article  CAS  PubMed  Google Scholar 

  15. Pu S, Wong J, Turner B, Cho E, Wodak SJ (2009) Up-to-date catalogues of yeast protein complexes. Nucleic Acids Res 37(3):825–831. https://doi.org/10.1093/nar/gkn1005

    Article  CAS  PubMed  Google Scholar 

  16. Licata L et al (2012) MINT, the molecular interaction database: 2012 Update. Nucleic Acids Res. https://doi.org/10.1093/nar/gkr930

    Article  PubMed  Google Scholar 

  17. Kanehisa M, Furumichi M, Tanabe M, Sato Y, Morishima K (2017) KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res 45(D1):D353–D361. https://doi.org/10.1093/nar/gkw1092

    Article  CAS  PubMed  Google Scholar 

  18. Szklarczyk D et al (2019) STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res 47(D1):D607–D613. https://doi.org/10.1093/nar/gky1131

    Article  CAS  PubMed  Google Scholar 

  19. Bateman A et al (2017) UniProt: the universal protein knowledgebase. Nucleic Acids Res 45(D1):D158–D169. https://doi.org/10.1093/nar/gkw1099

    Article  CAS  Google Scholar 

  20. Haw R, Loney F, Ong E, He Y, Wu G (2020) Perform Pathway Enrichment Analysis Using ReactomeFIViz. Humana, New York, pp 165–179. https://doi.org/10.1007/978-1-4939-9873-9_13

    Book  Google Scholar 

  21. Bader GD, Hogue CWV (2003) An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinform. https://doi.org/10.1186/1471-2105-4-2

    Article  Google Scholar 

  22. Adamcsek B, Palla G, Farkas IJ, Derényi I, Vicsek T (2006) CFinder: locating cliques and overlap** modules in biological networks. Bioinformatics 22(8):1021–1023. https://doi.org/10.1093/bioinformatics/btl039

    Article  CAS  PubMed  Google Scholar 

  23. Wu M, Li X, Kwoh C-K, Ng S-K (2009) A core-attachment based method to detect protein complexes in PPI networks. BMC Bioinform 10(1):169. https://doi.org/10.1186/1471-2105-10-169

    Article  Google Scholar 

  24. Li M, Chen J, Wang J, Hu B, Chen G (2008) Modifying the DPClus algorithm for identifying protein complexes based on new topological structures. BMC Bioinform 9(1):398. https://doi.org/10.1186/1471-2105-9-398

    Article  CAS  Google Scholar 

  25. Leung HCM, **ang Q, Yiu SM, Chin FYL (2009) Predicting protein complexes from PPI data: a core-attachment approach. J Comput Biol 16(2):133–144. https://doi.org/10.1089/cmb.2008.01TT

    Article  CAS  PubMed  Google Scholar 

  26. Dong Y, Sun Y, Qin C (2018) Predicting protein complexes using a supervised learning method combined with local structural information. PLoS One 13(3):e0194124. https://doi.org/10.1371/journal.pone.0194124

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Yu Y, Lin L, Sun C, Wang X, Wang X (2011) Complex detection based on integrated properties. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol. 7062 LNCS, no. PART 1:121–128. https://doi.org/10.1007/978-3-642-24955-6_15

  28. Mewes HW et al (2008) MIPS: analysis and annotation of genome information in 2007. Nucleic Acids Res 36(SUPPL):1. https://doi.org/10.1093/nar/gkm980

    Article  CAS  Google Scholar 

  29. Liu Q, Song J, Li J (2016) Using contrast patterns between true complexes and random subgraphs in PPI networks to predict unknown protein complexes. Sci Rep. https://doi.org/10.1038/srep21223

    Article  PubMed  PubMed Central  Google Scholar 

  30. Zeng J, Li D, Wu Y, Zou Q, Liu X (2015) An empirical study of features fusion techniques for protein-protein interaction prediction. Curr Bioinform 11(1):4–12. https://doi.org/10.2174/1574893611666151119221435

    Article  CAS  Google Scholar 

  31. Khan J, Bhatti MH, Khan UG, Iqbal R (2019) Multiclass EEG motor-imagery classification with sub-band common spatial patterns. Eurasip J Wirel Commun Netw 2019(1):1–9. https://doi.org/10.1186/s13638-019-1497-y

    Article  Google Scholar 

  32. Bhatti MH et al (2019) Soft computing-based EEG classification by optimal feature selection and neural networks. IEEE Trans Ind Inform 15(10):5747–5754. https://doi.org/10.1109/TII.2019.2925624

    Article  Google Scholar 

  33. Ahmad F, Farooq A, Ghani Khan MU, Shabbir MZ, Rabbani M, Hussain I (2020) Identification of most relevant features for classification of Francisella tularensis using machine learning. Curr Bioinform. https://doi.org/10.2174/1574893615666200219113900

    Article  Google Scholar 

  34. Zou Q, Zeng J, Cao L, Ji R (2016) A novel features ranking metric with application to scalable visual and bioinformatics data classification. Neurocomputing 173:346–354. https://doi.org/10.1016/j.neucom.2014.12.123

    Article  Google Scholar 

  35. Zhang SW, Cheng YM, Luo L, Pan Q (2011) Prediction of protein-protein interaction using distance frequency of amino acids grouped with their physicochemical properties. In: Proceedings—2011 6th International conference on bio-inspired computing: theories and applications, BIC-TA 2011, pp 70–74, https://doi.org/10.1109/BIC-TA.2011.53

  36. Jolliffe I (2011) Principal component analysis. International encyclopedia of statistical science. Springer, Berlin, pp 1094–1096. https://doi.org/10.1007/978-3-642-04898-2_455

    Chapter  Google Scholar 

  37. Sikandar A et al (2018) Decision tree based approaches for detecting protein complex in protein protein interaction network (PPI) via link and sequence analysis. IEEE Access 6:22108–22120. https://doi.org/10.1109/ACCESS.2018.2807811

    Article  Google Scholar 

  38. Sikandar A, Anwar W, Sikandar M (2019) Combining sequence entropy and subgraph topology for complex prediction in protein protein interaction (PPI) network. Curr Bioinform 14(6):516–523. https://doi.org/10.2174/1574893614666190103100026

    Article  CAS  Google Scholar 

  39. Faridoon A, Sikandar A, Imran M, Ghouri S, Sikandar M, Sikandar W (2020) Combining SVM and ECOC for identification of protein complexes from protein protein interaction networks by integrating amino acids’ physical properties and complex topology. Interdiscip Sci Comput Life Sci. https://doi.org/10.1007/s12539-020-00369-5

    Article  Google Scholar 

  40. Kursa MB, Jankowski A, Rudnicki WR (2010) Boruta - a system for feature selection. Fundam Informaticae 101(4):271–285. https://doi.org/10.3233/FI-2010-288

    Article  Google Scholar 

  41. Gursoy A, Keskin O, Nussinov R (2008) Topological properties of protein interaction networks from a structural perspective. Biochem Soc Trans 36(Pt 6):1398–403. https://doi.org/10.1042/BST0361398

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Guo Y-Z et al (2006) Classifying G protein-coupled receptors and nuclear receptors on the basis of protein power spectrum from fast Fourier transform. Amino Acids 30(4):397–402. https://doi.org/10.1007/s00726-006-0332-z

    Article  CAS  PubMed  Google Scholar 

  43. Jolliffe I (2005) Principal component analysis, in encyclopedia of statistics in behavioral science. Wiley, Chichester. https://doi.org/10.1002/0470013192.bsa501

    Book  Google Scholar 

  44. Bérard A, Servan C, Pietquin O, Besacier L (2016) MultiVec: a multilingual and multilevel representation learning toolkit for NLP. https://hal.archives-ouvertes.fr/hal-01335930/. Accessed 16 Jun 2019

  45. Singh P (2019) Natural language processing, in machine learning with PySpark. Apress, Berkeley, pp 191–218

    Book  Google Scholar 

  46. Kulkarni A, Shivananda A (2019) Converting text to features. Natural language processing recipes. Apress, Berkeley, pp 67–96

    Chapter  Google Scholar 

  47. Li Z-W, You Z-H, Chen X, Gui J, Nie R (2016) Highly accurate prediction of protein-protein interactions via incorporating evolutionary information and physicochemical characteristics. Int J Mol Sci. https://doi.org/10.3390/ijms17091396

    Article  PubMed  PubMed Central  Google Scholar 

  48. Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M (2008) AAindex: amino acid index database, progress report 2008. Nucleic Acids Res 36(Database issue):D202–5. https://doi.org/10.1093/nar/gkm998

    Article  CAS  PubMed  Google Scholar 

  49. Nakai K, Kidera A, Kanehisa M (2019) Cluster analysis of amino acid indices for prediction of protein structure and function. Protein Eng 2(2):93–100. https://doi.org/10.1093/protein/2.2.93

    Article  Google Scholar 

  50. Kawashima S, Kanehisa M (2000) AAindex: amino acid index database. Nucleic Acids Res 28(1):374. https://doi.org/10.1093/nar/28.1.374

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Tomii K, Kanehisa M (1996) Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. Protein Eng 9(1):27–36. https://doi.org/10.1093/protein/9.1.27

    Article  CAS  PubMed  Google Scholar 

  52. Raicar G, Saini H, Dehzangi A, Lal S, Sharma A (2016) Improving protein fold recognition and structural class prediction accuracies using physicochemical properties of amino acids. J Theor Biol 402:117–128. https://doi.org/10.1016/J.JTBI.2016.05.002

    Article  CAS  PubMed  Google Scholar 

  53. Blei DM, Ng AY, Jordan MI (2019) Blei03a.Pdf. J Mach Learn Res 3:993–1022. http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf. Accessed 11 Nov 2003

  54. Tagami H, Ray-Gallet D, Almouzni G, Nakatani Y (2004) Histone H3.1 and H3.3 complexes mediate nucleosome assembly pathways dependent or independent of DNA synthesis. Cell 116(1):51–61. https://doi.org/10.1016/S0092-8674(03)01064-X

    Article  CAS  PubMed  Google Scholar 

  55. Poss ZC, Ebmeier CC, Taatjes DJ (2013) The mediator complex and transcription regulation. Crit Rev Biochem Mol Biol 48(6):575–608. https://doi.org/10.3109/10409238.2013.840259

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Soutourina J (2018) Transcription regulation by the Mediator complex. Nat Rev Mol Cell Biol 19(4):262–274. https://doi.org/10.1038/nrm.2017.115

    Article  CAS  PubMed  Google Scholar 

  57. Lucas X, Ciulli A (2017) Recognition of substrate degrons by E3 ubiquitin ligases and modulation by small-molecule mimicry strategies. Curr Opin Struct Biol 44:101–110. https://doi.org/10.1016/j.sbi.2016.12.015

    Article  CAS  PubMed  Google Scholar 

  58. Rodriguez P et al (2005) GATA-1 forms distinct activating and repressive complexes in erythroid cells. EMBO J 24(13):2354–2366. https://doi.org/10.1038/sj.emboj.7600702

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Bottardi S et al (2014) The IKAROS interaction with a complex including chromatin remodeling and transcription elongation activities is required for hematopoiesis. PLoS Genet 10(12):e1004827. https://doi.org/10.1371/journal.pgen.1004827

    Article  PubMed  PubMed Central  Google Scholar 

  60. Bottardi S, Mavoungou L, Milot E (2015) IKAROS: a multifunctional regulator of the polymerase II transcription cycle. Trends Genet 31(9):500–508. https://doi.org/10.1016/j.tig.2015.05.003

    Article  CAS  PubMed  Google Scholar 

  61. Sikandar M et al (2020) Analysis for disease gene association using machine learning. IEEE Access 8:160616–160626. https://doi.org/10.1109/ACCESS.2020.3020592

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Muhammad Waqas Anwar.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Younis, H., Anwar, M.W., Khan, M.U.G. et al. A New Sequential Forward Feature Selection (SFFS) Algorithm for Mining Best Topological and Biological Features to Predict Protein Complexes from Protein–Protein Interaction Networks (PPINs). Interdiscip Sci Comput Life Sci 13, 371–388 (2021). https://doi.org/10.1007/s12539-021-00433-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12539-021-00433-8

Keywords

Navigation