A New Sequential Forward Feature Selection (SFFS) Algorithm for Mining Best Topological and Biological Features to Predict Protein Complexes from Protein–Protein Interaction Networks (PPINs)

Younis, Haseeb; Anwar, Muhammad Waqas; Khan, Muhammad Usman Ghani; Sikandar, Aisha; Bajwa, Usama Ijaz

doi:10.1007/s12539-021-00433-8

A New Sequential Forward Feature Selection (SFFS) Algorithm for Mining Best Topological and Biological Features to Predict Protein Complexes from Protein–Protein Interaction Networks (PPINs)

Original research article
Published: 06 May 2021

Volume 13, pages 371–388, (2021)
Cite this article

Interdisciplinary Sciences: Computational Life Sciences Aims and scope Submit manuscript

Haseeb Younis^1,2,
Muhammad Waqas Anwar ORCID: orcid.org/0000-0002-7822-8983²,
Muhammad Usman Ghani Khan³,
Aisha Sikandar⁴ &
…
Usama Ijaz Bajwa²

862 Accesses
9 Citations
Explore all metrics

Abstract

Protein–protein interaction plays an important role in the understanding of biological processes in the body. A network of dynamic protein complexes within a cell that regulates most biological processes is known as a protein–protein interaction network (PPIN). Complex prediction from PPINs is a challenging task. Most of the previous computation approaches mine cliques, stars, linear and hybrid structures as complexes from PPINs by considering topological features and fewer of them focus on important biological information contained within protein amino acid sequence. In this study, we have computed a wide variety of topological features and integrate them with biological features computed from protein amino acid sequence such as bag of words, physicochemical and spectral domain features. We propose a new Sequential Forward Feature Selection (SFFS) algorithm, i.e., random forest-based Boruta feature selection for selecting the best features from computed large feature set. Decision tree, linear discriminant analysis and gradient boosting classifiers are used as learners. We have conducted experiments by considering two reference protein complex datasets of yeast, i.e., CYC2008 and MIPS. Human and mouse complex information is taken from CORUM 3.0 dataset. Protein interaction information is extracted from the database of interacting proteins (DIP). Our proposed SFFS, i.e., random forest-based Brouta feature selection in combination with decision trees, linear discriminant analysis and Gradient Boosting Classifiers outperforms other state of art algorithms by achieving precision, recall and F-measure rates, i.e. 94.58%, 94.92% and 94.45% for MIPS, 96.31%, 93.55% and 96.02% for CYC2008, 98.84%, 98.00%, 98.87 % for CORUM humans and 96.60%, 96.70%, 96.32% for CORUM mouse dataset complexes, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Predicting protein complex in protein interaction network - a supervised learning based method

Article Open access 22 October 2014

Predicting protein-protein interactions via multivariate mutual information of protein sequences

Article Open access 27 September 2016

Robust and accurate prediction of protein–protein interactions by exploiting evolutionary information

Article Open access 19 August 2021

References

Peng Y, Lu Z (2017) Deep learning for extracting protein-protein interactions from biomedical literature, pp 29–38. https://doi.org/10.18653/v1/w17-2304
Qi Y, Balem F, Faloutsos C, Klein-Seetharaman J, Bar-Joseph Z (2008) Protein complex identification by supervised graph local clustering. Bioinformatics. https://doi.org/10.1093/bioinformatics/btn164
Article PubMed PubMed Central Google Scholar
Smits AH, Vermeulen M (2016) Characterizing protein-protein interactions using mass spectrometry: challenges and opportunities. Trends Biotechnol 34(10):825–834. https://doi.org/10.1016/j.tibtech.2016.02.014
Article CAS PubMed Google Scholar
Celaj A et al (2017) Quantitative analysis of protein interaction network dynamics in yeast. Mol Syst Biol 13(7):934. https://doi.org/10.15252/msb.20177532
Article CAS PubMed PubMed Central Google Scholar
Brückner A, Polge C, Lentze N, Auerbach D, Schlattner U (2009) Yeast two-hybrid, a powerful tool for systems biology. Int J Mol Sci 10(6):2763–2788. https://doi.org/10.3390/ijms10062763
Article CAS PubMed PubMed Central Google Scholar
Puig O et al (2001) The tandem affinity purification (TAP) method: a general procedure of protein complex purification. Methods 24(3):218–229. https://doi.org/10.1006/meth.2001.1183
Article CAS PubMed Google Scholar
George PM, Mlynash M, Adams CM, Kuo CJ, Albers GW, Olivot J-M (2015) Novel Tia biomarkers identified by mass spectrometry-based proteomics. Int J Stroke 10(8):1204–1211. https://doi.org/10.1111/ijs.12603
Article PubMed Google Scholar
Templin MF, Stoll D, Schrenk M, Traub PC, Vöhringer CF, Joos TO (2002) Protein microarray technology. Drug Discov Today 7(15):815–822. https://doi.org/10.1016/S1359-6446(00)01910-2
Article CAS PubMed Google Scholar
Sidhu SS, Koide S (2007) Phage display for engineering and analyzing protein interaction interfaces. Curr Opin Struct Biol 17(4):481–487. https://doi.org/10.1016/j.sbi.2007.08.007
Article CAS PubMed Google Scholar
Shoemaker BA, Panchenko AR (2007) Deciphering protein-protein interactions. Part I. Experimental techniques and databases. PLoS Comput Biol 3(3):e42. https://doi.org/10.1371/journal.pcbi.0030042
Article CAS PubMed PubMed Central Google Scholar
Oughtred R et al (2019) The BioGRID interaction database: 2019 update. Nucleic Acids Res 47(D1):D529–D541. https://doi.org/10.1093/nar/gky1079
Article CAS PubMed Google Scholar
Xenarios I (2002) DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res 30(1):303–305. https://doi.org/10.1093/nar/30.1.303
Article CAS PubMed PubMed Central Google Scholar
Giurgiu M et al (2019) CORUM: the comprehensive resource of mammalian protein complexes—2019. Nucleic Acids Res 47(D1):D559–D563. https://doi.org/10.1093/nar/gky973
Article CAS PubMed Google Scholar
Pagel P et al (2005) The MIPS mammalian protein-protein interaction database. Bioinformatics 21(6):832–834. https://doi.org/10.1093/bioinformatics/bti115
Article CAS PubMed Google Scholar
Pu S, Wong J, Turner B, Cho E, Wodak SJ (2009) Up-to-date catalogues of yeast protein complexes. Nucleic Acids Res 37(3):825–831. https://doi.org/10.1093/nar/gkn1005
Article CAS PubMed Google Scholar
Licata L et al (2012) MINT, the molecular interaction database: 2012 Update. Nucleic Acids Res. https://doi.org/10.1093/nar/gkr930
Article PubMed Google Scholar
Kanehisa M, Furumichi M, Tanabe M, Sato Y, Morishima K (2017) KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res 45(D1):D353–D361. https://doi.org/10.1093/nar/gkw1092
Article CAS PubMed Google Scholar
Szklarczyk D et al (2019) STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res 47(D1):D607–D613. https://doi.org/10.1093/nar/gky1131
Article CAS PubMed Google Scholar
Bateman A et al (2017) UniProt: the universal protein knowledgebase. Nucleic Acids Res 45(D1):D158–D169. https://doi.org/10.1093/nar/gkw1099
Article CAS Google Scholar
Haw R, Loney F, Ong E, He Y, Wu G (2020) Perform Pathway Enrichment Analysis Using ReactomeFIViz. Humana, New York, pp 165–179. https://doi.org/10.1007/978-1-4939-9873-9_13
Book Google Scholar
Bader GD, Hogue CWV (2003) An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinform. https://doi.org/10.1186/1471-2105-4-2
Article Google Scholar
Adamcsek B, Palla G, Farkas IJ, Derényi I, Vicsek T (2006) CFinder: locating cliques and overlap** modules in biological networks. Bioinformatics 22(8):1021–1023. https://doi.org/10.1093/bioinformatics/btl039
Article CAS PubMed Google Scholar
Wu M, Li X, Kwoh C-K, Ng S-K (2009) A core-attachment based method to detect protein complexes in PPI networks. BMC Bioinform 10(1):169. https://doi.org/10.1186/1471-2105-10-169
Article Google Scholar
Li M, Chen J, Wang J, Hu B, Chen G (2008) Modifying the DPClus algorithm for identifying protein complexes based on new topological structures. BMC Bioinform 9(1):398. https://doi.org/10.1186/1471-2105-9-398
Article CAS Google Scholar
Leung HCM, **ang Q, Yiu SM, Chin FYL (2009) Predicting protein complexes from PPI data: a core-attachment approach. J Comput Biol 16(2):133–144. https://doi.org/10.1089/cmb.2008.01TT
Article CAS PubMed Google Scholar
Dong Y, Sun Y, Qin C (2018) Predicting protein complexes using a supervised learning method combined with local structural information. PLoS One 13(3):e0194124. https://doi.org/10.1371/journal.pone.0194124
Article CAS PubMed PubMed Central Google Scholar
Yu Y, Lin L, Sun C, Wang X, Wang X (2011) Complex detection based on integrated properties. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol. 7062 LNCS, no. PART 1:121–128. https://doi.org/10.1007/978-3-642-24955-6_15
Mewes HW et al (2008) MIPS: analysis and annotation of genome information in 2007. Nucleic Acids Res 36(SUPPL):1. https://doi.org/10.1093/nar/gkm980
Article CAS Google Scholar
Liu Q, Song J, Li J (2016) Using contrast patterns between true complexes and random subgraphs in PPI networks to predict unknown protein complexes. Sci Rep. https://doi.org/10.1038/srep21223
Article PubMed PubMed Central Google Scholar
Zeng J, Li D, Wu Y, Zou Q, Liu X (2015) An empirical study of features fusion techniques for protein-protein interaction prediction. Curr Bioinform 11(1):4–12. https://doi.org/10.2174/1574893611666151119221435
Article CAS Google Scholar
Khan J, Bhatti MH, Khan UG, Iqbal R (2019) Multiclass EEG motor-imagery classification with sub-band common spatial patterns. Eurasip J Wirel Commun Netw 2019(1):1–9. https://doi.org/10.1186/s13638-019-1497-y
Article Google Scholar
Bhatti MH et al (2019) Soft computing-based EEG classification by optimal feature selection and neural networks. IEEE Trans Ind Inform 15(10):5747–5754. https://doi.org/10.1109/TII.2019.2925624
Article Google Scholar
Ahmad F, Farooq A, Ghani Khan MU, Shabbir MZ, Rabbani M, Hussain I (2020) Identification of most relevant features for classification of Francisella tularensis using machine learning. Curr Bioinform. https://doi.org/10.2174/1574893615666200219113900
Article Google Scholar
Zou Q, Zeng J, Cao L, Ji R (2016) A novel features ranking metric with application to scalable visual and bioinformatics data classification. Neurocomputing 173:346–354. https://doi.org/10.1016/j.neucom.2014.12.123
Article Google Scholar
Zhang SW, Cheng YM, Luo L, Pan Q (2011) Prediction of protein-protein interaction using distance frequency of amino acids grouped with their physicochemical properties. In: Proceedings—2011 6th International conference on bio-inspired computing: theories and applications, BIC-TA 2011, pp 70–74, https://doi.org/10.1109/BIC-TA.2011.53
Jolliffe I (2011) Principal component analysis. International encyclopedia of statistical science. Springer, Berlin, pp 1094–1096. https://doi.org/10.1007/978-3-642-04898-2_455
Chapter Google Scholar
Sikandar A et al (2018) Decision tree based approaches for detecting protein complex in protein protein interaction network (PPI) via link and sequence analysis. IEEE Access 6:22108–22120. https://doi.org/10.1109/ACCESS.2018.2807811
Article Google Scholar
Sikandar A, Anwar W, Sikandar M (2019) Combining sequence entropy and subgraph topology for complex prediction in protein protein interaction (PPI) network. Curr Bioinform 14(6):516–523. https://doi.org/10.2174/1574893614666190103100026
Article CAS Google Scholar
Faridoon A, Sikandar A, Imran M, Ghouri S, Sikandar M, Sikandar W (2020) Combining SVM and ECOC for identification of protein complexes from protein protein interaction networks by integrating amino acids’ physical properties and complex topology. Interdiscip Sci Comput Life Sci. https://doi.org/10.1007/s12539-020-00369-5
Article Google Scholar
Kursa MB, Jankowski A, Rudnicki WR (2010) Boruta - a system for feature selection. Fundam Informaticae 101(4):271–285. https://doi.org/10.3233/FI-2010-288
Article Google Scholar
Gursoy A, Keskin O, Nussinov R (2008) Topological properties of protein interaction networks from a structural perspective. Biochem Soc Trans 36(Pt 6):1398–403. https://doi.org/10.1042/BST0361398
Article CAS PubMed PubMed Central Google Scholar
Guo Y-Z et al (2006) Classifying G protein-coupled receptors and nuclear receptors on the basis of protein power spectrum from fast Fourier transform. Amino Acids 30(4):397–402. https://doi.org/10.1007/s00726-006-0332-z
Article CAS PubMed Google Scholar
Jolliffe I (2005) Principal component analysis, in encyclopedia of statistics in behavioral science. Wiley, Chichester. https://doi.org/10.1002/0470013192.bsa501
Book Google Scholar
Bérard A, Servan C, Pietquin O, Besacier L (2016) MultiVec: a multilingual and multilevel representation learning toolkit for NLP. https://hal.archives-ouvertes.fr/hal-01335930/. Accessed 16 Jun 2019
Singh P (2019) Natural language processing, in machine learning with PySpark. Apress, Berkeley, pp 191–218
Book Google Scholar
Kulkarni A, Shivananda A (2019) Converting text to features. Natural language processing recipes. Apress, Berkeley, pp 67–96
Chapter Google Scholar
Li Z-W, You Z-H, Chen X, Gui J, Nie R (2016) Highly accurate prediction of protein-protein interactions via incorporating evolutionary information and physicochemical characteristics. Int J Mol Sci. https://doi.org/10.3390/ijms17091396
Article PubMed PubMed Central Google Scholar
Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M (2008) AAindex: amino acid index database, progress report 2008. Nucleic Acids Res 36(Database issue):D202–5. https://doi.org/10.1093/nar/gkm998
Article CAS PubMed Google Scholar
Nakai K, Kidera A, Kanehisa M (2019) Cluster analysis of amino acid indices for prediction of protein structure and function. Protein Eng 2(2):93–100. https://doi.org/10.1093/protein/2.2.93
Article Google Scholar
Kawashima S, Kanehisa M (2000) AAindex: amino acid index database. Nucleic Acids Res 28(1):374. https://doi.org/10.1093/nar/28.1.374
Article CAS PubMed PubMed Central Google Scholar
Tomii K, Kanehisa M (1996) Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. Protein Eng 9(1):27–36. https://doi.org/10.1093/protein/9.1.27
Article CAS PubMed Google Scholar
Raicar G, Saini H, Dehzangi A, Lal S, Sharma A (2016) Improving protein fold recognition and structural class prediction accuracies using physicochemical properties of amino acids. J Theor Biol 402:117–128. https://doi.org/10.1016/J.JTBI.2016.05.002
Article CAS PubMed Google Scholar
Blei DM, Ng AY, Jordan MI (2019) Blei03a.Pdf. J Mach Learn Res 3:993–1022. http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf. Accessed 11 Nov 2003
Tagami H, Ray-Gallet D, Almouzni G, Nakatani Y (2004) Histone H3.1 and H3.3 complexes mediate nucleosome assembly pathways dependent or independent of DNA synthesis. Cell 116(1):51–61. https://doi.org/10.1016/S0092-8674(03)01064-X
Article CAS PubMed Google Scholar
Poss ZC, Ebmeier CC, Taatjes DJ (2013) The mediator complex and transcription regulation. Crit Rev Biochem Mol Biol 48(6):575–608. https://doi.org/10.3109/10409238.2013.840259
Article CAS PubMed PubMed Central Google Scholar
Soutourina J (2018) Transcription regulation by the Mediator complex. Nat Rev Mol Cell Biol 19(4):262–274. https://doi.org/10.1038/nrm.2017.115
Article CAS PubMed Google Scholar
Lucas X, Ciulli A (2017) Recognition of substrate degrons by E3 ubiquitin ligases and modulation by small-molecule mimicry strategies. Curr Opin Struct Biol 44:101–110. https://doi.org/10.1016/j.sbi.2016.12.015
Article CAS PubMed Google Scholar
Rodriguez P et al (2005) GATA-1 forms distinct activating and repressive complexes in erythroid cells. EMBO J 24(13):2354–2366. https://doi.org/10.1038/sj.emboj.7600702
Article CAS PubMed PubMed Central Google Scholar
Bottardi S et al (2014) The IKAROS interaction with a complex including chromatin remodeling and transcription elongation activities is required for hematopoiesis. PLoS Genet 10(12):e1004827. https://doi.org/10.1371/journal.pgen.1004827
Article PubMed PubMed Central Google Scholar
Bottardi S, Mavoungou L, Milot E (2015) IKAROS: a multifunctional regulator of the polymerase II transcription cycle. Trends Genet 31(9):500–508. https://doi.org/10.1016/j.tig.2015.05.003
Article CAS PubMed Google Scholar
Sikandar M et al (2020) Analysis for disease gene association using machine learning. IEEE Access 8:160616–160626. https://doi.org/10.1109/ACCESS.2020.3020592
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Professional Advancement, University of Management and Technology, Lahore, Pakistan
Haseeb Younis
Department of Computer Science, COMSATS University Islamabad, Lahore, Pakistan
Haseeb Younis, Muhammad Waqas Anwar & Usama Ijaz Bajwa
Department of Computer Science and Engineering, University of Engineering and Technology, Lahore, Pakistan
Muhammad Usman Ghani Khan
Govt. Girls Post Graduate College No.1 Abbottabad, Abbottabad, Pakistan
Aisha Sikandar

Authors

Haseeb Younis
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Waqas Anwar
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Usman Ghani Khan
View author publications
You can also search for this author in PubMed Google Scholar
Aisha Sikandar
View author publications
You can also search for this author in PubMed Google Scholar
Usama Ijaz Bajwa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Muhammad Waqas Anwar.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Younis, H., Anwar, M.W., Khan, M.U.G. et al. A New Sequential Forward Feature Selection (SFFS) Algorithm for Mining Best Topological and Biological Features to Predict Protein Complexes from Protein–Protein Interaction Networks (PPINs). Interdiscip Sci Comput Life Sci 13, 371–388 (2021). https://doi.org/10.1007/s12539-021-00433-8

Download citation

Received: 20 June 2020
Revised: 09 April 2021
Accepted: 15 April 2021
Published: 06 May 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s12539-021-00433-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A New Sequential Forward Feature Selection (SFFS) Algorithm for Mining Best Topological and Biological Features to Predict Protein Complexes from Protein–Protein Interaction Networks (PPINs)

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Predicting protein complex in protein interaction network - a supervised learning based method

Predicting protein-protein interactions via multivariate mutual information of protein sequences

Robust and accurate prediction of protein–protein interactions by exploiting evolutionary information

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A New Sequential Forward Feature Selection (SFFS) Algorithm for Mining Best Topological and Biological Features to Predict Protein Complexes from Protein–Protein Interaction Networks (PPINs)

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Predicting protein complex in protein interaction network - a supervised learning based method

Predicting protein-protein interactions via multivariate mutual information of protein sequences

Robust and accurate prediction of protein–protein interactions by exploiting evolutionary information

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation