Variable selection from a feature representing protein sequences: a case of classification on bacterial type IV secreted effectors

Zhang, Jian; Lv, Lixin; Lu, Donglei; Kong, Denan; Al-Alashaari, Mohammed Abdoh Ali; Zhao, Xudong

doi:10.1186/s12859-020-03826-6

Variable selection from a feature representing protein sequences: a case of classification on bacterial type IV secreted effectors

Methodology article
Open access
Published: 27 October 2020

Volume 21, article number 480, (2020)
Cite this article

Download PDF

You have full access to this open access article

BMC Bioinformatics Aims and scope Submit manuscript

Variable selection from a feature representing protein sequences: a case of classification on bacterial type IV secreted effectors

Download PDF

Jian Zhang¹,
Lixin Lv¹,
Donglei Lu¹,
Denan Kong²,
Mohammed Abdoh Ali Al-Alashaari² &
…
Xudong Zhao ORCID: orcid.org/0000-0003-2272-6278²

1696 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

Background

Classification of certain proteins with specific functions is momentous for biological research. Encoding approaches of protein sequences for feature extraction play an important role in protein classification. Many computational methods (namely classifiers) are used for classification on protein sequences according to various encoding approaches. Commonly, protein sequences keep certain labels corresponding to different categories of biological functions (e.g., bacterial type IV secreted effectors or not), which makes protein prediction a fantasy. As to protein prediction, a kernel set of protein sequences kee** certain labels certified by biological experiments should be existent in advance. However, it has been hardly ever seen in prevailing researches. Therefore, unsupervised learning rather than supervised learning (e.g. classification) should be considered. As to protein classification, various classifiers may help to evaluate the effectiveness of different encoding approaches. Besides, variable selection from an encoded feature representing protein sequences is an important issue that also needs to be considered.

Results

Focusing on the latter problem, we propose a new method for variable selection from an encoded feature representing protein sequences. Taking a benchmark dataset containing 1947 protein sequences as a case, experiments are made to identify bacterial type IV secreted effectors (T4SE) from protein sequences, which are composed of 399 T4SE and 1548 non-T4SE. Comparable and quantified results are obtained only using certain components of the encoded feature, i.e., position-specific scoring matix, and that indicates the effectiveness of our method.

Conclusions

Certain variables other than an encoded feature they belong to do work for discrimination between different types of proteins. In addition, ensemble classifiers with an automatic assignment of different base classifiers do achieve a better classification result.

Data Mining of Protein Sequences with Amino Acid Position-Based Feature Encoding Technique

Effective prediction of bacterial type IV secreted effectors by combined features of both C-termini and N-termini

Article 10 November 2017

Predicting Type III Effector Proteins Using the Effectidor Web Server

Background

Feature extraction from protein sequences plays an important role in protein classification [1,2,3,4] of many areas, such as identification of plant pentatricopeptide repeat coding protein [5], prediction of bacterial type IV secreted effectors [Variable selection

Variable selection is accomplished at the sixth step. In each dimension, the established ensemble classifier is applied to the testing samples. The accuracy (Acc) expressed in Eq. (2) and the area under curve (AUC) of the receiver operating characteristic (ROC) are calculated. Accordingly, a line chart is obtained with its horizontal and vertical coordinates corresponding to the variable indices in their descending order and the corresponding Accs and AUCs in different dimensions. A dimension threshold can be made when Accs and AUCs are kee** almost the same with dimension incrementally increasing. Thus, the variables that really help to recognize proteins with specific functions are selected from the encoded feature.

Measure

Evaluation metrics are made to estimate the effectiveness of selected variables at the seventh step. The classification error rate is expressed as follows,

$$\begin{aligned} Err={{FN+FP} \over {TP+FN+TN+FP}}, \end{aligned}$$

(1)

where TP, TN, FP and FN represent the number of true positive, true negative, false positive and false negative, respectively. On the contrary, Acc is shown as follows,

$$\begin{aligned} Acc={{TN+TP} \over {TP+FN+TN+FP}}. \end{aligned}$$

(2)

Besides, we choose four widely used quantitative measurements. The confusion matrix illustrates TP, TN, FP and FN together. Besides, Precision and Recall are computed as follows,

$$\begin{aligned} Precision= & {} {{TP} \over {TP+FP}}, \end{aligned}$$

(3)

$$\begin{aligned} Recall= & {} {{TP} \over {TP+FN}}. \end{aligned}$$

(4)

In addition, $F1-measure$ is a harmonic average of Precision and Recall, which is expressed as

$$\begin{aligned} F1-measure = {{2*Precision*Recall} \over {Precision+Recall}}. \end{aligned}$$

(5)

Moreover, the ROC and AUC are also provided here as qualitative measurements.

Availability of data and materials

The public dataset analysed during the current study is available in reference [51], and can be downloaded from the website https://github.com/LoopGan/Effective-prediction-of-bacterial-type-IV-secreted-effectors.

Abbreviations

Acc::: Accuracy;
AUC::: Area under curve;
DTC::: Decision tree classifier;
GBM::: Gradient boosting machine;
kNN::: k-nearest-neighbor;
LDA::: Linear discriminant analysis;
LR::: Logistic regression;
MLP::: Multi-layer perceptron;
NB::: Naive bayesian;
PseAAC::: Pseudo-amino acid composition;
PSI-BLAST::: Position-specific iterated blast;
PSSM::: Position-specific scoring matrix;
RF::: Random forest;
ROC::: Receiver operating characteristic;
SVM::: Support vector machine;
T4SE::: Type IV secreted effectors

References

Lv ZB, ** SS, Ding H, Zou Q. A random forest sub-Golgi protein classifier optimized via dipeptide and amino acid composition features. Fronti Bioeng Biotechnol. 2019;7:215.
Article Google Scholar
Zhu XJ, Feng CQ, Lai HY, Chen W, Lin H. Predicting protein structural classes for low-similarity sequences by evaluating different features. Knowl-Based Syst. 2019;163:787–93.
Article Google Scholar
Ru XQ, Li LH, Zou Q. Incorporating distance-based top-n-gram and random forest to identify electron transport proteins. J Proteome Res. 2019;18:2931–9.
Article CAS Google Scholar
Li YJ, Niu MT, Zou Q. ELM-MHC: an improved MHC identification method with extreme learning machine algorithm. J Proteome Res. 2019;18:1392–401.
Article CAS Google Scholar
Qu K, Wei L, Yu J, Wang C. Identifying plant pentatricopeptide repeat coding gene/protein using mixed feature extraction methods. Front Plant Sci. 2019;9:1–10.
Article Google Scholar
**ong Y, Wang QK, Yang JC, Zhu XL, Wei DQ. PredT4SE-Stack: prediction of bacterial type IV secreted effectors from protein sequences using a stacked ensemble method. Front Microbiol. 2018;9:2571.
Article Google Scholar
Zou LY, Nan CH, Hu FQ. Accurate prediction of bacterial type IV secreted effectors using amino acid composition and PSSM profiles. Bioinformatics. 2013;29(24):3135–42.
Article CAS Google Scholar
Ashari ZE, Dasgupta N, Brayton KA, Broschat SL. An optimal set of features for predicting type IV secretion system effector proteins for a subset of species based on a multi-level feature selection approach. PLoS ONE. 2018;13:e0197041.
Article CAS Google Scholar
Yu LZ, Guo YZ, Li YZ, Li GB, Li ML, Luo JS, **ong WJ, Qin WL. SecretP: identifying bacterial secreted proteins by fusing new features into Chou’s pseudo-amino acid composition. J Theor Biol. 2010;267:1–6.
Article CAS Google Scholar
Feng PM, Chen W, Lin H, Chou KC. iHSP-PseRAAAC: identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. Anal Biochem. 2013;442(1):118–25.
Article CAS Google Scholar
Mirza MT, Khan A, Tahir M, Lee YS. MitProt-Pred: predicting mitochondrial proteins of Plasmodium falciparum parasite using diverse physiochemical properties and ensemble classification. Comput Biol Med. 2013;43(10):1502–11.
Article CAS Google Scholar
Ahmad J, Hayat M. MFSC: multi-voting based feature selection for classification of Golgi proteins by adopting the general form of Chou’s PseAAC components. J Theor Biol. 2019;463:99–109.
Article CAS Google Scholar
Zhang SL, Duan X. Prediction of protein subcellular localization with oversampling approach and Chou’s general PseAAC. J Theor Biol. 2018;437:239–50.
Article CAS Google Scholar
Srivastava A, Kumar R, Kumar M. BlaPred: predicting and classifying beta-lactamase using a 3-tier prediction system via Chou’s general PseAAC. J Theor Biol. 2018;457:29–36.
Article CAS Google Scholar
Sankari ES, Manimegalai D. Predicting membrane protein types by incorporating a novel feature set into Chou’s general PseAAC. J Theor Biol. 2018;455:319–28.
Article CAS Google Scholar
Sankari ES, Manimegalai D. Predicting membrane protein types using various decision tree classifiers based on various modes of general PseAAC for imbalanced datasets. J Theor Biol. 2017;435:208–17.
Article CAS Google Scholar
Liang YY, Zhang SL. Predict protein structural class by incorporating two different modes of evolutionary information into Chou’s general pseudo amino acid composition. J Mol Graph Model. 2017;78:110–7.
Article CAS Google Scholar
Meher PK, Sahu TK, Banchariya A, Rao AR. DIRProt: a computational approach for discriminating insecticide resistant proteins from non-resistant proteins. BMC Bioinform. 2017;18:190.
Article CAS Google Scholar
Tiwari AK. Prediction of G-protein coupled receptors and their subfamilies by incorporating various sequence features into Chou’s general PseAAC. Comput Methods Programs Biomed. 2016;134:197–213.
Article Google Scholar
Han GS, Yu ZG, Anh V. A two-stage SVM method to predict membrane protein types by incorporating amino acid classifications and physicochemical properties into a general form of Chou’s PseAAC. J Theor Biol. 2014;344:31–9.
Article CAS Google Scholar
Chou K. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics. 2005;21:10–9.
Article CAS Google Scholar
Chou K. Prediction of protein cellular attrbutes using pseudo-amino acid composition. Proteins. 2001;43:246–55.
Article CAS Google Scholar
Wang JW, Yang BJ, Revote J, Leier A, Marquez-Lago TT, Webb G, Song JN, Chou KC, Lithgow T. POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles. Bioinformatics. 2017;33(17):2756–8.
Article CAS Google Scholar
Zhang LC, Zhao XQ, Kong L. Predict protein structural class for low-similarity sequences by evolutionary difference information into the general form of Chou’s pseudo amino acid composition. J Theor Biol. 2014;355:105–10.
Article CAS Google Scholar
Paliwal KK, Sharma A, Lyons J, Dehzangi A. A tri-gram based feature extraction technique using linear probabilities of position specific scoring matrix for protein fold recognition. IEEE Trans Nanobiosci. 2014;13(1):44–50.
Article Google Scholar
Zahiri J, Yaghoubi O, Mohammad-Noori M, Ebrahimpour R, Masoudi-Nejad A. PPIevo: protein–protein interaction prediction from PSSM based evolutionary information. Genomics. 2013;102(4):237–42.
Article CAS Google Scholar
Zhang SL, Ye F, Yuan XG. Using principal component analysis and support vector machine to predict protein structural class for low-similarity sequences via PSSM. J Biomol Struct Dyn. 2012;29(6):634–42.
Article CAS Google Scholar
Jeong JC, Lin XT, Chen XW. On position-specific scoring matrix for protein function prediction. IEEE-ACM Trans Comput Biol Bioinform. 2011;8(2):308–15.
Article Google Scholar
Jia CZ, Liu T, Chang AK, Zhai YY. Prediction of mitochondrial proteins of malaria parasite using bi-profile Bayes feature extraction. Biochimie. 2011;93(4):778–82.
Article CAS Google Scholar
Dong QW, Zhou SG, Guan JH. A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation. Bioinformatics. 2009;25(20):2655–62.
Article CAS Google Scholar
Cheng CW, Su ECY, Hwang JK, Sung TY, Hsu WL. Predicting RNA-binding sites of proteins using support vector machines and evolutionary information. BMC Bioinform. 2008;9(S12):S6.
Article CAS Google Scholar
Chou KC, Shen HB. MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. Biochem Biophys Res Commun. 2007;360(2):339–45.
Article CAS Google Scholar
An JY, You ZH, Chen X, Huang DS, Li ZW, Liu G, Wang Y. Identification of self-interacting proteins by exploring evolutionary information embedded in PSI-BLAST-constructed position specific scoring matrix. Oncotarget. 2016;7(50):82440–9.
Article Google Scholar
Qin YF, Zheng XQ, Wang J, Chen M, Zhou CJ. Prediction of protein structural class based on Linear Predictive Coding of PSI-BLAST profiles. Open Life Sciences. 2015;10(1):529–36.
Article CAS Google Scholar
Ding SY, Li Y, Shi ZX, Yan SJ. A protein structural classes prediction method based on predicted secondary structure and PSI-BLAST profile. Biochimie. 2014;97:60–5.
Article CAS Google Scholar
Liu T, Zheng XQ, Wang J. Prediction of protein structural class for low-similarity sequences using support vector machine and PSI-BLAST profile. Biochimie. 2010;92(10):1330–4.
Article CAS Google Scholar
Kaur H, Raghava GPS. Prediction of alpha-turns in proteins using PSI-BLAST profiles and secondary structure information. Proteins-Struct Funct Bioinform. 2004;55(1):83–90.
Article CAS Google Scholar
Breiman L. Random forests. Mach Learn. 2001;45:5–32.
Article Google Scholar
Tan CG, Wang T, Yang WY, Deng L. PredPSD: a gradient tree boosting approach for single-stranded and double-stranded DNA binding protein prediction. Molecules. 2020;25(1):98.
Article CAS Google Scholar
Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;29(5):1189–232.
Article Google Scholar
He ZY, Liu H, Moch H, Simon H. Machine learning with autophagy-related proteins for discriminating renal cell carcinoma subtypes. Sci Rep. 2020;10(1):720.
Article CAS Google Scholar
Cover T, Hart P. Nearest neighbor pattern classification. IEEE Trans Inf Theory. 1967;13(1):21–7.
Article Google Scholar
Isopescu RD, Spulber R, Josceanu AM, Mihaiescu DE, Popa O. Romanian bee pollen classification and property modelling. J Apicult Res. 2020.
Belhumeur PN, Hespanha JP, Kriegman DJ. Eigenfaces vs. fisherfaces: recognition using class specific linear projection. IEEE Trans Pattern Anal Mach Intell. 1997;19(7):711–20.
Article Google Scholar
Wachters JE, Kop E, Slagter-Menkema L, Mastik M, van der Wal JE, van der Vegt B. de Bock GH, van der Laan BFAM, Schuuring E. Distinct biomarker profiles and clinical characteristics in T1–T2 glottic and supraglottic carcinomas. The Laryngoscope 2020.
Zhou Y, Li GQ, Li HQ. Automatic cataract classification using deep neural network with discrete state transition. IEEE Trans Med Imaging. 2020;39(2):436–46.
Article Google Scholar
Pal SK, Mitra S. Multilayer perceptron, fuzzy sets, and classification. IEEE Trans Neural Netw. 1992;3(5):683–97.
Article CAS Google Scholar
Domingos P, Pazzani M. On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn. 1997;29(2–3):103–30.
Article Google Scholar
Meng CL, ** SS, Wang L, Guo F, Zou Q. AOPs-SVM: a sequence-based classifier of antioxidant proteins using a support vector machine. Front Bioeng Biotechnol 2019.
Cortes C, Vapnik VN. Support vector networks. Mach Learn. 1995;20(3):273–97.
Google Scholar
Wang Y, Guo Y, Pu X, Li M. Effective prediction of bacterial type IV secreted effectors by combined features of both C-termini and N-termini. J Comput Aided Mol Des. 2017;31:1029–38.
Article CAS Google Scholar
Zhao XD, Jiao Q, Li HY, Wu YM, Wang HX, Huang S, Wang GH. ECFS-DEA: an ensemble classifier-based feature selection for differential expression analysis on expression profiles. BMC Bioinform. 2020;21:43.
Article Google Scholar
Liu T, Li HY, Zhao XD. Clustering by search in descending order and automatic find of density peaks. IEEE Access. 2019;7:133772–80.
Article Google Scholar

Download references

Acknowledgements

This work is derived from Scientific Research Project Supported by Enterprise Suzhou Dachen Medical Technology Co., Ltd.

Funding

This work has been supported by the financial support of This work has been supported by the financial support of Natural Science Foundation of Heilongjiang Province (No. LH2020F002). The funding body of Fundamental Research Funds for Natural Science Foundation of Heilongjiang Province played an important role in the design of the study, collection, analysis and interpretation of data and in writing the manuscript.

Author information

Authors and Affiliations

College of Artificial Intelligence, Wuxi Vocational College of Science and Technology, No. 8 **nxi Road, Wuxi, 214028, China
Jian Zhang, Lixin Lv & Donglei Lu
College of Information and Computer Engineering, Northeast Forestry University, No. 26 Hexing Road, Harbin, 150040, China
Denan Kong, Mohammed Abdoh Ali Al-Alashaari & Xudong Zhao

Authors

Jian Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Lixin Lv
View author publications
You can also search for this author in PubMed Google Scholar
Donglei Lu
View author publications
You can also search for this author in PubMed Google Scholar
Denan Kong
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed Abdoh Ali Al-Alashaari
View author publications
You can also search for this author in PubMed Google Scholar
Xudong Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

X.D.Z conceived the general research and supervised it. J.Z performed the research and were the principal developers. D.L.L and L.X.L analyzed the data. D.N.K, M.A.A.A and X.D.Z wrote and revised the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Xudong Zhao.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Zhang, J., Lv, L., Lu, D. et al. Variable selection from a feature representing protein sequences: a case of classification on bacterial type IV secreted effectors. BMC Bioinformatics 21, 480 (2020). https://doi.org/10.1186/s12859-020-03826-6

Download citation

Received: 21 July 2020
Accepted: 19 October 2020
Published: 27 October 2020
DOI: https://doi.org/10.1186/s12859-020-03826-6

Variable selection from a feature representing protein sequences: a case of classification on bacterial type IV secreted effectors