Log in

Align-gram: Rethinking the Skip-gram Model for Protein Sequence Analysis

  • Published:
The Protein Journal Aims and scope Submit manuscript

Abstract

The inception of next generations sequencing technologies have exponentially increased the volume of biological sequence data. Protein sequences, being quoted as the ‘language of life’, has been analyzed for a multitude of applications and inferences. Owing to the rapid development of deep learning, in recent years there have been a number of breakthroughs in the domain of Natural Language Processing. Since these methods are capable of performing different tasks when trained with a sufficient amount of data, off-the-shelf models are used to perform various biological applications. In this study, we investigated the applicability of the popular Skip-gram model for protein sequence analysis and made an attempt to incorporate some biological insights into it. We propose a novel k-mer embedding scheme, Align-gram, which is capable of map** the similar k-mers close to each other in a vector space. Furthermore, we experiment with other sequence-based protein representations and observe that the embeddings derived from Align-gram aids modeling and training deep learning models better. Our experiments with a simple baseline LSTM model and a much complex CNN model of DeepGoPlus shows the potential of Align-gram in performing different types of deep learning applications for protein sequence analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. http://server.malab.cn/Local-DPP/Datasets.html

  2. https://github.com/songlab-cal/tape

  3. https://github.com/bio-ontology-research-group/deepgoplus

  4. http://deepgoplus.bio2vec.net/data/data-cafa.tar.gz

References

  1. Larranaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I, Lozano JA, Armañanzas R, Santafé G, Pérez A et al (2006) Machine learning in bioinformatics. Brief Bioinform 7(1):86–112

    Article  CAS  PubMed  Google Scholar 

  2. Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Byers AH et al (2011) Big data: the next frontier for innovation, competition, and productivity. McKinsey Global Institute, San Francisco

    Google Scholar 

  3. Min S, Lee B, Yoon S (2017) Deep learning in bioinformatics. Brief Bioinform 18(5):851–869

    PubMed  Google Scholar 

  4. Bo W, Zeng W, Liao Y, Shi Z, Savage SR, Jiang W, Zhang B (2020) Deep learning in proteomics. Proteomics 1900335

  5. Selbig J, Mevissen T, Lengauer T (1999) Decision tree-based formation of consensus protein secondary structure prediction. Bioinformatics 15(12):1039–1046

    Article  CAS  PubMed  Google Scholar 

  6. Yan C, Dobbs D, Honavar V (2004) A two-stage classifier for identification of protein-protein interface residues. Bioinformatics 20(suppl–1):i371–i378

    Article  CAS  PubMed  Google Scholar 

  7. Huang Y, Li Y (2004) Prediction of protein subcellular locations using fuzzy k-nn method. Bioinformatics 20(1):21–28

    Article  CAS  PubMed  Google Scholar 

  8. Di Lena P, Nagata K, Baldi P (2012) Deep architectures for protein contact map prediction. Bioinformatics 28(19):2449–2457

    Article  PubMed  PubMed Central  Google Scholar 

  9. Klausen MS, Jespersen MC, Nielsen H, Jensen KK, Jurtz VI, Sønderby CK, Otto Alexander Sommer M, Winther O, Nielsen M, Petersen B, Marcatili P (2019) Netsurfp-2.0: improved prediction of protein structural features by integrated deep learning. Proteins Struct Funct Bioinf 87(6):520–527

    Article  CAS  Google Scholar 

  10. Kulmanov M, Hoehndorf R (2020) Deepgoplus: improved protein function prediction from sequence. Bioinformatics 36(2):422–429

    Article  CAS  PubMed  Google Scholar 

  11. Armenteros JJA, Sønderby CK, Sønderby SK, Nielsen H, Winther O (2017) Deeploc: prediction of protein subcellular localization using deep learning. Bioinformatics 33(21):3387–3395

    Article  CAS  Google Scholar 

  12. Zeng H, Gifford DK (2019) Quantification of uncertainty in peptide-mhc binding prediction improves high-affinity peptide selection for therapeutic design. Cell Syst 9(2):159–166

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci 89(22):10915–10919

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. O’Donnell TJ, Rubinsteyn A, Bonsack M, Riemer AB, Laserson U, Hammerbacher J (2018) Mhcflurry: open-source class i mhc binding affinity prediction. Cell Syst 7(1):129–132

    Article  PubMed  Google Scholar 

  15. ** J, Liu Z, Nasiri A, Cui Y, Louis S-Y, Zhang A, Zhao Y, Jianjun H (2021) Deep learning pan-specific model for interpretable mhc-i peptide binding prediction with improved attention mechanism. Proteins Struct Funct Bioinf 89(7):866–883

    Article  CAS  Google Scholar 

  16. Hein A, Cole C, Valafar H (2021) An investigation in optimal encoding of protein primary sequence for structure prediction by artificial neural networks. In: Advances in computer vision and computational biology. Springer International Publishing, Berlin, pp 685–699

  17. Jones DT (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292(2):195–202

    Article  CAS  PubMed  Google Scholar 

  18. Hanson J, Paliwal K, Litfin T, Yang Y, Zhou Y (2019) Improving prediction of protein secondary structure, backbone angles, solvent accessibility and contact numbers by using predicted contact maps and an ensemble of recurrent and residual convolutional neural networks. Bioinformatics 35(14):2403–2410

    Article  CAS  PubMed  Google Scholar 

  19. Wang D, Liang Y, Dong X (2019) Capsule network for protein post-translational modification site prediction. Bioinformatics 35(14):2386–2394

    Article  CAS  PubMed  Google Scholar 

  20. Hongli F, Yang Y, Wang X, Wang H, Yan X (2019) Deepubi: a deep learning framework for prediction of ubiquitination sites in proteins. BMC Bioinf 20(1):1–10

    Google Scholar 

  21. Abelin JG, Harjanto D, Malloy M, Suri P, Colson T, Goulding SP, Creech AL, Serrano LR, Nasir G, Nasrullah Y et al (2019) Defining hla-ii ligand processing and binding rules with mass spectrometry enhances cancer epitope prediction. Immunity 51(4):766–779

    Article  CAS  PubMed  Google Scholar 

  22. Bin Y, Zhaomin Y, Chen C, Ma A, Liu B, Tian B, Ma Q (2020) Dnnace: Prediction of prokaryote lysine acetylation sites through deep neural networks with multi-information fusion. Chemometrics and Intelligent Laboratory Systems 103999

  23. Mikolov Tomás, Chen Kai, Corrado Greg, Dean Jeffrey (2013) Efficient estimation of word representations in vector space. In 1st international conference on learning representations, ICLR

  24. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119

  25. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536

    Article  Google Scholar 

  26. Jey HL, Timothy B (2016) An empirical evaluation of doc2vec with practical insights into document embedding generation. In Proceedings of the 1st Workshop on Representation Learning for NLP. Association for Computational Linguistics, pp 78–86

  27. Asgari E, Mofrad MRK (2015) Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE 10(11):e0141287

    Article  PubMed  PubMed Central  Google Scholar 

  28. Yang KK, Zachary W, Bedbrook CN, Arnold FH (2018) Learned protein embeddings for machine learning. Bioinformatics 34(15):2642–2648

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Phloyphisut P, Pornputtapong N, Sriswasdi S, Chuangsuwanich E (2019) Mhcseqnet: a deep neural network model for universal mhc binding prediction. BMC Bioinf 20(1):270

    Article  Google Scholar 

  30. Vielhaben J, Wenzel M, Samek W, Strodthoff N (2020) Usmpep: universal sequence models for major histocompatibility complex binding affinity prediction. BMC Bioinf 21(1):1–16

    Article  Google Scholar 

  31. Buchan DWA, Jones DT (2020) Learning a functional grammar of protein domains using natural language word embedding techniques. Proteins Str Funct Bioinf 88(4):616–624

    Article  CAS  Google Scholar 

  32. Michail YL, Petr K, Igor VS, Gian GT, Oxana VG (2016) Non-random distribution of homo-repeats: links with biological functions and human diseases. Sci Rep 6:26941

    Article  Google Scholar 

  33. Swathik CP, Jaspreet KD, Vidhi M, Navaneethan R, Mannu J, Durai S, Durai S, Mannu J (2018) Encyclopedia of bioinformatics and computational biology, Ranganathan S, Grib-skov M, Nakai K, Schönbach C (eds), pp 661–676

  34. Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B et al (2009) Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics 25(11):1422–1423

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Blast options and defaults (2020)

  36. François Chollet et al. Keras (2015)

  37. Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al (2016) Tensorflow: a system for large-scale machine learning. In OSDI 16:265–283

    Google Scholar 

  38. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: 3rd international conference on learning representations, ICLR

  39. Mathura SV, Werner B (2001) New quantitative descriptors of amino acids based on multidimensional scaling of a large number of physical-chemical properties. Mol Model Annu 7(12):445–453

    Article  Google Scholar 

  40. Liu B, **ghao X, Lan X, Ruifeng X, Zhou J, Wang X, Chou K-C (2014) idna-prot| dis: identifying dna-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition. PLoS ONE 9(9):e106691

    Article  PubMed  PubMed Central  Google Scholar 

  41. Rocklin GJ, Chidyausiku TM, Goreshnik I, Ford A, Houliston S, Lemak A, Carter L, Ravichandran R, Mulligan VK, Chevalier A et al (2017) Global analysis of protein folding using massively parallel design, synthesis, and testing. Science 357(6347):168–175

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Fox NK, Brenner SE, Chandonia JM (2013) Scope: Structural classification of proteins-extended, integrating scop and astral data and classification of new structures. Nucleic Acids Res 42(D1):D304–D309

    Article  PubMed  PubMed Central  Google Scholar 

  43. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The protein data bank. Nucleic Acids Res 28(1):235–242

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Moult J, Fidelis K, Kryshtafovych A, Schwede T, Tramontano A (2018) Critical assessment of methods of protein structure prediction (CASP)-Round XII. Proteins Str Funct Bioinf 86:7–15

    Article  CAS  Google Scholar 

  45. Rao R, Bhattacharya N, Thomas N, Duan Y, Chen P, Canny J, Abbeel P, Song Y (2019) Evaluating protein transfer learning with tape. In: Advances in neural information processing systems, pp 9689–9701

  46. Lou W, Wang X, Chen F, Chen Y, Jiang B, Zhang H (2014) Sequence based prediction of dna-binding proteins based on hybrid feature selection using random forest and gaussian naive bayes. PLoS ONE 9(1):e86703

    Article  PubMed  PubMed Central  Google Scholar 

  47. He F, Wang R, Li J, Bao L, Dong X, Zhao X (2018) Large-scale prediction of protein ubiquitination sites using a multimodal deep architecture. BMC Syst Biol 12(6):109

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Huang K-Y, Hsu JB-K, Lee T-Y (2019) Characterization and identification of lysine succinylation sites based on deep learning method. Sci Rep 9(1):1–15

    Google Scholar 

  49. Buchfink B, **e C, Huson DH (2015) Fast and sensitive protein alignment using diamond. Nat Methods 12(1):59–60

    Article  CAS  PubMed  Google Scholar 

  50. Zhou N, Jiang Y, Bergquist TR, Lee AJ, Kacsoh BZ, Crocker AW, Lewis KA, Georghiou G, Nguyen HN, Hamid MN et al (2019) The cafa challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genom Biol 20(1):1–23

    Article  Google Scholar 

  51. Mount DW (2008) Using gaps and gap penalties to optimize pairwise sequence alignments. Cold Spring Harbor Protoc 2008(6):pdb–top40

  52. Heinzinger M, Ahmed Elnaggar Yu, Wang CD, Nechaev D, Matthes F, Rost B (2019) Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinf 20(1):723

    Article  CAS  Google Scholar 

  53. Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Yu W, Jones L, Gibbs T, Feher T, Angerer C, Steinegger M, Bhowmik D, Rost B (2021) Prottrans: towards cracking the language of lifes code through self-supervised deep learning and high performance computing. In: IEEE transactions on pattern analysis and machine intelligence, p 1

  54. Zeng H, Gifford DK (2019) Deepligand: accurate prediction of mhc class i ligands using peptide embedding. Bioinformatics 35(14):i278–i283

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics. Association for Computational Linguistics, pp 2227–2237

  56. Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pp 4171–4186

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. Sohel Rahman.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ibtehaz, N., Sourav, S.M.S.H., Bayzid, M.S. et al. Align-gram: Rethinking the Skip-gram Model for Protein Sequence Analysis. Protein J 42, 135–146 (2023). https://doi.org/10.1007/s10930-023-10096-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10930-023-10096-7

Keywords

Navigation