Align-gram: Rethinking the Skip-gram Model for Protein Sequence Analysis

Ibtehaz, Nabil; Sourav, S. M. Shakhawat Hossain; Bayzid, Md. Shamsuzzoha; Rahman, M. Sohel

doi:10.1007/s10930-023-10096-7

Align-gram: Rethinking the Skip-gram Model for Protein Sequence Analysis

Published: 28 March 2023

Volume 42, pages 135–146, (2023)
Cite this article

The Protein Journal Aims and scope Submit manuscript

Nabil Ibtehaz¹,
S. M. Shakhawat Hossain Sourav²,
Md. Shamsuzzoha Bayzid¹ &
…
M. Sohel Rahman ORCID: orcid.org/0000-0001-9419-6478¹

264 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

The inception of next generations sequencing technologies have exponentially increased the volume of biological sequence data. Protein sequences, being quoted as the ‘language of life’, has been analyzed for a multitude of applications and inferences. Owing to the rapid development of deep learning, in recent years there have been a number of breakthroughs in the domain of Natural Language Processing. Since these methods are capable of performing different tasks when trained with a sufficient amount of data, off-the-shelf models are used to perform various biological applications. In this study, we investigated the applicability of the popular Skip-gram model for protein sequence analysis and made an attempt to incorporate some biological insights into it. We propose a novel k-mer embedding scheme, Align-gram, which is capable of map** the similar k-mers close to each other in a vector space. Furthermore, we experiment with other sequence-based protein representations and observe that the embeddings derived from Align-gram aids modeling and training deep learning models better. Our experiments with a simple baseline LSTM model and a much complex CNN model of DeepGoPlus shows the potential of Align-gram in performing different types of deep learning applications for protein sequence analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Institutional subscriptions

Modeling aspects of the language of life through transfer-learning protein sequences

Article Open access 17 December 2019

TEMPROT: protein function annotation using transformers embeddings and homology search

Article Open access 08 June 2023

Protein embedding based alignment

Article Open access 28 February 2024

Notes

References

Larranaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I, Lozano JA, Armañanzas R, Santafé G, Pérez A et al (2006) Machine learning in bioinformatics. Brief Bioinform 7(1):86–112
Article CAS PubMed Google Scholar
Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Byers AH et al (2011) Big data: the next frontier for innovation, competition, and productivity. McKinsey Global Institute, San Francisco
Google Scholar
Min S, Lee B, Yoon S (2017) Deep learning in bioinformatics. Brief Bioinform 18(5):851–869
PubMed Google Scholar
Bo W, Zeng W, Liao Y, Shi Z, Savage SR, Jiang W, Zhang B (2020) Deep learning in proteomics. Proteomics 1900335
Selbig J, Mevissen T, Lengauer T (1999) Decision tree-based formation of consensus protein secondary structure prediction. Bioinformatics 15(12):1039–1046
Article CAS PubMed Google Scholar
Yan C, Dobbs D, Honavar V (2004) A two-stage classifier for identification of protein-protein interface residues. Bioinformatics 20(suppl–1):i371–i378
Article CAS PubMed Google Scholar
Huang Y, Li Y (2004) Prediction of protein subcellular locations using fuzzy k-nn method. Bioinformatics 20(1):21–28
Article CAS PubMed Google Scholar
Di Lena P, Nagata K, Baldi P (2012) Deep architectures for protein contact map prediction. Bioinformatics 28(19):2449–2457
Article PubMed PubMed Central Google Scholar
Klausen MS, Jespersen MC, Nielsen H, Jensen KK, Jurtz VI, Sønderby CK, Otto Alexander Sommer M, Winther O, Nielsen M, Petersen B, Marcatili P (2019) Netsurfp-2.0: improved prediction of protein structural features by integrated deep learning. Proteins Struct Funct Bioinf 87(6):520–527
Article CAS Google Scholar
Kulmanov M, Hoehndorf R (2020) Deepgoplus: improved protein function prediction from sequence. Bioinformatics 36(2):422–429
Article CAS PubMed Google Scholar
Armenteros JJA, Sønderby CK, Sønderby SK, Nielsen H, Winther O (2017) Deeploc: prediction of protein subcellular localization using deep learning. Bioinformatics 33(21):3387–3395
Article CAS Google Scholar
Zeng H, Gifford DK (2019) Quantification of uncertainty in peptide-mhc binding prediction improves high-affinity peptide selection for therapeutic design. Cell Syst 9(2):159–166
Article CAS PubMed PubMed Central Google Scholar
Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci 89(22):10915–10919
Article CAS PubMed PubMed Central Google Scholar
O’Donnell TJ, Rubinsteyn A, Bonsack M, Riemer AB, Laserson U, Hammerbacher J (2018) Mhcflurry: open-source class i mhc binding affinity prediction. Cell Syst 7(1):129–132
Article PubMed Google Scholar
** J, Liu Z, Nasiri A, Cui Y, Louis S-Y, Zhang A, Zhao Y, Jianjun H (2021) Deep learning pan-specific model for interpretable mhc-i peptide binding prediction with improved attention mechanism. Proteins Struct Funct Bioinf 89(7):866–883
Article CAS Google Scholar
Hein A, Cole C, Valafar H (2021) An investigation in optimal encoding of protein primary sequence for structure prediction by artificial neural networks. In: Advances in computer vision and computational biology. Springer International Publishing, Berlin, pp 685–699
Jones DT (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292(2):195–202
Article CAS PubMed Google Scholar
Hanson J, Paliwal K, Litfin T, Yang Y, Zhou Y (2019) Improving prediction of protein secondary structure, backbone angles, solvent accessibility and contact numbers by using predicted contact maps and an ensemble of recurrent and residual convolutional neural networks. Bioinformatics 35(14):2403–2410
Article CAS PubMed Google Scholar
Wang D, Liang Y, Dong X (2019) Capsule network for protein post-translational modification site prediction. Bioinformatics 35(14):2386–2394
Article CAS PubMed Google Scholar
Hongli F, Yang Y, Wang X, Wang H, Yan X (2019) Deepubi: a deep learning framework for prediction of ubiquitination sites in proteins. BMC Bioinf 20(1):1–10
Google Scholar
Abelin JG, Harjanto D, Malloy M, Suri P, Colson T, Goulding SP, Creech AL, Serrano LR, Nasir G, Nasrullah Y et al (2019) Defining hla-ii ligand processing and binding rules with mass spectrometry enhances cancer epitope prediction. Immunity 51(4):766–779
Article CAS PubMed Google Scholar
Bin Y, Zhaomin Y, Chen C, Ma A, Liu B, Tian B, Ma Q (2020) Dnnace: Prediction of prokaryote lysine acetylation sites through deep neural networks with multi-information fusion. Chemometrics and Intelligent Laboratory Systems 103999
Mikolov Tomás, Chen Kai, Corrado Greg, Dean Jeffrey (2013) Efficient estimation of word representations in vector space. In 1st international conference on learning representations, ICLR
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536
Article Google Scholar
Jey HL, Timothy B (2016) An empirical evaluation of doc2vec with practical insights into document embedding generation. In Proceedings of the 1st Workshop on Representation Learning for NLP. Association for Computational Linguistics, pp 78–86
Asgari E, Mofrad MRK (2015) Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE 10(11):e0141287
Article PubMed PubMed Central Google Scholar
Yang KK, Zachary W, Bedbrook CN, Arnold FH (2018) Learned protein embeddings for machine learning. Bioinformatics 34(15):2642–2648
Article CAS PubMed PubMed Central Google Scholar
Phloyphisut P, Pornputtapong N, Sriswasdi S, Chuangsuwanich E (2019) Mhcseqnet: a deep neural network model for universal mhc binding prediction. BMC Bioinf 20(1):270
Article Google Scholar
Vielhaben J, Wenzel M, Samek W, Strodthoff N (2020) Usmpep: universal sequence models for major histocompatibility complex binding affinity prediction. BMC Bioinf 21(1):1–16
Article Google Scholar
Buchan DWA, Jones DT (2020) Learning a functional grammar of protein domains using natural language word embedding techniques. Proteins Str Funct Bioinf 88(4):616–624
Article CAS Google Scholar
Michail YL, Petr K, Igor VS, Gian GT, Oxana VG (2016) Non-random distribution of homo-repeats: links with biological functions and human diseases. Sci Rep 6:26941
Article Google Scholar
Swathik CP, Jaspreet KD, Vidhi M, Navaneethan R, Mannu J, Durai S, Durai S, Mannu J (2018) Encyclopedia of bioinformatics and computational biology, Ranganathan S, Grib-skov M, Nakai K, Schönbach C (eds), pp 661–676
Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B et al (2009) Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics 25(11):1422–1423
Article CAS PubMed PubMed Central Google Scholar
Blast options and defaults (2020)
François Chollet et al. Keras (2015)
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al (2016) Tensorflow: a system for large-scale machine learning. In OSDI 16:265–283
Google Scholar
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: 3rd international conference on learning representations, ICLR
Mathura SV, Werner B (2001) New quantitative descriptors of amino acids based on multidimensional scaling of a large number of physical-chemical properties. Mol Model Annu 7(12):445–453
Article Google Scholar
Liu B, **ghao X, Lan X, Ruifeng X, Zhou J, Wang X, Chou K-C (2014) idna-prot| dis: identifying dna-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition. PLoS ONE 9(9):e106691
Article PubMed PubMed Central Google Scholar
Rocklin GJ, Chidyausiku TM, Goreshnik I, Ford A, Houliston S, Lemak A, Carter L, Ravichandran R, Mulligan VK, Chevalier A et al (2017) Global analysis of protein folding using massively parallel design, synthesis, and testing. Science 357(6347):168–175
Article CAS PubMed PubMed Central Google Scholar
Fox NK, Brenner SE, Chandonia JM (2013) Scope: Structural classification of proteins-extended, integrating scop and astral data and classification of new structures. Nucleic Acids Res 42(D1):D304–D309
Article PubMed PubMed Central Google Scholar
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The protein data bank. Nucleic Acids Res 28(1):235–242
Article CAS PubMed PubMed Central Google Scholar
Moult J, Fidelis K, Kryshtafovych A, Schwede T, Tramontano A (2018) Critical assessment of methods of protein structure prediction (CASP)-Round XII. Proteins Str Funct Bioinf 86:7–15
Article CAS Google Scholar
Rao R, Bhattacharya N, Thomas N, Duan Y, Chen P, Canny J, Abbeel P, Song Y (2019) Evaluating protein transfer learning with tape. In: Advances in neural information processing systems, pp 9689–9701
Lou W, Wang X, Chen F, Chen Y, Jiang B, Zhang H (2014) Sequence based prediction of dna-binding proteins based on hybrid feature selection using random forest and gaussian naive bayes. PLoS ONE 9(1):e86703
Article PubMed PubMed Central Google Scholar
He F, Wang R, Li J, Bao L, Dong X, Zhao X (2018) Large-scale prediction of protein ubiquitination sites using a multimodal deep architecture. BMC Syst Biol 12(6):109
Article CAS PubMed PubMed Central Google Scholar
Huang K-Y, Hsu JB-K, Lee T-Y (2019) Characterization and identification of lysine succinylation sites based on deep learning method. Sci Rep 9(1):1–15
Google Scholar
Buchfink B, **e C, Huson DH (2015) Fast and sensitive protein alignment using diamond. Nat Methods 12(1):59–60
Article CAS PubMed Google Scholar
Zhou N, Jiang Y, Bergquist TR, Lee AJ, Kacsoh BZ, Crocker AW, Lewis KA, Georghiou G, Nguyen HN, Hamid MN et al (2019) The cafa challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genom Biol 20(1):1–23
Article Google Scholar
Mount DW (2008) Using gaps and gap penalties to optimize pairwise sequence alignments. Cold Spring Harbor Protoc 2008(6):pdb–top40
Heinzinger M, Ahmed Elnaggar Yu, Wang CD, Nechaev D, Matthes F, Rost B (2019) Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinf 20(1):723
Article CAS Google Scholar
Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Yu W, Jones L, Gibbs T, Feher T, Angerer C, Steinegger M, Bhowmik D, Rost B (2021) Prottrans: towards cracking the language of lifes code through self-supervised deep learning and high performance computing. In: IEEE transactions on pattern analysis and machine intelligence, p 1
Zeng H, Gifford DK (2019) Deepligand: accurate prediction of mhc class i ligands using peptide embedding. Bioinformatics 35(14):i278–i283
Article CAS PubMed PubMed Central Google Scholar
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics. Association for Computational Linguistics, pp 2227–2237
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pp 4171–4186

Download references

Author information

Authors and Affiliations

Department of CSE, BUET, ECE Building, West Palasi, Dhaka, 1205, Bangladesh
Nabil Ibtehaz, Md. Shamsuzzoha Bayzid & M. Sohel Rahman
Institute of Information Technology, University of Dhaka, Dhaka, Bangladesh
S. M. Shakhawat Hossain Sourav

Authors

Nabil Ibtehaz
View author publications
You can also search for this author in PubMed Google Scholar
S. M. Shakhawat Hossain Sourav
View author publications
You can also search for this author in PubMed Google Scholar
Md. Shamsuzzoha Bayzid
View author publications
You can also search for this author in PubMed Google Scholar
M. Sohel Rahman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M. Sohel Rahman.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ibtehaz, N., Sourav, S.M.S.H., Bayzid, M.S. et al. Align-gram: Rethinking the Skip-gram Model for Protein Sequence Analysis. Protein J 42, 135–146 (2023). https://doi.org/10.1007/s10930-023-10096-7

Download citation

Accepted: 13 February 2023
Published: 28 March 2023
Issue Date: April 2023
DOI: https://doi.org/10.1007/s10930-023-10096-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Institutional subscriptions

Align-gram: Rethinking the Skip-gram Model for Protein Sequence Analysis

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Modeling aspects of the language of life through transfer-learning protein sequences

TEMPROT: protein function annotation using transformers embeddings and homology search

Protein embedding based alignment

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Align-gram: Rethinking the Skip-gram Model for Protein Sequence Analysis

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Modeling aspects of the language of life through transfer-learning protein sequences

TEMPROT: protein function annotation using transformers embeddings and homology search

Protein embedding based alignment

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation