BioSequence2Vec: Efficient Embedding Generation for Biological Sequences

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13936))

Included in the following conference series:

  • 1016 Accesses

Abstract

Representation learning is an important step in the machine learning pipeline. Given the current biological sequencing data volume, learning an explicit representation is prohibitive due to the dimensionality of the resulting feature vectors. Kernel-based methods, e.g., SVM, are a proven efficient and useful alternative for several machine learning (ML) tasks such as sequence classification. Three challenges with kernel methods are (i) the computation time, (ii) the memory usage (storing an \(n\times n\) matrix), and (iii) the usage of kernel matrices limited to kernel-based ML methods (difficult to generalize on non-kernel classifiers). While (i) can be solved using approximate methods, challenge (ii) remains for typical kernel methods. Similarly, although non-kernel-based ML methods can be applied to kernel matrices by extracting principal components (kernel PCA), it may result in information loss, while being computationally expensive. In this paper, we propose a general-purpose representation learning approach that embodies kernel methods’ qualities while avoiding computation, memory, and generalizability challenges. This involves computing a low-dimensional embedding of each sequence, using random projections of its k-mer frequency vectors, significantly reducing the computation needed to compute the dot product and the memory needed to store the resulting representation. Our proposed fast and alignment-free embedding method can be used as input to any distance (e.g., k nearest neighbors) and non-distance (e.g., decision tree) based ML method for classification and clustering tasks. Using different forms of biological sequences as input, we perform a variety of real-world classification tasks, such as SARS-CoV-2 lineage and gene family classification, outperforming several state-of-the-art embedding and kernel methods in predictive performance.

M. Patterson and I. U. Khan — Joint Last Authors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 159.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.gisaid.org/.

References

  1. Ali, S.: Evaluating covid-19 sequence data using nearest-neighbors based network model. In: 2022 IEEE International Conference on Big Data (Big Data), pp. 5182–5188. Osaka, Japan (2022). https://doi.org/10.1109/BigData55660.2022.10020653

  2. Ali, S., Bello, B., Chourasia, P., Punathil, R.T., Zhou, Y., Patterson, M.: PWM2Vec: an efficient embedding approach for viral host specification from coronavirus spike sequences. Biology 11(3), 418 (2022)

    Article  Google Scholar 

  3. Ali, S., Bello, B., Tayebi, Z., Patterson, M.: Characterizing sars-cov-2 spike sequences based on geographical location. J. Comput. Biol. 30, 0391 (2023)

    Google Scholar 

  4. Ali, S., Murad, T., Chourasia, P., Patterson, M.: Spike2signal: classifying coronavirus spike sequences with deep learning. In: 2022 IEEE Eighth International Conference on Big Data Computing Service and Applications (BigDataService), pp. 81–88 (2022)

    Google Scholar 

  5. Ali, S., Patterson, M.: Spike2vec: an efficient and scalable embedding approach for COVID-19 spike sequences. In: IEEE Big Data, pp. 1533–1540 (2021)

    Google Scholar 

  6. Ali, S., Sahoo, B., Khan, M.A., Zelikovsky, A., Khan, I.U., Patterson, M.: Efficient approximate kernel based spike sequence classification. IEEE/ACM Transactions on Computational Biology and Bioinformatics (2022)

    Google Scholar 

  7. Ali, S., Sahoo, B., Ullah, N., Zelikovskiy, A., Patterson, M., Khan, I.: A k-mer based approach for sars-cov-2 variant identification. In: International Symposium on Bioinformatics Research and Applications, pp. 153–164 (2021)

    Google Scholar 

  8. Ali, S., Sahoo, B., Zelikovsky, A., Chen, P.Y., Patterson, M.: Benchmarking machine learning robustness in COVID-19 genome sequence classification. Sci. Rep. 13(1), 4154 (2023)

    Article  Google Scholar 

  9. Ali, S., Zhou, Y., Patterson, M.: Efficient analysis of COVID-19 clinical data using machine learning models. Med. Biol. Eng. Comput. 60(7), 1881–1896 (2022)

    Article  Google Scholar 

  10. Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. In: Symposium on Theory of computing, pp. 20–29 (1996)

    Google Scholar 

  11. Blaisdell, B.: A measure of the similarity of sets of sequences not requiring sequence alignment. Proc. Natl. Acad. Sci. 83, 5155–5159 (1986)

    Article  MATH  Google Scholar 

  12. Borisov, V., et al.: Deep neural networks and tabular data: a survey. ar**v preprint ar**v:2110.01889 (2021)

  13. Brandes, N., Ofer, D., Peleg, Y., Rappoport, N., Linial, M.: ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics 38(8), 2102–2110 (2022)

    Google Scholar 

  14. Carter, J.L., Wegman, M.N.: Universal classes of hash functions. In: ACM symposium on Theory of computing, pp. 106–112 (1979)

    Google Scholar 

  15. Chourasia, P., Ali, S., Ciccolella, S., Della Vedova, G., Patterson, M.: Clustering sars-cov-2 variants from raw high-throughput sequencing reads data. In: Computational Advances in Bio and Medical Sciences (ICCABS), pp. 133–148 (2022)

    Google Scholar 

  16. Chourasia, P., Ali, S., Patterson, M.: Informative initialization and kernel selection improves t-SNE for biological sequences. In: 2022 IEEE International Conference on Big Data (Big Data), pp. 101–106. Osaka, Japan (2022). https://doi.org/10.1109/BigData55660.2022.10020217

  17. Chowdhury, B., Garai, G.: A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 109(5–6), 419–431 (2017)

    Article  Google Scholar 

  18. Cristianini, N., Shawe-Taylor, J., et al.: An introduction to support vector machines and other Kernel-based learning methods. Cambridge University Press (2000)

    Google Scholar 

  19. Farhan, M., Tariq, J., Zaman, A., Shabbir, M., Khan, I.U.: Efficient approximation algorithms for strings Kernel based sequence classification. In: NeurIPS, pp. 6935–6945 (2017)

    Google Scholar 

  20. Ghandi, M., Noori, M., Beer, M.: Robust k k-mer frequency estimation using gapped k-mers. J. Math. Biol. 69(2), 469–500 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  21. GISAID. https://www.gisaid.org/ (2022). Accessed 04 Dec 2022

  22. Heinzinger, M., et al.: Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinform. 20(1), 1–17 (2019)

    Article  Google Scholar 

  23. Hoffmann, H.: Kernel PCA for novelty detection. Pattern Recogn. 40(3), 863–874 (2007)

    Article  MATH  Google Scholar 

  24. Hu, W., Bansal, R., Cao, K., Rao, N., Subbian, K., Leskovec, J.: Learning backward compatible embeddings. ar**v preprint ar**v:2206.03040 (2022)

  25. Human DNA. https://www.kaggle.com/code/nageshsingh/demystify-dna-sequencing-with-machine-learning/data. Accessed 10 Oct 2022

  26. Jumper, J., et al.: Highly accurate protein structure prediction with AlphaFold. Nature 596(7873), 583–589 (2021)

    Article  Google Scholar 

  27. Kuzmin, K., et al.: Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem. Biophys. Res. Comm. 533(3), 553–558 (2020)

    Google Scholar 

  28. O’Toole, A., et al.: Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool. Virus Evol. 7(2), veab064 (2021)

    Google Scholar 

  29. Rambaut, A., et al.: A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nature Microbiol. 5, 1403–1407 (2020)

    Google Scholar 

  30. Shen, J., Qu, Y., Zhang, W., Yu, Y.: Wasserstein distance guided representation learning for domain adaptation. In: AAAI conference on A.I (2018)

    Google Scholar 

  31. Shwartz-Ziv, R., Armon, A.: Tabular data: deep learning is not all you need. Inf. Fusion 81, 84–90 (2022)

    Article  Google Scholar 

  32. Singh, R., Sekhon, A., et al.: GakCo: a fast gapped k-mer string kernel using counting. In: Joint ECML and Knowledge Discovery in Databases, pp. 356–373 (2017)

    Google Scholar 

  33. Stephens, Z.D., et al.: Big data: astronomical or genomical? PLoS Biol. 13, e1002195 (2015)

    Google Scholar 

  34. Tayebi, Z., Ali, S., Patterson, M.: Robust representation and efficient feature selection allows for effective clustering of SARS-CoV-2 variants. Algorithms 14(12), 348 (2021)

    Article  Google Scholar 

  35. Ullah, A., Ali, S., Khan, I., Khan, M.A., Faizullah, S.: Effect of analysis window and feature selection on classification of hand movements using EMG signal. In: SAI Intelligent Systems Conference (IntelliSys), pp. 400–415 (2020)

    Google Scholar 

  36. Wang, Z., Yan, W., Oates, T.: Time series classification from scratch with deep neural networks: a strong baseline. In: IJCNN, pp. 1578–1585 (2017)

    Google Scholar 

  37. **e, J., Girshick, R., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In: International Conference on Machine Learning, pp. 478–487 (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Murray Patterson or Imdad Ullah Khan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ali, S., Sardar, U., Patterson, M., Khan, I.U. (2023). BioSequence2Vec: Efficient Embedding Generation for Biological Sequences. In: Kashima, H., Ide, T., Peng, WC. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2023. Lecture Notes in Computer Science(), vol 13936. Springer, Cham. https://doi.org/10.1007/978-3-031-33377-4_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-33377-4_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-33376-7

  • Online ISBN: 978-3-031-33377-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation