Matching Pattern in DNA Sequences Using Machine Learning Approach Based on K-Mer Function

  • Chapter
  • First Online:
Modern Approaches in Machine Learning & Cognitive Science: A Walkthrough

Part of the book series: Studies in Computational Intelligence ((SCI,volume 1027))

Abstract

Information based on DNA-sequence is significant for scientists to examine the functionalities of genes. Recently, there is a tremendous increase in biological-data generation. Hence it is essential to match and classify the DNA sequences. This study aims to perform effective approaches for pre-processing and feature extraction of DNA data. Here the feature extraction is performed using the count vectorizer. Besides, it intended to classify the DNA sequences using machine learning (ML) algorithms using k-mer function. The matched sequences are retrieved effectively via the use of a pattern-matching algorithm. The results revealed that the SVM linear classifier gives good result.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Celli, F., Cumbo, F., Weitschek, E.: Classification of large DNA methylation datasets for identifying cancer drivers. Big Data Res. 13, 21–28 (2018)

    Article  Google Scholar 

  2. Budach, S., Marsico, A.: Pysster: classification of biological sequences by learning sequence and structure motifs with convolutional neural networks. Bioinformatics 34, 3035–3037 (2018)

    Article  Google Scholar 

  3. Phan, D., Ngoc, G.N., Lumbanraja, F.R., Faisal, M.R., Abipihi, B., Purnama, B., et al.: Combined use of k-mer numerical features and position-specific categorical features in fixed-length DNA sequence classification. J. Biomed. Sci. Eng. 10, 390–401 (2017)

    Article  Google Scholar 

  4. Sahakyan, A.B., Chambers, V.S., Marsico, G., Santner, T., Di Antonio, M., Balasubramanian, S.: Machine learning model for sequence-driven DNA G-quadruplex formation. Sci. Rep. 7, 1–11 (2017)

    Article  Google Scholar 

  5. Liu, B.: BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Brief. Bioinform. 20, 1280–1294 (2019)

    Article  Google Scholar 

  6. He, T., Jiao, L., Wiedenhoeft, A.C., Yin, Y.: Machine learning approaches outperform distance-and tree-based methods for DNA barcoding of Pterocarpus wood. Planta 249, 1617–1625 (2019)

    Article  Google Scholar 

  7. Liu, B., Chen, S., Yan, K., Weng, F.: iRO-PsekGCC: identify DNA replication origins based on pseudo k-tuple GC composition. Front. Genet. 10, 842 (2019)

    Article  Google Scholar 

  8. Aliferi, A., Ballard, D., Gallidabino, M.D., Thurtle, H., Barron, L., Court, D.S.: DNA methylation-based age prediction using massively parallel sequencing data and multiple machine learning models. Forensic Sci. Int. Genet. 37, 215–226 (2018)

    Article  Google Scholar 

  9. Chowdhury, B., Garai, G.: A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 109, 419–431 (2017)

    Article  Google Scholar 

  10. Tahir, M., Sardaraz, M., Ikram, A.A.: EPMA: efficient pattern matching algorithm for DNA sequences. Expert Syst. Appl. 80, 162–170 (2017)

    Article  Google Scholar 

  11. Memeti, S., Pllana, S.: A machine learning approach for accelerating DNA sequence analysis. Int. J. High Perform. Comput. Appl. 32, 363–379 (2018)

    Article  Google Scholar 

  12. Moghadam, B.T., Etemadikhah, M., Rajkowska, G., Stockmeier, C., Grabherr, M., Komorowski, J., et al.: Analyzing DNA methylation patterns in subjects diagnosed with schizophrenia using machine learning methods. J. Psychiatr. Res. 114, 41–47 (2019)

    Article  Google Scholar 

  13. Touati, R., Messaoudi, I., Oueslati, A., Lachiri, Z., Kharrat, M.: New intraclass helitrons classification using DNA-image sequences and machine learning approaches. In: IRBM (2020)

    Google Scholar 

  14. Norlin, S.: DNA Sequence Classification Using Variable Length Markov Models (2020)

    Google Scholar 

  15. Ryu, C., Lecroq, T., Park, K.: Fast string matching for DNA sequences. Theoret. Comput. Sci. 812, 137–148 (2020)

    Article  MathSciNet  Google Scholar 

  16. Li, H., Xu, G., Tang, Q., Lin, X., Shen, X.S.: Enabling efficient and fine-grained DNA similarity search with access control over encrypted cloud data. In: International Conference on Wireless Algorithms, Systems, and Applications, pp. 236–248 (2018)

    Google Scholar 

  17. Anandakumar, M., Aiswarya, M.S., Bakyalakshmi, M.N., Brindha, M.S.: Pattern Similarity Search Using Expectation Maximization (Em) Algorithm

    Google Scholar 

  18. Xu, G., Li, H., Ren, H., Lin, X., Shen, X.S.: DNA similarity search with access control over encrypted cloud data. IEEE Trans. Cloud Comput. (2020)

    Google Scholar 

  19. Yin, B., Balvert, M., Zambrano, D., Schönhuth, A., Bohte, S.: An image representation based convolutional network for DNA classification, ar**v preprint ar**v:1806.04931 (2018)

  20. Wilkinson, S.P., Davy, S.K., Bunce, M., Stat, M.: Taxonomic identification of environmental DNA with informatic sequence classification trees. PeerJ Preprints (2018)

    Google Scholar 

  21. Szalkaia, B., Grolmusza, V.: SECLAF: a webserver and deep neural network design tool for biological sequence classification, ar**v preprint ar**v:1708.04103 (2017)

  22. Varsani, A., Krupovic, M.: Sequence-based taxonomic framework for the classification of uncultured single-stranded DNA viruses of the family Genomoviridae. Virus Evol. 3, vew037 (2017)

    Google Scholar 

  23. Touati, R., Oueslati, A.E., Messaoudi, I., Lachiri, Z.: The Helitron family classification using SVM based on Fourier transform features applied on an unbalanced dataset. Med. Biol. Eng. Compu. 57, 2289–2304 (2019)

    Article  Google Scholar 

  24. Greenside, P., Shimko, T., Fordyce, P., Kundaje, A.: Discovering epistatic feature interactions from neural network models of regulatory DNA sequences. Bioinformatics 34, i629–i637 (2018)

    Article  Google Scholar 

  25. Yang, A., Zhang, W., Wang, J., Yang, K., Han, Y., Zhang, L.: Review on the application of machine learning algorithms in the sequence data mining of DNA. Front. Bioeng. Biotechnol. 8, 1032 (2020)

    Article  Google Scholar 

  26. Colbran, L.L., Chen, L., Capra, J.A.: Short DNA sequence patterns accurately identify broadly active human enhancers. BMC Genomics 18, 1–11 (2017)

    Article  Google Scholar 

  27. Ravikumar, M., Prashanth, M.C.: Analysis of DNA sequence pattern matching: a brief survey. In: 2nd International Conference on Cybernetics, Cognition and Machine Learning Applications, (ICCCMLA 2020), Goa, pp. 221–229 (2020)

    Google Scholar 

  28. Ravikumar, M., Prashanth, M.C., Shivaprasad, B.J.: Searching pattern in DNA sequence using ECC-DiffieHellman exchange based hash function: an efficient approach. In: International Conference on Machine Learning and Big Data Analytics (ICMLBDA 2021), Patna (2021)

    Google Scholar 

  29. Neamatollahi, P., Hadi, M., Naghibzadeh, M.: Simple and efficient pattern matching algorithms for biological sequences. IEEE Access 8, 23838–23846 (2020)

    Article  Google Scholar 

  30. Zhang, J., Bi, C., Wang, Y., Zeng, T., Liao, B., Chen, L.: Efficient mining closed k-mers from DNA and protein sequences. In: 2020 IEEE International Conference on Big Data and Smart Computing (BigComp), pp. 342–349 (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. C. Prashanth .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Ravikumar, M., Prashanth, M.C., Guru, D.S. (2022). Matching Pattern in DNA Sequences Using Machine Learning Approach Based on K-Mer Function. In: Gunjan, V.K., Zurada, J.M. (eds) Modern Approaches in Machine Learning & Cognitive Science: A Walkthrough. Studies in Computational Intelligence, vol 1027. Springer, Cham. https://doi.org/10.1007/978-3-030-96634-8_14

Download citation

Publish with us

Policies and ethics

Navigation