Sequence to Graph Alignment Using Gap-Sensitive Co-linear Chaining

  • Conference paper
  • First Online:
Research in Computational Molecular Biology (RECOMB 2023)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 13976))

Abstract

Co-linear chaining is a widely used technique in sequence alignment tools that follow seed-filter-extend methodology. It is a mathematically rigorous approach to combine short exact matches. For co-linear chaining between two sequences, efficient subquadratic-time chaining algorithms are well-known for linear, concave and convex gap cost functions [Eppstein et al. JACM’92]. However, develo** extensions of chaining algorithms for directed acyclic graphs (DAGs) has been challenging. Recently, a new sparse dynamic programming framework was introduced that exploits small path cover of pangenome reference DAGs, and enables efficient chaining [Makinen et al. TALG’19, RECOMB’18]. However, the underlying problem formulation did not consider gap cost which makes chaining less effective in practice. To address this, we develop novel problem formulations and optimal chaining algorithms that support a variety of gap cost functions. We demonstrate empirically the ability of our provably-good chaining implementation to align long reads more precisely in comparison to existing aligners. For map** simulated long reads from human genome to a pangenome DAG of 95 human haplotypes, we achieve \(98.7\%\) precision while leaving \(<2\%\) reads unmapped.

Implementation: https://github.com/at-cg/minichain.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 34.99
Price excludes VAT (Canada)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 44.99
Price excludes VAT (Canada)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Abouelhoda, M., Ohlebusch, E.: Chaining algorithms for multiple genome comparison. J. Discrete Algorithms 3(2–4), 321–341 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  2. Baaijens, J.A., et al.: Computational graph pangenomics: a tutorial on data structures and their applications. Nat. Comput. 21, 81–108 (2022). https://doi.org/10.1007/s11047-022-09882-6

    Article  MathSciNet  Google Scholar 

  3. Backurs, A., Indyk, P.: Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). In: Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, pp. 51–58 (2015)

    Google Scholar 

  4. de Berg, M., Cheong, O., van Kreveld, M.J., Overmars, M.H.: Computational Geometry: Algorithms and Applications, 3rd edn. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-77974-2

    Book  MATH  Google Scholar 

  5. Cáceres, M., Cairo, M., Mumey, B., Rizzi, R., Tomescu, A.I.: Sparsifying, shrinking and splicing for minimum path cover in parameterized linear time. In: Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 359–376. SIAM (2022)

    Google Scholar 

  6. Chandra, G., Jain, C.: Sequence to graph alignment using gap-sensitive co-linear chaining. BioRxiv (2022). https://doi.org/10.1101/2022.08.29.505691

  7. Computational Pan-Genomics Consortium: Computational pan-genomics: status, promises and challenges. Brief. Bioinform. 19(1), 118–135 (2018)

    Google Scholar 

  8. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. MIT Press, Cambridge (2022)

    MATH  Google Scholar 

  9. Dvorkina, T., Antipov, D., Korobeynikov, A., Nurk, S.: SPAligner: alignment of long diverged molecular sequences to assembly graphs. BMC Bioinform. 21(12), 1–14 (2020)

    Google Scholar 

  10. Eggertsson, H.P., Jonsson, H., Kristmundsdottir, S., et al.: Graphtyper enables population-scale genoty** using pangenome graphs. Nat. Genet. 49(11), 1654–1660 (2017)

    Article  Google Scholar 

  11. Eizenga, J.M., et al.: Pangenome graphs. Annu. Rev. Genomics Hum. Genet. 21, 139 (2020)

    Article  Google Scholar 

  12. Eppstein, D., Galil, Z., Giancarlo, R., Italiano, G.F.: Sparse dynamic programming I: linear cost functions. J. ACM 39(3), 519–545 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  13. Eppstein, D., Galil, Z., Giancarlo, R., Italiano, G.F.: Sparse dynamic programming II: convex and concave cost functions. J. ACM 39(3), 546–567 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  14. Garg, S., Rautiainen, M., Novak, A.M., et al.: A graph-based approach to diploid genome assembly. Bioinformatics 34(13), i105–i114 (2018)

    Article  Google Scholar 

  15. Illumina: DRAGEN v3.10.4 software release notes. https://support.illumina.com/content/dam/illumina-support/documents/downloads/software/dragen/200016065_00_DRAGEN-3.10-Customer-Release-Notes.pdf. Accessed 08 Aug 2022

  16. Ivanov, P., Bichsel, B., Vechev, M.: Fast and optimal sequence-to-graph alignment guided by seeds. In: Pe’er, I. (ed.) RECOMB 2022. LNBI, vol. 13278, pp. 306–325. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-04749-7_22

    Chapter  Google Scholar 

  17. Jain, C., Gibney, D., Thankachan, S.V.: Co-linear chaining with overlaps and gap costs. In: Pe’er, I. (ed.) RECOMB 2022. LNBI, vol. 13278, pp. 246–262. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-04749-7_15

    Chapter  Google Scholar 

  18. Jain, C., Misra, S., Zhang, H., Dilthey, A., Aluru, S.: Accelerating sequence alignment to graphs. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 451–461. IEEE (2019)

    Google Scholar 

  19. Jain, C., Rhie, A., Hansen, N.F., Koren, S., Phillippy, A.M.: Long-read map** to repetitive reference sequences using Winnowmap2. Nat. Methods 19(6), 705–710 (2022)

    Article  Google Scholar 

  20. Jain, C., et al.: Weighted minimizer sampling improves long read map**. Bioinformatics 36(Supplement_1), i111–i118 (2020)

    Google Scholar 

  21. Jain, C., Zhang, H., Dilthey, A., Aluru, S.: Validating paired-end read alignments in sequence graphs. In: 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2019)

    Google Scholar 

  22. Li, H.: Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18), 3094–3100 (2018). https://doi.org/10.1093/bioinformatics/bty191

    Article  Google Scholar 

  23. Li, H., Feng, X., Chu, C.: The design and construction of reference pangenome graphs with minigraph. Genome Biol. 21(1), 265 (2020). https://doi.org/10.1186/s13059-020-02168-z

    Article  Google Scholar 

  24. Li, H., Ruan, J., Durbin, R.: Map** short DNA sequencing reads and calling variants using map** quality scores. Genome Res. 18(11), 1851–1858 (2008)

    Article  Google Scholar 

  25. Liao, W.W., et al.: A draft human pangenome reference. BioRxiv (2022). https://doi.org/10.1101/2022.07.09.499321

  26. Ma, J., Cáceres, M., Salmela, L., Mäkinen, V., Tomescu, A.I.: GraphChainer: co-linear chaining for accurate alignment of long reads to variation graphs. BioRxiv (2022)

    Google Scholar 

  27. Mäkinen, V., Sahlin, K.: Chaining with overlaps revisited. In: 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik (2020)

    Google Scholar 

  28. Mäkinen, V., Tomescu, A.I., Kuosmanen, A., Paavilainen, T., Gagie, T., Chikhi, R.: Sparse dynamic programming on DAGs with small width. ACM Trans. Algorithms 15(2), 1–21 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  29. Myers, G., Miller, W.: Chaining multiple-alignment fragments in sub-quadratic time. In: SODA, vol. 95, pp. 38–47 (1995)

    Google Scholar 

  30. Navarro, G.: Improved approximate pattern matching on hypertext. Theor. Comput. Sci. 237(1–2), 455–463 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  31. Nurk, S., Koren, S., Rhie, A., Rautiainen, M., et al.: The complete sequence of a human genome. Science 376(6588), 44–53 (2022). https://doi.org/10.1126/science.abj6987

    Article  Google Scholar 

  32. Ono, Y., Asai, K., Hamada, M.: PBSIM2: a simulator for long-read sequencers with a novel generative model of quality scores. Bioinformatics 37(5), 589–595 (2020). https://doi.org/10.1093/bioinformatics/btaa835

    Article  Google Scholar 

  33. Otto, C., Hoffmann, S., Gorodkin, J., Stadler, P.F.: Fast local fragment chaining using sum-of-pair gap costs. Algorithms Mol. Biol. 6(1), 4 (2011). https://doi.org/10.1186/1748-7188-6-4

    Article  Google Scholar 

  34. Paten, B., Novak, A.M., Eizenga, J.M., Garrison, E.: Genome graphs and the evolution of genome inference. Genome Res. 27(5), 665–676 (2017)

    Article  Google Scholar 

  35. Rautiainen, M., Marschall, T.: GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol. 21(1), 1–28 (2020). https://doi.org/10.1186/s13059-020-02157-2

    Article  Google Scholar 

  36. Ren, J., Chaisson, M.J.: lra: a long read aligner for sequences and contigs. PLoS Comput. Biol. 17(6), e1009078 (2021)

    Article  Google Scholar 

  37. Roberts, M., Hayes, W., Hunt, B.R., Mount, S.M., Yorke, J.A.: Reducing storage requirements for biological sequence comparison. Bioinformatics 20(18), 3363–3369 (2004). https://doi.org/10.1093/bioinformatics/bth408

    Article  Google Scholar 

  38. Sahlin, K., Baudeau, T., Cazaux, B., Marchet, C.: A survey of map** algorithms in the long-reads era. BioRxiv (2022)

    Google Scholar 

  39. Sahlin, K., Mäkinen, V.: Accurate spliced alignment of long RNA sequencing reads. Bioinformatics 37(24), 4643–4651 (2021)

    Article  Google Scholar 

  40. Salmela, L., Rivals, E.: LoRDEC: accurate and efficient long read error correction. Bioinformatics 30(24), 3506–3514 (2014)

    Article  Google Scholar 

  41. Sirén, J., Monlong, J., Chang, X., et al.: Pangenomics enables genoty** of known structural variants in 5202 diverse genomes. Science 374(6574), abg8871 (2021)

    Google Scholar 

  42. Wang, T., Antonacci-Fulton, L., Howe, K., et al.: The human pangenome project: a global resource to map genomic diversity. Nature 604(7906), 437–446 (2022)

    Article  Google Scholar 

  43. Zhang, H., Wu, S., Aluru, S., Li, H.: Fast sequence to graph alignment using the graph wavefront algorithm. ar**v preprint ar**v:2206.13574 (2022)

Download references

Acknowledgements

This work was supported by funding from the National Supercomputing Mission, India under DST/NSM/R &D_HPC_Applications. We used computing resources provided by the C-DAC National PARAM Supercomputing Facility, India, and the National Energy Research Scientific Computing Center, USA.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chirag Jain .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chandra, G., Jain, C. (2023). Sequence to Graph Alignment Using Gap-Sensitive Co-linear Chaining. In: Tang, H. (eds) Research in Computational Molecular Biology. RECOMB 2023. Lecture Notes in Computer Science(), vol 13976. Springer, Cham. https://doi.org/10.1007/978-3-031-29119-7_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-29119-7_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-29118-0

  • Online ISBN: 978-3-031-29119-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation