Abstract
Co-linear chaining is a widely used technique in sequence alignment tools that follow seed-filter-extend methodology. It is a mathematically rigorous approach to combine short exact matches. For co-linear chaining between two sequences, efficient subquadratic-time chaining algorithms are well-known for linear, concave and convex gap cost functions [Eppstein et al. JACM’92]. However, develo** extensions of chaining algorithms for directed acyclic graphs (DAGs) has been challenging. Recently, a new sparse dynamic programming framework was introduced that exploits small path cover of pangenome reference DAGs, and enables efficient chaining [Makinen et al. TALG’19, RECOMB’18]. However, the underlying problem formulation did not consider gap cost which makes chaining less effective in practice. To address this, we develop novel problem formulations and optimal chaining algorithms that support a variety of gap cost functions. We demonstrate empirically the ability of our provably-good chaining implementation to align long reads more precisely in comparison to existing aligners. For map** simulated long reads from human genome to a pangenome DAG of 95 human haplotypes, we achieve \(98.7\%\) precision while leaving \(<2\%\) reads unmapped.
Implementation: https://github.com/at-cg/minichain.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abouelhoda, M., Ohlebusch, E.: Chaining algorithms for multiple genome comparison. J. Discrete Algorithms 3(2–4), 321–341 (2005)
Baaijens, J.A., et al.: Computational graph pangenomics: a tutorial on data structures and their applications. Nat. Comput. 21, 81–108 (2022). https://doi.org/10.1007/s11047-022-09882-6
Backurs, A., Indyk, P.: Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). In: Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, pp. 51–58 (2015)
de Berg, M., Cheong, O., van Kreveld, M.J., Overmars, M.H.: Computational Geometry: Algorithms and Applications, 3rd edn. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-77974-2
Cáceres, M., Cairo, M., Mumey, B., Rizzi, R., Tomescu, A.I.: Sparsifying, shrinking and splicing for minimum path cover in parameterized linear time. In: Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 359–376. SIAM (2022)
Chandra, G., Jain, C.: Sequence to graph alignment using gap-sensitive co-linear chaining. BioRxiv (2022). https://doi.org/10.1101/2022.08.29.505691
Computational Pan-Genomics Consortium: Computational pan-genomics: status, promises and challenges. Brief. Bioinform. 19(1), 118–135 (2018)
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. MIT Press, Cambridge (2022)
Dvorkina, T., Antipov, D., Korobeynikov, A., Nurk, S.: SPAligner: alignment of long diverged molecular sequences to assembly graphs. BMC Bioinform. 21(12), 1–14 (2020)
Eggertsson, H.P., Jonsson, H., Kristmundsdottir, S., et al.: Graphtyper enables population-scale genoty** using pangenome graphs. Nat. Genet. 49(11), 1654–1660 (2017)
Eizenga, J.M., et al.: Pangenome graphs. Annu. Rev. Genomics Hum. Genet. 21, 139 (2020)
Eppstein, D., Galil, Z., Giancarlo, R., Italiano, G.F.: Sparse dynamic programming I: linear cost functions. J. ACM 39(3), 519–545 (1992)
Eppstein, D., Galil, Z., Giancarlo, R., Italiano, G.F.: Sparse dynamic programming II: convex and concave cost functions. J. ACM 39(3), 546–567 (1992)
Garg, S., Rautiainen, M., Novak, A.M., et al.: A graph-based approach to diploid genome assembly. Bioinformatics 34(13), i105–i114 (2018)
Illumina: DRAGEN v3.10.4 software release notes. https://support.illumina.com/content/dam/illumina-support/documents/downloads/software/dragen/200016065_00_DRAGEN-3.10-Customer-Release-Notes.pdf. Accessed 08 Aug 2022
Ivanov, P., Bichsel, B., Vechev, M.: Fast and optimal sequence-to-graph alignment guided by seeds. In: Pe’er, I. (ed.) RECOMB 2022. LNBI, vol. 13278, pp. 306–325. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-04749-7_22
Jain, C., Gibney, D., Thankachan, S.V.: Co-linear chaining with overlaps and gap costs. In: Pe’er, I. (ed.) RECOMB 2022. LNBI, vol. 13278, pp. 246–262. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-04749-7_15
Jain, C., Misra, S., Zhang, H., Dilthey, A., Aluru, S.: Accelerating sequence alignment to graphs. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 451–461. IEEE (2019)
Jain, C., Rhie, A., Hansen, N.F., Koren, S., Phillippy, A.M.: Long-read map** to repetitive reference sequences using Winnowmap2. Nat. Methods 19(6), 705–710 (2022)
Jain, C., et al.: Weighted minimizer sampling improves long read map**. Bioinformatics 36(Supplement_1), i111–i118 (2020)
Jain, C., Zhang, H., Dilthey, A., Aluru, S.: Validating paired-end read alignments in sequence graphs. In: 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2019)
Li, H.: Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18), 3094–3100 (2018). https://doi.org/10.1093/bioinformatics/bty191
Li, H., Feng, X., Chu, C.: The design and construction of reference pangenome graphs with minigraph. Genome Biol. 21(1), 265 (2020). https://doi.org/10.1186/s13059-020-02168-z
Li, H., Ruan, J., Durbin, R.: Map** short DNA sequencing reads and calling variants using map** quality scores. Genome Res. 18(11), 1851–1858 (2008)
Liao, W.W., et al.: A draft human pangenome reference. BioRxiv (2022). https://doi.org/10.1101/2022.07.09.499321
Ma, J., Cáceres, M., Salmela, L., Mäkinen, V., Tomescu, A.I.: GraphChainer: co-linear chaining for accurate alignment of long reads to variation graphs. BioRxiv (2022)
Mäkinen, V., Sahlin, K.: Chaining with overlaps revisited. In: 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik (2020)
Mäkinen, V., Tomescu, A.I., Kuosmanen, A., Paavilainen, T., Gagie, T., Chikhi, R.: Sparse dynamic programming on DAGs with small width. ACM Trans. Algorithms 15(2), 1–21 (2019)
Myers, G., Miller, W.: Chaining multiple-alignment fragments in sub-quadratic time. In: SODA, vol. 95, pp. 38–47 (1995)
Navarro, G.: Improved approximate pattern matching on hypertext. Theor. Comput. Sci. 237(1–2), 455–463 (2000)
Nurk, S., Koren, S., Rhie, A., Rautiainen, M., et al.: The complete sequence of a human genome. Science 376(6588), 44–53 (2022). https://doi.org/10.1126/science.abj6987
Ono, Y., Asai, K., Hamada, M.: PBSIM2: a simulator for long-read sequencers with a novel generative model of quality scores. Bioinformatics 37(5), 589–595 (2020). https://doi.org/10.1093/bioinformatics/btaa835
Otto, C., Hoffmann, S., Gorodkin, J., Stadler, P.F.: Fast local fragment chaining using sum-of-pair gap costs. Algorithms Mol. Biol. 6(1), 4 (2011). https://doi.org/10.1186/1748-7188-6-4
Paten, B., Novak, A.M., Eizenga, J.M., Garrison, E.: Genome graphs and the evolution of genome inference. Genome Res. 27(5), 665–676 (2017)
Rautiainen, M., Marschall, T.: GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol. 21(1), 1–28 (2020). https://doi.org/10.1186/s13059-020-02157-2
Ren, J., Chaisson, M.J.: lra: a long read aligner for sequences and contigs. PLoS Comput. Biol. 17(6), e1009078 (2021)
Roberts, M., Hayes, W., Hunt, B.R., Mount, S.M., Yorke, J.A.: Reducing storage requirements for biological sequence comparison. Bioinformatics 20(18), 3363–3369 (2004). https://doi.org/10.1093/bioinformatics/bth408
Sahlin, K., Baudeau, T., Cazaux, B., Marchet, C.: A survey of map** algorithms in the long-reads era. BioRxiv (2022)
Sahlin, K., Mäkinen, V.: Accurate spliced alignment of long RNA sequencing reads. Bioinformatics 37(24), 4643–4651 (2021)
Salmela, L., Rivals, E.: LoRDEC: accurate and efficient long read error correction. Bioinformatics 30(24), 3506–3514 (2014)
Sirén, J., Monlong, J., Chang, X., et al.: Pangenomics enables genoty** of known structural variants in 5202 diverse genomes. Science 374(6574), abg8871 (2021)
Wang, T., Antonacci-Fulton, L., Howe, K., et al.: The human pangenome project: a global resource to map genomic diversity. Nature 604(7906), 437–446 (2022)
Zhang, H., Wu, S., Aluru, S., Li, H.: Fast sequence to graph alignment using the graph wavefront algorithm. ar**v preprint ar**v:2206.13574 (2022)
Acknowledgements
This work was supported by funding from the National Supercomputing Mission, India under DST/NSM/R &D_HPC_Applications. We used computing resources provided by the C-DAC National PARAM Supercomputing Facility, India, and the National Energy Research Scientific Computing Center, USA.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Chandra, G., Jain, C. (2023). Sequence to Graph Alignment Using Gap-Sensitive Co-linear Chaining. In: Tang, H. (eds) Research in Computational Molecular Biology. RECOMB 2023. Lecture Notes in Computer Science(), vol 13976. Springer, Cham. https://doi.org/10.1007/978-3-031-29119-7_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-29119-7_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-29118-0
Online ISBN: 978-3-031-29119-7
eBook Packages: Computer ScienceComputer Science (R0)