Abstract
Modern pangenome graphs are built using haplotype-resolved genome assemblies. While map** reads to a pangenome graph, prioritizing alignments that are consistent with the known haplotypes has been shown to improve genoty** accuracy. However, the existing rigorous formulations for sequence-to-graph co-linear chaining and alignment problems do not consider the haplotype paths in a pangenome graph. This often leads to spurious read alignments to those paths that are unlikely recombinations of the known haplotypes.
We present novel formulations and algorithms for haplotype-aware sequence alignment to directed acyclic graphs (DAGs). We consider both sequence-to-DAG chaining and sequence-to-DAG alignment problems. Drawing inspiration from the commonly used models for genotype imputation, we assume that a query sequence is an imperfect mosaic of the reference haplotypes. Accordingly, we extend previous chaining and alignment formulations by introducing a recombination penalty for a haplotype switch. First, we solve haplotype-aware sequence-to-DAG alignment in \(O(|Q||E||\mathcal {H}|)\) time where Q is the query sequence, E is the set of edges, and \(\mathcal {H}\) is the set of haplotypes represented in the graph. To complement our solution, we prove that an algorithm significantly faster than \(O(|Q||E||\mathcal {H}|)\) is impossible under the Strong Exponential Time Hypothesis (SETH). Second, we propose a haplotype-aware chaining algorithm that runs in \(O(|\mathcal {H}|N \log {|\mathcal {H}|N})\) time after graph preprocessing, where N is the count of input anchors. We then establish that a chaining algorithm significantly faster than \(O(|\mathcal {H}|N)\) is impossible under SETH. As a proof-of-concept of our algorithmic solutions, we implemented the chaining algorithm in the Minichain aligner (https://github.com/at-cg/minichain). We demonstrate the advantage of the algorithm by aligning sequences sampled from human major histocompatibility complex (MHC) to a pangenome graph of 60 MHC haplotypes. The proposed algorithm offers better consistency with ground-truth recombinations when compared to a haplotype-agnostic algorithm.
A longer version of this paper is available on bioRxiv [3].
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abouelhoda, M., Ohlebusch, E.: Chaining algorithms for multiple genome comparison. J. Disc. Algor. 3(2–4), 321–341 (2005)
Chandra, G., Jain, C.: Gap-sensitive colinear chaining algorithms for acyclic pangenome graphs. J. Comput. Biol. 30(11), 1182–1197 (2023)
Chandra, G., Jain, C.: Haplotype-aware sequence-to-graph alignment. In: bioRxiv, pp. 2023–11 (2023). https://doi.org/10.1101/2023.11.15.566493
Ebler, J., et al.: Pangenome-based genome inference allows efficient and accurate genoty** across a wide spectrum of variant classes. Nat. Genet. 54(4), 518–525 (2022)
Li, H.: Sample graphs and sequences for testing sequence-to-graph alignment (2022). https://doi.org/10.5281/zenodo.6617246
Li, N., Stephens, M.: Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165(4), 2213–2233 (2003)
Liao, W.W., et al.: A draft human pangenome reference. Nature 617(7960), 312–324 (2023)
Ma, J., Cáceres, M., Salmela, L., Mäkinen, V., Tomescu, A.I.: Chaining for accurate alignment of erroneous long reads to acyclic variation graphs. Bioinformatics 39(8), btad460 (2023)
Mäkinen, V., Tomescu, A.I., Kuosmanen, A., Paavilainen, T., Gagie, T., Chikhi, R.: Sparse dynamic programming on DAGs with small width. ACM Trans. Algor. 15(2), 1–21 (2019)
Navarro, G.: Improved approximate pattern matching on hypertext. Theoret. Comput. Sci. 237(1–2), 455–463 (2000)
Pritt, J., Chen, N.C., Langmead, B.: Forge: prioritizing variants for graph genomes. Genome Biol. 19(1), 1–16 (2018)
Williams, V.V.: Hardness of easy problems: basing hardness on popular conjectures such as the strong exponential time hypothesis (invited talk). In: 10th International Symposium on Parameterized and Exact Computation (IPEC 2015). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2015)
Acknowledgements
This work is supported by funding from the National Supercomputing Mission, India under DST/NSM/ R &D_HPC_Applications, the Science and Engineering Research Board (SERB) under SRG/2021/000044, and the Intel India Research Fellowship.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Chandra, G., Gibney, D., Jain, C. (2024). Haplotype-Aware Sequence Alignment to Pangenome Graphs. In: Ma, J. (eds) Research in Computational Molecular Biology. RECOMB 2024. Lecture Notes in Computer Science, vol 14758. Springer, Cham. https://doi.org/10.1007/978-1-0716-3989-4_36
Download citation
DOI: https://doi.org/10.1007/978-1-0716-3989-4_36
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-1-0716-3988-7
Online ISBN: 978-1-0716-3989-4
eBook Packages: Computer ScienceComputer Science (R0)