Abstract
The k-spectrum of a string is the set of all distinct substrings of length k occurring in the string. K-spectra have many applications in bioinformatics including pseudoalignment and genome assembly. The Spectral Burrows-Wheeler Transform (SBWT) has been recently introduced as an algorithmic tool to efficiently represent and query these objects. The longest common prefix (\(\textit{LCP}\)) array for a k-spectrum is an array of length n that stores the length of the longest common prefix of adjacent k-mers as they occur in lexicographical order. The \(\textit{LCP}\) array has at least two important applications, namely to accelerate pseudoalignment algorithms using the SBWT and to allow simulation of variable-order de Bruijn graphs within the SBWT framework. In this paper we explore algorithms to compute the \(\textit{LCP}\) array efficiently from the SBWT representation of the k-spectrum. Starting with a straightforward O(nk) time algorithm, we describe algorithms that are efficient in both theory and practice. We show that the \(\textit{LCP}\) array can be computed in optimal O(n) time, where n is the length of the SBWT of the spectrum. In practical genomics scenarios, we show that this theoretically optimal algorithm is indeed practical, but is often outperformed on smaller values of k by an asymptotically suboptimal algorithm that interacts better with the CPU cache. Our algorithms share some features with both classical Burrows-Wheeler inversion algorithms and LCP array construction algorithms for suffix arrays. Our C++ implementations of these algorithms are available at https://github.com/jnalanko/kmer-lcs.
Supported in part by the Academy of Finland via grants 339070 and 351150.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We remark here that the LCS array of a colexicographically-ordered spectrum is equivalent to the longest common prefix (LCP) array of the lexicographically-ordered spectrum, and the algorithms we describe in this paper to compute the LCS array are trivially adapted to compute the LCP array.
- 2.
Wheeler graphs are a class of graphs including de Bruijn graphs, that admit a generalization of the Burrows-Wheeler transform. The SBWT can be seen as a special case of the Wheeler graph indexing framework.
- 3.
A similar but different structure is described in [2].
- 4.
Assuming the input to the BWT is terminated with a $-symbol, and there is an added $-edge from the last k-mer of the input to the root of the SBWT graph.
References
Alanko, J.N., Biagi, E., Puglisi, S.J., Vuohtoniemi, J.: Subset wavelet trees. In: Proceedings of the 21st International Symposium on Experimental Algorithms (SEA), LIPIcs. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2023)
Alanko, J.N., Puglisi, S.J., Vuohtoniemi, J.: Small searchable k-spectra via subset rank queries on the spectral burrows-wheeler transform. In Proceedings of SIAM Conference on Applied and Computational Discrete Algorithms (ACDA), pp. 225–236. Society for Industrial and Applied Mathematics (2023)
Alanko, J.N., Vuohtoniemi, J., Mäklin, T., Puglisi, S.J.: Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes. Bioinformatics (2023)
Beller, T., Gog, S., Ohlebusch, E., Schnattinger, T.: Computing the longest common prefix array based on the Burrows-Wheeler transform. J. Discrete Algorithms 18, 22–31 (2013)
Boucher, C., Bowe, A., Gagie, T., Puglisi, S.J., Sadakane, K.: Variable-order de Bruijn graphs. In: Proceedings of the 25th Data Compression Conference (DCC), pp. 383–392. IEEE (2015)
Compeau, P.E., Pevzner, P.A., Tesler, G.: Why are de Bruijn graphs useful for genome assembly? Nat. Biotechnol. 29(11), 987 (2011)
Conte, A., Cotumaccio, N., Gagie, T., Manzini, G., Prezza, N., Sciortino, M.: Computing matching statistics on Wheeler DFAs. ar**v preprint ar**v:2301.05338 (2023)
Holley, G., Melsted, P.: Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs. Genome Biol. 21(1), 1–20 (2020)
Jeffery, I.B., et al.: Differences in fecal microbiomes and metabolomes of people with vs without irritable bowel syndrome and bile acid malabsorption. Gastroenterology 158(4), 1016–1028 (2020)
Maillet, N., Lemaitre, C., Chikhi, R., Lavenier, D., Peterlongo, P.: Compareads: comparing huge metagenomic experiments. BMC Bioinf. 13(19), 1–10 (2012)
Marchet, C., Boucher, C., Puglisi, S.J., Medvedev, P., Salson, M., Chikhi, R.: Data structures based on k-mers for querying large collections of sequencing data sets. Genome Res. 31(1), 1–12 (2021)
Ondov, B.D., et al.: Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17(1), 1–14 (2016)
Salikhov, K.: Efficient algorithms and data structures for indexing DNA sequence data. PhD thesis, Université Paris-Est; Université Lomonossov (Moscou) (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Alanko, J.N., Biagi, E., Puglisi, S.J. (2023). Longest Common Prefix Arrays for Succinct k-Spectra. In: Nardini, F.M., Pisanti, N., Venturini, R. (eds) String Processing and Information Retrieval. SPIRE 2023. Lecture Notes in Computer Science, vol 14240. Springer, Cham. https://doi.org/10.1007/978-3-031-43980-3_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-43980-3_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43979-7
Online ISBN: 978-3-031-43980-3
eBook Packages: Computer ScienceComputer Science (R0)