Abstract
Sparse suffix sorting is the problem of sorting \(b=o(n)\) suffixes of a string of length n. Efficient sparse suffix sorting algorithms have existed for more than a decade. Despite the multitude of works and their justified claims for applications in text indexing, the existing algorithms have not been employed by practitioners. Arguably this is because there are no simple, direct, and efficient algorithms for sparse suffix array construction. We provide two new algorithms for constructing the sparse suffix and LCP arrays that are simultaneously simple, direct, small, and fast. In particular, our algorithms are: simple in the sense that they can be implemented using only basic data structures; direct in the sense that the output arrays are not a byproduct of constructing the sparse suffix tree or an LCE data structure; fast in the sense that they run in \(\mathcal {O}(n\log b)\) time, in the worst case, or in \(\mathcal {O}(n)\) time, when the total number of suffixes with an LCP value greater than \(2^{\lfloor \log \frac{n}{b} \rfloor + 1}-1\) is in \(\mathcal {O}(b/\log b)\), matching the time of optimal yet much more complicated algorithms [Gawrychowski and Kociumaka, SODA 2017; Birenzwige et al., SODA 2020]; and small in the sense that they can be implemented using only \(8b+o(b)\) machine words. We also show that our second algorithm can be trivially amended to work in \(\mathcal {O}(n)\) time for any uniformly random string. Our algorithms are non-trivial space-efficient adaptations of the Monte Carlo algorithm by I et al. for constructing the sparse suffix tree in \(\mathcal {O}(n\log b)\) time [STACS 2014].
SPP and HV are supported by the PANGAIA project (GA 872539). SPP is supported by the ALPACA project (GA 956229). HV is supported by a Constance van Eeden Fellowship.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
I et al. [16] claim \(\mathcal {O}(s)\) space but from their construction it is evident that in fact \(s+\mathcal {O}(1)\) machine words are used.
- 2.
We stress that the pseudocode is complete in the sense that it only assumes the implementation of Lemma 1 (Line 11).
- 3.
We assume that \(A[v_i] + k + 2^j - 1 \le n\); otherwise, the suffix ends at position n.
- 4.
This is generally not true when \(j_\text {start}\) was set to a value less than \(\lfloor \log n \rfloor \); in this case, the LCP values are only correct if they are at most \(2^{j_\text {start} + 1} - 1\); see Sect. 4.
- 5.
If this is not the case, we output incorrect arrays deliberately to ensure that our algorithm is Monte Carlo.
- 6.
If \(i=1\) then the group member id’s are \(L[1], \ldots , L[C[i]]\).
References
Arbitman, Y., Naor, M., Segev, G.: Backyard cuckoo hashing: Constant worst-case operations with a succinct representation. In: FOCS, pp. 787–796 (2010)
Ayad, L.A.K., Loukides, G., Pissis, S.P.: Text indexing for long patterns: anchors are all you need. Proc. VLDB Endow. 16(9), 2117–2131 (2023)
Ben-Nun, S., Golan, S., Kociumaka, T., Kraus, M.: Time-space tradeoffs for finding a long common substring. In: CPM. LIPIcs, vol. 161, pp. 5:1–5:14 (2020)
Bender, M.A., Conway, A., Farach-Colton, M., Kuszmaul, W., Tagliavini, G.: Iceberg hashing: optimizing many hash-table criteria at once. J. ACM 70(6) (2023)
Bernardini, G., Fici, G., Gawrychowski, P., Pissis, S.P.: Substring complexity in sublinear space. In: ISAAC. LIPIcs, vol. 283, pp. 12:1–12:19 (2023)
Bille, P., Fischer, J., Gørtz, I.L., Kopelowitz, T., Sach, B., Vildhøj, H.W.: Sparse text indexing in small space. ACM Trans. Algorithms 12(3), 39:1–39:19 (2016)
Birenzwige, O., Golan, S., Porat, E.: Locally consistent parsing for text indexing in small space. In: SODA, pp. 607–626 (2020)
Bollobás, B., Letzter, S.: Longest common extension. Eur. J. Comb. 68, 242–248 (2018)
Chan, T.M., Munro, J.I., Raman, V.: Selection and sorting in the “restore” model. ACM Trans. Algorithms 14(2), 11:1–11:18 (2018)
Christiansen, A.R., Ettienne, M.B., Kociumaka, T., Navarro, G., Prezza, N.: Optimal-time dictionary-compressed indexes. ACM Trans. Algorithms 17(1), 8:1–8:39 (2021)
Dietzfelbinger, M., Gil, J., Matias, Y., Pippenger, N.: Polynomial hash functions are reliable. In: Kuich, W. (ed.) ICALP 1992. LNCS, vol. 623, pp. 235–246. Springer, Heidelberg (1992). https://doi.org/10.1007/3-540-55719-9_77
Fischer, J., Tomohiro, I., Köppl, D.: Deterministic sparse suffix sorting in the restore model. ACM Trans. Algorithms 16(4), 50:1–50:53 (2020)
Franceschini, G., Muthukrishnan, S., Pǎtraşcu, M.: Radix sorting with no extra space. In: Arge, L., Hoffmann, M., Welzl, E. (eds.) ESA 2007. LNCS, vol. 4698, pp. 194–205. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-75520-3_19
Gawrychowski, P., Kociumaka, T.: Sparse suffix tree construction in optimal time and space. In: SODA, pp. 425–439 (2017)
Grabowski, S., Raniszewski, M.: Sampled suffix array with minimizers. Softw. Pract. Exp. 47(11), 1755–1771 (2017)
Tomohiro, I., Kärkkäinen, J., Kempa, D.: Faster sparse suffix sorting. In: STACS. LIPIcs, vol. 25, pp. 386–396 (2014)
Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. J. ACM 53(6), 918–936 (2006)
Kärkkäinen, J., Ukkonen, E.: Sparse suffix trees. In: Cai, J.-Y., Wong, C.K. (eds.) COCOON 1996. LNCS, vol. 1090, pp. 219–230. Springer, Heidelberg (1996). https://doi.org/10.1007/3-540-61332-3_155
Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2), 249–260 (1987)
Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Amir, A. (ed.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-48194-X_17
Katajainen, J., Pasanen, T., Teuhola, J.: Practical in-place mergesort. Nord. J. Comput. 3(1), 27–40 (1996)
Loukides, G., Pissis, S.P.: Bidirectional string anchors: a new string sampling mechanism. In: ESA. LIPIcs, vol. 204, pp. 64:1–64:21 (2021)
Loukides, G., Pissis, S.P., Sweering, M.: Bidirectional string anchors for improved text indexing and top-K similarity search. IEEE Trans. Knowl. Data Eng. 35(11), 11093–11111 (2023)
Navarro, G., Prezza, N.: Universal compressed text indexing. Theor. Comput. Sci. 762, 41–50 (2019)
Paige, R., Tarjan, R.E.: Three partition refinement algorithms. SIAM J. Comput. 16(6), 973–989 (1987)
Prezza, N.: Optimal substring equality queries with applications to sparse text indexing. ACM Trans. Algorithms 17(1), 7:1–7:23 (2021)
Salowe, J.S., Steiger, W.L.: Simplified stable merging tasks. J. Algorithms 8(4), 557–571 (1987)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ayad, L.A.K., Loukides, G., Pissis, S.P., Verbeek, H. (2024). Sparse Suffix and LCP Array: Simple, Direct, Small, and Fast. In: Soto, J.A., Wiese, A. (eds) LATIN 2024: Theoretical Informatics. LATIN 2024. Lecture Notes in Computer Science, vol 14578. Springer, Cham. https://doi.org/10.1007/978-3-031-55598-5_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-55598-5_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-55597-8
Online ISBN: 978-3-031-55598-5
eBook Packages: Computer ScienceComputer Science (R0)