Sparse Suffix and LCP Array: Simple, Direct, Small, and Fast

  • Conference paper
  • First Online:
LATIN 2024: Theoretical Informatics (LATIN 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14578))

Included in the following conference series:

  • 215 Accesses

Abstract

Sparse suffix sorting is the problem of sorting \(b=o(n)\) suffixes of a string of length n. Efficient sparse suffix sorting algorithms have existed for more than a decade. Despite the multitude of works and their justified claims for applications in text indexing, the existing algorithms have not been employed by practitioners. Arguably this is because there are no simple, direct, and efficient algorithms for sparse suffix array construction. We provide two new algorithms for constructing the sparse suffix and LCP arrays that are simultaneously simple, direct, small, and fast. In particular, our algorithms are: simple in the sense that they can be implemented using only basic data structures; direct in the sense that the output arrays are not a byproduct of constructing the sparse suffix tree or an LCE data structure; fast in the sense that they run in \(\mathcal {O}(n\log b)\) time, in the worst case, or in \(\mathcal {O}(n)\) time, when the total number of suffixes with an LCP value greater than \(2^{\lfloor \log \frac{n}{b} \rfloor + 1}-1\) is in \(\mathcal {O}(b/\log b)\), matching the time of optimal yet much more complicated algorithms [Gawrychowski and Kociumaka, SODA 2017; Birenzwige et al., SODA 2020]; and small in the sense that they can be implemented using only \(8b+o(b)\) machine words. We also show that our second algorithm can be trivially amended to work in \(\mathcal {O}(n)\) time for any uniformly random string. Our algorithms are non-trivial space-efficient adaptations of the Monte Carlo algorithm by I et al. for constructing the sparse suffix tree in \(\mathcal {O}(n\log b)\) time [STACS 2014].

SPP and HV are supported by the PANGAIA project (GA 872539). SPP is supported by the ALPACA project (GA 956229). HV is supported by a Constance van Eeden Fellowship.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
EUR 29.95
Price includes VAT (Germany)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 53.49
Price includes VAT (Germany)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR 70.61
Price includes VAT (Germany)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    I et al. [16] claim \(\mathcal {O}(s)\) space but from their construction it is evident that in fact \(s+\mathcal {O}(1)\) machine words are used.

  2. 2.

    We stress that the pseudocode is complete in the sense that it only assumes the implementation of Lemma 1 (Line 11).

  3. 3.

    We assume that \(A[v_i] + k + 2^j - 1 \le n\); otherwise, the suffix ends at position n.

  4. 4.

    This is generally not true when \(j_\text {start}\) was set to a value less than \(\lfloor \log n \rfloor \); in this case, the LCP values are only correct if they are at most \(2^{j_\text {start} + 1} - 1\); see Sect. 4.

  5. 5.

    If this is not the case, we output incorrect arrays deliberately to ensure that our algorithm is Monte Carlo.

  6. 6.

    If \(i=1\) then the group member id’s are \(L[1], \ldots , L[C[i]]\).

References

  1. Arbitman, Y., Naor, M., Segev, G.: Backyard cuckoo hashing: Constant worst-case operations with a succinct representation. In: FOCS, pp. 787–796 (2010)

    Google Scholar 

  2. Ayad, L.A.K., Loukides, G., Pissis, S.P.: Text indexing for long patterns: anchors are all you need. Proc. VLDB Endow. 16(9), 2117–2131 (2023)

    Article  Google Scholar 

  3. Ben-Nun, S., Golan, S., Kociumaka, T., Kraus, M.: Time-space tradeoffs for finding a long common substring. In: CPM. LIPIcs, vol. 161, pp. 5:1–5:14 (2020)

    Google Scholar 

  4. Bender, M.A., Conway, A., Farach-Colton, M., Kuszmaul, W., Tagliavini, G.: Iceberg hashing: optimizing many hash-table criteria at once. J. ACM 70(6) (2023)

    Google Scholar 

  5. Bernardini, G., Fici, G., Gawrychowski, P., Pissis, S.P.: Substring complexity in sublinear space. In: ISAAC. LIPIcs, vol. 283, pp. 12:1–12:19 (2023)

    Google Scholar 

  6. Bille, P., Fischer, J., Gørtz, I.L., Kopelowitz, T., Sach, B., Vildhøj, H.W.: Sparse text indexing in small space. ACM Trans. Algorithms 12(3), 39:1–39:19 (2016)

    Google Scholar 

  7. Birenzwige, O., Golan, S., Porat, E.: Locally consistent parsing for text indexing in small space. In: SODA, pp. 607–626 (2020)

    Google Scholar 

  8. Bollobás, B., Letzter, S.: Longest common extension. Eur. J. Comb. 68, 242–248 (2018)

    Article  MathSciNet  Google Scholar 

  9. Chan, T.M., Munro, J.I., Raman, V.: Selection and sorting in the “restore” model. ACM Trans. Algorithms 14(2), 11:1–11:18 (2018)

    Google Scholar 

  10. Christiansen, A.R., Ettienne, M.B., Kociumaka, T., Navarro, G., Prezza, N.: Optimal-time dictionary-compressed indexes. ACM Trans. Algorithms 17(1), 8:1–8:39 (2021)

    Google Scholar 

  11. Dietzfelbinger, M., Gil, J., Matias, Y., Pippenger, N.: Polynomial hash functions are reliable. In: Kuich, W. (ed.) ICALP 1992. LNCS, vol. 623, pp. 235–246. Springer, Heidelberg (1992). https://doi.org/10.1007/3-540-55719-9_77

    Chapter  Google Scholar 

  12. Fischer, J., Tomohiro, I., Köppl, D.: Deterministic sparse suffix sorting in the restore model. ACM Trans. Algorithms 16(4), 50:1–50:53 (2020)

    Google Scholar 

  13. Franceschini, G., Muthukrishnan, S., Pǎtraşcu, M.: Radix sorting with no extra space. In: Arge, L., Hoffmann, M., Welzl, E. (eds.) ESA 2007. LNCS, vol. 4698, pp. 194–205. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-75520-3_19

    Chapter  Google Scholar 

  14. Gawrychowski, P., Kociumaka, T.: Sparse suffix tree construction in optimal time and space. In: SODA, pp. 425–439 (2017)

    Google Scholar 

  15. Grabowski, S., Raniszewski, M.: Sampled suffix array with minimizers. Softw. Pract. Exp. 47(11), 1755–1771 (2017)

    Article  Google Scholar 

  16. Tomohiro, I., Kärkkäinen, J., Kempa, D.: Faster sparse suffix sorting. In: STACS. LIPIcs, vol. 25, pp. 386–396 (2014)

    Google Scholar 

  17. Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. J. ACM 53(6), 918–936 (2006)

    Article  MathSciNet  Google Scholar 

  18. Kärkkäinen, J., Ukkonen, E.: Sparse suffix trees. In: Cai, J.-Y., Wong, C.K. (eds.) COCOON 1996. LNCS, vol. 1090, pp. 219–230. Springer, Heidelberg (1996). https://doi.org/10.1007/3-540-61332-3_155

    Chapter  Google Scholar 

  19. Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2), 249–260 (1987)

    Article  MathSciNet  Google Scholar 

  20. Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Amir, A. (ed.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-48194-X_17

    Chapter  Google Scholar 

  21. Katajainen, J., Pasanen, T., Teuhola, J.: Practical in-place mergesort. Nord. J. Comput. 3(1), 27–40 (1996)

    MathSciNet  Google Scholar 

  22. Loukides, G., Pissis, S.P.: Bidirectional string anchors: a new string sampling mechanism. In: ESA. LIPIcs, vol. 204, pp. 64:1–64:21 (2021)

    Google Scholar 

  23. Loukides, G., Pissis, S.P., Sweering, M.: Bidirectional string anchors for improved text indexing and top-K similarity search. IEEE Trans. Knowl. Data Eng. 35(11), 11093–11111 (2023)

    Article  Google Scholar 

  24. Navarro, G., Prezza, N.: Universal compressed text indexing. Theor. Comput. Sci. 762, 41–50 (2019)

    Article  MathSciNet  Google Scholar 

  25. Paige, R., Tarjan, R.E.: Three partition refinement algorithms. SIAM J. Comput. 16(6), 973–989 (1987)

    Article  MathSciNet  Google Scholar 

  26. Prezza, N.: Optimal substring equality queries with applications to sparse text indexing. ACM Trans. Algorithms 17(1), 7:1–7:23 (2021)

    Google Scholar 

  27. Salowe, J.S., Steiger, W.L.: Simplified stable merging tasks. J. Algorithms 8(4), 557–571 (1987)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Solon P. Pissis .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ayad, L.A.K., Loukides, G., Pissis, S.P., Verbeek, H. (2024). Sparse Suffix and LCP Array: Simple, Direct, Small, and Fast. In: Soto, J.A., Wiese, A. (eds) LATIN 2024: Theoretical Informatics. LATIN 2024. Lecture Notes in Computer Science, vol 14578. Springer, Cham. https://doi.org/10.1007/978-3-031-55598-5_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-55598-5_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-55597-8

  • Online ISBN: 978-3-031-55598-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation