An LMS-Based Grammar Self-index with Local Consistency Properties

  • Conference paper
  • First Online:
String Processing and Information Retrieval (SPIRE 2021)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12944))

Included in the following conference series:

  • 405 Accesses

Abstract

A grammar self-index of a text T (Claude et al. 2012) consists of a grammar \({\mathcal {G}}\) that only produces T and a geometric data structure that indexes the string cuts of the right-hand sides of \({\mathcal {G}}\)’s rules. This representation uses space proportional to G, the size of the grammar, which is small when the text is repetitive. However, the index is slow for matching long patterns; it finds the occ occurrences of a pattern P[1..m] in \(O((m^{2}+occ)\log G)\) time. The most expensive part is a set of binary searches for the different cuts \(P[1..j]P[j+1..m]\) in the geometric data structure. Christiansen et al. 2010 solved this problem by building a locally consistent grammar that only searches for \(O(\log m)\) cuts of P. Their representation, however, requires significant extra space (tough still in O(G)) to store a set of permutations for the nonterminal symbols. In this work, we propose another locally consistent grammar that builds on the idea of LMS substrings (Nong et al. 2009). Our grammar also requires to try \(O(\log m)\) cuts when searching for P, but it does not need to store permutations. As a result, we obtain a self-index that searches in time \(O((m\log m+occ) \log G)\) and is of practical size. Our experiments showed that our index is faster than previous grammar-based indexes at the price of increasing the space by a 1.8x factor on average. Other experimental results showed that our data structure becomes convenient when the patterns to search for are long.

Funded in part by Basal Funds FB0001, Fondecyt Grant 1-200038, Ph.D. Scholarship 21171332, ANID, Chile.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
EUR 29.95
Price includes VAT (Germany)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 53.49
Price includes VAT (Germany)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR 69.54
Price includes VAT (Germany)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/nicolaprezza/r-index.

  2. 2.

    https://github.com/migumar2/uiHRDC/tree/master/uiHRDC/self-indexes/LZ.

  3. 3.

    https://github.com/apachecom/grammar_improved_index.

  4. 4.

    http://pizzachili.dcc.uchile.cl.

References

  1. Bille, P., Landau, G.M., Raman, R., Sadakane, K., Satti, S.R., Weimann, O.: Random access to grammar-compressed strings and trees. SIAM J. Comput. 44(3), 513–539 (2015)

    Article  MathSciNet  Google Scholar 

  2. Boucher, C., Gagie, T., Kuhnle, A., Langmead, B., Manzini, G., Mun, T.: Prefix-free parsing for building big BWTs. Algorithms Mole. Biol. 14(1), Article 13 (2019)

    Google Scholar 

  3. Chan, T., Larsen, K.G., Pătraşcu, M.: Orthogonal range searching on the RAM, revisited. In: Proceedings of the 27th Annual Symposium on Computational Geometry (SoCG), pp. 1–10 (2011)

    Google Scholar 

  4. Charikar, M., et al.: The smallest grammar problem. IEEE Trans. Inf. Theory 51(7), 2554–2576 (2005)

    Article  MathSciNet  Google Scholar 

  5. Christiansen, A.R., Ettienne, M.B., Kociumaka, T., Navarro, G., Prezza, N.: Optimal-Time Dictionary-compressed indexes. ACM Trans. Algorithms 17(1), Article 8 (2020)

    Google Scholar 

  6. Claude, F., Navarro, G.: Self-indexed grammar-based compression. Fund. Inform. 111(3), 313–337 (2011)

    MathSciNet  MATH  Google Scholar 

  7. Claude, F., Navarro, G.: Improved grammar-based compressed indexes. In: Proceedings of the 19th International Symposium on String Processing and Information Retrieval (SPIRE), pp. 180–192 (2012)

    Google Scholar 

  8. Claude, F., Navarro, G., Pacheco, A.: Grammar-compressed indexes with logarithmic search time. J. Comput. Syst. Sci. 118, 53–74 (2021)

    Article  MathSciNet  Google Scholar 

  9. Díaz-Domínguez, D., Navarro, G.: A grammar compressor for collections of reads with applications to the construction of the BWT. In: Proceedings of the 31st Data Compression Conference (DCC) (2021)

    Google Scholar 

  10. Gagie, T., Navarro, G., Prezza, N.: Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM 67(1), 1–54 (2020)

    Article  MathSciNet  Google Scholar 

  11. Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07959-2_28

    Chapter  Google Scholar 

  12. Kempa, D., Prezza, N.: At the roots of dictionary compression: string attractors. In: Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pp. 827–840 (2018)

    Google Scholar 

  13. Kieffer, J.C., Yang, E.H.: Grammar-based codes: a new class of universal lossless source codes. IEEE Trans. Inf. Theory 46(3), 737–754 (2000)

    Article  MathSciNet  Google Scholar 

  14. Kreft, S., Navarro, G.: LZ77-Like compression with fast random access. In: Proceedings of the 10th Data Compression Conference (DCC), pp. 239–248 (2010)

    Google Scholar 

  15. Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483, 115–133 (2013)

    Article  MathSciNet  Google Scholar 

  16. Larsson, J., Moffat, A.: Off-line dictionary-based compression. Proc. IEEE 88(11), 1722–1732 (2000)

    Article  Google Scholar 

  17. Louza, F., Gog, S., Telles, G.P.: Inducing enhanced suffix arrays for string collections. Theor. Comput. Sci. 678(1), 22–39 (2017)

    Article  MathSciNet  Google Scholar 

  18. Mehlhorn, K., Sundar, R., Uhrig, C.: Maintaining dynamic sequences under equality tests in polylogarithmic time. Algorithmica 17(2), 183–198 (1997)

    Article  MathSciNet  Google Scholar 

  19. Navarro, G.: Indexing highly repetitive string collections, Part II : compressed indexes. ACM Comput. Surv. 54(2), Article 26 (2021)

    Google Scholar 

  20. Nong, G.: Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Trans. Inf. Syst. 31(3), 1–15 (2013)

    Article  MathSciNet  Google Scholar 

  21. Nong, G., Zhang, S., Chan, W.H.: Linear suffix array construction by almost pure induced-sorting. In: Proceedings of the 19th Data Compression Conference (DCC), pp. 193–202 (2009)

    Google Scholar 

  22. Nunes, D.S.N., Louza, F.A., Gog, S., Ayala-Rincón, M., Navarro, G.: A grammar compression algorithm based on induced suffix sorting. In: Proceedings of the 28th Data Compression Conference (DCC), pp. 42–51 (2018)

    Google Scholar 

  23. Okanohara, D., Sadakane, K.: A linear-time Burrows-Wheeler transform using induced sorting. In: Proceedings of the 16th International Symposium on String Processing and Information Retrieval (SPIRE), pp. 90–101 (2009)

    Google Scholar 

  24. Raman, R., Raman, V., Satti, S.R.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms 3(4), Article 43 (2007)

    Google Scholar 

  25. Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 302(1–3), 211–222 (2003)

    Article  MathSciNet  Google Scholar 

  26. Sahinalp, C., Vishkin, U.: Data compression using locally consistent parsing. Technical report, UMIACS Technical report (1995)

    Google Scholar 

  27. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Diego Díaz-Domínguez .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Díaz-Domínguez, D., Navarro, G., Pacheco, A. (2021). An LMS-Based Grammar Self-index with Local Consistency Properties. In: Lecroq, T., Touzet, H. (eds) String Processing and Information Retrieval. SPIRE 2021. Lecture Notes in Computer Science(), vol 12944. Springer, Cham. https://doi.org/10.1007/978-3-030-86692-1_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86692-1_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86691-4

  • Online ISBN: 978-3-030-86692-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation