Abstract
Compression boosting (Ferragina & Manzini, SODA 2004) is a new technique to enhance zeroth order entropy compressors’ performance to k-th order entropy. It works by constructing the Burrows-Wheeler transform of the input text, finding optimal partitioning of the transform, and then compressing each piece using an arbitrary zeroth order compressor. The optimal partitioning has the property that the achieved compression is boosted to k-th order entropy, for any k. The technique has an application to text indexing: Essentially, building a wavelet tree (Grossi et al., SODA 2003) for each piece in the partitioning yields a k-th order compressed full-text self-index providing efficient substring searches on the indexed text (Ferragina et al., SPIRE 2004). In this paper, we show that using explicit compression boosting with wavelet trees is not necessary; our new analysis reveals that the size of the wavelet tree built for the complete Burrows-Wheeler transformed text is, in essence, the sum of those built for the pieces in the optimal partitioning. Hence, the technique provides a way to do compression boosting implicitly, with a trivial linear time algorithm, but fixed to a specific zeroth order compressor (Raman et al., SODA 2002). In addition to having these consequences on compression and static full-text self-indexes, the analysis shows that a recent dynamic zeroth order compressed self-index (Mäkinen & Navarro, CPM 2006) occupies in fact space proportional to k-th order entropy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Apostolico, A.: The myriad virtues of subword trees. In: Combinatorial Algorithms on Words, NATO ISI Series, pp. 85–96. Springer, Heidelberg (1985)
Arroyuelo, D., Navarro, G.: Space-efficient construction of LZ-index. In: Deng, X., Du, D.-Z. (eds.) ISAAC 2005. LNCS, vol. 3827, pp. 1143–1152. Springer, Heidelberg (2005)
Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report Technical Report 124, Digital Equipment Corporation (1994)
Chan, H.-L., Hon, W.-K., Lam, T.-W.: Compressed index for a dynamic collection of texts. In: Sahinalp, S.C., Muthukrishnan, S.M., Dogrusoz, U. (eds.) CPM 2004. LNCS, vol. 3109, pp. 445–456. Springer, Heidelberg (2004)
Ferragina, P., Giancarlo, R., Manzini, G.: The myriad virtues of wavelet trees. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4051, pp. 560–571. Springer, Heidelberg (2006)
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proc. FOCS 2000, pp. 390–398 (2000)
Ferragina, P., Manzini, G.: Compression boosting in optimal linear time using the Burrows-Wheeler transform. In: Proc. SODA 2004, pp. 655–663 (2004)
Ferragina, P., Manzini, G.: Indexing compressed texts. J. of the ACM 52(4), 552–581 (2005)
Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM TALG, 3(2): article 20 (2007)
Gagie, T.: Large alphabets and incompressibility. IPL 99(6), 246–251 (2006)
Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proc. SODA 2003, pp. 841–850 (2003)
Grossi, R., Gupta, A., Vitter, J.: When indexing equals compression: Experiments with compressing suffix arrays and applications. In: Proc. SODA 2004, pp. 636–645 (2004)
Hon, W.-K., Sadakane, K., Sung, W.-K.: Breaking a time-and-space barrier in constructing full-text indexes. In: Proc. FOCS 2003, pp. 251–260 (2003)
Jacobson, G.: Space-efficient static trees and graphs. In: Proc. FOCS 1989, pp. 549–554 (1989)
Kärkkäinen, J.: Fast BWT in small space by blockwise suffix sorting. In: Proc. DIMACS Working Group on the Burrows-Wheeler Transform (2004)
Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nordic J. of Computing 12(1), 40–66 (2005)
Mäkinen, V., Navarro, G.: Dynamic entropy-compressed sequences and full-text indexes. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 307–318. Springer, Heidelberg (2006)
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. on Computing, 935–948 (1993)
Manzini, G.: An analysis of the Burrows-Wheeler transform. J. of the ACM 48(3), 407–430 (2001)
Na, J.C.: Linear-time construction of compressed suffix arrays using o(n logn)-bit working space for large alphabets. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 57–67. Springer, Heidelberg (2005)
Pagh, R.: Low redundancy in dictionaries with O(1) worst case lookup time. In: Wiedermann, J., van Emde Boas, P., Nielsen, M. (eds.) ICALP 1999. LNCS, vol. 1644, pp. 595–604. Springer, Heidelberg (1999)
Raman, R., Raman, V., Srinivasa, S.: Succinct dynamic data structures. In: Dehne, F., Sack, J.-R., Tamassia, R. (eds.) WADS 2001. LNCS, vol. 2125, pp. 426–437. Springer, Heidelberg (2001)
Raman, R., Raman, V., Srinivasa Roa, S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Proc. SODA 2002, pp. 233–242 (2002)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mäkinen, V., Navarro, G. (2007). Implicit Compression Boosting with Applications to Self-indexing. In: Ziviani, N., Baeza-Yates, R. (eds) String Processing and Information Retrieval. SPIRE 2007. Lecture Notes in Computer Science, vol 4726. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75530-2_21
Download citation
DOI: https://doi.org/10.1007/978-3-540-75530-2_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-75529-6
Online ISBN: 978-3-540-75530-2
eBook Packages: Computer ScienceComputer Science (R0)