Abstract
We propose a scheme for automatically generating compressors for XML documents from Document Type Definition(DTD) specifications. Our algorithm is a lossless adaptive algorithm where the model used for compression and decompression is generated automatically from the DTD, and is used in conjunction with an arithmetic compressor to produce a compressed version of the document. The structure of the model mirrors the syntactic specification of the document. Our compression scheme is on-line, that is, it can compress the document as it is being read. We have implemented the compressor generator, and provide the results of experiments on some large XML databases whose DTD’s are specified. We note that the average compression is better than that of XMLPPM, the only other on-line tool we are aware of. The tool is able to compress massive documents where XMLPPM failed to work as it ran out of memory. We believe the main appeal of this technique is the fact that the underlying model is so simple and yet so effective.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
XML: W3C recommendation (2004), http://www.w3.org/TR/REC-xml
Backhouse, R.C.: Syntax of Programming Languages - Theory and Practice. Prentice Hall International, London (1979)
Witten, I.H., Neal, R.M., Cleary, J.G.: Arithmetic coding for data compression. Commun. ACM 30, 520–540 (1987)
Nelson, M.: Arithmetic coding and statistical modeling. Dr. Dobbs Journal (1991), http://dogma.net/markn/articles/arith/part1.htm
Liefke, H., Suciu, D.: XMILL: An efficient compressor for XML data. In: SIGMOD Conference, pp. 153–164 (2000)
Cheney, J.: Compressing XML with Multiplexed Hierarchical PPM Models. In: Proceedings of the Data Compression Conference, pp. 163–172. IEEE Computer Society, Los Alamitos (2001)
UniProt: http://www.ebi.uniprot.org
Michigan: http://www.eecs.umich.edu/db/mbench
Bzip2: http://www.bzip.org
Cameron, R.D.: Source encoding using syntactic information source models. IEEE Transactions on Information Theory 34, 843–850 (1988)
Ernst, J., Evans, W.S., Fraser, C.W., Lucco, S., Proebsting, T.A.: Code compression. In: PLDI, pp. 358–365 (1997)
Franz, M.: Adaptive compression of syntax trees and iterative dynamic code optimization: Two basic technologies for mobile object systems. In: Mobile Object Systems: Towards the Programmable Internet, pp. 263–276. Springer, Heidelberg (1997)
Franz, M., Kistler, T.: Slim binaries. Commun. ACM 40, 87–94 (1997)
Fraser, C.W.: Automatic inference of models for statistical code compression. In: PLDI, pp. 242–246 (1999)
XMLZIP: http://www.xmls.com
Cleary, J.G., Teahan, W.J.: Unbounded length contexts for PPM. The Computer Journal 40, 67–75 (1997)
Tolani, P.M., Haritsa, J.R.: XGRIND: A query-friendly XML compressor. In: ICDE, pp. 225–234 (2002)
Min, J.K., Park, M.J., Chung, C.W.: XPRESS: A queriable compression for XML data. In: SIGMOD Conference, pp. 122–133 (2003)
Arion, A., Bonifati, A., Costa, G., D’Aguanno, S., Manolescu, I., Pugliese, A.: Efficient query evaluation over compressed XML data. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 200–218. Springer, Heidelberg (2004)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23, 337–343 (1977)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Subramanian, H., Shankar, P. (2006). Compressing XML Documents Using Recursive Finite State Automata. In: Farré, J., Litovsky, I., Schmitz, S. (eds) Implementation and Application of Automata. CIAA 2005. Lecture Notes in Computer Science, vol 3845. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11605157_24
Download citation
DOI: https://doi.org/10.1007/11605157_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31023-5
Online ISBN: 978-3-540-33097-4
eBook Packages: Computer ScienceComputer Science (R0)