Clustering XML Documents Using Structural Summaries

  • Conference paper
Current Trends in Database Technology - EDBT 2004 Workshops (EDBT 2004)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3268))

Included in the following conference series:

Abstract

This work presents a methodology for grou** structurally similar XML documents using clustering algorithms. Modeling XML documents with tree-like structures, we face the ‘clustering XML documents by structure’ problem as a ‘tree clustering’ problem, exploiting distances that estimate the similarity between those trees in terms of the hierarchical relationships of their nodes. We suggest the usage of tree structural summaries to improve the performance of the distance calculation and at the same time to maintain or even improve its quality. Experimental results are provided using a prototype testbed.

Work supported in part by DELOS Network of Excellence on Digital Libraries, IST programme of the EC FP6, no G038-507618, and by PYTHAGORAS EPEAEK II programme, EU and Greek Ministry of Education.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (Canada)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web. Morgan Kaufmann, San Francisco (2000)

    Google Scholar 

  2. Chawathe, S.S.: Comparing hierarchical data in external memory. In: Proc. of the VLDB Conference, Edinburgh, Scotland, UK (1999)

    Google Scholar 

  3. Chawathe, S.S., Rajaraman, A., Garcia-Molina, H., Widom, J.: Change Detection in Hierarchically Structured Information. In: Proc. of the ACM SIGMOD Conference, USA (1996)

    Google Scholar 

  4. Cobena, G., Abiteboul, S., Marian, A.: Detecting changes in XML documents. In: Proc. of the ICDE Conference, San Jose, USA (2002)

    Google Scholar 

  5. Direen, H.G., Jones, M.S.: Knowledge management in bioinformatics. In: Chaudhri, A.B., Rashid, A., Zicari, R. (eds.) XML Data Management. Addison Wesley, Reading (2003)

    Google Scholar 

  6. Flesca, S., Manco, G., Masciari, E., Pontieri, L., Pugliese, A.: Detecting similarities between XML documents. In: Proc. of WebDB 2002 (2002)

    Google Scholar 

  7. Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: A system for extracting document type descriptors from XML documents. In: Proc. of the ACM SIGMOD Conference, Texas, USA (2000)

    Google Scholar 

  8. Milligan, G.W., Cooper, M.C.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50 (1985)

    Google Scholar 

  9. Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Proc. of the WebDB Workshop, Madison, Wisconsin, USA (June 2002)

    Google Scholar 

  10. Sankoff, D., Kruskal, J.: Time Warps, String Edits and Macromolecules, The Theory and Practice of Sequence Comparison. CSLI Publications, Stanford (1999)

    Google Scholar 

  11. Selkow, S.M.: The tree-to-tree editing problem. Information Processing Letters 6, 184–186 (1977)

    Article  MATH  MathSciNet  Google Scholar 

  12. Tai, K.C.: The tree-to-tree correction problem. Journal of ACM 26 (1979)

    Google Scholar 

  13. van Rijsbergen, C.J.: Information Retrieval, Butterworths, London (1979)

    Google Scholar 

  14. Wagner, R., Fisher, M.: The string-to-string correction problem. Journal of ACM 21(1), 168–173 (1974)

    Article  MATH  Google Scholar 

  15. Wang, Y., DeWitt, D., Cai, J.-Y.: X-Diff: An effective change detection algorithm for XML documents. In: Proc. of the ICDE Conference, Bangalore, India (2003)

    Google Scholar 

  16. Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal of Computing 18, 1245–1262 (1989)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Dalamagas, T., Cheng, T., Winkel, KJ., Sellis, T. (2004). Clustering XML Documents Using Structural Summaries. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds) Current Trends in Database Technology - EDBT 2004 Workshops. EDBT 2004. Lecture Notes in Computer Science, vol 3268. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30192-9_54

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30192-9_54

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23305-3

  • Online ISBN: 978-3-540-30192-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation