X-HYBRIDJOIN for Near-Real-Time Data Warehousing

  • Conference paper
Advances in Databases (BNCOD 2011)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7051))

Included in the following conference series:

Abstract

In order to make timely and effective decisions, businesses need the latest information from data warehouse repositories. To keep these repositories up-to-date with respect to end user updates, near-real-time data integration is required. An important phase in near-real-time data integration is data transformation where the stream of updates is joined with disk-based master data. The stream-based algorithm Mesh Join (MESHJOIN) has been proposed to amortize disk access over fast stream. MESHJOIN makes no assumptions about the data distribution. In real world applications, however, skewed distributions can be found, e.g, certain products are sold more frequently than the remainder of the products. The question arises, how much does MESHJOIN loose in terms of performance by not adapting to data skew. In this paper we perform a rigorous experimental study analyzing the possible performance improvements while considering typical data distributions. For this purpose we design an algorithm Extended Hybrid Join (X-HYBRIDJOIN) that is complementary to MESHJOIN in that it can adapt to data skew and stores parts of the master data in memory permanently, reducing the disk access overhead significantly. We compare the performance of X-HYBRIDJOIN against the performance of MESHJOIN. We take several precautions to make sure the comparison is adequate and focuses on the utilization of data skew. The experiments show that considering data skew offers substantial room for performance gains that cannot be used by non-adaptive approaches such as MESHJOIN.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 35.99
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 44.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Karakasidis, A., Vassiliadis, P., Pitoura, E.: ETL queues for active data warehousing. In: IQIS 2005: Proceedings of the 2nd International Workshop on Information Quality in Information Systems, pp. 28–39. ACM, New York (2005)

    Chapter  Google Scholar 

  2. Naeem, M.A., Dobbie, G., Weber, G.: An Event-Based Near Real-Time Data Integration Architecture. In: Enterprise Distributed Object Computing Conference Workshops, pp. 401–404. IEEE, Munich (2008)

    Chapter  Google Scholar 

  3. Labio, W., Yang, J., Cui, Y., Garcia-Molina, H., Widom, J.: Performance Issues in Incremental Warehouse Maintenance. In: VLDB 2000: Proceedings of the 26th International Conference on Very Large Data Bases, San Francisco, CA, USA, pp. 461–472 (2000)

    Google Scholar 

  4. Labio, W.J., Wiener, J.L., Garcia-Molina, H., Gorelik, V.: Efficient resumption of interrupted warehouse loads. SIGMOD Rec. 29(2), 46–57 (2000)

    Article  Google Scholar 

  5. Nguyen, A., Tjoa, A.: Zero-Latency data warehousing for hetrogeneous data sources and continuous data streams. In: iiWAS 2003 - The Fifth International Conference on Information Integration and Web-based Applications Services, pp. 55–64. Austrian Computer Society, OCG (2003)

    Google Scholar 

  6. Golab, L., Johnson, T., Seidel, J.S., Shkapenyuk, V.: Stream warehousing with DataDepot. In: Proceedings of the 35th SIGMOD International Conference on Management of Data, Providence, Rhode Island, USA, pp. 847–854 (2009)

    Google Scholar 

  7. Hohpe, G., Woolf, B.: Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions. Addison-Wesley Longman Publishing Co., Boston (2003)

    Google Scholar 

  8. Polyzotis, N., Skiadopoulos, S., Vassiliadis, P., Simitsis, A., Frantzell, N.E.: Supporting Streaming Updates in an Active Data Warehouse. In: IEEE 23rd International Conference on Data Engineering, ICDE 2007, Istanbul, Turkey, pp. 476–485 (2007)

    Google Scholar 

  9. Polyzotis, N., Skiadopoulos, S., Vassiliadis, P., Simitsis, A., Frantzell, N.: Meshing Streaming Updates with Persistent Data in an Active Data Warehouse. IEEE Trans. on Knowl. and Data Eng. 20(7), 976–991 (2008)

    Article  Google Scholar 

  10. Chakraborty, A., Singh, A.: A partition-based approach to support streaming updates over persistent data in an active datawarehouse. In: IPDPS 2009: Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, pp. 1–11. IEEE Computer Society, Washington, DC, USA (2009)

    Google Scholar 

  11. Anderson, C.: The Long Tail: Why the Future of Business is Selling Less of More (2006), Hyperion

    Google Scholar 

  12. Naeem, M.A., Dobbie, G., Weber, G.: R-MESHJOIN for Near-real-time Data Warehousing. In: DOLAP 2010: Proceedings of the ACM 13th International Workshop on Data Warehousing and OLAP. ACM, Toronto (2010)

    Google Scholar 

  13. Knuth, D.E.: The art of computer programming, pp. 400–401. Addison-Wiley, Reading, Mass (1968)

    Google Scholar 

  14. Heising, W.P.: Note on random addressing techniques. IBM Systems Journal 2(2), 976–991, 114–115 (1963)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Naeem, M.A., Dobbie, G., Weber, G. (2011). X-HYBRIDJOIN for Near-Real-Time Data Warehousing. In: Fernandes, A.A.A., Gray, A.J.G., Belhajjame, K. (eds) Advances in Databases. BNCOD 2011. Lecture Notes in Computer Science, vol 7051. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24577-0_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-24577-0_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-24576-3

  • Online ISBN: 978-3-642-24577-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation