Abstract
In order to make timely and effective decisions, businesses need the latest information from data warehouse repositories. To keep these repositories up-to-date with respect to end user updates, near-real-time data integration is required. An important phase in near-real-time data integration is data transformation where the stream of updates is joined with disk-based master data. The stream-based algorithm Mesh Join (MESHJOIN) has been proposed to amortize disk access over fast stream. MESHJOIN makes no assumptions about the data distribution. In real world applications, however, skewed distributions can be found, e.g, certain products are sold more frequently than the remainder of the products. The question arises, how much does MESHJOIN loose in terms of performance by not adapting to data skew. In this paper we perform a rigorous experimental study analyzing the possible performance improvements while considering typical data distributions. For this purpose we design an algorithm Extended Hybrid Join (X-HYBRIDJOIN) that is complementary to MESHJOIN in that it can adapt to data skew and stores parts of the master data in memory permanently, reducing the disk access overhead significantly. We compare the performance of X-HYBRIDJOIN against the performance of MESHJOIN. We take several precautions to make sure the comparison is adequate and focuses on the utilization of data skew. The experiments show that considering data skew offers substantial room for performance gains that cannot be used by non-adaptive approaches such as MESHJOIN.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Karakasidis, A., Vassiliadis, P., Pitoura, E.: ETL queues for active data warehousing. In: IQIS 2005: Proceedings of the 2nd International Workshop on Information Quality in Information Systems, pp. 28–39. ACM, New York (2005)
Naeem, M.A., Dobbie, G., Weber, G.: An Event-Based Near Real-Time Data Integration Architecture. In: Enterprise Distributed Object Computing Conference Workshops, pp. 401–404. IEEE, Munich (2008)
Labio, W., Yang, J., Cui, Y., Garcia-Molina, H., Widom, J.: Performance Issues in Incremental Warehouse Maintenance. In: VLDB 2000: Proceedings of the 26th International Conference on Very Large Data Bases, San Francisco, CA, USA, pp. 461–472 (2000)
Labio, W.J., Wiener, J.L., Garcia-Molina, H., Gorelik, V.: Efficient resumption of interrupted warehouse loads. SIGMOD Rec. 29(2), 46–57 (2000)
Nguyen, A., Tjoa, A.: Zero-Latency data warehousing for hetrogeneous data sources and continuous data streams. In: iiWAS 2003 - The Fifth International Conference on Information Integration and Web-based Applications Services, pp. 55–64. Austrian Computer Society, OCG (2003)
Golab, L., Johnson, T., Seidel, J.S., Shkapenyuk, V.: Stream warehousing with DataDepot. In: Proceedings of the 35th SIGMOD International Conference on Management of Data, Providence, Rhode Island, USA, pp. 847–854 (2009)
Hohpe, G., Woolf, B.: Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions. Addison-Wesley Longman Publishing Co., Boston (2003)
Polyzotis, N., Skiadopoulos, S., Vassiliadis, P., Simitsis, A., Frantzell, N.E.: Supporting Streaming Updates in an Active Data Warehouse. In: IEEE 23rd International Conference on Data Engineering, ICDE 2007, Istanbul, Turkey, pp. 476–485 (2007)
Polyzotis, N., Skiadopoulos, S., Vassiliadis, P., Simitsis, A., Frantzell, N.: Meshing Streaming Updates with Persistent Data in an Active Data Warehouse. IEEE Trans. on Knowl. and Data Eng. 20(7), 976–991 (2008)
Chakraborty, A., Singh, A.: A partition-based approach to support streaming updates over persistent data in an active datawarehouse. In: IPDPS 2009: Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, pp. 1–11. IEEE Computer Society, Washington, DC, USA (2009)
Anderson, C.: The Long Tail: Why the Future of Business is Selling Less of More (2006), Hyperion
Naeem, M.A., Dobbie, G., Weber, G.: R-MESHJOIN for Near-real-time Data Warehousing. In: DOLAP 2010: Proceedings of the ACM 13th International Workshop on Data Warehousing and OLAP. ACM, Toronto (2010)
Knuth, D.E.: The art of computer programming, pp. 400–401. Addison-Wiley, Reading, Mass (1968)
Heising, W.P.: Note on random addressing techniques. IBM Systems Journal 2(2), 976–991, 114–115 (1963)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Naeem, M.A., Dobbie, G., Weber, G. (2011). X-HYBRIDJOIN for Near-Real-Time Data Warehousing. In: Fernandes, A.A.A., Gray, A.J.G., Belhajjame, K. (eds) Advances in Databases. BNCOD 2011. Lecture Notes in Computer Science, vol 7051. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24577-0_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-24577-0_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24576-3
Online ISBN: 978-3-642-24577-0
eBook Packages: Computer ScienceComputer Science (R0)