Abstract
The MapReduce (MR) framework has become a standard tool for performing large batch computations—usually of aggregative nature—in parallel over a cluster of commodity machines. A significant share of typical MR jobs involves standard database-style queries, where it becomes cumbersome to specify map and reduce functions from scratch. To overcome this burden, higher-level languages such as HiveQL, PigLatin, and JAQL have been proposed to allow the automatic generation of MR jobs from declarative queries. We identify two major problems of these existing solutions: (i) they introduce new query languages and implement systems from scratch for the sole purpose of expressing MR jobs; and (ii) despite solving some of the major limitations of SQL, they still lack the flexibility required by big data applications. We propose BrackitMR, an approach based on the XQuery language with extended JSON support. XQuery not only is an established query language, but also has a more expressive data model and more powerful language constructs, enabling a much greater degree of flexibility. From a system design perspective, we extend an existing single-node query processor, Brackit, adding MR as a distributed coordination layer. Such heavy reuse of the standard query processor not only provides performance, but also allows for a more elegant design which transparently integrates MR processing into a generic query engine.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Afanasiev, L., Grust, T., Marx, M., Rittinger, J., Teubner, J.: An Inflationary Fixed Point Operator in XQuery. In: ICDE Conference, pp. 1504–1506. IEEE (2008)
Bächle, S.: Separating Key Concerns in Query Processing – Set Orientation, Physical Data Independence, and Parallelism. Ph.D. thesis, University of Kaiserslautern, Germany (2012)
Beyer, K.S., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M.Y., Kanne, C.C., Özcan, F., Shekita, E.J.: Jaql: A Scripting Language for Large-Scale Semistructured Data Analysis. PVLDB 4(12), 1272–1283 (2011)
Dean, J., Ghemawat, S.: MapReduce: A Flexible Data Processing Tool. Commun. ACM 53(1), 72–77 (2010)
Graefe, G.: Query Evaluation Techniques for Large Databases. ACM Comput. Surv. 25(2), 73–170 (1993)
Lämmel, R.: Google’s MapReduce Programming Model – Revisited. Sci. Comput. Program. 70(1), 1–30 (2008)
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: A Not-So-Foreign Language for Data Processing. In: SIGMOD Conference, pp. 1099–1110 (2008)
Robie, J., Brantner, M., Florescu, D., Fourny, G., Westmann, T.: JSONiq: XQuery for JSON, JSON for XQuery, pp. 63–72 (2012)
Sauer, C., Härder, T.: Compilation of Query Languages into MapReduce. Datenbank-Spektrum 13(1), 5–15 (2013)
Stewart, R.J., Trinder, P.W., Loidl, H.-W.: Comparing High Level MapReduce Query Languages. In: Temam, O., Yew, P.-C., Zang, B. (eds.) APPT 2011. LNCS, vol. 6965, pp. 58–72. Springer, Heidelberg (2011)
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Anthony, S., Liu, H., Murthy, R.: Hive – A Petabyte Scale Data Warehouse using Hadoop. In: ICDE Conference, pp. 996–1005 (2010)
W3C: XQuery 3.0: An XML Query Language (2011), http://www.w3.org/TR/xquery-30/
White, T.: Hadoop - The Definitive Guide: Storage and Analysis at Internet Scale, 2nd edn. O’Reilly (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sauer, C., Bächle, S., Härder, T. (2013). Versatile XQuery Processing in MapReduce. In: Catania, B., Guerrini, G., Pokorný, J. (eds) Advances in Databases and Information Systems. ADBIS 2013. Lecture Notes in Computer Science, vol 8133. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40683-6_16
Download citation
DOI: https://doi.org/10.1007/978-3-642-40683-6_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40682-9
Online ISBN: 978-3-642-40683-6
eBook Packages: Computer ScienceComputer Science (R0)