Abstract
The data science community today has embraced the concept of Dataframes as the de facto standard for data representation and manipulation. Ease of use, massive operator coverage, and popularization of R and Python languages have heavily influenced this transformation. However, most widely used serial Dataframes today (R, pandas) experience performance limitations even while working on even moderately large data sets. We believe that there is plenty of room for improvement by investigating the generic distributed patterns of dataframe operators.
In this paper, we propose a framework that lays the foundation for building high performance distributed-memory parallel dataframe systems based on these parallel processing patterns. We also present Cylon, as a reference runtime implementation. We demonstrate how this framework has enabled Cylon achieving scalable high performance. We also underline the flexibility of the proposed API and the extensibility of the framework on different hardware. To the best of our knowledge, Cylon is the first and only distributed-memory parallel dataframe system available today.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
MPI: A Message-Passing Interface Standard Version 3.0 (2012). http://mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf. Technical Report
Abeykoon, V., et al.: Streaming machine learning algorithms with big data systems. In: 2019 IEEE International Conference on Big Data (Big Data), pp. 5661–5666. IEEE (2019)
Abeykoon, V., et al.: Hptmt parallel operators for high performance data science & data engineering. ar**v preprint ar**v:2108.06001 (2021)
Abeykoon, V., et al.: Data engineering for HPC with python. In: 2020 IEEE/ACM 9th Workshop on Python for High-Performance and Scientific Computing (PyHPC), pp. 13–21. IEEE (2020)
Babuji, Y.N., et al.: Parsl: scalable parallel scripting in python. In: IWSG (2018)
CylonData: cylon (2021). https://github.com/cylondata/cylon
CylonData: cylon experiments (2021). https://github.com/cylondata/cylon_experiments
Fox, G., et al.: Solving problems on concurrent processors, vol. 1: general techniques and regular problems. Comput. Phys. 3(1), 83–84 (1989)
Gao, H., Sakharnykh, N.: Scaling joins to a thousand GPUs. In: 12th International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures, ADMS@ VLDB (2021)
Kamburugamuve, S., Wickramasinghe, P., Ekanayake, S., Fox, G.C.: Anatomy of machine learning algorithm implementations in MPI, Spark, and Flink. Int. J. High Perform. Comput. Appl. 32(1), 61–73 (2018)
Kamburugamuve, S., et al.: Hptmt: operator-based architecture for scalable high-performance data-intensive frameworks. In: 2021 IEEE 14th International Conference on Cloud Computing (CLOUD), pp. 228–239. IEEE (2021)
Li, X., Lu, P., Schaeffer, J., Shillington, J., Wong, P.S., Shi, H.: On the versatility of parallel sorting by regular sampling. Parallel Comput. 19(10), 1079–1103 (1993)
Mattson, T., Sanders, B., Massingill, B.: Patterns for parallel programming (2004)
McKinney, W., et al.: pandas: a foundational python library for data analysis and statistics. Python High Perform. Sci. Comput. 14(9), 1–9 (2011)
Modin: modin scalability issues (2021). https://github.com/modin-project/modin/issues
Moritz, P., et al.: Ray: a distributed framework for emerging \(\{\)AI\(\}\) applications. In: 13th \(\{\)USENIX\(\}\) Symposium on Operating Systems Design and Implementation (\(\{\)OSDI\(\}\) 18), pp. 561–577 (2018)
Perera, N., et al.: A fast, scalable, universal approach for distributed data reductions. In: International Workshop on Big Data Reduction, IEEE Big Data (2020)
Petersohn, D., et al.: Towards scalable dataframe systems. ar**v preprint ar**v:2001.00888 (2020)
Rocklin, M.: Dask: parallel computation with blocked algorithms and task scheduling. In: Proceedings of the 14th Python in Science Conference, 130–136. Citeseer (2015)
Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)
Wickramasinghe, P., et al.: Twister2: tset high-performance iterative dataflow. In: 2019 International Conference on High Performance Big Data and Intelligent Systems (HPBD &IS), pp. 55–60. IEEE (2019)
Widanage, C., et al.: High performance data engineering everywhere. In: 2020 IEEE International Conference on Smart Data Services (SMDS), pp. 122–132. IEEE (2020)
Zaharia, M., et al.: apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)
Zheng, Y., Kamil, A., Driscoll, M.B., Shan, H., Yelick, K.: UPC++: a PGAS extension for c++. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 1105–1114. IEEE (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Perera, N. et al. (2023). High Performance Dataframes from Parallel Processing Patterns. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2022. Lecture Notes in Computer Science, vol 13826. Springer, Cham. https://doi.org/10.1007/978-3-031-30442-2_22
Download citation
DOI: https://doi.org/10.1007/978-3-031-30442-2_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30441-5
Online ISBN: 978-3-031-30442-2
eBook Packages: Computer ScienceComputer Science (R0)