High Performance Dataframes from Parallel Processing Patterns

Perera, Niranda; Kamburugamuve, Supun; Widanage, Chathura; Abeykoon, Vibhatha; Uyar, Ahmet; Shan, Kaiying; Maithree, Hasara; Lenadora, Damitha; Kanewala, Thejaka Amila; Fox, Geoffrey

doi:10.1007/978-3-031-30442-2_22

Niranda Perera ORCID: orcid.org/0000-0003-3076-0011¹¹,
Supun Kamburugamuve¹²,
Chathura Widanage¹²,
Vibhatha Abeykoon¹²,
Ahmet Uyar¹²,
Kaiying Shan¹³,
Hasara Maithree¹⁴,
Damitha Lenadora¹⁵,
Thejaka Amila Kanewala¹² &
…
Geoffrey Fox¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13826))

Included in the following conference series:

International Conference on Parallel Processing and Applied Mathematics

476 Accesses
4 Citations

Abstract

The data science community today has embraced the concept of Dataframes as the de facto standard for data representation and manipulation. Ease of use, massive operator coverage, and popularization of R and Python languages have heavily influenced this transformation. However, most widely used serial Dataframes today (R, pandas) experience performance limitations even while working on even moderately large data sets. We believe that there is plenty of room for improvement by investigating the generic distributed patterns of dataframe operators.

In this paper, we propose a framework that lays the foundation for building high performance distributed-memory parallel dataframe systems based on these parallel processing patterns. We also present Cylon, as a reference runtime implementation. We demonstrate how this framework has enabled Cylon achieving scalable high performance. We also underline the flexibility of the proposed API and the extensibility of the framework on different hardware. To the best of our knowledge, Cylon is the first and only distributed-memory parallel dataframe system available today.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

ExaHDF5: Delivering Efficient Parallel I/O on Exascale Computing Systems

Article 17 January 2020

DASH: Distributed Data Structures and Parallel Algorithms in a Global Address Space

Upgrading a high performance computing environment for massive data processing

Article Open access 16 October 2019

References

MPI: A Message-Passing Interface Standard Version 3.0 (2012). http://mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf. Technical Report
Abeykoon, V., et al.: Streaming machine learning algorithms with big data systems. In: 2019 IEEE International Conference on Big Data (Big Data), pp. 5661–5666. IEEE (2019)
Google Scholar
Abeykoon, V., et al.: Hptmt parallel operators for high performance data science & data engineering. ar**v preprint ar**v:2108.06001 (2021)
Abeykoon, V., et al.: Data engineering for HPC with python. In: 2020 IEEE/ACM 9th Workshop on Python for High-Performance and Scientific Computing (PyHPC), pp. 13–21. IEEE (2020)
Google Scholar
Babuji, Y.N., et al.: Parsl: scalable parallel scripting in python. In: IWSG (2018)
Google Scholar
CylonData: cylon (2021). https://github.com/cylondata/cylon
CylonData: cylon experiments (2021). https://github.com/cylondata/cylon_experiments
Fox, G., et al.: Solving problems on concurrent processors, vol. 1: general techniques and regular problems. Comput. Phys. 3(1), 83–84 (1989)
Google Scholar
Gao, H., Sakharnykh, N.: Scaling joins to a thousand GPUs. In: 12th International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures, ADMS@ VLDB (2021)
Google Scholar
Kamburugamuve, S., Wickramasinghe, P., Ekanayake, S., Fox, G.C.: Anatomy of machine learning algorithm implementations in MPI, Spark, and Flink. Int. J. High Perform. Comput. Appl. 32(1), 61–73 (2018)
Article Google Scholar
Kamburugamuve, S., et al.: Hptmt: operator-based architecture for scalable high-performance data-intensive frameworks. In: 2021 IEEE 14th International Conference on Cloud Computing (CLOUD), pp. 228–239. IEEE (2021)
Google Scholar
Li, X., Lu, P., Schaeffer, J., Shillington, J., Wong, P.S., Shi, H.: On the versatility of parallel sorting by regular sampling. Parallel Comput. 19(10), 1079–1103 (1993)
Article MathSciNet MATH Google Scholar
Mattson, T., Sanders, B., Massingill, B.: Patterns for parallel programming (2004)
Google Scholar
McKinney, W., et al.: pandas: a foundational python library for data analysis and statistics. Python High Perform. Sci. Comput. 14(9), 1–9 (2011)
Google Scholar
Modin: modin scalability issues (2021). https://github.com/modin-project/modin/issues
Moritz, P., et al.: Ray: a distributed framework for emerging \(\{\)AI\(\}\) applications. In: 13th \(\{\)USENIX\(\}\) Symposium on Operating Systems Design and Implementation (\(\{\)OSDI\(\}\) 18), pp. 561–577 (2018)
Google Scholar
Perera, N., et al.: A fast, scalable, universal approach for distributed data reductions. In: International Workshop on Big Data Reduction, IEEE Big Data (2020)
Google Scholar
Petersohn, D., et al.: Towards scalable dataframe systems. ar**v preprint ar**v:2001.00888 (2020)
Rocklin, M.: Dask: parallel computation with blocked algorithms and task scheduling. In: Proceedings of the 14th Python in Science Conference, 130–136. Citeseer (2015)
Google Scholar
Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)
Article Google Scholar
Wickramasinghe, P., et al.: Twister2: tset high-performance iterative dataflow. In: 2019 International Conference on High Performance Big Data and Intelligent Systems (HPBD &IS), pp. 55–60. IEEE (2019)
Google Scholar
Widanage, C., et al.: High performance data engineering everywhere. In: 2020 IEEE International Conference on Smart Data Services (SMDS), pp. 122–132. IEEE (2020)
Google Scholar
Zaharia, M., et al.: apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)
Article Google Scholar
Zheng, Y., Kamil, A., Driscoll, M.B., Shan, H., Yelick, K.: UPC++: a PGAS extension for c++. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 1105–1114. IEEE (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN, 47408, USA
Niranda Perera
Indiana University Alumni, Bloomington, IN, 47405, USA
Supun Kamburugamuve, Chathura Widanage, Vibhatha Abeykoon, Ahmet Uyar & Thejaka Amila Kanewala
University of Virginia, Charlottesville, VA, 22904, USA
Kaiying Shan
University of Moratuwa, Bandaranayake Mawatha, Moratuwa, 10400, Sri Lanka
Hasara Maithree
University of Illinois Urbana-Champaign, Urbana, IL, 61801, USA
Damitha Lenadora
Biocomplexity Institute and Initiative, University of Virginia, Charlottesville, VA, 22904, USA
Geoffrey Fox

Authors

Niranda Perera
View author publications
You can also search for this author in PubMed Google Scholar
Supun Kamburugamuve
View author publications
You can also search for this author in PubMed Google Scholar
Chathura Widanage
View author publications
You can also search for this author in PubMed Google Scholar
Vibhatha Abeykoon
View author publications
You can also search for this author in PubMed Google Scholar
Ahmet Uyar
View author publications
You can also search for this author in PubMed Google Scholar
Kaiying Shan
View author publications
You can also search for this author in PubMed Google Scholar
Hasara Maithree
View author publications
You can also search for this author in PubMed Google Scholar
Damitha Lenadora
View author publications
You can also search for this author in PubMed Google Scholar
Thejaka Amila Kanewala
View author publications
You can also search for this author in PubMed Google Scholar
Geoffrey Fox
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Niranda Perera .

Editor information

Editors and Affiliations

Czestochowa University of Technology, Czestochowa, Poland
Roman Wyrzykowski
University of Tennessee, Knoxville, TN, USA
Jack Dongarra
University of Southern California, Marina del Rey, CA, USA
Ewa Deelman
Czestochowa University of Technology, Czestochowa, Poland
Konrad Karczewski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Perera, N. et al. (2023). High Performance Dataframes from Parallel Processing Patterns. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2022. Lecture Notes in Computer Science, vol 13826. Springer, Cham. https://doi.org/10.1007/978-3-031-30442-2_22

Download citation

DOI: https://doi.org/10.1007/978-3-031-30442-2_22
Published: 28 April 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30441-5
Online ISBN: 978-3-031-30442-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

High Performance Dataframes from Parallel Processing Patterns

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

ExaHDF5: Delivering Efficient Parallel I/O on Exascale Computing Systems

DASH: Distributed Data Structures and Parallel Algorithms in a Global Address Space

Upgrading a high performance computing environment for massive data processing

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

High Performance Dataframes from Parallel Processing Patterns

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

ExaHDF5: Delivering Efficient Parallel I/O on Exascale Computing Systems

DASH: Distributed Data Structures and Parallel Algorithms in a Global Address Space

Upgrading a high performance computing environment for massive data processing

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation