Evaluating the Efficiency of OpenMP Tasking for Unbalanced Computation on Diverse CPU Architectures

Olivier, Stephen L.

doi:10.1007/978-3-030-58144-2_2

Stephen L. Olivier ORCID: orcid.org/0000-0001-6247-8980¹²

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 12295))

Included in the following conference series:

International Workshop on OpenMP

608 Accesses
3 Citations

Abstract

In the decade since support for task parallelism was incorporated into OpenMP, its use has remained limited in part due to concerns about its performance and scalability. This paper revisits a study from the early days of OpenMP tasking that used the Unbalanced Tree Search (UTS) benchmark as a stress test to gauge implementation efficiency. The present UTS study includes both Clang/LLVM and vendor OpenMP implementations on four different architectures. We measure parallel efficiency to examine each implementation’s performance in response to varying task granularity. We find that most implementations achieve over 90% efficiency using all available cores for tasks of O(100k) instructions, and the best even manage tasks of O(10k) instructions well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Evaluating the Impact of OpenMP 4.0 Extensions on Relevant Parallel Workloads

A highly optimized skeleton for unbalanced and deep divide-and-conquer algorithms on multi-core clusters

Article Open access 24 January 2022

Evaluating the Performance of Kunpeng 920 Processors on Modern HPC Applications

Notes

1.
https://github.com/bsc-pm/bots.
2.
An if clause on the task construct would still create a task, though it would be undeferred. The combination of final and mergeable clauses would allow but not require that child tasks be merged, and it would require additional look-ahead since the parent task must also be final to enable merging of the child tasks.
3.
The version used in the 2009 UTS OpenMP tasking study [21] also had uniform work per task, but with each task performing the SHA-1 hash for only a single node.
4.
Each core has 4, but the BIOS configuration on the test system only has 2 enabled.
5.
UTS places relatively low demands on memory, so it can be more amenable to adding threads compared to more memory-hungry applications, which can saturate the memory subsystem with fewer active threads than the total available cores.

References

Adcock, A.B., Sullivan, B.D., Hernandez, O.R., Mahoney, M.W.: Evaluating OpenMP tasking at scale for the computation of graph hyperbolicity. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 71–83. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40698-0_6
Chapter Google Scholar
Atkinson, P., McIntosh-Smith, S.: On the performance of parallel tasking runtimes for an irregular fast multipole method application. In: de Supinski, B.R., Olivier, S.L., Terboven, C., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2017. LNCS, vol. 10468, pp. 92–106. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-65578-9_7
Chapter Google Scholar
Ayguadé, E., et al.: The design of OpenMP tasks. IEEE Trans. Parallel Distrib. Syst. 20, 404–418 (2009)
Article Google Scholar
Ayguadé, E., Duran, A., Hoeflinger, J., Massaioli, F., Teruel, X.: An experimental evaluation of the new OpenMP tasking model. In: Adve, V., Garzarán, M.J., Petersen, P. (eds.) LCPC 2007. LNCS, vol. 5234, pp. 63–77. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-85261-2_5
Chapter Google Scholar
Bull, J.M., Reid, F., McDonnell, N.: A microbenchmark suite for OpenMP tasks. In: Chapman, B.M., Massaioli, F., Müller, M.S., Rorro, M. (eds.) IWOMP 2012. LNCS, vol. 7312, pp. 271–274. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30961-8_24
Chapter Google Scholar
Duran, A., Corbalán, J., Ayguadé, E.: An adaptive cut-off for task parallelism. In: SC 2008: ACM/IEEE Supercomputing 2008, pp. 1–11. IEEE (2008)
Google Scholar
Duran, A., Corbalán, J., Ayguadé, E.: Evaluation of OpenMP task scheduling strategies. In: Eigenmann, R., de Supinski, B.R. (eds.) IWOMP 2008. LNCS, vol. 5004, pp. 100–110. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-79561-2_9
Chapter Google Scholar
Duran, A., Teruel, X., Ferrer, R., Martorell, X., Ayguadé, E.: Barcelona OpenMP tasks suite: a set of benchmarks targeting the exploitation of task parallelism in OpenMP. In: ICPP 2009: Proceedings of the 38th International Conference on Parallel Processing, pp. 124–131. IEEE, September 2009
Google Scholar
Eastlake, D., Jones, P.: US Secure Hash Algorithm 1 (SHA-1). RFC 3174, Internet Engineering Task Force, September 2001. http://www.rfc-editor.org/rfc/rfc3174.txt
Frigo, M., Leiserson, C.E., Randall, K.H.: The implementation of the Cilk-5 multithreaded language. In: PLDI 1998: Proc. ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, PLDI 1998, pp. 212–223. Association for Computing Machinery, New York (1998)
Google Scholar
Fürlinger, K., Skinner, D.: Performance profiling for OpenMP tasks. In: Müller, M.S., de Supinski, B.R., Chapman, B.M. (eds.) IWOMP 2009. LNCS, vol. 5568, pp. 132–139. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02303-3_11
Chapter Google Scholar
Gautier, T., Perez, C., Richard, J.: On the impact of OpenMP task granularity. In: de Supinski, B.R., Valero-Lara, P., Martorell, X., Mateo Bellido, S., Labarta, J. (eds.) IWOMP 2018. LNCS, vol. 11128, pp. 205–221. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98521-3_14
Chapter Google Scholar
Iwasaki, S., Taura, K.: A static cut-off for task parallel programs. In: PACT 2016: International Conference on Parallel Architecture and Compilation Techniques, pp. 139–150, September 2016
Google Scholar
Lattner, C., Adve, V.: LLVM: a compilation framework for lifelong program analysis and transformation. In: CGO 2004: International Symposium on Code Generation and Optimization, San Jose, CA, USA, pp. 75–88, March 2004
Google Scholar
Leiserson, C.E.: The Cilk++ concurrency platform. J. Supercomput. 51(3), 244–257 (2010)
Article Google Scholar
Lin, Y., Mazurov, O.: Providing observability for OpenMP 3.0 applications. In: Müller, M.S., de Supinski, B.R., Chapman, B.M. (eds.) IWOMP 2009. LNCS, vol. 5568, pp. 104–117. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02303-3_9
Chapter Google Scholar
Lorenz, D., Mohr, B., Rössel, C., Schmidl, D., Wolf, F.: How to reconcile event-based performance analysis with tasking in OpenMP. In: Sato, M., Hanawa, T., Müller, M.S., Chapman, B.M., de Supinski, B.R. (eds.) IWOMP 2010. LNCS, vol. 6132, pp. 109–121. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-13217-9_9
Chapter Google Scholar
Lorenz, D., Philippen, P., Schmidl, D., Wolf, F.: Profiling of OpenMP tasks with score-P. In: ICPPW 2012: 41st International Conference on Parallel Processing Workshops, pp. 444–453. IEEE Computer Society (2012)
Google Scholar
Navarro, A., Mateo, S., Perez, J.M., Beltran, V., Ayguadé, E.: Adaptive and architecture-independent task granularity for recursive applications. In: de Supinski, B.R., Olivier, S.L., Terboven, C., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2017. LNCS, vol. 10468, pp. 169–182. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-65578-9_12
Chapter Google Scholar
Olivier, S., et al.: UTS: an unbalanced tree search benchmark. In: Almási, G., Caşcaval, C., Wu, P. (eds.) LCPC 2006. LNCS, vol. 4382, pp. 235–250. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-72521-3_18
Chapter Google Scholar
Olivier, S.L., Prins, J.F.: Evaluating OpenMP 3.0 run time systems on unbalanced task graphs. In: Müller, M.S., de Supinski, B.R., Chapman, B.M. (eds.) IWOMP 2009. LNCS, vol. 5568, pp. 63–78. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02303-3_6
Chapter Google Scholar
Olivier, S.L., Prins, J.F.: Comparison of OpenMP 3.0 and other task parallel frameworks on unbalanced task graphs. Int. J. Parallel Program. 38(5–6), 341–360 (2010)
Google Scholar
Olivier, S.L., de Supinski, B.R., Schulz, M., Prins, J.F.: Characterizing and mitigating work time inflation in task parallel programs. In: SC 2012: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 65:1–65:12. IEEE Computer Society Press (2012)
Google Scholar
OpenMP Architecture Review Board: OpenMP application programming interface, version 3.0, May 2008. https://www.openmp.org/wp-content/uploads/spec30.pdf
OpenMP Architecture Review Board: OpenMP application programming interface, version 5.0, November 2018. https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf
Reinders, J.: Intel Threading Building Blocks: Outfitting C++ For Multi-Core Processor Parallelism. O’Reilly, Bei**g (2007)
Google Scholar
Schmidl, D., et al.: Performance analysis techniques for task-based OpenMP applications. In: Chapman, B.M., Massaioli, F., Müller, M.S., Rorro, M. (eds.) IWOMP 2012. LNCS, vol. 7312, pp. 196–209. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30961-8_15
Chapter Google Scholar
Terboven, C., Schmidl, D., Cramer, T., an Mey, D.: Assessing OpenMP tasking implementations on NUMA architectures. In: Chapman, B.M., Massaioli, F., Müller, M.S., Rorro, M. (eds.) IWOMP 2012. LNCS, vol. 7312, pp. 182–195. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30961-8_14
Chapter Google Scholar
Virouleau, P., et al.: Evaluation of OpenMP dependent tasks with the KASTORS benchmark suite. In: DeRose, L., de Supinski, B.R., Olivier, S.L., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2014. LNCS, vol. 8766, pp. 16–29. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11454-5_2
Chapter Google Scholar

Download references

Acknowledgment

This work used advanced architecture testbed systems provided by the National Nuclear Security Administration’s Advanced Simulation and Computing Program. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA-0003525.

Author information

Authors and Affiliations

Center for Computing Research, Sandia National Laboratories, Albuquerque, NM, USA
Stephen L. Olivier

Authors

Stephen L. Olivier
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stephen L. Olivier .

Editor information

Editors and Affiliations

Texas Advanced Computing Center (TACC), Austin, TX, USA
Kent Milfeld
Lawrence Livermore National Laboratory, Livermore, CA, USA
Bronis R. de Supinski
Texas Advanced Computing Center (TACC), Austin, TX, USA
Lars Koesterke
RWTH Aachen University, Aachen, Germany
Jannis Klinkenberg

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Olivier, S.L. (2020). Evaluating the Efficiency of OpenMP Tasking for Unbalanced Computation on Diverse CPU Architectures. In: Milfeld, K., de Supinski, B., Koesterke, L., Klinkenberg, J. (eds) OpenMP: Portable Multi-Level Parallelism on Modern Systems. IWOMP 2020. Lecture Notes in Computer Science(), vol 12295. Springer, Cham. https://doi.org/10.1007/978-3-030-58144-2_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-58144-2_2
Published: 01 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58143-5
Online ISBN: 978-3-030-58144-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Evaluating the Efficiency of OpenMP Tasking for Unbalanced Computation on Diverse CPU Architectures

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Evaluating the Impact of OpenMP 4.0 Extensions on Relevant Parallel Workloads

A highly optimized skeleton for unbalanced and deep divide-and-conquer algorithms on multi-core clusters

Evaluating the Performance of Kunpeng 920 Processors on Modern HPC Applications

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Evaluating the Efficiency of OpenMP Tasking for Unbalanced Computation on Diverse CPU Architectures

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Evaluating the Impact of OpenMP 4.0 Extensions on Relevant Parallel Workloads

A highly optimized skeleton for unbalanced and deep divide-and-conquer algorithms on multi-core clusters

Evaluating the Performance of Kunpeng 920 Processors on Modern HPC Applications

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation