Evaluating the Efficiency of OpenMP Tasking for Unbalanced Computation on Diverse CPU Architectures

  • Conference paper
  • First Online:
OpenMP: Portable Multi-Level Parallelism on Modern Systems (IWOMP 2020)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 12295))

Included in the following conference series:

Abstract

In the decade since support for task parallelism was incorporated into OpenMP, its use has remained limited in part due to concerns about its performance and scalability. This paper revisits a study from the early days of OpenMP tasking that used the Unbalanced Tree Search (UTS) benchmark as a stress test to gauge implementation efficiency. The present UTS study includes both Clang/LLVM and vendor OpenMP implementations on four different architectures. We measure parallel efficiency to examine each implementation’s performance in response to varying task granularity. We find that most implementations achieve over 90% efficiency using all available cores for tasks of O(100k) instructions, and the best even manage tasks of O(10k) instructions well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/bsc-pm/bots.

  2. 2.

    An if clause on the task construct would still create a task, though it would be undeferred. The combination of final and mergeable clauses would allow but not require that child tasks be merged, and it would require additional look-ahead since the parent task must also be final to enable merging of the child tasks.

  3. 3.

    The version used in the 2009 UTS OpenMP tasking study [21] also had uniform work per task, but with each task performing the SHA-1 hash for only a single node.

  4. 4.

    Each core has 4, but the BIOS configuration on the test system only has 2 enabled.

  5. 5.

    UTS places relatively low demands on memory, so it can be more amenable to adding threads compared to more memory-hungry applications, which can saturate the memory subsystem with fewer active threads than the total available cores.

References

  1. Adcock, A.B., Sullivan, B.D., Hernandez, O.R., Mahoney, M.W.: Evaluating OpenMP tasking at scale for the computation of graph hyperbolicity. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 71–83. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40698-0_6

    Chapter  Google Scholar 

  2. Atkinson, P., McIntosh-Smith, S.: On the performance of parallel tasking runtimes for an irregular fast multipole method application. In: de Supinski, B.R., Olivier, S.L., Terboven, C., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2017. LNCS, vol. 10468, pp. 92–106. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-65578-9_7

    Chapter  Google Scholar 

  3. Ayguadé, E., et al.: The design of OpenMP tasks. IEEE Trans. Parallel Distrib. Syst. 20, 404–418 (2009)

    Article  Google Scholar 

  4. Ayguadé, E., Duran, A., Hoeflinger, J., Massaioli, F., Teruel, X.: An experimental evaluation of the new OpenMP tasking model. In: Adve, V., Garzarán, M.J., Petersen, P. (eds.) LCPC 2007. LNCS, vol. 5234, pp. 63–77. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-85261-2_5

    Chapter  Google Scholar 

  5. Bull, J.M., Reid, F., McDonnell, N.: A microbenchmark suite for OpenMP tasks. In: Chapman, B.M., Massaioli, F., Müller, M.S., Rorro, M. (eds.) IWOMP 2012. LNCS, vol. 7312, pp. 271–274. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30961-8_24

    Chapter  Google Scholar 

  6. Duran, A., Corbalán, J., Ayguadé, E.: An adaptive cut-off for task parallelism. In: SC 2008: ACM/IEEE Supercomputing 2008, pp. 1–11. IEEE (2008)

    Google Scholar 

  7. Duran, A., Corbalán, J., Ayguadé, E.: Evaluation of OpenMP task scheduling strategies. In: Eigenmann, R., de Supinski, B.R. (eds.) IWOMP 2008. LNCS, vol. 5004, pp. 100–110. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-79561-2_9

    Chapter  Google Scholar 

  8. Duran, A., Teruel, X., Ferrer, R., Martorell, X., Ayguadé, E.: Barcelona OpenMP tasks suite: a set of benchmarks targeting the exploitation of task parallelism in OpenMP. In: ICPP 2009: Proceedings of the 38th International Conference on Parallel Processing, pp. 124–131. IEEE, September 2009

    Google Scholar 

  9. Eastlake, D., Jones, P.: US Secure Hash Algorithm 1 (SHA-1). RFC 3174, Internet Engineering Task Force, September 2001. http://www.rfc-editor.org/rfc/rfc3174.txt

  10. Frigo, M., Leiserson, C.E., Randall, K.H.: The implementation of the Cilk-5 multithreaded language. In: PLDI 1998: Proc. ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, PLDI 1998, pp. 212–223. Association for Computing Machinery, New York (1998)

    Google Scholar 

  11. Fürlinger, K., Skinner, D.: Performance profiling for OpenMP tasks. In: Müller, M.S., de Supinski, B.R., Chapman, B.M. (eds.) IWOMP 2009. LNCS, vol. 5568, pp. 132–139. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02303-3_11

    Chapter  Google Scholar 

  12. Gautier, T., Perez, C., Richard, J.: On the impact of OpenMP task granularity. In: de Supinski, B.R., Valero-Lara, P., Martorell, X., Mateo Bellido, S., Labarta, J. (eds.) IWOMP 2018. LNCS, vol. 11128, pp. 205–221. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98521-3_14

    Chapter  Google Scholar 

  13. Iwasaki, S., Taura, K.: A static cut-off for task parallel programs. In: PACT 2016: International Conference on Parallel Architecture and Compilation Techniques, pp. 139–150, September 2016

    Google Scholar 

  14. Lattner, C., Adve, V.: LLVM: a compilation framework for lifelong program analysis and transformation. In: CGO 2004: International Symposium on Code Generation and Optimization, San Jose, CA, USA, pp. 75–88, March 2004

    Google Scholar 

  15. Leiserson, C.E.: The Cilk++ concurrency platform. J. Supercomput. 51(3), 244–257 (2010)

    Article  Google Scholar 

  16. Lin, Y., Mazurov, O.: Providing observability for OpenMP 3.0 applications. In: Müller, M.S., de Supinski, B.R., Chapman, B.M. (eds.) IWOMP 2009. LNCS, vol. 5568, pp. 104–117. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02303-3_9

    Chapter  Google Scholar 

  17. Lorenz, D., Mohr, B., Rössel, C., Schmidl, D., Wolf, F.: How to reconcile event-based performance analysis with tasking in OpenMP. In: Sato, M., Hanawa, T., Müller, M.S., Chapman, B.M., de Supinski, B.R. (eds.) IWOMP 2010. LNCS, vol. 6132, pp. 109–121. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-13217-9_9

    Chapter  Google Scholar 

  18. Lorenz, D., Philippen, P., Schmidl, D., Wolf, F.: Profiling of OpenMP tasks with score-P. In: ICPPW 2012: 41st International Conference on Parallel Processing Workshops, pp. 444–453. IEEE Computer Society (2012)

    Google Scholar 

  19. Navarro, A., Mateo, S., Perez, J.M., Beltran, V., Ayguadé, E.: Adaptive and architecture-independent task granularity for recursive applications. In: de Supinski, B.R., Olivier, S.L., Terboven, C., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2017. LNCS, vol. 10468, pp. 169–182. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-65578-9_12

    Chapter  Google Scholar 

  20. Olivier, S., et al.: UTS: an unbalanced tree search benchmark. In: Almási, G., Caşcaval, C., Wu, P. (eds.) LCPC 2006. LNCS, vol. 4382, pp. 235–250. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-72521-3_18

    Chapter  Google Scholar 

  21. Olivier, S.L., Prins, J.F.: Evaluating OpenMP 3.0 run time systems on unbalanced task graphs. In: Müller, M.S., de Supinski, B.R., Chapman, B.M. (eds.) IWOMP 2009. LNCS, vol. 5568, pp. 63–78. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02303-3_6

    Chapter  Google Scholar 

  22. Olivier, S.L., Prins, J.F.: Comparison of OpenMP 3.0 and other task parallel frameworks on unbalanced task graphs. Int. J. Parallel Program. 38(5–6), 341–360 (2010)

    Google Scholar 

  23. Olivier, S.L., de Supinski, B.R., Schulz, M., Prins, J.F.: Characterizing and mitigating work time inflation in task parallel programs. In: SC 2012: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 65:1–65:12. IEEE Computer Society Press (2012)

    Google Scholar 

  24. OpenMP Architecture Review Board: OpenMP application programming interface, version 3.0, May 2008. https://www.openmp.org/wp-content/uploads/spec30.pdf

  25. OpenMP Architecture Review Board: OpenMP application programming interface, version 5.0, November 2018. https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf

  26. Reinders, J.: Intel Threading Building Blocks: Outfitting C++ For Multi-Core Processor Parallelism. O’Reilly, Bei**g (2007)

    Google Scholar 

  27. Schmidl, D., et al.: Performance analysis techniques for task-based OpenMP applications. In: Chapman, B.M., Massaioli, F., Müller, M.S., Rorro, M. (eds.) IWOMP 2012. LNCS, vol. 7312, pp. 196–209. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30961-8_15

    Chapter  Google Scholar 

  28. Terboven, C., Schmidl, D., Cramer, T., an Mey, D.: Assessing OpenMP tasking implementations on NUMA architectures. In: Chapman, B.M., Massaioli, F., Müller, M.S., Rorro, M. (eds.) IWOMP 2012. LNCS, vol. 7312, pp. 182–195. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30961-8_14

    Chapter  Google Scholar 

  29. Virouleau, P., et al.: Evaluation of OpenMP dependent tasks with the KASTORS benchmark suite. In: DeRose, L., de Supinski, B.R., Olivier, S.L., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2014. LNCS, vol. 8766, pp. 16–29. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11454-5_2

    Chapter  Google Scholar 

Download references

Acknowledgment

This work used advanced architecture testbed systems provided by the National Nuclear Security Administration’s Advanced Simulation and Computing Program. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA-0003525.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stephen L. Olivier .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Olivier, S.L. (2020). Evaluating the Efficiency of OpenMP Tasking for Unbalanced Computation on Diverse CPU Architectures. In: Milfeld, K., de Supinski, B., Koesterke, L., Klinkenberg, J. (eds) OpenMP: Portable Multi-Level Parallelism on Modern Systems. IWOMP 2020. Lecture Notes in Computer Science(), vol 12295. Springer, Cham. https://doi.org/10.1007/978-3-030-58144-2_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58144-2_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58143-5

  • Online ISBN: 978-3-030-58144-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation