Log in

Scalability analysis of AVX-512 extensions

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Energy efficiency below a specific thermal design power (TDP) has become the main design goal for microprocessors across all market segments. Optimizing the usage of the available transistors within the TDP is a pending topic. Parallelism is the basic foundation for achieving the exascale level. While instruction-level and thread-level parallelism are embraced by developers, data-level parallelism is usually underutilized, despite its huge potential (e.g. single-instruction multiple-data execution). Companies are pushing the size of vector registers to double every 4 years. Intel’s AVX-512 (512-bit registers) and ARM’s SVE (up to 2048-bit registers) are examples of such trend. In this paper, we perform a scalability and energy efficiency analysis of AVX-512 using the ParVec benchmark suite. ParVec is extended to add support for AVX-512 as well as the newest versions of the GCC compiler . We use Intel’s Top–Down model to show the main bottlenecks of the architecture for each studied benchmark. Results show that the performance and energy improvements depend greatly on the fraction of code that can be vectorized . Energy improvements over scalar codes in a single-thread environment range from 2\(\times \) for Streamcluster (worst) to 8\(\times \) for Blackscholes (best).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. Micro-operations.

  2. Set of macros used to generate the intrinsics code using the C pre-processor.

  3. Model Specific Registers.

References

  1. Asanovi\(\grave{{\rm c}}\) K (1998) Vector microprocessors. Ph.D. thesis

  2. Barnes GH, Brown RM, Kato M, Kuck DJ, Slotnick DL, Stokes RA (1968) The ILLIAC IV computer. IEEE Trans Comput C–17(8):746–757

    Article  Google Scholar 

  3. Bienia C (2011) Benchmarking modern multiprocessors. Ph.D. thesis, Princeton University

  4. Borkar S, Chien AA (2011) The future of microprocessors. ACM, New York, NY, USA. https://doi.org/10.1145/1941487.1941507

  5. Cebrian JM, Jahre M, Natvig L (2015) Parvec: vectorizing the parsec benchmark suite. Computing 97:1077–1100

    Article  MathSciNet  Google Scholar 

  6. Cebrian JM, Natvig L, ParVec Git repository. https://github.com/magnusjahre/parvec. Accessed Apr 2019

  7. Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee SH, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of the 2009 IEEE International Symposium on Workload Characterization. IEEE, pp 44–54. http://doi.ieeecomputersociety.org/10.1109/IISWC.2009.5306797

  8. Cray Research I (1984) Cray X-MP series. http://s3data.computerhistory.org/brochures/cray.x-mp.1983.102646267.pdf

  9. Dennard R, Gaensslen F, Rideout V, Bassous E, LeBlanc A (1974) Design of ion-implanted mosfet’s with very small physical dimensions. https://doi.org/10.1109/JSSC.1974.1050511

  10. Espasa R, Valero M, Smith JE (1998) Vector architectures : past, present and future. In: Proceeding ICS ’98 Proceedings of the 12th International Conference on Supercomputing, pp 425–432

  11. Ferdman M, Adileh A, Kocberber O, Volos S, Alisafaee M, Jevdjic D, Kaynak C, Popescu AD, Ailamaki A, Falsafi B (2012) Clearing the clouds: a study of emerging scale-out workloads on modern hardware. In: 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)

  12. Fuller S (1998) Motorola AltiVec technology. Motorola, Austin

    Google Scholar 

  13. Hennessy JL, Patterson DA (2006) Computer architecture, fourth edition: a quantitative approach. Morgan Kaufmann Publishers Inc., San Francisco

    MATH  Google Scholar 

  14. Intel Corporation (2016a) Intel 64 and IA-32 architectures software developer’s manual volume 1: basic architecture. https://www.intel.es/content/www/es/es/architecture-and-technology/64-ia-32-architectures-software-developer-vol-1-manual.html

  15. Intel Corporation (2016b) Intel 64 and IA-32 architectures software developer’s manual volume 2A: instruction set reference. https://www.intel.la/content/www/xl/es/architecture-and-technology/64-ia-32-architectures-softwaredeveloper-vol-2a-manual.html

  16. ITRS (2012) International technology roadmap for semiconductors report. https://www.itrs2.net/2012-itrs.html. Accssed Apr 2019

  17. Li M, Sasanka R, Adve S.V, kuang Chen Y, Debes E (2005) The alpbench benchmark suite. In: In Proceedings of the IEEE International Symposium on Workload Characterization

  18. Molka D, Hackenberg D, Schöne R, Minartz T, Nagel W (2011) Flexible workload generation for HPC cluster efficiency benchmarking. Springer, Berlin. https://doi.org/10.1007/s00450-011-0194-9

    Book  Google Scholar 

  19. Mucci PJ, Browne S, Deane C, Ho G (1999) PAPI: a portable interface to hardware performance counters. In: Proceedings of the Department of Defense HPCMP Users Group Conference

  20. NEC (2017) Vector supercomputer SX series: SX-Aurora TSUBASA. https://www.nec.com/en/event/mwc2019/leaflet/pdf_2019/SX_Aurora_eng.pdf

  21. NEON Programmer's Guide - Arm (2013). https://static.docs.arm.com/den0018/a/DEN0018A_neon_programmers_guide_en.pdf

  22. Ren B, Jo Y, Krishnamoorthy S, Agrawal K, Kulkarni M (2015) Efficient execution of recursive programs on commodity vector hardware. In: ACM SIGPLAN notices, vol 50. ACM, pp 509–520

  23. Technology Manual (2000). https://www.amd.com/system/files/TechDocs/21928.pdf

  24. Russell RM (1978) The CRAY-1 computer system. Commun. ACM 21(1):63–72. https://doi.org/10.1145/359327.359336

    Article  Google Scholar 

  25. SLEEF Vectorized Math Library. https://sleef.org/

  26. Satish N, Kim C, Chhugani J, Saito H, Krishnaiyer R, Smelyanskiy M, Girkar M, Dubey P (2012) Can traditional programming bridge the ninja performance gap for parallel computing applications? In: Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA), pp 440–451

  27. Sodani A (2015) Knights landing (KNL): 2nd generation Intell® Xeon Phi processor. In: 2015 IEEE Hot Chips 27 Symposium (HCS), pp 1–24. https://doi.org/10.1109/HOTCHIPS.2015.7477467

  28. Stephens N, Biles S, Boettcher M, Eapen J, Eyole M, Gabrielli G, Horsnell M, Magklis G, Martinez A, Premillieu N, Reid A, Rico A, Walker P (2017) The ARM scalable vector extension. IEEE Micro 37(2):26–39

    Article  Google Scholar 

  29. Watson WJ (1972) The TI ASC: a highly modular and flexible super computer architecture. In: Proceedings of the December 5–7, 1972, Fall Joint Computer Conference, Part I (AFIPS), pp 221–228

  30. Yasin A (2014) A Top-Down method for performance analysis and counters architecture. ISPASS 2014—IEEE International Symposium on Performance Analysis of Systems and Software, pp 35–44. https://doi.org/10.1109/ISPASS.2014.6844459

  31. Yoshida T (2016) Fujitsu Presentation Theme: Introduction of Fujitsu's HPC Processor for the Post-K Computer Speaker: Toshio Yoshida. https://www.fujitsu.com/global/documents/solutions/business-technology/tc/catalog/20160822hotchips28.pdf

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Juan M. Cebrian.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cebrian, J.M., Natvig, L. & Jahre, M. Scalability analysis of AVX-512 extensions. J Supercomput 76, 2082–2097 (2020). https://doi.org/10.1007/s11227-019-02840-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-019-02840-7

Keywords

Navigation