Abstract
Energy efficiency below a specific thermal design power (TDP) has become the main design goal for microprocessors across all market segments. Optimizing the usage of the available transistors within the TDP is a pending topic. Parallelism is the basic foundation for achieving the exascale level. While instruction-level and thread-level parallelism are embraced by developers, data-level parallelism is usually underutilized, despite its huge potential (e.g. single-instruction multiple-data execution). Companies are pushing the size of vector registers to double every 4 years. Intel’s AVX-512 (512-bit registers) and ARM’s SVE (up to 2048-bit registers) are examples of such trend. In this paper, we perform a scalability and energy efficiency analysis of AVX-512 using the ParVec benchmark suite. ParVec is extended to add support for AVX-512 as well as the newest versions of the GCC compiler . We use Intel’s Top–Down model to show the main bottlenecks of the architecture for each studied benchmark. Results show that the performance and energy improvements depend greatly on the fraction of code that can be vectorized . Energy improvements over scalar codes in a single-thread environment range from 2\(\times \) for Streamcluster (worst) to 8\(\times \) for Blackscholes (best).
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-019-02840-7/MediaObjects/11227_2019_2840_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-019-02840-7/MediaObjects/11227_2019_2840_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-019-02840-7/MediaObjects/11227_2019_2840_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-019-02840-7/MediaObjects/11227_2019_2840_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-019-02840-7/MediaObjects/11227_2019_2840_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-019-02840-7/MediaObjects/11227_2019_2840_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-019-02840-7/MediaObjects/11227_2019_2840_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-019-02840-7/MediaObjects/11227_2019_2840_Fig8_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-019-02840-7/MediaObjects/11227_2019_2840_Fig9_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-019-02840-7/MediaObjects/11227_2019_2840_Fig10_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-019-02840-7/MediaObjects/11227_2019_2840_Fig11_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-019-02840-7/MediaObjects/11227_2019_2840_Fig12_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-019-02840-7/MediaObjects/11227_2019_2840_Fig13_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-019-02840-7/MediaObjects/11227_2019_2840_Fig14_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-019-02840-7/MediaObjects/11227_2019_2840_Fig15_HTML.png)
Similar content being viewed by others
Notes
Micro-operations.
Set of macros used to generate the intrinsics code using the C pre-processor.
Model Specific Registers.
References
Asanovi\(\grave{{\rm c}}\) K (1998) Vector microprocessors. Ph.D. thesis
Barnes GH, Brown RM, Kato M, Kuck DJ, Slotnick DL, Stokes RA (1968) The ILLIAC IV computer. IEEE Trans Comput C–17(8):746–757
Bienia C (2011) Benchmarking modern multiprocessors. Ph.D. thesis, Princeton University
Borkar S, Chien AA (2011) The future of microprocessors. ACM, New York, NY, USA. https://doi.org/10.1145/1941487.1941507
Cebrian JM, Jahre M, Natvig L (2015) Parvec: vectorizing the parsec benchmark suite. Computing 97:1077–1100
Cebrian JM, Natvig L, ParVec Git repository. https://github.com/magnusjahre/parvec. Accessed Apr 2019
Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee SH, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of the 2009 IEEE International Symposium on Workload Characterization. IEEE, pp 44–54. http://doi.ieeecomputersociety.org/10.1109/IISWC.2009.5306797
Cray Research I (1984) Cray X-MP series. http://s3data.computerhistory.org/brochures/cray.x-mp.1983.102646267.pdf
Dennard R, Gaensslen F, Rideout V, Bassous E, LeBlanc A (1974) Design of ion-implanted mosfet’s with very small physical dimensions. https://doi.org/10.1109/JSSC.1974.1050511
Espasa R, Valero M, Smith JE (1998) Vector architectures : past, present and future. In: Proceeding ICS ’98 Proceedings of the 12th International Conference on Supercomputing, pp 425–432
Ferdman M, Adileh A, Kocberber O, Volos S, Alisafaee M, Jevdjic D, Kaynak C, Popescu AD, Ailamaki A, Falsafi B (2012) Clearing the clouds: a study of emerging scale-out workloads on modern hardware. In: 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)
Fuller S (1998) Motorola AltiVec technology. Motorola, Austin
Hennessy JL, Patterson DA (2006) Computer architecture, fourth edition: a quantitative approach. Morgan Kaufmann Publishers Inc., San Francisco
Intel Corporation (2016a) Intel 64 and IA-32 architectures software developer’s manual volume 1: basic architecture. https://www.intel.es/content/www/es/es/architecture-and-technology/64-ia-32-architectures-software-developer-vol-1-manual.html
Intel Corporation (2016b) Intel 64 and IA-32 architectures software developer’s manual volume 2A: instruction set reference. https://www.intel.la/content/www/xl/es/architecture-and-technology/64-ia-32-architectures-softwaredeveloper-vol-2a-manual.html
ITRS (2012) International technology roadmap for semiconductors report. https://www.itrs2.net/2012-itrs.html. Accssed Apr 2019
Li M, Sasanka R, Adve S.V, kuang Chen Y, Debes E (2005) The alpbench benchmark suite. In: In Proceedings of the IEEE International Symposium on Workload Characterization
Molka D, Hackenberg D, Schöne R, Minartz T, Nagel W (2011) Flexible workload generation for HPC cluster efficiency benchmarking. Springer, Berlin. https://doi.org/10.1007/s00450-011-0194-9
Mucci PJ, Browne S, Deane C, Ho G (1999) PAPI: a portable interface to hardware performance counters. In: Proceedings of the Department of Defense HPCMP Users Group Conference
NEC (2017) Vector supercomputer SX series: SX-Aurora TSUBASA. https://www.nec.com/en/event/mwc2019/leaflet/pdf_2019/SX_Aurora_eng.pdf
NEON Programmer's Guide - Arm (2013). https://static.docs.arm.com/den0018/a/DEN0018A_neon_programmers_guide_en.pdf
Ren B, Jo Y, Krishnamoorthy S, Agrawal K, Kulkarni M (2015) Efficient execution of recursive programs on commodity vector hardware. In: ACM SIGPLAN notices, vol 50. ACM, pp 509–520
Technology Manual (2000). https://www.amd.com/system/files/TechDocs/21928.pdf
Russell RM (1978) The CRAY-1 computer system. Commun. ACM 21(1):63–72. https://doi.org/10.1145/359327.359336
SLEEF Vectorized Math Library. https://sleef.org/
Satish N, Kim C, Chhugani J, Saito H, Krishnaiyer R, Smelyanskiy M, Girkar M, Dubey P (2012) Can traditional programming bridge the ninja performance gap for parallel computing applications? In: Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA), pp 440–451
Sodani A (2015) Knights landing (KNL): 2nd generation Intell® Xeon Phi processor. In: 2015 IEEE Hot Chips 27 Symposium (HCS), pp 1–24. https://doi.org/10.1109/HOTCHIPS.2015.7477467
Stephens N, Biles S, Boettcher M, Eapen J, Eyole M, Gabrielli G, Horsnell M, Magklis G, Martinez A, Premillieu N, Reid A, Rico A, Walker P (2017) The ARM scalable vector extension. IEEE Micro 37(2):26–39
Watson WJ (1972) The TI ASC: a highly modular and flexible super computer architecture. In: Proceedings of the December 5–7, 1972, Fall Joint Computer Conference, Part I (AFIPS), pp 221–228
Yasin A (2014) A Top-Down method for performance analysis and counters architecture. ISPASS 2014—IEEE International Symposium on Performance Analysis of Systems and Software, pp 35–44. https://doi.org/10.1109/ISPASS.2014.6844459
Yoshida T (2016) Fujitsu Presentation Theme: Introduction of Fujitsu's HPC Processor for the Post-K Computer Speaker: Toshio Yoshida. https://www.fujitsu.com/global/documents/solutions/business-technology/tc/catalog/20160822hotchips28.pdf
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Cebrian, J.M., Natvig, L. & Jahre, M. Scalability analysis of AVX-512 extensions. J Supercomput 76, 2082–2097 (2020). https://doi.org/10.1007/s11227-019-02840-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-019-02840-7