Implementing LU and Cholesky factorizations on artificial intelligence accelerators

Lu, Yuechen; Luo, Yuchen; Lian, Haocheng; **, Zhou; Liu, Weifeng

doi:10.1007/s42514-021-00075-8

Implementing LU and Cholesky factorizations on artificial intelligence accelerators

Regular Paper
Published: 24 August 2021

Volume 3, pages 286–297, (2021)
Cite this article

CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Yuechen Lu¹,
Yuchen Luo¹,
Haocheng Lian¹,
Zhou **¹ &
…
Weifeng Liu¹

279 Accesses
1 Citation
Explore all metrics

Abstract

LU and Cholesky factorizations for dense matrices are one of the most fundamental building blocks in a number of numerical applications. Because of the \(O(n^3)\) complexity, they may be the most time consuming basic kernels in numerical linear algebra. For this reason, accelerating them on a variety of modern parallel processors received much attention. We in this paper implement LU and Cholesky factorizations on novel massively parallel artificial intelligence (AI) accelerators originally developed for deep neural network applications. We explore data parallelism of the matrix factorizations, and exploit neural compute units and on-chip scratchpad memories of modern AI chips for accelerating them. The experimental results show that our various optimization methods bring performance improvements and can provide up to 41.54 and 19.77 GFlop/s performance using single precision data type and 78.37 and 33.85 GFlop/s performance using half precision data type for LU and Cholesky factorizations on a Cambricon AI accelerator, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Accelerating Block-Circulant Matrix-Based Neural Network Layer on a General Purpose Computing Platform: A Design Guideline

oclCUB: an OpenCL parallel computing library for deep learning operators

Article 16 February 2024

AIbench: a tool for benchmarking Huawei ascend AI processors

Article 01 April 2024

References

Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J. J.: “Performance tuning and optimization techniques of fixed and variable size batched cholesky factorization on gpus,” In International Conference on Computational Science 2016, ICCS 2016, 6-8 June 2016, San Diego, California, USA, ser. Procedia Computer Science, M. Connolly, Ed., vol. 80. Elsevier, (2016), pp. 119–130
Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J.J.: Fast Cholesky factorization on GPUs for batch and native modes in MAGMA. J. Comput. Sci. 20, 85–93 (2017)
Article Google Scholar
Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Ltaief, H., Luszczek, P., Tomov, S.: Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects. J. Phys. Conf. Ser. 180, 012037 (2009)
Article Google Scholar
Anderson, E., Bai, Z., Dongarra, J., Greenbaum, A., McKenney, A., Du Croz, J., Hammarling, S., Demmel, J., Bischof, C., Sorensen, D.: In: Lapack: a portable linear algebra library for high-performance computers, pp. 2–11. IEEE Computer Society Press, Washington, DC, USA (1990)
Chen, T., Du, Z., Sun, N., Wang, J., Wu, C., Chen, Y., Temam, O.: Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. Presented at the (2014)
Chen, Y., Chen, T., Xu, Z., Sun, N., Temam, O.: Diannao family: energy-efficient hardware accelerators for machine learning. Commun. ACM 59(11), 105–112 (2016)
Article Google Scholar
Chen, Y., **e, Y., Song, L., Chen, F., Tang, T.: A survey of accelerator architectures for deep neural networks. Engineering 6(3), 264–274 (2020)
Article Google Scholar
Choi, J., Demmel, J., Dhillon, I., Dongarra, J., Ostrouchov, S., Petitet, A., Stanley, K., Walker, D., Whaley, R.: “Scalapack: a portable linear algebra library for distributed memory computers – design issues and performance,” Computer Physics Communications, vol. 97, no. 1, pp. 1–15, (1996), high-Performance Computing in Science
Choi, J., Dongarra, J.J., Ostrouchov, S., Petitet, A., Walker, D.W., Whaley, R.C.: Design and implementation of the ScaLAPACK LU, QR, and Cholesky factorization routines. Sci. Program. 5(3), 173–184 (1996b)
Google Scholar
Dong, T., Haidar, A., Luszczek, P., Harris, J. A., Tomov, S., Dongarra, J. J.: “LU factorization of small matrices: Accelerating batched DGETRF on the GPU,” In 2014 IEEE International Conference on High Performance Computing and Communications, 6th IEEE International Symposium on Cyberspace Safety and Security, 11th IEEE International Conference on Embedded Software and Systems, HPCC/CSS/ICESS 2014, Paris, France, August 20-22, 2014. IEEE, (2014), pp. 157–160
Dongarra, J.J., Faverge, M., Ltaief, H., Luszczek, P.: Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting. Concurr. Comput. Pract. Exp. 26(7), 1408–1431 (2014)
Article Google Scholar
Dorris, J., Kurzak, J., Luszczek, P., YarKhan, A., Dongarra, J. J.: “Task-based cholesky decomposition on knights corner using openmp,” In High Performance Computing - ISC High Performance 2016 International Workshops, ExaComm, E-MuCoCoS, HPC-IODC, IXPUG, IWOPH, P 3A, VHPC, WOPSSS, Frankfurt, Germany, June 19-23, 2016, Revised Selected Papers, ser. Lecture Notes in Computer Science, M. Taufer, B. Mohr, and J. M. Kunkel, Eds., vol. 9945, (2016), pp. 544–562
Golub, G.H., van Loan, C.F.: Matrix computations, 4th edn. JHU Press, USA (2013)
MATH Google Scholar
Haidar, A., Abdelfattah, A., Tomov, S., Dongarra, J. J.: “High-performance cholesky factorization for gpu-only execution,” In Proceedings of the General Purpose GPUs, GPGPU@PPoPP, Austin, TX, USA, February 4-8, 2017. ACM, (2017), pp. 42–52
Haidar, A., Abdelfattah, A., Zounon, M., Tomov, S., Dongarra, J.J.: A guide for achieving high performance with very small matrices on GPU: a case study of batched LU and Cholesky factorizations. IEEE Trans. Parallel Distrib. Syst. 29(5), 973–984 (2018)
Article Google Scholar
Jia, Y., Luszczek, P., Dongarra, J. J.: “Multi-gpu implementation of LU factorization,” In Proceedings of the International Conference on Computational Science, ICCS 2012, Omaha, Nebraska, USA, 4-6 June, 2012, ser. Procedia Computer Science, H. H. Ali, Y. Shi, D. Khazanchi, M. Lees, G. D. van Albada, J. J. Dongarra, and P. M. A. Sloot, Eds., vol. 9. Elsevier, (2012), pp. 106–115
Jouppi, N.P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., Boyle, R., Cantin, P.-L., Chao, C., Clark, C., Coriell, J., Daley, M., Dau, M., Dean, J., Gelb, B., Ghaemmaghami, T.V., Gottipati, R., Gulland, W., Hagmann, R., Ho, C.R., Hogberg, D., Hu, J., Hundt, R., Hurt, D., Ibarz, J., Jaffey, A., Jaworski, A., Kaplan, A., Khaitan, H., Killebrew, D., Koch, A., Kumar, N., Lacy, S., Laudon, J., Law, J., Le, D., Leary, C., Liu, Z., Lucke, K., Lundin, A., MacKean, G., Maggiore, A., Mahony, M., Miller, K., Nagarajan, R., Narayanaswami, R., Ni, R., Nix, K., Norrie, T., Omernick, M., Penukonda, N., Phelps, A., Ross, J., Ross, M., Salek, A., Samadiani, E., Severn, C., Sizikov, G., Snelham, M., Souter, J., Steinberg, D., Swing, A., Tan, M., Thorson, G., Tian, B., Toma, H., Tuttle, E., Vasudevan, V., Walter, R., Wang, W., Wilcox, E., Yoon, D.H.: In-datacenter performance analysis of a tensor processing unit. Presented at the (2017)
Jouppi, N.P., Young, C., Patil, N., Patterson, D.: A domain-specific architecture for deep neural networks. Commun. ACM 61(9), 50–59 (2018)
Article Google Scholar
Kurzak, J., Luszczek, P., Faverge, M., Dongarra, J. J.: “Programming the LU factorization for a multicore system with accelerators,” In High Performance Computing for Computational Science - VECPAR 2012, 10th International Conference, Kobe, Japan, July 17-20, 2012, Revised Selected Papers, ser. Lecture Notes in Computer Science, M. J. Daydé, O. Marques, and K. Nakajima, Eds., vol. 7851. Springer, (2012), pp. 28–35
Kurzak, J., Anzt, H., Gates, M., Dongarra, J.J.: Implementation and tuning of batched Cholesky factorization and solve for NVIDIA GPUs. IEEE Trans. Parallel Distrib. Syst. 27(7), 2036–2048 (2016)
Article Google Scholar
Reuther, A., Michaleas, P., Jones, M., Gadepally, V., Samsi, S., Kepner, J.: “Survey and benchmarking of machine learning accelerators,” In. IEEE High Performance Extreme Computing Conference (HPEC) 2019, 1–9 (2019)
Reuther, A., Michaleas, P., Jones, M., Gadepally, V., Samsi, S., Kepner, J.: “Survey of machine learning accelerators,” In 2020 IEEE High Performance Extreme Computing Conference (HPEC), (2020), pp. 1–12
Rothberg, E.: Performance of panel and block approaches to sparse Cholesky factorization on the ipsc/860 and paragon multicomputers. SIAM J. Sci. Comput. 17(3), 699–713 (1996)
Article MathSciNet Google Scholar
Yamazaki, I., Tomov, S., Dongarra, J.: Mixed-precision Cholesky QR factorization and its case studies on multicore CPU with multiple GPUs. SIAM J. Sci. Comput. 37(3), C307–C330 (2015)
Article MathSciNet Google Scholar

Download references

Acknowledgements

We would like to thank the invaluable comments from all the reviewers. Weifeng Liu is the corresponding author of this paper. This research was supported by the National Natural Science Foundation of China under Grant No. 61972415, and the Science Foundation of China University of Petroleum, Bei**g under Grant Nos. 2462019YJRC004, 2462020XKJS03.

Author information

Authors and Affiliations

Super Scientific Software Laboratory, Department of Computer Science and Technology, China University of Petroleum-Bei**g, Bei**g, China
Yuechen Lu, Yuchen Luo, Haocheng Lian, Zhou ** & Weifeng Liu

Authors

Yuechen Lu
View author publications
You can also search for this author in PubMed Google Scholar
Yuchen Luo
View author publications
You can also search for this author in PubMed Google Scholar
Haocheng Lian
View author publications
You can also search for this author in PubMed Google Scholar
Zhou **
View author publications
You can also search for this author in PubMed Google Scholar
Weifeng Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weifeng Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lu, Y., Luo, Y., Lian, H. et al. Implementing LU and Cholesky factorizations on artificial intelligence accelerators. CCF Trans. HPC 3, 286–297 (2021). https://doi.org/10.1007/s42514-021-00075-8

Download citation

Received: 14 April 2021
Accepted: 03 August 2021
Published: 24 August 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s42514-021-00075-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Institutional subscriptions

Implementing LU and Cholesky factorizations on artificial intelligence accelerators

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Accelerating Block-Circulant Matrix-Based Neural Network Layer on a General Purpose Computing Platform: A Design Guideline

oclCUB: an OpenCL parallel computing library for deep learning operators

AIbench: a tool for benchmarking Huawei ascend AI processors

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Implementing LU and Cholesky factorizations on artificial intelligence accelerators

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Accelerating Block-Circulant Matrix-Based Neural Network Layer on a General Purpose Computing Platform: A Design Guideline

oclCUB: an OpenCL parallel computing library for deep learning operators

AIbench: a tool for benchmarking Huawei ascend AI processors

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation