Abstract
LU and Cholesky factorizations for dense matrices are one of the most fundamental building blocks in a number of numerical applications. Because of the \(O(n^3)\) complexity, they may be the most time consuming basic kernels in numerical linear algebra. For this reason, accelerating them on a variety of modern parallel processors received much attention. We in this paper implement LU and Cholesky factorizations on novel massively parallel artificial intelligence (AI) accelerators originally developed for deep neural network applications. We explore data parallelism of the matrix factorizations, and exploit neural compute units and on-chip scratchpad memories of modern AI chips for accelerating them. The experimental results show that our various optimization methods bring performance improvements and can provide up to 41.54 and 19.77 GFlop/s performance using single precision data type and 78.37 and 33.85 GFlop/s performance using half precision data type for LU and Cholesky factorizations on a Cambricon AI accelerator, respectively.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs42514-021-00075-8/MediaObjects/42514_2021_75_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs42514-021-00075-8/MediaObjects/42514_2021_75_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs42514-021-00075-8/MediaObjects/42514_2021_75_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs42514-021-00075-8/MediaObjects/42514_2021_75_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs42514-021-00075-8/MediaObjects/42514_2021_75_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs42514-021-00075-8/MediaObjects/42514_2021_75_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs42514-021-00075-8/MediaObjects/42514_2021_75_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs42514-021-00075-8/MediaObjects/42514_2021_75_Fig8_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs42514-021-00075-8/MediaObjects/42514_2021_75_Fig9_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs42514-021-00075-8/MediaObjects/42514_2021_75_Fig10_HTML.png)
Similar content being viewed by others
References
Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J. J.: “Performance tuning and optimization techniques of fixed and variable size batched cholesky factorization on gpus,” In International Conference on Computational Science 2016, ICCS 2016, 6-8 June 2016, San Diego, California, USA, ser. Procedia Computer Science, M. Connolly, Ed., vol. 80. Elsevier, (2016), pp. 119–130
Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J.J.: Fast Cholesky factorization on GPUs for batch and native modes in MAGMA. J. Comput. Sci. 20, 85–93 (2017)
Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Ltaief, H., Luszczek, P., Tomov, S.: Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects. J. Phys. Conf. Ser. 180, 012037 (2009)
Anderson, E., Bai, Z., Dongarra, J., Greenbaum, A., McKenney, A., Du Croz, J., Hammarling, S., Demmel, J., Bischof, C., Sorensen, D.: In: Lapack: a portable linear algebra library for high-performance computers, pp. 2–11. IEEE Computer Society Press, Washington, DC, USA (1990)
Chen, T., Du, Z., Sun, N., Wang, J., Wu, C., Chen, Y., Temam, O.: Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. Presented at the (2014)
Chen, Y., Chen, T., Xu, Z., Sun, N., Temam, O.: Diannao family: energy-efficient hardware accelerators for machine learning. Commun. ACM 59(11), 105–112 (2016)
Chen, Y., **e, Y., Song, L., Chen, F., Tang, T.: A survey of accelerator architectures for deep neural networks. Engineering 6(3), 264–274 (2020)
Choi, J., Demmel, J., Dhillon, I., Dongarra, J., Ostrouchov, S., Petitet, A., Stanley, K., Walker, D., Whaley, R.: “Scalapack: a portable linear algebra library for distributed memory computers – design issues and performance,” Computer Physics Communications, vol. 97, no. 1, pp. 1–15, (1996), high-Performance Computing in Science
Choi, J., Dongarra, J.J., Ostrouchov, S., Petitet, A., Walker, D.W., Whaley, R.C.: Design and implementation of the ScaLAPACK LU, QR, and Cholesky factorization routines. Sci. Program. 5(3), 173–184 (1996b)
Dong, T., Haidar, A., Luszczek, P., Harris, J. A., Tomov, S., Dongarra, J. J.: “LU factorization of small matrices: Accelerating batched DGETRF on the GPU,” In 2014 IEEE International Conference on High Performance Computing and Communications, 6th IEEE International Symposium on Cyberspace Safety and Security, 11th IEEE International Conference on Embedded Software and Systems, HPCC/CSS/ICESS 2014, Paris, France, August 20-22, 2014. IEEE, (2014), pp. 157–160
Dongarra, J.J., Faverge, M., Ltaief, H., Luszczek, P.: Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting. Concurr. Comput. Pract. Exp. 26(7), 1408–1431 (2014)
Dorris, J., Kurzak, J., Luszczek, P., YarKhan, A., Dongarra, J. J.: “Task-based cholesky decomposition on knights corner using openmp,” In High Performance Computing - ISC High Performance 2016 International Workshops, ExaComm, E-MuCoCoS, HPC-IODC, IXPUG, IWOPH, P 3A, VHPC, WOPSSS, Frankfurt, Germany, June 19-23, 2016, Revised Selected Papers, ser. Lecture Notes in Computer Science, M. Taufer, B. Mohr, and J. M. Kunkel, Eds., vol. 9945, (2016), pp. 544–562
Golub, G.H., van Loan, C.F.: Matrix computations, 4th edn. JHU Press, USA (2013)
Haidar, A., Abdelfattah, A., Tomov, S., Dongarra, J. J.: “High-performance cholesky factorization for gpu-only execution,” In Proceedings of the General Purpose GPUs, GPGPU@PPoPP, Austin, TX, USA, February 4-8, 2017. ACM, (2017), pp. 42–52
Haidar, A., Abdelfattah, A., Zounon, M., Tomov, S., Dongarra, J.J.: A guide for achieving high performance with very small matrices on GPU: a case study of batched LU and Cholesky factorizations. IEEE Trans. Parallel Distrib. Syst. 29(5), 973–984 (2018)
Jia, Y., Luszczek, P., Dongarra, J. J.: “Multi-gpu implementation of LU factorization,” In Proceedings of the International Conference on Computational Science, ICCS 2012, Omaha, Nebraska, USA, 4-6 June, 2012, ser. Procedia Computer Science, H. H. Ali, Y. Shi, D. Khazanchi, M. Lees, G. D. van Albada, J. J. Dongarra, and P. M. A. Sloot, Eds., vol. 9. Elsevier, (2012), pp. 106–115
Jouppi, N.P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., Boyle, R., Cantin, P.-L., Chao, C., Clark, C., Coriell, J., Daley, M., Dau, M., Dean, J., Gelb, B., Ghaemmaghami, T.V., Gottipati, R., Gulland, W., Hagmann, R., Ho, C.R., Hogberg, D., Hu, J., Hundt, R., Hurt, D., Ibarz, J., Jaffey, A., Jaworski, A., Kaplan, A., Khaitan, H., Killebrew, D., Koch, A., Kumar, N., Lacy, S., Laudon, J., Law, J., Le, D., Leary, C., Liu, Z., Lucke, K., Lundin, A., MacKean, G., Maggiore, A., Mahony, M., Miller, K., Nagarajan, R., Narayanaswami, R., Ni, R., Nix, K., Norrie, T., Omernick, M., Penukonda, N., Phelps, A., Ross, J., Ross, M., Salek, A., Samadiani, E., Severn, C., Sizikov, G., Snelham, M., Souter, J., Steinberg, D., Swing, A., Tan, M., Thorson, G., Tian, B., Toma, H., Tuttle, E., Vasudevan, V., Walter, R., Wang, W., Wilcox, E., Yoon, D.H.: In-datacenter performance analysis of a tensor processing unit. Presented at the (2017)
Jouppi, N.P., Young, C., Patil, N., Patterson, D.: A domain-specific architecture for deep neural networks. Commun. ACM 61(9), 50–59 (2018)
Kurzak, J., Luszczek, P., Faverge, M., Dongarra, J. J.: “Programming the LU factorization for a multicore system with accelerators,” In High Performance Computing for Computational Science - VECPAR 2012, 10th International Conference, Kobe, Japan, July 17-20, 2012, Revised Selected Papers, ser. Lecture Notes in Computer Science, M. J. Daydé, O. Marques, and K. Nakajima, Eds., vol. 7851. Springer, (2012), pp. 28–35
Kurzak, J., Anzt, H., Gates, M., Dongarra, J.J.: Implementation and tuning of batched Cholesky factorization and solve for NVIDIA GPUs. IEEE Trans. Parallel Distrib. Syst. 27(7), 2036–2048 (2016)
Reuther, A., Michaleas, P., Jones, M., Gadepally, V., Samsi, S., Kepner, J.: “Survey and benchmarking of machine learning accelerators,” In. IEEE High Performance Extreme Computing Conference (HPEC) 2019, 1–9 (2019)
Reuther, A., Michaleas, P., Jones, M., Gadepally, V., Samsi, S., Kepner, J.: “Survey of machine learning accelerators,” In 2020 IEEE High Performance Extreme Computing Conference (HPEC), (2020), pp. 1–12
Rothberg, E.: Performance of panel and block approaches to sparse Cholesky factorization on the ipsc/860 and paragon multicomputers. SIAM J. Sci. Comput. 17(3), 699–713 (1996)
Yamazaki, I., Tomov, S., Dongarra, J.: Mixed-precision Cholesky QR factorization and its case studies on multicore CPU with multiple GPUs. SIAM J. Sci. Comput. 37(3), C307–C330 (2015)
Acknowledgements
We would like to thank the invaluable comments from all the reviewers. Weifeng Liu is the corresponding author of this paper. This research was supported by the National Natural Science Foundation of China under Grant No. 61972415, and the Science Foundation of China University of Petroleum, Bei**g under Grant Nos. 2462019YJRC004, 2462020XKJS03.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lu, Y., Luo, Y., Lian, H. et al. Implementing LU and Cholesky factorizations on artificial intelligence accelerators. CCF Trans. HPC 3, 286–297 (2021). https://doi.org/10.1007/s42514-021-00075-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42514-021-00075-8