Search
Search Results
-
Performance improvement of the triangular matrix product in commodity clusters
There are many works devoted to improving the matrix product computation, as it is used in a wide variety of scientific applications arising from...
-
A parallel structured banded DC algorithm for symmetric eigenvalue problems
In this paper, a novel parallel structured divide-and-conquer (DC) algorithm is proposed for symmetric banded eigenvalue problems, denoted by PBSDC,...
-
Revisiting the performance optimization of QR factorization on Intel KNL and SKL multiprocessors
This study focused on the optimization of double-precision general matrix–matrix multiplication (DGEMM) routine to improve the QR factorization...
-
Optimizing Distributed Tensor Contractions Using Node-Aware Processor Grids
We propose an algorithm that aims at minimizing the inter-node communication volume for distributed and memory-efficient tensor contraction schemes... -
High-Efficiency Specialized Support for Dense Linear Algebra Arithmetic in LuNA System
Automatic synthesis of efficient scientific parallel programs for supercomputers is in general a complex problem of system parallel programming.... -
Automatic Code Selection for the Dense Symmetric Generalized Eigenvalue Problem Using ATMathCoreLib
Solution of the symmetric definite generalized eigenvalue problem (GEP)... -
A Task-Parallel Runtime for Heterogeneous Multi-node Vector Systems
In recent years, high-performance computing systems are equipped with not only host processors but also accelerators, and becoming more heterogeneous... -
Direct Solver Aiming at Elimination of Systematic Errors in 3D Stellar Positions
The determination of three-dimensional positions and velocities of stars based on the observations collected by a space telescope suffers from the... -
COSTA: Communication-Optimal Shuffle and Transpose Algorithm with Process Relabeling
Communication-avoiding algorithms for Linear Algebra have become increasingly popular, in particular for distributed memory architectures. In... -
Energy efficiency and performance analysis of a legacy atomic scale materials modeling simulator (VASP)
This work tackles the performance and energy consumption analysis of a legacy scientific application, the VASP (Vienna Ab-initio Simulation Package),...
-
Introduction to StarNEig—A Task-Based Library for Solving Nonsymmetric Eigenvalue Problems
In this paper, we present the StarNEig library for solving dense nonsymmetric (generalized) eigenvalue problems. The library is built on top of the... -
Parallel Spectral Clustering with FEAST Library
Spectral clustering is one of the most relevant unsupervised learning methods capable of classifying data without any a priori information. At the... -
Improving blocked matrix-matrix multiplication routine by utilizing AVX-512 instructions on intel knights landing and xeon scalable processors
In high-performance computing, the general matrix-matrix multiplication (xGEMM) routine is the core of the Level 3 BLAS kernel for effective...
-
A Current Task-Based Programming Paradigms Analysis
Task-based paradigm models can be an alternative to MPI. The user defines atomic tasks with a defined input and output with the dependencies between... -
Preconditioned Jacobi SVD Algorithm Outperforms PDGESVD
Recently, we have introduced a new preconditioner for the one-sided block-Jacobi SVD algorithm. In the serial case it outperformed the simple driver... -
High performance computing for first-principles Kohn-Sham density functional theory towards exascale supercomputers
High performance computing (HPC) plays an essential role in enabling first-principles calculations based on the Kohn–Sham density functional theory...
-
xMath2.0: a high-performance extended math library for SW26010-Pro many-core processor
High performance extended math library is used by many scientific engineering and artificial intelligence applications, which usually involves many...
-
swSpAMM: optimizing large-scale sparse approximate matrix multiplication on Sunway Taihulight
Although matrix multiplication plays an essential role in a wide range of applications, previous works only focus on optimizing dense or sparse...
-
10-Million Atoms Simulation of First-Principle Package LS3DF
The growing demand for semiconductor devices simulation poses a big challenge for large-scale electronic structure calculations. Among various...
-
Efficient Matrix Computation for SGD-Based Algorithms on Apache Spark
With the increasing of matrix size in large-scale data analysis, a series of Spark-based distributed matrix computation systems have emerged....