Skip to content
Snippets Groups Projects

Benchmarking rocsolver functions

Introduction and motivation

ELPA utilizes many BLAS/LAPACK routines, in the ideal case provided by the hardware vendors and specially tuned for the CPU/GPU hardware of choice. One of the very important for ELPA routines is LAPACK's STEDC: it diagonalizes a symmetric tridiagonal matrix and is required by both ELPA1 and ELPA2 stage solver algorithms.

ELPA's 2025.01 release will introduce new significant GPU ports of "solve" step, which requires using STEDC on the GPU. For CUDA, GPU version of STEDC is not available, so ELPA uses SYEVD (cusolverDnDsyevd) instead: it diagonalizes full matrix not taking into account its tridiagonal structure. For CUDA version of ELPA, this optimization allows to decrease the runtime of the "solve" step by a factor of ~3 compared to the CPU version for the example similar to presented below.

Advantageously for ROCm, rocsolver directly provides a GPU version of STEDC (rocsolver_dstedc) as well as SYEVD (rocsolver_dsyevd). Unfortunately, the performance of rocsolver_dstedc is even worse than even of the CPU version of STEDC (LAPACKE_dstedc), which becomes a significant issue for ELPA. The table below compares Raven vs Viper runtimes of ELPA 2025.01, where for symmetry, gpusolver_syevd = cusolverDnDsyevd/rocsolver_dsyevd on Raven/Viper, respectively.

Alt text

N=40960, nev=40960, NB=1024, ELPA1 solver, MKL 2024.0

Benchmarks for rocsolver functions

Here we present stand-alone benchmarks for rocsolver_dstedc (main_stedc.cpp) and rocsolver_dsyevd (main_syevd.cpp) functions. The benchmarks are run for matrix size N=10240 (it's same to the local matrix size, used by ELPA in the above table) on one GPU. Device workspace allocation/deallocation is excluded here, unlike the ELPA timings in the table above.

Viper

function (stedc) runtime (sec.)
LAPACKE_dstedc (mkl, 1 core) 18.2
rocsolver_dstedc (rocm 6.2) 35.6
rocsolver_dstedc (rocm 6.3) 27.7
rocsolver_dstedc (rocm-develop*) 11.3
function (syevd) runtime (sec.)
LAPACKE_dsyevd (mkl) 279
rocsolver_dsyevd (rocm 6.3) 31.6
rocsolver_dsyevd (rocm-develop*) 15.1
  • 2025-02-21 87c64837163f3ecbbe935caabfe86ef9deab7e6d

Raven

For the reference, here are some benchmarks on Raven:

function runtime (sec.)
LAPACKE_dstedc (mkl, 1 core) 15.3
cusolverDnDsyevd 2.62
rocsolver_dstedc (converted**, rocm 6.2) 36.8

**converted: "HIPIFLY"-like approach with trivial substitutions rocm --> cuda, hip --> cu, etc.