Bad perforrmance of ELPA_2STAGE_REAL_NVIDIA_SM80_GPU kernel
For ELPA 2023.11.001, ELPA_2STAGE_REAL_NVIDIA_SM80_GPU kernel gives much worse performance than ELPA_2STAGE_REAL_NVIDIA_GPU.
Reproducer: config.log with modules:
module load autoconf/2.71 cuda/11.4 gcc/11 openmpi/4 mkl/2022.1 nccl/2.11.4
The problem comes from tridi_to_band. Here are the timings for 4 GPUs (1 Raven node) and N=10k, 30k, and 40k matrices:
10k
ELPA_2STAGE_REAL_NVIDIA_GPU: tridi_to_band 2.643782 s
ELPA_2STAGE_REAL_NVIDIA_SM80_GPU: tridi_to_band 10.700442 s
30k
ELPA_2STAGE_REAL_NVIDIA_GPU: tridi_to_band 45.905885 s
ELPA_2STAGE_REAL_NVIDIA_SM80_GPU: tridi_to_band 243.225826 s
40k
ELPA_2STAGE_REAL_NVIDIA_GPU: tridi_to_band 101.433715 s
ELPA_2STAGE_REAL_NVIDIA_SM80_GPU: tridi_to_band 565.902228 s
Here are the run logs for 40k matrix: slurm-9329866_40k_ELPA_2STAGE_REAL_NVIDIA_GPU.out slurm-9329868_40k_ELPA_2STAGE_REAL_NVIDIA_SM80_GPU.out