src/elpa2/pack_unpack_gpu.F90 · 6cd5a4f1aeba0b7d7c0d7bfdd72bf558495e8e29 · elpa / elpa

Rewrite compute_hh_trafo CUDA kernels · 6cd5a4f1

Wenzhe Yu authored Jul 23, 2019

* Switch to a simple non-WY algorithm
* Unify real and complex cases
* Update reduction kernel
* Use __shfl_xor_sync for warp reduce (CUDA 9+)
* Support 2^n block size, n = 1,2,...,10
* Use templates when possible
* Clean up unused CUDA functions
* Increase default stripe width when using GPU

6cd5a4f1