-
Wenzhe Yu authored
* Switch to a simple non-WY algorithm * Unify real and complex cases * Update reduction kernel * Use __shfl_xor_sync for warp reduce (CUDA 9+) * Support 2^n block size, n = 1,2,...,10 * Use templates when possible * Clean up unused CUDA functions * Increase default stripe width when using GPU
6cd5a4f1