• Wenzhe Yu's avatar
    Rewrite compute_hh_trafo CUDA kernels · 6cd5a4f1
    Wenzhe Yu authored
    * Switch to a simple non-WY algorithm
    * Unify real and complex cases
    * Update reduction kernel
    * Use __shfl_xor_sync for warp reduce (CUDA 9+)
    * Support 2^n block size, n = 1,2,...,10
    * Use templates when possible
    * Clean up unused CUDA functions
    * Increase default stripe width when using GPU