elpa merge requestshttps://gitlab.mpcdf.mpg.de/elpa/elpa/-/merge_requests2024-03-20T07:35:42Zhttps://gitlab.mpcdf.mpg.de/elpa/elpa/-/merge_requests/172Fix print settings2024-03-20T07:35:42ZPetr KarpovFix print settings- Fix elpa_print_settings for CFLAGS=-D_FORTIFY_SOURCE=2 (like in OBS GNU installation)
- Move setup_gpu() after setting runtime options in test.c
- Add DeviceSynchronize() after kernel call in [cuda|hip]_check_device_info_FromC. This fi...- Fix elpa_print_settings for CFLAGS=-D_FORTIFY_SOURCE=2 (like in OBS GNU installation)
- Move setup_gpu() after setting runtime options in test.c
- Add DeviceSynchronize() after kernel call in [cuda|hip]_check_device_info_FromC. This fixes a potential problem of not catching a problem in gpusolver?potrf, when info_dev!=0Andreas MarekAndreas Marekhttps://gitlab.mpcdf.mpg.de/elpa/elpa/-/merge_requests/171Add man page for elpa_setup_gpu2024-03-14T08:42:40ZPetr KarpovAdd man page for elpa_setup_gpuAndreas MarekAndreas Marekhttps://gitlab.mpcdf.mpg.de/elpa/elpa/-/merge_requests/169Master pre stage2024-03-09T07:23:09ZAndreas MarekMaster pre stagehttps://gitlab.mpcdf.mpg.de/elpa/elpa/-/merge_requests/168Rccl support2024-03-02T12:07:40ZAndreas MarekRccl supportAndreas MarekAndreas Marekhttps://gitlab.mpcdf.mpg.de/elpa/elpa/-/merge_requests/167Convert recommendation into possibility2024-03-04T10:17:37ZTobias MelsonConvert recommendation into possibilityAndreas MarekAndreas Marekhttps://gitlab.mpcdf.mpg.de/elpa/elpa/-/merge_requests/166GPU Cholesky optimization, solves #1092024-03-06T06:09:55ZPetr KarpovGPU Cholesky optimization, solves #109- Added elpa_gpu_ccl_transpose_vectors in Cholesky-GPU
- Extract memcpy of info outside of cublas?potrf
- Move nccl_group_start out of the loops
- delete unused vendor_agnostic_layer_template.F90
- Add new cusolverDnXpotrf interface (cus...- Added elpa_gpu_ccl_transpose_vectors in Cholesky-GPU
- Extract memcpy of info outside of cublas?potrf
- Move nccl_group_start out of the loops
- delete unused vendor_agnostic_layer_template.F90
- Add new cusolverDnXpotrf interface (cusolverDn?potrf is deprecated)Andreas MarekAndreas Marekhttps://gitlab.mpcdf.mpg.de/elpa/elpa/-/merge_requests/165Master pre stage2024-02-28T07:07:26ZAndreas MarekMaster pre stagehttps://gitlab.mpcdf.mpg.de/elpa/elpa/-/merge_requests/164Add GPU device information to gpu_object2024-02-23T07:11:26ZAndreas MarekAdd GPU device information to gpu_object- At start up some GPU devices parameters are queried and stored
- For example the count of SM processors is passed to (some) kernels- At start up some GPU devices parameters are queried and stored
- For example the count of SM processors is passed to (some) kernelshttps://gitlab.mpcdf.mpg.de/elpa/elpa/-/merge_requests/163Gpu cholesky2024-02-15T06:18:07ZAndreas MarekGpu choleskyAndreas MarekAndreas Marekhttps://gitlab.mpcdf.mpg.de/elpa/elpa/-/merge_requests/162Master pre stage2024-02-07T11:40:13ZAndreas MarekMaster pre stageAndreas MarekAndreas Marekhttps://gitlab.mpcdf.mpg.de/elpa/elpa/-/merge_requests/161Fix cublas caching for cublasGemv, cublasGemm2024-02-06T07:21:13ZPetr KarpovFix cublas caching for cublasGemv, cublasGemmFix the problem with cublas caching for cublasGemv, cublasGemm.
It has been introduced with cublas 11.11.3.6 (https://docs.nvidia.com/cuda/archive/11.8.0/cuda-toolkit-release-notes/index.html):
- Introduced cuBlasLt heuristics cache ...Fix the problem with cublas caching for cublasGemv, cublasGemm.
It has been introduced with cublas 11.11.3.6 (https://docs.nvidia.com/cuda/archive/11.8.0/cuda-toolkit-release-notes/index.html):
- Introduced cuBlasLt heuristics cache that stores the mapping of matmul problems to kernels previously selected by heuristics. That helps reduce the host-side overhead for repeating matmul problems. Refer to https://docs.nvidia.com/cuda/cublas/index.html#cublasLt-heuristics-cache.
The problem with caching was resolved by NVIDIA with cublas 12.3.4.1 https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cublas-release-12-3-update-1
For the intermediate cublas version we have to switch caching by hand using cublasLtHeuristicsCacheSetCapacity(0).Andreas MarekAndreas Marekhttps://gitlab.mpcdf.mpg.de/elpa/elpa/-/merge_requests/160Mpi setup object2024-01-25T11:08:56ZAndreas MarekMpi setup objectAndreas MarekAndreas Marekhttps://gitlab.mpcdf.mpg.de/elpa/elpa/-/merge_requests/159Async in trans ev2023-12-21T08:15:53ZAndreas MarekAsync in trans evAndreas MarekAndreas Marekhttps://gitlab.mpcdf.mpg.de/elpa/elpa/-/merge_requests/158Master pre stage2023-12-17T09:17:22ZAndreas MarekMaster pre stageAndreas MarekAndreas Marekhttps://gitlab.mpcdf.mpg.de/elpa/elpa/-/merge_requests/157Fix memory access violation bug in double_complex cuda_store_u_v_in_uv_vu kernel2023-12-11T15:14:30ZPetr KarpovFix memory access violation bug in double_complex cuda_store_u_v_in_uv_vu kernelAndreas MarekAndreas Marekhttps://gitlab.mpcdf.mpg.de/elpa/elpa/-/merge_requests/156Master pre stage2023-12-07T13:06:07ZAndreas MarekMaster pre stageAndreas MarekAndreas Marekhttps://gitlab.mpcdf.mpg.de/elpa/elpa/-/merge_requests/155elpa1 gpu optimization2023-12-05T08:44:18ZPetr Karpovelpa1 gpu optimizationFixes for the ELPA1 GPU optimizations concerning the synchronization in dot-product-like kernels. The vulnerability was exposed by SYCL on CPU tests.Fixes for the ELPA1 GPU optimizations concerning the synchronization in dot-product-like kernels. The vulnerability was exposed by SYCL on CPU tests.Andreas MarekAndreas Marekhttps://gitlab.mpcdf.mpg.de/elpa/elpa/-/merge_requests/154OpenMP Optimizations to Solve routine2024-01-10T20:49:32ZArjun RamaswamiOpenMP Optimizations to Solve routine- OpenMP implementation to solve loops
- MPI ibcast not blocking loop in band_to_full- OpenMP implementation to solve loops
- MPI ibcast not blocking loop in band_to_fullAndreas MarekAndreas Marekhttps://gitlab.mpcdf.mpg.de/elpa/elpa/-/merge_requests/153optimization_26 nccl: implemented elpa_gpu_reduce_add_vectors, for...2023-11-24T10:49:09ZAndreas Marekoptimization_26 nccl: implemented elpa_gpu_reduce_add_vectors, for...optimization_26 nccl: implemented elpa_gpu_reduce_add_vectors, for NCCL-tridiagonalization in ELPA1 everything is on GPU nowoptimization_26 nccl: implemented elpa_gpu_reduce_add_vectors, for NCCL-tridiagonalization in ELPA1 everything is on GPU nowAndreas MarekAndreas Marekhttps://gitlab.mpcdf.mpg.de/elpa/elpa/-/merge_requests/152Redistribute and dptr2023-11-21T18:36:01ZAndreas MarekRedistribute and dptrAndreas MarekAndreas Marek