Andreas Marek · 4680e675
--- a/INSTALL.md
+++ b/INSTALL.md
-# Installation guide for the *ELPA* library#
+# Installation guide for the *ELPA* library #

 ## Preamble ##

-This file provides documentation on how to build the *ELPA* library in **version ELPA-2020.11.001**.
+This file provides documentation on how to build the *ELPA* library in **version ELPA-2021.05.001.rc1**.
 With release of **version ELPA-2017.05.001** the build process has been significantly simplified,
 which makes it easier to install the *ELPA* library.

@@ -10,7 +10,7 @@ The release ELPA 2018.11.001 was the last release, where the legacy API has been
 enabled by default (and can be disabled at build time).
 With the release ELPA 2019.11.001, the legacy API has been deprecated and the support has been closed.

-The release of ELPA 2020.11.001 does change the API and ABI compared to the release 2019.11.001, since
+The release of ELPA 2021.05.001.rc1 does change the API and ABI compared to the release 2019.11.001, since
 the legacy API has been dropped.

 ## How to install *ELPA* ##
@@ -27,7 +27,7 @@ autotools procedure. This is the **only supported way** how to build and install


 If you obtained *ELPA* from the official git repository, you will not find
-the needed configure script! You will have to create the configure script with autoconf.
+the needed configure script! You will have to create the configure script with autoconf. You can also run the `autogen.sh` script that does this step for you.


 ## (A): Installing *ELPA* as library with configure ##
@@ -62,7 +62,10 @@ An excerpt of the most important (*ELPA* specific) options reads as follows:
 |  `--enable-sve128`                     | Experimental feature build ARM SVE128 kernels, default: disabled               |
 |  `--enable-sve256`                     | Experimental feature build ARM SVE256 kernels, default: disabled               |
 |  `--enable-sve512`                     | Experimental feature build ARM SVE512 kernels, default: disabled               |
-|  `--enable-gpu`                        | build GPU kernels, default: disabled                  |
+|  `--enable-nvidia-gpu`                 | build NVIDIA GPU kernels, default: disabled           |
+|  `--enable-gpu`                        | same as --enable-nvidia-gpu                           |
+|  `--enable-amd-gpu`                    | EXPERIMENTAL: build AMD GPU kernels, default: disabled           |
+|  `--enable-intel-gpu`                  | VERY EXPERIMENTAL: build INTEL GPU kernels, default: disabled           |
 |  `--enable-bgp`                        | build BGP kernels, default: disabled                  |
 |  `--enable-bgq`                        | build BGQ kernels, default: disabled                  |
 |  `--with-mpi=[yes|no]`                 | compile with MPI. Default: yes                        |
@@ -71,7 +74,9 @@ An excerpt of the most important (*ELPA* specific) options reads as follows:
 |  `--with-GPU-compute-capability=VALUE` | use compute capability VALUE for GPU version, <br> default: "sm_35" |
 |  `--with-fixed-real-kernel=KERNEL`     | compile with only a single specific real kernel.      |
 |  `--with-fixed-complex-kernel=KERNEL`  | compile with only a single specific complex kernel.   |
-|  `--with-gpu-support-only`             | Compile and always use the GPU version                |
+|  `--with-nvidia-gpu-support-only`      | Compile and always use the NVIDIA GPU version         |
+|  `--with-amd-gpu-support-only`         | EXPERIMENTAL: Compile and always use the AMD GPU version         |
+|  `--with-intel-gpu-support-only`       | EXPERIMENTAL: Compile and always use the INTEL GPU version         |
 |  `--with-likwid=[yes|no|PATH]`         | use the likwid tool to measure performance (has an performance impact!), default: no |
 |  `--with-default-real-kernel=KERNEL`   | set the real kernel KERNEL as default                 |
 |  `--with-default-complex-kernel=KERNEL`| set the compplex kernel KERNEL as default             |
@@ -384,7 +389,7 @@ Remarks:
 FC=mpi_wrapper_for_gnu_Fortran_compiler CC=mpi_wrapper_for_gnu_C_compiler ./configure FCFLAGS="-O3 -march=native -mavx2 -mfma" CFLAGS="-O3 -march=native -mavx2 -mfma  -funsafe-loop-optimizations -funsafe-math-optimizations -ftree-vect-loop-version -ftree-vectorize" --enable-option-checking=fatal SCALAPACK_LDFLAGS="-L$MKLROOT/lib/intel64 -lmkl_scalapack_lp64 -lmkl_gf_lp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_lp64 -lpthread " SCALAPACK_FCFLAGS="-I$MKL_HOME/include/intel64/lp64"
 ```

-2. Building with Intel Fortran compiler and Intel C compiler:
+3. Building with Intel Fortran compiler and Intel C compiler:

 Remarks:
  - you have to know the name of the Intel Fortran compiler wrapper
@@ -392,13 +397,117 @@ Remarks:
  - you should specify compiler flags for Intel Fortran compiler; in the example only "-O3 -xAVX2" is set
  - you should be careful with the CFLAGS, the example shows typical flags

+```
 FC=mpi_wrapper_for_intel_Fortran_compiler CC=mpi_wrapper_for_intel_C_compiler ./configure FCFLAGS="-O3 -xAVX2" CFLAGS="-O3 -xAVX2" --enable-option-checking=fatal SCALAPACK_LDFLAGS="-L$MKLROOT/lib/intel64 -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_lp64 -lpthread " SCALAPACK_FCFLAGS="-I$MKL_HOME/include/intel64/lp64"
+```
+
+#### Intel cores supporting AVX-512 (Skylake and newer) ####
+
+We recommend that you build ELPA with the Intel compiler (if available) for the Fortran part, but
+with GNU compiler for the C part.
+
+1. Building with Intel Fortran compiler and GNU C compiler:
+
+Remarks:
+  - you have to know the name of the Intel Fortran compiler wrapper
+  - you do not have to specify a C compiler (with CC); GNU C compiler is recognized automatically
+  - you should specify compiler flags for Intel Fortran compiler; in the example only `-O3 -xCORE-AVX512` is set
+  - you should be careful with the CFLAGS, the example shows typical flags
+
+```
+FC=mpi_wrapper_for_intel_Fortran_compiler CC=mpi_wrapper_for_gnu_C_compiler ./configure FCFLAGS="-O3 -xCORE-AVX512" CFLAGS="-O3 -march=skylake-avx512 -mfma -funsafe-loop-optimizations -funsafe-math-optimizations -ftree-vect-loop-version -ftree-vectorize" --enable-option-checking=fatal SCALAPACK_LDFLAGS="-L$MKLROOT/lib/intel64 -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_lp64 -lpthread " SCALAPACK_FCFLAGS="-I$MKL_HOME/include/intel64/lp64" --enable-avx2 --enable-avx512
+```
+
+2. Building with GNU Fortran compiler and GNU C compiler:
+
+Remarks: 
+  - you have to know the name of the GNU Fortran compiler wrapper
+  - you DO have to specify a C compiler (with CC); GNU C compiler is recognized automatically
+  - you should specify compiler flags for GNU Fortran compiler; in the example only `-O3 -march=skylake-avx512 -mfma` is set
+  - you should be careful with the CFLAGS, the example shows typical flags
+
+```
+FC=mpi_wrapper_for_gnu_Fortran_compiler CC=mpi_wrapper_for_gnu_C_compiler ./configure FCFLAGS="-O3 -march=skylake-avx512 -mfma" CFLAGS="-O3 -march=skylake-avx512 -mfma  -funsafe-loop-optimizations -funsafe-math-optimizations -ftree-vect-loop-version -ftree-vectorize" --enable-option-checking=fatal SCALAPACK_LDFLAGS="-L$MKLROOT/lib/intel64 -lmkl_scalapack_lp64 -lmkl_gf_lp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_lp64 -lpthread " SCALAPACK_FCFLAGS="-I$MKL_HOME/include/intel64/lp64" --enable-avx2 --enable-avx512
+```
+
+3. Building with Intel Fortran compiler and Intel C compiler:
+
+Remarks:
+  - you have to know the name of the Intel Fortran compiler wrapper
+  - you have to specify the Intel C compiler
+  - you should specify compiler flags for Intel Fortran compiler; in the example only "-O3 -xCORE-AVX512" is set
+  - you should be careful with the CFLAGS, the example shows typical flags
+
+```
+FC=mpi_wrapper_for_intel_Fortran_compiler CC=mpi_wrapper_for_intel_C_compiler ./configure FCFLAGS="-O3 -xCORE-AVX512" CFLAGS="-O3 -xCORE-AVX512" --enable-option-checking=fatal SCALAPACK_LDFLAGS="-L$MKLROOT/lib/intel64 -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_lp64 -lpthread " SCALAPACK_FCFLAGS="-I$MKL_HOME/include/intel64/lp64" --enable-avx2 --enable-avx512
+```
+
+
+#### Building for NVIDIA A100 GPUS (and Intel Icelake CPUs) ####

+For the GPU builds of ELPA it is mandatory that you choose a GNU compiler for the C part, the Fortran part can be compiled with any compiler, for example with the Intel Fortran compiler

+1. Building with Intel Fortran compiler and GNU C compiler:

+Remarks:
+  - you have to know the name of the Intel Fortran compiler wrapper
+  - you do not have to specify a C compiler (with CC); GNU C compiler is recognized automatically
+  - you should specify compiler flags for Intel Fortran compiler; in the example only `-O3 -xCORE-AVX512` is set
+  - you should be careful with the CFLAGS, the example shows typical flags

+```
+FC=mpi_wrapper_for_intel_Fortran_compiler CC=mpi_wrapper_for_gnu_C_compiler ./configure FCFLAGS="-O3 -xCORE-AVX512" CFLAGS="-O3 -march=skylake-avx512 -mfma -funsafe-loop-optimizations -funsafe-math-optimizations -ftree-vect-loop-version -ftree-vectorize" --enable-option-checking=fatal SCALAPACK_LDFLAGS="-L$MKLROOT/lib/intel64 -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_lp64 -lpthread " SCALAPACK_FCFLAGS="-I$MKL_HOME/include/intel64/lp64" --enable-avx2 --enable-avx512 --enable-nvidia-gpu --with-cuda-path=PATH_TO_YOUR_CUDA_INSTALLATION --with-NVIDIA-GPU-compute-capability=sm_80
+```

+2. Building with GNU Fortran compiler and GNU C compiler:

+Remarks: 
+  - you have to know the name of the GNU Fortran compiler wrapper
+  - you DO have to specify a C compiler (with CC); GNU C compiler is recognized automatically
+  - you should specify compiler flags for GNU Fortran compiler; in the example only `-O3 -march=skylake-avx512 -mfma` is set
+  - you should be careful with the CFLAGS, the example shows typical flags
+
+```
+FC=mpi_wrapper_for_gnu_Fortran_compiler CC=mpi_wrapper_for_gnu_C_compiler ./configure FCFLAGS="-O3 -march=skylake-avx512 -mfma" CFLAGS="-O3 -march=skylake-avx512 -mfma  -funsafe-loop-optimizations -funsafe-math-optimizations -ftree-vect-loop-version -ftree-vectorize" --enable-option-checking=fatal SCALAPACK_LDFLAGS="-L$MKLROOT/lib/intel64 -lmkl_scalapack_lp64 -lmkl_gf_lp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_lp64 -lpthread " SCALAPACK_FCFLAGS="-I$MKL_HOME/include/intel64/lp64" --enable-avx2 --enable-avx512 --enable-nvidia-gpu --with-cuda-path=PATH_TO_YOUR_CUDA_INSTALLATION --with-NVIDIA-GPU-compute-capability=sm_80
+```

+#### Building for IBM SUMMIT HPC system ####

+For more information please have  a look at the [ELSI wiki](https://git.elsi-interchange.org/elsi-devel/elsi-interface/-/wikis/install-elpa).
+
+1. Building with GNU Fortran compiler and GNU C compiler:
+
+```
+FC=mpif90 CC=mpicc ./configure --prefix=$(pwd) CFLAGS="-O2 -mcpu=power9" CFLAGS="-O2 -mcpu=power9" CPP="cpp -E" LDFLAGS="-L${OLCF_NETLIB_SCALAPACK_ROOT}/lib -lscalapack -L${OLCF_ESSL_ROOT}/lib64 -lessl -L${OLCF_NETLIB_LAPACK_ROOT}/lib64 -llapack" --enable-gpu --with-cuda-path=${OLCF_CUDA_ROOT} --with-GPU-compute-capability=sm_70 --disable-sse-assembly --disable-sse --disable-avx --disable-avx2 --disable-avx512 --enable-c-tests=no
+```

+
+2. Building with PGI Fortran compiler and PGI C compiler:
+
+```
+FC=mpif90 CC=mpicc ./configure --prefix=$(pwd) CFLAGS="-fast -tp=pwr9" CFLAGS="-fast -tp=pwr9" CPP="cpp -E" LDFLAGS="-L${OLCF_NETLIB_SCALAPACK_ROOT}/lib -lscalapack -L${OLCF_ESSL_ROOT}/lib64 -lessl -L${OLCF_NETLIB_LAPACK_ROOT}/lib64 -llapack" --enable-gpu --with-cuda-path=${OLCF_CUDA_ROOT} --with-GPU-compute-capability=sm_70 --disable-sse-assembly --disable-sse --disable-avx --disable-avx2 --disable-avx512 --enable-c-tests=no
+```
+
+3. Building with IBM Fortran compiler and IBM C compiler:
+
+```
+FC=mpixlf CC=mpixlc ../configure --prefix=$(pwd) FCFLAGS="-O2 -qarch=pwr9 -qstrict -WF,-qfpp=linecont" CFLAGS="-O2 -qarch=pwr9 -qstrict" CPP="cpp -E" LDFLAGS="-L${OLCF_NETLIB_SCALAPACK_ROOT}/lib -lscalapack -L${OLCF_ESSL_ROOT}/lib64 -lessl -L${OLCF_NETLIB_LAPACK_ROOT}/lib64 -llapack" --enable-gpu --with-cuda-path=${OLCF_CUDA_ROOT} --with-GPU-compute-capability=sm_70 --disable-sse-assembly --disable-sse --disable-avx --disable-avx2 --disable-avx512 --enable-c-tests=no
+```
+
+
+#### EXPERIMENTAL: Building for AMD GPUs (currently tested only --with-mpi=0 ####
+
+
+In order to build *ELPA* for AMD GPUs please ensure that you have a working installation of HIP, ROCm, BLAS, and LAPACK
+
+```
+./configure CXX=hipcc CXXFLAGS="-I/opt/rocm-4.0.0/hip/include/ -I/opt/rocm-4.0.0/rocblas/inlcude -g" CC=hipcc CFLAGS="-I/opt/rocm-4.0.0/hip/include/ -I/opt/rocm-4.0.0/rocblas/include -g" LIBS="-L/opt/rocm-4.0.0/rocblas/lib" --enable-option-checking=fatal --with-mpi=0 FC=gfortran FCFLAGS="-g -LPATH_TO_YOUR_LAPACK_INSTALLATION -lopenblas -llapack" --disable-sse --disable-sse-assembly --disable-avx --disable-avx2 --disable-avx512 --enable-AMD-gpu --enable-single-precision
+```
+
+#### Problems of building with clang-12.0 ####
+The libtool tool adds some flags to the compiler commands (to be used for linking by ld) which are not known
+by the clang-12 compiler. One way to solve this issue is by calling directly after the configue step
+```
+sed -i 's/\\$wl-soname \\$wl\\$soname/-fuse-ld=ld -Wl,-soname,\\$soname/g' libtool
+sed -i 's/\\$wl--whole-archive\\$convenience \\$wl--no-whole-archive//g' libtool
+```