README_elpa2_kernels.txt 6.25 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
This file is intended as guideline for choosing one appropiate
ELPA2-kernel for your installation.

ELPA generally uses BLAS-Routines for all compute intensive work
so that the performance of ELPA mainly depends on the quality of
the BLAS implementation used when linking.

The only exception is the backtransformation of the eigenvectors
for the 2-stage solver (ELPA2). In this case BLAS routines cannot
be used effectively due to the nature of the problem.

The compute intensive part of the backtransformation of ELPA2
has been put to a file of its own (elpa2_kernels.f90) so that
this can be replaced by hand tailored, optimized code for
specific platforms.

However, we cannot choose for you the best kernels, you should read
these hints, and maybe try which kernel works best for you.

Currently we offer the following alternatives for the ELPA2 kernels:

* elpa2_kernels_{real|complex}.f90       

                             - The generic FORTRAN version of the ELPA2 kernels
                               which should be useable on every platform.
                               It contains some hand optimizations (loop unrolling)
                               in the hope to get optimal code from most FORTRAN
                               compilers. The configure option "--with-generic"
                               uses these kernels. They are propably a good
Andreas Marek's avatar
Andreas Marek committed
30 31 32 33 34 35 36
                               default if you do not know which kernel
                               to use. Note that in the real version,
                               there is used a complex variable in
                               order to enforce better compiler
                               optimizations. This produces correct
                               code, however, some compilers might
                               produce a warning. 
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95



* elpa2_kernels_{real|complex}_simple.f90  
                           
                             - Plain and simple version of elpa2_kernels.f90.
                               Please note that we observed that some compilers get
                               get confused by the hand optimizations done in
                               elpa2_kernels_{real|complex}.f90 and
                               give better performance with this
                               version - so it is worth to try both!
                               The configure option "--with-generic-simple"
                               uses these kernels. 

* elpa2_kernels_real_bgp.f90 
                             - Fortran code enhanced with assembler calls
                               for the IBM BlueGene/P. For the complex 
                               eigenvalue problem the "elpa2_kernels_complex.f90"
                               is recommended. The configure option 
                               "--with-generic-bgp" uses these
			       kernels.

* elpa2_kernels_real_bgq.f90 
                             - Fortran code enhanced with assembler calls
                               for the IBM BlueGene/Q. For the complex 
                               eigenvalue problem the "elpa2_kernels_complex.f90"
                               is recommended. The configure option 
                               "--with-generic-bgq" uses these
			       kernels.

* elpa2_kernels_asm_x86_64.s
                             - Fortran code enhanced with assembler 
                               for the SSE vectorization. The configure option 
                               "--with-sse-assembler" uses these kernels. 
                               They are worth trying on x86_64 without AVX,
        		       e.g. Intel Nehalem. 

 

Several

* elpa2_kernels_{real|complex}_sse-avx_*.c(pp)     
                             - Optimized intrinisic code for x86_64
                               systems (i.e. Intel/AMD architecture)
                               using SSE2/SSE3 operations.
                               (Use gcc for compiling as Intel
			       compiler generates slower code!)

			       Note that you have to specify with
                               configure the flags 
         		       CFLAGS="-O3 -mavx -funsafe-loop-optimizations \
			       -funsafe-math-optimizations -ftree-vect-loop-version \
			       -ftree-vectorize"
			       and 
			       CXXFLAGS="-O3 -mavx -funsafe-loop-optimizations \
			       -funsafe-math-optimizations -ftree-vect-loop-version \
			       -ftree-vectorize"
			       for best performace results.

Andreas Marek's avatar
Andreas Marek committed
96 97 98 99
                               For convenience the flag
                               "--with-avx-optimization" sets these
                               CFLAGS and CXXFLAGS automatically.

100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138
                               On Intel Sandybridge architectures the
                               configure option "--with-intel-sandybride" 
			       use the best combination.

                               On AMD Bulldozer architectures the
                               configure option "--with-amd-bulldozer" 
			       use the best combination.

			       Otherwise, you can try out your own
			       combinations with the configure options 
                               "--with-avx-complex-block{1|2}" and
                               "--with-avx-real-block{2|4|6}".




So which version should be used?
================================

* On the IBM BlueGene/P, BlueGene/Q,
  you should get the optimal performance using the optimized intrinsics/assembler versions
  elpa2_kernels_{real|complex}_bg{p|q}.f90, respectively.
  

* On x86_64 systems (i.e. almost all Intel/AMD systems) you should get
  the optimal performance using the optimized intrinsics/assembler versions
  in elpa2_kernels_*.c or elpa2_kernels_{real|complex}_bg{p|q}.f90
  respectively. However, here you have quite some choice to find your
  optimal kernel.

* If you don't compile for one of these systems or you don't like to use assembler
  for any reason, it is likely that you are best off using elpa2_kernels.f90.
  Make a perfomance test with elpa2_kernels_simple.f90, however, to check if
  your compiler doesn't get confused by the hand optimizations.

* If you want to develop your own optimized kernels for you platform, it is
  easier to start with elpa2_kernels_simple.f90.
  Don't let you confuse from the huge code in elpa2_kernels.f90, the mathemathics
  done in the kernels is relatively trivial.