README_elpa2_kernels.txt 7.04 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
This file is intended as guideline for choosing one appropiate
ELPA2-kernel for your installation.

ELPA generally uses BLAS-Routines for all compute intensive work
so that the performance of ELPA mainly depends on the quality of
the BLAS implementation used when linking.

The only exception is the backtransformation of the eigenvectors
for the 2-stage solver (ELPA2). In this case BLAS routines cannot
be used effectively due to the nature of the problem.

The compute intensive part of the backtransformation of ELPA2
has been put to a file of its own (elpa2_kernels.f90) so that
this can be replaced by hand tailored, optimized code for
specific platforms.

However, we cannot choose for you the best kernels, you should read
these hints, and maybe try which kernel works best for you.

Currently we offer the following alternatives for the ELPA2 kernels:

* elpa2_kernels_{real|complex}.f90       

                             - The generic FORTRAN version of the ELPA2 kernels
                               which should be useable on every platform.
                               It contains some hand optimizations (loop unrolling)
                               in the hope to get optimal code from most FORTRAN
                               compilers. The configure option "--with-generic"
                               uses these kernels. They are propably a good
Andreas Marek's avatar
Andreas Marek committed
30
31
32
33
34
35
36
                               default if you do not know which kernel
                               to use. Note that in the real version,
                               there is used a complex variable in
                               order to enforce better compiler
                               optimizations. This produces correct
                               code, however, some compilers might
                               produce a warning. 
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56



* elpa2_kernels_{real|complex}_simple.f90  
                           
                             - Plain and simple version of elpa2_kernels.f90.
                               Please note that we observed that some compilers get
                               get confused by the hand optimizations done in
                               elpa2_kernels_{real|complex}.f90 and
                               give better performance with this
                               version - so it is worth to try both!
                               The configure option "--with-generic-simple"
                               uses these kernels. 

* elpa2_kernels_real_bgp.f90 
                             - Fortran code enhanced with assembler calls
                               for the IBM BlueGene/P. For the complex 
                               eigenvalue problem the "elpa2_kernels_complex.f90"
                               is recommended. The configure option 
                               "--with-generic-bgp" uses these
Andreas Marek's avatar
Andreas Marek committed
57
58
59
60
61
62
63
64
			       kernels. Note that the OpenMP functionality of
			       this kernel is not yet tested and thus an
			       preprocessor error is thrown in the combination
			       of this kernel with OpenMP. By manually editing
			       the file src/elpa2.F90 one can avoid this and 
			       test the OpenMP functionality. The ELPA
			       developers would welcome every feedback to this
			       subject.
65
66
67
68
69
70
71

* elpa2_kernels_real_bgq.f90 
                             - Fortran code enhanced with assembler calls
                               for the IBM BlueGene/Q. For the complex 
                               eigenvalue problem the "elpa2_kernels_complex.f90"
                               is recommended. The configure option 
                               "--with-generic-bgq" uses these
Andreas Marek's avatar
Andreas Marek committed
72
73
74
75
76
77
78
79
			       kernels. Note that the OpenMP functionality of
			       this kernel is not yet tested and thus an
			       preprocessor error is thrown in the combination
			       of this kernel with OpenMP. By manually editing
			       the file src/elpa2.F90 one can avoid this and
       			       test the OpenMP functionality. The ELPA 
                               developers would welcome every feedback
			       to this subject.
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109

* elpa2_kernels_asm_x86_64.s
                             - Fortran code enhanced with assembler 
                               for the SSE vectorization. The configure option 
                               "--with-sse-assembler" uses these kernels. 
                               They are worth trying on x86_64 without AVX,
        		       e.g. Intel Nehalem. 

 

Several

* elpa2_kernels_{real|complex}_sse-avx_*.c(pp)     
                             - Optimized intrinisic code for x86_64
                               systems (i.e. Intel/AMD architecture)
                               using SSE2/SSE3 operations.
                               (Use gcc for compiling as Intel
			       compiler generates slower code!)

			       Note that you have to specify with
                               configure the flags 
         		       CFLAGS="-O3 -mavx -funsafe-loop-optimizations \
			       -funsafe-math-optimizations -ftree-vect-loop-version \
			       -ftree-vectorize"
			       and 
			       CXXFLAGS="-O3 -mavx -funsafe-loop-optimizations \
			       -funsafe-math-optimizations -ftree-vect-loop-version \
			       -ftree-vectorize"
			       for best performace results.

Andreas Marek's avatar
Andreas Marek committed
110
111
112
113
                               For convenience the flag
                               "--with-avx-optimization" sets these
                               CFLAGS and CXXFLAGS automatically.

114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
                               On Intel Sandybridge architectures the
                               configure option "--with-intel-sandybride" 
			       use the best combination.

                               On AMD Bulldozer architectures the
                               configure option "--with-amd-bulldozer" 
			       use the best combination.

			       Otherwise, you can try out your own
			       combinations with the configure options 
                               "--with-avx-complex-block{1|2}" and
                               "--with-avx-real-block{2|4|6}".




So which version should be used?
================================

* On the IBM BlueGene/P, BlueGene/Q,
  you should get the optimal performance using the optimized intrinsics/assembler versions
  elpa2_kernels_{real|complex}_bg{p|q}.f90, respectively.
  

* On x86_64 systems (i.e. almost all Intel/AMD systems) you should get
  the optimal performance using the optimized intrinsics/assembler versions
  in elpa2_kernels_*.c or elpa2_kernels_{real|complex}_bg{p|q}.f90
  respectively. However, here you have quite some choice to find your
  optimal kernel.

* If you don't compile for one of these systems or you don't like to use assembler
  for any reason, it is likely that you are best off using elpa2_kernels.f90.
  Make a perfomance test with elpa2_kernels_simple.f90, however, to check if
  your compiler doesn't get confused by the hand optimizations.

* If you want to develop your own optimized kernels for you platform, it is
  easier to start with elpa2_kernels_simple.f90.
  Don't let you confuse from the huge code in elpa2_kernels.f90, the mathemathics
  done in the kernels is relatively trivial.