README_elpa2_kernels.txt 2.61 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
ELPA generally uses BLAS-Routines for all compute intensive work
so that the performance of ELPA mainly depends on the quality of
the BLAS implementation used when linking.

The only exception is the backtransformation of the eigenvectors
for the 2-stage solver (ELPA2). In this case BLAS routines cannot
be used effectively due to the nature of the problem.

The compute intensive part of the backtransformation of ELPA2
has been put to a file of its own (elpa2_kernels.f90) so that
this can be replaced by hand tailored, optimized code for
specific platforms.

Currently we offer the following alternatives for the ELPA2 kernels:

* elpa2_kernels.f90          - The generic FORTRAN version of the ELPA2 kernels
                               which should be useable on every platform.
                               It contains some hand optimizations (loop unrolling)
                               in the hope to get optimal code from most FORTRAN
                               compilers.

* elpa2_kernels_simple.f90   - Plain and simple version of elpa2_kernels.f90.
                               Please note that we observed that some compilers get
                               get confused by the hand optimizations done in
                               elpa2_kernels.f90 and give better performance
                               with this version - so it is worth to try both!

* elpa2_kernels_bg.f90       - Fortran code enhanced with assembler calls
                               for the IBM BlueGene/P

* elpa2_tum_kernels_*.c      - Optimized intrinisic code for x86_64
                               systems (i.e. Intel/AMD architecture)
                               using SSE2/SSE3 operations.
                               (Use gcc for compiling as Intel compiler generates slower code!)


So which version should be used?
================================

* On x86_64 systems (i.e. almost all Intel/AMD systems) or on the IBM BlueGene/P
  you should get the optimal performance using the optimized intrinsics/assembler versions
  in elpa2_tum_kernels_*.c or elpa2_kernels_bg.f90 respectively.

* If you don't compile for one of these systems or you don't like to use assembler
  for any reason, it is likely that you are best off using elpa2_kernels.f90.
  Make a perfomance test with elpa2_kernels_simple.f90, however, to check if
  your compiler doesn't get confused by the hand optimizations.

* If you want to develop your own optimized kernels for you platform, it is
  easier to start with elpa2_kernels_simple.f90.
  Don't let you confuse from the huge code in elpa2_kernels.f90, the mathemathics
  done in the kernels is relatively trivial.