added some more wrappers for the cublas functions

The GPU initialization is actually quite constly, e.g. on Minsky it takes roughly 0.7s. That is hurting performance for small matrices. Thus a check has been added and now GPU should be initialized only the first time.

This closes issues #48

Conflicts: src/elpa1_auxiliary.F90 src/elpa1_tridiag_real_template.X90

library It the configure option "enablesingleprecision" is specified, ELPA will also be build for single precision usage. The double precision and single precision will be available at the same time with names "solve_evp_real_1stage_double" or "solve_evp_real_1stage_single" and so on... This change immplied some major refactoring of the ELPA code: 1.) functions/procedures had to be renamed with suffix "_double" 2.) If necessary the same functions have to be available with suffix "_single" 3.) Variable kind definitions have to be consistent with the intented use To avoid uneccessary code duplication this is done (most of the time) with preprocessor string substitution. The documentation has been updated. NOT SUPPORTED are at the moment:  single precision usage of ELPA2 with kernels, others than "generic" and "generic_simple"  single precision usage of GPU

ELPA2 can now be build (as ELPA1) for single precision calculations. The ELPA2 kernles which are implemented in assembler, C, or C++ have NOT yet been ported. Thus at the moment only the GENERIC and GENERIC_SIMPLE kernels support single precision calculations

