do work, but with less optimal performance. Especially, setups which allow only for one row (or column) in the 2D MPI grid do result in less than optimal performance.

This is illustrated in the figure below where we show the run-time for the solution of a real 10k matrix with the ELPA 1stage solver, with the number of MPI processes varying from 2 to 40. Please not that setups which enforce one process row (or process column), since the total number of MPI tasks is a prime number should always be avoided.

This is illustrated in the figure below where we show the run-time for the solution of a real 10k matrix with the ELPA 1stage solver, with the number of MPI processes varying from 2 to 40. Please note, that setups which enforce one process row (or process column), since the total number of MPI tasks is a prime number should always be avoided.

In case you do have the free choice of the number of MPI-tasks which you want to use, try to use a setup which can be split up in a "quadratic" way. If this is not possible, you might want to use less MPI tasks within ELPA than in your calling application and try the internal redistribution of ELPA to a new process grid.