Replace the current implementation of MPI FFTs

The current way of performing MPI-parallel FFTs has several problems:

it relies on a fork of pyfftw that may or may not be merged with the official version in the future, causing potential confusion for users
it does not work for all array sizes due to FFTW limitations.

The second point already makes broad regression testing of many different FFT sizes quite tricky.

My suggestion to overcome both these drawbacks is based on the fact that MPI communication during an FFT is only needed if the first field dimension needs to be transformed, and that a multi-D FFT can be separated in to FFTs along individual axes.

If an FFT along the first axis is required, NIFTy (or D2O) could just do the following:

MPI-tanspose the field in such a way that the first dimension is no longer distributed across CPUs
perform the FFT along this dimension (no MPI required)
revert the transpose again
(perform FFTs along the other requested axes)

Internally this is exactly how FFTW handles this problem as well.

The transposition algorithm is not trivial, but certainly implementable without too many difficulties.

Doing this would also get rid of the special "fftw" distribution strategy.

Additional bonus: MPI FFTs would then also work with numpy FFT. The dependence on MPI-enabled FFTW would vanish, making NIFTy configuration simpler.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information