Shopping list for performance improvements

I think there are a few different ways to improve the library performance, but I'm not sure which ones to prioritize:

Make sure that every user obtains a version that's compiled for their specific CPU architecture (i386, SSE2, AVX, AVX512, maybe various ARM versions). The impact in multi-D transforms will be very significant. Is there any mechanism for this in the scipy deployment process (different wheels for different x86 CPUs, on-the-fly recompilation of performance-critical code, etc.)?
OpenMP within scipy (@peterbell10 is investigating this).
1D transforms: convert them into 2D transforms using Tukey's 4 step algorithm if they are large enough (https://en.wikipedia.org/wiki/Cooley%E2%80%93Tukey_FFT_algorithm#Variations). This would enable vectorized sub-FFTs. Not sure how much can be gained there.
multi-D transforms: it seems that most time of many transforms is actually not spent in the 1D FFTs proper, but rather in the memory reordering steps called by general_nd. This could probably be improved by doing explicit transposition steps between the FFTs over the individual axes. but that in turn requires significant additional memory (one or two multi-D-arrays), so this can only be optional.
(related to the point above) Big performance improvements can be obtained by avoiding critical strides. Example:
a=np.zeros((4096,4096))+0j
%timeit ppf.c2c(a,out=a)
a=np.zeros((4096,4097))+0j
%timeit ppf.c2c(a[:,0:4096],out=a[:,0:4096])
We may want to make use of this, e.g. by allocating a slighty too large output array and returning a view of that when advantageous.

Edited Aug 04, 2019 by Martin Reinecke