Shopping list for performance improvements
I think there are a few different ways to improve the library performance, but I'm not sure which ones to prioritize:
- Make sure that every user obtains a version that's compiled for their specific CPU architecture (i386, SSE2, AVX, AVX512, maybe various ARM versions). The impact in multi-D transforms will be very significant. Is there any mechanism for this in the
scipy
deployment process (different wheels for different x86 CPUs, on-the-fly recompilation of performance-critical code, etc.)? - OpenMP within
scipy
(@peterbell10 is investigating this). - 1D transforms: convert them into 2D transforms using Tukey's 4 step algorithm if they are large enough (https://en.wikipedia.org/wiki/Cooley%E2%80%93Tukey_FFT_algorithm#Variations). This would enable vectorized sub-FFTs. Not sure how much can be gained there.
- multi-D transforms: it seems that most time of many transforms is actually not spent in the 1D FFTs proper, but rather in the memory reordering steps called by
general_nd
. This could probably be improved by doing explicit transposition steps between the FFTs over the individual axes. but that in turn requires significant additional memory (one or two multi-D-arrays), so this can only be optional. - (related to the point above) Big performance improvements can be obtained by avoiding critical strides. Example:
a=np.zeros((4096,4096))+0j
%timeit ppf.c2c(a,out=a)
a=np.zeros((4096,4097))+0j
%timeit ppf.c2c(a[:,0:4096],out=a[:,0:4096])
We may want to make use of this, e.g. by allocating a slighty too large output array and returning a view of that when advantageous.