Shopping list for performance improvements
I think there are a few different ways to improve the library performance, but I'm not sure which ones to prioritize:
- Make sure that every user obtains a version that's compiled for their specific CPU architecture (i386, SSE2, AVX, AVX512, maybe various ARM versions). The impact in multi-D transforms will be very significant. Is there any mechanism for this in the
scipydeployment process (different wheels for different x86 CPUs, on-the-fly recompilation of performance-critical code, etc.)?
- OpenMP within
scipy(@peterbell10 is investigating this).
- 1D transforms: convert them into 2D transforms using Tukey's 4 step algorithm if they are large enough (https://en.wikipedia.org/wiki/Cooley%E2%80%93Tukey_FFT_algorithm#Variations). This would enable vectorized sub-FFTs. Not sure how much can be gained there.
- multi-D transforms: it seems that most time of many transforms is actually not spent in the 1D FFTs proper, but rather in the memory reordering steps called by
general_nd. This could probably be improved by doing explicit transposition steps between the FFTs over the individual axes. but that in turn requires significant additional memory (one or two multi-D-arrays), so this can only be optional.
- (related to the point above) Big performance improvements can be obtained by avoiding critical strides. Example:
We may want to make use of this, e.g. by allocating a slighty too large output array and returning a view of that when advantageous.