pypocketfft issueshttps://gitlab.mpcdf.mpg.de/mtr/pypocketfft/-/issues2020-09-16T12:49:12Zhttps://gitlab.mpcdf.mpg.de/mtr/pypocketfft/-/issues/8Odd length DCST-IV is inaccurate on WSL2020-09-16T12:49:12ZPeter BellOdd length DCST-IV is inaccurate on WSLWhile investigating !22 I eventually realised that the transforms using `cos` and `sin` from `<cmath>` aren't the ones with accuracy problems. It is actually the odd-length type-IV DCT and DST that are inaccurate.
From running `stress.py` I've seen errors as large as ~1e-4 for `float` which is at least 3 orders of magnitude worse than the same compiler on native linux.While investigating !22 I eventually realised that the transforms using `cos` and `sin` from `<cmath>` aren't the ones with accuracy problems. It is actually the odd-length type-IV DCT and DST that are inaccurate.
From running `stress.py` I've seen errors as large as ~1e-4 for `float` which is at least 3 orders of magnitude worse than the same compiler on native linux.https://gitlab.mpcdf.mpg.de/mtr/pypocketfft/-/issues/10Reducing redundancy in cached data?2019-08-20T17:00:24ZMartin ReineckeReducing redundancy in cached data?Currently, plans are cached at the level of "full" 1D transforms, i.e. there are cached objects of type `pocketfft_c`, `pocketfft_r`, `T_dst1` etc. However these objects in turn again hold objects of type `cfftp`, `rfftp`, `fftblue` etc., which are often identical for different "high-level" objects.
Is there a way (and would it be efficient) to avoid this redundancy, e.g. by having the constructors of `pocketfft_c` call `get_plan<cfftp>` or `get_plan<fftblue>`? I'll be happy to try this, but I fear this might need special precautions to avoid deadlocks.
@peterbell10, if you have some time to think about this, I'd appreciate your opinion, but it's definitely not high priority.Currently, plans are cached at the level of "full" 1D transforms, i.e. there are cached objects of type `pocketfft_c`, `pocketfft_r`, `T_dst1` etc. However these objects in turn again hold objects of type `cfftp`, `rfftp`, `fftblue` etc., which are often identical for different "high-level" objects.
Is there a way (and would it be efficient) to avoid this redundancy, e.g. by having the constructors of `pocketfft_c` call `get_plan<cfftp>` or `get_plan<fftblue>`? I'll be happy to try this, but I fear this might need special precautions to avoid deadlocks.
@peterbell10, if you have some time to think about this, I'd appreciate your opinion, but it's definitely not high priority.https://gitlab.mpcdf.mpg.de/mtr/pypocketfft/-/issues/7Shopping list for performance improvements2019-08-13T12:41:47ZMartin ReineckeShopping list for performance improvementsI think there are a few different ways to improve the library performance, but I'm not sure which ones to prioritize:
- Make sure that every user obtains a version that's compiled for their specific CPU architecture (i386, SSE2, AVX, AVX512, maybe various ARM versions). The impact in multi-D transforms will be very significant. Is there any mechanism for this in the `scipy` deployment process (different wheels for different x86 CPUs, on-the-fly recompilation of performance-critical code, etc.)?
- OpenMP within `scipy` (@peterbell10 is investigating this).
- 1D transforms: convert them into 2D transforms using Tukey's 4 step algorithm if they are large enough (https://en.wikipedia.org/wiki/Cooley%E2%80%93Tukey_FFT_algorithm#Variations). This would enable vectorized sub-FFTs. Not sure how much can be gained there.
- multi-D transforms: it seems that most time of many transforms is actually not spent in the 1D FFTs proper, but rather in the memory reordering steps called by `general_nd`. This could probably be improved by doing explicit transposition steps between the FFTs over the individual axes. but that in turn requires significant additional memory (one or two multi-D-arrays), so this can only be optional.
- (related to the point above) Big performance improvements can be obtained by avoiding critical strides. Example:<br>`a=np.zeros((4096,4096))+0j`<br>
`%timeit ppf.c2c(a,out=a)`<br>
`a=np.zeros((4096,4097))+0j`<br>
`%timeit ppf.c2c(a[:,0:4096],out=a[:,0:4096])`<br>
We may want to make use of this, e.g. by allocating a slighty too large output array and returning a view of that when advantageous.I think there are a few different ways to improve the library performance, but I'm not sure which ones to prioritize:
- Make sure that every user obtains a version that's compiled for their specific CPU architecture (i386, SSE2, AVX, AVX512, maybe various ARM versions). The impact in multi-D transforms will be very significant. Is there any mechanism for this in the `scipy` deployment process (different wheels for different x86 CPUs, on-the-fly recompilation of performance-critical code, etc.)?
- OpenMP within `scipy` (@peterbell10 is investigating this).
- 1D transforms: convert them into 2D transforms using Tukey's 4 step algorithm if they are large enough (https://en.wikipedia.org/wiki/Cooley%E2%80%93Tukey_FFT_algorithm#Variations). This would enable vectorized sub-FFTs. Not sure how much can be gained there.
- multi-D transforms: it seems that most time of many transforms is actually not spent in the 1D FFTs proper, but rather in the memory reordering steps called by `general_nd`. This could probably be improved by doing explicit transposition steps between the FFTs over the individual axes. but that in turn requires significant additional memory (one or two multi-D-arrays), so this can only be optional.
- (related to the point above) Big performance improvements can be obtained by avoiding critical strides. Example:<br>`a=np.zeros((4096,4096))+0j`<br>
`%timeit ppf.c2c(a,out=a)`<br>
`a=np.zeros((4096,4097))+0j`<br>
`%timeit ppf.c2c(a[:,0:4096],out=a[:,0:4096])`<br>
We may want to make use of this, e.g. by allocating a slighty too large output array and returning a view of that when advantageous.https://gitlab.mpcdf.mpg.de/mtr/pypocketfft/-/issues/6Reducing plan size to O(sqrt(N))?2019-08-04T18:52:25ZMartin ReineckeReducing plan size to O(sqrt(N))?Currently the size of a plan is roughly comparable to the size of the input/output arrays (for 1D transforms). In the case of Bluestein transforms, it's even a few times larger. Given that we cache plans, this could be considered a bit wasteful.
It is possible to store a reduced set of roughly 2*sqrt(N) twiddle factors and to compute the work arrays (WA) from these quite efficiently (one if-statement, 2 complex loads and one complex multiplication) at the beginning of every pass. This will cause an overall slowdown (I expect less than 10 percent), but has two advantages:
- much smaller plan size
- only one kind of plan needed for real and complex FFTs
@peterbell10, do you think this is worth trying?
Providing this as an option will most likely cause too much code growth; we should keep the current state or make the full switch, I guess.Currently the size of a plan is roughly comparable to the size of the input/output arrays (for 1D transforms). In the case of Bluestein transforms, it's even a few times larger. Given that we cache plans, this could be considered a bit wasteful.
It is possible to store a reduced set of roughly 2*sqrt(N) twiddle factors and to compute the work arrays (WA) from these quite efficiently (one if-statement, 2 complex loads and one complex multiplication) at the beginning of every pass. This will cause an overall slowdown (I expect less than 10 percent), but has two advantages:
- much smaller plan size
- only one kind of plan needed for real and complex FFTs
@peterbell10, do you think this is worth trying?
Providing this as an option will most likely cause too much code growth; we should keep the current state or make the full switch, I guess.