pypocketfft issueshttps://gitlab.mpcdf.mpg.de/mtr/pypocketfft/-/issues2019-08-20T17:00:24Zhttps://gitlab.mpcdf.mpg.de/mtr/pypocketfft/-/issues/10Reducing redundancy in cached data?2019-08-20T17:00:24ZMartin ReineckeReducing redundancy in cached data?Currently, plans are cached at the level of "full" 1D transforms, i.e. there are cached objects of type `pocketfft_c`, `pocketfft_r`, `T_dst1` etc. However these objects in turn again hold objects of type `cfftp`, `rfftp`, `fftblue` etc....Currently, plans are cached at the level of "full" 1D transforms, i.e. there are cached objects of type `pocketfft_c`, `pocketfft_r`, `T_dst1` etc. However these objects in turn again hold objects of type `cfftp`, `rfftp`, `fftblue` etc., which are often identical for different "high-level" objects.
Is there a way (and would it be efficient) to avoid this redundancy, e.g. by having the constructors of `pocketfft_c` call `get_plan<cfftp>` or `get_plan<fftblue>`? I'll be happy to try this, but I fear this might need special precautions to avoid deadlocks.
@peterbell10, if you have some time to think about this, I'd appreciate your opinion, but it's definitely not high priority.https://gitlab.mpcdf.mpg.de/mtr/pypocketfft/-/issues/8Odd length DCST-IV is inaccurate on WSL2020-09-16T12:49:12ZPeter BellOdd length DCST-IV is inaccurate on WSLWhile investigating !22 I eventually realised that the transforms using `cos` and `sin` from `<cmath>` aren't the ones with accuracy problems. It is actually the odd-length type-IV DCT and DST that are inaccurate.
From running `stress.p...While investigating !22 I eventually realised that the transforms using `cos` and `sin` from `<cmath>` aren't the ones with accuracy problems. It is actually the odd-length type-IV DCT and DST that are inaccurate.
From running `stress.py` I've seen errors as large as ~1e-4 for `float` which is at least 3 orders of magnitude worse than the same compiler on native linux.https://gitlab.mpcdf.mpg.de/mtr/pypocketfft/-/issues/7Shopping list for performance improvements2021-02-01T15:47:00ZMartin ReineckeShopping list for performance improvementsI think there are a few different ways to improve the library performance, but I'm not sure which ones to prioritize:
- Make sure that every user obtains a version that's compiled for their specific CPU architecture (i386, SSE2, AVX, AV...I think there are a few different ways to improve the library performance, but I'm not sure which ones to prioritize:
- Make sure that every user obtains a version that's compiled for their specific CPU architecture (i386, SSE2, AVX, AVX512, maybe various ARM versions). The impact in multi-D transforms will be very significant. Is there any mechanism for this in the `scipy` deployment process (different wheels for different x86 CPUs, on-the-fly recompilation of performance-critical code, etc.)?
- OpenMP within `scipy` (@peterbell10 is investigating this).
- 1D transforms: convert them into 2D transforms using Tukey's 4 step algorithm if they are large enough (https://en.wikipedia.org/wiki/Cooley%E2%80%93Tukey_FFT_algorithm#Variations). This would enable vectorized sub-FFTs. Not sure how much can be gained there.
- multi-D transforms: it seems that most time of many transforms is actually not spent in the 1D FFTs proper, but rather in the memory reordering steps called by `general_nd`. This could probably be improved by doing explicit transposition steps between the FFTs over the individual axes. but that in turn requires significant additional memory (one or two multi-D-arrays), so this can only be optional.
- (related to the point above) Big performance improvements can be obtained by avoiding critical strides. Example:<br>`a=np.zeros((4096,4096))+0j`<br>
`%timeit ppf.c2c(a,out=a)`<br>
`a=np.zeros((4096,4097))+0j`<br>
`%timeit ppf.c2c(a[:,0:4096],out=a[:,0:4096])`<br>
We may want to make use of this, e.g. by allocating a slighty too large output array and returning a view of that when advantageous.https://gitlab.mpcdf.mpg.de/mtr/pypocketfft/-/issues/6Reducing plan size to O(sqrt(N))?2019-08-04T18:52:25ZMartin ReineckeReducing plan size to O(sqrt(N))?Currently the size of a plan is roughly comparable to the size of the input/output arrays (for 1D transforms). In the case of Bluestein transforms, it's even a few times larger. Given that we cache plans, this could be considered a bit w...Currently the size of a plan is roughly comparable to the size of the input/output arrays (for 1D transforms). In the case of Bluestein transforms, it's even a few times larger. Given that we cache plans, this could be considered a bit wasteful.
It is possible to store a reduced set of roughly 2*sqrt(N) twiddle factors and to compute the work arrays (WA) from these quite efficiently (one if-statement, 2 complex loads and one complex multiplication) at the beginning of every pass. This will cause an overall slowdown (I expect less than 10 percent), but has two advantages:
- much smaller plan size
- only one kind of plan needed for real and complex FFTs
@peterbell10, do you think this is worth trying?
Providing this as an option will most likely cause too much code growth; we should keep the current state or make the full switch, I guess.