pypocketfft issueshttps://gitlab.mpcdf.mpg.de/mtr/pypocketfft/-/issues2020-09-04T16:41:05Zhttps://gitlab.mpcdf.mpg.de/mtr/pypocketfft/-/issues/14More interesting OSX issues2020-09-04T16:41:05ZMartin ReineckeMore interesting OSX issuesI have a CI setup for tests on OSX for the `ducc` package, which contains an almost verbatim copy of `pypocketfft`'s threading functionality. One thing I learned was that in certain circumstances the C++ compiler defines some functionali...I have a CI setup for tests on OSX for the `ducc` package, which contains an almost verbatim copy of `pypocketfft`'s threading functionality. One thing I learned was that in certain circumstances the C++ compiler defines some functionality macros, despite the functionality not being available (see https://travis-ci.com/github/litebird/ducc/jobs/380712386).
Also it appears possible to create deadlocks with the new threading implementation on OSX (see the other recent failed builds at https://travis-ci.com/github/litebird/ducc/builds). It's quite possible that I broke something when I adjusted the threading code for `ducc`, but on the other hand the changes were quite minimal. @g-peterbell, if you have an idea what could be going wrong and/or think that the `pypocketfft` might also be vulnerable to this, please chime in!https://gitlab.mpcdf.mpg.de/mtr/pypocketfft/-/issues/13Make thread pool size more flexible?2020-08-27T12:06:07ZMartin ReineckeMake thread pool size more flexible?When running parallel FFTs with, say, four threads on a system with 12 CPUs and 24 virtual cores, I'm not observing a set of 4 virtual cores at 100% with `htop` (as would be the case with an OpenMP code), but rather a homogeneous, small ...When running parallel FFTs with, say, four threads on a system with 12 CPUs and 24 virtual cores, I'm not observing a set of 4 virtual cores at 100% with `htop` (as would be the case with an OpenMP code), but rather a homogeneous, small load on all 24 virtual cores. If I'm interpreting this correctly, this happens because we allocate a thread pool with as many threads as there are virtual cores, and assigning tasks to them in a round-robin fashion. So the load is jumping around very quickly.
This doesn't seem optimal: it invalidates a lot of caches, and it probably confuses the thread scheduler.
Would it be possible to resize the pool on demand, roughly like this:
```
inline thread_pool &get_pool2(size_t nthreads=0)
{
static std::unique_ptr<thread_pool> pool(std::make_unique<thread_pool>(1));
if ((!pool) || ((nthreads!=0) && (nthreads!=pool->size()))) // resize
{
pool = std::make_unique<thread_pool>(nthreads);
}
#if __has_include(<pthread.h>)
static std::once_flag f;
call_once(f,
[]{
pthread_atfork(
+[]{ get_pool2().shutdown(); }, // prepare
+[]{ get_pool2().restart(); }, // parent
+[]{ get_pool2().restart(); } // child
);
});
#endif
return *pool;
}
```
@g-peterbell do you think this could work, or am I missing some subtle multithreading issue? First tests look OK, but with concurrency I'd like to hear a second opinion :)https://gitlab.mpcdf.mpg.de/mtr/pypocketfft/-/issues/12multiply defined symbols2020-01-08T08:53:30ZMartin Reineckemultiply defined symbolsCurrently, the header contains symbols like `thread_id`, which need to be defined exactly once in a linked executable. If the header is included in more than one translation unit, these symbols become multiply defined, and the executable...Currently, the header contains symbols like `thread_id`, which need to be defined exactly once in a linked executable. If the header is included in more than one translation unit, these symbols become multiply defined, and the executable won't link.
I have a fix for this, but since it is part of a larger rework, a merge request will take a few more days.
@g-peterbellhttps://gitlab.mpcdf.mpg.de/mtr/pypocketfft/-/issues/11Suspiciously good performance ...2019-12-04T13:58:52ZMartin ReineckeSuspiciously good performance ...When running `bench.py`, it seems that (at least on my hardware) pypocketfft very often performs better than FFTW with "FFTW_MEASURE" planning for multi-D transforms.
I verified that my libfftw.so was compiled with AVX support, so I'm re...When running `bench.py`, it seems that (at least on my hardware) pypocketfft very often performs better than FFTW with "FFTW_MEASURE" planning for multi-D transforms.
I verified that my libfftw.so was compiled with AVX support, so I'm really confused what's going on.
@g-peterbell any ideas? The code cannot be that good :Phttps://gitlab.mpcdf.mpg.de/mtr/pypocketfft/-/issues/10Reducing redundancy in cached data?2019-08-20T17:00:24ZMartin ReineckeReducing redundancy in cached data?Currently, plans are cached at the level of "full" 1D transforms, i.e. there are cached objects of type `pocketfft_c`, `pocketfft_r`, `T_dst1` etc. However these objects in turn again hold objects of type `cfftp`, `rfftp`, `fftblue` etc....Currently, plans are cached at the level of "full" 1D transforms, i.e. there are cached objects of type `pocketfft_c`, `pocketfft_r`, `T_dst1` etc. However these objects in turn again hold objects of type `cfftp`, `rfftp`, `fftblue` etc., which are often identical for different "high-level" objects.
Is there a way (and would it be efficient) to avoid this redundancy, e.g. by having the constructors of `pocketfft_c` call `get_plan<cfftp>` or `get_plan<fftblue>`? I'll be happy to try this, but I fear this might need special precautions to avoid deadlocks.
@peterbell10, if you have some time to think about this, I'd appreciate your opinion, but it's definitely not high priority.https://gitlab.mpcdf.mpg.de/mtr/pypocketfft/-/issues/9Segmentation fault with multithreading on Macs?2019-08-14T13:27:27ZMartin ReineckeSegmentation fault with multithreading on Macs?A colleague just ran `stress.py` with the current master branch on a Mac and got intermittent segmentation faults.
If we switch to the `tweak_dcst4` branch (which has a macro to disable multithreading), and switch all multithreading off...A colleague just ran `stress.py` with the current master branch on a Mac and got intermittent segmentation faults.
If we switch to the `tweak_dcst4` branch (which has a macro to disable multithreading), and switch all multithreading off, things work again.
Any idea what could be going wrong, @peterbell10?
(I cannot provoke segfaults on Linux, no matter what I try...)https://gitlab.mpcdf.mpg.de/mtr/pypocketfft/-/issues/8Odd length DCST-IV is inaccurate on WSL2020-09-16T12:49:12ZPeter BellOdd length DCST-IV is inaccurate on WSLWhile investigating !22 I eventually realised that the transforms using `cos` and `sin` from `<cmath>` aren't the ones with accuracy problems. It is actually the odd-length type-IV DCT and DST that are inaccurate.
From running `stress.p...While investigating !22 I eventually realised that the transforms using `cos` and `sin` from `<cmath>` aren't the ones with accuracy problems. It is actually the odd-length type-IV DCT and DST that are inaccurate.
From running `stress.py` I've seen errors as large as ~1e-4 for `float` which is at least 3 orders of magnitude worse than the same compiler on native linux.https://gitlab.mpcdf.mpg.de/mtr/pypocketfft/-/issues/7Shopping list for performance improvements2021-02-01T15:47:00ZMartin ReineckeShopping list for performance improvementsI think there are a few different ways to improve the library performance, but I'm not sure which ones to prioritize:
- Make sure that every user obtains a version that's compiled for their specific CPU architecture (i386, SSE2, AVX, AV...I think there are a few different ways to improve the library performance, but I'm not sure which ones to prioritize:
- Make sure that every user obtains a version that's compiled for their specific CPU architecture (i386, SSE2, AVX, AVX512, maybe various ARM versions). The impact in multi-D transforms will be very significant. Is there any mechanism for this in the `scipy` deployment process (different wheels for different x86 CPUs, on-the-fly recompilation of performance-critical code, etc.)?
- OpenMP within `scipy` (@peterbell10 is investigating this).
- 1D transforms: convert them into 2D transforms using Tukey's 4 step algorithm if they are large enough (https://en.wikipedia.org/wiki/Cooley%E2%80%93Tukey_FFT_algorithm#Variations). This would enable vectorized sub-FFTs. Not sure how much can be gained there.
- multi-D transforms: it seems that most time of many transforms is actually not spent in the 1D FFTs proper, but rather in the memory reordering steps called by `general_nd`. This could probably be improved by doing explicit transposition steps between the FFTs over the individual axes. but that in turn requires significant additional memory (one or two multi-D-arrays), so this can only be optional.
- (related to the point above) Big performance improvements can be obtained by avoiding critical strides. Example:<br>`a=np.zeros((4096,4096))+0j`<br>
`%timeit ppf.c2c(a,out=a)`<br>
`a=np.zeros((4096,4097))+0j`<br>
`%timeit ppf.c2c(a[:,0:4096],out=a[:,0:4096])`<br>
We may want to make use of this, e.g. by allocating a slighty too large output array and returning a view of that when advantageous.https://gitlab.mpcdf.mpg.de/mtr/pypocketfft/-/issues/6Reducing plan size to O(sqrt(N))?2019-08-04T18:52:25ZMartin ReineckeReducing plan size to O(sqrt(N))?Currently the size of a plan is roughly comparable to the size of the input/output arrays (for 1D transforms). In the case of Bluestein transforms, it's even a few times larger. Given that we cache plans, this could be considered a bit w...Currently the size of a plan is roughly comparable to the size of the input/output arrays (for 1D transforms). In the case of Bluestein transforms, it's even a few times larger. Given that we cache plans, this could be considered a bit wasteful.
It is possible to store a reduced set of roughly 2*sqrt(N) twiddle factors and to compute the work arrays (WA) from these quite efficiently (one if-statement, 2 complex loads and one complex multiplication) at the beginning of every pass. This will cause an overall slowdown (I expect less than 10 percent), but has two advantages:
- much smaller plan size
- only one kind of plan needed for real and complex FFTs
@peterbell10, do you think this is worth trying?
Providing this as an option will most likely cause too much code growth; we should keep the current state or make the full switch, I guess.https://gitlab.mpcdf.mpg.de/mtr/pypocketfft/-/issues/5Update author and copyright information2019-07-20T11:21:29ZMartin ReineckeUpdate author and copyright informationI'm sorry @peterbell10, I completely forgot to update copyright and author info after you started contributing.
Do you hold the copyright for your contributions yourself, or does it belong to some organization? (Mine goes to Max-Planck-...I'm sorry @peterbell10, I completely forgot to update copyright and author info after you started contributing.
Do you hold the copyright for your contributions yourself, or does it belong to some organization? (Mine goes to Max-Planck-Society automatically for everything work-related.)https://gitlab.mpcdf.mpg.de/mtr/pypocketfft/-/issues/4sine and cosine transforms?2019-07-20T11:18:37ZMartin Reineckesine and cosine transforms?I'm thinking about adding sine and cosine transforms when I find some time.
The main reason is that `scipy.fft` offers these transforms but will fall back to `scipy.fftpack` if they are called. I'm not worried about performance but rathe...I'm thinking about adding sine and cosine transforms when I find some time.
The main reason is that `scipy.fft` offers these transforms but will fall back to `scipy.fftpack` if they are called. I'm not worried about performance but rather about accuracy, which is notoriously bad for `fftpack`'s real transforms (if the transform sizes are near-prime).
Also, at the moment, existing programs using `scipy` that do not switch explicitly to `scipy.fft` won't benefit from the new library and still use `fftpack`, correct? If `pypocketfft` contains `fftpack`'s full functionality, then maybe it could replace `fftpack` entirely (behind `scipy.fftpack`'s original interface, of course).
@peterbell10, do you think this is a useful plan? Or will it be too tedious to put `pypocketfft` behind the `fftpack` interface?https://gitlab.mpcdf.mpg.de/mtr/pypocketfft/-/issues/3Python interface cleanup?2019-07-17T09:51:15ZMartin ReineckePython interface cleanup?I think it might be convenient to:
- merge `fftn` and `ifftn` into one function (maybe `c2c` or `complex_FFT`?)
- provide `r2c` and `c2r` functions instead of `rfftn` and `irfftn`. They would have a direction flag and could therefore als...I think it might be convenient to:
- merge `fftn` and `ifftn` into one function (maybe `c2c` or `complex_FFT`?)
- provide `r2c` and `c2r` functions instead of `rfftn` and `irfftn`. They would have a direction flag and could therefore also do the job of `numpy`'s `hfft` and `ihfft`.
- maybe merge `rfft_scipy` and `irfft_scipy` in an analogous fashion (and replace `scipy` by `rfftpack`, which expresses more clearly what the function does).
This would make the interface code smaller, and I think it would present the underlying symmetries of the transforms much better.
@peterbell10, what do you think?https://gitlab.mpcdf.mpg.de/mtr/pypocketfft/-/issues/2sym_rfftn for ifftn2019-06-19T08:02:40ZPeter Bellsym_rfftn for ifftnThis is an ommission from the changes I made in my scipy PR: `ifftn` for real input can be optimised in exactly the same way as with `fftn`. You just need to take the complex conjugate of the 1D r2c transform before continuing to the rem...This is an ommission from the changes I made in my scipy PR: `ifftn` for real input can be optimised in exactly the same way as with `fftn`. You just need to take the complex conjugate of the 1D r2c transform before continuing to the remainging axes.
I suspect it's not a very common use case but it's not difficult to add. Alternatively, this could go into pocketfft itself as a `forwards` argument to `r2c`.https://gitlab.mpcdf.mpg.de/mtr/pypocketfft/-/issues/1Testing2019-05-27T16:48:28ZDaniel LenzTesting