Make thread pool size more flexible?
When running parallel FFTs with, say, four threads on a system with 12 CPUs and 24 virtual cores, I'm not observing a set of 4 virtual cores at 100% with htop
(as would be the case with an OpenMP code), but rather a homogeneous, small load on all 24 virtual cores. If I'm interpreting this correctly, this happens because we allocate a thread pool with as many threads as there are virtual cores, and assigning tasks to them in a round-robin fashion. So the load is jumping around very quickly.
This doesn't seem optimal: it invalidates a lot of caches, and it probably confuses the thread scheduler.
Would it be possible to resize the pool on demand, roughly like this:
inline thread_pool &get_pool2(size_t nthreads=0)
{
static std::unique_ptr<thread_pool> pool(std::make_unique<thread_pool>(1));
if ((!pool) || ((nthreads!=0) && (nthreads!=pool->size()))) // resize
{
pool = std::make_unique<thread_pool>(nthreads);
}
#if __has_include(<pthread.h>)
static std::once_flag f;
call_once(f,
[]{
pthread_atfork(
+[]{ get_pool2().shutdown(); }, // prepare
+[]{ get_pool2().restart(); }, // parent
+[]{ get_pool2().restart(); } // child
);
});
#endif
return *pool;
}
@g-peterbell do you think this could work, or am I missing some subtle multithreading issue? First tests look OK, but with concurrency I'd like to hear a second opinion :)