Make thread pool size more flexible?

When running parallel FFTs with, say, four threads on a system with 12 CPUs and 24 virtual cores, I'm not observing a set of 4 virtual cores at 100% with htop (as would be the case with an OpenMP code), but rather a homogeneous, small load on all 24 virtual cores. If I'm interpreting this correctly, this happens because we allocate a thread pool with as many threads as there are virtual cores, and assigning tasks to them in a round-robin fashion. So the load is jumping around very quickly.

This doesn't seem optimal: it invalidates a lot of caches, and it probably confuses the thread scheduler.

Would it be possible to resize the pool on demand, roughly like this:

inline thread_pool &get_pool2(size_t nthreads=0)
  {
  static std::unique_ptr<thread_pool> pool(std::make_unique<thread_pool>(1));
  if ((!pool) || ((nthreads!=0) && (nthreads!=pool->size()))) // resize
    {
    pool = std::make_unique<thread_pool>(nthreads);
    }
#if __has_include(<pthread.h>)
  static std::once_flag f;
  call_once(f,
    []{
    pthread_atfork(
      +[]{ get_pool2().shutdown(); },  // prepare
      +[]{ get_pool2().restart(); },   // parent
      +[]{ get_pool2().restart(); }    // child
      );
    });
#endif

  return *pool;
  }

@g-peterbell do you think this could work, or am I missing some subtle multithreading issue? First tests look OK, but with concurrency I'd like to hear a second opinion :)