The code needs to be fast at FFTs, so this issue is addressed to FFTs.
Here are preliminary scaling results for the code itself, obtained for the 1536^3 test cases scaling.pdf. For this discussion, only the "ftest" line is relevant. My interpretation is that the direct FFTW approach scales quite reasonably.
Procedure for plot:
- take snapshot from 1536^3 DNS, run four different DNS for 64 time steps with this snapshot as initial condition:
- "ftest": only run Navier Stokes solver
- "ptest-1e5": same as "ftest", but add 10^5 particles, with sampling at every timestep.
- "ptest-2e7": same as "ftest", but add 2 x 10^7 particles, with sampling at every timestep.
- "ptest-2e7-lessiO": same as "ftest", but add 2 x 10^7 particles, with sampling at every 16 timesteps.
The jobs are run using 128, 192, 256, 384 and 512 MPI processes, on draco, so always using a number of nodes at full capacity. In fact only "ftest" can run on 512 processes, since the particle code can't run if there are less than 4 z slices per slab allocated to MPI process.
Afterwards, read the overall execution time from the output file for each process, average over all processes for each run, and plot as a function of the number of processes.
To generate the plot, i am using the file https://gitlab.mpcdf.mpg.de/clalescu/bfps_addons/blob/develop/tests/timing_analyzer.py It's currently set up to work with my peculiar file structure, but I trust the "check_scaling" function is clearly enough defined.