Performance Discussion
Here will be posted the first results, to start discussing and deciding the orientation.
N1536_kMeta2 Using 256 CPUs
Config
./scaling_tests.py --ncpu 256 --src-simname N1536_kMeta2 --src-iteration 8192 --src-wd /hydra/ptmp/khr/bfps/database/ --fluid --p1e5 --p2e7
version 1.7.post45+g6385432.dirty
IO (write)
Note that some parts have been removed. The last process gives us:
[TIMING-255] @code::main_start = 2519.59s
[TIMING-255] @fluid_solver::read = 44.0502s
[TIMING-255] @fluid_solver::compute_velocity = 13.3542s (Min = 0.125834s ; Max = 0.13815s ; Average = 0.134891s ; Occurrence = 99)
[TIMING-255] @fluid_solver::ift_velocity = 125.755s (Min = 1.72731s ; Max = 10.7683s ; Average = 1.90539s ; Occurrence = 66)
[TIMING-255] @particles_io_base::particles_io_base = 2.94317s
[TIMING-255] @rFFTW_distributed_particles::rFFTW_distributed_particles = 0.047429s
[TIMING-255] @rFFTW_distributed_particles::read = 20.0988s
[TIMING-255] @field::compute_stats = 216.914s (Min = 3.08093s ; Max = 3.86351s ; Average = 3.28658s ; Occurrence = 66)
[TIMING-255] @field::cospectrum = 10.5377s (Min = 0.158073s ; Max = 0.180153s ; Average = 0.159663s ; Occurrence = 66)
[TIMING-255] @field::ift = 107.045s (Min = 1.58284s ; Max = 1.73614s ; Average = 1.6219s ; Occurrence = 66)
[TIMING-255] @field::compute_rspace_stats = 99.3301s (Min = 1.30216s ; Max = 1.95447s ; Average = 1.505s ; Occurrence = 66)
[TIMING-255] @MPI_Bcast = 10.2841s (Min = 0.134494s ; Max = 0.247863s ; Average = 0.155819s ; Occurrence = 66)
[TIMING-255] @FIELD_RLOOP = 69.8258s (Min = 1.02012s ; Max = 1.0942s ; Average = 1.05797s ; Occurrence = 66)
[TIMING-255] @MPI_Allreduce = 19.2172s (Min = 0.0614824s ; Max = 0.783903s ; Average = 0.29117s ; Occurrence = 66)
[TIMING-255] @rFFTW_distributed_particles::write2 = 248.374s (Min = 6.31219s ; Max = 33.165s ; Average = 7.30512s ; Occurrence = 34)
[TIMING-255] @MPI_Allreduce = 219.672s (Min = 1.2993e-05s ; Max = 0.345064s ; Average = 6.46095e-05s ; Occurrence = 3400000)
[TIMING-255] @std::copy = 0.00545869s (Min = 2.1e-07s ; Max = 4.514e-06s ; Average = 4.31585e-07s ; Occurrence = 12648)
[TIMING-255] @write_rhs = 26.3912s (Min = 0.000231546s ; Max = 0.164712s ; Average = 0.000263912s ; Occurrence = 100000)
[TIMING-255] @fluid_solver::compute_Lagrangian_acceleration = 376.551s (Min = 11.3228s ; Max = 11.5214s ; Average = 11.4106s ; Occurrence = 33)
[TIMING-255] @fluid_solver::compute_Lagrangian_acceleration = 296.386s (Min = 8.90616s ; Max = 9.05127s ; Average = 8.98139s ; Occurrence = 33)
[TIMING-255] @fluid_solver::compute_velocity = 7.7664s (Min = 0.0877993s ; Max = 0.153871s ; Average = 0.117673s ; Occurrence = 66)
[TIMING-255] @fluid_solver::ift_velocity = 64.1085s (Min = 1.90793s ; Max = 1.95632s ; Average = 1.94268s ; Occurrence = 33)
[TIMING-255] @fluid_solver::compute_pressure = 213.898s (Min = 6.44201s ; Max = 6.55077s ; Average = 6.48176s ; Occurrence = 33)
[TIMING-255] @fluid_solver_base::clean_up_real_space = 0.145752s (Min = 0.00114871s ; Max = 0.00354554s ; Average = 0.00220837s ; Occurrence = 66)
[TIMING-255] @fluid_solver_base::dealias = 72.9218s (Min = 1.09773s ; Max = 1.12227s ; Average = 1.10488s ; Occurrence = 66)
[TIMING-255] @rFFTW_distributed_particles::write = 432.386s (Min = 6.30417s ; Max = 7.74553s ; Average = 6.55131s ; Occurrence = 66)
[TIMING-255] @MPI_Allreduce = 428.053s (Min = 1.54e-05s ; Max = 1.05938s ; Average = 6.48565e-05s ; Occurrence = 6600000)
[TIMING-255] @fluid_solver::step = 758.14s (Min = 23.6557s ; Max = 23.7339s ; Average = 23.6919s ; Occurrence = 32)
[TIMING-255] @fluid_solver::omega_nonlin = 724.025s (Min = 6.99106s ; Max = 7.86799s ; Average = 7.54193s ; Occurrence = 96)
[TIMING-255] @fluid_solver::compute_velocity = 11.5398s (Min = 0.105677s ; Max = 0.136042s ; Average = 0.120206s ; Occurrence = 96)
[TIMING-255] @fluid_solver::omega_nonlin::fftw = 405.633s (Min = 3.66764s ; Max = 4.54959s ; Average = 4.22534s ; Occurrence = 96)
[TIMING-255] @fluid_solver::omega_nonlin::RLOOP = 7.43167s (Min = 0.0762171s ; Max = 0.0804303s ; Average = 0.0774132s ; Occurrence = 96)
[TIMING-255] @fluid_solver::omega_nonlin::fftw-2 = 170.379s (Min = 1.76048s ; Max = 1.78986s ; Average = 1.77478s ; Occurrence = 96)
[TIMING-255] @fluid_solver_base::dealias = 106.051s (Min = 1.10247s ; Max = 1.12183s ; Average = 1.1047s ; Occurrence = 96)
[TIMING-255] @fluid_solver::omega_nonlin::CLOOP = 7.8091s (Min = 0.0796459s ; Max = 0.0867639s ; Average = 0.0813448s ; Occurrence = 96)
[TIMING-255] @fluid_solver::omega_nonlin::add_forcing = 0.000439908s (Min = 3.099e-06s ; Max = 2.6225e-05s ; Average = 4.58237e-06s ; Occurrence = 96)
[TIMING-255] @fluid_solver::write = 44.8473s
First from this result, it appears the the fftw calls are taking a large part of the execution time. But here I would like to look to the IO.
So we have:
[TIMING-255] @rFFTW_distributed_particles::write2 = 248.374s (Min = 6.31219s ; Max = 33.165s ; Average = 7.30512s ; Occurrence = 34)
[TIMING-255] @MPI_Allreduce = 219.672s (Min = 1.2993e-05s ; Max = 0.345064s ; Average = 6.46095e-05s ; Occurrence = 3400000)
[TIMING-255] @std::copy = 0.00545869s (Min = 2.1e-07s ; Max = 4.514e-06s ; Average = 4.31585e-07s ; Occurrence = 12648)
[TIMING-255] @write_rhs = 26.3912s (Min = 0.000231546s ; Max = 0.164712s ; Average = 0.000263912s ; Occurrence = 100000)
[TIMING-255] @rFFTW_distributed_particles::write = 432.386s (Min = 6.30417s ; Max = 7.74553s ; Average = 6.55131s ; Occurrence = 66)
[TIMING-255] @MPI_Allreduce = 428.053s (Min = 1.54e-05s ; Max = 1.05938s ; Average = 6.48565e-05s ; Occurrence = 6600000)
We can compare this output to the first process because these write functions should have different behaviors on the root process or the others.
[TIMING-0] @rFFTW_distributed_particles::write2 = 248.148s (Min = 6.3105s ; Max = 33.165s ; Average = 7.29846s ; Occurrence = 34)
[TIMING-0] @MPI_Allreduce = 55.3869s (Min = 1.5148e-05s ; Max = 0.00178043s ; Average = 1.62903e-05s ; Occurrence = 3400000)
[TIMING-0] @write_state_chunk = 158.072s (Min = 4.1521e-05s ; Max = 0.345009s ; Average = 4.64919e-05s ; Occurrence = 3400000)
[TIMING-0] @particles_io_base::write_state_chunk = 154.6s (Min = 4.0557e-05s ; Max = 0.345007s ; Average = 4.54707e-05s ; Occurrence = 3400000)
[TIMING-0] @std::copy = 0.00690804s (Min = 2.6e-07s ; Max = 4.824e-06s ; Average = 5.43256e-07s ; Occurrence = 12716)
[TIMING-0] @write_rhs = 26.5682s (Min = 0.000247992s ; Max = 0.184024s ; Average = 0.000265682s ; Occurrence = 100000)
[TIMING-0] @rFFTW_distributed_particles::write = 432.388s (Min = 6.30419s ; Max = 7.74556s ; Average = 6.55134s ; Occurrence = 66)
[TIMING-0] @MPI_Allreduce = 107.475s (Min = 1.5153e-05s ; Max = 0.00127309s ; Average = 1.62841e-05s ; Occurrence = 6600000)
[TIMING-0] @write_point3D_chunk = 310.065s (Min = 4.1581e-05s ; Max = 1.05933s ; Average = 4.69796e-05s ; Occurrence = 6600000)
So, it appears that the MPI_Allreduce is called numerous times (6600000 times here). But still it is not that expensive if we look to process 0, but since the write operation of process 0 is slow, then the other processes are waiting for him in the MPI_Allreduce. So here the problem is that all the process must synchronize 6600000 times while the process 0 is writing between each call.
Merged time
We have also this information from the output:
[MPI-TIMING] @MPI_Allreduce
[MPI-TIMING] Stack => rFFTW_distributed_particles::write << code::main_start << BFPS <<
[MPI-TIMING] Done by 256 processes
[MPI-TIMING] Total time for all 110289s (average per process 430.817s)
[MPI-TIMING] Min time for a process 107.475s Max time for a process 436.467s
[MPI-TIMING] The same call has been done 1683000107 times by all process (duration min 9.959e-06s max 1.05939s avg 6.55313e-05s)
[MPI-TIMING] @MPI_Allreduce
[MPI-TIMING] Stack => rFFTW_distributed_particles::write2 << code::main_start << BFPS <<
[MPI-TIMING] Done by 256 processes
[MPI-TIMING] Total time for all 56085s (average per process 219.082s)
[MPI-TIMING] Min time for a process 55.3869s Max time for a process 219.854s
[MPI-TIMING] The same call has been done 867000055 times by all process (duration min 1.0033e-05s max 0.345071s avg 6.46886e-05s)
Question
Improving the IO could reduce the writing time for P0, improve the synchronization and thus might reduce the complete operation . But, is it really necessary to merge the results on P0? Why not have parallel output?