In d6129932 I added back a call to std::unique_ptr.release(). I think that was the problem for NSVEparticles, I will test soon.
The reason I had commented out the call to release() is that at some point in the past, when I was compiling with -fsanitize-address that line of code generated a "double-free" error, from what I can remember.
On the other hand when I use -fsanitize-address now, even symmetrize_test fails, which is ... not informative.
So I'm going to leave the release() there. "tests/run_all_tests.sh" runs just fine with or without the release() call (technically it's just bfps.test_NSVEparticles that's affected).
cobra is still showing increasing memory over the duration of the DNS, even with the call to release().
@mjr: do you have any suggestions other than -fsanitize-address? At some point I think the code worked with this flag, but it started failing with long messages involving MPI. Since I know that MPI routinely breaks valgrind, I just gave up on trying to use any sort of memory usage test.
I'm not sure if the flag started to be unusable at a particular update of bfps, or some upgrade of gcc. There was a log going on.
My workaround for now is to run the code on more MPI processes, then the total memory is enough.
I assume the proper way forward is to do a proper git bisect. I guess that would work to figure out whether -fsanitize-address stopped working because of something we did.
Otherwise, we need a way to monitor memory usage that doesn't crash due to weird MPI things, and then we can do the bisect and figure out the bug.
I opened the branch "bugfix/code-slowdown" to work on this.
For tiny 16^3 or 32^3 DNS on my desktop and laptop not only does memory usage increase, but there is a significant slowdown of the code overall.
For such small jobs I can comfortably run ~1000 iterations for tests, and I see a linear growth of the execution time per time-step as well as a linear growth of the memory used.
At the moment I assume the two problems are related.
Steps I am taking: on my laptop I am now recompiling everything to use mpich instead of openmpi, and I will try to see if this makes a difference to the test. I will also try to use -fsanitize-address again with mpich, maybe it's better behaved.
valgrind didn't give me anything useful, it just said a bunch of stuff about mpi leaking memory.
Found the "problem": I always compile with the timing turned on.
The timing code allocates new memory every time some timed code is called, hence memory usage grows linearly with time.
I assume the slowdown comes from the code then needing to go through the much bigger address space in loops etc.
If I turn off the timing feature, there is no memory growth or slow-down.
Problem is still there for the particle code.
Memory grows linearly for particle code, even when timing functionality is off.
For future reference: useful tool to easily see this is psrecord, described in this post https://unix.stackexchange.com/a/414770
I think I fixed this as of commit 05c44eca. I'm not sure. My latest test for a 4 process job with 10^4 particles and ~4000 timesteps shows that 3 out of 4 processes have more or less fixed memory footprint, and process 0 has approximately linear growth over the first half of the time, which then switches to much slower growth.
This looks more like some caching by HDF5, plus the reasonable "increase buffers when needed" algorithms from our own output code. By the way, for 10^4 particles only one process will do output since that's more efficient.
The problem was with using unique_ptr to manage arrays. I was calling "release", but we needed to delete the result of "release()" as well. Here is the description of unique_ptr on cppreference https://en.cppreference.com/w/cpp/memory/unique_ptr.
At the moment I'm not convinced that the use of unique_ptr is correct (we still use the "reset" method, and I don't know if the default deleter of unique_ptr works properly). I plan to make a "minimal failing example", without MPI, where I can use --fsanitize-address and figure out if there's any problem left.
OK, so I'm confident I now understand how unique_ptr works.
We have two options:
allocate stuff in it, and then don't do anything about it because it's destructor will clear up the memory.
allocate stuff in it, then call delete[] obj.release(), to make the deallocation explicit.
At the moment bfps does 2. Before, it did a broken version of 2, where it called .release() and didn't do anything with the result (the compiler did not warn me about this, even though I use -Wall all the time).
I think what we have now is fine, but we probably need to include some details (in the documentation) on using unique_ptr, since this is required for particle sampling, and this is one of the main things that future users will be doing with their own custom extensions.
@mjr, @bbramas, can you please comment on this?
Once I get your OK, I will merge the "bugfix/psample-memory-leak" branch back into develop.
What happens if you do not call "delete[] release" ? Because when the unique pointer is deleted (at the end of the scope) it should call delete automatically (if you have created a unique_ptr, it calls "delete ptr", and if you have created a unique_ptr<Obj[]>, it calls "delete[] ptr" for deleting arrays).
(You maybe already describe this but I am a little lost in all the messages)
As I said in option 1 of the latest message, I believe that not calling obj.release() at all will be fine, and the memory will be deallocated by the unique_ptr destructor. But if we do call it we need to put the delete[] in front.
I've tested this with a small code that I attach here: