first attempt to fix the bug which was rolled back in 402629d9.
Current fix does as much blocking as possible, which should be beneficial from both a compute and communication point of view. Additionally, a second possible fix was added which just calls the blocked version if the local matrix has a sufficient size. This might create smaller and more messages at scale.
Showing with 86 additions and 78 deletions