    first attempt to fix the bug which was rolled back in 402629d9.
    Alexander Heinecke
    Current fix does as much blocking as possible, which should be
    beneficial from both a compute and communication point of view.
    Additionally, a second possible fix was added which just calls
    the blocked version if the local matrix has a sufficient size.
    This might create smaller and more messages at scale.