## WIP: Made _sumup of KL faster by summing up in parallel when possible

Marked as work in progress because

- has to be tested whether it is really faster
- Speed could in theory be even more increased if one uses the numpy versions of communication, i.e. Send and Recv

The basic idea of this change is to sum up the KL in a tree-type structure. This goes as follows. Suppose we have this list:

`1 2 3 4 5 6`

Then we combine the numbers pairwise:

```
1 2 3 4 5 6
| / | / | /
3 7 11
```

We then iterate this procedure:

```
3 7 11
| / |
10 11
```

Here the 11 was at the end of the list, and was thus not summed. Finally we get

```
10 11
| /
21
```

This procedure converges in O(ln(N)) steps, if N is the length of the list. Suppose the objects of the list are distributed over different MPI tasks, then in every step we need at most N/2 communications. However, because the range of numbers is shared consecutively, every task needs to communicate at most 2 times per iteration, so most communications are done in parallel. The method is organized such that the result of a sum is always computed by the higher ranked tasks. i.e. if the list is distributed over 4 tasks as follows:

`1 1 2 2 3 4`

Then the ranks that hold the sums evolve as follows:

```
1 1 2 3 3 4
| / | / | /
1 3 4
```

So far two communication were needed, with task 3 needing to communicate twice. It continues as

```
1 3 4
| / |
3 4
| /
| /
4
```

with two more necessary communications. In the end task 4 broadcasts the result.