diff --git a/docs/source/ift.rst b/docs/source/ift.rst
index c18a93c1a49ecc1d8c3960c3139050e48c91c371..be5cce8eacaa543857f2e1ea46e514ee36a7b6ca 100644
--- a/docs/source/ift.rst
+++ b/docs/source/ift.rst
@@ -200,10 +200,11 @@ The reconstruction of a non-Gaussian signal with unknown covariance from a non-t
 | orange line).                                      |
 +----------------------------------------------------+
 
-Variational Inference
----------------------
+Maximim a Posteriori
+--------------------
+
+One popular field estimation method is Maximim a Posteriori (MAP).
 
-One popular field estimation method is MAP.
 It only requires to minimize the information Hamiltonian, e.g by a gradient descent method that stops when
 
 .. math::
@@ -219,7 +220,12 @@ A MAP estimate, which is only representative for a tiny fraction of the paramete
 
 This causes MAP signal estimates to be more prone to overfitting the noise as well as to perception thresholds than methods that take volume effects into account.
 
-One such method is VI. In VI, the posterior :math:`\mathcal{P}(\xi|d)` is approximated by a simpler distribution, often a Gaussian :math:`\mathcal{Q}(\xi)=\mathcal{G}(\xi-m,D)`.
+
+Variational Inference
+---------------------
+
+One method that takes volume effects into account is Variational Inference (VI). 
+In VI, the posterior :math:`\mathcal{P}(\xi|d)` is approximated by a simpler distribution, often a Gaussian :math:`\mathcal{Q}(\xi)=\mathcal{G}(\xi-m,D)`.
 The parameters of :math:`\mathcal{Q}`, the mean :math:`m` and its uncertainty dispersion :math:`D` are obtained by minimization of an appropriate information distance measure between :math:`\mathcal{Q}` and :math:`\mathcal{P}`.
 As a compromise between being optimal and being computational affordable the (reverse) Kullbach Leiberler (KL) divergence is used in VI:
 
@@ -229,13 +235,14 @@ As a compromise between being optimal and being computational affordable the (re
     \int \mathcal{D}\xi \,\mathcal{Q}(\xi) \log \left( \frac{\mathcal{Q}(\xi)}{\mathcal{P}(\xi)} \right)
 
 Minimizing this with respect to all entries of the covariance :math:`D` is unfeasible for fields.
-Therefore, MGVI makes the Ansatz to approximate the precision matrix :math:`M=D^{-1}` by the Bayesian Fisher information metric,
+Therefore, Metric Gaussian Variational Inference (MGVI) makes the Ansatz to approximate the precision matrix :math:`M=D^{-1}` by the Bayesian Fisher information metric,
 
 .. math::
 
     M \approx \left\langle \frac{\partial \mathcal{H}(d,\xi)}{\partial \xi} \, \frac{\partial \mathcal{H}(d,\xi)}{\partial \xi}^\dagger \right\rangle_{(d,\xi)},
 
-where  in the MGVI practice the average is performed over :math:`\mathcal{Q}` by evaluating the expression at samples drawn from this Gaussian.
+where  in the MGVI practice the average is performed over :math:`\mathcal{P}(d,\xi)\approx \mathcal{P}(d|\xi)\,\mathcal{Q}(\xi)` by evaluating the expression at :math:`\xi` samples drawn from the Gaussian :math:`\mathcal{Q}(\xi)` and corrsponding data samples dran from their generative process :math:`\mathcal{P}(d|\xi)`.
+
 With this approximation, the KL becomes effectively a function of the mean  :math:`m`, as :math:`D= D(m) \approx M^{-1}`. Thus, only the gradient of the KL is needed with respect to this, which can be expressed as
 
 .. math::
@@ -243,10 +250,59 @@ With this approximation, the KL becomes effectively a function of the mean  :mat
     \frac{\partial \mathrm{KL}(m|d)}{\partial m} = \left\langle \frac{\partial \mathcal{H}(d,\xi)}{\partial \xi}  \right\rangle_{\mathcal{G}(\xi-m,D)}.
 
 The advantage of this Ansatz is that the averages can be represented by sample averages, and all the gradients are represented by operators that NIFTy5 can calculate and that do not need the storage of full matrices. Therefore, NIFTy5 is able to draw samples according to a Gaussian with a covariance given by the inverse information metric, and to minimize the KL correspondingly.
-As this requires stochastic optimization, the parameters governing the numerics might need problem specific tuning.
+Setting up a KL for MGVI is done via objects of the class MetricGaussianKL.
+
+It should be noted that MetricGaussianKL does not estimate the full KL, as within the MGVI approximation only the KL is optimized with respect to the posterior mean and any other additive part of the KL is dropped. It turns out that a metric Gaussian average of the original information Hamiltonian contains all necessary dependencies:
+
+.. math::
+
+    \mathrm{KL}(m|d) =   \left\langle - \mathcal{H}_\mathcal{Q}(\xi|m) + \mathcal{H}(\xi|d) 
+     \right\rangle_{\mathcal{Q}(\xi)}
+
+where
+
+.. math::
+
+    \mathcal{H}_\mathcal{Q}(\xi|m) = - \log \mathcal{Q}(\xi) =
+    - \log \mathcal{G}(\xi-m,D)
+
+is the information Hamiltonian of the approximating Gaussian and
+
+.. math::
+
+    \mathcal{H}(\xi|d) = - \log \mathcal{P}(\xi|d) =
+    - \log \left( \frac{\mathcal{P}(d,\xi)}{\mathcal{P}(d)} \right) =
+    \mathcal{H}(d,\xi) - \mathcal{H}(d)
+
+the posterior information Hamiltonian.
+
+Since neither
+
+.. math::
+
+    \left\langle \mathcal{H}_\mathcal{Q}(\xi|m) \right\rangle_{\mathcal{Q}(\xi)} =
+    \frac{1}{2} \log \left| 2\pi e D \right|
+
+nor
+
+.. math::
+
+    \left\langle \mathcal{H}(d) \right\rangle_{\mathcal{Q}(\xi)} =
+    \mathcal{H}(d)
+
+depend directly on :math:`m`, they are dropped from the KL to be minimized. What remains is
+
+.. math::
+
+    \mathrm{KL}(m|d) \;\widehat{=}\;
+    \left\langle  \mathcal{H}(\xi,d)    \right\rangle_{\mathcal{Q}(\xi)},
+
+where :math:`\widehat{=}` expresses equality up to irrelvant (here not :math:`m`-dependent) terms.
+The fact that the KL depends indirectly also through :math:`D=D(m)` in a second way on :math:`m` is ignored in the MGVI approach. This can often be justified by uncertainties usually being mostly determined by the :math:`m`-independent measurment setup and the variation of the uncertainty with the posterior mean is expected to be subdominant and of moderate importance for the KL.
+
 The demo getting_started_3.py for example infers this way not only a field, but also the power spectrum of the process that has generated the field.
 The cross-correlation of field and power spectum is taken care of thereby.
 Posterior samples can be obtained to study this cross-correlation.
 
 It should be noted that MGVI as any VI method typically underestimates uncertainties due to the fact that :math:`\mathcal{D}_\mathrm{KL}(\mathcal{Q}||\mathcal{P})`, the reverse KL, is used, whereas :math:`\mathcal{D}_\mathrm{KL}(\mathcal{P}||\mathcal{Q})` would be optimal to approximate  :math:`\mathcal{P}` by  :math:`\mathcal{Q}` from an information theoretical perspective.
-This, however, would require that one is  able to integrate the posterior, in wich case one can calculate the desired posterior mean and its uncertainty covariance directly.
+This, however, would require that one is  able to integrate the posterior, in wich case one could calculate the desired posterior mean and its uncertainty covariance directly and therefore would not have any need to perform VI.
diff --git a/nifty5/operators/energy_operators.py b/nifty5/operators/energy_operators.py
index c6dcd5578b8941f98e562adf6376ae91a1abe839..a09cd563f11e7f7ac6447b12bd97457104173ce2 100644
--- a/nifty5/operators/energy_operators.py
+++ b/nifty5/operators/energy_operators.py
@@ -318,49 +318,21 @@ class StandardHamiltonian(EnergyOperator):
 
 
 class AveragedEnergy(EnergyOperator):
-    """Computes Kullback-Leibler (KL) divergence or Gibbs free energies.
-
-    A sample-averaged energy, e.g. a Hamiltonian, approximates the relevant
-    part of a KL to be used in Variational Bayes inference if the samples are
-    drawn from the approximating Gaussian:
-
-    .. math ::
-        \\text{KL}(m) = \\frac1{\\#\\{v_i\\}} \\sum_{v_i} H(m+v_i),
-
-    where :math:`v_i` are the residual samples and :math:`m` is the mean field
-    around which the samples are drawn.
+    """Averages an energy over samples
 
     Parameters
     ----------
     h: Hamiltonian
        The energy to be averaged.
     res_samples : iterable of Fields
-       Set of residual sample points to be added to mean field for approximate
-       estimation of the KL.
+       Set of residual sample points to be added to mean field for
+       approximate estimation of the KL.
 
     Note
     ----
-    Having symmetrized residual samples, with both v_i and -v_i being present
-    ensures that the distribution mean is exactly represented. This reduces
-    sampling noise and helps the numerics of the KL minimization process in the
-    variational Bayes inference.
-
-    See also
-    --------
-    Let :math:`Q(f) = G(f-m,D)` be the Gaussian distribution
-    which is used to approximate the accurate posterior :math:`P(f|d)` with
-    information Hamiltonian
-    :math:`H(d,f) = -\\log P(d,f) = -\\log P(f|d) + \\text{const}`. In
-    Variational Bayes one needs to optimize the KL divergence between those
-    two distributions for m. It is:
-
-    :math:`KL(Q,P) = \\int Df Q(f) \\log Q(f)/P(f)\\\\
-    = \\left< \\log Q(f) \\right>_Q(f) - \\left< \\log P(f) \\right>_Q(f)\\\\
-    = \\text{const} + \\left< H(f) \\right>_G(f-m,D)`
-
-    in essence the information Hamiltonian averaged over a Gaussian
-    distribution centered on the mean m.
-
+    Having symmetrized residual samples, with both v_i and -v_i being 
+    present ensures that the distribution mean is exactly represented.
+    
     :class:`AveragedEnergy(h)` approximates
     :math:`\\left< H(f) \\right>_{G(f-m,D)}` if the residuals
     :math:`f-m` are drawn from a Gaussian distribution with covariance