Commit 48e16705 authored by Torsten Ensslin's avatar Torsten Ensslin
Browse files

Merge branch 'docu_fix_te' into 'NIFTy_5'

Docu fix te

See merge request ift/nifty-dev!192
parents 60d0bba7 97f602f4
......@@ -200,10 +200,11 @@ The reconstruction of a non-Gaussian signal with unknown covariance from a non-t
| orange line). |
+----------------------------------------------------+
Variational Inference
---------------------
Maximim a Posteriori
--------------------
One popular field estimation method is Maximim a Posteriori (MAP).
One popular field estimation method is MAP.
It only requires to minimize the information Hamiltonian, e.g by a gradient descent method that stops when
.. math::
......@@ -219,7 +220,12 @@ A MAP estimate, which is only representative for a tiny fraction of the paramete
This causes MAP signal estimates to be more prone to overfitting the noise as well as to perception thresholds than methods that take volume effects into account.
One such method is VI. In VI, the posterior :math:`\mathcal{P}(\xi|d)` is approximated by a simpler distribution, often a Gaussian :math:`\mathcal{Q}(\xi)=\mathcal{G}(\xi-m,D)`.
Variational Inference
---------------------
One method that takes volume effects into account is Variational Inference (VI).
In VI, the posterior :math:`\mathcal{P}(\xi|d)` is approximated by a simpler distribution, often a Gaussian :math:`\mathcal{Q}(\xi)=\mathcal{G}(\xi-m,D)`.
The parameters of :math:`\mathcal{Q}`, the mean :math:`m` and its uncertainty dispersion :math:`D` are obtained by minimization of an appropriate information distance measure between :math:`\mathcal{Q}` and :math:`\mathcal{P}`.
As a compromise between being optimal and being computational affordable the (reverse) Kullbach Leiberler (KL) divergence is used in VI:
......@@ -229,13 +235,14 @@ As a compromise between being optimal and being computational affordable the (re
\int \mathcal{D}\xi \,\mathcal{Q}(\xi) \log \left( \frac{\mathcal{Q}(\xi)}{\mathcal{P}(\xi)} \right)
Minimizing this with respect to all entries of the covariance :math:`D` is unfeasible for fields.
Therefore, MGVI makes the Ansatz to approximate the precision matrix :math:`M=D^{-1}` by the Bayesian Fisher information metric,
Therefore, Metric Gaussian Variational Inference (MGVI) makes the Ansatz to approximate the precision matrix :math:`M=D^{-1}` by the Bayesian Fisher information metric,
.. math::
M \approx \left\langle \frac{\partial \mathcal{H}(d,\xi)}{\partial \xi} \, \frac{\partial \mathcal{H}(d,\xi)}{\partial \xi}^\dagger \right\rangle_{(d,\xi)},
where in the MGVI practice the average is performed over :math:`\mathcal{Q}` by evaluating the expression at samples drawn from this Gaussian.
where in the MGVI practice the average is performed over :math:`\mathcal{P}(d,\xi)\approx \mathcal{P}(d|\xi)\,\mathcal{Q}(\xi)` by evaluating the expression at :math:`\xi` samples drawn from the Gaussian :math:`\mathcal{Q}(\xi)` and corrsponding data samples dran from their generative process :math:`\mathcal{P}(d|\xi)`.
With this approximation, the KL becomes effectively a function of the mean :math:`m`, as :math:`D= D(m) \approx M^{-1}`. Thus, only the gradient of the KL is needed with respect to this, which can be expressed as
.. math::
......@@ -243,10 +250,59 @@ With this approximation, the KL becomes effectively a function of the mean :mat
\frac{\partial \mathrm{KL}(m|d)}{\partial m} = \left\langle \frac{\partial \mathcal{H}(d,\xi)}{\partial \xi} \right\rangle_{\mathcal{G}(\xi-m,D)}.
The advantage of this Ansatz is that the averages can be represented by sample averages, and all the gradients are represented by operators that NIFTy5 can calculate and that do not need the storage of full matrices. Therefore, NIFTy5 is able to draw samples according to a Gaussian with a covariance given by the inverse information metric, and to minimize the KL correspondingly.
As this requires stochastic optimization, the parameters governing the numerics might need problem specific tuning.
Setting up a KL for MGVI is done via objects of the class MetricGaussianKL.
It should be noted that MetricGaussianKL does not estimate the full KL, as within the MGVI approximation only the KL is optimized with respect to the posterior mean and any other additive part of the KL is dropped. It turns out that a metric Gaussian average of the original information Hamiltonian contains all necessary dependencies:
.. math::
\mathrm{KL}(m|d) = \left\langle - \mathcal{H}_\mathcal{Q}(\xi|m) + \mathcal{H}(\xi|d)
\right\rangle_{\mathcal{Q}(\xi)}
where
.. math::
\mathcal{H}_\mathcal{Q}(\xi|m) = - \log \mathcal{Q}(\xi) =
- \log \mathcal{G}(\xi-m,D)
is the information Hamiltonian of the approximating Gaussian and
.. math::
\mathcal{H}(\xi|d) = - \log \mathcal{P}(\xi|d) =
- \log \left( \frac{\mathcal{P}(d,\xi)}{\mathcal{P}(d)} \right) =
\mathcal{H}(d,\xi) - \mathcal{H}(d)
the posterior information Hamiltonian.
Since neither
.. math::
\left\langle \mathcal{H}_\mathcal{Q}(\xi|m) \right\rangle_{\mathcal{Q}(\xi)} =
\frac{1}{2} \log \left| 2\pi e D \right|
nor
.. math::
\left\langle \mathcal{H}(d) \right\rangle_{\mathcal{Q}(\xi)} =
\mathcal{H}(d)
depend directly on :math:`m`, they are dropped from the KL to be minimized. What remains is
.. math::
\mathrm{KL}(m|d) \;\widehat{=}\;
\left\langle \mathcal{H}(\xi,d) \right\rangle_{\mathcal{Q}(\xi)},
where :math:`\widehat{=}` expresses equality up to irrelvant (here not :math:`m`-dependent) terms.
The fact that the KL depends indirectly also through :math:`D=D(m)` in a second way on :math:`m` is ignored in the MGVI approach. This can often be justified by uncertainties usually being mostly determined by the :math:`m`-independent measurment setup and the variation of the uncertainty with the posterior mean is expected to be subdominant and of moderate importance for the KL.
The demo getting_started_3.py for example infers this way not only a field, but also the power spectrum of the process that has generated the field.
The cross-correlation of field and power spectum is taken care of thereby.
Posterior samples can be obtained to study this cross-correlation.
It should be noted that MGVI as any VI method typically underestimates uncertainties due to the fact that :math:`\mathcal{D}_\mathrm{KL}(\mathcal{Q}||\mathcal{P})`, the reverse KL, is used, whereas :math:`\mathcal{D}_\mathrm{KL}(\mathcal{P}||\mathcal{Q})` would be optimal to approximate :math:`\mathcal{P}` by :math:`\mathcal{Q}` from an information theoretical perspective.
This, however, would require that one is able to integrate the posterior, in wich case one can calculate the desired posterior mean and its uncertainty covariance directly.
This, however, would require that one is able to integrate the posterior, in wich case one could calculate the desired posterior mean and its uncertainty covariance directly and therefore would not have any need to perform VI.
......@@ -318,49 +318,21 @@ class StandardHamiltonian(EnergyOperator):
class AveragedEnergy(EnergyOperator):
"""Computes Kullback-Leibler (KL) divergence or Gibbs free energies.
A sample-averaged energy, e.g. a Hamiltonian, approximates the relevant
part of a KL to be used in Variational Bayes inference if the samples are
drawn from the approximating Gaussian:
.. math ::
\\text{KL}(m) = \\frac1{\\#\\{v_i\\}} \\sum_{v_i} H(m+v_i),
where :math:`v_i` are the residual samples and :math:`m` is the mean field
around which the samples are drawn.
"""Averages an energy over samples
Parameters
----------
h: Hamiltonian
The energy to be averaged.
res_samples : iterable of Fields
Set of residual sample points to be added to mean field for approximate
estimation of the KL.
Set of residual sample points to be added to mean field for
approximate estimation of the KL.
Note
----
Having symmetrized residual samples, with both v_i and -v_i being present
ensures that the distribution mean is exactly represented. This reduces
sampling noise and helps the numerics of the KL minimization process in the
variational Bayes inference.
See also
--------
Let :math:`Q(f) = G(f-m,D)` be the Gaussian distribution
which is used to approximate the accurate posterior :math:`P(f|d)` with
information Hamiltonian
:math:`H(d,f) = -\\log P(d,f) = -\\log P(f|d) + \\text{const}`. In
Variational Bayes one needs to optimize the KL divergence between those
two distributions for m. It is:
:math:`KL(Q,P) = \\int Df Q(f) \\log Q(f)/P(f)\\\\
= \\left< \\log Q(f) \\right>_Q(f) - \\left< \\log P(f) \\right>_Q(f)\\\\
= \\text{const} + \\left< H(f) \\right>_G(f-m,D)`
in essence the information Hamiltonian averaged over a Gaussian
distribution centered on the mean m.
Having symmetrized residual samples, with both v_i and -v_i being
present ensures that the distribution mean is exactly represented.
:class:`AveragedEnergy(h)` approximates
:math:`\\left< H(f) \\right>_{G(f-m,D)}` if the residuals
:math:`f-m` are drawn from a Gaussian distribution with covariance
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment