IFT is fully Bayesian. How else could infinitely many field degrees of freedom be constrained by finite data?

There is a full toolbox of methods that can be used, like the classical approximation (= Maximum a posteriori = MAP), effective action (= Variational Bayes = VI), Feynman diagrams, renormalization, and more. IFT reproduces many known well working algorithms. This should be reassuring. Also, there were certainly previous works in a similar spirit. Anyhow, in many cases IFT provides novel rigorous ways to extract information from data. NIFTy comes with reimplemented MAP and VI estimators. It also provides a Hamiltonian Monte Carlo sampler for Fields (HMCF). (*FIXME* does it?)

There is a full toolbox of methods that can be used, like the classical approximation (= Maximum a posteriori = MAP), effective action (= Variational Bayes = VI), Feynman diagrams, renormalitation, and more. IFT reproduces many known well working algorithms. This should be reassuring. And, there were certainly previous works in a similar spirit. Anyhow, in many cases IFT provides novel rigorous ways to extract information from data. NIFTy comes with reimplemented MAP and VI estimators.

.. tip:: *In-a-nutshell introductions to information field theory* can be found in [2]_, [3]_, [4]_, and [5]_, with the latter probably being the most didactical.

...

...

@@ -184,14 +184,14 @@ The reconstruction of a non-Gaussian signal with unknown covariance from a non-t

| **Output of tomography demo getting_started_3.py** |

NIFTy5 is able to calculate the necessary gradient from a generative model of the signal and the data and to minimize the Hamiltonian.

However, MAP provides often unsatisfactory result in case a deep hirachical Bayesian networks describes the singal and data generation.

The reason for this is that MAP ignores the volume factors in parameter space, which are not to be neglected in deciding whether a solution is reasonable or not.

In the high dimensional setting of field inference these volume factors can differ by large ratios.

A MAP estimate, which is only representative for a tiny fraction of the parameter space, might be a poorer choice (with respect to an error norm) compared to a slightly worse location with slightly lower posterior probability, which, however, is associated with a much larger volume (of nearby locations with similar probability).

This causes MAP signal estimates to be more prone to overfitting the noise as well as to perception thresholds than methods that take volume effects into account.

Variational Inference

---------------------

One method that takes volume effects into account is Variational Inference (VI).

In VI, the posterior :math:`\mathcal{P}(\xi|d)` is approximated by a simpler distribution, often a Gaussian :math:`\mathcal{Q}(\xi)=\mathcal{G}(\xi-m,D)`.

The parameters of :math:`\mathcal{Q}`, the mean :math:`m` and its uncertainty dispersion :math:`D` are obtained by minimization of an appropriate information distance measure between :math:`\mathcal{Q}` and :math:`\mathcal{P}`.

As a compromise between being optimal and being computational affordable the (reverse) Kullbach Leiberler (KL) divergence is used in VI:

Minimizing this with respect to all entries of the covariance :math:`D` is unfeasible for fields.

Therefore, Metric Gaussian Variational Inference (MGVI) makes the Ansatz to approximate the precision matrix :math:`M=D^{-1}` by the Bayesian Fisher information metric,

where in the MGVI practice the average is performed over :math:`\mathcal{P}(d,\xi)\approx \mathcal{P}(d|\xi)\,\mathcal{Q}(\xi)` by evaluating the expression at :math:`\xi` samples drawn from the Gaussian :math:`\mathcal{Q}(\xi)` and corrsponding data samples dran from their generative process :math:`\mathcal{P}(d|\xi)`.

With this approximation, the KL becomes effectively a function of the mean :math:`m`, as :math:`D= D(m) \approx M^{-1}`. Thus, only the gradient of the KL is needed with respect to this, which can be expressed as

The advantage of this Ansatz is that the averages can be represented by sample averages, and all the gradients are represented by operators that NIFTy5 can calculate and that do not need the storage of full matrices. Therefore, NIFTy5 is able to draw samples according to a Gaussian with a covariance given by the inverse information metric, and to minimize the KL correspondingly.

Setting up a KL for MGVI is done via objects of the class MetricGaussianKL.

It should be noted that MetricGaussianKL does not estimate the full KL, as within the MGVI approximation only the KL is optimized with respect to the posterior mean and any other additive part of the KL is dropped. It turns out that a metric Gaussian average of the original information Hamiltonian contains all necessary dependencies:

where :math:`\widehat{=}` expresses equality up to irrelvant (here not :math:`m`-dependent) terms.

The fact that the KL depends indirectly also through :math:`D=D(m)` in a second way on :math:`m` is ignored in the MGVI approach. This can often be justified by uncertainties usually being mostly determined by the :math:`m`-independent measurment setup and the variation of the uncertainty with the posterior mean is expected to be subdominant and of moderate importance for the KL.

The demo getting_started_3.py for example infers this way not only a field, but also the power spectrum of the process that has generated the field.

The cross-correlation of field and power spectum is taken care of thereby.

Posterior samples can be obtained to study this cross-correlation.

It should be noted that MGVI as any VI method typically underestimates uncertainties due to the fact that :math:`\mathcal{D}_\mathrm{KL}(\mathcal{Q}||\mathcal{P})`, the reverse KL, is used, whereas :math:`\mathcal{D}_\mathrm{KL}(\mathcal{P}||\mathcal{Q})` would be optimal to approximate :math:`\mathcal{P}` by :math:`\mathcal{Q}` from an information theoretical perspective.

This, however, would require that one is able to integrate the posterior, in wich case one could calculate the desired posterior mean and its uncertainty covariance directly and therefore would not have any need to perform VI.