ift.rst 15.5 KB
Newer Older
Philipp Arras's avatar
Philipp Arras committed
1
2
Information Field Theory
========================
Martin Reinecke's avatar
Martin Reinecke committed
3
4
5
6

Theoretical Background
----------------------

Martin Reinecke's avatar
Martin Reinecke committed
7
`Information Field Theory <https://www.mpa-garching.mpg.de/ift/>`_ [1]_  (IFT) is information theory, the logic of reasoning under uncertainty, applied to fields.
8
9
10
11
A field can be any quantity defined over some space, e.g. the air temperature over Europe, the magnetic field strength in the Milky Way, or the matter density in the Universe.
IFT describes how data and knowledge can be used to infer field properties.
Mathematically it is a statistical field theory and exploits many of the tools developed for such.
Practically, it is a framework for signal processing and image reconstruction.
Martin Reinecke's avatar
Martin Reinecke committed
12

13
14
IFT is fully Bayesian.
How else could infinitely many field degrees of freedom be constrained by finite data?
Martin Reinecke's avatar
Martin Reinecke committed
15

16
17
18
19
20
There is a full toolbox of methods that can be used, like the classical approximation (= Maximum a posteriori = MAP), effective action (= Variational Bayes = VI), Feynman diagrams, renormalization, and more.
IFT reproduces many known well working algorithms, which is reassuring.
Also, there were certainly previous works in a similar spirit.
Anyhow, in many cases IFT provides novel rigorous ways to extract information from data.
NIFTy comes with reimplemented MAP and VI estimators.
Martin Reinecke's avatar
Martin Reinecke committed
21

Philipp Arras's avatar
Philipp Arras committed
22
.. tip:: A didactical introduction to *information field theory* can be found in [2]_. Other resources include the `Wikipedia entry <https://en.wikipedia.org/wiki/Information_field_theory>`_, [3]_ and [4]_.
Martin Reinecke's avatar
Martin Reinecke committed
23

Martin Reinecke's avatar
Martin Reinecke committed
24
.. [1] T.A. Enßlin et al. (2009), "Information field theory for cosmological perturbation reconstruction and nonlinear signal analysis", PhysRevD.80.105005, 09/2009; `[arXiv:0806.3474] <https://www.arxiv.org/abs/0806.3474>`_
Martin Reinecke's avatar
Martin Reinecke committed
25

Philipp Arras's avatar
Philipp Arras committed
26
.. [2] T.A. Enßlin (2019), "Information theory for fields", Annalen der Physik; `[DOI] <https://doi.org/10.1002/andp.201800127>`_, `[arXiv:1804.03350] <https://arxiv.org/abs/1804.03350>`_
27

Philipp Arras's avatar
Philipp Arras committed
28
.. [3] T.A. Enßlin (2013), "Information field theory", proceedings of MaxEnt 2012 -- the 32nd International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering; AIP Conference Proceedings, Volume 1553, Issue 1, p.184; `[arXiv:1301.2556] <https://arxiv.org/abs/1301.2556>`_
Martin Reinecke's avatar
Martin Reinecke committed
29

Philipp Arras's avatar
Philipp Arras committed
30
.. [4] T.A. Enßlin (2014), "Astrophysical data analysis with information field theory", AIP Conference Proceedings, Volume 1636, Issue 1, p.49; `[arXiv:1405.7701] <https://arxiv.org/abs/1405.7701>`_
Torsten Ensslin's avatar
Torsten Ensslin committed
31

32

Martin Reinecke's avatar
Martin Reinecke committed
33
34
35



Martin Reinecke's avatar
cleanup    
Martin Reinecke committed
36
Free Theory & Implicit Operators
37
38
--------------------------------

Martin Reinecke's avatar
Martin Reinecke committed
39
A free IFT appears when the signal field :math:`{s}` and the noise :math:`{n}` of the data :math:`{d}` are independent, zero-centered Gaussian processes of known covariances :math:`{S}` and :math:`{N}`, respectively,
40
41
42
43
44

.. math::

    \mathcal{P}(s,n) = \mathcal{G}(s,S)\,\mathcal{G}(n,N),

Martin Reinecke's avatar
cleanup    
Martin Reinecke committed
45
and the measurement equation is linear in both signal and noise,
46
47
48
49
50

.. math::

    d= R\, s + n,

Martin Reinecke's avatar
Martin Reinecke committed
51
with :math:`{R}` being the measurement response, which maps the continuous signal field into the discrete data space.
52
53

This is called a free theory, as the information Hamiltonian
Philipp Arras's avatar
Philipp Arras committed
54

55
56
57
58
.. math::

    \mathcal{H}(d,s)= -\log \mathcal{P}(d,s)= \frac{1}{2} s^\dagger S^{-1} s + \frac{1}{2} (d-R\,s)^\dagger N^{-1} (d-R\,s) + \mathrm{const}

Martin Reinecke's avatar
cleanup    
Martin Reinecke committed
59
is only of quadratic order in :math:`{s}`, which leads to a linear relation between the data and the posterior mean field.
60

Martin Reinecke's avatar
cleanup    
Martin Reinecke committed
61
In this case, the posterior is
62
63
64
65
66

.. math::

    \mathcal{P}(s|d) = \mathcal{G}(s-m,D)

Martin Reinecke's avatar
cleanup    
Martin Reinecke committed
67
with
68
69
70
71
72
73
74
75
76
77
78

.. math::

    m = D\, j

the posterior mean field,

.. math::

    D = \left( S^{-1} + R^\dagger N^{-1} R\right)^{-1}

Martin Reinecke's avatar
cleanup    
Martin Reinecke committed
79
the posterior covariance operator, and
80
81
82
83
84

.. math::

    j = R^\dagger N^{-1} d

85
86
the information source.
The operation in :math:`{m = D\,R^\dagger N^{-1} d}` is also called the generalized Wiener filter.
87

Martin Reinecke's avatar
Martin Reinecke committed
88
NIFTy permits to define the involved operators :math:`{R}`, :math:`{R^\dagger}`, :math:`{S}`, and :math:`{N}` implicitly, as routines that can be applied to vectors, but which do not require the explicit storage of the matrix elements of the operators.
Torsten Ensslin's avatar
Torsten Ensslin committed
89

90
91
92
Some of these operators are diagonal in harmonic (Fourier) basis, and therefore only require the specification of a (power) spectrum and :math:`{S= F\,\widehat{P_s} F^\dagger}`.
Here :math:`{F = \mathrm{HarmonicTransformOperator}}`, :math:`{\widehat{P_s} = \mathrm{DiagonalOperator}(P_s)}`, and :math:`{P_s(k)}` is the power spectrum of the process that generated :math:`{s}` as a function of the (absolute value of the) harmonic (Fourier) space coordinate :math:`{k}`.
For those, NIFTy can easily also provide inverse operators, as :math:`{S^{-1}= F\,\widehat{\frac{1}{P_s}} F^\dagger}` in case :math:`{F}` is unitary, :math:`{F^\dagger=F^{-1}}`.
93
94

These implicit operators can be combined into new operators, e.g. to :math:`{D^{-1} = S^{-1} + R^\dagger N^{-1} R}`, as well as their inverses, e.g. :math:`{D = \left( D^{-1} \right)^{-1}}`.
95
96
The invocation of an inverse operator applied to a vector might trigger the execution of a numerical linear algebra solver.

Martin Reinecke's avatar
Martin Reinecke committed
97
Thus, when NIFTy calculates :math:`{m = D\, j}`, it actually solves :math:`{D^{-1} m = j}` for :math:`{m}` behind the scenes.
Martin Reinecke's avatar
Martin Reinecke committed
98
99
The advantage of implicit operators compared to explicit matrices is the reduced memory consumption;
for the reconstruction of just a Megapixel image the latter would already require several Terabytes.
100
Larger images could not be dealt with due to the quadratic memory requirements of explicit operator representations.
101

Martin Reinecke's avatar
Martin Reinecke committed
102
The demo codes `demos/getting_started_1.py` and `demos/Wiener_Filter.ipynb` illustrate this.
103
104
105
106
107


Generative Models
-----------------

Martin Reinecke's avatar
Martin Reinecke committed
108
For more sophisticated measurement situations (involving non-linear measurements, unknown covariances, calibration constants and the like) it is recommended to formulate those as generative models for which NIFTy provides powerful inference algorithms.
109

Torsten Ensslin's avatar
Torsten Ensslin committed
110
In a generative model, all known or unknown quantities are described as the results of generative processes, which start with simple probability distributions, like the uniform, the i.i.d. Gaussian, or the delta distribution.
111

112
Let us rewrite the above free theory as a generative model:
Torsten Ensslin's avatar
Torsten Ensslin committed
113

114
115
116
.. math::

    s = A\,\xi
Torsten Ensslin's avatar
Torsten Ensslin committed
117

118
with :math:`{A}` the amplitude operator such that it generates signal field realizations with the correct covariance :math:`{S=A\,A^\dagger}` when being applied to a white Gaussian field :math:`{\xi}` with :math:`{\mathcal{P}(\xi)= \mathcal{G}(\xi, 1)}`.
119

120
The joint information Hamiltonian for the standardized signal field :math:`{\xi}` reads:
Torsten Ensslin's avatar
Torsten Ensslin committed
121

122
123
.. math::

124
    \mathcal{H}(d,\xi)= -\log \mathcal{P}(d,s)= \frac{1}{2} \xi^\dagger \xi + \frac{1}{2} (d-R\,A\,\xi)^\dagger N^{-1} (d-R\,A\,\xi) + \mathrm{const}.
125

Martin Reinecke's avatar
cleanup    
Martin Reinecke committed
126
NIFTy takes advantage of this formulation in several ways:
127

128
1) All prior degrees of freedom have unit covariance, which improves the condition number of operators that need to be inverted.
Martin Reinecke's avatar
Martin Reinecke committed
129

130
2) The amplitude operator can be regarded as part of the response, :math:`{R'=R\,A}`.
Martin Reinecke's avatar
Martin Reinecke committed
131
   In general, more sophisticated responses can be obtained by combining simpler operators.
Martin Reinecke's avatar
Martin Reinecke committed
132

Martin Reinecke's avatar
Martin Reinecke committed
133
3) The response can be non-linear, e.g. :math:`{R'(s)=R \exp(A\,\xi)}`, see `demos/getting_started_2.py`.
Martin Reinecke's avatar
Martin Reinecke committed
134

Philipp Frank's avatar
Philipp Frank committed
135
4) The amplitude operator may depend on further parameters, e.g. :math:`A=A(\tau)=F\, \widehat{e^\tau}` represents an amplitude operator with a positive definite, unknown spectrum.
Philipp Arras's avatar
Philipp Arras committed
136
   The log-amplitude field :math:`{\tau}` is modelled with the help of an integrated Wiener process in order to impose some (user-defined degree of) spectral smoothness.
Martin Reinecke's avatar
Martin Reinecke committed
137

138
5) NIFTy calculates the gradient of the information Hamiltonian and the Fisher information metric with respect to all unknown parameters, here :math:`{\xi}` and :math:`{\tau}`, by automatic differentiation.
Torsten Ensslin's avatar
Torsten Ensslin committed
139
   The gradients are used for MAP estimates, and the Fisher matrix is required in addition to the gradient by Metric Gaussian Variational Inference (MGVI), which is available in NIFTy as well.
Martin Reinecke's avatar
Martin Reinecke committed
140
   MGVI is an implicit operator extension of Automatic Differentiation Variational Inference (ADVI).
141

142
143
The reconstruction of a non-Gaussian signal with unknown covariance from a non-trivial (tomographic) response is demonstrated in `demos/getting_started_3.py`.
Here, the uncertainty of the field and the power spectrum of its generating process are probed via posterior samples provided by the MGVI algorithm.
144

Philipp Arras's avatar
Philipp Arras committed
145
146
147
+----------------------------------------------------+
| **Output of tomography demo getting_started_3.py** |
+----------------------------------------------------+
Philipp Arras's avatar
Philipp Arras committed
148
| .. image:: ../images/getting_started_3_setup.png   |
Philipp Arras's avatar
Philipp Arras committed
149
|                                                    |
Philipp Arras's avatar
Philipp Arras committed
150
151
152
153
154
+----------------------------------------------------+
| Non-Gaussian signal field,                         |
| data backprojected into the image domain, power    |
| spectrum of underlying Gausssian process.          |
+----------------------------------------------------+
Philipp Arras's avatar
Philipp Arras committed
155
| .. image:: ../images/getting_started_3_results.png |
Philipp Arras's avatar
Philipp Arras committed
156
|                                                    |
Philipp Arras's avatar
Philipp Arras committed
157
158
159
160
161
162
163
+----------------------------------------------------+
| Posterior mean field signal                        |
| reconstruction, its uncertainty, and the power     |
| spectrum of the process for different posterior    |
| samples in comparison to the correct one (thick    |
| orange line).                                      |
+----------------------------------------------------+
164

165
Maximum a Posteriori
Torsten Ensslin's avatar
Torsten Ensslin committed
166
167
--------------------

Philipp Arras's avatar
Docs    
Philipp Arras committed
168
One popular field estimation method is Maximum a Posteriori (MAP).
169

Martin Reinecke's avatar
Martin Reinecke committed
170
It only requires minimizing the information Hamiltonian, e.g. by a gradient descent method that stops when
171
172
173
174
175

.. math::

    \frac{\partial \mathcal{H}(d,\xi)}{\partial \xi} = 0.

Martin Reinecke's avatar
Martin Reinecke committed
176
NIFTy7 automatically calculates the necessary gradient from a generative model of the signal and the data and uses this to minimize the Hamiltonian.
177

Philipp Arras's avatar
Philipp Arras committed
178
However, MAP often provides unsatisfactory results in cases of deep hierarchical Bayesian networks.
179
180
181
182
183
184
The reason for this is that MAP ignores the volume factors in parameter space, which are not to be neglected in deciding whether a solution is reasonable or not.
In the high dimensional setting of field inference these volume factors can differ by large ratios.
A MAP estimate, which is only representative for a tiny fraction of the parameter space, might be a poorer choice (with respect to an error norm) compared to a slightly worse location with slightly lower posterior probability, which, however, is associated with a much larger volume (of nearby locations with similar probability).

This causes MAP signal estimates to be more prone to overfitting the noise as well as to perception thresholds than methods that take volume effects into account.

Torsten Ensslin's avatar
Torsten Ensslin committed
185

Philipp Frank's avatar
Philipp Frank committed
186
187
Metric Gaussian Variational Inference
-------------------------------------
Torsten Ensslin's avatar
Torsten Ensslin committed
188

Martin Reinecke's avatar
Martin Reinecke committed
189
One method that takes volume effects into account is Variational Inference (VI).
190
191
In VI, the posterior :math:`\mathcal{P}(\xi|d)` is approximated by a simpler, parametrized distribution, often a Gaussian :math:`\mathcal{Q}(\xi)=\mathcal{G}(\xi-m,D)`.
The parameters of :math:`\mathcal{Q}`, the mean :math:`m` and its covariance :math:`D` are obtained by minimization of an appropriate information distance measure between :math:`\mathcal{Q}` and :math:`\mathcal{P}`.
Martin Reinecke's avatar
Martin Reinecke committed
192
As a compromise between being optimal and being computationally affordable, the variational Kullback-Leibler (KL) divergence is used:
193
194
195
196
197
198
199

.. math::

    \mathrm{KL}(m,D|d)= \mathcal{D}_\mathrm{KL}(\mathcal{Q}||\mathcal{P})=
    \int \mathcal{D}\xi \,\mathcal{Q}(\xi) \log \left( \frac{\mathcal{Q}(\xi)}{\mathcal{P}(\xi)} \right)

Minimizing this with respect to all entries of the covariance :math:`D` is unfeasible for fields.
200
Therefore, Metric Gaussian Variational Inference (MGVI) approximates the posterior precision matrix :math:`D^{-1}` at the location of the current mean :math:`m` by the Bayesian Fisher information metric,
201
202
203

.. math::

204
    M \approx \left\langle \frac{\partial \mathcal{H}(d,\xi)}{\partial \xi} \, \frac{\partial \mathcal{H}(d,\xi)}{\partial \xi}^\dagger \right\rangle_{(d,\xi)}.
Torsten Ensslin's avatar
Torsten Ensslin committed
205

206
207
In practice the average is performed over :math:`\mathcal{P}(d,\xi)\approx \mathcal{P}(d|\xi)\,\delta(\xi-m)` by evaluating the expression at the current mean :math:`m`.
This results in a Fisher information metric of the likelihood evaluated at the mean plus the prior information metric.
208
209
Therefore we will only have to infer the mean of the approximate distribution.
The only term within the KL-divergence that explicitly depends on it is the Hamiltonian of the true problem averaged over the approximation:
Martin Reinecke's avatar
Martin Reinecke committed
210

Torsten Ensslin's avatar
Torsten Ensslin committed
211
212
.. math::

213
214
    \mathrm{KL}(m|d) \;\widehat{=}\;
    \left\langle  \mathcal{H}(\xi,d)    \right\rangle_{\mathcal{Q}(\xi)},
Torsten Ensslin's avatar
Torsten Ensslin committed
215

Philipp Arras's avatar
Philipp Arras committed
216
where :math:`\widehat{=}` expresses equality up to irrelevant (here not :math:`m`-dependent) terms.
Martin Reinecke's avatar
Martin Reinecke committed
217

Martin Reinecke's avatar
Martin Reinecke committed
218
Thus, only the gradient of the KL is needed with respect to this, which can be expressed as
Torsten Ensslin's avatar
Torsten Ensslin committed
219
220
221

.. math::

222
    \frac{\partial \mathrm{KL}(m|d)}{\partial m} = \left\langle \frac{\partial \mathcal{H}(d,\xi)}{\partial \xi}  \right\rangle_{\mathcal{G}(\xi-m,D)}.
Torsten Ensslin's avatar
Torsten Ensslin committed
223

224
225
We stochastically estimate the KL-divergence and gradients with a set of samples drawn from the approximate posterior distribution.
The particular structure of the covariance allows us to draw independent samples solving a certain system of equations.
Philipp Arras's avatar
Philipp Arras committed
226
This KL-divergence for MGVI is implemented by
Philipp Frank's avatar
Philipp Frank committed
227
:func:`~nifty7.minimization.kl_energies.MetricGaussianKL` within NIFTy7.
Torsten Ensslin's avatar
Torsten Ensslin committed
228

Philipp Arras's avatar
Philipp Arras committed
229
Note that MGVI typically provides only a lower bound on the variance.
Torsten Ensslin's avatar
Torsten Ensslin committed
230

Philipp Frank's avatar
Philipp Frank committed
231
232
Geometric Variational Inference
-------------------------------
233

Philipp Frank's avatar
Philipp Frank committed
234
235
For non-linear posterior distributions :math:`\mathcal{P}(\xi|d)` an approximation with a Gaussian :math:`\mathcal{Q}(\xi)` in the coordinates :math:`\xi` is sub-optimal, as higher order interactions are ignored.
A better approximation can be achieved by constructing a coordinate system :math:`y = g\left(\xi\right)` in which the posterior is close to a Gaussian, and perform VI with a Gaussian :math:`\mathcal{Q}(y)` in these coordinates.
Philipp Arras's avatar
Philipp Arras committed
236
237
This approach is called Geometric Variation Inference (geoVI).
It is discussed in detail in [6]_.
Philipp Frank's avatar
Philipp Frank committed
238

Philipp Arras's avatar
Philipp Arras committed
239
One useful coordinate system is obtained in case the metric :math:`M` of the posterior can be expressed as the pullback of the Euclidean metric by :math:`g`:
Philipp Frank's avatar
Philipp Frank committed
240
241
242
243
244

.. math::

    M = \left(\frac{\partial g}{\partial \xi}\right)^T \frac{\partial g}{\partial \xi} \ .

Philipp Arras's avatar
Philipp Arras committed
245
246
247
In general, such a transformation exists only locally, i.e. in a neighbourhood of some expansion point :math:`\bar{\xi}`, denoted as :math:`g_{\bar{\xi}}\left(\xi\right)`.
Using :math:`g_{\bar{\xi}}`, the GeoVI scheme uses a zero mean, unit Gaussian :math:`\mathcal{Q}(y) = \mathcal{G}(y, 1)` approximation.
It can be expressed in :math:`\xi` coordinates via the pushforward by the inverse transformation :math:`\xi = g_{\bar{\xi}}^{-1}(y)`:
Philipp Frank's avatar
Philipp Frank committed
248
249
250
251

.. math::

    \mathcal{Q}_{\bar{\xi}}(\xi) = \left(g_{\bar{\xi}}^{-1} * \mathcal{Q}\right)(\xi) = \int \delta\left(\xi - g_{\bar{\xi}}^{-1}(y)\right) \ \mathcal{G}(y, 1) \ \mathcal{D}y \ ,
Philipp Arras's avatar
Philipp Arras committed
252

Philipp Arras's avatar
Philipp Arras committed
253
where :math:`\delta` denotes the Kronecker-delta.
Philipp Frank's avatar
Philipp Frank committed
254

Philipp Arras's avatar
Philipp Arras committed
255
256
257
GeoVI obtains the optimal expansion point :math:`\bar{\xi}` such that :math:`\mathcal{Q}_{\bar{\xi}}` matches the posterior as good as possible.
Analogous to the MGVI algorithm, :math:`\bar{\xi}` is obtained by minimization of the KL-divergence between :math:`\mathcal{P}` and :math:`\mathcal{Q}_{\bar{\xi}}` w.r.t. :math:`\bar{\xi}`.
Furthermore the KL is represented as a stochastic estimate using a set of samples drawn from :math:`\mathcal{Q}_{\bar{\xi}}` which is implemented in NIFTy7 via :func:`~nifty7.minimization.kl_energies.GeoMetricKL`.
Philipp Frank's avatar
Philipp Frank committed
258

Philipp Arras's avatar
Philipp Arras committed
259
A visual comparison of the MGVI and GeoVI algorithm can be found in `variational_inference_visualized.py <https://gitlab.mpcdf.mpg.de/ift/nifty/-/blob/NIFTy_7/demos/variational_inference_visualized.py>`_.
Philipp Frank's avatar
Philipp Frank committed
260

Philipp Arras's avatar
Philipp Arras committed
261
.. [6] P. Frank, R. Leike, and T.A. Enßlin (2021), "Geometric Variational Inference"; `[arXiv:2105.10470] <https://arxiv.org/abs/2105.10470>`_ `[doi] <https://doi.org/10.3390/e23070853>`_