Planned maintenance on Wednesday, 2021-01-20, 17:00-18:00. Expect some interruptions during that time

ift.rst 18.2 KB
Newer Older
Martin Reinecke's avatar
Martin Reinecke committed
1 2 3 4 5 6 7
IFT -- Information Field Theory
===============================

Theoretical Background
----------------------


8 9 10 11 12
`Information Field Theory <http://www.mpa-garching.mpg.de/ift/>`_ [1]_  (IFT) is information theory, the logic of reasoning under uncertainty, applied to fields.
A field can be any quantity defined over some space, e.g. the air temperature over Europe, the magnetic field strength in the Milky Way, or the matter density in the Universe.
IFT describes how data and knowledge can be used to infer field properties.
Mathematically it is a statistical field theory and exploits many of the tools developed for such.
Practically, it is a framework for signal processing and image reconstruction.
Martin Reinecke's avatar
Martin Reinecke committed
13

14 15
IFT is fully Bayesian.
How else could infinitely many field degrees of freedom be constrained by finite data?
Martin Reinecke's avatar
Martin Reinecke committed
16

17 18 19 20 21
There is a full toolbox of methods that can be used, like the classical approximation (= Maximum a posteriori = MAP), effective action (= Variational Bayes = VI), Feynman diagrams, renormalization, and more.
IFT reproduces many known well working algorithms, which is reassuring.
Also, there were certainly previous works in a similar spirit.
Anyhow, in many cases IFT provides novel rigorous ways to extract information from data.
NIFTy comes with reimplemented MAP and VI estimators.
Martin Reinecke's avatar
Martin Reinecke committed
22

Martin Reinecke's avatar
Martin Reinecke committed
23
.. tip:: *In-a-nutshell introductions to information field theory* can be found in [2]_, [3]_, [4]_, and [5]_, with the latter probably being the most didactical.
Martin Reinecke's avatar
Martin Reinecke committed
24

25
.. [1] T.A. Enßlin et al. (2009), "Information field theory for cosmological perturbation reconstruction and nonlinear signal analysis", PhysRevD.80.105005, 09/2009; `[arXiv:0806.3474] <http://www.arxiv.org/abs/0806.3474>`_
Martin Reinecke's avatar
Martin Reinecke committed
26

27
.. [2] T.A. Enßlin (2013), "Information field theory", proceedings of MaxEnt 2012 -- the 32nd International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering; AIP Conference Proceedings, Volume 1553, Issue 1, p.184; `[arXiv:1301.2556] <http://arxiv.org/abs/1301.2556>`_
28

29
.. [3] T.A. Enßlin (2014), "Astrophysical data analysis with information field theory", AIP Conference Proceedings, Volume 1636, Issue 1, p.49; `[arXiv:1405.7701] <http://arxiv.org/abs/1405.7701>`_
Martin Reinecke's avatar
Martin Reinecke committed
30

Martin Reinecke's avatar
cleanup  
Martin Reinecke committed
31
.. [4] Wikipedia contributors (2018), `"Information field theory" <https://en.wikipedia.org/w/index.php?title=Information_field_theory&oldid=876731720>`_, Wikipedia, The Free Encyclopedia.
Torsten Ensslin's avatar
Torsten Ensslin committed
32

33 34
.. [5] T.A. Enßlin (2019), "Information theory for fields", accepted by Annalen der Physik; `[DOI] <https://doi.org/10.1002/andp.201800127>`_, `[arXiv:1804.03350] <http://arxiv.org/abs/1804.03350>`_

Martin Reinecke's avatar
Martin Reinecke committed
35 36 37 38

Discretized continuum
---------------------

39 40
The representation of fields that are mathematically defined on a continuous space in a finite computer environment is a common necessity.
The goal hereby is to preserve the continuum limit in the calculus in order to ensure a resolution independent discretization.
Martin Reinecke's avatar
Martin Reinecke committed
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55

+-----------------------------+-----------------------------+
| .. image:: images/42vs6.png | .. image:: images/42vs9.png |
|     :width:  100 %          |     :width:  100 %          |
+-----------------------------+-----------------------------+

Any partition of the continuous position space :math:`\Omega` (with volume :math:`V`) into a set of :math:`Q` disjoint, proper subsets :math:`\Omega_q` (with volumes :math:`V_q`) defines a pixelization,

.. math::

    \Omega &\quad=\quad \dot{\bigcup_q} \; \Omega_q \qquad \mathrm{with} \qquad q \in \{1,\dots,Q\} \subset \mathbb{N}
    , \\
    V &\quad=\quad \int_\Omega \mathrm{d}x \quad=\quad \sum_{q=1}^Q \int_{\Omega_q} \mathrm{d}x \quad=\quad \sum_{q=1}^Q V_q
    .

56 57
Here the number :math:`Q` characterizes the resolution of the pixelization and the continuum limit is described by :math:`Q \rightarrow \infty` and :math:`V_q \rightarrow 0` for all :math:`q \in \{1,\dots,Q\}` simultaneously.
Moreover, the above equation defines a discretization of continuous integrals, :math:`\int_\Omega \mathrm{d}x \mapsto \sum_q V_q`.
Martin Reinecke's avatar
Martin Reinecke committed
58 59 60 61 62 63 64 65

Any valid discretization scheme for a field :math:`{s}` can be described by a mapping,

.. math::

    s(x \in \Omega_q) \quad\mapsto\quad s_q \quad=\quad \int_{\Omega_q} \mathrm{d}x \; w_q(x) \; s(x)
    ,

66 67 68
if the weighting function :math:`w_q(x)` is chosen appropriately.
In order for the discretized version of the field to converge to the actual field in the continuum limit, the weighting functions need to be normalized in each subset; i.e., :math:`\forall q: \int_{\Omega_q} \mathrm{d}x \; w_q(x) = 1`.
Choosing such a weighting function that is constant with respect to :math:`x` yields
Martin Reinecke's avatar
Martin Reinecke committed
69 70 71 72 73 74

.. math::

    s_q = \frac{\int_{\Omega_q} \mathrm{d}x \; s(x)}{\int_{\Omega_q} \mathrm{d}x} = \left< s(x) \right>_{\Omega_q}
    ,

75 76
which corresponds to a discretization of the field by spatial averaging.
Another common and equally valid choice is :math:`w_q(x) = \delta(x-x_q)`, which distinguishes some position :math:`x_q \in \Omega_q`, and evaluates the continuous field at this position,
Martin Reinecke's avatar
Martin Reinecke committed
77 78 79 80 81 82

.. math::

    s_q \quad=\quad \int_{\Omega_q} \mathrm{d}x \; \delta(x-x_q) \; s(x) \quad=\quad s(x_q)
    .

83 84 85
In practice, one often makes use of the spatially averaged pixel position, :math:`x_q = \left< x \right>_{\Omega_q}`.
If the resolution is high enough to resolve all features of the signal field :math:`{s}`, both of these discretization schemes approximate each other, :math:`\left< s(x) \right>_{\Omega_q} \approx s(\left< x \right>_{\Omega_q})`, since they approximate the continuum limit by construction.
(The approximation of :math:`\left< s(x) \right>_{\Omega_q} \approx s(x_q \in \Omega_q)` marks a resolution threshold beyond which further refinement of the discretization reveals no new features; i.e., no new information content of the field :math:`{s}`.)
Martin Reinecke's avatar
Martin Reinecke committed
86

87 88
All operations involving position integrals can be normalized in accordance with the above definitions.
For example, the scalar product between two fields :math:`{s}` and :math:`{u}` is defined as
Martin Reinecke's avatar
Martin Reinecke committed
89 90 91 92 93 94

.. math::

    {s}^\dagger {u} \quad=\quad \int_\Omega \mathrm{d}x \; s^*(x) \; u(x) \quad\approx\quad \sum_{q=1}^Q V_q^{\phantom{*}} \; s_q^* \; u_q^{\phantom{*}}
    ,

95 96
where :math:`\dagger` denotes adjunction and :math:`*` complex conjugation.
Since the above approximation becomes an equality in the continuum limit, the scalar product is independent of the pixelization scheme and resolution, if the latter is sufficiently high.
Martin Reinecke's avatar
Martin Reinecke committed
97

98 99
The above line of argumentation analogously applies to the discretization of operators.
For a linear operator :math:`{A}` acting on some field :math:`{s}` as :math:`{A} {s} = \int_\Omega \mathrm{d}y \; A(x,y) \; s(y)`, a matrix representation discretized with constant weighting functions is given by
Martin Reinecke's avatar
Martin Reinecke committed
100 101 102 103 104 105

.. math::

    A(x \in \Omega_p, y \in \Omega_q) \quad\mapsto\quad A_{pq} \quad=\quad \frac{\iint_{\Omega_p \Omega_q} \mathrm{d}x \, \mathrm{d}y \; A(x,y)}{\iint_{\Omega_p \Omega_q} \mathrm{d}x \, \mathrm{d}y} \quad=\quad \big< \big< A(x,y) \big>_{\Omega_p} \big>_{\Omega_q}
    .

106 107
The proper discretization of spaces, fields, and operators, as well as the normalization of position integrals, is essential for the conservation of the continuum limit.
Their consistent implementation in NIFTy allows a pixelization independent coding of algorithms.
Martin Reinecke's avatar
Martin Reinecke committed
108

Martin Reinecke's avatar
cleanup  
Martin Reinecke committed
109
Free Theory & Implicit Operators
110 111
--------------------------------

112
A free IFT appears when the signal field :math:`{s}` and the noise :math:`{n}` of the data :math:`{d}` are independent, zero-centered Gaussian processes of kown covariances :math:`{S}` and :math:`{N}`, respectively,
113 114 115 116 117

.. math::

    \mathcal{P}(s,n) = \mathcal{G}(s,S)\,\mathcal{G}(n,N),

Martin Reinecke's avatar
cleanup  
Martin Reinecke committed
118
and the measurement equation is linear in both signal and noise,
119 120 121 122 123

.. math::

    d= R\, s + n,

Martin Reinecke's avatar
Martin Reinecke committed
124
with :math:`{R}` being the measurement response, which maps the continous signal field into the discrete data space.
125 126

This is called a free theory, as the information Hamiltonian
Philipp Arras's avatar
Philipp Arras committed
127

128 129 130 131
.. math::

    \mathcal{H}(d,s)= -\log \mathcal{P}(d,s)= \frac{1}{2} s^\dagger S^{-1} s + \frac{1}{2} (d-R\,s)^\dagger N^{-1} (d-R\,s) + \mathrm{const}

Martin Reinecke's avatar
cleanup  
Martin Reinecke committed
132
is only of quadratic order in :math:`{s}`, which leads to a linear relation between the data and the posterior mean field.
133

Martin Reinecke's avatar
cleanup  
Martin Reinecke committed
134
In this case, the posterior is
135 136 137 138 139

.. math::

    \mathcal{P}(s|d) = \mathcal{G}(s-m,D)

Martin Reinecke's avatar
cleanup  
Martin Reinecke committed
140
with
141 142 143 144 145 146 147 148 149 150 151

.. math::

    m = D\, j

the posterior mean field,

.. math::

    D = \left( S^{-1} + R^\dagger N^{-1} R\right)^{-1}

Martin Reinecke's avatar
cleanup  
Martin Reinecke committed
152
the posterior covariance operator, and
153 154 155 156 157

.. math::

    j = R^\dagger N^{-1} d

158 159
the information source.
The operation in :math:`{m = D\,R^\dagger N^{-1} d}` is also called the generalized Wiener filter.
160

Martin Reinecke's avatar
Martin Reinecke committed
161
NIFTy permits to define the involved operators :math:`{R}`, :math:`{R^\dagger}`, :math:`{S}`, and :math:`{N}` implicitly, as routines that can be applied to vectors, but which do not require the explicit storage of the matrix elements of the operators.
Torsten Ensslin's avatar
Torsten Ensslin committed
162

163 164 165
Some of these operators are diagonal in harmonic (Fourier) basis, and therefore only require the specification of a (power) spectrum and :math:`{S= F\,\widehat{P_s} F^\dagger}`.
Here :math:`{F = \mathrm{HarmonicTransformOperator}}`, :math:`{\widehat{P_s} = \mathrm{DiagonalOperator}(P_s)}`, and :math:`{P_s(k)}` is the power spectrum of the process that generated :math:`{s}` as a function of the (absolute value of the) harmonic (Fourier) space coordinate :math:`{k}`.
For those, NIFTy can easily also provide inverse operators, as :math:`{S^{-1}= F\,\widehat{\frac{1}{P_s}} F^\dagger}` in case :math:`{F}` is unitary, :math:`{F^\dagger=F^{-1}}`.
166 167

These implicit operators can be combined into new operators, e.g. to :math:`{D^{-1} = S^{-1} + R^\dagger N^{-1} R}`, as well as their inverses, e.g. :math:`{D = \left( D^{-1} \right)^{-1}}`.
168 169
The invocation of an inverse operator applied to a vector might trigger the execution of a numerical linear algebra solver.

Martin Reinecke's avatar
Martin Reinecke committed
170
Thus, when NIFTy calculates :math:`{m = D\, j}`, it actually solves :math:`{D^{-1} m = j}` for :math:`{m}` behind the scenes.
171 172 173
The advantage of implicit operators to explicit matrices is the reduced memory requirements.
The reconstruction of only a Megapixel image would otherwithe require the storage and processing of matrices with sizes of several Terabytes.
Larger images could not be dealt with due to the quadratic memory requirements of explicit operator representations.
174

Martin Reinecke's avatar
Martin Reinecke committed
175
The demo codes `demos/getting_started_1.py` and `demos/Wiener_Filter.ipynb` illustrate this.
176 177 178 179 180


Generative Models
-----------------

181
For more sophisticated measurement situations, involving non-linear measuremnts, unknown covariances, calibration constants and the like, it is recommended to formulate those as generative models for which NIFTy provides powerful inference algorithms.
182

Torsten Ensslin's avatar
Torsten Ensslin committed
183
In a generative model, all known or unknown quantities are described as the results of generative processes, which start with simple probability distributions, like the uniform, the i.i.d. Gaussian, or the delta distribution.
184

185
Let us rewrite the above free theory as a generative model:
Torsten Ensslin's avatar
Torsten Ensslin committed
186

187 188 189
.. math::

    s = A\,\xi
Torsten Ensslin's avatar
Torsten Ensslin committed
190

191
with :math:`{A}` the amplitude operator such that it generates signal field realizations with the correct covariance :math:`{S=A\,A^\dagger}` when being applied to a white Gaussian field :math:`{\xi}` with :math:`{\mathcal{P}(\xi)= \mathcal{G}(\xi, 1)}`.
192

193
The joint information Hamiltonian for the standardized signal field :math:`{\xi}` reads:
Torsten Ensslin's avatar
Torsten Ensslin committed
194

195 196
.. math::

197
    \mathcal{H}(d,\xi)= -\log \mathcal{P}(d,s)= \frac{1}{2} \xi^\dagger \xi + \frac{1}{2} (d-R\,A\,\xi)^\dagger N^{-1} (d-R\,A\,\xi) + \mathrm{const}.
198

Martin Reinecke's avatar
cleanup  
Martin Reinecke committed
199
NIFTy takes advantage of this formulation in several ways:
200

201
1) All prior degrees of freedom have unit covariance, which improves the condition number of operators that need to be inverted.
Martin Reinecke's avatar
Martin Reinecke committed
202

203
2) The amplitude operator can be regarded as part of the response, :math:`{R'=R\,A}`.
Martin Reinecke's avatar
Martin Reinecke committed
204 205
   In general, more sophisticated responses can be constructed out of the composition of simpler operators.

Martin Reinecke's avatar
Martin Reinecke committed
206
3) The response can be non-linear, e.g. :math:`{R'(s)=R \exp(A\,\xi)}`, see `demos/getting_started_2.py`.
Martin Reinecke's avatar
Martin Reinecke committed
207

208
4) The amplitude operator may dependent on further parameters, e.g. :math:`A=A(\tau)= F\, \widehat{e^\tau}` represents an amplitude operator with a positive definite, unknown spectrum defined in the Fourier domain.
Martin Reinecke's avatar
Martin Reinecke committed
209 210
   The amplitude field :math:`{\tau}` would get its own amplitude operator, with a cepstrum (spectrum of a log spectrum) defined in quefrency space (harmonic space of a logarithmically binned harmonic space) to regularize its degrees of freedom by imposing some (user-defined degree of) spectral smoothness.

211
5) NIFTy calculates the gradient of the information Hamiltonian and the Fisher information metric with respect to all unknown parameters, here :math:`{\xi}` and :math:`{\tau}`, by automatic differentiation.
Martin Reinecke's avatar
Martin Reinecke committed
212 213
   The gradients are used for MAP and HMCF estimates, and the Fisher matrix is required in addition to the gradient by Metric Gaussian Variational Inference (MGVI), which is available in NIFTy as well.
   MGVI is an implicit operator extension of Automatic Differentiation Variational Inference (ADVI).
214

215 216
The reconstruction of a non-Gaussian signal with unknown covariance from a non-trivial (tomographic) response is demonstrated in `demos/getting_started_3.py`.
Here, the uncertainty of the field and the power spectrum of its generating process are probed via posterior samples provided by the MGVI algorithm.
217

Philipp Arras's avatar
Philipp Arras committed
218 219 220 221
+----------------------------------------------------+
| **Output of tomography demo getting_started_3.py** |
+----------------------------------------------------+
| .. image:: images/getting_started_3_setup.png      |
222
|     :width:  50 %                                  |
Philipp Arras's avatar
Philipp Arras committed
223 224 225 226 227 228
+----------------------------------------------------+
| Non-Gaussian signal field,                         |
| data backprojected into the image domain, power    |
| spectrum of underlying Gausssian process.          |
+----------------------------------------------------+
| .. image:: images/getting_started_3_results.png    |
229
|     :width:  50 %                                  |
Philipp Arras's avatar
Philipp Arras committed
230 231 232 233 234 235 236
+----------------------------------------------------+
| Posterior mean field signal                        |
| reconstruction, its uncertainty, and the power     |
| spectrum of the process for different posterior    |
| samples in comparison to the correct one (thick    |
| orange line).                                      |
+----------------------------------------------------+
237

Torsten Ensslin's avatar
Torsten Ensslin committed
238 239 240 241
Maximim a Posteriori
--------------------

One popular field estimation method is Maximim a Posteriori (MAP).
242 243 244 245 246 247 248

It only requires to minimize the information Hamiltonian, e.g by a gradient descent method that stops when

.. math::

    \frac{\partial \mathcal{H}(d,\xi)}{\partial \xi} = 0.

Martin Reinecke's avatar
Martin Reinecke committed
249
NIFTy5 automatically calculates the necessary gradient from a generative model of the signal and the data and to minimize the Hamiltonian.
250

251
However, MAP often provides unsatisfactory results in cases of deep hirachical Bayesian networks.
252 253 254 255 256 257
The reason for this is that MAP ignores the volume factors in parameter space, which are not to be neglected in deciding whether a solution is reasonable or not.
In the high dimensional setting of field inference these volume factors can differ by large ratios.
A MAP estimate, which is only representative for a tiny fraction of the parameter space, might be a poorer choice (with respect to an error norm) compared to a slightly worse location with slightly lower posterior probability, which, however, is associated with a much larger volume (of nearby locations with similar probability).

This causes MAP signal estimates to be more prone to overfitting the noise as well as to perception thresholds than methods that take volume effects into account.

Torsten Ensslin's avatar
Torsten Ensslin committed
258 259 260 261

Variational Inference
---------------------

Martin Reinecke's avatar
Martin Reinecke committed
262
One method that takes volume effects into account is Variational Inference (VI).
263 264
In VI, the posterior :math:`\mathcal{P}(\xi|d)` is approximated by a simpler, parametrized distribution, often a Gaussian :math:`\mathcal{Q}(\xi)=\mathcal{G}(\xi-m,D)`.
The parameters of :math:`\mathcal{Q}`, the mean :math:`m` and its covariance :math:`D` are obtained by minimization of an appropriate information distance measure between :math:`\mathcal{Q}` and :math:`\mathcal{P}`.
Martin Reinecke's avatar
Martin Reinecke committed
265
As a compromise between being optimal and being computationally affordable, the variational Kullback-Leibler (KL) divergence is used:
266 267 268 269 270 271 272

.. math::

    \mathrm{KL}(m,D|d)= \mathcal{D}_\mathrm{KL}(\mathcal{Q}||\mathcal{P})=
    \int \mathcal{D}\xi \,\mathcal{Q}(\xi) \log \left( \frac{\mathcal{Q}(\xi)}{\mathcal{P}(\xi)} \right)

Minimizing this with respect to all entries of the covariance :math:`D` is unfeasible for fields.
273
Therefore, Metric Gaussian Variational Inference (MGVI) approximates the precision matrix at the location of the current mean :math:`M=D^{-1}` by the Bayesian Fisher information metric,
274 275 276

.. math::

277
    M \approx \left\langle \frac{\partial \mathcal{H}(d,\xi)}{\partial \xi} \, \frac{\partial \mathcal{H}(d,\xi)}{\partial \xi}^\dagger \right\rangle_{(d,\xi)}.
Torsten Ensslin's avatar
Torsten Ensslin committed
278

279 280
In practice the average is performed over :math:`\mathcal{P}(d,\xi)\approx \mathcal{P}(d|\xi)\,\delta(\xi-m)` by evaluating the expression at the current mean :math:`m`.
This results in a Fisher information metric of the likelihood evaluated at the mean plus the prior information metric.
281 282
Therefore we will only have to infer the mean of the approximate distribution.
The only term within the KL-divergence that explicitly depends on it is the Hamiltonian of the true problem averaged over the approximation:
Martin Reinecke's avatar
Martin Reinecke committed
283

Torsten Ensslin's avatar
Torsten Ensslin committed
284 285
.. math::

286 287
    \mathrm{KL}(m|d) \;\widehat{=}\;
    \left\langle  \mathcal{H}(\xi,d)    \right\rangle_{\mathcal{Q}(\xi)},
Torsten Ensslin's avatar
Torsten Ensslin committed
288

289
where :math:`\widehat{=}` expresses equality up to irrelvant (here not :math:`m`-dependent) terms.
Martin Reinecke's avatar
Martin Reinecke committed
290

Martin Reinecke's avatar
Martin Reinecke committed
291
Thus, only the gradient of the KL is needed with respect to this, which can be expressed as
Torsten Ensslin's avatar
Torsten Ensslin committed
292 293 294

.. math::

295
    \frac{\partial \mathrm{KL}(m|d)}{\partial m} = \left\langle \frac{\partial \mathcal{H}(d,\xi)}{\partial \xi}  \right\rangle_{\mathcal{G}(\xi-m,D)}.
Torsten Ensslin's avatar
Torsten Ensslin committed
296

297 298
We stochastically estimate the KL-divergence and gradients with a set of samples drawn from the approximate posterior distribution.
The particular structure of the covariance allows us to draw independent samples solving a certain system of equations.
299
This KL-divergence for MGVI is implemented in the class MetricGaussianKL within NIFTy5.
Torsten Ensslin's avatar
Torsten Ensslin committed
300 301


Martin Reinecke's avatar
Martin Reinecke committed
302
The demo `getting_started_3.py` for example not only infers a field this way, but also the power spectrum of the process that has generated the field.
Martin Reinecke's avatar
Martin Reinecke committed
303
The cross-correlation of field and power spectrum is taken care of in this process.
304 305
Posterior samples can be obtained to study this cross-correlation.

Martin Reinecke's avatar
Martin Reinecke committed
306
It should be noted that MGVI, as any VI method, can typically only provide a lower bound on the variance.