ift.rst 20.3 KB
Newer Older
Martin Reinecke's avatar
Martin Reinecke committed
1 2 3 4 5 6 7
IFT -- Information Field Theory
===============================

Theoretical Background
----------------------


8
`Information Field Theory <http://www.mpa-garching.mpg.de/ift/>`_ [1]_  (IFT) is information theory, the logic of reasoning under uncertainty, applied to fields. A field can be any quantity defined over some space, e.g. the air temperature over Europe, the magnetic field strength in the Milky Way, or the matter density in the Universe. IFT describes how data and knowledge can be used to infer field properties. Mathematically it is a statistical field theory and exploits many of the tools developed for such. Practically, it is a framework for signal processing and image reconstruction.
Martin Reinecke's avatar
Martin Reinecke committed
9 10 11

IFT is fully Bayesian. How else could infinitely many field degrees of freedom be constrained by finite data?

12
There is a full toolbox of methods that can be used, like the classical approximation (= Maximum a posteriori = MAP), effective action (= Variational Bayes = VI), Feynman diagrams, renormalitation, and more. IFT reproduces many known well working algorithms. This should be reassuring. And, there were certainly previous works in a similar spirit. Anyhow, in many cases IFT provides novel rigorous ways to extract information from data. NIFTy comes with reimplemented MAP and VI estimators. 
Martin Reinecke's avatar
Martin Reinecke committed
13

Martin Reinecke's avatar
Martin Reinecke committed
14
.. tip:: *In-a-nutshell introductions to information field theory* can be found in [2]_, [3]_, [4]_, and [5]_, with the latter probably being the most didactical.
Martin Reinecke's avatar
Martin Reinecke committed
15

16
.. [1] T.A. Enßlin et al. (2009), "Information field theory for cosmological perturbation reconstruction and nonlinear signal analysis", PhysRevD.80.105005, 09/2009; `[arXiv:0806.3474] <http://www.arxiv.org/abs/0806.3474>`_
Martin Reinecke's avatar
Martin Reinecke committed
17

18
.. [2] T.A. Enßlin (2013), "Information field theory", proceedings of MaxEnt 2012 -- the 32nd International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering; AIP Conference Proceedings, Volume 1553, Issue 1, p.184; `[arXiv:1301.2556] <http://arxiv.org/abs/1301.2556>`_
19

20
.. [3] T.A. Enßlin (2014), "Astrophysical data analysis with information field theory", AIP Conference Proceedings, Volume 1636, Issue 1, p.49; `[arXiv:1405.7701] <http://arxiv.org/abs/1405.7701>`_
Martin Reinecke's avatar
Martin Reinecke committed
21

Martin Reinecke's avatar
cleanup  
Martin Reinecke committed
22
.. [4] Wikipedia contributors (2018), `"Information field theory" <https://en.wikipedia.org/w/index.php?title=Information_field_theory&oldid=876731720>`_, Wikipedia, The Free Encyclopedia.
Torsten Ensslin's avatar
Torsten Ensslin committed
23

24 25
.. [5] T.A. Enßlin (2019), "Information theory for fields", accepted by Annalen der Physik; `[DOI] <https://doi.org/10.1002/andp.201800127>`_, `[arXiv:1804.03350] <http://arxiv.org/abs/1804.03350>`_

Martin Reinecke's avatar
Martin Reinecke committed
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86

Discretized continuum
---------------------

The representation of fields that are mathematically defined on a continuous space in a finite computer environment is a common necessity. The goal hereby is to preserve the continuum limit in the calculus in order to ensure a resolution independent discretization.

+-----------------------------+-----------------------------+
| .. image:: images/42vs6.png | .. image:: images/42vs9.png |
|     :width:  100 %          |     :width:  100 %          |
+-----------------------------+-----------------------------+

Any partition of the continuous position space :math:`\Omega` (with volume :math:`V`) into a set of :math:`Q` disjoint, proper subsets :math:`\Omega_q` (with volumes :math:`V_q`) defines a pixelization,

.. math::

    \Omega &\quad=\quad \dot{\bigcup_q} \; \Omega_q \qquad \mathrm{with} \qquad q \in \{1,\dots,Q\} \subset \mathbb{N}
    , \\
    V &\quad=\quad \int_\Omega \mathrm{d}x \quad=\quad \sum_{q=1}^Q \int_{\Omega_q} \mathrm{d}x \quad=\quad \sum_{q=1}^Q V_q
    .

Here the number :math:`Q` characterizes the resolution of the pixelization and the continuum limit is described by :math:`Q \rightarrow \infty` and :math:`V_q \rightarrow 0` for all :math:`q \in \{1,\dots,Q\}` simultaneously. Moreover, the above equation defines a discretization of continuous integrals, :math:`\int_\Omega \mathrm{d}x \mapsto \sum_q V_q`.

Any valid discretization scheme for a field :math:`{s}` can be described by a mapping,

.. math::

    s(x \in \Omega_q) \quad\mapsto\quad s_q \quad=\quad \int_{\Omega_q} \mathrm{d}x \; w_q(x) \; s(x)
    ,

if the weighting function :math:`w_q(x)` is chosen appropriately. In order for the discretized version of the field to converge to the actual field in the continuum limit, the weighting functions need to be normalized in each subset; i.e., :math:`\forall q: \int_{\Omega_q} \mathrm{d}x \; w_q(x) = 1`. Choosing such a weighting function that is constant with respect to :math:`x` yields

.. math::

    s_q = \frac{\int_{\Omega_q} \mathrm{d}x \; s(x)}{\int_{\Omega_q} \mathrm{d}x} = \left< s(x) \right>_{\Omega_q}
    ,

which corresponds to a discretization of the field by spatial averaging. Another common and equally valid choice is :math:`w_q(x) = \delta(x-x_q)`, which distinguishes some position :math:`x_q \in \Omega_q`, and evaluates the continuous field at this position,

.. math::

    s_q \quad=\quad \int_{\Omega_q} \mathrm{d}x \; \delta(x-x_q) \; s(x) \quad=\quad s(x_q)
    .

In practice, one often makes use of the spatially averaged pixel position, :math:`x_q = \left< x \right>_{\Omega_q}`. If the resolution is high enough to resolve all features of the signal field :math:`{s}`, both of these discretization schemes approximate each other, :math:`\left< s(x) \right>_{\Omega_q} \approx s(\left< x \right>_{\Omega_q})`, since they approximate the continuum limit by construction. (The approximation of :math:`\left< s(x) \right>_{\Omega_q} \approx s(x_q \in \Omega_q)` marks a resolution threshold beyond which further refinement of the discretization reveals no new features; i.e., no new information content of the field :math:`{s}`.)

All operations involving position integrals can be normalized in accordance with the above definitions. For example, the scalar product between two fields :math:`{s}` and :math:`{u}` is defined as

.. math::

    {s}^\dagger {u} \quad=\quad \int_\Omega \mathrm{d}x \; s^*(x) \; u(x) \quad\approx\quad \sum_{q=1}^Q V_q^{\phantom{*}} \; s_q^* \; u_q^{\phantom{*}}
    ,

where :math:`\dagger` denotes adjunction and :math:`*` complex conjugation. Since the above approximation becomes an equality in the continuum limit, the scalar product is independent of the pixelization scheme and resolution, if the latter is sufficiently high.

The above line of argumentation analogously applies to the discretization of operators. For a linear operator :math:`{A}` acting on some field :math:`{s}` as :math:`{A} {s} = \int_\Omega \mathrm{d}y \; A(x,y) \; s(y)`, a matrix representation discretized with constant weighting functions is given by

.. math::

    A(x \in \Omega_p, y \in \Omega_q) \quad\mapsto\quad A_{pq} \quad=\quad \frac{\iint_{\Omega_p \Omega_q} \mathrm{d}x \, \mathrm{d}y \; A(x,y)}{\iint_{\Omega_p \Omega_q} \mathrm{d}x \, \mathrm{d}y} \quad=\quad \big< \big< A(x,y) \big>_{\Omega_p} \big>_{\Omega_q}
    .

Martin Reinecke's avatar
Martin Reinecke committed
87
The proper discretization of spaces, fields, and operators, as well as the normalization of position integrals, is essential for the conservation of the continuum limit. Their consistent implementation in NIFTy allows a pixelization independent coding of algorithms.
Martin Reinecke's avatar
Martin Reinecke committed
88

Martin Reinecke's avatar
cleanup  
Martin Reinecke committed
89
Free Theory & Implicit Operators
90 91
--------------------------------

92
A free IFT appears when the signal field :math:`{s}` and the noise :math:`{n}` of the data :math:`{d}` are independent, zero-centered Gaussian processes of kown covariances :math:`{S}` and :math:`{N}`, respectively,
93 94 95 96 97

.. math::

    \mathcal{P}(s,n) = \mathcal{G}(s,S)\,\mathcal{G}(n,N),

Martin Reinecke's avatar
cleanup  
Martin Reinecke committed
98
and the measurement equation is linear in both signal and noise,
99 100 101 102 103

.. math::

    d= R\, s + n,

Martin Reinecke's avatar
Martin Reinecke committed
104
with :math:`{R}` being the measurement response, which maps the continous signal field into the discrete data space.
105 106

This is called a free theory, as the information Hamiltonian
Philipp Arras's avatar
Philipp Arras committed
107

108 109 110 111
.. math::

    \mathcal{H}(d,s)= -\log \mathcal{P}(d,s)= \frac{1}{2} s^\dagger S^{-1} s + \frac{1}{2} (d-R\,s)^\dagger N^{-1} (d-R\,s) + \mathrm{const}

Martin Reinecke's avatar
cleanup  
Martin Reinecke committed
112
is only of quadratic order in :math:`{s}`, which leads to a linear relation between the data and the posterior mean field.
113

Martin Reinecke's avatar
cleanup  
Martin Reinecke committed
114
In this case, the posterior is
115 116 117 118 119

.. math::

    \mathcal{P}(s|d) = \mathcal{G}(s-m,D)

Martin Reinecke's avatar
cleanup  
Martin Reinecke committed
120
with
121 122 123 124 125 126 127 128 129 130 131

.. math::

    m = D\, j

the posterior mean field,

.. math::

    D = \left( S^{-1} + R^\dagger N^{-1} R\right)^{-1}

Martin Reinecke's avatar
cleanup  
Martin Reinecke committed
132
the posterior covariance operator, and
133 134 135 136 137

.. math::

    j = R^\dagger N^{-1} d

138
the information source. The operation in :math:`{m = D\,R^\dagger N^{-1} d}` is also called the generalized Wiener filter.
139

Martin Reinecke's avatar
Martin Reinecke committed
140
NIFTy permits to define the involved operators :math:`{R}`, :math:`{R^\dagger}`, :math:`{S}`, and :math:`{N}` implicitly, as routines that can be applied to vectors, but which do not require the explicit storage of the matrix elements of the operators.
Torsten Ensslin's avatar
Torsten Ensslin committed
141

Torsten Ensslin's avatar
Torsten Ensslin committed
142
Some of these operators are diagonal in harmonic (Fourier) basis, and therefore only require the specification of a (power) spectrum and :math:`{S= F\,\widehat{P_s} F^\dagger}`. Here :math:`{F = \mathrm{HarmonicTransformOperator}}`, :math:`{\widehat{P_s} = \mathrm{DiagonalOperator}(P_s)}`, and :math:`{P_s(k)}` is the power spectrum of the process that generated :math:`{s}` as a function of the (absolute value of the) harmonic (Fourier) space koordinate :math:`{k}`. For those, NIFTy can easily also provide inverse operators, as :math:`{S^{-1}= F\,\widehat{\frac{1}{P_s}} F^\dagger}` in case :math:`{F}` is unitary, :math:`{F^\dagger=F^{-1}}`.
143 144

These implicit operators can be combined into new operators, e.g. to :math:`{D^{-1} = S^{-1} + R^\dagger N^{-1} R}`, as well as their inverses, e.g. :math:`{D = \left( D^{-1} \right)^{-1}}`.
145 146
The invocation of an inverse operator applied to a vector might trigger the execution of a numerical linear algebra solver.

147
Thus, when NIFTy calculates :math:`{m = D\, j}` it actually solves  :math:`{D^{-1} m = j}` for :math:`{m}` behind the scenes. The advantage of implicit operators to explicit matrices is the reduced memory requirements. The reconstruction of only a Megapixel image would otherwithe require the storage and processing of matrices with sizes of several Terrabytes. Larger images could not be dealt with due to the quadratic memory requirements of explicit operator representations.
148 149 150 151 152 153 154

The demo codes demos/getting_started_1.py and demos/Wiener_Filter.ipynb illustrate this.


Generative Models
-----------------

155
For more sophisticated measurement situations, involving non-linear measuremnts, unknown covariances, calibration constants and the like, it is recommended to formulate those as generative models for which NIFTy provides powerful inference algorithms.
156

Torsten Ensslin's avatar
Torsten Ensslin committed
157
In a generative model, all known or unknown quantities are described as the results of generative processes, which start with simple probability distributions, like the uniform, the i.i.d. Gaussian, or the delta distribution.
158

159
Let us rewrite the above free theory as a generative model:
Torsten Ensslin's avatar
Torsten Ensslin committed
160

161 162 163
.. math::

    s = A\,\xi
Torsten Ensslin's avatar
Torsten Ensslin committed
164

165
with :math:`{A}` the amplitude operator such that it generates signal field realizations with the correct covariance :math:`{S=A\,A^\dagger}` when being applied to a white Gaussian field :math:`{\xi}` with :math:`{\mathcal{P}(\xi)= \mathcal{G}(\xi, 1)}`.
166

167
The joint information Hamiltonian for the whitened signal field :math:`{\xi}` reads:
Torsten Ensslin's avatar
Torsten Ensslin committed
168

169 170
.. math::

171
    \mathcal{H}(d,\xi)= -\log \mathcal{P}(d,s)= \frac{1}{2} \xi^\dagger \xi + \frac{1}{2} (d-R\,A\,\xi)^\dagger N^{-1} (d-R\,A\,\xi) + \mathrm{const}.
172

Martin Reinecke's avatar
cleanup  
Martin Reinecke committed
173
NIFTy takes advantage of this formulation in several ways:
174

175 176 177
1) All prior degrees of freedom have unit covariance which improves the condition number of operators which need to be inverted.
2) The amplitude operator can be regarded as part of the response, :math:`{R'=R\,A}`. In general, more sophisticated responses can be constructed out of the composition of simpler operators.
3) The response can be non-linear, e.g. :math:`{R'(s)=R \exp(A\,\xi)}`, see demos/getting_started_2.py.
Martin Reinecke's avatar
Martin Reinecke committed
178 179
4) The amplitude operator can be made dependent on unknowns as well, e.g. :math:`A=A(\tau)= F\, \widehat{e^\tau}` represents an amplitude operator with a positive definite, unknown spectrum defined in the Fourier domain. The amplitude field :math:`{\tau}` would get its own amplitude operator, with a cepstrum (spectrum of a log spectrum) defined in quefrency space (harmonic space of a logarithmically binned harmonic space) to regularize its degrees of freedom by imposing some (user-defined degree of) spectral smoothness.
5) NIFTy can calculate the gradient of the information Hamiltonian and the Fisher information metric with respect to all unknown parameters, here :math:`{\xi}` and :math:`{\tau}`, by automatic differentiation. The gradients are used for MAP and HMCF estimates, and the Fisher matrix is required in addition to the gradient by Metric Gaussian Variational Inference (MGVI), which is available in NIFTy as well. MGVI is an implicit operator extension of Automatic Differentiation Variational Inference (ADVI).
180

Martin Reinecke's avatar
Martin Reinecke committed
181
The reconstruction of a non-Gaussian signal with unknown covariance from a non-trivial (tomographic) response is demonstrated in demos/getting_started_3.py. Here, the uncertainty of the field and the power spectrum of its generating process are probed via posterior samples provided by the MGVI algorithm.
182

Philipp Arras's avatar
Philipp Arras committed
183 184 185 186
+----------------------------------------------------+
| **Output of tomography demo getting_started_3.py** |
+----------------------------------------------------+
| .. image:: images/getting_started_3_setup.png      |
Philipp Arras's avatar
Philipp Arras committed
187
|                                                    |
Philipp Arras's avatar
Philipp Arras committed
188 189 190 191 192 193
+----------------------------------------------------+
| Non-Gaussian signal field,                         |
| data backprojected into the image domain, power    |
| spectrum of underlying Gausssian process.          |
+----------------------------------------------------+
| .. image:: images/getting_started_3_results.png    |
Philipp Arras's avatar
Philipp Arras committed
194
|                                                    |
Philipp Arras's avatar
Philipp Arras committed
195 196 197 198 199 200 201
+----------------------------------------------------+
| Posterior mean field signal                        |
| reconstruction, its uncertainty, and the power     |
| spectrum of the process for different posterior    |
| samples in comparison to the correct one (thick    |
| orange line).                                      |
+----------------------------------------------------+
202

Torsten Ensslin's avatar
Torsten Ensslin committed
203 204 205 206
Maximim a Posteriori
--------------------

One popular field estimation method is Maximim a Posteriori (MAP).
207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222

It only requires to minimize the information Hamiltonian, e.g by a gradient descent method that stops when

.. math::

    \frac{\partial \mathcal{H}(d,\xi)}{\partial \xi} = 0.

NIFTy5 is able to calculate the necessary gradient from a generative model of the signal and the data and to minimize the Hamiltonian.

However, MAP provides often unsatisfactory result in case a deep hirachical Bayesian networks describes the singal and data generation.
The reason for this is that MAP ignores the volume factors in parameter space, which are not to be neglected in deciding whether a solution is reasonable or not.
In the high dimensional setting of field inference these volume factors can differ by large ratios.
A MAP estimate, which is only representative for a tiny fraction of the parameter space, might be a poorer choice (with respect to an error norm) compared to a slightly worse location with slightly lower posterior probability, which, however, is associated with a much larger volume (of nearby locations with similar probability).

This causes MAP signal estimates to be more prone to overfitting the noise as well as to perception thresholds than methods that take volume effects into account.

Torsten Ensslin's avatar
Torsten Ensslin committed
223 224 225 226 227 228

Variational Inference
---------------------

One method that takes volume effects into account is Variational Inference (VI). 
In VI, the posterior :math:`\mathcal{P}(\xi|d)` is approximated by a simpler distribution, often a Gaussian :math:`\mathcal{Q}(\xi)=\mathcal{G}(\xi-m,D)`.
229 230 231 232 233 234 235 236 237
The parameters of :math:`\mathcal{Q}`, the mean :math:`m` and its uncertainty dispersion :math:`D` are obtained by minimization of an appropriate information distance measure between :math:`\mathcal{Q}` and :math:`\mathcal{P}`.
As a compromise between being optimal and being computational affordable the (reverse) Kullbach Leiberler (KL) divergence is used in VI:

.. math::

    \mathrm{KL}(m,D|d)= \mathcal{D}_\mathrm{KL}(\mathcal{Q}||\mathcal{P})=
    \int \mathcal{D}\xi \,\mathcal{Q}(\xi) \log \left( \frac{\mathcal{Q}(\xi)}{\mathcal{P}(\xi)} \right)

Minimizing this with respect to all entries of the covariance :math:`D` is unfeasible for fields.
Torsten Ensslin's avatar
Torsten Ensslin committed
238
Therefore, Metric Gaussian Variational Inference (MGVI) makes the Ansatz to approximate the precision matrix :math:`M=D^{-1}` by the Bayesian Fisher information metric,
239 240 241 242 243

.. math::

    M \approx \left\langle \frac{\partial \mathcal{H}(d,\xi)}{\partial \xi} \, \frac{\partial \mathcal{H}(d,\xi)}{\partial \xi}^\dagger \right\rangle_{(d,\xi)},

Torsten Ensslin's avatar
Torsten Ensslin committed
244 245
where  in the MGVI practice the average is performed over :math:`\mathcal{P}(d,\xi)\approx \mathcal{P}(d|\xi)\,\mathcal{Q}(\xi)` by evaluating the expression at :math:`\xi` samples drawn from the Gaussian :math:`\mathcal{Q}(\xi)` and corrsponding data samples dran from their generative process :math:`\mathcal{P}(d|\xi)`.

246 247 248 249 250 251 252
With this approximation, the KL becomes effectively a function of the mean  :math:`m`, as :math:`D= D(m) \approx M^{-1}`. Thus, only the gradient of the KL is needed with respect to this, which can be expressed as

.. math::

    \frac{\partial \mathrm{KL}(m|d)}{\partial m} = \left\langle \frac{\partial \mathcal{H}(d,\xi)}{\partial \xi}  \right\rangle_{\mathcal{G}(\xi-m,D)}.

The advantage of this Ansatz is that the averages can be represented by sample averages, and all the gradients are represented by operators that NIFTy5 can calculate and that do not need the storage of full matrices. Therefore, NIFTy5 is able to draw samples according to a Gaussian with a covariance given by the inverse information metric, and to minimize the KL correspondingly.
Torsten Ensslin's avatar
Torsten Ensslin committed
253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302
Setting up a KL for MGVI is done via objects of the class MetricGaussianKL.

It should be noted that MetricGaussianKL does not estimate the full KL, as within the MGVI approximation only the KL is optimized with respect to the posterior mean and any other additive part of the KL is dropped. It turns out that a metric Gaussian average of the original information Hamiltonian contains all necessary dependencies:

.. math::

    \mathrm{KL}(m|d) =   \left\langle - \mathcal{H}_\mathcal{Q}(\xi|m) + \mathcal{H}(\xi|d) 
     \right\rangle_{\mathcal{Q}(\xi)}

where

.. math::

    \mathcal{H}_\mathcal{Q}(\xi|m) = - \log \mathcal{Q}(\xi) =
    - \log \mathcal{G}(\xi-m,D)

is the information Hamiltonian of the approximating Gaussian and

.. math::

    \mathcal{H}(\xi|d) = - \log \mathcal{P}(\xi|d) =
    - \log \left( \frac{\mathcal{P}(d,\xi)}{\mathcal{P}(d)} \right) =
    \mathcal{H}(d,\xi) - \mathcal{H}(d)

the posterior information Hamiltonian.

Since neither

.. math::

    \left\langle \mathcal{H}_\mathcal{Q}(\xi|m) \right\rangle_{\mathcal{Q}(\xi)} =
    \frac{1}{2} \log \left| 2\pi e D \right|

nor

.. math::

    \left\langle \mathcal{H}(d) \right\rangle_{\mathcal{Q}(\xi)} =
    \mathcal{H}(d)

depend directly on :math:`m`, they are dropped from the KL to be minimized. What remains is

.. math::

    \mathrm{KL}(m|d) \;\widehat{=}\;
    \left\langle  \mathcal{H}(\xi,d)    \right\rangle_{\mathcal{Q}(\xi)},

where :math:`\widehat{=}` expresses equality up to irrelvant (here not :math:`m`-dependent) terms.
The fact that the KL depends indirectly also through :math:`D=D(m)` in a second way on :math:`m` is ignored in the MGVI approach. This can often be justified by uncertainties usually being mostly determined by the :math:`m`-independent measurment setup and the variation of the uncertainty with the posterior mean is expected to be subdominant and of moderate importance for the KL.

303 304 305 306 307
The demo getting_started_3.py for example infers this way not only a field, but also the power spectrum of the process that has generated the field.
The cross-correlation of field and power spectum is taken care of thereby.
Posterior samples can be obtained to study this cross-correlation.

It should be noted that MGVI as any VI method typically underestimates uncertainties due to the fact that :math:`\mathcal{D}_\mathrm{KL}(\mathcal{Q}||\mathcal{P})`, the reverse KL, is used, whereas :math:`\mathcal{D}_\mathrm{KL}(\mathcal{P}||\mathcal{Q})` would be optimal to approximate  :math:`\mathcal{P}` by  :math:`\mathcal{Q}` from an information theoretical perspective.
Torsten Ensslin's avatar
Torsten Ensslin committed
308
This, however, would require that one is  able to integrate the posterior, in wich case one could calculate the desired posterior mean and its uncertainty covariance directly and therefore would not have any need to perform VI.