ift.rst 15.5 KB
 Martin Reinecke committed Jan 23, 2018 1 2 3 4 5 6 IFT -- Information Field Theory =============================== Theoretical Background ----------------------  Martin Reinecke committed Feb 01, 2019 7 Information Field Theory _ [1]_ (IFT) is information theory, the logic of reasoning under uncertainty, applied to fields.  Martin Reinecke committed Jan 22, 2019 8 9 10 11 A field can be any quantity defined over some space, e.g. the air temperature over Europe, the magnetic field strength in the Milky Way, or the matter density in the Universe. IFT describes how data and knowledge can be used to infer field properties. Mathematically it is a statistical field theory and exploits many of the tools developed for such. Practically, it is a framework for signal processing and image reconstruction.  Martin Reinecke committed Jan 23, 2018 12   Martin Reinecke committed Jan 22, 2019 13 14 IFT is fully Bayesian. How else could infinitely many field degrees of freedom be constrained by finite data?  Martin Reinecke committed Jan 23, 2018 15   Martin Reinecke committed Jan 22, 2019 16 17 18 19 20 There is a full toolbox of methods that can be used, like the classical approximation (= Maximum a posteriori = MAP), effective action (= Variational Bayes = VI), Feynman diagrams, renormalization, and more. IFT reproduces many known well working algorithms, which is reassuring. Also, there were certainly previous works in a similar spirit. Anyhow, in many cases IFT provides novel rigorous ways to extract information from data. NIFTy comes with reimplemented MAP and VI estimators.  Martin Reinecke committed Jan 23, 2018 21   Philipp Arras committed May 31, 2021 22 .. tip:: A didactical introduction to *information field theory* can be found in [2]_. Other resources include the Wikipedia entry _, [3]_ and [4]_.  Martin Reinecke committed Jan 23, 2018 23   Martin Reinecke committed Feb 01, 2019 24 .. [1] T.A. Enßlin et al. (2009), "Information field theory for cosmological perturbation reconstruction and nonlinear signal analysis", PhysRevD.80.105005, 09/2009; [arXiv:0806.3474] _  Martin Reinecke committed Jan 23, 2018 25   Philipp Arras committed May 31, 2021 26 .. [2] T.A. Enßlin (2019), "Information theory for fields", Annalen der Physik; [DOI] _, [arXiv:1804.03350] _  Torsten Ensslin committed Jan 04, 2019 27   Philipp Arras committed May 31, 2021 28 .. [3] T.A. Enßlin (2013), "Information field theory", proceedings of MaxEnt 2012 -- the 32nd International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering; AIP Conference Proceedings, Volume 1553, Issue 1, p.184; [arXiv:1301.2556] _  Martin Reinecke committed Jan 23, 2018 29   Philipp Arras committed May 31, 2021 30 .. [4] T.A. Enßlin (2014), "Astrophysical data analysis with information field theory", AIP Conference Proceedings, Volume 1636, Issue 1, p.49; [arXiv:1405.7701] _  Torsten Ensslin committed Jan 04, 2019 31   Torsten Ensslin committed Jan 17, 2019 32   Martin Reinecke committed Jan 23, 2018 33 34 35   Martin Reinecke committed Jan 16, 2019 36 Free Theory & Implicit Operators  Torsten Ensslin committed Jan 04, 2019 37 38 --------------------------------  Martin Reinecke committed Feb 14, 2019 39 A free IFT appears when the signal field :math:{s} and the noise :math:{n} of the data :math:{d} are independent, zero-centered Gaussian processes of known covariances :math:{S} and :math:{N}, respectively,  Torsten Ensslin committed Jan 04, 2019 40 41 42 43 44  .. math:: \mathcal{P}(s,n) = \mathcal{G}(s,S)\,\mathcal{G}(n,N),  Martin Reinecke committed Jan 16, 2019 45 and the measurement equation is linear in both signal and noise,  Torsten Ensslin committed Jan 04, 2019 46 47 48 49 50  .. math:: d= R\, s + n,  Martin Reinecke committed Feb 14, 2019 51 with :math:{R} being the measurement response, which maps the continuous signal field into the discrete data space.  Torsten Ensslin committed Jan 04, 2019 52 53  This is called a free theory, as the information Hamiltonian  Philipp Arras committed Jan 09, 2019 54   Torsten Ensslin committed Jan 04, 2019 55 56 57 58 .. math:: \mathcal{H}(d,s)= -\log \mathcal{P}(d,s)= \frac{1}{2} s^\dagger S^{-1} s + \frac{1}{2} (d-R\,s)^\dagger N^{-1} (d-R\,s) + \mathrm{const}  Martin Reinecke committed Jan 16, 2019 59 is only of quadratic order in :math:{s}, which leads to a linear relation between the data and the posterior mean field.  Torsten Ensslin committed Jan 04, 2019 60   Martin Reinecke committed Jan 16, 2019 61 In this case, the posterior is  Torsten Ensslin committed Jan 04, 2019 62 63 64 65 66  .. math:: \mathcal{P}(s|d) = \mathcal{G}(s-m,D)  Martin Reinecke committed Jan 16, 2019 67 with  Torsten Ensslin committed Jan 04, 2019 68 69 70 71 72 73 74 75 76 77 78  .. math:: m = D\, j the posterior mean field, .. math:: D = \left( S^{-1} + R^\dagger N^{-1} R\right)^{-1}  Martin Reinecke committed Jan 16, 2019 79 the posterior covariance operator, and  Torsten Ensslin committed Jan 04, 2019 80 81 82 83 84  .. math:: j = R^\dagger N^{-1} d  Martin Reinecke committed Jan 22, 2019 85 86 the information source. The operation in :math:{m = D\,R^\dagger N^{-1} d} is also called the generalized Wiener filter.  Torsten Ensslin committed Jan 04, 2019 87   Martin Reinecke committed Jan 16, 2019 88 NIFTy permits to define the involved operators :math:{R}, :math:{R^\dagger}, :math:{S}, and :math:{N} implicitly, as routines that can be applied to vectors, but which do not require the explicit storage of the matrix elements of the operators.  Torsten Ensslin committed Jan 04, 2019 89   Martin Reinecke committed Jan 22, 2019 90 91 92 Some of these operators are diagonal in harmonic (Fourier) basis, and therefore only require the specification of a (power) spectrum and :math:{S= F\,\widehat{P_s} F^\dagger}. Here :math:{F = \mathrm{HarmonicTransformOperator}}, :math:{\widehat{P_s} = \mathrm{DiagonalOperator}(P_s)}, and :math:{P_s(k)} is the power spectrum of the process that generated :math:{s} as a function of the (absolute value of the) harmonic (Fourier) space coordinate :math:{k}. For those, NIFTy can easily also provide inverse operators, as :math:{S^{-1}= F\,\widehat{\frac{1}{P_s}} F^\dagger} in case :math:{F} is unitary, :math:{F^\dagger=F^{-1}}.  Torsten Ensslin committed Jan 05, 2019 93 94  These implicit operators can be combined into new operators, e.g. to :math:{D^{-1} = S^{-1} + R^\dagger N^{-1} R}, as well as their inverses, e.g. :math:{D = \left( D^{-1} \right)^{-1}}.  Torsten Ensslin committed Jan 04, 2019 95 96 The invocation of an inverse operator applied to a vector might trigger the execution of a numerical linear algebra solver.  Martin Reinecke committed Jan 22, 2019 97 Thus, when NIFTy calculates :math:{m = D\, j}, it actually solves :math:{D^{-1} m = j} for :math:{m} behind the scenes.  Martin Reinecke committed Feb 14, 2019 98 99 The advantage of implicit operators compared to explicit matrices is the reduced memory consumption; for the reconstruction of just a Megapixel image the latter would already require several Terabytes.  Martin Reinecke committed Jan 22, 2019 100 Larger images could not be dealt with due to the quadratic memory requirements of explicit operator representations.  Torsten Ensslin committed Jan 04, 2019 101   Martin Reinecke committed Jan 22, 2019 102 The demo codes demos/getting_started_1.py and demos/Wiener_Filter.ipynb illustrate this.  Torsten Ensslin committed Jan 04, 2019 103 104 105 106 107  Generative Models -----------------  Martin Reinecke committed Feb 14, 2019 108 For more sophisticated measurement situations (involving non-linear measurements, unknown covariances, calibration constants and the like) it is recommended to formulate those as generative models for which NIFTy provides powerful inference algorithms.  Torsten Ensslin committed Jan 04, 2019 109   Torsten Ensslin committed Jan 07, 2019 110 In a generative model, all known or unknown quantities are described as the results of generative processes, which start with simple probability distributions, like the uniform, the i.i.d. Gaussian, or the delta distribution.  Torsten Ensslin committed Jan 04, 2019 111   Philipp Arras committed Jan 07, 2019 112 Let us rewrite the above free theory as a generative model:  Torsten Ensslin committed Jan 04, 2019 113   Torsten Ensslin committed Jan 04, 2019 114 115 116 .. math:: s = A\,\xi  Torsten Ensslin committed Jan 04, 2019 117   Philipp Arras committed Jan 07, 2019 118 with :math:{A} the amplitude operator such that it generates signal field realizations with the correct covariance :math:{S=A\,A^\dagger} when being applied to a white Gaussian field :math:{\xi} with :math:{\mathcal{P}(\xi)= \mathcal{G}(\xi, 1)}.  Torsten Ensslin committed Jan 04, 2019 119   Jakob Knollmueller committed Jan 22, 2019 120 The joint information Hamiltonian for the standardized signal field :math:{\xi} reads:  Torsten Ensslin committed Jan 04, 2019 121   Torsten Ensslin committed Jan 04, 2019 122 123 .. math::  Torsten Ensslin committed Jan 04, 2019 124  \mathcal{H}(d,\xi)= -\log \mathcal{P}(d,s)= \frac{1}{2} \xi^\dagger \xi + \frac{1}{2} (d-R\,A\,\xi)^\dagger N^{-1} (d-R\,A\,\xi) + \mathrm{const}.  Torsten Ensslin committed Jan 04, 2019 125   Martin Reinecke committed Jan 16, 2019 126 NIFTy takes advantage of this formulation in several ways:  Torsten Ensslin committed Jan 04, 2019 127   Jakob Knollmueller committed Jan 22, 2019 128 1) All prior degrees of freedom have unit covariance, which improves the condition number of operators that need to be inverted.  Martin Reinecke committed Jan 22, 2019 129   Martin Reinecke committed Jan 22, 2019 130 2) The amplitude operator can be regarded as part of the response, :math:{R'=R\,A}.  Martin Reinecke committed Feb 14, 2019 131  In general, more sophisticated responses can be obtained by combining simpler operators.  Martin Reinecke committed Jan 22, 2019 132   Martin Reinecke committed Jan 22, 2019 133 3) The response can be non-linear, e.g. :math:{R'(s)=R \exp(A\,\xi)}, see demos/getting_started_2.py.  Martin Reinecke committed Jan 22, 2019 134   Philipp Arras committed May 31, 2021 135 136 4) The amplitude operator may depend on further parameters, e.g. :math:A=A(\tau)=e^{2\tau} represents an amplitude operator with a positive definite, unknown spectrum. The log-amplitude field :math:{\tau} is modelled with the help of an integrated Wiener process in order to impose some (user-defined degree of) spectral smoothness.  Martin Reinecke committed Jan 22, 2019 137   Martin Reinecke committed Jan 22, 2019 138 5) NIFTy calculates the gradient of the information Hamiltonian and the Fisher information metric with respect to all unknown parameters, here :math:{\xi} and :math:{\tau}, by automatic differentiation.  Torsten Ensslin committed Jan 31, 2019 139  The gradients are used for MAP estimates, and the Fisher matrix is required in addition to the gradient by Metric Gaussian Variational Inference (MGVI), which is available in NIFTy as well.  Martin Reinecke committed Jan 22, 2019 140  MGVI is an implicit operator extension of Automatic Differentiation Variational Inference (ADVI).  Torsten Ensslin committed Jan 19, 2019 141   Martin Reinecke committed Jan 22, 2019 142 143 The reconstruction of a non-Gaussian signal with unknown covariance from a non-trivial (tomographic) response is demonstrated in demos/getting_started_3.py. Here, the uncertainty of the field and the power spectrum of its generating process are probed via posterior samples provided by the MGVI algorithm.  Torsten Ensslin committed Jan 06, 2019 144   Philipp Arras committed Jan 09, 2019 145 146 147 148 +----------------------------------------------------+ | **Output of tomography demo getting_started_3.py** | +----------------------------------------------------+ | .. image:: images/getting_started_3_setup.png |  Philipp Arras committed Jan 20, 2019 149 | |  Philipp Arras committed Jan 09, 2019 150 151 152 153 154 155 +----------------------------------------------------+ | Non-Gaussian signal field, | | data backprojected into the image domain, power | | spectrum of underlying Gausssian process. | +----------------------------------------------------+ | .. image:: images/getting_started_3_results.png |  Philipp Arras committed Jan 20, 2019 156 | |  Philipp Arras committed Jan 09, 2019 157 158 159 160 161 162 163 +----------------------------------------------------+ | Posterior mean field signal | | reconstruction, its uncertainty, and the power | | spectrum of the process for different posterior | | samples in comparison to the correct one (thick | | orange line). | +----------------------------------------------------+  Torsten Ensslin committed Jan 19, 2019 164   Torsten Ensslin committed Jan 23, 2019 165 Maximum a Posteriori  Torsten Ensslin committed Jan 20, 2019 166 167 --------------------  Philipp Arras committed Jan 24, 2019 168 One popular field estimation method is Maximum a Posteriori (MAP).  Torsten Ensslin committed Jan 19, 2019 169   Martin Reinecke committed Feb 14, 2019 170 It only requires minimizing the information Hamiltonian, e.g. by a gradient descent method that stops when  Torsten Ensslin committed Jan 19, 2019 171 172 173 174 175  .. math:: \frac{\partial \mathcal{H}(d,\xi)}{\partial \xi} = 0.  Martin Reinecke committed May 18, 2020 176 NIFTy7 automatically calculates the necessary gradient from a generative model of the signal and the data and uses this to minimize the Hamiltonian.  Torsten Ensslin committed Jan 19, 2019 177   Philipp Arras committed May 31, 2021 178 However, MAP often provides unsatisfactory results in cases of deep hierarchical Bayesian networks.  Torsten Ensslin committed Jan 19, 2019 179 180 181 182 183 184 The reason for this is that MAP ignores the volume factors in parameter space, which are not to be neglected in deciding whether a solution is reasonable or not. In the high dimensional setting of field inference these volume factors can differ by large ratios. A MAP estimate, which is only representative for a tiny fraction of the parameter space, might be a poorer choice (with respect to an error norm) compared to a slightly worse location with slightly lower posterior probability, which, however, is associated with a much larger volume (of nearby locations with similar probability). This causes MAP signal estimates to be more prone to overfitting the noise as well as to perception thresholds than methods that take volume effects into account.  Torsten Ensslin committed Jan 20, 2019 185   Philipp Frank committed May 31, 2021 186 187 Metric Gaussian Variational Inference -------------------------------------  Torsten Ensslin committed Jan 20, 2019 188   Martin Reinecke committed Jan 22, 2019 189 One method that takes volume effects into account is Variational Inference (VI).  Jakob Knollmueller committed Jan 22, 2019 190 191 In VI, the posterior :math:\mathcal{P}(\xi|d) is approximated by a simpler, parametrized distribution, often a Gaussian :math:\mathcal{Q}(\xi)=\mathcal{G}(\xi-m,D). The parameters of :math:\mathcal{Q}, the mean :math:m and its covariance :math:D are obtained by minimization of an appropriate information distance measure between :math:\mathcal{Q} and :math:\mathcal{P}.  Martin Reinecke committed Jan 22, 2019 192 As a compromise between being optimal and being computationally affordable, the variational Kullback-Leibler (KL) divergence is used:  Torsten Ensslin committed Jan 19, 2019 193 194 195 196 197 198 199  .. math:: \mathrm{KL}(m,D|d)= \mathcal{D}_\mathrm{KL}(\mathcal{Q}||\mathcal{P})= \int \mathcal{D}\xi \,\mathcal{Q}(\xi) \log \left( \frac{\mathcal{Q}(\xi)}{\mathcal{P}(\xi)} \right) Minimizing this with respect to all entries of the covariance :math:D is unfeasible for fields.  Torsten Ensslin committed Jan 23, 2019 200 Therefore, Metric Gaussian Variational Inference (MGVI) approximates the posterior precision matrix :math:D^{-1} at the location of the current mean :math:m by the Bayesian Fisher information metric,  Torsten Ensslin committed Jan 19, 2019 201 202 203  .. math::  Jakob Knollmueller committed Jan 22, 2019 204  M \approx \left\langle \frac{\partial \mathcal{H}(d,\xi)}{\partial \xi} \, \frac{\partial \mathcal{H}(d,\xi)}{\partial \xi}^\dagger \right\rangle_{(d,\xi)}.  Torsten Ensslin committed Jan 20, 2019 205   Martin Reinecke committed Jan 22, 2019 206 207 In practice the average is performed over :math:\mathcal{P}(d,\xi)\approx \mathcal{P}(d|\xi)\,\delta(\xi-m) by evaluating the expression at the current mean :math:m. This results in a Fisher information metric of the likelihood evaluated at the mean plus the prior information metric.  Jakob Knollmueller committed Jan 22, 2019 208 209 Therefore we will only have to infer the mean of the approximate distribution. The only term within the KL-divergence that explicitly depends on it is the Hamiltonian of the true problem averaged over the approximation:  Martin Reinecke committed Jan 22, 2019 210   Torsten Ensslin committed Jan 20, 2019 211 212 .. math::  Jakob Knollmueller committed Jan 22, 2019 213 214  \mathrm{KL}(m|d) \;\widehat{=}\; \left\langle \mathcal{H}(\xi,d) \right\rangle_{\mathcal{Q}(\xi)},  Torsten Ensslin committed Jan 20, 2019 215   Philipp Arras committed May 31, 2021 216 where :math:\widehat{=} expresses equality up to irrelevant (here not :math:m-dependent) terms.  Martin Reinecke committed Jan 22, 2019 217   Martin Reinecke committed Jan 22, 2019 218 Thus, only the gradient of the KL is needed with respect to this, which can be expressed as  Torsten Ensslin committed Jan 20, 2019 219 220 221  .. math::  Jakob Knollmueller committed Jan 22, 2019 222  \frac{\partial \mathrm{KL}(m|d)}{\partial m} = \left\langle \frac{\partial \mathcal{H}(d,\xi)}{\partial \xi} \right\rangle_{\mathcal{G}(\xi-m,D)}.  Torsten Ensslin committed Jan 20, 2019 223   Martin Reinecke committed Jan 22, 2019 224 225 We stochastically estimate the KL-divergence and gradients with a set of samples drawn from the approximate posterior distribution. The particular structure of the covariance allows us to draw independent samples solving a certain system of equations.  Philipp Arras committed May 31, 2021 226 This KL-divergence for MGVI is implemented by  Philipp Frank committed May 31, 2021 227 :func:~nifty7.minimization.kl_energies.MetricGaussianKL within NIFTy7.  Torsten Ensslin committed Jan 20, 2019 228   Philipp Arras committed May 31, 2021 229 Note that MGVI typically provides only a lower bound on the variance.  Torsten Ensslin committed Jan 20, 2019 230   Philipp Frank committed May 31, 2021 231 232 Geometric Variational Inference -------------------------------  Torsten Ensslin committed Jan 19, 2019 233   Philipp Frank committed May 31, 2021 234 235 For non-linear posterior distributions :math:\mathcal{P}(\xi|d) an approximation with a Gaussian :math:\mathcal{Q}(\xi) in the coordinates :math:\xi is sub-optimal, as higher order interactions are ignored. A better approximation can be achieved by constructing a coordinate system :math:y = g\left(\xi\right) in which the posterior is close to a Gaussian, and perform VI with a Gaussian :math:\mathcal{Q}(y) in these coordinates.  Philipp Arras committed May 31, 2021 236 237 This approach is called Geometric Variation Inference (geoVI). It is discussed in detail in [6]_.  Philipp Frank committed May 31, 2021 238   Philipp Arras committed May 31, 2021 239 One useful coordinate system is obtained in case the metric :math:M of the posterior can be expressed as the pullback of the Euclidean metric by :math:g:  Philipp Frank committed May 31, 2021 240 241 242 243 244  .. math:: M = \left(\frac{\partial g}{\partial \xi}\right)^T \frac{\partial g}{\partial \xi} \ .  Philipp Arras committed May 31, 2021 245 246 247 In general, such a transformation exists only locally, i.e. in a neighbourhood of some expansion point :math:\bar{\xi}, denoted as :math:g_{\bar{\xi}}\left(\xi\right). Using :math:g_{\bar{\xi}}, the GeoVI scheme uses a zero mean, unit Gaussian :math:\mathcal{Q}(y) = \mathcal{G}(y, 1) approximation. It can be expressed in :math:\xi coordinates via the pushforward by the inverse transformation :math:\xi = g_{\bar{\xi}}^{-1}(y):  Philipp Frank committed May 31, 2021 248 249 250 251  .. math:: \mathcal{Q}_{\bar{\xi}}(\xi) = \left(g_{\bar{\xi}}^{-1} * \mathcal{Q}\right)(\xi) = \int \delta\left(\xi - g_{\bar{\xi}}^{-1}(y)\right) \ \mathcal{G}(y, 1) \ \mathcal{D}y \ ,  Philipp Arras committed May 31, 2021 252   Philipp Arras committed May 31, 2021 253 where :math:\delta denotes the Kronecker-delta.  Philipp Frank committed May 31, 2021 254   Philipp Arras committed May 31, 2021 255 256 257 GeoVI obtains the optimal expansion point :math:\bar{\xi} such that :math:\mathcal{Q}_{\bar{\xi}} matches the posterior as good as possible. Analogous to the MGVI algorithm, :math:\bar{\xi} is obtained by minimization of the KL-divergence between :math:\mathcal{P} and :math:\mathcal{Q}_{\bar{\xi}} w.r.t. :math:\bar{\xi}. Furthermore the KL is represented as a stochastic estimate using a set of samples drawn from :math:\mathcal{Q}_{\bar{\xi}} which is implemented in NIFTy7 via :func:~nifty7.minimization.kl_energies.GeoMetricKL.  Philipp Frank committed May 31, 2021 258   Philipp Arras committed May 31, 2021 259 A visual comparison of the MGVI and GeoVI algorithm can be found in variational_inference_visualized.py _.  Philipp Frank committed May 31, 2021 260 261  .. [6] P. Frank, R. Leike, and T.A. Enßlin (2021), "Geometric Variational Inference"; [arXiv:2105.10470] _