Modify header

c24e421c · Luigi · 7644b637 · c24e421c · c24e421c
Commit c24e421c authored 5 years ago by Luigi
--- a/assets/soap_atomic_charges/Logo_NOMAD.png
+++ b/assets/soap_atomic_charges/Logo_NOMAD.png
--- a/soap_atomic_charges.ipynb
+++ b/soap_atomic_charges.ipynb
@@ -4,10 +4,37 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# Learning atomic charges\n",
+    "<div id=\"teaser\" style=' background-position:  right center; background-size: 00px; background-repeat: no-repeat; \n",
-    "\n",
+    "    padding-top: 20px;\n",
-    "_Gábor Csányi (gc121@cam.ac.uk), James R. Kermode (j.r.kermode@warwick.ac.uk)_\n",
+    "    padding-right: 10px;\n",
-    "\n",
+    "    padding-bottom: 170px;\n",
+    "    padding-left: 10px;\n",
+    "    border-bottom: 14px double #333;\n",
+    "    border-top: 14px double #333;' > \n",
+    "\n",
+    "   \n",
+    "   <div style=\"text-align:center\">\n",
+    "    <b><font size=\"6.4\">Learning atomic charges</font></b>    \n",
+    "  </div>\n",
+    "    \n",
+    "<p>\n",
+    " created by:\n",
+    "    Gábor Csányi,\n",
+    "    James R. Kermode\n",
+    "<br><p>  \n",
+    "gc121@cam.ac.uk, j.r.kermode@warwick.ac.uk\n",
+    "<br><br>  \n",
+    "\n",
+    "<div> \n",
+    "<img  style=\"float: right;\" src=\"assets/soap_atomic_charges/Logo_NOMAD.png\" width=\"250\">\n",
+    "</div>\n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
    "In this tutorial, we will use Gaussian process regression, GPR (or equivalently, Kernel Ridge Regression, KRR) to train and predict charges of atoms in small organic molecules. "
   ]
  },
@@ -450,13 +477,6 @@
    "\n",
    "4. For the low-quality fit above, you see that there are two groups of H atoms that are clearly separated. Try to identify what characterises those groups? Inspect the molecules and H atoms in each group. "
   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
  }
 ],
 "metadata": {
@@ -476,7 +496,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.7.3"
+   "version": "3.7.7"
  }
 },
 "nbformat": 4,

 %% Cell type:markdown id: tags:
-# Learning atomic charges
+<div id="teaser" style=' background-position:  right center; background-size: 00px; background-repeat: no-repeat;
+    padding-top: 20px;
+    padding-right: 10px;
+    padding-bottom: 170px;
+    padding-left: 10px;
+    border-bottom: 14px double #333;
+    border-top: 14px double #333;' >
+   <div style="text-align:center">
+    <b><font size="6.4">Learning atomic charges</font></b>
+  </div>
+<p>
+ created by:
+    Gábor Csányi,
+    James R. Kermode
+<br><p>
+gc121@cam.ac.uk, j.r.kermode@warwick.ac.uk
+<br><br>
+<div>
+<img  style="float: right;" src="assets/soap_atomic_charges/Logo_NOMAD.png" width="250">
+</div>
+</div>
-_Gábor Csányi (gc121@cam.ac.uk), James R. Kermode (j.r.kermode@warwick.ac.uk)_
+%% Cell type:markdown id: tags:
 In this tutorial, we will use Gaussian process regression, GPR (or equivalently, Kernel Ridge Regression, KRR) to train and predict charges of atoms in small organic molecules.
 %% Cell type:markdown id: tags:
 In GPR, we are fitting a function in a moderate dimensional space, using basis functions that are typically symmetric, "similarity" functions, they describe how similar we expect the function value to be at two different points in the input space. In its simplest form, when we fit a function $f$ using input data $y_i$ that are function values at selected points $x_i$, we have
 $$
 f(x) = \sum_i^N \alpha_i K(x_i, x)
 $$
 where $K(x,x')$ is the positive definite similarity function, with a value $1$ when $x=x'$, and lower values for different arguments. The $\alpha$ coefficients are the degrees of freedom in the fit, and we need to determine them from the data. The sum runs through the available data points (but in principle we can choose fewer basis functions than datapoints (this is called sparsification), or even more if we want to, but that goes beyond the scope of this tutorial. )
 %% Cell type:markdown id: tags:
 The difference between KRR and GPR is in the interpretation, in GPR we construct a probability distribution for the unknown function values, and the $K$ is taken to be the formal covariance between function values,
 $$
 K(x,x') = \rm{Cov}\left((f(x),f(x')\right)
 $$
 %% Cell type:markdown id: tags:
 Regardless of the interpretation, when the ${x,y}$ data is plugged in, we get a system of linear equations:
 $$
 y_j = \sum_i \alpha_i K(x_i,x_j)
 $$
 The solution is given by (in vectorial form)
 $$
 {\bf \alpha} = \left({\bf K} + \lambda^2 {\bf I}\right)^{-1} {\bf y}
 $$
 where the matrix $K$ has elements $K(x_i,x_j)$. If the data was perfectly consistent, without any noise, then in principle we could get an fit with $lambda$ set to zero. In practice however, our data might have noise, or we might _choose_ to not want the interpolant (ie the fitted function f) to go through each data point _exactly_, but prefer smoother functions that just go _close_ to the datapoints. We can achieve this by choosing a nonzero (but small) $\lambda$. In the GPR interpretation, $lambda$ should be set to the standard deviation of the noise our data has. More on this below.
 %% Cell type:code id: tags:
 ``` python
 import os
 import sys
 import numpy as np
 import matplotlib.pyplot as plt
 from ase.io import read
 from quippy.descriptors import Descriptor
 from soap_atomic_charges import ViewStructure
 ```
 %% Cell type:markdown id: tags:
 ## Database
 We load a database of small molecule geometries, and precomuputed atomic charges. This file is a subset of 2000 molecules from the GDB9 dataset. The molecules contain H, C, N, O, and F atoms.
 %% Cell type:code id: tags:
 ``` python
 atAll = read('data/soap_atomic_charges/data_GDB9.xyz', index=':')
 ```
 %% Cell type:code id: tags:
 ``` python
 view = ViewStructure(atAll[2])
 view
 ```
 %% Cell type:code id: tags:
 ``` python
 # We can access the atomic numbers of any molecule
 atAll[0].numbers
 ```
 %% Cell type:code id: tags:
 ``` python
 # similarly, the positions
 atAll[1].positions
 ```
 %% Cell type:code id: tags:
 ``` python
 # and the charges
 atAll[1].get_initial_charges()
 ```
 %% Cell type:markdown id: tags:
 ## SOAP Kernel and descriptor
 Next, we need to define a kernel. There are many ways to define what the atomic charges would be a function of, but somehow we need to describe the environment of the atom, and then construct a similarity function that can serve as the kernel function.
 In this tutorial, we are going to make the atomic charge a function of the near-environment of an atom (within a cutoff), and we will describe that environment using the SOAP descriptor and compare them using the SOAP kernel. Note right away that the quantum mechanically computed atomic charge is not fully determined by the near-environment of atoms (far-away atoms can also influence the charge, even if just to a small extent), so this is an early indication that we will be making use of the "noise" interpretation of the $\lambda$ regularization parameter: we don't expect (and don't want) our fitted function to precisely go through each datapoint.
 The SOAP descriptor of an atomic environment is based on a spherical harmonic expansion of the neighbour density, and truncating this expansion at some maximum numer of radial (n_max) and angular (l_max) indices gives rise to some parameters. We also need to give the cutoff within which we consider the neighbour environment.
 Writing the descriptor vector as $p_{ss'nn'l}$, where $s$ and $s'$ are indices that run over the different atomic species in the atom's environment, $n$ and $n'$ are radial and $l$ is an angular index, the kernel between two atomic environments is
 $$
 K(p,p') = \delta^2 \left| \sum_{ss'nn'l} p_{ss'nn'l} p'_{ss'nn'l}\right|^\zeta \equiv \delta^2 \left| {\bf p} \cdot {\bf p'}\right|^\zeta
 $$
 The factor of $\delta^2$ allows the setting of the scale of fitted function, relative to the error specification $\lambda$.
 %% Cell type:markdown id: tags:
 Below we fit and predict the atomic charge of H atoms
 %% Cell type:code id: tags:
 ``` python
 # First we create the descriptor object
 Z = 1 # compute environment around H atoms
 # other parameters
 # atom_sigma   : width of Gaussian smearing around each neighbour atom
 # n_max        : maximum radial index
 # l_max        : maximum angular index
 # cutoff       : environment radius in Å
 # Z            : environments around this atomic number will be considered
 # species_Z    : atomic numbers we wish to consider in the environment around atoms of type Z
 # n_species    : length of species_Z list
 desc = Descriptor(
        "soap atom_sigma=0.5 n_max=3 l_max=3 cutoff=3.0 Z={:d} n_species=5 species_Z='1 6 7 8 9'".format(Z))
 ```
 %% Cell type:markdown id: tags:
 The ```desc``` object can be used to calculate the desrciptor for all the atoms in an Atoms object:
 %% Cell type:code id: tags:
 ``` python
 Z
 ```
 %% Cell type:code id: tags:
 ``` python
 p, c  = [], []
 for at in atAll:
    descAt = desc.calc(at) # calc() returns a dictionary, in which we find the descriptor vector
    if 'data' in descAt:
        desc_vector = descAt['data']
        for pp in desc_vector:
            p.append(pp)
    for atom in at:
        if atom.number == Z:
            c.append(atom.charge)       # pick up the target atomic charge and store it in the same order
 p = np.array(p)
 c = np.array(c)
 ```
 %% Cell type:markdown id: tags:
 We now have a list of charges, and correspond list of descriptors, collected into a matrix
 %% Cell type:code id: tags:
 ``` python
 p.shape
 ```
 %% Cell type:code id: tags:
 ``` python
 c.shape
 ```
 %% Cell type:code id: tags:
 ``` python
 p
 ```
 %% Cell type:markdown id: tags:
 We now fit a function using GPR on a subset of the charge data, using the SOAP descriptor and kernel
 %% Cell type:code id: tags:
 ``` python
 n_fit = 10           # use the first n_fit data points to fit
 lambda_fit = 0.02    # assumed noise in the data, due to "model error", i.e. the limitations of the descriptor/kernel
 zeta = 2             # kernel parameter
 delta = 0.2          # kernel parameter
 x_fit = p[:n_fit,:]
 y_fit= c[:n_fit]
 # compute the covariance matrix, including regularisation
 K_fit = delta**2 * np.dot(x_fit, x_fit.T)**zeta + lambda_fit**2 * np.eye(n_fit)
 # solve linear system to obtain fit coefficients
 alpha = np.linalg.solve(K_fit, y_fit)
 ```
 %% Cell type:code id: tags:
 ``` python
 x_test = p[n_fit:,:]
 y_test = c[n_fit:]
 k_test = delta**2 * np.dot(x_test, x_fit.T)**zeta
 y_test_predict = np.dot(k_test, alpha)
 ```
 %% Cell type:code id: tags:
 ``` python
 plt.plot(y_test, y_test_predict, ".")
 plt.plot([-0.6,0.6], [-0.6,0.6])
 plt.xlabel("Test data")
 plt.ylabel("Predicted data")
 print("RMS error:", np.sqrt(np.mean((y_test-y_test_predict)**2)))
 ```
 %% Cell type:markdown id: tags:
 ### Error estimates
 The GPR interpretation allows the estimation of the error of the prediction, as the square root of the variance of the posterior distribution of the function. The corresponding formula is
 $$
 {\rm err}(x)^2 = K(x,x) - {\bf k}(x)^T  \left({\bf K} + \lambda^2 {\bf I}\right)^{-1} {\bf k}(x)\qquad\qquad [{\bf k}(x)]_i = K(x,x_i)
 $$
 Notice how the error does not actually depend on the data values $\{y\}$, only on the locations $\{x\}$ and the kernel function. As you see below, this error estimate can on occasion be quite different from the actual error.
 %% Cell type:code id: tags:
 ``` python
 invK = np.linalg.inv(K_fit)
 y_test_error_predict = np.zeros(len(x_test))
 for i in range(len(x_test)):
    kappa = delta**2 * np.dot(x_test[i].T, x_test[i])**zeta
    y_test_error_predict[i] = np.sqrt(kappa - np.dot(k_test[i,:].T, np.dot(invK, k_test[i,:])))
 ```
 %% Cell type:code id: tags:
 ``` python
 plt.errorbar(y_test[::100], y_test_predict[::100], yerr=y_test_error_predict[::100], fmt=".")
 plt.plot([-0.,0.4], [-0.0,0.4])
 ```
 %% Cell type:markdown id: tags:
 ## Tasks
 Now you are ready to complete the following exercises
 1. Increase the radial and angular expansions to try and achieve a better fit. Try to go in small steps, because for large expansions, the calculation takes significantly longer. Notice how the predictions and the errors behave if you reduce the radial cutoff of the environment definition, can you explain what you observe?
 2. Fit and predict the charge of other species (you will need to create a new descriptor object).
 3. Study how the accuracy of prediction depends on the number of fitting data points.
 4. For the low-quality fit above, you see that there are two groups of H atoms that are clearly separated. Try to identify what characterises those groups? Inspect the molecules and H atoms in each group.
-%% Cell type:code id: tags:
-``` python
-```