Commit b45ffa2c by Mohamed, Fawzi Roberto (fawzi)

 %% Cell type:markdown id: tags: # Learning atomic charges _Gábor Csányi (gc121@cam.ac.uk), James R. Kermode (j.r.kermode@warwick.ac.uk)_ In this tutorial, we will use Gaussian process regression, GPR (or equivalently, Kernel Ridge Regression, KRR) to train and predict charges of atoms in small organic molecules. %% Cell type:markdown id: tags: In GPR, we are fitting a function in a moderate dimensional space, using basis functions that are typically symmetric, "similarity" functions, they describe how similar we expect the function value to be at two different points in the input space. In its simplest form, when we fit a function $f$ using input data $y_i$ that are function values at selected points $x_i$, we have $$f(x) = \sum_i^N \alpha_i K(x_i, x)$$ where $K(x,x')$ is the positive definite similarity function, with a value $1$ when $x=x'$, and lower values for different arguments. The $\alpha$ coefficients are the degrees of freedom in the fit, and we need to determine them from the data. The sum runs through the available data points (but in principle we can choose fewer basis functions than datapoints (this is called sparsification), or even more if we want to, but that goes beyond the scope of this tutorial. ) %% Cell type:markdown id: tags: The difference between KRR and GPR is in the interpretation, in GPR we construct a probability distribution for the unknown function values, and the $K$ is taken to be the formal covariance between function values, $$K(x,x') = \rm{Cov}\left((f(x),f(x')\right)$$ %% Cell type:markdown id: tags: Regardless of the interpretation, when the ${x,y}$ data is plugged in, we get a system of linear equations: $$y_j = \sum_i \alpha_i K(x_i,x_j)$$ The solution is given by (in vectorial form) $${\bf \alpha} = \left({\bf K} + \lambda^2 {\bf I}\right)^{-1} {\bf y}$$ where the matrix $K$ has elements $K(x_i,x_j)$. If the data was perfectly consistent, without any noise, then in principle we could get an fit with $lambda$ set to zero. In practice however, our data might have noise, or we might _choose_ to not want the interpolant (ie the fitted function f) to go through each data point _exactly_, but prefer smoother functions that just go _close_ to the datapoints. We can achieve this by choosing a nonzero (but small) $\lambda$. In the GPR interpretation, $lambda$ should be set to the standard deviation of the noise our data has. More on this below. %% Cell type:code id: tags:  python %pylab inline import numpy as np import quippy import matplotlib as mpl import matplotlib.pyplot as plt import os from __future__ import print_function import sys sys.path.append("/data/shared/afekete/tutorial/") from scripts.Visualise import ViewStructure  %% Cell type:markdown id: tags: ## Database We load a database of small molecule geometries, and precomuputed atomic charges. This file is a subset of 2000 molecules from the GDB9 dataset. The molecules contain H, C, N, O, and F atoms. %% Cell type:code id: tags:  python atAll = quippy.AtomsList("/data/shared/afekete/tutorial/data/data_GDB9.xyz")  %% Cell type:code id: tags:  python view = ViewStructure(atAll[0]) view  %% Cell type:code id: tags:  python # We can access the atomic numbers of any molecule atAll[0].Z  %% Cell type:code id: tags:  python # similarly, the positions atAll[1].positions  %% Cell type:code id: tags:  python # and the charges atAll[1].charge  %% Cell type:markdown id: tags: ## SOAP Kernel and descriptor Next, we need to define a kernel. There are many ways to define what the atomic charges would be a function of, but somehow we need to describe the environment of the atom, and then construct a similarity function that can serve as the kernel function. In this tutorial, we are going to make the atomic charge a function of the near-environment of an atom (within a cutoff), and we will describe that environment using the SOAP descriptor and compare them using the SOAP kernel. Note right away that the quantum mechanically computed atomic charge is not fully determined by the near-environment of atoms (far-away atoms can also influence the charge, even if just to a small extent), so this is an early indication that we will be making use of the "noise" interpretation of the $\lambda$ regularization parameter: we don't expect (and don't want) our fitted function to precisely go through each datapoint. The SOAP descriptor of an atomic environment is based on a spherical harmonic expansion of the neighbour density, and truncating this expansion at some maximum numer of radial (n_max) and angular (l_max) indices gives rise to some parameters. We also need to give the cutoff within which we consider the neighbour environment. Writing the descriptor vector as $p_{ss'nn'l}$, where $s$ and $s'$ are indices that run over the different atomic species in the atom's environment, $n$ and $n'$ are radial and $l$ is an angular index, the kernel between two atomic environments is $$K(p,p') = \delta^2 \left| \sum_{ss'nn'l} p_{ss'nn'l} p'_{ss'nn'l}\right|^\zeta \equiv \delta^2 \left| {\bf p} \cdot {\bf p'}\right|^\zeta$$ The factor of $\delta^2$ allows the setting of the scale of fitted function, relative to the error specification $\lambda$. %% Cell type:markdown id: tags: Below we fit and predict the atomic charge of H atoms %% Cell type:code id: tags:  python # First we create the descriptor object Z = 1 # compute environment around H atoms # other parameters # atom_sigma : width of Gaussian smearing around each neighbour atom # n_max : maximum radial index # l_max : maximum angular index # cutoff : environment radius in Å # Z : environments around this atomic number will be considered # species_Z : atomic numbers we wish to consider in the environment around atoms of type Z # n_species : length of species_Z list desc = quippy.Descriptor( "soap atom_sigma=0.5 n_max=3 l_max=3 cutoff=3.0 Z={:d} n_species=5 species_Z='1 6 7 8 9'".format(Z))  %% Cell type:markdown id: tags: The desc object can be used to calculate the desrciptor for all the atoms in an Atoms object: %% Cell type:code id: tags:  python p=[] c=[] for at in atAll: at.set_cutoff(desc.cutoff()) at.calc_connect() descAt = desc.calc(at)["descriptor"] # calc() returns a dictionary, in which we find the descriptor vector for pp in descAt: p.append(pp) for i in quippy.frange(at.n): if at.Z[i] == Z: c.append(at.charge[i]) # pick up the target atomic charge and store it in the same order p = np.array(p) c = np.array(c)  %% Cell type:markdown id: tags: We now have a list of charges, and correspond list of descriptors, collected into a matrix %% Cell type:code id: tags:  python c  %% Cell type:code id: tags:  python p  %% Cell type:markdown id: tags: We now fit a function using GPR on a subset of the charge data, using the SOAP descriptor and kernel %% Cell type:code id: tags:  python n_fit = 200 # use the first n_fit data points to fit lambda_fit = 0.02 # assumed noise in the data, due to "model error", i.e. the limitations of the descriptor/kernel zeta = 2 # kernel parameter delta = 0.2 # kernel parameter x_fit = p[:n_fit,:] y_fit= c[:n_fit] # compute the covariance matrix, including regularisation K_fit = delta**2*np.dot(x_fit,x_fit.T)**zeta + lambda_fit**2 * np.eye(n_fit) # solve linear system to obtain fit coefficients alpha = np.linalg.solve(K_fit,y_fit)  %% Cell type:code id: tags:  python x_test = p[n_fit:,:] y_test = c[n_fit:] k_test = delta**2*np.dot(x_test,x_fit.T)**zeta y_test_predict = np.dot(k_test,alpha)  %% Cell type:code id: tags:  python plt.plot(y_test,y_test_predict,".") plt.plot([-0.6,0.6],[-0.6,0.6]) plt.xlabel("Test data") plt.ylabel("Predicted data") print("RMS error:", np.sqrt(np.mean((y_test-y_test_predict)**2)))  %% Cell type:markdown id: tags: ### Error estimates The GPR interpretation allows the estimation of the error of the prediction, as the square root of the variance of the posterior distribution of the function. The corresponding formula is $${\rm err}(x)^2 = K(x,x) - {\bf k}(x)^T \left({\bf K} + \lambda^2 {\bf I}\right)^{-1} {\bf k}(x)\qquad\qquad [{\bf k}(x)]_i = K(x,x_i)$$ Notice how the error does not actually depend on the data values $\{y\}$, only on the locations $\{x\}$ and the kernel function. As you see below, this error estimate can on occasion be quite different from the actual error. %% Cell type:code id: tags: