"_Gábor Csányi (gc121@cam.ac.uk), James R. Kermode (j.r.kermode@warwick.ac.uk)_\n",
"\n",
"In this tutorial, we will use Gaussian process regression, GPR (or equivalently, Kernel Ridge Regression, KRR) to train and predict charges of atoms in small organic molecules. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In GPR, we are fitting a function in a moderate dimensional space, using basis functions that are typically symmetric, \"similarity\" functions, they describe how similar we expect the function value to be at two different points in the input space. In its simplest form, when we fit a function $f$ using input data $y_i$ that are function values at selected points $x_i$, we have\n",
"\n",
"$$\n",
"f(x) = \\sum_i^N \\alpha_i K(x_i, x)\n",
"$$\n",
"\n",
"where $K(x,x')$ is the positive definite similarity function, with a value $1$ when $x=x'$, and lower values for different arguments. The $\\alpha$ coefficients are the degrees of freedom in the fit, and we need to determine them from the data. The sum runs through the available data points (but in principle we can choose fewer basis functions than datapoints (this is called sparsification), or even more if we want to, but that goes beyond the scope of this tutorial. )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The difference between KRR and GPR is in the interpretation, in GPR we construct a probability distribution for the unknown function values, and the $K$ is taken to be the formal covariance between function values,\n",
"where the matrix $K$ has elements $K(x_i,x_j)$. If the data was perfectly consistent, without any noise, then in principle we could get an fit with $lambda$ set to zero. In practice however, our data might have noise, or we might _choose_ to not want the interpolant (ie the fitted function f) to go through each data point _exactly_, but prefer smoother functions that just go _close_ to the datapoints. We can achieve this by choosing a nonzero (but small) $\\lambda$. In the GPR interpretation, $lambda$ should be set to the standard deviation of the noise our data has. More on this below. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pylab inline\n",
"import numpy as np\n",
"import quippy\n",
"import matplotlib as mpl\n",
"import matplotlib.pyplot as plt\n",
"import os\n",
"from __future__ import print_function\n",
"import sys\n",
"from scripts.Visualise import ViewStructure"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Database\n",
"\n",
"We load a database of small molecule geometries, and precomuputed atomic charges. This file is a subset of 2000 molecules from the GDB9 dataset. The molecules contain H, C, N, O, and F atoms. "
"# We can access the atomic numbers of any molecule\n",
"atAll[0].Z"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# similarly, the positions\n",
"atAll[1].positions "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# and the charges\n",
"atAll[1].charge"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## SOAP Kernel and descriptor\n",
"\n",
"Next, we need to define a kernel. There are many ways to define what the atomic charges would be a function of, but somehow we need to describe the environment of the atom, and then construct a similarity function that can serve as the kernel function. \n",
"\n",
"In this tutorial, we are going to make the atomic charge a function of the near-environment of an atom (within a cutoff), and we will describe that environment using the SOAP descriptor and compare them using the SOAP kernel. Note right away that the quantum mechanically computed atomic charge is not fully determined by the near-environment of atoms (far-away atoms can also influence the charge, even if just to a small extent), so this is an early indication that we will be making use of the \"noise\" interpretation of the $\\lambda$ regularization parameter: we don't expect (and don't want) our fitted function to precisely go through each datapoint.\n",
"\n",
"The SOAP descriptor of an atomic environment is based on a spherical harmonic expansion of the neighbour density, and truncating this expansion at some maximum numer of radial (n_max) and angular (l_max) indices gives rise to some parameters. We also need to give the cutoff within which we consider the neighbour environment.\n",
"\n",
"Writing the descriptor vector as $p_{ss'nn'l}$, where $s$ and $s'$ are indices that run over the different atomic species in the atom's environment, $n$ and $n'$ are radial and $l$ is an angular index, the kernel between two atomic environments is\n",
"The GPR interpretation allows the estimation of the error of the prediction, as the square root of the variance of the posterior distribution of the function. The corresponding formula is\n",
"Notice how the error does not actually depend on the data values $\\{y\\}$, only on the locations $\\{x\\}$ and the kernel function. As you see below, this error estimate can on occasion be quite different from the actual error."
"Now you are ready to complete the following exercises\n",
"\n",
"1. Increase the radial and angular expansions to try and achieve a better fit. Try to go in small steps, because for large expansions, the calculation takes significantly longer. Notice how the predictions and the errors behave if you reduce the radial cutoff of the environment definition, can you explain what you observe? \n",
"\n",
"2. Fit and predict the charge of other species (you will need to create a new descriptor object).\n",
"\n",
"3. Study how the accuracy of prediction depends on the number of fitting data points.\n",
"\n",
"4. For the low-quality fit above, you see that there are two groups of H atoms that are clearly separated. Try to identify what characterises those groups? Inspect the molecules and H atoms in each group. "