parent f510c057
This diff is collapsed.
 { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Learning atomic charges\n", "\n", "_Gábor Csányi (gc121@cam.ac.uk), James R. Kermode (j.r.kermode@warwick.ac.uk)_\n", "\n", "In this tutorial, we will use Gaussian process regression, GPR (or equivalently, Kernel Ridge Regression, KRR) to train and predict charges of atoms in small organic molecules. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In GPR, we are fitting a function in a moderate dimensional space, using basis functions that are typically symmetric, \"similarity\" functions, they describe how similar we expect the function value to be at two different points in the input space. In its simplest form, when we fit a function $f$ using input data $y_i$ that are function values at selected points $x_i$, we have\n", "\n", "$$\n", "f(x) = \\sum_i^N \\alpha_i K(x_i, x)\n", "$$\n", "\n", "where $K(x,x')$ is the positive definite similarity function, with a value $1$ when $x=x'$, and lower values for different arguments. The $\\alpha$ coefficients are the degrees of freedom in the fit, and we need to determine them from the data. The sum runs through the available data points (but in principle we can choose fewer basis functions than datapoints (this is called sparsification), or even more if we want to, but that goes beyond the scope of this tutorial. )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The difference between KRR and GPR is in the interpretation, in GPR we construct a probability distribution for the unknown function values, and the $K$ is taken to be the formal covariance between function values,\n", "\n", "$$\n", "K(x,x') = \\rm{Cov}\\left((f(x),f(x')\\right)\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Regardless of the interpretation, when the ${x,y}$ data is plugged in, we get a system of linear equations:\n", "\n", "$$\n", "y_j = \\sum_i \\alpha_i K(x_i,x_j)\n", "$$\n", "\n", "The solution is given by (in vectorial form)\n", "\n", "$$\n", "{\\bf \\alpha} = \\left({\\bf K} + \\lambda^2 {\\bf I}\\right)^{-1} {\\bf y}\n", "$$\n", "\n", "where the matrix $K$ has elements $K(x_i,x_j)$. If the data was perfectly consistent, without any noise, then in principle we could get an fit with $lambda$ set to zero. In practice however, our data might have noise, or we might _choose_ to not want the interpolant (ie the fitted function f) to go through each data point _exactly_, but prefer smoother functions that just go _close_ to the datapoints. We can achieve this by choosing a nonzero (but small) $\\lambda$. In the GPR interpretation, $lambda$ should be set to the standard deviation of the noise our data has. More on this below. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pylab inline\n", "import numpy as np\n", "import quippy\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt\n", "import os\n", "from __future__ import print_function\n", "import sys\n", "sys.path.append(\"/data/shared/afekete/tutorial/\")\n", "from scripts.Visualise import ViewStructure" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Database\n", "\n", "We load a database of small molecule geometries, and precomuputed atomic charges. This file is a subset of 2000 molecules from the GDB9 dataset. The molecules contain H, C, N, O, and F atoms. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "atAll = quippy.AtomsList(\"/data/shared/afekete/tutorial/data/data_GDB9.xyz\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "view = ViewStructure(atAll)\n", "view" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# We can access the atomic numbers of any molecule\n", "atAll.Z" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# similarly, the positions\n", "atAll.positions " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# and the charges\n", "atAll.charge" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SOAP Kernel and descriptor\n", "\n", "Next, we need to define a kernel. There are many ways to define what the atomic charges would be a function of, but somehow we need to describe the environment of the atom, and then construct a similarity function that can serve as the kernel function. \n", "\n", "In this tutorial, we are going to make the atomic charge a function of the near-environment of an atom (within a cutoff), and we will describe that environment using the SOAP descriptor and compare them using the SOAP kernel. Note right away that the quantum mechanically computed atomic charge is not fully determined by the near-environment of atoms (far-away atoms can also influence the charge, even if just to a small extent), so this is an early indication that we will be making use of the \"noise\" interpretation of the $\\lambda$ regularization parameter: we don't expect (and don't want) our fitted function to precisely go through each datapoint.\n", "\n", "The SOAP descriptor of an atomic environment is based on a spherical harmonic expansion of the neighbour density, and truncating this expansion at some maximum numer of radial (n_max) and angular (l_max) indices gives rise to some parameters. We also need to give the cutoff within which we consider the neighbour environment.\n", "\n", "Writing the descriptor vector as $p_{ss'nn'l}$, where $s$ and $s'$ are indices that run over the different atomic species in the atom's environment, $n$ and $n'$ are radial and $l$ is an angular index, the kernel between two atomic environments is\n", "\n", "$$\n", "K(p,p') = \\delta^2 \\left| \\sum_{ss'nn'l} p_{ss'nn'l} p'_{ss'nn'l}\\right|^\\zeta \\equiv \\delta^2 \\left| {\\bf p} \\cdot {\\bf p'}\\right|^\\zeta\n", "$$\n", "\n", "The factor of $\\delta^2$ allows the setting of the scale of fitted function, relative to the error specification $\\lambda$. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below we fit and predict the atomic charge of H atoms" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# First we create the descriptor object\n", "\n", "Z = 1 # compute environment around H atoms\n", "# other parameters\n", "# atom_sigma : width of Gaussian smearing around each neighbour atom\n", "# n_max : maximum radial index\n", "# l_max : maximum angular index\n", "# cutoff : environment radius in Å\n", "# Z : environments around this atomic number will be considered\n", "# species_Z : atomic numbers we wish to consider in the environment around atoms of type Z\n", "# n_species : length of species_Z list\n", "desc = quippy.Descriptor(\n", " \"soap atom_sigma=0.5 n_max=3 l_max=3 cutoff=3.0 Z={:d} n_species=5 species_Z='1 6 7 8 9'\".format(Z))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The desc object can be used to calculate the desrciptor for all the atoms in an Atoms object:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "p=[]\n", "c=[]\n", "for at in atAll:\n", " at.set_cutoff(desc.cutoff())\n", " at.calc_connect()\n", " \n", " descAt = desc.calc(at)[\"descriptor\"] # calc() returns a dictionary, in which we find the descriptor vector\n", " \n", " for pp in descAt:\n", " p.append(pp)\n", " for i in quippy.frange(at.n):\n", " if at.Z[i] == Z:\n", " c.append(at.charge[i]) # pick up the target atomic charge and store it in the same order \n", "p = np.array(p)\n", "c = np.array(c)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now have a list of charges, and correspond list of descriptors, collected into a matrix" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "c" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "p" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now fit a function using GPR on a subset of the charge data, using the SOAP descriptor and kernel" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "n_fit = 200 # use the first n_fit data points to fit\n", "lambda_fit = 0.02 # assumed noise in the data, due to \"model error\", i.e. the limitations of the descriptor/kernel\n", "zeta = 2 # kernel parameter\n", "delta = 0.2 # kernel parameter\n", "\n", "x_fit = p[:n_fit,:]\n", "y_fit= c[:n_fit]\n", "\n", "# compute the covariance matrix, including regularisation\n", "K_fit = delta**2*np.dot(x_fit,x_fit.T)**zeta + lambda_fit**2 * np.eye(n_fit)\n", "\n", "# solve linear system to obtain fit coefficients\n", "alpha = np.linalg.solve(K_fit,y_fit)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "x_test = p[n_fit:,:]\n", "y_test = c[n_fit:]\n", "k_test = delta**2*np.dot(x_test,x_fit.T)**zeta\n", "y_test_predict = np.dot(k_test,alpha)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "plt.plot(y_test,y_test_predict,\".\")\n", "plt.plot([-0.6,0.6],[-0.6,0.6])\n", "plt.xlabel(\"Test data\")\n", "plt.ylabel(\"Predicted data\")\n", "print(\"RMS error:\", np.sqrt(np.mean((y_test-y_test_predict)**2)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Error estimates\n", "\n", "The GPR interpretation allows the estimation of the error of the prediction, as the square root of the variance of the posterior distribution of the function. The corresponding formula is\n", "\n", "$$\n", "{\\rm err}(x)^2 = K(x,x) - {\\bf k}(x)^T \\left({\\bf K} + \\lambda^2 {\\bf I}\\right)^{-1} {\\bf k}(x)\\qquad\\qquad [{\\bf k}(x)]_i = K(x,x_i)\n", "$$\n", "\n", "Notice how the error does not actually depend on the data values $\\{y\\}$, only on the locations $\\{x\\}$ and the kernel function. As you see below, this error estimate can on occasion be quite different from the actual error." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "invK = np.linalg.inv(K_fit)\n", "y_test_error_predict = np.zeros(len(x_test))\n", "for i in range(len(x_test)):\n", " kappa = delta**2*np.dot(x_test[i].T,x_test[i])**zeta\n", " y_test_error_predict[i] = sqrt(kappa - np.dot(k_test[i,:].T, np.dot(invK, k_test[i,:])))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "plt.errorbar(y_test[::100], y_test_predict[::100], yerr=y_test_error_predict[::100], fmt=\".\")\n", "plt.plot([-0.,0.4],[-0.0,0.4])\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tasks\n", "\n", "Now you are ready to complete the following exercises\n", "\n", "1. Increase the radial and angular expansions to try and achieve a better fit. Try to go in small steps, because for large expansions, the calculation takes significantly longer. Notice how the predictions and the errors behave if you reduce the radial cutoff of the environment definition, can you explain what you observe? \n", "\n", "2. Fit and predict the charge of other species (you will need to create a new descriptor object).\n", "\n", "3. Study how the accuracy of prediction depends on the number of fitting data points.\n", "\n", "4. For the low-quality fit above, you see that there are two groups of H atoms that are clearly separated. Try to identify what characterises those groups? Inspect the molecules and H atoms in each group. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 1 }
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This source diff could not be displayed because it is too large. You can view the blob instead.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!