Skip to content
Snippets Groups Projects
Commit c24e421c authored by Luigi's avatar Luigi
Browse files

Modify header

parent 7644b637
No related branches found
No related tags found
No related merge requests found
assets/soap_atomic_charges/Logo_NOMAD.png

303 KiB

%% Cell type:markdown id: tags:
# Learning atomic charges
<div id="teaser" style=' background-position: right center; background-size: 00px; background-repeat: no-repeat;
padding-top: 20px;
padding-right: 10px;
padding-bottom: 170px;
padding-left: 10px;
border-bottom: 14px double #333;
border-top: 14px double #333;' >
<div style="text-align:center">
<b><font size="6.4">Learning atomic charges</font></b>
</div>
<p>
created by:
Gábor Csányi,
James R. Kermode
<br><p>
gc121@cam.ac.uk, j.r.kermode@warwick.ac.uk
<br><br>
<div>
<img style="float: right;" src="assets/soap_atomic_charges/Logo_NOMAD.png" width="250">
</div>
</div>
_Gábor Csányi (gc121@cam.ac.uk), James R. Kermode (j.r.kermode@warwick.ac.uk)_
%% Cell type:markdown id: tags:
In this tutorial, we will use Gaussian process regression, GPR (or equivalently, Kernel Ridge Regression, KRR) to train and predict charges of atoms in small organic molecules.
%% Cell type:markdown id: tags:
In GPR, we are fitting a function in a moderate dimensional space, using basis functions that are typically symmetric, "similarity" functions, they describe how similar we expect the function value to be at two different points in the input space. In its simplest form, when we fit a function $f$ using input data $y_i$ that are function values at selected points $x_i$, we have
$$
f(x) = \sum_i^N \alpha_i K(x_i, x)
$$
where $K(x,x')$ is the positive definite similarity function, with a value $1$ when $x=x'$, and lower values for different arguments. The $\alpha$ coefficients are the degrees of freedom in the fit, and we need to determine them from the data. The sum runs through the available data points (but in principle we can choose fewer basis functions than datapoints (this is called sparsification), or even more if we want to, but that goes beyond the scope of this tutorial. )
%% Cell type:markdown id: tags:
The difference between KRR and GPR is in the interpretation, in GPR we construct a probability distribution for the unknown function values, and the $K$ is taken to be the formal covariance between function values,
$$
K(x,x') = \rm{Cov}\left((f(x),f(x')\right)
$$
%% Cell type:markdown id: tags:
Regardless of the interpretation, when the ${x,y}$ data is plugged in, we get a system of linear equations:
$$
y_j = \sum_i \alpha_i K(x_i,x_j)
$$
The solution is given by (in vectorial form)
$$
{\bf \alpha} = \left({\bf K} + \lambda^2 {\bf I}\right)^{-1} {\bf y}
$$
where the matrix $K$ has elements $K(x_i,x_j)$. If the data was perfectly consistent, without any noise, then in principle we could get an fit with $lambda$ set to zero. In practice however, our data might have noise, or we might _choose_ to not want the interpolant (ie the fitted function f) to go through each data point _exactly_, but prefer smoother functions that just go _close_ to the datapoints. We can achieve this by choosing a nonzero (but small) $\lambda$. In the GPR interpretation, $lambda$ should be set to the standard deviation of the noise our data has. More on this below.
%% Cell type:code id: tags:
``` python
import os
import sys
import numpy as np
import matplotlib.pyplot as plt
from ase.io import read
from quippy.descriptors import Descriptor
from soap_atomic_charges import ViewStructure
```
%% Cell type:markdown id: tags:
## Database
We load a database of small molecule geometries, and precomuputed atomic charges. This file is a subset of 2000 molecules from the GDB9 dataset. The molecules contain H, C, N, O, and F atoms.
%% Cell type:code id: tags:
``` python
atAll = read('data/soap_atomic_charges/data_GDB9.xyz', index=':')
```
%% Cell type:code id: tags:
``` python
view = ViewStructure(atAll[2])
view
```
%% Cell type:code id: tags:
``` python
# We can access the atomic numbers of any molecule
atAll[0].numbers
```
%% Cell type:code id: tags:
``` python
# similarly, the positions
atAll[1].positions
```
%% Cell type:code id: tags:
``` python
# and the charges
atAll[1].get_initial_charges()
```
%% Cell type:markdown id: tags:
## SOAP Kernel and descriptor
Next, we need to define a kernel. There are many ways to define what the atomic charges would be a function of, but somehow we need to describe the environment of the atom, and then construct a similarity function that can serve as the kernel function.
In this tutorial, we are going to make the atomic charge a function of the near-environment of an atom (within a cutoff), and we will describe that environment using the SOAP descriptor and compare them using the SOAP kernel. Note right away that the quantum mechanically computed atomic charge is not fully determined by the near-environment of atoms (far-away atoms can also influence the charge, even if just to a small extent), so this is an early indication that we will be making use of the "noise" interpretation of the $\lambda$ regularization parameter: we don't expect (and don't want) our fitted function to precisely go through each datapoint.
The SOAP descriptor of an atomic environment is based on a spherical harmonic expansion of the neighbour density, and truncating this expansion at some maximum numer of radial (n_max) and angular (l_max) indices gives rise to some parameters. We also need to give the cutoff within which we consider the neighbour environment.
Writing the descriptor vector as $p_{ss'nn'l}$, where $s$ and $s'$ are indices that run over the different atomic species in the atom's environment, $n$ and $n'$ are radial and $l$ is an angular index, the kernel between two atomic environments is
$$
K(p,p') = \delta^2 \left| \sum_{ss'nn'l} p_{ss'nn'l} p'_{ss'nn'l}\right|^\zeta \equiv \delta^2 \left| {\bf p} \cdot {\bf p'}\right|^\zeta
$$
The factor of $\delta^2$ allows the setting of the scale of fitted function, relative to the error specification $\lambda$.
%% Cell type:markdown id: tags:
Below we fit and predict the atomic charge of H atoms
%% Cell type:code id: tags:
``` python
# First we create the descriptor object
Z = 1 # compute environment around H atoms
# other parameters
# atom_sigma : width of Gaussian smearing around each neighbour atom
# n_max : maximum radial index
# l_max : maximum angular index
# cutoff : environment radius in Å
# Z : environments around this atomic number will be considered
# species_Z : atomic numbers we wish to consider in the environment around atoms of type Z
# n_species : length of species_Z list
desc = Descriptor(
"soap atom_sigma=0.5 n_max=3 l_max=3 cutoff=3.0 Z={:d} n_species=5 species_Z='1 6 7 8 9'".format(Z))
```
%% Cell type:markdown id: tags:
The ```desc``` object can be used to calculate the desrciptor for all the atoms in an Atoms object:
%% Cell type:code id: tags:
``` python
Z
```
%% Cell type:code id: tags:
``` python
p, c = [], []
for at in atAll:
descAt = desc.calc(at) # calc() returns a dictionary, in which we find the descriptor vector
if 'data' in descAt:
desc_vector = descAt['data']
for pp in desc_vector:
p.append(pp)
for atom in at:
if atom.number == Z:
c.append(atom.charge) # pick up the target atomic charge and store it in the same order
p = np.array(p)
c = np.array(c)
```
%% Cell type:markdown id: tags:
We now have a list of charges, and correspond list of descriptors, collected into a matrix
%% Cell type:code id: tags:
``` python
p.shape
```
%% Cell type:code id: tags:
``` python
c.shape
```
%% Cell type:code id: tags:
``` python
p
```
%% Cell type:markdown id: tags:
We now fit a function using GPR on a subset of the charge data, using the SOAP descriptor and kernel
%% Cell type:code id: tags:
``` python
n_fit = 10 # use the first n_fit data points to fit
lambda_fit = 0.02 # assumed noise in the data, due to "model error", i.e. the limitations of the descriptor/kernel
zeta = 2 # kernel parameter
delta = 0.2 # kernel parameter
x_fit = p[:n_fit,:]
y_fit= c[:n_fit]
# compute the covariance matrix, including regularisation
K_fit = delta**2 * np.dot(x_fit, x_fit.T)**zeta + lambda_fit**2 * np.eye(n_fit)
# solve linear system to obtain fit coefficients
alpha = np.linalg.solve(K_fit, y_fit)
```
%% Cell type:code id: tags:
``` python
x_test = p[n_fit:,:]
y_test = c[n_fit:]
k_test = delta**2 * np.dot(x_test, x_fit.T)**zeta
y_test_predict = np.dot(k_test, alpha)
```
%% Cell type:code id: tags:
``` python
plt.plot(y_test, y_test_predict, ".")
plt.plot([-0.6,0.6], [-0.6,0.6])
plt.xlabel("Test data")
plt.ylabel("Predicted data")
print("RMS error:", np.sqrt(np.mean((y_test-y_test_predict)**2)))
```
%% Cell type:markdown id: tags:
### Error estimates
The GPR interpretation allows the estimation of the error of the prediction, as the square root of the variance of the posterior distribution of the function. The corresponding formula is
$$
{\rm err}(x)^2 = K(x,x) - {\bf k}(x)^T \left({\bf K} + \lambda^2 {\bf I}\right)^{-1} {\bf k}(x)\qquad\qquad [{\bf k}(x)]_i = K(x,x_i)
$$
Notice how the error does not actually depend on the data values $\{y\}$, only on the locations $\{x\}$ and the kernel function. As you see below, this error estimate can on occasion be quite different from the actual error.
%% Cell type:code id: tags:
``` python
invK = np.linalg.inv(K_fit)
y_test_error_predict = np.zeros(len(x_test))
for i in range(len(x_test)):
kappa = delta**2 * np.dot(x_test[i].T, x_test[i])**zeta
y_test_error_predict[i] = np.sqrt(kappa - np.dot(k_test[i,:].T, np.dot(invK, k_test[i,:])))
```
%% Cell type:code id: tags:
``` python
plt.errorbar(y_test[::100], y_test_predict[::100], yerr=y_test_error_predict[::100], fmt=".")
plt.plot([-0.,0.4], [-0.0,0.4])
```
%% Cell type:markdown id: tags:
## Tasks
Now you are ready to complete the following exercises
1. Increase the radial and angular expansions to try and achieve a better fit. Try to go in small steps, because for large expansions, the calculation takes significantly longer. Notice how the predictions and the errors behave if you reduce the radial cutoff of the environment definition, can you explain what you observe?
2. Fit and predict the charge of other species (you will need to create a new descriptor object).
3. Study how the accuracy of prediction depends on the number of fitting data points.
4. For the low-quality fit above, you see that there are two groups of H atoms that are clearly separated. Try to identify what characterises those groups? Inspect the molecules and H atoms in each group.
%% Cell type:code id: tags:
``` python
```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment