In this tutorial, we'll explore the application of kernel ridge regression to the prediction of materials properties. We will begin with a largely informal, pragmatic introduction to kernel ridge regression, including a rudimentary implementation, in order to become familiar with the basic terminology and considerations. Then, we'll discuss the challenges of using these models for learning structure-property mappings, encountering the concept of a *representation* and atomic kernels. Finally, we'll re-trace the [NOMAD 2018 Kaggle Challenge](https://www.kaggle.com/c/nomad2018-predict-transparent-conductors)(Paper:[Sutton *et al.* (2019)](https://www.nature.com/articles/s41524-019-0239-3)), and learn how to build a model for the formation energies of these materials using kernel ridge regression. We will do this using the [`cmlkit`](https://marcel.science/cmlkit) framework, but the general approach is largely independent of software.

Feel free to watch [this introductory video](https://www.youtube.com/watch?v=H_MVlljpYHw) before getting started!

### Prerequisites

You'll get the most out of this tutorial if you're reasonably confident with using Python and `numpy`, since a lot of the explanations contain code that you'll need to be able to read. In case you get a bit lost in the code, I'd recommend taking the time to check the [Python documentation](https://docs.python.org/3/) and the [Numpy manual](https://numpy.org/doc/1.18/) for clues. If you're taking this in a supervised setting, you can of course also just ask a tutor!

It's also helpful (but largely optional) to have a working understanding of linear algebra to get the gist of the KRR section.

Additionally, the words "crystal structure", "atom", "formation energy" and related concepts should mean something to you if you want to get the most out of the applications part of the tutorial.

Some passing familiarity with error metrics (RMSE, MAE, R2) and "ML 101" in general will also prove useful, but is optional!

### Administrative notes

In the text, the ✋ emoji will mark tasks that you are expected to complete. 🤔 will mark more open questions that you could think about, and may be answered later in the tutorial. 👏 are strictly optional tasks.

The mathematical notation of this text is kept intentionally somewhat imprecise, in order to not suggest a rigour that is beyond the scope of the tutorial. In general, variables ($x$, $\alpha$, ...) can refer to either scalars or vectors equally. Upper case letters refer to matrices. Latin indices ($i, j, ...$) refer to separate entries in a dataset, for instance training examples. Often, we'll use a variable without index, like $X$, to refer to a $n_{\text{example}} \times \text{dim}$ matrix of data points, each of which is a vector, for instance $x_i$. The hat symbol ($\hat f$) generally denotes quantities that are predicted. We typically restrict our input and output spaces to $\mathbb{R}^N$.

Most sections end with a "further reading" paragraph that points to the sources, and, well, things that are good to read for more information! :)

***

Alright, let's go!

%% Cell type:markdown id: tags:

# Kernel Ridge Regression

## Introduction

Kernel ridge regression (KRR from now on) is an approach to *supervised learning*, where we're trying to infer a relationship between an input and an output space:

$\quad f: \mathcal{X} \rightarrow \mathcal{Y}$,

using $n$ training examples $\{(x_i, f(x_i) = y_i)_{i=1}^n\}$.

As a starting point, setting its mathematical foundations and its wider context within machine learning aside, we can simply view KRR as a basis set expansion. Our estimate of $f$, which we'll call $\hat f$ is simply:

Essentially, our prediction for a new point $x$ is composed from "basis functions" $k$ (the *kernel functions*) centered on each training point, with some yet-to-be-determined *regression coefficients* $\alpha$. These coefficients are determined by solving a convex optimisation problem, as we'll see in a minute.

First, however, let's simply play with this expression for a little bit. First, we choose one particular kernel function. We'll start with the "working horse of KRR", the *Gaussian kernel* (or *Radial Basis Function* (RBF) *kernel*)

$\quad k(x, z) = \exp{ \left(\frac{-|| x - z ||_2^2}{2 \sigma^2} \right)}$,

using $||\, . ||_2$ to denote $l_2$ norm (i.e. the usual vector norm). In the following, to cut down on confusion, we'll call this $\sigma$ *length scale* or `ls` for short. It is our first *hyper-parameter* (HP), a parameter of our model that we can't immediately determine from the data, we have to somehow empirically decide on it. We'll hear much more about this later.

For now, let's just go ahead and look at a toy problem: We'll generate some data for a function `f`, and then try to fit it with KRR.

%% Cell type:code id: tags:

``` python

importnumpyasnp

importmatplotlib.pyplotasplt

%matplotlibinline

plt.rcParams['figure.figsize']=(13.0,9.0)# adjust this so the plot below looks good on your screen

✋: Play with the values of `alpha` to try and fit the data. What's difficult about this? What happens if you change `ls`? What happens if you add more training points? What happens if they're no longer evenly spaced?

🤔: Can you think of a way to automate this process? How would you find the coefficients?

%% Cell type:markdown id: tags:

## Training the model

Alright, time to solve this problem once and for all, starting with some definitions: $K$ denotes a *kernel matrix*. Therefore, the kernel matrix between all points in the training set would be

$\quad K_{ij} = k(x_i, x_j)$,

where $i, j = 1, \ldots, n$. $K$ is therefore a square matrix.

To predict something, we'll need another kernel matrix

$\quad L_{ij} = k(x_i, x_j)$,

where $i = 1...n$ runs over the training set and $j$ over the data points we're trying to predict. $L$ is a rectangular matrix.

With this notation, predictions are simply $L^T \alpha$. $\alpha$ now refers to a vector of *all* coefficients.

The task of finding $\alpha$ can be phrased as an optimisation problem, where we minimise the "squared deviation" or "squared error":

$\quad \text{arg min}_{\alpha} \sum_{i=1}^n (\hat f(x_i) - y_i)^2 $, or in matrix-vector notation:

$\quad \text{arg min}_{\alpha} (K \alpha - y)^T (K \alpha - y) = \text{arg min}_{\alpha} \alpha^T K K \alpha - 2 \alpha^T K y + y^T y $

(The rewrite is straightforward, but slightly tedious. It's easiest to write all products with explicit sums and indices, and then use the fact that $K$ is symmetric, a requirement we have not formally introduced, since this is an *informal* introdution.)

To solve the optimisation problem, we simply set the gradient with respect to $\alpha$ to zero to arrive at:

$

\quad K K \alpha = K y \\\Rightarrow \alpha = K^{-1} y.

$

(This step is, once again, easy but tedious, and done by writing everything out in index notation. It also requires that the inverse exists, which follows from the mathematical requirements a kernel has to fulfil.)

👏: Do the derivation yourself. What are those requirements?

***

Alright! Let's implement it.

%% Cell type:code id: tags:

``` python

X=np.linspace(0,1,100)

n=5

x_train=np.linspace(0,1,n)

y_train=f(x_train)

ls=0.2

K=[[k(z,x,ls=ls)forzinx_train]forxinx_train]

alpha=np.linalg.inv(K)@y_train# @ is matrix multiplication

✋: Play with the value of `ls`. How does the fit change?

🤔: Can you think of a practical way of setting the `ls` hyper-parameter?

## Regularisation and noise

Now, let's make the problem a slight bit more difficult: What happens if our observations of $f$ have *noise* associated with them? For instance, if those were experimental results?

👏: Why are we using more data points? Why did I decide to change `ls`?

***

While our model had no problem fitting this noisy data, we notice an unpleasant feature of the result: It's extremely "wiggly", as opposed to the smooth function that we're trying to reproduce. The reason for this is simple: In the definition of the optimisation problem that we're solving to fit the model, we're essentially forcing $\hat f$ to pass through all the training points as much as possible. In practice, however, this is **not** what we would like to happen -- in case of noise, we don't want to fit to the noise, and in other situations we might prefer smooth interpolation to maximum accuracy at the training points.

In both cases, we should keep in mind what we're aiming for: We would like a good model for the *entire* space $\mathcal{X}$, not just the training points. While our model is great at fitting the training points, it performs very poorly for almost every other point. This behaviour, often called "overfitting", needs to be fixed.

The solution is to add a penalty to the optimisation problem, discouraging "complex models". In the case of KRR, the term is $\lambda \alpha^T K \alpha $, with a new hyper-parameter $\lambda$. (From now on, we will call this "regularisation strength" or "noise level", `nl` for short.) This term is essentially the norm of the coefficients, weighted by the kernel matrix. (A more technical derivation is beyond the scope of this tutorial, please consult the "further reading" section below.)

The optimisation problem then becomes:

$\quad \text{arg min}_{\alpha} \alpha^T K K \alpha - 2 \alpha^T K y + y^T y + \lambda \alpha^T K \alpha $

With the solution:

$

\quad \Rightarrow \alpha = (K + \lambda \mathbb{1} )^{-1} y.

$

With the addition of this *regularisation*, we've now finally arrived at kernel ridge regression!

(It's also worth mentioning that even in the absence of noise, it's usually preferably to add a small amount of regularisation for numerical reasons, and to encourage smooth interpolation.)

👏: How would you make this code more efficient, in particular the kernel function?

Finally, a word of caution: This implementation is **not** production ready. In reality, the use of a matrix inversion to train the model is both inefficient and numerically unstable, and will yield to failure in many real-world applications. For this reason, real implementations use the Cholesky decomposition instead. Rewriting the implementation above is straightforward, but beyond the scope of an introductory tutorial!

## Hyper-parameter tuning

As we've seen already, the choice of `ls` and `nl` significantly impacts the performance of the model:

plt.plot(X,train_and_predict(KRR(ls=0.5,nl=1e-1),x_train,y_train,X),color="#DC3220",linestyle="dashed",label="high ls, high nl")

plt.legend()

```

%% Cell type:markdown id: tags:

So our next task is coming up with a strategy to pick these "hyper-parameters", `nl` and `ls`. HP optimisation is in general a global non-convex optimisation problem with all the problems that entails. There is no perfect solution to this problem, and many different approaches can be taken.

Speaking from personal experience, the approach we'll explore now works reasonably well for the problems at hand, but it is always recommended to experiment. It's also of vital importance to *state what you're doing* if you're publishing any work in this domain. Since the HPs impact model performance significantly, the choice of tuning strategy is an important piece of information.

In order to design a HP tuning strategy, we need to first decide on a measure of performance. Usually, this is further decomposed into two sub-tasks:

1. Picking a quantitative error measure, an "error metric", and

2. Deciding on a set of training and test examples to compute it.

Two common error metric are (adopting the $y$ as shorthand for all test labels, and $\hat y$ as the corresponding predictions of our model):

The RMSE is an upper bound to the MAE, and more sensitive to outliers (i.e. single points with large error). Since the KRR training amounts to minimising the RMSE on the training set, the RMSE can be regarded as the "natural" error measure for KRR -- however, the choice of error metric ultimately depends on the problem being investigated.

Let's start our explorations with a very simple strategy: We'll further split our training set into a "validation" train/test split. In other words, we'll pick some parameters, train the model on the "validation training set", predict on the "validation test set", and compute the error. Then, we'll attempt to minimise this error.

In order to make this a bit more viable, let's choose a slightly more exciting toy problem, and a bit more data. For illustrative purposes, we'll use regularly spaced training points and sub-splits. In practice, the training set is often randomly drawn, as is the validation split.

(In fact, mathematical derivations for ML models often assume "identically and independently distributed" data points. This is often violated in practice, with somewhat unclear consequences. For instance, the error on the "validation test set" is only a reasonable estimate of the "real error" if it is drawn from the same distribution as the actual test data.)

Next, we'll build some infrastructure. We introduce the term "loss function" to refer to the function we're trying to minimise in HP optimisation. The loss function and be chosen to correspond to any of the error metrics above. In practice, the term "loss" is often used instead of "error" to describe the minimsation target; we'll follow this convention.

%% Cell type:code id: tags:

``` python

defrmse(true,pred):

"""Root mean squared error."""

returnnp.sqrt(np.mean((true-pred)**2))

defmae(true,pred):

"""Mean absolute error."""

returnnp.mean(np.fabs(true-pred))

defcompute_loss(ls,nl,lossf=rmse):

krr=KRR(ls,nl)

krr.train(x_valid_train,f(x_valid_train))

y_hat=krr.predict(x_valid_test)

returnlossf(f(x_valid_test),y_hat)

```

%% Cell type:markdown id: tags:

Before we try to optimise this loss, however, let's try to contextualise the numbers a little bit.

%% Cell type:code id: tags:

``` python

fromcollectionsimportdeque

guesses=deque([],10)# will store 10 guesses, deleting the oldest one when more are added

print(f"RMSE if we predict mean = {rmse(Y,mean*np.ones_like(Y)):.3f}")

print(f"MAE if we predict mean = {mae(Y,mean*np.ones_like(Y)):.3f}")

print()

foriinrange(10):

ls=2*np.random.random()

nl=0.01*np.random.random()

guess(ls,nl,quiet=True)

guess(1,1)

```

%% Cell type:markdown id: tags:

✋: Spend a few minutes in the cell below, optimising the values by hand.

%% Cell type:code id: tags:

``` python

guess(0.1,0.1)

```

%% Cell type:markdown id: tags:

First, let's try to make some educated guesses, as a baseline. For example, we could demand the exponent in the kernel to be approximately one, and simply set a small value for `nl`, i.e. assume that there is little noise in the data.

Alright, we could do worse. But we could also do better. As a rule of thumb, an RMSE of around 1% of the standard deviation is a good error to aim for. (Note: If your dataset is noisy, or there is no enough data, this might be impossible to achieve. For example, for the NOMAD2018 dataset we'll investigate later in the tutorial, models typically reach around 20-25%!)

Let's now try to approach this problem more systematically. A simple approach we could take is to simply *try all the possible combinations* (in some domain), also called a "grid search".

print(f"min loss: {compute_loss(nl=best_nl,ls=best_ls):.4f} at nl={best_nl:.4f} and ls={best_ls:.4f}.")

```

%% Cell type:markdown id: tags:

Clearly, `nl` should be very small in this case -- reasonable, as there is little noise in our observations. `ls` should be somewhere below `1.0` as well. We'll now re-do the search with a more traditional $\log_2$ grid, which allows for an exploration of a larger range of values.