diff --git a/krr4mat.ipynb b/krr4mat.ipynb
index d23a7ddcd3c4cf99f56c05bfa95e847766efb44c..b3dfcbcacc228ce08fbb5b6770f239868cc691a0 100644
--- a/krr4mat.ipynb
+++ b/krr4mat.ipynb
@@ -20,1649 +20,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Hello! ðŸ‘‹ \n",
-    "\n",
-    "In this tutorial, we'll explore the application of kernel ridge regression to the prediction of materials properties. We will begin with a largely informal, pragmatic introduction to kernel ridge regression, including a rudimentary implementation, in order to become familiar with the basic terminology and considerations. Then, we'll discuss the challenges of using these models for learning structure-property mappings, encountering the concept of a *representation* and atomic kernels. Finally, we'll re-trace the [NOMAD 2018 Kaggle Challenge](https://www.kaggle.com/c/nomad2018-predict-transparent-conductors) (Paper: [Sutton *et al.* (2019)](https://www.nature.com/articles/s41524-019-0239-3)), and learn how to build a model for the formation energies of these materials using kernel ridge regression. We will do this using the [`cmlkit`](https://marcel.science/cmlkit) framework, but the general approach is largely independent of software.\n",
-    "\n",
-    "Feel free to watch [this introductory video](https://www.youtube.com/watch?v=H_MVlljpYHw) before getting started!\n",
-    "\n",
-    "### Prerequisites\n",
-    "\n",
-    "You'll get the most out of this tutorial if you're reasonably confident with using Python and `numpy`, since a lot of the explanations contain code that you'll need to be able to read. In case you get a bit lost in the code, I'd recommend taking the time to check the [Python documentation](https://docs.python.org/3/) and the [Numpy manual](https://numpy.org/doc/1.18/) for clues. If you're taking this in a supervised setting, you can of course also just ask a tutor!\n",
-    "\n",
-    "It's also helpful (but largely optional) to have a working understanding of linear algebra to get the gist of the KRR section. \n",
-    "\n",
-    "Additionally, the words \"crystal structure\", \"atom\", \"formation energy\" and related concepts should mean something to you if you want to get the most out of the applications part of the tutorial.\n",
-    "\n",
-    "Some passing familiarity with error metrics (RMSE, MAE, R2) and \"ML 101\" in general will also prove useful, but is optional!\n",
-    "\n",
-    "### Administrative notes\n",
-    "\n",
-    "In the text, the âœ‹ emoji will mark tasks that you are expected to complete. ðŸ¤” will mark more open questions that you could think about, and may be answered later in the tutorial. ðŸ‘ are strictly optional tasks.\n",
-    "\n",
-    "The mathematical notation of this text is kept intentionally somewhat imprecise, in order to not suggest a rigour that is beyond the scope of the tutorial. In general, variables ($x$, $\\alpha$, ...) can refer to either scalars or vectors equally. Upper case letters refer to matrices. Latin indices ($i, j, ...$) refer to separate entries in a dataset, for instance training examples. Often, we'll use a variable without index, like $X$, to refer to a $n_{\\text{example}} \\times \\text{dim}$ matrix of data points, each of which is a vector, for instance $x_i$. The hat symbol ($\\hat f$) generally denotes quantities that are predicted. We typically restrict our input and output spaces to $\\mathbb{R}^N$.\n",
-    "\n",
-    "Most sections end with a \"further reading\" paragraph that points to the sources, and, well, things that are good to read for more information! :)\n",
-    "\n",
-    "***\n",
-    "\n",
-    "Alright, let's go! "
+    "# We are currently working to restore this notebook as soon as possible, we apologize for the inconvenience."
    ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Kernel Ridge Regression\n",
-    "\n",
-    "## Introduction\n",
-    "\n",
-    "Kernel ridge regression (KRR from now on) is an approach to *supervised learning*, where we're trying to infer a relationship between an input and an output space: \n",
-    "\n",
-    "$\\quad f: \\mathcal{X} \\rightarrow \\mathcal{Y}$, \n",
-    "\n",
-    "using $n$ training examples $\\{(x_i, f(x_i) = y_i)_{i=1}^n\\}$.\n",
-    "\n",
-    "As a starting point, setting its mathematical foundations and its wider context within machine learning aside, we can simply view KRR as a basis set expansion. Our estimate of $f$, which we'll call $\\hat f$ is simply:\n",
-    "\n",
-    "$\\quad \\hat f(x) = \\sum_{i=1}^n \\, \\alpha_i k(x_i, x)$. \n",
-    "\n",
-    "Essentially, our prediction for a new point $x$ is composed from \"basis functions\" $k$ (the *kernel functions*) centered on each training point, with some yet-to-be-determined *regression coefficients* $\\alpha$. These coefficients are determined by solving a convex optimisation problem, as we'll see in a minute.\n",
-    "\n",
-    "First, however, let's simply play with this expression for a little bit. First, we choose one particular kernel function. We'll start with the \"working horse of KRR\", the *Gaussian kernel* (or *Radial Basis Function* (RBF) *kernel*)\n",
-    "\n",
-    "$\\quad k(x, z) = \\exp{ \\left(\\frac{-|| x - z ||_2^2}{2 \\sigma^2} \\right)}$,\n",
-    "\n",
-    "using $||\\, . ||_2$ to denote $l_2$ norm (i.e. the usual vector norm). In the following, to cut down on confusion, we'll call this $\\sigma$ *length scale* or `ls` for short. It is our first *hyper-parameter* (HP), a parameter of our model that we can't immediately determine from the data, we have to somehow empirically decide on it. We'll hear much more about this later.\n",
-    "\n",
-    "For now, let's just go ahead and look at a toy problem: We'll generate some data for a function `f`, and then try to fit it with KRR."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:19:26.209775Z",
-     "start_time": "2020-06-05T14:19:25.919105Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "import numpy as np\n",
-    "import matplotlib.pyplot as plt\n",
-    "%matplotlib inline\n",
-    "plt.rcParams['figure.figsize'] = (13.0, 9.0)  # adjust this so the plot below looks good on your screen\n",
-    "\n",
-    "def f(x):\n",
-    "    return np.exp(-x**2/0.5) + 0.5*np.exp(-(x-0.5)**2/0.1)\n",
-    "\n",
-    "X = np.linspace(0, 1, 100)\n",
-    "n = 5\n",
-    "x_train = np.linspace(0, 1, n)\n",
-    "\n",
-    "plt.plot(X, f(X), color=\"red\", linestyle=\"dashed\")\n",
-    "plt.scatter(x_train, f(x_train), color=\"red\")\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We'll stick to this color code: Red dots are training samples, a dashed red line is the function we're trying to learn.\n",
-    "\n",
-    "Alright, let's *just do it* and guess some values for `alpha` and `ls`:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:19:26.376913Z",
-     "start_time": "2020-06-05T14:19:26.211444Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "alpha = [0.5, 0.6, 0.6, 0.2, 0.2]\n",
-    "ls = 0.2\n",
-    "\n",
-    "def k(x, z, ls=1):\n",
-    "    return np.exp(-(x-z)**2 / (2 * ls ** 2))\n",
-    "\n",
-    "def basis_function(x, i):\n",
-    "    # function centered on i-th training point\n",
-    "    return alpha[i]*np.array([k(this_x, x_train[i], ls=ls) for this_x in x])\n",
-    "\n",
-    "def f_hat(x):\n",
-    "    # total prediction\n",
-    "    return np.sum([basis_function(x, i) for i in range(n)], axis=0)\n",
-    "\n",
-    "plt.plot(X, f(X), color=\"red\", linestyle=\"dashed\")\n",
-    "plt.scatter(x_train, f(x_train), color=\"red\")\n",
-    "\n",
-    "plt.plot(X, f_hat(X), color=\"black\")\n",
-    "\n",
-    "# plot the basis functions separately, so we can see them\n",
-    "for i in range(n):\n",
-    "    plt.plot(X, basis_function(X, i), alpha=0.3, color=\"black\")\n",
-    "\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "âœ‹: Play with the values of `alpha` to try and fit the data. What's difficult about this? What happens if you change `ls`? What happens if you add more training points? What happens if they're no longer evenly spaced?\n",
-    "\n",
-    "\n",
-    "ðŸ¤”: Can you think of a way to automate this process? How would you find the coefficients?"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Training the model\n",
-    "\n",
-    "Alright, time to solve this problem once and for all, starting with some definitions: $K$ denotes a *kernel matrix*. Therefore, the kernel matrix between all points in the training set would be \n",
-    "\n",
-    "$\\quad K_{ij} = k(x_i, x_j)$,\n",
-    "\n",
-    "where $i, j = 1, \\ldots, n$. $K$ is therefore a square matrix.\n",
-    "\n",
-    "To predict something, we'll need another kernel matrix\n",
-    "\n",
-    "$\\quad L_{ij} = k(x_i, x_j)$,\n",
-    "\n",
-    "where $i = 1...n$ runs over the training set and $j$ over the data points we're trying to predict. $L$ is a rectangular matrix.\n",
-    "\n",
-    "With this notation, predictions are simply $L^T \\alpha$. $\\alpha$ now refers to a vector of *all* coefficients.\n",
-    "\n",
-    "The task of finding $\\alpha$ can be phrased as an optimisation problem, where we minimise the \"squared deviation\" or \"squared error\":\n",
-    "\n",
-    "$\\quad \\text{arg min}_{\\alpha} \\sum_{i=1}^n (\\hat f(x_i) - y_i)^2 $, or in matrix-vector notation:\n",
-    "\n",
-    "$\\quad \\text{arg min}_{\\alpha} (K \\alpha - y)^T (K \\alpha - y) = \\text{arg min}_{\\alpha} \\alpha^T K K \\alpha - 2 \\alpha^T K y + y^T y $ \n",
-    "\n",
-    "(The rewrite is straightforward, but slightly tedious. It's easiest to write all products with explicit sums and indices, and then use the fact that $K$ is symmetric, a requirement we have not formally introduced, since this is an *informal* introdution.)\n",
-    "\n",
-    "To solve the optimisation problem, we simply set the gradient with respect to $\\alpha$ to zero to arrive at:\n",
-    "\n",
-    "$\n",
-    "\\quad K K \\alpha = K y \\\\ \\Rightarrow \\alpha = K^{-1} y.\n",
-    "$\n",
-    "\n",
-    "(This step is, once again, easy but tedious, and done by writing everything out in index notation. It also requires that the inverse exists, which follows from the mathematical requirements a kernel has to fulfil.)\n",
-    "\n",
-    "ðŸ‘: Do the derivation yourself. What are those requirements?\n",
-    "\n",
-    "***\n",
-    "\n",
-    "Alright! Let's implement it.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:19:26.527418Z",
-     "start_time": "2020-06-05T14:19:26.379361Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "X = np.linspace(0, 1, 100)\n",
-    "n = 5\n",
-    "x_train = np.linspace(0, 1, n)\n",
-    "y_train = f(x_train)\n",
-    "\n",
-    "ls = 0.2\n",
-    "\n",
-    "K = [[k(z, x, ls=ls) for z in x_train] for x in x_train]\n",
-    "alpha = np.linalg.inv(K) @ y_train  # @ is matrix multiplication\n",
-    "\n",
-    "def basis_function(x, i):\n",
-    "    # function centered on i-th training point\n",
-    "    return alpha[i]*np.array([k(this_x, x_train[i], ls=ls) for this_x in x])\n",
-    "\n",
-    "def f_hat(x):\n",
-    "    # total prediction\n",
-    "    return np.sum([basis_function(x, i) for i in range(n)], axis=0)\n",
-    "\n",
-    "plt.plot(X, f(X), color=\"red\", linestyle=\"dashed\")\n",
-    "plt.scatter(x_train, f(x_train), color=\"red\")\n",
-    "\n",
-    "plt.plot(X, f_hat(X), color=\"black\")\n",
-    "\n",
-    "# plot the basis functions separately, so we can see them\n",
-    "for i in range(n):\n",
-    "    plt.plot(X, basis_function(X, i), alpha=0.3, color=\"black\")\n",
-    "\n",
-    "print(f\"alpha: {np.round(alpha, 2)}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "âœ‹: Play with the value of `ls`. How does the fit change?\n",
-    "\n",
-    "ðŸ¤”: Can you think of a practical way of setting the `ls` hyper-parameter?\n",
-    "\n",
-    "## Regularisation and noise\n",
-    "\n",
-    "Now, let's make the problem a slight bit more difficult: What happens if our observations of $f$ have *noise* associated with them? For instance, if those were experimental results? \n",
-    "\n",
-    "Let's take a look:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:19:26.736096Z",
-     "start_time": "2020-06-05T14:19:26.530123Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "def f(x):\n",
-    "    return np.exp(-x**2/0.3) + 0.5*np.exp(-(x-0.5)**2/0.1)\n",
-    "\n",
-    "X = np.linspace(0, 1, 100)\n",
-    "n = 15\n",
-    "x_train = np.linspace(0, 1, n)\n",
-    "y_train = f(x_train) + 0.1*np.random.normal(scale=0.6, size=len(x_train))  # add gaussian noise\n",
-    "\n",
-    "plt.plot(X, f(X), color=\"red\", linestyle=\"dashed\")\n",
-    "plt.scatter(x_train, y_train, color=\"red\")\n",
-    "\n",
-    "ls = 0.05\n",
-    "\n",
-    "K = [[k(z, x, ls=ls) for z in x_train] for x in x_train]\n",
-    "alpha = np.linalg.inv(K) @ y_train\n",
-    "\n",
-    "def basis_function(x, i):\n",
-    "    # function centered on i-th training point\n",
-    "    return alpha[i]*np.array([k(this_x, x_train[i], ls=ls) for this_x in x])\n",
-    "\n",
-    "def f_hat(x):\n",
-    "    # total prediction\n",
-    "    return np.sum([basis_function(x, i) for i in range(n)], axis=0)\n",
-    "\n",
-    "\n",
-    "plt.plot(X, f_hat(X), color=\"black\")\n",
-    "\n",
-    "# plot the basis functions separately, so we can see them\n",
-    "for i in range(n):\n",
-    "    plt.plot(X, basis_function(X, i), alpha=0.3, color=\"black\")\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "ðŸ¤”: What is the problem here?\n",
-    "\n",
-    "ðŸ‘: Why are we using more data points? Why did I decide to change `ls`?\n",
-    "\n",
-    "***\n",
-    "\n",
-    "While our model had no problem fitting this noisy data, we notice an unpleasant feature of the result: It's extremely \"wiggly\", as opposed to the smooth function that we're trying to reproduce. The reason for this is simple: In the definition of the optimisation problem that we're solving to fit the model, we're essentially forcing $\\hat f$ to pass through all the training points as much as possible. In practice, however, this is **not** what we would like to happen -- in case of noise, we don't want to fit to the noise, and in other situations we might prefer smooth interpolation to maximum accuracy at the training points. \n",
-    "\n",
-    "In both cases, we should keep in mind what we're aiming for: We would like a good model for the *entire* space $\\mathcal{X}$, not just the training points. While our model is great at fitting the training points, it performs very poorly for almost every other point. This behaviour, often called \"overfitting\", needs to be fixed.\n",
-    "\n",
-    "The solution is to add a penalty to the optimisation problem, discouraging \"complex models\". In the case of KRR, the term is $\\lambda \\alpha^T K \\alpha $, with a new hyper-parameter $\\lambda$. (From now on, we will call this \"regularisation strength\" or \"noise level\", `nl` for short.) This term is essentially the norm of the coefficients, weighted by the kernel matrix. (A more technical derivation is beyond the scope of this tutorial, please consult the \"further reading\" section below.)\n",
-    "\n",
-    "The optimisation problem then becomes:\n",
-    "\n",
-    "$\\quad \\text{arg min}_{\\alpha} \\alpha^T K K \\alpha - 2 \\alpha^T K y + y^T y + \\lambda \\alpha^T K \\alpha $ \n",
-    "\n",
-    "With the solution:\n",
-    "\n",
-    "$\n",
-    "\\quad \\Rightarrow \\alpha = (K + \\lambda \\mathbb{1} )^{-1} y.\n",
-    "$\n",
-    "\n",
-    "With the addition of this *regularisation*, we've now finally arrived at kernel ridge regression!\n",
-    "\n",
-    "(It's also worth mentioning that even in the absence of noise, it's usually preferably to add a small amount of regularisation for numerical reasons, and to encourage smooth interpolation.)\n",
-    "\n",
-    "So, how does it look?"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:19:26.933325Z",
-     "start_time": "2020-06-05T14:19:26.737776Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "plt.plot(X, f(X), color=\"red\", linestyle=\"dashed\")\n",
-    "plt.scatter(x_train, y_train, color=\"red\")\n",
-    "\n",
-    "\n",
-    "ls = 0.15\n",
-    "nl = 0.1\n",
-    "\n",
-    "K = np.array([[k(z, x, ls=ls) for z in x_train] for x in x_train])\n",
-    "alpha = np.linalg.inv(K + nl*np.identity(K.shape[0])) @ y_train\n",
-    "\n",
-    "def basis_function(x, i):\n",
-    "    # function centered on i-th training point\n",
-    "    return alpha[i]*np.array([k(this_x, x_train[i], ls=ls) for this_x in x])\n",
-    "\n",
-    "def f_hat(x):\n",
-    "    # total prediction\n",
-    "    return np.sum([basis_function(x, i) for i in range(n)], axis=0)\n",
-    "\n",
-    "\n",
-    "plt.plot(X, f_hat(X), color=\"black\")\n",
-    "\n",
-    "# plot the basis functions separately, so we can see them\n",
-    "for i in range(n):\n",
-    "    plt.plot(X, basis_function(X, i), alpha=0.3, color=\"black\")\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Looking good!\n",
-    "\n",
-    "âœ‹: Play with the value of `ls` and `nl`. How do they interact?\n",
-    "\n",
-    "## Implementing KRR\n",
-    "\n",
-    "Before moving on, let's spend a few minutes writing a more useable implementation of KRR:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:19:27.095327Z",
-     "start_time": "2020-06-05T14:19:26.935055Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "def kernelf_gaussian(x, z=None, ls=1.0):\n",
-    "    \"\"\"Compute kernel matrix with Gaussian kernel.\n",
-    "    \n",
-    "    Args:\n",
-    "        x: iterable, row of kernel matrix\n",
-    "        z: (optional) iterable, column of kernel matrix. if None, use x\n",
-    "        ls: length scale\n",
-    "    \n",
-    "    \"\"\"\n",
-    "    if z is None:\n",
-    "        z = x\n",
-    "\n",
-    "    return np.array([[np.exp(-(this_x-this_z)**2 / (2 * ls ** 2)) for this_z in z] for this_x in x])\n",
-    "\n",
-    "class KRR():\n",
-    "    def __init__(self, ls, nl, kernelf=kernelf_gaussian):\n",
-    "        self.ls = ls\n",
-    "        self.nl = nl\n",
-    "        self.kernelf = kernelf\n",
-    "        \n",
-    "    def train(self, x, y):\n",
-    "        self.x_train = x\n",
-    "        K = self.kernelf(x, ls=self.ls)\n",
-    "        self.alpha = np.linalg.inv(K + self.nl*np.identity(K.shape[0])) @ y\n",
-    "        \n",
-    "    def predict(self, z):\n",
-    "        L = self.kernelf(self.x_train, ls=self.ls, z=z)\n",
-    "        return L.T @ self.alpha\n",
-    "\n",
-    "krr = KRR(ls=0.15, nl=0.1)\n",
-    "\n",
-    "krr.train(x_train, y_train)\n",
-    "\n",
-    "plt.plot(X, f(X), color=\"red\", linestyle=\"dashed\")\n",
-    "plt.scatter(x_train, y_train, color=\"red\")\n",
-    "\n",
-    "plt.plot(X, krr.predict(X), color=\"black\")\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "ðŸ‘: How would you make this code more efficient, in particular the kernel function?\n",
-    "\n",
-    "Finally, a word of caution: This implementation is **not** production ready. In reality, the use of a matrix inversion to train the model is both inefficient and numerically unstable, and will yield to failure in many real-world applications. For this reason, real implementations use the Cholesky decomposition instead. Rewriting the implementation above is straightforward, but beyond the scope of an introductory tutorial!\n",
-    "\n",
-    "## Hyper-parameter tuning\n",
-    "\n",
-    "As we've seen already, the choice of `ls` and `nl` significantly impacts the performance of the model:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:19:27.289309Z",
-     "start_time": "2020-06-05T14:19:27.097234Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "def train_and_predict(krr, x_train, y_train, x_predict):\n",
-    "    krr.train(x_train, y_train)\n",
-    "    return krr.predict(x_predict)\n",
-    "    \n",
-    "\n",
-    "def f(x):\n",
-    "    return np.exp(-x**2/0.3) + 0.5*np.exp(-(x-0.5)**2/0.1)\n",
-    "\n",
-    "X = np.linspace(0, 1, 100)\n",
-    "n = 5\n",
-    "x_train = np.linspace(0, 1, n)\n",
-    "y_train = f(x_train)\n",
-    "\n",
-    "plt.plot(X, f(X), color=\"black\", linestyle=\"solid\", label=\"ground truth\")\n",
-    "plt.scatter(x_train, y_train, color=\"red\")\n",
-    "\n",
-    "plt.plot(X, train_and_predict(KRR(ls=0.05, nl=1e-3), x_train, y_train, X), color=\"#005AB5\", linestyle=\"dotted\", label=\"low ls, low nl\")\n",
-    "plt.plot(X, train_and_predict(KRR(ls=0.05, nl=1e-1), x_train, y_train, X), color=\"#005AB5\", linestyle=\"dashed\", label=\"low ls, high nl\")\n",
-    "plt.plot(X, train_and_predict(KRR(ls=0.5, nl=1e-3), x_train, y_train, X), color=\"#DC3220\", linestyle=\"dotted\", label=\"high ls, low nl\")\n",
-    "plt.plot(X, train_and_predict(KRR(ls=0.5, nl=1e-1), x_train, y_train, X), color=\"#DC3220\", linestyle=\"dashed\", label=\"high ls, high nl\")\n",
-    "plt.legend()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "So our next task is coming up with a strategy to pick these \"hyper-parameters\", `nl` and `ls`. HP optimisation is in general a global non-convex optimisation problem with all the problems that entails. There is no perfect solution to this problem, and many different approaches can be taken.\n",
-    "\n",
-    "Speaking from personal experience, the approach we'll explore now works reasonably well for the problems at hand, but it is always recommended to experiment. It's also of vital importance to *state what you're doing* if you're publishing any work in this domain. Since the HPs impact model performance significantly, the choice of tuning strategy is an important piece of information.\n",
-    "\n",
-    "In order to design a HP tuning strategy, we need to first decide on a measure of performance. Usually, this is further decomposed into two sub-tasks: \n",
-    "\n",
-    "1. Picking a quantitative error measure, an \"error metric\", and \n",
-    "2. Deciding on a set of training and test examples to compute it. \n",
-    "\n",
-    "Two common error metric are (adopting the $y$ as shorthand for all test labels, and $\\hat y$ as the corresponding predictions of our model):\n",
-    "\n",
-    "The root mean squared error (RMSE):\n",
-    "\n",
-    "$\\quad \\text{rmse}(y, \\hat y) =  \\sqrt{ \\left( \\frac{1}{n_{\\text{test}}} \\sum_{i=1}^{n_{\\text{test}}} (y_i - \\hat y_i)^2  \\right)}$\n",
-    "\n",
-    "The mean absolute error (MAE):\n",
-    "\n",
-    "$\\quad \\text{mae}(y, \\hat y) = \\frac{1}{n_{\\text{test}}} \\sum_{i=1}^{n_{\\text{test}}} |y_i - \\hat y_i | $\n",
-    "\n",
-    "The RMSE is an upper bound to the MAE, and more sensitive to outliers (i.e. single points with large error). Since the KRR training amounts to minimising the RMSE on the training set, the RMSE can be regarded as the \"natural\" error measure for KRR -- however, the choice of error metric ultimately depends on the problem being investigated.\n",
-    "\n",
-    "Let's start our explorations with a very simple strategy: We'll further split our training set into a \"validation\" train/test split. In other words, we'll pick some parameters, train the model on the \"validation training set\", predict on the \"validation test set\", and compute the error. Then, we'll attempt to minimise this error.\n",
-    "\n",
-    "In order to make this a bit more viable, let's choose a slightly more exciting toy problem, and a bit more data. For illustrative purposes, we'll use regularly spaced training points and sub-splits. In practice, the training set is often randomly drawn, as is the validation split.\n",
-    "\n",
-    "(In fact, mathematical derivations for ML models often assume \"identically and independently distributed\" data points. This is often violated in practice, with somewhat unclear consequences. For instance, the error on the \"validation test set\" is only a reasonable estimate of the \"real error\" if it is drawn from the same distribution as the actual test data.)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:19:27.446548Z",
-     "start_time": "2020-06-05T14:19:27.291956Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "def f(x):\n",
-    "    return np.exp(-x**2/0.3) + 0.5*np.exp(-(x-0.5)**2/0.1) + 0.5*np.exp(-(x-2.5)**2/0.5) \n",
-    "\n",
-    "\n",
-    "def f_with_noise(x):\n",
-    "    return f(x) + np.random.normal(scale=0.05, size=len(x))\n",
-    "\n",
-    "np.random.seed(1)\n",
-    "\n",
-    "X = np.linspace(0, 5, 100)\n",
-    "Y = f(X)\n",
-    "\n",
-    "x_train = np.linspace(0, 4, 40)\n",
-    "\n",
-    "idx_train = np.arange(len(x_train))\n",
-    "idx_valid_test = idx_train[::4]\n",
-    "idx_valid_train = np.setdiff1d(idx_train, idx_valid_test)\n",
-    "\n",
-    "plt.plot(X, f(X), color=\"red\", linestyle=\"dashed\")\n",
-    "\n",
-    "y_train = f_with_noise(x_train)\n",
-    "y_valid_test = y_train[idx_valid_test]\n",
-    "y_valid_train = y_train[idx_valid_train]\n",
-    "\n",
-    "x_valid_train = x_train[idx_valid_train]\n",
-    "x_valid_test = x_train[idx_valid_test]\n",
-    "\n",
-    "plt.scatter(x_valid_train, y_valid_train, color=\"red\", marker=\"x\")\n",
-    "plt.scatter(x_valid_test, y_valid_test, color=\"red\", marker=\"o\")\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Next, we'll build some infrastructure. We introduce the term \"loss function\" to refer to the function we're trying to minimise in HP optimisation. The loss function and be chosen to correspond to any of the error metrics above. In practice, the term \"loss\" is often used instead of \"error\" to describe the minimsation target; we'll follow this convention."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:19:27.455366Z",
-     "start_time": "2020-06-05T14:19:27.449252Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "def rmse(true, pred):\n",
-    "    \"\"\"Root mean squared error.\"\"\"\n",
-    "    return np.sqrt(np.mean((true - pred) ** 2))\n",
-    "\n",
-    "def mae(true, pred):\n",
-    "    \"\"\"Mean absolute error.\"\"\"\n",
-    "    return np.mean(np.fabs(true - pred))\n",
-    "\n",
-    "\n",
-    "def compute_loss(ls, nl, lossf=rmse):\n",
-    "    krr = KRR(ls, nl)\n",
-    "    krr.train(x_valid_train, f(x_valid_train))\n",
-    "    y_hat = krr.predict(x_valid_test)\n",
-    "    \n",
-    "    return lossf(f(x_valid_test), y_hat)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Before we try to optimise this loss, however, let's try to contextualise the numbers a little bit."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "from collections import deque\n",
-    "guesses = deque([], 10)  # will store 10 guesses, deleting the oldest one when more are added\n",
-    "\n",
-    "def guess(ls, nl, quiet=False):\n",
-    "    guesses.append((ls, nl, compute_loss(ls, nl, lossf=rmse), compute_loss(ls, nl, lossf=mae)))\n",
-    "    \n",
-    "    if not quiet:\n",
-    "        print(\"guesses  (last is current)\")\n",
-    "        print(f\"{'ls':<10}| {'nl':<10}| {'rmse':<10}| {'mae':<10}\")\n",
-    "        for ls, nl, l_rmse, l_mae in guesses:\n",
-    "            print(f\"{ls:<10.5f}| {nl:<10.5f}| {l_rmse:<10.5f}| {l_mae:<10.5f}\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:19:27.586190Z",
-     "start_time": "2020-06-05T14:19:27.457670Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "mean = np.mean(Y)\n",
-    "std = np.std(Y)\n",
-    "\n",
-    "print(f\"std = {std:.3f}, mean = {mean:.3f}\")\n",
-    "print(f\"RMSE if we predict mean = {rmse(Y, mean * np.ones_like(Y)):.3f}\")\n",
-    "print(f\"MAE if we predict mean = {mae(Y, mean * np.ones_like(Y)):.3f}\")\n",
-    "print()\n",
-    "\n",
-    "for i in range(10):\n",
-    "    ls = 2*np.random.random()\n",
-    "    nl = 0.01*np.random.random()\n",
-    "    guess(ls, nl, quiet=True)\n",
-    "guess(1, 1)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "âœ‹: Spend a few minutes in the cell below, optimising the values by hand."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:19:27.607191Z",
-     "start_time": "2020-06-05T14:19:27.588246Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "guess(0.1, 0.1)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "First, let's try to make some educated guesses, as a baseline. For example, we could demand the exponent in the kernel to be approximately one, and simply set a small value for `nl`, i.e. assume that there is little noise in the data."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:19:27.640790Z",
-     "start_time": "2020-06-05T14:19:27.609345Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "distances = [[np.abs(x-z) for z in x_train] for x in x_train]\n",
-    "guess_ls = np.median(distances) / np.sqrt(2.)\n",
-    "guess_nl = 1e-7\n",
-    "\n",
-    "guess(guess_ls, guess_nl)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Alright, we could do worse. But we could also do better. As a rule of thumb, an RMSE of around 1% of the standard deviation is a good error to aim for. (Note: If your dataset is noisy, or there is no enough data, this might be impossible to achieve. For example, for the NOMAD2018 dataset we'll investigate later in the tutorial, models typically reach around 20-25%!)\n",
-    "\n",
-    "Let's now try to approach this problem more systematically. A simple approach we could take is to simply *try all the possible combinations* (in some domain), also called a \"grid search\"."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:19:33.492665Z",
-     "start_time": "2020-06-05T14:19:27.642668Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "ls_grid = np.linspace(0.0001, 1.0, num=30)\n",
-    "nl_grid = np.linspace(0.0001, 1.0, num=30)\n",
-    "\n",
-    "results = np.zeros((ls_grid.shape[0], nl_grid.shape[0]))\n",
-    "\n",
-    "for i, ls in enumerate(ls_grid):\n",
-    "    for j, nl in enumerate(nl_grid):\n",
-    "        results[i, j] = compute_loss(ls=ls, nl=nl)\n",
-    "\n",
-    "ls_range = (ls_grid[0], ls_grid[-1])\n",
-    "nl_range = (nl_grid[0], nl_grid[-1])\n",
-    "\n",
-    "plt.imshow(results,\n",
-    "        extent=[nl_grid[0], nl_grid[-1], ls_grid[-1], ls_grid[0]],\n",
-    "        aspect=(nl_grid[-1] - nl_grid[0])/(ls_grid[-1] - ls_grid[0]),\n",
-    "        cmap=plt.get_cmap(\"inferno\"))\n",
-    "\n",
-    "plt.clim(0.0, std)\n",
-    "plt.colorbar()\n",
-    "\n",
-    "# find minimum, draw as big red X\n",
-    "i_min, j_min = np.unravel_index(np.argmin(results), results.shape)\n",
-    "plt.plot(nl_grid[j_min], ls_grid[i_min], color=\"red\", marker=\"x\", ms=20)\n",
-    "\n",
-    "plt.xlabel(\"nl\")\n",
-    "plt.ylabel(\"ls\")\n",
-    "\n",
-    "plt.xlim((nl_grid[0], nl_grid[-1]))\n",
-    "plt.ylim((ls_grid[-1], ls_grid[0]))\n",
-    "\n",
-    "best_nl = nl_grid[j_min]\n",
-    "best_ls = ls_grid[i_min]\n",
-    "\n",
-    "print(f\"min loss: {compute_loss(nl=best_nl, ls=best_ls):.4f} at nl={best_nl:.4f} and ls={best_ls:.4f}.\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Clearly, `nl` should be very small in this case -- reasonable, as there is little noise in our observations. `ls` should be somewhere below `1.0` as well. We'll now re-do the search with a more traditional $\\log_2$ grid, which allows for an exploration of a larger range of values."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:19:39.186103Z",
-     "start_time": "2020-06-05T14:19:33.494961Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "ls_grid = np.linspace(-5, 0, num=30)\n",
-    "nl_grid = np.linspace(-12, -2, num=30)\n",
-    "\n",
-    "results = np.zeros((ls_grid.shape[0], nl_grid.shape[0]))\n",
-    "\n",
-    "for i, ls in enumerate(ls_grid):\n",
-    "    for j, nl in enumerate(nl_grid):\n",
-    "        results[i, j] = compute_loss(ls=2**ls, nl=2**nl)\n",
-    "\n",
-    "ls_range = (ls_grid[0], ls_grid[-1])\n",
-    "nl_range = (nl_grid[0], nl_grid[-1])\n",
-    "\n",
-    "plt.imshow(results,\n",
-    "        extent=[nl_grid[0], nl_grid[-1], ls_grid[-1], ls_grid[0]],\n",
-    "        aspect=(nl_grid[-1] - nl_grid[0])/(ls_grid[-1] - ls_grid[0]),\n",
-    "        cmap=plt.get_cmap(\"inferno\"))\n",
-    "\n",
-    "plt.clim(0.0, std)\n",
-    "plt.colorbar()\n",
-    "\n",
-    "# find minimum, draw as big red X\n",
-    "i_min, j_min = np.unravel_index(np.argmin(results), results.shape)\n",
-    "plt.plot(nl_grid[j_min], ls_grid[i_min], color=\"red\", marker=\"x\", ms=20)\n",
-    "\n",
-    "\n",
-    "plt.xlabel(\"nl\")\n",
-    "plt.ylabel(\"ls\")\n",
-    "\n",
-    "plt.xlim((nl_grid[0], nl_grid[-1]))\n",
-    "plt.ylim((ls_grid[-1], ls_grid[0]))\n",
-    "\n",
-    "best_nl = nl_grid[j_min]\n",
-    "best_ls = ls_grid[i_min]\n",
-    "\n",
-    "print(f\"min loss: {compute_loss(nl=2**best_nl, ls=2**best_ls):.4f} at nl=2^{best_nl:.4f} and ls=2^{best_ls:.4f}.\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Alright. Let's have a look at our newly optimised model:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:19:39.373255Z",
-     "start_time": "2020-06-05T14:19:39.188528Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "plt.plot(X, f(X), color=\"red\", linestyle=\"dashed\")\n",
-    "plt.scatter(x_valid_train, y_valid_train, color=\"red\", marker=\"x\")\n",
-    "plt.scatter(x_valid_test, y_valid_test, color=\"red\", marker=\"o\")\n",
-    "\n",
-    "krr = KRR(ls=2**best_ls, nl=2**best_nl)\n",
-    "krr.train(x=x_train, y=y_train)\n",
-    "\n",
-    "plt.plot(X, krr.predict(X))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "It looks like we've succeeded! \n",
-    "\n",
-    "We can also observe one salient feature of KRR: Outside of the span of the training data, the model performs very poorly. *It cannot extrapolate,* since it's essentially centered on the training data. \n",
-    "\n",
-    "***\n",
-    "\n",
-    "Using a single holdout validation set, as we've done here, is not always sufficient. For instance, if the choice is random, a somehow non-representative split might get chosen, and the performance estimate used during HP optimisation does not reflect the \"real\" performance very well. A common strategy to avoid this case is to use some form of *cross-validation* (CV), where the training set is divided into multiple validation splits, and the optimisation target is an average over the losses obtained on these splits. Often, these splits are chosen by partitioning the training set into $k$ non-overlapping subsets, and then, $k$ times, training on the combination of $k-1$ of these splits and predicting on the remaining one. This is called \"k-fold cross-validation\". Alternatively, the CV splits can simply be chosen at random. An extreme case is \"leave one out\" cross-validation, where the model is trained on all except *one* point, for each point in the training set.\n",
-    "\n",
-    "Apart from that, we can also vary the optimisation approach: While there are no guarantees on the convexity of the optimisation problem, we might be happy with a *local* minimum, so gradient descent or related approaches could be used. Additionally, a stochastic global optimisation, or a \"local\" grid search, which searches for local minima within the local environment of a given starting point, can be considered.\n",
-    "\n",
-    "In the case of KRR, there exists an alternative approach, from the domain of *Gaussian process regression* (GPR). GPR is a statistical approach to regression that essentially predicts a probability distribution over possible functions that could feasibly have generated the training data. The *mean* over this (infinite) distribution of functions, which corresponds to the \"maximum likelihood estimate\" is identical to a prediction using KRR given the same HPs and training data. Therefore, in some sense, KRR and GPR are *equivalent*. However, due to the development of GPR in a different community, the practices associated with HP tuning and interpretation of results are often quite different, and therefore, some care has to be taken to distinguish between the two.\n",
-    "\n",
-    "In GPR, HPs are often optimised by, roughly speaking, maximising the likelihood of the training data to be observed given the model. Anecdotally, this does not produce superior results compared to a holdout/cross-validation strategy, but in some sense \"makes better use of the training data\", and it can be done using gradient descent. As far as I'm aware, there are no firm theoretical grounds to privileging one approach over the other in a real-world setting.\n",
-    "\n",
-    "## Further reading and references\n",
-    "\n",
-    "The introduction to KRR in this section was largely modelled on [*Machine Learning for Quantum Mechanics in a Nutshell* by Matthias Rupp (2015)]( https://doi.org/10.1002/qua.24954 ), with additional tidbits from [*Elements of Statistical Learning* by Trevor Hastie, Robert Tibshirani and Jerome Friedman (2017)]( https://web.stanford.edu/~hastie/ElemStatLearn/ ) and [*Gaussian Processes for Machine Learning* by Carl Edward Rasmussen and Christopher K.I. Williams (2006)]( http://www.gaussianprocess.org/gpml/ ). \n",
-    "\n",
-    "I would highly recommend the \"... in a nutshell\" article as additional reading on the background of KRR, and Chapter 2 of \"Ramussen & Williams\" as an in-depth discussion of GPR. Chapter 5 of the same book offers a good overview of the different approaches to HP tuning briefly mentioned above.\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Materials, Representations and KRR\n",
-    "\n",
-    "Let us now turn our attention to our particular problem domain: Predicting the properties of materials (and molecules). In particular, in this tutorial, we're aiming to predict properties that can be directly computed from the structure (atomic positions, atomic charges, atomic numbers, and unit cell basis vectors) with a first-principles method, for instance density functional theory. We're therefore trying to develop *surrogate* models to the first-principles methods, with the aim of using the model as a less accurate, but faster, alternative in the context of screening. Since this general approach includes both materials (periodic boundary conditions) and molecules (no boundary conditions), we'll largely use the abstract *structure* to refer to either.\n",
-    "\n",
-    "Speaking slightly more formally, we wish to learn a functional structure-property relationship:\n",
-    "\n",
-    "$\n",
-    "\\quad \\mathcal{S_i} = \\{R_i, B_i, Z_i\\} \\rightarrow \\mathbb{R}\n",
-    "$\n",
-    "\n",
-    "where, for a structure $i$, $R_i, B_i, Z_i$ are the positions of each atom, the basis vectors of the unit cell, and the corresponding atomic numbers for each atom. We'll focus on scalar quantities, for example the potential energy or bandgap. Vector properties often pose additional challenges, for instance covariance.\n",
-    "\n",
-    "### Representations\n",
-    "\n",
-    "Considering what we've learned about kernel ridge regression, one immediate challenge presents itself: How exactly are we supposed to compute a kernel value between two structures? \n",
-    "\n",
-    "So far, it has been clear: We simply have two input vectors, which uniquely identify two points, and we compute a single, scalar \"kernel distance\" between them. But now, we have three matrices, not even of the same dimension -- and also not unique, as any permutation of the order in which we write down the atoms, or a rotation and/or translation of the positions, or choice of unit cell, leaves the properties we're investigating here invariant!\n",
-    "\n",
-    "There are essentially two, not quite separable, approaches to dealing with this problem: Either by defining a kernel directly on the space of structures, or by introducing an intermediate step that maps a structure first to a vector with the required invariances, and then using that vector as input in an \"off-the-shelf\" KRR approach. Clearly, the second approach can be regarded as a subset of the first, with the advantage of separating the \"physics specific\" parts from the ML approach.\n",
-    "\n",
-    "Here, we will focus on this approach, and therefore introduce the notion of a *representation*.\n",
-    "\n",
-    "We'll need to distinguish between *global* representations:\n",
-    "\n",
-    "$\n",
-    "\\quad \\mathcal{S_i} \\rightarrow \\mathbb{R}^d \\, ,\n",
-    "$\n",
-    "\n",
-    "which represent the entire structure at once, and *local* (or *atomic*) representations\n",
-    "\n",
-    "$\n",
-    "\\quad \\mathcal{S_i} \\rightarrow \\mathbb{R}^{n_{\\text{atoms}} \\times d} \\, ,\n",
-    "$\n",
-    "\n",
-    "which represents each atom separately. We won't dwell too much on how to implement such representations, but should keep in mind that they should fulfil some basic requirements: They must be *invariant* to anything that doesn't change the target property (rotations, translations, permutation), but *variant* otherwise, to avoid two distinct structures sharing the same representation. A large number of different representations exist.\n",
-    "\n",
-    "ðŸ¤”: Spend a minute thinking about how to build a representation. Why can't you simply concatenate all the positions, basis vectors, charges into one big vector?\n",
-    "\n",
-    "For this tutorial, we'll use the Smooth Overlap of Atomic Positions (SOAP) representation by [BartÃ³k, Kondor, CsÃ¡nyi (2013)](https://doi.org/10.1103/PhysRevB.87.184115). In a nutshell, this *atomic* representation does the following: For every atom within a certain distance (the `cutoff`) of the central atom for which we're computing SOAP, a Gaussian is placed on the location. The resulting density is then expanded in spherical harmonics and a radial basis, and invariants (the \"power spectrum\") of the resulting coefficients form the representation.\n",
-    "\n",
-    "### Kernels\n",
-    "\n",
-    "Before we can get started, we need to tackle one last challenge: How do we compute kernel values between structures for a local representation, which gives us one value per *atom*, not structure? This question, as always, has many answers, but here we'll choose a simple way to treat the problem. Given a kernel function $k$, which acts between two vectors, we define a \"sum kernel\" $k'$:\n",
-    "\n",
-    "Let $x$ and $y$ be two structures with $n$ and $m$ atoms each, represented by a local representation, so $x=\\{x_1, x_2, ... x_n\\}$ and $y=\\{y_1, y_2, ..., y_m\\}$. Then:\n",
-    "\n",
-    "$\n",
-    "\\quad k'(x, y) = \\sum_{i, j} k(x_i, y_j) \\, .\n",
-    "$\n",
-    "\n",
-    "In other words, we simply pretend $x$ and $y$ are entire sets of inputs, compute the full kernel matrix between them, and then take the sum of that \"atom-atom\" kernel matrix. From a theoretical standpoint, we can motivate this choice by assuming that our target quantity, if it is extensive, can be decomposed into (unknown) atomic contributions, and then re-deriving the equations defining KRR under that assumption. (See the \"further reading\" section for details.)\n",
-    "\n",
-    "ðŸ¤”: How would you change this approach to model *intensive* properties?\n",
-    "\n",
-    "### Further reading\n",
-    "\n",
-    "Some representations are:\n",
-    "\n",
-    "- Many-Body Tensor Representation (MBTR) by [Huo, Rupp (2017)](https://arxiv.org/abs/1704.06439)\n",
-    "- Smooth Overlap of Atomic Positions (SOAP) representaton by [BartÃ³k, Kondor, CsÃ¡nyi (2013)](https://doi.org/10.1103/PhysRevB.87.184115)\n",
-    "- Symmetry Functions (SF) representation by [Behler (2011)](https://doi.org/10.1063/1.3553717)\n",
-    "\n",
-    "Multiple implementations of each of these exist, for instance, MBTR is implemented in [`qmmlpack`](https://gitlab.com/qmml/qmmlpack), and SOAP in [`QUIP`](https://github.com/libAtoms/QUIP). Additionally, the [`dscribe`](https://singroup.github.io/dscribe/) package offers alternative, unified implementations of many representations. \n",
-    "\n",
-    "For an overview and comparison of these representations, you might be interested in [Langer *et al.* (2020)](https://arxiv.org/abs/2003.12081). The references of that pre-print will also point you to many more things to read.\n",
-    "\n",
-    "Here is a small selection of additional publications to have a look at:\n",
-    "\n",
-    "- FCHL representation and kernel [Faber *et al.* (2018)](https://doi.org/10.1063/1.5020710)\n",
-    "- Comparison of models for energy prediction in binary alloys [Nyshadham *et al.* (2019)](https://www.nature.com/articles/s41524-019-0189-9)\n",
-    "- (Slightly older) comparison of representations and regression methods for molecular properties [Faber *et al.* (2017)](https://doi.org/10.1021/acs.jctc.7b00577)\n",
-    "- `dscribe` paper [Himanen *et al.* (2020)](https://doi.org/10.1016/j.cpc.2019.106949)\n",
-    "- Additional information about \"atomic\" kernels [De *et al.* (2016)](https://doi.org/10.1039/C6CP00415F)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## NOMAD2018 Kaggle Challenge\n",
-    "\n",
-    "At this point, we're ready to tackle a *real* regression problem. In particular, we'll recapitulate a portion of  the [\"Nomad2018 Predicting Transparent Conductors\" Kaggle challenge](https://www.kaggle.com/c/nomad2018-predict-transparent-conductors). The challenge was to create a regression model to predict formation energies and bandgaps for Al<sub>x</sub>Ga<sub>y</sub>In<sub>z</sub> sesquioxides, which are believed to be candidate materials for transparent conductors.  The training set consisted of 2400 such materials, with Vegard's law \"projected geometries\" and formation and bandgap energies for the corresponding relaxed structures, and the test set was 600 additional materials unseen during training. (The \"projected geometries\" are estimates of the relaxed crystal structures obtained by re-scaling the lattice parameters based on the composition.)\n",
-    "\n",
-    "For the full results of the challenge, and an in-depth discussion of the winning methods, please see [Sutton *et al.* (2019)](https://doi.org/10.1038/s41524-019-0239-3).\n",
-    "\n",
-    "In this tutorial, we'll focus on the formation energy, which appears to be a less challening learning problem. We will also do most of our work with significantly smaller subsets, in order to not have to wait too long for calculations to finish. \n",
-    "\n",
-    "This dataset presents a rather *challenging* problem. The structures are quite large and periodic, and therefore computationally expensive to deal with, so HP optimisation is a resource-intensive task. The fact that we don't deal with relaxed structures, but with \"projected\" geometries additionally complicates the regression task, since the structure-property mapping implicitly now also contains relaxation.\n",
-    "\n",
-    "### Technical Interlude\n",
-    "\n",
-    "Before we go ahead and get our hands dirty, let me briefly interject with a few words on the technical infrastructure we'll be using for the rest of this tutorial. Our hand-crafted KRR implementation won't quite cut it, and we'll also need state-of-the-art representations to aid us.\n",
-    "\n",
-    "For KRR, which is not particularly difficult to implement, we in principle have many options, for example [`scikit-learn`](https://scikit-learn.org/stable/) or [`GPyTorch`](https://github.com/cornellius-gp/gpytorch), which both implement general-purpose GPR, with many kernels and variations. However, for the present purposes, a more focused tool is better: We will use [`qmmlpack`](https://gitlab.com/qmml/qmmlpack), which implements both KRR and the MBTR represention, and is specialised for regression models in the \"materials and molecules\" domain.\n",
-    "\n",
-    "For the representations, there are also many options, depending on which we choose. In order to keep things simple, we will use [`dscribe`](https://singroup.github.io/dscribe/), which implements a wide range of representations in one package.\n",
-    "\n",
-    "In order to tie these two components of our models together we'll be using [`cmlkit`](https://marcel.science/cmlkit), a modular package designed for constructing \"representation+KRR\"-style models. It wraps the underlying implementations in a unified and consistent interface, so we don't have to worry too much about the technical details.\n",
-    "\n",
-    "However, it is important to keep in mind that fundamentally, nothing we'll do in the rest of the tutorial is particularly difficult to achieve with any KRR/GPR implementation, plain `dscribe` and a bit of time, so please don't feel too intimidated to select tools that suit your working style. `cmlkit` is designed to be quite minimal, expects you to build your own infrastructure on top of it, and is experimental software, so it's understandable if it's not your cup of tea.\n",
-    "\n",
-    "To learn more about `cmlkit`, please consult the [`cmlkit` tutorial]( https://analytics-toolkit.nomad-coe.eu/public/user-redirect/notebooks/tutorials/cmlkit.ipynb ), also available as part of the [NOMAD Analytics Toolkit](https://nomad-lab.eu/AItutorials).\n",
-    "\n",
-    "### Inspecting the dataset\n",
-    "\n",
-    "<p style=\"\"><span style=\"font-style: italic;\">\"Time looking at your data is always well spent.\"</span><br /><br />Ian H. Witten, Eibe Frank, Mark A. Hall, Christopher J. Pal:<br /> Data Mining. Practical Machine Learning Tools and Techniques,<br /> 4th edition,  Morgan Kaufmann, 2017.</p>\n",
-    "\n",
-    "Let's start by importing the dataset. (Which can be obtained in a generally readable format from [qmml.org](https://qmml.org).)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:19:40.674247Z",
-     "start_time": "2020-06-05T14:19:39.375092Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "import cmlkit\n",
-    "import seaborn as sbn\n",
-    "sbn.set_context('notebook')\n",
-    "\n",
-    "cmlkit.caches.location = \"data/krr4mat/cache\" # usually, this would be set using an environment variable\n",
-    "\n",
-    "# load the dataset as a cmlkit Dataset\n",
-    "train = cmlkit.load_dataset(\"data/krr4mat/nmd18_train\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Next, let's check the basic composition of our dataset. First: how big are the structures we're dealing with?"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:19:40.841632Z",
-     "start_time": "2020-06-05T14:19:40.676081Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "# histogram of number of atoms per unit cell\n",
-    "sbn.distplot(train.info[\"atoms_by_system\"], norm_hist=False, kde=False, bins=[5, 15, 25,  35, 45, 65, 75, 85])"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "ðŸ¤”: Do you see a problem here?\n",
-    "\n",
-    "Clearly, the distribution of `n_atoms` is skewed towards large unit cells, and there are a wide range of unit cell sizes present. This might lead to over-representation of large structures and it may also pose a difficulty for models that are not intrinsically extensive, as they have to deal with a large variation in unit cell size.\n",
-    "\n",
-    "***\n",
-    "\n",
-    "Next question: How are the different elements distributed? \n",
-    "\n",
-    "First: In the 2400 structures, which elements do occur in which?"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:19:40.847558Z",
-     "start_time": "2020-06-05T14:19:40.843304Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "for count, element in zip(*np.unique(train.info[\"systems_per_element\"], return_index=True)):\n",
-    "    if count != 0:\n",
-    "        print(f\"#structures with Z={element}: {count}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Alright -- that looks evenly distributed. (Note that all structures, being oxides, have oxygen in them.)\n",
-    "\n",
-    "Let's also check the composition:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:19:41.364286Z",
-     "start_time": "2020-06-05T14:19:40.849678Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "fig, axs = plt.subplots(nrows=3, ncols=1, figsize=(7, 14))\n",
-    "compositions = np.array([np.array([(z == 13).sum(), (z == 31).sum(), (z == 49).sum()])/len(z) for z in train.z])\n",
-    "sbn.distplot(compositions[:, 0], ax=axs[0])\n",
-    "sbn.distplot(compositions[:, 1], ax=axs[1])\n",
-    "sbn.distplot(compositions[:, 2], ax=axs[2])\n",
-    "\n",
-    "axs[0].set_xlabel(\"Fraction of Al\")\n",
-    "axs[1].set_xlabel(\"Fraction of Ga\")\n",
-    "axs[2].set_xlabel(\"Fraction of In\")\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "These distributions also look reasonably similar.\n",
-    "\n",
-    "To conclude investigating the structures, let's check the spacegroups we're dealing with:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:19:41.371035Z",
-     "start_time": "2020-06-05T14:19:41.366357Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "for sg, count in zip(*np.unique(train.p[\"sg\"], return_counts=True)):\n",
-    "    if count != 0:\n",
-    "        print(f\"#structures with spacegroup={sg:>3}: {count}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Alright, they're also reasonably similarly distributed, with a slightly surplus of 206 and 33.\n",
-    "\n",
-    "Finally, let's briefly check on the quantity we're trying to predict: The formation energy *per substitution site*, which in this case means \"the formation energy per atom in the structure that is not Oxygen\". We use this quantity because it was used in the Kaggle challenge, more typical choices would be the intensive \"formation energy per atom\" or the extensive \"formation energy per unit cell\". `cmlkit` allows easy conversion between these using the `per` argument in the function below.\n",
-    "\n",
-    "The formation energy is given in eV here.\n",
-    "\n",
-    "ðŸ¤”: Why would we be interested in converting between `per=\"atom\"` and `per=\"cell\"`?"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:19:41.573644Z",
-     "start_time": "2020-06-05T14:19:41.373195Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "sbn.distplot(train.pp(\"fe\", per=\"non_O\"))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "sbn.distplot(train.pp(\"fe\", per=\"cell\"))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "sbn.distplot(train.pp(\"fe\", per=\"atom\"))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The energy distribution is not quite a normal distribution, and shifted so no negative energies occur. Interestingly, there seems to be a small number of high-energy structures. Converting to energy per unit cell slightly shifts the distribution, while there appears to be little difference between energy per atom and energy per non-O atoms. \n",
-    "\n",
-    "Alright! Now we know what we're dealing with. What do we do now?\n",
-    "\n",
-    "### Our Plan\n",
-    "\n",
-    "1. Define what we're trying to achieve\n",
-    "2. Decide on a general class of models\n",
-    "    1. Choose representation\n",
-    "    2. Choose regression method\n",
-    "    3. Choose ranges/allowed choice of HPs\n",
-    "3. Decide on a HP optimisation scheme\n",
-    "    1. Choose a search space\n",
-    "    2. Choose a target to optimise\n",
-    "        1. Choose a loss function\n",
-    "        2. Choose a train/test split and/or cross-validation\n",
-    "    3. Choose an optimisation method\n",
-    "4. Run this\n",
-    "5. Evaluate the resulting model or models according to what we decided to achieve in step 1\n",
-    "\n",
-    "***\n",
-    "\n",
-    "While the time and computational cost constraints of this tutorial keep us from exploring most of these steps in detail, it's important to briefly consider all of them, especially as practice for future projects beyond this tutorial.\n",
-    "\n",
-    "#### Goal\n",
-    "\n",
-    "First, what are we trying to do? Let's share the (partial) goal of the Kaggle challenge and declare:\n",
-    "\n",
-    "*We will create a model that, when trained on the \"training\" portion of the Kaggle challenge dataset, predicts the formation energy per substitution site on the \"test\" portion of the same dataset with a MAE in the range of 13-15 meV/cation. (Which is what is reported in Sutton et al. (2019) as the range of results in the original challenge.)*\n",
-    "\n",
-    "#### Model Class\n",
-    "\n",
-    "For our model class, I'll choose for us: We'll be using the SOAP representation with KRR. For KRR, we'll restrict ourselves to the plain Gaussian kernel (with the \"atomic\" modification), and the following ranges of parameters:\n",
-    "\n",
-    "```\n",
-    "ls in [2**-10, 2**2]\n",
-    "nl in [2**-20, 2**0]\n",
-    "```\n",
-    "\n",
-    "These ranges are chosen based on previous empirical experience, if more computing power is available, they can also be chosen more generally, for instance from `[2**-20, 2**+20]`.\n",
-    "\n",
-    "For the SOAP representation, we will only optimise one parameter:\n",
-    "\n",
-    "```\n",
-    "cutoff = 5 # (Angstrom), size of the environment around each atom to take into account\n",
-    "sigma in [2**-4, 2**1] # the broadening of the environment\n",
-    "n_max = 6 # radial basis set expansion size\n",
-    "l_max = 6 # angular basis set expansion size\n",
-    "```\n",
-    "\n",
-    "#### HP Optimisation\n",
-    "\n",
-    "Since we're computationally constrained in this tutorial, we can't use the full training set for our HP optimisation scheme. We'll therefore use a single train/test split with `n_train=800` and `n_test=200` (the \"usual\" 80%/20% split). This is a rather basic choice, as we've discussed in the KRR section, but we'll see that it is sufficient for our purposes here. We'll use the `rmse` as loss function.\n",
-    "\n",
-    "ðŸ¤”: What would you do in this situation?\n",
-    "\n",
-    "I've already pre-rolled our splits, using the following script:\n",
-    "\n",
-    "```python\n",
-    "import numpy as np\n",
-    "\n",
-    "np.random.seed(42)\n",
-    "import cmlkit\n",
-    "data = cmlkit.load_dataset(\"data/krr4mat/nmd18u\")  # obtainable from qmml.org\n",
-    "rest, train, test = cmlkit.utility.threeway_split(2400, 800, 200)\n",
-    "train = cmlkit.dataset.Subset.from_dataset(data, idx=train, name=\"nmd18_hpo_train\")\n",
-    "test = cmlkit.dataset.Subset.from_dataset(data, idx=test, name=\"nmd18_hpo_test\")\n",
-    "train.save()\n",
-    "test.save()\n",
-    "```\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Now, we'll have to decide on an optimisation method. We have three free parameters, and since we'd like to both cover a large range of possible values, and not do too much computation, we'll refrain from using a grid search. Instead, we'll use a stochastic search that samples from all possible values and focuses the sampling on promising reasons of parameter space.\n",
-    "\n",
-    "We'll set up this HP optimisation task using `cmlkit`. You don't have to understand every line in this in detail, as this is somewhat beyond the focus of this tutorial. If you'd like to learn more, take a look at the [`cmlkit` tutorial]( https://analytics-toolkit.nomad-coe.eu/public/user-redirect/notebooks/tutorials/cmlkit.ipynb ). That tutorial will also explain, in detail, how the stochastic optimiser works conceptually.\n",
-    "\n",
-    "In simple terms, below, we're:\n",
-    "\n",
-    "- Translating our search space into a machine-readable form\n",
-    "- Defining a way to compute the loss using an \"evaluator\"\n",
-    "- Combining these components into a run"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:19:49.213145Z",
-     "start_time": "2020-06-05T14:19:48.858443Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "space = {\n",
-    "    \"model\": {\n",
-    "        \"per\": \"cell\",\n",
-    "        \"regression\": {\n",
-    "            \"krr\": {\n",
-    "                \"kernel\": {\n",
-    "                    \"kernel_atomic\": {\n",
-    "                        \"norm\": False,\n",
-    "                        \"kernelf\": {\n",
-    "                            \"gaussian\": {\n",
-    "                                \"ls\": [\n",
-    "                                    \"hp_loggrid\",\n",
-    "                                    \"ls_start\",\n",
-    "                                    -10,\n",
-    "                                    2,\n",
-    "                                    25,\n",
-    "                                ]\n",
-    "                            }\n",
-    "                        },\n",
-    "                    }\n",
-    "                },\n",
-    "                \"nl\": [\"hp_loggrid\", \"nl_start\", -20, -10, 21],\n",
-    "            }\n",
-    "        },\n",
-    "        \"representation\": {\n",
-    "            \"ds_soap\": {\n",
-    "                \"elems\": [8, 13, 31, 49],\n",
-    "                \"n_max\": 6,\n",
-    "                \"l_max\": 6,\n",
-    "                \"cutoff\": 5,\n",
-    "                \"sigma\": [\"hp_loggrid\", \"sigma\", -4, 1, 11],\n",
-    "                \"rbf\": \"gto\",\n",
-    "            }\n",
-    "        },\n",
-    "    }\n",
-    "}\n",
-    "search = cmlkit.tune.Hyperopt(space, method=\"tpe\")\n",
-    "\n",
-    "evaluator = cmlkit.tune.TuneEvaluatorHoldout(\n",
-    "    train=\"data/krr4mat/nmd18_hpo_train\", test=\"data/krr4mat/nmd18_hpo_test\", target=\"fe\", lossf=\"rmse\"\n",
-    ")\n",
-    "\n",
-    "run = cmlkit.tune.Run(\n",
-    "    search=search,\n",
-    "    evaluator=evaluator,\n",
-    "    stop={\"stop_max\": {\"count\": 200}},\n",
-    "    context={\"max_workers\": 40},\n",
-    "    name=\"nmd18_hpo\",\n",
-    "    caught_exceptions=[\"QMMLException\"]\n",
-    ")\n",
-    "# run.prepare()\n",
-    "# run.run()\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "At this point, I have to admit that we won't actually let this optimisation run, for two reasons. One is purely technical: The `cmlkit` optimisation infrastructure is not built for interactive use, it expects to be run from the command line. The other one is performance related: As we've discussed in the beginning, the `nmd18` dataset is quite challenging, and this optimisation would take awkwardly long to run. (The above takes about an hour on 40 cores.)\n",
-    "\n",
-    "We'll therefore simply jump ahead and load the result of executing the above."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:19:53.302672Z",
-     "start_time": "2020-06-05T14:19:50.154876Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "run = cmlkit.tune.Run.checkout(\"data/krr4mat/run_nmd18_hpo\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:19:53.310413Z",
-     "start_time": "2020-06-05T14:19:53.304757Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "model = run.state.evals.top_suggestions(n=1)[0]  # get the model with the lowest error on the validation set\n",
-    "model"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "ðŸ‘: Find the [`cmlkit` tutorial](https://analytics-toolkit.nomad-coe.eu/public/user-redirect/notebooks/tutorials/cmlkit.ipynb) on the NOMAD Analytics Toolkit and try to figure out what precisely the `space` defined above means and how the optimisation actually worked under the hood.\n",
-    "\n",
-    "ðŸ¤”: Can you think of possible improvements to our HP optimisation strategy here? How would you evaluate your better strategy against this one? Can you think of any problem using, for instance, the final test set loss as comparison?\n",
-    "\n",
-    "### Evaluating the results\n",
-    "\n",
-    "Let's now see how this model performs on the original splits! We'll load the data, instantiate the model from the dictionary description we've obtained above, and then train it.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:33:11.939786Z",
-     "start_time": "2020-06-05T14:19:53.312984Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "train = cmlkit.load_dataset(\"data/krr4mat/nmd18_train\")\n",
-    "test = cmlkit.load_dataset(\"data/krr4mat/nmd18_test\")\n",
-    "\n",
-    "model = cmlkit.from_config(model, context={\"cache\": \"disk\"})  # we're using cached representations/kernel matrices\n",
-    "model.train(train, target=\"fe\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:33:11.943951Z",
-     "start_time": "2020-06-05T14:19:50.943Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "pred = model.predict(test, per=\"non_O\")\n",
-    "true = test.pp(\"fe\", per=\"non_O\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:33:11.945308Z",
-     "start_time": "2020-06-05T14:19:51.159Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "loss = cmlkit.evaluation.get_loss([\"rmse\", \"mae\", \"r2\", \"rmsle\"])\n",
-    "for k, v in loss(true, pred).items():\n",
-    "    print(f\"{k}: {v:.5f}\")\n",
-    "    \n",
-    "print(\"\")\n",
-    "print(f\"SOAP + KRR: RMSLE={loss(true, pred)['rmsle']:.3f}, MAE={1000*loss(true, pred)['mae']:.0f} meV/cation\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "From [Sutton *et al.* (2019)](https://doi.org/10.1038/s41524-019-0239-3) table 1, we have the following results for comparison:\n",
-    "\n",
-    "- *n*-gram + KRR: RMSLE=0.020, MAE=14 meV/cation\n",
-    "- c/BOP + LGBM: RMSLE=0.022, MAE=15 meV/cation\n",
-    "- SOAP + NN: RMSLE=0.021, MAE= 13 meV/cation\n",
-    "\n",
-    "Congratulations! ðŸŽ‰ We've just won the (formation energy portion of) the Kaggle challenge! Nice.\n",
-    "\n",
-    "Let's have a look at the results we've obtained."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:33:11.946465Z",
-     "start_time": "2020-06-05T14:19:51.944Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "lim_min, lim_max = 0, 1000*true.max()\n",
-    "\n",
-    "sbn.jointplot(x=1000*true, y=1000*pred, kind=\"reg\", xlim=(lim_min, lim_max), ylim=(lim_min, lim_max))\n",
-    "plt.xlabel(\"DFT FE in meV/cation\")\n",
-    "plt.ylabel(\"ML FE in meV/cation\")\n",
-    "plt.plot([lim_min, lim_max], [lim_min, lim_max], color=\"red\", linestyle=\"dotted\")\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "This is looking pretty reasonable, but there appear to be some outliers. Let's check them out:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:33:11.947587Z",
-     "start_time": "2020-06-05T14:19:52.665Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "outliers = np.argwhere(np.abs(true-pred) > 0.1)  # find points with >100 meV absolute error\n",
-    "for i in outliers:\n",
-    "    print(f\"N_atoms: {test.info['atoms_by_system'][i][0]} / sg: {test.p['sg'][i][0]}\")\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:33:11.948738Z",
-     "start_time": "2020-06-05T14:19:53.013Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "mask = np.abs(true-pred) <= 0.1\n",
-    "sbn.jointplot(x=1000*true[mask], y=1000*pred[mask], kind=\"reg\", xlim=(lim_min, lim_max), ylim=(lim_min, lim_max))\n",
-    "plt.xlabel(\"DFT FE in meV/cation\")\n",
-    "plt.ylabel(\"ML FE in meV/cation\")\n",
-    "plt.plot([lim_min, lim_max], [lim_min, lim_max], color=\"red\", linestyle=\"dotted\")\n",
-    "\n",
-    "\n",
-    "for k, v in loss(true[mask], pred[mask]).items():\n",
-    "    print(f\"{k}: {v:.5f}\")\n",
-    "    "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Without these outliers, things look a lot more orderly -- but we don't yet have a clear reason why they occur, so we can't just exclude them. One hint might be the the fact that `N_atoms=10` is clearly under-represented in the training set. Let's see what happens if we exclude these points:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:33:11.949904Z",
-     "start_time": "2020-06-05T14:19:53.837Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "mask = test.info[\"atoms_by_system\"] > 10\n",
-    "\n",
-    "sbn.jointplot(x=1000*true[mask], y=1000*pred[mask], kind=\"reg\", xlim=(lim_min, lim_max), ylim=(lim_min, lim_max))\n",
-    "plt.xlabel(\"DFT FE in meV/cation\")\n",
-    "plt.ylabel(\"ML FE in meV/cation\")\n",
-    "plt.plot([lim_min, lim_max], [lim_min, lim_max], color=\"red\", linestyle=\"dotted\")\n",
-    "\n",
-    "\n",
-    "for k, v in loss(true[mask], pred[mask]).items():\n",
-    "    print(f\"{k}: {v:.5f}\")\n",
-    "    "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Or maybe we should simply focus on the most common unit cell size, 80?"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:33:11.950931Z",
-     "start_time": "2020-06-05T14:19:54.707Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "mask = test.info[\"atoms_by_system\"] == 80\n",
-    "\n",
-    "sbn.jointplot(x=1000*true[mask], y=1000*pred[mask], kind=\"reg\", xlim=(lim_min, lim_max), ylim=(lim_min, lim_max))\n",
-    "plt.xlabel(\"DFT FE in meV/cation\")\n",
-    "plt.ylabel(\"ML FE in meV/cation\")\n",
-    "plt.plot([lim_min, lim_max], [lim_min, lim_max], color=\"red\", linestyle=\"dotted\")\n",
-    "\n",
-    "\n",
-    "for k, v in loss(true[mask], pred[mask]).items():\n",
-    "    print(f\"{k}: {v:.5f}\")\n",
-    "    "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Or what if we exclude spacegroups 194 and 227?"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2020-06-05T14:33:11.952252Z",
-     "start_time": "2020-06-05T14:19:55.868Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "mask = np.logical_and(test.p[\"sg\"] != 194, test.p[\"sg\"] != 227)\n",
-    "\n",
-    "sbn.jointplot(x=1000*true[mask], y=1000*pred[mask], kind=\"reg\", xlim=(lim_min, lim_max), ylim=(lim_min, lim_max))\n",
-    "plt.xlabel(\"DFT FE in meV/cation\")\n",
-    "plt.ylabel(\"ML FE in meV/cation\")\n",
-    "plt.plot([lim_min, lim_max], [lim_min, lim_max], color=\"red\", linestyle=\"dotted\")\n",
-    "\n",
-    "\n",
-    "for k, v in loss(true[mask], pred[mask]).items():\n",
-    "    print(f\"{k}: {v:.5f}\")\n",
-    "    "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "It appears that relatively simple modifications of the kinds of data points we're willing to predict can make quite a significant difference in the losses. \n",
-    "\n",
-    "âœ‹: Play around with the plots above. How else can you slice the test set? How many points are you excluding?\n",
-    "\n",
-    "ðŸ‘: Have a look at [Sutton *et al.*  (2020)](https://doi.org/10.1038/s41467-020-17112-9) a paper that introduces a methodology for finding these \"domains of applicability\" in a programmatic way.\n",
-    "\n",
-    "ðŸ¤”: Consider how you would have to modify this approach to fit the bandgap energy."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Caveats\n",
-    "\n",
-    "For completeness, we should discuss a few caveats to these results. Re-running the optimisation with a different random training set of the same size can lead to quite different results, which would not \"win\" the Kaggle challenge. This nicely illustrates that everything we're doing here is stochastic: Single results should not be taken too seriously. Also, I've picked the search space and HP optimisation strategy having quite a bit of experience with this dataset, including the test portion, so rigorously speaking, the \"test\" set is not actually unseen. Our \"win\" is therefore purely for demonstration purposes.\n",
-    "\n",
-    "The same reasoning applies to our \"domain of applicability\" attempts above: If we use the test set error to find it, we can't reasonably report the results as obtained on that test set. We should have done this analysis on a split of the training set, and then used it, *without looking at the results first* on the test set.\n",
-    "\n",
-    "A more thorough way of comparing models or model classes against each is to generate learning curves (plotting errors vs training set size), and re-run the analysis with multiple randomised splits in order to assess the statistical fluctuations in the results. This is, however, computationally quite challenging for expensive models and datasets, and we therefore don't do it here. For an impression of how this can look, have a peek a the papers mentioned in the \"Representation\" section above.\n",
-    "\n",
-    "\n",
-    "## Conclusion\n",
-    "\n",
-    "This brings us, finally, to the end of this tutorial. Well done! You now have a working knowledge of KRR and its application to learning structure-property mappings in materials science. Use it wisely.\n",
-    "\n",
-    "If this has piqued your interest, I would recommend browsing the [NOMAD Analytics Toolkit](https://analytics-toolkit.nomad-coe.eu/home/) for additional tutorials on other techniques and applications. If you have further questions, please feel free to drop me a line at langer@fhi-berlin.mpg.de!\n",
-    "\n",
-    "Have a good day! ðŸ‘‹\n",
-    "\n",
-    "### Acknowledgements\n",
-    "\n",
-    "I'd like to thank Tom Purcell, Daniel Speckhard, Matthias Rupp, Luca Ghiringelli, and Luigi SbailÃ² for their assistance and feedback."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {