From a0fc2d43bfe64db58cbbfd42ae72437c6dd2b192 Mon Sep 17 00:00:00 2001
From: Marcel Langer <me@sirmarcel.com>
Date: Fri, 30 Oct 2020 12:11:49 +0100
Subject: [PATCH] Correct typos, general tweaks.

---
 README.md     |  11 +--
 krr4mat.ipynb | 243 +++++++++++++++++++++++++++++---------------------
 2 files changed, 148 insertions(+), 106 deletions(-)

diff --git a/README.md b/README.md
index 807606f..650c60d 100644
--- a/README.md
+++ b/README.md
@@ -1,25 +1,26 @@
 # Kernel Ridge Regression for Materials
 
-Tutorial on kernel ridge regression for the block course "Big data and artificial intelligence in materials science" as part of the Max Planck Graduate Center for Quantum Materials.
+Tutorial on kernel ridge regression for materials science, created for the block course "Big data and artificial intelligence in materials science" as part of the [Max Planck Graduate Center for Quantum Materials](https://www.quantummaterials.mpg.de).
 
 Topics covered:
 - Kernel ridge regression
 	- Fundamentals
-	- HP Optimisation
+	- Hyper-Parameter Optimisation
 	- Toy implementation
 - Representations
-- Application to the NOMAD2018 challenge dataset
-	- Strategy
+- Application to the [NOMAD2018 challenge dataset](https://www.kaggle.com/c/nomad2018-predict-transparent-conductors)
 	- Dataset exploration
 	- HP Optimisation
 	- Discussion of results
 
 ## Technicalities
 
-In addition to the prerequisites in `setup.py`, this tutorial requires [`qmmlpack`](https://gitlab.com/qmml/qmmlpack/-/tree/development) on the DEVELOPMENT branch.
+In addition to the prerequisites in `setup.py`, this tutorial requires [`qmmlpack`](https://gitlab.com/qmml/qmmlpack/-/tree/development) on the DEVELOPMENT branch, as well as [`cmlkit`](https://github.com/sirmarcel/cmlkit) and [`cscribe`](https://github.com/sirmarcel/cscribe).
 
 It also requires the environment variable `CML_PLUGINS=cscribe` to be set.
 
+Installing `qmmlpack` by hand, then `pip install cmlkit cscribe` should suffice.
+
 ## Todos/Future plans
 
 - Once `cmlkit` supports caching, expand the result analysis section
diff --git a/krr4mat.ipynb b/krr4mat.ipynb
index 447e4f3..975e06e 100644
--- a/krr4mat.ipynb
+++ b/krr4mat.ipynb
@@ -27,13 +27,13 @@
     "\n",
     "Additionally, the words \"crystal structure\", \"atom\", \"formation energy\" and related concepts should mean something to you if you want to get the most out of the applications part of the tutorial.\n",
     "\n",
-    "Some passing familiarity with loss functions and \"ML 101\" in general will also prove useful, but is optional!\n",
+    "Some passing familiarity with error metrics (RMSE, MAE, R2) and \"ML 101\" in general will also prove useful, but is optional!\n",
     "\n",
     "### Administrative notes\n",
     "\n",
     "In the text, the βœ‹ emoji will mark tasks that you are expected to complete. πŸ€” will mark more open questions that you could think about, and may be answered later in the tutorial. πŸ‘ are strictly optional tasks.\n",
     "\n",
-    "The mathematical notation of this text is kept intentionally somewhat imprecise, in order to not suggest a rigour that is beyond the scope of the tutorial. In general, variables ($x$, $\\alpha$, ...) can refer to either scalars or vectors equally, their nature should be possible to infer from context. Upper case letters refer to matrices. Latin indices ($i, j, ...$) refer to separate entries in a dataset, for instance training examples. Often, we'll use a lower-case variable without index, like $x$, to refer to a $n_{\\text{example}} \\times \\text{dim}$ matrix of data points, each of which is a vector , for instance $x_i$. The hat symbol ($\\hat f$) generally denotes quantities that are predicted. We typically restrict our input and output spaces to $\\mathbb{R}^N$.\n",
+    "The mathematical notation of this text is kept intentionally somewhat imprecise, in order to not suggest a rigour that is beyond the scope of the tutorial. In general, variables ($x$, $\\alpha$, ...) can refer to either scalars or vectors equally. Upper case letters refer to matrices. Latin indices ($i, j, ...$) refer to separate entries in a dataset, for instance training examples. Often, we'll use a variable without index, like $X$, to refer to a $n_{\\text{example}} \\times \\text{dim}$ matrix of data points, each of which is a vector, for instance $x_i$. The hat symbol ($\\hat f$) generally denotes quantities that are predicted. We typically restrict our input and output spaces to $\\mathbb{R}^N$.\n",
     "\n",
     "Most sections end with a \"further reading\" paragraph that points to the sources, and, well, things that are good to read for more information! :)\n",
     "\n",
@@ -50,7 +50,7 @@
     "\n",
     "## Introduction\n",
     "\n",
-    "Kernel ridge regression (KRR from now on) is an approach to *supervised learning*, where we're trying to infer a relationship between an input and output space: \n",
+    "Kernel ridge regression (KRR from now on) is an approach to *supervised learning*, where we're trying to infer a relationship between an input and an output space: \n",
     "\n",
     "$\\quad f: \\mathcal{X} \\rightarrow \\mathcal{Y}$, \n",
     "\n",
@@ -62,11 +62,11 @@
     "\n",
     "Essentially, our prediction for a new point $x$ is composed from \"basis functions\" $k$ (the *kernel functions*) centered on each training point, with some yet-to-be-determined *regression coefficients* $\\alpha$. These coefficients are determined by solving a convex optimisation problem, as we'll see in a minute.\n",
     "\n",
-    "First, however, let's simply play with this expression for a little bit. First, we choose one particular kernel function. We'll start with the \"working horse of KRR\", the *Gaussian kernel* (or *Radial basis function kernel*)\n",
+    "First, however, let's simply play with this expression for a little bit. First, we choose one particular kernel function. We'll start with the \"working horse of KRR\", the *Gaussian kernel* (or *Radial Basis Function* (RBF) *kernel*)\n",
     "\n",
-    "$\\quad k(x, z) = \\exp{ (-|| x - z ||_2^2 / 2 \\sigma^2) }$,\n",
+    "$\\quad k(x, z) = \\exp{ \\left(\\frac{-|| x - z ||_2^2}{2 \\sigma^2} \\right)}$,\n",
     "\n",
-    "using $||\\, . ||_2$ to denote $l_2$ norm (i.e. the usual vector norm). In the following, to cut down on confusion, we'll call this $\\sigma$ *length scale* or `ls` for short. It is our first *hyper-parameter*, a parameter of our model that we can't immediately determine from the data, we have to somehow empirically decide on it. We'll hear much more about this later.\n",
+    "using $||\\, . ||_2$ to denote $l_2$ norm (i.e. the usual vector norm). In the following, to cut down on confusion, we'll call this $\\sigma$ *length scale* or `ls` for short. It is our first *hyper-parameter* (HP), a parameter of our model that we can't immediately determine from the data, we have to somehow empirically decide on it. We'll hear much more about this later.\n",
     "\n",
     "For now, let's just go ahead and look at a toy problem:"
    ]
@@ -100,6 +100,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "We'll stick to this color code: Red dots are training samples, a dashed red line is the function we're trying to learn.\n",
+    "\n",
     "Alright, let's *just do it* and guess some values for `alpha` and `ls`:"
    ]
   },
@@ -156,15 +158,19 @@
    "source": [
     "## Training the model\n",
     "\n",
-    "Alright, time to solve this problem once and for all, starting with some definitions: $K$ denotes a *kernel matrix*, with rows corresponding to $x$ and columns to $z$. Therefore, the kernel matrix between all points in the training set would be \n",
+    "Alright, time to solve this problem once and for all, starting with some definitions: $K$ denotes a *kernel matrix*. Therefore, the kernel matrix between all points in the training set would be \n",
     "\n",
     "$\\quad K_{ij} = k(x_i, x_j)$,\n",
     "\n",
-    "and to predict something, we'll need another kernel matrix\n",
+    "where $i, j = 1, \\ldots, n$. $K$ is therefore a square matrix.\n",
+    "\n",
+    "To predict something, we'll need another kernel matrix\n",
     "\n",
-    "$\\quad L_{ij} = k(x_i, x_j)$ where $i = 1...n$ runs over the training set and $j$ over the data points we're trying to predict.\n",
+    "$\\quad L_{ij} = k(x_i, x_j)$,\n",
     "\n",
-    "With this notation, predictions are simply $L^T \\alpha$, promoting $\\alpha$ to a vector of all coefficients.\n",
+    "where $i = 1...n$ runs over the training set and $j$ over the data points we're trying to predict. $L$ is a rectangular matrix.\n",
+    "\n",
+    "With this notation, predictions are simply $L^T \\alpha$. $\\alpha$ now refers to a vector of *all* coefficients.\n",
     "\n",
     "The task of finding $\\alpha$ can be phrased as an optimisation problem, where we minimise the \"squared deviation\" or \"squared error\":\n",
     "\n",
@@ -172,7 +178,7 @@
     "\n",
     "$\\quad \\text{arg min}_{\\alpha} (K \\alpha - y)^T (K \\alpha - y) = \\text{arg min}_{\\alpha} \\alpha^T K K \\alpha - 2 \\alpha^T K y + y^T y $ \n",
     "\n",
-    "(The rewrite is straightforward, but slightly tedious. It's easiest to write all products with explicit sums and indices, and then use the fact that $K$ is symmetric. (We have not formally introduced this requirement, since this is an *informal* introdution.))\n",
+    "(The rewrite is straightforward, but slightly tedious. It's easiest to write all products with explicit sums and indices, and then use the fact that $K$ is symmetric, a requirement we have not formally introduced, since this is an *informal* introdution.)\n",
     "\n",
     "To solve the optimisation problem, we simply set the gradient with respect to $\\alpha$ to zero to arrive at:\n",
     "\n",
@@ -180,9 +186,9 @@
     "\\quad K K \\alpha = K y \\\\ \\Rightarrow \\alpha = K^{-1} y.\n",
     "$\n",
     "\n",
-    "(This step is, once again, easy but tedious, to do by writing everything out in index notation. It also requires that the inverse exists, which follows from the requirements a kernel has to fulfil.)\n",
+    "(This step is, once again, easy but tedious, and done by writing everything out in index notation. It also requires that the inverse exists, which follows from the mathematical requirements a kernel has to fulfil.)\n",
     "\n",
-    "πŸ‘: Do the derivation yourself. What requirements do you need the kernel to fulfil?\n",
+    "πŸ‘: Do the derivation yourself. What are those requirements?\n",
     "\n",
     "***\n",
     "\n",
@@ -226,7 +232,8 @@
     "# plot the basis functions separately, so we can see them\n",
     "for i in range(n):\n",
     "    plt.plot(X, basis_function(X, i), alpha=0.3, color=\"black\")\n",
-    "\n"
+    "\n",
+    "print(f\"alpha: {np.round(alpha, 2)}\")"
    ]
   },
   {
@@ -261,7 +268,7 @@
     "X = np.linspace(0, 1, 100)\n",
     "n = 15\n",
     "x_train = np.linspace(0, 1, n)\n",
-    "y_train = f(x_train) + 0.1*np.random.normal(scale=0.4, size=len(x_train))\n",
+    "y_train = f(x_train) + 0.1*np.random.normal(scale=0.4, size=len(x_train))  # add gaussian noise\n",
     "\n",
     "plt.plot(X, f(X), color=\"red\", linestyle=\"dashed\")\n",
     "plt.scatter(x_train, y_train, color=\"red\")\n",
@@ -294,13 +301,13 @@
    "source": [
     "πŸ€”: What is the problem here?\n",
     "\n",
-    "πŸ‘: Why are we using more data points?\n",
+    "πŸ‘: Why are we using more data points? Why did I decide to change `ls`?\n",
     "\n",
     "***\n",
     "\n",
-    "While our model had no problem fitting this noisy data, we notice an unpleasant feature of the result: It's extremely \"wiggly\", as opposed to the smooth function that we're trying to reproduce. The reason for this is simple: In the definition of the optimisation problem that we're solving to fit the model, we're essentially forcing $\\hat f$ to passt through all the training points as much as possible. In practice, however, this is **not** what we would like to happen -- in case of noise, we don't want to fit to the noise, and in other situations we might prefer smooth interpolation to maximum accuracy at the training points. \n",
+    "While our model had no problem fitting this noisy data, we notice an unpleasant feature of the result: It's extremely \"wiggly\", as opposed to the smooth function that we're trying to reproduce. The reason for this is simple: In the definition of the optimisation problem that we're solving to fit the model, we're essentially forcing $\\hat f$ to pass through all the training points as much as possible. In practice, however, this is **not** what we would like to happen -- in case of noise, we don't want to fit to the noise, and in other situations we might prefer smooth interpolation to maximum accuracy at the training points. \n",
     "\n",
-    "In both cases, we should keep in mind what we're aiming for: We would like a good model for the *entire* space $\\mathcal{X}$, not just the training points. While our model is great at fitting the training points, it performs very poorly for almost every other point. This needs to be fixed.\n",
+    "In both cases, we should keep in mind what we're aiming for: We would like a good model for the *entire* space $\\mathcal{X}$, not just the training points. While our model is great at fitting the training points, it performs very poorly for almost every other point. This behaviour, often called \"overfitting\", needs to be fixed.\n",
     "\n",
     "The solution is to add a penalty to the optimisation problem, discouraging \"complex models\". In the case of KRR, the term is $\\lambda \\alpha^T K \\alpha $, with a new hyper-parameter $\\lambda$. (From now on, we will call this \"regularisation strength\" or \"noise level\", `nl` for short.) This term is essentially the norm of the coefficients, weighted by the kernel matrix. (A more technical derivation is beyond the scope of this tutorial, please consult the \"further reading\" section below.)\n",
     "\n",
@@ -316,7 +323,7 @@
     "\n",
     "With the addition of this *regularisation*, we've now finally arrived at kernel ridge regression!\n",
     "\n",
-    "(It's also worth mentioning that even in the absence of noise, it's usually preferably to add a small amount of regularisation for numerical reasons, and to encourage a smooth interpolation.)\n",
+    "(It's also worth mentioning that even in the absence of noise, it's usually preferably to add a small amount of regularisation for numerical reasons, and to encourage smooth interpolation.)\n",
     "\n",
     "So, how does it look?"
    ]
@@ -426,7 +433,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "πŸ‘: How would you make this code more efficient? How would you re-structure to support different kernel functions?\n",
+    "πŸ‘: How would you make this code more efficient, in particular the kernel function?\n",
     "\n",
     "Finally, a word of caution: This implementation is **not** production ready. In reality, the use of a matrix inversion to train the model is both inefficient and numerically unstable, and will yield to failure in many real-world applications. For this reason, real implementations use the Cholesky decomposition instead. Rewriting the implementation above is straightforward, but beyond the scope of an introductory tutorial!\n",
     "\n",
@@ -459,12 +466,13 @@
     "x_train = np.linspace(0, 1, n)\n",
     "y_train = f(x_train)\n",
     "\n",
-    "plt.plot(X, f(X), color=\"red\", linestyle=\"dashed\")\n",
+    "plt.plot(X, f(X), color=\"red\", linestyle=\"dashed\", label=\"ground truth\")\n",
     "plt.scatter(x_train, y_train, color=\"red\")\n",
     "\n",
-    "plt.plot(X, train_and_predict(KRR(ls=0.05, nl=1e-3), x_train, y_train, X), color=\"black\", label=\"low ls\")\n",
-    "plt.plot(X, train_and_predict(KRR(ls=0.5, nl=1e-3), x_train, y_train, X), color=\"blue\", linestyle=\"dotted\", label=\"high ls, low nl\")\n",
-    "plt.plot(X, train_and_predict(KRR(ls=0.5, nl=1e-1), x_train, y_train, X), color=\"pink\", label=\"high ls, high nl\")\n",
+    "plt.plot(X, train_and_predict(KRR(ls=0.05, nl=1e-3), x_train, y_train, X), color=\"green\", linestyle=\"dotted\", label=\"low ls, low nl\")\n",
+    "plt.plot(X, train_and_predict(KRR(ls=0.05, nl=1e-1), x_train, y_train, X), color=\"green\", linestyle=\"dashed\", label=\"low ls, high nl\")\n",
+    "plt.plot(X, train_and_predict(KRR(ls=0.5, nl=1e-3), x_train, y_train, X), color=\"orange\", linestyle=\"dotted\", label=\"high ls, low nl\")\n",
+    "plt.plot(X, train_and_predict(KRR(ls=0.5, nl=1e-1), x_train, y_train, X), color=\"orange\", linestyle=\"dashed\", label=\"high ls, high nl\")\n",
     "plt.legend()"
    ]
   },
@@ -472,14 +480,16 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "So our next task is coming up with a strategy to pick these \"hyper-parameters\" (HPs). There are different approaches to this problem, and as far as I'm aware, no clearly optimal strategy, or firm mathematical grounds for one. Speaking from personal experience, the approach we'll explore now works reasonably well for the problems at hand, but it is always recommended to experiment. It's also of vital importance to *state what you're doing* if you're publishing any work in this domain. Since the HPs impact model performance significantly, the choice of tuning strategy is an important piece of information.\n",
+    "So our next task is coming up with a strategy to pick these \"hyper-parameters\", `nl` and `ls`. HP optimisation is in general a global non-convex optimisation problem with all the problems that entails. There is no perfect solution to this problem, and many different approaches can be taken.\n",
+    "\n",
+    "Speaking from personal experience, the approach we'll explore now works reasonably well for the problems at hand, but it is always recommended to experiment. It's also of vital importance to *state what you're doing* if you're publishing any work in this domain. Since the HPs impact model performance significantly, the choice of tuning strategy is an important piece of information.\n",
     "\n",
     "In order to design a HP tuning strategy, we need to first decide on a measure of performance. Usually, this is further decomposed into two sub-tasks: \n",
     "\n",
-    "1. Picking a quantitative error measure, a \"loss function\", and \n",
+    "1. Picking a quantitative error measure, an \"error metric\", and \n",
     "2. Deciding on a set of training and test examples to compute it. \n",
     "\n",
-    "Two common loss functions are (adopting the $y$ as shorthand for all test labels, and $\\hat y$ as the corresponding predictions of our model):\n",
+    "Two common error metric are (adopting the $y$ as shorthand for all test labels, and $\\hat y$ as the corresponding predictions of our model):\n",
     "\n",
     "The root mean squared error (RMSE):\n",
     "\n",
@@ -489,13 +499,13 @@
     "\n",
     "$\\quad \\text{mae}(y, \\hat y) = \\frac{1}{n_{\\text{test}}} \\sum_{i=1}^{n_{\\text{test}}} |y_i - \\hat y_i | $\n",
     "\n",
-    "The RMSE is an upper bound to the MAE, and more sensitive to outliers (i.e. single points with large error). Since the KRR training amounts to minimising the RMSE on the training set, the RMSE can be regarded as the \"natural\" error measure for KRR -- however, the choice of loss function ultimately depends on the problem being investigated.\n",
+    "The RMSE is an upper bound to the MAE, and more sensitive to outliers (i.e. single points with large error). Since the KRR training amounts to minimising the RMSE on the training set, the RMSE can be regarded as the \"natural\" error measure for KRR -- however, the choice of error metric ultimately depends on the problem being investigated.\n",
     "\n",
-    "Let's start our explorations with a very simple strategy: We'll further split our training set into a \"validation\" train/test split. In other words, we'll pick some parameters, train the model on the \"validation training set\", predict on the \"validation test set\", and compute the loss. Then, we'll attempt to minimise this loss.\n",
+    "Let's start our explorations with a very simple strategy: We'll further split our training set into a \"validation\" train/test split. In other words, we'll pick some parameters, train the model on the \"validation training set\", predict on the \"validation test set\", and compute the error. Then, we'll attempt to minimise this error.\n",
     "\n",
     "In order to make this a bit more viable, let's choose a slightly more exciting toy problem, and a bit more data. For illustrative purposes, we'll use regularly spaced training points and sub-splits. In practice, the training set is often randomly drawn, as is the validation split.\n",
     "\n",
-    "(In fact, the underlying mathematical derivations for ML models often assume \"identically and independently distributed\" data points. This is often violated in practice, with somewhat unclear consequences.)"
+    "(In fact, mathematical derivations for ML models often assume \"identically and independently distributed\" data points. This is often violated in practice, with somewhat unclear consequences. For instance, the error on the \"validation test set\" is only a reasonable estimate of the \"real error\" if it is drawn from the same distribution as the actual test data.)"
    ]
   },
   {
@@ -544,7 +554,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Next, we'll build some infrastructure...   "
+    "Next, we'll build some infrastructure. We introduce the term \"loss function\" to refer to the function we're trying to minimise in HP optimisation. The loss function and be chosen to correspond to any of the error metrics above. In practice, the term \"loss\" is often used instead of \"error\" to describe the minimsation target; we'll follow this convention."
    ]
   },
   {
@@ -605,7 +615,7 @@
     "    ls = 2*np.random.random()\n",
     "    nl = 0.01*np.random.random()\n",
     "    print(f\"ls={ls:.3f}, nl={nl:.3f}\")\n",
-    "    print(f\"RMSE: {compute_loss(ls, nl, lossf=rmse)}, MAE={compute_loss(ls, nl, lossf=mae)}\")\n",
+    "    print(f\"RMSE: {compute_loss(ls, nl, lossf=rmse):.3f}, MAE={compute_loss(ls, nl, lossf=mae):.3f}\")\n",
     "    print()\n"
    ]
   },
@@ -613,9 +623,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "βœ‹: Spend a few minutes in the cell below, optimising the values by hand.\n",
-    "\n",
-    "πŸ‘: Does anything about the value of the RMSE above strike you as peculiar?"
+    "βœ‹: Spend a few minutes in the cell below, optimising the values by hand."
    ]
   },
   {
@@ -629,14 +637,14 @@
    },
    "outputs": [],
    "source": [
-    "print(compute_loss(0.05, 0.0001))"
+    "print(f\"{compute_loss(0.05, 0.0001):.3f}\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "First, let's try to make some educated guesses, as a baseline. For example, we could demand the exponent in the kernel to be approximately one, and simply set a lot value for `nl`."
+    "First, let's try to make some educated guesses, as a baseline. For example, we could demand the exponent in the kernel to be approximately one, and simply set a small value for `nl`, i.e. assume that there is little noise in the data."
    ]
   },
   {
@@ -662,9 +670,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Alright, we could do worse! But let's do it systematically now!\n",
+    "Alright, we could do worse. But we could also do better. As a rule of thumb, an RMSE of around 1% of the standard deviation is a good error to aim for. \n",
     "\n",
-    "As a simple approach we could take to this task is to simply *try all the possible combinations* (in some domain), also called a \"grid search\"."
+    "Let's now try to approach this problem more systematically. A simple approach we could take is to simply *try all the possible combinations* (in some domain), also called a \"grid search\"."
    ]
   },
   {
@@ -692,13 +700,15 @@
     "\n",
     "plt.imshow(results,\n",
     "        extent=[nl_grid[0], nl_grid[-1], ls_grid[-1], ls_grid[0]],\n",
-    "        aspect=(nl_grid[-1] - nl_grid[0])/(ls_grid[-1] - ls_grid[0]))\n",
+    "        aspect=(nl_grid[-1] - nl_grid[0])/(ls_grid[-1] - ls_grid[0]),\n",
+    "        cmap=plt.get_cmap(\"inferno\"))\n",
     "\n",
     "plt.clim(0.0, std)\n",
     "plt.colorbar()\n",
     "\n",
+    "# find minimum, draw as big red X\n",
     "i_min, j_min = np.unravel_index(np.argmin(results), results.shape)\n",
-    "plt.scatter(nl_grid[j_min], ls_grid[i_min], color=\"red\", marker=\"x\")\n",
+    "plt.plot(nl_grid[j_min], ls_grid[i_min], color=\"red\", marker=\"x\", ms=20)\n",
     "\n",
     "plt.xlabel(\"nl\")\n",
     "plt.ylabel(\"ls\")\n",
@@ -709,7 +719,7 @@
     "best_nl = nl_grid[j_min]\n",
     "best_ls = ls_grid[i_min]\n",
     "\n",
-    "print(f\"min loss: {compute_loss(nl=best_nl, ls=best_ls)} at nl={best_nl} and ls={best_ls}.\")"
+    "print(f\"min loss: {compute_loss(nl=best_nl, ls=best_ls):.4f} at nl={best_nl:.4f} and ls={best_ls:.4f}.\")"
    ]
   },
   {
@@ -744,13 +754,16 @@
     "\n",
     "plt.imshow(results,\n",
     "        extent=[nl_grid[0], nl_grid[-1], ls_grid[-1], ls_grid[0]],\n",
-    "        aspect=(nl_grid[-1] - nl_grid[0])/(ls_grid[-1] - ls_grid[0]))\n",
+    "        aspect=(nl_grid[-1] - nl_grid[0])/(ls_grid[-1] - ls_grid[0]),\n",
+    "        cmap=plt.get_cmap(\"inferno\"))\n",
     "\n",
     "plt.clim(0.0, std)\n",
     "plt.colorbar()\n",
     "\n",
+    "# find minimum, draw as big red X\n",
     "i_min, j_min = np.unravel_index(np.argmin(results), results.shape)\n",
-    "plt.scatter(nl_grid[j_min], ls_grid[i_min], color=\"red\", marker=\"x\")\n",
+    "plt.plot(nl_grid[j_min], ls_grid[i_min], color=\"red\", marker=\"x\", ms=20)\n",
+    "\n",
     "\n",
     "plt.xlabel(\"nl\")\n",
     "plt.ylabel(\"ls\")\n",
@@ -761,7 +774,7 @@
     "best_nl = nl_grid[j_min]\n",
     "best_ls = ls_grid[i_min]\n",
     "\n",
-    "print(f\"min loss: {compute_loss(nl=2**best_nl, ls=2**best_ls)} at nl=2^{best_nl} and ls=2^{best_ls}.\")"
+    "print(f\"min loss: {compute_loss(nl=2**best_nl, ls=2**best_ls):.4f} at nl=2^{best_nl:.4f} and ls=2^{best_ls:.4f}.\")"
    ]
   },
   {
@@ -802,17 +815,17 @@
     "\n",
     "***\n",
     "\n",
-    "Using a single holdout validation set, as we've done here, is not always sufficient. For instance, if the choice is random, a somehow non-representative split might get chosen, and the performance estimate used during HP optimisation does not reflect the \"real\" performance very well. A common strategy to avoid this case is to use some for of *cross-validation* (CV), where the training set is divided into multiple validation splits, and the optimisation target is an average over the losses obtained on these splits. Often, these splits are chosen by partitioning the training set into $k$ non-overlapping subsets, and then, $k$ times, training on the combination of $k-1$ of these splits and predicting on the remaining one. Alternatively, the CV splits can simply be chosen at random. An extreme case is \"leave one out\" cross-validation, where the model is trained on all except *one* point, for each point in the training set.\n",
+    "Using a single holdout validation set, as we've done here, is not always sufficient. For instance, if the choice is random, a somehow non-representative split might get chosen, and the performance estimate used during HP optimisation does not reflect the \"real\" performance very well. A common strategy to avoid this case is to use some form of *cross-validation* (CV), where the training set is divided into multiple validation splits, and the optimisation target is an average over the losses obtained on these splits. Often, these splits are chosen by partitioning the training set into $k$ non-overlapping subsets, and then, $k$ times, training on the combination of $k-1$ of these splits and predicting on the remaining one. This is called \"k-fold cross-validation\". Alternatively, the CV splits can simply be chosen at random. An extreme case is \"leave one out\" cross-validation, where the model is trained on all except *one* point, for each point in the training set.\n",
     "\n",
-    "Apart from that, we can also vary the optimisation approach: While there are no guarantees on the convexity of the optimisation problem, we might be happy with a *local* loss minimum, so gradient descent or related approaches could be used. Additionally, a stochastic global optimisation, or a \"local\" grid search, which searches for local minima within the local environment of a given starting point, can be considered.\n",
+    "Apart from that, we can also vary the optimisation approach: While there are no guarantees on the convexity of the optimisation problem, we might be happy with a *local* minimum, so gradient descent or related approaches could be used. Additionally, a stochastic global optimisation, or a \"local\" grid search, which searches for local minima within the local environment of a given starting point, can be considered.\n",
     "\n",
     "In the case of KRR, there exists an alternative approach, from the domain of *Gaussian process regression* (GPR). GPR is a statistical approach to regression that essentially predicts a probability distribution over possible functions that could feasibly have generated the training data. The *mean* over this (infinite) distribution of functions, which corresponds to the \"maximum likelihood estimate\" is identical to a prediction using KRR given the same HPs and training data. Therefore, in some sense, KRR and GPR are *equivalent*. However, due to the development of GPR in a different community, the practices associated with HP tuning and interpretation of results are often quite different, and therefore, some care has to be taken to distinguish between the two.\n",
     "\n",
-    "In GPR, HPs are often optimised by, roughly speaking, maximising the likelihood of the training data to be observed with the model. Anecdotally, this does not produce superior results compared to a holdout/cross-validation strategy, but in some sense \"makes better use of the training data\", and it can be done using gradient descent. As far as I'm aware, there are no firm theoretical grounds to privileging one approach over the other in a real-world setting.\n",
+    "In GPR, HPs are often optimised by, roughly speaking, maximising the likelihood of the training data to be observed given the model. Anecdotally, this does not produce superior results compared to a holdout/cross-validation strategy, but in some sense \"makes better use of the training data\", and it can be done using gradient descent. As far as I'm aware, there are no firm theoretical grounds to privileging one approach over the other in a real-world setting.\n",
     "\n",
     "## Further reading and references\n",
     "\n",
-    "The introduction to KRR in this section was largely modelled on [*Machine Learning for Quantum Mechanics in a Nutshell* by Matthias Rupp (2015)]( ... ), with additional tidbits from [*Elements of Statistical Learning* by Trevor Hastie, Robert Tibshirani and Jerome Friedman (2017)](...) and [*Gaussian Processes for Machine Learning* by Carl Edward Rasmussen and Christopher K.I. Williams (2006)]( ... ). \n",
+    "The introduction to KRR in this section was largely modelled on [*Machine Learning for Quantum Mechanics in a Nutshell* by Matthias Rupp (2015)]( https://doi.org/10.1002/qua.24954 ), with additional tidbits from [*Elements of Statistical Learning* by Trevor Hastie, Robert Tibshirani and Jerome Friedman (2017)]( https://web.stanford.edu/~hastie/ElemStatLearn/ ) and [*Gaussian Processes for Machine Learning* by Carl Edward Rasmussen and Christopher K.I. Williams (2006)]( http://www.gaussianprocess.org/gpml/ ). \n",
     "\n",
     "I would highly recommend the \"... in a nutshell\" article as additional reading on the background of KRR, and Chapter 2 of \"Ramussen & Williams\" as an in-depth discussion of GPR. Chapter 5 of the same book offers a good overview of the different approaches to HP tuning briefly mentioned above.\n"
    ]
@@ -823,7 +836,7 @@
    "source": [
     "## Materials, Representations and KRR\n",
     "\n",
-    "Let us now turn our attention to our particular problem domain: Predicting the properties of materials (and molecules). In particular, in this tutorial, we're aiming to predict properties that can be directly computed from the structure (atomic positions, atomic charges, atomic numbers unit cell basis vectors) with a first principles method, for instance density functional theory. We're therefore trying to develop models als *surrogate* models to the first principles methods, with the aim of using these models as a less accurate, but faster, alternative in the context of screening. Since this general approach includes both materials (periodic boundary conditions) and molecules (no boundary conditions), we'll largely use the abstract *structure* to refer to either.\n",
+    "Let us now turn our attention to our particular problem domain: Predicting the properties of materials (and molecules). In particular, in this tutorial, we're aiming to predict properties that can be directly computed from the structure (atomic positions, atomic charges, atomic numbers, and unit cell basis vectors) with a first-principles method, for instance density functional theory. We're therefore trying to develop *surrogate* models to the first-principles methods, with the aim of using the model as a less accurate, but faster, alternative in the context of screening. Since this general approach includes both materials (periodic boundary conditions) and molecules (no boundary conditions), we'll largely use the abstract *structure* to refer to either.\n",
     "\n",
     "Speaking slightly more formally, we wish to learn a functional structure-property relationship:\n",
     "\n",
@@ -831,11 +844,13 @@
     "\\quad \\mathcal{S_i} = \\{R_i, B_i, Z_i\\} \\rightarrow \\mathbb{R}\n",
     "$\n",
     "\n",
-    "where, for a structure $i$, $R_i, B_i, Z_i$ are the positions of each atom, the basis vectors of the unit cell, and the corresponding atomic numbers for each atom. We'll focus on scalar quantities, for example the potential energy or bandgap. Vector properties often pose additional challenges!\n",
+    "where, for a structure $i$, $R_i, B_i, Z_i$ are the positions of each atom, the basis vectors of the unit cell, and the corresponding atomic numbers for each atom. We'll focus on scalar quantities, for example the potential energy or bandgap. Vector properties often pose additional challenges, for instance covariance.\n",
     "\n",
     "### Representations\n",
     "\n",
-    "Considering what we've now learned about kernel ridge regression, one immediate challenge presents itself: How exactly are we supposed to compute a kernel value between two structures? So far, it has been clear: We simply have two input vectors, which uniquely identify two points, and we compute a single, scalar \"kernel distance\" between them. But now, we have three matrices, not even of the same dimension -- and also not unique, as any permutation of the order in which we write down the atoms, or a rotation and/or translation of the positions, or choice of unit cell, leaves the properties we're investigating here invariant!\n",
+    "Considering what we've learned about kernel ridge regression, one immediate challenge presents itself: How exactly are we supposed to compute a kernel value between two structures? \n",
+    "\n",
+    "So far, it has been clear: We simply have two input vectors, which uniquely identify two points, and we compute a single, scalar \"kernel distance\" between them. But now, we have three matrices, not even of the same dimension -- and also not unique, as any permutation of the order in which we write down the atoms, or a rotation and/or translation of the positions, or choice of unit cell, leaves the properties we're investigating here invariant!\n",
     "\n",
     "There are essentially two, not quite separable, approaches to dealing with this problem: Either by defining a kernel directly on the space of structures, or by introducing an intermediate step that maps a structure first to a vector with the required invariances, and then using that vector as input in an \"off-the-shelf\" KRR approach. Clearly, the second approach can be regarded as a subset of the first, with the advantage of separating the \"physics specific\" parts from the ML approach.\n",
     "\n",
@@ -857,7 +872,7 @@
     "\n",
     "πŸ€”: Spend a minute thinking about how to build a representation. Why can't you simply concatenate all the positions, basis vectors, charges into one big vector?\n",
     "\n",
-    "For this tutorial, we'll focus use Smooth Overlap of Atomic Positions (SOAP) representaton by [BartΓ³k, Kondor, CsΓ‘nyi (2013)](https://doi.org/10.1103/PhysRevB.87.184115). In a nutshell, this *atomic* representation does the following: For every atom within a certain distance (the `cutoff`) of the central atom for which we're computing SOAP, a Gaussian is placed on the location. The resulting density is then expanded in spherical harmonics and a radial basis, and invariants (the \"power spectrum\") of the resulting coefficients form the representation.\n",
+    "For this tutorial, we'll use the Smooth Overlap of Atomic Positions (SOAP) representation by [BartΓ³k, Kondor, CsΓ‘nyi (2013)](https://doi.org/10.1103/PhysRevB.87.184115). In a nutshell, this *atomic* representation does the following: For every atom within a certain distance (the `cutoff`) of the central atom for which we're computing SOAP, a Gaussian is placed on the location. The resulting density is then expanded in spherical harmonics and a radial basis, and invariants (the \"power spectrum\") of the resulting coefficients form the representation.\n",
     "\n",
     "### Kernels\n",
     "\n",
@@ -871,11 +886,11 @@
     "\n",
     "In other words, we simply pretend $x$ and $y$ are entire sets of inputs, compute the full kernel matrix between them, and then take the sum of that \"atom-atom\" kernel matrix. From a theoretical standpoint, we can motivate this choice by assuming that our target quantity, if it is extensive, can be decomposed into (unknown) atomic contributions, and then re-deriving the equations defining KRR under that assumption. (See the \"further reading\" section for details.)\n",
     "\n",
-    "πŸ€”: Do you see any problems with this definition? How would you change it to model *intensive* properties?\n",
+    "πŸ€”: How would you change this approach to model *intensive* properties?\n",
     "\n",
     "### Further reading\n",
     "\n",
-    "Some popular representations are:\n",
+    "Some representations are:\n",
     "\n",
     "- Many-Body Tensor Representation (MBTR) by [Huo, Rupp (2017)](https://arxiv.org/abs/1704.06439)\n",
     "- Smooth Overlap of Atomic Positions (SOAP) representaton by [BartΓ³k, Kondor, CsΓ‘nyi (2013)](https://doi.org/10.1103/PhysRevB.87.184115)\n",
@@ -883,7 +898,7 @@
     "\n",
     "Multiple implementations of each of these exist, for instance, MBTR is implemented in [`qmmlpack`](https://gitlab.com/qmml/qmmlpack), and SOAP in [`QUIP`](https://github.com/libAtoms/QUIP). Additionally, the [`dscribe`](https://singroup.github.io/dscribe/) package offers alternative, unified implementations of many representations. \n",
     "\n",
-    "For an overview and comparison of these representations, you might be interested in [Langer *et al.* (2020)](https://arxiv.org/abs/2003.12081). The references of that pre-print will also point you to many more things to read, if this topic catches your fancy! (Please excuse the self-promotion, dear reader!)\n",
+    "For an overview and comparison of these representations, you might be interested in [Langer *et al.* (2020)](https://arxiv.org/abs/2003.12081). The references of that pre-print will also point you to many more things to read.\n",
     "\n",
     "Here is a small selection of additional publications to have a look at:\n",
     "\n",
@@ -900,23 +915,27 @@
    "source": [
     "## NOMAD2018 Kaggle Challenge\n",
     "\n",
-    "At this point, we're reading to tackle a *real* regression problem. In particular, we'll recapitulate a portion of  the [\"Nomad2018 Predicting Transparent Conductors\" Kaggle challenge](https://www.kaggle.com/c/nomad2018-predict-transparent-conductors). The challenge was to create a regression model to predict formation energies and bandgaps for Al<sub>x</sub>Ga<sub>y</sub>In<sub>z</sub> sesquioxides, which are believed to be candidate materials for transparent conductors.  The training set consisted of 2400 such materials, with Vegard's law \"projected geometries\" and formation and bandgap energies for the corresponding relaxed structures, and the test set was 600 additional unseen materials. For the full results of the challenge, and an in-depth discussion of the winning methods, please see [Sutton *et al.* (2019)](https://doi.org/10.1038/s41524-019-0239-3).\n",
+    "At this point, we're ready to tackle a *real* regression problem. In particular, we'll recapitulate a portion of  the [\"Nomad2018 Predicting Transparent Conductors\" Kaggle challenge](https://www.kaggle.com/c/nomad2018-predict-transparent-conductors). The challenge was to create a regression model to predict formation energies and bandgaps for Al<sub>x</sub>Ga<sub>y</sub>In<sub>z</sub> sesquioxides, which are believed to be candidate materials for transparent conductors.  The training set consisted of 2400 such materials, with Vegard's law \"projected geometries\" and formation and bandgap energies for the corresponding relaxed structures, and the test set was 600 additional materials unseen during training. (The \"projected geometries\" are estimates of the relaxed crystal structures obtained by re-scaling the lattice parameters based on the composition.)\n",
+    "\n",
+    "For the full results of the challenge, and an in-depth discussion of the winning methods, please see [Sutton *et al.* (2019)](https://doi.org/10.1038/s41524-019-0239-3).\n",
     "\n",
-    "For this tutorial, we'll focus on the formation energy, which appears to be a less challening learning problem. We will also do most of our work with significantly smaller subsets, in order to not have to wait too long for calculations to finish. Perhaps it should be noted that, by most reasonable measures, this dataset presents a rather *challenging* problem! The structures are quite large and periodic, and therefore computationally expensive to deal with, so HP optimisation is a resource-intensive task. The fact that we don't deal with relaxed structures, but with \"synthetic\" geometries make the regression task quite challenging by introducing ambiguity into the structure-property mapping.\n",
+    "In this tutorial, we'll focus on the formation energy, which appears to be a less challening learning problem. We will also do most of our work with significantly smaller subsets, in order to not have to wait too long for calculations to finish. \n",
+    "\n",
+    "This dataset presents a rather *challenging* problem. The structures are quite large and periodic, and therefore computationally expensive to deal with, so HP optimisation is a resource-intensive task. The fact that we don't deal with relaxed structures, but with \"projected\" geometries additionally complicates the regression task, since the structure-property mapping implicitly now also contains relaxation.\n",
     "\n",
     "### Technical Interlude\n",
     "\n",
-    "Before we go ahead and get our hands dirty, let me briefly interject with a few words on the technical infrastructure we'll be using for the rest of this tutorial. Clearly, our hand-crafted KRR implementation won't quite cut it, and we'll also need state-of-the-art representations to aid us.\n",
+    "Before we go ahead and get our hands dirty, let me briefly interject with a few words on the technical infrastructure we'll be using for the rest of this tutorial. Our hand-crafted KRR implementation won't quite cut it, and we'll also need state-of-the-art representations to aid us.\n",
     "\n",
-    "For KRR, which is not particularly difficult to implement, we in principle have many options, for example [`scikit-learn`](https://scikit-learn.org/stable/) or [`GPyTorch`](https://scikit-learn.org/stable/), which both implement general-purpose GPR, with many kernels and variations. However, for the present purposes, a more focused tool is better: We will use [`qmmlpack`](https://gitlab.com/qmml/qmmlpack), which implements both KRR and the MBTR represention, and is specialised for regression models in the \"materials and molecules\" domain.\n",
+    "For KRR, which is not particularly difficult to implement, we in principle have many options, for example [`scikit-learn`](https://scikit-learn.org/stable/) or [`GPyTorch`](https://github.com/cornellius-gp/gpytorch), which both implement general-purpose GPR, with many kernels and variations. However, for the present purposes, a more focused tool is better: We will use [`qmmlpack`](https://gitlab.com/qmml/qmmlpack), which implements both KRR and the MBTR represention, and is specialised for regression models in the \"materials and molecules\" domain.\n",
     "\n",
     "For the representations, there are also many options, depending on which we choose. In order to keep things simple, we will use [`dscribe`](https://singroup.github.io/dscribe/), which implements a wide range of representations in one package.\n",
     "\n",
-    "In order to tie these two components of our models together we'll be using [`cmlkit`](https://marcel.science/cmlkit), a modular (and quite experimental) package designed for constructing \"representation+KRR\"-style models. It wraps the underlying implementations in a unified and consistent interface, so we don't have to worry too much about the technical details.\n",
+    "In order to tie these two components of our models together we'll be using [`cmlkit`](https://marcel.science/cmlkit), a modular package designed for constructing \"representation+KRR\"-style models. It wraps the underlying implementations in a unified and consistent interface, so we don't have to worry too much about the technical details.\n",
     "\n",
-    "However, it is important to keep in mind that fundamentally, nothing we'll do in the rest of the tutorial is particularly difficult to achieve with any KRR/GPR implementation, plain `dscribe` and a bit of time, so please don't feel too intimidated to select tools that suit your working style. `cmlkit` is designed to be quite minimal, pointy, and opinionated, and it is experimental software, so it's understandable if it's not your cup of tea!\n",
+    "However, it is important to keep in mind that fundamentally, nothing we'll do in the rest of the tutorial is particularly difficult to achieve with any KRR/GPR implementation, plain `dscribe` and a bit of time, so please don't feel too intimidated to select tools that suit your working style. `cmlkit` is designed to be quite minimal, expects you to build your own infrastructure on top of it, and is experimental software, so it's understandable if it's not your cup of tea.\n",
     "\n",
-    "To learn more about `cmlkit`, please consult the [`cmlkit` tutorial TODO]( TODO ), also available on the NOMAD Analytics Toolkit.\n",
+    "To learn more about `cmlkit`, please consult the [`cmlkit` tutorial]( https://analytics-toolkit.nomad-coe.eu/public/user-redirect/notebooks/tutorials/cmlkit.ipynb ), also available as part of the [NOMAD Analytics Toolkit](https://nomad-lab.eu/AItutorials).\n",
     "\n",
     "### Inspecting the dataset\n",
     "\n",
@@ -939,6 +958,8 @@
     "import cmlkit\n",
     "import seaborn as sbn\n",
     "\n",
+    "cmlkit.caches.location = \"data/krr4mat/cache\" # usually, this would be set using an environment variable\n",
+    "\n",
     "# load the dataset as a cmlkit Dataset\n",
     "train = cmlkit.load_dataset(\"data/krr4mat/nmd18_train\")"
    ]
@@ -1031,6 +1052,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "These distributions also look reasonably similar.\n",
+    "\n",
     "To conclude investigating the structures, let's check the spacegroups we're dealing with:"
    ]
   },
@@ -1058,7 +1081,7 @@
     "\n",
     "Finally, let's briefly check on the quantity we're trying to predict: The formation energy *per substitution site*, which in this case means \"the formation energy per atom in the structure that is not Oxygen\". We use this quantity because it was used in the Kaggle challenge, more typical choices would be the intensive \"formation energy per atom\" or the extensive \"formation energy per unit cell\". `cmlkit` allows easy conversion between these using the `per` argument in the function below.\n",
     "\n",
-    "The formation energy is given in eV here. But, since this a machine learning tutorial, units don't mean anything to us! πŸ˜‰\n",
+    "The formation energy is given in eV here.\n",
     "\n",
     "πŸ€”: Why would we be interested in converting between `per=\"atom\"` and `per=\"cell\"`?"
    ]
@@ -1077,10 +1100,30 @@
     "sbn.distplot(train.pp(\"fe\", per=\"non_O\"))"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sbn.distplot(train.pp(\"fe\", per=\"cell\"))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sbn.distplot(train.pp(\"fe\", per=\"atom\"))"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "The energy distribution is not quite a normal distribution, and shifted so no negative energies occur. Interestingly, there seems to be a small number of high-energy structures. Converting to energy per unit cell slightly shifts the distribution, while there appears to be little difference between energy per atom and energy per non-O atoms. \n",
+    "\n",
     "Alright! Now we know what we're dealing with. What do we do now?\n",
     "\n",
     "### Our Plan\n",
@@ -1097,15 +1140,15 @@
     "        2. Choose a train/test split and/or cross-validation\n",
     "    3. Choose an optimisation method\n",
     "4. Run this\n",
-    "5. Evaluate the resulting model or models according to what we decided to do in step 1\n",
+    "5. Evaluate the resulting model or models according to what we decided to achieve in step 1\n",
     "\n",
     "***\n",
     "\n",
-    "While the various constraint of this tutorial keep us from exploring most of these steps in detail, it's important to briefly consider all of them, especially as practice for future projects beyond this tutorial.\n",
+    "While the time and computational cost constraints of this tutorial keep us from exploring most of these steps in detail, it's important to briefly consider all of them, especially as practice for future projects beyond this tutorial.\n",
     "\n",
     "#### Goal\n",
     "\n",
-    "First, what are we trying to do? Let's share the (partial) goal of the kaggle challenge and declare:\n",
+    "First, what are we trying to do? Let's share the (partial) goal of the Kaggle challenge and declare:\n",
     "\n",
     "*We will create a model that, when trained on the \"training\" portion of the Kaggle challenge dataset, predicts the formation energy per substitution site on the \"test\" portion of the same dataset with a MAE in the range of 13-15 meV/cation. (Which is what is reported in Sutton et al. (2019) as the range of results in the original challenge.)*\n",
     "\n",
@@ -1118,6 +1161,8 @@
     "nl in [2**-20, 2**0]\n",
     "```\n",
     "\n",
+    "These ranges are chosen based on previous empirical experience, if more computing power is available, they can also be chosen more generally, for instance from `[2**-20, 2**+20]`.\n",
+    "\n",
     "For the SOAP representation, we will only optimise one parameter:\n",
     "\n",
     "```\n",
@@ -1129,7 +1174,7 @@
     "\n",
     "#### HP Optimisation\n",
     "\n",
-    "Since we're computationally constrained in this tutorial, we can't use the full training set for our HP optimisation scheme, it is simply too expensive. We'll therefore use a single train/test split with `n_train=800` and `n_test=200` (the \"usual\" 80%/20% split). This is a rather basic choice, as we've discussed in the KRR section, but we'll see that it is sufficient for our purposes here. We'll use the `rmse` loss function.\n",
+    "Since we're computationally constrained in this tutorial, we can't use the full training set for our HP optimisation scheme. We'll therefore use a single train/test split with `n_train=800` and `n_test=200` (the \"usual\" 80%/20% split). This is a rather basic choice, as we've discussed in the KRR section, but we'll see that it is sufficient for our purposes here. We'll use the `rmse` as loss function.\n",
     "\n",
     "πŸ€”: What would you do in this situation?\n",
     "\n",
@@ -1142,9 +1187,7 @@
     "import cmlkit\n",
     "data = cmlkit.load_dataset(\"data/krr4mat/nmd18u\")  # obtainable from qmml.org\n",
     "rest, train, test = cmlkit.utility.threeway_split(2400, 800, 200)\n",
-    "train = cmlkit.dataset.Subset.from_dataset(\n",
-    "    data, idx=train, name=\"nmd18_hpo_train\"\n",
-    ")\n",
+    "train = cmlkit.dataset.Subset.from_dataset(data, idx=train, name=\"nmd18_hpo_train\")\n",
     "test = cmlkit.dataset.Subset.from_dataset(data, idx=test, name=\"nmd18_hpo_test\")\n",
     "train.save()\n",
     "test.save()\n",
@@ -1155,9 +1198,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now, we'll have to decide on an optimisation method. We have three free parameters, and since we'd like to both, cover a large range of possible values, and not do too much computation, we'll refrain from using a grid search. Instead, we'll use a stochastic search that samples from all possible values and focuses the sampling on promising reasons of parameter space.\n",
+    "Now, we'll have to decide on an optimisation method. We have three free parameters, and since we'd like to both cover a large range of possible values, and not do too much computation, we'll refrain from using a grid search. Instead, we'll use a stochastic search that samples from all possible values and focuses the sampling on promising reasons of parameter space.\n",
     "\n",
-    "We'll set up this HP optimisation task using `cmlkit`. You don't have to understand every line in this in detail, as this is somewhat beyond the focus of this tutorial. If you'd like to learn more, take a look at the [`cmlkit` tutorial TODO]( TODO ). That tutorial will also explain, in detail, how the stochastic optimiser works conceptually.\n",
+    "We'll set up this HP optimisation task using `cmlkit`. You don't have to understand every line in this in detail, as this is somewhat beyond the focus of this tutorial. If you'd like to learn more, take a look at the [`cmlkit` tutorial]( https://analytics-toolkit.nomad-coe.eu/public/user-redirect/notebooks/tutorials/cmlkit.ipynb ). That tutorial will also explain, in detail, how the stochastic optimiser works conceptually.\n",
     "\n",
     "In simple terms, below, we're:\n",
     "\n",
@@ -1214,10 +1257,11 @@
     "    }\n",
     "}\n",
     "search = cmlkit.tune.Hyperopt(space, method=\"tpe\")\n",
+    "\n",
     "evaluator = cmlkit.tune.TuneEvaluatorHoldout(\n",
     "    train=\"data/krr4mat/nmd18_hpo_train\", test=\"data/krr4mat/nmd18_hpo_test\", target=\"fe\", lossf=\"rmse\"\n",
     ")\n",
-    "search = cmlkit.tune.Hyperopt(space, method=\"tpe\")\n",
+    "\n",
     "run = cmlkit.tune.Run(\n",
     "    search=search,\n",
     "    evaluator=evaluator,\n",
@@ -1234,7 +1278,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "At this point, I have to make a confession: We will not actually let this optimisation run, for two reasons. One is purely technical: The `cmlkit` optimisation infrastructure is not built for interactive use, it expects to be run from the command line. The other one is performance related: As we've discussed in the beginning, the `nmd18` dataset is quite challenging, and this optimisation would take awkwardly long to run. (The above takes about an hour on 40 cores.)\n",
+    "At this point, I have to admit that we won't actually let this optimisation run, for two reasons. One is purely technical: The `cmlkit` optimisation infrastructure is not built for interactive use, it expects to be run from the command line. The other one is performance related: As we've discussed in the beginning, the `nmd18` dataset is quite challenging, and this optimisation would take awkwardly long to run. (The above takes about an hour on 40 cores.)\n",
     "\n",
     "We'll therefore simply jump ahead and load the result of executing the above."
    ]
@@ -1264,7 +1308,7 @@
    },
    "outputs": [],
    "source": [
-    "model = run.state.evals.top_suggestions(n=1)[0]\n",
+    "model = run.state.evals.top_suggestions(n=1)[0]  # get the model with the lowest error on the validation set\n",
     "model"
    ]
   },
@@ -1272,16 +1316,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let's now see how this model performs on the original splits! We'll load the data, instantiate the model from the dictionary description we've obtained above, and then train it. This will take some time, execute the cells below, and then have a look at these tasks:\n",
-    "\n",
-    "\n",
-    "πŸ‘: Find the `cmlkit` tutorial on the NOMAD Analytics Toolkit and try to figure out what precisely the `space` defined above means and how the optimisation actually worked under the hood.\n",
-    "\n",
-    "πŸ‘: Take a look at the [repository](https://gitlab.mpcdf.mpg.de/nomad-lab/krr4mat) for this tutorial and locate the `data/krr4mat/` which contains the scripts used to prepare the datasets and execute the HP optimisation run. Can you figure out what is going on? Have a look at the folder containing the actual optimisation run.\n",
+    "πŸ‘: Find the [`cmlkit` tutorial](https://analytics-toolkit.nomad-coe.eu/public/user-redirect/notebooks/tutorials/cmlkit.ipynb) on the NOMAD Analytics Toolkit and try to figure out what precisely the `space` defined above means and how the optimisation actually worked under the hood.\n",
     "\n",
     "πŸ€”: Can you think of possible improvements to our HP optimisation strategy here? How would you evaluate your better strategy against this one? Can you think of any problem using, for instance, the final test set loss as comparison?\n",
     "\n",
-    "### Evaluating the results\n"
+    "### Evaluating the results\n",
+    "\n",
+    "Let's now see how this model performs on the original splits! We'll load the data, instantiate the model from the dictionary description we've obtained above, and then train it.\n"
    ]
   },
   {
@@ -1297,7 +1338,8 @@
    "source": [
     "train = cmlkit.load_dataset(\"data/krr4mat/nmd18_train\")\n",
     "test = cmlkit.load_dataset(\"data/krr4mat/nmd18_test\")\n",
-    "model = cmlkit.from_config(model)\n",
+    "\n",
+    "model = cmlkit.from_config(model, context={\"cache\": \"disk\"})  # we're using cached representations/kernel matrices\n",
     "model.train(train, target=\"fe\")"
    ]
   },
@@ -1327,7 +1369,7 @@
    },
    "outputs": [],
    "source": [
-    "loss = cmlkit.evaluation.get_loss(\"all_non_pv\")\n",
+    "loss = cmlkit.evaluation.get_loss([\"rmse\", \"mae\", \"r2\", \"rmsle\"])\n",
     "for k, v in loss(true, pred).items():\n",
     "    print(f\"{k}: {v:.5f}\")\n",
     "    \n",
@@ -1373,7 +1415,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This is looking pretty reasonable, but there appear to be some extreme outliers. Let's check them out:"
+    "This is looking pretty reasonable, but there appear to be some outliers. Let's check them out:"
    ]
   },
   {
@@ -1387,8 +1429,7 @@
    },
    "outputs": [],
    "source": [
-    "outliers = np.argwhere(np.abs(true-pred) > 0.1)\n",
-    "mask = np.abs(true-pred) <= 0.1\n",
+    "outliers = np.argwhere(np.abs(true-pred) > 0.1)  # find points with >100 meV absolute error\n",
     "for i in outliers:\n",
     "    print(f\"N_atoms: {test.info['atoms_by_system'][i][0]} / sg: {test.p['sg'][i][0]}\")\n"
    ]
@@ -1404,6 +1445,7 @@
    },
    "outputs": [],
    "source": [
+    "mask = np.abs(true-pred) <= 0.1\n",
     "sbn.jointplot(x=1000*true[mask], y=1000*pred[mask], kind=\"reg\", xlim=(lim_min, lim_max), ylim=(lim_min, lim_max))\n",
     "plt.xlabel(\"DFT FE in meV/cation\")\n",
     "plt.ylabel(\"ML FE in meV/cation\")\n",
@@ -1419,7 +1461,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Without these outliers, things look a lot more orderly -- but we don't yet have a clear reason why they occur, so we can't just exlude them. One hint might be the the fact that `N_atoms=10` is clearly under-represented in the training set. Let's see what happens if we exlude these points:"
+    "Without these outliers, things look a lot more orderly -- but we don't yet have a clear reason why they occur, so we can't just exclude them. One hint might be the the fact that `N_atoms=10` is clearly under-represented in the training set. Let's see what happens if we exclude these points:"
    ]
   },
   {
@@ -1515,7 +1557,9 @@
     "It appears that relatively simple modifications of the kinds of data points we're willing to predict can make quite a significant difference in the losses. \n",
     "\n",
     "βœ‹: Play around with the plots above. How else can you slice the test set? How many points are you excluding?\n",
-    "πŸ‘: Have a look at [Sutton *et al.*  (2020)](https://pure.mpg.de/pubman/faces/ViewItemOverviewPage.jsp?itemId=item_3164134) a pre-print that introduces a methodology for finding these \"domains of applicability\" in a programmatic way.\n",
+    "\n",
+    "πŸ‘: Have a look at [Sutton *et al.*  (2020)](https://doi.org/10.1038/s41467-020-17112-9) a paper that introduces a methodology for finding these \"domains of applicability\" in a programmatic way.\n",
+    "\n",
     "πŸ€”: Consider how you would have to modify this approach to fit the bandgap energy."
    ]
   },
@@ -1523,11 +1567,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "πŸ€”: Consider how you would have to modify this approach to fit the bandgap energy.\n",
-    "\n",
     "### Caveats\n",
     "\n",
-    "For completeness, we should discuss a few caveats to these, somewhat haphazardly obtained, results. Re-running the optimisation with a different random training set of the same size can lead to quite different results, which would not \"win\" the Kaggle challenge. This nicely illustrates that everything we're doing here is stochastic: Single results should not be taken too seriously. Also, I've picked the search space and HP optimisation strategy having quite a bit of experience with this dataset, including the test portion, so rigorously speaking, the \"test\" set is not actually unseen. Our \"win\" is therefore purely for demonstration purposes. 😜 The same reasoning applies to our \"domain of applicability\" attempts above: If we use the test set error to find it, we can't reasonably report the results as obtained on that test set. We should have done this analysis on a split of the training set, and then used it, *without looking at the results first* on the final test set.\n",
+    "For completeness, we should discuss a few caveats to these results. Re-running the optimisation with a different random training set of the same size can lead to quite different results, which would not \"win\" the Kaggle challenge. This nicely illustrates that everything we're doing here is stochastic: Single results should not be taken too seriously. Also, I've picked the search space and HP optimisation strategy having quite a bit of experience with this dataset, including the test portion, so rigorously speaking, the \"test\" set is not actually unseen. Our \"win\" is therefore purely for demonstration purposes.\n",
+    "\n",
+    "The same reasoning applies to our \"domain of applicability\" attempts above: If we use the test set error to find it, we can't reasonably report the results as obtained on that test set. We should have done this analysis on a split of the training set, and then used it, *without looking at the results first* on the test set.\n",
     "\n",
     "A more thorough way of comparing models or model classes against each is to generate learning curves (plotting errors vs trainin set size), and re-run the analysis with multiple randomised splits in order to assess the statistical fluctuations in the results. This is, however, computationally quite challenging for expensive models and datasets, and we therefore don't do it here. For an impression of how this can look, have a peek a the papers mentioned in the \"Representation\" section above.\n",
     "\n",
@@ -1538,16 +1582,13 @@
     "\n",
     "If this has piqued your interest, I would recommend browsing the [NOMAD Analytics Toolkit](https://analytics-toolkit.nomad-coe.eu/home/) for additional tutorials on other techniques and applications. If you have further questions, please feel free to drop me a line at langer@fhi-berlin.mpg.de!\n",
     "\n",
-    "Have a good day! πŸ‘‹"
+    "Have a good day! πŸ‘‹\n",
+    "\n",
+    "### Acknowledgements\n",
+    "\n",
+    "I'd like to thank Matthias Rupp, Luca Ghiringelli, and Luigi SbailΓ² for their assistance and feedback."
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  },
   {
    "cell_type": "code",
    "execution_count": null,
-- 
GitLab