Add note about krr4mat tutorial

04ee502c · Marcel Langer · b68fa430 · 04ee502c
Commit 04ee502c authored 4 years ago by Marcel Langer
--- a/cmlkit.ipynb
+++ b/cmlkit.ipynb
@@ -25,6 +25,8 @@
    "- Know a little bit about chemistry and/or physics,\n",
    "- Know roughly how kernel ridge regression works, and know a bit about machine learning in general.\n",
    "\n",
+    "For an introduction to kernel ridge regression, you can take a look at [this tutorial](https://analytics-toolkit.nomad-coe.eu/public/user-redirect/notebooks/tutorials/krr4mat.ipynb), which contains everything you need to know!\n",
+    "\n",
    "The contents of this tutorial will mostly be of interest for people researching the application of machine learning models to computational chemistry and computational condensed matter physics, and in particularly those interested in building computational experiments and toolkits in that domain.\n",
    "\n",
    "I'll use the following abbreviations and terms throughout:\n",

 %% Cell type:markdown id: tags:

 # `cmlkit`: <br />Toolkit for Machine Learning in Computational Condensed Matter Physics and Quantum Chemistry
 # *A Tutorial Introduction*

 by Marcel F. Langer<sup>‡</sup>, langer@fhi-berlin.mpg.de

 <sup>‡</sup><i>Fritz Haber Institute of the Max Planck Society, Faradayweg 4-6, D-14195 Berlin, Germany</i>

 **Related publication: [Langer, Goeßmann, Rupp (2020)](http://marcel.science/repbench/)**

 ***

 Hello! 👋 Welcome to the [`cmlkit` 🐫🧰](https://marcel.science/cmlkit) tutorial. This tutorial will introduce you to the `cmlkit` python package, from its conceptual foundations and architecture, to its hyper-parameter tuning module, and finally concluding with an application: We will develop a machine learning model to [predict formation energies of candidate materials for transparent conducting oxides](https://www.kaggle.com/c/nomad2018-predict-transparent-conductors) (Paper: [Sutton *et al.* (2019)](https://doi.org/10.1038/s41524-019-0239-3)). After completing this tutorial, you should be able to use `cmlkit` as a basis for your own experiments, and will have a solid understanding of stochastic, parallel hyper-parameter optimisation as implemented in `cmlkit.tune`.

 ### Prerequisites

 You will get the most out of this tutorial if you:

 - Have some familiarity with Python 3,
 - Know a little bit about chemistry and/or physics,
 - Know roughly how kernel ridge regression works, and know a bit about machine learning in general.

+For an introduction to kernel ridge regression, you can take a look at [this tutorial](https://analytics-toolkit.nomad-coe.eu/public/user-redirect/notebooks/tutorials/krr4mat.ipynb), which contains everything you need to know!
+
 The contents of this tutorial will mostly be of interest for people researching the application of machine learning models to computational chemistry and computational condensed matter physics, and in particularly those interested in building computational experiments and toolkits in that domain.

 I'll use the following abbreviations and terms throughout:

 - "Representation": A way to transform the coordinates and atomic numbers (i.e. the "structure") of a molecule or material into a vector.
 - "ML": Machine learning, here, we can basically substitute it with "fitting" or "interpolation".
 - "KRR": Kernel ridge regression, a regression (i.e. "fitting") method.
 - "HP": Hyper-parameters. These are the "free" parameters of a ML model, which aren't directly determined by training.
 - "SOAP": Smooth Overlap of Atomic Positions representaton ([Bartók, Kondor, Csányi (2013)](https://doi.org/10.1103/PhysRevB.87.184115)).
 - "MBTR": Many-Body Tensor Representation [(Huo, Rupp (2017))](https://arxiv.org/abs/1704.06439).
 - "System" or "structure": Either a molecule, other finite system, or a periodic system.

 🏁 Let's get started. 🏁

 ## What is `cmlkit`?

 According to the [Github repository](https://github.com/sirmarcel/cmlkit/):

 > `cmlkit` is an extensible `python` package providing clean and concise infrastructure to specify, tune, and evaluate machine learning models for computational chemistry and condensed matter physics. Intended as a common foundation for more specialised systems, not a monolithic user-facing tool, it wants to help you build your own tools! ✨

 In this tutorial, I will give an overview of the things `cmlkit` can do, and hopefully make a case for why it is useful to you as a base to build your own projects on. (You can find a more compact feature list on the Github page, if that's more your kind of thing. This is the scenic route. 🌆)

 At this point, it's most useful if we go on a little `cmlkit` tour. Let's saddle up! 🐫

 ## A Tour of `cmlkit`

 Conceptually, the "core" of `cmlkit` consists of:

 1. `Component` class and sub-classes that define how parts of ML models should be implemented, an
 2. Engine that defines a way to specify parametrisations of these components in a special format ("configs"), and a
 3. Plugin system to easily extend this architecture with custom components.

 On top of this, `cmlkit` then provides:

 1. Reference interfaces to commonly used representations of molecules and materials,
 2. Reference interface to `qmmlpack` for fast kernel ridge regression, and finally
 3. Tools to combine these into ML models.

 Additionally, `cmlkit` also contains a hyper-parameter tuning engine based on `hyperopt`.



 ### Core 🌱

 #### Components

 In `cmlkit`, a `Component` is an object that:

 - Needs to be instantiated with a lot of complex parameters,
 - Has little to no internal state, (i.e. once you create it, it can't change)
 - Does one thing, deterministically (i.e. given the same input, it always produces the same output).

 In most cases, we can simply think of a `Component` as a function with a lot of parameters. Something along the lines of:

 ```python
 f(data, lots, of, other, parameters, ...)
 ```

 Instead of always passing around these parameters, we create the component `c` with these parameters, and then simply call `c(data)` instead of an extremely long, unwieldy expression. (For the technically-minded: This is basically a way of creating [partials](https://docs.python.org/3.7/library/functools.html). Components are mostly "fancy functions".)

 This is useful because the parts of a ML model can often be described in this way. For example, a *representation* is simply transformation of a set of coordinates into a vector, with a lot of parameters. Or a *kernel* in KRR is simply a function acting on representations. If we write down all the parameters for all the components of a model we can reconstruct it easily. And if there is no state, we can reconstruct it *exactly*!

 #### Configs

 `Components` can always be described as specially formatted dictionaries, called *configs*. Having a canonical, *non-code*, format forces the parametrisation to be separated from the code implementing the models. It also means that everything can be easily serialised, and remains human readable. This is surprisingly useful, as we'll see later.

 The config format is:

 ```
 {"kind": { ... }}
 ```

 Where `kind` is a unique name for the type of component being instantiated (essentially, the class name) and the "inner" dictionary contains all arguments that would normally be passed into the `__init__` call.

 This format is designed to print nicely to [`yaml`](https://en.wikipedia.org/wiki/YAML), a simple data-serialization language. Here is, for instance, the entire `config` of a SOAP+KRR model:

 ```yaml
 model:
  per: cell
  regression:
    krr:
      kernel:
        kernel_atomic:
          kernelf:
            gaussian:
              ls: 90.50966799187809
      nl: 9.5367431640625e-07
  representation:
    ds_soap:
      cutoff: 3
      elems: [8, 13, 31, 49]
      l_max: 8
      n_max: 2
      rbf: gto
      sigma: 0.5
 ```

 Calling `cmlkit.from_yaml()` on the this produces an instance of `Model` with KRR as regression method and the SOAP representation (implemented in `dscribe`), with these parameters. We'll return to this later.

 #### Plugins

 This deserialisation scheme lends itself to a simple plugin system: Since at some point, `cmlkit` has to look up the classes associated with the `kind` string in the config dict, we can simply add external classes to this lookup table.

 In practice, `cmlkit` plugins are python packages that, at module level, have a `components` attribute which is a list of classes (sub-classes of `Component`) to be included in the registry. `cmlkit` knows about these through an environment variable, `CML_PLUGINS`. It's rudimentary -- but it works!

 #### Caching

 The `Component` concept also enables easy caching: Since `Components` produce outputs that are deterministic, we can store cached results with: a) the hash of the input, and b) the hash of the `Component`'s config.

 In practice, to avoid computing costly hashes of large amounts of data, we go one step further: A `Dataset` is only hashed at the very beginning, and then we simply store hashes for all `Components` applied to it, sequentially. Since this is entirely deterministic, this "history" can serve as hash for the input at any point in the pipeline!

 At the moment, caching support is quite experimental, but can be enables by passing `{"cache": "disk"}` into the context of a `Component`. Please keep in mind that this can quickly lead to large amounts of data being stored to disk, so proceed carefully! You'll see the caching system in action later.

 #### Data

 The "history" tracking for data as it is transformed by `Components` is implemented in a `cmlkit`-internal `Data` parent class, which complements the `Component` concept: `Components` are (mostly) pure functions acting on `Data` objects. Check out the `data` module in `cmlkit` for details!

 ### Parts 🌳

 `cmlkit` implements `Components` for ML models that follow the representation + regressor pattern. Currently supported representations and regression methods are:

 #### Representations
 - Many-Body Tensor Representation (MBTR) by [Huo, Rupp (2017)](https://arxiv.org/abs/1704.06439) (`qmmlpack` and `dscribe` implementation)
 - Smooth Overlap of Atomic Positions (SOAP) representaton by [Bartók, Kondor, Csányi (2013)](https://doi.org/10.1103/PhysRevB.87.184115) (`quippy` and `dscribe` implementations)
 - Symmetry Functions (SF) representation by [Behler (2011)](https://doi.org/10.1063/1.3553717) (`RuNNer` and `dscribe` implementation), with a semi-automatic parametrisation scheme taken from [Gastegger *et al.* (2018)](https://doi.org/10.1063/1.5019667)

 #### Regression methods
 - Kernel Ridge Regression (KRR) as implemented in [`qmmlpack`](https://gitlab.com/qmml/qmmlpack) (supporting both global and local/atomic representations)


 ***

 And this brings us to the end of the tour of the `cmlkit` core. Let's get a bit more *practical*!



 %% Cell type:markdown id: tags:

 ## Loading Data and Model Basics

 In this chapter, we'll take a look at how to load datasets, how to make subsets, and how to define, train and evaluate models. We'll do all of this on a small toy subset of the `nmd18u` dataset, which was created for the ["Nomad2018 Predicting Transparent Conductors" Kaggle challenge](https://www.kaggle.com/c/nomad2018-predict-transparent-conductors) (Paper: [Sutton *et al.* (2019)](https://doi.org/10.1038/s41524-019-0239-3)). It consists of 3000 Al<sub>x</sub>Ga<sub>y</sub>In<sub>z</sub> oxides, and we'll predict their formation energy. Having learned these basics, we'll then move on to hyper-parameter optimisation for the same problem!


 Let's start by finally importing `cmlkit`!

 %% Cell type:code id: tags:

 ``` python
 import cmlkit

 cmlkit.caches.location = "data/cmlkit/cache" # usually, this would be set using an environment variable
 ```

 %% Cell type:markdown id: tags:

 Nice!

 ### Datasets

 Let's now load some training data. `cmlkit` allows you to set a path (using the `CML_DATASET_PATH` environment variable) where it'll automatically look for named datasets, so you can load a dataset by simply giving its name. A `Dataset` is simply a container that collects the geometries of many systems (molecules or crystals) and the properties we're trying to interpolate in one place. (If you're interested in creating your own `Datasets`, please have a look at the appendix of this tutorial!)

 We'll now load the training set for the Kaggle challenge, and look at what `cmlkit` can tell us about the dataset.

 %% Cell type:code id: tags:

 ``` python
 nmd18_train = cmlkit.load_dataset("data/cmlkit/nmd18_train")
 print(nmd18_train.report)
 # "fe" is short for formation energy, "bg" for bandgap, "sg" for spacegroup number
 ```

 %% Cell type:markdown id: tags:

 Nice. We'll now generate our toy datasets for this chapter. `cmlkit.utility.threeway_split` is a convenience function to randomly draw two subsets, and also return the rest. (`_` is a python idiom that just means "we don't care about this variable".)

 %% Cell type:code id: tags:

 ``` python
 _, idx_toy_train, idx_toy_test = cmlkit.utility.threeway_split(nmd18_train.n, 80, 20)
 toy_train = cmlkit.Subset.from_dataset(nmd18_train, idx=idx_toy_train)
 toy_test = cmlkit.Subset.from_dataset(nmd18_train, idx=idx_toy_test)
 print(f"N_train = {toy_train.n}")
 print(f"N_test = {toy_test.n}")
 ```

 %% Cell type:markdown id: tags:

 Now that we have some data, let's see how we can get a model trained!

 ### Loading and Training a Model

 Here is a model config with all hyper-parameters already tuned. It consists of the SOAP representation, which transforms atomic positions in the neighbourhood of a central atom into a vector, which is used as *feature* for a kernel ridge regression model. The main component of the KRR model is the *kernel*, which computes a "distance" between two structures. In this case, we use an *atomic* kernel, which first compares all atoms in two structures, and then sums up the result to obtain a system-system kernel matrix.

 (This particular model comes from the [repbench](https://marcel.science/repbench) paper, its HPs were tuned for the `nmd18u` dataset at `Ntrain=1600`.)

 %% Cell type:code id: tags:

 ``` python
 config = {
    "model": {
        "per": "cell",  # this means that the model will internally be trained with energies per unit cell,
                        # this is needed because SOAP is a local representation, so the KRR model is extensive
        "regression": {
            "krr": {
                "centering": False,  # we do not subtract the mean from labels or kernel matrices
                "kernel": {
                    "kernel_atomic": {
                        "kernelf": {"gaussian": {"ls": 22.627416997969522}},  # gaussian kernel
                        "norm": False,  # we don't normalise the kernel by atom counts
                                        # (needed to predict intensive properties)
                    }
                },
                "nl": 9.5367431640625e-07,  # regularisation parameter for KRR
            }
        },
        "representation": {
            "ds_soap": {  # we use the SOAP implemented in dscribe
                "cutoff": 3,  # 3 angstrom cutoff radius
                "elems": [8, 13, 31, 49],  # elements in the dataset
                "l_max": 5,  # number of angular basis functions
                "n_max": 3,  # number of radial basis functions
                "rbf": "gto",# radial basis: Gaussians
                "sigma": 0.3535533905932738, # broadening
            }
        },
    }
 }

 model = cmlkit.from_config(config)
 print(model)
 ```

 %% Cell type:markdown id: tags:

 This is the `config` mechanism at work! We've turned a dictionary into a whole hierarchy of objects, without having to import anything or write any additonal code. We're even using a class from outside `cmlkit`!

 %% Cell type:code id: tags:

 ``` python
 print(model.representation)  # <- not implemented in cmlkit, but in the cscribe plugin!
 print(model.regression)
 print(model.regression.kernel)
 print(model.regression.kernel.kernelf)
 ```

 %% Cell type:markdown id: tags:

 Observe that the `model.representation` comes from a module outside of `cmlkit` itself, it's from the [`cscribe`](https://github.com/sirmarcel/cscribe) plugin, which provides an interface to [`dscribe`](https://github.com/SINGROUP/dscribe/), which in turn provides alternative implementations for common representations.

 So! To see how it works, we'll now train the model and predict some energies on our toy dataset.

 We train the model on the formation energy ("`fe`"):

 %% Cell type:code id: tags:

 ``` python
 model.train(toy_train, target="fe")
 ```

 %% Cell type:markdown id: tags:

 Well, that was not too difficult! Let's predict something.

 %% Cell type:code id: tags:

 ``` python
 pred = model.predict(toy_test, per="cation")
 print(pred)
 ```

 %% Cell type:markdown id: tags:

 Okay... what is the ground truth?

 %% Cell type:code id: tags:

 ``` python
 true = toy_test.pp("fe", per="cation")  # pp means "property per"
 print(true)
 ```

 %% Cell type:markdown id: tags:

 Whelp! We did ... not too well? But, given that we've trained on only 80 points, it's not too bad.

 Let's put this into qualitative terms and compute some popular loss functions:

 %% Cell type:code id: tags:

 ``` python
 loss = cmlkit.evaluation.get_loss("rmse", "mae", "maxae", "r2", "rmsle")
 print(loss(true, pred))
 ```

 %% Cell type:markdown id: tags:

 (RMSE=root mean squared error, MAE=mean absolute error, MAXAE=maximum absolute error, R2=[Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) squared, RMSLE=root mean squared log error, the error used in the Sutten *et al.* (2019) paper. RMSE is an upper bound for MAE, and more sensitive to outliers.)

 For context, let's check the standard deviation of the `true` energy values:

 %% Cell type:code id: tags:

 ``` python
 print(true.std())
 ```

 %% Cell type:markdown id: tags:

 Very roughly speaking, we should aim for ~10% or less of the standard deviation for our MAE, or even better, the RMSE.  There is still room for improvement!

 ***

 In essence, this concludes the "core" tutorial -- you now know the basics of how `cmlkit` is supposed to work, you can load models, and you know how to train and predict. At this point, you're in an excellent position to take a look at the [repository](https://github.com/sirmarcel/cmlkit) and take it from there!

 (But of course, then you'd be missing...)

 ##  Hyper-parameter Optimisation

 We'll conclude the tutorial with the topic of hyper-parameter optimisation. Most state-of-the-art ML methods, representations and regression methods alike, have a number of free parameters that have a large impact on their performance, and no immediate way of setting these parameters. One approach, which we're exploring here, is to do it empirically: We pick some loss function (i.e. a quantitative measure of model quality) and minimise it using a global optimisation method. In this part of the tutorial, you'll first learn how such an optimiser is implemented in `cmlkit`, and finally, we'll explore the results of running such a search for `nmd18u`.

 ### Exploring the Optimiser

 At its core, the `cmlkit.tune` module is an asynchronous, stochastic optimiser based on [`hyperopt`](https://github.com/hyperopt/hyperopt). It essentially allows you to specify a probability distribution over possible parameter choices, draws from that distribution, finds out which combination works best, and then updates the distribution to focus further on promising areas (if the "Tree-Structured Parzen Estimators" method is used, otherwise it's a random search). This type of method is most appropriate to high-dimensional search spaces, and can also deal with "tree structured" problems, where some paramaters are only needed based on a prior choice. (For a more formal introduction, please see [Bergstra *et al.* (2011)](https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization).)

 Let's first explore the mechanics of the `cmlkit.tune` module. We'll start by building a toy problem:

 ```
 Let f(x, y, z) = 2*(x*y - 2.0)**2 + (z - 1)**2 ;
 Find x, y, z in [0, 2] to minimise f.
 ```

 In this case, `f` is our loss function, and `x, y, z in [0, 2]` is our parameter search space.

 The search space is simply a dictionary with special "tokens" that tell the system to use a "uniform" prior for "variable name" from "start value" to "end value":

 %% Cell type:code id: tags:

 ``` python
 space = {"x": ["hp_uniform", "var_x", 0, 2],
         "y": ["hp_uniform", "var_y", 0, 2],
         "z": ["hp_uniform", "var_z", 0, 2],}
 ```

 %% Cell type:markdown id: tags:

 Therefore, this `space` means that we would like to uniformly sample x, y and z from zero to two. (If you happen to be familiar with `hyperopt`, we are literally just calling the `hp.` functions here!)

 We implement `f`, our loss function, as a `Component` that follows the `Evaluator` interface:

 %% Cell type:code id: tags:

 ``` python
 class F(cmlkit.engine.Component):
    kind = "f"

    def __call__(self, model):
        x, y, z = model["x"], model["y"], model["z"]
        loss = 2*(x*y - 2.0)**2 + (z - 1)**2
        return {"loss": loss}

    def _get_config(self):
        return {}

 cmlkit.register(F)  # registering the Component

 evaluator = cmlkit.from_config({"f": {}})
 print(evaluator)
 ```

 %% Cell type:markdown id: tags:

 Now we're basically all set to get going. Let's instantiate a `Search`, which will take care of the optimisation part of our task. Without any additional information, the search will simply return random samples from our search space:

 %% Cell type:code id: tags:

 ``` python
 search = cmlkit.tune.Hyperopt(space, method="tpe")

 suggestions = [search.suggest() for i in range(50)]
 for i in range(5):
    print(suggestions[i])
 ```

 %% Cell type:markdown id: tags:

 (The first part of each of these is simply a unique identifier for this particular "suggestion", which we can later use to inform the search of the loss we have obtained for this suggestion.)

 Let's have a quick look at what the search has actually suggested by making a histogram of each of the variable suggestions:

 %% Cell type:code id: tags:

 ``` python
 import matplotlib.pyplot as plt
 import seaborn
 %matplotlib inline
 seaborn.set_context('notebook')

 fig, axs = plt.subplots(nrows=1, ncols=3)
 fig.tight_layout(pad=1.0)
 seaborn.distplot([s[1]["x"] for s in suggestions], ax=axs[0], kde=False, bins=8)
 seaborn.distplot([s[1]["y"] for s in suggestions], ax=axs[1], kde=False, bins=8)
 seaborn.distplot([s[1]["z"] for s in suggestions], ax=axs[2], kde=False, bins=8)
 ```

 %% Cell type:markdown id: tags:

 So we see, it is (somewhat) uniform across the search space. Let's now see how this changes if we return some information back to the search:

 %% Cell type:code id: tags:

 ``` python
 for i in range(50):
    if suggestions[i][1]["x"] < 0.5:
        loss = 0.0
    else:
        loss = 100.0
    search.submit(i, loss=loss)
 ```

 %% Cell type:markdown id: tags:

 We're now pretending that only "x" matters, passing a low loss only if `x < 0.5`. This will cause the search to focus on this region of the search space. Let's see it for ourselves!

 %% Cell type:code id: tags:

 ``` python
 new_suggestions = [search.suggest() for i in range(200)]
 fig, axs = plt.subplots(nrows=1, ncols=3)
 fig.tight_layout(pad=1.0)
 seaborn.distplot([s[1]["x"] for s in new_suggestions], ax=axs[0], kde=False, bins=8)
 seaborn.distplot([s[1]["y"] for s in new_suggestions], ax=axs[1], kde=False, bins=8)
 seaborn.distplot([s[1]["z"] for s in new_suggestions], ax=axs[2], kde=False, bins=8)
 ```

 %% Cell type:markdown id: tags:

 Clearly, we have managed to shift the distribution towards `x < 0.5`.

 ### Running an Optimisation

 With this basic understanding under our belt, we can now run an actual optimisation:

 (Don't be alarmed by the red output background.)

 %% Cell type:code id: tags:

 ``` python
 search = cmlkit.tune.Hyperopt(space, method="tpe")  # reset the search
 run = cmlkit.tune.Run(
    search=search,
    evaluator=F(),
    stop={"stop_max": {"count": 100}},  # specify the stopping mechanism: simply run until 100 trials
    context={"max_workers": 1},  # number of parallel workers
 )
 run.prepare()  # get ready...
 run.run()  # let's go!
 ```

 %% Cell type:markdown id: tags:

 On the backend, a couple of notable things happened just now:

 - We made a new `Run`, which gave itself a randomly assigned, human-readable name,
 - It made a folder for itself, called `run_{name}` (that's what `.prepare` does),
 - It ran the optimisation (essentially, getting suggestions, dispatching their evaluation to the workers, and submitting the results back to the `Search`),
 - It stored the results in an internal database, and
 - It wrote every step of the run into a linear record, the `tape`.

 It also wrote out this `tape` into the folder as a starting point for continuing the run, and the top 5 results as `.yaml` files. (And a periodically updated `status.txt` file that you can easily `watch cat` on a remote system.)

 First, let's have a look at the results:

 %% Cell type:code id: tags:

 ``` python
 run.state.evals.top_suggestions()
 ```

 %% Cell type:code id: tags:

 ``` python
 # or alternatively, we can just read from disk:
 print(cmlkit.read_yaml(run.work_directory / "suggestion-0"))
 # this is the current best result.
 ```

 %% Cell type:markdown id: tags:

 As you can see, we've done *alright* on our toy optimisation problem. `hyperopt` is really not build for such low-dimensional problems! But, as you've suspected, winning the toy problem was not the real point...

 Let's have a quick peek behind the curtain at the `tape`, the record of the optimisation:

 %% Cell type:code id: tags:

 ``` python
 run.state.tape.raw[0:4]  # raw just gets us something subscriptable
 ```

 %% Cell type:markdown id: tags:

 We see that there are two types of entires:

 - `"suggest"`: We get a suggestion from the `Search`.
 - `"submit"`: We return a result to the `Search`.

 If we replay them in order, we arrive at the exact same state of the optimisation. This is how we, for instance, restart a `Run`:

 %% Cell type:code id: tags:

 ``` python
 new = cmlkit.tune.Run.restore(
    run.work_directory,
    new_stop={"stop_max": {"count": 200}}, # we overwrite the stopping directive to run another 100 steps!
 )
 new.run()  # let's go
 ```

 %% Cell type:markdown id: tags:

 Alright! At this point, we've seen most of what the `cmlkit.tune` module does. To end this tutorial, we will optimise an actual model, not a toy `Evaluator`.

 %% Cell type:markdown id: tags:

 ### A Real Problem

 We will use a search space for SOAP:

 %% Cell type:code id: tags:

 ``` python
 space = {
    "model": {
        "per": "cell",
        "regression": {
            "krr": {
                "kernel": {
                    "kernel_atomic": {
                        "norm": False,
                        "kernelf": {
                            "gaussian": {
                                "ls": ["hp_loggrid", "ls_start", -13, 13, 27]  # base2 loggrid, i.e 2**-13 to 2**13
                            }
                        },
                    }
                },
                "nl": ["hp_loggrid", "nl_start", -18, 0, 19]
            }
        },
        "representation": {
            "ds_soap": {
                "elems": [8, 13, 31, 49],
                "n_max": ["hp_choice", "n_max", [2, 3, 4, 5, 6, 7, 8]],
                "l_max": ["hp_choice", "l_max", [2, 3, 4, 5, 6, 7, 8]],
                "cutoff": ["hp_choice", "cutoff", [2, 3, 4, 5, 6, 7, 8, 9, 10]],
                "sigma": ["hp_loggrid", "sigma", -20, 6, 27],
                "rbf": "gto",
            }
        },
    }
 }
 ```

 %% Cell type:markdown id: tags:

 We'll use a "simple" evaluator, `cmlkit.tune.TuneEvaluatorHoldout`, which uses predefined training and test sets to compute a loss, without cross-validation.

 For parallelisation, these sets need to be saved to disk. In order to save us some time, I've already rolled these splits (from the original kaggle challenge training set) and saved them:

 %% Cell type:code id: tags:

 ``` python
 print(cmlkit.load_dataset("data/cmlkit/nmd18_hpo_train").n)
 print(cmlkit.load_dataset("data/cmlkit/nmd18_hpo_test").n)
 evaluator = cmlkit.tune.TuneEvaluatorHoldout(
    train="data/cmlkit/nmd18_hpo_train", test="data/cmlkit/nmd18_hpo_test", target="fe", lossf="rmse"
 )
 ```

 %% Cell type:markdown id: tags:

 We can also instantiate our Search!

 %% Cell type:code id: tags:

 ``` python
 search = cmlkit.tune.Hyperopt(space, method="tpe")
 s = search.suggest()[1]
 print(s)
 ```

 %% Cell type:markdown id: tags:

 Now, every sample from the `Search` is an entire `Model` config! We're ready to roll.

 %% Cell type:code id: tags:

 ``` python
 search = cmlkit.tune.Hyperopt(space, method="tpe")
 run = cmlkit.tune.Run(
    search=search,
    evaluator=evaluator,
    stop={"stop_max": {"count": 400}},
    context={"max_workers": 2},
    name="nmd18_hpo"
 )
 ```

 %% Cell type:markdown id: tags:

 Ok, I'll come clean -- we will not actually run this optimisation, to spare you the experience of waiting for a couple of hours. The `nmd18` dataset contains large structures, and periodic boundary conditions, which makes calculations for any sizeable non-toy training/test sets impractically slow for a tutorial like this. So, let's skip straight to the results!

 %% Cell type:code id: tags:

 ``` python
 run = cmlkit.tune.Run.checkout("data/cmlkit/run_nmd18_hpo")
 ```

 %% Cell type:markdown id: tags:

 By the way, this is how you access the results of already finished runs! You simply run `checkout`.

 To end, let's check how our model would have done in the 2018 Kaggle challenge.

 %% Cell type:code id: tags:

 ``` python
 model_config = run.state.evals.top_suggestions()[0]
 model = cmlkit.from_config(model_config, context={"cache": "disk"})  # we use pre-computed cached values for speed!
 ```

 %% Cell type:code id: tags:

 ``` python
 train = cmlkit.load_dataset("data/cmlkit/nmd18_train")
 test = cmlkit.load_dataset("data/cmlkit/nmd18_test")

 model.train(train, target="fe")
 pred = model.predict(test, per="cation")
 ```

 %% Cell type:code id: tags:

 ``` python
 true = test.pp("fe", per="cation")
 loss = cmlkit.evaluation.get_loss("rmse", "rmsle", "mae", "r2")
 print(loss(true, pred))
 print(f"\n\nFor context: std of true values is {true.std():.3f}!")
 ```

 %% Cell type:markdown id: tags:

 Congratulations! We've gotten close to the top of the Kaggle 2018 challenge. (For the full results and discussion, please see [Sutton *et al.* (2019)](https://doi.org/10.1038/s41524-019-0239-3).)

 (Also, remember our results from the beginning? This time, we didn't rely on a pre-tuned model, we tuned all the parameters from scratch, and substantially *improved* on all the metrics. Nice!)

 ## Next steps ☀️

 And with this, we're at the end of this tutorial. Well done!

 As a next step, I would recommend taking a good look at the [repository](https://github.com/sirmarcel/cmlkit/). The readme will essentially reiterate this tutorial, so you can skip it, but it is recommended to simply [browse the source  code](https://github.com/sirmarcel/cmlkit/tree/master/cmlkit). Every submodule has a detailed explanation of its mechanics and interfaces.

 If you're interested in how `cmlkit` is used in production, you can also take a look at [`repbench`]( https://marcel.science/repbench ), which is a benchmarking effort for multiple representations, and the origin of `cmlkit`. The repositories associated with the project will provide many examples of how to use `cmlkit` and what you can do with it.

 Start building! 🌱

 *festina lente*

 %% Cell type:markdown id: tags:

 <br /><br /><br />

 Thanks to Luca Ghiringelli, Xiaojuan Hu, Daniel Speckhard and Thomas Purcell for feedback on this tutorial.

 The technical requirements for running this tutorial are:

 - [`cmlkit`](https://marcel.science/cmlkit) with its dependencies (most notably `qmmlpack`)
 - `cscribe` plugin for `cmlkit`
 - `seaborn` (and `matplotlib`, etc.)
 - `jupyter`
 - For the `Dataset` appendix, also `nglview` and `ipywidgets` (have to be enabled for `jupyter`)

 You also need the tutorial repository, including the data, and the environment variable `CML_PLUGINS=cscribe` to be set.


 %% Cell type:markdown id: tags:

 ## Appendix: Datasets

 Since you probably won't be only using pre-made datasets, here is a quick introduction in how to work with `cmlkit` `Datasets`. This is, however, only a short overview -- for details, please consult the `cmlkit` source code, which will explain things in detail!

 If you're mostly want to know how to generate `Datasets`, you might also want to check out the [`repbench-datasets`](https://gitlab.com/repbench/repbench-datasets) repository, which contains many examples of how to convert different dataset formats to `cmlkit`.

 ### Overview

 In essence, a `Dataset` collects two kinds of information: the *geometry* of a set of systems (molecules or periodic systems, i.e. materials or crystals), and scalar *properties* associated with each of these systems, such as formation energies, or bandgaps.

 Under the hood, this information is stored as `numpy` arrays: The geometric information as arrays `z`, for atomic numbers, `r` for positions, and `b` for basis vectors (if periodic systems are used). Properties are also stored in `numpy` arrays, as values in a dictionary called `p`. Let's have a look at our toy dataset from above:

 %% Cell type:code id: tags:

 ``` python
 print(toy_test.z[0])
 print(toy_test.r[0])
 print(toy_test.b[0])
 ```

 %% Cell type:markdown id: tags:

 This gives us, in order: The elements in that structure, then the positions of each atom in the unit cell, then the three basis vectors. The properties here are bandgap (`bg`) and formation energy (`fe`) (as well as the spacegroup (`sg`), which we ignore for now):

 %% Cell type:code id: tags:

 ``` python
 print(toy_test.p.keys())
 ```

 %% Cell type:markdown id: tags:

 You can access these properties, normalised to various things, using the `pp` (property per) method:

 %% Cell type:code id: tags:

 ``` python
 print(toy_test.pp("fe", per="atom"))  # fe per *atom*
 print(toy_test.pp("fe", per="cell"))  # fe per *unit cell*
 print(toy_test.pp("fe", per="non_O"))  # fe per *atom which is not Oxygen*
 ```

 %% Cell type:markdown id: tags:

 In addition, the `Dataset` provides various associated information:

 %% Cell type:code id: tags:

 ``` python
 print(toy_test.info.keys())
 ```

 %% Cell type:markdown id: tags:

 A `Dataset` can also have a *name*, which is used to load it from the file system via `cmlkit.load_dataset`.

 %% Cell type:code id: tags:

 ``` python
 print(toy_test.name)
 print(nmd18_train.name)
 ```

 %% Cell type:markdown id: tags:

 You can generate your own `Dataset` by simply instantiating it with `z` (`numpy` object array wrapping a `list`, each entry being an `numpy` array), `r` (same) and `b` (either `None` or a `numpy` array of basis vectors), and if you want to train, also at least one property `p`:

 ```python
 my_dataset = cmlkit.dataset.Dataset(z=my_z, r=my_r, b=my_b, p={"my_prop": my_property}, name="my_name")
 ```

 ***

 **One very important detail: `cmlkit` expects you to provide all properties normalised PER ATOM.** This is due to the built-in conversion facilities: You need one common basis to start from. You can also, which is not encouraged, pass un-normalised properties and then make sure all models have `per=None`, and you always use `per=None` so there is no conversion. Otherwise, things will GO WRONG.

 ***

 Then, you will usually want to save it:

 ```python
 my_dataset.save(directory="/my/dataset/storage/path")
 ```

 Currently, `cmlkit` also provides the `Subset` class which can be used to generate subsets of a "parent" dataset. We have used it above to make the train and test splits, for example. (In the future, this class will probably be deprecated in favour of only using `Dataset`, with a convenience function to generate subsets.) It can be used like this (`idx` being the indices you want to include in the `Subset`):

 ```python
 my_subset = cmlkit.dataset.Subset.from_dataset(my_dataset, idx=[1, 2, 4, ...], name="my_subset")
 ```

 ### `ase`

 The `Dataset` class also features a (very minimal, at the moment) interface to `ase`, consisting of two methods:

 ```
 dataset.as_Atoms() # => return geometries in dataset as list of Atoms object
 dataset.from_Atoms(list_of_Atoms, p={...}) # => use geometries in Atoms objects to make a dataset
 ```

 You can use this, for example, to visualise structures using `ase`, or to use the many file formats supported by `ase`!

 %% Cell type:code id: tags:

 ``` python
 from ase.visualize import view
 atoms = toy_test.as_Atoms()[0]

 view(atoms, viewer="nglview")
 ```

 %% Cell type:markdown id: tags:

 Please note that currenty, the `ase` interface does not know about `Calculators`, and so will not, for example, automatically extract energies from the `Atoms`. You have to pre-process these things yourself!

 %% Cell type:code id: tags:

 ``` python
 ```