diff --git a/cmlkit.ipynb b/cmlkit.ipynb index 5731ce3f7887ca0897a92dcca42dd5c34dcf5205..beabb2ec5a8391191d5449515b3007e6fa3f8bdd 100644 --- a/cmlkit.ipynb +++ b/cmlkit.ipynb @@ -15,7 +15,7 @@ "\n", "***\n", "\n", - "Hello! 👋 Welcome to the [`cmlkit` 🐫🧰](https://marcel.science/cmlkit) tutorial. This tutorial will introduce you to the `cmlkit` python package, from its conceptual foundations and architecture, to its hyper-parameter tuning module, and concluding with an application: We will develop a machine learning model to [predict formation energies of candidate materials for transparent conducting oxides](https://www.kaggle.com/c/nomad2018-predict-transparent-conductors) (Paper: [Sutton *et al.* (2019)](https://doi.org/10.1038/s41524-019-0239-3)). After completing this tutorial, you should be able to use `cmlkit` as a basis for your own experiments, and will have a solid understanding of stochastic, parallel hyper-parameter optimisation as implemented in `cmlkit.tune`.\n", + "Hello! 👋 Welcome to the [`cmlkit` 🐫🧰](https://marcel.science/cmlkit) tutorial. This tutorial will introduce you to the `cmlkit` python package, from its conceptual foundations and architecture, to its hyper-parameter tuning module, and finally concluding with an application: We will develop a machine learning model to [predict formation energies of candidate materials for transparent conducting oxides](https://www.kaggle.com/c/nomad2018-predict-transparent-conductors) (Paper: [Sutton *et al.* (2019)](https://doi.org/10.1038/s41524-019-0239-3)). After completing this tutorial, you should be able to use `cmlkit` as a basis for your own experiments, and will have a solid understanding of stochastic, parallel hyper-parameter optimisation as implemented in `cmlkit.tune`.\n", "\n", "### Prerequisites\n", "\n", @@ -23,7 +23,7 @@ "\n", "- Have some familiarity with Python 3,\n", "- Know a little bit about chemistry and/or physics,\n", - "- Know roughly how kernel ridge regression works.\n", + "- Know roughly how kernel ridge regression works, and know a bit about machine learning in general.\n", "\n", "The contents of this tutorial will mostly be of interest for people researching the application of machine learning models to computational chemistry and computational condensed matter physics, and in particularly those interested in building computational experiments and toolkits in that domain.\n", "\n", @@ -35,7 +35,7 @@ "- \"HP\": Hyper-parameters. These are the \"free\" parameters of a ML model, which aren't directly determined by training.\n", "- \"SOAP\": Smooth Overlap of Atomic Positions representaton ([Bartók, Kondor, Csányi (2013)](https://doi.org/10.1103/PhysRevB.87.184115)).\n", "- \"MBTR\": Many-Body Tensor Representation [(Huo, Rupp (2017))](https://arxiv.org/abs/1704.06439).\n", - "- \"System\" or \"structure\": Either a molecule or a periodic system, i.e. \"molecule or material\".\n", + "- \"System\" or \"structure\": Either a molecule, other finite system, or a periodic system.\n", "\n", "🏁 Let's get started. 🏁\n", "\n", @@ -83,7 +83,7 @@ "f(data, lots, of, other, parameters, ...)\n", "```\n", "\n", - "Instead of always passing around these parameters, we create the component `c` with these parameters, and then simply call `c(data)` instead of an extremely long, unwield expression. (For the technically-minded: This is basically a way of creating [partials](https://docs.python.org/3.7/library/functools.html). Components are mostly \"fancy functions\".)\n", + "Instead of always passing around these parameters, we create the component `c` with these parameters, and then simply call `c(data)` instead of an extremely long, unwieldy expression. (For the technically-minded: This is basically a way of creating [partials](https://docs.python.org/3.7/library/functools.html). Components are mostly \"fancy functions\".)\n", "\n", "This is useful because the parts of a ML model can often be described in this way. For example, a *representation* is simply transformation of a set of coordinates into a vector, with a lot of parameters. Or a *kernel* in KRR is simply a function acting on representations. If we write down all the parameters for all the components of a model we can reconstruct it easily. And if there is no state, we can reconstruct it *exactly*!\n", "\n", @@ -132,7 +132,7 @@ "\n", "#### Caching\n", "\n", - "The `Component` concept also enables easy caching: Since `Components` produce output that is deterministic, we can store cached results with: a) the hash of the input, and b) the hash of the `Component`'s config. \n", + "The `Component` concept also enables easy caching: Since `Components` produce outputs that are deterministic, we can store cached results with: a) the hash of the input, and b) the hash of the `Component`'s config. \n", "\n", "In practice, to avoid computing costly hashes of large amounts of data, we go one step further: A `Dataset` is only hashed at the very beginning, and then we simply store hashes for all `Components` applied to it, sequentially. Since this is entirely deterministic, this \"history\" can serve as hash for the input at any point in the pipeline!\n", "\n", @@ -144,12 +144,12 @@ "\n", "### Parts 🌳\n", "\n", - "`cmlkit` implements `Components` for ML models that follow the representation + regressor pattern. Currently supported are:\n", + "`cmlkit` implements `Components` for ML models that follow the representation + regressor pattern. Currently supported representations and regression methods are:\n", "\n", "#### Representations\n", "- Many-Body Tensor Representation (MBTR) by [Huo, Rupp (2017)](https://arxiv.org/abs/1704.06439) (`qmmlpack` and `dscribe` implementation)\n", "- Smooth Overlap of Atomic Positions (SOAP) representaton by [Bartók, Kondor, Csányi (2013)](https://doi.org/10.1103/PhysRevB.87.184115) (`quippy` and `dscribe` implementations)\n", - "- Symmetry Functions (SF) representation by [Behler (2011)](https://doi.org/10.1063/1.3553717) (`RuNNer` and `dscribe` implementation), with a semi-automatic parametrisation scheme taken from [Gastegger *et al.* (2018)](https://doi.org/10.1063/1.5019667).\n", + "- Symmetry Functions (SF) representation by [Behler (2011)](https://doi.org/10.1063/1.3553717) (`RuNNer` and `dscribe` implementation), with a semi-automatic parametrisation scheme taken from [Gastegger *et al.* (2018)](https://doi.org/10.1063/1.5019667)\n", "\n", "#### Regression methods\n", "- Kernel Ridge Regression (KRR) as implemented in [`qmmlpack`](https://gitlab.com/qmml/qmmlpack) (supporting both global and local/atomic representations)\n", @@ -331,6 +331,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "Observe that the `model.representation` comes from a module outside of `cmlkit` itself, it's from the [`cscribe`](https://github.com/sirmarcel/cscribe) plugin, which provides an interface to [`dscribe`](https://github.com/SINGROUP/dscribe/), which in turn provides alternative implementations for common representations.\n", + "\n", "So! To see how it works, we'll now train the model and predict some energies on our toy dataset.\n", "\n", "We train the model on the formation energy (\"`fe`\"):" @@ -398,7 +400,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Whelp! We did not do too well! But this is not surprising -- we've trained on only 80 points. Let's compute some popular loss functions in any case, just so we know how to do it." + "Whelp! We did ... not too well? But, given that we've trained on only 80 points, it's not too bad. \n", + "\n", + "Let's put this into qualitative terms and compute some popular loss functions:" ] }, { @@ -420,6 +424,28 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "(RMSE=root mean squared error, MAE=mean absolute error, MAXAE=maximum absolute error, R2=[Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) squared, RMSLE=root mean squared log error, the error used in the Sutten *et al.* (2019) paper. RMSE is an upper bound for MAE, and more sensitive to outliers.)\n", + "\n", + "For context, let's check the standard deviation of the `true` energy values:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(true.std())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Very roughly speaking, we should aim for ~10% or less of the standard deviation for our MAE, or even better, the RMSE. There is still room for improvement!\n", + "\n", + "***\n", + "\n", "In essence, this concludes the \"core\" tutorial -- you now know the basics of how `cmlkit` is supposed to work, you can load models, and you know how to train and predict. At this point, you're in an excellent position to take a look at the [repository](https://github.com/sirmarcel/cmlkit) and take it from there!\n", "\n", "(But of course, then you'd be missing...)\n", @@ -614,7 +640,9 @@ "\n", "### Running an Optimisation\n", "\n", - "With this basic understanding under our belt, we can now run an actual optimisation:" + "With this basic understanding under our belt, we can now run an actual optimisation:\n", + "\n", + "(Don't be alarmed by the red output background.)" ] }, { @@ -624,7 +652,8 @@ "ExecuteTime": { "end_time": "2020-03-04T23:15:18.636243Z", "start_time": "2020-03-04T23:15:16.119707Z" - } + }, + "scrolled": false }, "outputs": [], "source": [ @@ -954,7 +983,8 @@ "source": [ "true = test.pp(\"fe\", per=\"cation\")\n", "loss = cmlkit.evaluation.get_loss(\"rmse\", \"rmsle\", \"mae\", \"r2\")\n", - "loss(true, pred)" + "print(loss(true, pred))\n", + "print(f\"\\n\\nFor context: std of true values is {true.std():.3f}!\")" ] }, { @@ -963,6 +993,8 @@ "source": [ "Congratulations! We've gotten close to the top of the Kaggle 2018 challenge. (For the full results and discussion, please see [Sutton *et al.* (2019)](https://doi.org/10.1038/s41524-019-0239-3).)\n", "\n", + "(Also, remember our results from the beginning? This time, we didn't rely on a pre-tuned model, we tuned all the parameters from scratch, and substantially *improved* on all the metrics. Nice!)\n", + "\n", "## Next steps ☀️\n", "\n", "And with this, we're at the end of this tutorial. Well done!\n", diff --git a/metainfo.json b/metainfo.json index 5293463c3265289e36e48bd83cdb912b9ec6c9cc..8cd86ac80930bb3e268d3f50f13ccba552d8b79d 100644 --- a/metainfo.json +++ b/metainfo.json @@ -1,6 +1,6 @@ { "authors": [ - "Langer, Marcel" + "Langer, Marcel F." ], "email": "langer@fhi-berlin.mpg.de", "title": "cmlkit: Toolkit for Machine Learning in Computational Condensed Matter Physics and Quantum Chemistry",