Commit 52ce3593 authored by Luca Massimiliano Ghiringhelli's avatar Luca Massimiliano Ghiringhelli
Browse files

Update domain_of_applicability.ipynb

parent ca4465a0
......@@ -1335,10 +1335,10 @@
"When using the linear kernel, the resulting linear model is globally incapable to trace the variation of the 3rd order polynomial except for a small stripe around $x_1$ values close to 0 where it can be approximated well by a linear function. Consequently, there is a very high error globally that is substantially reduced in the applicability domain described by: \n",
"$\\sigma_{lin}(x_1, x_2) \\equiv -0.3 \\le x_1 \\le 0.3$\n",
"\n",
"When using the Gaussian kernel, the resulting radial basis function model is able to represent the target property well locally unless (a) the noise component is too large and (b) the variation of the target property is too high relative to the number of training points. The second restriction is because the radial basis functions (rbf) have non-negligible values only within a small region around the training examples. Consequently, the DA is not only restricted in $x2$-direction but also excludes high absolute $x1$-values: \n",
"When using the Gaussian kernel, the resulting radial basis function model is able to represent the target property locally well, unless (a) the noise component is too large and (b) the variation of the target property is too high relative to the number of training points. The second restriction is because the radial basis functions (rbf) have non-negligible values only within a small region around the training examples. Consequently, the DA is not only restricted in $x_2$-direction but also excludes high absolute $x_1$-values: \n",
"$\\sigma_{rbf}(x_1,x_2) \\equiv −3.3 \\le x_1 \\le 3.1 \\wedge x_2 \\le 0.1$\n",
"\n",
"In contrast, when using the non-local 3rd degree polynomial kernel, data sparsity does not prevent an accurate modeling of the target property along the $x1$-axis. However, this non-locality is counter productive along the $x2$-axis where overfitting of the noise component has a global influence that results in higher prediction errors for the almost deterministic data points with low $x2$-values. This is reflected in the identified applicability domain, which contains no restriction in $x1$-direction, but excludes both high and low $x2$-values.This highlights an important structural difference between the rbf and the polynomial model that is not reflected in their similar average errors: \n",
"In contrast, when using the non-local 3rd degree polynomial kernel, data sparsity does not prevent an accurate modeling of the target property along the $x_1$-axis. However, this non-locality is counter productive along the $x_2$-axis where overfitting of the noise component has a global influence that results in higher prediction errors for the almost deterministic data points with low $x_2$-values. This is reflected in the identified applicability domain, which contains no restriction in $x_1$-direction, but excludes both high and low $x_2$-values.This highlights an important structural difference between the rbf and the polynomial model that is not reflected in their similar average errors: \n",
"$\\sigma_{ply}(x_1,x_2) \\equiv −3.5 \\le $x_2$ \\le 0.1$"
]
},
......@@ -1361,7 +1361,7 @@
"## Comparing the errors\n",
"\n",
"For each model, the global mean absolute arror (MAE) globally and in their respective domains of applicability are calculated. Additionally, the coverage of the DA as a fraction of the total data is presented. \n",
"It is apparent that the accuracy for different models is drastically improved in their respective domains, which differ with the strengths and weaknesses of each model."
"It is apparent that the accuracy for different models is drastically improved in their respective domains, which differ in terms of strengths and weaknesses of each model."
]
},
{
......@@ -1454,9 +1454,9 @@
"source": [
"# Domains of applicability for TCO models\n",
"\n",
"Equipped with the concept of applicability domains, we can now examine the ML models for the prediction of stable alloys with potential application as transparent conducting oxides (TCOs). Materials that are both transparent to visible light and electrically conductive are important for a variety of technological devices such as photovoltaic cells, light-emitting diodes for flat-panel displays, transistors, sensors, touch screens, and lasers. However, only a small number of TCOs have been realized because typically the properties that maximize transparency are detrimental to conductivity and vice versa. Because of their promise for technologically relevant applications, a public data-analytics competition was organized by the Novel Materials Discovery Centre of Excellence (NOMAD) and hosted by the on-line platform Kaggle using a dataset of 3,000 $(Al_x Ga_y In_z)_2 O_3$ sesquioxides, spanning six different spacegroups. The target property in this examination is the formation energy, which is a measure of the energetic stability of the specific elements in a local environment that is defined by the specific lattice structure.\n",
"Equipped with the concept of applicability domains, we can now examine the ML models for the prediction of stable alloys with potential application as transparent conducting oxides (TCOs). Materials that are both transparent to visible light and electrically conductive are important for a variety of technological devices such as photovoltaic cells, light-emitting diodes for flat-panel displays, transistors, sensors, touch screens, and lasers. However, only a small number of TCOs have been realized because typically the properties that maximize transparency are detrimental to conductivity and vice versa. Because of their promise for technologically relevant applications, a public data-analytics competition was organized by the Novel Materials Discovery Centre of Excellence (NOMAD) and hosted by the on-line platform Kaggle using a dataset of 3,000 $(Al_x Ga_y In_z)_2 O_3$ sesquioxides, spanning six different spacegroups. The target property in this examination is the formation energy, which is a measure of the energetic stability of the specific elements in a local environment that is defined by the specific lattice structure. The details of the NOMAD-Kaggle competition are given in <a href="https://www.nature.com/articles/s41524-019-0239-3" target="_blank">Sutton <i>et al.</i>, npj Comput. Mater. (2018)</a>. \n",
"\n",
"Our aim is to demonstrate the ability of the proposed DA analysis to (i) differentiate the performance of models based on different representations of the local atomic information of each structure and (ii) to identify sub-domains in which they can be used reliably for high-throughput screening. Specifically, we focus on the state-of-the-art representations of MBTR, SOAP, and the n-gram representation. As an additional benchmark, we also perform DA identification for a simple representation containing just atomic properties averaged by the compositions. Since this representation is oblivious to configurational disorder (i.e., many distinct structures that are possibleat a given composition), it is expected to perform poorly across all spacegroups and concentrations.\n",
"Our aim is to demonstrate the ability of the proposed DA analysis to (a) differentiate the performance of models based on different representations of the local atomic information of each structure and (b) to identify sub-domains in which they can be used reliably for high-throughput screening. Specifically, we focus on the state-of-the-art representations of MBTR, SOAP, and the $n$-gram representation. As an additional benchmark, we also perform DA identification for a simple representation containing just atomic properties averaged by the compositions. Since this representation is oblivious to configurational disorder (i.e., many distinct structures that are possibleat a given composition), it is expected to perform poorly across all spacegroups and concentrations.\n",
"\n"
]
},
......@@ -1868,11 +1868,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Gamma value\n",
"### Setting the relative weight between coverage and error reduction\n",
"\n",
"As a reminder, the impact function, which the SGD maximizes to find DA is given by: \n",
"$\\mathrm{impact}(\\sigma) = \\left( \\frac{s}{k} \\right)^\\gamma \\left( \\frac{1}{k} \\sum\\limits^k_{i=1} l_i(f) - \\frac{1}{s} \\sum\\limits_{i \\in I(\\sigma)} l_i(f) \\right)^{1-\\gamma}$ \n",
"where $\\gamma$ determines the weight between the coverage and the error reduction terms. This value is 0.5 by default and can be changed with the slider above and calling ``update_gamma()`` or directly by setting the value as a function parameter, i.e. ``update_gamma(0.4)``. This sets the corresponding value in the file ``neg_mean_shift_abs_norm_error.json``, which serves as a settings file for the SGD."
"With the slider above, the value $\gamma$, which determines the relative weight between coverage and error reduction, can be set.\n"
"The impact function, which the SGD maximizes to find the DA, is given by:\n"
"$\mathrm{impact}(\sigma) = \left( \frac{s}{k} \right)^\gamma \left( \frac{1}{k} \sum\limits^k_{i=1} l_i(f) - \frac{1}{s} \sum\limits_{i \in I(\sigma)} l_i(f) \right)^{1-\gamma}$\n"
"The value of $\gamma$ is set to 0.5 by default, but can be changed with the slider above and calling ``update_gamma()`` (next cell), or directly by setting the value as a function parameter, i.e. ``update_gamma(0.4)``. This sets the corresponding value in the file ``neg_mean_shift_abs_norm_error.json``, which serves as a settings file for the SGD."
]
},
{
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment