"To show the trade off between coverage of the DA and the error reduction within, controlled by the $\\gamma$ value, a DA analysis is performed for $\\gamma$ values ranging from 0.2 to 0.66. The following code generates all these files. \n",
"To show the trade off between coverage of the DA and the error reduction within, controlled by the $\\gamma$ value, a DA analysis is performed for $\\gamma$ values ranging from 0.2 to 0.66. The following code generates all these files. \n",
"\n",
"\n",
"**WARNING: The next cell will take a long time to run (more than 2h). Therefore the results have been pre-calculated and stored, so that running with the default setting can be skipped.**"
"**WARNING: The next cell will take a long time to run (more than 2h). Therefore the results have been pre-calculated and stored, so that running with the default setting can be skipped. Please uncomment the following cell only if you wish to repeat the whole calculation.**"
Although machine learning (ML) models promise to substantially accelerate the discovery of novel materials, their performance is often still insufficient to draw reliable conclusions. Improved ML models are therefore actively researched, but their design is currently guided mainly by monitoring the average model test error. This can render different models indistinguishable although their performance differs substantially across materials, or it can make a model appear generally insufficient while it actually works well in specific sub-domains. Here we present a method, based on subgroup discovery, for detecting domains of applicability (DA) of models within a materials class. The utility of this approach is demonstrated by analyzing three state-of-the-art ML models for predicting the formation energy of transparent conducting oxides. We find that, despite having a mutually indistinguishable and unsatisfactory average error, the models have DAs with distinctive features and notably improved performance.
Although machine learning (ML) models promise to substantially accelerate the discovery of novel materials, their performance is often still insufficient to draw reliable conclusions. Improved ML models are therefore actively researched, but their design is currently guided mainly by monitoring the average model test error. This can render different models indistinguishable although their performance differs substantially across materials, or it can make a model appear generally insufficient while it actually works well in specific sub-domains. Here we present a method, based on subgroup discovery, for detecting domains of applicability (DA) of models within a materials class. The utility of this approach is demonstrated by analyzing three state-of-the-art ML models for predicting the formation energy of transparent conducting oxides. We find that, despite having a mutually indistinguishable and unsatisfactory average error, the models have DAs with distinctive features and notably improved performance.
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
The materials of interest are represented as vectors in a vector space $X$ according to some chosen representation. The coordinates $x_i$ could represent features of the material e.g. bond distance, lattice parameters, composition of elements etc. The ML moodels try to predict a target property $y$ with the minimal error according to some loss function. In our case, $y$ is the formation energy. For this example, three ML models have been used. Specifically, kernel-ridge-regression models were trained by using three different descriptor of the atomc structures: <ahref="https://arxiv.org/abs/1704.06439"target="_blank">MBTR</a>, <ahref="https://arxiv.org/abs/1502.01366"target="_blank">SOAP</a>, <ahref="https://www.nature.com/articles/s41524-019-0239-3"target="_blank">$n$-gram</a>. Additionally, calculation were performed on a simple representation just containing atomic properties, which is expected to produce much larger errors. All of this data is compiled into ``data.csv``.
The materials of interest are represented as vectors in a vector space $X$ according to some chosen representation. The coordinates $x_i$ could represent features of the material e.g. bond distance, lattice parameters, composition of elements etc. The ML moodels try to predict a target property $y$ with the minimal error according to some loss function. In our case, $y$ is the formation energy. For this example, three ML models have been used. Specifically, kernel-ridge-regression models were trained by using three different descriptor of the atomc structures: <ahref="https://arxiv.org/abs/1704.06439"target="_blank">MBTR</a>, <ahref="https://arxiv.org/abs/1502.01366"target="_blank">SOAP</a>, <ahref="https://www.nature.com/articles/s41524-019-0239-3"target="_blank">$n$-gram</a>. Additionally, calculation were performed on a simple representation just containing atomic properties, which is expected to produce much larger errors. All of this data is compiled into ``data.csv``.
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
A DA is defined by a function $\sigma: X \rightarrow \{true, false\}$, which describes a series of inequality constraints on the coordinates $x_i$. Thus, these selectors describe intersections of axis-parallel half-spaces resulting in simple convex regions in $X$. This allows to systematically reason about the described sub-domains (e.g., it iseasy to determine their differences and overlap) and also to sample novel points from them. These domains are found through subgroup discovery (SGD), maximizing the impact on the model error. This impact is defined by the product of selector coverage and the error reduction within, i.e.:
A DA is defined by a function $\sigma: X \rightarrow \{true, false\}$, which describes a series of inequality constraints on the coordinates $x_i$. Thus, these selectors describe intersections of axis-parallel half-spaces resulting in simple convex regions in $X$. This allows to systematically reason about the described sub-domains (e.g., it iseasy to determine their differences and overlap) and also to sample novel points from them. These domains are found through subgroup discovery (SGD), maximizing the impact on the model error. This impact is defined by the product of selector coverage and the error reduction within, i.e.:
Let us first demonstrate the concept of DA with a synthetic example. We consider a simple two-dimensional representation consisting of independent features $x_1$ and $x_2$ that are each distributed according to a normal distribution with mean 0 and variance 2 ($N(0,2)$) and a target property $y$ that is a 3rd degree polynomial in $x_1$ with an additive noise component that scales exponentially in $x_2$:
Let us first demonstrate the concept of DA with a synthetic example. We consider a simple two-dimensional representation consisting of independent features $x_1$ and $x_2$ that are each distributed according to a normal distribution with mean 0 and variance 2 ($N(0,2)$) and a target property $y$ that is a 3rd degree polynomial in $x_1$ with an additive noise component that scales exponentially in $x_2$:
That is, the $y$ values are almost determined by the third degree polynomial for low $x_2$ values but are almost completely random for high $x_2$ values. Discovering applicable domains reveals how different models cope differently with this setting even if they have a comparable average error. To show this, let us examine the average error obtained from three different kernelized regression models.
That is, the $y$ values are almost determined by the third degree polynomial for low $x_2$ values but are almost completely random for high $x_2$ values. Discovering applicable domains reveals how different models cope differently with this setting even if they have a comparable average error. To show this, let us examine the average error obtained from three different kernelized regression models.
First, the data for $n$ points is generated in the form of numpy arrays.
First, the data for $n$ points is generated in the form of numpy arrays.
Then, we use the sklearn library to fit our data with a linear, a gaussian (radial basis function = rbf) and a polynomial kernel. Our original data as well as the predicted values for each kernel are stored in the ``example_df`` data frame.
Then, we use the sklearn library to fit our data with a linear, a gaussian (radial basis function = rbf) and a polynomial kernel. Our original data as well as the predicted values for each kernel are stored in the ``example_df`` data frame.
To demonstrate the concept of DA, we will choose the domains for these models ourselves and compare the reduction of average error by restricting the models to those domains.
To demonstrate the concept of DA, we will choose the domains for these models ourselves and compare the reduction of average error by restricting the models to those domains.
When using the linear kernel, the resulting linear model is globally incapable to trace the variation of the 3rd order polynomial except for a small stripe around $x_1$ values close to 0 where it can be approximated well by a linear function. Consequently, there is a very high error globally that is substantially reduced in the applicability domain described by:
When using the linear kernel, the resulting linear model is globally incapable to trace the variation of the 3rd order polynomial except for a small stripe around $x_1$ values close to 0 where it can be approximated well by a linear function. Consequently, there is a very high error globally that is substantially reduced in the applicability domain described by:
When using the Gaussian kernel, the resulting radial basis function model is able to represent the target property well locally unless (a) the noise component is too large and (b) the variation of the target property is too high relative to the number of training points. The second restriction is because the radial basis functions (rbf) have non-negligible values only within a small region around the training examples. Consequently, the DA is not only restricted in $x_2$-direction but also excludes high absolute $x_1$-values:
When using the Gaussian kernel, the resulting radial basis function model is able to represent the target property well locally unless (a) the noise component is too large and (b) the variation of the target property is too high relative to the number of training points. The second restriction is because the radial basis functions (rbf) have non-negligible values only within a small region around the training examples. Consequently, the DA is not only restricted in $x_2$-direction but also excludes high absolute $x_1$-values:
In contrast, when using the non-local 3rd degree polynomial kernel, data sparsity does not prevent an accurate modeling of the target property along the $x_1$-axis. However, this non-locality is counter productive along the $x_2$-axis where overfitting of the noise component has a global influence that results in higher prediction errors for the almost deterministic data points with low $x_2$-values. This is reflected in the identified applicability domain, which contains no restriction in $x_1$-direction, but excludes both high and low $x_2$-values.This highlights an important structural difference between the rbf and the polynomial model that is not reflected in their similar average errors:
In contrast, when using the non-local 3rd degree polynomial kernel, data sparsity does not prevent an accurate modeling of the target property along the $x_1$-axis. However, this non-locality is counter productive along the $x_2$-axis where overfitting of the noise component has a global influence that results in higher prediction errors for the almost deterministic data points with low $x_2$-values. This is reflected in the identified applicability domain, which contains no restriction in $x_1$-direction, but excludes both high and low $x_2$-values.This highlights an important structural difference between the rbf and the polynomial model that is not reflected in their similar average errors:
For each model, the global mean absolute arror (MAE) globally and in their respective domains of applicability are calculated. Additionally, the coverage of the DA as a fraction of the total data is presented.
For each model, the global mean absolute arror (MAE) globally and in their respective domains of applicability are calculated. Additionally, the coverage of the DA as a fraction of the total data is presented.
It is apparent that the accuracy for different models is drastically improved in their respective domains, which differ with the strengths and weaknesses of each model.
It is apparent that the accuracy for different models is drastically improved in their respective domains, which differ with the strengths and weaknesses of each model.
Equipped with the concept of applicability domains, we can now examine the ML models for the prediction of stable alloys with potential application as transparent conducting oxides (TCOs). Materials that are both transparent to visible light and electrically conductive are important for a variety of technological devices such as photovoltaic cells, light-emitting diodes for flat-panel displays, transistors, sensors, touch screens, and lasers. However, only a small number of TCOs have been realized because typically the properties that maximize transparency are detrimental to conductivity and vice versa. Because of their promise for technologically relevant applications, a public data-analytics competition was organized by the Novel Materials Discovery Centre of Excellence (NOMAD) and hosted by the on-line platform Kaggle using a dataset of 3,000 $(Al_x Ga_y In_z)_2 O_3$ sesquioxides, spanning six different spacegroups. The target property in this examination is the formation energy, which is a measure of the energetic stability of the specific elements in a local environment that is defined by the specific lattice structure. <ahref="https://www.nature.com/articles/s41524-019-0239-3"target="_blank">Sutton <i>et al.</i>, npj Comput. Mater. (2018)</a>
Equipped with the concept of applicability domains, we can now examine the ML models for the prediction of stable alloys with potential application as transparent conducting oxides (TCOs). Materials that are both transparent to visible light and electrically conductive are important for a variety of technological devices such as photovoltaic cells, light-emitting diodes for flat-panel displays, transistors, sensors, touch screens, and lasers. However, only a small number of TCOs have been realized because typically the properties that maximize transparency are detrimental to conductivity and vice versa. Because of their promise for technologically relevant applications, a public data-analytics competition was organized by the Novel Materials Discovery Centre of Excellence (NOMAD) and hosted by the on-line platform Kaggle using a dataset of 3,000 $(Al_x Ga_y In_z)_2 O_3$ sesquioxides, spanning six different spacegroups. The target property in this examination is the formation energy, which is a measure of the energetic stability of the specific elements in a local environment that is defined by the specific lattice structure. <ahref="https://www.nature.com/articles/s41524-019-0239-3"target="_blank">Sutton <i>et al.</i>, npj Comput. Mater. (2018)</a>
Our aim is to demonstrate the ability of the proposed DA analysis to (i) differentiate the performance of models based on different representations of the local atomic information of each structure and (ii) to identify sub-domains in which they can be used reliably for high-throughput screening. Specifically, we focus on the state-of-the-art representations of MBTR, SOAP, and the n-gram representation. As an additional benchmark, we also perform DA identification for a simple representation containing just atomic properties averaged by the compositions. Since this representation is oblivious to configurational disorder (i.e., many distinct structures that are possibleat a given composition), it is expected to perform poorly across all spacegroups and concentrations.
Our aim is to demonstrate the ability of the proposed DA analysis to (i) differentiate the performance of models based on different representations of the local atomic information of each structure and (ii) to identify sub-domains in which they can be used reliably for high-throughput screening. Specifically, we focus on the state-of-the-art representations of MBTR, SOAP, and the n-gram representation. As an additional benchmark, we also perform DA identification for a simple representation containing just atomic properties averaged by the compositions. Since this representation is oblivious to configurational disorder (i.e., many distinct structures that are possibleat a given composition), it is expected to perform poorly across all spacegroups and concentrations.
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Settings
## Settings
First, some global variables for the DA analysis need to be established. The data for this analysis is given in ``data.csv``.
First, some global variables for the DA analysis need to be established. The data for this analysis is given in ``data.csv``.
The following are functions used for pre- and postprocessing of the data and don't need to be studied in detail. Compact methods to use them are explained further below.
The following are functions used for pre- and postprocessing of the data and don't need to be studied in detail. Compact methods to use them are explained further below.
Where $\gamma$ is a parameter that determines the relative importance of coverage and error reduction.
Where $\gamma$ is a parameter that determines the relative importance of coverage and error reduction.
The value of $\gamma$ is set to 0.5 by default, but can be changed with the slider below and calling ``update_gamma()`` or directly by setting the value as a function parameter, i.e. ``update_gamma(0.4)``. This sets the corresponding value in the file ``neg_mean_shift_abs_norm_error.json``, which serves as a settings file for the SGD.
The value of $\gamma$ is set to 0.5 by default, but can be changed with the slider below and calling ``update_gamma()`` or directly by setting the value as a function parameter, i.e. ``update_gamma(0.4)``. This sets the corresponding value in the file ``neg_mean_shift_abs_norm_error.json``, which serves as a settings file for the SGD.
The feature space can be reduced with the function ``set_features(features)`` where ``features`` is the list of features from the data set that should be included in the calculation. To use all features, call ``set_features()``, i.e. the function without a parameter.
The feature space can be reduced with the function ``set_features(features)`` where ``features`` is the list of features from the data set that should be included in the calculation. To use all features, call ``set_features()``, i.e. the function without a parameter.
To prevent using the wrong data, old data should be removed before starting a new calculation. This can be done with ``rm_old_files(model)`` where ``model`` is the name of the ML model and also the name of the correspondng directory.
To prevent using the wrong data, old data should be removed before starting a new calculation. This can be done with ``rm_old_files(model)`` where ``model`` is the name of the ML model and also the name of the correspondng directory.
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
defrm_old_files(model):
defrm_old_files(model):
os.chdir(data_path)
os.chdir(data_path)
splits=get_dirs_glob(model)
splits=get_dirs_glob(model)
forsplitinsplits:
forsplitinsplits:
try:
try:
shutil.rmtree(os.path.join(split,"output"))
shutil.rmtree(os.path.join(split,"output"))
exceptFileNotFoundError:
exceptFileNotFoundError:
pass
pass
os.chdir(base_path)
os.chdir(base_path)
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### Splitting data into folds
### Splitting data into folds
The DA analysis uses $k$-fold cross validation. That means in our case ($k=6$) the data is split into 6 folds and the calculation is done 6 times where each fold is treated as the test set once while the rest is used for training. Afterwards, the data of all runs is combined. ``split_data(model)`` generates these files and directories used for the actual analysis.
The DA analysis uses $k$-fold cross validation. That means in our case ($k=6$) the data is split into 6 folds and the calculation is done 6 times where each fold is treated as the test set once while the rest is used for training. Afterwards, the data of all runs is combined. ``split_data(model)`` generates these files and directories used for the actual analysis.
``run_analysis(model)`` runs the executable of the SGD code (namely, <ahref="http://www.realkd.org/realkd-library/"target="_blank">realKD</a>, by Mario Boley) for each split to determine the domains of applicability using subgroup discovery.
``run_analysis(model)`` runs the executable of the SGD code (namely, <ahref="http://www.realkd.org/realkd-library/"target="_blank">realKD</a>, by Mario Boley) for each split to determine the domains of applicability using subgroup discovery.
Finally, the results for each split and each model needs to be summarized by ``summarize_data()``. This returns a dictionary containing global and DA error, coverage and R values.
Finally, the results for each split and each model needs to be summarized by ``summarize_data()``. This returns a dictionary containing global and DA error, coverage and R values.
By using the above defined funtions, we can easily run the complete analysis for a $\gamma$ value of 0.5. This recreates the table in the <ahref="https://th.fhi-berlin.mpg.de/site/uploads/Publications/s41467-020-17112-9.pdf"target="_top">paper</a>.
By using the above defined funtions, we can easily run the complete analysis for a $\gamma$ value of 0.5. This recreates the table in the <ahref="https://th.fhi-berlin.mpg.de/site/uploads/Publications/s41467-020-17112-9.pdf"target="_top">paper</a>.
**Warning: running the next cell takes several minutes.**
**Warning: running the next cell takes several minutes.**
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
gamma=0.5
gamma=0.5
update_gamma(gamma)
update_gamma(gamma)
set_features()
set_features()
formodelinmodels:
formodelinmodels:
rm_old_files(model)
rm_old_files(model)
split_data(model)
split_data(model)
run_analysis(model)
run_analysis(model)
data_summary=summarize_data()
data_summary=summarize_data()
results_df=generate_table(data_summary,gamma)
results_df=generate_table(data_summary,gamma)
results_df
results_df
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Coverage - effect trade off
## Coverage - effect trade off
To show the trade off between coverage of the DA and the error reduction within, controlled by the $\gamma$ value, a DA analysis is performed for $\gamma$ values ranging from 0.2 to 0.66. The following code generates all these files.
To show the trade off between coverage of the DA and the error reduction within, controlled by the $\gamma$ value, a DA analysis is performed for $\gamma$ values ranging from 0.2 to 0.66. The following code generates all these files.
**WARNING: The next cell will take a long time to run (more than 2h). Therefore the results have been pre-calculated and stored, so that running with the default setting can be skipped.**
**WARNING: The next cell will take a long time to run (more than 2h). Therefore the results have been pre-calculated and stored, so that running with the default setting can be skipped. Please uncomment the following cell only if you wish to repeat the whole calculation.**
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
forgammainnp.linspace(0.2,0.66,16):
# for gamma in np.linspace(0.2, 0.66, 16):
update_gamma(gamma)
# update_gamma(gamma)
formodelinmodels:
# for model in models:
rm_old_files(model)
# rm_old_files(model)
run_analysis(model)
# run_analysis(model)
data_summary=summarize_data()
# data_summary = summarize_data()
generate_table(data_summary,gamma)
# generate_table(data_summary, gamma)
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Each analysis creates a data point for each model of the coverage and relative reduction in error. This data is compiled and displayed into a graph. It is apparent that bigger domains with a broader applicability have a less severe reduction in the mean absolute error.
Each analysis creates a data point for each model of the coverage and relative reduction in error. This data is compiled and displayed into a graph. It is apparent that bigger domains with a broader applicability have a less severe reduction in the mean absolute error.