<div id="teaser" style=' background-position: right center; background-size: 00px; background-repeat: no-repeat;

padding-top: 20px;

padding-right: 10px;

padding-bottom: 170px;

padding-left: 10px;

border-bottom: 14px double #333;

border-top: 14px double #333;' >

<divstyle="text-align:center">

<b><fontsize="6.4">Identifying Domains of Applicability of MachineLearning Models for Materials Science</font></b>

</div>

<p>

<b> Notebook designed and created by: </b> Mohammad-Yasin Arif, Luigi Sbailò, and Luca Ghiringhelli. <i>Fritz-Haber-Institut der Max-Planck-Gesellschaft, Faradayweg 4-6, D-14195 Berlin, Germany</i><br>

<p>

<b> This notebook allows to reproduce results from the paper:</b>

C. Sutton, M. Boley L.M. Ghiringhelli, M. Rupp, J. Vreeken, and M. Scheffler, Identifying domains of applicability of machine learning models for materials science. Nat. Commun. 11, 4428 (2020) [<ahref="https://th.fhi-berlin.mpg.de/site/uploads/Publications/s41467-020-17112-9.pdf"target="_top">PDF</a>]

<spanclass="nomad--last-updated"data-version="v1.0.0">[Last updated: January 27, 2021]</span>

Although machine learning (ML) models promise to substantially accelerate the discovery of novel materials, their performance is often still insufficient to draw reliable conclusions. Improved ML models are therefore actively researched, but their design is currently guided mainly by monitoring the average model test error. This can render different models indistinguishable although their performance differs substantially across materials, or it can make a model appear generally insufficient while it actually works well in specific sub-domains. Here we present a method, based on subgroup discovery, for detecting domains of applicability (DA) of models within a materials class. The utility of this approach is demonstrated by analyzing three state-of-the-art ML models for predicting the formation energy of transparent conducting oxides. We find that, despite having a mutually indistinguishable and unsatisfactory average error, the models have DAs with distinctive features and notably improved performance.

%% Cell type:markdown id: tags:

The materials of interest are represented as vectors in a vector space $X$ according to some chosen representation. The coordinates $x_i$ could represent features of the material e.g. bond distanc, lattice parameters, composition of elements etc. The ML moodels try to predict a target property $y$ with the minimal error according to some loss function. In our case $y$ is the formation energy. For this example three ML models have been used, MBTR, SOAP and n-gram. Additionally, calculation were performed on a simple representation just containing atomic properties, which is expected to produce much larger errors. All of this data is compiled into ``data.csv``.

The materials of interest are represented as vectors in a vector space $X$ according to some chosen representation. The coordinates $x_i$ could represent features of the material e.g. bond distance, lattice parameters, composition of elements etc. The ML moodels try to predict a target property $y$ with the minimal error according to some loss function. In our case, $y$ is the formation energy. For this example, three ML models have been used. Specifically, kernel-ridge-regression models were trained by using three different descriptor of the atomc structures: <ahref="https://arxiv.org/abs/1704.06439"target="_blank">MBTR</a>, <ahref="https://arxiv.org/abs/1502.01366"target="_blank">SOAP</a>, <ahref="https://www.nature.com/articles/s41524-019-0239-3"target="_blank">$n$-gram</a>. Additionally, calculation were performed on a simple representation just containing atomic properties, which is expected to produce much larger errors. All of this data is compiled into ``data.csv``.

%% Cell type:markdown id: tags:

A DA is defined by a function $\sigma: X \rightarrow \{true, false\}$, which describes a series of inequality constraints on the coordinates $x_i$. Thus, these selectors describe intersections of axis-parallel half-spaces resulting in simple convex regions in $X$. This allows to systematically reason about the described sub-domains (e.g., it iseasy to determine their differences and overlap) and also to sample novel points from them. These domains are found through subgroup discovery (SGD), maximizing the impact on the model error. This impact is defined by the product of selector coverage and the error reduction within, i.e.:

Let us first demonstrate the concept of DA with a synthetic example. We consider a simple two-dimensional representation consisting of independent features $x_1$ and $x_2$ that are each distributed according to a normal distribution with mean 0 and variance 2 ($N(0,2)$) and a target property $y$ that is a 3rd degree polynomial in $x_1$ with an additive noise component that scales exponentially in $x_2$:

That is, the $y$ values are almost determined by the third degree polynomial for low $x_2$ values but are almost completely random for high $x_2$ values. Discovering applicable domains reveals how different models cope differently with this setting even if they have a comparable average error. To show this, let us examine the average error obtained from three different kernelized regression models.

First, the data for $n$ points is generated in the form of numpy arrays.

Then, we use the sklearn library to fit our data with a linear, a gaussian (radial basis function = rbf) and a polynomial kernel. Our original data as well as the predicted values for each kernel are stored in the ``example_df`` data frame.

To demonstrate the concept of DA, we will choose the domains for these models ourselves and compare the reduction of average error by restricting the models to those domains.

When using the linear kernel, the resulting linear model is globally incapable to trace the variation of the 3rd order polynomial except for a small stripe around $x_1$ values close to 0 where it can be approximated well by a linear function. Consequently, there is a very high error globally that is substantially reduced in the applicability domain described by:

When using the Gaussian kernel, the resulting radial basis function model is able to represent the target property well locally unless (a) the noise component is too large and (b) the variation of the target property is too high relative to the number of training points. The second restriction is because the radial basis functions (rbf) have non-negligible values only within a small region around the training examples. Consequently, the DA is not only restricted in $x2$-direction but also excludes high absolute $x1$-values:

When using the Gaussian kernel, the resulting radial basis function model is able to represent the target property well locally unless (a) the noise component is too large and (b) the variation of the target property is too high relative to the number of training points. The second restriction is because the radial basis functions (rbf) have non-negligible values only within a small region around the training examples. Consequently, the DA is not only restricted in $x_2$-direction but also excludes high absolute $x_1$-values:

In contrast, when using the non-local 3rd degree polynomial kernel, data sparsity does not prevent an accurate modeling of the target property along the $x1$-axis. However, this non-locality is counter productive along the $x2$-axis where overfitting of the noise component has a global influence that results in higher prediction errors for the almost deterministic data points with low $x2$-values. This is reflected in the identified applicability domain, which contains no restriction in $x1$-direction, but excludes both high and low $x2$-values.This highlights an important structural difference between the rbf and the polynomial model that is not reflected in their similar average errors:

In contrast, when using the non-local 3rd degree polynomial kernel, data sparsity does not prevent an accurate modeling of the target property along the $x_1$-axis. However, this non-locality is counter productive along the $x_2$-axis where overfitting of the noise component has a global influence that results in higher prediction errors for the almost deterministic data points with low $x_2$-values. This is reflected in the identified applicability domain, which contains no restriction in $x_1$-direction, but excludes both high and low $x_2$-values.This highlights an important structural difference between the rbf and the polynomial model that is not reflected in their similar average errors:

For each model, the global mean absolute arror (MAE) globally and in their respective domains of applicability are calculated. Additionally, the coverage of the DA as a fraction of the total data is presented.

It is apparent that the accuracy for different models is drastically improved in their respective domains, which differ with the strengths and weaknesses of each model.

Equipped with the concept of applicability domains, we can now examine the ML models for the prediction of stable alloys with potential application as transparent conducting oxides (TCOs). Materials that are both transparent to visible light and electrically conductive are important for a variety of technological devices such as photovoltaic cells, light-emitting diodes for flat-panel displays, transistors, sensors, touch screens, and lasers. However, only a small number of TCOs have been realized because typically the properties that maximize transparency are detrimental to conductivity and vice versa. Because of their promise for technologically relevant applications, a public data-analytics competition was organized by the Novel Materials Discovery Centre of Excellence (NOMAD) and hosted by the on-line platform Kaggle using a dataset of 3,000 $(Al_x Ga_y In_z)_2 O_3$ sesquioxides, spanning six different spacegroups. The target property in this examination is the formation energy, which is a measure of the energetic stability of the specific elements in a local environment that is defined by the specific lattice structure.

Equipped with the concept of applicability domains, we can now examine the ML models for the prediction of stable alloys with potential application as transparent conducting oxides (TCOs). Materials that are both transparent to visible light and electrically conductive are important for a variety of technological devices such as photovoltaic cells, light-emitting diodes for flat-panel displays, transistors, sensors, touch screens, and lasers. However, only a small number of TCOs have been realized because typically the properties that maximize transparency are detrimental to conductivity and vice versa. Because of their promise for technologically relevant applications, a public data-analytics competition was organized by the Novel Materials Discovery Centre of Excellence (NOMAD) and hosted by the on-line platform Kaggle using a dataset of 3,000 $(Al_x Ga_y In_z)_2 O_3$ sesquioxides, spanning six different spacegroups. The target property in this examination is the formation energy, which is a measure of the energetic stability of the specific elements in a local environment that is defined by the specific lattice structure.<ahref="https://www.nature.com/articles/s41524-019-0239-3"target="_blank">Sutton <i>et al.</i>, npj Comput. Mater. (2018)</a>

Our aim is to demonstrate the ability of the proposed DA analysis to (i) differentiate the performance of models based on different representations of the local atomic information of each structure and (ii) to identify sub-domains in which they can be used reliably for high-throughput screening. Specifically, we focus on the state-of-the-art representations of MBTR, SOAP, and the n-gram representation. As an additional benchmark, we also perform DA identification for a simple representation containing just atomic properties averaged by the compositions. Since this representation is oblivious to configurational disorder (i.e., many distinct structures that are possibleat a given composition), it is expected to perform poorly across all spacegroups and concentrations.

%% Cell type:markdown id: tags:

## Settings

First, some global variables for the DA analysis need to be established. The data for this analysis is given in ``data.csv``.

The following are functions used for pre- and postprocessing of the data and don't need to be studied in detail. Compact methods to use them are explained further below.