Skip to content
Snippets Groups Projects
Commit 91ddf23f authored by Luigi Sbailo's avatar Luigi Sbailo
Browse files

Uncomment time consuming cell

parent df0aa40c
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<img src="assets/domain_of_applicability/header.jpg" width="900"> <img src="assets/domain_of_applicability/header.jpg" width="900">
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<img style="float: left;" src="assets/domain_of_applicability/logo_MPG.png" width=150> <img style="float: left;" src="assets/domain_of_applicability/logo_MPG.png" width=150>
<img style="float: left; margin-top: -10px" src="assets/domain_of_applicability/logo_NOMAD.png" width=250> <img style="float: left; margin-top: -10px" src="assets/domain_of_applicability/logo_NOMAD.png" width=250>
<img style="float: left; margin-top: -5px" src="assets/domain_of_applicability/logo_HU.png" width=130> <img style="float: left; margin-top: -5px" src="assets/domain_of_applicability/logo_HU.png" width=130>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Introduction # Introduction
Although machine learning (ML) models promise to substantially accelerate the discovery of novel materials, their performance is often still insufficient to draw reliable conclusions. Improved ML models are therefore actively researched, but their design is currently guided mainly by monitoring the average model test error. This can render different models indistinguishable although their performance differs substantially across materials, or it can make a model appear generally insufficient while it actually works well in specific sub-domains. Here we present a method, based on subgroup discovery, for detecting domains of applicability (DA) of models within a materials class. The utility of this approach is demonstrated by analyzing three state-of-the-art ML models for predicting the formation energy of transparent conducting oxides. We find that, despite having a mutually indistinguishable and unsatisfactory average error, the models have DAs with distinctive features and notably improved performance. Although machine learning (ML) models promise to substantially accelerate the discovery of novel materials, their performance is often still insufficient to draw reliable conclusions. Improved ML models are therefore actively researched, but their design is currently guided mainly by monitoring the average model test error. This can render different models indistinguishable although their performance differs substantially across materials, or it can make a model appear generally insufficient while it actually works well in specific sub-domains. Here we present a method, based on subgroup discovery, for detecting domains of applicability (DA) of models within a materials class. The utility of this approach is demonstrated by analyzing three state-of-the-art ML models for predicting the formation energy of transparent conducting oxides. We find that, despite having a mutually indistinguishable and unsatisfactory average error, the models have DAs with distinctive features and notably improved performance.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
The materials of interest are represented as vectors in a vector space $X$ according to some chosen representation. The coordinates $x_i$ could represent features of the material e.g. bond distance, lattice parameters, composition of elements etc. The ML moodels try to predict a target property $y$ with the minimal error according to some loss function. In our case, $y$ is the formation energy. For this example, three ML models have been used. Specifically, kernel-ridge-regression models were trained by using three different descriptor of the atomc structures: <a href="https://arxiv.org/abs/1704.06439" target="_blank">MBTR</a>, <a href="https://arxiv.org/abs/1502.01366" target="_blank">SOAP</a>, <a href="https://www.nature.com/articles/s41524-019-0239-3" target="_blank">$n$-gram</a>. Additionally, calculation were performed on a simple representation just containing atomic properties, which is expected to produce much larger errors. All of this data is compiled into ``data.csv``. The materials of interest are represented as vectors in a vector space $X$ according to some chosen representation. The coordinates $x_i$ could represent features of the material e.g. bond distance, lattice parameters, composition of elements etc. The ML moodels try to predict a target property $y$ with the minimal error according to some loss function. In our case, $y$ is the formation energy. For this example, three ML models have been used. Specifically, kernel-ridge-regression models were trained by using three different descriptor of the atomc structures: <a href="https://arxiv.org/abs/1704.06439" target="_blank">MBTR</a>, <a href="https://arxiv.org/abs/1502.01366" target="_blank">SOAP</a>, <a href="https://www.nature.com/articles/s41524-019-0239-3" target="_blank">$n$-gram</a>. Additionally, calculation were performed on a simple representation just containing atomic properties, which is expected to produce much larger errors. All of this data is compiled into ``data.csv``.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
A DA is defined by a function $\sigma: X \rightarrow \{true, false\}$, which describes a series of inequality constraints on the coordinates $x_i$. Thus, these selectors describe intersections of axis-parallel half-spaces resulting in simple convex regions in $X$. This allows to systematically reason about the described sub-domains (e.g., it iseasy to determine their differences and overlap) and also to sample novel points from them. These domains are found through subgroup discovery (SGD), maximizing the impact on the model error. This impact is defined by the product of selector coverage and the error reduction within, i.e.: A DA is defined by a function $\sigma: X \rightarrow \{true, false\}$, which describes a series of inequality constraints on the coordinates $x_i$. Thus, these selectors describe intersections of axis-parallel half-spaces resulting in simple convex regions in $X$. This allows to systematically reason about the described sub-domains (e.g., it iseasy to determine their differences and overlap) and also to sample novel points from them. These domains are found through subgroup discovery (SGD), maximizing the impact on the model error. This impact is defined by the product of selector coverage and the error reduction within, i.e.:
$\mathrm{impact}(\sigma) = \left( \frac{s}{k} \right)^\gamma \left( \frac{1}{k} \sum\limits^k_{i=1} l_i(f) - \frac{1}{s} \sum\limits_{i \in I(\sigma)} l_i(f) \right)^{1-\gamma}$ $\mathrm{impact}(\sigma) = \left( \frac{s}{k} \right)^\gamma \left( \frac{1}{k} \sum\limits^k_{i=1} l_i(f) - \frac{1}{s} \sum\limits_{i \in I(\sigma)} l_i(f) \right)^{1-\gamma}$
where: where:
$k$... number of points in the data set $k$... number of points in the data set
$s$... number of points in the DA set $s$... number of points in the DA set
$I(\sigma)$... set of selected indices ($I(\sigma) = \{i: 1\le i\le k, \sigma(x_i)= \mathrm{true}\}$) $I(\sigma)$... set of selected indices ($I(\sigma) = \{i: 1\le i\le k, \sigma(x_i)= \mathrm{true}\}$)
$l_i$... loss function (e.g., squared error) $l_i$... loss function (e.g., squared error)
$f$... ML model (i.e., $f: X\rightarrow \mathbb{R}$) $f$... ML model (i.e., $f: X\rightarrow \mathbb{R}$)
$\gamma$... parameter governing coverage vs. error reduction relative importance, $0 < \gamma < 1$ $\gamma$... parameter governing coverage vs. error reduction relative importance, $0 < \gamma < 1$
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# An illustrative example # An illustrative example
Let us first demonstrate the concept of DA with a synthetic example. We consider a simple two-dimensional representation consisting of independent features $x_1$ and $x_2$ that are each distributed according to a normal distribution with mean 0 and variance 2 ($N(0,2)$) and a target property $y$ that is a 3rd degree polynomial in $x_1$ with an additive noise component that scales exponentially in $x_2$: Let us first demonstrate the concept of DA with a synthetic example. We consider a simple two-dimensional representation consisting of independent features $x_1$ and $x_2$ that are each distributed according to a normal distribution with mean 0 and variance 2 ($N(0,2)$) and a target property $y$ that is a 3rd degree polynomial in $x_1$ with an additive noise component that scales exponentially in $x_2$:
$y \sim x_1^3 - x_1 + N\left(0, \exp \left( \frac{x_2}{2}\right) \right)$ $y \sim x_1^3 - x_1 + N\left(0, \exp \left( \frac{x_2}{2}\right) \right)$
That is, the $y$ values are almost determined by the third degree polynomial for low $x_2$ values but are almost completely random for high $x_2$ values. Discovering applicable domains reveals how different models cope differently with this setting even if they have a comparable average error. To show this, let us examine the average error obtained from three different kernelized regression models. That is, the $y$ values are almost determined by the third degree polynomial for low $x_2$ values but are almost completely random for high $x_2$ values. Discovering applicable domains reveals how different models cope differently with this setting even if they have a comparable average error. To show this, let us examine the average error obtained from three different kernelized regression models.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Code initialization ## Code initialization
%% Cell type:code id: tags:startup %% Cell type:code id: tags:startup
``` python ``` python
import sys, os, shutil import sys, os, shutil
import random import random
import pandas as pd import pandas as pd
import numpy as np import numpy as np
import json import json
import math import math
import itertools import itertools
import glob import glob
import ipywidgets as widgets import ipywidgets as widgets
from copy import deepcopy from copy import deepcopy
from mpl_toolkits.mplot3d import axes3d from mpl_toolkits.mplot3d import axes3d
import matplotlib.pyplot as plt import matplotlib.pyplot as plt
``` ```
%% Cell type:code id: tags:startup %% Cell type:code id: tags:startup
``` python ``` python
base_path = os.getcwd() base_path = os.getcwd()
data_path = base_path+"/data/domain_of_applicability" data_path = base_path+"/data/domain_of_applicability"
if not os.path.exists(data_path+'/results'): if not os.path.exists(data_path+'/results'):
os.makedirs(data_path+'/results') os.makedirs(data_path+'/results')
#print(base_path) #print(base_path)
#print(data_path) #print(data_path)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Data generation ## Data generation
First, the data for $n$ points is generated in the form of numpy arrays. First, the data for $n$ points is generated in the form of numpy arrays.
Then, we use the sklearn library to fit our data with a linear, a gaussian (radial basis function = rbf) and a polynomial kernel. Our original data as well as the predicted values for each kernel are stored in the ``example_df`` data frame. Then, we use the sklearn library to fit our data with a linear, a gaussian (radial basis function = rbf) and a polynomial kernel. Our original data as well as the predicted values for each kernel are stored in the ``example_df`` data frame.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
np.random.seed(7) np.random.seed(7)
n = 500 n = 500
index = np.arange(1, n+1) index = np.arange(1, n+1)
x1 = np.random.normal(0, math.sqrt(2), n) x1 = np.random.normal(0, math.sqrt(2), n)
x2 = np.random.normal(0, math.sqrt(2), n) x2 = np.random.normal(0, math.sqrt(2), n)
X = np.matrix([x1, x2]) X = np.matrix([x1, x2])
y = x1**3 - x1 y = x1**3 - x1
for i, val in enumerate(y): for i, val in enumerate(y):
y[i] = val + np.random.normal(0, math.exp(x2[i]/2), 1) y[i] = val + np.random.normal(0, math.exp(x2[i]/2), 1)
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
from sklearn.kernel_ridge import KernelRidge from sklearn.kernel_ridge import KernelRidge
lin_kern = KernelRidge(alpha = 1, kernel = 'linear') lin_kern = KernelRidge(alpha = 1, kernel = 'linear')
rbf_kern = KernelRidge(alpha = 1, kernel = 'rbf') rbf_kern = KernelRidge(alpha = 1, kernel = 'rbf')
poly_kern = KernelRidge(alpha = 1, kernel = 'polynomial') poly_kern = KernelRidge(alpha = 1, kernel = 'polynomial')
lin_kern.fit(X.T, y) lin_kern.fit(X.T, y)
rbf_kern.fit(X.T, y) rbf_kern.fit(X.T, y)
poly_kern.fit(X.T, y) poly_kern.fit(X.T, y)
lin_pred = lin_kern.predict(X.T) lin_pred = lin_kern.predict(X.T)
rbf_pred = rbf_kern.predict(X.T) rbf_pred = rbf_kern.predict(X.T)
poly_pred = poly_kern.predict(X.T) poly_pred = poly_kern.predict(X.T)
example_df = pd.DataFrame(data = {'id' : index, 'x1' : x1, 'x2': x2, 'y': y, 'lin_pred' : lin_pred, 'rbf_pred' : rbf_pred, 'poly_pred' : poly_pred}, columns = ['id', 'x1', 'x2', 'y', 'lin_pred', 'rbf_pred', 'poly_pred']) example_df = pd.DataFrame(data = {'id' : index, 'x1' : x1, 'x2': x2, 'y': y, 'lin_pred' : lin_pred, 'rbf_pred' : rbf_pred, 'poly_pred' : poly_pred}, columns = ['id', 'x1', 'x2', 'y', 'lin_pred', 'rbf_pred', 'poly_pred'])
example_df.to_csv(data_path+'/example_data.csv', index = False) example_df.to_csv(data_path+'/example_data.csv', index = False)
example_df example_df
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Plotting the data ## Plotting the data
The following code plots the given data as well as the three models in 3D-space. The following code plots the given data as well as the three models in 3D-space.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def prediction_mesh(X, Y, kernel): def prediction_mesh(X, Y, kernel):
XX, YY = np.meshgrid(X, Y) XX, YY = np.meshgrid(X, Y)
samples = [(x, y) for y in Y for x in X] samples = [(x, y) for y in Y for x in X]
Z = kernel.predict(samples) Z = kernel.predict(samples)
ZZ = np.array([[Z[len(X)*iy+ix] for ix, x in enumerate(X)] for iy, y in enumerate(Y)]) ZZ = np.array([[Z[len(X)*iy+ix] for ix, x in enumerate(X)] for iy, y in enumerate(Y)])
return XX, YY, ZZ return XX, YY, ZZ
XXlin, YYlin, ZZlin = prediction_mesh(np.linspace(-5,5,10), np.linspace(-5,5,10), lin_kern) XXlin, YYlin, ZZlin = prediction_mesh(np.linspace(-5,5,10), np.linspace(-5,5,10), lin_kern)
XXrbf, YYrbf, ZZrbf = prediction_mesh(np.linspace(-5,5,20), np.linspace(-5,5,20), rbf_kern) XXrbf, YYrbf, ZZrbf = prediction_mesh(np.linspace(-5,5,20), np.linspace(-5,5,20), rbf_kern)
XXpoly, YYpoly, ZZpoly = prediction_mesh(np.linspace(-5,5,20), np.linspace(-5,5,10), poly_kern) XXpoly, YYpoly, ZZpoly = prediction_mesh(np.linspace(-5,5,20), np.linspace(-5,5,10), poly_kern)
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
%matplotlib notebook %matplotlib notebook
fig = plt.figure() fig = plt.figure()
ax_lin = fig.add_subplot(221, projection='3d') ax_lin = fig.add_subplot(221, projection='3d')
ax_rbf = fig.add_subplot(222, projection='3d') ax_rbf = fig.add_subplot(222, projection='3d')
ax_poly = fig.add_subplot(223, projection='3d') ax_poly = fig.add_subplot(223, projection='3d')
ax_lin.scatter(x1, x2, y, color = 'r') ax_lin.scatter(x1, x2, y, color = 'r')
ax_lin.plot_wireframe(XXlin, YYlin, ZZlin, color = 'k') ax_lin.plot_wireframe(XXlin, YYlin, ZZlin, color = 'k')
ax_lin.set_zlim3d(-30,30) ax_lin.set_zlim3d(-30,30)
ax_rbf.scatter(x1, x2, y, color = 'r') ax_rbf.scatter(x1, x2, y, color = 'r')
ax_rbf.plot_wireframe(XXrbf, YYrbf, ZZrbf, color = 'k') ax_rbf.plot_wireframe(XXrbf, YYrbf, ZZrbf, color = 'k')
ax_rbf.set_zlim3d(-30,30) ax_rbf.set_zlim3d(-30,30)
ax_poly.scatter(x1, x2, y, color = 'r') ax_poly.scatter(x1, x2, y, color = 'r')
ax_poly.plot_wireframe(XXpoly, YYpoly, ZZpoly, color = 'k') ax_poly.plot_wireframe(XXpoly, YYpoly, ZZpoly, color = 'k')
ax_poly.set_zlim3d(-30,30) ax_poly.set_zlim3d(-30,30)
ax_lin.set_xlabel('x1') ax_lin.set_xlabel('x1')
ax_lin.set_ylabel('x2') ax_lin.set_ylabel('x2')
ax_lin.set_zlabel('y') ax_lin.set_zlabel('y')
ax_rbf.set_xlabel('x1') ax_rbf.set_xlabel('x1')
ax_rbf.set_ylabel('x2') ax_rbf.set_ylabel('x2')
ax_rbf.set_zlabel('y') ax_rbf.set_zlabel('y')
ax_poly.set_xlabel('x1') ax_poly.set_xlabel('x1')
ax_poly.set_ylabel('x2') ax_poly.set_ylabel('x2')
ax_poly.set_zlabel('y') ax_poly.set_zlabel('y')
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Setting the DA ## Setting the DA
To demonstrate the concept of DA, we will choose the domains for these models ourselves and compare the reduction of average error by restricting the models to those domains. To demonstrate the concept of DA, we will choose the domains for these models ourselves and compare the reduction of average error by restricting the models to those domains.
When using the linear kernel, the resulting linear model is globally incapable to trace the variation of the 3rd order polynomial except for a small stripe around $x_1$ values close to 0 where it can be approximated well by a linear function. Consequently, there is a very high error globally that is substantially reduced in the applicability domain described by: When using the linear kernel, the resulting linear model is globally incapable to trace the variation of the 3rd order polynomial except for a small stripe around $x_1$ values close to 0 where it can be approximated well by a linear function. Consequently, there is a very high error globally that is substantially reduced in the applicability domain described by:
$\sigma_{lin}(x_1, x_2) \equiv -0.3 \le x_1 \le 0.3$ $\sigma_{lin}(x_1, x_2) \equiv -0.3 \le x_1 \le 0.3$
When using the Gaussian kernel, the resulting radial basis function model is able to represent the target property well locally unless (a) the noise component is too large and (b) the variation of the target property is too high relative to the number of training points. The second restriction is because the radial basis functions (rbf) have non-negligible values only within a small region around the training examples. Consequently, the DA is not only restricted in $x_2$-direction but also excludes high absolute $x_1$-values: When using the Gaussian kernel, the resulting radial basis function model is able to represent the target property well locally unless (a) the noise component is too large and (b) the variation of the target property is too high relative to the number of training points. The second restriction is because the radial basis functions (rbf) have non-negligible values only within a small region around the training examples. Consequently, the DA is not only restricted in $x_2$-direction but also excludes high absolute $x_1$-values:
$\sigma_{rbf}(x_1,x_2) \equiv −3.3 \le x_1 \le 3.1 \wedge x_2 \le 0.1$ $\sigma_{rbf}(x_1,x_2) \equiv −3.3 \le x_1 \le 3.1 \wedge x_2 \le 0.1$
In contrast, when using the non-local 3rd degree polynomial kernel, data sparsity does not prevent an accurate modeling of the target property along the $x_1$-axis. However, this non-locality is counter productive along the $x_2$-axis where overfitting of the noise component has a global influence that results in higher prediction errors for the almost deterministic data points with low $x_2$-values. This is reflected in the identified applicability domain, which contains no restriction in $x_1$-direction, but excludes both high and low $x_2$-values.This highlights an important structural difference between the rbf and the polynomial model that is not reflected in their similar average errors: In contrast, when using the non-local 3rd degree polynomial kernel, data sparsity does not prevent an accurate modeling of the target property along the $x_1$-axis. However, this non-locality is counter productive along the $x_2$-axis where overfitting of the noise component has a global influence that results in higher prediction errors for the almost deterministic data points with low $x_2$-values. This is reflected in the identified applicability domain, which contains no restriction in $x_1$-direction, but excludes both high and low $x_2$-values.This highlights an important structural difference between the rbf and the polynomial model that is not reflected in their similar average errors:
$\sigma_{ply}(x_1,x_2) \equiv −3.5 \le x_2 \le 0.1$ $\sigma_{ply}(x_1,x_2) \equiv −3.5 \le x_2 \le 0.1$
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
doa = {} doa = {}
doa['lin'] = (example_df['x1'] > -0.3) & (example_df['x1'] < 0.3) doa['lin'] = (example_df['x1'] > -0.3) & (example_df['x1'] < 0.3)
doa['rbf'] = (example_df['x1'] > -3.3) & (example_df['x1'] < 3.1) & (example_df['x2'] < 0.1) doa['rbf'] = (example_df['x1'] > -3.3) & (example_df['x1'] < 3.1) & (example_df['x2'] < 0.1)
doa['poly'] = (example_df['x2'] > -3.5) & (example_df['x2'] < 0.1) doa['poly'] = (example_df['x2'] > -3.5) & (example_df['x2'] < 0.1)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Comparing the errors ## Comparing the errors
For each model, the global mean absolute arror (MAE) globally and in their respective domains of applicability are calculated. Additionally, the coverage of the DA as a fraction of the total data is presented. For each model, the global mean absolute arror (MAE) globally and in their respective domains of applicability are calculated. Additionally, the coverage of the DA as a fraction of the total data is presented.
It is apparent that the accuracy for different models is drastically improved in their respective domains, which differ with the strengths and weaknesses of each model. It is apparent that the accuracy for different models is drastically improved in their respective domains, which differ with the strengths and weaknesses of each model.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
error_df_data = {} error_df_data = {}
error_df_columns = ['model', 'Global MAE', 'DA MAE', 'coverage'] error_df_columns = ['model', 'Global MAE', 'DA MAE', 'coverage']
for model in ['lin', 'rbf', 'poly']: for model in ['lin', 'rbf', 'poly']:
error_df_data[model] = [] error_df_data[model] = []
error_df_data[model].append(model) error_df_data[model].append(model)
error_df_data[model].append(abs(example_df['y'] - example_df[model+'_pred']).mean()) error_df_data[model].append(abs(example_df['y'] - example_df[model+'_pred']).mean())
error_df_data[model].append(abs(example_df[doa[model]]['y'] - example_df[doa[model]][model+'_pred']).mean()) error_df_data[model].append(abs(example_df[doa[model]]['y'] - example_df[doa[model]][model+'_pred']).mean())
error_df_data[model].append(len(example_df[doa[model]].index) / len(example_df.index)) error_df_data[model].append(len(example_df[doa[model]].index) / len(example_df.index))
error_df = pd.DataFrame.from_dict(data = error_df_data, columns = error_df_columns, orient = 'index') error_df = pd.DataFrame.from_dict(data = error_df_data, columns = error_df_columns, orient = 'index')
error_df error_df
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Domains of applicability for TCO models # Domains of applicability for TCO models
Equipped with the concept of applicability domains, we can now examine the ML models for the prediction of stable alloys with potential application as transparent conducting oxides (TCOs). Materials that are both transparent to visible light and electrically conductive are important for a variety of technological devices such as photovoltaic cells, light-emitting diodes for flat-panel displays, transistors, sensors, touch screens, and lasers. However, only a small number of TCOs have been realized because typically the properties that maximize transparency are detrimental to conductivity and vice versa. Because of their promise for technologically relevant applications, a public data-analytics competition was organized by the Novel Materials Discovery Centre of Excellence (NOMAD) and hosted by the on-line platform Kaggle using a dataset of 3,000 $(Al_x Ga_y In_z)_2 O_3$ sesquioxides, spanning six different spacegroups. The target property in this examination is the formation energy, which is a measure of the energetic stability of the specific elements in a local environment that is defined by the specific lattice structure. <a href="https://www.nature.com/articles/s41524-019-0239-3" target="_blank">Sutton <i>et al.</i>, npj Comput. Mater. (2018)</a> Equipped with the concept of applicability domains, we can now examine the ML models for the prediction of stable alloys with potential application as transparent conducting oxides (TCOs). Materials that are both transparent to visible light and electrically conductive are important for a variety of technological devices such as photovoltaic cells, light-emitting diodes for flat-panel displays, transistors, sensors, touch screens, and lasers. However, only a small number of TCOs have been realized because typically the properties that maximize transparency are detrimental to conductivity and vice versa. Because of their promise for technologically relevant applications, a public data-analytics competition was organized by the Novel Materials Discovery Centre of Excellence (NOMAD) and hosted by the on-line platform Kaggle using a dataset of 3,000 $(Al_x Ga_y In_z)_2 O_3$ sesquioxides, spanning six different spacegroups. The target property in this examination is the formation energy, which is a measure of the energetic stability of the specific elements in a local environment that is defined by the specific lattice structure. <a href="https://www.nature.com/articles/s41524-019-0239-3" target="_blank">Sutton <i>et al.</i>, npj Comput. Mater. (2018)</a>
Our aim is to demonstrate the ability of the proposed DA analysis to (i) differentiate the performance of models based on different representations of the local atomic information of each structure and (ii) to identify sub-domains in which they can be used reliably for high-throughput screening. Specifically, we focus on the state-of-the-art representations of MBTR, SOAP, and the n-gram representation. As an additional benchmark, we also perform DA identification for a simple representation containing just atomic properties averaged by the compositions. Since this representation is oblivious to configurational disorder (i.e., many distinct structures that are possibleat a given composition), it is expected to perform poorly across all spacegroups and concentrations. Our aim is to demonstrate the ability of the proposed DA analysis to (i) differentiate the performance of models based on different representations of the local atomic information of each structure and (ii) to identify sub-domains in which they can be used reliably for high-throughput screening. Specifically, we focus on the state-of-the-art representations of MBTR, SOAP, and the n-gram representation. As an additional benchmark, we also perform DA identification for a simple representation containing just atomic properties averaged by the compositions. Since this representation is oblivious to configurational disorder (i.e., many distinct structures that are possibleat a given composition), it is expected to perform poorly across all spacegroups and concentrations.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Settings ## Settings
First, some global variables for the DA analysis need to be established. The data for this analysis is given in ``data.csv``. First, some global variables for the DA analysis need to be established. The data for this analysis is given in ``data.csv``.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
root_path_dir = "." root_path_dir = "."
exchange_value = 1.0 #exchange_value = 100.0 exchange_value = 1.0 #exchange_value = 100.0
random_state_dict = {"mbtr": 4, "soap": 4, "ngram": 14, "atomic":2} random_state_dict = {"mbtr": 4, "soap": 4, "ngram": 14, "atomic":2}
random_state = None random_state = None
skip_dict = {"mbtr": {"norm_abs_error":[4,6]}, skip_dict = {"mbtr": {"norm_abs_error":[4,6]},
"soap": {"norm_abs_error":[2, 4, 5]}, "soap": {"norm_abs_error":[2, 4, 5]},
"ngram":{"norm_abs_error":[1, 4, 5]}, "ngram":{"norm_abs_error":[1, 4, 5]},
"atomic": {"norm_abs_error":[1, 6]}} "atomic": {"norm_abs_error":[1, 6]}}
n_splits = 6 n_splits = 6
models = ["mbtr","soap","ngram","atomic"] models = ["mbtr","soap","ngram","atomic"]
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Functions ## Functions
The following are functions used for pre- and postprocessing of the data and don't need to be studied in detail. Compact methods to use them are explained further below. The following are functions used for pre- and postprocessing of the data and don't need to be studied in detail. Compact methods to use them are explained further below.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def gen_sgd_inputs(target, model=None, random_state=None, end_label="_predE", filename = 'data.csv', prop = "Ef"): def gen_sgd_inputs(target, model=None, random_state=None, end_label="_predE", filename = 'data.csv', prop = "Ef"):
random.seed(random_state) random.seed(random_state)
final_df = get_df(model, end_label=end_label, filename = filename, prop = prop) final_df = get_df(model, end_label=end_label, filename = filename, prop = prop)
split_total_df_and_write(target, model, n_splits, final_df, random_state) split_total_df_and_write(target, model, n_splits, final_df, random_state)
def split_total_df_and_write(target, model, n_splits, final_df, random_state): def split_total_df_and_write(target, model, n_splits, final_df, random_state):
initial_list = final_df.id.tolist() initial_list = final_df.id.tolist()
master_list = {i:[] for i in range(0, n_splits) } master_list = {i:[] for i in range(0, n_splits) }
for n_split in range(0, n_splits): for n_split in range(0, n_splits):
idxs = random.sample(initial_list, 100) idxs = random.sample(initial_list, 100)
master_list[n_split] = idxs master_list[n_split] = idxs
for i in idxs: for i in idxs:
del initial_list[initial_list.index(i)] del initial_list[initial_list.index(i)]
for n_split in master_list: for n_split in master_list:
if not os.path.exists(model): if not os.path.exists(model):
os.mkdir(model) os.mkdir(model)
tmp_dir = os.path.join(model,"random_state_"+str(random_state)) tmp_dir = os.path.join(model,"random_state_"+str(random_state))
if not os.path.exists(tmp_dir): if not os.path.exists(tmp_dir):
os.mkdir(tmp_dir) os.mkdir(tmp_dir)
dirname = os.path.join(model,"random_state_"+str(random_state), "split"+"_"+str(n_split+1)) dirname = os.path.join(model,"random_state_"+str(random_state), "split"+"_"+str(n_split+1))
if not os.path.exists(dirname): if not os.path.exists(dirname):
os.mkdir(dirname) os.mkdir(dirname)
idxs = master_list[n_split] idxs = master_list[n_split]
test_df = final_df[final_df.id.isin(idxs)] test_df = final_df[final_df.id.isin(idxs)]
train_df = final_df[~final_df.id.isin(idxs)] train_df = final_df[~final_df.id.isin(idxs)]
train_df.to_csv(os.path.join(dirname, "train.csv"), index=False) train_df.to_csv(os.path.join(dirname, "train.csv"), index=False)
test_df.to_csv(os.path.join(dirname, "test.csv"), index=False) test_df.to_csv(os.path.join(dirname, "test.csv"), index=False)
xarf_write_name = os.path.join(dirname, "xarf.txt") xarf_write_name = os.path.join(dirname, "xarf.txt")
write_xarf_file(train_df, target, xarf_write_name, write_dir = ".") write_xarf_file(train_df, target, xarf_write_name, write_dir = ".")
def get_df(model, end_label="_predE", filename = 'data.csv', prop = "Ef"): def get_df(model, end_label="_predE", filename = 'data.csv', prop = "Ef"):
test_infile = os.path.join(root_path_dir,filename) test_infile = os.path.join(root_path_dir,filename)
df = pd.read_csv(test_infile) df = pd.read_csv(test_infile)
label = model+end_label label = model+end_label
keep_cols = [ i for i in df.columns.tolist() if end_label not in i and i != prop ] keep_cols = [ i for i in df.columns.tolist() if end_label not in i and i != prop ]
df = calc_sum_predE_abs_error(df, label, prop) df = calc_sum_predE_abs_error(df, label, prop)
keep_cols.extend( [prop, label, "abs_error", "error", "sq_error", "norm_abs_error", "sum_Ef_and_normalized_error", "sum_pred_Ef_and_abs_error"] ) keep_cols.extend( [prop, label, "abs_error", "error", "sq_error", "norm_abs_error", "sum_Ef_and_normalized_error", "sum_pred_Ef_and_abs_error"] )
final_df = df[keep_cols] final_df = df[keep_cols]
return final_df return final_df
def calc_sum_predE_abs_error(df, label, prop = "Ef"): def calc_sum_predE_abs_error(df, label, prop = "Ef"):
df["error"] = df[label].values - df[prop].values df["error"] = df[label].values - df[prop].values
df["abs_error"] = abs(df[label].values - df[prop].values) df["abs_error"] = abs(df[label].values - df[prop].values)
df["sq_error"] = (df[label].values - df[prop].values)**2 df["sq_error"] = (df[label].values - df[prop].values)**2
df["norm_abs_error"] = [ abs(i-j)/i if round(i - 0.00, 3) != 0.000 else 0.000 for i, j in zip(df[prop].tolist(),df[label].tolist()) ] df["norm_abs_error"] = [ abs(i-j)/i if round(i - 0.00, 3) != 0.000 else 0.000 for i, j in zip(df[prop].tolist(),df[label].tolist()) ]
df["sum_pred_Ef_and_abs_error"] = df[label] + df["abs_error"] df["sum_pred_Ef_and_abs_error"] = df[label] + df["abs_error"]
if "sum_Ef_and_normalized_error" not in df.columns.tolist(): if "sum_Ef_and_normalized_error" not in df.columns.tolist():
df["sum_Ef_and_normalized_error"] = df[label] + df["norm_abs_error"] df["sum_Ef_and_normalized_error"] = df[label] + df["norm_abs_error"]
return df return df
def write_xarf_file(write_df, target, write_name, write_dir="."): def write_xarf_file(write_df, target, write_name, write_dir="."):
feature_list = get_feature_txt(write_df.columns.tolist(), write_name.strip('_xarf_file.txt')) feature_list = get_feature_txt(write_df.columns.tolist(), write_name.strip('_xarf_file.txt'))
outFile2=write_name+'.tmp' outFile2=write_name+'.tmp'
if os.path.isfile(outFile2): if os.path.isfile(outFile2):
os.remove(outFile2) os.remove(outFile2)
write_df.to_csv(outFile2, header=None, sep=',', mode='a', index=False) write_df.to_csv(outFile2, header=None, sep=',', mode='a', index=False)
with open(outFile2) as f: with open(outFile2) as f:
lines = f.readlines() lines = f.readlines()
content = [ line.strip() for line in lines ] content = [ line.strip() for line in lines ]
feature_list.extend(content) feature_list.extend(content)
outFile3=os.path.join(write_dir, write_name) outFile3=os.path.join(write_dir, write_name)
with open(outFile3, 'w') as f: with open(outFile3, 'w') as f:
f.write('\n'.join(feature_list)) f.write('\n'.join(feature_list))
os.remove(outFile2) os.remove(outFile2)
def get_feature_txt(features, header_name): def get_feature_txt(features, header_name):
feature_list = [] feature_list = []
header = ('@relation wpo_1_0_0 caption="WPO 1.0.0" description="generated from %s.csv"' %header_name) header = ('@relation wpo_1_0_0 caption="WPO 1.0.0" description="generated from %s.csv"' %header_name)
feature_list.append( "%s" %header ) feature_list.append( "%s" %header )
for i in features: for i in features:
if i == "label": if i == "label":
name_type = "categoric" name_type = "categoric"
else: else:
name_type = "numeric" name_type = "numeric"
c = " ".join(["@attribute", i, name_type]) c = " ".join(["@attribute", i, name_type])
feature_list.append( "%s" %c ) feature_list.append( "%s" %c )
feature_list.append( "%s" %"@data" ) feature_list.append( "%s" %"@data" )
return feature_list return feature_list
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def get_all_values(rep, target, target_label, skip=None, end_label="_predE", prop="Ef", filename="data.csv"): def get_all_values(rep, target, target_label, skip=None, end_label="_predE", prop="Ef", filename="data.csv"):
global_dict = {} global_dict = {}
da_dict = {} da_dict = {}
dirs = get_dirs_glob(rep) dirs = get_dirs_glob(rep)
results_file = open("results/"+rep+".dat",'w') results_file = open("results/"+rep+".dat",'w')
for d in dirs: for d in dirs:
try: try:
tmp_selectors, values, relations = get_selectors_from_subfolder(d) tmp_selectors, values, relations = get_selectors_from_subfolder(d)
selectors = [ rename(s) for s in tmp_selectors ] selectors = [ rename(s) for s in tmp_selectors ]
except UnboundLocalError: except UnboundLocalError:
# throws an error if there are no sgd outputs in the subfolder # throws an error if there are no sgd outputs in the subfolder
continue continue
idx = int(d.split("/")[-2].strip("split_")) idx = int(d.split("/")[-2].strip("split_"))
if idx in skip: if idx in skip:
continue continue
else: else:
global_df = get_df(rep, end_label=end_label, filename = filename, prop = prop) global_df = get_df(rep, end_label=end_label, filename = filename, prop = prop)
global_dict[idx] = get_global_summary(global_df, rep, target, target_label, selectors, values, relations, results_file, end_label=end_label, prop = prop) global_dict[idx] = get_global_summary(global_df, rep, target, target_label, selectors, values, relations, results_file, end_label=end_label, prop = prop)
da_dict[idx] = get_subset_summary(d, rep, target, target_label, selectors, values, relations, results_file, end_label=end_label, prop = prop) da_dict[idx] = get_subset_summary(d, rep, target, target_label, selectors, values, relations, results_file, end_label=end_label, prop = prop)
result_dict = print_summary(rep, da_dict, global_dict , results_file) result_dict = print_summary(rep, da_dict, global_dict , results_file)
results_file.close() results_file.close()
return result_dict return result_dict
def calc_r2(values1, values2): def calc_r2(values1, values2):
avg_e = np.mean(values1) avg_e = np.mean(values1)
sum_of_square_error = [ ((values2[i] - values1[i])**2) for i in range(len(values1)) ] sum_of_square_error = [ ((values2[i] - values1[i])**2) for i in range(len(values1)) ]
var = [ (values1[i] - avg_e)**2 for i in range(len(values1)) ] var = [ (values1[i] - avg_e)**2 for i in range(len(values1)) ]
return 1-np.sum(sum_of_square_error)/np.sum(var) return 1-np.sum(sum_of_square_error)/np.sum(var)
def calc_l1(values1, values2): def calc_l1(values1, values2):
median_e = np.median(values1) median_e = np.median(values1)
sum_of_abs_error = [ abs(values2[i] - values1[i]) for i in range(len(values1)) ] sum_of_abs_error = [ abs(values2[i] - values1[i]) for i in range(len(values1)) ]
var = [ abs(values1[i] - median_e) for i in range(len(values1)) ] var = [ abs(values1[i] - median_e) for i in range(len(values1)) ]
return 1-np.sum(sum_of_abs_error)/np.sum(var) return 1-np.sum(sum_of_abs_error)/np.sum(var)
def partition_and_print(bigdf, target): def partition_and_print(bigdf, target):
rr_df = bigdf[bigdf["is_reliable"] == 1] rr_df = bigdf[bigdf["is_reliable"] == 1]
other_df = bigdf[bigdf["is_reliable"] == 0] other_df = bigdf[bigdf["is_reliable"] == 0]
print("avg. rr --> %s, cov = %s, vs. glob --> %s" %(np.mean(rr_df[target].values), len(rr_df)/float(len(bigdf)), np.mean(bigdf[target].values))) print("avg. rr --> %s, cov = %s, vs. glob --> %s" %(np.mean(rr_df[target].values), len(rr_df)/float(len(bigdf)), np.mean(bigdf[target].values)))
return rr_df, other_df return rr_df, other_df
def selector2df(tmp_df, in_selectors, in_values, in_relations): def selector2df(tmp_df, in_selectors, in_values, in_relations):
tmp_df["is_reliable"] = [ 1 for i in range(0, len(tmp_df)) ] tmp_df["is_reliable"] = [ 1 for i in range(0, len(tmp_df)) ]
for s, v, r in zip(in_selectors, in_values, in_relations): for s, v, r in zip(in_selectors, in_values, in_relations):
if r == "greaterOrEquals": if r == "greaterOrEquals":
tmp_df["is_reliable"] = [ 1*tmp_df["is_reliable"].values[i] if tmp_df[s].values[i] >= v else 0 for i in range(0, len(tmp_df)) ] tmp_df["is_reliable"] = [ 1*tmp_df["is_reliable"].values[i] if tmp_df[s].values[i] >= v else 0 for i in range(0, len(tmp_df)) ]
elif r == "greaterThan": elif r == "greaterThan":
tmp_df["is_reliable"] = [ 1*tmp_df["is_reliable"].values[i] if tmp_df[s].values[i] > v else 0 for i in range(0, len(tmp_df)) ] tmp_df["is_reliable"] = [ 1*tmp_df["is_reliable"].values[i] if tmp_df[s].values[i] > v else 0 for i in range(0, len(tmp_df)) ]
elif r == "lessThan": elif r == "lessThan":
tmp_df["is_reliable"] = [ 1*tmp_df["is_reliable"].values[i] if tmp_df[s].values[i] < v else 0 for i in range(0, len(tmp_df)) ] tmp_df["is_reliable"] = [ 1*tmp_df["is_reliable"].values[i] if tmp_df[s].values[i] < v else 0 for i in range(0, len(tmp_df)) ]
elif r == "lessOrEquals": elif r == "lessOrEquals":
tmp_df["is_reliable"] = [ 1*tmp_df["is_reliable"].values[i] if tmp_df[s].values[i] <= v else 0 for i in range(0, len(tmp_df)) ] tmp_df["is_reliable"] = [ 1*tmp_df["is_reliable"].values[i] if tmp_df[s].values[i] <= v else 0 for i in range(0, len(tmp_df)) ]
return tmp_df return tmp_df
def get_dirs_glob(rep): def get_dirs_glob(rep):
subfolders = glob.glob(rep+"/*/split_*/") subfolders = glob.glob(rep+"/*/split_*/")
return subfolders return subfolders
def get_global_summary(global_df, rep, target, target_label, selectors, values, relations, file, end_label="_predE", prop="Ef"): def get_global_summary(global_df, rep, target, target_label, selectors, values, relations, file, end_label="_predE", prop="Ef"):
global_df[target] = global_df[target].values global_df[target] = global_df[target].values
if target != target_label: if target != target_label:
global_df[target_label] = exchange_value*global_df[target_label].values global_df[target_label] = exchange_value*global_df[target_label].values
global_df[rep+end_label] = exchange_value*global_df[rep+end_label].values global_df[rep+end_label] = exchange_value*global_df[rep+end_label].values
global_df[prop] = exchange_value*global_df[prop].values global_df[prop] = exchange_value*global_df[prop].values
for s, v, r in zip(selectors, values, relations): for s, v, r in zip(selectors, values, relations):
print(s, v, r, file=file) print(s, v, r, file=file)
out_global_df = selector2df(deepcopy(global_df), selectors, values, relations) out_global_df = selector2df(deepcopy(global_df), selectors, values, relations)
global_dict = { "error": np.mean(out_global_df[target_label].values), global_dict = { "error": np.mean(out_global_df[target_label].values),
"l1": calc_l1(out_global_df[prop].values, out_global_df[rep+end_label].values), "l1": calc_l1(out_global_df[prop].values, out_global_df[rep+end_label].values),
"l1_disp": np.mean([ abs(out_global_df[prop].values[i] - np.median(out_global_df[prop].values)) for i in range(len(out_global_df[prop].values)) ]), "l1_disp": np.mean([ abs(out_global_df[prop].values[i] - np.median(out_global_df[prop].values)) for i in range(len(out_global_df[prop].values)) ]),
"l2_disp": math.sqrt(np.mean([ (out_global_df[prop].values[i] - np.mean(out_global_df[prop].values))**2 for i in range(len(out_global_df[prop].values)) ])), "l2_disp": math.sqrt(np.mean([ (out_global_df[prop].values[i] - np.mean(out_global_df[prop].values))**2 for i in range(len(out_global_df[prop].values)) ])),
"rsquared": calc_r2(out_global_df[prop].values, out_global_df[rep+end_label].values), "rsquared": calc_r2(out_global_df[prop].values, out_global_df[rep+end_label].values),
"95per" : np.percentile(out_global_df[target_label].values, 95), "95per" : np.percentile(out_global_df[target_label].values, 95),
"DA_rsquared": calc_r2(out_global_df[out_global_df["is_reliable"] == 1][prop].values, out_global_df[out_global_df["is_reliable"] == 1][rep+end_label].values), "DA_rsquared": calc_r2(out_global_df[out_global_df["is_reliable"] == 1][prop].values, out_global_df[out_global_df["is_reliable"] == 1][rep+end_label].values),
"DA_error": np.mean(out_global_df[out_global_df["is_reliable"] == 1][target_label].values), "DA_error": np.mean(out_global_df[out_global_df["is_reliable"] == 1][target_label].values),
"samples": deepcopy( out_global_df[target_label].values ), "samples": deepcopy( out_global_df[target_label].values ),
prop: deepcopy( out_global_df[prop].values ) } prop: deepcopy( out_global_df[prop].values ) }
return global_dict return global_dict
def get_subset_summary(d, rep, target, target_label, selectors, values, relations, file, end_label="_predE", prop="Ef"): def get_subset_summary(d, rep, target, target_label, selectors, values, relations, file, end_label="_predE", prop="Ef"):
tmp_dict = {} tmp_dict = {}
for t in ["test", "train"]: for t in ["test", "train"]:
in_df = pd.read_csv(os.path.join(d, t+".csv")) in_df = pd.read_csv(os.path.join(d, t+".csv"))
out_df = selector2df(deepcopy(in_df), selectors, values, relations) out_df = selector2df(deepcopy(in_df), selectors, values, relations)
pred_e_value = deepcopy(exchange_value*out_df[out_df["is_reliable"] == 1][rep+end_label].values) pred_e_value = deepcopy(exchange_value*out_df[out_df["is_reliable"] == 1][rep+end_label].values)
e_value = deepcopy(exchange_value*out_df[out_df["is_reliable"] == 1][prop].values) e_value = deepcopy(exchange_value*out_df[out_df["is_reliable"] == 1][prop].values)
target_e_value = deepcopy(exchange_value*out_df[out_df["is_reliable"] == 1][target_label].values) target_e_value = deepcopy(exchange_value*out_df[out_df["is_reliable"] == 1][target_label].values)
all_target_e_value = deepcopy(exchange_value*out_df[target_label].values) all_target_e_value = deepcopy(exchange_value*out_df[target_label].values)
#print("for model %s split %s, avg. rr vs. glob error --> %s vs. %s" %(rep, split, target, t, np.mean(df[target].values), np.mean(df[df["is_reliable"] == 1][target].values)), file=file) #print("for model %s split %s, avg. rr vs. glob error --> %s vs. %s" %(rep, split, target, t, np.mean(df[target].values), np.mean(df[df["is_reliable"] == 1][target].values)), file=file)
tmp_dict["DA"+"_"+t] = {"error": np.mean(target_e_value), tmp_dict["DA"+"_"+t] = {"error": np.mean(target_e_value),
"cov": len(out_df[out_df["is_reliable"] == 1])/float(len(out_df)), "cov": len(out_df[out_df["is_reliable"] == 1])/float(len(out_df)),
"l1": calc_l1(e_value, pred_e_value), "l1": calc_l1(e_value, pred_e_value),
"l1_disp": np.mean([ abs(e_value[i] - np.median(e_value)) for i in range(len(e_value)) ]), "l1_disp": np.mean([ abs(e_value[i] - np.median(e_value)) for i in range(len(e_value)) ]),
"l2_disp": math.sqrt(np.mean([ (e_value[i] - np.mean(e_value))**2 for i in range(len(e_value)) ])), "l2_disp": math.sqrt(np.mean([ (e_value[i] - np.mean(e_value))**2 for i in range(len(e_value)) ])),
"rsquared": calc_r2(e_value, pred_e_value), "rsquared": calc_r2(e_value, pred_e_value),
"95per" : np.percentile(target_e_value, 95), "95per" : np.percentile(target_e_value, 95),
"samples": deepcopy(target_e_value), #abs(e_value - pred_e_value), "samples": deepcopy(target_e_value), #abs(e_value - pred_e_value),
"all_error": np.mean(all_target_e_value), "all_error": np.mean(all_target_e_value),
"ids": deepcopy(out_df[out_df["is_reliable"] == 1]["id"].values), "ids": deepcopy(out_df[out_df["is_reliable"] == 1]["id"].values),
} }
return tmp_dict return tmp_dict
def print_summary(rep, final_rep_dict, final_global_dict, file): def print_summary(rep, final_rep_dict, final_global_dict, file):
result_dict = {} result_dict = {}
root_strg = "Global" root_strg = "Global"
for l, p in zip(["MAE", "r^2", "l1", "95per" ], ["error", "rsquared", "l1", "95per"]): for l, p in zip(["MAE", "r^2", "l1", "95per" ], ["error", "rsquared", "l1", "95per"]):
tmp_list = [ final_global_dict[i][p] for i in final_global_dict.keys() ] tmp_list = [ final_global_dict[i][p] for i in final_global_dict.keys() ]
result_dict[(root_strg, l)] = np.mean(tmp_list) result_dict[(root_strg, l)] = np.mean(tmp_list)
print("%s: %s --> avg. = %s" %(root_strg, l, np.mean(tmp_list)),file=file) print("%s: %s --> avg. = %s" %(root_strg, l, np.mean(tmp_list)),file=file)
root_strg = "DA of Global" root_strg = "DA of Global"
for l, p in zip(["MAE","r^2", "l1", "95per"], ["DA_error", "DA_rsquared", "l1", "95per"]): for l, p in zip(["MAE","r^2", "l1", "95per"], ["DA_error", "DA_rsquared", "l1", "95per"]):
tmp_list = [ final_global_dict[i][p] for i in final_global_dict.keys() ] tmp_list = [ final_global_dict[i][p] for i in final_global_dict.keys() ]
result_dict[(root_strg, l)] = np.mean(tmp_list) result_dict[(root_strg, l)] = np.mean(tmp_list)
print("%s: %s --> avg. = %s" %(root_strg, l, np.mean(tmp_list)),file=file) print("%s: %s --> avg. = %s" %(root_strg, l, np.mean(tmp_list)),file=file)
for t in ["train", "test"]: for t in ["train", "test"]:
if t == "train": if t == "train":
root_strg = "DA identification" root_strg = "DA identification"
elif t == "test": elif t == "test":
root_strg = "DA validation" root_strg = "DA validation"
for l, p in zip(["MAE", "l1", "95per", "cov"], ["error", "l1", "95per", "cov"]): for l, p in zip(["MAE", "l1", "95per", "cov"], ["error", "l1", "95per", "cov"]):
tmp_list = [ final_rep_dict[i]["DA"+"_"+t][p] for i in final_rep_dict.keys() ] tmp_list = [ final_rep_dict[i]["DA"+"_"+t][p] for i in final_rep_dict.keys() ]
result_dict[(root_strg, l)] = [np.mean(tmp_list),np.std(tmp_list)] result_dict[(root_strg, l)] = [np.mean(tmp_list),np.std(tmp_list)]
print("%s: %s --> avg. (std) DA = %s (%s) " %(root_strg, l, np.mean(tmp_list), np.std(tmp_list)),file=file) print("%s: %s --> avg. (std) DA = %s (%s) " %(root_strg, l, np.mean(tmp_list), np.std(tmp_list)),file=file)
return result_dict return result_dict
def rename(in_key): def rename(in_key):
rename_dict = {'ecn_bond_dist_Al_Ga': 'ecn_bond_dist_Al-Ga', rename_dict = {'ecn_bond_dist_Al_Ga': 'ecn_bond_dist_Al-Ga',
'ecn_bond_dist_Al_In' : 'ecn_bond_dist_Al-In', 'ecn_bond_dist_Al_In' : 'ecn_bond_dist_Al-In',
'ecn_bond_dist_Al_O': 'ecn_bond_dist_Al-O', 'ecn_bond_dist_Al_O': 'ecn_bond_dist_Al-O',
'ecn_bond_dist_Ga_Al' : 'ecn_bond_dist_Ga-Al', 'ecn_bond_dist_Ga_Al' : 'ecn_bond_dist_Ga-Al',
'ecn_bond_dist_Ga_In' : 'ecn_bond_dist_Ga-In', 'ecn_bond_dist_Ga_In' : 'ecn_bond_dist_Ga-In',
'ecn_bond_dist_Ga_O': 'ecn_bond_dist_Ga-O', 'ecn_bond_dist_Ga_O': 'ecn_bond_dist_Ga-O',
'ecn_bond_dist_In_Al' : 'ecn_bond_dist_In-Al', 'ecn_bond_dist_In_Al' : 'ecn_bond_dist_In-Al',
'ecn_bond_dist_In_Ga': 'ecn_bond_dist_In-Ga', 'ecn_bond_dist_In_Ga': 'ecn_bond_dist_In-Ga',
'ecn_bond_dist_In_O': 'ecn_bond_dist_In-O', 'ecn_bond_dist_In_O': 'ecn_bond_dist_In-O',
'ecn_bond_dist_O_Al': 'ecn_bond_dist_O-Al', 'ecn_bond_dist_O_Al': 'ecn_bond_dist_O-Al',
'ecn_bond_dist_O_Ga': 'ecn_bond_dist_O-Ga', 'ecn_bond_dist_O_Ga': 'ecn_bond_dist_O-Ga',
'ecn_bond_dist_O_In': 'ecn_bond_dist_O-In', 'ecn_bond_dist_O_In': 'ecn_bond_dist_O-In',
} }
try: try:
return rename_dict[in_key] return rename_dict[in_key]
except KeyError: except KeyError:
return in_key return in_key
def renameback(in_key): def renameback(in_key):
rename_dict = {'ecn_bond_dist_Al-Ga': 'ecn_bond_dist_Al_Ga', rename_dict = {'ecn_bond_dist_Al-Ga': 'ecn_bond_dist_Al_Ga',
'ecn_bond_dist_Al-In' : 'ecn_bond_dist_Al_In', 'ecn_bond_dist_Al-In' : 'ecn_bond_dist_Al_In',
'ecn_bond_dist_Al-O': 'ecn_bond_dist_Al_O', 'ecn_bond_dist_Al-O': 'ecn_bond_dist_Al_O',
'ecn_bond_dist_Ga-Al' : 'ecn_bond_dist_Ga_Al', 'ecn_bond_dist_Ga-Al' : 'ecn_bond_dist_Ga_Al',
'ecn_bond_dist_Ga-In' : 'ecn_bond_dist_Ga_In', 'ecn_bond_dist_Ga-In' : 'ecn_bond_dist_Ga_In',
'ecn_bond_dist_Ga-O': 'ecn_bond_dist_Ga_O', 'ecn_bond_dist_Ga-O': 'ecn_bond_dist_Ga_O',
'ecn_bond_dist_In-Al' : 'ecn_bond_dist_In_Al', 'ecn_bond_dist_In-Al' : 'ecn_bond_dist_In_Al',
'ecn_bond_dist_In-Ga': 'ecn_bond_dist_In_Ga', 'ecn_bond_dist_In-Ga': 'ecn_bond_dist_In_Ga',
'ecn_bond_dist_In-O': 'ecn_bond_dist_In_O', 'ecn_bond_dist_In-O': 'ecn_bond_dist_In_O',
'ecn_bond_dist_O-Al': 'ecn_bond_dist_O_Al', 'ecn_bond_dist_O-Al': 'ecn_bond_dist_O_Al',
'ecn_bond_dist_O-Ga': 'ecn_bond_dist_O_Ga', 'ecn_bond_dist_O-Ga': 'ecn_bond_dist_O_Ga',
'ecn_bond_dist_O-In': 'ecn_bond_dist_O_In', 'ecn_bond_dist_O-In': 'ecn_bond_dist_O_In',
} }
try: try:
return rename_dict[in_key] return rename_dict[in_key]
except KeyError: except KeyError:
return in_key return in_key
def read_json(fjson): def read_json(fjson):
with open(fjson) as f: with open(fjson) as f:
return json.load(f) return json.load(f)
def get_selectors_from_subfolder(in_dir): def get_selectors_from_subfolder(in_dir):
# need to set folder to perform the walk and need to check based on the string # need to set folder to perform the walk and need to check based on the string
for root, dirs, files in os.walk(in_dir): for root, dirs, files in os.walk(in_dir):
for f in files: for f in files:
if ".json" in f: if ".json" in f:
res = read_json(os.path.join(root, f)) res = read_json(os.path.join(root, f))
for r in res: for r in res:
try: try:
r.keys() r.keys()
except AttributeError: except AttributeError:
continue continue
res_dict = r["descriptor"]['selector'] res_dict = r["descriptor"]['selector']
selectors = res_dict["attributes"] selectors = res_dict["attributes"]
values = [ i["value"] for i in res_dict["constraints"] ] values = [ i["value"] for i in res_dict["constraints"] ]
relations = [ i["type"] for i in res_dict["constraints"] ] relations = [ i["type"] for i in res_dict["constraints"] ]
return selectors, values, relations return selectors, values, relations
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## DA Analysis ## DA Analysis
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Relative weight between coverage and error reduction ### Relative weight between coverage and error reduction
The impact function, which the SGD maximizes to find the DA, is given by: The impact function, which the SGD maximizes to find the DA, is given by:
$\mathrm{impact}(\sigma) = \left( \frac{s}{k} \right)^\gamma \left( \frac{1}{k} \sum\limits^k_{i=1} l_i(f) - \frac{1}{s} \sum\limits_{i \in I(\sigma)} l_i(f) \right)^{1-\gamma}$ $\mathrm{impact}(\sigma) = \left( \frac{s}{k} \right)^\gamma \left( \frac{1}{k} \sum\limits^k_{i=1} l_i(f) - \frac{1}{s} \sum\limits_{i \in I(\sigma)} l_i(f) \right)^{1-\gamma}$
Where $\gamma$ is a parameter that determines the relative importance of coverage and error reduction. Where $\gamma$ is a parameter that determines the relative importance of coverage and error reduction.
The value of $\gamma$ is set to 0.5 by default, but can be changed with the slider below and calling ``update_gamma()`` or directly by setting the value as a function parameter, i.e. ``update_gamma(0.4)``. This sets the corresponding value in the file ``neg_mean_shift_abs_norm_error.json``, which serves as a settings file for the SGD. The value of $\gamma$ is set to 0.5 by default, but can be changed with the slider below and calling ``update_gamma()`` or directly by setting the value as a function parameter, i.e. ``update_gamma(0.4)``. This sets the corresponding value in the file ``neg_mean_shift_abs_norm_error.json``, which serves as a settings file for the SGD.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
gamma_slider = widgets.FloatSlider( gamma_slider = widgets.FloatSlider(
value=0.5, value=0.5,
min=0, min=0,
max=1.0, max=1.0,
step=0.01, step=0.01,
description='Coverage weight:', description='Coverage weight:',
style={'description_width': 'initial'} style={'description_width': 'initial'}
) )
gamma_slider gamma_slider
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def update_gamma(g = None): def update_gamma(g = None):
if g is None: if g is None:
gamma = gamma_slider.value gamma = gamma_slider.value
else: else:
gamma = g gamma = g
gamma_prime = gamma/(1.0-gamma) gamma_prime = gamma/(1.0-gamma)
os.chdir(data_path) os.chdir(data_path)
with open('neg_mean_shift_abs_norm_error.json') as json_settings: with open('neg_mean_shift_abs_norm_error.json') as json_settings:
settings = json.load(json_settings) settings = json.load(json_settings)
settings['computations'][0]['parameters']['cov_weight'] = gamma_prime settings['computations'][0]['parameters']['cov_weight'] = gamma_prime
with open('neg_mean_shift_abs_norm_error.json', 'w') as json_settings: with open('neg_mean_shift_abs_norm_error.json', 'w') as json_settings:
json.dump(settings, json_settings) json.dump(settings, json_settings)
os.chdir(base_path) os.chdir(base_path)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Feature selection ### Feature selection
The feature space can be reduced with the function ``set_features(features)`` where ``features`` is the list of features from the data set that should be included in the calculation. To use all features, call ``set_features()``, i.e. the function without a parameter. The feature space can be reduced with the function ``set_features(features)`` where ``features`` is the list of features from the data set that should be included in the calculation. To use all features, call ``set_features()``, i.e. the function without a parameter.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def list2str(List): def list2str(List):
string = '[' + ', '.join(List) + ']' string = '[' + ', '.join(List) + ']'
return string return string
def set_features(features = None): def set_features(features = None):
full_features = ['Natoms', 'percent_atom_al', 'percent_atom_ga', 'percent_atom_in', 'vol_per_atom', 'a', full_features = ['Natoms', 'percent_atom_al', 'percent_atom_ga', 'percent_atom_in', 'vol_per_atom', 'a',
'b', 'c', 'alpha', 'gamma', 'beta', 'a_b', 'b_c', 'a_c', 'ecn_bond_dist_Al-Ga', 'b', 'c', 'alpha', 'gamma', 'beta', 'a_b', 'b_c', 'a_c', 'ecn_bond_dist_Al-Ga',
'ecn_bond_dist_Al-In', 'ecn_bond_dist_Al-O', 'ecn_bond_dist_Ga-Al', 'ecn_bond_dist_Ga-In', 'ecn_bond_dist_Al-In', 'ecn_bond_dist_Al-O', 'ecn_bond_dist_Ga-Al', 'ecn_bond_dist_Ga-In',
'ecn_bond_dist_Ga-O', 'ecn_bond_dist_In-Al', 'ecn_bond_dist_In-Ga', 'ecn_bond_dist_In-O', 'ecn_bond_dist_Ga-O', 'ecn_bond_dist_In-Al', 'ecn_bond_dist_In-Ga', 'ecn_bond_dist_In-O',
'ecn_bond_dist_O-Al', 'ecn_bond_dist_O-Ga', 'ecn_bond_dist_O-In'] 'ecn_bond_dist_O-Al', 'ecn_bond_dist_O-Ga', 'ecn_bond_dist_O-In']
if features is None: if features is None:
removal_filter = [] removal_filter = []
else: else:
removal_filter = [renameback(feat) for feat in full_features if feat not in features] removal_filter = [renameback(feat) for feat in full_features if feat not in features]
removal_filter = ['Ef', 'atomic_predE', 'soap_predE', 'ngram_predE', 'mbtr_predE', 'abs_error', removal_filter = ['Ef', 'atomic_predE', 'soap_predE', 'ngram_predE', 'mbtr_predE', 'abs_error',
'sum_pred_Ef_and_abs_error', 'sum_Ef_and_normalized_error', 'norm_abs_error', 'norm_error', 'sum_pred_Ef_and_abs_error', 'sum_Ef_and_normalized_error', 'norm_abs_error', 'norm_error',
'error', 'abs_error', 'sq_error', 'abs_sq_error', 'norm_abs_error_bystd', 'error', 'abs_error', 'sq_error', 'abs_sq_error', 'norm_abs_error_bystd',
'norm_sq_error_bystd'] + removal_filter 'norm_sq_error_bystd'] + removal_filter
os.chdir(data_path) os.chdir(data_path)
with open('neg_mean_shift_abs_norm_error.json') as json_settings: with open('neg_mean_shift_abs_norm_error.json') as json_settings:
settings = json.load(json_settings) settings = json.load(json_settings)
settings['computations'][0]['parameters']['attr_filter'] = list2str(removal_filter) settings['computations'][0]['parameters']['attr_filter'] = list2str(removal_filter)
with open('neg_mean_shift_abs_norm_error.json', 'w') as json_settings: with open('neg_mean_shift_abs_norm_error.json', 'w') as json_settings:
json.dump(settings, json_settings) json.dump(settings, json_settings)
os.chdir(base_path) os.chdir(base_path)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Removing old files ### Removing old files
To prevent using the wrong data, old data should be removed before starting a new calculation. This can be done with ``rm_old_files(model)`` where ``model`` is the name of the ML model and also the name of the correspondng directory. To prevent using the wrong data, old data should be removed before starting a new calculation. This can be done with ``rm_old_files(model)`` where ``model`` is the name of the ML model and also the name of the correspondng directory.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def rm_old_files(model): def rm_old_files(model):
os.chdir(data_path) os.chdir(data_path)
splits = get_dirs_glob(model) splits = get_dirs_glob(model)
for split in splits: for split in splits:
try: try:
shutil.rmtree(os.path.join(split, "output")) shutil.rmtree(os.path.join(split, "output"))
except FileNotFoundError: except FileNotFoundError:
pass pass
os.chdir(base_path) os.chdir(base_path)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Splitting data into folds ### Splitting data into folds
The DA analysis uses $k$-fold cross validation. That means in our case ($k=6$) the data is split into 6 folds and the calculation is done 6 times where each fold is treated as the test set once while the rest is used for training. Afterwards, the data of all runs is combined. ``split_data(model)`` generates these files and directories used for the actual analysis. The DA analysis uses $k$-fold cross validation. That means in our case ($k=6$) the data is split into 6 folds and the calculation is done 6 times where each fold is treated as the test set once while the rest is used for training. Afterwards, the data of all runs is combined. ``split_data(model)`` generates these files and directories used for the actual analysis.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def split_data(model): def split_data(model):
os.chdir(data_path) os.chdir(data_path)
target = "norm_abs_error" target = "norm_abs_error"
target_label = "abs_error" target_label = "abs_error"
gen_sgd_inputs(target, model=model, random_state=random_state_dict[model]) gen_sgd_inputs(target, model=model, random_state=random_state_dict[model])
os.chdir(base_path) os.chdir(base_path)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Running the analysis ### Running the analysis
``run_analysis(model)`` runs the executable of the SGD code (namely, <a href="http://www.realkd.org/realkd-library/" target="_blank">realKD</a>, by Mario Boley) for each split to determine the domains of applicability using subgroup discovery. ``run_analysis(model)`` runs the executable of the SGD code (namely, <a href="http://www.realkd.org/realkd-library/" target="_blank">realKD</a>, by Mario Boley) for each split to determine the domains of applicability using subgroup discovery.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def run_analysis(model): def run_analysis(model):
os.chdir(data_path) os.chdir(data_path)
for i in range(1,7): for i in range(1,7):
print("Calculating: "+model+" split "+str(i)) print("Calculating: "+model+" split "+str(i))
os.chdir(model+"/random_state_"+str(random_state_dict[model])+"/split_"+str(i)) os.chdir(model+"/random_state_"+str(random_state_dict[model])+"/split_"+str(i))
os.system("java -jar ../../../software/realkd-0.7.2/bin/realkd-0.7.2-jar-with-dependencies.jar ../../../neg_mean_shift_abs_norm_error.json") os.system("java -jar ../../../software/realkd-0.7.2/bin/realkd-0.7.2-jar-with-dependencies.jar ../../../neg_mean_shift_abs_norm_error.json")
os.chdir(data_path) os.chdir(data_path)
os.chdir(base_path) os.chdir(base_path)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Summarizing data ### Summarizing data
Finally, the results for each split and each model needs to be summarized by ``summarize_data()``. This returns a dictionary containing global and DA error, coverage and R values. Finally, the results for each split and each model needs to be summarized by ``summarize_data()``. This returns a dictionary containing global and DA error, coverage and R values.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def summarize_data(): def summarize_data():
target = "norm_abs_error" target = "norm_abs_error"
target_label = "abs_error" target_label = "abs_error"
os.chdir(data_path) os.chdir(data_path)
data_summary = {} data_summary = {}
for model in models: for model in models:
data_summary[model] = get_all_values(model, target, target_label, skip=skip_dict[model][target]) data_summary[model] = get_all_values(model, target, target_label, skip=skip_dict[model][target])
os.chdir(base_path) os.chdir(base_path)
return data_summary return data_summary
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Displaying data ### Displaying data
``generate_table(data_summary, gamma)`` will compile the summarized data into a table while writing it into a csv file. ``generate_table(data_summary, gamma)`` will compile the summarized data into a table while writing it into a csv file.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def generate_table(data_summary, gamma): def generate_table(data_summary, gamma):
columns = pd.MultiIndex.from_product([["Global", "DA validation", "DA identification"], ["cov", "MAE", "95AE", "R"]]) columns = pd.MultiIndex.from_product([["Global", "DA validation", "DA identification"], ["cov", "MAE", "95AE", "R"]])
columns = columns.drop([("Global","cov")]) columns = columns.drop([("Global","cov")])
columns = columns.insert(0,("Model","")) columns = columns.insert(0,("Model",""))
data = {} data = {}
factor = {"cov" : 1, "MAE" : 1000, "95per" : 1000, "l1" : 1} factor = {"cov" : 1, "MAE" : 1000, "95per" : 1000, "l1" : 1}
for model in models: for model in models:
data[model] = [] data[model] = []
data[model].append(model) data[model].append(model)
root_strg = "Global" root_strg = "Global"
for col in ["MAE", "95per", "l1"]: for col in ["MAE", "95per", "l1"]:
data[model].append("{:.2f}".format(factor[col]*data_summary[model][(root_strg, col)])) data[model].append("{:.2f}".format(factor[col]*data_summary[model][(root_strg, col)]))
for root_strg in ["DA validation", "DA identification"]: for root_strg in ["DA validation", "DA identification"]:
for col in ["cov", "MAE", "95per", "l1"]: for col in ["cov", "MAE", "95per", "l1"]:
data[model].append("{:.2f} ({:.2f})".format(factor[col]*data_summary[model][(root_strg, col)][0],factor[col]*data_summary[model][(root_strg, col)][1])) data[model].append("{:.2f} ({:.2f})".format(factor[col]*data_summary[model][(root_strg, col)][0],factor[col]*data_summary[model][(root_strg, col)][1]))
results_df = pd.DataFrame.from_dict(data = data,columns = columns, orient = "index") results_df = pd.DataFrame.from_dict(data = data,columns = columns, orient = "index")
results_df.to_csv(data_path+'/cov'+str(gamma)+"_results.csv", index = False) results_df.to_csv(data_path+'/cov'+str(gamma)+"_results.csv", index = False)
return results_df return results_df
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Analysis in action ### Analysis in action
By using the above defined funtions, we can easily run the complete analysis for a $\gamma$ value of 0.5. This recreates the table in the <a href="https://th.fhi-berlin.mpg.de/site/uploads/Publications/s41467-020-17112-9.pdf" target="_top">paper</a>. By using the above defined funtions, we can easily run the complete analysis for a $\gamma$ value of 0.5. This recreates the table in the <a href="https://th.fhi-berlin.mpg.de/site/uploads/Publications/s41467-020-17112-9.pdf" target="_top">paper</a>.
**Warning: running the next cell takes several minutes.** **Warning: running the next cell takes several minutes.**
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
gamma = 0.5 gamma = 0.5
update_gamma(gamma) update_gamma(gamma)
set_features() set_features()
for model in models: for model in models:
rm_old_files(model) rm_old_files(model)
split_data(model) split_data(model)
run_analysis(model) run_analysis(model)
data_summary = summarize_data() data_summary = summarize_data()
results_df = generate_table(data_summary, gamma) results_df = generate_table(data_summary, gamma)
results_df results_df
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Coverage - effect trade off ## Coverage - effect trade off
To show the trade off between coverage of the DA and the error reduction within, controlled by the $\gamma$ value, a DA analysis is performed for $\gamma$ values ranging from 0.2 to 0.66. The following code generates all these files. To show the trade off between coverage of the DA and the error reduction within, controlled by the $\gamma$ value, a DA analysis is performed for $\gamma$ values ranging from 0.2 to 0.66. The following code generates all these files.
**WARNING: The next cell will take a long time to run (more than 2h). Therefore the results have been pre-calculated and stored, so that running with the default setting can be skipped.** **WARNING: The next cell will take a long time to run (more than 2h). Therefore the results have been pre-calculated and stored, so that running with the default setting can be skipped. Please uncomment the following cell only if you wish to repeat the whole calculation.**
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
for gamma in np.linspace(0.2, 0.66, 16): # for gamma in np.linspace(0.2, 0.66, 16):
update_gamma(gamma) # update_gamma(gamma)
for model in models: # for model in models:
rm_old_files(model) # rm_old_files(model)
run_analysis(model) # run_analysis(model)
data_summary = summarize_data() # data_summary = summarize_data()
generate_table(data_summary, gamma) # generate_table(data_summary, gamma)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Each analysis creates a data point for each model of the coverage and relative reduction in error. This data is compiled and displayed into a graph. It is apparent that bigger domains with a broader applicability have a less severe reduction in the mean absolute error. Each analysis creates a data point for each model of the coverage and relative reduction in error. This data is compiled and displayed into a graph. It is apparent that bigger domains with a broader applicability have a less severe reduction in the mean absolute error.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
cov_error_data = { cov_error_data = {
'mbtr': {'x': [], 'y': [], 'x_err': [], 'y_err': []}, 'mbtr': {'x': [], 'y': [], 'x_err': [], 'y_err': []},
'soap': {'x': [], 'y': [], 'x_err': [], 'y_err': []}, 'soap': {'x': [], 'y': [], 'x_err': [], 'y_err': []},
'ngram': {'x': [], 'y': [], 'x_err': [], 'y_err': []}, 'ngram': {'x': [], 'y': [], 'x_err': [], 'y_err': []},
'atomic': {'x': [], 'y': [], 'x_err': [], 'y_err': []} 'atomic': {'x': [], 'y': [], 'x_err': [], 'y_err': []}
} }
for file in glob.glob(data_path+'/cov*_results.csv'): for file in glob.glob(data_path+'/cov*_results.csv'):
df = pd.read_csv(file) df = pd.read_csv(file)
#print(df) #print(df)
for i, model in enumerate(models): for i, model in enumerate(models):
global_error = float(df['Global'][i+1]) #i+1 to skip the row "MAE, 95AE, ..." global_error = float(df['Global'][i+1]) #i+1 to skip the row "MAE, 95AE, ..."
da_error = float(df['DA validation.1'][i+1].split(" ")[0]) da_error = float(df['DA validation.1'][i+1].split(" ")[0])
da_delta = float(df['DA validation.1'][i+1].split(" ")[1].replace('(', '').replace(')', '')) da_delta = float(df['DA validation.1'][i+1].split(" ")[1].replace('(', '').replace(')', ''))
reduction = (global_error - da_error)/global_error reduction = (global_error - da_error)/global_error
reduction_delta = da_delta/global_error reduction_delta = da_delta/global_error
cov = float(df['DA validation'][i+1].split(" ")[0]) cov = float(df['DA validation'][i+1].split(" ")[0])
cov_delta = float(df['DA validation'][i+1].split(" ")[1].replace('(', '').replace(')', '')) cov_delta = float(df['DA validation'][i+1].split(" ")[1].replace('(', '').replace(')', ''))
cov_error_data[model]['x'].append(cov) cov_error_data[model]['x'].append(cov)
cov_error_data[model]['y'].append(reduction) cov_error_data[model]['y'].append(reduction)
cov_error_data[model]['x_err'].append(cov_delta) cov_error_data[model]['x_err'].append(cov_delta)
cov_error_data[model]['y_err'].append(reduction_delta) cov_error_data[model]['y_err'].append(reduction_delta)
#print(rel_reduction, cov) #print(rel_reduction, cov)
fig2 = plt.figure() fig2 = plt.figure()
for model in models: for model in models:
cov_error_data[model]['x'], cov_error_data[model]['y'] = zip(*sorted(zip(cov_error_data[model]['x'], cov_error_data[model]['y']))) cov_error_data[model]['x'], cov_error_data[model]['y'] = zip(*sorted(zip(cov_error_data[model]['x'], cov_error_data[model]['y'])))
plt.plot(cov_error_data[model]['x'],cov_error_data[model]['y'], '-o', label=model) plt.plot(cov_error_data[model]['x'],cov_error_data[model]['y'], '-o', label=model)
#plt.errorbar(cov_error_data[model]['x'],cov_error_data[model]['y'], xerr=cov_error_data[model]['x_err'], yerr=cov_error_data[model]['y_err'], fmt='-o', label=model) #plt.errorbar(cov_error_data[model]['x'],cov_error_data[model]['y'], xerr=cov_error_data[model]['x_err'], yerr=cov_error_data[model]['y_err'], fmt='-o', label=model)
plt.legend(loc="upper right") plt.legend(loc="upper right")
plt.xlabel('coverage') plt.xlabel('coverage')
plt.ylabel('relative error reduction') plt.ylabel('relative error reduction')
print() print()
``` ```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment