Commit d4ef2c41 authored by Mohammad-Yasin Arif's avatar Mohammad-Yasin Arif
Browse files

Added feature selection

parent abd4ae93
%% Cell type:code id: tags:
``` python
%%HTML
<script>
window.findCellIndicesByTag = function findCellIndicesByTag(tagName) {
return (Jupyter.notebook.get_cells()
.filter(
({metadata: {tags}}) => tags && tags.includes(tagName)
)
.map((cell) => Jupyter.notebook.find_cell_index(cell))
);
};
window.runCells = function runCells(tagName) {
var c = window.findCellIndicesByTag(tagName);
Jupyter.notebook.execute_cells(c);
};
</script>
```
%%%% Output: display_data
%% Cell type:code id: tags:
``` python
%%HTML
<script>
code_show=true;
function code_toggle() {
if (code_show)
{
$('div.input').hide();
}
else
{
$('div.input').show();
}
code_show = !code_show
}
#$( document ).ready(code_toggle);
window.runCells("startup");
</script>
<!-- The raw code for this notebook is by default hidden for easier reading. -->
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.
```
%%%% Output: display_data
%% Cell type:markdown id: tags:
<div id="teaser" style=' background-position: right center; background-size: 00px; background-repeat: no-repeat;
padding-top: 20px;
padding-right: 10px;
padding-bottom: 170px;
padding-left: 10px;
border-bottom: 14px double #333;
border-top: 14px double #333;' >
<div style="text-align:center">
<b><font size="6.4">Identifying Domains of Applicability of MachineLearning Models for Materials Science</font></b>
</div>
<p>
<b> Notebook designed and created by:</b> Mohammad-Yasin Arif, Luigi Sbailò, and Matthias Scheffler. <i>Fritz-Haber-Institut der Max-Planck-Gesellschaft, Faradayweg 4-6, D-14195 Berlin, Germany</i><br> <br>
<span class="nomad--last-updated" data-version="v1.0.0">[Last updated: July 30, 2020]</span>
<div>
<img style="float: left;" src="assets/domain_of_applicability/Logo_MPG.png" width="200">
<img style="float: right;" src="assets/domain_of_applicability/Logo_NOMAD.png" width="250">
</div>
</div>
%% Cell type:markdown id: tags:
# Introduction
Although machine learning (ML) models promise to substantially accelerate the discovery of novel materials, their performance is often still insufficient to draw reliable conclusions. Improved ML models are therefore actively researched, but their design is currently guided mainly by monitoring the average model test error. This can render different models indistinguishable although their performance differs substantially across materials, or it can make a model appear generally insufficient while it actually works well in specific sub-domains. Here we present a method, based on subgroup discovery, for detecting domains of applicability (DA) of models within a materials class. The utility of this approach is demonstrated by analyzing three state-of-the-art ML models for predicting the formation energy of transparent conducting oxides. We find that, despite having a mutually indistinguishable and unsatisfactory average error, the models have DAs with distinctive features and notably improved performance.
%% Cell type:markdown id: tags:
The materials of interest are represented as vectors in a vector space $X$ according to some chosen representation. The coordinates $x_i$ could represent features of the material e.g. bond distanc, lattice parameters, composition of elements etc. The ML moodels try to predict a target property $y$ with the minimal error according to some loss function. In our case $y$ is the formation energy. For this example three ML models have been used, MBTR, SOAP and n-gram. Additionally, calculation were performed on a simple representation just containing atomic properties, which is expected to produce much larger errors. All of this data is compiled into ``data.csv``.
%% Cell type:markdown id: tags:
A DA is defined by a function $\sigma: X \rightarrow \{true, false\}$, which describes a series of inequality constraints on the coordinates $x_i$. Thus, these selectors describe intersections of axis-parallel half-spaces resulting in simple convex regions in $X$. This allows to systematically reason about the described sub-domains (e.g., it iseasy to determine their differences and overlap) and also to sample novel points from them. These domains are found through subgroup discovery (SGD), maximizing the impact on the model error. This impact is defined by the product of selector coverage and the error reduction within, i.e.:
$\mathrm{impact}(\sigma) = \left( \frac{s}{k} \right)^\gamma \left( \frac{1}{k} \sum\limits^k_{i=1} l_i(f) - \frac{1}{s} \sum\limits_{i \in I(\sigma)} l_i(f) \right)^{1-\gamma}$
where:
$k$... number of points in the data set
$s$... number of points in the DA set
$I(\sigma)$... set of selected indices ($I(\sigma) = \{i: 1\le i\le k, \sigma(x_i)= \mathrm{true}\}$)
$l_i$... loss function (e.g. squared error)
$f$... ML model (i.e. $f: X\rightarrow \mathbb{R}$)
$\gamma$... coverage weight, $0 < \gamma < 1$
%% Cell type:markdown id: tags:
# An illustrative example
Let us first demonstrate the concept of DA with a synthetic example. We consider a simple two-dimensional representation consisting of independent features $x_1$ and $x_2$ that are each distributed according to a normal distribution with mean 0 and variance 2 ($N(0,2)$) and a target property $y$ that is a 3rd degree polynomial in $x_1$ with an additive noise component that scales exponentially in $x_2$:
$y \sim x_1^3 - x_1 + N\left(0, \exp \left( \frac{x_2}{2}\right) \right)$
That is, the $y$ values are almost determined by the third degree polynomial for low $x_2$ values but are almost completely random for high $x_2$ values. Discovering applicable domains reveals how different models cope differently with this setting even if they have a comparable average error. To show this, let us examine the average error obtained from three different kernelized regression models.
%% Cell type:code id: tags:startup
``` python
import sys, os, shutil
import random
import pandas as pd
import numpy as np
import json
import math
import itertools
import glob
import ipywidgets as widgets
from copy import deepcopy
from mpl_toolkits.mplot3d import axes3d
import matplotlib.pyplot as plt
```
%% Cell type:code id: tags:startup
``` python
base_path = os.getcwd()
data_path = base_path+"/data/domain_of_applicability"
#print(base_path)
#print(data_path)
```
%% Cell type:markdown id: tags:
## Data generation
First, the data for $n$ points is generated in the form of numpy arrays.
%% Cell type:code id: tags:
``` python
np.random.seed(7)
n = 500
index = np.arange(1, n+1)
x1 = np.random.normal(0, math.sqrt(2), n)
x2 = np.random.normal(0, math.sqrt(2), n)
X = np.matrix([x1, x2])
y = x1**3 - x1
for i, val in enumerate(y):
y[i] = val + np.random.normal(0, math.exp(x2[i]/2), 1)
```
%% Cell type:markdown id: tags:
Then, we use the sklearn library to fit our data with a linear, a gaussian (radial basis function = rbf) and a polynomial kernel. Our original data as well as the predicted values for each kernel are stored in the ``example_df`` data frame.
%% Cell type:code id: tags:
``` python
from sklearn.kernel_ridge import KernelRidge
lin_kern = KernelRidge(alpha = 1, kernel = 'linear')
rbf_kern = KernelRidge(alpha = 1, kernel = 'rbf')
poly_kern = KernelRidge(alpha = 1, kernel = 'polynomial')
lin_kern.fit(X.T, y)
rbf_kern.fit(X.T, y)
poly_kern.fit(X.T, y)
lin_pred = lin_kern.predict(X.T)
rbf_pred = rbf_kern.predict(X.T)
poly_pred = poly_kern.predict(X.T)
example_df = pd.DataFrame(data = {'id' : index, 'x1' : x1, 'x2': x2, 'y': y, 'lin_pred' : lin_pred, 'rbf_pred' : rbf_pred, 'poly_pred' : poly_pred}, columns = ['id', 'x1', 'x2', 'y', 'lin_pred', 'rbf_pred', 'poly_pred'])
example_df.to_csv(data_path+'/example_data.csv', index = False)
example_df
```
%%%% Output: execute_result
id x1 x2 y lin_pred rbf_pred poly_pred
0 1 2.390764 2.156206 13.875031 11.520824 11.162903 11.704958
1 2 -0.658935 -2.653527 0.131608 -3.465452 0.190073 0.296761
2 3 0.046415 0.675889 1.545471 0.312993 -0.056601 0.023133
3 4 0.576315 -0.443546 -0.315322 2.641479 -0.627006 -0.442774
4 5 -1.115706 -1.210105 0.137321 -5.405181 -0.080986 -0.208178
.. ... ... ... ... ... ... ...
495 496 1.023711 0.054674 -0.958879 4.810776 -0.002086 -0.020592
496 497 1.610162 0.787967 3.591573 7.665611 2.803856 2.544301
497 498 -0.230868 -0.690782 0.646929 -1.180514 0.137436 0.226905
498 499 -1.819410 -0.888864 -2.872168 -8.661579 -4.087120 -4.040424
499 500 0.377257 -0.958044 -0.459578 1.635048 -0.414392 -0.378469
[500 rows x 7 columns]
%% Cell type:markdown id: tags:
## Plotting the data
The following code plots the given data as well as the three models in 3D-space.
%% Cell type:code id: tags:
``` python
def prediction_mesh(X, Y, kernel):
XX, YY = np.meshgrid(X, Y)
samples = [(x, y) for y in Y for x in X]
Z = kernel.predict(samples)
ZZ = np.array([[Z[len(X)*iy+ix] for ix, x in enumerate(X)] for iy, y in enumerate(Y)])
return XX, YY, ZZ
XXlin, YYlin, ZZlin = prediction_mesh(np.linspace(-5,5,10), np.linspace(-5,5,10), lin_kern)
XXrbf, YYrbf, ZZrbf = prediction_mesh(np.linspace(-5,5,20), np.linspace(-5,5,20), rbf_kern)
XXpoly, YYpoly, ZZpoly = prediction_mesh(np.linspace(-5,5,20), np.linspace(-5,5,10), poly_kern)
```
%% Cell type:code id: tags:
``` python
%matplotlib notebook
fig = plt.figure()
ax_lin = fig.add_subplot(221, projection='3d')
ax_rbf = fig.add_subplot(222, projection='3d')
ax_poly = fig.add_subplot(223, projection='3d')
ax_lin.scatter(x1, x2, y, color = 'r')
ax_lin.plot_wireframe(XXlin, YYlin, ZZlin, color = 'k')
ax_lin.set_zlim3d(-30,30)
ax_rbf.scatter(x1, x2, y, color = 'r')
ax_rbf.plot_wireframe(XXrbf, YYrbf, ZZrbf, color = 'k')
ax_rbf.set_zlim3d(-30,30)
ax_poly.scatter(x1, x2, y, color = 'r')
ax_poly.plot_wireframe(XXpoly, YYpoly, ZZpoly, color = 'k')
ax_poly.set_zlim3d(-30,30)
ax_lin.set_xlabel('x1')
ax_lin.set_ylabel('x2')
ax_lin.set_zlabel('y')
ax_rbf.set_xlabel('x1')
ax_rbf.set_ylabel('x2')
ax_rbf.set_zlabel('y')
ax_poly.set_xlabel('x1')
ax_poly.set_ylabel('x2')
ax_poly.set_zlabel('y')
```
%%%% Output: display_data
%%%% Output: display_data
%%%% Output: execute_result
Text(0.5, 0, 'y')
%% Cell type:markdown id: tags:
## Setting the DA
To demonstrate the concept of DA, we will choose the domains for these models ourselves and compare the reduction of average error by restricting the models to those domains.
When using the linear kernel, the resulting linear model is globally incapable to trace the variation of the 3rd order polynomial except for a small stripe around $x_1$ values close to 0 where it can be approximated well by a linear function. Consequently, there is a very high error globally that is substantially reduced in the applicability domain described by:
$\sigma_{lin}(x_1, x_2) \equiv -0.3 \le x_1 \le 0.3$
When using the Gaussian kernel, the resulting radial basis function model is able to represent the target property well locally unless (a) the noise component is too large and (b) the variation of the target property is too high relative to the number of training points. The second restriction is because the radial basis functions (rbf) have non-negligible values only within a small region around the training examples. Consequently, the DA is not only restricted in $x2$-direction but also excludes high absolute $x1$-values:
$\sigma_{rbf}(x_1,x_2) \equiv −3.3 \le x_1 \le 3.1 \wedge x_2 \le 0.1$
In contrast, when using the non-local 3rd degree polynomial kernel, data sparsity does not prevent an accurate modeling of the target property along the $x1$-axis. However, this non-locality is counter productive along the $x2$-axis where overfitting of the noise component has a global influence that results in higher prediction errors for the almost deterministic data points with low $x2$-values. This is reflected in the identified applicability domain, which contains no restriction in $x1$-direction, but excludes both high and low $x2$-values.This highlights an important structural difference between the rbf and the polynomial model that is not reflected in their similar average errors:
$\sigma_{ply}(x_1,x_2) \equiv −3.5 \le $x_2$ \le 0.1$
%% Cell type:code id: tags:
``` python
doa = {}
doa['lin'] = (example_df['x1'] > -0.3) & (example_df['x1'] < 0.3)
doa['rbf'] = (example_df['x1'] > -3.3) & (example_df['x1'] < 3.1) & (example_df['x2'] < 0.1)
doa['poly'] = (example_df['x2'] > -3.5) & (example_df['x2'] < 0.1)
```
%% Cell type:markdown id: tags:
## Comparing the errors
For each model, the global mean absolute arror (MAE) globally and in their respective domains of applicability are calculated. Additionally, the coverage of the DA as a fraction of the total data is presented.
It is apparent that the accuracy for different models is drastically improved in their respective domains, which differ with the strengths and weaknesses of each model.
%% Cell type:code id: tags:
``` python
error_df_data = {}
error_df_columns = ['model', 'Global MAE', 'DA MAE', 'coverage']
for model in ['lin', 'rbf', 'poly']:
error_df_data[model] = []
error_df_data[model].append(model)
error_df_data[model].append(abs(example_df['y'] - example_df[model+'_pred']).mean())
error_df_data[model].append(abs(example_df[doa[model]]['y'] - example_df[doa[model]][model+'_pred']).mean())
error_df_data[model].append(len(example_df[doa[model]].index) / len(example_df.index))
error_df = pd.DataFrame.from_dict(data = error_df_data, columns = error_df_columns, orient = 'index')
error_df
```
%%%% Output: execute_result
model Global MAE DA MAE coverage
lin lin 4.119203 1.374833 0.174
rbf rbf 1.318289 0.637111 0.528
poly poly 1.027286 0.513449 0.532
%% Cell type:markdown id: tags:
# Domains of applicability for TCO models
Equipped with the concept of applicability domains, we can now examine the ML models for the prediction of stable alloys with potential application as transparent conducting oxides (TCOs). Materials that are both transparent to visible light and electrically conductive are important for a variety of technological devices such as photovoltaic cells, light-emitting diodes for flat-panel displays, transistors, sensors, touch screens, and lasers. However, only a small number of TCOs have been realized because typically the properties that maximize transparency are detrimental to conductivity and vice versa. Because of their promise for technologically relevant applications, a public data-analytics competition was organized by the Novel Materials Discovery Centre of Excellence (NOMAD) and hosted by the on-line platform Kaggle using a dataset of 3,000 $(Al_x Ga_y In_z)_2 O_3$ sesquioxides, spanning six different spacegroups. The target property in this examination is the formation energy, which is a measure of the energetic stability of the specific elements in a local environment that is defined by the specific lattice structure.
Our aim is to demonstrate the ability of the proposed DA analysis to (i) differentiate the performance of models based on different representations of the local atomic information of each structure and (ii) to identify sub-domains in which they can be used reliably for high-throughput screening. Specifically, we focus on the state-of-the-art representations of MBTR, SOAP, and the n-gram representation. As an additional benchmark, we also perform DA identification for a simple representation containing just atomic properties averaged by the compositions. Since this representation is oblivious to configurational disorder (i.e., many distinct structures that are possibleat a given composition), it is expected to perform poorly across all spacegroups and concentrations.
%% Cell type:markdown id: tags:
## Settings
First, some global variables for the DA analysis need to be established. The data for this analysis is given in ``data.csv``.
%% Cell type:code id: tags:
``` python
root_path_dir = "."
exchange_value = 1.0 #exchange_value = 100.0
random_state_dict = {"mbtr": 4, "soap": 4, "ngram": 14, "atomic":2}
random_state = None
skip_dict = {"mbtr": {"norm_abs_error":[4,6]},
"soap": {"norm_abs_error":[2, 4, 5]},
"ngram":{"norm_abs_error":[1, 4, 5]},
"atomic": {"norm_abs_error":[1, 6]}}
n_splits = 6
models = ["mbtr","soap","ngram","atomic"]
```
%% Cell type:markdown id: tags:
## Functions
The following are functions used for pre- and postprocessing of the data and don't need to be studied in detail. Compact methods to use them are explained further below.
%% Cell type:code id: tags:
``` python
def gen_sgd_inputs(target, model=None, random_state=None, end_label="_predE", filename = 'data.csv', prop = "Ef"):
random.seed(random_state)
final_df = get_df(model, end_label=end_label, filename = filename, prop = prop)
split_total_df_and_write(target, model, n_splits, final_df, random_state)
def split_total_df_and_write(target, model, n_splits, final_df, random_state):
initial_list = final_df.id.tolist()
master_list = {i:[] for i in range(0, n_splits) }
for n_split in range(0, n_splits):
idxs = random.sample(initial_list, 100)
master_list[n_split] = idxs
for i in idxs:
del initial_list[initial_list.index(i)]
for n_split in master_list:
if not os.path.exists(model):
os.mkdir(model)
tmp_dir = os.path.join(model,"random_state_"+str(random_state))
if not os.path.exists(tmp_dir):
os.mkdir(tmp_dir)
dirname = os.path.join(model,"random_state_"+str(random_state), "split"+"_"+str(n_split+1))
if not os.path.exists(dirname):
os.mkdir(dirname)
idxs = master_list[n_split]
test_df = final_df[final_df.id.isin(idxs)]
train_df = final_df[~final_df.id.isin(idxs)]
train_df.to_csv(os.path.join(dirname, "train.csv"), index=False)
test_df.to_csv(os.path.join(dirname, "test.csv"), index=False)
xarf_write_name = os.path.join(dirname, "xarf.txt")
write_xarf_file(train_df, target, xarf_write_name, write_dir = ".")
def get_df(model, end_label="_predE", filename = 'data.csv', prop = "Ef"):
test_infile = os.path.join(root_path_dir,filename)
df = pd.read_csv(test_infile)
label = model+end_label
keep_cols = [ i for i in df.columns.tolist() if end_label not in i and i != prop ]
df = calc_sum_predE_abs_error(df, label, prop)
keep_cols.extend( [prop, label, "abs_error", "error", "sq_error", "norm_abs_error", "sum_Ef_and_normalized_error", "sum_pred_Ef_and_abs_error"] )
final_df = df[keep_cols]
return final_df
def calc_sum_predE_abs_error(df, label, prop = "Ef"):
df["error"] = df[label].values - df[prop].values
df["abs_error"] = abs(df[label].values - df[prop].values)
df["sq_error"] = (df[label].values - df[prop].values)**2
df["norm_abs_error"] = [ abs(i-j)/i if round(i - 0.00, 3) != 0.000 else 0.000 for i, j in zip(df[prop].tolist(),df[label].tolist()) ]
df["sum_pred_Ef_and_abs_error"] = df[label] + df["abs_error"]
if "sum_Ef_and_normalized_error" not in df.columns.tolist():
df["sum_Ef_and_normalized_error"] = df[label] + df["norm_abs_error"]
return df
def write_xarf_file(write_df, target, write_name, write_dir="."):
feature_list = get_feature_txt(write_df.columns.tolist(), write_name.strip('_xarf_file.txt'))
outFile2=write_name+'.tmp'
if os.path.isfile(outFile2):
os.remove(outFile2)
write_df.to_csv(outFile2, header=None, sep=',', mode='a', index=False)
with open(outFile2) as f:
lines = f.readlines()
content = [ line.strip() for line in lines ]
feature_list.extend(content)
outFile3=os.path.join(write_dir, write_name)
with open(outFile3, 'w') as f:
f.write('\n'.join(feature_list))
os.remove(outFile2)
def get_feature_txt(features, header_name):
feature_list = []
header = ('@relation wpo_1_0_0 caption="WPO 1.0.0" description="generated from %s.csv"' %header_name)
feature_list.append( "%s" %header )
for i in features:
if i == "label":
name_type = "categoric"
else:
name_type = "numeric"
c = " ".join(["@attribute", i, name_type])
feature_list.append( "%s" %c )
feature_list.append( "%s" %"@data" )
return feature_list
```
%% Cell type:code id: tags:
``` python
def get_all_values(rep, target, target_label, skip=None, end_label="_predE", prop="Ef", filename="data.csv"):
global_dict = {}
da_dict = {}
dirs = get_dirs_glob(rep)
results_file = open("results/"+rep+".dat",'w')
for d in dirs:
try:
tmp_selectors, values, relations = get_selectors_from_subfolder(d)
selectors = [ rename(s) for s in tmp_selectors ]
except UnboundLocalError:
# throws an error if there are no sgd outputs in the subfolder
continue
idx = int(d.split("/")[-2].strip("split_"))
if idx in skip:
continue
else:
global_df = get_df(rep, end_label=end_label, filename = filename, prop = prop)
global_dict[idx] = get_global_summary(global_df, rep, target, target_label, selectors, values, relations, results_file, end_label=end_label, prop = prop)
da_dict[idx] = get_subset_summary(d, rep, target, target_label, selectors, values, relations, results_file, end_label=end_label, prop = prop)
result_dict = print_summary(rep, da_dict, global_dict , results_file)
results_file.close()
return result_dict
def calc_r2(values1, values2):
avg_e = np.mean(values1)
sum_of_square_error = [ ((values2[i] - values1[i])**2) for i in range(len(values1)) ]
var = [ (values1[i] - avg_e)**2 for i in range(len(values1)) ]
return 1-np.sum(sum_of_square_error)/np.sum(var)
def calc_l1(values1, values2):
median_e = np.median(values1)
sum_of_abs_error = [ abs(values2[i] - values1[i]) for i in range(len(values1)) ]
var = [ abs(values1[i] - median_e) for i in range(len(values1)) ]
return 1-np.sum(sum_of_abs_error)/np.sum(var)
def partition_and_print(bigdf, target):
rr_df = bigdf[bigdf["is_reliable"] == 1]
other_df = bigdf[bigdf["is_reliable"] == 0]
print("avg. rr --> %s, cov = %s, vs. glob --> %s" %(np.mean(rr_df[target].values), len(rr_df)/float(len(bigdf)), np.mean(bigdf[target].values)))
return rr_df, other_df
def selector2df(tmp_df, in_selectors, in_values, in_relations):
tmp_df["is_reliable"] = [ 1 for i in range(0, len(tmp_df)) ]
for s, v, r in zip(in_selectors, in_values, in_relations):
if r == "greaterOrEquals":
tmp_df["is_reliable"] = [ 1*tmp_df["is_reliable"].values[i] if tmp_df[s].values[i] >= v else 0 for i in range(0, len(tmp_df)) ]
elif r == "greaterThan":
tmp_df["is_reliable"] = [ 1*tmp_df["is_reliable"].values[i] if tmp_df[s].values[i] > v else 0 for i in range(0, len(tmp_df)) ]
elif r == "lessThan":
tmp_df["is_reliable"] = [ 1*tmp_df["is_reliable"].values[i] if tmp_df[s].values[i] < v else 0 for i in range(0, len(tmp_df)) ]
elif r == "lessOrEquals":
tmp_df["is_reliable"] = [ 1*tmp_df["is_reliable"].values[i] if tmp_df[s].values[i] <= v else 0 for i in range(0, len(tmp_df)) ]
return tmp_df
def get_dirs_glob(rep):
subfolders = glob.glob(rep+"/*/split_*/")
return subfolders
def get_global_summary(global_df, rep, target, target_label, selectors, values, relations, file, end_label="_predE", prop="Ef"):
global_df[target] = global_df[target].values
if target != target_label:
global_df[target_label] = exchange_value*global_df[target_label].values
global_df[rep+end_label] = exchange_value*global_df[rep+end_label].values
global_df[prop] = exchange_value*global_df[prop].values
for s, v, r in zip(selectors, values, relations):
print(s, v, r, file=file)
out_global_df = selector2df(deepcopy(global_df), selectors, values, relations)
global_dict = { "error": np.mean(out_global_df[target_label].values),
"l1": calc_l1(out_global_df[prop].values, out_global_df[rep+end_label].values),
"l1_disp": np.mean([ abs(out_global_df[prop].values[i] - np.median(out_global_df[prop].values)) for i in range(len(out_global_df[prop].values)) ]),
"l2_disp": math.sqrt(np.mean([ (out_global_df[prop].values[i] - np.mean(out_global_df[prop].values))**2 for i in range(len(out_global_df[prop].values)) ])),
"rsquared": calc_r2(out_global_df[prop].values, out_global_df[rep+end_label].values),
"95per" : np.percentile(out_global_df[target_label].values, 95),
"DA_rsquared": calc_r2(out_global_df[out_global_df["is_reliable"] == 1][prop].values, out_global_df[out_global_df["is_reliable"] == 1][rep+end_label].values),
"DA_error": np.mean(out_global_df[out_global_df["is_reliable"] == 1][target_label].values),
"samples": deepcopy( out_global_df[target_label].values ),
prop: deepcopy( out_global_df[prop].values ) }
return global_dict
def get_subset_summary(d, rep, target, target_label, selectors, values, relations, file, end_label="_predE", prop="Ef"):
tmp_dict = {}
for t in ["test", "train"]:
in_df = pd.read_csv(os.path.join(d, t+".csv"))
out_df = selector2df(deepcopy(in_df), selectors, values, relations)
pred_e_value = deepcopy(exchange_value*out_df[out_df["is_reliable"] == 1][rep+end_label].values)
e_value = deepcopy(exchange_value*out_df[out_df["is_reliable"] == 1][prop].values)
target_e_value = deepcopy(exchange_value*out_df[out_df["is_reliable"] == 1][target_label].values)