Commit a5d0fbda authored by Luigi Sbailo's avatar Luigi Sbailo
Browse files

Update logos

parent f968bd76
%% Cell type:markdown id: tags:
<img src="assets/perovskites_tolerance_factor/header.jpg" width="900">
%% Cell type:markdown id: tags:
<img style="float: left;" src="assets/perovskites_tolerance_factor/logo_NOMAD.png" width=300>
<img style="float: right;" src="assets/perovskites_tolerance_factor/logo_MPG.png" width=170>
<img style="float: left;" src="assets/perovskites_tolerance_factor/logo_MPG.png" width=150>
<img style="float: left; margin-top: -10px" src="assets/perovskites_tolerance_factor/logo_NOMAD.png" width=250>
<img style="float: left; margin-top: -5px" src="assets/perovskites_tolerance_factor/logo_HU.png" width=130>
%% Cell type:markdown id: tags:
This tutorial shows how a tolerance factor for predicting perovskite stability can be learned from data with the sure-independece-screening-and-sparsifying-operator (SISSO) descriptor-identification approach.
The SISSO method is described in detail in:
<div style="padding: 1ex; margin-top: 1ex; margin-bottom: 1ex; border-style: dotted; border-width: 1pt; border-color: blue; border-radius: 3px;">
R. Ouyang, S. Curtarolo, E. Ahmetcik, M. Scheffler, L. M. Ghiringhelli: <span style="font-style: italic;">SISSO: a compressed-sensing method for identifying the best low-dimensional descriptor in an immensity of offered candidates</span>, Phys. Rev. Materials 2, 083802 (2018) <a href="https://journals.aps.org/prmaterials/abstract/10.1103/PhysRevMaterials.2.083802" target="_blank">[PDF]</a> .
</div>
This tutorial is based on the following publication:
<div style="padding: 1ex; margin-top: 1ex; margin-bottom: 1ex; border-style: dotted; border-width: 1pt; border-color: blue; border-radius: 3px;">
C. Bartel, C. Sutton, B. R. Goldsmith, R. Ouyang, C. B. Musgrave, Luca M. Ghiringhelli, M. Scheffler: <span style="font-style: italic;">New tolerance factor to predict the stability of perovskite oxides and halides</span>, Sci. Adv. 5, eaav0693 (2019) <a href="https://advances.sciencemag.org/content/advances/5/2/eaav0693.full.pdf" target="_blank">[PDF]</a> .
</div>
Some scripts and data tables for this notebook have been adapted from the content of:
<div style="padding: 1ex; margin-top: 1ex; margin-bottom: 1ex; border-style: dotted; border-width: 1pt; border-color: blue; border-radius: 3px;">
<a href="https://github.com/CJBartel/perovskite-stability" target="_blank">https://github.com/CJBartel/perovskite-stability</a>
</div>
# Perovskites and the Goldschmidt tolerance factor
Perovskites constitute a class of materials having the basic formula $ABX_3$ and displaying a common structure in which a smaller metal cation $B$ (e.g. a transition metal) resides in corner-sharing octahedra of $X$ anions (e.g. $O^{2-}$, $Cl^-$, $Br^-$) and a larger $A$ metal cation (e.g. alkali, alkaline earth or lanthanide) has a 12-fold coordination with the $X$ anions. This class of compounds has a remarkable variety of electronic, magnetic, optical, mechanical, and transport properties. Such variety is derived from the possibility of tuning the materials propertites by the composition. About 90% of the metallic chemical elements of the periodic table can be stabilized in a perovskite structure. Therefore, perovskites are versatile materials suitable for a number of applications including photovoltaics, thermoelectrics and catalysis.
The first step to design new perovskites is to assess their stability. For this purpose, the Goldschmidt tolerance factor, $t$, has been extensively used to predict the stability of a material in the perovskite structure based on the (Shannon) ionic radii,$r_i$, of each ion $(A,B,X)$ in the chemical formula $ABX_3$:
$$ t=\frac{r_A+r_X}{\sqrt2(r_B+r_X)} $$
$t$ measures how much the $A$-site cation fits into the corner-sharing octahedral network in a cubic crystal structure. It indicates the compatibilty of a given set of ions with the ideal, cubic perovskite structure ($t\approx1$). Distortions from the cubic structure arise from size mismatch between cations and anions, which results in perovskite structures other than cubic (e.g. orthorhombic, rhombohedral). However, when these distortions are too large, the perovskite structure may be unstable and non-perovskites structures are rather formed.
The accuracy of the Goldschmidt factor is, however, often insufficient to screen for new potential materials and several modification have been proposed to overcome this issue. In this tutorial, we show how data can be used to derive tolerance factors for perovskite stability.
# The SISSO method for descriptor identifcation
A crucial step in data-driven materials science is the identification of descriptors, functions of physical parameters that characterize the phenomena governing a certain property. More specifically, a descriptor $\mathbf{d}$ (in general, an array) is mapped into a target property $P$ of interest by means of a mathematical relationship: $P=P(\mathbf{d})$. Descriptors numerically encode the different materials and, crucially, should be obtained (measured or calculated) more easily than the target property, so that they can be evaluated for large sets of still unknown materials to search for new ones.
The-sure-independence-screening-and-sparsifying-operator (SISSO) method combines a symbolic-regression-based feature construction with compressed sensing for the (data-driven) identification of the best low-dimensional descriptors. Within SISSO's feature construction, an initial set of input features (the primary features) offered by the user are systematically combined via the application of mathematical operators (e.g. addition, multiplication, exponential, square root), generating a large space of candidate features. The candidate features are then ranked according to their fit to the target property (number of materials in the overlap of convex-hull regions, for the case of classification problems) and the top-ranked features are further used for descriptor selection.
For futher details on compressed-sensing methods (including SISSO) for descriptor identification, a <a href="https://analytics-toolkit.nomad-coe.eu/hub/user-redirect/notebooks/tutorials/compressed_sensing.ipynb" target="_blank">dedicated notebook</a> is available in the NOMAD toolkit.
For navigation convenience, the following sections are collapsed. You can uncollapse one by one and run the cells as usual or reach to the section you are interested and before running the first cell in the selected section, click "Run All Above" in the "Cell" menu.
%% Cell type:markdown id: tags:
# Import required modules
%% Cell type:code id: tags:
``` python
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import ticker
import pandas as pd
import seaborn as sns
import warnings
import itertools
import os
from math import log
import re
import numexpr as ne
import ipywidgets as widgets
from IPython.display import Markdown, display, clear_output
from sissopp import Inputs, Unit, FeatureNode
from perovskites_tolerance_factor.PredictPerovskites import PredictABX3, PredictAABBXX6
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.calibration import CalibratedClassifierCV
warnings.filterwarnings('ignore')
```
%% Cell type:markdown id: tags:
# Get the data
The data consists of a list of 576 $ABX_3$ solids experimentally-characterized at ambient conditions, classified as stable or unstable at the perovskite structure, together with the following features:
<div >
<ul>
<li>$exp$_$label$ : a Boolean variable, for which the value 1.0 (0.0) means that the materials is stable (unstable) as perovskite</li>
<li>$r_A, r_B, r_X$: <a href="https://mipt.ru/upload/medialibrary/df1/revised-effective-ionic-radii-and-systematic-studies-of-interatomic-distances-in-halides-and-chalcogenides_shannon.pdf" target="_blank">Shannon ionic radii</a> of each ion, with $r_A>r_B$ </li>
<li>$n_A, n_B, n_X$: oxidation satates of each ion </li>
<li>$\frac{r_A}{r_B}, \frac{r_A}{r_X}, \frac{r_B}{r_x}$: ionic radii ratios. Note: this features can be obtained from $r_A, r_B, r_X$ at the first rung of SISSO feature constructions. However, since in Goldschmidt's $t$ factor, ratios of radii appear prominently, they are offered directly as candidate features, aiming at simpler models. </li>
<li>$Z_A, Z_B, Z_X$: nuclear charges </li>
<li>$r_{s,A}, r_{s,B}, r_{s,X}$: <i>calculated</i> radius where the radial distribution of the $s$ orbital has its maximum </li>
<li>$r_{p,A}, r_{p,B}, r_{p,X}$: <i>calculated</i> radius where the radial distribution of the $p$ orbital has its maximum</li>
<li>$HOMO_A, HOMO_B, HOMO_X$: <i>calculated</i> energy of the highest occupied atomic orbital</li>
<li>$LUMO_A, LUMO_B, LUMO_X$: <i>calculated</i> energy of the lowest unoccupied atomic orbital</li>
<li>$EA_A, EA_B, EA_X$: <i>calculated</i> electron affinity</li>
<li>$IP_A, IP_B, IP_X$: <i>calculated</i> ionization potential</li>
</div>
The <i>calculated</i> features were obtained with DFT-PBE using the FHI-aims all-electron full-potential code and correspond to properties of isolated atoms.
%% Cell type:code id: tags:
``` python
#load data
df = pd.read_csv("data/perovskites_tolerance_factor/data_perovskite.csv", index_col=0)
#add rX_rB_ratio feature to the dataframe
rX_rB_ratio="rB_rX_ratio**-1"
df.eval('rX_rB_ratio =' + rX_rB_ratio, inplace=True)
df
```
%% Cell type:code id: tags:
``` python
#count the number of material in each class in the whole dataset
print('In the whole dataset, %s compositions are unstable and %s are stable in the perovskite structure.' % (df['exp_label'].value_counts().values[0], df['exp_label'].value_counts().values[1]))
```
%% Cell type:code id: tags:
``` python
# make lists of random numbers for train and test sets
inds = np.arange(576)
np.random.shuffle(inds)
task_sizes_train = [460]
task_sizes_test = [116]
test_inds = [int(ii) for ii in np.sort(inds[:task_sizes_test[0]])]
train_inds = [int(ii) for ii in np.sort(inds[task_sizes_test[0]:])]
# test data frame
index_id_test=df.index[test_inds]
df_units_columns= df.columns.values[:]
test_vals = df.loc[index_id_test, :].values
test_df= pd.DataFrame(index=index_id_test,data=np.array(test_vals),columns=df_units_columns)
display(test_df)
# train data frame
index_id_train=df.index[train_inds]
train_vals = df.loc[index_id_train,:].values
train_df= pd.DataFrame(index=index_id_train,data=np.array(train_vals),columns=df_units_columns)
display(train_df)
```
%% Cell type:code id: tags:
``` python
#count the number of material in each class in the training/test sets
print('In the training set, %s compositions are unstable and %s are stable.' % (train_df['exp_label'].value_counts().values[0], train_df['exp_label'].value_counts().values[1]))
print('In the test set, %s compositions are unstable and %s are stable.' % (test_df['exp_label'].value_counts().values[0], test_df['exp_label'].value_counts().values[1]))
```
%% Cell type:markdown id: tags:
# Generate the candidate features space from the primary features and operators
The two sets of elements needed to create the feature space with SISSO are the features to be used (i.e. the primary features) and the set of mathematical operators to be applied. Another input from the user is the number of times the operators are applied, the so-called rung (max_phi).
%% Cell type:code id: tags:
``` python
#add the primary features units to data set
df_units=df.rename(columns={'rA':'rA (AA)', 'rB':'rB (AA)', 'rX':'rX (AA)',
'nA':'nA (Unitless)', 'nB':'nB (Unitless)', 'nX':'nX (Unitless)',
'rA_rB_ratio':'rA_rB_ratio (Unitless)', 'rA_rX_ratio':'rA_rX_ratio (Unitless)', 'rB_rX_ratio':'rB_rX_ratio (Unitless)',
'rS_A':'rS_A (AA)', 'rP_A':'rP_A (AA)',
'Z_A':'Z_A (elem_charge)',
'HOMO_A':'HOMO_A (eV)', 'LUMO_A':'LUMO_A (eV)',
'EA_A':'EA_A (eV)', 'IP_A':'IP_A (eV)',
'rS_B':'rS_B (AA)', 'rP_B':'rP_B (AA)',
'Z_B':'Z_B (elem_charge)',
'HOMO_B':'HOMO_B (eV)', 'LUMO_B':'LUMO_B (eV)',
'EA_B':'EA_B (eV)', 'IP_B':'IP_B (eV)',
'rS_X':'rS_X (AA)', 'rP_X':'rP_X (AA)',
'Z_X':'Z_X (elem_charge)',
'HOMO_X':'HOMO_X (eV)','LUMO_X':'LUMO_X (eV)',
'EA_X':'EA_X (eV)', 'IP_X':'IP_X (eV)','rX_rB_ratio':'rX_rB_ratio (Unitless)',
})
df_units
```
%% Cell type:code id: tags:
``` python
#define list of primary features - user has to choose
cols = [
# 'rA (AA)',
# 'rB (AA)',
# 'rX (AA)',
'nA (Unitless)',
# 'nA2 (Unitless)',
# 'nB (Unitless)',
# 'nX(Unitless)',
'rA_rB_ratio (Unitless)',
# 'rA_rX_ratio (Unitless)',
# 'rB_rX_ratio (Unitless)',
'rX_rB_ratio (Unitless)',
# 'log_rA_rB_ratio (Unitless)',
# 'rS_A (AA)',
# 'rP_A (AA)',
# 'Z_A (elem_charge)',
# 'HOMO_A (eV)',
# 'LUMO_A (eV)',
# 'EA_A (eV)',
# 'IP_A (eV)',
# 'rS_B (AA)',
# 'rP_B (AA)',
# 'Z_B (elem_charge)',
# 'HOMO_B (eV)',
# 'LUMO_B (eV)',
# 'EA_B (eV)',
# 'IP_B (eV)',
# 'rS_X (AA)',
# 'rP_X (AA)',
# 'Z_X (elem_charge)',
# 'HOMO_X (eV)',
# 'LUMO_X (eV)',
# 'EA_X (eV)',
# 'IP_X (eV)'
]
```
%% Cell type:code id: tags:
``` python
#define list of operators - user has to choose
ops = [
# "add",
# "two_power",
"sub",
# "abs_diff",
"mult",
"div",
# "exp",
# "neg_exp",
# "inv",
"sq",
# "cb",
# "sixth_power",
# "sqrt",
# "cbrt",
"log",
# "abs",
# "sin",
# "cos",
]
```
%% Cell type:code id: tags:
``` python
# making input files for SISSO++
# note: together with the pre-uncommented features and operators above, these settings ensure to find tau as
# in the paper. All settings can be modified, but note that adding features and/or operators,
# increasing n_sis_select, max_rung, and n_dim increases the computational time, typically exponentially!
inputs = Inputs()
inputs.allowed_ops = ops
inputs.n_sis_select = 2000
inputs.max_rung = 3
inputs.n_dim = 3
inputs.calc_type = "classification"
inputs.n_residual = 200
inputs.n_model_store = 1
inputs.min_abs_feat_val = 0
inputs.max_abs_feat_val = 1e5
inputs.task_names = ["all_mats"]
inputs.task_sizes_train = [460]
inputs.task_sizes_test = [116]
inputs.leave_out_inds = list(test_inds)
#list_object[index] = new_value
# exp_label values according to the previous list of random number for train and test sets
inputs.prop_train = df_units.loc[df_units.index[train_inds], "exp_label"].values
inputs.prop_test = df_units.loc[df_units.index[test_inds], "exp_label"].values
inputs.prop_label = "exp_label"
inputs.prop_unit = Unit("eV")
inputs.global_param_optfalse = True
inputs.fix_intercept= True
# name of materials according to the previous list of random number for train and test sets
inputs.sample_ids_train = df_units.index[train_inds].tolist()
inputs.sample_ids_test = df_units.index[test_inds].tolist()
#primary features - deletes the unit part of the features and then uses node to make the input
phi_0 = []
for cc, col in enumerate(cols):
expr = col.split("(")[0].strip()
if len(col.split("(")) == 2:
unit = Unit(col.split("(")[1].split(")")[0].strip())
else:
unit = Unit()
phi_0.append(FeatureNode(cc, expr, df_units.loc[df_units.index[train_inds], col].tolist(), df_units.loc[df_units.index[test_inds], col].tolist(), unit))
inputs.phi_0 = phi_0
```
%% Cell type:markdown id: tags:
# Select the best candidate features and classification thresholds
Next, the generated candidate features are selected in two steps. In a first step (SIS), they are ranked according to the number of materials $N$ that fall in overlapping regions of stable and unstable domains and only the top-ranked features are kept. The domain is defined as the range between the maximum and minimum values of the feature for each of the classes (stable and unstable). The best candidate features are those that present lower $N$. The lenght of the overlap domain, $S$, is used to rank features presenting the same $N$. $N$ and $S$ correspond to equations 2 and 3, respectively, in Phys. Rev. Materials 2, 083802 (2018).
%% Cell type:code id: tags:
``` python
#Perform SISSO++ to create a feature space and select the features
from sissopp import FeatureSpace, SISSOClassifier
feature_space = FeatureSpace(inputs)
feature_space.sis(df_units.loc[df_units.index[train_inds], "exp_label"].values)
```
%% Cell type:code id: tags:
``` python
features=[feat.expr for feat in feature_space.phi_selected] # features
vals = [feat.value for feat in feature_space.phi_selected] # the values of the features
vals=np.transpose(vals) # transpose of the matrix to make it compatible with the indices
index_id=df_units.index[train_inds] # indices of materials
feature_df=pd.DataFrame(index=index_id,data=np.array(vals),columns=features) # make a data frame of the features and values of the features for different materials
feature_df
```
%% Cell type:markdown id: tags:
In a second step (SO), classification trees are used to choose the best candidate feature among those selected by the overlaps (above). For each of the selected candidate features, a classification tree is trained, providing a threshold for the classification and its accuracy. The selected candidate features get ranked based on their accuracy.
%% Cell type:code id: tags:
``` python
#train classification trees for the selected descriptors
#depth of the classification tree - user has to choose
def rank_tree(labels, feature_space, depth): #rank features according to the classification-tree accuracy
score = []
for i in list(range(0,feature_space.shape[1])): # 'i' is a column and 'for' is from the first column to the last one
x=np.array(feature_space)[:,i] # take the first column values
clf = tree.DecisionTreeClassifier(max_depth=depth)
clf = clf.fit(x.reshape(-1,1), labels) # Build a decision-tree classifier from the training set (X, y). X is the values of features (for each for iteration on column) and Y is the target value, here exp_label
score.append([feature_space.columns.values[i],clf.score(x.reshape(-1,1),labels)]) # make a list of the feature and the mean accuracy of the all values of that feaure (for different materials)
score_sorted=sorted(score,reverse=True,key=lambda x: x[1]) # sort the features based on the accuracy
return score_sorted
labels_train=train_df["exp_label"].to_numpy()
rank_list=rank_tree(labels_train, feature_df, 1)
tree_pd =pd.DataFrame(rank_list, columns=['feature','tree accuracy']) # make a new data frame of the feature and the accuracies
tree_pd.style.set_properties(subset=['feature'], **{'width': '300px'})
```
%% Cell type:markdown id: tags:
The top-ranked candidate feature corresponds to the the SISSO-derived tolerance factor. We call this descriptor $t_{sisso}$.
%% Cell type:code id: tags:
``` python
# the first ranked feature is the t_sisso
t_sisso_expression=str(rank_list[0][0])
t_sisso_expression = t_sisso_expression.replace("ln","log")
t_sisso_expression = t_sisso_expression.replace("^","**")
print('Identified expression for t_sisso: %s' % t_sisso_expression)
```
%% Cell type:markdown id: tags:
$t_{sisso}$ can be now evaluated for all the materials, including those in the test set. For comparison, we also evaluate the Goldschmidt factor ($t$) and the $\tau$ descriptor found in Sci. Adv. 5, eaav0693 (2019).
%% Cell type:code id: tags:
``` python
# evaluate t_sisso, t and tau for all materials in train and test sets and add them as extra columns to the train and test data frame
#make a dictionary for t_sisso,t, tau
tolerance_factor_dict = {
"t_sisso": [t_sisso_expression],
"t": ["(rA+rX)/(1.41421*(rB+rX))"],
"tau": ["rX/rB-nA*(nA-rA_rB_ratio/log(rA_rB_ratio))"]
}
#Add tau threshold
tolerance_factor_dict["tau"].append(4.18)
train_df.eval('t_sisso = ' + tolerance_factor_dict['t_sisso'][0], inplace=True)
train_df.eval('t = '+ tolerance_factor_dict['t'][0],inplace=True)
train_df.eval('tau = '+ tolerance_factor_dict['tau'][0], inplace=True)
test_df.eval('t_sisso = ' + tolerance_factor_dict['t_sisso'][0], inplace=True)
test_df.eval('t = '+ tolerance_factor_dict['t'][0], inplace=True)
test_df.eval('tau = '+ tolerance_factor_dict['tau'][0], inplace=True)
train_df
```
%% Cell type:markdown id: tags:
# Evalute the descriptor performance
The accuracy of the classification tree for $t_{sisso}$ can be now evaluated for train and test sets. For the classification tree, a maximum depth of one is used here.
%% Cell type:code id: tags:
``` python
#train classification tree with t_sisso
x_train_t_sisso=train_df['t_sisso'].to_numpy()
x_test_t_sisso=test_df['t_sisso'].to_numpy()
labels_test=test_df['exp_label'].to_numpy()
clf = tree.DecisionTreeClassifier(max_depth=1)
clf = clf.fit(x_train_t_sisso.reshape(-1,1),labels_train)
labels_pred_t_sisso=clf.predict(x_test_t_sisso.reshape(-1,1))
tree.plot_tree(clf)
print('Classification tree accuracy (for t_sisso) on the train set: %f.' % clf.score(x_train_t_sisso.reshape(-1,1),labels_train))
print('Classification tree accuracy (for t_sisso) on the test set: %f.' % metrics.accuracy_score(labels_test, labels_pred_t_sisso))
```
%% Cell type:code id: tags:
``` python
threshold_t_sisso=clf.tree_.threshold[0]
#Add threshold to the dictionary
tolerance_factor_dict["t_sisso"].append(threshold_t_sisso)
print('t_sisso < %f indicates stable perovskites.' % tolerance_factor_dict["t_sisso"][1])
```
%% Cell type:markdown id: tags:
The accuracy of the classification tree for $t$ can be also evaluated. In order to mimic the original calibration of $t$ performed by Goldschmidt, we adopt here a maximum depth of two for the classification tree, corresponding to a classification based on two thresholds.
%% Cell type:code id: tags:
``` python
#train classification tree with t
x_train_t=train_df['t'].to_numpy()
x_test_t=test_df['t'].to_numpy()
clf1 = tree.DecisionTreeClassifier(max_depth=2)
clf1 = clf1.fit(x_train_t.reshape(-1,1),labels_train)
labels_pred_t=clf1.predict(x_test_t.reshape(-1,1))
tree.plot_tree(clf1)
print('Classification tree accuracy (for t) on the train set: %f.' % clf1.score(x_train_t.reshape(-1,1),labels_train))
print('Classification tree accuracy (for t) on the test set: %f.' % metrics.accuracy_score(labels_test, labels_pred_t))
```
%% Cell type:code id: tags:
``` python
threshold_t=[clf1.tree_.threshold[0],clf1.tree_.threshold[4]]
# add t threshold to dictionary
tolerance_factor_dict["t"].append(threshold_t)
print('%f < t < %f indicates stable perovskites.' % (tolerance_factor_dict["t"][1][0],tolerance_factor_dict["t"][1][1]) )
```
%% Cell type:markdown id: tags:
Platt-scaled classification probabilities $P(t_{sisso})$ are also computed based on the $t_{sisso}$ values and the labels. Such probabilities indicate the likelihood that a material is stable in the perovskite structure (as opposed to the stable vs. non-stable classification provided by the threshold alone).
%% Cell type:code id: tags:
``` python
#evaluation of P(t_sisso)
labels_platt=clf.predict(x_train_t_sisso.reshape(-1,1))
clf2 = CalibratedClassifierCV(cv=3)
clf2 = clf2.fit(x_train_t_sisso.reshape(-1,1), labels_platt)
p_t_sisso_train=clf2.predict_proba(x_train_t_sisso.reshape(-1,1))[:,1]
p_t_sisso_test=clf2.predict_proba(x_test_t_sisso.reshape(-1,1))[:,1]
train_df['p_t_sisso'] = p_t_sisso_train # add p_t_sisso to the train and test data frame
test_df['p_t_sisso'] = p_t_sisso_test
```
%% Cell type:code id: tags:
``` python
#plotting P(t_sisso) vs t_sisso
plt.figure(figsize=(8,8))
concat=pd.concat([train_df.assign(dataset='train'), test_df.assign(dataset='test')])
plot1=sns.scatterplot(x='t_sisso', y='p_t_sisso', data=concat,hue='exp_label', style='dataset',
palette=['red','blue'], markers=['s','o'], s=80)
plot1.set_xlabel("$t_{sisso}$", fontsize=20)
plot1.set_ylabel("$P(t_{sisso})$", fontsize=20)
plot1.tick_params(labelsize=20)
plt.xlim(threshold_t_sisso-0.5, threshold_t_sisso+0.5)
plt.axvline(tolerance_factor_dict["t_sisso"][1])
plt.axhline(0.5,linestyle='--')
```
%% Cell type:markdown id: tags:
Finally, the performance of $t_{sisso}$ and $t$ can be compared by plotting $t_{sisso}$ vs. $t$ and $P(t_{sisso})$ vs. $t$. In thes plot, each datapoint is colored according to the (true) experimental label (e.g. whether a materials is a perovskite (blue) or not (red)) for both train and test sets.
%% Cell type:code id: tags:
``` python
plt.figure(figsize=(8,8))
plot2=sns.scatterplot(x='t', y='t_sisso', data=concat, hue='exp_label', style='dataset',
palette=['red','blue'], markers=['s','o'], s=80)
plot2.set_xlabel("$t$", fontsize=20)
plot2.set_ylabel("$t_{sisso}$", fontsize=20)
plot2.tick_params(labelsize=20)
plt.ylim(threshold_t_sisso-4, threshold_t_sisso+4)
plt.axvline(tolerance_factor_dict["t"][1][0])
plt.axvline(tolerance_factor_dict["t"][1][1])
plt.axhline(threshold_t_sisso)
```
%% Cell type:code id: tags:
``` python
plt.figure(figsize=(8,8))
plot3=sns.scatterplot(x='t', y='p_t_sisso', data=concat, hue='exp_label', style='dataset',
palette=['red','blue'], markers=['s','o'], s=80)
plot3.set_xlabel("$t$", fontsize=20)
plot3.set_ylabel("$P(t_{sisso})$", fontsize=20)
plot3.tick_params(labelsize=20)
plt.axvline(tolerance_factor_dict["t"][1][0])
plt.axvline(tolerance_factor_dict["t"][1][1])
```
%% Cell type:markdown id: tags:
# Predict stability of unseen materials (exploitation) using $t_{sisso}$
With $t_{sisso}$ in hand, one can explore large composition spaces in search for (new) compositions which are likely to form perovskites. We start by creating a list of candidate materials to be tested.
%% Cell type:code id: tags:
``` python
potential_A=['Li','Na','K','Rb','Cs','Be','Mg','Ca','Sr','Ba','Sc','Y','La','Ce','Pr','Nd','Pm','Sm']
potential_B=['Al','Ti', 'V', 'Cr','Mn','Fe','Co','Ni','Cu', 'Zn','Ga','Ge','Zr','Nb','Mo','Tc','Ru','Rh','Pd','Ag','Cd','Sn','Sb','Ta','W','Pt','Pb','Bi']
potential_X=['O3']
candidates=list(map("".join, itertools.product(potential_A, potential_B, potential_X))) # combine the elements
print('A list of %i candidate compositions was created.' %len(candidates))
```
%% Cell type:markdown id: tags:
For each of the compositions, one has to predict which cations are at the $A$ and $B$-sites of the perovskite structure (based on their relative sizes) and assign oxidation states for each cation (based on their periodic table group and common oxidation states). Oxidation states are not only needed for the evaluation of the tolerance factors, but also for the determination of the Shannon ionic radii, which are themselves oxidation-state dependant. Furthermore, based on oxidations states, one can assess if a given composition is charge-balanced or not. In order to perform these tasks, we use PredictABX3.py, provided as SI of Sci. Adv. 5, eaav0693 (2019). From the list of given compositions, we determine A and B and exclude the compounds that cannot be charge-balanced.
%% Cell type:code id: tags:
``` python
for i in candidates:
abx=PredictABX3(i) # PredictABX3 library predict that the given candidate is perovskite or not
if isinstance(abx.pred_A,float):
candidates.remove(i)
# else:
# print('For the compound %s, A = %s and B = %s' % (i, abx.pred_A, abx.pred_B))
print('A list of %i charge-balanced compositions was selected for further analysis:' %len(candidates))
#print(candidates)
```
%% Cell type:markdown id: tags:
We then collect all the input features for each of the given materials (including the DFT-calculated atomic feature) and evalute $t_{sisso}$, $t$ and $\tau$ for each of the materials.
%% Cell type:code id: tags:
``` python
def get_calc_feat(element): # For a given element gets the DFT calculated features from 'atomic_features.csv' file
calc_feat=[]
with open('data/perovskites_tolerance_factor/atomic_features.csv') as f:
for line in f:
line = line.split(',')
if line[2] == element:
calc_feat.append(float(line[12])*1E10) # rS (Angsrom)
calc_feat.append(float(line[14])*1E10) # rP (Angstrom)
calc_feat.append(float(line[1])) # Z_A
calc_feat.append(float(line[5])*6.241509074461E18) # HOMO (eV)
calc_feat.append(float(line[6])*6.241509074461E18) # LUMO (eV)
calc_feat.append(float(line[8])*6.241509074461E18) # EA (eV)
calc_feat.append(float(line[9])*6.241509074461E18) # IP (eV)
return calc_feat
exp_features=pd.read_csv('data/perovskites_tolerance_factor/TableS1.csv',index_col=0)
features=[]
for i in candidates:
abx=PredictABX3(i)
if i in exp_features.index: # If the predicted material is in the Table consider it
features.append([i, 0, abx.rA, abx.rB, abx.rX, abx.nA, abx.nB, abx.nX, abx.rA/abx.rB, abx.rA/abx.rX, abx.rB/abx.rX, abx.rX/abx.rB, *get_calc_feat(abx.pred_A), *get_calc_feat(abx.pred_B), *get_calc_feat(abx.X)])
#make a data frame of the new materials with the chosen features
exploit=pd.DataFrame(features, columns=['material', 'tau_pred_label',
'rA','rB','rX','nA','nB','nX','rA_rB_ratio','rA_rX_ratio','rB_rX_ratio','rX_rB_ratio',
'rS_A','rP_A','Z_A','HOMO_A','LUMO_A','EA_A','IP_A',
'rS_B','rP_B','Z_B','HOMO_B','LUMO_B','EA_B','IP_B',