This tutorial shows how tolerance factors for perovskite stability can be derived from data with the sure independece screening and sparsifying operator (SISSO) descriptor-identification approach.
R. Ouyang, S. Curtarolo, E. Ahmetcik, M. Scheffler, L. M. Ghiringhelli: <spanstyle="font-style: italic;">SISSO: a compressed-sensing method for identifying the best low-dimensional descriptor in an immensity of offered candidates</span>, Phys. Rev. Materials 2, 083802 (2018) <ahref="https://journals.aps.org/prmaterials/abstract/10.1103/PhysRevMaterials.2.083802"target="_blank">[PDF]</a> .
</div>
This tutorial is based on the following publication:
C. Bartel, C. Sutton, B. R. Goldsmith, R. Ouyang, C. B. Musgrave, Luca M. Ghiringhelli, M. Scheffler: <spanstyle="font-style: italic;">New tolerance factor to predict the stabilityof perovskite oxides and halides</span>, Sci. Adv. 5, eaav0693 (2019) <ahref="https://advances.sciencemag.org/content/advances/5/2/eaav0693.full.pdf"target="_blank">[PDF]</a> .
</div>
# Perovskites and the Goldschmidt tolerance factor
Perovskites are a class of materials having the basic formula $ABX_3$ and displaying a common structure in which a smaller metal cation $B$ (e.g. a transition metal) resides in corner-sharing octahedra of $X$ anions (e.g. $O^2-$, $Cl^-$, $Br^-$) and a larger A metal cation (e.g. alkali, alkaline earth or lanthanide) has a 12-fold coordination with the $X$ anions. This class of compounds has a remarkable variety of electronic, magnetic, optical, mechanical and transport properties, which is derived from the possibility of tuning the materials propertites by the composition. In fact, ca. 90% of the metallic natural elements of the periodic table can be stabilized in a perovskite structure. Therefore, perovskites are versatile materials suitable for a number of applications including photovoltaics, thermoelectrics and catalysis.
The first step to design new perovskites is to assess their stability. For this purpose, the Goldschmidt tolerance factor, $t$, has been extensively used to predict the stability of a material in the perovskite structure based on the (Shannon) ionic radii,$r_i$, of each ion on the chemical formula $(A,B,X)$:
$$ t=\frac{r_A+r_X}{\sqrt2(r_B+r_X)} $$
$t$ measures how much the $A$-site cation fits into the corner-sharing octahedral network in a cubic crystal structure. It indicates the compatibilty of a given set of ions with the ideal, cubic perovskite structure ($t\approx1$). Distortions from the cubic structure arise from size mismatch between cations and anions, which results in perovskite structures other than cubic (e.g. orthorhombic, rhombohedral). However, when these distortions are too large (e.g. $t<0.8$or$t>1.05$), the perovskite structure may be unstable and non-perovskites structures are formed.
The accuracy of the Goldschmidt factor is, however, often insufficient to screen for new potential materials and several modification have been proposed to overcome this issue. For instance, the input radii have been refined and the dimensionality of the factor has been increased. In this tutorial, we show how data can be used to derive tolerance factors for perovskite stability.
# The SISSO method for descriptor identifcation
A crucial step in data-driven materials science is the identification of descriptors, functions of parameters characterizing the phenomena governing a certain property. Descriptors allow distinguishing materials and, crucially, should be obtained (measured or calculated) more easily than the property itself, so that they can be evaluated for large sets of still unknown materials to search for new ones.
The sure independence screening and sparsifying operator (SISSO) method combines a symbolic-regression-based feature construction with compressed sensing for the identification of the best low-dimensional descriptors based on data. Within SISSO feature construction, an initial set of input features (the primary features) offered by the user are systematically combined by the application of mathematical operators (e.g. addition, multiplication, exponential, square root), generating a large space (containing up to a billion elements) of candidate features. The candidate features are then ranked according to their fit to the target property (number of materials in the overlap of convex-hull regions, for the case of classification problems) and the top-ranked features are further used for descriptor selection.
For futher details on compressed sensing methods (including SISSO) for descriptor identification, a dedicated notebook is available in the NOMAD toolkit.
The data consists of a list of 576 $ABX_3$ solids experimentally-characterized at ambient conditions, classified as stable or unstable at the perovskite structure, together with the following features:
<div>
<ul>
<li>$r_A, r_B, r_X$: Shannon ionic radii of each ion, with $r_A>r_B$ </li>
<li>$n_A, n_B, n_X$: oxidation satates of each ion </li>
#count the number of material in each class in the whole dataset
print('In the whole dataset, %s compositions are unstable and %s are stable.'%(df['exp_label'].value_counts().values[0],df['exp_label'].value_counts().values[1]))
```
%%%% Output: stream
In the whole dataset, 313 compositions are unstable and 263 are stable.
%% Cell type:code id: tags:
``` python
#split the data in 80% training and 20% testing
train,test=train_test_split(df,test_size=0.2)
```
%% Cell type:code id: tags:
``` python
#count the number of material in each class in the training/test sets
print('In the training set, %s compositions are unstable and %s are stable.'%(train['exp_label'].value_counts().values[0],train['exp_label'].value_counts().values[1]))
print('In the test set, %s compositions are unstable and %s are stable.'%(test['exp_label'].value_counts().values[0],test['exp_label'].value_counts().values[1]))
```
%%%% Output: stream
In the training set, 237 compositions are unstable and 223 are stable.
In the test set, 76 compositions are unstable and 40 are stable.
%% Cell type:code id: tags:
``` python
#sort the training data by the labels (stable/unstable)
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
train.sort_values(by=['exp_label'],inplace=True)
%% Cell type:markdown id: tags:
# Generate the candidate features space from the primary features and operators
The two ingredients to create the feature space with SISSO are the features to be used (i.e. the primary features) and the set of mathematical operators to be applied. Another input from the user is the number of times the operators are applied, the so-called rung (max_phi).
%% Cell type:code id: tags:
``` python
#define list of primary features - user has to choose
cols=[
# 'rA (AA)',
# 'rB (AA)',
# 'rX (AA)',
'nA (Unitless)',
# 'nB (Unitless)',
# 'nX(Unitless)',
'rA_rB_ratio (Unitless)',
# 'rA_rX_ratio (Unitless)',
'rB_rX_ratio (Unitless)',
# 'rS_A (AA)',
# 'rP_A (AA)',
# 'Z_A (elem_charge)',
# 'HOMO_A (eV)',
# 'LUMO_A (eV)',
# 'EA_A (eV)',
# 'IP_A (eV)',
# 'rS_B (AA)',
# 'rP_B (AA)',
# 'Z_B (elem_charge)',
# 'HOMO_B (eV)',
# 'LUMO_B (eV)',
# 'EA_B (eV)',
# 'IP_B (eV)',
# 'rS_X (AA)',
# 'rP_X (AA)',
# 'Z_X (elem_charge)',
# 'HOMO_X (eV)',
# 'LUMO_X (eV)',
# 'EA_X (eV)',
# 'IP_X (eV)'
]
```
%% Cell type:code id: tags:
``` python
#define list of operators - user has to choose
ops=[
# "add",
"sub",
# "abs_diff",
"mult",
"div",
# "exp",
# "neg_exp",
"inv",
# "sq",
# "cb",
# "sixth_power",
# "sqrt",
# "cbrt",
"log",
# "abs",
# "sin",
# "cos",
]
```
%% Cell type:code id: tags:
``` python
#feature space creation - user has to choose rung
fs=FeatureSpace.from_df(
train,
"exp_label",
ops,
cols,
max_phi=3,# rung
n_sis_select=100,
parameterize=False,
fix_c_0=False,
)
```
%%%% Output: stream
/home/sbailo/anaconda3/envs/sisso/lib/python3.8/site-packages/sisso/feature_creation/nodes/functions.py:41: RuntimeWarning: divide by zero encountered in true_divide
return np.divide(1.0, alpha * x + a) + c
/home/sbailo/anaconda3/envs/sisso/lib/python3.8/site-packages/sisso/feature_creation/nodes/functions.py:71: RuntimeWarning: divide by zero encountered in log
return alpha * np.log(x + a) + c
/home/sbailo/anaconda3/envs/sisso/lib/python3.8/site-packages/sisso/feature_creation/nodes/functions.py:71: RuntimeWarning: invalid value encountered in log
return alpha * np.log(x + a) + c
/home/sbailo/anaconda3/envs/sisso/lib/python3.8/site-packages/sisso/feature_creation/nodes/functions.py:26: RuntimeWarning: divide by zero encountered in true_divide
return np.divide(x[pvt:], x[:pvt] * alpha + a) + c
/home/sbailo/anaconda3/envs/sisso/lib/python3.8/site-packages/sisso/feature_creation/nodes/functions.py:26: RuntimeWarning: invalid value encountered in true_divide
return np.divide(x[pvt:], x[:pvt] * alpha + a) + c
%% Cell type:code id: tags:
``` python
#visualize the feature space created
fs.all_df
```
%%%% Output: execute_result
nA () rA_rB_ratio () rB_rX_ratio () nA - rA_rB_ratio () \
material
FeGaO3 3.0 1.25806 0.442857 -1.74194
TlBeCl3 1.0 3.77778 0.248619 2.77778
BaTeO3 2.0 1.65979 0.692857 -0.34021
AgCdBr3 1.0 1.34737 0.484694 0.34737
AgVO3 1.0 2.37037 0.385714 1.37037
... ... ... ... ...
EuCrO3 3.0 1.80645 0.442857 -1.19355
YAlO3 3.0 2.00000 0.385714 -1.00000
LuFeO3 3.0 1.60938 0.457143 -1.39062
YGaO3 3.0 1.74194 0.442857 -1.25806
TlMnCl3 1.0 2.04819 0.458564 1.04819
nA*rA_rB_ratio () nA - rB_rX_ratio () nA*rB_rX_ratio () \
print('From %s primary features and %s operators, SISSO generated a feature space containing %s candidate features.'%(len(cols),len(ops),fs.all_df.shape[1]))
```
%%%% Output: stream
From 3 primary features and 5 operators, SISSO generated a feature space containing 823 candidate features.
%% Cell type:markdown id: tags:
# Select the best candidate features
Next, the generated candidate features are selected in two steps. In a first step, they are ranked according to the number of materials $N$ that fall in overlapping regions of stable and unstable domains and only the top-ranked features are kept. The domain is defined as the range between the maximum and minimum values of the feature for each of the classes (stable and unstable). The best candidate features are those that present lower $N$. The lenght of the overlap domain, $S$, is used to rank features with similar $N$. $N$ and $S$ correspond to equations 2 and 3, respectively, in the original SISSO publication (Phys. Rev. Materials 2, 083802 (2018)).