Skip to content
Snippets Groups Projects
Commit a3e88b8e authored by Thomas Purcell's avatar Thomas Purcell
Browse files

Add descriptions for pySR and FFX

parent 0e810fde
No related branches found
No related tags found
1 merge request!1Draft: New dockerfile
This commit is part of merge request !1. Comments created here will be created in the context of that merge request.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Symbolic regression via compressed sensing: a tutorial # Symbolic regression: A tutorial
___Emre Ahmetcik,___ _Angelo Ziletti, Runhai Ouyang, Sbailò Luigi, Matthias Scheffler, Luca M. Ghiringhelli \<luca@fhi-berlin.mpg.de\>_ __Tom Purcell__, ___Emre Ahmetcik,___ _Angelo Ziletti, Runhai Ouyang, Sbailò Luigi, Matthias Scheffler, Luca M. Ghiringhelli \<luca@fhi-berlin.mpg.de\>_
<img src="assets/logo_MPG.png" width="80px"> <img src="assets/logo_MPG.png" width="80px">
<img src="assets/logo_NOMAD.png" width="80px"> <img src="assets/logo_NOMAD.png" width="80px">
<img src="assets/logo_HU.png" width="80px"> <img src="assets/logo_HU.png" width="80px">
<blockquote> <blockquote>
<br> <br>
__Abstract__ __Abstract__
This tutorial shows how to find descriptive parameters (short formulas) to predict materials properties using compressed sensing tools. As an example, we address the prediction of the zincblende (ZB) versus rocksalt (RS) relative stability of 82 octet binary materials. This tutorial shows how to find descriptive parameters (short formulas) to predict materials properties using compressed sensing tools. As an example, we address the prediction of the zincblende (ZB) versus rocksalt (RS) relative stability of 82 octet binary materials.
The idea of using compressed sensing tools: Starting from simple physical quantities ("building blocks", here properties of the constituent free atoms such as orbital radii), millions (or billions) of candidate formulas are generated by applying arithmetic operations combining building blocks, for example forming sums and products of them. These candidate formulas constitute the so-called "feature space". Then a compressed sensing based method is used to select only a few of these formulas that explain the data, as described in The idea of using compressed sensing tools: Starting from simple physical quantities ("building blocks", here properties of the constituent free atoms such as orbital radii), millions (or billions) of candidate formulas are generated by applying arithmetic operations combining building blocks, for example forming sums and products of them. These candidate formulas constitute the so-called "feature space". Then a compressed sensing based method is used to select only a few of these formulas that explain the data, as described in
- L. M. Ghiringhelli, J. Vybiral, S. V. Levchenko, C. Draxl, M. Scheffler: _Big Data of Materials Science: Critical Role of the Descriptor_, Phys. Rev. Lett. 114, 105503 (2015) [[PDF]](http://journals.aps.org/prl/abstract/10.1103/PhysRevLett.114.105503) - L. M. Ghiringhelli, J. Vybiral, S. V. Levchenko, C. Draxl, M. Scheffler: _Big Data of Materials Science: Critical Role of the Descriptor_, Phys. Rev. Lett. 114, 105503 (2015) [[PDF]](http://journals.aps.org/prl/abstract/10.1103/PhysRevLett.114.105503)
In this tutorial we use the Sure Independence Screening and Sparsifying Operator (SISSO) as introduced in In this tutorial we use the Sure Independence Screening and Sparsifying Operator (SISSO) as introduced in
- R. Ouyang, S. Curtarolo, E. Ahmetcik, M. Scheffler, L. M. Ghiringhelli: _SISSO: a compressed-sensing method for identifying the best low-dimensional descriptor in an immensity of offered candidates_, Phys. Rev. Materials 2, 083802 (2018) [[PDF]](https://journals.aps.org/prmaterials/abstract/10.1103/PhysRevMaterials.2.083802) - R. Ouyang, S. Curtarolo, E. Ahmetcik, M. Scheffler, L. M. Ghiringhelli: _SISSO: a compressed-sensing method for identifying the best low-dimensional descriptor in an immensity of offered candidates_, Phys. Rev. Materials 2, 083802 (2018) [[PDF]](https://journals.aps.org/prmaterials/abstract/10.1103/PhysRevMaterials.2.083802)
<br> <br>
</blockquote> </blockquote>
<br> <br>
<div style="text-align: right"> <div style="text-align: right">
<a href="https://gitlab.mpcdf.mpg.de/nomad-lab/analytics-compressed-sensing" target="_blank">[Gitlab]</a>, Last updated: 2023-03-12 <a href="https://gitlab.mpcdf.mpg.de/nomad-lab/analytics-compressed-sensing" target="_blank">[Gitlab]</a>, Last updated: 2023-03-12
</div> </div>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Introduction to the compressed sensing methods ## Introduction to the compressed sensing methods
The feature space is generated by creating a list of analytical expressions (the derived features), obtained by combining <i> primary features </i> and arithmetic operations. In this example, the primary features are properties of isolated atoms, such as electron affinity or the radial extension of valence orbitals. We put all $m$ derived features into a descriptor matrix $\mathbf{D} \in \mathbb{R}^{82 \times m}$ where each column stands for a derived feature and each row for a compound. An $\ell_0$-regularization The feature space is generated by creating a list of analytical expressions (the derived features), obtained by combining <i> primary features </i> and arithmetic operations. In this example, the primary features are properties of isolated atoms, such as electron affinity or the radial extension of valence orbitals. We put all $m$ derived features into a descriptor matrix $\mathbf{D} \in \mathbb{R}^{82 \times m}$ where each column stands for a derived feature and each row for a compound. An $\ell_0$-regularization
$\text{argmin}_{\mathbf{c} \in \mathbb{R}^{m}} \{\|\mathbf{P} - \mathbf{D}\mathbf{c}\|^2_2 +\lambda \|\mathbf{c}\|_0\}$ $\text{argmin}_{\mathbf{c} \in \mathbb{R}^{m}} \{\|\mathbf{P} - \mathbf{D}\mathbf{c}\|^2_2 +\lambda \|\mathbf{c}\|_0\}$
determines those few feature columns which approximate a property vector $\mathbf{P} \in \mathbb{R}^{82}$ (i.e RS vs. ZB energy differences) best. The subscript 0 stands for the $\ell_0$-quasinorm, that counts the number of non-zero elements of $\mathbf{c}$ and $\lambda > 0$ is called the regularization parameter. Performing the $\ell_0$-regularization becomes fast computational infeasable and often approximations (i.e. LASSO, orthogonal matching pursuit) are needed since in practice the $\ell_0$-regularization needs to be solved combinatorial: All singletons, pairs, triplets, ... $n$-tuples (up to the selected maximum dimension of the descriptor) are listed and for each set a least-square regression is performed. The $n$-tuple that gives the lowest mean square error for the least-square regression fit is selected as the resulting $n$-dimensional descriptor. determines those few feature columns which approximate a property vector $\mathbf{P} \in \mathbb{R}^{82}$ (i.e RS vs. ZB energy differences) best. The subscript 0 stands for the $\ell_0$-quasinorm, that counts the number of non-zero elements of $\mathbf{c}$ and $\lambda > 0$ is called the regularization parameter. Performing the $\ell_0$-regularization becomes fast computational infeasable and often approximations (i.e. LASSO, orthogonal matching pursuit) are needed since in practice the $\ell_0$-regularization needs to be solved combinatorial: All singletons, pairs, triplets, ... $n$-tuples (up to the selected maximum dimension of the descriptor) are listed and for each set a least-square regression is performed. The $n$-tuple that gives the lowest mean square error for the least-square regression fit is selected as the resulting $n$-dimensional descriptor.
### The LASSO method ### The LASSO method
A convex optimization problem can be introduced by the Least Absolute Shrinkage and Selection Operator (LASSO): A convex optimization problem can be introduced by the Least Absolute Shrinkage and Selection Operator (LASSO):
$\text{argmin}_{\mathbf{c} \in \mathbb{R}^{m}} \{\|\mathbf{P} - \mathbf{D}\mathbf{c}\|^2_2 +\lambda \|\mathbf{c}\|_1\}$. $\text{argmin}_{\mathbf{c} \in \mathbb{R}^{m}} \{\|\mathbf{P} - \mathbf{D}\mathbf{c}\|^2_2 +\lambda \|\mathbf{c}\|_1\}$.
Under certain conditions on the matrix $\mathbf{D}$, it can find the exact solution of, or a good approximation to, the $\ell_0$-regularization problem. Under certain conditions on the matrix $\mathbf{D}$, it can find the exact solution of, or a good approximation to, the $\ell_0$-regularization problem.
### The SISSO method ### The SISSO method
SISSO works iteratively. In the first iteration, a number $k$ of features is collected that have the largest correlation (scalar product) with $\mathbf{P}$. The feature with the largest correlation is simply the 1D descriptor. Next, a residual is constructed as the error made in the first iteration. A new set of $k$ features is now selected as those having the largest correlation with the residual. The 2D descriptor is the pair of features that yield the smallest fitting error upon least-square regression, among all possible pairs contained in the union of the sets selected in this and the first iteration. In each next iteration a new residual is constructed as the error made in the previous iteration, then a new set of $k$ features is extracted as those that have largest correlation with each new residual. The $n$D descriptor is the $n$-tuple of features that yield the smallest fitting error upon least square regression, among all possible $n$-tuples contained in the union of the sets obtained in each new iteration and all the previous iterations. If $k=1$ the method collapses to the so-called orthogonal matching pursuit. SISSO works iteratively. In the first iteration, a number $k$ of features is collected that have the largest correlation (scalar product) with $\mathbf{P}$. The feature with the largest correlation is simply the 1D descriptor. Next, a residual is constructed as the error made in the first iteration. A new set of $k$ features is now selected as those having the largest correlation with the residual. The 2D descriptor is the pair of features that yield the smallest fitting error upon least-square regression, among all possible pairs contained in the union of the sets selected in this and the first iteration. In each next iteration a new residual is constructed as the error made in the previous iteration, then a new set of $k$ features is extracted as those that have largest correlation with each new residual. The $n$D descriptor is the $n$-tuple of features that yield the smallest fitting error upon least square regression, among all possible $n$-tuples contained in the union of the sets obtained in each new iteration and all the previous iterations. If $k=1$ the method collapses to the so-called orthogonal matching pursuit.
<img src="assets/SISSO.png" width="800"> <img src="assets/SISSO.png" width="800">
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Import required modules # Import required modules
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
import contextlib import contextlib
import numpy as np import numpy as np
import os import os
import pandas as pd import pandas as pd
import shutil import shutil
from itertools import combinations, product from itertools import combinations, product
from time import time from time import time
import matplotlib.pyplot as plt import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso from sklearn.linear_model import Lasso
import scipy.stats as ss import scipy.stats as ss
import warnings import warnings
from collections import Counter from collections import Counter
from sklearn.kernel_ridge import KernelRidge from sklearn.kernel_ridge import KernelRidge
from sklearn.model_selection import GridSearchCV, LeaveOneOut from sklearn.model_selection import GridSearchCV, LeaveOneOut
from IPython.display import HTML from IPython.display import HTML
from jupyter_jsmol import JsmolView from jupyter_jsmol import JsmolView
import pathlib import pathlib
from pathlib import Path from pathlib import Path
from compressed_sensing.sisso import SissoRegressor from compressed_sensing.sisso import SissoRegressor
from compressed_sensing.combine_features import combine_features from compressed_sensing.combine_features import combine_features
from compressed_sensing.scatter_plot import show_scatter_plot from compressed_sensing.scatter_plot import show_scatter_plot
from compressed_sensing.visualizer import Visualizer from compressed_sensing.visualizer import Visualizer
from sissopp import Inputs, FeatureSpace, SISSORegressor, FeatureNode, Unit from sissopp import Inputs, FeatureSpace, SISSORegressor, FeatureNode, Unit
from sissopp.py_interface import read_csv from sissopp.py_interface import read_csv
from sissopp.py_interface.import_dataframe import get_unit from sissopp.py_interface.import_dataframe import get_unit
from sissopp.sklearn import SISSORegressor as SISSORegressor_skl from sissopp.sklearn import SISSORegressor as SISSORegressor_skl
from pysr import PySRRegressor from pysr import PySRRegressor
from sympy import Abs, root, sign from sympy import Abs, root, sign
warnings.filterwarnings('ignore') warnings.filterwarnings('ignore')
from bokeh.io import export_png from bokeh.io import export_png
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
@contextlib.contextmanager @contextlib.contextmanager
def cwd(path, mkdir=False, debug=False, clear=True): def cwd(path, mkdir=False, debug=False, clear=True):
CWD = os.getcwd() CWD = os.getcwd()
if os.path.exists(path) is False and mkdir: if os.path.exists(path) is False and mkdir:
os.makedirs(path) os.makedirs(path)
if debug: if debug:
os.chdir(path) os.chdir(path)
yield yield
os.chdir(CWD) os.chdir(CWD)
return return
os.chdir(path) os.chdir(path)
yield yield
os.chdir(CWD) os.chdir(CWD)
if clear: if clear:
shutil.rmtree(path) shutil.rmtree(path)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Get the data # Get the data
Let us load the data from the NOMAD Archive and the atomicfeaturespackage. It consists of RS-ZB energy differences (in eV/atom) of the 82 octet binary compounds, structural information (e.g. atomic positions) of the materials, and properties of the atomic constituents. The following atomic features are considered: Let us load the data from the NOMAD Archive and the atomicfeaturespackage. It consists of RS-ZB energy differences (in eV/atom) of the 82 octet binary compounds, structural information (e.g. atomic positions) of the materials, and properties of the atomic constituents. The following atomic features are considered:
<div > <div >
<ul> <ul>
<li>Z: atomic number</li> <li>Z: atomic number</li>
<li>period: period in the periodic table</li> <li>period: period in the periodic table</li>
<li>IP: ionization potential</li> <li>IP: ionization potential</li>
<li>EA: electron affinity</li> <li>EA: electron affinity</li>
<li>E_HOMO: energy of the highest occupied atomic orbital</li> <li>E_HOMO: energy of the highest occupied atomic orbital</li>
<li>E_LUMO: energy of the lowest unoccupied atomic orbital</li> <li>E_LUMO: energy of the lowest unoccupied atomic orbital</li>
<li>r_(s, p, d): radius where the radial distribution of s, p or d orbital has its maximum.</li> <li>r_(s, p, d): radius where the radial distribution of s, p or d orbital has its maximum.</li>
</ul> </ul>
</div> </div>
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def write_xyz(df): def write_xyz(df):
for compound, (A, B, pos, lat) in df[['A', 'B', 'atomic_positions', 'lattice_vectors']].iterrows(): for compound, (A, B, pos, lat) in df[['A', 'B', 'atomic_positions', 'lattice_vectors']].iterrows():
lat_x, lat_y, lat_z = lat lat_x, lat_y, lat_z = lat
atoms = [A, B] atoms = [A, B]
file = open("data/compressed_sensing/structures/"+compound+".xyz", "w") file = open("data/compressed_sensing/structures/"+compound+".xyz", "w")
file.write ("%d\n\n"%32) file.write ("%d\n\n"%32)
for i, j, k, n in product(range(3), range(3), range(3), range(2)): for i, j, k, n in product(range(3), range(3), range(3), range(2)):
xyz = pos[n].copy() xyz = pos[n].copy()
xyz += i*lat_x xyz += i*lat_x
xyz += j*lat_y xyz += j*lat_y
xyz += k*lat_z xyz += k*lat_z
file.write(atoms[n]) file.write(atoms[n])
file.write("\t%f\t%f\t%f\n" % (xyz[0], file.write("\t%f\t%f\t%f\n" % (xyz[0],
xyz[1], xyz[1],
xyz[2])) xyz[2]))
file.close() file.close()
df_target = pd.read_hdf('compressed_sensing.hdf5', 'target') df_target = pd.read_hdf('compressed_sensing.hdf5', 'target')
path_structure = './data/compressed_sensing/structures/' path_structure = './data/compressed_sensing/structures/'
pathlib.Path(path_structure).mkdir(parents=True, exist_ok=True) pathlib.Path(path_structure).mkdir(parents=True, exist_ok=True)
write_xyz(df_target) write_xyz(df_target)
df_target = df_target[['A', 'B', 'energy_diff', 'min_struc_type']] df_target = df_target[['A', 'B', 'energy_diff', 'min_struc_type']]
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df_features = pd.read_hdf('compressed_sensing.hdf5', 'features') df_features = pd.read_hdf('compressed_sensing.hdf5', 'features')
df_features df_features
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def sort_AB_wrt_electronegativity(df_target, df_features): def sort_AB_wrt_electronegativity(df_target, df_features):
df_features['EN'] = -(df_features['IP'] + df_features['EA'])/2 df_features['EN'] = -(df_features['IP'] + df_features['EA'])/2
df_en = df_target[['A', 'B']].merge(df_features['EN'], left_on='A', right_index=True) df_en = df_target[['A', 'B']].merge(df_features['EN'], left_on='A', right_index=True)
df_en = df_en.merge(df_features['EN'], left_on='B', right_index=True, suffixes=('_A', '_B'),) df_en = df_en.merge(df_features['EN'], left_on='B', right_index=True, suffixes=('_A', '_B'),)
df_en = df_en.sort_index() df_en = df_en.sort_index()
AB = np.where(df_en['EN_A'] < df_en['EN_B'], [df_en['A'], df_en['B']], [df_en['B'], df_en['A']]) AB = np.where(df_en['EN_A'] < df_en['EN_B'], [df_en['A'], df_en['B']], [df_en['B'], df_en['A']])
df_target.index = np.where(AB[0]==AB[1], AB[0]+'2', AB.sum(0),) df_target.index = np.where(AB[0]==AB[1], AB[0]+'2', AB.sum(0),)
df_target[['A', 'B',]] = AB.T df_target[['A', 'B',]] = AB.T
df_features.drop('EN', axis=1, inplace=True) df_features.drop('EN', axis=1, inplace=True)
return df_target return df_target
# sort A and B in chemical formula AB with respect to electronegativity of A and B # sort A and B in chemical formula AB with respect to electronegativity of A and B
df_target = sort_AB_wrt_electronegativity(df_target, df_features) df_target = sort_AB_wrt_electronegativity(df_target, df_features)
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def merge_target_feature(df_target, df_features, suffixes=('(A)', '(B)')): def merge_target_feature(df_target, df_features, suffixes=('(A)', '(B)')):
df = df_target.merge(df_features, left_on='A', right_index=True) df = df_target.merge(df_features, left_on='A', right_index=True)
df = df.merge(df_features, left_on='B', right_index=True, suffixes=suffixes) df = df.merge(df_features, left_on='B', right_index=True, suffixes=suffixes)
return df return df
# merge target and feature data frame # merge target and feature data frame
df = merge_target_feature(df_target, df_features) df = merge_target_feature(df_target, df_features)
df df
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Let us look at the distribution of the energy differences. Let us look at the distribution of the energy differences.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
plt.hist(df['energy_diff'].tolist(), bins=30) plt.hist(df['energy_diff'].tolist(), bins=30)
plt.xlabel('$\Delta E$') plt.xlabel('$\Delta E$')
plt.ylabel('Counts') plt.ylabel('Counts')
plt.show() plt.show()
print('Standard deviation: %.3f eV/atom' % df['energy_diff'].values.std()) print('Standard deviation: %.3f eV/atom' % df['energy_diff'].values.std())
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
With the SISSO method, we are able to predict the energy differences from the atomic features with an accuracy of 0.035 eV/atom. However, due to computational limit we will target an accuracy of around 0.1 eV/atom in this tutorial. With the SISSO method, we are able to predict the energy differences from the atomic features with an accuracy of 0.035 eV/atom. However, due to computational limit we will target an accuracy of around 0.1 eV/atom in this tutorial.
Now let us define a function get_data that uses the data frame df to define the target vector $\mathbf{P}$ of energy differences and constructs the descriptor matrix $\mathbf{D}$ of combined features. The arguments selected_feature_list and allowed_operations specify which primary features and which arithmetic operations should be used to build the new derived features. Now let us define a function get_data that uses the data frame df to define the target vector $\mathbf{P}$ of energy differences and constructs the descriptor matrix $\mathbf{D}$ of combined features. The arguments selected_feature_list and allowed_operations specify which primary features and which arithmetic operations should be used to build the new derived features.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def get_data(selected_feature_list, allowed_operations): def get_data(selected_feature_list, allowed_operations):
# add both '(A)', '(B)' to each feature # add both '(A)', '(B)' to each feature
selected_featureAB_list = [f+A_or_B for f in selected_feature_list for A_or_B in ['(A)', '(B)']] selected_featureAB_list = [f+A_or_B for f in selected_feature_list for A_or_B in ['(A)', '(B)']]
# extract energy differences and selected features from df # extract energy differences and selected features from df
P = df['energy_diff'].values P = df['energy_diff'].values
df_features = df[selected_featureAB_list] df_features = df[selected_featureAB_list]
# derive new features using allowed_operations # derive new features using allowed_operations
df_combined = combine_features(df=df_features, allowed_operations=allowed_operations) df_combined = combine_features(df=df_features, allowed_operations=allowed_operations)
return P, df_combined return P, df_combined
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# selected_feature_list = ['IP', 'EA', 'E_HOMO', 'E_LUMO', 'r_s', 'r_p', 'r_d', 'Z', 'period'] # selected_feature_list = ['IP', 'EA', 'E_HOMO', 'E_LUMO', 'r_s', 'r_p', 'r_d', 'Z', 'period']
selected_feature_list = ['r_s', 'r_p'] selected_feature_list = ['r_s', 'r_p']
# allowed_operations = ['+', '-', '|-|', '/' '^2', '^3', 'exp'] # allowed_operations = ['+', '-', '|-|', '/' '^2', '^3', 'exp']
allowed_operations = ['+'] allowed_operations = ['+']
P, df_D = get_data(selected_feature_list, allowed_operations) P, df_D = get_data(selected_feature_list, allowed_operations)
# print derived features # print derived features
df_D df_D
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Determining low-dimensional descriptors with the $\ell_0$ method # Determining low-dimensional descriptors with the $\ell_0$ method
<div style="list-style:disc; margin: 2px;padding: 10px;border: 0px;border:8px double green; font-size:16px;padding-left: 32px;padding-right: 22px; width:89%"> <div style="list-style:disc; margin: 2px;padding: 10px;border: 0px;border:8px double green; font-size:16px;padding-left: 32px;padding-right: 22px; width:89%">
<li> Perform an $\ell_0$-regularization to identify the best low dimensional descriptors using the primary features.</li> <li> Perform an $\ell_0$-regularization to identify the best low dimensional descriptors using the primary features.</li>
<li> Show that non-linear functions of the primary features improve the models significantly. </li> <li> Show that non-linear functions of the primary features improve the models significantly. </li>
<li> See that the $\ell_0$-regularization can rapidly become computational infeasible.</li> <li> See that the $\ell_0$-regularization can rapidly become computational infeasible.</li>
</div> </div>
Our target is to find the best low dimensional descriptor for a linear model $\mathbf{P} = \mathbf{D^\ast}\mathbf{c^\ast}$, where $\mathbf{c^\ast}$ is the vector of nonzero elements of the solution vector $\mathbf{c}$ and $\mathbf{D^\ast}$ is the matrix of the columns of $\mathbf{D}$ corresponding to the nonzero elements of $\mathbf{c}$. The $\ell_0$ regularization Our target is to find the best low dimensional descriptor for a linear model $\mathbf{P} = \mathbf{D^\ast}\mathbf{c^\ast}$, where $\mathbf{c^\ast}$ is the vector of nonzero elements of the solution vector $\mathbf{c}$ and $\mathbf{D^\ast}$ is the matrix of the columns of $\mathbf{D}$ corresponding to the nonzero elements of $\mathbf{c}$. The $\ell_0$ regularization
$\text{argmin}_{\mathbf{c} \in \mathbb{R}^{m}} \{\|\mathbf{P} - \mathbf{D}\mathbf{c}\|^2_2 +\lambda \|\mathbf{c}\|_0\}$ $\text{argmin}_{\mathbf{c} \in \mathbb{R}^{m}} \{\|\mathbf{P} - \mathbf{D}\mathbf{c}\|^2_2 +\lambda \|\mathbf{c}\|_0\}$
provides exactly what we want. It is defined in the following and solved combinatorial: provides exactly what we want. It is defined in the following and solved combinatorial:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def L0(P, D, dimension): def L0(P, D, dimension):
n_rows, n_columns = D.shape n_rows, n_columns = D.shape
D = np.column_stack((D, np.ones(n_rows))) D = np.column_stack((D, np.ones(n_rows)))
SE_min = np.inner(P ,P) SE_min = np.inner(P ,P)
coef_min, permu_min = None, None coef_min, permu_min = None, None
for permu in combinations(range(n_columns), dimension): for permu in combinations(range(n_columns), dimension):
D_ls = D[:, permu + (-1,)] D_ls = D[:, permu + (-1,)]
coef, SE, __1, __2 = np.linalg.lstsq(D_ls, P, rcond=-1) coef, SE, __1, __2 = np.linalg.lstsq(D_ls, P, rcond=-1)
try: try:
if SE[0] < SE_min: if SE[0] < SE_min:
SE_min = SE[0] SE_min = SE[0]
coef_min, permu_min = coef, permu coef_min, permu_min = coef, permu
except: except:
pass pass
RMSE = np.sqrt(SE_min/n_rows) RMSE = np.sqrt(SE_min/n_rows)
return RMSE, coef_min, permu_min return RMSE, coef_min, permu_min
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Perform the $\ell_0$-regularization for different dimensions (numbers of non-zero coefficients in the model) and see the root mean square errors (RMSE) and the selected features. Perform the $\ell_0$-regularization for different dimensions (numbers of non-zero coefficients in the model) and see the root mean square errors (RMSE) and the selected features.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df_D df_D
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
selected_feature_list = ['r_s', 'r_p', 'r_d', 'EA', 'IP'] selected_feature_list = ['r_s', 'r_p', 'r_d', 'EA', 'IP']
allowed_operations = [] allowed_operations = []
P, df_D = get_data(selected_feature_list, allowed_operations) P, df_D = get_data(selected_feature_list, allowed_operations)
features_list = df_D.columns.tolist() features_list = df_D.columns.tolist()
D = df_D.values D = df_D.values
print(" RMSE Best desriptor") print(" RMSE Best desriptor")
for dim in range(1,11): for dim in range(1,11):
RMSE, coefficients, selected_indices = L0(P,D,dim) RMSE, coefficients, selected_indices = L0(P,D,dim)
print('%2sD: %.5f' % (dim, RMSE), [features_list[i] for i in selected_indices]) print('%2sD: %.5f' % (dim, RMSE), [features_list[i] for i in selected_indices])
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
The result of performing the $\ell_0$-regularization shows that the accuracy converges fast, e.g. we could leave out some components in the linear model without decreasing the accuracy. The second observation is that a linear model of the atomic features is not enough to describe the RS-ZB energy differences. A way out could be using non-linear machine learning models, e.g. kernel ridge regression or a neural network, instead of linear regression. Another way is to put the non-linearity into the descriptors by building algebraic combinations of the atomic features and mapping the few best of these more complex features onto the target again with a linear model. The result of performing the $\ell_0$-regularization shows that the accuracy converges fast, e.g. we could leave out some components in the linear model without decreasing the accuracy. The second observation is that a linear model of the atomic features is not enough to describe the RS-ZB energy differences. A way out could be using non-linear machine learning models, e.g. kernel ridge regression or a neural network, instead of linear regression. Another way is to put the non-linearity into the descriptors by building algebraic combinations of the atomic features and mapping the few best of these more complex features onto the target again with a linear model.
Run the following script to build larger feature spaces of more complex features and select the best 1D, 2D and 3D descriptor for a linear model using $\ell_0$-regularization. Plot the results afterwards. How does the accuracy of the models change? How does the feature space size and the dimension of the descriptors depend on the needed time to solve the $\ell_0$-problem? Run the following script to build larger feature spaces of more complex features and select the best 1D, 2D and 3D descriptor for a linear model using $\ell_0$-regularization. Plot the results afterwards. How does the accuracy of the models change? How does the feature space size and the dimension of the descriptors depend on the needed time to solve the $\ell_0$-problem?
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
selected_feature_list = ['r_s', 'r_p', 'r_d', 'EA', 'IP'] selected_feature_list = ['r_s', 'r_p', 'r_d', 'EA', 'IP']
op_lists = [[], ['+','|-|'], ['+','|-|','exp'], ['+','|-|','exp', '^2'] ] op_lists = [[], ['+','|-|'], ['+','|-|','exp'], ['+','|-|','exp', '^2'] ]
X = [] X = []
Errors, Time = np.empty([3,len(op_lists)]), np.empty([3,len(op_lists)]) Errors, Time = np.empty([3,len(op_lists)]), np.empty([3,len(op_lists)])
for n_op, allowed_operations in enumerate(op_lists): for n_op, allowed_operations in enumerate(op_lists):
P, df_D = get_data(selected_feature_list, allowed_operations) P, df_D = get_data(selected_feature_list, allowed_operations)
features_list = df_D.columns.tolist() features_list = df_D.columns.tolist()
D = df_D.values D = df_D.values
number_of_features = len(features_list) number_of_features = len(features_list)
X.append(number_of_features) X.append(number_of_features)
for dim in range(1,4): for dim in range(1,4):
t1= time() t1= time()
RMSE, coefficients, selected_indices = L0(P,D,dim) RMSE, coefficients, selected_indices = L0(P,D,dim)
t2 = time()-t1 t2 = time()-t1
Time [dim-1][n_op] = t2 Time [dim-1][n_op] = t2
Errors[dim-1][n_op] = RMSE Errors[dim-1][n_op] = RMSE
print("n_features: %s; %sD RMSE: %.3f best features: %s" print("n_features: %s; %sD RMSE: %.3f best features: %s"
%(len(features_list), dim, RMSE, [features_list[i] for i in selected_indices])) %(len(features_list), dim, RMSE, [features_list[i] for i in selected_indices]))
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
#plot #plot
f, (ax1, ax2) = plt.subplots(1,2, sharex=True, figsize=(12,8)) f, (ax1, ax2) = plt.subplots(1,2, sharex=True, figsize=(12,8))
ax1.set_xlabel('Feature space size') ax1.set_xlabel('Feature space size')
ax2.set_xlabel('Feature space size') ax2.set_xlabel('Feature space size')
ax1.set_ylabel('RMSE [eV/atom]') ax1.set_ylabel('RMSE [eV/atom]')
ax2.set_ylabel('Time [s]') ax2.set_ylabel('Time [s]')
#ax2.set_yscale('log') #ax2.set_yscale('log')
for dim in range(1,4): for dim in range(1,4):
ax1.plot(X, Errors[dim-1], 's-', label='%sD' %dim) ax1.plot(X, Errors[dim-1], 's-', label='%sD' %dim)
ax2.plot(X, Time[dim-1], 's-', label='%sD' %dim) ax2.plot(X, Time[dim-1], 's-', label='%sD' %dim)
ax2.legend(loc='best') ax2.legend(loc='best')
plt.show() plt.show()
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Assume now that we would like to include thousands or millions of (more) complex features to obtain more accurate models... Assume now that we would like to include thousands or millions of (more) complex features to obtain more accurate models...
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Approximations to the $\ell_0$ method # Approximations to the $\ell_0$ method
<div style="list-style:disc; margin: 2px;padding: 10px;border: 0px;border:8px double green; font-size:16px;padding-left: 32px;padding-right: 22px; width:89%"> <div style="list-style:disc; margin: 2px;padding: 10px;border: 0px;border:8px double green; font-size:16px;padding-left: 32px;padding-right: 22px; width:89%">
<li >Perform a LASSO minimization and the SISSO method.</li> <li >Perform a LASSO minimization and the SISSO method.</li>
<li >Compare the solutions with the ones from the $\ell_0$ method.</li> <li >Compare the solutions with the ones from the $\ell_0$ method.</li>
</div> </div>
### The LASSO ### The LASSO
One state-of-the art approximation to the $\ell_0$ method is the LASSO: One state-of-the art approximation to the $\ell_0$ method is the LASSO:
$\text{argmin}_{\mathbf{c} \in \mathbb{R}^{m}} \{\|\mathbf{P} - \mathbf{D}\mathbf{c}\|^2_2 +\lambda \|\mathbf{c}\|_1\}$. $\text{argmin}_{\mathbf{c} \in \mathbb{R}^{m}} \{\|\mathbf{P} - \mathbf{D}\mathbf{c}\|^2_2 +\lambda \|\mathbf{c}\|_1\}$.
Before performing the LASSO regression we standardize the data to have mean 0 and variance 1, since otherwise the $\ell_2$-norm of a column would affect bias its contribution to the model. <br> Before performing the LASSO regression we standardize the data to have mean 0 and variance 1, since otherwise the $\ell_2$-norm of a column would affect bias its contribution to the model. <br>
Note that we can use the LASSO also only for feature selection. We use then a least-square model with the selected features afterwards instead of the LASSO model directly. Note that we can use the LASSO also only for feature selection. We use then a least-square model with the selected features afterwards instead of the LASSO model directly.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def lasso_fit(lam, P, D, feature_list): def lasso_fit(lam, P, D, feature_list):
#LASSO #LASSO
D_standardized = ss.zscore(D) D_standardized = ss.zscore(D)
lasso = Lasso(alpha=lam) lasso = Lasso(alpha=lam)
lasso.fit(D_standardized, P) lasso.fit(D_standardized, P)
coef = lasso.coef_ coef = lasso.coef_
# get strings of selected features # get strings of selected features
selected_indices = coef.nonzero()[0] selected_indices = coef.nonzero()[0]
selected_features = [feature_list[i] for i in selected_indices] selected_features = [feature_list[i] for i in selected_indices]
# get RMSE of LASSO model # get RMSE of LASSO model
P_predict = lasso.predict(D_standardized) P_predict = lasso.predict(D_standardized)
RMSE_LASSO = np.linalg.norm(P-P_predict) / np.sqrt(82.) RMSE_LASSO = np.linalg.norm(P-P_predict) / np.sqrt(82.)
#get RMSE for least-square fit #get RMSE for least-square fit
D_new = D[:, selected_indices] D_new = D[:, selected_indices]
D_new = np.column_stack((D_new, np.ones(82))) D_new = np.column_stack((D_new, np.ones(82)))
RMSE_LS = np.sqrt(np.linalg.lstsq(D_new, P, rcond=-1)[1][0]/82.) RMSE_LS = np.sqrt(np.linalg.lstsq(D_new, P, rcond=-1)[1][0]/82.)
return RMSE_LASSO, RMSE_LS, coef, selected_features return RMSE_LASSO, RMSE_LS, coef, selected_features
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
$\lambda$ regulates the sparsity of the coefficient vector of the model. Get the data and try different $\lambda$ by adjusting the varibale lam. How good does LASSO (directly or with a least-square fit afterwards) approximate the L0-method (when the same feature space is used for both)? $\lambda$ regulates the sparsity of the coefficient vector of the model. Get the data and try different $\lambda$ by adjusting the varibale lam. How good does LASSO (directly or with a least-square fit afterwards) approximate the L0-method (when the same feature space is used for both)?
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
#import Data #import Data
selected_feature_list = ['r_s', 'r_p', 'r_d', 'EA', 'IP'] selected_feature_list = ['r_s', 'r_p', 'r_d', 'EA', 'IP']
allowed_operations = ['+','|-|','exp', '^2'] allowed_operations = ['+','|-|','exp', '^2']
P, df_D = get_data(selected_feature_list, allowed_operations) P, df_D = get_data(selected_feature_list, allowed_operations)
D = df_D.values D = df_D.values
features_list = df_D.columns.tolist() features_list = df_D.columns.tolist()
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# change lam between 0.02 and 0.34, e.g. 0.34, 0.30, 0.20, 0.13, 0.10, 0.02 # change lam between 0.02 and 0.34, e.g. 0.34, 0.30, 0.20, 0.13, 0.10, 0.02
lam = 0.2 lam = 0.2
RMSE_LASSO, RMSE_LS, coef, selected_features = lasso_fit(lam, P, D, features_list) RMSE_LASSO, RMSE_LS, coef, selected_features = lasso_fit(lam, P, D, features_list)
plt.bar(range(len(coef)), np.abs(coef)) plt.bar(range(len(coef)), np.abs(coef))
plt.xlabel("Coefficient index $i$") plt.xlabel("Coefficient index $i$")
plt.ylabel("$|c_i|$") plt.ylabel("$|c_i|$")
print("lambda: %.3f\t dimension of descriptor: %s\t RMSE_LASSO: %.3f\t RMSE_LS: %.3f" print("lambda: %.3f\t dimension of descriptor: %s\t RMSE_LASSO: %.3f\t RMSE_LS: %.3f"
%(lam, len(selected_features), RMSE_LASSO, RMSE_LS)) %(lam, len(selected_features), RMSE_LASSO, RMSE_LS))
print(pd.DataFrame({'features':np.array(selected_features), 'abs(nonzero_coefs_LASSO)': np.abs(coef[coef.nonzero()])})) print(pd.DataFrame({'features':np.array(selected_features), 'abs(nonzero_coefs_LASSO)': np.abs(coef[coef.nonzero()])}))
plt.show() plt.show()
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Hint: Hint:
Compare these results to the L0 results you have obtained before from the same feature space, copied and pasted in here:<br> Compare these results to the L0 results you have obtained before from the same feature space, copied and pasted in here:<br>
"Number of total features generated: 115 <br> "Number of total features generated: 115 <br>
features: 115; 1D RMSE: 0.296667841349 best features: ['(r_p(A)+r_d(B))'] <br> features: 115; 1D RMSE: 0.296667841349 best features: ['(r_p(A)+r_d(B))'] <br>
features: 115; 2D RMSE: 0.194137970112 best features: ['(r_s(B)+r_p(A))', '(r_s(B)+r_p(A))^2'] <br> features: 115; 2D RMSE: 0.194137970112 best features: ['(r_s(B)+r_p(A))', '(r_s(B)+r_p(A))^2'] <br>
features: 115; 3D RMSE: 0.170545592998 best features: ['(r_s(B)+r_p(A))', '(r_s(B)+r_p(A))^2', 'exp(r_s(B)+r_p(A))']" features: 115; 3D RMSE: 0.170545592998 best features: ['(r_s(B)+r_p(A))', '(r_s(B)+r_p(A))^2', 'exp(r_s(B)+r_p(A))']"
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Run the SISSO method with a (relatively) big feature space # Run the SISSO method with a (relatively) big feature space
<div style="list-style:disc; margin: 2px;padding: 10px;border: 0px;border:8px double green; font-size:16px;padding-left: 32px;padding-right: 22px; width:89%"> <div style="list-style:disc; margin: 2px;padding: 10px;border: 0px;border:8px double green; font-size:16px;padding-left: 32px;padding-right: 22px; width:89%">
<li>Reproduce the results from the <a href="http://journals.aps.org/prl/abstract/10.1103/PhysRevLett.114.105503" target="_blank">reference publication</a> by including further features.</li> <li>Reproduce the results from the <a href="http://journals.aps.org/prl/abstract/10.1103/PhysRevLett.114.105503" target="_blank">reference publication</a> by including further features.</li>
<li>Visualize the 2D descriptors in a structure map.</li> <li>Visualize the 2D descriptors in a structure map.</li>
<li>Experiment with different settings and investigate the influence of the input parameters on the results. (OPTIONAL)</li> <li>Experiment with different settings and investigate the influence of the input parameters on the results. (OPTIONAL)</li>
</div> </div>
Note the size of the feature space, the needed time to run the code and the accuracy (using the default settings)! Note the size of the feature space, the needed time to run the code and the accuracy (using the default settings)!
To deal with larger feature spaces, note that in this chapter an optimized c++ implementation of the SISSO code will be used. To deal with larger feature spaces, note that in this chapter an optimized c++ implementation of the SISSO code will be used.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# here we define a different dataframe to make it compatible with the c++ implementation of SISSO # here we define a different dataframe to make it compatible with the c++ implementation of SISSO
# merge target and feature data frame # merge target and feature data frame
df_plus = merge_target_feature(df_target, df_features, suffixes=('_A', '_B')) df_plus = merge_target_feature(df_target, df_features, suffixes=('_A', '_B'))
# add Zunger's r_pi and r_sigma # add Zunger's r_pi and r_sigma
df_plus['r_pi'] = abs(df_plus['r_p_A'] - df_plus['r_s_A']) + abs(df_plus['r_p_B'] + df_plus['r_s_B']) df_plus['r_pi'] = abs(df_plus['r_p_A'] - df_plus['r_s_A']) + abs(df_plus['r_p_B'] + df_plus['r_s_B'])
df_plus['r_sigma'] = abs(df_plus['r_p_A'] + df_plus['r_s_A'] - (df_plus['r_p_B'] + df_plus['r_s_B'])) df_plus['r_sigma'] = abs(df_plus['r_p_A'] + df_plus['r_s_A'] - (df_plus['r_p_B'] + df_plus['r_s_B']))
df_plus = df_plus.rename(columns={'Z_A': 'Z_A (nuc_charge)', df_plus = df_plus.rename(columns={'Z_A': 'Z_A (nuc_charge)',
'Z_B': 'Z_B (nuc_charge)', 'Z_B': 'Z_B (nuc_charge)',
'period_A': 'period_A (unitless)', 'period_A': 'period_A (unitless)',
'period_B': 'period_B (unitless)', 'period_B': 'period_B (unitless)',
'IP_A': 'IP_A (eV_IP)', 'IP_A': 'IP_A (eV_IP)',
'IP_B': 'IP_B (eV_IP)', 'IP_B': 'IP_B (eV_IP)',
'EA_A': 'EA_A (eV_IP)', 'EA_A': 'EA_A (eV_IP)',
'EA_B': 'EA_B (eV_IP)', 'EA_B': 'EA_B (eV_IP)',
'E_HOMO_A': 'E_HOMO_A (eV)', 'E_HOMO_A': 'E_HOMO_A (eV)',
'E_HOMO_B': 'E_HOMO_B (eV)', 'E_HOMO_B': 'E_HOMO_B (eV)',
'E_LUMO_A': 'E_LUMO_A (eV)', 'E_LUMO_A': 'E_LUMO_A (eV)',
'E_LUMO_B': 'E_LUMO_B (eV)', 'E_LUMO_B': 'E_LUMO_B (eV)',
}) })
df_plus_reduced = df_plus.drop(['A', 'B', 'min_struc_type'], axis=1) df_plus_reduced = df_plus.drop(['A', 'B', 'min_struc_type'], axis=1)
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
n_nonzero_coefs=3 n_nonzero_coefs=3
n_features_per_sis_iter=50 n_features_per_sis_iter=50
selected_features = ['r_s_A', 'r_p_A', 'r_d_A', 'EA_A', 'IP_A', 'r_s_B', 'r_p_B', 'r_d_B', 'EA_B', 'IP_B'] selected_features = ['r_s_A', 'r_p_A', 'r_d_A', 'EA_A', 'IP_A', 'r_s_B', 'r_p_B', 'r_d_B', 'EA_B', 'IP_B']
selected_ops = ['add', 'abs_diff', 'exp', 'sq', 'div'] selected_ops = ['add', 'abs_diff', 'exp', 'sq', 'div']
inputs = read_csv( inputs = read_csv(
df_plus_reduced, df_plus_reduced,
prop_key="energy_diff", prop_key="energy_diff",
cols=selected_features, cols=selected_features,
max_rung=2, max_rung=2,
leave_out_frac=0.0 leave_out_frac=0.0
) )
inputs.allowed_ops = selected_ops inputs.allowed_ops = selected_ops
inputs.n_sis_select = n_features_per_sis_iter inputs.n_sis_select = n_features_per_sis_iter
inputs.n_dim = 3 inputs.n_dim = 3
inputs.max_rung = 2 inputs.max_rung = 2
inputs.n_residual = 1 inputs.n_residual = 1
# inputs.n_model_store = 1 # inputs.n_model_store = 1
inputs.calc_type = "regression" inputs.calc_type = "regression"
inputs.leave_out_inds = [] inputs.leave_out_inds = []
inputs.task_sizes_train = [82] inputs.task_sizes_train = [82]
inputs.task_sizes_test = [0] inputs.task_sizes_test = [0]
inputs.sample_ids_train = df_plus_reduced.index.tolist() inputs.sample_ids_train = df_plus_reduced.index.tolist()
inputs.prop_train = df_plus_reduced["energy_diff"].to_numpy() inputs.prop_train = df_plus_reduced["energy_diff"].to_numpy()
inputs.prop_test = np.array([]) inputs.prop_test = np.array([])
inputs.prop_label = "energy_diff" inputs.prop_label = "energy_diff"
inputs.prop_unit = Unit("eV") inputs.prop_unit = Unit("eV")
inputs.task_names = ["all_mats"] inputs.task_names = ["all_mats"]
feat_space = FeatureSpace(inputs) feat_space = FeatureSpace(inputs)
sisso = SISSORegressor(inputs, feat_space) sisso = SISSORegressor(inputs, feat_space)
with cwd(Path("SISSO_training"), mkdir=True): with cwd(Path("SISSO_training"), mkdir=True):
sisso.fit() sisso.fit()
for i in range(n_nonzero_coefs): for i in range(n_nonzero_coefs):
print(str(i+1)+'D model') print(str(i+1)+'D model')
print("RMSE: {:.4} | Descriptor: {}".format(sisso.models[i][0].rmse, sisso.models[i][0])) print("RMSE: {:.4} | Descriptor: {}".format(sisso.models[i][0].rmse, sisso.models[i][0]))
string = "c0:{:.4}".format(sisso.models[i][0].coefs[0][-1]) string = "c0:{:.4}".format(sisso.models[i][0].coefs[0][-1])
for j in range(i+1): for j in range(i+1):
string = string + str(" | a"+str(j)+":{:.4}".format(sisso.models[i][0].coefs[0][j])) string = string + str(" | a"+str(j)+":{:.4}".format(sisso.models[i][0].coefs[0][j]))
print(string + '\n') print(string + '\n')
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Next, plot an interactive 2D structure map using the 2D descriptor. Run the SISSO method for 2 non zero coefficients. Next, plot an interactive 2D structure map using the 2D descriptor. Run the SISSO method for 2 non zero coefficients.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
n_nonzero_coefs=2 n_nonzero_coefs=2
n_features_per_sis_iter=50 n_features_per_sis_iter=50
selected_features = ['r_s_A', 'r_p_A', 'r_d_A', 'EA_A', 'IP_A', 'r_s_B', 'r_p_B', 'r_d_B', 'EA_B', 'IP_B'] selected_features = ['r_s_A', 'r_p_A', 'r_d_A', 'EA_A', 'IP_A', 'r_s_B', 'r_p_B', 'r_d_B', 'EA_B', 'IP_B']
selected_ops = ['add', 'abs_diff', 'exp', 'sq', 'div'] selected_ops = ['add', 'abs_diff', 'exp', 'sq', 'div']
inputs = read_csv( inputs = read_csv(
df_plus_reduced, df_plus_reduced,
prop_key="energy_diff", prop_key="energy_diff",
cols=selected_features, cols=selected_features,
max_rung=2, max_rung=2,
leave_out_frac=0.0 leave_out_frac=0.0
) )
inputs.allowed_ops = selected_ops inputs.allowed_ops = selected_ops
inputs.n_sis_select = n_features_per_sis_iter inputs.n_sis_select = n_features_per_sis_iter
inputs.n_dim = 2 inputs.n_dim = 2
inputs.max_rung = 2 inputs.max_rung = 2
inputs.n_residual = 1 inputs.n_residual = 1
# inputs.n_model_store = 1 # inputs.n_model_store = 1
inputs.calc_type = "regression" inputs.calc_type = "regression"
inputs.leave_out_inds = [] inputs.leave_out_inds = []
inputs.task_sizes_train = [82] inputs.task_sizes_train = [82]
inputs.task_sizes_test = [0] inputs.task_sizes_test = [0]
inputs.sample_ids_train = df_plus_reduced.index.tolist() inputs.sample_ids_train = df_plus_reduced.index.tolist()
inputs.prop_train = df_plus_reduced["energy_diff"].to_numpy() inputs.prop_train = df_plus_reduced["energy_diff"].to_numpy()
inputs.prop_test = np.array([]) inputs.prop_test = np.array([])
inputs.prop_label = "energy_diff" inputs.prop_label = "energy_diff"
inputs.prop_unit = Unit("eV") inputs.prop_unit = Unit("eV")
inputs.task_names = ["all_mats"] inputs.task_names = ["all_mats"]
feat_space = FeatureSpace(inputs) feat_space = FeatureSpace(inputs)
sisso = SISSORegressor(inputs, feat_space) sisso = SISSORegressor(inputs, feat_space)
with cwd(Path("SISSO_training"), mkdir=True): with cwd(Path("SISSO_training"), mkdir=True):
sisso.fit() sisso.fit()
for i in range(n_nonzero_coefs): for i in range(n_nonzero_coefs):
print(str(i+1)+'D model') print(str(i+1)+'D model')
print("RMSE: {:.4} | Descriptor: {}".format(sisso.models[i][0].rmse, sisso.models[i][0])) print("RMSE: {:.4} | Descriptor: {}".format(sisso.models[i][0].rmse, sisso.models[i][0]))
string = "c0:{:.4}".format(sisso.models[i][0].coefs[0][-1]) string = "c0:{:.4}".format(sisso.models[i][0].coefs[0][-1])
for j in range(i+1): for j in range(i+1):
string = string + str(" | a"+str(j)+":{:.4}".format(sisso.models[i][0].coefs[0][j])) string = string + str(" | a"+str(j)+":{:.4}".format(sisso.models[i][0].coefs[0][j]))
print(string + '\n') print(string + '\n')
print(sisso.models[i][0].latex_str) print(sisso.models[i][0].latex_str)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Below the plot containing all octet binary materials, shown in the coordinate reference system given by the SISSO descriptors. By hovering over a point in the plot, information regarding that compound is displayed. By clicking a point, an interactive 3D visualization of the most stable structure will be displayed below the plot. To facilitate comparison between different materials, we make available two separate viewers. The specific viewer to be employed for visualization can be selected with a check box placed on top-right of each visualizer. Once a compound is selected from the plot, the marker on the plot changes shape accordingly to a marker that is specific of the viewer and whose shape is shown next to the checkbox. The name of the compound viewed is shown on top of the visualizer, and it is also possible to modify the name in the text box with the name of another compound, which can be displayed by clicking the "Display" button. Consequently, the new compound is also marked on the plot, that allows to easily identify the position of each compound in the plot. Note that the structures of all compounds were stored in a .xyz file at the beginning of the tutorial, i.e. using the function 'write_xyz'. Below the plot containing all octet binary materials, shown in the coordinate reference system given by the SISSO descriptors. By hovering over a point in the plot, information regarding that compound is displayed. By clicking a point, an interactive 3D visualization of the most stable structure will be displayed below the plot. To facilitate comparison between different materials, we make available two separate viewers. The specific viewer to be employed for visualization can be selected with a check box placed on top-right of each visualizer. Once a compound is selected from the plot, the marker on the plot changes shape accordingly to a marker that is specific of the viewer and whose shape is shown next to the checkbox. The name of the compound viewed is shown on top of the visualizer, and it is also possible to modify the name in the text box with the name of another compound, which can be displayed by clicking the "Display" button. Consequently, the new compound is also marked on the plot, that allows to easily identify the position of each compound in the plot. Note that the structures of all compounds were stored in a .xyz file at the beginning of the tutorial, i.e. using the function 'write_xyz'.
The markers represent the compounds and their colors the reference energy differences. How well does the descriptor separate the compounds with respect to their crystal structure? The markers represent the compounds and their colors the reference energy differences. How well does the descriptor separate the compounds with respect to their crystal structure?
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
visualizer=Visualizer(df, sisso, feat_space) visualizer=Visualizer(df, sisso, feat_space)
visualizer.show() visualizer.show()
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Predicting new materials (extrapolation) # Predicting new materials (extrapolation)
<div style="list-style:disc; margin: 2px;padding: 10px;border: 0px;border:8px double green; font-size:16px;padding-left: 32px;padding-right: 22px; width:89%"> <div style="list-style:disc; margin: 2px;padding: 10px;border: 0px;border:8px double green; font-size:16px;padding-left: 32px;padding-right: 22px; width:89%">
<li>Perform a leave-one-out cross-validation (LOOCV) using SISSO.</li> <li>Perform a leave-one-out cross-validation (LOOCV) using SISSO.</li>
<li>Analyze the prediction accuracy and how often the same descriptor is selected.</li> <li>Analyze the prediction accuracy and how often the same descriptor is selected.</li>
</div> </div>
We have seen that we can fit the energy differences of materials accurately. But what about predicting the energy difference of a 'new' material (which was not included when determining the model)? We test the prediction performance via LOOCV. In a LOOCV for each material the following procedure is performed: the selected material is excluded, the model is built on the remaining materials and the model accurcy is tested on the excluded material. This means that we need to run SISSO function 82 times. <br> We have seen that we can fit the energy differences of materials accurately. But what about predicting the energy difference of a 'new' material (which was not included when determining the model)? We test the prediction performance via LOOCV. In a LOOCV for each material the following procedure is performed: the selected material is excluded, the model is built on the remaining materials and the model accurcy is tested on the excluded material. This means that we need to run SISSO function 82 times. <br>
Get the data in the next cell and run the LOOCV one cell after. Note that running the LOOCV could take up to ten minutes. Use the remaining two cells of this chapter to analyse the results.<br> Get the data in the next cell and run the LOOCV one cell after. Note that running the LOOCV could take up to ten minutes. Use the remaining two cells of this chapter to analyse the results.<br>
How is the prediction error compared to the fitting error? How often is the same descriptor selected? Are there materials that yield an outlying high/low error? How is the prediction error compared to the fitting error? How often is the same descriptor selected? Are there materials that yield an outlying high/low error?
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Leave-one-out cross-validation # Leave-one-out cross-validation
n_compounds = len(P) n_compounds = len(P)
dimensions = range(1, 4) dimensions = range(1, 4)
features_count = [[] for i in range(3)] features_count = [[] for i in range(3)]
P_predict = np.empty([len(dimensions), n_compounds]) P_predict = np.empty([len(dimensions), n_compounds])
chemical_formulas = df_target.index.tolist() chemical_formulas = df_target.index.tolist()
sisso = SISSORegressor_skl( sisso = SISSORegressor_skl(
allowed_ops = selected_ops, allowed_ops = selected_ops,
n_sis_select = 100, n_sis_select = 100,
n_dim = 3, n_dim = 3,
max_rung = 2, max_rung = 2,
n_residual = 1, n_residual = 1,
n_models_store = 1, n_models_store = 1,
verbose = False, verbose = False,
) )
loo = LeaveOneOut() loo = LeaveOneOut()
df_sisso_cv = df_plus_reduced.drop(columns=["energy_diff"]) df_sisso_cv = df_plus_reduced.drop(columns=["energy_diff"])
with cwd(Path("SISSO_CV"), mkdir=True): with cwd(Path("SISSO_CV"), mkdir=True):
for indices_train, index_test in loo.split(P): for indices_train, index_test in loo.split(P):
i_cv = index_test[0] i_cv = index_test[0]
print('%2s) Leave out %s: Ediff_ref = %.3f eV/atom' print('%2s) Leave out %s: Ediff_ref = %.3f eV/atom'
% (index_test[0]+1, chemical_formulas[i_cv], P[i_cv])) % (index_test[0]+1, chemical_formulas[i_cv], P[i_cv]))
sisso.fit(df_sisso_cv.iloc[indices_train, :], P[indices_train]) sisso.fit(df_sisso_cv.iloc[indices_train, :], P[indices_train])
data_dct = {col.split("(")[0].strip(): df_sisso_cv.iloc[i_cv, cc] for cc, col in enumerate(df_sisso_cv.columns)} data_dct = {col.split("(")[0].strip(): df_sisso_cv.iloc[i_cv, cc] for cc, col in enumerate(df_sisso_cv.columns)}
predicted_value = [models[0].eval(data_dct) for models in sisso.solver.models[:]] predicted_value = [models[0].eval(data_dct) for models in sisso.solver.models[:]]
P_predict[:, i_cv] = predicted_value P_predict[:, i_cv] = predicted_value
for dim in dimensions: for dim in dimensions:
print('Ediff_predicted(%sD) = %.3f eV/atom' %(dim, predicted_value[dim - 1])) print('Ediff_predicted(%sD) = %.3f eV/atom' %(dim, predicted_value[dim - 1]))
print('-----') print('-----')
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
#Plot Prediction errors #Plot Prediction errors
prediction_errors = np.linalg.norm(P-P_predict, axis=1)/np.sqrt(n_compounds) prediction_errors = np.linalg.norm(P-P_predict, axis=1)/np.sqrt(n_compounds)
xs = [P] *3 xs = [P] *3
ys = P_predict ys = P_predict
legend = ['%sD, RMSE = %.3f eV/atom' %(dim, prediction_errors[dim-1]) for dim in dimensions] legend = ['%sD, RMSE = %.3f eV/atom' %(dim, prediction_errors[dim-1]) for dim in dimensions]
data_point_labels = [df.index.tolist()]*3 data_point_labels = [df.index.tolist()]*3
fig = show_scatter_plot(xs, ys, data_point_labels=data_point_labels, fig = show_scatter_plot(xs, ys, data_point_labels=data_point_labels,
x_label='E_diff_DFT', y_label='E_diff_predicted', legend=legend, unit='eV/atom') x_label='E_diff_DFT', y_label='E_diff_predicted', legend=legend, unit='eV/atom')
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Kernel ridge regession ## Kernel ridge regession
It is instructive to compare the performance of the just identified model with a model trained with the popular kernel ridge regression (KRR) approach, by using the same list of atomic features as input. It is instructive to compare the performance of the just identified model with a model trained with the popular kernel ridge regression (KRR) approach, by using the same list of atomic features as input.
KRR solves a $\ell_2$ regularized linear regression problem, with (typically) a nonlinear kernel. KRR solves a $\ell_2$ regularized linear regression problem, with (typically) a nonlinear kernel.
This can be descirbed in the following manner. The $\ell_2$ regularized linear regression problem is: This can be descirbed in the following manner. The $\ell_2$ regularized linear regression problem is:
$\text{argmin}_{\mathbf{c} \in \mathbb{R}^{m}} \{\|\mathbf{P} - \mathbf{D}\mathbf{c}\|^2_2 +\lambda \|\mathbf{c}\|_2\}$. $\text{argmin}_{\mathbf{c} \in \mathbb{R}^{m}} \{\|\mathbf{P} - \mathbf{D}\mathbf{c}\|^2_2 +\lambda \|\mathbf{c}\|_2\}$.
Its solution is: Its solution is:
$\mathbf{c} = \left( \mathbf{D}^\top \mathbf{D} + \lambda \mathbf{I} \right)^{-1} \mathbf{D}^\top \mathbf{P}$ $\mathbf{c} = \left( \mathbf{D}^\top \mathbf{D} + \lambda \mathbf{I} \right)^{-1} \mathbf{D}^\top \mathbf{P}$
Acoording to Hilbert space representaiton theorem, we can write the solution vector $\mathbf{c}$ as a linear expansion over the data points represented acoording to the chosen descriptor. In other words, $\mathbf{c}$ can be represented as a linear expansion over the rows $\mathbf{d}_j$ of the matrix $\mathbf{D}$. Acoording to Hilbert space representaiton theorem, we can write the solution vector $\mathbf{c}$ as a linear expansion over the data points represented acoording to the chosen descriptor. In other words, $\mathbf{c}$ can be represented as a linear expansion over the rows $\mathbf{d}_j$ of the matrix $\mathbf{D}$.
$\mathbf{c} = \sum_j \alpha_j \mathbf{d}_j $ $\mathbf{c} = \sum_j \alpha_j \mathbf{d}_j $
This rewriting leads to the equivalent problem: This rewriting leads to the equivalent problem:
$\text{argmin}_{\boldsymbol{\alpha} \in \mathbb{R}^{N}} \{\|\mathbf{P} - \mathbf{K}\boldsymbol{\alpha}\|^2_2 +\lambda \boldsymbol{\alpha}^\top \mathbf{K} \boldsymbol{\alpha} \} \quad (1)$, $\text{argmin}_{\boldsymbol{\alpha} \in \mathbb{R}^{N}} \{\|\mathbf{P} - \mathbf{K}\boldsymbol{\alpha}\|^2_2 +\lambda \boldsymbol{\alpha}^\top \mathbf{K} \boldsymbol{\alpha} \} \quad (1)$,
where the kernel $\mathbf{K}$ is: $K_{ij} = < \mathbf{d}_i, \mathbf{d}_j > $. where the kernel $\mathbf{K}$ is: $K_{ij} = < \mathbf{d}_i, \mathbf{d}_j > $.
The rewritten problem has solution $\boldsymbol{\alpha} = \left( \mathbf{K} + \lambda \mathbf{I} \right)^{-1} \mathbf{P} \quad (2)$ The rewritten problem has solution $\boldsymbol{\alpha} = \left( \mathbf{K} + \lambda \mathbf{I} \right)^{-1} \mathbf{P} \quad (2)$
This rewriting did not add anything to the regularized linear regression approoach. If, however, we now expand the vector $\mathbf{c}$ as an expansion of nonlinear functions $\Phi()$ of the vectors $\mathbf{d}_j: This rewriting did not add anything to the regularized linear regression approoach. If, however, we now expand the vector $\mathbf{c}$ as an expansion of nonlinear functions $\Phi()$ of the vectors $\mathbf{d}_j:
$\mathbf{c} = \sum_j \alpha_j \Phi(\mathbf{d}_j) $ $\mathbf{c} = \sum_j \alpha_j \Phi(\mathbf{d}_j) $
and the kernel as: $K_{ij} = < \Phi(\mathbf{d}_i), \Phi(\mathbf{d}_j)> $, one can prove that the same problem (1) with the same solution (2) holds. Normally, people do not specify the function $\Phi()$, but rahter the kernel. and the kernel as: $K_{ij} = < \Phi(\mathbf{d}_i), \Phi(\mathbf{d}_j)> $, one can prove that the same problem (1) with the same solution (2) holds. Normally, people do not specify the function $\Phi()$, but rahter the kernel.
Actually, $\Phi()$ does not need to be known at all. We refer to the specialized literature on KRR for more details on this method. Actually, $\Phi()$ does not need to be known at all. We refer to the specialized literature on KRR for more details on this method.
Here, we make use of the Gaussian kernel, i.e., the most popular one: Here, we make use of the Gaussian kernel, i.e., the most popular one:
$K(x, y) = \exp(-\gamma ||x-y||^2)$. $K(x, y) = \exp(-\gamma ||x-y||^2)$.
The parameters $\lambda$ and $\gamma$ are called <i> hyperparameters </i> and are set via CV. The parameters $\lambda$ and $\gamma$ are called <i> hyperparameters </i> and are set via CV.
Specifically, at each LOOCV step, the hyperparameters ($\ell_2$-regularization parameter $\lambda$ and inverse gaussian width $\gamma$) are optimized via a grid search and a so-called 5-fold cross-validation on the training set. This means splitting the training set in 5 subsets, 1 subset is used for evaluating the performance (RMSE) and the other 4 for training. The procedure is repeated 5 times by changing the test subset and the overall perfrmance is the RMSE over all repetitions. Specifically, at each LOOCV step, the hyperparameters ($\ell_2$-regularization parameter $\lambda$ and inverse gaussian width $\gamma$) are optimized via a grid search and a so-called 5-fold cross-validation on the training set. This means splitting the training set in 5 subsets, 1 subset is used for evaluating the performance (RMSE) and the other 4 for training. The procedure is repeated 5 times by changing the test subset and the overall perfrmance is the RMSE over all repetitions.
What can one observe by comparing the SISSO and KRR results? What can one observe by comparing the SISSO and KRR results?
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
selected_feature_list = ['IP', 'EA', 'r_s', 'r_p','r_d'] selected_feature_list = ['IP', 'EA', 'r_s', 'r_p','r_d']
allowed_operations = [] allowed_operations = []
P, df_D = get_data(selected_feature_list, allowed_operations) P, df_D = get_data(selected_feature_list, allowed_operations)
features_list = df_D.columns.tolist() features_list = df_D.columns.tolist()
D = df_D.values D = df_D.values
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
kr = GridSearchCV(KernelRidge(kernel='rbf'), cv=5, kr = GridSearchCV(KernelRidge(kernel='rbf'), cv=5,
param_grid={"alpha": np.logspace(-3, 0, 5), param_grid={"alpha": np.logspace(-3, 0, 5),
"gamma": np.logspace(-2, 1, 5)}) "gamma": np.logspace(-2, 1, 5)})
P_predict_kr = [] P_predict_kr = []
loo = LeaveOneOut() loo = LeaveOneOut()
for indices_train, index_test in loo.split(P): for indices_train, index_test in loo.split(P):
kr.fit(D[indices_train], P[indices_train]) kr.fit(D[indices_train], P[indices_train])
print("%2i Ediff_ref: %.3f, Ediff_pred: %.3f, hyperparameters: {'lambda': %.3f, 'gamma':%.3f}" print("%2i Ediff_ref: %.3f, Ediff_pred: %.3f, hyperparameters: {'lambda': %.3f, 'gamma':%.3f}"
% (index_test[0], P[index_test], kr.predict(D[index_test]), % (index_test[0], P[index_test], kr.predict(D[index_test]),
kr.best_params_['alpha'], kr.best_params_['gamma'])) kr.best_params_['alpha'], kr.best_params_['gamma']))
P_predict_kr.append(kr.predict(D[index_test])[0]) P_predict_kr.append(kr.predict(D[index_test])[0])
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
prediction_rmse_kr = np.linalg.norm(np.array(P_predict_kr) - P)/np.sqrt(P.size) prediction_rmse_kr = np.linalg.norm(np.array(P_predict_kr) - P)/np.sqrt(P.size)
xs = [P, P] xs = [P, P]
ys = [P_predict[-1], P_predict_kr,] ys = [P_predict[-1], P_predict_kr,]
legend = ['SISSO 3D, RMSE = %.3f eV/atom' % prediction_errors[dim-1], legend = ['SISSO 3D, RMSE = %.3f eV/atom' % prediction_errors[dim-1],
'KR, RMSE = %.3f eV/atom' % prediction_rmse_kr] 'KR, RMSE = %.3f eV/atom' % prediction_rmse_kr]
data_point_labels = [df.index.tolist()]*2 data_point_labels = [df.index.tolist()]*2
fig = show_scatter_plot(xs, ys, data_point_labels=data_point_labels, fig = show_scatter_plot(xs, ys, data_point_labels=data_point_labels,
x_label='E_diff_DFT', y_label='E_diff_predicted', legend=legend, unit='eV/atom') x_label='E_diff_DFT', y_label='E_diff_predicted', legend=legend, unit='eV/atom')
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# PySR # Symbolic Regression Through Genetic Programing
It is good to compare the results of the previous workflows against other methods for performing symbolic regression, in this case genetic programming. Historically, genetic programming is the most popular tool for performing symbolic regerssion, and it performs genetic operations on the binary expression tress of the forumlas to generate new expressions. pySR is the latest package to perform this work where it uses a python frontend and a julia backend to efficently find the expressions. For more details of the approach see: https://github.com/MilesCranmer/PySR
In this example we are using a "deterministic" version of the code by turnning off all paralellism and setting the seed via the `procs`, `multithreading`, `random_state`, and `deterministic` keywords. While this slows down the run time it makes the results consistent.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
features = ['Z', 'r_s', 'r_p', 'r_d', 'period', 'EA', 'IP', 'E_HOMO', 'E_LUMO'] features = ['Z', 'r_s', 'r_p', 'r_d', 'period', 'EA', 'IP', 'E_HOMO', 'E_LUMO']
pysr_df = df_plus_reduced.copy() pysr_df = df_plus_reduced.copy()
col_rename = {col: col.split("(")[0].strip() for col in pysr_df.columns} col_rename = {col: col.split("(")[0].strip() for col in pysr_df.columns}
pysr_df.rename(columns=col_rename, inplace=True) pysr_df.rename(columns=col_rename, inplace=True)
pysr_df.drop(columns=["energy_diff"], inplace=True) pysr_df.drop(columns=["energy_diff"], inplace=True)
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
pysr = PySRRegressor( pysr = PySRRegressor(
niterations=100, niterations=100,
binary_operators=["+", "-", "/"], binary_operators=["+", "-", "/"],
unary_operators=[ unary_operators=[
"square", "square",
"exp", "exp",
"abs", "abs",
], ],
loss="loss(prediction, target) = (prediction - target)^2", loss="loss(prediction, target) = (prediction - target)^2",
verbosity = 0, verbosity = 0,
procs = 0, procs = 0,
multithreading = False, multithreading = False,
random_state = 1, random_state = 1,
deterministic = True, deterministic = True,
) )
P_predict_pysr = [] P_predict_pysr = []
loo = LeaveOneOut() loo = LeaveOneOut()
with cwd(Path("pySR"), mkdir=True): with cwd(Path("pySR"), mkdir=True):
for indices_train, index_test in loo.split(P): for indices_train, index_test in loo.split(P):
pysr.fit(pysr_df.iloc[indices_train, :], P[indices_train]) pysr.fit(pysr_df.iloc[indices_train, :], P[indices_train])
P_predict_pysr.append(pysr.predict(pysr_df.iloc[index_test, :])[0]) P_predict_pysr.append(pysr.predict(pysr_df.iloc[index_test, :])[0])
print("%2i Ediff_ref: %.3f, Ediff_pred: %.3f, equation: {%s}" print("%2i Ediff_ref: %.3f, Ediff_pred: %.3f, equation: {%s}"
% (index_test[0], P[index_test], P_predict_pysr[-1], % (index_test[0], P[index_test], P_predict_pysr[-1],
pysr.get_best().sympy_format)) pysr.get_best().sympy_format))
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
xs = [P, P] xs = [P, P]
ys = [P_predict[-1], P_predict_pysr,] ys = [P_predict[-1], P_predict_pysr,]
prediction_rmse_pysr = np.sqrt(np.mean((P_predict_pysr - P)**2.0)) prediction_rmse_pysr = np.sqrt(np.mean((P_predict_pysr - P)**2.0))
legend = ['SISSO 3D, RMSE = %.3f eV/atom' % prediction_errors[dim-1], legend = ['SISSO 3D, RMSE = %.3f eV/atom' % prediction_errors[dim-1],
'pySR, RMSE = %.3f eV/atom' % prediction_rmse_pysr] 'pySR, RMSE = %.3f eV/atom' % prediction_rmse_pysr]
data_point_labels = [df.index.tolist()]*2 data_point_labels = [df.index.tolist()]*2
fig = show_scatter_plot(xs, ys, data_point_labels=data_point_labels, fig = show_scatter_plot(xs, ys, data_point_labels=data_point_labels,
x_label='E_diff_DFT', y_label='E_diff_predicted', legend=legend, unit='eV/atom') x_label='E_diff_DFT', y_label='E_diff_predicted', legend=legend, unit='eV/atom')
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
pysr_path = Path(f"pySR/all_pts") pysr_path = Path(f"pySR/all_pts")
with cwd(pysr_path, mkdir=True): with cwd(pysr_path, mkdir=True):
pysr.fit(pysr_df, P) pysr.fit(pysr_df, P)
print(pysr.get_best().sympy_format) print(pysr.get_best().sympy_format)
error_df = pysr.equations.copy() error_df = pysr.equations.copy()
error_df["RMSE"] = np.sqrt(error_df["loss"]) error_df["RMSE"] = np.sqrt(error_df["loss"])
plt.plot(error_df["complexity"], error_df["RMSE"]) plt.plot(error_df["complexity"], error_df["RMSE"])
plt.plot(pysr.get_best()["complexity"], np.sqrt(pysr.get_best()["loss"]), "*", ms=10) plt.plot(pysr.get_best()["complexity"], np.sqrt(pysr.get_best()["loss"]), "*", ms=10)
plt.xlabel("Number of Nodes") plt.xlabel("Number of Nodes")
plt.ylabel("RMSE (eV / atom)") plt.ylabel("RMSE (eV / atom)")
plt.show() plt.show()
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# FFX # FFX: Fast Function Extraction
FFX was the first popularized attempt of removing genetic programing from symbolic regression. This approach builds up a set of basis functions by rasing each input variable to a group of powers and applying a predetermined set of unary and binary operators onto them. From here it uses an `ElasticNet` to find the best solution for a given number of basis functions.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
import ffx import ffx
P_predict_ffx = [] P_predict_ffx = []
n_basis = [] n_basis = []
loo = LeaveOneOut() loo = LeaveOneOut()
FFX = ffx.FFXRegressor() FFX = ffx.FFXRegressor()
for indices_train, index_test in loo.split(P): for indices_train, index_test in loo.split(P):
FFX.fit( FFX.fit(
pysr_df.iloc[indices_train, :], pysr_df.iloc[indices_train, :],
P[indices_train], P[indices_train],
) )
P_predict_ffx.append( P_predict_ffx.append(
FFX.predict(pysr_df.iloc[index_test, :].values)[0] FFX.predict(pysr_df.iloc[index_test, :].values)[0]
) )
n_basis.append(FFX.complexity()) n_basis.append(FFX.complexity())
print("%2i Ediff_ref: %.3f, Ediff_pred: %.3f, complexity: %.d" print("%2i Ediff_ref: %.3f, Ediff_pred: %.3f, complexity: %.d"
% (index_test[0], P[index_test], P_predict_ffx[-1], n_basis[-1])) % (index_test[0], P[index_test], P_predict_ffx[-1], n_basis[-1]))
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
xs = [P, P] xs = [P, P]
ys = [P_predict[-1], P_predict_ffx] ys = [P_predict[-1], P_predict_ffx]
prediction_rmse_ffx = np.sqrt(np.mean((P_predict_ffx - P)**2.0)) prediction_rmse_ffx = np.sqrt(np.mean((P_predict_ffx - P)**2.0))
legend = ['SISSO 3D, RMSE = %.3f eV/atom' % prediction_errors[dim-1], legend = ['SISSO 3D, RMSE = %.3f eV/atom' % prediction_errors[dim-1],
'FFX, RMSE = %.3f eV/atom' % prediction_rmse_ffx] 'FFX, RMSE = %.3f eV/atom' % prediction_rmse_ffx]
data_point_labels = [df.index.tolist()]*2 data_point_labels = [df.index.tolist()]*2
fig = show_scatter_plot(xs, ys, data_point_labels=data_point_labels, fig = show_scatter_plot(xs, ys, data_point_labels=data_point_labels,
x_label='E_diff_DFT', y_label='E_diff_predicted', legend=legend, unit='eV/atom') x_label='E_diff_DFT', y_label='E_diff_predicted', legend=legend, unit='eV/atom')
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
plt.hist(n_basis, bins=15) plt.hist(n_basis, bins=15)
plt.xlabel("Number of Basis Functions") plt.xlabel("Number of Basis Functions")
plt.ylabel("Count") plt.ylabel("Count")
plt.show() plt.show()
``` ```
%% Cell type:code id: tags:
``` python
```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment