Commit 93ced703 authored by Luigi Sbailo's avatar Luigi Sbailo
Browse files

Update metainfo

parent 1c00e9aa
{ {
"authors": [ "authors": [
"Sbailo, Luigi", "Sbailò, Luigi",
"Ghiringhelli, Luca M." "Purcell, Thomas A. R.",
"Ghiringhelli, Luca M.",
"Scheffler, Matthias"
], ],
"email": "ghiringhelli@fhi-berlin.mpg.de", "email": "ghiringhelli@fhi-berlin.mpg.de",
"title": "Artificial intelligence for high-throughput discovery of topological insulators", "title": "Discovery of new topological insulators in alloyed tetradymites",
"description": "In this tutorial... ", "description": "Learn how to find descriptive parameters (short formulas) that predict whether alloyed materials are topological or trivial insulators, using the example of tetradymites. This notebook is based on the algorithm 'sure independence screening and sparsifying operator' (SISSO) that enables to search for optimal descriptor by scanning huge feature spaces.",
"url": "https://gitlab.mpcdf.mpg.de/nomad-lab/analytics-tetradymite-PRM2020", "url": "https://gitlab.mpcdf.mpg.de/nomad-lab/analytics-tetradymite-PRM2020",
"link": "https://analytics-toolkit.nomad-coe.eu/hub/user-redirect/notebooks/tutorials/tetradymite_PRM2020.ipynb", "link": "https://analytics-toolkit.nomad-coe.eu/hub/user-redirect/notebooks/tutorials/tetradymite_PRM2020.ipynb",
"link_public": "https://analytics-toolkit.nomad-coe.eu/public/user-redirect/notebooks/tutorials/tetradymite_PRM2020.ipynb", "link_public": "https://analytics-toolkit.nomad-coe.eu/public/user-redirect/notebooks/tutorials/tetradymite_PRM2020.ipynb",
"updated": "2020-26-08", "updated": "2020-12-09",
"flags":{ "flags":{
"featured": true, "featured": true,
"top_of_list": false "top_of_list": false
}, },
"labels": { "labels": {
"application_keyword": [ "application_keyword": [
"Something" "Tetradymites",
"Topological insulators"
], ],
"application_section": [ "application_section": [
"Something" "Timely artificial-intelligence applications to Materials Science"
], ],
"application_system": [ "application_system": [
"Something" "Tetradymites"
], ],
"category": [ "category": [
"Something" "Tutorial"
], ],
"data_analytics_method": [ "data_analytics_method": [
"Something" "SISSO",
"Classification"
], ],
"platform": [ "platform": [
"jupyter" "jupyter"
......
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<div id="teaser" style=' background-position: right center; background-size: 00px; background-repeat: no-repeat; <div id="teaser" style=' background-position: right center; background-size: 00px; background-repeat: no-repeat;
padding-top: 20px; padding-top: 20px;
padding-right: 10px; padding-right: 10px;
padding-bottom: 170px; padding-bottom: 170px;
padding-left: 10px; padding-left: 10px;
border-bottom: 14px double #333; border-bottom: 14px double #333;
border-top: 14px double #333;' > border-top: 14px double #333;' >
<div style="text-align:center"> <div style="text-align:center">
<b><font size="6.4">Discovery of new topological insulators in alloyed tetradymites with symbolic regression combined with compressed sensing (SISSO). <b><font size="6.4">Discovery of new topological insulators in alloyed tetradymites
</font></b> </font></b>
</div> </div>
<p> <p>
created by: Luigi Sbailò, Thomas A. R. Purcell, Luca M. Ghiringhelli, and Matthias Scheffler created by: Luigi Sbailò, Thomas A. R. Purcell, Luca M. Ghiringhelli, and Matthias Scheffler
Fritz Haber Institute of the Max Planck Society, Faradayweg 4-6, D-14195 Berlin, Germany <br> Fritz Haber Institute of the Max Planck Society, Faradayweg 4-6, D-14195 Berlin, Germany <br>
<span class="nomad--last-updated" data-version="v1.0.0">[Last updated: Sep 29, 2020]</span> <span class="nomad--last-updated" data-version="v1.0.0">[Last updated: Sep 29, 2020]</span>
<div> <div>
<img style="float: left;" src="assets/tetradymite_PRM2020/Logo_MPG.png" width="200"> <img style="float: left;" src="assets/tetradymite_PRM2020/Logo_MPG.png" width="200">
<img style="float: right;" src="assets/tetradymite_PRM2020/Logo_NOMAD.png" width="250"> <img style="float: right;" src="assets/tetradymite_PRM2020/Logo_NOMAD.png" width="250">
</div> </div>
</div> </div>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Introduction ### Introduction
This tutorial shows how to find descriptive parameters (short formulas) that predict whether alloyed materials are topological or trivial insulators, using the example of tetradymites. It is based on the algorithm sure independence screening and sparsifying operator (SISSO), that enables to search for optimal descriptor by scanning huge feature spaces. This tutorial shows how to find descriptive parameters (short formulas) that predict whether alloyed materials are topological or trivial insulators, using the example of tetradymites. It is based on the algorithm sure independence screening and sparsifying operator (SISSO), that enables to search for optimal descriptor by scanning huge feature spaces.
<div style="padding: 1ex; margin-top: 1ex; margin-bottom: 1ex; border-style: dotted; border-width: 1pt; border-color: blue; border-radius: 3px;">R. Ouyang, S. Curtarolo, E. Ahmetcik, M. Scheffler, L. M. Ghiringhelli: <span style="font-style: italic;">SISSO: a compressed-sensing method for identifying the best low-dimensional descriptor in an immensity of offered candidates</span>, Phys. Rev. Materials 2, 083802 (2018) <a href="https://journals.aps.org/prmaterials/abstract/10.1103/PhysRevMaterials.2.083802" target="_blank">[PDF]</a>.</div> <div style="padding: 1ex; margin-top: 1ex; margin-bottom: 1ex; border-style: dotted; border-width: 1pt; border-color: blue; border-radius: 3px;">R. Ouyang, S. Curtarolo, E. Ahmetcik, M. Scheffler, L. M. Ghiringhelli: <span style="font-style: italic;">SISSO: a compressed-sensing method for identifying the best low-dimensional descriptor in an immensity of offered candidates</span>, Phys. Rev. Materials 2, 083802 (2018) <a href="https://journals.aps.org/prmaterials/abstract/10.1103/PhysRevMaterials.2.083802" target="_blank">[PDF]</a>.</div>
With the default settings, the method reproduces the same results from: With the default settings, the method reproduces the same results from:
<div style="padding: 1ex; margin-top: 1ex; margin-bottom: 1ex; border-style: dotted; border-width: 1pt; border-color: blue; border-radius: 3px;">G. Cao, R. Ouyang, L. M. Ghiringhelli, M. Scheffler, H. Liu, C. Carbogno, and Z. Zhang: <span style="font-style: italic;">Artificial intelligence for high-throughput discovery of topological insulators: The example of alloyed tetradymites</span>, Phys. Rev. Materials 4, 034204 (2020) <a href="https://journals.aps.org/prmaterials/abstract/10.1103/PhysRevMaterials.4.034204">[PDF]</a>,</div> <div style="padding: 1ex; margin-top: 1ex; margin-bottom: 1ex; border-style: dotted; border-width: 1pt; border-color: blue; border-radius: 3px;">G. Cao, R. Ouyang, L. M. Ghiringhelli, M. Scheffler, H. Liu, C. Carbogno, and Z. Zhang: <span style="font-style: italic;">Artificial intelligence for high-throughput discovery of topological insulators: The example of alloyed tetradymites</span>, Phys. Rev. Materials 4, 034204 (2020) <a href="https://journals.aps.org/prmaterials/abstract/10.1103/PhysRevMaterials.4.034204">[PDF]</a>,</div>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<details> <details>
<summary> <summary>
<div style="padding: 1ex; margin-top: 1ex; margin-bottom: 1ex; border-style: dotted; border-width: 1pt; border-color: blue; border-radius: 3px;"><b>Explanation of the method (click to expand/collapse)</b></div></summary> <div style="padding: 1ex; margin-top: 1ex; margin-bottom: 1ex; border-style: dotted; border-width: 1pt; border-color: blue; border-radius: 3px;"><b>Explanation of the method (click to expand/collapse)</b></div></summary>
We present a tool for predicting whether alloyed tetradymite are topological or trivial insulators, by using a set of descriptive parameters (a descriptor) based on free-atom data of the atomic species constituting the $AB-LMN$ materials, where $A,B \in \{ \textrm{As, Sb, Bi} \}$ and $L,M,N \in \{ \textrm{S, Se, Te} \}$. We apply a recently developed method: sure independence screening and sparsifying operator (SISSO), that allows to find an optimal descriptor in a huge feature space containing billions of features. In this tutorial an $\ell_0$-optimization is used as the sparsifying operator. We present a tool for predicting whether alloyed tetradymite are topological or trivial insulators, by using a set of descriptive parameters (a descriptor) based on free-atom data of the atomic species constituting the $AB-LMN$ materials, where $A,B \in \{ \textrm{As, Sb, Bi} \}$ and $L,M,N \in \{ \textrm{S, Se, Te} \}$. We apply a recently developed method: sure independence screening and sparsifying operator (SISSO), that allows to find an optimal descriptor in a huge feature space containing billions of features. In this tutorial an $\ell_0$-optimization is used as the sparsifying operator.
The method is described in: The method is described in:
<div style="padding: 1ex; margin-top: 1ex; margin-bottom: 1ex; border-style: dotted; border-width: 1pt; border-color: blue; border-radius: 3px;"> <div style="padding: 1ex; margin-top: 1ex; margin-bottom: 1ex; border-style: dotted; border-width: 1pt; border-color: blue; border-radius: 3px;">
R. Ouyang, S. Curtarolo, E. Ahmetcik, M. Scheffler, L. M. Ghiringhelli: <span style="font-style: italic;">SISSO: a compressed-sensing method for identifying the best low-dimensional descriptor in an immensity of offered candidates</span>, Phys. Rev. Materials 2, 083802 (2018) <a href="https://journals.aps.org/prmaterials/abstract/10.1103/PhysRevMaterials.2.083802" target="_blank">[PDF]</a>. <br> </div> R. Ouyang, S. Curtarolo, E. Ahmetcik, M. Scheffler, L. M. Ghiringhelli: <span style="font-style: italic;">SISSO: a compressed-sensing method for identifying the best low-dimensional descriptor in an immensity of offered candidates</span>, Phys. Rev. Materials 2, 083802 (2018) <a href="https://journals.aps.org/prmaterials/abstract/10.1103/PhysRevMaterials.2.083802" target="_blank">[PDF]</a>. <br> </div>
In this tutorial, we focus on the classification flavor of SISSO($\ell_0$). In this tutorial, we focus on the classification flavor of SISSO($\ell_0$).
In the space of descriptors, each category’s domain (here, topological vs trivial insulator) is approximated as In the space of descriptors, each category’s domain (here, topological vs trivial insulator) is approximated as
the region of space within the convex hull of the corresponding training data. SISSO finds the low-dimensional descriptor yielding the minimum overlap between these convex regions. In practice, the algorithm is iterative. At the first iteration, in the SIS step, it selects the $k$ features which yield the smallest overlap when convex regions (segments encompassing all the data in one category) over the training data are constructed. In the first iteration the feature giving the smalles overlap is already the 1D model. At each subsequent iteration $i$, in the SIS step. $k$ new features that do the same for those training points which were in the overlap regions at the previous steps (i.e., the residuals). Then, in the SO step, all $i$-tuples of features selected combining in all possible ways the features selected in the SIS steps are ranked by the size of the overlap. The $i$-tuple with the smallest overlap is the $i$D model. the region of space within the convex hull of the corresponding training data. SISSO finds the low-dimensional descriptor yielding the minimum overlap between these convex regions. In practice, the algorithm is iterative. At the first iteration, in the SIS step, it selects the $k$ features which yield the smallest overlap when convex regions (segments encompassing all the data in one category) over the training data are constructed. In the first iteration the feature giving the smalles overlap is already the 1D model. At each subsequent iteration $i$, in the SIS step. $k$ new features that do the same for those training points which were in the overlap regions at the previous steps (i.e., the residuals). Then, in the SO step, all $i$-tuples of features selected combining in all possible ways the features selected in the SIS steps are ranked by the size of the overlap. The $i$-tuple with the smallest overlap is the $i$D model.
In order to better identify a predictive model to classify unseen data point, at each dimension (iteration) a soft-margin support-vector machine <a href="https://link.springer.com/article/10.1007%252FBF00994018" target="_blank">[C. Cortes & V. Vapnik, Machine learning 20, 273 (1995)]</a> is trained to define the separating hyperplanes. The resulting model is identified by the coefficents and intercept of the hyperplanes. In order to better identify a predictive model to classify unseen data point, at each dimension (iteration) a soft-margin support-vector machine <a href="https://link.springer.com/article/10.1007%252FBF00994018" target="_blank">[C. Cortes & V. Vapnik, Machine learning 20, 273 (1995)]</a> is trained to define the separating hyperplanes. The resulting model is identified by the coefficents and intercept of the hyperplanes.
</details> </details>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
The idea demonstrated in this tutorial is to start from simple physical quantities ("primary features", here properties of the constituent free atoms such as Pauling electronegativity), to generate millions (or billions) of candidate formulas by applying arithmetic operations combining primary features. These candidate formulas constitute the so-called "feature space". Then, SISSO is used to select only a few of these formulas that explain the data. The idea demonstrated in this tutorial is to start from simple physical quantities ("primary features", here properties of the constituent free atoms such as Pauling electronegativity), to generate millions (or billions) of candidate formulas by applying arithmetic operations combining primary features. These candidate formulas constitute the so-called "feature space". Then, SISSO is used to select only a few of these formulas that explain the data.
By clicking directly on "Run" below, i.e., with the default selection, you can reproduce the 2D map as published in <a href="https://journals.aps.org/prmaterials/abstract/10.1103/PhysRevMaterials.2.083802" target="_blank">PRM 2020</a>. You can also select primary features and allowed operations (by clicking the check-boxes), as well as the SISSO rung (i.e., the number of iterations in the construction of the feature space), the number of features that are selected at each iteration of the SIS step, and the max number of dimensions of the model. The materials considered here have up to 5 different atomic species in the unit cell, with the prototype formula $AB-LMN$, where the cations $A,B \in \{ \textrm{As, Sb, Bi} \}$ and the anions $L,M,N \in \{ \textrm{S, Se, Te} \}$. We have therefore grouped the features to be selected into those for cations and anions. This means that by selecting, e.g., a cation feature, such feature is added to the primary feature set for both $A$ and $B$ elements, but either is treated singularly in the feature construction and SISSO optimization. After the features' and other settings' selection, press "Run". \ By clicking directly on "Run" below, i.e., with the default selection, you can reproduce the 2D map as published in <a href="https://journals.aps.org/prmaterials/abstract/10.1103/PhysRevMaterials.2.083802" target="_blank">PRM 2020</a>. You can also select primary features and allowed operations (by clicking the check-boxes), as well as the SISSO rung (i.e., the number of iterations in the construction of the feature space), the number of features that are selected at each iteration of the SIS step, and the max number of dimensions of the model. The materials considered here have up to 5 different atomic species in the unit cell, with the prototype formula $AB-LMN$, where the cations $A,B \in \{ \textrm{As, Sb, Bi} \}$ and the anions $L,M,N \in \{ \textrm{S, Se, Te} \}$. We have therefore grouped the features to be selected into those for cations and anions. This means that by selecting, e.g., a cation feature, such feature is added to the primary feature set for both $A$ and $B$ elements, but either is treated singularly in the feature construction and SISSO optimization. After the features' and other settings' selection, press "Run". \
After the results are shown for all models from one dimensional to the max chosen dimension, you can press "Plot interactive map" to reveal a map of tetradymites' topological vs trivial insulators, for the highest-dimensional model. If the highest-dimensional model is 2D, the support-vector-machine separation line between the two phases is shown. For higher dimensional models, the 3rd and 4th dimensions can be visualized via the size or the color of the data-point markers. Intuitive drop-down menus allow to assign axes, markers, and colors, to the descriptor components of choice. After the results are shown for all models from one dimensional to the max chosen dimension, you can press "Plot interactive map" to reveal a map of tetradymites' topological vs trivial insulators, for the highest-dimensional model. If the highest-dimensional model is 2D, the support-vector-machine separation line between the two phases is shown. For higher dimensional models, the 3rd and 4th dimensions can be visualized via the size or the color of the data-point markers. Intuitive drop-down menus allow to assign axes, markers, and colors, to the descriptor components of choice.
With the selection of "PRM2020" (or default selection) as SISSO rung, a special feature space is uploaded, which contains much fewer features than in the production calculation used in <a href="https://journals.aps.org/prmaterials/abstract/10.1103/PhysRevMaterials.2.083802" target="_blank">PRM 2020</a>. This allows to reobtain in the notebook the same result in a reasonsable time. Still, the provided feature space contains thousands of the top ranked features and SISSO finds the best nD model. With the selection of "PRM2020" (or default selection) as SISSO rung, a special feature space is uploaded, which contains much fewer features than in the production calculation used in <a href="https://journals.aps.org/prmaterials/abstract/10.1103/PhysRevMaterials.2.083802" target="_blank">PRM 2020</a>. This allows to reobtain in the notebook the same result in a reasonsable time. Still, the provided feature space contains thousands of the top ranked features and SISSO finds the best nD model.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
%%HTML %%HTML
<script> <script>
code_show=true; code_show=true;
function code_toggle() { function code_toggle() {
if (code_show) if (code_show)
{ {
$('div.input').hide(); $('div.input').hide();
} }
else else
{ {
$('div.input').show(); $('div.input').show();
} }
code_show = !code_show code_show = !code_show
} }
$( document ).ready(code_toggle); $( document ).ready(code_toggle);
window.runCells("startup"); window.runCells("startup");
</script> </script>
The Python code for this notebook is by default hidden for easier reading. The Python code for this notebook is by default hidden for easier reading.
To toggle on/off the code, click <a href="javascript:code_toggle()">here</a>. To toggle on/off the code, click <a href="javascript:code_toggle()">here</a>.
``` ```
%% Output
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
%load_ext autoreload %load_ext autoreload
%autoreload 2 %autoreload 2
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
from cpp_sisso import get_max_number_feats, get_estimate_n_feat_next_rung, generate_fs, SISSOClassifier, generate_phi_0_from_csv, FeatureSpace from cpp_sisso import get_max_number_feats, get_estimate_n_feat_next_rung, generate_fs, SISSOClassifier, generate_phi_0_from_csv, FeatureSpace
from tetradymite_PRM2020.visualizer import Visualizer from tetradymite_PRM2020.visualizer import Visualizer
import numpy as np import numpy as np
import pandas as pd import pandas as pd
import os import os
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
from nomad import client, config from nomad import client, config
config.client.url = 'http://nomad-lab.eu/prod/rae/api' config.client.url = 'http://nomad-lab.eu/prod/rae/api'
query = client.query_archive(query={ query = client.query_archive(query={
'dataset_id': ['BjT-NFK0QdOx81_z5TmyeQ']}, 'dataset_id': ['BjT-NFK0QdOx81_z5TmyeQ']},
per_page=100, per_page=100,
) )
print(query) print(query)
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df_train = pd.read_pickle('./data/tetradymite_PRM2020/training_set') df_train = pd.read_pickle('./data/tetradymite_PRM2020/training_set')
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
path_structure = './data/tetradymite_PRM2020/structures/' path_structure = './data/tetradymite_PRM2020/structures/'
try: try:
os.mkdir(path_structure) os.mkdir(path_structure)
except OSError: except OSError:
!rm ./data/tetradymite_PRM2020/structures/* !rm ./data/tetradymite_PRM2020/structures/*
compounds=df_train.index.to_list() compounds=df_train.index.to_list()
scale_factor = 10**10 scale_factor = 10**10
alist = [] alist = []
for compound in compounds: for compound in compounds:
for entry in range (1581): for entry in range (1581):
labels = query[entry].section_run[0].section_system[-1].atom_labels labels = query[entry].section_run[0].section_system[-1].atom_labels
if (len(labels)>5): if (len(labels)>5):
continue continue
labels_1 = str(labels[0])+'_'+str(labels[1])+'_'+str(labels[3])+'_'+str(labels[4])+'_'+str(labels[2]) labels_1 = str(labels[0])+'_'+str(labels[1])+'_'+str(labels[3])+'_'+str(labels[4])+'_'+str(labels[2])
labels_2 = str(labels[0])+'_'+str(labels[1])+'_'+str(labels[4])+'_'+str(labels[3])+'_'+str(labels[2]) labels_2 = str(labels[0])+'_'+str(labels[1])+'_'+str(labels[4])+'_'+str(labels[3])+'_'+str(labels[2])
labels_3 = str(labels[1])+'_'+str(labels[0])+'_'+str(labels[3])+'_'+str(labels[4])+'_'+str(labels[2]) labels_3 = str(labels[1])+'_'+str(labels[0])+'_'+str(labels[3])+'_'+str(labels[4])+'_'+str(labels[2])
labels_4 = str(labels[1])+'_'+str(labels[0])+'_'+str(labels[4])+'_'+str(labels[3])+'_'+str(labels[2]) labels_4 = str(labels[1])+'_'+str(labels[0])+'_'+str(labels[4])+'_'+str(labels[3])+'_'+str(labels[2])
if compound in list([labels_1, labels_2, labels_3, labels_4]): if compound in list([labels_1, labels_2, labels_3, labels_4]):
n_atoms = len (labels) n_atoms = len (labels)
lat_x, lat_y, lat_z = query[entry].section_run[0].section_system[-1].lattice_vectors.magnitude * scale_factor lat_x, lat_y, lat_z = query[entry].section_run[0].section_system[-1].lattice_vectors.magnitude * scale_factor
file = open(path_structure + str(compound) +".xyz","w") file = open(path_structure + str(compound) +".xyz","w")
file.write("%d\n\n"%(n_atoms*8)) file.write("%d\n\n"%(n_atoms*8))
for i in [0,1,2]: for i in [0,1,2]:
for j in [0,1,2]: for j in [0,1,2]:
for k in [0,1,2]: for k in [0,1,2]:
for n in range (n_atoms): for n in range (n_atoms):
el = query[entry].section_run[0].section_system[-1].atom_labels[n] el = query[entry].section_run[0].section_system[-1].atom_labels[n]
xyz = query[entry].section_run[0].section_system[-1].atom_positions[n].magnitude * scale_factor xyz = query[entry].section_run[0].section_system[-1].atom_positions[n].magnitude * scale_factor
xyz += i*lat_x xyz += i*lat_x
xyz += j*lat_y xyz += j*lat_y
xyz += k*lat_z xyz += k*lat_z
file.write (el) file.write (el)
file.write ("\t%f\t%f\t%f\n"%(xyz[0],xyz[1],xyz[2])) file.write ("\t%f\t%f\t%f\n"%(xyz[0],xyz[1],xyz[2]))
file.close() file.close()
alist.append(compound) alist.append(compound)
break break
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
zeta = {'S':16, 'As':33, 'Se':34, 'Sb':51, 'Te':52, 'Bi':83} zeta = {'S':16, 'As':33, 'Se':34, 'Sb':51, 'Te':52, 'Bi':83}
chi = {'S':2.58, 'As':2.18, 'Se':2.55, 'Sb':2.05, 'Te':2.12, 'Bi':2.02} chi = {'S':2.58, 'As':2.18, 'Se':2.55, 'Sb':2.05, 'Te':2.12, 'Bi':2.02}
lambd = {'S':0.05, 'As':0.19, 'Se':0.22, 'Sb':0.4, 'Te':0.49, 'Bi':1.25} lambd = {'S':0.05, 'As':0.19, 'Se':0.22, 'Sb':0.4, 'Te':0.49, 'Bi':1.25}
df_feat = pd.DataFrame(index=df_train.index, columns=[ df_feat = pd.DataFrame(index=df_train.index, columns=[
'z_A','z_B','z_L','z_M','z_N', 'z_A','z_B','z_L','z_M','z_N',
'x_A','x_B','x_L','x_M','x_N', 'x_A','x_B','x_L','x_M','x_N',
'l_A','l_B','l_L','l_M','l_N', 'l_A','l_B','l_L','l_M','l_N',
]) ])
for comp in df_train.index: for comp in df_train.index:
ablmn = comp.split('_') ablmn = comp.split('_')
df_feat.loc[comp] = pd.Series({ df_feat.loc[comp] = pd.Series({
'z_A':zeta[ablmn[0]], 'z_A':zeta[ablmn[0]],
'z_B':zeta[ablmn[1]], 'z_B':zeta[ablmn[1]],
'z_L':zeta[ablmn[2]], 'z_L':zeta[ablmn[2]],
'z_M':zeta[ablmn[3]], 'z_M':zeta[ablmn[3]],
'z_N':zeta[ablmn[4]], 'z_N':zeta[ablmn[4]],
'x_A':chi[ablmn[0]], 'x_A':chi[ablmn[0]],
'x_B':chi[ablmn[1]], 'x_B':chi[ablmn[1]],
'x_L':chi[ablmn[2]], 'x_L':chi[ablmn[2]],
'x_M':chi[ablmn[3]], 'x_M':chi[ablmn[3]],
'x_N':chi[ablmn[4]], 'x_N':chi[ablmn[4]],
'l_A':lambd[ablmn[0]], 'l_A':lambd[ablmn[0]],
'l_B':lambd[ablmn[1]], 'l_B':lambd[ablmn[1]],
'l_L':lambd[ablmn[2]], 'l_L':lambd[ablmn[2]],
'l_M':lambd[ablmn[3]], 'l_M':lambd[ablmn[3]],
'l_N':lambd[ablmn[4]], 'l_N':lambd[ablmn[4]],
}) })
df_feat['Class'] = df_train['Class'] df_feat['Class'] = df_train['Class']
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def get_feat_space_and_sr( def get_feat_space_and_sr(
df, df,
ops= ['add', 'sub', 'abs_diff', 'mult', 'div', 'exp', 'neg_exp', 'inv', 'sq', 'cb', ops= ['add', 'sub', 'abs_diff', 'mult', 'div', 'exp', 'neg_exp', 'inv', 'sq', 'cb',
'sqrt', 'cbrt', 'log', 'abs'], 'sqrt', 'cbrt', 'log', 'abs'],
cols="all", cols="all",
max_phi=2, max_phi=2,
n_sis_select=50, n_sis_select=50,
remove_double_divison=True, remove_double_divison=True,
max_dim=3, max_dim=3,
n_residual=1, n_residual=1,
default=True, default=True,
): ):
if default: if default:
phi_0, prop_unit, prop, prop_test, task_sizes_train, task_sizes_test, leave_out_inds = generate_phi_0_from_csv( phi_0, prop_unit, prop, prop_test, task_sizes_train, task_sizes_test, leave_out_inds = generate_phi_0_from_csv(
df_train, "Class", cols='all', task_key=None, leave_out_frac=0.0 df_train, "Class", cols='all', task_key=None, leave_out_frac=0.0
) )
feat_space = generate_fs( feat_space = generate_fs(
phi_0, phi_0,
prop, prop,
task_sizes_train, task_sizes_train,
["add", "sub", "mult", "div", "abs_diff", "sq", "cb", "sqrt", "cbrt", "inv", "abs"], ["add", "sub", "mult", "div", "abs_diff", "sq", "cb", "sqrt", "cbrt", "inv", "abs"],
"classification", "classification",
0, 0,
n_sis_select n_sis_select
) )
else: else:
phi_0, prop_unit, prop, prop_test, task_sizes_train, task_sizes_test, leave_out_inds = generate_phi_0_from_csv( phi_0, prop_unit, prop, prop_test, task_sizes_train, task_sizes_test, leave_out_inds = generate_phi_0_from_csv(
df_feat, "Class", cols=cols, task_key=None, leave_out_frac=0.0, leave_out_inds=None df_feat, "Class", cols=cols, task_key=None, leave_out_frac=0.0, leave_out_inds=None
) )
feat_space = generate_fs( feat_space = generate_fs(
phi_0, phi_0,
prop, prop,
task_sizes_train, task_sizes_train,
ops, ops,
"classification", "classification",
max_phi, max_phi,
n_sis_select n_sis_select
) )
sisso = SISSOClassifier( sisso = SISSOClassifier(
feat_space, feat_space,
prop_unit, prop_unit,
prop, prop,
prop_test, prop_test,
task_sizes_train, task_sizes_train,
task_sizes_test, task_sizes_test,
leave_out_inds, leave_out_inds,
max_dim, max_dim,
10, 10,
10 10
) )
return feat_space, sisso return feat_space, sisso
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
from ipywidgets import widgets, interactive from ipywidgets import widgets, interactive
from IPython.display import HTML, clear_output from IPython.display import HTML, clear_output
def plot_2d_solution(b): def plot_2d_solution(b):
with out2: with out2:
model = sisso.models[1][0] model = sisso.models[1][0]
classified=model.prop_train classified=model.prop_train
compounds = df_train.index.to_list() compounds = df_train.index.to_list()
df=pd.DataFrame(data={ df=pd.DataFrame(data={
"Compound":compounds, "Compound":compounds,
"Classification":classified}) "Classification":classified})
for feat in sisso.models[sisso.n_dim-1][0].feats: for feat in sisso.models[sisso.n_dim-1][0].feats:
df[str(feat)]=feat.value df[str(feat)]=feat.value
classes = ['Topological insulators', 'Trivial insulators'] classes = ['Topological insulators', 'Trivial insulators']
visualizer=Visualizer(df, sisso, classes) visualizer=Visualizer(df, sisso, classes)
visualizer.show() visualizer.show()
def prm_select(change): def prm_select(change):
if change['new'] == 'PRM2020': if change['new'] == 'PRM2020':
default_operations = ['add', 'sub', 'abs_diff', 'mult', 'div', 'exp', 'neg_exp', 'inv', 'sq', 'cb', default_operations = ['add', 'sub', 'abs_diff', 'mult', 'div', 'exp', 'neg_exp', 'inv', 'sq', 'cb',
'sqrt', 'cbrt', 'log', 'abs'] 'sqrt', 'cbrt', 'log', 'abs']
default_features = ['z_cations','chi_cations','lambda_cations','z_anions','chi_anions','lambda_anions'] default_features = ['z_cations','chi_cations','lambda_cations','z_anions','chi_anions','lambda_anions']
for op, widget in zip(possible_operations, op_list): for op, widget in zip(possible_operations, op_list):
widget.value = op in default_operations widget.value = op in default_operations
widget.disabled = True widget.disabled = True
for feat, widget in zip(possible_features, feat_list): for feat, widget in zip(possible_features, feat_list):
widget.value = feat in default_features widget.value = feat in default_features
widget.disabled = True widget.disabled = True
tier_selection.value = 'PRM2020' tier_selection.value = 'PRM2020'
feat_per_iter_selection.value = 50 feat_per_iter_selection.value = 50
dimension_selection.value = 2 dimension_selection.value = 2
else: else:
for widget in op_list+feat_list: for widget in op_list+feat_list:
widget.disabled = False widget.disabled = False
def default_selection(b): def default_selection(b):
default_operations = ['add', 'sub', 'abs_diff', 'mult', 'div', 'exp', 'neg_exp', 'inv', 'sq', 'cb', default_operations = ['add', 'sub', 'abs_diff', 'mult', 'div', 'exp', 'neg_exp', 'inv', 'sq', 'cb',
'sqrt', 'cbrt', 'log', 'abs'] 'sqrt', 'cbrt', 'log', 'abs']
default_features = ['z_cations','chi_cations','lambda_cations','z_anions','chi_anions','lambda_anions'] default_features = ['z_cations','chi_cations','lambda_cations','z_anions','chi_anions','lambda_anions']
for op, widget in zip(possible_operations, op_list): for op, widget in zip(possible_operations, op_list):
widget.value = op in default_operations widget.value = op in default_operations
widget.disabled = True widget.disabled = True
for feat, widget in zip(possible_features, feat_list): for feat, widget in zip(possible_features, feat_list):
widget.value = feat in default_features widget.value = feat in default_features
widget.disabled = True widget.disabled = True
tier_selection.value = 'PRM2020' tier_selection.value = 'PRM2020'
feat_per_iter_selection.value = 50 feat_per_iter_selection.value = 50
dimension_selection.value = 2 dimension_selection.value = 2
def find_descriptors(b): def find_descriptors(b):
with out2: with out2:
clear_output() clear_output()
with out1: with out1:
clear_output() clear_output()
print('Calculating...', flush=True) print('Calculating...', flush=True)
selected_features = [] selected_features = []
allowed_operations = [] allowed_operations = []
for op, widget in zip(possible_operations, op_list): for op, widget in zip(possible_operations, op_list):
if widget.value: if widget.value:
allowed_operations.append(op) allowed_operations.append(op)
for sel_feat, widget in zip(possible_features, feat_list): for sel_feat, widget in zip(possible_features, feat_list):
if widget.value: if widget.value:
feat = sel_feat.split('_')[0] feat = sel_feat.split('_')[0]
typ = sel_feat.split('_')[1] typ = sel_feat.split('_')[1]
if (typ=='cations'): if (typ=='cations'):
selected_features.append(feat + '_'+ 'A') selected_features.append(feat + '_'+ 'A')
selected_features.append(feat + '_'+ 'B') selected_features.append(feat + '_'+ 'B')
if (typ=='anions'): if (typ=='anions'):
selected_features.append(feat + '_'+ 'L') selected_features.append(feat + '_'+ 'L')
selected_features.append(feat + '_'+ "M") selected_features.append(feat + '_'+ "M")
selected_features.append(feat + '_'+ "N") selected_features.append(feat + '_'+ "N")
if tier_selection.value == 'PRM2020': if tier_selection.value == 'PRM2020':
selected_features = "all" selected_features = "all"
tier = 0 tier = 0
default = True default = True
else: else:
tier = tier_selection.value tier = tier_selection.value
default = False default = False
global feat_space global feat_space
global sisso global sisso
try: try:
feat_space, sisso = get_feat_space_and_sr( feat_space, sisso = get_feat_space_and_sr(
df = df_train, df = df_train,
ops = allowed_operations, ops = allowed_operations,
cols = selected_features, cols = selected_features,
max_phi = tier, max_phi = tier,
n_sis_select = feat_per_iter_selection.value, n_sis_select = feat_per_iter_selection.value,
remove_double_divison=True, remove_double_divison=True,
max_dim = dimension_selection.value, max_dim = dimension_selection.value,
n_residual = 1, n_residual = 1,
default = default) default = default)
clear_output() clear_output()
if (dimension_selection.value>1): if (dimension_selection.value>1):
plot_button.disabled=False plot_button.disabled=False
else: else:
plot_button.disabled=True plot_button.disabled=True
print("Number of features generated: " + str(feat_space.n_feat)) print("Number of features generated: " + str(feat_space.n_feat))
print("") print("")
try: try:
sisso.fit() sisso.fit()
for i in range(dimension_selection.value): for i in range(dimension_selection.value):
print(str(i+1)+'D model') print(str(i+1)+'D model')
print("# misclassified: {} ".format(int(sisso.models[i][0].n_convex_overlap_train))) print("# misclassified: {} ".format(int(sisso.models[i][0].n_convex_overlap_train)))
string = "SVM dividing line: c0" string = "SVM dividing line: c0"
for nf, feat in enumerate(sisso.models[i][0].feats): for nf, feat in enumerate(sisso.models[i][0].feats):
string = string + str(' + a'+str(nf)+'*'+str(feat)) string = string + str(' + a'+str(nf)+'*'+str(feat))
string = string + " = 0" string = string + " = 0"
print(string) print(string)
string = "c0:{:.4}".format(sisso.models[i][0].coefs[0][-1]) string = "c0:{:.4}".format(sisso.models[i][0].coefs[0][-1])
for j in range(i+1): for j in range(i+1):
string = string + str(" | a"+str(j)+":{:.4}".format(sisso.models[i][0].coefs[0][j])) string = string + str(" | a"+str(j)+":{:.4}".format(sisso.models[i][0].coefs[0][j]))
print(string + '\n') print(string + '\n')
global df global df
except RuntimeError: except RuntimeError:
print("\nThe number of selected features per SIS iteration is bigger than the number of features available. Please reduce the number of selected features per SIS iteration (number of features generated / max number of dimensions) or increase the number of selected features and operations.") print("\nThe number of selected features per SIS iteration is bigger than the number of features available. Please reduce the number of selected features per SIS iteration (number of features generated / max number of dimensions) or increase the number of selected features and operations.")
except: except:
print('The present selection does not lead to the creation of any derived features in the highest selected rung, please select at least one binary or power operator, or reduce the maximum rung') print('The present selection does not lead to the creation of any derived features in the highest selected rung, please select at least one binary or power operator, or reduce the maximum rung')
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
cb_layout = widgets.Layout(width = '15px') cb_layout = widgets.Layout(width = '15px')
thin_layout = widgets.Layout(width = '100px') thin_layout = widgets.Layout(width = '100px')
mid_layout = widgets.Layout(width = '200px') mid_layout = widgets.Layout(width = '200px')
wide_layout = widgets.Layout(width = '300px') wide_layout = widgets.Layout(width = '300px')
possible_operations = ['add', 'sub', 'abs_diff', 'mult', 'div', 'exp', 'neg_exp', 'inv', 'sq', 'cb', possible_operations = ['add', 'sub', 'abs_diff', 'mult', 'div', 'exp', 'neg_exp', 'inv', 'sq', 'cb',
'sqrt', 'cbrt', 'log', 'abs'] 'sqrt', 'cbrt', 'log', 'abs']
possible_features = ['z_cations','chi_cations','lambda_cations','z_anions','chi_anions','lambda_anions'] possible_features = ['z_cations','chi_cations','lambda_cations','z_anions','chi_anions','lambda_anions']
tooltips = { tooltips = {
"z_cations" : "Atomic number", "z_cations" : "Atomic number",
"chi_cations" : "Pauling electronegativity", "chi_cations" : "Pauling electronegativity",
"lambda_cations" : "Spin orbit coupling", "lambda_cations" : "Spin orbit coupling",
"z_anions" : "Atomic number", "z_anions" : "Atomic number",
"chi_anions" : "Pauling electronegativity", "chi_anions" : "Pauling electronegativity",
"lambda_anions" : "Spin orbit coupling", "lambda_anions" : "Spin orbit coupling",
} }
labels = { labels = {
'add' : '$x + y$', 'sub' : '$x - y$', 'abs_diff' : '$|x - y|$', 'mult' : '$x \cdot y$', 'div' : '$x / y$', 'add' : '$x + y$', 'sub' : '$x - y$', 'abs_diff' : '$|x - y|$', 'mult' : '$x \cdot y$', 'div' : '$x / y$',
'exp' : '$\exp(x)$', 'neg_exp' : '$\exp(-x)$', 'inv' : '$1/x$', 'sq' : '$x^2$', 'cb' : '$x^3$', 'exp' : '$\exp(x)$', 'neg_exp' : '$\exp(-x)$', 'inv' : '$1/x$', 'sq' : '$x^2$', 'cb' : '$x^3$',
'six_pow' : '$x^6$', 'sqrt' : '$\sqrt{x}$', 'cbrt' : '$\sqrt[3]{x}$', 'log' : '$\log(x)$', 'six_pow' : '$x^6$', 'sqrt' : '$\sqrt{x}$', 'cbrt' : '$\sqrt[3]{x}$', 'log' : '$\log(x)$',
'abs' : '$|x|$', 'sin' : '$\sin(x)$', 'cos' : '$\cos(x)$', 'z_cations' : '$Z_{cations}$', 'chi_cations' : '$\chi_{cations}$', 'abs' : '$|x|$', 'sin' : '$\sin(x)$', 'cos' : '$\cos(x)$', 'z_cations' : '$Z_{cations}$', 'chi_cations' : '$\chi_{cations}$',
'lambda_cations' : '$\lambda_{cations}$', 'z_anions' : '$Z_{anions}$', 'chi_anions' : '$\chi_{anions}$', 'lambda_anions' : '$\lambda_{anions}$' 'lambda_cations' : '$\lambda_{cations}$', 'z_anions' : '$Z_{anions}$', 'chi_anions' : '$\chi_{anions}$', 'lambda_anions' : '$\lambda_{anions}$'
} }
op_list = [] op_list = []
op_labels = [] op_labels = []
feat_list = [] feat_list = []
feat_labels = [] feat_labels = []
for operation in possible_operations: for operation in possible_operations:
op_list.append(widgets.Checkbox(description='', value=True, indent=False, layout=cb_layout)) op_list.append(widgets.Checkbox(description='', value=True, indent=False, layout=cb_layout))
op_labels.append(widgets.Label(value=labels[operation])) op_labels.append(widgets.Label(value=labels[operation]))
for feature in possible_features: for feature in possible_features:
feat_list.append(widgets.Checkbox(description=tooltips[feature], value=True, indent=False, layout=cb_layout)) feat_list.append(widgets.Checkbox(description=tooltips[feature], value=True, indent=False, layout=cb_layout))
feat_labels.append(widgets.Label(value=labels[feature])) feat_labels.append(widgets.Label(value=labels[feature]))
op_box = widgets.VBox([widgets.Label()]+op_list) op_box = widgets.VBox([widgets.Label()]+op_list)
op_label_box = widgets.VBox([widgets.Label(value='Operations:', layout=thin_layout)]+op_labels) op_label_box = widgets.VBox([widgets.Label(value='Operations:', layout=thin_layout)]+op_labels)
feat_box = widgets.VBox([widgets.Label()]+feat_list) feat_box = widgets.VBox([widgets.Label()]+feat_list)
feat_label_box = widgets.VBox([widgets.Label(value='Features:', layout=thin_layout)]+feat_labels) feat_label_box = widgets.VBox([widgets.Label(value='Features:', layout=thin_layout)]+feat_labels)
tier_selection = widgets.Dropdown(options=['PRM2020', 1,2,3], layout=thin_layout) tier_selection = widgets.Dropdown(options=['PRM2020', 1,2,3], layout=thin_layout)
feat_per_iter_selection = widgets.BoundedIntText(value=26, min=10, max=100, step=1, layout=thin_layout) feat_per_iter_selection = widgets.BoundedIntText(value=26, min=10, max=100, step=1, layout=thin_layout)
dimension_selection = widgets.BoundedIntText(value = 3, min=1, max=4, step=1, layout = thin_layout) dimension_selection = widgets.BoundedIntText(value = 3, min=1, max=4, step=1, layout = thin_layout)
settings_box = widgets.VBox([ settings_box = widgets.VBox([
widgets.Label(value='Settings:', layout=wide_layout), widgets.Label(value='Settings:', layout=wide_layout),
widgets.Label(value='SISSO rung:', layout=wide_layout), widgets.Label(value='SISSO rung:', layout=wide_layout),
tier_selection, tier_selection,
widgets.Label(value='To unfreeze the feature selection,' , layout=wide_layout), widgets.Label(value='To unfreeze the feature selection,' , layout=wide_layout),
widgets.Label(value='please select any rung other than PRM2020.', layout=widgets.Layout(width = '300px', bottom='10px') ), widgets.Label(value='please select any rung other than PRM2020.', layout=widgets.Layout(width = '300px', bottom='10px') ),
widgets.Label(value='Number of selected features per SIS iteration:', layout=wide_layout), widgets.Label(value='Number of selected features per SIS iteration:', layout=wide_layout),
feat_per_iter_selection, feat_per_iter_selection,
widgets.Label(value='Maximum number of dimensions:', layout=wide_layout), widgets.Label(value='Maximum number of dimensions:', layout=wide_layout),
dimension_selection]) dimension_selection])
default_button = widgets.Button(description = 'Default selection', layout=mid_layout) default_button = widgets.Button(description = 'Default selection', layout=mid_layout)
descriptor_button = widgets.Button(description = 'Run', layout=mid_layout) descriptor_button = widgets.Button(description = 'Run', layout=mid_layout)
plot_button = widgets.Button(description = 'Plot interactive map', disabled=True, layout=mid_layout) plot_button = widgets.Button(description = 'Plot interactive map', disabled=True, layout=mid_layout)
default_button.on_click(default_selection) default_button.on_click(default_selection)
descriptor_button.on_click(find_descriptors) descriptor_button.on_click(find_descriptors)
plot_button.on_click(plot_2d_solution) plot_button.on_click(plot_2d_solution)
button_box = widgets.VBox([default_button, descriptor_button, plot_button]) button_box = widgets.VBox([default_button, descriptor_button, plot_button])
out1 = widgets.Output() out1 = widgets.Output()
out2 = widgets.Output() out2 = widgets.Output()
gui_box = widgets.HBox([op_box, op_label_box, feat_box, feat_label_box, settings_box, button_box]) gui_box = widgets.HBox([op_box, op_label_box, feat_box, feat_label_box, settings_box, button_box])
out_box = widgets.VBox([gui_box, out1, out2]) out_box = widgets.VBox([gui_box, out1, out2])
tier_selection.observe(prm_select, names='value') tier_selection.observe(prm_select, names='value')
default_selection('') default_selection('')
display(out_box) display(out_box)
``` ```
%% Output
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment