Skip to content
Snippets Groups Projects

Updated query nomad archive

Merged Aakash Ashok Naik requested to merge updated-query-nomad-archive into master
1 file
+ 787
0
Compare changes
  • Side-by-side
  • Inline
+ 787
0
%% Cell type:markdown id: tags:
<div id="teaser" style=' background-position: right center; background-size: 00px; background-repeat: no-repeat;
padding-top: 20px;
padding-right: 10px;
padding-bottom: 170px;
padding-left: 10px;
border-bottom: 14px double #333;
border-top: 14px double #333;' >
<div style="text-align:center">
<b><font size="6.4">Querying the Archive and performing Artificial Intelligence modeling</font></b>
</div>
<p>
created by:
Luigi Sbailò,<sup>1</sup>
Matthias Scheffler,<sup>1</sup>
and Luca Ghiringhelli<sup>1</sup> <br><br>
<sup>1</sup> Fritz Haber Institute of the Max Planck Society, Faradayweg 4-6, D-14195 Berlin, Germany <br><p>
ghiringhelli@fhi-berlin.mpg.de,
sbailo@fhi-berlin.mpg.de
<br><br>
<span class="nomad--last-updated" data-version="v1.0.0">[Last updated: Jan 5, 2021]</span>
<div>
<img style="float: left;" src="assets/query_nomad_archive/Logo_MPG.png" width="200">
<img style="float: right;" src="assets/query_nomad_archive/Logo_NOMAD.png" width="250">
</div>
</div>
%% Cell type:markdown id: tags:
In this tutorial, we show how to query the NOMAD Archive (https://www.nomad-coe.eu/index.php?page=nomad-repository) and perform data analysis on the retrieved data.
%% Cell type:markdown id: tags:
We load the following packages that are all available in the virtual environment containing the Jupyter notebooks in the NOMAD AI toolkit. Among the various packages we notice Sklearn, one of the most popular machine learning packages, and pandas, another popular tool for data analysis.
%% Cell type:code id: tags:
``` python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import math
import plotly.express as px
from tqdm import tqdm
# from mendeleev import element
from query_nomad_archive import atomic_properties_dft as ap
from query_nomad_archive import atomic_properties_pymat as pymat
from sklearn import preprocessing, tree
from sklearn.manifold import TSNE
from sklearn.model_selection import train_test_split
from query_nomad_archive.visualizer import Visualizer
from sklearn import decomposition
from IPython.display import display, Markdown
import re
```
%% Cell type:code id: tags:
``` python
from nomad import client, config
```
%% Cell type:code id: tags:
``` python
from nomad.client import ArchiveQuery
from nomad.metainfo import units
```
%% Cell type:code id: tags:
``` python
from mendeleev import element
```
%% Cell type:code id: tags:
``` python
ap.method(method = 'HSE06', Spin = 'False')
```
%% Cell type:markdown id: tags:
We maintain a "nomad" package that can be imported in all notebooks of the AI Toolkit.
%% Cell type:code id: tags:
``` python
# Z_A = ap.symbol("K")
# ap.K.atomic_ip*6.242e+18
```
%% Cell type:code id: tags:
``` python
# Z_B = element("K")
# Z_B.covalent_radius
```
%% Cell type:code id: tags:
``` python
# pymat.symbol("K").atomic_mass
```
%% Cell type:markdown id: tags:
Nomad package allows to retrieve data from the NOMAD Archive with means of a script, as shown below. In this script we insert metadata characterizing the materials that we aim to retrieve. In this case, we select ternary materials containing Oxygen. We also request that simulations were carried out using the VASP code using GGA exchange-correlation (xc) functionals. Values are retrieved from the simulation run that found geometrically convergence wihin a threshold value of 1e-20.
%% Cell type:code id: tags:
``` python
max_entries = 23000
query = ArchiveQuery(
query={
'$and': [
{'dft.code_name': 'VASP'},
{'dft.crystal_system':'cubic'},
{'atoms': ['O']},
{'dft.xc_functional':"GGA"},
{'dft.compound_type': 'ternary'},
{'$lte': {'dft.workflow.section_geometry_optimization.final_energy_difference': 1e-20}},
]
},
required={
'section_workflow': {
'calculation_result_ref': {
'single_configuration_calculation_to_system_ref': {
'chemical_composition_reduced': '*',
'chemical_composition': '*',
'section_symmetry': '*',
'simulation_cell': '*',
'lattice_vectors': '*',
'atom_species': '*',
'atom_labels': '*',
'atom_positions': '*',
}
}
},
},
per_page=1000,
parallel=10,
max=max_entries)
```
%% Cell type:markdown id: tags:
We have defined the variable 'query', which allows to perform our query.
The required condition ensures that all quantities in the simulation run that we are interested in are fetched during the query. For example, we can see quantities as 'chemical_composition' which gives the composition of the material or 'atom_positions' that contains the positon of all atoms after geometric convergence.
We notice that the variable 'query' contains a number of other variables: the 'max' value sets the maximum number of entries that can be retrieved; the 'per_page' value indicates the number of entries fetched at each API call; the 'parallel' value gives the number of parallel calls that are performed at each iteration.
Printing the variable shows the number of all entries that are accessible using this variable. Please note that the maximum number of retrievable materials is given by the value of 'max_entries' inserted in the query, even if the printed value of 'Number queried entries: ' is larger.
%% Cell type:code id: tags:
``` python
print(query)
```
%% Output
Number queried entries: 23462
Number of entries loaded in the last api call: 6699
Bytes loaded in the last api call: 286440478
Bytes loaded from this query: 286440478
Number of downloaded entries: 6699
Number of made api calls: 1
%% Cell type:markdown id: tags:
In this tutorial, we use machine learning tools to investigate properties of materials. In particular, we aim to predict the atomic density of the materials as a function of some primary features such as the atomic number.
The atomic density of the material is derived from the volume of the simulation cell, whose dimensions are inserted in meters. Thus, we define a scale factor to convert dimensions into angstroms for a higher numerical stability during the machine learning analysis.
%% Cell type:code id: tags:
``` python
scale_factor = 10**10
```
%% Cell type:markdown id: tags:
To retrieve data and place it within a framework, we use a 'for' loop that iteratively fetch all entries up to the maximum value, which is given by 'max_entries'. Taking into account that some links in the query might be broken, the resulting 'IndexError' exception is handled within the 'for' loop, that skips over the broken entry. In addition, we also make sure the entry contains the simulation cell value which we are interested in, and that all elements in the material have an admissible atomic number.
%% Cell type:markdown id: tags:
In the next cell, we query and fetch data from the NOMAD Archive. As we query a large number of elements, this operation can be time consuming. Hence, we have cached the results of the following query, and data can be loaded with a command given in the subsequent cell.
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
# path_structure = (r'\data\query_nomad_archive\structures\')
# try:
# os.mkdir(path_structure)
# except OSError:
# !rm -r "$path_structure"
# os.mkdir(path_structure)
# We define a 'Pandas' dataframe that contains all fetched data.
df = pd.DataFrame()
for entry in tqdm(range (max_entries)):
try:
calc = query[entry].section_workflow.calculation_result_ref
formula_red = calc.single_configuration_calculation_to_system_ref.chemical_composition_reduced
crystal = calc.single_configuration_calculation_to_system_ref.section_symmetry[0].crystal_system
space_group = calc.single_configuration_calculation_to_system_ref.section_symmetry[0].space_group_number
elements = np.sort(calc.single_configuration_calculation_to_system_ref.atom_species)
labels = calc.single_configuration_calculation_to_system_ref.atom_labels
# Dimensions of the cell are rescaled to angstroms.
x,y,z = calc.single_configuration_calculation_to_system_ref.simulation_cell.magnitude * scale_factor
lat_x, lat_y, lat_z = calc.single_configuration_calculation_to_system_ref.lattice_vectors.magnitude * scale_factor
except AttributeError:
continue
if (min(elements)<1 or max(elements)>118):
continue
# The total number in the array 'elements' gives the total number of atoms.
n_atoms = len (elements)
# Structures of materials are stored for being viewed using the Visualizer.
# file = open(path_structure + str(entry) +".xyz","w")
# file.write("%d\n\n"%(n_atoms*8))
# for i in [0,1]:
# for j in [0,1]:
# for k in [0,1]:
# for n in range (n_atoms):
# el = calc.single_configuration_calculation_to_system_ref.atom_labels[n]
# xyz = calc.single_configuration_calculation_to_system_ref.atom_positions[n].magnitude * scale_factor
# xyz += i*lat_x
# xyz += j*lat_y
# xyz += k*lat_z
# file.write (el)
# file.write ("\t%f\t%f\t%f\n"%(xyz[0],xyz[1],xyz[2]))
# file.close()
# The volume of the cell is obtained as scalar triple product of the three base vectors.
# The triple scalar product is obtained as determinant of the matrix composed with the three vectors.
cell_volume = np.linalg.det ([x,y,z])
# The atomic density is given by the number of atoms in a unit cell.
density = n_atoms / cell_volume
# The ternary materials are composed by Oxygen and two other elements labeled as A,B.
# Variables 'Z_A','Z_B' contain the atomic number of the elements A,B.
Z_A = int(np.delete(np.unique(elements), np.where(np.unique(elements)==8))[0])
Z_B = int(np.delete(np.unique(elements), np.where(np.unique(elements)==8))[1])
lab_A = np.delete(np.unique(labels), np.where(np.unique(labels)== 'O'))[0]
lab_B = re.sub("\d+", "", np.delete(np.unique(labels), np.where(np.unique(labels)== 'O'))[1])
# We instantiate the Mendeleev Element classes with the atomic numbers.
# These classes allow to retrieve atoms properties, in this example we only fetch the element name
#A = element(Z_A)
#B = element(Z_B)
A = ap.symbol(lab_A)
B = ap.symbol(lab_B)
# The fraction of atoms of a specific element within the material, that is also given by the stochiometric ratio.
fraction_O = np.sum(np.where (elements==8,1,0)) / len(elements)
fraction_A = np.sum(np.where (elements==A.atomic_number,1,0)) / len(elements)
fraction_B = np.sum(np.where (elements==B.atomic_number,1,0)) / len(elements)
# At each iteration, we add to the datafram one row that contains the A,B elements in the material and a number of other material properties.
df=df.append({
'Element_A_name': pymat.symbol(lab_A).atomic_element_name,
'Element_B_name': pymat.symbol(lab_B).atomic_element_name,
'Atomic_number_A': Z_A,
'Atomic_number_B': Z_B,
'Fraction_A':fraction_A,
'Fraction_B':fraction_B,
'Fraction_O':fraction_O,
'Covalent_radius_A':A.atomic_r_val[0]*100 ,
'Covalent_radius_B':B.atomic_r_val[0]*100 ,
'Ionenergy_A':A.atomic_ip*6.242e+18,
'Ionenergy_B':B.atomic_ip*6.242e+18,
'Weight_A':pymat.symbol(lab_A).atomic_mass,
'Weight_B':pymat.symbol(lab_B).atomic_mass,
'Space_group_number':int(space_group),
'Atomic_density':density,
'Formula':formula_red,
'File-id':int(entry),
},ignore_index=True)
```
%% Output
100%|████████████████████████████████████████████████████████████████████████████| 23000/23000 [07:47<00:00, 49.20it/s]
%% Cell type:markdown id: tags:
Here we load the dataframe which contains the data retrieved from the NOMAD Archive using the conditions defined above. The activation of the following cell should be performed only if the query above was skipped.
%% Cell type:code id: tags:
``` python
#df = pd.read_pickle('./data/query_nomad_archive/ternary_O_cubic')
```
%% Cell type:markdown id: tags:
Pandas dataframes include the 'describe' method which gives an overview about the composition of the dataset.
%% Cell type:code id: tags:
``` python
df
```
%% Output
Atomic_density Atomic_number_A Atomic_number_B Covalent_radius_A \
0 0.069760 28.0 49.0 154.635
1 0.072930 28.0 49.0 154.635
2 0.049210 37.0 41.0 77.725
3 0.058399 47.0 61.0 134.805
4 0.067951 70.0 82.0 142.345
... ... ... ... ...
22308 0.070226 50.0 93.0 54.845
22309 0.061514 48.0 64.0 124.345
22310 0.079527 42.0 69.0 72.935
22311 0.057326 41.0 61.0 77.725
22312 0.066790 69.0 90.0 130.435
Covalent_radius_B Element_A_name Element_B_name File-id Formula \
0 34.135 Indium Nickel 0.0 InNiO3
1 34.135 Indium Nickel 1.0 InNiO3
2 235.115 Niobium Rubidium 2.0 Nb2O8Rb4
3 33.765 Silver Promethium 3.0 Ag2O8Pm4
4 176.325 Lead Ytterbium 4.0 O8Pb2Yb4
... ... ... ... ... ...
22308 137.255 Neptunium Tin 22995.0 NpO3Sn
22309 30.575 Cadmium Gadolinium 22996.0 CdGdO3
22310 26.255 Molybdenum Thulium 22997.0 MoO3Tm
22311 33.765 Niobium Promethium 22998.0 NbO3Pm
22312 26.255 Thorium Thulium 22999.0 O3ThTm
Fraction_A Fraction_B Fraction_O Ionenergy_A Ionenergy_B \
0 0.200000 0.200000 0.600000 5.094165 8.055879
1 0.200000 0.200000 0.600000 5.094165 8.055879
2 0.142857 0.285714 0.571429 7.221556 3.992932
3 0.142857 0.285714 0.571429 7.409658 5.775166
4 0.142857 0.285714 0.571429 6.443283 6.480560
... ... ... ... ... ...
22308 0.200000 0.200000 0.600000 5.500209 6.748155
22309 0.200000 0.200000 0.600000 9.395269 4.172374
22310 0.200000 0.200000 0.600000 6.742170 5.902252
22311 0.200000 0.200000 0.600000 7.221556 5.775166
22312 0.200000 0.200000 0.600000 5.734792 5.902252
Space_group_number Weight_A Weight_B
0 221.0 114.81800 58.69340
1 221.0 114.81800 58.69340
2 227.0 92.90638 85.46780
3 227.0 107.86820 145.00000
4 227.0 207.20000 173.04000
... ... ... ...
22308 221.0 237.00000 118.71000
22309 221.0 112.41100 157.25000
22310 221.0 95.94000 168.93421
22311 221.0 92.90638 145.00000
22312 221.0 232.03806 168.93421
[22313 rows x 17 columns]
%% Cell type:markdown id: tags:
We might have different entries with the same chemical composition, because e.g. simulations were performed for the same material with different settings that were not included among the filters of our query. Each of these simulations might have produced a slightly different value of the resulting atomic density of the material. As data is taken from heterogeneous simulations which were carried out in different laboratories, we do not aim to evaluate all possible parameters of each simulation. Hence, we average the atomic density value over all materials with the same chemical composition.
After averaging, or grouping, data is placed in a different dataframe 'df_grouped', where each entry represents a different compound.
%% Cell type:code id: tags:
``` python
df_grouped=df.groupby(['Formula','Element_A_name','Element_B_name','Space_group_number']).mean()
df_grouped=df_grouped.reset_index(level=['Element_A_name','Element_B_name','Space_group_number'])
df_grouped['Replicas']=df.groupby(['Formula','Element_A_name','Element_B_name','Space_group_number']).count()['Atomic_density'].values
```
%% Cell type:markdown id: tags:
With data placed in a dataframe, we can carry out our machine learning analysis.
%% Cell type:markdown id: tags:
# Example of unsupervised machine learning: Clustering and dimension reduction
---
%% Cell type:markdown id: tags:
Firstly, we perform an explorative analysis to understand how data is composed and organized. Hence, we use
unsupervised learning to extract from the dataset clusters of materials with similar properties.
We define the list of features that are used for clustering including only the stochiometric ratio and the atomic number. Our aim is to use unsupervised learning for understanding whether the defined descriptors are sufficient for structuring the dataset.
%% Cell type:code id: tags:
``` python
clustering_features = []
clustering_features.append('Fraction_A')
clustering_features.append('Fraction_B')
clustering_features.append('Fraction_O')
clustering_features.append('Atomic_number_A')
clustering_features.append('Atomic_number_B')
# clustering_features.append('Ionenergy_A')
# clustering_features.append('Ionenergy_B')
# clustering_features.append('Weight_A')
# clustering_features.append('Weight_B')
# clustering_features.append('Covalent_radius_A')
# clustering_features.append('Covalent_radius_B')
df_clustering=preprocessing.scale(df_grouped[clustering_features])
```
%% Cell type:markdown id: tags:
As clustering algorithm we use HDBSCAN, that is described in:
<div style="padding: 1ex; margin-top: 1ex; margin-bottom: 1ex; border-style: dotted; border-width: 1pt; border-color: blue; border-radius: 3px;">
R.J.G.B. Campello, D. Moulavi, J. Sander: <span style="font-style: italic;">Density-Based Clustering Based on Hierarchical Density Estimates</span>, Springer Berlin Heidelberg, (2013).</div>
%% Cell type:markdown id: tags:
The only input parameter that this algorithm requires is the minimum size of each cluster.
To achieve a more accurate cluster definition, HDBSCAN is able to detect all those points that are hardly classified into a specific cluster, which are labeled as 'outliers'.
%% Cell type:markdown id: tags:
The implementation of the algorithm that we use is taken from https://pypi.org/project/hdbscan/.
%% Cell type:code id: tags:
``` python
import hdbscan
```
%% Cell type:code id: tags:
``` python
clusterer=hdbscan.HDBSCAN(min_cluster_size=60)
clusterer.fit(df_clustering)
display(Markdown('The algorithm finds ' + str(clusterer.labels_.max()+1) + ' clusters.'))
```
%% Output
The algorithm finds 3 clusters.
%% Cell type:code id: tags:
``` python
cluster_labels = clusterer.labels_
df_grouped['Cluster_label']=cluster_labels
```
%% Cell type:markdown id: tags:
To visualize our multidimensional data, we need to project it onto a two-dimensional manifold.
Hence, we use the UMAP embedding algorithm.
%% Cell type:code id: tags:
``` python
import umap
```
%% Cell type:code id: tags:
``` python
reducer = umap.UMAP(min_dist=0.5, n_neighbors=100)
embedding = reducer.fit(df_clustering)
embedding = reducer.transform(df_clustering)
df_grouped['x_emb']=embedding[:,0]
df_grouped['y_emb']=embedding[:,1]
```
%% Output
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-33-263ed499c165> in <module>
----> 1 reducer = umap.UMAP(min_dist=0.5, n_neighbors=100)
2 embedding = reducer.fit(df_clustering)
3 embedding = reducer.transform(df_clustering)
4 df_grouped['x_emb']=embedding[:,0]
5 df_grouped['y_emb']=embedding[:,1]
AttributeError: module 'umap' has no attribute 'UMAP'
%% Cell type:markdown id: tags:
We maintain the Visualizer, a dedicated tool for an interactive visualization of the data retrieved from the Archive.
The Visualizer shows compounds belonging in different clusters with different colors.
Outliers by default are not shown on the map, but outliers visualization can be activated by clicking the 'Outliers' label on the right of the map.
The color of the markers can also represent a specific target property of the material that can be selected from the 'Marker colors' dropdown menu.
Target properties which can be displayed with different colors are inserted as a list of 'color_features'.
After selecting a specific property, a new menu will appear to choose the color scale to be used for visualizing the different values of that target property.
To prevent data overloading, only a fraction of the whole dataset is initially visualized.
This parameter can be adjusted modifing the fraction value on top of the map, up to select all entries in the dataset.
By hovering over the map, the Visualizer shows which compound corresponds to each point, and its number of 'Replicas'.
Replicas represents the number of entries from the original dataset (before grouping) which have the same chemical composition.
It is also possible to select an additional number of features in the 'hover_features' list which are displayed while hovering.
Clicking on any of the points in the map automatically shows the 3D chemical structure of the material in one of the windows below.
Note that at each time the 'Display' button is clicked, a different structure with the same chemical composition is visualized.
A new structure is shown up to the 'Replicas' number.
This allows to inspect all possible structures in the dataset.
Furthermore, the chemical formula of a specific compound can be manually written in the 'Compound' textbox, and clicking the 'Display' button will both show the 3D chemical structure and mark the exact position of the compound on the map with a cross.
The Compound textbox includes autocompletion, which allows to inspect all materials in the dataset inserting partial formulae.
Lastly, the Visualizer offers a number of utils for producing high-quality plots of the map, which are displayed after clicking the button just below the map.
%% Cell type:code id: tags:
``` python
hover_features = []
hover_features.append('Atomic_number_A')
hover_features.append('Atomic_number_B')
hover_features.append('Space_group_number')
hover_features.append('Atomic_density')
hover_features.append('Replicas')
hover_features.append('Cluster_label')
color_features = []
color_features.append('Atomic_density')
color_features.append('Space_group_number')
Visualizer(df, df_grouped, hover_features, color_features).view()
```
%% Output
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
c:\users\aakas\appdata\local\programs\python\python38\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2645 try:
-> 2646 return self._engine.get_loc(key)
2647 except KeyError:
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'x_emb'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-32-329093dad86f> in <module>
11 color_features.append('Space_group_number')
12
---> 13 Visualizer(df, df_grouped, hover_features, color_features).view()
E:\Material Science\WS 2020-21\Work-SHK\Gitlab\analytics-query-nomad-archive\query_nomad_archive\visualizer.py in __init__(self, df, df_grouped, hover_features, color_features)
118 name=self.name_trace[cl],
119 mode='markers',
--> 120 x=self.df_entries_onmap[cl]['x_emb'],
121 y=self.df_entries_onmap[cl]['y_emb'],
122 marker_color=next(self.palette),
c:\users\aakas\appdata\local\programs\python\python38\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
2798 if self.columns.nlevels > 1:
2799 return self._getitem_multilevel(key)
-> 2800 indexer = self.columns.get_loc(key)
2801 if is_integer(indexer):
2802 indexer = [indexer]
c:\users\aakas\appdata\local\programs\python\python38\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2646 return self._engine.get_loc(key)
2647 except KeyError:
-> 2648 return self._engine.get_loc(self._maybe_cast_indexer(key))
2649 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
2650 if indexer.ndim > 1 or indexer.size > 1:
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'x_emb'
%% Cell type:markdown id: tags:
Using the visualizer we can analyse the composition of the 4 clusters extracted.
We have selected the atomic density and the space group number as color features.
This allows to inspect how these values vary within clusters.
The values of these features show ordered structures within clusters, that is particularly interesting because these features were not used by the clustering algorithm.
This suggests that the atomic features used for clustering are sufficient to describe certain properties of the whole material such as the most stable structure or the atomic density.
Therefore, we might imagine that there is a functional form capable to infer the defined materials properties from our atomic features, and we can train a supervised machine learning model to find such relationship.
The space group number of the elements in each cluster is shown below firstly as a list of values, then using a pie chart.
We can clearly notice that in each cluster there is one space group number that is predominant respect to all the others.
This means that we can label the different clusters with a characteristic space group number.
%% Cell type:code id: tags:
``` python
df_count_groups=df_grouped.loc[df_grouped['Cluster_label']!=-1].groupby(['Cluster_label','Space_group_number']).describe()['Atomic_density'][['count']]
display(Markdown(print(df_count_groups)))
df_count_groups=df_count_groups.reset_index();
n_clusters = df_grouped['Cluster_label'].max() +1
```
%% Cell type:code id: tags:
``` python
for cl in range (n_clusters):
df_cluster=df_count_groups.loc[df_count_groups['Cluster_label']==cl]
fig = px.pie(df_cluster, values='count', names='Space_group_number')
fig.show()
```
%% Cell type:markdown id: tags:
The pie chart below includes the space group number of all elements classified as outliers. We can clearly see that the outliers include all different space group numbers, and it is not possible to identify a predominant space group number characteristic of the group of outliers.
%% Cell type:code id: tags:
``` python
df_cluster=df_grouped.loc[df_grouped['Cluster_label']==-1].groupby(['Space_group_number']).describe()['Atomic_density'][['count']].reset_index()
plt.figure(figsize=(25,25))
fig = px.pie(df_cluster, values='count', names='Space_group_number')
fig.show()
```
%% Cell type:markdown id: tags:
# Example of supervised machine learning: Random forest
---
%% Cell type:markdown id: tags:
Finally, we aim to use the large data set of materials that we have retrieved from the NOMAD Archive to train an AI model.
Previous findings obtained with the unsupervised analysis suggest that it is possible to train a model to predict materials properties using only atomic features.
The trained model can then be used to predict properties of yet unknown materials from the knowledge of its constituent atoms.
In this specific case, we aim to predict the average atomic density.
We use the Random forest method. Random forest is available from the scikit-learn package.
%% Cell type:code id: tags:
``` python
from sklearn.ensemble import RandomForestRegressor
```
%% Cell type:markdown id: tags:
We select all atomic properties as primary features and the atomic density of the material as target feature.
%% Cell type:code id: tags:
``` python
ML_primary_features = []
ML_primary_features.append('Fraction_A')
ML_primary_features.append('Fraction_B')
ML_primary_features.append('Fraction_O')
ML_primary_features.append('Atomic_number_A')
ML_primary_features.append('Atomic_number_B')
ML_target_features = []
ML_target_features.append('Atomic_density')
```
%% Cell type:markdown id: tags:
Our dataset is divided into a train set and a test set. The model is trained only with the train set, while ignoring the values in the test set. This allows to test the prediction capability of the model on data that have not been seen.
%% Cell type:code id: tags:
``` python
X_train, X_test, y_train, y_test = train_test_split (df[ML_primary_features], df[ML_target_features], test_size=0.2, )
```
%% Cell type:markdown id: tags:
We train here the model.
%% Cell type:code id: tags:
``` python
random_regressor = RandomForestRegressor(
n_estimators= 100,
max_depth = 100,
max_features = 5,
min_samples_split = 5,
random_state=0
)
random_regressor.fit(X_train, y_train.to_numpy().ravel())
```
%% Cell type:markdown id: tags:
After training, we check the accuracy of the model.
%% Cell type:code id: tags:
``` python
y_predict= random_regressor.predict(X_test)
display(Markdown(r'The Ai model predicts the atomic density on a test set with an average error of '+
str(int(10000*np.mean(np.abs(y_predict-y_test.to_numpy().flatten())))/10000) +
' Angstroms$^{-1}$.' ))
```
%% Cell type:code id: tags:
``` python
y_predict= random_regressor.predict(X_train)
display(Markdown(r'The Ai model predicts the atomic density on a training set with an average error of '+
str(int(10000*np.mean(np.abs(y_predict-y_train.to_numpy().flatten())))/10000) +
' Angstroms$^{-1}$.' ))
```
%% Cell type:code id: tags:
``` python
y_predict= random_regressor.predict(X_test)
X_test = X_test.assign(Atomic_density=y_predict)
df_A_pred = X_test[['Atomic_number_A','Atomic_density']].rename(columns={'Atomic_number_A':'Atomic_number'})
df_B_pred = X_test[['Atomic_number_B','Atomic_density']].rename(columns={'Atomic_number_B':'Atomic_number'})
df_AB_pred = pd.concat([df_A_pred,df_B_pred], ignore_index=True)
df_A = df[['Atomic_number_A','Atomic_density']].rename(columns={'Atomic_number_A':'Atomic_number'})
df_B = df[['Atomic_number_B','Atomic_density']].rename(columns={'Atomic_number_B':'Atomic_number'})
df_AB = pd.concat([df_A,df_B], ignore_index=True)
```
%% Cell type:code id: tags:
``` python
df_AB['Atomic_number']=df_AB['Atomic_number'].astype('int')
```
%% Cell type:code id: tags:
``` python
xaxis = df_AB.groupby('Atomic_number').mean().reset_index().to_numpy()[:,0]
yaxis = df_AB.groupby('Atomic_number').mean().reset_index().to_numpy()[:,1]
```
%% Cell type:markdown id: tags:
In the following plot, we see the average atomic density of ternary elements composed of Oxygen and another element, whose atomic number is given by the x-axis. Therefore, all values are averaged over the third element.
Plots show the predictions of the trained model on the test set, and on the kwnon values of the training set that are taken as reference values. Each point shows also the standard deviation. We emphasize that, considering that each value on the plot is given by an average over all elements in the periodic table, the standard deviation cannot go to zero by construction, even in the limit of taking all possible combinations. We then aim that averages and standard deviations predicted by our model are comparable to the ones of the reference model.
%% Cell type:code id: tags:
``` python
plt.style.use('ggplot')
plt.figure(figsize=(15,10))
x=df_AB_pred.groupby(['Atomic_number']).mean().index.to_numpy().flatten()
y=df_AB_pred.groupby(['Atomic_number']).mean().to_numpy().flatten()
std=df_AB_pred.groupby(['Atomic_number']).std().to_numpy().flatten()
plt.errorbar(x,y,yerr=std.T, ls='', marker='s', label='AI prediction' )
x=df_AB.groupby(['Atomic_number']).mean().index.to_numpy().flatten()
y=df_AB.groupby(['Atomic_number']).mean().to_numpy().flatten()
std=df_AB.groupby(['Atomic_number']).std().to_numpy().flatten()
plt.errorbar(x,y,yerr=std.T, ls='', marker='s', label='Reference value' );
plt.legend(loc='upper right', fontsize='x-large')
plt.xlabel('Atomic number', fontsize='x-large')
plt.ylabel(r'Atomic density [$\AA^{-1}$]', fontsize='x-large')
plt.xticks([3,11,19,37,55,89],fontsize='x-large');
plt.yticks(fontsize='x-large');
```
%% Cell type:markdown id: tags:
In the plot above, we can see that values predicted by the AI model are comparable to the reference values. In addition, we observe that the atomic density follows periodic trends, as values tend to be lower in the beginning and in the end of each row of the periodic table.
Loading