Commits (1)
%% Cell type:markdown id: tags:
<img src="assets/query_nomad_archive/header.jpg" width="900">
%% Cell type:markdown id: tags:
<img style="float: left;" src="assets/query_nomad_archive/logo_MPG.png" width=150>
<img style="float: left; margin-top: -10px" src="assets/query_nomad_archive/logo_NOMAD.png" width=250>
<img style="float: left; margin-top: -5px" src="assets/query_nomad_archive/logo_HU.png" width=130>
%% Cell type:markdown id: tags:
In this tutorial, we show how to query the NOMAD Archive (https://www.nomad-coe.eu/index.php?page=nomad-repository) and perform data analysis on the retrieved data.
%% Cell type:markdown id: tags:
## Preliminary operations
We load the following packages that are all available in the virtual environment containing the Jupyter notebooks in the NOMAD AI toolkit. Among the loaded packages, we highlight ``sklearn``, i.e., scikit-learn, one of the most popular machine-learning packages, and ``pandas``, a popular tool for data handling and analysis.
%% Cell type:code id: tags:
``` python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import math
import plotly.express as px
from tqdm import tqdm
from sklearn import preprocessing, tree
from sklearn.manifold import TSNE
from sklearn.model_selection import train_test_split
from query_nomad_archive.visualizer import Visualizer
from sklearn import decomposition
from IPython.display import display, Markdown
import re
```
%% Cell type:code id: tags:
``` python
from nomad import client, config
```
%% Cell type:code id: tags:
``` python
from nomad.client.archive import ArchiveQuery
from nomad.metainfo import units
```
%% Cell type:code id: tags:
``` python
import nest_asyncio
nest_asyncio.apply()
```
%% Cell type:markdown id: tags:
We maintain a ``nomad`` package that can be imported in all notebooks of the AI Toolkit.
%% Cell type:markdown id: tags:
The ``nomad`` package allows to retrieve data from the NOMAD Archive by means of a script, as shown below. In this script, we insert metadata characterizing the materials that we aim to retrieve. In this case, we select ternary materials containing oxygen. We also request that simulations were carried out using the VASP code using GGA exchange-correlation (xc) functionals. Values are retrieved from the simulation run that found geometrically convergence wihin a desird threshold value of.
%% Cell type:code id: tags:
``` python
from atomicfeaturespackage.atomicproperties import atomic_properties_dft as ap
from atomicfeaturespackage.atomicproperties import atomic_properties_pymat as pymat
from atomicfeaturespackage.atomicproperties import periodictable
```
%% Cell type:markdown id: tags:
The modules ``atomic_properties_dft`` and ``atomic_properties_pymat`` make available atomic properites of the elements of the periodic table, to be used as features in data analytics. The sources of these atomic proporties are DFT calculations performed by the NOMAD team and <a href="https://pymatgen.org/" target="_blank">pymatgen</a>, respectively.
%% Cell type:markdown id: tags:
We have developed a visualization tool that allows us to visualize atomic properites of all elements accross periodic table as a heatmap. Currently, this tool is able to visualize atomic properties acessible from ``atomic_properties_dft`` module. Below is an example for the data calculated via the HSE06 functional and spinless settings.
This module can be used as follows. From the dropdown menu, one can select which property one is interested to check and the table is updated automatically to show the corresponding heatmap.
%% Cell type:code id: tags:
``` python
periodictable.heatmap(Spin = 'False', method = 'pbe')
```
%% Cell type:markdown id: tags:
## Querying the NOMAD Archive
%% Cell type:code id: tags:
``` python
max_entries=9000
query={
'results.method.simulation.program_name': 'VASP',
'results.material.elements': ['O'],
"results.material.symmetry.crystal_system:any": [
"cubic"
],
'results.material.n_elements': {
"lte": 3,
"gte": 3
},
"results.material.symmetry.crystal_system:any": [
"cubic"
],
"results.method.simulation.dft.xc_functional_type:any": [
"GGA"
],
"results.properties.geometry_optimization": {
"final_energy_difference": {
"lte": 2e-22,
"gte": 1e-30
}
},
}
required={
'results':{
'material':{
'elements': '*',
'symmetry': {
'space_group_number': '*'
}
}
},
'workflow': {
'geometry_optimization': {
'final_energy_difference': '*',
},
'calculation_result_ref': {
'system_ref': {
'chemical_composition_reduced': '*',
'chemical_composition': '*',
'atoms': {
'labels': '*',
'species': '*',
'lattice_vectors': '*',
'positions': '*',
}
}
}
}
}
query = ArchiveQuery(query=query, required=required, page_size=100, results_max=max_entries)
query.fetch()
```
%% Cell type:markdown id: tags:
We have defined the variable ``query``, which allows to perform our query.
All quantities defined in the ``required`` field are fetched during the query. For example, we can see quantities as ``chemical_composition`` which gives the composition of the material or ``atoms.positions`` that contains the positon of all atoms after geometric convergence.
We notice that the variable ``query`` contains a number of other variables: the ``max_entries`` value sets the maximum number of entries that can be retrieved; the ``per_page`` value indicates the number of entries fetched at each API call.
%% Cell type:markdown id: tags:
In this tutorial, we use machine learning tools to investigate properties of materials. In particular, we aim to predict the atomic density of the materials as a function of some primary features such as the atomic number.
The atomic density of the material is derived from the volume of the simulation cell, whose dimensions are inserted in meters. Thus, we define a scale factor to convert dimensions into angstroms for a higher numerical stability during the machine learning analysis.
%% Cell type:code id: tags:
``` python
scale_factor = 10**10
```
%% Cell type:markdown id: tags:
To retrieve data and place it within a framework, we use a 'for' loop that iteratively fetches all entries up to the maximum value, which is given by 'max_entries'. Taking into account that some links in the query might be broken, the resulting 'IndexError' exception is handled within the 'for' loop, that skips over the broken entry. In addition, we also make sure the entry contains the simulation cell value which we are interested in, and that all elements in the material have an admissible atomic number.
%% Cell type:markdown id: tags:
In the next cell, we download data from the NOMAD Archive. **As we query a large number of elements, this operation can be time consuming. Hence, we have cached the results of the following query, and data can be loaded with a command given in the subsequent cell.**
%% Cell type:code id: tags:
``` python
results=query.download()
path_structure = './data/query_nomad_archive/structures/'
try:
os.mkdir(path_structure)
except OSError:
!rm -r "$path_structure"
os.mkdir(path_structure)
# We define a 'Pandas' dataframe that contains all fetched data.
df = pd.DataFrame()
for entry in range(max_entries):
try:
calc = results[entry].workflow[0].calculation_result_ref
convergence_energy_diff = results[entry].workflow[0].geometry_optimization.final_energy_difference.m
formula_red = calc.system_ref.chemical_composition_reduced
space_group = results[entry].results.material.symmetry.space_group_number
elements = np.sort(calc.system_ref.atoms.species)
labels = calc.system_ref.atoms.labels
# Dimensions of the cell are rescaled to angstroms.
lat_x, lat_y, lat_z = calc.system_ref.atoms.lattice_vectors.m * scale_factor
except IndexError:
continue
if (min(elements)<1 or max(elements)>118):
continue
# The total number in the array 'elements' gives the total number of atoms.
n_atoms = len (elements)
# Structures of materials are stored for being viewed using the Visualizer.
file = open(path_structure + str(entry) +".xyz","w")
file.write("%d\n\n"%(n_atoms*8))
for i in [0,1]:
for j in [0,1]:
for k in [0,1]:
for n in range (n_atoms):
el = calc.system_ref.atoms.labels[n]
xyz = calc.system_ref.atoms.positions[n].m * scale_factor
xyz += i*lat_x
xyz += j*lat_y
xyz += k*lat_z
file.write (el)
file.write ("\t%f\t%f\t%f\n"%(xyz[0],xyz[1],xyz[2]))
file.close()
# The volume of the cell is obtained as scalar triple product of the three base vectors.
# The triple scalar product is obtained as determinant of the matrix composed with the three vectors.
cell_volume = np.linalg.det ([lat_x,lat_y,lat_z])
# The atomic density is given by the number of atoms in a unit cell.
density = n_atoms / cell_volume
# The ternary materials are composed by Oxygen and two other elements labeled as A,B.
# Variables 'Z_A','Z_B' contain the atomic number of the elements A,B.
Z_A = int(np.delete(np.unique(elements), np.where(np.unique(elements)==8))[0])
Z_B = int(np.delete(np.unique(elements), np.where(np.unique(elements)==8))[1])
lab_A = np.delete(np.unique(labels), np.where(np.unique(labels)== 'O'))[0]
lab_B = re.sub("\d+", "", np.delete(np.unique(labels), np.where(np.unique(labels)== 'O'))[1])
# We instantiate the atomic_properties_dft(imported as ap here) classes with the atomic symbol by first specifying
# DFT spin setting employed and fucntional for evaluation of atomic properties.
# Functional for which this properties accessible are 'hse06', 'pbe','pbe0','pbesol','pw-lda','revpbe'
# Spin setting could be either 'True' or 'False'
ap.method(method = 'pbe', Spin = 'False')
# We instantiate the atomic_properties_dft(imported as ap here) classes with the atomic symbol.
# Similarly we instantiate the atomic_properties_pymat(imported as pymat here) classes with the atomic symbol
# These classes allow to retrieve atoms properties, in this example we fetch the element ionization energy,atomic radius
# from atomic_properties_dft module and element name,element weight from atomic_properties_pymat module
try:
A = ap.symbol(lab_A)
B = ap.symbol(lab_B)
except KeyError:
continue
# The fraction of atoms of a specific element within the material, that is also given by the stochiometric ratio.
fraction_O = np.sum(np.where (elements==8,1,0)) / len(elements)
fraction_A = np.sum(np.where (elements==A.atomic_number,1,0)) / len(elements)
fraction_B = np.sum(np.where (elements==B.atomic_number,1,0)) / len(elements)
# At each iteration, we add to the datafram one row that contains the A,B elements in the material and a number of other material properties.
df=df.append({
'Convergence_energy_diff': convergence_energy_diff,
'Element_A_name': pymat.symbol(lab_A).atomic_element_name,
'Element_B_name': pymat.symbol(lab_B).atomic_element_name,
'Atomic_number_A': Z_A,
'Atomic_number_B': Z_B,
'Fraction_A':fraction_A,
'Fraction_B':fraction_B,
'Fraction_O':fraction_O,
'Atomic_radius_A':A.atomic_r_val[0]*100 , #distance values are in angstroms so we convert these to pm in this eg
'Atomic_radius_B':B.atomic_r_val[0]*100 , #distance values are in angstroms so we convert these to pm in this eg
'Ionenergy_A':A.atomic_ip*6.242e+18, # energy value obtained is in joules so we convert to eV
'Ionenergy_B':B.atomic_ip*6.242e+18, # energy value obtained is in joules so we convert to eV
'El_affinity_A':A.atomic_ea*6.242e+18, # energy value obtained is in joules so we convert to eV
'El_affinity_B':B.atomic_ea*6.242e+18, # energy value obtained is in joules so we convert to eV
'Homo_A':A.atomic_hfomo*6.242e+18, # energy value obtained is in joules so we convert to eV
'Homo_B':B.atomic_hfomo*6.242e+18, # energy value obtained is in joules so we convert to eV
'Lumo_A':A.atomic_hfomo*6.242e+18, # energy value obtained is in joules so we convert to eV
'Lumo_B':B.atomic_hfomo*6.242e+18, # energy value obtained is in joules so we convert to eV
'Weight_A':pymat.symbol(lab_A).atomic_mass,
'Weight_B':pymat.symbol(lab_B).atomic_mass,
'Space_group_number':int(space_group),
'Atomic_density':density,
'Formula':formula_red,
'File-id':int(entry),
},ignore_index=True)
```
%% Cell type:markdown id: tags:
Here we load the dataframe which contains the data retrieved from the NOMAD Archive using the conditions defined above. The activation of the following cell should be performed only if the query above was skipped.
%% Cell type:code id: tags:
``` python
# df = pd.read_pickle('./data/query_nomad_archive/ternary')
```
%% Cell type:markdown id: tags:
Pandas dataframes include the 'describe' method which gives an overview about the composition of the dataset.
%% Cell type:code id: tags:
``` python
df.describe()
```
%% Cell type:markdown id: tags:
We are particularly interested in materials properties and how they can be inferred from the solely atomic composition of a specific material. In our query, we have retrieved two materials properties, namely the _'Atomic_density'_ and the _'Space_group_number'_. Before performing any machine learning analysis, it is interesting to visualize the distribution of these values using a histogram. Pandas allows for a straighforward visualization of histograms constructed with columns dataframe values.
%% Cell type:code id: tags:
``` python
df.hist(bins=np.arange(200,230), figsize=(16,4), xlabelsize=10, ylabelsize=10, column='Space_group_number')
plt.xticks(range(200,230));
```
%% Cell type:markdown id: tags:
We notice above that retrieved materials can be mainly classified into two distinct space group numbers, i.e. the sapce group number 221 and 227. It would be interesting to see whether such distinction involves that materials belonging in the same space group share similar atomistic properties, while the ones belonging in different space groups are distinct also on an atomistic level. This is the scope of clustering, and it will be the object of an in-depth analysis below.
Now, we keep inspecting the dataframe values and plot a histogram containing the values of the atomic density.
%% Cell type:code id: tags:
``` python
df.hist(bins=50,figsize=(16,4), xlabelsize=10, ylabelsize=10, column='Atomic_density');
```
%% Cell type:markdown id: tags:
The histogram above shows that the atomic density is mainly distributed around a value of 0.07 Angstroms$^{-1}$. Such distribution might seem the result of a random extraction, but we aim to find an AI model that is able to make high-resolution predictions for the atomic density based only on the atomic composition of the material.
%% Cell type:markdown id: tags:
In order to build an AI model that makes reliable predictions, we should make sure that each entry has a different representation. In this case, as we are interested in predicting material properties from the atomic composition, the chemical composition of the material is an ideal representation for the dataframe entries. However, we might have different entries with the same chemical composition, because e.g. simulations were performed for the same material with different settings that were not included among the filters of our query. Each of these simulations might have produced a slightly different value of the resulting atomic density of the material. As data is taken from heterogeneous simulations which were carried out in different laboratories, we do not aim to evaluate all possible parameters for each simulation. Hence, we average the atomic density value over all materials with the same chemical composition.
After averaging (or _grouping_) data is placed in a different dataframe '_df_grouped_', where each entry represents a different compound.
%% Cell type:code id: tags:
``` python
df_grouped=df.groupby(['Formula','Element_A_name','Element_B_name','Space_group_number']).mean()
df_grouped=df_grouped.reset_index(level=['Element_A_name','Element_B_name','Space_group_number'])
df_grouped['Replicas']=df.groupby(['Formula','Element_A_name','Element_B_name','Space_group_number']).count()['Atomic_density'].values
```
%% Cell type:markdown id: tags:
With data placed in a dataframe, we can carry out our machine learning analysis.
%% Cell type:markdown id: tags:
# Example of unsupervised machine learning: Clustering and dimension reduction
---
%% Cell type:markdown id: tags:
Firstly, we perform an explorative analysis to understand how data is composed and organized. Hence, we use
unsupervised learning to extract from the dataset clusters of materials with similar properties.
We define the list of features that are used for clustering including only the stochiometric ratio and the atomic number. Our aim is to use unsupervised learning for understanding whether the defined descriptors are sufficient for structuring the dataset.
%% Cell type:code id: tags:
``` python
clustering_features = []
clustering_features.append('Fraction_A')
clustering_features.append('Fraction_B')
clustering_features.append('Fraction_O')
clustering_features.append('Atomic_number_A')
clustering_features.append('Atomic_number_B')
clustering_features.append('El_affinity_A')
clustering_features.append('El_affinity_B')
# clustering_features.append('Ionenergy_A')
# clustering_features.append('Ionenergy_B')
# clustering_features.append('Homo_A')
# clustering_features.append('Homo_B')
# clustering_features.append('Lumo_A')
# clustering_features.append('Lumo_B')
# clustering_features.append('Weight_A')
# clustering_features.append('Weight_B')
# clustering_features.append('Atomic_radius_A')
# clustering_features.append('Atomic_radius_B')
df_clustering=preprocessing.scale(df_grouped[clustering_features])
```
%% Cell type:markdown id: tags:
As clustering algorithm we use HDBSCAN, that is described in:
<div style="padding: 1ex; margin-top: 1ex; margin-bottom: 1ex; border-style: dotted; border-width: 1pt; border-color: blue; border-radius: 3px;">
R.J.G.B. Campello, D. Moulavi, J. Sander: <span style="font-style: italic;">Density-Based Clustering Based on Hierarchical Density Estimates</span>, Springer Berlin Heidelberg, (2013).</div>
The only input parameter that this algorithm requires is the minimum size of each cluster.
To achieve a more accurate cluster definition, HDBSCAN is able to detect all those points that are hardly classified into a specific cluster, which are labeled as 'outliers'.
The implementation of the algorithm that we use is taken from https://pypi.org/project/hdbscan/.
%% Cell type:code id: tags:
``` python
import hdbscan
```
%% Cell type:code id: tags:
``` python
clusterer=hdbscan.HDBSCAN(min_cluster_size=60, min_samples=50)
clusterer.fit(df_clustering)
display(Markdown('The algorithm finds ' + str(clusterer.labels_.max()+1) + ' clusters.'))
```
%% Cell type:code id: tags:
``` python
cluster_labels = clusterer.labels_
df_grouped['Cluster_label']=cluster_labels
```
%% Cell type:markdown id: tags:
To visualize our multidimensional data, we need to project it onto a two-dimensional manifold.
Hence, we use the UMAP embedding algorithm.
Hence, we use the TSNE embedding algorithm.
%% Cell type:code id: tags:
``` python
import umap
from sklearn.manifold import TSNE
```
%% Cell type:code id: tags:
``` python
reducer = umap.UMAP(min_dist=0.5, n_neighbors=100)
embedding = reducer.fit(df_clustering)
embedding = reducer.transform(df_clustering)
embedding = TSNE(n_components=2).fit_transform(df_clustering)
df_grouped['x_emb']=embedding[:,0]
df_grouped['y_emb']=embedding[:,1]
```
%% Cell type:markdown id: tags:
We maintain the Visualizer, a dedicated tool for an interactive visualization of the data retrieved from the Archive.
The Visualizer shows compounds belonging in different clusters with different colors.
Outliers by default are not shown on the map, but outliers visualization can be activated by clicking the 'Outliers' label on the right of the map.
The color of the markers can also represent a specific target property of the material that can be selected from the 'Marker colors' dropdown menu.
Target properties which can be displayed with different colors are inserted as a list of 'color_features'.
After selecting a specific property, a new menu will appear to choose the color scale to be used for visualizing the different values of that target property.
To prevent data overloading, only a fraction of the whole dataset is initially visualized.
This parameter can be adjusted modifing the fraction value on top of the map, up to select all entries in the dataset.
By hovering over the map, the Visualizer shows which compound corresponds to each point, and its number of 'Replicas'.
Replicas represents the number of entries from the original dataset (before grouping) which have the same chemical composition.
It is also possible to select an additional number of features in the 'hover_features' list which are displayed while hovering.
Clicking on any of the points in the map automatically shows the 3D chemical structure of the material in one of the windows below.
Note that at each time the 'Display' button is clicked, a different structure with the same chemical composition is visualized.
A new structure is shown up to the 'Replicas' number.
This allows to inspect all possible structures in the dataset.
Furthermore, the chemical formula of a specific compound can be manually written in the 'Compound' textbox, and clicking the 'Display' button will both show the 3D chemical structure and mark the exact position of the compound on the map with a cross.
The Compound textbox includes autocompletion, which allows to inspect all materials in the dataset inserting partial formulae.
Lastly, the Visualizer offers a number of utils for producing high-quality plots of the map, which are displayed after clicking the button just below the map.
%% Cell type:code id: tags:
``` python
hover_features = []
hover_features.append('Atomic_number_A')
hover_features.append('Atomic_number_B')
hover_features.append('Space_group_number')
hover_features.append('Atomic_density')
hover_features.append('Replicas')
hover_features.append('Cluster_label')
color_features = []
color_features.append('Atomic_density')
color_features.append('Space_group_number')
Visualizer(df, df_grouped, hover_features, color_features).view()
```
%% Cell type:markdown id: tags:
Using the visualizer we can analyse the composition of the different clusters extracted.
We have selected the atomic density and the space group number as color features.
This allows to inspect how these values vary within clusters.
The values of these features show ordered structures within clusters, that is particularly interesting because these features were not used by the clustering algorithm.
This suggests that the atomic features used for clustering are sufficient to describe certain properties of the whole material such as the most stable structure or the atomic density.
Therefore, we might imagine that there is a functional form capable to infer the defined materials properties from our atomic features, and we can train a supervised machine learning model to find such relationship.
The space group number of the elements in each cluster is shown below firstly as a list of values, then using a pie chart.
We can clearly notice that in each cluster there is one space group number that is predominant respect to all the others.
This means that we can label the different clusters with a characteristic space group number.
%% Cell type:code id: tags:
``` python
df_count_groups=df_grouped.loc[df_grouped['Cluster_label']!=-1].groupby(['Cluster_label','Space_group_number']).describe()['Atomic_density'][['count']]
display(Markdown(print(df_count_groups)))
df_count_groups=df_count_groups.reset_index();
n_clusters = df_grouped['Cluster_label'].max() +1
```
%% Cell type:code id: tags:
``` python
for cl in range (n_clusters):
df_cluster=df_count_groups.loc[df_count_groups['Cluster_label']==cl]
fig = px.pie(df_cluster, values='count', names='Space_group_number')
fig.show()
```
%% Cell type:markdown id: tags:
The pie chart below includes the space group number of all elements classified as outliers. We can clearly see that the outliers include all different space group numbers, and it is not possible to identify a predominant space group number characteristic of the group of outliers.
%% Cell type:code id: tags:
``` python
df_cluster=df_grouped.loc[df_grouped['Cluster_label']==-1].groupby(['Space_group_number']).describe()['Atomic_density'][['count']].reset_index()
plt.figure(figsize=(25,25))
fig = px.pie(df_cluster, values='count', names='Space_group_number')
fig.show()
```
%% Cell type:markdown id: tags:
# Example of supervised machine learning: Random forest
---
%% Cell type:markdown id: tags:
Finally, we aim to use the large data set of materials that we have retrieved from the NOMAD Archive to train an AI model.
Previous findings obtained with the unsupervised analysis suggest that it is possible to train a model to predict materials properties using only atomic features.
The trained model can then be used to predict properties of yet unknown materials from the knowledge of its constituent atoms.
In this specific case, we aim to predict the average atomic density.
We use the Random forest method. Random forest is available from the scikit-learn package.
%% Cell type:code id: tags:
``` python
from sklearn.ensemble import RandomForestRegressor
```
%% Cell type:markdown id: tags:
We select all atomic properties as primary features and the atomic density of the material as target feature.
%% Cell type:code id: tags:
``` python
ML_primary_features = []
ML_primary_features.append('Fraction_A')
ML_primary_features.append('Fraction_B')
ML_primary_features.append('Fraction_O')
ML_primary_features.append('Atomic_number_A')
ML_primary_features.append('Atomic_number_B')
ML_primary_features.append('El_affinity_A')
ML_primary_features.append('El_affinity_B')
ML_primary_features.append('Ionenergy_A')
ML_primary_features.append('Ionenergy_B')
ML_primary_features.append('Atomic_radius_A')
ML_primary_features.append('Atomic_radius_B')
# ML_primary_features.append('Homo_A')
# ML_primary_features.append('Homo_B')
# ML_primary_features.append('Lumo_A')
# ML_primary_features.append('Lumo_B')
# ML_primary_features.append('Weight_A')
# ML_primary_features.append('Weight_B')
ML_target_features = []
ML_target_features.append('Atomic_density')
```
%% Cell type:markdown id: tags:
Our dataset is divided into a train set and a test set. The model is trained only with the train set, while ignoring the values in the test set. This allows to test the prediction capability of the model on data that have not been seen.
%% Cell type:code id: tags:
``` python
X_train, X_test, y_train, y_test = train_test_split (df[ML_primary_features], df[ML_target_features], test_size=0.2, )
```
%% Cell type:markdown id: tags:
We train here the model.
%% Cell type:code id: tags:
``` python
random_regressor = RandomForestRegressor(
n_estimators= 100,
max_depth = 100,
max_features = 5,
min_samples_split = 5,
random_state=0
)
random_regressor.fit(X_train, y_train.to_numpy().ravel())
```
%% Cell type:markdown id: tags:
After training, we check the accuracy of the model.
%% Cell type:code id: tags:
``` python
y_predict= random_regressor.predict(X_test)
display(Markdown(r'The Ai model predicts the atomic density on a test set with an average error of '+
str(int(10000*np.mean(np.abs(y_predict-y_test.to_numpy().flatten())))/10000) +
' Angstroms$^{-1}$.' ))
```
%% Cell type:code id: tags:
``` python
y_predict= random_regressor.predict(X_train)
display(Markdown(r'The Ai model predicts the atomic density on a training set with an average error of '+
str(int(10000*np.mean(np.abs(y_predict-y_train.to_numpy().flatten())))/10000) +
' Angstroms$^{-1}$.' ))
```
%% Cell type:code id: tags:
``` python
y_predict= random_regressor.predict(X_test)
X_test = X_test.assign(Atomic_density=y_predict)
df_A_pred = X_test[['Atomic_number_A','Atomic_density']].rename(columns={'Atomic_number_A':'Atomic_number'})
df_B_pred = X_test[['Atomic_number_B','Atomic_density']].rename(columns={'Atomic_number_B':'Atomic_number'})
df_AB_pred = pd.concat([df_A_pred,df_B_pred], ignore_index=True)
df_A = df[['Atomic_number_A','Atomic_density']].rename(columns={'Atomic_number_A':'Atomic_number'})
df_B = df[['Atomic_number_B','Atomic_density']].rename(columns={'Atomic_number_B':'Atomic_number'})
df_AB = pd.concat([df_A,df_B], ignore_index=True)
```
%% Cell type:code id: tags:
``` python
df_AB['Atomic_number']=df_AB['Atomic_number'].astype('int')
```
%% Cell type:code id: tags:
``` python
xaxis = df_AB.groupby('Atomic_number').mean().reset_index().to_numpy()[:,0]
yaxis = df_AB.groupby('Atomic_number').mean().reset_index().to_numpy()[:,1]
```
%% Cell type:markdown id: tags:
In the following plot, we see the average atomic density of ternary elements composed of Oxygen and another element, whose atomic number is given by the x-axis. Therefore, all values are averaged over the third element.
Plots show the predictions of the trained model on the test set, and on the kwnon values of the training set that are taken as reference values. Each point shows also the standard deviation. We emphasize that, considering that each value on the plot is given by an average over all elements in the periodic table, the standard deviation cannot go to zero by construction, even in the limit of taking all possible combinations. We then aim that averages and standard deviations predicted by our model are comparable to the ones of the reference model.
%% Cell type:code id: tags:
``` python
plt.style.use('ggplot')
plt.figure(figsize=(15,10))
x=df_AB_pred.groupby(['Atomic_number']).mean().index.to_numpy().flatten()
y=df_AB_pred.groupby(['Atomic_number']).mean().to_numpy().flatten()
std=df_AB_pred.groupby(['Atomic_number']).std().to_numpy().flatten()
plt.errorbar(x,y,yerr=std.T, ls='', marker='s', label='AI prediction' )
x=df_AB.groupby(['Atomic_number']).mean().index.to_numpy().flatten()
y=df_AB.groupby(['Atomic_number']).mean().to_numpy().flatten()
std=df_AB.groupby(['Atomic_number']).std().to_numpy().flatten()
plt.errorbar(x,y,yerr=std.T, ls='', marker='s', label='Reference value' );
plt.legend(loc='upper right', fontsize='x-large')
plt.xlabel('Atomic number', fontsize='x-large')
plt.ylabel(r'Atomic density [$\AA^{-1}$]', fontsize='x-large')
plt.xticks([3,11,19,37,55,89],fontsize='x-large');
plt.yticks(fontsize='x-large');
```
%% Cell type:markdown id: tags:
In the plot above, we can see that values predicted by the AI model are comparable to the reference values. In addition, we observe that the atomic density follows periodic trends, as values tend to be lower in the beginning and in the end of each row of the periodic table.
......
%% Cell type:markdown id: tags:
<img src="assets/query_nomad_archive/header.jpg" width="900">
%% Cell type:markdown id: tags:
<img style="float: left;" src="assets/query_nomad_archive/logo_MPG.png" width=150>
<img style="float: left; margin-top: -10px" src="assets/query_nomad_archive/logo_NOMAD.png" width=250>
<img style="float: left; margin-top: -5px" src="assets/query_nomad_archive/logo_HU.png" width=130>
%% Cell type:markdown id: tags:
In this tutorial, we show how to query the NOMAD Archive (https://www.nomad-coe.eu/index.php?page=nomad-repository) and perform data analysis on the retrieved data.
%% Cell type:markdown id: tags:
## Preliminary operations
We load the following packages that are all available in the virtual environment containing the Jupyter notebooks in the NOMAD AI toolkit. Among the loaded packages, we highlight ``sklearn``, i.e., scikit-learn, one of the most popular machine-learning packages, and ``pandas``, a popular tool for data handling and analysis.
%% Cell type:code id: tags:
``` python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import math
import plotly.express as px
from tqdm import tqdm
from sklearn import preprocessing, tree
from sklearn.manifold import TSNE
from sklearn.model_selection import train_test_split
from query_nomad_archive.visualizer import Visualizer
from sklearn import decomposition
from IPython.display import display, Markdown
import re
```
%% Cell type:code id: tags:
``` python
from nomad import client, config
```
%% Cell type:code id: tags:
``` python
from nomad.client.archive import ArchiveQuery
from nomad.metainfo import units
```
%% Cell type:code id: tags:
``` python
import nest_asyncio
nest_asyncio.apply()
```
%% Cell type:markdown id: tags:
We maintain a ``nomad`` package that can be imported in all notebooks of the AI Toolkit.
%% Cell type:markdown id: tags:
The ``nomad`` package allows to retrieve data from the NOMAD Archive by means of a script, as shown below. In this script, we insert metadata characterizing the materials that we aim to retrieve. In this case, we select ternary materials containing oxygen. We also request that simulations were carried out using the VASP code using GGA exchange-correlation (xc) functionals. Values are retrieved from the simulation run that found geometrically convergence wihin a desird threshold value of.
%% Cell type:code id: tags:
``` python
from atomicfeaturespackage.atomicproperties import atomic_properties_dft as ap
from atomicfeaturespackage.atomicproperties import atomic_properties_pymat as pymat
from atomicfeaturespackage.atomicproperties import periodictable
```
%% Cell type:markdown id: tags:
The modules ``atomic_properties_dft`` and ``atomic_properties_pymat`` make available atomic properites of the elements of the periodic table, to be used as features in data analytics. The sources of these atomic proporties are DFT calculations performed by the NOMAD team and <a href="https://pymatgen.org/" target="_blank">pymatgen</a>, respectively.
%% Cell type:markdown id: tags:
We have developed a visualization tool that allows us to visualize atomic properites of all elements accross periodic table as a heatmap. Currently, this tool is able to visualize atomic properties acessible from ``atomic_properties_dft`` module. Below is an example for the data calculated via the HSE06 functional and spinless settings.
This module can be used as follows. From the dropdown menu, one can select which property one is interested to check and the table is updated automatically to show the corresponding heatmap.
%% Cell type:code id: tags:
``` python
periodictable.heatmap(Spin = 'False', method = 'pbe')
```
%% Cell type:markdown id: tags:
## Querying the NOMAD Archive
%% Cell type:code id: tags:
``` python
max_entries=9000
query={
'results.method.simulation.program_name': 'VASP',
'results.material.elements': ['O'],
"results.material.symmetry.crystal_system:any": [
"cubic"
],
'results.material.n_elements': {
"lte": 3,
"gte": 3
},
"results.material.symmetry.crystal_system:any": [
"cubic"
],
"results.method.simulation.dft.xc_functional_type:any": [
"GGA"
],
"results.properties.geometry_optimization": {
"final_energy_difference": {
"lte": 2e-22,
"gte": 1e-30
}
},
}
required={
'results':{
'material':{
'elements': '*',
'symmetry': {
'space_group_number': '*'
}
}
},
'workflow': {
'geometry_optimization': {
'final_energy_difference': '*',
},
'calculation_result_ref': {
'system_ref': {
'chemical_composition_reduced': '*',
'chemical_composition': '*',
'atoms': {
'labels': '*',
'species': '*',
'lattice_vectors': '*',
'positions': '*',
}
}
}
}
}
query = ArchiveQuery(query=query, required=required, page_size=100, results_max=max_entries)
query.fetch()
```
%% Cell type:markdown id: tags:
We have defined the variable ``query``, which allows to perform our query.
All quantities defined in the ``required`` field are fetched during the query. For example, we can see quantities as ``chemical_composition`` which gives the composition of the material or ``atoms.positions`` that contains the positon of all atoms after geometric convergence.
We notice that the variable ``query`` contains a number of other variables: the ``max_entries`` value sets the maximum number of entries that can be retrieved; the ``per_page`` value indicates the number of entries fetched at each API call.
%% Cell type:markdown id: tags:
In this tutorial, we use machine learning tools to investigate properties of materials. In particular, we aim to predict the atomic density of the materials as a function of some primary features such as the atomic number.
The atomic density of the material is derived from the volume of the simulation cell, whose dimensions are inserted in meters. Thus, we define a scale factor to convert dimensions into angstroms for a higher numerical stability during the machine learning analysis.
%% Cell type:code id: tags:
``` python
scale_factor = 10**10
```
%% Cell type:markdown id: tags:
To retrieve data and place it within a framework, we use a 'for' loop that iteratively fetches all entries up to the maximum value, which is given by 'max_entries'. Taking into account that some links in the query might be broken, the resulting 'IndexError' exception is handled within the 'for' loop, that skips over the broken entry. In addition, we also make sure the entry contains the simulation cell value which we are interested in, and that all elements in the material have an admissible atomic number.
%% Cell type:markdown id: tags:
In the next cell, we download data from the NOMAD Archive. **As we query a large number of elements, this operation can be time consuming. Hence, we have cached the results of the following query, and data can be loaded with a command given in the subsequent cell.**
%% Cell type:code id: tags:
``` python
results=query.download()
path_structure = './data/query_nomad_archive/structures/'
try:
os.mkdir(path_structure)
except OSError:
!rm -r "$path_structure"
os.mkdir(path_structure)
# We define a 'Pandas' dataframe that contains all fetched data.
df = pd.DataFrame()
for entry in range(max_entries):
try:
calc = results[entry].workflow[0].calculation_result_ref
convergence_energy_diff = results[entry].workflow[0].geometry_optimization.final_energy_difference.m
formula_red = calc.system_ref.chemical_composition_reduced
space_group = results[entry].results.material.symmetry.space_group_number
elements = np.sort(calc.system_ref.atoms.species)
labels = calc.system_ref.atoms.labels
# Dimensions of the cell are rescaled to angstroms.
lat_x, lat_y, lat_z = calc.system_ref.atoms.lattice_vectors.m * scale_factor
except IndexError:
continue
if (min(elements)<1 or max(elements)>118):
continue
# The total number in the array 'elements' gives the total number of atoms.
n_atoms = len (elements)
# Structures of materials are stored for being viewed using the Visualizer.
file = open(path_structure + str(entry) +".xyz","w")
file.write("%d\n\n"%(n_atoms*8))
for i in [0,1]:
for j in [0,1]:
for k in [0,1]:
for n in range (n_atoms):
el = calc.system_ref.atoms.labels[n]
xyz = calc.system_ref.atoms.positions[n].m * scale_factor
xyz += i*lat_x
xyz += j*lat_y
xyz += k*lat_z
file.write (el)
file.write ("\t%f\t%f\t%f\n"%(xyz[0],xyz[1],xyz[2]))
file.close()
# The volume of the cell is obtained as scalar triple product of the three base vectors.
# The triple scalar product is obtained as determinant of the matrix composed with the three vectors.
cell_volume = np.linalg.det ([lat_x,lat_y,lat_z])
# The atomic density is given by the number of atoms in a unit cell.
density = n_atoms / cell_volume
# The ternary materials are composed by Oxygen and two other elements labeled as A,B.
# Variables 'Z_A','Z_B' contain the atomic number of the elements A,B.
Z_A = int(np.delete(np.unique(elements), np.where(np.unique(elements)==8))[0])
Z_B = int(np.delete(np.unique(elements), np.where(np.unique(elements)==8))[1])
lab_A = np.delete(np.unique(labels), np.where(np.unique(labels)== 'O'))[0]
lab_B = re.sub("\d+", "", np.delete(np.unique(labels), np.where(np.unique(labels)== 'O'))[1])
# We instantiate the atomic_properties_dft(imported as ap here) classes with the atomic symbol by first specifying
# DFT spin setting employed and fucntional for evaluation of atomic properties.
# Functional for which this properties accessible are 'hse06', 'pbe','pbe0','pbesol','pw-lda','revpbe'
# Spin setting could be either 'True' or 'False'
ap.method(method = 'pbe', Spin = 'False')
# We instantiate the atomic_properties_dft(imported as ap here) classes with the atomic symbol.
# Similarly we instantiate the atomic_properties_pymat(imported as pymat here) classes with the atomic symbol
# These classes allow to retrieve atoms properties, in this example we fetch the element ionization energy,atomic radius
# from atomic_properties_dft module and element name,element weight from atomic_properties_pymat module
try:
A = ap.symbol(lab_A)
B = ap.symbol(lab_B)
except KeyError:
continue
# The fraction of atoms of a specific element within the material, that is also given by the stochiometric ratio.
fraction_O = np.sum(np.where (elements==8,1,0)) / len(elements)
fraction_A = np.sum(np.where (elements==A.atomic_number,1,0)) / len(elements)
fraction_B = np.sum(np.where (elements==B.atomic_number,1,0)) / len(elements)
# At each iteration, we add to the datafram one row that contains the A,B elements in the material and a number of other material properties.
df=df.append({
'Convergence_energy_diff': convergence_energy_diff,
'Element_A_name': pymat.symbol(lab_A).atomic_element_name,
'Element_B_name': pymat.symbol(lab_B).atomic_element_name,
'Atomic_number_A': Z_A,
'Atomic_number_B': Z_B,
'Fraction_A':fraction_A,
'Fraction_B':fraction_B,
'Fraction_O':fraction_O,
'Atomic_radius_A':A.atomic_r_val[0]*100 , #distance values are in angstroms so we convert these to pm in this eg
'Atomic_radius_B':B.atomic_r_val[0]*100 , #distance values are in angstroms so we convert these to pm in this eg
'Ionenergy_A':A.atomic_ip*6.242e+18, # energy value obtained is in joules so we convert to eV
'Ionenergy_B':B.atomic_ip*6.242e+18, # energy value obtained is in joules so we convert to eV
'El_affinity_A':A.atomic_ea*6.242e+18, # energy value obtained is in joules so we convert to eV
'El_affinity_B':B.atomic_ea*6.242e+18, # energy value obtained is in joules so we convert to eV
'Homo_A':A.atomic_hfomo*6.242e+18, # energy value obtained is in joules so we convert to eV
'Homo_B':B.atomic_hfomo*6.242e+18, # energy value obtained is in joules so we convert to eV
'Lumo_A':A.atomic_hfomo*6.242e+18, # energy value obtained is in joules so we convert to eV
'Lumo_B':B.atomic_hfomo*6.242e+18, # energy value obtained is in joules so we convert to eV
'Weight_A':pymat.symbol(lab_A).atomic_mass,
'Weight_B':pymat.symbol(lab_B).atomic_mass,
'Space_group_number':int(space_group),
'Atomic_density':density,
'Formula':formula_red,
'File-id':int(entry),
},ignore_index=True)
```
%% Cell type:markdown id: tags:
Here we load the dataframe which contains the data retrieved from the NOMAD Archive using the conditions defined above. The activation of the following cell should be performed only if the query above was skipped.
%% Cell type:code id: tags:
``` python
# df = pd.read_pickle('./data/query_nomad_archive/ternary')
```
%% Cell type:markdown id: tags:
Pandas dataframes include the 'describe' method which gives an overview about the composition of the dataset.
%% Cell type:code id: tags:
``` python
df.describe()
```
%% Cell type:markdown id: tags:
We are particularly interested in materials properties and how they can be inferred from the solely atomic composition of a specific material. In our query, we have retrieved two materials properties, namely the _'Atomic_density'_ and the _'Space_group_number'_. Before performing any machine learning analysis, it is interesting to visualize the distribution of these values using a histogram. Pandas allows for a straighforward visualization of histograms constructed with columns dataframe values.
%% Cell type:code id: tags:
``` python
df.hist(bins=np.arange(200,230), figsize=(16,4), xlabelsize=10, ylabelsize=10, column='Space_group_number')
plt.xticks(range(200,230));
```
%% Cell type:markdown id: tags:
We notice above that retrieved materials can be mainly classified into two distinct space group numbers, i.e. the sapce group number 221 and 227. It would be interesting to see whether such distinction involves that materials belonging in the same space group share similar atomistic properties, while the ones belonging in different space groups are distinct also on an atomistic level. This is the scope of clustering, and it will be the object of an in-depth analysis below.
Now, we keep inspecting the dataframe values and plot a histogram containing the values of the atomic density.
%% Cell type:code id: tags:
``` python
df.hist(bins=50,figsize=(16,4), xlabelsize=10, ylabelsize=10, column='Atomic_density');
```
%% Cell type:markdown id: tags:
The histogram above shows that the atomic density is mainly distributed around a value of 0.07 Angstroms$^{-1}$. Such distribution might seem the result of a random extraction, but we aim to find an AI model that is able to make high-resolution predictions for the atomic density based only on the atomic composition of the material.
%% Cell type:markdown id: tags:
In order to build an AI model that makes reliable predictions, we should make sure that each entry has a different representation. In this case, as we are interested in predicting material properties from the atomic composition, the chemical composition of the material is an ideal representation for the dataframe entries. However, we might have different entries with the same chemical composition, because e.g. simulations were performed for the same material with different settings that were not included among the filters of our query. Each of these simulations might have produced a slightly different value of the resulting atomic density of the material. As data is taken from heterogeneous simulations which were carried out in different laboratories, we do not aim to evaluate all possible parameters for each simulation. Hence, we average the atomic density value over all materials with the same chemical composition.
After averaging (or _grouping_) data is placed in a different dataframe '_df_grouped_', where each entry represents a different compound.
%% Cell type:code id: tags:
``` python
df_grouped=df.groupby(['Formula','Element_A_name','Element_B_name','Space_group_number']).mean()
df_grouped=df_grouped.reset_index(level=['Element_A_name','Element_B_name','Space_group_number'])
df_grouped['Replicas']=df.groupby(['Formula','Element_A_name','Element_B_name','Space_group_number']).count()['Atomic_density'].values
```
%% Cell type:markdown id: tags:
With data placed in a dataframe, we can carry out our machine learning analysis.
%% Cell type:markdown id: tags:
# Example of unsupervised machine learning: Clustering and dimension reduction
---
%% Cell type:markdown id: tags:
Firstly, we perform an explorative analysis to understand how data is composed and organized. Hence, we use
unsupervised learning to extract from the dataset clusters of materials with similar properties.
We define the list of features that are used for clustering including only the stochiometric ratio and the atomic number. Our aim is to use unsupervised learning for understanding whether the defined descriptors are sufficient for structuring the dataset.
%% Cell type:code id: tags:
``` python
clustering_features = []
clustering_features.append('Fraction_A')
clustering_features.append('Fraction_B')
clustering_features.append('Fraction_O')
clustering_features.append('Atomic_number_A')
clustering_features.append('Atomic_number_B')
clustering_features.append('El_affinity_A')
clustering_features.append('El_affinity_B')
# clustering_features.append('Ionenergy_A')
# clustering_features.append('Ionenergy_B')
# clustering_features.append('Homo_A')
# clustering_features.append('Homo_B')
# clustering_features.append('Lumo_A')
# clustering_features.append('Lumo_B')
# clustering_features.append('Weight_A')
# clustering_features.append('Weight_B')
# clustering_features.append('Atomic_radius_A')
# clustering_features.append('Atomic_radius_B')
df_clustering=preprocessing.scale(df_grouped[clustering_features])
```
%% Cell type:markdown id: tags:
As clustering algorithm we use HDBSCAN, that is described in:
<div style="padding: 1ex; margin-top: 1ex; margin-bottom: 1ex; border-style: dotted; border-width: 1pt; border-color: blue; border-radius: 3px;">
R.J.G.B. Campello, D. Moulavi, J. Sander: <span style="font-style: italic;">Density-Based Clustering Based on Hierarchical Density Estimates</span>, Springer Berlin Heidelberg, (2013).</div>
The only input parameter that this algorithm requires is the minimum size of each cluster.
To achieve a more accurate cluster definition, HDBSCAN is able to detect all those points that are hardly classified into a specific cluster, which are labeled as 'outliers'.
The implementation of the algorithm that we use is taken from https://pypi.org/project/hdbscan/.
%% Cell type:code id: tags:
``` python
import hdbscan
```
%% Cell type:code id: tags:
``` python
clusterer=hdbscan.HDBSCAN(min_cluster_size=60, min_samples=50)
clusterer.fit(df_clustering)
display(Markdown('The algorithm finds ' + str(clusterer.labels_.max()+1) + ' clusters.'))
```
%% Cell type:code id: tags:
``` python
cluster_labels = clusterer.labels_
df_grouped['Cluster_label']=cluster_labels
```
%% Cell type:markdown id: tags:
To visualize our multidimensional data, we need to project it onto a two-dimensional manifold.
Hence, we use the UMAP embedding algorithm.
Hence, we use the TSNE embedding algorithm.
%% Cell type:code id: tags:
``` python
import umap
from sklearn.manifold import TSNE
```
%% Cell type:code id: tags:
``` python
reducer = umap.UMAP(min_dist=0.5, n_neighbors=100)
embedding = reducer.fit(df_clustering)
embedding = reducer.transform(df_clustering)
embedding = TSNE(n_components=2).fit_transform(df_clustering)
df_grouped['x_emb']=embedding[:,0]
df_grouped['y_emb']=embedding[:,1]
```
%% Cell type:markdown id: tags:
We maintain the Visualizer, a dedicated tool for an interactive visualization of the data retrieved from the Archive.
The Visualizer shows compounds belonging in different clusters with different colors.
Outliers by default are not shown on the map, but outliers visualization can be activated by clicking the 'Outliers' label on the right of the map.
The color of the markers can also represent a specific target property of the material that can be selected from the 'Marker colors' dropdown menu.
Target properties which can be displayed with different colors are inserted as a list of 'color_features'.
After selecting a specific property, a new menu will appear to choose the color scale to be used for visualizing the different values of that target property.
To prevent data overloading, only a fraction of the whole dataset is initially visualized.
This parameter can be adjusted modifing the fraction value on top of the map, up to select all entries in the dataset.
By hovering over the map, the Visualizer shows which compound corresponds to each point, and its number of 'Replicas'.
Replicas represents the number of entries from the original dataset (before grouping) which have the same chemical composition.
It is also possible to select an additional number of features in the 'hover_features' list which are displayed while hovering.
Clicking on any of the points in the map automatically shows the 3D chemical structure of the material in one of the windows below.
Note that at each time the 'Display' button is clicked, a different structure with the same chemical composition is visualized.
A new structure is shown up to the 'Replicas' number.
This allows to inspect all possible structures in the dataset.
Furthermore, the chemical formula of a specific compound can be manually written in the 'Compound' textbox, and clicking the 'Display' button will both show the 3D chemical structure and mark the exact position of the compound on the map with a cross.
The Compound textbox includes autocompletion, which allows to inspect all materials in the dataset inserting partial formulae.
Lastly, the Visualizer offers a number of utils for producing high-quality plots of the map, which are displayed after clicking the button just below the map.
%% Cell type:code id: tags:
``` python
hover_features = []
hover_features.append('Atomic_number_A')
hover_features.append('Atomic_number_B')
hover_features.append('Space_group_number')
hover_features.append('Atomic_density')
hover_features.append('Replicas')
hover_features.append('Cluster_label')
color_features = []
color_features.append('Atomic_density')
color_features.append('Space_group_number')
Visualizer(df, df_grouped, hover_features, color_features).view()
```
%% Cell type:markdown id: tags:
Using the visualizer we can analyse the composition of the different clusters extracted.
We have selected the atomic density and the space group number as color features.
This allows to inspect how these values vary within clusters.
The values of these features show ordered structures within clusters, that is particularly interesting because these features were not used by the clustering algorithm.
This suggests that the atomic features used for clustering are sufficient to describe certain properties of the whole material such as the most stable structure or the atomic density.
Therefore, we might imagine that there is a functional form capable to infer the defined materials properties from our atomic features, and we can train a supervised machine learning model to find such relationship.
The space group number of the elements in each cluster is shown below firstly as a list of values, then using a pie chart.
We can clearly notice that in each cluster there is one space group number that is predominant respect to all the others.
This means that we can label the different clusters with a characteristic space group number.
%% Cell type:code id: tags:
``` python
df_count_groups=df_grouped.loc[df_grouped['Cluster_label']!=-1].groupby(['Cluster_label','Space_group_number']).describe()['Atomic_density'][['count']]
display(Markdown(print(df_count_groups)))
df_count_groups=df_count_groups.reset_index();
n_clusters = df_grouped['Cluster_label'].max() +1
```
%% Cell type:code id: tags:
``` python
for cl in range (n_clusters):
df_cluster=df_count_groups.loc[df_count_groups['Cluster_label']==cl]
fig = px.pie(df_cluster, values='count', names='Space_group_number')
fig.show()
```
%% Cell type:markdown id: tags:
The pie chart below includes the space group number of all elements classified as outliers. We can clearly see that the outliers include all different space group numbers, and it is not possible to identify a predominant space group number characteristic of the group of outliers.
%% Cell type:code id: tags:
``` python
df_cluster=df_grouped.loc[df_grouped['Cluster_label']==-1].groupby(['Space_group_number']).describe()['Atomic_density'][['count']].reset_index()
plt.figure(figsize=(25,25))
fig = px.pie(df_cluster, values='count', names='Space_group_number')
fig.show()
```
%% Cell type:markdown id: tags:
# Example of supervised machine learning: Random forest
---
%% Cell type:markdown id: tags:
Finally, we aim to use the large data set of materials that we have retrieved from the NOMAD Archive to train an AI model.
Previous findings obtained with the unsupervised analysis suggest that it is possible to train a model to predict materials properties using only atomic features.
The trained model can then be used to predict properties of yet unknown materials from the knowledge of its constituent atoms.
In this specific case, we aim to predict the average atomic density.
We use the Random forest method. Random forest is available from the scikit-learn package.
%% Cell type:code id: tags:
``` python
from sklearn.ensemble import RandomForestRegressor
```
%% Cell type:markdown id: tags:
We select all atomic properties as primary features and the atomic density of the material as target feature.
%% Cell type:code id: tags:
``` python
ML_primary_features = []
ML_primary_features.append('Fraction_A')
ML_primary_features.append('Fraction_B')
ML_primary_features.append('Fraction_O')
ML_primary_features.append('Atomic_number_A')
ML_primary_features.append('Atomic_number_B')
ML_primary_features.append('El_affinity_A')
ML_primary_features.append('El_affinity_B')
ML_primary_features.append('Ionenergy_A')
ML_primary_features.append('Ionenergy_B')
ML_primary_features.append('Atomic_radius_A')
ML_primary_features.append('Atomic_radius_B')
# ML_primary_features.append('Homo_A')
# ML_primary_features.append('Homo_B')
# ML_primary_features.append('Lumo_A')
# ML_primary_features.append('Lumo_B')
# ML_primary_features.append('Weight_A')
# ML_primary_features.append('Weight_B')
ML_target_features = []
ML_target_features.append('Atomic_density')
```
%% Cell type:markdown id: tags:
Our dataset is divided into a train set and a test set. The model is trained only with the train set, while ignoring the values in the test set. This allows to test the prediction capability of the model on data that have not been seen.
%% Cell type:code id: tags:
``` python
X_train, X_test, y_train, y_test = train_test_split (df[ML_primary_features], df[ML_target_features], test_size=0.2, )
```
%% Cell type:markdown id: tags:
We train here the model.
%% Cell type:code id: tags:
``` python
random_regressor = RandomForestRegressor(
n_estimators= 100,
max_depth = 100,
max_features = 5,
min_samples_split = 5,
random_state=0
)
random_regressor.fit(X_train, y_train.to_numpy().ravel())
```
%% Cell type:markdown id: tags:
After training, we check the accuracy of the model.
%% Cell type:code id: tags:
``` python
y_predict= random_regressor.predict(X_test)
display(Markdown(r'The Ai model predicts the atomic density on a test set with an average error of '+
str(int(10000*np.mean(np.abs(y_predict-y_test.to_numpy().flatten())))/10000) +
' Angstroms$^{-1}$.' ))
```
%% Cell type:code id: tags:
``` python
y_predict= random_regressor.predict(X_train)
display(Markdown(r'The Ai model predicts the atomic density on a training set with an average error of '+
str(int(10000*np.mean(np.abs(y_predict-y_train.to_numpy().flatten())))/10000) +
' Angstroms$^{-1}$.' ))
```
%% Cell type:code id: tags:
``` python
y_predict= random_regressor.predict(X_test)
X_test = X_test.assign(Atomic_density=y_predict)
df_A_pred = X_test[['Atomic_number_A','Atomic_density']].rename(columns={'Atomic_number_A':'Atomic_number'})
df_B_pred = X_test[['Atomic_number_B','Atomic_density']].rename(columns={'Atomic_number_B':'Atomic_number'})
df_AB_pred = pd.concat([df_A_pred,df_B_pred], ignore_index=True)
df_A = df[['Atomic_number_A','Atomic_density']].rename(columns={'Atomic_number_A':'Atomic_number'})
df_B = df[['Atomic_number_B','Atomic_density']].rename(columns={'Atomic_number_B':'Atomic_number'})
df_AB = pd.concat([df_A,df_B], ignore_index=True)
```
%% Cell type:code id: tags:
``` python
df_AB['Atomic_number']=df_AB['Atomic_number'].astype('int')
```
%% Cell type:code id: tags:
``` python
xaxis = df_AB.groupby('Atomic_number').mean().reset_index().to_numpy()[:,0]
yaxis = df_AB.groupby('Atomic_number').mean().reset_index().to_numpy()[:,1]
```
%% Cell type:markdown id: tags:
In the following plot, we see the average atomic density of ternary elements composed of Oxygen and another element, whose atomic number is given by the x-axis. Therefore, all values are averaged over the third element.
Plots show the predictions of the trained model on the test set, and on the kwnon values of the training set that are taken as reference values. Each point shows also the standard deviation. We emphasize that, considering that each value on the plot is given by an average over all elements in the periodic table, the standard deviation cannot go to zero by construction, even in the limit of taking all possible combinations. We then aim that averages and standard deviations predicted by our model are comparable to the ones of the reference model.
%% Cell type:code id: tags:
``` python
plt.style.use('ggplot')
plt.figure(figsize=(15,10))
x=df_AB_pred.groupby(['Atomic_number']).mean().index.to_numpy().flatten()
y=df_AB_pred.groupby(['Atomic_number']).mean().to_numpy().flatten()
std=df_AB_pred.groupby(['Atomic_number']).std().to_numpy().flatten()
plt.errorbar(x,y,yerr=std.T, ls='', marker='s', label='AI prediction' )
x=df_AB.groupby(['Atomic_number']).mean().index.to_numpy().flatten()
y=df_AB.groupby(['Atomic_number']).mean().to_numpy().flatten()
std=df_AB.groupby(['Atomic_number']).std().to_numpy().flatten()
plt.errorbar(x,y,yerr=std.T, ls='', marker='s', label='Reference value' );
plt.legend(loc='upper right', fontsize='x-large')
plt.xlabel('Atomic number', fontsize='x-large')
plt.ylabel(r'Atomic density [$\AA^{-1}$]', fontsize='x-large')
plt.xticks([3,11,19,37,55,89],fontsize='x-large');
plt.yticks(fontsize='x-large');
```
%% Cell type:markdown id: tags:
In the plot above, we can see that values predicted by the AI model are comparable to the reference values. In addition, we observe that the atomic density follows periodic trends, as values tend to be lower in the beginning and in the end of each row of the periodic table.
......