Aakash Ashok Naik
--- a/query_nomad_archive.ipynb 0 → 100644

+ 787

− 0
+++ b/query_nomad_archive.ipynb 0 → 100644

+ 787

− 0
+%% Cell type:markdown id: tags:
+
+<div id="teaser" style=' background-position:  right center; background-size: 00px; background-repeat: no-repeat;
+    padding-top: 20px;
+    padding-right: 10px;
+    padding-bottom: 170px;
+    padding-left: 10px;
+    border-bottom: 14px double #333;
+    border-top: 14px double #333;' >
+
+
+   <div style="text-align:center">
+    <b><font size="6.4">Querying the Archive and performing Artificial Intelligence modeling</font></b>
+  </div>
+
+<p>
+ created by:
+ Luigi Sbailò,<sup>1</sup>
+ Matthias Scheffler,<sup>1</sup>
+ and Luca Ghiringhelli<sup>1</sup> <br><br>
+
+<sup>1</sup> Fritz Haber Institute of the Max Planck Society, Faradayweg 4-6, D-14195 Berlin, Germany <br><p>
+
+ghiringhelli@fhi-berlin.mpg.de,
+sbailo@fhi-berlin.mpg.de
+<br><br>
+<span class="nomad--last-updated" data-version="v1.0.0">[Last updated: Jan 5, 2021]</span>
+
+<div>
+<img  style="float: left;" src="assets/query_nomad_archive/Logo_MPG.png" width="200">
+<img  style="float: right;" src="assets/query_nomad_archive/Logo_NOMAD.png" width="250">
+</div>
+</div>
+
+%% Cell type:markdown id: tags:
+
+In this tutorial, we show how to query the NOMAD Archive (https://www.nomad-coe.eu/index.php?page=nomad-repository) and perform data analysis on the retrieved data.
+
+%% Cell type:markdown id: tags:
+
+We load the following packages that are all available in the virtual environment containing the Jupyter notebooks in the NOMAD AI toolkit. Among the various packages we notice Sklearn, one of the most popular machine learning packages, and pandas, another popular tool for data analysis.
+
+%% Cell type:code id: tags:
+
+``` python
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+import os
+import math
+import plotly.express as px
+from tqdm import tqdm
+# from mendeleev import element
+from query_nomad_archive import atomic_properties_dft as ap
+from query_nomad_archive import atomic_properties_pymat as pymat
+from sklearn import preprocessing, tree
+from sklearn.manifold import TSNE
+from sklearn.model_selection import train_test_split
+from query_nomad_archive.visualizer import Visualizer
+from sklearn import decomposition
+from IPython.display import display, Markdown
+import re
+```
+
+%% Cell type:code id: tags:
+
+``` python
+from nomad import client, config
+```
+
+%% Cell type:code id: tags:
+
+``` python
+from nomad.client import ArchiveQuery
+from nomad.metainfo import units
+```
+
+%% Cell type:code id: tags:
+
+``` python
+from mendeleev import element
+```
+
+%% Cell type:code id: tags:
+
+``` python
+ap.method(method = 'HSE06', Spin = 'False')
+```
+
+%% Cell type:markdown id: tags:
+
+We maintain a "nomad" package that can be imported in all notebooks of the AI Toolkit.
+
+%% Cell type:code id: tags:
+
+``` python
+# Z_A = ap.symbol("K")
+
+# ap.K.atomic_ip*6.242e+18
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Z_B = element("K")
+# Z_B.covalent_radius
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# pymat.symbol("K").atomic_mass
+```
+
+%% Cell type:markdown id: tags:
+
+Nomad package allows to retrieve data from the NOMAD Archive with means of a script, as shown below. In this script we insert metadata characterizing the materials that we aim to retrieve. In this case, we select ternary materials containing Oxygen. We also request that simulations were carried out using the VASP code using GGA exchange-correlation (xc) functionals. Values are retrieved from the simulation run that found geometrically convergence wihin a threshold value of 1e-20.
+
+%% Cell type:code id: tags:
+
+``` python
+max_entries = 23000
+
+query = ArchiveQuery(
+    query={
+        '$and': [
+            {'dft.code_name': 'VASP'},
+            {'dft.crystal_system':'cubic'},
+            {'atoms': ['O']},
+            {'dft.xc_functional':"GGA"},
+            {'dft.compound_type': 'ternary'},
+            {'$lte': {'dft.workflow.section_geometry_optimization.final_energy_difference': 1e-20}},
+
+        ]
+    },
+    required={
+        'section_workflow': {
+            'calculation_result_ref': {
+                'single_configuration_calculation_to_system_ref': {
+                    'chemical_composition_reduced': '*',
+                    'chemical_composition': '*',
+                    'section_symmetry': '*',
+                    'simulation_cell': '*',
+                    'lattice_vectors': '*',
+                    'atom_species': '*',
+                    'atom_labels': '*',
+                    'atom_positions': '*',
+                }
+            }
+        },
+    },
+    per_page=1000,
+    parallel=10,
+    max=max_entries)
+```
+
+%% Cell type:markdown id: tags:
+
+We have defined the variable 'query', which allows to perform our query.
+The required condition ensures that all quantities in the simulation run that we are interested in are fetched during the query. For example, we can see quantities as 'chemical_composition' which gives the composition of the material or 'atom_positions' that contains the positon of all atoms after geometric convergence.
+We notice that the variable 'query' contains a number of other variables: the 'max' value sets the maximum number of entries that can be retrieved; the 'per_page' value indicates the number of entries fetched at each API call; the 'parallel' value gives the number of parallel calls that are performed at each iteration.
+
+Printing the variable shows the number of all entries that are accessible using this variable. Please note that the maximum number of retrievable materials is given by the value of 'max_entries' inserted in the query, even if the printed value of 'Number queried entries: ' is larger.
+
+%% Cell type:code id: tags:
+
+``` python
+print(query)
+```
+
+%% Output
+
+    Number queried entries: 23462
+    Number of entries loaded in the last api call: 6699
+    Bytes loaded in the last api call: 286440478
+    Bytes loaded from this query: 286440478
+    Number of downloaded entries: 6699
+    Number of made api calls: 1
+    
+
+%% Cell type:markdown id: tags:
+
+In this tutorial, we use machine learning tools to investigate properties of materials. In particular, we aim to predict the atomic density of the materials as a function of some primary features such as the atomic number.
+
+The atomic density of the material is derived from the volume of the simulation cell, whose dimensions are inserted in meters. Thus, we define a scale factor to convert dimensions into angstroms for a higher numerical stability during the machine learning analysis.
+
+%% Cell type:code id: tags:
+
+``` python
+scale_factor = 10**10
+```
+
+%% Cell type:markdown id: tags:
+
+To retrieve data and place it within a framework, we use a 'for' loop that iteratively fetch all entries up to the maximum value, which is  given by 'max_entries'. Taking into account that some links in the query might be broken, the resulting 'IndexError' exception is handled within the 'for' loop, that skips over the broken entry. In addition, we also make sure the entry contains the simulation cell value which we are interested in, and that all elements in the material have an admissible atomic number.
+
+%% Cell type:markdown id: tags:
+
+In the next cell, we query and fetch data from the NOMAD Archive. As we query a large number of elements, this operation can be time consuming. Hence, we have cached the results of the following query, and data can be loaded with a command given in the subsequent cell.
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# path_structure = (r'\data\query_nomad_archive\structures\')
+# try:
+#     os.mkdir(path_structure)
+# except OSError:
+#     !rm -r "$path_structure"
+#     os.mkdir(path_structure)
+
+# We define a 'Pandas' dataframe that contains all fetched data.
+df = pd.DataFrame()
+
+for entry in tqdm(range (max_entries)):
+
+    try:
+        calc = query[entry].section_workflow.calculation_result_ref
+        formula_red = calc.single_configuration_calculation_to_system_ref.chemical_composition_reduced
+        crystal = calc.single_configuration_calculation_to_system_ref.section_symmetry[0].crystal_system
+        space_group = calc.single_configuration_calculation_to_system_ref.section_symmetry[0].space_group_number
+        elements = np.sort(calc.single_configuration_calculation_to_system_ref.atom_species)
+        labels = calc.single_configuration_calculation_to_system_ref.atom_labels
+        # Dimensions of the cell are rescaled to angstroms.
+        x,y,z = calc.single_configuration_calculation_to_system_ref.simulation_cell.magnitude * scale_factor
+        lat_x, lat_y, lat_z = calc.single_configuration_calculation_to_system_ref.lattice_vectors.magnitude * scale_factor
+    except AttributeError:
+        continue
+
+    if (min(elements)<1 or max(elements)>118):
+        continue
+
+    # The total number in the array 'elements' gives the total number of atoms.
+    n_atoms = len (elements)
+
+    # Structures of materials are stored for being viewed using the Visualizer.
+#     file = open(path_structure + str(entry) +".xyz","w")
+#     file.write("%d\n\n"%(n_atoms*8))
+#     for i in [0,1]:
+#         for j in [0,1]:
+#             for k in [0,1]:
+#                 for n in range (n_atoms):
+#                     el = calc.single_configuration_calculation_to_system_ref.atom_labels[n]
+#                     xyz = calc.single_configuration_calculation_to_system_ref.atom_positions[n].magnitude * scale_factor
+#                     xyz += i*lat_x
+#                     xyz += j*lat_y
+#                     xyz += k*lat_z
+#                     file.write (el)
+#                     file.write ("\t%f\t%f\t%f\n"%(xyz[0],xyz[1],xyz[2]))
+#     file.close()
+
+    # The volume of the cell is obtained as scalar triple product of the three base vectors.
+    # The triple scalar product is obtained as determinant of the matrix composed with the three vectors.
+    cell_volume =  np.linalg.det ([x,y,z])
+
+    # The atomic density is given by the number of atoms in a unit cell.
+    density = n_atoms /  cell_volume
+
+    # The ternary materials are composed by Oxygen and two other elements labeled as A,B.
+    # Variables 'Z_A','Z_B' contain the atomic number of the elements A,B.
+    Z_A = int(np.delete(np.unique(elements), np.where(np.unique(elements)==8))[0])
+    Z_B = int(np.delete(np.unique(elements), np.where(np.unique(elements)==8))[1])
+    lab_A = np.delete(np.unique(labels), np.where(np.unique(labels)== 'O'))[0]
+    lab_B = re.sub("\d+", "", np.delete(np.unique(labels), np.where(np.unique(labels)== 'O'))[1])
+
+    # We instantiate the Mendeleev Element classes with the atomic numbers.
+    # These classes allow to retrieve atoms properties, in this example we only fetch the element name
+    #A = element(Z_A)
+    #B = element(Z_B)
+    A = ap.symbol(lab_A)
+    B = ap.symbol(lab_B)
+
+    # The fraction of atoms of a specific element within the material, that is also given by the stochiometric ratio.
+    fraction_O = np.sum(np.where (elements==8,1,0)) / len(elements)
+    fraction_A = np.sum(np.where (elements==A.atomic_number,1,0)) / len(elements)
+    fraction_B = np.sum(np.where (elements==B.atomic_number,1,0)) / len(elements)
+
+    # At each iteration, we add to the datafram one row that contains the A,B elements in the material and a number of other material properties.
+    df=df.append({
+
+        'Element_A_name': pymat.symbol(lab_A).atomic_element_name,
+        'Element_B_name': pymat.symbol(lab_B).atomic_element_name,
+        'Atomic_number_A': Z_A,
+        'Atomic_number_B': Z_B,
+        'Fraction_A':fraction_A,
+        'Fraction_B':fraction_B,
+        'Fraction_O':fraction_O,
+        'Covalent_radius_A':A.atomic_r_val[0]*100 ,
+        'Covalent_radius_B':B.atomic_r_val[0]*100 ,
+        'Ionenergy_A':A.atomic_ip*6.242e+18,
+        'Ionenergy_B':B.atomic_ip*6.242e+18,
+        'Weight_A':pymat.symbol(lab_A).atomic_mass,
+        'Weight_B':pymat.symbol(lab_B).atomic_mass,
+        'Space_group_number':int(space_group),
+        'Atomic_density':density,
+        'Formula':formula_red,
+        'File-id':int(entry),
+
+    },ignore_index=True)
+```
+
+%% Output
+
+    100%|████████████████████████████████████████████████████████████████████████████| 23000/23000 [07:47<00:00, 49.20it/s]
+
+%% Cell type:markdown id: tags:
+
+Here we load the dataframe which contains the data retrieved from the NOMAD Archive using the conditions defined above. The activation of the following cell should be performed only if the query above was skipped.
+
+%% Cell type:code id: tags:
+
+``` python
+#df = pd.read_pickle('./data/query_nomad_archive/ternary_O_cubic')
+```
+
+%% Cell type:markdown id: tags:
+
+Pandas dataframes include the 'describe' method which gives an overview about the composition of the dataset.
+
+%% Cell type:code id: tags:
+
+``` python
+df
+```
+
+%% Output
+
+           Atomic_density  Atomic_number_A  Atomic_number_B  Covalent_radius_A  \
+    0            0.069760             28.0             49.0            154.635
+    1            0.072930             28.0             49.0            154.635
+    2            0.049210             37.0             41.0             77.725
+    3            0.058399             47.0             61.0            134.805
+    4            0.067951             70.0             82.0            142.345
+    ...               ...              ...              ...                ...
+    22308        0.070226             50.0             93.0             54.845
+    22309        0.061514             48.0             64.0            124.345
+    22310        0.079527             42.0             69.0             72.935
+    22311        0.057326             41.0             61.0             77.725
+    22312        0.066790             69.0             90.0            130.435
+    
+           Covalent_radius_B Element_A_name Element_B_name  File-id   Formula  \
+    0                 34.135         Indium         Nickel      0.0    InNiO3
+    1                 34.135         Indium         Nickel      1.0    InNiO3
+    2                235.115        Niobium       Rubidium      2.0  Nb2O8Rb4
+    3                 33.765         Silver     Promethium      3.0  Ag2O8Pm4
+    4                176.325           Lead      Ytterbium      4.0  O8Pb2Yb4
+    ...                  ...            ...            ...      ...       ...
+    22308            137.255      Neptunium            Tin  22995.0    NpO3Sn
+    22309             30.575        Cadmium     Gadolinium  22996.0    CdGdO3
+    22310             26.255     Molybdenum        Thulium  22997.0    MoO3Tm
+    22311             33.765        Niobium     Promethium  22998.0    NbO3Pm
+    22312             26.255        Thorium        Thulium  22999.0    O3ThTm
+    
+           Fraction_A  Fraction_B  Fraction_O  Ionenergy_A  Ionenergy_B  \
+    0        0.200000    0.200000    0.600000     5.094165     8.055879
+    1        0.200000    0.200000    0.600000     5.094165     8.055879
+    2        0.142857    0.285714    0.571429     7.221556     3.992932
+    3        0.142857    0.285714    0.571429     7.409658     5.775166
+    4        0.142857    0.285714    0.571429     6.443283     6.480560
+    ...           ...         ...         ...          ...          ...
+    22308    0.200000    0.200000    0.600000     5.500209     6.748155
+    22309    0.200000    0.200000    0.600000     9.395269     4.172374
+    22310    0.200000    0.200000    0.600000     6.742170     5.902252
+    22311    0.200000    0.200000    0.600000     7.221556     5.775166
+    22312    0.200000    0.200000    0.600000     5.734792     5.902252
+    
+           Space_group_number   Weight_A   Weight_B
+    0                   221.0  114.81800   58.69340
+    1                   221.0  114.81800   58.69340
+    2                   227.0   92.90638   85.46780
+    3                   227.0  107.86820  145.00000
+    4                   227.0  207.20000  173.04000
+    ...                   ...        ...        ...
+    22308               221.0  237.00000  118.71000
+    22309               221.0  112.41100  157.25000
+    22310               221.0   95.94000  168.93421
+    22311               221.0   92.90638  145.00000
+    22312               221.0  232.03806  168.93421
+    
+    [22313 rows x 17 columns]
+
+%% Cell type:markdown id: tags:
+
+We might have different entries with the same chemical composition, because e.g. simulations were performed  for the same material with different settings that were not included among the filters of our query. Each of these simulations might have produced a slightly different value of the resulting atomic density of the material. As data is taken from heterogeneous simulations which were carried out in different laboratories, we do not aim to evaluate all possible parameters of each simulation. Hence, we average the atomic density value over all materials with the same chemical composition.
+
+After averaging, or grouping, data is placed in a different dataframe 'df_grouped', where each entry represents a different compound.
+
+%% Cell type:code id: tags:
+
+``` python
+df_grouped=df.groupby(['Formula','Element_A_name','Element_B_name','Space_group_number']).mean()
+df_grouped=df_grouped.reset_index(level=['Element_A_name','Element_B_name','Space_group_number'])
+df_grouped['Replicas']=df.groupby(['Formula','Element_A_name','Element_B_name','Space_group_number']).count()['Atomic_density'].values
+```
+
+%% Cell type:markdown id: tags:
+
+With data placed in a dataframe, we can carry out our machine learning analysis.
+
+%% Cell type:markdown id: tags:
+
+# Example of unsupervised machine learning: Clustering and dimension reduction
+---
+
+%% Cell type:markdown id: tags:
+
+Firstly, we perform an explorative analysis to understand how data is composed and organized. Hence, we use
+unsupervised learning to extract from the dataset clusters of materials with similar properties.
+
+
+We define the list of features that are used for clustering including only the stochiometric ratio and the  atomic number. Our aim is to use unsupervised learning for understanding whether the defined descriptors are sufficient for structuring the dataset.
+
+%% Cell type:code id: tags:
+
+``` python
+clustering_features = []
+clustering_features.append('Fraction_A')
+clustering_features.append('Fraction_B')
+clustering_features.append('Fraction_O')
+clustering_features.append('Atomic_number_A')
+clustering_features.append('Atomic_number_B')
+# clustering_features.append('Ionenergy_A')
+# clustering_features.append('Ionenergy_B')
+# clustering_features.append('Weight_A')
+# clustering_features.append('Weight_B')
+# clustering_features.append('Covalent_radius_A')
+# clustering_features.append('Covalent_radius_B')
+
+df_clustering=preprocessing.scale(df_grouped[clustering_features])
+```
+
+%% Cell type:markdown id: tags:
+
+As clustering algorithm we use HDBSCAN, that is described in:
+
+<div style="padding: 1ex; margin-top: 1ex; margin-bottom: 1ex; border-style: dotted; border-width: 1pt; border-color: blue; border-radius: 3px;">
+R.J.G.B. Campello, D. Moulavi, J. Sander: <span style="font-style: italic;">Density-Based Clustering Based on Hierarchical Density Estimates</span>,  Springer Berlin Heidelberg, (2013).</div>
+
+%% Cell type:markdown id: tags:
+
+The only input parameter that this algorithm requires is the minimum size of each cluster.
+
+To achieve a more accurate cluster definition, HDBSCAN is able to detect all those points that are hardly classified into a specific cluster, which are labeled as 'outliers'.
+
+%% Cell type:markdown id: tags:
+
+The implementation of the algorithm that we use is taken from https://pypi.org/project/hdbscan/.
+
+%% Cell type:code id: tags:
+
+``` python
+import hdbscan
+```
+
+%% Cell type:code id: tags:
+
+``` python
+clusterer=hdbscan.HDBSCAN(min_cluster_size=60)
+clusterer.fit(df_clustering)
+display(Markdown('The algorithm finds ' + str(clusterer.labels_.max()+1) + ' clusters.'))
+```
+
+%% Output
+
+    The algorithm finds 3 clusters.
+
+%% Cell type:code id: tags:
+
+``` python
+cluster_labels = clusterer.labels_
+df_grouped['Cluster_label']=cluster_labels
+```
+
+%% Cell type:markdown id: tags:
+
+To visualize our multidimensional data, we need to project it onto a two-dimensional manifold.
+Hence, we use the UMAP embedding algorithm.
+
+%% Cell type:code id: tags:
+
+``` python
+import umap
+```
+
+%% Cell type:code id: tags:
+
+``` python
+reducer = umap.UMAP(min_dist=0.5, n_neighbors=100)
+embedding = reducer.fit(df_clustering)
+embedding = reducer.transform(df_clustering)
+df_grouped['x_emb']=embedding[:,0]
+df_grouped['y_emb']=embedding[:,1]
+```
+
+%% Output
+
+    ---------------------------------------------------------------------------
+    AttributeError                            Traceback (most recent call last)
+    <ipython-input-33-263ed499c165> in <module>
+    ----> 1 reducer = umap.UMAP(min_dist=0.5, n_neighbors=100)
+          2 embedding = reducer.fit(df_clustering)
+          3 embedding = reducer.transform(df_clustering)
+          4 df_grouped['x_emb']=embedding[:,0]
+          5 df_grouped['y_emb']=embedding[:,1]
+    AttributeError: module 'umap' has no attribute 'UMAP'
+
+%% Cell type:markdown id: tags:
+
+We maintain the Visualizer, a dedicated tool for an interactive visualization of the data retrieved from the Archive.
+
+The Visualizer shows compounds belonging in different clusters with different colors.
+Outliers by default are not shown on the map, but outliers visualization can be activated by clicking the 'Outliers' label on the right of the map.
+The color of the markers can also represent a specific target property of the material that can be selected from the 'Marker colors' dropdown menu.
+Target properties which can be displayed with different colors are inserted as a list of 'color_features'.
+After selecting a specific property, a new menu will appear to choose the color scale to be used for visualizing the different values of that target property.
+
+To prevent data overloading, only a fraction of the whole dataset is initially visualized.
+This parameter can be adjusted modifing the fraction value on top of the map, up to select all entries in the dataset.
+
+By hovering over the map, the Visualizer shows which compound corresponds to each point, and its number of 'Replicas'.
+Replicas represents the number of entries from the original dataset (before grouping) which have the same chemical composition.
+It is also possible to select an additional number of features in the 'hover_features' list which are displayed while hovering.
+
+Clicking on any of the points in the map automatically shows the 3D chemical structure of the material in one of the windows below.
+Note that at each time the 'Display' button is clicked, a different structure with the same chemical composition is visualized.
+A new structure is shown up to the 'Replicas' number.
+This allows to inspect all possible structures in the dataset.
+
+Furthermore, the chemical formula of a specific compound can be manually written in the 'Compound' textbox, and clicking the 'Display' button will both show the 3D chemical structure and mark the exact position of the compound on the map with a cross.
+The Compound textbox includes autocompletion, which allows to inspect all materials in the dataset inserting partial formulae.
+
+Lastly, the Visualizer offers a number of utils for producing high-quality plots of the map, which are displayed after clicking the button just below the map.
+
+%% Cell type:code id: tags:
+
+``` python
+hover_features = []
+hover_features.append('Atomic_number_A')
+hover_features.append('Atomic_number_B')
+hover_features.append('Space_group_number')
+hover_features.append('Atomic_density')
+hover_features.append('Replicas')
+hover_features.append('Cluster_label')
+
+color_features = []
+color_features.append('Atomic_density')
+color_features.append('Space_group_number')
+
+Visualizer(df, df_grouped, hover_features, color_features).view()
+```
+
+%% Output
+
+    ---------------------------------------------------------------------------
+    KeyError                                  Traceback (most recent call last)
+    c:\users\aakas\appdata\local\programs\python\python38\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
+       2645             try:
+    -> 2646                 return self._engine.get_loc(key)
+       2647             except KeyError:
+    pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
+    pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
+    pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
+    pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
+    KeyError: 'x_emb'
+
+During handling of the above exception, another exception occurred:
+    KeyError                                  Traceback (most recent call last)
+    <ipython-input-32-329093dad86f> in <module>
+         11 color_features.append('Space_group_number')
+         12
+    ---> 13 Visualizer(df, df_grouped, hover_features, color_features).view()
+
+    E:\Material Science\WS 2020-21\Work-SHK\Gitlab\analytics-query-nomad-archive\query_nomad_archive\visualizer.py in __init__(self, df, df_grouped, hover_features, color_features)
+        118                         name=self.name_trace[cl],
+        119                         mode='markers',
+    --> 120                         x=self.df_entries_onmap[cl]['x_emb'],
+        121                         y=self.df_entries_onmap[cl]['y_emb'],
+        122                         marker_color=next(self.palette),
+    c:\users\aakas\appdata\local\programs\python\python38\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
+       2798             if self.columns.nlevels > 1:
+       2799                 return self._getitem_multilevel(key)
+    -> 2800             indexer = self.columns.get_loc(key)
+       2801             if is_integer(indexer):
+       2802                 indexer = [indexer]
+    c:\users\aakas\appdata\local\programs\python\python38\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
+       2646                 return self._engine.get_loc(key)
+       2647             except KeyError:
+    -> 2648                 return self._engine.get_loc(self._maybe_cast_indexer(key))
+       2649         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
+       2650         if indexer.ndim > 1 or indexer.size > 1:
+    pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
+    pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
+    pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
+    pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
+    KeyError: 'x_emb'
+
+%% Cell type:markdown id: tags:
+
+Using the visualizer we can analyse the composition of the 4 clusters extracted.
+We have selected the atomic density and the space group number as color features.
+This allows to inspect how these values vary within clusters.
+The values of these features show ordered structures within clusters, that is particularly interesting because these features were not used by the clustering algorithm.
+This suggests that the atomic features used for clustering are sufficient to describe certain properties of the whole material such as the most stable structure or the atomic density.
+Therefore, we might imagine that there is a functional form capable to infer the defined materials properties from our atomic features, and we can train a supervised machine learning model to find such relationship.
+
+The space group number of the elements in each cluster is shown below firstly as a list of values, then using a pie chart.
+We can clearly notice that in each cluster there is one space group number that is predominant respect to all the others.
+This means that we can label the different clusters with a characteristic space group number.
+
+%% Cell type:code id: tags:
+
+``` python
+df_count_groups=df_grouped.loc[df_grouped['Cluster_label']!=-1].groupby(['Cluster_label','Space_group_number']).describe()['Atomic_density'][['count']]
+display(Markdown(print(df_count_groups)))
+df_count_groups=df_count_groups.reset_index();
+n_clusters = df_grouped['Cluster_label'].max() +1
+```
+
+%% Cell type:code id: tags:
+
+``` python
+for cl in range (n_clusters):
+    df_cluster=df_count_groups.loc[df_count_groups['Cluster_label']==cl]
+    fig = px.pie(df_cluster, values='count', names='Space_group_number')
+    fig.show()
+```
+
+%% Cell type:markdown id: tags:
+
+The pie chart below includes the space group number of all elements classified as outliers. We can clearly see that the outliers include all different space group numbers, and it is not possible to identify a predominant space group number characteristic of the group of outliers.
+
+%% Cell type:code id: tags:
+
+``` python
+df_cluster=df_grouped.loc[df_grouped['Cluster_label']==-1].groupby(['Space_group_number']).describe()['Atomic_density'][['count']].reset_index()
+plt.figure(figsize=(25,25))
+fig = px.pie(df_cluster, values='count', names='Space_group_number')
+fig.show()
+```
+
+%% Cell type:markdown id: tags:
+
+#  Example of supervised machine learning: Random forest
+---
+
+%% Cell type:markdown id: tags:
+
+Finally, we aim to use the large data set of materials that we have retrieved from the NOMAD Archive to train an AI model.
+Previous findings obtained with the unsupervised analysis suggest that it is possible to train a model to predict materials properties using only atomic features.
+The trained model can then be used to predict properties of yet unknown materials from the knowledge of its constituent atoms.
+In this specific case, we aim to predict the average atomic density.
+
+We use the Random forest method. Random forest is available from the scikit-learn package.
+
+%% Cell type:code id: tags:
+
+``` python
+from sklearn.ensemble import RandomForestRegressor
+```
+
+%% Cell type:markdown id: tags:
+
+We select all atomic properties as primary features and the atomic density of the material as target feature.
+
+%% Cell type:code id: tags:
+
+``` python
+ML_primary_features = []
+ML_primary_features.append('Fraction_A')
+ML_primary_features.append('Fraction_B')
+ML_primary_features.append('Fraction_O')
+ML_primary_features.append('Atomic_number_A')
+ML_primary_features.append('Atomic_number_B')
+
+ML_target_features = []
+ML_target_features.append('Atomic_density')
+```
+
+%% Cell type:markdown id: tags:
+
+Our dataset is divided into a train set and a test set. The model is trained only with the train set, while ignoring the values in the test set. This allows to test the prediction capability of the model on data that have not been seen.
+
+%% Cell type:code id: tags:
+
+``` python
+X_train, X_test, y_train, y_test = train_test_split (df[ML_primary_features], df[ML_target_features], test_size=0.2, )
+```
+
+%% Cell type:markdown id: tags:
+
+We train here the model.
+
+%% Cell type:code id: tags:
+
+``` python
+random_regressor = RandomForestRegressor(
+    n_estimators= 100,
+    max_depth = 100,
+    max_features = 5,
+    min_samples_split = 5,
+    random_state=0
+)
+random_regressor.fit(X_train, y_train.to_numpy().ravel())
+```
+
+%% Cell type:markdown id: tags:
+
+After training, we check the accuracy of the model.
+
+%% Cell type:code id: tags:
+
+``` python
+y_predict= random_regressor.predict(X_test)
+display(Markdown(r'The Ai model predicts the atomic density on a test set with an average error of '+
+                 str(int(10000*np.mean(np.abs(y_predict-y_test.to_numpy().flatten())))/10000) +
+                ' Angstroms$^{-1}$.' ))
+```
+
+%% Cell type:code id: tags:
+
+``` python
+y_predict= random_regressor.predict(X_train)
+display(Markdown(r'The Ai model predicts the atomic density on a training set with an average error of '+
+                 str(int(10000*np.mean(np.abs(y_predict-y_train.to_numpy().flatten())))/10000) +
+                ' Angstroms$^{-1}$.' ))
+```
+
+%% Cell type:code id: tags:
+
+``` python
+y_predict= random_regressor.predict(X_test)
+X_test = X_test.assign(Atomic_density=y_predict)
+df_A_pred = X_test[['Atomic_number_A','Atomic_density']].rename(columns={'Atomic_number_A':'Atomic_number'})
+df_B_pred = X_test[['Atomic_number_B','Atomic_density']].rename(columns={'Atomic_number_B':'Atomic_number'})
+df_AB_pred = pd.concat([df_A_pred,df_B_pred], ignore_index=True)
+df_A = df[['Atomic_number_A','Atomic_density']].rename(columns={'Atomic_number_A':'Atomic_number'})
+df_B = df[['Atomic_number_B','Atomic_density']].rename(columns={'Atomic_number_B':'Atomic_number'})
+df_AB = pd.concat([df_A,df_B], ignore_index=True)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df_AB['Atomic_number']=df_AB['Atomic_number'].astype('int')
+```
+
+%% Cell type:code id: tags:
+
+``` python
+xaxis =  df_AB.groupby('Atomic_number').mean().reset_index().to_numpy()[:,0]
+yaxis =  df_AB.groupby('Atomic_number').mean().reset_index().to_numpy()[:,1]
+```
+
+%% Cell type:markdown id: tags:
+
+In the following plot, we see the average atomic density of ternary elements composed of Oxygen and another element, whose atomic number is given by the x-axis. Therefore, all values are averaged over the third element.
+
+Plots show the predictions of the trained model on the test set, and on the kwnon values of the training set that are taken as reference values. Each point shows also the standard deviation. We emphasize that, considering that each value on the plot is given by an average over all elements in the periodic table, the standard deviation cannot go to zero by construction, even in the limit of taking all possible combinations. We then aim that averages and standard deviations predicted by our model are comparable to the ones of the reference model.
+
+%% Cell type:code id: tags:
+
+``` python
+plt.style.use('ggplot')
+plt.figure(figsize=(15,10))
+
+x=df_AB_pred.groupby(['Atomic_number']).mean().index.to_numpy().flatten()
+y=df_AB_pred.groupby(['Atomic_number']).mean().to_numpy().flatten()
+std=df_AB_pred.groupby(['Atomic_number']).std().to_numpy().flatten()
+plt.errorbar(x,y,yerr=std.T, ls='', marker='s', label='AI prediction' )
+x=df_AB.groupby(['Atomic_number']).mean().index.to_numpy().flatten()
+y=df_AB.groupby(['Atomic_number']).mean().to_numpy().flatten()
+std=df_AB.groupby(['Atomic_number']).std().to_numpy().flatten()
+plt.errorbar(x,y,yerr=std.T, ls='', marker='s', label='Reference value' );
+plt.legend(loc='upper right', fontsize='x-large')
+plt.xlabel('Atomic number', fontsize='x-large')
+plt.ylabel(r'Atomic density [$\AA^{-1}$]', fontsize='x-large')
+plt.xticks([3,11,19,37,55,89],fontsize='x-large');
+plt.yticks(fontsize='x-large');
+```
+
+%% Cell type:markdown id: tags:
+
+In the plot above, we can see that values predicted by the AI model are comparable to the reference values. In addition, we observe that the atomic density follows periodic trends, as values tend to be lower in the beginning and in the end of each row of the periodic table.