In this tutorial, we show how to query the NOMAD Archive (https://www.nomad-coe.eu/index.php?page=nomad-repository) and perform data analysis on the retrieved data.
%% Cell type:markdown id: tags:
## Preliminary operations
We load the following packages that are all available in the virtual environment containing the Jupyter notebooks in the NOMAD AI toolkit. Among the loaded packages, we highlight ``sklearn``, i.e., scikit-learn, one of the most popular machine-learning packages, and ``pandas``, a popular tool for data handling and analysis.
We maintain a ``nomad`` package that can be imported in all notebooks of the AI Toolkit.
%% Cell type:markdown id: tags:
The ``nomad`` package allows to retrieve data from the NOMAD Archive by means of a script, as shown below. In this script, we insert metadata characterizing the materials that we aim to retrieve. In this case, we select ternary materials containing oxygen. We also request that simulations were carried out using the VASP code using GGA exchange-correlation (xc) functionals. Values are retrieved from the simulation run that found geometrically convergence wihin a desird threshold value of.
The modules ``atomic_properties_dft`` and ``atomic_properties_pymat`` make available atomic properites of the elements of the periodic table, to be used as features in data analytics. The sources of these atomic proporties are DFT calculations performed by the NOMAD team and <ahref="https://pymatgen.org/"target="_blank">pymatgen</a>, respectively.
%% Cell type:markdown id: tags:
We have developed a visualization tool that allows us to visualize atomic properites of all elements accross periodic table as a heatmap. Currently, this tool is able to visualize atomic properties acessible from ``atomic_properties_dft`` module. Below is an example for the data calculated via the HSE06 functional and spinless settings.
This module can be used as follows. From the dropdown menu, one can select which property one is interested to check and the table is updated automatically to show the corresponding heatmap.
We have defined the variable ``query``, which allows to perform our query.
All quantities defined in the ``required`` field are fetched during the query. For example, we can see quantities as ``chemical_composition`` which gives the composition of the material or ``atoms.positions`` that contains the positon of all atoms after geometric convergence.
We notice that the variable ``query`` contains a number of other variables: the ``max_entries`` value sets the maximum number of entries that can be retrieved; the ``per_page`` value indicates the number of entries fetched at each API call.
%% Cell type:markdown id: tags:
In this tutorial, we use machine learning tools to investigate properties of materials. In particular, we aim to predict the atomic density of the materials as a function of some primary features such as the atomic number.
The atomic density of the material is derived from the volume of the simulation cell, whose dimensions are inserted in meters. Thus, we define a scale factor to convert dimensions into angstroms for a higher numerical stability during the machine learning analysis.
%% Cell type:code id: tags:
``` python
scale_factor=10**10
```
%% Cell type:markdown id: tags:
To retrieve data and place it within a framework, we use a 'for' loop that iteratively fetches all entries up to the maximum value, which is given by 'max_entries'. Taking into account that some links in the query might be broken, the resulting 'IndexError' exception is handled within the 'for' loop, that skips over the broken entry. In addition, we also make sure the entry contains the simulation cell value which we are interested in, and that all elements in the material have an admissible atomic number.
%% Cell type:markdown id: tags:
In the next cell, we download data from the NOMAD Archive. **As we query a large number of elements, this operation can be time consuming. Hence, we have cached the results of the following query, and data can be loaded with a command given in the subsequent cell.**
'Atomic_radius_A':A.atomic_r_val[0]*100,#distance values are in angstroms so we convert these to pm in this eg
'Atomic_radius_B':B.atomic_r_val[0]*100,#distance values are in angstroms so we convert these to pm in this eg
'Ionenergy_A':A.atomic_ip*6.242e+18,# energy value obtained is in joules so we convert to eV
'Ionenergy_B':B.atomic_ip*6.242e+18,# energy value obtained is in joules so we convert to eV
'El_affinity_A':A.atomic_ea*6.242e+18,# energy value obtained is in joules so we convert to eV
'El_affinity_B':B.atomic_ea*6.242e+18,# energy value obtained is in joules so we convert to eV
'Homo_A':A.atomic_hfomo*6.242e+18,# energy value obtained is in joules so we convert to eV
'Homo_B':B.atomic_hfomo*6.242e+18,# energy value obtained is in joules so we convert to eV
'Lumo_A':A.atomic_hfomo*6.242e+18,# energy value obtained is in joules so we convert to eV
'Lumo_B':B.atomic_hfomo*6.242e+18,# energy value obtained is in joules so we convert to eV
'Weight_A':pymat.symbol(lab_A).atomic_mass,
'Weight_B':pymat.symbol(lab_B).atomic_mass,
'Space_group_number':int(space_group),
'Atomic_density':density,
'Formula':formula_red,
'File-id':int(entry),
},ignore_index=True)
```
%% Cell type:markdown id: tags:
Here we load the dataframe which contains the data retrieved from the NOMAD Archive using the conditions defined above. The activation of the following cell should be performed only if the query above was skipped.
Pandas dataframes include the 'describe' method which gives an overview about the composition of the dataset.
%% Cell type:code id: tags:
``` python
df.describe()
```
%% Cell type:markdown id: tags:
We are particularly interested in materials properties and how they can be inferred from the solely atomic composition of a specific material. In our query, we have retrieved two materials properties, namely the _'Atomic_density'_ and the _'Space_group_number'_. Before performing any machine learning analysis, it is interesting to visualize the distribution of these values using a histogram. Pandas allows for a straighforward visualization of histograms constructed with columns dataframe values.
We notice above that retrieved materials can be mainly classified into two distinct space group numbers, i.e. the sapce group number 221 and 227. It would be interesting to see whether such distinction involves that materials belonging in the same space group share similar atomistic properties, while the ones belonging in different space groups are distinct also on an atomistic level. This is the scope of clustering, and it will be the object of an in-depth analysis below.
Now, we keep inspecting the dataframe values and plot a histogram containing the values of the atomic density.
The histogram above shows that the atomic density is mainly distributed around a value of 0.07 Angstroms$^{-1}$. Such distribution might seem the result of a random extraction, but we aim to find an AI model that is able to make high-resolution predictions for the atomic density based only on the atomic composition of the material.
%% Cell type:markdown id: tags:
In order to build an AI model that makes reliable predictions, we should make sure that each entry has a different representation. In this case, as we are interested in predicting material properties from the atomic composition, the chemical composition of the material is an ideal representation for the dataframe entries. However, we might have different entries with the same chemical composition, because e.g. simulations were performed for the same material with different settings that were not included among the filters of our query. Each of these simulations might have produced a slightly different value of the resulting atomic density of the material. As data is taken from heterogeneous simulations which were carried out in different laboratories, we do not aim to evaluate all possible parameters for each simulation. Hence, we average the atomic density value over all materials with the same chemical composition.
After averaging (or _grouping_) data is placed in a different dataframe '_df_grouped_', where each entry represents a different compound.
With data placed in a dataframe, we can carry out our machine learning analysis.
%% Cell type:markdown id: tags:
# Example of unsupervised machine learning: Clustering and dimension reduction
---
%% Cell type:markdown id: tags:
Firstly, we perform an explorative analysis to understand how data is composed and organized. Hence, we use
unsupervised learning to extract from the dataset clusters of materials with similar properties.
We define the list of features that are used for clustering including only the stochiometric ratio and the atomic number. Our aim is to use unsupervised learning for understanding whether the defined descriptors are sufficient for structuring the dataset.
R.J.G.B. Campello, D. Moulavi, J. Sander: <spanstyle="font-style: italic;">Density-Based Clustering Based on Hierarchical Density Estimates</span>, Springer Berlin Heidelberg, (2013).</div>
The only input parameter that this algorithm requires is the minimum size of each cluster.
To achieve a more accurate cluster definition, HDBSCAN is able to detect all those points that are hardly classified into a specific cluster, which are labeled as 'outliers'.
The implementation of the algorithm that we use is taken from https://pypi.org/project/hdbscan/.
We maintain the Visualizer, a dedicated tool for an interactive visualization of the data retrieved from the Archive.
The Visualizer shows compounds belonging in different clusters with different colors.
Outliers by default are not shown on the map, but outliers visualization can be activated by clicking the 'Outliers' label on the right of the map.
The color of the markers can also represent a specific target property of the material that can be selected from the 'Marker colors' dropdown menu.
Target properties which can be displayed with different colors are inserted as a list of 'color_features'.
After selecting a specific property, a new menu will appear to choose the color scale to be used for visualizing the different values of that target property.
To prevent data overloading, only a fraction of the whole dataset is initially visualized.
This parameter can be adjusted modifing the fraction value on top of the map, up to select all entries in the dataset.
By hovering over the map, the Visualizer shows which compound corresponds to each point, and its number of 'Replicas'.
Replicas represents the number of entries from the original dataset (before grouping) which have the same chemical composition.
It is also possible to select an additional number of features in the 'hover_features' list which are displayed while hovering.
Clicking on any of the points in the map automatically shows the 3D chemical structure of the material in one of the windows below.
Note that at each time the 'Display' button is clicked, a different structure with the same chemical composition is visualized.
A new structure is shown up to the 'Replicas' number.
This allows to inspect all possible structures in the dataset.
Furthermore, the chemical formula of a specific compound can be manually written in the 'Compound' textbox, and clicking the 'Display' button will both show the 3D chemical structure and mark the exact position of the compound on the map with a cross.
The Compound textbox includes autocompletion, which allows to inspect all materials in the dataset inserting partial formulae.
Lastly, the Visualizer offers a number of utils for producing high-quality plots of the map, which are displayed after clicking the button just below the map.
Using the visualizer we can analyse the composition of the different clusters extracted.
We have selected the atomic density and the space group number as color features.
This allows to inspect how these values vary within clusters.
The values of these features show ordered structures within clusters, that is particularly interesting because these features were not used by the clustering algorithm.
This suggests that the atomic features used for clustering are sufficient to describe certain properties of the whole material such as the most stable structure or the atomic density.
Therefore, we might imagine that there is a functional form capable to infer the defined materials properties from our atomic features, and we can train a supervised machine learning model to find such relationship.
The space group number of the elements in each cluster is shown below firstly as a list of values, then using a pie chart.
We can clearly notice that in each cluster there is one space group number that is predominant respect to all the others.
This means that we can label the different clusters with a characteristic space group number.
The pie chart below includes the space group number of all elements classified as outliers. We can clearly see that the outliers include all different space group numbers, and it is not possible to identify a predominant space group number characteristic of the group of outliers.
# Example of supervised machine learning: Random forest
---
%% Cell type:markdown id: tags:
Finally, we aim to use the large data set of materials that we have retrieved from the NOMAD Archive to train an AI model.
Previous findings obtained with the unsupervised analysis suggest that it is possible to train a model to predict materials properties using only atomic features.
The trained model can then be used to predict properties of yet unknown materials from the knowledge of its constituent atoms.
In this specific case, we aim to predict the average atomic density.
We use the Random forest method. Random forest is available from the scikit-learn package.
%% Cell type:code id: tags:
``` python
fromsklearn.ensembleimportRandomForestRegressor
```
%% Cell type:markdown id: tags:
We select all atomic properties as primary features and the atomic density of the material as target feature.
%% Cell type:code id: tags:
``` python
ML_primary_features=[]
ML_primary_features.append('Fraction_A')
ML_primary_features.append('Fraction_B')
ML_primary_features.append('Fraction_O')
ML_primary_features.append('Atomic_number_A')
ML_primary_features.append('Atomic_number_B')
ML_primary_features.append('El_affinity_A')
ML_primary_features.append('El_affinity_B')
ML_primary_features.append('Ionenergy_A')
ML_primary_features.append('Ionenergy_B')
ML_primary_features.append('Atomic_radius_A')
ML_primary_features.append('Atomic_radius_B')
# ML_primary_features.append('Homo_A')
# ML_primary_features.append('Homo_B')
# ML_primary_features.append('Lumo_A')
# ML_primary_features.append('Lumo_B')
# ML_primary_features.append('Weight_A')
# ML_primary_features.append('Weight_B')
ML_target_features=[]
ML_target_features.append('Atomic_density')
```
%% Cell type:markdown id: tags:
Our dataset is divided into a train set and a test set. The model is trained only with the train set, while ignoring the values in the test set. This allows to test the prediction capability of the model on data that have not been seen.
In the following plot, we see the average atomic density of ternary elements composed of Oxygen and another element, whose atomic number is given by the x-axis. Therefore, all values are averaged over the third element.
Plots show the predictions of the trained model on the test set, and on the kwnon values of the training set that are taken as reference values. Each point shows also the standard deviation. We emphasize that, considering that each value on the plot is given by an average over all elements in the periodic table, the standard deviation cannot go to zero by construction, even in the limit of taking all possible combinations. We then aim that averages and standard deviations predicted by our model are comparable to the ones of the reference model.
In the plot above, we can see that values predicted by the AI model are comparable to the reference values. In addition, we observe that the atomic density follows periodic trends, as values tend to be lower in the beginning and in the end of each row of the periodic table.
In this tutorial, we show how to query the NOMAD Archive (https://www.nomad-coe.eu/index.php?page=nomad-repository) and perform data analysis on the retrieved data.
%% Cell type:markdown id: tags:
## Preliminary operations
We load the following packages that are all available in the virtual environment containing the Jupyter notebooks in the NOMAD AI toolkit. Among the loaded packages, we highlight ``sklearn``, i.e., scikit-learn, one of the most popular machine-learning packages, and ``pandas``, a popular tool for data handling and analysis.
We maintain a ``nomad`` package that can be imported in all notebooks of the AI Toolkit.
%% Cell type:markdown id: tags:
The ``nomad`` package allows to retrieve data from the NOMAD Archive by means of a script, as shown below. In this script, we insert metadata characterizing the materials that we aim to retrieve. In this case, we select ternary materials containing oxygen. We also request that simulations were carried out using the VASP code using GGA exchange-correlation (xc) functionals. Values are retrieved from the simulation run that found geometrically convergence wihin a desird threshold value of.
The modules ``atomic_properties_dft`` and ``atomic_properties_pymat`` make available atomic properites of the elements of the periodic table, to be used as features in data analytics. The sources of these atomic proporties are DFT calculations performed by the NOMAD team and <ahref="https://pymatgen.org/"target="_blank">pymatgen</a>, respectively.
%% Cell type:markdown id: tags:
We have developed a visualization tool that allows us to visualize atomic properites of all elements accross periodic table as a heatmap. Currently, this tool is able to visualize atomic properties acessible from ``atomic_properties_dft`` module. Below is an example for the data calculated via the HSE06 functional and spinless settings.
This module can be used as follows. From the dropdown menu, one can select which property one is interested to check and the table is updated automatically to show the corresponding heatmap.
We have defined the variable ``query``, which allows to perform our query.
All quantities defined in the ``required`` field are fetched during the query. For example, we can see quantities as ``chemical_composition`` which gives the composition of the material or ``atoms.positions`` that contains the positon of all atoms after geometric convergence.
We notice that the variable ``query`` contains a number of other variables: the ``max_entries`` value sets the maximum number of entries that can be retrieved; the ``per_page`` value indicates the number of entries fetched at each API call.
%% Cell type:markdown id: tags:
In this tutorial, we use machine learning tools to investigate properties of materials. In particular, we aim to predict the atomic density of the materials as a function of some primary features such as the atomic number.
The atomic density of the material is derived from the volume of the simulation cell, whose dimensions are inserted in meters. Thus, we define a scale factor to convert dimensions into angstroms for a higher numerical stability during the machine learning analysis.
%% Cell type:code id: tags:
``` python
scale_factor=10**10
```
%% Cell type:markdown id: tags:
To retrieve data and place it within a framework, we use a 'for' loop that iteratively fetches all entries up to the maximum value, which is given by 'max_entries'. Taking into account that some links in the query might be broken, the resulting 'IndexError' exception is handled within the 'for' loop, that skips over the broken entry. In addition, we also make sure the entry contains the simulation cell value which we are interested in, and that all elements in the material have an admissible atomic number.
%% Cell type:markdown id: tags:
In the next cell, we download data from the NOMAD Archive. **As we query a large number of elements, this operation can be time consuming. Hence, we have cached the results of the following query, and data can be loaded with a command given in the subsequent cell.**
'Atomic_radius_A':A.atomic_r_val[0]*100,#distance values are in angstroms so we convert these to pm in this eg
'Atomic_radius_B':B.atomic_r_val[0]*100,#distance values are in angstroms so we convert these to pm in this eg
'Ionenergy_A':A.atomic_ip*6.242e+18,# energy value obtained is in joules so we convert to eV
'Ionenergy_B':B.atomic_ip*6.242e+18,# energy value obtained is in joules so we convert to eV
'El_affinity_A':A.atomic_ea*6.242e+18,# energy value obtained is in joules so we convert to eV
'El_affinity_B':B.atomic_ea*6.242e+18,# energy value obtained is in joules so we convert to eV
'Homo_A':A.atomic_hfomo*6.242e+18,# energy value obtained is in joules so we convert to eV
'Homo_B':B.atomic_hfomo*6.242e+18,# energy value obtained is in joules so we convert to eV
'Lumo_A':A.atomic_hfomo*6.242e+18,# energy value obtained is in joules so we convert to eV
'Lumo_B':B.atomic_hfomo*6.242e+18,# energy value obtained is in joules so we convert to eV
'Weight_A':pymat.symbol(lab_A).atomic_mass,
'Weight_B':pymat.symbol(lab_B).atomic_mass,
'Space_group_number':int(space_group),
'Atomic_density':density,
'Formula':formula_red,
'File-id':int(entry),
},ignore_index=True)
```
%% Cell type:markdown id: tags:
Here we load the dataframe which contains the data retrieved from the NOMAD Archive using the conditions defined above. The activation of the following cell should be performed only if the query above was skipped.
Pandas dataframes include the 'describe' method which gives an overview about the composition of the dataset.
%% Cell type:code id: tags:
``` python
df.describe()
```
%% Cell type:markdown id: tags:
We are particularly interested in materials properties and how they can be inferred from the solely atomic composition of a specific material. In our query, we have retrieved two materials properties, namely the _'Atomic_density'_ and the _'Space_group_number'_. Before performing any machine learning analysis, it is interesting to visualize the distribution of these values using a histogram. Pandas allows for a straighforward visualization of histograms constructed with columns dataframe values.
We notice above that retrieved materials can be mainly classified into two distinct space group numbers, i.e. the sapce group number 221 and 227. It would be interesting to see whether such distinction involves that materials belonging in the same space group share similar atomistic properties, while the ones belonging in different space groups are distinct also on an atomistic level. This is the scope of clustering, and it will be the object of an in-depth analysis below.
Now, we keep inspecting the dataframe values and plot a histogram containing the values of the atomic density.
The histogram above shows that the atomic density is mainly distributed around a value of 0.07 Angstroms$^{-1}$. Such distribution might seem the result of a random extraction, but we aim to find an AI model that is able to make high-resolution predictions for the atomic density based only on the atomic composition of the material.
%% Cell type:markdown id: tags:
In order to build an AI model that makes reliable predictions, we should make sure that each entry has a different representation. In this case, as we are interested in predicting material properties from the atomic composition, the chemical composition of the material is an ideal representation for the dataframe entries. However, we might have different entries with the same chemical composition, because e.g. simulations were performed for the same material with different settings that were not included among the filters of our query. Each of these simulations might have produced a slightly different value of the resulting atomic density of the material. As data is taken from heterogeneous simulations which were carried out in different laboratories, we do not aim to evaluate all possible parameters for each simulation. Hence, we average the atomic density value over all materials with the same chemical composition.
After averaging (or _grouping_) data is placed in a different dataframe '_df_grouped_', where each entry represents a different compound.
With data placed in a dataframe, we can carry out our machine learning analysis.
%% Cell type:markdown id: tags:
# Example of unsupervised machine learning: Clustering and dimension reduction
---
%% Cell type:markdown id: tags:
Firstly, we perform an explorative analysis to understand how data is composed and organized. Hence, we use
unsupervised learning to extract from the dataset clusters of materials with similar properties.
We define the list of features that are used for clustering including only the stochiometric ratio and the atomic number. Our aim is to use unsupervised learning for understanding whether the defined descriptors are sufficient for structuring the dataset.
R.J.G.B. Campello, D. Moulavi, J. Sander: <spanstyle="font-style: italic;">Density-Based Clustering Based on Hierarchical Density Estimates</span>, Springer Berlin Heidelberg, (2013).</div>
The only input parameter that this algorithm requires is the minimum size of each cluster.
To achieve a more accurate cluster definition, HDBSCAN is able to detect all those points that are hardly classified into a specific cluster, which are labeled as 'outliers'.
The implementation of the algorithm that we use is taken from https://pypi.org/project/hdbscan/.
We maintain the Visualizer, a dedicated tool for an interactive visualization of the data retrieved from the Archive.
The Visualizer shows compounds belonging in different clusters with different colors.
Outliers by default are not shown on the map, but outliers visualization can be activated by clicking the 'Outliers' label on the right of the map.
The color of the markers can also represent a specific target property of the material that can be selected from the 'Marker colors' dropdown menu.
Target properties which can be displayed with different colors are inserted as a list of 'color_features'.
After selecting a specific property, a new menu will appear to choose the color scale to be used for visualizing the different values of that target property.
To prevent data overloading, only a fraction of the whole dataset is initially visualized.
This parameter can be adjusted modifing the fraction value on top of the map, up to select all entries in the dataset.
By hovering over the map, the Visualizer shows which compound corresponds to each point, and its number of 'Replicas'.
Replicas represents the number of entries from the original dataset (before grouping) which have the same chemical composition.
It is also possible to select an additional number of features in the 'hover_features' list which are displayed while hovering.
Clicking on any of the points in the map automatically shows the 3D chemical structure of the material in one of the windows below.
Note that at each time the 'Display' button is clicked, a different structure with the same chemical composition is visualized.
A new structure is shown up to the 'Replicas' number.
This allows to inspect all possible structures in the dataset.
Furthermore, the chemical formula of a specific compound can be manually written in the 'Compound' textbox, and clicking the 'Display' button will both show the 3D chemical structure and mark the exact position of the compound on the map with a cross.
The Compound textbox includes autocompletion, which allows to inspect all materials in the dataset inserting partial formulae.
Lastly, the Visualizer offers a number of utils for producing high-quality plots of the map, which are displayed after clicking the button just below the map.
Using the visualizer we can analyse the composition of the different clusters extracted.
We have selected the atomic density and the space group number as color features.
This allows to inspect how these values vary within clusters.
The values of these features show ordered structures within clusters, that is particularly interesting because these features were not used by the clustering algorithm.
This suggests that the atomic features used for clustering are sufficient to describe certain properties of the whole material such as the most stable structure or the atomic density.
Therefore, we might imagine that there is a functional form capable to infer the defined materials properties from our atomic features, and we can train a supervised machine learning model to find such relationship.
The space group number of the elements in each cluster is shown below firstly as a list of values, then using a pie chart.
We can clearly notice that in each cluster there is one space group number that is predominant respect to all the others.
This means that we can label the different clusters with a characteristic space group number.
The pie chart below includes the space group number of all elements classified as outliers. We can clearly see that the outliers include all different space group numbers, and it is not possible to identify a predominant space group number characteristic of the group of outliers.
# Example of supervised machine learning: Random forest
---
%% Cell type:markdown id: tags:
Finally, we aim to use the large data set of materials that we have retrieved from the NOMAD Archive to train an AI model.
Previous findings obtained with the unsupervised analysis suggest that it is possible to train a model to predict materials properties using only atomic features.
The trained model can then be used to predict properties of yet unknown materials from the knowledge of its constituent atoms.
In this specific case, we aim to predict the average atomic density.
We use the Random forest method. Random forest is available from the scikit-learn package.
%% Cell type:code id: tags:
``` python
fromsklearn.ensembleimportRandomForestRegressor
```
%% Cell type:markdown id: tags:
We select all atomic properties as primary features and the atomic density of the material as target feature.
%% Cell type:code id: tags:
``` python
ML_primary_features=[]
ML_primary_features.append('Fraction_A')
ML_primary_features.append('Fraction_B')
ML_primary_features.append('Fraction_O')
ML_primary_features.append('Atomic_number_A')
ML_primary_features.append('Atomic_number_B')
ML_primary_features.append('El_affinity_A')
ML_primary_features.append('El_affinity_B')
ML_primary_features.append('Ionenergy_A')
ML_primary_features.append('Ionenergy_B')
ML_primary_features.append('Atomic_radius_A')
ML_primary_features.append('Atomic_radius_B')
# ML_primary_features.append('Homo_A')
# ML_primary_features.append('Homo_B')
# ML_primary_features.append('Lumo_A')
# ML_primary_features.append('Lumo_B')
# ML_primary_features.append('Weight_A')
# ML_primary_features.append('Weight_B')
ML_target_features=[]
ML_target_features.append('Atomic_density')
```
%% Cell type:markdown id: tags:
Our dataset is divided into a train set and a test set. The model is trained only with the train set, while ignoring the values in the test set. This allows to test the prediction capability of the model on data that have not been seen.
In the following plot, we see the average atomic density of ternary elements composed of Oxygen and another element, whose atomic number is given by the x-axis. Therefore, all values are averaged over the third element.
Plots show the predictions of the trained model on the test set, and on the kwnon values of the training set that are taken as reference values. Each point shows also the standard deviation. We emphasize that, considering that each value on the plot is given by an average over all elements in the periodic table, the standard deviation cannot go to zero by construction, even in the limit of taking all possible combinations. We then aim that averages and standard deviations predicted by our model are comparable to the ones of the reference model.
In the plot above, we can see that values predicted by the AI model are comparable to the reference values. In addition, we observe that the atomic density follows periodic trends, as values tend to be lower in the beginning and in the end of each row of the periodic table.