Merge branch 'hotfix_removing_empty_fingerprints' into 'main'

HOTFIX: fix cell removing empty fingerprints See merge request !1

Merge branch 'hotfix_removing_empty_fingerprints' into 'main'
30d664a3 · Adam Fekete · bf282d25 · bd16ff2a · 30d664a3
Commit 30d664a3 authored Feb 28, 2024 by Adam Fekete
--- a/dos_similarity_search.ipynb
+++ b/dos_similarity_search.ipynb
@@ -225,10 +225,11 @@
    "# removing entries with empty fingerprints\n",
    "index = 0\n",
    "while index < len(GaAs_alloy):\n",
-    "    if GaAs_alloy[index].run[0].calculation[-1].dos_electronic[0].fingerprint.bins == '':\n",
-    "        del GaAs_alloy[index]\n",
-    "    else:\n",
-    "        index += 1"
+    "    try:\n",
+    "        assert GaAs_alloy[index].run[0].calculation[-1].dos_electronic[0].fingerprint.bins != ''\n",
+    "        index += 1\n",
+    "    except (IndexError, AssertionError):\n",
+    "        del GaAs_alloy[index]"
   ]
  },
  {

 %% Cell type:markdown id: tags:

 <img  src="assets/dos_similarity_search/header.jpg" width="900">

 %% Cell type:markdown id: tags:


 <img style="float: left;" src="assets/dos_similarity_search/logo_MPG.png" width=150>
 <img style="float: left; margin-top: -10px" src="assets/dos_similarity_search/logo_NOMAD.png" width=250>
 <img style="float: left; margin-top: -5px" src="assets/dos_similarity_search/logo_HU.png" width=130>

 %% Cell type:markdown id: tags:

 <span style='font-family:sans-serif'>

 ## Introduction

 This notebook shows how to compute the similarity of materials in terms of their electronic density-of-states (DOS) from data retrieved from the [NOMAD Archive](https://nomad-lab.eu/prod/v1/gui/search/entries).

 For this purpose, a _DOS fingerprint_ is used which encodes the DOS obtained from density-functional theory (DFT) calculations into a binary valued descriptor. A detailed description of the fingerprint can be found in Ref. [1].

 The DOS fingerprints in this notebook are precomputed and available in the NOMAD Archive.
 We first download the respective data from the NOMAD Archive and use the fingerprint to find materials that are similar to a given reference material.

 **In this notebook we demonstrate how to find GaAs-based binary and ternary compounds from the NOMAD Archive that have the most similar electronic structure to GaAs.**

 ### Contents:
 - [Import modules](#Import-modules)
 - [Downloading data from the NOMAD Archive](#Downloading-data-from-the-NOMAD-Archive)
  - [Downloading a single calculation](#Downloading-a-reference-material)
  - [Downloading calculations using search queries](#Downloading-calculations-using-search-queries)
 - [The DOS fingerprint as a descriptor](#The-DOS-fingerprint-as-a-descriptor)
 - [Calculation of similarity coefficients](#Calculation-of-similarity-coefficients)
 - [Visualizing results](#Visualizing-results)
 - [References](#References)

 </span>

 %% Cell type:markdown id: tags:

 <span style='font-family:sans-serif'>

 ## Import modules

 To interact with the NOMAD Archive API we use the python package `nomad-lab`. To learn more about its usage, please refer to the [documentation](https://nomad-lab.eu/prod/rae/docs/client/client.html). </span>

 %% Cell type:code id: tags:

 ``` python
 import numpy as np
 import pandas as pd
 import matplotlib.pyplot as plt
 import nest_asyncio
 import copy

 from nomad.client import ArchiveQuery

 # Load plot parameters
 plt.style.use('./data/dos_similarity_search/dos_similarity.mplstyle')

 nest_asyncio.apply()
 ```

 %% Cell type:markdown id: tags:

 <span style='font-family:sans-serif'>

 # Downloading data from the NOMAD Archive

 **For a detailed overview on how to query the NOMAD Archive using the `nomad-lab` package see the tutorial 'Query the Archive' on the ['AI toolkit tutorials'](https://nomad-lab.eu/services/aitoolkit) page.** Here, we will download all necessary data to perform a similarity search using DOS fingerprints. This is achieved using an instance of `ArchiveQuery`. It allows for querying the NOMAD Achive with only few commands.
 </span>

 %% Cell type:markdown id: tags:

 <span style='font-family:sans-serif'>

 ## Downloading a reference material

 First, we download a reference calculation for [GaAs](https://nomad-lab.eu/prod/v1/gui/entry/id/SobHx8fSRuC0eC3fZuRPLA/Ya3jm8nB0Gb_8VFZs2j1dNYAg7h_) from the Archive. To download a specific calculation we construct the `query` dictionary only from the calculation ID. The calculation ID is a unique, static identifier for each calculation.

 For the here presented analysis, not all of the data of a calculation are required. Therefore, we select the paths to the needed data in the NOMAD Archive entry. The paths are contained in the cell below in the variable `reference_query_required_sections`. This helps to reduce unnecessary download of data. The path to all data of a calculation can be found on the the NOMAD [Metainfo](https://nomad-lab.eu/prod/v1/gui/analyze/metainfo) page.</span>

 %% Cell type:code id: tags:

 ``` python
 reference_calc_id = 'zkkMIAPyn4OCbdEdW21DZTeretQ3'
 reference_query_parameters = {
    'entry_id': reference_calc_id # ID of the reference calculation
 }
 reference_query_required_sections = {
    # DOS fingerprint
    'workflow': {
        'calculation_result_ref': {
            'dos_electronic': {
                'fingerprint': '*'
            }
        }
    },
    # Upload and calculation id
    'metadata': "*",
    # chemical formula, material id, and space group number
    'results':{
        'material':{
            'chemical_formula_reduced': '*',
            'material_id': '*',
            'symmetry': {
                'space_group_number': '*'
            }
        }
    }
 }
 ```

 %% Cell type:code id: tags:

 ``` python
 # compile the query
 reference_GaAs_query = ArchiveQuery(query = reference_query_parameters,
                                    required = reference_query_required_sections)
 # download
 reference_GaAs = reference_GaAs_query.download()[0]
 ```

 %% Cell type:markdown id: tags:

 <span style='font-family:sans-serif'>

 This calculation stored in the variable `reference_GaAs` will be used as a reference for our similarity search.
 </span>

 %% Cell type:markdown id: tags:

 <span style='font-family:sans-serif'>

 ## Downloading calculations using search queries

 To perfrom a similarity search, we compare the fingerprint of the reference to the fingerprints of a large data set.
 In the following, we query the NOMAD Archive for GaAs-based binary and ternary compounds. As a starting point we restrict the search to only calculations computed with the DFT code 'VASP' using a GGA exchange-correlation functional. This information is written to the `query` dictionary that is passed to `ArchiveQuery`. Note that the `required` argument of the `ArchiveQuery` is unchanged.
 </span>

 %% Cell type:markdown id: tags:

 <span style='font-family:sans-serif'>

 **Queries can be generated in the GUI of the [NOMAD Archive](https://nomad-lab.eu/prod/v1/gui/search) in the python dictionary format.** They can be found under the `<>` symbol at the top of the search menu. From there, they can be directly copied into the `query` dictionary of the `ArchiveQuery` function.
 </span>

 %% Cell type:code id: tags:

 ``` python
 search_query_parameters = {
    'results.method.simulation.program_name': 'VASP',
    'results.material.elements': {'all': ['Ga', 'As']},
    'results.properties.available_properties': ['dos_electronic'],
    'results.method.simulation.dft.xc_functional_type': ['GGA'],
    'results.material.n_elements': {'gte': 2, 'lte': 3}
 }

 # the required parameters are the same as for the reference
 GaAs_alloy_query = ArchiveQuery(query =  search_query_parameters,
                                 required = reference_query_required_sections,
                                 page_size=1000, results_max=10000)


 GaAs_alloy = GaAs_alloy_query.download()
 ```

 %% Cell type:markdown id: tags:

 <span style='font-family:sans-serif'>

 Next, we remove entries without DOS fingerprint data.
 </span>

 %% Cell type:code id: tags:

 ``` python
 # removing entries with empty fingerprints
 index = 0
 while index < len(GaAs_alloy):
-    if GaAs_alloy[index].run[0].calculation[-1].dos_electronic[0].fingerprint.bins == '':
-        del GaAs_alloy[index]
-    else:
+    try:
+        assert GaAs_alloy[index].run[0].calculation[-1].dos_electronic[0].fingerprint.bins != ''
        index += 1
+    except (IndexError, AssertionError):
+        del GaAs_alloy[index]
 ```

 %% Cell type:markdown id: tags:

 <span style='font-family:sans-serif'>

 # The DOS fingerprint as a descriptor

 In order to quantitatively evaluate materials similarity, we encode the electronic DOS in a so-called _DOS fingerprint_. The DOS fingerprint is a two-dimensional, binary-valued representation of the electronic DOS. An in-depth description can be found in Ref. [1].

 To make use of the fingerprint, the data stored in the NOMAD Archive must be loaded into `DOSFingerprint` objects. Therefore, we scan through the Archive contents that we downloaded previously and extract all data that are related to the fingerprint, as well as identifiers for presenting the results. To do so in a systematic manner, we define functions that collect the relevant information from an Archive entry. An example of such a function, `formula`, is given below. These function are passed in a list `exctract_properties` to the function `get_data`, which extracts the relevant data from `ArchiveQuery`. The extracted data is saved using the name of the function as the keyword.

 For convenience, the extracted data are collected in a dictionary which will allow us to efficienty search the results.
 </span>

 %% Cell type:code id: tags:

 ``` python
 from nomad.datamodel.datamodel import EntryArchive
 from dos_similarity_search.extract_data import *
 from dos_similarity_search.tools import *

 def formula(db_entry: EntryArchive) -> str:
    '''
    Retrieve the chemical formula.
    '''
    return db_entry.results.material.chemical_formula_reduced

 # Extract data and apply filters
 extract_properties = [calc_id, upload_id, url_endpoint, formula, material_id, space_group_number, dos_fingerprint]

 materials_data = {}
 for calculation in GaAs_alloy:
    materials_data[calc_id(calculation)] = get_fingerprint_data(calculation, extract_properties)

 reference = {calc_id(reference_GaAs): get_fingerprint_data(reference_GaAs, extract_properties)}
 ```

 %% Cell type:markdown id: tags:

 <span style='font-family:sans-serif'>

 The Archive API returns all calculations which fit the query, therefore, **for a single material multiple calculations (e.g. from different authors) are downloaded.**

 To simplify the analysis presented here, **we select a representative calculation for each material**. To do so, we define a function called `select_representative`. The cell below shows an example of this function that takes the first encountered calculation of a material as the representative. However, different approaches can be used, e.g., based on computational parameters employed in the DFT calculations.
 </span>

 %% Cell type:code id: tags:

 ``` python
 def select_representative(materials_data: list) -> list:
    '''
    Example of a `select_representative` function.
    Returns the first calculation of a material it finds.

    Inputs:
        materials_data: list, containing the materials in a dictionary as outputted by `get_fps`
    '''
    material_ids = []
    output = copy.deepcopy(materials_data)

    for calc_id, properties in materials_data.items():
        material_id = properties['material_id']
        if material_id in material_ids:
            del output[calc_id]
        else:
            material_ids.append(material_id)

    return output
 ```

 %% Cell type:markdown id: tags:

 <span style='font-family:sans-serif'>

 # Calculation of similarity coefficients

 Now we compute the similarity between two DOS spectra.

 A DOS fingerprint represents the electronic DOS as a binary vector [1]. In order to compute the similarity of two fingerprints we use the **Tanimoto coefficient** [2]. The Tanimoto coefficient, $T_c$, between two vectors $\mathbf{a}$ and $\mathbf{b}$ is defined as:


 $$\begin{eqnarray}
 T_c(\mathbf{a},\mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{||\mathbf{a}||^2 + ||\mathbf{b}||^2 - \mathbf{a} \cdot \mathbf{b}}.
 \end{eqnarray}$$


 It is restricted to values $T_c \in [0,1]$. 1 means that the DOS of two materials are identical, 0 means no overlap at all. **The Tanimoto coefficient can be interpreted as the ratio between the number of shared features and the total number of features of two fingerprints.** For dichotomous vectors, the complement of the Tanimoto coefficient ($1 - T_c$), also known as Jaccard distance, is a metric. The Tanimoto coefficient is implemented as the function `tanimoto_similarity` in the `nomad_dos_fingerprints` package.

 The arguments of the function `tanimoto_similarity` are two `DOSFingerprint` objects. Using this, the similarity between the reference material and one of the candidate materials can be calculated, as shown in the following example:
 </span>

 %% Cell type:code id: tags:

 ``` python
 from nomad_dos_fingerprints import tanimoto_similarity

 reference_values = list(reference.values())[0]
 candidate_values = list(materials_data.values())[0]

 print(f"Similarity between {reference_values['formula']} and {candidate_values['formula']}:\n")
 print(f"Tc = {tanimoto_similarity(reference_values['dos_fingerprint'], candidate_values['dos_fingerprint'])}")
 ```

 %% Cell type:markdown id: tags:

 <span style='font-family:sans-serif'>

 Now we **use the function `calculate_similarity` to calculate the similarities of the materials in `materials_data` to our reference**, `reference`. The function `calculate_similarity` returns a dictionary of the calculation, where `similarity to <reference_formula>` is the value of the Tanimoto coefficient between the reference and the current calculation.
 </span>

 %% Cell type:code id: tags:

 ``` python
 materials_data = select_representative(materials_data)
 # apply `calculate_similarity` to all entries in `materials_data`
 for key, properties in materials_data.items(): calculate_similarity(properties, reference)
 # sort `materials_data` from highest similarity to lowest
 materials_data = dict(sorted(materials_data.items(), key = lambda x: x[1][f'similarity to {reference_values["formula"]}'], reverse = True))
 ```

 %% Cell type:markdown id: tags:

 <span style='font-family:sans-serif'>

 Now we have computed the similarities of the `reference` to all the materials in `materials_data`.
 </span>

 %% Cell type:markdown id: tags:

 <span style='font-family:sans-serif'>

 # Visualizing results

 We want to look at the results of the similarity search to identify the most similar materials to the reference. For an overview, we first visualize the found similarity coefficients in a histogram.
 </span>

 %% Cell type:code id: tags:

 ``` python
 reference_formula = str(reference_values['formula'])

 plt.figure(figsize = (13,5))
 formulas = [value['formula'] for key, value in materials_data.items()]
 similarities = [value['similarity to ' +str(reference_formula)] for key,value in materials_data.items()]
 plt.hist(similarities, bins = 20, range = [0,1], label = f'Reference: {reference_formula}')
 plt.xticks(np.arange(0, 1.1, 0.1))
 plt.xlabel('Tc')
 plt.xlim(0,1)
 plt.ylabel('Counts')
 plt.yscale('log')
 #plt.title('Frequency of similarity coefficients')
 plt.legend(fontsize = 20)
 plt.show()
 ```

 %% Cell type:markdown id: tags:

 <span style='font-family:sans-serif'>

 Note the logarithmic scale in this histogram. We can see here, that the vast majority of materials has a low similarity to our reference. On the right side of the histogram ($\mathrm{Tc} > 0.7$), the most similar materials can be found, which show exceptionally high similarity scores.

 We construct a ranking table which shows the similarity of materials to our reference from the most similar to the least similar.
 </span>

 %% Cell type:code id: tags:

 ``` python
 for key, value in materials_data.items():
    value["formula (link)"] = (value["formula"], value["url_endpoint"])

 def make_clickable(x):
    return f'<a target="{x[1]}" href="https://nomad-lab.eu/prod/v1/gui/entry/id/{x[1]}">{x[0]}</a>'

 ranking_table = pd.DataFrame([value for key, value in materials_data.items()])
 ranking_table[['formula (link)', 'space_group_number', f'similarity to {reference_formula}']].style.format({'formula (link)' : make_clickable})
 ```

 %% Cell type:markdown id: tags:

 <span style='font-family:sans-serif'>

 By clicking the link, you will land on the Archive page of the respective calculation, where you can find more information about your material.

 Now we plot the DOS of the most similar materials to our reference. In the variable `ranks_to_download`, we give the rank of the materials from the table above, whose DOS we want to plot. To avoid unnecessary downloading, we check if the spectrum is already in `materials_data` under the key `dos`, if not, we download it.
 </span>

 %% Cell type:code id: tags:

 ``` python
 from dos_similarity_search.tools import download_DOS, DOS_downloaded

 ranks_to_download = list(range(3))

 #get calc_id from table
 calc_ids = []
 for rank in ranks_to_download:
    calc_ids.append(ranking_table.iloc[rank]['calc_id'])

 # check if already downloaded
 calc_ids_to_download = [calculation_id for calculation_id in calc_ids if not DOS_downloaded(materials_data, calculation_id)]

 #download DOS spectrum
 dos_spectra = download_DOS(calc_ids_to_download)
 for calculation_id, dos_spectrum in dos_spectra.items():
    materials_data[calculation_id]['dos'] = dos_spectrum

 if not DOS_downloaded(reference, reference_calc_id):
    reference_dos_spectrum = download_DOS(reference_calc_id)
    for calculation_id, dos_spectrum in reference_dos_spectrum.items():
        reference[calculation_id]['dos'] = dos_spectrum
 ```

 %% Cell type:markdown id: tags:

 <span style='font-family:sans-serif'>
 And finally we plot the spectra.
 </span>

 %% Cell type:code id: tags:

 ``` python
 plt.figure(figsize = (13,5))
 dos_energies_reference = reference[reference_calc_id]['dos']['energies']
 dos_values_reference = reference[reference_calc_id]['dos']['values']
 chem_formula_reference = reference[reference_calc_id]['formula']

 plt.plot(dos_energies_reference.magnitude, dos_values_reference.magnitude, label = 'reference', c = 'r')
 plt.fill_between(dos_energies_reference.magnitude, dos_values_reference.magnitude, color = 'r', alpha = 0.1)

 for calculation_id in calc_ids:
    dos_energies = materials_data[calculation_id]['dos']['energies']
    dos_values = materials_data[calculation_id]['dos']['values']
    chem_formula = materials_data[calculation_id]['formula']
    Tc = materials_data[calculation_id][f'similarity to {reference_formula}']

    plt.plot(dos_energies.magnitude, dos_values.magnitude, label = f"{chem_formula}, Tc = {Tc: .2f}")
    plt.fill_between(dos_energies.magnitude, dos_values.magnitude, alpha = 0.1)
    plt.ylabel(r'DOS [$\frac{1}{eV}$]')
    plt.xlabel(r'Energy [$eV$]')

 plt.xlim(-10, 5)
 plt.ylim(0,2)
 plt.legend(prop={'size': 15})
 plt.show()
 ```

 %% Cell type:markdown id: tags:

 <span style='font-family:sans-serif'>

 # References

 [1] M. Kuban, S. Rigamonti, M. Scheidgen, and C. Draxl: [Density-of-states similarity descriptor for unsupervised learning from materials data](https://arxiv.org/abs/2201.02187)

 [2] P. Willet, J. M. Barnard, G. M. Downs: [Chemical Similarity Searching](https://pubs.acs.org/doi/abs/10.1021/ci9800211), $\textit{J. Chem. Inf. Comput. Sci.}$, $\textbf{38}$, 983, (1998)
 </span>

 %% Cell type:markdown id: tags:

 <span style='font-family:sans-serif'>

 # Acknowledgements

 We thank Luca Ghiringhelli for help in preparing this notebook. This work recieved partial funding from the European Union’s Horizon 2020 research and innovation program under the grant agreement Nº 951786 (NOMAD CoE), from the NFDI consortium FAIRmat, and from the German Research Foundation (DFG) through the CRC 1404 (FONDA).
 </span>