Skip to content
Snippets Groups Projects
Commit a73bd0e7 authored by Andreas Leitherer's avatar Andreas Leitherer
Browse files

Suppress tensorflow warnings

parent 9456bbcf
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
***
<div id="teaser" style=' background-position: right center; background-size: 00px; background-repeat: no-repeat;
padding-top: 20px;
padding-right: 10px;
padding-bottom: 170px;
padding-left: 10px;
border-bottom: 14px double #333;
border-top: 14px double #333;' >
<div style="text-align:center">
<b><font size="6.4">ARISE - Robust recognition and exploratory analysis of crystal structures via Bayesian-deep-learning</font></b>
</div>
<p>
created by:
Andreas Leitherer<sup>1</sup>,
Angelo Ziletti<sup>1</sup>,
and Luca Ghiringhelli<sup> 1</sup>
***
<sup>1</sup> Fritz Haber Institute of the Max Planck Society, Faradayweg 4-6, D-14195 Berlin, Germany
<div>
<img style="float: left;" src="assets/ARISE/Logo_MPG.png" width="200">
<img style="float: right;" src=https://nomad-coe.eu/uploads/nomad/images/NOMAD_Logo2.png width="250">
</div>
</div>
***
%% Cell type:markdown id: tags:
In this tutorial , we introduce the computational framework ARISE (<ins>Ar</ins>tificial <ins>I</ins>ntelligence based <ins>S</ins>tructure <ins>E</ins>valuation) for crystal-structure recognition in single- and polycyrstalline systems[1]. ARISE can treat more than 100 crystal structures, including one-, two-dimensional, and bulk materials - in robust and threshold-independent fashion. The Bayesian-deep-learning model yields not only a classification but also uncertainty estimates, which are principled (i.e., they approximate the uncertainty estimates of a Gaussian process) [2,3]. These uncertainty estimates correlate with crystal order.
![](./assets/ARISE/ARISE_logo.svg)
For additional details, please refer to
[1] A. Leitherer, A. Ziletti, and L. M. Ghiringhelli, arXiv:2103.09777, (2021)
ARISE is part of the code framework *ai4materials* (https://github.com/angeloziletti/ai4materials).
The outline of this tutorial is as follows:
* Quickstart (jump here if you want to directly use ARISE)
* Single-crystal classification
* Polycyrstal classification
* Unsupervised learning / explanatory analysis of the trained model
%% Cell type:markdown id: tags:
## Import packages
%% Cell type:code id: tags:
``` python
import ase
from ase.io import read, write
from ase.visualize import view
import ipywidgets as widgets
from ipywidgets import interactive
from timeit import default_timer as timer
from collections import defaultdict
from ai4materials.utils.utils_crystals import get_nn_distance, scale_structure
from ai4materials.descriptors.quippy_soap_descriptor import quippy_SOAP_descriptor
from ai4materials.utils.utils_config import set_configs, setup_logger
from ai4materials.models.cnn_polycrystals import predict_with_uncertainty
from ai4materials.models import ARISE
import quippy
from quippy import descriptors
import matplotlib
import matplotlib.pyplot as plt
import os
import numpy as np
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR) # Suppress TF warnings
from keras.models import load_model
import json
import pickle
# filepaths
prototype_path = './data/ARISE/single_crystals/PROTOTYPES' # prototypes used in [1]
bcc_prototype_path = os.path.join(prototype_path,
'Elemental_solids',
'bcc_W_A_cI2_229_a',
'bcc_W_A_cI2_229_a.in' )
fcc_prototype_path = os.path.join(prototype_path,
'Elemental_solids',
'fcc_Cu_A_cF4_225_a',
'fcc_Cu_A_cF4_225_a.in' )
NaCl_prototype_path = os.path.join(prototype_path,
'Binaries',
'RockSalt_NaCl',
'RockSalt_NaCl.in' )
nn_model_path = './data/ARISE/nn_model/AI_SYM_Leitherer_et_al_2021.h5'
nn_model_class_info_path = './data/ARISE/nn_model'
calculations_path = './data/ARISE/calculations'
# saved desc folder for desc calculation
desc_folder_4grains = './data/ARISE/four_grains_desc_file/four_grains_stride_15.0_box_16.0_.tar.gz'
fcc_bcc_desc = np.load('./data/ARISE/saved_soap_descriptors/fcc_Cu_A_cF4_225_a_bcc_W_A_cI2_229_a.npy')
fcc_desc = fcc_bcc_desc[:1]
bcc_desc = fcc_bcc_desc[1:]
nacl_desc = np.load('./data/ARISE/saved_soap_descriptors/RockSalt_NaCl.npy')
version = 'py3' # or 'py2' -> defines python version of quippy
```
%% Cell type:markdown id: tags:
## Quickstart
%% Cell type:markdown id: tags:
Please specify the geometry files that you want to analyze in the list 'geometry_files'.
ARISE can be used in two modes: Local and global (controlled via the keyword 'mode'). If mode = 'local', then the strided pattern matching (SPM) framework introduced in [1] is employed, requiring the definition of stride and box size (standard setting: $4 \overset{\circ}{\mathrm {A}}$ stride in all directions and $16 \overset{\circ}{\mathrm {A}}$ box size).
%% Cell type:markdown id: tags:
```python
from ai4materials.models import ARISE
geometry_files = [ ... ]
predictions, uncertainty = ARISE.analyze(geometry_files, mode='global')
predictions, uncertainty = ARISE.analyze(geometry_files, mode='local')
```
%% Cell type:markdown id: tags:
Typically, one uses the global mode for single crystals an the local mode for large (polycrystalline) samples (too zoom into a given structure and detect structural defects such as grain boundaries). However, one can also mix this up and investigate, for instance, the global assignments for polycrystals (see for instance the analysis of STEM graphene images with grain boundaries in Fig. 3 of Ref. [1]).
The following sections provide more details on the internal workings of ARISE and SPM.
%% Cell type:markdown id: tags:
## The Bayesian-deep-learning model
%% Cell type:markdown id: tags:
Given an unknown atomic structure, the goal of crystal-strucutre recognition - in general terms - is to find the most similar prototype (that is currently known to be found in nature, for example). In the figure below, this list includes fcc, bcc, diamond, and hcp symmetry but as we will see later, our framework allows to treat a much broader range of materials.
![](./assets/ARISE/cs_classification_first_example.svg)
The initial representation of a given atomic structure (single- or polycrystalline) are atomic positions and chemical species symbols (as well as lattice vectors for periodic systems).
The first step is to define how this input will be treated in the ARISE framework, i.e., how we arrive at a meaningful prediction.
%% Cell type:markdown id: tags:
### Classification of mono-species single crystals
%% Cell type:markdown id: tags:
Considering exemplarily the body-centered cubic system (see http://www.aflowlib.org/prototype-encyclopedia/A_cI2_229_a.html and also [2]), the prediction pipeline for single crystals can be sketched as follows:
![](./assets/ARISE/concept_figure_global.svg)
%% Cell type:markdown id: tags:
Each of these steps will be explained in the following.
Before that, we want to shotly discuss the logging feature of *ai4materials* (based on https://docs.python.org/3/howto/logging.html). Specifically, we define a configuration object, that introduces a specific folder structure.
Specifically, three different folders are created automatically:
* 'desc_folder': Here the descriptors that we are going to calculate later will be saved (as file_name.tar.gz)
* 'tmp_folder': temporary folder in which individual files that are calculated will be temporarily saved and later merged into .tar.gz files and moved to 'desc_folder'
* 'results': folder where results are saved (e.g., the neural network predictions).
In practice, you only need to define the main folder in which the calculations should be saved. Here we choose 'calculations_path' defined at the beginning of this tutorial:
%% Cell type:code id: tags:
``` python
# set config file
main_folder = calculations_path
configs = set_configs(main_folder=main_folder)
logger = setup_logger(configs, level='INFO', display_configs=False)
logger.setLevel('INFO')
```
%% Cell type:markdown id: tags:
We continue with the explanations of the prediction pipeline.
We first load the geometry file using ASE (https://wiki.fysik.dtu.dk/ase/index.html). We typically use the FHI-aims format for the geometry files (while via ASE's geometry file parser, compatibility with many other formats is provided).
%% Cell type:code id: tags:
``` python
structure = read(bcc_prototype_path, ':', 'aims')[0]
view(structure, viewer='ngl')
```
%% Cell type:markdown id: tags:
To avoid dependence on the lattice parameters, we isotropically scale a given structure using the function 'get_nn_distance' from the *ai4materials* package. By default, we compute the radial distribution function (as approximated by the histogram of nearest neighbor distances) and then choose the center of the maximally populated bin (i.e., the mode) as the nearest neighbor distance:
%% Cell type:code id: tags:
``` python
scale_factor = get_nn_distance(structure)
print('Scale factor for fcc structure = {}'.format(scale_factor))
```
%% Cell type:markdown id: tags:
Given an atomic structure, scaling of atomic positions and unit cell is summarized in the function 'scale_structure':
%% Cell type:code id: tags:
``` python
scaled_structure = scale_structure(structure, scaling_type='quantile_nn',
atoms_scaling_cutoffs=[10., 20., 30., 40.])
```
%% Cell type:markdown id: tags:
Using atomic positions $\mathbf{r}_i$ labeled by chemical species $c_i$ ($i = 1, ..., \text{N}_{\text{atoms}}$) directly as input to a classification model introduces several issues. Most importantly, physically meaningful invariances that we know to be true are not respected (translational / rotational invariance and permutations of idential atoms). Also the input dimension would depend on the number of atoms (while this could in princple be addressed with the use of convolutional neural networks).
A well-known and -tested materials science descriptor that is by definition invariant to the above mentioned symmetries is the so-called smooht-overlap-of-atomic-positions (SOAP) descriptor [3-5].
In this tutorial (and in [1]) we employ the SOAP implementation that is being made available via the quippy package (https://github.com/libAtoms/QUIP) for non-commercial use.
In short, given an atomic structure with N atoms, we obtain N SOAP vectors (respecting the mentioned invariances) that represent the local atomic environments of each atom. The local atomic environments of an atom is defined as a sphere (centered at that particular atom) that has a certain (tunable) cutoff radius $R_C$. Each atom within that sphere is represented by a Gaussian function (with a certain, tunable width $\sigma$) and the sum of these Gaussians defines the local density:
$$ \rho_\mathscr{X}(\vec{r})=\sum_{i\in \mathscr{X}} \exp{\left(-\frac{(\vec{r}-\vec{r}_i)^2}{2\sigma^2}\right)}$$
The local density is expanded in terms of radial basis functions and spherical harmonics, giving a set of basis set expansion coefficients $c_{blm}$:
$$ \rho_\mathscr{X}(\vec{r})=\sum_{i\in \mathscr{X}} \exp{\left(-\frac{(\vec{r}-\vec{r}_i)^2}{2\sigma^2}\right)} =\sum_{blm}c_{blm}u_b(r) Y_{lm}(\hat{\vec{r}})$$
A rotational average of these coefficients yields the SOAP representation of the local atomic environment:
$$ p(\mathscr{X})_{b_1b_2l}=\pi\sqrt{\frac{8}{2l+1}}\sum_{m}(c_{b_1 lm})^{\dagger} c_{b_2 lm}.$$
Applying this procedure to every atom, gives a collection of SOAP vectors, which one may average. The default behavior in quippy (that is employed in [1]) is though to average the coefficients and then perform the rotational average.
The output of quippy is adapted using a typical averaging procedure (while additional details such as the treatment of "cross-correlation terms" is discussed in the next section and also the supplementary material of [1]).
First, we have to define some parameters that define the SOAP descriptor and are summarized in the string
descriptor_options that is used to create the SOAP descritptor object:
%% Cell type:code id: tags:
``` python
cutoff = 4.0
central_weight = 0.0 # relative weight of central atom in atom density
atom_sigma = 0.1
p_b_c = scaled_structure.get_pbc()
l_max = 6
n_max = 9
n_Z = 1
n_species = 1
species_Z = structure.get_atomic_numbers()[0]
Z = structure.get_atomic_numbers()[0]
average = True
descriptor_options = 'soap '+'cutoff='+str(cutoff)+' l_max='+str(l_max)+' n_max='+str(n_max) + \
' atom_sigma='+str(atom_sigma)+ ' n_Z='+str(n_Z) + \
' Z={'+str(Z)+'} n_species='+str(n_species)+' species_Z={'+str(species_Z) + \
'} central_weight='+str(central_weight)+' average='+str(average)
desc = descriptors.Descriptor(descriptor_options)
if version == 'py2':
from quippy import Atoms as quippy_Atoms
struct = quippy_Atoms(scaled_structure)
elif version == 'py3':
struct = scaled_structure
struct.set_pbc(p_b_c)
```
%% Cell type:markdown id: tags:
We can calculate the SOAP descriptor and plot its components:
%% Cell type:code id: tags:
``` python
#Compute SOAP descriptor
if version == 'py2':
struct.set_cutoff(desc.cutoff())
struct.calc_connect()
SOAP_descriptor = desc.calc(struct)['descriptor']
elif version == 'py3':
SOAP_descriptor = desc.calc(struct)['data']
# visualize descriptor
def vis_desc(SOAP_descriptor):
fig, ax = plt.subplots(1,2, figsize=(15,2))
ax[0].plot(SOAP_descriptor[0])
ax[0].set_xlabel('SOAP component #')
ax[0].set_ylabel('SOAP component value (normalized)')
m = ax[1].imshow(SOAP_descriptor, aspect=150, cmap="viridis", extent=[0, 315, 0, 1])
ax[1].set_yticks([])
ax[1].set_xticks([])
fig.colorbar(m, ax=ax)
plt.show()
vis_desc(SOAP_descriptor)
```
%% Cell type:markdown id: tags:
Scaling, descriptor definition and calculation are summarized in the following cell (using *ai4materials*):
%% Cell type:code id: tags:
``` python
# Define the descriptor
descriptor = quippy_SOAP_descriptor()
soap_desc = descriptor.calculate(structure).info['descriptor']['SOAP_descriptor']
# visualize descriptor
vis_desc(SOAP_descriptor)
```
%% Cell type:markdown id: tags:
The object quippy_SOAP_descriptor above has standard settings that allow to reproduce the results of [1], but in principle one may choose different values - in particular for the SOAP parameters:
%% Cell type:markdown id: tags:
```python
descriptor = quippy_SOAP_descriptor(configs=configs,
cutoff=cutoff, l_max=l_max,
n_max=n_max, atom_sigma=atom_sigma)
```
%% Cell type:markdown id: tags:
Given the descriptor, the next step in the pipeline is to apply a classification model. We use the Bayesian neural network introduced in [1]:
%% Cell type:code id: tags:
``` python
model = load_model(nn_model_path)
model.summary()
```
%% Cell type:markdown id: tags:
The neurons in the final layer each correspond to a specific class. This information is saved in a json file that we are loading below:
%% Cell type:code id: tags:
``` python
with open(os.path.join(nn_model_class_info_path, 'class_info.json')) as file_name:
data = json.load(file_name)
class_labels = data["data"][0]['classes']
numerical_to_text_label = dict(zip(range(len(class_labels)), class_labels))
text_to_numerical_label = dict(zip(class_labels, range(len(class_labels))))
```
%% Cell type:markdown id: tags:
In [1], ARISE employs a Bayesian neural network as a classification model, which it does not only yield classification probabilities, but also allows to quantify the predictive uncertainty.
Specifically, we employ so-called Monte Carlo (MC) Dropout introduced in [6,7], which employs the stochastic regularization technique dropout. In dropout, individual neurons are randomly dropped in each layer, usually only during training to avoid over-specialization of individual units and thus control overfitting. One can show that when also employing dropout at test time, the resulting probabilistic model provides uncertainty estimates that approximate those of a Gaussian process.
In practice, for a given input, the model output is computed for several iterations (100-1000 typically suffice), yielding a collection of differing probability vectors - while in a standard, deterministic neural network, all predictions would be the same. Sampling the output layer of the Bayesian neural network actually corresponds to sampling from (an approximated version of) the true probability distribution of outputs. Averaging these forward-passes gives an average classification probability vector, and the predicted class label can be inferred by selecting the most likely class (i.e., computing the argmax). Additional, statistical information on the predictive uncertainty is contained in the classification probability samples. In [1], we employ mutual information (as implemented in *ai4materials*) to obtain a single uncertainty-quantifying number from the output samples.
For the bcc structure from above, prediction and uncertainty quantification is performed in the following cell:
%% Cell type:code id: tags:
``` python
# reshape data array such that it fits model input shape
input_shape_from_model = model.layers[0].get_input_at(0).get_shape().as_list()[1:]
target_shape = tuple([-1] + input_shape_from_model)
data = np.reshape(soap_desc, target_shape)
# Obtain classifcation probabilities (numpy array 'prediction')
# and uncertainty quantifiers (dictionary of numpy arrays, with three keys
# 'mutual_information', 'predictive_entropy', and 'variation_ratio' corresponding to
# three uncertainty quantifiers - see also L. Smith & Y. Gal, arXiv:1803.08533v1, 2018 for further
# reading).
prediction, uncertainty = predict_with_uncertainty(data,
model=model,
model_type='classification',
n_iter=1000)
```
%% Cell type:markdown id: tags:
The array 'prediction' gives the classification probabilities, which are clearly peaked at the entry corresponding to bcc:
%% Cell type:code id: tags:
``` python
argmax_prediction = prediction.argmax(axis=-1)[0]
predicted_label = numerical_to_text_label[argmax_prediction].split('_')[:2]
predicted_label = '_'.join(predicted_label)
print('Maximal classifcation probability with value {:.4f} for prototype {}'.format(prediction[0][argmax_prediction],
predicted_label))
plt.plot(prediction[0])
plt.xlabel('Class (integer encoding)')
plt.ylabel('Probability')
plt.show()
```
%% Cell type:markdown id: tags:
The dictionary "uncertainty" contains quantifiers of uncertainty, in particular the mutual information employed in [1]:
%% Cell type:code id: tags:
``` python
print('Mutual information = {:.4f}'.format(uncertainty["mutual_information"][0]))
```
%% Cell type:markdown id: tags:
***In summary***: All of these steps - isotropic scaling, descriptor calculation, neural network predictions (classification probabilities + uncertainty) - are summarized in the following funciton, to provide a quick and easy usage. In particular, you may simply pass a list of geometry files, here we choose the fcc (Cu) and bcc (W) prototypes from AFLOW:
%% Cell type:code id: tags:
``` python
from ai4materials.models import ARISE
geometry_files = [fcc_prototype_path, bcc_prototype_path]
predictions, uncertainty = ARISE.analyze(geometry_files, mode='global')
```
%% Cell type:markdown id: tags:
To analyze the predictions, in particular the ranking by classification probability and the uncertainty quantifiers, we can use the function 'analyze_predictions':
%% Cell type:code id: tags:
``` python
ARISE.analyze_predictions(geometry_files, predictions, uncertainty)
```
%% Cell type:markdown id: tags:
For the predictions in the fcc, we see that non-zero probabiltiy is also assigned to a tetragonal prototype (identifier 'bctIn_In_A_tI2_139_a', where the AFLOW identifier is specified in this case as well, see [2] for more details).This assignment is justified since it is a slightly distorted (tetragonal) version of fcc (http://aflowlib.org/prototype-encyclopedia/A_tI2_139_a.In.html). We also see the uncertainty is non-zero, indicating that there is more than one candidate for the most similar prototype.
So we see that interestingly also the low probability candidates are still meaningful. Further analysis of this classification-probability induced ranking is provided in [1], specifically Fig. 3 (for experimental scanning transmission electron microscopy data of graphene) and Supplementary note 2.3 (for out-of-sample single crystal structures from AFLOW and NOMAD). This study becomes particularly important if you want to investigate atomic structures that are not include in the training set. In the mentioned figures of [1], we provide guidelines on how one can still use ARISE and interpret its predictions. In the next section, we will first extend to multi-species systems and then provide the code for this out-of-sample studies, where you may also try different structures (upload the geometry files to './data/ARISE/calculations_path').
%% Cell type:markdown id: tags:
### Classification of multi-species single crystals
%% Cell type:markdown id: tags:
The great advantage of our method is that we can treat much more classes than available methods.
So far we only considered mono-species systems. We consider the rock salt structure (NaCl) as an example for the extension of ARISE to multi-component systems.
The usual way in treating multiple chemical species is to calculate [8]
$$p(\mathscr{X})_{b_1b_2l}^{\alpha\beta}=\pi\sqrt{\frac{8}{2l+1}}\sum_{m}(c_{b_1 lm}^{\alpha})^{\dagger} c_{b_2 lm}^\beta,$$
where the coefficients $c_{b_1 lm}^{\alpha}, c_{b_1 lm}^{\beta}$ are obtained from separate densities for species $\alpha,\beta$. Then one may simply append all of these components, or average over certain combinations. We take the latter route, where it turns out that it suffices (for our purposes) to only select coefficients for $Z_1 = Z_2$ and ignore cross terms for $Z_1 \neq Z_2$. Then, we consider all substructures, i.e., we compute SOAP according to
$$ p(\mathscr{X})_{b_1b_2l}=\pi\sqrt{\frac{8}{2l+1}}\sum_{m}(c_{b_1 lm})^{\dagger} c_{b_2 lm}, $$
where we sit on all Na atoms and consider only Na atoms or Ca atoms, and sit on all Ca atoms and consider only all Ca or Na atoms. This gives us 4 vectors, which we average. Generalization to structures with arbitrary number of chemical species is straightforward - and so we end up with a descriptor that is independent of 1. the number of atoms, 2. number of chemical species, and 3. incorporates all relevant symmetries (rotational invariance, translational invariance, and invariance with respect to permutations of identical atoms).
Coming back to our implementation, an important keyword is 'scale_elment_sensitive' which is by default set to 'True' - this way, we isotropically scale each substructure separately (one may also not do this, but the model we provide her is trained for 'scale_element_sensitive'==True):
%% Cell type:code id: tags:
``` python
geometry_files = [NaCl_prototype_path]
view(structure, viewer='ngl')
```
%% Cell type:code id: tags:
``` python
predictions, uncertainty = ARISE.analyze(geometry_files, mode='global', save_descriptors=True)
```
%% Cell type:code id: tags:
``` python
ARISE.analyze_predictions(geometry_files, predictions, uncertainty)
```
%% Cell type:markdown id: tags:
One can see that NaCl is correctly predicted with low mutual information.
%% Cell type:markdown id: tags:
You may try to choose a different prototype and analyze its predictions in the following cells. You can choose one of the prototypes of [1] (see also Supplementary Table 4-6 for a complete list), or other examples from AFLOW (including those explored in Supplementary Figure 2).
First we define some functions for loading (no need to understand this in deatil, so you can skip for now):
%% Cell type:code id: tags:
``` python
# Load all prototypes
from ai4materials.models.spm_prediction_tools import load_all_prototypes
#########################################################
# PARAMETERS WHICH YOU MAY WANT TO CHANGE
#########################################################
# For the training prototypes:
periodic = True # if False, take supercell structure with at least min_atoms atoms.
min_atoms = 100
# Exception: nanotubes, which are here of fixed length. For quippy, the SOAP descriptor
# does not change significantly such that change of length does not matter (at least for
# Carbon nanotubes).
# Change this value to smooth the fluctuating predictions in case of high-uncertainty points.
# Please also have a look at the remark in the next cell.
n_iter = 1000
#########################################################
prototypes = load_all_prototypes(periodic_boundary_conditions=[False])
for prototype in prototypes:
if not prototype.info['dataset'] == 'Nanotubes':
if periodic:
prototype.set_pbc(True)
else:
prototype.set_pbc(True)
replica = 0
third_dimension = replica
while (prototype*(replica,replica,third_dimension)).get_number_of_atoms()<min_atoms:
replica += 1
third_dimension += 1
if prototype.info['dataset'] == '2D_materials':
third_dimension = 1
prototype *= (replica, replica, replica)
training_prototypes = [_.info['crystal_structure'] for _ in prototypes]
name_to_material_type_dict = {_.info['crystal_structure'] : _.info['dataset'] for _ in prototypes}
#prototypes_geo_files = [os.path.join(prototype_path,
# prototype.info['crystal_structure'],
# prototype.info['crystal_structure'] + '.in')]
AFLOW_other_protos_path = './data/ARISE/single_crystals/exploration_examples'
AFLOW_geo_files = os.listdir(AFLOW_other_protos_path)
out_of_sample_prototypes = [_.split('.')[0] for _ in AFLOW_geo_files] # careful if have points in filename!!
#AFLOW_geo_files = [os.path.join(AFLOW_other_protos_path, _) for _ in AFLOW_geo_files]
#other_AFLOW_prototypes = [read(_, ':', format='aims') for _ in AFLOW_geo_files]
def save_pred_and_u(predictions, uncertainty, file_name):
np.save(file_name+'_predictions.npy', predictions)
pickle.dump(np.array(predictions), open(file_name+'_uncertainty.pkl', "wb"))
```
%% Cell type:markdown id: tags:
<span style="color:red">Some practical tips:</span>
* When executing the code below, don't be surprised if the predictions are not identical between different runs. Due to the stochastic nature of the model, this may very well happen, especially for out-of-sample predictions. Still, the final class prediction (via the argmax operation) remains constant for the parameters being used here.
* To arrive at more constant classification probabilties (and uncertainty quantifications), one may increase the stochastic forward passes (see 'n_iter' argument in the above cell). The increase of computational cost for larger values of n_iter is typicall mild (unless you go to extreme values with billions of forward passes).
%% Cell type:markdown id: tags:
Run the following cell to investigate the training prototypes used in [1]. Click 'Run Interact' to start a calculation:
%% Cell type:code id: tags:
``` python
@widgets.interact_manual(prototype_name=training_prototypes)
def given_geo_predict_and_analyze_training_prototypes(prototype_name):
print('Start calculation.')
geometry_file_path = os.path.join(prototype_path,
name_to_material_type_dict[prototype_name],
prototype_name)
geometry_file = [_ for _ in os.listdir(geometry_file_path) if _[-3:]=='.in']
if not len(geometry_file)==1:
raise ValueError("Poluted prototype path. In [1] we only have one representative for each \
structural class")
geometry_file = os.path.join(geometry_file_path, geometry_file[0])
logger_level = logger.getEffectiveLevel()
logger.setLevel(0)
predictions, uncertainty = ARISE.analyze([geometry_file], mode='global', n_iter=n_iter)
save_pred_and_u(predictions, uncertainty, os.path.join(calculations_path,
prototype_name))
logger.setLevel(logger_level)
print('Finished calculation.')
top_n_labels_list = ARISE.analyze_predictions([geometry_file], predictions,
uncertainty, return_predicted_prototypes=True)
#np.save(os.path.join(calculations_path, 'top_n_labels.npy'), np.array(top_n_labels_list[0], dtype=object))
pickle.dump(top_n_labels_list[0], open(os.path.join(calculations_path, 'top_n_labels.pkl'), "wb"))
```
%% Cell type:markdown id: tags:
You may also want to have a look at the top predicted prototypes (an interactive window will open up where you can change via "Ranking_idx" which of the top predictions you want to have a look at and via "Supercell" which replicas you want to inspect - exception being nanotubes which have fixed length, see also the comment three cells above):
%% Cell type:code id: tags:
``` python
@widgets.interact_manual(Rankind_index=[1,2,3], Supercell=[(1,1,1),(2,2,2), (3,3,3), (4,4,4)])
def vis_predicted_structures(Rankind_index, Supercell):
#top_n_labels = np.load(os.path.join(calculations_path, 'top_n_labels.npy'))
top_n_labels = pickle.load(open(os.path.join(calculations_path, 'top_n_labels.pkl'), "rb"))
structure_label = top_n_labels[Rankind_index-1] # Rankind_idx = 1,2,3,.. -> subtract 1
geometry_file_path = os.path.join(prototype_path,
name_to_material_type_dict[structure_label],
structure_label)
geometry_file = [_ for _ in os.listdir(geometry_file_path) if _[-3:]=='.in']
if not len(geometry_file)==1:
raise ValueError("Poluted prototype path. In [1] we only have one representative for each \
structural class")
geometry_file = os.path.join(geometry_file_path, geometry_file[0])
structure = read(geometry_file, ':', 'aims')[0]
if not structure.cell.flatten().any()==0.0:
structure *= Supercell # only if lattice defined, create supercell
view(structure, viewer='ngl')
```
%% Cell type:markdown id: tags:
Please run the following cell if you want to study out-of-sample structures (including the prototypes investigated in [1]). If you want to inspect your own, new structures, just upload the to './data/ARISE/calculations_path', defined in the package loading part at the very beginning of this tutorial).
%% Cell type:code id: tags:
``` python
#@widgets.interact(prototype_name=out_of_sample_prototypes)
@widgets.interact_manual(prototype_name=out_of_sample_prototypes)
def given_geo_predict_and_analyze_out_of_sample(prototype_name):
print('Start calculation.')
geometry_file = os.path.join(AFLOW_other_protos_path, prototype_name + '.in')
logger_level = logger.getEffectiveLevel()
logger.setLevel(0)
predictions, uncertainty = ARISE.analyze([geometry_file], mode='global', n_iter=n_iter)
save_pred_and_u(predictions, uncertainty, os.path.join(calculations_path,
prototype_name))
logger.setLevel(logger_level)
print('Finished calculation.')
top_n_labels_list = ARISE.analyze_predictions([geometry_file], predictions,
uncertainty, return_predicted_prototypes=True)
#np.save(os.path.join(calculations_path, 'top_n_labels.npy'), np.array(top_n_labels_list[0], dtype=object))
pickle.dump(top_n_labels_list[0], open(os.path.join(calculations_path, 'top_n_labels.pkl'), "wb"))
```
%% Cell type:markdown id: tags:
You may also want to have a look at the top predicted prototypes:
%% Cell type:code id: tags:
``` python
@widgets.interact_manual(Rankind_index=[1,2,3], Supercell=[(1,1,1),(2,2,2), (3,3,3), (4,4,4)])
def vis_predicted_structures(Rankind_index, Supercell):
#top_n_labels = np.load(os.path.join(calculations_path, 'top_n_labels.npy'))
top_n_labels = pickle.load(open(os.path.join(calculations_path, 'top_n_labels.pkl'), "rb"))
structure_label = top_n_labels[Rankind_index-1] # Rankind_idx = 1,2,3,.. -> subtract 1
geometry_file_path = os.path.join(prototype_path,
name_to_material_type_dict[structure_label],
structure_label)
geometry_file = [_ for _ in os.listdir(geometry_file_path) if _[-3:]=='.in']
if not len(geometry_file)==1:
raise ValueError("Poluted prototype path. In [1] we only have one representative for each \
structural class")
geometry_file = os.path.join(geometry_file_path, geometry_file[0])
structure = read(geometry_file, ':', 'aims')[0]
if not structure.cell.flatten().any()==0.0:
structure *= Supercell # only if lattice defined, create supercell
view(structure, viewer='ngl')
```
%% Cell type:markdown id: tags:
# Polycyrstal classification
%% Cell type:markdown id: tags:
To classify polycrystals, we introduce the strided pattern matching (SPM) framework, sketched in the following figure:
![](./assets/ARISE/concept_figure_local.svg)
***The first line*** depicts this procedure for slab-like systems and ***the second line*** for bulk materials.
In both scenarios, a box of certain size is scanned across the whole crystal volume with a certain stride. For each stride, the classification model is applied, yielding both predictions and uncertainty values. These values are then rearranged in 2D or 3D maps. One can construct these for each of the classification probabilites and the uncertainty (here as quantified by mutual information), or also for the most similar prototypes (i.e., the argmax prediction).
This way one can discover the most prevalent symmetries (via classification maps) and detect defective regions (in particular, the uncertainty can indicate when a particular region is defective or far outside the known training examples, thus allowing - for instance - to discover grain boundaries, or as it is depicted above a cubic-shaped precipitate).
%% Cell type:markdown id: tags:
As introduced in the 'Quickstart' section, the local analysis can be obtained by simply changing the mode to 'local', while one also has to provide stride and box size (standard settings stride=1.0, box_size=12.0).
We investigate exemplarily the mono-species structure shown in the first line of the above plot:
%% Cell type:code id: tags:
``` python
geometry_files = ['./data/ARISE/polycrystals/four_grains_elemental_solid.xyz']
view(read(geometry_files[0], ':', 'xyz')[0], viewer='ngl')
```
%% Cell type:markdown id: tags:
We choose a box size that is roughly equal to the slab thickness and perform the SPM analysis. We choose here a very coarse stride and the smaller we will choose it the better the resolution will get. Note that we choose a stride in z direction that is larger than the slab thickness, a setting for which no stride in z will be made.
A folder (in './data/ARISE/calculations_path') in which all relevant information is saved is created automatically. The name of the folder is given by the structure file being passed and the options for SPM (box size and stride).
Note that if you run the cell below multiple times, a new folder (with "_run_i", i=1,... appended to the file name) will be created, so no data will be lost.
%% Cell type:code id: tags:
``` python
box_size = [16.0]
# you may also pass a list for both stride and box size if you have more than one geometry file.
stride = [[15.0, 15.0, 20.0]]
# choosing the stride in z larger as the extension of the material (here as 20.0, exceeding
# the slab thickness by about 4 angstrom) will prohibit that a stride is made in z direction.
previous_level = logger.getEffectiveLevel()
logger.setLevel(0) # switch off logging, you may change this.
predictions, uncertainty = ARISE.analyze(geometry_files, mode='local',
stride=stride, box_size=box_size, configs=configs)
logger.setLevel(previous_level)
```
%% Cell type:markdown id: tags:
Within the automatically created folder, the results are saved in the folder 'results_folder' (in this case, the probabilities as 'four_grains_elemental_solid.xyz_probabilities.npy' and the uncertainty as 'four_grains_elemental_solid.xyz_mutual_information.npy'), where each of the numpy arrays has in general the shape (#classes, z_coordinate, y_coordinate, x_coordinate), where here #classes=108 and the coordinates refer to the spatial positions of the box.In this case of a slab structure it has only the shape (#classes, y_coordinate, x_coordinate).
%% Cell type:code id: tags:
``` python
print('Uncertainty (mutual information) array shape = {}'.format(uncertainty[0]['mutual_information'].shape))
print('Predictions (i.e., classification probabilities) array shape {}'.format(predictions[0].shape))
```
%% Cell type:markdown id: tags:
We exemplarily visualize the four classes fcc, bcc, hcp, and diamond alongside the mutual information in the following cell:
%% Cell type:code id: tags:
``` python
import matplotlib
classes_of_interest = ['fcc_Cu_A_cF4_225_a', 'bcc_W_A_cI2_229_a', 'hcp_Mg_A_hP2_194_c', 'diam_C_A_cF8_227_a']
uncertainty_quantifier = ['mutual_information']
fig_classes, ax_classes = plt.subplots(1, len(classes_of_interest), figsize=(20,5))
for idx, class_of_interest in enumerate(classes_of_interest):
class_idx = text_to_numerical_label[class_of_interest]
classification_probabilities = predictions[0][class_idx]
ax_class = ax_classes[idx]
ax_class.imshow(classification_probabilities)
cmap = matplotlib.cm.get_cmap(name='viridis')
# set the color for NaN values
cmap.set_bad(color='lightgrey')
cax = ax_class.imshow(classification_probabilities, interpolation='none', vmin=0.0, vmax=1.0, cmap=cmap,
origin='lower')
ax_class.set_title('Class {}'.format(class_of_interest))
ax_class.set_xlabel(u'x stride #')
ax_class.set_ylabel(u'y stride #')
fig_classes.colorbar(cax)
fig_u, ax_u = plt.subplots(1,1, figsize=(20,5))
for u in uncertainty_quantifier:
uncertainty_values = uncertainty[0][u]
cmap = matplotlib.cm.get_cmap(name='hot')
# set the color for NaN values
cmap.set_bad(color='lightgrey')
uax = ax_u.imshow(uncertainty_values, cmap=cmap, origin='lower', vmin=0.0)
ax_u.set_title('Mutual_information')
fig_u.colorbar(uax)
plt.show()
```
%% Cell type:markdown id: tags:
The finer you make the stride, the better the resolution will get. In [1], we employ a stride of $1 \overset{\circ}{\mathrm {A}}$, resulting in the following high-resolution picture:
![](./assets/ARISE/synthetic_polycrystals_figure_results_mono_species.svg)
%% Cell type:markdown id: tags:
# Unsupervised analysis
%% Cell type:markdown id: tags:
One important aspect of [1] is the application of unsupervised learning techniques to analyze the internal neural network representations. This allows to *explain* the trained model by showing that regularities in the neural-network representation space correspond to physical characteristics. Using this strategy, one can conduct exploratory analysis, with the goal to discover patterns in high-dimensional representation space and to relate them to the crystal structure - leading to the discovery of defective regions (e.g., grain boundaries or regions that are distorted in similar/distinct fashion, cf. Fig. 2 and 4 of [1]).
Specifically, we apply clustering (HDBSCAN https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html) to identify groups in the high-dimensional neural-network representations. Furthermore the manifold-learning method UMAP (https://umap-learn.readthedocs.io/en/latest/index.html) is employed to obtain a low-dimensional embedding (here 2D) that can be interpreted by human eye.
The most important parameters for the unsupervised techniques employed in this work are the following:
* HDBSCAN: min_cluster_size (int) - determines the minimum number of points a cluster must contain. In more detail, HDBSCAN employs (agglomerative) hierarchical clustering (https://www.displayr.com/what-is-hierarchical-clustering/) to construct a hierarchical cluster tree, where only those clusters are considered that contain as many datapoints as defined via min_cluster_size. Clusters are finally extracted by computing a cluster stability value and the most stable clusters are used to construct a final, so-called flat clustering.
* UMAP: n_neighbors (int) - determines the trade-off between capturing local (small n_neighbors) vs global (large n_neighbors) relationships in the data. Specifically, UMAP constructs a topological representation (in practice corresponding to a specific type of k-neighbor graph) of the data, while some assumptions are made (e.g., the data being uniformly distributed on a (Riemannian) manifold). Then a low-dimensional representation is found by matching its topological representation with the one of the original data (where the matching procedure amounts to the definition of a cross-entropy loss function and the application of stochastic gradient descent). The parameter n_neighbors influences the high-dimensional topological representation, while the distances in the low-dimensional representation can also be tuned (via 'min_dist' while this parameter is typically considered more as only tuning the visualization / the distances in the low-dimensional embedding - see also https://pair-code.github.io/understanding-umap/ for a discussion with hands-on examples).
%% Cell type:markdown id: tags:
In the cell below, we exemplarily analyze a mono-species polycrystal with four crystalline grains (corresponding to the slab structure depicted above and discussed in Fig. 2 of [1], where we chose a stride of 1 $ \overset{\circ}{\mathrm {A}}$ in x,y and a box size of 16 $ \overset{\circ}{\mathrm {A}}$).
You can change the parameters to get a feeling for the importance of min_clsuter_size, n_neighbors.
Also you may choose the color scale according to ARISE's predictions or uncertainty (mutual information), or the HDBSCAN cluster labels.
***Remark:*** For the HDBSCAN clustering, there are two options: Either you can choose the 'cluster_labels', which is the cluster assignment that arises from the standard (flat) clustering procedure, or the 'argmax_clustering' color scale, which is determined from the soft-clustering feature of HDBSCAN. Specifically, HDBSCAN allows to calculate a vector for each point with component i capturing the "probability" that the point is a member of cluster i. Then, we can infer a cluster assignment for points that would normally be considered outliers, by taking the cluster whose membership probability is maximal (while we conisder the point an outlier if none of its cluster probabilities exceeds a certain threshold - here, we choose 10 %). This procedure allows to obtain reasonable clusterings also in the presence of high levels of noise (see Fig. 4 of [1]). Here, it will not make a qualitative difference in performance, i.e., the basic structure of the polycrsytal (four grains + grain boundaries) can be recovered by both clustering procedures.
%% Cell type:code id: tags:
``` python
%matplotlib inline
from ipywidgets import interactive
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
from collections import Counter
n_neighbors_list = [25, 50, 100, 250, 1000]
min_cluster_sizes = n_neighbors_list
analysis_folder_path = './data/ARISE/unsupervised_learning'
file_name = 'four_grains_elemental_solid.xyz_stride_1.0_1.0_16.0_box_size_16.0_.tar.gz_'
layer_name = 'dense_1'
# other possible values : 'soap', dense_0, dense_2, 'final_dense' corresponding to input, hidden layers and output
s = 32
edgecolors = 'face'
color_scales = ['cluster_labels', 'argmax_clustering', 'ARISE_predictions', 'ARISE_mutual_information']
df_dict = {}
for min_cluster_size in min_cluster_sizes:
for n_neighbors in n_neighbors_list:
quantities_to_plot = {#'cluster_probs': [], # soft clustering probabilties
#'outlier_scores': [], # GLOSH -> see HDBSCAN docs
'cluster_labels': [],
'embedding' : [],
'argmax_clustering': [],
}
for q_to_plot in quantities_to_plot:
q = quantities_to_plot[q_to_plot]
q_filename = os.path.join(analysis_folder_path,
file_name
+ '_'
+ layer_name
+ '_mincsize_'
+ str(min_cluster_size)
+ '_nneighbors_'
+ str(n_neighbors)
+ '_' + q_to_plot + '.npy')
quantities_to_plot[q_to_plot] = np.load(q_filename)
embedding = quantities_to_plot['embedding']
# save embedding
df = pd.DataFrame({'Dim_1': embedding[:, 0], 'Dim_2': embedding[:, 1]})
# save possible color scales
df['cluster_labels'] = quantities_to_plot['cluster_labels']
df['argmax_clustering'] = quantities_to_plot['argmax_clustering']
#df['ARISE_predictions'] = pickle.load(open(os.path.join(analysis_folder_path,
# 'ARISE_predictions.pkl'), "rb"))
#df['ARISE_mutual_information'] = pickle.load(open(os.path.join(analysis_folder_path,
# 'uncertainty_filtered_mutual_information.pkl'), "rb"))
with open(os.path.join(analysis_folder_path, 'ARISE_predictions.json')) as f:
ARISE_predictions = json.load(f)
df['ARISE_predictions'] = ARISE_predictions
df_dict[(min_cluster_size, n_neighbors)] = df
df['ARISE_mutual_information'] = np.load(os.path.join(analysis_folder_path,
'uncertainty_filtered_mutual_information.npy'))
@widgets.interact_manual(min_cluster_size=min_cluster_sizes,
n_neighbors=n_neighbors_list,
color_scale=color_scales)
def f(min_cluster_size, n_neighbors, color_scale):
df = df_dict[(min_cluster_size, n_neighbors)]
df['target'] = df[color_scale]
#if color_scale == 'ARISE_predictions':
# df['target'] = [text_to_numerical_label[_] for _ in df['target'].values]
vmin = min(df['target'].values)
vmax = max(df['target'].values)
if color_scale == 'ARISE_mutual_information':
palette = 'hot'
elif color_scale == 'ARISE_predictions':
palette = None
vmin = None
vmax = None
unique_targets = np.unique(df['target'].values)
target_to_int = {t : i for t,i in zip(unique_targets, range(len(unique_targets)))}
int_to_target = {i : t for t,i in zip(unique_targets, range(len(unique_targets)))}
#palette = plt.get_cmap('tab10', len(unique_targets))
#palette = sns.color_palette(palette='tab10', n_colors=len(unique_targets))
palette_cmap = plt.get_cmap('tab10', np.max(range(len(unique_targets)))-np.min(range(len(unique_targets)))+1)
palette = sns.color_palette(palette_cmap.colors)
df['target'] = [target_to_int[_] for _ in df['target']]
else:
palette = 'viridis'
print('Frequency per label : {}'.format(dict(Counter(df['target'].values))))
print('Remark: -1 = outlier')
fig, axs = plt.subplots(1, 2, figsize=(15,10))
fig.suptitle('min_cluster_size (HDBSCAN) = {}, n_neighbors (UMAP) = {}, color scale = {}'.format(min_cluster_size, n_neighbors, color_scale))
ax = axs[0]
ax_ = sns.scatterplot(ax=ax, x='Dim_1', y='Dim_2',
data=df, s=s, hue='target', edgecolor=edgecolors,
palette=palette, vmin=vmin, vmax=vmax)
ax.set_aspect('equal')
ax.axis('off')
ax.set_title('UMAP 2D embedding of NN representations')
if color_scale == 'ARISE_predictions':
#plt.legend()
ax.get_legend().remove()
#ax.legend(range(len(unique_targets)), labels = [int_to_target[_] for _ in range(len(unique_targets))])
pass
else:
norm = plt.Normalize(df['target'].min(), df['target'].max())
sm = plt.cm.ScalarMappable(cmap=palette, norm=norm)
sm.set_array([])
# Remove the legend and add a colorbar
ax.get_legend().remove()
ax.figure.colorbar(sm)
#plt.legend()
# second part of figure: crystal maps
color_scale_ = color_scale
path_to_crystal_maps = os.path.join(analysis_folder_path,
layer_name+'_neighbors_'+str(n_neighbors)
+'_cluster_analysis_'+color_scale_
+'_min_csize_'+str(min_cluster_size)+'.npy')
if color_scale == 'ARISE_predictions':
color_scale_ = 'argmax_preds'
path_to_crystal_maps = os.path.join(analysis_folder_path, 'ARISE_argmax_predictions_crystal_map.npy')
elif color_scale == 'ARISE_mutual_information':
color_scale_ = 'mutual_information'
path_to_crystal_maps = os.path.join(analysis_folder_path, 'ARISE_mutual_information_crystal_map.npy')
crystal_map = np.load(path_to_crystal_maps)
vmin = None
vmax = None
if color_scale == 'ARISE_predictions':
palette = palette_cmap
vmin = np.min(range(len(unique_targets)))-.5
vmax = np.max(range(len(unique_targets)))+.5
crystal_map_shape = crystal_map.shape
crystal_map = crystal_map.flatten()
for idx, c in enumerate(crystal_map):
if not crystal_map[idx] in list(numerical_to_text_label.keys()):
continue # nan number
else:
if numerical_to_text_label[crystal_map[idx]] in target_to_int.keys():
crystal_map[idx] = target_to_int[numerical_to_text_label[crystal_map[idx]]]
else:
crystal_map[idx] = target_to_int['other']
#crystal_map = np.full(crystal_map_shape, np.nan)
crystal_map = np.reshape(crystal_map, crystal_map_shape)
#print(Counter(crystal_map.flatten()))
ax = axs[1]
cax = ax.imshow(crystal_map, cmap=palette, origin='lower', vmin=vmin, vmax=vmax)
#ax.set_xticks([])
#ax.set_yticks([])
if color_scale == 'ARISE_predictions':
cbar = fig.colorbar(cax, ticks = range(len(unique_targets)))
cbar.ax.set_yticklabels([int_to_target[_] for _ in range(len(unique_targets))])
ax.axis('off')
if not color_scale == 'ARISE_predictions':
ax.set_title('Crystal map (real space)')
plt.show()
```
%% Cell type:markdown id: tags:
A consistent picture with four main clusters corresponding to the four crystalline grains separated by grain boundaries arises for min_cluster_size>250 (and also n_neighbors>250, while also at lower values reasonable embeddings are obtained, i.e., points corresponding to different grains are separated). For more details on the interpretation of these figures, we refer to [1].
%% Cell type:markdown id: tags:
# Conclusion
%% Cell type:markdown id: tags:
You have reached the end of this tutorial.
*Please let us know if you have any questions, wishes, or suggestions for improvement. Feel free to reach out to us, e.g., via mail (leitherer@fhi-berlin.mpg.de, andreas.leitherer@gmail.com).*
%% Cell type:markdown id: tags:
# References
%% Cell type:markdown id: tags:
[1] ARISE: A. Leitherer, A. Ziletti, and L. M. Ghiringhelli, arXiv:2103.09777, (2021)
[2] Gal, Y. & Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, 1050-1059 (2016)
[3] Gal, Y. Uncertainty in deep learning. Ph.D. thesis, University of Cambridge (2016)
[4] Mehl, M. J. et al. The AFLOW library of crystallographic prototypes: part 1. Comput. Mater. Sci. 136, S1–S828 (2017).
[5] A. P. Bartok et al. Physical Review Letters 104, 136403 (2010)
[6] A. P. Bartok et al. Physical Review B 87, 184115 (2013)
[7] The quippy software is available for non-commercial use from www.libatoms.org (or https://github.com/libAtoms/QUIP)
[8] De, S., Bartók, A. P., Csányi, G. & Ceriotti, M. Comparing molecules and solids across structural and alchemical space. Phys. Chem. Chem. Phys. 18, 13754–13769 (2016)
%% Cell type:markdown id: tags:
Please refer to [1] for additional information and references.
%% Cell type:code id: tags:
``` python
```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment