Skip to content
Snippets Groups Projects
Commit 0122b49c authored by Andreas Leitherer's avatar Andreas Leitherer
Browse files

Further text editing

parent 0dd5a0f9
No related branches found
No related tags found
No related merge requests found
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"teaser\" style=' background-position: right center; background-size: 00px; background-repeat: no-repeat; \n",
" padding-top: 20px;\n",
" padding-right: 10px;\n",
" padding-bottom: 170px;\n",
" padding-left: 10px;\n",
" border-bottom: 14px double #333;\n",
" border-top: 14px double #333;' > \n",
"\n",
"\n",
"\n",
" <div style=\"text-align:center\">\n",
" <b><font size=\"6.4\">ARISE - Robust recognition and exploratory analysis of crystal structures via Bayesian-deep-learning</font></b> \n",
" </div>\n",
" \n",
"<p>\n",
"\n",
"\n",
"created by: \n",
"Andreas Leitherer<sup>1</sup>,\n",
"Angelo Ziletti<sup>1</sup>,\n",
"and Luca Ghiringhelli<sup> 1</sup>\n",
"***\n",
"\n",
"<sup>1</sup> Fritz Haber Institute of the Max Planck Society, Faradayweg 4-6, D-14195 Berlin, Germany\n",
"\n",
"<div> \n",
"<img style=\"float: left;\" src=\"assets/ARISE/Logo_MPG.png\" width=\"200\"> \n",
"<img style=\"float: right;\" src=https://nomad-coe.eu/uploads/nomad/images/NOMAD_Logo2.png width=\"250\">\n",
"</div>\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this tutorial , we introduce the computational framework ARISE (<ins>Ar</ins>tificial <ins>I</ins>ntelligence based <ins>S</ins>tructure <ins>E</ins>valuation) for crystal-structure recognition in single- and polycyrstalline systems[1]. ARISE can treat more than 100 crystal structures, including one-, two-dimensional, and bulk materials - in robust and threshold-independent fashion. The Bayesian-deep-learning model yields not only a classification but also uncertainty estimates, which are principled (i.e., they approximate the uncertainty estimates of a Gaussian process) [2,3]. These uncertainty estimates correlate with crystal order.\n",
"\n",
"![](./assets/ARISE/ARISE_logo.svg)\n",
"\n",
"For additional details, please refer to \n",
"\n",
"[1] A. Leitherer, A. Ziletti, and L. M. Ghiringhelli, arXiv:2103.09777, (2021)\n",
"\n",
"ARISE is part of the code framework *ai4materials* (https://github.com/angeloziletti/ai4materials).\n",
"\n",
"\n",
"The outline of this tutorial is as follows:\n",
"\n",
"* Quickstart (jump here if you want to directly use ARISE)\n",
"* Single-crystal classification\n",
"* Polycyrstal classification\n",
"* Unsupervised learning / explanatory analysis of the trained model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import packages"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import ase\n",
"from ase.io import read, write\n",
"from ase.visualize import view\n",
"\n",
"import ipywidgets as widgets\n",
"from ipywidgets import interactive\n",
"\n",
"\n",
"from timeit import default_timer as timer\n",
"\n",
"from collections import defaultdict\n",
"\n",
"\n",
"from ai4materials.utils.utils_crystals import get_nn_distance, scale_structure\n",
"from ai4materials.descriptors.quippy_soap_descriptor import quippy_SOAP_descriptor\n",
"from ai4materials.utils.utils_config import set_configs, setup_logger\n",
"from ai4materials.models.cnn_polycrystals import predict_with_uncertainty\n",
"from ai4materials.models import ARISE\n",
"\n",
"import quippy\n",
"from quippy import descriptors\n",
"\n",
"import matplotlib\n",
"import matplotlib.pyplot as plt\n",
"\n",
"import os\n",
"\n",
"import numpy as np\n",
"\n",
"import tensorflow as tf\n",
"tf.logging.set_verbosity(tf.logging.ERROR) # Suppress TF warnings\n",
"\n",
"from keras.models import load_model\n",
"\n",
"import json\n",
"import pickle\n",
"\n",
"# filepaths\n",
"prototype_path = './data/ARISE/single_crystals/PROTOTYPES' # prototypes used in [1]\n",
"bcc_prototype_path = os.path.join(prototype_path, \n",
" 'Elemental_solids',\n",
" 'bcc_W_A_cI2_229_a',\n",
" 'bcc_W_A_cI2_229_a.in' )\n",
"\n",
"fcc_prototype_path = os.path.join(prototype_path, \n",
" 'Elemental_solids',\n",
" 'fcc_Cu_A_cF4_225_a',\n",
" 'fcc_Cu_A_cF4_225_a.in' )\n",
"\n",
"NaCl_prototype_path = os.path.join(prototype_path, \n",
" 'Binaries',\n",
" 'RockSalt_NaCl',\n",
" 'RockSalt_NaCl.in' )\n",
"\n",
"nn_model_path = './data/ARISE/nn_model/AI_SYM_Leitherer_et_al_2021.h5'\n",
"nn_model_class_info_path = './data/ARISE/nn_model'\n",
"\n",
"calculations_path = './data/ARISE/calculations'\n",
"\n",
"fcc_bcc_desc = np.load('./data/ARISE/saved_soap_descriptors/fcc_Cu_A_cF4_225_a_bcc_W_A_cI2_229_a.npy')\n",
"fcc_desc = fcc_bcc_desc[:1]\n",
"bcc_desc = fcc_bcc_desc[1:]\n",
"nacl_desc = np.load('./data/ARISE/saved_soap_descriptors/RockSalt_NaCl.npy')\n",
"\n",
"\n",
"\n",
"version = 'py3' # or 'py2' -> defines python version of quippy\n",
"\n",
"\n",
"\n",
"# set config file\n",
"main_folder = calculations_path\n",
"configs = set_configs(main_folder=main_folder)\n",
"\n",
"logger = setup_logger(configs, level='INFO', display_configs=False)\n",
"logger.setLevel('INFO')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Quickstart"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Please specify the geometry files that you want to analyze in the list 'geometry_files'.\n",
"\n",
"ARISE can be used in two modes: Local and global (controlled via the keyword 'mode'). If mode = 'local', then the strided pattern matching (SPM) framework introduced in [1] is employed, requiring the definition of stride and box size (standard setting: $4 \\overset{\\circ}{\\mathrm {A}}$ stride in all directions and $16 \\overset{\\circ}{\\mathrm {A}}$ box size)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```python\n",
"from ai4materials.models import ARISE\n",
"\n",
"geometry_files = [ ... ]\n",
"\n",
"predictions, uncertainty = ARISE.analyze(geometry_files, mode='global') \n",
"\n",
"predictions, uncertainty = ARISE.analyze(geometry_files, mode='local')\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Typically, one uses the global mode for single crystals an the local mode for large (polycrystalline) samples (to zoom into a given structure and detect structural defects such as grain boundaries). However, one can also mix this up and investigate, for instance, the global assignments for polycrystals (see for instance the analysis of STEM graphene images with grain boundaries in Fig. 3 of Ref. [1]).\n",
"\n",
"The following sections provide more details on the internal workings of ARISE and SPM. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The Bayesian-deep-learning model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Given an unknown atomic structure, the goal of crystal-strucutre recognition is - in general terms - to find the most similar prototype (that is currently known to be found in nature, for example). In the figure below, this list includes fcc, bcc, diamond, and hcp symmetry but as we will see later, our framework allows to treat a much broader range of materials.\n",
"\n",
"![](./assets/ARISE/cs_classification_first_example.svg)\n",
"\n",
"The initial representation of a given atomic structure (single- or polycrystalline) are atomic positions and chemical species symbols (as well as lattice vectors for periodic systems).\n",
"\n",
"The first step is to define how this input will be treated in the ARISE framework, i.e., how we arrive at a meaningful prediction."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Classification of mono-species single crystals"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Considering exemplarily the body-centered cubic system (see http://www.aflowlib.org/prototype-encyclopedia/A_cI2_229_a.html and also [4]), the prediction pipeline for single crystals is sketched as follows:\n",
"![](./assets/ARISE/concept_figure_global.svg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Each of these steps will be explained in the following.\n",
"\n",
"We first load the geometry file using ASE (https://wiki.fysik.dtu.dk/ase/index.html). We typically use the FHI-aims format for the geometry files (while via ASE's geometry file parser, compatibility with many other formats is provided). "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"structure = read(bcc_prototype_path, ':', 'aims')[0]\n",
"view(structure, viewer='ngl')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To avoid dependence on the lattice parameters, we isotropically scale a given structure using the function 'get_nn_distance' from the *ai4materials* package. By default, we compute the radial distribution function (as approximated by the histogram of nearest neighbor distances) and then choose the center of the maximally populated bin (i.e., the mode) as the nearest neighbor distance:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"scale_factor = get_nn_distance(structure)\n",
"print('Scale factor for fcc structure = {}'.format(scale_factor))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Given an atomic structure, scaling of atomic positions and unit cell is summarized in the function 'scale_structure': "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"scaled_structure = scale_structure(structure, scaling_type='quantile_nn',\n",
" atoms_scaling_cutoffs=[10., 20., 30., 40.])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using atomic positions $\\mathbf{r}_i$ labeled by chemical species $c_i$ ($i = 1, ..., \\text{N}_{\\text{atoms}}$) directly as input to a classification model introduces several issues. Most importantly, physically meaningful invariants that we know to be true are not respected (invariance with respect to translations, rotations and permutations of idential atoms). Also the input dimension would depend on the number of atoms.\n",
"\n",
"A well-known and -tested materials science descriptor that is by definition invariant to the above mentioned symmetries is the so-called smooth-overlap-of-atomic-positions (SOAP) descriptor [5-7]. \n",
"\n",
"In this tutorial (and in [1]) we employ the SOAP implementation that is being made available via the quippy package (https://github.com/libAtoms/QUIP) for non-commercial use.\n",
"\n",
"In short, given an atomic structure with N atoms, we obtain N SOAP vectors (respecting the mentioned invariants) that represent the local atomic environments of each atom. The local atomic environments of an atom is defined as a sphere (centered at that particular atom) that has a certain tunable cutoff radius $R_C$. Each atom within that sphere is represented by a Gaussian function (with tunable width $\\sigma$). The sum of these Gaussians defines the local density:\n",
"\n",
"$$ \\rho_\\mathscr{X}(\\vec{r})=\\sum_{i\\in \\mathscr{X}} \\exp{\\left(-\\frac{(\\vec{r}-\\vec{r}_i)^2}{2\\sigma^2}\\right)}$$\n",
"\n",
"This density is expanded in terms of radial basis functions and spherical harmonics \n",
"\n",
"$$ \\rho_\\mathscr{X}(\\vec{r})=\\sum_{i\\in \\mathscr{X}} \\exp{\\left(-\\frac{(\\vec{r}-\\vec{r}_i)^2}{2\\sigma^2}\\right)} =\\sum_{blm}c_{blm}u_b(r) Y_{lm}(\\hat{\\vec{r}}),$$\n",
"\n",
"yielding a set of expansion coefficients $c_{blm}$.\n",
"\n",
"A rotational average of these coefficients yields the SOAP representation of the local atomic environment:\n",
"\n",
"$$ p(\\mathscr{X})_{b_1b_2l}=\\pi\\sqrt{\\frac{8}{2l+1}}\\sum_{m}(c_{b_1 lm})^{\\dagger} c_{b_2 lm}.$$\n",
"\n",
"Applying this procedure to every atom, gives a collection of SOAP vectors, which one may average. The default behavior in quippy (that is employed in [1]) is to first average the coefficients and then perform the rotational average.\n",
"\n",
"The output of quippy is adapted using a typical averaging procedure (while additional details such as the treatment of \"cross-correlation terms\" is discussed in the next section and also the supplementary material of [1]).\n",
"\n",
"First, we have to define some parameters necessary for the calculation of SOAP, summarized in the string \n",
"descriptor_options that is used to create the SOAP descriptor object:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cutoff = 4.0\n",
"central_weight = 0.0 # relative weight of central atom in atom density\n",
"atom_sigma = 0.1\n",
"p_b_c = scaled_structure.get_pbc()\n",
"\n",
"l_max = 6\n",
"n_max = 9\n",
"n_Z = 1\n",
"n_species = 1\n",
"species_Z = structure.get_atomic_numbers()[0]\n",
"Z = structure.get_atomic_numbers()[0]\n",
"average = True\n",
"\n",
"descriptor_options = 'soap '+'cutoff='+str(cutoff)+' l_max='+str(l_max)+' n_max='+str(n_max) + \\\n",
" ' atom_sigma='+str(atom_sigma)+ ' n_Z='+str(n_Z) + \\\n",
" ' Z={'+str(Z)+'} n_species='+str(n_species)+' species_Z={'+str(species_Z) + \\\n",
" '} central_weight='+str(central_weight)+' average='+str(average)\n",
" \n",
"desc = descriptors.Descriptor(descriptor_options)\n",
"if version == 'py2':\n",
" from quippy import Atoms as quippy_Atoms\n",
" struct = quippy_Atoms(scaled_structure)\n",
"elif version == 'py3':\n",
" struct = scaled_structure\n",
"struct.set_pbc(p_b_c)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Given this, we can calculate the SOAP descriptor and visualize its components:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Compute SOAP descriptor\n",
"if version == 'py2':\n",
" struct.set_cutoff(desc.cutoff())\n",
" struct.calc_connect()\n",
" SOAP_descriptor = desc.calc(struct)['descriptor']\n",
"elif version == 'py3':\n",
" SOAP_descriptor = desc.calc(struct)['data']\n",
"\n",
"# visualize descriptor\n",
"def vis_desc(SOAP_descriptor):\n",
" fig, ax = plt.subplots(1,2, figsize=(15,2))\n",
" ax[0].plot(SOAP_descriptor[0])\n",
" ax[0].set_xlabel('SOAP component #')\n",
" ax[0].set_ylabel('SOAP component value (normalized)')\n",
"\n",
"\n",
" m = ax[1].imshow(SOAP_descriptor, aspect=150, cmap=\"viridis\", extent=[0, 315, 0, 1])\n",
" ax[1].set_yticks([])\n",
" ax[1].set_xticks([])\n",
" fig.colorbar(m, ax=ax)\n",
"\n",
"\n",
" plt.show()\n",
"vis_desc(SOAP_descriptor)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Scaling, descriptor definition and calculation are summarized in the following cell (using *ai4materials*):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Define the descriptor\n",
"descriptor = quippy_SOAP_descriptor()\n",
"\n",
"soap_desc = descriptor.calculate(structure).info['descriptor']['SOAP_descriptor']\n",
"\n",
"\n",
"# visualize descriptor\n",
"vis_desc(SOAP_descriptor)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The object quippy_SOAP_descriptor above has standard settings that allow to reproduce the results of [1], but in principle one may choose different values - in particular for the SOAP parameters:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```python \n",
"descriptor = quippy_SOAP_descriptor(configs=configs, \n",
" cutoff=cutoff, l_max=l_max,\n",
" n_max=n_max, atom_sigma=atom_sigma)\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Given the descriptor, the next step in the pipeline is to apply a classification model. We use the Bayesian neural network introduced in [1]: "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model = load_model(nn_model_path)\n",
"model.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The neurons in the final layer each correspond to a specific class. This information is saved in a json file that we are loading below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"with open(os.path.join(nn_model_class_info_path, 'class_info.json')) as file_name:\n",
" data = json.load(file_name) \n",
"\n",
"class_labels = data[\"data\"][0]['classes']\n",
"numerical_to_text_label = dict(zip(range(len(class_labels)), class_labels)) \n",
"text_to_numerical_label = dict(zip(class_labels, range(len(class_labels))))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In [1], ARISE employs a Bayesian neural network as a classification model, which it does not only yield classification probabilities, but also allows to quantify the predictive uncertainty. \n",
"\n",
"Specifically, we employ so-called Monte Carlo (MC) Dropout introduced in [2,3], which employs the stochastic regularization technique dropout. In dropout, individual neurons are randomly dropped in each layer, usually only during training to avoid over-specialization of individual units and this way control overfitting. One can show that employing dropout also at test time yields a probabilistic model whose uncertainty estimates approximate those of a Gaussian process. In particular, neural-network optimization with dropout is equivalent to performing variational inference (see [3] for more details, especially sections 1, 2, and 3.3). \n",
"\n",
"In practice, for a given input, the model output is computed for several iterations (100-1000 typically suffice), yielding a collection of differing probability vectors. In contrast, standard neural networks are deterministic and thus all predictions are identical. Sampling the output layer of the Bayesian neural network actually corresponds to sampling from (an approximated version of) the true probability distribution of outputs. Deterministic networks only provide a point estimate of this distribution, leading to overconfident predictions (see section 1.5 of [3]). Averaging the forward-passes gives an average classification probability vector, and the predicted class label can be inferred by selecting the most likely class (i.e., computing the argmax). Additional statistical information is contained in the classification probability samples. In [1], we employ mutual information (as implemented in *ai4materials*) to obtain a single uncertainty-quantifying number from the output samples. \n",
"\n",
"For the bcc structure from above, prediction and uncertainty quantification is performed in the following cell:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# reshape data array such that it fits model input shape\n",
"input_shape_from_model = model.layers[0].get_input_at(0).get_shape().as_list()[1:]\n",
"target_shape = tuple([-1] + input_shape_from_model)\n",
"data = np.reshape(soap_desc, target_shape)\n",
"\n",
"# Obtain classifcation probabilities (numpy array 'prediction')\n",
"# and uncertainty quantifiers (dictionary of numpy arrays, with three keys\n",
"# 'mutual_information', 'predictive_entropy', and 'variation_ratio' corresponding to\n",
"# three uncertainty quantifiers - see also L. Smith & Y. Gal, arXiv:1803.08533v1, 2018 for further\n",
"# reading).\n",
"prediction, uncertainty = predict_with_uncertainty(data, \n",
" model=model, \n",
" model_type='classification', \n",
" n_iter=1000)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The array 'prediction' gives the classification probabilities, which are clearly peaked at the entry corresponding to bcc:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"argmax_prediction = prediction.argmax(axis=-1)[0]\n",
"predicted_label = numerical_to_text_label[argmax_prediction].split('_')[:2]\n",
"predicted_label = '_'.join(predicted_label)\n",
"print('Maximal classifcation probability with value {:.4f} for prototype {}'.format(prediction[0][argmax_prediction],\n",
" predicted_label))\n",
"plt.plot(prediction[0])\n",
"plt.xlabel('Class (integer encoding)')\n",
"plt.ylabel('Probability')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The dictionary \"uncertainty\" contains quantifiers of uncertainty, in particular the mutual information employed in [1]:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print('Mutual information = {:.4f}'.format(uncertainty[\"mutual_information\"][0]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***In summary***: All of these steps - isotropic scaling, descriptor calculation, neural network predictions (classification probabilities + uncertainty) - are summarized in the following function, to provide a quick and easy usage. In particular, you may simply pass a list of geometry files, here we choose the fcc (Cu) and bcc (W) prototypes from AFLOW:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from ai4materials.models import ARISE\n",
"\n",
"geometry_files = [fcc_prototype_path, bcc_prototype_path]\n",
"\n",
"predictions, uncertainty = ARISE.analyze(geometry_files, mode='global')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To analyze the predictions, in particular the ranking by classification probability and the uncertainty quantifiers, we can use the function 'analyze_predictions':"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ARISE.analyze_predictions(geometry_files, predictions, uncertainty)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For the predictions in the fcc, non-zero probability is also assigned to a tetragonal prototype (identifier 'bctIn_In_A_tI2_139_a', where the AFLOW identifier is specified in this case as well, see [4] for more details). This assignment is justified since it is a slightly distorted (tetragonal) version of fcc (http://aflowlib.org/prototype-encyclopedia/A_tI2_139_a.In.html). Moreover, the uncertainty is non-zero, indicating that there is more than one candidate for the most similar prototype. Interestingly, also the low probability candidates are still meaningful. Further analysis of this classification-probability induced ranking is provided in [1], specifically Fig. 3 (for experimental scanning transmission electron microscopy data of graphene) and Supplementary note 2.3 (for out-of-sample single-crystal structures from AFLOW and NOMAD). This study becomes particularly important if one wants to investigate atomic structures that are not include in the training set. In the mentioned figures of [1], we provide guidelines on how one can still use ARISE in this situation and interpret its predictions. \n",
"\n",
"In the next section, we will first extend to multi-species systems and then provide the code for this out-of-sample studies, where you may also try different structures (upload the geometry files to './data/ARISE/calculations_path'). "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Classification of multi-species single crystals"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One of the most important characteristics of ARISE is that it can treat much more classes than available methods (see [1] for a benchmarking study).\n",
"\n",
"So far we only considered mono-species systems. We consider the rock salt structure (NaCl) as an example for the extension of ARISE to multi-component systems.\n",
"\n",
"The usual way in treating multiple chemical species is to calculate [8]\n",
"\n",
"$$p(\\mathscr{X})_{b_1b_2l}^{\\alpha\\beta}=\\pi\\sqrt{\\frac{8}{2l+1}}\\sum_{m}(c_{b_1 lm}^{\\alpha})^{\\dagger} c_{b_2 lm}^\\beta,$$\n",
"\n",
"\n",
"where the coefficients $c_{b_1 lm}^{\\alpha}, c_{b_1 lm}^{\\beta}$ are obtained from separate densities for species $\\alpha,\\beta$. Then one may simply append all of these components, or average over certain combinations. We take the latter route, where it turns out that it suffices (for our purposes) to only select coefficients for $Z_1 = Z_2$ and ignore cross terms for $Z_1 \\neq Z_2$. Then, we consider all substructures, i.e., we compute SOAP according to \n",
"\n",
"$$ p(\\mathscr{X})_{b_1b_2l}=\\pi\\sqrt{\\frac{8}{2l+1}}\\sum_{m}(c_{b_1 lm})^{\\dagger} c_{b_2 lm}, $$\n",
"\n",
"where we sit on all Na atoms and consider only Na atoms, sit on all Na atoms and consider only Ca atoms etc. This procedure yields 4 vectors of equal length, which are averaged. Generalization to structures with arbitrary number of chemical species is straightforward - and so we end up with a descriptor that is independent of 1. the number of atoms, 2. number of chemical species, and 3. incorporates all relevant symmetries (rotational invariance, translational invariance, and invariance with respect to permutations of identical atoms).\n",
"\n",
"Coming back to our implementation, an important keyword is 'scale_elment_sensitive' which is by default set to 'True' - this way, we isotropically scale each substructure separately (one may also not do this, but the model we provide her is trained for 'scale_element_sensitive'==True):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"geometry_files = [NaCl_prototype_path]\n",
"NaCl_structure = read(NaCl_prototype_path, ':', format='aims')[0]\n",
"view(NaCl_structure, viewer='ngl')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"predictions, uncertainty = ARISE.analyze(geometry_files, mode='global', save_descriptors=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ARISE.analyze_predictions(geometry_files, predictions, uncertainty)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One can see that NaCl is correctly predicted with low mutual information."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You may try to choose a different prototype and analyze its predictions in the following cells. You can choose one of the prototypes of [1] (see also Supplementary Table 4-6 for a complete list), or other examples from AFLOW (including those explored in Supplementary Figure 2).\n",
"\n",
"First we define some functions for loading (no need to understand this in detail, so you can skip for now):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Load all prototypes\n",
"from ai4materials.models.spm_prediction_tools import load_all_prototypes\n",
"\n",
"\n",
"\n",
"#########################################################\n",
"# PARAMETERS WHICH YOU MAY WANT TO CHANGE \n",
"#########################################################\n",
"\n",
"# For the training prototypes:\n",
"periodic = True # if False, take supercell structure with at least min_atoms atoms.\n",
"min_atoms = 100\n",
"# Exception: nanotubes, which are here of fixed length. For quippy, the SOAP descriptor\n",
"# does not change significantly such that change of length does not matter (at least for \n",
"# Carbon nanotubes).\n",
"\n",
"\n",
"# Change this value to smooth the fluctuating predictions in case of high-uncertainty points. \n",
"# Please also have a look at the remark in the next cell.\n",
"n_iter = 1000\n",
"\n",
"#########################################################\n",
"\n",
"\n",
"prototypes = load_all_prototypes(periodic_boundary_conditions=[False])\n",
"for prototype in prototypes:\n",
" if not prototype.info['dataset'] == 'Nanotubes':\n",
" if periodic:\n",
" prototype.set_pbc(True)\n",
" else:\n",
" prototype.set_pbc(True)\n",
" replica = 0\n",
" third_dimension = replica\n",
" while (prototype*(replica,replica,third_dimension)).get_number_of_atoms()<min_atoms:\n",
" replica += 1\n",
" third_dimension += 1\n",
" if prototype.info['dataset'] == '2D_materials':\n",
" third_dimension = 1\n",
" \n",
" prototype *= (replica, replica, replica)\n",
" \n",
"\n",
"training_prototypes = [_.info['crystal_structure'] for _ in prototypes]\n",
"name_to_material_type_dict = {_.info['crystal_structure'] : _.info['dataset'] for _ in prototypes}\n",
"#prototypes_geo_files = [os.path.join(prototype_path,\n",
"# prototype.info['crystal_structure'],\n",
"# prototype.info['crystal_structure'] + '.in')]\n",
"\n",
"\n",
"AFLOW_other_protos_path = './data/ARISE/single_crystals/exploration_examples'\n",
"AFLOW_geo_files = os.listdir(AFLOW_other_protos_path)\n",
"out_of_sample_prototypes = [_.split('.')[0] for _ in AFLOW_geo_files] # careful if have points in filename!!\n",
"#AFLOW_geo_files = [os.path.join(AFLOW_other_protos_path, _) for _ in AFLOW_geo_files]\n",
"#other_AFLOW_prototypes = [read(_, ':', format='aims') for _ in AFLOW_geo_files]\n",
" \n",
" \n",
"def save_pred_and_u(predictions, uncertainty, file_name):\n",
" np.save(file_name+'_predictions.npy', predictions)\n",
" pickle.dump(np.array(predictions), open(file_name+'_uncertainty.pkl', \"wb\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<span style=\"color:red\">Some practical tips:</span> \n",
"* When executing the code below, don't be surprised if the predictions are not identical between different runs. Due to the stochastic nature of the model, this may very well happen, especially for out-of-sample predictions. Still, the final class prediction (via the argmax operation) remains constant for the parameters being used here. \n",
"* To arrive at more constant classification probabilties (and uncertainty quantifications), one may increase the stochastic forward passes (see 'n_iter' argument in the above cell). The increase of computational cost for larger values of n_iter is typicall mild (unless you go to extreme values with billions/trillions of forward passes). \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the following cell to investigate the training prototypes used in [1]. Click 'Run Interact' to start a calculation:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"@widgets.interact_manual(prototype_name=training_prototypes)\n",
"def given_geo_predict_and_analyze_training_prototypes(prototype_name):\n",
" print('Start calculation.')\n",
" geometry_file_path = os.path.join(prototype_path,\n",
" name_to_material_type_dict[prototype_name], \n",
" prototype_name)\n",
" geometry_file = [_ for _ in os.listdir(geometry_file_path) if _[-3:]=='.in']\n",
" if not len(geometry_file)==1:\n",
" raise ValueError(\"Poluted prototype path. In [1] we only have one representative for each \\\n",
" structural class\")\n",
" geometry_file = os.path.join(geometry_file_path, geometry_file[0])\n",
" logger_level = logger.getEffectiveLevel()\n",
" logger.setLevel(0)\n",
" predictions, uncertainty = ARISE.analyze([geometry_file], mode='global', n_iter=n_iter)\n",
" save_pred_and_u(predictions, uncertainty, os.path.join(calculations_path,\n",
" prototype_name))\n",
" logger.setLevel(logger_level)\n",
" print('Finished calculation.')\n",
" top_n_labels_list = ARISE.analyze_predictions([geometry_file], predictions,\n",
" uncertainty, return_predicted_prototypes=True)\n",
" #np.save(os.path.join(calculations_path, 'top_n_labels.npy'), np.array(top_n_labels_list[0], dtype=object))\n",
" pickle.dump(top_n_labels_list[0], open(os.path.join(calculations_path, 'top_n_labels.pkl'), \"wb\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You may also want to have a look at the top predicted prototypes. When executing the next cell, an interactive menu will open up where you can change via \"Ranking_idx\" which of the top predictions you want to have a look at and via \"Supercell\" which replicas you want to inspect - exception being nanotubes which have fixed length. Executing the cell after the following one visualizes the selected structure."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"@widgets.interact(Rankind_index=[1,2,3], Supercell=[(1,1,1),(2,2,2), (3,3,3), (4,4,4)])\n",
"def vis_predicted_structures(Rankind_index, Supercell):\n",
" #top_n_labels = np.load(os.path.join(calculations_path, 'top_n_labels.npy'))\n",
" top_n_labels = pickle.load(open(os.path.join(calculations_path, 'top_n_labels.pkl'), \"rb\"))\n",
" structure_label = top_n_labels[Rankind_index-1] # Rankind_idx = 1,2,3,.. -> subtract 1\n",
" geometry_file_path = os.path.join(prototype_path, \n",
" name_to_material_type_dict[structure_label],\n",
" structure_label)\n",
" \n",
" geometry_file = [_ for _ in os.listdir(geometry_file_path) if _[-3:]=='.in']\n",
" if not len(geometry_file)==1:\n",
" raise ValueError(\"Poluted prototype path. In [1] we only have one representative for each \\\n",
" structural class\")\n",
" geometry_file = os.path.join(geometry_file_path, geometry_file[0])\n",
" structure = read(geometry_file, ':', 'aims')[0]\n",
" if not structure.cell.flatten().any()==0.0:\n",
" structure *= Supercell # only if lattice defined, create supercell\n",
" write(os.path.join(calculations_path, 'ranking_structure.in'), structure, format='aims')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"view(read(os.path.join(calculations_path, 'ranking_structure.in'),':', 'aims'), viewer='ngl')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Please run the following cell if you want to study out-of-sample structures (including the prototypes investigated in [1]). If you want to inspect your own, new structures, just upload the to './data/ARISE/calculations_path' and rerun the code cell before 'Some practical tips'). "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#@widgets.interact(prototype_name=out_of_sample_prototypes)\n",
"@widgets.interact_manual(prototype_name=out_of_sample_prototypes)\n",
"def given_geo_predict_and_analyze_out_of_sample(prototype_name):\n",
" print('Start calculation.')\n",
" geometry_file = os.path.join(AFLOW_other_protos_path, prototype_name + '.in')\n",
" logger_level = logger.getEffectiveLevel()\n",
" logger.setLevel(0)\n",
" predictions, uncertainty = ARISE.analyze([geometry_file], mode='global', n_iter=n_iter)\n",
" save_pred_and_u(predictions, uncertainty, os.path.join(calculations_path,\n",
" prototype_name))\n",
" logger.setLevel(logger_level)\n",
" print('Finished calculation.')\n",
" top_n_labels_list = ARISE.analyze_predictions([geometry_file], predictions, \n",
" uncertainty, return_predicted_prototypes=True) \n",
" #np.save(os.path.join(calculations_path, 'top_n_labels.npy'), np.array(top_n_labels_list[0], dtype=object))\n",
" pickle.dump(top_n_labels_list[0], open(os.path.join(calculations_path, 'top_n_labels.pkl'), \"wb\"))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You may also want to have a look at the top predicted prototypes:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"@widgets.interact(Rankind_index=[1,2,3], Supercell=[(1,1,1),(2,2,2), (3,3,3), (4,4,4)])\n",
"def vis_predicted_structures(Rankind_index, Supercell):\n",
" #top_n_labels = np.load(os.path.join(calculations_path, 'top_n_labels.npy'))\n",
" top_n_labels = pickle.load(open(os.path.join(calculations_path, 'top_n_labels.pkl'), \"rb\"))\n",
" structure_label = top_n_labels[Rankind_index-1] # Rankind_idx = 1,2,3,.. -> subtract 1\n",
" geometry_file_path = os.path.join(prototype_path, \n",
" name_to_material_type_dict[structure_label],\n",
" structure_label)\n",
" \n",
" geometry_file = [_ for _ in os.listdir(geometry_file_path) if _[-3:]=='.in']\n",
" if not len(geometry_file)==1:\n",
" raise ValueError(\"Poluted prototype path. In [1] we only have one representative for each \\\n",
" structural class\")\n",
" geometry_file = os.path.join(geometry_file_path, geometry_file[0])\n",
" structure = read(geometry_file, ':', 'aims')[0]\n",
" if not structure.cell.flatten().any()==0.0:\n",
" structure *= Supercell # only if lattice defined, create supercell\n",
" write(os.path.join(calculations_path, 'ranking_structure.in'), structure, format='aims')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"view(read(os.path.join(calculations_path, 'ranking_structure.in'),':', 'aims'), viewer='ngl')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Polycyrstal classification"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To classify polycrystals, we introduce the strided pattern matching (SPM) framework, sketched in the following figure:\n",
"\n",
"![](./assets/ARISE/concept_figure_local.svg)\n",
"\n",
"***The first line*** depicts this procedure for slab-like systems and ***the second line*** for bulk materials. \n",
"\n",
"In both scenarios, a box of certain size is scanned across the whole crystal volume with a certain stride. For each stride, the classification model is applied, yielding both predictions and uncertainty values. These values are then rearranged in 2D or 3D maps. One can construct these for each of the classification probabilities and the uncertainty (here as quantified by mutual information), or also for the most similar prototypes (i.e., the argmax prediction). \n",
"\n",
"\n",
"This way one can discover the most prevalent symmetries (via classification maps) and detect defective regions. In particular, the uncertainty can indicate when a particular region is defective or far outside the known training examples, thus allowing for instance to discover grain boundaries, or, as it is depicted above, a cubic-shaped precipitate."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As introduced in the 'Quickstart' section, the local analysis can be obtained by simply changing the mode to 'local', while one also has to provide stride and box size (standard settings stride=1.0, box_size=12.0). \n",
"\n",
"We investigate exemplarily the mono-species structure shown in the first line of the above plot (and discussed in more detail in Fig. 2 of [1]):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"geometry_files = ['./data/ARISE/polycrystals/four_grains_elemental_solid.xyz']\n",
"view(read(geometry_files[0], ':', 'xyz')[0], viewer='ngl')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We choose a box size that is roughly equal to the slab thickness and perform the SPM analysis. We choose here a very coarse stride and the smaller we will choose it the better the resolution will get. We select a stride along the z-axis that is larger than the slab thickness, a setting for which no stride in this direction will be made.\n",
"\n",
"A folder (in './data/ARISE/calculations_path') in which all relevant information is saved is created automatically. The name of the folder is given by the structure file being passed and the options for SPM (box size and stride).\n",
"Note that if you run the cell below multiple times, a new folder (with \"_run_i\", i=1,... appended to the file name) will be created, so no data will be lost."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"box_size = [16.0]\n",
"# you may also pass a list for both stride and box size if you have more than one geometry file.\n",
"stride = [[15.0, 15.0, 20.0]] \n",
"# choosing the stride in z larger as the extension of the material (here as 20.0, exceeding \n",
"# the slab thickness by about 4 angstrom) will prohibit that a stride is made in z direction. \n",
"\n",
"previous_level = logger.getEffectiveLevel()\n",
"logger.setLevel(0) # switch off logging, you may change this.\n",
"predictions, uncertainty = ARISE.analyze(geometry_files, mode='local',\n",
" stride=stride, box_size=box_size, configs=configs)\n",
"logger.setLevel(previous_level)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Within the automatically created folder, the results are saved in the folder 'results_folder' (in this case, the probabilities as 'four_grains_elemental_solid.xyz_probabilities.npy' and the uncertainty as 'four_grains_elemental_solid.xyz_mutual_information.npy'), where each of the numpy arrays has in general the shape (#classes, z_coordinate, y_coordinate, x_coordinate), where here #classes=108 and the coordinates refer to the spatial positions of the box. In this case of a slab structure, the array shape is (#classes, y_coordinate, x_coordinate) while in general it is (#classes, z_coordinate, y_coordinate, x_coordinate)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print('Uncertainty (mutual information) array shape = {}'.format(uncertainty[0]['mutual_information'].shape))\n",
"print('Predictions (i.e., classification probabilities) array shape {}'.format(predictions[0].shape))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We exemplarily visualize the four classes fcc, bcc, hcp, and diamond alongside the mutual information in the following cell:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib\n",
"classes_of_interest = ['fcc_Cu_A_cF4_225_a', 'bcc_W_A_cI2_229_a', 'hcp_Mg_A_hP2_194_c', 'diam_C_A_cF8_227_a']\n",
"uncertainty_quantifier = ['mutual_information']\n",
"\n",
"\n",
"\n",
"fig_classes, ax_classes = plt.subplots(1, len(classes_of_interest), figsize=(20,5))\n",
"for idx, class_of_interest in enumerate(classes_of_interest):\n",
" class_idx = text_to_numerical_label[class_of_interest]\n",
" classification_probabilities = predictions[0][class_idx]\n",
" \n",
" \n",
" ax_class = ax_classes[idx]\n",
" ax_class.imshow(classification_probabilities)\n",
" \n",
" cmap = matplotlib.cm.get_cmap(name='viridis')\n",
" # set the color for NaN values\n",
" cmap.set_bad(color='lightgrey')\n",
"\n",
" cax = ax_class.imshow(classification_probabilities, interpolation='none', vmin=0.0, vmax=1.0, cmap=cmap,\n",
" origin='lower')\n",
"\n",
"\n",
" ax_class.set_title('Class {}'.format(class_of_interest))\n",
"\n",
" ax_class.set_xlabel(u'x stride #')\n",
" ax_class.set_ylabel(u'y stride #')\n",
"fig_classes.colorbar(cax)\n",
" \n",
" \n",
" \n",
"fig_u, ax_u = plt.subplots(1,1, figsize=(20,5))\n",
"for u in uncertainty_quantifier:\n",
" uncertainty_values = uncertainty[0][u]\n",
" \n",
" cmap = matplotlib.cm.get_cmap(name='hot')\n",
" # set the color for NaN values\n",
" cmap.set_bad(color='lightgrey')\n",
" \n",
" uax = ax_u.imshow(uncertainty_values, cmap=cmap, origin='lower', vmin=0.0)\n",
" ax_u.set_title('Mutual_information')\n",
"fig_u.colorbar(uax)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The finer you make the stride, the better the resolution will get. In [1], we employ a stride of $1 \\overset{\\circ}{\\mathrm {A}}$, resulting in the following high-resolution picture, revealing further details at the grain boundary (see also supplementary figure 7 and 8 of [1]):\n",
"\n",
"\n",
"![](./assets/ARISE/synthetic_polycrystals_figure_results_mono_species.svg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Unsupervised analysis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One important aspect of [1] is the application of unsupervised learning techniques to analyze the internal neural network representations. This allows to *explain* the trained model by showing that regularities in the neural-network representation space correspond to physical characteristics. Using this strategy, one can conduct exploratory analysis, with the goal to discover patterns in high-dimensional representation space and to relate them to the crystal structure - leading to the discovery of defective regions (e.g., grain boundaries or regions that are distorted in similar/distinct fashion, cf. Fig. 2 and 4 of [1]).\n",
"\n",
"Specifically, we apply clustering (HDBSCAN https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html) to identify groups in the high-dimensional neural-network representations. Furthermore the manifold-learning method UMAP (https://umap-learn.readthedocs.io/en/latest/index.html) is employed to obtain a low-dimensional embedding (here 2D) that can be interpreted by human eye. \n",
"\n",
"The most important parameters for the unsupervised techniques employed in this work are the following:\n",
"\n",
"* HDBSCAN: min_cluster_size (int) - determines the minimum number of points a cluster must contain. In more detail, HDBSCAN employs (agglomerative) hierarchical clustering (https://www.displayr.com/what-is-hierarchical-clustering/) to construct a hierarchical cluster tree, where only those clusters are considered that contain as many data points as defined via min_cluster_size. Cluster assignments are extracted by computing a cluster stability value and the most stable clusters are used to construct a final, so-called flat clustering.\n",
"\n",
"\n",
"* UMAP: n_neighbors (int) - determines the trade-off between capturing local (small n_neighbors) vs global (large n_neighbors) relationships in the data. Specifically, UMAP constructs a topological representation of the data (in practice corresponding to a specific type of k-neighbor graph), while some assumptions are made (e.g., the data being uniformly distributed on a Riemannian manifold). Then a low-dimensional representation is found by matching its topological representation with the one of the original data (where the matching procedure amounts to the definition of a cross-entropy loss function and the application of stochastic gradient descent). The parameter n_neighbors influences the high-dimensional topological representation. The distances in the low-dimensional representation can also be tuned via 'min_dist', while this parameter is typically considered more as only tuning the visualization / the distances in the low-dimensional embedding - see also https://pair-code.github.io/understanding-umap/ for a discussion with hands-on examples."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the cell below, we consider again the mono-species polycrystal with four crystalline grains (the slab structure depicted above and discussed in Fig. 2 of [1], where we chose a stride of 1 $ \\overset{\\circ}{\\mathrm {A}}$ in x,y and a box size of 16 $ \\overset{\\circ}{\\mathrm {A}}$). \n",
"\n",
"You can change the parameters to get a feeling for the importance of min_clsuter_size, n_neighbors. \n",
"\n",
"Also you may choose the color scale according to ARISE's predictions or uncertainty (mutual information), or the HDBSCAN cluster labels. \n",
"\n",
"\n",
"***Remark:*** For the HDBSCAN clustering, there are two options: Either you can choose the 'cluster_labels', which is the cluster assignment that arises from the standard (flat) clustering procedure, or the 'argmax_clustering' color scale, which is determined from the soft-clustering feature of HDBSCAN. Specifically, HDBSCAN allows to calculate a vector for each point, with component i capturing the \"probability\" that the point is a member of cluster i. Then, we can infer a cluster assignment for points that would normally be considered outliers by taking the cluster whose membership probability is maximal (while we consider the point an outlier if none of its cluster probabilities exceeds a certain threshold - here, we choose 10 %). This procedure allows to obtain reasonable clusterings also in the presence of higher levels of noise (as encountered in atomic electron tomography data[9-12], which is analyzed using ARISE in Fig. 4 of [1]). Here, it will not make a qualitative difference in performance, i.e., the basic structure of the polycrystal (four grains + grain boundaries) can be recovered by both clustering procedures."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"from ipywidgets import interactive\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import seaborn as sns\n",
"import pandas as pd\n",
"from collections import Counter\n",
"\n",
"n_neighbors_list = [25, 50, 100, 250, 1000] \n",
"min_cluster_sizes = n_neighbors_list\n",
"\n",
"analysis_folder_path = './data/ARISE/unsupervised_learning'\n",
"file_name = 'four_grains_elemental_solid.xyz_stride_1.0_1.0_16.0_box_size_16.0_.tar.gz_'\n",
"\n",
"layer_name = 'dense_1' \n",
"# other possible values : 'soap', dense_0, dense_2, 'final_dense' corresponding to input, hidden layers and output\n",
"\n",
"s = 32\n",
"edgecolors = 'face'\n",
"\n",
"color_scales = ['cluster_labels', 'argmax_clustering', 'ARISE_predictions', 'ARISE_mutual_information']\n",
"\n",
"\n",
"df_dict = {}\n",
"\n",
"for min_cluster_size in min_cluster_sizes:\n",
" for n_neighbors in n_neighbors_list:\n",
"\n",
" quantities_to_plot = {#'cluster_probs': [], # soft clustering probabilties\n",
" #'outlier_scores': [], # GLOSH -> see HDBSCAN docs\n",
" 'cluster_labels': [], \n",
" 'embedding' : [],\n",
" 'argmax_clustering': [], \n",
" }\n",
" for q_to_plot in quantities_to_plot:\n",
" q = quantities_to_plot[q_to_plot]\n",
" q_filename = os.path.join(analysis_folder_path,\n",
" file_name \n",
" + '_' \n",
" + layer_name \n",
" + '_mincsize_'\n",
" + str(min_cluster_size) \n",
" + '_nneighbors_'\n",
" + str(n_neighbors)\n",
" + '_' + q_to_plot + '.npy')\n",
" quantities_to_plot[q_to_plot] = np.load(q_filename)\n",
" embedding = quantities_to_plot['embedding']\n",
"\n",
" # save embedding\n",
" df = pd.DataFrame({'Dim_1': embedding[:, 0], 'Dim_2': embedding[:, 1]})\n",
" \n",
" # save possible color scales\n",
" df['cluster_labels'] = quantities_to_plot['cluster_labels']\n",
" df['argmax_clustering'] = quantities_to_plot['argmax_clustering']\n",
" #df['ARISE_predictions'] = pickle.load(open(os.path.join(analysis_folder_path,\n",
" # 'ARISE_predictions.pkl'), \"rb\"))\n",
" #df['ARISE_mutual_information'] = pickle.load(open(os.path.join(analysis_folder_path,\n",
" # 'uncertainty_filtered_mutual_information.pkl'), \"rb\"))\n",
" \n",
" with open(os.path.join(analysis_folder_path, 'ARISE_predictions.json')) as f:\n",
" ARISE_predictions = json.load(f)\n",
" df['ARISE_predictions'] = ARISE_predictions\n",
" df_dict[(min_cluster_size, n_neighbors)] = df\n",
" df['ARISE_mutual_information'] = np.load(os.path.join(analysis_folder_path,\n",
" 'uncertainty_filtered_mutual_information.npy'))\n",
" \n",
"@widgets.interact_manual(min_cluster_size=min_cluster_sizes,\n",
" n_neighbors=n_neighbors_list,\n",
" color_scale=color_scales)\n",
"def f(min_cluster_size, n_neighbors, color_scale):\n",
" df = df_dict[(min_cluster_size, n_neighbors)]\n",
" df['target'] = df[color_scale]\n",
"\n",
" #if color_scale == 'ARISE_predictions':\n",
" # df['target'] = [text_to_numerical_label[_] for _ in df['target'].values]\n",
"\n",
" vmin = min(df['target'].values)\n",
" vmax = max(df['target'].values)\n",
"\n",
" if color_scale == 'ARISE_mutual_information':\n",
" palette = 'hot'\n",
" elif color_scale == 'ARISE_predictions':\n",
" palette = None\n",
" vmin = None\n",
" vmax = None\n",
" unique_targets = np.unique(df['target'].values)\n",
" target_to_int = {t : i for t,i in zip(unique_targets, range(len(unique_targets)))}\n",
" int_to_target = {i : t for t,i in zip(unique_targets, range(len(unique_targets)))}\n",
" #palette = plt.get_cmap('tab10', len(unique_targets))\n",
" #palette = sns.color_palette(palette='tab10', n_colors=len(unique_targets))\n",
" palette_cmap = plt.get_cmap('tab10', np.max(range(len(unique_targets)))-np.min(range(len(unique_targets)))+1)\n",
" palette = sns.color_palette(palette_cmap.colors)\n",
"\n",
" df['target'] = [target_to_int[_] for _ in df['target']]\n",
" else:\n",
" palette = 'viridis'\n",
" print('Frequency per label : {}'.format(dict(Counter(df['target'].values))))\n",
" print('Remark: -1 = outlier')\n",
"\n",
" fig, axs = plt.subplots(1, 2, figsize=(15,10))\n",
" fig.suptitle('min_cluster_size (HDBSCAN) = {}, n_neighbors (UMAP) = {}, color scale = {}'.format(min_cluster_size, n_neighbors, color_scale))\n",
"\n",
" ax = axs[0]\n",
" ax_ = sns.scatterplot(ax=ax, x='Dim_1', y='Dim_2',\n",
" data=df, s=s, hue='target', edgecolor=edgecolors, \n",
" palette=palette, vmin=vmin, vmax=vmax)\n",
" ax.set_aspect('equal')\n",
" ax.axis('off')\n",
" ax.set_title('UMAP 2D embedding of NN representations')\n",
"\n",
" if color_scale == 'ARISE_predictions':\n",
" #plt.legend()\n",
" ax.get_legend().remove()\n",
" #ax.legend(range(len(unique_targets)), labels = [int_to_target[_] for _ in range(len(unique_targets))])\n",
" pass\n",
" else:\n",
" norm = plt.Normalize(df['target'].min(), df['target'].max())\n",
" sm = plt.cm.ScalarMappable(cmap=palette, norm=norm)\n",
" sm.set_array([])\n",
"\n",
" # Remove the legend and add a colorbar\n",
" ax.get_legend().remove()\n",
" ax.figure.colorbar(sm)\n",
"\n",
" #plt.legend()\n",
"\n",
" # second part of figure: crystal maps\n",
" color_scale_ = color_scale\n",
" path_to_crystal_maps = os.path.join(analysis_folder_path, \n",
" layer_name+'_neighbors_'+str(n_neighbors)\n",
" +'_cluster_analysis_'+color_scale_\n",
" +'_min_csize_'+str(min_cluster_size)+'.npy')\n",
" if color_scale == 'ARISE_predictions':\n",
" color_scale_ = 'argmax_preds'\n",
" path_to_crystal_maps = os.path.join(analysis_folder_path, 'ARISE_argmax_predictions_crystal_map.npy')\n",
" elif color_scale == 'ARISE_mutual_information':\n",
" color_scale_ = 'mutual_information'\n",
" path_to_crystal_maps = os.path.join(analysis_folder_path, 'ARISE_mutual_information_crystal_map.npy')\n",
" crystal_map = np.load(path_to_crystal_maps)\n",
"\n",
" vmin = None\n",
" vmax = None\n",
" if color_scale == 'ARISE_predictions': \n",
" palette = palette_cmap\n",
" vmin = np.min(range(len(unique_targets)))-.5\n",
" vmax = np.max(range(len(unique_targets)))+.5\n",
" crystal_map_shape = crystal_map.shape\n",
" crystal_map = crystal_map.flatten()\n",
"\n",
" for idx, c in enumerate(crystal_map):\n",
" if not crystal_map[idx] in list(numerical_to_text_label.keys()):\n",
" continue # nan number\n",
" else:\n",
" if numerical_to_text_label[crystal_map[idx]] in target_to_int.keys():\n",
" crystal_map[idx] = target_to_int[numerical_to_text_label[crystal_map[idx]]]\n",
" else:\n",
" crystal_map[idx] = target_to_int['other']\n",
"\n",
" #crystal_map = np.full(crystal_map_shape, np.nan)\n",
" crystal_map = np.reshape(crystal_map, crystal_map_shape)\n",
"\n",
"\n",
"\n",
" #print(Counter(crystal_map.flatten()))\n",
" ax = axs[1]\n",
" cax = ax.imshow(crystal_map, cmap=palette, origin='lower', vmin=vmin, vmax=vmax)\n",
" #ax.set_xticks([])\n",
" #ax.set_yticks([])\n",
"\n",
" if color_scale == 'ARISE_predictions':\n",
" cbar = fig.colorbar(cax, ticks = range(len(unique_targets)))\n",
" cbar.ax.set_yticklabels([int_to_target[_] for _ in range(len(unique_targets))]) \n",
"\n",
"\n",
" ax.axis('off')\n",
" if not color_scale == 'ARISE_predictions':\n",
" ax.set_title('Crystal map (real space)')\n",
"\n",
"\n",
" plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A consistent picture with four main clusters corresponding to the four crystalline grains separated by grain boundaries arises for min_cluster_size > 250 (and also n_neighbors > 250, while also at lower values reasonable embeddings are obtained, i.e., points corresponding to different grains are separated). For more details on the interpretation of these figures, we refer to [1]."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Conclusion"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You have reached the end of this tutorial. \n",
"\n",
"*Please let us know if you have any questions, wishes, or suggestions for improvement. Feel free to reach out to us, e.g., via mail (leitherer@fhi-berlin.mpg.de, andreas.leitherer@gmail.com).*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# References"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[1] ARISE: A. Leitherer, A. Ziletti, and L. M. Ghiringhelli, arXiv:2103.09777, (2021)\n",
"\n",
"[2] Gal, Y. & Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, 1050-1059 (2016)\n",
"\n",
"[3] Gal, Y. Uncertainty in deep learning. Ph.D. thesis, University of Cambridge (2016)\n",
"\n",
"[4] Mehl, M. J. *et al.* The AFLOW library of crystallographic prototypes: part 1. Comput. Mater. Sci. 136, S1–S828 (2017).\n",
"\n",
"[5] A. P. Bartok *et al.* Physical Review Letters 104, 136403 (2010)\n",
"\n",
"[6] A. P. Bartok *et al.* Physical Review B 87, 184115 (2013)\n",
"\n",
"[7] The quippy software is available for non-commercial use from www.libatoms.org (or https://github.com/libAtoms/QUIP)\n",
"\n",
"[8] De, S., Bartók, A. P., Csányi, G. & Ceriotti, M. Comparing molecules and solids across structural and alchemical space. Phys. Chem. Chem. Phys. 18, 13754–13769 (2016)\n",
"\n",
"[9] Chen, C.-C. *et al.* Three-dimensional imaging of dislocations in a nanoparticle at atomic resolution. Nature 496, 74–77 (2013)\n",
"\n",
"[10] Miao, J., Ercius, P. & Billinge, S. J. Atomic electron tomography: 3D structures without crystals. Science 353, aaf2157–aaf2157 (2016).\n",
"\n",
"[11] Yang, Y. *et al.* Deciphering chemical order/disorder and material properties at the single-atom level. Nature 542, 75–79 (2017).\n",
"\n",
"[12] Zhou, J. *et al.* Observing crystal nucleation in four dimensions using atomic electron tomography. Nature 570, 500–503 (2019)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Please refer to [1] for additional information and references."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment