In this tutorial , we introduce the computational framework ARISE (<ins>Ar</ins>tificial <ins>I</ins>ntelligence based <ins>S</ins>tructure <ins>E</ins>valuation) for crystal-structure recognition in single- and polycyrstalline systems[1]. ARISE can treat more than 100 crystal structures, including one-, two-dimensional, and bulk materials - in robust and threshold-independent fashion. The Bayesian-deep-learning model yields not only a classification but also uncertainty estimates, which are principled (i.e., they approximate the uncertainty estimates of a Gaussian process) [2,3]. These uncertainty estimates correlate with crystal order.

For additional details, please refer to
[1] A. Leitherer, A. Ziletti, and L. M. Ghiringhelli, arXiv:2103.09777, (2021)
ARISE is part of the code framework *ai4materials* (https://github.com/angeloziletti/ai4materials).
The outline of this tutorial is as follows:
* Quickstart (jump here if you want to directly use ARISE)
* Single-crystal classification
* Polycyrstal classification
* Unsupervised learning / explanatory analysis of the trained model
version='py3'# or 'py2' -> defines python version of quippy
```
%% Cell type:markdown id: tags:
## Quickstart
%% Cell type:markdown id: tags:
Please specify the geometry files that you want to analyze in the list 'geometry_files'.
ARISE can be used in two modes: Local and global (controlled via the keyword 'mode'). If mode = 'local', then the strided pattern matching (SPM) framework introduced in [1] is employed, requiring the definition of stride and box size (standard setting: $4 \overset{\circ}{\mathrm {A}}$ stride in all directions and $16 \overset{\circ}{\mathrm {A}}$ box size).
Typically, one uses the global mode for single crystals an the local mode for large (polycrystalline) samples (too zoom into a given structure and detect structural defects such as grain boundaries). However, one can also mix this up and investigate, for instance, the global assignments for polycrystals (see for instance the analysis of STEM graphene images with grain boundaries in Fig. 3 of Ref. [1]).
The following sections provide more details on the internal workings of ARISE and SPM.
%% Cell type:markdown id: tags:
## The Bayesian-deep-learning model
%% Cell type:markdown id: tags:
Given an unknown atomic structure, the goal of crystal-strucutre recognition - in general terms - is to find the most similar prototype (that is currently known to be found in nature, for example). In the figure below, this list includes fcc, bcc, diamond, and hcp symmetry but as we will see later, our framework allows to treat a much broader range of materials.
The initial representation of a given atomic structure (single- or polycrystalline) are atomic positions and chemical species symbols (as well as lattice vectors for periodic systems).
The first step is to define how this input will be treated in the ARISE framework, i.e., how we arrive at a meaningful prediction.
%% Cell type:markdown id: tags:
### Classification of mono-species single crystals
%% Cell type:markdown id: tags:
Considering exemplarily the body-centered cubic system (see http://www.aflowlib.org/prototype-encyclopedia/A_cI2_229_a.html and also [2]), the prediction pipeline for single crystals can be sketched as follows:

%% Cell type:markdown id: tags:
Each of these steps will be explained in the following.
Before that, we want to shotly discuss the logging feature of *ai4materials* (based on https://docs.python.org/3/howto/logging.html). Specifically, we define a configuration object, that introduces a specific folder structure.
Specifically, three different folders are created automatically:
* 'desc_folder': Here the descriptors that we are going to calculate later will be saved (as file_name.tar.gz)
* 'tmp_folder': temporary folder in which individual files that are calculated will be temporarily saved and later merged into .tar.gz files and moved to 'desc_folder'
* 'results': folder where results are saved (e.g., the neural network predictions).
In practice, you only need to define the main folder in which the calculations should be saved. Here we choose 'calculations_path' defined at the beginning of this tutorial:
We continue with the explanations of the prediction pipeline.
We first load the geometry file using ASE (https://wiki.fysik.dtu.dk/ase/index.html). We typically use the FHI-aims format for the geometry files (while via ASE's geometry file parser, compatibility with many other formats is provided).
%% Cell type:code id: tags:
``` python
structure=read(bcc_prototype_path,':','aims')[0]
view(structure,viewer='ngl')
```
%% Cell type:markdown id: tags:
To avoid dependence on the lattice parameters, we isotropically scale a given structure using the function 'get_nn_distance' from the *ai4materials* package. By default, we compute the radial distribution function (as approximated by the histogram of nearest neighbor distances) and then choose the center of the maximally populated bin (i.e., the mode) as the nearest neighbor distance:
%% Cell type:code id: tags:
``` python
scale_factor=get_nn_distance(structure)
print('Scale factor for fcc structure = {}'.format(scale_factor))
```
%% Cell type:markdown id: tags:
Given an atomic structure, scaling of atomic positions and unit cell is summarized in the function 'scale_structure':
Using atomic positions $\mathbf{r}_i$ labeled by chemical species $c_i$ ($i = 1, ..., \text{N}_{\text{atoms}}$) directly as input to a classification model introduces several issues. Most importantly, physically meaningful invariances that we know to be true are not respected (translational / rotational invariance and permutations of idential atoms). Also the input dimension would depend on the number of atoms (while this could in princple be addressed with the use of convolutional neural networks).
A well-known and -tested materials science descriptor that is by definition invariant to the above mentioned symmetries is the so-called smooht-overlap-of-atomic-positions (SOAP) descriptor [3-5].
In this tutorial (and in [1]) we employ the SOAP implementation that is being made available via the quippy package (https://github.com/libAtoms/QUIP) for non-commercial use.
In short, given an atomic structure with N atoms, we obtain N SOAP vectors (respecting the mentioned invariances) that represent the local atomic environments of each atom. The local atomic environments of an atom is defined as a sphere (centered at that particular atom) that has a certain (tunable) cutoff radius $R_C$. Each atom within that sphere is represented by a Gaussian function (with a certain, tunable width $\sigma$) and the sum of these Gaussians defines the local density:
Applying this procedure to every atom, gives a collection of SOAP vectors, which one may average. The default behavior in quippy (that is employed in [1]) is though to average the coefficients and then perform the rotational average.
The output of quippy is adapted using a typical averaging procedure (while additional details such as the treatment of "cross-correlation terms" is discussed in the next section and also the supplementary material of [1]).
First, we have to define some parameters that define the SOAP descriptor and are summarized in the string
descriptor_options that is used to create the SOAP descritptor object:
%% Cell type:code id: tags:
``` python
cutoff=4.0
central_weight=0.0# relative weight of central atom in atom density
The object quippy_SOAP_descriptor above has standard settings that allow to reproduce the results of [1], but in principle one may choose different values - in particular for the SOAP parameters:
In [1], ARISE employs a Bayesian neural network as a classification model, which it does not only yield classification probabilities, but also allows to quantify the predictive uncertainty.
Specifically, we employ so-called Monte Carlo (MC) Dropout introduced in [6,7], which employs the stochastic regularization technique dropout. In dropout, individual neurons are randomly dropped in each layer, usually only during training to avoid over-specialization of individual units and thus control overfitting. One can show that when also employing dropout at test time, the resulting probabilistic model provides uncertainty estimates that approximate those of a Gaussian process.
In practice, for a given input, the model output is computed for several iterations (100-1000 typically suffice), yielding a collection of differing probability vectors - while in a standard, deterministic neural network, all predictions would be the same. Sampling the output layer of the Bayesian neural network actually corresponds to sampling from (an approximated version of) the true probability distribution of outputs. Averaging these forward-passes gives an average classification probability vector, and the predicted class label can be inferred by selecting the most likely class (i.e., computing the argmax). Additional, statistical information on the predictive uncertainty is contained in the classification probability samples. In [1], we employ mutual information (as implemented in *ai4materials*) to obtain a single uncertainty-quantifying number from the output samples.
For the bcc structure from above, prediction and uncertainty quantification is performed in the following cell:
%% Cell type:code id: tags:
``` python
# reshape data array such that it fits model input shape
print('Maximal classifcation probability with value {:.4f} for prototype {}'.format(prediction[0][argmax_prediction],
predicted_label))
plt.plot(prediction[0])
plt.xlabel('Class (integer encoding)')
plt.ylabel('Probability')
plt.show()
```
%% Cell type:markdown id: tags:
The dictionary "uncertainty" contains quantifiers of uncertainty, in particular the mutual information employed in [1]:
%% Cell type:code id: tags:
``` python
print('Mutual information = {:.4f}'.format(uncertainty["mutual_information"][0]))
```
%% Cell type:markdown id: tags:
***In summary***: All of these steps - isotropic scaling, descriptor calculation, neural network predictions (classification probabilities + uncertainty) - are summarized in the following funciton, to provide a quick and easy usage. In particular, you may simply pass a list of geometry files, here we choose the fcc (Cu) and bcc (W) prototypes from AFLOW:
To analyze the predictions, in particular the ranking by classification probability and the uncertainty quantifiers, we can use the function 'analyze_predictions':
For the predictions in the fcc, we see that non-zero probabiltiy is also assigned to a tetragonal prototype (identifier 'bctIn_In_A_tI2_139_a', where the AFLOW identifier is specified in this case as well, see [2] for more details).This assignment is justified since it is a slightly distorted (tetragonal) version of fcc (http://aflowlib.org/prototype-encyclopedia/A_tI2_139_a.In.html). We also see the uncertainty is non-zero, indicating that there is more than one candidate for the most similar prototype.
So we see that interestingly also the low probability candidates are still meaningful. Further analysis of this classification-probability induced ranking is provided in [1], specifically Fig. 3 (for experimental scanning transmission electron microscopy data of graphene) and Supplementary note 2.3 (for out-of-sample single crystal structures from AFLOW and NOMAD). This study becomes particularly important if you want to investigate atomic structures that are not include in the training set. In the mentioned figures of [1], we provide guidelines on how one can still use ARISE and interpret its predictions. In the next section, we will first extend to multi-species systems and then provide the code for this out-of-sample studies, where you may also try different structures (upload the geometry files to './data/ARISE/calculations_path').
%% Cell type:markdown id: tags:
### Classification of multi-species single crystals
%% Cell type:markdown id: tags:
The great advantage of our method is that we can treat much more classes than available methods.
So far we only considered mono-species systems. We consider the rock salt structure (NaCl) as an example for the extension of ARISE to multi-component systems.
The usual way in treating multiple chemical species is to calculate [8]
where the coefficients $c_{b_1 lm}^{\alpha}, c_{b_1 lm}^{\beta}$ are obtained from separate densities for species $\alpha,\beta$. Then one may simply append all of these components, or average over certain combinations. We take the latter route, where it turns out that it suffices (for our purposes) to only select coefficients for $Z_1 = Z_2$ and ignore cross terms for $Z_1 \neq Z_2$. Then, we consider all substructures, i.e., we compute SOAP according to
where we sit on all Na atoms and consider only Na atoms or Ca atoms, and sit on all Ca atoms and consider only all Ca or Na atoms. This gives us 4 vectors, which we average. Generalization to structures with arbitrary number of chemical species is straightforward - and so we end up with a descriptor that is independent of 1. the number of atoms, 2. number of chemical species, and 3. incorporates all relevant symmetries (rotational invariance, translational invariance, and invariance with respect to permutations of identical atoms).
Coming back to our implementation, an important keyword is 'scale_elment_sensitive' which is by default set to 'True' - this way, we isotropically scale each substructure separately (one may also not do this, but the model we provide her is trained for 'scale_element_sensitive'==True):
One can see that NaCl is correctly predicted with low mutual information.
%% Cell type:markdown id: tags:
You may try to choose a different prototype and analyze its predictions in the following cells. You can choose one of the prototypes of [1] (see also Supplementary Table 4-6 for a complete list), or other examples from AFLOW (including those explored in Supplementary Figure 2).
First we define some functions for loading (no need to understand this in deatil, so you can skip for now):
* When executing the code below, don't be surprised if the predictions are not identical between different runs. Due to the stochastic nature of the model, this may very well happen, especially for out-of-sample predictions. Still, the final class prediction (via the argmax operation) remains constant for the parameters being used here.
* To arrive at more constant classification probabilties (and uncertainty quantifications), one may increase the stochastic forward passes (see 'n_iter' argument in the above cell). The increase of computational cost for larger values of n_iter is typicall mild (unless you go to extreme values with billions of forward passes).
%% Cell type:markdown id: tags:
Run the following cell to investigate the training prototypes used in [1]. Click 'Run Interact' to start a calculation:
You may also want to have a look at the top predicted prototypes (an interactive window will open up where you can change via "Ranking_idx" which of the top predictions you want to have a look at and via "Supercell" which replicas you want to inspect - exception being nanotubes which have fixed length, see also the comment three cells above):
structure*=Supercell# only if lattice defined, create supercell
view(structure,viewer='ngl')
```
%% Cell type:markdown id: tags:
Please run the following cell if you want to study out-of-sample structures (including the prototypes investigated in [1]). If you want to inspect your own, new structures, just upload the to './data/ARISE/calculations_path', defined in the package loading part at the very beginning of this tutorial).
structure*=Supercell# only if lattice defined, create supercell
view(structure,viewer='ngl')
```
%% Cell type:markdown id: tags:
# Polycyrstal classification
%% Cell type:markdown id: tags:
To classify polycrystals, we introduce the strided pattern matching (SPM) framework, sketched in the following figure:

***The first line*** depicts this procedure for slab-like systems and ***the second line*** for bulk materials.
In both scenarios, a box of certain size is scanned across the whole crystal volume with a certain stride. For each stride, the classification model is applied, yielding both predictions and uncertainty values. These values are then rearranged in 2D or 3D maps. One can construct these for each of the classification probabilites and the uncertainty (here as quantified by mutual information), or also for the most similar prototypes (i.e., the argmax prediction).
This way one can discover the most prevalent symmetries (via classification maps) and detect defective regions (in particular, the uncertainty can indicate when a particular region is defective or far outside the known training examples, thus allowing - for instance - to discover grain boundaries, or as it is depicted above a cubic-shaped precipitate).
%% Cell type:markdown id: tags:
As introduced in the 'Quickstart' section, the local analysis can be obtained by simply changing the mode to 'local', while one also has to provide stride and box size (standard settings stride=1.0, box_size=12.0).
We investigate exemplarily the mono-species structure shown in the first line of the above plot:
We choose a box size that is roughly equal to the slab thickness and perform the SPM analysis. We choose here a very coarse stride and the smaller we will choose it the better the resolution will get. Note that we choose a stride in z direction that is larger than the slab thickness, a setting for which no stride in z will be made.
A folder (in './data/ARISE/calculations_path') in which all relevant information is saved is created automatically. The name of the folder is given by the structure file being passed and the options for SPM (box size and stride).
Note that if you run the cell below multiple times, a new folder (with "_run_i", i=1,... appended to the file name) will be created, so no data will be lost.
%% Cell type:code id: tags:
``` python
box_size=[16.0]
# you may also pass a list for both stride and box size if you have more than one geometry file.
stride=[[15.0,15.0,20.0]]
# choosing the stride in z larger as the extension of the material (here as 20.0, exceeding
# the slab thickness by about 4 angstrom) will prohibit that a stride is made in z direction.
previous_level=logger.getEffectiveLevel()
logger.setLevel(0)# switch off logging, you may change this.
Within the automatically created folder, the results are saved in the folder 'results_folder' (in this case, the probabilities as 'four_grains_elemental_solid.xyz_probabilities.npy' and the uncertainty as 'four_grains_elemental_solid.xyz_mutual_information.npy'), where each of the numpy arrays has in general the shape (#classes, z_coordinate, y_coordinate, x_coordinate), where here #classes=108 and the coordinates refer to the spatial positions of the box.In this case of a slab structure it has only the shape (#classes, y_coordinate, x_coordinate).
The finer you make the stride, the better the resolution will get. In [1], we employ a stride of $1 \overset{\circ}{\mathrm {A}}$, resulting in the following high-resolution picture:
One important aspect of [1] is the application of unsupervised learning techniques to analyze the internal neural network representations. This allows to *explain* the trained model by showing that regularities in the neural-network representation space correspond to physical characteristics. Using this strategy, one can conduct exploratory analysis, with the goal to discover patterns in high-dimensional representation space and to relate them to the crystal structure - leading to the discovery of defective regions (e.g., grain boundaries or regions that are distorted in similar/distinct fashion, cf. Fig. 2 and 4 of [1]).
Specifically, we apply clustering (HDBSCAN https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html) to identify groups in the high-dimensional neural-network representations. Furthermore the manifold-learning method UMAP (https://umap-learn.readthedocs.io/en/latest/index.html) is employed to obtain a low-dimensional embedding (here 2D) that can be interpreted by human eye.
The most important parameters for the unsupervised techniques employed in this work are the following:
* HDBSCAN: min_cluster_size (int) - determines the minimum number of points a cluster must contain. In more detail, HDBSCAN employs (agglomerative) hierarchical clustering (https://www.displayr.com/what-is-hierarchical-clustering/) to construct a hierarchical cluster tree, where only those clusters are considered that contain as many datapoints as defined via min_cluster_size. Clusters are finally extracted by computing a cluster stability value and the most stable clusters are used to construct a final, so-called flat clustering.
* UMAP: n_neighbors (int) - determines the trade-off between capturing local (small n_neighbors) vs global (large n_neighbors) relationships in the data. Specifically, UMAP constructs a topological representation (in practice corresponding to a specific type of k-neighbor graph) of the data, while some assumptions are made (e.g., the data being uniformly distributed on a (Riemannian) manifold). Then a low-dimensional representation is found by matching its topological representation with the one of the original data (where the matching procedure amounts to the definition of a cross-entropy loss function and the application of stochastic gradient descent). The parameter n_neighbors influences the high-dimensional topological representation, while the distances in the low-dimensional representation can also be tuned (via 'min_dist' while this parameter is typically considered more as only tuning the visualization / the distances in the low-dimensional embedding - see also https://pair-code.github.io/understanding-umap/ for a discussion with hands-on examples).
%% Cell type:markdown id: tags:
In the cell below, we exemplarily analyze a mono-species polycrystal with four crystalline grains (corresponding to the slab structure depicted above and discussed in Fig. 2 of [1], where we chose a stride of 1 $ \overset{\circ}{\mathrm {A}}$ in x,y and a box size of 16 $ \overset{\circ}{\mathrm {A}}$).
You can change the parameters to get a feeling for the importance of min_clsuter_size, n_neighbors.
Also you may choose the color scale according to ARISE's predictions or uncertainty (mutual information), or the HDBSCAN cluster labels.
***Remark:*** For the HDBSCAN clustering, there are two options: Either you can choose the 'cluster_labels', which is the cluster assignment that arises from the standard (flat) clustering procedure, or the 'argmax_clustering' color scale, which is determined from the soft-clustering feature of HDBSCAN. Specifically, HDBSCAN allows to calculate a vector for each point with component i capturing the "probability" that the point is a member of cluster i. Then, we can infer a cluster assignment for points that would normally be considered outliers, by taking the cluster whose membership probability is maximal (while we conisder the point an outlier if none of its cluster probabilities exceeds a certain threshold - here, we choose 10 %). This procedure allows to obtain reasonable clusterings also in the presence of high levels of noise (see Fig. 4 of [1]). Here, it will not make a qualitative difference in performance, i.e., the basic structure of the polycrsytal (four grains + grain boundaries) can be recovered by both clustering procedures.
A consistent picture with four main clusters corresponding to the four crystalline grains separated by grain boundaries arises for min_cluster_size>250 (and also n_neighbors>250, while also at lower values reasonable embeddings are obtained, i.e., points corresponding to different grains are separated). For more details on the interpretation of these figures, we refer to [1].
%% Cell type:markdown id: tags:
# Conclusion
%% Cell type:markdown id: tags:
You have reached the end of this tutorial.
*Please let us know if you have any questions, wishes, or suggestions for improvement. Feel free to reach out to us, e.g., via mail (leitherer@fhi-berlin.mpg.de, andreas.leitherer@gmail.com).*
%% Cell type:markdown id: tags:
# References
%% Cell type:markdown id: tags:
[1] ARISE: A. Leitherer, A. Ziletti, and L. M. Ghiringhelli, arXiv:2103.09777, (2021)
[2] Gal, Y. & Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, 1050-1059 (2016)
[3] Gal, Y. Uncertainty in deep learning. Ph.D. thesis, University of Cambridge (2016)
[4] Mehl, M. J. et al. The AFLOW library of crystallographic prototypes: part 1. Comput. Mater. Sci. 136, S1–S828 (2017).
[5] A. P. Bartok et al. Physical Review Letters 104, 136403 (2010)
[6] A. P. Bartok et al. Physical Review B 87, 184115 (2013)
[7] The quippy software is available for non-commercial use from www.libatoms.org (or https://github.com/libAtoms/QUIP)
[8] De, S., Bartók, A. P., Csányi, G. & Ceriotti, M. Comparing molecules and solids across structural and alchemical space. Phys. Chem. Chem. Phys. 18, 13754–13769 (2016)
%% Cell type:markdown id: tags:
Please refer to [1] for additional information and references.