Skip to content
Snippets Groups Projects
Commit d85c9974 authored by Luigi Sbailo's avatar Luigi Sbailo
Browse files

Reinsert fast search and find of density peaks in tutorial

parent 59ac12fc
Branches master
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
<img src="assets/exploratory_analysis/header.jpg" width="900">
%% Cell type:markdown id: tags:
<img style="float: left;" src="assets/exploratory_analysis/logo_MPG.png" width=150>
<img style="float: left; margin-top: -10px" src="assets/exploratory_analysis/logo_NOMAD.png" width=250>
<img style="float: left; margin-top: -5px" src="assets/exploratory_analysis/logo_HU.png" width=130>
%% Cell type:markdown id: tags:
In this tutorial, we use unsupervised learning for an exploratory analysis of materials science data. More specifically, we analyze 82 octet binary materials known to crystallize in zinc blende (ZB) and rocksalt (RS) structures. Our aim is to show how to visualize a multidimensional dataset and gain an understanding of its relevant inner structures. As a first step in our data analysis, we would like to detect whether data points can be classified into different clusters, where each cluster is aimed to group together objects that share similar features. With an explorative analysis we would like to visualize the structure and spatial arrangement of the clusters, but when the feature space is highly multidimensional such visualization is directly not possible. Hence, we project the feature space onto a two-dimensional manifold which, instead, can be visualized. To avoid losing relevant information, embedding into a lower-dimensional manifold must be performed while preserving the most informative features in the original space. Below we introduce to different clustering and embedding methods, which can be combined to obtain different visualizations of our dataset.
%% Cell type:markdown id: tags:
# Introduction to clustering
%% Cell type:markdown id: tags:
Cluster analysis is performed to group together data points that are more similar to each other in comparison with points belonging in other clusters. Clustering can be achieved by means of many different algorithms, each with proper characteristics and input parameters. The choice of the clustering algorithms to be used depends on the specific dataset analyzed, and once an optimal algorithm has been chosen it is often necessary to iteratively modify the input parameters until results achieve the desired resolution. We focus on five different algorithms as described below.
- ___k_-means__ partitions the dataset into _k_ clusters, where each datapoint belongs in the cluster with the nearest mean. This partition ultimately minimizes the within-cluster variance to find the most compact partitioning of the data set. _k_-means uses an iterative refinement technique that is fast and scalable, but if falls in local minima. Thus, the algorithm is iterated multiple times with different initial conditions, and the best outcome is finally chosen. Drawbacks of this algorithm are that the number of clusters _k_ is an input parameter which must be known in advance and clusters are convex shaped.
- __Hierarchical clustering__ builds a hierarchy of clusters with a bottom-up (__agglomerative__) or top-down (__divisive__) approach. In this tutorial we deploy a bottom-up approach. In a bottom-up hierarchical clustering algorithm, all datapoints are initially placed into its own cluster, thus the number of clusters is initially equal to the number of datapoints. Then different pairs of clusters are iteratively merged together where the decision of the clusters to be merged is made according to a specific linkage criterion. Merging is iterated until all points are grouped into a unique supercluster, and the resulting hierarchy of clusters can be shown with means of a dendrogram. If a distance threshold is given, clusters are not merged if they are more distant than the threshold value, and this stops the algorithm when no more mergings are possible. The algorithm then returns a certain number of clusters as a function of the threshold distance. An advantage of this algorithm is that the construction of dendroids allows for a visual inspection of the clustering, but hierarchical clustering is a rather slow algorithm and not well suited for big data.
- Density-based spatial clustering of applications with noise (__DBSCAN__) is an algorithm that, without knowing the exact number of clusters, groups points that are close to each other leaving outliers marked as noise and not defined in any clusters. In this algorithm, a neighborood distance _$\epsilon$_ and a number of points _min-samples_ are used to determine whether a point belongs in a cluster: in case the point has a number _min-samples_ of other points within the distance _$\epsilon$_ is marked as core point and belongs in a cluster; otherwise, the point is marked as noise. This algorithm is fast and clusters can assume any shapes, but the choice of the distance _$\epsilon$_ migth be non trivial.
- __HDBSCAN__ is a hierarchical extension of DBSCAN. This algorithm deploys the mutual reachability distance as distance metric to push outliers away from high density regions, thus facilitating their detection. The mutual reachability distance acts by increasing the distance of all points that are not close to at least _min_samples_ points. Using this metric, the algorithm builds a hierarchy tree, where it extracts clusters which contain at least _min_cluster_size_ elements.
- The fast search and find of density peaks (__DenPeak__) algorithm is a density-based algorithm that makes use of a bidimensional plot to select which clusters are extracted. Density peaks are assumed to be surrounded by lower density regions. Based on the position of the highest density peak, the peaks can be visualized on a graph that shows their surrounding density and the distance from the most densely surrounded peak. It is then possible to choose the peaks to include from this plot, where each peak is a point in the plot.
%% Cell type:markdown id: tags:
# Introduction to embedding
%% Cell type:markdown id: tags:
Visualization of a dataset is not possible when it is defined in a highly multidimensional space, but a visual analysis can help detecting inner structures in the dataset. Hence, in order to make such visualization possible, we reduce the dimensionality of the system using an embedding algorithm.
These methods are specifically developed to avoid losing critical information during embedding into a lower dimensionality space. In this tutorial, we use three different embedding algorithms that are summarized below.
- Principal component analysis (__PCA__) is a linear projection method that seeks for an orthogonal transformation of the dataset so as to render the variables of the dataset uncorrelated. The dimensionality reduction is then performed onto the features with highest variance to preserve as much information as possible. This is a deterministic but linear method, that fails to catch non linear correlations.
- Multidimensional scaling (__MDS__) constructs a pairwise distance matrix in the original space and seeks a low-dimensional representation that preserves the original distances as much as possible. This method tends to preserve local structures better than global structures and scales badly with the number of data points.
- T-distributed Stochastic Neighbor Embedding (__t-SNE__) is a non-linear dimensionality reduction method that converts similarities between data points to joint probabilities and minimizes the Kullback-Leibler divergence between the joint probabilities of the embedding and the original space. The cost function is not convex, and results depend on the initialization. Nonlinear effects in this method might occasionally produce misleading results, therefore several iterations of the method are recommended.
%% Cell type:markdown id: tags:
# Import required modules
%% Cell type:markdown id: tags:
We load below the packages required for the tutorial. Most of the clustering and embedding algorithms are contained in the scikit-learn and SciPy packages. We use Panda's dataframe for manipulating our dataset.
%% Cell type:code id: tags:
``` python
from ase.io import read
import pandas as pd
import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage, cut_tree
from sklearn import preprocessing
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE, MDS
import hdbscan
import plotly.graph_objects as go
import ipywidgets as widgets
from IPython.display import display, clear_output
import matplotlib.pyplot as plt
```
%% Cell type:markdown id: tags:
# Get the data
We load the data and place it into a Panda's dataframe. Data has been downloaded from the NOMAD Archive and the NOMAD atomic data collection. It consists of RS-ZB energy differences (in eV/atom) of the 82 octet binary compounds, structure objects containing the atomic positions of the materials and properties of the atomic constituents. The following atomic features are included:
- Z: atomic number
- period: period in the periodic table
- IP: ionization potential
- EA: electron affinity
- E_HOMO: energy of the highest occupied atomic orbital
- E_LUMO: energy of the lowest unoccupied atomic orbital
- r_(s, p, d): radius where the radial distribution of s, p or d orbital has its maximum.
%% Cell type:code id: tags:
``` python
# load data
RS_structures = read("data/exploratory_analysis/octet_binaries/RS_structures.xyz", index=':')
ZB_structures = read("data/exploratory_analysis/octet_binaries/ZB_structures.xyz", index=':')
def generate_table(RS_structures, ZB_structures):
for RS, ZB in zip(RS_structures, ZB_structures):
energy_diff = RS.info['energy'] - ZB.info['energy']
min_struc_type = 'RS' if energy_diff < 0 else 'ZB'
struc_obj_min = RS if energy_diff < 0 else ZB
yield [RS.info['energy'], ZB.info['energy'],
energy_diff, min_struc_type,
RS.info['Z'], ZB.info['Z'],
RS.info['period'], ZB.info['period'],
RS.info['IP'], ZB.info['IP'],
RS.info['EA'], ZB.info['EA'],
RS.info['E_HOMO'], ZB.info['E_HOMO'],
RS.info['E_LUMO'], ZB.info['E_LUMO'],
RS.info['r_s'], ZB.info['r_s'],
RS.info['r_p'], ZB.info['r_p'],
RS.info['r_d'], ZB.info['r_d']]
df = pd.DataFrame(
generate_table(RS_structures, ZB_structures),
columns=['energy_RS', 'energy_ZB',
'energy_diff', 'min_struc_type',
'Z(A)', 'Z(B)',
'period(A)', 'period(B)',
'IP(A)', 'IP(B)',
'EA(A)', 'EA(B)',
'E_HOMO(A)', 'E_HOMO(B)',
'E_LUMO(A)', 'E_LUMO(B)',
'r_s(A)', 'r_s(B)',
'r_p(A)', 'r_p(B)',
'r_d(A)', 'r_d(B)',],
index=list(RS.get_chemical_formula() for RS in RS_structures)
)
```
%% Cell type:markdown id: tags:
We insert in the dataframe a column that contains different marker symbols for different most stable structure types. These markers will be used while visualizing the datapoints in the 2-dimensional embedding.
%% Cell type:code id: tags:
``` python
df['marker_symbol']= np.where(df['min_struc_type']=='RS','square-open','hexagram')
```
%% Cell type:markdown id: tags:
A 'Clustering' class is defined that includes all clustering algorithms that are covered during the tutorial. Before creating an instance of this class, a dataframe variable 'df' must have been defined. In this class, each clustering function labels the entries in the dataframe according to the outcome of the cluster assignment.
%% Cell type:code id: tags:
``` python
class Clustering:
def __init__ (self):
self.df_flag = False
try:
df
except NameError:
print("Please define a dataframe 'df' and a features list")
self.df_flag = True
def kmeans (self, n_clusters, max_iter):
if self.df_flag:
return
cluster_labels = KMeans (n_clusters=n_clusters, max_iter=max_iter).fit_predict(df[features])
print(max(cluster_labels)+1,' clusters were extracted.')
df['clustering'] = 'k-means'
df['cluster_label']=cluster_labels
def hierarchical (self, distance_threshold):
if self.df_flag:
return
linkage_criterion = 'ward'
Z = linkage(df[features], linkage_criterion )
cluster_labels = cut_tree(Z, height=distance_threshold)
print(int(max(cluster_labels))+1,' clusters were extracted.')
df['clustering'] = 'Hierarchical - ' + linkage_criterion + ' criterion'
df['cluster_label']=cluster_labels
def dbscan (self, eps, min_samples):
if self.df_flag:
return
cluster_labels = DBSCAN(eps=eps, min_samples=min_samples).fit_predict(df[features])
print(max(cluster_labels)+1,' clusters were extracted.')
df['clustering'] = 'DBSCAN'
df['cluster_label']=cluster_labels
def hdbscan (self, min_cluster_size, min_samples):
clusterer = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size, min_samples=min_samples)
clusterer.fit(df[features])
cluster_labels=clusterer.labels_
print(max(cluster_labels)+1,' clusters were extracted.')
df['clustering']= 'HDBSCAN'
df['cluster_label']=cluster_labels
def dpc (self, density = 0, delta = 0 ):
if self.df_flag:
return
if density > 0 and delta > 0 :
clu=DPCClustering(np.ascontiguousarray(df[features].to_numpy()), autoplot=False)
clu.autoplot = True
clu.assign(density,delta)
cluster_labels = clu.membership
print(max(cluster_labels)+1,' clusters were extracted.')
df['clustering'] = 'DPC'
df['cluster_label']=cluster_labels
else:
clu=DPCClustering(np.ascontiguousarray(df[features].to_numpy()))
```
%% Cell type:markdown id: tags:
The embedding algorithms are handled with a graphical interface that is generated using Jupyter Widgets. This allows to create plots using the desired embedding algorithm by clicking on a button. Before plotting data with any of the embedding algorithms, a dataframe 'df' must have been defined, and cluster labels must have been assigned to each datapoint.
%% Cell type:code id: tags:
``` python
def show_embedding ():
btn_PCA = widgets.Button(description='PCA')
btn_MDS = widgets.Button(description='MDS')
btn_tSNE = widgets.Button(description='t-SNE')
def btn_eventhandler_embedding (obj):
method = str (obj.description)
try:
df['clustering'][0]
except KeyError:
print("Please assign labels with a clustering algorithm")
return
if (method == 'PCA'):
transformed_data = PCA(n_components=2).fit_transform(df[features])
df['x_emb']=transformed_data[:,0]
df['y_emb']=transformed_data[:,1]
df['embedding'] = 'PCA'
elif (method == 'MDS'):
transformed_data = MDS (n_components=2).fit_transform(df[features])
df['x_emb']=transformed_data[:,0]
df['y_emb']=transformed_data[:,1]
df['embedding'] = 'MDS'
elif (method == 't-SNE'):
transformed_data = TSNE (n_components=2).fit_transform(df[features])
df['x_emb']=transformed_data[:,0]
df['y_emb']=transformed_data[:,1]
df['embedding'] = 't-SNE'
plot_embedding()
def plot_embedding():
with fig.batch_update():
for scatter in fig['data']:
cl = scatter.meta
scatter['x']=df[df['cluster_label']==cl]['x_emb']
scatter['y']=df[df['cluster_label']==cl]['y_emb']
scatter['customdata']=np.dstack((df[df['cluster_label']==cl]['min_struc_type'].to_numpy(),
df[df['cluster_label']==cl]['cluster_label'].to_numpy(),
))[0]
scatter['hovertemplate']=r"<b>%{text}</b><br><br> Low energy structure: %{customdata[0]}<br>Cluster label: %{customdata[1]}<br>"
scatter['marker'].symbol=df[df['cluster_label']==cl]['marker_symbol'].to_numpy()
scatter['text']=df[df['cluster_label']==cl].index.to_list()
fig.update_layout(
plot_bgcolor='rgba(229,236,246, 0.5)',
xaxis=dict(visible=True),
yaxis=dict(visible=True),
legend_title_text='List of clusters',
showlegend=True,)
label_b.value = "Embedding method used: " + str(df['embedding'][0])
btn_PCA.on_click(btn_eventhandler_embedding)
btn_MDS.on_click(btn_eventhandler_embedding)
btn_tSNE.on_click(btn_eventhandler_embedding)
label_t = widgets.Label(value="Clustering algorithm used: " + str(df['clustering'][0]))
label_b = widgets.Label(value='Select a dimension reduction method to visualize the 2-dimensional embedding')
fig = go.FigureWidget()
for cl in np.unique(df['cluster_label'].to_numpy()):
if cl == -1:
name = 'Outliers'
else:
name = 'Cluster ' + str(cl)
fig.add_trace(go.Scatter(
name=name,
mode='markers',
meta=cl
))
fig.update_layout(plot_bgcolor='rgba(229,236,246, 0.5)',
width=800,
height=600,
xaxis=dict(visible=False, title='x_emb'),
yaxis=dict(visible=False, title='y_emb'))
return widgets.VBox([widgets.HBox ([btn_PCA,btn_MDS,btn_tSNE]),label_t, label_b, fig])
```
%% Cell type:markdown id: tags:
We select which features will be used for the clustering and embedding algorithms. The complexity of the problem clearly decreases as the number of features is reduced, and an accurate selection of the features to be processed can improve the quality of the results. To find the most meaningful results, it is sometimes necessary to iterate training while considering different features at each iteration.
%% Cell type:code id: tags:
``` python
features = []
features.append('IP(A)')
features.append('IP(B)')
features.append('EA(A)')
features.append('EA(B)')
features.append('Z(A)')
features.append('Z(B)')
features.append('E_HOMO(A)')
features.append('E_HOMO(B)')
features.append('E_LUMO(A)')
features.append('E_LUMO(B)')
features.append('r_s(A)')
features.append('r_s(B)')
features.append('r_p(A)')
features.append('r_p(B)')
features.append('r_d(A)')
features.append('r_d(B)')
```
%% Cell type:markdown id: tags:
Feature standardization is the operation of rescaling data so as to be shaped as a Gaussian with zero mean and unit variance, and it is a common requirement for machine learning algorithms. In fact, estimators can be biased towards dimensions presenting higher absolute values, or outliers can undermine the learning capabilites of the algorithm. Hence, we standardize the dataset by subtracting the mean value and dividing it by the standard deviation for each variable.
%% Cell type:code id: tags:
``` python
df[features]=preprocessing.scale(df[features])
```
%% Cell type:markdown id: tags:
Panda's dataframes offer a number of useful tools to visualize datasets. For example, here we show histograms of all 'features' for all entries in the dataframe by calling the 'hist' function. Below we notice that the dataset has been normalized.
%% Cell type:code id: tags:
``` python
hist = df[features].hist( bins=10, figsize = (20,15));
```
%% Cell type:markdown id: tags:
---
# $k$-means
%% Cell type:markdown id: tags:
$k$-means requires the knowledge of the number of clusters and clustering depends on the initial conditions. Therefore the algorithm is iterated, up to _max\_iter_ times, with different initial conditions until convergence. As we know that our octet binary materials crystallize in the RS and ZB structures, a natural distinction in this dataset is between materials with the most stable conformation in the RS vs ZB structure. Hence we seek for two clusters, aiming to find clusters of materials with the same most stable structure.
%% Cell type:markdown id: tags:
From the class 'clustering', we call the 'kmeans' function with the desired values of clusters and maximum iterations as parameters. The function will then assign to the materials in the dataframe 'df' the label of the cluster they belong in.
%% Cell type:code id: tags:
``` python
n_clusters = 2
max_iter =100
Clustering().kmeans(n_clusters, max_iter)
```
%% Cell type:code id: tags:
``` python
print(df['cluster_label'][:10])
```
%% Cell type:markdown id: tags:
We can see that the dataframe contains the domain 'cluster_label', which can assume the values 0 or 1, because the algorithm finds two clusters.
Now we deploy the graphical interface defined above to visualize the datapoints using a two dimensional embedding of our choice.
The function 'show_embedding' displays three buttons labeled with the name of the dimension reduction methods that are deployed in this tutorial.
Clicking any of the buttons will show a plot of the dataset that uses the relative embedding.
%% Cell type:code id: tags:
``` python
show_embedding()
```
%% Cell type:markdown id: tags:
In the plot, different clusters are visualized with different colors, and by hovering over points it is possible to see the name of the relative material, its most stable structure and the cluster it was assigned to.
We can see open squares and hexagrams used as markers in the plot. Open squares indicate materials whose most stable structure is rocksalt, while hexograms are used for zinc blende structures. Can you modify the code so as to visualize rocksalt as diamond and zinc blende as open circle? A more difficult task is to modify the hovering features. Can you add the atomic number of the two elements to the hovering features? A hint is that text visualized in the cell appearing while hovering is defined as 'hovertemplate' in the 'show_embedding' function. Then few other modifications are required. Now inspect the atomic number values for the different materials. Are such values as you would expect, i.e. natural numbers? If not, can you explain why they are not?
Now let's focus on the results of the clustering algorithm. Could you identify and visualize two distinct clusters in the dataset? You can also run the $k$-means clustering again and select only 1 as _max\_iter_ , which means that the first outcome is taken as final result. Try this again and compare the results, does the output change at each iteration? What happens instead if the number is much larger?
To compare different outcomes of the algorithm, it might be convenient to copy paste the cell containing 'show_embedding()', and updating only one of the two visualizers at each iteration. Also, only the usage of the PCA embedding allows a straightforward comparison, because MDS and t-SNE are stochastic algorithms, thus they can give different results at each call.
We are interested in understanding whether clustering groups togheter materials which have the same most stable structure.
Therefore, we define a function that prints for each cluster the percentage of materials that is more stable in the RS vs ZB structure.
%% Cell type:code id: tags:
``` python
def composition_RS_ZB (df):
df_cm = pd.DataFrame (columns=['RS','ZB','Materials in cluster'], dtype=object)
n_clusters = df['cluster_label'].max() + 1
for i in range (n_clusters):
Tot = len(df.loc[df['cluster_label']==i])
if (Tot == 0):
continue
RS = int(100*len(df.loc[(df['cluster_label']==i) & (df['min_struc_type']=='RS')])/len(df.loc[df['cluster_label']==i]))
ZB = int(100*len(df.loc[(df['cluster_label']==i) & (df['min_struc_type']=='ZB')])/len(df.loc[df['cluster_label']==i]))
df_cm = df_cm.append({'RS':RS, 'ZB':ZB, "Materials in cluster":Tot},ignore_index=True)
return df_cm
```
%% Cell type:code id: tags:
``` python
composition_RS_ZB(df)
```
%% Cell type:markdown id: tags:
We can see that $k$-means finds two distinct clusters, and in one of the two clusters there are more 'RS' stable structures while in the other there are more 'ZB' stable structures. This is a hint that in the space described by the atomic features, materials with the same most stable structure are close to each other, that is also possible to visualize using the different embedding algorithms.
Observing the linear and deterministic embedding given by PCA we can clearly notice that RS and ZB structures are placed in different regions of the embedding space. But we notice that there is an overlapping area where RS and ZB are close to each other. We can also notice that the volume spanned by RS structures seems to be larger with respect to ZB structrues. On the other hand, we know that $k$-means is only able to detect convex clusters of comparable shapes, hence we can argue that it might not be able to find the desired two clusters.
%% Cell type:markdown id: tags:
# Hierarchical agglomerative clustering
---
%% Cell type:markdown id: tags:
In a hierarchical agglomerative clustering different clusters are iteratively merged if their distance is lower than a _distance\_threshold_. The number of clusters obtained is a function of this threshold.
%% Cell type:code id: tags:
``` python
distance_threshold=20
Clustering().hierarchical(distance_threshold=distance_threshold)
```
%% Cell type:code id: tags:
``` python
show_embedding()
```
%% Cell type:code id: tags:
``` python
composition_RS_ZB(df)
```
%% Cell type:markdown id: tags:
Several different possible choices can be used as linkage criterion. As a default option, we have used the Ward distance, which minimizes the sum of squared differences within all clusters. This has some analogies with the objective function of $k$-means. By tuning the parameters, can you find the same clusters as the ones obtained with $k$-means? Now we would like to use a different linkage method. Can you modify the code to use single linkage instead of ward linkage? Typical of the single linkage criterion is the rich-get-richer dynamics where already large clusters tend to become even larger during linkage. Can you adjust the distance threshold so as to find only two clusters? Have these clusters similar shape?
One advantage of hierarchical methods is that they allow to decompose and understand the clustering process. Indeed, below we plot a dendogram that shows all agglomeration steps that from having all single objects as individual clusters group objects into a unique supercluster. On the y-axys there is the distance threshold, and the number of biforcations in the dendogram for a certain value on the y-axis represents the number of clusters that are generated choosing that value as distance threshold. Hence, from the dendogram we can select the value of distance threshold that we need for having a certain number of clusters.
%% Cell type:code id: tags:
``` python
Z = linkage(df[features], 'ward' )
dendrogram(Z, truncate_mode='lastp',p=11);
```
%% Cell type:markdown id: tags:
The dendrogram function above requires the parameter $p$ which indicates the maximum number of biforcations, i.e. final clusters, which are shown in the plot. Values in parenthesis in the x-axis represent the number of objects in each cluster.
%% Cell type:markdown id: tags:
---
# DBSCAN
%% Cell type:markdown id: tags:
DBSCAN is a density-based clustering algorithm that detects outliers and is able to extract clusters of different size and shape.
This algorithm requires two parameters: the distance $\epsilon$ is the maximum distance for considering two points as neighbours; _min_samples_ gives the minimum number of neighbors required to define a core point.
Core points are the core component of clusters, and all those points that are neither core points nor neighbor of core points are labeled as outliers.
%% Cell type:code id: tags:
``` python
eps = 3
min_samples= 8
Clustering().dbscan(eps,min_samples)
```
%% Cell type:code id: tags:
``` python
show_embedding()
```
%% Cell type:code id: tags:
``` python
composition_RS_ZB(df)
```
%% Cell type:markdown id: tags:
We can see that the algorithm has found two different clusters, and we notice that each cluster is representative of the RS vs ZB structure. However, this happens at the cost of neglecting many points that have been classified as outliers.
Now tune the parameters and see the effects of each parameter on the number of outliers.
Considering that MDS seeks for an embedding that tries to preserve local pairwise distances, we would expect that in a MDS embedding outliers are placed far from the defined clusters. Differently t-SNE tends to privilege global structures at the expenses of losing local definition, hence outliers can be placed closed to other clusters. In fact, it is possible to notice that using t-SNE points tend to be equally distanced from each other, but clusters are quite distinguishable. Pairwise distances are not meaningful in a t-SNE embedding because it aims to depict global arrangements of clusters. On the other hand, MDS attemmpting to preserve all pairwise distances sometimes fails to arrange the different clusters.
Can you notice in this case that outliers are better isolated in a MDS embedding rather than using a t-SNE embedding? Try to decrease the number of outliers by tuning down the parameters for an easier visualization.
%% Cell type:markdown id: tags:
# HDBSCAN
---
%% Cell type:markdown id: tags:
The HDBSCAN clustering algorithm is introduced in:
R.J.G.B. Campello, D. Moulavi, J. Sander: <span style="font-style: italic;">Density-Based Clustering Based on Hierarchical Density Estimates</span>, Springer Berlin Heidelberg, (2013).
The implementation of the algorithm that we use is taken from https://pypi.org/project/hdbscan/.
%% Cell type:code id: tags:
``` python
import hdbscan
```
%% Cell type:markdown id: tags:
HDBSCAN can be defined as a hierarchical extension of DBSCAN, with respect to which it has a number of advantages.
One advantage is that there is only one relevant parameter to be tuned, i.e. the minimum size of clusters.
This parameter is more intuitive to assess in comparison to e.g. the $\epsilon$ threshold in DBSCAN.
In the HDBSCAN library that we deploy, the minimum number of samples that is used for the mutual reachability distance is by default fixed to the same value of the minimum cluster size, as they essentiallt have the same goal, i.e. avoiding the detection of clusters that contain less than a certain number of objects.
In this tutorial we explicitly define the two values.
%% Cell type:code id: tags:
``` python
min_cluster_size = 10
min_samples = 10
Clustering().hdbscan(min_cluster_size=min_cluster_size, min_samples=min_samples)
```
%% Cell type:code id: tags:
``` python
show_embedding()
```
%% Cell type:code id: tags:
``` python
composition_RS_ZB(df)
```
%% Cell type:markdown id: tags:
We obtain two clusters with high percentage of only one most stable structure. However, the number of materials classified as outliers is considerably large.
The effect of _min_samples_ is to fix how conservative respect to outliers detection the algorithm should be. Increasing its value the distorsion effects of the mutual reachability distance become more evident, while decreasing it less points are classified as outliers.
Can you obtain more meaningful results by decreasing the value of this parameter?
%% Cell type:markdown id: tags:
# Fast search and find of density peaks
---
%% Cell type:markdown id: tags:
The fast search and find of density peaks algorithm is introduced in:
A. Rodriguez, A. Laio: <span style="font-style: italic;">Clustering by fast search and find of density peaks</span>, Science, (2014).
The implementation of the algorithm that we use is taken from https://pypi.org/project/pydpc/.
%% Cell type:code id: tags:
``` python
from pydpc import Cluster as DPCClustering
```
%% Cell type:markdown id: tags:
The fast search and find of density peaks algorithm allows to make the clusters selection based on a graphical interpretation.
%% Cell type:code id: tags:
``` python
Clustering().dpc()
```
%% Cell type:markdown id: tags:
In the plot above, each point represents a different peak that could be the core of a specific cluster if selected.
All points of the dataset are placed in the plot, and in the top right position of the plot we always have one point that represents the peak in the highest density region.
The other peaks are then placed in the plot according to their local density and distance ('delta' in the graph) from the first peak.
Choosing the values on the x,y-axis, it is possible to select the clusters that the algorithm returns.
Here, we select the 3 peaks closest to the vertex.
%% Cell type:code id: tags:
``` python
Clustering().dpc(2.4,3.8)
```
%% Cell type:code id: tags:
``` python
show_embedding()
```
%% Cell type:code id: tags:
``` python
composition_RS_ZB(df)
```
%% Cell type:markdown id: tags:
We have found two clusters containing only materials with the same most stable structure and a mixed cluster containing both most stable structures.
It is interesting to visualize this result with MDS, where we can see that the mixed cluster is placed in between the pure clusters as a transition zone.
These clustering results suggest that the atominc features we have used are sufficient for classifying materials according to their most stable structure.
Even though the RS and ZB clusters are not clearly separated as a mixed cluster is also found, a supervised machine learning model might be able to learn classification of the 82 octet binary materials.
We might also expect that such model faces challenges especially when classifing materials in the transition area.
A supervised learning algorithm, namely SISSO, has been used for such classification, and we resort to other tutorials in the AI toolkit to study this application (see https://nomad-lab.eu/prod/analytics/public/user-redirect/notebooks/tutorials/compressed_sensing.ipynb and https://nomad-lab.eu/prod/analytics/public/user-redirect/notebooks/tutorials/descriptor_role.ipynb).
In this tutorial, we have seen an exemplary application of unsupervised learning that has been deployed for explorying the structure of a multi-dimensional dataset.
We have performed a clustering analysis, that led us finding clusters representative of different external labels, i.e. the most stable structure.
Such clustering gave us a clear evidence that the set of features used for clustering should be enough for determining the value of the external labels.
A subsequent step of such analysis would be the deployment of a supervised learing algorithm to find an interpretable relationship between the input features and the labels.
%% Cell type:code id: tags:
``` python
```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment