From cb84316ecbbd305d65c0ed31cc4da3f0d8fc772a Mon Sep 17 00:00:00 2001
From: Luigi <luigi.sbailo@gmail.com>
Date: Tue, 26 May 2020 18:12:52 +0200
Subject: [PATCH] Add explanations

---
 exploratory_analysis.ipynb | 459 ++++++++++++++++++-------------------
 1 file changed, 218 insertions(+), 241 deletions(-)

diff --git a/exploratory_analysis.ipynb b/exploratory_analysis.ipynb
index de2b4ae..45cd401 100644
--- a/exploratory_analysis.ipynb
+++ b/exploratory_analysis.ipynb
@@ -36,7 +36,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In this tutorial we use unsupervised learning for a preliminary exploration of materials science data. More specifically, we analyze 82 octet binary materials known to crystallize in zinc blende (ZB) and rocksalst (RS) structures. Our aim is to identify the right strategy to facilitate the visualization and characterization of unlabeled data. As a first step in our data analysis, we would like to detect whether data points can be classified into different  clusters, where each cluster is aimed to group together objects that share similar features. With an explorative analysis we would like to visualize the structure and spatial displacement of the clusters, but when the feature space is higlhly multidimensional such visualization is directly not possible. Hence, we project the feature space into a two-dimensional manifold that can be  visualized. To avoid losing relevant information, the embedding into a lower dimensional manifold must be performed while preserving the most informative features in the original space. Below we introduce into different clustering and embedding methods, which can be combined to obtain different visualizations of our dataset."
+    "In this tutorial, we use unsupervised learning for a preliminary exploration of materials science data. More specifically, we analyze 82 octet binary materials known to crystallize in zinc blende (ZB) and rocksalst (RS) structures. Our aim is to show how to facilitate the visualization of unlabeled data and gain an understanding of the relevant inner structures inside the dataset. As a first step in our data analysis, we would like to detect whether data points can be classified into different  clusters, where each cluster is aimed to group together objects that share similar features. With an explorative analysis we would like to visualize the structure and spatial displacement of the clusters, but when the feature space is higlhly multidimensional such visualization is directly not possible. Hence, we project the feature space into a two-dimensional manifold which, instead, can be  visualized. To avoid losing relevant information, the embedding into a lower dimensional manifold must be performed while preserving the most informative features in the original space. Below we introduce into different clustering and embedding methods, which can be combined to obtain different visualizations of our dataset."
    ]
   },
   {
@@ -50,10 +50,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Cluster analysis is performed to group together data points that are more similar to each other in comparison with points belonging to other clusters. Clustering can be achieved by means of many different algorithms, each with proper characteristics and input parameters. The choice of the specific clustering algorithms to be used depends on the individual data set analyzed, and, once an optimal algorithm has been chosen, it is often necessary to iteratively modify the input parameters until results achieve the desired properties. We focus on three distinct algorithms as described below.\n",
+    "Cluster analysis is performed to group together data points that are more similar to each other in comparison with points belonging to other clusters. Clustering can be achieved by means of many different algorithms, each with proper characteristics and input parameters. The choice of the clustering algorithms to be used depends on the specific data set analyzed, and, once an optimal algorithm has been chosen, it is often necessary to iteratively modify the input parameters until results achieve the desired resolution. We focus on four different algorithms as described below.\n",
     "- ___k_-means__ partitions the data set into _k_ clusters, where each data point belongs to the cluster with the nearest mean. This partition ultimately minimizes the within-cluster variance to find the most compact partitioning of the data set. _K_-means uses an iterative refinement technique that is fast and scalable, but if falls in local minima. Thus, the algorithm is iterated multiple times with different initial conditions and the best outcome is finally chosen. Drawbacks of this algorithm are that the number of clusters _k_ is an input parameter which must be known in advance and clusters are convex shaped.\n",
-    "- Density-based spatial clustering of applications with noise (__DBSCAN__) is an algorithm that, without knowing the exact number of clusters, groups points that are close to each other leaving outliers marked as noise and not defined in any cluster. In this algorithm a neighborood distance _$\\epsilon$_  and a number of points _min-samples_ are used to determine if a point belongs to a cluster: if the point has a number _min-samples_ of other points  within the distance _$\\epsilon$_ is marked as core point and belongs to a cluster; otherwise, the point is marked as noise. This algorithm is fast and clusters can assume any shape, but the outcome depends on the initial order of the data points.\n",
-    "- __Hierarchical clustering__ builds a hierarchy of clusters with a bottom-up (__agglomerative__) or top-down (__divisive__) approach. In a bottom-up approach, that we deploy below, starting with all data points placed in its own cluster, different pairs of clusters are iteratively merged together where the decision of the clusters to be merged is determined in a greedy manner. This is iterated until all points are grouped within one cluster, and the resulting hierarchy of clusters is presentend in a dendogram. Given a distance thereshold it is possible to avoid merging of clusters when outside this distance, this stops the algorithm when no more mergings are possible. The algorithm then returns a certain number of clusters as a function of the threshold distance . An advantage of this algorithm is that the construction of dendroids allows for a visual inspection of the clustering, but hierarchical clustering is considerably slower than the other algorithms discussed above and not well suited for big data.\n"
+    "- __Hierarchical clustering__ builds a hierarchy of clusters with a bottom-up (__agglomerative__) or top-down (__divisive__) approach. In a bottom-up approach, that we deploy below, starting with all data points placed in its own cluster, different pairs of clusters are iteratively merged together where the decision of the clusters to be merged is determined in a greedy manner. This is iterated until all points are grouped within one cluster, and the resulting hierarchy of clusters is presentend in a dendogram. If a distance thereshold is given,  merging of clusters when outside this distance, this stops the algorithm when no more mergings are possible. The algorithm then returns a certain number of clusters as a function of the threshold distance . An advantage of this algorithm is that the construction of dendroids allows for a visual inspection of the clustering, but hierarchical clustering is considerably slower than the other algorithms discussed above and not well suited for big data.\n",
+    "- Density-based spatial clustering of applications with noise (__DBSCAN__) is a  algorithm that, without knowing the exact number of clusters, groups points that are close to each other leaving outliers marked as noise and not defined in any cluster. In this algorithm a neighborood distance _$\\epsilon$_  and a number of points _min-samples_ are used to determine whether a point belongs to a cluster: in case the point has a number _min-samples_ of other points  within the distance _$\\epsilon$_ is marked as core point and belongs to a cluster; otherwise, the point is marked as noise. This algorithm is fast and clusters can assume any shape, but the choice of the distance _$\\epsilon$_ migth be non trivial.\n",
+    "- The fast search and find of density peaks (__DenPeak__) algorithm is a density-based algorithm that is able to automatically locate non-spherical clusters. Density peaks are assumed to be sourrounded by lower density regions. Based on the position of the highest density peak, the peaks can be visualized on a graph that shows their sourrounding density and the distance from the first peak. It is then possible to choose the peaks to include from this plot, where each peak represents a different cluster."
    ]
   },
   {
@@ -67,10 +68,10 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Visualization of a dataset is not possible when it is defined in a highly multidimensional space. To facilitate visualization of inner structures in the dataset, we reduce the dimensionality of the system with methodologies specifically developed to avoid losing critical information, which are introduced below.\n",
+    "Visualization of a dataset is not possible when it is defined in a highly multidimensional space, but a visual analysis can help to detect inner structures in the dataset. Hence, in order to make such visualization possible, we reduce the dimensionality of the system with methodologies specifically developed to avoid losing critical information during the embedding into a lower dimensionality space. In this tutorial, we use three different embedding methods that are summarize below.\n",
     "- Principal component analysis (__PCA__) is a linear projection method that seeks for an orthogonal transformation of the dataset so as to render the variables of the dataset uncorrelated. The dimensionality reduction is then performed onto the features with highest variance to preserve as much information as possible. This is a deterministic but linear method, that fails to catch non linear correlations.\n",
-    "- Multi-dimensional scaling (__MDS__) constructs a pairwise distance matrix in the original space, and seeks a low-dimensional representation that preserves the original distances as much as possible. This method tends to preserve local structures better than global structures and scales badly with the number of the data points. \n",
-    "- T-distributed Stochastic Neighbor Embedding (__t-SNE__) is a non-linear dimensionality reduction method that converts similarities between data points to joint probabilities and minimizes the Kullback-Leibler divergence between the joint probabilities of the embedding and the original space. The cost function is not convex and results depend on the inizialization. Non linear effects in this method might occasionally produce misleading results, a fine parameter tuning and several iterations of the method are then recommended.\n"
+    "- Multi-dimensional scaling (__MDS__) constructs a pairwise distance matrix in the original space, and seeks a low-dimensional representation that preserves the original distances as much as possible. This method tends to preserve local structures better than global structures and scales badly with the number of data points. \n",
+    "- T-distributed Stochastic Neighbor Embedding (__t-SNE__) is a non-linear dimensionality reduction method that converts similarities between data points to joint probabilities and minimizes the Kullback-Leibler divergence between the joint probabilities of the embedding and the original space. The cost function is not convex and results depend on the inizialization. Non linear effects in this method might occasionally produce misleading results, therefore several iterations of the method are recommended.\n"
    ]
   },
   {
@@ -80,6 +81,13 @@
     "# Import required modules"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We load below the packages required for the tutorial. Most of the clustering and embedding algorithms are contained in the scikit-learn package. We use panda's dataframe to manipulate our dataset."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -112,6 +120,78 @@
     "pd.options.mode.chained_assignment = None"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Get the data\n",
+    "We load the data and place it into a Panda's dataframe. Data has been downloaded from the NOMAD archive and the NOMAD atomic data collection. It consists of RS-ZB energy differences (in eV/atom) of the 82 octet binary compounds, structure objects containing the atomic positions of the materials and properties of the atomic constituents. The following atomic features are included:\n",
+    "\n",
+    "- Z:  atomic number\n",
+    "- period: period in the periodic table\n",
+    "- IP: ionization potential\n",
+    "- EA: electron affinity\n",
+    "- E_HOMO: energy of the highest occupied atomic orbital\n",
+    "- E_LUMO: energy of the lowest unoccupied atomic orbital\n",
+    "- r_(s, p, d): radius where the radial distribution of s, p or d orbital has its maximum."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "# load data\n",
+    "RS_structures = read(\"data/exploratory_analysis/octet_binaries/RS_structures.xyz\", index=':')\n",
+    "ZB_structures = read(\"data/exploratory_analysis/octet_binaries/ZB_structures.xyz\", index=':')\n",
+    "\n",
+    "def generate_table(RS_structures, ZB_structures):\n",
+    "\n",
+    "    for RS, ZB in zip(RS_structures, ZB_structures):\n",
+    "        energy_diff = RS.info['energy'] - ZB.info['energy']\n",
+    "        min_struc_type = 'RS' if energy_diff < 0 else 'ZB'\n",
+    "        struc_obj_min = RS if energy_diff < 0 else ZB\n",
+    "\n",
+    "        yield [RS.info['energy'], ZB.info['energy'],\n",
+    "               energy_diff, min_struc_type,\n",
+    "               RS.info['Z'], ZB.info['Z'],\n",
+    "               RS.info['period'], ZB.info['period'],\n",
+    "               RS.info['IP'], ZB.info['IP'],\n",
+    "               RS.info['EA'], ZB.info['EA'],\n",
+    "               RS.info['E_HOMO'], ZB.info['E_HOMO'],\n",
+    "               RS.info['E_LUMO'], ZB.info['E_LUMO'],\n",
+    "               RS.info['r_s'], ZB.info['r_s'],\n",
+    "               RS.info['r_p'], ZB.info['r_p'],\n",
+    "               RS.info['r_d'], ZB.info['r_d']]\n",
+    "        \n",
+    "    \n",
+    "df = pd.DataFrame(\n",
+    "    generate_table(RS_structures, ZB_structures),\n",
+    "    columns=['energy_RS', 'energy_ZB', \n",
+    "             'energy_diff', 'min_struc_type', \n",
+    "             'Z(A)', 'Z(B)', \n",
+    "             'period(A)', 'period(B)', \n",
+    "             'IP(A)', 'IP(B)', \n",
+    "             'EA(A)', 'EA(B)', \n",
+    "             'E_HOMO(A)', 'E_HOMO(B)', \n",
+    "             'E_LUMO(A)', 'E_LUMO(B)', \n",
+    "             'r_s(A)', 'r_s(B)', \n",
+    "             'r_p(A)', 'r_p(B)', \n",
+    "             'r_d(A)', 'r_d(B)',],\n",
+    "    index=list(RS.get_chemical_formula() for RS in RS_structures)\n",
+    ")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "A 'Clustering' class is defined that includes all clustering algorithms that are covered during the tutorial. Before creating an instance of this class, a dataframe variable 'df' must have been defined. The clustering functions in the class, assign labels to the entries in the dataframe according to the results of the clustering."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -164,6 +244,13 @@
     "            "
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The embedding algorithms are handled with a graphical interface that is generated using Jupyter Widgets, that allows to generate a plot with the desiered embedding algorithm by pushing a bottom. Before plotting data with any embedding algorithm, a dataframe 'df' must have been defined and cluster labels assigned to each data point."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -180,7 +267,9 @@
     "\n",
     "\n",
     "def btn_eventhandler_embedding (obj):\n",
+    "\n",
     "    method = str (obj.description)\n",
+    "    \n",
     "    try:\n",
     "        df \n",
     "    except NameError:\n",
@@ -191,6 +280,12 @@
     "    except KeyError:\n",
     "        print(\"Please assign labels with a clustering algorithm\")\n",
     "        return\n",
+    "    try:\n",
+    "        hover_features\n",
+    "    except NameError:\n",
+    "        print(\"Please create a list 'hover_features' containing all hover features\")\n",
+    "        return\n",
+    "              \n",
     "    if (method == 'PCA'):\n",
     "        transformed_data = PCA(n_components=2).fit_transform(df[features])\n",
     "        df['x_emb']=transformed_data[:,0]\n",
@@ -226,79 +321,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Get the data\n",
-    "Let us load the data from the file data/data.pkl into a data frame. The data was downloaded from the NOMAD archive and the NOMAD atomic data collection. It consists of RS-ZB energy differences (in eV/atom) of the 82 octet binary compounds, structure objects containing the atomic positions of the materials and properties of the atomic constituents. The following atomic features are considered:\n",
-    "\n",
-    "- Z:  atomic number\n",
-    "- period: period in the periodic table\n",
-    "- IP: ionization potential\n",
-    "- EA: electron affinity\n",
-    "- E_HOMO: energy of the highest occupied atomic orbital\n",
-    "- E_LUMO: energy of the lowest unoccupied atomic orbital\n",
-    "- r_(s, p, d): radius where the radial distribution of s, p or d orbital has its maximum."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Elementi non in ordine alfabetico"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "# load data\n",
-    "RS_structures = read(\"data/exploratory_analysis/octet_binaries/RS_structures.xyz\", index=':')\n",
-    "ZB_structures = read(\"data/exploratory_analysis/octet_binaries/ZB_structures.xyz\", index=':')\n",
-    "\n",
-    "def generate_table(RS_structures, ZB_structures):\n",
-    "\n",
-    "    for RS, ZB in zip(RS_structures, ZB_structures):\n",
-    "        energy_diff = RS.info['energy'] - ZB.info['energy']\n",
-    "        min_struc_type = 'RS' if energy_diff < 0 else 'ZB'\n",
-    "        struc_obj_min = RS if energy_diff < 0 else ZB\n",
-    "\n",
-    "        yield [RS.info['energy'], ZB.info['energy'],\n",
-    "               energy_diff, min_struc_type,\n",
-    "               RS.info['Z'], ZB.info['Z'],\n",
-    "               RS.info['period'], ZB.info['period'],\n",
-    "               RS.info['IP'], ZB.info['IP'],\n",
-    "               RS.info['EA'], ZB.info['EA'],\n",
-    "               RS.info['E_HOMO'], ZB.info['E_HOMO'],\n",
-    "               RS.info['E_LUMO'], ZB.info['E_LUMO'],\n",
-    "               RS.info['r_s'], ZB.info['r_s'],\n",
-    "               RS.info['r_p'], ZB.info['r_p'],\n",
-    "               RS.info['r_d'], ZB.info['r_d']]\n",
-    "        \n",
-    "    \n",
-    "df = pd.DataFrame(\n",
-    "    generate_table(RS_structures, ZB_structures),\n",
-    "    columns=['energy_RS', 'energy_ZB', \n",
-    "             'energy_diff', 'min_struc_type', \n",
-    "             'Z(A)', 'Z(B)', \n",
-    "             'period(A)', 'period(B)', \n",
-    "             'IP(A)', 'IP(B)', \n",
-    "             'EA(A)', 'EA(B)', \n",
-    "             'E_HOMO(A)', 'E_HOMO(B)', \n",
-    "             'E_LUMO(A)', 'E_LUMO(B)', \n",
-    "             'r_s(A)', 'r_s(B)', \n",
-    "             'r_p(A)', 'r_p(B)', \n",
-    "             'r_d(A)', 'r_d(B)',],\n",
-    "    index=list(RS.get_chemical_formula() for RS in RS_structures)\n",
-    ")\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We select which features will be used for the clustering and embedding methods. The complexity of the problem clearly is reduced with lowering the number of features that are considered, and an accurate selection of the features to be processed can imporove the quality of the results. To find the most meaningful results it is sometimes necessary to iterate training while considering different features at each iteration.  "
+    "We select which features will be used for the clustering and embedding methods. The complexity of the problem clearly decreases as the number of features is reduced, and an accurate selection of the features to be processed can imporove the quality of the results. To find the most meaningful results, it is sometimes necessary to iterate training while considering different features at each iteration.  "
    ]
   },
   {
@@ -323,16 +346,14 @@
     "features.append('r_p(A)')\n",
     "features.append('r_p(B)')\n",
     "features.append('r_d(A)')\n",
-    "features.append('r_d(B)')\n",
-    "\n",
-    "hover_features = ['min_struc_type']"
+    "features.append('r_d(B)')"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Machine learning algorithms can improve their performance if data is standardized. In fact, training can be biased towards dimensions presenting higher absolute values, or outliers can undermine the learning capabilites  of the algorithm. Hence, we standardize our dataset by subtracting the mean value and dividing it by the standard deviation for each variable. "
+    "Feature standardization if the operation of rescaling data so as to be shaped as a Gaussian with zero mean and unit variance, and it is a common requirement for machine learning algorithms. In fact, estimators can be biased towards dimensions presenting higher absolute values, or outliers can undermine the learning capabilites  of the algorithm. Hence, we standardize our dataset by subtracting the mean value and dividing it by the standard deviation for each variable."
    ]
   },
   {
@@ -346,15 +367,22 @@
     "df[features]=preprocessing.scale(df[features])"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Panda's dataframes offer a number of useful tools to visualize datasets. For example, here we show histograms of all 'features' for all entries in the dataframe by calling the 'hist' function. Below we see that the dataset has been normalized."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "scrolled": false
+    "scrolled": true
    },
    "outputs": [],
    "source": [
-    "hist = df[features].hist( bins=10, figsize = (20,15))\n"
+    "hist = df[features].hist( bins=10, figsize = (20,15))"
    ]
   },
   {
@@ -369,13 +397,22 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "K-means requires the knowledge of the number of clusters and clustering depends on the initial conditions, hence the algorithm is iterated,  up to _max\\_iter_ times, with different initial conditions until convergence. As initial guess we seek for 2 clusters and run the algorithm up to 200 iterations. "
+    "K-means requires the knowledge of the number of clusters and clustering depends on the initial conditions, hence the algorithm is iterated,  up to _max\\_iter_ times, with different initial conditions until convergence. As we know that our octet binary materials crystallize in the RS and ZB structures, a natural distinction in this dataset is between materials with the most stable conformationzin in the RS vs ZB structure. Hence we seek for two clusters, aiming to find clusters of materials with the same most stable structure. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "From the class clustering, we call the 'kmeans' function with the desired values of clusters and maximum iterations as parameters. The function will then assign to the materials stored in the dataframe 'df' the label of the cluster they belong to."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "scrolled": true
+   },
    "outputs": [],
    "source": [
     "n_clusters = 2\n",
@@ -383,6 +420,38 @@
     "Clustering().kmeans(n_clusters, max_iter)"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(df['labels'][:10])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can see that the dataframe contains the domain 'labels', which can assume the values 0 or 1, because the algorithm finds two clusters."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Clicking any of the bottons below will display the dataset embedding according to the label placed on the botton. Different clusters are visualized with different colors, and by hovering over points it is possible to see the material they represent and some defined features. In this case we are interested to see which is the lowest energy structure of the materials, then we select only the 'min_struc_type' as hovering feature. Please note that any other feature can be added to the 'hover_features' list."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "hover_features = ['min_struc_type']"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -398,9 +467,16 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Could you identify and visualize two distinct clusters within your data? If not the number of clusters must be changed among the input parameters. You can also run the k-means clustering again and select only 1 as _max\\_iter_ , which means that the first output is taken as optimal result. Try this again and compare the results, does the output change at each iteration? What happens instead if the number is much larger?\n",
+    "Could you identify and visualize two distinct clusters within your data? You can also run the k-means clustering again and select only 1 as _max\\_iter_ , which means that the first output is taken as optimal result. Try this again and compare the results, does the output change at each iteration? What happens instead if the number is much larger?\n",
     "\n",
-    "Note that also MDS and t-SNE are stochastich algorithms, so it might be worth iterate also the embedding to find a more satisfying result."
+    "Note that also MDS and t-SNE are stochastich algorithms, so it might be worth iterate also the embedding."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We define a function that for each cluster prints the percentage of materials that is more stable in the RS vs ZB structure.  "
    ]
   },
   {
@@ -436,6 +512,13 @@
     "composition_RS_ZB(df)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can see that K-means finds two distinct clusters, and in one of these clusters there are more 'RS' stable structures while in the other there are more 'ZB' stable structures. This is a hint that in the space described by the atomic features, materials with the same most stable structure are close to each other. On the other hand, we know that K-means is only able to detect spherically shaped clusters, therefore delimiting clusters containing only one specific stable structure is difficult under this assumption."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -448,7 +531,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The most relevant parameter of DBSCAN is the maximum distance $\\epsilon$ that determines the extent of the cluster and whether a point is considered as noise."
+    "The most relevant parameter of DBSCAN is the maximum distance $\\epsilon$ that determines the extent of the cluster and whether a point is considered as noise, that is labeled with -1."
    ]
   },
   {
@@ -457,8 +540,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# eps = 2.4\n",
-    "# min_samples= 2\n",
     "eps = 3\n",
     "min_samples= 8\n",
     "Clustering().dbscan(eps,min_samples)"
@@ -468,29 +549,34 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "scrolled": false
+    "scrolled": true
    },
    "outputs": [],
    "source": [
     "display(box)"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "composition_RS_ZB(df)"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Is the number of clusters that you have just found the same as the one that you used in k-means? Can you spot the noise in the visualization? What happens lowering the maximal distance $\\epsilon$?\n",
-    "\n",
-    "MDS seeks for an embedding that tries to preserve pairwise distances. Can you notice that noise is distant from other clusters? Does the same happen with t-SNE? "
+    "We can see that the algorithm found two different clusters, and we notice that each cluster is more representative of the RS vs ZB stable structure compared to K-means. However, this happens at the cost of neglecting many points that have been classified as noise."
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
+   "cell_type": "markdown",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "composition_RS_ZB(df)"
+    "Now tune the parameters and see the effect of each parameter on the amount of noise. Considering that MDS seeks for an embedding that tries to preserve pairwise distances, we would expect that in a MDS embedding noise is placed far from the defined clusters, while t-SNE tends to twist relative distances. Can you notice it in the plots above?"
    ]
   },
   {
@@ -512,7 +598,9 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "scrolled": true
+   },
    "outputs": [],
    "source": [
     "distance_threshold=15\n",
@@ -534,25 +622,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Several different possible metrics can be used for the linkage criterium. As a default option, we have used the Ward distance, which minimizes the sum of squared differences within all clusters. This has some similarities with the objective function of _k-means_, but tackled differently. By tuning the parameters, can you find the same results using the two different methodologies?"
+    "Several different possible metrics can be used for the linkage criterium. As a default option, we used the Ward distance, which minimizes the sum of squared differences within all clusters. This has some analogies with the objective function of _k-means_. By tuning the parameters, can you find the same clusters as the ones obtained with k-means?"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now try again including different features. Is the classification obtained identical?\n",
-    "You can hover over the different plots and explore the classification of the materials.\n",
-    "Can you identify  meaningful clusters that group together materials that share similar properties? What clustering and embedding methods provide the most meaningful visualization of the data set?"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "composition_RS_ZB(df)"
+    "One advantage of hierarchical methods is that they allow to decompose and understand the clustering process. Indeed, below we plot a dendogram that shows all agglomeration steps that from having all single objects as individual clusters group objects into a unique cluster. On the y-axys there is the distance threshold, and the number of biforcations in the dendogram for a certain value on the y-axis represents the number of clusters that are generated chossing that value as distance threshold. Hence, from the dendogram we can select the value of distance threshold that we need to have a certain number of clusters. "
    ]
   },
   {
@@ -602,6 +679,14 @@
     "---"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The Fast search and find of density peaks allows for a graphical decision of the clusters to select. \n",
+    "In the plot below, each point represents a different peak that can represent a different cluster. On the top right positon of the plot we always have one point that represents the highest density point. The other peaks are then placed in the plot according to their sourrounding density and distance from the first peak. "
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -611,13 +696,20 @@
     "Clustering().dpc()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We choose the values on the x,y-axis, and the algorithm will return the clusters that are given by the peaks that we selected. In this case we select the 3 peaks which are the closest to the top right vertex."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
-    "Clustering().dpc(2,3.5)"
+    "Clustering().dpc(2,4.)"
    ]
   },
   {
@@ -644,59 +736,22 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Predictions"
+    "We can see that we found two pure clusters containing materials with the same stable structure and a mixed cluster containing both stable structure. It is interesting to visualzie this result with MDS, where we can see that the mixed cluster is placed in between the pure clusters as a transition zone."
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "scrolled": false
-   },
-   "outputs": [],
-   "source": [
-    "df_temp = pd.DataFrame()\n",
-    "\n",
-    "for i in range (100):\n",
-    "    df_train, df_test = train_test_split ( df, test_size=0.1)\n",
-    "    data_train = df_train.loc[(df_train['labels']==0) | (df_train['labels']==2)]\n",
-    "\n",
-    "    clf=svm.SVC(probability=True)\n",
-    "    clf.fit(data_train[features].to_numpy(),data_train['labels'].to_numpy())\n",
-    "    labels_svm = clf.predict(df_test[features].to_numpy())\n",
-    "    df_test['labels']= labels_svm\n",
-    "    df_temp=df_temp.append(composition_RS_ZB(df_test))\n",
-    "print(df_temp.loc[0].mean(),'\\n\\n',df_temp.loc[1].mean())"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [],
+   "cell_type": "markdown",
+   "metadata": {},
    "source": [
-    "df_temp = pd.DataFrame()\n",
-    "\n",
-    "for i in range (100) :\n",
-    "    df_train, df_test = train_test_split ( df, test_size=0.2)\n",
-    "\n",
-    "    clf=svm.SVC(probability=True)\n",
-    "    labels_RS_ZB = np.where(df_train['min_struc_type'].to_numpy()=='RS',1,0)\n",
-    "    clf.fit(df_train[features].to_numpy(), labels_RS_ZB)\n",
-    "    labels_svm = clf.predict(df_test[features].to_numpy())\n",
-    "    df_test['labels']= labels_svm\n",
-    "    df_temp=df_temp.append(composition_RS_ZB(df_test))\n",
-    "print(df_temp.loc[0].mean(),'\\n\\n',df_temp.loc[1].mean())"
+    "# Perovskite\n",
+    "---"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Perovskite\n",
-    "---"
+    "As last step in this tutorial, we invite you to use the tools that we have introduced on a new dataset composed of perovskite materials, where an interesting property to be analysed is the clustering of metallic vs non metallic materials."
    ]
   },
   {
@@ -712,94 +767,16 @@
     "df['type'] = df.apply(lambda x : 'metallic' if  x['band_gap'] < 0.2  else 'non metallic' , axis=1)\n",
     "\n",
     "features = df.drop(['material','type', 'lattice_constant', 'bul_modulus','band_gap'],axis=1).columns.tolist()\n",
+    "# features = df.drop(['material','type'],axis=1).columns.tolist()\n",
+    "df[features]=preprocessing.scale(df[features])\n",
     "hover_features = ['type','material','lattice_constant', 'bul_modulus','band_gap']"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "Clustering().dpc()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "Clustering().dpc(10,15)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "display(box)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "df=df.sort_values(by=['labels'])"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "scrolled": false
-   },
-   "outputs": [],
-   "source": [
-    "clusters=df[['material','labels']].to_numpy()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "np.savetxt('perovskites_clustering.txt', clusters, fmt='%s', delimiter='\\t')"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def composition_metallic (df):\n",
-    "    df_cm = pd.DataFrame (columns=['metallic','non metallic','Materials in cluster'])\n",
-    "\n",
-    "    n_clusters = df['labels'].max() + 1\n",
-    "  \n",
-    "    for i in range (n_clusters):\n",
-    "        met = int(100*len(df.loc[(df['labels']==i) & (df['type']=='metallic')])/len(df.loc[df['labels']==i]))\n",
-    "        nomet = int(100*len(df.loc[(df['labels']==i) & (df['type']=='non metallic')])/len(df.loc[df['labels']==i]))\n",
-    "        Tot = len(df.loc[df['labels']==i])\n",
-    "        df_cm = df_cm.append({'metallic':met, 'non metallic':nomet, \"Materials in cluster\":Tot},ignore_index=True)\n",
-    "        \n",
-    "    display(df_cm)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
+   "cell_type": "markdown",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "composition_metallic(df)"
+    "Enjoy the exploration of this new data set!"
    ]
   }
  ],
-- 
GitLab