diff --git a/exploratory_analysis.ipynb b/exploratory_analysis.ipynb index 0fa07c2b49e73d64de3463b22b75c30eeb1118a2..983533ea8e8b5168a15d002ccd766243552980c9 100644 --- a/exploratory_analysis.ipynb +++ b/exploratory_analysis.ipynb @@ -52,8 +52,9 @@ "source": [ "Cluster analysis is performed to group together data points that are more similar to each other in comparison with points belonging in other clusters. Clustering can be achieved by means of many different algorithms, each with proper characteristics and input parameters. The choice of the clustering algorithms to be used depends on the specific dataset analyzed, and, once an optimal algorithm has been chosen, it is often necessary to iteratively modify the input parameters until results achieve the desired resolution. We focus on four different algorithms as described below.\n", "- ___k_-means__ partitions the data set into _k_ clusters, where each datapoint belongs in the cluster with the nearest mean. This partition ultimately minimizes the within-cluster variance to find the most compact partitioning of the data set. _k_-means uses an iterative refinement technique that is fast and scalable, but if falls in local minima. Thus, the algorithm is iterated multiple times with different initial conditions and the best outcome is finally chosen. Drawbacks of this algorithm are that the number of clusters _k_ is an input parameter which must be known in advance and clusters are convex shaped.\n", - "- __Hierarchical clustering__ builds a hierarchy of clusters with a bottom-up (__agglomerative__) or top-down (__divisive__) approach. In a bottom-up approach, that we deploy below, starting with all datapoints placed in its own cluster, different pairs of clusters are iteratively merged together where the decision of the clusters to be merged is determined in a greedy manner. This is iterated until all points are grouped within one cluster, and the resulting hierarchy of clusters is presentend in a dendrogram. If a distance threshold is given, clusters are not merged if they are more distant than the threshold value, and this stops the algorithm when no more mergings are possible. The algorithm then returns a certain number of clusters as a function of the threshold distance. An advantage of this algorithm is that the construction of dendroids allows for a visual inspection of the clustering, but hierarchical clustering is a rather slow algorithm and not well suited for big data.\n", - "- Density-based spatial clustering of applications with noise (__DBSCAN__) is an algorithm that, without knowing the exact number of clusters, groups points that are close to each other leaving outliers marked as noise and not defined in any clusters. In this algorithm a neighborood distance _$\\epsilon$_ and a number of points _min-samples_ are used to determine whether a point belongs to a cluster: in case the point has a number _min-samples_ of other points within the distance _$\\epsilon$_ is marked as core point and belongs to a cluster; otherwise, the point is marked as noise. This algorithm is fast and clusters can assume any shapes, but the choice of the distance _$\\epsilon$_ migth be non trivial.\n", + "- __Hierarchical clustering__ builds a hierarchy of clusters with a bottom-up (__agglomerative__) or top-down (__divisive__) approach. In a bottom-up approach, that we deploy in this tutorial, starting with all datapoints placed in its own cluster, different pairs of clusters are iteratively merged together where the decision of the clusters to be merged is determined in a greedy manner. This is iterated until all points are grouped within one cluster, and the resulting hierarchy of clusters is presentend in a dendrogram. If a distance threshold is given, clusters are not merged if they are more distant than the threshold value, and this stops the algorithm when no more mergings are possible. The algorithm then returns a certain number of clusters as a function of the threshold distance. An advantage of this algorithm is that the construction of dendroids allows for a visual inspection of the clustering, but hierarchical clustering is a rather slow algorithm and not well suited for big data.\n", + "- Density-based spatial clustering of applications with noise (__DBSCAN__) is an algorithm that, without knowing the exact number of clusters, groups points that are close to each other leaving outliers marked as noise and not defined in any clusters. In this algorithm, a neighborood distance _$\\epsilon$_ and a number of points _min-samples_ are used to determine whether a point belongs in a cluster: in case the point has a number _min-samples_ of other points within the distance _$\\epsilon$_ is marked as core point and belongs to a cluster; otherwise, the point is marked as noise. This algorithm is fast and clusters can assume any shapes, but the choice of the distance _$\\epsilon$_ migth be non trivial.\n", + "-__HDBSCAN__ is a hierarchical extension of DBSCAN. This algorithm deploys the mutual reachability distance as distance metric to push outliers away from high density regions, thus facilitating noise detection. The mutual reachability distance acts by increasing the distance of all points that are not close to at least _min_samples_ points. Using this metric, the algorithm builds a hierarchy tree, where it extracts clusters which contain at least _min_cluster_size_ elements.\n", "- The fast search and find of density peaks (__DenPeak__) algorithm is a density-based algorithm that is able to automatically locate non-spherical clusters. Density peaks are assumed to be sourrounded by lower density regions. Based on the position of the highest density peak, the peaks can be visualized on a graph that shows their sourrounding density and the distance from the first peak. It is then possible to choose the peaks to include from this plot, where each peak represents a different cluster." ] }, @@ -90,11 +91,11 @@ }, { "cell_type": "code", - "execution_count": 88, + "execution_count": 303, "metadata": { "ExecuteTime": { - "end_time": "2021-01-01T16:07:03.608283Z", - "start_time": "2021-01-01T16:07:03.598933Z" + "end_time": "2021-01-03T18:58:58.492480Z", + "start_time": "2021-01-03T18:58:58.487049Z" } }, "outputs": [], @@ -102,6 +103,7 @@ "from ase.io import read\n", "import pandas as pd\n", "import numpy as np\n", + "from scipy.cluster.hierarchy import dendrogram, linkage, cut_tree\n", "from scipy.cluster.hierarchy import dendrogram\n", "from sklearn import preprocessing, svm\n", "from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering\n", @@ -109,6 +111,7 @@ "from sklearn.manifold import TSNE, MDS\n", "from sklearn.svm import SVC\n", "from sklearn.model_selection import train_test_split\n", + "import hdbscan\n", "import plotly.express as px\n", "import plotly.graph_objects as go\n", "import ipywidgets as widgets\n", @@ -119,18 +122,18 @@ }, { "cell_type": "code", - "execution_count": 89, + "execution_count": 231, "metadata": { "ExecuteTime": { - "end_time": "2021-01-01T16:07:04.219755Z", - "start_time": "2021-01-01T16:07:04.215903Z" + "end_time": "2021-01-03T17:13:20.488851Z", + "start_time": "2021-01-03T17:13:20.485901Z" } }, "outputs": [], "source": [ - "import warnings\n", - "warnings.filterwarnings(\"ignore\", category=DeprecationWarning) \n", - "pd.options.mode.chained_assignment = None" + "# import warnings\n", + "# warnings.filterwarnings(\"ignore\", category=DeprecationWarning) \n", + "# pd.options.mode.chained_assignment = None" ] }, { @@ -151,11 +154,11 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": 232, "metadata": { "ExecuteTime": { - "end_time": "2021-01-01T15:08:43.334067Z", - "start_time": "2021-01-01T15:08:43.240012Z" + "end_time": "2021-01-03T17:13:37.256376Z", + "start_time": "2021-01-03T17:13:37.174087Z" }, "scrolled": true }, @@ -206,16 +209,37 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "A 'Clustering' class is defined that includes all clustering algorithms that are covered during the tutorial. Before creating an instance of this class, a dataframe variable 'df' must have been defined. The clustering functions in the class assign labels to the entries in the dataframe according to the outcome of the clustering." + "We insert in the dataframe a column that contains different marker symbols for different most stable structure types. These markers will be used while visualizing the datapoints on the 2-dimensional embedding." ] }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 264, "metadata": { "ExecuteTime": { - "end_time": "2021-01-01T14:50:23.468443Z", - "start_time": "2021-01-01T14:50:23.458779Z" + "end_time": "2021-01-03T17:45:05.445766Z", + "start_time": "2021-01-03T17:45:05.439960Z" + } + }, + "outputs": [], + "source": [ + "df['marker_symbol']= np.where(df['min_struc_type']=='RS','square-open','hexagram')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A 'Clustering' class is defined that includes all clustering algorithms that are covered during the tutorial. Before creating an instance of this class, a dataframe variable 'df' must have been defined. In this class, the clustering functions gives labels to the entries in the dataframe according to the outcome of the clustering assignments." + ] + }, + { + "cell_type": "code", + "execution_count": 370, + "metadata": { + "ExecuteTime": { + "end_time": "2021-01-03T19:23:14.144040Z", + "start_time": "2021-01-03T19:23:14.132942Z" } }, "outputs": [], @@ -235,22 +259,31 @@ " return \n", " cluster_labels = KMeans (n_clusters=n_clusters, max_iter=max_iter).fit_predict(df[features])\n", " df['clustering'] = 'k-means'\n", - " df['labels']=cluster_labels\n", + " df['cluster_label']=cluster_labels\n", "\n", " def hierarchical (self, distance_threshold):\n", " if self.df_flag: \n", " return \n", - " cluster_labels = AgglomerativeClustering(n_clusters=None, distance_threshold=distance_threshold).fit_predict(df[features])\n", - " df['clustering'] = 'Hierarchical'\n", - " df['labels']=cluster_labels\n", + " linkage_criterion = 'ward'\n", + " Z = linkage(df[features], linkage_criterion )\n", + " cluster_labels = cut_tree(Z, height=distance_threshold)\n", + " df['clustering'] = 'Hierarchical - ' + linkage_criterion + ' criterion' \n", + " df['cluster_label']=cluster_labels\n", "\n", " def dbscan (self, eps, min_samples):\n", " if self.df_flag: \n", " return \n", " cluster_labels = DBSCAN(eps=eps, min_samples=min_samples).fit_predict(df[features])\n", " df['clustering'] = 'DBSCAN'\n", - " df['labels']=cluster_labels\n", + " df['cluster_label']=cluster_labels\n", " \n", + " def hdbscan (self, min_cluster_size, min_samples):\n", + " clusterer = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size, min_samples=min_samples)\n", + " clusterer.fit(df[features])\n", + " cluster_labels=clusterer.labels_\n", + " df['clustering']= 'HDBSCAN'\n", + " df['cluster_label']=cluster_labels\n", + "\n", " def dpc (self, density = 0, delta = 0 ):\n", " if self.df_flag:\n", " return\n", @@ -260,7 +293,7 @@ " clu.assign(density,delta)\n", " cluster_labels = clu.membership\n", " df['clustering'] = 'DPC'\n", - " df['labels']=cluster_labels\n", + " df['cluster_label']=cluster_labels\n", " else: \n", " clu=DPCClustering(np.ascontiguousarray(df[features].to_numpy()))\n", " " @@ -270,16 +303,16 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The embedding algorithms are handled with a graphical interface that is generated using Jupyter Widgets, that allows to generate a plot with the desiered embedding algorithm by pushing a button. Before plotting data with any of the embedding algorithms, a dataframe 'df' must have been defined, and cluster labels assigned to each data point." + "The embedding algorithms are handled with a graphical interface that is generated using Jupyter Widgets, that allows to generate a plot with the desired embedding algorithm by pushing a button. Before plotting data with any of the embedding algorithms, a dataframe 'df' must have been defined, and cluster labels assigned to each data point." ] }, { "cell_type": "code", - "execution_count": 320, + "execution_count": 280, "metadata": { "ExecuteTime": { - "end_time": "2021-01-01T20:10:15.901128Z", - "start_time": "2021-01-01T20:10:15.888977Z" + "end_time": "2021-01-03T18:03:38.688564Z", + "start_time": "2021-01-03T18:03:38.669764Z" } }, "outputs": [], @@ -289,10 +322,6 @@ " btn_PCA = widgets.Button(description='PCA')\n", " btn_MDS = widgets.Button(description='MDS')\n", " btn_tSNE = widgets.Button(description='t-SNE')\n", - " btn_kmeans = widgets.Button(description='k-means')\n", - " btn_hierarchical = widgets.Button(description='hierarchical')\n", - " btn_dbscan = widgets.Button(description='DBSCAN')\n", - " btn_plot = widgets.Button (description='plot')\n", "\n", " def btn_eventhandler_embedding (obj):\n", "\n", @@ -308,12 +337,7 @@ " except KeyError:\n", " print(\"Please assign labels with a clustering algorithm\")\n", " return\n", - " try:\n", - " hover_features\n", - " except NameError:\n", - " print(\"Please create a list 'hover_features' containing all hover features\")\n", - " return\n", - "\n", + " \n", " if (method == 'PCA'):\n", " transformed_data = PCA(n_components=2).fit_transform(df[features])\n", " df['x_emb']=transformed_data[:,0]\n", @@ -333,78 +357,68 @@ "\n", " def plot_embedding():\n", " with fig.batch_update():\n", - " fig['data'][0]['x']=df['x_emb']\n", - " fig['data'][0]['y']=df['y_emb']\n", - " fig['data'][0]['customdata']=np.expand_dims(df['min_struc_type'].to_numpy(),axis=1)\n", - " fig['data'][0]['hovertemplate']=r\"<b>%{text}</b><br><br> Low energy structure: %{customdata[0]}<br>\"\n", - " fig['data'][0]['marker'].color=df['labels'].to_numpy()\n", - " fig['data'][0]['marker'].colorscale=tuple([tuple([0,'#14213d']),tuple([1,'#fca311'])])\n", - " fig['data'][0]['text']=df.index\n", - " fig.update_layout(plot_bgcolor='rgba(229,236,246, 0.5)',\n", - " xaxis=dict(visible=True),\n", - " yaxis=dict(visible=True))\n", - " label.value = \"Clustering algorithm used: \" + str(df['clustering'][0]) + \"\\t Embedding method used: \" + str(df['embedding'][0]) \n", + " \n", + " for scatter in fig['data']:\n", + " cl = scatter.meta\n", + " scatter['x']=df[df['cluster_label']==cl]['x_emb']\n", + " scatter['y']=df[df['cluster_label']==cl]['y_emb']\n", + " scatter['customdata']=np.dstack((df[df['cluster_label']==cl]['min_struc_type'].to_numpy(),\n", + " df[df['cluster_label']==cl]['cluster_label'].to_numpy(),\n", + " ))[0]\n", + " scatter['hovertemplate']=r\"<b>%{text}</b><br><br> Low energy structure: %{customdata[0]}<br>Cluster label: %{customdata[1]}<br>\"\n", + " scatter['marker'].symbol=df[df['cluster_label']==cl]['marker_symbol'].to_numpy()\n", + " scatter['text']=df.index.to_list()\n", + " \n", + " fig.update_layout(\n", + " plot_bgcolor='rgba(229,236,246, 0.5)',\n", + " xaxis=dict(visible=True),\n", + " yaxis=dict(visible=True),\n", + " legend_title_text='List of clusters',\n", + " showlegend=True,)\n", + " label_b.value = \"Embedding method used: \" + str(df['embedding'][0]) \n", "\n", " btn_PCA.on_click(btn_eventhandler_embedding)\n", " btn_MDS.on_click(btn_eventhandler_embedding)\n", " btn_tSNE.on_click(btn_eventhandler_embedding)\n", - " label = widgets.Label(value='Select a dimension reduction method to visualize the 2-dimensional embedding')\n", + " label_t = widgets.Label(value=\"Clustering algorithm used: \" + str(df['clustering'][0]))\n", + " label_b = widgets.Label(value='Select a dimension reduction method to visualize the 2-dimensional embedding')\n", + "\n", " fig = go.FigureWidget()\n", - " fig.add_trace(go.Scatter(\n", - " name='embedding',\n", - " mode='markers',\n", - " ))\n", + " \n", + " for cl in np.unique(df['cluster_label'].to_numpy()):\n", + " if cl == -1:\n", + " name = 'Noise'\n", + " else:\n", + " name = 'Cluster ' + str(cl)\n", + " fig.add_trace(go.Scatter(\n", + " name=name,\n", + " mode='markers',\n", + " meta=cl\n", + " ))\n", + "\n", " fig.update_layout(plot_bgcolor='rgba(229,236,246, 0.5)',\n", + " width=800,\n", + " height=600,\n", " xaxis=dict(visible=False, title='x_emb'),\n", " yaxis=dict(visible=False, title='y_emb'))\n", "\n", - " return widgets.VBox([widgets.HBox ([btn_PCA,btn_MDS,btn_tSNE]),label, fig])" - ] - }, - { - "cell_type": "code", - "execution_count": 322, - "metadata": { - "ExecuteTime": { - "end_time": "2021-01-01T20:11:05.452861Z", - "start_time": "2021-01-01T20:11:05.304006Z" - }, - "scrolled": false - }, - "outputs": [ - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "fae6d0da3c25423791f763044233393e", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "VBox(children=(HBox(children=(Button(description='PCA', style=ButtonStyle()), Button(description='MDS', style=…" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "show_embedding()" + " return widgets.VBox([widgets.HBox ([btn_PCA,btn_MDS,btn_tSNE]),label_t, label_b, fig])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "We select which features will be used for the clustering and embedding methods. The complexity of the problem clearly decreases as the number of features is reduced, and an accurate selection of the features to be processed can imporove the quality of the results. To find the most meaningful results, it is sometimes necessary to iterate training while considering different features at each iteration. " + "We select which features will be used for the clustering and embedding methods. The complexity of the problem clearly decreases as the number of features is reduced, and an accurate selection of the features to be processed can improve the quality of the results. To find the most meaningful results, it is sometimes necessary to iterate training while considering different features at each iteration. " ] }, { "cell_type": "code", - "execution_count": 35, + "execution_count": 281, "metadata": { "ExecuteTime": { - "end_time": "2021-01-01T15:13:02.811407Z", - "start_time": "2021-01-01T15:13:02.806201Z" + "end_time": "2021-01-03T18:03:39.335323Z", + "start_time": "2021-01-03T18:03:39.331734Z" } }, "outputs": [], @@ -437,11 +451,11 @@ }, { "cell_type": "code", - "execution_count": 36, + "execution_count": 282, "metadata": { "ExecuteTime": { - "end_time": "2021-01-01T15:13:03.146889Z", - "start_time": "2021-01-01T15:13:03.133137Z" + "end_time": "2021-01-03T18:03:39.759950Z", + "start_time": "2021-01-03T18:03:39.749501Z" }, "scrolled": true }, @@ -459,13 +473,13 @@ }, { "cell_type": "code", - "execution_count": 37, + "execution_count": 283, "metadata": { "ExecuteTime": { - "end_time": "2021-01-01T15:13:05.302852Z", - "start_time": "2021-01-01T15:13:03.661340Z" + "end_time": "2021-01-03T18:03:41.897594Z", + "start_time": "2021-01-03T18:03:40.216149Z" }, - "scrolled": true + "scrolled": false }, "outputs": [ { @@ -490,47 +504,47 @@ "metadata": {}, "source": [ "---\n", - "# K-Means" + "# $k$-means" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "$k$-means requires the knowledge of the number of clusters and clustering depends on the initial condition. Therefore the algorithm is iterated, up to _max\\_iter_ times, with different initial conditions until convergence. As we know that our octet binary materials crystallize in the RS and ZB structures, a natural distinction in this dataset is between materials with the most stable conformation in the RS vs ZB structure. Hence we seek for two clusters, aiming to find clusters of materials with the same most stable structure. " + "$k$-means requires the knowledge of the number of clusters and clustering depends on the initial conditions. Therefore the algorithm is iterated, up to _max\\_iter_ times, with different initial conditions until convergence. As we know that our octet binary materials crystallize in the RS and ZB structures, a natural distinction in this dataset is between materials with the most stable conformation in the RS vs ZB structure. Hence we seek for two clusters, aiming to find clusters of materials with the same most stable structure. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "From the class clustering, we call the 'kmeans' function with the desired values of clusters and maximum iterations as parameters. The function will then assign to the materials stored in the dataframe 'df' the label of the cluster they belong to." + "From the class clustering, we call the 'kmeans' function with the desired values of clusters and maximum iterations as parameters. The function will then assign to the materials stored in the dataframe 'df' the label of the cluster they belong in." ] }, { "cell_type": "code", - "execution_count": 94, + "execution_count": 313, "metadata": { "ExecuteTime": { - "end_time": "2021-01-01T16:08:09.067069Z", - "start_time": "2021-01-01T16:08:09.040255Z" + "end_time": "2021-01-03T19:02:53.287520Z", + "start_time": "2021-01-03T19:02:53.255273Z" }, "scrolled": true }, "outputs": [], "source": [ "n_clusters = 2\n", - "max_iter = 200\n", + "max_iter =100\n", "Clustering().kmeans(n_clusters, max_iter)" ] }, { "cell_type": "code", - "execution_count": 95, + "execution_count": 314, "metadata": { "ExecuteTime": { - "end_time": "2021-01-01T16:08:09.336832Z", - "start_time": "2021-01-01T16:08:09.331988Z" + "end_time": "2021-01-03T19:02:53.413468Z", + "start_time": "2021-01-03T19:02:53.406593Z" } }, "outputs": [ @@ -538,7 +552,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "AgBr 0\n", + "AgBr 1\n", "AgCl 1\n", "AgF 1\n", "AgI 0\n", @@ -548,145 +562,45 @@ "AlSb 0\n", "AsGa 0\n", "AsB 0\n", - "Name: labels, dtype: int32\n" + "Name: cluster_label, dtype: int32\n" ] } ], "source": [ - "print(df['labels'][:10])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can see that the dataframe contains the domain 'labels', which can assume the values 0 or 1, because the algorithm finds two clusters." + "print(df['cluster_label'][:10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Clicking any of the buttons below will display the dataset embedding according to the label placed on the button. Different clusters are visualized with different colors, and by hovering over points it is possible to see the material they represent and some defined features. In this case we are interested to see which is the lowest energy structure of the materials, then we select only the 'min_struc_type' as hovering feature. Please note that any other feature can be added to the 'hover_features' list." - ] - }, - { - "cell_type": "code", - "execution_count": 101, - "metadata": { - "ExecuteTime": { - "end_time": "2021-01-01T16:10:25.790100Z", - "start_time": "2021-01-01T16:10:25.732129Z" - } - }, - "outputs": [], - "source": [ - "btn_PCA = widgets.Button(description='PCA')\n", - "btn_MDS = widgets.Button(description='MDS')\n", - "btn_tSNE = widgets.Button(description='t-SNE')\n", - "btn_kmeans = widgets.Button(description='k-means')\n", - "btn_hierarchical = widgets.Button(description='hierarchical')\n", - "btn_dbscan = widgets.Button(description='DBSCAN')\n", - "btn_plot = widgets.Button (description='plot')\n", - "\n", - "\n", - "def btn_eventhandler_embedding (obj):\n", - "\n", - " method = str (obj.description)\n", - " \n", - " try:\n", - " df \n", - " except NameError:\n", - " print(\"Please define a dataframe 'df'\")\n", - " return\n", - " try:\n", - " df['clustering'][0]\n", - " except KeyError:\n", - " print(\"Please assign labels with a clustering algorithm\")\n", - " return\n", - " try:\n", - " hover_features\n", - " except NameError:\n", - " print(\"Please create a list 'hover_features' containing all hover features\")\n", - " return\n", - " \n", - " if (method == 'PCA'):\n", - " transformed_data = PCA(n_components=2).fit_transform(df[features])\n", - " df['x_emb']=transformed_data[:,0]\n", - " df['y_emb']=transformed_data[:,1]\n", - " df['embedding'] = 'PCA'\n", - " elif (method == 'MDS'):\n", - " transformed_data = MDS (n_components=2).fit_transform(df[features])\n", - " df['x_emb']=transformed_data[:,0]\n", - " df['y_emb']=transformed_data[:,1]\n", - " df['embedding'] = 'MDS'\n", - " elif (method == 't-SNE'):\n", - " transformed_data = TSNE (n_components=2).fit_transform(df[features])\n", - " df['x_emb']=transformed_data[:,0]\n", - " df['y_emb']=transformed_data[:,1]\n", - " df['embedding'] = 't-SNE'\n", - " plot_embedding()\n", - "\n", - "def plot_embedding():\n", - " clear_output() \n", - "\n", - " fig = go.FigureWidget()\n", - "\n", - " fig.add_trace(go.Scatter(x=df['x_emb'],y=df['y_emb'],\n", - " mode='markers',\n", - " marker=dict(color=df['labels']),\n", - " customdata=np.expand_dims(df['min_struc_type'].to_numpy(),axis=1),\n", - " text=df.index,\n", - " hovertemplate=\n", - " r\"<b>%{text}</b><br><br>\" +\n", - " \"x axis: %{x:,.2f}<br>\" +\n", - " \"y axis: %{y:,.2f}<br>\" +\n", - " \"Low energy structure: %{customdata[0]}<br>\"\n", - " ))\n", - " print (\"Clustering algorithm used: \",df['clustering'][0], \"\\t Embedding method used: \", df['embedding'][0]) \n", - "# df[\"labels\"]=df[\"labels\"].astype(str)\n", - "# display(px.scatter(df,x='x_emb',y='y_emb',color=df['labels'].astype(str),hover_data=df[hover_features], hover_name=df.index ))\n", - "# with output:\n", - " display(fig)\n", - " \n", - "btn_PCA.on_click(btn_eventhandler_embedding)\n", - "btn_MDS.on_click(btn_eventhandler_embedding)\n", - "btn_tSNE.on_click(btn_eventhandler_embedding)\n", - "\n", + "We can see that the dataframe contains the domain 'cluster_label', which can assume the values 0 or 1, because the algorithm finds two clusters.\n", "\n", - "box = widgets.HBox ([btn_PCA,btn_MDS,btn_tSNE])" + "Now we deploy the graphical interface defined above to visualize the datapoints using a two dimensional embedding of our choice.\n", + "The function 'show_embedding' displays three buttons labeled with the name of the dimension reduction methods that are deployed in this tutorial.\n", + "Clicking any of the buttons will show a plot of the dataset that uses the relative two dimensional embedding. " ] }, { "cell_type": "code", - "execution_count": 104, + "execution_count": 315, "metadata": { "ExecuteTime": { - "end_time": "2021-01-01T16:10:36.562307Z", - "start_time": "2021-01-01T16:10:36.553397Z" + "end_time": "2021-01-03T19:02:53.762957Z", + "start_time": "2021-01-03T19:02:53.685207Z" }, "scrolled": false }, "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Clustering algorithm used: k-means \t Embedding method used: t-SNE\n" - ] - }, { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "372ac81158154de79dae475e8db9b9d6", + "model_id": "b264423d80e64d9c8b11bfbf0d31ae36", "version_major": 2, "version_minor": 0 }, "text/plain": [ - "FigureWidget({\n", - " 'data': [{'customdata': array([['RS'],\n", - " ['RS'],\n", - " …" + "VBox(children=(HBox(children=(Button(description='PCA', style=ButtonStyle()), Button(description='MDS', style=…" ] }, "metadata": {}, @@ -694,32 +608,32 @@ } ], "source": [ - "display(box)" + "show_embedding()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Could you identify and visualize two distinct clusters within your data? You can also run the k-means clustering again and select only 1 as _max\\_iter_ , which means that the first output is taken as optimal result. Try this again and compare the results, does the output change at each iteration? What happens instead if the number is much larger?\n", + "In the plot, different clusters are visualized with different colors, and by hovering over points it is possible to see the name of the relative material, its most stable structure and the cluster number it was assigned to.\n", "\n", - "Note that also MDS and t-SNE are stochastich algorithms, so it might be worth iterate also the embedding." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We define a function that for each cluster prints the percentage of materials that is more stable in the RS vs ZB structure. " + "We can see open squares and hexagrams used as markers in the plot. Open squares indicate materials whose most stable structure is rocksalt, while hexograms are used for zinc blende structures. Can you modify the code so as to visualize rocksalt as diamond and zinc blende as open circle? A more difficult task is to modify the hovering features. Can you add the atomic number of the two elements to the hovering features? A hint is that text visualized in the cell appearing while hovering is defined as 'hovertemplate' in the 'show_embedding' function. Then few other modifications are required. Now inspect the atomic number values for the different materials. Are such values as you would expect, i.e. natural numbers? If not, can you explain why they are not?\n", + "\n", + "Now let's focus on the results of the clustering algorithm. Could you identify and visualize two distinct clusters in the dataset? You can also run the $k$-means clustering again and select only 1 as _max\\_iter_ , which means that the first outcome is taken as final result. Try this again and compare the results, does the output change at each iteration? What happens instead if the number is much larger? \n", + "\n", + "To compare different outcomes of the algorithm, it might be convenient to copy paste the cell containing 'show_embedding()', and updating only one of the two visualizers at each iteration. Also, only the usage of the PCA embedding allows a straightforward comparison, because MDS and t-SNE are stochastics algorithms, thus they can give different results at each call.\n", + "\n", + "We are interested in understanding whether clustering groups togheter materials which have the same most stable structure. \n", + "Therefore, we define a function that prints for each cluster the percentage of materials that is more stable in the RS vs ZB structure. " ] }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 293, "metadata": { "ExecuteTime": { - "end_time": "2020-12-18T18:43:38.603954Z", - "start_time": "2020-12-18T18:43:38.597727Z" + "end_time": "2021-01-03T18:50:06.115626Z", + "start_time": "2021-01-03T18:50:06.106963Z" } }, "outputs": [], @@ -727,14 +641,14 @@ "def composition_RS_ZB (df):\n", " df_cm = pd.DataFrame (columns=['RS','ZB','Materials in cluster'])\n", "\n", - " n_clusters = df['labels'].max() + 1\n", + " n_clusters = df['cluster_label'].max() + 1\n", " \n", " for i in range (n_clusters):\n", - " Tot = len(df.loc[df['labels']==i])\n", + " Tot = len(df.loc[df['cluster_label']==i])\n", " if (Tot == 0):\n", " continue\n", - " RS = int(100*len(df.loc[(df['labels']==i) & (df['min_struc_type']=='RS')])/len(df.loc[df['labels']==i]))\n", - " ZB = int(100*len(df.loc[(df['labels']==i) & (df['min_struc_type']=='ZB')])/len(df.loc[df['labels']==i]))\n", + " RS = int(100*len(df.loc[(df['cluster_label']==i) & (df['min_struc_type']=='RS')])/len(df.loc[df['cluster_label']==i]))\n", + " ZB = int(100*len(df.loc[(df['cluster_label']==i) & (df['min_struc_type']=='ZB')])/len(df.loc[df['cluster_label']==i]))\n", " df_cm = df_cm.append({'RS':RS, 'ZB':ZB, \"Materials in cluster\":Tot},ignore_index=True)\n", " \n", " return df_cm" @@ -742,11 +656,11 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 302, "metadata": { "ExecuteTime": { - "end_time": "2020-12-18T18:43:39.273928Z", - "start_time": "2020-12-18T18:43:39.238298Z" + "end_time": "2021-01-03T18:53:25.093724Z", + "start_time": "2021-01-03T18:53:25.061959Z" }, "scrolled": true }, @@ -780,15 +694,15 @@ " <tbody>\n", " <tr>\n", " <th>0</th>\n", - " <td>85</td>\n", - " <td>15</td>\n", - " <td>40</td>\n", + " <td>34</td>\n", + " <td>65</td>\n", + " <td>52</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", - " <td>16</td>\n", - " <td>83</td>\n", - " <td>42</td>\n", + " <td>76</td>\n", + " <td>23</td>\n", + " <td>30</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", @@ -796,11 +710,11 @@ ], "text/plain": [ " RS ZB Materials in cluster\n", - "0 85 15 40\n", - "1 16 83 42" + "0 34 65 52\n", + "1 76 23 30" ] }, - "execution_count": 15, + "execution_count": 302, "metadata": {}, "output_type": "execute_result" } @@ -813,60 +727,64 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We can see that $k$-means finds two distinct clusters, and in one of these clusters there are more 'RS' stable structures while in the other there are more 'ZB' stable structures. This is a hint that in the space described by the atomic features, materials with the same most stable structure are close to each other. On the other hand, we know that $k$-means is only able to detect spherically shaped clusters, and the delimitation of clusters containing only one specific stable structure can be difficult under this assumption." + "We can see that $k$-means finds two distinct clusters, and in one of the two clusters there are more 'RS' stable structures while in the other there are more 'ZB' stable structures. This is a hint that in the space described by the atomic features, materials with the same most stable structure are close to each other, that is also possible to visualize using the different embedding algorithms. \n", + "\n", + "Observing the linear and deterministic embedding given by PCA we can clearly notice that RS and ZB structures are placed in different regions of the embedding space. But we notice that there is an overlapping area where RS and ZB are close to each other, where assessing the most stable structure is difficult if assessment is solely based on the specific location in the PCA embedding. We can also notice that the volume spanned by RS structures seems to be larger with respect to ZB structrues. On the other hand, we know that $k$-means is only able to detect convex clusters of comparable shapes, hence we can argue that it might not be able to find the desired two clusters." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "---\n", - "# DBSCAN" + "\n", + "# Hierarchical agglomerative clustering\n", + "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The most relevant parameter of DBSCAN is the maximum distance $\\epsilon$ that determines the extent of the cluster and whether a point is considered as noise, that is labeled with -1." + "In a hierarchical agglomerative clustering different clusters are iteratively merged if their distance is lower than a _distance\\_threshold_. The number of clusters obtained is a function of this threshold." ] }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 385, "metadata": { "ExecuteTime": { - "end_time": "2020-12-18T18:50:54.589458Z", - "start_time": "2020-12-18T18:50:54.569294Z" - } + "end_time": "2021-01-03T19:43:07.632638Z", + "start_time": "2021-01-03T19:43:07.613870Z" + }, + "scrolled": true }, "outputs": [], "source": [ - "eps = 3\n", - "min_samples= 8\n", - "Clustering().dbscan(eps,min_samples)" + "distance_threshold=20\n", + "\n", + "Clustering().hierarchical(distance_threshold=distance_threshold)" ] }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 386, "metadata": { "ExecuteTime": { - "end_time": "2020-12-18T18:50:55.527429Z", - "start_time": "2020-12-18T18:50:55.518975Z" + "end_time": "2021-01-03T19:43:07.838054Z", + "start_time": "2021-01-03T19:43:07.763636Z" }, - "scrolled": true + "scrolled": false }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "1ac9898689124574aef1ccc5775edf6f", + "model_id": "7495e8896a1d4f56b8050bbd329288da", "version_major": 2, "version_minor": 0 }, "text/plain": [ - "HBox(children=(Button(description='PCA', style=ButtonStyle()), Button(description='MDS', style=ButtonStyle()),…" + "VBox(children=(HBox(children=(Button(description='PCA', style=ButtonStyle()), Button(description='MDS', style=…" ] }, "metadata": {}, @@ -874,17 +792,18 @@ } ], "source": [ - "display(box)" + "show_embedding()" ] }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 387, "metadata": { "ExecuteTime": { - "end_time": "2020-12-18T18:50:57.413977Z", - "start_time": "2020-12-18T18:50:57.382729Z" - } + "end_time": "2021-01-03T19:43:08.416812Z", + "start_time": "2021-01-03T19:43:08.386876Z" + }, + "scrolled": true }, "outputs": [ { @@ -916,27 +835,27 @@ " <tbody>\n", " <tr>\n", " <th>0</th>\n", - " <td>0</td>\n", - " <td>100</td>\n", - " <td>17</td>\n", + " <td>86</td>\n", + " <td>13</td>\n", + " <td>44</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", - " <td>70</td>\n", - " <td>30</td>\n", - " <td>30</td>\n", + " <td>7</td>\n", + " <td>92</td>\n", + " <td>38</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ - " RS ZB Materials in cluster\n", - "0 0 100 17\n", - "1 70 30 30" + " RS ZB Materials in cluster\n", + "0 86 13 44\n", + "1 7 92 38" ] }, - "execution_count": 18, + "execution_count": 387, "metadata": {}, "output_type": "execute_result" } @@ -949,55 +868,86 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We can see that the algorithm found two different clusters, and we notice that each cluster is more representative of the RS vs ZB stable structure compared to $k$-means. However, this happens at the cost of neglecting many points that have been classified as noise." + "Several different possible choices can be used as linkage criterion. As a default option, we have used the Ward distance, which minimizes the sum of squared differences within all clusters. This has some analogies with the objective function of $k$-means. By tuning the parameters, can you find the same clusters as the ones obtained with $k$-means? Now we would like to use a different linkage method. Can you modify the code to use single linkage instead of ward linkage? Typical of the single linkage criterion is the rich-get-richer dynamics where already large clusters tend to become even larger during linkage. Can you adjust the distance threshold so as to find only two clusters? Have these clusters similar shape?\n", + "\n", + "One advantage of hierarchical methods is that they allow to decompose and understand the clustering process. Indeed, below we plot a dendogram that shows all agglomeration steps that from having all single objects as individual clusters group objects into a unique supercluster. On the y-axys there is the distance threshold, and the number of biforcations in the dendogram for a certain value on the y-axis represents the number of clusters that are generated choosing that value as distance threshold. Hence, from the dendogram we can select the value of distance threshold that we need for having a certain number of clusters. " + ] + }, + { + "cell_type": "code", + "execution_count": 365, + "metadata": { + "ExecuteTime": { + "end_time": "2021-01-03T19:19:11.636111Z", + "start_time": "2021-01-03T19:19:11.503934Z" + } + }, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAD8CAYAAABuHP8oAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAARD0lEQVR4nO3dfZCdZXnH8e+lCKlClJhVY1SiVBPiW2gzoMOoOIgTkRGcGhAtSalt6FSm2jpW6j9qx7GOU1BHKxoLESxRXtSJjRFrGZUyQxkXmhIkidoYKJiQZQINWpECV/94njDL2bN7ds95zj652e9nZmf3PG/XdfPyO/c+e59zIjORJJXnKW03IEnqjwEuSYUywCWpUAa4JBXKAJekQhngklSow3odEBEvBK4Angc8BqzPzM9GxEeBPwXG6kM/nJlbprrWwoULc8mSJQM1LElzzS233HJfZo50bu8Z4MAjwAcy89aIOAq4JSK+X+/7dGb+/XSbWLJkCaOjo9M9XJIERMSd3bb3DPDM3APsqX9+MCK2A4ubbU+SNFMzugceEUuA44Gb600XRMRtEXFZRBw9yTnrImI0IkbHxsa6HSJJ6sO0AzwijgS+Abw/Mw8AlwDHAiuoZugXdTsvM9dn5srMXDkyMuEWjiSpT9MK8Ih4GlV4X5mZ3wTIzHsz89HMfAz4MnDC8NqUJHXqGeAREcClwPbMvHjc9kXjDns7cHvz7UmSJjOdVSgnAecC2yJia73tw8A5EbECSGA3cP5QOpQkdTWdVSg3AtFl15RrviVJw+UrMSWpUNO5hfKktvHmu9i09Z6229AcdMaKxbzrxBe13YYKNudn4Ju23sMdew603YbmmDv2HHDioIHN+Rk4wPJF87nq/Ne23YbmkLO/dFPbLehJYM7PwCWpVAa4JBXKAJekQhngklQoA1ySCmWAS1KhDHBJKpQBLkmFMsAlqVAGuCQVygCXpEIZ4JJUKANckgplgEtSoQxwSSqUAS5JhTLAJalQBrgkFcoAl6RCGeCSVCgDXJIKZYBLUqEMcEkqlAEuSYUywCWpUAa4JBXKAJekQhngklQoA1ySCmWAS1KhegZ4RLwwIn4QEdsj4icR8b56+4KI+H5E/Kz+fvTw25UkHTSdGfgjwAcy8zjgNcB7I2I5cCFwfWa+FLi+fixJmiU9Azwz92TmrfXPDwLbgcXAGcDl9WGXA2cOq0lJ0kQzugceEUuA44Gbgedm5h6oQh54ziTnrIuI0YgYHRsbG6xbSdLjph3gEXEk8A3g/Zl5YLrnZeb6zFyZmStHRkb66VGS1MW0AjwinkYV3ldm5jfrzfdGxKJ6/yJg33BalCR1M51VKAFcCmzPzIvH7fo2sLb+eS2wqfn2JEmTOWwax5wEnAtsi4it9bYPA58Ero6I9wB3AauH06IkqZueAZ6ZNwIxye5Tmm1HkjRdvhJTkgplgEtSoaZzD1xqxMab72LT1nvabuOQcMeeaiXu2V+6qeVO2nfGisW868QXtd1GkZyBa9Zs2nrP48E11y1fNJ/li+a33Ubr7thzwCf1ATgD16xavmg+V53/2rbb0CHC30AG4wxckgplgEtSoQxwSSqUAS5JhTLAJalQBrgkFcoAl6RCGeCSVCgDXJIKZYBLUqEMcEkqlAEuSYUywCWpUAa4JBXKAJekQhngklQoA1ySCmWAS1KhDHBJKlQZn4k5ugG2XTuca+89o/q+4ePDuf4r3wErzxvOtSXNaWUE+LZrYe82eN4rG7/0VS/a1Pg1H7d3W/XdAJc0BGUEOFThfd532u5iZja8te0OJD2JeQ9ckgplgEtSoQxwSSqUAS5JhTLAJalQBrgkFcoAl6RC9QzwiLgsIvZFxO3jtn00Iu6JiK3112nDbVOS1Gk6M/CvAKu6bP90Zq6ov7Y025YkqZeeAZ6ZNwD7Z6EXSdIMDHIP/IKIuK2+xXL0ZAdFxLqIGI2I0bGxsQHKSZLG6zfALwGOBVYAe4CLJjswM9dn5srMXDkyMtJnOUlSp74CPDPvzcxHM/Mx4MvACc22JUnqpa93I4yIRZm5p374duD2qY4v2iDvRb73tup7v+9K6HuJS5pCzwCPiK8BJwMLI+Ju4CPAyRGxAkhgN3D+EHts1yDvRf68V/Vf1/cSl9RDzwDPzHO6bL50CL0cutp4L3LfS1xSD74SU5IKZYBLUqEMcEkqlAEuSYUywCWpUAa4JBWqrxfySDr03X/V1RzYvLntNqb024VvBODOcy9puZPe5p9+OkeffVbbbTyBAS49SR3YvJmHduxg3rJlbbcyqc/e94O2W5iWh3bsADDAJc2eecuWccxXr2i7jeLdee6atlvoynvgklQoA1ySCuUtFD3BNT+9hi27hvMJeTv3vwGA865bP5Trn/aS01j9stVDubZ0KDLA9QRbdm1h5/6dLF2wtPFrH3/8jxq/5kE79+8EMMA1pxjgmmDpgqVsWLWh7TZm5LzrfNtdzT3eA5ekQhngklQoA1ySCmWAS1KhDHBJKpQBLkmFMsAlqVAGuCQVygCXpEIZ4JJUKANckgplgEtSoQxwSSqUAS5JhTLAJalQBrgkFcoAl6RCGeCSVCg/Uk3SnHD/VVdzYPPmvs59aMcOAO48d03f9eeffjpHn31W3+d34wxc0pxwYPPmx4N4puYtW8a8Zcv6rv3Qjh19P3lMpecMPCIuA04H9mXmK+ptC4CrgCXAbuCszLy/8e4kqUHzli3jmK9eMet1B5m5T2U6M/CvAKs6tl0IXJ+ZLwWurx9LkmZRzwDPzBuA/R2bzwAur3++HDiz4b4kST30ew/8uZm5B6D+/pzJDoyIdRExGhGjY2NjfZaTJHUa+h8xM3N9Zq7MzJUjIyPDLidJc0a/AX5vRCwCqL/va64lSdJ09Bvg3wbW1j+vBTY1044kabp6BnhEfA24CVgaEXdHxHuATwKnRsTPgFPrx5KkWdRzHXhmnjPJrlMa7kWSNAO+ElOSCmWAS1KhDHBJKpQBLkmFMsAlqVAGuCQVygCXpEIZ4JJUKANckgrlZ2JKLRrkcxp7aeJzHCczjM931Mw5A5daNMjnNPYy6Oc4TmZYn++omXMGLrWsrc9p7NewPt9RM+cMXJIKZYBLUqEMcEkqlAEuSYUywCWpUK5CkeagQdafN7G+3HXkzXAGLs1Bg6w/H3R9uevIm+MMXJqj2lp/7jry5jgDl6RCGeCSVChvoeiQcc1Pr2HLri19nbtjf3U/97zrzuvr/NNechqrX7a6r3OltjgD1yFjy64t7Ny/s69zly1YxrIF/f1hbef+nX0/cUhtcgauQ8rSBUvZsGrDrNbsd9Yutc0ZuCQVygCXpEIZ4JJUKANckgplgEtSoQxwSSqUAS5JhTLAJalQBrgkFWqgV2JGxG7gQeBR4JHMXNlEU5Kk3pp4Kf0bM/O+Bq4jSZoBb6FIUqEGDfAE/iUibomIdd0OiIh1ETEaEaNjY2MDlpMkHTRogJ+Umb8HvAV4b0S8vvOAzFyfmSszc+XIyMiA5SRJBw0U4Jn5y/r7PuBbwAlNNCVJ6q3vAI+IZ0TEUQd/Bt4M3N5UY5KkqQ2yCuW5wLci4uB1NmbmdY10JUnqqe8Az8xdwKsb7EWSNAMuI5SkQhngklQoA1ySCmWAS1KhDHBJKpQBLkmFMsAlqVAGuCQVygCXpEIZ4JJUKANckgplgEtSoQxwSSqUAS5JhTLAJalQBrgkFcoAl6RCGeCSVCgDXJIKZYBLUqEMcEkqlAEuSYUywCWpUAa4JBXKAJekQhngklQoA1ySCmWAS1KhDHBJKpQBLkmFMsAlqVAGuCQVygCXpEIZ4JJUqIECPCJWRcTOiPh5RFzYVFOSpN76DvCIeCrwD8BbgOXAORGxvKnGJElTG2QGfgLw88zclZkPA18HzmimLUlSL5GZ/Z0Y8Q5gVWb+Sf34XODEzLyg47h1wLr64VJgZ//tStKcdExmjnRuPGyAC0aXbROeDTJzPbB+gDqSpC4GuYVyN/DCcY9fAPxysHYkSdM1SID/GHhpRLw4Ig4H3gl8u5m2JEm99H0LJTMfiYgLgO8BTwUuy8yfNNaZJGlKff8RU5LULl+JKUmFMsAlqVCtB3hE/F1EvL+B6xwRETsi4jnD7iMiLo6IP+unzkxr9bhOMWM+FOq2Wdsxz17dNmvPet3MbO0LGAHuAX6nfnw4cC2wm2pN+ckdx38QuB14EPgF8MGO/X8NXDRoH/W2s4Dtda07gDPH7VsE/Ddw+CyM+Qjgi8C9wH7gn4HFTY8ZeDfwq3Ff/1v38/uDjrlH3eXAKHB//fWvwPIm/llP59/zuH0fqcf7pqZrT/Lf19OBLwD3Af8D3DBbY56N2o55duq2PQP/I2BLZv5m3LYbgT8E9nY5PoA1wNHAKuCCiHjnuP0bgbURccQgfUTEYuCfgL8C5lM9cWw8ONPNzD3ADuBtM6wzoVZtqjG/D3gt8Crg+cADwOfG7W9kzJl5ZWYeefAL+HNgF3BrvX+QMU9al+q1A+8AFgALqZaifv3gwQ3W7VYbgIg4tu5hz/jtQxwzVC9uWwAcV3//yyHUbbN2W3XbrD3rddsO8LcAPzr4IDMfzszPZOaNwKOdB2fmpzLz1sx8JDN3ApuAk8btv5tqFveaQfqgelHSA5n53ax8B/g1cOy4Y34IvHWGdSbU6jVm4MXA9zLz3sx8iCrcXj7u/KbG3GktcEXWU4XaD+lvzJPWzcwHMnN3XSeo/hn8bsc5TdSdUHuczwMfAh7usq+J2k+oGxFLqf6nXZeZY5n5aGbeMoS6bdZ2zLNQt+0AfyV9vjdKRATwOqBz7fl24NUD9jEKbI+It0XEUyPiTOC3wG0D1ulWq5dLgZMi4vkR8XSqWx3f7TimiTE/LiKOAV4PXNFAnWnVjYgHgIeofrv4xBDqdq0dEauBhzNzyyTnDGPMJwJ3Ah+LiPsiYltE/MEQ6rZZ2zHPQt22A/xZVPeY+/FRqv43dGx/sL5u331k5qNU4bWRKrg3Audn5q8HrDOh1jT8FLiL6t7aAapfxf6245iBx9xhDfBvmfmLBupMq25mPgt4JnAB8B9DqDuhdkQcSfVkMdUfnYYx5hcAr6C6J/p8qjFfHhHHNVy3zdqOeRbqth3g9wNHzfSk+hWga4C3ZuZvO3YfRXWfuO8+IuJNwKeAk6n+yPgG4B8jYsWAdSbUmoZLgHnAs4FnAN9k4gx84DF3WANc3mV7v2OeVt36CfKLwBUdK2uaqNut9seAr3Z5ohpvGGP+DfB/wMfrW2g/An4AvLnhum3WdsyzULftAL8NeNlMToiIPwYuBE6p7/92Og74zwH7WEH11+LRzHwsM38M3Ay8acA63Wr18mrgK5m5v36y+hxwQkQsHLCXrn1ExElUs4Vru5zT75h71h3nKVR/uV/ccN1utU8B/iIi9kbEXqo3Z7s6Ij7UcO3OurdNdmDDddus7ZhnoW7bAb6Fanb7uHpt87z64eERMa++301EvJvqV95TM3NX58Xq1SMLgH8fsI8fA687OOOOiOOp7reP/xfyBibOhPupNeWY617WRMQzI+JpVKtDfpmZ99XnNjXmg9YC38jMbrdX+h3zpHUj4tSIOL7+W8N84GKqmcz2hutOqE0V4K+gesJeQbUi5nyqT5pqsnZn3Ruobov9TUQcVj9pnkz1vkJN1m2ztmOejbqDrHsc9Itq2djdPHHd5G6q9bjjv5bU+35B9SvJ+PXKXxx37geBixvq4wLg51T3qHYBHxi3b1F9fD/rwGc65mcDVwL7qH7VuhE4YUhjnlfXOKXL8X2Peaq6wGqqpVS/Asao/id4VdN1Jxtzx/7dTFwH3viY620vB26iWt10B/D22RrzbNR2zLNTt/U3s4qITwD7MvMzA17nCKpfRV6fmfuG2UdEXAT8V2Z+Yeadzs0xHwp126ztmGevbpu1Z7tu6wEuSepP2/fAJUl9MsAlqVAGuCQVygCXpEIZ4JJUKANckgplgEtSof4fKteuzCjMa5EAAAAASUVORK5CYII=\n", + "text/plain": [ + "<Figure size 432x288 with 1 Axes>" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "Z = linkage(df[features], 'ward' )\n", + "dendrogram(Z, truncate_mode='lastp',p=11);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Now tune the parameters and see the effect of each parameter on the amount of noise. Considering that MDS seeks for an embedding that tries to preserve pairwise distances, we would expect that in a MDS embedding noise is placed far from the defined clusters, while t-SNE tends to twist relative distances. Can you notice it in the plots above?" + "The dendrogram function above requires the parameter $p$ which indicates the maximum number of biforcations, i.e. final clusters, which are shown in the plot. Values in parenthesis in the x-axis represent the number of objects in each cluster. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "# Hierarchical agglomerative clustering\n", - "---" + "---\n", + "# DBSCAN" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "In a hierarchical agglomerative clustering different clusters are iteratively merged if their distance is lower than a _distance\\_threshold_. The number of clusters obtained is a function of this threshold." + "DBSCAN is a density-based spatial clustering algorithm that detects noise and is able to extract clusters of different size and shape.\n", + "This algorithm requires two parameters: the distance $\\epsilon$ is the maximum distance for considering two points neighbours; _min_samples_ gives the minimum number of neighbor required to define a core point. \n", + "Core points are the core component of clusters, and all those points that are neither core points nor neighbor of core points are labeled as noise.\n" ] }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 416, "metadata": { "ExecuteTime": { - "end_time": "2020-12-18T18:51:42.316557Z", - "start_time": "2020-12-18T18:51:42.305922Z" - }, - "scrolled": true + "end_time": "2021-01-03T19:48:28.447927Z", + "start_time": "2021-01-03T19:48:28.432013Z" + } }, "outputs": [], "source": [ - "distance_threshold=15\n", - "Clustering().hierarchical(distance_threshold=distance_threshold)" + "eps = 3\n", + "min_samples= 8\n", + "Clustering().dbscan(eps,min_samples)" ] }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 417, "metadata": { "ExecuteTime": { - "end_time": "2020-12-18T18:51:42.933612Z", - "start_time": "2020-12-18T18:51:42.924728Z" + "end_time": "2021-01-03T19:48:28.641161Z", + "start_time": "2021-01-03T19:48:28.572840Z" }, "scrolled": false }, @@ -1005,12 +955,12 @@ { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "1ac9898689124574aef1ccc5775edf6f", + "model_id": "819d75a896b14216a17b3d3588f7663d", "version_major": 2, "version_minor": 0 }, "text/plain": [ - "HBox(children=(Button(description='PCA', style=ButtonStyle()), Button(description='MDS', style=ButtonStyle()),…" + "VBox(children=(HBox(children=(Button(description='PCA', style=ButtonStyle()), Button(description='MDS', style=…" ] }, "metadata": {}, @@ -1018,97 +968,94 @@ } ], "source": [ - "display(box)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Several different possible metrics can be used for the linkage criterium. As a default option, we used the Ward distance, which minimizes the sum of squared differences within all clusters. This has some analogies with the objective function of _k-means_. By tuning the parameters, can you find the same clusters as the ones obtained with k-means?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "One advantage of hierarchical methods is that they allow to decompose and understand the clustering process. Indeed, below we plot a dendogram that shows all agglomeration steps that from having all single objects as individual clusters group objects into a unique supercluster. On the y-axys there is the distance threshold, and the number of biforcations in the dendogram for a certain value on the y-axis represents the number of clusters that are generated choosing that value as distance threshold. Hence, from the dendogram we can select the value of distance threshold that we need to have a certain number of clusters. " + "show_embedding()" ] }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 418, "metadata": { "ExecuteTime": { - "end_time": "2020-12-18T18:52:40.950213Z", - "start_time": "2020-12-18T18:52:40.723508Z" + "end_time": "2021-01-03T19:48:30.831414Z", + "start_time": "2021-01-03T19:48:30.798968Z" } }, "outputs": [ - { - "ename": "AttributeError", - "evalue": "'AgglomerativeClustering' object has no attribute 'distances_'", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m<ipython-input-21-4f7faa6e95b9>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m 27\u001b[0m \u001b[0mplt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtitle\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Hierarchical Clustering Dendrogram'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 28\u001b[0m \u001b[0;31m# plot the top three levels of the dendrogram\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 29\u001b[0;31m \u001b[0mplot_dendrogram\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmodel\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtruncate_mode\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'level'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m3\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 30\u001b[0m \u001b[0mplt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mxlabel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Number of points in node (or index of point if no parenthesis).\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 31\u001b[0m \u001b[0mplt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshow\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m<ipython-input-21-4f7faa6e95b9>\u001b[0m in \u001b[0;36mplot_dendrogram\u001b[0;34m(model, **kwargs)\u001b[0m\n\u001b[1;32m 14\u001b[0m \u001b[0mcounts\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mi\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcurrent_count\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 15\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 16\u001b[0;31m linkage_matrix = np.column_stack([model.children_, model.distances_,\n\u001b[0m\u001b[1;32m 17\u001b[0m counts]).astype(float)\n\u001b[1;32m 18\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mAttributeError\u001b[0m: 'AgglomerativeClustering' object has no attribute 'distances_'" - ] - }, { "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXwAAAEICAYAAABcVE8dAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/d3fzzAAAACXBIWXMAAAsTAAALEwEAmpwYAAAVZklEQVR4nO3cf7BkZZ3f8feHGVFABJTx18wgKMgPE7HgirpZBYMGBktnTUgCqAirIWQXYypuBTZbq27pJrqVrbgG3MnIIjFunNWVVTAgq2VYNMDKTILIQDAjKAw/ZEBAHXYlA9/8cc44Pc2duT339r13uM/7VdVVfc55uvt7nu7+9NPP6dOpKiRJC98e812AJGluGPiS1AgDX5IaYeBLUiMMfElqhIEvSY0w8HcjSdYnOWE3qOOsJN/eyfarkrx7Nh9jhNtfk+S9M6lhHJK8Psnt813HOCQ5IcnG+a5Ds8fAnyNJfpjkTUPrtgu9qnpFVV0z58XtoqpaUVX/ZTYfI8meST6c5P8m2dz33yVJDh7jY8zoQwegqr5VVYePq6ZB/Yfa3yb5WZKfJlmX5IIkz5yNx9PCZ+AvAEkWT+M2i2ajljH6c+BtwBnAfsDRwDrgxPksatB0+n0azquqfYEXAR8ATgOuTJI5eOxfGve+zlHfaYiBvxsZ/BaQZI9+NPeDJA8l+UKS5/bbDk5SSd6T5C7gm/36Lya5P8mjSa5N8oqB+740yR8nuTLJZuCNSZYnuSzJpv4xLhyq5z8keTjJnUlWDKzfbjolyT9Lcls/Er01yTH9+q31b13/9hH74U3Am4GVVXVjVW2pqker6qKq+pNJ2n84yecGlrf2z+J++awkd/R13JnkHUmOBFYBr0vy8ySP9G2f2e/3XUl+nGRVkr36bSck2Zjk/CT3A58Zngbpn8PfSnJz/zz8WZJnDWz/N0nuS3Jvkvf2dR46VZ9U1eb+29/bgNcBb+nvb5TXybv7/Xkwye8M1LJX/7p4OMmtwKuH+vWH/b7eDGxOsjjJ29JNPT7Svw6OHGh/TJL/3ffzF/t9/+hO+u6AJF/tX38P99eXDdzfNUk+muS6/jm6Isnzkvxpum88N2aM3/haYODvvv4l8GvA8cCLgYeBi4baHA8cCZzUL18FHAY8H/hfwJ8OtT8D+H1gX+B64KvAj4CDgaXAmoG2rwFuBw4E/gD4k+Spo8ok/xj4MHAm8By6QHqo3/wD4PV0I/TfAz6X5EUj7PubgO9U1d0jtN2pJPsAnwRW9CPlXwFuqqrbgHOB66vq2VW1f3+TjwMvB14FHErXLx8cuMsXAs8FXgKcs4OH/SfAycAhwCuBs/paTgb+db9/h9I9f7ukqu4C1tL1K4z2OvlV4HC6b0cfHAjpDwEv6y8nAZMdlzmd7sNlf+ClwOeBfwUsAa4Erkg3/bYn8BfApXT983lg+AN+uO/2AD7TLx8E/A1w4dBtTgPeRfc8vIzudfuZ/n5u6/dBo6oqL3NwAX4I/Bx4ZODyGPDtoTZv6q/fBpw4sO1FwP8DFtMFdAEv3cnj7d+32a9fvhT47MD21wGbgMWT3PYsYMPA8t79fb2wX74GeG9//Wrg/SP2wU10o/atj/HtHbT7NLBmivsarOHDwOcGtm3tn8XAPn1f/yNgr0n2c7D/A2wGXjbUT3f2108AHgeeNbD9BGDj0HP4zoHlPwBW9dcvAf79wLZD+zoPnWofh9avAT69C6+TZQPbvwOc1l+/Azh5YNs5k+zLrw8s/y7whYHlPYB7+j54Q389A9u/DXx0R303yX69Cnh4aP9/Z2D5D4GrBpbfSvfhPe/v76fLxRH+3Pq1qtp/6wX4jZ20fQnwF/1X50fo3thPAC8YaPPLEXCSRUk+1n+1/yndmxW6EfpT2gPLgR9V1ZYdPP79W69U1WP91WdP0m453Uj+KZKcmeSmgX34O0P17MhDdME1Y1W1GfindKP5+5L89yRH7KD5EroPt3UDNX+tX7/Vpqr62yke9v6B64+xrd9ezPbPwXS/wSwFftJfH+V1Mmo9P5rksQa3v3iwTVU92W9f2m+7p/oknuS2MNR3SfZO8p+T/Kh/zV4L7J/tjy/9eOD630yyPNlrUjtg4O++7qabhth/4PKsqrpnoM3gm+sMYCXddMF+dKM76Eatk7W/GzgoMz94djfdV+3tJHkJ3Uj9POB5/QfcLUP17Mg3gOMG53OnsJkuqLd64eDGqrq6qt5M9yHyf/q6YPv+AHiQLkReMdDn+1XVYKjM5O9l7wMG92n5rt5BkuXAscC3+lWjvE52Vs9gDQdN0mZwf++l+4DZWkv629/T39fSoWm/4f0b7rsP0E01vaaqnkP3LQFGe41oGgz83dcq4Pf74CTJkiQrd9J+X+AXdKPjvYF/N8X9f4fuTfqxJPskeVaSvzeNOi8GfivJsekc2te8D90bfFNf/9l0I/wpVdU3gK/TjVyP7Q8W7pvk3CS/PslNbgLekOSgJPsBv711Q5IX9Aca96Hrn5/TjYChGy0u6+eft45YPw38xyTP72+/NMlJjMcXgLOTHJlkb7Y/NrBT/Wj4eOArdM/dlf2mXX2dDNfz2/3B02XA+0Zo/5YkJyZ5Bl1g/wK4jm5u/QngvP75WgkcN8X97Uv3AftIugPNzsfPMgN/9/VHwOXAXyb5GXAD3YHUHfks3dfte4Bb+/Y7VFVP0M2BHgrcBWykm/rYJVX1RboDwf8N+BnwZeC5VXUr3Zzr9XTB+neB/7kLd30qXaj9GfAo3beDCbrR/3ANX+/b3Uz3082vDmzegy6Y7qWbBjmebVNp3wTWA/cnebBfdz6wAbihn2b4Bt0odMaq6iq6A8j/o3+M6/tNv9jJzS7sn/8fA58AvkQ37/5kv31XXyeDfo/uNXMn8JfAf52i/tuBdwL/ie7b0FuBt1bV41X1OPAPgffQHTN5J93zsLN9+wSwV39fN9BNn2kWZfspN0lzpf+1zC3AM3dyLOVpK8lf0x2w/sx816KOI3xpDiV5e/8zxgPofgJ6xUIJ+yTHJ3lhP6XzbrqfpDpq341MGfjpTmd/IMktO9ieJJ9MsiHdySbHjL9MacH453THNX5AN+f9L+a3nLE6HPgu3RTcB4BTq+q++S1Jg6ac0knyBroDXZ+tqqccdEtyCt3BnlPo5g7/qKpGnUOUJM2RKUf4VXUt237zO5mVdB8GVVU30P2Odiy/oZYkjc84/sBoKdufYLGxX/eUr3JJzqE/HX2fffY59ogjdnT+iyRpMuvWrXuwqpZM3fKpxhH4k50kMek8UVWtBlYDTExM1Nq1a8fw8JLUjiSTnRE9knH8Smcj259Rt4zuN8+SpN3IOAL/cuDM/tc6rwUe9ci8JO1+ppzSSfJ5un+6OzDd/35/CHgGQFWtojsb8hS6MwcfA86erWIlSdM3ZeBX1elTbC/gN8dWkSRpVnimrSQ1wsCXpEYY+JLUCANfkhph4EtSIwx8SWqEgS9JjTDwJakRBr4kNcLAl6RGGPiS1AgDX5IaYeBLUiMMfElqhIEvSY0w8CWpEQa+JDXCwJekRhj4ktQIA1+SGmHgS1IjDHxJaoSBL0mNMPAlqREGviQ1wsCXpEYY+JLUCANfkhph4EtSIwx8SWqEgS9JjTDwJakRBr4kNcLAl6RGGPiS1IiRAj/JyUluT7IhyQWTbN8vyRVJvptkfZKzx1+qJGkmpgz8JIuAi4AVwFHA6UmOGmr2m8CtVXU0cALwh0n2HHOtkqQZGGWEfxywoaruqKrHgTXAyqE2BeybJMCzgZ8AW8ZaqSRpRkYJ/KXA3QPLG/t1gy4EjgTuBb4HvL+qnhy+oyTnJFmbZO2mTZumWbIkaTpGCfxMsq6Glk8CbgJeDLwKuDDJc55yo6rVVTVRVRNLlizZxVIlSTMxSuBvBJYPLC+jG8kPOhu4rDobgDuBI8ZToiRpHEYJ/BuBw5Ic0h+IPQ24fKjNXcCJAEleABwO3DHOQiVJM7N4qgZVtSXJecDVwCLgkqpan+Tcfvsq4CPApUm+RzcFdH5VPTiLdUuSdtGUgQ9QVVcCVw6tWzVw/V7gH4y3NEnSOHmmrSQ1wsCXpEYY+JLUCANfkhph4EtSIwx8SWqEgS9JjTDwJakRBr4kNcLAl6RGGPiS1AgDX5IaYeBLUiMMfElqhIEvSY0w8CWpEQa+JDXCwJekRhj4ktQIA1+SGmHgS1IjDHxJaoSBL0mNMPAlqREGviQ1wsCXpEYY+JLUCANfkhph4EtSIwx8SWqEgS9JjTDwJakRBr4kNcLAl6RGjBT4SU5OcnuSDUku2EGbE5LclGR9kr8ab5mSpJlaPFWDJIuAi4A3AxuBG5NcXlW3DrTZH/gUcHJV3ZXk+bNUryRpmkYZ4R8HbKiqO6rqcWANsHKozRnAZVV1F0BVPTDeMiVJMzVK4C8F7h5Y3tivG/Ry4IAk1yRZl+TMye4oyTlJ1iZZu2nTpulVLEmallECP5Osq6HlxcCxwFuAk4DfTfLyp9yoanVVTVTVxJIlS3a5WEnS9E05h083ol8+sLwMuHeSNg9W1WZgc5JrgaOB74+lSknSjI0ywr8ROCzJIUn2BE4DLh9q8xXg9UkWJ9kbeA1w23hLlSTNxJQj/KrakuQ84GpgEXBJVa1Pcm6/fVVV3Zbka8DNwJPAxVV1y2wWLknaNakano6fGxMTE7V27dp5eWxJerpKsq6qJqZzW8+0laRGGPiS1AgDX5IaYeBLUiMMfElqhIEvSY0w8CWpEQa+JDXCwJekRhj4ktQIA1+SGmHgS1IjDHxJaoSBL0mNMPAlqREGviQ1wsCXpEYY+JLUCANfkhph4EtSIwx8SWqEgS9JjTDwJakRBr4kNcLAl6RGGPiS1AgDX5IaYeBLUiMMfElqhIEvSY0w8CWpEQa+JDXCwJekRhj4ktQIA1+SGjFS4Cc5OcntSTYkuWAn7V6d5Ikkp46vREnSOEwZ+EkWARcBK4CjgNOTHLWDdh8Hrh53kZKkmRtlhH8csKGq7qiqx4E1wMpJ2r0P+BLwwBjrkySNySiBvxS4e2B5Y7/ul5IsBd4OrNrZHSU5J8naJGs3bdq0q7VKkmZglMDPJOtqaPkTwPlV9cTO7qiqVlfVRFVNLFmyZMQSJUnjsHiENhuB5QPLy4B7h9pMAGuSABwInJJkS1V9eRxFSpJmbpTAvxE4LMkhwD3AacAZgw2q6pCt15NcCnzVsJek3cuUgV9VW5KcR/frm0XAJVW1Psm5/fadzttLknYPo4zwqaorgSuH1k0a9FV11szLkiSNm2faSlIjDHxJaoSBL0mNMPAlqREGviQ1wsCXpEYY+JLUCANfkhph4EtSIwx8SWqEgS9JjTDwJakRBr4kNcLAl6RGGPiS1AgDX5IaYeBLUiMMfElqhIEvSY0w8CWpEQa+JDXCwJekRhj4ktQIA1+SGmHgS1IjDHxJaoSBL0mNMPAlqREGviQ1wsCXpEYY+JLUCANfkhph4EtSIwx8SWrESIGf5OQktyfZkOSCSba/I8nN/eW6JEePv1RJ0kxMGfhJFgEXASuAo4DTkxw11OxO4PiqeiXwEWD1uAuVJM3MKCP844ANVXVHVT0OrAFWDjaoquuq6uF+8QZg2XjLlCTN1CiBvxS4e2B5Y79uR94DXDXZhiTnJFmbZO2mTZtGr1KSNGOjBH4mWVeTNkzeSBf450+2vapWV9VEVU0sWbJk9ColSTO2eIQ2G4HlA8vLgHuHGyV5JXAxsKKqHhpPeZKkcRllhH8jcFiSQ5LsCZwGXD7YIMlBwGXAu6rq++MvU5I0U1OO8KtqS5LzgKuBRcAlVbU+ybn99lXAB4HnAZ9KArClqiZmr2xJ0q5K1aTT8bNuYmKi1q5dOy+PLUlPV0nWTXdA7Zm2ktQIA1+SGmHgS1IjDHxJaoSBL0mNMPAlqREGviQ1wsCXpEYY+JLUCANfkhph4EtSIwx8SWqEgS9JjTDwJakRBr4kNcLAl6RGGPiS1AgDX5IaYeBLUiMMfElqhIEvSY0w8CWpEQa+JDXCwJekRhj4ktQIA1+SGmHgS1IjDHxJaoSBL0mNMPAlqREGviQ1wsCXpEYY+JLUCANfkhph4EtSI0YK/CQnJ7k9yYYkF0yyPUk+2W+/Ockx4y9VkjQTUwZ+kkXARcAK4Cjg9CRHDTVbARzWX84B/njMdUqSZmiUEf5xwIaquqOqHgfWACuH2qwEPludG4D9k7xozLVKkmZg8QhtlgJ3DyxvBF4zQpulwH2DjZKcQ/cNAOAXSW7ZpWoXrgOBB+e7iN2EfbGNfbGNfbHN4dO94SiBn0nW1TTaUFWrgdUASdZW1cQIj7/g2Rfb2Bfb2Bfb2BfbJFk73duOMqWzEVg+sLwMuHcabSRJ82iUwL8ROCzJIUn2BE4DLh9qczlwZv9rndcCj1bVfcN3JEmaP1NO6VTVliTnAVcDi4BLqmp9knP77auAK4FTgA3AY8DZIzz26mlXvfDYF9vYF9vYF9vYF9tMuy9S9ZSpdknSAuSZtpLUCANfkhox64Hv3zJsM0JfvKPvg5uTXJfk6Pmocy5M1RcD7V6d5Ikkp85lfXNplL5IckKSm5KsT/JXc13jXBnhPbJfkiuSfLfvi1GOFz7tJLkkyQM7Oldp2rlZVbN2oTvI+wPgpcCewHeBo4banAJcRfdb/tcCfz2bNc3XZcS++BXggP76ipb7YqDdN+l+FHDqfNc9j6+L/YFbgYP65efPd93z2Bf/Fvh4f30J8BNgz/mufRb64g3AMcAtO9g+rdyc7RG+f8uwzZR9UVXXVdXD/eINdOczLESjvC4A3gd8CXhgLoubY6P0xRnAZVV1F0BVLdT+GKUvCtg3SYBn0wX+lrktc/ZV1bV0+7Yj08rN2Q78Hf3lwq62WQh2dT/fQ/cJvhBN2RdJlgJvB1bNYV3zYZTXxcuBA5Jck2RdkjPnrLq5NUpfXAgcSXdi5/eA91fVk3NT3m5lWrk5yl8rzMTY/pZhARh5P5O8kS7wf3VWK5o/o/TFJ4Dzq+qJbjC3YI3SF4uBY4ETgb2A65PcUFXfn+3i5tgofXEScBPw94GXAV9P8q2q+uks17a7mVZuznbg+7cM24y0n0leCVwMrKiqh+aotrk2Sl9MAGv6sD8QOCXJlqr68pxUOHdGfY88WFWbgc1JrgWOBhZa4I/SF2cDH6tuIntDkjuBI4DvzE2Ju41p5eZsT+n4twzbTNkXSQ4CLgPetQBHb4Om7IuqOqSqDq6qg4E/B35jAYY9jPYe+Qrw+iSLk+xN92+1t81xnXNhlL64i+6bDkleQPfPkXfMaZW7h2nl5qyO8Gv2/pbhaWfEvvgg8DzgU/3IdkstwH8IHLEvmjBKX1TVbUm+BtwMPAlcXFUL7q/FR3xdfAS4NMn36KY1zq+qBfe3yUk+D5wAHJhkI/Ah4Bkws9z0rxUkqRGeaStJjTDwJakRBr4kNcLAl6RGGPiS1AgDX5IaYeBLUiP+P03AEu5791vuAAAAAElFTkSuQmCC\n", + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>RS</th>\n", + " <th>ZB</th>\n", + " <th>Materials in cluster</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>0</th>\n", + " <td>0</td>\n", + " <td>100</td>\n", + " <td>17</td>\n", + " </tr>\n", + " <tr>\n", + " <th>1</th>\n", + " <td>70</td>\n", + " <td>30</td>\n", + " <td>30</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "</div>" + ], "text/plain": [ - "<Figure size 432x288 with 1 Axes>" + " RS ZB Materials in cluster\n", + "0 0 100 17\n", + "1 70 30 30" ] }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" + "execution_count": 418, + "metadata": {}, + "output_type": "execute_result" } ], "source": [ - "def plot_dendrogram(model, **kwargs):\n", - " # Create linkage matrix and then plot the dendrogram\n", - "\n", - " # create the counts of samples under each node\n", - " counts = np.zeros(model.children_.shape[0])\n", - " n_samples = len(model.labels_)\n", - " for i, merge in enumerate(model.children_):\n", - " current_count = 0\n", - " for child_idx in merge:\n", - " if child_idx < n_samples:\n", - " current_count += 1 # leaf node\n", - " else:\n", - " current_count += counts[child_idx - n_samples]\n", - " counts[i] = current_count\n", - "\n", - " linkage_matrix = np.column_stack([model.children_, model.distances_,\n", - " counts]).astype(float)\n", - "\n", - " # Plot the corresponding dendrogram\n", - " dendrogram(linkage_matrix, **kwargs)\n", - "\n", + "composition_RS_ZB(df)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can see that the algorithm has found two different clusters, and we notice that each cluster is representative of the RS vs ZB structure. However, this happens at the cost of neglecting many points that have been classified as noise.\n", + "Now tune the parameters and see the effect of each parameter on the amount of noise.\n", "\n", - "# setting distance_threshold=0 ensures we compute the full tree.\n", - "model = AgglomerativeClustering(distance_threshold=15, n_clusters=None)\n", + "Considering that MDS seeks for an embedding that tries to preserve local pairwise distances, we would expect that in a MDS embedding noise is placed far from the defined clusters. Differently t-SNE tends to privilege global structures at the expenses of losing local definition, hence noise can be placed closed to other clusters. In fact, it is possible to notice that using t-SNE points tend to be equally distanced from each other, but clusters are quite distinguishable. Pairwise distances are not meaningful in a t-SNE embedding because it aims to depict global arrangements of clusters. On the other hand, MDS sometimes fails to arrange the different clusters.\n", "\n", - "model = model.fit(df[features].to_numpy())\n", - "plt.title('Hierarchical Clustering Dendrogram')\n", - "# plot the top three levels of the dendrogram\n", - "plot_dendrogram(model, truncate_mode='level', p=3)\n", - "plt.xlabel(\"Number of points in node (or index of point if no parenthesis).\")\n", - "plt.show()\n" + "Can you notice in this case that noise is better isolated in a MDS embedding rather than using a t-SNE embedding? Try using a small amount of noise by tuning down the parameters for an easier visualization." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# Fast search and find of density peaks\n", + "# HDBSCAN\n", "---" ] }, @@ -1116,19 +1063,66 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The Fast search and find of density peaks allows for a graphical decision of the clusters to select. \n", - "In the plot below, each point represents a different peak that can represent a different cluster. On the top right positon of the plot we always have one point that represents the highest density point. The other peaks are then placed in the plot according to their sourrounding density and distance from the first peak. " + "HDBSCAN can be defined as a hierarchical extension of DBSCAN, with respect to which it has a number of advantages. \n", + "One advantage is that there is only one relevant parameter to be tuned, i.e. the minimum size of clusters. \n", + "This parameter is more intuitive to assess in comparison to e.g. the $\\epsilon$ threshold in DBSCAN.\n", + "In the HDBSCAN library we deploy, the minimum number of samples that is used for the mutual reachability distance is by default fixed to the same value of the minimum cluster size, as they both have the same objective. \n", + "In this tutorial we explicitly define the two values. " ] }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 425, "metadata": { "ExecuteTime": { - "end_time": "2020-12-18T18:55:01.700344Z", - "start_time": "2020-12-18T18:55:01.658371Z" - }, - "scrolled": true + "end_time": "2021-01-03T20:23:27.932727Z", + "start_time": "2021-01-03T20:23:27.916036Z" + } + }, + "outputs": [], + "source": [ + "min_cluster_size = 10\n", + "min_samples = 10\n", + "Clustering().hdbscan(min_cluster_size=min_cluster_size, min_samples=min_samples)" + ] + }, + { + "cell_type": "code", + "execution_count": 426, + "metadata": { + "ExecuteTime": { + "end_time": "2021-01-03T20:23:28.187432Z", + "start_time": "2021-01-03T20:23:28.113866Z" + } + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "f1c6cdf4e83a4a15b17adcd5b2783dc3", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox(children=(HBox(children=(Button(description='PCA', style=ButtonStyle()), Button(description='MDS', style=…" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "show_embedding()" + ] + }, + { + "cell_type": "code", + "execution_count": 427, + "metadata": { + "ExecuteTime": { + "end_time": "2021-01-03T20:23:28.334834Z", + "start_time": "2021-01-03T20:23:28.313888Z" + } }, "outputs": [ { @@ -1152,433 +1146,139 @@ " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", - " <th>energy_RS</th>\n", - " <th>energy_ZB</th>\n", - " <th>energy_diff</th>\n", - " <th>min_struc_type</th>\n", - " <th>Z(A)</th>\n", - " <th>Z(B)</th>\n", - " <th>period(A)</th>\n", - " <th>period(B)</th>\n", - " <th>IP(A)</th>\n", - " <th>IP(B)</th>\n", - " <th>...</th>\n", - " <th>r_s(B)</th>\n", - " <th>r_p(A)</th>\n", - " <th>r_p(B)</th>\n", - " <th>r_d(A)</th>\n", - " <th>r_d(B)</th>\n", - " <th>clustering</th>\n", - " <th>labels</th>\n", - " <th>x_emb</th>\n", - " <th>y_emb</th>\n", - " <th>embedding</th>\n", + " <th>RS</th>\n", + " <th>ZB</th>\n", + " <th>Materials in cluster</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", - " <th>AgBr</th>\n", - " <td>-108781.333959</td>\n", - " <td>-108781.303925</td>\n", - " <td>-0.030033</td>\n", - " <td>RS</td>\n", - " <td>1.044160</td>\n", - " <td>0.500776</td>\n", - " <td>5.0</td>\n", - " <td>4.0</td>\n", - " <td>-0.682185</td>\n", - " <td>-0.155241</td>\n", - " <td>...</td>\n", - " <td>0.039278</td>\n", - " <td>-0.022042</td>\n", - " <td>0.109901</td>\n", - " <td>0.249586</td>\n", - " <td>-0.095787</td>\n", - " <td>Hierarchical</td>\n", - " <td>1</td>\n", - " <td>-39.526848</td>\n", - " <td>22.154827</td>\n", - " <td>t-SNE</td>\n", - " </tr>\n", - " <tr>\n", - " <th>AgCl</th>\n", - " <td>-79397.451083</td>\n", - " <td>-79397.408285</td>\n", - " <td>-0.042797</td>\n", - " <td>RS</td>\n", - " <td>1.044160</td>\n", - " <td>-0.558171</td>\n", - " <td>5.0</td>\n", - " <td>3.0</td>\n", - " <td>-0.682185</td>\n", - " <td>-0.550095</td>\n", - " <td>...</td>\n", - " <td>-0.349442</td>\n", - " <td>-0.022042</td>\n", - " <td>-0.333307</td>\n", - " <td>0.249586</td>\n", - " <td>-0.806601</td>\n", - " <td>Hierarchical</td>\n", - " <td>1</td>\n", - " <td>-50.796947</td>\n", - " <td>28.824591</td>\n", - " <td>t-SNE</td>\n", - " </tr>\n", - " <tr>\n", - " <th>AgF</th>\n", - " <td>-74477.428165</td>\n", - " <td>-74477.274407</td>\n", - " <td>-0.153758</td>\n", - " <td>RS</td>\n", - " <td>1.044160</td>\n", - " <td>-1.028814</td>\n", - " <td>5.0</td>\n", - " <td>2.0</td>\n", - " <td>-0.682185</td>\n", - " <td>-2.285188</td>\n", - " <td>...</td>\n", - " <td>-1.848794</td>\n", - " <td>-0.022042</td>\n", - " <td>-1.773734</td>\n", - " <td>0.249586</td>\n", - " <td>-1.659579</td>\n", - " <td>Hierarchical</td>\n", - " <td>2</td>\n", - " <td>38.285912</td>\n", - " <td>95.895142</td>\n", - " <td>t-SNE</td>\n", - " </tr>\n", - " <tr>\n", - " <th>AgI</th>\n", - " <td>-171339.208181</td>\n", - " <td>-171339.245107</td>\n", - " <td>0.036925</td>\n", - " <td>ZB</td>\n", - " <td>1.044160</td>\n", - " <td>1.559722</td>\n", - " <td>5.0</td>\n", - " <td>5.0</td>\n", - " <td>-0.682185</td>\n", - " <td>0.283853</td>\n", - " <td>...</td>\n", - " <td>0.872252</td>\n", - " <td>-0.022042</td>\n", - " <td>0.811648</td>\n", - " <td>0.249586</td>\n", - " <td>-0.628898</td>\n", - " <td>Hierarchical</td>\n", - " <td>1</td>\n", - " <td>-50.215115</td>\n", - " <td>13.525675</td>\n", - " <td>t-SNE</td>\n", - " </tr>\n", - " <tr>\n", - " <th>AlAs</th>\n", - " <td>-34200.077513</td>\n", - " <td>-34200.290775</td>\n", - " <td>0.213262</td>\n", - " <td>ZB</td>\n", - " <td>-0.901775</td>\n", - " <td>0.383115</td>\n", - " <td>3.0</td>\n", - " <td>4.0</td>\n", - " <td>0.560131</td>\n", - " <td>0.912996</td>\n", - " <td>...</td>\n", - " <td>0.594594</td>\n", - " <td>-0.753988</td>\n", - " <td>0.700845</td>\n", - " <td>-0.437955</td>\n", - " <td>0.437324</td>\n", - " <td>Hierarchical</td>\n", + " <th>0</th>\n", " <td>0</td>\n", - " <td>-7.230611</td>\n", - " <td>-354.558777</td>\n", - " <td>t-SNE</td>\n", - " </tr>\n", - " <tr>\n", - " <th>...</th>\n", - " <td>...</td>\n", - " <td>...</td>\n", - " <td>...</td>\n", - " <td>...</td>\n", - " <td>...</td>\n", - " <td>...</td>\n", - " <td>...</td>\n", - " <td>...</td>\n", - " <td>...</td>\n", - " <td>...</td>\n", - " <td>...</td>\n", - " <td>...</td>\n", - " <td>...</td>\n", - " <td>...</td>\n", - " <td>...</td>\n", - " <td>...</td>\n", - " <td>...</td>\n", - " <td>...</td>\n", - " <td>...</td>\n", - " <td>...</td>\n", - " <td>...</td>\n", - " </tr>\n", - " <tr>\n", - " <th>SrTe</th>\n", - " <td>-137269.487147</td>\n", - " <td>-137269.107853</td>\n", - " <td>-0.379295</td>\n", - " <td>RS</td>\n", - " <td>0.529060</td>\n", - " <td>1.500892</td>\n", - " <td>5.0</td>\n", - " <td>5.0</td>\n", - " <td>0.423168</td>\n", - " <td>0.722286</td>\n", - " <td>...</td>\n", - " <td>1.094378</td>\n", - " <td>0.978781</td>\n", - " <td>1.070186</td>\n", - " <td>-0.931917</td>\n", - " <td>-0.237949</td>\n", - " <td>Hierarchical</td>\n", - " <td>1</td>\n", - " <td>-5.772508</td>\n", - " <td>10.822165</td>\n", - " <td>t-SNE</td>\n", - " </tr>\n", - " <tr>\n", - " <th>OZn</th>\n", - " <td>-25540.809205</td>\n", - " <td>-25540.911173</td>\n", - " <td>0.101968</td>\n", - " <td>ZB</td>\n", - " <td>0.071193</td>\n", - " <td>-1.087644</td>\n", - " <td>4.0</td>\n", - " <td>2.0</td>\n", - " <td>-1.815302</td>\n", - " <td>-1.348317</td>\n", - " <td>...</td>\n", - " <td>-1.571137</td>\n", - " <td>-0.514985</td>\n", - " <td>-1.552129</td>\n", - " <td>-0.231026</td>\n", - " <td>1.148139</td>\n", - " <td>Hierarchical</td>\n", - " <td>2</td>\n", - " <td>71.758156</td>\n", - " <td>-28.533104</td>\n", - " <td>t-SNE</td>\n", - " </tr>\n", - " <tr>\n", - " <th>SZn</th>\n", - " <td>-29945.889373</td>\n", - " <td>-29946.165186</td>\n", - " <td>0.275813</td>\n", - " <td>ZB</td>\n", - " <td>0.071193</td>\n", - " <td>-0.617001</td>\n", - " <td>4.0</td>\n", - " <td>3.0</td>\n", - " <td>-1.815302</td>\n", - " <td>0.114207</td>\n", - " <td>...</td>\n", - " <td>-0.016253</td>\n", - " <td>-0.514985</td>\n", - " <td>-0.000901</td>\n", - " <td>-0.231026</td>\n", - " <td>1.681250</td>\n", - " <td>Hierarchical</td>\n", - " <td>3</td>\n", - " <td>35.325409</td>\n", - " <td>-33.065041</td>\n", - " <td>t-SNE</td>\n", - " </tr>\n", - " <tr>\n", - " <th>SeZn</th>\n", - " <td>-57752.319875</td>\n", - " <td>-57752.583012</td>\n", - " <td>0.263137</td>\n", - " <td>ZB</td>\n", - " <td>0.071193</td>\n", - " <td>0.441945</td>\n", - " <td>4.0</td>\n", - " <td>4.0</td>\n", - " <td>-1.815302</td>\n", - " <td>0.381952</td>\n", - " <td>...</td>\n", - " <td>0.316936</td>\n", - " <td>-0.514985</td>\n", - " <td>0.368439</td>\n", - " <td>-0.231026</td>\n", - " <td>1.005977</td>\n", - " <td>Hierarchical</td>\n", - " <td>3</td>\n", - " <td>28.319963</td>\n", - " <td>-44.431747</td>\n", - " <td>t-SNE</td>\n", + " <td>100</td>\n", + " <td>12</td>\n", " </tr>\n", " <tr>\n", - " <th>TeZn</th>\n", - " <td>-118239.807676</td>\n", - " <td>-118240.052677</td>\n", - " <td>0.245001</td>\n", - " <td>ZB</td>\n", - " <td>0.071193</td>\n", - " <td>1.500892</td>\n", - " <td>4.0</td>\n", - " <td>5.0</td>\n", - " <td>-1.815302</td>\n", - " <td>0.722286</td>\n", - " <td>...</td>\n", - " <td>1.094378</td>\n", - " <td>-0.514985</td>\n", - " <td>1.070186</td>\n", - " <td>-0.231026</td>\n", - " <td>-0.237949</td>\n", - " <td>Hierarchical</td>\n", - " <td>3</td>\n", - " <td>4.945252</td>\n", - " <td>-34.944271</td>\n", - " <td>t-SNE</td>\n", + " <th>1</th>\n", + " <td>82</td>\n", + " <td>17</td>\n", + " <td>17</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", - "<p>82 rows × 27 columns</p>\n", "</div>" ], "text/plain": [ - " energy_RS energy_ZB energy_diff min_struc_type Z(A) \\\n", - "AgBr -108781.333959 -108781.303925 -0.030033 RS 1.044160 \n", - "AgCl -79397.451083 -79397.408285 -0.042797 RS 1.044160 \n", - "AgF -74477.428165 -74477.274407 -0.153758 RS 1.044160 \n", - "AgI -171339.208181 -171339.245107 0.036925 ZB 1.044160 \n", - "AlAs -34200.077513 -34200.290775 0.213262 ZB -0.901775 \n", - "... ... ... ... ... ... \n", - "SrTe -137269.487147 -137269.107853 -0.379295 RS 0.529060 \n", - "OZn -25540.809205 -25540.911173 0.101968 ZB 0.071193 \n", - "SZn -29945.889373 -29946.165186 0.275813 ZB 0.071193 \n", - "SeZn -57752.319875 -57752.583012 0.263137 ZB 0.071193 \n", - "TeZn -118239.807676 -118240.052677 0.245001 ZB 0.071193 \n", - "\n", - " Z(B) period(A) period(B) IP(A) IP(B) ... r_s(B) \\\n", - "AgBr 0.500776 5.0 4.0 -0.682185 -0.155241 ... 0.039278 \n", - "AgCl -0.558171 5.0 3.0 -0.682185 -0.550095 ... -0.349442 \n", - "AgF -1.028814 5.0 2.0 -0.682185 -2.285188 ... -1.848794 \n", - "AgI 1.559722 5.0 5.0 -0.682185 0.283853 ... 0.872252 \n", - "AlAs 0.383115 3.0 4.0 0.560131 0.912996 ... 0.594594 \n", - "... ... ... ... ... ... ... ... \n", - "SrTe 1.500892 5.0 5.0 0.423168 0.722286 ... 1.094378 \n", - "OZn -1.087644 4.0 2.0 -1.815302 -1.348317 ... -1.571137 \n", - "SZn -0.617001 4.0 3.0 -1.815302 0.114207 ... -0.016253 \n", - "SeZn 0.441945 4.0 4.0 -1.815302 0.381952 ... 0.316936 \n", - "TeZn 1.500892 4.0 5.0 -1.815302 0.722286 ... 1.094378 \n", - "\n", - " r_p(A) r_p(B) r_d(A) r_d(B) clustering labels x_emb \\\n", - "AgBr -0.022042 0.109901 0.249586 -0.095787 Hierarchical 1 -39.526848 \n", - "AgCl -0.022042 -0.333307 0.249586 -0.806601 Hierarchical 1 -50.796947 \n", - "AgF -0.022042 -1.773734 0.249586 -1.659579 Hierarchical 2 38.285912 \n", - "AgI -0.022042 0.811648 0.249586 -0.628898 Hierarchical 1 -50.215115 \n", - "AlAs -0.753988 0.700845 -0.437955 0.437324 Hierarchical 0 -7.230611 \n", - "... ... ... ... ... ... ... ... \n", - "SrTe 0.978781 1.070186 -0.931917 -0.237949 Hierarchical 1 -5.772508 \n", - "OZn -0.514985 -1.552129 -0.231026 1.148139 Hierarchical 2 71.758156 \n", - "SZn -0.514985 -0.000901 -0.231026 1.681250 Hierarchical 3 35.325409 \n", - "SeZn -0.514985 0.368439 -0.231026 1.005977 Hierarchical 3 28.319963 \n", - "TeZn -0.514985 1.070186 -0.231026 -0.237949 Hierarchical 3 4.945252 \n", - "\n", - " y_emb embedding \n", - "AgBr 22.154827 t-SNE \n", - "AgCl 28.824591 t-SNE \n", - "AgF 95.895142 t-SNE \n", - "AgI 13.525675 t-SNE \n", - "AlAs -354.558777 t-SNE \n", - "... ... ... \n", - "SrTe 10.822165 t-SNE \n", - "OZn -28.533104 t-SNE \n", - "SZn -33.065041 t-SNE \n", - "SeZn -44.431747 t-SNE \n", - "TeZn -34.944271 t-SNE \n", - "\n", - "[82 rows x 27 columns]" + " RS ZB Materials in cluster\n", + "0 0 100 12\n", + "1 82 17 17" ] }, - "execution_count": 22, + "execution_count": 427, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "df" + "composition_RS_ZB(df)" ] }, { - "cell_type": "code", - "execution_count": 24, - "metadata": { - "ExecuteTime": { - "end_time": "2020-12-18T18:55:39.726794Z", - "start_time": "2020-12-18T18:55:39.689441Z" - } - }, - "outputs": [], + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We obtain two clusters with high percentage of only one most stable structure. However, the number of materials classified as noise is considerably large.\n", + "The effect of _min_samples_ is to fix how conservative respect to noise detection the algorithm should be. Increasing its value the distorsion effects of the mutual reachability distance become more evident, while decreasing it less points are classified as noise. \n", + "Can you obtain more meaningful results by decreasing the value of this parameter?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, "source": [ - "import hdbscan" + "# Fast search and find of density peaks\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The fast search and find of density peaks algorithm allows to make the clusters selection based on a graphical interpretation. " ] }, { "cell_type": "code", - "execution_count": 67, + "execution_count": 295, "metadata": { "ExecuteTime": { - "end_time": "2020-12-18T19:02:33.653988Z", - "start_time": "2020-12-18T19:02:33.633756Z" + "end_time": "2021-01-03T18:50:38.962692Z", + "start_time": "2021-01-03T18:50:38.839100Z" } }, "outputs": [ { "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAVsAAAFLCAYAAAB4GS92AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nO3de5xcdZ3n/9e7u0NuyjURHEymQR0RmJ0RGwdQUEdQcHTwgoLKjigu6nhbFcd1hBHMA2cdfqPjoqj81KijDMErY5aAsCMIDLcE1AEMI5dAQIgJBtg0pEl3f/aPUwWVSlV1nepT55yqej8fj35U+tzqc05XPnXO96qIwMzMumuo6ADMzAaBk62ZWQ6cbM3McuBka2aWAydbM7McONmameVgpOgAirBo0aIYHR0tOgwz6zNr1qzZFBGLG60byGQ7OjrK6tWriw7DzPqMpHuarXMxgplZDgpNtpJOkhQNft4zw367SFouabOkRyR9V9IeecVtZpZWWYoR/hx4vOb3u2bYfgXwPOBdwDTwWeDHwOFdic7MbJbKkmxvjIgt7Wwo6VDgVcBLI+LnlWX3A9dLOjIiLu9inGZmHenFMttjgA3VRAsQETcAd1fWmZmVTlmS7Z2SJiXdLundM2y7H7C2wfJfV9aZmZVO0cUIDwCnAzcAw8BbgK9IWhARn2+yz27Aww2Wbwb27UqUZmazVGiyjYhLgUtrFq2SNBc4TdIXImK62a4NlqnJ8mSldApwCsDSpUs7jNjMAMYnJtnw6Fb23HkeC+cWfc+WvW6cXxmv0veBNwOjNG6VsBlo1ENjVxrf8QIQEecB5wGMjY15xHSzDkxOTbNs5W1ccON6RobE5HRwwsFLOP01+zMyXJZSyc518/zKfHWaJcS1NC6bbVaWa2YZWbbyNi5cvZ6JyWnGn5hiYnKaC1evZ9nK24oOLRPdPL8yJts3ApuAZt3eVgF7SXpJdYGkMZLy2lXdD89sMI1PTHLBjet5fNv2pXuPb5tmxer1jE9MFhRZNrp9foUWI0j6AUnl2K9IKsiOr/x8sFpeK+kO4MqIOBkgIq6VdCnwbUmn8lSnhqvdxtasezY8upWRITHRYN2wxIZHt7Lv4qflHldWun1+RZfZ3g68E1hCUsF1G/BXEfHPNduMkCTiWicAnwe+QXJ3vhL4YNejNRtge+48j8npxqV7UxHsufO8nCPKVrfPr9BihIj424h4XkQsiIj5EfHCukRLRIxGxEl1yx6OiHdExK4RsXNEvDUiNuUavNmAWTh3hBMOXsL8Odunjflzhjh+bEnPt0ro9vn19tUxs1yd/pr9AVixej3DElMRvHlsyZPLe103z08Rg9cKamxsLDyerVnn3M62MUlrImKs0br+u0pm1nUL5470dGXYTLpxfmVs+mVm1necbM3McuBka2aWAydbM7McONmameXAydbMLAdOtmZmOXCyNTPLgZOtmVkOnGzNzHLgZGtmlgMnWzOzHDjZmpnlwMnWzCwHTrZmZjlwsjUzy0Gpkq2kvSVtkRSSmo7cK2m0sk39zwV5xmtm1q6yzdRwNrAFWNjm9qcC19T87kkfzayUSpNsJR0OHA18hiTptuP2iLiue1GZmWWjFMlW0jBwDvBp4OGCwzEzy1xZymzfA8wDvpRyv+WSpiQ9IOlzkuZ3ITYzs1kr/M5W0h7AMuDEiNgmqZ3dJkgS80+BR4GXAR8Hng0c2+R9TgFOAVi6dOms4zYzS6PwZAucBVwfERe3u0NEPAC8v2bRFZI2AOdK+tOI+EWDfc4DzgMYGxuLWcZsZpZKocUIkg4A3gmcKWlXSbsCCyqrd0lZLPD9yutBWcZoZpaFou9snwvMAa5tsO4+4OvAu9o8VtS9mpmVRtHJ9mrg5XXLjiYpf301cFeKYx1XeV2TQVxmZpkqNNlGxCbgitplkkYr/7wqIrZUlt0BXBkRJ1d+PwN4OkmHhkeBI4CPAT+MiF/lELqZWSpF39m2awQYrvl9LUnvsXcB84F7STpCnJV/aNbM+MQkGx7dyp47z2Ph3F75qJl1R+n+B0TEN4Fv1i0brfv9AsDjIJTU5NQ0y1bexgU3rmdkSExOByccvITTX7M/I8Nladptlq/SJVvrfctW3saFq9czMTnNRGXZhavXA3DmsQcWF5hZgXybYZkan5jkghvX8/i26e2WP75tmhWr1zM+MVlQZGbFcrK1TG14dCsjQ417AQ5LbHh0a84RmZWDk61las+d5zE53bip81QEe+48L+eIzMrBydYytXDuCCccvIT5c7b/aM2fM8TxY0vcKsEGlj/5lrnTX7M/ACtWr2dYYiqCN48teXK52SBSxOD1bh0bG4vVq1cXHUbfcztbGzSS1kTEWKN1/h9gXbNw7gj7Lm46lZzZQHGZrZlZDpxszcxy4GRrZpYDJ1szsxw42ZqZ5cDJ1swsB062ZmY5cLI1M8uBk62ZWQ6cbM3MclCqZCtpb0lbJIWklv08Je0iabmkzZIekfRdSXvkFavZbIxPTHLXxi0eTH2AlG1shLOBLcDCNrZdATyPZNLHaeCzwI+Bw7sWndkseX62wVWaZCvpcOBo4DMkSbfVtocCrwJeGhE/ryy7H7he0pERcXm34zXrhOdnG1yl+CqVNAycA3wa2NTGLscAG6qJFiAibgDurqwzKx3PzzbYSpFsgfcA84Avtbn9fsDaBst/XVlnVjqen22wFV6MUKnUWgacGBHbpMYfxjq7AQ83WL4Z2DfD8Mwy4/nZBlsZ7mzPAq6PiItT7tfoU6smy5F0iqTVklZv3LgxbYxms+b52QZboX9dSQcA7wSOkLRrZfGCyusukqYi4vEGu24GFjdYviuN73iJiPOA8yCZFmdWgZt1yPOzDa6iv0qfC8wBrm2w7j7g6yRNu+qtpXETr/1Imn+ZldLI8BBnHnsgf3P0fp6fbcAU/Ve+Gnh53bKjgY8DrwbuarLfKuB0SS+JiKsBJI2RlNeu6lKsZpnx/GyDp9BkGxGbgCtql0karfzzqojYUll2B3BlRJxc2e9aSZcC35Z0Kk91arjabWzNrIzKUEHWjhFguG7ZCcCVwDeAbwNrgNfnHJeZWVsUMXh1RWNjY7F69eqiw7CSG5+YdLmqpSJpTUSMNVrnT5BZHY9fYN3gZGtWx+MXWDf4a9qshscvsG5xsjWr4fELrFucbM1qePwC6xYnW7MaHr/AuiWzT46kfYHLgYiIZ2d1XLO8efwC64Ysv6bnAKM0GXXLrFd4/ALrhiw/QXcC+2R4PLNCefwCy1JmyTYiJoF7sjqemVk/cQWZmVkOnGzNzHLQdjGCpGZjy9Zza4Q+0GwQFg/OYtaZNP9bhmjc0mAXkuloAH4LbJttUFacZoOwfOKY/fj7VWs9OItZh9pOthEx2mydpOcA/wtYCLxq9mFZUZoNwnL93Q9xz0OPeXAWsw5lcksSEXcAbwD2Bj6VxTEtf60GYVn74BYPzmI2C5k9/0XEVuAy4C1ZHdPy1WoQlmY8OItZe7IubJsE9sr4mJaTVoOwNOPBWczak1mylbSIZA6w9Vkd0/LVahCW/fZ6mgdnMZuFNE2//q7FMZYAx5K0TPhEimMeB3wEeB5J5do9wD8D/xARTzTZZxS4u8GqFRFxQrvvbY01G4Sl2hrBg7OYdabtCR8lTc+wyaPAFyKi7QoySe8GlgKrgYeBFwFnAF+PiPc32WeUJNmeClxTs2pTpaJuRp7wcWZuZ2uWXlYTPr68yfJpYDOwtjI+Qtsi4qt1i34maWfgfZI+EK2/CW6PiOvSvJ+1r9kgLB6cxawzadrZXtnNQGo8BOyU03uZmeWiFF1/JA1LWiDpJcAHgS/PcFcLsFzSlKQHJH1O0vwcQjUz60hZCt3GgbmVf38b+FiLbSeALwE/JSknfhnwceDZJJV0ZmalU5ZpcQ4DFpBUkP0d8EXgrxttGBEPALWVZ1dI2gCcK+lPI+IXTeI7BTgFYOnSpSnDMzObnSyLEarT4oym3TEiboqIqyPicyTFCO+VlCZhf7/yelCL9zgvIsYiYmzx4sVpQzQzm5Usk211Wpx9Z3mcmyqvaabYibpXM7NSKeO0OC+uvDbquNDMcZXXNRm8v1lDbmNss1HoJ0bSJSTlvLcCUySJ9qMkvcHurGxzB3BlRJxc+f0M4OkkHRoeBY4gqVD7YUT8Ku9zsP7XbIxfj+VraRT99XwjcBJJOe8kcBdJd9+v1GwzAgzX/L6WpPfYu4D5wL3A2cBZXY/WBlKzMX7BY/la+9rurgsgaSFJK4FXkYxdO7fBZqWfFsfdda1d4xOTHLTsMiYmd+ytPm/OEGtOO8pFCvakTLrrStoVuBrYn+TxfWfgEZLeXtUOBZ4Wx/pKdYzfiQbrqmP5uvuytSNNgdNpJIn2ZGC3yrLPA08jaSd7E0mLhOdnGaBZkVqN8euxfC2NNMn2L4GfR8Ty2q60kbgOeDWwH/DJjGMcWOMTk9y1cct20840WpZ3DIOk1Ri/HsvX0kjzSVkCrKz5fZqaMtuI+J2kVcAJwOnZhDeYGtV+v3nsWYC4cHU+NeKugX9KszF+PZavpZEm2T5G0jyr6hF2nAJnA0nFmc1Co9rv86+/FySmpiOXGnHXwD9lZHiIM489kL85ej+3s7WOpblFWU9yd1t1G3CEpNpmWS8BHswisEHVbIbbqYCpurLDbs1u22qW3UGeTbc6lq8TrXUiTbK9EnippOr0qytIRtr635LeJ+l7wCHAxRnHOFDSznDbjdltW8Xg2XTNOpPmK/pbJM28nkVyl/sV4M+B1wGvrGxzDUmrBetQ2hluu1Ej7hp4s+y1fWdbGZnrvRGxvvL7ZES8ATgYeAtwKPDSiHi4O6EOhma138OC4bq7zW7ViLsG3ix7s/5fExFr8AAwmWpU+/2mFyatEb63Jp8acdfAm2UrVXfdftEr3XUbjTKV98hTHunKrH1Zza5rOWs0k23es9t6Nl2zbAxW63Qzs4I42ZqZ5cDJ1swsB062ZmY5aJlsJf1C0t9J+i95BWRm1o9murOdA5wB3CzpDkn/IOmw7odlZtZfWibbiDgAeB7wt8AmkskYr5L0gKQvSzpKkpuPmZnNYMYy24j4TUR8NiIOIRn164Mks+GeDFwCbJT0z5LeIGlBmjeXdJykf5f0kKStkm6XdJqknWbYbxdJyyVtlvSIpO9K2iPNe5sNkkEfBL4MUt2VRsRvgS8BX5K0G8nsDa8H3gC8DXhc0k+BHwErI+L3MxxyD+BnJLPjPgy8iKTYYi/g/S32W0Fyx/0ukkHMPwv8GDg8zfmY9TsPAl8eHRcBRMRmkpHAviVpPsm0OK+vvB5LMvFjo9l3a4/x1bpFP5O0M/A+SR+IBn2JJR1KMrvvSyPi55Vl9wPXSzoyIi7v9JzM+o0HgS+PTL7aIuLxiPhBRJwIPAM4Gvh6h4d7iGQox2aOATZUE23l/W8A7q6sMzM8CHzZZP4cURl68acR8dft7iNpWNICSS8hKRP+cqO72or9gLUNlv+6ss7M8CDwZVOWlgTjPFXk8G3gYy223Y2kfLfeZmDfjOMy61keBL5cylJCfhhJ5dZHScp7vzjD9o0+QWqyPFkpnSJptaTVGzdu7DhQs17hQeDLpRRXOyJuqvzzakmbSCrd/jEi7myw+WZgcYPlu9L4jrf6HucB50Eynu0sQzbrCR4EvjxKkWzrVBPvPkCjZLuWxk289iNp/mVmFZ6GvTzKUoxQ68WV17ubrF8F7FWpTANA0hhJee2qLsdm1pM8DXvxCr3yki4BLifpkTZFkmg/CqyoFiFIugO4MiJOBoiIayVdCnxb0qk81anharexnZmnuTErRtH/224ETgJGgUngLuATJNOkV40Aw3X7nQB8HvgGyd35SpImY9aEexKZFSvVhI+SngmcRtKDa28adz6IiCg6ibfUKxM+ZulTF93Chau3b+A+f84Qbx5b4p5EZhlpNeFj27c0kvYGVgPv5ql2sfcCvyEpAhDwS+Cq2QZs2XJPIrPipXl+/DuSAWKOjog/qSxbHhH7kVROXQrMJxmUxkrEPYnMipcm2b4KuKRRJVRE3Ae8iSTZnplRbH2nqGHu3JPIrHhpylb3Ai6s+X2KJLkCEBFbJF1G0gPMlVU1iq6cqvYkalZm61YJZt2X5n/Zo2xfIbaZpJKs1iM07t010MowzJ17EpkVK02yvYdkpoaqXwJ/LmlBRDwmaQh4JXBflgH2umrl1MRk48qpvzl6v1zuLN2TyKxYaZ5h/w/wcklzKr9/C/gD4N8lnQ1cAxxAMouCVZStcso9icyKkeZ/3NdJig4WAQ9ExHckvRD4AFCd6vwC4KxsQ+xtrpwyM0hxZ1sz8eMDNcs+DDwTOBR4ZkS8NSLcjqiGh7kzM8igu25EbAQ8QGwLrpwys7a760qaAs6IiGUttvkkcKa76zbmQWDM+lur7rpp/ser8tPOdtZAtXLKzAZP1i3qdwNcZmtmVqflna2kI+oWjTZYBskQiEuBtwG3ZxRbafjx38xma6bMcQVPTaIYwNsrP42IZCDvj2YSWQkU3c3WzPrHTMn20yRJViSjfl0BXNlguyngIeBnEbE2ywCLVIZutmbWH1om24g4o/pvSW8HfhwR/6vbQZVBWbrZmll/aDtbRMQ+3QykbKrdbCcarKt2s3XLAjNrlwsem3A3WzPLUtM7W0n/1uExIyJe0c6Gkt4E/FfghcAuJC0Z/r+I+JcW+4zSeJrzFRFxQupom/AYsN3j1h02iFp90l/W4THbn0ESPkKSOD8MbAJeDZwvaVFEnDPDvqeSjDRWtSlVlG1wN9tsuXWHDbJUs+tm/uZJUt1Ut+x84NBmZcQ1d7avjYiVnbxv2u66vhPLhmf4tX6Xyey63VCfaCtuBp6RdyyteAzY2fMMvzboyvjsdhhwWxvbLZc0JekBSZ+TNH/mXawoZRtE3SxvqZKtpCFJH5B0naRHJE3WrHuBpHMl/VGnwUh6BcmEkV9qsdlEZf3JwCuArwLvJRm43ErKrTts0LX9XCxpJ2AVScXZ74H/C9Q2NL0beCfJ2LafShtIpSz2fOCiiPhms+0qg5e/v2bRFZI2AOdK+tOI+EWT458CnAKwdOnStOHZLLl1hw26NHe2HwNeDpwJ7Al8rXZlRDwM/Bx4VdogJO1OksjvBU5Muz/w/crrQc02iIjzImIsIsYWL/YEwEU4/TX78+axJcybM8TCnYaZN2eI179gb0485A9LV2Y7PjHJXRu3lC4u611pbifeBlwTEZ8GkNTomfBu4LVpApC0AFhJMk36X0TEeJr9K6Lu1Uqodobf3z78OMuvWccPbrqPi37x29I0A3PzNOuWNMl2H+B/z7DN74Hd2z2gpBHge8BzgRdHxO9SxFPruMrrmg73txwtnDvCd667hx/dfF/pBvnx4EPWLWm+qh8Hdp1hm6XAwymOeS5JR4ZlwO6SDqn5mQsg6Q5JX6/uIOkMSf8o6Q2SjpT0aeDzwA8j4lcp3tsKUtZmYGWNy/pDmjvbXwCvlLRTRDxRv1LSLiTltf+e4pivrLx+ocG6fYB1lRiHa5avJek99i5gPkk579l4CvWeUdZBfsoal/WHNMn2/we+C3xX0sm1KyTtCiwnmRbnK+0eMCJG024TERfgZl49razNwMoal/WHtosRKoPDLAfeSNK8670AklYDD5C0jz03Ii7uQpw9xTXZrVWbgc2fs/3Hb/6cIY5P0Qws6+ucVVxmjaT69ETEyZKuAj4E/BeSGRwOAm4FPhcRy7MPsXe4Jrt9sxnkp5vX2YMPWbd0PBBNpXvsbsAjHTbXKkzagWja5YFW0utkkJ88rrMHH7JOdGUgmoh4PCJ+22uJtltck92ZtIP85HWdPfiQZc3PthnxQCv58HW2XtVqpoa7OjxmRMSzO9y3Z7kmOx++ztarWt3ZDpFUgNX+zAVGKz/PImnn+qyaZXNnOGbf6pWa7DK1lOgkljJf5zJdWyufpp/M+vatknYGLgfuAT4BXBUR05KGgCOAvydJtEd2LdqSK2NNdrWiZ4+FO/G5y/6zFC0lZtuaoGzX2a1QrB1tt0aQdA5JD7EDm/Qgmwf8B7AqIj6YaZQZ61ZrhKoy1GTXJ4Ct26aBYKrmz11US4msWhOU4TqDW6HYU7JqjfB6krFmd0i0ABGxFbgIeEP6EPtLbU12Vo+WaY9TO6DK+BNTTMX2iRbybykxPjHJrfc/wr/ccG8mrQnK0GLArVCsXWk+pXsAc2bYZk5lu4GX1aNlJ8epJoCJyemG62vl0ee/9hyGJJ6oz/o5xpI1j6dg7UpzZ3sncFxlwJkdSNqNZKjDTlsx9JX6O8uJyWkuXL2eZSvbmV5tdsdp1TyqXh41+LXn8Pi2qUJjyZpbR1i70iTbrwB/ANwg6a8kjUqaX3l9O3A9sBet5w8bCFk9WnZ6nFYJoFYeNfjNzqGIWLqhzK0jrFzSDETzReAckoG+l5Pc6W6pvH4DeA7wxYg4twtx9pSsGt53epxmCWB4SAyLJ6ekyaMGf6a77PlzhnKLpVsaTffTy+dj3ZF2IJoPSbqAZGLHFwC7AI8ANwHfjIg0Y9n2raweLWdznEbNo44fW8JHjvojHhp/Irca/FbnMHdkiO+/5zBGFy3s6TvA2ul+ytA6wsop9SciIq4Fru1CLH0jq5lkZ3OcVglglwU7dXhm6c10Dgfs3bAKoCdVW0eYNeKv3y7JquH9bI9ThgRQtk4IZkXoeIjFXtbtTg21smp4X5YG/LPRD+dg1kqrTg3+xHdZVneWZbhDna1+OAezThXacVvSmyT9q6T7JW2RtEbSW9rYbxdJyyVtlvSIpO9KcmcKMyutokfJ+AhJ87EPA38J/Aw4X9IHZthvBfAykhl2TwIOBn7ctSjNzGap6GKE10bEpprf/03SH5Ak4XMa7SDpUJIBcV4aET+vLLsfuF7SkRFxebeDNjNLq9A727pEW3Uz8IwWux0DbKgm2spxbgDurqwzMyudoosRGjkMaDWAwH7A2gbLf11ZZ2ZWOqVKtpJeARxL6/EVdgMebrB8c2Vds2OfImm1pNUbN26cXaBmZimVJtlKGgXOJxkz95szbN6ocbCaLE92iDgvIsYiYmzx4sWdhmlm1pFSJFtJuwOrgHuBE2fYfDOwa4Plu9L4jtfMrHCFJ1tJC4CVwE7AX0TE+Ay7rKVx2Wyzslwzs8IV3alhBPgeybCNx0TE79rYbRWwl6SX1BxnDNi3ss7MrHSKbmd7LvBq4EPA7pIOqVl3c0RMSLoDuDIiToZk1DFJlwLflnQqMA18FrjabWzNrKyKTravrLx+ocG6fYB1JDEO1607Afg8yaDlQyTFEKWe0dfMBluhyTYiRjvZJiIeBt5R+bEGihxhy6N7me3I/xP6TFaz+mb93hOT06zblNR99vrMDK34i8aa8aehz9TOZFudXvvC1esBOPPYA3N/7xWr13PdXZv4ze/Gqc6OMyzx1j9byqde2/0vgLwU+SVnvcGfgj6S1ay+Wb731m3T3L7hqUQLyfxp599wb+pp3cssq6nrrX852faRrGb1zfq9G5maDi64sbtfAHkp8kvOeoeTbR/JalbfrN+7maEufwHkpcgvOesdTrY9bHxikrs2bnnyzqk6k+38Odv/WefPGeL4FLP61h+3Hc3eu5XpLn8BVHVyPmkU+SVnvcMVZD2oVWXMbGaynW0lT6P3Ht1jAb/ZsIWpulw0PCROOLj9L4BO5FVpldXU9dbfPLtuD/rURbc0/Y9dbXHQSROkdo7bjtr3njsyxJk/uZXzr7/3yYQ7PCTe+qLut0bI6nzaUU3stV80x4+5NcKgaTW7rpNtjxmfmOSgZZcxMTm9w7p5c4ZYc9pRHd1Jdeu4tcfPs51tt8+n1fu6ne3gapVs/ZXbY7pVGdPtSp6Fc0c4YO9dOGDvXXJJQkVVWlWna3eitXpOtj2mW5Ux/VbJ02/nY73PybbHZNXiIK/jFqXfzsd6nz9xPWg2LQ6KOG5R+u18rLe5gqyHdasypt8qefrtfKy8XEHWp7pVGVO2Sp7ZdkpodD6dHDOrzhHd7mRh5VSO/01mDXSjU0Inx8wqDo8MNticbK20ujFcZCfHzCqOIoe/tOL567QHdPuxs4yPta1G0vqXG+/ldx20k+1kdK7ZjOhVe13LODJYGf/u/azwO1tJzwE+BhwCHAhcFREvm2GfUeDuBqtWRMQJGYdYmG4/dpb1sXZ8YpLV637PcJMRG5+YDF782X/jrS9amirWakeHiQbrqh0d9l38tFnv0+i6HvX8ZzCk1p0s6o/TLWX9u/e7wpMtcADJDLvXATul3PdU4Jqa3zdlFVQZdPuxs2yPtbVJYHhIPLZtx662Vdumgu9cdw/TESx73R+3dfxOOjp0sk+j67ryPx5sGlfenSzK9ncfFGX4GvtJRCyJiDcBt6bc9/aIuK7m545uBFiEbj92lvGxtjYJPPbE1IzbTwWcf/29bcfaSUeHtPs0u67N5N3Joox/90FReLKNiPY+lQOm2337yzbgddokVTUVPDnATTtOf83+vHlsCfPmDLFwp2HmVUYBa9XRIc0+aWasmDsy83tnrWx/90FShmKE2VguaXfgd8C/AJ+MiMcLjikT3e7bX7axA1qVjWZpZHiIM489kL85er+2Ozqk2afdGSvmzxni++85jAP23qWj8+hU2f7ug6TwO9sOTQBfAk4GXgF8FXgvcEGRQWWp2337yzZ2QCfT6kAyNu7oooWp9+uk40Y7+7Q7Y0VAR3HPVtn+7oOkJ69sRDwAvL9m0RWSNgDnSvrTiPhF/T6STgFOAVi6dGk+gc5St/v2Z3382XSLbTXbwR82me1BwFtftLR0CaL2uk5OTVM/pG7RMzik/bu7u3M2SjU2gqTvA4tmavrVZN/FJMUJJ0fEN1pt22tjI3T7wz7b42fdw6p+toNPHLMfZ13860Jme5iN8YlJfvvw4yy/Zh0/vPm+0s3gMNPf3U3E0uuZmRpmmWwXARuBd0bE8lbb9lqyLbusp59plgTynu0hS714d5jntEL9YlAGojmu8rqm0CgGTDeaEjUrG817tocslW1wn5m4iVj2Cv/LS1pA0qkBYG9gZ0nVxHlxRDwm6Q7gyog4uSBcBoUAAA+FSURBVLLPGcDTSTo0PAocQdIL7YcR8as84y+7bt9RddLDKo+48ni/XrxbbVenf1drrgyfkGcA36tbVv19H2AdSZzDNevXkvQeexcwH7gXOBs4q5uBlkG7/8HzKm9r1Ypg2/Q0T6uLsdO4Ok1sM71fJ8cdhLJMNxHLXuHJNiLWkVQst9pmtO73C+ijZl7tSPsfPK8umc1aEQBMT8Ph//Cz7eJMG9dsE1uz95sOGBIdHXcQuru2ah1SZEuKXtYfX8MDoPY/+PgTU0xMTnPh6vUsW3nbDtvmXd5W28NqpOYTNTkd28XZSVxpzrteq/c7/4Z7WXHjvamPO0hlmZ30trPmnGx7QNr/4Bse3dp0xKxudMms9rD6+cdeztDQjh+papzrNo2n6io628TWqmvq1HSwdXL7x+R2jrtu0zjNeuP2YnfXVsMsVv+ua047ip984CWsOe0ozjz2wL4pKsmbnwV6QJrKismpab521d1NR8zqZnnblolJ5gyJJ5rECaQqB5xtJU0nvdLaGTZxor6XQsXWbVPssTDtwHXFSFM8U21JYbPjr6gekKayYtnK2/jRzfc13LbbXTJninN00cJUXUVnW0nTrGvqvDlDTe9O2xk2sSmJz132ny1jKovZFM9YZ5xse0C7/dlnGjnr9S/Yu6vlbe3EmaYcMIt+/I3e7/ixJbztz5ZmPmzi1HT0RLntIJU7l4mLEXpEO/3ZWz12L9hpmHcdvm/Xy9tmijPtqFuzHb+h2ftNTk0zJLV13DQjkvVCG1S3oS1Gqbrr5qWXu+u2ahc6PjHJQcsua/ioO2/OEGtOOyrTIoSZYsmywX+7x0v7vo22r1/W6rrW68Z1zlren5NB0qq7rq9oj2lVWbFw7ghvHlvC+Tfcy1RNWefwkHjTC7Mrq22nciXrSpWZjtdpe9za47Y6RrO2xLV6pQ2q29AWw1e17wTUP61EJMtTaHWHWMZG/VnE1OoYOxRnTAejixZw96ZxRoaGchmmMMunhW4P3zmTfu7q3IyLEfpIFo+H7XRvLdsjaBYxtXuMRkUM3R6msJvdg/NOev3e1XlQRv0aeFnMLzVTk6AyzmGVRUztHqN+9K6ZRvPKoolVN5tp5T0a2SA3OXOyLbFWvXsabfv4E1Nsm+q8M0M7TYI6afua5jzSbF/d7mlzR1INhtNINwZeaed6znSu/dRMq5/OpRODUVjSY9I8atVvOzUNw2K7KWTarfhot0lQu5UraR8Z292+0Xb7LFrAuk3jO3TBbTQYTiPVSqMVq9eztS4ZjO6xgLkj6e9LWl3PIeCTP/oPVt3yYMtz7admWv10Lp3wnW0JpXnUqt92KgIkhkXqwUPavbtrt2NC2kfGdrdvtN26hx5jdNHCGQfDaeX01+zP6B4Ldli+btN4R4+5ra7nxOQ0l9z64Izn2k9DHfbTuXTCybZk0jxqNdt2ajoYGR7iwncfmmrwkHZ7bLUzQEnaR8Z2t2+23dZtScK95EOHtxwMp9Wj6sTkNHdvemyH5VsnO+sZ1rS78IgA7XAH3SjGfpoNt5/OpRNOtiWTprKn1bYjQ2L+TsOpP8Bpu9M2q1xJW2nV7vYzbXfPQ48xp8PKsm5U/jW6nkcf+EzmNZnqvNH79NNQh/10Lmn191dJD0rzqNWNx7K03WmbSRtbu9vPtN3zn7lzx9ckr+sJsOqWB9t+n6z+JmXQT+eSlu9sS6bVo+fRB+zV1rZZPJbNtknQwrkjvPGgvZk7sv2dYrPYWp3LG16wNxse3cr4xOSM5/yMned1fE1aXfssr2enf7demzSylX46l3YNzpn2kNrePUMkZYnbpuCy2zaw6pYHt6u1LronUCPV1gI/uOl+qi3RRoaSbsOtYqs/l8npaf5wjwX84Kb7uegXv32yxv4Tx+y33Xb15zyba3L6a/ZnOoLzr7/3yRYd26ZgOpLzyqrhfRn/btZdhfcgk/QckplxDwEOBK6KiJe1sd8uwD8BryO5Q18JfDAiHppp317pQTY+Mcknf/QfXHLrg9tVplSbWdV2Qy1T98dPXXTLDk3D5o4M8caDnsVn3vDHM+5fPZevXXU3P7r5voZNzM489sAZz7nTa/Kpi27ZoQlYo2uehTL93Wz2yt6D7ACSqcz/s/LTrhXAy0hm2D0JOBj4ccaxFW7VLQ+2XWtdhseyZq0FJian+eHN97VVo79w7gh77jyPH9x0X8vWCTOdcyfXpBp/O9c8C2X5u1n3lSHZ/iQilkTEm4Bb29lB0qHAq4C3R8QPIuJHwInASyQd2cVYc1XGrrEzySrmos69F6+59YbCk21EzDxI6I6OATZExM9rjnMDcHdlXV/oxUbgWcVc1Ln34jW33lB4su3QfsDaBst/XVnXF3qxEXhWMRd17r14za039OonZzfg4QbLNwP75hxLV/VirXVWMRd17r14za38Cm+NUEvS94FFM7VGkHQZsCUiXl+3/LvAaES8uME+pwCnACxduvSF99xzT2Zx56EXa62zirmoc+/Fa27FKntrhE5sBnZtsHxXGt/xEhHnRcRYRIwtXry4q8F1Qy/WWmcVc1Hn3ovX3MqrV5PtWhqXzTYryzUzK1SvJttVwF6SXlJdIGmMpLx2VWFRmZk1UfjzkaQFJJ0aAPYGdpZ0XOX3iyPiMUl3AFdGxMkAEXGtpEuBb0s6FZgGPgtcHRGX53wKZmYzKjzZAs8Avle3rPr7PsA6kjiH67Y5Afg88A1quut2LUozs1koPNlGxDqgcZedp7YZbbDsYeAdlR8zs1Lr1TJbM7Oe4mRrZpaDUnVqyIukjUBv9Wro3CJgU9FBFMzXwNcA8rkGfxgRDRvyD2SyHSSSVjfr0TIofA18DaD4a+BiBDOzHDjZmpnlwMm2/51XdAAl4GvgawAFXwOX2ZqZ5cB3tmZmOXCy7UOSniPpq5J+KWlK0hVFx5QnSW+S9K+S7pe0RdIaSW8pOq48STpO0r9LekjSVkm3SzpN0k5Fx1YUSXtXPg8h6Wl5v3/h3XWtK6ozFl8HDOJ/ro+QzEf3YZJ2la8Gzpe0KCLOKTSy/OwB/Aw4m2SM5xcBZwB7Ae8vLqxCnQ1sARYW8eYus+1DkoaqE2m2O/tFP6kk1U11y84HDo2IfQoKq3CSzgLeB+wWA/YfX9LhwEXAZ0iS7tMjYkueMbgYoQ91OGNx36hPtBU3k4wwN8geYgCfdCQNA+cAn6bAXnROtjYoDgNuKzqIvEkalrSgMtD+B4EvD9pdLfAeYB7wpSKDcJmt9T1JrwCOBd5ZdCwFGAfmVv79beBjBcaSO0l7AMuAEyNim9RyNNeu8p2t9TVJo8D5wEUR8c1CgynGYcDhwEdJvnC+WGw4uTsLuD4iLi46EN/ZWt+StDvJnHT3AicWHE4hIuKmyj+vlrQJ+Jakf4yIO4uMKw+SDiB5mjlCUnU27gWV110kTUXE43nF42Rrfakyt91Kkgqhv4iI8YJDKoNq4t0H6PtkCzwXmANc22DdfcDXgXflFYyTrfUdSSMk89g9F3hxRPyu4JDK4sWV17sLjSI/VwMvr1t2NPBxkrbXd+UZjJNtH2pnxuJiIsvNuSTn/yFgd0mH1Ky7OSImigkrP5IuAS4HbgWmSBLtR4EVg1CEAE82AbyidlmlDB/gqrzb2TrZ9qd2ZizuZ6+svH6hwbpBOH+AG4GTgFFgkuQu7hPAV4oLabC5B5mZWQ7c9MvMLAdOtmZmOXCyNTPLgZOtmVkOnGzNzHLgZGtmlgMnW+s5ktZJWld0HM2UPT4rhpOtWQ4knVSZ++qkomOxYrgHmVn2XlF0AFY+TrZmGRuUsQcsHRcjWCkp8X5Jt1am4r5f0hcl7dJin7dI+pmkzZV9fl2Zvntug21D0hWSFkk6T9IDkiYq7/eOJvG8vTI9+MbK8ddLulTS8XXbbldmW5lKfnnl1+WV967+jEr6n5V//1WT83phZf1P2rt6VkYeG8FKSdIXSObMegD4PrCNZKaBzSQjmT0REaM123+dZKDo+4CfkkzffQjJTAVXAEdFxGTN9gH8EpgPPFHZZh5wHLArcFJEfKtm+8+QDORyN8mA5I8AzwQOBtZGxHE1264DqMZXKad9XSX+i4Bf1JzqP1Xe707guoh4MXUknQf8N+C1EbFyxotn5RQR/vFPqX5IEmQAdwC71yyfRzIQdADrapafVFn2Q2B+3bHOqKz7UN3yqPx8DRiuWb4/yShZt9Vt/xBJIl/QIN5Fdb+vq42vLsaTmpzzysr6P65b/jTg/5LMNjHcaF//9MaPixGsjKqP8WdFxO+rCyNiK8ndZb0PkSTId8aO05wsI0mUb2uw32PARyJiquY9bgOuAZ4v6el1228jGRt2O9F46vS0vlx5PaVu+dtIEu7XauO03uMKMiujgyqvVzZYdxVJYgWeHCj9T4BNwH9vMnvqBPD8Bst/ExGPNli+vvK6K8ldJcB3gQ8At0r6XiW2ayPikdan0rZVJEUU/1XSx+OpAd5PIUnwX8vofawgTrZWRtVKsA31KyJiStJDNYt2AwQsBj6V8n0ebrK8msyHa5Z9mKRc9Z3A/6j8TEq6GPhoRNyR8r23ExHTkr4K/E/geJKKtBeSfPH8OCJ+O5vjW/FcjGBlVL1b3LN+haRhYI8G294cEWr1M5uAImIqIr4QEX9SieuNwI+AvwQuadTioQPfILkLf3fl9+rrVzM4thXMydbKqDoL7EsbrDucmieySOaRuhU4oDJ1eddFxO8i4ocR8Wbg34BnAwfOsFu1vHW42QYRsZGk5cWfSXox8BaSyrafzjpoK5yTrZXRNyuvn6xNoJLmAX/fYPvPkUxZ/g1Ju9avlLSbpIN23K09kuZKeoXqCoQlzQGq8c00iWa16GPpDNtVK8pWkFSMnRcR02nitXJyma2VTkRcI+kckgqpWyTVt7N9oG77b1TKN/8auFPSpSRNpXYnmeDxCJJOBe/pMKT5JDPVrpN0PXAPSTO0o0gq3v41In49wzGuJUnI/73yBVItjz6ntpKtcu6/JKn020ZStGB9wMnWyupDwH8C7yMpu3yIpIz0b0k6I2wnIt4naRVJQj2SpCXB70mS7tnAd2YRyzjwceDlJG2AX0fSSuFO4L20kRAjYrOkN5JU4r0DWFhZ9R2eKneuWk7S2eGiiNihktB6k3uQmZWMpG8CbweOjIj/U3A4lhEnW7MSkbQE+A1wF3BA+D9o33AxglkJSHor8EfACcBc4HQn2v7iO1uzEqiMDHYESe+1z0fEPxUbkWXNydbMLAduZ2tmlgMnWzOzHDjZmpnlwMnWzCwHTrZmZjlwsjUzy8H/A8XdpFq0Q/++AAAAAElFTkSuQmCC\n", "text/plain": [ - "HDBSCAN(algorithm='best', allow_single_cluster=False, alpha=1.0,\n", - " approx_min_span_tree=True, cluster_selection_epsilon=0.0,\n", - " cluster_selection_method='eom', core_dist_n_jobs=4,\n", - " gen_min_span_tree=False, leaf_size=40,\n", - " match_reference_implementation=False, memory=Memory(location=None),\n", - " metric='euclidean', min_cluster_size=9, min_samples=1, p=None,\n", - " prediction_data=False)" + "<Figure size 360x360 with 1 Axes>" ] }, - "execution_count": 67, - "metadata": {}, - "output_type": "execute_result" + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" } ], "source": [ - "min_cluster_size = 9\n", - "clusterer = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size, min_samples=1)\n", - "clusterer.fit(df[features])" + "Clustering().dpc()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the plot above, each point represents a different peak that could be the core of a specific cluster if selected.\n", + "All points of the dataset are placed in the plot, and in the top right positon of the plot we always have one point that represents the highest density point. \n", + "The other peaks are then placed in the plot according to their local density and distance ('delta' in the graph) from the first peak. \n", + "We choose the values on the x,y-axis, and the algorithm will return the clusters that are given by the peaks that are selected. \n", + "Here, we select the 3 peaks closest to the vertex." ] }, { "cell_type": "code", - "execution_count": 68, + "execution_count": 430, "metadata": { "ExecuteTime": { - "end_time": "2020-12-18T19:02:34.026348Z", - "start_time": "2020-12-18T19:02:34.022206Z" + "end_time": "2021-01-03T20:40:20.761084Z", + "start_time": "2021-01-03T20:40:20.634588Z" } }, - "outputs": [], + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAVsAAAFLCAYAAAB4GS92AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nO3de5xcdX3/8ddnL+SGSSCJYGPigloj0FZhsdyVAgoWiyKXqLQi4Re1XqiK+rOCgHlga/kVahHU/IQglUgQQWpKQKjc5bYBVC5BgYSEW0hiAs2SLNndT/84Z8hkcuZyZs6cc2bm/Xw85jG75zafc3b2M2e+V3N3RESkubqyDkBEpBMo2YqIpEDJVkQkBUq2IiIpULIVEUmBkq2ISAp6sg4gC1OnTvW+vr6swxCRNrN06dK17j4tal1HJtu+vj4GBgayDkNE2oyZPV1unYoRRERSkGmyNbOTzcwjHp+qst8kM1tgZuvN7CUzu8LMpqQVt4hIXHkpRvgrYFPR709V2X4R8DbgVGAU+Dbwc+DgpkQnItKgvCTb+919Yy0bmtn+wPuAd7v77eGyZ4F7zexwd7+5iXGKiNSlFctsjwJWFxItgLvfBywP14mI5E5eku2TZjZsZo+b2SerbDsLWBax/LFwnYhI7mRdjPA8cCZwH9ANfAT4vpmNd/cLyuyzE7AhYvl6YPemRCki0qBMk6273wjcWLRoiZmNAc4ws++4+2i5XSOWWZnlwUqzucBcgJkzZ9YZsYgADA4Ns/rlzewycSwTxmR9z5a8ZpxfHq/S1cAJQB/RrRLWA1E9NCYTfccLgLvPB+YD9Pf3a8R0kToMj4wyb/GjXHn/Knq6jOFRZ/a+Mzjz6D3o6c5LqWT9mnl+eb465RLiMqLLZsuV5YpIQuYtfpSrBlYxNDzK4KsjDA2PctXAKuYtfjTr0BLRzPPLY7L9MLAWKNftbQmwq5kdVFhgZv0E5bVLmh+eSASz4NHGBoeGufL+VWzasm3p3qYtoywaWMXg0HBGkSWj2eeXaTGCmf2MoHLstwQVZCeGj88XymvN7AngNnefA+Dud5vZjcDlZnY6Wzs13Kk2tiLNs/rlzfR0GUMR67rNWP3yZnaftmPqcSWl2eeXdZnt48ApwAyCCq5Hgb9z9/8o2qaHIBEXmw1cAFxKcHe+GPh806MV6WC7TBzL8Gh06d6IO7tMHJtyRMlq9vllWozg7v/o7m9z9/HuPs7d9ylJtLh7n7ufXLJsg7t/wt0nu/tEd/+ou69NNXiRDjNhTA+z953BuN5t08a43i5O7J/R8q0Smn1+rX11RCRVZx69BwCLBlbRbcaIOyf0z3hteatr5vmZe+e1gurv73eNZyuJKlSOdcj/k9rZRjOzpe7eH7Wu/a6SiDTdhDE9LV0ZVk0zzi+PTb9ERNqOkq2ISAqUbEVEUqBkKyKSAiVbEZEUKNmKiKRAyVZEJAVKtiIiKVCyFRFJgZKtiEgKlGxFRFKgZCsikgIlWxGRFCjZioikQMlWRCQFSrYiIinIVbI1s+lmttHM3MzKjtxrZn3hNqWPK9OMV0SkVnmbqeE8YCMwocbtTwfuKvpdkz6KSC7lJtma2cHAkcC3CJJuLR5393uaF5WISDJykWzNrBu4EPgmsCHjcEREEpeXMttPAWOBi2Lut8DMRszseTM738zGNSE2EZGGZX5na2ZTgHnASe6+xQpTQlc2RJCYfwm8DLwH+CrwZuCYMq8zF5gLMHPmzIbjFhGJI/NkC5wL3Ovu19e6g7s/D3y2aNGtZrYauNjM3uHuD0XsMx+YD9Df3+8NxiwiEkumxQhmtidwCnCOmU02s8nA+HD1pJjFAleHz3snGaOISBKyvrN9K9AL3B2x7hngEuDUGo/lJc8iIrmRdbK9Ezi0ZNmRBOWv7weeinGs48LnpQnEJSKSqEyTrbuvBW4tXmZmfeGPd7j7xnDZE8Bt7j4n/P1s4HUEHRpeBg4Bvgxc4+6/TSF0EZFYsr6zrVUP0F30+zKC3mOnAuOAlQQdIc5NPzQpZ3BomNUvb2aXiWOZMKZV3moizZG7/wB3vwy4rGRZX8nvVwIaByGnhkdGmbf4Ua68fxU9XcbwqDN73xmcefQe9HTnpWm3SLpyl2yl9c1b/ChXDaxiaHiUoXDZVQOrADjnmL2yC0wkQ7rNkEQNDg1z5f2r2LRldJvlm7aMsmhgFYNDwxlFJpItJVtJ1OqXN9PTFd0LsNuM1S9vTjkikXxQspVE7TJxLMOj0U2dR9zZZeLYlCMSyQclW0nUhDE9zN53BuN6t31rjevt4sT+GWqVIB1L73xJ3JlH7wHAooFVdJsx4s4J/TNeWy7Sicy983q39vf3+8DAQNZhtL2OamdbGK2uA/+fZCszW+ru/VHr2vw/QLI0YUwPu08rO5WcSEdRma2ISAqUbEVEUqBkKyKSAiVbEZEUKNmKiKRAyVZEJAVKtiIiKVCyFRFJgZKtiEgKlGxFRFKQq2RrZtPNbKOZuZlV7OdpZpPMbIGZrTezl8zsCjObklasIo0YHBrmqTUbNZh6B8nb2AjnARuBCTVsuwh4G8Gkj6PAt4GfAwc3LTqRBml+ts6Vm2RrZgcDRwLfIki6lbbdH3gf8G53vz1c9ixwr5kd7u43NztekXpofrbOlYuPUjPrBi4EvgmsrWGXo4DVhUQL4O73AcvDdSK5o/nZOlsuki3wKWAscFGN288ClkUsfyxcJ5I7mp+ts2VejBBWas0DTnL3LWbRb8YSOwEbIpavB3ZPMDyRxGh+ts6Whzvbc4F73f36mPtFvWutzHLMbK6ZDZjZwJo1a+LGKNIwzc/W2TL965rZnsApwCFmNjlcPD58nmRmI+6+KWLX9cC0iOWTib7jxd3nA/MhmBanocBF6qT52TpX1h+lbwV6gbsj1j0DXELQtKvUMqKbeM0iaP4lkks93V2cc8xefOXIWZ0zP5sA2SfbO4FDS5YdCXwVeD/wVJn9lgBnmtlB7n4ngJn1E5TXLmlSrCKJ0fxsnSfTZOvua4Fbi5eZWV/44x3uvjFc9gRwm7vPCfe728xuBC43s9PZ2qnhTrWxFZE8ykMFWS16gO6SZbOB24BLgcuBpcCHUo5LRKQm5h04z31/f78PDAxkHYbk3ODQcO3lqoUmix34/yRbmdlSd++PWpd1ma1I7mj8AmkGJVuREhq/QJpBH9MiRTR+gTSLkq1IEY1fIM2iZCtSROMXSLOozFakSGH8gqsGti1KGNfbxQm1jF9Q20BK0ioSbF2SWLI1s92BmwF39zcndVyRtGn8AmmGJO9se4E+yoy6JdIq6hq/QO1rpYokk+2TwG4JHk8kUxq/QJKUWLJ192Hg6aSOJyLSTtQaQUQkBUq2IiIpqLkYwczKjS1bSq0R2kC5QVhiDc4iIq+J89/SRXRLg0kE09EAPAdsaTQoyU65QVi+dtQs/mnJMg3OIlKnmpOtu/eVW2dmbwH+HZgAvK/xsCQr5QZhuXf5Op5e94oGZxGpUyK3JO7+BHAsMB04K4ljSvoqDcKy7IWNGpxFpAGJff9z983ATcBHkjqmpKvSICzlaHAWkdokXdg2DOya8DElJZUGYSlHg7OI1CaxZGtmUwnmAFuV1DElXYVBWMb1bvu2GNfbxaxdd4xcfmItg7OISKymX9+ocIwZwDEELRO+FuOYxwFfBN5GULn2NPAfwL+4+6tl9ukDlkesWuTus2t9bYlWbhCWQmsEDc4iUp+aJ3w0s9Eqm7wMfMfda64gM7NPAjOBAWAD8C7gbOASd/9smX36CJLt6cBdRavWhhV1VWnCx+rUzlYkvqQmfDy0zPJRYD2wLBwfoWbu/oOSRbeY2UTgM2b2Oa/8SfC4u98T5/WkduUGYdHgLCL1idPO9rZmBlJkHbBDSq8lIpKKXHT9MbNuMxtvZgcBnwe+V+WuFmCBmY2Y2fNmdr6ZjUshVBGRuuSl0G0QGBP+fDnw5QrbDgEXAb8kKCd+D/BV4M0ElXQiIrmTl2lxDgDGE1SQfQP4LvD3URu6+/NAceXZrWa2GrjYzN7h7g+ViW8uMBdg5syZMcMTEWlMksUIhWlx+uLu6O4PuPud7n4+QTHCp80sTsK+Onzeu8JrzHf3fnfvnzZtWtwQRUQakmSyLUyLs3uDx3kgfI4zxY6XPIuI5Eoep8U5MHyO6rhQznHh89IEXl8kktoYSyMyfceY2Q0E5byPACMEifZLBL3Bngy3eQK4zd3nhL+fDbyOoEPDy8AhBBVq17j7b9M+B2l/5cb41Vi+EkfWH8/3AycTlPMOA08RdPf9ftE2PUB30e/LCHqPnQqMA1YC5wHnNj1a6UjlxvgFjeUrtau5uy6AmU0gaCXwPoKxa8dEbJb7aXHUXVdqNTg0zN7zbmJoePve6mN7u1h6xhEqUpDXJNJd18wmA3cCexB8fZ8IvETQ26vQoUDT4khbKYzxOxSxrjCWr7ovSy3iFDidQZBo5wA7hcsuAHYkaCf7AEGLhLcnGaBIliqN8auxfCWOOMn2b4Db3X1BcVdaD9wDvB+YBXw94Rg71uDQME+t2bjNtDNRy9KOoZNUGuNXY/lKHHHeKTOAxUW/j1JUZuvuL5rZEmA2cGYy4XWmqNrvE/rfCBhXDaRTI64a+K3KjfGrsXwljjjJ9hWC5lkFL7H9FDirCSrOpAFRtd8L710JZoyMeio14qqB36qnu4tzjtmLrxw5S+1spW5xblFWEdzdFjwKHGJmxc2yDgJeSCKwTlVuhtsRh5GSssNmzW5baZbdTp5NtzCWrxKt1CNOsr0NeLeZFaZfXUQw0tZ/mdlnzOynwH7A9QnH2FHiznDbjNltK8Wg2XRF6hPnI/pHBM283khwl/t94K+ADwLvDbe5i6DVgtQp7gy3zagRVw28SPJqvrMNR+b6tLuvCn8fdvdjgX2BjwD7A+929w3NCbUzlKv97jboLrnbbFaNuGrgRZLX8H+Nuy9FA8AkKqr2+/h9gtYIP12aTo24auBFkhWru267aJXuulGjTKU98pRGuhKpXVKz60rKomayTXt2W82mK5KMzmqdLiKSESVbEZEUKNmKiKRAyVZEJAUVk62ZPWRm3zCzP08rIBGRdlTtzrYXOBt40MyeMLN/MbMDmh+WiEh7qZhs3X1P4G3APwJrCSZjvMPMnjez75nZEWam5mMiIlVULbN19z+4+7fdfT+CUb8+TzAb7hzgBmCNmf2HmR1rZuPjvLiZHWdmvzazdWa22cweN7MzzGyHKvtNMrMFZrbezF4ysyvMbEqc1xbpJJ0+CHwexLordffngIuAi8xsJ4LZGz4EHAt8DNhkZr8ErgUWu/sfqxxyCnALwey4G4B3ERRb7Ap8tsJ+iwjuuE8lGMT828DPgYPjnI9Iu9Mg8PlRdxGAu68nGAnsR2Y2jmBanA+Fz8cQTPwYNftu8TF+ULLoFjObCHzGzD7nEX2JzWx/gtl93+3ut4fLngXuNbPD3f3mes9JpN1oEPj8SOSjzd03ufvP3P0k4PXAkcAldR5uHcFQjuUcBawuJNrw9e8DlofrRAQNAp83iX+PCIde/KW7/32t+5hZt5mNN7ODCMqEvxd1VxuaBSyLWP5YuE5E0CDweZOXlgSDbC1yuBz4coVtdyIo3y21Htg94bhEWpYGgc+XvJSQH0BQufUlgvLe71bZPuodZGWWByvN5prZgJkNrFmzpu5ARVqFBoHPl1xcbXd/IPzxTjNbS1Dp9q/u/mTE5uuBaRHLJxN9x1t4jfnAfAjGs20wZJGWoEHg8yMXybZEIfHuBkQl22VEN/GaRdD8S0RCmoY9P/JSjFDswPB5eZn1S4Bdw8o0AMysn6C8dkmTYxNpSZqGPXuZXnkzuwG4maBH2ghBov0SsKhQhGBmTwC3ufscAHe/28xuBC43s9PZ2qnhTrWxrU7T3IhkI+v/tvuBk4E+YBh4CvgawTTpBT1Ad8l+s4ELgEsJ7s4XEzQZkzLUk0gkW7EmfDSzNwBnEPTgmk505wN396yTeEWtMuFjks667mGuGti2gfu43i5O6J+hnkQiCak04WPNtzRmNh0YAD7J1naxK4E/EBQBGPAb4I5GA5ZkqSeRSPbifH/8BsEAMUe6+1+Eyxa4+yyCyqkbgXEEg9JIjqgnkUj24iTb9wE3RFVCufszwPEEyfachGJrO1kNc6eeRCLZi1O2uitwVdHvIwTJFQB332hmNxH0AFNlVZGsK6cKPYnKldmqVYJI88X5L3uZbSvE1hNUkhV7iejeXR0tD8PcqSeRSLbiJNunCWZqKPgN8FdmNt7dXzGzLuC9wDNJBtjqCpVTQ8PRlVNfOXJWKneW6kkkkq0432H/GzjUzHrD338E/AnwazM7D7gL2JNgFgUJ5a1ySj2JRLIR5z/uEoKig6nA8+7+YzPbB/gcUJjq/Erg3GRDbG2qnBIRiHFnWzTx4/NFy74AvAHYH3iDu3/U3dWOqIiGuRMRSKC7rruvATRAbAWqnBKRmrvrmtkIcLa7z6uwzdeBc9RdN5oGgRFpb5W668b5j7fwUct2EqFQOSUinSfpFvU7ASqzFREpUfHO1swOKVnUF7EMgiEQZwIfAx5PKLbc0Nd/EWlUtcxxK1snUXTg4+EjihEM5P2lRCLLgay72YpI+6iWbL9JkGSNYNSvW4HbIrYbAdYBt7j7siQDzFIeutmKSHuomGzd/ezCz2b2ceDn7v7vzQ4qD/LSzVZE2kPN2cLdd2tmIHlT6GY7FLGu0M1WLQtEpFYqeCxD3WxFJEll72zN7Fd1HtPd/bBaNjSz44G/BfYBJhG0ZPh/7v6TCvv0ET3N+SJ3nx072jI0BmzzqHWHdKJK7/T31HnM2meQhC8SJM4vAGuB9wMLzWyqu19YZd/TCUYaK1gbK8oaqJttstS6QzpZrNl1E3/xIKmuLVm2ENi/XBlx0Z3tB9x9cT2vG7e7ru7EkqEZfqXdJTK7bjOUJtrQg8Dr046lEo0B2zjN8CudLo/f3Q4AHq1huwVmNmJmz5vZ+WY2rvoukpW8DaIukrZYydbMuszsc2Z2j5m9ZGbDReveaWYXm9mf1huMmR1GMGHkRRU2GwrXzwEOA34AfJpg4HLJKbXukE5X8/diM9sBWEJQcfZH4H+A4oamy4FTCMa2PStuIGFZ7ELgOne/rNx24eDlny1adKuZrQYuNrN3uPtDZY4/F5gLMHPmzLjhSYPUukM6XZw72y8DhwLnALsAPyxe6e4bgNuB98UNwsx2JkjkK4GT4u4PXB0+711uA3ef7+797t4/bZomAM7CmUfvwQn9Mxjb28WEHboZ29vFh945nZP2e1PuymwHh4Z5as3G3MUlrSvO7cTHgLvc/ZsAZhb1nXA58IE4AZjZeGAxwTTpf+3ug3H2D3nJs+RQ8Qy/z23YxIK7VvCzB57huoeey00zMDVPk2aJk2x3A/6ryjZ/BHau9YBm1gP8FHgrcKC7vxgjnmLHhc9L69xfUjRhTA8/vudprn3wmdwN8qPBh6RZ4nxUbwImV9lmJrAhxjEvJujIMA/Y2cz2K3qMATCzJ8zsksIOZna2mf2rmR1rZoeb2TeBC4Br3P23MV5bMpLXZmB5jUvaQ5w724eA95rZDu7+aulKM5tEUF776xjHfG/4/J2IdbsBK8IYu4uWLyPoPXYqMI6gnPc8NIV6y8jrID95jUvaQ5xk+/+BK4ArzGxO8QozmwwsIJgW5/u1HtDd++Ju4+5XomZeLS2vzcDyGpe0h5qLEcLBYRYAHyZo3vVpADMbAJ4naB97sbtf34Q4W4pqsisrNAMb17vt229cbxcnxmgGlvR1TioukSix3j3uPsfM7gBOA/6cYAaHvYFHgPPdfUHyIbYO1WTXrpFBfpp5nTX4kDRL3QPRhN1jdwJeqrO5VmbiDkRTKw20El89g/ykcZ01+JDUoykD0bj7Jnd/rtUSbbOoJrs+cQf5Ses6a/AhSZq+2yZEA62kQ9dZWlWlmRqeqvOY7u5vrnPflqWa7HToOkurqnRn20VQAVb8GAP0hY83ErRzfWPRsjFVjtm2WqUmO08tJeqJJc/XOU/XVvKn7DuztH2rmU0EbgaeBr4G3OHuo2bWBRwC/BNBoj28adHmXB5rsgsVPVMm7MD5N/0+Fy0lGm1NkLfrrFYoUouaWyOY2YUEPcT2KtODbCzwO2CJu38+0SgT1qzWCAV5qMkuTQCbt4wCzkjRnzurlhJJtSbIw3UGtUKRrZJqjfAhgrFmt0u0AO6+GbgOODZ+iO2luCY7qa+WcY9TPKDK4KsjjPi2iRbSbykxODTMI8++xE/uW5lIa4I8tBhQKxSpVZx36RSgt8o2veF2HS+pr5b1HKeQAIaGRyPXF0ujz3/xOXSZ8Wpp1k8xlqRpPAWpVZw72yeB48IBZ7ZjZjsRDHVYbyuGtlJ6Zzk0PMpVA6uYt7iW6dUaO06l5lGl0qjBLz6HTVtGMo0laWodIbWKk2y/D/wJcJ+Z/Z2Z9ZnZuPD548C9wK5Unj+sIyT11bLe41RKAMXSqMEvdw5ZxNIMeW4dIfkSZyCa7wIXEgz0vYDgTndj+Hwp8Bbgu+5+cRPibClJNbyv9zjlEkB3l9FtvDYlTRo1+NXussf1dqUWS7NETffTyucjzRF3IJrTzOxKgokd3wlMAl4CHgAuc/c4Y9m2raS+WjZynKjmUSf2z+CLR/wp6wZfTa0Gv9I5jOnp4upPHUDf1AktfQdYPN1PHlpHSD7Ffke4+93A3U2IpW0kNZNsI8eplAAmjd+hzjOLr9o57Dk9sgqgJRVaR4hE0cdvkyTV8L7R4+QhAeStE4JIFuoeYrGVNbtTQ7GkGt7npQF/I9rhHEQqqdSpQe/4JkvqzjIPd6iNaodzEKlXph23zex4M/tPM3vWzDaa2VIz+0gN+00yswVmtt7MXjKzK8xMnSlEJLeyHiXjiwTNx74A/A1wC7DQzD5XZb9FwHsIZtg9GdgX+HnTohQRaVDWxQgfcPe1Rb//ysz+hCAJXxi1g5ntTzAgzrvd/fZw2bPAvWZ2uLvf3OygRUTiyvTOtiTRFjwIvL7CbkcBqwuJNjzOfcDycJ2ISO5kXYwQ5QCg0gACs4BlEcsfC9eJiOROrpKtmR0GHEPl8RV2AjZELF8frit37LlmNmBmA2vWrGksUBGRmHKTbM2sD1hIMGbuZVU2j2ocbGWWBzu4z3f3fnfvnzZtWr1hiojUJRfJ1sx2BpYAK4GTqmy+HpgcsXwy0Xe8IiKZyzzZmtl4YDGwA/DX7j5YZZdlRJfNlivLFRHJXNadGnqAnxIM23iUu79Yw25LgF3N7KCi4/QDu4frRERyJ+t2thcD7wdOA3Y2s/2K1j3o7kNm9gRwm7vPgWDUMTO7EbjczE4HRoFvA3eqja2I5FXWyfa94fN3ItbtBqwgiLG7ZN1s4AKCQcu7CIohcj2jr4h0tkyTrbv31bONu28APhE+JEKWI2xpdC+R7ek/oc0kNatv0q89NDzKirVB3Werz8xQiT5opBy9G9pM8Uy2hem1rxpYBcA5x+yV+msvGljFPU+t5Q8vDlKYHafbjI/+5UzO+kDzPwDSkuWHnLQGvQvaSFKz+ib52pu3jPL46q2JFoL50xbetzL2tO55ltTU9dK+lGzbSFKz+ib92lFGRp0r72/uB0BasvyQk9ahZNtGkprVN+nXLqeryR8AacnyQ05ah5JtCxscGuapNRtfu3MqzGQ7rnfbP+u43i5OjDGrb+lxa1HutSsZbfIHQEE95xNHlh9y0jpUQdaCKlXGNDKTbaOVPFGv3TdlPH9YvZGRklzU3WXM3rf2D4B6pFVpldTU9dLeNLtuCzrruofL/mMXWhzU0wSpluPWovi1x/R0cc4vHmHhvStfS7jdXcZH39X81ghJnU8tCom9+IPmxH61Rug0lWbXVbJtMYNDw+w97yaGhke3Wze2t4ulZxxR151Us45bfPw029k2+3wqva7a2XauSslWH7ktplmVMc2u5Jkwpoc9p09iz+mTUklCWVVaFaZrV6KVUkq2LaZZlTHtVsnTbucjrU/JtsUk1eIgreNmpd3OR1qf3nEtqJEWB1kcNyvtdj7S2lRB1sKaVRnTbpU87XY+kl+qIGtTzaqMyVslT6OdEqLOp55jJtU5otmdLCSf8vHfJBKhGZ0S6jlmUnFoZLDOpmQrudWM4SLrOWZScWQ5/KVkTx+nLaDZXzvz+LW20khaP7l/JS/W0U62ntG5GhnRq/i65nFksDz+3dtZ5ne2ZvYW4MvAfsBewB3u/p4q+/QByyNWLXL32QmHmJlmf+3M69fawaFhBlb8ke4yIza+Ouwc+O1f8dF3zYwVa6Gjw1DEukJHh92n7djwPlHX9Yi3v54uq9zJovQ4zZLXv3u7yzzZAnsSzLB7D7BDzH1PB+4q+n1tUkHlQbO/dubta21xEujuMl7Zsn1X24ItI86P73maUXfmffDPajp+PR0d6tkn6rou/t0LZeNKu5NF3v7unSIPH2O/cPcZ7n488EjMfR9393uKHk80I8AsNPtrZx6/1hYngVdeHam6/YjDwntX1hxrPR0d4u5T7rqWk3Ynizz+3TtF5snW3Wt7V3aYZvftz9uA13GTVMGI89oAN7U48+g9OKF/BmN7u5iwQzdjw1HAKnV0iLNPnBkrxvRUf+2k5e3v3knyUIzQiAVmtjPwIvAT4OvuvinjmBLR7L79eRs7oFLZaJJ6urs455i9+MqRs2ru6BBnn1pnrBjX28XVnzqAPadPqus86pW3v3snyfzOtk5DwEXAHOAw4AfAp4ErswwqSc3u25+3sQPqmVYHgrFx+6ZOiL1fPR03atmn1hkrHOqKu1F5+7t3kpa8su7+PPDZokW3mtlq4GIze4e7P1S6j5nNBeYCzJw5M51AG9Tsvv1JH7+RbrGVZjt4U5nZHgz46Ltm5i5BFF/X4ZFRSofUzXoGh7h/d3V3TkauxkYws6uBqdWafpXZdxpBccIcd7+00ratNjZCs9/sjR4/6R5WpbMdfO2oWZx7/WOZzPbQiMGhYZ7bsIkFd63gmgefyd0MDtX+7moiFl/LzNTQYLKdCqwBTnH3BZW2bbVkm3dJTz9TLgmkPdtDklrx7jDNadrYIO4AAA/DSURBVIXaRacMRHNc+Lw00yg6TDOaEpUrG017tock5W1wn2rURCx5mf/lzWw8QacGgOnARDMrJM7r3f0VM3sCuM3d54T7nA28jqBDw8vAIQS90K5x99+mGX/eNfuOqp4eVmnElcbrteLdaq3q/btKeXl4h7we+GnJssLvuwErCOLsLlq/jKD32KnAOGAlcB5wbjMDzYNa/8HTKm+r1Ipgy+goO5bEWG9c9Sa2aq9Xz3E7oSxTTcSSl3mydfcVBBXLlbbpK/n9StqomVct4v6Dp9Uls1wrAoDRUTj4X27ZJs64cTWa2Mq93qhDl1HXcTuhu2ul1iFZtqRoZe3xMdwBiv/BB18dYWh4lKsGVjFv8aPbbZt2eVtxD6ueonfU8KhvE2c9ccU571KVXm/hfStZdP/K2MftpLLMenrbSXlKti0g7j/46pc3lx0xqxldMgs9rG7/8qF0dW3/lirEuWLtYKyuoo0mtkpdU0dGnc3D235NruW4K9YOUq43bit2d600zGLh77r0jCP4xecOYukZR3DOMXu1TVFJ2vRdoAXEqawYHhnlh3csLztiVjPL2zYODdPbZbxaJk4gVjlgo5U09fRKq2XYxKHSXgqhzVtGmDIh7sB12YhTPFNoSSGN0UdUC4hTWTFv8aNc++Azkds2u0tmtTj7pk6I1VW00Uqacl1Tx/Z2lb07rWXYxLLMOP+m31eMKS8aKZ6R+ijZtoBa+7NXGznrQ++c3tTytlrijFMOmEQ//qjXO7F/Bh/7y5mJD5s4MuotUW7bSeXOeaJihBZRS3/2Sl+7x+/QzakH79708rZqccYddavR8RvKvd7wyChdZjUdN86IZK3QBlVtaLORq+66aWnl7rqV2oUODg2z97ybIr/qju3tYukZRyRahFAtliQb/Nd6vLivG7V96bJK17VUM65z0tJ+n3SSSt11dUVbTKXKigljejihfwYL71vJSFFZZ3eXcfw+yZXV1lK5knSlSrXj1dset/i4lY5Rri1xsVZpg6o2tNnQVW07DqXfVtyD5TFUukPMY6P+JGKqdIztijNGnb6p41m+dpCerq5UhilM8ttCs4fvrKaduzqXo2KENpLE18Naurfm7StoEjHVeoyoIoZmD1PYzO7BaSe9du/q3CmjfnW8JOaXqtYkKI9zWCURU63HKB29q9poXkk0sWpmM620RyPr5CZnSrY5Vql3T9S2m14dYctI/Z0ZamkSVE/b1zjnEWf7wnY7jumJNRhOlGYMvFLL9ax2ru3UTKudzqUenVFY0mLifNUq3XZkFLqNbaaQqbXio9YmQbVWrsT9yljr9lHb7TZ1PCvWDm7XBTdqMJwohUqjRQOr2FySDPqmjGdMT/z7kkrXswv4+rW/Y8nDL1Q813ZqptVO51IP3dnmUJyvWqXbjriDGd1G7MFDar27q7VjQtyvjLVuH7XdinWv0Dd1QtXBcCo58+g96JsyfrvlK9YO1vU1t9L1HBoe5YZHXqh6ru001GE7nUs9lGxzJs5XrXLbjow6Pd1dXPXJ/WMNHlJrj61aBiiJ+5Wx1u3Lbbd5S5Bwbzjt4IqD4VT6qjo0PMryta9st3zzcH09w8p2F+4xwLa7g46KsZ1mw22nc6mHkm3OxKnsqbRtT5cxbofu2G/guN1py1WuxK20qnX7ats9ve4VeuusLGtG5V/U9TxyrzcwtsxU51Gv005DHbbTucTV3h8lLSjOV61mfC2L2522nLix1bp9te3e/oaJdV+TtK4nwJKHX6j5dZL6m+RBO51LXLqzzZlKXz2P3HPXmrZN4mtZo02CJozp4cN7T2dMz7Z3iuViq3Qux75zOqtf3szg0HDVc379xLF1X5NK1z7J61nv363VJo2spJ3OpVadc6YtpLh3TxdBWeKWEbjp0dUsefiFbWqts+4JFKXQWuBnDzxLoSVaT1fQbbhSbKXnMjw6ypumjOdnDzzLdQ8991qN/deOmrXNdqXn3Mg1OfPoPRh1Z+G9K19r0bFlBEY9OK+kGt7n8e8mzZV5DzIzewvBzLj7AXsBd7j7e2rYbxLwb8AHCe7QFwOfd/d11fZtlR5kg0PDfP3a33HDIy9sU5lSaGZV3A01T90fz7ru4e2aho3p6eLDe7+Rbx37Z1X3L5zLD+9YzrUPPhPZxOycY/aqes71XpOzrnt4uyZgUdc8CXn6u0nj8t6DbE+Cqcx/Hz5qtQh4D8EMuycD+wI/Tzi2zC15+IWaa63z8LWsXGuBoeFRrnnwmZpq9CeM6WGXiWP52QPPVGydUO2c67kmhfhrueZJyMvfTZovD8n2F+4+w92PBx6pZQcz2x94H/Bxd/+Zu18LnAQcZGaHNzHWVOWxa2w1ScWc1bm34jWX1pB5snX36oOEbu8oYLW73150nPuA5eG6ttCKjcCTijmrc2/Fay6tIfNkW6dZwLKI5Y+F69pCKzYCTyrmrM69Fa+5tIZWfefsBGyIWL4e2D3lWJqqFWutk4o5q3NvxWsu+Zd5a4RiZnY1MLVaawQzuwnY6O4fKll+BdDn7gdG7DMXmAswc+bMfZ5++unE4k5DK9ZaJxVzVufeitdcspX31gj1WA9Mjlg+meg7Xtx9vrv3u3v/tGnTmhpcM7RirXVSMWd17q14zSW/WjXZLiO6bLZcWa6ISKZaNdkuAXY1s4MKC8ysn6C8dklmUYmIlJH59yMzG0/QqQFgOjDRzI4Lf7/e3V8xsyeA29x9DoC7321mNwKXm9npwCjwbeBOd7855VMQEakq82QLvB74acmywu+7ASsI4uwu2WY2cAFwKUXddZsWpYhIAzJPtu6+AojusrN1m76IZRuAT4QPEZFca9UyWxGRlqJkKyKSglx1akiLma0BWqtXQ/2mAmuzDiJjuga6BpDONXiTu0c25O/IZNtJzGygXI+WTqFroGsA2V8DFSOIiKRAyVZEJAVKtu1vftYB5ICuga4BZHwNVGYrIpIC3dmKiKRAybYNmdlbzOwHZvYbMxsxs1uzjilNZna8mf2nmT1rZhvNbKmZfSTruNJkZseZ2a/NbJ2ZbTazx83sDDPbIevYsmJm08P3g5vZjmm/fubddaUpCjMW3wN04j/XFwnmo/sCQbvK9wMLzWyqu1+YaWTpmQLcApxHMMbzu4CzgV2Bz2YXVqbOAzYCE7J4cZXZtiEz6ypMpFnr7BftJEyqa0uWLQT2d/fdMgorc2Z2LvAZYCfvsH98MzsYuA74FkHSfZ27b0wzBhUjtKE6ZyxuG6WJNvQgwQhznWwdHfhNx8y6gQuBb5JhLzolW+kUBwCPZh1E2sys28zGhwPtfx74Xqfd1QKfAsYCF2UZhMpspe2Z2WHAMcApWceSgUFgTPjz5cCXM4wldWY2BZgHnOTuW8wqjubaVLqzlbZmZn3AQuA6d78s02CycQBwMPAlgg+c72YbTurOBe519+uzDkR3ttK2zGxngjnpVgInZRxOJtz9gfDHO81sLfAjM/tXd38yy7jSYGZ7EnybOcTMCrNxjw+fJ5nZiLtvSiseJVtpS+HcdosJKoT+2t0HMw4pDwqJdzeg7ZMt8FagF7g7Yt0zwCXAqWkFo2QrbcfMegjmsXsrcKC7v5hxSHlxYPi8PNMo0nMncGjJsiOBrxK0vX4qzWCUbNtQLTMWZxNZai4mOP/TgJ3NbL+idQ+6+1A2YaXHzG4AbgYeAUYIEu2XgEWdUIQArzUBvLV4WViGD3BH2u1slWzbUy0zFrez94bP34lY1wnnD3A/cDLQBwwT3MV9Dfh+diF1NvUgExFJgZp+iYikQMlWRCQFSrYiIilQshURSYGSrYhICpRsRURSoGQrLcfMVpjZiqzjKCfv8Uk2lGxFUmBmJ4dzX52cdSySDfUgE0neYVkHIPmjZCuSsE4Ze0DiUTGC5JIFPmtmj4RTcT9rZt81s0kV9vmImd1iZuvDfR4Lp+8eE7Gtm9mtZjbVzOab2fNmNhS+3ifKxPPxcHrwNeHxV5nZjWZ2Ysm225TZhlPJLwh/XRC+duHRZ2b/HP78d2XOa59w/S9qu3qSRxobQXLJzL5DMGfW88DVwBaCmQbWE4xk9qq79xVtfwnBQNHPAL8kmL57P4KZCm4FjnD34aLtHfgNMA54NdxmLHAcMBk42d1/VLT9twgGcllOMCD5S8AbgH2BZe5+XNG2KwAK8YXltB8M478OeKjoVP8tfL0ngXvc/UBKmNl84P8AH3D3xVUvnuSTu+uhR64eBAnSgSeAnYuWjyUYCNqBFUXLTw6XXQOMKznW2eG600qWe/j4IdBdtHwPglGyHi3Zfh1BIh8fEe/Ukt9XFMdXEuPJZc55cbj+z0qW7wj8D8FsE91R++rRGg8VI0geFb7Gn+vufywsdPfNBHeXpU4jSJCn+PbTnMwjSJQfi9jvFeCL7j5S9BqPAncBbzez15Vsv4VgbNhtePTU6XF9L3yeW7L8YwQJ94fFcUrrUQWZ5NHe4fNtEevuIEiswGsDpf8FsBb4hzKzpw4Bb49Y/gd3fzli+arweTLBXSXAFcDngEfM7KdhbHe7+0uVT6VmSwiKKP7WzL7qWwd4n0uQ4H+Y0OtIRpRsJY8KlWCrS1e4+4iZrStatBNgwDTgrJivs6HM8kIy7y5a9gWCctVTgP8bPobN7HrgS+7+RMzX3oa7j5rZD4B/Bk4kqEjbh+CD5+fu/lwjx5fsqRhB8qhwt7hL6Qoz6wamRGz7oLtbpUcjAbn7iLt/x93/Iozrw8C1wN8AN0S1eKjDpQR34Z8Mfy88/yCBY0vGlGwljwqzwL47Yt3BFH0j82AeqUeAPcOpy5vO3V9092vc/QTgV8Cbgb2q7FYob+0ut4G7ryFoefGXZnYg8BGCyrZfNhy0ZE7JVvLosvD568UJ1MzGAv8Usf35BFOWX2pmk0tXmtlOZrb39rvVxszGmNlhVlIgbGa9QCG+apNoFoo+ZlbZrlBRtoigYmy+u4/GiVfySWW2kjvufpeZXUhQIfWwmZW2s32+ZPtLw/LNvweeNLMbCZpK7UwwweMhBJ0KPlVnSOMIZqpdYWb3Ak8TNEM7gqDi7T/d/bEqx7ibICH/Q/gBUiiPvrC4ki08998QVPptIShakDagZCt5dRrwe+AzBGWX6wjKSP+RoDPCNtz9M2a2hCChHk7QkuCPBEn3PODHDcQyCHwVOJSgDfAHCVopPAl8mhoSoruvN7MPE1TifQKYEK76MVvLnQsWEHR2uM7dt6sklNakHmQiOWNmlwEfBw539//OOBxJiJKtSI6Y2QzgD8BTwJ6uf9C2oWIEkRwws48CfwrMBsYAZyrRthfd2YrkQDgy2CEEvdcucPd/yzYiSZqSrYhICtTOVkQkBUq2IiIpULIVEUmBkq2ISAqUbEVEUqBkKyKSgv8FvCSzln4825oAAAAASUVORK5CYII=\n", + "text/plain": [ + "<Figure size 360x360 with 1 Axes>" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], "source": [ - "cluster_labels=clusterer.labels_\n", - "df['labels']=cluster_labels\n" + "Clustering().dpc(2.4,3.8)" ] }, { "cell_type": "code", - "execution_count": 69, + "execution_count": 431, "metadata": { "ExecuteTime": { - "end_time": "2020-12-18T19:02:34.365919Z", - "start_time": "2020-12-18T19:02:34.356374Z" + "end_time": "2021-01-03T20:40:29.271435Z", + "start_time": "2021-01-03T20:40:29.186173Z" }, "scrolled": false }, @@ -1586,12 +1286,12 @@ { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "1ac9898689124574aef1ccc5775edf6f", + "model_id": "7539aab6e8914dd399f92e6478218b64", "version_major": 2, "version_minor": 0 }, "text/plain": [ - "HBox(children=(Button(description='PCA', style=ButtonStyle()), Button(description='MDS', style=ButtonStyle()),…" + "VBox(children=(HBox(children=(Button(description='PCA', style=ButtonStyle()), Button(description='MDS', style=…" ] }, "metadata": {}, @@ -1599,16 +1299,16 @@ } ], "source": [ - "display(box)" + "show_embedding()" ] }, { "cell_type": "code", - "execution_count": 70, + "execution_count": 432, "metadata": { "ExecuteTime": { - "end_time": "2020-12-18T19:02:34.791129Z", - "start_time": "2020-12-18T19:02:34.759139Z" + "end_time": "2021-01-03T20:40:54.079754Z", + "start_time": "2021-01-03T20:40:54.041534Z" } }, "outputs": [ @@ -1641,27 +1341,34 @@ " <tbody>\n", " <tr>\n", " <th>0</th>\n", - " <td>78</td>\n", - " <td>21</td>\n", - " <td>41</td>\n", + " <td>0</td>\n", + " <td>100</td>\n", + " <td>26</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", - " <td>0</td>\n", " <td>100</td>\n", - " <td>26</td>\n", + " <td>0</td>\n", + " <td>32</td>\n", + " </tr>\n", + " <tr>\n", + " <th>2</th>\n", + " <td>37</td>\n", + " <td>62</td>\n", + " <td>24</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ - " RS ZB Materials in cluster\n", - "0 78 21 41\n", - "1 0 100 26" + " RS ZB Materials in cluster\n", + "0 0 100 26\n", + "1 100 0 32\n", + "2 37 62 24" ] }, - "execution_count": 70, + "execution_count": 432, "metadata": {}, "output_type": "execute_result" } @@ -1674,47 +1381,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We can see that we found two pure clusters containing materials with the same stable structure and a mixed cluster containing both stable structure. It is interesting to visualzie this result with MDS, where we can see that the mixed cluster is placed in between the pure clusters as a transition zone." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Perovskite\n", - "---" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "As last step in this tutorial, we invite you to use the tools that we have introduced on a new dataset composed of perovskite materials, where an interesting property to be analysed is the clustering of metallic vs non metallic materials." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "df = pd.read_table(\"data/exploratory_analysis/perovskites/cubic_perosvkistes_data.asc\", sep=' ')\n", - "df = df.dropna(axis=1)\n", - "df['type'] = df.apply(lambda x : 'metallic' if x['band_gap'] < 0.2 else 'non metallic' , axis=1)\n", + "We have found two clusters containing only materials with the same most stable structure and a mixed cluster containing both most stable structures. \n", + "It is interesting to visualzie this result with MDS, where we can see that the mixed cluster is placed in between the pure clusters as a transition zone.\n", "\n", - "features = df.drop(['material','type', 'lattice_constant', 'bul_modulus','band_gap'],axis=1).columns.tolist()\n", - "# features = df.drop(['material','type'],axis=1).columns.tolist()\n", - "df[features]=preprocessing.scale(df[features])\n", - "hover_features = ['type','material','lattice_constant', 'bul_modulus','band_gap']" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Enjoy the exploration of this new data set!" + "Results of this clustering suggest that the features we have used are sufficient for classifying materials according to their most stable structure.\n", + "Even though the RS and ZB clusters are not clearly separated as a mixed cluster transitioning between the two clusters is found, a supervised machine learning model might be able to learn classification of the 82 octet binary materials.\n", + "We might also expect that such model faces challenges especially when classifing materials in the transition area.\n", + "A supervised learning algorithm, namely SISSO, has been used for such classification, and we resort to other tutorials in the AI toolkit to study this application.\n", + "\n", + "In this tutorial, we have seen an exemplary application of unsupervised learning that has been deployed for explorying the structure of a multi-dimensional dataset.\n", + "We have performed a clustering analysis, that led us finding clusters representative of different external labels, i.e. the most stable structures.\n", + "Such clustering gave us a clear evidence that the set of features used for clustering should be enough for determining the value of the external labels.\n", + "A subsequent step of such analysis would be the deployment of a supervised learing algorithm to find an interpretable relationship between the input features and the labels. " ] } ],