From 480ccc6ba841af377fea7a480a6cd889ede1f5ef Mon Sep 17 00:00:00 2001 From: Luigi Sbailo <luigi.sbailo@gmail.com> Date: Tue, 5 Jan 2021 15:29:03 +0100 Subject: [PATCH] Minor fixies --- exploratory_analysis.ipynb | 149 +++++++++++++++++++------------------ 1 file changed, 77 insertions(+), 72 deletions(-) diff --git a/exploratory_analysis.ipynb b/exploratory_analysis.ipynb index 9598019..b5341e8 100644 --- a/exploratory_analysis.ipynb +++ b/exploratory_analysis.ipynb @@ -19,10 +19,11 @@ " \n", "<p>\n", " created by:\n", - " Luigi Sbailo<sup>1</sup> \n", + " Luigi Sbailò<sup>1</sup> \n", " and Luca Ghiringhelli<sup>1</sup> <br><br>\n", " \n", "<sup>1</sup> Fritz Haber Institute of the Max Planck Society, Faradayweg 4-6, D-14195 Berlin, Germany <br>\n", + "<span class=\"nomad--last-updated\" data-version=\"v1.0.0\">[Last updated: Jan 5, 2020]</span>\n", "\n", " \n", "<div> \n", @@ -36,7 +37,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In this tutorial, we use unsupervised learning for an exploratory analysis of materials science data. More specifically, we analyze 82 octet binary materials known to crystallize in zinc blende (ZB) and rocksalt (RS) structures. Our aim is to show how to visualize a multidimensional dataset and gain an understanding of its relevant inner structures. As a first step in our data analysis, we would like to detect whether data points can be classified into different clusters, where each cluster is aimed to group together objects that share similar features. With an explorative analysis we would like to visualize the structure and spatial arrangement of the clusters, but when the feature space is highly multidimensional such visualization is directly not possible. Hence, we project the feature space onto a two-dimensional manifold which, instead, can be visualized. To avoid losing relevant information, embedding into a lower dimensional manifold must be performed while preserving the most informative features in the original space. Below we introduce into different clustering and embedding methods, which can be combined to obtain different visualizations of our dataset." + "In this tutorial, we use unsupervised learning for an exploratory analysis of materials science data. More specifically, we analyze 82 octet binary materials known to crystallize in zinc blende (ZB) and rocksalt (RS) structures. Our aim is to show how to visualize a multidimensional dataset and gain an understanding of its relevant inner structures. As a first step in our data analysis, we would like to detect whether data points can be classified into different clusters, where each cluster is aimed to group together objects that share similar features. With an explorative analysis we would like to visualize the structure and spatial arrangement of the clusters, but when the feature space is highly multidimensional such visualization is directly not possible. Hence, we project the feature space onto a two-dimensional manifold which, instead, can be visualized. To avoid losing relevant information, embedding into a lower-dimensional manifold must be performed while preserving the most informative features in the original space. Below we introduce to different clustering and embedding methods, which can be combined to obtain different visualizations of our dataset." ] }, { @@ -95,8 +96,8 @@ "execution_count": null, "metadata": { "ExecuteTime": { - "end_time": "2021-01-04T16:28:07.755442Z", - "start_time": "2021-01-04T16:28:07.095624Z" + "end_time": "2021-01-05T14:25:23.999902Z", + "start_time": "2021-01-05T14:25:23.354783Z" } }, "outputs": [], @@ -138,8 +139,8 @@ "execution_count": null, "metadata": { "ExecuteTime": { - "end_time": "2021-01-04T16:28:07.870608Z", - "start_time": "2021-01-04T16:28:07.760577Z" + "end_time": "2021-01-05T14:25:24.088808Z", + "start_time": "2021-01-05T14:25:24.001958Z" }, "scrolled": true }, @@ -198,8 +199,8 @@ "execution_count": null, "metadata": { "ExecuteTime": { - "end_time": "2021-01-04T16:28:07.876974Z", - "start_time": "2021-01-04T16:28:07.872057Z" + "end_time": "2021-01-05T14:25:24.095166Z", + "start_time": "2021-01-05T14:25:24.090901Z" } }, "outputs": [], @@ -219,8 +220,8 @@ "execution_count": null, "metadata": { "ExecuteTime": { - "end_time": "2021-01-04T16:28:07.896229Z", - "start_time": "2021-01-04T16:28:07.878762Z" + "end_time": "2021-01-05T14:25:24.107786Z", + "start_time": "2021-01-05T14:25:24.096821Z" } }, "outputs": [], @@ -239,6 +240,7 @@ " if self.df_flag: \n", " return \n", " cluster_labels = KMeans (n_clusters=n_clusters, max_iter=max_iter).fit_predict(df[features])\n", + " print(max(cluster_labels)+1,' clusters were extracted.') \n", " df['clustering'] = 'k-means'\n", " df['cluster_label']=cluster_labels\n", "\n", @@ -248,6 +250,7 @@ " linkage_criterion = 'ward'\n", " Z = linkage(df[features], linkage_criterion )\n", " cluster_labels = cut_tree(Z, height=distance_threshold)\n", + " print(int(max(cluster_labels))+1,' clusters were extracted.') \n", " df['clustering'] = 'Hierarchical - ' + linkage_criterion + ' criterion' \n", " df['cluster_label']=cluster_labels\n", "\n", @@ -255,6 +258,7 @@ " if self.df_flag: \n", " return \n", " cluster_labels = DBSCAN(eps=eps, min_samples=min_samples).fit_predict(df[features])\n", + " print(max(cluster_labels)+1,' clusters were extracted.') \n", " df['clustering'] = 'DBSCAN'\n", " df['cluster_label']=cluster_labels\n", " \n", @@ -262,6 +266,7 @@ " clusterer = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size, min_samples=min_samples)\n", " clusterer.fit(df[features])\n", " cluster_labels=clusterer.labels_\n", + " print(max(cluster_labels)+1,' clusters were extracted.') \n", " df['clustering']= 'HDBSCAN'\n", " df['cluster_label']=cluster_labels\n", "\n", @@ -273,6 +278,7 @@ " clu.autoplot = True\n", " clu.assign(density,delta)\n", " cluster_labels = clu.membership\n", + " print(max(cluster_labels)+1,' clusters were extracted.') \n", " df['clustering'] = 'DPC'\n", " df['cluster_label']=cluster_labels\n", " else: \n", @@ -284,7 +290,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The embedding algorithms are handled with a graphical interface that is generated using Jupyter Widgets, that allows to create plots using the desired embedding algorithm. Before plotting data with any of the embedding algorithms, a dataframe 'df' must have been defined, and cluster labels must have been assigned to each data point." + "The embedding algorithms are handled with a graphical interface that is generated using Jupyter Widgets. This allows to create plots using the desired embedding algorithm by clicking on a button. Before plotting data with any of the embedding algorithms, a dataframe 'df' must have been defined, and cluster labels must have been assigned to each datapoint." ] }, { @@ -292,8 +298,8 @@ "execution_count": null, "metadata": { "ExecuteTime": { - "end_time": "2021-01-04T16:28:07.911638Z", - "start_time": "2021-01-04T16:28:07.898218Z" + "end_time": "2021-01-05T14:25:24.121703Z", + "start_time": "2021-01-05T14:25:24.109367Z" } }, "outputs": [], @@ -363,7 +369,7 @@ " \n", " for cl in np.unique(df['cluster_label'].to_numpy()):\n", " if cl == -1:\n", - " name = 'Noise'\n", + " name = 'Outliers'\n", " else:\n", " name = 'Cluster ' + str(cl)\n", " fig.add_trace(go.Scatter(\n", @@ -393,8 +399,8 @@ "execution_count": null, "metadata": { "ExecuteTime": { - "end_time": "2021-01-04T16:28:07.926241Z", - "start_time": "2021-01-04T16:28:07.913288Z" + "end_time": "2021-01-05T14:25:24.141593Z", + "start_time": "2021-01-05T14:25:24.122900Z" } }, "outputs": [], @@ -430,8 +436,8 @@ "execution_count": null, "metadata": { "ExecuteTime": { - "end_time": "2021-01-04T16:28:07.946990Z", - "start_time": "2021-01-04T16:28:07.928697Z" + "end_time": "2021-01-05T14:25:24.157453Z", + "start_time": "2021-01-05T14:25:24.142956Z" }, "scrolled": true }, @@ -452,8 +458,8 @@ "execution_count": null, "metadata": { "ExecuteTime": { - "end_time": "2021-01-04T16:28:09.611115Z", - "start_time": "2021-01-04T16:28:07.948604Z" + "end_time": "2021-01-05T14:25:25.703372Z", + "start_time": "2021-01-05T14:25:24.159666Z" }, "scrolled": false }, @@ -489,8 +495,8 @@ "execution_count": null, "metadata": { "ExecuteTime": { - "end_time": "2021-01-04T16:28:09.633401Z", - "start_time": "2021-01-04T16:28:09.612488Z" + "end_time": "2021-01-05T14:25:25.726629Z", + "start_time": "2021-01-05T14:25:25.705254Z" }, "scrolled": true }, @@ -506,8 +512,8 @@ "execution_count": null, "metadata": { "ExecuteTime": { - "end_time": "2021-01-04T16:28:09.638358Z", - "start_time": "2021-01-04T16:28:09.634718Z" + "end_time": "2021-01-05T14:25:25.731276Z", + "start_time": "2021-01-05T14:25:25.728009Z" } }, "outputs": [], @@ -531,8 +537,8 @@ "execution_count": null, "metadata": { "ExecuteTime": { - "end_time": "2021-01-04T16:28:09.891273Z", - "start_time": "2021-01-04T16:28:09.639781Z" + "end_time": "2021-01-05T14:25:25.966762Z", + "start_time": "2021-01-05T14:25:25.732453Z" }, "scrolled": false }, @@ -551,7 +557,7 @@ "\n", "Now let's focus on the results of the clustering algorithm. Could you identify and visualize two distinct clusters in the dataset? You can also run the $k$-means clustering again and select only 1 as _max\\_iter_ , which means that the first outcome is taken as final result. Try this again and compare the results, does the output change at each iteration? What happens instead if the number is much larger? \n", "\n", - "To compare different outcomes of the algorithm, it might be convenient to copy paste the cell containing 'show_embedding()', and updating only one of the two visualizers at each iteration. Also, only the usage of the PCA embedding allows a straightforward comparison, because MDS and t-SNE are stochastics algorithms, thus they can give different results at each call.\n", + "To compare different outcomes of the algorithm, it might be convenient to copy paste the cell containing 'show_embedding()', and updating only one of the two visualizers at each iteration. Also, only the usage of the PCA embedding allows a straightforward comparison, because MDS and t-SNE are stochastic algorithms, thus they can give different results at each call.\n", "\n", "We are interested in understanding whether clustering groups togheter materials which have the same most stable structure. \n", "Therefore, we define a function that prints for each cluster the percentage of materials that is more stable in the RS vs ZB structure. " @@ -562,8 +568,8 @@ "execution_count": null, "metadata": { "ExecuteTime": { - "end_time": "2021-01-04T16:28:09.898070Z", - "start_time": "2021-01-04T16:28:09.892737Z" + "end_time": "2021-01-05T14:25:25.975637Z", + "start_time": "2021-01-05T14:25:25.968619Z" } }, "outputs": [], @@ -589,8 +595,8 @@ "execution_count": null, "metadata": { "ExecuteTime": { - "end_time": "2021-01-04T16:28:09.932800Z", - "start_time": "2021-01-04T16:28:09.899348Z" + "end_time": "2021-01-05T14:25:26.013983Z", + "start_time": "2021-01-05T14:25:25.977220Z" }, "scrolled": true }, @@ -629,15 +635,14 @@ "execution_count": null, "metadata": { "ExecuteTime": { - "end_time": "2021-01-04T16:28:09.945953Z", - "start_time": "2021-01-04T16:28:09.934125Z" + "end_time": "2021-01-05T14:25:26.028804Z", + "start_time": "2021-01-05T14:25:26.015145Z" }, "scrolled": true }, "outputs": [], "source": [ "distance_threshold=20\n", - "\n", "Clustering().hierarchical(distance_threshold=distance_threshold)" ] }, @@ -646,8 +651,8 @@ "execution_count": null, "metadata": { "ExecuteTime": { - "end_time": "2021-01-04T16:28:10.020006Z", - "start_time": "2021-01-04T16:28:09.947670Z" + "end_time": "2021-01-05T14:25:26.099022Z", + "start_time": "2021-01-05T14:25:26.029990Z" }, "scrolled": false }, @@ -661,8 +666,8 @@ "execution_count": null, "metadata": { "ExecuteTime": { - "end_time": "2021-01-04T16:28:10.049290Z", - "start_time": "2021-01-04T16:28:10.021782Z" + "end_time": "2021-01-05T14:25:26.127167Z", + "start_time": "2021-01-05T14:25:26.100850Z" }, "scrolled": true }, @@ -685,8 +690,8 @@ "execution_count": null, "metadata": { "ExecuteTime": { - "end_time": "2021-01-04T16:28:10.213386Z", - "start_time": "2021-01-04T16:28:10.051024Z" + "end_time": "2021-01-05T14:25:26.283016Z", + "start_time": "2021-01-05T14:25:26.128369Z" } }, "outputs": [], @@ -714,9 +719,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "DBSCAN is a density-based clustering algorithm that detects noise and is able to extract clusters of different size and shape.\n", + "DBSCAN is a density-based clustering algorithm that detects outliers and is able to extract clusters of different size and shape.\n", "This algorithm requires two parameters: the distance $\\epsilon$ is the maximum distance for considering two points as neighbours; _min_samples_ gives the minimum number of neighbors required to define a core point. \n", - "Core points are the core component of clusters, and all those points that are neither core points nor neighbor of core points are labeled as noise.\n" + "Core points are the core component of clusters, and all those points that are neither core points nor neighbor of core points are labeled as outliers.\n" ] }, { @@ -724,8 +729,8 @@ "execution_count": null, "metadata": { "ExecuteTime": { - "end_time": "2021-01-04T16:28:10.219907Z", - "start_time": "2021-01-04T16:28:10.214805Z" + "end_time": "2021-01-05T14:25:26.291099Z", + "start_time": "2021-01-05T14:25:26.284621Z" } }, "outputs": [], @@ -740,8 +745,8 @@ "execution_count": null, "metadata": { "ExecuteTime": { - "end_time": "2021-01-04T16:28:10.300990Z", - "start_time": "2021-01-04T16:28:10.221245Z" + "end_time": "2021-01-05T14:25:26.372635Z", + "start_time": "2021-01-05T14:25:26.292763Z" }, "scrolled": false }, @@ -755,8 +760,8 @@ "execution_count": null, "metadata": { "ExecuteTime": { - "end_time": "2021-01-04T16:28:10.328403Z", - "start_time": "2021-01-04T16:28:10.302447Z" + "end_time": "2021-01-05T14:25:26.401536Z", + "start_time": "2021-01-05T14:25:26.374424Z" } }, "outputs": [], @@ -768,12 +773,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We can see that the algorithm has found two different clusters, and we notice that each cluster is representative of the RS vs ZB structure. However, this happens at the cost of neglecting many points that have been classified as noise.\n", - "Now tune the parameters and see the effects of each parameter on the amount of noise.\n", + "We can see that the algorithm has found two different clusters, and we notice that each cluster is representative of the RS vs ZB structure. However, this happens at the cost of neglecting many points that have been classified as outliers.\n", + "Now tune the parameters and see the effects of each parameter on the number of outliers.\n", "\n", - "Considering that MDS seeks for an embedding that tries to preserve local pairwise distances, we would expect that in a MDS embedding noise is placed far from the defined clusters. Differently t-SNE tends to privilege global structures at the expenses of losing local definition, hence noise can be placed closed to other clusters. In fact, it is possible to notice that using t-SNE points tend to be equally distanced from each other, but clusters are quite distinguishable. Pairwise distances are not meaningful in a t-SNE embedding because it aims to depict global arrangements of clusters. On the other hand, MDS attemmpting to preserve all pairwise distances sometimes fails to arrange the different clusters.\n", + "Considering that MDS seeks for an embedding that tries to preserve local pairwise distances, we would expect that in a MDS embedding outliers are placed far from the defined clusters. Differently t-SNE tends to privilege global structures at the expenses of losing local definition, hence outliers can be placed closed to other clusters. In fact, it is possible to notice that using t-SNE points tend to be equally distanced from each other, but clusters are quite distinguishable. Pairwise distances are not meaningful in a t-SNE embedding because it aims to depict global arrangements of clusters. On the other hand, MDS attemmpting to preserve all pairwise distances sometimes fails to arrange the different clusters.\n", "\n", - "Can you notice in this case that noise is better isolated in a MDS embedding rather than using a t-SNE embedding? Try using a small amount of noise by tuning down the parameters for an easier visualization." + "Can you notice in this case that outliers are better isolated in a MDS embedding rather than using a t-SNE embedding? Try to decrease the number of outliers by tuning down the parameters for an easier visualization." ] }, { @@ -791,7 +796,7 @@ "HDBSCAN can be defined as a hierarchical extension of DBSCAN, with respect to which it has a number of advantages. \n", "One advantage is that there is only one relevant parameter to be tuned, i.e. the minimum size of clusters. \n", "This parameter is more intuitive to assess in comparison to e.g. the $\\epsilon$ threshold in DBSCAN.\n", - "In the HDBSCAN library that we we deploy, the minimum number of samples that is used for the mutual reachability distance is by default fixed to the same value of the minimum cluster size, as they essentiallt have the same goal, i.e. avoid the detection of clusters that contain less than a certain number of objects. \n", + "In the HDBSCAN library that we deploy, the minimum number of samples that is used for the mutual reachability distance is by default fixed to the same value of the minimum cluster size, as they essentiallt have the same goal, i.e. avoiding the detection of clusters that contain less than a certain number of objects. \n", "In this tutorial we explicitly define the two values. " ] }, @@ -800,8 +805,8 @@ "execution_count": null, "metadata": { "ExecuteTime": { - "end_time": "2021-01-04T16:28:10.349230Z", - "start_time": "2021-01-04T16:28:10.329658Z" + "end_time": "2021-01-05T14:25:26.422572Z", + "start_time": "2021-01-05T14:25:26.403215Z" } }, "outputs": [], @@ -816,8 +821,8 @@ "execution_count": null, "metadata": { "ExecuteTime": { - "end_time": "2021-01-04T16:28:10.427573Z", - "start_time": "2021-01-04T16:28:10.351211Z" + "end_time": "2021-01-05T14:25:26.499772Z", + "start_time": "2021-01-05T14:25:26.423842Z" } }, "outputs": [], @@ -830,8 +835,8 @@ "execution_count": null, "metadata": { "ExecuteTime": { - "end_time": "2021-01-04T16:28:10.447426Z", - "start_time": "2021-01-04T16:28:10.429033Z" + "end_time": "2021-01-05T14:25:26.531074Z", + "start_time": "2021-01-05T14:25:26.503236Z" } }, "outputs": [], @@ -843,8 +848,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We obtain two clusters with high percentage of only one most stable structure. However, the number of materials classified as noise is considerably large.\n", - "The effect of _min_samples_ is to fix how conservative respect to noise detection the algorithm should be. Increasing its value the distorsion effects of the mutual reachability distance become more evident, while decreasing it less points are classified as noise. \n", + "We obtain two clusters with high percentage of only one most stable structure. However, the number of materials classified as outliers is considerably large.\n", + "The effect of _min_samples_ is to fix how conservative respect to outliers detection the algorithm should be. Increasing its value the distorsion effects of the mutual reachability distance become more evident, while decreasing it less points are classified as outliers. \n", "Can you obtain more meaningful results by decreasing the value of this parameter?" ] }, @@ -868,8 +873,8 @@ "execution_count": null, "metadata": { "ExecuteTime": { - "end_time": "2021-01-04T16:28:10.588857Z", - "start_time": "2021-01-04T16:28:10.448746Z" + "end_time": "2021-01-05T14:25:26.669427Z", + "start_time": "2021-01-05T14:25:26.533056Z" } }, "outputs": [], @@ -893,8 +898,8 @@ "execution_count": null, "metadata": { "ExecuteTime": { - "end_time": "2021-01-04T16:28:10.705792Z", - "start_time": "2021-01-04T16:28:10.590212Z" + "end_time": "2021-01-05T14:25:26.798561Z", + "start_time": "2021-01-05T14:25:26.671018Z" } }, "outputs": [], @@ -907,8 +912,8 @@ "execution_count": null, "metadata": { "ExecuteTime": { - "end_time": "2021-01-04T16:28:10.772478Z", - "start_time": "2021-01-04T16:28:10.707090Z" + "end_time": "2021-01-05T14:25:26.865309Z", + "start_time": "2021-01-05T14:25:26.800153Z" }, "scrolled": false }, @@ -922,8 +927,8 @@ "execution_count": null, "metadata": { "ExecuteTime": { - "end_time": "2021-01-04T16:28:10.803036Z", - "start_time": "2021-01-04T16:28:10.773905Z" + "end_time": "2021-01-05T14:25:26.897739Z", + "start_time": "2021-01-05T14:25:26.866609Z" } }, "outputs": [], @@ -938,13 +943,13 @@ "We have found two clusters containing only materials with the same most stable structure and a mixed cluster containing both most stable structures. \n", "It is interesting to visualize this result with MDS, where we can see that the mixed cluster is placed in between the pure clusters as a transition zone.\n", "\n", - "Results of this clustering suggest that the atominc features we have used are sufficient for classifying materials according to their most stable structure.\n", + "These clustering results suggest that the atominc features we have used are sufficient for classifying materials according to their most stable structure.\n", "Even though the RS and ZB clusters are not clearly separated as a mixed cluster is also found, a supervised machine learning model might be able to learn classification of the 82 octet binary materials.\n", "We might also expect that such model faces challenges especially when classifing materials in the transition area.\n", - "A supervised learning algorithm, namely SISSO, has been used for such classification, and we resort to other tutorials in the AI toolkit to study this application.\n", + "A supervised learning algorithm, namely SISSO, has been used for such classification, and we resort to other tutorials in the AI toolkit to study this application (see https://nomad-lab.eu/prod/analytics/public/user-redirect/notebooks/tutorials/compressed_sensing.ipynb and https://nomad-lab.eu/prod/analytics/public/user-redirect/notebooks/tutorials/descriptor_role.ipynb).\n", "\n", "In this tutorial, we have seen an exemplary application of unsupervised learning that has been deployed for explorying the structure of a multi-dimensional dataset.\n", - "We have performed a clustering analysis, that led us finding clusters representative of different external labels, i.e. the most stable structures.\n", + "We have performed a clustering analysis, that led us finding clusters representative of different external labels, i.e. the most stable structure.\n", "Such clustering gave us a clear evidence that the set of features used for clustering should be enough for determining the value of the external labels.\n", "A subsequent step of such analysis would be the deployment of a supervised learing algorithm to find an interpretable relationship between the input features and the labels. " ] -- GitLab