" <h4 class=\"modal-title\" id=\"lasso-motivation-modal-label\">Introduction and motivation</h4>",
" </div>",
" <div class=\"modal-body lasso_instructions\">",
" <p> In this tutorial notebook, we present a tool that produces two-dimensional structure maps for octet binary compounds, by starting from a high-dimensional rotational, translation, and permutational invariant representation (a descriptor) of the spatial structure (the geometry) that identifies each data point (material).",
" ",
" <p> The low-dimensional embedding methods (here, two-dimensional for the sake of visualization) are <i>unsupervised</i> machine-learning algorithms; so, in our example, the algorithm processes only the similarity (the distance) between the points in the high-dimensional representation. </p>",
" ",
" <p> In the linear method, <b>principal component analysis (<a href=\"https://en.wikipedia.org/wiki/Principal_component_analysis\" target=\"_blank\">PCA</a>)</b>, the direction (linear combination of the input coordinates) with the maximum variance is identified as the first principal component (PC). The direction perpendicular to the first PC with the largest variance is the second PC.",
" The process can be iterated up to as many dimensions as the initial dimensionality of the data, but here we stop at the second dimension and give the amount of total variance recovered by the first two principal components. </p>",
" <p> In the two popular non-linear methods we chose, <b>multidimensional scaling (<a href=\"https://en.wikipedia.org/wiki/Multidimensional_scaling\" target=\"_blank\">MDS</a>) </b> tries to preserve the distances from the given high-dimensional to the two-dimensional representation, ",
" whereas <b>t-Distributed Stochastic Neighbor Embedding (<a href=\"https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding\" target=\"_blank\">t-SNE</a>) </b> tries to preserve the local shape for groups of neighboring points. Both methods use a notion of distance that in our example is the Euclidean norm, even if in principle it could be any proper norm. </p>",
"",
" <p> In the results, we show the data points colored according to the classification (zincblende or rocksalt) in Part 1 and Part 2, while in Part 3 the data points are colored by a property which is retrieved from the database by means of a query. The labeling and consequent coloring is independent of the embedding method used, therefore the labeling is an <i>a posteriori</i> check that the high-dimensional representation could contain information about the labeling itself. In practice, if the coloring identifies clearly distinct areas, then the two dimensional representation is a map for the prediction of the labels, so that a new data point of unknown labeling, that lands in the two-dimensional map in an area of points with known labeling, is expected to belong to that same labeling. </p>",
" ",
"<p>The merit of the embedding methods is to provide relatively inexpensive tools to visually test whether a given set of features contains information about an investigated property (label). For this reason, they are widely used as preliminary tools for discovering structures in the data. </p>",
"object": "<script>\nvar beaker = bkHelper.getBeakerObject().beakerObj;\n</script>\n<style type=\"text/css\">\n .lasso_instructions{\n font-size: 15px;\n } \n</style>\n<!-- Button trigger modal -->\n<button type=\"button\" class=\"btn btn-default\" data-toggle=\"modal\" data-target=\"#lasso-motivation-modal\">\n Introduction and motivation\n</button>\n\n<!-- Modal -->\n<div class=\"modal fade\" id=\"lasso-motivation-modal\" tabindex=\"-1\" role=\"dialog\" aria-labelledby=\"lasso-motivation-modal-label\">\n <div class=\"modal-dialog modal-lg\" role=\"document\">\n <div class=\"modal-content\">\n <div class=\"modal-header\">\n <button type=\"button\" class=\"close\" data-dismiss=\"modal\" aria-label=\"Close\"><span aria-hidden=\"true\">×</span></button>\n <h4 class=\"modal-title\" id=\"lasso-motivation-modal-label\">Introduction and motivation</h4>\n </div>\n <div class=\"modal-body lasso_instructions\">\n <p> In this tutorial notebook, we present a tool that produces two-dimensional structure maps for octet binary compounds, by starting from a high-dimensional rotational, translation, and permutational invariant representation (a descriptor) of the spatial structure (the geometry) that identifies each data point (material).\n \n </p><p> The low-dimensional embedding methods (here, two-dimensional for the sake of visualization) are <i>unsupervised</i> machine-learning algorithms; so, in our example, the algorithm processes only the similarity (the distance) between the points in the high-dimensional representation. </p>\n \n <p> In the linear method, <b>principal component analysis (<a href=\"https://en.wikipedia.org/wiki/Principal_component_analysis\" target=\"_blank\">PCA</a>)</b>, the direction (linear combination of the input coordinates) with the maximum variance is identified as the first principal component (PC). The direction perpendicular to the first PC with the largest variance is the second PC.\n The process can be iterated up to as many dimensions as the initial dimensionality of the data, but here we stop at the second dimension and give the amount of total variance recovered by the first two principal components. </p>\n <p> In the two popular non-linear methods we chose, <b>multidimensional scaling (<a href=\"https://en.wikipedia.org/wiki/Multidimensional_scaling\" target=\"_blank\">MDS</a>) </b> tries to preserve the distances from the given high-dimensional to the two-dimensional representation, \n whereas <b>t-Distributed Stochastic Neighbor Embedding (<a href=\"https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding\" target=\"_blank\">t-SNE</a>) </b> tries to preserve the local shape for groups of neighboring points. Both methods use a notion of distance that in our example is the Euclidean norm, even if in principle it could be any proper norm. </p>\n\n <p> In the results, we show the data points colored according to the classification (zincblende or rocksalt) in Part 1 and Part 2, while in Part 3 the data points are colored by a property which is retrieved from the database by means of a query. The labeling and consequent coloring is independent of the embedding method used, therefore the labeling is an <i>a posteriori</i> check that the high-dimensional representation could contain information about the labeling itself. In practice, if the coloring identifies clearly distinct areas, then the two dimensional representation is a map for the prediction of the labels, so that a new data point of unknown labeling, that lands in the two-dimensional map in an area of points with known labeling, is expected to belong to that same labeling. </p>\n \n<p>The merit of the embedding methods is to provide relatively inexpensive tools to visually test whether a given set of features contains information about an investigated property (label). For this reason, they are widely used as preliminary tools for discovering structures in the data. </p>\n </div>\n <div class=\"modal-footer\">\n <button type=\"button\" class=\"btn btn-default\" data-dismiss=\"modal\">Close</button>\n<!-- <button type=\"button\" class=\"btn btn-primary\">Save changes</button> -->\n </div>\n </div>\n </div>\n</div>"
"value": "/usr/local/lib/python2.7/dist-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n \"This module will be removed in 0.20.\", DeprecationWarning)\nUsing TensorFlow backend.\n"
}
]
}
},
"evaluatorReader": true,
"lineCount": 16
},
{
"id": "sectionGehWM5",
"type": "section",
"title": "<p style=\"color: #20335d; font-size: 15pt;font-weight: 900;\">1.2 Represent the system: the Partial Radial Distribution Function (PRDF) as\"descriptor\"</p>",
"level": 2,
"evaluatorReader": false,
"collapsed": false
},
{
"id": "markdownb4dAuG",
"type": "markdown",
"body": [
"<div class=\"modal-body lasso_instructions\">",
"The Partial Radial DIstribution Function (PRDF) considers distributions of pairwise distances $d_{\\alpha \\beta}$ <br>",
"between two atom type $\\alpha$ and $\\beta$.",
"[<a href=\"http://journals.aps.org/prb/abstract/10.1103/PhysRevB.89.205118\" target=\"blank\">K. T. Schütt et al., Phys. Rev. B 89, 205118 (2014)</a>] ",
"The goal of this section is to reduce the dimensionality of a dataset to two dimensions for visualization purposes <br>",
"",
" <p> Here, we use <b>multidimensional scaling </b>, a non-linear method that tries to preserve the distances from the given high-dimensional to the two-dimensional representation. <br>",
"In \"Part2: Guided Exercise\" you will be able to use different methods\". [<a href=\"https://en.wikipedia.org/wiki/Multidimensional_scaling\" target=\"_blank\">more info</a>]<br>",
"value": "/usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.py:2699: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.\n VisibleDeprecationWarning)\nDEBUG: Processing configuration 1/10\n"
},
{
"type": "err",
"value": "DEBUG: Actual feature matrix needs 4104 bytes\n"
"value": "INFO: Click on the button 'View interactive 2D scatter plot' to see the plot.\n"
}
],
"payload": "<div class=\"output_subarea output_html rendered_html\"><a target=\"_blank\" href=\"/user/tmp/c560d2423b2a837c.html\">Click here to open the Viewer</a></div>"
"We first load the libraries, set up the paths to the data, and calculate the Partial Radial Distribution Function (PRDF) descriptor exactly as in Part 1. <br>",
"value": "INFO: Click on the button 'View interactive 2D scatter plot' to see the plot.\n"
}
],
"payload": "<div class=\"output_subarea output_html rendered_html\"><a target=\"_blank\" href=\"/user/tmp/068a9c5fe3101642.html\">Click here to open the Viewer</a></div>"
}
},
"evaluatorReader": true,
"lineCount": 49
},
{
"id": "sectionIhW1xp",
"type": "section",
"title": "<p style=\"color: #20335d; font-weight: 900;\">Part 3: Interactive query of the database and two-dimensional embedding </p>",
"level": 1,
"evaluatorReader": false,
"collapsed": false
},
{
"id": "markdownkdhnI9",
"type": "markdown",
"body": [
"<div class=\"modal-body lasso_instructions\">",
"In this part, you will interactively query the database, then calculate the partial radial distribution function (as done in Part 1), visualize the results in a high-dimensional data in two-dimensions, ",
"and finally generate an interactive Viewer with the results.<br>",
"",
"<br>",
"The difference with Part 1 and Part 2 is that you will query the database, and perform the data-analytics operations (seen in Part 1 and Part 2) on the result of your query.",
"title": "<p style=\"color: #20335d; font-size: 15pt;font-weight: 900;\">3.3 Result visualization in the interactive NOMAD Viewer</p>",
"level": 2,
"evaluatorReader": false,
"collapsed": false
},
{
"id": "markdowndvKLyf",
"type": "markdown",
"body": [
"<div class=\"modal-body lasso_instructions\">",
"We will now plot the results. Contrarily to what we have done in Part 1 and Part 2, here you can change the property that you want to use to color code the results.",
"<br>",
"In particular, given the Query results at point 3.1, you can decide to color code according to: ",
"<ul>",
"",
"<li> band gap (keyword: <font color=\"blue\"> band_gap</font>) <br>",
"value": "INFO: The color in the plot is given by the target value.\n"
},
{
"type": "err",
"value": "INFO: Click on the button 'View interactive 2D scatter plot' to see the plot.\n"
}
],
"payload": "<div class=\"output_subarea output_html rendered_html\"><a target=\"_blank\" href=\"/user/tmp/e918a6abeebd3442.html\">Click here to open the Viewer</a></div>"