Skip to content
Snippets Groups Projects
Commit 781e1dcd authored by Luigi Sbailo's avatar Luigi Sbailo
Browse files

Remove 100% width from jupyter visualization

parent c385a4de
Branches
No related tags found
No related merge requests found
%% Cell type:code id: tags:
``` python
import warnings
warnings.filterwarnings('ignore')
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
# display(HTML("<style>.container { width:100% !important; }</style>"))
```
%% Cell type:markdown id: tags:
<div id="teaser" style=' background-position: right center; background-size: 00px; background-repeat: no-repeat;
padding-top: 20px;
padding-right: 10px;
padding-bottom: 170px;
padding-left: 10px;
border-bottom: 14px double #333;
border-top: 14px double #333;' >
<div style="text-align:center">
<b><font size="6.4">NOMAD 2018 Kaggle competition</font></b>
</div>
<p>
created by:
Xiangyue Liu<sup>1</sup> (<a href="mailto:xyliu@fhi-berlin.mpg.de">email</a>),
Christopher Sutton<sup>1</sup> (<a href="mailto:sutton@fhi-berlin.mpg.de">email</a>),
Luca M. Ghiringhelli<sup>1</sup>(<a href="mailto:ghiringhelli@fhi-berlin.mpg.de">email</a>),
Takenori Yamamoto<sup>2</sup>,
Yury Lysogorskiy<sup>3</sup>,
Lars Blumenthal<sup>4,5</sup>,
Thomas Hammerschmidt<sup>3</sup>,
Jacek Golebiowski<sup>4,5</sup>,
Angelo Ziletti<sup>1</sup>,
and Matthias Scheffler<sup>1</sup>
<sup>1</sup> Fritz Haber Institute of the Max Planck Society, Faradayweg 4-6, D-14195 Berlin, Germany <br>
<sup>2</sup> Research Institute for Mathematical and Computational Sciences (RIMCS), LLC, Yokohama, Japan <br>
<sup>3</sup> ICAMS, Ruhr-Universität Bochum, Germany <br>
<sup>4</sup> EPSRC Centre for Doctoral Training on Theory and Simulation of Materials Department of Physics, Imperial College London, London, U.K. <br>
<sup>5</sup> Thomas Young Centre for Theory and Simulation of Materials, Department of Materials, Imperial College London, London, U.K <br>
<span class="nomad--last-updated" data-version="v1.0.0">[Last updated: March 21, 2019]</span>
<div>
<img style="float: left;" src="assets/kaggle_competition/Logo_MPG.png" width="200">
<img style="float: right;" src="assets/kaggle_competition/Logo_NOMAD.png" width="250">
</div>
</div>
%% Cell type:markdown id: tags:
<font size=4em>[Skip introduction](#make_predictions)</font>
%% Cell type:markdown id: tags:
# NOMAD 2018 Kaggle research competition
### A paradigm shift in solving materials science grand challenges by crowd sourcing solutions through an open and global big-data competition
Innovative materials design is needed to tackle some of the most important health, environmental, energy, societal, and economic challenges. Improving the properties of materials that are intrinsically connected to the generation and utilization of energy is crucial if we are to mitigate environmental damage due to a growing global demand. Transparent conductors are an important class of compounds that are both electrically conductive and have low absorption in the visible range, which are typically competing properties. A combination of both of these characteristics is key for the operation of a variety of technological devices such as photovoltaic cells, light-emitting diodes for flat-panel displays, transistors, sensors, touch screens, and lasers. However, only a small number of compounds are currently known to display both transparency and conductivity to a high enough degree to be used as transparent conducting materials.
To address the need for finding new materials with an ideal target functionality, the Novel Materials Discovery (NOMAD) Centre of Excellence has organized a crowd-sourced data analytics competition with Kaggle, which is one of the most well known online platforms for hosting big-data competitions. Kaggle has a community of over a half of a million users from around the world with various backgrounds in computer science, statistics, biology, and medicine. The competition occurred from December 18th 2017 to February 15th, 2018 and involved nearly 900 participants. The goal of this competition was to develop or apply data analytics models for the prediction of two target properties: the formation energy (which is an indication of the stability of a material) and the bandgap energy (which is an indicator for the potential for transparency over the visible range) to facilitate the discovery of new transparent conductors and allow for advancements in (opto)electronic technologies. A total of 5,000.00 euros in prizes were awarded to the top-three participants with the best performing models (i.e., lowest average root mean square log error (RMSLE) of the formation and bandgap energies). The RMSLE is defined as:
$RMSLE = \sqrt{\frac{1}{N} \sum_{i=1}^{N}{(log(\hat{y}_i + 1) - log(y_i + 1))^2} }$
where $N$ is the total number of observations, $\hat{y}_i$ is the predicted value, $y_i$ is the reference value for either the formation or bandgap energies.
The dataset consists of 3,000 materials, 2,400 of which made the training set and the remaining 600 were used as the test set (i.e., only structures and input features were provided), with the target properties kept secret. Of that test set, 100 materials were used to determine the public leaderboard score so that the participants could assess their model performance on the fly (but the exact values used in this assessment were kept secret). The top three winners of the competition were determined the private leaderboard score based on the test set of 500 materials with the target properties kept secret.
Because only 100 values were used for assessing the performance on the leaderboard, participants had to ensure the predictive accuracy of their model for unseen data, even if a disagreement was found with the public leaderboard score. This is evident in the summary of the average RMSLE for all of the participants with scores below 0.25 in Figure 1, where a large shift in the values between the public leaderboard (100 compounds) and private leaderboard (500 compounds). The winning score has a RMSLE of 0.0509, while the 2nd and 3rd places winners were closely stacked together with a RMSLE of 0.0521 and 0.0523. However, within the first bin, there were a total of four participants with an RMLSE 0.053 (i.e., 0.45% of participants).
%% Cell type:code id: tags:
``` python
%%HTML
<script>
code_show=true;
function code_toggle() {
if (code_show)
{
$('div.input').hide();
}
else
{
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
The Python code for this notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.
```
%% Cell type:markdown id: tags:
# More about the dataset: Group-III transparent conductors
Group-III oxides provide promising candidates for wide band-gap transparent conductors. A wide range of experimental band gap energies from 3.6 to 7.5 eV have been reported from alloying In$_2$O$_3$/Ga$_2$O$_3$ or Ga$_2$O$_3$/Al$_2$O$_3$,
which suggest that alloying of group-III oxides is a viable strategy for designing new wide band gap semiconductors. However, Al$_2$O$_3$, Ga$_2$O$_3$, and In$_2$O$_3$ all display very different
ground-state structures. Therefore, it is unclear which structure will be stable for various compositions. Within the dataset, there are six different lattice symmetries: $C2/m$, $Pna2_1$, $R\bar{3}c$, $P6_3/mmc$, $Ia\bar{3}$, $Fd\bar{3}m$.
%% Cell type:code id: tags:
``` python
%%html
<!-- CSS Style Inline: -->
<style type="text/css">
#jmol_div227{
height: 350px;
width: 350px;
float: left;
}
#jmol_div12{
height: 350px;
width: 350px;
float: left;
}
#jmol_div206{
height: 350px;
width: 350px;
float: left;
}
#jmol_div33{
height: 350px;
width: 350px;
float: left;
}
#jmol_div194{
height: 350px;
width: 350px;
float: left;
}
#jmol_div167{
height: 350px;
width: 350px;
float: left;
}
</style>
<!-- Load Jmol javascript library -->
<script type="text/javascript" src="assets/kaggle_competition/jsmol/JSmol.min.js"></script>
<!-- calls to jQuery and Jmol (inline) -->
<script type="text/javascript">
// Jmol readyFunction
jmol_isReady = function(applet) {
document.title = (applet._id + " - Jmol " + Jmol.___JmolVersion)
Jmol._getElement(applet, "appletdiv").style.border="0px solid blue"
}
// initialize Jmol Applet
var myJmol227 = "myJmol227";
var Info227 = {
width: "100%",
height: "100%",
color: "#ffffff",
use: "HTML5",
j2sPath: "assets/kaggle_competition/jsmol/j2s",
jarPath: "assets/kaggle_competition/jsmol/java",
jarFile: "JmolAppletSigned.jar",
debug: false,
readyFunction: jmol_isReady,
script: 'load "data/kaggle_competition/xyz/geometry3-spacegroup227.xyz" ;',
allowJavaScript: false,
disableJ2SLoadMonitor: true,
}
var myJmol12 = "myJmol12";
var Info12 = {
width: "100%",
height: "100%",
color: "#ffffff",
use: "HTML5",
j2sPath: "assets/kaggle_competition/jsmol/j2s",
jarPath: "assets/kaggle_competition/jsmol/java",
jarFile: "JmolAppletSigned.jar",
debug: false,
readyFunction: jmol_isReady,
script: 'load "data/kaggle_competition/xyz/geometry8-spacegroup12.xyz" ;',
allowJavaScript: false,
disableJ2SLoadMonitor: true,
}
var myJmol206 = "myJmol206";
var Info206 = {
width: "100%",
height: "100%",
color: "#ffffff",
use: "HTML5",
j2sPath: "assets/kaggle_competition/jsmol/j2s",
jarPath: "assets/kaggle_competition/jsmol/java",
jarFile: "JmolAppletSigned.jar",
debug: false,
readyFunction: jmol_isReady,
script: 'load "data/kaggle_competition/xyz/geometry7-spacegroup206.xyz" ;',
allowJavaScript: false,
disableJ2SLoadMonitor: true,
}
var myJmol33 = "myJmol33";
var Info33 = {
width: "100%",
height: "100%",
color: "#ffffff",
use: "HTML5",
j2sPath: "assets/kaggle_competition/jsmol/j2s",
jarPath: "assets/kaggle_competition/jsmol/java",
jarFile: "JmolAppletSigned.jar",
debug: false,
readyFunction: jmol_isReady,
script: 'load "data/kaggle_competition/xyz/geometry1-spacegroup33.xyz" ;',
allowJavaScript: false,
disableJ2SLoadMonitor: true,
}
var myJmol194 = "myJmol194";
var Info194 = {
width: "100%",
height: "100%",
color: "#ffffff",
use: "HTML5",
j2sPath: "assets/kaggle_competition/jsmol/j2s",
jarPath: "assets/kaggle_competition/jsmol/java",
jarFile: "JmolAppletSigned.jar",
debug: false,
readyFunction: jmol_isReady,
script: 'load "data/kaggle_competition/xyz/geometry2-spacegroup194.xyz" ;',
allowJavaScript: false,
disableJ2SLoadMonitor: true,
}
var myJmol167 = "myJmol167";
var Info167 = {
width: "100%",
height: "100%",
color: "#ffffff",
use: "HTML5",
j2sPath: "assets/kaggle_competition/jsmol/j2s",
jarPath: "assets/kaggle_competition/jsmol/java",
jarFile: "JmolAppletSigned.jar",
debug: false,
readyFunction: jmol_isReady,
script: 'load "data/kaggle_competition/xyz/geometry4-spacegroup167.xyz" ;',
allowJavaScript: false,
disableJ2SLoadMonitor: true,
}
// jQuery ready functions
// is called when page has been completely loaded
$(document).ready(function() {
$("#jmol_div227").html(Jmol.getAppletHtml(myJmol227, Info227))
})
$(document).ready(function() {
$("#jmol_div12").html(Jmol.getAppletHtml(myJmol12, Info12))
})
$(document).ready(function() {
$("#jmol_div206").html(Jmol.getAppletHtml(myJmol206, Info206))
})
$(document).ready(function() {
$("#jmol_div33").html(Jmol.getAppletHtml(myJmol33, Info33))
})
$(document).ready(function() {
$("#jmol_div194").html(Jmol.getAppletHtml(myJmol194, Info194))
})
$(document).ready(function() {
$("#jmol_div167").html(Jmol.getAppletHtml(myJmol167, Info167))
})
var lastPrompt=0;
function show_hide_structures()
{
var x = document.getElementById("geometry_jmol");
if (x.style.display === "none")
{
x.style.display = "block";
}
else
{
x.style.display = "none";
}
}
</script>
<div>
<button type="button" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:200px; height:30px" onclick="show_hide_structures()">Show/hide structures</button>
<br><br>
</div>
<div id="geometry_jmol">
<table>
<tr>
<th><center>Spacegroup 12 (C2/m)</center></th>
<th><center>Spacegroup 33 (Pna2<sub>1</sub>)</center></th>
<th><center>Spacegroup 167 (R<font style="text-decoration: overline">3</font>c)</center></th>
</tr>
<tr>
<th><div id='jmol_div12'></div></th>
<th><div id='jmol_div33'></div></th>
<th><div id='jmol_div167'></div></th>
</tr>
<tr>
<th><center>Spacegroup 194 (P6<sub>3</sub>/mmc)</center></th>
<th><center>Spacegroup 206 (Ia<font style="text-decoration: overline">3</font>)</center></th>
<th><center>Spacegroup 227 (Fd<font style="text-decoration: overline">3</font>m)</center></th>
</tr>
<tr>
<th><div id='jmol_div194'></div></th>
<th><div id='jmol_div206'></div></th>
<th><div id='jmol_div227'></div></th>
</tr>
</table>
</div>
```
%% Cell type:markdown id: tags:
# Winning solutions
- __1st place winning solution:__ crystal graph n-grams + Kernel Ridge Regression
- __3rd place winning solution:__ SOAP-based descriptor + Neural network
## - Representations: crystal graph n-grams, SOAP-based descriptor
### 1. Crystal graph n-grams
The 1st place winning solution was obtained using the metal-oxygen coordination number derived from the number of bonds that are within the sum of the ionic Shannon experimental radii (which were enlarged by 30-50% depending on the crystal structure type). These ionic bonds are then used for building a crystal graph, where each atom is a node in the graph and the corresponding edges between nodes are defined by the ionic bond, which are shown as coordination numbers for each atom for a sequence of 6 atoms.
<img src="assets/kaggle_competition/ngram.png" alt="Drawing" style="width: 700px;"/>
<center> Figure 1 : Depiction of a crystal graph representation of In3Ga1O6 showing the connections between each atom (node) that are defined by the ionic bonds. </center>
### 2. SOAP-based descriptor
In the 3rd place winning solution, the smooth overlap of atomic positions (SOAP) kernel developed by Bartók et al. that incorporates information on the local atomic environment through a rotationally integrated overlap of neighbor densities.20-21 The SOAP kernel describes the local environment for a given atom ($i$) through the sum of Gaussians centered on each of the atomic neighbors ($j$) within a specific cutoff radius ($r_{ij}$):
$\rho_{i}(r) = \sum_{j} exp(\frac{-(r-r_{ij})^2}{2\sigma^2_{atom}}) f_{cut}(r_{ij})$
where $\sigma_{atom}$ is a smoothing parameter and the switching function $f_{cut}$ goes smoothly to zero beyond a specified radial value. This local atomic neighbor density can be expanded in terms of spherical harmonics and orthogonal radial functions, while the expansion coefficients are then combined to form the rotationally invariant power spectrum corresponding to the neighbor density for each atom.
## - Regression methods: Kernel ridge regression, Neural network
### 1. Kernel ridge regression (KRR) $^1$
Kernel ridge regression (KRR) is a generalization of ridge regression, where a linear function is learned in the space induced by the respective kernel and the data. The squared loss is minimized with a squared norm L2 regularization term.
The model learned by KRR has the same form with support vector regression (SVR). In both KRR and SVR, the L2 regularization are used, but KRR uses squared error loss function which is different than SVR. The learned model by KRR is non-sparse, therefore the fitting of KRR is typically slower than SVR.
### 2. Neural network $^2$
In the 3rd place winning solution, a multi-layer perceptron (MLP) is employed. MLP is composed of more than one perceptron: typically it has an input layer, several hidden layer and an output layer. The input of the network is transformed using a learnt non-linear transformation. As a supervised learning algorithm, MLP learns a function $f(\cdot):R^{I} \rightarrow R^{O}$, where $I$ is the number of dimensions for input and $O$ is the number of dimensions for output. Several parameters are adjusted during the training of MLP, including the weights of the neurons and the biases. Gradient-based optimisation algorithms, such as stochastic gradient descent, are employed to minimize the loss funtion. There are different choices of loss functions, for example the root mean squared error (RMSE), or the cross entropy. One can also switch loss functions during the training for better gradients. Figure 1 shows a one hidden layer MLP with scalar output as an example.
<img src="assets/kaggle_competition/multilayerperceptron_network.png" alt="Drawing" style="width: 400px;"/>
<center> Figure 2 : One hidden layer MLP $^2$. </center>
%% Cell type:markdown id: tags:
<a id='make_predictions'></a>
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
%%HTML
<br><br><br>
<font size="6.5em"><b>Make predictions on formation energies and bandgaps</b></font>
<br><hr><br>
<font size="5em"><b>Winning representations combined with different regression methods</b></font>
<br><br>
<font size = "3.5em"> To understand the relative importance of the representation vs. regression model, one can examine the performance of each representation combined with different regression models.
The hyperparameters are optimized for each representation/regressor combination. </font>
<br>
<font size = "3.5em" color="009FC2"><br>Warning: the learning algorithm employed in this study (e.g. grid-search) can not guarantee deterministic results. The actual predictions can divergent from the published data.
</font>
<br><br><br>
<form>
<font size="4em">Select a representation and a regression method:</font>
<select id="representation" onclick="show_options()">
<option value="ngram">n-gram</option>
<option value="soap">SOAP</option>
</select>
<select id="regression" onclick="show_options()">
<option value="krr">KRR</option>
<option value="nn">Neural network</option>
</select>
<button type="button" id="button_show_options" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:150px; height:30px" onclick="show_options()">More options</button>
<button type="button" id="button_hide_options" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:150px; height:30px; display: none" onclick="hide_options()">Less options</button>
</form>
<script type="text/Javascript">
function show_options()
{
representation_value = document.getElementById("representation").value;
regression_value = document.getElementById("regression").value;
if(representation_value == "ngram")
{
document.getElementById("div_options_soap").style.display="none";
document.getElementById("div_options_ngram").style.display="";
document.getElementById("button_show_options").style.display="none";
document.getElementById("button_hide_options").style.display="";
}
else if(representation_value == "soap")
{
document.getElementById("div_options_ngram").style.display="none";
document.getElementById("div_options_soap").style.display="";
document.getElementById("button_show_options").style.display="none";
document.getElementById("button_hide_options").style.display="";
}
if(regression_value == "nn")
{
document.getElementById("div_options_nn").style.display="";
set_nn_default();
}
else if(regression_value == "krr")
{
document.getElementById("div_options_nn").style.display="none";
}
}
function hide_options()
{
document.getElementById("button_show_options").style.display="";
document.getElementById("button_hide_options").style.display="none";
document.getElementById("div_options_ngram").style.display="none";
document.getElementById("div_options_soap").style.display="none";
document.getElementById("div_options_nn").style.display="none";
}
function set_nn_n_neurons()
{
document.getElementById("div_options_nn_n_neurons").style.display="";
n_layers = document.getElementById("input_n_nn").value;
contents = "<br><font size=\"3em\">Number of neurons in each hidden layer</font><br>"
for(i=0; i<n_layers; i++)
{
content_add = "<font size=\"3em\"> Layer " + String(i+1) + ": </font><input\ type=\"number\"\ id=\"input_n_neurons_" + String(i) + "\"\ value=\"256\"\ min=\"1\"\ max=\"1024\"\ >";
contents += content_add;
if((i+1)%6 ==0)
{
contents += "<br>"
}
}
//alert(contents);
document.getElementById("div_options_nn_n_neurons").innerHTML = contents;
}
function set_nn_default()
{
document.getElementById("div_options_nn_n_neurons").style.display="";
representation_value = document.getElementById("representation").value;
if(representation_value == "ngram")
{
n_layers = 11;
document.getElementById("input_n_nn").value = n_layers;
contents = "<br><font size=\"3em\">Number of neurons in each hidden layer</font><br>"
for(i=0; i<7; i++)
{
content_add = "<font size=\"3em\"> Layer " + String(i+1) + ": </font><input\ type=\"number\"\ id=\"input_n_neurons_" + String(i) + "\"\ value=\"100\"\ min=\"1\"\ max=\"1024\"\ >";
contents += content_add;
if((i+1)%6 ==0)
{
contents += "<br>"
}
}
for(i=7; i<11; i++)
{
content_add = "<font size=\"3em\"> Layer " + String(i+1) + ": </font><input\ type=\"number\"\ id=\"input_n_neurons_" + String(i) + "\"\ value=\"50\"\ min=\"1\"\ max=\"1024\"\ >";
contents += content_add;
if((i+1)%6 ==0)
{
contents += "<br>"
}
}
}
else if(representation_value == "soap")
{
n_layers = 3;
document.getElementById("input_n_nn").value = n_layers;
contents = "<br><font size=\"3em\">Number of neurons in each hidden layer</font><br>"
for(i=0; i<1; i++)
{
content_add = "<font size=\"3em\"> Layer " + String(i+1) + ": </font><input\ type=\"number\"\ id=\"input_n_neurons_" + String(i) + "\"\ value=\"512\"\ min=\"1\"\ max=\"1024\"\ >";
contents += content_add;
}
for(i=1; i<3; i++)
{
content_add = "<font size=\"3em\"> Layer " + String(i+1) + ": </font><input\ type=\"number\"\ id=\"input_n_neurons_" + String(i) + "\"\ value=\"256\"\ min=\"1\"\ max=\"1024\"\ >";
contents += content_add;
}
}
document.getElementById("div_options_nn_n_neurons").innerHTML = contents;
save_options_nn();
}
function save_options_nn()
{
n_layers = document.getElementById("input_n_nn").value;
var n_neurons = [];
for(i=0; i<n_layers; i++)
{
id_i = "input_n_neurons_" + String(i);
n_neurons_i = document.getElementById(id_i).value;
n_neurons.push(n_neurons_i);
}
var command = "N_nn_neurons = " + n_neurons + ";";
command += "N_nn_layers = int(" + n_layers + ");"
var kernel = IPython.notebook.kernel;
kernel.execute(command);
}
function save_options()
{
representation_value = document.getElementById("representation").value;
regression_value = document.getElementById("regression").value;
if(representation_value == "ngram")
{
ngram_N = document.getElementById("input_ngram_N").value;
n_threads = document.getElementById("input_n_threads").value;
var command = "ngram_N = int(" + ngram_N + "); n_threads = int(" + n_threads + ");"
}
if(representation_value == "soap")
{
soap_cutoff = document.getElementById("input_soap_cutoff").value;
soap_lmax = document.getElementById("input_soap_lmax").value;
soap_nmax = document.getElementById("input_soap_nmax").value;
n_threads = document.getElementById("input_n_threads").value;
var command = "soap_cutoff = int(" + soap_cutoff + ");";
command += "soap_lmax = int(" + soap_lmax + ");";
command += "soap_nmax = int(" + soap_nmax + ");";
command += "n_threads = int(" + n_threads + ");";
}
console.log("Executing Command: " + command);
var kernel = IPython.notebook.kernel;
kernel.execute(command);
}
</script>
<div id="div_options_ngram" style="display: none; position:relative; left:5%">
<br>
<font size="3em">N-grams size (N = 1 ~ 4)</font> <input type="number" id="input_ngram_N" value="3" min="1" max="4" >
<br>
<font size="3em">Number of threads used in regression (N = 1 ~ 4)</font> <input type="number" id="input_n_threads" value="4" min="1" max="4" >
<br>
<button type="button" id="save_options_ngram" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:150px; height:30px;" onclick="save_options()">Save options</button>
</div>
<div id="div_options_soap" style="display: none; position:relative; left:5%">
<br>
<font size="3em">SOAP cutoff (Ang) </font> <input type="number" id="input_soap_cutoff" value="10" min="1" max="25">
<font size="3em">Max. l </font><input type="number" id="input_soap_lmax" value="4" min="1" max="8" >
<font size="3em">Max. n </font><input type="number" id="input_soap_nmax" value="4" min="1" max="8" >
<br>
<font size="3em">Number of threads used in regression (N = 1 ~ 4)</font> <input type="number" id="input_n_threads" value="4" min="1" max="4" >
<br>
<button type="button" id="save_options_soap" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:150px; height:30px;" onclick="save_options()">Save options</button>
</div>
<br>
<div id="div_options_nn" style="display: none; position:relative; left:5%">
<br>
<font size="3em">Number of (linear) hidden layers </font> <input type="number" id="input_n_nn" value="2" min="1" max="20">
<button type="button" id="set_n_neurons" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:250px; height:30px;" onclick="set_nn_n_neurons()">Set the number of neurons</button>
<div id="div_options_nn_n_neurons" style="display: none; position:relative; left:0%">
<br>
<font size="3em">Number of neurons in each hidden layer</font>
</div>
<br>
<button type="button" id="set_nn_default" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:200px; height:30px;" onclick="set_nn_default()">Use default configuration</button>
<button type="button" id="save_options_nn" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:150px; height:30px;" onclick="save_options_nn()">Save options</button>
</div>
<div id="demoa"></div>
<br><br>
<button type="button" id="button_show_sample_output" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:200px; height:30px" onclick="show_sample_output()">Show sample outputs</button>
<button type="button" id="button_hide_sample_output" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:200px; height:30px; display: none" onclick="hide_sample_output()">Hide sample outputs</button>
<button type="button" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:150px; height:30px" onclick="get_repr_regr_combination()">Get predictions</button>
<input type="checkbox" id="if_learning_curve" value="learning_curve" onclick=show_warning_learning_curve()> Show learning curve
<input type="checkbox" id="if_use_prestored" value="use_prestored" onclick=show_warning_use_prestored_model() checked=true> Use prestored models
<div id="warning_learning_curve" style="display: none">
<br>
<font color="009FC2">
Warning: It can be very time-consuming (5~20 min/point, depending on method, model size, and number of threads employed) when getting a learning curve.
</font>
<p><font size="3em">Number of points in learning curve: <input type="number" id="N_learning_curve" value="4" min="1" max="25" style="display:none"></font></p>
</div>
<div id="warning_use_prestored" style="display: none">
<br>
<font color="009FC2">
Warning: It can be very time-consuming (10~20 min, depending on method, model size, and number of threads employed) to generate models on the fly.
</font>
</div>
<div id="sample_output" style="display: none">
<br><br>
<font size="4em"><b> Sample outputs of predictions on formation energy and bandgap</b></font>
<br><br>
<font size="3em">Representation: N-grams<br><br>Regression method: Kernel-ridge regression (KRR)</font>
<br><br>
<img src="assets/kaggle_competition/results-ngram-krr.png" width="60%">
<br><br>
</div>
<style>
* {
box-sizing: border-box;
}
.column {
float: left;
width: 33.33%;
padding: 5px;
}
/* Clearfix (clear floats) */
.row::after {
content: "";
clear: both;
display: table;
}
</style>
<div id="sample_learning_curve" style="display: none">
<font size="4em"><b> Sample outputs of learning curves for formation energy and bandgap predictions</b></font>
<br><br>
<font size="3em">Representation: N-gram<br><br>Regression method: Kernel-ridge regression (KRR)</font>
<br><br>
<div style="display: table; width: 95%">
<div class="row">
<div class="column">
<img src="./imgs/learning_curve-formation-ngram-krr.png" style="width:100%">
</div>
<div class="column">
<img src="./imgs/learning_curve-bandgap-ngram-krr.png" style="width:100%">
</div>
</div>
</div>
<br><br>
</div>
<script type="text/Javascript">
window.findCellIndicesByTag = function findCellIndicesByTag(tagName) {
return (Jupyter.notebook.get_cells()
.filter(
({metadata: {tags}}) => tags && tags.includes(tagName)
)
.map((cell) => Jupyter.notebook.find_cell_index(cell))
);
};
window.runCells = function runPlotCells(tags) {
var c = window.findCellIndicesByTag(tags);
Jupyter.notebook.execute_cells(c);
};
function show_sample_output()
{
document.getElementById("button_show_sample_output").style.display="none";
document.getElementById("button_hide_sample_output").style.display="";
var check_learning_curve = document.getElementById("if_learning_curve");
var learning_curve_output = document.getElementById("sample_learning_curve");
var output = document.getElementById("sample_output");
//var representation_value = document.getElementById("representation").value;
//var regression_value = document.getElementById("regression").value;
output.style.display = "block";
if (check_learning_curve.checked == true)
{
learning_curve_output.style.display = "block";
}
else
{
learning_curve_output.style.display = "none";
}
}
function hide_sample_output()
{
document.getElementById("button_show_sample_output").style.display="";
document.getElementById("button_hide_sample_output").style.display="none";
var learning_curve_output = document.getElementById("sample_learning_curve");
var output = document.getElementById("sample_output");
output.style.display = "none"
learning_curve_output.style.display = "none";
}
function show_warning_learning_curve()
{
var checkBox = document.getElementById("if_learning_curve");
learning_curve_output = document.getElementById("sample_learning_curve");
var warning = document.getElementById("warning_learning_curve");
if (checkBox.checked == true)
{
warning.style.display = "block";
document.getElementById("N_learning_curve").style.display="";
var N_learning_curve = document.getElementById("N_learning_curve").value;
var command = " N_learning_curve = int(" + N_learning_curve + ");";
var kernel = IPython.notebook.kernel;
kernel.execute(command);
}
else
{
warning.style.display = "none";
document.getElementById("N_learning_curve").style.display="none";
}
}
function show_warning_use_prestored_model()
{
var checkBox = document.getElementById("if_use_prestored");
var warning = document.getElementById("warning_use_prestored");
if (checkBox.checked == true)
{
warning.style.display = "none";
}
else
{
warning.style.display = "block";
}
if_use_prestored = document.getElementById("if_use_prestored").checked;
var command = " if_use_prestored_models = '" + if_use_prestored.toString() + "';";
var kernel = IPython.notebook.kernel;
kernel.execute(command);
}
function get_repr_regr_combination()
{
ngram_N = document.getElementById("input_ngram_N").value;
n_threads = document.getElementById("input_n_threads").value;
soap_cutoff = document.getElementById("input_soap_cutoff").value;
soap_lmax = document.getElementById("input_soap_lmax").value;
soap_nmax = document.getElementById("input_soap_nmax").value;
var command = "ngram_N = int(" + ngram_N + "); n_threads = int(" + n_threads + ");"
+ " soap_cutoff = int(" + soap_cutoff + ");"
+ " soap_lmax = int(" + soap_lmax + ");"
+ " soap_nmax = int(" + soap_nmax + ");" ;
console.log("Executing Command: " + command);
var kernel = IPython.notebook.kernel;
kernel.execute(command);
representation_value = document.getElementById("representation").value;
regression_value = document.getElementById("regression").value;
if_learning_curve = document.getElementById("if_learning_curve").checked;
if_use_prestored = document.getElementById("if_use_prestored").checked;
N_learning_curve = document.getElementById("N_learning_curve").value;
var command = " representation = '" + representation_value +
"'; regression = '" + regression_value + "';" + " if_learningcurve = '" + if_learning_curve.toString() + "';"
+ " if_use_prestored_models = '" + if_use_prestored.toString() + "';"
+ " N_learning_curve = int(" + N_learning_curve + ");";
console.log("Executing Command: " + command);
var kernel = IPython.notebook.kernel;
kernel.execute(command);
Jupyter.notebook.execute_cells(window.findCellIndicesByTag('ngram'));//# N-gram descriptor
Jupyter.notebook.execute_cells(window.findCellIndicesByTag('soap'));//# SOAP descriptors
Jupyter.notebook.execute_cells(window.findCellIndicesByTag('krr-ngram'));//# KRR
Jupyter.notebook.execute_cells(window.findCellIndicesByTag('init'));
Jupyter.notebook.execute_cells(window.findCellIndicesByTag('init_read_data'));
Jupyter.notebook.execute_cells(window.findCellIndicesByTag('plot'));
Jupyter.notebook.execute_cells(window.findCellIndicesByTag('main'));//# Read and predict
}
</script>
```
%% Cell type:code id: tags:init
``` python
import os
import numpy as np
import copy
import pandas as pd
import time
#from tqdm import tqdm, trange
from tqdm import tnrange, tqdm_notebook
import sys
#print(sys.version)
from IPython.display import HTML
# n-gram
import kaggle_competition.crystal_graph as cg
# SOAP
from ase.atoms import Atoms as ASEAtoms
from ase.io import read, write
import quippy as qp
from quippy import descriptors
from IPython.core.display import Javascript
from IPython.display import display
def run_cell_by_tag(tags):
display(Javascript("window.runCells('"+tags+"')"))
```
%% Cell type:code id: tags:init_read_data
``` python
# Read data
INPUT_DIR = "data/kaggle_competition/original_Kaggle/"
df_train = pd.read_csv(INPUT_DIR+"train.csv")
df_train["dataset"] = "train"
df_test = pd.read_csv(INPUT_DIR+"test.csv")
df_test["dataset"] = "test"
df_crystals = pd.concat([df_train, df_test], ignore_index=True)
train_data = 'data/kaggle_competition/original_Kaggle/train'
test_data = 'data/kaggle_competition/original_Kaggle/test'
csv_path = 'data/kaggle_competition/original_Kaggle/'
train_csv = np.loadtxt(csv_path+'/train.csv', skiprows=1, delimiter=',')
test_csv = np.loadtxt(csv_path+'/test.csv', skiprows=1, delimiter=',')
test_true_csv = pd.read_csv("data/kaggle_competition/labeled_test.csv")
bandgap_test_true = test_true_csv["bandgap_energy_ev"]
formation_test_true = test_true_csv["formation_energy_ev_natom"]
formation_train_true = df_train["formation_energy_ev_natom"]
bandgap_train_true = df_train["bandgap_energy_ev"]
# Initialize arrays
unigrams = []
bigrams = []
trigrams = []
quadgrams = []
train_mean_atomic_descriptors = []
test_mean_atomic_descriptors = []
```
%% Cell type:code id: tags:plot
``` python
from IPython.display import HTML
%matplotlib inline
%matplotlib inline
import matplotlib.pyplot as plt
%matplotlib inline
fontsize = 15
def RMSE(y_true, y_pred):
return np.sqrt(np.mean((np.array(y_true)-np.array(y_pred))**2))
def RMSLE(true, pred):
return np.sqrt(np.mean((np.log(np.array(true)+1)-np.log(np.array(pred)+1))**2))
def test_plot(formation_pred, formation_true, bandgap_pred, bandgap_true):
RMSE_formation = RMSE(formation_pred, formation_true)
RMSE_bandgap = RMSE(bandgap_pred, bandgap_true)
RMSLE_formation = RMSLE(formation_pred, formation_true)
RMSLE_bandgap = RMSLE(bandgap_pred, bandgap_true)
label_formation = "test RMSLE " + str(RMSLE_formation)[:5]
label_bandgap = "test RMSLE " + str(RMSLE_bandgap)[:5]
fig = plt.figure(figsize=(6,4))
plt.scatter(formation_true, formation_pred - formation_true, s = 3, alpha = 0.3, label = label_formation)
#plt.plot(np.arange(min(test_vegards_true), max(test_vegards_true), 0.01), np.arange(min(test_vegards_true), max(test_vegards_true), 0.01), '--', c = 'grey')
plt.xlabel('Actual formation energy (eV/cation)', fontsize = fontsize)
plt.ylabel('$y-\hat{y}$ (eV/cation)', fontsize = fontsize)
plt.legend(loc='upper right', fontsize = fontsize)
plt.axhline(y=0, linewidth=1, color="#000000")
plt.xticks(fontsize = fontsize)
plt.yticks(fontsize = fontsize)
plt.show()
fig = plt.figure(figsize=(6,4))
plt.scatter(bandgap_true, bandgap_pred - bandgap_true, s = 3, alpha = 0.3, label = label_bandgap)
#plt.plot(np.arange(min(test_vegards_true), max(test_vegards_true), 0.01), np.arange(min(test_vegards_true), max(test_vegards_true), 0.01), '--', c = 'grey')
plt.xlabel('Actual bandgap energy (eV)', fontsize = fontsize)
plt.ylabel('$y-\hat{y}$ (eV)', fontsize = fontsize)
plt.legend(loc='upper right', fontsize = fontsize)
plt.axhline(y=0, linewidth=1, color="#000000")
plt.xticks(fontsize = fontsize)
plt.yticks(fontsize = fontsize)
plt.show()
def plot_learning_curve(learning_curve):
RMSLE_formation_train = []
RMSLE_formation_test = []
RMSLE_bandgap_train = []
RMSLE_bandgap_test = []
for i in range(len(learning_curve["N_train"])):
RMSLE_formation_train.append(RMSLE(formation_train_true[learning_curve["indtrain"][i]], learning_curve["formation_train"][i]))
RMSLE_formation_test.append(RMSLE(formation_test_true, learning_curve["formation_test"][i]))
RMSLE_bandgap_train.append(RMSLE(bandgap_train_true[learning_curve["indtrain"][i]], learning_curve["bandgap_train"][i]))
RMSLE_bandgap_test.append(RMSLE(bandgap_test_true, learning_curve["bandgap_test"][i]))
display(HTML("""<div><font size=3.5em><b> - Formation energy</b></font><br></div>"""))
fig = plt.figure(figsize = (6,4))
plt.scatter(learning_curve["N_train"], RMSLE_formation_train, s=3, alpha=1, label = "Training")
plt.scatter(learning_curve["N_train"], RMSLE_formation_test, s=3, alpha=1, label = "Test")
plt.xlabel('Size of training set', fontsize = fontsize)
plt.ylabel('RMSLE (eV/cation)', fontsize = fontsize)
plt.legend(loc='upper right', fontsize = fontsize)
plt.xticks(fontsize = fontsize)
plt.yticks(fontsize = fontsize)
plt.show()
display(HTML("""<div><font size=3.5em><b> - Bandgap</b></font><br></div>"""))
plt.scatter(learning_curve["N_train"], RMSLE_bandgap_train, s=3, alpha=1, label = "Training")
plt.scatter(learning_curve["N_train"], RMSLE_bandgap_test, s=3, alpha=1, label = "Test")
plt.xlabel('Size of training set', fontsize = fontsize)
plt.ylabel('RMSLE (eV)', fontsize = fontsize)
plt.legend(loc='upper right', fontsize = fontsize)
plt.xticks(fontsize = fontsize)
plt.yticks(fontsize = fontsize)
plt.show()
def plot_hyperparameter_contour(x_grid, y_grid, z_grid, grid):
fontsize = 15
plt.figure()
c = plt.contourf(x_grid, y_grid, abs(z_grid), grid, alpha=.75)
CS = plt.contour(x_grid, y_grid, abs(z_grid), grid, colors='black', linewidth=.5)
plt.clabel(CS, inline=1, fontsize=12)
plt.xlabel('alpha', fontsize = fontsize)
plt.ylabel('gamma', fontsize = fontsize)
plt.title('RMSE', fontsize = fontsize)
plt.xticks(fontsize = fontsize)
plt.yticks(fontsize = fontsize)
plt.show()
```
%% Cell type:code id: tags:main
``` python
#main
import sys
# Make predictions
if representation == "ngram":
run_cell_by_tag('get_ngram_descriptor')
if regression == "nn":
run_cell_by_tag('ngram-nn')
elif regression == "krr":
run_cell_by_tag('predic_plot_ngram_krr')
elif representation == "soap":
run_cell_by_tag('get_soap_descriptor')
if regression == "nn":
run_cell_by_tag('nn-soap')
elif regression == "krr":
run_cell_by_tag('soap-krr')
else:
print ("Representation/regression method not found:", representation, regression)
```
%% Cell type:code id: tags:ngram
``` python
# N-gram descriptor
def check_crdn(crdn, symbols, filename):
metals = [i for i, s in enumerate(symbols) if s != 'O']
oxygens = [i for i, s in enumerate(symbols) if s == 'O']
metal_cn = [len(crdn[i]) for i in metals]
oxygen_cn = [len(crdn[i]) for i in oxygens]
if np.max(metal_cn) > m_cn_max or np.min(metal_cn) < m_cn_min:
print(filename)
print('metal_cn: {}'.format(metal_cn))
if np.max(oxygen_cn) > o_cn_max or np.min(oxygen_cn) < o_cn_min:
print(filename)
print('oxygen_cn: {}'.format(oxygen_cn))
def get_cg(fn, factor):
crystal_xyz, crystal_sym, crystal_lat = cg.get_xyz_data(fn) #fn: file_name
A = np.transpose(crystal_lat)
B = np.linalg.inv(A)
crystal_red = np.matmul(crystal_xyz, np.transpose(B))
R_max = cg.get_Rmax(crystal_sym) * factor
crystal_dist, crystal_Rij, crystal_crdn = cg.get_shortest_distances(crystal_red, A, R_max)
check_crdn(crystal_crdn, crystal_sym, fn)
return crystal_crdn, crystal_sym
def get_symbols(G):
return G[1]
def coord_number(G, node):
return len(G[0][node])
def neighbors(G, node):
return [c[0] for c in G[0][node]]
def neighbors_with_R(G, node):
return [(c[0], c[2]) for c in G[0][node]]
def get_unigrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict):
symbols = get_symbols(G)
symbols = [token_dict[s + str(coord_number(G, i))] for i, s in enumerate(symbols)]
uniques, counts = np.unique(symbols, return_counts=True)
counts = dict(zip(list(uniques), list(counts)))
v = np.zeros(len(tokens), dtype=np.int)
for key, value in counts.items():
v[key] = value
return v
def get_bigrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict):
symbols = get_symbols(G)
metals = [i for i, s in enumerate(symbols) if s != 'O']
symbols = [s + str(coord_number(G, i)) for i, s in enumerate(symbols)]
pair_list = []
for j in metals:
sym_j = symbols[j]
for i in neighbors(G, j):
sym_i = symbols[i]
pair = sym_j + '_' + sym_i
pair_list.append(pair_dict[pair])
uniques, counts = np.unique(pair_list, return_counts=True)
counts = dict(zip(list(uniques), list(counts)))
v = np.zeros(len(pairs), dtype=np.int)
for key, value in counts.items():
v[key] = value
return v
def get_trigrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict):
symbols = get_symbols(G)
oxygens = [i for i, s in enumerate(symbols) if s == 'O']
metals = [i for i, s in enumerate(symbols) if s != 'O']
symbols = [s + str(coord_number(G, i)) for i, s in enumerate(symbols)]
triple_list = []
for o in oxygens:
sym_o = symbols[o]
nbr = neighbors(G, o)
for j in range(len(nbr)):
sym_j = symbols[nbr[j]]
for i in range(j):
sym_i = symbols[nbr[i]]
if token_dict[sym_i] >= token_dict[sym_j]:
triple = sym_i + '_' + sym_o + '_' + sym_j
else:
triple = sym_j + '_' + sym_o + '_' + sym_i
triple_list.append(triples_dict[triple])
for m in metals:
sym_m = symbols[m]
nbr = neighbors(G, m)
for j in range(len(nbr)):
sym_j = symbols[nbr[j]]
for i in range(j):
sym_i = symbols[nbr[i]]
if token_dict[sym_i] >= token_dict[sym_j]:
triple = sym_i + '_' + sym_m + '_' + sym_j
else:
triple = sym_j + '_' + sym_m + '_' + sym_i
triple_list.append(triples_dict[triple])
uniques, counts = np.unique(triple_list, return_counts=True)
counts = dict(zip(list(uniques), list(counts)))
v = np.zeros(len(triples), dtype=np.int)
for key, value in counts.items():
v[key] = value
return v
def get_quadgrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict):
symbols = get_symbols(G)
oxygens = [i for i, s in enumerate(symbols) if s == 'O']
metals = [i for i, s in enumerate(symbols) if s != 'O']
symbols = [s + str(coord_number(G, i)) for i, s in enumerate(symbols)]
quad_list = []
for o in oxygens:
sym_o = symbols[o]
nbr = neighbors_with_R(G, o)
for j in range(len(nbr)):
nbr_j = neighbors_with_R(G, nbr[j][0])
sym_j = symbols[nbr[j][0]]
for i in range(j):
nbr_i = neighbors_with_R(G, nbr[i][0])
sym_i = symbols[nbr[i][0]]
triple1 = sym_i + '_' + sym_o + '_' + sym_j
triple2 = sym_j + '_' + sym_o + '_' + sym_i
for kj in nbr_j:
if kj[0] == o and np.sum(np.abs(nbr[j][1]+kj[1])) < 1e-8: continue
quad1 = triple1 + '_' + symbols[kj[0]]
quad_list.append(quad_dict[quad1])
for ki in nbr_i:
if ki[0] == o and np.sum(np.abs(nbr[i][1]+ki[1])) < 1e-8: continue
quad2 = triple2 + '_' + symbols[ki[0]]
quad_list.append(quad_dict[quad2])
uniques, counts = np.unique(quad_list, return_counts=True)
counts = dict(zip(list(uniques), list(counts)))
v = np.zeros(len(quads), dtype=np.int)
for key, value in counts.items():
v[key] = value
return v
def get_ngram_descriptors():
print ("\nGenerating n-grams descriptors...\n")
m_elements = ['Al', 'Ga', 'In']
m_tokens = []
print ("Generating tokens...")
for i in range(m_cn_min, m_cn_max+1):
m_tokens += [ s+str(i) for s in m_elements]
o_tokens = ['O'+str(i) for i in range(o_cn_min, o_cn_max+1)]
tokens = m_tokens + o_tokens
print ("Tokens:", tokens[:10], "...")
token_dict = dict([(tokens[i], i) for i in range(len(tokens))])
pairs = []
print ("Generating pairs...")
for m in m_tokens:
pairs += [ m + '_' + o for o in o_tokens]
print ("Pairs:", pairs[:10], "...")
pair_dict = dict([(pairs[i], i) for i in range(len(pairs))])
print ("Generating triples...")
mom_triples = []
for m1 in range(len(m_tokens)):
for m2 in range(m1+1):
mom_triples += [m_tokens[m1] + '_' + o + '_' + m_tokens[m2] for o in o_tokens]
omo_triples = []
for o1 in range(len(o_tokens)):
for o2 in range(o1+1):
omo_triples += [o_tokens[o1] + '_' + m + '_' + o_tokens[o2] for m in m_tokens]
triples = mom_triples + omo_triples
print ("Triples:", triples[:10], "...")
triples_dict = dict([(triples[i], i) for i in range(len(triples))])
print ("Generating quads...")
quads = []
for m1 in m_tokens:
for o1 in o_tokens:
for m2 in m_tokens:
quads += [m1 + '_' + o1 + '_' + m2 + '_' + o for o in o_tokens]
print ("Quads:", quads[:10], "...")
quad_dict = dict([(quads[i], i) for i in range(len(quads))])
print ("Generating grams...")
unigrams = []
bigrams = []
trigrams = []
quadgrams = []
count = 0
t0 = time.time()
for crystal in df_crystals.itertuples():
fn = INPUT_DIR + "{}/{}/geometry.xyz".format(crystal.dataset, crystal.id)
factor = cg.get_factor(crystal.spacegroup, crystal.lattice_angle_gamma_degree)
G = get_cg(fn, factor)
unigrams.append(get_unigrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict))
bigrams.append(get_bigrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict))
trigrams.append(get_trigrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict))
quadgrams.append(get_quadgrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict))
count += 1
#if count % 500 == 0:
#print(count, time.time()-t0)
print ("Tokens:", len(tokens))
print ("Pairs:", len(pairs))
print ("Triples:", len(triples))
print ("Quads:", len(quads))
#np.savez("./data/kaggle_competition/models/ngram.npz", tokens=tokens, pairs=pairs, triples=triples, quads=quads, unigrams=unigrams, bigrams=bigrams, trigrams=trigrams, quadgrams=quadgrams)
return tokens, pairs, triples, quads, unigrams, bigrams, trigrams, quadgrams
```
%% Cell type:code id: tags:get_ngram_descriptor
``` python
# Get N-gram descriptor
from IPython.display import HTML
display(HTML("""<br><div><font size=6em><b>Predictions with N-grams representation</b></font><hr></div>"""))
m_cn_min = 4
m_cn_max = 9
o_cn_min = 2
o_cn_max = 6
if if_use_prestored_models == "true":
ngram_model_saved = np.load('./data/kaggle_competition/models/ngram.npz')
display(HTML("""Read pre-stored N-grams model."""))
tokens = ngram_model_saved['tokens']
pairs = ngram_model_saved['pairs']
triples = ngram_model_saved['triples']
quads = ngram_model_saved['quads']
unigrams = ngram_model_saved['unigrams']
bigrams = ngram_model_saved['bigrams']
trigrams = ngram_model_saved['trigrams']
quadgrams = ngram_model_saved['quadgrams']
else:
tokens = []
pairs = []
triples = []
quads = []
unigrams = []
bigrams = []
trigrams = []
quadgrams = []
tokens, pairs, triples, quads, unigrams, bigrams, trigrams, quadgrams = get_ngram_descriptors()
```
%% Cell type:code id: tags:krr-ngram
``` python
#KRR-ngram
from sklearn.kernel_ridge import KernelRidge
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.preprocessing import MinMaxScaler
from sklearn.externals.joblib import parallel_backend
def train_and_pred_KRR(X_train, y_train, X_test, nfold, nthread):
neg_root_mean_squared_error = make_scorer(RMSE, greater_is_better=False)
t0 = time.time()
clf = GridSearchCV(KernelRidge(kernel='rbf'),
cv=nfold, n_jobs=nthread, verbose=1,
scoring=neg_root_mean_squared_error,
param_grid={"alpha": np.logspace(-15, 5, 21, base=2),
"gamma": np.logspace(-15, 3, 19, base=2)})
with parallel_backend('threading'):
res = clf.fit(X_train, y_train)
#print("clf best_score:", -res.best_score_)
#print("clf best_params:", res.best_params_)
#print("clf fit:", time.time() - t0)
out_y_train = clf.predict(X_train)
#for i,j in zip(out_y_train, y_train):
# print(i, j)
t0 = time.time()
y_test = clf.predict(X_test)
#print("clf predict:", time.time() - t0)
return clf, y_test, out_y_train
def get_ngram_KRR_predictions(tokens, pairs, triples, quads, unigrams, bigrams, trigrams, quadgrams, ngram_size, n_threads):
#ngram_size = 3
#nfold = 5
#nthread = 4
print ("Size of N-gram:", ngram_size)
ngrams = {}
ngrams["tokens"]=tokens
ngrams["pairs"]=pairs
ngrams["triples"]=triples
ngrams["quads"]=quads
ngrams["unigrams"]=unigrams
ngrams["bigrams"]=bigrams
ngrams["trigrams"]=trigrams
ngrams["quadgrams"]=quadgrams
#print 'ngram_size: {}'.format(ngram_size)
if ngram_size == 4:
features = np.hstack([ngrams["unigrams"], ngrams["bigrams"], ngrams["trigrams"], ngrams["quadgrams"]])
elif ngram_size == 3:
features = np.hstack([ngrams["unigrams"], ngrams["bigrams"], ngrams["trigrams"]])
elif ngram_size == 2:
features = np.hstack([ngrams["unigrams"], ngrams["bigrams"]])
elif ngram_size == 1:
features = ngrams["unigrams"]
else:
raise NameError('ngram_size should be within [1,4].')
print ('Ngram feature matrix shape: {}'.format(features.shape))
# Select column
count_per_column = np.sum(features, axis=0)
nonzero_columns = np.where(count_per_column > 0)[0]
print( 'Total column count: {}'.format(len(count_per_column)))
print( 'Nonzero-column count: {}'.format(len(nonzero_columns)))
features = features[:,nonzero_columns]
print( 'Nonzero ngram feature maxrix shape: {}'.format(np.shape(features)))
# Volume
lat_a = df_crystals.lattice_vector_1_ang.values
lat_b = df_crystals.lattice_vector_2_ang.values
lat_c = df_crystals.lattice_vector_3_ang.values
lat_cosa = np.cos(np.radians(df_crystals.lattice_angle_alpha_degree.values))
lat_cosb = np.cos(np.radians(df_crystals.lattice_angle_beta_degree.values))
lat_cosg = np.cos(np.radians(df_crystals.lattice_angle_gamma_degree.values))
volume = lat_a * lat_b * lat_c * np.sqrt(1. - lat_cosa**2 - lat_cosb**2 - lat_cosg**2 + 2. * lat_cosa * lat_cosb * lat_cosg)
# Scale
features = np.array(features, dtype=np.float)
features = features / volume[:, np.newaxis]
scaler = MinMaxScaler()
features = scaler.fit_transform(features)
y1 = np.log1p(df_crystals.formation_energy_ev_natom.values[:len(df_train)])
y2 = np.log1p(df_crystals.bandgap_energy_ev.values[:len(df_train)])
X_train = features[:len(df_train)]
X_test = features[len(df_train):]
learning_curve = {}
if(if_learningcurve == "true"):
step = (2400 - 100) / N_learning_curve
range_learning_curve = range(100, 2401, step)
range_learning_curve[-1] = 2400
print( range_learning_curve)
learning_curve_N_train = []
learning_curve_indtrain = []
learning_curve_formation_train = []
learning_curve_formation_test = []
learning_curve_bandgap_train = []
learning_curve_bandgap_test = []
for N_train in range_learning_curve:
indtrain = np.random.choice(len(X_train), N_train, replace=False)
clf1, y1_test, y1_train = train_and_pred_KRR(X_train[indtrain], y1[indtrain], X_test, nfold = 5, nthread = n_threads)
clf2, y2_test, y2_train = train_and_pred_KRR(X_train[indtrain], y2[indtrain], X_test, nfold = 5, nthread = n_threads)
learning_curve_formation_train.append(y1_train)
learning_curve_formation_test.append(y1_test)
learning_curve_bandgap_train.append(y2_train)
learning_curve_bandgap_test.append(y2_test)
learning_curve_N_train.append(N_train)
learning_curve_indtrain.append(indtrain)
learning_curve["formation_train"] = learning_curve_formation_train
learning_curve["formation_test"] = learning_curve_formation_test
learning_curve["bandgap_train"] = learning_curve_bandgap_train
learning_curve["bandgap_test"] = learning_curve_bandgap_test
learning_curve["N_train"] = learning_curve_N_train
learning_curve["indtrain"] = learning_curve_indtrain
clf1, y1_test, y1_train = train_and_pred_KRR(X_train, y1, X_test, nfold = 5, nthread = n_threads)
clf2, y2_test, y2_train = train_and_pred_KRR(X_train, y2, X_test, nfold = 5, nthread = n_threads)
# Find negative values
print('negative values of y1_test:', np.where(y1_test < 0))
print('negative values of y2_test:', np.where(y2_test < 0))
y1_test = np.clip(np.expm1(y1_test), 1e-8, None)
y2_test = np.clip(np.expm1(y2_test), 1e-8, None)
y1_train = np.clip(np.expm1(y1_train), 1e-8, None)
y2_train = np.clip(np.expm1(y2_train), 1e-8, None)
#print clf1.cv_results_
#print clf2.cv_results_
gridsearch_results = {}
gridsearch_results['formation'] = clf1.cv_results_
gridsearch_results['bandgap'] = clf2.cv_results_
return y1_train, y1_test, y2_train, y2_test, gridsearch_results, learning_curve
```
%% Cell type:code id: tags:predic_plot_ngram_krr
``` python
# Predict with N-gram + KRR and plot
from IPython.display import HTML
display(HTML("""<br><div><span style="height:15px;width:15px;background-color:#000000;border-radius:50%;display:inline-block;"></span><font size=5em><b> N-grams + Kernel ridge regression </b> </font><br></div>"""))
#print "\nTrain and predict with n-gram + KRR.\n"
formation_train_pred_ngram_krr, formation_test_pred_ngram_krr, bandgap_train_pred_ngram_krr, bandgap_test_pred_ngram_krr, gridsearch_results_ngram_krr, learning_curve_ngram_krr = get_ngram_KRR_predictions(tokens, pairs, triples, quads, unigrams, bigrams, trigrams, quadgrams, ngram_size = ngram_N, n_threads = n_threads)
x_grid_formation_ngram_krr = np.unique(gridsearch_results_ngram_krr['formation']['param_alpha'].data)
y_grid_formation_ngram_krr = np.unique(gridsearch_results_ngram_krr['formation']['param_gamma'].data)
z_grid_formation_ngram_krr = np.reshape(gridsearch_results_ngram_krr['formation']['mean_test_score'], [len(y_grid_formation_ngram_krr), -1])
x_grid_bandgap_ngram_krr = np.unique(gridsearch_results_ngram_krr['bandgap']['param_alpha'].data)
y_grid_bandgap_ngram_krr = np.unique(gridsearch_results_ngram_krr['bandgap']['param_gamma'].data)
z_grid_bandgap_ngram_krr = np.reshape(gridsearch_results_ngram_krr['bandgap']['mean_test_score'], [len(y_grid_bandgap_ngram_krr), -1])
run_cell_by_tag('plot_ngram_krr')
```
%% Cell type:code id: tags:plot_ngram_krr
``` python
from IPython.display import HTML
display(HTML("""<div><hr><font size=4em><b>Formation energy and bandgap predictions</b></font><br></div>"""))
test_plot(formation_test_pred_ngram_krr, formation_test_true, bandgap_test_pred_ngram_krr, bandgap_test_true)
display(HTML("""<div><hr><font size=4em><b>Hyperparameter grid search</b></font></div>"""))
display(HTML("""<div><font size=3.5em><b>- Formation energy</b></font><br></div>"""))
plot_hyperparameter_contour(x_grid_formation_ngram_krr, y_grid_formation_ngram_krr, z_grid_formation_ngram_krr, grid = 8)
display(HTML("""<div><font size=3.5em><b>- Bandgap</b></font><br></div>"""))
plot_hyperparameter_contour(x_grid_bandgap_ngram_krr, y_grid_bandgap_ngram_krr, z_grid_bandgap_ngram_krr, grid = 10)
if(if_learningcurve == "true"):
display(HTML("""<div><hr><font size=4em><b>Learning curves</b></font><br></div>"""))
plot_learning_curve(learning_curve_ngram_krr)
display(HTML("""<br><hr><br><br>"""))
```
%% Cell type:code id: tags:ngram-nn
``` python
# N-gram + NN
import pandas as pd
import torch
import sys
def mean_free(model):
means = np.average(model, axis = 0)
#print("Shape of means:", np.shape(means))
model_mean_free = model - means
return model_mean_free
class NeuralNetwork(torch.nn.Module):
'''
def __init__(self,inplen, outlen):
super(NeuralNetwork, self).__init__()
self.layer1 = torch.nn.Linear(inplen, 100)
self.layer2 = torch.nn.Linear(100, 100)
self.layer3 = torch.nn.Linear(100, 100)
self.layer4 = torch.nn.Linear(100, 100)
self.layer5 = torch.nn.Linear(100, 100)
self.layer6 = torch.nn.Linear(100, 100)
self.layer7 = torch.nn.Linear(100, 100)
self.layer8 = torch.nn.Linear(100, 50)
self.layer9 = torch.nn.Linear(50, 50)
self.layer10 = torch.nn.Linear(50, 50)
self.layer11 = torch.nn.Linear(50, 50)
self.layer12 = torch.nn.Linear(50, outlen)
self.activation = torch.nn.LeakyReLU()
#self.dropout = torch.nn.Dropout(p=0.05)
def forward(self, inp):
x = self.activation(self.layer1(inp))
x = self.activation(self.layer2(x))
x = self.activation(self.layer3(x))
x = self.activation(self.layer4(x))
x = self.activation(self.layer5(x))
x = self.activation(self.layer6(x))
x = self.activation(self.layer7(x))
x = self.activation(self.layer8(x))
x = self.activation(self.layer9(x))
x = self.activation(self.layer10(x))
x = self.activation(self.layer11(x))
return self.layer12(x)#self.dropout(self.layer12(x))
'''
def __init__(self,inplen, outlen):
super(NeuralNetwork, self).__init__()
hidden_layers = []
batchnorms = []
hidden_layers.append(torch.nn.Linear(inplen, N_nn_neurons[0]))
batchnorms.append(torch.nn.BatchNorm1d(N_nn_neurons[0], affine=True))
for i in range(N_nn_layers-1):
hidden_layers.append(torch.nn.Linear(N_nn_neurons[i], N_nn_neurons[i+1], bias=True))
batchnorms.append(torch.nn.BatchNorm1d(N_nn_neurons[i+1], affine=True))
self.layer_hidden = torch.nn.ModuleList(hidden_layers)
self.batchnorms = torch.nn.ModuleList(batchnorms)
self.layer_out = torch.nn.Linear(N_nn_neurons[-1], outlen)
self.activation = torch.nn.LeakyReLU()
#self.activation = torch.nn.ReLU()
def forward(self, inp):
#x = self.layer_hidden[0](inp)
x = self.activation(self.layer_hidden[0](inp))
#x = self.batchnorms[0](self.layer_hidden[0](inp))
#x = self.activation(self.batchnorms[0](self.layer_hidden[0](inp)))
for i in range(1, N_nn_layers-1):
#x = self.layer_hidden[i](x)
x = self.activation(self.layer_hidden[i](x))
#x = self.batchnorms[i](self.layer_hidden[i](x))
#x = self.activation(self.batchnorms[i](self.layer_hidden[i](x)))
x = self.layer_out(x)
return x
def ngram_nn_train_pred(nn_inp_len, nn_out_len, x_train, y_train, x_test, y_test_true, learning_rate, n_epoch):
inp_trn = torch.autograd.Variable(
torch.Tensor(x_train.reshape(-1, nn_inp_len)))
out_trn = torch.autograd.Variable(
torch.Tensor(y_train.reshape(-1, nn_out_len)))
inp_test = torch.autograd.Variable(
torch.Tensor(x_test.reshape(-1, nn_inp_len)),
volatile=True)
out_test = torch.autograd.Variable(
torch.Tensor(y_test_true.reshape(-1, nn_out_len)),
volatile=True)
deep_fit = NeuralNetwork(nn_inp_len, nn_out_len)
loss_function = torch.nn.MSELoss()
optimizer = torch.optim.Adam(deep_fit.parameters(), lr=learning_rate)
sys.stdout.flush()
loss_trn, loss_test = [], []
#for _ in tqdm(range(n_epoch)):
for _ in tnrange(n_epoch):
deep_fit.train()
loss = loss_function(deep_fit(inp_trn), out_trn)
loss.backward()
optimizer.step()
optimizer.zero_grad()
loss_trn.append(loss.data.float())
deep_fit.eval()
loss = loss_function(deep_fit(inp_test), out_test)
loss_test.append(loss.data.float())
#print 'Training RMSE:'
deep_fit_trn=deep_fit(inp_trn).data
#print RMSE(np.clip(np.expm1(np.array(out_trn.data)),1e-8,None), np.clip(np.expm1(np.array(deep_fit_trn)),1e-8,None))
#print 'Test RMSE (log1p):'
deep_fit_test=deep_fit(inp_test).data
#print "deep_fit_test:", np.array(deep_fit_test)[:10]
#print RMSE(np.array(y_test), np.array(deep_fit_test))
sys.stdout.flush()
'''
fig, (cnv, prd) = plt.subplots(1, 2, figsize=(10, 4))
cnv.plot(loss_trn[200:], label='training')
cnv.plot(loss_test[200:], label='validation')
cnv.set_xlabel('epoch')
cnv.set_ylabel('loss')
cnv.legend()
prd.scatter(np.array(y_train), deep_fit(inp_trn).data.numpy(), s=2, label='Training')
prd.scatter(np.array(y_test_true), deep_fit(inp_test).data.numpy(), s=2, label='Validation')
prd.legend()
fig.tight_layout()
'''
#print(deep_fit_trn.flatten())
return np.clip(np.expm1(np.array(deep_fit_trn.flatten())),1e-8,None), np.clip(np.expm1(np.array(deep_fit_test.flatten())),1e-8,None)#, deep_fit
def get_ngram_NN_predictions(tokens, pairs, triples, quads, unigrams, bigrams, trigrams, quadgrams, ngram_size):
print ("Size of N-gram:", ngram_size)
ngrams = {}
ngrams["tokens"]=tokens
ngrams["pairs"]=pairs
ngrams["triples"]=triples
ngrams["quads"]=quads
ngrams["unigrams"]=unigrams
ngrams["bigrams"]=bigrams
ngrams["trigrams"]=trigrams
ngrams["quadgrams"]=quadgrams
#print 'ngram_size: {}'.format(ngram_size)
if ngram_size == 4:
features = np.hstack([ngrams["unigrams"], ngrams["bigrams"], ngrams["trigrams"], ngrams["quadgrams"]])
elif ngram_size == 3:
features = np.hstack([ngrams["unigrams"], ngrams["bigrams"], ngrams["trigrams"]])
elif ngram_size == 2:
features = np.hstack([ngrams["unigrams"], ngrams["bigrams"]])
elif ngram_size == 1:
features = ngrams["unigrams"]
else:
raise NameError('ngram_size should be within [1,4].')
print ('Ngram feature matrix shape: {}'.format(features.shape))
# Select column
count_per_column = np.sum(features, axis=0)
nonzero_columns = np.where(count_per_column > 0)[0]
print ('Total column count: {}'.format(len(count_per_column)) )
print ('Nonzero-column count: {}'.format(len(nonzero_columns)) )
features = features[:,nonzero_columns]
print ('Nonzero ngram feature maxrix shape: {}'.format(features.shape) )
# Volume
lat_a = df_crystals.lattice_vector_1_ang.values
lat_b = df_crystals.lattice_vector_2_ang.values
lat_c = df_crystals.lattice_vector_3_ang.values
lat_cosa = np.cos(np.radians(df_crystals.lattice_angle_alpha_degree.values))
lat_cosb = np.cos(np.radians(df_crystals.lattice_angle_beta_degree.values))
lat_cosg = np.cos(np.radians(df_crystals.lattice_angle_gamma_degree.values))
volume = lat_a * lat_b * lat_c * np.sqrt(1. - lat_cosa**2 - lat_cosb**2 - lat_cosg**2 + 2. * lat_cosa * lat_cosb * lat_cosg)
# Scale
features = np.array(features, dtype=np.float)
features = features / volume[:, np.newaxis]
scaler = MinMaxScaler()
features = scaler.fit_transform(features)
#print 'Scaled features : \n',features
sys.stdout.flush()
y_train_Eform = np.log1p(df_crystals.formation_energy_ev_natom.values[:len(df_train)])
y_train_Egap = np.log1p(df_crystals.bandgap_energy_ev.values[:len(df_train)])
y_test_Eform_true = np.array(np.log1p(np.array(formation_test_true)))
y_test_Egap_true = np.array(np.log1p(np.array(bandgap_test_true)))
x_train = features[:len(df_train)]
x_test = features[len(df_train):]
x_train_meanfree = mean_free(x_train)
x_test_meanfree = mean_free(x_test)
nn_inp_len = len(nonzero_columns)
nn_out_len = len(y_train_Eform.reshape(-1,len(y_train_Eform)))
formation_train_pred_ngram_nn, formation_test_pred_ngram_nn = ngram_nn_train_pred(nn_inp_len, nn_out_len, x_train_meanfree, y_train_Eform, x_test_meanfree, y_test_Eform_true, learning_rate=0.00012, n_epoch=600)
bandgap_train_pred_ngram_nn, bandgap_test_pred_ngram_nn = ngram_nn_train_pred(nn_inp_len, nn_out_len, x_train_meanfree, y_train_Egap, x_test_meanfree, y_test_Egap_true, learning_rate=0.00015, n_epoch=700)
learning_curve = {}
if(if_learningcurve == "true"):
learning_curve_N_train = []
learning_curve_indtrain = []
learning_curve_formation_train = []
learning_curve_formation_test = []
learning_curve_bandgap_train = []
learning_curve_bandgap_test = []
step = (2400 - 100) / N_learning_curve
range_learning_curve = range(100, 2401, step)
range_learning_curve[-1] = 2400
#print range_learning_curve
for N_train in range_learning_curve:
indtrain = np.random.choice(len(x_train_meanfree), N_train, replace=False)
formation_train_pred_ngram_nn, formation_test_pred_ngram_nn = ngram_nn_train_pred(nn_inp_len, nn_out_len, x_train_meanfree[indtrain], y_train_Eform[indtrain], x_test_meanfree, y_test_Eform_true, learning_rate=0.00012, n_epoch=600)
bandgap_train_pred_ngram_nn, bandgap_test_pred_ngram_nn = ngram_nn_train_pred(nn_inp_len, nn_out_len, x_train_meanfree[indtrain], y_train_Egap[indtrain], x_test_meanfree, y_test_Egap_true, learning_rate=0.00015, n_epoch=700)
learning_curve_formation_train.append(formation_train_pred_ngram_nn)
learning_curve_formation_test.append(formation_test_pred_ngram_nn)
learning_curve_bandgap_train.append(bandgap_train_pred_ngram_nn)
learning_curve_bandgap_test.append(bandgap_test_pred_ngram_nn)
learning_curve_N_train.append(N_train)
learning_curve_indtrain.append(indtrain)
learning_curve["formation_train"] = learning_curve_formation_train
learning_curve["formation_test"] = learning_curve_formation_test
learning_curve["bandgap_train"] = learning_curve_bandgap_train
learning_curve["bandgap_test"] = learning_curve_bandgap_test
learning_curve["N_train"] = learning_curve_N_train
learning_curve["indtrain"] = learning_curve_indtrain
return formation_train_pred_ngram_nn , formation_test_pred_ngram_nn, bandgap_train_pred_ngram_nn, bandgap_test_pred_ngram_nn, learning_curve
from IPython.display import HTML
display(HTML("""<br><div><span style="height:15px;width:15px;background-color:#000000;border-radius:50%;display:inline-block;"></span><font size=5em><b> N-grams + Neural network </b> </font><br></div>"""))
formation_train_pred_ngram_nn, formation_test_pred_ngram_nn, bandgap_train_pred_ngram_nn, bandgap_test_pred_ngram_nn, learning_curve_ngram_nn = get_ngram_NN_predictions(tokens, pairs, triples, quads, unigrams, bigrams, trigrams, quadgrams, ngram_size = ngram_N)
run_cell_by_tag('plot_ngram_nn')
```
%% Cell type:code id: tags:plot_ngram_nn
``` python
from IPython.display import HTML
display(HTML("""<div><hr><font size=4em><b>Formation energy and bandgap predictions</b></font><br></div>"""))
test_plot(formation_test_pred_ngram_nn, formation_test_true, bandgap_test_pred_ngram_nn, bandgap_test_true)
if(if_learningcurve == "true"):
display(HTML("""<div><hr><font size=4em><b>Learning curves</b></font><br></div>"""))
plot_learning_curve(learning_curve_ngram_nn)
display(HTML("""<br><hr><br><br>"""))
```
%% Cell type:code id: tags:soap
``` python
# SOAP descriptors
def get_atomic_soap_descriptors(geometry_xyz, soap_cutoff, soap_lmax, soap_nmax):
# get cell vectors from *.xyz file
cell_ = np.loadtxt(geometry_xyz, skiprows=3, usecols=(1,2,3))[:3]
# get atom positions and types from *.xyz file
atoms_ = pd.read_table(geometry_xyz, skiprows=6,
delim_whitespace=True, header=None, names=['x', 'y', 'z', 'type'],
usecols=[1,2,3,4])
# construct ase Atoms object
compound = ASEAtoms(atoms_.type.values,
positions = atoms_.values[:, :3],
cell = cell_,
pbc=[1, 1, 1])
###########################################################################
# The specification of which atoms are used as SOAP centers is separate to the
# specification of which atoms are taken into account in the environment.
# The n_Z=2 Z={1 6} options specify that both H and C are to be taken as SOAP
# centers. The n_species=2 species_Z={1 6} options specify the atom types to
# be taken into account in the environment.
soap_command = "soap cutoff="
soap_command += str(soap_cutoff)
soap_command += " l_max="
soap_command += str(soap_lmax)
soap_command += " n_max="
soap_command += str(soap_nmax)
soap_command += " atom_sigma=0.5 n_Z=4 Z={8 13 31 49} n_species=4 species_Z={8 13 31 49}"
#print(soap_command)
desc = descriptors.Descriptor(soap_command) # for formation energies
#"soap cutoff=20 l_max=6 n_max=10 atom_sigma=0.5 n_Z=1 Z={8} n_species=4 species_Z={8 13 31 49}") # for bandgaps
compound.set_cutoff(desc.cutoff())
compound.calc_connect()
return desc.calc(compound)['descriptor']
def get_SOAP(soap_cutoff, soap_lmax, soap_nmax):
#for compound_id in tqdm(train_csv[:, 0].astype(int)):
for compound_id in tnrange(train_csv[:, 0].astype(int)):
#print('compound_id: {}.'.format(compound_id))
compound_xyz = train_data+'/'+str(compound_id)+'/geometry.xyz'
train_mean_atomic_descriptor = np.mean(get_atomic_soap_descriptors(compound_xyz, soap_cutoff, soap_lmax, soap_nmax),
axis=0)
if 'train_mean_atomic_descriptors' in locals():
train_mean_atomic_descriptors = np.append(train_mean_atomic_descriptors,
train_mean_atomic_descriptor.reshape(1, -1), axis=0)
else:
train_mean_atomic_descriptors = train_mean_atomic_descriptor.reshape(1, -1)
print('mean_atomic_descriptor.shape: {}'.format(train_mean_atomic_descriptor.shape))
#for compound_id in tqdm(test_csv[:, 0].astype(int)):
for compound_id in tnrange(test_csv[:, 0].astype(int)):
#print('compound_id: {}.'.format(compound_id))
compound_xyz = test_data+'/'+str(compound_id)+'/geometry.xyz'
test_mean_atomic_descriptor = np.mean(get_atomic_soap_descriptors(compound_xyz, soap_cutoff, soap_lmax, soap_nmax),
axis=0)
if 'test_mean_atomic_descriptors' in locals():
test_mean_atomic_descriptors = np.append(test_mean_atomic_descriptors,
test_mean_atomic_descriptor.reshape(1, -1), axis=0)
else:
test_mean_atomic_descriptors = test_mean_atomic_descriptor.reshape(1, -1)
#print('mean_atomic_descriptor.shape: {}'.format(test_mean_atomic_descriptor.shape))
#print(test_mean_atomic_descriptors)
#np.savez("./data/kaggle_competition/models/soap.npz", train_mean_atomic_descriptors = train_mean_atomic_descriptors, test_mean_atomic_descriptors = test_mean_atomic_descriptors)
return train_mean_atomic_descriptors, test_mean_atomic_descriptors
```
%% Cell type:code id: tags:get_soap_descriptor
``` python
# Get SOAP descriptor
from IPython.display import HTML
display(HTML("""<br><div><font size=6em><b>Predictions with SOAP representation</b></font></div><hr>"""))
#"soap cutoff=10 l_max=4 n_max=4 atom_sigma=0.5 n_Z=4 Z={8 13 31 49} n_species=4 species_Z={8 13 31 49}") # for formation energies
#"soap cutoff=20 l_max=6 n_max=10 atom_sigma=0.5 n_Z=1 Z={8} n_species=4 species_Z={8 13 31 49}") # for bandgaps
if if_use_prestored_models == "true":
soap_model_saved = np.load('./data/kaggle_competition/models/soap.npz')
train_mean_atomic_descriptors = soap_model_saved['train_mean_atomic_descriptors']
test_mean_atomic_descriptors = soap_model_saved['test_mean_atomic_descriptors']
display(HTML("""Read pre-stored SOAP model."""))
else:
display(HTML("""Generating SOAP descriptors..."""))
train_mean_atomic_descriptors, test_mean_atomic_descriptors = get_SOAP(soap_cutoff, soap_lmax, soap_nmax)
#print "test_mean_atomic_descriptors:", np.shape(test_mean_atomic_descriptors), test_mean_atomic_descriptors[:10]
```
%% Cell type:code id: tags:soap-krr
``` python
# Soap + KRR
# The following arrays must be read/prepared: test_csv, train_csv, test_mean_atomic_descriptors, train_mean_atomic_descriptors
# Depend on cell 'krr-ngram'
from IPython.display import HTML
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler(with_mean=True, with_std=True)
formation_train_true = train_csv[:, -2]
bandgap_train_true = train_csv[:, -1]
combined_mean_atomic_descriptors = scaler.fit_transform(np.append(
test_mean_atomic_descriptors,
train_mean_atomic_descriptors,
axis=0))
X_test = np.round(combined_mean_atomic_descriptors[:test_csv.shape[0]],
decimals=5)
X_train = np.round(combined_mean_atomic_descriptors[test_csv.shape[0]:],
decimals=5)
#display(HTML("""<br><br><hr<br><div><font size=4em><b>Representation:</b> SOAP descriptor</font><br></div>"""))
#display(HTML("""<div><font size=4em><b>Regression method:</b> Kernel ridge regression</font><br></div>"""))
display(HTML("""<br><div><span style="height:15px;width:15px;background-color:#000000;border-radius:50%;display:inline-block;"></span><font size=5em><b> SOAP + Kernel ridge regression </b> </font><br></div>"""))
def get_soap_KRR_predictions(X_train, formation_train_true, bandgap_train_true, n_threads):
formation_train_true_10times = formation_train_true * 10.0
clf1_soap_krr, formation_test_pred_soap_krr, formation_train_pred_soap_krr = train_and_pred_KRR(X_train, formation_train_true_10times, X_test, nfold = 5, nthread = n_threads)
clf2_soap_krr, bandgap_test_pred_soap_krr, bandgap_train_pred_soap_krr = train_and_pred_KRR(X_train, bandgap_train_true, X_test, nfold = 5, nthread = n_threads)
learning_curve = {}
if(if_learningcurve == "true"):
learning_curve_N_train = []
learning_curve_indtrain = []
learning_curve_formation_train = []
learning_curve_formation_test = []
learning_curve_bandgap_train = []
learning_curve_bandgap_test = []
step = (2400 - 100) / N_learning_curve
range_learning_curve = range(100, 2401, step)
range_learning_curve[-1] = 2400
print (range_learning_curve)
for N_train in range_learning_curve:
indtrain = np.random.choice(len(X_train), N_train, replace=False)
clf1, y1_test, y1_train = train_and_pred_KRR(X_train[indtrain], formation_train_true_10times[indtrain], X_test, nfold = 5, nthread = n_threads)
clf2, y2_test, y2_train = train_and_pred_KRR(X_train[indtrain], bandgap_train_true[indtrain], X_test, nfold = 5, nthread = n_threads)
learning_curve_formation_train.append(y1_train/10.0)
learning_curve_formation_test.append(y1_test/10.0)
learning_curve_bandgap_train.append(y2_train)
learning_curve_bandgap_test.append(y2_test)
learning_curve_N_train.append(N_train)
learning_curve_indtrain.append(indtrain)
learning_curve["formation_train"] = learning_curve_formation_train
learning_curve["formation_test"] = learning_curve_formation_test
learning_curve["bandgap_train"] = learning_curve_bandgap_train
learning_curve["bandgap_test"] = learning_curve_bandgap_test
learning_curve["N_train"] = learning_curve_N_train
learning_curve["indtrain"] = learning_curve_indtrain
formation_test_pred_soap_krr = formation_test_pred_soap_krr / 10.0
formation_train_pred_soap_krr = formation_train_pred_soap_krr / 10.0
gridsearch_results = {}
gridsearch_results['formation'] = clf1_soap_krr.cv_results_
gridsearch_results['bandgap'] = clf2_soap_krr.cv_results_
return formation_test_pred_soap_krr, formation_train_pred_soap_krr, bandgap_test_pred_soap_krr, bandgap_train_pred_soap_krr, gridsearch_results, learning_curve
formation_test_pred_soap_krr, formation_train_pred_soap_krr, bandgap_test_pred_soap_krr, bandgap_train_pred_soap_krr, gridsearch_results_soap_krr, learning_curve_soap_krr = get_soap_KRR_predictions(X_train, formation_train_true, bandgap_train_true, n_threads = n_threads)
x_grid_formation_soap_krr = np.unique(gridsearch_results_soap_krr['formation']['param_alpha'].data)
y_grid_formation_soap_krr = np.unique(gridsearch_results_soap_krr['formation']['param_gamma'].data)
z_grid_formation_soap_krr = np.reshape(gridsearch_results_soap_krr['formation']['mean_test_score']/10.0, [len(y_grid_formation_soap_krr), -1])
x_grid_bandgap_soap_krr = np.unique(gridsearch_results_soap_krr['bandgap']['param_alpha'].data)
y_grid_bandgap_soap_krr = np.unique(gridsearch_results_soap_krr['bandgap']['param_gamma'].data)
z_grid_bandgap_soap_krr = np.reshape(gridsearch_results_soap_krr['bandgap']['mean_test_score'], [len(y_grid_bandgap_soap_krr), -1])
run_cell_by_tag('plot_soap_krr')
```
%% Cell type:code id: tags:plot_soap_krr
``` python
from IPython.display import HTML
display(HTML("""<div><hr><font size=4em><b>Formation energy and bandgap predictions</b></font><br></div>"""))
test_plot(formation_test_pred_soap_krr, formation_test_true, bandgap_test_pred_soap_krr, bandgap_test_true)
display(HTML("""<div><hr><font size=4em><b>Hyperparameter grid search</b></font></div>"""))
display(HTML("""<div><font size=3.5em><b>- Formation energy</b></font><br></div>"""))
plot_hyperparameter_contour(x_grid_formation_soap_krr, y_grid_formation_soap_krr, z_grid_formation_soap_krr/10.0, grid = 3)
display(HTML("""<div><font size=3.5em><b>- Bandgap</b></font><br></div>"""))
plot_hyperparameter_contour(x_grid_bandgap_soap_krr, y_grid_bandgap_soap_krr, z_grid_bandgap_soap_krr/10.0, grid = 3)
if(if_learningcurve == "true"):
display(HTML("""<div><hr><font size=4em><b>Learning curves</b></font><br></div>"""))
plot_learning_curve(learning_curve_soap_krr)
```
%% Cell type:code id: tags:nn-soap
``` python
#cell tag: nn-soap
#Neural network for SOAP
from IPython.display import HTML
display(HTML("""<br><div><span style="height:15px;width:15px;background-color:#009FC2;border-radius:50%;display:inline-block;"></span><font size=5em><b> SOAP + Neural network </b> </font><br></div>"""))
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
torch.manual_seed(1)
class myNNRegressor(torch.nn.Module):
'''
def __init__(self, feature_size):
super(myNNRegressor, self).__init__()
self.feature_size = feature_size
self.hidden_size1 = 512
self.hidden_size2 = 256
self.linear1 = nn.Linear(self.feature_size, self.hidden_size1, bias=False)
self.batchnorm1 = nn.BatchNorm1d(self.hidden_size1, affine=True)
self.linear2 = nn.Linear(self.hidden_size1, self.hidden_size2, bias=False)
self.batchnorm2 = nn.BatchNorm1d(self.hidden_size2, affine=True)
self.linear3 = nn.Linear(self.hidden_size2, self.hidden_size2, bias=False)
self.batchnorm3 = nn.BatchNorm1d(self.hidden_size2, affine=True)
self.h2s = nn.Linear(self.hidden_size2, 1, bias=False)
self.activation = nn.ReLU()
self.dropout = nn.Dropout(p=0.20)
def forward(self, X):
h1 = self.activation(self.dropout(self.batchnorm1(self.linear1(X))))
h2 = self.activation(self.dropout(self.batchnorm2(self.linear2(h1))))
h3 = self.activation(self.dropout(self.batchnorm3(self.linear3(h2))))
predictions = self.h2s(h3)
return predictions
'''
def __init__(self, feature_size):
super(myNNRegressor, self).__init__()
self.feature_size = feature_size
hiddens = []
batchnorms = []
hiddens.append(nn.Linear(self.feature_size, N_nn_neurons[0], bias=False))
batchnorms.append(nn.BatchNorm1d(N_nn_neurons[0], affine=True))
for i in range(N_nn_layers-1):
hiddens.append(nn.Linear(N_nn_neurons[i], N_nn_neurons[i+1], bias=False))
batchnorms.append(nn.BatchNorm1d(N_nn_neurons[i+1], affine=True))
self.h2s = nn.Linear(N_nn_neurons[-1], 1, bias = False)
'''
hiddens.append(nn.Linear(self.feature_size, 512, bias=False))
batchnorms.append(nn.BatchNorm1d(512, affine=True))
hiddens.append(nn.Linear(512, 256, bias=False))
batchnorms.append(nn.BatchNorm1d(256, affine=True))
hiddens.append(nn.Linear(256, 256, bias=False))
batchnorms.append(nn.BatchNorm1d(256, affine=True))
self.h2s = nn.Linear(256, 1, bias = False)
self.batchnorm1 = nn.BatchNorm1d(512, affine=True)
self.batchnorm2 = nn.BatchNorm1d(256, affine=True)
self.batchnorm3 = nn.BatchNorm1d(256, affine=True)
self.hiddenlayers = [nn.Linear(self.feature_size, 512, bias=False), nn.Linear(512, 256, bias=False), nn.Linear(256, 256, bias=False)]
'''
self.hidden_layers = nn.ModuleList(hiddens)
self.batchnorm = nn.ModuleList(batchnorms)
self.activation = nn.ReLU()
self.dropout = nn.Dropout(p=0.20)
def forward(self, X):
h = self.activation(self.dropout(self.batchnorm[0](self.hidden_layers[0](X))))
for i in range(1, N_nn_layers):
h = self.activation(self.dropout(self.batchnorm[i](self.hidden_layers[i](h))))
'''
h = self.activation(self.dropout(self.batchnorm2(self.hiddenlayers[1](h))))
h = self.activation(self.dropout(self.batchnorm3(self.hiddenlayers[2](h))))
'''
predictions = self.h2s(h)
return predictions
def rmsle(real, predicted):
sum=0.0
_ = predicted.cpu().data.numpy()
for idx, x in enumerate(_):
if x<0:
continue
p = torch.log(predicted[idx]+1)
r = torch.log(real[idx]+1)
sum += (p - r)**2
return torch.sqrt(sum/float(len(predicted)))
def get_predictions(nEpochs, X_train, y_train, X_val=None, y_val=None, evaluation=False):
try:
os.remove('./myBestModel.pt')
except:
pass
feature_size = X_train.shape[1]
#feature_size = X_train.shape[0]
#print "size of feature (in training set):", feature_size
#sys.stdout.flush()
learning_rate = 1e-3
X_train = Variable(torch.from_numpy(X_train).float(),
requires_grad = False).cpu()
y_train = Variable(torch.from_numpy(y_train).float(),
requires_grad = False).cpu()
if evaluation:
X_val = Variable(torch.from_numpy(X_val).float(),
requires_grad = False, volatile = True).cpu()
y_val = Variable(torch.from_numpy(y_val).float(),
requires_grad = False, volatile = True).cpu()
NNRegressor = myNNRegressor(feature_size).cpu()
optimiser = torch.optim.Adam(NNRegressor.parameters(), lr=learning_rate)
loss_fn = nn.MSELoss()
#y_train = y_train.reshape(-1, 1)
final_train_loss = 99999
#for epoch in tqdm(range(nEpochs)):
for epoch in tnrange(nEpochs):
# Train
NNRegressor.train(True)
t0 = time.time()
predictions = NNRegressor(X_train)
if epoch <= 100:
loss = loss_fn(predictions.reshape(-1), y_train)
else:
loss = rmsle(y_train, predictions.reshape(-1))
#print("loss:", loss)
NNRegressor.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(NNRegressor.parameters(), 1.0)
optimiser.step()
t1 = time.time()
# Evaluate
NNRegressor.train(False)
predictions_train = NNRegressor(X_train)
train_loss = float(rmsle(y_train, predictions_train.reshape(-1)))
#print("train_loss:", float(train_loss))
if evaluation:
predictions = NNRegressor(X_val)
val_loss = float(rmsle(y_val, predictions.reshape(-1)))
NNRegressor.train(True)
'''
if evaluation:
msg = 'Epoch ' + str(epoch) +" - training loss: " + str(float(train_loss)) + 'validation loss: '+ str(float(val_loss))
tqdm.write(msg)
else:
msg = 'Epoch ' + str(epoch) + " - training loss: " + str(float(train_loss))
tqdm.write(msg)
'''
final_train_loss = float(train_loss)
if evaluation:
if 'best_val_loss' in locals():
if val_loss < best_val_loss:
best_val_loss = val_loss
with open('./models/myBestModel.pt', 'wb') as f:
torch.save(NNRegressor, f)
else:
best_val_loss = val_loss
with open('./models/myBestModel.pt', 'wb') as f:
torch.save(NNRegressor, f)
if epoch == 149:
learning_rate=1e-5
if evaluation:
del NNRegressor
with open('./models/myBestModel.pt', 'rb') as f:
NNRegressor = torch.load(f)
optimiser = torch.optim.Adam(NNRegressor.parameters(),
lr=learning_rate)
if evaluation:
with open('./models/myBestModel.pt', 'rb') as f:
NNRegressor = torch.load(f)
NNRegressor.train(False)
predictions = NNRegressor(X_val)
val_loss = float(rmsle(y_val, predictions))
print ('-----\nbest val. loss: {}\n_____'.format(val_loss))
else:
NNRegressor.train(False)
predictions = NNRegressor(X_train)
print("Final training loss(RMSE):", final_train_loss)
return predictions, NNRegressor
def train_test_NN_soap(x_trains, y_trains, x_tests):
ensemble_size = 1
all_test_predictions = []
all_train_predictions = []
for regressor in range(ensemble_size):
#print('\n________________\nRegressor No. {}\n'.format(regressor))
predictions, NNRegressor = get_predictions(nEpochs=250,
X_train=x_trains, y_train=y_trains)
NNRegressor.train(False)
predictions = NNRegressor(x_tests).cpu().data.numpy()
train_predictions = NNRegressor(
Variable(torch.from_numpy(x_trains).float(),
requires_grad=False, volatile=True)
).cpu().data.numpy()
all_test_predictions.append(predictions.reshape(-1))
all_train_predictions.append(train_predictions.reshape(-1))
#print np.shape(all_test_predictions)
final_test_prediction = np.mean(all_test_predictions, axis = 0)
final_train_prediction = np.mean(all_train_predictions, axis = 0)
return final_test_prediction, final_train_prediction
def get_soap_NN_predictions(X_train, X_test, formation_train_true, bandgap_train_true):
formation_train_true_10times = formation_train_true * 10.0
formation_test_pred_soap_nn, formation_train_pred_soap_nn = train_test_NN_soap(X_train, formation_train_true_10times, X_test)
bandgap_test_pred_soap_nn, bandgap_train_pred_soap_nn = train_test_NN_soap(X_train, bandgap_train_true, X_test)
learning_curve = {}
if(if_learningcurve == "true"):
learning_curve_N_train = []
learning_curve_indtrain = []
learning_curve_formation_train = []
learning_curve_formation_test = []
learning_curve_bandgap_train = []
learning_curve_bandgap_test = []
step = (2400 - 100) / N_learning_curve
range_learning_curve = range(100, 2401, step)
range_learning_curve[-1] = 2400
print (range_learning_curve)
for N_train in range_learning_curve:
indtrain = np.random.choice(len(X_train), N_train, replace=False)
formation_test_pred_soap_nn, formation_train_pred_soap_nn = train_test_NN_soap(X_train[indtrain], formation_train_true_10times[indtrain], X_test)
bandgap_test_pred_soap_nn, bandgap_train_pred_soap_nn = train_test_NN_soap(X_train[indtrain], bandgap_train_true[indtrain], X_test)
learning_curve_formation_train.append(formation_train_pred_soap_nn / 10.0)
learning_curve_formation_test.append(formation_test_pred_soap_nn / 10.0)
learning_curve_bandgap_train.append(bandgap_train_pred_soap_nn)
learning_curve_bandgap_test.append(bandgap_test_pred_soap_nn)
learning_curve_N_train.append(N_train)
learning_curve_indtrain.append(indtrain)
learning_curve["formation_train"] = learning_curve_formation_train
learning_curve["formation_test"] = learning_curve_formation_test
learning_curve["bandgap_train"] = learning_curve_bandgap_train
learning_curve["bandgap_test"] = learning_curve_bandgap_test
learning_curve["N_train"] = learning_curve_N_train
learning_curve["indtrain"] = learning_curve_indtrain
formation_test_pred_soap_nn = formation_test_pred_soap_nn / 10.0
formation_train_pred_soap_nn = formation_train_pred_soap_nn / 10.0
return formation_test_pred_soap_nn, formation_train_pred_soap_nn, bandgap_test_pred_soap_nn, bandgap_train_pred_soap_nn, learning_curve
# Scale
scaler = StandardScaler(with_mean=True, with_std=True)
formation_train_true = train_csv[:, -2]
bandgap_train_true = train_csv[:, -1]
# mean atomic descriptor
combined_mean_atomic_descriptors = scaler.fit_transform(np.append(
test_mean_atomic_descriptors,
train_mean_atomic_descriptors,
axis=0))
X_test_soap = np.round(combined_mean_atomic_descriptors[:test_csv.shape[0]],
decimals=5)
X_test_soap = Variable(torch.from_numpy(X_test_soap).float(),
requires_grad = False, volatile = True).cpu()
#X_test = X_test.reshape(-1, 1)
X_train_soap = np.round(combined_mean_atomic_descriptors[test_csv.shape[0]:],
decimals=5)
#X_train = X_train.reshape(-1, 1)
formation_test_pred_soap_nn, formation_train_pred_soap_nn, bandgap_test_pred_soap_nn, bandgap_train_pred_soap_nn, learning_curve_soap_nn = get_soap_NN_predictions(X_train_soap, X_test_soap, formation_train_true, bandgap_train_true)
run_cell_by_tag('plot_soap_nn')
```
%% Cell type:code id: tags:plot_soap_nn
``` python
from IPython.display import HTML
display(HTML("""<div><hr><font size=4em><b>Formation energy and bandgap predictions</b></font><br></div>"""))
test_plot(formation_test_pred_soap_nn, formation_test_true, bandgap_test_pred_soap_nn, bandgap_test_true)
if(if_learningcurve == "true"):
display(HTML("""<div><hr><font size=4em><b>Learning curves</b></font><br></div>"""))
plot_learning_curve(learning_curve_soap_nn)
```
%% Cell type:markdown id: tags:
## References
1. https://scikit-learn.org/stable/modules/kernel_ridge.html
2. https://scikit-learn.org/stable/modules/neural_networks_supervised.html
3. Jmol: an open-source Java viewer for chemical structures in 3D. http://www.jmol.org/
%% Cell type:code id: tags:
``` python
```
......
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment