Skip to content
Snippets Groups Projects
Commit dc4b9f58 authored by Xiangyue Liu's avatar Xiangyue Liu
Browse files

Update kaggle_competition.ipynb

parent b71443e3
No related branches found
No related tags found
No related merge requests found
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
%%HTML %%HTML
<script> <script>
code_show=true; code_show=true;
function code_toggle() { function code_toggle() {
if (code_show) if (code_show)
{ {
$('div.input').hide(); $('div.input').hide();
} }
else else
{ {
$('div.input').show(); $('div.input').show();
} }
code_show = !code_show code_show = !code_show
} }
$( document ).ready(code_toggle); $( document ).ready(code_toggle);
</script> </script>
The raw code for this notebook is by default hidden for easier reading. The raw code for this notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>. To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.
``` ```
%% Output %% Output
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
import warnings import warnings
warnings.filterwarnings('ignore') warnings.filterwarnings('ignore')
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)
from IPython.core.display import display, HTML from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>")) display(HTML("<style>.container { width:100% !important; }</style>"))
``` ```
%% Output %% Output
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
%%HTML %%HTML
<div align="left" style="background-color: rgba(149,170,79, 1.0); width: 100%; height: 470px"> <div align="left" style="background-color: rgba(149,170,79, 1.0); width: 100%; height: 470px">
<div > <div >
<table> <table>
<tr></tr> <tr></tr>
<tr> <tr>
<td><img id="nomad" align="right" src="https://nomad-coe.eu/uploads/nomad/images/NOMAD_Logo2.png" width="70%" alt="NOMAD Logo"></td> <td><img id="nomad" align="right" src="https://nomad-coe.eu/uploads/nomad/images/NOMAD_Logo2.png" width="70%" alt="NOMAD Logo"></td>
<td><font size=14em color="#20335d" align="left"><b>NOMAD Analytics Toolkit</b></font></td> <td><font size=14em color="#20335d" align="left"><b>NOMAD Analytics Toolkit</b></font></td>
<td><img height="85px" width="80px" src="https://www.nomad-coe.eu/uploads/nomad/backgrounds/head_big-data_analytics_2.png"></td> <td><img height="85px" width="80px" src="https://www.nomad-coe.eu/uploads/nomad/backgrounds/head_big-data_analytics_2.png"></td>
</tr> </tr>
</table> </table>
</div> </div>
<br><br> <br><br>
<div style="position:relative; left:3%"><font size=6em color="#20335d" ><b> - NOMAD 2018 Kaggle competition</b></font></div> <div style="position:relative; left:3%"><font size=6em color="#20335d" ><b> - NOMAD 2018 Kaggle competition</b></font></div>
<p style="position:relative;left:10%; "> <p style="position:relative;left:10%; ">
<br> <br>
Created by: Created by:
Xiangyue Liu<sup>1</sup> (<a href="mailto:xyliu@fhi-berlin.mpg.de">email</a>), Xiangyue Liu<sup>1</sup> (<a href="mailto:xyliu@fhi-berlin.mpg.de">email</a>),
Christopher Sutton<sup>1</sup> (<a href="mailto:sutton@fhi-berlin.mpg.de">email</a>), Christopher Sutton<sup>1</sup> (<a href="mailto:sutton@fhi-berlin.mpg.de">email</a>),
Luca M. Ghiringhelli<sup>1</sup>(<a href="mailto:ghiringhelli@fhi-berlin.mpg.de">email</a>), <br> Luca M. Ghiringhelli<sup>1</sup>(<a href="mailto:ghiringhelli@fhi-berlin.mpg.de">email</a>), <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Takenori Yamamoto<sup>2</sup>, &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Takenori Yamamoto<sup>2</sup>,
Yury Lysogorskiy<sup>3</sup>, Yury Lysogorskiy<sup>3</sup>,
Lars Blumenthal<sup>4,5</sup>, Lars Blumenthal<sup>4,5</sup>,
Thomas Hammerschmidt<sup>3</sup>, Thomas Hammerschmidt<sup>3</sup>,
Jacek Golebiowski<sup>4,5</sup>, <br> Jacek Golebiowski<sup>4,5</sup>, <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Angelo Ziletti<sup>1</sup>, Matthias Scheffler<sup>1</sup> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Angelo Ziletti<sup>1</sup>, Matthias Scheffler<sup>1</sup>
<br><br> <br><br>
<sup>1</sup> Fritz Haber Institute of the Max Planck Society, Faradayweg 4-6, D-14195 Berlin, Germany <br> <sup>1</sup> Fritz Haber Institute of the Max Planck Society, Faradayweg 4-6, D-14195 Berlin, Germany <br>
<sup>2</sup> Research Institute for Mathematical and Computational Sciences (RIMCS), LLC, Yokohama, Japan <br> <sup>2</sup> Research Institute for Mathematical and Computational Sciences (RIMCS), LLC, Yokohama, Japan <br>
<sup>3</sup> ICAMS, Ruhr-Universität Bochum, Germany <br> <sup>3</sup> ICAMS, Ruhr-Universität Bochum, Germany <br>
<sup>4</sup> EPSRC Centre for Doctoral Training on Theory and Simulation of Materials Department of Physics, Imperial College London, London, U.K. <br> <sup>4</sup> EPSRC Centre for Doctoral Training on Theory and Simulation of Materials Department of Physics, Imperial College London, London, U.K. <br>
<sup>5</sup> Thomas Young Centre for Theory and Simulation of Materials, Department of Materials, Imperial College London, London, U.K <br> <sup>5</sup> Thomas Young Centre for Theory and Simulation of Materials, Department of Materials, Imperial College London, London, U.K <br>
<br> <br>
</p> </p>
<br> <br>
<div style="position:relative;bottom:3%"> <div style="position:relative;bottom:3%">
<div style="position:absolute;right:5%;bottom: 0%;"><font color="#999999" size="10em">v1.0.0</font></div> <div style="position:absolute;right:5%;bottom: 0%;"><font color="#999999" size="10em">v1.0.0</font></div>
<div style="position:absolute;right:5%;bottom: 0%;"><font color="#666666" size="2.7em">[Last updated: March 21, 2019]</font></div> <div style="position:absolute;right:5%;bottom: 0%;"><font color="#666666" size="2.7em">[Last updated: March 21, 2019]</font></div>
</div> </div>
</div> </div>
<div style='text-align: right;'> <div style='text-align: right;'>
<a href="https://analytics-toolkit.nomad-coe.eu/home/" class="btn btn-primary" style="font-size:larger;">Back to Analytics Home</a> <a href="https://analytics-toolkit.nomad-coe.eu/home/" class="btn btn-primary" style="font-size:larger;">Back to Analytics Home</a>
<a href="https://www.nomad-coe.eu/" class="btn btn-primary" style="font-size:larger; ">Back to nomad-coe</a> <a href="https://www.nomad-coe.eu/" class="btn btn-primary" style="font-size:larger; ">Back to nomad-coe</a>
</div> </div>
``` ```
%% Output %% Output
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<font size=4em>[Skip introduction](#make_predictions)</font> <font size=4em>[Skip introduction](#make_predictions)</font>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# NOMAD 2018 Kaggle research competition # NOMAD 2018 Kaggle research competition
### A paradigm shift in solving materials science grand challenges by crowd sourcing solutions through an open and global big-data competition ### A paradigm shift in solving materials science grand challenges by crowd sourcing solutions through an open and global big-data competition
Innovative materials design is needed to tackle some of the most important health, environmental, energy, societal, and economic challenges. Improving the properties of materials that are intrinsically connected to the generation and utilization of energy is crucial if we are to mitigate environmental damage due to a growing global demand. Transparent conductors are an important class of compounds that are both electrically conductive and have low absorption in the visible range, which are typically competing properties. A combination of both of these characteristics is key for the operation of a variety of technological devices such as photovoltaic cells, light-emitting diodes for flat-panel displays, transistors, sensors, touch screens, and lasers. However, only a small number of compounds are currently known to display both transparency and conductivity to a high enough degree to be used as transparent conducting materials. Innovative materials design is needed to tackle some of the most important health, environmental, energy, societal, and economic challenges. Improving the properties of materials that are intrinsically connected to the generation and utilization of energy is crucial if we are to mitigate environmental damage due to a growing global demand. Transparent conductors are an important class of compounds that are both electrically conductive and have low absorption in the visible range, which are typically competing properties. A combination of both of these characteristics is key for the operation of a variety of technological devices such as photovoltaic cells, light-emitting diodes for flat-panel displays, transistors, sensors, touch screens, and lasers. However, only a small number of compounds are currently known to display both transparency and conductivity to a high enough degree to be used as transparent conducting materials.
To address the need for finding new materials with an ideal target functionality, the Novel Materials Discovery (NOMAD) Centre of Excellence has organized a crowd-sourced data analytics competition with Kaggle, which is one of the most well known online platforms for hosting big-data competitions. Kaggle has a community of over a half of a million users from around the world with various backgrounds in computer science, statistics, biology, and medicine. The competition occurred from December 18th 2017 to February 15th, 2018 and involved nearly 900 participants. The goal of this competition was to develop or apply data analytics models for the prediction of two target properties: the formation energy (which is an indication of the stability of a material) and the bandgap energy (which is an indicator for the potential for transparency over the visible range) to facilitate the discovery of new transparent conductors and allow for advancements in (opto)electronic technologies. A total of 5,000.00 euros in prizes were awarded to the top-three participants with the best performing models (i.e., lowest average root mean square log error (RMSLE) of the formation and bandgap energies). The RMSLE is defined as: To address the need for finding new materials with an ideal target functionality, the Novel Materials Discovery (NOMAD) Centre of Excellence has organized a crowd-sourced data analytics competition with Kaggle, which is one of the most well known online platforms for hosting big-data competitions. Kaggle has a community of over a half of a million users from around the world with various backgrounds in computer science, statistics, biology, and medicine. The competition occurred from December 18th 2017 to February 15th, 2018 and involved nearly 900 participants. The goal of this competition was to develop or apply data analytics models for the prediction of two target properties: the formation energy (which is an indication of the stability of a material) and the bandgap energy (which is an indicator for the potential for transparency over the visible range) to facilitate the discovery of new transparent conductors and allow for advancements in (opto)electronic technologies. A total of 5,000.00 euros in prizes were awarded to the top-three participants with the best performing models (i.e., lowest average root mean square log error (RMSLE) of the formation and bandgap energies). The RMSLE is defined as:
$RMSLE = \sqrt{\frac{1}{N} \sum_{i=1}^{N}{(log(\hat{y}_i + 1) - log(y_i + 1))^2} }$ $RMSLE = \sqrt{\frac{1}{N} \sum_{i=1}^{N}{(log(\hat{y}_i + 1) - log(y_i + 1))^2} }$
where $N$ is the total number of observations, $\hat{y}_i$ is the predicted value, $y_i$ is the reference value for either the formation or bandgap energies. where $N$ is the total number of observations, $\hat{y}_i$ is the predicted value, $y_i$ is the reference value for either the formation or bandgap energies.
The dataset consists of 3,000 materials, 2,400 of which made the training set and the remaining 600 were used as the test set (i.e., only structures and input features were provided), with the target properties kept secret. Of that test set, 100 materials were used to determine the public leaderboard score so that the participants could assess their model performance on the fly (but the exact values used in this assessment were kept secret). The top three winners of the competition were determined the private leaderboard score based on the test set of 500 materials with the target properties kept secret. The dataset consists of 3,000 materials, 2,400 of which made the training set and the remaining 600 were used as the test set (i.e., only structures and input features were provided), with the target properties kept secret. Of that test set, 100 materials were used to determine the public leaderboard score so that the participants could assess their model performance on the fly (but the exact values used in this assessment were kept secret). The top three winners of the competition were determined the private leaderboard score based on the test set of 500 materials with the target properties kept secret.
Because only 100 values were used for assessing the performance on the leaderboard, participants had to ensure the predictive accuracy of their model for unseen data, even if a disagreement was found with the public leaderboard score. This is evident in the summary of the average RMSLE for all of the participants with scores below 0.25 in Figure 1, where a large shift in the values between the public leaderboard (100 compounds) and private leaderboard (500 compounds). The winning score has a RMSLE of 0.0509, while the 2nd and 3rd places winners were closely stacked together with a RMSLE of 0.0521 and 0.0523. However, within the first bin, there were a total of four participants with an RMLSE 0.053 (i.e., 0.45% of participants). Because only 100 values were used for assessing the performance on the leaderboard, participants had to ensure the predictive accuracy of their model for unseen data, even if a disagreement was found with the public leaderboard score. This is evident in the summary of the average RMSLE for all of the participants with scores below 0.25 in Figure 1, where a large shift in the values between the public leaderboard (100 compounds) and private leaderboard (500 compounds). The winning score has a RMSLE of 0.0509, while the 2nd and 3rd places winners were closely stacked together with a RMSLE of 0.0521 and 0.0523. However, within the first bin, there were a total of four participants with an RMLSE 0.053 (i.e., 0.45% of participants).
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# More about the dataset: Group-III transparent conductors # More about the dataset: Group-III transparent conductors
Group-III oxides provide promising candidates for wide band-gap transparent conductors. A wide range of experimental band gap energies from 3.6 to 7.5 eV have been reported from alloying In$_2$O$_3$/Ga$_2$O$_3$ or Ga$_2$O$_3$/Al$_2$O$_3$, Group-III oxides provide promising candidates for wide band-gap transparent conductors. A wide range of experimental band gap energies from 3.6 to 7.5 eV have been reported from alloying In$_2$O$_3$/Ga$_2$O$_3$ or Ga$_2$O$_3$/Al$_2$O$_3$,
which suggest that alloying of group-III oxides is a viable strategy for designing new wide band gap semiconductors. However, Al$_2$O$_3$, Ga$_2$O$_3$, and In$_2$O$_3$ all display very different which suggest that alloying of group-III oxides is a viable strategy for designing new wide band gap semiconductors. However, Al$_2$O$_3$, Ga$_2$O$_3$, and In$_2$O$_3$ all display very different
ground-state structures. Therefore, it is unclear which structure will be stable for various compositions. Within the dataset, there are six different lattice symmetries: $C2/m$, $Pna2_1$, $R\bar{3}c$, $P6_3/mmc$, $Ia\bar{3}$, $Fd\bar{3}m$. ground-state structures. Therefore, it is unclear which structure will be stable for various compositions. Within the dataset, there are six different lattice symmetries: $C2/m$, $Pna2_1$, $R\bar{3}c$, $P6_3/mmc$, $Ia\bar{3}$, $Fd\bar{3}m$.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
%%html %%html
<!-- CSS Style Inline: --> <!-- CSS Style Inline: -->
<style type="text/css"> <style type="text/css">
#jmol_div227{ #jmol_div227{
height: 350px; height: 350px;
width: 350px; width: 350px;
float: left; float: left;
} }
#jmol_div12{ #jmol_div12{
height: 350px; height: 350px;
width: 350px; width: 350px;
float: left; float: left;
} }
#jmol_div206{ #jmol_div206{
height: 350px; height: 350px;
width: 350px; width: 350px;
float: left; float: left;
} }
#jmol_div33{ #jmol_div33{
height: 350px; height: 350px;
width: 350px; width: 350px;
float: left; float: left;
} }
#jmol_div194{ #jmol_div194{
height: 350px; height: 350px;
width: 350px; width: 350px;
float: left; float: left;
} }
#jmol_div167{ #jmol_div167{
height: 350px; height: 350px;
width: 350px; width: 350px;
float: left; float: left;
} }
</style> </style>
<!-- Load Jmol javascript library --> <!-- Load Jmol javascript library -->
<script type="text/javascript" src="assets/kaggle_competition/jsmol/JSmol.min.js"></script> <script type="text/javascript" src="assets/kaggle_competition/jsmol/JSmol.min.js"></script>
<!-- calls to jQuery and Jmol (inline) --> <!-- calls to jQuery and Jmol (inline) -->
<script type="text/javascript"> <script type="text/javascript">
// Jmol readyFunction // Jmol readyFunction
jmol_isReady = function(applet) { jmol_isReady = function(applet) {
document.title = (applet._id + " - Jmol " + Jmol.___JmolVersion) document.title = (applet._id + " - Jmol " + Jmol.___JmolVersion)
Jmol._getElement(applet, "appletdiv").style.border="0px solid blue" Jmol._getElement(applet, "appletdiv").style.border="0px solid blue"
} }
// initialize Jmol Applet // initialize Jmol Applet
var myJmol227 = "myJmol227"; var myJmol227 = "myJmol227";
var Info227 = { var Info227 = {
width: "100%", width: "100%",
height: "100%", height: "100%",
color: "#ffffff", color: "#ffffff",
use: "HTML5", use: "HTML5",
j2sPath: "assets/kaggle_competition/jsmol/j2s", j2sPath: "assets/kaggle_competition/jsmol/j2s",
jarPath: "assets/kaggle_competition/jsmol/java", jarPath: "assets/kaggle_competition/jsmol/java",
jarFile: "JmolAppletSigned.jar", jarFile: "JmolAppletSigned.jar",
debug: false, debug: false,
readyFunction: jmol_isReady, readyFunction: jmol_isReady,
script: 'load "data/kaggle_competition/xyz/geometry3-spacegroup227.xyz" ;', script: 'load "data/kaggle_competition/xyz/geometry3-spacegroup227.xyz" ;',
allowJavaScript: false, allowJavaScript: false,
disableJ2SLoadMonitor: true, disableJ2SLoadMonitor: true,
} }
var myJmol12 = "myJmol12"; var myJmol12 = "myJmol12";
var Info12 = { var Info12 = {
width: "100%", width: "100%",
height: "100%", height: "100%",
color: "#ffffff", color: "#ffffff",
use: "HTML5", use: "HTML5",
j2sPath: "assets/kaggle_competition/jsmol/j2s", j2sPath: "assets/kaggle_competition/jsmol/j2s",
jarPath: "assets/kaggle_competition/jsmol/java", jarPath: "assets/kaggle_competition/jsmol/java",
jarFile: "JmolAppletSigned.jar", jarFile: "JmolAppletSigned.jar",
debug: false, debug: false,
readyFunction: jmol_isReady, readyFunction: jmol_isReady,
script: 'load "data/kaggle_competition/xyz/geometry8-spacegroup12.xyz" ;', script: 'load "data/kaggle_competition/xyz/geometry8-spacegroup12.xyz" ;',
allowJavaScript: false, allowJavaScript: false,
disableJ2SLoadMonitor: true, disableJ2SLoadMonitor: true,
} }
var myJmol206 = "myJmol206"; var myJmol206 = "myJmol206";
var Info206 = { var Info206 = {
width: "100%", width: "100%",
height: "100%", height: "100%",
color: "#ffffff", color: "#ffffff",
use: "HTML5", use: "HTML5",
j2sPath: "assets/kaggle_competition/jsmol/j2s", j2sPath: "assets/kaggle_competition/jsmol/j2s",
jarPath: "assets/kaggle_competition/jsmol/java", jarPath: "assets/kaggle_competition/jsmol/java",
jarFile: "JmolAppletSigned.jar", jarFile: "JmolAppletSigned.jar",
debug: false, debug: false,
readyFunction: jmol_isReady, readyFunction: jmol_isReady,
script: 'load "data/kaggle_competition/xyz/geometry7-spacegroup206.xyz" ;', script: 'load "data/kaggle_competition/xyz/geometry7-spacegroup206.xyz" ;',
allowJavaScript: false, allowJavaScript: false,
disableJ2SLoadMonitor: true, disableJ2SLoadMonitor: true,
} }
var myJmol33 = "myJmol33"; var myJmol33 = "myJmol33";
var Info33 = { var Info33 = {
width: "100%", width: "100%",
height: "100%", height: "100%",
color: "#ffffff", color: "#ffffff",
use: "HTML5", use: "HTML5",
j2sPath: "assets/kaggle_competition/jsmol/j2s", j2sPath: "assets/kaggle_competition/jsmol/j2s",
jarPath: "assets/kaggle_competition/jsmol/java", jarPath: "assets/kaggle_competition/jsmol/java",
jarFile: "JmolAppletSigned.jar", jarFile: "JmolAppletSigned.jar",
debug: false, debug: false,
readyFunction: jmol_isReady, readyFunction: jmol_isReady,
script: 'load "data/kaggle_competition/xyz/geometry1-spacegroup33.xyz" ;', script: 'load "data/kaggle_competition/xyz/geometry1-spacegroup33.xyz" ;',
allowJavaScript: false, allowJavaScript: false,
disableJ2SLoadMonitor: true, disableJ2SLoadMonitor: true,
} }
var myJmol194 = "myJmol194"; var myJmol194 = "myJmol194";
var Info194 = { var Info194 = {
width: "100%", width: "100%",
height: "100%", height: "100%",
color: "#ffffff", color: "#ffffff",
use: "HTML5", use: "HTML5",
j2sPath: "assets/kaggle_competition/jsmol/j2s", j2sPath: "assets/kaggle_competition/jsmol/j2s",
jarPath: "assets/kaggle_competition/jsmol/java", jarPath: "assets/kaggle_competition/jsmol/java",
jarFile: "JmolAppletSigned.jar", jarFile: "JmolAppletSigned.jar",
debug: false, debug: false,
readyFunction: jmol_isReady, readyFunction: jmol_isReady,
script: 'load "data/kaggle_competition/xyz/geometry2-spacegroup194.xyz" ;', script: 'load "data/kaggle_competition/xyz/geometry2-spacegroup194.xyz" ;',
allowJavaScript: false, allowJavaScript: false,
disableJ2SLoadMonitor: true, disableJ2SLoadMonitor: true,
} }
var myJmol167 = "myJmol167"; var myJmol167 = "myJmol167";
var Info167 = { var Info167 = {
width: "100%", width: "100%",
height: "100%", height: "100%",
color: "#ffffff", color: "#ffffff",
use: "HTML5", use: "HTML5",
j2sPath: "assets/kaggle_competition/jsmol/j2s", j2sPath: "assets/kaggle_competition/jsmol/j2s",
jarPath: "assets/kaggle_competition/jsmol/java", jarPath: "assets/kaggle_competition/jsmol/java",
jarFile: "JmolAppletSigned.jar", jarFile: "JmolAppletSigned.jar",
debug: false, debug: false,
readyFunction: jmol_isReady, readyFunction: jmol_isReady,
script: 'load "data/kaggle_competition/xyz/geometry4-spacegroup167.xyz" ;', script: 'load "data/kaggle_competition/xyz/geometry4-spacegroup167.xyz" ;',
allowJavaScript: false, allowJavaScript: false,
disableJ2SLoadMonitor: true, disableJ2SLoadMonitor: true,
} }
// jQuery ready functions // jQuery ready functions
// is called when page has been completely loaded // is called when page has been completely loaded
$(document).ready(function() { $(document).ready(function() {
$("#jmol_div227").html(Jmol.getAppletHtml(myJmol227, Info227)) $("#jmol_div227").html(Jmol.getAppletHtml(myJmol227, Info227))
}) })
$(document).ready(function() { $(document).ready(function() {
$("#jmol_div12").html(Jmol.getAppletHtml(myJmol12, Info12)) $("#jmol_div12").html(Jmol.getAppletHtml(myJmol12, Info12))
}) })
$(document).ready(function() { $(document).ready(function() {
$("#jmol_div206").html(Jmol.getAppletHtml(myJmol206, Info206)) $("#jmol_div206").html(Jmol.getAppletHtml(myJmol206, Info206))
}) })
$(document).ready(function() { $(document).ready(function() {
$("#jmol_div33").html(Jmol.getAppletHtml(myJmol33, Info33)) $("#jmol_div33").html(Jmol.getAppletHtml(myJmol33, Info33))
}) })
$(document).ready(function() { $(document).ready(function() {
$("#jmol_div194").html(Jmol.getAppletHtml(myJmol194, Info194)) $("#jmol_div194").html(Jmol.getAppletHtml(myJmol194, Info194))
}) })
$(document).ready(function() { $(document).ready(function() {
$("#jmol_div167").html(Jmol.getAppletHtml(myJmol167, Info167)) $("#jmol_div167").html(Jmol.getAppletHtml(myJmol167, Info167))
}) })
var lastPrompt=0; var lastPrompt=0;
function show_hide_structures() function show_hide_structures()
{ {
var x = document.getElementById("geometry_jmol"); var x = document.getElementById("geometry_jmol");
if (x.style.display === "none") if (x.style.display === "none")
{ {
x.style.display = "block"; x.style.display = "block";
} }
else else
{ {
x.style.display = "none"; x.style.display = "none";
} }
} }
</script> </script>
<div> <div>
<button type="button" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:200px; height:30px" onclick="show_hide_structures()">Show/hide structures</button> <button type="button" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:200px; height:30px" onclick="show_hide_structures()">Show/hide structures</button>
<br><br> <br><br>
</div> </div>
<div id="geometry_jmol"> <div id="geometry_jmol">
<table> <table>
<tr> <tr>
<th><center>Spacegroup 12 (C2/m)</center></th> <th><center>Spacegroup 12 (C2/m)</center></th>
<th><center>Spacegroup 33 (Pna2<sub>1</sub>)</center></th> <th><center>Spacegroup 33 (Pna2<sub>1</sub>)</center></th>
<th><center>Spacegroup 167 (R<font style="text-decoration: overline">3</font>c)</center></th> <th><center>Spacegroup 167 (R<font style="text-decoration: overline">3</font>c)</center></th>
</tr> </tr>
<tr> <tr>
<th><div id='jmol_div12'></div></th> <th><div id='jmol_div12'></div></th>
<th><div id='jmol_div33'></div></th> <th><div id='jmol_div33'></div></th>
<th><div id='jmol_div167'></div></th> <th><div id='jmol_div167'></div></th>
</tr> </tr>
<tr> <tr>
<th><center>Spacegroup 194 (P6<sub>3</sub>/mmc)</center></th> <th><center>Spacegroup 194 (P6<sub>3</sub>/mmc)</center></th>
<th><center>Spacegroup 206 (Ia<font style="text-decoration: overline">3</font>)</center></th> <th><center>Spacegroup 206 (Ia<font style="text-decoration: overline">3</font>)</center></th>
<th><center>Spacegroup 227 (Fd<font style="text-decoration: overline">3</font>m)</center></th> <th><center>Spacegroup 227 (Fd<font style="text-decoration: overline">3</font>m)</center></th>
</tr> </tr>
<tr> <tr>
<th><div id='jmol_div194'></div></th> <th><div id='jmol_div194'></div></th>
<th><div id='jmol_div206'></div></th> <th><div id='jmol_div206'></div></th>
<th><div id='jmol_div227'></div></th> <th><div id='jmol_div227'></div></th>
</tr> </tr>
</table> </table>
</div> </div>
``` ```
%% Output %% Output
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Winning solutions # Winning solutions
- __1st place winning solution:__ crystal graph n-grams + Kernel Ridge Regression - __1st place winning solution:__ crystal graph n-grams + Kernel Ridge Regression
- __3rd place winning solution:__ SOAP-based descriptor + Neural network - __3rd place winning solution:__ SOAP-based descriptor + Neural network
## - Representations: crystal graph n-grams, SOAP-based descriptor ## - Representations: crystal graph n-grams, SOAP-based descriptor
### 1. Crystal graph n-grams ### 1. Crystal graph n-grams
The 1st place winning solution was obtained using the metal-oxygen coordination number derived from the number of bonds that are within the sum of the ionic Shannon experimental radii (which were enlarged by 30-50% depending on the crystal structure type). These ionic bonds are then used for building a crystal graph, where each atom is a node in the graph and the corresponding edges between nodes are defined by the ionic bond, which are shown as coordination numbers for each atom for a sequence of 6 atoms. The 1st place winning solution was obtained using the metal-oxygen coordination number derived from the number of bonds that are within the sum of the ionic Shannon experimental radii (which were enlarged by 30-50% depending on the crystal structure type). These ionic bonds are then used for building a crystal graph, where each atom is a node in the graph and the corresponding edges between nodes are defined by the ionic bond, which are shown as coordination numbers for each atom for a sequence of 6 atoms.
<img src="assets/kaggle_competition/ngram.png" alt="Drawing" style="width: 700px;"/> <img src="assets/kaggle_competition/ngram.png" alt="Drawing" style="width: 700px;"/>
<center> Figure 1 : Depiction of a crystal graph representation of In3Ga1O6 showing the connections between each atom (node) that are defined by the ionic bonds. </center> <center> Figure 1 : Depiction of a crystal graph representation of In3Ga1O6 showing the connections between each atom (node) that are defined by the ionic bonds. </center>
### 2. SOAP-based descriptor ### 2. SOAP-based descriptor
In the 3rd place winning solution, the smooth overlap of atomic positions (SOAP) kernel developed by Bartók et al. that incorporates information on the local atomic environment through a rotationally integrated overlap of neighbor densities.20-21 The SOAP kernel describes the local environment for a given atom ($i$) through the sum of Gaussians centered on each of the atomic neighbors ($j$) within a specific cutoff radius ($r_{ij}$): In the 3rd place winning solution, the smooth overlap of atomic positions (SOAP) kernel developed by Bartók et al. that incorporates information on the local atomic environment through a rotationally integrated overlap of neighbor densities.20-21 The SOAP kernel describes the local environment for a given atom ($i$) through the sum of Gaussians centered on each of the atomic neighbors ($j$) within a specific cutoff radius ($r_{ij}$):
$\rho_{i}(r) = \sum_{j} exp(\frac{-(r-r_{ij})^2}{2\sigma^2_{atom}}) f_{cut}(r_{ij})$ $\rho_{i}(r) = \sum_{j} exp(\frac{-(r-r_{ij})^2}{2\sigma^2_{atom}}) f_{cut}(r_{ij})$
where $\sigma_{atom}$ is a smoothing parameter and the switching function $f_{cut}$ goes smoothly to zero beyond a specified radial value. This local atomic neighbor density can be expanded in terms of spherical harmonics and orthogonal radial functions, while the expansion coefficients are then combined to form the rotationally invariant power spectrum corresponding to the neighbor density for each atom. where $\sigma_{atom}$ is a smoothing parameter and the switching function $f_{cut}$ goes smoothly to zero beyond a specified radial value. This local atomic neighbor density can be expanded in terms of spherical harmonics and orthogonal radial functions, while the expansion coefficients are then combined to form the rotationally invariant power spectrum corresponding to the neighbor density for each atom.
## - Regression methods: Kernel ridge regression, Neural network ## - Regression methods: Kernel ridge regression, Neural network
### 1. Kernel ridge regression (KRR) $^1$ ### 1. Kernel ridge regression (KRR) $^1$
Kernel ridge regression (KRR) is a generalization of ridge regression, where a linear function is learned in the space induced by the respective kernel and the data. The squared loss is minimized with a squared norm L2 regularization term. Kernel ridge regression (KRR) is a generalization of ridge regression, where a linear function is learned in the space induced by the respective kernel and the data. The squared loss is minimized with a squared norm L2 regularization term.
The model learned by KRR has the same form with support vector regression (SVR). In both KRR and SVR, the L2 regularization are used, but KRR uses squared error loss function which is different than SVR. The learned model by KRR is non-sparse, therefore the fitting of KRR is typically slower than SVR. The model learned by KRR has the same form with support vector regression (SVR). In both KRR and SVR, the L2 regularization are used, but KRR uses squared error loss function which is different than SVR. The learned model by KRR is non-sparse, therefore the fitting of KRR is typically slower than SVR.
### 2. Neural network $^2$ ### 2. Neural network $^2$
In the 3rd place winning solution, a multi-layer perceptron (MLP) is employed. MLP is composed of more than one perceptron: typically it has an input layer, several hidden layer and an output layer. The input of the network is transformed using a learnt non-linear transformation. As a supervised learning algorithm, MLP learns a function $f(\cdot):R^{I} \rightarrow R^{O}$, where $I$ is the number of dimensions for input and $O$ is the number of dimensions for output. Several parameters are adjusted during the training of MLP, including the weights of the neurons and the biases. Gradient-based optimisation algorithms, such as stochastic gradient descent, are employed to minimize the loss funtion. There are different choices of loss functions, for example the root mean squared error (RMSE), or the cross entropy. One can also switch loss functions during the training for better gradients. Figure 1 shows a one hidden layer MLP with scalar output as an example. In the 3rd place winning solution, a multi-layer perceptron (MLP) is employed. MLP is composed of more than one perceptron: typically it has an input layer, several hidden layer and an output layer. The input of the network is transformed using a learnt non-linear transformation. As a supervised learning algorithm, MLP learns a function $f(\cdot):R^{I} \rightarrow R^{O}$, where $I$ is the number of dimensions for input and $O$ is the number of dimensions for output. Several parameters are adjusted during the training of MLP, including the weights of the neurons and the biases. Gradient-based optimisation algorithms, such as stochastic gradient descent, are employed to minimize the loss funtion. There are different choices of loss functions, for example the root mean squared error (RMSE), or the cross entropy. One can also switch loss functions during the training for better gradients. Figure 1 shows a one hidden layer MLP with scalar output as an example.
<img src="assets/kaggle_competition/multilayerperceptron_network.png" alt="Drawing" style="width: 400px;"/> <img src="assets/kaggle_competition/multilayerperceptron_network.png" alt="Drawing" style="width: 400px;"/>
<center> Figure 2 : One hidden layer MLP $^2$. </center> <center> Figure 2 : One hidden layer MLP $^2$. </center>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<a id='make_predictions'></a> <a id='make_predictions'></a>
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
%%HTML %%HTML
<br><br><br> <br><br><br>
<font size="6.5em"><b>Make predictions on formation energies and bandgaps</b></font> <font size="6.5em"><b>Make predictions on formation energies and bandgaps</b></font>
<br><hr><br> <br><hr><br>
<font size="5em"><b>Winning representations combined with different regression methods</b></font> <font size="5em"><b>Winning representations combined with different regression methods</b></font>
<br><br> <br><br>
<font size = "3.5em"> To understand the relative importance of the representation vs. regression model, one can examine the performance of each representation combined with different regression models. <font size = "3.5em"> To understand the relative importance of the representation vs. regression model, one can examine the performance of each representation combined with different regression models.
The hyperparameters are optimized for each representation/regressor combination. </font> The hyperparameters are optimized for each representation/regressor combination. </font>
<br> <br>
<font size = "3.5em" color="009FC2"><br>Warning: the learning algorithm employed in this study (e.g. grid-search) can not guarantee deterministic results. The actual predictions can divergent from the published data. <font size = "3.5em" color="009FC2"><br>Warning: the learning algorithm employed in this study (e.g. grid-search) can not guarantee deterministic results. The actual predictions can divergent from the published data.
</font> </font>
<br><br><br> <br><br><br>
<form> <form>
<font size="4em">Select a representation and a regression method:</font> <font size="4em">Select a representation and a regression method:</font>
<select id="representation" onclick="show_options()"> <select id="representation" onclick="show_options()">
<option value="ngram">n-gram</option> <option value="ngram">n-gram</option>
<option value="soap">SOAP</option> <option value="soap">SOAP</option>
</select> </select>
<select id="regression" onclick="show_options()"> <select id="regression" onclick="show_options()">
<option value="krr">KRR</option> <option value="krr">KRR</option>
<option value="nn">Neural network</option> <option value="nn">Neural network</option>
</select> </select>
<button type="button" id="button_show_options" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:150px; height:30px" onclick="show_options()">More options</button> <button type="button" id="button_show_options" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:150px; height:30px" onclick="show_options()">More options</button>
<button type="button" id="button_hide_options" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:150px; height:30px; display: none" onclick="hide_options()">Less options</button> <button type="button" id="button_hide_options" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:150px; height:30px; display: none" onclick="hide_options()">Less options</button>
</form> </form>
<script type="text/Javascript"> <script type="text/Javascript">
function show_options() function show_options()
{ {
representation_value = document.getElementById("representation").value; representation_value = document.getElementById("representation").value;
regression_value = document.getElementById("regression").value; regression_value = document.getElementById("regression").value;
if(representation_value == "ngram") if(representation_value == "ngram")
{ {
document.getElementById("div_options_soap").style.display="none"; document.getElementById("div_options_soap").style.display="none";
document.getElementById("div_options_ngram").style.display=""; document.getElementById("div_options_ngram").style.display="";
document.getElementById("button_show_options").style.display="none"; document.getElementById("button_show_options").style.display="none";
document.getElementById("button_hide_options").style.display=""; document.getElementById("button_hide_options").style.display="";
} }
else if(representation_value == "soap") else if(representation_value == "soap")
{ {
document.getElementById("div_options_ngram").style.display="none"; document.getElementById("div_options_ngram").style.display="none";
document.getElementById("div_options_soap").style.display=""; document.getElementById("div_options_soap").style.display="";
document.getElementById("button_show_options").style.display="none"; document.getElementById("button_show_options").style.display="none";
document.getElementById("button_hide_options").style.display=""; document.getElementById("button_hide_options").style.display="";
} }
if(regression_value == "nn") if(regression_value == "nn")
{ {
document.getElementById("div_options_nn").style.display=""; document.getElementById("div_options_nn").style.display="";
set_nn_default(); set_nn_default();
} }
else if(regression_value == "krr") else if(regression_value == "krr")
{ {
document.getElementById("div_options_nn").style.display="none"; document.getElementById("div_options_nn").style.display="none";
} }
} }
function hide_options() function hide_options()
{ {
document.getElementById("button_show_options").style.display=""; document.getElementById("button_show_options").style.display="";
document.getElementById("button_hide_options").style.display="none"; document.getElementById("button_hide_options").style.display="none";
document.getElementById("div_options_ngram").style.display="none"; document.getElementById("div_options_ngram").style.display="none";
document.getElementById("div_options_soap").style.display="none"; document.getElementById("div_options_soap").style.display="none";
document.getElementById("div_options_nn").style.display="none"; document.getElementById("div_options_nn").style.display="none";
} }
function set_nn_n_neurons() function set_nn_n_neurons()
{ {
document.getElementById("div_options_nn_n_neurons").style.display=""; document.getElementById("div_options_nn_n_neurons").style.display="";
n_layers = document.getElementById("input_n_nn").value; n_layers = document.getElementById("input_n_nn").value;
contents = "<br><font size=\"3em\">Number of neurons in each hidden layer</font><br>" contents = "<br><font size=\"3em\">Number of neurons in each hidden layer</font><br>"
for(i=0; i<n_layers; i++) for(i=0; i<n_layers; i++)
{ {
content_add = "<font size=\"3em\"> Layer " + String(i+1) + ": </font><input\ type=\"number\"\ id=\"input_n_neurons_" + String(i) + "\"\ value=\"256\"\ min=\"1\"\ max=\"1024\"\ >"; content_add = "<font size=\"3em\"> Layer " + String(i+1) + ": </font><input\ type=\"number\"\ id=\"input_n_neurons_" + String(i) + "\"\ value=\"256\"\ min=\"1\"\ max=\"1024\"\ >";
contents += content_add; contents += content_add;
if((i+1)%6 ==0) if((i+1)%6 ==0)
{ {
contents += "<br>" contents += "<br>"
} }
} }
//alert(contents); //alert(contents);
document.getElementById("div_options_nn_n_neurons").innerHTML = contents; document.getElementById("div_options_nn_n_neurons").innerHTML = contents;
} }
function set_nn_default() function set_nn_default()
{ {
document.getElementById("div_options_nn_n_neurons").style.display=""; document.getElementById("div_options_nn_n_neurons").style.display="";
representation_value = document.getElementById("representation").value; representation_value = document.getElementById("representation").value;
if(representation_value == "ngram") if(representation_value == "ngram")
{ {
n_layers = 11; n_layers = 11;
document.getElementById("input_n_nn").value = n_layers; document.getElementById("input_n_nn").value = n_layers;
contents = "<br><font size=\"3em\">Number of neurons in each hidden layer</font><br>" contents = "<br><font size=\"3em\">Number of neurons in each hidden layer</font><br>"
for(i=0; i<7; i++) for(i=0; i<7; i++)
{ {
content_add = "<font size=\"3em\"> Layer " + String(i+1) + ": </font><input\ type=\"number\"\ id=\"input_n_neurons_" + String(i) + "\"\ value=\"100\"\ min=\"1\"\ max=\"1024\"\ >"; content_add = "<font size=\"3em\"> Layer " + String(i+1) + ": </font><input\ type=\"number\"\ id=\"input_n_neurons_" + String(i) + "\"\ value=\"100\"\ min=\"1\"\ max=\"1024\"\ >";
contents += content_add; contents += content_add;
if((i+1)%6 ==0) if((i+1)%6 ==0)
{ {
contents += "<br>" contents += "<br>"
} }
} }
for(i=7; i<11; i++) for(i=7; i<11; i++)
{ {
content_add = "<font size=\"3em\"> Layer " + String(i+1) + ": </font><input\ type=\"number\"\ id=\"input_n_neurons_" + String(i) + "\"\ value=\"50\"\ min=\"1\"\ max=\"1024\"\ >"; content_add = "<font size=\"3em\"> Layer " + String(i+1) + ": </font><input\ type=\"number\"\ id=\"input_n_neurons_" + String(i) + "\"\ value=\"50\"\ min=\"1\"\ max=\"1024\"\ >";
contents += content_add; contents += content_add;
if((i+1)%6 ==0) if((i+1)%6 ==0)
{ {
contents += "<br>" contents += "<br>"
} }
} }
} }
else if(representation_value == "soap") else if(representation_value == "soap")
{ {
n_layers = 3; n_layers = 3;
document.getElementById("input_n_nn").value = n_layers; document.getElementById("input_n_nn").value = n_layers;
contents = "<br><font size=\"3em\">Number of neurons in each hidden layer</font><br>" contents = "<br><font size=\"3em\">Number of neurons in each hidden layer</font><br>"
for(i=0; i<1; i++) for(i=0; i<1; i++)
{ {
content_add = "<font size=\"3em\"> Layer " + String(i+1) + ": </font><input\ type=\"number\"\ id=\"input_n_neurons_" + String(i) + "\"\ value=\"512\"\ min=\"1\"\ max=\"1024\"\ >"; content_add = "<font size=\"3em\"> Layer " + String(i+1) + ": </font><input\ type=\"number\"\ id=\"input_n_neurons_" + String(i) + "\"\ value=\"512\"\ min=\"1\"\ max=\"1024\"\ >";
contents += content_add; contents += content_add;
} }
for(i=1; i<3; i++) for(i=1; i<3; i++)
{ {
content_add = "<font size=\"3em\"> Layer " + String(i+1) + ": </font><input\ type=\"number\"\ id=\"input_n_neurons_" + String(i) + "\"\ value=\"256\"\ min=\"1\"\ max=\"1024\"\ >"; content_add = "<font size=\"3em\"> Layer " + String(i+1) + ": </font><input\ type=\"number\"\ id=\"input_n_neurons_" + String(i) + "\"\ value=\"256\"\ min=\"1\"\ max=\"1024\"\ >";
contents += content_add; contents += content_add;
} }
} }
document.getElementById("div_options_nn_n_neurons").innerHTML = contents; document.getElementById("div_options_nn_n_neurons").innerHTML = contents;
save_options_nn(); save_options_nn();
} }
function save_options_nn() function save_options_nn()
{ {
n_layers = document.getElementById("input_n_nn").value; n_layers = document.getElementById("input_n_nn").value;
var n_neurons = []; var n_neurons = [];
for(i=0; i<n_layers; i++) for(i=0; i<n_layers; i++)
{ {
id_i = "input_n_neurons_" + String(i); id_i = "input_n_neurons_" + String(i);
n_neurons_i = document.getElementById(id_i).value; n_neurons_i = document.getElementById(id_i).value;
n_neurons.push(n_neurons_i); n_neurons.push(n_neurons_i);
} }
var command = "N_nn_neurons = " + n_neurons + ";"; var command = "N_nn_neurons = " + n_neurons + ";";
command += "N_nn_layers = int(" + n_layers + ");" command += "N_nn_layers = int(" + n_layers + ");"
var kernel = IPython.notebook.kernel; var kernel = IPython.notebook.kernel;
kernel.execute(command); kernel.execute(command);
} }
function save_options() function save_options()
{ {
representation_value = document.getElementById("representation").value; representation_value = document.getElementById("representation").value;
regression_value = document.getElementById("regression").value; regression_value = document.getElementById("regression").value;
if(representation_value == "ngram") if(representation_value == "ngram")
{ {
ngram_N = document.getElementById("input_ngram_N").value; ngram_N = document.getElementById("input_ngram_N").value;
n_threads = document.getElementById("input_n_threads").value; n_threads = document.getElementById("input_n_threads").value;
var command = "ngram_N = int(" + ngram_N + "); n_threads = int(" + n_threads + ");" var command = "ngram_N = int(" + ngram_N + "); n_threads = int(" + n_threads + ");"
} }
if(representation_value == "soap") if(representation_value == "soap")
{ {
soap_cutoff = document.getElementById("input_soap_cutoff").value; soap_cutoff = document.getElementById("input_soap_cutoff").value;
soap_lmax = document.getElementById("input_soap_lmax").value; soap_lmax = document.getElementById("input_soap_lmax").value;
soap_nmax = document.getElementById("input_soap_nmax").value; soap_nmax = document.getElementById("input_soap_nmax").value;
n_threads = document.getElementById("input_n_threads").value; n_threads = document.getElementById("input_n_threads").value;
var command = "soap_cutoff = int(" + soap_cutoff + ");"; var command = "soap_cutoff = int(" + soap_cutoff + ");";
command += "soap_lmax = int(" + soap_lmax + ");"; command += "soap_lmax = int(" + soap_lmax + ");";
command += "soap_nmax = int(" + soap_nmax + ");"; command += "soap_nmax = int(" + soap_nmax + ");";
command += "n_threads = int(" + n_threads + ");"; command += "n_threads = int(" + n_threads + ");";
} }
console.log("Executing Command: " + command); console.log("Executing Command: " + command);
var kernel = IPython.notebook.kernel; var kernel = IPython.notebook.kernel;
kernel.execute(command); kernel.execute(command);
} }
</script> </script>
<div id="div_options_ngram" style="display: none; position:relative; left:5%"> <div id="div_options_ngram" style="display: none; position:relative; left:5%">
<br> <br>
<font size="3em">N-grams size (N = 1 ~ 4)</font> <input type="number" id="input_ngram_N" value="3" min="1" max="4" > <font size="3em">N-grams size (N = 1 ~ 4)</font> <input type="number" id="input_ngram_N" value="3" min="1" max="4" >
<br> <br>
<font size="3em">Number of threads used in regression (N = 1 ~ 4)</font> <input type="number" id="input_n_threads" value="4" min="1" max="4" > <font size="3em">Number of threads used in regression (N = 1 ~ 4)</font> <input type="number" id="input_n_threads" value="4" min="1" max="4" >
<br> <br>
<button type="button" id="save_options_ngram" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:150px; height:30px;" onclick="save_options()">Save options</button> <button type="button" id="save_options_ngram" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:150px; height:30px;" onclick="save_options()">Save options</button>
</div> </div>
<div id="div_options_soap" style="display: none; position:relative; left:5%"> <div id="div_options_soap" style="display: none; position:relative; left:5%">
<br> <br>
<font size="3em">SOAP cutoff (Ang) </font> <input type="number" id="input_soap_cutoff" value="10" min="1" max="25"> <font size="3em">SOAP cutoff (Ang) </font> <input type="number" id="input_soap_cutoff" value="10" min="1" max="25">
<font size="3em">Max. l </font><input type="number" id="input_soap_lmax" value="4" min="1" max="8" > <font size="3em">Max. l </font><input type="number" id="input_soap_lmax" value="4" min="1" max="8" >
<font size="3em">Max. n </font><input type="number" id="input_soap_nmax" value="4" min="1" max="8" > <font size="3em">Max. n </font><input type="number" id="input_soap_nmax" value="4" min="1" max="8" >
<br> <br>
<font size="3em">Number of threads used in regression (N = 1 ~ 4)</font> <input type="number" id="input_n_threads" value="4" min="1" max="4" > <font size="3em">Number of threads used in regression (N = 1 ~ 4)</font> <input type="number" id="input_n_threads" value="4" min="1" max="4" >
<br> <br>
<button type="button" id="save_options_soap" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:150px; height:30px;" onclick="save_options()">Save options</button> <button type="button" id="save_options_soap" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:150px; height:30px;" onclick="save_options()">Save options</button>
</div> </div>
<br> <br>
<div id="div_options_nn" style="display: none; position:relative; left:5%"> <div id="div_options_nn" style="display: none; position:relative; left:5%">
<br> <br>
<font size="3em">Number of (linear) hidden layers </font> <input type="number" id="input_n_nn" value="2" min="1" max="20"> <font size="3em">Number of (linear) hidden layers </font> <input type="number" id="input_n_nn" value="2" min="1" max="20">
<button type="button" id="set_n_neurons" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:250px; height:30px;" onclick="set_nn_n_neurons()">Set the number of neurons</button> <button type="button" id="set_n_neurons" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:250px; height:30px;" onclick="set_nn_n_neurons()">Set the number of neurons</button>
<div id="div_options_nn_n_neurons" style="display: none; position:relative; left:0%"> <div id="div_options_nn_n_neurons" style="display: none; position:relative; left:0%">
<br> <br>
<font size="3em">Number of neurons in each hidden layer</font> <font size="3em">Number of neurons in each hidden layer</font>
</div> </div>
<br> <br>
<button type="button" id="set_nn_default" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:200px; height:30px;" onclick="set_nn_default()">Use default configuration</button> <button type="button" id="set_nn_default" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:200px; height:30px;" onclick="set_nn_default()">Use default configuration</button>
<button type="button" id="save_options_nn" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:150px; height:30px;" onclick="save_options_nn()">Save options</button> <button type="button" id="save_options_nn" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:150px; height:30px;" onclick="save_options_nn()">Save options</button>
</div> </div>
<div id="demoa"></div> <div id="demoa"></div>
<br><br> <br><br>
<button type="button" id="button_show_sample_output" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:200px; height:30px" onclick="show_sample_output()">Show sample outputs</button> <button type="button" id="button_show_sample_output" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:200px; height:30px" onclick="show_sample_output()">Show sample outputs</button>
<button type="button" id="button_hide_sample_output" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:200px; height:30px; display: none" onclick="hide_sample_output()">Hide sample outputs</button> <button type="button" id="button_hide_sample_output" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:200px; height:30px; display: none" onclick="hide_sample_output()">Hide sample outputs</button>
<button type="button" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:150px; height:30px" onclick="get_repr_regr_combination()">Get predictions</button> <button type="button" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:150px; height:30px" onclick="get_repr_regr_combination()">Get predictions</button>
<input type="checkbox" id="if_learning_curve" value="learning_curve" onclick=show_warning_learning_curve()> Show learning curve <input type="checkbox" id="if_learning_curve" value="learning_curve" onclick=show_warning_learning_curve()> Show learning curve
<input type="checkbox" id="if_use_prestored" value="use_prestored" onclick=show_warning_use_prestored_model() checked=true> Use prestored models <input type="checkbox" id="if_use_prestored" value="use_prestored" onclick=show_warning_use_prestored_model() checked=true> Use prestored models
<div id="warning_learning_curve" style="display: none"> <div id="warning_learning_curve" style="display: none">
<br> <br>
<font color="009FC2"> <font color="009FC2">
Warning: It can be very time-consuming (5~20 min/point, depending on method, model size, and number of threads employed) when getting a learning curve. Warning: It can be very time-consuming (5~20 min/point, depending on method, model size, and number of threads employed) when getting a learning curve.
</font> </font>
<p><font size="3em">Number of points in learning curve: <input type="number" id="N_learning_curve" value="4" min="1" max="25" style="display:none"></font></p> <p><font size="3em">Number of points in learning curve: <input type="number" id="N_learning_curve" value="4" min="1" max="25" style="display:none"></font></p>
</div> </div>
<div id="warning_use_prestored" style="display: none"> <div id="warning_use_prestored" style="display: none">
<br> <br>
<font color="009FC2"> <font color="009FC2">
Warning: It can be very time-consuming (10~20 min, depending on method, model size, and number of threads employed) to generate models on the fly. Warning: It can be very time-consuming (10~20 min, depending on method, model size, and number of threads employed) to generate models on the fly.
</font> </font>
</div> </div>
<div id="sample_output" style="display: none"> <div id="sample_output" style="display: none">
<br><br> <br><br>
<font size="4em"><b> Sample outputs of predictions on formation energy and bandgap</b></font> <font size="4em"><b> Sample outputs of predictions on formation energy and bandgap</b></font>
<br><br> <br><br>
<font size="3em">Representation: N-grams<br><br>Regression method: Kernel-ridge regression (KRR)</font> <font size="3em">Representation: N-grams<br><br>Regression method: Kernel-ridge regression (KRR)</font>
<br><br> <br><br>
<img src="assets/kaggle_competition/results-ngram-krr.png" width="60%"> <img src="assets/kaggle_competition/results-ngram-krr.png" width="60%">
<br><br> <br><br>
</div> </div>
<style> <style>
* { * {
box-sizing: border-box; box-sizing: border-box;
} }
.column { .column {
float: left; float: left;
width: 33.33%; width: 33.33%;
padding: 5px; padding: 5px;
} }
/* Clearfix (clear floats) */ /* Clearfix (clear floats) */
.row::after { .row::after {
content: ""; content: "";
clear: both; clear: both;
display: table; display: table;
} }
</style> </style>
<div id="sample_learning_curve" style="display: none"> <div id="sample_learning_curve" style="display: none">
<font size="4em"><b> Sample outputs of learning curves for formation energy and bandgap predictions</b></font> <font size="4em"><b> Sample outputs of learning curves for formation energy and bandgap predictions</b></font>
<br><br> <br><br>
<font size="3em">Representation: N-gram<br><br>Regression method: Kernel-ridge regression (KRR)</font> <font size="3em">Representation: N-gram<br><br>Regression method: Kernel-ridge regression (KRR)</font>
<br><br> <br><br>
<div style="display: table; width: 95%"> <div style="display: table; width: 95%">
<div class="row"> <div class="row">
<div class="column"> <div class="column">
<img src="./imgs/learning_curve-formation-ngram-krr.png" style="width:100%"> <img src="./imgs/learning_curve-formation-ngram-krr.png" style="width:100%">
</div> </div>
<div class="column"> <div class="column">
<img src="./imgs/learning_curve-bandgap-ngram-krr.png" style="width:100%"> <img src="./imgs/learning_curve-bandgap-ngram-krr.png" style="width:100%">
</div> </div>
</div> </div>
</div> </div>
<br><br> <br><br>
</div> </div>
<script type="text/Javascript"> <script type="text/Javascript">
window.findCellIndicesByTag = function findCellIndicesByTag(tagName) { window.findCellIndicesByTag = function findCellIndicesByTag(tagName) {
return (Jupyter.notebook.get_cells() return (Jupyter.notebook.get_cells()
.filter( .filter(
({metadata: {tags}}) => tags && tags.includes(tagName) ({metadata: {tags}}) => tags && tags.includes(tagName)
) )
.map((cell) => Jupyter.notebook.find_cell_index(cell)) .map((cell) => Jupyter.notebook.find_cell_index(cell))
); );
}; };
window.runCells = function runPlotCells(tags) { window.runCells = function runPlotCells(tags) {
var c = window.findCellIndicesByTag(tags); var c = window.findCellIndicesByTag(tags);
Jupyter.notebook.execute_cells(c); Jupyter.notebook.execute_cells(c);
}; };
function show_sample_output() function show_sample_output()
{ {
document.getElementById("button_show_sample_output").style.display="none"; document.getElementById("button_show_sample_output").style.display="none";
document.getElementById("button_hide_sample_output").style.display=""; document.getElementById("button_hide_sample_output").style.display="";
var check_learning_curve = document.getElementById("if_learning_curve"); var check_learning_curve = document.getElementById("if_learning_curve");
var learning_curve_output = document.getElementById("sample_learning_curve"); var learning_curve_output = document.getElementById("sample_learning_curve");
var output = document.getElementById("sample_output"); var output = document.getElementById("sample_output");
//var representation_value = document.getElementById("representation").value; //var representation_value = document.getElementById("representation").value;
//var regression_value = document.getElementById("regression").value; //var regression_value = document.getElementById("regression").value;
output.style.display = "block"; output.style.display = "block";
if (check_learning_curve.checked == true) if (check_learning_curve.checked == true)
{ {
learning_curve_output.style.display = "block"; learning_curve_output.style.display = "block";
} }
else else
{ {
learning_curve_output.style.display = "none"; learning_curve_output.style.display = "none";
} }
} }
function hide_sample_output() function hide_sample_output()
{ {
document.getElementById("button_show_sample_output").style.display=""; document.getElementById("button_show_sample_output").style.display="";
document.getElementById("button_hide_sample_output").style.display="none"; document.getElementById("button_hide_sample_output").style.display="none";
var learning_curve_output = document.getElementById("sample_learning_curve"); var learning_curve_output = document.getElementById("sample_learning_curve");
var output = document.getElementById("sample_output"); var output = document.getElementById("sample_output");
output.style.display = "none" output.style.display = "none"
learning_curve_output.style.display = "none"; learning_curve_output.style.display = "none";
} }
function show_warning_learning_curve() function show_warning_learning_curve()
{ {
var checkBox = document.getElementById("if_learning_curve"); var checkBox = document.getElementById("if_learning_curve");
learning_curve_output = document.getElementById("sample_learning_curve"); learning_curve_output = document.getElementById("sample_learning_curve");
var warning = document.getElementById("warning_learning_curve"); var warning = document.getElementById("warning_learning_curve");
if (checkBox.checked == true) if (checkBox.checked == true)
{ {
warning.style.display = "block"; warning.style.display = "block";
document.getElementById("N_learning_curve").style.display=""; document.getElementById("N_learning_curve").style.display="";
var N_learning_curve = document.getElementById("N_learning_curve").value; var N_learning_curve = document.getElementById("N_learning_curve").value;
var command = " N_learning_curve = int(" + N_learning_curve + ");"; var command = " N_learning_curve = int(" + N_learning_curve + ");";
var kernel = IPython.notebook.kernel; var kernel = IPython.notebook.kernel;
kernel.execute(command); kernel.execute(command);
} }
else else
{ {
warning.style.display = "none"; warning.style.display = "none";
document.getElementById("N_learning_curve").style.display="none"; document.getElementById("N_learning_curve").style.display="none";
} }
} }
function show_warning_use_prestored_model() function show_warning_use_prestored_model()
{ {
var checkBox = document.getElementById("if_use_prestored"); var checkBox = document.getElementById("if_use_prestored");
var warning = document.getElementById("warning_use_prestored"); var warning = document.getElementById("warning_use_prestored");
if (checkBox.checked == true) if (checkBox.checked == true)
{ {
warning.style.display = "none"; warning.style.display = "none";
} }
else else
{ {
warning.style.display = "block"; warning.style.display = "block";
} }
if_use_prestored = document.getElementById("if_use_prestored").checked; if_use_prestored = document.getElementById("if_use_prestored").checked;
var command = " if_use_prestored_models = '" + if_use_prestored.toString() + "';"; var command = " if_use_prestored_models = '" + if_use_prestored.toString() + "';";
var kernel = IPython.notebook.kernel; var kernel = IPython.notebook.kernel;
kernel.execute(command); kernel.execute(command);
} }
function get_repr_regr_combination() function get_repr_regr_combination()
{ {
ngram_N = document.getElementById("input_ngram_N").value; ngram_N = document.getElementById("input_ngram_N").value;
n_threads = document.getElementById("input_n_threads").value; n_threads = document.getElementById("input_n_threads").value;
soap_cutoff = document.getElementById("input_soap_cutoff").value; soap_cutoff = document.getElementById("input_soap_cutoff").value;
soap_lmax = document.getElementById("input_soap_lmax").value; soap_lmax = document.getElementById("input_soap_lmax").value;
soap_nmax = document.getElementById("input_soap_nmax").value; soap_nmax = document.getElementById("input_soap_nmax").value;
var command = "ngram_N = int(" + ngram_N + "); n_threads = int(" + n_threads + ");" var command = "ngram_N = int(" + ngram_N + "); n_threads = int(" + n_threads + ");"
+ " soap_cutoff = int(" + soap_cutoff + ");" + " soap_cutoff = int(" + soap_cutoff + ");"
+ " soap_lmax = int(" + soap_lmax + ");" + " soap_lmax = int(" + soap_lmax + ");"
+ " soap_nmax = int(" + soap_nmax + ");" ; + " soap_nmax = int(" + soap_nmax + ");" ;
console.log("Executing Command: " + command); console.log("Executing Command: " + command);
var kernel = IPython.notebook.kernel; var kernel = IPython.notebook.kernel;
kernel.execute(command); kernel.execute(command);
representation_value = document.getElementById("representation").value; representation_value = document.getElementById("representation").value;
regression_value = document.getElementById("regression").value; regression_value = document.getElementById("regression").value;
if_learning_curve = document.getElementById("if_learning_curve").checked; if_learning_curve = document.getElementById("if_learning_curve").checked;
if_use_prestored = document.getElementById("if_use_prestored").checked; if_use_prestored = document.getElementById("if_use_prestored").checked;
N_learning_curve = document.getElementById("N_learning_curve").value; N_learning_curve = document.getElementById("N_learning_curve").value;
var command = " representation = '" + representation_value + var command = " representation = '" + representation_value +
"'; regression = '" + regression_value + "';" + " if_learningcurve = '" + if_learning_curve.toString() + "';" "'; regression = '" + regression_value + "';" + " if_learningcurve = '" + if_learning_curve.toString() + "';"
+ " if_use_prestored_models = '" + if_use_prestored.toString() + "';" + " if_use_prestored_models = '" + if_use_prestored.toString() + "';"
+ " N_learning_curve = int(" + N_learning_curve + ");"; + " N_learning_curve = int(" + N_learning_curve + ");";
console.log("Executing Command: " + command); console.log("Executing Command: " + command);
var kernel = IPython.notebook.kernel; var kernel = IPython.notebook.kernel;
kernel.execute(command); kernel.execute(command);
Jupyter.notebook.execute_cells(window.findCellIndicesByTag('ngram'));//# N-gram descriptor Jupyter.notebook.execute_cells(window.findCellIndicesByTag('ngram'));//# N-gram descriptor
Jupyter.notebook.execute_cells(window.findCellIndicesByTag('soap'));//# SOAP descriptors Jupyter.notebook.execute_cells(window.findCellIndicesByTag('soap'));//# SOAP descriptors
Jupyter.notebook.execute_cells(window.findCellIndicesByTag('krr-ngram'));//# KRR Jupyter.notebook.execute_cells(window.findCellIndicesByTag('krr-ngram'));//# KRR
Jupyter.notebook.execute_cells(window.findCellIndicesByTag('init')); Jupyter.notebook.execute_cells(window.findCellIndicesByTag('init'));
Jupyter.notebook.execute_cells(window.findCellIndicesByTag('init_read_data')); Jupyter.notebook.execute_cells(window.findCellIndicesByTag('init_read_data'));
Jupyter.notebook.execute_cells(window.findCellIndicesByTag('plot')); Jupyter.notebook.execute_cells(window.findCellIndicesByTag('plot'));
Jupyter.notebook.execute_cells(window.findCellIndicesByTag('main'));//# Read and predict Jupyter.notebook.execute_cells(window.findCellIndicesByTag('main'));//# Read and predict
} }
</script> </script>
``` ```
%% Output %% Output
%% Cell type:code id: tags:init %% Cell type:code id: tags:init
``` python ``` python
import os import os
import numpy as np import numpy as np
import copy import copy
import pandas as pd import pandas as pd
import time import time
#from tqdm import tqdm, trange #from tqdm import tqdm, trange
from tqdm import tnrange, tqdm_notebook from tqdm import tnrange, tqdm_notebook
import sys import sys
#print(sys.version) #print(sys.version)
from IPython.display import HTML from IPython.display import HTML
# n-gram # n-gram
import kaggle_competition.crystal_graph as cg import kaggle_competition.crystal_graph as cg
# SOAP # SOAP
from ase.atoms import Atoms as ASEAtoms from ase.atoms import Atoms as ASEAtoms
from ase.io import read, write from ase.io import read, write
import quippy as qp import quippy as qp
from quippy import descriptors from quippy import descriptors
from IPython.core.display import Javascript from IPython.core.display import Javascript
from IPython.display import display from IPython.display import display
def run_cell_by_tag(tags): def run_cell_by_tag(tags):
display(Javascript("window.runCells('"+tags+"')")) display(Javascript("window.runCells('"+tags+"')"))
``` ```
%% Cell type:code id: tags:init_read_data %% Cell type:code id: tags:init_read_data
``` python ``` python
# Read data # Read data
INPUT_DIR = "data/kaggle_competition/original_Kaggle/" INPUT_DIR = "data/kaggle_competition/original_Kaggle/"
df_train = pd.read_csv(INPUT_DIR+"train.csv") df_train = pd.read_csv(INPUT_DIR+"train.csv")
df_train["dataset"] = "train" df_train["dataset"] = "train"
df_test = pd.read_csv(INPUT_DIR+"test.csv") df_test = pd.read_csv(INPUT_DIR+"test.csv")
df_test["dataset"] = "test" df_test["dataset"] = "test"
df_crystals = pd.concat([df_train, df_test], ignore_index=True) df_crystals = pd.concat([df_train, df_test], ignore_index=True)
train_data = 'data/kaggle_competition/original_Kaggle/train' train_data = 'data/kaggle_competition/original_Kaggle/train'
test_data = 'data/kaggle_competition/original_Kaggle/test' test_data = 'data/kaggle_competition/original_Kaggle/test'
csv_path = 'data/kaggle_competition/original_Kaggle/' csv_path = 'data/kaggle_competition/original_Kaggle/'
train_csv = np.loadtxt(csv_path+'/train.csv', skiprows=1, delimiter=',') train_csv = np.loadtxt(csv_path+'/train.csv', skiprows=1, delimiter=',')
test_csv = np.loadtxt(csv_path+'/test.csv', skiprows=1, delimiter=',') test_csv = np.loadtxt(csv_path+'/test.csv', skiprows=1, delimiter=',')
test_true_csv = pd.read_csv("data/kaggle_competition/labeled_test.csv") test_true_csv = pd.read_csv("data/kaggle_competition/labeled_test.csv")
bandgap_test_true = test_true_csv["bandgap_energy_ev"] bandgap_test_true = test_true_csv["bandgap_energy_ev"]
formation_test_true = test_true_csv["formation_energy_ev_natom"] formation_test_true = test_true_csv["formation_energy_ev_natom"]
formation_train_true = df_train["formation_energy_ev_natom"] formation_train_true = df_train["formation_energy_ev_natom"]
bandgap_train_true = df_train["bandgap_energy_ev"] bandgap_train_true = df_train["bandgap_energy_ev"]
# Initialize arrays # Initialize arrays
unigrams = [] unigrams = []
bigrams = [] bigrams = []
trigrams = [] trigrams = []
quadgrams = [] quadgrams = []
train_mean_atomic_descriptors = [] train_mean_atomic_descriptors = []
test_mean_atomic_descriptors = [] test_mean_atomic_descriptors = []
``` ```
%% Cell type:code id: tags:plot %% Cell type:code id: tags:plot
``` python ``` python
from IPython.display import HTML from IPython.display import HTML
%matplotlib inline %matplotlib inline
%matplotlib inline %matplotlib inline
import matplotlib.pyplot as plt import matplotlib.pyplot as plt
%matplotlib inline %matplotlib inline
fontsize = 15 fontsize = 15
def RMSE(y_true, y_pred): def RMSE(y_true, y_pred):
return np.sqrt(np.mean((np.array(y_true)-np.array(y_pred))**2)) return np.sqrt(np.mean((np.array(y_true)-np.array(y_pred))**2))
def RMSLE(true, pred): def RMSLE(true, pred):
return np.sqrt(np.mean((np.log(np.array(true)+1)-np.log(np.array(pred)+1))**2)) return np.sqrt(np.mean((np.log(np.array(true)+1)-np.log(np.array(pred)+1))**2))
def test_plot(formation_pred, formation_true, bandgap_pred, bandgap_true): def test_plot(formation_pred, formation_true, bandgap_pred, bandgap_true):
RMSE_formation = RMSE(formation_pred, formation_true) RMSE_formation = RMSE(formation_pred, formation_true)
RMSE_bandgap = RMSE(bandgap_pred, bandgap_true) RMSE_bandgap = RMSE(bandgap_pred, bandgap_true)
RMSLE_formation = RMSLE(formation_pred, formation_true) RMSLE_formation = RMSLE(formation_pred, formation_true)
RMSLE_bandgap = RMSLE(bandgap_pred, bandgap_true) RMSLE_bandgap = RMSLE(bandgap_pred, bandgap_true)
label_formation = "test RMSLE " + str(RMSLE_formation)[:5] label_formation = "test RMSLE " + str(RMSLE_formation)[:5]
label_bandgap = "test RMSLE " + str(RMSLE_bandgap)[:5] label_bandgap = "test RMSLE " + str(RMSLE_bandgap)[:5]
fig = plt.figure(figsize=(6,4)) fig = plt.figure(figsize=(6,4))
plt.scatter(formation_true, formation_pred - formation_true, s = 3, alpha = 0.3, label = label_formation) plt.scatter(formation_true, formation_pred - formation_true, s = 3, alpha = 0.3, label = label_formation)
#plt.plot(np.arange(min(test_vegards_true), max(test_vegards_true), 0.01), np.arange(min(test_vegards_true), max(test_vegards_true), 0.01), '--', c = 'grey') #plt.plot(np.arange(min(test_vegards_true), max(test_vegards_true), 0.01), np.arange(min(test_vegards_true), max(test_vegards_true), 0.01), '--', c = 'grey')
plt.xlabel('Actual formation energy (eV/cation)', fontsize = fontsize) plt.xlabel('Actual formation energy (eV/cation)', fontsize = fontsize)
plt.ylabel('$y-\hat{y}$ (eV/cation)', fontsize = fontsize) plt.ylabel('$y-\hat{y}$ (eV/cation)', fontsize = fontsize)
plt.legend(loc='upper right', fontsize = fontsize) plt.legend(loc='upper right', fontsize = fontsize)
plt.axhline(y=0, linewidth=1, color="#000000") plt.axhline(y=0, linewidth=1, color="#000000")
plt.xticks(fontsize = fontsize) plt.xticks(fontsize = fontsize)
plt.yticks(fontsize = fontsize) plt.yticks(fontsize = fontsize)
plt.show() plt.show()
fig = plt.figure(figsize=(6,4)) fig = plt.figure(figsize=(6,4))
plt.scatter(bandgap_true, bandgap_pred - bandgap_true, s = 3, alpha = 0.3, label = label_bandgap) plt.scatter(bandgap_true, bandgap_pred - bandgap_true, s = 3, alpha = 0.3, label = label_bandgap)
#plt.plot(np.arange(min(test_vegards_true), max(test_vegards_true), 0.01), np.arange(min(test_vegards_true), max(test_vegards_true), 0.01), '--', c = 'grey') #plt.plot(np.arange(min(test_vegards_true), max(test_vegards_true), 0.01), np.arange(min(test_vegards_true), max(test_vegards_true), 0.01), '--', c = 'grey')
plt.xlabel('Actual bandgap energy (eV)', fontsize = fontsize) plt.xlabel('Actual bandgap energy (eV)', fontsize = fontsize)
plt.ylabel('$y-\hat{y}$ (eV)', fontsize = fontsize) plt.ylabel('$y-\hat{y}$ (eV)', fontsize = fontsize)
plt.legend(loc='upper right', fontsize = fontsize) plt.legend(loc='upper right', fontsize = fontsize)
plt.axhline(y=0, linewidth=1, color="#000000") plt.axhline(y=0, linewidth=1, color="#000000")
plt.xticks(fontsize = fontsize) plt.xticks(fontsize = fontsize)
plt.yticks(fontsize = fontsize) plt.yticks(fontsize = fontsize)
plt.show() plt.show()
def plot_learning_curve(learning_curve): def plot_learning_curve(learning_curve):
RMSLE_formation_train = [] RMSLE_formation_train = []
RMSLE_formation_test = [] RMSLE_formation_test = []
RMSLE_bandgap_train = [] RMSLE_bandgap_train = []
RMSLE_bandgap_test = [] RMSLE_bandgap_test = []
for i in range(len(learning_curve["N_train"])): for i in range(len(learning_curve["N_train"])):
RMSLE_formation_train.append(RMSLE(formation_train_true[learning_curve["indtrain"][i]], learning_curve["formation_train"][i])) RMSLE_formation_train.append(RMSLE(formation_train_true[learning_curve["indtrain"][i]], learning_curve["formation_train"][i]))
RMSLE_formation_test.append(RMSLE(formation_test_true, learning_curve["formation_test"][i])) RMSLE_formation_test.append(RMSLE(formation_test_true, learning_curve["formation_test"][i]))
RMSLE_bandgap_train.append(RMSLE(bandgap_train_true[learning_curve["indtrain"][i]], learning_curve["bandgap_train"][i])) RMSLE_bandgap_train.append(RMSLE(bandgap_train_true[learning_curve["indtrain"][i]], learning_curve["bandgap_train"][i]))
RMSLE_bandgap_test.append(RMSLE(bandgap_test_true, learning_curve["bandgap_test"][i])) RMSLE_bandgap_test.append(RMSLE(bandgap_test_true, learning_curve["bandgap_test"][i]))
display(HTML("""<div><font size=3.5em><b> - Formation energy</b></font><br></div>""")) display(HTML("""<div><font size=3.5em><b> - Formation energy</b></font><br></div>"""))
fig = plt.figure(figsize = (6,4)) fig = plt.figure(figsize = (6,4))
plt.scatter(learning_curve["N_train"], RMSLE_formation_train, s=3, alpha=1, label = "Training") plt.scatter(learning_curve["N_train"], RMSLE_formation_train, s=3, alpha=1, label = "Training")
plt.scatter(learning_curve["N_train"], RMSLE_formation_test, s=3, alpha=1, label = "Test") plt.scatter(learning_curve["N_train"], RMSLE_formation_test, s=3, alpha=1, label = "Test")
plt.xlabel('Size of training set', fontsize = fontsize) plt.xlabel('Size of training set', fontsize = fontsize)
plt.ylabel('RMSLE (eV/cation)', fontsize = fontsize) plt.ylabel('RMSLE (eV/cation)', fontsize = fontsize)
plt.legend(loc='upper right', fontsize = fontsize) plt.legend(loc='upper right', fontsize = fontsize)
plt.xticks(fontsize = fontsize) plt.xticks(fontsize = fontsize)
plt.yticks(fontsize = fontsize) plt.yticks(fontsize = fontsize)
plt.show() plt.show()
display(HTML("""<div><font size=3.5em><b> - Bandgap</b></font><br></div>""")) display(HTML("""<div><font size=3.5em><b> - Bandgap</b></font><br></div>"""))
plt.scatter(learning_curve["N_train"], RMSLE_bandgap_train, s=3, alpha=1, label = "Training") plt.scatter(learning_curve["N_train"], RMSLE_bandgap_train, s=3, alpha=1, label = "Training")
plt.scatter(learning_curve["N_train"], RMSLE_bandgap_test, s=3, alpha=1, label = "Test") plt.scatter(learning_curve["N_train"], RMSLE_bandgap_test, s=3, alpha=1, label = "Test")
plt.xlabel('Size of training set', fontsize = fontsize) plt.xlabel('Size of training set', fontsize = fontsize)
plt.ylabel('RMSLE (eV)', fontsize = fontsize) plt.ylabel('RMSLE (eV)', fontsize = fontsize)
plt.legend(loc='upper right', fontsize = fontsize) plt.legend(loc='upper right', fontsize = fontsize)
plt.xticks(fontsize = fontsize) plt.xticks(fontsize = fontsize)
plt.yticks(fontsize = fontsize) plt.yticks(fontsize = fontsize)
plt.show() plt.show()
def plot_hyperparameter_contour(x_grid, y_grid, z_grid, grid): def plot_hyperparameter_contour(x_grid, y_grid, z_grid, grid):
fontsize = 15 fontsize = 15
plt.figure() plt.figure()
c = plt.contourf(x_grid, y_grid, abs(z_grid), grid, alpha=.75) c = plt.contourf(x_grid, y_grid, abs(z_grid), grid, alpha=.75)
CS = plt.contour(x_grid, y_grid, abs(z_grid), grid, colors='black', linewidth=.5) CS = plt.contour(x_grid, y_grid, abs(z_grid), grid, colors='black', linewidth=.5)
plt.clabel(CS, inline=1, fontsize=12) plt.clabel(CS, inline=1, fontsize=12)
plt.xlabel('alpha', fontsize = fontsize) plt.xlabel('alpha', fontsize = fontsize)
plt.ylabel('gamma', fontsize = fontsize) plt.ylabel('gamma', fontsize = fontsize)
plt.title('RMSE', fontsize = fontsize) plt.title('RMSE', fontsize = fontsize)
plt.xticks(fontsize = fontsize) plt.xticks(fontsize = fontsize)
plt.yticks(fontsize = fontsize) plt.yticks(fontsize = fontsize)
plt.show() plt.show()
``` ```
%% Cell type:code id: tags:main %% Cell type:code id: tags:main
``` python ``` python
#main #main
import sys import sys
# Make predictions # Make predictions
if representation == "ngram": if representation == "ngram":
run_cell_by_tag('get_ngram_descriptor') run_cell_by_tag('get_ngram_descriptor')
if regression == "nn": if regression == "nn":
run_cell_by_tag('ngram-nn') run_cell_by_tag('ngram-nn')
elif regression == "krr": elif regression == "krr":
run_cell_by_tag('predic_plot_ngram_krr') run_cell_by_tag('predic_plot_ngram_krr')
elif representation == "soap": elif representation == "soap":
run_cell_by_tag('get_soap_descriptor') run_cell_by_tag('get_soap_descriptor')
if regression == "nn": if regression == "nn":
run_cell_by_tag('nn-soap') run_cell_by_tag('nn-soap')
elif regression == "krr": elif regression == "krr":
run_cell_by_tag('soap-krr') run_cell_by_tag('soap-krr')
else: else:
print ("Representation/regression method not found:", representation, regression) print ("Representation/regression method not found:", representation, regression)
``` ```
%% Cell type:code id: tags:ngram %% Cell type:code id: tags:ngram
``` python ``` python
# N-gram descriptor # N-gram descriptor
def check_crdn(crdn, symbols, filename): def check_crdn(crdn, symbols, filename):
metals = [i for i, s in enumerate(symbols) if s != 'O'] metals = [i for i, s in enumerate(symbols) if s != 'O']
oxygens = [i for i, s in enumerate(symbols) if s == 'O'] oxygens = [i for i, s in enumerate(symbols) if s == 'O']
metal_cn = [len(crdn[i]) for i in metals] metal_cn = [len(crdn[i]) for i in metals]
oxygen_cn = [len(crdn[i]) for i in oxygens] oxygen_cn = [len(crdn[i]) for i in oxygens]
if np.max(metal_cn) > m_cn_max or np.min(metal_cn) < m_cn_min: if np.max(metal_cn) > m_cn_max or np.min(metal_cn) < m_cn_min:
print(filename) print(filename)
print('metal_cn: {}'.format(metal_cn)) print('metal_cn: {}'.format(metal_cn))
if np.max(oxygen_cn) > o_cn_max or np.min(oxygen_cn) < o_cn_min: if np.max(oxygen_cn) > o_cn_max or np.min(oxygen_cn) < o_cn_min:
print(filename) print(filename)
print('oxygen_cn: {}'.format(oxygen_cn)) print('oxygen_cn: {}'.format(oxygen_cn))
def get_cg(fn, factor): def get_cg(fn, factor):
crystal_xyz, crystal_sym, crystal_lat = cg.get_xyz_data(fn) #fn: file_name crystal_xyz, crystal_sym, crystal_lat = cg.get_xyz_data(fn) #fn: file_name
A = np.transpose(crystal_lat) A = np.transpose(crystal_lat)
B = np.linalg.inv(A) B = np.linalg.inv(A)
crystal_red = np.matmul(crystal_xyz, np.transpose(B)) crystal_red = np.matmul(crystal_xyz, np.transpose(B))
R_max = cg.get_Rmax(crystal_sym) * factor R_max = cg.get_Rmax(crystal_sym) * factor
crystal_dist, crystal_Rij, crystal_crdn = cg.get_shortest_distances(crystal_red, A, R_max) crystal_dist, crystal_Rij, crystal_crdn = cg.get_shortest_distances(crystal_red, A, R_max)
check_crdn(crystal_crdn, crystal_sym, fn) check_crdn(crystal_crdn, crystal_sym, fn)
return crystal_crdn, crystal_sym return crystal_crdn, crystal_sym
def get_symbols(G): def get_symbols(G):
return G[1] return G[1]
def coord_number(G, node): def coord_number(G, node):
return len(G[0][node]) return len(G[0][node])
def neighbors(G, node): def neighbors(G, node):
return [c[0] for c in G[0][node]] return [c[0] for c in G[0][node]]
def neighbors_with_R(G, node): def neighbors_with_R(G, node):
return [(c[0], c[2]) for c in G[0][node]] return [(c[0], c[2]) for c in G[0][node]]
def get_unigrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict): def get_unigrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict):
symbols = get_symbols(G) symbols = get_symbols(G)
symbols = [token_dict[s + str(coord_number(G, i))] for i, s in enumerate(symbols)] symbols = [token_dict[s + str(coord_number(G, i))] for i, s in enumerate(symbols)]
uniques, counts = np.unique(symbols, return_counts=True) uniques, counts = np.unique(symbols, return_counts=True)
counts = dict(zip(list(uniques), list(counts))) counts = dict(zip(list(uniques), list(counts)))
v = np.zeros(len(tokens), dtype=np.int) v = np.zeros(len(tokens), dtype=np.int)
for key, value in counts.items(): for key, value in counts.items():
v[key] = value v[key] = value
return v return v
def get_bigrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict): def get_bigrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict):
symbols = get_symbols(G) symbols = get_symbols(G)
metals = [i for i, s in enumerate(symbols) if s != 'O'] metals = [i for i, s in enumerate(symbols) if s != 'O']
symbols = [s + str(coord_number(G, i)) for i, s in enumerate(symbols)] symbols = [s + str(coord_number(G, i)) for i, s in enumerate(symbols)]
pair_list = [] pair_list = []
for j in metals: for j in metals:
sym_j = symbols[j] sym_j = symbols[j]
for i in neighbors(G, j): for i in neighbors(G, j):
sym_i = symbols[i] sym_i = symbols[i]
pair = sym_j + '_' + sym_i pair = sym_j + '_' + sym_i
pair_list.append(pair_dict[pair]) pair_list.append(pair_dict[pair])
uniques, counts = np.unique(pair_list, return_counts=True) uniques, counts = np.unique(pair_list, return_counts=True)
counts = dict(zip(list(uniques), list(counts))) counts = dict(zip(list(uniques), list(counts)))
v = np.zeros(len(pairs), dtype=np.int) v = np.zeros(len(pairs), dtype=np.int)
for key, value in counts.items(): for key, value in counts.items():
v[key] = value v[key] = value
return v return v
def get_trigrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict): def get_trigrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict):
symbols = get_symbols(G) symbols = get_symbols(G)
oxygens = [i for i, s in enumerate(symbols) if s == 'O'] oxygens = [i for i, s in enumerate(symbols) if s == 'O']
metals = [i for i, s in enumerate(symbols) if s != 'O'] metals = [i for i, s in enumerate(symbols) if s != 'O']
symbols = [s + str(coord_number(G, i)) for i, s in enumerate(symbols)] symbols = [s + str(coord_number(G, i)) for i, s in enumerate(symbols)]
triple_list = [] triple_list = []
for o in oxygens: for o in oxygens:
sym_o = symbols[o] sym_o = symbols[o]
nbr = neighbors(G, o) nbr = neighbors(G, o)
for j in range(len(nbr)): for j in range(len(nbr)):
sym_j = symbols[nbr[j]] sym_j = symbols[nbr[j]]
for i in range(j): for i in range(j):
sym_i = symbols[nbr[i]] sym_i = symbols[nbr[i]]
if token_dict[sym_i] >= token_dict[sym_j]: if token_dict[sym_i] >= token_dict[sym_j]:
triple = sym_i + '_' + sym_o + '_' + sym_j triple = sym_i + '_' + sym_o + '_' + sym_j
else: else:
triple = sym_j + '_' + sym_o + '_' + sym_i triple = sym_j + '_' + sym_o + '_' + sym_i
triple_list.append(triples_dict[triple]) triple_list.append(triples_dict[triple])
for m in metals: for m in metals:
sym_m = symbols[m] sym_m = symbols[m]
nbr = neighbors(G, m) nbr = neighbors(G, m)
for j in range(len(nbr)): for j in range(len(nbr)):
sym_j = symbols[nbr[j]] sym_j = symbols[nbr[j]]
for i in range(j): for i in range(j):
sym_i = symbols[nbr[i]] sym_i = symbols[nbr[i]]
if token_dict[sym_i] >= token_dict[sym_j]: if token_dict[sym_i] >= token_dict[sym_j]:
triple = sym_i + '_' + sym_m + '_' + sym_j triple = sym_i + '_' + sym_m + '_' + sym_j
else: else:
triple = sym_j + '_' + sym_m + '_' + sym_i triple = sym_j + '_' + sym_m + '_' + sym_i
triple_list.append(triples_dict[triple]) triple_list.append(triples_dict[triple])
uniques, counts = np.unique(triple_list, return_counts=True) uniques, counts = np.unique(triple_list, return_counts=True)
counts = dict(zip(list(uniques), list(counts))) counts = dict(zip(list(uniques), list(counts)))
v = np.zeros(len(triples), dtype=np.int) v = np.zeros(len(triples), dtype=np.int)
for key, value in counts.items(): for key, value in counts.items():
v[key] = value v[key] = value
return v return v
def get_quadgrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict): def get_quadgrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict):
symbols = get_symbols(G) symbols = get_symbols(G)
oxygens = [i for i, s in enumerate(symbols) if s == 'O'] oxygens = [i for i, s in enumerate(symbols) if s == 'O']
metals = [i for i, s in enumerate(symbols) if s != 'O'] metals = [i for i, s in enumerate(symbols) if s != 'O']
symbols = [s + str(coord_number(G, i)) for i, s in enumerate(symbols)] symbols = [s + str(coord_number(G, i)) for i, s in enumerate(symbols)]
quad_list = [] quad_list = []
for o in oxygens: for o in oxygens:
sym_o = symbols[o] sym_o = symbols[o]
nbr = neighbors_with_R(G, o) nbr = neighbors_with_R(G, o)
for j in range(len(nbr)): for j in range(len(nbr)):
nbr_j = neighbors_with_R(G, nbr[j][0]) nbr_j = neighbors_with_R(G, nbr[j][0])
sym_j = symbols[nbr[j][0]] sym_j = symbols[nbr[j][0]]
for i in range(j): for i in range(j):
nbr_i = neighbors_with_R(G, nbr[i][0]) nbr_i = neighbors_with_R(G, nbr[i][0])
sym_i = symbols[nbr[i][0]] sym_i = symbols[nbr[i][0]]
triple1 = sym_i + '_' + sym_o + '_' + sym_j triple1 = sym_i + '_' + sym_o + '_' + sym_j
triple2 = sym_j + '_' + sym_o + '_' + sym_i triple2 = sym_j + '_' + sym_o + '_' + sym_i
for kj in nbr_j: for kj in nbr_j:
if kj[0] == o and np.sum(np.abs(nbr[j][1]+kj[1])) < 1e-8: continue if kj[0] == o and np.sum(np.abs(nbr[j][1]+kj[1])) < 1e-8: continue
quad1 = triple1 + '_' + symbols[kj[0]] quad1 = triple1 + '_' + symbols[kj[0]]
quad_list.append(quad_dict[quad1]) quad_list.append(quad_dict[quad1])
for ki in nbr_i: for ki in nbr_i:
if ki[0] == o and np.sum(np.abs(nbr[i][1]+ki[1])) < 1e-8: continue if ki[0] == o and np.sum(np.abs(nbr[i][1]+ki[1])) < 1e-8: continue
quad2 = triple2 + '_' + symbols[ki[0]] quad2 = triple2 + '_' + symbols[ki[0]]
quad_list.append(quad_dict[quad2]) quad_list.append(quad_dict[quad2])
uniques, counts = np.unique(quad_list, return_counts=True) uniques, counts = np.unique(quad_list, return_counts=True)
counts = dict(zip(list(uniques), list(counts))) counts = dict(zip(list(uniques), list(counts)))
v = np.zeros(len(quads), dtype=np.int) v = np.zeros(len(quads), dtype=np.int)
for key, value in counts.items(): for key, value in counts.items():
v[key] = value v[key] = value
return v return v
def get_ngram_descriptors(): def get_ngram_descriptors():
print ("\nGenerating n-grams descriptors...\n") print ("\nGenerating n-grams descriptors...\n")
m_elements = ['Al', 'Ga', 'In'] m_elements = ['Al', 'Ga', 'In']
m_tokens = [] m_tokens = []
print ("Generating tokens...") print ("Generating tokens...")
for i in range(m_cn_min, m_cn_max+1): for i in range(m_cn_min, m_cn_max+1):
m_tokens += [ s+str(i) for s in m_elements] m_tokens += [ s+str(i) for s in m_elements]
o_tokens = ['O'+str(i) for i in range(o_cn_min, o_cn_max+1)] o_tokens = ['O'+str(i) for i in range(o_cn_min, o_cn_max+1)]
tokens = m_tokens + o_tokens tokens = m_tokens + o_tokens
print ("Tokens:", tokens[:10], "...") print ("Tokens:", tokens[:10], "...")
token_dict = dict([(tokens[i], i) for i in range(len(tokens))]) token_dict = dict([(tokens[i], i) for i in range(len(tokens))])
pairs = [] pairs = []
print ("Generating pairs...") print ("Generating pairs...")
for m in m_tokens: for m in m_tokens:
pairs += [ m + '_' + o for o in o_tokens] pairs += [ m + '_' + o for o in o_tokens]
print ("Pairs:", pairs[:10], "...") print ("Pairs:", pairs[:10], "...")
pair_dict = dict([(pairs[i], i) for i in range(len(pairs))]) pair_dict = dict([(pairs[i], i) for i in range(len(pairs))])
print ("Generating triples...") print ("Generating triples...")
mom_triples = [] mom_triples = []
for m1 in range(len(m_tokens)): for m1 in range(len(m_tokens)):
for m2 in range(m1+1): for m2 in range(m1+1):
mom_triples += [m_tokens[m1] + '_' + o + '_' + m_tokens[m2] for o in o_tokens] mom_triples += [m_tokens[m1] + '_' + o + '_' + m_tokens[m2] for o in o_tokens]
omo_triples = [] omo_triples = []
for o1 in range(len(o_tokens)): for o1 in range(len(o_tokens)):
for o2 in range(o1+1): for o2 in range(o1+1):
omo_triples += [o_tokens[o1] + '_' + m + '_' + o_tokens[o2] for m in m_tokens] omo_triples += [o_tokens[o1] + '_' + m + '_' + o_tokens[o2] for m in m_tokens]
triples = mom_triples + omo_triples triples = mom_triples + omo_triples
print ("Triples:", triples[:10], "...") print ("Triples:", triples[:10], "...")
triples_dict = dict([(triples[i], i) for i in range(len(triples))]) triples_dict = dict([(triples[i], i) for i in range(len(triples))])
print ("Generating quads...") print ("Generating quads...")
quads = [] quads = []
for m1 in m_tokens: for m1 in m_tokens:
for o1 in o_tokens: for o1 in o_tokens:
for m2 in m_tokens: for m2 in m_tokens:
quads += [m1 + '_' + o1 + '_' + m2 + '_' + o for o in o_tokens] quads += [m1 + '_' + o1 + '_' + m2 + '_' + o for o in o_tokens]
print ("Quads:", quads[:10], "...") print ("Quads:", quads[:10], "...")
quad_dict = dict([(quads[i], i) for i in range(len(quads))]) quad_dict = dict([(quads[i], i) for i in range(len(quads))])
print ("Generating grams...") print ("Generating grams...")
unigrams = [] unigrams = []
bigrams = [] bigrams = []
trigrams = [] trigrams = []
quadgrams = [] quadgrams = []
count = 0 count = 0
t0 = time.time() t0 = time.time()
for crystal in df_crystals.itertuples(): for crystal in df_crystals.itertuples():
fn = INPUT_DIR + "{}/{}/geometry.xyz".format(crystal.dataset, crystal.id) fn = INPUT_DIR + "{}/{}/geometry.xyz".format(crystal.dataset, crystal.id)
factor = cg.get_factor(crystal.spacegroup, crystal.lattice_angle_gamma_degree) factor = cg.get_factor(crystal.spacegroup, crystal.lattice_angle_gamma_degree)
G = get_cg(fn, factor) G = get_cg(fn, factor)
unigrams.append(get_unigrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict)) unigrams.append(get_unigrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict))
bigrams.append(get_bigrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict)) bigrams.append(get_bigrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict))
trigrams.append(get_trigrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict)) trigrams.append(get_trigrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict))
quadgrams.append(get_quadgrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict)) quadgrams.append(get_quadgrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict))
count += 1 count += 1
#if count % 500 == 0: #if count % 500 == 0:
#print(count, time.time()-t0) #print(count, time.time()-t0)
print ("Tokens:", len(tokens)) print ("Tokens:", len(tokens))
print ("Pairs:", len(pairs)) print ("Pairs:", len(pairs))
print ("Triples:", len(triples)) print ("Triples:", len(triples))
print ("Quads:", len(quads)) print ("Quads:", len(quads))
#np.savez("./data/kaggle_competition/models/ngram.npz", tokens=tokens, pairs=pairs, triples=triples, quads=quads, unigrams=unigrams, bigrams=bigrams, trigrams=trigrams, quadgrams=quadgrams) #np.savez("./data/kaggle_competition/models/ngram.npz", tokens=tokens, pairs=pairs, triples=triples, quads=quads, unigrams=unigrams, bigrams=bigrams, trigrams=trigrams, quadgrams=quadgrams)
return tokens, pairs, triples, quads, unigrams, bigrams, trigrams, quadgrams return tokens, pairs, triples, quads, unigrams, bigrams, trigrams, quadgrams
``` ```
%% Cell type:code id: tags:get_ngram_descriptor %% Cell type:code id: tags:get_ngram_descriptor
``` python ``` python
# Get N-gram descriptor # Get N-gram descriptor
from IPython.display import HTML from IPython.display import HTML
display(HTML("""<br><div><font size=6em><b>Predictions with N-grams representation</b></font><hr></div>""")) display(HTML("""<br><div><font size=6em><b>Predictions with N-grams representation</b></font><hr></div>"""))
m_cn_min = 4 m_cn_min = 4
m_cn_max = 9 m_cn_max = 9
o_cn_min = 2 o_cn_min = 2
o_cn_max = 6 o_cn_max = 6
if if_use_prestored_models == "true": if if_use_prestored_models == "true":
ngram_model_saved = np.load('./data/kaggle_competition/models/ngram.npz') ngram_model_saved = np.load('./data/kaggle_competition/models/ngram.npz')
display(HTML("""Read pre-stored N-grams model.""")) display(HTML("""Read pre-stored N-grams model."""))
tokens = ngram_model_saved['tokens'] tokens = ngram_model_saved['tokens']
pairs = ngram_model_saved['pairs'] pairs = ngram_model_saved['pairs']
triples = ngram_model_saved['triples'] triples = ngram_model_saved['triples']
quads = ngram_model_saved['quads'] quads = ngram_model_saved['quads']
unigrams = ngram_model_saved['unigrams'] unigrams = ngram_model_saved['unigrams']
bigrams = ngram_model_saved['bigrams'] bigrams = ngram_model_saved['bigrams']
trigrams = ngram_model_saved['trigrams'] trigrams = ngram_model_saved['trigrams']
quadgrams = ngram_model_saved['quadgrams'] quadgrams = ngram_model_saved['quadgrams']
else: else:
tokens = [] tokens = []
pairs = [] pairs = []
triples = [] triples = []
quads = [] quads = []
unigrams = [] unigrams = []
bigrams = [] bigrams = []
trigrams = [] trigrams = []
quadgrams = [] quadgrams = []
tokens, pairs, triples, quads, unigrams, bigrams, trigrams, quadgrams = get_ngram_descriptors() tokens, pairs, triples, quads, unigrams, bigrams, trigrams, quadgrams = get_ngram_descriptors()
``` ```
%% Cell type:code id: tags:krr-ngram %% Cell type:code id: tags:krr-ngram
``` python ``` python
#KRR-ngram #KRR-ngram
from sklearn.kernel_ridge import KernelRidge from sklearn.kernel_ridge import KernelRidge
from sklearn.model_selection import GridSearchCV from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer from sklearn.metrics import make_scorer
from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import MinMaxScaler
from sklearn.externals.joblib import parallel_backend from sklearn.externals.joblib import parallel_backend
def train_and_pred_KRR(X_train, y_train, X_test, nfold, nthread): def train_and_pred_KRR(X_train, y_train, X_test, nfold, nthread):
neg_root_mean_squared_error = make_scorer(RMSE, greater_is_better=False) neg_root_mean_squared_error = make_scorer(RMSE, greater_is_better=False)
t0 = time.time() t0 = time.time()
clf = GridSearchCV(KernelRidge(kernel='rbf'), clf = GridSearchCV(KernelRidge(kernel='rbf'),
cv=nfold, n_jobs=nthread, verbose=1, cv=nfold, n_jobs=nthread, verbose=1,
scoring=neg_root_mean_squared_error, scoring=neg_root_mean_squared_error,
param_grid={"alpha": np.logspace(-15, 5, 21, base=2), param_grid={"alpha": np.logspace(-15, 5, 21, base=2),
"gamma": np.logspace(-15, 3, 19, base=2)}) "gamma": np.logspace(-15, 3, 19, base=2)})
with parallel_backend('threading'): with parallel_backend('threading'):
res = clf.fit(X_train, y_train) res = clf.fit(X_train, y_train)
#print("clf best_score:", -res.best_score_) #print("clf best_score:", -res.best_score_)
#print("clf best_params:", res.best_params_) #print("clf best_params:", res.best_params_)
#print("clf fit:", time.time() - t0) #print("clf fit:", time.time() - t0)
out_y_train = clf.predict(X_train) out_y_train = clf.predict(X_train)
#for i,j in zip(out_y_train, y_train): #for i,j in zip(out_y_train, y_train):
# print(i, j) # print(i, j)
t0 = time.time() t0 = time.time()
y_test = clf.predict(X_test) y_test = clf.predict(X_test)
#print("clf predict:", time.time() - t0) #print("clf predict:", time.time() - t0)
return clf, y_test, out_y_train return clf, y_test, out_y_train
def get_ngram_KRR_predictions(tokens, pairs, triples, quads, unigrams, bigrams, trigrams, quadgrams, ngram_size, n_threads): def get_ngram_KRR_predictions(tokens, pairs, triples, quads, unigrams, bigrams, trigrams, quadgrams, ngram_size, n_threads):
#ngram_size = 3 #ngram_size = 3
#nfold = 5 #nfold = 5
#nthread = 4 #nthread = 4
print ("Size of N-gram:", ngram_size) print ("Size of N-gram:", ngram_size)
ngrams = {} ngrams = {}
ngrams["tokens"]=tokens ngrams["tokens"]=tokens
ngrams["pairs"]=pairs ngrams["pairs"]=pairs
ngrams["triples"]=triples ngrams["triples"]=triples
ngrams["quads"]=quads ngrams["quads"]=quads
ngrams["unigrams"]=unigrams ngrams["unigrams"]=unigrams
ngrams["bigrams"]=bigrams ngrams["bigrams"]=bigrams
ngrams["trigrams"]=trigrams ngrams["trigrams"]=trigrams
ngrams["quadgrams"]=quadgrams ngrams["quadgrams"]=quadgrams
#print 'ngram_size: {}'.format(ngram_size) #print 'ngram_size: {}'.format(ngram_size)
if ngram_size == 4: if ngram_size == 4:
features = np.hstack([ngrams["unigrams"], ngrams["bigrams"], ngrams["trigrams"], ngrams["quadgrams"]]) features = np.hstack([ngrams["unigrams"], ngrams["bigrams"], ngrams["trigrams"], ngrams["quadgrams"]])
elif ngram_size == 3: elif ngram_size == 3:
features = np.hstack([ngrams["unigrams"], ngrams["bigrams"], ngrams["trigrams"]]) features = np.hstack([ngrams["unigrams"], ngrams["bigrams"], ngrams["trigrams"]])
elif ngram_size == 2: elif ngram_size == 2:
features = np.hstack([ngrams["unigrams"], ngrams["bigrams"]]) features = np.hstack([ngrams["unigrams"], ngrams["bigrams"]])
elif ngram_size == 1: elif ngram_size == 1:
features = ngrams["unigrams"] features = ngrams["unigrams"]
else: else:
raise NameError('ngram_size should be within [1,4].') raise NameError('ngram_size should be within [1,4].')
print ('Ngram feature matrix shape: {}'.format(features.shape)) print ('Ngram feature matrix shape: {}'.format(features.shape))
# Select column # Select column
count_per_column = np.sum(features, axis=0) count_per_column = np.sum(features, axis=0)
nonzero_columns = np.where(count_per_column > 0)[0] nonzero_columns = np.where(count_per_column > 0)[0]
print( 'Total column count: {}'.format(len(count_per_column))) print( 'Total column count: {}'.format(len(count_per_column)))
print( 'Nonzero-column count: {}'.format(len(nonzero_columns))) print( 'Nonzero-column count: {}'.format(len(nonzero_columns)))
features = features[:,nonzero_columns] features = features[:,nonzero_columns]
print( 'Nonzero ngram feature maxrix shape: {}'.format(np.shape(features))) print( 'Nonzero ngram feature maxrix shape: {}'.format(np.shape(features)))
# Volume # Volume
lat_a = df_crystals.lattice_vector_1_ang.values lat_a = df_crystals.lattice_vector_1_ang.values
lat_b = df_crystals.lattice_vector_2_ang.values lat_b = df_crystals.lattice_vector_2_ang.values
lat_c = df_crystals.lattice_vector_3_ang.values lat_c = df_crystals.lattice_vector_3_ang.values
lat_cosa = np.cos(np.radians(df_crystals.lattice_angle_alpha_degree.values)) lat_cosa = np.cos(np.radians(df_crystals.lattice_angle_alpha_degree.values))
lat_cosb = np.cos(np.radians(df_crystals.lattice_angle_beta_degree.values)) lat_cosb = np.cos(np.radians(df_crystals.lattice_angle_beta_degree.values))
lat_cosg = np.cos(np.radians(df_crystals.lattice_angle_gamma_degree.values)) lat_cosg = np.cos(np.radians(df_crystals.lattice_angle_gamma_degree.values))
volume = lat_a * lat_b * lat_c * np.sqrt(1. - lat_cosa**2 - lat_cosb**2 - lat_cosg**2 + 2. * lat_cosa * lat_cosb * lat_cosg) volume = lat_a * lat_b * lat_c * np.sqrt(1. - lat_cosa**2 - lat_cosb**2 - lat_cosg**2 + 2. * lat_cosa * lat_cosb * lat_cosg)
# Scale # Scale
features = np.array(features, dtype=np.float) features = np.array(features, dtype=np.float)
features = features / volume[:, np.newaxis] features = features / volume[:, np.newaxis]
scaler = MinMaxScaler() scaler = MinMaxScaler()
features = scaler.fit_transform(features) features = scaler.fit_transform(features)
y1 = np.log1p(df_crystals.formation_energy_ev_natom.values[:len(df_train)]) y1 = np.log1p(df_crystals.formation_energy_ev_natom.values[:len(df_train)])
y2 = np.log1p(df_crystals.bandgap_energy_ev.values[:len(df_train)]) y2 = np.log1p(df_crystals.bandgap_energy_ev.values[:len(df_train)])
X_train = features[:len(df_train)] X_train = features[:len(df_train)]
X_test = features[len(df_train):] X_test = features[len(df_train):]
learning_curve = {} learning_curve = {}
if(if_learningcurve == "true"): if(if_learningcurve == "true"):
step = (2400 - 100) / N_learning_curve step = (2400 - 100) / N_learning_curve
range_learning_curve = range(100, 2401, step) range_learning_curve = range(100, 2401, step)
range_learning_curve[-1] = 2400 range_learning_curve[-1] = 2400
print( range_learning_curve) print( range_learning_curve)
learning_curve_N_train = [] learning_curve_N_train = []
learning_curve_indtrain = [] learning_curve_indtrain = []
learning_curve_formation_train = [] learning_curve_formation_train = []
learning_curve_formation_test = [] learning_curve_formation_test = []
learning_curve_bandgap_train = [] learning_curve_bandgap_train = []
learning_curve_bandgap_test = [] learning_curve_bandgap_test = []
for N_train in range_learning_curve: for N_train in range_learning_curve:
indtrain = np.random.choice(len(X_train), N_train, replace=False) indtrain = np.random.choice(len(X_train), N_train, replace=False)
clf1, y1_test, y1_train = train_and_pred_KRR(X_train[indtrain], y1[indtrain], X_test, nfold = 5, nthread = n_threads) clf1, y1_test, y1_train = train_and_pred_KRR(X_train[indtrain], y1[indtrain], X_test, nfold = 5, nthread = n_threads)
clf2, y2_test, y2_train = train_and_pred_KRR(X_train[indtrain], y2[indtrain], X_test, nfold = 5, nthread = n_threads) clf2, y2_test, y2_train = train_and_pred_KRR(X_train[indtrain], y2[indtrain], X_test, nfold = 5, nthread = n_threads)
learning_curve_formation_train.append(y1_train) learning_curve_formation_train.append(y1_train)
learning_curve_formation_test.append(y1_test) learning_curve_formation_test.append(y1_test)
learning_curve_bandgap_train.append(y2_train) learning_curve_bandgap_train.append(y2_train)
learning_curve_bandgap_test.append(y2_test) learning_curve_bandgap_test.append(y2_test)
learning_curve_N_train.append(N_train) learning_curve_N_train.append(N_train)
learning_curve_indtrain.append(indtrain) learning_curve_indtrain.append(indtrain)
learning_curve["formation_train"] = learning_curve_formation_train learning_curve["formation_train"] = learning_curve_formation_train
learning_curve["formation_test"] = learning_curve_formation_test learning_curve["formation_test"] = learning_curve_formation_test
learning_curve["bandgap_train"] = learning_curve_bandgap_train learning_curve["bandgap_train"] = learning_curve_bandgap_train
learning_curve["bandgap_test"] = learning_curve_bandgap_test learning_curve["bandgap_test"] = learning_curve_bandgap_test
learning_curve["N_train"] = learning_curve_N_train learning_curve["N_train"] = learning_curve_N_train
learning_curve["indtrain"] = learning_curve_indtrain learning_curve["indtrain"] = learning_curve_indtrain
clf1, y1_test, y1_train = train_and_pred_KRR(X_train, y1, X_test, nfold = 5, nthread = n_threads) clf1, y1_test, y1_train = train_and_pred_KRR(X_train, y1, X_test, nfold = 5, nthread = n_threads)
clf2, y2_test, y2_train = train_and_pred_KRR(X_train, y2, X_test, nfold = 5, nthread = n_threads) clf2, y2_test, y2_train = train_and_pred_KRR(X_train, y2, X_test, nfold = 5, nthread = n_threads)
# Find negative values # Find negative values
print('negative values of y1_test:', np.where(y1_test < 0)) print('negative values of y1_test:', np.where(y1_test < 0))
print('negative values of y2_test:', np.where(y2_test < 0)) print('negative values of y2_test:', np.where(y2_test < 0))
y1_test = np.clip(np.expm1(y1_test), 1e-8, None) y1_test = np.clip(np.expm1(y1_test), 1e-8, None)
y2_test = np.clip(np.expm1(y2_test), 1e-8, None) y2_test = np.clip(np.expm1(y2_test), 1e-8, None)
y1_train = np.clip(np.expm1(y1_train), 1e-8, None) y1_train = np.clip(np.expm1(y1_train), 1e-8, None)
y2_train = np.clip(np.expm1(y2_train), 1e-8, None) y2_train = np.clip(np.expm1(y2_train), 1e-8, None)
#print clf1.cv_results_ #print clf1.cv_results_
#print clf2.cv_results_ #print clf2.cv_results_
gridsearch_results = {} gridsearch_results = {}
gridsearch_results['formation'] = clf1.cv_results_ gridsearch_results['formation'] = clf1.cv_results_
gridsearch_results['bandgap'] = clf2.cv_results_ gridsearch_results['bandgap'] = clf2.cv_results_
return y1_train, y1_test, y2_train, y2_test, gridsearch_results, learning_curve return y1_train, y1_test, y2_train, y2_test, gridsearch_results, learning_curve
``` ```
%% Cell type:code id: tags:predic_plot_ngram_krr %% Cell type:code id: tags:predic_plot_ngram_krr
``` python ``` python
# Predict with N-gram + KRR and plot # Predict with N-gram + KRR and plot
from IPython.display import HTML from IPython.display import HTML
display(HTML("""<br><div><span style="height:15px;width:15px;background-color:#000000;border-radius:50%;display:inline-block;"></span><font size=5em><b> N-grams + Kernel ridge regression </b> </font><br></div>""")) display(HTML("""<br><div><span style="height:15px;width:15px;background-color:#000000;border-radius:50%;display:inline-block;"></span><font size=5em><b> N-grams + Kernel ridge regression </b> </font><br></div>"""))
#print "\nTrain and predict with n-gram + KRR.\n" #print "\nTrain and predict with n-gram + KRR.\n"
formation_train_pred_ngram_krr, formation_test_pred_ngram_krr, bandgap_train_pred_ngram_krr, bandgap_test_pred_ngram_krr, gridsearch_results_ngram_krr, learning_curve_ngram_krr = get_ngram_KRR_predictions(tokens, pairs, triples, quads, unigrams, bigrams, trigrams, quadgrams, ngram_size = ngram_N, n_threads = n_threads) formation_train_pred_ngram_krr, formation_test_pred_ngram_krr, bandgap_train_pred_ngram_krr, bandgap_test_pred_ngram_krr, gridsearch_results_ngram_krr, learning_curve_ngram_krr = get_ngram_KRR_predictions(tokens, pairs, triples, quads, unigrams, bigrams, trigrams, quadgrams, ngram_size = ngram_N, n_threads = n_threads)
x_grid_formation_ngram_krr = np.unique(gridsearch_results_ngram_krr['formation']['param_alpha'].data) x_grid_formation_ngram_krr = np.unique(gridsearch_results_ngram_krr['formation']['param_alpha'].data)
y_grid_formation_ngram_krr = np.unique(gridsearch_results_ngram_krr['formation']['param_gamma'].data) y_grid_formation_ngram_krr = np.unique(gridsearch_results_ngram_krr['formation']['param_gamma'].data)
z_grid_formation_ngram_krr = np.reshape(gridsearch_results_ngram_krr['formation']['mean_test_score'], [len(y_grid_formation_ngram_krr), -1]) z_grid_formation_ngram_krr = np.reshape(gridsearch_results_ngram_krr['formation']['mean_test_score'], [len(y_grid_formation_ngram_krr), -1])
x_grid_bandgap_ngram_krr = np.unique(gridsearch_results_ngram_krr['bandgap']['param_alpha'].data) x_grid_bandgap_ngram_krr = np.unique(gridsearch_results_ngram_krr['bandgap']['param_alpha'].data)
y_grid_bandgap_ngram_krr = np.unique(gridsearch_results_ngram_krr['bandgap']['param_gamma'].data) y_grid_bandgap_ngram_krr = np.unique(gridsearch_results_ngram_krr['bandgap']['param_gamma'].data)
z_grid_bandgap_ngram_krr = np.reshape(gridsearch_results_ngram_krr['bandgap']['mean_test_score'], [len(y_grid_bandgap_ngram_krr), -1]) z_grid_bandgap_ngram_krr = np.reshape(gridsearch_results_ngram_krr['bandgap']['mean_test_score'], [len(y_grid_bandgap_ngram_krr), -1])
run_cell_by_tag('plot_ngram_krr') run_cell_by_tag('plot_ngram_krr')
``` ```
%% Cell type:code id: tags:plot_ngram_krr %% Cell type:code id: tags:plot_ngram_krr
``` python ``` python
from IPython.display import HTML from IPython.display import HTML
display(HTML("""<div><hr><font size=4em><b>Formation energy and bandgap predictions</b></font><br></div>""")) display(HTML("""<div><hr><font size=4em><b>Formation energy and bandgap predictions</b></font><br></div>"""))
test_plot(formation_test_pred_ngram_krr, formation_test_true, bandgap_test_pred_ngram_krr, bandgap_test_true) test_plot(formation_test_pred_ngram_krr, formation_test_true, bandgap_test_pred_ngram_krr, bandgap_test_true)
display(HTML("""<div><hr><font size=4em><b>Hyperparameter grid search</b></font></div>""")) display(HTML("""<div><hr><font size=4em><b>Hyperparameter grid search</b></font></div>"""))
display(HTML("""<div><font size=3.5em><b>- Formation energy</b></font><br></div>""")) display(HTML("""<div><font size=3.5em><b>- Formation energy</b></font><br></div>"""))
plot_hyperparameter_contour(x_grid_formation_ngram_krr, y_grid_formation_ngram_krr, z_grid_formation_ngram_krr, grid = 8) plot_hyperparameter_contour(x_grid_formation_ngram_krr, y_grid_formation_ngram_krr, z_grid_formation_ngram_krr, grid = 8)
display(HTML("""<div><font size=3.5em><b>- Bandgap</b></font><br></div>""")) display(HTML("""<div><font size=3.5em><b>- Bandgap</b></font><br></div>"""))
plot_hyperparameter_contour(x_grid_bandgap_ngram_krr, y_grid_bandgap_ngram_krr, z_grid_bandgap_ngram_krr, grid = 10) plot_hyperparameter_contour(x_grid_bandgap_ngram_krr, y_grid_bandgap_ngram_krr, z_grid_bandgap_ngram_krr, grid = 10)
if(if_learningcurve == "true"): if(if_learningcurve == "true"):
display(HTML("""<div><hr><font size=4em><b>Learning curves</b></font><br></div>""")) display(HTML("""<div><hr><font size=4em><b>Learning curves</b></font><br></div>"""))
plot_learning_curve(learning_curve_ngram_krr) plot_learning_curve(learning_curve_ngram_krr)
display(HTML("""<br><hr><br><br>""")) display(HTML("""<br><hr><br><br>"""))
``` ```
%% Cell type:code id: tags:ngram-nn %% Cell type:code id: tags:ngram-nn
``` python ``` python
# N-gram + NN # N-gram + NN
import pandas as pd import pandas as pd
import torch import torch
import sys import sys
def mean_free(model): def mean_free(model):
means = np.average(model, axis = 0) means = np.average(model, axis = 0)
#print("Shape of means:", np.shape(means)) #print("Shape of means:", np.shape(means))
model_mean_free = model - means model_mean_free = model - means
return model_mean_free return model_mean_free
class NeuralNetwork(torch.nn.Module): class NeuralNetwork(torch.nn.Module):
''' '''
def __init__(self,inplen, outlen): def __init__(self,inplen, outlen):
super(NeuralNetwork, self).__init__() super(NeuralNetwork, self).__init__()
self.layer1 = torch.nn.Linear(inplen, 100) self.layer1 = torch.nn.Linear(inplen, 100)
self.layer2 = torch.nn.Linear(100, 100) self.layer2 = torch.nn.Linear(100, 100)
self.layer3 = torch.nn.Linear(100, 100) self.layer3 = torch.nn.Linear(100, 100)
self.layer4 = torch.nn.Linear(100, 100) self.layer4 = torch.nn.Linear(100, 100)
self.layer5 = torch.nn.Linear(100, 100) self.layer5 = torch.nn.Linear(100, 100)
self.layer6 = torch.nn.Linear(100, 100) self.layer6 = torch.nn.Linear(100, 100)
self.layer7 = torch.nn.Linear(100, 100) self.layer7 = torch.nn.Linear(100, 100)
self.layer8 = torch.nn.Linear(100, 50) self.layer8 = torch.nn.Linear(100, 50)
self.layer9 = torch.nn.Linear(50, 50) self.layer9 = torch.nn.Linear(50, 50)
self.layer10 = torch.nn.Linear(50, 50) self.layer10 = torch.nn.Linear(50, 50)
self.layer11 = torch.nn.Linear(50, 50) self.layer11 = torch.nn.Linear(50, 50)
self.layer12 = torch.nn.Linear(50, outlen) self.layer12 = torch.nn.Linear(50, outlen)
self.activation = torch.nn.LeakyReLU() self.activation = torch.nn.LeakyReLU()
#self.dropout = torch.nn.Dropout(p=0.05) #self.dropout = torch.nn.Dropout(p=0.05)
def forward(self, inp): def forward(self, inp):
x = self.activation(self.layer1(inp)) x = self.activation(self.layer1(inp))
x = self.activation(self.layer2(x)) x = self.activation(self.layer2(x))
x = self.activation(self.layer3(x)) x = self.activation(self.layer3(x))
x = self.activation(self.layer4(x)) x = self.activation(self.layer4(x))
x = self.activation(self.layer5(x)) x = self.activation(self.layer5(x))
x = self.activation(self.layer6(x)) x = self.activation(self.layer6(x))
x = self.activation(self.layer7(x)) x = self.activation(self.layer7(x))
x = self.activation(self.layer8(x)) x = self.activation(self.layer8(x))
x = self.activation(self.layer9(x)) x = self.activation(self.layer9(x))
x = self.activation(self.layer10(x)) x = self.activation(self.layer10(x))
x = self.activation(self.layer11(x)) x = self.activation(self.layer11(x))
return self.layer12(x)#self.dropout(self.layer12(x)) return self.layer12(x)#self.dropout(self.layer12(x))
''' '''
def __init__(self,inplen, outlen): def __init__(self,inplen, outlen):
super(NeuralNetwork, self).__init__() super(NeuralNetwork, self).__init__()
hidden_layers = [] hidden_layers = []
batchnorms = [] batchnorms = []
hidden_layers.append(torch.nn.Linear(inplen, N_nn_neurons[0])) hidden_layers.append(torch.nn.Linear(inplen, N_nn_neurons[0]))
batchnorms.append(torch.nn.BatchNorm1d(N_nn_neurons[0], affine=True)) batchnorms.append(torch.nn.BatchNorm1d(N_nn_neurons[0], affine=True))
for i in range(N_nn_layers-1): for i in range(N_nn_layers-1):
hidden_layers.append(torch.nn.Linear(N_nn_neurons[i], N_nn_neurons[i+1], bias=True)) hidden_layers.append(torch.nn.Linear(N_nn_neurons[i], N_nn_neurons[i+1], bias=True))
batchnorms.append(torch.nn.BatchNorm1d(N_nn_neurons[i+1], affine=True)) batchnorms.append(torch.nn.BatchNorm1d(N_nn_neurons[i+1], affine=True))
self.layer_hidden = torch.nn.ModuleList(hidden_layers) self.layer_hidden = torch.nn.ModuleList(hidden_layers)
self.batchnorms = torch.nn.ModuleList(batchnorms) self.batchnorms = torch.nn.ModuleList(batchnorms)
self.layer_out = torch.nn.Linear(N_nn_neurons[-1], outlen) self.layer_out = torch.nn.Linear(N_nn_neurons[-1], outlen)
self.activation = torch.nn.LeakyReLU() self.activation = torch.nn.LeakyReLU()
#self.activation = torch.nn.ReLU() #self.activation = torch.nn.ReLU()
def forward(self, inp): def forward(self, inp):
#x = self.layer_hidden[0](inp) #x = self.layer_hidden[0](inp)
x = self.activation(self.layer_hidden[0](inp)) x = self.activation(self.layer_hidden[0](inp))
#x = self.batchnorms[0](self.layer_hidden[0](inp)) #x = self.batchnorms[0](self.layer_hidden[0](inp))
#x = self.activation(self.batchnorms[0](self.layer_hidden[0](inp))) #x = self.activation(self.batchnorms[0](self.layer_hidden[0](inp)))
for i in range(1, N_nn_layers-1): for i in range(1, N_nn_layers-1):
#x = self.layer_hidden[i](x) #x = self.layer_hidden[i](x)
x = self.activation(self.layer_hidden[i](x)) x = self.activation(self.layer_hidden[i](x))
#x = self.batchnorms[i](self.layer_hidden[i](x)) #x = self.batchnorms[i](self.layer_hidden[i](x))
#x = self.activation(self.batchnorms[i](self.layer_hidden[i](x))) #x = self.activation(self.batchnorms[i](self.layer_hidden[i](x)))
x = self.layer_out(x) x = self.layer_out(x)
return x return x
def ngram_nn_train_pred(nn_inp_len, nn_out_len, x_train, y_train, x_test, y_test_true, learning_rate, n_epoch): def ngram_nn_train_pred(nn_inp_len, nn_out_len, x_train, y_train, x_test, y_test_true, learning_rate, n_epoch):
inp_trn = torch.autograd.Variable( inp_trn = torch.autograd.Variable(
torch.Tensor(x_train.reshape(-1, nn_inp_len))) torch.Tensor(x_train.reshape(-1, nn_inp_len)))
out_trn = torch.autograd.Variable( out_trn = torch.autograd.Variable(
torch.Tensor(y_train.reshape(-1, nn_out_len))) torch.Tensor(y_train.reshape(-1, nn_out_len)))
inp_test = torch.autograd.Variable( inp_test = torch.autograd.Variable(
torch.Tensor(x_test.reshape(-1, nn_inp_len)), torch.Tensor(x_test.reshape(-1, nn_inp_len)),
volatile=True) volatile=True)
out_test = torch.autograd.Variable( out_test = torch.autograd.Variable(
torch.Tensor(y_test_true.reshape(-1, nn_out_len)), torch.Tensor(y_test_true.reshape(-1, nn_out_len)),
volatile=True) volatile=True)
deep_fit = NeuralNetwork(nn_inp_len, nn_out_len) deep_fit = NeuralNetwork(nn_inp_len, nn_out_len)
loss_function = torch.nn.MSELoss() loss_function = torch.nn.MSELoss()
optimizer = torch.optim.Adam(deep_fit.parameters(), lr=learning_rate) optimizer = torch.optim.Adam(deep_fit.parameters(), lr=learning_rate)
sys.stdout.flush() sys.stdout.flush()
loss_trn, loss_test = [], [] loss_trn, loss_test = [], []
#for _ in tqdm(range(n_epoch)): #for _ in tqdm(range(n_epoch)):
for _ in tnrange(n_epoch): for _ in tnrange(n_epoch):
deep_fit.train() deep_fit.train()
loss = loss_function(deep_fit(inp_trn), out_trn) loss = loss_function(deep_fit(inp_trn), out_trn)
loss.backward() loss.backward()
optimizer.step() optimizer.step()
optimizer.zero_grad() optimizer.zero_grad()
loss_trn.append(loss.data.float()) loss_trn.append(loss.data.float())
deep_fit.eval() deep_fit.eval()
loss = loss_function(deep_fit(inp_test), out_test) loss = loss_function(deep_fit(inp_test), out_test)
loss_test.append(loss.data.float()) loss_test.append(loss.data.float())
#print 'Training RMSE:' #print 'Training RMSE:'
deep_fit_trn=deep_fit(inp_trn).data deep_fit_trn=deep_fit(inp_trn).data
#print RMSE(np.clip(np.expm1(np.array(out_trn.data)),1e-8,None), np.clip(np.expm1(np.array(deep_fit_trn)),1e-8,None)) #print RMSE(np.clip(np.expm1(np.array(out_trn.data)),1e-8,None), np.clip(np.expm1(np.array(deep_fit_trn)),1e-8,None))
#print 'Test RMSE (log1p):' #print 'Test RMSE (log1p):'
deep_fit_test=deep_fit(inp_test).data deep_fit_test=deep_fit(inp_test).data
#print "deep_fit_test:", np.array(deep_fit_test)[:10] #print "deep_fit_test:", np.array(deep_fit_test)[:10]
#print RMSE(np.array(y_test), np.array(deep_fit_test)) #print RMSE(np.array(y_test), np.array(deep_fit_test))
sys.stdout.flush() sys.stdout.flush()
''' '''
fig, (cnv, prd) = plt.subplots(1, 2, figsize=(10, 4)) fig, (cnv, prd) = plt.subplots(1, 2, figsize=(10, 4))
cnv.plot(loss_trn[200:], label='training') cnv.plot(loss_trn[200:], label='training')
cnv.plot(loss_test[200:], label='validation') cnv.plot(loss_test[200:], label='validation')
cnv.set_xlabel('epoch') cnv.set_xlabel('epoch')
cnv.set_ylabel('loss') cnv.set_ylabel('loss')
cnv.legend() cnv.legend()
prd.scatter(np.array(y_train), deep_fit(inp_trn).data.numpy(), s=2, label='Training') prd.scatter(np.array(y_train), deep_fit(inp_trn).data.numpy(), s=2, label='Training')
prd.scatter(np.array(y_test_true), deep_fit(inp_test).data.numpy(), s=2, label='Validation') prd.scatter(np.array(y_test_true), deep_fit(inp_test).data.numpy(), s=2, label='Validation')
prd.legend() prd.legend()
fig.tight_layout() fig.tight_layout()
''' '''
#print(deep_fit_trn.flatten()) #print(deep_fit_trn.flatten())
return np.clip(np.expm1(np.array(deep_fit_trn.flatten())),1e-8,None), np.clip(np.expm1(np.array(deep_fit_test.flatten())),1e-8,None)#, deep_fit return np.clip(np.expm1(np.array(deep_fit_trn.flatten())),1e-8,None), np.clip(np.expm1(np.array(deep_fit_test.flatten())),1e-8,None)#, deep_fit
def get_ngram_NN_predictions(tokens, pairs, triples, quads, unigrams, bigrams, trigrams, quadgrams, ngram_size): def get_ngram_NN_predictions(tokens, pairs, triples, quads, unigrams, bigrams, trigrams, quadgrams, ngram_size):
print ("Size of N-gram:", ngram_size) print ("Size of N-gram:", ngram_size)
ngrams = {} ngrams = {}
ngrams["tokens"]=tokens ngrams["tokens"]=tokens
ngrams["pairs"]=pairs ngrams["pairs"]=pairs
ngrams["triples"]=triples ngrams["triples"]=triples
ngrams["quads"]=quads ngrams["quads"]=quads
ngrams["unigrams"]=unigrams ngrams["unigrams"]=unigrams
ngrams["bigrams"]=bigrams ngrams["bigrams"]=bigrams
ngrams["trigrams"]=trigrams ngrams["trigrams"]=trigrams
ngrams["quadgrams"]=quadgrams ngrams["quadgrams"]=quadgrams
#print 'ngram_size: {}'.format(ngram_size) #print 'ngram_size: {}'.format(ngram_size)
if ngram_size == 4: if ngram_size == 4:
features = np.hstack([ngrams["unigrams"], ngrams["bigrams"], ngrams["trigrams"], ngrams["quadgrams"]]) features = np.hstack([ngrams["unigrams"], ngrams["bigrams"], ngrams["trigrams"], ngrams["quadgrams"]])
elif ngram_size == 3: elif ngram_size == 3:
features = np.hstack([ngrams["unigrams"], ngrams["bigrams"], ngrams["trigrams"]]) features = np.hstack([ngrams["unigrams"], ngrams["bigrams"], ngrams["trigrams"]])
elif ngram_size == 2: elif ngram_size == 2:
features = np.hstack([ngrams["unigrams"], ngrams["bigrams"]]) features = np.hstack([ngrams["unigrams"], ngrams["bigrams"]])
elif ngram_size == 1: elif ngram_size == 1:
features = ngrams["unigrams"] features = ngrams["unigrams"]
else: else:
raise NameError('ngram_size should be within [1,4].') raise NameError('ngram_size should be within [1,4].')
print ('Ngram feature matrix shape: {}'.format(features.shape)) print ('Ngram feature matrix shape: {}'.format(features.shape))
# Select column # Select column
count_per_column = np.sum(features, axis=0) count_per_column = np.sum(features, axis=0)
nonzero_columns = np.where(count_per_column > 0)[0] nonzero_columns = np.where(count_per_column > 0)[0]
print ('Total column count: {}'.format(len(count_per_column)) ) print ('Total column count: {}'.format(len(count_per_column)) )
print ('Nonzero-column count: {}'.format(len(nonzero_columns)) ) print ('Nonzero-column count: {}'.format(len(nonzero_columns)) )
features = features[:,nonzero_columns] features = features[:,nonzero_columns]
print ('Nonzero ngram feature maxrix shape: {}'.format(features.shape) ) print ('Nonzero ngram feature maxrix shape: {}'.format(features.shape) )
# Volume # Volume
lat_a = df_crystals.lattice_vector_1_ang.values lat_a = df_crystals.lattice_vector_1_ang.values
lat_b = df_crystals.lattice_vector_2_ang.values lat_b = df_crystals.lattice_vector_2_ang.values
lat_c = df_crystals.lattice_vector_3_ang.values lat_c = df_crystals.lattice_vector_3_ang.values
lat_cosa = np.cos(np.radians(df_crystals.lattice_angle_alpha_degree.values)) lat_cosa = np.cos(np.radians(df_crystals.lattice_angle_alpha_degree.values))
lat_cosb = np.cos(np.radians(df_crystals.lattice_angle_beta_degree.values)) lat_cosb = np.cos(np.radians(df_crystals.lattice_angle_beta_degree.values))
lat_cosg = np.cos(np.radians(df_crystals.lattice_angle_gamma_degree.values)) lat_cosg = np.cos(np.radians(df_crystals.lattice_angle_gamma_degree.values))
volume = lat_a * lat_b * lat_c * np.sqrt(1. - lat_cosa**2 - lat_cosb**2 - lat_cosg**2 + 2. * lat_cosa * lat_cosb * lat_cosg) volume = lat_a * lat_b * lat_c * np.sqrt(1. - lat_cosa**2 - lat_cosb**2 - lat_cosg**2 + 2. * lat_cosa * lat_cosb * lat_cosg)
# Scale # Scale
features = np.array(features, dtype=np.float) features = np.array(features, dtype=np.float)
features = features / volume[:, np.newaxis] features = features / volume[:, np.newaxis]
scaler = MinMaxScaler() scaler = MinMaxScaler()
features = scaler.fit_transform(features) features = scaler.fit_transform(features)
#print 'Scaled features : \n',features #print 'Scaled features : \n',features
sys.stdout.flush() sys.stdout.flush()
y_train_Eform = np.log1p(df_crystals.formation_energy_ev_natom.values[:len(df_train)]) y_train_Eform = np.log1p(df_crystals.formation_energy_ev_natom.values[:len(df_train)])
y_train_Egap = np.log1p(df_crystals.bandgap_energy_ev.values[:len(df_train)]) y_train_Egap = np.log1p(df_crystals.bandgap_energy_ev.values[:len(df_train)])
y_test_Eform_true = np.array(np.log1p(np.array(formation_test_true))) y_test_Eform_true = np.array(np.log1p(np.array(formation_test_true)))
y_test_Egap_true = np.array(np.log1p(np.array(bandgap_test_true))) y_test_Egap_true = np.array(np.log1p(np.array(bandgap_test_true)))
x_train = features[:len(df_train)] x_train = features[:len(df_train)]
x_test = features[len(df_train):] x_test = features[len(df_train):]
x_train_meanfree = mean_free(x_train) x_train_meanfree = mean_free(x_train)
x_test_meanfree = mean_free(x_test) x_test_meanfree = mean_free(x_test)
nn_inp_len = len(nonzero_columns) nn_inp_len = len(nonzero_columns)
nn_out_len = len(y_train_Eform.reshape(-1,len(y_train_Eform))) nn_out_len = len(y_train_Eform.reshape(-1,len(y_train_Eform)))
formation_train_pred_ngram_nn, formation_test_pred_ngram_nn = ngram_nn_train_pred(nn_inp_len, nn_out_len, x_train_meanfree, y_train_Eform, x_test_meanfree, y_test_Eform_true, learning_rate=0.00012, n_epoch=600) formation_train_pred_ngram_nn, formation_test_pred_ngram_nn = ngram_nn_train_pred(nn_inp_len, nn_out_len, x_train_meanfree, y_train_Eform, x_test_meanfree, y_test_Eform_true, learning_rate=0.00012, n_epoch=600)
bandgap_train_pred_ngram_nn, bandgap_test_pred_ngram_nn = ngram_nn_train_pred(nn_inp_len, nn_out_len, x_train_meanfree, y_train_Egap, x_test_meanfree, y_test_Egap_true, learning_rate=0.00015, n_epoch=700) bandgap_train_pred_ngram_nn, bandgap_test_pred_ngram_nn = ngram_nn_train_pred(nn_inp_len, nn_out_len, x_train_meanfree, y_train_Egap, x_test_meanfree, y_test_Egap_true, learning_rate=0.00015, n_epoch=700)
learning_curve = {} learning_curve = {}
if(if_learningcurve == "true"): if(if_learningcurve == "true"):
learning_curve_N_train = [] learning_curve_N_train = []
learning_curve_indtrain = [] learning_curve_indtrain = []
learning_curve_formation_train = [] learning_curve_formation_train = []
learning_curve_formation_test = [] learning_curve_formation_test = []
learning_curve_bandgap_train = [] learning_curve_bandgap_train = []
learning_curve_bandgap_test = [] learning_curve_bandgap_test = []
step = (2400 - 100) / N_learning_curve step = (2400 - 100) / N_learning_curve
range_learning_curve = range(100, 2401, step) range_learning_curve = range(100, 2401, step)
range_learning_curve[-1] = 2400 range_learning_curve[-1] = 2400
#print range_learning_curve #print range_learning_curve
for N_train in range_learning_curve: for N_train in range_learning_curve:
indtrain = np.random.choice(len(x_train_meanfree), N_train, replace=False) indtrain = np.random.choice(len(x_train_meanfree), N_train, replace=False)
formation_train_pred_ngram_nn, formation_test_pred_ngram_nn = ngram_nn_train_pred(nn_inp_len, nn_out_len, x_train_meanfree[indtrain], y_train_Eform[indtrain], x_test_meanfree, y_test_Eform_true, learning_rate=0.00012, n_epoch=600) formation_train_pred_ngram_nn, formation_test_pred_ngram_nn = ngram_nn_train_pred(nn_inp_len, nn_out_len, x_train_meanfree[indtrain], y_train_Eform[indtrain], x_test_meanfree, y_test_Eform_true, learning_rate=0.00012, n_epoch=600)
bandgap_train_pred_ngram_nn, bandgap_test_pred_ngram_nn = ngram_nn_train_pred(nn_inp_len, nn_out_len, x_train_meanfree[indtrain], y_train_Egap[indtrain], x_test_meanfree, y_test_Egap_true, learning_rate=0.00015, n_epoch=700) bandgap_train_pred_ngram_nn, bandgap_test_pred_ngram_nn = ngram_nn_train_pred(nn_inp_len, nn_out_len, x_train_meanfree[indtrain], y_train_Egap[indtrain], x_test_meanfree, y_test_Egap_true, learning_rate=0.00015, n_epoch=700)
learning_curve_formation_train.append(formation_train_pred_ngram_nn) learning_curve_formation_train.append(formation_train_pred_ngram_nn)
learning_curve_formation_test.append(formation_test_pred_ngram_nn) learning_curve_formation_test.append(formation_test_pred_ngram_nn)
learning_curve_bandgap_train.append(bandgap_train_pred_ngram_nn) learning_curve_bandgap_train.append(bandgap_train_pred_ngram_nn)
learning_curve_bandgap_test.append(bandgap_test_pred_ngram_nn) learning_curve_bandgap_test.append(bandgap_test_pred_ngram_nn)
learning_curve_N_train.append(N_train) learning_curve_N_train.append(N_train)
learning_curve_indtrain.append(indtrain) learning_curve_indtrain.append(indtrain)
learning_curve["formation_train"] = learning_curve_formation_train learning_curve["formation_train"] = learning_curve_formation_train
learning_curve["formation_test"] = learning_curve_formation_test learning_curve["formation_test"] = learning_curve_formation_test
learning_curve["bandgap_train"] = learning_curve_bandgap_train learning_curve["bandgap_train"] = learning_curve_bandgap_train
learning_curve["bandgap_test"] = learning_curve_bandgap_test learning_curve["bandgap_test"] = learning_curve_bandgap_test
learning_curve["N_train"] = learning_curve_N_train learning_curve["N_train"] = learning_curve_N_train
learning_curve["indtrain"] = learning_curve_indtrain learning_curve["indtrain"] = learning_curve_indtrain
return formation_train_pred_ngram_nn , formation_test_pred_ngram_nn, bandgap_train_pred_ngram_nn, bandgap_test_pred_ngram_nn, learning_curve return formation_train_pred_ngram_nn , formation_test_pred_ngram_nn, bandgap_train_pred_ngram_nn, bandgap_test_pred_ngram_nn, learning_curve
from IPython.display import HTML from IPython.display import HTML
display(HTML("""<br><div><span style="height:15px;width:15px;background-color:#000000;border-radius:50%;display:inline-block;"></span><font size=5em><b> N-grams + Neural network </b> </font><br></div>""")) display(HTML("""<br><div><span style="height:15px;width:15px;background-color:#000000;border-radius:50%;display:inline-block;"></span><font size=5em><b> N-grams + Neural network </b> </font><br></div>"""))
formation_train_pred_ngram_nn, formation_test_pred_ngram_nn, bandgap_train_pred_ngram_nn, bandgap_test_pred_ngram_nn, learning_curve_ngram_nn = get_ngram_NN_predictions(tokens, pairs, triples, quads, unigrams, bigrams, trigrams, quadgrams, ngram_size = ngram_N) formation_train_pred_ngram_nn, formation_test_pred_ngram_nn, bandgap_train_pred_ngram_nn, bandgap_test_pred_ngram_nn, learning_curve_ngram_nn = get_ngram_NN_predictions(tokens, pairs, triples, quads, unigrams, bigrams, trigrams, quadgrams, ngram_size = ngram_N)
run_cell_by_tag('plot_ngram_nn') run_cell_by_tag('plot_ngram_nn')
``` ```
%% Cell type:code id: tags:plot_ngram_nn %% Cell type:code id: tags:plot_ngram_nn
``` python ``` python
from IPython.display import HTML from IPython.display import HTML
display(HTML("""<div><hr><font size=4em><b>Formation energy and bandgap predictions</b></font><br></div>""")) display(HTML("""<div><hr><font size=4em><b>Formation energy and bandgap predictions</b></font><br></div>"""))
test_plot(formation_test_pred_ngram_nn, formation_test_true, bandgap_test_pred_ngram_nn, bandgap_test_true) test_plot(formation_test_pred_ngram_nn, formation_test_true, bandgap_test_pred_ngram_nn, bandgap_test_true)
if(if_learningcurve == "true"): if(if_learningcurve == "true"):
display(HTML("""<div><hr><font size=4em><b>Learning curves</b></font><br></div>""")) display(HTML("""<div><hr><font size=4em><b>Learning curves</b></font><br></div>"""))
plot_learning_curve(learning_curve_ngram_nn) plot_learning_curve(learning_curve_ngram_nn)
display(HTML("""<br><hr><br><br>""")) display(HTML("""<br><hr><br><br>"""))
``` ```
%% Cell type:code id: tags:soap %% Cell type:code id: tags:soap
``` python ``` python
# SOAP descriptors # SOAP descriptors
def get_atomic_soap_descriptors(geometry_xyz, soap_cutoff, soap_lmax, soap_nmax): def get_atomic_soap_descriptors(geometry_xyz, soap_cutoff, soap_lmax, soap_nmax):
# get cell vectors from *.xyz file # get cell vectors from *.xyz file
cell_ = np.loadtxt(geometry_xyz, skiprows=3, usecols=(1,2,3))[:3] cell_ = np.loadtxt(geometry_xyz, skiprows=3, usecols=(1,2,3))[:3]
# get atom positions and types from *.xyz file # get atom positions and types from *.xyz file
atoms_ = pd.read_table(geometry_xyz, skiprows=6, atoms_ = pd.read_table(geometry_xyz, skiprows=6,
delim_whitespace=True, header=None, names=['x', 'y', 'z', 'type'], delim_whitespace=True, header=None, names=['x', 'y', 'z', 'type'],
usecols=[1,2,3,4]) usecols=[1,2,3,4])
# construct ase Atoms object # construct ase Atoms object
compound = ASEAtoms(atoms_.type.values, compound = ASEAtoms(atoms_.type.values,
positions = atoms_.values[:, :3], positions = atoms_.values[:, :3],
cell = cell_, cell = cell_,
pbc=[1, 1, 1]) pbc=[1, 1, 1])
########################################################################### ###########################################################################
# The specification of which atoms are used as SOAP centers is separate to the # The specification of which atoms are used as SOAP centers is separate to the
# specification of which atoms are taken into account in the environment. # specification of which atoms are taken into account in the environment.
# The n_Z=2 Z={1 6} options specify that both H and C are to be taken as SOAP # The n_Z=2 Z={1 6} options specify that both H and C are to be taken as SOAP
# centers. The n_species=2 species_Z={1 6} options specify the atom types to # centers. The n_species=2 species_Z={1 6} options specify the atom types to
# be taken into account in the environment. # be taken into account in the environment.
soap_command = "soap cutoff=" soap_command = "soap cutoff="
soap_command += str(soap_cutoff) soap_command += str(soap_cutoff)
soap_command += " l_max=" soap_command += " l_max="
soap_command += str(soap_lmax) soap_command += str(soap_lmax)
soap_command += " n_max=" soap_command += " n_max="
soap_command += str(soap_nmax) soap_command += str(soap_nmax)
soap_command += " atom_sigma=0.5 n_Z=4 Z={8 13 31 49} n_species=4 species_Z={8 13 31 49}" soap_command += " atom_sigma=0.5 n_Z=4 Z={8 13 31 49} n_species=4 species_Z={8 13 31 49}"
#print(soap_command) #print(soap_command)
desc = descriptors.Descriptor(soap_command) # for formation energies desc = descriptors.Descriptor(soap_command) # for formation energies
#"soap cutoff=20 l_max=6 n_max=10 atom_sigma=0.5 n_Z=1 Z={8} n_species=4 species_Z={8 13 31 49}") # for bandgaps #"soap cutoff=20 l_max=6 n_max=10 atom_sigma=0.5 n_Z=1 Z={8} n_species=4 species_Z={8 13 31 49}") # for bandgaps
compound.set_cutoff(desc.cutoff()) compound.set_cutoff(desc.cutoff())
compound.calc_connect() compound.calc_connect()
return desc.calc(compound)['descriptor'] return desc.calc(compound)['descriptor']
def get_SOAP(soap_cutoff, soap_lmax, soap_nmax): def get_SOAP(soap_cutoff, soap_lmax, soap_nmax):
#for compound_id in tqdm(train_csv[:, 0].astype(int)): #for compound_id in tqdm(train_csv[:, 0].astype(int)):
for compound_id in tnrange(train_csv[:, 0].astype(int)): for compound_id in tnrange(train_csv[:, 0].astype(int)):
#print('compound_id: {}.'.format(compound_id)) #print('compound_id: {}.'.format(compound_id))
compound_xyz = train_data+'/'+str(compound_id)+'/geometry.xyz' compound_xyz = train_data+'/'+str(compound_id)+'/geometry.xyz'
train_mean_atomic_descriptor = np.mean(get_atomic_soap_descriptors(compound_xyz, soap_cutoff, soap_lmax, soap_nmax), train_mean_atomic_descriptor = np.mean(get_atomic_soap_descriptors(compound_xyz, soap_cutoff, soap_lmax, soap_nmax),
axis=0) axis=0)
if 'train_mean_atomic_descriptors' in locals(): if 'train_mean_atomic_descriptors' in locals():
train_mean_atomic_descriptors = np.append(train_mean_atomic_descriptors, train_mean_atomic_descriptors = np.append(train_mean_atomic_descriptors,
train_mean_atomic_descriptor.reshape(1, -1), axis=0) train_mean_atomic_descriptor.reshape(1, -1), axis=0)
else: else:
train_mean_atomic_descriptors = train_mean_atomic_descriptor.reshape(1, -1) train_mean_atomic_descriptors = train_mean_atomic_descriptor.reshape(1, -1)
print('mean_atomic_descriptor.shape: {}'.format(train_mean_atomic_descriptor.shape)) print('mean_atomic_descriptor.shape: {}'.format(train_mean_atomic_descriptor.shape))
#for compound_id in tqdm(test_csv[:, 0].astype(int)): #for compound_id in tqdm(test_csv[:, 0].astype(int)):
for compound_id in tnrange(test_csv[:, 0].astype(int)): for compound_id in tnrange(test_csv[:, 0].astype(int)):
#print('compound_id: {}.'.format(compound_id)) #print('compound_id: {}.'.format(compound_id))
compound_xyz = test_data+'/'+str(compound_id)+'/geometry.xyz' compound_xyz = test_data+'/'+str(compound_id)+'/geometry.xyz'
test_mean_atomic_descriptor = np.mean(get_atomic_soap_descriptors(compound_xyz, soap_cutoff, soap_lmax, soap_nmax), test_mean_atomic_descriptor = np.mean(get_atomic_soap_descriptors(compound_xyz, soap_cutoff, soap_lmax, soap_nmax),
axis=0) axis=0)
if 'test_mean_atomic_descriptors' in locals(): if 'test_mean_atomic_descriptors' in locals():
test_mean_atomic_descriptors = np.append(test_mean_atomic_descriptors, test_mean_atomic_descriptors = np.append(test_mean_atomic_descriptors,
test_mean_atomic_descriptor.reshape(1, -1), axis=0) test_mean_atomic_descriptor.reshape(1, -1), axis=0)
else: else:
test_mean_atomic_descriptors = test_mean_atomic_descriptor.reshape(1, -1) test_mean_atomic_descriptors = test_mean_atomic_descriptor.reshape(1, -1)
#print('mean_atomic_descriptor.shape: {}'.format(test_mean_atomic_descriptor.shape)) #print('mean_atomic_descriptor.shape: {}'.format(test_mean_atomic_descriptor.shape))
#print(test_mean_atomic_descriptors) #print(test_mean_atomic_descriptors)
#np.savez("./data/kaggle_competition/models/soap.npz", train_mean_atomic_descriptors = train_mean_atomic_descriptors, test_mean_atomic_descriptors = test_mean_atomic_descriptors) #np.savez("./data/kaggle_competition/models/soap.npz", train_mean_atomic_descriptors = train_mean_atomic_descriptors, test_mean_atomic_descriptors = test_mean_atomic_descriptors)
return train_mean_atomic_descriptors, test_mean_atomic_descriptors return train_mean_atomic_descriptors, test_mean_atomic_descriptors
``` ```
%% Cell type:code id: tags:get_soap_descriptor %% Cell type:code id: tags:get_soap_descriptor
``` python ``` python
# Get SOAP descriptor # Get SOAP descriptor
from IPython.display import HTML from IPython.display import HTML
display(HTML("""<br><div><font size=6em><b>Predictions with SOAP representation</b></font></div><hr>""")) display(HTML("""<br><div><font size=6em><b>Predictions with SOAP representation</b></font></div><hr>"""))
#"soap cutoff=10 l_max=4 n_max=4 atom_sigma=0.5 n_Z=4 Z={8 13 31 49} n_species=4 species_Z={8 13 31 49}") # for formation energies #"soap cutoff=10 l_max=4 n_max=4 atom_sigma=0.5 n_Z=4 Z={8 13 31 49} n_species=4 species_Z={8 13 31 49}") # for formation energies
#"soap cutoff=20 l_max=6 n_max=10 atom_sigma=0.5 n_Z=1 Z={8} n_species=4 species_Z={8 13 31 49}") # for bandgaps #"soap cutoff=20 l_max=6 n_max=10 atom_sigma=0.5 n_Z=1 Z={8} n_species=4 species_Z={8 13 31 49}") # for bandgaps
if if_use_prestored_models == "true": if if_use_prestored_models == "true":
soap_model_saved = np.load('./data/kaggle_competition/models/soap.npz') soap_model_saved = np.load('./data/kaggle_competition/models/soap.npz')
train_mean_atomic_descriptors = soap_model_saved['train_mean_atomic_descriptors'] train_mean_atomic_descriptors = soap_model_saved['train_mean_atomic_descriptors']
test_mean_atomic_descriptors = soap_model_saved['test_mean_atomic_descriptors'] test_mean_atomic_descriptors = soap_model_saved['test_mean_atomic_descriptors']
display(HTML("""Read pre-stored SOAP model.""")) display(HTML("""Read pre-stored SOAP model."""))
else: else:
display(HTML("""Generating SOAP descriptors...""")) display(HTML("""Generating SOAP descriptors..."""))
train_mean_atomic_descriptors, test_mean_atomic_descriptors = get_SOAP(soap_cutoff, soap_lmax, soap_nmax) train_mean_atomic_descriptors, test_mean_atomic_descriptors = get_SOAP(soap_cutoff, soap_lmax, soap_nmax)
#print "test_mean_atomic_descriptors:", np.shape(test_mean_atomic_descriptors), test_mean_atomic_descriptors[:10] #print "test_mean_atomic_descriptors:", np.shape(test_mean_atomic_descriptors), test_mean_atomic_descriptors[:10]
``` ```
%% Cell type:code id: tags:soap-krr %% Cell type:code id: tags:soap-krr
``` python ``` python
# Soap + KRR # Soap + KRR
# The following arrays must be read/prepared: test_csv, train_csv, test_mean_atomic_descriptors, train_mean_atomic_descriptors # The following arrays must be read/prepared: test_csv, train_csv, test_mean_atomic_descriptors, train_mean_atomic_descriptors
# Depend on cell 'krr-ngram' # Depend on cell 'krr-ngram'
from IPython.display import HTML from IPython.display import HTML
from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import StandardScaler
scaler = StandardScaler(with_mean=True, with_std=True) scaler = StandardScaler(with_mean=True, with_std=True)
formation_train_true = train_csv[:, -2] formation_train_true = train_csv[:, -2]
bandgap_train_true = train_csv[:, -1] bandgap_train_true = train_csv[:, -1]
combined_mean_atomic_descriptors = scaler.fit_transform(np.append( combined_mean_atomic_descriptors = scaler.fit_transform(np.append(
test_mean_atomic_descriptors, test_mean_atomic_descriptors,
train_mean_atomic_descriptors, train_mean_atomic_descriptors,
axis=0)) axis=0))
X_test = np.round(combined_mean_atomic_descriptors[:test_csv.shape[0]], X_test = np.round(combined_mean_atomic_descriptors[:test_csv.shape[0]],
decimals=5) decimals=5)
X_train = np.round(combined_mean_atomic_descriptors[test_csv.shape[0]:], X_train = np.round(combined_mean_atomic_descriptors[test_csv.shape[0]:],
decimals=5) decimals=5)
#display(HTML("""<br><br><hr<br><div><font size=4em><b>Representation:</b> SOAP descriptor</font><br></div>""")) #display(HTML("""<br><br><hr<br><div><font size=4em><b>Representation:</b> SOAP descriptor</font><br></div>"""))
#display(HTML("""<div><font size=4em><b>Regression method:</b> Kernel ridge regression</font><br></div>""")) #display(HTML("""<div><font size=4em><b>Regression method:</b> Kernel ridge regression</font><br></div>"""))
display(HTML("""<br><div><span style="height:15px;width:15px;background-color:#000000;border-radius:50%;display:inline-block;"></span><font size=5em><b> SOAP + Kernel ridge regression </b> </font><br></div>""")) display(HTML("""<br><div><span style="height:15px;width:15px;background-color:#000000;border-radius:50%;display:inline-block;"></span><font size=5em><b> SOAP + Kernel ridge regression </b> </font><br></div>"""))
def get_soap_KRR_predictions(X_train, formation_train_true, bandgap_train_true, n_threads): def get_soap_KRR_predictions(X_train, formation_train_true, bandgap_train_true, n_threads):
formation_train_true_10times = formation_train_true * 10.0 formation_train_true_10times = formation_train_true * 10.0
clf1_soap_krr, formation_test_pred_soap_krr, formation_train_pred_soap_krr = train_and_pred_KRR(X_train, formation_train_true_10times, X_test, nfold = 5, nthread = n_threads) clf1_soap_krr, formation_test_pred_soap_krr, formation_train_pred_soap_krr = train_and_pred_KRR(X_train, formation_train_true_10times, X_test, nfold = 5, nthread = n_threads)
clf2_soap_krr, bandgap_test_pred_soap_krr, bandgap_train_pred_soap_krr = train_and_pred_KRR(X_train, bandgap_train_true, X_test, nfold = 5, nthread = n_threads) clf2_soap_krr, bandgap_test_pred_soap_krr, bandgap_train_pred_soap_krr = train_and_pred_KRR(X_train, bandgap_train_true, X_test, nfold = 5, nthread = n_threads)
learning_curve = {} learning_curve = {}
if(if_learningcurve == "true"): if(if_learningcurve == "true"):
learning_curve_N_train = [] learning_curve_N_train = []
learning_curve_indtrain = [] learning_curve_indtrain = []
learning_curve_formation_train = [] learning_curve_formation_train = []
learning_curve_formation_test = [] learning_curve_formation_test = []
learning_curve_bandgap_train = [] learning_curve_bandgap_train = []
learning_curve_bandgap_test = [] learning_curve_bandgap_test = []
step = (2400 - 100) / N_learning_curve step = (2400 - 100) / N_learning_curve
range_learning_curve = range(100, 2401, step) range_learning_curve = range(100, 2401, step)
range_learning_curve[-1] = 2400 range_learning_curve[-1] = 2400
print (range_learning_curve) print (range_learning_curve)
for N_train in range_learning_curve: for N_train in range_learning_curve:
indtrain = np.random.choice(len(X_train), N_train, replace=False) indtrain = np.random.choice(len(X_train), N_train, replace=False)
clf1, y1_test, y1_train = train_and_pred_KRR(X_train[indtrain], formation_train_true_10times[indtrain], X_test, nfold = 5, nthread = n_threads) clf1, y1_test, y1_train = train_and_pred_KRR(X_train[indtrain], formation_train_true_10times[indtrain], X_test, nfold = 5, nthread = n_threads)
clf2, y2_test, y2_train = train_and_pred_KRR(X_train[indtrain], bandgap_train_true[indtrain], X_test, nfold = 5, nthread = n_threads) clf2, y2_test, y2_train = train_and_pred_KRR(X_train[indtrain], bandgap_train_true[indtrain], X_test, nfold = 5, nthread = n_threads)
learning_curve_formation_train.append(y1_train/10.0) learning_curve_formation_train.append(y1_train/10.0)
learning_curve_formation_test.append(y1_test/10.0) learning_curve_formation_test.append(y1_test/10.0)
learning_curve_bandgap_train.append(y2_train) learning_curve_bandgap_train.append(y2_train)
learning_curve_bandgap_test.append(y2_test) learning_curve_bandgap_test.append(y2_test)
learning_curve_N_train.append(N_train) learning_curve_N_train.append(N_train)
learning_curve_indtrain.append(indtrain) learning_curve_indtrain.append(indtrain)
learning_curve["formation_train"] = learning_curve_formation_train learning_curve["formation_train"] = learning_curve_formation_train
learning_curve["formation_test"] = learning_curve_formation_test learning_curve["formation_test"] = learning_curve_formation_test
learning_curve["bandgap_train"] = learning_curve_bandgap_train learning_curve["bandgap_train"] = learning_curve_bandgap_train
learning_curve["bandgap_test"] = learning_curve_bandgap_test learning_curve["bandgap_test"] = learning_curve_bandgap_test
learning_curve["N_train"] = learning_curve_N_train learning_curve["N_train"] = learning_curve_N_train
learning_curve["indtrain"] = learning_curve_indtrain learning_curve["indtrain"] = learning_curve_indtrain
formation_test_pred_soap_krr = formation_test_pred_soap_krr / 10.0 formation_test_pred_soap_krr = formation_test_pred_soap_krr / 10.0
formation_train_pred_soap_krr = formation_train_pred_soap_krr / 10.0 formation_train_pred_soap_krr = formation_train_pred_soap_krr / 10.0
gridsearch_results = {} gridsearch_results = {}
gridsearch_results['formation'] = clf1_soap_krr.cv_results_ gridsearch_results['formation'] = clf1_soap_krr.cv_results_
gridsearch_results['bandgap'] = clf2_soap_krr.cv_results_ gridsearch_results['bandgap'] = clf2_soap_krr.cv_results_
return formation_test_pred_soap_krr, formation_train_pred_soap_krr, bandgap_test_pred_soap_krr, bandgap_train_pred_soap_krr, gridsearch_results, learning_curve return formation_test_pred_soap_krr, formation_train_pred_soap_krr, bandgap_test_pred_soap_krr, bandgap_train_pred_soap_krr, gridsearch_results, learning_curve
formation_test_pred_soap_krr, formation_train_pred_soap_krr, bandgap_test_pred_soap_krr, bandgap_train_pred_soap_krr, gridsearch_results_soap_krr, learning_curve_soap_krr = get_soap_KRR_predictions(X_train, formation_train_true, bandgap_train_true, n_threads = n_threads) formation_test_pred_soap_krr, formation_train_pred_soap_krr, bandgap_test_pred_soap_krr, bandgap_train_pred_soap_krr, gridsearch_results_soap_krr, learning_curve_soap_krr = get_soap_KRR_predictions(X_train, formation_train_true, bandgap_train_true, n_threads = n_threads)
x_grid_formation_soap_krr = np.unique(gridsearch_results_soap_krr['formation']['param_alpha'].data) x_grid_formation_soap_krr = np.unique(gridsearch_results_soap_krr['formation']['param_alpha'].data)
y_grid_formation_soap_krr = np.unique(gridsearch_results_soap_krr['formation']['param_gamma'].data) y_grid_formation_soap_krr = np.unique(gridsearch_results_soap_krr['formation']['param_gamma'].data)
z_grid_formation_soap_krr = np.reshape(gridsearch_results_soap_krr['formation']['mean_test_score']/10.0, [len(y_grid_formation_soap_krr), -1]) z_grid_formation_soap_krr = np.reshape(gridsearch_results_soap_krr['formation']['mean_test_score']/10.0, [len(y_grid_formation_soap_krr), -1])
x_grid_bandgap_soap_krr = np.unique(gridsearch_results_soap_krr['bandgap']['param_alpha'].data) x_grid_bandgap_soap_krr = np.unique(gridsearch_results_soap_krr['bandgap']['param_alpha'].data)
y_grid_bandgap_soap_krr = np.unique(gridsearch_results_soap_krr['bandgap']['param_gamma'].data) y_grid_bandgap_soap_krr = np.unique(gridsearch_results_soap_krr['bandgap']['param_gamma'].data)
z_grid_bandgap_soap_krr = np.reshape(gridsearch_results_soap_krr['bandgap']['mean_test_score'], [len(y_grid_bandgap_soap_krr), -1]) z_grid_bandgap_soap_krr = np.reshape(gridsearch_results_soap_krr['bandgap']['mean_test_score'], [len(y_grid_bandgap_soap_krr), -1])
run_cell_by_tag('plot_soap_krr') run_cell_by_tag('plot_soap_krr')
``` ```
%% Cell type:code id: tags:plot_soap_krr %% Cell type:code id: tags:plot_soap_krr
``` python ``` python
from IPython.display import HTML from IPython.display import HTML
display(HTML("""<div><hr><font size=4em><b>Formation energy and bandgap predictions</b></font><br></div>""")) display(HTML("""<div><hr><font size=4em><b>Formation energy and bandgap predictions</b></font><br></div>"""))
test_plot(formation_test_pred_soap_krr, formation_test_true, bandgap_test_pred_soap_krr, bandgap_test_true) test_plot(formation_test_pred_soap_krr, formation_test_true, bandgap_test_pred_soap_krr, bandgap_test_true)
display(HTML("""<div><hr><font size=4em><b>Hyperparameter grid search</b></font></div>""")) display(HTML("""<div><hr><font size=4em><b>Hyperparameter grid search</b></font></div>"""))
display(HTML("""<div><font size=3.5em><b>- Formation energy</b></font><br></div>""")) display(HTML("""<div><font size=3.5em><b>- Formation energy</b></font><br></div>"""))
plot_hyperparameter_contour(x_grid_formation_soap_krr, y_grid_formation_soap_krr, z_grid_formation_soap_krr/10.0, grid = 3) plot_hyperparameter_contour(x_grid_formation_soap_krr, y_grid_formation_soap_krr, z_grid_formation_soap_krr/10.0, grid = 3)
display(HTML("""<div><font size=3.5em><b>- Bandgap</b></font><br></div>""")) display(HTML("""<div><font size=3.5em><b>- Bandgap</b></font><br></div>"""))
plot_hyperparameter_contour(x_grid_bandgap_soap_krr, y_grid_bandgap_soap_krr, z_grid_bandgap_soap_krr/10.0, grid = 3) plot_hyperparameter_contour(x_grid_bandgap_soap_krr, y_grid_bandgap_soap_krr, z_grid_bandgap_soap_krr/10.0, grid = 3)
if(if_learningcurve == "true"): if(if_learningcurve == "true"):
display(HTML("""<div><hr><font size=4em><b>Learning curves</b></font><br></div>""")) display(HTML("""<div><hr><font size=4em><b>Learning curves</b></font><br></div>"""))
plot_learning_curve(learning_curve_soap_krr) plot_learning_curve(learning_curve_soap_krr)
``` ```
%% Cell type:code id: tags:nn-soap %% Cell type:code id: tags:nn-soap
``` python ``` python
#cell tag: nn-soap #cell tag: nn-soap
#Neural network for SOAP #Neural network for SOAP
from IPython.display import HTML from IPython.display import HTML
display(HTML("""<br><div><span style="height:15px;width:15px;background-color:#009FC2;border-radius:50%;display:inline-block;"></span><font size=5em><b> SOAP + Neural network </b> </font><br></div>""")) display(HTML("""<br><div><span style="height:15px;width:15px;background-color:#009FC2;border-radius:50%;display:inline-block;"></span><font size=5em><b> SOAP + Neural network </b> </font><br></div>"""))
from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split from sklearn.model_selection import train_test_split
import torch import torch
from torch.autograd import Variable from torch.autograd import Variable
import torch.nn as nn import torch.nn as nn
import torch.nn.functional as F import torch.nn.functional as F
torch.manual_seed(1) torch.manual_seed(1)
class myNNRegressor(torch.nn.Module): class myNNRegressor(torch.nn.Module):
''' '''
def __init__(self, feature_size): def __init__(self, feature_size):
super(myNNRegressor, self).__init__() super(myNNRegressor, self).__init__()
self.feature_size = feature_size self.feature_size = feature_size
self.hidden_size1 = 512 self.hidden_size1 = 512
self.hidden_size2 = 256 self.hidden_size2 = 256
self.linear1 = nn.Linear(self.feature_size, self.hidden_size1, bias=False) self.linear1 = nn.Linear(self.feature_size, self.hidden_size1, bias=False)
self.batchnorm1 = nn.BatchNorm1d(self.hidden_size1, affine=True) self.batchnorm1 = nn.BatchNorm1d(self.hidden_size1, affine=True)
self.linear2 = nn.Linear(self.hidden_size1, self.hidden_size2, bias=False) self.linear2 = nn.Linear(self.hidden_size1, self.hidden_size2, bias=False)
self.batchnorm2 = nn.BatchNorm1d(self.hidden_size2, affine=True) self.batchnorm2 = nn.BatchNorm1d(self.hidden_size2, affine=True)
self.linear3 = nn.Linear(self.hidden_size2, self.hidden_size2, bias=False) self.linear3 = nn.Linear(self.hidden_size2, self.hidden_size2, bias=False)
self.batchnorm3 = nn.BatchNorm1d(self.hidden_size2, affine=True) self.batchnorm3 = nn.BatchNorm1d(self.hidden_size2, affine=True)
self.h2s = nn.Linear(self.hidden_size2, 1, bias=False) self.h2s = nn.Linear(self.hidden_size2, 1, bias=False)
self.activation = nn.ReLU() self.activation = nn.ReLU()
self.dropout = nn.Dropout(p=0.20) self.dropout = nn.Dropout(p=0.20)
def forward(self, X): def forward(self, X):
h1 = self.activation(self.dropout(self.batchnorm1(self.linear1(X)))) h1 = self.activation(self.dropout(self.batchnorm1(self.linear1(X))))
h2 = self.activation(self.dropout(self.batchnorm2(self.linear2(h1)))) h2 = self.activation(self.dropout(self.batchnorm2(self.linear2(h1))))
h3 = self.activation(self.dropout(self.batchnorm3(self.linear3(h2)))) h3 = self.activation(self.dropout(self.batchnorm3(self.linear3(h2))))
predictions = self.h2s(h3) predictions = self.h2s(h3)
return predictions return predictions
''' '''
def __init__(self, feature_size): def __init__(self, feature_size):
super(myNNRegressor, self).__init__() super(myNNRegressor, self).__init__()
self.feature_size = feature_size self.feature_size = feature_size
hiddens = [] hiddens = []
batchnorms = [] batchnorms = []
hiddens.append(nn.Linear(self.feature_size, N_nn_neurons[0], bias=False)) hiddens.append(nn.Linear(self.feature_size, N_nn_neurons[0], bias=False))
batchnorms.append(nn.BatchNorm1d(N_nn_neurons[0], affine=True)) batchnorms.append(nn.BatchNorm1d(N_nn_neurons[0], affine=True))
for i in range(N_nn_layers-1): for i in range(N_nn_layers-1):
hiddens.append(nn.Linear(N_nn_neurons[i], N_nn_neurons[i+1], bias=False)) hiddens.append(nn.Linear(N_nn_neurons[i], N_nn_neurons[i+1], bias=False))
batchnorms.append(nn.BatchNorm1d(N_nn_neurons[i+1], affine=True)) batchnorms.append(nn.BatchNorm1d(N_nn_neurons[i+1], affine=True))
self.h2s = nn.Linear(N_nn_neurons[-1], 1, bias = False) self.h2s = nn.Linear(N_nn_neurons[-1], 1, bias = False)
''' '''
hiddens.append(nn.Linear(self.feature_size, 512, bias=False)) hiddens.append(nn.Linear(self.feature_size, 512, bias=False))
batchnorms.append(nn.BatchNorm1d(512, affine=True)) batchnorms.append(nn.BatchNorm1d(512, affine=True))
hiddens.append(nn.Linear(512, 256, bias=False)) hiddens.append(nn.Linear(512, 256, bias=False))
batchnorms.append(nn.BatchNorm1d(256, affine=True)) batchnorms.append(nn.BatchNorm1d(256, affine=True))
hiddens.append(nn.Linear(256, 256, bias=False)) hiddens.append(nn.Linear(256, 256, bias=False))
batchnorms.append(nn.BatchNorm1d(256, affine=True)) batchnorms.append(nn.BatchNorm1d(256, affine=True))
self.h2s = nn.Linear(256, 1, bias = False) self.h2s = nn.Linear(256, 1, bias = False)
self.batchnorm1 = nn.BatchNorm1d(512, affine=True) self.batchnorm1 = nn.BatchNorm1d(512, affine=True)
self.batchnorm2 = nn.BatchNorm1d(256, affine=True) self.batchnorm2 = nn.BatchNorm1d(256, affine=True)
self.batchnorm3 = nn.BatchNorm1d(256, affine=True) self.batchnorm3 = nn.BatchNorm1d(256, affine=True)
self.hiddenlayers = [nn.Linear(self.feature_size, 512, bias=False), nn.Linear(512, 256, bias=False), nn.Linear(256, 256, bias=False)] self.hiddenlayers = [nn.Linear(self.feature_size, 512, bias=False), nn.Linear(512, 256, bias=False), nn.Linear(256, 256, bias=False)]
''' '''
self.hidden_layers = nn.ModuleList(hiddens) self.hidden_layers = nn.ModuleList(hiddens)
self.batchnorm = nn.ModuleList(batchnorms) self.batchnorm = nn.ModuleList(batchnorms)
self.activation = nn.ReLU() self.activation = nn.ReLU()
self.dropout = nn.Dropout(p=0.20) self.dropout = nn.Dropout(p=0.20)
def forward(self, X): def forward(self, X):
h = self.activation(self.dropout(self.batchnorm[0](self.hidden_layers[0](X)))) h = self.activation(self.dropout(self.batchnorm[0](self.hidden_layers[0](X))))
for i in range(1, N_nn_layers): for i in range(1, N_nn_layers):
h = self.activation(self.dropout(self.batchnorm[i](self.hidden_layers[i](h)))) h = self.activation(self.dropout(self.batchnorm[i](self.hidden_layers[i](h))))
''' '''
h = self.activation(self.dropout(self.batchnorm2(self.hiddenlayers[1](h)))) h = self.activation(self.dropout(self.batchnorm2(self.hiddenlayers[1](h))))
h = self.activation(self.dropout(self.batchnorm3(self.hiddenlayers[2](h)))) h = self.activation(self.dropout(self.batchnorm3(self.hiddenlayers[2](h))))
''' '''
predictions = self.h2s(h) predictions = self.h2s(h)
return predictions return predictions
def rmsle(real, predicted): def rmsle(real, predicted):
sum=0.0 sum=0.0
_ = predicted.cpu().data.numpy() _ = predicted.cpu().data.numpy()
for idx, x in enumerate(_): for idx, x in enumerate(_):
if x<0: if x<0:
continue continue
p = torch.log(predicted[idx]+1) p = torch.log(predicted[idx]+1)
r = torch.log(real[idx]+1) r = torch.log(real[idx]+1)
sum += (p - r)**2 sum += (p - r)**2
return torch.sqrt(sum/float(len(predicted))) return torch.sqrt(sum/float(len(predicted)))
def get_predictions(nEpochs, X_train, y_train, X_val=None, y_val=None, evaluation=False): def get_predictions(nEpochs, X_train, y_train, X_val=None, y_val=None, evaluation=False):
try: try:
os.remove('./myBestModel.pt') os.remove('./myBestModel.pt')
except: except:
pass pass
feature_size = X_train.shape[1] feature_size = X_train.shape[1]
#feature_size = X_train.shape[0] #feature_size = X_train.shape[0]
#print "size of feature (in training set):", feature_size #print "size of feature (in training set):", feature_size
#sys.stdout.flush() #sys.stdout.flush()
learning_rate = 1e-3 learning_rate = 1e-3
X_train = Variable(torch.from_numpy(X_train).float(), X_train = Variable(torch.from_numpy(X_train).float(),
requires_grad = False).cpu() requires_grad = False).cpu()
y_train = Variable(torch.from_numpy(y_train).float(), y_train = Variable(torch.from_numpy(y_train).float(),
requires_grad = False).cpu() requires_grad = False).cpu()
if evaluation: if evaluation:
X_val = Variable(torch.from_numpy(X_val).float(), X_val = Variable(torch.from_numpy(X_val).float(),
requires_grad = False, volatile = True).cpu() requires_grad = False, volatile = True).cpu()
y_val = Variable(torch.from_numpy(y_val).float(), y_val = Variable(torch.from_numpy(y_val).float(),
requires_grad = False, volatile = True).cpu() requires_grad = False, volatile = True).cpu()
NNRegressor = myNNRegressor(feature_size).cpu() NNRegressor = myNNRegressor(feature_size).cpu()
optimiser = torch.optim.Adam(NNRegressor.parameters(), lr=learning_rate) optimiser = torch.optim.Adam(NNRegressor.parameters(), lr=learning_rate)
loss_fn = nn.MSELoss() loss_fn = nn.MSELoss()
#y_train = y_train.reshape(-1, 1) #y_train = y_train.reshape(-1, 1)
final_train_loss = 99999 final_train_loss = 99999
#for epoch in tqdm(range(nEpochs)): #for epoch in tqdm(range(nEpochs)):
for epoch in tnrange(nEpochs): for epoch in tnrange(nEpochs):
# Train # Train
NNRegressor.train(True) NNRegressor.train(True)
t0 = time.time() t0 = time.time()
predictions = NNRegressor(X_train) predictions = NNRegressor(X_train)
if epoch <= 100: if epoch <= 100:
loss = loss_fn(predictions.reshape(-1), y_train) loss = loss_fn(predictions.reshape(-1), y_train)
else: else:
loss = rmsle(y_train, predictions.reshape(-1)) loss = rmsle(y_train, predictions.reshape(-1))
#print("loss:", loss) #print("loss:", loss)
NNRegressor.zero_grad() NNRegressor.zero_grad()
loss.backward() loss.backward()
torch.nn.utils.clip_grad_norm_(NNRegressor.parameters(), 1.0) torch.nn.utils.clip_grad_norm_(NNRegressor.parameters(), 1.0)
optimiser.step() optimiser.step()
t1 = time.time() t1 = time.time()
# Evaluate # Evaluate
NNRegressor.train(False) NNRegressor.train(False)
predictions_train = NNRegressor(X_train) predictions_train = NNRegressor(X_train)
train_loss = float(rmsle(y_train, predictions_train.reshape(-1))) train_loss = float(rmsle(y_train, predictions_train.reshape(-1)))
#print("train_loss:", float(train_loss)) #print("train_loss:", float(train_loss))
if evaluation: if evaluation:
predictions = NNRegressor(X_val) predictions = NNRegressor(X_val)
val_loss = float(rmsle(y_val, predictions.reshape(-1))) val_loss = float(rmsle(y_val, predictions.reshape(-1)))
NNRegressor.train(True) NNRegressor.train(True)
''' '''
if evaluation: if evaluation:
msg = 'Epoch ' + str(epoch) +" - training loss: " + str(float(train_loss)) + 'validation loss: '+ str(float(val_loss)) msg = 'Epoch ' + str(epoch) +" - training loss: " + str(float(train_loss)) + 'validation loss: '+ str(float(val_loss))
tqdm.write(msg) tqdm.write(msg)
else: else:
msg = 'Epoch ' + str(epoch) + " - training loss: " + str(float(train_loss)) msg = 'Epoch ' + str(epoch) + " - training loss: " + str(float(train_loss))
tqdm.write(msg) tqdm.write(msg)
''' '''
final_train_loss = float(train_loss) final_train_loss = float(train_loss)
if evaluation: if evaluation:
if 'best_val_loss' in locals(): if 'best_val_loss' in locals():
if val_loss < best_val_loss: if val_loss < best_val_loss:
best_val_loss = val_loss best_val_loss = val_loss
with open('./models/myBestModel.pt', 'wb') as f: with open('./models/myBestModel.pt', 'wb') as f:
torch.save(NNRegressor, f) torch.save(NNRegressor, f)
else: else:
best_val_loss = val_loss best_val_loss = val_loss
with open('./models/myBestModel.pt', 'wb') as f: with open('./models/myBestModel.pt', 'wb') as f:
torch.save(NNRegressor, f) torch.save(NNRegressor, f)
if epoch == 149: if epoch == 149:
learning_rate=1e-5 learning_rate=1e-5
if evaluation: if evaluation:
del NNRegressor del NNRegressor
with open('./models/myBestModel.pt', 'rb') as f: with open('./models/myBestModel.pt', 'rb') as f:
NNRegressor = torch.load(f) NNRegressor = torch.load(f)
optimiser = torch.optim.Adam(NNRegressor.parameters(), optimiser = torch.optim.Adam(NNRegressor.parameters(),
lr=learning_rate) lr=learning_rate)
if evaluation: if evaluation:
with open('./models/myBestModel.pt', 'rb') as f: with open('./models/myBestModel.pt', 'rb') as f:
NNRegressor = torch.load(f) NNRegressor = torch.load(f)
NNRegressor.train(False) NNRegressor.train(False)
predictions = NNRegressor(X_val) predictions = NNRegressor(X_val)
val_loss = float(rmsle(y_val, predictions)) val_loss = float(rmsle(y_val, predictions))
print ('-----\nbest val. loss: {}\n_____'.format(val_loss)) print ('-----\nbest val. loss: {}\n_____'.format(val_loss))
else: else:
NNRegressor.train(False) NNRegressor.train(False)
predictions = NNRegressor(X_train) predictions = NNRegressor(X_train)
print("Final training loss(RMSE):", final_train_loss) print("Final training loss(RMSE):", final_train_loss)
return predictions, NNRegressor return predictions, NNRegressor
def train_test_NN_soap(x_trains, y_trains, x_tests): def train_test_NN_soap(x_trains, y_trains, x_tests):
ensemble_size = 1 ensemble_size = 1
all_test_predictions = [] all_test_predictions = []
all_train_predictions = [] all_train_predictions = []
for regressor in range(ensemble_size): for regressor in range(ensemble_size):
#print('\n________________\nRegressor No. {}\n'.format(regressor)) #print('\n________________\nRegressor No. {}\n'.format(regressor))
predictions, NNRegressor = get_predictions(nEpochs=250, predictions, NNRegressor = get_predictions(nEpochs=250,
X_train=x_trains, y_train=y_trains) X_train=x_trains, y_train=y_trains)
NNRegressor.train(False) NNRegressor.train(False)
predictions = NNRegressor(x_tests).cpu().data.numpy() predictions = NNRegressor(x_tests).cpu().data.numpy()
train_predictions = NNRegressor( train_predictions = NNRegressor(
Variable(torch.from_numpy(x_trains).float(), Variable(torch.from_numpy(x_trains).float(),
requires_grad=False, volatile=True) requires_grad=False, volatile=True)
).cpu().data.numpy() ).cpu().data.numpy()
all_test_predictions.append(predictions.reshape(-1)) all_test_predictions.append(predictions.reshape(-1))
all_train_predictions.append(train_predictions.reshape(-1)) all_train_predictions.append(train_predictions.reshape(-1))
#print np.shape(all_test_predictions) #print np.shape(all_test_predictions)
final_test_prediction = np.mean(all_test_predictions, axis = 0) final_test_prediction = np.mean(all_test_predictions, axis = 0)
final_train_prediction = np.mean(all_train_predictions, axis = 0) final_train_prediction = np.mean(all_train_predictions, axis = 0)
return final_test_prediction, final_train_prediction return final_test_prediction, final_train_prediction
def get_soap_NN_predictions(X_train, X_test, formation_train_true, bandgap_train_true): def get_soap_NN_predictions(X_train, X_test, formation_train_true, bandgap_train_true):
formation_train_true_10times = formation_train_true * 10.0 formation_train_true_10times = formation_train_true * 10.0
formation_test_pred_soap_nn, formation_train_pred_soap_nn = train_test_NN_soap(X_train, formation_train_true_10times, X_test) formation_test_pred_soap_nn, formation_train_pred_soap_nn = train_test_NN_soap(X_train, formation_train_true_10times, X_test)
bandgap_test_pred_soap_nn, bandgap_train_pred_soap_nn = train_test_NN_soap(X_train, bandgap_train_true, X_test) bandgap_test_pred_soap_nn, bandgap_train_pred_soap_nn = train_test_NN_soap(X_train, bandgap_train_true, X_test)
learning_curve = {} learning_curve = {}
if(if_learningcurve == "true"): if(if_learningcurve == "true"):
learning_curve_N_train = [] learning_curve_N_train = []
learning_curve_indtrain = [] learning_curve_indtrain = []
learning_curve_formation_train = [] learning_curve_formation_train = []
learning_curve_formation_test = [] learning_curve_formation_test = []
learning_curve_bandgap_train = [] learning_curve_bandgap_train = []
learning_curve_bandgap_test = [] learning_curve_bandgap_test = []
step = (2400 - 100) / N_learning_curve step = (2400 - 100) / N_learning_curve
range_learning_curve = range(100, 2401, step) range_learning_curve = range(100, 2401, step)
range_learning_curve[-1] = 2400 range_learning_curve[-1] = 2400
print (range_learning_curve) print (range_learning_curve)
for N_train in range_learning_curve: for N_train in range_learning_curve:
indtrain = np.random.choice(len(X_train), N_train, replace=False) indtrain = np.random.choice(len(X_train), N_train, replace=False)
formation_test_pred_soap_nn, formation_train_pred_soap_nn = train_test_NN_soap(X_train[indtrain], formation_train_true_10times[indtrain], X_test) formation_test_pred_soap_nn, formation_train_pred_soap_nn = train_test_NN_soap(X_train[indtrain], formation_train_true_10times[indtrain], X_test)
bandgap_test_pred_soap_nn, bandgap_train_pred_soap_nn = train_test_NN_soap(X_train[indtrain], bandgap_train_true[indtrain], X_test) bandgap_test_pred_soap_nn, bandgap_train_pred_soap_nn = train_test_NN_soap(X_train[indtrain], bandgap_train_true[indtrain], X_test)
learning_curve_formation_train.append(formation_train_pred_soap_nn / 10.0) learning_curve_formation_train.append(formation_train_pred_soap_nn / 10.0)
learning_curve_formation_test.append(formation_test_pred_soap_nn / 10.0) learning_curve_formation_test.append(formation_test_pred_soap_nn / 10.0)
learning_curve_bandgap_train.append(bandgap_train_pred_soap_nn) learning_curve_bandgap_train.append(bandgap_train_pred_soap_nn)
learning_curve_bandgap_test.append(bandgap_test_pred_soap_nn) learning_curve_bandgap_test.append(bandgap_test_pred_soap_nn)
learning_curve_N_train.append(N_train) learning_curve_N_train.append(N_train)
learning_curve_indtrain.append(indtrain) learning_curve_indtrain.append(indtrain)
learning_curve["formation_train"] = learning_curve_formation_train learning_curve["formation_train"] = learning_curve_formation_train
learning_curve["formation_test"] = learning_curve_formation_test learning_curve["formation_test"] = learning_curve_formation_test
learning_curve["bandgap_train"] = learning_curve_bandgap_train learning_curve["bandgap_train"] = learning_curve_bandgap_train
learning_curve["bandgap_test"] = learning_curve_bandgap_test learning_curve["bandgap_test"] = learning_curve_bandgap_test
learning_curve["N_train"] = learning_curve_N_train learning_curve["N_train"] = learning_curve_N_train
learning_curve["indtrain"] = learning_curve_indtrain learning_curve["indtrain"] = learning_curve_indtrain
formation_test_pred_soap_nn = formation_test_pred_soap_nn / 10.0 formation_test_pred_soap_nn = formation_test_pred_soap_nn / 10.0
formation_train_pred_soap_nn = formation_train_pred_soap_nn / 10.0 formation_train_pred_soap_nn = formation_train_pred_soap_nn / 10.0
return formation_test_pred_soap_nn, formation_train_pred_soap_nn, bandgap_test_pred_soap_nn, bandgap_train_pred_soap_nn, learning_curve return formation_test_pred_soap_nn, formation_train_pred_soap_nn, bandgap_test_pred_soap_nn, bandgap_train_pred_soap_nn, learning_curve
# Scale # Scale
scaler = StandardScaler(with_mean=True, with_std=True) scaler = StandardScaler(with_mean=True, with_std=True)
formation_train_true = train_csv[:, -2] formation_train_true = train_csv[:, -2]
bandgap_train_true = train_csv[:, -1] bandgap_train_true = train_csv[:, -1]
# mean atomic descriptor # mean atomic descriptor
combined_mean_atomic_descriptors = scaler.fit_transform(np.append( combined_mean_atomic_descriptors = scaler.fit_transform(np.append(
test_mean_atomic_descriptors, test_mean_atomic_descriptors,
train_mean_atomic_descriptors, train_mean_atomic_descriptors,
axis=0)) axis=0))
X_test_soap = np.round(combined_mean_atomic_descriptors[:test_csv.shape[0]], X_test_soap = np.round(combined_mean_atomic_descriptors[:test_csv.shape[0]],
decimals=5) decimals=5)
X_test_soap = Variable(torch.from_numpy(X_test_soap).float(), X_test_soap = Variable(torch.from_numpy(X_test_soap).float(),
requires_grad = False, volatile = True).cpu() requires_grad = False, volatile = True).cpu()
#X_test = X_test.reshape(-1, 1) #X_test = X_test.reshape(-1, 1)
X_train_soap = np.round(combined_mean_atomic_descriptors[test_csv.shape[0]:], X_train_soap = np.round(combined_mean_atomic_descriptors[test_csv.shape[0]:],
decimals=5) decimals=5)
#X_train = X_train.reshape(-1, 1) #X_train = X_train.reshape(-1, 1)
formation_test_pred_soap_nn, formation_train_pred_soap_nn, bandgap_test_pred_soap_nn, bandgap_train_pred_soap_nn, learning_curve_soap_nn = get_soap_NN_predictions(X_train_soap, X_test_soap, formation_train_true, bandgap_train_true) formation_test_pred_soap_nn, formation_train_pred_soap_nn, bandgap_test_pred_soap_nn, bandgap_train_pred_soap_nn, learning_curve_soap_nn = get_soap_NN_predictions(X_train_soap, X_test_soap, formation_train_true, bandgap_train_true)
run_cell_by_tag('plot_soap_nn') run_cell_by_tag('plot_soap_nn')
``` ```
%% Cell type:code id: tags:plot_soap_nn %% Cell type:code id: tags:plot_soap_nn
``` python ``` python
from IPython.display import HTML from IPython.display import HTML
display(HTML("""<div><hr><font size=4em><b>Formation energy and bandgap predictions</b></font><br></div>""")) display(HTML("""<div><hr><font size=4em><b>Formation energy and bandgap predictions</b></font><br></div>"""))
test_plot(formation_test_pred_soap_nn, formation_test_true, bandgap_test_pred_soap_nn, bandgap_test_true) test_plot(formation_test_pred_soap_nn, formation_test_true, bandgap_test_pred_soap_nn, bandgap_test_true)
if(if_learningcurve == "true"): if(if_learningcurve == "true"):
display(HTML("""<div><hr><font size=4em><b>Learning curves</b></font><br></div>""")) display(HTML("""<div><hr><font size=4em><b>Learning curves</b></font><br></div>"""))
plot_learning_curve(learning_curve_soap_nn) plot_learning_curve(learning_curve_soap_nn)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## References ## References
1. https://scikit-learn.org/stable/modules/kernel_ridge.html 1. https://scikit-learn.org/stable/modules/kernel_ridge.html
2. https://scikit-learn.org/stable/modules/neural_networks_supervised.html 2. https://scikit-learn.org/stable/modules/neural_networks_supervised.html
3. Jmol: an open-source Java viewer for chemical structures in 3D. http://www.jmol.org/ 3. Jmol: an open-source Java viewer for chemical structures in 3D. http://www.jmol.org/
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
... ...
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment