Update kaggle_competition.ipynb

dc4b9f58 · Xiangyue Liu · b71443e3 · dc4b9f58
Commit dc4b9f58 authored Feb 6, 2020 by Xiangyue Liu
--- a/kaggle_competition.ipynb
+++ b/kaggle_competition.ipynb
@@ -89,6 +89,8 @@
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
+    "from warnings import simplefilter\n",
+    "simplefilter(action='ignore', category=FutureWarning)\n",
    "from IPython.core.display import display, HTML\n",
    "display(HTML(\"<style>.container { width:100% !important; }</style>\"))"
   ]


 %% Cell type:code id: tags:
 ``` python
 %%HTML
 <script>
    code_show=true;
    function code_toggle() {
        if (code_show)
        {
            $('div.input').hide();
        }
        else
        {
            $('div.input').show();
        }
        code_show = !code_show
    }
    $( document ).ready(code_toggle);
 </script>
 The raw code for this notebook is by default hidden for easier reading.
 To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.
 ```
 %% Output
 %% Cell type:code id: tags:
 ``` python
 import warnings
 warnings.filterwarnings('ignore')
+from warnings import simplefilter
+simplefilter(action='ignore', category=FutureWarning)
 from IPython.core.display import display, HTML
 display(HTML("<style>.container { width:100% !important; }</style>"))
 ```
 %% Output
 %% Cell type:code id: tags:
 ``` python
 %%HTML
 <div align="left" style="background-color: rgba(149,170,79, 1.0); width: 100%; height: 470px">
    <div >
        <table>
            <tr></tr>
            <tr>
                <td><img id="nomad" align="right" src="https://nomad-coe.eu/uploads/nomad/images/NOMAD_Logo2.png" width="70%" alt="NOMAD Logo"></td>
                <td><font size=14em color="#20335d" align="left"><b>NOMAD Analytics Toolkit</b></font></td>
                <td><img height="85px" width="80px" src="https://www.nomad-coe.eu/uploads/nomad/backgrounds/head_big-data_analytics_2.png"></td>
            </tr>
        </table>
    </div>
    <br><br>
    <div style="position:relative; left:3%"><font size=6em  color="#20335d" ><b> - NOMAD 2018 Kaggle competition</b></font></div>
    <p style="position:relative;left:10%; ">
        <br>
        Created by:
        Xiangyue Liu<sup>1</sup> (<a href="mailto:xyliu@fhi-berlin.mpg.de">email</a>),
        Christopher Sutton<sup>1</sup> (<a href="mailto:sutton@fhi-berlin.mpg.de">email</a>),
        Luca M. Ghiringhelli<sup>1</sup>(<a href="mailto:ghiringhelli@fhi-berlin.mpg.de">email</a>), <br>
        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Takenori Yamamoto<sup>2</sup>,
        Yury Lysogorskiy<sup>3</sup>,
        Lars Blumenthal<sup>4,5</sup>,
        Thomas Hammerschmidt<sup>3</sup>,
        Jacek Golebiowski<sup>4,5</sup>, <br>
        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Angelo Ziletti<sup>1</sup>, Matthias Scheffler<sup>1</sup>
        <br><br>
        <sup>1</sup> Fritz Haber Institute of the Max Planck Society, Faradayweg 4-6, D-14195 Berlin, Germany <br>
        <sup>2</sup> Research Institute for Mathematical and Computational Sciences (RIMCS), LLC, Yokohama, Japan <br>
        <sup>3</sup> ICAMS, Ruhr-Universität Bochum, Germany <br>
        <sup>4</sup> EPSRC Centre for Doctoral Training on Theory and Simulation of Materials Department of Physics, Imperial College London, London, U.K. <br>
        <sup>5</sup> Thomas Young Centre for Theory and Simulation of Materials, Department of Materials, Imperial College London, London, U.K <br>
        <br>
    </p>
    <br>
    <div style="position:relative;bottom:3%">
        <div style="position:absolute;right:5%;bottom: 0%;"><font color="#999999" size="10em">v1.0.0</font></div>
        <div style="position:absolute;right:5%;bottom: 0%;"><font color="#666666" size="2.7em">[Last updated: March 21, 2019]</font></div>
    </div>
 </div>
 <div style='text-align: right;'>
    <a href="https://analytics-toolkit.nomad-coe.eu/home/" class="btn btn-primary" style="font-size:larger;">Back to Analytics Home</a>
    <a href="https://www.nomad-coe.eu/" class="btn btn-primary" style="font-size:larger; ">Back to nomad-coe</a>
 </div>
 ```
 %% Output
 %% Cell type:markdown id: tags:
 <font size=4em>[Skip introduction](#make_predictions)</font>
 %% Cell type:markdown id: tags:
 # NOMAD 2018 Kaggle research competition
 ### A paradigm shift in solving materials science grand challenges by crowd sourcing solutions through an open and global big-data competition
 Innovative materials design is needed to tackle some of the most important health, environmental, energy, societal, and economic challenges. Improving the properties of materials that are intrinsically connected to the generation and utilization of energy is crucial if we are to mitigate environmental damage due to a growing global demand. Transparent conductors are an important class of compounds that are both electrically conductive and have low absorption in the visible range, which are typically competing properties. A combination of both of these characteristics is key for the operation of a variety of technological devices such as photovoltaic cells, light-emitting diodes for flat-panel displays, transistors, sensors, touch screens, and lasers. However, only a small number of compounds are currently known to display both transparency and conductivity to a high enough degree to be used as transparent conducting materials.
 To address the need for finding new materials with an ideal target functionality, the Novel Materials Discovery (NOMAD) Centre of Excellence has organized a crowd-sourced data analytics competition with Kaggle, which is one of the most well known online platforms for hosting big-data competitions. Kaggle has a community of over a half of a million users from around the world with various backgrounds in computer science, statistics, biology, and medicine. The competition occurred from December 18th 2017 to February 15th, 2018 and involved nearly 900 participants. The goal of this competition was to develop or apply data analytics models for the prediction of two target properties: the formation energy (which is an indication of the stability of a material) and the bandgap energy (which is an indicator for the potential for transparency over the visible range) to facilitate the discovery of new transparent conductors and allow for advancements in (opto)electronic technologies. A total of 5,000.00 euros in prizes were awarded to the top-three participants with the best performing models (i.e., lowest average root mean square log error (RMSLE) of the formation and bandgap energies). The RMSLE is defined as:
 $RMSLE = \sqrt{\frac{1}{N} \sum_{i=1}^{N}{(log(\hat{y}_i + 1) - log(y_i + 1))^2}  }$
 where $N$ is the total number of observations, $\hat{y}_i$ is the predicted value, $y_i$ is the reference value for either the formation or bandgap energies.
 The dataset consists of 3,000 materials, 2,400 of which made the training set and the remaining 600 were used as the test set (i.e., only structures and input features were provided), with the target properties kept secret. Of that test set, 100 materials were used to determine the public leaderboard score so that the participants could assess their model performance on the fly (but the exact values used in this assessment were kept secret). The top three winners of the competition were determined the private leaderboard score based on the test set of 500 materials with the target properties kept secret.
 Because only 100 values were used for assessing the performance on the leaderboard, participants had to ensure the predictive accuracy of their model for unseen data, even if a disagreement was found with the public leaderboard score. This is evident in the summary of the average RMSLE for all of the participants with scores below 0.25 in Figure 1, where a large shift in the values between the public leaderboard (100 compounds) and private leaderboard (500 compounds). The winning score has a RMSLE of 0.0509, while the 2nd and 3rd places winners were closely stacked together with a RMSLE of 0.0521 and 0.0523. However, within the first bin, there were a total of four participants with an RMLSE  0.053 (i.e., 0.45% of participants).
 %% Cell type:markdown id: tags:
 # More about the dataset: Group-III transparent conductors
 Group-III oxides provide promising candidates for wide band-gap transparent conductors. A wide range of experimental band gap energies from 3.6 to 7.5 eV have been reported from alloying In$_2$O$_3$/Ga$_2$O$_3$ or Ga$_2$O$_3$/Al$_2$O$_3$,
 which suggest that alloying of group-III oxides is a viable strategy for designing new wide band gap semiconductors. However, Al$_2$O$_3$, Ga$_2$O$_3$, and In$_2$O$_3$ all display very different
 ground-state structures. Therefore, it is unclear which structure will be stable for various compositions. Within the dataset, there are six different lattice symmetries: $C2/m$, $Pna2_1$, $R\bar{3}c$, $P6_3/mmc$, $Ia\bar{3}$, $Fd\bar{3}m$.
 %% Cell type:code id: tags:
 ``` python
 %%html
 <!-- CSS Style Inline: -->
    <style type="text/css">
        #jmol_div227{
            height: 350px;
            width:  350px;
            float: left;
        }
        #jmol_div12{
            height: 350px;
            width:  350px;
            float: left;
        }
        #jmol_div206{
            height: 350px;
            width:  350px;
            float: left;
        }
        #jmol_div33{
            height: 350px;
            width:  350px;
            float: left;
        }
        #jmol_div194{
            height: 350px;
            width:  350px;
            float: left;
        }
        #jmol_div167{
            height: 350px;
            width:  350px;
            float: left;
        }
    </style>
    <!-- Load Jmol javascript library -->
    <script type="text/javascript" src="assets/kaggle_competition/jsmol/JSmol.min.js"></script>
    <!-- calls to jQuery and Jmol (inline) -->
    <script type="text/javascript">
        // Jmol readyFunction
        jmol_isReady = function(applet) {
            document.title = (applet._id + " - Jmol " + Jmol.___JmolVersion)
            Jmol._getElement(applet, "appletdiv").style.border="0px solid blue"
        }
        // initialize Jmol Applet
        var myJmol227 = "myJmol227";
        var Info227 = {
            width:   "100%",
            height:  "100%",
            color:   "#ffffff",
            use:     "HTML5",
            j2sPath: "assets/kaggle_competition/jsmol/j2s",
            jarPath: "assets/kaggle_competition/jsmol/java",
            jarFile: "JmolAppletSigned.jar",
            debug:   false,
            readyFunction: jmol_isReady,
            script:  'load "data/kaggle_competition/xyz/geometry3-spacegroup227.xyz" ;',
            allowJavaScript: false,
            disableJ2SLoadMonitor: true,
        }
        var myJmol12 = "myJmol12";
        var Info12 = {
            width:   "100%",
            height:  "100%",
            color:   "#ffffff",
            use:     "HTML5",
            j2sPath: "assets/kaggle_competition/jsmol/j2s",
            jarPath: "assets/kaggle_competition/jsmol/java",
            jarFile: "JmolAppletSigned.jar",
            debug:   false,
            readyFunction: jmol_isReady,
            script:  'load "data/kaggle_competition/xyz/geometry8-spacegroup12.xyz" ;',
            allowJavaScript: false,
            disableJ2SLoadMonitor: true,
        }
        var myJmol206 = "myJmol206";
        var Info206 = {
            width:   "100%",
            height:  "100%",
            color:   "#ffffff",
            use:     "HTML5",
            j2sPath: "assets/kaggle_competition/jsmol/j2s",
            jarPath: "assets/kaggle_competition/jsmol/java",
            jarFile: "JmolAppletSigned.jar",
            debug:   false,
            readyFunction: jmol_isReady,
            script:  'load "data/kaggle_competition/xyz/geometry7-spacegroup206.xyz" ;',
            allowJavaScript: false,
            disableJ2SLoadMonitor: true,
        }
        var myJmol33 = "myJmol33";
        var Info33 = {
            width:   "100%",
            height:  "100%",
            color:   "#ffffff",
            use:     "HTML5",
            j2sPath: "assets/kaggle_competition/jsmol/j2s",
            jarPath: "assets/kaggle_competition/jsmol/java",
            jarFile: "JmolAppletSigned.jar",
            debug:   false,
            readyFunction: jmol_isReady,
            script:  'load "data/kaggle_competition/xyz/geometry1-spacegroup33.xyz" ;',
            allowJavaScript: false,
            disableJ2SLoadMonitor: true,
        }
        var myJmol194 = "myJmol194";
        var Info194 = {
            width:   "100%",
            height:  "100%",
            color:   "#ffffff",
            use:     "HTML5",
            j2sPath: "assets/kaggle_competition/jsmol/j2s",
            jarPath: "assets/kaggle_competition/jsmol/java",
            jarFile: "JmolAppletSigned.jar",
            debug:   false,
            readyFunction: jmol_isReady,
            script:  'load "data/kaggle_competition/xyz/geometry2-spacegroup194.xyz" ;',
            allowJavaScript: false,
            disableJ2SLoadMonitor: true,
        }
        var myJmol167 = "myJmol167";
        var Info167 = {
            width:   "100%",
            height:  "100%",
            color:   "#ffffff",
            use:     "HTML5",
            j2sPath: "assets/kaggle_competition/jsmol/j2s",
            jarPath: "assets/kaggle_competition/jsmol/java",
            jarFile: "JmolAppletSigned.jar",
            debug:   false,
            readyFunction: jmol_isReady,
            script:  'load "data/kaggle_competition/xyz/geometry4-spacegroup167.xyz" ;',
            allowJavaScript: false,
            disableJ2SLoadMonitor: true,
        }
        // jQuery ready functions
        // is called when page has been completely loaded
        $(document).ready(function() {
            $("#jmol_div227").html(Jmol.getAppletHtml(myJmol227, Info227))
        })
        $(document).ready(function() {
            $("#jmol_div12").html(Jmol.getAppletHtml(myJmol12, Info12))
        })
        $(document).ready(function() {
            $("#jmol_div206").html(Jmol.getAppletHtml(myJmol206, Info206))
        })
        $(document).ready(function() {
            $("#jmol_div33").html(Jmol.getAppletHtml(myJmol33, Info33))
        })
        $(document).ready(function() {
            $("#jmol_div194").html(Jmol.getAppletHtml(myJmol194, Info194))
        })
        $(document).ready(function() {
            $("#jmol_div167").html(Jmol.getAppletHtml(myJmol167, Info167))
        })
        var lastPrompt=0;
        function show_hide_structures()
        {
            var x = document.getElementById("geometry_jmol");
            if (x.style.display === "none")
            {
                x.style.display = "block";
            }
            else
            {
                x.style.display = "none";
            }
        }
    </script>
 <div>
    <button type="button" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:200px; height:30px" onclick="show_hide_structures()">Show/hide structures</button>
    <br><br>
 </div>
 <div id="geometry_jmol">
    <table>
        <tr>
            <th><center>Spacegroup 12 (C2/m)</center></th>
            <th><center>Spacegroup 33 (Pna2<sub>1</sub>)</center></th>
            <th><center>Spacegroup 167 (R<font style="text-decoration: overline">3</font>c)</center></th>
        </tr>
        <tr>
            <th><div id='jmol_div12'></div></th>
            <th><div id='jmol_div33'></div></th>
            <th><div id='jmol_div167'></div></th>
        </tr>
        <tr>
            <th><center>Spacegroup 194 (P6<sub>3</sub>/mmc)</center></th>
            <th><center>Spacegroup 206 (Ia<font style="text-decoration: overline">3</font>)</center></th>
            <th><center>Spacegroup 227 (Fd<font style="text-decoration: overline">3</font>m)</center></th>
        </tr>
        <tr>
            <th><div id='jmol_div194'></div></th>
            <th><div id='jmol_div206'></div></th>
            <th><div id='jmol_div227'></div></th>
        </tr>
    </table>
 </div>
 ```
 %% Output
 %% Cell type:markdown id: tags:
 # Winning solutions
 - __1st place winning solution:__ crystal graph n-grams + Kernel Ridge Regression
 - __3rd place winning solution:__ SOAP-based descriptor + Neural network
 ## - Representations: crystal graph n-grams,  SOAP-based descriptor
 ### 1. Crystal graph n-grams
 The 1st place winning solution was obtained using the metal-oxygen coordination number derived from the number of bonds that are within the sum of the ionic Shannon experimental radii (which were enlarged by 30-50% depending on the crystal structure type). These ionic bonds are then used for building a crystal graph, where each atom is a node in the graph and the corresponding edges between nodes are defined by the ionic bond, which are shown as coordination numbers for each atom for a sequence of 6 atoms.
 <img src="assets/kaggle_competition/ngram.png" alt="Drawing" style="width: 700px;"/>
 <center> Figure 1 : Depiction of a crystal graph representation of In3Ga1O6 showing the connections between each atom (node) that are defined by the ionic bonds. </center>
 ### 2. SOAP-based descriptor
 In the 3rd place winning solution, the smooth overlap of atomic positions (SOAP) kernel developed by Bartók et al. that incorporates information on the local atomic environment through a rotationally integrated overlap of neighbor densities.20-21 The SOAP kernel describes the local environment for a given atom ($i$) through the sum of Gaussians centered on each of the atomic neighbors ($j$) within a specific cutoff radius ($r_{ij}$):
 $\rho_{i}(r) = \sum_{j} exp(\frac{-(r-r_{ij})^2}{2\sigma^2_{atom}}) f_{cut}(r_{ij})$
 where $\sigma_{atom}$ is a smoothing parameter and the switching function $f_{cut}$ goes smoothly to zero beyond a specified radial value. This local atomic neighbor density can be expanded in terms of spherical harmonics and orthogonal radial functions, while the expansion coefficients are then combined to form the rotationally invariant power spectrum corresponding to the neighbor density for each atom.
 ## - Regression methods: Kernel ridge regression, Neural network
 ### 1. Kernel ridge regression (KRR) $^1$
 Kernel ridge regression (KRR) is a generalization of ridge regression, where a linear function is learned in the space induced by the respective kernel and the data. The squared loss is minimized with a squared norm L2 regularization term.
 The model learned by KRR has the same form with support vector regression (SVR). In both KRR and SVR, the L2 regularization are used, but KRR uses squared error loss function which is different than SVR. The learned model by KRR is non-sparse, therefore the fitting of KRR is typically slower than SVR.
 ### 2. Neural network $^2$
 In the 3rd place winning solution, a multi-layer perceptron (MLP) is employed. MLP is composed of more than one perceptron: typically it has an input layer, several hidden layer and an output layer. The input of the network is transformed using a learnt non-linear transformation. As a supervised learning algorithm, MLP learns a function $f(\cdot):R^{I} \rightarrow R^{O}$, where $I$ is the number of dimensions for input and $O$ is the number of dimensions for output. Several parameters are adjusted during the training of MLP, including the weights of the neurons and the biases. Gradient-based optimisation algorithms, such as stochastic gradient descent, are employed to minimize the loss funtion. There are different choices of loss functions, for example the root mean squared error (RMSE), or the cross entropy. One can also switch loss functions during the training for better gradients. Figure 1 shows a one hidden layer MLP with scalar output as an example.
 <img src="assets/kaggle_competition/multilayerperceptron_network.png" alt="Drawing" style="width: 400px;"/>
 <center> Figure 2 : One hidden layer MLP $^2$. </center>
 %% Cell type:markdown id: tags:
 <a id='make_predictions'></a>
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:code id: tags:
 ``` python
 %%HTML
 <br><br><br>
 <font size="6.5em"><b>Make predictions on formation energies and bandgaps</b></font>
 <br><hr><br>
 <font size="5em"><b>Winning representations combined with different regression methods</b></font>
 <br><br>
 <font size = "3.5em"> To understand the relative importance of the representation vs. regression model, one can examine the performance of each representation combined with different regression models.
 The hyperparameters are optimized for each representation/regressor combination. </font>
 <br>
 <font size = "3.5em" color="009FC2"><br>Warning: the learning algorithm employed in this study (e.g. grid-search) can not guarantee deterministic results. The actual predictions can divergent from the published data.
 </font>
 <br><br><br>
 <form>
    <font size="4em">Select a representation and a regression method:</font>
    <select id="representation" onclick="show_options()">
        <option value="ngram">n-gram</option>
        <option value="soap">SOAP</option>
    </select>
    <select id="regression" onclick="show_options()">
        <option value="krr">KRR</option>
        <option value="nn">Neural network</option>
    </select>
    <button type="button" id="button_show_options" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:150px; height:30px" onclick="show_options()">More options</button>
    <button type="button" id="button_hide_options" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:150px; height:30px; display: none" onclick="hide_options()">Less options</button>
 </form>
 <script type="text/Javascript">
    function show_options()
    {
        representation_value = document.getElementById("representation").value;
        regression_value = document.getElementById("regression").value;
        if(representation_value == "ngram")
        {
            document.getElementById("div_options_soap").style.display="none";
            document.getElementById("div_options_ngram").style.display="";
            document.getElementById("button_show_options").style.display="none";
            document.getElementById("button_hide_options").style.display="";
        }
        else if(representation_value == "soap")
        {
            document.getElementById("div_options_ngram").style.display="none";
            document.getElementById("div_options_soap").style.display="";
            document.getElementById("button_show_options").style.display="none";
            document.getElementById("button_hide_options").style.display="";
        }
        if(regression_value == "nn")
        {
            document.getElementById("div_options_nn").style.display="";
            set_nn_default();
        }
        else if(regression_value == "krr")
        {
            document.getElementById("div_options_nn").style.display="none";
        }
    }
    function hide_options()
    {
        document.getElementById("button_show_options").style.display="";
        document.getElementById("button_hide_options").style.display="none";
        document.getElementById("div_options_ngram").style.display="none";
        document.getElementById("div_options_soap").style.display="none";
        document.getElementById("div_options_nn").style.display="none";
    }
    function set_nn_n_neurons()
    {
        document.getElementById("div_options_nn_n_neurons").style.display="";
        n_layers = document.getElementById("input_n_nn").value;
        contents = "<br><font size=\"3em\">Number of neurons in each hidden layer</font><br>"
        for(i=0; i<n_layers; i++)
        {
            content_add = "<font size=\"3em\"> Layer " +  String(i+1) + ": </font><input\ type=\"number\"\ id=\"input_n_neurons_" + String(i) + "\"\ value=\"256\"\ min=\"1\"\ max=\"1024\"\ >";
            contents += content_add;
            if((i+1)%6 ==0)
            {
                contents += "<br>"
            }
        }
        //alert(contents);
        document.getElementById("div_options_nn_n_neurons").innerHTML = contents;
    }
    function set_nn_default()
    {
        document.getElementById("div_options_nn_n_neurons").style.display="";
        representation_value = document.getElementById("representation").value;
        if(representation_value == "ngram")
        {
            n_layers = 11;
            document.getElementById("input_n_nn").value = n_layers;
            contents = "<br><font size=\"3em\">Number of neurons in each hidden layer</font><br>"
            for(i=0; i<7; i++)
            {
                content_add = "<font size=\"3em\"> Layer " +  String(i+1) + ": </font><input\ type=\"number\"\ id=\"input_n_neurons_" + String(i) + "\"\ value=\"100\"\ min=\"1\"\ max=\"1024\"\ >";
                contents += content_add;
                if((i+1)%6 ==0)
                {
                    contents += "<br>"
                }
            }
            for(i=7; i<11; i++)
            {
                content_add = "<font size=\"3em\"> Layer " +  String(i+1) + ": </font><input\ type=\"number\"\ id=\"input_n_neurons_" + String(i) + "\"\ value=\"50\"\ min=\"1\"\ max=\"1024\"\ >";
                contents += content_add;
                if((i+1)%6 ==0)
                {
                    contents += "<br>"
                }
            }
        }
        else if(representation_value == "soap")
        {
            n_layers = 3;
            document.getElementById("input_n_nn").value = n_layers;
            contents = "<br><font size=\"3em\">Number of neurons in each hidden layer</font><br>"
            for(i=0; i<1; i++)
            {
                content_add = "<font size=\"3em\"> Layer " +  String(i+1) + ": </font><input\ type=\"number\"\ id=\"input_n_neurons_" + String(i) + "\"\ value=\"512\"\ min=\"1\"\ max=\"1024\"\ >";
                contents += content_add;
            }
            for(i=1; i<3; i++)
            {
                content_add = "<font size=\"3em\"> Layer " +  String(i+1) + ": </font><input\ type=\"number\"\ id=\"input_n_neurons_" + String(i) + "\"\ value=\"256\"\ min=\"1\"\ max=\"1024\"\ >";
                contents += content_add;
            }
        }
        document.getElementById("div_options_nn_n_neurons").innerHTML = contents;
        save_options_nn();
    }
    function save_options_nn()
    {
        n_layers = document.getElementById("input_n_nn").value;
        var n_neurons = [];
        for(i=0; i<n_layers; i++)
        {
            id_i = "input_n_neurons_" + String(i);
            n_neurons_i = document.getElementById(id_i).value;
            n_neurons.push(n_neurons_i);
        }
        var command = "N_nn_neurons = " + n_neurons + ";";
        command += "N_nn_layers = int(" + n_layers + ");"
        var kernel = IPython.notebook.kernel;
        kernel.execute(command);
    }
    function save_options()
    {
        representation_value = document.getElementById("representation").value;
        regression_value = document.getElementById("regression").value;
        if(representation_value == "ngram")
        {
            ngram_N = document.getElementById("input_ngram_N").value;
            n_threads = document.getElementById("input_n_threads").value;
            var command = "ngram_N = int(" + ngram_N + "); n_threads = int(" + n_threads + ");"
        }
        if(representation_value == "soap")
        {
            soap_cutoff = document.getElementById("input_soap_cutoff").value;
            soap_lmax = document.getElementById("input_soap_lmax").value;
            soap_nmax = document.getElementById("input_soap_nmax").value;
            n_threads = document.getElementById("input_n_threads").value;
            var command = "soap_cutoff = int(" + soap_cutoff + ");";
            command += "soap_lmax = int(" + soap_lmax + ");";
            command += "soap_nmax = int(" + soap_nmax + ");";
            command += "n_threads = int(" + n_threads + ");";
        }
        console.log("Executing Command: " + command);
        var kernel = IPython.notebook.kernel;
        kernel.execute(command);
    }
 </script>
 <div id="div_options_ngram" style="display: none; position:relative; left:5%">
    <br>
    <font size="3em">N-grams size (N = 1 ~ 4)</font> <input type="number" id="input_ngram_N" value="3" min="1" max="4" >
    <br>
    <font size="3em">Number of threads used in regression (N = 1 ~ 4)</font> <input type="number" id="input_n_threads" value="4" min="1" max="4" >
    <br>
    <button type="button" id="save_options_ngram" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:150px; height:30px;" onclick="save_options()">Save options</button>
 </div>
 <div id="div_options_soap" style="display: none; position:relative; left:5%">
    <br>
    <font size="3em">SOAP cutoff (Ang) </font> <input type="number" id="input_soap_cutoff" value="10" min="1" max="25">
    <font size="3em">Max. l </font><input type="number" id="input_soap_lmax" value="4" min="1" max="8" >
    <font size="3em">Max. n </font><input type="number" id="input_soap_nmax" value="4" min="1" max="8" >
    <br>
    <font size="3em">Number of threads used in regression (N = 1 ~ 4)</font> <input type="number" id="input_n_threads" value="4" min="1" max="4" >
    <br>
    <button type="button" id="save_options_soap" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:150px; height:30px;" onclick="save_options()">Save options</button>
 </div>
 <br>
 <div id="div_options_nn" style="display: none; position:relative; left:5%">
    <br>
    <font size="3em">Number of (linear) hidden layers </font> <input type="number" id="input_n_nn" value="2" min="1" max="20">
    <button type="button" id="set_n_neurons" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:250px; height:30px;" onclick="set_nn_n_neurons()">Set the number of neurons</button>
    <div id="div_options_nn_n_neurons" style="display: none; position:relative; left:0%">
        <br>
        <font size="3em">Number of neurons in each hidden layer</font>
    </div>
    <br>
    <button type="button" id="set_nn_default" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:200px; height:30px;" onclick="set_nn_default()">Use default configuration</button>
    <button type="button" id="save_options_nn" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:150px; height:30px;" onclick="save_options_nn()">Save options</button>
 </div>
 <div id="demoa"></div>
 <br><br>
 <button type="button" id="button_show_sample_output" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:200px; height:30px" onclick="show_sample_output()">Show sample outputs</button>
 <button type="button" id="button_hide_sample_output" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:200px; height:30px; display: none" onclick="hide_sample_output()">Hide sample outputs</button>
 <button type="button" style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:150px; height:30px" onclick="get_repr_regr_combination()">Get predictions</button>
 <input type="checkbox" id="if_learning_curve" value="learning_curve" onclick=show_warning_learning_curve()> Show learning curve
 <input type="checkbox" id="if_use_prestored" value="use_prestored" onclick=show_warning_use_prestored_model() checked=true> Use prestored models
 <div id="warning_learning_curve" style="display: none">
    <br>
    <font color="009FC2">
        Warning: It can be very time-consuming (5~20 min/point, depending on method, model size, and number of threads employed) when getting a learning curve.
    </font>
    <p><font size="3em">Number of points in learning curve: <input type="number" id="N_learning_curve" value="4" min="1" max="25" style="display:none"></font></p>
 </div>
 <div id="warning_use_prestored" style="display: none">
    <br>
    <font color="009FC2">
        Warning: It can be very time-consuming (10~20 min, depending on method, model size, and number of threads employed) to generate models on the fly.
    </font>
 </div>
 <div id="sample_output" style="display: none">
    <br><br>
    <font size="4em"><b> Sample outputs of predictions on formation energy and bandgap</b></font>
    <br><br>
    <font size="3em">Representation: N-grams<br><br>Regression method: Kernel-ridge regression (KRR)</font>
    <br><br>
    <img src="assets/kaggle_competition/results-ngram-krr.png" width="60%">
    <br><br>
 </div>
 <style>
 * {
  box-sizing: border-box;
 }
 .column {
  float: left;
  width: 33.33%;
  padding: 5px;
 }
 /* Clearfix (clear floats) */
 .row::after {
  content: "";
  clear: both;
  display: table;
 }
 </style>
 <div id="sample_learning_curve" style="display: none">
    <font size="4em"><b> Sample outputs of learning curves for formation energy and bandgap predictions</b></font>
    <br><br>
    <font size="3em">Representation: N-gram<br><br>Regression method: Kernel-ridge regression (KRR)</font>
    <br><br>
    <div style="display: table; width: 95%">
        <div class="row">
            <div class="column">
                <img src="./imgs/learning_curve-formation-ngram-krr.png" style="width:100%">
            </div>
            <div class="column">
                <img src="./imgs/learning_curve-bandgap-ngram-krr.png" style="width:100%">
            </div>
        </div>
    </div>
    <br><br>
 </div>
 <script type="text/Javascript">
    window.findCellIndicesByTag = function findCellIndicesByTag(tagName) {
      return (Jupyter.notebook.get_cells()
        .filter(
          ({metadata: {tags}}) => tags && tags.includes(tagName)
        )
        .map((cell) => Jupyter.notebook.find_cell_index(cell))
      );
    };
    window.runCells = function runPlotCells(tags) {
        var c = window.findCellIndicesByTag(tags);
        Jupyter.notebook.execute_cells(c);
    };
    function show_sample_output()
    {
        document.getElementById("button_show_sample_output").style.display="none";
        document.getElementById("button_hide_sample_output").style.display="";
        var check_learning_curve = document.getElementById("if_learning_curve");
        var learning_curve_output = document.getElementById("sample_learning_curve");
        var output = document.getElementById("sample_output");
        //var representation_value = document.getElementById("representation").value;
        //var regression_value = document.getElementById("regression").value;
        output.style.display = "block";
        if (check_learning_curve.checked == true)
        {
            learning_curve_output.style.display = "block";
        }
        else
        {
            learning_curve_output.style.display = "none";
        }
    }
    function hide_sample_output()
    {
        document.getElementById("button_show_sample_output").style.display="";
        document.getElementById("button_hide_sample_output").style.display="none";
        var learning_curve_output = document.getElementById("sample_learning_curve");
        var output = document.getElementById("sample_output");
        output.style.display = "none"
        learning_curve_output.style.display = "none";
    }
    function show_warning_learning_curve()
    {
        var checkBox = document.getElementById("if_learning_curve");
        learning_curve_output = document.getElementById("sample_learning_curve");
        var warning = document.getElementById("warning_learning_curve");
        if (checkBox.checked == true)
        {
            warning.style.display = "block";
            document.getElementById("N_learning_curve").style.display="";
            var N_learning_curve = document.getElementById("N_learning_curve").value;
            var command = " N_learning_curve = int(" + N_learning_curve + ");";
            var kernel = IPython.notebook.kernel;
            kernel.execute(command);
        }
        else
        {
            warning.style.display = "none";
            document.getElementById("N_learning_curve").style.display="none";
        }
    }
    function show_warning_use_prestored_model()
    {
        var checkBox = document.getElementById("if_use_prestored");
        var warning = document.getElementById("warning_use_prestored");
        if (checkBox.checked == true)
        {
            warning.style.display = "none";
        }
        else
        {
            warning.style.display = "block";
        }
        if_use_prestored = document.getElementById("if_use_prestored").checked;
        var command = " if_use_prestored_models = '" + if_use_prestored.toString() + "';";
        var kernel = IPython.notebook.kernel;
        kernel.execute(command);
    }
    function get_repr_regr_combination()
    {
        ngram_N = document.getElementById("input_ngram_N").value;
        n_threads = document.getElementById("input_n_threads").value;
        soap_cutoff = document.getElementById("input_soap_cutoff").value;
        soap_lmax = document.getElementById("input_soap_lmax").value;
        soap_nmax = document.getElementById("input_soap_nmax").value;
        var command = "ngram_N = int(" + ngram_N + "); n_threads = int(" + n_threads + ");"
                        + " soap_cutoff = int(" + soap_cutoff + ");"
                        + " soap_lmax = int(" + soap_lmax + ");"
                        + " soap_nmax = int(" + soap_nmax + ");" ;
        console.log("Executing Command: " + command);
        var kernel = IPython.notebook.kernel;
        kernel.execute(command);
        representation_value = document.getElementById("representation").value;
        regression_value = document.getElementById("regression").value;
        if_learning_curve = document.getElementById("if_learning_curve").checked;
        if_use_prestored = document.getElementById("if_use_prestored").checked;
        N_learning_curve = document.getElementById("N_learning_curve").value;
        var command = " representation = '" + representation_value +
                        "'; regression = '" + regression_value + "';" + " if_learningcurve = '" + if_learning_curve.toString() + "';"
                        + " if_use_prestored_models = '" + if_use_prestored.toString() + "';"
                        + " N_learning_curve = int(" + N_learning_curve + ");";
        console.log("Executing Command: " + command);
        var kernel = IPython.notebook.kernel;
        kernel.execute(command);
        Jupyter.notebook.execute_cells(window.findCellIndicesByTag('ngram'));//# N-gram descriptor
        Jupyter.notebook.execute_cells(window.findCellIndicesByTag('soap'));//# SOAP descriptors
        Jupyter.notebook.execute_cells(window.findCellIndicesByTag('krr-ngram'));//# KRR
        Jupyter.notebook.execute_cells(window.findCellIndicesByTag('init'));
        Jupyter.notebook.execute_cells(window.findCellIndicesByTag('init_read_data'));
        Jupyter.notebook.execute_cells(window.findCellIndicesByTag('plot'));
        Jupyter.notebook.execute_cells(window.findCellIndicesByTag('main'));//# Read and predict
    }
 </script>
 ```
 %% Output
 %% Cell type:code id: tags:init
 ``` python
 import os
 import numpy as np
 import copy
 import pandas as pd
 import time
 #from tqdm import tqdm, trange
 from tqdm import tnrange, tqdm_notebook
 import sys
 #print(sys.version)
 from IPython.display import HTML
 # n-gram
 import kaggle_competition.crystal_graph as cg
 # SOAP
 from ase.atoms import Atoms as ASEAtoms
 from ase.io import read, write
 import quippy as qp
 from quippy import descriptors
 from IPython.core.display import Javascript
 from IPython.display import display
 def run_cell_by_tag(tags):
    display(Javascript("window.runCells('"+tags+"')"))
 ```
 %% Cell type:code id: tags:init_read_data
 ``` python
 # Read data
 INPUT_DIR = "data/kaggle_competition/original_Kaggle/"
 df_train = pd.read_csv(INPUT_DIR+"train.csv")
 df_train["dataset"] = "train"
 df_test = pd.read_csv(INPUT_DIR+"test.csv")
 df_test["dataset"] = "test"
 df_crystals = pd.concat([df_train, df_test], ignore_index=True)
 train_data = 'data/kaggle_competition/original_Kaggle/train'
 test_data = 'data/kaggle_competition/original_Kaggle/test'
 csv_path = 'data/kaggle_competition/original_Kaggle/'
 train_csv = np.loadtxt(csv_path+'/train.csv', skiprows=1, delimiter=',')
 test_csv = np.loadtxt(csv_path+'/test.csv', skiprows=1, delimiter=',')
 test_true_csv = pd.read_csv("data/kaggle_competition/labeled_test.csv")
 bandgap_test_true = test_true_csv["bandgap_energy_ev"]
 formation_test_true = test_true_csv["formation_energy_ev_natom"]
 formation_train_true = df_train["formation_energy_ev_natom"]
 bandgap_train_true = df_train["bandgap_energy_ev"]
 # Initialize arrays
 unigrams = []
 bigrams = []
 trigrams = []
 quadgrams = []
 train_mean_atomic_descriptors = []
 test_mean_atomic_descriptors = []
 ```
 %% Cell type:code id: tags:plot
 ``` python
 from IPython.display import HTML
 %matplotlib inline
 %matplotlib inline
 import matplotlib.pyplot as plt
 %matplotlib inline
 fontsize = 15
 def RMSE(y_true, y_pred):
    return np.sqrt(np.mean((np.array(y_true)-np.array(y_pred))**2))
 def RMSLE(true, pred):
    return np.sqrt(np.mean((np.log(np.array(true)+1)-np.log(np.array(pred)+1))**2))
 def test_plot(formation_pred, formation_true, bandgap_pred, bandgap_true):
    RMSE_formation = RMSE(formation_pred, formation_true)
    RMSE_bandgap = RMSE(bandgap_pred, bandgap_true)
    RMSLE_formation = RMSLE(formation_pred, formation_true)
    RMSLE_bandgap = RMSLE(bandgap_pred, bandgap_true)
    label_formation = "test RMSLE " + str(RMSLE_formation)[:5]
    label_bandgap = "test RMSLE " + str(RMSLE_bandgap)[:5]
    fig = plt.figure(figsize=(6,4))
    plt.scatter(formation_true, formation_pred - formation_true, s = 3, alpha = 0.3, label = label_formation)
    #plt.plot(np.arange(min(test_vegards_true), max(test_vegards_true), 0.01), np.arange(min(test_vegards_true), max(test_vegards_true), 0.01), '--', c = 'grey')
    plt.xlabel('Actual formation energy (eV/cation)', fontsize = fontsize)
    plt.ylabel('$y-\hat{y}$ (eV/cation)', fontsize = fontsize)
    plt.legend(loc='upper right', fontsize = fontsize)
    plt.axhline(y=0, linewidth=1, color="#000000")
    plt.xticks(fontsize = fontsize)
    plt.yticks(fontsize = fontsize)
    plt.show()
    fig = plt.figure(figsize=(6,4))
    plt.scatter(bandgap_true, bandgap_pred - bandgap_true, s = 3, alpha = 0.3, label = label_bandgap)
    #plt.plot(np.arange(min(test_vegards_true), max(test_vegards_true), 0.01), np.arange(min(test_vegards_true), max(test_vegards_true), 0.01), '--', c = 'grey')
    plt.xlabel('Actual bandgap energy (eV)', fontsize = fontsize)
    plt.ylabel('$y-\hat{y}$ (eV)', fontsize = fontsize)
    plt.legend(loc='upper right', fontsize = fontsize)
    plt.axhline(y=0, linewidth=1, color="#000000")
    plt.xticks(fontsize = fontsize)
    plt.yticks(fontsize = fontsize)
    plt.show()
 def plot_learning_curve(learning_curve):
    RMSLE_formation_train = []
    RMSLE_formation_test = []
    RMSLE_bandgap_train = []
    RMSLE_bandgap_test = []
    for i in range(len(learning_curve["N_train"])):
        RMSLE_formation_train.append(RMSLE(formation_train_true[learning_curve["indtrain"][i]], learning_curve["formation_train"][i]))
        RMSLE_formation_test.append(RMSLE(formation_test_true, learning_curve["formation_test"][i]))
        RMSLE_bandgap_train.append(RMSLE(bandgap_train_true[learning_curve["indtrain"][i]], learning_curve["bandgap_train"][i]))
        RMSLE_bandgap_test.append(RMSLE(bandgap_test_true, learning_curve["bandgap_test"][i]))
    display(HTML("""<div><font size=3.5em><b> - Formation energy</b></font><br></div>"""))
    fig = plt.figure(figsize = (6,4))
    plt.scatter(learning_curve["N_train"], RMSLE_formation_train, s=3, alpha=1, label = "Training")
    plt.scatter(learning_curve["N_train"], RMSLE_formation_test, s=3, alpha=1, label = "Test")
    plt.xlabel('Size of training set', fontsize = fontsize)
    plt.ylabel('RMSLE (eV/cation)', fontsize = fontsize)
    plt.legend(loc='upper right', fontsize = fontsize)
    plt.xticks(fontsize = fontsize)
    plt.yticks(fontsize = fontsize)
    plt.show()
    display(HTML("""<div><font size=3.5em><b> - Bandgap</b></font><br></div>"""))
    plt.scatter(learning_curve["N_train"], RMSLE_bandgap_train, s=3, alpha=1, label = "Training")
    plt.scatter(learning_curve["N_train"], RMSLE_bandgap_test, s=3, alpha=1, label = "Test")
    plt.xlabel('Size of training set', fontsize = fontsize)
    plt.ylabel('RMSLE (eV)', fontsize = fontsize)
    plt.legend(loc='upper right', fontsize = fontsize)
    plt.xticks(fontsize = fontsize)
    plt.yticks(fontsize = fontsize)
    plt.show()
 def plot_hyperparameter_contour(x_grid, y_grid, z_grid, grid):
    fontsize = 15
    plt.figure()
    c = plt.contourf(x_grid, y_grid, abs(z_grid), grid, alpha=.75)
    CS = plt.contour(x_grid, y_grid, abs(z_grid), grid, colors='black', linewidth=.5)
    plt.clabel(CS, inline=1, fontsize=12)
    plt.xlabel('alpha', fontsize = fontsize)
    plt.ylabel('gamma', fontsize = fontsize)
    plt.title('RMSE', fontsize = fontsize)
    plt.xticks(fontsize = fontsize)
    plt.yticks(fontsize = fontsize)
    plt.show()
 ```
 %% Cell type:code id: tags:main
 ``` python
 #main
 import sys
 # Make predictions
 if representation == "ngram":
    run_cell_by_tag('get_ngram_descriptor')
    if regression == "nn":
        run_cell_by_tag('ngram-nn')
    elif regression == "krr":
        run_cell_by_tag('predic_plot_ngram_krr')
 elif representation == "soap":
    run_cell_by_tag('get_soap_descriptor')
    if regression == "nn":
        run_cell_by_tag('nn-soap')
    elif regression == "krr":
        run_cell_by_tag('soap-krr')
 else:
    print ("Representation/regression method not found:", representation, regression)
 ```
 %% Cell type:code id: tags:ngram
 ``` python
 # N-gram descriptor
 def check_crdn(crdn, symbols, filename):
    metals = [i for i, s in enumerate(symbols) if s != 'O']
    oxygens = [i for i, s in enumerate(symbols) if s == 'O']
    metal_cn = [len(crdn[i]) for i in metals]
    oxygen_cn = [len(crdn[i]) for i in oxygens]
    if np.max(metal_cn) > m_cn_max or np.min(metal_cn) < m_cn_min:
        print(filename)
        print('metal_cn: {}'.format(metal_cn))
    if np.max(oxygen_cn) > o_cn_max or np.min(oxygen_cn) < o_cn_min:
        print(filename)
        print('oxygen_cn: {}'.format(oxygen_cn))
 def get_cg(fn, factor):
    crystal_xyz, crystal_sym, crystal_lat = cg.get_xyz_data(fn) #fn: file_name
    A = np.transpose(crystal_lat)
    B = np.linalg.inv(A)
    crystal_red = np.matmul(crystal_xyz, np.transpose(B))
    R_max = cg.get_Rmax(crystal_sym) * factor
    crystal_dist, crystal_Rij, crystal_crdn = cg.get_shortest_distances(crystal_red, A, R_max)
    check_crdn(crystal_crdn, crystal_sym, fn)
    return crystal_crdn, crystal_sym
 def get_symbols(G):
    return G[1]
 def coord_number(G, node):
    return len(G[0][node])
 def neighbors(G, node):
    return [c[0] for c in G[0][node]]
 def neighbors_with_R(G, node):
    return [(c[0], c[2]) for c in G[0][node]]
 def get_unigrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict):
    symbols = get_symbols(G)
    symbols = [token_dict[s + str(coord_number(G, i))] for i, s in enumerate(symbols)]
    uniques, counts = np.unique(symbols, return_counts=True)
    counts = dict(zip(list(uniques), list(counts)))
    v = np.zeros(len(tokens), dtype=np.int)
    for key, value in counts.items():
        v[key] = value
    return v
 def get_bigrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict):
    symbols = get_symbols(G)
    metals = [i for i, s in enumerate(symbols) if s != 'O']
    symbols = [s + str(coord_number(G, i)) for i, s in enumerate(symbols)]
    pair_list = []
    for j in metals:
        sym_j = symbols[j]
        for i in neighbors(G, j):
            sym_i = symbols[i]
            pair = sym_j + '_' + sym_i
            pair_list.append(pair_dict[pair])
    uniques, counts = np.unique(pair_list, return_counts=True)
    counts = dict(zip(list(uniques), list(counts)))
    v = np.zeros(len(pairs), dtype=np.int)
    for key, value in counts.items():
        v[key] = value
    return v
 def get_trigrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict):
    symbols = get_symbols(G)
    oxygens = [i for i, s in enumerate(symbols) if s == 'O']
    metals = [i for i, s in enumerate(symbols) if s != 'O']
    symbols = [s + str(coord_number(G, i)) for i, s in enumerate(symbols)]
    triple_list = []
    for o in oxygens:
        sym_o = symbols[o]
        nbr = neighbors(G, o)
        for j in range(len(nbr)):
            sym_j = symbols[nbr[j]]
            for i in range(j):
                sym_i = symbols[nbr[i]]
                if token_dict[sym_i] >= token_dict[sym_j]:
                    triple = sym_i + '_' + sym_o + '_' + sym_j
                else:
                    triple = sym_j + '_' + sym_o + '_' + sym_i
                triple_list.append(triples_dict[triple])
    for m in metals:
        sym_m = symbols[m]
        nbr = neighbors(G, m)
        for j in range(len(nbr)):
            sym_j = symbols[nbr[j]]
            for i in range(j):
                sym_i = symbols[nbr[i]]
                if token_dict[sym_i] >= token_dict[sym_j]:
                    triple = sym_i + '_' + sym_m + '_' + sym_j
                else:
                    triple = sym_j + '_' + sym_m + '_' + sym_i
                triple_list.append(triples_dict[triple])
    uniques, counts = np.unique(triple_list, return_counts=True)
    counts = dict(zip(list(uniques), list(counts)))
    v = np.zeros(len(triples), dtype=np.int)
    for key, value in counts.items():
        v[key] = value
    return v
 def get_quadgrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict):
    symbols = get_symbols(G)
    oxygens = [i for i, s in enumerate(symbols) if s == 'O']
    metals = [i for i, s in enumerate(symbols) if s != 'O']
    symbols = [s + str(coord_number(G, i)) for i, s in enumerate(symbols)]
    quad_list = []
    for o in oxygens:
        sym_o = symbols[o]
        nbr = neighbors_with_R(G, o)
        for j in range(len(nbr)):
            nbr_j = neighbors_with_R(G, nbr[j][0])
            sym_j = symbols[nbr[j][0]]
            for i in range(j):
                nbr_i = neighbors_with_R(G, nbr[i][0])
                sym_i = symbols[nbr[i][0]]
                triple1 = sym_i + '_' + sym_o + '_' + sym_j
                triple2 = sym_j + '_' + sym_o + '_' + sym_i
                for kj in nbr_j:
                    if kj[0] == o and np.sum(np.abs(nbr[j][1]+kj[1])) < 1e-8: continue
                    quad1 = triple1 + '_' + symbols[kj[0]]
                    quad_list.append(quad_dict[quad1])
                for ki in nbr_i:
                    if ki[0] == o and np.sum(np.abs(nbr[i][1]+ki[1])) < 1e-8: continue
                    quad2 = triple2 + '_' + symbols[ki[0]]
                    quad_list.append(quad_dict[quad2])
    uniques, counts = np.unique(quad_list, return_counts=True)
    counts = dict(zip(list(uniques), list(counts)))
    v = np.zeros(len(quads), dtype=np.int)
    for key, value in counts.items():
        v[key] = value
    return v
 def get_ngram_descriptors():
    print ("\nGenerating n-grams descriptors...\n")
    m_elements = ['Al', 'Ga', 'In']
    m_tokens = []
    print ("Generating tokens...")
    for i in range(m_cn_min, m_cn_max+1):
        m_tokens += [ s+str(i) for s in m_elements]
    o_tokens = ['O'+str(i) for i in range(o_cn_min, o_cn_max+1)]
    tokens = m_tokens + o_tokens
    print ("Tokens:", tokens[:10], "...")
    token_dict = dict([(tokens[i], i) for i in range(len(tokens))])
    pairs = []
    print ("Generating pairs...")
    for m in m_tokens:
        pairs += [ m + '_' + o for o in o_tokens]
    print ("Pairs:", pairs[:10], "...")
    pair_dict = dict([(pairs[i], i) for i in range(len(pairs))])
    print ("Generating triples...")
    mom_triples = []
    for m1 in range(len(m_tokens)):
        for m2 in range(m1+1):
            mom_triples += [m_tokens[m1] + '_' + o + '_' + m_tokens[m2] for o in o_tokens]
    omo_triples = []
    for o1 in range(len(o_tokens)):
        for o2 in range(o1+1):
            omo_triples += [o_tokens[o1] + '_' + m + '_' + o_tokens[o2] for m in m_tokens]
    triples = mom_triples + omo_triples
    print ("Triples:", triples[:10], "...")
    triples_dict = dict([(triples[i], i) for i in range(len(triples))])
    print ("Generating quads...")
    quads = []
    for m1 in m_tokens:
        for o1 in o_tokens:
            for m2 in m_tokens:
                quads += [m1 + '_' + o1 + '_' + m2 + '_' + o for o in o_tokens]
    print ("Quads:", quads[:10], "...")
    quad_dict = dict([(quads[i], i) for i in range(len(quads))])
    print ("Generating grams...")
    unigrams = []
    bigrams = []
    trigrams = []
    quadgrams = []
    count = 0
    t0 = time.time()
    for crystal in df_crystals.itertuples():
        fn = INPUT_DIR + "{}/{}/geometry.xyz".format(crystal.dataset, crystal.id)
        factor = cg.get_factor(crystal.spacegroup, crystal.lattice_angle_gamma_degree)
        G = get_cg(fn, factor)
        unigrams.append(get_unigrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict))
        bigrams.append(get_bigrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict))
        trigrams.append(get_trigrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict))
        quadgrams.append(get_quadgrams(G,tokens,pairs,triples,quads,token_dict,pair_dict,triples_dict,quad_dict))
        count += 1
        #if count % 500 == 0:
            #print(count, time.time()-t0)
    print ("Tokens:", len(tokens))
    print ("Pairs:", len(pairs))
    print ("Triples:", len(triples))
    print ("Quads:", len(quads))
    #np.savez("./data/kaggle_competition/models/ngram.npz", tokens=tokens, pairs=pairs, triples=triples, quads=quads, unigrams=unigrams, bigrams=bigrams, trigrams=trigrams, quadgrams=quadgrams)
    return tokens, pairs, triples, quads, unigrams, bigrams, trigrams, quadgrams
 ```
 %% Cell type:code id: tags:get_ngram_descriptor
 ``` python
 # Get N-gram descriptor
 from IPython.display import HTML
 display(HTML("""<br><div><font size=6em><b>Predictions with N-grams representation</b></font><hr></div>"""))
 m_cn_min = 4
 m_cn_max = 9
 o_cn_min = 2
 o_cn_max = 6
 if if_use_prestored_models == "true":
    ngram_model_saved = np.load('./data/kaggle_competition/models/ngram.npz')
    display(HTML("""Read pre-stored N-grams model."""))
    tokens = ngram_model_saved['tokens']
    pairs  = ngram_model_saved['pairs']
    triples = ngram_model_saved['triples']
    quads = ngram_model_saved['quads']
    unigrams = ngram_model_saved['unigrams']
    bigrams = ngram_model_saved['bigrams']
    trigrams  = ngram_model_saved['trigrams']
    quadgrams = ngram_model_saved['quadgrams']
 else:
    tokens = []
    pairs  = []
    triples = []
    quads =  []
    unigrams = []
    bigrams = []
    trigrams  = []
    quadgrams = []
    tokens, pairs, triples, quads, unigrams, bigrams, trigrams, quadgrams = get_ngram_descriptors()
 ```
 %% Cell type:code id: tags:krr-ngram
 ``` python
 #KRR-ngram
 from sklearn.kernel_ridge import KernelRidge
 from sklearn.model_selection import GridSearchCV
 from sklearn.metrics import make_scorer
 from sklearn.preprocessing import MinMaxScaler
 from sklearn.externals.joblib import parallel_backend
 def train_and_pred_KRR(X_train, y_train, X_test, nfold, nthread):
    neg_root_mean_squared_error = make_scorer(RMSE, greater_is_better=False)
    t0 = time.time()
    clf = GridSearchCV(KernelRidge(kernel='rbf'),
                       cv=nfold, n_jobs=nthread, verbose=1,
                       scoring=neg_root_mean_squared_error,
                       param_grid={"alpha": np.logspace(-15, 5, 21, base=2),
                                   "gamma": np.logspace(-15, 3, 19, base=2)})
    with parallel_backend('threading'):
        res = clf.fit(X_train, y_train)
    #print("clf best_score:", -res.best_score_)
    #print("clf best_params:", res.best_params_)
    #print("clf fit:", time.time() - t0)
    out_y_train = clf.predict(X_train)
    #for i,j in zip(out_y_train, y_train):
    #    print(i, j)
    t0 = time.time()
    y_test = clf.predict(X_test)
    #print("clf predict:", time.time() - t0)
    return clf, y_test, out_y_train
 def get_ngram_KRR_predictions(tokens, pairs, triples, quads, unigrams, bigrams, trigrams, quadgrams, ngram_size, n_threads):
    #ngram_size = 3
    #nfold = 5
    #nthread = 4
    print ("Size of N-gram:", ngram_size)
    ngrams = {}
    ngrams["tokens"]=tokens
    ngrams["pairs"]=pairs
    ngrams["triples"]=triples
    ngrams["quads"]=quads
    ngrams["unigrams"]=unigrams
    ngrams["bigrams"]=bigrams
    ngrams["trigrams"]=trigrams
    ngrams["quadgrams"]=quadgrams
    #print 'ngram_size: {}'.format(ngram_size)
    if ngram_size == 4:
        features = np.hstack([ngrams["unigrams"], ngrams["bigrams"], ngrams["trigrams"], ngrams["quadgrams"]])
    elif ngram_size == 3:
        features = np.hstack([ngrams["unigrams"], ngrams["bigrams"], ngrams["trigrams"]])
    elif ngram_size == 2:
        features = np.hstack([ngrams["unigrams"], ngrams["bigrams"]])
    elif ngram_size == 1:
        features = ngrams["unigrams"]
    else:
        raise NameError('ngram_size should be within [1,4].')
    print ('Ngram feature matrix shape: {}'.format(features.shape))
    # Select column
    count_per_column = np.sum(features, axis=0)
    nonzero_columns = np.where(count_per_column > 0)[0]
    print( 'Total column count: {}'.format(len(count_per_column)))
    print( 'Nonzero-column count: {}'.format(len(nonzero_columns)))
    features = features[:,nonzero_columns]
    print( 'Nonzero ngram feature maxrix shape: {}'.format(np.shape(features)))
    # Volume
    lat_a = df_crystals.lattice_vector_1_ang.values
    lat_b = df_crystals.lattice_vector_2_ang.values
    lat_c = df_crystals.lattice_vector_3_ang.values
    lat_cosa = np.cos(np.radians(df_crystals.lattice_angle_alpha_degree.values))
    lat_cosb = np.cos(np.radians(df_crystals.lattice_angle_beta_degree.values))
    lat_cosg = np.cos(np.radians(df_crystals.lattice_angle_gamma_degree.values))
    volume = lat_a * lat_b * lat_c * np.sqrt(1. - lat_cosa**2 - lat_cosb**2 - lat_cosg**2 + 2. * lat_cosa * lat_cosb * lat_cosg)
    # Scale
    features = np.array(features, dtype=np.float)
    features = features / volume[:, np.newaxis]
    scaler = MinMaxScaler()
    features = scaler.fit_transform(features)
    y1 = np.log1p(df_crystals.formation_energy_ev_natom.values[:len(df_train)])
    y2 = np.log1p(df_crystals.bandgap_energy_ev.values[:len(df_train)])
    X_train = features[:len(df_train)]
    X_test = features[len(df_train):]
    learning_curve = {}
    if(if_learningcurve == "true"):
        step = (2400 - 100) / N_learning_curve
        range_learning_curve = range(100, 2401, step)
        range_learning_curve[-1] = 2400
        print( range_learning_curve)
        learning_curve_N_train = []
        learning_curve_indtrain = []
        learning_curve_formation_train = []
        learning_curve_formation_test = []
        learning_curve_bandgap_train = []
        learning_curve_bandgap_test = []
        for N_train in range_learning_curve:
            indtrain = np.random.choice(len(X_train), N_train, replace=False)
            clf1, y1_test, y1_train = train_and_pred_KRR(X_train[indtrain], y1[indtrain], X_test, nfold = 5, nthread = n_threads)
            clf2, y2_test, y2_train = train_and_pred_KRR(X_train[indtrain], y2[indtrain], X_test, nfold = 5, nthread = n_threads)
            learning_curve_formation_train.append(y1_train)
            learning_curve_formation_test.append(y1_test)
            learning_curve_bandgap_train.append(y2_train)
            learning_curve_bandgap_test.append(y2_test)
            learning_curve_N_train.append(N_train)
            learning_curve_indtrain.append(indtrain)
        learning_curve["formation_train"] = learning_curve_formation_train
        learning_curve["formation_test"] = learning_curve_formation_test
        learning_curve["bandgap_train"] = learning_curve_bandgap_train
        learning_curve["bandgap_test"] = learning_curve_bandgap_test
        learning_curve["N_train"] = learning_curve_N_train
        learning_curve["indtrain"] = learning_curve_indtrain
    clf1, y1_test, y1_train = train_and_pred_KRR(X_train, y1, X_test, nfold = 5, nthread = n_threads)
    clf2, y2_test, y2_train = train_and_pred_KRR(X_train, y2, X_test, nfold = 5, nthread = n_threads)
    # Find negative values
    print('negative values of y1_test:', np.where(y1_test < 0))
    print('negative values of y2_test:', np.where(y2_test < 0))
    y1_test = np.clip(np.expm1(y1_test), 1e-8, None)
    y2_test = np.clip(np.expm1(y2_test), 1e-8, None)
    y1_train = np.clip(np.expm1(y1_train), 1e-8, None)
    y2_train = np.clip(np.expm1(y2_train), 1e-8, None)
    #print clf1.cv_results_
    #print clf2.cv_results_
    gridsearch_results = {}
    gridsearch_results['formation'] = clf1.cv_results_
    gridsearch_results['bandgap'] = clf2.cv_results_
    return y1_train, y1_test, y2_train, y2_test, gridsearch_results, learning_curve
 ```
 %% Cell type:code id: tags:predic_plot_ngram_krr
 ``` python
 # Predict with N-gram + KRR and plot
 from IPython.display import HTML
 display(HTML("""<br><div><span style="height:15px;width:15px;background-color:#000000;border-radius:50%;display:inline-block;"></span><font size=5em><b> N-grams + Kernel ridge regression </b> </font><br></div>"""))
 #print "\nTrain and predict with n-gram + KRR.\n"
 formation_train_pred_ngram_krr, formation_test_pred_ngram_krr, bandgap_train_pred_ngram_krr, bandgap_test_pred_ngram_krr, gridsearch_results_ngram_krr, learning_curve_ngram_krr = get_ngram_KRR_predictions(tokens, pairs, triples, quads, unigrams, bigrams, trigrams, quadgrams, ngram_size = ngram_N, n_threads = n_threads)
 x_grid_formation_ngram_krr = np.unique(gridsearch_results_ngram_krr['formation']['param_alpha'].data)
 y_grid_formation_ngram_krr = np.unique(gridsearch_results_ngram_krr['formation']['param_gamma'].data)
 z_grid_formation_ngram_krr = np.reshape(gridsearch_results_ngram_krr['formation']['mean_test_score'], [len(y_grid_formation_ngram_krr), -1])
 x_grid_bandgap_ngram_krr = np.unique(gridsearch_results_ngram_krr['bandgap']['param_alpha'].data)
 y_grid_bandgap_ngram_krr = np.unique(gridsearch_results_ngram_krr['bandgap']['param_gamma'].data)
 z_grid_bandgap_ngram_krr = np.reshape(gridsearch_results_ngram_krr['bandgap']['mean_test_score'], [len(y_grid_bandgap_ngram_krr), -1])
 run_cell_by_tag('plot_ngram_krr')
 ```
 %% Cell type:code id: tags:plot_ngram_krr
 ``` python
 from IPython.display import HTML
 display(HTML("""<div><hr><font size=4em><b>Formation energy and bandgap predictions</b></font><br></div>"""))
 test_plot(formation_test_pred_ngram_krr, formation_test_true, bandgap_test_pred_ngram_krr, bandgap_test_true)
 display(HTML("""<div><hr><font size=4em><b>Hyperparameter grid search</b></font></div>"""))
 display(HTML("""<div><font size=3.5em><b>- Formation energy</b></font><br></div>"""))
 plot_hyperparameter_contour(x_grid_formation_ngram_krr, y_grid_formation_ngram_krr, z_grid_formation_ngram_krr, grid = 8)
 display(HTML("""<div><font size=3.5em><b>- Bandgap</b></font><br></div>"""))
 plot_hyperparameter_contour(x_grid_bandgap_ngram_krr, y_grid_bandgap_ngram_krr, z_grid_bandgap_ngram_krr, grid = 10)
 if(if_learningcurve == "true"):
    display(HTML("""<div><hr><font size=4em><b>Learning curves</b></font><br></div>"""))
    plot_learning_curve(learning_curve_ngram_krr)
 display(HTML("""<br><hr><br><br>"""))
 ```
 %% Cell type:code id: tags:ngram-nn
 ``` python
 # N-gram + NN
 import pandas as pd
 import torch
 import sys
 def mean_free(model):
    means = np.average(model, axis = 0)
    #print("Shape of means:", np.shape(means))
    model_mean_free = model - means
    return model_mean_free
 class NeuralNetwork(torch.nn.Module):
    '''
    def __init__(self,inplen, outlen):
        super(NeuralNetwork, self).__init__()
        self.layer1 = torch.nn.Linear(inplen, 100)
        self.layer2 = torch.nn.Linear(100, 100)
        self.layer3 = torch.nn.Linear(100, 100)
        self.layer4 = torch.nn.Linear(100, 100)
        self.layer5 = torch.nn.Linear(100, 100)
        self.layer6 = torch.nn.Linear(100, 100)
        self.layer7 = torch.nn.Linear(100, 100)
        self.layer8 = torch.nn.Linear(100, 50)
        self.layer9 = torch.nn.Linear(50, 50)
        self.layer10 = torch.nn.Linear(50, 50)
        self.layer11 = torch.nn.Linear(50, 50)
        self.layer12 = torch.nn.Linear(50, outlen)
        self.activation = torch.nn.LeakyReLU()
        #self.dropout = torch.nn.Dropout(p=0.05)
    def forward(self, inp):
        x = self.activation(self.layer1(inp))
        x = self.activation(self.layer2(x))
        x = self.activation(self.layer3(x))
        x = self.activation(self.layer4(x))
        x = self.activation(self.layer5(x))
        x = self.activation(self.layer6(x))
        x = self.activation(self.layer7(x))
        x = self.activation(self.layer8(x))
        x = self.activation(self.layer9(x))
        x = self.activation(self.layer10(x))
        x = self.activation(self.layer11(x))
        return self.layer12(x)#self.dropout(self.layer12(x))
    '''
    def __init__(self,inplen, outlen):
        super(NeuralNetwork, self).__init__()
        hidden_layers = []
        batchnorms = []
        hidden_layers.append(torch.nn.Linear(inplen, N_nn_neurons[0]))
        batchnorms.append(torch.nn.BatchNorm1d(N_nn_neurons[0], affine=True))
        for i in range(N_nn_layers-1):
            hidden_layers.append(torch.nn.Linear(N_nn_neurons[i], N_nn_neurons[i+1], bias=True))
            batchnorms.append(torch.nn.BatchNorm1d(N_nn_neurons[i+1], affine=True))
        self.layer_hidden = torch.nn.ModuleList(hidden_layers)
        self.batchnorms = torch.nn.ModuleList(batchnorms)
        self.layer_out = torch.nn.Linear(N_nn_neurons[-1], outlen)
        self.activation = torch.nn.LeakyReLU()
        #self.activation = torch.nn.ReLU()
    def forward(self, inp):
        #x = self.layer_hidden[0](inp)
        x = self.activation(self.layer_hidden[0](inp))
        #x = self.batchnorms[0](self.layer_hidden[0](inp))
        #x = self.activation(self.batchnorms[0](self.layer_hidden[0](inp)))
        for i in range(1, N_nn_layers-1):
            #x = self.layer_hidden[i](x)
            x = self.activation(self.layer_hidden[i](x))
            #x = self.batchnorms[i](self.layer_hidden[i](x))
            #x = self.activation(self.batchnorms[i](self.layer_hidden[i](x)))
        x = self.layer_out(x)
        return x
 def ngram_nn_train_pred(nn_inp_len, nn_out_len, x_train, y_train, x_test, y_test_true, learning_rate, n_epoch):
    inp_trn = torch.autograd.Variable(
        torch.Tensor(x_train.reshape(-1, nn_inp_len)))
    out_trn = torch.autograd.Variable(
        torch.Tensor(y_train.reshape(-1, nn_out_len)))
    inp_test = torch.autograd.Variable(
        torch.Tensor(x_test.reshape(-1, nn_inp_len)),
        volatile=True)
    out_test = torch.autograd.Variable(
        torch.Tensor(y_test_true.reshape(-1, nn_out_len)),
        volatile=True)
    deep_fit = NeuralNetwork(nn_inp_len, nn_out_len)
    loss_function = torch.nn.MSELoss()
    optimizer = torch.optim.Adam(deep_fit.parameters(), lr=learning_rate)
    sys.stdout.flush()
    loss_trn, loss_test = [], []
    #for _ in tqdm(range(n_epoch)):
    for _ in tnrange(n_epoch):
        deep_fit.train()
        loss = loss_function(deep_fit(inp_trn), out_trn)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loss_trn.append(loss.data.float())
        deep_fit.eval()
        loss = loss_function(deep_fit(inp_test), out_test)
        loss_test.append(loss.data.float())
    #print 'Training RMSE:'
    deep_fit_trn=deep_fit(inp_trn).data
    #print RMSE(np.clip(np.expm1(np.array(out_trn.data)),1e-8,None), np.clip(np.expm1(np.array(deep_fit_trn)),1e-8,None))
    #print 'Test RMSE (log1p):'
    deep_fit_test=deep_fit(inp_test).data
    #print "deep_fit_test:", np.array(deep_fit_test)[:10]
    #print RMSE(np.array(y_test), np.array(deep_fit_test))
    sys.stdout.flush()
    '''
    fig, (cnv, prd) = plt.subplots(1, 2, figsize=(10, 4))
    cnv.plot(loss_trn[200:], label='training')
    cnv.plot(loss_test[200:], label='validation')
    cnv.set_xlabel('epoch')
    cnv.set_ylabel('loss')
    cnv.legend()
    prd.scatter(np.array(y_train), deep_fit(inp_trn).data.numpy(), s=2, label='Training')
    prd.scatter(np.array(y_test_true), deep_fit(inp_test).data.numpy(), s=2, label='Validation')
    prd.legend()
    fig.tight_layout()
    '''
    #print(deep_fit_trn.flatten())
    return np.clip(np.expm1(np.array(deep_fit_trn.flatten())),1e-8,None), np.clip(np.expm1(np.array(deep_fit_test.flatten())),1e-8,None)#, deep_fit
 def get_ngram_NN_predictions(tokens, pairs, triples, quads, unigrams, bigrams, trigrams, quadgrams, ngram_size):
    print ("Size of N-gram:", ngram_size)
    ngrams = {}
    ngrams["tokens"]=tokens
    ngrams["pairs"]=pairs
    ngrams["triples"]=triples
    ngrams["quads"]=quads
    ngrams["unigrams"]=unigrams
    ngrams["bigrams"]=bigrams
    ngrams["trigrams"]=trigrams
    ngrams["quadgrams"]=quadgrams
    #print 'ngram_size: {}'.format(ngram_size)
    if ngram_size == 4:
        features = np.hstack([ngrams["unigrams"], ngrams["bigrams"], ngrams["trigrams"], ngrams["quadgrams"]])
    elif ngram_size == 3:
        features = np.hstack([ngrams["unigrams"], ngrams["bigrams"], ngrams["trigrams"]])
    elif ngram_size == 2:
        features = np.hstack([ngrams["unigrams"], ngrams["bigrams"]])
    elif ngram_size == 1:
        features = ngrams["unigrams"]
    else:
        raise NameError('ngram_size should be within [1,4].')
    print ('Ngram feature matrix shape: {}'.format(features.shape))
    # Select column
    count_per_column = np.sum(features, axis=0)
    nonzero_columns = np.where(count_per_column > 0)[0]
    print ('Total column count: {}'.format(len(count_per_column)) )
    print ('Nonzero-column count: {}'.format(len(nonzero_columns)) )
    features = features[:,nonzero_columns]
    print ('Nonzero ngram feature maxrix shape: {}'.format(features.shape) )
    # Volume
    lat_a = df_crystals.lattice_vector_1_ang.values
    lat_b = df_crystals.lattice_vector_2_ang.values
    lat_c = df_crystals.lattice_vector_3_ang.values
    lat_cosa = np.cos(np.radians(df_crystals.lattice_angle_alpha_degree.values))
    lat_cosb = np.cos(np.radians(df_crystals.lattice_angle_beta_degree.values))
    lat_cosg = np.cos(np.radians(df_crystals.lattice_angle_gamma_degree.values))
    volume = lat_a * lat_b * lat_c * np.sqrt(1. - lat_cosa**2 - lat_cosb**2 - lat_cosg**2 + 2. * lat_cosa * lat_cosb * lat_cosg)
    # Scale
    features = np.array(features, dtype=np.float)
    features = features / volume[:, np.newaxis]
    scaler = MinMaxScaler()
    features = scaler.fit_transform(features)
    #print 'Scaled features : \n',features
    sys.stdout.flush()
    y_train_Eform = np.log1p(df_crystals.formation_energy_ev_natom.values[:len(df_train)])
    y_train_Egap = np.log1p(df_crystals.bandgap_energy_ev.values[:len(df_train)])
    y_test_Eform_true = np.array(np.log1p(np.array(formation_test_true)))
    y_test_Egap_true = np.array(np.log1p(np.array(bandgap_test_true)))
    x_train = features[:len(df_train)]
    x_test = features[len(df_train):]
    x_train_meanfree = mean_free(x_train)
    x_test_meanfree = mean_free(x_test)
    nn_inp_len = len(nonzero_columns)
    nn_out_len = len(y_train_Eform.reshape(-1,len(y_train_Eform)))
    formation_train_pred_ngram_nn, formation_test_pred_ngram_nn = ngram_nn_train_pred(nn_inp_len, nn_out_len, x_train_meanfree, y_train_Eform, x_test_meanfree, y_test_Eform_true, learning_rate=0.00012, n_epoch=600)
    bandgap_train_pred_ngram_nn, bandgap_test_pred_ngram_nn = ngram_nn_train_pred(nn_inp_len, nn_out_len, x_train_meanfree, y_train_Egap, x_test_meanfree, y_test_Egap_true, learning_rate=0.00015, n_epoch=700)
    learning_curve = {}
    if(if_learningcurve == "true"):
        learning_curve_N_train = []
        learning_curve_indtrain = []
        learning_curve_formation_train = []
        learning_curve_formation_test = []
        learning_curve_bandgap_train = []
        learning_curve_bandgap_test = []
        step = (2400 - 100) / N_learning_curve
        range_learning_curve = range(100, 2401, step)
        range_learning_curve[-1] = 2400
        #print range_learning_curve
        for N_train in range_learning_curve:
            indtrain = np.random.choice(len(x_train_meanfree), N_train, replace=False)
            formation_train_pred_ngram_nn, formation_test_pred_ngram_nn = ngram_nn_train_pred(nn_inp_len, nn_out_len, x_train_meanfree[indtrain], y_train_Eform[indtrain], x_test_meanfree, y_test_Eform_true, learning_rate=0.00012, n_epoch=600)
            bandgap_train_pred_ngram_nn, bandgap_test_pred_ngram_nn = ngram_nn_train_pred(nn_inp_len, nn_out_len, x_train_meanfree[indtrain], y_train_Egap[indtrain], x_test_meanfree, y_test_Egap_true, learning_rate=0.00015, n_epoch=700)
            learning_curve_formation_train.append(formation_train_pred_ngram_nn)
            learning_curve_formation_test.append(formation_test_pred_ngram_nn)
            learning_curve_bandgap_train.append(bandgap_train_pred_ngram_nn)
            learning_curve_bandgap_test.append(bandgap_test_pred_ngram_nn)
            learning_curve_N_train.append(N_train)
            learning_curve_indtrain.append(indtrain)
        learning_curve["formation_train"] = learning_curve_formation_train
        learning_curve["formation_test"] = learning_curve_formation_test
        learning_curve["bandgap_train"] = learning_curve_bandgap_train
        learning_curve["bandgap_test"] = learning_curve_bandgap_test
        learning_curve["N_train"] = learning_curve_N_train
        learning_curve["indtrain"] = learning_curve_indtrain
    return formation_train_pred_ngram_nn , formation_test_pred_ngram_nn, bandgap_train_pred_ngram_nn, bandgap_test_pred_ngram_nn, learning_curve
 from IPython.display import HTML
 display(HTML("""<br><div><span style="height:15px;width:15px;background-color:#000000;border-radius:50%;display:inline-block;"></span><font size=5em><b> N-grams + Neural network </b> </font><br></div>"""))
 formation_train_pred_ngram_nn, formation_test_pred_ngram_nn, bandgap_train_pred_ngram_nn, bandgap_test_pred_ngram_nn, learning_curve_ngram_nn  = get_ngram_NN_predictions(tokens, pairs, triples, quads, unigrams, bigrams, trigrams, quadgrams, ngram_size = ngram_N)
 run_cell_by_tag('plot_ngram_nn')
 ```
 %% Cell type:code id: tags:plot_ngram_nn
 ``` python
 from IPython.display import HTML
 display(HTML("""<div><hr><font size=4em><b>Formation energy and bandgap predictions</b></font><br></div>"""))
 test_plot(formation_test_pred_ngram_nn, formation_test_true, bandgap_test_pred_ngram_nn, bandgap_test_true)
 if(if_learningcurve == "true"):
    display(HTML("""<div><hr><font size=4em><b>Learning curves</b></font><br></div>"""))
    plot_learning_curve(learning_curve_ngram_nn)
 display(HTML("""<br><hr><br><br>"""))
 ```
 %% Cell type:code id: tags:soap
 ``` python
 # SOAP descriptors
 def get_atomic_soap_descriptors(geometry_xyz, soap_cutoff, soap_lmax, soap_nmax):
    # get cell vectors from *.xyz file
    cell_ = np.loadtxt(geometry_xyz, skiprows=3, usecols=(1,2,3))[:3]
    # get atom positions and types from *.xyz file
    atoms_ = pd.read_table(geometry_xyz, skiprows=6,
        delim_whitespace=True, header=None, names=['x', 'y', 'z', 'type'],
        usecols=[1,2,3,4])
    # construct ase Atoms object
    compound = ASEAtoms(atoms_.type.values,
                     positions = atoms_.values[:, :3],
                     cell = cell_,
                     pbc=[1, 1, 1])
    ###########################################################################
    # The specification of which atoms are used as SOAP centers is separate to the
    # specification of which atoms are taken into account in the environment.
    # The n_Z=2 Z={1 6} options specify that both H and C are to be taken as SOAP
    # centers. The n_species=2 species_Z={1 6} options specify the atom types to
    # be taken into account in the environment.
    soap_command = "soap cutoff="
    soap_command += str(soap_cutoff)
    soap_command += " l_max="
    soap_command += str(soap_lmax)
    soap_command += " n_max="
    soap_command += str(soap_nmax)
    soap_command += " atom_sigma=0.5 n_Z=4 Z={8 13 31 49} n_species=4 species_Z={8 13 31 49}"
    #print(soap_command)
    desc = descriptors.Descriptor(soap_command) # for formation energies
    #"soap cutoff=20 l_max=6 n_max=10 atom_sigma=0.5 n_Z=1 Z={8} n_species=4 species_Z={8 13 31 49}") # for bandgaps
    compound.set_cutoff(desc.cutoff())
    compound.calc_connect()
    return desc.calc(compound)['descriptor']
 def get_SOAP(soap_cutoff, soap_lmax, soap_nmax):
    #for compound_id in tqdm(train_csv[:, 0].astype(int)):
    for compound_id in tnrange(train_csv[:, 0].astype(int)):
        #print('compound_id: {}.'.format(compound_id))
        compound_xyz = train_data+'/'+str(compound_id)+'/geometry.xyz'
        train_mean_atomic_descriptor =  np.mean(get_atomic_soap_descriptors(compound_xyz, soap_cutoff, soap_lmax, soap_nmax),
                                          axis=0)
        if 'train_mean_atomic_descriptors' in locals():
            train_mean_atomic_descriptors = np.append(train_mean_atomic_descriptors,
                                    train_mean_atomic_descriptor.reshape(1, -1), axis=0)
        else:
            train_mean_atomic_descriptors = train_mean_atomic_descriptor.reshape(1, -1)
    print('mean_atomic_descriptor.shape: {}'.format(train_mean_atomic_descriptor.shape))
    #for compound_id in tqdm(test_csv[:, 0].astype(int)):
    for compound_id in tnrange(test_csv[:, 0].astype(int)):
        #print('compound_id: {}.'.format(compound_id))
        compound_xyz = test_data+'/'+str(compound_id)+'/geometry.xyz'
        test_mean_atomic_descriptor =  np.mean(get_atomic_soap_descriptors(compound_xyz, soap_cutoff, soap_lmax, soap_nmax),
                                          axis=0)
        if 'test_mean_atomic_descriptors' in locals():
            test_mean_atomic_descriptors = np.append(test_mean_atomic_descriptors,
                                    test_mean_atomic_descriptor.reshape(1, -1), axis=0)
        else:
            test_mean_atomic_descriptors = test_mean_atomic_descriptor.reshape(1, -1)
    #print('mean_atomic_descriptor.shape: {}'.format(test_mean_atomic_descriptor.shape))
    #print(test_mean_atomic_descriptors)
    #np.savez("./data/kaggle_competition/models/soap.npz", train_mean_atomic_descriptors = train_mean_atomic_descriptors, test_mean_atomic_descriptors = test_mean_atomic_descriptors)
    return train_mean_atomic_descriptors, test_mean_atomic_descriptors
 ```
 %% Cell type:code id: tags:get_soap_descriptor
 ``` python
 # Get SOAP descriptor
 from IPython.display import HTML
 display(HTML("""<br><div><font size=6em><b>Predictions with SOAP representation</b></font></div><hr>"""))
 #"soap cutoff=10 l_max=4 n_max=4 atom_sigma=0.5 n_Z=4 Z={8 13 31 49} n_species=4 species_Z={8 13 31 49}") # for formation energies
 #"soap cutoff=20 l_max=6 n_max=10 atom_sigma=0.5 n_Z=1 Z={8} n_species=4 species_Z={8 13 31 49}") # for bandgaps
 if if_use_prestored_models == "true":
    soap_model_saved = np.load('./data/kaggle_competition/models/soap.npz')
    train_mean_atomic_descriptors = soap_model_saved['train_mean_atomic_descriptors']
    test_mean_atomic_descriptors = soap_model_saved['test_mean_atomic_descriptors']
    display(HTML("""Read pre-stored SOAP model."""))
 else:
    display(HTML("""Generating SOAP descriptors..."""))
    train_mean_atomic_descriptors, test_mean_atomic_descriptors = get_SOAP(soap_cutoff, soap_lmax, soap_nmax)
 #print "test_mean_atomic_descriptors:", np.shape(test_mean_atomic_descriptors), test_mean_atomic_descriptors[:10]
 ```
 %% Cell type:code id: tags:soap-krr
 ``` python
 # Soap + KRR
 # The following arrays must be read/prepared: test_csv, train_csv, test_mean_atomic_descriptors, train_mean_atomic_descriptors
 # Depend on cell 'krr-ngram'
 from IPython.display import HTML
 from sklearn.preprocessing import StandardScaler
 scaler = StandardScaler(with_mean=True, with_std=True)
 formation_train_true = train_csv[:, -2]
 bandgap_train_true = train_csv[:, -1]
 combined_mean_atomic_descriptors = scaler.fit_transform(np.append(
        test_mean_atomic_descriptors,
        train_mean_atomic_descriptors,
        axis=0))
 X_test = np.round(combined_mean_atomic_descriptors[:test_csv.shape[0]],
        decimals=5)
 X_train = np.round(combined_mean_atomic_descriptors[test_csv.shape[0]:],
        decimals=5)
 #display(HTML("""<br><br><hr<br><div><font size=4em><b>Representation:</b> SOAP descriptor</font><br></div>"""))
 #display(HTML("""<div><font size=4em><b>Regression method:</b> Kernel ridge regression</font><br></div>"""))
 display(HTML("""<br><div><span style="height:15px;width:15px;background-color:#000000;border-radius:50%;display:inline-block;"></span><font size=5em><b> SOAP + Kernel ridge regression </b> </font><br></div>"""))
 def get_soap_KRR_predictions(X_train, formation_train_true, bandgap_train_true, n_threads):
    formation_train_true_10times = formation_train_true * 10.0
    clf1_soap_krr, formation_test_pred_soap_krr, formation_train_pred_soap_krr = train_and_pred_KRR(X_train, formation_train_true_10times, X_test, nfold = 5, nthread = n_threads)
    clf2_soap_krr, bandgap_test_pred_soap_krr, bandgap_train_pred_soap_krr = train_and_pred_KRR(X_train, bandgap_train_true, X_test, nfold = 5, nthread = n_threads)
    learning_curve = {}
    if(if_learningcurve == "true"):
        learning_curve_N_train = []
        learning_curve_indtrain = []
        learning_curve_formation_train = []
        learning_curve_formation_test = []
        learning_curve_bandgap_train = []
        learning_curve_bandgap_test = []
        step = (2400 - 100) / N_learning_curve
        range_learning_curve = range(100, 2401, step)
        range_learning_curve[-1] = 2400
        print (range_learning_curve)
        for N_train in range_learning_curve:
            indtrain = np.random.choice(len(X_train), N_train, replace=False)
            clf1, y1_test, y1_train = train_and_pred_KRR(X_train[indtrain], formation_train_true_10times[indtrain], X_test, nfold = 5, nthread = n_threads)
            clf2, y2_test, y2_train = train_and_pred_KRR(X_train[indtrain], bandgap_train_true[indtrain], X_test, nfold = 5, nthread = n_threads)
            learning_curve_formation_train.append(y1_train/10.0)
            learning_curve_formation_test.append(y1_test/10.0)
            learning_curve_bandgap_train.append(y2_train)
            learning_curve_bandgap_test.append(y2_test)
            learning_curve_N_train.append(N_train)
            learning_curve_indtrain.append(indtrain)
        learning_curve["formation_train"] = learning_curve_formation_train
        learning_curve["formation_test"] = learning_curve_formation_test
        learning_curve["bandgap_train"] = learning_curve_bandgap_train
        learning_curve["bandgap_test"] = learning_curve_bandgap_test
        learning_curve["N_train"] = learning_curve_N_train
        learning_curve["indtrain"] = learning_curve_indtrain
    formation_test_pred_soap_krr = formation_test_pred_soap_krr / 10.0
    formation_train_pred_soap_krr = formation_train_pred_soap_krr / 10.0
    gridsearch_results = {}
    gridsearch_results['formation'] = clf1_soap_krr.cv_results_
    gridsearch_results['bandgap'] = clf2_soap_krr.cv_results_
    return formation_test_pred_soap_krr, formation_train_pred_soap_krr, bandgap_test_pred_soap_krr, bandgap_train_pred_soap_krr, gridsearch_results, learning_curve
 formation_test_pred_soap_krr, formation_train_pred_soap_krr, bandgap_test_pred_soap_krr, bandgap_train_pred_soap_krr, gridsearch_results_soap_krr, learning_curve_soap_krr = get_soap_KRR_predictions(X_train, formation_train_true, bandgap_train_true, n_threads = n_threads)
 x_grid_formation_soap_krr = np.unique(gridsearch_results_soap_krr['formation']['param_alpha'].data)
 y_grid_formation_soap_krr = np.unique(gridsearch_results_soap_krr['formation']['param_gamma'].data)
 z_grid_formation_soap_krr = np.reshape(gridsearch_results_soap_krr['formation']['mean_test_score']/10.0, [len(y_grid_formation_soap_krr), -1])
 x_grid_bandgap_soap_krr = np.unique(gridsearch_results_soap_krr['bandgap']['param_alpha'].data)
 y_grid_bandgap_soap_krr = np.unique(gridsearch_results_soap_krr['bandgap']['param_gamma'].data)
 z_grid_bandgap_soap_krr = np.reshape(gridsearch_results_soap_krr['bandgap']['mean_test_score'], [len(y_grid_bandgap_soap_krr), -1])
 run_cell_by_tag('plot_soap_krr')
 ```
 %% Cell type:code id: tags:plot_soap_krr
 ``` python
 from IPython.display import HTML
 display(HTML("""<div><hr><font size=4em><b>Formation energy and bandgap predictions</b></font><br></div>"""))
 test_plot(formation_test_pred_soap_krr, formation_test_true, bandgap_test_pred_soap_krr, bandgap_test_true)
 display(HTML("""<div><hr><font size=4em><b>Hyperparameter grid search</b></font></div>"""))
 display(HTML("""<div><font size=3.5em><b>- Formation energy</b></font><br></div>"""))
 plot_hyperparameter_contour(x_grid_formation_soap_krr, y_grid_formation_soap_krr, z_grid_formation_soap_krr/10.0, grid = 3)
 display(HTML("""<div><font size=3.5em><b>- Bandgap</b></font><br></div>"""))
 plot_hyperparameter_contour(x_grid_bandgap_soap_krr, y_grid_bandgap_soap_krr, z_grid_bandgap_soap_krr/10.0, grid = 3)
 if(if_learningcurve == "true"):
    display(HTML("""<div><hr><font size=4em><b>Learning curves</b></font><br></div>"""))
    plot_learning_curve(learning_curve_soap_krr)
 ```
 %% Cell type:code id: tags:nn-soap
 ``` python
 #cell tag: nn-soap
 #Neural network for SOAP
 from IPython.display import HTML
 display(HTML("""<br><div><span style="height:15px;width:15px;background-color:#009FC2;border-radius:50%;display:inline-block;"></span><font size=5em><b> SOAP + Neural network </b> </font><br></div>"""))
 from sklearn.preprocessing import StandardScaler
 from sklearn.model_selection import train_test_split
 import torch
 from torch.autograd import Variable
 import torch.nn as nn
 import torch.nn.functional as F
 torch.manual_seed(1)
 class myNNRegressor(torch.nn.Module):
    '''
    def __init__(self, feature_size):
        super(myNNRegressor, self).__init__()
        self.feature_size = feature_size
        self.hidden_size1 = 512
        self.hidden_size2 = 256
        self.linear1 = nn.Linear(self.feature_size, self.hidden_size1, bias=False)
        self.batchnorm1 = nn.BatchNorm1d(self.hidden_size1, affine=True)
        self.linear2 = nn.Linear(self.hidden_size1, self.hidden_size2, bias=False)
        self.batchnorm2 = nn.BatchNorm1d(self.hidden_size2, affine=True)
        self.linear3 = nn.Linear(self.hidden_size2, self.hidden_size2, bias=False)
        self.batchnorm3 = nn.BatchNorm1d(self.hidden_size2, affine=True)
        self.h2s = nn.Linear(self.hidden_size2, 1, bias=False)
        self.activation = nn.ReLU()
        self.dropout = nn.Dropout(p=0.20)
    def forward(self, X):
        h1 = self.activation(self.dropout(self.batchnorm1(self.linear1(X))))
        h2 = self.activation(self.dropout(self.batchnorm2(self.linear2(h1))))
        h3 = self.activation(self.dropout(self.batchnorm3(self.linear3(h2))))
        predictions = self.h2s(h3)
        return predictions
    '''
    def __init__(self, feature_size):
        super(myNNRegressor, self).__init__()
        self.feature_size = feature_size
        hiddens = []
        batchnorms = []
        hiddens.append(nn.Linear(self.feature_size, N_nn_neurons[0], bias=False))
        batchnorms.append(nn.BatchNorm1d(N_nn_neurons[0], affine=True))
        for i in range(N_nn_layers-1):
            hiddens.append(nn.Linear(N_nn_neurons[i], N_nn_neurons[i+1], bias=False))
            batchnorms.append(nn.BatchNorm1d(N_nn_neurons[i+1], affine=True))
        self.h2s = nn.Linear(N_nn_neurons[-1], 1, bias = False)
        '''
        hiddens.append(nn.Linear(self.feature_size, 512, bias=False))
        batchnorms.append(nn.BatchNorm1d(512, affine=True))
        hiddens.append(nn.Linear(512, 256, bias=False))
        batchnorms.append(nn.BatchNorm1d(256, affine=True))
        hiddens.append(nn.Linear(256, 256, bias=False))
        batchnorms.append(nn.BatchNorm1d(256, affine=True))
        self.h2s = nn.Linear(256, 1, bias = False)
        self.batchnorm1 = nn.BatchNorm1d(512, affine=True)
        self.batchnorm2 = nn.BatchNorm1d(256, affine=True)
        self.batchnorm3 = nn.BatchNorm1d(256, affine=True)
        self.hiddenlayers = [nn.Linear(self.feature_size, 512, bias=False), nn.Linear(512, 256, bias=False), nn.Linear(256, 256, bias=False)]
        '''
        self.hidden_layers = nn.ModuleList(hiddens)
        self.batchnorm = nn.ModuleList(batchnorms)
        self.activation = nn.ReLU()
        self.dropout = nn.Dropout(p=0.20)
    def forward(self, X):
        h = self.activation(self.dropout(self.batchnorm[0](self.hidden_layers[0](X))))
        for i in range(1, N_nn_layers):
            h = self.activation(self.dropout(self.batchnorm[i](self.hidden_layers[i](h))))
        '''
        h = self.activation(self.dropout(self.batchnorm2(self.hiddenlayers[1](h))))
        h = self.activation(self.dropout(self.batchnorm3(self.hiddenlayers[2](h))))
        '''
        predictions = self.h2s(h)
        return predictions
 def rmsle(real, predicted):
    sum=0.0
    _ = predicted.cpu().data.numpy()
    for idx, x in enumerate(_):
        if x<0:
            continue
        p = torch.log(predicted[idx]+1)
        r = torch.log(real[idx]+1)
        sum += (p - r)**2
    return torch.sqrt(sum/float(len(predicted)))
 def get_predictions(nEpochs, X_train, y_train, X_val=None, y_val=None, evaluation=False):
    try:
        os.remove('./myBestModel.pt')
    except:
        pass
    feature_size = X_train.shape[1]
    #feature_size = X_train.shape[0]
    #print "size of feature (in training set):", feature_size
    #sys.stdout.flush()
    learning_rate = 1e-3
    X_train = Variable(torch.from_numpy(X_train).float(),
            requires_grad = False).cpu()
    y_train = Variable(torch.from_numpy(y_train).float(),
            requires_grad = False).cpu()
    if evaluation:
        X_val = Variable(torch.from_numpy(X_val).float(),
                requires_grad = False, volatile = True).cpu()
        y_val = Variable(torch.from_numpy(y_val).float(),
                requires_grad = False, volatile = True).cpu()
    NNRegressor = myNNRegressor(feature_size).cpu()
    optimiser = torch.optim.Adam(NNRegressor.parameters(), lr=learning_rate)
    loss_fn = nn.MSELoss()
    #y_train = y_train.reshape(-1, 1)
    final_train_loss = 99999
    #for epoch in tqdm(range(nEpochs)):
    for epoch in tnrange(nEpochs):
        # Train
        NNRegressor.train(True)
        t0 = time.time()
        predictions = NNRegressor(X_train)
        if epoch <= 100:
            loss = loss_fn(predictions.reshape(-1), y_train)
        else:
            loss = rmsle(y_train, predictions.reshape(-1))
        #print("loss:", loss)
        NNRegressor.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(NNRegressor.parameters(), 1.0)
        optimiser.step()
        t1 = time.time()
        # Evaluate
        NNRegressor.train(False)
        predictions_train = NNRegressor(X_train)
        train_loss = float(rmsle(y_train, predictions_train.reshape(-1)))
        #print("train_loss:", float(train_loss))
        if evaluation:
            predictions = NNRegressor(X_val)
            val_loss = float(rmsle(y_val, predictions.reshape(-1)))
        NNRegressor.train(True)
        '''
        if evaluation:
            msg = 'Epoch ' + str(epoch) +" - training loss: " + str(float(train_loss)) + 'validation loss: '+ str(float(val_loss))
            tqdm.write(msg)
        else:
            msg = 'Epoch ' + str(epoch) + " - training loss: " + str(float(train_loss))
            tqdm.write(msg)
        '''
        final_train_loss = float(train_loss)
        if evaluation:
            if 'best_val_loss' in locals():
                if val_loss < best_val_loss:
                    best_val_loss = val_loss
                    with open('./models/myBestModel.pt', 'wb') as f:
                        torch.save(NNRegressor, f)
            else:
                best_val_loss = val_loss
                with open('./models/myBestModel.pt', 'wb') as f:
                    torch.save(NNRegressor, f)
        if epoch == 149:
            learning_rate=1e-5
            if evaluation:
                del NNRegressor
                with open('./models/myBestModel.pt', 'rb') as f:
                    NNRegressor = torch.load(f)
            optimiser = torch.optim.Adam(NNRegressor.parameters(),
                lr=learning_rate)
    if evaluation:
        with open('./models/myBestModel.pt', 'rb') as f:
            NNRegressor = torch.load(f)
        NNRegressor.train(False)
        predictions = NNRegressor(X_val)
        val_loss = float(rmsle(y_val, predictions))
        print ('-----\nbest val. loss: {}\n_____'.format(val_loss))
    else:
        NNRegressor.train(False)
        predictions = NNRegressor(X_train)
        print("Final training loss(RMSE):", final_train_loss)
    return predictions, NNRegressor
 def train_test_NN_soap(x_trains, y_trains, x_tests):
    ensemble_size = 1
    all_test_predictions = []
    all_train_predictions = []
    for regressor in range(ensemble_size):
        #print('\n________________\nRegressor No. {}\n'.format(regressor))
        predictions, NNRegressor = get_predictions(nEpochs=250,
                X_train=x_trains, y_train=y_trains)
        NNRegressor.train(False)
        predictions = NNRegressor(x_tests).cpu().data.numpy()
        train_predictions = NNRegressor(
                                Variable(torch.from_numpy(x_trains).float(),
                                    requires_grad=False, volatile=True)
                                    ).cpu().data.numpy()
        all_test_predictions.append(predictions.reshape(-1))
        all_train_predictions.append(train_predictions.reshape(-1))
    #print np.shape(all_test_predictions)
    final_test_prediction = np.mean(all_test_predictions, axis = 0)
    final_train_prediction = np.mean(all_train_predictions, axis = 0)
    return final_test_prediction, final_train_prediction
 def get_soap_NN_predictions(X_train, X_test, formation_train_true, bandgap_train_true):
    formation_train_true_10times = formation_train_true * 10.0
    formation_test_pred_soap_nn, formation_train_pred_soap_nn = train_test_NN_soap(X_train, formation_train_true_10times, X_test)
    bandgap_test_pred_soap_nn, bandgap_train_pred_soap_nn = train_test_NN_soap(X_train, bandgap_train_true, X_test)
    learning_curve = {}
    if(if_learningcurve == "true"):
        learning_curve_N_train = []
        learning_curve_indtrain = []
        learning_curve_formation_train = []
        learning_curve_formation_test = []
        learning_curve_bandgap_train = []
        learning_curve_bandgap_test = []
        step = (2400 - 100) / N_learning_curve
        range_learning_curve = range(100, 2401, step)
        range_learning_curve[-1] = 2400
        print (range_learning_curve)
        for N_train in range_learning_curve:
            indtrain = np.random.choice(len(X_train), N_train, replace=False)
            formation_test_pred_soap_nn, formation_train_pred_soap_nn = train_test_NN_soap(X_train[indtrain], formation_train_true_10times[indtrain], X_test)
            bandgap_test_pred_soap_nn, bandgap_train_pred_soap_nn = train_test_NN_soap(X_train[indtrain], bandgap_train_true[indtrain], X_test)
            learning_curve_formation_train.append(formation_train_pred_soap_nn / 10.0)
            learning_curve_formation_test.append(formation_test_pred_soap_nn / 10.0)
            learning_curve_bandgap_train.append(bandgap_train_pred_soap_nn)
            learning_curve_bandgap_test.append(bandgap_test_pred_soap_nn)
            learning_curve_N_train.append(N_train)
            learning_curve_indtrain.append(indtrain)
        learning_curve["formation_train"] = learning_curve_formation_train
        learning_curve["formation_test"] = learning_curve_formation_test
        learning_curve["bandgap_train"] = learning_curve_bandgap_train
        learning_curve["bandgap_test"] = learning_curve_bandgap_test
        learning_curve["N_train"] = learning_curve_N_train
        learning_curve["indtrain"] = learning_curve_indtrain
    formation_test_pred_soap_nn = formation_test_pred_soap_nn / 10.0
    formation_train_pred_soap_nn = formation_train_pred_soap_nn / 10.0
    return formation_test_pred_soap_nn, formation_train_pred_soap_nn, bandgap_test_pred_soap_nn, bandgap_train_pred_soap_nn, learning_curve
 # Scale
 scaler = StandardScaler(with_mean=True, with_std=True)
 formation_train_true = train_csv[:, -2]
 bandgap_train_true = train_csv[:, -1]
 # mean atomic descriptor
 combined_mean_atomic_descriptors = scaler.fit_transform(np.append(
        test_mean_atomic_descriptors,
        train_mean_atomic_descriptors,
        axis=0))
 X_test_soap = np.round(combined_mean_atomic_descriptors[:test_csv.shape[0]],
        decimals=5)
 X_test_soap = Variable(torch.from_numpy(X_test_soap).float(),
                requires_grad = False, volatile = True).cpu()
 #X_test = X_test.reshape(-1, 1)
 X_train_soap = np.round(combined_mean_atomic_descriptors[test_csv.shape[0]:],
        decimals=5)
 #X_train = X_train.reshape(-1, 1)
 formation_test_pred_soap_nn, formation_train_pred_soap_nn, bandgap_test_pred_soap_nn, bandgap_train_pred_soap_nn, learning_curve_soap_nn = get_soap_NN_predictions(X_train_soap, X_test_soap, formation_train_true, bandgap_train_true)
 run_cell_by_tag('plot_soap_nn')
 ```
 %% Cell type:code id: tags:plot_soap_nn
 ``` python
 from IPython.display import HTML
 display(HTML("""<div><hr><font size=4em><b>Formation energy and bandgap predictions</b></font><br></div>"""))
 test_plot(formation_test_pred_soap_nn, formation_test_true, bandgap_test_pred_soap_nn, bandgap_test_true)
 if(if_learningcurve == "true"):
    display(HTML("""<div><hr><font size=4em><b>Learning curves</b></font><br></div>"""))
    plot_learning_curve(learning_curve_soap_nn)
 ```
 %% Cell type:markdown id: tags:
 ## References
 1. https://scikit-learn.org/stable/modules/kernel_ridge.html
 2. https://scikit-learn.org/stable/modules/neural_networks_supervised.html
 3. Jmol: an open-source Java viewer for chemical structures in 3D. http://www.jmol.org/
 %% Cell type:code id: tags:
 ``` python
 ```