Skip to content
Snippets Groups Projects
Commit 9977d3a5 authored by Andreas Leitherer's avatar Andreas Leitherer
Browse files

Small tuning of text

parent 66c0d22b
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<div id="teaser" style=' background-position: right center; background-size: 00px; background-repeat: no-repeat; <div id="teaser" style=' background-position: right center; background-size: 00px; background-repeat: no-repeat;
padding-top: 20px; padding-top: 20px;
padding-right: 10px; padding-right: 10px;
padding-bottom: 170px; padding-bottom: 170px;
padding-left: 10px; padding-left: 10px;
border-bottom: 14px double #333; border-bottom: 14px double #333;
border-top: 14px double #333;' > border-top: 14px double #333;' >
<div style="text-align:center"> <div style="text-align:center">
<b><font size="6.4">Regression using multilayer perceptrons</font></b> <b><font size="6.4">Regression using multilayer perceptrons</font></b>
</div> </div>
<p> <p>
created by: created by:
Andreas Leitherer,<sup>1</sup> Andreas Leitherer,<sup>1</sup>
Angelo Ziletti,<sup>1</sup> Angelo Ziletti,<sup>1</sup>
Luigi Sbailò,<sup>1</sup> Luigi Sbailò,<sup>1</sup>
Matthias Scheffler,<sup>1</sup> Matthias Scheffler,<sup>1</sup>
and Luca Ghiringhelli<sup>1</sup> <br><br> and Luca Ghiringhelli<sup>1</sup> <br><br>
<sup>1</sup> Fritz Haber Institute of the Max Planck Society, Faradayweg 4-6, D-14195 Berlin, Germany <br> <sup>1</sup> Fritz Haber Institute of the Max Planck Society, Faradayweg 4-6, D-14195 Berlin, Germany <br>
<div> <div>
<img style="float: left;" src="assets/nn_regression/Logo_MPG.png" width="200"> <img style="float: left;" src="assets/nn_regression/Logo_MPG.png" width="200">
<img style="float: right;" src="assets/nn_regression/Logo_NOMAD.png" width="250"> <img style="float: right;" src="assets/nn_regression/Logo_NOMAD.png" width="250">
</div> </div>
</div> </div>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
In this tutorial, the standard architecture for neural networks (multilayer perceptrons or rather fully-connected neural networks) is introduced and applied to a regression task (the prediction of a material property of inorganic compounds). Neural networks for classification are briefly explained as well, while more details on this topic can be found in the tutorial on convolutional neural networks. In this tutorial, the standard architecture for neural networks (multilayer perceptrons or rather fully-connected neural networks) is introduced and applied to a regression task (the prediction of a material property of inorganic compounds). Neural networks for classification are briefly explained as well, while more details on this topic can be found in the tutorial on convolutional neural networks.
After explaining the basic concepts, a fully connected neural network is set up using the python library Keras (https://keras.io/) with the input representation being constructed in the spirit of After explaining the basic concepts, a fully connected neural network is set up using the python library Keras (https://keras.io/) with the input representation being constructed in the spirit of
D. Jha, L. Ward, A. Paul, W.-k. Liao, A. Choudhary, C. Wolverton and A. Agrawal. Elemnet: Deep learning the D. Jha, L. Ward, A. Paul, W.-k. Liao, A. Choudhary, C. Wolverton and A. Agrawal. Elemnet: Deep learning the
chemistry of materials from only elemental composition. Scientific reports 8, 1 (2018). chemistry of materials from only elemental composition. Scientific reports 8, 1 (2018).
The goal is then to predict the volume per atom for inorganic solids from the open quantum materials database (OQMD). Only information on the chemical composition is used (in particular, no structural information). The results are analyzed using typical performance measures such as mean absolute error, mean squared error, root mean square error, and the Pearson correlation coefficient. Visualization techniques and advanced optimization methods are discussed at the end. The goal is then to predict the volume per atom for inorganic solids from the open quantum materials database (OQMD). Only information on the chemical composition is used (in particular, no structural information). The results are analyzed using typical performance measures such as mean absolute error, mean squared error, root mean square error, and the Pearson correlation coefficient. Visualization techniques and advanced optimization methods are discussed at the end.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Side remark: The documentation of Keras at https://keras.io/ refers to the newest version of Keras (>2.4), which only supports Tensorflow (https://www.tensorflow.org/) as a backend. This tutorial (as well as the tutorial on convolutional neural networks) is compatible with versions <=2.3 which allows multiple backends (CNTK, Tensorflow, Theano). There are only slight differences in syntax and you can find archieved documentations at https://github.com/faroit/keras-docs, e.g., for Side remark: The documentation of Keras at https://keras.io/ refers to the newest version of Keras (>2.4), which only supports Tensorflow (https://www.tensorflow.org/) as a backend. This tutorial (as well as the tutorial on convolutional neural networks) is compatible with versions <=2.3 which allows multiple backends (CNTK, Tensorflow, Theano). There are only slight differences in syntax and you can find archieved documentations at https://github.com/faroit/keras-docs, e.g., for
version 2.1.5 https://faroit.com/keras-docs/2.1.5/. In both tutorials, we use tensorflow as backend (version <2.0). version 2.1.5 https://faroit.com/keras-docs/2.1.5/. In both tutorials, we use tensorflow as backend (version <2.0).
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Load packages ## Load packages
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
The following packages are required for this tutorial: The following packages are required for this tutorial:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Plotting # Plotting
%matplotlib inline %matplotlib inline
import matplotlib.pyplot as plt import matplotlib.pyplot as plt
import seaborn as sns import seaborn as sns
# Tensorflow as backedn for keras (see below) # Tensorflow as backedn for keras (see below)
import tensorflow as tf import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR) # Suppress TF warnings tf.logging.set_verbosity(tf.logging.ERROR) # Suppress TF warnings
# Keras for neural networks # Keras for neural networks
from keras.models import load_model from keras.models import load_model
from keras.layers import Input, Dense, Dropout from keras.layers import Input, Dense, Dropout
from keras.models import Model from keras.models import Model
# pymatgen # pymatgen
from pymatgen.core.periodic_table import Element from pymatgen.core.periodic_table import Element
# other packages # other packages
import os import os
import numpy as np import numpy as np
from collections import Counter from collections import Counter
# sklearn # sklearn
from sklearn.metrics import r2_score, mean_absolute_error from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.model_selection import KFold from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import StandardScaler
# pandas # pandas
import pandas as pd import pandas as pd
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## 1. Introduction ## 1. Introduction
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### 1.1 Biological motivation, the perceptron, and typical activation functions ### 1.1 Biological motivation, the perceptron, and typical activation functions
The origin of *artificial neural networks* (ANNs) dates back to the early 1940's. The most simple form of an ANN is the *perceptron*, which was developed by Frank Rosenblatt in 1957 (the interested reader can find the original report [here](https://blogs.umass.edu/brain-wars/files/2016/03/rosenblatt-1957.pdf)) and is biologically motivated (see the simplifying sketch of a biological neuron below). The origin of *artificial neural networks* (ANNs) dates back to the early 1940's. The most simple form of an ANN is the *perceptron*, which was developed by Frank Rosenblatt in 1957 (the interested reader can find the original report [here](https://blogs.umass.edu/brain-wars/files/2016/03/rosenblatt-1957.pdf)) and is biologically motivated (see the simplifying sketch of a biological neuron below).
In a perceptron, the output y is computed by taking the linear combination of the input with weights $w_1, w_2$ and *bias* b, yielding the value z (see right part of below figure) to which a non-linear function f (the *activation function*) is applied at the end. In a perceptron, the output y is computed by taking the linear combination of the input with weights $w_1, w_2$ and *bias* b, yielding the value z (see right part of below figure) to which a non-linear function f (the *activation function*) is applied at the end.
<img src="./assets/nn_regression/biological_neuron_analogy.png" width="700"> <img src="./assets/nn_regression/biological_neuron_analogy.png" width="700">
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
As a first example, let us consider binary classification, i.e., two classes (marked by circles and crosses in the right part of the figure below), which we try to distinguish using two input features $x_1, x_2$. This could be for instance a medical problem setting, where we want to predict if a patient has a certain disease based on two factors (e.g., age and height). A very simple form of activation function is the Heaviside activation, for which the perceptron yields the following output: As a first example, let us consider binary classification, i.e., two classes (marked by circles and crosses in the right part of the figure below), which we try to distinguish using two input features $x_1, x_2$. This could be for instance a medical problem setting, where we want to predict if a patient has a certain disease based on two factors (e.g., age and height). A very simple form of activation function is the Heaviside activation, for which the perceptron yields the following output:
$ y= f(z)= $ y= f(z)=
\begin{cases} \begin{cases}
1 \quad \text{if } z = w_1x_1+w_2x_2+b > 0 \\ 1 \quad \text{if } z = w_1x_1+w_2x_2+b > 0 \\
0 \quad \text{otherwise.} 0 \quad \text{otherwise.}
\end{cases} \end{cases}
$ $
Here, the two possible values 0 and 1 represent the two distinct classes. Given fixed weights $w_1, w_2$, the bias term $b$ defines when the neuron is firing (i.e., f(z)=1) or not - for given input features $x_1, x_2$ which represent, for instance, a patient. Here, the two possible values 0 and 1 represent the two distinct classes. Given fixed weights $w_1, w_2$, the bias term $b$ defines when the neuron is firing (i.e., f(z)=1) or not - for given input features $x_1, x_2$ which represent, for instance, a patient.
<img src="./assets/nn_regression/perceptron_binary_classification.png" width="500"> <img src="./assets/nn_regression/perceptron_binary_classification.png" width="500">
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
In general, in supervised learning, we have a set of m training examples, which may be written as tuples consisting of input features (in this example summarized in a two dimensional vector $\mathbf{x}$ with components $x_1$ and $x_2$) and the correct labels $y^{(i)}$, with $i = 1, ..., m$: In general, in supervised learning, we have a set of m training examples, which may be written as tuples consisting of input features (in this example summarized in a two dimensional vector $\mathbf{x}$ with components $x_1$ and $x_2$) and the correct labels $y^{(i)}$, with $i = 1, ..., m$:
$(\mathbf{x}^{(1)}, y^{(1)}), (\mathbf{x}^{(2)}, y^{(2)}), ..., (\mathbf{x}^{(m)}, y^{(m)})$. $(\mathbf{x}^{(1)}, y^{(1)}), (\mathbf{x}^{(2)}, y^{(2)}), ..., (\mathbf{x}^{(m)}, y^{(m)})$.
Weights and bias term of the model (here: $w_1, w_2, b$) are optimized by minimizing a *loss function* $L(w_1, w_2, b)$ that quantifies how the predicted values $\hat{y}^{(i)}$ deviate from the true values $y^{(i)}$ - as a function of the model parameters. We will see an explicit form for $L(w_1, w_2, b)$ in context of regression in section 2 of this tutorial. Usually gradient decent is used to find the parameters minimizing $L$ (with modifications of gradient decent enabling faster convergence, see, for instance chapter 5 and 8 of [this](https://www.deeplearningbook.org/) standard reference). After finishing optimization, in case of classification, model performance can be assessed via the classification accuracy (# of correct predictions divided by # of total predictions). Weights and bias term of the model (here: $w_1, w_2, b$) are optimized by minimizing a *loss function* $L(w_1, w_2, b)$ that quantifies how the predicted values $\hat{y}^{(i)}$ deviate from the true values $y^{(i)}$ - as a function of the model parameters. We will see an explicit form for $L(w_1, w_2, b)$ in context of regression in section 2 of this tutorial. Usually gradient decent is used to find the parameters minimizing $L$ (with modifications of gradient decent enabling faster convergence, see, for instance chapter 5 and 8 of [this](https://www.deeplearningbook.org/) standard reference). After finishing optimization, in case of classification, model performance can be assessed via the classification accuracy (# of correct predictions divided by # of total predictions).
We will explain the optimization procedure in more detail in section 2 of the tutorial. In this example, one can think of the training / optimization phase as changing the model parameters such that the optimal position of a straight line (see above figure, right) is found, which serves as a decision boundary between the two classes. We will explain the optimization procedure in more detail in section 2 of the tutorial. In this example, one can think of the training / optimization phase as changing the model parameters such that the optimal position of a straight line (see above figure, right) is found, which serves as a decision boundary between the two classes.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
To conclude this section on perceptrons, note that the Heaviside activation function is not used in modern deep learning applications but rather one of the following ones: To conclude this section on perceptrons, note that the Heaviside activation function is not used in modern deep learning applications but rather one of the following ones:
Sigmoid: $f(x) = \frac{1}{1+ \exp{(-x)}}$ Sigmoid: $f(x) = \frac{1}{1+ \exp{(-x)}}$
Tangens hyperbolicus: $f(x) = \tanh{(x)} = \frac{\sinh{(x)}}{\cosh{(x)}} = \frac{\exp{(x)}-\exp{(-x)}}{\exp{(x)}+\exp{(-x)}}$ Tangens hyperbolicus: $f(x) = \tanh{(x)} = \frac{\sinh{(x)}}{\cosh{(x)}} = \frac{\exp{(x)}-\exp{(-x)}}{\exp{(x)}+\exp{(-x)}}$
Rectified linear unit (ReLU): $f(x) = max(0, x)$ Rectified linear unit (ReLU): $f(x) = max(0, x)$
The ReLU activation function is most frequently used. Non-linear functions are essential to increase the space of possible (complex) functions that the model can learn. If no activation function would be used, i.e., the identity - also called *linear activation function*- the class of possible functions that the model can represent would be drastically reduced. The ReLU activation function is most frequently used. Non-linear functions are essential to increase the space of possible (complex) functions that the model can learn. If no activation function would be used, i.e., the identity - also called *linear activation function*- the class of possible functions that the model can represent would be drastically reduced.
![activation_functions.png](./assets/nn_regression/activation_functions.png) ![activation_functions.png](./assets/nn_regression/activation_functions.png)
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### 1.2 Multilayer peceptron ### 1.2 Multilayer peceptron
Extending the idea of simple perceptrons, one can construct multilayer perceptrons as sequences of layers. Extending the idea of simple perceptrons, one can construct multilayer perceptrons as sequences of layers.
Each layer consists of a predefined number of neurons, where the neurons of the first layer (the *input layer*) correspond to the input features $\mathbf{x} = (x_1, x_2, ...)$. The subsequent layers are called *hidden layers*. The individual neurons in each hidden layer are a linear combination of neurons from the previous layer. For instance, the *activation value* $a_1$ highlighted in the figure below is computed the following way: Each layer consists of a predefined number of neurons, where the neurons of the first layer (the *input layer*) correspond to the input features $\mathbf{x} = (x_1, x_2, ...)$. The subsequent layers are called *hidden layers*. The individual neurons in each hidden layer are a linear combination of neurons from the previous layer. For instance, the *activation value* $a_1$ highlighted in the figure below is computed the following way:
$\begin{equation*} $\begin{equation*}
a_1 = f(w_1 x_1 + w_2 x_2 + w_3 x_3 + w_4 x_4 + w_5 x_5 + w_6 x_6 + b_1), a_1 = f(w_1 x_1 + w_2 x_2 + w_3 x_3 + w_4 x_4 + w_5 x_5 + w_6 x_6 + b_1),
\end{equation*}$ \end{equation*}$
where an activation function $f$ is applied to yield the final activation value. This process, which is called *forward propagation*, is repeated for all neurons and layers until the output $\textbf{o} = (o_1, o_2, ...)$ in the final *output layer* is obtained. Already at this level one of the main characteristics of (multi-layer) neural networks becomes clear - namely, the ability to learn complex features from the initial input. This ability to learn new representations that become more abstract the deeper the network is, i.e., the more hidden layers it has, distinguishes neural networks from other, "standard" machine learning algorithms (e.g., decision trees). where an activation function $f$ is applied to yield the final activation value. This process, which is called *forward propagation*, is repeated for all neurons and layers until the output $\textbf{o} = (o_1, o_2, ...)$ in the final *output layer* is obtained. Already at this level one of the main characteristics of (multi-layer) neural networks becomes clear - namely, the ability to learn complex features from the initial input. This ability to learn new representations that become more abstract the deeper the network is, i.e., the more hidden layers it has, distinguishes neural networks from other, "standard" machine learning algorithms (e.g., decision trees).
Note that the architecture shown in the figure below has only one hidden layer (a *shallow neural network*), while modern deep-learning models can reach more than 50 layers (see for instance [here](http://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html)). Note that the architecture shown in the figure below has only one hidden layer (a *shallow neural network*), while modern deep-learning models can reach more than 50 layers (see for instance [here](http://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html)).
<img src="./assets/nn_regression/mlp_example.png" width="500"> <img src="./assets/nn_regression/mlp_example.png" width="500">
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
More formally, given vectorial input $\mathbf{x} = (x_1, x_2, ...)^T$, the More formally, given vectorial input $\mathbf{x} = (x_1, x_2, ...)^T$, the
activations $\textbf{a} = (a_1, a_2, ...)^T$ of the next layer are obtained via the transformation activations $\textbf{a} = (a_1, a_2, ...)^T$ of the next layer are obtained via the transformation
$\begin{equation*} $\begin{equation*}
\mathbf{a} = f ( A\mathbf{x} + \mathbf{b} ), \mathbf{a} = f ( A\mathbf{x} + \mathbf{b} ),
\end{equation*}$ \end{equation*}$
which is essentially an affine transformation (defined by a matrix A and vector b) followed by element-wise application of the (non-linear) activation function f. The weights of the linear combinations are collected in the matrix A and the offsets in the *bias vector* $\mathbf{b} = (b_1, b_2, ...)$. The output activations $\mathbf{o}$ are obtained by applying a further affine transformation (matrix A$^\prime$, bias $b^\prime$) and activation function $f^\prime$: which is essentially an affine transformation (defined by a matrix A and vector b) followed by element-wise application of the (non-linear) activation function f. The weights of the linear combinations are collected in the matrix A and the offsets in the *bias vector* $\mathbf{b} = (b_1, b_2, ...)$. The output activations $\mathbf{o}$ are obtained by applying a further affine transformation (matrix A$^\prime$, bias $b^\prime$) and activation function $f^\prime$:
$\begin{equation*} $\begin{equation*}
\mathbf{o} = f^\prime (A^\prime \mathbf{a} + \mathbf{b}^\prime) \mathbf{o} = f^\prime (A^\prime \mathbf{a} + \mathbf{b}^\prime)
\end{equation*}$ \end{equation*}$
The final activation function $f^\prime$ is chosen in a specific way, usually depending on the task being either regression or classification - we will come back to this later. The final activation function $f^\prime$ is chosen in a specific way, usually depending on the task being either regression or classification - we will come back to this later.
To simplify the above expression for $\mathbf{o}$, one can change the definition of input vector and weight matrices such that the bias terms can be omitted. We denote the input vector as before and introduce weight matrices W, W$^\prime$, which yields a more compact expression for the output: To simplify the above expression for $\mathbf{o}$, one can change the definition of input vector and weight matrices such that the bias terms can be omitted. We denote the input vector as before and introduce weight matrices W, W$^\prime$, which yields a more compact expression for the output:
$\begin{equation*} $\begin{equation*}
\mathbf{o} = f^\prime (W^\prime \mathbf{a}) = f^\prime (W^\prime f(W \mathbf{x})). \mathbf{o} = f^\prime (W^\prime \mathbf{a}) = f^\prime (W^\prime f(W \mathbf{x})).
\end{equation*}$ \end{equation*}$
This equation makes it more clear that a multilayer perceptron is a concatenation of affine and non-linear transformations, parametrized by a set of weights and biases which are optimized to fit the task at hand. This equation can be generalized to an arbitrary number of layers $N$: This equation makes it more clear that a multilayer perceptron is a concatenation of affine and non-linear transformations, parametrized by a set of weights and biases which are optimized to fit the task at hand. This equation can be generalized to an arbitrary number of layers $N$:
$\begin{equation*} $\begin{equation*}
\mathbf{o} = f^{N} ( W^N f^{N-1} ( W^{N-1} ... f^1 ( W^1 \mathbf{x}) ) \mathbf{o} = f^{N} ( W^N f^{N-1} ( W^{N-1} ... f^1 ( W^1 \mathbf{x}) )
\end{equation*}$ \end{equation*}$
An example for the case of N=3 is shown in the following figure: An example for the case of N=3 is shown in the following figure:
<img src="./assets/nn_regression/mlp_more_layers_example.png" width="200"> <img src="./assets/nn_regression/mlp_more_layers_example.png" width="200">
The larger and deeper the network, the more computationally costly it becomes to calculate the gradient of the loss function (which is needed to optimize the parameters via gradient descent). A key invention is the backpropagation algorithm (cf. the [original publication]([https://www.nature.com/articles/323533a0])), which is an efficient way to calculate the gradient of the loss function with respect to the neural-network parameters. The larger and deeper the network, the more computationally costly it becomes to calculate the gradient of the loss function (which is needed to optimize the parameters via gradient descent). A key invention is the backpropagation algorithm (cf. the [original publication]([https://www.nature.com/articles/323533a0])), which is an efficient way to calculate the gradient of the loss function with respect to the neural-network parameters.
Note that if we would use only linear (i.e., identity) activation functions, the output would essentially be just a linear combination of the input. Using non-linear activation functions (nowadays mostly the ReLU activation function) Note that if we would use only linear (i.e., identity) activation functions, the output would essentially be just a linear combination of the input. Using non-linear activation functions (nowadays mostly the ReLU activation function)
enriches the function space that can be represented. In particular, the universal approximation theorem (see [chapter 6.4.1](https://www.deeplearningbook.org/) and references therein) guarantees that multilayer perceptrons can approximate any continuous function - while architecture specifics and generalization guarantees are not provided by this theorem. Due to the use of non-linearities, optimizing a deep neural network is a non-convex optimization problem with multiple local minima. Techniques such as stochastic gradient descent (see [chapter 8](https://www.deeplearningbook.org/)) allow to avoid getting stuck in non-optimal local minima. enriches the function space that can be represented. In particular, the universal approximation theorem (see [chapter 6.4.1](https://www.deeplearningbook.org/) and references therein) guarantees that multilayer perceptrons can approximate any continuous function - while architecture specifics and generalization guarantees are not provided by this theorem. Due to the use of non-linearities, optimizing a deep neural network is a non-convex optimization problem with multiple local minima. Techniques such as stochastic gradient descent (see [chapter 8](https://www.deeplearningbook.org/)) allow to avoid getting stuck in non-optimal local minima.
Coming now to the choice of activation function in the final layer, the softmax activation function is the usual choice for classification tasks, yielding the following expression for the $j$th component Coming now to the choice of activation function in the final layer, the softmax activation function is the usual choice for classification tasks, yielding the following expression for the $j$th component
of the output vector: of the output vector:
$\begin{equation*} $\begin{equation*}
[\mathbf{o}]_j = [f(\mathbf{x})]_j = \frac{\exp{\left([\mathbf{x}]_j\right)}}{\sum_k \exp{\left([\mathbf{x}]_k\right)}} [\mathbf{o}]_j = [f(\mathbf{x})]_j = \frac{\exp{\left([\mathbf{x}]_j\right)}}{\sum_k \exp{\left([\mathbf{x}]_k\right)}}
\end{equation*}$ \end{equation*}$
This way, the output vector components are normalized to the range $[0, 1]$ and sum up to one, i.e., the components can be interpreted as probabilities. This way, the output vector components are normalized to the range $[0, 1]$ and sum up to one, i.e., the components can be interpreted as probabilities.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
To illustrate the usefulness of softmax activation functions, let us consider the case of crystal-structure classification. The task is to assign the correct (symmetry) label to a given, unknown crystal structure as defined by atomic positions and chemical species. For instance, possible assignments could be face-centered-cubic, body-centered-cubic, diamond or hexagonal closed packed - a collection of structures that covers more than 80% of the elemental solids. More thorough explanations on deep learning applied to crystal-structure classification can be found [here](https://www.nature.com/articles/s41467-018-05169-6). When applying the multilayer perceptron architecture which we introduced above, each of the four output neurons correspond to a specific crystal structure. The use of the softmax activation function guarantees that all output activations sum to one, which is why the output vector $\mathbf{o}$ can be considered as a vector of classification probabilities. For instance, if $\mathbf{o} = (1, 0, 0, 0)$, the input structure is predicted to have fcc symmetry with 100\% probability (see figure below). This is also called "one-hot-encoding" and corresponds to representing a given number N of classes in the standard basis in $\mathbb{R}^\text{N}$, i.e., by N vectors $e_i = (0, ...0, 1, 0, ..., 0)$, for $i=1, ..., N$ and all components of $e_i$ being zero except for the $i$th entry. To illustrate the usefulness of softmax activation functions, let us consider the case of crystal-structure classification. The task is to assign the correct (symmetry) label to a given, unknown crystal structure as defined by atomic positions and chemical species. For instance, possible assignments could be face-centered-cubic, body-centered-cubic, diamond or hexagonal closed packed - a collection of structures that covers more than 80% of the elemental solids. More thorough explanations on deep learning applied to crystal-structure classification can be found [here](https://www.nature.com/articles/s41467-018-05169-6). When applying the multilayer perceptron architecture which we introduced above, each of the four output neurons correspond to a specific crystal structure. The use of the softmax activation function guarantees that all output activations sum to one, which is why the output vector $\mathbf{o}$ can be considered as a vector of classification probabilities. For instance, if $\mathbf{o} = (1, 0, 0, 0)$, the input structure is predicted to have fcc symmetry with 100\% probability (see figure below). This is also called "one-hot-encoding" and corresponds to representing a given number N of classes in the standard basis in $\mathbb{R}^\text{N}$, i.e., by N vectors $e_i = (0, ...0, 1, 0, ..., 0)$, for $i=1, ..., N$ and all components of $e_i$ being zero except for the $i$th entry.
<img src="./assets/nn_regression/cs_classification_first_example.png" width="1700"> <img src="./assets/nn_regression/cs_classification_first_example.png" width="1700">
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
For regression, the above multi-layer network architecture can easily be adapted - by using only one output neuron in the final layer together with a *linear* activation function (i.e., the identity function) to predict a specific target property $\text{P}\in\mathbb{R}$. A similar architecture (with additional hidden layers) will be used in this tutorial and the target property will be a specific property of inorganic compounds. For regression, the above multi-layer network architecture can easily be adapted - by using only one output neuron in the final layer together with a *linear* activation function (i.e., the identity function) to predict a specific target property $\text{P}\in\mathbb{R}$. A similar architecture (with additional hidden layers) will be used in this tutorial and the target property will be a specific property of inorganic compounds.
<img src="./assets/nn_regression/regression_first_example.png" width="350"> <img src="./assets/nn_regression/regression_first_example.png" width="350">
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## 2. Neural network regression example - "ElemNet" ## 2. Neural network regression example - "ElemNet"
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
In the following, we will consider a specific application of multilayer perceptrons. The idea of this section is that you first read through / run the cells and then return to them later (in particular, changing the neural-network parameter settings) when answering the questions at the end.
%% Cell type:markdown id: tags:
"ElemNet" is a deep-learning approach to predict material properties using information on the chemical composition only. In particular, no structural information is used, i.e., the atomic positions and the corresponding symmetries are not considered (which is frequently employed in other machine and deep learning projects in materials science). More information on ElemNet can be found in the following references: "ElemNet" is a deep-learning approach to predict material properties using information on the chemical composition only. In particular, no structural information is used, i.e., the atomic positions and the corresponding symmetries are not considered (which is frequently employed in other machine and deep learning projects in materials science). More information on ElemNet can be found in the following references:
* Dipendra Jha, Logan Ward, Arindam Paul, Wei-keng Liao, Alok Choudhary, Chris Wolverton, and Ankit Agrawal, “ElemNet: Deep Learning the Chemistry of Materials From Only Elemental Composition,” Scientific Reports, 8, Article number: 17593 (2018) [DOI:10.1038/s41598-018-35934-y] * Dipendra Jha, Logan Ward, Arindam Paul, Wei-keng Liao, Alok Choudhary, Chris Wolverton, and Ankit Agrawal, “ElemNet: Deep Learning the Chemistry of Materials From Only Elemental Composition,” Scientific Reports, 8, Article number: 17593 (2018) [DOI:10.1038/s41598-018-35934-y]
* Dipendra Jha, Kamal Choudhary, Francesca Tavazza, Wei-keng Liao, Alok Choudhary, Carelyn Campbell, Ankit Agrawal, "Enhancing materials property prediction by leveraging computational and experimental data using deep transfer learning," Nature Communications, 10, Article number: 5316 (2019) [DOI: https:10.1038/s41467-019-13297-w] * Dipendra Jha, Kamal Choudhary, Francesca Tavazza, Wei-keng Liao, Alok Choudhary, Carelyn Campbell, Ankit Agrawal, "Enhancing materials property prediction by leveraging computational and experimental data using deep transfer learning," Nature Communications, 10, Article number: 5316 (2019) [DOI: https:10.1038/s41467-019-13297-w]
* https://github.com/NU-CUCIS/ElemNet * https://github.com/NU-CUCIS/ElemNet
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
The deep-learning model behind "ElemNet" is essentially a multilayer perceptron with the input vector (the representation, i.e, the descriptor of inorganic compounds) being chosen in a very specific way: The deep-learning model behind "ElemNet" is essentially a multilayer perceptron with the input vector (the representation, i.e, the descriptor of inorganic compounds) being chosen in a very specific way:
Each compound is represented by a feature vector $\mathbf{f}$ of fixed length, whose components correspond to the elements of the periodic table. They are sorted according to the atomic number Z in ascending order (i.e., the first component of $\mathbf{f}$ corresponds to hydrogen, the second to Helium etc.). Each compound is represented by a feature vector $\mathbf{f}$ of fixed length, whose components correspond to the elements of the periodic table. They are sorted according to the atomic number Z in ascending order (i.e., the first component of $\mathbf{f}$ corresponds to hydrogen, the second to Helium etc.).
For instance, given a binary compound $\text{A}_x \text{B}_y$ with $x+y=1$, all entries of $\mathbf{f}$ are zero except those corresponding to element A and B. For these entries, the relative stoichiometric attributes x and y are assigned. In case of NaCl (rock salt), the representation would be For instance, given a binary compound $\text{A}_x \text{B}_y$ with $x+y=1$, all entries of $\mathbf{f}$ are zero except those corresponding to element A and B. For these entries, the relative stoichiometric attributes x and y are assigned. In case of NaCl (rock salt), the representation would be
$\begin{equation*} $\begin{equation*}
\mathbf{f} = (0.0,\ 0.0,\ ...,\ 0.0,\ 0.5,\ 0.0,\ ..., 0.0,\ 0.5,\ 0.0,\ ...), \mathbf{f} = (0.0,\ 0.0,\ ...,\ 0.0,\ 0.5,\ 0.0,\ ..., 0.0,\ 0.5,\ 0.0,\ ...),
\end{equation*}$ \end{equation*}$
where only the two entries corresponding to Na and Cl are assigned non-zero values (0.5 in this case). where only the two entries corresponding to Na and Cl are assigned non-zero values (0.5 in this case).
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### 2.1 The task and dataset creation ### 2.1 The task and dataset creation
Given only the chemical composition of an inorganic compound, the goal is to predict the volume per atom, which is the total volume of the unit cell divided by the number of atoms in the unit cell. Thus, this property provides an average characterization of the atomic arrangement and therefore becomes interesting in context of crystal-structure prediction (see for instance [here](https://www.nature.com/articles/s41578-019-0101-8) for a review on this topic). Note, however, that this quantity is only an average assessment of the unit cell geometry and thus further properties have to be known (for instance total volume, space group) to get a geometrical understanding of the crystal that is sufficient to actually predict the atomic arrangement. Given only the chemical composition of an inorganic compound, the goal is to predict the volume per atom, which is the total volume of the unit cell divided by the number of atoms in the unit cell. Thus, this property provides an average characterization of the atomic arrangement and therefore becomes interesting in context of crystal-structure prediction (we refer the interested reader to [this review](https://www.nature.com/articles/s41578-019-0101-8) for a review on this topic). Note, however, that this quantity is only an average assessment of the unit cell geometry and thus further properties have to be known (for instance total volume, space group) to get a geometrical understanding of the crystal that is sufficient to actually predict the atomic arrangement.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
This target property has been predicted using several other machine learning algorithms before (specifically, decision trees) in the following reference: This target property has been predicted using several other machine learning algorithms before (specifically, decision trees) in the following reference:
L. Ward, A. Agrawal, A. Choudhary and C. Wolverton. A general-purpose machine learning framework for predicting L. Ward, A. Agrawal, A. Choudhary and C. Wolverton. A general-purpose machine learning framework for predicting
properties of inorganic materials. npj Computational Materials 2, 16028 (2016). properties of inorganic materials. npj Computational Materials 2, 16028 (2016).
While in this reference (and also the "ElemNet"-related works listed above) energetic targets (such as formation energy) were the main focus, we will use "ElemNet" to predict the volume per atom. While in this reference (and also the "ElemNet"-related works listed above) energetic targets (such as formation energy) were the main focus, we will use "ElemNet" to predict the volume per atom.
For this purpose, we need to create the dataset in the first step. Information on stoichiometry and volume per atom can be extracted from the dataset published in Ward et al. (2016) (which is based on the open quantum materials database (OQMD), see the [original paper](https://link.springer.com/article/10.1007/s11837-013-0755-4) and the [documentation](http://oqmd.org/)). This step has already been done and we provide the prepared dataset. We also eliminated some data points with very large volumes (>100$Å$ per atom). Furthermore, we computed the input representation required by "ElemNet". This is a common situation: the data is available in some format, but additional preprocessing (data cleaning, feature engineering) is necessary. For this purpose, we need to create the dataset in the first step. Information on stoichiometry and volume per atom can be extracted from the dataset published in Ward et al. (2016) (which is based on the open quantum materials database (OQMD), see the [original paper](https://link.springer.com/article/10.1007/s11837-013-0755-4) and the [documentation](http://oqmd.org/)). This step has already been done and we provide the prepared dataset. We also eliminated some data points with very large volumes (>100$Å$ per atom). Furthermore, we computed the input representation required by "ElemNet". This is a common situation: the data is available in some format, but additional preprocessing (data cleaning, feature engineering) is necessary.
Let us first load the dataset containing information on the chemical composition (in particular the stoichiometry), for which we use pandas (https://pandas.pydata.org/, a python library providing convenient ways to store, load and process data): Let us first load the dataset containing information on the chemical composition (in particular the stoichiometry), for which we use pandas (https://pandas.pydata.org/, a python library providing convenient ways to store, load and process data):
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df = pd.read_pickle('./data/nn_regression/OQMD_Ward_et_al_2016_df.pkl') df = pd.read_pickle('./data/nn_regression/OQMD_Ward_et_al_2016_df.pkl')
y_vol_per_atom = df['vol_per_atom'].values y_vol_per_atom = df['vol_per_atom'].values
number_of_elements = df['number_of_elements'].values number_of_elements = df['number_of_elements'].values
stoichiometry_dicts = df['stoichiometry_dicts'] stoichiometry_dicts = df['stoichiometry_dicts']
elements_of_interest = [] elements_of_interest = []
for stoichiometry_dict in stoichiometry_dicts: for stoichiometry_dict in stoichiometry_dicts:
for species in stoichiometry_dict: for species in stoichiometry_dict:
elements_of_interest.append(species) elements_of_interest.append(species)
df.head() df.head()
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Now we can have a look at the statistics of this dataset, in particular the distribution of the target property: Now we can have a look at the statistics of this dataset, in particular the distribution of the target property:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df.hist() df.hist()
plt.show() plt.show()
print('\nStatistics of the target property:\n') print('\nStatistics of the target property:\n')
print(df['vol_per_atom'].describe()) print(df['vol_per_atom'].describe())
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
We already see at this point that most of the compounds have three chemical species. To provide more insight, the code below will print out the total number of data points, what kind of materials are actually appearing (specifically the amount of species being present) and furthermore, which elements of the periodic table occur and how often (which is visualized in a bar plot): We already see at this point that most of the compounds have three chemical species. To provide more insight, the code below will print out the total number of data points, what kind of materials are actually appearing (specifically the amount of species being present) and furthermore, which elements of the periodic table occur and how often (which is visualized in a bar plot):
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
print("Total number of datapoints: {}\n".format(len(y_vol_per_atom))) print("Total number of datapoints: {}\n".format(len(y_vol_per_atom)))
from collections import Counter from collections import Counter
dict_elements = dict(Counter(number_of_elements)) dict_elements = dict(Counter(number_of_elements))
for key in dict_elements: for key in dict_elements:
print("Compounds with {} element(s) appear {} times in the dataset".format(key, dict_elements[key])) print("Compounds with {} element(s) appear {} times in the dataset".format(key, dict_elements[key]))
# Order elements according to Z # Order elements according to Z
elements_of_interest_unique = np.array(list(set(elements_of_interest)), dtype=object) elements_of_interest_unique = np.array(list(set(elements_of_interest)), dtype=object)
Z_values = [] Z_values = []
for element in elements_of_interest_unique: for element in elements_of_interest_unique:
Z_values.append(Element(element).Z) Z_values.append(Element(element).Z)
sorted_indices = np.argsort(Z_values) sorted_indices = np.argsort(Z_values)
elements_of_interest_sorted = elements_of_interest_unique[sorted_indices] elements_of_interest_sorted = elements_of_interest_unique[sorted_indices]
print("\nThe following elements (in total {}) appear in the dataset:\n\n {}".format(len(elements_of_interest_sorted), elements_of_interest_sorted)) print("\nThe following elements (in total {}) appear in the dataset:\n\n {}".format(len(elements_of_interest_sorted), elements_of_interest_sorted))
elements_appearance = dict(Counter(elements_of_interest)) elements_appearance = dict(Counter(elements_of_interest))
elements_appearance_sorted = [elements_appearance[_] for _ in elements_of_interest_sorted] elements_appearance_sorted = [elements_appearance[_] for _ in elements_of_interest_sorted]
fig = plt.figure(figsize=(25,10)) fig = plt.figure(figsize=(25,10))
plt.bar(elements_of_interest_sorted, elements_appearance_sorted, width=.2, color='g') plt.bar(elements_of_interest_sorted, elements_appearance_sorted, width=.2, color='g')
plt.xlabel('Elements appearing in the dataset') plt.xlabel('Elements appearing in the dataset')
plt.ylabel('Counts') plt.ylabel('Counts')
plt.show() plt.show()
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
From the bar plot, we see that most of the materials contain oxygen. This reflects the bias of experimentalists and theoreticians towards investigating specific systems (i.e., those that are either well-known or currently of interest). From the bar plot, we see that most of the materials contain oxygen. This also reflects the bias of researchers towards investigating specific systems (i.e., those that are either well-known or currently of interest).
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
We now have the target property ready, but also need the input feature vectors in "ElemNet"-style. This is also prepared and saved in a numpy array, which we load in the subsequent cell: We now have the target property ready, but also need the input feature vectors in "ElemNet"-style. This is also prepared and saved in a numpy array, which we load in the subsequent cell:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
X_ElemNet = np.load('./data/nn_regression/X_ElemNet.npy') X_ElemNet = np.load('./data/nn_regression/X_ElemNet.npy')
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Note that elements that do not appear in the dataset are not included in the feature vector (since otherwise some input neurons would be zero for every data point, thus making them useless and even worse, they might disturb the learning process). Note that elements that do not appear in the dataset are not included in the feature vector (since otherwise some input neurons would be zero for every data point, thus making them useless and even worse, they might disturb the learning process).
Now the dataset is ready for training. For completeness, let us print the shapes of input features and targets: Now the dataset is ready for training. For completeness, let us print the shapes of input features and targets:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
print("Input shape = {}, target shape = {}".format(X_ElemNet.shape, y_vol_per_atom.shape)) print("Input shape = {}, target shape = {}".format(X_ElemNet.shape, y_vol_per_atom.shape))
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### 2.2 Dataset training / test split ### 2.2 Dataset training / test split
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
The first step is to split the data into training and test set, the latter not being touched during optimization and only at the end for the final model evaluation. Note that it is essential to report split ratio and random state to enable reproducibility: The first step is to split the data into training and test set, the latter not being touched during optimization and only at the end for the final model test. Note that it is essential to report split ratio and random state to enable reproducibility:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Very important for reproducibility!! # Very important for reproducibility!
RANDOM_STATE = 42 RANDOM_STATE = 42
split_ratio = 0.2 split_ratio = 0.2
# 80 /20 Split for creating training and test set # 80 /20 Split for creating training and test set
X, X_test, y, y_test = train_test_split(X_ElemNet, y_vol_per_atom, X, X_test, y, y_test = train_test_split(X_ElemNet, y_vol_per_atom,
test_size=split_ratio, random_state=RANDOM_STATE) test_size=split_ratio, random_state=RANDOM_STATE)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Another typical preprocessing step is to scale the input features (for instance using mean and standard deviation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html). Another typical preprocessing step is to scale the input features (for instance using mean and standard deviation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).
This can improve model performance, while for the moment it is not used - still, the interested reader can experiment with this preprocessing procedure by uncommenting the following code block: This can improve model performance, while for the moment it is not used - still, the interested reader can experiment with this preprocessing procedure by uncommenting the following code block:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
""" """
scaler = StandardScaler() scaler = StandardScaler()
X = scaler.fit_transform(X) X = scaler.fit_transform(X)
X_test = scaler.transform(X_test) X_test = scaler.transform(X_test)
""" """
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Now we perform an additional 80/20 split such that we use the 80% portion for training the model and the 20% portion for validation: Now we perform an additional 80/20 split such that we use the 80% portion for training the model and the 20% portion for validation:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
X_train, X_val, y_train, y_val = train_test_split(X, y, X_train, X_val, y_train, y_val = train_test_split(X, y,
test_size=split_ratio, random_state=RANDOM_STATE) test_size=split_ratio, random_state=RANDOM_STATE)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
If we are not satisfied with the validation performance, we can take a step back, change the hyperparameters (such as the number of layers or number of neurons in each layer), and restart the training process. This way, we can optimize the generalization ability of the model, before the final evaluation on the test set. Ideally, one should also consider different splits (for instance via *cross-validation*), which would be too time-consuming for this tutorial. Furthermore, we provide code only for hand-tuning the hyperparameters while specifying the references on more advanced tuning methods at the end (e.g., Bayesian optimization). If we are not satisfied with the validation performance, we can take a step back, change the hyperparameters (such as the number of layers or number of neurons in each layer), and restart the training process. This way, we can optimize the generalization ability of the model, before the final evaluation on the test set. Ideally, one should also consider different splits (for instance via *cross-validation*, see capter 7.10 of this [book](https://link.springer.com/book/10.1007/978-0-387-84858-7)), which would be too time-consuming for this tutorial. Furthermore, we provide code only for hand-tuning the hyperparameters while specifying the references on more advanced tuning methods at the end (e.g., Bayesian optimization).
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### 2.3 Model definition and training phase ### 2.3 Model definition and training phase
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
We use the python library Keras to create the multilayer perceptron architecture (the documentation https://keras.io/ is an excellent resource for more details). The function defined below will generate a model object from a dictionary "params", which will be defined later and contains specifications about the model architecture: We use the python library Keras to create the multilayer perceptron architecture (the online documentation is an excellent resource for more details - please refer to the side remark at the beginning of the tutorial for the URLs). The function defined below will generate a model object from a dictionary "params", which will be defined later and contains specifications about the model architecture:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
n_feat = X_train.shape[1] n_feat = X_train.shape[1]
def create_model(params): def create_model(params):
x_input = Input(shape=(n_feat,), name="x_input") x_input = Input(shape=(n_feat,), name="x_input")
dropout_values = params["dropout"] dropout_values = params["dropout"]
for i, n_nodes in enumerate(params["arch"]): for i, n_nodes in enumerate(params["arch"]):
if i == 0: if i == 0:
x = Dense(n_nodes, activation="relu")(x_input) x = Dense(n_nodes, activation="relu")(x_input)
else: else:
x = Dense(n_nodes, activation="relu")(x) x = Dense(n_nodes, activation="relu")(x)
dropout = dropout_values[i] dropout = dropout_values[i]
x = Dropout(dropout)(x) x = Dropout(dropout)(x)
output = Dense(1)(x) output = Dense(1)(x)
model = Model(x_input, output) model = Model(x_input, output)
model.compile(optimizer="adam", loss=["mse"]) model.compile(optimizer="adam", loss=["mse"])
return model return model
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
In this function, a particular loss function is chosen (see "model.compile(...)"), the mean squared error (MSE): In this function, a particular loss function is chosen (see "model.compile(...)"), the mean squared error (MSE):
$\begin{equation*} $\begin{equation*}
MSE = \sum_{i=1}^m (y_i - \widehat{y}_i)^2, MSE = \sum_{i=1}^m (y_i - \widehat{y}_i)^2,
\end{equation*}$ \end{equation*}$
where m is the number of training examples, $y_i$ denote the $i$th ground truth and $\widehat{y}_i$ the $i$th predicted target. where m is the number of training examples, $y_i$ denote the $i$th ground truth and $\widehat{y}_i$ the $i$th predicted target.
Several hyperparameters can be optimized, most notably the number of layers, the number of neurons in each layer, and regularization parameters. Let us discuss the choice of regularization in more detail in the following: Several hyperparameters can be optimized, most notably the number of layers, the number of neurons in each layer, and regularization parameters. Let us discuss the choice of regularization in more detail in the following:
To have a valuable machine learning model, it is of paramount importance to make sure that the model generalizes well to unseen samples. To have a valuable machine learning model, it is of paramount importance to make sure that the model generalizes well to unseen samples.
Goodfellow *et.al.* (cf. chapter [5.2.2](https://www.deeplearningbook.org/contents/ml.html)) defines regularization as *any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.* In practice, additional terms are added to the training optimization objective to prevent overfitting or help the optimization. Goodfellow *et.al.* (cf. chapter [5.2.2](https://www.deeplearningbook.org/contents/ml.html)) defines regularization as *any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.* In practice, additional terms are added to the training optimization objective to prevent overfitting or help the optimization.
There are numerous ways to regularize a neural network; we refer the interested reader to Chapter [7](https://www.deeplearningbook.org/contents/regularization.html)). There are numerous ways to regularize a neural network; we refer the interested reader to Chapter [7](https://www.deeplearningbook.org/contents/regularization.html)).
Here, we use **dropout layers** to regularize our neural network. To explain in a few word what dropout is, we report below the abstract from the article that introduced dropout (Srivastava *et al.*, J. Mach. Learn. Res. 15 1929 (2014)) Here, we use **dropout layers** to regularize our neural network. To explain in a few word what dropout is, we report below the abstract from the article that introduced dropout (Srivastava *et al.*, J. Mach. Learn. Res. 15 1929 (2014))
*Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem.* *Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem.*
*The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different “thinned” networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights.* *The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different “thinned” networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights.*
*This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.* *This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.*
For the full article "Srivastava et al., Dropout: A Simple Way to Prevent Neural Networks from Overfitting", please visit http://jmlr.org/papers/volume15/srivastava14a.old/srivastava14a.pdf. For the full article "Srivastava et al., Dropout: A Simple Way to Prevent Neural Networks from Overfitting", please visit http://jmlr.org/papers/volume15/srivastava14a.old/srivastava14a.pdf.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
One may also try to change the training time, which is determined by the number of epochs. Specifically, during one epoch, the training set is partitioned into batches - introducing another tunable parameter, the batch size - and after one epoch a full pass through the training set is completed. One may also try to change the training time, which is determined by the number of epochs. Specifically, during one epoch, the training set is partitioned into batches - introducing another tunable parameter, the batch size - and after one epoch a full pass through the training set is completed.
We provide some reasonable choices for hyperparameters below and print a summary of the neural network model: We provide some reasonable choices for hyperparameters below and print a summary of the neural network model:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
batch_size = 64 batch_size = 64
epochs = 30 epochs = 30
number_of_neurons_per_layer = [512, 256, 128, 64, 32, 18, 8, 4] number_of_neurons_per_layer = [512, 256, 128, 64, 32, 18, 8, 4]
dropout_values = [0.1, 0.05, 0.025, 0.0, 0.0, 0.0, 0.0, 0.0] dropout_values = [0.1, 0.05, 0.025, 0.0, 0.0, 0.0, 0.0, 0.0]
params = { params = {
'arch': number_of_neurons_per_layer, 'arch': number_of_neurons_per_layer,
'batch_size': batch_size, 'batch_size': batch_size,
'epochs': epochs, 'epochs': epochs,
'dropout': dropout_values 'dropout': dropout_values
} }
model = create_model(params) model = create_model(params)
print(model.summary()) print(model.summary())
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Now we fit the model to the data, while computing the mean squared error on the validation set for each epoch: Now we fit the model to the data, while computing the mean squared error on the validation set for each epoch:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
history = model.fit(X_train, y_train, history = model.fit(X_train, y_train,
validation_data = (X_val, y_val), validation_data = (X_val, y_val),
epochs=params["epochs"], epochs=params["epochs"],
batch_size=params["batch_size"], verbose=True) batch_size=params["batch_size"], verbose=True)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
We can visualize the training and validation accuracy for each epoch: We can visualize the training and validation accuracy for each epoch:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
import matplotlib.pyplot as plt import matplotlib.pyplot as plt
# summarize history for loss: A plot of loss on the training and validation datasets over training epochs. # summarize history for loss: A plot of loss on the training and validation datasets over training epochs.
plt.plot(history.history['loss']) plt.plot(history.history['loss'])
plt.plot(history.history['val_loss']) plt.plot(history.history['val_loss'])
plt.title('model loss') plt.title('model loss')
plt.ylabel('loss') plt.ylabel('loss')
plt.xlabel('epoch') plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left') plt.legend(['train', 'val'], loc='upper left')
plt.show() plt.show()
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
In addition, we can compute several performance metrics to evaluate our model: In addition, we can compute several performance metrics to evaluate our model:
* Mean absolute error: $\begin{equation*} * Mean absolute error: $\begin{equation*}
MAE = \sum_{i=1}^m |y_i - \widehat{y}_i|, MAE = \sum_{i=1}^m |y_i - \widehat{y}_i|,
\end{equation*}$ \end{equation*}$
* Root mean squared error: $\begin{equation*} * Root mean squared error: $\begin{equation*}
RMSE = \sqrt{\sum_{i=1}^m (y_i - \widehat{y}_i)^2}, RMSE = \sqrt{\sum_{i=1}^m (y_i - \widehat{y}_i)^2},
\end{equation*}$ \end{equation*}$
* Pearson correlation coefficient (0 for no correlation, 1 for positive linear correlation, and -1 for negative linear correlation), see for instance [wikipedia](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient). * Pearson correlation coefficient (0 for no correlation, 1 for positive linear correlation, and -1 for negative linear correlation), see for instance [wikipedia](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient).
***Important note:*** We always have to compare these quantities with the statistics of the dataset, i.e., we have to provide at least the range (minimum, maximum) as well as mean and standard deviation. ***Important note:*** We always have to compare these quantities with the statistics of the dataset, i.e., we have to provide at least the range (minimum, maximum) as well as mean and standard deviation.
We compute the above performance metrics for both training and validation set, while also stating the dataset statistics: We compute the above performance metrics for both training and validation set, while also stating the dataset statistics:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def rmse(y_pred, y_true): def rmse(y_pred, y_true):
return np.sqrt(((y_pred - y_true) ** 2).mean()) return np.sqrt(((y_pred - y_true) ** 2).mean())
y_pred_train = model.predict(X_train).flatten() y_pred_train = model.predict(X_train).flatten()
y_true_train = y_train y_true_train = y_train
mae_train = mean_absolute_error(y_true_train, y_pred_train) mae_train = mean_absolute_error(y_true_train, y_pred_train)
mse_train = rmse(y_pred_train, y_true_train)**2 mse_train = rmse(y_pred_train, y_true_train)**2
rmse_train = rmse(y_pred_train, y_true_train) rmse_train = rmse(y_pred_train, y_true_train)
pearson_train = np.corrcoef(y_pred_train, y_true_train)[0,1] pearson_train = np.corrcoef(y_pred_train, y_true_train)[0,1]
print("Training: MAE: {:.3f} | MSE: {:.3f} | RMSE: {:.3f} | Pearson: {:.3f}".format(mae_train, mse_train, print("Training: MAE: {:.3f} | MSE: {:.3f} | RMSE: {:.3f} | Pearson: {:.3f}".format(mae_train, mse_train,
rmse_train, pearson_train)) rmse_train, pearson_train))
dataset = y_train dataset = y_train
min_data = min(dataset) min_data = min(dataset)
max_data = max(dataset) max_data = max(dataset)
mean_data = np.mean(dataset) mean_data = np.mean(dataset)
sd_data = np.sqrt(np.var(dataset)) sd_data = np.sqrt(np.var(dataset))
print("Dataset: Min: {:.3f} | Max: {:.3f} | Mean: {:.3f} | Std.dev.: {:.3f}".format(min_data, max_data, print("Dataset: Min: {:.3f} | Max: {:.3f} | Mean: {:.3f} | Std.dev.: {:.3f}".format(min_data, max_data,
mean_data, sd_data)) mean_data, sd_data))
print('\n') print('\n')
y_pred_val = model.predict(X_val).flatten() y_pred_val = model.predict(X_val).flatten()
y_true_val = y_val y_true_val = y_val
mae_val = mean_absolute_error(y_true_val, y_pred_val) mae_val = mean_absolute_error(y_true_val, y_pred_val)
mse_val = rmse(y_pred_val, y_true_val)**2 mse_val = rmse(y_pred_val, y_true_val)**2
rmse_val = rmse(y_pred_val, y_true_val) rmse_val = rmse(y_pred_val, y_true_val)
pearson_val = np.corrcoef(y_pred_val, y_true_val)[0,1] pearson_val = np.corrcoef(y_pred_val, y_true_val)[0,1]
print("Validation: MAE: {:.3f} | MSE: {:.3f} | RMSE: {:.3f} | Pearson: {:.3f}".format(mae_val, mse_val, print("Validation: MAE: {:.3f} | MSE: {:.3f} | RMSE: {:.3f} | Pearson: {:.3f}".format(mae_val, mse_val,
rmse_val, pearson_val)) rmse_val, pearson_val))
dataset = y_val dataset = y_val
min_data = min(dataset) min_data = min(dataset)
max_data = max(dataset) max_data = max(dataset)
mean_data = np.mean(dataset) mean_data = np.mean(dataset)
sd_data = np.sqrt(np.var(dataset)) sd_data = np.sqrt(np.var(dataset))
print("Dataset: Min: {:.3f} | Max: {:.3f} | Mean: {:.3f} | Std.dev.: {:.3f}".format(min_data, max_data, print("Dataset: Min: {:.3f} | Max: {:.3f} | Mean: {:.3f} | Std.dev.: {:.3f}".format(min_data, max_data,
mean_data, sd_data)) mean_data, sd_data))
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### 2.4 Visualization ### 2.4 Visualization
Seaborn (https://seaborn.pydata.org/) provides some nice visualization tools and is compatible with the pandas library. For instance, one can visualize the model performance on the validation set by plotting true vs predicted target in a scatter plot, while showing the respective densities as projections: Seaborn (https://seaborn.pydata.org/) provides some nice visualization tools and is compatible with the pandas library. For instance, one can visualize the model performance on the validation set by plotting true vs predicted target in a scatter plot, while showing the respective densities as projections:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
y_true = y_true_val y_true = y_true_val
y_pred = y_pred_val y_pred = y_pred_val
#Convert validation predictions to pandas dataframe #Convert validation predictions to pandas dataframe
df = pd.DataFrame({'y_true':y_true, 'y_pred': y_pred}) df = pd.DataFrame({'y_true':y_true, 'y_pred': y_pred})
# Create scatter plot with histograms for visualizing the density # Create scatter plot with histograms for visualizing the density
sns.set(style="white", color_codes=True) sns.set(style="white", color_codes=True)
g = sns.jointplot(x='y_true', y='y_pred', data=df, kind="reg") g = sns.jointplot(x='y_true', y='y_pred', data=df, kind="reg")
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### 2.5 Questions / Further tasks ### 2.5 Questions / Further tasks
* Evaluate the results: What can you infer from the plot showing training/validation MSE vs. epoch number (e.g., is overfitting observed)? What do the individual performance measures tell you - especially in comparison with the dataset statistics? * Evaluate the results: What can you infer from the plot showing training/validation MSE vs. epoch number (e.g., is overfitting observed)? What do the individual performance measures tell you - especially in comparison with the dataset statistics?
* Feel free to change hyperparameters by hand to get a feeling for them and possibly to improve model performance. In case training takes too long, reduce the number of neural-network parameters and/or the number of epochs. * Feel free to change hyperparameters by hand to get a feeling for them and possibly to improve model performance. In case training takes too long, reduce the neural-network size and/or the number of epochs.
* Compare the results of this tutorial to the ones in the [original reference](https://www.nature.com/articles/npjcompumats201628?report=reader) (Table 1). If they are worse, try to explain what could be improved (tip: think about how the information on inorganic compounds is encoded in "ElemNet" and compare it to the one of the reference. Also have a look at [this documentation](https://scikit-learn.org/stable/modules/cross_validation.html)). Furthermore, even if the performance (as, for instance, indicated by MSE) is worse, why are neural networks still useful / why do they provide an advantage compared to other "standard" machine learning methods (hint: Figure 6 of the original ElemNet reference and/or google "Representation learning")? * Compare the results of this tutorial to the ones in the [original reference](https://www.nature.com/articles/npjcompumats201628?report=reader) (Table 1). If they are worse, try to explain what could be improved (tip: think about how the information on inorganic compounds is encoded in "ElemNet" and compare it to the one of the reference. Also have a look at [this documentation](https://scikit-learn.org/stable/modules/cross_validation.html)). Furthermore, even if the performance (as, for instance, indicated by MSE) is worse, why are neural networks still useful / why do they provide an advantage compared to other "standard" machine learning methods (hint: Figure 6 of the original ElemNet reference and/or google "Representation learning")?
* (Optional) Inform yourself about other ways to visualize the data such as box and violin plots. The seaborn documentation is a useful resource for this. For materials science related visualizations, have a look at the figures in [this paper](https://iopscience.iop.org/article/10.1088/2515-7639/ab077b/meta), especially Fig. 3-6. * (Optional) Inform yourself about other ways to visualize the data such as box and violin plots. The seaborn documentation is a useful resource for this. For materials science related visualizations, have a look at the figures in [this paper](https://iopscience.iop.org/article/10.1088/2515-7639/ab077b/meta), especially Fig. 3-6.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### 2.6 Test ### 2.6 Test
Once model optimization is finished, we are ready to investigate model performance on the test set - set the following keyword to True and then run the subsequent cell. Once model optimization is finished, we are ready to investigate model performance on the test set - set the following keyword to True and then run the subsequent cell. Evaluate and interpret your final results.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
investigate_test_set = True investigate_test_set = True
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
if investigate_test_set: if investigate_test_set:
y_pred_test = model.predict(X_test).flatten() y_pred_test = model.predict(X_test).flatten()
y_true_test = y_test y_true_test = y_test
mae_test = mean_absolute_error(y_true_test, y_pred_test) mae_test = mean_absolute_error(y_true_test, y_pred_test)
mse_test = rmse(y_pred_test, y_true_test)**2 mse_test = rmse(y_pred_test, y_true_test)**2
rmse_test = rmse(y_pred_test, y_true_test) rmse_test = rmse(y_pred_test, y_true_test)
pearson_test = np.corrcoef(y_pred_test, y_true_test)[0,1] pearson_test = np.corrcoef(y_pred_test, y_true_test)[0,1]
print("Test: MAE: {:.3f} | MSE: {:.3f} | RMSE: {:.3f} | Pearson: {:.3f}".format(mae_test, mse_test, print("Test: MAE: {:.3f} | MSE: {:.3f} | RMSE: {:.3f} | Pearson: {:.3f}".format(mae_test, mse_test,
rmse_test, pearson_test)) rmse_test, pearson_test))
y_true = y_test y_true = y_test
y_pred = y_pred_test y_pred = y_pred_test
#Convert validation predictions to pandas dataframe #Convert validation predictions to pandas dataframe
df = pd.DataFrame({'y_true':y_true, 'y_pred': y_pred}) df = pd.DataFrame({'y_true':y_true, 'y_pred': y_pred})
# Create scatter plot with histograms for visualizing the density # Create scatter plot with histograms for visualizing the density
sns.set(style="white", color_codes=True) sns.set(style="white", color_codes=True)
g = sns.jointplot(x='y_true', y='y_pred', data=df, kind="reg") g = sns.jointplot(x='y_true', y='y_pred', data=df, kind="reg")
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## 3. Further reading ## 3. Further reading
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Other representations based on atomic properties can be found in the following two references: Other representations based on atomic properties can be found in the following two references:
* L. M. Ghiringhelli, J. Vybiral, S. V. Levchenko, C. Draxl and M. Scheffler. Big data of materials science: critical * L. M. Ghiringhelli, J. Vybiral, S. V. Levchenko, C. Draxl and M. Scheffler. Big data of materials science: critical
role of the descriptor. Physical review letters 114, 105503 (2015) role of the descriptor. Physical review letters 114, 105503 (2015)
* L. Ward, A. Agrawal, A. Choudhary and C. Wolverton. A general-purpose machine learning framework for predicting * L. Ward, A. Agrawal, A. Choudhary and C. Wolverton. A general-purpose machine learning framework for predicting
properties of inorganic materials. npj Computational Materials 2, 16028 (2016) properties of inorganic materials. npj Computational Materials 2, 16028 (2016)
You may also have a look at this review:
* Schmidt, J., Marques, M. R., Botti, S. & Marques, M. A. Recent advances and applications of machine learning in
solid-state materials science. npj Comput. Mater. 5, 1–36 (2019)
Furthermore, more information on (advanced) hyperparameter tuning techniques (for instance grid search, random search or Bayesian optimization for tuning the number of layers, neurons, and dropout ratios) can be found in the following references: Furthermore, more information on (advanced) hyperparameter tuning techniques (for instance grid search, random search or Bayesian optimization for tuning the number of layers, neurons, and dropout ratios) can be found in the following references:
* As a start: https://scikit-learn.org/stable/modules/grid_search.html * As a start: https://scikit-learn.org/stable/modules/grid_search.html
* More advanced: https://github.com/hyperopt/hyperopt - Bergstra, J., Yamins, D. & Cox, D. D. Making a science of model search: Hyperparameter optimization in hundreds of * More advanced: https://github.com/hyperopt/hyperopt - Bergstra, J., Yamins, D. & Cox, D. D. Making a science of model search: Hyperparameter optimization in hundreds of
dimensions for vision architectures. In Proceedings of the 30th International Conference on International Conference on dimensions for vision architectures. In Proceedings of the 30th International Conference on International Conference on
Machine Learning - Volume 28, ICML’13, I–115–I–123 (JMLR.org, 2013). Machine Learning - Volume 28, ICML’13, I–115–I–123 (JMLR.org, 2013).
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment