In this tutorial, the standard architecture for neural networks (multilayer perceptrons or rather fully-connected neural networks) is introduced and applied to a regression task (the prediction of a material property of inorganic compounds). Neural networks for classification are briefly explained as well, while more details on this topic can be found in the tutorial on convolutional neural networks.

After explaining the basic concepts, a fully connected neural network is set up using the python library Keras (https://keras.io/) with the input representation being constructed in the spirit of

D. Jha, L. Ward, A. Paul, W.-k. Liao, A. Choudhary, C. Wolverton and A. Agrawal. Elemnet: Deep learning the

chemistry of materials from only elemental composition. Scientific reports 8, 1 (2018).

The goal is then to predict the volume per atom for inorganic solids from the open quantum materials database (OQMD). Only information on the chemical composition is used (in particular, no structural information). The results are analyzed using typical performance measures such as mean absolute error, mean squared error, root mean square error, and the Pearson correlation coefficient. Visualization techniques and advanced optimization methods are discussed at the end.

%% Cell type:markdown id: tags:

Side remark: The documentation of Keras at https://keras.io/ refers to the newest version of Keras (>2.4), which only supports Tensorflow (https://www.tensorflow.org/) as a backend. This tutorial (as well as the tutorial on convolutional neural networks) is compatible with versions <=2.3 which allows multiple backends (CNTK, Tensorflow, Theano). There are only slight differences in syntax and you can find archived documentations at https://github.com/faroit/keras-docs, e.g., for

version 2.1.5 https://faroit.com/keras-docs/2.1.5/. In both tutorials, we use tensorflow as backend (version <2.0).

%% Cell type:markdown id: tags:

## Load packages

%% Cell type:markdown id: tags:

The following packages are required for this tutorial:

### 1.1 Biological motivation, the perceptron, and typical activation functions

The origin of *artificial neural networks* (ANNs) dates back to the early 1940's. The most simple form of an ANN is the *perceptron*, which was developed by Frank Rosenblatt in 1957 (the interested reader can find the original report [here](https://blogs.umass.edu/brain-wars/files/2016/03/rosenblatt-1957.pdf)) and is biologically motivated (see the simplifying sketch of a biological neuron below).

In a perceptron, the output y is computed by taking the linear combination of the input with weights $w_1, w_2$ and *bias* b, yielding the value z (see right part of below figure) to which a non-linear function f (the *activation function*) is applied at the end.

As a first example, let us consider binary classification, i.e., two classes (marked by circles and crosses in the right part of the figure below), which we try to distinguish using two input features $x_1, x_2$. This could be for instance a medical problem setting, where we want to predict if a patient has a certain disease based on two factors (e.g., age and height). A very simple form of activation function is the Heaviside activation, for which the perceptron yields the following output:

$ y= f(z)=

\begin{cases}

1 \quad \text{if } z = w_1x_1+w_2x_2+b > 0 \\

0 \quad \text{otherwise.}

\end{cases}

$

Here, the two possible values 0 and 1 represent the two distinct classes. Given fixed weights $w_1, w_2$, the bias term $b$ defines when the neuron is firing (i.e., f(z)=1) or not - for given input features $x_1, x_2$ which represent, for instance, a patient.

In general, in supervised learning, we have a set of m training examples, which may be written as tuples consisting of input features (in this example summarized in a two dimensional vector $\mathbf{x}$ with components $x_1$ and $x_2$) and the correct labels $y^{(i)}$, with $i = 1, ..., m$:

Weights and bias term of the model (here: $w_1, w_2, b$) are optimized by minimizing a *loss function* $L(w_1, w_2, b)$ that quantifies how the predicted values $\hat{y}^{(i)}$ deviate from the true values $y^{(i)}$ - as a function of the model parameters. We will see an explicit form for $L(w_1, w_2, b)$ in context of regression in section 2 of this tutorial. Usually gradient decent is used to find the parameters minimizing $L$ (with modifications of gradient decent enabling for instance faster convergence - see chapter 5 and 8 of [this](https://www.deeplearningbook.org/) standard reference). After finishing optimization, in case of classification, model performance can be assessed via the classification accuracy (# of correct predictions divided by # of total predictions).

We will explain the optimization procedure in more detail in section 2 of the tutorial. In this example, one can think of the training / optimization phase as changing the model parameters such that the optimal position of a straight line (see above figure, right) is found, which serves as a decision boundary between the two classes.

%% Cell type:markdown id: tags:

To conclude this section on perceptrons, note that the Heaviside activation function is not used in modern deep learning applications but rather one of the following ones:

The ReLU activation function is most frequently used. Note that the use of non-linear functions is essential: if no activation function would be used, i.e., the identity - also called *linear activation function*- the class of possible functions that the model can represent would be drastically reduced.

Extending the idea of simple perceptrons, one can construct multilayer perceptrons as a sequence of layers.

Each layer consists of a predefined number of neurons, where the neurons of the first layer (the *input layer*) correspond to the input features $\mathbf{x} = (x_1, x_2, ...)$. The subsequent layers are called *hidden layers*. The individual neurons in each hidden layer are a linear combination of neurons from the previous layer. For instance, the *activation value* $a_1$ highlighted in the figure below is computed the following way:

where an activation function $f$ is applied to yield the final activation value. This process, which is called *forward propagation*, is repeated for all neurons and layers until the output $\textbf{o} = (o_1, o_2, ...)$ in the final *output layer* is obtained. Already at this level one of the main characteristics of (multi-layer) neural networks becomes clear - namely, the ability to learn complex features from the initial input. This ability to learn new representations that become more abstract the deeper the network is, i.e., the more hidden layers it has, distinguishes neural networks from other, "standard" machine learning algorithms (e.g., decision trees).

Note that the architecture shown in the figure below has only one hidden layer (a *shallow neural network*), while modern deep-learning models can reach more than 50 layers (see for instance [here](http://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html)).

More formally, given vectorial input $\mathbf{x} = (x_1, x_2, ...)^T$, the

activations $\textbf{a} = (a_1, a_2, ...)^T$ of the next layer are obtained via the transformation

$\begin{equation*}

\mathbf{a} = f ( A\mathbf{x} + \mathbf{b} ),

\end{equation*}$

which is essentially an affine transformation (defined by a matrix A and vector b) followed by element-wise application of the (non-linear) activation function f. The weights of the linear combinations are collected in the matrix A and the offsets in the *bias vector* $\mathbf{b} = (b_1, b_2, ...)$. The output activations $\mathbf{o}$ are obtained by applying a further affine transformation (matrix A$^\prime$, bias $b^\prime$) and activation function $f^\prime$:

The final activation function $f^\prime$ is chosen in a specific way, usually depending on the task being either regression or classification - we will come back to this later.

To simplify the above expression for $\mathbf{o}$, it is common to change the definition of input vector and weight matrices such that the bias terms can be omitted. To illustrate this, we consider the simplified case of two input features and two activations $a_1 = w_{11}x_1 + w_{12}x_2 + b_1$ and $a_2 = w_{21}x_1 + w_{22}x_2 + b_2$. Then, A and b are defined as

$\begin{equation*}

A = \begin{bmatrix}w_{11} & w_{12}\\w_{21} & w_{22}\end{bmatrix}, b = \begin{bmatrix}b_1 \\ b_2 \end{bmatrix}.

This equation makes it more clear that a multilayer perceptron is a concatenation of affine and non-linear transformations, parametrized by a set of weights and biases which are optimized to fit the task at hand. This equation can be generalized to an arbitrary number of layers $N$:

The larger and deeper the network, the more computationally costly it becomes to calculate the gradient of the loss function (which is needed to optimize the parameters via gradient descent). A key invention is the backpropagation algorithm (cf. the [original publication]([https://www.nature.com/articles/323533a0])), which is an efficient way to calculate the gradient of the loss function with respect to the neural-network parameters.

Note that if we would use only linear (i.e., identity) activation functions, the output would essentially be just a linear combination of the input. Using non-linear activation functions (nowadays mostly the ReLU activation function)

enriches the function space that can be represented. In particular, the universal approximation theorem (see [chapter 6.4.1](https://www.deeplearningbook.org/) and references therein) guarantees that multilayer perceptrons can approximate any continuous function - while architecture specifics and generalization guarantees are not provided by this theorem. Due to the use of non-linearities, optimizing a deep neural network is a non-convex optimization problem with multiple local minima. Techniques such as stochastic gradient descent (see [chapter 8](https://www.deeplearningbook.org/)) allow to avoid getting stuck in non-optimal local minima.

Coming now to the choice of activation function in the final layer, the softmax activation function is the usual choice for classification tasks, yielding the following expression for the $j$th component

This way, the output vector components are normalized to the range $[0, 1]$ and sum up to one, i.e., the components can be interpreted as probabilities.

%% Cell type:markdown id: tags:

To illustrate the usefulness of softmax activation functions, let us consider the case of crystal-structure classification. The task is to assign the correct (symmetry) label to a given, unknown crystal structure as defined by atomic positions and chemical species. For instance, possible assignments could be face-centered-cubic, body-centered-cubic, diamond or hexagonal closed packed - a collection of structures that covers more than 80% of the elemental solids. More thorough explanations on deep learning applied to crystal-structure classification can be found [here](https://www.nature.com/articles/s41467-018-05169-6). When applying the multilayer perceptron architecture that we introduced above, each of the four output neurons correspond to a specific crystal structure. The use of the softmax activation function guarantees that all output activations sum to one, which is why the output vector $\mathbf{o}$ can be considered as a vector of classification probabilities. For instance, if $\mathbf{o} = (1, 0, 0, 0)$, the input structure is predicted to have fcc symmetry with 100\% probability (see figure below). This is also called "one-hot-encoding" and corresponds to representing a given number of classes N in the standard basis in $\mathbb{R}^\text{N}$, i.e., by N vectors $e_i = (0, ...0, 1, 0, ..., 0)$, for $i=1, ..., N$ and all components of $e_i$ being zero except for the $i$th entry.

For regression, the above multi-layer network architecture can easily be adapted - by using only one output neuron in the final layer together with a *linear* activation function (i.e., the identity function) to predict a specific target property $\text{P}\in\mathbb{R}$. A similar architecture (with additional hidden layers) will be used in this tutorial and the target property will be a specific property of inorganic compounds.

## 2. Neural network regression example - "ElemNet"

%% Cell type:markdown id: tags:

In the following, we will consider a specific application of multilayer perceptrons. The idea of this section is that you first read through / run the cells and then return to them later (in particular, changing the neural-network parameter settings) when answering the questions in section 2.5.

%% Cell type:markdown id: tags:

"ElemNet" is a deep-learning approach to predict material properties using information on the chemical composition only. In particular, no structural information is used, i.e., the atomic positions and the corresponding symmetries are not considered (which is frequently employed in other machine and deep learning projects in materials science). More information on ElemNet can be found in the following references:

* Dipendra Jha, Logan Ward, Arindam Paul, Wei-keng Liao, Alok Choudhary, Chris Wolverton, and Ankit Agrawal, “ElemNet: Deep Learning the Chemistry of Materials From Only Elemental Composition,” Scientific Reports, 8, Article number: 17593 (2018) [DOI:10.1038/s41598-018-35934-y]

* Dipendra Jha, Kamal Choudhary, Francesca Tavazza, Wei-keng Liao, Alok Choudhary, Carelyn Campbell, Ankit Agrawal, "Enhancing materials property prediction by leveraging computational and experimental data using deep transfer learning," Nature Communications, 10, Article number: 5316 (2019) [DOI: https:10.1038/s41467-019-13297-w]

* https://github.com/NU-CUCIS/ElemNet

%% Cell type:markdown id: tags:

The deep-learning model behind "ElemNet" is essentially a multilayer perceptron with the input vector being chosen in a very specific way:

Each compound is represented by a feature vector $\mathbf{f}$ of fixed length, whose components correspond to the elements of the periodic table. They are sorted according to the atomic number Z in ascending order (i.e., the first component of $\mathbf{f}$ corresponds to hydrogen, the second to Helium etc.).

For instance, given a binary compound $\text{A}_x \text{B}_y$ with $x+y=1$, all entries of $\mathbf{f}$ are zero except those corresponding to element A and B. For these entries, the relative stoichiometric attributes x and y are assigned. In case of NaCl (rock salt), the representation would be

where only the two entries corresponding to Na and Cl are assigned non-zero values (0.5 in this case).

%% Cell type:markdown id: tags:

### 2.1 The task and dataset creation

Given only the chemical composition of an inorganic compound, the goal is to predict the volume per atom, which is the total volume of the unit cell divided by the number of atoms in the unit cell. This property provides an average characterization of the atomic arrangement and therefore becomes interesting in context of crystal-structure prediction (we refer the interested reader to [this paper](https://www.nature.com/articles/s41578-019-0101-8) for a review on this topic). Note, however, that this quantity is only an average characterization of the unit cell geometry and thus further properties have to be known (for instance total volume, space group) to get a geometrical understanding of the crystal that is sufficient to actually predict the atomic arrangement.

%% Cell type:markdown id: tags:

This target property has been predicted using several other machine learning algorithms before (specifically, decision trees) in the following reference:

L. Ward, A. Agrawal, A. Choudhary and C. Wolverton. A general-purpose machine learning framework for predicting

properties of inorganic materials. npj Computational Materials 2, 16028 (2016).

While in this reference (and also the "ElemNet"-related works listed above) energetic targets (such as formation energy) were the main focus, we will use "ElemNet" to predict the volume per atom.

For this purpose, we need to create the dataset in the first step. Information on stoichiometry and volume per atom can be extracted from the dataset published in Ward et al. (2016) (which is based on the open quantum materials database (OQMD), see the [original paper](https://link.springer.com/article/10.1007/s11837-013-0755-4) and the [documentation](http://oqmd.org/)). This step has already been done and we provide the prepared dataset. We also eliminated some data points with very large volumes (>100$Å^3$ per atom). Furthermore, we computed the input representation required by "ElemNet". This is a common situation: the data is available in some format, but additional preprocessing (data cleaning, feature engineering) is necessary.

Let us first load the dataset containing information on the chemical composition (in particular the stoichiometry), for which we use pandas (https://pandas.pydata.org/, a python library providing convenient ways to store, load and process data):

Now we can have a look at the statistics of this dataset, in particular the distribution of the target property:

%% Cell type:code id: tags:

``` python

df.hist()

plt.show()

print('\nStatistics of the target property:\n')

print(df['vol_per_atom'].describe())

```

%% Cell type:markdown id: tags:

We already see at this point that most of the compounds have three chemical species. To provide more insight, the code below will print out the total number of data points, what kind of materials are actually appearing (specifically the amount of species being present) and furthermore, which elements of the periodic table occur and how often (which is visualized in a bar plot):

%% Cell type:code id: tags:

``` python

print("Total number of datapoints: {}\n".format(len(y_vol_per_atom)))

fromcollectionsimportCounter

dict_elements=dict(Counter(number_of_elements))

forkeyindict_elements:

print("Compounds with {} element(s) appear {} times in the dataset".format(key,dict_elements[key]))

From the bar plot, we see that most of the materials contain oxygen. This reflects the bias of researchers towards investigating specific systems (i.e., those that are well-known or interesting for a specific application).

%% Cell type:markdown id: tags:

We now have the target property ready, but also need the input feature vectors in "ElemNet"-style. This is also prepared and saved in a numpy array, which we load in the subsequent cell:

Note that elements that do not appear in the dataset are not included in the feature vector (since otherwise some input neurons would be zero for every data point, thus making them useless and even worse, they might disturb the learning process).

Now the dataset is ready for training. For completeness, let us print the shapes of input features and targets:

The first step is to split the data into training and test set, the latter not being touched during optimization and only at the end for the final model test. Note that it is essential to report split ratio and random state to enable reproducibility:

Another typical preprocessing step is to scale the input features (for instance using mean and standard deviation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).

This can improve model performance, while for the moment it is not used - still, the interested reader can experiment with this preprocessing procedure by uncommenting the following code block:

%% Cell type:code id: tags:

``` python

"""

scaler = StandardScaler()

X = scaler.fit_transform(X)

X_test = scaler.transform(X_test)

"""

```

%% Cell type:markdown id: tags:

Now we perform an additional 80/20 split and use the 80% portion for training the model and the 20% portion for validation:

%% Cell type:code id: tags:

``` python

X_train,X_val,y_train,y_val=train_test_split(X,y,

test_size=split_ratio,random_state=RANDOM_STATE)

```

%% Cell type:markdown id: tags:

If we are not satisfied with the validation performance, we can take a step back, change the hyperparameters (such as the number of layers or number of neurons in each layer), and restart the training process. This way, we can optimize the generalization ability of the model, before the final evaluation on the test set. Ideally, one should also consider different splits (for instance via *cross-validation*, see capter 7.10 of this [book](https://link.springer.com/book/10.1007/978-0-387-84858-7)), which would be too time-consuming for this tutorial. The code provided in this tutorial is only for hand-tuning the hyperparameters while references on more advanced methods can be found in section 3 (e.g., Bayesian optimization).

%% Cell type:markdown id: tags:

### 2.3 Model definition and training phase

%% Cell type:markdown id: tags:

We use the python library Keras to create the multilayer perceptron architecture (the online documentation is an excellent resource for more details - please refer to the side remark at the beginning of the tutorial for the URLs). The function defined below will generate a model object from a dictionary "params", which will be defined later and contains specifications about the model architecture:

%% Cell type:code id: tags:

``` python

n_feat=X_train.shape[1]

defcreate_model(params):

x_input=Input(shape=(n_feat,),name="x_input")

dropout_values=params["dropout"]

fori,n_nodesinenumerate(params["arch"]):

ifi==0:

x=Dense(n_nodes,activation="relu")(x_input)

else:

x=Dense(n_nodes,activation="relu")(x)

dropout=dropout_values[i]

x=Dropout(dropout)(x)

output=Dense(1)(x)

model=Model(x_input,output)

model.compile(optimizer="adam",loss=["mse"])

returnmodel

```

%% Cell type:markdown id: tags:

In this function, a particular loss function is chosen (see "model.compile(...)"), the mean squared error (MSE):

$\begin{equation*}

MSE = \sum_{i=1}^m (y_i - \widehat{y}_i)^2,

\end{equation*}$

where m is the number of training examples, $y_i$ denotes the $i$th ground truth and $\widehat{y}_i$ the $i$th predicted target.

Several hyperparameters can be optimized, most notably the number of layers, the number of neurons in each layer, and regularization parameters. Let us discuss the choice of regularization in more detail in the following:

To have a valuable machine learning model, it is of paramount importance to make sure that the model generalizes well to unseen samples.

Goodfellow *et.al.* (cf. chapter [5.2.2](https://www.deeplearningbook.org/contents/ml.html)) define regularization as *any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.* In practice, additional terms are added to the training optimization objective to prevent overfitting or help the optimization.

There are numerous ways to regularize a neural network; we refer the interested reader to Chapter [7](https://www.deeplearningbook.org/contents/regularization.html).

Here, we use **dropout layers** to regularize our neural network. To explain in a few word what dropout is, we report below the abstract from the article that introduced dropout (Srivastava *et al.*, J. Mach. Learn. Res. 15 1929 (2014)):

*Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem.*

*The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different “thinned” networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights.*

*This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.*

For the full article "Srivastava et al., Dropout: A Simple Way to Prevent Neural Networks from Overfitting", please visit http://jmlr.org/papers/volume15/srivastava14a.old/srivastava14a.pdf.

%% Cell type:markdown id: tags:

One may also try to change the training time, which is determined by the number of epochs. Specifically, during one epoch, the training set is partitioned into batches - introducing another tunable parameter, the batch size (cf. chapter [8.1.3](https://www.deeplearningbook.org/) or [this blog](https://machinelearningmastery.com/how-to-control-the-speed-and-stability-of-training-neural-networks-with-gradient-descent-batch-size/)) - and after one epoch a full pass through the training set is completed.

We provide some reasonable choices for hyperparameters below and print a summary of the neural network model:

* Pearson correlation coefficient (0 for no correlation, 1 for positive linear correlation, and -1 for negative linear correlation), see for instance [wikipedia](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient).

***Important note:*** We always have to compare these quantities with the statistics of the dataset, i.e., we have to provide at least the range (minimum, maximum) as well as mean and standard deviation.

We compute the above performance metrics for both training and validation set, while also stating the dataset statistics:

Seaborn (https://seaborn.pydata.org/) provides some nice visualization tools and is compatible with the pandas library. For instance, one can visualize the model performance on the validation set by plotting true vs predicted target in a scatter plot, while showing the respective densities as projections:

%% Cell type:code id: tags:

``` python

y_true=y_true_val

y_pred=y_pred_val

#Convert validation predictions to pandas dataframe

* Evaluate the results: What can you infer from the plot showing training/validation MSE vs. epoch number (e.g., is overfitting observed)? What do the individual performance measures tell you - especially in comparison with the dataset statistics?

* Feel free to change hyperparameters by hand to get a feeling for them and possibly to improve model performance. In case training takes too long, reduce the neural-network size and/or the number of epochs.

* Compare the results of this tutorial to the ones in the [original reference](https://www.nature.com/articles/npjcompumats201628?report=reader)(Table 1). If they are worse, try to explain what could be improved (tip: think about how the information on inorganic compounds is encoded in "ElemNet" and compare it to the one of the reference. Also have a look at [this documentation](https://scikit-learn.org/stable/modules/cross_validation.html)). Furthermore, even if the performance (as, for instance, indicated by MSE) is worse, why are neural networks still useful / why do they provide an advantage compared to other "standard" machine learning methods (hint: Figure 6 of the original ElemNet reference and/or google "Representation learning")?

* (Optional) Inform yourself about other ways to visualize the data such as box and violin plots. The seaborn documentation is a useful resource for this. For materials science related visualizations, have a look at the figures in [this paper](https://iopscience.iop.org/article/10.1088/2515-7639/ab077b/meta), especially Fig. 3-6.

%% Cell type:markdown id: tags:

### 2.6 Test

Once model optimization is finished, we are ready to investigate model performance on the test set - set the following keyword to True and then run the subsequent cell. Evaluate and interpret your final results.

Furthermore, more information on (advanced) hyperparameter tuning techniques (for instance grid search, random search or Bayesian optimization for tuning the number of layers, neurons, and dropout ratios) can be found in the following references:

* As a start: https://scikit-learn.org/stable/modules/grid_search.html

* More advanced: https://github.com/hyperopt/hyperopt - Bergstra, J., Yamins, D. & Cox, D. D. Making a science of model search: Hyperparameter optimization in hundreds of

dimensions for vision architectures. In Proceedings of the 30th International Conference on International Conference on

In this tutorial, the standard architecture for neural networks (multilayer perceptrons or rather fully-connected neural networks) is introduced and applied to a regression task (the prediction of a material property of inorganic compounds). Neural networks for classification are briefly explained as well, while more details on this topic can be found in the tutorial on convolutional neural networks.

After explaining the basic concepts, a fully connected neural network is set up using the python library Keras (https://keras.io/) with the input representation being constructed in the spirit of

D. Jha, L. Ward, A. Paul, W.-k. Liao, A. Choudhary, C. Wolverton and A. Agrawal. Elemnet: Deep learning the

chemistry of materials from only elemental composition. Scientific reports 8, 1 (2018).

The goal is then to predict the volume per atom for inorganic solids from the open quantum materials database (OQMD). Only information on the chemical composition is used (in particular, no structural information). The results are analyzed using typical performance measures such as mean absolute error, mean squared error, root mean square error, and the Pearson correlation coefficient. Visualization techniques and advanced optimization methods are discussed at the end.

%% Cell type:markdown id: tags:

Side remark: The documentation of Keras at https://keras.io/ refers to the newest version of Keras (>2.4), which only supports Tensorflow (https://www.tensorflow.org/) as a backend. This tutorial (as well as the tutorial on convolutional neural networks) is compatible with versions <=2.3 which allows multiple backends (CNTK, Tensorflow, Theano). There are only slight differences in syntax and you can find archived documentations at https://github.com/faroit/keras-docs, e.g., for

version 2.1.5 https://faroit.com/keras-docs/2.1.5/. In both tutorials, we use tensorflow as backend (version <2.0).

%% Cell type:markdown id: tags:

## Load packages

%% Cell type:markdown id: tags:

The following packages are required for this tutorial:

### 1.1 Biological motivation, the perceptron, and typical activation functions

The origin of *artificial neural networks* (ANNs) dates back to the early 1940's. The most simple form of an ANN is the *perceptron*, which was developed by Frank Rosenblatt in 1957 (the interested reader can find the original report [here](https://blogs.umass.edu/brain-wars/files/2016/03/rosenblatt-1957.pdf)) and is biologically motivated (see the simplifying sketch of a biological neuron below).

In a perceptron, the output y is computed by taking the linear combination of the input with weights $w_1, w_2$ and *bias* b, yielding the value z (see right part of below figure) to which a non-linear function f (the *activation function*) is applied at the end.

As a first example, let us consider binary classification, i.e., two classes (marked by circles and crosses in the right part of the figure below), which we try to distinguish using two input features $x_1, x_2$. This could be for instance a medical problem setting, where we want to predict if a patient has a certain disease based on two factors (e.g., age and height). A very simple form of activation function is the Heaviside activation, for which the perceptron yields the following output:

$ y= f(z)=

\begin{cases}

1 \quad \text{if } z = w_1x_1+w_2x_2+b > 0 \\

0 \quad \text{otherwise.}

\end{cases}

$

Here, the two possible values 0 and 1 represent the two distinct classes. Given fixed weights $w_1, w_2$, the bias term $b$ defines when the neuron is firing (i.e., f(z)=1) or not - for given input features $x_1, x_2$ which represent, for instance, a patient.

In general, in supervised learning, we have a set of m training examples, which may be written as tuples consisting of input features (in this example summarized in a two dimensional vector $\mathbf{x}$ with components $x_1$ and $x_2$) and the correct labels $y^{(i)}$, with $i = 1, ..., m$:

Weights and bias term of the model (here: $w_1, w_2, b$) are optimized by minimizing a *loss function* $L(w_1, w_2, b)$ that quantifies how the predicted values $\hat{y}^{(i)}$ deviate from the true values $y^{(i)}$ - as a function of the model parameters. We will see an explicit form for $L(w_1, w_2, b)$ in context of regression in section 2 of this tutorial. Usually gradient decent is used to find the parameters minimizing $L$ (with modifications of gradient decent enabling for instance faster convergence - see chapter 5 and 8 of [this](https://www.deeplearningbook.org/) standard reference). After finishing optimization, in case of classification, model performance can be assessed via the classification accuracy (# of correct predictions divided by # of total predictions).

We will explain the optimization procedure in more detail in section 2 of the tutorial. In this example, one can think of the training / optimization phase as changing the model parameters such that the optimal position of a straight line (see above figure, right) is found, which serves as a decision boundary between the two classes.

%% Cell type:markdown id: tags:

To conclude this section on perceptrons, note that the Heaviside activation function is not used in modern deep learning applications but rather one of the following ones:

The ReLU activation function is most frequently used. Note that the use of non-linear functions is essential: if no activation function would be used, i.e., the identity - also called *linear activation function*- the class of possible functions that the model can represent would be drastically reduced.

Extending the idea of simple perceptrons, one can construct multilayer perceptrons as a sequence of layers.

Each layer consists of a predefined number of neurons, where the neurons of the first layer (the *input layer*) correspond to the input features $\mathbf{x} = (x_1, x_2, ...)$. The subsequent layers are called *hidden layers*. The individual neurons in each hidden layer are a linear combination of neurons from the previous layer. For instance, the *activation value* $a_1$ highlighted in the figure below is computed the following way:

where an activation function $f$ is applied to yield the final activation value. This process, which is called *forward propagation*, is repeated for all neurons and layers until the output $\textbf{o} = (o_1, o_2, ...)$ in the final *output layer* is obtained. Already at this level one of the main characteristics of (multi-layer) neural networks becomes clear - namely, the ability to learn complex features from the initial input. This ability to learn new representations that become more abstract the deeper the network is, i.e., the more hidden layers it has, distinguishes neural networks from other, "standard" machine learning algorithms (e.g., decision trees).

Note that the architecture shown in the figure below has only one hidden layer (a *shallow neural network*), while modern deep-learning models can reach more than 50 layers (see for instance [here](http://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html)).

More formally, given vectorial input $\mathbf{x} = (x_1, x_2, ...)^T$, the

activations $\textbf{a} = (a_1, a_2, ...)^T$ of the next layer are obtained via the transformation

$\begin{equation*}

\mathbf{a} = f ( A\mathbf{x} + \mathbf{b} ),

\end{equation*}$

which is essentially an affine transformation (defined by a matrix A and vector b) followed by element-wise application of the (non-linear) activation function f. The weights of the linear combinations are collected in the matrix A and the offsets in the *bias vector* $\mathbf{b} = (b_1, b_2, ...)$. The output activations $\mathbf{o}$ are obtained by applying a further affine transformation (matrix A$^\prime$, bias $b^\prime$) and activation function $f^\prime$:

The final activation function $f^\prime$ is chosen in a specific way, usually depending on the task being either regression or classification - we will come back to this later.

To simplify the above expression for $\mathbf{o}$, it is common to change the definition of input vector and weight matrices such that the bias terms can be omitted. To illustrate this, we consider the simplified case of two input features and two activations $a_1 = w_{11}x_1 + w_{12}x_2 + b_1$ and $a_2 = w_{21}x_1 + w_{22}x_2 + b_2$. Then, A and b are defined as

$\begin{equation*}

A = \begin{bmatrix}w_{11} & w_{12}\\w_{21} & w_{22}\end{bmatrix}, b = \begin{bmatrix}b_1 \\ b_2 \end{bmatrix}.

This equation makes it more clear that a multilayer perceptron is a concatenation of affine and non-linear transformations, parametrized by a set of weights and biases which are optimized to fit the task at hand. This equation can be generalized to an arbitrary number of layers $N$:

The larger and deeper the network, the more computationally costly it becomes to calculate the gradient of the loss function (which is needed to optimize the parameters via gradient descent). A key invention is the backpropagation algorithm (cf. the [original publication]([https://www.nature.com/articles/323533a0])), which is an efficient way to calculate the gradient of the loss function with respect to the neural-network parameters.

Note that if we would use only linear (i.e., identity) activation functions, the output would essentially be just a linear combination of the input. Using non-linear activation functions (nowadays mostly the ReLU activation function)

enriches the function space that can be represented. In particular, the universal approximation theorem (see [chapter 6.4.1](https://www.deeplearningbook.org/) and references therein) guarantees that multilayer perceptrons can approximate any continuous function - while architecture specifics and generalization guarantees are not provided by this theorem. Due to the use of non-linearities, optimizing a deep neural network is a non-convex optimization problem with multiple local minima. Techniques such as stochastic gradient descent (see [chapter 8](https://www.deeplearningbook.org/)) allow to avoid getting stuck in non-optimal local minima.

Coming now to the choice of activation function in the final layer, the softmax activation function is the usual choice for classification tasks, yielding the following expression for the $j$th component

This way, the output vector components are normalized to the range $[0, 1]$ and sum up to one, i.e., the components can be interpreted as probabilities.

%% Cell type:markdown id: tags:

To illustrate the usefulness of softmax activation functions, let us consider the case of crystal-structure classification. The task is to assign the correct (symmetry) label to a given, unknown crystal structure as defined by atomic positions and chemical species. For instance, possible assignments could be face-centered-cubic, body-centered-cubic, diamond or hexagonal closed packed - a collection of structures that covers more than 80% of the elemental solids. More thorough explanations on deep learning applied to crystal-structure classification can be found [here](https://www.nature.com/articles/s41467-018-05169-6). When applying the multilayer perceptron architecture that we introduced above, each of the four output neurons correspond to a specific crystal structure. The use of the softmax activation function guarantees that all output activations sum to one, which is why the output vector $\mathbf{o}$ can be considered as a vector of classification probabilities. For instance, if $\mathbf{o} = (1, 0, 0, 0)$, the input structure is predicted to have fcc symmetry with 100\% probability (see figure below). This is also called "one-hot-encoding" and corresponds to representing a given number of classes N in the standard basis in $\mathbb{R}^\text{N}$, i.e., by N vectors $e_i = (0, ...0, 1, 0, ..., 0)$, for $i=1, ..., N$ and all components of $e_i$ being zero except for the $i$th entry.

For regression, the above multi-layer network architecture can easily be adapted - by using only one output neuron in the final layer together with a *linear* activation function (i.e., the identity function) to predict a specific target property $\text{P}\in\mathbb{R}$. A similar architecture (with additional hidden layers) will be used in this tutorial and the target property will be a specific property of inorganic compounds.

## 2. Neural network regression example - "ElemNet"

%% Cell type:markdown id: tags:

In the following, we will consider a specific application of multilayer perceptrons. The idea of this section is that you first read through / run the cells and then return to them later (in particular, changing the neural-network parameter settings) when answering the questions in section 2.5.

%% Cell type:markdown id: tags:

"ElemNet" is a deep-learning approach to predict material properties using information on the chemical composition only. In particular, no structural information is used, i.e., the atomic positions and the corresponding symmetries are not considered (which is frequently employed in other machine and deep learning projects in materials science). More information on ElemNet can be found in the following references:

* Dipendra Jha, Logan Ward, Arindam Paul, Wei-keng Liao, Alok Choudhary, Chris Wolverton, and Ankit Agrawal, “ElemNet: Deep Learning the Chemistry of Materials From Only Elemental Composition,” Scientific Reports, 8, Article number: 17593 (2018) [DOI:10.1038/s41598-018-35934-y]

* Dipendra Jha, Kamal Choudhary, Francesca Tavazza, Wei-keng Liao, Alok Choudhary, Carelyn Campbell, Ankit Agrawal, "Enhancing materials property prediction by leveraging computational and experimental data using deep transfer learning," Nature Communications, 10, Article number: 5316 (2019) [DOI: https:10.1038/s41467-019-13297-w]

* https://github.com/NU-CUCIS/ElemNet

%% Cell type:markdown id: tags:

The deep-learning model behind "ElemNet" is essentially a multilayer perceptron with the input vector being chosen in a very specific way:

Each compound is represented by a feature vector $\mathbf{f}$ of fixed length, whose components correspond to the elements of the periodic table. They are sorted according to the atomic number Z in ascending order (i.e., the first component of $\mathbf{f}$ corresponds to hydrogen, the second to Helium etc.).

For instance, given a binary compound $\text{A}_x \text{B}_y$ with $x+y=1$, all entries of $\mathbf{f}$ are zero except those corresponding to element A and B. For these entries, the relative stoichiometric attributes x and y are assigned. In case of NaCl (rock salt), the representation would be

where only the two entries corresponding to Na and Cl are assigned non-zero values (0.5 in this case).

%% Cell type:markdown id: tags:

### 2.1 The task and dataset creation

Given only the chemical composition of an inorganic compound, the goal is to predict the volume per atom, which is the total volume of the unit cell divided by the number of atoms in the unit cell. This property provides an average characterization of the atomic arrangement and therefore becomes interesting in context of crystal-structure prediction (we refer the interested reader to [this paper](https://www.nature.com/articles/s41578-019-0101-8) for a review on this topic). Note, however, that this quantity is only an average characterization of the unit cell geometry and thus further properties have to be known (for instance total volume, space group) to get a geometrical understanding of the crystal that is sufficient to actually predict the atomic arrangement.

%% Cell type:markdown id: tags:

This target property has been predicted using several other machine learning algorithms before (specifically, decision trees) in the following reference:

L. Ward, A. Agrawal, A. Choudhary and C. Wolverton. A general-purpose machine learning framework for predicting

properties of inorganic materials. npj Computational Materials 2, 16028 (2016).

While in this reference (and also the "ElemNet"-related works listed above) energetic targets (such as formation energy) were the main focus, we will use "ElemNet" to predict the volume per atom.

For this purpose, we need to create the dataset in the first step. Information on stoichiometry and volume per atom can be extracted from the dataset published in Ward et al. (2016) (which is based on the open quantum materials database (OQMD), see the [original paper](https://link.springer.com/article/10.1007/s11837-013-0755-4) and the [documentation](http://oqmd.org/)). This step has already been done and we provide the prepared dataset. We also eliminated some data points with very large volumes (>100$Å^3$ per atom). Furthermore, we computed the input representation required by "ElemNet". This is a common situation: the data is available in some format, but additional preprocessing (data cleaning, feature engineering) is necessary.

Let us first load the dataset containing information on the chemical composition (in particular the stoichiometry), for which we use pandas (https://pandas.pydata.org/, a python library providing convenient ways to store, load and process data):

Now we can have a look at the statistics of this dataset, in particular the distribution of the target property:

%% Cell type:code id: tags:

``` python

df.hist()

plt.show()

print('\nStatistics of the target property:\n')

print(df['vol_per_atom'].describe())

```

%% Cell type:markdown id: tags:

We already see at this point that most of the compounds have three chemical species. To provide more insight, the code below will print out the total number of data points, what kind of materials are actually appearing (specifically the amount of species being present) and furthermore, which elements of the periodic table occur and how often (which is visualized in a bar plot):

%% Cell type:code id: tags:

``` python

print("Total number of datapoints: {}\n".format(len(y_vol_per_atom)))

fromcollectionsimportCounter

dict_elements=dict(Counter(number_of_elements))

forkeyindict_elements:

print("Compounds with {} element(s) appear {} times in the dataset".format(key,dict_elements[key]))

From the bar plot, we see that most of the materials contain oxygen. This reflects the bias of researchers towards investigating specific systems (i.e., those that are well-known or interesting for a specific application).

%% Cell type:markdown id: tags:

We now have the target property ready, but also need the input feature vectors in "ElemNet"-style. This is also prepared and saved in a numpy array, which we load in the subsequent cell:

Note that elements that do not appear in the dataset are not included in the feature vector (since otherwise some input neurons would be zero for every data point, thus making them useless and even worse, they might disturb the learning process).

Now the dataset is ready for training. For completeness, let us print the shapes of input features and targets:

The first step is to split the data into training and test set, the latter not being touched during optimization and only at the end for the final model test. Note that it is essential to report split ratio and random state to enable reproducibility:

Another typical preprocessing step is to scale the input features (for instance using mean and standard deviation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).

This can improve model performance, while for the moment it is not used - still, the interested reader can experiment with this preprocessing procedure by uncommenting the following code block:

%% Cell type:code id: tags:

``` python

"""

scaler = StandardScaler()

X = scaler.fit_transform(X)

X_test = scaler.transform(X_test)

"""

```

%% Cell type:markdown id: tags:

Now we perform an additional 80/20 split and use the 80% portion for training the model and the 20% portion for validation:

%% Cell type:code id: tags:

``` python

X_train,X_val,y_train,y_val=train_test_split(X,y,

test_size=split_ratio,random_state=RANDOM_STATE)

```

%% Cell type:markdown id: tags:

If we are not satisfied with the validation performance, we can take a step back, change the hyperparameters (such as the number of layers or number of neurons in each layer), and restart the training process. This way, we can optimize the generalization ability of the model, before the final evaluation on the test set. Ideally, one should also consider different splits (for instance via *cross-validation*, see capter 7.10 of this [book](https://link.springer.com/book/10.1007/978-0-387-84858-7)), which would be too time-consuming for this tutorial. The code provided in this tutorial is only for hand-tuning the hyperparameters while references on more advanced methods can be found in section 3 (e.g., Bayesian optimization).

%% Cell type:markdown id: tags:

### 2.3 Model definition and training phase

%% Cell type:markdown id: tags:

We use the python library Keras to create the multilayer perceptron architecture (the online documentation is an excellent resource for more details - please refer to the side remark at the beginning of the tutorial for the URLs). The function defined below will generate a model object from a dictionary "params", which will be defined later and contains specifications about the model architecture:

%% Cell type:code id: tags:

``` python

n_feat=X_train.shape[1]

defcreate_model(params):

x_input=Input(shape=(n_feat,),name="x_input")

dropout_values=params["dropout"]

fori,n_nodesinenumerate(params["arch"]):

ifi==0:

x=Dense(n_nodes,activation="relu")(x_input)

else:

x=Dense(n_nodes,activation="relu")(x)

dropout=dropout_values[i]

x=Dropout(dropout)(x)

output=Dense(1)(x)

model=Model(x_input,output)

model.compile(optimizer="adam",loss=["mse"])

returnmodel

```

%% Cell type:markdown id: tags:

In this function, a particular loss function is chosen (see "model.compile(...)"), the mean squared error (MSE):

$\begin{equation*}

MSE = \sum_{i=1}^m (y_i - \widehat{y}_i)^2,

\end{equation*}$

where m is the number of training examples, $y_i$ denotes the $i$th ground truth and $\widehat{y}_i$ the $i$th predicted target.

Several hyperparameters can be optimized, most notably the number of layers, the number of neurons in each layer, and regularization parameters. Let us discuss the choice of regularization in more detail in the following:

To have a valuable machine learning model, it is of paramount importance to make sure that the model generalizes well to unseen samples.

Goodfellow *et.al.* (cf. chapter [5.2.2](https://www.deeplearningbook.org/contents/ml.html)) define regularization as *any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.* In practice, additional terms are added to the training optimization objective to prevent overfitting or help the optimization.

There are numerous ways to regularize a neural network; we refer the interested reader to Chapter [7](https://www.deeplearningbook.org/contents/regularization.html).

Here, we use **dropout layers** to regularize our neural network. To explain in a few word what dropout is, we report below the abstract from the article that introduced dropout (Srivastava *et al.*, J. Mach. Learn. Res. 15 1929 (2014)):

*Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem.*

*The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different “thinned” networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights.*

*This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.*

For the full article "Srivastava et al., Dropout: A Simple Way to Prevent Neural Networks from Overfitting", please visit http://jmlr.org/papers/volume15/srivastava14a.old/srivastava14a.pdf.

%% Cell type:markdown id: tags:

One may also try to change the training time, which is determined by the number of epochs. Specifically, during one epoch, the training set is partitioned into batches - introducing another tunable parameter, the batch size (cf. chapter [8.1.3](https://www.deeplearningbook.org/) or [this blog](https://machinelearningmastery.com/how-to-control-the-speed-and-stability-of-training-neural-networks-with-gradient-descent-batch-size/)) - and after one epoch a full pass through the training set is completed.

We provide some reasonable choices for hyperparameters below and print a summary of the neural network model:

* Pearson correlation coefficient (0 for no correlation, 1 for positive linear correlation, and -1 for negative linear correlation), see for instance [wikipedia](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient).

***Important note:*** We always have to compare these quantities with the statistics of the dataset, i.e., we have to provide at least the range (minimum, maximum) as well as mean and standard deviation.

We compute the above performance metrics for both training and validation set, while also stating the dataset statistics:

Seaborn (https://seaborn.pydata.org/) provides some nice visualization tools and is compatible with the pandas library. For instance, one can visualize the model performance on the validation set by plotting true vs predicted target in a scatter plot, while showing the respective densities as projections:

%% Cell type:code id: tags:

``` python

y_true=y_true_val

y_pred=y_pred_val

#Convert validation predictions to pandas dataframe

* Evaluate the results: What can you infer from the plot showing training/validation MSE vs. epoch number (e.g., is overfitting observed)? What do the individual performance measures tell you - especially in comparison with the dataset statistics?

* Feel free to change hyperparameters by hand to get a feeling for them and possibly to improve model performance. In case training takes too long, reduce the neural-network size and/or the number of epochs.

* Compare the results of this tutorial to the ones in the [original reference](https://www.nature.com/articles/npjcompumats201628?report=reader)(Table 1). If they are worse, try to explain what could be improved (tip: think about how the information on inorganic compounds is encoded in "ElemNet" and compare it to the one of the reference. Also have a look at [this documentation](https://scikit-learn.org/stable/modules/cross_validation.html)). Furthermore, even if the performance (as, for instance, indicated by MSE) is worse, why are neural networks still useful / why do they provide an advantage compared to other "standard" machine learning methods (hint: Figure 6 of the original ElemNet reference and/or google "Representation learning")?

* (Optional) Inform yourself about other ways to visualize the data such as box and violin plots. The seaborn documentation is a useful resource for this. For materials science related visualizations, have a look at the figures in [this paper](https://iopscience.iop.org/article/10.1088/2515-7639/ab077b/meta), especially Fig. 3-6.

%% Cell type:markdown id: tags:

### 2.6 Test

Once model optimization is finished, we are ready to investigate model performance on the test set - set the following keyword to True and then run the subsequent cell. Evaluate and interpret your final results.

Furthermore, more information on (advanced) hyperparameter tuning techniques (for instance grid search, random search or Bayesian optimization for tuning the number of layers, neurons, and dropout ratios) can be found in the following references:

* As a start: https://scikit-learn.org/stable/modules/grid_search.html

* More advanced: https://github.com/hyperopt/hyperopt - Bergstra, J., Yamins, D. & Cox, D. D. Making a science of model search: Hyperparameter optimization in hundreds of

dimensions for vision architectures. In Proceedings of the 30th International Conference on International Conference on