diff --git a/nn_regression.ipynb b/nn_regression.ipynb
index faa838985bcbc9db5f6a03902d9412e6cac020f9..afd6bb4c182407627a1f1b23e6704bdf78634508 100644
--- a/nn_regression.ipynb
+++ b/nn_regression.ipynb
@@ -38,7 +38,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In this tutorial, the standard architecture for neural networks (multilayer perceptrons or rather fully-connected neural networks) is introduced and applied to a regression task (the prediction of material properties of inorganic compounds). Neural networks for classification are briefly explained as well, while more details on this topic can be found in the tutorial on convolutional neural networks.\n",
+    "In this tutorial, the standard architecture for neural networks (multilayer perceptrons or rather fully-connected neural networks) is introduced and applied to a regression task (the prediction of a material property of inorganic compounds). Neural networks for classification are briefly explained as well, while more details on this topic can be found in the tutorial on convolutional neural networks.\n",
     "\n",
     "After explaining the basic concepts, a fully connected neural network is set up using the python library Keras (https://keras.io/) with the input representation being constructed in the spirit of \n",
     "\n",
@@ -48,6 +48,14 @@
     "The goal is then to predict the volume per atom for inorganic solids from the open quantum materials database (OQMD). Only information on the chemical composition is used  (in particular, no structural information). The results are analyzed using typical performance measures such as mean absolute error, mean squared error, root mean square error, and the Pearson correlation coefficient. Visualization techniques and advanced optimization methods are discussed at the end."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Side remark: The documentation of Keras at https://keras.io/ refers to the newest version of Keras (>2.4), which only supports Tensorflow (https://www.tensorflow.org/) as a backend. This tutorial (as well as the tutorial on convolutional neural networks) is compatible with versions <=2.3 which allows multiple backends (CNTK, Tensorflow, Theano). There are only slight differences in syntax and you can find archieved documentations at (https://github.com/faroit/keras-docs), e.g., for \n",
+    "version 2.1.5 https://faroit.com/keras-docs/2.1.5/. We use tensorflow as backend (version <2.0)."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -207,7 +215,7 @@
     "\\mathbf{a} = f ( A\\mathbf{x} + \\mathbf{b} ),\n",
     "\\end{equation*}$\n",
     "\n",
-    "which is essentially an affine transformation (mediated by a matrix A and vector b) followed by element-wise application of the (non-linear) activation function f. The weights of the linear combinations are collected in the matrix A and the offsets in the *bias vector* $\\mathbf{b} = (b_1, b_2, ...)$. The output activations $\\mathbf{o}$ are obtained by applying a further affine transformation (matrix A$^\\prime$, bias $b^\\prime$) and activation function $f^\\prime$:\n",
+    "which is essentially an affine transformation (defined by a matrix A and vector b) followed by element-wise application of the (non-linear) activation function f. The weights of the linear combinations are collected in the matrix A and the offsets in the *bias vector* $\\mathbf{b} = (b_1, b_2, ...)$. The output activations $\\mathbf{o}$ are obtained by applying a further affine transformation (matrix A$^\\prime$, bias $b^\\prime$) and activation function $f^\\prime$:\n",
     "\n",
     "$\\begin{equation*}\n",
     "\\mathbf{o} = f^\\prime (A^\\prime \\mathbf{a} + \\mathbf{b}^\\prime)\n",
@@ -215,7 +223,7 @@
     "\n",
     "The final activation function $f^\\prime$ is chosen in a specific way, usually depending on the task being either  regression or classification - we will come back to this later.\n",
     "\n",
-    "To simplify the above expression for $\\mathbf{o}$, one can change the definition of input vector and weight matrices such that the bias terms can be omitted. We denote the input vector as before and introduce weight matrices W, W$^\\prime$, which yields us a more compact expression for the output:  \n",
+    "To simplify the above expression for $\\mathbf{o}$, one can change the definition of input vector and weight matrices such that the bias terms can be omitted. We denote the input vector as before and introduce weight matrices W, W$^\\prime$, which yields a more compact expression for the output:  \n",
     "\n",
     "$\\begin{equation*}\n",
     "\\mathbf{o} = f^\\prime (W^\\prime \\mathbf{a}) = f^\\prime (W^\\prime f(W \\mathbf{x})).\n",
@@ -231,10 +239,10 @@
     "\n",
     "<img src=\"./assets/nn_regression/mlp_more_layers_example.png\" width=\"200\">\n",
     "\n",
-    "The larger and deeper the network, the more computationally costly it becomes to calculate the gradient (which is needed to optimize the parameters via gradient descent) of the loss function. A key invention is the [backpropagation algorithm]([https://www.nature.com/articles/323533a0]) which is an efficient way to calculate the gradient of the loss function with respect to the neural-network parameters.\n",
+    "The larger and deeper the network, the more computationally costly it becomes to calculate the gradient (which is needed to optimize the parameters via gradient descent) of the loss function. A key invention is the backpropagation algorithm (cf. the [original publication]([https://www.nature.com/articles/323533a0])) which is an efficient way to calculate the gradient of the loss function with respect to the neural-network parameters.\n",
     "\n",
     "Note that if we would use only linear activation functions, the output would essentially be just a linear combination of the input. Using non-linear activation functions (nowadays mostly the ReLU activation function) \n",
-    "increases the functions that can be represented - in particular the universal approximation theorem (see [chapter 6.4.1](https://www.deeplearningbook.org/) and references therein) guarantees that multilayer perceptrons can approximate any function arbitrarily well - while details of the architecture and generalization guarantees are not provided by this theorem. In particular, due to the non-linearities, optimizing a deep neural network is a non-convex optimization problem where one has to deal with multiple local minima. Finding the most optimal one is the key task and techniques such as stochastic gradient descent (see [chapter 8](https://www.deeplearningbook.org/)) allow to avoid getting stuck in non-optimal local minima. \n",
+    "enriches the function space that can be represented. In particular, the universal approximation theorem (see [chapter 6.4.1](https://www.deeplearningbook.org/) and references therein) guarantees that multilayer perceptrons can approximate any function arbitrarily well - while details of the architecture and generalization guarantees are not provided by this theorem. In particular, due to the non-linearities, optimizing a deep neural network is a non-convex optimization problem where one has to deal with multiple local minima. Finding the most optimal one is the key task and techniques such as stochastic gradient descent (see [chapter 8](https://www.deeplearningbook.org/)) allow to avoid getting stuck in non-optimal local minima. \n",
     "\n",
     "\n",
     "Coming now to the choice of activation function in the final layer, in case of classification,\n",
@@ -252,7 +260,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "To illustrate the usefulness of softmax activation functions, let us consider the case of crystal-structure classification. The task is to assign the correct (symmetry) label to a given, unkown crystal structure, i.e., to predict the correct class, e.g., face-centered-cubic, body-centered-cubic, diamond or hexagonal closed packed (note that this collection of structures covers more than 80% of the elemental solids). More thorough explanations of deep learning applied to crystal-structure recognition can be found  [here](https://www.nature.com/articles/s41467-018-05169-6). When applying the multilayer perceptron architecture which we introduced above, each of the four output neurons correspond to a specific crystal structure. The use of the softmax activation function guarantees that all output activations sum to one, which is why the output vector $\\mathbf{o}$ can be considered as a vector of classification probabiltites. For instance, if $\\mathbf{o} = (1, 0, 0, 0)$, the input structure is predicted to have fcc symmetry with 100\\% probability (see figure below). This is also called \"one-hot-encoding\" and corresponds to representing a given number N of classes in the standard basis in $\\mathbb{R}^\\text{N}$, i.e., by N vectors $e_i = (0, ...0, 1, 0, ..., 0)$, for $i=1, ..., N$ and all components of $e_i$ being zero except for the $i$th entry. \n",
+    "To illustrate the usefulness of softmax activation functions, let us consider the case of crystal-structure classification. The task is to assign the correct (symmetry) label to a given, unkown crystal structure, i.e., to predict the correct class, e.g., face-centered-cubic, body-centered-cubic, diamond or hexagonal closed packed (note that this collection of structures covers more than 80% of the elemental solids). More thorough explanations on deep learning applied to crystal-structure recognition can be found  [here](https://www.nature.com/articles/s41467-018-05169-6). When applying the multilayer perceptron architecture which we introduced above, each of the four output neurons correspond to a specific crystal structure. The use of the softmax activation function guarantees that all output activations sum to one, which is why the output vector $\\mathbf{o}$ can be considered as a vector of classification probabiltites. For instance, if $\\mathbf{o} = (1, 0, 0, 0)$, the input structure is predicted to have fcc symmetry with 100\\% probability (see figure below). This is also called \"one-hot-encoding\" and corresponds to representing a given number N of classes in the standard basis in $\\mathbb{R}^\\text{N}$, i.e., by N vectors $e_i = (0, ...0, 1, 0, ..., 0)$, for $i=1, ..., N$ and all components of $e_i$ being zero except for the $i$th entry. \n",
     "\n",
     "<img src=\"./assets/nn_regression/cs_classification_first_example.png\" width=\"1700\">\n",
     "\n"
@@ -724,7 +732,7 @@
     "* Pearson correlation coefficient (0 for no correlation, 1 for positive linear correlation, and -1 for negative linear correlation), see for instance [wikipedia](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient).\n",
     "\n",
     "\n",
-    "***Important note:*** We always have to put these quantities into perspective with the statistics of the dataset, i.e., we have to provide at least the range (minimum, maximum) as well as mean and standard deviation.\n",
+    "***Important note:*** We always have to compare these quantities with the statistics of the dataset, i.e., we have to provide at least the range (minimum, maximum) as well as mean and standard deviation.\n",
     "\n",
     "We compute the above performance metrics for both training and validation set, while also stating the dataset statistics:"
    ]