From 3efc84d5bad2ab2ecefa37de771508c19d677c52 Mon Sep 17 00:00:00 2001
From: Andreas Leitherer <leitherer@fhi-berlin.mpg.de>
Date: Thu, 17 Dec 2020 11:51:05 +0100
Subject: [PATCH] Addressing Luigi's comments: adding explanations on universal
 approx.thm., stochastic grad.desc. etc.

---
 nn_regression.ipynb | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/nn_regression.ipynb b/nn_regression.ipynb
index 0e34b1b..faa8389 100644
--- a/nn_regression.ipynb
+++ b/nn_regression.ipynb
@@ -231,6 +231,11 @@
     "\n",
     "<img src=\"./assets/nn_regression/mlp_more_layers_example.png\" width=\"200\">\n",
     "\n",
+    "The larger and deeper the network, the more computationally costly it becomes to calculate the gradient (which is needed to optimize the parameters via gradient descent) of the loss function. A key invention is the [backpropagation algorithm]([https://www.nature.com/articles/323533a0]) which is an efficient way to calculate the gradient of the loss function with respect to the neural-network parameters.\n",
+    "\n",
+    "Note that if we would use only linear activation functions, the output would essentially be just a linear combination of the input. Using non-linear activation functions (nowadays mostly the ReLU activation function) \n",
+    "increases the functions that can be represented - in particular the universal approximation theorem (see [chapter 6.4.1](https://www.deeplearningbook.org/) and references therein) guarantees that multilayer perceptrons can approximate any function arbitrarily well - while details of the architecture and generalization guarantees are not provided by this theorem. In particular, due to the non-linearities, optimizing a deep neural network is a non-convex optimization problem where one has to deal with multiple local minima. Finding the most optimal one is the key task and techniques such as stochastic gradient descent (see [chapter 8](https://www.deeplearningbook.org/)) allow to avoid getting stuck in non-optimal local minima. \n",
+    "\n",
     "\n",
     "Coming now to the choice of activation function in the final layer, in case of classification,\n",
     "the softmax activation function is the usual choice, yielding the following expression for the $j$th component \n",
-- 
GitLab