From 3efc84d5bad2ab2ecefa37de771508c19d677c52 Mon Sep 17 00:00:00 2001 From: Andreas Leitherer <leitherer@fhi-berlin.mpg.de> Date: Thu, 17 Dec 2020 11:51:05 +0100 Subject: [PATCH] Addressing Luigi's comments: adding explanations on universal approx.thm., stochastic grad.desc. etc. --- nn_regression.ipynb | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/nn_regression.ipynb b/nn_regression.ipynb index 0e34b1b..faa8389 100644 --- a/nn_regression.ipynb +++ b/nn_regression.ipynb @@ -231,6 +231,11 @@ "\n", "<img src=\"./assets/nn_regression/mlp_more_layers_example.png\" width=\"200\">\n", "\n", + "The larger and deeper the network, the more computationally costly it becomes to calculate the gradient (which is needed to optimize the parameters via gradient descent) of the loss function. A key invention is the [backpropagation algorithm]([https://www.nature.com/articles/323533a0]) which is an efficient way to calculate the gradient of the loss function with respect to the neural-network parameters.\n", + "\n", + "Note that if we would use only linear activation functions, the output would essentially be just a linear combination of the input. Using non-linear activation functions (nowadays mostly the ReLU activation function) \n", + "increases the functions that can be represented - in particular the universal approximation theorem (see [chapter 6.4.1](https://www.deeplearningbook.org/) and references therein) guarantees that multilayer perceptrons can approximate any function arbitrarily well - while details of the architecture and generalization guarantees are not provided by this theorem. In particular, due to the non-linearities, optimizing a deep neural network is a non-convex optimization problem where one has to deal with multiple local minima. Finding the most optimal one is the key task and techniques such as stochastic gradient descent (see [chapter 8](https://www.deeplearningbook.org/)) allow to avoid getting stuck in non-optimal local minima. \n", + "\n", "\n", "Coming now to the choice of activation function in the final layer, in case of classification,\n", "the softmax activation function is the usual choice, yielding the following expression for the $j$th component \n", -- GitLab