+

+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

+

+
+

+
+# (Convolutional) Neural network tutorial - BigMax workshop - Dresden, April 2019¶

##### Authors: Angelo Ziletti, Andreas Leitherer, and Luca M. Ghiringhelli - Fritz Haber Institute of the Max Planck Society, Berlin¶

+In this tutorial, we briefly introduce the main ideas behind convolutional neural networks, build a neural network model, and finally explain the classification decision process using attentive response maps.

+ +
+

+
+

+
+## 0. Install packages needed¶

+

+
+

+
+

+
+

+We first install the packages that we will need to perform this tutorial, and then we load the necessary Python libraries. This tutorial has been tested on Python 3.5.

+ +
+

+
+

+
+In [1]:

+
+

+
+
+
+

+```
# packages to build convolutional neural networks (and not only)
+! pip install --user tensorflow
+! pip install --user keras
+
+# to visualize images
+! pip install matplotlib
+
+# to calculate convolution
+! pip install scipy
+! pip install numpy
+
+# package for neural network attention map visualization
+! pip install git+https://github.com/raghakot/keras-vis.git -U
+
```

+

+
+
+
+
+

+
+
+
+
+
+

+
+
+
+

+
+

+
+In [2]:

+
+

+
+
+
+

+```
from __future__ import print_function
+%matplotlib inline
+
+import keras
+from keras.datasets import mnist
+from keras.models import Sequential
+from keras.layers import Dense, Dropout, Flatten
+from keras.layers import Conv2D, MaxPooling2D
+from keras import backend as K
+import matplotlib
+import matplotlib.pyplot as plt
+import numpy as np
+from scipy import signal
+import scipy.misc
+import urllib.request
+
```

+

+
+
+
+
+

+
+
+
+
+
+

+
+
+
+

+
+

+
+## 1. Introduction to Convolutional Neural Networks¶

+This introduction is mainly taken from Ref. [1], to which we refer the interested reader for more details.

+ +
+

+
+

+
+#### Max pooling example¶

#### Average pooling example¶

+Convolutional networks are a specialized kind of neural network for processing data that has a known **grid-like topology**; they are networks that use convolution in place of general matrix multiplication in at least one of their layers.

Examples of such data include time-series data (1-D grid with samples at regular time intervals) and image data (2-D grid of pixels).

+Convolutional networks have been tremendously successful in practical applications, especially in computer vision.

The name "convolutional neural network" indicates that the network employs a mathematical operation called convolution. Convolution is a specialized kind of linear operation.

+A typical layer of a convolutional network consists of three stages:

+-
+

+**Convolution**stage: the layer performs several convolutions in parallel to produce a set of linear activations.
+

+**Detector**stage: each linear activation is run through a nonlinear activation function (e.g. rectified linear +activation function)
+

+**Pooling**stage: a pooling function is used to modify (downsample) the output of the layer. A pooling function replaces the output of the network at a certain location with a summary statistic of the nearby outputs. For example, the max pooling operation reports the maximum output within a rectangular neighborhood. Other popular pooling functions include the average of a rectangular neighborhood, the $L^2$ norm of a rectangular neighborhood, or a weighted average based on the distance from the central pixel.
+

+Figure from http://cs231n.github.io/convolutional-networks/

++Figure from https://github.com/vdumoulin/conv_arithmetic

+ +
+

+
+

+
+### 2. Motivation¶

#### 2.1 Sparse interactions¶

##### Fully connected NN¶

##### CNN¶

#### 2.2 Parameter sharing¶

##### Fully connected NN¶

##### CNN¶

#### 2.3 Equivariant representations¶

+Why one should use convolutional neural networks instead of simple (fully connected) neural networks?

+Convolution leverages three important ideas that can help improve a machine learning system:

+-
+
**sparse interactions**
+**parameter sharing**
+**equivariant representations**
+

Moreover, convolution provides a means for working with inputs of variable size - while this is not possible with fully connected neural networks (also called multi-layer perceptrons).

+It uses matrix multiplication by a matrix of parameters with a separate parameter describing the interaction between each input unit and each output unit. This means that every output unit interacts with every input unit. This do not scale well to full images. For example, an image of 200x200x3 would lead to neurons that have 200x200x3 = 120,000 weights. Moreover, we would almost certainly want to have several such neurons. Clearly, this full connectivity is wasteful and the huge number of parameters would quickly lead to overfitting.

+It achieves sparse interactions (sparse connectivity) by making the kernel smaller than the input. When processing an image, we can detect small, meaningful features such as edges with kernels that occupy only tens or hundreds of pixels. (*see Sec. 3.3.2 for two concrete examples*).

+This means that we need to store fewer parameters, which both reduces the memory requirements of the model and improves its statistical efficiency. It also means that computing the output requires fewer operations. If there are $m$ inputs and $n$ outputs, then matrix multiplication requires $m \times n$ parameters, and the algorithms used in practice have $O(m \times n)$ runtime (per example). If we limit the number of connections each output may have to $k$, then the sparsely connected approach requires only $k \times n$ parameters and $O(k \times n)$ runtime. For many practical applications, $k$ is several orders of magnitude smaller than $m$.

It refers to using the same parameter for more than one function in a model.

+Each element of the weight matrix is used exactly once when computing the output of a layer.

+Each member of the kernel is used at every position of the input. The parameter sharing used by the convolution operation means that rather than learning a separate set of parameters for every location, we learn only one set. This further reduce the storage requirements of the model to $k$ parameters. Recall that $k$ is usually several orders of magnitude smaller than $m$. Since $m$ and $n$ are usually roughly the same size, $k$ is practically insignificant compared to $m \times n$. Convolution is thus dramatically more efficient than dense matrix multiplication in terms of the memory requirements and statistical efficiency.

+Parameter sharing causes the layer to have **equivariance to translation**. To say a function is equivariant means that if the input changes, the output changes in the same way.

When processing time-series data, this means that convolution produces a sort of timeline that shows when different features appear in the input. If we move an event later in time in the input, the exact same representation of it will appear in the output, just later. Similarly with images, convolution creates a 2-D map of where certain features appear in the input. If we move the object in the input, its representation will move the same amount in the output. This is useful for when we know that some function of a small number of neighboring pixels is useful when applied to multiple input locations.

+ +
+

+
+

+
+## 3. The convolution operation¶

### 3.1 Summary and intuition¶

+The convolutional layer's parameters consist of a set of learnable filters. Every filter is small spatially (along width and height), but extends through the full depth of the input volume. For example, a typical filter on a first layer of a ConvNet might have size 5x5x3 (i.e. 5 pixels width and height, and 3 because images have depth 3, the color channels).

+-
+
During the forward pass, we slide (more precisely, convolve) each filter across the width and height of the input volume and compute dot products between the entries of the filter and the input at any position. Intuitively, a convolution can be thought as a sliding (weigthed) average.

+
+As we slide the filter over the width and height of the input volume we will produce a 2-dimensional activation map that gives the responses of that filter at every spatial position. Intuitively, the network will learn filters that activate when they see some type of visual feature such as an edge of some orientation or a blotch of some color on the first layer, or eventually entire honeycomb or wheel-like patterns on higher layers of the network.

+
+At this stage, we have an entire set of filters in each convolutional layer (e.g. 12 filters), and each of them produce a separate 2-dimensional activation map. We stack these activation maps along the depth dimension and produce the output volume.

+
+

Below, you can see a representation on how the convolution operation is performed. +

+Animation from: https://github.com/vdumoulin/conv_arithmetic/blob/master/gif/padding_strides.gif

+ +
+

+
+

+
+### 3.2 Mathematical formulation - from Ref. [1]¶

#### Main idea¶

#### Discrete version - 1D [optional]¶

#### Discrete version - 2D [optional]¶

+Suppose we are tracking the location of a spaceship with a laser sensor. Our laser sensor provides a single output $x(t)$, the position of the spaceship at time $t$. Now suppose that our laser sensor is somewhat noisy. To obtain a less noisy estimate of the spaceshipâ€™s position, we would like to average several measurements. Of course, more recent measurements are more relevant, so we will want this to be a weighted average that gives more weight to recent measurements. We can do this with a weighting function $w(a)$, where $a$ is the age of a measurement.

+If we apply such a weighted average operation at every moment, we obtain a new function $s$ providing a smoothed estimate of the position of the spaceship:

$s(t) = \int x(a)w(tâˆ’ a)da$

+This operation is called **convolution**.

The convolution operation is typically denoted with an asterisk:

+$s(t) = ( x âˆ— w )( t )$

+In convolutional network terminology, the first argument (in this example, the function $x$) to the convolution is often referred to as the **input**, and the second argument (in this example, the function $w$) as the **kernel**. The output is sometimes referred to as the **feature map**.

Let us assume that time index $t$ can then take on only integer values. If we now assume that $x$ and $w$ are defined only on integer $t$, we can define the discrete convolution:

+$s(t) = ( x âˆ— w )( t ) = \sum_{a=-\infty}^{+\infty} x(a)w(tâˆ’ a)$

+$S(i,j) = (I âˆ— k)(i,j) = \sum_{m}\sum_{n} I(m,n)K(i-m,j-n)$

+Convolution is commutative, so we can write:

+$S(i,j) = (K âˆ— I)(i,j) = \sum_{m}\sum_{n} I(i-m,j-n)K(m,n)$

+Usually the latter formula is more straightforward to implement in a machine learning library, because there is less variation in the range of valid values of $m$ and $n$. The commutative property of convolution arises because we have flipped the kernel relative to the input, in the sense that as $m$ increases, the index into the input increases, but the index into the kernel decreases. The only reason to flip the kernel is to obtain the commutative property. While the commutative property is useful for writing proofs, it is not usually an important property of a neural network implementation.

+Instead, many neural network libraries implement a related function called the cross-correlation, which is the same as convolution but without flipping the kernel:

+$S(i,j) = (I âˆ— K)(i,j) = \sum_{m}\sum_{n} I(i+m,j+n)K(m,n)$

+Many machine learning libraries implement cross-correlation but call it *convolution*. In the context of machine learning, the learning algorithm will learn the appropriate values of the kernel in the appropriate place, so an algorithm based on convolution with kernel flipping will learn a kernel that is flipped relative to the kernel learned by an algorithm without the flipping.

+

+
+

+
+## 3.3 Examples¶

### 3.3.1 Example: computing output value of a discrete convolution (from Ref. [3])¶

+We present below the calculation of the discrete convolution of a 3x3 kernel $K_{\rm ex}$ (with no padding and stride 1):

+$K_{\rm ex} = \begin{pmatrix}
+0 & 1 & 2 \\
+2 & 2 & 0 \\
+0 & 1 & 2
+\end{pmatrix}$

+

+
+

+
+### 3.3.2 Example: convolution in practice on real images¶

+

+
+

+
+

+
+

+We now perform a convolution operation on real images. We use a photo of Max Planck, and a Berlin landscape.

+ +
+

+
+

+
+In [3]:

+
+

+
+
+
+

+```
# this can be skipped because the images are already saved on the server
+
+# retrieve image of Max Planck from wikipedia
+#print("Retrieving picture of Max Planck. Saving image to './img_max_planck.jpg'.")
+#urllib.request.urlretrieve("https://upload.wikimedia.org/wikipedia/commons/thumb/c/c7/Max_Planck_1933.jpg/220px-Max_Planck_1933.jpg", "./img_max_planck.jpg")
+
+# retrive a picture of Berlin
+#print("Retrieving picture of Berlin landscape. Saving image to './img_berlin_landscape.jpg'.")
+#urllib.request.urlretrieve("http://vivalifestyleandtravel.com/images/cache/c-1509326560-44562570.jpg", "./img_berlin_landscape.jpg")
+
+#print("Done.")
+
```

+

+
+

+
+

+We define a function to display images in a single figure; it is not important for the purpose of this tutorial to understand this function implementation.

+ +
+

+
+

+
+In [4]:

+
+

+
+
+
+

+```
# function to display multiple images in a single figure
+def show_images(images, cols=1, titles=None, cmap='viridis', filename_out=None):
+ """Display a list of images in a single figure with matplotlib.
+
+ Taken from https://stackoverflow.com/questions/11159436/multiple-figures-in-a-single-window
+
+ Parameters:
+
+ images: list of np.arrays
+ Images to be plotted. It must be compatible with plt.imshow.
+
+ cols: int, optional, (default = 1)
+ Number of columns in figure (number of rows is
+ set to np.ceil(n_images/float(cols))).
+
+ titles: list of strings
+ List of titles corresponding to each image.
+
+ """
+ plt.clf()
+ assert ((titles is None) or (len(images) == len(titles)))
+ n_images = len(images)
+ if titles is None:
+ titles = ['Image (%d)' % i for i in range(1, n_images + 1)]
+ fig = plt.figure()
+ for n, (image, title) in enumerate(zip(images, titles)):
+ a = fig.add_subplot(cols, np.ceil(n_images / float(cols)), n + 1)
+ plt.imshow(image, cmap=cmap)
+ a.set_title(title, fontsize=40)
+ a.axis('off') # clear x- and y-axes
+ fig.set_size_inches(np.array(fig.get_size_inches()) * n_images)
+ if filename_out is not None:
+ plt.savefig(filename_out, dpi=100, format='png')
+
```

+

+
+

+
+In [5]:

+
+

+
+
+
+

+```
# read jpg files as numpy arrays
+img_max_planck = plt.imread('./img_max_planck.jpg')[:, :, 0]
+img_berlin_landscape = plt.imread('./img_berlin_landscape.jpg')[:, :, 0]
+
```

+

+
+

+
+#### Type of Kernels¶

+In Sec. 3.3.1, we used a randomly chosen matrix to perform our convolution; it turns out that there are some "special" kernel matrices that perform specific (and useful) transformation when convoluted with an image. +Below, we present some example of these kernels.

+Please visit this page for more details: https://en.wikipedia.org/wiki/Kernel_(image_processing)

+$K_{\rm identity } = \begin{pmatrix} +0 & 0 & 0 \\ +0 & 1 & 0 \\ +0 & 0 & 0 +\end{pmatrix}$

+$K_{ \rm boxblur} = \dfrac{1}{9}\begin{pmatrix} +1 & 1 & 1 \\ +1 & 1 & 1 \\ +1 & 1 & 1 +\end{pmatrix}$

+$K_{\rm gaussianblur3x3} = \dfrac{1}{16}\begin{pmatrix} +1 & 2 & 1 \\ +2 & 4 & 2 \\ +1 & 2 & 1 +\end{pmatrix}$

+$K_{\rm gaussianblur5x5} = \dfrac{1}{256}\begin{pmatrix} +1 & 4 & 6 & 4 & 1 \\ +4 & 16 & 24 & 16 & 4 \\ +6 & 24 & 36 & 24 & 6 \\ +4 & 16 & 24 & 16 & 4 \\ +1 & 4 & 6 & 4 & 1 +\end{pmatrix}$

+$K_{\rm vlines} = \begin{pmatrix} +-1 & 2 & -1 \\ +-1 & 2 & -1 \\ +-1 & 2 & -1 +\end{pmatrix}$

+$K_{\rm hlines} = \begin{pmatrix} +-1 & -1 & -1 \\ + 2 & 2 & 2 \\ +-1 & -1 & -1 +\end{pmatrix}$

+$K_{\rm edges} = \begin{pmatrix} +-1 & -1 & -1 \\ +-1 & 8 & -1 \\ +-1 & -1 & -1 +\end{pmatrix}$

+$K_{\rm emboss} = \begin{pmatrix} +-2 & -1 & 0 \\ +-1 & 1 & 1 \\ + 0 & 1 & 2 +\end{pmatrix}$

+ +
+

+
+

+
+

+Now we apply the convolution operation on both images (photo of Max Planck and the Berlin landscape) using each of the kernel above.
+In particular, we use the Scipy function `signal.convolve2d`

to perform the convolution.

Please refer to the Scipy documentation for more details on this function: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.convolve2d.html

+ +
+

+
+

+
+In [6]:

+
+

+
+
+
+

+```
k_identity = np.array([[0., 0., 0.],
+ [0., 1., 0.],
+ [0., 0., 0.]])
+
+k_box_blur = 1./9. * np.array([[1., 1., 1.],
+ [1., 1., 1.],
+ [1., 1., 1.]])
+
+k_gauss_blur_3x3 = 1./16.* np.array([[0., 0., 0., 5., 0., 0., 0.],
+ [0., 0., 18., 32., 18., 5., 0.],
+ [0., 18., 64., 100., 64., 18., 0.],
+ [5., 32., 100., 100., 100., 32., 5.],
+ [0., 18., 64., 100., 64., 18., 0.],
+ [0., 5., 18., 32., 18., 5., 0.],
+ [0., 0., 0., 5., 0., 0., 0.]])
+
+k_gauss_blur_5x5 = 1./256.* np.array([[1., 4., 6., 4., 1.],
+ [4., 16., 24., 16., 4.],
+ [6., 24., 36., 24., 6.],
+ [4., 16., 24., 16., 4.],
+ [1., 4., 6., 4., 1.]])
+
+k_vlines = np.array([[-1., 2., -1.],
+ [-1., 2., -1.],
+ [-1., 2., -1.]])
+
+k_hlines = np.array([[-1., -1., -1.],
+ [ 2., 2., 2.],
+ [-1., -1., -1.]])
+
+k_edges = np.array([[-1., -1., -1.],
+ [-1., 8., -1.],
+ [-1., -1., -1.]])
+
+#the emboss kernel givens the illusion of depth by emphasizing the differences of pixels in a given direction
+# in this case, in a direction along a line from the top left to the bottom right.
+k_emboss = np.array([[-2., -1., 0.],
+ [-1., 1., 1.],
+ [ 0., 1., 2.]])
+
```

+

+
+

+
+In [7]:

+
+

+
+
+
+

+```
kernels = [k_identity, k_box_blur, k_vlines, k_hlines, k_edges, k_emboss]
+titles = ['original', 'box blur', 'vertical lines', 'horizontal lines', 'edges', 'emboss']
+
+# now apply the convolution for each kernel above
+max_planck_feature_maps = []
+berlin_landscape_feature_maps = []
+for kernel in kernels:
+ max_planck_feature_maps.append(signal.convolve2d(img_max_planck, kernel, boundary='symm', mode='same'))
+ berlin_landscape_feature_maps.append(signal.convolve2d(img_berlin_landscape, kernel, boundary='symm', mode='same'))
+
```

+

+
+

+
+In [8]:

+
+

+
+
+
+

+```
show_images(images=max_planck_feature_maps, cols=2, titles=titles, cmap='gray')
+
```

+

+
+
+
+
+

+
+
+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+

+

+
+In [9]:

+
+

+
+
+
+

+```
show_images(images=berlin_landscape_feature_maps, cols=2, titles=titles, cmap='gray')
+
```

+

+
+
+

+
+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
+
+