Commits (1)
%% Cell type:markdown id: tags:
<div id="teaser" style=' background-position: right center; background-size: 00px; background-repeat: no-repeat;
padding-top: 20px;
padding-right: 10px;
padding-bottom: 170px;
padding-left: 10px;
border-bottom: 14px double #333;
border-top: 14px double #333;' >
<div style="text-align:center">
<b><font size="6.4">Convolutional neural networks</font></b>
</div>
<p>
created by:
Angelo Ziletti, <sup>1</sup>
Andreas Leitherer,<sup>1</sup>
Matthias Scheffler,<sup>1</sup>
and Luca Ghiringhelli<sup>1</sup> <br><br>
<sup>1</sup> Fritz Haber Institute of the Max Planck Society, Faradayweg 4-6, D-14195 Berlin, Germany <br>
<div>
<img style="float: left;" src="assets/convolutional_nn/Logo_MPG.png" width="200">
<img style="float: right;" src="assets/convolutional_nn/Logo_NOMAD.png" width="250">
</div>
</div>
%% Cell type:markdown id: tags:
In this tutorial, we briefly introduce the main ideas behind convolutional neural networks, build a neural network model with Keras, and explain the classification decision process using attentive response maps.
%% Cell type:markdown id: tags:
## Load packages needed
%% Cell type:markdown id: tags:
We first load the packages that we will need to perform this tutorial.
%% Cell type:code id: tags:
``` python
%matplotlib inline
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR) # Suppress TF warnings
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K
from scipy import signal
import scipy.misc
import numpy as np
import matplotlib.pyplot as plt
import urllib.request
```
%% Cell type:markdown id: tags:
## 1. Introduction to Convolutional Neural Networks
This introduction is mainly taken from Ref. [1], to which we refer the interested reader for more details.
%% Cell type:markdown id: tags:
Convolutional networks are a specialized kind of neural network for processing data that has a known **grid-like topology**; they are networks that use convolution in place of general matrix multiplication in at least one of their layers.
Examples of such data include time-series data (1-D grid with samples at regular time intervals) and image data (2-D grid of pixels).
Convolutional networks have been tremendously successful in practical applications, especially in computer vision.
The name "convolutional neural network" indicates that the network employs a mathematical operation called convolution. Convolution is a specialized kind of linear operation.
A typical layer of a convolutional network consists of three stages:
1. **Convolution** stage: the layer performs several convolutions in parallel to produce a set of linear activations (see Sec. 3 for more details).
2. **Detector** stage: each linear activation is run through a nonlinear activation function (e.g. rectified linear
activation function, sigmoid or tanh function)
3. **Pooling** stage: a pooling function is used to modify (downsample) the output of the layer. A pooling function replaces the output of the network at a certain location with a summary statistic of the nearby outputs. For example, the max pooling operation reports the maximum output within a rectangular neighborhood. Other popular pooling functions include the average of a rectangular neighborhood, the $L^2$ norm of a rectangular neighborhood, or a weighted average based on the distance from the central pixel.
#### Max pooling example
![maxpool.jpg](assets/convolutional_nn/maxpool.jpeg)
Figure from http://cs231n.github.io/convolutional-networks/
#### Average pooling example
![avg_pooling_example.png](assets/convolutional_nn/avg_pooling_example.png)
Figure from https://github.com/vdumoulin/conv_arithmetic
%% Cell type:markdown id: tags:
### 2. Motivation
Why one should use convolutional neural networks instead of simple (fully connected) neural networks?
Convolution leverages three important ideas that can help improve a machine learning system:
- **sparse interactions**
- **parameter sharing**
- **equivariant representations**
Moreover, convolution provides a means for working with inputs of variable size - while this is not possible with fully connected neural networks (also called multi-layer perceptrons).
#### 2.1 Sparse interactions
##### Fully connected NN
It uses matrix multiplication by a matrix of parameters with a separate parameter describing the interaction between each input unit and each output unit. This means that every output unit interacts with every input unit. This do not scale well to full images. For example, an image of 200x200x3 would lead to neurons that have 200x200x3 = 120,000 weights. Moreover, we would almost certainly want to have several such neurons. Clearly, this full connectivity is wasteful and the huge number of parameters would quickly lead to overfitting.
##### CNN
It achieves sparse interactions (sparse connectivity) by making the kernel smaller than the input. When processing an image, we can detect small, meaningful features such as edges with kernels that occupy only tens or hundreds of pixels. (*see Sec. 3.3.2 for two concrete examples*).
This means that we need to store fewer parameters, which both reduces the memory requirements of the model and improves its statistical efficiency. It also means that computing the output requires fewer operations. If there are $m$ inputs and $n$ outputs, then matrix multiplication requires $m \times n$ parameters, and the algorithms used in practice have $O(m \times n)$ runtime (per example). If we limit the number of connections each output may have to $k$, then the sparsely connected approach requires only $k \times n$ parameters and $O(k \times n)$ runtime. For many practical applications, $k$ is several orders of magnitude smaller than $m$.
#### 2.2 Parameter sharing
It refers to using the same parameter for more than one function in a model.
##### Fully connected NN
Each element of the weight matrix is used exactly once when computing the output of a layer.
##### CNN
Each member of the kernel is used at every position of the input. The parameter sharing used by the convolution operation means that rather than learning a separate set of parameters for every location, we learn only one set. This further reduce the storage requirements of the model to $k$ parameters. Recall that $k$ is usually several orders of magnitude smaller than $m$. Since $m$ and $n$ are usually roughly the same size, $k$ is practically insignificant compared to $m \times n$. Convolution is thus dramatically more efficient than dense matrix multiplication in terms of the memory requirements and statistical efficiency.
#### 2.3 Equivariant representations
Parameter sharing causes the layer to have **equivariance to translation**. To say a function is equivariant means that if the input changes, the output changes in the same way.
When processing time-series data, this means that convolution produces a sort of timeline that shows when different features appear in the input. If we move an event later in time in the input, the exact same representation of it will appear in the output, just later. Similarly with images, convolution creates a 2-D map of where certain features appear in the input. If we move the object in the input, its representation will move the same amount in the output. This is useful for when we know that some function of a small number of neighboring pixels is useful when applied to multiple input locations.
%% Cell type:markdown id: tags:
## 3. The convolution operation
### 3.1 Summary and intuition
The convolutional layer's parameters consist of a set of learnable filters. Every filter is small spatially (along width and height), but extends through the full depth of the input volume. For example, a typical filter on a first layer of a ConvNet might have size 5x5x3 (i.e. 5 pixels width and height, and 3 because images have depth 3, the color channels).
* During the forward pass, we slide (more precisely, convolve) each filter across the width and height of the input volume and compute dot products between the entries of the filter and the input at any position. Intuitively, a convolution can be thought as a sliding (weighted) average.
* As we slide the filter over the width and height of the input volume we will produce a 2-dimensional activation map that gives the responses of that filter at every spatial position. Intuitively, the network will learn filters that activate when they see some type of visual feature such as an edge of some orientation or a blotch of some color on the first layer, or eventually entire honeycomb or wheel-like patterns on higher layers of the network.
* At this stage, we have an entire set of filters in each convolutional layer (e.g. 12 filters), and each of them produce a separate 2-dimensional activation map. We stack these activation maps along the depth dimension and produce the output volume.
Below, you can see a representation on how the convolution operation is performed.
![AnimationConvolution](assets/convolutional_nn/padding_strides.gif "convolution")
Animation from: https://github.com/vdumoulin/conv_arithmetic/blob/master/gif/padding_strides.gif
%% Cell type:markdown id: tags:
### 3.2 Mathematical formulation - from Ref. [1]
#### Main idea
Suppose we are tracking the location of a spaceship with a laser sensor. Our laser sensor provides a single output $x(t)$, the position of the spaceship at time $t$. Now suppose that our laser sensor is somewhat noisy. To obtain a less noisy estimate of the spaceship’s position, we would like to average several measurements. Of course, more recent measurements are more relevant, so we will want this to be a weighted average that gives more weight to recent measurements. We can do this with a weighting function $w(a)$, where $a$ is the age of a measurement.
If we apply such a weighted average operation at every moment, we obtain a new function $s$ providing a smoothed estimate of the position of the spaceship:
$s(t) = \int x(a)w(t− a)da$
This operation is called **convolution**.
The convolution operation is typically denoted with an asterisk:
$s(t) = ( x ∗ w )( t )$
In convolutional network terminology, the first argument (in this example, the function $x$) to the convolution is often referred to as the **input**, and the second argument (in this example, the function $w$) as the **kernel**. The output is sometimes referred to as the **feature map**.
#### Discrete version - 1D [optional]
Let us assume that time index $t$ can then take on only integer values. If we now assume that $x$ and $w$ are defined only on integer $t$, we can define the discrete convolution:
$s(t) = ( x ∗ w )( t ) = \sum_{a=-\infty}^{+\infty} x(a)w(t− a)$
#### Discrete version - 2D [optional]
$S(i,j) = (I ∗ k)(i,j) = \sum_{m}\sum_{n} I(m,n)K(i-m,j-n)$
Convolution is commutative, so we can write:
$S(i,j) = (K ∗ I)(i,j) = \sum_{m}\sum_{n} I(i-m,j-n)K(m,n)$
Usually the latter formula is more straightforward to implement in a machine learning library, because there is less variation in the range of valid values of $m$ and $n$. The commutative property of convolution arises because we have flipped the kernel relative to the input, in the sense that as $m$ increases, the index into the input increases, but the index into the kernel decreases. The only reason to flip the kernel is to obtain the commutative property. While the commutative property is useful for writing proofs, it is not usually an important property of a neural network implementation.
Instead, many neural network libraries implement a related function called the cross-correlation, which is the same as convolution but without flipping the kernel:
$S(i,j) = (I ∗ K)(i,j) = \sum_{m}\sum_{n} I(i+m,j+n)K(m,n)$
Many machine learning libraries implement cross-correlation but call it *convolution*. In the context of machine learning, the learning algorithm will learn the appropriate values of the kernel in the appropriate place, so an algorithm based on convolution with kernel flipping will learn a kernel that is flipped relative to the kernel learned by an algorithm without the flipping.
%% Cell type:markdown id: tags:
## 3.3 Examples
### 3.3.1 Example: computing output value of a discrete convolution (from Ref. [3])
We present below the calculation of the discrete convolution of a 3x3 kernel $K_{\rm ex}$ (with no padding and stride 1):
$K_{\rm ex} = \begin{pmatrix}
0 & 1 & 2 \\
2 & 2 & 0 \\
0 & 1 & 2
\end{pmatrix}$
![SegmentLocal](assets/convolutional_nn/output_discrete_convolution.png "segment")
%% Cell type:markdown id: tags:
### 3.3.2 Example: convolution in practice on real images
%% Cell type:markdown id: tags:
We now perform a convolution operation on real images. We use a photo of Max Planck, and a Berlin landscape.
%% Cell type:code id: tags:
``` python
# this can be skipped because the images are already saved on the server
# retrieve image of Max Planck from wikipedia
print("Retrieving picture of Max Planck. Saving image to './img_max_planck.jpg'.")
urllib.request.urlretrieve("https://upload.wikimedia.org/wikipedia/commons/thumb/c/c7/Max_Planck_1933.jpg/220px-Max_Planck_1933.jpg", "./img_max_planck.jpg")
# retrieve a picture of Berlin
print("Retrieving picture of Berlin landscape. Saving image to './img_berlin_landscape.jpg'.")
urllib.request.urlretrieve("http://vivalifestyleandtravel.com/images/cache/c-1509326560-44562570.jpg", "./img_berlin_landscape.jpg")
print("Done.")
```
%% Cell type:markdown id: tags:
We define a function to display images in a single figure; it is not important for the purpose of this tutorial to understand this function implementation.
%% Cell type:code id: tags:
``` python
# function to display multiple images in a single figure
def show_images(images, cols=1, titles=None, cmap='viridis', filename_out=None):
"""Display a list of images in a single figure with matplotlib.
Taken from https://stackoverflow.com/questions/11159436/multiple-figures-in-a-single-window
Parameters:
images: list of np.arrays
Images to be plotted. It must be compatible with plt.imshow.
cols: int, optional, (default = 1)
Number of columns in figure (number of rows is
set to np.ceil(n_images/float(cols))).
titles: list of strings
List of titles corresponding to each image.
"""
plt.clf()
assert ((titles is None) or (len(images) == len(titles)))
n_images = len(images)
if titles is None:
titles = ['Image (%d)' % i for i in range(1, n_images + 1)]
fig = plt.figure()
for n, (image, title) in enumerate(zip(images, titles)):
a = fig.add_subplot(cols, np.ceil(n_images / float(cols)), n + 1)
plt.imshow(image, cmap=cmap)
a.set_title(title, fontsize=40)
a.axis('off') # clear x- and y-axes
fig.set_size_inches(np.array(fig.get_size_inches()) * n_images)
if filename_out is not None:
plt.savefig(filename_out, dpi=100, format='png')
```
%% Cell type:code id: tags:
``` python
# read jpg files as numpy arrays
img_max_planck = plt.imread('./img_max_planck.jpg')[:, :, 0]
img_berlin_landscape = plt.imread('./img_berlin_landscape.jpg')[:, :, 0]
```
%% Cell type:markdown id: tags:
#### Type of Kernels
In Sec. 3.3.1, we used a randomly chosen matrix to perform our convolution; it turns out that there are some "special" kernel matrices that perform specific (and useful) transformation when convoluted with an image.
Below, we present some example of these kernels.
Please visit this page for more details: https://en.wikipedia.org/wiki/Kernel_(image_processing)
$K_{\rm identity } = \begin{pmatrix}
0 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 0
\end{pmatrix}$
$K_{ \rm boxblur} = \dfrac{1}{9}\begin{pmatrix}
1 & 1 & 1 \\
1 & 1 & 1 \\
1 & 1 & 1
\end{pmatrix}$
$K_{\rm gaussianblur3x3} = \dfrac{1}{16}\begin{pmatrix}
1 & 2 & 1 \\
2 & 4 & 2 \\
1 & 2 & 1
\end{pmatrix}$
$K_{\rm gaussianblur5x5} = \dfrac{1}{256}\begin{pmatrix}
1 & 4 & 6 & 4 & 1 \\
4 & 16 & 24 & 16 & 4 \\
6 & 24 & 36 & 24 & 6 \\
4 & 16 & 24 & 16 & 4 \\
1 & 4 & 6 & 4 & 1
\end{pmatrix}$
$K_{\rm vlines} = \begin{pmatrix}
-1 & 2 & -1 \\
-1 & 2 & -1 \\
-1 & 2 & -1
\end{pmatrix}$
$K_{\rm hlines} = \begin{pmatrix}
-1 & -1 & -1 \\
2 & 2 & 2 \\
-1 & -1 & -1
\end{pmatrix}$
$K_{\rm edges} = \begin{pmatrix}
-1 & -1 & -1 \\
-1 & 8 & -1 \\
-1 & -1 & -1
\end{pmatrix}$
$K_{\rm emboss} = \begin{pmatrix}
-2 & -1 & 0 \\
-1 & 1 & 1 \\
0 & 1 & 2
\end{pmatrix}$
%% Cell type:markdown id: tags:
Now we apply the convolution operation on both images (photo of Max Planck and the Berlin landscape) using each of the kernel above.
In particular, we use the Scipy function `signal.convolve2d` to perform the convolution.
Please refer to the Scipy documentation for more details on this function: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.convolve2d.html
%% Cell type:code id: tags:
``` python
k_identity = np.array([[0., 0., 0.],
[0., 1., 0.],
[0., 0., 0.]])
k_box_blur = 1./9. * np.array([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]])
k_gauss_blur_3x3 = 1./16.* np.array([[0., 0., 0., 5., 0., 0., 0.],
[0., 0., 18., 32., 18., 5., 0.],
[0., 18., 64., 100., 64., 18., 0.],
[5., 32., 100., 100., 100., 32., 5.],
[0., 18., 64., 100., 64., 18., 0.],
[0., 5., 18., 32., 18., 5., 0.],
[0., 0., 0., 5., 0., 0., 0.]])
k_gauss_blur_5x5 = 1./256.* np.array([[1., 4., 6., 4., 1.],
[4., 16., 24., 16., 4.],
[6., 24., 36., 24., 6.],
[4., 16., 24., 16., 4.],
[1., 4., 6., 4., 1.]])
k_vlines = np.array([[-1., 2., -1.],
[-1., 2., -1.],
[-1., 2., -1.]])
k_hlines = np.array([[-1., -1., -1.],
[ 2., 2., 2.],
[-1., -1., -1.]])
k_edges = np.array([[-1., -1., -1.],
[-1., 8., -1.],
[-1., -1., -1.]])
# the emboss kernel gives the illusion of depth by emphasizing the differences of pixels in a given direction
# in this case, in a direction along a line from the top left to the bottom right.
k_emboss = np.array([[-2., -1., 0.],
[-1., 1., 1.],
[ 0., 1., 2.]])
```
%% Cell type:code id: tags:
``` python
kernels = [k_identity, k_box_blur, k_vlines, k_hlines, k_edges, k_emboss]
titles = ['original', 'box blur', 'vertical lines', 'horizontal lines', 'edges', 'emboss']
# now apply the convolution for each kernel above
max_planck_feature_maps = []
berlin_landscape_feature_maps = []
for kernel in kernels:
max_planck_feature_maps.append(signal.convolve2d(img_max_planck, kernel, boundary='symm', mode='same'))
berlin_landscape_feature_maps.append(signal.convolve2d(img_berlin_landscape, kernel, boundary='symm', mode='same'))
```
%% Cell type:code id: tags:
``` python
show_images(images=max_planck_feature_maps, cols=2, titles=titles, cmap='gray')
```
%% Cell type:code id: tags:
``` python
show_images(images=berlin_landscape_feature_maps, cols=2, titles=titles, cmap='gray')
```
%% Cell type:markdown id: tags:
Looking at the pictures above, we notice that each kernel performed a pre-determined modification:
1. blurring the picture
2. highlighting vertical lines
3. highlighting horizontal lines
4. highlighting edges
5. embossing (i.e. raising the pattern against the background)
As you can see above, the effect are similar for both pictures, and it is defined by the kernel with which the image is convolved.
In the case of **convolutional neural networks**, the **kernels** will not be the one reported above, but they are going to be **learned by the network** from the data (by minimizing the classification error).
%% Cell type:markdown id: tags:
## 4. Convolutional neural network model with Keras
%% Cell type:markdown id: tags:
Now, we build and train a convolutional neural network.
As an example, we use the well-known MNIST dataset, a database of handwritten digits with training set of 60,000 examples, and a test set of 10,000 examples.
This is a sample of the hand-written digits present in the database:
![mnist_examples.png](assets/convolutional_nn/MnistExamples.png)
Figure from https://en.wikipedia.org/wiki/File:MnistExamples.png
If you are interested to know more about this database, please refer to https://en.wikipedia.org/wiki/MNIST_database
%% Cell type:markdown id: tags:
### 4.1 Get the MNIST dataset and load it
%% Cell type:markdown id: tags:
With the code below, we download the MNIST dataset using the built-in Keras module, and then recast it in numpy arrays.
%% Cell type:code id: tags:
``` python
# this is the code from https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py
batch_size = 128
num_classes = 10
# input image dimensions
img_rows, img_cols = 28, 28
# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()
if K.image_data_format() == 'channels_first':
x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
input_shape = (1, img_rows, img_cols)
else:
x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')
# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
```
%% Cell type:markdown id: tags:
### 4.2 Convolutional neural network (without regularization)
We now build a convolutional neural network using Keras, a simple and intuitive high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano; it also runs seamlessly on CPU and GPU.
For more information on Keras, please visit https://keras.io/. Note that this link refers to the newest version of Keras (>2.4), which only supports Tensorflow (https://www.tensorflow.org/) as a backend. This tutorial (as well as the one on multilayer perceptrons) is compatible with versions <=2.3 which allows multiple backends (CNTK, Tensorflow, Theano). There are only slight differences in syntax and you can find archived documentations at https://github.com/faroit/keras-docs, e.g., for
version 2.1.5 https://faroit.com/keras-docs/2.1.5/. In both tutorials, we use tensorflow as backend (version <2.0).
We start by defining the architecture (i.e. the shape) of the network. We use two convolutional layers, one max pooling, and one fully connected layer. There is no particular reason behind this choice, and other - better performing - choices are possible.
%% Cell type:code id: tags:
``` python
model_no_reg = Sequential()
model_no_reg.add(Conv2D(32, kernel_size=(3, 3),
activation='relu',
input_shape=input_shape))
model_no_reg.add(Conv2D(64, (3, 3), activation='relu'))
model_no_reg.add(MaxPooling2D(pool_size=(2, 2)))
model_no_reg.add(Flatten())
model_no_reg.add(Dense(128, activation='relu'))
model_no_reg.add(Dense(num_classes, activation='softmax', name='preds'))
# compile the model before starting training
model_no_reg.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(),
metrics=['accuracy'])
# print a model summary
model_no_reg.summary()
```
%% Cell type:markdown id: tags:
Now, we train the neural network; you can decide the number of epoch you want to use. The more epochs, the more time the network will be able to see the training samples, but this will results in an increase of computational time (proportional to `nb_epochs`).
An **epoch** is a single step in training a neural network; one epoch is completed when the neural network has seen every training sample once.
<span style="color:red"> **Run the cell below to start training your first convolutional neural network. For the current setting for the number of epochs, the full optimization should take approximately 20 minutes (~4 min per epoch); please take this time to read carefully the materials above, and maybe check out some external references.**</span>.
%% Cell type:code id: tags:
``` python
nb_epochs=5
# train the model for the specified nb_epochs
history_no_reg = model_no_reg.fit(x_train, y_train,
batch_size=batch_size,
epochs=nb_epochs,
verbose=1,
validation_data=(x_test, y_test))
# evaluate the model performance on the test set
score = model_no_reg.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
```
%% Cell type:markdown id: tags:
#### Training history visualization
%% Cell type:code id: tags:
``` python
# Plot training & validation accuracy values
plt.plot(history_no_reg.history['acc'])
plt.plot(history_no_reg.history['val_acc'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()
```
%% Cell type:code id: tags:
``` python
# Plot training & validation loss values
plt.plot(history_no_reg.history['loss'])
plt.plot(history_no_reg.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()
```
%% Cell type:markdown id: tags:
#### Questions
1. Look at the *Accuracy plot* above. Which information can you gather from it? What is happening to the *training* and *test* accuracy over epochs?
2. Is the behavior of *training* and *test* accuracy that you are observing desirable? Is *overfitting* occurring?
3. Should you look at the *training* or at the *test* accuracy to have an estimate of the generalization ability of the model?
4. Compare the *Accuracy plot* with the *Model loss plot* below. Do they show the same qualitative behavior?
%% Cell type:markdown id: tags:
### 4.3 Adding regularization (using dropout layers)
%% Cell type:markdown id: tags:
As we discussed in the tutorial on multilayer perceptrons, regularization techniques are extremely useful to improve the generalization ability of machine learning models. We will again use dropout, now in context of convolutional neural networks, and investigate its influence on model performance.
%% Cell type:code id: tags:
``` python
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
activation='relu',
input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25)) # dropout layer to regularize
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5)) # dropout layer to regularize
model.add(Dense(num_classes, activation='softmax', name='preds'))
# compile the model before starting training
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(),
metrics=['accuracy'])
# print a model summary
model.summary()
```
%% Cell type:markdown id: tags:
<span style="color:red"> **Run the cell below to start training your *regularized* convolutional neural network. For the current setting for the number of epochs, the full optimization should take approximately 20 minutes (~4 min per epoch); please take this time to read carefully the materials above, and maybe check out some external references.**</span>.
%% Cell type:code id: tags:
``` python
nb_epochs=5
# train the model for the specified nb_epochs
history = model.fit(x_train, y_train,
batch_size=batch_size,
epochs=nb_epochs,
verbose=1,
validation_data=(x_test, y_test))
# evaluate the model performance on the test set
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
```
%% Cell type:code id: tags:
``` python
# Plot training & validation accuracy values
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()
```
%% Cell type:code id: tags:
``` python
# Plot training & validation loss values
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()
```
%% Cell type:markdown id: tags:
#### Questions
1. Which changes do you see between the training history of the previous (unregularized) and the current (regularized) neural network?
2. If you had to pick one neural network out of the two, which one would you choose and why?
3. How important is regularization? Is it optional or a must for having generalizable models?
%% Cell type:markdown id: tags:
## 5. Opening the black box with attentive response maps
In the previous section, we built a model which classifies the handwritten digits of the MNIST dataset with satisfactory accuracy. But how can we assess which parts of a given image the network utilizes to arrive at its classification decision?
To answer this question, in this tutorial we will compute **attentive response maps**.
The main idea is to invert the data flow of a convolutional neural network, going from the last layers activations until image space. Then, a heatmap is constructed to shows which parts of the input image are most strongly activating when a classification decision is made - and thus are the most discriminative.
Specifically, in this tutorial we will use **guided back-propagation**, as introduced in J. Springenberg, A. Dosovitskiy, T. Brox, and Riedmiller, *Striving for Simplicity: The All Convolutional Net*, https://arxiv.org/pdf/1412.6806.pdf (2015), and implemented in the Keras-vis package (https://raghakot.github.io/keras-vis/)
Specifically, in this tutorial we will use **guided back-propagation**, as introduced in J. Springenberg, A. Dosovitskiy, T. Brox, and Riedmiller, *Striving for Simplicity: The All Convolutional Net*, https://arxiv.org/pdf/1412.6806.pdf (2015), and implemented in the Keras-vis package (https://raghakot.github.io/keras-vis/, https://github.com/raghakot/keras-vis)
This is not the only technique to explain the classification decisions made by convolutional neural networks; some useful references are listed below:
1. M.D. Zeiler, and R. Fergus, *Visualizing and Understanding Convolutional Networks* 818-833, https://doi.org/10.1007/978-3-319-10590-1_53 (2014).
2. K. Simonyan, A. Vedaldi, and A. Zisserman, *Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps* (2014) (https://arxiv.org/pdf/1312.6034v2.pdf)
3. S. Bach, et al. *On pixel-wise explanations for nonlinear classifier decisions by layer-wise relevance propagation*, PLoS ONE 10, e0130140 (2015).
4. G. Montavon, S. Lapuschkin, A. Binder, W. Samek, and K.R. Müller, *Explaining nonlinear classification decisions with deep Taylor decomposition. Pattern Recognit.* 65, 211–222 (2017).
5. R.R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, *Visual Explanations from Deep Networks via Gradient-based Localization*, https://arxiv.org/pdf/1610.02391.pdf (2017)
6. Kumar, D., Wong, A. & Taylor, G. W. Explaining the unexplained: a class-enhanced attentive response (CLEAR) approach to understanding deep neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 1686–1694 (IEEE, Honolulu, HI, 2017).
For an application of convolutional neural network interpretation to a materials science problem:
7. A. Ziletti, D. Kumar, M. Scheffler, and L. M. Ghiringhelli, *Insightful classification of crystal structures using deep learning*, Nature Communications 9, 2775 (2018)
%% Cell type:code id: tags:
``` python
class_idx = 0
indices = np.where(y_test[:, class_idx] == 1.)[0]
# pick some random input from here.
idx = indices[0]
# Lets sanity check the picked image.
plt.rcParams['figure.figsize'] = (18, 6)
plt.imshow(x_test[idx][..., 0])
```
%% Cell type:code id: tags:
``` python
from vis.visualization import visualize_saliency
from vis.utils import utils
from keras import activations
# Utility to search for layer index by name.
# Alternatively we can specify this as -1 since it corresponds to the last layer.
layer_idx = utils.find_layer_idx(model, 'preds')
# Swap softmax with linear
model.layers[layer_idx].activation = activations.linear
model = utils.apply_modifications(model)
```
%% Cell type:code id: tags:
``` python
for modifier in ['guided']:
grads = visualize_saliency(model, layer_idx, filter_indices=class_idx,
seed_input=x_test[idx], backprop_modifier=modifier)
plt.figure()
plt.title(modifier)
plt.imshow(grads, cmap='jet')
```
%% Cell type:code id: tags:
``` python
from vis.visualization import visualize_cam
for class_idx in np.arange(10):
indices = np.where(y_test[:, class_idx] == 1.)[0]
idx = indices[0]
f, ax = plt.subplots(1, 2)
ax[0].imshow(x_test[idx][..., 0])
for i, modifier in enumerate(['guided']):
grads = visualize_cam(model, layer_idx, filter_indices=class_idx,
seed_input=x_test[idx], backprop_modifier=modifier)
ax[i+1].set_title(modifier)
ax[i+1].imshow(grads, cmap='jet')
```
%% Cell type:markdown id: tags:
#### Questions
1. Have a look at the attentive response maps. Do they look reasonable? How do you evaluate their quality?
2. Try to train another neural network - even a bad one - and calculate its attentive response maps. Are they similar to the one of this network?
3. Are there materials-science-oriented applications of this technique that can be useful in your research?
%% Cell type:markdown id: tags:
## References:
1. *Deep learning*, Goodfellow, Bengio, Courville, MIT Press 2016, Chap. 9 http://www.deeplearningbook.org/contents/convnets.html
2. *A guide to convolution arithmetic for deep learning*, Article: https://arxiv.org/abs/1603.07285; Github: https://github.com/vdumoulin/conv_arithmetic
3. CS231n Convolutional Neural Networks for Visual Recognition (Stanford University): http://cs231n.github.io/convolutional-networks/
4. Code from MNIST example: https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py
5. Keras-vis MNIST example: https://github.com/raghakot/keras-vis/blob/master/examples/mnist/attention.ipynb
%% Cell type:code id: tags:
``` python
```
......
%% Cell type:markdown id: tags:
<div id="teaser" style=' background-position: right center; background-size: 00px; background-repeat: no-repeat;
padding-top: 20px;
padding-right: 10px;
padding-bottom: 170px;
padding-left: 10px;
border-bottom: 14px double #333;
border-top: 14px double #333;' >
<div style="text-align:center">
<b><font size="6.4">Convolutional neural networks</font></b>
</div>
<p>
created by:
Angelo Ziletti, <sup>1</sup>
Andreas Leitherer,<sup>1</sup>
Matthias Scheffler,<sup>1</sup>
and Luca Ghiringhelli<sup>1</sup> <br><br>
<sup>1</sup> Fritz Haber Institute of the Max Planck Society, Faradayweg 4-6, D-14195 Berlin, Germany <br>
<div>
<img style="float: left;" src="assets/convolutional_nn/Logo_MPG.png" width="200">
<img style="float: right;" src="assets/convolutional_nn/Logo_NOMAD.png" width="250">
</div>
</div>
%% Cell type:markdown id: tags:
In this tutorial, we briefly introduce the main ideas behind convolutional neural networks, build a neural network model with Keras, and explain the classification decision process using attentive response maps.
%% Cell type:markdown id: tags:
## Load packages needed
%% Cell type:markdown id: tags:
We first load the packages that we will need to perform this tutorial.
%% Cell type:code id: tags:
``` python
%matplotlib inline
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR) # Suppress TF warnings
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K
from scipy import signal
import scipy.misc
import numpy as np
import matplotlib.pyplot as plt
import urllib.request
```
%% Cell type:markdown id: tags:
## 1. Introduction to Convolutional Neural Networks
This introduction is mainly taken from Ref. [1], to which we refer the interested reader for more details.
%% Cell type:markdown id: tags:
Convolutional networks are a specialized kind of neural network for processing data that has a known **grid-like topology**; they are networks that use convolution in place of general matrix multiplication in at least one of their layers.
Examples of such data include time-series data (1-D grid with samples at regular time intervals) and image data (2-D grid of pixels).
Convolutional networks have been tremendously successful in practical applications, especially in computer vision.
The name "convolutional neural network" indicates that the network employs a mathematical operation called convolution. Convolution is a specialized kind of linear operation.
A typical layer of a convolutional network consists of three stages:
1. **Convolution** stage: the layer performs several convolutions in parallel to produce a set of linear activations (see Sec. 3 for more details).
2. **Detector** stage: each linear activation is run through a nonlinear activation function (e.g. rectified linear
activation function, sigmoid or tanh function)
3. **Pooling** stage: a pooling function is used to modify (downsample) the output of the layer. A pooling function replaces the output of the network at a certain location with a summary statistic of the nearby outputs. For example, the max pooling operation reports the maximum output within a rectangular neighborhood. Other popular pooling functions include the average of a rectangular neighborhood, the $L^2$ norm of a rectangular neighborhood, or a weighted average based on the distance from the central pixel.
#### Max pooling example
![maxpool.jpg](assets/convolutional_nn/maxpool.jpeg)
Figure from http://cs231n.github.io/convolutional-networks/
#### Average pooling example
![avg_pooling_example.png](assets/convolutional_nn/avg_pooling_example.png)
Figure from https://github.com/vdumoulin/conv_arithmetic
%% Cell type:markdown id: tags:
### 2. Motivation
Why one should use convolutional neural networks instead of simple (fully connected) neural networks?
Convolution leverages three important ideas that can help improve a machine learning system:
- **sparse interactions**
- **parameter sharing**
- **equivariant representations**
Moreover, convolution provides a means for working with inputs of variable size - while this is not possible with fully connected neural networks (also called multi-layer perceptrons).
#### 2.1 Sparse interactions
##### Fully connected NN
It uses matrix multiplication by a matrix of parameters with a separate parameter describing the interaction between each input unit and each output unit. This means that every output unit interacts with every input unit. This do not scale well to full images. For example, an image of 200x200x3 would lead to neurons that have 200x200x3 = 120,000 weights. Moreover, we would almost certainly want to have several such neurons. Clearly, this full connectivity is wasteful and the huge number of parameters would quickly lead to overfitting.
##### CNN
It achieves sparse interactions (sparse connectivity) by making the kernel smaller than the input. When processing an image, we can detect small, meaningful features such as edges with kernels that occupy only tens or hundreds of pixels. (*see Sec. 3.3.2 for two concrete examples*).
This means that we need to store fewer parameters, which both reduces the memory requirements of the model and improves its statistical efficiency. It also means that computing the output requires fewer operations. If there are $m$ inputs and $n$ outputs, then matrix multiplication requires $m \times n$ parameters, and the algorithms used in practice have $O(m \times n)$ runtime (per example). If we limit the number of connections each output may have to $k$, then the sparsely connected approach requires only $k \times n$ parameters and $O(k \times n)$ runtime. For many practical applications, $k$ is several orders of magnitude smaller than $m$.
#### 2.2 Parameter sharing
It refers to using the same parameter for more than one function in a model.
##### Fully connected NN
Each element of the weight matrix is used exactly once when computing the output of a layer.
##### CNN
Each member of the kernel is used at every position of the input. The parameter sharing used by the convolution operation means that rather than learning a separate set of parameters for every location, we learn only one set. This further reduce the storage requirements of the model to $k$ parameters. Recall that $k$ is usually several orders of magnitude smaller than $m$. Since $m$ and $n$ are usually roughly the same size, $k$ is practically insignificant compared to $m \times n$. Convolution is thus dramatically more efficient than dense matrix multiplication in terms of the memory requirements and statistical efficiency.
#### 2.3 Equivariant representations
Parameter sharing causes the layer to have **equivariance to translation**. To say a function is equivariant means that if the input changes, the output changes in the same way.
When processing time-series data, this means that convolution produces a sort of timeline that shows when different features appear in the input. If we move an event later in time in the input, the exact same representation of it will appear in the output, just later. Similarly with images, convolution creates a 2-D map of where certain features appear in the input. If we move the object in the input, its representation will move the same amount in the output. This is useful for when we know that some function of a small number of neighboring pixels is useful when applied to multiple input locations.
%% Cell type:markdown id: tags:
## 3. The convolution operation
### 3.1 Summary and intuition
The convolutional layer's parameters consist of a set of learnable filters. Every filter is small spatially (along width and height), but extends through the full depth of the input volume. For example, a typical filter on a first layer of a ConvNet might have size 5x5x3 (i.e. 5 pixels width and height, and 3 because images have depth 3, the color channels).
* During the forward pass, we slide (more precisely, convolve) each filter across the width and height of the input volume and compute dot products between the entries of the filter and the input at any position. Intuitively, a convolution can be thought as a sliding (weighted) average.
* As we slide the filter over the width and height of the input volume we will produce a 2-dimensional activation map that gives the responses of that filter at every spatial position. Intuitively, the network will learn filters that activate when they see some type of visual feature such as an edge of some orientation or a blotch of some color on the first layer, or eventually entire honeycomb or wheel-like patterns on higher layers of the network.
* At this stage, we have an entire set of filters in each convolutional layer (e.g. 12 filters), and each of them produce a separate 2-dimensional activation map. We stack these activation maps along the depth dimension and produce the output volume.
Below, you can see a representation on how the convolution operation is performed.
![AnimationConvolution](assets/convolutional_nn/padding_strides.gif "convolution")
Animation from: https://github.com/vdumoulin/conv_arithmetic/blob/master/gif/padding_strides.gif
%% Cell type:markdown id: tags:
### 3.2 Mathematical formulation - from Ref. [1]
#### Main idea
Suppose we are tracking the location of a spaceship with a laser sensor. Our laser sensor provides a single output $x(t)$, the position of the spaceship at time $t$. Now suppose that our laser sensor is somewhat noisy. To obtain a less noisy estimate of the spaceship’s position, we would like to average several measurements. Of course, more recent measurements are more relevant, so we will want this to be a weighted average that gives more weight to recent measurements. We can do this with a weighting function $w(a)$, where $a$ is the age of a measurement.
If we apply such a weighted average operation at every moment, we obtain a new function $s$ providing a smoothed estimate of the position of the spaceship:
$s(t) = \int x(a)w(t− a)da$
This operation is called **convolution**.
The convolution operation is typically denoted with an asterisk:
$s(t) = ( x ∗ w )( t )$
In convolutional network terminology, the first argument (in this example, the function $x$) to the convolution is often referred to as the **input**, and the second argument (in this example, the function $w$) as the **kernel**. The output is sometimes referred to as the **feature map**.
#### Discrete version - 1D [optional]
Let us assume that time index $t$ can then take on only integer values. If we now assume that $x$ and $w$ are defined only on integer $t$, we can define the discrete convolution:
$s(t) = ( x ∗ w )( t ) = \sum_{a=-\infty}^{+\infty} x(a)w(t− a)$
#### Discrete version - 2D [optional]
$S(i,j) = (I ∗ k)(i,j) = \sum_{m}\sum_{n} I(m,n)K(i-m,j-n)$
Convolution is commutative, so we can write:
$S(i,j) = (K ∗ I)(i,j) = \sum_{m}\sum_{n} I(i-m,j-n)K(m,n)$
Usually the latter formula is more straightforward to implement in a machine learning library, because there is less variation in the range of valid values of $m$ and $n$. The commutative property of convolution arises because we have flipped the kernel relative to the input, in the sense that as $m$ increases, the index into the input increases, but the index into the kernel decreases. The only reason to flip the kernel is to obtain the commutative property. While the commutative property is useful for writing proofs, it is not usually an important property of a neural network implementation.
Instead, many neural network libraries implement a related function called the cross-correlation, which is the same as convolution but without flipping the kernel:
$S(i,j) = (I ∗ K)(i,j) = \sum_{m}\sum_{n} I(i+m,j+n)K(m,n)$
Many machine learning libraries implement cross-correlation but call it *convolution*. In the context of machine learning, the learning algorithm will learn the appropriate values of the kernel in the appropriate place, so an algorithm based on convolution with kernel flipping will learn a kernel that is flipped relative to the kernel learned by an algorithm without the flipping.
%% Cell type:markdown id: tags:
## 3.3 Examples
### 3.3.1 Example: computing output value of a discrete convolution (from Ref. [3])
We present below the calculation of the discrete convolution of a 3x3 kernel $K_{\rm ex}$ (with no padding and stride 1):
$K_{\rm ex} = \begin{pmatrix}
0 & 1 & 2 \\
2 & 2 & 0 \\
0 & 1 & 2
\end{pmatrix}$
![SegmentLocal](assets/convolutional_nn/output_discrete_convolution.png "segment")
%% Cell type:markdown id: tags:
### 3.3.2 Example: convolution in practice on real images
%% Cell type:markdown id: tags:
We now perform a convolution operation on real images. We use a photo of Max Planck, and a Berlin landscape.
%% Cell type:code id: tags:
``` python
# this can be skipped because the images are already saved on the server
# retrieve image of Max Planck from wikipedia
print("Retrieving picture of Max Planck. Saving image to './img_max_planck.jpg'.")
urllib.request.urlretrieve("https://upload.wikimedia.org/wikipedia/commons/thumb/c/c7/Max_Planck_1933.jpg/220px-Max_Planck_1933.jpg", "./img_max_planck.jpg")
# retrieve a picture of Berlin
print("Retrieving picture of Berlin landscape. Saving image to './img_berlin_landscape.jpg'.")
urllib.request.urlretrieve("http://vivalifestyleandtravel.com/images/cache/c-1509326560-44562570.jpg", "./img_berlin_landscape.jpg")
print("Done.")
```
%% Cell type:markdown id: tags:
We define a function to display images in a single figure; it is not important for the purpose of this tutorial to understand this function implementation.
%% Cell type:code id: tags:
``` python
# function to display multiple images in a single figure
def show_images(images, cols=1, titles=None, cmap='viridis', filename_out=None):
"""Display a list of images in a single figure with matplotlib.
Taken from https://stackoverflow.com/questions/11159436/multiple-figures-in-a-single-window
Parameters:
images: list of np.arrays
Images to be plotted. It must be compatible with plt.imshow.
cols: int, optional, (default = 1)
Number of columns in figure (number of rows is
set to np.ceil(n_images/float(cols))).
titles: list of strings
List of titles corresponding to each image.
"""
plt.clf()
assert ((titles is None) or (len(images) == len(titles)))
n_images = len(images)
if titles is None:
titles = ['Image (%d)' % i for i in range(1, n_images + 1)]
fig = plt.figure()
for n, (image, title) in enumerate(zip(images, titles)):
a = fig.add_subplot(cols, np.ceil(n_images / float(cols)), n + 1)
plt.imshow(image, cmap=cmap)
a.set_title(title, fontsize=40)
a.axis('off') # clear x- and y-axes
fig.set_size_inches(np.array(fig.get_size_inches()) * n_images)
if filename_out is not None:
plt.savefig(filename_out, dpi=100, format='png')
```
%% Cell type:code id: tags:
``` python
# read jpg files as numpy arrays
img_max_planck = plt.imread('./img_max_planck.jpg')[:, :, 0]
img_berlin_landscape = plt.imread('./img_berlin_landscape.jpg')[:, :, 0]
```
%% Cell type:markdown id: tags:
#### Type of Kernels
In Sec. 3.3.1, we used a randomly chosen matrix to perform our convolution; it turns out that there are some "special" kernel matrices that perform specific (and useful) transformation when convoluted with an image.
Below, we present some example of these kernels.
Please visit this page for more details: https://en.wikipedia.org/wiki/Kernel_(image_processing)
$K_{\rm identity } = \begin{pmatrix}
0 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 0
\end{pmatrix}$
$K_{ \rm boxblur} = \dfrac{1}{9}\begin{pmatrix}
1 & 1 & 1 \\
1 & 1 & 1 \\
1 & 1 & 1
\end{pmatrix}$
$K_{\rm gaussianblur3x3} = \dfrac{1}{16}\begin{pmatrix}
1 & 2 & 1 \\
2 & 4 & 2 \\
1 & 2 & 1
\end{pmatrix}$
$K_{\rm gaussianblur5x5} = \dfrac{1}{256}\begin{pmatrix}
1 & 4 & 6 & 4 & 1 \\
4 & 16 & 24 & 16 & 4 \\
6 & 24 & 36 & 24 & 6 \\
4 & 16 & 24 & 16 & 4 \\
1 & 4 & 6 & 4 & 1
\end{pmatrix}$
$K_{\rm vlines} = \begin{pmatrix}
-1 & 2 & -1 \\
-1 & 2 & -1 \\
-1 & 2 & -1
\end{pmatrix}$
$K_{\rm hlines} = \begin{pmatrix}
-1 & -1 & -1 \\
2 & 2 & 2 \\
-1 & -1 & -1
\end{pmatrix}$
$K_{\rm edges} = \begin{pmatrix}
-1 & -1 & -1 \\
-1 & 8 & -1 \\
-1 & -1 & -1
\end{pmatrix}$
$K_{\rm emboss} = \begin{pmatrix}
-2 & -1 & 0 \\
-1 & 1 & 1 \\
0 & 1 & 2
\end{pmatrix}$
%% Cell type:markdown id: tags:
Now we apply the convolution operation on both images (photo of Max Planck and the Berlin landscape) using each of the kernel above.
In particular, we use the Scipy function `signal.convolve2d` to perform the convolution.
Please refer to the Scipy documentation for more details on this function: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.convolve2d.html
%% Cell type:code id: tags:
``` python
k_identity = np.array([[0., 0., 0.],
[0., 1., 0.],
[0., 0., 0.]])
k_box_blur = 1./9. * np.array([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]])
k_gauss_blur_3x3 = 1./16.* np.array([[0., 0., 0., 5., 0., 0., 0.],
[0., 0., 18., 32., 18., 5., 0.],
[0., 18., 64., 100., 64., 18., 0.],
[5., 32., 100., 100., 100., 32., 5.],
[0., 18., 64., 100., 64., 18., 0.],
[0., 5., 18., 32., 18., 5., 0.],
[0., 0., 0., 5., 0., 0., 0.]])
k_gauss_blur_5x5 = 1./256.* np.array([[1., 4., 6., 4., 1.],
[4., 16., 24., 16., 4.],
[6., 24., 36., 24., 6.],
[4., 16., 24., 16., 4.],
[1., 4., 6., 4., 1.]])
k_vlines = np.array([[-1., 2., -1.],
[-1., 2., -1.],
[-1., 2., -1.]])
k_hlines = np.array([[-1., -1., -1.],
[ 2., 2., 2.],
[-1., -1., -1.]])
k_edges = np.array([[-1., -1., -1.],
[-1., 8., -1.],
[-1., -1., -1.]])
# the emboss kernel gives the illusion of depth by emphasizing the differences of pixels in a given direction
# in this case, in a direction along a line from the top left to the bottom right.
k_emboss = np.array([[-2., -1., 0.],
[-1., 1., 1.],
[ 0., 1., 2.]])
```
%% Cell type:code id: tags:
``` python
kernels = [k_identity, k_box_blur, k_vlines, k_hlines, k_edges, k_emboss]
titles = ['original', 'box blur', 'vertical lines', 'horizontal lines', 'edges', 'emboss']
# now apply the convolution for each kernel above
max_planck_feature_maps = []
berlin_landscape_feature_maps = []
for kernel in kernels:
max_planck_feature_maps.append(signal.convolve2d(img_max_planck, kernel, boundary='symm', mode='same'))
berlin_landscape_feature_maps.append(signal.convolve2d(img_berlin_landscape, kernel, boundary='symm', mode='same'))
```
%% Cell type:code id: tags:
``` python
show_images(images=max_planck_feature_maps, cols=2, titles=titles, cmap='gray')
```
%% Cell type:code id: tags:
``` python
show_images(images=berlin_landscape_feature_maps, cols=2, titles=titles, cmap='gray')
```
%% Cell type:markdown id: tags:
Looking at the pictures above, we notice that each kernel performed a pre-determined modification:
1. blurring the picture
2. highlighting vertical lines
3. highlighting horizontal lines
4. highlighting edges
5. embossing (i.e. raising the pattern against the background)
As you can see above, the effect are similar for both pictures, and it is defined by the kernel with which the image is convolved.
In the case of **convolutional neural networks**, the **kernels** will not be the one reported above, but they are going to be **learned by the network** from the data (by minimizing the classification error).
%% Cell type:markdown id: tags:
## 4. Convolutional neural network model with Keras
%% Cell type:markdown id: tags:
Now, we build and train a convolutional neural network.
As an example, we use the well-known MNIST dataset, a database of handwritten digits with training set of 60,000 examples, and a test set of 10,000 examples.
This is a sample of the hand-written digits present in the database:
![mnist_examples.png](assets/convolutional_nn/MnistExamples.png)
Figure from https://en.wikipedia.org/wiki/File:MnistExamples.png
If you are interested to know more about this database, please refer to https://en.wikipedia.org/wiki/MNIST_database
%% Cell type:markdown id: tags:
### 4.1 Get the MNIST dataset and load it
%% Cell type:markdown id: tags:
With the code below, we download the MNIST dataset using the built-in Keras module, and then recast it in numpy arrays.
%% Cell type:code id: tags:
``` python
# this is the code from https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py
batch_size = 128
num_classes = 10
# input image dimensions
img_rows, img_cols = 28, 28
# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()
if K.image_data_format() == 'channels_first':
x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
input_shape = (1, img_rows, img_cols)
else:
x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')
# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
```
%% Cell type:markdown id: tags:
### 4.2 Convolutional neural network (without regularization)
We now build a convolutional neural network using Keras, a simple and intuitive high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano; it also runs seamlessly on CPU and GPU.
For more information on Keras, please visit https://keras.io/. Note that this link refers to the newest version of Keras (>2.4), which only supports Tensorflow (https://www.tensorflow.org/) as a backend. This tutorial (as well as the one on multilayer perceptrons) is compatible with versions <=2.3 which allows multiple backends (CNTK, Tensorflow, Theano). There are only slight differences in syntax and you can find archived documentations at https://github.com/faroit/keras-docs, e.g., for
version 2.1.5 https://faroit.com/keras-docs/2.1.5/. In both tutorials, we use tensorflow as backend (version <2.0).
We start by defining the architecture (i.e. the shape) of the network. We use two convolutional layers, one max pooling, and one fully connected layer. There is no particular reason behind this choice, and other - better performing - choices are possible.
%% Cell type:code id: tags:
``` python
model_no_reg = Sequential()
model_no_reg.add(Conv2D(32, kernel_size=(3, 3),
activation='relu',
input_shape=input_shape))
model_no_reg.add(Conv2D(64, (3, 3), activation='relu'))
model_no_reg.add(MaxPooling2D(pool_size=(2, 2)))
model_no_reg.add(Flatten())
model_no_reg.add(Dense(128, activation='relu'))
model_no_reg.add(Dense(num_classes, activation='softmax', name='preds'))
# compile the model before starting training
model_no_reg.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(),
metrics=['accuracy'])
# print a model summary
model_no_reg.summary()
```
%% Cell type:markdown id: tags:
Now, we train the neural network; you can decide the number of epoch you want to use. The more epochs, the more time the network will be able to see the training samples, but this will results in an increase of computational time (proportional to `nb_epochs`).
An **epoch** is a single step in training a neural network; one epoch is completed when the neural network has seen every training sample once.
<span style="color:red"> **Run the cell below to start training your first convolutional neural network. For the current setting for the number of epochs, the full optimization should take approximately 20 minutes (~4 min per epoch); please take this time to read carefully the materials above, and maybe check out some external references.**</span>.
%% Cell type:code id: tags:
``` python
nb_epochs=5
# train the model for the specified nb_epochs
history_no_reg = model_no_reg.fit(x_train, y_train,
batch_size=batch_size,
epochs=nb_epochs,
verbose=1,
validation_data=(x_test, y_test))
# evaluate the model performance on the test set
score = model_no_reg.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
```
%% Cell type:markdown id: tags:
#### Training history visualization
%% Cell type:code id: tags:
``` python
# Plot training & validation accuracy values
plt.plot(history_no_reg.history['acc'])
plt.plot(history_no_reg.history['val_acc'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()
```
%% Cell type:code id: tags:
``` python
# Plot training & validation loss values
plt.plot(history_no_reg.history['loss'])
plt.plot(history_no_reg.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()
```
%% Cell type:markdown id: tags:
#### Questions
1. Look at the *Accuracy plot* above. Which information can you gather from it? What is happening to the *training* and *test* accuracy over epochs?
2. Is the behavior of *training* and *test* accuracy that you are observing desirable? Is *overfitting* occurring?
3. Should you look at the *training* or at the *test* accuracy to have an estimate of the generalization ability of the model?
4. Compare the *Accuracy plot* with the *Model loss plot* below. Do they show the same qualitative behavior?
%% Cell type:markdown id: tags:
### 4.3 Adding regularization (using dropout layers)
%% Cell type:markdown id: tags:
As we discussed in the tutorial on multilayer perceptrons, regularization techniques are extremely useful to improve the generalization ability of machine learning models. We will again use dropout, now in context of convolutional neural networks, and investigate its influence on model performance.
%% Cell type:code id: tags:
``` python
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
activation='relu',
input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25)) # dropout layer to regularize
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5)) # dropout layer to regularize
model.add(Dense(num_classes, activation='softmax', name='preds'))
# compile the model before starting training
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(),
metrics=['accuracy'])
# print a model summary
model.summary()
```
%% Cell type:markdown id: tags:
<span style="color:red"> **Run the cell below to start training your *regularized* convolutional neural network. For the current setting for the number of epochs, the full optimization should take approximately 20 minutes (~4 min per epoch); please take this time to read carefully the materials above, and maybe check out some external references.**</span>.
%% Cell type:code id: tags:
``` python
nb_epochs=5
# train the model for the specified nb_epochs
history = model.fit(x_train, y_train,
batch_size=batch_size,
epochs=nb_epochs,
verbose=1,
validation_data=(x_test, y_test))
# evaluate the model performance on the test set
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
```
%% Cell type:code id: tags:
``` python
# Plot training & validation accuracy values
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()
```
%% Cell type:code id: tags:
``` python
# Plot training & validation loss values
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()
```
%% Cell type:markdown id: tags:
#### Questions
1. Which changes do you see between the training history of the previous (unregularized) and the current (regularized) neural network?
2. If you had to pick one neural network out of the two, which one would you choose and why?
3. How important is regularization? Is it optional or a must for having generalizable models?
%% Cell type:markdown id: tags:
## 5. Opening the black box with attentive response maps
In the previous section, we built a model which classifies the handwritten digits of the MNIST dataset with satisfactory accuracy. But how can we assess which parts of a given image the network utilizes to arrive at its classification decision?
To answer this question, in this tutorial we will compute **attentive response maps**.
The main idea is to invert the data flow of a convolutional neural network, going from the last layers activations until image space. Then, a heatmap is constructed to shows which parts of the input image are most strongly activating when a classification decision is made - and thus are the most discriminative.
Specifically, in this tutorial we will use **guided back-propagation**, as introduced in J. Springenberg, A. Dosovitskiy, T. Brox, and Riedmiller, *Striving for Simplicity: The All Convolutional Net*, https://arxiv.org/pdf/1412.6806.pdf (2015), and implemented in the Keras-vis package (https://raghakot.github.io/keras-vis/)
Specifically, in this tutorial we will use **guided back-propagation**, as introduced in J. Springenberg, A. Dosovitskiy, T. Brox, and Riedmiller, *Striving for Simplicity: The All Convolutional Net*, https://arxiv.org/pdf/1412.6806.pdf (2015), and implemented in the Keras-vis package (https://raghakot.github.io/keras-vis/, https://github.com/raghakot/keras-vis)
This is not the only technique to explain the classification decisions made by convolutional neural networks; some useful references are listed below:
1. M.D. Zeiler, and R. Fergus, *Visualizing and Understanding Convolutional Networks* 818-833, https://doi.org/10.1007/978-3-319-10590-1_53 (2014).
2. K. Simonyan, A. Vedaldi, and A. Zisserman, *Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps* (2014) (https://arxiv.org/pdf/1312.6034v2.pdf)
3. S. Bach, et al. *On pixel-wise explanations for nonlinear classifier decisions by layer-wise relevance propagation*, PLoS ONE 10, e0130140 (2015).
4. G. Montavon, S. Lapuschkin, A. Binder, W. Samek, and K.R. Müller, *Explaining nonlinear classification decisions with deep Taylor decomposition. Pattern Recognit.* 65, 211–222 (2017).
5. R.R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, *Visual Explanations from Deep Networks via Gradient-based Localization*, https://arxiv.org/pdf/1610.02391.pdf (2017)
6. Kumar, D., Wong, A. & Taylor, G. W. Explaining the unexplained: a class-enhanced attentive response (CLEAR) approach to understanding deep neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 1686–1694 (IEEE, Honolulu, HI, 2017).
For an application of convolutional neural network interpretation to a materials science problem:
7. A. Ziletti, D. Kumar, M. Scheffler, and L. M. Ghiringhelli, *Insightful classification of crystal structures using deep learning*, Nature Communications 9, 2775 (2018)
%% Cell type:code id: tags:
``` python
class_idx = 0
indices = np.where(y_test[:, class_idx] == 1.)[0]
# pick some random input from here.
idx = indices[0]
# Lets sanity check the picked image.
plt.rcParams['figure.figsize'] = (18, 6)
plt.imshow(x_test[idx][..., 0])
```
%% Cell type:code id: tags:
``` python
from vis.visualization import visualize_saliency
from vis.utils import utils
from keras import activations
# Utility to search for layer index by name.
# Alternatively we can specify this as -1 since it corresponds to the last layer.
layer_idx = utils.find_layer_idx(model, 'preds')
# Swap softmax with linear
model.layers[layer_idx].activation = activations.linear
model = utils.apply_modifications(model)
```
%% Cell type:code id: tags:
``` python
for modifier in ['guided']:
grads = visualize_saliency(model, layer_idx, filter_indices=class_idx,
seed_input=x_test[idx], backprop_modifier=modifier)
plt.figure()
plt.title(modifier)
plt.imshow(grads, cmap='jet')
```
%% Cell type:code id: tags:
``` python
from vis.visualization import visualize_cam
for class_idx in np.arange(10):
indices = np.where(y_test[:, class_idx] == 1.)[0]
idx = indices[0]
f, ax = plt.subplots(1, 2)
ax[0].imshow(x_test[idx][..., 0])
for i, modifier in enumerate(['guided']):
grads = visualize_cam(model, layer_idx, filter_indices=class_idx,
seed_input=x_test[idx], backprop_modifier=modifier)
ax[i+1].set_title(modifier)
ax[i+1].imshow(grads, cmap='jet')
```
%% Cell type:markdown id: tags:
#### Questions
1. Have a look at the attentive response maps. Do they look reasonable? How do you evaluate their quality?
2. Try to train another neural network - even a bad one - and calculate its attentive response maps. Are they similar to the one of this network?
3. Are there materials-science-oriented applications of this technique that can be useful in your research?
%% Cell type:markdown id: tags:
## References:
1. *Deep learning*, Goodfellow, Bengio, Courville, MIT Press 2016, Chap. 9 http://www.deeplearningbook.org/contents/convnets.html
2. *A guide to convolution arithmetic for deep learning*, Article: https://arxiv.org/abs/1603.07285; Github: https://github.com/vdumoulin/conv_arithmetic
3. CS231n Convolutional Neural Networks for Visual Recognition (Stanford University): http://cs231n.github.io/convolutional-networks/
4. Code from MNIST example: https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py
5. Keras-vis MNIST example: https://github.com/raghakot/keras-vis/blob/master/examples/mnist/attention.ipynb
%% Cell type:code id: tags:
``` python
```
......
......@@ -13,5 +13,5 @@ setup(
description=metainfo['title'],
long_description=metainfo['description'],
packages=find_packages(),
install_requires=['tensorflow==1.13.1', 'keras==2.2.4', 'keras-vis', 'numpy==1.16.4', 'scipy==1.3.1', 'matplotlib', 'pandas', 'seaborn', 'pymatgen==2020.3.13', 'sklearn'],
install_requires=['tensorflow==1.13.1', 'keras==2.2.4', 'keras-vis', 'numpy==1.16.4', 'scipy==1.1.0', 'matplotlib', 'pandas', 'seaborn', 'pymatgen==2020.3.13', 'sklearn'],
)