In this tutorial, we briefly introduce the main ideas behind convolutional neural networks, build a neural network model with Keras, and explain the classification decision process using attentive response maps.

%% Cell type:markdown id: tags:

## Load packages needed

%% Cell type:markdown id: tags:

We first load the packages that we will need to perform this tutorial.

## 1. Introduction to Convolutional Neural Networks

This introduction is mainly taken from Ref. [1], to which we refer the interested reader for more details.

%% Cell type:markdown id: tags:

Convolutional networks are a specialized kind of neural network for processing data that has a known **grid-like topology**; they are networks that use convolution in place of general matrix multiplication in at least one of their layers.

Examples of such data include time-series data (1-D grid with samples at regular time intervals) and image data (2-D grid of pixels).

Convolutional networks have been tremendously successful in practical applications, especially in computer vision.

The name "convolutional neural network" indicates that the network employs a mathematical operation called convolution. Convolution is a specialized kind of linear operation.

A typical layer of a convolutional network consists of three stages:

1.**Convolution** stage: the layer performs several convolutions in parallel to produce a set of linear activations (see Sec. 3 for more details).

2.**Detector** stage: each linear activation is run through a nonlinear activation function (e.g. rectified linear

activation function, sigmoid or tanh function)

3.**Pooling** stage: a pooling function is used to modify (downsample) the output of the layer. A pooling function replaces the output of the network at a certain location with a summary statistic of the nearby outputs. For example, the max pooling operation reports the maximum output within a rectangular neighborhood. Other popular pooling functions include the average of a rectangular neighborhood, the $L^2$ norm of a rectangular neighborhood, or a weighted average based on the distance from the central pixel.

Figure from https://github.com/vdumoulin/conv_arithmetic

%% Cell type:markdown id: tags:

### 2. Motivation

Why one should use convolutional neural networks instead of simple (fully connected) neural networks?

Convolution leverages three important ideas that can help improve a machine learning system:

-**sparse interactions**

-**parameter sharing**

-**equivariant representations**

Moreover, convolution provides a means for working with inputs of variable size - while this is not possible with fully connected neural networks (also called multi-layer perceptrons).

#### 2.1 Sparse interactions

##### Fully connected NN

It uses matrix multiplication by a matrix of parameters with a separate parameter describing the interaction between each input unit and each output unit. This means that every output unit interacts with every input unit. This do not scale well to full images. For example, an image of 200x200x3 would lead to neurons that have 200x200x3 = 120,000 weights. Moreover, we would almost certainly want to have several such neurons. Clearly, this full connectivity is wasteful and the huge number of parameters would quickly lead to overfitting.

##### CNN

It achieves sparse interactions (sparse connectivity) by making the kernel smaller than the input. When processing an image, we can detect small, meaningful features such as edges with kernels that occupy only tens or hundreds of pixels. (*see Sec. 3.3.2 for two concrete examples*).

This means that we need to store fewer parameters, which both reduces the memory requirements of the model and improves its statistical efficiency. It also means that computing the output requires fewer operations. If there are $m$ inputs and $n$ outputs, then matrix multiplication requires $m \times n$ parameters, and the algorithms used in practice have $O(m \times n)$ runtime (per example). If we limit the number of connections each output may have to $k$, then the sparsely connected approach requires only $k \times n$ parameters and $O(k \times n)$ runtime. For many practical applications, $k$ is several orders of magnitude smaller than $m$.

#### 2.2 Parameter sharing

It refers to using the same parameter for more than one function in a model.

##### Fully connected NN

Each element of the weight matrix is used exactly once when computing the output of a layer.

##### CNN

Each member of the kernel is used at every position of the input. The parameter sharing used by the convolution operation means that rather than learning a separate set of parameters for every location, we learn only one set. This further reduce the storage requirements of the model to $k$ parameters. Recall that $k$ is usually several orders of magnitude smaller than $m$. Since $m$ and $n$ are usually roughly the same size, $k$ is practically insignificant compared to $m \times n$. Convolution is thus dramatically more efficient than dense matrix multiplication in terms of the memory requirements and statistical efficiency.

#### 2.3 Equivariant representations

Parameter sharing causes the layer to have **equivariance to translation**. To say a function is equivariant means that if the input changes, the output changes in the same way.

When processing time-series data, this means that convolution produces a sort of timeline that shows when different features appear in the input. If we move an event later in time in the input, the exact same representation of it will appear in the output, just later. Similarly with images, convolution creates a 2-D map of where certain features appear in the input. If we move the object in the input, its representation will move the same amount in the output. This is useful for when we know that some function of a small number of neighboring pixels is useful when applied to multiple input locations.

%% Cell type:markdown id: tags:

## 3. The convolution operation

### 3.1 Summary and intuition

The convolutional layer's parameters consist of a set of learnable filters. Every filter is small spatially (along width and height), but extends through the full depth of the input volume. For example, a typical filter on a first layer of a ConvNet might have size 5x5x3 (i.e. 5 pixels width and height, and 3 because images have depth 3, the color channels).

* During the forward pass, we slide (more precisely, convolve) each filter across the width and height of the input volume and compute dot products between the entries of the filter and the input at any position. Intuitively, a convolution can be thought as a sliding (weighted) average.

* As we slide the filter over the width and height of the input volume we will produce a 2-dimensional activation map that gives the responses of that filter at every spatial position. Intuitively, the network will learn filters that activate when they see some type of visual feature such as an edge of some orientation or a blotch of some color on the first layer, or eventually entire honeycomb or wheel-like patterns on higher layers of the network.

* At this stage, we have an entire set of filters in each convolutional layer (e.g. 12 filters), and each of them produce a separate 2-dimensional activation map. We stack these activation maps along the depth dimension and produce the output volume.

Below, you can see a representation on how the convolution operation is performed.

Suppose we are tracking the location of a spaceship with a laser sensor. Our laser sensor provides a single output $x(t)$, the position of the spaceship at time $t$. Now suppose that our laser sensor is somewhat noisy. To obtain a less noisy estimate of the spaceship’s position, we would like to average several measurements. Of course, more recent measurements are more relevant, so we will want this to be a weighted average that gives more weight to recent measurements. We can do this with a weighting function $w(a)$, where $a$ is the age of a measurement.

If we apply such a weighted average operation at every moment, we obtain a new function $s$ providing a smoothed estimate of the position of the spaceship:

$s(t) = \int x(a)w(t− a)da$

This operation is called **convolution**.

The convolution operation is typically denoted with an asterisk:

$s(t) = ( x ∗ w )( t )$

In convolutional network terminology, the first argument (in this example, the function $x$) to the convolution is often referred to as the **input**, and the second argument (in this example, the function $w$) as the **kernel**. The output is sometimes referred to as the **feature map**.

#### Discrete version - 1D [optional]

Let us assume that time index $t$ can then take on only integer values. If we now assume that $x$ and $w$ are defined only on integer $t$, we can define the discrete convolution:

$s(t) = ( x ∗ w )( t ) = \sum_{a=-\infty}^{+\infty} x(a)w(t− a)$

#### Discrete version - 2D [optional]

$S(i,j) = (I ∗ k)(i,j) = \sum_{m}\sum_{n} I(m,n)K(i-m,j-n)$

Usually the latter formula is more straightforward to implement in a machine learning library, because there is less variation in the range of valid values of $m$ and $n$. The commutative property of convolution arises because we have flipped the kernel relative to the input, in the sense that as $m$ increases, the index into the input increases, but the index into the kernel decreases. The only reason to flip the kernel is to obtain the commutative property. While the commutative property is useful for writing proofs, it is not usually an important property of a neural network implementation.

Instead, many neural network libraries implement a related function called the cross-correlation, which is the same as convolution but without flipping the kernel:

$S(i,j) = (I ∗ K)(i,j) = \sum_{m}\sum_{n} I(i+m,j+n)K(m,n)$

Many machine learning libraries implement cross-correlation but call it *convolution*. In the context of machine learning, the learning algorithm will learn the appropriate values of the kernel in the appropriate place, so an algorithm based on convolution with kernel flipping will learn a kernel that is flipped relative to the kernel learned by an algorithm without the flipping.

%% Cell type:markdown id: tags:

## 3.3 Examples

### 3.3.1 Example: computing output value of a discrete convolution (from Ref. [3])

We present below the calculation of the discrete convolution of a 3x3 kernel $K_{\rm ex}$ (with no padding and stride 1):

We define a function to display images in a single figure; it is not important for the purpose of this tutorial to understand this function implementation.

%% Cell type:code id: tags:

``` python

# function to display multiple images in a single figure

In Sec. 3.3.1, we used a randomly chosen matrix to perform our convolution; it turns out that there are some "special" kernel matrices that perform specific (and useful) transformation when convoluted with an image.

Below, we present some example of these kernels.

Please visit this page for more details: https://en.wikipedia.org/wiki/Kernel_(image_processing)

Now we apply the convolution operation on both images (photo of Max Planck and the Berlin landscape) using each of the kernel above.

In particular, we use the Scipy function `signal.convolve2d` to perform the convolution.

Please refer to the Scipy documentation for more details on this function: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.convolve2d.html

Looking at the pictures above, we notice that each kernel performed a pre-determined modification:

1. blurring the picture

2. highlighting vertical lines

3. highlighting horizontal lines

4. highlighting edges

5. embossing (i.e. raising the pattern against the background)

As you can see above, the effect are similar for both pictures, and it is defined by the kernel with which the image is convolved.

In the case of **convolutional neural networks**, the **kernels** will not be the one reported above, but they are going to be **learned by the network** from the data (by minimizing the classification error).

%% Cell type:markdown id: tags:

## 4. Convolutional neural network model with Keras

%% Cell type:markdown id: tags:

Now, we build and train a convolutional neural network.

As an example, we use the well-known MNIST dataset, a database of handwritten digits with training set of 60,000 examples, and a test set of 10,000 examples.

This is a sample of the hand-written digits present in the database:

We now build a convolutional neural network using Keras, a simple and intuitive high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano; it also runs seamlessly on CPU and GPU.

For more information on Keras, please visit https://keras.io/. Note that this link refers to the newest version of Keras (>2.4), which only supports Tensorflow (https://www.tensorflow.org/) as a backend. This tutorial (as well as the one on multilayer perceptrons) is compatible with versions <=2.3 which allows multiple backends (CNTK, Tensorflow, Theano). There are only slight differences in syntax and you can find archived documentations at https://github.com/faroit/keras-docs, e.g., for

version 2.1.5 https://faroit.com/keras-docs/2.1.5/. In both tutorials, we use tensorflow as backend (version <2.0).

We start by defining the architecture (i.e. the shape) of the network. We use two convolutional layers, one max pooling, and one fully connected layer. There is no particular reason behind this choice, and other - better performing - choices are possible.

Now, we train the neural network; you can decide the number of epoch you want to use. The more epochs, the more time the network will be able to see the training samples, but this will results in an increase of computational time (proportional to `nb_epochs`).

An **epoch** is a single step in training a neural network; one epoch is completed when the neural network has seen every training sample once.

<spanstyle="color:red">**Run the cell below to start training your first convolutional neural network. For the current setting for the number of epochs, the full optimization should take approximately 20 minutes (~4 min per epoch); please take this time to read carefully the materials above, and maybe check out some external references.**</span>.

As we discussed in the tutorial on multilayer perceptrons, regularization techniques are extremely useful to improve the generalization ability of machine learning models. We will again use dropout, now in context of convolutional neural networks, and investigate its influence on model performance.

%% Cell type:code id: tags:

``` python

model=Sequential()

model.add(Conv2D(32,kernel_size=(3,3),

activation='relu',

input_shape=input_shape))

model.add(Conv2D(64,(3,3),activation='relu'))

model.add(MaxPooling2D(pool_size=(2,2)))

model.add(Dropout(0.25))# dropout layer to regularize

model.add(Flatten())

model.add(Dense(128,activation='relu'))

model.add(Dropout(0.5))# dropout layer to regularize

<spanstyle="color:red">**Run the cell below to start training your *regularized* convolutional neural network. For the current setting for the number of epochs, the full optimization should take approximately 20 minutes (~4 min per epoch); please take this time to read carefully the materials above, and maybe check out some external references.**</span>.

%% Cell type:code id: tags:

``` python

nb_epochs=5

# train the model for the specified nb_epochs

history=model.fit(x_train,y_train,

batch_size=batch_size,

epochs=nb_epochs,

verbose=1,

validation_data=(x_test,y_test))

# evaluate the model performance on the test set

score=model.evaluate(x_test,y_test,verbose=0)

print('Test loss:',score[0])

print('Test accuracy:',score[1])

```

%% Cell type:code id: tags:

``` python

# Plot training & validation accuracy values

plt.plot(history.history['acc'])

plt.plot(history.history['val_acc'])

plt.title('Model accuracy')

plt.ylabel('Accuracy')

plt.xlabel('Epoch')

plt.legend(['Train','Test'],loc='upper left')

plt.show()

```

%% Cell type:code id: tags:

``` python

# Plot training & validation loss values

plt.plot(history.history['loss'])

plt.plot(history.history['val_loss'])

plt.title('Model loss')

plt.ylabel('Loss')

plt.xlabel('Epoch')

plt.legend(['Train','Test'],loc='upper left')

plt.show()

```

%% Cell type:markdown id: tags:

#### Questions

1. Which changes do you see between the training history of the previous (unregularized) and the current (regularized) neural network?

2. If you had to pick one neural network out of the two, which one would you choose and why?

3. How important is regularization? Is it optional or a must for having generalizable models?

%% Cell type:markdown id: tags:

## 5. Opening the black box with attentive response maps

In the previous section, we built a model which classifies the handwritten digits of the MNIST dataset with satisfactory accuracy. But how can we assess which parts of a given image the network utilizes to arrive at its classification decision?

To answer this question, in this tutorial we will compute **attentive response maps**.

The main idea is to invert the data flow of a convolutional neural network, going from the last layers activations until image space. Then, a heatmap is constructed to shows which parts of the input image are most strongly activating when a classification decision is made - and thus are the most discriminative.

Specifically, in this tutorial we will use **guided back-propagation**, as introduced in J. Springenberg, A. Dosovitskiy, T. Brox, and Riedmiller, *Striving for Simplicity: The All Convolutional Net*, https://arxiv.org/pdf/1412.6806.pdf (2015), and implemented in the Keras-vis package (https://raghakot.github.io/keras-vis/)

Specifically, in this tutorial we will use **guided back-propagation**, as introduced in J. Springenberg, A. Dosovitskiy, T. Brox, and Riedmiller, *Striving for Simplicity: The All Convolutional Net*, https://arxiv.org/pdf/1412.6806.pdf (2015), and implemented in the Keras-vis package (https://raghakot.github.io/keras-vis/, https://github.com/raghakot/keras-vis)

This is not the only technique to explain the classification decisions made by convolutional neural networks; some useful references are listed below:

1. M.D. Zeiler, and R. Fergus, *Visualizing and Understanding Convolutional Networks* 818-833, https://doi.org/10.1007/978-3-319-10590-1_53 (2014).

2. K. Simonyan, A. Vedaldi, and A. Zisserman, *Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps* (2014) (https://arxiv.org/pdf/1312.6034v2.pdf)

3. S. Bach, et al. *On pixel-wise explanations for nonlinear classifier decisions by layer-wise relevance propagation*, PLoS ONE 10, e0130140 (2015).

4. G. Montavon, S. Lapuschkin, A. Binder, W. Samek, and K.R. Müller, *Explaining nonlinear classification decisions with deep Taylor decomposition. Pattern Recognit.* 65, 211–222 (2017).

5. R.R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, *Visual Explanations from Deep Networks via Gradient-based Localization*, https://arxiv.org/pdf/1610.02391.pdf (2017)

6. Kumar, D., Wong, A. & Taylor, G. W. Explaining the unexplained: a class-enhanced attentive response (CLEAR) approach to understanding deep neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 1686–1694 (IEEE, Honolulu, HI, 2017).

For an application of convolutional neural network interpretation to a materials science problem:

7. A. Ziletti, D. Kumar, M. Scheffler, and L. M. Ghiringhelli, *Insightful classification of crystal structures using deep learning*, Nature Communications 9, 2775 (2018)

%% Cell type:code id: tags:

``` python

class_idx=0

indices=np.where(y_test[:,class_idx]==1.)[0]

# pick some random input from here.

idx=indices[0]

# Lets sanity check the picked image.

plt.rcParams['figure.figsize']=(18,6)

plt.imshow(x_test[idx][...,0])

```

%% Cell type:code id: tags:

``` python

fromvis.visualizationimportvisualize_saliency

fromvis.utilsimportutils

fromkerasimportactivations

# Utility to search for layer index by name.

# Alternatively we can specify this as -1 since it corresponds to the last layer.

2.*A guide to convolution arithmetic for deep learning*, Article: https://arxiv.org/abs/1603.07285; Github: https://github.com/vdumoulin/conv_arithmetic

In this tutorial, we briefly introduce the main ideas behind convolutional neural networks, build a neural network model with Keras, and explain the classification decision process using attentive response maps.

%% Cell type:markdown id: tags:

## Load packages needed

%% Cell type:markdown id: tags:

We first load the packages that we will need to perform this tutorial.

## 1. Introduction to Convolutional Neural Networks

This introduction is mainly taken from Ref. [1], to which we refer the interested reader for more details.

%% Cell type:markdown id: tags:

Convolutional networks are a specialized kind of neural network for processing data that has a known **grid-like topology**; they are networks that use convolution in place of general matrix multiplication in at least one of their layers.

Examples of such data include time-series data (1-D grid with samples at regular time intervals) and image data (2-D grid of pixels).

Convolutional networks have been tremendously successful in practical applications, especially in computer vision.

The name "convolutional neural network" indicates that the network employs a mathematical operation called convolution. Convolution is a specialized kind of linear operation.

A typical layer of a convolutional network consists of three stages:

1.**Convolution** stage: the layer performs several convolutions in parallel to produce a set of linear activations (see Sec. 3 for more details).

2.**Detector** stage: each linear activation is run through a nonlinear activation function (e.g. rectified linear

activation function, sigmoid or tanh function)

3.**Pooling** stage: a pooling function is used to modify (downsample) the output of the layer. A pooling function replaces the output of the network at a certain location with a summary statistic of the nearby outputs. For example, the max pooling operation reports the maximum output within a rectangular neighborhood. Other popular pooling functions include the average of a rectangular neighborhood, the $L^2$ norm of a rectangular neighborhood, or a weighted average based on the distance from the central pixel.

Figure from https://github.com/vdumoulin/conv_arithmetic

%% Cell type:markdown id: tags:

### 2. Motivation

Why one should use convolutional neural networks instead of simple (fully connected) neural networks?

Convolution leverages three important ideas that can help improve a machine learning system:

-**sparse interactions**

-**parameter sharing**

-**equivariant representations**

Moreover, convolution provides a means for working with inputs of variable size - while this is not possible with fully connected neural networks (also called multi-layer perceptrons).

#### 2.1 Sparse interactions

##### Fully connected NN

It uses matrix multiplication by a matrix of parameters with a separate parameter describing the interaction between each input unit and each output unit. This means that every output unit interacts with every input unit. This do not scale well to full images. For example, an image of 200x200x3 would lead to neurons that have 200x200x3 = 120,000 weights. Moreover, we would almost certainly want to have several such neurons. Clearly, this full connectivity is wasteful and the huge number of parameters would quickly lead to overfitting.

##### CNN

It achieves sparse interactions (sparse connectivity) by making the kernel smaller than the input. When processing an image, we can detect small, meaningful features such as edges with kernels that occupy only tens or hundreds of pixels. (*see Sec. 3.3.2 for two concrete examples*).

This means that we need to store fewer parameters, which both reduces the memory requirements of the model and improves its statistical efficiency. It also means that computing the output requires fewer operations. If there are $m$ inputs and $n$ outputs, then matrix multiplication requires $m \times n$ parameters, and the algorithms used in practice have $O(m \times n)$ runtime (per example). If we limit the number of connections each output may have to $k$, then the sparsely connected approach requires only $k \times n$ parameters and $O(k \times n)$ runtime. For many practical applications, $k$ is several orders of magnitude smaller than $m$.

#### 2.2 Parameter sharing

It refers to using the same parameter for more than one function in a model.

##### Fully connected NN

Each element of the weight matrix is used exactly once when computing the output of a layer.

##### CNN

Each member of the kernel is used at every position of the input. The parameter sharing used by the convolution operation means that rather than learning a separate set of parameters for every location, we learn only one set. This further reduce the storage requirements of the model to $k$ parameters. Recall that $k$ is usually several orders of magnitude smaller than $m$. Since $m$ and $n$ are usually roughly the same size, $k$ is practically insignificant compared to $m \times n$. Convolution is thus dramatically more efficient than dense matrix multiplication in terms of the memory requirements and statistical efficiency.

#### 2.3 Equivariant representations

Parameter sharing causes the layer to have **equivariance to translation**. To say a function is equivariant means that if the input changes, the output changes in the same way.

When processing time-series data, this means that convolution produces a sort of timeline that shows when different features appear in the input. If we move an event later in time in the input, the exact same representation of it will appear in the output, just later. Similarly with images, convolution creates a 2-D map of where certain features appear in the input. If we move the object in the input, its representation will move the same amount in the output. This is useful for when we know that some function of a small number of neighboring pixels is useful when applied to multiple input locations.

%% Cell type:markdown id: tags:

## 3. The convolution operation

### 3.1 Summary and intuition

The convolutional layer's parameters consist of a set of learnable filters. Every filter is small spatially (along width and height), but extends through the full depth of the input volume. For example, a typical filter on a first layer of a ConvNet might have size 5x5x3 (i.e. 5 pixels width and height, and 3 because images have depth 3, the color channels).

* During the forward pass, we slide (more precisely, convolve) each filter across the width and height of the input volume and compute dot products between the entries of the filter and the input at any position. Intuitively, a convolution can be thought as a sliding (weighted) average.

* As we slide the filter over the width and height of the input volume we will produce a 2-dimensional activation map that gives the responses of that filter at every spatial position. Intuitively, the network will learn filters that activate when they see some type of visual feature such as an edge of some orientation or a blotch of some color on the first layer, or eventually entire honeycomb or wheel-like patterns on higher layers of the network.

* At this stage, we have an entire set of filters in each convolutional layer (e.g. 12 filters), and each of them produce a separate 2-dimensional activation map. We stack these activation maps along the depth dimension and produce the output volume.

Below, you can see a representation on how the convolution operation is performed.

Suppose we are tracking the location of a spaceship with a laser sensor. Our laser sensor provides a single output $x(t)$, the position of the spaceship at time $t$. Now suppose that our laser sensor is somewhat noisy. To obtain a less noisy estimate of the spaceship’s position, we would like to average several measurements. Of course, more recent measurements are more relevant, so we will want this to be a weighted average that gives more weight to recent measurements. We can do this with a weighting function $w(a)$, where $a$ is the age of a measurement.

If we apply such a weighted average operation at every moment, we obtain a new function $s$ providing a smoothed estimate of the position of the spaceship:

$s(t) = \int x(a)w(t− a)da$

This operation is called **convolution**.

The convolution operation is typically denoted with an asterisk:

$s(t) = ( x ∗ w )( t )$

In convolutional network terminology, the first argument (in this example, the function $x$) to the convolution is often referred to as the **input**, and the second argument (in this example, the function $w$) as the **kernel**. The output is sometimes referred to as the **feature map**.

#### Discrete version - 1D [optional]

Let us assume that time index $t$ can then take on only integer values. If we now assume that $x$ and $w$ are defined only on integer $t$, we can define the discrete convolution:

$s(t) = ( x ∗ w )( t ) = \sum_{a=-\infty}^{+\infty} x(a)w(t− a)$

#### Discrete version - 2D [optional]

$S(i,j) = (I ∗ k)(i,j) = \sum_{m}\sum_{n} I(m,n)K(i-m,j-n)$

Usually the latter formula is more straightforward to implement in a machine learning library, because there is less variation in the range of valid values of $m$ and $n$. The commutative property of convolution arises because we have flipped the kernel relative to the input, in the sense that as $m$ increases, the index into the input increases, but the index into the kernel decreases. The only reason to flip the kernel is to obtain the commutative property. While the commutative property is useful for writing proofs, it is not usually an important property of a neural network implementation.

Instead, many neural network libraries implement a related function called the cross-correlation, which is the same as convolution but without flipping the kernel:

$S(i,j) = (I ∗ K)(i,j) = \sum_{m}\sum_{n} I(i+m,j+n)K(m,n)$

Many machine learning libraries implement cross-correlation but call it *convolution*. In the context of machine learning, the learning algorithm will learn the appropriate values of the kernel in the appropriate place, so an algorithm based on convolution with kernel flipping will learn a kernel that is flipped relative to the kernel learned by an algorithm without the flipping.

%% Cell type:markdown id: tags:

## 3.3 Examples

### 3.3.1 Example: computing output value of a discrete convolution (from Ref. [3])

We present below the calculation of the discrete convolution of a 3x3 kernel $K_{\rm ex}$ (with no padding and stride 1):

We define a function to display images in a single figure; it is not important for the purpose of this tutorial to understand this function implementation.

%% Cell type:code id: tags:

``` python

# function to display multiple images in a single figure

In Sec. 3.3.1, we used a randomly chosen matrix to perform our convolution; it turns out that there are some "special" kernel matrices that perform specific (and useful) transformation when convoluted with an image.

Below, we present some example of these kernels.

Please visit this page for more details: https://en.wikipedia.org/wiki/Kernel_(image_processing)

Now we apply the convolution operation on both images (photo of Max Planck and the Berlin landscape) using each of the kernel above.

In particular, we use the Scipy function `signal.convolve2d` to perform the convolution.

Please refer to the Scipy documentation for more details on this function: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.convolve2d.html

Looking at the pictures above, we notice that each kernel performed a pre-determined modification:

1. blurring the picture

2. highlighting vertical lines

3. highlighting horizontal lines

4. highlighting edges

5. embossing (i.e. raising the pattern against the background)

As you can see above, the effect are similar for both pictures, and it is defined by the kernel with which the image is convolved.

In the case of **convolutional neural networks**, the **kernels** will not be the one reported above, but they are going to be **learned by the network** from the data (by minimizing the classification error).

%% Cell type:markdown id: tags:

## 4. Convolutional neural network model with Keras

%% Cell type:markdown id: tags:

Now, we build and train a convolutional neural network.

As an example, we use the well-known MNIST dataset, a database of handwritten digits with training set of 60,000 examples, and a test set of 10,000 examples.

This is a sample of the hand-written digits present in the database:

We now build a convolutional neural network using Keras, a simple and intuitive high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano; it also runs seamlessly on CPU and GPU.

For more information on Keras, please visit https://keras.io/. Note that this link refers to the newest version of Keras (>2.4), which only supports Tensorflow (https://www.tensorflow.org/) as a backend. This tutorial (as well as the one on multilayer perceptrons) is compatible with versions <=2.3 which allows multiple backends (CNTK, Tensorflow, Theano). There are only slight differences in syntax and you can find archived documentations at https://github.com/faroit/keras-docs, e.g., for

version 2.1.5 https://faroit.com/keras-docs/2.1.5/. In both tutorials, we use tensorflow as backend (version <2.0).

We start by defining the architecture (i.e. the shape) of the network. We use two convolutional layers, one max pooling, and one fully connected layer. There is no particular reason behind this choice, and other - better performing - choices are possible.

Now, we train the neural network; you can decide the number of epoch you want to use. The more epochs, the more time the network will be able to see the training samples, but this will results in an increase of computational time (proportional to `nb_epochs`).

An **epoch** is a single step in training a neural network; one epoch is completed when the neural network has seen every training sample once.

<spanstyle="color:red">**Run the cell below to start training your first convolutional neural network. For the current setting for the number of epochs, the full optimization should take approximately 20 minutes (~4 min per epoch); please take this time to read carefully the materials above, and maybe check out some external references.**</span>.

As we discussed in the tutorial on multilayer perceptrons, regularization techniques are extremely useful to improve the generalization ability of machine learning models. We will again use dropout, now in context of convolutional neural networks, and investigate its influence on model performance.

%% Cell type:code id: tags:

``` python

model=Sequential()

model.add(Conv2D(32,kernel_size=(3,3),

activation='relu',

input_shape=input_shape))

model.add(Conv2D(64,(3,3),activation='relu'))

model.add(MaxPooling2D(pool_size=(2,2)))

model.add(Dropout(0.25))# dropout layer to regularize

model.add(Flatten())

model.add(Dense(128,activation='relu'))

model.add(Dropout(0.5))# dropout layer to regularize

<spanstyle="color:red">**Run the cell below to start training your *regularized* convolutional neural network. For the current setting for the number of epochs, the full optimization should take approximately 20 minutes (~4 min per epoch); please take this time to read carefully the materials above, and maybe check out some external references.**</span>.

%% Cell type:code id: tags:

``` python

nb_epochs=5

# train the model for the specified nb_epochs

history=model.fit(x_train,y_train,

batch_size=batch_size,

epochs=nb_epochs,

verbose=1,

validation_data=(x_test,y_test))

# evaluate the model performance on the test set

score=model.evaluate(x_test,y_test,verbose=0)

print('Test loss:',score[0])

print('Test accuracy:',score[1])

```

%% Cell type:code id: tags:

``` python

# Plot training & validation accuracy values

plt.plot(history.history['acc'])

plt.plot(history.history['val_acc'])

plt.title('Model accuracy')

plt.ylabel('Accuracy')

plt.xlabel('Epoch')

plt.legend(['Train','Test'],loc='upper left')

plt.show()

```

%% Cell type:code id: tags:

``` python

# Plot training & validation loss values

plt.plot(history.history['loss'])

plt.plot(history.history['val_loss'])

plt.title('Model loss')

plt.ylabel('Loss')

plt.xlabel('Epoch')

plt.legend(['Train','Test'],loc='upper left')

plt.show()

```

%% Cell type:markdown id: tags:

#### Questions

1. Which changes do you see between the training history of the previous (unregularized) and the current (regularized) neural network?

2. If you had to pick one neural network out of the two, which one would you choose and why?

3. How important is regularization? Is it optional or a must for having generalizable models?

%% Cell type:markdown id: tags:

## 5. Opening the black box with attentive response maps

In the previous section, we built a model which classifies the handwritten digits of the MNIST dataset with satisfactory accuracy. But how can we assess which parts of a given image the network utilizes to arrive at its classification decision?

To answer this question, in this tutorial we will compute **attentive response maps**.

The main idea is to invert the data flow of a convolutional neural network, going from the last layers activations until image space. Then, a heatmap is constructed to shows which parts of the input image are most strongly activating when a classification decision is made - and thus are the most discriminative.

Specifically, in this tutorial we will use **guided back-propagation**, as introduced in J. Springenberg, A. Dosovitskiy, T. Brox, and Riedmiller, *Striving for Simplicity: The All Convolutional Net*, https://arxiv.org/pdf/1412.6806.pdf (2015), and implemented in the Keras-vis package (https://raghakot.github.io/keras-vis/)

Specifically, in this tutorial we will use **guided back-propagation**, as introduced in J. Springenberg, A. Dosovitskiy, T. Brox, and Riedmiller, *Striving for Simplicity: The All Convolutional Net*, https://arxiv.org/pdf/1412.6806.pdf (2015), and implemented in the Keras-vis package (https://raghakot.github.io/keras-vis/, https://github.com/raghakot/keras-vis)

This is not the only technique to explain the classification decisions made by convolutional neural networks; some useful references are listed below:

1. M.D. Zeiler, and R. Fergus, *Visualizing and Understanding Convolutional Networks* 818-833, https://doi.org/10.1007/978-3-319-10590-1_53 (2014).

2. K. Simonyan, A. Vedaldi, and A. Zisserman, *Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps* (2014) (https://arxiv.org/pdf/1312.6034v2.pdf)

3. S. Bach, et al. *On pixel-wise explanations for nonlinear classifier decisions by layer-wise relevance propagation*, PLoS ONE 10, e0130140 (2015).

4. G. Montavon, S. Lapuschkin, A. Binder, W. Samek, and K.R. Müller, *Explaining nonlinear classification decisions with deep Taylor decomposition. Pattern Recognit.* 65, 211–222 (2017).

5. R.R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, *Visual Explanations from Deep Networks via Gradient-based Localization*, https://arxiv.org/pdf/1610.02391.pdf (2017)

6. Kumar, D., Wong, A. & Taylor, G. W. Explaining the unexplained: a class-enhanced attentive response (CLEAR) approach to understanding deep neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 1686–1694 (IEEE, Honolulu, HI, 2017).

For an application of convolutional neural network interpretation to a materials science problem:

7. A. Ziletti, D. Kumar, M. Scheffler, and L. M. Ghiringhelli, *Insightful classification of crystal structures using deep learning*, Nature Communications 9, 2775 (2018)

%% Cell type:code id: tags:

``` python

class_idx=0

indices=np.where(y_test[:,class_idx]==1.)[0]

# pick some random input from here.

idx=indices[0]

# Lets sanity check the picked image.

plt.rcParams['figure.figsize']=(18,6)

plt.imshow(x_test[idx][...,0])

```

%% Cell type:code id: tags:

``` python

fromvis.visualizationimportvisualize_saliency

fromvis.utilsimportutils

fromkerasimportactivations

# Utility to search for layer index by name.

# Alternatively we can specify this as -1 since it corresponds to the last layer.

2.*A guide to convolution arithmetic for deep learning*, Article: https://arxiv.org/abs/1603.07285; Github: https://github.com/vdumoulin/conv_arithmetic