Convolutional Neural Networks - Bernhard Pfann, CFA

We use convolutional neural networks (”CNNs”) mainly for image classification, which is a multi-label classification task. The networks main characteristics are *local connectivity* and *shared weights*. These properties are achieved though the methods of *convolution* and *pooling*. ## Key Components **Convolution:** - We only look at a small patch $x$ of the whole image (e.g. $11\times 11$) and we learn a weights matrix $W$ of the same size (i.e. 121 parameters). We then compute the inner product of $x \cdot W$ and pass it through a $\text{ReLU}$ activation function to return a scalar. - By iterating over each data point, we can repeat this operation. We take as input the closest $11 \times 11$ patch to the current data point and multiply it with the already calculated weights $W$. ![[convolutional-nn.png|center|450]] **Pooling:** - We want to identify patterns in the image, irrespective of their location in the image (e.g. in center or left corner). Therefore we take the $\max$ of a small patch (”pooling region”). Again we slide over the whole image, to apply this. **Stride:** - The sliding window can also skip some pixels of the input. Moving 2 pixels at a time horizontally and vertically corresponds to a stride of $2$. - This method reduces the number of parameters, with a potentially small loss of information since 2 very close patches anyways share majority of the pixel values ## Comparison to Feed Forward Networks [[Feed Forward Neural Network|Feed Forward Neural Networks]] are not well-suited for image classification, due to the following reasons: - *High parameter count:* FFNN has too many parameters to learn in this problem. An input image of size $1000 \times 1000$ has $10^6$ inputs. A fully connected hidden layer then has $10^{12}$ parameters. - *Loss of spatial information:* FFNN weights would learn the exact position of pixel values instead of general patterns. ## Example: CNN Architecture Assume the input of our CNN model are $28 \times 28$ images. We set the batch size to 32, which means that we split the whole training data into sets of 32 images each. During an epoch we iterate over all batches. In each batch iteration, we compute the gradient for the batch, and update the weights accordingly. ```python import torch.nn as nn model = nn.Sequential( nn.Conv2d(1, 32, (3, 3)), nn.ReLU(), nn.MaxPool2d((2, 2)), nn.Conv2d(32, 64, (3, 3)), nn.ReLU(), nn.Flatten(), nn.Linear(7744, 128), nn.Dropout(0.5), nn.Linear(128 , 10), ) ``` We create the following layers to our network: - *Convolution:* `nn.Conv2d(1, 32, (3,3))` takes a single input image and create 32 convolution filters. Each filter is a $3 \times 3$ matrix. This reduces the dimensionality of the input by 2 in width and height to $26 \times 26$. This is because we have the default parameter settings of `stride=1` and `padding=0`. - *ReLU:* We apply a ReLU activation function. This has no parameters. - *Pooling:* `nn.MaxPool2d((2,2))` takes the maximum value of a $2 \times 2$ filter. This further reduces the width and height to a half, which is $13 \times 13$. This is true because the `kernel_size=(2,2)` and `stride=None` indicates a stride of 2. - *Convolution:* `nn.Conv2d(32, 64, (3,3))` takes in the 32 dimensions (number of dimensions has not changed since first convolution layer), and creates 64 convolutions out of it. Again the filter is a $3 \times 3$ matrix, which reduces each dimension to a $11 \times 11$ matrix. - *Flatten:* `nn.Flatten()` We union the 64 separate convolutions into a single array of size 7,744 (64_11_11). - *Linear:* `nn.Linear(7744, 128)` creates a fully connected layer with 7744 input nodes and 128 output nodes. It is a [[Linear Classifier]]. - *Dropout:* `nn.Dropout(0.5)` sets half of the input nodes to 0 randomly.