Image Processing

Image Processing

Image Datasets and Tasks

Various datasets:
- MNIST handwritten digits, 10 classes, 6000 each
- CIFAR colour images, 10 classes, 5000 each
- ImageNet LSVRC - more images
Tasks:
- image captioning
- image classification
- object detection
- object segmentation
- generating images
- generating art
- style transfer

AlexNet

Enhanced using: ReLUs, overlapping pooling, stochastic gradient descent with momentum and weight decay, 50% dropout in fully connected layers
Data augmentation
- Cropping out 10 patches of the original image
- Horizontal reflection of each patch included
- RGB channels had changed intensities
- Predictions of the 10 patches averaged at testing
Convolution Kernels
- Filters on GPU-1 are colour agnostic
- Filters on GPU-2 are colour specific

Deep Networks

> 10 layers: weight initialisation and batch normalization
> 30 layers: skip connections
> 100 layers: identity skip connections

Weight Initialization

We want changes in the weights between layers to be around the same size.
- This is affected by the variance in input and output, number of outputs ($n_i^out$) and a constant that takes into account the transfer function ($G_0$)
We know that for D layers, the variance is:

$Var[\frac{\delta}{\delta x}] = (\prod_{i=1}^D G_0 \times n_i^out\times Var[w^{(i)}])\times Var[\frac{\delta}{\delta z}]$

We want the variance to be around the same between input and output (left and right) - so therefore we choose weights such that

$(\prod_{i=1}^D G_0 \times n_i^out\times Var[w^{(i)}]) = 1$

Batch Normalization

We want to normalise the activations of the nodes of a particular layer. To do that, we normalise the activations $x_k^{(i)}$:

$\hat{x}_k^{(i)} = \frac{x_k^{(i)} - Mean[x_k^{(i)}]}{\sqrt{Var[x_k^{(i)}]}}$

Then we shift and rescale it with our own custom mean and variance, that can be trained as well using backprop with the other parameters/weights.

$y_k^{(i)} = \beta_k^{(i)} + \gamma_k^{(i)}\times \hat{x}_k^{(i)}$

Residual Networks

Skip a layer, add a skip connection and add to the output

F(x) is called a residual component
- Corrects errors from previous layers, or provides additional details
So instead of using H(x) as our output, we refine x with the residual F(x)

Dense Networks

In a densely connected block, each layer is connected by shortcut connections to all preceding layers
Each block is separeted by convolution and pooling

Texture Synthesis

Pretrain CNN
- Hidden layers generated that are general purpose
Pass input texture through CNN, compute feature map $F_{ik}^l$ for the $i^{th}$ filter at spatial location k in layer l
Compute Gram matrix for each pair of features $G_{ij}^l = \sum_k F_{ik}^l \times F_{jk}^l$
Feed random image into CNN, compute L2 distance between original and new image
Backprop to get gradient on image pixels, update image and go to step 4

Neural Style Transfer

Take content of image and style of another and combine to produce a new image