Autoencoders

  • Unsupervised pretraining: pretrain using unsupervised method to get initial weights, then refine using backpropogation (then it becomes supervised)
    • Better than backprop
  • Encoder computes $z=f(x)$, decoder computes $g(f(x))$, we want to minimise the distance between z and g(f(x)):

$E = L(x, g(f(x)))$

Autoencoder Networks

  • Input to bottleneck hidden layer = encoding
  • Hidden to output layer = decoding
  • After autoencoder is trained, decoder part is removed and replaced with, for example, a classification layer
  • Greedy Layerwise Pretraining: alternative to RBM - hidden layer is trained to reconstruct the input. First layer (encoder) becomes first layer of deep network

Regularized Autoencoders

  • Trivial identity: Often hidden nodes exceeds input nodes, might overfit (but free parameters might be less)
    • Usually don’t pool as you lose information when decoding/reconstructing
    • There is a risk that autoencoder reproduces the identity function - reproduce the input exactly, we want a new image
      • Avoid by having very low number of weights/hidden nodes
      • Or, introduce some form of regularisation - penalty to force weights to be restricted

Sparse autoencoders

  • Add a penalty term based on the hidden unit activation - recall weight decay
  • Instead of resizing weights, resize the hidden unit activations

$E = L(x, g(f(x))) + \lambda \sum_i|h_i|$

  • This is L1 regularisation - absolute value rather than square - encourages some hidden units to go to 0, hence why it’s ‘sparse’ - features are present or absent

Contractive autoencoders

  • Take derivative of hidden units and use L2 norm

$E = L(x, g(f(x))) + \lambda \sum_i||\triangledown_x h_i||^2$

  • Forced to learn hidden features that don’t change much - small change in input is mapped to a very close point in the HU space

Denoising autoencoders

Add noise to inputs, train to recover original input

repeat:
    sample a training item x(i)
    generate a corrupted version x˜ of x(i)
    train to reduce E = L(x(i),g(f(x˜)))
end

Cost Functions and Probability

  • Cost function can be seen as defining a probability distribution over the outputs - we train to maximise the log of the probability of the target values
    • squared error assumes underlying Gaussian distribution, mean is the output of the network
    • cross entropy assumes Bernoulli distribution, probability is the output of the network
    • softmax assumes Boltzmann distribution

Stochastic Encoders and Decoders

  • Decoder is a conditional proability distribution $p_\theta(x|z)$ of output x
  • Encoder is a conditional proability distribution $q_\theta(z|x)$ of values z
  • This is what we saw in RBM

Generative Models

  • We want to make new images of bedrooms
  • Explicit (Variational Autoencoders) or implicit (GANs)

Variational Autoencoders

  • Instead of producing a single z, we want to produce a probability distribution with mean and sd - we train system to maximise:

  • First term is the probability of output being similar to the input
  • Second term is the KL divergence
  • Tension between both terms - KL divergence wants to stretch out the points to be the standard normal distribution, but in practice it will stretch certain dimensions out