Geometry

Encoder Networks

Visualising hidden unit dynamics
N-M-N task - e.g. 5-3-5 meaning 5 inputs, 5 outputs, squashing/bottleneck data into 3 hidden units

For N-2-N:

This represents a 4-2-4 encoder.
- There are 4 possible inputs that are the same as the outputs, represented by points. This is the table on the right.
- There are two hidden units, which can be represented by lines (a function of H1, H2 and a bias).
- The aim is for the hidden units to split the points (‘squash’ into two hidden units)
- When there are too many inputs, the network becomes too constrained and unstable
During training, the points start around the middle and move to the corners as it improves
Similar to autoencoders that compress images, later in the course.
If there are 3 hidden units, hidden units are not lines but planes, coins go to a corner of a square (e.g. for 8-3-8 it is coins on a cube octahedron)

White is positive, black is negative, larger the point, the more positive/negative
Looks at one hidden unit
Superposition of inputs

Swapping hidden nodes will have the same overall function
Changing weights to be negative by reversing sign of weights
If weights are the same for two hidden units, they will have the same errors and weight updates, which is why you should randomise your weights at the start
- Hidden units should try to do a similar job first but then specialise
- Each layer implements an approximately linear function, two layers introduces a bit of non-linearity. Should try not to be too non-linear

Twin Spirals
- First layer is the line for positive/negative
- Second layer learns convex features
- Third layer can learn concave features, like holes
- Hard to do - very low learning rate and initial weight values, otherwise network converges to local optimum
https://www.cs.cmu.edu/~dst/pubs/byte-hiddenlayer-1989.pdf

When weights are too small, differentials become smaller as you backpropogate through the layers (exponentially, due to chain rule)
When weights are too large, activations in higher layers saturate to extreme values - gradients also become very small
Weights with intermediate values - differentials sometimes gets multiplied many times where the transfer function is steep - blow up
Solved by:
- Layerwise unsupervised training
  - Train and get useful features first before caring about the output
- LSTM for recurrent neural networks
- New activation functions

Medium article
Drop out/ignore hidden and visible units in a neural network during training, when doing a forward/backward pass.
- Nodes are dropped out with a probability 1-p.
- After training, activations are multiplied by this probability - so activation is the average value of what it would have received.
Redundancy - when features are missing
Simulates ensembling - different classifiers trained on same task for more diversity
- Diversity also achieved by training different subsets of data with replacement - some not chosen, some chosen multiple times
- In dropout, this happens because it is different architecture (units) and then later averaged over all different models