Boltzmann Machines

Motivation: Retrieve something from memory when presented only part of it.
- E.g. when given a set of images that are corrupted, reconstruct the original image

Hopfield Network

Energy function is a cost function - local minima of energy function corresponds to the stored items
For black and white images with $d$ pixels, where each image has a configuration $x = {x_j}_{1 \leq j \leq d}$ with:

If we have p images, we have p local minima for E(x).
Make $w_{ij}$ be positive if two pixels tend to have the same colour, $w_{ij}$ is negative if they are opposite colours.
- The 1/2 comes from assuming $w_{ij} = w_{ji}$ for all ij
- Weights for Hopfield networks are discrete - only -1 or +1
To generate an image, we start with an initial state x and flip $x_i$ and try and reach a lower energy state. If we factor out $x_i$ from our E(x) expression, to reduce E(x) we change it to be:

Reliably stored if $p/d < 0.138$, p the number of patterns and d the number of neurons in network

What if we want to generate new items, not just store and retrieve past items? Generative model

Differences to Hopfield
- Energy function is the same
- Neurons $x_i$ are between 0 and 1, not -1/+1 (break inverted colour symmetry) - think of them as features
- Generate new states rather than retrieving stored state
- Update is not deterministic but stochastic

$p(x) = \frac{e^{-E(x)/T}}{Z}$

E(x) is energy function, T is temperature, Z is partition function
Similar to softmax but you want low energy = larger probability
Gibbs Sampling: Have an image x, only change one pixel at a time - if changed, find energy change. Then, we can calculate conditional probability of $x_i$ taking the value 1 or 0

Original Boltzmann - everything connected to everything - every pixel connected to each other one.
RBM - introduce hidden units in addition to the visible pixel units
- No vis-to-vis or hidden-to-hidden connections
- Update becomes much faster - values for visible nodes fixed, change hidden nodes, and then vice versa
- Inputs are binary vectors
- Train to maximise expected log probability of the data

Alternating Gibbs Sampling: Choose v0 randomly, sample h0 from p(h|v0), sample v1 from p(v|h0), etc. This is called ‘dreaming’

Contrastive Divergence: Train by comparing real and fake images, prioritise the real ones
Quick Contrastive Divergence: Start with real digit, generate hidden units, generate the false (reconstructed) digit and train using those as positive and negative samples respectively
- Not mathematically rigourous but good enough

Same approach but apply iteratively to multi-layer network
Weights from input first layer trained first, then keep those fixed and train to second layer

Pair of layers trained as RBM, those values used as initial weights and biases for a feedforward NN, train using backprop and use for another task

https://www.doc.ic.ac.uk/~sd4215/hopfield.html