Boltzmann Machines
Boltzmann Machines
- Motivation: Retrieve something from memory when presented only part of it.
- E.g. when given a set of images that are corrupted, reconstruct the original image
Hopfield Network
- Energy function is a cost function - local minima of energy function corresponds to the stored items
- For black and white images with $d$ pixels, where each image has a configuration $x = {x_j}_{1 \leq j \leq d}$ with:
- If we have p images, we have p local minima for E(x).
- Make $w_{ij}$ be positive if two pixels tend to have the same colour, $w_{ij}$ is negative if they are opposite colours.
- The 1/2 comes from assuming $w_{ij} = w_{ji}$ for all ij
- Weights for Hopfield networks are discrete - only -1 or +1
- To generate an image, we start with an initial state x and flip $x_i$ and try and reach a lower energy state. If we factor out $x_i$ from our E(x) expression, to reduce E(x) we change it to be:
- Hebbian learning: update the weights according to the formula:
- For any image x:
- Reliably stored if $p/d < 0.138$, p the number of patterns and d the number of neurons in network
Generative Models
- What if we want to generate new items, not just store and retrieve past items? Generative model
Boltzmann Machine
- Differences to Hopfield
- Energy function is the same
- Neurons $x_i$ are between 0 and 1, not -1/+1 (break inverted colour symmetry) - think of them as features
- Generate new states rather than retrieving stored state
- Update is not deterministic but stochastic
$p(x) = \frac{e^{-E(x)/T}}{Z}$
- E(x) is energy function, T is temperature, Z is partition function
- Similar to softmax but you want low energy = larger probability
- Gibbs Sampling: Have an image x, only change one pixel at a time - if changed, find energy change. Then, we can calculate conditional probability of $x_i$ taking the value 1 or 0
- Similar to Hopfield but there is randomness to updates.
- Boltzmann - we could have a change that moves the image to a higher energy state
- If repeated, eventually obtain a sample from Boltzmann distribution
- $T\rightarrow \infty,\space p(x_i\rightarrow1)=1/2$
- $T\rightarrow 0,\space p(x_i\rightarrow1)=0$ - behaviour becomes similar to Hopfield - energy never increases
- Temperature T is usually fixed at 1
Restricted Boltzmann Machine
- Original Boltzmann - everything connected to everything - every pixel connected to each other one.
- RBM - introduce hidden units in addition to the visible pixel units
- No vis-to-vis or hidden-to-hidden connections
- Update becomes much faster - values for visible nodes fixed, change hidden nodes, and then vice versa
- Inputs are binary vectors
- Train to maximise expected log probability of the data
- Alternating Gibbs Sampling: Choose v0 randomly, sample h0 from p(h|v0), sample v1 from p(v|h0), etc. This is called ‘dreaming’
Training RBM
- Contrastive Divergence: Train by comparing real and fake images, prioritise the real ones
- Quick Contrastive Divergence: Start with real digit, generate hidden units, generate the false (reconstructed) digit and train using those as positive and negative samples respectively
- Not mathematically rigourous but good enough
Deep Boltzmann Machine
- Same approach but apply iteratively to multi-layer network
- Weights from input first layer trained first, then keep those fixed and train to second layer
Greedy Layerwise Pretraining
- Pair of layers trained as RBM, those values used as initial weights and biases for a feedforward NN, train using backprop and use for another task
https://www.doc.ic.ac.uk/~sd4215/hopfield.html