Reinforcement Learning

Use reinforcement learning is when it is hard to provide a training set
- E.g. situation-action pairs

Framework

An agent interacts with its environment.
There is a set $S$ of states, set $A$ of actions
At each time t, agent is at a state $s_t$, must choose an action $a_t$
- Goes into state $s_{t+1}=\delta(s_t,a_t)$
- Receives a reward $r_t = R(s_t,a_t)$
Agent has a policy $\pi:S\rightarrow A$, and the aim is to find an optimal policy $\pi^*$ that maximises the cumulative reward
$\delta, R, \pi$ can be multi-valued with random element (e.g. can be treated as probability distributions)

Models of Optimality

Reward	Cumulative reward	Comment
Finite horizon reward	$\sum_{i=0}^{h-1} r_{t+i}$	Simple computationally
Infinite discounted reward	$\sum_{i=0}^\infty\gamma^ir_{t+i}$	Easier for proving theorems
Average reward	$lim_{h\rightarrow \infty}\frac{1}{h}\sum_{i=0}^{h-1}r_{t+i}$	Cannot sensibly choose between small reward soon and large reward later

Various ways of learning, based on:
- Value Function Learning, e.g. TD-, Q-Learning
- Policy Learning, e.g. Hill Climbing, Policy Gradients
- Actor-Critic, a mix of both

Value Function Learning

Each policy $\pi$ has a value function $V^\pi:S\rightarrow R$ where $V^\pi(s)$ is the average discounted reward received by agent who starts at s and chooses actions according to $\pi$

Average discounted reward - rewards arriving later in the future are worth less - controls how far the agent looks ahead when making predictions.

The agent retains its own estimate of $V^*$, the maximum (expected) discounted reward. The aim is to start with a random V and improve it until it is close to V* (bootstrapping).

Exploration vs. Exploitation

Either choose a random action or choose what we think is the best action
- choose a random action 5% of the time, or
- use Softmax (Boltzmann distribution) to choose next action:

$P(a) = \frac{e^{R(a)/T}}{\sum_{b\in A}e^{R(b)/T}}$

TD-Learning

For the true value $V^*(s)$, it is equal to the immediate reward plus discounted value of the next state:

$V^*(s)=R(s,a)+\gamma V^*(\delta(s,a))$

For the estimated value, we want to learn - so incorporate the error and learning rate:

Q-Learning

Q-Learning is similar but depends on a state-action pair, not just a state. Done via the function $Q^\pi : S \times A \rightarrow R$ in place of the value function.

Limitations:
- Delayed reinforcement: rewards might not be received until several time steps after - slows down learning
- Search space must be finite - relies on visiting every state infinitely often
- Real world problems cannot rely on lookup table (we are looking up the reward for a particular state or state-action) - need to generalise

Backprop

Supervised learning:
- Learn from human games
TD-Learning
- Use subsequent positions to refine evaluation of current position
- General method that does not rely on world model, e.g. the rules of the game
TD Learning and backprop

Policy Learning

Attempt to optimise the policy $\pi_\theta : S\rightarrow A$ directly, with parameters $\theta$
- We consider the fitness of a policy as the value function of initial state under that policy

$fitness(\pi_\theta) = V^{\pi_\theta}(s_0)=E_{\pi_\theta}(r_{total})$

Evolution Strategy

Hill climbing
1. Initialise a ‘champ’ policy $\theta_{champ} = 0$
2. For each trial, generate a mutant policy $\theta_{mutant} = \theta_{champ} + Gaussian\space noise$
3. Evaluate both on same task, if mutant is better then $\theta_{champ}\leftarrow (1-\alpha)\theta_{champ} + \alpha\theta_{mutant}$
  Policy Gradients
Use gradient ascent instead of random updates. For episodic:
- After a sequence of actions, we receive a total reward - distribute reward to each action equally
- If total reward is high, we change parameters to make agent more likely to take those actions in similar situations, e.g. we want to increase the total reward: $log\prod_{t=1}^m \pi_\theta (a_t|s_t)$
- This is the REINFORCE algorithm:

For non-episodic, rewards are received incrementally throughout the game - policy is a probability distribution

Actor-Critic

For the above fitness gradient, it is hard to estimate the Q value. Actor-Critic:
Critic NN measures how good the action taken is (value-based)
- Have a family of Q functions and learn different parameters to approximate $Q^{\pi_\theta}$
Actor NN controls how our agent behaves (policy-based)
- At the same time as approximating $Q^{\pi_\theta}$, we are learning the best policy
Instead of updating eery episode in REINFORCE, we update every step (e.g. like in TD Learning)