Temporal-Difference (TD) learning

TD (specifically, TD(0)) allows us to learn value functions without full episodes via the following estimate: \(V(S_t) \leftarrow V(S_t) + \alpha(R_{t+1}+\gamma V(S_{t+1})-V(S_t))\)

In the above, \(R_{t+1}+\gamma V(S_{t+1})\) is called the target, \(R_{t+1}+\gamma V(S_{t+1})-V(S_t)\) is referred to as the TD error.

This is called the TD(0) algorithm, and \(0<\alpha<1\).

On a high-level, TD learning is learning a guess from a guess, the so called TD error is the difference between two predictions.

Recall that given a list \([a_1,...,a_n]\), the running average can be computed recursively as \(\mu_k = \frac{(k-1)\mu_{k-1} + a_k}{k} = \mu_{k-1} + \frac{1}{k}(a_k - \mu_{k-1})\).

In TD learning we apriori do not know how many steps are in the episode, or whether the task is even episodic at all, so the \(\frac{1}{k}\) is subtituted with \(\alpha\), where \(0<\alpha <1\).

However, information could take a long time to propagate to the actual states in TD learning, so instead of using 1 step, use \(n\) steps, such that the target is now \(R_{t+1}+\sum_{i=2}^{n-1} \gamma^{i-1} R_{t+i}+ \gamma^n V(S_{t+n})\). Namely, we only bootstrap the previous value function estimate after taking \(n\) steps forward rather than 1.

Here, taking \(n \rightarrow \infty\) returns the full Monte-Carlo approach, where the value functions are updated once we finish one whole episode, giving the following update equation: \(V(S_t) = V(S_t)+\alpha(G_t - V(S_t))\).

Note that the target updates for TD and \(n\) -step TD are as follows:

\(G_t^{(1)} = R_{t+1}+\gamma V(S_t)\) for TD learning.
\(G_t^{(n)} = R_{t+1}+\gamma G_{t+1}^{(n-1)}\) for \(n\) -step TD learning.

TD(\(\lambda\)) aims for a compromise between the TD and Monte-Carlo approach above, by taking an average of the two targets above using the following target: \(G^{\lambda}_t = R_{t+1}+\gamma ((1-\lambda)V(S_{t+1})+\lambda G^{\lambda}_{t+1})\)

In the above equation, setting \(\lambda=0,1\) recovers the TD learning and Monte-Carlo approach respectively. Intuitively, \(\frac{1}{1-\lambda}\) is the horizon, i.e. the \(n\) in the \(n\) -step TD learning algorithm [1].

TD(\(\lambda\)) an be written in the following manner: \(G_t^\lambda= (1-\lambda)\sum_{k=1}^\infty \lambda^{k-1}G_t^{(k)}\) \(G^{(k)}_t = \sum_{i=1}^k \gamma^{i-1}R_{t+i} + \gamma^kV(S_{t+k})\) \(V(S_t) = V(S_t) +\alpha(G_t^\lambda -V(S_t))\)

Note however TD(\(\lambda\)) requires full episodes to learn, hence eligibility traces can be used.