Why is there a Deadly Triad issue and how to handle it ?
This post is the following of an introductory one that you can find here.
Why is there a Deadly Triad issue ?
The deadly triad issue states that when one combines:
- Off-policy learning
- Function approximations (as Neural Nets)
The Value function may suffer from instabilities and/or divergence.
Let’s say the agent is in state S, where he can take the action a leading him to the similar state S’, where he can choose between several actions:
At the beginning of the ε-greedy policy, if there are a lot of actions available, it is very likely to select any of them, and thus, not to select a1 that often, whose Q-value will thus not be updated.
Let’s say that the reward is always 1 for this first action a, and the best Q-value of the next state S’ is evaluated to be 2.
The Q-learning equation (off-policy+bootstrapping) updates the Q-value of this state/action pair such that:
As γ ≈ 1, we obtain, after some updates (depending on the value of α):
Q(S, a) ≈ 3.
So far every thing seems consistent. However, as we use a function approximations to compute the Q-values here, with state S and S’ being similar, the update of this Q-value might change the Q-value of Q(S’,a1). If a and a1 are similar enough, the function approximation might implicitly embed Q(S, a) ≈ Q(S’, a1).
This value might also converge to Q(S’,a1) ≈ 3.
At the next run, following the same updates rules, we should obtain:
Q(S, a) ≈ 4 and then Q(S’,a1) ≈ 4 … etc, following this scheme:
We clearly see here why divergence is likely to happen when we combine bootstrapping, off policy learning, and function approximations.
Of course, the smaller α is, the more you have to use the off-policy update rule to have the Q-value rising, but the main idea is that both value can rise up infinitely.
I will now briefly explain why this 3 components are necessarily combined to obtain the divergence, or why if we change one of them, this problem does not occur anymore.
How is each component responsible for it ?
Bootstrapping allows us to estimate the Q-value (the return of a state/action pair) using the estimation of the next state/action pair.
If we would not use this technique and rather use Monte Carlo approximation, we would rely on the real rewards obtained at each step, and therefore avoid having two similar states, whose Q-value diverge together.
We could use a mixed approach (e.g. TD( λ)) to reduce this impact.
With off-policy learning method, we use the best Q-value of the next state to update the policy. If we visited this state each time we use this approximation (what on policy learning does, e.g SARSA), we would update this Q-value to have it converge to its real value. The problem with off-policy is that we might not update this value sufficiently often to prevent it from diverging.
If you look at the scheme above, it’s clear why function approximations are essential to produce this divergence. They are responsible for the approximation of the Q-values (for Q(S, a) ≈ Q(S’, a1)). Without this approximation, we would not have this problem arising.
The deadly triad issue in Mario
Last year, I tried to create a neural net using PyTorch and have it learn how to play the NES version of Super Mario Bros.
I followed some post to understand how to implement Deep Q learning, without making use of any trick necessary to avoid the Deadly Triad issue (that I discuss in the next post). After I let the neural net learn for a night, I faced the fact that Mario couldn’t do anything else but run forward and die touching the first Goomba.
Going forward was rewarded with a +1, while running anywhere else was rewarded with a -1, and dying with -10.
If you look at the two next states, you will see that there are pretty similar.
We are here in the situation described above. The two states are similar enough for the neural network to confuse them (or at least, have the update on the first one highly changing the value of the second one).
During the training, it happened that Mario eventually jumped over the Goomba, or did not run into it, stopping before, but the best evaluated value was always to go on forward, thus leading to a great overestimate of the real Q-value of this state/action.
This shows that there is no (not yet ?) any magical algorithm that would be able to solve anything…