How to account for rare rewards

672 views
Skip to first unread message

Andreas Forslöw

unread,
Jul 31, 2015, 5:13:57 PM7/31/15
to Deep Q-Learning
In Deep Q, the AI learns the 'optimal' path with positive and negative rewards. However, as I understand it, the rewards needs to come relatively frequently, because otherwise the learning will be too influenced by neutral rewards (reward = 0). Am I right about this, and in that case: is there any way to make the algorithm account for this?

rp...@ualberta.ca

unread,
Aug 7, 2015, 4:20:47 PM8/7/15
to Deep Q-Learning
Sparse reward (and its credit assignment) is a major problem in Reinforcement Learning. However, the issue is not so much that you need to see a reward consistently, but rather learning cant be done until you see a nonzero reward. Because of the Bellman Equation used in the update rules for RL algorithms seeing a reward once is enough to get things started, but many update passes would be required to get the final state or action value (like when using dynamic programming to solve a MDP). The learning isn't dominated by neutral rewards, because the reward is only one part of the Q value update; the other being the expected total discounted reward for the best possible action. 

Now in the DQN, things are different because we are randomly sampling from a experience replay memory and not using eligibility traces. Once you see a first reward, there will be a single experience in there that will eventually (hopefully) be sampled, causing an update to the parameters of the network. Now whenever we take a sample from the experience replay, if that samples next state is the starting start (or very similar as we are using function approximation) that led us to see a reward previously, that samples action values will be updated to reflect which action led to that reward. In this way a reward is propagated back through time in terms of the trajectory of states that led to it. Furthermore once you see that reward it will become self-fulfilling in that you will begin to see that reward more often as you realize new parts of the state space can reach known parts that give positive rewards. This is the whole reason we use eligibility traces in traditional RL, but we cant use them in the DQN because we are not learning in an online manner.

An example of this is the game Freeway in which you get a single point for crossing the road. Once you cross the road once, you find that when in the position just below the top you can press 'up' and get a reward. Then when in the position below that you can press 'up' again (as it will have the highest Q value) to lead you to the newly updated state.

Of course its much more complicated with function approximation, as we cant explicitly separate our states and a q value update will affect many other possible states that share features.

So yes reward frequency (and magnitude!) is a problem, and no we don't really a solution.

Ehud Ben-Reuven

unread,
Aug 11, 2015, 9:21:49 AM8/11/15
to Deep Q-Learning
does sampling with higher weights to the reward samples make any sense?

rp...@ualberta.ca

unread,
Aug 12, 2015, 4:47:19 PM8/12/15
to Deep Q-Learning
Its certainly worth a shot. If you attempt it I would be interested in the results. However even samples with zero reward are important to update, the reward is only half of the update rule; the other being the highest Q value of the resulting state. 

Here is a video of RL in the Gridworld domain using Dyna updates: https://www.youtube.com/watch?v=NLOWqilDT3g

This demonstrates the cascading updates to states that lead to the goal state (which is the only transition that gives a reward), hopefully that can give some intuition on what's happening.

So perhaps not only biasing the sampling toward experiences that result in a reward, but rather towards small trajectories (Not just rewarding samples, but a few samples prior) that lead to a reward would be useful. 

Willem H. de Boer

unread,
Sep 2, 2015, 5:32:42 AM9/2/15
to Deep Q-Learning
I have experimented with biased sampling from experience replay memory over the past few days. 

I tried several strategies and found that (informally, as I have only had the time to test on one game so far) it tends to be susceptible to getting trapped in local optima. That is, it will start chasing early successes and then reach a plateau which it seemingly can't recover from. 

The strategy that has worked best so far -- and this is on one game only -- is biasing towards experiences with non-zero reward and the N experiences preceding it. In this case, average Q tends to rise quicker than uniform sampling, and the average episode reward over an epoch is spikier: it will do really good runs that occur several epochs ahead of the vanilla method. The downside to this increased learning is that it also goes the other way. So it will learn equally "well" from bad episodes.

Willem

rp...@ualberta.ca

unread,
Sep 4, 2015, 3:36:52 PM9/4/15
to Deep Q-Learning
That is unfortunate it doesn't work as well as hoped. Which game have you tested it on? Do you intend to try it out on more? How big was N (number of experiences preceding a non-zero reward)?

Willem H. de Boer

unread,
Sep 4, 2015, 3:48:29 PM9/4/15
to Deep Q-Learning
I have tested it on Seaquest so far, and only just the one run of about 50 epochs. I compared average episode score and average max Q over time with 50 epochs from a uniform sampling run. Average max Q does tend to rise quicker compared to uniform sampling, which I guess is to be expected.

N was 15, which is 1 second with a frame skip of 4. 

I intend to revisit once I have implemented an idea to prune large parts of the state space tree using a risk measure in the hope to speed up learning (at the expense of optimality). Best to try it out on smaller toy problems, because waiting 3 days for a result tends to kill workflow a bit!
Reply all
Reply to author
Forward
0 new messages