Reproducing Deepmind - Observations

462 views
Skip to first unread message

Juan Leni

unread,
Aug 18, 2015, 6:16:02 PM8/18/15
to Deep Q-Learning
Hi,
I am trying to understand the differences between deep_q_rl and DeepMind.
At the moment I am running Breakout on both + some additional logging. In both cases, I am using the parameters from Nature.

I though you might find the data interesting so I have uploaded the charts to this wiki:
DeepMind will take a few more days to finish. I will update the charts when it is ready.

Some observations so far:
- As early as epoch 10, Deepmind results are very different. This is good. It will will speed up the analysis.
- I have to disagree with the idea that deep_q_rl learns well at the beginning and after more training performance gets worse. There is something else going on here. A big difference is how deep_q_rl never manages to get a good minimum score while testing. If you look at the charts, the minimum score is almost flat. On the contrary, Deepmind achieves a very good minimum score in testing after 10 epochs.
-  DeepMind scores spread much more than deep_q_rl. When training, deep_q_rl scores cluster in a few groups. Is this a kind of overfitting? What you think about this.
- When training an episode restarts on every life loss. But when testing, all lives are used. If we get scores ~200 in training... then with 5 lives, shouldn't we easily get  scores >700? If you compare training and testing, they look quite similar! But testing has 5 times more lives! What is going on here?
Somehow, it seems that the policy for the first life is good and later things go wrong, right?

Regarding differences, the only "important" thing I've found so far is that the DNN are not initialised in the same way.
I am running another experiment with the corrections. I will post the results in the wiki soon but it does not look promising... 

Some code that surprises me in Torch: (Extracted from nn/SpatialConvolution.lua)
...
      stdv = 1/math.sqrt(self.kW*self.kH*self.nInputPlane)
...
      self.weight:uniform(-stdv, stdv)
      self.bias:uniform(-stdv, stdv)

They apply the same range to both the weights and the bias? I understood the usual recommendation is to use some small positive number. In deep_q_rl, we are using 0.1, which sounds normal to me. Why is this different in Torch?
Sorry, if I am new to the topic and I am missing something. I would love to hear your opinions/comments.

-- Juan

Alejandro Dubrovsky

unread,
Aug 19, 2015, 1:59:13 AM8/19/15
to deep-q-...@googlegroups.com
Nice analysis.

Is the deep_q_rl code you are running Nathan's latest? I haven't run
breakout on the latest code, but I've run it over twenty times before
and I don't usually see average scores as low as what you are seeing
during testing for a while. They usually look similar to what you are
seeing for the DeepMind code. I am currently running something else
(seaquest) but will try after this finishes.

Regarding getting 200 for the first vs overall score, the environment
changes a lot once you tunnel through so the network gets confused.
Also, to hit 700 you'd have to clear the first level, which would mean
starting to maybe aim for the remaining bricks instead of just hitting
the ball at random. Note that, from the scores you posted and the scores
that DeepMind posted on the Gorilla paper, the DeepMind code does not
clear the first level either (they post their scores as 401 and 402 for
their Nature and Gorilla networks. Clearing would get you probably
around 410-420, and then you'd get a whole bunch more points from the
easy start on the next level)

I've never run the DeepMind code yet, but from the scores I've seen in
the Gorilla paper, it isn't obvious to me that deep_q_rl is doing worse
than it does.
> --
> You received this message because you are subscribed to the Google
> Groups "Deep Q-Learning" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to deep-q-learni...@googlegroups.com
> <mailto:deep-q-learni...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/deep-q-learning/da31fa3e-38ed-459a-ac4a-aa573ebceaca%40googlegroups.com
> <https://groups.google.com/d/msgid/deep-q-learning/da31fa3e-38ed-459a-ac4a-aa573ebceaca%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

Juan Leni

unread,
Aug 19, 2015, 4:18:25 AM8/19/15
to deep-q-...@googlegroups.com
I will try an older version to see if there is a big difference. I was running latest theano, ale, etc. Maybe there are changes there..? Some "configuration management" may be necessary. :)
Anyway, next time I will try to be more precise and post GitHub commit hashes for each library.

Regarding first life vs overall score, I didn't think that it was deepqrl's fault. I was just wondering how it could be possible. Your explanation makes sense. I see that there is a lot of space for improvement there.. It happens in other games like pacman where it can't finish the level either.

As I discussed before, I was doing some refactoring that might get integrated. I can't promise but if I have time in the next few weeks, I can clean/share the code for logging/plotting. It would be nice to have some kind of script to upload results to this wiki after each run (including hashes/parameters). In addition, I think it gives a clearer picture when you can see the score for each episode instead of the aggregated data.
> To unsubscribe from this group and stop receiving emails from it, send an email to deep-q-learni...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/deep-q-learning/55D41B2B.2090807%40gmail.com.

Nathan Sprague

unread,
Aug 19, 2015, 8:57:02 AM8/19/15
to Deep Q-Learning
Your analysis is nice!  However, I agree with Alejandro that something funny is happening with those results.  I think either: you are using an older version of one of the base packages: Lasagne, Theano, ALE, etc.  OR you are using the latest version and there has been a recent change that breaks something.  Providing the commit hashes should help clarify.

I'm pretty sure the issue I'm seeing  /isn't/ the difference in weight initialization.  I've tried weight initialization that is consistent with the Deepmind code, and it doesn't seem to make much difference.  I don't know why the Torch uses those defaults.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/deep-q-learning/da31fa3e-38ed-459a-ac4a-aa573ebceaca%40googlegroups.com
>> <https://groups.google.com/d/msgid/deep-q-learning/da31fa3e-38ed-459a-ac4a-aa573ebceaca%40googlegroups.com?utm_medium=email&utm_source=footer>.
>> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups "Deep Q-Learning" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to deep-q-learning+unsubscribe@googlegroups.com.

Juan Leni

unread,
Aug 19, 2015, 9:27:25 AM8/19/15
to Nathan Sprague, Deep Q-Learning
I know these results are unusual. 
I have added on more chart with a completely clean checkout (not even the changes for logging). Same..

I have installed lasagne/ theano using pip (As described in de_script.sh). 
Unfortunately, as far as I know, I can't get the git SHA from those folders. (Any trick that I don't know?)
The information I can provide is that I installed both on Aug 06 ~17:00GMT

Later today, I will checkout both (lasagne/theano/ale) from GitHub and try one more time (at least the first 20 epochs)
It would be great if any one you can run breakout for at least the first 20 epochs and share results+SHA.



To unsubscribe from this group and stop receiving emails from it, send an email to deep-q-learni...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/deep-q-learning/63a00ad3-e329-4e88-9e28-57a93e3717f6%40googlegroups.com.

Juan Leni

unread,
Aug 20, 2015, 10:16:37 AM8/20/15
to Nathan Sprague, Deep Q-Learning
Ok, I think this is good news.
I have updated the charts and started a run with known SHAs (latest from GitHub).
Up to now, it looks well. Much better than before.

Now I am wondering if changes in other packages can were to blame or the results are very sensitive to initial conditions.
@Alejandro: deepmind managed to get a few episodes over 500 (max=763). It seems there were a few lucky shots there..

Nathan Sprague

unread,
Aug 20, 2015, 11:11:04 AM8/20/15
to Deep Q-Learning, nathan.r...@gmail.com
That is good news.  There could have been a short-lived bug in some package (that's a danger of always using the bleeding edge sources), but I think it's more likely that this was just an unlucky run.  I've seen it happen before.

The real issue is the performance gap between the two packages.  In the results you've posted the Deepmind has found policies that average around 300 after 50 epochs, while deep_q_rl is averaging around 200.  

Ivan Lobov

unread,
Aug 24, 2015, 5:09:34 PM8/24/15
to Deep Q-Learning, nathan.r...@gmail.com
I've launched a clean version of deep_q_rl to run 200 epochs on Gopher. I'll be able to compare achieved results afterwards, I'll share them in a few days.

Juan Leni

unread,
Aug 24, 2015, 5:17:22 PM8/24/15
to Deep Q-Learning
Hi Ivan,
That’s great. I started the same a few hours ago but without setting CUDNN to deterministic. Maybe we can compare results later.
By the way, if possible, write down ALE/Theano/Lasagne SHAs.

Ivan Lobov

unread,
Aug 25, 2015, 1:39:21 PM8/25/15
to Deep Q-Learning
I'm not familiar with how to get SHAs of packages. Any reference on that would help :)

Juan Leni

unread,
Aug 25, 2015, 1:57:50 PM8/25/15
to Deep Q-Learning
Hi Ivan,
You get the SHA when you do:  git show where your check out

For instance:


The SHA identifies the commit so it is easier to track 
If you installed theano/lasagne with pip instead of cloning, as far as I know, you can’t get the SHA.


--
You received this message because you are subscribed to the Google Groups "Deep Q-Learning" group.
To unsubscribe from this group and stop receiving emails from it, send an email to deep-q-learni...@googlegroups.com.

Ivan Lobov

unread,
Aug 25, 2015, 2:09:33 PM8/25/15
to Juan Leni, Deep Q-Learning
Here are my results for the first 70 epochs. Rewards seem good, action values are inconsistent with Nature illustration, but they are usual for any kind of test that I ran before.

--
You received this message because you are subscribed to a topic in the Google Groups "Deep Q-Learning" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/deep-q-learning/K2zo-9I5pM8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to deep-q-learni...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/deep-q-learning/C7B8726F-FEA3-4D18-BA53-CFEE0076A59D%40gmail.com.

Juan Leni

unread,
Aug 25, 2015, 2:14:09 PM8/25/15
to Deep Q-Learning
Great! Do you have the plot for "average score per episode” ?

On 25 Aug 2015, at 20:09, Ivan Lobov <ivanl...@gmail.com> wrote:

Here are my results for the first 70 epochs. Rewards seem good, action values are inconsistent with Nature illustration, but they are usual for any kind of test that I ran before.
On Tue, Aug 25, 2015 at 8:57 PM, Juan Leni <leni...@gmail.com> wrote:
Hi Ivan,
You get the SHA when you do:  git show where your check out

For instance:

<PastedGraphic-1.png>

The SHA identifies the commit so it is easier to track 
If you installed theano/lasagne with pip instead of cloning, as far as I know, you can’t get the SHA.


On 25 Aug 2015, at 19:39, Ivan Lobov <ivanl...@gmail.com> wrote:

I'm not familiar with how to get SHAs of packages. Any reference on that would help :)

On Tuesday, August 25, 2015 at 12:17:22 AM UTC+3, Juan Leni wrote:
Hi Ivan,
That’s great. I started the same a few hours ago but without setting CUDNN to deterministic. Maybe we can compare results later.
By the way, if possible, write down ALE/Theano/Lasagne SHAs.

On Monday, August 24, 2015 at 11:09:34 PM UTC+2, Ivan Lobov wrote:
I've launched a clean version of deep_q_rl to run 200 epochs on Gopher. I'll be able to compare achieved results afterwards, I'll share them in a few days.

--
You received this message because you are subscribed to the Google Groups "Deep Q-Learning" group.
To unsubscribe from this group and stop receiving emails from it, send an email to deep-q-learni...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/deep-q-learning/091be9e1-9f4a-4975-bcd7-d575df70c24e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ivan Lobov

unread,
Aug 25, 2015, 2:53:51 PM8/25/15
to Deep Q-Learning
I believe that's exactly what 'rewards per epoch' plot is. I just made my own plots, didn't use the plot_results.py. But I've checked, it does the same.

Alejandro Dubrovsky

unread,
Aug 26, 2015, 1:08:04 AM8/26/15
to deep-q-...@googlegroups.com
Attached is the plot from seaquest.

deep_q_rl b56bfb252320302f498a5b88ad95b55ad1b73e53
ale ff2995407e42b1f359478604d01177425065a5b4
theano 15c90dd36c95ea2989a2c0085399147d15a6a477

From looking at its best version play, it plays quite well, but it only
seems to surface once its load of divers is full, regardless of its load
of oxygen. This mostly leads to it dying due to running out of oxygen
before collecting all the divers.

On 26/08/15 04:53, Ivan Lobov wrote:
> I believe that's exactly what 'rewards per epoch' plot is. I just made
> my own plots, didn't use the plot_results.py. But I've checked, it does
> the same.
>
> On Tuesday, August 25, 2015 at 9:14:09 PM UTC+3, Juan Leni wrote:
>
> Great! Do you have the plot for "average score per episode” ?
>
>> On 25 Aug 2015, at 20:09, Ivan Lobov <ivanl...@gmail.com
>> <javascript:>> wrote:
>>
>> Here are my results for the first 70 epochs. Rewards seem good,
>> action values are inconsistent with Nature illustration, but they
>> are usual for any kind of test that I ran before.
>> https://github.com/Ivanopolo/deep_q_rl/wiki
>> <https://github.com/Ivanopolo/deep_q_rl/wiki>
>>
>> On Tue, Aug 25, 2015 at 8:57 PM, Juan Leni <leni...@gmail.com
>> <javascript:>> wrote:
>>
>> Hi Ivan,
>> You get the SHA when you do: */git show/* where your check out
>>
>> For instance:
>>
>> <PastedGraphic-1.png>
>>
>> The SHA identifies the commit so it is easier to track
>> If you installed theano/lasagne with pip instead of cloning,
>> as far as I know, you can’t get the SHA.
>>
>>
>>> On 25 Aug 2015, at 19:39, Ivan Lobov <ivanl...@gmail.com
>>> <javascript:>> wrote:
>>>
>>> I'm not familiar with how to get SHAs of packages. Any
>>> reference on that would help :)
>>>
>>> On Tuesday, August 25, 2015 at 12:17:22 AM UTC+3, Juan Leni
>>> wrote:
>>>
>>> Hi Ivan,
>>> That’s great. I started the same a few hours ago but
>>> without setting CUDNN to deterministic. Maybe we can
>>> compare results later.
>>> By the way, if possible, write down ALE/Theano/Lasagne SHAs.
>>>
>>> On Monday, August 24, 2015 at 11:09:34 PM UTC+2, Ivan
>>> Lobov wrote:
>>>
>>> I've launched a clean version of deep_q_rl to run 200
>>> epochs on Gopher. I'll be able to compare achieved
>>> results afterwards, I'll share them in a few days.
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the
>>> Google Groups "Deep Q-Learning" group.
>>> To unsubscribe from this group and stop receiving emails from
>>> it, send an email to deep-q-learni...@googlegroups.com
>>> <javascript:>.
>>> <https://groups.google.com/d/msgid/deep-q-learning/091be9e1-9f4a-4975-bcd7-d575df70c24e%40googlegroups.com?utm_medium=email&utm_source=footer>.
>>> For more options, visit https://groups.google.com/d/optout
>>> <https://groups.google.com/d/optout>.
>>
>>
>> --
>> You received this message because you are subscribed to a
>> topic in the Google Groups "Deep Q-Learning" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/deep-q-learning/K2zo-9I5pM8/unsubscribe
>> <https://groups.google.com/d/topic/deep-q-learning/K2zo-9I5pM8/unsubscribe>.
>> To unsubscribe from this group and all its topics, send an
>> email to deep-q-learni...@googlegroups.com <javascript:>.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/deep-q-learning/C7B8726F-FEA3-4D18-BA53-CFEE0076A59D%40gmail.com
>> <https://groups.google.com/d/msgid/deep-q-learning/C7B8726F-FEA3-4D18-BA53-CFEE0076A59D%40gmail.com?utm_medium=email&utm_source=footer>.
>>
>> For more options, visit https://groups.google.com/d/optout
>> <https://groups.google.com/d/optout>.
>>
>>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Deep Q-Learning" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to deep-q-learni...@googlegroups.com
> <mailto:deep-q-learni...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/deep-q-learning/d8d33577-81df-4f4b-8de7-c00e47bbf408%40googlegroups.com
> <https://groups.google.com/d/msgid/deep-q-learning/d8d33577-81df-4f4b-8de7-c00e47bbf408%40googlegroups.com?utm_medium=email&utm_source=footer>.
seaquest.png

Juan Leni

unread,
Aug 26, 2015, 7:02:47 AM8/26/15
to Deep Q-Learning
I know reward_per_epoch (in the results.csv) is a kind of misnomer. 
In fact that value would be something like “avg reward per episode in this epoch"

Were you running Breakout? Unless you are plotting something different, those values seem too high.


Ivan Lobov

unread,
Aug 26, 2015, 9:02:57 AM8/26/15
to Juan Leni, Deep Q-Learning
I ran Gopher since I'm running all my tests on it. And yes, I agree that the label was misleading.

среда, 26 августа 2015 г. пользователь Juan Leni написал:
To view this discussion on the web visit https://groups.google.com/d/msgid/deep-q-learning/991F0817-66D5-48C9-9E4A-6258638D64C2%40gmail.com.

Ivan Lobov

unread,
Aug 27, 2015, 4:45:03 AM8/27/15
to Deep Q-Learning, leni...@gmail.com
And here are the final results from running 195 epochs:

Everything was perfect until after epoch 116. It managed to hit 3500 average score per episode. Nature paper - 8520. But afterwards everything went south: stopped improving and completely fallen apart after epoch 180.

I wonder what could make the algorithm to learn worse policies, but deteriorate after 116 epochs of learning normally?

My ideas:
- Something wrong with the update rule - it must be diverging at some special point, probably when gradients become really small?
- Something wrong with how gradients are calculated - probably rounding gets in the way and the algo diverges?

What do you think is going on here? What could possibly get wrong after that many successful epochs?
To unsubscribe from this group and stop receiving emails from it, send an email to deep-q-learning+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to a topic in the Google Groups "Deep Q-Learning" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/deep-q-learning/K2zo-9I5pM8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to deep-q-learning+unsubscribe@googlegroups.com.

Peter Z

unread,
Sep 9, 2015, 1:44:45 AM9/9/15
to Deep Q-Learning, leni...@gmail.com
My results running deep_q_rl with breakout is like that:


I'm also running the original Torch7 version and it seems like the original version is more stable.
Maybe the problem is related to target network freezing and experience replay memory. 

在 2015年8月27日星期四 UTC+8下午4:45:03,Ivan Lobov写道:

Juan Leni

unread,
Sep 14, 2015, 6:45:20 AM9/14/15
to Peter Z, Deep Q-Learning
Well.. yes, quite similar to what we’ve been seeing.
I would love to be able to use deep_q_rl as a baseline..

A few additional findings / ideas:

1) ALE vs Xitari

I recently found some differences between ALE and Xitari (it affects Breakout and a few other games). 
In Breakout, ALE declares 6 minimal actions while Xitari only declares 4 minimal actions. 
This clearly results in a smaller search space. 

Deep_q_rl seemed to perform better in my computer when running with only 4 actions. 
Unfortunately, not well enough to be on par..  There is something else going on.… 

BUT another interesting point: 
I ran DQN for 60 epochs (with 4 and 6 actions) and it didn’t seem to affect it much ! 
You can find the charts in the PR.

The changes have already been integrated in ALE. More information here:  

2) CUDNN

I've got better results when running with cuDNN v3 in deterministic mode (same code / RNG seed / etc.)
Maybe someone can confirm with another experiment. I am not completely sure this is always happening. 
This would be bad news because using deterministic=True is definitely slower...

3) Python/Torch

Lately I've been trying to integrate both deep_q_rl (python) and DQN (lua).

I have branched from Lupa and patched it to be able to use Torch from Python (even if you get DQN in a non-standard folder).
I will try to see if I can get deep_q_rl  to use the NN in Torch. I am already aware about differences in the way the NN is initialised, etc...

It is still just a hack on top of Lupa but maybe someone else wants to give it a try. 
The fork is here:  https://github.com/jleni/lupa

Peter Z

unread,
Sep 14, 2015, 9:02:33 AM9/14/15
to Deep Q-Learning, peter...@gmail.com
Hi Juan,

Cool! I would also prefer deep_q_rl and we can keep in touch and dig deeper to find the bugs.
I haven't looked into the details and my idea is that there might be some critical bugs in the deep_q_rl implementation, otherwise it would be due to the lack of robustness of the DQN algorithm. I think Nathan's advice may be helpful, we can try to set the state identical and compare the gradient updates between the Torch7 and Theano implementation. 

Thanks and best wishes,
Peter Z

在 2015年9月14日星期一 UTC+8下午6:45:20,Juan Leni写道:

Nathan Sprague

unread,
Sep 16, 2015, 9:59:24 AM9/16/15
to Deep Q-Learning, peter...@gmail.com
Keep us posted on your progress.  I'll be mostly offline until mid-October, but I would love to see this resolved.

-N

Peter Z

unread,
Sep 21, 2015, 10:35:50 AM9/21/15
to Deep Q-Learning
I've run several experiments by increasing the experience replay memory to 80000, the minibatch size to 128 and the target network update period to 4000. None of them made any improvements upon the original parameters when playing breakout. However, the Torch version achieved the average reward of 200 quite early at 7500000th step. I doubt that the problem might be something critical, and might be related to the gradient decent algorithm... Thus I'll further investigate the reason and I believe that there must be some issue in this implementation.

Peter Z

在 2015年9月16日星期三 UTC+8下午9:59:24,Nathan Sprague写道:

David Schneider-Joseph

unread,
Sep 26, 2015, 8:31:49 PM9/26/15
to Deep Q-Learning
It appears that deep_q_rl unintentionally zeroes the gradient whenever the error is clipped. See https://github.com/spragunr/deep_q_rl/issues/46

I'm testing a patch (results promising so far), which I'll be submitting shortly.

Peter Z

unread,
Sep 27, 2015, 12:52:00 AM9/27/15
to Deep Q-Learning
Thanks a lot! I think this kind of bug should be the major issue. I've also started testing and will share the result with you. I think the problem should be generally solved if the gradient decent works correctly.

在 2015年9月27日星期日 UTC+8上午8:31:49,David Schneider-Joseph写道:
Reply all
Reply to author
Forward
0 new messages