Several small changes to the deep_q_rl code

389 views
Skip to first unread message

Nathan Sprague

unread,
Feb 3, 2015, 2:28:12 PM2/3/15
to deep-q-...@googlegroups.com
I finally got around to playing around with some parameters to see if I could improve the learning results.  The performance on Breakout still isn't as good as reported in the Deepmind paper, but much better than before.  I just pushed the following changes:

* Crop game images instead of using OpenCV to rescale them.  (This makes the code more consistent with the original paper.)
* Setting "-disable_color_averaging" to true.  (Thanks to Alejandro for pointing out this issue.)
* Turning off momentum and adjusting the other RMSProp parameters.

I have reason to believe that each of these changes help a little bit.  I still haven't carefully tuned the learning rate parameters, so there is probably room for improvement there.
figure_1.png

Ajay Talati

unread,
Feb 3, 2015, 5:27:57 PM2/3/15
to deep-q-...@googlegroups.com
I'm not sure how easy it would be to substitute another solver, but there's an improved version of RMSProp, which is supposed to give improved performance towards the end of training.


I think the codes available here, 


I want to try and give it ago - I just wondered if all you have to do to replace the solver method in `cnn_q_learner.py` is to change lines 209-216, something like,

`self._updates = adam(self._loss, self._parameters, self._adam_tuning_parameters)`

Is that the how you would do it?

Alejandro Dubrovsky

unread,
Feb 4, 2015, 4:25:42 AM2/4/15
to deep-q-...@googlegroups.com
Excellent! That graphs looks almost there with DeepMind's. Testing it
now. I suspect that DeepMind used a higher discount rate, going by the
comparison of the Q-value graphs. Or maybe a variable (ascending)
discount rate.

BTW, the full rgb output patch for ALE has been merged upstream. If you
pass -send_rgb, it will send RGB values for each pixel (it will still
send the RAM before, but it's only 128 bytes, easy to ignore). I've
submitted the restricted_action_set patch too. Marc Bellemare seems to
merge things on weekends only, so maybe this coming weekend it will get
merged. I'll let you know.

Nathan Sprague

unread,
Feb 4, 2015, 8:25:33 AM2/4/15
to deep-q-...@googlegroups.com

It will probably be more complicated than that.

This code:

self._updates = layers.gen_updates_rmsprop(...)

Just puts together appropriate Theano updates.  The update operations are performed later.  It would be a coincidence if the the adam code were organized in the same way.

Given a numpy or Theano implementation, it shouldn't be too complicated to substitute a new update rule, but it probably won't work out of the box.

Ajay Talati

unread,
Feb 4, 2015, 12:55:05 PM2/4/15
to deep-q-...@googlegroups.com
Thanks for the advice Nathan :) I appreciate it.

No - you're right - I could get it too work 'out of the box', there was no straight forward way to hack it!

Next plan is to copy and paste the existing layers.gen_updates_rmsprop_and_nesterov_momentum code into a new function, and try to add the bias correction, from Dirk and Jimmy's paper. That should be consistent. 

Trying to do this with pen paper first though ???? 

Nathan Sprague

unread,
Feb 5, 2015, 10:16:02 AM2/5/15
to deep-q-...@googlegroups.com
I see this patch was just merged.  This is great!  I'll update to use the new stock ALE as soon as possible.  Report back if you have luck with the higher discount rate.

Alejandro Dubrovsky

unread,
Feb 5, 2015, 7:35:19 PM2/5/15
to deep-q-...@googlegroups.com
On 06/02/15 02:16, Nathan Sprague wrote:
> I see this patch was just merged. This is great! I'll update to use
> the new stock ALE as soon as possible. Report back if you have luck
> with the higher discount rate.
>

I'm still testing the 0.95 discount rate. With the new code I am getting
results that to me look as good as or better than DeepMind's, although
taking a bit longer to get there. Current graph attached (won't kill it
till it starts going down)

Only differences between my code and yours:

1) I'm using 84 x 84 like the original paper
2) I'm initiating the network values with a stdev of 0.02 instead of
0.01 (No good reason for this. I had just read a blog post by Ilya
Sutskever where he mentioned 0.02 and I changed it on impulse)
3) Greyscaling using CV (so, using the luminosity formula, instead of
RGB average)
4) Our cropping Y-cutoff is two pixels different (you are taking 16
pixells off the bottom, I'm taking 14)

It could, of course, just be a flukey run.

Thanks again for this magical code!
breakoutrmsnomomentum.png

Alejandro Dubrovsky

unread,
Feb 5, 2015, 7:44:32 PM2/5/15
to deep-q-...@googlegroups.com
Actually, I forgot to mention one change that most probably invalidates my results. I've got the epsilon during testing set to 0.01 instead of 0.05, so the curves aren't really comparable. I'll test later how one of the learnt networks does with it set to 0.05.

Nathan Sprague

unread,
Feb 5, 2015, 9:23:58 PM2/5/15
to deep-q-...@googlegroups.com
Too bad! Those results were looking great. I'll be curious to hear if
they hold up with the higher epsilon.
> --
> You received this message because you are subscribed to the Google
> Groups "Deep Q-Learning" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to deep-q-learni...@googlegroups.com
> <mailto:deep-q-learni...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/deep-q-learning/31b2411d-497a-4d32-8eba-8e342e236b39%40googlegroups.com
> <https://groups.google.com/d/msgid/deep-q-learning/31b2411d-497a-4d32-8eba-8e342e236b39%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

Ajay Talati

unread,
Feb 5, 2015, 10:47:26 PM2/5/15
to deep-q-...@googlegroups.com
Hey Alejandro!

That's really interesting

2) I'm initiating the network values with a stdev of 0.02 instead of 
0.01 (No good reason for this. I had just read a blog post by Ilya 
Sutskever where he mentioned 0.02 and I changed it on impulse) 

weights_std=.02,
init_bias_value=0.1

I'll try running your code too, with epsilon 0.05 during testing. Hopefully that should help :)

Alejandro Dubrovsky

unread,
Feb 8, 2015, 8:20:26 AM2/8/15
to deep-q-...@googlegroups.com
Here is the graph for the breakout run with testing epsilon set at 0.05.
It isn't quite as good as the previous one I sent, but it is probably
still as good as DeepMind's, or almost there, although slower to get to
the 150s.

(btw, to create this, I wrote a wrapper to just go through the networks
saved on the original run. It's just a cut-n-paste-n-delete job on
rl_glue_ale_agent. If anyone wants it, the "driver" is ale_retest.py and
the agent is in agent_tester.py on my fork. You pass it the directory of
saved networks, and a testing epsilon with --testing-epsilon. It's much
faster than rerunning the whole thing)

I'm now planning to rerun it with a discount rate of 0.97.
breakout005.png

sridhar.i...@gmail.com

unread,
Feb 8, 2015, 8:50:16 AM2/8/15
to deep-q-...@googlegroups.com
To me, this looks pretty bad. The learning is quite unstable with lots of oscillations. 

- Sridhar

Ajay Talati

unread,
Feb 8, 2015, 9:23:46 AM2/8/15
to deep-q-...@googlegroups.com
Hey Alejandro,

sorry for the late reply - my computers running slow - here's my run upto 102 epochs - with training epsilon = 0.05.

It seems initializing the network with sd 0.02, gives a small performance improvement, especially in the early learning epochs?

The wrapper's a great idea/tool !!! I'm looking forward to trying it out later on - thanks a lot for making it available to us :)

What are your current interests? 

a) fiddling with the screen/observations/network inputs, and network initialization, and the stochastic optimizer? Slow to experiment with !!!
b) fiddling with the RL parameters, discount and epsilon stuff?

Fiddling with a) is what I'll try and make some progress on :)))) 
Alejandros_screen_processing_with_weights_sd_initialized_02_training_epsilon_05.png

Alejandro Dubrovsky

unread,
Feb 8, 2015, 9:55:46 AM2/8/15
to deep-q-...@googlegroups.com
On 09/02/15 01:23, Ajay Talati wrote:
> Hey Alejandro,
>
> sorry for the late reply - my computers running slow - here's my run
> upto 102 epochs - with training epsilon = 0.05.
>
Looks good, learning faster than my run. Thanks for testing it!

> It seems initializing the network with sd 0.02, gives a small
> performance improvement, especially in the early learning epochs?
>
I don't know really, I haven't done any comparisons.

> The wrapper's a great idea/tool !!! I'm looking forward to trying it out
> later on - thanks a lot for making it available to us :)
>
> What are your current interests?
>
> a) fiddling with the screen/observations/network inputs, and network
> initialization, and the stochastic optimizer? Slow to experiment with !!!
> b) fiddling with the RL parameters, discount and epsilon stuff?
>
> Fiddling with a) is what I'll try and make some progress on :))))
>
I don't currently have much time for proper thinking, which is why I'm
doing just enough to keep the GPU busy. I'm going to tweak the discount
rate and try some other games for now. I haven't seen it learn much on
anything other than pong and breakout, although there was definitely
some learning going on in enduro. Mainly try to give Nathan data for him
to be able to tell what's working and what's not. (I also just enjoy
watching it play)

Once all the results from the DeepMind paper are replicated, there are
some simple ideas that I'd like to test (variable discount rate, deeper
network, context-sensitive epsilon). Also, getting it to use the cuDNN
code might be helpful and shouldn't be too hard.



Nathan Sprague

unread,
Feb 8, 2015, 6:08:26 PM2/8/15
to deep-q-...@googlegroups.com
The results in the original paper show fluctuations that look just like
this. It may be bad, but it is consistent with the published results.
> >> an email to deep-q-learni...@googlegroups.com <javascript:>
> >> <mailto:deep-q-learni...@googlegroups.com
> <javascript:>>.
> <https://groups.google.com/d/msgid/deep-q-learning/31b2411d-497a-4d32-8eba-8e342e236b39%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/optout>.
> >
>
> --
> You received this message because you are subscribed to the Google
> Groups "Deep Q-Learning" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to deep-q-learni...@googlegroups.com
> <mailto:deep-q-learni...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/deep-q-learning/0f695fed-aa0f-4f23-96ad-7bfcf7ff9331%40googlegroups.com
> <https://groups.google.com/d/msgid/deep-q-learning/0f695fed-aa0f-4f23-96ad-7bfcf7ff9331%40googlegroups.com?utm_medium=email&utm_source=footer>.

Ajay Talati

unread,
Feb 8, 2015, 7:10:06 PM2/8/15
to deep-q-...@googlegroups.com
If there was a way to calculate a log-likelihood, to my mind, that's the only measure of training that I would reasonably expect to see a smooth epoch-upon-epoch improvement.

This is a really hard problem, feature learning + prediction + optimal action. I get confused on anything more complicated than binarized MNIST :(

Considering this is model free, and we don't have mystical knowledge as to the internal state of the environment/emulator - it's almost miraculous what it's doing? I guess fluctuations in the realized reward shouldn’t be too surprising?
>     <javascript:>>.
>      >> To view this discussion on the web visit
>      >>
>     https://groups.google.com/d/msgid/deep-q-learning/31b2411d-497a-4d32-8eba-8e342e236b39%40googlegroups.com
>     <https://groups.google.com/d/msgid/deep-q-learning/31b2411d-497a-4d32-8eba-8e342e236b39%40googlegroups.com>
>
>      >>
>      >>
>     <https://groups.google.com/d/msgid/deep-q-learning/31b2411d-497a-4d32-8eba-8e342e236b39%40googlegroups.com?utm_medium=email&utm_source=footer
>     <https://groups.google.com/d/msgid/deep-q-learning/31b2411d-497a-4d32-8eba-8e342e236b39%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>      >>
>      >> For more options, visit https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>.
>      >
>
> --
> You received this message because you are subscribed to the Google
> Groups "Deep Q-Learning" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to deep-q-learni...@googlegroups.com

Ajay Talati

unread,
Feb 8, 2015, 7:29:38 PM2/8/15
to deep-q-...@googlegroups.com
I'd be really interested in any variance reduction or baseline type techniques that could be used here?

You guys are the experts :)

Alejandro Dubrovsky

unread,
Feb 9, 2015, 4:02:29 AM2/9/15
to deep-q-...@googlegroups.com


On Monday, February 9, 2015 at 12:20:26 AM UTC+11, Alejandro Dubrovsky wrote:

I'm now planning to rerun it with a discount rate of 0.97.


Interestingly, it didn't learn at all with a discount rate of 0.97 with this new code, even though it used with the old code.  Trying 0.96

Nathan Sprague

unread,
Feb 9, 2015, 11:57:22 AM2/9/15
to deep-q-...@googlegroups.com
I'm currently running a bunch of tests with different parameter settings for both breakout and seaquest.  The general impression I'm getting is that .97 works sometimes for breakout, but it is less stable.  It never seems to work for seaquest.  I haven't tried .96 on either game.

Sridhar Mahadevan

unread,
Feb 10, 2015, 9:10:00 AM2/10/15
to deep-q-...@googlegroups.com
I'm concerned over this issue of brittleness with respect to the discount factor, e.g. 0.97 or 0..96 or whatever? If the learning is that brittle, it's not worth it.
History will discard this as a flaky method. In order for deep Q RL to make a long term impact, learning has to be stable
across a range of discount factors.

I'm still in two minds about deep Q RL. I can't decide if it's just a flash in the pan, to be forgotten 5 years from now (as the earlier generation of
neural nets in RL were, which most of you were too young to probably know about).

As I always tell my students when I teach the graduate ML class at UMass, there's a reason simple methods, like least squares, k nearest neighbors,
and k means clustering, have survived through the decades. They are simple to understand, simple to implement, and give more or less guaranteed
results. They may not be the best performing method, but they generally work well and can be easily tuned.

Right now, deep Q RL has all the drawbacks of deep learning and nonlinear RL combined. We don't know why deep learning works, and when/if it will work,
and on top of that, we know theoretically that nonlinear RL is plagued by instabilities (which is why a large portion of the community went back to linear methods).

Nathan Sprague

unread,
Feb 10, 2015, 9:30:04 AM2/10/15
to deep-q-...@googlegroups.com
I completely agree with these concerns. I'm currently putting together
an abstract for RLDM that is just about parameter selection for DQN. Of
the 24 reasonable-seeming parameter combinations of learning rate, rho,
and discount, I'm finding that only two result in meaningful learning
for both breakout and seaquest.

That said, deep feature learning doesn't seem to be a flash in the pan.
The general approach is having a big impact in many other areas. It
is worth finding out if it can have a similar impact in RL. Also, the
DQN paper does represent an impressive result.

I don't know the answer. Is this another Neurogammon: impressive
one-off result that never really goes anywhere? Or is it the start of
something more interesting. Even in the best case, the approach is so
data-intensive that it will take some thought to figure out how to apply
it to more "real-world" domains. Especially if each new domain
requires finicky parameter tuning.

Alejandro Dubrovsky

unread,
Feb 10, 2015, 11:16:53 PM2/10/15
to deep-q-...@googlegroups.com
On 10/02/15 03:57, Nathan Sprague wrote:
> I'm currently running a bunch of tests with different parameter settings
> for both breakout and seaquest. The general impression I'm getting is
> that .97 works sometimes for breakout, but it is less stable. It never
> seems to work for seaquest. I haven't tried .96 on either game.

0.96 worked, but probably not as well as 0.95. Graph attached (caveat:
epsilon of 0.01).

breakout096.png

Michael Partheil

unread,
Feb 11, 2015, 3:23:53 AM2/11/15
to deep-q-...@googlegroups.com
Concerning parameters: Did any one of you also experiment with momentum? In your (Nathan's) current implementation it seems that momentum is set to zero, did you also try the "classical" 0.9 ? At least in supervised deeplearning this typically works much better.

Nathan Sprague

unread,
Feb 11, 2015, 7:23:23 AM2/11/15
to deep-q-...@googlegroups.com
Interesting... Unfortunately, it is hard to know whether run-to-run
variations are a result parameter changes, or just the result of random
differences in initialization and training. In my playing around I
haven't found anything better than the current default parameters, but
the only discount values I've looked at are .9, .95 and .97. On
seaquest, I actually see results that are substantially better than the
results from the paper.

Nathan Sprague

unread,
Feb 11, 2015, 7:26:04 AM2/11/15
to deep-q-...@googlegroups.com
My original implementation had non-zero momentum, but the current
parameter settings seem to work better. It's possible that momentum
could be helpful, but it requires a lot of effort to find good parameter
settings. Each new parameter makes the job harder.
> --
> You received this message because you are subscribed to the Google
> Groups "Deep Q-Learning" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to deep-q-learni...@googlegroups.com
> <mailto:deep-q-learni...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/deep-q-learning/c665310a-3703-4be7-92a5-9e7abf2f1c58%40googlegroups.com
> <https://groups.google.com/d/msgid/deep-q-learning/c665310a-3703-4be7-92a5-9e7abf2f1c58%40googlegroups.com?utm_medium=email&utm_source=footer>.

Alejandro Dubrovsky

unread,
Feb 11, 2015, 7:36:15 AM2/11/15
to deep-q-...@googlegroups.com
On 11/02/15 23:23, Nathan Sprague wrote:
> Interesting... Unfortunately, it is hard to know whether run-to-run
> variations are a result parameter changes, or just the result of random
> differences in initialization and training. In my playing around I
> haven't found anything better than the current default parameters, but
> the only discount values I've looked at are .9, .95 and .97. On
> seaquest, I actually see results that are substantially better than the
> results from the paper.

Awesome! I haven't retried seaquest since the change to 0 momentum. I
had thought before it would be the hardest due to the full action set.
Which discount value worked for it?

(Currently running enduro at 0.95)


Nathan Sprague

unread,
Feb 11, 2015, 7:40:09 AM2/11/15
to deep-q-...@googlegroups.com
The title figure title shows the parameter settings: learning rate:
.0002, discount: .95, decay: .99.
seaquest.png
Message has been deleted

Alejandro Dubrovsky

unread,
Feb 11, 2015, 9:14:09 AM2/11/15
to deep-q-...@googlegroups.com
Default. Excellent.

I'm going to bed but I can tell you it is learning on enduro with those
settings too, although with a very strange graph (stuck at 0 for the
first 25 epochs before taking off).

Alejandro Dubrovsky

unread,
Feb 12, 2015, 9:21:07 PM2/12/15
to deep-q-...@googlegroups.com
I think you've nailed it Nathan! Every game is working. 

That's the enduro graph (at epsilon 0.05). It looks lower than what DeepMind shows, but that's just because of the way that the current code calculates the average. It divides the total score in the 10000 frames by the number of episodes, _including_ the final episode, in which the agent hasn't lost, it just got clipped by the 10000 mark. Usually, this doesn't make much difference, but in enduro you only get one or two games in 10000 frames, so what's happening is that it's scoring 450-700 in the first episode, and then running out of time in the second, and the average is going do high 300s. The 800 peak is just a single run that runs out of time (and it's awesome). When running at 0.01, there are lots of those 800 runs. It plays like someone with super-human reflexes, who refuses to drive in the snow, and gets very upset when it runs into something so it has to spend ten seconds smashing things in the room. It's great. 

I'm going to change the average calculation to not include the frame-clipped episode, unless it's the only one in the whole testing period. I think that will also pump up the breakout scores to be in-line with DeepMind's. I am definitely getting higher scores for breakout and enduro than what is reported as DQN best, but I'm not sure if what they refer to as "episode" in the paper ("single best performing episode") is just a single game.

I'm currently running qbert, and it's already scored a 4000.

Nathan Sprague

unread,
Feb 13, 2015, 11:22:38 AM2/13/15
to deep-q-...@googlegroups.com
Very nice! I've updated the README to remove the caveat about
less-than-great performance :)

On 02/12/2015 09:21 PM, Alejandro Dubrovsky wrote:
> I think you've nailed it Nathan! Every game is working.
>
> <https://lh3.googleusercontent.com/-XikAfbTFdpk/VN1dUm4HgLI/AAAAAAAAAi8/s4nFBkYmThw/s1600/enduro005.png>
> --
> You received this message because you are subscribed to the Google
> Groups "Deep Q-Learning" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to deep-q-learni...@googlegroups.com
> <mailto:deep-q-learni...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/deep-q-learning/0ed775d3-653b-431d-a4a0-b108484f176b%40googlegroups.com
> <https://groups.google.com/d/msgid/deep-q-learning/0ed775d3-653b-431d-a4a0-b108484f176b%40googlegroups.com?utm_medium=email&utm_source=footer>.

Ajay Talati

unread,
Feb 13, 2015, 12:49:34 PM2/13/15
to deep-q-...@googlegroups.com
COOL !!! Really happy it's working so well :))

Are the default parameters set on Nathan's master fork the ones you are both using to get the high performance results?

Also, Nathan and Sidhar were curious about network weights transfer between games - https://groups.google.com/forum/?hl=en#!topic/deep-q-learning/d7FuHrzJxX8

Can you transfer learning from one Atari problem to another? Since the same network is being used to learn any game, the bulk of the learning involves extracting features using the convolutional networks. Can the features learned, say while playing breakout, transfer to playing pong?

More generally, would it be possible to facilitate learning across games so that learning the N+1th game would be much quicker if the first N games have already been learned?

Now that you've got a solid set of hyper-parameters, I could have a go at this transfer learning experiment? That would help to understand a bit more how deep learning is working in this context? I have a hand wavy idea that some lower level features are simply 'computer vision' related, and the high level features are game/strategy/action set specific?

If so, in the master fork you could simply deposit a semi-trained network, that has some vision features already pre-trained.  

I just want to understand now how the "magic" works?

Sridhar Mahadevan

unread,
Feb 13, 2015, 5:28:49 PM2/13/15
to deep-q-...@googlegroups.com
Unfortunately, the deep Q RL code has stopped working for me. Ironic, since after a month, my Tesla based GPU machine
is finally working well ::-).Temperatures of the passively cooled GPU card are now in the mid-40s even after a day long run.

What happens is that all the processes start, and then it simply hangs. No further output, nada.... any clues, anyone? 
I should mention all other packages work fine, including caffe, DQN, torch etc. I suspect our old friend, Theano, is the
problem, but I'm not sure. Theano itself seems to run fine when I do the theano.test() 

mahadeva@manifold:~/Documents/code/deep_rl/deep_q_rl$ python ale_run.py 
RL-Glue Version 3.04, Build 909
RL-Glue is listening for connections on port=4096
A.L.E: Arcade Learning Environment (version 0.4.4)
[Powered by Stella]
Use -help for help screen.
Game console created:
  ROM file:  /home/mahadeva/Documents/code/deep_rl/roms/breakout.bin
  Cart Name: Breakout - Breakaway IV (1978) (Atari)
  Cart MD5:  f34f08e5eb96e500e851a80be3277a56
  Display Format:  AUTO-DETECT ==> NTSC
  ROM Size:        2048
  Bankswitch Type: AUTO-DETECT ==> 2K

Running ROM file...
Random Seed: Time
Game will be controlled through RL-Glue.
RL-Glue Python Experiment Codec Version: 2.02 (Build 738)
Connecting to 127.0.0.1 on port 4096...
RL-Glue :: Experiment connected.
Initializing ALE RL-Glue ...
RL-Glue :: Environment connected.
Using gpu device 0: Tesla K80
RL-Glue Python Agent Codec Version: 2.02 (Build 738)
Connecting to 127.0.0.1 on port 4096...
Agent Codec Connected
RL-Glue :: Agent connected.
(32, 4, 80, 80)
(4, 80, 80, 32)
(16, 19.0, 19.0, 32)
(32, 9.0, 9.0, 32)
(32, 32, 9.0, 9.0)
(32, 256)
(32, 18)

Sridhar Mahadevan

unread,
Feb 13, 2015, 5:33:00 PM2/13/15
to deep-q-...@googlegroups.com
Never mind, it's working again. I forget the 5 minute delay that it takes for Theano to start rumbling along....

Alejandro Dubrovsky

unread,
Feb 13, 2015, 9:40:43 PM2/13/15
to deep-q-...@googlegroups.com
On 14/02/15 04:49, Ajay Talati wrote:
> COOL !!! Really happy it's working so well :))
>
> Are the default parameters set on Nathan's master fork the ones you are
> both using to get the high performance results?
>
Yep, default parameters are working brilliantly.


Ajay Talati

unread,
Feb 13, 2015, 10:25:31 PM2/13/15
to deep-q-...@googlegroups.com
Thanks :) I'm testing Qbert now.

The more I think about it the more impressed and amazed I am that this thing works!

There's got to be some deep (no pun intended) mathematical/physics reasoning behind deep networks?

Alejandro Dubrovsky

unread,
Feb 22, 2015, 9:07:57 AM2/22/15
to deep-q-...@googlegroups.com
Hey Nathan, do you remember how long you run seaquest for? Did the
scores get hammered after a while? I'm seeing what used to happen before
you got rid of the momentum where the scores just went to the floor as
the Q-scores skyrocketed, but without the Q-scores skyrocketing. Graph
attached.



On 11/02/15 23:39, Nathan Sprague wrote:
seaquest.png

Nathan Sprague

unread,
Feb 22, 2015, 10:57:22 AM2/22/15
to deep-q-...@googlegroups.com
Hi Alejandro,

I just sent in an abstract to the RLDM conference that looks at the
effect of the different parameters on learning across three different
games. I'll attach the main figure here. I think this "crashing"
effect that you are seeing tends to happen at higher learning rates. I
haven't run any of these experiments more than 100 epochs, so I don't
know what would happen with longer runs.

I'm guessing your Seaquest results must be configured differently from
my code somehow. I've never seen reward go that high. When I train for
Seaquest the learner gets to the point where it is more or less perfect
at shooting sharks, but never figures out that it needs to surface for
air. Have you watched your top policies play? Do they surface?
many_runs.png

Alejandro Dubrovsky

unread,
Feb 23, 2015, 3:37:29 AM2/23/15
to deep-q-...@googlegroups.com
I was using the standard defaults (0.95/0.0002/0 momentum). It might
have just been a fluky run. My y-cutoff for the cropping for that game
is two pixels higher than yours (that's the only difference I can think of)

But good call on the point about surfacing for air. The best single
episode score that it reaches during the whole run is during the crash
(ie in the couple of epochs it takes to drop from averaging 3000 to
averaging 500). They are the epochs during which it evidently learns to
surface. At least I never saw it surfacing prior, it just behaved like
an ultimate shark-killing machine (which gets it to 3000), but during
those episodes it surfaces a bit (and even more after the crash). The
discovery of surfacing obviously sends it mad.

After the crash, it actually doesn't play that badly. The problem is
that surfacing means that the sharks then get transformed into red
sharks (which go faster) and submarines (which shoot), and it doesn't
know what to do with those. All makes sense now.

BTW, what does the p (rho?) that gets set to 0.95 and 0.99 in your
graphs mean? Momentum?

Nathan Sprague

unread,
Feb 23, 2015, 1:44:55 PM2/23/15
to deep-q-...@googlegroups.com
The rho value is the decay rate used by RMSProp to determine how quickly
the averaged gradient magnitudes are updated. I originally didn't think
it would matter much, but increasing this value (from .9) turned out to
be key to getting everything working as well as it is now.

Ajay Talati

unread,
Feb 23, 2015, 2:54:44 PM2/23/15
to deep-q-...@googlegroups.com
Aha - so its the adaptable step rates that do the magic !!!!!

That explains these patched versions of RMSprop which sprouted in other packages? 

Thank you very much for mentioning this, and also keeping the 'useful links' in the code you wrote

# http://climin.readthedocs.org/en/latest/rmsprop.html
# https://github.com/lisa-lab/pylearn2/pull/136

Best,

Aj

Alejandro Dubrovsky

unread,
Feb 24, 2015, 11:55:24 PM2/24/15
to deep-q-...@googlegroups.com
Ah, I completely missed the change in the decay rate.

Reading the code for gen_updates_rmsprop makes me remember that I don't
understand Theano (doesn't that code mean that the accumulators get
reinitialised to zero on every call!?? Obviously not unless
all_parameters is actually a list of all weights in its history, but I
don't understand how. But maybe all_parameters is all the weights ever.
Doesn't matter, I'll sit down and read about it properly when I get
time). It does look like the initial updates would be huge (10 times
bigger than usual) but you didn't like the initialisation to 1.0. It
didn't work?

The suggested link to the youtube video in the comment doesn't work for
me; it tells me that the video is private. Is it meant to be Hinton's
coursera lecture that everyone references or something else?

I reran training on seaquest, same result, although the crash came a bit
later. Maybe the reason I am hitting it earlier than you is due to my
initialisation of the network to 0.02 instead of 0.01.
seaquestagain.png

Nathan Sprague

unread,
Feb 25, 2015, 4:09:53 PM2/25/15
to deep-q-...@googlegroups.com
Theano code is often confusing to read because it isn't apparent on
casual inspection that the code is building a computation graph, not
actually performing any computations. That's what's happening in
gen_updates_rmsprop. The code is building the update rules, it isn't
actually performing the updates.

I assume that the link is intended to lead to Hinton's Coursera video.
I didn't write it so I don't know for sure. I believe the video is
still available if you google for it.

The difference in weight initialization might be the issue. You could
try a lower learning rate to see if that helps. Of course, the new
information in the Nature paper/code will impact any plans now.

On that subject... I'll probably continue to support and update this
library, even though the DeepMind code has been published. I'm
depending on it for other projects. My feelings won't be hurt if
everyone else jumps ship :)

Sridhar Mahadevan

unread,
Feb 25, 2015, 6:16:00 PM2/25/15
to deep-q-...@googlegroups.com
No fear, Nathan. It will take all of us a while to sort through the new Deep Mind code.  I have a feeling that
this is not going to be plug and play, as is almost always the case. So, we'll be using your code for a long
time to come, I expect. It's always good to have multiple implementations, in any case. 

Ajay Talati

unread,
Feb 25, 2015, 9:01:30 PM2/25/15
to deep-q-...@googlegroups.com
Yeah I'll second that. Anyone who's got the insight to put together a package like this is definitely an original thinker.

I learnt a lot from trying to understand this package - that main thing being simple is good :)

I just think that convnets and reinforcement learning are not that interesting anymore?

Generative models, the neural Turing Machine, long short term memory, variational inference and sampling methods just seem to be what's working best at the moment? 
Reply all
Reply to author
Forward
0 new messages