Unable to run deep_q_rl due to segmentation faults

323 views
Skip to first unread message

sridhar.i...@gmail.com

unread,
Feb 5, 2015, 8:52:02 PM2/5/15
to deep-q-...@googlegroups.com
Ever since I upgraded my GPU machine to using a Tesla GPU, I've been unable
to run Nathan's code. I am successfully able to run Theano, Muupan's deep Q learner, 
and caffe etc. 

Here's the error I am getting. It starts everything up, and then seg faults. Any one knows
what's going on?  Thanks!

RL-Glue Version 3.04, Build 909
A.L.E: Arcade Learning Environment (version 0.4.4)
[Powered by Stella]
Use -help for help screen.
Warning: couldn't load settings file: ./stellarc
Game console created:
  ROM file:  ../../../roms/breakout.bin
  Cart Name: Breakout - Breakaway IV (1978) (Atari)
  Cart MD5:  f34f08e5eb96e500e851a80be3277a56
  Display Format:  AUTO-DETECT ==> NTSC
  ROM Size:        2048
  Bankswitch Type: AUTO-DETECT ==> 2K

Running ROM file...
Random Seed: Time
Game will be controlled through RL-Glue.
RL-Glue Python Experiment Codec Version: 2.02 (Build 738)
Connecting to 127.0.0.1 on port 4096...
Initializing ALE RL-Glue ...
Using gpu device 0: Tesla K80
RL-Glue Python Agent Codec Version: 2.02 (Build 738)
Connecting to 127.0.0.1 on port 4096...
/home/mahadeva/Downloads/Theano/theano/gof/cmodule.py:289: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility
  rval = __import__(module_name, {}, {}, [module_name])
mod.cu(5511) : cutilCheckMsg() CUTIL CUDA error : filterActs: kernel execution failed : (8) invalid device function .
Agent Codec Connected
(32, 4, 80, 80)
(4, 80, 80, 32)
(16, 19.0, 19.0, 32)
(32, 9.0, 9.0, 32)
(32, 32, 9.0, 9.0)
(32, 256)
(32, 6)
OPENING  breakout_02-06-01-48_0p0002_0p95/results.csv
Segmentation fault (core dumped)
training epoch:  1 steps_left:  50000
training epoch:  1 steps_left:  49959
training epoch:  1 steps_left:  49957
training epoch:  1 steps_left:  49955
training epoch:  1 steps_left:  49953
training epoch:  1 steps_left:  49951
training epoch:  1 steps_left:  49949
training epoch:  1 steps_left:  49947
training epoch:  1 steps_left:  49945
Traceback (most recent call last):
  File "./rl_glue_ale_experiment.py", line 69, in <module>
    main()
  File "./rl_glue_ale_experiment.py", line 59, in main
    run_epoch(epoch, args.epoch_length, "training")
  File "./rl_glue_ale_experiment.py", line 34, in run_epoch
    print prefix + " epoch: ", epoch, "steps_left: ", steps_left
IOError: [Errno 32] Broken pipe
mahadeva@manifold:~/Documents/code/deep_rl/deepq_rl_feb_5/deep_q_rl/deep_q_rl$ 

Ajay Talati

unread,
Feb 5, 2015, 9:02:24 PM2/5/15
to deep-q-...@googlegroups.com
Use a separate install of ALE for this implementation, which is built with the,

USE_RLGLUE  := 1

Flag on. 

Best to pull Nathan or Alejandro's code fresh, as RL_glue controller file has changed. 

Nathan Sprague

unread,
Feb 5, 2015, 9:22:35 PM2/5/15
to deep-q-...@googlegroups.com
I haven't seen that issue. Have you tried clearing the Theano compile
cache?

Theano/bin/theano-cache clear

Or just delete everything in the .theano folder.
> --
> You received this message because you are subscribed to the Google
> Groups "Deep Q-Learning" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to deep-q-learni...@googlegroups.com
> <mailto:deep-q-learni...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/deep-q-learning/ebcfb4ca-8b23-4801-bfc4-093467a0259d%40googlegroups.com
> <https://groups.google.com/d/msgid/deep-q-learning/ebcfb4ca-8b23-4801-bfc4-093467a0259d%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

Alejandro Dubrovsky

unread,
Feb 5, 2015, 9:29:38 PM2/5/15
to deep-q-...@googlegroups.com
On 06/02/15 12:52, sridhar.i...@gmail.com wrote:
> Ever since I upgraded my GPU machine to using a Tesla GPU, I've been unable
> to run Nathan's code. I am successfully able to run Theano, Muupan's
> deep Q learner,
> and caffe etc.
>
> Here's the error I am getting. It starts everything up, and then seg
> faults. Any one knows
> what's going on? Thanks!

Which program is segfaulting? ie what does 'file core' say? Also, what
version of python are you using?

(Just to be sure, to fix the warning about no ./stellarc file, you could
create a 'stellarc' file. I don't know if it needs to have anything in
it. Mine has:

cpu=low

in it)

sridhar.i...@gmail.com

unread,
Feb 6, 2015, 3:00:51 PM2/6/15
to deep-q-...@googlegroups.com
Deleting the cache seems to have done the trick! It's working again...great, thanks! 
Message has been deleted

Johnny

unread,
Feb 14, 2015, 4:29:36 AM2/14/15
to deep-q-...@googlegroups.com
Hi guys,

First of all, big shout-out for managing to reproduce the results from the paper. I'm taking part in an AI master's programme and I'm starting to do research on reinforcement learning and deep learning, and this paper is one of my starting points. I'm also working on my implementation in Torch, I want to get one as clean as possible to be able to easily bring possible improvements afterwards. (Also, I have reasons to think that DeepMind also implemented their agent in Torch.)

Right now I'm having problems running Nathan's code. Files 'results.csv' and 'learning.csv' are created after the training starts, but they only contain the headers, and files data_network_file_xx.pkl are missing completely. When  I run the code I receive a segmentation fault and after that the training process starts, but it ends without outputting the network files and empty results.csv and learning.csv. Here it is. Any ideas? Thank you very much!
P.S. I already cleared the Theano cache.

...
(32, 4, 84, 84)
(4, 84, 84, 32)
(16, 20.0, 20.0, 32)

(32, 9.0, 9.0, 32)
(32, 32, 9.0, 9.0)
(32, 256)
(32, 18)
/home/johnny/Theano/theano/
gof/cmodule.py:293: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility

  rval = __import__(module_name, {}, {}, [module_name])
OPENING  data_02-14-09-09_0p0002_0p95/results.csv

training epoch:  1 steps_left:  50000
Traceback (most recent call last):
  File "./rl_glue_ale_agent.py", line 454, in <module>
    main()
  File "./rl_glue_ale_agent.py", line 450, in main
    AgentLoader.loadAgent(NeuralAgent())
  File "/usr/local/lib/python2.7/dist-packages/rlglue/agent/AgentLoader.py", line 58, in loadAgent
    client.runAgentEventLoop()
  File "/usr/local/lib/python2.7/dist-packages/rlglue/agent/ClientAgent.py", line 144, in runAgentEventLoop
    switch[agentState](self)
  File "/usr/local/lib/python2.7/dist-packages/rlglue/agent/ClientAgent.py", line 138, in <lambda>
    Network.kAgentStart: lambda self: self.onAgentStart(),
  File "/usr/local/lib/python2.7/dist-packages/rlglue/agent/ClientAgent.py", line 51, in onAgentStart
    action = self.agent.agent_start(observation)
  File "./rl_glue_ale_agent.py", line 256, in agent_start
    self.last_img = self._resize_observation(observation.intArray)
  File "./rl_glue_ale_agent.py", line 274, in _resize_observation
    image = observation[128:].reshape(IMAGE_HEIGHT, IMAGE_WIDTH, 3)
ValueError: total size of new array must be unchanged
Segmentation fault (core dumped)
training epoch:  1 steps_left:  49995
training epoch:  1 steps_left:  49993
training epoch:  1 steps_left:  49991
...

Alejandro Dubrovsky

unread,
Feb 14, 2015, 5:27:48 AM2/14/15
to deep-q-...@googlegroups.com
Hi Johnny,

You are most likely not using the latest ALE code from github. Try that
if that's the case.
> --
> You received this message because you are subscribed to the Google
> Groups "Deep Q-Learning" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to deep-q-learni...@googlegroups.com
> <mailto:deep-q-learni...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/deep-q-learning/f4fa164e-e77f-4d49-900c-c2cf86ca3bee%40googlegroups.com
> <https://groups.google.com/d/msgid/deep-q-learning/f4fa164e-e77f-4d49-900c-c2cf86ca3bee%40googlegroups.com?utm_medium=email&utm_source=footer>.

Johnny

unread,
Feb 14, 2015, 5:57:19 AM2/14/15
to deep-q-...@googlegroups.com
Hi Alejandro,

I confirm that I'm using the latest ALE code, taken directly from github.com/mgbellemare/Arcade-Learning-Environment, compiled with:

cp makefile.unix makefile
make

after setting USE_SDL := 1 and USE_RLGUE :=1 in makefile.unix.

Johnny

unread,
Feb 14, 2015, 8:02:38 AM2/14/15
to deep-q-...@googlegroups.com
Any other ideas would be much appreciated.

Alejandro Dubrovsky

unread,
Feb 14, 2015, 9:16:34 AM2/14/15
to deep-q-...@googlegroups.com
Strange. Could you add a:

print "Observation length ", len(observation.intArray)

at the start of agent_start in rl_glue_ale_agent.py and see what that
says when it runs?
> --
> You received this message because you are subscribed to the Google
> Groups "Deep Q-Learning" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to deep-q-learni...@googlegroups.com
> <mailto:deep-q-learni...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/deep-q-learning/c77f18e7-7ecd-48b8-9ad9-8c369e07cf31%40googlegroups.com
> <https://groups.google.com/d/msgid/deep-q-learning/c77f18e7-7ecd-48b8-9ad9-8c369e07cf31%40googlegroups.com?utm_medium=email&utm_source=footer>.

Johnny

unread,
Feb 14, 2015, 10:12:18 AM2/14/15
to deep-q-...@googlegroups.com
I ran it and it prints:

Observation length  33728

Alejandro Dubrovsky

unread,
Feb 14, 2015, 10:45:16 AM2/14/15
to deep-q-...@googlegroups.com
On 15/02/15 02:12, Johnny wrote:
> I ran it and it prints:
>
> Observation length 33728

If you are definitely running with the latest ALE version, this means
you are not passing -send_rgb true in ale_run.py, which means that you
are not running the latest deep_q_rl code.

If you are definitely running the latest deep_q_rl, then maybe check
your path, you might have some other ALE installed that is getting run
instead. You should be seeing 100928.

Ajay Talati

unread,
Feb 14, 2015, 10:55:54 AM2/14/15
to deep-q-...@googlegroups.com
I had the path problem when I had two versions of ALE installed. It's easily fixed with 

# Start ALE
command = ['/home/ajay/bin/ale_alito/ale_0_4/ale', '-game_controller', 'rlglue', '-send_rgb', 'true',

on line 72 of ale_run.py

Johnny

unread,
Feb 14, 2015, 12:27:59 PM2/14/15
to deep-q-...@googlegroups.com
I was indeed leaving the -send_rgb true parameter out by mistake when fixing the path problem, the one Ajay mentioned. Now I'm only left with this warning


...
(32, 4, 84, 84)
(4, 84, 84, 32)
(16, 20.0, 20.0, 32)
(32, 9.0, 9.0, 32)
(32, 32, 9.0, 9.0)
(32, 256)
(32, 6)
/home/johnny/Theano/theano/gof/cmodule.py:293: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility

  rval = __import__(module_name, {}, {}, [module_name])
OPENING  data_02-14-17-12_0p0002_0p95/results.csv

training epoch:  1 steps_left:  50000
Simulated at a rate of 28.9300208794/s
 Average loss: 0.0179085704343
...

but judging by the speed and the fact the the *.pkl files are created, I think the training process is working as intended now! :)

Thank you very much for your help, Alejandro and Ajay!

Ajay Talati

unread,
Feb 14, 2015, 1:03:04 PM2/14/15
to deep-q-...@googlegroups.com
Cool :))))

You can safely forget about the numpy warnings, I think everyone gets them.

Have you seen Deepmind's github repo? They have some packages which should help speed up building a Torch implementation.

https://github.com/deepmind/alewrap

https://github.com/deepmind/xitari

https://github.com/fidlej/aledataset

I tried building a Torch implementation too, but I'm pretty crap at coding, and I got distracted by other stuff. 

Johnny

unread,
Feb 14, 2015, 2:09:28 PM2/14/15
to deep-q-...@googlegroups.com
@Ajay: Yes, I saw DeepMind's repo ;).

I got past the training error, but now the replaying is the one giving me a headache. When running ale_run watch.py like this:

python ale_run_watch.py data_09-29-15-46_0p0001_0p9/network_file_99.pkl

I receive this:

...

RL-Glue Python Agent Codec Version: 2.02 (Build 738)
    Connecting to 127.0.0.1 on port 4097...
     Agent Codec Connected
    RL-Glue :: Agent connected.

Traceback (most recent call last):
  File "./rl_glue_ale_agent.py", line 454, in <module>
    main()
  File "./rl_glue_ale_agent.py", line 450, in main
    AgentLoader.loadAgent(NeuralAgent())
  File "/usr/local/lib/python2.7/dist-packages/rlglue/agent/AgentLoader.py", line 58, in loadAgent
    client.runAgentEventLoop()
  File "/usr/local/lib/python2.7/dist-packages/rlglue/agent/ClientAgent.py", line 144, in runAgentEventLoop
    switch[agentState](self)
  File "/usr/local/lib/python2.7/dist-packages/rlglue/agent/ClientAgent.py", line 137, in <lambda>
    Network.kAgentInit: lambda self: self.onAgentInit(),
  File "/usr/local/lib/python2.7/dist-packages/rlglue/agent/ClientAgent.py", line 43, in onAgentInit
    self.agent.agent_init(taskSpec)
  File "./rl_glue_ale_agent.py", line 150, in agent_init
    phi_length=self.phi_length)
  File "/home/johnny/ale_0_4/deep_q_rl/ale_data_set.py", line 38, in __init__
    self.states = np.zeros((self.capacity, height, width), dtype='uint8')
MemoryError
testing epoch:  1 steps_left:  10000
testing epoch:  1 steps_left:  9995
testing epoch:  1 steps_left:  9993
...

The ALE display screen briefly appears and disappears after less than a second and the testing finishes really fast. Did anyone else bump into this by any chance? Sorry if it's something obvious and I'm missing it, it's because I basically spent my whole day trying to make this work :).

Ajay Talati

unread,
Feb 14, 2015, 3:02:16 PM2/14/15
to deep-q-...@googlegroups.com
Hi Johnny!

data_09-29-15-46_0p0001_0p9/network_file_99.pkl

This folder and pickle file has'nt been created?

When you train a network with this package a folder will be created with todays date and the time you started training. In that folder you will see a whole load of pkl files, and results.csv

To watch a trained network, use

python ale_run_watch.py path_to_folder_with training_data/a_pickled_network_file_in_the_folder.pkl

Johnny

unread,
Feb 14, 2015, 3:10:02 PM2/14/15
to deep-q-...@googlegroups.com
Ajay,

I ran ale_run_watch with a .pkl file created in the training process by running ale_run.py. The line:

python ale_run_watch.py data_09-29-15-46_0p0001_0p9/network_file_99.pkl

was just an example to show how I ran it. The line above is actually the example from Nathan's github readme.

Nathan Sprague

unread,
Feb 14, 2015, 3:36:58 PM2/14/15
to deep-q-...@googlegroups.com
I see that there is a "MemoryError" in your output. How much memory do
you have on your testing machine? The training code requires 10-12 gigs
of RAM, so if you have anything less than 16 you are likely to run into
problems.
> --
> You received this message because you are subscribed to the Google
> Groups "Deep Q-Learning" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to deep-q-learni...@googlegroups.com
> <mailto:deep-q-learni...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/deep-q-learning/7cfb6673-3408-445d-9925-ebf0878ae1cb%40googlegroups.com
> <https://groups.google.com/d/msgid/deep-q-learning/7cfb6673-3408-445d-9925-ebf0878ae1cb%40googlegroups.com?utm_medium=email&utm_source=footer>.

Alejandro Dubrovsky

unread,
Feb 14, 2015, 8:37:25 PM2/14/15
to deep-q-...@googlegroups.com
Johnny, like Nathan mentioned, you are running out of memory. You are
most likely trying to watch it while leaving the training running.

The issue is that the watching still allocates space for keeping the
full image history for training, even if it isn't going to be doing any
training.

Try adding '--max_history','4000' to the command in ale_run_watch.py and
see if that works

Johnny

unread,
Feb 15, 2015, 12:55:09 AM2/15/15
to deep-q-...@googlegroups.com
Hi,

Curiously enough, the error disappeared after logging into my computer this morning. Yesterday I also restarted it before training / playing sessions, but with no results. I have only 8 GB of RAM and 2 GB of video memory. This morning it worked like a charm, without modifying a thing. The only difference before yesterday and today is that, instead of restarting it, the computer was turned off last night. Thank you for help though :).
Reply all
Reply to author
Forward
0 new messages