Unable to run deep_q_rl due to segmentation faults

sridhar.i...@gmail.com

unread,

Feb 5, 2015, 8:52:02 PM2/5/15

to deep-q-...@googlegroups.com

Ever since I upgraded my GPU machine to using a Tesla GPU, I've been unable

to run Nathan's code. I am successfully able to run Theano, Muupan's deep Q learner,

and caffe etc.

Here's the error I am getting. It starts everything up, and then seg faults. Any one knows

what's going on? Thanks!

RL-Glue Version 3.04, Build 909

A.L.E: Arcade Learning Environment (version 0.4.4)

[Powered by Stella]

Use -help for help screen.

Warning: couldn't load settings file: ./stellarc

Game console created:

ROM file: ../../../roms/breakout.bin

Cart Name: Breakout - Breakaway IV (1978) (Atari)

Cart MD5: f34f08e5eb96e500e851a80be3277a56

Display Format: AUTO-DETECT ==> NTSC

ROM Size: 2048

Bankswitch Type: AUTO-DETECT ==> 2K

Running ROM file...

Random Seed: Time

Game will be controlled through RL-Glue.

RL-Glue Python Experiment Codec Version: 2.02 (Build 738)

Connecting to 127.0.0.1 on port 4096...

Initializing ALE RL-Glue ...

Using gpu device 0: Tesla K80

RL-Glue Python Agent Codec Version: 2.02 (Build 738)

Connecting to 127.0.0.1 on port 4096...

/home/mahadeva/Downloads/Theano/theano/gof/cmodule.py:289: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility

rval = __import__(module_name, {}, {}, [module_name])

mod.cu(5511) : cutilCheckMsg() CUTIL CUDA error : filterActs: kernel execution failed : (8) invalid device function .

Agent Codec Connected

(32, 4, 80, 80)

(4, 80, 80, 32)

(16, 19.0, 19.0, 32)

(32, 9.0, 9.0, 32)

(32, 32, 9.0, 9.0)

(32, 256)

(32, 6)

OPENING breakout_02-06-01-48_0p0002_0p95/results.csv

Segmentation fault (core dumped)

training epoch: 1 steps_left: 50000

training epoch: 1 steps_left: 49959

training epoch: 1 steps_left: 49957

training epoch: 1 steps_left: 49955

training epoch: 1 steps_left: 49953

training epoch: 1 steps_left: 49951

training epoch: 1 steps_left: 49949

training epoch: 1 steps_left: 49947

training epoch: 1 steps_left: 49945

Traceback (most recent call last):

File "./rl_glue_ale_experiment.py", line 69, in <module>

main()

File "./rl_glue_ale_experiment.py", line 59, in main

run_epoch(epoch, args.epoch_length, "training")

File "./rl_glue_ale_experiment.py", line 34, in run_epoch

print prefix + " epoch: ", epoch, "steps_left: ", steps_left

IOError: [Errno 32] Broken pipe

mahadeva@manifold:~/Documents/code/deep_rl/deepq_rl_feb_5/deep_q_rl/deep_q_rl$

Ajay Talati

unread,

Feb 5, 2015, 9:02:24 PM2/5/15

to deep-q-...@googlegroups.com

Use a separate install of ALE for this implementation, which is built with the,

USE_RLGLUE := 1

Flag on.

Best to pull Nathan or Alejandro's code fresh, as RL_glue controller file has changed.

Nathan Sprague

unread,

Feb 5, 2015, 9:22:35 PM2/5/15

to deep-q-...@googlegroups.com

I haven't seen that issue. Have you tried clearing the Theano compile
cache?

Theano/bin/theano-cache clear

Or just delete everything in the .theano folder.

> --
> You received this message because you are subscribed to the Google
> Groups "Deep Q-Learning" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to deep-q-learni...@googlegroups.com
> <mailto:deep-q-learni...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/deep-q-learning/ebcfb4ca-8b23-4801-bfc4-093467a0259d%40googlegroups.com
> <https://groups.google.com/d/msgid/deep-q-learning/ebcfb4ca-8b23-4801-bfc4-093467a0259d%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

Alejandro Dubrovsky

unread,

Feb 5, 2015, 9:29:38 PM2/5/15

to deep-q-...@googlegroups.com

On 06/02/15 12:52, sridhar.i...@gmail.com wrote:
> Ever since I upgraded my GPU machine to using a Tesla GPU, I've been unable
> to run Nathan's code. I am successfully able to run Theano, Muupan's
> deep Q learner,
> and caffe etc.
>
> Here's the error I am getting. It starts everything up, and then seg
> faults. Any one knows
> what's going on? Thanks!

Which program is segfaulting? ie what does 'file core' say? Also, what
version of python are you using?

(Just to be sure, to fix the warning about no ./stellarc file, you could
create a 'stellarc' file. I don't know if it needs to have anything in
it. Mine has:

cpu=low

in it)

sridhar.i...@gmail.com

unread,

Feb 6, 2015, 3:00:51 PM2/6/15

to deep-q-...@googlegroups.com

Deleting the cache seems to have done the trick! It's working again...great, thanks!

> <mailto:deep-q-learning+unsub...@googlegroups.com>.

Message has been deleted

Johnny

unread,

Feb 14, 2015, 4:29:36 AM2/14/15

to deep-q-...@googlegroups.com

Hi guys,

First of all, big shout-out for managing to reproduce the results from the paper. I'm taking part in an AI master's programme and I'm starting to do research on reinforcement learning and deep learning, and this paper is one of my starting points. I'm also working on my implementation in Torch, I want to get one as clean as possible to be able to easily bring possible improvements afterwards. (Also, I have reasons to think that DeepMind also implemented their agent in Torch.)

Right now I'm having problems running Nathan's code. Files 'results.csv' and 'learning.csv' are created after the training starts, but they only contain the headers, and files data_network_file_xx.pkl are missing completely. When I run the code I receive a segmentation fault and after that the training process starts, but it ends without outputting the network files and empty results.csv and learning.csv. Here it is. Any ideas? Thank you very much!
P.S. I already cleared the Theano cache.

...
(32, 4, 84, 84)
(4, 84, 84, 32)
(16, 20.0, 20.0, 32)

(32, 9.0, 9.0, 32)
(32, 32, 9.0, 9.0)
(32, 256)

(32, 18)
/home/johnny/Theano/theano/

gof/cmodule.py:293: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility

rval = __import__(module_name, {}, {}, [module_name])

OPENING data_02-14-09-09_0p0002_0p95/results.csv

training epoch: 1 steps_left: 50000

Traceback (most recent call last):

File "./rl_glue_ale_agent.py", line 454, in <module>
    main()
File "./rl_glue_ale_agent.py", line 450, in main
    AgentLoader.loadAgent(NeuralAgent())
File "/usr/local/lib/python2.7/dist-packages/rlglue/agent/AgentLoader.py", line 58, in loadAgent
    client.runAgentEventLoop()
File "/usr/local/lib/python2.7/dist-packages/rlglue/agent/ClientAgent.py", line 144, in runAgentEventLoop
    switch[agentState](self)
File "/usr/local/lib/python2.7/dist-packages/rlglue/agent/ClientAgent.py", line 138, in <lambda>
    Network.kAgentStart: lambda self: self.onAgentStart(),
File "/usr/local/lib/python2.7/dist-packages/rlglue/agent/ClientAgent.py", line 51, in onAgentStart
    action = self.agent.agent_start(observation)
File "./rl_glue_ale_agent.py", line 256, in agent_start
    self.last_img = self._resize_observation(observation.intArray)
File "./rl_glue_ale_agent.py", line 274, in _resize_observation
    image = observation[128:].reshape(IMAGE_HEIGHT, IMAGE_WIDTH, 3)
ValueError: total size of new array must be unchanged
Segmentation fault (core dumped)
training epoch: 1 steps_left: 49995
training epoch: 1 steps_left: 49993
training epoch: 1 steps_left: 49991
...

Alejandro Dubrovsky

unread,

Feb 14, 2015, 5:27:48 AM2/14/15

to deep-q-...@googlegroups.com

Hi Johnny,

You are most likely not using the latest ALE code from github. Try that
if that's the case.

> --
> You received this message because you are subscribed to the Google
> Groups "Deep Q-Learning" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to deep-q-learni...@googlegroups.com

> <mailto:deep-q-learni...@googlegroups.com>.

> To view this discussion on the web visit

> https://groups.google.com/d/msgid/deep-q-learning/f4fa164e-e77f-4d49-900c-c2cf86ca3bee%40googlegroups.com
> <https://groups.google.com/d/msgid/deep-q-learning/f4fa164e-e77f-4d49-900c-c2cf86ca3bee%40googlegroups.com?utm_medium=email&utm_source=footer>.

Johnny

unread,

Feb 14, 2015, 5:57:19 AM2/14/15

to deep-q-...@googlegroups.com

Hi Alejandro,

I confirm that I'm using the latest ALE code, taken directly from github.com/mgbellemare/Arcade-Learning-Environment, compiled with:

cp makefile.unix makefile

make

after setting USE_SDL := 1 and USE_RLGUE :=1 in makefile.unix.

Johnny

unread,

Feb 14, 2015, 8:02:38 AM2/14/15

to deep-q-...@googlegroups.com

Any other ideas would be much appreciated.

Alejandro Dubrovsky

unread,

Feb 14, 2015, 9:16:34 AM2/14/15

to deep-q-...@googlegroups.com

Strange. Could you add a:

print "Observation length ", len(observation.intArray)

at the start of agent_start in rl_glue_ale_agent.py and see what that
says when it runs?

> --
> You received this message because you are subscribed to the Google
> Groups "Deep Q-Learning" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to deep-q-learni...@googlegroups.com
> <mailto:deep-q-learni...@googlegroups.com>.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/deep-q-learning/c77f18e7-7ecd-48b8-9ad9-8c369e07cf31%40googlegroups.com
> <https://groups.google.com/d/msgid/deep-q-learning/c77f18e7-7ecd-48b8-9ad9-8c369e07cf31%40googlegroups.com?utm_medium=email&utm_source=footer>.

Johnny

unread,

Feb 14, 2015, 10:12:18 AM2/14/15

to deep-q-...@googlegroups.com

I ran it and it prints:

Observation length 33728

Alejandro Dubrovsky

unread,

Feb 14, 2015, 10:45:16 AM2/14/15

to deep-q-...@googlegroups.com

On 15/02/15 02:12, Johnny wrote:
> I ran it and it prints:
>
> Observation length 33728

If you are definitely running with the latest ALE version, this means
you are not passing -send_rgb true in ale_run.py, which means that you
are not running the latest deep_q_rl code.

If you are definitely running the latest deep_q_rl, then maybe check
your path, you might have some other ALE installed that is getting run
instead. You should be seeing 100928.

Ajay Talati

unread,

Feb 14, 2015, 10:55:54 AM2/14/15

to deep-q-...@googlegroups.com

I had the path problem when I had two versions of ALE installed. It's easily fixed with

# Start ALE
command = ['/home/ajay/bin/ale_alito/ale_0_4/ale', '-game_controller', 'rlglue', '-send_rgb', 'true',

on line 72 of ale_run.py

Johnny

unread,

Feb 14, 2015, 12:27:59 PM2/14/15

to deep-q-...@googlegroups.com

I was indeed leaving the -send_rgb true parameter out by mistake when fixing the path problem, the one Ajay mentioned. Now I'm only left with this warning

...
(32, 4, 84, 84)
(4, 84, 84, 32)
(16, 20.0, 20.0, 32)
(32, 9.0, 9.0, 32)
(32, 32, 9.0, 9.0)
(32, 256)

(32, 6)
/home/johnny/Theano/theano/gof/cmodule.py:293: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility

rval = __import__(module_name, {}, {}, [module_name])

OPENING data_02-14-17-12_0p0002_0p95/results.csv

training epoch: 1 steps_left: 50000

Simulated at a rate of 28.9300208794/s
Average loss: 0.0179085704343
...

but judging by the speed and the fact the the *.pkl files are created, I think the training process is working as intended now! :)

Thank you very much for your help, Alejandro and Ajay!

Ajay Talati

unread,

Feb 14, 2015, 1:03:04 PM2/14/15

to deep-q-...@googlegroups.com

Cool :))))

You can safely forget about the numpy warnings, I think everyone gets them.

Have you seen Deepmind's github repo? They have some packages which should help speed up building a Torch implementation.

https://github.com/deepmind/alewrap

https://github.com/deepmind/xitari

https://github.com/fidlej/aledataset

I tried building a Torch implementation too, but I'm pretty crap at coding, and I got distracted by other stuff.

Johnny

unread,

Feb 14, 2015, 2:09:28 PM2/14/15

to deep-q-...@googlegroups.com

@Ajay: Yes, I saw DeepMind's repo ;).

I got past the training error, but now the replaying is the one giving me a headache. When running ale_run watch.py like this:

python ale_run_watch.py data_09-29-15-46_0p0001_0p9/network_file_99.pkl

I receive this:

...

RL-Glue Python Agent Codec Version: 2.02 (Build 738)

    Connecting to 127.0.0.1 on port 4097...
    Agent Codec Connected
    RL-Glue :: Agent connected.

Traceback (most recent call last):
File "./rl_glue_ale_agent.py", line 454, in <module>
    main()
File "./rl_glue_ale_agent.py", line 450, in main
    AgentLoader.loadAgent(NeuralAgent())
File "/usr/local/lib/python2.7/dist-packages/rlglue/agent/AgentLoader.py", line 58, in loadAgent
    client.runAgentEventLoop()
File "/usr/local/lib/python2.7/dist-packages/rlglue/agent/ClientAgent.py", line 144, in runAgentEventLoop
    switch[agentState](self)

File "/usr/local/lib/python2.7/dist-packages/rlglue/agent/ClientAgent.py", line 137, in <lambda>
    Network.kAgentInit: lambda self: self.onAgentInit(),
File "/usr/local/lib/python2.7/dist-packages/rlglue/agent/ClientAgent.py", line 43, in onAgentInit
    self.agent.agent_init(taskSpec)
File "./rl_glue_ale_agent.py", line 150, in agent_init
    phi_length=self.phi_length)
File "/home/johnny/ale_0_4/deep_q_rl/ale_data_set.py", line 38, in __init__
    self.states = np.zeros((self.capacity, height, width), dtype='uint8')
MemoryError
testing epoch: 1 steps_left: 10000
testing epoch: 1 steps_left: 9995
testing epoch: 1 steps_left: 9993
...

The ALE display screen briefly appears and disappears after less than a second and the testing finishes really fast. Did anyone else bump into this by any chance? Sorry if it's something obvious and I'm missing it, it's because I basically spent my whole day trying to make this work :).

Ajay Talati

unread,

Feb 14, 2015, 3:02:16 PM2/14/15

to deep-q-...@googlegroups.com

Hi Johnny!

data_09-29-15-46_0p0001_0p9/network_file_99.pkl

This folder and pickle file has'nt been created?

When you train a network with this package a folder will be created with todays date and the time you started training. In that folder you will see a whole load of pkl files, and results.csv

To watch a trained network, use


python ale_run_watch.py path_to_folder_with training_data/a_pickled_network_file_in_the_folder.pkl

Johnny

unread,

Feb 14, 2015, 3:10:02 PM2/14/15

to deep-q-...@googlegroups.com

Ajay,

I ran ale_run_watch with a .pkl file created in the training process by running ale_run.py. The line:

python ale_run_watch.py data_09-29-15-46_0p0001_0p9/network_file_99.pkl

was just an example to show how I ran it. The line above is actually the example from Nathan's github readme.

Nathan Sprague

unread,

Feb 14, 2015, 3:36:58 PM2/14/15

to deep-q-...@googlegroups.com

I see that there is a "MemoryError" in your output. How much memory do
you have on your testing machine? The training code requires 10-12 gigs
of RAM, so if you have anything less than 16 you are likely to run into
problems.

> --
> You received this message because you are subscribed to the Google
> Groups "Deep Q-Learning" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to deep-q-learni...@googlegroups.com
> <mailto:deep-q-learni...@googlegroups.com>.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/deep-q-learning/7cfb6673-3408-445d-9925-ebf0878ae1cb%40googlegroups.com
> <https://groups.google.com/d/msgid/deep-q-learning/7cfb6673-3408-445d-9925-ebf0878ae1cb%40googlegroups.com?utm_medium=email&utm_source=footer>.

Alejandro Dubrovsky

unread,

Feb 14, 2015, 8:37:25 PM2/14/15

to deep-q-...@googlegroups.com

Johnny, like Nathan mentioned, you are running out of memory. You are
most likely trying to watch it while leaving the training running.

The issue is that the watching still allocates space for keeping the
full image history for training, even if it isn't going to be doing any
training.

Try adding '--max_history','4000' to the command in ale_run_watch.py and
see if that works

Johnny

unread,

Feb 15, 2015, 12:55:09 AM2/15/15

to deep-q-...@googlegroups.com

Hi,

Curiously enough, the error disappeared after logging into my computer this morning. Yesterday I also restarted it before training / playing sessions, but with no results. I have only 8 GB of RAM and 2 GB of video memory. This morning it worked like a charm, without modifying a thing. The only difference before yesterday and today is that, instead of restarting it, the computer was turned off last night. Thank you for help though :).

Reply all

Reply to author

Forward