Getting "pygpu.gpuarray.GpuArrayException: Out of memory" for a small application

525 views
Skip to first unread message

Daniel Seita

unread,
Jul 2, 2017, 3:59:31 PM7/2/17
to theano-users
I am attempting to run some reinforcement learning code on the GPU. (The code is https://github.com/openai/imitation if it matters, running `scripts/run_rl_mj.py`.)

I converted the code to run on float32 by changing the way the data is supplied via numpy. Unfortunately, with the new GPU backend, I am gettting an out of memory error, despite having 12GB of memory on my Titan X Pascal GPU. Here are my settings:

$ cat ~/.theanorc
[global]
device = cuda
floatX = float32

[gpuarray]
preallocate = 1

[cuda]
root = /
usr/local/cuda-8.0


Theano seems to be importing correctly:

$ ipython
Python 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)  
Type "copyright", "credits" or "license" for more information.
IPython 5.3.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python'
s own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import theano
Using cuDNN version 5105 on context None
Preallocating 11576/12186 Mb (0.950000) on cuda
Mapped name None to device cuda: TITAN X (Pascal) (0000:01:00.0)

In [2]:



Unfortunately, running `python scripts/run_rl_mj.py --env_name CartPole-v0 --log trpo_logs/CartPole-v0` on the very low-dimensional CartPole setting (state space is just four numbers, actions are just one number) gives me (after a bit of a setup):


Traceback (most recent call last):

  File "scripts/run_rl_mj.py", line 116, in <module>

    main()

  File "scripts/run_rl_mj.py", line 109, in main

    iter_info = opt.step()

  File "/home/daniel/imitation_noise/policyopt/rl.py", line 280, in step

    cfg=self.sim_cfg)

  File "/home/daniel/imitation_noise/policyopt/__init__.py", line 411, in sim_mp

    traj = job.get()

  File "/home/daniel/anaconda2/lib/python2.7/multiprocessing/pool.py", line 567, in get

    raise self._value

pygpu.gpuarray.GpuArrayException: Out of memory

Apply node that caused the error: GpuFromHost<None>(obsfeat_B_Df)

Toposort index: 4

Inputs types: [TensorType(float32, matrix)]

Inputs shapes: [(1, 4)]

Inputs strides: [(16, 4)]

Inputs values: [array([[ 0.04058,  0.00428,  0.03311, -0.02898]], dtype=float32)]

Outputs clients: [[GpuElemwise{Composite{((i0 - i1) / i2)}}[]<gpuarray>(GpuFromHost<None>.0, /GibbsPolicy/obsnorm/Standardizer/mean_1_D, GpuElemwise{Composite{(i0 + sqrt((i1 * (Composite{(i0 - sqr(i1))}(i2, i3) + Abs(Composite{(i0 - sqr(i1))}(i2, i3))))))}}[]<gpuarray>.0)]]


HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.

HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.

Closing remaining open files:trpo_logs/CartPole-v0...done



What I'm confused about is that
  • This happens right at the beginning of the reinforcement learning, so it's not as if the algorithm has been running a long time and then ran out of memory.
  • The input shapes are quite small, (1,4) and (16,4). In addition, the output is supposed to do normalization and several other element-wise operations. None of this suggests high memory usage.
I tried `optimizer = fast_compile` and re-ran this, but the error message was actually less informative (it contains a subset of the above error message). Running with `exception_verbosity = high` results in a different error message:


Max traj len: 200

Traceback (most recent call last):

  File "scripts/run_rl_mj.py", line 116, in <module>

    main()

  File "scripts/run_rl_mj.py", line 109, in main

    iter_info = opt.step()

  File "/home/daniel/imitation_noise/policyopt/rl.py", line 280, in step

    cfg=self.sim_cfg)

  File "/home/daniel/imitation_noise/policyopt/__init__.py", line 411, in sim_mp

    traj = job.get()

  File "/home/daniel/anaconda2/lib/python2.7/multiprocessing/pool.py", line 567, in get

    raise self._value

pygpu.gpuarray.GpuArrayException: initialization error

Closing remaining open files:trpo_logs/CartPole-v0...done


It somehow didn't even reach the correct point in the code??

I noticed a similar issue here: https://github.com/costapt/vess2ret/issues/5 which seems to suggest that the problem is not limited to just this script. What do you suggest I do? Thanks.

Pascal Lamblin

unread,
Jul 3, 2017, 6:08:44 PM7/3/17
to theano-users
What happens if you set gpuarray.preallocate to something much smaller, or even to -1?

Also, I see the script uses multiprocessing. Weird things happen if new Python processes are spawned after the GPU has been initialized. This is a limitation of how cuda handles GPU contexts I believe.
The solution would be not to use `device=cuda`, but `device=cpu`, and call `theano.gpuarray.use('cuda')` manually in the subprocess, or after all processes have been launched.

Daniel Seita

unread,
Jul 4, 2017, 12:51:14 AM7/4/17
to theano-users

Thanks Pascal.


I tried using gpu preallocate 0.01 and 0.1. The run with 0.1, for instance, starts like this:


$ python scripts/run_rl_mj.py --env_name CartPole-v0 --log trpo_logs/CartPole-v0
Using cuDNN version 5105 on context None
Preallocating 1218/12186 Mb (0.100000) on cuda
Mapped name None to device cuda: TITAN X (Pascal) (0000:01:00.0)


But the same error message results:

Traceback (most recent call last):
 
File "scripts/run_rl_mj.py", line 116, in <module>
    main
()
 
File "scripts/run_rl_mj.py", line 109, in main
    iter_info
= opt.step()

 
File "/home/daniel/imitation_noise/policyopt/rl.py", line 283, in step
    cfg
=self.sim_cfg)
 
File "/home/daniel/imitation_noise/policyopt/__init__.py", line 425, in sim_mp
    traj
= job.get()

 
File "/home/daniel/anaconda2/lib/python2.7/multiprocessing/pool.py", line 567, in get
   
raise self._value
pygpu
.gpuarray.GpuArrayException: Out of memory

Apply node that caused the error: GpuFromHost<None>(obsfeat_B_Df)
Toposort index: 1

Inputs types: [TensorType(float32, matrix)]
Inputs shapes: [(1, 4)]
Inputs strides: [(16, 4)]
Inputs values: [array([[ 0.02563, -0.03082,  0.01663, -0.00558]], dtype=float32)]
Outputs clients: [[GpuElemwise{Composite{((i0 - i1) / i2)}}[(0, 0)]<gpuarray>(GpuFromHost<None>.0, /GibbsPolicy/obsnorm/Standardizer/mean_1_D, GpuElemwise{Composite{(i0 + sqrt((i1 * (Composite{(i0 - sqr(i1))}(i2, i3) + Abs(Composite{(i0 - sqr(i1))}(i2, i3))))))}}[]<gpuarray>.0)]]


HINT
: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT
: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.


With -1 as the preallocate, I get this to start:


$ python scripts/run_rl_mj.py --env_name CartPole-v0 --log trpo_logs/CartPole-v0
Using cuDNN version 5105 on context None
Disabling allocation cache on cuda


I get a similar error message except it's slightly different, with an initialization error, but the same part of the code is running into problems:


Traceback (most recent call last):
 
File "scripts/run_rl_mj.py", line 116, in <module>
    main
()
 
File "scripts/run_rl_mj.py", line 109, in main
    iter_info
= opt.step()

 
File "/home/daniel/imitation_noise/policyopt/rl.py", line 283, in step
    cfg
=self.sim_cfg)
 
File "/home/daniel/imitation_noise/policyopt/__init__.py", line 425, in sim_mp
    traj
= job.get()

 
File "/home/daniel/anaconda2/lib/python2.7/multiprocessing/pool.py", line 567, in get
   
raise self._value
pygpu
.gpuarray.GpuArrayException: initialization error

Apply node that caused the error: GpuFromHost<None>(obsfeat_B_Df)
Toposort index: 1

Inputs types: [TensorType(float32, matrix)]
Inputs shapes: [(1, 4)]
Inputs strides: [(16, 4)]
Inputs values: [array([[ 0.01357, -0.02611,  0.0341 ,  0.0162 ]], dtype=float32)]
Outputs clients: [[GpuElemwise{Composite{((i0 - i1) / i2)}}[(0, 0)]<gpuarray>(GpuFromHost<None>.0, /GibbsPolicy/obsnorm/Standardizer/mean_1_D, GpuElemwise{Composite{(i0 + sqrt((i1 * (Composite{(i0 - sqr(i1))}(i2, i3) + Abs(Composite{(i0 - sqr(i1))}(i2, i3))))))}}[]<gpuarray>.0)]]


HINT
: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT
: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.


Yes, the code seems to be using multiprocessing. I will try to see if I can find out how to deal with the multiprocessing, or perhaps just disable it.

Reply all
Reply to author
Forward
0 new messages