I just got my new hardware set up and was terribly excited to see how
fast my theano code ran on it. Alas, I've encountered a minor bump
along the way -- when using a gpu as the device, python segfaults when
exiting the python process. (Thus it doesn't *really* block getting
work done, but it is a bummer.) I seem to attract these things, don't
I?
$ THEANO_FLAGS=device=cpu python -c "import theano; print 'ok'"
ok
$ THEANO_FLAGS=device=gpu0 python -c "import theano; print 'ok'"
Using gpu device 0: GeForce GTX 480
ok
Segmentation fault
I attached gdb to python, and got a (symbol-less) stack trace that
appears to confirm that it is gpu-related cleanup code to blame:
Program received signal SIGSEGV, Segmentation fault.
0x00007f8bf6cd9d50 in ?? ()
(gdb) bt
#0 0x00007f8bf6cd9d50 in ?? ()
#1 0x00007f8bf77efa93 in ?? () from /usr/local/cuda/lib64/libcudart.so.3
#2 0x00007f8bf77f2e8d in ?? () from /usr/local/cuda/lib64/libcudart.so.3
#3 0x00007f8bf781da4b in ?? () from /usr/local/cuda/lib64/libcudart.so.3
#4 0x00007f8bf7806a30 in ?? () from /usr/local/cuda/lib64/libcudart.so.3
#5 0x00007f8bf77e3cdf in ?? () from /usr/local/cuda/lib64/libcudart.so.3
#6 0x000000000000003a in ?? ()
#7 0x0000000000000000 in ?? ()
I'm running Ubuntu 10.04.1, 64bit. All the various supporting
libraries were installed via apt-get. I tried with both the Cuda 3.1
and the 3.2rc2 releases. I'd be happy to help track it down further,
but I'm unsure where to look next. Any pointers as to where I should
dig / what I should try to get this resolved?
Thanks, and sorry,
Josh
Glad to know it's not just me. I also just saw while googling that
Dmitry reported this a bit over a month ago on this list. Dmitry, did
switching to CUDA 3.0 fix things for you?
Just to be sure, I've ruled out atexit handlers (commented them all
out) and gc (running python -vv shows the crash at the *very* end,
after float cleanup). It's definitely in the cuda cleanup code
somewhere. I might just let this one go, since it doesn't impact
usability at all...
Cheers,
Josh
I (accidentally) discovered a useful data point and a workaround:
pycuda's cleanup code prevents this segfault. For example:
$ python -c "import theano; import pycuda.autoinit; print 'ok'"
Using gpu device 0: GeForce GTX 480
ok
One can thus use pycuda for a workaround. However, if you do this, you
need to explicitly trigger gc prior to the process exiting, otherwise
theano tries to clean up device memory after the gpu context has been
destroyed, which triggers its own set of (non-destructive) errors. The
full workaround to prevent the segfault is thus:
# order matters! theano must be before pycuda.autoinit, and
pycuda.autoinit must be before the atexit.register call
import theano
import pycuda.autoinit # prevent segfault-on-exit caused by theano
import atexit, gc
atexit.register(gc.collect) # clean up theano's device memory
pointers prior to pycuda tearing down the gpu context
Antoine and Dmitry, I'd love to know whether this works for you as well...
Cheers,
Josh
I have a similar problem with some non theano but GPU from python
using code. I would love to figure out what is going on. It may be
easier to get help from nvidia or others with a minimal example. Here
is how to trigger this without using theano or my code but simply
ctypes in python.
import ctypes
libcublas = ctypes.cdll.LoadLibrary('libcublas.so')
libcublas.cublasInit()
libcublas.cublasShutdown()
Note that installing the 3.0 runtime fixes this bug.
Let me know if anyone has any more ideas or if this sheds some light
on something.
Best,
Jon.
On Sun, Sep 26, 2010 at 11:07 AM, Josh Bleecher Snyder
Thanks for reminding me to take another look at this. It looks like it
was indeed as simple as calling cudaThreadExit() ourselves. (I wonder
if the atexit call happens on the correct thread vs an automatic
teardown.) Anyway, I believe it should now be fixed in
http://hg.assembla.com/theano/rev/aa06fd5a3347
Would you mind updating and checking to see that fixes it for you?
-josh
Ever since http://xkcd.com/371/, I have an odd feeling of guilt about
segfaults. :)
-josh
Thanks for reminding me to take another look at this. It looks like it
was indeed as simple as calling cudaThreadExit() ourselves. (I wonder
if the atexit call happens on the correct thread vs an automatic
teardown.) Anyway, I believe it should now be fixed in
http://hg.assembla.com/theano/rev/aa06fd5a3347
Would you mind updating and checking to see that fixes it for you?
Yep -- thanks. Please update and try again now...
-josh
Yep -- thanks. Please update and try again now...
It looks like you got only the first half of my fix (I forgot that
atexit makes its calls in LIFO order); sorry that there was a delay
between the two changesets. Hopefully it is not broken after
http://hg.assembla.com/theano/rev/3a70d5b71f04. Anyway, I'm off to the
airport now and will be flying all day, but will take a closer look at
your diff tomorrow...
Thanks for your helping working through this.
-josh
It looks like you got only the first half of my fix (I forgot that
atexit makes its calls in LIFO order); sorry that there was a delay
between the two changesets. Hopefully it is not broken after
http://hg.assembla.com/theano/rev/3a70d5b71f04. Anyway, I'm off to the
airport now and will be flying all day, but will take a closer look at
your diff tomorrow...
It feels a little bit like a hack to me...but I can't say I have any
better ideas. :) Will cobble together something along these lines to
see what it looks like...
-josh
Hmmmm. Unfortunately, making the call to cudaFree re-initializes the
gpu context, and thus resuscitates the segfault. Doh! Short-circuiting
immediately in device_free seems to do the trick, though. It should be
safe so long as the only call to gpu_shutdown occurs at exit...and I
can't think of any legitimate reason to call it earlier than that
unless on was trying to multithread their gpu work, and theano is so
very far from being threadsafe that I don't think it is worth handling
that case right now. (And anyone trying to make theano threadsafe will
be immediately tipped off that this needs attention by the static
variable.)
Here's it is for you to try out and review:
http://bitbucket.org/josharian/theano/changeset/b426d0b6ef8a
One question: device_free does some bookkeeping if
COMPUTE_GPU_MEM_USED is enabled. I currently don't handle that at all
right now, since it wasn't obvious to me what the correct thing to do
was...
Here's another idea; not sure whether this is any better.
In gpu_shutdown, just set a flag that we're done with the gpu. Then at
device_free time, after the free call, check whether that was the last
malloc'd bit of memory to be freed. (Looks like this is already
tracked in _outstanding_mallocs, albeit with a scary comment that
indicates that it currently doesn't work.) If that was the last free
call to be made, and the flag is set indicating we're done with the
gpu, then call cudaThreadExit. That should make gpu_shutdown
idempotent (although still not multi-thread friendly) and avoid weird
short-circuiting.
I've implemented this in
http://bitbucket.org/josharian/theano/changeset/ebbc8fab2c8f -- seems
to work ok as well. Not sure which approach is preferable...
-josh
...and third time's a charm, right? One last iteration on this in
http://bitbucket.org/josharian/theano/changeset/ab866c953440. Uses a
refcounting scheme for cuda contexts, in which device_malloc and
device_free participates. Makes the code cleaner and cleaner, and
handles the case in which gpu_shutdown is called *after* the last bit
of device memory is freed.
I'm pretty happy with this now; my lingering uncertainties about the
last two attempts are gone.
So if you're going to code review just one of my three attempts (sorry
about that), code review this one. :)
Glad to hear that things that fix the problem for me also fix them for
you, Ian; thanks for confirming that.
-josh
I don't understand correctly why we need to call cudaThreadExit. But
if it fix the problems, I'm pretty confident that this don't add
problem.
thanks for the work.
Fred
Argh. Yes, I do.
However, I believe at this point the question to answer is why there
are more device_malloc calls being made than device_free calls. (I'm
loath to return to the solution of shutting down the gpu context
atexit and swallowing the errors, since it strikes me as masking
rather than fixing the root problem.) I've opened a ticket for this:
http://trac-hg.assembla.com/theano/ticket/591
-josh
However, I believe at this point the question to answer is why there
are more device_malloc calls being made than device_free calls. (I'm
loath to return to the solution of shutting down the gpu context
atexit and swallowing the errors, since it strikes me as masking
rather than fixing the root problem.)
I don't think their is a memory leak. From memory, in Python, their is
no warranty that objects will be destroyed. As we call device_free
only when we destroy an object, if their is object not destroyed,
their will be more call to device_malloc then device_free.
This also mean that function registered at atexit will be called when
not all object are destroyed. So this probably that your previous fix
that is to call ourself cudaThreadExit in a atexit fct and that all
subsequent device_free don't call cudaFree.
So it seam we don't have a choose and take a less well structured approach.
Josh, do you agree with this reasoning if Python object are not always
destroyed?
Thanks
Fred
Sorry about the slow response (I just saw your ticket response from
some time ago); I just returned from Thanksgiving holidays.
> I don't think their is a memory leak. From memory, in Python, their is
> no warranty that objects will be destroyed. As we call device_free
> only when we destroy an object, if their is object not destroyed,
> their will be more call to device_malloc then device_free.
Bummer. I didn't realize that.
> This also mean that function registered at atexit will be called when
> not all object are destroyed. So this probably that your previous fix
> that is to call ourself cudaThreadExit in a atexit fct and that all
> subsequent device_free don't call cudaFree.
>
> So it seam we don't have a choose and take a less well structured approach.
Yep, I think you're right (and were right all along!). It's a bummer,
but short of say using C++ destructors (which *are* guaranteed to get
called at process exit), which would involve major codebase upheaval
for a minor problem, I don't see any other way.
So...it looks like reverting ebbc8fab2c8f and ab866c953440 should do
the trick. Any objections? If not, I'll do that shortly.
-josh
Backed out in d8d551435117 and 0ae23f7375cc. Hopefully that'll be the
end of this particular bug. :)
thanks
Fred
Ian Kuntz, do you still have the problem since the last fix?