Segfault during gpu cleanup

432 views
Skip to first unread message

Josh Bleecher Snyder

unread,
Sep 23, 2010, 4:28:54 PM9/23/10
to theano...@googlegroups.com
Hi,

I just got my new hardware set up and was terribly excited to see how
fast my theano code ran on it. Alas, I've encountered a minor bump
along the way -- when using a gpu as the device, python segfaults when
exiting the python process. (Thus it doesn't *really* block getting
work done, but it is a bummer.) I seem to attract these things, don't
I?

$ THEANO_FLAGS=device=cpu python -c "import theano; print 'ok'"
ok
$ THEANO_FLAGS=device=gpu0 python -c "import theano; print 'ok'"
Using gpu device 0: GeForce GTX 480
ok
Segmentation fault


I attached gdb to python, and got a (symbol-less) stack trace that
appears to confirm that it is gpu-related cleanup code to blame:

Program received signal SIGSEGV, Segmentation fault.
0x00007f8bf6cd9d50 in ?? ()
(gdb) bt
#0 0x00007f8bf6cd9d50 in ?? ()
#1 0x00007f8bf77efa93 in ?? () from /usr/local/cuda/lib64/libcudart.so.3
#2 0x00007f8bf77f2e8d in ?? () from /usr/local/cuda/lib64/libcudart.so.3
#3 0x00007f8bf781da4b in ?? () from /usr/local/cuda/lib64/libcudart.so.3
#4 0x00007f8bf7806a30 in ?? () from /usr/local/cuda/lib64/libcudart.so.3
#5 0x00007f8bf77e3cdf in ?? () from /usr/local/cuda/lib64/libcudart.so.3
#6 0x000000000000003a in ?? ()
#7 0x0000000000000000 in ?? ()


I'm running Ubuntu 10.04.1, 64bit. All the various supporting
libraries were installed via apt-get. I tried with both the Cuda 3.1
and the 3.2rc2 releases. I'd be happy to help track it down further,
but I'm unsure where to look next. Any pointers as to where I should
dig / what I should try to get this resolved?

Thanks, and sorry,
Josh

Antoine

unread,
Sep 24, 2010, 6:14:55 AM9/24/10
to theano-users
Hi,

I confirm the Segfault using Mandriva 2010.0 and CUDA 3.1 on a GeForce
GTX 260
This information is not very usefull for you, but now, you know that
you're not alone to receive that segfault ! ;-)
I don't know what to do, but would be happy to help the tracking if
needed...

Antoine

Josh Bleecher Snyder

unread,
Sep 24, 2010, 8:00:46 PM9/24/10
to theano...@googlegroups.com
Hi Antoine,

Glad to know it's not just me. I also just saw while googling that
Dmitry reported this a bit over a month ago on this list. Dmitry, did
switching to CUDA 3.0 fix things for you?

Just to be sure, I've ruled out atexit handlers (commented them all
out) and gc (running python -vv shows the crash at the *very* end,
after float cleanup). It's definitely in the cuda cleanup code
somewhere. I might just let this one go, since it doesn't impact
usability at all...

Cheers,
Josh

Dmitry Chichkov

unread,
Sep 24, 2010, 8:23:34 PM9/24/10
to theano...@googlegroups.com
Hi Josh,

I haven't tried it. As you've said, It doesn't impact usability.

-- Dmitry

Josh Bleecher Snyder

unread,
Sep 26, 2010, 11:07:33 AM9/26/10
to theano...@googlegroups.com
Hi,

I (accidentally) discovered a useful data point and a workaround:
pycuda's cleanup code prevents this segfault. For example:

$ python -c "import theano; import pycuda.autoinit; print 'ok'"


Using gpu device 0: GeForce GTX 480
ok

One can thus use pycuda for a workaround. However, if you do this, you
need to explicitly trigger gc prior to the process exiting, otherwise
theano tries to clean up device memory after the gpu context has been
destroyed, which triggers its own set of (non-destructive) errors. The
full workaround to prevent the segfault is thus:


# order matters! theano must be before pycuda.autoinit, and
pycuda.autoinit must be before the atexit.register call
import theano
import pycuda.autoinit # prevent segfault-on-exit caused by theano
import atexit, gc
atexit.register(gc.collect) # clean up theano's device memory
pointers prior to pycuda tearing down the gpu context


Antoine and Dmitry, I'd love to know whether this works for you as well...

Cheers,
Josh

Jonathan Taylor

unread,
Nov 10, 2010, 3:34:13 PM11/10/10
to theano...@googlegroups.com
Hi,

I have a similar problem with some non theano but GPU from python
using code. I would love to figure out what is going on. It may be
easier to get help from nvidia or others with a minimal example. Here
is how to trigger this without using theano or my code but simply
ctypes in python.

import ctypes
libcublas = ctypes.cdll.LoadLibrary('libcublas.so')
libcublas.cublasInit()
libcublas.cublasShutdown()

Note that installing the 3.0 runtime fixes this bug.

Let me know if anyone has any more ideas or if this sheds some light
on something.

Best,
Jon.

On Sun, Sep 26, 2010 at 11:07 AM, Josh Bleecher Snyder

Ian Kuntz

unread,
Nov 17, 2010, 5:50:57 PM11/17/10
to theano-users
Hi all,

I can reproduce exactly the same issue than Josh in my setup (Ubuntu
10.04/64bits, Cuda 3.2-12).
As you said, it seems to be a cleanup issue.

I have checked the examples included in NVIDIA_GPU_Computing_SDK (C/
src) and most of them do the following :
1 - free all cuda and local vars
2 - call cudaTheadExit()
3 - cutilExit(argc,argv)

I am not a Cuda expert (nor a Theano one), but I am experienced C
developper... so with your inputs, we should be able to kill with bug,
important
for newcomers like me.

Bye,

Ian

Josh Bleecher Snyder

unread,
Nov 17, 2010, 7:22:10 PM11/17/10
to theano...@googlegroups.com
> As you said, it seems to be a cleanup issue.
>
> I have checked the examples included in NVIDIA_GPU_Computing_SDK (C/
> src) and most of them do the following :
>  1 - free all cuda and local vars
>  2 - call cudaTheadExit()
>  3 - cutilExit(argc,argv)
>
> I am not a Cuda expert (nor a Theano one), but I am experienced C
> developper... so with your inputs, we should be able to kill with bug,
> important
> for newcomers like me.

Thanks for reminding me to take another look at this. It looks like it
was indeed as simple as calling cudaThreadExit() ourselves. (I wonder
if the atexit call happens on the correct thread vs an automatic
teardown.) Anyway, I believe it should now be fixed in
http://hg.assembla.com/theano/rev/aa06fd5a3347

Would you mind updating and checking to see that fixes it for you?

-josh

Frédéric Bastien

unread,
Nov 18, 2010, 9:14:22 AM11/18/10
to theano...@googlegroups.com
thanks to all you for looking and fixing that.

Fred

Josh Bleecher Snyder

unread,
Nov 18, 2010, 9:23:03 AM11/18/10
to theano...@googlegroups.com
> thanks to all you for looking and fixing that.

Ever since http://xkcd.com/371/, I have an odd feeling of guilt about
segfaults. :)

-josh

Ian Kuntz

unread,
Nov 19, 2010, 6:55:00 AM11/19/10
to theano...@googlegroups.com
Thanks for reminding me to take another look at this. It looks like it
was indeed as simple as calling cudaThreadExit() ourselves. (I wonder
if the atexit call happens on the correct thread vs an automatic
teardown.) Anyway, I believe it should now be fixed in
http://hg.assembla.com/theano/rev/aa06fd5a3347
Thanks for that quick fixes
 
Would you mind updating and checking to see that fixes it for you?

Well, partially.

I have saved the three code snippet from the GPU page of the tutorial as "gpu-1.py", "gpu-2.py", "gpu-3.py"

As you suspected, I have no more segfault for the first one using numpy objects, but now I have odd errors messages and after a still a crash.

Hope this report and logs below will help !

Ian

---
 (theano+cuda)$ THEANO_FLAGS=mode=FAST_RUN,device=gpu0,floatX=float32 python gpu-1.py
Using gpu device 0: GeForce 9500 GT
Looping 1000 times took 2.12884688377 seconds
Result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
  1.62323296]
Used the gpu
(theano+cuda)$ THEANO_FLAGS=mode=FAST_RUN,device=gpu0,floatX=float32 python gpu-2.py
Using gpu device 0: GeForce 9500 GT
Looping 1000 times took 1.33140897751 seconds
Result is <CudaNdarray object at 0x1f84130>
Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
  1.62323296]
Used the gpu
Error freeing device pointer 0x300000 (invalid device pointer).
!!!! error freeing device memory
Error freeing device pointer 0x100100 (invalid device pointer).
!!!! error freeing device memory
Erreur de segmentation
(theano+cuda)$ THEANO_FLAGS=mode=FAST_RUN,device=gpu0,floatX=float32 python gpu-3.py
Using gpu device 0: GeForce 9500 GT
Looping 1000 times took 1.33468914032 seconds
Result is <CudaNdarray object at 0x38b91b0>
Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
  1.62323296]
Used the gpu
Error freeing device pointer 0x300000 (invalid device pointer).
!!!! error freeing device memory
Error freeing device pointer 0x100100 (invalid device pointer).
!!!! error freeing device memory
Erreur de segmentation


Josh Bleecher Snyder

unread,
Nov 19, 2010, 7:05:21 AM11/19/10
to theano...@googlegroups.com
>> Would you mind updating and checking to see that fixes it for you?
>
> Well, partially.
>
> I have saved the three code snippet from the GPU page of the tutorial as
> "gpu-1.py", "gpu-2.py", "gpu-3.py"
>
> As you suspected, I have no more segfault for the first one using numpy
> objects, but now I have odd errors messages and after a still a crash.
>
> Hope this report and logs below will help !

Yep -- thanks. Please update and try again now...

-josh

Ian Kuntz

unread,
Nov 19, 2010, 7:47:05 AM11/19/10
to theano...@googlegroups.com
Yep -- thanks. Please update and try again now...

Sorry, but I have the same errors for gpu-2.py and gpu-3.py.

I would think the GC from Python does not trigger the CudaNdarray dealloc function, but CudaThreadExit() desallocate
the memory.

To understand why I add "r = None" at the end of the test, and (surprise!) I have not more any segfault nor error messages. I understand that GC does not work because when a module runs it at the end of program, python stills hold reference on local variables.

To be sure of the workflow, I have added my own little verbose exit function (see in attachment), and I can see clearly that the error messages are coming long after the atexit() call.


$ THEANO_FLAGS=mode=FAST_RUN,device=gpu0,floatX=float32 python gpu-3.py
Using gpu device 0: GeForce 9500 GT
Looping 1000 times took 1.34656310081 seconds
Result is <CudaNdarray object at 0x24000f0>

Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
  1.62323296]
Used the gpu
before gc
after gc,sleeping 10
before gpu_shutdown
after gpu_shutdown, sleeping 10
end of myexit

Error freeing device pointer 0x300000 (invalid device pointer).
!!!! error freeing device memory
Error freeing device pointer 0x100100 (invalid device pointer).
!!!! error freeing device memory
Erreur de segmentation

Any idea to fix that cleaning workflow ?

Ian
atexit-debug.diff

Josh Bleecher Snyder

unread,
Nov 19, 2010, 7:55:16 AM11/19/10
to theano...@googlegroups.com

It looks like you got only the first half of my fix (I forgot that
atexit makes its calls in LIFO order); sorry that there was a delay
between the two changesets. Hopefully it is not broken after
http://hg.assembla.com/theano/rev/3a70d5b71f04. Anyway, I'm off to the
airport now and will be flying all day, but will take a closer look at
your diff tomorrow...

Thanks for your helping working through this.

-josh

Ian Kuntz

unread,
Nov 19, 2010, 8:09:36 AM11/19/10
to theano...@googlegroups.com
It looks like you got only the first half of my fix (I forgot that
atexit makes its calls in LIFO order); sorry that there was a delay
between the two changesets. Hopefully it is not broken after
http://hg.assembla.com/theano/rev/3a70d5b71f04. Anyway, I'm off to the
airport now and will be flying all day, but will take a closer look at
your diff tomorrow...

Well, you are right but it changes nothing.

The only way to do not have a segfault is by reseting by hand all CudaNdarray object
with "r=None" in the GPU exemple.

My patch is not a fix, but a dirty hack to understand the call workflow.

Ian

Josh Bleecher Snyder

unread,
Nov 19, 2010, 9:17:47 AM11/19/10
to theano...@googlegroups.com
I see...ugh. Thanks for your patience with me. :) Will poke at this more tomorrow. If you want to take a crack at a proper fix, the relevant c file is most likely cuda_ndarray.cu(h). It occurs to me it might also be helpful to look at PyCuda's cleanup techniques, since they don't have this problem.

Josh

Sent from my iPhone

Frédéric Bastien

unread,
Nov 19, 2010, 9:35:52 AM11/19/10
to theano...@googlegroups.com
Hi,

I see one way to fix this.

1) In the file cuda_ndarray.cu we create a variable called shutdown_called that default to false.
2) When CudaNdarray_gpu_shutdown() is called we set it to true.
3) When device_free() is called, if shutdown_called is true, we still call cudaFree, but we consider that it always succeed. This is because CudaNdarray_gpu_shutdown free all gpu memory and cudaFree will fail.
3.1) I prefer to call cudaFree event if it is not needed in case that we allocate new gpu memory after CudaNdarray_gpu_shutdown() is called. But that should not happen to my understanding.

What do you think about this?

Fred

Josh Bleecher Snyder

unread,
Nov 20, 2010, 4:07:01 PM11/20/10
to theano...@googlegroups.com
> I see one way to fix this.
>
> 1) In the file cuda_ndarray.cu we create a variable called shutdown_called
> that default to false.
> 2) When CudaNdarray_gpu_shutdown() is called we set it to true.
> 3) When device_free() is called, if shutdown_called is true, we still call
> cudaFree, but we consider that it always succeed. This is because
> CudaNdarray_gpu_shutdown free all gpu memory and cudaFree will fail.
> 3.1) I prefer to call cudaFree event if it is not needed in case that we
> allocate new gpu memory after CudaNdarray_gpu_shutdown() is called. But that
> should not happen to my understanding.
>
> What do you think about this?

It feels a little bit like a hack to me...but I can't say I have any
better ideas. :) Will cobble together something along these lines to
see what it looks like...

-josh

Josh Bleecher Snyder

unread,
Nov 20, 2010, 4:58:41 PM11/20/10
to theano...@googlegroups.com
>> I see one way to fix this.
>>
>> 1) In the file cuda_ndarray.cu we create a variable called shutdown_called
>> that default to false.
>> 2) When CudaNdarray_gpu_shutdown() is called we set it to true.
>> 3) When device_free() is called, if shutdown_called is true, we still call
>> cudaFree, but we consider that it always succeed. This is because
>> CudaNdarray_gpu_shutdown free all gpu memory and cudaFree will fail.
>> 3.1) I prefer to call cudaFree event if it is not needed in case that we
>> allocate new gpu memory after CudaNdarray_gpu_shutdown() is called. But that
>> should not happen to my understanding.
>>
>> What do you think about this?
>
> It feels a little bit like a hack to me...but I can't say I have any
> better ideas. :) Will cobble together something along these lines to
> see what it looks like...


Hmmmm. Unfortunately, making the call to cudaFree re-initializes the
gpu context, and thus resuscitates the segfault. Doh! Short-circuiting
immediately in device_free seems to do the trick, though. It should be
safe so long as the only call to gpu_shutdown occurs at exit...and I
can't think of any legitimate reason to call it earlier than that
unless on was trying to multithread their gpu work, and theano is so
very far from being threadsafe that I don't think it is worth handling
that case right now. (And anyone trying to make theano threadsafe will
be immediately tipped off that this needs attention by the static
variable.)

Here's it is for you to try out and review:

http://bitbucket.org/josharian/theano/changeset/b426d0b6ef8a

One question: device_free does some bookkeeping if
COMPUTE_GPU_MEM_USED is enabled. I currently don't handle that at all
right now, since it wasn't obvious to me what the correct thing to do
was...

Ian Kuntz

unread,
Nov 20, 2010, 6:26:47 PM11/20/10
to theano...@googlegroups.com
Hi Josh,

I can not help you, because I do not know theano nor cuda internals.
But I had some tests with your patch, and it's removing the segfault and error messages.

Maybe having a look to pycuda code could help, they seems to use more "gpu contexts" but I am not
sure it's related.

Ian.

Josh Bleecher Snyder

unread,
Nov 20, 2010, 6:28:49 PM11/20/10
to theano...@googlegroups.com
>>> I see one way to fix this.
>>>
>>> 1) In the file cuda_ndarray.cu we create a variable called shutdown_called
>>> that default to false.
>>> 2) When CudaNdarray_gpu_shutdown() is called we set it to true.
>>> 3) When device_free() is called, if shutdown_called is true, we still call
>>> cudaFree, but we consider that it always succeed. This is because
>>> CudaNdarray_gpu_shutdown free all gpu memory and cudaFree will fail.
>>> 3.1) I prefer to call cudaFree event if it is not needed in case that we
>>> allocate new gpu memory after CudaNdarray_gpu_shutdown() is called. But that
>>> should not happen to my understanding.
>>>
>>> What do you think about this?
>>
>> It feels a little bit like a hack to me...but I can't say I have any
>> better ideas. :) Will cobble together something along these lines to
>> see what it looks like...

Here's another idea; not sure whether this is any better.

In gpu_shutdown, just set a flag that we're done with the gpu. Then at
device_free time, after the free call, check whether that was the last
malloc'd bit of memory to be freed. (Looks like this is already
tracked in _outstanding_mallocs, albeit with a scary comment that
indicates that it currently doesn't work.) If that was the last free
call to be made, and the flag is set indicating we're done with the
gpu, then call cudaThreadExit. That should make gpu_shutdown
idempotent (although still not multi-thread friendly) and avoid weird
short-circuiting.

I've implemented this in
http://bitbucket.org/josharian/theano/changeset/ebbc8fab2c8f -- seems
to work ok as well. Not sure which approach is preferable...

-josh

Josh Bleecher Snyder

unread,
Nov 20, 2010, 6:51:55 PM11/20/10
to theano...@googlegroups.com
> I've implemented this in
> http://bitbucket.org/josharian/theano/changeset/ebbc8fab2c8f -- seems
> to work ok as well. Not sure which approach is preferable...

...and third time's a charm, right? One last iteration on this in
http://bitbucket.org/josharian/theano/changeset/ab866c953440. Uses a
refcounting scheme for cuda contexts, in which device_malloc and
device_free participates. Makes the code cleaner and cleaner, and
handles the case in which gpu_shutdown is called *after* the last bit
of device memory is freed.

I'm pretty happy with this now; my lingering uncertainties about the
last two attempts are gone.

So if you're going to code review just one of my three attempts (sorry
about that), code review this one. :)

Glad to hear that things that fix the problem for me also fix them for
you, Ian; thanks for confirming that.

-josh

Frédéric Bastien

unread,
Nov 22, 2010, 3:40:11 PM11/22/10
to theano...@googlegroups.com
It look good. I integrated it to theano.

I don't understand correctly why we need to call cudaThreadExit. But
if it fix the problems, I'm pretty confident that this don't add
problem.

thanks for the work.

Fred

Ian Kuntz

unread,
Nov 23, 2010, 5:15:11 PM11/23/10
to theano...@googlegroups.com
Hi all,

Sorry but it still seems buggy.
Please find the logs in attachement.

Tested with :
(theano+cuda)$ hg summary
parent: 4773:473e7f185d23 tip
 white space fix.
branch: default
commit: 2 unknown (clean)
update: (current)

And DEBUG_GPU_CONTEXT_REFCOUNT

gpu-1.py is the first code snippet at http://deeplearning.net/software/theano/tutorial/using_gpu.html

Do you reproduce the refcounting remaining to 2 in your setup ?

Regards,

Ian
log.4773:473e7f185d23

Josh Bleecher Snyder

unread,
Nov 23, 2010, 8:02:58 PM11/23/10
to theano...@googlegroups.com

Argh. Yes, I do.

However, I believe at this point the question to answer is why there
are more device_malloc calls being made than device_free calls. (I'm
loath to return to the solution of shutting down the gpu context
atexit and swallowing the errors, since it strikes me as masking
rather than fixing the root problem.) I've opened a ticket for this:
http://trac-hg.assembla.com/theano/ticket/591

-josh

Ian Kuntz

unread,
Nov 25, 2010, 5:00:44 PM11/25/10
to theano...@googlegroups.com
However, I believe at this point the question to answer is why there
are more device_malloc calls being made than device_free calls. (I'm
loath to return to the solution of shutting down the gpu context
atexit and swallowing the errors, since it strikes me as masking
rather than fixing the root problem.)

You are right about that.

Ian.

Frédéric Bastien

unread,
Nov 29, 2010, 9:56:57 AM11/29/10
to theano...@googlegroups.com
Hi,

I don't think their is a memory leak. From memory, in Python, their is
no warranty that objects will be destroyed. As we call device_free
only when we destroy an object, if their is object not destroyed,
their will be more call to device_malloc then device_free.

This also mean that function registered at atexit will be called when
not all object are destroyed. So this probably that your previous fix
that is to call ourself cudaThreadExit in a atexit fct and that all
subsequent device_free don't call cudaFree.

So it seam we don't have a choose and take a less well structured approach.

Josh, do you agree with this reasoning if Python object are not always
destroyed?

Thanks

Fred

Josh Bleecher Snyder

unread,
Nov 29, 2010, 3:29:28 PM11/29/10
to theano...@googlegroups.com
Hi Fred,

Sorry about the slow response (I just saw your ticket response from
some time ago); I just returned from Thanksgiving holidays.


> I don't think their is a memory leak. From memory, in Python, their is
> no warranty that objects will be destroyed. As we call device_free
> only when we destroy an object, if their is object not destroyed,
> their will be more call to device_malloc then device_free.

Bummer. I didn't realize that.


> This also mean that function registered at atexit will be called when
> not all object are destroyed. So this probably that your previous fix
> that is to call ourself cudaThreadExit in a atexit fct and that all
> subsequent device_free don't call cudaFree.
>
> So it seam we don't have a choose and take a less well structured approach.

Yep, I think you're right (and were right all along!). It's a bummer,
but short of say using C++ destructors (which *are* guaranteed to get
called at process exit), which would involve major codebase upheaval
for a minor problem, I don't see any other way.

So...it looks like reverting ebbc8fab2c8f and ab866c953440 should do
the trick. Any objections? If not, I'll do that shortly.

-josh

Josh Bleecher Snyder

unread,
Nov 29, 2010, 4:16:38 PM11/29/10
to theano...@googlegroups.com
> So...it looks like reverting ebbc8fab2c8f and ab866c953440 should do
> the trick. Any objections? If not, I'll do that shortly.

Backed out in d8d551435117 and 0ae23f7375cc. Hopefully that'll be the
end of this particular bug. :)

Frédéric Bastien

unread,
Dec 8, 2010, 1:26:52 PM12/8/10
to theano...@googlegroups.com
Ian Kuntz, do you still have the problem since the last fix?

thanks

Fred

Ian Kuntz

unread,
Dec 8, 2010, 5:02:36 PM12/8/10
to theano...@googlegroups.com
Ian Kuntz, do you still have the problem since the last fix?

I will try by the end of the week.

Regards,

Ian

Ian Kuntz

unread,
Dec 9, 2010, 5:41:16 PM12/9/10
to theano...@googlegroups.com
Dear all,

Everything  - all three snipnets from the GPU setup page of the tutorial - are now working.
Thanks for that.

Regards,

Ian

2010/12/8 Ian Kuntz <kunt...@gmail.com>
Reply all
Reply to author
Forward
0 new messages