help with a memory leak

277 views
Skip to first unread message

Ian Goodfellow

unread,
Feb 1, 2012, 4:46:00 PM2/1/12
to thean...@googlegroups.com
Using the following script, I see two problems:
1) allocating W results in too much memory on the gpu getting lost
2) calling f results in still more memory getting lost

I haven't investigated issue #1 much. For issue #2, I know it's size
dependent. It doesn't happen if I shrink either dimension of W at all
(that's why the values in s are so weird). Issue #2 also goes away if
I set f to output updates.values() rather than updating grad.

Any ideas?
Thanks,
Ian


import numpy as np
from pylearn2.utils import sharedX
from theano import function
import theano
import gc

s = [399,219]
before = theano.sandbox.cuda.cuda_ndarray.cuda_ndarray.mem_info()
W = sharedX(np.zeros((s[0],s[1])))
gc.collect()
gc.collect()
gc.collect()
after = theano.sandbox.cuda.cuda_ndarray.cuda_ndarray.mem_info()
print before[0] - after[0]
print s[0]*s[1]*4

grad = sharedX(np.zeros(W.get_value().shape))

updates = { grad : W}

f = function([], updates = updates())

before = theano.sandbox.cuda.cuda_ndarray.cuda_ndarray.mem_info()
f()
gc.collect(); gc.collect(); gc.collect()
after = theano.sandbox.cuda.cuda_ndarray.cuda_ndarray.mem_info()
assert after[0] >= before[0]

James Bergstra

unread,
Feb 1, 2012, 5:05:34 PM2/1/12
to thean...@googlegroups.com
You could try running with cuda-memcheck - it might catch an arithmetic bound error that only occurs with some argument shapes / strides.

Ian Goodfellow

unread,
Feb 1, 2012, 10:34:40 PM2/1/12
to thean...@googlegroups.com
cuda-memcheck finds no errors

Ian Goodfellow

unread,
Feb 1, 2012, 10:47:24 PM2/1/12
to thean...@googlegroups.com
further information on issue #1:
it appears not to be very size dependent. about 1 megabyte of space is
wasted regardless of the size of the shared variable. the actual size
fluctuates from just under 1 meg to around 1.6 megs. it doesn't have a
clear relationship with the size of the shared value.
Interestingly, the second shared variable allocation does not waste any memory.

On Wed, Feb 1, 2012 at 10:34 PM, Ian Goodfellow

Ian Goodfellow

unread,
Feb 2, 2012, 1:19:54 PM2/2/12
to thean...@googlegroups.com
Here's an updated version of the file that demonstrates that the
memory getting leaked is the original buffer allocated for grad. At
the end of the script grad points to a buffer allocated by the call to
f. Nothing is ever freed.

#script to demonstrate that theano leaks memory on the gpu

import numpy as np
from pylearn2.utils import sharedX
from theano import function
import theano
import gc

s = [400,8000]
print 'first shared'


before = theano.sandbox.cuda.cuda_ndarray.cuda_ndarray.mem_info()
W = sharedX(np.zeros((s[0],s[1])))
gc.collect()
gc.collect()
gc.collect()
after = theano.sandbox.cuda.cuda_ndarray.cuda_ndarray.mem_info()

diff = before[0] - after[0]
expected_diff = s[0]*s[1]*4

if diff > expected_diff:
print "W uses ",str(float(diff)/float(expected_diff))," times more
memory than needed."
print "(",str(float(diff-expected_diff)/(1024. ** 2))," megabytes)"

print 'second shared'
grad =sharedX(np.zeros(W.get_value().shape))
gc.collect()
gc.collect()
gc.collect()
after_after = theano.sandbox.cuda.cuda_ndarray.cuda_ndarray.mem_info()
diff = after_after[0] - after[0]

if diff > expected_diff:
print "grad uses ",str(float(diff)/float(expected_diff))," times
more memory than needed."


updates = { grad : W}


f = function([], updates = updates)


from theano.printing import debugprint
debugprint(f)

print 'call'

before = theano.sandbox.cuda.cuda_ndarray.cuda_ndarray.mem_info()
f()
gc.collect(); gc.collect(); gc.collect()
after = theano.sandbox.cuda.cuda_ndarray.cuda_ndarray.mem_info()

cuda_array = grad.get_value(borrow=True, return_internal_type = True)

addr = cuda_array.gpudata

print 'final storage address: %x' % addr


assert after[0] >= before[0]


On Wed, Feb 1, 2012 at 10:47 PM, Ian Goodfellow

Ian Goodfellow

unread,
Feb 2, 2012, 3:05:00 PM2/2/12
to thean...@googlegroups.com
Here's my current summary of the problem.

A CudaNdarray which I will call A is allocated and used as the value
for grad. During the call to f, a second CudaNdarray which I will call
B is allocated and used as the updated value for grad.

If I make a reference to A in my script right after initializing grad,
I can then call sys.getrefcount and gc.get_referrers on it at the end
of the script. The expected result is the get_referrers should list
only locals() and getrefcount should return 1. The actual result is
that get_referrers returns only locals() as expected but getrefcount
returns 3.

I have some questions about how to proceed now:
-Is it possible that if I crawl through all the sub-fields of f or of
theano I will find some reference to A that was not found by
get_referrers? or does this necessarily indicate that we have failed
to call a python C api decref macro somewhere?
I don't really understand the gc documentation on the subject of what
get_referrers will find or not find: "This function will only locate
those containers which support garbage collection; extension types
which do refer to other objects but do not support garbage collection
will not be found." What does it mean exactly to "support garbage
collection"?
-If it is possible that I might find some referrers by crawling
through all the sub-fields of f and/or theano, how exactly do I do
this? Iterating through all the elements of dir(obj) and calling
getattr(dir,field) will evaluate properties so I get stuck following
infinite loops of .T called on .T.
-Is there any good way of instrumenting the python C api incref/decref
macros so I can trace where they are getting called?

James Bergstra

unread,
Feb 2, 2012, 3:28:57 PM2/2/12
to thean...@googlegroups.com
Does the behaviour change in cvm vs. c|py linkers?  These different linkers would use different code paths to manage refcounts on updated shared vars.

Ian Goodfellow

unread,
Feb 2, 2012, 3:37:51 PM2/2/12
to thean...@googlegroups.com
No, it ends with a ref count of 3 either way.

Ian Goodfellow

unread,
Feb 2, 2012, 5:03:19 PM2/2/12
to thean...@googlegroups.com
I've found that by deleting f I can get it to end with a ref count of 2

Ian Goodfellow

unread,
Feb 2, 2012, 5:20:08 PM2/2/12
to thean...@googlegroups.com
ok, I can get the ref count to drop to 2 either by deleting f.__dict__
or by deleting f.defaults. Shouldn't I have to delete both to get the
effect?

James Bergstra

unread,
Feb 2, 2012, 5:21:32 PM2/2/12
to thean...@googlegroups.com
Each one holds a ref, so each deletion should drop the count by 1.

Ian Goodfellow

unread,
Feb 2, 2012, 5:31:24 PM2/2/12
to thean...@googlegroups.com
deleting both only results in a drop of 1. also, weirdly, if I delete
__dict__ first I get an exception saying I can't delete defaults
because it is read only.

what exactly is defaults for? in this case it is a list containing two
tuples. Each tuple contains (False, False, CudaNdarray). One of the
CudaNdarrays is the one that is getting leaked.

James Bergstra

unread,
Feb 2, 2012, 5:35:52 PM2/2/12
to thean...@googlegroups.com
That's getting into the default argument mechanism that was invented long ago... I'm not sure if the shared-variable calling protocol is using that mechanism or bypassing it. It's tricky code to pick through in function_module, and I just have to go back and pick through it again every time I want to [briefly] understand it.

Ian Goodfellow

unread,
Feb 2, 2012, 6:31:35 PM2/2/12
to thean...@googlegroups.com
It appears that most of this code has no concept of shared variables
(some of the comments refer to shared variables being in a different
repository); it thinks that either an input must have its value passed
in by the caller or it must have a default value 'refed' by the
function itself. In the case of shared variables, this means that
their original value gets stored for no reason.

David Warde-Farley

unread,
Feb 2, 2012, 6:37:31 PM2/2/12
to thean...@googlegroups.com
On Thu, Feb 02, 2012 at 05:35:52PM -0500, James Bergstra wrote:
> That's getting into the default argument mechanism that was invented long
> ago... I'm not sure if the shared-variable calling protocol is using that
> mechanism or bypassing it. It's tricky code to pick through in
> function_module, and I just have to go back and pick through it again every
> time I want to [briefly] understand it.

Such strolls down memory lane are ripe opportunities for writing comments...
*ahem* :P

David

Ian Goodfellow

unread,
Feb 2, 2012, 6:46:58 PM2/2/12
to thean...@googlegroups.com
OK, I've attempted to remedy the situation by modifying compile.io.In
to have a 'shared' flag, and changing pFunc to set shared inputs to
have value=None and shared=True.

The shared flag is necessary because before any input with value=None
was flagged as required to be passed in by the client.

Unfortunately, my test script now only executes successfully if I run
it in 'FAST_COMPILE' mode.

If I run it in any other mode, it has an error inside run_cthunk so I
can't tell where the error actually originates.

If I run it on CPU, I get the error:
ValueError: ('expected an ndarray, not None', <TensorType(float32, matrix)>)
which makes me think some piece of code somewhere is trying to read
the value out of defaults, or the In.value field.

If I run it on GPU, I get the more inexplicable:
TypeError: ('Argument not a CudaNdarray', <CudaNdarrayType(float32, matrix)>)


Can anyone give me a high level level explanation of what is going on?
It seems to me that shared variables should get initialized once, when
you initialize them, and that the call to the theano function
shouldn't be trying to read their initial value. Moreover, I can't
think of any reason why different modes should read the initial value
from different locations.

Also, if anyone can give me any tips for figuring out the actual code
location of the failure inside run_cthunk that would be very helpful.

James Bergstra

unread,
Feb 2, 2012, 7:18:57 PM2/2/12
to thean...@googlegroups.com
On Thu, Feb 2, 2012 at 6:46 PM, Ian Goodfellow <goodfel...@gmail.com> wrote:

Can anyone give me a high level level explanation of what is going on?
It seems to me that shared variables should get initialized once, when
you initialize them, and that the call to the theano function
shouldn't be trying to read their initial value. Moreover, I can't
think of any reason why different modes should read the initial value
from different locations.


You might try to simplify the situation by cutting Function totally out of the story.

By this point here:

everything you need to run your computations has been set up, and you can start diagnosing this memory leak. If you've set mode=Mode, and linker='VM', then that call to make_thunk on line 1118 of function_module will go here after one intermediate relay:


When make_thunk returns, the `_fn` variable is the VM you just allocated, so you can inspect its internals.  You'll be most interested in the vm.storage_map, which is a dictionary that maps variable -> [value] for variables in the post-optimized Env.

You can do function calls "by hand", by inserting values for inputs straight into the storage_map, calling the vm (with no arguments), and then reading output values back out of the storage map. There is no such thing as shared vs. non-shared variables at this level, a shared-variable update simply means reading an output value out of the storage map, assigning None to the storage where it came from, and replacing some input storage container with that value. If you're curious what calling a vm does, then you can look at the __call__ implementations in vm.py.

The reason to do function calls by hand in this way, is to isolate the source of the memory leak: if it still happens when you do things this way then you can forget about compile.io.In, and all of Function for that matter, and focus in on the parts of the process that are left. Conversely, if the memory leak disappears, then there is something wrong with the parts you just short-circuited.

Either way, it's sometimes good to think about running compiled functions in this low-level way because you can do some neat things. For example it's always a faster calling strategy, and (b) by manipulating the compute_map (similar to storage_map, but it contains a bool for each variable -- True means it's valid, False means it's invalid/dirty) you can implement partial function computation. It's also a simpler lower-level interface that can help debug things, like this case in point.

- James

Ian Goodfellow

unread,
Feb 2, 2012, 10:32:37 PM2/2/12
to thean...@googlegroups.com
I admit I don't fully understand your suggestion but I think you're
misunderstanding my current problem. At the moment I am not trying to
find a memory leak. I have found the memory leak and am trying to fix
it.

The memory leak was caused by FunctionMaker always saving the initial
value of all shared variables in the function's defaults field. I need
to remove this redundant storage, but it appears that some linkers
depend on it in ways that are unclear for reasons that are unclear.

It is true that there is an additional memory leak which I may be able
to find using your suggestion, but I want to fix this one before I
look for the second one.

Ian Goodfellow

unread,
Feb 3, 2012, 9:31:17 AM2/3/12
to thean...@googlegroups.com
I changed my solution to have In.value store the value of the shared
variable and made function.defaults not store this value. That seems
to work now. I'm running the tests.

On Thu, Feb 2, 2012 at 10:32 PM, Ian Goodfellow

Ian Goodfellow

unread,
Feb 3, 2012, 10:34:08 AM2/3/12
to thean...@googlegroups.com
I am now hunting down the second leak. It turns out not to require
making any functions. It is just plain a memory leak in the shared
variable functionality. The following script prints out 2. It should
print out 1.

from theano import shared
import numpy as np
import sys

shape = (400,8000)

x = shared(np.zeros(shape))
val = x.get_value(borrow=True,return_internal_type=True)
del x

print sys.getrefcount(val)

Ian Goodfellow

unread,
Feb 3, 2012, 10:52:42 AM2/3/12
to thean...@googlegroups.com
Nevermind, after discussing this with Fred and Guillaume I've decided
there's only one memory leak. sys.getrefcount always returns at least
two because it makes an internal reference to the object you pass it.

On Fri, Feb 3, 2012 at 10:34 AM, Ian Goodfellow

Reply all
Reply to author
Forward
0 new messages