Hi,
I have two questions here:
(1)
When I do memory profiling to my theano program, I found that most of my memory is consumed by GpuContiguous
183169800B [(4785, 4785)] v GpuContiguous(GpuContiguous.0)
183169800B [(4785, 4785)] v GpuContiguous(GpuElemwise{Composite{((i0 * exp((i1 * i2))) + (i3 * i4))}}[(0, 2)]<gpuarray>.0)
183169800B [(4785, 4785)] i GpuCholesky{lower=True, inplace=True}(GpuContiguous.0)
183169800B [(4785, 4785)] v GpuContiguous(GpuCholesky{lower=True, inplace=True}.0)
It looks like theano would try to cache each memory output on each operator and then each GpuContiguous would cache each copy of my input. Am I correct?
The optimizer of the theano WOULD NOT REMOVE extra GpuContiguous if it can be sure that the input is contiguous?
If I am sure that the input of that my own Op is C contiguous when I work on my code, I can skip gpu_contiguous in make_node()
def make_node(self,X1,X2):
ctx = infer_context_name(X1,X2)
X1 = as_gpuarray_variable(X1,ctx)
X2 = as_gpuarray_variable(X2,ctx)
# X1 = gpu_contiguous(X1)
# X2 = gpu_contiguous(X2)
and then simply add a checking in c_code(...):
if(!GpuArray_IS_C_CONTIGUOUS(&%(X1)s->ga)) {
PyErr_Format(PyExc_RuntimeError,"X1 must be C contiguous");
%(fail)s;
}
if(!GpuArray_IS_C_CONTIGUOUS(&%(X2)s->ga)) {
PyErr_Format(PyExc_RuntimeError,"X2 must be C contiguous");
%(fail)s;
}
right?
(2) Are there any way to free all the memory cache in the computational graph?
My code would compile over same combinations of Op, but with different dimension, let's say some 4000x4000, 3000x3000, 1234x1234
and then it looks (i am not sure) like whenever I compile and run the computational graph for different dimension, those memory 4000x4000, 3000x3000, 1234x1234 (created by many GPU op) are all cached and ultimately consume all of my memory: