[ANN] GPU speed up merged in development version (need manual activation for now)

Frédéric Bastien

unread,

23 Jul 2015, 14:29:1623/07/2015

to theano-users, theano-dev

Hi,

this is just to tell you that we just merged in Theano a new GPU memory allocator recently released by NVIDIA.

If you don't change the Theano flag allow_gc, you can expect 20% speed up on the GPU. In some case (small models), we saw a 50% speed up.

If you set allow_gc=False, the speed up will be much less, but you can still have one.

To use it:

1) Update Theano to the dev version: http://deeplearning.net/software/theano/install.html#bleeding-edge-install-instructions
2) Use the Theano flag lib.cnmem: http://www.deeplearning.net/software/theano/library/config.html#config.lib.cnmem

Happy speed up!

Fred

p.s. We will enable it by default in 1 or 2 weeks if we don't have report of problems. We aren't sure of the default % of the GPU to allocate. We thought of using 45% by default. What do you think? This would allow 2 jobs by default (it need some memory for the driver).

p.p.s. We didn't tested it on Mac and Windows, but we think it should work. What we did here isn't different then other libs, so this isn't new there. But if you try it there, report your results.

Jeffrey De Fauw

unread,

24 Jul 2015, 04:11:0824/07/2015

to theano-users, thean...@googlegroups.com, frederic...@gmail.com

When I use 0.45 or 0.87 (which is the max free memory I still have due to the GPU also driving my displays) I run out of memory (CNMEM_STATUS_OUT_OF_MEMORY). Normally I run it with gc enabled and it runs well (can't run with gc disabled). This is on linux with a 980. Normally I think it occupies about 3400MB at its max while running.

I can give some more specific information in a few days but thought it might already be useful to report.

Frédéric Bastien

unread,

24 Jul 2015, 09:16:1424/07/2015

to Jeffrey De Fauw, theano-users, theano-dev

If you run the exact same program with the same other program running to have the same amount of available memory on the GPU, but without cnmem, do this work?

The memory specified by the flag is the start memory by cnmem. If the program need more memory then the start memory, cnmem will try to grab it. But it won't be handled exactly the same, as it won't be 1 big continuous bloc. So starting with the good amount of memory could help fragmentation of GPU memory I think. But otherwise, I think it should be the same speed.

Fred

Jeffrey De Fauw

unread,

24 Jul 2015, 09:32:1724/07/2015

to Frédéric Bastien, theano-dev, theano-users

The situation was: normally I ran it with gc enabled and I decided to pull the latest master to test the new improvements. I kept gc and enabled cnmem at 0.45 (since you mentioned you might use that as a default). Which gave the cnmem out of memory error. Same happened when I set it at 0.87 (max free memory available on my GPU). When I disabled cnmem it worked again. It never worked without gc so I kept gc on as default. Can't really comment on speed ups at the moment.

Hopefully that helps a bit. Thought I would quickly try it out and since your proposed 0.45 failed for me, thought I should probably mention it. I might try to find some minimal example next week if needed.

Frédéric Bastien

unread,

24 Jul 2015, 09:33:4824/07/2015

to Jeffr...@gmail.com, theano-dev, theano-users

Can you try with cnmem and without allow_gc?

thanks

Fred

Pascal Lamblin

unread,

24 Jul 2015, 11:41:0024/07/2015

to Frédéric Bastien, Jeffr...@gmail.com, theano-dev, theano-users

On Fri, Jul 24, 2015, Frédéric Bastien wrote:
> Can you try with cnmem and without allow_gc?

I'm pretty sure that is what he tried.

> --
>
> ---
> You received this message because you are subscribed to the Google Groups "theano-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to theano-users...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
Pascal

Jeffrey De Fauw

unread,

24 Jul 2015, 11:47:5524/07/2015

to Pascal Lamblin, theano-dev, Frédéric Bastien, theano-users

No, I have put gc on by default because my recent experiments didn't run without it. I have not tested trying to disable it recently. I should be able to try it later tonight and I will report back.

Julien Demouth

unread,

24 Jul 2015, 15:05:5124/07/2015

to theano-users, lamb...@iro.umontreal.ca, thean...@googlegroups.com, frederic...@gmail.com, Jeffr...@gmail.com, Jeff...@gmail.com

Hi Jeffrey,

If you still see the issue with allow_gc=False, I'm interested in getting a repro case. If that's out of question, I can explain how to generate a text trace that I could study to identify if something could be done to make the library more robust. I wrote cnmem and I would like to collect cases to help me improve the internal allocation strategy.

Thanks,

Julien

Jeffrey De Fauw

unread,

24 Jul 2015, 15:08:0424/07/2015

to theano-dev, Jeffr...@gmail.com, theano...@googlegroups.com, frederic...@gmail.com, julien....@gmail.com

Hi all,

I just tested it quickly. It is indeed so that turning gc off (and cnmem disabled) will give the same out of memory error. I tried turning gc off and cnmem 0.45 and 0.87 but still no luck.

I want to release some code for https://www.kaggle.com/c/diabetic-retinopathy-detection soon and was a little panicky that suddenly it wouldn't be able to run anymore on a 4GB card with the latest theano dev (if cnmem would be enabled by default). I actually had bigger models than this which also ran well with gc.

I'll get back to this in a few days when I have some more time to test it and I can give you access to a repo. :-) Thanks for the quick responses!

Best,

Jeffrey

On Friday, 24 July 2015 19:54:24 UTC+1, Julien Demouth wrote:

Hi Jeffrey,

If you still see the problem with cnmem and you can share a repro with me, I'm interested (I wrote CNMEM and helped Frédéric integrate it into Theano). I can tweak the internal policy of cnmem to deal with "hard" cases (assuming your case is hard). We also have the freedom to add new strategies and let the user choose the best strategy to claim/reclaim memory.

Thanks,
Julien

Frédéric Bastien

unread,

26 Jul 2015, 23:09:2026/07/2015

to Jeffrey De Fauw, theano-dev, theano-users, julien....@gmail.com

Hi,

just to be sure, by default, allow_gc is True. When I wrote not to set allow_gc, I mean, keep the default. I think you understood to set it to False.

Can you confirm that all what I wrote bellow is True.

allow_gc=True, lib.cnmem=False (default), it work
allow_gc=True, lib.cnmem=True, it work

allow_gc=False, lib.cnmem=True it crash

allow_gc=False, lib.cnmem=False it crash

If so, that is what I expect. Mostly, lib.cnmem won't change how much memory is used by Theano (so don't introduce or remove missing memory error). But it would bring most of allow_gc=False speed up to Theano!

You can still have very small speed up by using allow_gc=False with lib.cnmem=True, but it would be minimal on the GPU.

allow_gc=False can still give speed up on the CPU.

thanks

Fred

Jeffrey De Fauw

unread,

26 Jul 2015, 23:15:1226/07/2015

to Frédéric Bastien, Julien Demouth, theano-dev, theano-users

Hi Frédéric,

No, only with allow_gc True and cnmem False (=0) does it work. Hence, part of my concern.

Jeffrey

Daniel Renshaw

unread,

27 Jul 2015, 04:56:2827/07/2015

to theano...@googlegroups.com

I've just updated to the latest Theano Github version and now I can't import theano without getting a compilation failure (see below for detail). Everything worked fine until I updated to the latest version and the problem seems to be with the recent cnmem update.

Git commit: 354d7b576357c11b41565086437ef972cb8b4768 [354d7b5] (Merge pull request #2629 from abergeron/fix_merge_opts Don't apply alpha_merge and output_merge when the proc node has more than one client.); no local changes.

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver

Built on Fri_Sep_21_17:28:58_PDT_2012

Cuda compilation tools, release 5.0, V0.2.1221

GPU: GeForce GT 640

python

Python 2.7.9 (default, Feb 4 2015, 08:18:43)

[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2

Problem same for all of the following configurations:

THEANO_FLAGS="device=gpu,openmp=True,floatX=float32,nvcc.fastmath=True"

THEANO_FLAGS="device=gpu,openmp=True,floatX=float32,nvcc.fastmath=True,allow_gc=True,lib.cnmem=0"

THEANO_FLAGS="device=gpu,openmp=True,floatX=float32,nvcc.fastmath=True,allow_gc=False,lib.cnmem=0.5"

THEANO_FLAGS="device=gpu,openmp=True,floatX=float32,nvcc.fastmath=True,allow_gc=True,lib.cnmem=0"

THEANO_FLAGS="device=gpu,openmp=True,floatX=float32,nvcc.fastmath=True,allow_gc=False,lib.cnmem=0.5"

The commit that causes the problem is ca465be02404c2b697dd75e1253d069cdd99a13b [ca465be] (Merge pull request #3198 from nouiz/cumem3 Add CNMeM in Theano to speed up CUDA allocation.)

If I go back to the commit prior to that, 2ddaca0654abdf224d1945c9b89b5e362a57f464 [2ddaca0] (Merge pull request #3117 from ChienliMa/infer_shape OpFromGraph.infer_shape()), then everything works fine again.

Is there anything I can do to fix the problem?

Daniel

In file included from /usr/include/python2.7/Python.h:8,

from mod.cu:3:

/usr/include/python2.7/pyconfig.h:1182:1: warning: "_POSIX_C_SOURCE" redefined

In file included from /opt/cuda-5.0.35/bin/../include/host_config.h:114,

from /opt/cuda-5.0.35/bin/../include/cuda_runtime.h:59,

from <command-line>:0:

/usr/include/features.h:162:1: warning: this is the location of the previous definition

In file included from /usr/include/python2.7/Python.h:8,

from mod.cu:3:

/usr/include/python2.7/pyconfig.h:1204:1: warning: "_XOPEN_SOURCE" redefined

In file included from /opt/cuda-5.0.35/bin/../include/host_config.h:114,

from /opt/cuda-5.0.35/bin/../include/cuda_runtime.h:59,

from <command-line>:0:

/usr/include/features.h:164:1: warning: this is the location of the previous definition

<snip_home_dir>/source/theano/theano/sandbox/cuda/cnmem.cpp(386): error: identifier "cudaStreamGetFlags" is undefined

mod.cu(938): warning: pointless comparison of unsigned integer with zero

1 error detected in the compilation of "/tmp/tmpxft_0000a408_00000000-6_mod.cpp1.ii".

['nvcc', '-shared', '-O3', '-use_fast_math', '-m64', '-Xcompiler', '-DCUDA_NDARRAY_CUH=5355c4a61aea6cdfb944298e802983a7,-DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION,-fPIC,-fvisibility=hidden', '-Xlinker', '-rpath,<snip_home_dir>/.theano/vetinari/compiledir_Linux-2.6-el6.x86_64-x86_64-with-redhat-6.6-Carbon-x86_64-2.7.9-64/cuda_ndarray', '-I<snip_home_dir>/source/theano/theano/sandbox/cuda', '-I<snip_home_dir>/python/vetinari/lib/python2.7/site-packages/numpy/core/include', '-I/usr/include/python2.7', '-I<snip_home_dir>/source/theano/theano/gof', '-o', '<snip_home_dir>/.theano/vetinari/compiledir_Linux-2.6-el6.x86_64-x86_64-with-redhat-6.6-Carbon-x86_64-2.7.9-64/cuda_ndarray/cuda_ndarray.so', 'mod.cu', '-L/usr/lib', '-lpython2.7', '-lcublas', '-lcudart']

ERROR (theano.sandbox.cuda): Failed to compile cuda_ndarray.cu: ('nvcc return status', 2, 'for cmd', 'nvcc -shared -O3 -use_fast_math -m64 -Xcompiler -DCUDA_NDARRAY_CUH=5355c4a61aea6cdfb944298e802983a7,-DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION,-fPIC,-fvisibility=hidden -Xlinker -rpath,<snip_home_dir>/.theano/vetinari/compiledir_Linux-2.6-el6.x86_64-x86_64-with-redhat-6.6-Carbon-x86_64-2.7.9-64/cuda_ndarray -I<snip_home_dir>/source/theano/theano/sandbox/cuda -I<snip_home_dir>/python/vetinari/lib/python2.7/site-packages/numpy/core/include -I/usr/include/python2.7 -I<snip_home_dir>/source/theano/theano/gof -o <snip_home_dir>/.theano/vetinari/compiledir_Linux-2.6-el6.x86_64-x86_64-with-redhat-6.6-Carbon-x86_64-2.7.9-64/cuda_ndarray/cuda_ndarray.so mod.cu -L/usr/lib -lpython2.7 -lcublas -lcudart')

WARNING (theano.sandbox.cuda): CUDA is installed, but device gpu is not available (error: cuda unavilable)

--

Frédéric Bastien

unread,

27 Jul 2015, 12:57:2727/07/2015

to theano-users

It seeem that this update need cuda 5.5 or higher. Can you update cuda?

Should we simply request a more recent cuda version of do we need to support cuda 5.0?

Fred

Frédéric Bastien

unread,

27 Jul 2015, 12:59:0827/07/2015

to theano-users

I think I found another work around.

Can you go in the file sandbox/cuda/cuda_ndarray.cu and add this at the top:

#define CUDA_API_PER_THREAD_DEFAULT_STREAM

Tell me if this work. If so, that would be the good fix.

thanks

Fred

Daniel Renshaw

unread,

27 Jul 2015, 13:20:5427/07/2015

to theano...@googlegroups.com

Thanks for looking into this Fred. Unfortunately I won't be in a position to test your suggestion until tomorrow. I'll get back to you then.

P.S. upgrading cuda may be possible but won't be quick because I don't manage this server myself; hopefully the #define change will workaround the problem.

Frédéric Bastien

unread,

27 Jul 2015, 14:03:5727/07/2015

to theano-users

I think this diff is better:

diff --git a/theano/sandbox/cuda/cnmem.cpp b/theano/sandbox/cuda/cnmem.cpp
index 4a081cf..8e6d999 100644
--- a/theano/sandbox/cuda/cnmem.cpp
+++ b/theano/sandbox/cuda/cnmem.cpp
@@ -380,6 +380,8 @@ public:
     inline cnmemStatus_t setStream(cudaStream_t stream) {
         mStream = stream;
#ifdef CUDA_API_PER_THREAD_DEFAULT_STREAM
+
+#if defined(CUDA_API_PER_THREAD_DEFAULT_STREAM) || (CUDART_VERSION < 5050)
         mIsStreamBlocking = false;
#else
         unsigned flags = 0;

Daniel Renshaw

unread,

28 Jul 2015, 04:52:2128/07/2015

to theano...@googlegroups.com

I had to alter the diff very slightly, but I can now import theano using the most up-to-date revision.

diff --git a/theano/sandbox/cuda/cnmem.cpp b/theano/sandbox/cuda/cnmem.cpp

index 4a081cf..373bc67 100644

--- a/theano/sandbox/cuda/cnmem.cpp

+++ b/theano/sandbox/cuda/cnmem.cpp

@@ -379,7 +379,7 @@ public:

/// Define the stream.

inline cnmemStatus_t setStream(cudaStream_t stream) {

mStream = stream;

-#ifdef CUDA_API_PER_THREAD_DEFAULT_STREAM

+#if defined(CUDA_API_PER_THREAD_DEFAULT_STREAM) || (CUDART_VERSION < 5050)

mIsStreamBlocking = false;

#else

unsigned flags = 0;

Note the removal of the original #ifdef line, to be replaced by the new #if line.

Are there any significant performance differences between cuda versions? Should I be trying to upgrade cuda anyway?

Daniel

Frédéric Bastien

unread,

28 Jul 2015, 09:22:1428/07/2015

to theano-users

Thanks, I merged a PR from Julien that do the right fix, mIsStreamBlocking should be true for old cuda, not false. So be sure to update Theano.

thanks

Fred

Frédéric Bastien

unread,

11 Aug 2015, 02:50:4411/08/2015

to Jeffrey De Fauw, Julien Demouth, theano-dev, theano-users

Jeffrey, do you have time to generate way to run this code? If this is hard, Julien can tell you how to generate a trace of the memory alloc. He just need that to investigate what is happening.

thanks

Frédéric

Jeffrey De Fauw

unread,

11 Aug 2015, 02:57:3011/08/2015

to Frédéric Bastien, theano-users, Julien Demouth, theano-dev

Hi Frédéric,

I already emailed Julien a few weeks ago (soon after I posted here) with a minimal example replicating the issue. Haven't heard back but he might be on holiday. I can put it here as well if you're interested.

Best,
Jeffrey

Francesco Visin

unread,

11 Aug 2015, 14:54:4911/08/2015

to theano-users, thean...@googlegroups.com

I am also getting this error:

ERROR (theano.sandbox.cuda): ERROR: Not using GPU. Initialisation of device 0 failed:

initCnmem: cnmemInit call failed! Reason=CNMEM_STATUS_OUT_OF_MEMORY. numdev=1

with gc.allow either True or False and cnmem enabled:

[lib]
cnmem = 1

I get no error if I disable cnmem instead.

My code is really convoluted though, so I am not sure I can create a minimal example.

Please let me know if I can do something else to help.

Frédéric Bastien

unread,

1 Sept 2015, 17:44:2501/09/2015

to theano-users, theano-dev

Julien have an example that cause this type of behavior. If you can keep a way for you to test it do it. It would be e good to that you test out when Julien update cnmem.

Fred

--

Doug

unread,

3 Sept 2015, 14:11:5903/09/2015

to theano-users, thean...@googlegroups.com

I just tested enabling cnmem on windows 7x64, it seems to work as intended. I got a pretty good speed boost on the script I tested with (convolutional ladder network), so I'm a fan.

allow_gc = True

cnmem = 0

Time in 100 calls to Function.__call__: 5.962800e+01s

Class

---

<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>

40.0% 40.0% 21.753s 3.27e-04s C 66600 666 theano.sandbox.cuda.basic_ops.GpuElemwise

19.4% 59.4% 10.540s 6.43e-04s C 16400 164 theano.sandbox.cuda.basic_ops.GpuCAReduce

10.6% 70.0% 5.775s 3.04e-03s C 1900 19 theano.sandbox.cuda.dnn.GpuDnnConv

9.1% 79.1% 4.956s 2.75e-03s C 1800 18 theano.sandbox.cuda.dnn.GpuDnnConvGradI

8.5% 87.6% 4.613s 2.88e-03s C 1600 16 theano.sandbox.cuda.dnn.GpuDnnConvGradW

4.3% 91.9% 2.317s 2.41e-04s C 9600 96 theano.sandbox.cuda.basic_ops.GpuIncSubtensor

allow_gc = False

cnmem = 0

Time in 100 calls to Function.__call__: 6.032000e+01s

Class

---

<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>

38.3% 38.3% 22.981s 3.45e-04s C 66600 666 theano.sandbox.cuda.basic_ops.GpuElemwise

20.8% 59.0% 12.477s 7.61e-04s C 16400 164 theano.sandbox.cuda.basic_ops.GpuCAReduce

10.6% 69.6% 6.343s 3.34e-03s C 1900 19 theano.sandbox.cuda.dnn.GpuDnnConv

9.0% 78.6% 5.429s 3.02e-03s C 1800 18 theano.sandbox.cuda.dnn.GpuDnnConvGradI

8.4% 87.0% 5.052s 3.16e-03s C 1600 16 theano.sandbox.cuda.dnn.GpuDnnConvGradW

6.0% 93.0% 3.602s 3.75e-04s C 9600 96 theano.sandbox.cuda.basic_ops.GpuIncSubtensor

allow_gc = True

cnmem = .75

Time in 100 calls to Function.__call__: 3.973400e+01s

Class

---

<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>

29.8% 29.8% 11.653s 1.75e-04s C 66600 666 theano.sandbox.cuda.basic_ops.GpuElemwise

25.0% 54.8% 9.759s 5.95e-04s C 16400 164 theano.sandbox.cuda.basic_ops.GpuCAReduce

12.4% 67.1% 4.843s 2.69e-03s C 1800 18 theano.sandbox.cuda.dnn.GpuDnnConvGradI

11.6% 78.7% 4.517s 2.82e-03s C 1600 16 theano.sandbox.cuda.dnn.GpuDnnConvGradW

8.9% 87.6% 3.488s 1.84e-03s C 1900 19 theano.sandbox.cuda.dnn.GpuDnnConv

5.3% 92.9% 2.067s 2.15e-04s C 9600 96 theano.sandbox.cuda.basic_ops.GpuIncSubtensor

allow_gc = False

cnmem = .75

Time in 100 calls to Function.__call__: 3.858000e+01s

Class

---

<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>

29.3% 29.3% 11.271s 1.69e-04s C 66600 666 theano.sandbox.cuda.basic_ops.GpuElemwise

25.0% 54.3% 9.611s 5.86e-04s C 16400 164 theano.sandbox.cuda.basic_ops.GpuCAReduce

12.6% 66.9% 4.823s 2.68e-03s C 1800 18 theano.sandbox.cuda.dnn.GpuDnnConvGradI

11.7% 78.6% 4.510s 2.82e-03s C 1600 16 theano.sandbox.cuda.dnn.GpuDnnConvGradW

9.0% 87.6% 3.458s 1.82e-03s C 1900 19 theano.sandbox.cuda.dnn.GpuDnnConv

5.4% 93.0% 2.062s 2.15e-04s C 9600 96 theano.sandbox.cuda.basic_ops.GpuIncSubtensor

goo...@jan-schlueter.de

unread,

16 Sept 2015, 13:35:3216/09/2015

to theano-users

with gc.allow either True or False and cnmem enabled:
[lib]
cnmem = 1

If you set cnmem to 1, it will ask CNMeM to initially use 100% of GPU memory, which is impossible (even an unused K40c has like 23 MiB already in use). You should either set it to a fraction (such as 0.5) or a number of megabytes (such as 500).

p.s. We will enable it by default in 1 or 2 weeks if we don't have report of problems. We aren't sure of the default % of the GPU to allocate. We thought of using 45% by default. What do you think? This would allow 2 jobs by default (it need some memory for the driver).

Sounds like a plausible default, but maybe you should set an upper limit as well? Users may be confused if even their simplest models already take up close to 6 GiB on their Tesla or Titan X. Maybe use cnmem=1 for a default heuristic: 45%, but at most 2 GiB? (cnmem=1 is not a useful value otherwise, as seen above.)

By the way, similar to Doug, I also observe a significant speedup even if combined with allow_gc=False. Good job!

Frédéric Bastien

unread,

16 Sept 2015, 13:53:2216/09/2015

to theano-users

On Wed, Sep 16, 2015 at 1:35 PM, <goo...@jan-schlueter.de> wrote:

with gc.allow either True or False and cnmem enabled:
[lib]
cnmem = 1

If you set cnmem to 1, it will ask CNMeM to initially use 100% of GPU memory, which is impossible (even an unused K40c has like 23 MiB already in use). You should either set it to a fraction (such as 0.5) or a number of megabytes (such as 500).

We convert it to 0.98. In older version it was 0.985 but it failed in some cases. Can you tell us which is the highest number that work for you?

The driver don't use more then 1%. Then there is some libs that have static alloc for optimization. So we don't know exactly a safe max.

p.s. We will enable it by default in 1 or 2 weeks if we don't have report of problems. We aren't sure of the default % of the GPU to allocate. We thought of using 45% by default. What do you think? This would allow 2 jobs by default (it need some memory for the driver).

Sounds like a plausible default, but maybe you should set an upper limit as well? Users may be confused if even their simplest models already take up close to 6 GiB on their Tesla or Titan X. Maybe use cnmem=1 for a default heuristic: 45%, but at most 2 GiB? (cnmem=1 is not a useful value otherwise, as seen above.)

I'm not sure, what other people think of this?

We didn't changed the defeault as someone had a crash with it. After investigation, it is because he is very close to the limit of the card. In that case the driver alloc can move the memory around to allocated a big enough region, but not cnmem.

So we are checking if we could detect if the cause is a real memory missing or that cnmem can't move the allocated memory around like the driver. Then we could give that information to the user to try if possible without cnmem.

By the way, similar to Doug, I also observe a significant speedup even if combined with allow_gc=False. Good job!

I don't understand how we can get such good speed up with cnmem compared to allow_gc=False. Do you have an idea? Combining then could speed up a little and we saw this in benchmark. If someone have an idea, I'm interrested to know.

Fred

goo...@jan-schlueter.de

unread,

16 Sept 2015, 14:10:0416/09/2015

to theano-users

So we don't know exactly a safe max.

What about querying the amount free memory at the time cnmem is instantiated?

I don't understand how we can get such good speed up with cnmem compared to allow_gc=False. Do you have an idea?

Maybe even with allow_gc=False some Ops allocate new memory in between, because not all Ops were written to check if their outputs can be reused? That would cause synchronizations and limit parallelism. I don't know if there's an easy way to track down such allocations.

Frédéric Bastien

unread,

16 Sept 2015, 14:37:5216/09/2015

to theano-users

In Doug profile, I'm pretty sure that all those ops shown respect that.

If the input shape change, like frequently with RNN, this trigger new allocation. This could explain that.

To check for when alloc are done, it would be to put a gdb break point in the code. But anyway, I don't have the time for this and in the futur, we should enable by default cnmem.