[ANN] GPU speed up merged in development version (need manual activation for now)

2033 views
Skip to first unread message

Frédéric Bastien

unread,
Jul 23, 2015, 2:29:16 PM7/23/15
to theano-users, theano-dev
Hi,

this is just to tell you that we just merged in Theano a new GPU memory allocator recently released by NVIDIA.

If you don't change the Theano flag allow_gc, you can expect 20% speed up on the GPU. In some case (small models), we saw a 50% speed up.

If you set allow_gc=False, the speed up will be much less, but you can still have one.

To use it:

Happy speed up!

Fred

p.s. We will enable it by default in 1 or 2 weeks if we don't have report of problems. We aren't sure of the default % of the GPU to allocate. We thought of using 45% by default. What do you think? This would allow 2 jobs by default (it need some memory for the driver).
p.p.s. We didn't tested it on Mac and Windows, but we think it should work. What we did here isn't different then other libs, so this isn't new there. But if you try it there, report your results.

Jeffrey De Fauw

unread,
Jul 24, 2015, 4:11:08 AM7/24/15
to theano-users, thean...@googlegroups.com, frederic...@gmail.com
When I use 0.45 or 0.87 (which is the max free memory I still have due to the GPU also driving my displays) I run out of memory (CNMEM_STATUS_OUT_OF_MEMORY). Normally I run it with gc enabled and it runs well (can't run with gc disabled). This is on linux with a 980. Normally I think it occupies about 3400MB at its max while running.

I can give some more specific information in a few days but thought it might already be useful to report.

Frédéric Bastien

unread,
Jul 24, 2015, 9:16:14 AM7/24/15
to Jeffrey De Fauw, theano-users, theano-dev
If you run the exact same program with the same other program running to have the same amount of available memory on the GPU, but without cnmem, do this work?

The memory specified by the flag is the start memory by cnmem. If the program need more memory then the start memory, cnmem will try to grab it. But it won't be handled exactly the same, as it won't be 1 big continuous bloc. So starting with the good amount of memory could help fragmentation of GPU memory I think. But otherwise, I think it should be the same speed.

Fred

Jeffrey De Fauw

unread,
Jul 24, 2015, 9:32:17 AM7/24/15
to Frédéric Bastien, theano-dev, theano-users

The situation was: normally I ran it with gc enabled and I decided to pull the latest master to test the new improvements. I kept gc and enabled cnmem at 0.45 (since you mentioned you might use that as a default). Which gave the cnmem out of memory error. Same happened when I set it at 0.87 (max free memory available on my GPU). When I disabled cnmem it worked again. It never worked without gc so I kept gc on as default. Can't really comment on speed ups at the moment.

Hopefully that helps a bit. Thought I would quickly try it out and since your proposed 0.45 failed for me,  thought I should probably mention it. I might try to find some minimal example next week if needed.

Frédéric Bastien

unread,
Jul 24, 2015, 9:33:48 AM7/24/15
to Jeffr...@gmail.com, theano-dev, theano-users
Can you try with cnmem and without allow_gc?

thanks

Fred

Pascal Lamblin

unread,
Jul 24, 2015, 11:41:00 AM7/24/15
to Frédéric Bastien, Jeffr...@gmail.com, theano-dev, theano-users
On Fri, Jul 24, 2015, Frédéric Bastien wrote:
> Can you try with cnmem and without allow_gc?

I'm pretty sure that is what he tried.
> --
>
> ---
> You received this message because you are subscribed to the Google Groups "theano-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to theano-users...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
Pascal

Jeffrey De Fauw

unread,
Jul 24, 2015, 11:47:55 AM7/24/15
to Pascal Lamblin, theano-dev, Frédéric Bastien, theano-users

No, I have put gc on by default because my recent experiments didn't run without it. I have not tested trying to disable it recently. I should be able to try it later tonight and I will report back.

Julien Demouth

unread,
Jul 24, 2015, 3:05:51 PM7/24/15
to theano-users, lamb...@iro.umontreal.ca, thean...@googlegroups.com, frederic...@gmail.com, Jeffr...@gmail.com, Jeff...@gmail.com
Hi Jeffrey,

If you still see the issue with allow_gc=False, I'm interested in getting a repro case. If that's out of question, I can explain how to generate a text trace that I could study to identify if something could be done to make the library more robust. I wrote cnmem and I would like to collect cases to help me improve the internal allocation strategy.

Thanks,
Julien

Jeffrey De Fauw

unread,
Jul 24, 2015, 3:08:04 PM7/24/15
to theano-dev, Jeffr...@gmail.com, theano...@googlegroups.com, frederic...@gmail.com, julien....@gmail.com
Hi all,

I just tested it quickly. It is indeed so that turning gc off (and cnmem disabled) will give the same out of memory error. I tried turning gc off and cnmem 0.45 and 0.87 but still no luck.

I want to release some code for https://www.kaggle.com/c/diabetic-retinopathy-detection soon and was a little panicky that suddenly it wouldn't be able to run anymore on a 4GB card with the latest theano dev (if cnmem would be enabled by default). I actually had bigger models than this which also ran well with gc.

I'll get back to this in a few days when I have some more time to test it and I can give you access to a repo. :-) Thanks for the quick responses!

Best,
Jeffrey


On Friday, 24 July 2015 19:54:24 UTC+1, Julien Demouth wrote:
Hi Jeffrey,

If you still see the problem with cnmem and you can share a repro with me, I'm interested (I wrote CNMEM and helped Frédéric integrate it into Theano). I can tweak the internal policy of cnmem to deal with "hard" cases (assuming your case is hard). We also have the freedom to add new strategies and let the user choose the best strategy to claim/reclaim memory.

Thanks,
Julien

Frédéric Bastien

unread,
Jul 26, 2015, 11:09:20 PM7/26/15
to Jeffrey De Fauw, theano-dev, theano-users, julien....@gmail.com
Hi,

just to be sure, by default, allow_gc is True. When I wrote not to set allow_gc, I mean, keep the default. I think you understood to set it to False.

Can you confirm that all what I wrote bellow is True.

allow_gc=True, lib.cnmem=False (default), it work
allow_gc=True, lib.cnmem=True, it work
allow_gc=False, lib.cnmem=True it crash
allow_gc=False, lib.cnmem=False it crash

If so, that is what I expect. Mostly, lib.cnmem won't change how much memory is used by Theano (so don't introduce or remove missing memory error). But it would bring most of allow_gc=False speed up to Theano!

You can still have very small speed up by using allow_gc=False with lib.cnmem=True, but it would be minimal on the GPU.

allow_gc=False can still give speed up on the CPU.

thanks

Fred

Jeffrey De Fauw

unread,
Jul 26, 2015, 11:15:12 PM7/26/15
to Frédéric Bastien, Julien Demouth, theano-dev, theano-users

Hi Frédéric,

No, only with allow_gc True and cnmem False (=0) does it work. Hence, part of my concern.

Jeffrey

Daniel Renshaw

unread,
Jul 27, 2015, 4:56:28 AM7/27/15
to theano...@googlegroups.com
I've just updated to the latest Theano Github version and now I can't import theano without getting a compilation failure (see below for detail). Everything worked fine until I updated to the latest version and the problem seems to be with the recent cnmem update.

Git commit: 354d7b576357c11b41565086437ef972cb8b4768 [354d7b5] (Merge pull request #2629 from abergeron/fix_merge_opts Don't apply alpha_merge and output_merge when the proc node has more than one client.); no local changes.

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Fri_Sep_21_17:28:58_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221

GPU: GeForce GT 640

python
Python 2.7.9 (default, Feb  4 2015, 08:18:43)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2

Problem same for all of the following configurations:

THEANO_FLAGS="device=gpu,openmp=True,floatX=float32,nvcc.fastmath=True"
THEANO_FLAGS="device=gpu,openmp=True,floatX=float32,nvcc.fastmath=True,allow_gc=True,lib.cnmem=0"
THEANO_FLAGS="device=gpu,openmp=True,floatX=float32,nvcc.fastmath=True,allow_gc=False,lib.cnmem=0.5"
THEANO_FLAGS="device=gpu,openmp=True,floatX=float32,nvcc.fastmath=True,allow_gc=True,lib.cnmem=0"
THEANO_FLAGS="device=gpu,openmp=True,floatX=float32,nvcc.fastmath=True,allow_gc=False,lib.cnmem=0.5"

The commit that causes the problem is ca465be02404c2b697dd75e1253d069cdd99a13b [ca465be] (Merge pull request #3198 from nouiz/cumem3 Add CNMeM in Theano to speed up CUDA allocation.)

If I go back to the commit prior to that, 2ddaca0654abdf224d1945c9b89b5e362a57f464 [2ddaca0] (Merge pull request #3117 from ChienliMa/infer_shape OpFromGraph.infer_shape()), then everything works fine again.

Is there anything I can do to fix the problem?

Daniel


In file included from /usr/include/python2.7/Python.h:8,
                 from mod.cu:3:
/usr/include/python2.7/pyconfig.h:1182:1: warning: "_POSIX_C_SOURCE" redefined
In file included from /opt/cuda-5.0.35/bin/../include/host_config.h:114,
                 from /opt/cuda-5.0.35/bin/../include/cuda_runtime.h:59,
                 from <command-line>:0:
/usr/include/features.h:162:1: warning: this is the location of the previous definition
In file included from /usr/include/python2.7/Python.h:8,
                 from mod.cu:3:
/usr/include/python2.7/pyconfig.h:1204:1: warning: "_XOPEN_SOURCE" redefined
In file included from /opt/cuda-5.0.35/bin/../include/host_config.h:114,
                 from /opt/cuda-5.0.35/bin/../include/cuda_runtime.h:59,
                 from <command-line>:0:
/usr/include/features.h:164:1: warning: this is the location of the previous definition
<snip_home_dir>/source/theano/theano/sandbox/cuda/cnmem.cpp(386): error: identifier "cudaStreamGetFlags" is undefined
mod.cu(938): warning: pointless comparison of unsigned integer with zero
1 error detected in the compilation of "/tmp/tmpxft_0000a408_00000000-6_mod.cpp1.ii".

['nvcc', '-shared', '-O3', '-use_fast_math', '-m64', '-Xcompiler', '-DCUDA_NDARRAY_CUH=5355c4a61aea6cdfb944298e802983a7,-DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION,-fPIC,-fvisibility=hidden', '-Xlinker', '-rpath,<snip_home_dir>/.theano/vetinari/compiledir_Linux-2.6-el6.x86_64-x86_64-with-redhat-6.6-Carbon-x86_64-2.7.9-64/cuda_ndarray', '-I<snip_home_dir>/source/theano/theano/sandbox/cuda', '-I<snip_home_dir>/python/vetinari/lib/python2.7/site-packages/numpy/core/include', '-I/usr/include/python2.7', '-I<snip_home_dir>/source/theano/theano/gof', '-o', '<snip_home_dir>/.theano/vetinari/compiledir_Linux-2.6-el6.x86_64-x86_64-with-redhat-6.6-Carbon-x86_64-2.7.9-64/cuda_ndarray/cuda_ndarray.so', 'mod.cu', '-L/usr/lib', '-lpython2.7', '-lcublas', '-lcudart']
ERROR (theano.sandbox.cuda): Failed to compile cuda_ndarray.cu: ('nvcc return status', 2, 'for cmd', 'nvcc -shared -O3 -use_fast_math -m64 -Xcompiler -DCUDA_NDARRAY_CUH=5355c4a61aea6cdfb944298e802983a7,-DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION,-fPIC,-fvisibility=hidden -Xlinker -rpath,<snip_home_dir>/.theano/vetinari/compiledir_Linux-2.6-el6.x86_64-x86_64-with-redhat-6.6-Carbon-x86_64-2.7.9-64/cuda_ndarray -I<snip_home_dir>/source/theano/theano/sandbox/cuda -I<snip_home_dir>/python/vetinari/lib/python2.7/site-packages/numpy/core/include -I/usr/include/python2.7 -I<snip_home_dir>/source/theano/theano/gof -o <snip_home_dir>/.theano/vetinari/compiledir_Linux-2.6-el6.x86_64-x86_64-with-redhat-6.6-Carbon-x86_64-2.7.9-64/cuda_ndarray/cuda_ndarray.so mod.cu -L/usr/lib -lpython2.7 -lcublas -lcudart')
WARNING (theano.sandbox.cuda): CUDA is installed, but device gpu is not available  (error: cuda unavilable)


--

Frédéric Bastien

unread,
Jul 27, 2015, 12:57:27 PM7/27/15
to theano-users
It seeem that this update need cuda 5.5 or higher. Can you update cuda?

Should we simply request a more recent cuda version of do we need to support cuda 5.0?

Fred

Frédéric Bastien

unread,
Jul 27, 2015, 12:59:08 PM7/27/15
to theano-users
I think I found another work around.

Can you go in the file sandbox/cuda/cuda_ndarray.cu and add this at the top:

#define CUDA_API_PER_THREAD_DEFAULT_STREAM

Tell me if this work. If so, that would be the good fix.

thanks

Fred

Daniel Renshaw

unread,
Jul 27, 2015, 1:20:54 PM7/27/15
to theano...@googlegroups.com
Thanks for looking into this Fred. Unfortunately I won't be in a position to test your suggestion until tomorrow. I'll get back to you then.

P.S. upgrading cuda may be possible but won't be quick because I don't manage this server myself; hopefully the #define change will workaround the problem.

Frédéric Bastien

unread,
Jul 27, 2015, 2:03:57 PM7/27/15
to theano-users
I think this diff is better:


diff --git a/theano/sandbox/cuda/cnmem.cpp b/theano/sandbox/cuda/cnmem.cpp
index 4a081cf..8e6d999 100644
--- a/theano/sandbox/cuda/cnmem.cpp
+++ b/theano/sandbox/cuda/cnmem.cpp
@@ -380,6 +380,8 @@ public:
     inline cnmemStatus_t setStream(cudaStream_t stream) {
         mStream = stream;
 #ifdef CUDA_API_PER_THREAD_DEFAULT_STREAM
+
+#if defined(CUDA_API_PER_THREAD_DEFAULT_STREAM) || (CUDART_VERSION < 5050)
         mIsStreamBlocking = false;
 #else
         unsigned flags = 0;

Daniel Renshaw

unread,
Jul 28, 2015, 4:52:21 AM7/28/15
to theano...@googlegroups.com
I had to alter the diff very slightly, but I can now import theano using the most up-to-date revision.

diff --git a/theano/sandbox/cuda/cnmem.cpp b/theano/sandbox/cuda/cnmem.cpp
index 4a081cf..373bc67 100644
--- a/theano/sandbox/cuda/cnmem.cpp
+++ b/theano/sandbox/cuda/cnmem.cpp
@@ -379,7 +379,7 @@ public:
     /// Define the stream.
     inline cnmemStatus_t setStream(cudaStream_t stream) {
         mStream = stream;
-#ifdef CUDA_API_PER_THREAD_DEFAULT_STREAM
+#if defined(CUDA_API_PER_THREAD_DEFAULT_STREAM) || (CUDART_VERSION < 5050)
         mIsStreamBlocking = false;
 #else
         unsigned flags = 0;

Note the removal of the original #ifdef line, to be replaced by the new #if line.

Are there any significant performance differences between cuda versions? Should I be trying to upgrade cuda anyway?

Daniel

Frédéric Bastien

unread,
Jul 28, 2015, 9:22:14 AM7/28/15
to theano-users
Thanks, I merged a PR from Julien that do the right fix, mIsStreamBlocking should be true for old cuda, not false. So be sure to update Theano.

thanks

Fred

Frédéric Bastien

unread,
Aug 11, 2015, 2:50:44 AM8/11/15
to Jeffrey De Fauw, Julien Demouth, theano-dev, theano-users
Jeffrey, do you have time to generate way to run this code? If this is hard, Julien can tell you how to generate a trace of the memory alloc. He just need that to investigate what is happening.

thanks

Frédéric

Jeffrey De Fauw

unread,
Aug 11, 2015, 2:57:30 AM8/11/15
to Frédéric Bastien, theano-users, Julien Demouth, theano-dev

Hi Frédéric,

I already emailed Julien a few weeks ago (soon after I posted here) with a minimal example replicating the issue. Haven't heard back but he might be on holiday. I can put it here as well if you're interested.

Best,
Jeffrey

Francesco Visin

unread,
Aug 11, 2015, 2:54:49 PM8/11/15
to theano-users, thean...@googlegroups.com
I am also getting this error:
ERROR (theano.sandbox.cuda): ERROR: Not using GPU. Initialisation of device 0 failed:  
initCnmem: cnmemInit call failed! Reason=CNMEM_STATUS_OUT_OF_MEMORY. numdev=1

with gc.allow either True or False and cnmem enabled:
[lib]
cnmem = 1

I get no error if I disable cnmem instead.
My code is really convoluted though, so I am not sure I can create a minimal example.
Please let me know if I can do something else to help.

Frédéric Bastien

unread,
Sep 1, 2015, 5:44:25 PM9/1/15
to theano-users, theano-dev

Julien have an example that cause this type of behavior. If you can keep a way for you to test it do it. It would be e good to that you test out when Julien update cnmem.

Fred

--

Doug

unread,
Sep 3, 2015, 2:11:59 PM9/3/15
to theano-users, thean...@googlegroups.com
I just tested enabling cnmem on windows 7x64, it seems to work as intended. I got a pretty good speed boost on the script I tested with (convolutional ladder network), so I'm a fan.

allow_gc = True
cnmem = 0

Time in 100 calls to Function.__call__: 5.962800e+01s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  40.0%    40.0%      21.753s       3.27e-04s     C    66600     666   theano.sandbox.cuda.basic_ops.GpuElemwise
  19.4%    59.4%      10.540s       6.43e-04s     C    16400     164   theano.sandbox.cuda.basic_ops.GpuCAReduce
  10.6%    70.0%       5.775s       3.04e-03s     C     1900      19   theano.sandbox.cuda.dnn.GpuDnnConv
   9.1%    79.1%       4.956s       2.75e-03s     C     1800      18   theano.sandbox.cuda.dnn.GpuDnnConvGradI
   8.5%    87.6%       4.613s       2.88e-03s     C     1600      16   theano.sandbox.cuda.dnn.GpuDnnConvGradW
   4.3%    91.9%       2.317s       2.41e-04s     C     9600      96   theano.sandbox.cuda.basic_ops.GpuIncSubtensor
   
   
   
allow_gc = False
cnmem = 0  

Time in 100 calls to Function.__call__: 6.032000e+01s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  38.3%    38.3%      22.981s       3.45e-04s     C    66600     666   theano.sandbox.cuda.basic_ops.GpuElemwise
  20.8%    59.0%      12.477s       7.61e-04s     C    16400     164   theano.sandbox.cuda.basic_ops.GpuCAReduce
  10.6%    69.6%       6.343s       3.34e-03s     C     1900      19   theano.sandbox.cuda.dnn.GpuDnnConv
   9.0%    78.6%       5.429s       3.02e-03s     C     1800      18   theano.sandbox.cuda.dnn.GpuDnnConvGradI
   8.4%    87.0%       5.052s       3.16e-03s     C     1600      16   theano.sandbox.cuda.dnn.GpuDnnConvGradW
   6.0%    93.0%       3.602s       3.75e-04s     C     9600      96   theano.sandbox.cuda.basic_ops.GpuIncSubtensor   
   
   
   
allow_gc = True
cnmem = .75

Time in 100 calls to Function.__call__: 3.973400e+01s   
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  29.8%    29.8%      11.653s       1.75e-04s     C    66600     666   theano.sandbox.cuda.basic_ops.GpuElemwise
  25.0%    54.8%       9.759s       5.95e-04s     C    16400     164   theano.sandbox.cuda.basic_ops.GpuCAReduce
  12.4%    67.1%       4.843s       2.69e-03s     C     1800      18   theano.sandbox.cuda.dnn.GpuDnnConvGradI
  11.6%    78.7%       4.517s       2.82e-03s     C     1600      16   theano.sandbox.cuda.dnn.GpuDnnConvGradW
   8.9%    87.6%       3.488s       1.84e-03s     C     1900      19   theano.sandbox.cuda.dnn.GpuDnnConv
   5.3%    92.9%       2.067s       2.15e-04s     C     9600      96   theano.sandbox.cuda.basic_ops.GpuIncSubtensor   
   
   
   
allow_gc = False
cnmem = .75   

Time in 100 calls to Function.__call__: 3.858000e+01s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  29.3%    29.3%      11.271s       1.69e-04s     C    66600     666   theano.sandbox.cuda.basic_ops.GpuElemwise
  25.0%    54.3%       9.611s       5.86e-04s     C    16400     164   theano.sandbox.cuda.basic_ops.GpuCAReduce
  12.6%    66.9%       4.823s       2.68e-03s     C     1800      18   theano.sandbox.cuda.dnn.GpuDnnConvGradI
  11.7%    78.6%       4.510s       2.82e-03s     C     1600      16   theano.sandbox.cuda.dnn.GpuDnnConvGradW
   9.0%    87.6%       3.458s       1.82e-03s     C     1900      19   theano.sandbox.cuda.dnn.GpuDnnConv
   5.4%    93.0%       2.062s       2.15e-04s     C     9600      96   theano.sandbox.cuda.basic_ops.GpuIncSubtensor

goo...@jan-schlueter.de

unread,
Sep 16, 2015, 1:35:32 PM9/16/15
to theano-users
with gc.allow either True or False and cnmem enabled:
[lib]
cnmem = 1

If you set cnmem to 1, it will ask CNMeM to initially use 100% of GPU memory, which is impossible (even an unused K40c has like 23 MiB already in use). You should either set it to a fraction (such as 0.5) or a number of megabytes (such as 500).


p.s. We will enable it by default in 1 or 2 weeks if we don't have report of problems. We aren't sure of the default % of the GPU to allocate. We thought of using 45% by default. What do you think? This would allow 2 jobs by default (it need some memory for the driver).

Sounds like a plausible default, but maybe you should set an upper limit as well? Users may be confused if even their simplest models already take up close to 6 GiB on their Tesla or Titan X. Maybe use cnmem=1 for a default heuristic: 45%, but at most 2 GiB? (cnmem=1 is not a useful value otherwise, as seen above.)

By the way, similar to Doug, I also observe a significant speedup even if combined with allow_gc=False. Good job!

Frédéric Bastien

unread,
Sep 16, 2015, 1:53:22 PM9/16/15
to theano-users
On Wed, Sep 16, 2015 at 1:35 PM, <goo...@jan-schlueter.de> wrote:
with gc.allow either True or False and cnmem enabled:
[lib]
cnmem = 1

If you set cnmem to 1, it will ask CNMeM to initially use 100% of GPU memory, which is impossible (even an unused K40c has like 23 MiB already in use). You should either set it to a fraction (such as 0.5) or a number of megabytes (such as 500).

We convert it to 0.98. In older version it was 0.985 but it failed in some cases. Can you tell us which is the highest number that work for you?

The driver don't use more then 1%. Then there is some libs that have static alloc for optimization. So we don't know exactly a safe max.
 
p.s. We will enable it by default in 1 or 2 weeks if we don't have report of problems. We aren't sure of the default % of the GPU to allocate. We thought of using 45% by default. What do you think? This would allow 2 jobs by default (it need some memory for the driver).

Sounds like a plausible default, but maybe you should set an upper limit as well? Users may be confused if even their simplest models already take up close to 6 GiB on their Tesla or Titan X. Maybe use cnmem=1 for a default heuristic: 45%, but at most 2 GiB? (cnmem=1 is not a useful value otherwise, as seen above.)

I'm not sure, what other people think of this?

We didn't changed the defeault as someone had a crash with it. After investigation, it is because he is very close to the limit of the card. In that case the driver alloc can move the  memory around to allocated a big enough region, but not cnmem.

So we are checking if we could detect if the cause is a real memory missing or that cnmem can't move the allocated memory around like the driver. Then we could give that information to the user to try if possible without cnmem.
 

By the way, similar to Doug, I also observe a significant speedup even if combined with allow_gc=False. Good job!

I don't understand how we can get such good speed up with cnmem compared to allow_gc=False. Do you have an idea? Combining then could speed up a little and we saw this in benchmark. If someone have an idea, I'm interrested to know.

Fred

goo...@jan-schlueter.de

unread,
Sep 16, 2015, 2:10:04 PM9/16/15
to theano-users
So we don't know exactly a safe max.

What about querying the amount free memory at the time cnmem is instantiated?

I don't understand how we can get such good speed up with cnmem compared to allow_gc=False. Do you have an idea?

Maybe even with allow_gc=False some Ops allocate new memory in between, because not all Ops were written to check if their outputs can be reused? That would cause synchronizations and limit parallelism. I don't know if there's an easy way to track down such allocations.

Frédéric Bastien

unread,
Sep 16, 2015, 2:37:52 PM9/16/15
to theano-users
In Doug profile, I'm pretty sure that all those ops shown respect that.

If the input shape change, like frequently with RNN, this trigger new allocation. This could explain that.

To check for when alloc are done, it would be to put a gdb break point in the code. But anyway, I don't have the time for this and in the futur, we should enable by default cnmem.

thanks

Fred
Reply all
Reply to author
Forward
0 new messages