Fwd: [theano-users] PROFILE_MODE is Faster than FAST_RUN ??

50 views

Skip to first unread message

Olivier Delalleau

unread,

May 18, 2013, 11:15:16 AM5/18/13

to thean...@googlegroups.com

Wouldn't it be better to use cvm_nogc by default?

So far, in all Theano code I've personally been involved with, the memory used during function execution has never been a bottleneck (since typically the function is run on a relatively small minibatch, or a single example).
Maybe it's just me though?

-=- Olivier

Début du message transféré :

Expéditeur: Karthik Narayan <karthik....@gmail.com>
Date: 17 mai 2013 09:13:11 HAE
Destinataire: theano...@googlegroups.com
Objet: Rép : [theano-users] PROFILE_MODE is Faster than FAST_RUN ??
Répondre à: theano...@googlegroups.com

Ah, that makes more sense now. Thanks for the detailed description, Frédéric. Really appreciate it!

Cheers,

Karthik

On Fri, May 17, 2013 at 5:55 AM, Frédéric Bastien <no...@nouiz.org> wrote:

Normally, when we talk about FAST_RUN, it mean using linker=cvm, which is the default. Using linker=cvm_nogc, will disable Theano "GC" of intermediate results. What this mean is that when the gc is enabled, we use python gc to get rid of intermediate result as soon as they aren't needed. So this mean we need to reallocate memory for them at each function call. So when you disable Theano gc with linker=cvm_nogc, Theano kill keep those intermediate value when they are not needed. So Theano function will use more memory, but for the 2nd and following function call, won't reallocate as much memory. The speed up you see it due to that. Is some case, I even see bigger speed up then 20% when disabling Theano gc.

If you have enough ram, it is save to use linker=cvm_nogc by default. At worst, your computer will swap or crash, which will make it slower. Just keep that in mind in case you see your computer use more memory.

Fred

p.s., in the development version, ProfileMode is deprecated and the new profiler support the gc. This allow to profile more graph, especially on the GPU where memory is more limited.

On Fri, May 17, 2013 at 6:22 AM, Karthik Narayan <karthik....@gmail.com> wrote:

You're right! Using cvm_nogc places runtimes for FAST_RUN and PROFILE_MODE very closely. The difference appears to be smaller as job times increase, with FAST_RUN having a very small edge over PROFILE_MODE. As mentioned earlier however, the difference between enabling and not enabling cvm_nogc makes a noticeable difference (~10-20%).

Could you discuss when it's okay to use cvm_nogc? In particular, can I always use this linker (e.g. set this in my .theanorc), or are there certain times when it's unsafe to?

Thanks,

Karthik

On Thursday, May 16, 2013 6:17:39 AM UTC-7, Pascal Lamblin wrote:

On Thu, May 16, 2013, Karthik Narayan wrote:
> Sorry for reopening this thread... I'm encountering the same problem with
> PROFILE_MODE being faster than FAST_RUN. This happens when trying to run
> mlp.py from the tutorials without modifications on the GPU. In my case,
> each epoch under FAST_RUN takes ~ 13 seconds while each epoch under
> PROFILE_MODE takes ~ 11.5 seconds, around a 10% difference, which seems a
> bit significant given that the task isn't tiny. Thoughts?

FAST_RUN uses garbage collection for the intermediate values, something that
PROFILE_MODE does not do, and that can slow down things a bit.
Can you try with "linker=cvm_nogc"? It should be even faster.

>
> Thanks!
>
> Cheers,
>
> Karthik
>
> On Tuesday, July 31, 2012 2:57:42 PM UTC-7, Manavender Reddy wrote:
> >
> > Thanks for reply. This helps :)
> >
> > On Tuesday, July 31, 2012 3:33:09 PM UTC-4, Frï¿½dï¿½ric Bastien wrote:
> >>
> >> This can happen when the graph do do much computation as in this
> >> example. The ProfileMode do not implement a garbage collector for
> >> intermediate computation results. This is what make it faster in this
> >> case.
> >> Also, it do generate wrose error message and this can make it a little
> >> faster in the case cases.
> >>
> >> But in regular code, I never saw those behavior being significant.
> >>
> >> Fred
> >>
> >> On Mon, Jul 30, 2012 at 6:34 PM, manavender reddy
> >> <manaven...@gmail.com <javascript:>> wrote:
> >> > Hi,
> >> >
> >> > I am running thing.py example and i am noticing that PROFILE mode is
> >> faster
> >> > than FAST_RUN. Is this expected ??
> >> > I am using Win 7 64bit + python 64bit
> >> >
> >> > Forcing DISTUTILS_USE_SDK=1
> >> > Using gpu device 0: GeForce GT 430
> >> > Looping 1000 times took 0.524999856949 seconds
> >> > Result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813
> >> 2.29967761
> >> > 1.62323296]
> >> >
> >> > ProfileMode.print_summary()
> >> > ---------------------------
> >> >
> >> > Time since import 1.463s
> >> > Theano compile time: 0.000s (0.0% since import)
> >> > Optimization time: 0.000s
> >> > Linker time: 0.000s
> >> > Theano fct call 0.520s (35.5% since import)
> >> > Theano Op time 0.490s 33.5%(since import) 94.2%(of fct call)
> >> > Theano function overhead in ProfileMode 0.030s 2.1%(since import)
> >> 5.8%(of
> >> > fct call)
> >> > 1000 Theano fct call, 0.001s per call
> >> > Rest of the time since import 0.943s 64.5%
> >> >
> >> > Theano fct summary:
> >> > <% total fct time> <total time> <time per call> <nb call> <fct name>
> >> > 100.0% 0.520s 5.20e-04s 1000 None
> >> >
> >> > Single Op-wise summary:
> >> > <% of local_time spent on this kind of Op> <cumulative %> <self
> >> seconds>
> >> > <cumulative seconds> <time per call> [*] <nb_call> <nb_op> <nb_apply>
> >> <Op
> >> > name>
> >> > 72.7% 72.7% 0.356s 0.356s 3.56e-04s 1000 1 1 <class
> >> > 'theano.sandbox.cuda.basic_ops.HostFromGpu'>
> >> > 27.3% 100.0% 0.134s 0.490s 1.34e-04s * 1000 1 1 <class
> >> > 'theano.sandbox.cuda.basic_ops.GpuElemwise'>
> >> > ... (remaining 0 single Op account for 0.00%(0.00s) of the runtime)
> >> > (*) Op is running a c implementation
> >> >
> >> > Op-wise summary:
> >> > <% of local_time spent on this kind of Op> <cumulative %> <self
> >> seconds>
> >> > <cumulative seconds> <time per call> [*] <nb_call> <nb apply> <Op
> >> name>
> >> > 72.7% 72.7% 0.356s 0.356s 3.56e-04s 1000 1 HostFromGpu
> >> > 27.3% 100.0% 0.134s 0.490s 1.34e-04s * 1000 1
> >> > GpuElemwise{exp,no_inplace}
> >> > ... (remaining 0 Op account for 0.00%(0.00s) of the runtime)
> >> > (*) Op is running a c implementation
> >> >
> >> > Apply-wise summary:
> >> > <% of local_time spent at this position> <cumulative %%> <apply time>
> >> > <cumulative seconds> <time per call> [*] <nb_call> <Apply position>
> >> <Apply
> >> > Op name>
> >> > 72.7% 72.7% 0.356s 0.356s 3.56e-04s 1000 1
> >> > HostFromGpu(GpuElemwise{exp,no_inplace}.0)
> >> > 27.3% 100.0% 0.134s 0.490s 1.34e-04s * 1000 0
> >> > GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>)
> >> > ... (remaining 0 Apply instances account for 0.00%(0.00s) of the
> >> runtime)
> >> > (*) Op is running a c implementation
> >> >
> >> > Some info useful for gpu:
> >> >
> >> > Spent 0.356s(72.653%) in cpu Op, 0.134s(27.347%) in gpu Op and
> >> > 0.000s(0.000%) transfert Op
> >> >
> >> > Theano function input that are float64
> >> > <fct name> <input name> <input type> <str input>
> >> >
> >> > List of apply that don't have float64 as input but have float64 in
> >> > outputs
> >> > (Useful to know if we forgot some cast when using floatX=float32 or
> >> gpu
> >> > code)
> >> > <Apply> <Apply position> <fct name> <inputs type> <outputs type>
> >> >
> >> > Profile of Theano functions memory:
> >> > (This check only the output of each apply node. It don't check the
> >> temporary
> >> > memory used by the op in the apply node.)
> >> > We skipped 1 theano function(s). Each of them used less then
> >> 1024B(theano
> >> > flags ProfileMode.min_memory_size) of total intermediate memory size
> >> >
> >> > Here are tips to potentially make your code run faster
> >> > (if you think of new ones, suggest them on the mailing list).
> >> > Test them first, as they are not guaranteed to always provide a
> >> speedup.
> >> > Sorry, no tip for today.
> >> >
> >> >
> >> > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FAST_RUN %%%%%%%%%%%%%%%%
> >> >
> >> > Forcing DISTUTILS_USE_SDK=1
> >> > Using gpu device 0: GeForce GT 430
> >> > Looping 1000 times took 0.889999866486 seconds
> >> > Result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813
> >> 2.29967761
> >> > 1.62323296]
> >> >
> >>
> >
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups "theano-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to theano-users...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

--
Pascal

--

---
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--

---
You received this message because you are subscribed to a topic in the Google Groups "theano-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/theano-users/it0Q0ZvE8Fc/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to theano-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Email: karthik....@gmail.com
Mobile: +1 (404) 645-4240

--

---
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Frédéric Bastien

unread,

May 21, 2013, 9:54:41 AM5/21/13

to theano-dev

Hi,

In the past it was the default, but we swiched it a long time ago when we had a few user having there computer swap by default. I also found a few other users in the lab that needed memory optimization. There is also the GPU that have limited memory.

I don't think we should change to disable it completly by default. But we could make a smarter memory management. Like we keep 100M of allocated object by default and reuse them between theano function. If someone is insterested to work on this, tell us, I can guide you.

Fred

--

---
You received this message because you are subscribed to the Google Groups "theano-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-dev+...@googlegroups.com.

Reply all

Reply to author

Forward

0 new messages