Theano OpenMP multicore support

sai rajeshwar

unread,

Jan 5, 2015, 1:31:20 PM1/5/15

to theano...@googlegroups.com

Theano supports OpenMP..but does it still not support MIC(xeon phi)? now that python(numpy) can be offloaded on MIC https://portal.tacc.utexas.edu/tutorials/automatic-offload#python.
The only OpenMP pragmas I could find were https://github.com/Theano/Theano/search?utf8=%E2%9C%93&q=%23pragma+omp
I feel if the scaleup with openMP is good, then MIC should do better. Even if theano doesnot support MIC, even if it doesn't, is there a way we can change some (relevant) parts of Theano to port it to MIC?Fred, Pascal, Olivier?

Pascal Lamblin

unread,

Jan 6, 2015, 12:01:09 PM1/6/15

to theano...@googlegroups.com

Hi,

I'm not sure what kind of speed-up we could have, and xeon phi is really
not our priority target at the moment...

If you are interested, I think the first step may be to make the code
generation and compilation work with the Intel compiler (creating a new
class, like for gcc and nvcc).

Actually, maybe it would be worth figuring out if we can link with the
parallel MKL running on the xeon phi for BLAS calls even from gcc
(they only say it works with icc), since blass calls are the usual
bottleneck.

On Mon, Jan 05, 2015, sai rajeshwar wrote:
> Theano supports OpenMP..but does it still not support MIC(xeon phi)? now
> that python(numpy) can be offloaded on MIC
> https://portal.tacc.utexas.edu/tutorials/automatic-offload#python

> <https://www.google.com/url?q=https%3A%2F%2Fportal.tacc.utexas.edu%2Ftutorials%2Fautomatic-offload%23python&sa=D&sntz=1&usg=AFQjCNFN2_Zt7rMiuOFMS30iEogyyN7qTA>
> .

> The only OpenMP pragmas I could find were
> https://github.com/Theano/Theano/search?utf8=%E2%9C%93&q=%23pragma+omp

> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2FTheano%2FTheano%2Fsearch%3Futf8%3D%25E2%259C%2593%26q%3D%2523pragma%2Bomp&sa=D&sntz=1&usg=AFQjCNESAQru7spa6embmd_gifBxYBH_Uw>

> I feel if the scaleup with openMP is good, then MIC should do better. Even
> if theano doesnot support MIC, even if it doesn't, is there a way we can
> change some (relevant) parts of Theano to port it to MIC?Fred, Pascal,
> Olivier?

> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2FTheano%2FTheano%2Fsearch%3Futf8%3D%25E2%259C%2593%26q%3D%2523pragma%2Bomp&sa=D&sntz=1&usg=AFQjCNESAQru7spa6embmd_gifBxYBH_Uw>
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups "theano-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to theano-users...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
Pascal

Sai Rajeshwar

unread,

Jan 7, 2015, 12:44:33 AM1/7/15

to theano...@googlegroups.com, lamb...@iro.umontreal.ca

hi, thanks for the reply

ok, i shall make use of icc compiler

and while using icc, also will try mic specific offload compiler options so that the code (MKL part of it is) offloaded.

One doubt is.. are you having something like code generation using gnu toolchain for theano? and i need to add code generation using intel?

can you guide me briefly..how to do code generation and compilation?..that would be great

with regards..

M. Sai Rajeswar

IIT Delhi
----------------------------------Cogito Ergo Sum---------

--

---
You received this message because you are subscribed to a topic in the Google Groups "theano-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/theano-users/NpImVMwVESo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to theano-users...@googlegroups.com.

Pascal Lamblin

unread,

Jan 7, 2015, 10:15:29 AM1/7/15

to theano...@googlegroups.com

On Wed, Jan 07, 2015, Sai Rajeshwar wrote:
> ok, i shall make use of icc compiler
> and while using icc, also will try mic specific offload compiler options so
> that the code (MKL part of it is) offloaded.
>
> One doubt is.. are you having something like code generation using gnu
> toolchain for theano? and i need to add code generation using intel?

We have code generation, that generates cpp files, which are then
compiled (currently by g++ or nvcc) into shared libraries, which are
then loaded as Python modules.

The generation part itself does not depend on the gnu toolchain. All
of that code is in theano/gof/. In particular, GCC_compiler is in
theano/gof/cmodule.py. You would probably need to write a similar class
for icc. NVCC_compiler is in theano/sandbox/cuda/nvcc_compiler.py,
maybe it can be another reference for you.

Currently, the logic for selecting the compiler is that if a specific
node (for instance, a GPU operation) requests a specific compiler
(nvcc), that compiler will be used to compile its code. By default, it
is GCC. See theano/gov/cc.py:900 for that.
Also, some generic utility functions (the CVM and cutils_ext) are always
compiled with GCC for the moment, you may have to adjust that.

If this is worth it in terms of performance, we can then help you
refactor that into more flexible code.

> can you guide me briefly..how to do code generation and compilation?..that
> would be great

Please let us know if you have any more specific questions.

--
Pascal

Sai Rajeshwar

unread,

Jan 7, 2015, 2:08:54 PM1/7/15

to theano...@googlegroups.com

thanks.. I'll follow the steps you suggested
I will be adding the class for icc in cmodule.py and make necessary changes in cc.py in my fork and let you know..

Few other queries which i wanted to confirm were.

see if there is improvement with openMP just on CPU itself

i. will using icc mak performance better

ii. will using openmp threads make it better

iii. will using MIC helpful, does it have that about amount of parallelism?
if much of the time is for BLAS etc. libraries, then BLAS is very well offloaded onto MIC and it will take care..

with regards..

M. Sai Rajeswar

IIT Delhi
----------------------------------Cogito Ergo Sum---------

--
Pascal

Frédéric Bastien

unread,

Jan 7, 2015, 2:50:51 PM1/7/15

to theano-users

Hi,

Something very recent in Theano is that you can just swap the "g++" call to somthing else. If that something else understand g++ parameter, it will work. We do that on Mac to use the clang compiler. Maybe icpc understand g++ parameter. If that is the case, you just have to use a recent Theano (dev version, I don't remember when it was merged) and use this Theano flag:

cxx=icpc

Tell us the result. If that do not work, maybe you can just convert/parametrise the parameter passed to g++ in the GCC_compiler class instead of making a completly new class.

icc/icpc isn't always faster then gcc. I didn't time it. So I don't know what this will do. It will depend of the mode. If the model spend most of its time in the BLAS library, it won't make a difference.

I do not know if openmp implementation from icpc is faster then the one from gcc

MIC could be useful. I didn't test it. It could give similar speed up then GPU. I do not expect more speed up then GPU. But for this, someone will probably need to reoptimize/openmp-ize some part of Theano. If they have a MIC optimized blas, that can help.

In all case, keep us updated. I think many people have question about it.

I do not know how it will handle the transfer between the host and the device. We do not have any such mechanism on the CPU code.

Fred

--

---

You received this message because you are subscribed to the Google Groups "theano-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to theano-users...@googlegroups.com.

Sai Rajeshwar

unread,

Jan 8, 2015, 2:27:40 AM1/8/15

to theano...@googlegroups.com

hi,

I just came across this error when used theano flag cxx=icpc
----------------------------------------------

Traceback (most recent call last):
637   File "lenet_ufc_my.py", line 303, in <module>
638     evaluate_lenet5()
639   File "lenet_ufc_my.py", line 223, in evaluate_lenet5
640     y: test_set_y[index * batch_size: (index + 1) * batch_size]})
641   File "/home1/02246/srm89/theano_cuda/Theano-master/theano/compile/function.py", line 265, in function
642     profile=profile)
643   File "/home1/02246/srm89/theano_cuda/Theano-master/theano/compile/pfunc.py", line 511, in pfunc
644     on_unused_input=on_unused_input)
645   File "/home1/02246/srm89/theano_cuda/Theano-master/theano/compile/function_module.py", line 1546, in orig_function
646     defaults)
647   File "/home1/02246/srm89/theano_cuda/Theano-master/theano/compile/function_module.py", line 1409, in create
648     _fn, _i, _o = self.linker.make_thunk(input_storage=input_storage_lists)
649   File "/home1/02246/srm89/theano_cuda/Theano-master/theano/gof/link.py", line 531, in make_thunk
650     output_storage=output_storage)[:3]
651   File "/home1/02246/srm89/theano_cuda/Theano-master/theano/gof/vm.py", line 897, in make_all
652     no_recycling))
653   File "/home1/02246/srm89/theano_cuda/Theano-master/theano/gof/op.py", line 722, in make_thunk
654     output_storage=node_output_storage)
655   File "/home1/02246/srm89/theano_cuda/Theano-master/theano/gof/cc.py", line 1043, in make_thunk
656     keep_lock=keep_lock)
657   File "/home1/02246/srm89/theano_cuda/Theano-master/theano/gof/cc.py", line 985, in __compile__
658     keep_lock=keep_lock)
659   File "/home1/02246/srm89/theano_cuda/Theano-master/theano/gof/cc.py", line 1415, in cthunk_factory
660     key = self.cmodule_key()
661   File "/home1/02246/srm89/theano_cuda/Theano-master/theano/gof/cc.py", line 1124, in cmodule_key
662     compile_args=self.compile_args(),
663   File "/home1/02246/srm89/theano_cuda/Theano-master/theano/gof/cc.py", line 842, in compile_args
664     ret += c_compiler.compile_args()
665   File "/home1/02246/srm89/theano_cuda/Theano-master/theano/gof/cmodule.py", line 1752, in compile_args
666     cxxflags.extend(GCC_compiler.march_flags)
667 TypeError: ('The following error happened while compiling the node', InplaceDimShuffle{}(TensorConstant{0}), '\n', "'NoneType' object is not     iterable")

any suggestions?

with regards..

M. Sai Rajeswar

IIT Delhi
----------------------------------Cogito Ergo Sum---------

Frédéric Bastien

unread,

Jan 8, 2015, 4:33:57 PM1/8/15

to theano-users

I did a PR for this:

https://github.com/Theano/Theano/pull/2382

Can you test it?

Fred

Sai Rajeshwar

unread,

Jan 9, 2015, 2:26:25 PM1/9/15

to theano...@googlegroups.com

sure i will get you updated soon.

thanks

with regards..

M. Sai Rajeswar

IIT Delhi
----------------------------------Cogito Ergo Sum---------

Pascal Lamblin

unread,

Jan 14, 2015, 4:07:07 PM1/14/15

to theano...@googlegroups.com

On Wed, Jan 07, 2015, Frédéric Bastien wrote:
> icc/icpc isn't always faster then gcc. I didn't time it. So I don't know
> what this will do. It will depend of the mode. If the model spend most of
> its time in the BLAS library, it won't make a difference.

I'm not sure that is correct in that case, as I expect icc to be able to
link with the MIC version of blas, which gcc may not be able to do.

--
Pascal

Sai Rajeshwar

unread,

Jan 19, 2015, 12:12:13 PM1/19/15

to theano...@googlegroups.com

Hi Fred, Pascal,

I was testing the PR you 've created, here are the observations.

I was successfully able to compile with intel compilers and bit of improvement in the training speed, with flags -O2, -mkl.

However I could not offload it onto MIC(xeon phi) as i expected, tried many things but somehow couldnot. But it should be possible to automatically offload python code onto MIC.

The only thing I can see now is that there is a lower bound on the matrix size before offloading happens. So if our matrices are too small, MKL will not offload because the cost of moving the data is less than the gains from performance. This bound is something like a matrix size of 4096. How to see the sizes of various matrices as the code runs in theano? what all parameters could increase it? like larger datasets..or batchsize etc..?

any suggestions also would be helpful.. thanks

with regards..

M. Sai Rajeswar

IIT Delhi
----------------------------------Cogito Ergo Sum---------

--
Pascal

Frédéric Bastien

unread,

Jan 19, 2015, 9:11:57 PM1/19/15

to theano-users

bigger batch size and bigger size of layers would make bigger input to dot product.

Thanks for confirming that it compile with icpc.

But I think to get good speed up from XEON Phi will ask that we store the weight on it do not transfer it each time. This is what we do with GPUs.

For this, I think the best direction to go is the new GPU back-end with OpenCL. I think that XEON Phi support OpenCL. But we didn't port much code to the OpenCL size. So much work is needed.

I do not think the core of Theano will advance fast on the OpenCL, as we do not have the hardware. So it won't help us. But if people with C programming skill want to help, tell us.

Fred

--

---

You received this message because you are subscribed to the Google Groups "theano-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to theano-users...@googlegroups.com.

Sai Rajeshwar

unread,

Jan 21, 2015, 11:50:51 PM1/21/15

to theano...@googlegroups.com

Hi I was kind of playing with batchsize parameter just to see if offload ti mic happens..but apparently its not working ..I attach my code..similar to lenet code ..just that it's 3dcnn based..can you suggest how to view the matrix size ..present in the code..
Thank you

lenet_ufc_my_100.py

Pascal Lamblin

unread,

Jan 22, 2015, 11:08:23 AM1/22/15

to theano...@googlegroups.com

On Thu, Jan 22, 2015, Sai Rajeshwar wrote:
> Hi I was kind of playing with batchsize parameter just to see if offload ti
> mic happens..but apparently its not working ..I attach my code..similar to
> lenet code ..just that it's 3dcnn based..can you suggest how to view the
> matrix size ..present in the code..
> Thank you

> lenet_ufc_my_100.py
> <https://docs.google.com/file/d/0B02fGs-cS7CeNUx3R3ZQckVZVUpnb2d1M1hGLW1mXzZDU0xR/edit?usp=drivesdk>

> bigger batch size and bigger size of layers would make bigger input to dot
> product.

I think there has been a misunderstanding. Convolutional nets on CPU are
not using BLAS (except for the last densely-connected layers), so they
would not benefit from the optimization of MKL for MIC.

If you use a densely-connected nnet, then BLAS gets used for the
weight multiplication, and then it can benefit from the optimized MKL,
presumably more so with larger batch and layers.

> > *with regards..*
> >
> > *M. Sai Rajeswar*
> >
> >
> > *IIT Delhi----------------------------------Cogito Ergo Sum---------*

Pascal

Johann Hauswald

unread,

Jul 20, 2015, 6:42:22 PM7/20/15

to theano...@googlegroups.com

Hi Sai and everyone,

I was wondering if you have since this discussion made progress offloading to the MIC? I have a densely connected network and it does not appear to be offloading to the MIC even with larger batch sizes. I'd be interested to see where you left off.

Thanks,

martin.de...@gmail.com

unread,

Sep 10, 2016, 8:49:39 PM9/10/16

to theano-users

Hi Sai,

sorry of the disturbance, I was reading your post and I tried to compile the theano.test() with icpc and the flags that you mentioned -O2 -mkl but I was always getting errors and warning about compiler optimization being disabled. Did you got any warnings or errors when you compiled with icc/icpc? I tried the above on a OS X.

Thanks!

Sai Rajeshwar

unread,

Sep 10, 2016, 8:58:30 PM9/10/16

to theano...@googlegroups.com

Hi Martin,

I donot think I ran into any warnings or errors when I compiled with icpc. It ran properly.

with regards..

M. Sai Rajeswar

IIT Delhi
----------------------------------Cogito Ergo Sum---------

To unsubscribe from this group and all its topics, send an email to theano-users+unsubscribe@googlegroups.com.

martin.de...@gmail.com

unread,

Sep 10, 2016, 9:51:12 PM9/10/16

to theano-users

Thanks Sai,

I've been trying all day to run the tests on theano and I've been getting errors every time I try to use a different compiler either g++-6 or icpc. Whenever I try to use g++-6 it gives me some conflict error regarding the math.h not compatible intel library with g++-6.

When I tried your flags and switched to icpc again could not run those tests, I was getting errors about flags not being properly recognized and some of them where conflicting with icpc.

Now i'm running the tests just using default clang compiler it seems to be working so far on tests, apart from all the user warnings and deprecation warnings. Still the worst solution since it doesn't support openmp.

Frédéric Bastien

unread,

Sep 12, 2016, 4:06:05 PM9/12/16

to theano-users

I never tried g++ 6, so I can't comment on that. But I know it work with many version of g++ 4. Can you give the errors? It help find the solution.

You received this message because you are subscribed to the Google Groups "theano-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+unsubscribe@googlegroups.com.

Reply all

Reply to author

Forward