Does nn automatically use multiple cpus?

917 views
Skip to first unread message

David C Cohen

unread,
Sep 14, 2014, 11:11:50 PM9/14/14
to tor...@googlegroups.com
Hi,

I've compiled torch and nn using gcc on OS X, and it seems openmp works fine:

torch.setnumthreads(4)
torch.getnumthreads()
4

However, my neural network code still uses 1 cpu. Does nn need specific instructions for using multiple cpus, or is it suppose to parallelize out of the box?

soumith

unread,
Sep 15, 2014, 1:03:10 AM9/15/14
to torch7 on behalf of David C Cohen
The neural net code uses OpenMP for parallelization, and if it is still using only 1 cpu, you are either using a module that has blas optimizations (and the blas you are using is not multi-CPU enabled), or OpenMP's numthreads is being overriden maybe by an environment variable (OMP_NUM_THREADS), or the nn module you are using doesn't have any parallelization code.

--
You received this message because you are subscribed to the Google Groups "torch7" group.
To unsubscribe from this group and stop receiving emails from it, send an email to torch7+un...@googlegroups.com.
To post to this group, send email to tor...@googlegroups.com.
Visit this group at http://groups.google.com/group/torch7.
For more options, visit https://groups.google.com/d/optout.

David C Cohen

unread,
Sep 15, 2014, 9:54:41 AM9/15/14
to tor...@googlegroups.com
I guess my training code does not have any parallelization, I'm simply testing the training code given in the docs, are any of these two examples parallelized?


I couldn't find any sign of parallelization in nn docs which explicitly talks about multi cpu. Should nn.Parallel module be used to divide the job between multiple cpus?


On Monday, September 15, 2014 6:03:10 AM UTC+1, smth chntla wrote:
The neural net code uses OpenMP for parallelization, and if it is still using only 1 cpu, you are either using a module that has blas optimizations (and the blas you are using is not multi-CPU enabled), or OpenMP's numthreads is being overriden maybe by an environment variable (OMP_NUM_THREADS), or the nn module you are using doesn't have any parallelization code.

smth chntla

unread,
Sep 15, 2014, 11:37:35 AM9/15/14
to tor...@googlegroups.com
I think there's a slight confusion here. nn.Parallel doesn't make a module parallel wrt CPUs, it is to build a certain kind of nn topology.

All nn modules which have OpenMP constructs are layer-wise, i.e. each layer when being computed (like nn.SpatialConvolution or nn.Tanh) will use as many CPUs as it can using OpenMP.

nn.Parallel however runs all the layers in it's topology serially (and each layer contained in nn.Parallel might use multiple CPUs)

David C Cohen

unread,
Sep 15, 2014, 12:02:48 PM9/15/14
to tor...@googlegroups.com
I've reinstalled toch and nn, making sure openmp is supported:

-- The C compiler identification is GNU 4.9.1
-- The CXX compiler identification is GNU 4.9.1

...

-- Check for working CXX compiler: /usr/local/bin/g++-4.9
-- Check for working CXX compiler: /usr/local/bin/g++-4.9 -- works

...

-- Try OpenMP C flag = [-fopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Success
-- Try OpenMP CXX flag = [-fopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Success
-- Found OpenMP: -fopenmp  
-- Compiling with OpenMP support


I'm using this simple multi layer model for classification:

torch.setnumthreads(4)

model = nn.Sequential()
model:add(nn.Linear(200, 10))
model:add(nn.Tanh())
model:add(nn.Linear(10, 2))
model:add(nn.LogSoftMax())

My BLAS library is the default OS X Accelerate Framework.

Given the above config, nn still uses only 1 cpu. I understand that only using nn.Tanh() should be enough to use multiple cpus, given that numthreads is set to 4, and nn is compiled with openmp.

What am I missing here? Must the BLAS library also support openmp?

smth chntla

unread,
Sep 15, 2014, 12:39:12 PM9/15/14
to tor...@googlegroups.com
David,

That sounds like OpenMP is being compiled in. I have no idea why multiple CPUs aren't being used (maybe the layer is not large enough to notice this?) 

Can you try this:
require 'nn'
m=nn.Tanh()
i=torch.randn(10000,10000)
for i=1,100 do o=m:forward(i) end

David C Cohen

unread,
Sep 15, 2014, 1:36:41 PM9/15/14
to tor...@googlegroups.com
Thanks, your example code did trigger multi cpu usage.

Maybe I'm not using nn efficiently. My model input size is only about 100, but I have a large number of samples, about 0.5 million. How can nn use all available resources to speed up training? Should a different topology be used for this?

smth chntla

unread,
Sep 15, 2014, 5:04:37 PM9/15/14
to tor...@googlegroups.com
If it's a small model, try using mini-batches to have enough to parallelize over, most nn modules take dim+1 inputs (where dim == expected number of dimensions) and compute things in batches.

David C Cohen

unread,
Sep 15, 2014, 8:15:23 PM9/15/14
to tor...@googlegroups.com


On Monday, September 15, 2014 10:04:37 PM UTC+1, smth chntla wrote:
If it's a small model, try using mini-batches to have enough to parallelize over, most nn modules take dim+1 inputs (where dim == expected number of dimensions) and compute things in batches.

I've tried the mini batch approach suggested here:

https://github.com/clementfarabet/torch7-demos/blob/master/train-on-cifar/train-on-cifar.lua

But luajit process gets stuck at 99.9% (while your code gives over 300% cpu usage for luajit). I've tried increasing the batch size with no luck. Is that code out of date?
 

r.

unread,
May 4, 2016, 8:32:47 AM5/4/16
to torch7
thanks for the code snippet, smth chntla 

(I wonder, has a solution been found?)
Reply all
Reply to author
Forward
0 new messages