GPU minibatch training

532 views
Skip to first unread message

Rami Al-Rfou'

unread,
Oct 11, 2012, 5:44:31 PM10/11/12
to theano...@googlegroups.com, Bryan Perozzi
Hi All,

I am training a model with mini-batches. The GPU execution is slower than the CPU, the speed is around half the CPU one.

I decided to use shared datasets, as explained in the MLP sgd example in the tutorials, to copy the data to the GPU memory and avoid transferring data penalty.

Two things I observed:
  1- The GPU memory usage does not change, I monitor it using nvidia-smi
  2- The execution did not get faster.

I attached below my profiling logs of the CPU/GPU executions. Bear in mind that the code execute for fixed amount of time, however, the CPU processes twice more batches than the GPU.

Your help is really appreciated :).

Regards.
--
Rami Al-Rfou
cpu_log
gpu_log

James Bergstra

unread,
Oct 11, 2012, 5:50:58 PM10/11/12
to theano...@googlegroups.com, Bryan Perozzi
What timing results do you get when you run the actual deep learning
tutorial code?

رامي الرفوع

unread,
Oct 11, 2012, 6:09:56 PM10/11/12
to theano...@googlegroups.com, Bryan Perozzi
In case the attachments did not appear.
@James
For theano test offered in the website, I get better results when running things on the GPU. Here is the table of timings, those are calculated for the first variation of the code, the other variations show the expected speedups.

ExperimentGPU (sec)CPU (sec)
allow_gc=True;
float32
0.4315.44
allow_gc=False
float32
0.29515.43
float64; allow_gc=True3.853.86
float64; allow_gc=False3.873.87

James Bergstra

unread,
Oct 11, 2012, 9:16:50 PM10/11/12
to theano...@googlegroups.com, Bryan Perozzi
Hi Rami,

The pattern among the runtimes in your table are expected for code that maps well on to GPU.  GPU only works in float32, and garbage collection slows things down somewhat.

HTH,
- James

رامي الرفوع

unread,
Oct 11, 2012, 10:13:18 PM10/11/12
to theano...@googlegroups.com, Bryan Perozzi
@James
The test program run as expected, my code does not behave the same. Profiling does not show that I am using float 64 or there is any operation running on CPU. The GPU log shows that I am using the GPU for all operations, still I am not getting any speedups.

I am looking for what I am missing, any hints are appreciated.

Pascal Lamblin

unread,
Oct 11, 2012, 10:43:12 PM10/11/12
to theano...@googlegroups.com
Hi Rami,
Apparently, some operations have a GPU implementation that is slower
than the GPU one, as they have not been optimized yet: In your case,
GpuAdvancedIncSubtensor1 is 5 to 10 times slower than the CPU equivalent
(AdvancedIncSubtensor1) for each call.

This operation happens in particular when taking the gradient of
expressions like a[v], where v is a vector of integers. Maybe there is
a way to express your graph in a way that does not involve advanced
indexing, and could be faster with current implementation.

--
Pascal

رامي الرفوع

unread,
Oct 12, 2012, 5:49:20 AM10/12/12
to theano...@googlegroups.com
Hi Pascal,

Thanks for the insight, I see what you are saying in the logs. Actually, I index a large matrix to lookup the words embeddings.

I think I can implement indexing as a multiplication with a sparse matrix, will that speed up things? As far as I understand sparse operations will be executed at the CPU.

Pascal Lamblin

unread,
Oct 12, 2012, 2:31:57 PM10/12/12
to theano...@googlegroups.com
Hi,

On Fri, Oct 12, 2012, رامي الرفوع wrote:
> I think I can implement indexing as a multiplication with a sparse
> matrix, will that speed up things? As far as I understand sparse
> operations will be executed at the CPU.

That's correct, sparse operations will be executed only on the CPU.

Multiplying by a sparse matrix is worth trying, since the implementation
of AdvancedIncSubtensor1 on the CPU, too, is quite slow. I'm not sure it
will help, though, but I thing it is worth giving it a try.

Hope this helps,
--
Pascal

رامي الرفوع

unread,
Oct 13, 2012, 7:03:45 PM10/13/12
to theano...@googlegroups.com
I am trying to use the sparse matrices multiplication instead of the indexing solution, here is my code


from theano import function
from theano import shared
from theano import tensor as T
from theano import sparse
import scipy.sparse as sp
from numpy import array, asarray, random

from theano.tensor.tests.test_basic import get_numeric_types
print [(str(x), x.num) for x in get_numeric_types()]


size = 9

intX = 'int32'

C = T.matrix('C', dtype=intX)
I = T.matrix('I', dtype=intX)

fI= I.flatten()
data = T.ones_like(fI)
indptr = T.arange(data.shape[0]+1)

m1 = sparse.CSR(data, fI, indptr, (8, size))
m2 = sparse.dot(m1 ,C)
y = m2.reshape(shape=(2,4,9),ndim=3)

f = function(inputs=[I, C], outputs=y)
i = asarray([[4, 3, 7  ,7], [2,8, 4,5]], dtype=intX)
a = asarray(random.randint(0, 100, (size,size)), dtype=intX)
print a
print '================================='

result = f(i, a)
print result


I get the following error

[('float32', 11), ('float64', 12), ('int16', 3), ('int32', 5), ('int32', 7), ('int64', 9), ('int8', 1), ('uint16', 4), ('uint32', 6), ('uint32', 8), ('uint64', 10), ('uint8', 2)]
(False, False)
<class 'theano.tensor.basic.TensorVariable'>
DimShuffle{1,0}.0
2
[[29 16 54 40 75 40  5 39 39]
 [50  6 59 45 47 92 86 18 68]
 [ 1 25 72 81 69 22 46 21 51]
 [53 97 64 26 47 74 13 20 78]
 [34 40 28 16 13 32 89  7 27]
 [95 82  4 25 49 57 99 50  7]
 [ 4 55  6  6 67 53 69  0  1]
 [71 20 39 55 53 92 67 11 35]
 [ 6 75 10 92 48 51 73 43 82]]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-137-8377cf2e4203> in <module>()
     49 print '================================='
     50 
---> 51 result = f(i, a)
     52 print result
     53 

/usr/local/lib/python2.7/dist-packages/Theano-0.6.0rc1-py2.7.egg/theano/compile/function_module.pyc in __call__(self, *args, **kwargs)
    672                 # the C VM needs this because the exception manipulation
    673                 # done by raise_with_op is not implemented in C.
--> 674                 gof.vm.raise_with_op(self.fn.nodes[self.fn.position_of_error])
    675             else:
    676                 # old-style linkers raise their own exceptions

/usr/local/lib/python2.7/dist-packages/Theano-0.6.0rc1-py2.7.egg/theano/compile/function_module.pyc in __call__(self, *args, **kwargs)
    666         t0_fn = time.time()
    667         try:
--> 668             outputs = self.fn()
    669         except Exception:
    670             if hasattr(self.fn, 'position_of_error'):

TypeError: expected type_num 7 (NPY_INT32) got 5

=================================

رامي الرفوع

unread,
Oct 14, 2012, 12:21:34 AM10/14/12
to theano...@googlegroups.com
Ok, I fixed this issue by assigning
intX = 'int64'
Actually, I do not understand that works.

I am now getting a different error when I am trying to run the code on the GPU, this exception is thrown when I am trying to execute sparse dot operation
NotImplementedError: ('this function should only be called on *variables* (of type sparse.SparseType or tensor.TensorType), not,', Weights_C)

Weights_C is <class 'theano.sandbox.cuda.var.CudaNdarraySharedVariable'> when I use the GPU
Weights_C is <class 'theano.tensor.sharedvar.TensorSharedVariable'> when I use the CPU

Shouldn't the sparse dot operation be done by the CPU?

Pascal Lamblin

unread,
Oct 15, 2012, 2:32:18 PM10/15/12
to theano...@googlegroups.com
On Sat, Oct 13, 2012, رامي الرفوع wrote:
> I am trying to use the sparse matrices multiplication instead of the
> indexing solution, here is my code
>
> [...]
>
> I get the following error
> TypeError: expected type_num 7 (NPY_INT32) got 5

Thanks for reporting that bug, I just submitted a fix:
https://github.com/Theano/Theano/pull/1015

When it's accepted, you can update Theano to the latest development
version, and that problem should be solved. Please let us know if it is
not the case.

Now I'll try to fix the other ones you reported :)
--
Pascal

رامي الرفوع

unread,
Oct 22, 2012, 11:19:26 AM10/22/12
to theano...@googlegroups.com
@Pascal

I noticed that when I increase the size of the matrix I am indexing from 10000*64 to 100000*64 I get a slow down of a factor of 10! So even if the process slow, I do not understand why it is a function of the matrix size? Isn't this supposed to be random memory access? I expect things to slow down because of caching issues but not dramatically!


Regards,
Reply all
Reply to author
Forward
0 new messages