GPU minibatch training

Rami Al-Rfou'

unread,

Oct 11, 2012, 5:44:31 PM10/11/12

to theano...@googlegroups.com, Bryan Perozzi

Hi All,

I am training a model with mini-batches. The GPU execution is slower than the CPU, the speed is around half the CPU one.

I decided to use shared datasets, as explained in the MLP sgd example in the tutorials, to copy the data to the GPU memory and avoid transferring data penalty.

Two things I observed:

1- The GPU memory usage does not change, I monitor it using nvidia-smi

2- The execution did not get faster.

I attached below my profiling logs of the CPU/GPU executions. Bear in mind that the code execute for fixed amount of time, however, the CPU processes twice more batches than the GPU.

Your help is really appreciated :).

Regards.

--

Rami Al-Rfou

cpu_log

gpu_log

James Bergstra

unread,

Oct 11, 2012, 5:50:58 PM10/11/12

to theano...@googlegroups.com, Bryan Perozzi

What timing results do you get when you run the actual deep learning
tutorial code?

رامي الرفوع‎

unread,

Oct 11, 2012, 6:09:56 PM10/11/12

to theano...@googlegroups.com, Bryan Perozzi

In case the attachments did not appear.

@James

For theano test offered in the website, I get better results when running things on the GPU. Here is the table of timings, those are calculated for the first variation of the code, the other variations show the expected speedups.

Experiment GPU (sec) CPU (sec)
allow_gc=True;
float32 0.43 15.44
allow_gc=False
float32 0.295 15.43
float64; allow_gc=True 3.85 3.86
float64; allow_gc=False 3.87 3.87

James Bergstra

unread,

Oct 11, 2012, 9:16:50 PM10/11/12

to theano...@googlegroups.com, Bryan Perozzi

Hi Rami,

The pattern among the runtimes in your table are expected for code that maps well on to GPU. GPU only works in float32, and garbage collection slows things down somewhat.

HTH,

- James

رامي الرفوع‎

unread,

Oct 11, 2012, 10:13:18 PM10/11/12

to theano...@googlegroups.com, Bryan Perozzi

@James
The test program run as expected, my code does not behave the same. Profiling does not show that I am using float 64 or there is any operation running on CPU. The GPU log shows that I am using the GPU for all operations, still I am not getting any speedups.

I am looking for what I am missing, any hints are appreciated.

Pascal Lamblin

unread,

Oct 11, 2012, 10:43:12 PM10/11/12

to theano...@googlegroups.com

Hi Rami,

Apparently, some operations have a GPU implementation that is slower
than the GPU one, as they have not been optimized yet: In your case,
GpuAdvancedIncSubtensor1 is 5 to 10 times slower than the CPU equivalent
(AdvancedIncSubtensor1) for each call.

This operation happens in particular when taking the gradient of
expressions like a[v], where v is a vector of integers. Maybe there is
a way to express your graph in a way that does not involve advanced
indexing, and could be faster with current implementation.

--
Pascal

رامي الرفوع‎

unread,

Oct 12, 2012, 5:49:20 AM10/12/12

to theano...@googlegroups.com

Hi Pascal,

Thanks for the insight, I see what you are saying in the logs. Actually, I index a large matrix to lookup the words embeddings.

I think I can implement indexing as a multiplication with a sparse matrix, will that speed up things? As far as I understand sparse operations will be executed at the CPU.

Pascal Lamblin

unread,

Oct 12, 2012, 2:31:57 PM10/12/12

to theano...@googlegroups.com

Hi,

On Fri, Oct 12, 2012, رامي الرفوع wrote:
> I think I can implement indexing as a multiplication with a sparse
> matrix, will that speed up things? As far as I understand sparse
> operations will be executed at the CPU.

That's correct, sparse operations will be executed only on the CPU.

Multiplying by a sparse matrix is worth trying, since the implementation
of AdvancedIncSubtensor1 on the CPU, too, is quite slow. I'm not sure it
will help, though, but I thing it is worth giving it a try.

Hope this helps,
--
Pascal

رامي الرفوع‎

unread,

Oct 13, 2012, 7:03:45 PM10/13/12

to theano...@googlegroups.com

I am trying to use the sparse matrices multiplication instead of the indexing solution, here is my code

from theano import function

from theano import shared

from theano import tensor as T

from theano import sparse

import scipy.sparse as sp

from numpy import array, asarray, random

from theano.tensor.tests.test_basic import get_numeric_types

print [(str(x), x.num) for x in get_numeric_types()]

size = 9

intX = 'int32'

C = T.matrix('C', dtype=intX)

I = T.matrix('I', dtype=intX)

fI= I.flatten()

data = T.ones_like(fI)

indptr = T.arange(data.shape[0]+1)

m1 = sparse.CSR(data, fI, indptr, (8, size))

m2 = sparse.dot(m1 ,C)

y = m2.reshape(shape=(2,4,9),ndim=3)

f = function(inputs=[I, C], outputs=y)

i = asarray([[4, 3, 7 ,7], [2,8, 4,5]], dtype=intX)

a = asarray(random.randint(0, 100, (size,size)), dtype=intX)

print a

print '================================='

result = f(i, a)

print result

I get the following error

[('float32', 11), ('float64', 12), ('int16', 3), ('int32', 5), ('int32', 7), ('int64', 9), ('int8', 1), ('uint16', 4), ('uint32', 6), ('uint32', 8), ('uint64', 10), ('uint8', 2)]
(False, False)
<class 'theano.tensor.basic.TensorVariable'>
DimShuffle{1,0}.0
2
[[29 16 54 40 75 40  5 39 39]
 [50  6 59 45 47 92 86 18 68]
 [ 1 25 72 81 69 22 46 21 51]
 [53 97 64 26 47 74 13 20 78]
 [34 40 28 16 13 32 89  7 27]
 [95 82  4 25 49 57 99 50  7]
 [ 4 55  6  6 67 53 69  0  1]
 [71 20 39 55 53 92 67 11 35]
 [ 6 75 10 92 48 51 73 43 82]]

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-137-8377cf2e4203> in <module>()
     49 print '================================='
     50 
---> 51 result = f(i, a)
     52 print result
     53 

/usr/local/lib/python2.7/dist-packages/Theano-0.6.0rc1-py2.7.egg/theano/compile/function_module.pyc in __call__(self, *args, **kwargs)
    672                 # the C VM needs this because the exception manipulation
    673                 # done by raise_with_op is not implemented in C.
--> 674                 gof.vm.raise_with_op(self.fn.nodes[self.fn.position_of_error])
    675             else:
    676                 # old-style linkers raise their own exceptions

/usr/local/lib/python2.7/dist-packages/Theano-0.6.0rc1-py2.7.egg/theano/compile/function_module.pyc in __call__(self, *args, **kwargs)
    666         t0_fn = time.time()
    667         try:
--> 668             outputs = self.fn()
    669         except Exception:
    670             if hasattr(self.fn, 'position_of_error'):

TypeError: expected type_num 7 (NPY_INT32) got 5

=================================

رامي الرفوع‎

unread,

Oct 14, 2012, 12:21:34 AM10/14/12

to theano...@googlegroups.com

Ok, I fixed this issue by assigning

intX = 'int64'

Actually, I do not understand that works.

I am now getting a different error when I am trying to run the code on the GPU, this exception is thrown when I am trying to execute sparse dot operation

NotImplementedError: ('this function should only be called on *variables* (of type sparse.SparseType or tensor.TensorType), not,', Weights_C)

Weights_C is <class 'theano.sandbox.cuda.var.CudaNdarraySharedVariable'> when I use the GPU

Weights_C is <class 'theano.tensor.sharedvar.TensorSharedVariable'> when I use the CPU

Shouldn't the sparse dot operation be done by the CPU?

Pascal Lamblin

unread,

Oct 15, 2012, 2:32:18 PM10/15/12

to theano...@googlegroups.com

On Sat, Oct 13, 2012, رامي الرفوع wrote:
> I am trying to use the sparse matrices multiplication instead of the
> indexing solution, here is my code
>

> [...]

>
> I get the following error

> TypeError: expected type_num 7 (NPY_INT32) got 5

Thanks for reporting that bug, I just submitted a fix:
https://github.com/Theano/Theano/pull/1015

When it's accepted, you can update Theano to the latest development
version, and that problem should be solved. Please let us know if it is
not the case.

Now I'll try to fix the other ones you reported :)
--
Pascal

رامي الرفوع‎

unread,

Oct 22, 2012, 11:19:26 AM10/22/12

to theano...@googlegroups.com

@Pascal

I noticed that when I increase the size of the matrix I am indexing from 10000*64 to 100000*64 I get a slow down of a factor of 10! So even if the process slow, I do not understand why it is a function of the matrix size? Isn't this supposed to be random memory access? I expect things to slow down because of caching issues but not dramatically!

Regards,

Reply all

Reply to author

Forward

Experiment	GPU (sec)	CPU (sec)
allow_gc=True; float32	0.43	15.44
allow_gc=False float32	0.295	15.43

float64; allow_gc=True	3.85	3.86
float64; allow_gc=False	3.87	3.87