Re: [theano-dev] User-defined GPU OP for dot-product returns unexpected DtoH transfers even during compile time in Theano and taking more time than T.dot() op

44 views
Skip to first unread message
Message has been deleted

Pascal Lamblin

unread,
Nov 15, 2017, 3:44:06 PM11/15/17
to thean...@googlegroups.com
Hi,

I did not understand all the details, but it looks like the difference
could come from the type of the output variable:
if you use your class, d1_out would be on the GPU, however if you use
T.dot or T.nnet.sigmoid, and so on, the output of the function will
actually be on CPU and need to be transferred back.
You can print(d1_out.type) to see if it is a TensorType or GpuArrayType.

On 2017-11-15 12:52 AM, Adit Bhargav wrote:
> Hello All,
>
> I have written a GPU op for the new backend(GPUARRAY) for doing
> Dot-product in dense layers.
> Then, I am comparing this OP wiith T.dot() from theano function.
>
> Setup:
>
> 1. Created a Dense Layer in Theano and GPU OP in file DenseGPU.py
> 2. Compile the below file till line : forward =  theano.function([x],
> d1_out) and comment the lines thereafter.
> 3. I note some DtoH and HtoD during compilation as shown below in nvprof
> Log:
>     ==4996== Profiling result:
> Time(%)      Time     Calls       Avg       Min       Max  Name
>  78.81%  493.41ms         3  164.47ms  97.138ms  201.56ms  [CUDA memcpy
> DtoH]
>  21.19%  132.63ms         5  26.526ms     960ns  96.613ms  [CUDA memcpy
> HtoD]
>   0.00%  3.0080us         1  3.0080us  3.0080us  3.0080us  void
> gemmSN_NN_kernel<float, float, float, int=128, int=2, int=4, int=8,
> int=2, int=4>(cublasGemmSmallNParams<float, float, float>, float const
> *, float const *, float, float, int)
>
> 4. Now, make similar dense layer in theano but now using T.dot()
> operation instead of my GPU OP. I make this in file DenseGraph.py
> 5. Compile this file till line :  forward =  theano.function([x], d1_out)
> and comment the below lines
> '''
> data = np.random.randint(-127,127,(8192,8192)).astype('float32')
> start = timeit.default_timer()
> a = forward(data)
> end = timeit.default_timer()
> print a.shape
> print (end-start)
> np.save('Forward_Default_API.npy', a)
> '''
> 6. You will see the nvprof logs with only HtoD transfers and no DtoH
> transfers:
>
> ==5456== Profiling result:
> Time(%)      Time     Calls       Avg       Min       Max  Name
>  99.99%  32.711ms         4  8.1776ms     960ns  32.705ms  [CUDA memcpy
> HtoD]
>   0.01%  3.2640us         1  3.2640us  3.2640us  3.2640us  void
> gemmSN_NN_kernel<float, float, float, int=128, int=2, int=4, int=8,
> int=2, int=4>(cublasGemmSmallNParams<float, float, float>, float const
> *, float const *, float, float, int)
>
> 7. Now uncomment the lines above and call forward pass WITHOUT using my
> GPU OP in file DenseGraph.py
> You will see that T.dot() does the dot product and DtoH increased from 0
> to 1 (which I think is my output) and HtoD from 4 to 5(which is my call
> to forward() to device)
> This is working as expected since HtoD and DtoH increased by exactly 1
> both for one dot product operation.
>
> ==5662== Profiling result:
> Time(%)      Time     Calls       Avg       Min       Max  Name
>  49.20%  189.14ms         1  189.14ms  189.14ms  189.14ms  [CUDA memcpy
> DtoH]
>  30.89%  118.74ms         1  118.74ms  118.74ms  118.74ms
> maxwell_sgemm_128x128_raggedMn_nn
>  19.50%  74.966ms         5  14.993ms     960ns  37.633ms  [CUDA memcpy
> HtoD]
>   0.41%  1.5731ms         1  1.5731ms  1.5731ms  1.5731ms  elem
>   0.00%  3.1680us         1  3.1680us  3.1680us  3.1680us  void
> gemmSN_NN_kernel<float, float, float, int=128, int=2, int=4, int=8,
> int=2, int=4>(cublasGemmSmallNParams<float, float, float>, float const
> *, float const *, float, float, int)
>
> 8. Now uncomment the same lines above and call forward pass WITH using
> my GPU OP in file DenseGPU.py
> You will see that T.dot() does the dot product and DtoH increased from
> 3(See topmost nvprof logs ) to 4 (which I think is my output) and HtoD
> from 5 to 7.
> I don't know why there is 2 HtoD more after I call forward() function
> although the output DtoH transfer increased only by  1.
>
> *But in first place, I don't understand why my GPU OP shows DtoH
> transfers as 3 as compared to none in T.dot () case [See my points 3 and
> 6 above for nvprof logs ]*
> *
> *
> ==5881== Profiling result:
> Time(%)      Time     Calls       Avg       Min       Max  Name
>  59.48%  595.02ms         4  148.76ms  96.881ms  194.85ms  [CUDA memcpy
> DtoH]
>  37.51%  375.22ms         7  53.602ms     960ns  304.96ms  [CUDA memcpy
> HtoD]
>   2.86%  28.587ms         1  28.587ms  28.587ms  28.587ms
> maxwell_igemm_int8_128x128_ldg4_nn
>   0.16%  1.5613ms         1  1.5613ms  1.5613ms  1.5613ms  elem
>   0.00%  7.1680us         3  2.3890us  1.8560us  3.4240us  [CUDA memset]
>   0.00%  3.0400us         1  3.0400us  3.0400us  3.0400us  void
> gemmSN_NN_kernel<float, float, float, int=128, int=2, int=4, int=8,
> int=2, int=4>(cublasGemmSmallNParams<float, float, float>, float const
> *, float const *, float, float, int)
>
>
> *Due to this extra time, I am getting more time in forward pass when
> using my GPU OP as compared to T.dot() operation. *
> *The aim of this GPU OP is to incorporate 8-bit dot products using DP4A
> product on NVIDIA Pascal architectures.*
> *
> *
> *My problem is  I am not able to get the improvement in forward pass
> (inference) for the Dense Layer output. *
> *
> *
> *Can anyone please let me know what mistake I am doing ? And how to
> implement only 1 HtoD and 1 DtoH transfers per dot product operation. *
>
> I attach my Python and c and cuda  files:
>
> 1. DenseGraph.py
> 2. DenseGPU.py
> 3. cublas.c (needed to run my gpu op )
> 4. cublasKernel.cu (need to make ptx file out of this file to run
> cublas.c , use command nvcc --ptx cublasKernel.cu to generate it)
>
>
>
>
>
>
>
>
>
> --
>
> ---
> You received this message because you are subscribed to the Google
> Groups "theano-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to theano-dev+...@googlegroups.com
> <mailto:theano-dev+...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Pascal Lamblin

Adit Bhargav

unread,
Nov 17, 2017, 1:32:11 PM11/17/17
to theano-dev
Hello
Thanks for the reply. 
But I solved the issue now. It was due to device driiver calls I was doing in the cuda file.

Thanks anyways for your time .

BR
Adit
Reply all
Reply to author
Forward
0 new messages