Calling a user-defined GPU OP(cublasGemmEx) from Theano takes more time than normal Theano OP (T.dot)

77 views

Skip to first unread message

Adit Bhargav

unread,

Nov 17, 2017, 1:45:41 PM11/17/17

to theano-dev

Hello ,

I made a GPU OP which calculates the Dot product of the inputs and weights in Dense Layer in 8 bits and returns the result of Dot product.

I compare my GPU OP with the T.dot() OP implemented in Theano.

I am getting improvements in time spent in NVIDIA CUDA kernel used for calculating dot products.

However, I am losing this improvement due to Theano-Python overhead.

When I use T.dot() to calculate dot products of inputs and kernels I got below time spent in calliing T.dot() GPU OP: (I used cProfile tool to profile it )

It takes 0.004 sec or 4 ms to call the GPU OP written in Theano T.dot(). [See cProfile Logs below]

>>>>>>>> nvprof python DenseGPU.py

Couldn't import dot_parser, loading of dot files will not be possible.

==32564== NVPROF is profiling process 32564, command: python DenseGPU.py

Using cuDNN version 5110 on context None

Preallocating 3351/11172 Mb (0.300000) on cuda0

Mapped name None to device cuda0: Graphics Device (0000:02:00.0)

25 function calls in 0.004 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)

1 0.000 0.000 0.004 0.004 <string>:1(<module>)

1 0.004 0.004 0.004 0.004 function_module.py:725(__call__)

1 0.000 0.000 0.000 0.000 function_module.py:753(restore_defaults)

1 0.000 0.000 0.000 0.000 type.py:332(<lambda>)

1 0.000 0.000 0.000 0.000 type.py:74(filter)

3 0.000 0.000 0.000 0.000 {getattr}

1 0.000 0.000 0.000 0.000 {isinstance}

9 0.000 0.000 0.000 0.000 {len}

1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}

1 0.000 0.000 0.000 0.000 {method 'pop' of 'dict' objects}

4 0.000 0.000 0.000 0.000 {time.time}

1 0.000 0.000 0.000 0.000 {zip}

==32564== Profiling application: python DenseGPU.py

==32564== Profiling result:

Time(%) Time Calls Avg Min Max Name

62.39% 1.6711ms 5 334.22us 960ns 1.2515ms [CUDA memcpy HtoD]

25.74% 689.34us 1 689.34us 689.34us 689.34us [CUDA memcpy DtoH]

10.88% 291.44us 1 291.44us 291.44us 291.44us maxwell_sgemm_128x64_raggedMn_nn_splitK

0.83% 22.337us 1 22.337us 22.337us 22.337us elem

0.13% 3.4880us 1 3.4880us 3.4880us 3.4880us void gemmSN_NN_kernel<float, float, float, int=128, int=2, int=4, int=8, int=2, int=4>(cublasGemmSmallNParams<float, float, float>, float const *, float const *, float, float, int)

0.02% 608ns 1 608ns 608ns 608ns [CUDA memset]

==32564== API calls:

Time(%) Time Calls Avg Min Max Name

36.04% 667.35ms 8 83.418ms 14.778us 667.23ms cudaStreamCreateWithFlags

24.80% 459.16ms 3 153.05ms 829ns 459.16ms cudaFree

19.14% 354.32ms 361 981.48us 3.3530us 9.0190ms cuModuleUnload

16.22% 300.28ms 1 300.28ms 300.28ms 300.28ms cuDevicePrimaryCtxRetain

1.55% 28.606ms 11 2.6006ms 2.3203ms 2.9228ms cuLinkAddData

0.78% 14.458ms 1088 13.288us 151ns 3.7597ms cuDeviceGetAttribute

0.45% 8.4092ms 12 700.77us 236.48us 4.6565ms cuDeviceTotalMem

0.13% 2.4803ms 13 190.79us 38.138us 1.7543ms cuDeviceGetName

0.13% 2.4603ms 1 2.4603ms 2.4603ms 2.4603ms cuMemAlloc

0.12% 2.1670ms 3 722.33us 22.926us 1.6050ms cuMemcpyHtoDAsync

0.11% 1.9854ms 11 180.49us 136.27us 259.72us cuModuleLoadData

0.09% 1.5859ms 11 144.18us 45.212us 1.0008ms cuLinkCreate

0.08% 1.5308ms 1 1.5308ms 1.5308ms 1.5308ms cuMemcpyDtoHAsync

0.08% 1.4489ms 11 131.72us 113.31us 155.69us cuLinkComplete

0.08% 1.4176ms 6 236.27us 8.4930us 718.30us cudaMalloc

0.07% 1.3875ms 1 1.3875ms 1.3875ms 1.3875ms cuMemAllocHost

0.07% 1.3661ms 4 341.52us 260.87us 408.33us cuMemGetInfo

0.03% 524.74us 1 524.74us 524.74us 524.74us cudaGetDeviceProperties

0.01% 103.02us 2 51.511us 29.444us 73.578us cudaLaunch

0.00% 81.883us 1 81.883us 81.883us 81.883us cudaMemsetAsync

0.00% 36.982us 2 18.491us 15.203us 21.779us cudaMemcpy

0.00% 34.965us 40 874ns 513ns 6.4420us cudaEventCreateWithFlags

0.00% 33.551us 35 958ns 200ns 5.8240us cuCtxPushCurrent

0.00% 31.642us 1 31.642us 31.642us 31.642us cuStreamCreate

0.00% 25.204us 11 2.2910us 639ns 15.817us cuModuleGetFunction

0.00% 23.593us 21 1.1230us 145ns 15.139us cuDeviceGet

0.00% 18.320us 1 18.320us 18.320us 18.320us cuLaunchKernel

0.00% 18.051us 14 1.2890us 323ns 3.5380us cuEventCreate

0.00% 17.808us 77 231ns 173ns 806ns cudaGetErrorString

0.00% 17.784us 35 508ns 210ns 1.3470us cuCtxPopCurrent

0.00% 15.543us 32 485ns 319ns 1.6580us cudaDeviceGetAttribute

0.00% 13.003us 3 4.3340us 1.9130us 9.0430us cudaGetDevice

0.00% 11.221us 30 374ns 190ns 3.3590us cudaSetupArgument

0.00% 8.4200us 3 2.8060us 1.0800us 5.0740us cuInit

0.00% 8.1550us 11 741ns 596ns 920ns cuLinkDestroy

0.00% 6.7750us 1 6.7750us 6.7750us 6.7750us cudaEventQuery

0.00% 5.9020us 17 347ns 187ns 559ns cuCtxGetDevice

0.00% 4.4640us 6 744ns 360ns 1.3950us cuEventDestroy

0.00% 4.2440us 1 4.2440us 4.2440us 4.2440us cudaGetDeviceCount

0.00% 4.1700us 1 4.1700us 4.1700us 4.1700us cudaEventRecord

0.00% 4.0720us 5 814ns 200ns 2.7980us cuDeviceGetCount

0.00% 3.6730us 2 1.8360us 982ns 2.6910us cudaConfigureCall

0.00% 2.4470us 1 2.4470us 2.4470us 2.4470us cuDevicePrimaryCtxGetState

0.00% 2.2840us 2 1.1420us 991ns 1.2930us cuDevicePrimaryCtxRelease

0.00% 2.0130us 1 2.0130us 2.0130us 2.0130us cuDeviceGetPCIBusId

0.00% 1.8440us 3 614ns 452ns 769ns cuDriverGetVersion

0.00% 1.3970us 1 1.3970us 1.3970us 1.3970us cuDevicePrimaryCtxSetFlags

0.00% 1.2740us 1 1.2740us 1.2740us 1.2740us cuFuncGetAttribute

0.00% 1.1250us 2 562ns 483ns 642ns cudaGetLastError

0.00% 430ns 1 430ns 430ns 430ns cudaDriverGetVersion

Then, I run my own GPU OP performing the same dot products: It took 19 ms to call GPU OP.

d1230@linse3:~/no_backup/d1230/Keras_Theano_GPUop/cudnn_conv_int8-master/src/NEW_GPUARRAY_OP> nvprof python DenseGPU.py

Couldn't import dot_parser, loading of dot files will not be possible.

==32695== NVPROF is profiling process 32695, command: python DenseGPU.py

Using cuDNN version 5110 on context None

Preallocating 3351/11172 Mb (0.300000) on cuda0

Mapped name None to device cuda0: Graphics Device (0000:02:00.0)

25 function calls in 0.019 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)

1 0.000 0.000 0.019 0.019 <string>:1(<module>)

1 0.019 0.019 0.019 0.019 function_module.py:725(__call__)

1 0.000 0.000 0.000 0.000 function_module.py:753(restore_defaults)

1 0.000 0.000 0.000 0.000 type.py:332(<lambda>)

1 0.000 0.000 0.000 0.000 type.py:74(filter)

3 0.000 0.000 0.000 0.000 {getattr}

1 0.000 0.000 0.000 0.000 {isinstance}

9 0.000 0.000 0.000 0.000 {len}

1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}

1 0.000 0.000 0.000 0.000 {method 'pop' of 'dict' objects}

4 0.000 0.000 0.000 0.000 {time.time}

1 0.000 0.000 0.000 0.000 {zip}

==32695== Profiling application: python DenseGPU.py

==32695== Profiling result:

Time(%) Time Calls Avg Min Max Name

52.25% 810.60us 6 135.10us 704ns 413.24us [CUDA memcpy HtoD]

36.11% 560.15us 1 560.15us 560.15us 560.15us [CUDA memcpy DtoH]

6.20% 96.197us 1 96.197us 96.197us 96.197us maxwell_igemm_int8_128x128_ldg4_nn

2.31% 35.873us 2 17.936us 17.856us 18.017us Float2Int(float*, char*, int)

1.46% 22.721us 1 22.721us 22.721us 22.721us Int2Float(int*, float*, int)

1.46% 22.626us 1 22.626us 22.626us 22.626us elem

0.20% 3.1690us 1 3.1690us 3.1690us 3.1690us void gemmSN_NN_kernel<float, float, float, int=128, int=2, int=4, int=8, int=2, int=4>(cublasGemmSmallNParams<float, float, float>, float const *, float const *, float, float, int)

==32695== API calls:

Time(%) Time Calls Avg Min Max Name

37.59% 705.71ms 8 88.214ms 17.189us 705.55ms cudaStreamCreateWithFlags

24.17% 453.83ms 7 64.833ms 894ns 453.29ms cudaFree

16.52% 310.09ms 1 310.09ms 310.09ms 310.09ms cuDevicePrimaryCtxRetain

16.01% 300.52ms 362 830.16us 3.8400us 11.332ms cuModuleUnload

1.67% 31.409ms 11 2.8554ms 45.297us 30.890ms cuLinkCreate

1.43% 26.840ms 11 2.4400ms 1.9647ms 2.6699ms cuLinkAddData

0.82% 15.310ms 1436 10.661us 144ns 1.9169ms cuDeviceGetAttribute

0.37% 7.0063ms 12 583.86us 6.6600us 4.3589ms cudaMalloc

0.33% 6.2594ms 16 391.21us 225.79us 661.53us cuDeviceTotalMem

0.23% 4.3428ms 5 868.55us 8.3080us 4.2541ms cudaLaunch

0.21% 3.9707ms 17 233.57us 37.674us 2.3135ms cuDeviceGetName

0.14% 2.6292ms 1 2.6292ms 2.6292ms 2.6292ms cuMemAlloc

0.10% 1.9037ms 11 173.07us 144.70us 250.80us cuModuleLoadData

0.09% 1.6658ms 1 1.6658ms 1.6658ms 1.6658ms cuMemAllocHost

0.08% 1.5449ms 4 386.24us 318.45us 462.59us cuMemGetInfo

0.08% 1.4213ms 11 129.21us 102.51us 167.79us cuLinkComplete

0.05% 962.41us 3 320.80us 15.468us 500.46us cuMemcpyHtoDAsync

0.04% 836.91us 1 836.91us 836.91us 836.91us cuMemcpyDtoHAsync

0.03% 561.03us 1 561.03us 561.03us 561.03us cudaGetDeviceProperties

0.00% 88.696us 3 29.565us 19.673us 38.608us cudaMemcpy

0.00% 53.644us 56 957ns 552ns 7.5940us cudaEventCreateWithFlags

0.00% 36.819us 34 1.0820us 195ns 7.6330us cuCtxPushCurrent

0.00% 32.182us 43 748ns 344ns 7.7040us cudaDeviceGetAttribute

0.00% 30.169us 39 773ns 173ns 16.499us cudaSetupArgument

0.00% 29.547us 1 29.547us 29.547us 29.547us cuStreamCreate

0.00% 26.006us 4 6.5010us 1.0840us 20.852us cuInit

0.00% 21.522us 1 21.522us 21.522us 21.522us cuLaunchKernel

0.00% 19.461us 4 4.8650us 1.2290us 14.798us cudaGetDevice

0.00% 18.946us 77 246ns 184ns 884ns cudaGetErrorString

0.00% 18.367us 34 540ns 199ns 1.4680us cuCtxPopCurrent

0.00% 17.212us 14 1.2290us 543ns 3.0890us cuEventCreate

0.00% 10.336us 11 939ns 820ns 1.1500us cuModuleGetFunction

0.00% 10.125us 25 405ns 168ns 1.2270us cuDeviceGet

0.00% 8.9540us 5 1.7900us 379ns 3.1970us cudaConfigureCall

0.00% 7.7130us 11 701ns 517ns 900ns cuLinkDestroy

0.00% 5.6440us 17 332ns 188ns 576ns cuCtxGetDevice

0.00% 4.5380us 6 756ns 367ns 1.3760us cuEventDestroy

0.00% 4.2510us 6 708ns 327ns 2.2490us cuDeviceGetCount

0.00% 3.7870us 1 3.7870us 3.7870us 3.7870us cudaGetDeviceCount

0.00% 3.1310us 3 1.0430us 900ns 1.3270us cuDevicePrimaryCtxRelease

0.00% 2.5870us 4 646ns 540ns 892ns cuDriverGetVersion

0.00% 2.5400us 1 2.5400us 2.5400us 2.5400us cudaDriverGetVersion

0.00% 2.0750us 1 2.0750us 2.0750us 2.0750us cuDevicePrimaryCtxGetState

0.00% 1.9600us 1 1.9600us 1.9600us 1.9600us cuDeviceGetPCIBusId

0.00% 1.4150us 1 1.4150us 1.4150us 1.4150us cuDevicePrimaryCtxSetFlags

0.00% 1.1340us 1 1.1340us 1.1340us 1.1340us cuFuncGetAttribute

0.00% 855ns 2 427ns 368ns 487ns cudaGetLastError

If you see the time spent on GPU(nvprof Logs), you can see that the kernel is taking 1/3rd time (291 us vs 96 us in 8 bit dot products).

But, since my GPU OP is taking 19 ms compared to Theano OP(T.dot ) to be called from PYthon. This improvement is not being reflected in my Python file from where I am

recording down the time to do forward pass.

Can anyone please let me know how can I change my OP so that it also takes the same time to be called from Theano as normal T.dot() is taking.

If that goes same, I will see improvements .

Thanks in advance

CHeers !

Frédéric Bastien

unread,

Nov 22, 2017, 1:49:38 PM11/22/17

to thean...@googlegroups.com

Can you share the code of your new op?

Did you implement a perform there? To lower the python overhead, you should implement c code for the op, not just our python interface.

--

---
You received this message because you are subscribed to the Google Groups "theano-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-dev+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Adit Bhargav

unread,

Nov 28, 2017, 10:27:53 AM11/28/17

to theano-dev

Hello,

I have not implemented the perform method. I have written a Cop.

I attach my OP files :

1. DenseGraph.py is the normal T.dot () functionality with one Layer

2. DenseGPU.py is my OP functionality with one layer with same data as T.dot()

3. cublas.c is C functionality for DenseGPU.py

4.cublasKernel.h header for wrappers used by cublas.c

5. cublasKernel.cu is the code for kernel wrappers used by cublas.c
6. nvcc_command_shared_lib_cublas.sh is the script to generate the .libcudnnKernel.so file.

You have to change c_header_dirs and c_headers in DenseGPU.py for correct paths

Thanks

cublasKernel.h

nvcc_command_shared_lib_cublas.sh

cublasKernel.cu

DenseGraph.py

DenseGPU.py

cublas.c

Reply all

Reply to author

Forward

0 new messages