Hello ,
I made a GPU OP which calculates the Dot product of the inputs and weights in Dense Layer in 8 bits and returns the result of Dot product.
I compare my GPU OP with the T.dot() OP implemented in Theano.
I am getting improvements in time spent in NVIDIA CUDA kernel used for calculating dot products.
However, I am losing this improvement due to Theano-Python overhead.
When I use T.dot() to calculate dot products of inputs and kernels I got below time spent in calliing T.dot() GPU OP: (I used cProfile tool to profile it )
It takes 0.004 sec or 4 ms to call the GPU OP written in Theano T.dot(). [See cProfile Logs below]
>>>>>>>> nvprof python DenseGPU.py
Couldn't import dot_parser, loading of dot files will not be possible.
==32564== NVPROF is profiling process 32564, command: python DenseGPU.py
Using cuDNN version 5110 on context None
Preallocating 3351/11172 Mb (0.300000) on cuda0
Mapped name None to device cuda0: Graphics Device (0000:02:00.0)
25 function calls in 0.004 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.004 0.004 <string>:1(<module>)
1 0.004 0.004 0.004 0.004 function_module.py:725(__call__)
1 0.000 0.000 0.000 0.000 function_module.py:753(restore_defaults)
1 0.000 0.000 0.000 0.000 type.py:332(<lambda>)
1 0.000 0.000 0.000 0.000 type.py:74(filter)
3 0.000 0.000 0.000 0.000 {getattr}
1 0.000 0.000 0.000 0.000 {isinstance}
9 0.000 0.000 0.000 0.000 {len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {method 'pop' of 'dict' objects}
4 0.000 0.000 0.000 0.000 {time.time}
1 0.000 0.000 0.000 0.000 {zip}
==32564== Profiling application: python DenseGPU.py
==32564== Profiling result:
Time(%) Time Calls Avg Min Max Name
62.39% 1.6711ms 5 334.22us 960ns 1.2515ms [CUDA memcpy HtoD]
25.74% 689.34us 1 689.34us 689.34us 689.34us [CUDA memcpy DtoH]
10.88% 291.44us 1 291.44us 291.44us 291.44us maxwell_sgemm_128x64_raggedMn_nn_splitK
0.83% 22.337us 1 22.337us 22.337us 22.337us elem
0.13% 3.4880us 1 3.4880us 3.4880us 3.4880us void gemmSN_NN_kernel<float, float, float, int=128, int=2, int=4, int=8, int=2, int=4>(cublasGemmSmallNParams<float, float, float>, float const *, float const *, float, float, int)
0.02% 608ns 1 608ns 608ns 608ns [CUDA memset]
==32564== API calls:
Time(%) Time Calls Avg Min Max Name
36.04% 667.35ms 8 83.418ms 14.778us 667.23ms cudaStreamCreateWithFlags
24.80% 459.16ms 3 153.05ms 829ns 459.16ms cudaFree
19.14% 354.32ms 361 981.48us 3.3530us 9.0190ms cuModuleUnload
16.22% 300.28ms 1 300.28ms 300.28ms 300.28ms cuDevicePrimaryCtxRetain
1.55% 28.606ms 11 2.6006ms 2.3203ms 2.9228ms cuLinkAddData
0.78% 14.458ms 1088 13.288us 151ns 3.7597ms cuDeviceGetAttribute
0.45% 8.4092ms 12 700.77us 236.48us 4.6565ms cuDeviceTotalMem
0.13% 2.4803ms 13 190.79us 38.138us 1.7543ms cuDeviceGetName
0.13% 2.4603ms 1 2.4603ms 2.4603ms 2.4603ms cuMemAlloc
0.12% 2.1670ms 3 722.33us 22.926us 1.6050ms cuMemcpyHtoDAsync
0.11% 1.9854ms 11 180.49us 136.27us 259.72us cuModuleLoadData
0.09% 1.5859ms 11 144.18us 45.212us 1.0008ms cuLinkCreate
0.08% 1.5308ms 1 1.5308ms 1.5308ms 1.5308ms cuMemcpyDtoHAsync
0.08% 1.4489ms 11 131.72us 113.31us 155.69us cuLinkComplete
0.08% 1.4176ms 6 236.27us 8.4930us 718.30us cudaMalloc
0.07% 1.3875ms 1 1.3875ms 1.3875ms 1.3875ms cuMemAllocHost
0.07% 1.3661ms 4 341.52us 260.87us 408.33us cuMemGetInfo
0.03% 524.74us 1 524.74us 524.74us 524.74us cudaGetDeviceProperties
0.01% 103.02us 2 51.511us 29.444us 73.578us cudaLaunch
0.00% 81.883us 1 81.883us 81.883us 81.883us cudaMemsetAsync
0.00% 36.982us 2 18.491us 15.203us 21.779us cudaMemcpy
0.00% 34.965us 40 874ns 513ns 6.4420us cudaEventCreateWithFlags
0.00% 33.551us 35 958ns 200ns 5.8240us cuCtxPushCurrent
0.00% 31.642us 1 31.642us 31.642us 31.642us cuStreamCreate
0.00% 25.204us 11 2.2910us 639ns 15.817us cuModuleGetFunction
0.00% 23.593us 21 1.1230us 145ns 15.139us cuDeviceGet
0.00% 18.320us 1 18.320us 18.320us 18.320us cuLaunchKernel
0.00% 18.051us 14 1.2890us 323ns 3.5380us cuEventCreate
0.00% 17.808us 77 231ns 173ns 806ns cudaGetErrorString
0.00% 17.784us 35 508ns 210ns 1.3470us cuCtxPopCurrent
0.00% 15.543us 32 485ns 319ns 1.6580us cudaDeviceGetAttribute
0.00% 13.003us 3 4.3340us 1.9130us 9.0430us cudaGetDevice
0.00% 11.221us 30 374ns 190ns 3.3590us cudaSetupArgument
0.00% 8.4200us 3 2.8060us 1.0800us 5.0740us cuInit
0.00% 8.1550us 11 741ns 596ns 920ns cuLinkDestroy
0.00% 6.7750us 1 6.7750us 6.7750us 6.7750us cudaEventQuery
0.00% 5.9020us 17 347ns 187ns 559ns cuCtxGetDevice
0.00% 4.4640us 6 744ns 360ns 1.3950us cuEventDestroy
0.00% 4.2440us 1 4.2440us 4.2440us 4.2440us cudaGetDeviceCount
0.00% 4.1700us 1 4.1700us 4.1700us 4.1700us cudaEventRecord
0.00% 4.0720us 5 814ns 200ns 2.7980us cuDeviceGetCount
0.00% 3.6730us 2 1.8360us 982ns 2.6910us cudaConfigureCall
0.00% 2.4470us 1 2.4470us 2.4470us 2.4470us cuDevicePrimaryCtxGetState
0.00% 2.2840us 2 1.1420us 991ns 1.2930us cuDevicePrimaryCtxRelease
0.00% 2.0130us 1 2.0130us 2.0130us 2.0130us cuDeviceGetPCIBusId
0.00% 1.8440us 3 614ns 452ns 769ns cuDriverGetVersion
0.00% 1.3970us 1 1.3970us 1.3970us 1.3970us cuDevicePrimaryCtxSetFlags
0.00% 1.2740us 1 1.2740us 1.2740us 1.2740us cuFuncGetAttribute
0.00% 1.1250us 2 562ns 483ns 642ns cudaGetLastError
0.00% 430ns 1 430ns 430ns 430ns cudaDriverGetVersion
Then, I run my own GPU OP performing the same dot products: It took 19 ms to call GPU OP.
d1230@linse3:~/no_backup/d1230/Keras_Theano_GPUop/cudnn_conv_int8-master/src/NEW_GPUARRAY_OP> nvprof python DenseGPU.py
Couldn't import dot_parser, loading of dot files will not be possible.
==32695== NVPROF is profiling process 32695, command: python DenseGPU.py
Using cuDNN version 5110 on context None
Preallocating 3351/11172 Mb (0.300000) on cuda0
Mapped name None to device cuda0: Graphics Device (0000:02:00.0)
25 function calls in 0.019 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.019 0.019 <string>:1(<module>)
1 0.019 0.019 0.019 0.019 function_module.py:725(__call__)
1 0.000 0.000 0.000 0.000 function_module.py:753(restore_defaults)
1 0.000 0.000 0.000 0.000 type.py:332(<lambda>)
1 0.000 0.000 0.000 0.000 type.py:74(filter)
3 0.000 0.000 0.000 0.000 {getattr}
1 0.000 0.000 0.000 0.000 {isinstance}
9 0.000 0.000 0.000 0.000 {len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {method 'pop' of 'dict' objects}
4 0.000 0.000 0.000 0.000 {time.time}
1 0.000 0.000 0.000 0.000 {zip}
==32695== Profiling application: python DenseGPU.py
==32695== Profiling result:
Time(%) Time Calls Avg Min Max Name
52.25% 810.60us 6 135.10us 704ns 413.24us [CUDA memcpy HtoD]
36.11% 560.15us 1 560.15us 560.15us 560.15us [CUDA memcpy DtoH]
6.20% 96.197us 1 96.197us 96.197us 96.197us maxwell_igemm_int8_128x128_ldg4_nn
2.31% 35.873us 2 17.936us 17.856us 18.017us Float2Int(float*, char*, int)
1.46% 22.721us 1 22.721us 22.721us 22.721us Int2Float(int*, float*, int)
1.46% 22.626us 1 22.626us 22.626us 22.626us elem
0.20% 3.1690us 1 3.1690us 3.1690us 3.1690us void gemmSN_NN_kernel<float, float, float, int=128, int=2, int=4, int=8, int=2, int=4>(cublasGemmSmallNParams<float, float, float>, float const *, float const *, float, float, int)
==32695== API calls:
Time(%) Time Calls Avg Min Max Name
37.59% 705.71ms 8 88.214ms 17.189us 705.55ms cudaStreamCreateWithFlags
24.17% 453.83ms 7 64.833ms 894ns 453.29ms cudaFree
16.52% 310.09ms 1 310.09ms 310.09ms 310.09ms cuDevicePrimaryCtxRetain
16.01% 300.52ms 362 830.16us 3.8400us 11.332ms cuModuleUnload
1.67% 31.409ms 11 2.8554ms 45.297us 30.890ms cuLinkCreate
1.43% 26.840ms 11 2.4400ms 1.9647ms 2.6699ms cuLinkAddData
0.82% 15.310ms 1436 10.661us 144ns 1.9169ms cuDeviceGetAttribute
0.37% 7.0063ms 12 583.86us 6.6600us 4.3589ms cudaMalloc
0.33% 6.2594ms 16 391.21us 225.79us 661.53us cuDeviceTotalMem
0.23% 4.3428ms 5 868.55us 8.3080us 4.2541ms cudaLaunch
0.21% 3.9707ms 17 233.57us 37.674us 2.3135ms cuDeviceGetName
0.14% 2.6292ms 1 2.6292ms 2.6292ms 2.6292ms cuMemAlloc
0.10% 1.9037ms 11 173.07us 144.70us 250.80us cuModuleLoadData
0.09% 1.6658ms 1 1.6658ms 1.6658ms 1.6658ms cuMemAllocHost
0.08% 1.5449ms 4 386.24us 318.45us 462.59us cuMemGetInfo
0.08% 1.4213ms 11 129.21us 102.51us 167.79us cuLinkComplete
0.05% 962.41us 3 320.80us 15.468us 500.46us cuMemcpyHtoDAsync
0.04% 836.91us 1 836.91us 836.91us 836.91us cuMemcpyDtoHAsync
0.03% 561.03us 1 561.03us 561.03us 561.03us cudaGetDeviceProperties
0.00% 88.696us 3 29.565us 19.673us 38.608us cudaMemcpy
0.00% 53.644us 56 957ns 552ns 7.5940us cudaEventCreateWithFlags
0.00% 36.819us 34 1.0820us 195ns 7.6330us cuCtxPushCurrent
0.00% 32.182us 43 748ns 344ns 7.7040us cudaDeviceGetAttribute
0.00% 30.169us 39 773ns 173ns 16.499us cudaSetupArgument
0.00% 29.547us 1 29.547us 29.547us 29.547us cuStreamCreate
0.00% 26.006us 4 6.5010us 1.0840us 20.852us cuInit
0.00% 21.522us 1 21.522us 21.522us 21.522us cuLaunchKernel
0.00% 19.461us 4 4.8650us 1.2290us 14.798us cudaGetDevice
0.00% 18.946us 77 246ns 184ns 884ns cudaGetErrorString
0.00% 18.367us 34 540ns 199ns 1.4680us cuCtxPopCurrent
0.00% 17.212us 14 1.2290us 543ns 3.0890us cuEventCreate
0.00% 10.336us 11 939ns 820ns 1.1500us cuModuleGetFunction
0.00% 10.125us 25 405ns 168ns 1.2270us cuDeviceGet
0.00% 8.9540us 5 1.7900us 379ns 3.1970us cudaConfigureCall
0.00% 7.7130us 11 701ns 517ns 900ns cuLinkDestroy
0.00% 5.6440us 17 332ns 188ns 576ns cuCtxGetDevice
0.00% 4.5380us 6 756ns 367ns 1.3760us cuEventDestroy
0.00% 4.2510us 6 708ns 327ns 2.2490us cuDeviceGetCount
0.00% 3.7870us 1 3.7870us 3.7870us 3.7870us cudaGetDeviceCount
0.00% 3.1310us 3 1.0430us 900ns 1.3270us cuDevicePrimaryCtxRelease
0.00% 2.5870us 4 646ns 540ns 892ns cuDriverGetVersion
0.00% 2.5400us 1 2.5400us 2.5400us 2.5400us cudaDriverGetVersion
0.00% 2.0750us 1 2.0750us 2.0750us 2.0750us cuDevicePrimaryCtxGetState
0.00% 1.9600us 1 1.9600us 1.9600us 1.9600us cuDeviceGetPCIBusId
0.00% 1.4150us 1 1.4150us 1.4150us 1.4150us cuDevicePrimaryCtxSetFlags
0.00% 1.1340us 1 1.1340us 1.1340us 1.1340us cuFuncGetAttribute
0.00% 855ns 2 427ns 368ns 487ns cudaGetLastError
If you see the time spent on GPU(nvprof Logs), you can see that the kernel is taking 1/3rd time (291 us vs 96 us in 8 bit dot products).
But, since my GPU OP is taking 19 ms compared to Theano OP(T.dot ) to be called from PYthon. This improvement is not being reflected in my Python file from where I am
recording down the time to do forward pass.
Can anyone please let me know how can I change my OP so that it also takes the same time to be called from Theano as normal T.dot() is taking.
If that goes same, I will see improvements .
Thanks in advance
CHeers !