Hello community,
I am trying to debug with the memcheck tool from visual studio 2015 Nsight the
function from CUB framework "cub::DeviceScan::InclusiveSum()" and it's
returning me the following error (GTX 1070, nSight 5.3, cuda 8.0.61,
nvidia drivers 384.94):
CUDA module loaded: 1ca9e7ba940 kernel.cu.obj
Internal debugger error occurred while attempting to launch
_ZN3cub16DeviceScanKernelINS_12DispatchScanIPiS2_NS_3SumENS_8NullTypeEiE18PtxAgentScanPolicyES2_S2_NS_13ScanTileStateIiLb1EEES3_S4_iEEvT0_T1_T2_iT3_T4_T5_
in CUcontext 0x1ca8f1433c0,
CUmodule 0x1ca9e7ba940:
code patching failed for unknown reason.
All breakpoints for function
_ZN3cub16DeviceScanKernelINS_12DispatchScanIPiS2_NS_3SumENS_8NullTypeEiE18PtxAgentScanPolicyES2_S2_NS_13ScanTileStateIiLb1EEES3_S4_iEEvT0_T1_T2_iT3_T4_T5_
have been removed.
I have tested it as well in a laptop with a 960M, same results. However, if I use the memcheck utility from cmd the program works. Nonetheless, even if it works from there, I still get some memory leak errors:
========= Leaked 767 bytes at 0x500400400
========= Saved host backtrace up to driver entry point at cudaMalloc time
========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuMemcpy2D + 0x1ad677) [0x1babbb]
========= Host Frame:C:\Users\User\Dropbox\PhD\CUDA\2015Version\CUDA_Scan_Cub\x64\Release\cudart64_80.dll (cudaGetExportTable + 0xb5ce) [0x1adde]
========= Host Frame:C:\Users\User\Dropbox\PhD\CUDA\2015Version\CUDA_Scan_Cub\x64\Release\cudart64_80.dll (_cudaInitManagedRuntime + 0x42ce) [0x59ae]
========= Host Frame:C:\Users\User\Dropbox\PhD\CUDA\2015Version\CUDA_Scan_Cub\x64\Release\cudart64_80.dll (cudaMalloc + 0xde) [0x22b0e]
========= Host Frame:C:\Users\User\Dropbox\PhD\CUDA\2015Version\CUDA_Scan_Cub\x64\Release\CUDA_Scan_Cub.exe (main + 0x4e1) [0x1e21]
========= Host Frame:C:\Users\User\Dropbox\PhD\CUDA\2015Version\CUDA_Scan_Cub\x64\Release\CUDA_Scan_Cub.exe (__scrt_common_main_seh + 0x11d) [0x25f1]
========= Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x12774]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x70d51]
=========
========= Leaked 484 bytes at 0x500400200
========= Saved host backtrace up to driver entry point at cudaMalloc time
========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuMemcpy2D + 0x1ad677) [0x1babbb]
========= Host Frame:C:\Users\User\Dropbox\PhD\CUDA\2015Version\CUDA_Scan_Cub\x64\Release\cudart64_80.dll (cudaGetExportTable + 0xb5ce) [0x1adde]
========= Host Frame:C:\Users\User\Dropbox\PhD\CUDA\2015Version\CUDA_Scan_Cub\x64\Release\cudart64_80.dll (_cudaInitManagedRuntime + 0x42ce) [0x59ae]
========= Host Frame:C:\Users\User\Dropbox\PhD\CUDA\2015Version\CUDA_Scan_Cub\x64\Release\cudart64_80.dll (cudaMalloc + 0xde) [0x22b0e]
========= Host Frame:C:\Users\User\Dropbox\PhD\CUDA\2015Version\CUDA_Scan_Cub\x64\Release\CUDA_Scan_Cub.exe (main + 0x475) [0x1db5]
========= Host Frame:C:\Users\User\Dropbox\PhD\CUDA\2015Version\CUDA_Scan_Cub\x64\Release\CUDA_Scan_Cub.exe (__scrt_common_main_seh + 0x11d) [0x25f1]
========= Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x12774]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x70d51]
=========
========= Leaked 484 bytes at 0x500400000
========= Saved host backtrace up to driver entry point at cudaMalloc time
========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuMemcpy2D + 0x1ad677) [0x1babbb]
========= Host Frame:C:\Users\User\Dropbox\PhD\CUDA\2015Version\CUDA_Scan_Cub\x64\Release\cudart64_80.dll (cudaGetExportTable + 0xb5ce) [0x1adde]
========= Host Frame:C:\Users\User\Dropbox\PhD\CUDA\2015Version\CUDA_Scan_Cub\x64\Release\cudart64_80.dll (_cudaInitManagedRuntime + 0x42ce) [0x59ae]
========= Host Frame:C:\Users\User\Dropbox\PhD\CUDA\2015Version\CUDA_Scan_Cub\x64\Release\cudart64_80.dll (cudaMalloc + 0xde) [0x22b0e]
========= Host Frame:C:\Users\User\Dropbox\PhD\CUDA\2015Version\CUDA_Scan_Cub\x64\Release\CUDA_Scan_Cub.exe (main + 0x463) [0x1da3]
========= Host Frame:C:\Users\User\Dropbox\PhD\CUDA\2015Version\CUDA_Scan_Cub\x64\Release\CUDA_Scan_Cub.exe (__scrt_common_main_seh + 0x11d) [0x25f1]
========= Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x12774]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x70d51]
=========
========= LEAK SUMMARY: 1735 bytes leaked in 3 allocations
The code is quite simple, I can paste it if you want to have a look, it just creates an array of 121 elements and tries to execute InclusiveSum on it:
cudaMalloc(&d_in, sizeof(int)*NUM_ITEMS);
cudaMalloc(&d_out, sizeof(int)*NUM_ITEMS);
cudaMemcpy(d_in, h_in, sizeof(int)*NUM_ITEMS, cudaMemcpyHostToDevice);
// This first call initializes the temporal storage needed by the algorithm
CubDebugExit(cub::DeviceScan::InclusiveSum(d_temp_storage, temp_storage_bytes, d_in, d_out, NUM_ITEMS));
// Allocate temporary storage for inclusive prefix sum
CubDebugExit(cudaMalloc(&d_temp_storage, temp_storage_bytes));
// Run inclusive prefix sum
CubDebugExit(cub::DeviceScan::InclusiveSum(d_temp_storage, temp_storage_bytes, d_in, d_out, NUM_ITEMS));
cudaMemcpy(h_out, d_out, sizeof(int)*NUM_ITEMS, cudaMemcpyDeviceToHost);
Any ideas on why it fails on the Nsight debugger when memcheck is enabled? I need this to work because I use the InclusiveSum in a much bigger application and it's slowing me down as I can't use the memcheck in other parts of the algorithm.
Best regards to all of you and thanks in advance for your help :)