CUB InclusiveSum() fails when debugging with VS2015 Nsight Memcheck enabled

39 views
Skip to first unread message

Kar

unread,
Jul 26, 2017, 5:03:50 AM7/26/17
to cub-...@googlegroups.com
Hello community,

I am trying to debug with the memcheck tool from visual studio 2015 Nsight the function from CUB framework "cub::DeviceScan::InclusiveSum()" and it's returning me the following error (GTX 1070, nSight 5.3, cuda 8.0.61, nvidia drivers 384.94):

CUDA module loaded: 1ca9e7ba940 kernel.cu.obj Internal debugger error occurred while attempting to launch _ZN3cub16DeviceScanKernelINS_12DispatchScanIPiS2_NS_3SumENS_8NullTypeEiE18PtxAgentScanPolicyES2_S2_NS_13ScanTileStateIiLb1EEES3_S4_iEEvT0_T1_T2_iT3_T4_T5_ in CUcontext 0x1ca8f1433c0,
CUmodule 0x1ca9e7ba940: code patching failed for unknown reason. All breakpoints for function _ZN3cub16DeviceScanKernelINS_12DispatchScanIPiS2_NS_3SumENS_8NullTypeEiE18PtxAgentScanPolicyES2_S2_NS_13ScanTileStateIiLb1EEES3_S4_iEEvT0_T1_T2_iT3_T4_T5_ have been removed.

I have tested it as well in a laptop with a 960M, same results. However, if I use the memcheck utility from cmd the program works. Nonetheless, even if it works from there, I still get some memory leak errors:

========= Leaked 767 bytes at 0x500400400
=========     Saved host backtrace up to driver entry point at cudaMalloc time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuMemcpy2D + 0x1ad677) [0x1babbb]
=========     Host Frame:C:\Users\User\Dropbox\PhD\CUDA\2015Version\CUDA_Scan_Cub\x64\Release\cudart64_80.dll (cudaGetExportTable + 0xb5ce) [0x1adde]
=========     Host Frame:C:\Users\User\Dropbox\PhD\CUDA\2015Version\CUDA_Scan_Cub\x64\Release\cudart64_80.dll (_cudaInitManagedRuntime + 0x42ce) [0x59ae]
=========     Host Frame:C:\Users\User\Dropbox\PhD\CUDA\2015Version\CUDA_Scan_Cub\x64\Release\cudart64_80.dll (cudaMalloc + 0xde) [0x22b0e]
=========     Host Frame:C:\Users\User\Dropbox\PhD\CUDA\2015Version\CUDA_Scan_Cub\x64\Release\CUDA_Scan_Cub.exe (main + 0x4e1) [0x1e21]
=========     Host Frame:C:\Users\User\Dropbox\PhD\CUDA\2015Version\CUDA_Scan_Cub\x64\Release\CUDA_Scan_Cub.exe (__scrt_common_main_seh + 0x11d) [0x25f1]
=========     Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x12774]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x70d51]
=========
========= Leaked 484 bytes at 0x500400200
=========     Saved host backtrace up to driver entry point at cudaMalloc time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuMemcpy2D + 0x1ad677) [0x1babbb]
=========     Host Frame:C:\Users\User\Dropbox\PhD\CUDA\2015Version\CUDA_Scan_Cub\x64\Release\cudart64_80.dll (cudaGetExportTable + 0xb5ce) [0x1adde]
=========     Host Frame:C:\Users\User\Dropbox\PhD\CUDA\2015Version\CUDA_Scan_Cub\x64\Release\cudart64_80.dll (_cudaInitManagedRuntime + 0x42ce) [0x59ae]
=========     Host Frame:C:\Users\User\Dropbox\PhD\CUDA\2015Version\CUDA_Scan_Cub\x64\Release\cudart64_80.dll (cudaMalloc + 0xde) [0x22b0e]
=========     Host Frame:C:\Users\User\Dropbox\PhD\CUDA\2015Version\CUDA_Scan_Cub\x64\Release\CUDA_Scan_Cub.exe (main + 0x475) [0x1db5]
=========     Host Frame:C:\Users\User\Dropbox\PhD\CUDA\2015Version\CUDA_Scan_Cub\x64\Release\CUDA_Scan_Cub.exe (__scrt_common_main_seh + 0x11d) [0x25f1]
=========     Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x12774]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x70d51]
=========
========= Leaked 484 bytes at 0x500400000
=========     Saved host backtrace up to driver entry point at cudaMalloc time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuMemcpy2D + 0x1ad677) [0x1babbb]
=========     Host Frame:C:\Users\User\Dropbox\PhD\CUDA\2015Version\CUDA_Scan_Cub\x64\Release\cudart64_80.dll (cudaGetExportTable + 0xb5ce) [0x1adde]
=========     Host Frame:C:\Users\User\Dropbox\PhD\CUDA\2015Version\CUDA_Scan_Cub\x64\Release\cudart64_80.dll (_cudaInitManagedRuntime + 0x42ce) [0x59ae]
=========     Host Frame:C:\Users\User\Dropbox\PhD\CUDA\2015Version\CUDA_Scan_Cub\x64\Release\cudart64_80.dll (cudaMalloc + 0xde) [0x22b0e]
=========     Host Frame:C:\Users\User\Dropbox\PhD\CUDA\2015Version\CUDA_Scan_Cub\x64\Release\CUDA_Scan_Cub.exe (main + 0x463) [0x1da3]
=========     Host Frame:C:\Users\User\Dropbox\PhD\CUDA\2015Version\CUDA_Scan_Cub\x64\Release\CUDA_Scan_Cub.exe (__scrt_common_main_seh + 0x11d) [0x25f1]
=========     Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x12774]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x70d51]
=========
========= LEAK SUMMARY: 1735 bytes leaked in 3 allocations


The code is quite simple, I can paste it if you want to have a look, it just creates an array of 121 elements and tries to execute InclusiveSum on it:

    cudaMalloc(&d_in, sizeof(int)*NUM_ITEMS);
    cudaMalloc(&d_out, sizeof(int)*NUM_ITEMS);
    cudaMemcpy(d_in, h_in, sizeof(int)*NUM_ITEMS, cudaMemcpyHostToDevice);

    // This first call initializes the temporal storage needed by the algorithm
    CubDebugExit(cub::DeviceScan::InclusiveSum(d_temp_storage, temp_storage_bytes, d_in, d_out, NUM_ITEMS));
    // Allocate temporary storage for inclusive prefix sum
    CubDebugExit(cudaMalloc(&d_temp_storage, temp_storage_bytes));
    // Run inclusive prefix sum
    CubDebugExit(cub::DeviceScan::InclusiveSum(d_temp_storage, temp_storage_bytes, d_in, d_out, NUM_ITEMS));

    cudaMemcpy(h_out, d_out, sizeof(int)*NUM_ITEMS, cudaMemcpyDeviceToHost);

Any ideas on why it fails on the Nsight debugger when memcheck is enabled? I need this to work because I use the InclusiveSum in a much bigger application and it's slowing me down as I can't use the memcheck in other parts of the algorithm.

Best regards to all of you and thanks in advance for your help :)

Robert Remen

unread,
Jul 26, 2017, 5:14:16 AM7/26/17
to cub-users
Hi Carlos!

According to the error message "LEAK SUMMARY: 1735 bytes leaked in 3 allocations" you are leaking memory.
You are allocating device memory with cudaMalloc() and i see no corresponding calls to cudaFree() in your code to free the allocated memory... hence the reported memory leak.

cheers
Robert
Carlos.

Kar

unread,
Jul 26, 2017, 5:24:56 AM7/26/17
to cub-...@googlegroups.com
Robert,

Thanks for answering, indeed I forgot to call to cudaFree, that solves the leaking, my bad at that. However, the main issue with nvidia Nsight persists. Any idea on that?

Cheers!

El miércoles, 26 de julio de 2017, 11:14:16 (UTC+2), Robert Remen escribió:
Hi!

Robert Crovella

unread,
Jul 26, 2017, 11:08:11 PM7/26/17
to Kar, cub-users
It's an internal error.  I would suggest filing a bug at developer.nvidia.com with sufficient information for nvidia to reproduce the issue.



From: Kar <solra...@hotmail.com>
To: cub-users <cub-...@googlegroups.com>
Sent: Wednesday, July 26, 2017 2:24 AM
Subject: [cub-users: 343] Re: CUB InclusiveSum() fails when debugging with VS2015 Nsight Memcheck enabled

Robert,

Thanks for answering, indeed I forgot to call to cudaFree, that solves the leaking, my bad at that. However, the main issue with nvidia Nsight persists. Any idea on that?

Cheers!

El miércoles, 26 de julio de 2017, 11:14:16 (UTC+2), Robert Remen escribió:
Hi Carlos!

According to the error message "LEAK SUMMARY: 1735 bytes leaked in 3 allocations" you are leaking memory.
You are allocating device memory with cudaMalloc() and i see no corresponding calls to cudaFree() in your code to free the allocated memory... hence the reported memory leak.

cheers
Robert

On Wednesday, July 26, 2017 at 11:03:50 AM UTC+2, Kar wrote:
Hello community,

I am trying to debug with the memcheck tool from visual studio 2015 Nsight the function from CUB framework "cub::DeviceScan:: InclusiveSum()" and it's returning me the following error (GTX 1070, nSight 5.3, cuda 8.0.61, nvidia drivers 384.94):

CUDA module loaded: 1ca9e7ba940 kernel.cu.obj Internal debugger error occurred while attempting to launch _ZN3cub16DeviceScanKernelINS_ 12DispatchScanIPiS2_NS_ 3SumENS_ 8NullTypeEiE18PtxAgentScanPoli cyES2_S2_NS_ 13ScanTileStateIiLb1EEES3_S4_ iEEvT0_T1_T2_iT3_T4_T5_ in CUcontext 0x1ca8f1433c0,
CUmodule 0x1ca9e7ba940: code patching failed for unknown reason. All breakpoints for function _ZN3cub16DeviceScanKernelINS_ 12DispatchScanIPiS2_NS_ 3SumENS_ 8NullTypeEiE18PtxAgentScanPoli cyES2_S2_NS_ 13ScanTileStateIiLb1EEES3_S4_ iEEvT0_T1_T2_iT3_T4_T5_ have been removed.

I have tested it as well in a laptop with a 960M, same results. However, if I use the memcheck utility from cmd the program works. Nonetheless, even if it works from there, I still get some memory leak errors:

========= Leaked 767 bytes at 0x500400400
=========     Saved host backtrace up to driver entry point at cudaMalloc time
=========     Host Frame:C:\WINDOWS\system32\ nvcuda.dll (cuMemcpy2D + 0x1ad677) [0x1babbb]
=========     Host Frame:C:\Users\User\Dropbox\ PhD\CUDA\2015Version\CUDA_ Scan_Cub\x64\Release\cudart64_ 80.dll (cudaGetExportTable + 0xb5ce) [0x1adde]
=========     Host Frame:C:\Users\User\Dropbox\ PhD\CUDA\2015Version\CUDA_ Scan_Cub\x64\Release\cudart64_ 80.dll (_cudaInitManagedRuntime + 0x42ce) [0x59ae]
=========     Host Frame:C:\Users\User\Dropbox\ PhD\CUDA\2015Version\CUDA_ Scan_Cub\x64\Release\cudart64_ 80.dll (cudaMalloc + 0xde) [0x22b0e]
=========     Host Frame:C:\Users\User\Dropbox\ PhD\CUDA\2015Version\CUDA_ Scan_Cub\x64\Release\CUDA_ Scan_Cub.exe (main + 0x4e1) [0x1e21]
=========     Host Frame:C:\Users\User\Dropbox\ PhD\CUDA\2015Version\CUDA_ Scan_Cub\x64\Release\CUDA_ Scan_Cub.exe (__scrt_common_main_seh + 0x11d) [0x25f1]

=========     Host Frame:C:\WINDOWS\System32\ KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x12774]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ ntdll.dll (RtlUserThreadStart + 0x21) [0x70d51]
=========
========= Leaked 484 bytes at 0x500400200
=========     Saved host backtrace up to driver entry point at cudaMalloc time
=========     Host Frame:C:\WINDOWS\system32\ nvcuda.dll (cuMemcpy2D + 0x1ad677) [0x1babbb]
=========     Host Frame:C:\Users\User\Dropbox\ PhD\CUDA\2015Version\CUDA_ Scan_Cub\x64\Release\cudart64_ 80.dll (cudaGetExportTable + 0xb5ce) [0x1adde]
=========     Host Frame:C:\Users\User\Dropbox\ PhD\CUDA\2015Version\CUDA_ Scan_Cub\x64\Release\cudart64_ 80.dll (_cudaInitManagedRuntime + 0x42ce) [0x59ae]
=========     Host Frame:C:\Users\User\Dropbox\ PhD\CUDA\2015Version\CUDA_ Scan_Cub\x64\Release\cudart64_ 80.dll (cudaMalloc + 0xde) [0x22b0e]
=========     Host Frame:C:\Users\User\Dropbox\ PhD\CUDA\2015Version\CUDA_ Scan_Cub\x64\Release\CUDA_ Scan_Cub.exe (main + 0x475) [0x1db5]
=========     Host Frame:C:\Users\User\Dropbox\ PhD\CUDA\2015Version\CUDA_ Scan_Cub\x64\Release\CUDA_ Scan_Cub.exe (__scrt_common_main_seh + 0x11d) [0x25f1]

=========     Host Frame:C:\WINDOWS\System32\ KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x12774]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ ntdll.dll (RtlUserThreadStart + 0x21) [0x70d51]
=========
========= Leaked 484 bytes at 0x500400000
=========     Saved host backtrace up to driver entry point at cudaMalloc time
=========     Host Frame:C:\WINDOWS\system32\ nvcuda.dll (cuMemcpy2D + 0x1ad677) [0x1babbb]
=========     Host Frame:C:\Users\User\Dropbox\ PhD\CUDA\2015Version\CUDA_ Scan_Cub\x64\Release\cudart64_ 80.dll (cudaGetExportTable + 0xb5ce) [0x1adde]
=========     Host Frame:C:\Users\User\Dropbox\ PhD\CUDA\2015Version\CUDA_ Scan_Cub\x64\Release\cudart64_ 80.dll (_cudaInitManagedRuntime + 0x42ce) [0x59ae]
=========     Host Frame:C:\Users\User\Dropbox\ PhD\CUDA\2015Version\CUDA_ Scan_Cub\x64\Release\cudart64_ 80.dll (cudaMalloc + 0xde) [0x22b0e]
=========     Host Frame:C:\Users\User\Dropbox\ PhD\CUDA\2015Version\CUDA_ Scan_Cub\x64\Release\CUDA_ Scan_Cub.exe (main + 0x463) [0x1da3]
=========     Host Frame:C:\Users\User\Dropbox\ PhD\CUDA\2015Version\CUDA_ Scan_Cub\x64\Release\CUDA_ Scan_Cub.exe (__scrt_common_main_seh + 0x11d) [0x25f1]

=========     Host Frame:C:\WINDOWS\System32\ KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x12774]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ ntdll.dll (RtlUserThreadStart + 0x21) [0x70d51]
=========
========= LEAK SUMMARY: 1735 bytes leaked in 3 allocations


The code is quite simple, I can paste it if you want to have a look, it just creates an array of 121 elements and tries to execute InclusiveSum on it:

    cudaMalloc(&d_in, sizeof(int)*NUM_ITEMS);
    cudaMalloc(&d_out, sizeof(int)*NUM_ITEMS);
    cudaMemcpy(d_in, h_in, sizeof(int)*NUM_ITEMS, cudaMemcpyHostToDevice);

    // This first call initializes the temporal storage needed by the algorithm
    CubDebugExit(cub::DeviceScan:: InclusiveSum(d_temp_storage, temp_storage_bytes, d_in, d_out, NUM_ITEMS));
    // Allocate temporary storage for inclusive prefix sum
    CubDebugExit(cudaMalloc(&d_ temp_storage, temp_storage_bytes));

    // Run inclusive prefix sum
    CubDebugExit(cub::DeviceScan:: InclusiveSum(d_temp_storage, temp_storage_bytes, d_in, d_out, NUM_ITEMS));

    cudaMemcpy(h_out, d_out, sizeof(int)*NUM_ITEMS, cudaMemcpyDeviceToHost);

Any ideas on why it fails on the Nsight debugger when memcheck is enabled? I need this to work because I use the InclusiveSum in a much bigger application and it's slowing me down as I can't use the memcheck in other parts of the algorithm.

Best regards to all of you and thanks in advance for your help :)

Carlos.
--
http://nvlabs.github.com/cub
---
You received this message because you are subscribed to the Google Groups "cub-users" group.
To post to this group, send email to cub-...@googlegroups.com.
Visit this group at https://groups.google.com/group/cub-users.


Kar

unread,
Jul 27, 2017, 3:51:16 AM7/27/17
to cub-users, solra...@hotmail.com, crov...@yahoo.com
Yeah, I did it yesterday and got confirmation it's a bug. They will give me news when it is solved.
Reply all
Reply to author
Forward
0 new messages