Compile failure with CUB 1.6.4 WarpScan

35 views
Skip to first unread message

Jeremy Buhler

unread,
Feb 9, 2017, 6:56:59 AM2/9/17
to cub-...@googlegroups.com
I am using CUB's WarpScan template with logical warp size < physical warp size to do scans across sub-warp ranges of threads, like this:

typedef cub::WarpScan<T, SCAN_WIDTH> WarpScan;

where SCAN_WIDTH is a power of 2 <= the warp size.  Ideally, my code should work even in the trivial case that SCAN_WIDTH = 1.  However, setting SCAN_WIDTH 1 and compiling results in the following error from nvcc:

cub/cub/block/specializations/../../block/../block/specializations/../../warp/specializations/warp_scan_smem.cuh(70): error: shift count is negative
          detected during:
            instantiation of class "cub::WarpScanSmem<T, LOGICAL_WARP_THREADS, PTX_ARCH> [with T=unsigned int, LOGICAL_WARP_THREADS=1, PTX_ARCH=0]"
cub/cub/block/specializations/../../block/../block/specializations/../../warp/warp_scan.cuh(172): here

The problem arises when nvcc does the host pass of compilation and hence does not set __CUDA_ARCH__, so that PTX_ARCH == 0.  Instantiating WarpScanSmem with LOGICAL_WARP_THREADS = 1 fails because of this line:

  /// The number of threads in half a warp
  HALF_WARP_THREADS = 1 << (STEPS - 1),

To be clear, I'm coding for a GPU with compute capability 5.2, so my actual device code always uses WarpScanShfl, which works just fine with LOGICAL_WARP_THREADS = 1. The problem is purely the expansion of the WarpScanSmem template when compiling for the host (hence with __CUDA_ARCH__ undefined), even though nothing is ever codegen'd from this class in my application.

As a workaround, I can explicitly set the PTX_ARCH parameter of the WarpScan template to a value >= 300 so that the host compilation does not expand WarpScanSmem.  However, this is clearly not the "right" way to fix the problem.  Rather, I'd like to see a minimal patch to WarpScanSmem to make it build and work correctly when LOGICAL_WARP_THREADS == 1.  Is it as simple as saying

HALF_WARP_THREADS = (STEPS == 0 ? 0 : (1 << (STEPS - 1))

(BTW, this is with CUDA 8.5 on Linux with gcc 4.8.5.)

Thanks,
Jeremy

RaulPPelaez

unread,
Sep 15, 2017, 1:11:37 PM9/15/17
to cub-users
I noticed this behavior, it appears hat it is not meant to be used with a logical warp size of 1 for compute capability <=210 (the shared memory version).
I also noticed something strange in the sm_52 version (which uses __shfl_down, etc). I describe the problem here https://github.com/NVlabs/cub/issues/112
Maybe you are seeing something like this?

Good luck!
Reply all
Reply to author
Forward
0 new messages