I am using CUB's WarpScan template with logical warp size < physical warp size to do scans across sub-warp ranges of threads, like this:
typedef cub::WarpScan<T, SCAN_WIDTH> WarpScan;
where SCAN_WIDTH is a power of 2 <= the warp size. Ideally, my code should work even in the trivial case that SCAN_WIDTH = 1. However, setting SCAN_WIDTH 1 and compiling results in the following error from nvcc:
cub/cub/block/specializations/../../block/../block/specializations/../../warp/specializations/warp_scan_smem.cuh(70): error: shift count is negative
detected during:
instantiation of class "cub::WarpScanSmem<T, LOGICAL_WARP_THREADS, PTX_ARCH> [with T=unsigned int, LOGICAL_WARP_THREADS=1, PTX_ARCH=0]"
cub/cub/block/specializations/../../block/../block/specializations/../../warp/warp_scan.cuh(172): here
The problem arises when nvcc does the host pass of compilation and hence does not set __CUDA_ARCH__, so that PTX_ARCH == 0. Instantiating WarpScanSmem with LOGICAL_WARP_THREADS = 1 fails because of this line:
/// The number of threads in half a warp
HALF_WARP_THREADS = 1 << (STEPS - 1),
To be clear, I'm coding for a GPU with compute capability 5.2, so my actual device code always uses WarpScanShfl, which works just fine with LOGICAL_WARP_THREADS = 1. The problem is purely the expansion of the WarpScanSmem template when compiling for the host (hence with __CUDA_ARCH__ undefined), even though nothing is ever codegen'd from this class in my application.
As a workaround, I can explicitly set the PTX_ARCH parameter of the WarpScan template to a value >= 300 so that the host compilation does not expand WarpScanSmem. However, this is clearly not the "right" way to fix the problem. Rather, I'd like to see a minimal patch to WarpScanSmem to make it build and work correctly when LOGICAL_WARP_THREADS == 1. Is it as simple as saying
HALF_WARP_THREADS = (STEPS == 0 ? 0 : (1 << (STEPS - 1))
(BTW, this is with CUDA 8.5 on Linux with gcc 4.8.5.)
Thanks,
Jeremy