For my application I need to do various reductions (sum, sum of squares, etc.) where the input data type (typically unsigned short) has a much smaller capacity than the reduction output. For instance, summing a large 3-dimensional image (e.g., 300x300x300) needs a 64 bit data type for the sum or sum of squares. As far as I can tell CUB doesn't support this currently: if I try to calculate the sum of an array of unsigned short values using cub::DeviceReduce::Sum, the sum is truncated to a 16 bit value, even if the OutputIterator template parameter is long long int.