Hi Qingpeng,
warp level sort
We don't have warp-level sorting primitives (yet). But that's something we could do fairly easily. Out of curiosity, what's the application? (If you're working on something whose behavior fits a general pattern used to solve many types of problems which could be benefited by WarpSort, it helps motivate WarpSort as a feature to add soon.) Also, do you envision each warp-thread contributing one item, or many items to the WarpSort?
warp level reduce-by-key
Although we don't have it explicitly, it's easy to construct using
cub::WarpReduce::TailSegmentedSum or
cub::WarpReduce::TailSegmentedReduce. You just need to compute the tail-flags for which threads have the last item in the run of consecutive keys. Then feed the tail flags and the values into either of the segmented WarpReduce methods. Computing the tail-flags will require having each thread look at it's predecessor's item. On Fermi (and older) you'll need to use an array of 32 items in shared memory (one array for each warp) to have threads exchange items through. On Kepler, you can simply use "shuffle up" get the predecessor's item (either using the
__shfl() intrinsic or CUB's generic
cub::ShuffleUp() operation that works on any data type). Again, requests for new primitives are best prioritized if we have a good idea for what kinds of things they would be used for.
Hope this helps, cheers,
Duane