Learn how to use the CUB library of "collective" SIMT primitives to simplify CUDA kernel development, maintenance, and tuning. Constructing, tuning, and maintaining kernel code is perhaps the most challenging, time-consuming aspect of CUDA programming. CUDA kernel software is where the complexity of parallelism is expressed. Programmers must reason about deadlock, livelock, synchronization, race conditions, shared memory layout, plurality of state, granularity, throughput, latency, memory bottlenecks, etc. However, with the exception of CUB, there are few (if any) software libraries of reusable kernel primitives. In the CUDA ecosystem, CUB is unique in this regard. CUB provides state-of-the-art, reusable software components for every layer of the CUDA programming model:Device-wide primitives (sort, prefix scan, reduction, histogram, etc.); Block-wide "collective" primitives (I/O, sort, prefix scan, reduction, histogram, etc.); Warp-wide "collective" primitives (Warp-wide prefix scan, reduction, etc.)
Session Level: Intermediate
Session Type: Talk
Tags: Numerical Algorithms & Libraries; Programming Languages & Compilers