That parameter forces a cudaDeviceSynchronize after every kernel launch inside the library routine, of which there may be many. It is intended for debugging purposes when you would want to determine where an ULF (unidentified launch failure) is coming from. Setting it would make your code run unnecessarily slower. In the next release of CUB, that parameter will be renamed "debug_synchronous" to reduce any confusion.
CUB's DeviceXXX methods are asynchronous with respect to the calling thread (regardless of whether that thread is a host thread or a device thread that is starting new work with nested/dynamic parallelism). The calling thread may (and will likely) return before the work is done. However, you don't need to do any explicit synchronization if you are going to either (a) copy the results back to host memory using cudaMemcpy; or (b) use those buffers in another kernel (in the same stream). The cudaMemcpy (device to host) will block the host thread until both the sorting and the copy are done. And the driver won't launch your secondary kernel until the sorting is done.
You would only need to synchronize the calling thread (with cudaStreamSynchronize or cudaDeviceSynchronize) if your buffers were actually memory-mapped pinned memory and you wanted the calling thread to make sure they were coherent before reading them. And even so, you would preferably make your own call to cudaDeviceSynchronize, rather than setting the last boolean "stream_synchronous" parameter to true.
Basically CUB's DeviceXXX methods have the same concurrency semantics as a regular CUDA kernel launch (which you can read more about in the
CUDA programming guide).
Hope that helps!
Duane
you do not need to do any explicit sy
The boolean stream_synchronous parameter is mostly for debugging