Do I need to sync the kernel before I prceeds?

175 views
Skip to first unread message

brbs

unread,
Dec 3, 2013, 3:34:03 AM12/3/13
to cub-...@googlegroups.com
I just want to behavior of CUB functions, to ensure the correctness of my code

Are CUB functions asynchronous with respect to the host thread by default or not?

For instance, considering the following code:

bool syncSignal = false;


DeviceRadixSort::SortPairs<double, int>(buffer, storage_bytes, d_keys, d_values, len, 0, sizeof(double)*8, streamIdx, syncSignal);

SomeHostFunction(d_keys, d_values);

Could the CUB function return to the host thread before it finish its task?  my main concern is whether the values of d_keys/d_values are correct if syncSignal is set to false.

To ensure the correctness of the returned values, especially d_keys/d_values, do I need to set syncSignal to ture or add a cuda stream barrier before the host thread can deal with the data (d_keys/d_values)?

Duane Merrill

unread,
Dec 5, 2013, 10:04:18 AM12/5/13
to cub-...@googlegroups.com
That parameter forces a cudaDeviceSynchronize after every kernel launch inside the library routine, of which there may be many.  It is intended for debugging purposes when you would want to determine where an ULF (unidentified launch failure) is coming from.  Setting it would make your code run unnecessarily slower.  In the next release of CUB, that parameter will be renamed "debug_synchronous" to reduce any confusion.

CUB's DeviceXXX methods are asynchronous with respect to the calling thread (regardless of whether that thread is a host thread or a device thread that is starting new work with nested/dynamic parallelism).  The calling thread may (and will likely) return before the work is done.  However, you don't need to do any explicit synchronization if you are going to either (a) copy the results back to host memory using cudaMemcpy; or (b) use those buffers in another kernel (in the same stream).  The cudaMemcpy (device to host) will block the host thread until both the sorting and the copy are done.  And the driver won't launch your secondary kernel until the sorting is done. 

You would only need to synchronize the calling thread (with cudaStreamSynchronize or cudaDeviceSynchronize) if your buffers were actually memory-mapped pinned memory and you wanted the calling thread to make sure they were coherent before reading them.  And even so, you would preferably make your own call to cudaDeviceSynchronize, rather than setting the last boolean "stream_synchronous" parameter to true.  

Basically CUB's DeviceXXX methods have the same concurrency semantics as a regular CUDA kernel launch (which you can read more about in the CUDA programming guide).

Hope that helps!

Duane




you do not need to do any explicit sy

The boolean stream_synchronous parameter is mostly for debugging
Reply all
Reply to author
Forward
0 new messages