When is __syncthreads() needed?

119 views
Skip to first unread message

Slava Pollak

unread,
Feb 13, 2017, 1:03:58 PM2/13/17
to cub-users
Hello,

I want to use the BlockScan<...>::ExclusiveScan function, and it says the following in the detailed description of all of the flavors of this function:
  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.

Ok, it makes sense. But what throws me off is that the sample code provided for several of these functions invokes a __syncthreads() where there is no smem reuse. For example:

#include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
// A stateful callback functor that maintains a running prefix to be applied
// during consecutive scan operations.
struct BlockPrefixCallbackOp
{
   // Running prefix
   int running_total;
   // Constructor
   __device__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {}
   // Callback operator to be entered by the first warp of threads in the block.
   // Thread-0 is responsible for returning a value for seeding the block-wide scan.
   __device__ int operator()(int block_aggregate)
   {
       int old_prefix = running_total;
       running_total = (block_aggregate > old_prefix) ? block_aggregate : old_prefix;
       return old_prefix;
   }
};
__global__ void ExampleKernel(int *d_data, int num_items, ...)
{
   // Specialize BlockScan for a 1D block of 128 threads
   // Allocate shared memory for BlockScan
   __shared__ typename BlockScan::TempStorage temp_storage;
   // Initialize running total
   BlockPrefixCallbackOp prefix_op(INT_MIN);
   // Have the block iterate over segments of items
   for (int block_offset = 0; block_offset < num_items; block_offset += 128)
   {
       // Load a segment of consecutive items that are blocked across threads
       int thread_data = d_data[block_offset];
       // Collectively compute the block-wide exclusive prefix max scan
       BlockScan(temp_storage).ExclusiveScan(
           thread_data, thread_data, INT_MIN, cub::Max(), prefix_op);
       __syncthreads();
       // Store scanned items to output segment
       d_data[block_offset] = thread_data;
   }

As far as I can tell that last __syncthreads() is redundant. However, I wanted to make sure since this example repeats several times throughout the BlockScan documentation page.

Kind regards,
Slava

Duane Merrill III

unread,
Feb 14, 2017, 10:41:11 PM2/14/17
to Slava Pollak, cub-users
Hi Slava, 

You're right, the example snippets have a redundant __syncthreads().  (Likely a copy-paste error from other code that was using a BlockStore after the scan where the storage was being re-purposed.) 

I'll fix the example snippets for the next release.

--
http://nvlabs.github.com/cub
---
You received this message because you are subscribed to the Google Groups "cub-users" group.
To post to this group, send email to cub-...@googlegroups.com.
Visit this group at https://groups.google.com/group/cub-users.

RoBiK

unread,
Feb 15, 2017, 3:32:16 AM2/15/17
to Duane Merrill III, Slava Pollak, cub-users
Hi Duane & Slava,

IMHO that __syncthreads() call in the example below is needed.
The BlockScan call is inside a loop so the temporary storage is indeed reused - by the BlockScan in the next iteration of the loop.
Without the __syncthreads() call you would be risking a situation where one warp has already started to execute a BlockScan while another warp is still running a BlockScan from a previous loop iteration.

cheers,

Robert

Duane Merrill III

unread,
Feb 16, 2017, 8:51:40 AM2/16/17
to RoBiK, Slava Pollak, cub-users
Ah, you're totally right Robert.  (Sorry, I only glanced at the snippet and assumed it was the one relating to the simple non-callback entrypoint.)  It does need to synchronize the shared memory between subsequent loop iterations.
Reply all
Reply to author
Forward
0 new messages