When is __syncthreads() needed?

Slava Pollak

unread,

Feb 13, 2017, 1:03:58 PM2/13/17

to cub-users

Hello,

I want to use the BlockScan<...>::ExclusiveScan function, and it says the following in the detailed description of all of the flavors of this function:

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.

Ok, it makes sense. But what throws me off is that the sample code provided for several of these functions invokes a __syncthreads() where there is no smem reuse. For example:

#include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
// A stateful callback functor that maintains a running prefix to be applied
// during consecutive scan operations.
struct BlockPrefixCallbackOp
{
    // Running prefix
    int running_total;
    // Constructor
    __device__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {}
    // Callback operator to be entered by the first warp of threads in the block.
    // Thread-0 is responsible for returning a value for seeding the block-wide scan.
    __device__ int operator()(int block_aggregate)
    {
        int old_prefix = running_total;
        running_total = (block_aggregate > old_prefix) ? block_aggregate : old_prefix;
        return old_prefix;
    }
};
__global__ void ExampleKernel(int *d_data, int num_items, ...)
{
    // Specialize BlockScan for a 1D block of 128 threads
    typedef cub::BlockScan<int, 128> BlockScan;
    // Allocate shared memory for BlockScan
    __shared__ typename BlockScan::TempStorage temp_storage;
    // Initialize running total
    BlockPrefixCallbackOp prefix_op(INT_MIN);
    // Have the block iterate over segments of items
    for (int block_offset = 0; block_offset < num_items; block_offset += 128)
    {
        // Load a segment of consecutive items that are blocked across threads
        int thread_data = d_data[block_offset];
        // Collectively compute the block-wide exclusive prefix max scan
        BlockScan(temp_storage).ExclusiveScan(
            thread_data, thread_data, INT_MIN, cub::Max(), prefix_op);
        __syncthreads();
        // Store scanned items to output segment
        d_data[block_offset] = thread_data;
    }

As far as I can tell that last __syncthreads() is redundant. However, I wanted to make sure since this example repeats several times throughout the BlockScan documentation page.

Kind regards,

Slava

Duane Merrill III

unread,

Feb 14, 2017, 10:41:11 PM2/14/17

to Slava Pollak, cub-users

Hi Slava,

You're right, the example snippets have a redundant __syncthreads(). (Likely a copy-paste error from other code that was using a BlockStore after the scan where the storage was being re-purposed.)

I'll fix the example snippets for the next release.

--
http://nvlabs.github.com/cub
---
You received this message because you are subscribed to the Google Groups "cub-users" group.
To post to this group, send email to cub-...@googlegroups.com.
Visit this group at https://groups.google.com/group/cub-users.

RoBiK

unread,

Feb 15, 2017, 3:32:16 AM2/15/17

to Duane Merrill III, Slava Pollak, cub-users

Hi Duane & Slava,

IMHO that __syncthreads() call in the example below is needed.

The BlockScan call is inside a loop so the temporary storage is indeed reused - by the BlockScan in the next iteration of the loop.

Without the __syncthreads() call you would be risking a situation where one warp has already started to execute a BlockScan while another warp is still running a BlockScan from a previous loop iteration.

cheers,

Robert

Duane Merrill III

unread,

Feb 16, 2017, 8:51:40 AM2/16/17

to RoBiK, Slava Pollak, cub-users

Ah, you're totally right Robert. (Sorry, I only glanced at the snippet and assumed it was the one relating to the simple non-callback entrypoint.) It does need to synchronize the shared memory between subsequent loop iterations.

Reply all

Reply to author

Forward