Work-stealing with LIFO local pop and FIFO steal

486 views
Skip to first unread message

bdt...@gmail.com

unread,
Aug 1, 2017, 2:03:07 AM8/1/17
to Scalable Synchronization Algorithms
Hi all,

I've been using the following lock-free work-stealing deque for a while: https://blog.molecular-matters.com/2015/09/25/job-system-2-0-lock-free-work-stealing-part-3-going-lock-free/
I'm not using an MPMC queue since pops by the owning worker thread should be from the end of the queue it's pushing from (opposite from the stealing end), as LIFO makes it more likely the working set will be in cache.
I'm posting here with a couple of questions:
1) To inquire about whether there's a more efficient algorithm, as the pop in this implementation has both a full barrier and a CAS, and so it's expensive on the 2nd most common operation (after the push, assuming good initial task distribution that make steals less frequent).
2) What's the most efficient way to prevent busy spinning when there's no work left? This occurs often since my typical workloads come from real-time data streams. Right now, I have workers doing exponential back-off after failed steal attempts (with victims chosen randomly). I was thinking of maybe using SPSC queues from each worker to periodically send its status (sleeping or not) so that new work is sent first to these threads (rather than simple round-robin as I'm doing now, using SPSC per worker already for sending new work).
3) To ask whether in general there's any benefit of doing strided indexing of a buffer in this sort of data structure, so that operations by different threads on different ends of the (de)que would tend to not read/write in the same cache line when the buffer has only a few items. Related is whether, given both head and tail are accessed by the different oprations in this algorithm, there's any benefit of putting padding between them.
4) Is there a performance benefit to making the size a template parameter as well, given that this class is already templated on item type?

#include <vector>
#include <atomic>

class Task;

// Restricted deque for work-stealing parallelism
// Push and LIFO pop at one end by a single thread and FIFO pop from the other end by other threads

template<class T>
class WorkStealQ
{
public:
WorkStealQ(void) = delete;
inline WorkStealQ(uint32_t bufferSize); // Must be positive power-of-two
WorkStealQ(WorkStealQ const &) = delete;
inline WorkStealQ(WorkStealQ &&) noexcept = default;
WorkStealQ &operator=(WorkStealQ const &) = delete;
inline WorkStealQ &operator=(WorkStealQ &&) = default;
// PushUnsafe() will not check for a full queue, in the case where task counts are externally constrained
// This is more efficient due to the lack of need to reference the atomic head variable
inline void PushUnsafe(T &&task);
inline void PushUnsafe(T const &task);
inline bool Push(T &&task);
inline bool Push(T const &task);
inline T Pop(void);
inline T Steal(void);
inline uint32_t Space(void) const noexcept; // Slow as it checks atomic head and tail
private:
const uint32_t _mask;
std::vector<T> _tasks;
std::atomic<uint32_t> _head, _tail; /// TODO: Try padding
};

template<class T>
inline WorkStealQ<T>::WorkStealQ(uint32_t bufferSize) : _mask(bufferSize - 1), _tasks(bufferSize), _head{0}, _tail{0}
{
if (bufferSize < 2 || bufferSize > static_cast<uint32_t>(std::numeric_limits<int32_t>::max()) // Limit is needed for signed comparison cast in Push()
|| bufferSize & _mask) throw std::invalid_argument("bufferSize must be a positive power-of-two that fits in a signed 32-bit integer");
}

template<class T>
inline void WorkStealQ<T>::PushUnsafe(T &&task)
{
auto head{_head.load(std::memory_order_relaxed)};
_tasks[head & _mask] = std::move(task);
_head.store(head + 1, std::memory_order_release);
}

template<class T>
inline void WorkStealQ<T>::PushUnsafe(T const &task)
{
auto head{ _head.load(std::memory_order_relaxed) };
_tasks[head & _mask] = task;
_head.store(head + 1, std::memory_order_release);
}

template<class T>
inline bool WorkStealQ<T>::Push(T &&task)
{
auto head{_head.load(std::memory_order_relaxed)};
if (static_cast<int32_t>(head - _tail.load(std::memory_order_relaxed)) >= static_cast<int32_t>(_tasks.size())) return false;
_tasks[head & _mask] = std::move(task);
_head.store(head + 1, std::memory_order_release);
return true;
}

template<class T>
inline bool WorkStealQ<T>::Push(T const &task)
{
auto head{_head.load(std::memory_order_relaxed)};
if (static_cast<int32_t>(head - _tail.load(std::memory_order_relaxed)) >= static_cast<int32_t>(_tasks.size())) return false;
_tasks[head & _mask] = task;
_head.store(head + 1, std::memory_order_release);
return true;
}

template<class T>
inline T WorkStealQ<T>::Pop(void)
{
auto head{_head.load(std::memory_order_relaxed) - 1};
_head.exchange(head, std::memory_order_seq_cst); // Full barrier
auto tail{_tail.load(std::memory_order_acquire)};

if (head < tail)
{
_head.store(tail, std::memory_order_release);
return T();
}
auto task{_tasks[head & _mask]};
if (head != tail) return task;
auto tailNew{tail + 1};
if (_tail.compare_exchange_weak(tail, tailNew, std::memory_order_release, std::memory_order_relaxed))
{
_head.store(tailNew);
return task;
}
else
{
_head.store(tailNew);
return T();
}
}

template<class T>
inline T WorkStealQ<T>::Steal(void)
{
auto tail{_tail.load(std::memory_order_acquire)}; // Tail must be loaded before head
if (_head.load(std::memory_order_acquire) <= tail) return T();
auto task{_tasks[tail & _mask]};
if (_tail.compare_exchange_weak(tail, tail + 1, std::memory_order_release, std::memory_order_relaxed)) return task;
return T();
}

template<class T>
inline uint32_t WorkStealQ<T>::Space(void) const noexcept
{
return _head.load(std::memory_order_relaxed) - _tail.load(std::memory_order_relaxed) & _mask;
}


Dmitry Vyukov

unread,
Aug 4, 2017, 2:13:37 AM8/4/17
to lock...@googlegroups.com
On Mon, Jul 31, 2017 at 9:14 PM, <bdt...@gmail.com> wrote:
> Hi all,
>
> I've been using the following lock-free work-stealing deque for a while:
> https://blog.molecular-matters.com/2015/09/25/job-system-2-0-lock-free-work-stealing-part-3-going-lock-free/
> I'm not using an MPMC queue since pops by the owning worker thread should be
> from the end of the queue it's pushing from (opposite from the stealing
> end), as LIFO makes it more likely the working set will be in cache.
> I'm posting here with a couple of questions:
> 1) To inquire about whether there's a more efficient algorithm, as the pop
> in this implementation has both a full barrier and a CAS, and so it's
> expensive on the 2nd most common operation (after the push, assuming good
> initial task distribution that make steals less frequent).

Hi,

This one looks good.
Note that pop contains only full barrier in common case (when deque is
not close to empty). And it's as cheap as you can get, because pop
needs to achieve consensus with steal as to who gets what elements.
Just don't schedule a single integer addition as a parallel task.


> 2) What's the most efficient way to prevent busy spinning when there's no
> work left? This occurs often since my typical workloads come from real-time
> data streams. Right now, I have workers doing exponential back-off after
> failed steal attempts (with victims chosen randomly). I was thinking of
> maybe using SPSC queues from each worker to periodically send its status
> (sleeping or not) so that new work is sent first to these threads (rather
> than simple round-robin as I'm doing now, using SPSC per worker already for
> sending new work).

Search this group and comp.programming.threads archives for
"eventcount". And also this:
https://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/299245
Eventcounts are the generic way to wait for arbitrary predicates in
lock-free algorithms.


> 3) To ask whether in general there's any benefit of doing strided indexing
> of a buffer in this sort of data structure, so that operations by different
> threads on different ends of the (de)que would tend to not read/write in the
> same cache line when the buffer has only a few items. Related is whether,
> given both head and tail are accessed by the different oprations in this
> algorithm, there's any benefit of putting padding between them.

Yes.
Try benchmarking it. It will also help you answer other questions and
compare with alernatives.



> 4) Is there a performance benefit to making the size a template parameter as
> well, given that this class is already templated on item type?

See above.
/\/\/\/\/\/\

This makes no sense.
Exchange with unused return value is store.
Message has been deleted

bdt...@gmail.com

unread,
Aug 4, 2017, 2:46:44 PM8/4/17
to Scalable Synchronization Algorithms

On Thursday, August 3, 2017 at 11:13:37 PM UTC-7, Dmitry Vyukov wrote:
On Mon, Jul 31, 2017 at 9:14 PM,  <bdt...@gmail.com> wrote:
> 2) What's the most efficient way to prevent busy spinning when there's no
> work left? This occurs often since my typical workloads come from real-time
> data streams. Right now, I have workers doing exponential back-off after
> failed steal attempts (with victims chosen randomly). I was thinking of
> maybe using SPSC queues from each worker to periodically send its status
> (sleeping or not) so that new work is sent first to these threads (rather
> than simple round-robin as I'm doing now, using SPSC per worker already for
> sending new work).

Search this group and comp.programming.threads archives for
"eventcount". And also this:
https://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/299245
Eventcounts are the generic way to wait for arbitrary predicates in
lock-free algorithms.

Looks like there's an efficient eventcount implementation at https://locklessinc.com/articles/obscure_synch/ 2/3 down the page. It uses futexes but on Windows (8+) it can be done with WaitOnAddress() etc. If converting to C++11 atomics, there's no standard library operation that will issue "lock bts"; I think atomic_fetch_or() would work but I wonder about any differences in efficiency. 
 
Reply all
Reply to author
Forward
0 new messages