Jul 24, 2010, 9:34:32 AM7/24/10
to Scalable Synchronization Algorithms
Here is another excerpt from my Intel TC2010 write-up.
There are 4 main strategies for a fine-grained distributed dynamic
Work-stealing. That's a reactive asynchronous strategy. The essence:
when a thread is out work, it randomly chooses a victim thread and
asynchronously tries to steal some work from it.
Work-requesting. That's a reactive synchronous strategy. The essence:
when a thread is out of work, it randomly chooses a victim thread and
sends a synchronous request to it; the victim receives the request,
and sends some work back (if any).
Work-distribution. That's a proactive synchronous strategy. The
essence: during submission of a new work, it's divided and proactively
distributed to some threads (idle or lightly loaded).
Work-balancing. That's a proactive asynchronous strategy. The essence:
dedicated thread (or potentially one of the worker threads)
periodically collects information about load of all worker thread,
then calculates optimal distribution of work, and then re-distributes
work among them.
It's worth noting that a scheduler may employ several (or even all of
the) above strategies. Reactive strategies (stealing and requesting)
deal with inevitable dynamic load imbalance; but usually have very
limited local information about a system's state, so make sub-optimal
decisions. Proactive strategies (distribution and balancing), on the
other hand, have information about a system's state, so make one-shot
optimal scheduling decisions; but unable to cope with inevitable
dynamic load imbalance.
A scheduler must employ at least one of the reactive strategies in
order to cope with continuous and inevitable dynamic load imbalance,
and optionally include one or both proactive strategies in order to
cut down stealing/requesting costs. So, the general recipe for a
SCHEDULER = (STEALING ^ REQUESTING) [+DISTRIBUTION] [+BALANCING]
Work-Stealing vs. Work-Requesting
So we have to choose between work-stealing and work-requesting. Work-
stealing has a serious advantage over work-requesting due to it's
asynchronous nature: a thief thread is able to get some work, even if
a victim thread is busy processing a user task or even de-scheduled by
an OS. With work-requesting a thief thread is able to get some work
only if a victim thread condescends to send it (which it is just
unable to do if it is de-scheduled by an OS). However, in our case
it's not a problem, because tasks are small and limited in size, plus
we are not going to over-subscribe processors with threads.
There are also 2 problems with work-stealing due to it's asynchronous
nature. First, it inherently incurs some observable per-task overhead,
because every pop operation from a thread's work deque must be
synchronized with a potential asynchronous steal operation from
another thread. Stealing is rare, but one still has to pay that price
on every pop operation. The price is at least a single store-load
style memory fence (MFENCE instruction for x86 architecture), or a
single atomic RMW operation (LOCK prefixed instruction on x86).
Work-requesting is free of the problem. Task deque is completely local
to a thread and requires no synchronization.
The second problem has similar nature and relates to a join phase of
parallel algorithms (game state backtracking in our case). Traditional
handling of task completion involves decrement of a pending child
counter of a parent task. Due to asynchronous nature of work-stealing,
the decrement has to be synchronized with other potential concurrent
Work-requesting is free of the problem. During execution of work-
requesting protocol a victim thread can mark a parent task with a
'has_stolen_children' flag, and synchronization will be used only for
such tasks. While great bulk of tasks will proceed without any
Due to afore-mentioned problems I've chosen a work-requesting protocol
for task scheduling. Basically, it allows us to cut 2 atomic RMW
operations per task.
Work-Distribution and Work-Balancing
Now, we must decide as to whether to use work-distribution and/or
balancing in addition to work-requesting. Work-balancing is too
cumbersome to implement and my educated guess is that it's not going
to provide any significant speedup (it's more suitable for work DAGs
with no “busy leaves” property).
Work distribution is a different story. It's easier to implement, and
may provide some speedup. I decided to implement work distribution
only on initial stage. Namely, all initial game states are evenly
distributed across all worker threads in a round-robin fashion. So
each worker thread has quite a lot of work to begin with, and work-
requesting comes into play towards an end of computation.
Here is a bit simplified code of my scheduler (some functions are
omitted for brevity). Full source code is presented in the Appendix B.
// global index of current thread
// local tasks
// 1-element queue for work requests from other threads
// 1-element queue for work request acknowledges from other
// main thread loop
// while we have local work process it
while (tasks.size() != 0)
// pop task in LIFO order
task_t task = tasks.back();
// user processing
// check for steal requests from other threads
if (steal_req.load(memory_order_relaxed) !=
// try to get some work from other threads
if (false == send_steal_request())
// get thief descriptor
size_t thief_idx = steal_req.load(memory_order_relaxed);
worker_thread& thief = get_thread(thief_idx);
// pop task in FIFO order
task_t task = tasks.back();
// synchronous user processing
// give it to the thift
// notify the thief that the operation is completed
// choose a victim
size_t victim_idx = choose_randomly();
worker_thread& victim = get_thread(victim_idx);
// send a request to it (if it's not busy processing
size_t cmp = victim.steal_req.load(memory_order_relaxed);
if (cmp != (size_t)-1)
// request is sent?
if (cmp == no_requests)
// wait for ack
while (steal_ack.load(memory_order_acquire) == 0)
// check as to whether we got some work or not
// termination condition