Task Scheduling Strategies

Dmitriy Vyukov

unread,

Jul 24, 2010, 9:34:32 AM7/24/10

to Scalable Synchronization Algorithms

Here is another excerpt from my Intel TC2010 write-up.
-------------------------------------------------------

There are 4 main strategies for a fine-grained distributed dynamic
task scheduling:

Work-stealing. That's a reactive asynchronous strategy. The essence:
when a thread is out work, it randomly chooses a victim thread and
asynchronously tries to steal some work from it.

Work-requesting. That's a reactive synchronous strategy. The essence:
when a thread is out of work, it randomly chooses a victim thread and
sends a synchronous request to it; the victim receives the request,
and sends some work back (if any).

Work-distribution. That's a proactive synchronous strategy. The
essence: during submission of a new work, it's divided and proactively
distributed to some threads (idle or lightly loaded).

Work-balancing. That's a proactive asynchronous strategy. The essence:
dedicated thread (or potentially one of the worker threads)
periodically collects information about load of all worker thread,
then calculates optimal distribution of work, and then re-distributes
work among them.

It's worth noting that a scheduler may employ several (or even all of
the) above strategies. Reactive strategies (stealing and requesting)
deal with inevitable dynamic load imbalance; but usually have very
limited local information about a system's state, so make sub-optimal
decisions. Proactive strategies (distribution and balancing), on the
other hand, have information about a system's state, so make one-shot
optimal scheduling decisions; but unable to cope with inevitable
dynamic load imbalance.

A scheduler must employ at least one of the reactive strategies in
order to cope with continuous and inevitable dynamic load imbalance,
and optionally include one or both proactive strategies in order to
cut down stealing/requesting costs. So, the general recipe for a
scheduler is:

SCHEDULER = (STEALING ^ REQUESTING) [+DISTRIBUTION] [+BALANCING]

Work-Stealing vs. Work-Requesting
So we have to choose between work-stealing and work-requesting. Work-
stealing has a serious advantage over work-requesting due to it's
asynchronous nature: a thief thread is able to get some work, even if
a victim thread is busy processing a user task or even de-scheduled by
an OS. With work-requesting a thief thread is able to get some work
only if a victim thread condescends to send it (which it is just
unable to do if it is de-scheduled by an OS). However, in our case
it's not a problem, because tasks are small and limited in size, plus
we are not going to over-subscribe processors with threads.
There are also 2 problems with work-stealing due to it's asynchronous
nature. First, it inherently incurs some observable per-task overhead,
because every pop operation from a thread's work deque must be
synchronized with a potential asynchronous steal operation from
another thread. Stealing is rare, but one still has to pay that price
on every pop operation. The price is at least a single store-load
style memory fence (MFENCE instruction for x86 architecture), or a
single atomic RMW operation (LOCK prefixed instruction on x86).
Work-requesting is free of the problem. Task deque is completely local
to a thread and requires no synchronization.

The second problem has similar nature and relates to a join phase of
parallel algorithms (game state backtracking in our case). Traditional
handling of task completion involves decrement of a pending child
counter of a parent task. Due to asynchronous nature of work-stealing,
the decrement has to be synchronized with other potential concurrent
decrement operations.
Work-requesting is free of the problem. During execution of work-
requesting protocol a victim thread can mark a parent task with a
'has_stolen_children' flag, and synchronization will be used only for
such tasks. While great bulk of tasks will proceed without any
synchronization.
Due to afore-mentioned problems I've chosen a work-requesting protocol
for task scheduling. Basically, it allows us to cut 2 atomic RMW
operations per task.

Work-Distribution and Work-Balancing
Now, we must decide as to whether to use work-distribution and/or
balancing in addition to work-requesting. Work-balancing is too
cumbersome to implement and my educated guess is that it's not going
to provide any significant speedup (it's more suitable for work DAGs
with no “busy leaves” property).
Work distribution is a different story. It's easier to implement, and
may provide some speedup. I decided to implement work distribution
only on initial stage. Namely, all initial game states are evenly
distributed across all worker threads in a round-robin fashion. So
each worker thread has quite a lot of work to begin with, and work-
requesting comes into play towards an end of computation.

Scheduler Algorithm
Here is a bit simplified code of my scheduler (some functions are
omitted for brevity). Full source code is presented in the Appendix B.
class worker_thread
{
// global index of current thread
size_t my_index;
// local tasks
deque<task_t> tasks;
// 1-element queue for work requests from other threads
atomic<size_t> steal_req;
// 1-element queue for work request acknowledges from other
threads
atomic<size_t> steal_ack;

// main thread loop
void loop()
{
for (;;)
{
// while we have local work process it
while (tasks.size() != 0)
{
// pop task in LIFO order
task_t task = tasks.back();
tasks.pop_back();

// user processing
execute(task);

// check for steal requests from other threads
if (steal_req.load(memory_order_relaxed) !=
no_requests)
process_work_request();
}

// try to get some work from other threads
if (false == send_steal_request())
break;
}
}

void process_work_request()
{
// get thief descriptor
size_t thief_idx = steal_req.load(memory_order_relaxed);
worker_thread& thief = get_thread(thief_idx);
if (tasks.size())
{
// pop task in FIFO order
task_t task = tasks.back();
tasks.pop_back();

// synchronous user processing
steal(task);

// give it to the thift
thief.tasks.push_back(task);
}
// notify the thief that the operation is completed
thief.steal_ack.store(1, memory_order_release);
}

bool send_steal_request()
{
for (;;)
{
// choose a victim
size_t victim_idx = choose_randomly();
worker_thread& victim = get_thread(victim_idx);

// send a request to it (if it's not busy processing
another request)
steal_ack.store(0, memory_order_relaxed);
size_t cmp = victim.steal_req.load(memory_order_relaxed);
for (;;)
{
if (cmp != (size_t)-1)
break;
if (victim.steal_req.compare_exchange_strong(cmp,
my_index, memory_order_acq_rel))
break;
}
// request is sent?
if (cmp == no_requests)
{
// wait for ack
while (steal_ack.load(memory_order_acquire) == 0)
Sleep(0);
// check as to whether we got some work or not
if (tasks.size())
return true;
}
// termination condition
if (no_work_in_the_system())
return false;
}
}
};

--
Dmitriy V'jukov

Jedd Haberstro

unread,

Aug 28, 2010, 2:38:04 AM8/28/10

to Scalable Synchronization Algorithms

Do you have a link to your whole Intel TC2010 write-up? I'd love to
read the rest! :)

Dmitriy Vyukov

unread,

Sep 1, 2010, 3:28:20 AM9/1/10

to Scalable Synchronization Algorithms

On Aug 28, 9:38 am, Jedd Haberstro <jhabers...@gmail.com> wrote:
> Do you have a link to your whole Intel TC2010 write-up? I'd love to
> read the rest! :)

Hi Jedd,

Here it is:
https://sites.google.com/site/1024cores/skiplist_vyukov.zip

--
Dmitriy V'jukov

Jedd Haberstro

unread,

Sep 9, 2010, 12:26:49 PM9/9/10

to Scalable Synchronization Algorithms

Thanks Dmitriy!

Reply all

Reply to author

Forward