Love to see some comments on this paper

151 views
Skip to first unread message

Kimo Crossman

unread,
Feb 17, 2014, 2:59:25 PM2/17/14
to lock-free
"Fence-Free Work Stealing on Bounded TSO Processors" ASPLOS '14

Abstract
Work stealing is the method of choice for load balancing
in task parallel programming languages and frameworks.
Yet despite considerable effort invested in optimizing work
stealing task queues, existing algorithms issue a costly memory
fence when removing a task, and these fences are believed
to be necessary for correctness.
This paper refutes this belief, demonstrating work stealing
algorithms in which a worker does not issue a memory
fence for microarchitectures with a bounded total store ordering
(TSO) memory model. Bounded TSO is a novel restriction
of TSO – capturing mainstream x86 and SPARC
TSO processors – that bounds the number of stores a load
can be reordered with.
Our algorithms eliminate the memory fence penalty, improving
the running time of a suite of parallel benchmarks
on modern x86 multicore processors by 7%􀀀11% on average
(and up to 23%), compared to the Cilk and Chase-Lev
work stealing queues.


Dmitriy V'jukov

unread,
Mar 11, 2014, 1:28:23 PM3/11/14
to lock...@googlegroups.com
Thanks, Kimo, for posting it here!


This looks like a very interesting and novel approach. I think I can
see how and why it works.

Some random comments:
The approach does not seem to be extensible to other synchronization
algorithms (in particular mutexes) because it inherently requires
state distribution (N items are available at the same time). However
it's extensible to say distributed object (which is, well, essentially
the same as ws dequeue).

If I want to use it in a library code, I am not sure how I can
reliably determine the store buffer bound.

Then the size of the deque drops below the bound, they effectively
switch to work-requesting. (1) I suspect that benchmarks effectively
benchmark work-requesting in their case. (2) Work-requesting works
well for good computational workload that they used, but is not
applicable in other domains; e.g. if a task spawns 2 other tasks, one
generally wants the tasks to execute in parallel, while with their
algorithm both tasks will execute sequentially. (3) It would be
interetsing to try to combine their approach with traditional
work-stealing, i.e. size<bound=pay for the barrier, size>bound=don't
pay for the barrier.

It seems to be well suited for parallel for construct. Potentially it
would allow to reduce per-iteration overhead to basically zero.
However, you will need some fake stores, because parallel for usually
stores to only 1 memory location (iteration space), so their logic
won't work as is.

The idea of using static code analysis to determine minimum number of
stores per task looks too... academic for me.
Reply all
Reply to author
Forward
0 new messages