Love to see some comments on this paper

151 views

Skip to first unread message

Kimo Crossman

unread,

Feb 17, 2014, 2:59:25 PM2/17/14

to lock-free

"Fence-Free Work Stealing on Bounded TSO Processors" ASPLOS '14

https://www.cs.technion.ac.il/people/mad/online-publications/asplos2014-ffwsq.pdf

Abstract

Work stealing is the method of choice for load balancing

in task parallel programming languages and frameworks.

Yet despite considerable effort invested in optimizing work

stealing task queues, existing algorithms issue a costly memory

fence when removing a task, and these fences are believed

to be necessary for correctness.

This paper refutes this belief, demonstrating work stealing

algorithms in which a worker does not issue a memory

fence for microarchitectures with a bounded total store ordering

(TSO) memory model. Bounded TSO is a novel restriction

of TSO – capturing mainstream x86 and SPARC

TSO processors – that bounds the number of stores a load

can be reordered with.

Our algorithms eliminate the memory fence penalty, improving

the running time of a suite of parallel benchmarks

on modern x86 multicore processors by 7%􀀀11% on average

(and up to 23%), compared to the Cilk and Chase-Lev

work stealing queues.

Dmitriy V'jukov

unread,

Mar 11, 2014, 1:28:23 PM3/11/14

to lock...@googlegroups.com

Thanks, Kimo, for posting it here!

This looks like a very interesting and novel approach. I think I can
see how and why it works.

Some random comments:
The approach does not seem to be extensible to other synchronization
algorithms (in particular mutexes) because it inherently requires
state distribution (N items are available at the same time). However
it's extensible to say distributed object (which is, well, essentially
the same as ws dequeue).

If I want to use it in a library code, I am not sure how I can
reliably determine the store buffer bound.

Then the size of the deque drops below the bound, they effectively
switch to work-requesting. (1) I suspect that benchmarks effectively
benchmark work-requesting in their case. (2) Work-requesting works
well for good computational workload that they used, but is not
applicable in other domains; e.g. if a task spawns 2 other tasks, one
generally wants the tasks to execute in parallel, while with their
algorithm both tasks will execute sequentially. (3) It would be
interetsing to try to combine their approach with traditional
work-stealing, i.e. size<bound=pay for the barrier, size>bound=don't
pay for the barrier.

It seems to be well suited for parallel for construct. Potentially it
would allow to reduce per-iteration overhead to basically zero.
However, you will need some fake stores, because parallel for usually
stores to only 1 memory location (iteration space), so their logic
won't work as is.

The idea of using static code analysis to determine minimum number of
stores per task looks too... academic for me.

Reply all

Reply to author

Forward

0 new messages