Thanks, Kimo, for posting it here!
This looks like a very interesting and novel approach. I think I can
see how and why it works.
Some random comments:
The approach does not seem to be extensible to other synchronization
algorithms (in particular mutexes) because it inherently requires
state distribution (N items are available at the same time). However
it's extensible to say distributed object (which is, well, essentially
the same as ws dequeue).
If I want to use it in a library code, I am not sure how I can
reliably determine the store buffer bound.
Then the size of the deque drops below the bound, they effectively
switch to work-requesting. (1) I suspect that benchmarks effectively
benchmark work-requesting in their case. (2) Work-requesting works
well for good computational workload that they used, but is not
applicable in other domains; e.g. if a task spawns 2 other tasks, one
generally wants the tasks to execute in parallel, while with their
algorithm both tasks will execute sequentially. (3) It would be
interetsing to try to combine their approach with traditional
work-stealing, i.e. size<bound=pay for the barrier, size>bound=don't
pay for the barrier.
It seems to be well suited for parallel for construct. Potentially it
would allow to reduce per-iteration overhead to basically zero.
However, you will need some fake stores, because parallel for usually
stores to only 1 memory location (iteration space), so their logic
won't work as is.
The idea of using static code analysis to determine minimum number of
stores per task looks too... academic for me.