SALSA: Scalable and Low Synchronization NUMA-aware Algorithm for Producer-Consumer Pools

204 views
Skip to first unread message

Andriy Plokhotnyuk

unread,
Jan 6, 2014, 10:19:38 AM1/6/14
to mechanica...@googlegroups.com
Dear mechanical sympathizers,

Did you see an algorithm for scalable task pool, proposed in the following paper:

It looks very promising.

Has anybody tried to implement it for JVM or other manageable runtime?

Martin Thompson

unread,
Jan 7, 2014, 7:40:29 AM1/7/14
to mechanica...@googlegroups.com
Work stealing algorithms are a well researched and effective approach. A few might have sneaked into JUC and the parallel young gen collector :-)  

The problem with this implementation is that it requires sequential consistency and no main stream processors offer this without memory fences which greatly impacts throughput. The technique they use to overcome this, as described by Dice et al, requires some low level platform specific coding to make it work so Java is out of the question :-)

There are other approaches that scale linearly when FIFO is not a requirement. The SALSA algorithm as proposed here is not FIFO BTW.

Very interesting to see more research into this. For me a more interesting area is not outright throughput but how to deal with bursty traffic so that consumer threads are scaled effectively in relation to workload. Real world traffic is virtually never at a consistent flow rate.

Nitsan Wakart

unread,
Jan 8, 2014, 10:36:31 AM1/8/14
to mechanica...@googlegroups.com
For lazy people who dont' want to reda the whole thing to find out what Martin meant ;-)
"The problem with this implementation is that it requires sequential consistency and no main stream processors offer this without memory fences which greatly impacts throughput. The technique they use to overcome this, as described by Dice et al, requires some low level platform specific coding to make it work so Java is out of the question :-)"
Martin is referring to the bit at the end of page 13:" In our implementation, we adopt the synchronization technique described by Dice et al. [11], where the slow thread (namely, the stealer) binds directly to the processor on which the fast thread (namely, the con- sumer) is currently running, preempting it from the processor, and then returns to run on its own processor.
Thread displacement serves as a full memory fence, hence, a stealer that invokes the displacement binding right after updating the ownership (before line 119 in Algorithm 5) observes the updated consumer’s index. On the other hand, the steal-free fast path is not affected by this change. "
Interesting technique, could be done with some JNI, but would kill the performance.


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Reply all
Reply to author
Forward
0 new messages