Work stealing algorithms are a well researched and effective approach. A few might have sneaked into JUC and the parallel young gen collector :-)
The problem with this implementation is that it requires sequential consistency and no main stream processors offer this without memory fences which greatly impacts throughput. The technique they use to overcome this, as described by Dice et al, requires some low level platform specific coding to make it work so Java is out of the question :-)
There are other approaches that scale linearly when FIFO is not a requirement. The SALSA algorithm as proposed here is not FIFO BTW.
Very interesting to see more research into this. For me a more interesting area is not outright throughput but how to deal with bursty traffic so that consumer threads are scaled effectively in relation to workload. Real world traffic is virtually never at a consistent flow rate.