removing store/load in Peterson's algorithm?

James

unread,

Dec 18, 2009, 9:40:53 AM12/18/09

to

I was wondering if something like this would work:

struct lock_state
{
uint8_t flag0;
uint8_t flag1;
uint8_t turn;
uint8_t padding;
};

union petersons
{
struct lock_state state;
uint32_t value; /* = 0 */
};

void thread_1_lock(union petersons* p)
{
p->state.flag0 = 1;
p->state.turn = 1;

// no store/load?

while (p->state.flag1 && p->turn)
{
sched_yield();
}
}

void thread_1_unlock(union petersons* p)
{
p->state.flag0 = 0;
}

void thread_2_lock(union petersons* p)
{
p->state.flag1 = 1;
p->state.turn = 0;

// no store/load?

while (p->state.flag0 && ! p->turn)
{
sched_yield();
}
}

void thread_2_unlock(union petersons* p)
{
p->state.flag1 = 0;
}

Dmitriy Vyukov

unread,

Dec 18, 2009, 10:47:39 AM12/18/09

to

It's called 'collocation trick'.
It will work *and* will be faster than #StoreLoad on SPARC. See:
http://home.comcast.net/~pjbishop/Dave/QRL-OpLocks-BiasedLocking.pdf
(search by 'collocation')

It seems that it works on x86 too, however AFAIK it's not documented
and actually *not* faster than MFENCE.

--
Dmitriy V'jukov

James

unread,

Dec 20, 2009, 3:02:55 PM12/20/09

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
news:7cb11d95-40f0-442f...@j4g2000yqe.googlegroups.com...

On Dec 18, 5:40 pm, "James" <n...@spam.invalid> wrote:
> I was wondering if something like this would work:

[...]

> It's called 'collocation trick'.
> It will work *and* will be faster than #StoreLoad on SPARC. See:
> http://home.comcast.net/~pjbishop/Dave/QRL-OpLocks-BiasedLocking.pdf
> (search by 'collocation')

Don't you still need a store/load anyway? I mean, can loads that occur
within the critical section rise above the acquisition logic in the
Petersons algorithm I posted?

> It seems that it works on x86 too, however AFAIK it's not documented
> and actually *not* faster than MFENCE.

Have you profiled it? It seems like:

STORE16(&state, 1);
LOAD32(&state)

should be faster than

STORE32(&state1, 1);
LOAD32(&state2);

right? Is the collocation trick equal to a store/load wrt other variables?

James

unread,

Dec 20, 2009, 3:07:28 PM12/20/09

to

"James" <n...@spam.invalid> wrote in message news:hglvp4$jes$1...@aioe.org...

> "Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
> news:7cb11d95-40f0-442f...@j4g2000yqe.googlegroups.com...
> On Dec 18, 5:40 pm, "James" <n...@spam.invalid> wrote:
>> I was wondering if something like this would work:
>
> [...]
>
>> It's called 'collocation trick'.
>> It will work *and* will be faster than #StoreLoad on SPARC. See:
>> http://home.comcast.net/~pjbishop/Dave/QRL-OpLocks-BiasedLocking.pdf
>> (search by 'collocation')
>
>
> Don't you still need a store/load anyway? I mean, can loads that occur
> within the critical section rise above the acquisition logic in the
> Petersons algorithm I posted?
>
>
>
>
>> It seems that it works on x86 too, however AFAIK it's not documented
>> and actually *not* faster than MFENCE.
>
> Have you profiled it? It seems like:
>
>
> STORE16(&state, 1);

> LOAD32(&state) >> 16U

>
>
> should be faster than
>
> STORE32(&state1, 1);

> MFENCE

> LOAD32(&state2);
>
>
> right? Is the collocation trick equal to a store/load wrt other variables?

If each thread had it's own Petersons lock with collocation trick, could you
get a read-write lock without using explicit membars on TSO mem-model?
Writer would take all locks, and reader would take it's own lock. Like:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/baf0ab7be19262f2

Dmitriy Vyukov

unread,

Dec 21, 2009, 2:32:47 AM12/21/09

to

On 20 дек, 23:02, "James" <n...@spam.invalid> wrote:
> "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message

>
> news:7cb11d95-40f0-442f...@j4g2000yqe.googlegroups.com...
> On Dec 18, 5:40 pm, "James" <n...@spam.invalid> wrote:
>
> > I was wondering if something like this would work:
>
> [...]
>
> > It's called 'collocation trick'.
> > It will work *and* will be faster than #StoreLoad on SPARC. See:
> >http://home.comcast.net/~pjbishop/Dave/QRL-OpLocks-BiasedLocking.pdf
> > (search by 'collocation')
>
> Don't you still need a store/load anyway? I mean, can loads that occur
> within the critical section rise above the acquisition logic in the
> Petersons algorithm I posted?

I think one need only #LoadLoad | #LoadStore:

STORE16(&state, 1);
LOAD32(&state);
if (...) ...
membar #LoadLoad | #LoadStore; //no-op for SPARC TSO and x86
// nothing from here can hoist above mutex acquisition

> > It seems that it works on x86 too, however AFAIK it's not documented
> > and actually *not* faster than MFENCE.
>
> Have you profiled it? It seems like:
>
> STORE16(&state, 1);
> LOAD32(&state)
>
> should be faster than
>
> STORE32(&state1, 1);
> LOAD32(&state2);
>
> right? Is the collocation trick equal to a store/load wrt other variables?

STORE16(&state, 1);
LOAD32(&state)

is substitution for:

STORE32(&state1, 1);
MFENCE;
LOAD32(&state2);

I do not remember as to whether that was me who measured this or not,
but AFAIR above variants have equal performance on Intel x86, i.e.
processor effectively emits MFENCE when it detects RAW conflict (read-
after-write) with mismatching operand addresses/sizes.

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,

Dec 21, 2009, 2:44:08 AM12/21/09

to

On 20 дек, 23:07, "James" <n...@spam.invalid> wrote:

> If each thread had it's own Petersons lock with collocation trick, could you
> get a read-write lock without using explicit membars on TSO mem-model?

That's impossible. Strong contention resolution can't be cheap by its
nature. Period.
It requires expensive membar or something equivalent. If you do not
write it explicitly then processor will emit it implicitly (or your
mutex just won't work). In either case strong synchronization won't be
cheap.

To the best of my knowledge the only way to remove that nasty
#StoreLoad is to employ asymmetric synchronization:
http://home.comcast.net/~pjbishop/Dave/Asymmetric-Dekker-Synchronization.txt
http://groups.google.ru/group/lock-free/browse_frm/thread/1efdc652571c6137
http://groups.google.ru/group/lock-free/browse_frm/thread/31f07e15df7f988e
http://groups.google.ru/group/comp.programming.threads/msg/3406433437fffe1e

--
Dmitriy V'jukov

James

unread,

Dec 21, 2009, 10:08:03 AM12/21/09

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:e26e42a6-ce73-4e9f...@b2g2000yqi.googlegroups.com...

> > news:7cb11d95-40f0-442f...@j4g2000yqe.googlegroups.com...
> > On Dec 18, 5:40 pm, "James" <n...@spam.invalid> wrote:

[...]

> > Is the collocation trick equal to a store/load wrt other variables?

> STORE16(&state, 1);
> LOAD32(&state)

> is substitution for:

> STORE32(&state1, 1);
> MFENCE;
> LOAD32(&state2);

> I do not remember as to whether that was me who measured this or not,
> but AFAIR above variants have equal performance on Intel x86, i.e.
> processor effectively emits MFENCE when it detects RAW conflict (read-
> after-write) with mismatching operand addresses/sizes.

Here is a very simple little test I wrote for MSVC++:

#include <cstdio>

#define USE_MFENCE

int main()
{
unsigned int volatile a = 0xFFFF;
unsigned int volatile b = 0;

#if defined (USE_MFENCE)

for (unsigned int i = 0; i < 1000000000; ++i)
{
a = 1;
_asm { MFENCE };
unsigned int volatile c = b;
}

#else

for (unsigned int i = 0; i < 1000000000; ++i)
{
unsigned short* volatile p16 = (unsigned short*)&a;
*p16 = 1;
unsigned int volatile c = a;
}

#endif

std::puts("Complete!");

return 0;
}

On my old 3.06 ghz P4 it takes around 51 seconds for the MFENCE version to
complete. On the other hand, it takes around 13 seconds for the collocation
trick store/load emulation to complete. Do you know why MFENCE is taking so
much longer? It seems like the collocation trick might be faster?

Thanks.

James

unread,

Dec 21, 2009, 11:28:55 AM12/21/09

to

"James" <n...@spam.invalid> wrote in message news:hgo2s4$5kc$1...@aioe.org...

> "Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
> news:e26e42a6-ce73-4e9f...@b2g2000yqi.googlegroups.com...
>> > news:7cb11d95-40f0-442f...@j4g2000yqe.googlegroups.com...
>> > On Dec 18, 5:40 pm, "James" <n...@spam.invalid> wrote:
> [...]
>
>> > Is the collocation trick equal to a store/load wrt other variables?
>
>
>> STORE16(&state, 1);
>> LOAD32(&state)
>
>> is substitution for:
>
>> STORE32(&state1, 1);
>> MFENCE;
>> LOAD32(&state2);

STORE16(&state1, 1);
LOAD32(&state1);
LOAD32(&state2);

Can the load from state2 rise above the 16-bit store? It seems like it
should be able to. Where am I going wrong?

[...]

Dmitriy Vyukov

unread,

Dec 21, 2009, 12:01:15 PM12/21/09

to

But does it (Peterson algorithm based on collocation trick) provide
mutual exclusion then?

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,

Dec 21, 2009, 12:02:23 PM12/21/09

to

On Dec 21, 7:28 pm, "James" <n...@spam.invalid> wrote:
> "James" <n...@spam.invalid> wrote in messagenews:hgo2s4$5kc$1...@aioe.org...
> > "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message

> >news:e26e42a6-ce73-4e9f...@b2g2000yqi.googlegroups.com...
> >> >news:7cb11d95-40f0-442f...@j4g2000yqe.googlegroups.com...
> >> > On Dec 18, 5:40 pm, "James" <n...@spam.invalid> wrote:
> > [...]
>
> >> > Is the collocation trick equal to a store/load wrt other variables?
>
> >> STORE16(&state, 1);
> >> LOAD32(&state)
>
> >> is substitution for:
>
> >> STORE32(&state1, 1);
> >> MFENCE;
> >> LOAD32(&state2);
>
> STORE16(&state1, 1);
> LOAD32(&state1);
> LOAD32(&state2);
>
> Can the load from state2 rise above the 16-bit store? It seems like it
> should be able to. Where am I going wrong?
>
> [...]

Yes, it can. But there is no such code in Peterson algorithm, there is
only 1 load.

--
Dmitriy V'jukov

James

unread,

Dec 21, 2009, 12:32:45 PM12/21/09

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:f52cc104-f17e-47e1...@m38g2000yqd.googlegroups.com...

I was referring to a load that occurred within the critical section. Can
loads from the critical section rise above the 16-bit store? Can that cause
any problems?

James

unread,

Dec 21, 2009, 12:36:04 PM12/21/09

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:f9b8bc75-79bf-4f8c...@p8g2000yqb.googlegroups.com...

On Dec 21, 6:08 pm, "James" <n...@spam.invalid> wrote:

[...]

> > On my old 3.06 ghz P4 it takes around 51 seconds for the MFENCE version
> > to
> > complete. On the other hand, it takes around 13 seconds for the
> > collocation
> > trick store/load emulation to complete. Do you know why MFENCE is taking
> > so
> > much longer? It seems like the collocation trick might be faster?

> But does it (Peterson algorithm based on collocation trick) provide
> mutual exclusion then?

The test code does not. I was just testing the performance in a single
threaded loop. It seems like this should model best non-contended case
scenario. If you are indeed right that:

STORE16(&state, 1);
LOAD32(&state)

is substitution for:

STORE32(&state1, 1);
MFENCE;
LOAD32(&state2);

Then the Peterson algorithm based on collocation trick should provide proper
mutual exclusion work on TSO systems.

Dmitriy Vyukov

unread,

Dec 21, 2009, 1:40:51 PM12/21/09

to

I can only cite myself:

I think one need only #LoadLoad | #LoadStore:

STORE16(&state, 1);
LOAD32(&state);
if (...) ...
membar #LoadLoad | #LoadStore; //no-op for SPARC TSO and x86
// nothing from here can hoist above mutex acquisition

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,

Dec 21, 2009, 1:52:15 PM12/21/09

to

On Dec 21, 8:36 pm, "James" <n...@spam.invalid> wrote:

> [...]
>
> > > On my old 3.06 ghz P4 it takes around 51 seconds for the MFENCE version
> > > to
> > > complete. On the other hand, it takes around 13 seconds for the
> > > collocation
> > > trick store/load emulation to complete. Do you know why MFENCE is taking
> > > so
> > > much longer? It seems like the collocation trick might be faster?
> > But does it (Peterson algorithm based on collocation trick) provide
> > mutual exclusion then?
>
> The test code does not. I was just testing the performance in a single
> threaded loop. It seems like this should model best non-contended case
> scenario. If you are indeed right that:
>
> STORE16(&state, 1);
> LOAD32(&state)
>
> is substitution for:
>
> STORE32(&state1, 1);
> MFENCE;
> LOAD32(&state2);

I did not mean that it's a substitution. I meant that if you test one
variant against another then you must test against:
STORE32(&state1, 1);
MFENCE;
LOAD32(&state2);
and not against:
STORE32(&state1, 1);
LOAD32(&state2);

I think that if collocation is observably faster than MFENCE then that
means that it just does not work.

--
Dmitriy V'jukov

James

unread,

Dec 21, 2009, 2:23:41 PM12/21/09

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:18683848-cf38-4812...@26g2000yqo.googlegroups.com...

In the relevant portion of the test code:

#if defined (USE_MFENCE)

for (unsigned int i = 0; i < 1000000000; ++i)
{
a = 1;
_asm { MFENCE };
unsigned int volatile c = b;
}

#else

for (unsigned int i = 0; i < 1000000000; ++i)
{
unsigned short* volatile p16 = (unsigned short*)&a;
*p16 = 1;
unsigned int volatile c = a;
}

#endif

For the MFENCE version I am performing a 32-bit store to location a, an
MFENCE then a 32-bit load from location b.

For the collocation version I am performing a 16-bit store to location a
then a 32-bit load from location a.

That should be fair?

> I think that if collocation is observably faster than MFENCE then that
> means that it just does not work.

The MFENCE version of the test is much slower than the collocation version.
So, does the collocation trick not work on an x86?

Dmitriy Vyukov

unread,

Dec 22, 2009, 2:51:30 AM12/22/09

to

Yes.

> > I think that if collocation is observably faster than MFENCE then that
> > means that it just does not work.
>
> The MFENCE version of the test is much slower than the collocation version.
> So, does the collocation trick not work on an x86?

The first question I would answer is: does Peterson algorithm with
collocation actually provide mutual exclusion on x86 (is it guaranteed
or just seems to work on some processors)? If the answer is No then
performance of collocation is irrelevant for me.

--
Dmitriy V'jukov

James

unread,

Dec 23, 2009, 2:29:16 PM12/23/09

to

> "Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

> news:756561b8-810a-40ef...@g7g2000yqa.googlegroups.com...

On Dec 21, 8:32 pm, "James" <n...@spam.invalid> wrote:

[...]

> > > > STORE16(&state1, 1);
> > > > LOAD32(&state1);
> > > > LOAD32(&state2);
> >
> > > > Can the load from state2 rise above the 16-bit store? It seems like
> > > > it
> > > > should be able to. Where am I going wrong?
> >
> > > > [...]
> > > Yes, it can. But there is no such code in Peterson algorithm, there is
> > > only 1 load.
> >
> > I was referring to a load that occurred within the critical section. Can
> > loads from the critical section rise above the 16-bit store? Can that
> > cause
> > any problems?

> I can only cite myself:

> I think one need only #LoadLoad | #LoadStore:

> STORE16(&state, 1);
> LOAD32(&state);
> if (...) ...
> membar #LoadLoad | #LoadStore; //no-op for SPARC TSO and x86
> // nothing from here can hoist above mutex acquisition

But the:

STORE16(&state, 1);
LOAD32(&state);

will not prevent a store from within the critical section from rising above
it. This would mess things up right?

James

unread,

Dec 23, 2009, 2:32:30 PM12/23/09

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:ad5c3755-23c5-42b8...@c3g2000yqd.googlegroups.com...

On Dec 21, 10:23 pm, "James" <n...@spam.invalid> wrote:

[...]

> > For the MFENCE version I am performing a 32-bit store to location a, an
> > MFENCE then a 32-bit load from location b.
> >
> > For the collocation version I am performing a 16-bit store to location a
> > then a 32-bit load from location a.
> >
> > That should be fair?

> Yes.

> > > I think that if collocation is observably faster than MFENCE then that
> > > means that it just does not work.
> >
> > The MFENCE version of the test is much slower than the collocation
> > version.
> > So, does the collocation trick not work on an x86?

> The first question I would answer is: does Peterson algorithm with
> collocation actually provide mutual exclusion on x86 (is it guaranteed
> or just seems to work on some processors)? If the answer is No then
> performance of collocation is irrelevant for me.

I simply don't know if it works on x86. I don't know who to ask in order to
verify this. I could create a test which does not prove anything even if it
passes a hundred trillion times. Can you think of a way to verify that
collocation works on x86?

Dmitriy Vyukov

unread,

Dec 24, 2009, 5:50:44 AM12/24/09

to

On Dec 23, 10:29 pm, "James" <n...@spam.invalid> wrote:
> > "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message

It depends on hardware.
On IA-32, Intel64, SPARC TSO it won't mess.
On IA-64, SPARC RMO it can.

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,

Dec 24, 2009, 5:52:28 AM12/24/09

to

On Dec 23, 10:32 pm, "James" <n...@spam.invalid> wrote:

> > > For the MFENCE version I am performing a 32-bit store to location a, an
> > > MFENCE then a 32-bit load from location b.
>
> > > For the collocation version I am performing a 16-bit store to location a
> > > then a 32-bit load from location a.
>
> > > That should be fair?
> > Yes.
> > > > I think that if collocation is observably faster than MFENCE then that
> > > > means that it just does not work.
>
> > > The MFENCE version of the test is much slower than the collocation
> > > version.
> > > So, does the collocation trick not work on an x86?
> > The first question I would answer is: does Peterson algorithm with
> > collocation actually provide mutual exclusion on x86 (is it guaranteed
> > or just seems to work on some processors)? If the answer is No then
> > performance of collocation is irrelevant for me.
>
> I simply don't know if it works on x86. I don't know who to ask in order to
> verify this. I could create a test which does not prove anything even if it
> passes a hundred trillion times. Can you think of a way to verify that
> collocation works on x86?

Well, what about consulting those fine x86 manuals?

--
Dmitriy V'jukov

Chris M. Thomasson

unread,

Dec 29, 2009, 10:57:34 PM12/29/09

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:47a995bf-fb05-4f57...@p8g2000yqb.googlegroups.com...

The performance of collocation trick seems to be little bit better than
`LOCK' prefix:

http://cpt.pastebin.com/f5a2d7337

I get 71.015 seconds for collocation, and 79.578 seconds for `LOCK' prefix
on old P4 (look at `PLOCK_MEMBAR' macro). Humm, I wonder if this violates
using different sized atomic operations to mutate a semaphore? Also, the
overhead for collocation seems to kick in on contention, while overhead of
LOCK prefix kicks in on a per-call basis.

:^o

Dmitriy Vyukov

unread,

Dec 30, 2009, 7:23:02 AM12/30/09

to

On Dec 30, 6:57 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> The performance of collocation trick seems to be little bit better than
> `LOCK' prefix:
>
> http://cpt.pastebin.com/f5a2d7337
>
> I get 71.015 seconds for collocation, and 79.578 seconds for `LOCK' prefix
> on old P4 (look at `PLOCK_MEMBAR' macro). Humm, I wonder if this violates
> using different sized atomic operations to mutate a semaphore? Also, the
> overhead for collocation seems to kick in on contention, while overhead of
> LOCK prefix kicks in on a per-call basis.
>
> :^o

And what are the numbers w/o contention? Does collocation
significantly faster than LOCK w/o contention?

--
Dmitriy V'jukov

Chris M. Thomasson

unread,

Dec 30, 2009, 5:24:42 PM12/30/09

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:348fd5cb-a7ed-499f...@n38g2000yqf.googlegroups.com...

I removed contention by only creating a single thread. Simply comment out a
`pthread_create()' `pthread_join()' pair in the `main()' function. E.g.,

pthread_create(&tid[0], NULL, thread_1, NULL);
/* pthread_create(&tid[1], NULL, thread_2, NULL); */

/* pthread_join(tid[1], NULL); */
pthread_join(tid[0], NULL);

I am getting 35.125 seconds for collocation, and 40.954 seconds for `LOCK'
prefix. I should also test this against an actual `MFENCE' instruction.
Also, please note that this LOCK prefix is not operating on shared state:
__________________________________________________________
__declspec(naked)
void
atomic_membar(void)
{
_asm
{
MOV EAX, 0
LOCK XADD [ESP], EAX
RET
}
}
__________________________________________________________

While the collocation trick is actually operating on shared state.
Therefore, it might be interesting to check the performance against a `LOCK'
instruction on a piece of shared state:
__________________________________________________________
__declspec(naked)
void
atomic_membar(void)
{
static uword32 g_dummy = 0;

_asm
{
MOV EAX, 0
LOCK XADD [g_dummy], EAX
RET
}
}
__________________________________________________________

Humm.

Chris M. Thomasson

unread,

Dec 30, 2009, 5:46:45 PM12/30/09

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:LoQ_m.5821$YG1...@newsfe14.iad...

> "Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
> news:348fd5cb-a7ed-499f...@n38g2000yqf.googlegroups.com...
> On Dec 30, 6:57 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:
>
>> > The performance of collocation trick seems to be little bit better than
>> > `LOCK' prefix:
>> >
>> > http://cpt.pastebin.com/f5a2d7337
>> >
>> > I get 71.015 seconds for collocation, and 79.578 seconds for `LOCK'
>> > prefix
>> > on old P4 (look at `PLOCK_MEMBAR' macro). Humm, I wonder if this
>> > violates
>> > using different sized atomic operations to mutate a semaphore? Also,
>> > the
>> > overhead for collocation seems to kick in on contention, while overhead
>> > of
>> > LOCK prefix kicks in on a per-call basis.
>> >
>> > :^o

[...]

> While the collocation trick is actually operating on shared state.
> Therefore, it might be interesting to check the performance against a
> `LOCK' instruction on a piece of shared state:
> __________________________________________________________
> __declspec(naked)
> void
> atomic_membar(void)
> {
> static uword32 g_dummy = 0;
>
> _asm
> {
> MOV EAX, 0
> LOCK XADD [g_dummy], EAX
> RET
> }
> }
> __________________________________________________________

It has similar results, 40.012 seconds, `MFENCE' is basically equal to
`LOCK' at 38.125 seconds. I need to run the test presented by James to see
if I can recreate the numbers that are similar to the ones he reported.

WRT the test code, on average, I would say that collocation is getting
around 7 to 9 second improvement over `LOCK' and `MFENCE' in high contention
environment. Collocation get around 4 to 6 second improvement in zero
contention scenario on a P4. Not sure if it's worth it. However, those
seconds do add up...

;^)

I believe that one could use collocation to get around the explicit
`#StoreLoad' in the work-stealing deque algorithm. Instead of storing to the
tail then loading the head you can do 32-bit store in tail and single 64-bit
load to get head and tail. That should eliminate the `#StoreLoad' in
between.