struct lock_state
{
uint8_t flag0;
uint8_t flag1;
uint8_t turn;
uint8_t padding;
};
union petersons
{
struct lock_state state;
uint32_t value; /* = 0 */
};
void thread_1_lock(union petersons* p)
{
p->state.flag0 = 1;
p->state.turn = 1;
// no store/load?
while (p->state.flag1 && p->turn)
{
sched_yield();
}
}
void thread_1_unlock(union petersons* p)
{
p->state.flag0 = 0;
}
void thread_2_lock(union petersons* p)
{
p->state.flag1 = 1;
p->state.turn = 0;
// no store/load?
while (p->state.flag0 && ! p->turn)
{
sched_yield();
}
}
void thread_2_unlock(union petersons* p)
{
p->state.flag1 = 0;
}
It's called 'collocation trick'.
It will work *and* will be faster than #StoreLoad on SPARC. See:
http://home.comcast.net/~pjbishop/Dave/QRL-OpLocks-BiasedLocking.pdf
(search by 'collocation')
It seems that it works on x86 too, however AFAIK it's not documented
and actually *not* faster than MFENCE.
--
Dmitriy V'jukov
[...]
> It's called 'collocation trick'.
> It will work *and* will be faster than #StoreLoad on SPARC. See:
> http://home.comcast.net/~pjbishop/Dave/QRL-OpLocks-BiasedLocking.pdf
> (search by 'collocation')
Don't you still need a store/load anyway? I mean, can loads that occur
within the critical section rise above the acquisition logic in the
Petersons algorithm I posted?
> It seems that it works on x86 too, however AFAIK it's not documented
> and actually *not* faster than MFENCE.
Have you profiled it? It seems like:
STORE16(&state, 1);
LOAD32(&state)
should be faster than
STORE32(&state1, 1);
LOAD32(&state2);
right? Is the collocation trick equal to a store/load wrt other variables?
If each thread had it's own Petersons lock with collocation trick, could you
get a read-write lock without using explicit membars on TSO mem-model?
Writer would take all locks, and reader would take it's own lock. Like:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/baf0ab7be19262f2
I think one need only #LoadLoad | #LoadStore:
STORE16(&state, 1);
LOAD32(&state);
if (...) ...
membar #LoadLoad | #LoadStore; //no-op for SPARC TSO and x86
// nothing from here can hoist above mutex acquisition
> > It seems that it works on x86 too, however AFAIK it's not documented
> > and actually *not* faster than MFENCE.
>
> Have you profiled it? It seems like:
>
> STORE16(&state, 1);
> LOAD32(&state)
>
> should be faster than
>
> STORE32(&state1, 1);
> LOAD32(&state2);
>
> right? Is the collocation trick equal to a store/load wrt other variables?
STORE16(&state, 1);
LOAD32(&state)
is substitution for:
STORE32(&state1, 1);
MFENCE;
LOAD32(&state2);
I do not remember as to whether that was me who measured this or not,
but AFAIR above variants have equal performance on Intel x86, i.e.
processor effectively emits MFENCE when it detects RAW conflict (read-
after-write) with mismatching operand addresses/sizes.
--
Dmitriy V'jukov
> If each thread had it's own Petersons lock with collocation trick, could you
> get a read-write lock without using explicit membars on TSO mem-model?
That's impossible. Strong contention resolution can't be cheap by its
nature. Period.
It requires expensive membar or something equivalent. If you do not
write it explicitly then processor will emit it implicitly (or your
mutex just won't work). In either case strong synchronization won't be
cheap.
To the best of my knowledge the only way to remove that nasty
#StoreLoad is to employ asymmetric synchronization:
http://home.comcast.net/~pjbishop/Dave/Asymmetric-Dekker-Synchronization.txt
http://groups.google.ru/group/lock-free/browse_frm/thread/1efdc652571c6137
http://groups.google.ru/group/lock-free/browse_frm/thread/31f07e15df7f988e
http://groups.google.ru/group/comp.programming.threads/msg/3406433437fffe1e
--
Dmitriy V'jukov
> > Is the collocation trick equal to a store/load wrt other variables?
> STORE16(&state, 1);
> LOAD32(&state)
> is substitution for:
> STORE32(&state1, 1);
> MFENCE;
> LOAD32(&state2);
> I do not remember as to whether that was me who measured this or not,
> but AFAIR above variants have equal performance on Intel x86, i.e.
> processor effectively emits MFENCE when it detects RAW conflict (read-
> after-write) with mismatching operand addresses/sizes.
Here is a very simple little test I wrote for MSVC++:
#include <cstdio>
#define USE_MFENCE
int main()
{
unsigned int volatile a = 0xFFFF;
unsigned int volatile b = 0;
#if defined (USE_MFENCE)
for (unsigned int i = 0; i < 1000000000; ++i)
{
a = 1;
_asm { MFENCE };
unsigned int volatile c = b;
}
#else
for (unsigned int i = 0; i < 1000000000; ++i)
{
unsigned short* volatile p16 = (unsigned short*)&a;
*p16 = 1;
unsigned int volatile c = a;
}
#endif
std::puts("Complete!");
return 0;
}
On my old 3.06 ghz P4 it takes around 51 seconds for the MFENCE version to
complete. On the other hand, it takes around 13 seconds for the collocation
trick store/load emulation to complete. Do you know why MFENCE is taking so
much longer? It seems like the collocation trick might be faster?
Thanks.
STORE16(&state1, 1);
LOAD32(&state1);
LOAD32(&state2);
Can the load from state2 rise above the 16-bit store? It seems like it
should be able to. Where am I going wrong?
[...]
But does it (Peterson algorithm based on collocation trick) provide
mutual exclusion then?
--
Dmitriy V'jukov
Yes, it can. But there is no such code in Peterson algorithm, there is
only 1 load.
--
Dmitriy V'jukov
I was referring to a load that occurred within the critical section. Can
loads from the critical section rise above the 16-bit store? Can that cause
any problems?
[...]
> > On my old 3.06 ghz P4 it takes around 51 seconds for the MFENCE version
> > to
> > complete. On the other hand, it takes around 13 seconds for the
> > collocation
> > trick store/load emulation to complete. Do you know why MFENCE is taking
> > so
> > much longer? It seems like the collocation trick might be faster?
> But does it (Peterson algorithm based on collocation trick) provide
> mutual exclusion then?
The test code does not. I was just testing the performance in a single
threaded loop. It seems like this should model best non-contended case
scenario. If you are indeed right that:
STORE16(&state, 1);
LOAD32(&state)
is substitution for:
STORE32(&state1, 1);
MFENCE;
LOAD32(&state2);
Then the Peterson algorithm based on collocation trick should provide proper
mutual exclusion work on TSO systems.
I can only cite myself:
I think one need only #LoadLoad | #LoadStore:
STORE16(&state, 1);
LOAD32(&state);
if (...) ...
membar #LoadLoad | #LoadStore; //no-op for SPARC TSO and x86
// nothing from here can hoist above mutex acquisition
--
Dmitriy V'jukov
> [...]
>
> > > On my old 3.06 ghz P4 it takes around 51 seconds for the MFENCE version
> > > to
> > > complete. On the other hand, it takes around 13 seconds for the
> > > collocation
> > > trick store/load emulation to complete. Do you know why MFENCE is taking
> > > so
> > > much longer? It seems like the collocation trick might be faster?
> > But does it (Peterson algorithm based on collocation trick) provide
> > mutual exclusion then?
>
> The test code does not. I was just testing the performance in a single
> threaded loop. It seems like this should model best non-contended case
> scenario. If you are indeed right that:
>
> STORE16(&state, 1);
> LOAD32(&state)
>
> is substitution for:
>
> STORE32(&state1, 1);
> MFENCE;
> LOAD32(&state2);
I did not mean that it's a substitution. I meant that if you test one
variant against another then you must test against:
STORE32(&state1, 1);
MFENCE;
LOAD32(&state2);
and not against:
STORE32(&state1, 1);
LOAD32(&state2);
I think that if collocation is observably faster than MFENCE then that
means that it just does not work.
--
Dmitriy V'jukov
In the relevant portion of the test code:
#if defined (USE_MFENCE)
for (unsigned int i = 0; i < 1000000000; ++i)
{
a = 1;
_asm { MFENCE };
unsigned int volatile c = b;
}
#else
for (unsigned int i = 0; i < 1000000000; ++i)
{
unsigned short* volatile p16 = (unsigned short*)&a;
*p16 = 1;
unsigned int volatile c = a;
}
#endif
For the MFENCE version I am performing a 32-bit store to location a, an
MFENCE then a 32-bit load from location b.
For the collocation version I am performing a 16-bit store to location a
then a 32-bit load from location a.
That should be fair?
> I think that if collocation is observably faster than MFENCE then that
> means that it just does not work.
The MFENCE version of the test is much slower than the collocation version.
So, does the collocation trick not work on an x86?
Yes.
> > I think that if collocation is observably faster than MFENCE then that
> > means that it just does not work.
>
> The MFENCE version of the test is much slower than the collocation version.
> So, does the collocation trick not work on an x86?
The first question I would answer is: does Peterson algorithm with
collocation actually provide mutual exclusion on x86 (is it guaranteed
or just seems to work on some processors)? If the answer is No then
performance of collocation is irrelevant for me.
--
Dmitriy V'jukov
> I can only cite myself:
> I think one need only #LoadLoad | #LoadStore:
> STORE16(&state, 1);
> LOAD32(&state);
> if (...) ...
> membar #LoadLoad | #LoadStore; //no-op for SPARC TSO and x86
> // nothing from here can hoist above mutex acquisition
But the:
STORE16(&state, 1);
LOAD32(&state);
will not prevent a store from within the critical section from rising above
it. This would mess things up right?
> Yes.
> > > I think that if collocation is observably faster than MFENCE then that
> > > means that it just does not work.
> >
> > The MFENCE version of the test is much slower than the collocation
> > version.
> > So, does the collocation trick not work on an x86?
> The first question I would answer is: does Peterson algorithm with
> collocation actually provide mutual exclusion on x86 (is it guaranteed
> or just seems to work on some processors)? If the answer is No then
> performance of collocation is irrelevant for me.
I simply don't know if it works on x86. I don't know who to ask in order to
verify this. I could create a test which does not prove anything even if it
passes a hundred trillion times. Can you think of a way to verify that
collocation works on x86?
It depends on hardware.
On IA-32, Intel64, SPARC TSO it won't mess.
On IA-64, SPARC RMO it can.
--
Dmitriy V'jukov
> > > For the MFENCE version I am performing a 32-bit store to location a, an
> > > MFENCE then a 32-bit load from location b.
>
> > > For the collocation version I am performing a 16-bit store to location a
> > > then a 32-bit load from location a.
>
> > > That should be fair?
> > Yes.
> > > > I think that if collocation is observably faster than MFENCE then that
> > > > means that it just does not work.
>
> > > The MFENCE version of the test is much slower than the collocation
> > > version.
> > > So, does the collocation trick not work on an x86?
> > The first question I would answer is: does Peterson algorithm with
> > collocation actually provide mutual exclusion on x86 (is it guaranteed
> > or just seems to work on some processors)? If the answer is No then
> > performance of collocation is irrelevant for me.
>
> I simply don't know if it works on x86. I don't know who to ask in order to
> verify this. I could create a test which does not prove anything even if it
> passes a hundred trillion times. Can you think of a way to verify that
> collocation works on x86?
Well, what about consulting those fine x86 manuals?
--
Dmitriy V'jukov
The performance of collocation trick seems to be little bit better than
`LOCK' prefix:
http://cpt.pastebin.com/f5a2d7337
I get 71.015 seconds for collocation, and 79.578 seconds for `LOCK' prefix
on old P4 (look at `PLOCK_MEMBAR' macro). Humm, I wonder if this violates
using different sized atomic operations to mutate a semaphore? Also, the
overhead for collocation seems to kick in on contention, while overhead of
LOCK prefix kicks in on a per-call basis.
:^o
> The performance of collocation trick seems to be little bit better than
> `LOCK' prefix:
>
> http://cpt.pastebin.com/f5a2d7337
>
> I get 71.015 seconds for collocation, and 79.578 seconds for `LOCK' prefix
> on old P4 (look at `PLOCK_MEMBAR' macro). Humm, I wonder if this violates
> using different sized atomic operations to mutate a semaphore? Also, the
> overhead for collocation seems to kick in on contention, while overhead of
> LOCK prefix kicks in on a per-call basis.
>
> :^o
And what are the numbers w/o contention? Does collocation
significantly faster than LOCK w/o contention?
--
Dmitriy V'jukov
I removed contention by only creating a single thread. Simply comment out a
`pthread_create()' `pthread_join()' pair in the `main()' function. E.g.,
pthread_create(&tid[0], NULL, thread_1, NULL);
/* pthread_create(&tid[1], NULL, thread_2, NULL); */
/* pthread_join(tid[1], NULL); */
pthread_join(tid[0], NULL);
I am getting 35.125 seconds for collocation, and 40.954 seconds for `LOCK'
prefix. I should also test this against an actual `MFENCE' instruction.
Also, please note that this LOCK prefix is not operating on shared state:
__________________________________________________________
__declspec(naked)
void
atomic_membar(void)
{
_asm
{
MOV EAX, 0
LOCK XADD [ESP], EAX
RET
}
}
__________________________________________________________
While the collocation trick is actually operating on shared state.
Therefore, it might be interesting to check the performance against a `LOCK'
instruction on a piece of shared state:
__________________________________________________________
__declspec(naked)
void
atomic_membar(void)
{
static uword32 g_dummy = 0;
_asm
{
MOV EAX, 0
LOCK XADD [g_dummy], EAX
RET
}
}
__________________________________________________________
Humm.
It has similar results, 40.012 seconds, `MFENCE' is basically equal to
`LOCK' at 38.125 seconds. I need to run the test presented by James to see
if I can recreate the numbers that are similar to the ones he reported.
WRT the test code, on average, I would say that collocation is getting
around 7 to 9 second improvement over `LOCK' and `MFENCE' in high contention
environment. Collocation get around 4 to 6 second improvement in zero
contention scenario on a P4. Not sure if it's worth it. However, those
seconds do add up...
;^)
I believe that one could use collocation to get around the explicit
`#StoreLoad' in the work-stealing deque algorithm. Instead of storing to the
tail then loading the head you can do 32-bit store in tail and single 64-bit
load to get head and tail. That should eliminate the `#StoreLoad' in
between.