Sequential consitency on X86_64 - implementation of a bust wait algo not working inspite of using mfences

jon wayne

unread,

Nov 16, 2009, 8:33:25 AM11/16/09

to

Hi

I was trying to implement a flavor of barrier wait to synchronize
pthreads( the machine's Xeon X5, running linux 2.6 and gcc 4.1.1).
To ensure sequential conistency - i sprinkled my code with mfences -
unfortunately i find the memory consistency being broken when i run
this code. I was wondering if i've missed something?

I'm using 3 booleans as the synchronization flags; have spread 'em
across 3 cache lines to prevent false sharing. Am synchronizing 3
threads (taking 3 as a small example)

uint8_t X[3 * cachelinesize]; (where cache line size is 64 in this
case)

my mem barrier is " __asm__ __volatile__ ("mfence" : : : "memory");"
(pardon the overkill - i've used mem barriers instead of lfences/
sfences - just wanted to make sure this works, before i
optimize it)

ThreadA

while (1) {
mem barrier #1;
while ( X[0 * cacheline_size] != 0) ; // i tried inserting
pause too ------------> POINT A
mem barrier #2;
do some work;

mem barrrier #3;
X[0 *cacheline_size] = 1;
mem barrier # 4;

while (X[0 * cacheline size] != 2) ; // spin

mem barrier #5;
do something ;

mem barrier #6;
X[0 * cacheline size] = 3;------------> POINT C
mem barrier # 7;
}
............
(similar code in the other threads)

Main synchronizing thread
------------------------------------
while (1) {
mem barrier # 8;
while( all X[ 0...n] not 1) ; / => spin until (X[ 0 *
cacheline size] == 1) || (X[1 * cacheline size == 1]) etc
mem barrier # 9;

X[0...n] = 2; // tried doing this in a CAS
too - didn't help;
mem barrier # 10;

while(X[0...n] != 3) ; // spin ------------> POINT B
mem barrier # 11

do somthing
mem barrier #12

X[0...n ] = 0;
mem barrier;
}

Problem/Issue

I get into a deadlock at times - thread A is waiting on (X[0] ! = 0)
( point A) and the main thread is
waiting at point B with X[0] = 2- there's no way that could have
happened unless thread A bypassed point C.

Could someone tell me if i've screwed up someplace?

Thanks
-Addy

}

Noob

unread,

Nov 16, 2009, 8:50:00 AM11/16/09

to

jon wayne wrote:

> I was trying to implement a flavor of barrier wait to synchronize
> pthreads( the machine's Xeon X5, running linux 2.6 and gcc 4.1.1).

> To ensure sequential consistency - I sprinkled my code with mfences -
> unfortunately I find the memory consistency being broken when I run
> this code. I was wondering if I've missed something?

You might find the following document somewhat relevant.

Intel 64 Architecture Memory Ordering White Paper
http://dl.free.fr/pLg3bHtbF/intel-memory-ordering.pdf

EricP

unread,

Nov 16, 2009, 11:43:34 AM11/16/09

to

jon wayne wrote:
> Hi
>
> I was trying to implement a flavor of barrier wait to synchronize
> pthreads( the machine's Xeon X5, running linux 2.6 and gcc 4.1.1).
> To ensure sequential conistency - i sprinkled my code with mfences -
> unfortunately i find the memory consistency being broken when i run
> this code. I was wondering if i've missed something?
>
> I'm using 3 booleans as the synchronization flags; have spread 'em
> across 3 cache lines to prevent false sharing. Am synchronizing 3
> threads (taking 3 as a small example)

> <snip>

What you are trying to do does not require fences on an x64
using normally allocated memory. When just doing reads or writes
to the shared variables as you are, as opposed to
read-modify-write, then you do not need atomic ops.

So if it doesn't work, you've buggered something up :-)
(e.g. what was X init'ed to before you started the worker threads?)

comp.programming.threads is a good place for threading questions.

Eric

jon wayne

unread,

Nov 17, 2009, 4:50:43 AM11/17/09

to

Noob - i've already seen the doc - my implementation seems to be
correct - but for some reason it ain't working

Eric - thanks for the info, but it really won't help much; but I think
you could probably go through this; this is well known
issue in multicore programming - and I thought i'd tackled it
correctly.

http://www.cs.utexas.edu/users/dburger/teaching/cs382m-f06/papers/16paper.pdf

EricP

unread,

Nov 17, 2009, 11:22:24 AM11/17/09

to

jon wayne wrote:
> Noob - i've already seen the doc - my implementation seems to be
> correct - but for some reason it ain't working
>
> Eric - thanks for the info, but it really won't help much; but I think
> you could probably go through this; this is well known
> issue in multicore programming - and I thought i'd tackled it
> correctly.
>
> http://www.cs.utexas.edu/users/dburger/teaching/cs382m-f06/papers/16paper.pdf

I was not endeavoring to solve your problem. I was just pointing
out that your futzing with membars was chasing a red herring as
on an x86/x64 the style of spin barrier handshake you are
attempting does not require either fences or atomic ops.

Eric

Chris M. Thomasson

unread,

Nov 17, 2009, 12:00:45 PM11/17/09

to

"jon wayne" <addy....@gmail.com> wrote in message
news:00e8d57e-a640-4960...@u25g2000prh.googlegroups.com...
> Hi
[...]

You have a race-condition. I don't have time to find it for you. However,
here is a very simple distributed spin-wait barrier synchronization
algorithm I have whipped up:

http://relacy.pastebin.com/f5d57b700

It's coded in Relacy race-detector, and passes it's tests. Here is the
algorithm in pseudo-code:
______________________________________________________________________
atomic_word gates[THREADS] = { 0 };

/*

waits for all other worker threads
to synchronize with the main sync thread

*/
void wait(size_t i)
{
ATOMIC_STORE_RELEASE(&gates[i], 1);

while (ATOMIC_LOAD_ACQUIRE(&gates[i])) pause();
}

/*

waits for all worker threads to hit the barrier

*/
void acquire()
{
retry:
for (size_t i = 0; i < THREADS; ++i)
{
if (! ATOMIC_LOAD_ACQUIRE(&gates[i]))
{
pause();
goto retry;
}
}
}

/*

releases all worker threads from the barrier

*/
void release()
{
for (size_t i = 0; i < THREADS; ++i)
{
ATOMIC_STORE_RELEASE(&gates[i], 0);
}
}
______________________________________________________________________

Is this type of synchronization primitive sufficient for you're specific
needs?

Chris M. Thomasson

unread,

Nov 17, 2009, 12:06:03 PM11/17/09

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:iEAMm.12230$Sw5....@newsfe16.iad...

> "jon wayne" <addy....@gmail.com> wrote in message
> news:00e8d57e-a640-4960...@u25g2000prh.googlegroups.com...
>> Hi
> [...]
>
> You have a race-condition. I don't have time to find it for you. However,
> here is a very simple distributed spin-wait barrier synchronization
> algorithm I have whipped up:
>
>
> http://relacy.pastebin.com/f5d57b700
>
>
> It's coded in Relacy race-detector, and passes it's tests. Here is the
> algorithm in pseudo-code:
> ______________________________________________________________________

[...]

> ______________________________________________________________________
>
> Is this type of synchronization primitive sufficient for you're specific
> needs?

BTW, on an x86/x64 the acquire/release semantics are implicit.

EricP

unread,

Nov 17, 2009, 1:26:43 PM11/17/09

to

Chris M. Thomasson wrote:
> "jon wayne" <addy....@gmail.com> wrote in message
> news:00e8d57e-a640-4960...@u25g2000prh.googlegroups.com...
>> Hi
> [...]
>
> You have a race-condition. I don't have time to find it for you.

Not necessarily. There are at least 3 other possible reasons.
However without seeing the actual source code,
and as it appeared to be homework, I thought finding
the solution himself would be more instructive.
Knowing that that style of handshake should work on x86/x64
should be sufficient hint.

Eric

jon wayne

unread,

Nov 17, 2009, 9:45:09 PM11/17/09

to

Oh ok, I didn't realize I'd hit a race condition in this case;

Just one question - would doing a CAS on all the stores on the sync
var (CAS(X[0],0,1) etc) instead of the assignment operation help to
alleviate
the race condition?

Thanks a lot.

On Nov 17, 10:06 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "Chris M. Thomasson" <n...@spam.invalid> wrote in messagenews:iEAMm.12230$Sw5....@newsfe16.iad...
>
>
>

> > "jon wayne" <addy.va...@gmail.com> wrote in message

Chris M. Thomasson

unread,

Nov 18, 2009, 2:04:54 AM11/18/09

to

"jon wayne" <addy....@gmail.com> wrote in message

news:fc34e2d6-a149-495a...@b36g2000prf.googlegroups.com...

> Oh ok, I didn't realize I'd hit a race condition in this case;

> Just one question - would doing a CAS on all the stores on the sync
> var (CAS(X[0],0,1) etc) instead of the assignment operation help to
> alleviate
> the race condition?

You don't need CAS for this type of algorithm.

> Thanks a lot.

BTW, did you take a look at the algorithm I posted? Spin-wait aside for a
moment, the actual syncronization scheme is fairly efficent. BTW, the
membars in the pseudo-code can be simplified; here is how it would look on a
SPARC in RMO mode:

______________________________________________________________________
atomic_word gates[THREADS] = { 0 };

/*

waits for all other worker threads
to synchronize with the main sync thread

*/
void wait(size_t i)
{
MEMBAR #LoadStore | #StoreStore;

ATOMIC_STORE(&gates[i], 1);

while (ATOMIC_LOAD(&gates[i])) pause();

MEMBAR #LoadStore | #LoadLoad;
}

/*

waits for all worker threads to hit the barrier

*/
void acquire()
{
retry:
for (size_t i = 0; i < THREADS; ++i)
{

if (! ATOMIC_LOAD(&gates[i]))
{
pause();
goto retry;
}
}

MEMBAR #LoadStore | #LoadLoad;
}

/*

releases all worker threads from the barrier

*/
void release()
{
MEMBAR #LoadStore | #StoreStore;

for (size_t i = 0; i < THREADS; ++i)
{

ATOMIC_STORE(&gates[i], 0);
}
}
______________________________________________________________________

[...]

jon wayne

unread,

Nov 18, 2009, 2:55:51 AM11/18/09

to

Actually I needed a very lightweight sync method - my threads need to
sync every 5 -10 microsecs; and that's why I was trying to avoid using
inherent locks/barriers provided by pthread.

Am just going thru your algo - just trying to figure out if it'd gimme
the kinda perf am looking for.

I really appreciate your enthusiasm in helping out.

Thanks a lot once again

On Nov 18, 12:04 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "jon wayne" <addy.va...@gmail.com> wrote in message

Chris M. Thomasson

unread,

Nov 18, 2009, 3:32:35 AM11/18/09

to

"jon wayne" <addy....@gmail.com> wrote in message

news:43a99c37-8a24-469e...@o9g2000prg.googlegroups.com...

> Actually I needed a very lightweight sync method - my threads need to
> sync every 5 -10 microsecs; and that's why I was trying to avoid using
> inherent locks/barriers provided by pthread.

> Am just going thru your algo - just trying to figure out if it'd gimme
> the kinda perf am looking for.

Keep in mind that it's SCHED_OTHER because the primitive itself is making
scheduling decisions. The algorithm is always fair as it hands off ownership
on release; a thread cannot starve it. Not sure if this inherent fairness
guarantee is required in your case.

> I really appreciate your enthusiasm in helping out.

> Thanks a lot once again

Check this out:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/c460ed59e5f586f7

AFAICT, the algorithm would be a good fit you're worker threads are
constantly performing synchronization operations (eg., acquiring/releasing
access to the barrier)...

jon wayne

unread,

Nov 18, 2009, 8:09:54 AM11/18/09

to

I just wanted to confirm one thing, the atomic loads and stores - am
doing the following for gcc

inline void atomically_load(volatile uint8_t *loadptr, uint8_t newval)
{
while(__sync_fetch_and_add(loadptr, 0) != newval) ;
}
and
inline atomically_store(volatile uint8_t *store, uint8_t newval,
uint8_t oldval)
{
while(__sync_lock_test_and_set(store, newval) != oldval) ; //spin
__sync_lock_release(store);

}

was wondering if that's the intended implementation?

On Nov 18, 1:32 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "jon wayne" <addy.va...@gmail.com> wrote in message

>
> news:43a99c37-8a24-469e...@o9g2000prg.googlegroups.com...
>
> > Actually I needed a very lightweight sync method - my threads need to
> > sync every 5 -10 microsecs; and that's why I was trying to avoid using
> > inherent locks/barriers provided by pthread.
> > Am just going thru your algo - just trying to figure out if it'd gimme
> > the kinda perf am looking for.
>
> Keep in mind that it's SCHED_OTHER because the primitive itself is making
> scheduling decisions. The algorithm is always fair as it hands off ownership
> on release; a thread cannot starve it. Not sure if this inherent fairness
> guarantee is required in your case.
>
> > I really appreciate your enthusiasm in helping out.
> > Thanks a lot once again
>
> Check this out:
>

> http://groups.google.com/group/comp.programming.threads/browse_frm/th...

Chris M. Thomasson

unread,

Nov 18, 2009, 9:25:36 AM11/18/09

to

"jon wayne" <addy....@gmail.com> wrote in message

news:388476d8-4923-4d78...@z35g2000prh.googlegroups.com...

>
> I just wanted to confirm one thing, the atomic loads and stores - am
> doing the following for gcc
>
> inline void atomically_load(volatile uint8_t *loadptr, uint8_t newval)
> {
> while(__sync_fetch_and_add(loadptr, 0) != newval) ;
> }
> and
> inline atomically_store(volatile uint8_t *store, uint8_t newval,
> uint8_t oldval)
> {
> while(__sync_lock_test_and_set(store, newval) != oldval) ; //spin
> __sync_lock_release(store);
>
> }
>
> was wondering if that's the intended implementation?

[...]

Only if you wanted sequential consistency, which is not required for this
algorithm. On x86/x64 you can use normal `MOV' instructions for the atomic
load/store operations.

jon wayne

unread,

Nov 18, 2009, 10:41:00 AM11/18/09

to

I got it :)
I'll work on your implementation, and see if does gimme the kinda low
overhead I need for synchronization.

( though I'm still trying to figure out the race condition in my
flavor of busy wait :( ).

Thanks Chris.

On Nov 18, 7:25 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "jon wayne" <addy.va...@gmail.com> wrote in message

jon wayne

unread,

Nov 18, 2009, 12:15:09 PM11/18/09

to

I just went thru yr algorithm - i think mine's pretty similar, except
it's quad stated and that i've encapsulated wait/acquire/release
into
functions....
I was wondering if the above algo is architecture/framework
independent ?

jon wayne

unread,

Nov 18, 2009, 12:18:52 PM11/18/09

to

I meant that I haven't encapsulated my code into wait/release/
acquire...

EricP

unread,

Nov 18, 2009, 1:01:58 PM11/18/09

to

jon wayne wrote:
>
> ( though I'm still trying to figure out the race condition in my
> flavor of busy wait :( ).

Did you _try_ just testing your handshake with just 2 threads,
a main and a worker, a dummy DoSomthing() that just counts,
a single volatile int syncState variable, and _nothing_ else
(no cachelinesize, membars, atomics, fancy-pants whatevers)?

Cuz I'll bet you didn't.

Eric

jon wayne

unread,

Nov 18, 2009, 1:34:15 PM11/18/09

to

well, i did :)

in fact i even went to the extent of implementing the above
atomic_load/store to ensure sequential consistency(have modified 'em
too, have used fences with 'em too. ) - but of no avail;
I see non-deterministic behavior still :(.

Am using these within c++ code. The atomic load/store functions are
defined as static functions. and so is the volatile synchronized
variables.

Cachelineline... ermmm... well, initially I did ensure that each of
the synchronous vars are on a different cache line. The current
version
has false sharing - the variables are on the same line; it should
degrade performance for sure - but functionality ?

True the sync vars aren't defined as atomics - but then i made 'em
volatile...

inline bool atomically_load(volatile uint8_t *loadptr, uint8_t newval)
{

__asm__ __volatile__ ("mfence" : : : "memory");

while(__sync_fetch_and_add(loadptr, 0) != newval) {
__asm__ __volatile__(" rep; nop \n");

}
__asm__ __volatile__ ("mfence" : : : "memory");

return true;
}

//
inline bool atomically_store(volatile uint8_t *store, uint8_t newval,
uint8_t oldval)
{

__asm__ __volatile__ ("mfence" : : : "memory");

while(__sync_lock_test_and_set(store, newval) != oldval) {
__asm__ __volatile__(" rep; nop \n");

}
__asm__ __volatile__ ("mfence" : : : "memory");

return true;

EricP

unread,

Nov 18, 2009, 2:51:34 PM11/18/09

to

Hmmm... perhaps you misunderstood how simple a test I meant.
You said "yes" but then go on to talk about atomic loads, etc.
I just want to be absolutely clear....

Are you are saying that you have code that looks almost identical
to the simple handshake code below, and you say it jams up?
Because if this doesn't work I'd look at the compilers
assembler output.

static volatile int appSyncState = 0;
static volatile int appDone = 0;

#define CpuPause __asm { pause }

static void WorkerRtn (void)
{
while (!appDone)
{
while (appSyncState != 0)
{ CpuPause }

appSyncState = 1;
while (appSyncState != 2)
{ CpuPause }

appSyncState = 3;
}
}

#define MAIN_LOOP_COUNT (1000000)

static void MainRtn (void)
{ int i;

for (i = 1; i <= MAIN_LOOP_COUNT; i++)
{
while (appSyncState != 1)
{ CpuPause }

if (i == MAIN_LOOP_COUNT)
appDone = 1;

appSyncState = 2;
while (appSyncState != 3)
{ CpuPause }

appSyncState = 0;
}
}

Eric

jon wayne

unread,

Nov 19, 2009, 5:14:32 AM11/19/09

to

yup - that's pretty similar to what am doing, except that I do a lotta
function calls in betw.

before i do a appSyncState = 1, I do some function calls ; my code's
in c++ and the worker thread's a static member function within a
class.

something akin to this:

while (!appDone)
{
while (appSyncState != 0)
{ CpuPause }

i = 0;
while (i++ < N) {
call_some_func();
}
appSyncState = 1;
......
The call_some_func in turns does a lotta work - the whole lotta c++
code's in there

I realised that the sync algo is working just fine - except that I
reach appsyncstate =1 even before I've completed all the calls to
caal_some_func.

Wondering what could have gone wrong...

jon wayne

unread,

Nov 19, 2009, 8:38:24 AM11/19/09

to

another update - the algo seems to work fine; but when I touch very
large iterations(say around 50 million, it does break) :(

will update further on experiments

it'd be nice, if someone can point out the race condition in my code .

Terje Mathisen

unread,

Nov 20, 2009, 5:58:59 AM11/20/09

to

jon wayne wrote:
> Actually I needed a very lightweight sync method - my threads need to
> sync every 5 -10 microsecs; and that's why I was trying to avoid using

Aha! Here's your problem!

If you _require_ multiple threads to sync every 5-10 us, then you've
simply selected the wrong algorithm.

(I assume, perhaps wrongly, that the idea is to do some useful work done
and get speedups from multiple threads/cores?)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"