I was trying to implement a flavor of barrier wait to synchronize
pthreads( the machine's Xeon X5, running linux 2.6 and gcc 4.1.1).
To ensure sequential conistency - i sprinkled my code with mfences -
unfortunately i find the memory consistency being broken when i run
this code. I was wondering if i've missed something?
I'm using 3 booleans as the synchronization flags; have spread 'em
across 3 cache lines to prevent false sharing. Am synchronizing 3
threads (taking 3 as a small example)
uint8_t X[3 * cachelinesize]; (where cache line size is 64 in this
case)
my mem barrier is " __asm__ __volatile__ ("mfence" : : : "memory");"
(pardon the overkill - i've used mem barriers instead of lfences/
sfences - just wanted to make sure this works, before i
optimize it)
ThreadA
while (1) {
mem barrier #1;
while ( X[0 * cacheline_size] != 0) ; // i tried inserting
pause too ------------> POINT A
mem barrier #2;
do some work;
mem barrrier #3;
X[0 *cacheline_size] = 1;
mem barrier # 4;
while (X[0 * cacheline size] != 2) ; // spin
mem barrier #5;
do something ;
mem barrier #6;
X[0 * cacheline size] = 3;------------> POINT C
mem barrier # 7;
}
............
(similar code in the other threads)
Main synchronizing thread
------------------------------------
while (1) {
mem barrier # 8;
while( all X[ 0...n] not 1) ; / => spin until (X[ 0 *
cacheline size] == 1) || (X[1 * cacheline size == 1]) etc
mem barrier # 9;
X[0...n] = 2; // tried doing this in a CAS
too - didn't help;
mem barrier # 10;
while(X[0...n] != 3) ; // spin ------------> POINT B
mem barrier # 11
do somthing
mem barrier #12
X[0...n ] = 0;
mem barrier;
}
Problem/Issue
I get into a deadlock at times - thread A is waiting on (X[0] ! = 0)
( point A) and the main thread is
waiting at point B with X[0] = 2- there's no way that could have
happened unless thread A bypassed point C.
Could someone tell me if i've screwed up someplace?
Thanks
-Addy
}
> I was trying to implement a flavor of barrier wait to synchronize
> pthreads( the machine's Xeon X5, running linux 2.6 and gcc 4.1.1).
> To ensure sequential consistency - I sprinkled my code with mfences -
> unfortunately I find the memory consistency being broken when I run
> this code. I was wondering if I've missed something?
You might find the following document somewhat relevant.
Intel 64 Architecture Memory Ordering White Paper
http://dl.free.fr/pLg3bHtbF/intel-memory-ordering.pdf
What you are trying to do does not require fences on an x64
using normally allocated memory. When just doing reads or writes
to the shared variables as you are, as opposed to
read-modify-write, then you do not need atomic ops.
So if it doesn't work, you've buggered something up :-)
(e.g. what was X init'ed to before you started the worker threads?)
comp.programming.threads is a good place for threading questions.
Eric
Eric - thanks for the info, but it really won't help much; but I think
you could probably go through this; this is well known
issue in multicore programming - and I thought i'd tackled it
correctly.
http://www.cs.utexas.edu/users/dburger/teaching/cs382m-f06/papers/16paper.pdf
I was not endeavoring to solve your problem. I was just pointing
out that your futzing with membars was chasing a red herring as
on an x86/x64 the style of spin barrier handshake you are
attempting does not require either fences or atomic ops.
Eric
You have a race-condition. I don't have time to find it for you. However,
here is a very simple distributed spin-wait barrier synchronization
algorithm I have whipped up:
http://relacy.pastebin.com/f5d57b700
It's coded in Relacy race-detector, and passes it's tests. Here is the
algorithm in pseudo-code:
______________________________________________________________________
atomic_word gates[THREADS] = { 0 };
/*
waits for all other worker threads
to synchronize with the main sync thread
*/
void wait(size_t i)
{
ATOMIC_STORE_RELEASE(&gates[i], 1);
while (ATOMIC_LOAD_ACQUIRE(&gates[i])) pause();
}
/*
waits for all worker threads to hit the barrier
*/
void acquire()
{
retry:
for (size_t i = 0; i < THREADS; ++i)
{
if (! ATOMIC_LOAD_ACQUIRE(&gates[i]))
{
pause();
goto retry;
}
}
}
/*
releases all worker threads from the barrier
*/
void release()
{
for (size_t i = 0; i < THREADS; ++i)
{
ATOMIC_STORE_RELEASE(&gates[i], 0);
}
}
______________________________________________________________________
Is this type of synchronization primitive sufficient for you're specific
needs?
BTW, on an x86/x64 the acquire/release semantics are implicit.
Not necessarily. There are at least 3 other possible reasons.
However without seeing the actual source code,
and as it appeared to be homework, I thought finding
the solution himself would be more instructive.
Knowing that that style of handshake should work on x86/x64
should be sufficient hint.
Eric
Just one question - would doing a CAS on all the stores on the sync
var (CAS(X[0],0,1) etc) instead of the assignment operation help to
alleviate
the race condition?
Thanks a lot.
On Nov 17, 10:06 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "Chris M. Thomasson" <n...@spam.invalid> wrote in messagenews:iEAMm.12230$Sw5....@newsfe16.iad...
>
>
>
> > "jon wayne" <addy.va...@gmail.com> wrote in message
> Just one question - would doing a CAS on all the stores on the sync
> var (CAS(X[0],0,1) etc) instead of the assignment operation help to
> alleviate
> the race condition?
You don't need CAS for this type of algorithm.
> Thanks a lot.
BTW, did you take a look at the algorithm I posted? Spin-wait aside for a
moment, the actual syncronization scheme is fairly efficent. BTW, the
membars in the pseudo-code can be simplified; here is how it would look on a
SPARC in RMO mode:
______________________________________________________________________
atomic_word gates[THREADS] = { 0 };
/*
waits for all other worker threads
to synchronize with the main sync thread
*/
void wait(size_t i)
{
MEMBAR #LoadStore | #StoreStore;
ATOMIC_STORE(&gates[i], 1);
while (ATOMIC_LOAD(&gates[i])) pause();
MEMBAR #LoadStore | #LoadLoad;
}
/*
waits for all worker threads to hit the barrier
*/
void acquire()
{
retry:
for (size_t i = 0; i < THREADS; ++i)
{
if (! ATOMIC_LOAD(&gates[i]))
{
pause();
goto retry;
}
}
MEMBAR #LoadStore | #LoadLoad;
}
/*
releases all worker threads from the barrier
*/
void release()
{
MEMBAR #LoadStore | #StoreStore;
for (size_t i = 0; i < THREADS; ++i)
{
ATOMIC_STORE(&gates[i], 0);
}
}
______________________________________________________________________
[...]
Am just going thru your algo - just trying to figure out if it'd gimme
the kinda perf am looking for.
I really appreciate your enthusiasm in helping out.
Thanks a lot once again
On Nov 18, 12:04 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "jon wayne" <addy.va...@gmail.com> wrote in message
> Am just going thru your algo - just trying to figure out if it'd gimme
> the kinda perf am looking for.
Keep in mind that it's SCHED_OTHER because the primitive itself is making
scheduling decisions. The algorithm is always fair as it hands off ownership
on release; a thread cannot starve it. Not sure if this inherent fairness
guarantee is required in your case.
> I really appreciate your enthusiasm in helping out.
> Thanks a lot once again
Check this out:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/c460ed59e5f586f7
AFAICT, the algorithm would be a good fit you're worker threads are
constantly performing synchronization operations (eg., acquiring/releasing
access to the barrier)...
inline void atomically_load(volatile uint8_t *loadptr, uint8_t newval)
{
while(__sync_fetch_and_add(loadptr, 0) != newval) ;
}
and
inline atomically_store(volatile uint8_t *store, uint8_t newval,
uint8_t oldval)
{
while(__sync_lock_test_and_set(store, newval) != oldval) ; //spin
__sync_lock_release(store);
}
was wondering if that's the intended implementation?
On Nov 18, 1:32 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "jon wayne" <addy.va...@gmail.com> wrote in message
>
> news:43a99c37-8a24-469e...@o9g2000prg.googlegroups.com...
>
> > Actually I needed a very lightweight sync method - my threads need to
> > sync every 5 -10 microsecs; and that's why I was trying to avoid using
> > inherent locks/barriers provided by pthread.
> > Am just going thru your algo - just trying to figure out if it'd gimme
> > the kinda perf am looking for.
>
> Keep in mind that it's SCHED_OTHER because the primitive itself is making
> scheduling decisions. The algorithm is always fair as it hands off ownership
> on release; a thread cannot starve it. Not sure if this inherent fairness
> guarantee is required in your case.
>
> > I really appreciate your enthusiasm in helping out.
> > Thanks a lot once again
>
> Check this out:
>
> http://groups.google.com/group/comp.programming.threads/browse_frm/th...
[...]
Only if you wanted sequential consistency, which is not required for this
algorithm. On x86/x64 you can use normal `MOV' instructions for the atomic
load/store operations.
( though I'm still trying to figure out the race condition in my
flavor of busy wait :( ).
Thanks Chris.
On Nov 18, 7:25 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "jon wayne" <addy.va...@gmail.com> wrote in message
I just went thru yr algorithm - i think mine's pretty similar, except
it's quad stated and that i've encapsulated wait/acquire/release
into
functions....
I was wondering if the above algo is architecture/framework
independent ?
Did you _try_ just testing your handshake with just 2 threads,
a main and a worker, a dummy DoSomthing() that just counts,
a single volatile int syncState variable, and _nothing_ else
(no cachelinesize, membars, atomics, fancy-pants whatevers)?
Cuz I'll bet you didn't.
Eric
in fact i even went to the extent of implementing the above
atomic_load/store to ensure sequential consistency(have modified 'em
too, have used fences with 'em too. ) - but of no avail;
I see non-deterministic behavior still :(.
Am using these within c++ code. The atomic load/store functions are
defined as static functions. and so is the volatile synchronized
variables.
Cachelineline... ermmm... well, initially I did ensure that each of
the synchronous vars are on a different cache line. The current
version
has false sharing - the variables are on the same line; it should
degrade performance for sure - but functionality ?
True the sync vars aren't defined as atomics - but then i made 'em
volatile...
inline bool atomically_load(volatile uint8_t *loadptr, uint8_t newval)
{
__asm__ __volatile__ ("mfence" : : : "memory");
while(__sync_fetch_and_add(loadptr, 0) != newval) {
__asm__ __volatile__(" rep; nop \n");
}
__asm__ __volatile__ ("mfence" : : : "memory");
return true;
}
//
inline bool atomically_store(volatile uint8_t *store, uint8_t newval,
uint8_t oldval)
{
__asm__ __volatile__ ("mfence" : : : "memory");
while(__sync_lock_test_and_set(store, newval) != oldval) {
__asm__ __volatile__(" rep; nop \n");
}
__asm__ __volatile__ ("mfence" : : : "memory");
return true;
Hmmm... perhaps you misunderstood how simple a test I meant.
You said "yes" but then go on to talk about atomic loads, etc.
I just want to be absolutely clear....
Are you are saying that you have code that looks almost identical
to the simple handshake code below, and you say it jams up?
Because if this doesn't work I'd look at the compilers
assembler output.
static volatile int appSyncState = 0;
static volatile int appDone = 0;
#define CpuPause __asm { pause }
static void WorkerRtn (void)
{
while (!appDone)
{
while (appSyncState != 0)
{ CpuPause }
appSyncState = 1;
while (appSyncState != 2)
{ CpuPause }
appSyncState = 3;
}
}
#define MAIN_LOOP_COUNT (1000000)
static void MainRtn (void)
{ int i;
for (i = 1; i <= MAIN_LOOP_COUNT; i++)
{
while (appSyncState != 1)
{ CpuPause }
if (i == MAIN_LOOP_COUNT)
appDone = 1;
appSyncState = 2;
while (appSyncState != 3)
{ CpuPause }
appSyncState = 0;
}
}
Eric
before i do a appSyncState = 1, I do some function calls ; my code's
in c++ and the worker thread's a static member function within a
class.
something akin to this:
while (!appDone)
{
while (appSyncState != 0)
{ CpuPause }
i = 0;
while (i++ < N) {
call_some_func();
}
appSyncState = 1;
......
The call_some_func in turns does a lotta work - the whole lotta c++
code's in there
I realised that the sync algo is working just fine - except that I
reach appsyncstate =1 even before I've completed all the calls to
caal_some_func.
Wondering what could have gone wrong...
will update further on experiments
it'd be nice, if someone can point out the race condition in my code .
Aha! Here's your problem!
If you _require_ multiple threads to sync every 5-10 us, then you've
simply selected the wrong algorithm.
(I assume, perhaps wrongly, that the idea is to do some useful work done
and get speedups from multiple threads/cores?)
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"