Memory barrier + volatile (C++)

JC

unread,

Jan 8, 2009, 4:19:23 PM1/8/09

to

Say I have some data, in C++, on Windows:

int a;
int b;
volatile int valid = 0;
CRITICAL_SECTION mtx;

I have multiple threads that acquire that mutex and perform a few
actions, and the first thread to acquire the mutex also initializes a
and b:

void function1 () { // called from many threads

EnterCriticalSection(&mtx);
if (!valid) {
a = something;
b = something;
valid = 1;
}
do_other_stuff();
LeaveCriticalSection(&mtx);

}

I then have other threads that need to read a and b. They don't have
to wait for a and b to become valid, but they have to either read
valid data from both or know that the values may be invalid:

int function2 () { // called from many threads

if (valid) {
return a + b;
}

}

Do I need any memory barriers anywhere? Will function2() be guaranteed
to see the new values of a and b after 'valid' is set to 1?

If not, what kind of memory barrier (read/write) do I need to put
where? A write barrier just before setting valid=1 and a read barrier
just before returning a+b? Does it make a difference that "valid" is
declared volatile?

Thanks,
Jason

David Schwartz

unread,

Jan 8, 2009, 6:34:20 PM1/8/09

to

On Jan 8, 1:19 pm, JC <jason.cipri...@gmail.com> wrote:

> EnterCriticalSection(&mtx);
> if (!valid) {
> a = something;
> b = something;
> valid = 1;
> }
> do_other_stuff();
> LeaveCriticalSection(&mtx);
> }

Note that this forces *no* ordering between the assignments to 'a',
'b', and 'valid'.

> int function2 () { // called from many threads
> if (valid) {
> return a + b;
> }
> }

No good. There is no guarantee that the write to 'valid' will occur
after the writes to 'a' or 'b'.

> Do I need any memory barriers anywhere? Will function2() be guaranteed
> to see the new values of a and b after 'valid' is set to 1?

No guarantee.

> If not, what kind of memory barrier (read/write) do I need to put
> where? A write barrier just before setting valid=1 and a read barrier
> just before returning a+b? Does it make a difference that "valid" is
> declared volatile?

On Windows, I'm pretty sure that all you need to do is to use
'InterlockedIncrement' to set 'valid' from 0 to 1.

DS

David Barrett-Lennard

unread,

Jan 8, 2009, 7:39:10 PM1/8/09

to

My understanding is that the following will work correctly on IA-32
under Windows...

int a;
int b;
int valid = 0;

// Called by one thread
void f1()
{

if (!valid)
{
a = something;
b = something;

_WriteBarrier();
valid = 1;
}
}

// Called by many threads
int f2()
{
if (valid)
{
_ReadBarrier();
return a + b;
}
...
}

_WriteBarrier() and _ReadBarrier() only prevent compiler reorderings,
but on IA-32 that is all that is required for release/acquire barriers
for shared memory.

Chris M. Thomasson

unread,

Jan 8, 2009, 9:29:39 PM1/8/09

to

"JC" <jason.c...@gmail.com> wrote in message
news:8e05ee1f-92b2-4162...@q36g2000vbn.googlegroups.com...

David S. and David B.L. are both technically correct. However, David
Schwartz's approach covers _all_ platforms that modern Windows currently
runs on. This of course assumes that Microsoft implemented the Interlocked
API correctly! IIRC, there was one post around 4-5 years ago that suggested
there was on error in the PPC impl of Interlock API. There was no `sync' or
`lwsync' instructions in the disassembled code...

David Barrett-Lennard

unread,

Jan 8, 2009, 9:51:05 PM1/8/09

to

On Jan 9, 8:34 am, David Schwartz <dav...@webmaster.com> wrote:

> On Windows, I'm pretty sure that all you need to do is to use
> 'InterlockedIncrement' to set 'valid' from 0 to 1.

Is that sufficient? Is it possible that the compiler reorders the
reads in function2() causing an error?

David Barrett-Lennard

unread,

Jan 8, 2009, 9:57:51 PM1/8/09

to

On Jan 9, 11:29 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:
>
> David S. and David B.L. are both technically correct. However, David
> Schwartz's approach covers _all_ platforms that modern Windows currently
> runs on. This of course assumes that Microsoft implemented the Interlocked
> API correctly! IIRC, there was one post around 4-5 years ago that suggested
> there was on error in the PPC impl of Interlock API. There was no `sync' or
> `lwsync' instructions in the disassembled code...

It would seem useful to use macros for release and acquire barriers
that do the right thing without unnecessary overheads. Is that
possible on all platforms that modern Windows currently runs on?

Chris M. Thomasson

unread,

Jan 8, 2009, 10:19:56 PM1/8/09

to

"David Barrett-Lennard" <dav...@iinet.net.au> wrote in message
news:5d17003f-c430-4308...@s9g2000prm.googlegroups.com...

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/527cb992c30093ef
(read response from Marcel Müller)

On SPARC in RMO mode, your correct. The reads can be reordered. An explicit
`#StoreStore' membar is required before the store to `valid', and a
`#LoadLoad' membar is needed after the load from `valid'. Current x86 and
TSO SPARC don't need explicit membars in this case.

David Schwartz

unread,

Jan 8, 2009, 10:28:19 PM1/8/09

to

On Windows, InterlockedIncrement is a full barrier.

DS

David Barrett-Lennard

unread,

Jan 8, 2009, 10:28:41 PM1/8/09

to

On Jan 9, 12:19 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "David Barrett-Lennard" <davi...@iinet.net.au> wrote in message

>
> news:5d17003f-c430-4308...@s9g2000prm.googlegroups.com...
>
> > On Jan 9, 8:34 am, David Schwartz <dav...@webmaster.com> wrote:
>
> >> On Windows, I'm pretty sure that all you need to do is to use
> >> 'InterlockedIncrement' to set 'valid' from 0 to 1.
>
> > Is that sufficient? Is it possible that the compiler reorders the
> > reads in function2() causing an error?
>

> http://groups.google.com/group/comp.programming.threads/browse_frm/th...

> (read response from Marcel Müller)
>
> On SPARC in RMO mode, your correct. The reads can be reordered. An explicit
> `#StoreStore' membar is required before the store to `valid', and a
> `#LoadLoad' membar is needed after the load from `valid'. Current x86 and
> TSO SPARC don't need explicit membars in this case.

Wouldn't *compiler* reorderings upset any platform?

Chris M. Thomasson

unread,

Jan 8, 2009, 10:36:11 PM1/8/09

to

"David Barrett-Lennard" <dav...@iinet.net.au> wrote in message

news:8de844c9-65d9-466a...@w1g2000prm.googlegroups.com...

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/527cb992c30093ef

MSVC won't reorder anything across calls to the `Interlocked' API.

http://webpages.charter.net/appcore/vzdoc/atomic/static-init
(refer to section 2)

David Barrett-Lennard

unread,

Jan 8, 2009, 10:48:06 PM1/8/09

to

On Jan 9, 12:36 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> MSVC won't reorder anything across calls to the `Interlocked' API.

Understood. But that is not called in function2...

int function2 () { // called from many threads
if (valid) {
return a + b;
}
}

Is it guaranteed that the instructions emitted by the compiler read
valid before a and b?

David Schwartz

unread,

Jan 8, 2009, 11:25:16 PM1/8/09

to

On Jan 8, 7:48 pm, David Barrett-Lennard <davi...@iinet.net.au> wrote:

> Understood. But that is not called in function2...
>
> int function2 () { // called from many threads
> if (valid) {
> return a + b;
> }
>
> }
>
> Is it guaranteed that the instructions emitted by the compiler read
> valid before a and b?

He made 'valid' volatile. So the compiler will emit a read for 'valid'
before 'a' and 'b'. The CPU may pre-fetch 'a' or 'b', but that's not a
problem, since the write will invalidate the cache line and the pre-
fetch will be invalidated along with it.

DS

David Barrett-Lennard

unread,

Jan 8, 2009, 11:30:53 PM1/8/09

to

Ok volatile is needed. That makes sense. Thanks.

Marcel Müller

unread,

Jan 9, 2009, 9:27:13 AM1/9/09

to

Hi!

David Barrett-Lennard wrote:
> My understanding is that the following will work correctly on IA-32
> under Windows...
>
> int a;
> int b;
> int valid = 0;
>
> // Called by one thread
> void f1()
> {
> if (!valid)
> {
> a = something;
> b = something;
> _WriteBarrier();
> valid = 1;
> }
> }

I don't think so because two threads may enter the initialization block
simultaneously. JC said that the first thread initializes a and b, not a
dedicated one.

Marcel

David Barrett-Lennard

unread,

Jan 9, 2009, 10:16:42 AM1/9/09

to

On Jan 9, 11:27 pm, Marcel Müller <news.5.ma...@spamgourmet.com>
wrote:

Yes, sorry I intentionally simplified the example (because the mutex
wasn't directly relevant).

David Barrett-Lennard

unread,

Jan 9, 2009, 10:51:26 AM1/9/09

to

On Jan 9, 8:34 am, David Schwartz <dav...@webmaster.com> wrote:

> On Jan 8, 1:19 pm, JC <jason.cipri...@gmail.com> wrote:
>
> > EnterCriticalSection(&mtx);
> > if (!valid) {
> > a = something;
> > b = something;
> > valid = 1;
> > }
> > do_other_stuff();
> > LeaveCriticalSection(&mtx);
> > }
>
> Note that this forces *no* ordering between the assignments to 'a',
> 'b', and 'valid'.
>
> > int function2 () { // called from many threads
> > if (valid) {
> > return a + b;
> > }
> > }
>
> No good. There is no guarantee that the write to 'valid' will occur
> after the writes to 'a' or 'b'.
>
> > Do I need any memory barriers anywhere? Will function2() be guaranteed
> > to see the new values of a and b after 'valid' is set to 1?
>
> No guarantee.

The following article in MSDN discusses the MS specific semantics of
'volatile' in C++

http://msdn.microsoft.com/en-us/library/12a04hfd(VS.80).aspx

The article states that volatile write has release semantics, and
volatile read has acquire semantics. In fact it contains an example
where a volatile bool called 'Sentinel' is used to coordinate a
producer-consumer relationship between a pair of threads on some
shared memory.

Doesn't this show that JC's original example was correct after all?

Alexander Terekhov

unread,

Jan 9, 2009, 11:05:44 AM1/9/09

to

David Barrett-Lennard wrote:
[...]

> Ok volatile is needed. That makes sense. Thanks.

Under the current C++0x working draft volatile is NOT needed but you
need std::atomic<> for 'valid'.

void function1 () { // called from many threads

mtx.lock();
if (!valid.load(memory_order_relaxed)) {

a = something;
b = something;

valid.store(1, memory_order_release);
}
do_other_stuff();
mtx.unlock();

}

int function2 () { // called from many threads

if (valid.load(memory_order_acquire)) {
return a + b;
}

}

or

void function1 () { // called from many threads

mtx.lock();
if (!valid.load(memory_order_relaxed)) {

a = something;
b = something;

atomic_thread_fence(memory_order_release);
valid.store(1, memory_order_relaxed);
}
do_other_stuff();
mtx.unlock();

}

int function2 () { // called from many threads

if (valid.load(memory_order_relaxed)) {
atomic_thread_fence(memory_order_acquire);
return a + b;
}

}

or

void function1 () { // called from many threads

mtx.lock();
if (!valid.load(memory_order_relaxed)) {

a = something;
b = something;

valid.store(1, memory_order_release);
}
do_other_stuff();
mtx.unlock();

}

int function2 () { // called from many threads

if (valid.load(memory_order_relaxed)) {
atomic_thread_fence(memory_order_acquire);
return a + b;
}

}

or

void function1 () { // called from many threads

mtx.lock();
if (!valid.load(memory_order_relaxed)) {

a = something;
b = something;

atomic_thread_fence(memory_order_release);
valid.store(1, memory_order_relaxed);
}
do_other_stuff();
mtx.unlock();

}

int function2 () { // called from many threads

if (valid.load(memory_order_acquire)) {
return a + b;
}

}

but all versions above are actually way too much constrained.

Ideally, you would write something along the lines

void function1 () { // called from many threads

mtx.lock();
if (!valid.load(memory_order_relaxed)) {

a = something;
b = something;

valid.store(1, memory_order_ssb); // sink-store barrier (release - sink-load barrier)
}
do_other_stuff();
mtx.unlock();

}

int function2 () { // called from many threads

if (valid.load(memory_order_hlb)) { // hoist-load barrier (acquire - host-store barrier)
return a + b;
}

}

or

mtx.lock();
if (!valid.load(memory_order_relaxed)) {

a = something;
b = something;

atomic_thread_fence(memory_order_release_not_affecting_loads);
valid.store(1, memory_order_relaxed);
}
do_other_stuff();
mtx.unlock();

}

int function2 () { // called from many threads

if (valid.load(memory_order_relaxed)) {
atomic_thread_fence(memory_order_acquire_not_affecting_stores);
return a + b;
}

}

or

[... combinations with one atomic_thread_fence() and one
unidirectional constraint for valid's load or store ...]

but the current C++0x working draft doesn't recognize such more
fine-grained reordering constraints.

regards,
alexander.

Giancarlo Niccolai

unread,

Jan 9, 2009, 7:50:44 PM1/9/09

to

Chris M. Thomasson ha scritto:

> This of course assumes that Microsoft implemented the
> Interlocked API correctly! IIRC, there was one post around 4-5 years ago
> that suggested there was on error in the PPC impl of Interlock API.
> There was no `sync' or `lwsync' instructions in the disassembled code...

I had problems with Interlocked* too. I first tried to implement some
very simple lock free algos (as one-way lists), and when I saw that they
did not work for me, being afraid of having messed up something in the
algorithm transcriptions, I even dld a couple of ready-made samples from
authoritative sites. The tests they were coming with was working, but
once put in a production environment with massive activity I had crashes
hangs or double-loops (two cpus going both 100% without progres on
8-CPU or 16-CPU machines) every approx 1-3 hours (exactly the same as my
own algos). That was with VS6 and no SDK...

Anyone else experienced similar problems with old Interlocked* ?

GN.

JC

unread,

Jan 9, 2009, 9:47:27 PM1/9/09

to

On Jan 9, 10:51 am, David Barrett-Lennard <davi...@iinet.net.au>
wrote:

Hmm, the example does imply that's the case; but the way I read the
description is that it places guarantees on compiler ordering, not
"cache" ordering (or whatever you call that):

== BEGIN QUOTE ==

Also, when optimizing, the compiler must maintain ordering among
references to volatile objects as well as references to other global
objects. In particular,

* A write to a volatile object (volatile write) has Release
semantics; a reference to a global or static object that occurs before
a write to a volatile object in the instruction sequence will occur
before that volatile write in the compiled binary.
* A read of a volatile object (volatile read) has Acquire
semantics; a reference to a global or static object that occurs after
a read of volatile memory in the instruction sequence will occur after
that volatile read in the compiled binary.

== END QUOTE ==

So that guarantees the reads and writes occur in the expected order in
the binary, but does that also guarantee expected visibility?

Jason

JC

unread,

Jan 9, 2009, 9:51:45 PM1/9/09

to

If that's all that is required for memory barriers then I guess that
does imply the original example works, since Windows makes compiler-
ordering guarantees for volatile access as well.

Except, I found this, a description of the MemoryBarrier() macro:

http://msdn.microsoft.com/en-us/library/ms684208.aspx

Which states:

"Use this macro or the interlocked functions when the order of memory
read and write operations is critical for program operation... The
_ReadBarrier, _WriteBarrier, and _ReadWriteBarrier compiler intrinsics
prevent compiler re-ordering only."

I may be misinterpreting this but it seems to imply that "compiler re-
ordering only" is insufficient for certain tasks? I don't know what to
make of it.

Jason

JC

unread,

Jan 9, 2009, 9:58:12 PM1/9/09

to

On Jan 9, 7:50 pm, Giancarlo Niccolai <gc-remov...@removeme-

I've never experienced the problem personally, because I didn't start
getting into this stuff until recently. It does seem that Microsoft
has changed semantics of certain things over the years; e.g. in the
MSDN docs for MemoryBarrier() I found this:

"With Visual Studio 2003, volatile to volatile references are ordered;
the compiler will not re-order volatile variable access. With Visual
Studio 2005, the compiler also uses acquire semantics for read
operations on volatile variables and release semantics for write
operations on volatile variables (when supported by the CPU)."

In the docs for _ReadBarrier() and _WriteBarrier() I found this:

"Note: In past versions of the Visual C++ compiler, the
_ReadWriteBarrier and _WriteBarrier functions were enforced only
locally and did not affect functions up the call tree. In Visual C++
2005 and later, these functions are enforced all the way up the call
tree."

Maybe the same weird internal issues were the cause for your
Interlocked* problems, too. I don't know, not sure if any of that
stuff is related, but it is evidence that things haven't stayed very
consistent in the shared data department at least.

Jason

JC

unread,

Jan 9, 2009, 10:07:18 PM1/9/09

to

You mean without declaring it volatile? So the documentation for
interlocked functions here:

http://msdn.microsoft.com/en-us/library/ms684122.aspx

States:

"Simple reads and writes to properly-aligned 32-bit variables are
atomic operations. In other words, you will not end up with only one
portion of the variable updated; all bits are updated in an atomic
fashion. However, access is not guaranteed to be synchronized."

So the only reason I'd use InterlockedAdd is to keep the compiler from
reordering it... but then what about for reading it's value? There's
no InterlockedGet, so I guess that means as long as it's volatile, and
as long as it's 32 bits, then the read will be atomic; and MS also
guarantees that the compiler won't move non-volatile reads after
volatile reads. But... in this case what does InterlockedAdd() get you
(where "InterlockedAdd" is being used to set a variable to 1, not so
much to add values atomically) that a simple write to a volatile
variable doesn't? I'm starting to confuse myself again...

Thanks,
Jason

rani_s...@hotmail.com

unread,

Jan 10, 2009, 1:00:53 AM1/10/09

to

On Jan 9, 4:50 pm, Giancarlo Niccolai <gc-remov...@removeme-

niccolai.cc> wrote:
> Chris M. Thomasson ha scritto:

> > IIRC, there was one post around 4-5 years ago
> > that suggested there was on error in the PPC impl of Interlock API.
> > There was no `sync' or `lwsync' instructions in the disassembled code...
>
> I had problems with Interlocked* too.

Did you write software for PPC?

Anthony Williams

unread,

Jan 10, 2009, 3:04:20 AM1/10/09

to

JC <jason.c...@gmail.com> writes:

It does on x86, since stores to memory have release semantics and
loads have acquire semantics at the processor level anyway.

Of course, the MSDN article only applies if you're using the Microsoft
compiler to which it refers --- older versions (e.g. MSVC 7.1) don't
do that, and neither do compilers from other vendors.

Anthony
--
Anthony Williams
Author of C++ Concurrency in Action | http://www.manning.com/williams
Custom Software Development | http://www.justsoftwaresolutions.co.uk
Just Software Solutions Ltd, Registered in England, Company Number 5478976.
Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL, UK

Anthony Williams

unread,

Jan 10, 2009, 3:07:50 AM1/10/09

to

JC <jason.c...@gmail.com> writes:

> You mean without declaring it volatile? So the documentation for
> interlocked functions here:
>
> http://msdn.microsoft.com/en-us/library/ms684122.aspx
>
> States:
>
> "Simple reads and writes to properly-aligned 32-bit variables are
> atomic operations. In other words, you will not end up with only one
> portion of the variable updated; all bits are updated in an atomic
> fashion. However, access is not guaranteed to be synchronized."
>
> So the only reason I'd use InterlockedAdd is to keep the compiler from
> reordering it... but then what about for reading it's value? There's
> no InterlockedGet, so I guess that means as long as it's volatile, and
> as long as it's 32 bits, then the read will be atomic; and MS also
> guarantees that the compiler won't move non-volatile reads after
> volatile reads. But... in this case what does InterlockedAdd() get you
> (where "InterlockedAdd" is being used to set a variable to 1, not so
> much to add values atomically) that a simple write to a volatile
> variable doesn't? I'm starting to confuse myself again...

InterlockedAdd is an RMW operation. The load and store are guaranteed
to be indivisible so concurrent calls are serialized and actually do
increment the counter (rather than incrementing a stale value).

If all you need is to store "1", then you can just use a plain store
with a memory barrier (which can be achieved with a volatile store on
recent MS compilers).

David Schwartz

unread,

Jan 10, 2009, 8:39:28 PM1/10/09

to

On Jan 10, 12:04 am, Anthony Williams <anthony....@gmail.com> wrote:

> It does on x86, since stores to memory have release semantics and
> loads have acquire semantics at the processor level anyway.

They "happen to". Intel has specifically said that you cannot rely on
future processors invalidating pre-fetched reads if the cache line is
invalidated. And AMD, as far as I know, has not said that you can rely
on it.

DS

JC

unread,

Jan 11, 2009, 11:07:28 AM1/11/09

to

All right, so just making sure, this (your earlier suggestion) way
with the interlocked function is most "portable" (as in, portable
between Windows systems running on Intel-compatible CPUs -- that's the
only target platform) way to do it, right?

int a;
int b;
volatile LONG valid = 0;
CRITICAL_SECTION mtx;

void function1 () { // called from many threads

EnterCriticalSection(&mtx);
if (!valid) {
a = something;
b = something;

InterlockedExchange(&valid, 1);
}
do_other_stuff();
LeaveCriticalSection(&mtx);

}

int function2 () { // called from many threads

if (valid) {
return a + b;

} else {
return 0;
}

}

And simply using InterlockedExchange for the write, and having the
variable volatile, guarantees correct visibility for the reads, and
should continue guaranteeing that unless something major changes in
the future?

Can I replace InterlockedExchange with InterlockedExchangeAcquire for
the same effect?

Thanks,
Jason

Giancarlo Niccolai

unread,

Jan 11, 2009, 11:08:42 AM1/11/09

to

rani_s...@hotmail.com ha scritto:

Not in this specific case.

GN.

Chris M. Thomasson

unread,

Jan 11, 2009, 1:23:31 PM1/11/09

to

"JC" <jason.c...@gmail.com> wrote in message

news:65269250-3fb7-4176...@k1g2000prb.googlegroups.com...

On Jan 10, 8:39 pm, David Schwartz <dav...@webmaster.com> wrote:
> > On Jan 10, 12:04 am, Anthony Williams <anthony....@gmail.com> wrote:
> >
> > > It does on x86, since stores to memory have release semantics and
> > > loads have acquire semantics at the processor level anyway.
> >
> > They "happen to". Intel has specifically said that you cannot rely on
> > future processors invalidating pre-fetched reads if the cache line is
> > invalidated. And AMD, as far as I know, has not said that you can rely
> > on it.

> All right, so just making sure, this (your earlier suggestion) way
> with the interlocked function is most "portable" (as in, portable
> between Windows systems running on Intel-compatible CPUs -- that's the
> only target platform) way to do it, right?

> [...]

> And simply using InterlockedExchange for the write, and having the
> variable volatile, guarantees correct visibility for the reads, and
> should continue guaranteeing that unless something major changes in
> the future?

If your using recent versions of MSVC (e.g., 8 or higher), then yes. The
load from the volatile variable will generate subsequent membar `#LoadStore
| #LoadLoad'. If your using another compiler, well, your going to need to
consult its documentation. BTW, if your using MSVC, you don't even need the
`InterlockedExchange' in there because store to volatile will generate
preceding membar `#LoadStore | #StoreStore'.

> Can I replace InterlockedExchange with InterlockedExchangeAcquire for
> the same effect?

You would want to use release semantics instead; its weird that MS does not
seem to have a release version of `InterlockedExchange'. See, the act of
storing to `valid' is a producer. The act of loading from `valid' is the
consumer. The producer needs release semantics, and the consumer would need
acquire. For 100% portable solution across all platforms that Windows runs
on, and basically all C/C++ compilers that are compatible with Windows,
would be something like:
_______________________________________________________________________
int a;
int b;

LONG valid = 0;
CRITICAL_SECTION mtx;

void function1 () { // called from many threads

EnterCriticalSection(&mtx);
if (! valid) {
a = something;
b = something;

InterlockedExchangeAddRelease(&valid, 1);
}
do_other_stuff();
LeaveCriticalSection(&mtx);

}

int function2 () { // called from many threads

if (InterlockedExchangeAddAcquire(&valid, 0)) {

return a + b;
} else {
return 0;
}

}
_______________________________________________________________________

rani_s...@hotmail.com

unread,

Jan 11, 2009, 1:54:49 PM1/11/09

to

On Jan 11, 8:08 am, Giancarlo Niccolai <gc-remov...@removeme-
niccolai.cc> wrote:
> rani_shar...@hotmail.com ha scritto:

What interlocked functions did you use? Do you have the algorithm?
I find it extremely hard to believe that the x86 implementation was
broken.
OTOH, writing lock free code correctly is quite challenging.

Rani

rani_s...@hotmail.com

unread,

Jan 11, 2009, 2:03:31 PM1/11/09

to

Don't you think that Intel's hands are tight by now?
Backward compatibility was always their top priority so even if they
will relax the memory model they will make it configurable so that
windows (and probably others) will avoid it.

AFAIR, windows is running in such conservative mode on IA64 (all 5
installations).

Rani

Chris M. Thomasson

unread,

Jan 11, 2009, 2:25:58 PM1/11/09

to

"Giancarlo Niccolai" <gc-re...@removeme-niccolai.cc> wrote in message
news:EhS9l.37074$J84....@tornado.fastwebnet.it...

I bet that you were not managing the memory of the nodes correctly. You need
some type of lifetime management scheme (e.g., GC, PDR) to ensure that a
node is in a persistent quiescent state _before_ its memory is
reclaimed/reused... What types of algorihtms did you try to implement?

Anthony Williams

unread,

Jan 11, 2009, 3:37:36 PM1/11/09

to

David Schwartz <dav...@webmaster.com> writes:

Both Intel and AMD have stated that stores have release semantics and
loads have acquire semantics in their x86/x86-64 processor
architecture documents. For any processors that implement these
specifications you can therefore rely on it.

Of course, they are free to change the specs for future
*architectures*, but if it claims to be x86 or x86-64 then it would be
breaking backwards compatibility with their own specs, which would be
rather a bad move.

Anthony
--
Anthony Williams | Just Software Solutions Ltd

Custom Software Development | http://www.justsoftwaresolutions.co.uk

JC

unread,

Jan 11, 2009, 4:46:28 PM1/11/09

to

On Jan 11, 1:23 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "JC" <jason.cipri...@gmail.com> wrote in message

> > All right, so just making sure, this (your earlier suggestion) way
> > with the interlocked function is most "portable" (as in, portable
> > between Windows systems running on Intel-compatible CPUs -- that's the
> > only target platform) way to do it, right?
> > [...]
> > And simply using InterlockedExchange for the write, and having the
> > variable volatile, guarantees correct visibility for the reads, and
> > should continue guaranteeing that unless something major changes in
> > the future?
>
> If your using recent versions of MSVC (e.g., 8 or higher), then yes. The
> load from the volatile variable will generate subsequent membar `#LoadStore
> | #LoadLoad'. If your using another compiler, well, your going to need to
> consult its documentation. BTW, if your using MSVC, you don't even need the
> `InterlockedExchange' in there because store to volatile will generate
> preceding membar `#LoadStore | #StoreStore'.

I see, so in that case, using MSVC (2008), the original example I had
posted also work correctly (just volatile access, no interlocked
functions), it's just not necessarily portable to older versions of
MSVC or non-MSVC compilers (I know David Schwartz and David Barrett-
Lennard were discussing this, just got my head around it now).

> > Can I replace InterlockedExchange with InterlockedExchangeAcquire for
> > the same effect?
>
> You would want to use release semantics instead; its weird that MS does not
> seem to have a release version of `InterlockedExchange'. See, the act of
> storing to `valid' is a producer. The act of loading from `valid' is the
> consumer. The producer needs release semantics, and the consumer would need
> acquire.

Ah, ok, thanks for the good explanation. And that is weird that there
is no InterlockedExchangeRelease (I thought there was), I wonder why
that is. InterlockedExchange loads the old value, maybe that somehow
doesn't fit the model?

> For 100% portable solution across all platforms that Windows runs
> on, and basically all C/C++ compilers that are compatible with Windows,
> would be something like:

Great, thanks! I had felt a little weird about using
InterlockedExchangeAdd(..., 0) as a sort of "InterlockedGet" but I
guess it makes sense. This is what I was looking for.

Thanks again for the good explanation,
Jason

David Schwartz

unread,

Jan 11, 2009, 5:51:04 PM1/11/09

to

On Jan 11, 11:03 am, rani_shar...@hotmail.com wrote:

> On Jan 10, 5:39 pm, David Schwartz <dav...@webmaster.com> wrote:

> > They "happen to". Intel has specifically said that you cannot rely on
> > future processors invalidating pre-fetched reads if the cache line is
> > invalidated. And AMD, as far as I know, has not said that you can rely
> > on it.

> Don't you think that Intel's hands are tight by now?

Yes, I do.

> Backward compatibility was always their top priority so even if they
> will relax the memory model they will make it configurable so that
> windows (and probably others) will avoid it.
>
> AFAIR, windows is running in such conservative mode on IA64 (all 5
> installations).

I agree with you, I'm just not sure that justifies adding to the
problem.

DS

David Barrett-Lennard

unread,

Jan 11, 2009, 8:14:24 PM1/11/09

to

That seems to conflict with the following Intel white paper which
presumably applies to all IA-32 and Intel 64 processors:

http://www.intel.com/products/processor/manuals/318147.pdf

David Barrett-Lennard

unread,

Jan 11, 2009, 8:52:25 PM1/11/09

to

Is this mandatory for a C++ conforming compiler? I've read claims
that as far as the standard is concerned it's permissible for a
compiler to reorder loads/stores between volatile and non-volatile
variables.

rani_s...@hotmail.com

unread,

Jan 11, 2009, 10:49:47 PM1/11/09

to

I totally agree that relaying on permissive memory models is bad.
The common premature optimizing programmer, armed with such liberal
assumptions, will often end up with broken code (race conditions).

OTOH, VC is in practice disallowing any HW freedom of implementation
(e.g. to improve m-cores scalability) by generating code that is
guaranteed to work given the same assumptions that we want programmers
to avoid.

Rani

Chris M. Thomasson

unread,

Jan 12, 2009, 9:24:34 PM1/12/09

to

[OT]

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:zOqal.6952$tD1....@newsfe07.iad...

> "JC" <jason.c...@gmail.com> wrote in message
> news:65269250-3fb7-4176...@k1g2000prb.googlegroups.com...
> On Jan 10, 8:39 pm, David Schwartz <dav...@webmaster.com> wrote:
>> > On Jan 10, 12:04 am, Anthony Williams <anthony....@gmail.com> wrote:
>> >
>> > > It does on x86, since stores to memory have release semantics and
>> > > loads have acquire semantics at the processor level anyway.
>> >
>> > They "happen to". Intel has specifically said that you cannot rely on
>> > future processors invalidating pre-fetched reads if the cache line is
>> > invalidated. And AMD, as far as I know, has not said that you can rely
>> > on it.
>
>> All right, so just making sure, this (your earlier suggestion) way
>> with the interlocked function is most "portable" (as in, portable
>> between Windows systems running on Intel-compatible CPUs -- that's the
>> only target platform) way to do it, right?
>
>> [...]
>
>
>> And simply using InterlockedExchange for the write, and having the
>> variable volatile, guarantees correct visibility for the reads, and
>> should continue guaranteeing that unless something major changes in
>> the future?

^^^^^^^^
> [...]

No shi%! Major changes; humm... There is some hard core stuff in the works
indeed:

http://groups.google.com/group/comp.arch/msg/4d211aea3166fe65

http://groups.google.com/group/comp.arch/browse_frm/thread/a985196455d0fd26

Any thoughts? :^o

Giancarlo Niccolai

unread,

Jan 13, 2009, 2:24:30 PM1/13/09

to

rani_s...@hotmail.com ha scritto:

I implemented this:
http://www.research.ibm.com/people/m/michael/podc-1996.pdf

Then, as it didn't work as expected, I downloaded several
implementations, in example this:

http://www.codeproject.com/KB/cpp/lockfreeq.aspx

and different other re-implementations of the "book version" of the
canonical lock free queue (the first link above).

The algorithm was working in Java and C#, but didn't work for me with
InterlockedCompareExchange & company on the old VS6. All the
implementations I tried on VS6 were defective and presented the same
problems, while working correctly on MS-Windows in languages giving
guarantees about memory access ordering, so I started wonder about some
deficiency in the (old) Interlocked functions.

> OTOH, writing lock free code correctly is quite challenging.
>

Write challenging code is quite a normal activity here.

GN.

Chris M. Thomasson

unread,

Jan 13, 2009, 3:21:23 PM1/13/09

to

"Giancarlo Niccolai" <gc-re...@removeme-niccolai.cc> wrote in message

news:OT5bl.40812$J84....@tornado.fastwebnet.it...

> rani_s...@hotmail.com ha scritto:
>> On Jan 11, 8:08 am, Giancarlo Niccolai <gc-remov...@removeme-
>> niccolai.cc> wrote:
>>> rani_shar...@hotmail.com ha scritto:
>>>
>>>> On Jan 9, 4:50 pm, Giancarlo Niccolai <gc-remov...@removeme-
>>>> niccolai.cc> wrote:
>>>>> Chris M. Thomasson ha scritto:
>>>>>> IIRC, there was one post around 4-5 years ago
>>>>>> that suggested there was on error in the PPC impl of Interlock API.
>>>>>> There was no `sync' or `lwsync' instructions in the disassembled
>>>>>> code...
>>>>> I had problems with Interlocked* too.
>>>> Did you write software for PPC?
>>> Not in this specific case.
>>
>> What interlocked functions did you use? Do you have the algorithm?
>> I find it extremely hard to believe that the x86 implementation was
>> broken.
>
> I implemented this:
> http://www.research.ibm.com/people/m/michael/podc-1996.pdf
>
> Then, as it didn't work as expected,

What did you do for memory management? How did you solve ABA? With DWCAS, or
something else?

> I downloaded several implementations, in example this:
>
> http://www.codeproject.com/KB/cpp/lockfreeq.aspx

Yikes! That code is so fu%king busted its ridiculous. Its basically infested
with a shi%load of race-conditions. Wow.

> and different other re-implementations of the "book version" of the
> canonical lock free queue (the first link above).
>
> The algorithm was working in Java and C#, but didn't work for me with
> InterlockedCompareExchange & company on the old VS6. All the
> implementations I tried on VS6 were defective and presented the same
> problems, while working correctly on MS-Windows in languages giving
> guarantees about memory access ordering, so I started wonder about some
> deficiency in the (old) Interlocked functions.

There is nothing wrong with the Interlocked functions wrt the example.
AFAICT, the author of that code has no idea what he is doing. The thing that
first jumps out at me is that there is no apparent attempt to solve ABA and
no apparent memory management scheme. I bet the code "seemed" to run on Java
and C# because they both have a GC.

>> OTOH, writing lock free code correctly is quite challenging.
>>
>
> Write challenging code is quite a normal activity here.

Indeed. FWIW, the queue code my Michael and Scott is very expensive, and
basically useless for several reasons; use the two-lock queue instead.
Should I go into some of them?

;^)

Giancarlo Niccolai

unread,

Jan 17, 2009, 6:56:31 AM1/17/09

to

Chris M. Thomasson ha scritto:

>>

>> I implemented this:
>> http://www.research.ibm.com/people/m/michael/podc-1996.pdf
>>
>> Then, as it didn't work as expected,
>
> What did you do for memory management? How did you solve ABA? With
> DWCAS, or something else?
>

ABA resolution is part of the algorithm in the link; there's a long
dissertation on that in the paper.

Also, memory management was not part of my concern, as the link pointers
was part of the visible structure, and there was no data sharing between
the producer and consumers of the data.

>
>
>
>> I downloaded several implementations, in example this:
>>
>> http://www.codeproject.com/KB/cpp/lockfreeq.aspx
>
> Yikes! That code is so fu%king busted its ridiculous. Its basically
> infested with a shi%load of race-conditions. Wow.
>

The code in this link is certainly deficient, as it deals with ABA
prevention in an unsafe way.

However, I forgot to say, I was not concerned about this deficiency
because both the tests and the real world program using the code I
needed had a FIFO one-producer-one-consumer scheme.

I am sorry I can't recall the other implementations I checked out, but
the interesting part is that all this implementations (mine and other's)
of a well known algorithm was plainly failing in the most simple
scenario and in the same way.

I also tried some single linked lists, at least to solve the problem at
hand of a FIFO single-producer-single-consumer scenario, and experienced
the same corruptions.

>
> There is nothing wrong with the Interlocked functions wrt the example.
> AFAICT, the author of that code has no idea what he is doing. The thing
> that first jumps out at me is that there is no apparent attempt to solve
> ABA and no apparent memory management scheme. I bet the code "seemed" to
> run on Java and C# because they both have a GC.
>

What has GC to do with the functionality of a queue?

Moreover, how can GC save an ill-written algorithm? -- even if a
language as Java would prevent (and in my experience, not always) an ill
written lock-free or concurrent algorithm from crashing, it can't save
you from structure corruptions that would lead to deadlocks, data lost
errors, endless loops or language level exceptions; or in simpler words,
from just failing.

However, the problem is not relevant anymore, as I can't see any problem
with Interlocked* on VS200x

Giancarlo.

Chris M. Thomasson

unread,

Jan 18, 2009, 10:58:13 PM1/18/09

to

"Giancarlo Niccolai" <gc-re...@removeme-niccolai.cc> wrote in message

news:PHjcl.45237$J84....@tornado.fastwebnet.it...

> Chris M. Thomasson ha scritto:
>
>>>
>>> I implemented this:
>>> http://www.research.ibm.com/people/m/michael/podc-1996.pdf
>>>
>>> Then, as it didn't work as expected,
>>
>> What did you do for memory management? How did you solve ABA? With
>> DWCAS, or something else?
>>
>
> ABA resolution is part of the algorithm in the link; there's a long
> dissertation on that in the paper.

So, you either used and offset trick or DWCAS. Which one was it? Both can
map exactly to the algorihtm described in the paper.

> Also, memory management was not part of my concern, as the link pointers
> was part of the visible structure, and there was no data sharing between
> the producer and consumers of the data.

You have to ensure that ANY node which gets passed through the queue
described in the paper is in a _persistent_ quiescent state _before_ its
destroyed. You simply cannot reclaim nodes in the presented algorithm as-is;
you _need_ to manage the lifetime nodes. The paper describes using static
node cache, which is fine. However, it means that you still need ABA version
counters embedded in the queue anchor.

>>> I downloaded several implementations, in example this:
>>>
>>> http://www.codeproject.com/KB/cpp/lockfreeq.aspx
>>
>> Yikes! That code is so fu%king busted its ridiculous. Its basically
>> infested with a shi%load of race-conditions. Wow.
>>
>
> The code in this link is certainly deficient, as it deals with ABA
> prevention in an unsafe way.
>
> However, I forgot to say, I was not concerned about this deficiency
> because both the tests and the real world program using the code I
> needed had a FIFO one-producer-one-consumer scheme.

IHMO, the queue in the paper from M. Michael, and the totally busted one
presented in codeproject.com, are "overkill" for simple single
producer/consumer FIFO. You can take a look at an example implmentation of
an unbounded wait-free queue here:

http://webpages.charter.net/appcore

Feel free to post questions about AppCore in this news group.

> I am sorry I can't recall the other implementations I checked out, but
> the interesting part is that all this implementations (mine and other's)
> of a well known algorithm was plainly failing in the most simple
> scenario and in the same way.
>
> I also tried some single linked lists, at least to solve the problem at
> hand of a FIFO single-producer-single-consumer scenario, and experienced
> the same corruptions.

AFAICT, unbounded FIFO single producer/consumer queue is extremely simple
and requires no atomics and no explicit membars on TSO memory model; RMO
model would require a single `#StoreStore' barrier in the producer. For
example, here is example code in x86 ASM for the producer and consumer
potion of a SPSC queue:

http://webpages.charter.net/appcore/appcore/src/cpu/i686/ac_i686_gcc_asm.html
(search for `ac_i686_queue_spsc_push' and `ac_i686_queue_spsc_pop'
functions, located near middle of page)

Here is where the C API and data-structures are defined:

http://webpages.charter.net/appcore/appcore/include/cpu/i686/ac_i686_h.html
(search for `ac_i686_queue_spsc_t' and `ac_i686_node_t' types)

The library uses a simple three level internal cache to manage the nodes
because the queue is not an intrusive collection. I also include an
eventcount to take care of conditional blocking. One can use an eventcount
to conditionally block on the "boundary" conditions of a wide variety of
non-blocking algorihtm. IMVHO, the eventcount synchronization primitive
represents a "perfect marriage" between non-blocking and lock-based
algorihtms.

The SPSC FIFO queue is nice because its wait-free, and it requires no
explicit memory management for any of its nodes. This is perfect for
real-time systems.

>> There is nothing wrong with the Interlocked functions wrt the example.
>> AFAICT, the author of that code has no idea what he is doing. The thing
>> that first jumps out at me is that there is no apparent attempt to solve
>> ABA and no apparent memory management scheme. I bet the code "seemed" to
>> run on Java and C# because they both have a GC.
>>
>
> What has GC to do with the functionality of a queue?

Most of the non-blocking algorihtms in general, queues included, presented
in papers require explicit lifetime management of the nodes that flow
through them. GC handles all of this by default. If your using C/C++, well,
you need to handle this case with great care.

> Moreover, how can GC save an ill-written algorithm?

By correctly identifying that its okay to reclaim memory which make up parts
of the data-structures (e.g., nodes). Most non-blocking algorithms can be in
the middle of accessing a data-structure when your not expecting it. For the
queue presented in the IBM paper, you cannot just pop a node and free it. No
way. There could still be threads accessing, or just about to access it.
There are multiple solutions to this problem, traditional GC being one of
them indeed.

Humm, well, a GC environment can still bite inexperienced programmers. One
might think that they can reuse nodes as soon as they pop them from a
non-blocking algorihtm in a full GC environment. They also design the
non-blocking algorihtm with no consideration for the ABA problem because
they think GC automatically cures ABA in _all_ cases. So, they reuse a node
as soon as they pop it, and BAM! ABA problem sends their program off into
random undefined states. You cannot automatically reuse nodes in a GC lang
unless the non-blocking algorihtm can handle ABA all by itself. Otherwise,
you need to send the nodes through the GC and only reuse them when they come
out on the other side.

> -- even if a
> language as Java would prevent (and in my experience, not always) an ill
> written lock-free or concurrent algorithm from crashing, it can't save
> you from structure corruptions that would lead to deadlocks, data lost
> errors, endless loops or language level exceptions; or in simpler words,
> from just failing.

I am focusing directly on the memory management aspect of the problem.

> However, the problem is not relevant anymore, as I can't see any problem
> with Interlocked* on VS200x

Ummm, I have implemented various non-blocking synchronization schemes using
VC 6.0 on NT4. IMVHO, your problem was never with the `Interlocked' API, its
was a bug in the implementation of the algorihtm...

Giancarlo Niccolai

unread,

Jan 20, 2009, 3:21:05 PM1/20/09

to

Chris M. Thomasson ha scritto:
> "Giancarlo Niccolai" <gc-re...@removeme-niccolai.cc> wrote in message
> news:PHjcl.45237$J84....@tornado.fastwebnet.it...
>> Chris M. Thomasson ha scritto:
>>
>>>>
>>>> I implemented this:
>>>> http://www.research.ibm.com/people/m/michael/podc-1996.pdf
>>>>
>>>> Then, as it didn't work as expected,
>>>
>>> What did you do for memory management? How did you solve ABA? With
>>> DWCAS, or something else?
>>>
>>
>> ABA resolution is part of the algorithm in the link; there's a long
>> dissertation on that in the paper.
>
> So, you either used and offset trick or DWCAS. Which one was it? Both
> can map exactly to the algorihtm described in the paper.
>

I tried only the DWCAS. I also took care of proper alignment. But then
again, the thing was failing also with a single linked list with one
producer and one consumer, where the ABA problem is obviously not possible.

GN.