What does Memory Barriers mean ??

Major

unread,

Jun 2, 2004, 3:25:59 AM6/2/04

to

hi

what is Memory Barries?

Is it enough to use mutex(such as Win32 Mutex or POSIX mutex) to protect
some shared object(s) on SMP systems?

thanks
Major

Joe Seigh

unread,

Jun 2, 2004, 7:12:56 AM6/2/04

to

The term is used informally here. They're really hardware platorm dependent
and used to implement the semantics of the synchronization functions discussed
here. Their use here is shorthand for memory visibility. You really have to
learn the threaded programming conventions until you get a feel for what those
memory visibility rules are. Nobody can actually tell you what those rules
are. When you do "get it" then you will understand what "memory barrier" means
in the context of c.p.t.

Joe SEigh

SenderX

unread,

Jun 2, 2004, 5:17:50 PM6/2/04

to

> what is Memory Barries?
>
> Is it enough to use mutex(such as Win32 Mutex or POSIX mutex) to protect
> some shared object(s) on SMP systems?

Yes. The mutex code will have the proper barriers.

Ross Bencina

unread,

Jun 5, 2004, 11:55:04 PM6/5/04

to

Joe Seigh wrote:
> The term is used informally here. They're really hardware platorm
dependent
> and used to implement the semantics of the synchronization functions
discussed
> here. Their use here is shorthand for memory visibility. You really have
to
> learn the threaded programming conventions until you get a feel for what
those
> memory visibility rules are. Nobody can actually tell you what those
rules
> are. When you do "get it" then you will understand what "memory barrier"
means
> in the context of c.p.t.

No offence Joe, but this sounds like a total cop-out. You make it sound like
knowledge of memory barriers are some kind of zen enlightment only
comprehensible to the initiated. I contend that it must be possible to
formalise your (our) knowledge of memory barriers, and to account for any
potential differences in hardware implementations...

How about we thrash out a definition and I'll post it to wikipedia?

Here's a draft, please point out any errors and suggest any ommissions...

---

"Memory barrier" (a.k.a memory fence, [any others?]) is a general term used
to describe CPU instructions which affect memory visibility. Memory
visibility refers to the observed state of a shared memory by different
devices on a bus, for example, multiple CPUs on an SMP system, or a CPU and
prehipheral devices on a uniprocessor system. Modern processors and memory
controllers implement a variety of optimisation techniques which may result
in physical memory stores and loads being executed in a different order from
the order in which they appear in the program. For a program running on a
single processor, the hardware ensures that the effects of memory access
reordering are not noticable. For efficiency reasons the hardware may not
provide such strong guarantees for stores and loads executed by different
processors.

Consider the following program running on two processors:

>>
(initial state: x = 0, f = 0)

processor #1:
x = 42;
f = 1;

processor #2:
while( f == 0 ) /* repeat */;
print( x );
<<

You might expect the print statement to always print the number "42",
however if the memory hardware had reordered processor #1's store
operations, it is also possible that f would be updated _before_ x, and the
print statement might print "0". For some programs this situation is not
acceptable. Memory barrier instructions provide a method to avoid this
situation by enforcing a total or partial ordering on read and write
operations which appear on either side of the barrier instruction. In the
example above, a barrier could be inserted before processor #1's assignment
to f to ensure that the new value of x was visible to other processors prior
to the change in the value of f.

Memory barrier instructions come in a number of different flavours, which
are always dependent on the hardware architecture. For this reason the usual
advice regarding memory barrier usage is to consult the technical
documentation for the hardware which you are programming. That said, it is
useful to consider the types of memory barrier instructions and the related
terminology which may be encountered in practice.

Some architectures provide only a single memory barrier instruction
sometimes called "full fence". A full fence ensures that all read and write
operations prior to the fence will have been commited prior to any dependent
reads and writes executed following the fence. Other architectures provide
separate "aquire" and "release" memory barriers which address the visibility
of read-after-write operations from the point of view of a reader (sink) or
writer (source) respectively [is this correct?]. [where does alexander's
"hoist" terminology fit in?] Some architectures provide separate memory
barriers to address memory visibility between combinations of system memory
and i/o memory. When more than one memory barrier instruction is available
it is important to consider that the cost of different instructions may vary
considerably.

Memory barriers in "threaded" programming
--------------------------------------------

Threaded programs usually use synchronisation objects such as mutexes and
semaphores to synchronise access to data from paralell threads of execution.
In general, operations on such objects (eg lock()/unlock(), Microsoft's
Interlocked*() API, etc.) are implemented with memory barriers to provide
the expected memory visibility semantics. Thus the explicit use of memory
barriers in most threaded programs is not necessary. Explicit use of memory
barriers is usually only necessary when implementing synchronisation
operations such as those just mentioned, when programming lock-free
algorithms for multiprocessor systems, or when communicating with hardware
devices.

Hardware barriers vs. compiler reordering optimisations
--------------------------------------------------------

Memory barrier instructions only address reordering effects at the hardware
level. Many compilers may also reorder memory access instructions as part of
the program optimization process. Some compilers may provide sufficient
facilities to implement "memory barrier functions" which address both the
compiler reordering and machine reordering issues, however it is usually
advisable to be very careful about this, either by writing in assembler, or
by carefully inspecting compiler generated code.

---

Comments? Suggestions?

Ross.

David Schwartz

unread,

Jun 6, 2004, 3:43:03 AM6/6/04

to

Ross Bencina wrote:

> (initial state: x = 0, f = 0)
>
> processor #1:
> x = 42;
> f = 1;
>
> processor #2:
> while( f == 0 ) /* repeat */;
> print( x );
>

Do not make this look like C code, because this is pseudo-code. Someone
looking at this might think that you meant real C code. The problem is
mostly C permits the compiler to re-order the operations, so a hardware
memory barrier is insufficient to prevent the problem. In other words, even
with a memory barrier, as you've defined it, between the two stores, the
compiler could still reorder the stores.

DS

Ross Bencina

unread,

Jun 6, 2004, 5:36:20 AM6/6/04

to

David Schwartz wrote:
> Ross Bencina wrote:
>
> > (initial state: x = 0, f = 0)
> >
> > processor #1:
> > x = 42;
> > f = 1;
> >
> > processor #2:
> > while( f == 0 ) /* repeat */;
> > print( x );
> >
>
> Do not make this look like C code, because this is pseudo-code.
Someone
> looking at this might think that you meant real C code.

Thanks David.

How about this for the pseudo code then?

---
Initially, memory locations x and f both hold the value 0. The program
running on processor #1 loops until the value of f is non-zero, then it
prints the value of x. The program running on processor #2 stores the value
42 into x and then stores the value 1 into f. Pseudo code for the two
program fragments is shown below. The sequence of instructions for each
program is shown in the left hand column.

Processor #1:
loop:
(1) if the value in location f is 0 goto loop
(2) print the value in location x

Processor #2:
(1) store the value 42 into location x
(2) store the value 1 into location f
---

> The problem is
> mostly C permits the compiler to re-order the operations, so a hardware
> memory barrier is insufficient to prevent the problem. In other words,
even
> with a memory barrier, as you've defined it, between the two stores, the
> compiler could still reorder the stores.

Yes, I mentioned that later in the text, but I agree that it is problematic
to use C for the example. I imagine that most high-level languages permit
this kind of reordering, since so many compiler optimisations are based on
data flow and value dependence.

One thing I'm wondering -- does anyone use the term "memory barrier" to
refer to something which also inhibits the compiler from performing
reordering around the barrier function? Is it ever correct to discuss memory
barriers in the context of a high level language? or does the term strictly
refer to hardware operations to synchronise memories accross a bus?

Ross.

David Schwartz

unread,

Jun 6, 2004, 6:45:43 AM6/6/04

to

Ross Bencina wrote:
> David Schwartz wrote:

>> Do not make this look like C code, because this is pseudo-code.
>> Someone looking at this might think that you meant real C code.
>
> Thanks David.
>
> How about this for the pseudo code then?

> Processor #1:

> loop:
> (1) if the value in location f is 0 goto loop
> (2) print the value in location x
>
> Processor #2:
> (1) store the value 42 into location x
> (2) store the value 1 into location f

I think that's much less likely to confuse people about what a memory
barrier really is.

>> The problem is
>> mostly C permits the compiler to re-order the operations, so a
>> hardware memory barrier is insufficient to prevent the problem. In
>> other words, even with a memory barrier, as you've defined it,
>> between the two stores, the compiler could still reorder the stores.

> Yes, I mentioned that later in the text, but I agree that it is
> problematic to use C for the example. I imagine that most high-level
> languages permit this kind of reordering, since so many compiler
> optimisations are based on data flow and value dependence.

> One thing I'm wondering -- does anyone use the term "memory barrier"
> to refer to something which also inhibits the compiler from performing
> reordering around the barrier function? Is it ever correct to discuss
> memory barriers in the context of a high level language? or does the
> term strictly refer to hardware operations to synchronise memories
> accross a bus?

I suppose it's possible, but I would argue that such usage is at worst
incorrect and at best should be discouraged. A 'memory barrier' is a very
low level concept, generally used in low level languages. In higher level
languages, you typically use constructs that are more suitable precisely
because they have semantics that are defined with respect to compiler
optimizations. This is why, for example, the POSIX pthreads doesn't talk
about memory barriers but about memory visibility and doesn't talk about
guarantees having to do with multiple processors and hardware but about what
the C code will actually see.

DS

Joe Seigh

unread,

Jun 6, 2004, 12:18:50 PM6/6/04

to

Ross Bencina wrote:
>
> Joe Seigh wrote:
> > The term is used informally here. They're really hardware platorm
> dependent
> > and used to implement the semantics of the synchronization functions
> discussed
> > here. Their use here is shorthand for memory visibility. You really have
> to
> > learn the threaded programming conventions until you get a feel for what
> those
> > memory visibility rules are. Nobody can actually tell you what those
> rules
> > are. When you do "get it" then you will understand what "memory barrier"
> means
> > in the context of c.p.t.
>
> No offence Joe, but this sounds like a total cop-out. You make it sound like
> knowledge of memory barriers are some kind of zen enlightment only
> comprehensible to the initiated. I contend that it must be possible to
> formalise your (our) knowledge of memory barriers, and to account for any
> potential differences in hardware implementations...
>
> How about we thrash out a definition and I'll post it to wikipedia?
>
> Here's a draft, please point out any errors and suggest any ommissions...

(snip)

Memory barriers are not the same as memory visibility but the term is used
loosely to imply memory visiblity rules. Memory barriers don't exist in
POSIX and are not defined there. They are defined for various hardware
platforms in the context of the respective platform memory models. And
they are used in some implementations of POSIX synchronization functions to
effect the required memory visibility. The implementations are most likely
overly restrictive given that hw memory barriers don't know about locks per
se. This means that you should not use the fact that a particular implementation
uses a memory barrier to infer POSIX lock semantics as you will be making
invalid assumptions. A lock implementation is not required to use a memory
barrier. It just has to insure that the unspecified memory visibility rules
are adhered to somehow.

I suppose you could define generic portable memory barrier semantics and refer
to those when you talk about things not defined by POSIX, such as double checked
locking so people know what you mean.

I've done a formal defintion or at least know how to do it now. But it's not
of POSIX since only the POSIX committee can do that. It's basically my
reverse engineering from conventional POSIX usage ajusted for consistency.
And it's on the conservative side. I may restrict something that POSIX
would actually allow if we knew what the rules were, but I'd rather be
safe than sorry.

But don't listen to me. There are plenty who will tell you they know what
the sematics are. Apparently they got this from devine revelation from the
thread gods and thus their knowedge is absolute. Trust them.

But don't complain to me. I don't know what the rules are. I think I have
a pretty good idea. But I've though that before and it turned out I didn't
really have as good an understanding as I thought I did. And this has happened
to me more than once.

Joe Seigh

Ross Bencina

unread,

Jun 6, 2004, 11:23:18 PM6/6/04

to

Hi Joe

Sorry if I seem to be repeating what you've said. I'm just trying to get my
head around this...

Joe Seigh wrote:
> Memory barriers are not the same as memory visibility but the term is used
> loosely to imply memory visiblity rules. Memory barriers don't exist in
> POSIX and are not defined there.

It seems that both you and David Schwartz hold the opinion that the concept
of "memory visibility" is somehow tied to POSIX. Perhaps I shouldn't be
giving this term precedence in my definition. I don't know much about POSIX,
but my impression was that the concept of memory visibility is broader than
POSIX. Am I wrong? I was assuming an intuitive interpretation where memory
visibility means something like "the state of memory as visible by each
device on a bus" or "the observed ordering of operations on memory by each
device on a bus". Perhaps there is a more generic term that doesn't have
such strong POSIX connotations?

My understanding was that a memory barrier is a specific hardware operation
which can be used to _implement_ or _effect_ either:
(a) formally defined high-level memory visibility semantics (ie POSIX)
(b) some cache synchronisation operation on the memory hardware which is
defined specifically for a particular hardware platform.

Obviously (a) requires a mapping from the specific hardware semantics to
"POSIX memory visibility semantics". A similar example that was given
recently here would be a mapping to the forthcoming Java Memory Model.

(b) speaks directly to what a memory barrier is, so perhaps this is closer
to the definition I should be using.

> They are defined for various hardware
> platforms in the context of the respective platform memory models.

O.K. so a memory model doesn't define "memory visibility" per-se at all? If
so, how is a platform memory model defined? What kind of semantic entities
are used?

> And
> they are used in some implementations of POSIX synchronization functions
to
> effect the required memory visibility. The implementations are most
likely
> overly restrictive given that hw memory barriers don't know about locks
per
> se.

Overly restrictive, meaning the barriers provide more memory synchronisation
than required by POSIX?

> This means that you should not use the fact that a particular
implementation
> uses a memory barrier to infer POSIX lock semantics as you will be making
> invalid assumptions. A lock implementation is not required to use a
memory
> barrier. It just has to insure that the unspecified memory visibility
rules
> are adhered to somehow.

"unspecified memory visibility" -- are you implying that POSIX memory
visibility rules are un(der)specified?

> I suppose you could define generic portable memory barrier semantics and
refer
> to those when you talk about things not defined by POSIX, such as double
checked
> locking so people know what you mean.

Well, I was assuming that people already have their own private "portable
memory barrier semantics" terminology.. Sender-X talks about Aquire/Release
and Source/Sink, Alexander talks about "hoisting".. I'm just trying to sort
things into some kind of understandable conceptual framework. I believe that
the Linux kernel guys had to do something like this for their memory barrier
functions.. I'd better go back and do some more research about that.

> I've done a formal defintion or at least know how to do it now. But it's
not
> of POSIX since only the POSIX committee can do that. It's basically my
> reverse engineering from conventional POSIX usage ajusted for consistency.
> And it's on the conservative side. I may restrict something that POSIX
> would actually allow if we knew what the rules were, but I'd rather be
> safe than sorry.
>
> But don't listen to me. There are plenty who will tell you they know what
> the sematics are. Apparently they got this from devine revelation from
the
> thread gods and thus their knowedge is absolute. Trust them.

Thankfully I'm only trying to define what a memory barrier is, not the
explicit visibility semantics of POSIX. I'd like to provide some background
for people who need to look at the memory model of a specific platform,
(such as the O.P.), and of course it's agreat vehicle for improving my own
understanding.

> But don't complain to me. I don't know what the rules are. I think I
have
> a pretty good idea. But I've though that before and it turned out I
didn't
> really have as good an understanding as I thought I did. And this has
happened
> to me more than once.
>
> Joe Seigh

Thanks!

Ross.

Thomas Pornin

unread,

Jun 7, 2004, 4:40:07 AM6/7/04

to

According to Ross Bencina <ro...@audiomulch.com>:

> Yes, I mentioned that later in the text, but I agree that it is problematic
> to use C for the example. I imagine that most high-level languages permit
> this kind of reordering, since so many compiler optimisations are based on
> data flow and value dependence.

Maybe you should add a word about the "volatile" C keyword: although
it looks like a way to specify absolutely the memory access order, it
is no substitute for a proper memory barrier. In a way, "volatile" is
a barrier with regards to interruptions handled on the same CPU (e.g.,
signal handlers) but this does not work for SMP systems.

--Thomas Pornin

Joe Seigh

unread,

Jun 7, 2004, 8:38:29 AM6/7/04

to

Ross Bencina wrote:
>
> Hi Joe
>
> Sorry if I seem to be repeating what you've said. I'm just trying to get my
> head around this...
>
> Joe Seigh wrote:
> > Memory barriers are not the same as memory visibility but the term is used
> > loosely to imply memory visiblity rules. Memory barriers don't exist in
> > POSIX and are not defined there.
>
> It seems that both you and David Schwartz hold the opinion that the concept
> of "memory visibility" is somehow tied to POSIX. Perhaps I shouldn't be
> giving this term precedence in my definition. I don't know much about POSIX,
> but my impression was that the concept of memory visibility is broader than
> POSIX. Am I wrong? I was assuming an intuitive interpretation where memory
> visibility means something like "the state of memory as visible by each
> device on a bus" or "the observed ordering of operations on memory by each
> device on a bus". Perhaps there is a more generic term that doesn't have
> such strong POSIX connotations?

Java has a formal definition. POSIX pthreads and windows threads do not.

>
> My understanding was that a memory barrier is a specific hardware operation
> which can be used to _implement_ or _effect_ either:
> (a) formally defined high-level memory visibility semantics (ie POSIX)
> (b) some cache synchronisation operation on the memory hardware which is
> defined specifically for a particular hardware platform.

Cache in most cases is strongly coherent (i.e. processor order is maintained).
The visibility issues are due to compiler reordering and out of order execution
in a pipeline processors. You'd have these issues even on a system with no
cache at all. Mentioning cache when discussing visibility is usually an
indication that person doesn't know what they're talking about.

>
> Obviously (a) requires a mapping from the specific hardware semantics to
> "POSIX memory visibility semantics". A similar example that was given
> recently here would be a mapping to the forthcoming Java Memory Model.
>
> (b) speaks directly to what a memory barrier is, so perhaps this is closer
> to the definition I should be using.
>
> > They are defined for various hardware
> > platforms in the context of the respective platform memory models.
>
> O.K. so a memory model doesn't define "memory visibility" per-se at all? If
> so, how is a platform memory model defined? What kind of semantic entities
> are used?

Depends on the vendor. IBM's architecture manuals were usually pretty good.
Intel is getting better as of Itanium, the IA32 stuff is so so. But they're
not defining POSIX synchronization contructs, just their memory models which
are all different. Using a particular memory model makes about as much
sense as using Intel machine instructions to define C language semantics.

>
> > And
> > they are used in some implementations of POSIX synchronization functions
> to
> > effect the required memory visibility. The implementations are most
> likely
> > overly restrictive given that hw memory barriers don't know about locks
> per
> > se.
>
> Overly restrictive, meaning the barriers provide more memory synchronisation
> than required by POSIX?

No. Meaning that you can put in more synchronization than POSIX reqires
and the implementation will still be correct. But you cannot imply
from such an implementation what the specification may have been.

>
> > This means that you should not use the fact that a particular
> implementation
> > uses a memory barrier to infer POSIX lock semantics as you will be making
> > invalid assumptions. A lock implementation is not required to use a
> memory
> > barrier. It just has to insure that the unspecified memory visibility
> rules
> > are adhered to somehow.
>
> "unspecified memory visibility" -- are you implying that POSIX memory
> visibility rules are un(der)specified?

No. Just not specified or at least incomplete with the blanks being filled
in differently in some cases.

>
> > I suppose you could define generic portable memory barrier semantics and
> refer
> > to those when you talk about things not defined by POSIX, such as double
> checked
> > locking so people know what you mean.
>
> Well, I was assuming that people already have their own private "portable
> memory barrier semantics" terminology.. Sender-X talks about Aquire/Release
> and Source/Sink, Alexander talks about "hoisting".. I'm just trying to sort
> things into some kind of understandable conceptual framework. I believe that
> the Linux kernel guys had to do something like this for their memory barrier
> functions.. I'd better go back and do some more research about that.

But the linux memory barriers are not formally defined either. The kernel developers
mostly know what they're doing. But what it means is that on architectures
with more exotic types of memory barriers, such as Sparc, the linux memory barriers
have to be over implemented so as to be on the safe side.

>
> > I've done a formal defintion or at least know how to do it now. But it's
> not
> > of POSIX since only the POSIX committee can do that. It's basically my
> > reverse engineering from conventional POSIX usage ajusted for consistency.
> > And it's on the conservative side. I may restrict something that POSIX
> > would actually allow if we knew what the rules were, but I'd rather be
> > safe than sorry.
> >
> > But don't listen to me. There are plenty who will tell you they know what
> > the sematics are. Apparently they got this from devine revelation from
> the
> > thread gods and thus their knowedge is absolute. Trust them.
>
> Thankfully I'm only trying to define what a memory barrier is, not the
> explicit visibility semantics of POSIX. I'd like to provide some background
> for people who need to look at the memory model of a specific platform,
> (such as the O.P.), and of course it's agreat vehicle for improving my own
> understanding.

The OP was assuming that memory barriers had something to do with POSIX
semantics. There's no direct correlation.

Joe Seigh

unread,

Jun 7, 2004, 9:11:27 AM6/7/04

to

> But the linux memory barriers are not formally defined either. The kernel developers
> mostly know what they're doing. But what it means is that on architectures
> with more exotic types of memory barriers, such as Sparc, the linux memory barriers
> have to be over implemented so as to be on the safe side.
> >

And I should mention that Linux has memory barriers that don't correspond to
hardware memory barriers AFAIK. That being the dependent load barrier used
for RCU. Most architectures don't require a memory barrier except for
Alpha which is one the few architectures without a strongly coherent cache,
it's mostly coherent. The membar in that case actually does some synchronization
with the cache. Dependent load barriers aren't something hardware architects
are aware of, so there is a lot of research into deciding if dependent load
barriers need to be implemented with a read memory barrier. Linux can get
away with this because the kernel has to be ported anytime a new processor
with new features appears, even if the architecture does not change significantly.

Joe Seigh

Ross Bencina

unread,

Jun 10, 2004, 11:57:27 AM6/10/04

to

[see below for revised Memory Barrier definition which I intend to post to
wikipedia]

I wasn't sure where to put this post, but I've revised the definition
considerably to include everyone's comments including those of the parent
(thanks Joe).

I've avoided confusing "memory visibility semantics" and "memory barriers
and hardware memory models" and tried to make it clear that these are
different levels of abstraction. This also allows me to discuss the hazards
of depending on a particular mapping between them.

I've made it clear that out-of-order execution is the main reason memory
barriers are needed. I havn't mentioned cache coherency at all, inspite of
Joe's mention of the Alpha processor.

There is now a small mention of C's "volatile" keyword.

I would have liked to include a discussion of "portable" memory barrier
instructions such as the Linux kernel functions.. perhaps in a later
revision this can be covered.

---

Memory Barrier
a.k.a membar or memory fence

Modern CPUs employ mechanisms which can result in operations being executed
out-of-order, including memory loads and stores. A Memory Barrier is a
general term used to refer to instructions which cause the CPU to enforce an
ordering constraint on memory operations issued before and after the barrier
instruction. The exact nature of the ordering constraint is hardware
dependent, and is defined by the architecture's memory model. Some
architectures provide multiple barriers for enforcing different ordering
constraints.

Memory barriers are typically used when implementing low-level code which
operates on memory shared by multiple devices. Such code includes
sychronisation primitives and lock free data structures on multiprocessor
systems, and drivers which communicate with hardware.

An Illustrative Example
-----------------------

When a program runs on a singe CPU, the hardware performs the necessary
book-keeping to ensure that programs execute as if all memory operations
were performed in program order, hence memory barriers are not necessary.
However, when the memory is shared with multiple devices, such as other CPUs
in a multiprocessor system, or memory mapped prehipherals, out-of-order
access may affect program behavior. For example a second CPU may see memory
changes made by the first CPU in a sequence which differs from program
order.

The following two processor program gives a concrete example of how such
out-of-order execution can affect program behavior:

>>>

Initially, memory locations x and f both hold the value 0. The program
running on processor #1 loops until the value of f is non-zero, then it
prints the value of x. The program running on processor #2 stores the value
42 into x and then stores the value 1 into f. Pseudo code for the two

program fragments is shown below. The steps of the program correspond to
individual processor instructions.

Processor #1:
loop:
load the value in location f, if it is 0 goto loop

print the value in location x

Processor #2:

store the value 42 into location x

store the value 1 into location f

<<<

You might expect the print statement to always print the number "42",
however if processor #1's store operations are executed out-of-order it is

possible that f would be updated _before_ x, and the print statement might

print "0". For some programs this situation is not acceptable. A memory
barrier can be inserted before processor #1's assignment to f to ensure that
the new value of x was visible to other processors at or prior to the change

in the value of f.

Low Level Architecture-Specific Primitives
------------------------------------------

Memory barriers are low level primitives which are part of the definition of
an architecture's memory model. Like instruction sets, memory models vary
considerably between architctures, so it is not appropriate to generalise
about memory barrier behavior. The received wisdom is that to use memory
barriers correctly you should study the architecture manuals for the
hardware which you are programming. That said, the following paragraph
offers a glimpse of some memory barriers which exist in the wild.

Some architectures provide only a single memory barrier instruction

sometimes called "full fence". A full fence ensures that all load and store
operations prior to the fence will have been commited prior to any loads and
stores issued following the fence. Other architectures provide separate

"aquire" and "release" memory barriers which address the visibility of
read-after-write operations from the point of view of a reader (sink) or

writer (source) respectively. Some architectures provide separate memory
barriers to control ordering between different combinations of system memory

and i/o memory. When more than one memory barrier instruction is available
it is important to consider that the cost of different instructions may vary
considerably.

"Threaded" Programming and Memory Visibility
--------------------------------------------

Threaded programs usually use synchronisation primitives provided by a
high-level programming environment such as Java, or a C API such as POSIX
pthreads or Win32. Primitives such as mutexes and semaphores are provided to
synchronise access to resources from paralell threads of execution. These
primitives are usually implemented with the memory barriers required to
provide the expected memory visibility semantics. When using such
environments explicit use of memory barriers is not generally necessary.

Each API or programming environment has it's own high-level memory model
which defines its memory visibility semantics. Although you don't usually
need to use memory barriers in such high level environments, it's important
to understand their memory visibility semantics. Such understanding is not
necessarily easy to achieve because memory visibility semantics are not
always consistently specified or documented.

Just as programming language semantics are defined at a different level of
abstraction to machine language opcodes, a programming environment's memory
model is defined at a different level of abstraction to that of a hardware
memory model. It's important to understand this distinction and realise that
there is not always a simple mapping between low-level hardware memory
barrier semantics and the high-level memory visibility semantics of a
particular programming environment. As a result, a particular platform's
implementation of (say) pthreads may employ stronger barriers than required
by the specification. Programs which take advantage of memory visibility
as-implemented rather than as-specified may not be portable.

Out-of-order Execution vs. Compiler Reordering Optimisations
------------------------------------------------------------

Memory barrier instructions only address reordering effects at the hardware

level. Compilers may also reorder instructions as part of the program
optimization process. Although the effects on parallel program behavior can
be similar in both cases, in general it is necessary to take separate
measures to inhibit compiler reordering optimisations for data that may be
shared by multiple threads of execution. Note that such measures are usually
only necessary for data which is not protected by synchronisation primitives
such as those discussed in the previous section.

In C, the "volatile" keyword is provided to inhibit optimisations which
remove or reorder memory operations on a variable marked as volatile. This
will provide a kind of barrier for interruptions which occur on a single
CPU, such as signal handlers or concurrent threads on a uniprocessor system.
However, the use of "volatile" is insufficient to guarantee correct ordering
for multiprocessor systems because it only impacts reorderings performed by
the compiler, not those which may be performed by the CPU during execution.

Some languages and compilers may provide sufficient facilities to implement

functions which address both the compiler reordering and machine reordering

issues, however it is usually advisable to be very careful about this, for
example by carefully inspecting compiler generated code. Some developers
advocate coding in assembley language to avoid compiler reordering issues.

---

Comments anyone?

Ross.

David Schwartz

unread,

Jun 10, 2004, 12:42:01 PM6/10/04

to

"Ross Bencina" <ro...@audiomulch.com> wrote in message
news:40c88...@news.iprimus.com.au...

> In C, the "volatile" keyword is provided to inhibit optimisations which
> remove or reorder memory operations on a variable marked as volatile. This
> will provide a kind of barrier for interruptions which occur on a single
> CPU, such as signal handlers or concurrent threads on a uniprocessor
> system.
> However, the use of "volatile" is insufficient to guarantee correct
> ordering
> for multiprocessor systems because it only impacts reorderings performed
> by
> the compiler, not those which may be performed by the CPU during
> execution.

Or, for that matter, by anything else that has the capability to
re-order memory accesses. At the C language level, you have no idea what is
between the CPU and memory. There is no guarantee in the C standard that
only the CPU and the compiler can re-order memory accesses.

> Some languages and compilers may provide sufficient facilities to
> implement
> functions which address both the compiler reordering and machine
> reordering
> issues, however it is usually advisable to be very careful about this, for
> example by carefully inspecting compiler generated code. Some developers
> advocate coding in assembley language to avoid compiler reordering issues.

Yeah. You have to know every possible thing that could cause you a
problem on the particular platform you are using and ensure it won't bite
you.

DS

Min-Koo Seo

unread,

Jun 12, 2004, 11:18:48 AM6/12/04

to

<snip>

> In C, the "volatile" keyword is provided to inhibit optimisations which
> remove or reorder memory operations on a variable marked as volatile. This
> will provide a kind of barrier for interruptions which occur on a single
> CPU, such as signal handlers or concurrent threads on a uniprocessor system.
> However, the use of "volatile" is insufficient to guarantee correct ordering
> for multiprocessor systems because it only impacts reorderings performed by
> the compiler, not those which may be performed by the CPU during execution.
>
> Some languages and compilers may provide sufficient facilities to implement
> functions which address both the compiler reordering and machine reordering
> issues, however it is usually advisable to be very careful about this, for
> example by carefully inspecting compiler generated code. Some developers
> advocate coding in assembley language to avoid compiler reordering issues.

</snip>

How about java and .NET? They also have volatile keyword and AFAIK,
volatile keyword has the same effect as synchronized/lock. (Though the
JDK 1.4 or lower does not properly implementing volatile, let's
concentrate on the intented meaning of volatile.)

For example,

int a;
Object lock;
...

void someMethod() {
synchronized(lock) {
a++;
}
}

has the same effect as

volatile int a;

void someMethod() {
a++;
}

In other words, volatile in JAVA and .NET is also a construct of
*locking* as well as preventing reordering. Am I missing something
here?

Regards,
Minkoo Seo

Alexander Terekhov

unread,

Jun 12, 2004, 12:02:55 PM6/12/04

to

Min-Koo Seo wrote:
[...]

> void someMethod() {
> synchronized(lock) {
> a++;
> }
> }
>
> has the same effect as
>
> volatile int a;
>
> void someMethod() {
> a++;
> }
>

> In other words, volatile in JAVA [... snip ...] is also a construct

> of *locking* as well as preventing reordering.

No. In {revised} Java, volatile reads and writes are atomic (and
they also seem to prevent reordering in a somewhat "stronger" way
than locks -- StoreLoad barrier), but there's no guarantee that
volatile read-modify-write is atomic. Volatiles are braindead.

regards,
alexander.

SenderX

unread,

Jun 12, 2004, 3:50:52 PM6/12/04

to

> No. In {revised} Java, volatile reads and writes are atomic (and
> they also seem to prevent reordering in a somewhat "stronger" way
> than locks -- StoreLoad barrier), but there's no guarantee that
> volatile read-modify-write is atomic. Volatiles are braindead.

What about the Itanium ABI? Dosen't volatile mean st.rel and ld.acq's?

Alexander Terekhov

unread,

Jun 12, 2004, 4:13:30 PM6/12/04

to

On Itanium, their intent is to make < different volatile variables >

volatile-write
volatile-read

mean

st.rel
mf
ld.acq

(to have StoreLoad barrier in between). IIUC, of course.

regards,
alexander.

SenderX

unread,

Jun 12, 2004, 5:00:49 PM6/12/04

to

> but there's no guarantee that
> > volatile read-modify-write is atomic.

That's what java's atomic ops are for...

;)

SenderX

unread,

Jun 12, 2004, 6:43:06 PM6/12/04

to

> On Itanium, their intent is to make < different volatile variables >
>
> volatile-write
> volatile-read
>
> mean
>
> st.rel
> mf
> ld.acq
>
> (to have StoreLoad barrier in between). IIUC, of course.

I would assume all compilers for IA64 would have to conform to the Itanium
ABI. The volatile keyword can "possibly" be transformed into some meaningful
C/C++ standard, wrt threads and memory visibility. I think acquire and
release semantics are well known enough ( lock(); unlock(); ), and could be
standardized...

SenderX

unread,

Jun 12, 2004, 6:48:19 PM6/12/04

to

> > > Volatiles are braindead.

The C/C++ std shoud rip its dead brain out, and replace it with a new one
that can actually comprehend threads and memory visibility...

;)

Min-Koo Seo

unread,

Jun 13, 2004, 8:50:23 AM6/13/04

to

Alexander Terekhov <tere...@web.de> wrote in message news:<40CB292F...@web.de>...

I'm sorry that I am novice in threading.

Please, elaborate upon the differences between reads/writes vs
read-modify-write. According to
http://www.cs.umd.edu/users/pugh/java/memoryModel/jsr-133-faq.html#volatile,
"Under the new memory model, writing to a volatile field has the same
memory effect as a monitor release, and reading from a volatile field
has the same memory effect as a monitor acquire."

What do you mean by read-modify-write is not atomic? So, you are
suggesting that *DO NOT USE VOLATILE*? In what circumstances, does the
volatile is useful?

Regards,
Minkoo Seo

Joe Seigh

unread,

Jun 13, 2004, 9:31:11 AM6/13/04

to

Min-Koo Seo wrote:
>
> Alexander Terekhov <tere...@web.de> wrote in message news:<40CB292F...@web.de>...

> > No. In {revised} Java, volatile reads and writes are atomic (and

> > they also seem to prevent reordering in a somewhat "stronger" way
> > than locks -- StoreLoad barrier), but there's no guarantee that
> > volatile read-modify-write is atomic. Volatiles are braindead.
> >
> > regards,
> > alexander.
>
> I'm sorry that I am novice in threading.
>
> Please, elaborate upon the differences between reads/writes vs
> read-modify-write. According to
> http://www.cs.umd.edu/users/pugh/java/memoryModel/jsr-133-faq.html#volatile,
> "Under the new memory model, writing to a volatile field has the same
> memory effect as a monitor release, and reading from a volatile field
> has the same memory effect as a monitor acquire."
>
> What do you mean by read-modify-write is not atomic? So, you are
> suggesting that *DO NOT USE VOLATILE*? In what circumstances, does the
> volatile is useful?
>

It means that if "a" is volatile then

a++

will be roughly equivalent to

synchronized (a) {
temp = a; // read
}

temp++; // modify

synchronized (a) {
a = temp; // write
}

where temp is a thread local variable.

Joe Seigh

Alexander Terekhov

unread,

Jun 14, 2004, 3:43:05 AM6/14/04

to

Min-Koo Seo wrote:
[...]

> http://www.cs.umd.edu/users/pugh/java/memoryModel/jsr-133-faq.html#volatile,
> "Under the new memory model, writing to a volatile field has the same
> memory effect as a monitor release, and reading from a volatile field
> has the same memory effect as a monitor acquire."

Not quite. The exchange below took place in April this year.

---------------------- Forwarded by Alexander Terekhov/Germany/IBM on 06/14/2004 09:42 AM ---------------------------

To: Jeremy Manson <snip>
cc: dl@<snip>
From: Alexander Terekhov/Germany/IBM@IBMDE
Subject: Re: JavaMemoryModel: JavaOne 2004 Presentation

Slides say:

<quote>

A volatile write is a release

- that is acquired by a later read of the same variable

All accesses before the volatile write

- are ordered before and visible to all accesses after
the volatile read

</quote>

Now,

http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/main/java/util/concurrent/locks/ReentrantLock.java?rev=1.43&content-type=text/vnd.viewcvs-markup
(note: old version)

<quote>

* Unlock:
* owner = null; // volatile assignment
* h = first node on queue;
* if (h != null) unpark(h's successor's thread);
*
* * The fast path uses one atomic CAS operation, plus one
* StoreLoad barrier (i.e., volatile-write-barrier) per
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
</quote>

AFAIK,

http://www.ai.mit.edu/projects/aries/papers/consistency/ISCA90_ps.ps

"classic release" doesn't incur the overhead of "StoreLoad"
constraint/barrier.

while the release is being delayed for writes [prior accesses]
to complete, the processor is free to move on to the next
record to acquire the lock and start the reads [subsequent
accesses]. Thus, there is overlap between the writes [accesses]
of one critical section and reads [accesses] of the next
section.

</quote>

If volatile write is a "release", then I guess the following
reodering must be allowed.

Original:

* owner = null; // volatile assignment
* h = first node on queue;

Transformation/reordering:

* h = first node on queue;
* owner = null; // volatile assignment

But it isn't OK here, AFAICS. So, I'm a bit confused.

regards,
alexander.
Sent by: owner-java...@cs.umd.edu
To: javamem...@cs.umd.edu
cc:
Subject: JavaMemoryModel: JavaOne 2004 Presentation

Hi folk,

Bill and I are making a presentation on the JMM at this year's JavaOne,
and we would like to get feedback on the slides and presentation. We
want to know if there is anything that seems glaringly wrong, or
anything that we absolutely can't forget to say to the developers, and
so on.

The slides are available at:

http://www.cs.umd.edu/users/pugh/java/memoryModel/TS-2331.pdf

Thanks!

Jeremy
-------------------------------
JavaMemoryModel mailing list - http://www.cs.umd.edu/~pugh/java/memoryModel

---------------------- Forwarded by Alexander Terekhov/Germany/IBM on 06/14/2004 09:42 AM ---------------------------

To: Alexander Terekhov/Germany/IBM@IBMDE
cc: Jeremy Manson <snip>, dl@<snip>
Subject: Re: JavaMemoryModel: JavaOne 2004 Presentation

> If volatile write is a "release", then I guess the following
> reodering must be allowed.
>
> Original:
>
> * owner = null; // volatile assignment
> * h = first node on queue;
>
> Transformation/reordering:
>
> * h = first node on queue;
> * owner = null; // volatile assignment
>
> But it isn't OK here, AFAICS. So, I'm a bit confused.

I believe that the answer to this is that the assignment to h is a read of a
volatile (the head node). All volatile accesses have to occur in a total
order that reflects program order, so these won't be reordered.

I didn't write the class, of course, so I could be wrong about the way it is
implemented.

Might be worth mentioning that total order in the talk, though.

Jeremy

---------------------- Forwarded by Alexander Terekhov/Germany/IBM on 06/14/2004 09:42 AM ---------------------------

To: Jeremy Manson <snip>
cc: Alexander Terekhov/Germany/IBM@IBMDE, dl@<snip>
Subject: Re: JavaMemoryModel: JavaOne 2004 Presentation

> I believe that the answer to this is that the assignment to h is a read of a
> volatile (the head node).

Exactly so.

A fun exercise is to try to make a lock where you let this read float,
which might mean you miss doing a wakeup, which means that you need to
compensate elsewhere. People have tried to make this work, but it
never works well enough to actually use in any real lock implementation.

-Doug

---------------------- Forwarded by Alexander Terekhov/Germany/IBM on 06/14/2004 09:42 AM ---------------------------

To: Doug Lea <snip>
cc: Jeremy Manson <snip>, dl@<snip>
From: Alexander Terekhov/Germany/IBM@IBMDE
Subject: Re: JavaMemoryModel: JavaOne 2004 Presentation

Doug Lee wrote:
[...]
> > I believe that the answer to this is that the assignment to h is a
> > read of a volatile (the head node).
>
> Exactly so.

Ok. But on Itanic, for example, "store.rel load.acq" need
not be done in the program order, right? IIUC, "classic"
definitions of acquire and release do allow "processor
consistent" reordering with respect to acquire and release
accesses.

> A fun exercise is to try to make a lock where you let this read
> float, which might mean you miss doing a wakeup, which means that
> you need to compensate elsewhere. People have tried to make this
> work, but it never works well enough to actually use in any real
> lock implementation.

Take a look at:

http://groups.google.com/groups?threadm=3FD4687C.CD66C516%40web.de
http://groups.google.com/groups?selm=99d37172.0312091705.35ef738f%40posting.google.com
http://groups.google.com/groups?selm=99d37172.0312111525.722dcc2c%40posting.google.com

regards,
alexander.

---------------------- Forwarded by Alexander Terekhov/Germany/IBM on 06/14/2004 09:42 AM ---------------------------

To: Alexander Terekhov/Germany/IBM@IBMDE
cc: Doug Lea <snip>, Jeremy Manson <snip>
Subject: Re: JavaMemoryModel: JavaOne 2004 Presentation

> Ok. But on Itanic, for example, "store.rel load.acq" need
> not be done in the program order, right? IIUC, "classic"
> definitions of acquire and release do allow "processor
> consistent" reordering with respect to acquire and release
> accesses.

Right. On Itanic, you'd generally need a full "mf" here,
although there are some special cases where you don't.

>
> > A fun exercise is to try to make a lock where you let this read
> > float, which might mean you miss doing a wakeup, which means that
> > you need to compensate elsewhere. People have tried to make this
> > work, but it never works well enough to actually use in any real
> > lock implementation.
>
> Take a look at:
> ...

Using a waiter-bit is a little different mostly in that you need to
CAS rather than write to unlock. On Itanic this probably wins, since,
like on some other intel chips, barriers can take longer than
CAS. Chips aren't too rational or predictable these days...

What I meant was to consider instead not using any kind of barrier or
atomic. Somehow a waiting thread must know that wakeup could be
missed. One possibility is just for it to spin. Doing better than this
is hard.

-Doug

---------------------- Forwarded by Alexander Terekhov/Germany/IBM on 06/14/2004 09:42 AM ---------------------------

To: Doug Lea <snip>
cc: Doug Lea <snip>, Jeremy Manson <snip>
From: Alexander Terekhov/Germany/IBM@IBMDE
Subject: Re: JavaMemoryModel: JavaOne 2004 Presentation

Doug Lea wrote:
[...]
> Using a waiter-bit is a little different mostly in that you need to
> CAS rather than write to unlock. On Itanic this probably wins, since,
> like on some other intel chips, barriers can take longer than
> CAS.

Or unconditional exchange/swap (works quite well on i386 ;-) ).

class swap_based_mutex_for_windows { // noncopyable

atomic<int> m_lock_status; // 0: free, 1/-1: locked/contention
auto_reset_event m_retry_event;

public:

// ctor/dtor [w/o lazy event init]

void lock() throw() {
if (m_lock_status.swap(1, msync::acq))
while (m_lock_status.swap(-1, msync::acq))
m_retry_event.wait();
}

bool trylock() throw() {
return !m_lock_status.swap(1, msync::acq) ?
true : !m_lock_status.swap(-1, msync::acq);
}

bool timedlock(absolute_timeout const & timeout) throw() {
if (m_lock_status.swap(1, msync::acq)) {
while (m_lock_status.swap(-1, msync::acq))
if (!m_retry_event.timedwait(timeout))
return false;
}
return true;
}

void unlock() throw() {
if (m_lock_status.swap(0, msync::rel) < 0)
m_retry_event.set();
}

};

> Chips aren't too rational or predictable these days...

Yeah. Itanic has xchg.acq but no xchg.rel. How rational. ;-)

Well, but I like the idea to allow overlapping of critical
regions by processor. Sounds reasonable to me.

> What I meant was to consider instead not using any kind of
> barrier or atomic. Somehow a waiting thread must know that
> wakeup could be missed. One possibility is just for it to
> spin.

That's what they (SUN) do. Adaptive spinning. They call them
"adaptive mutexes" but I think that "adaptive spinlocks" is
more proper. They are highly optimized for uncontened case
and little contention between threads running on different
processors. No store-load barrier and no atomic/interlocked
read-modify-write in unlock(). They have a sort of "criticial
part" in unlock() that they re-run if it was preempted.

> Doing better than this is hard.

Agreed.

regards,
alexander.

Joe Seigh

unread,

Jun 14, 2004, 7:10:24 AM6/14/04

to

Alexander Terekhov wrote:
>
> Min-Koo Seo wrote:
> [...]
> > http://www.cs.umd.edu/users/pugh/java/memoryModel/jsr-133-faq.html#volatile,
> > "Under the new memory model, writing to a volatile field has the same
> > memory effect as a monitor release, and reading from a volatile field
> > has the same memory effect as a monitor acquire."
>
> Not quite. The exchange below took place in April this year.

I don't know. You could get stack overflow trying to recurse all the back
links in your posts. Especially if you try to use conclusion as the exit
criteria. It's kind of hard to follow up on these kind of posts since it's
hard to determine what the point was or who made it even.

Joe Seigh

unread,

Jun 14, 2004, 8:34:00 AM6/14/04

to

Alexander Terekhov recalls that he wrote somewhere else once:

>
> Take a look at:
>
> http://groups.google.com/groups?threadm=3FD4687C.CD66C516%40web.de
> http://groups.google.com/groups?selm=99d37172.0312091705.35ef738f%40posting.google.com
> http://groups.google.com/groups?selm=99d37172.0312111525.722dcc2c%40posting.google.com
>

We never did get a complete and coherent explanation of what Sun was up to there.
There was nothing in release semantics that would imply that a store/load like
membar was required anyway. The Sun code could be construed as an example of that
except we don't really know what it does. I have "wake up" logic that doesn't require
store/load either. I've never posted it but it doesn't matter. You don't have to
prove it's not needed because "release" doesn't imply "wake up" even if you somehow
thought that "wake up" required store/load. Locks do "wake up" but it's in addition
to "release" because locks can block. atomic<T> and Java volatile do not block.

Joe Seigh

Alexander Terekhov

unread,

Jun 14, 2004, 9:34:32 AM6/14/04

to

Joe Seigh wrote:
[...]

> There was nothing in release semantics that would imply that a store/load like
> membar was required anyway.

Store/load is not required as part of release semantics.

> The Sun code could be construed as an example of that
> except we don't really know what it does. I have "wake up" logic that doesn't require
> store/load either. I've never posted it but it doesn't matter. You don't have to
> prove it's not needed because "release" doesn't imply "wake up" even if you somehow
> thought that "wake up" required store/load. Locks do "wake up" but it's in addition
> to "release" because locks can block. atomic<T> and Java volatile do not block.

JSR-166 reference implementation uses JSR-133 revised volatiles to
build all sorts of blocking sync objects. It needs store/load in
order to not miss waiter(s) to wake. Store/load is not needed with
"waiters bit" CAS/SWAP with release (or something ala Sun's hack
with just atomic STORE.release... your unpublished work aside for
a moment ;-) ).

regards,
alexander.

Min-Koo Seo

unread,

Jun 14, 2004, 11:09:51 AM6/14/04

to

> No. In {revised} Java, volatile reads and writes are atomic (and
> they also seem to prevent reordering in a somewhat "stronger" way
> than locks -- StoreLoad barrier), but there's no guarantee that
> volatile read-modify-write is atomic. Volatiles are braindead.

I think I have posted the below before. But it does not appear on this
newsgroup. So I am posting the same question again:

What is the difference between read/write and read-modify-write?

I just guess that read/write is:

volatile a;
a=3;
System.out.println(a);

And

read-modify-wirte is:

volatile a=3;
a+=2;

And

read-modify-write is not thread-safe by definition of volatile of
(revised) JAVA.

Am I right?

TIA.
Minkooo Seo

Joe Seigh

unread,

Jun 14, 2004, 11:38:01 AM6/14/04

to

Note that spin waiting (polling) works with ordinary acquire and release. Not just
for spin locks but for ordinary lock-free collections as well. E.g.

while ((item = lockFreeQueue.pop()) != NULL) {}

works fine. The store/release is a consequence of a particular wait/notify implementation.

Joe Seigh

Alexander Terekhov

unread,

Jun 14, 2004, 11:50:20 AM6/14/04

to

Min-Koo Seo wrote:
[...]
> volatile a=3;
> a+=2;

is equivalent to

// non-volatile temp

temp = a; // read

temp += 2; // modify

a = temp; // write

It isn't atomic read-modify-write.

regards,
alexander.

Jonathan Adams

unread,

Jun 14, 2004, 4:18:45 PM6/14/04

to

Joe Seigh <jsei...@xemaps.com> wrote in message news:<40CD9CDD...@xemaps.com>...

Well, I'm not sure I saw requests for clarifications at the time. The
standard mutex memory barriers we use are:

mutex_enter:
grab lock
membar #StoreLoad|#StoreStore

(That is, we make sure that the store which grabbed the lock is
visible before any following loads or stores happen)

mutex_exit:
load lock value
membar #LoadStore|#StoreStore
drop lock

(that is, we make sure all previous loads and stores have completed
before the store that drops the lock is visible)

These are simply there to make sure the locks protect the data they're
supposed to be protecting.

- jonathan

(aside: the actual code for mutex_enter() has a much weaker membar,
since we take advantage of some properties of branches and CAS
instructions on Sparc processors)

Alexander Terekhov

unread,

Jun 14, 2004, 4:48:04 PM6/14/04

to

Jonathan Adams wrote:
[...]

> (aside: the actual code for mutex_enter() has a much weaker membar,
> since we take advantage of some properties of branches and CAS
> instructions on Sparc processors)

membar #LoadLoad|#LoadStore?

Because

<quote>

Atomic operations (LDSTUB(A), SWAP(A), CASA, and CASXA) are ordered
by MEMBAR as if they were both a load and a store, since they share
the semantics of both.

[...]

In specifying the effect of MEMBAR, instructions are considered to
be executed as if they were processed in a strictly sequential
fashion, with each instruction completed before the next has begun.

</quote>

Right?

regards,
alexander.

Joe Seigh

unread,

Jun 14, 2004, 4:51:09 PM6/14/04

to

Jonathan Adams wrote:
>
> Joe Seigh <jsei...@xemaps.com> wrote in message news:<40CD9CDD...@xemaps.com>...
> > Alexander Terekhov recalls that he wrote somewhere else once:
> > >
> > > Take a look at:
> > >
> > > http://groups.google.com/groups?threadm=3FD4687C.CD66C516%40web.de
> > > http://groups.google.com/groups?selm=99d37172.0312091705.35ef738f%40posting.google.com
> > > http://groups.google.com/groups?selm=99d37172.0312111525.722dcc2c%40posting.google.com
> > >
> >
> > We never did get a complete and coherent explanation of what Sun was up to there.
> > There was nothing in release semantics that would imply that a store/load like
> > membar was required anyway. The Sun code could be construed as an example of that
> > except we don't really know what it does. I have "wake up" logic that doesn't require
> > store/load either. I've never posted it but it doesn't matter. You don't have to
> > prove it's not needed because "release" doesn't imply "wake up" even if you somehow
> > thought that "wake up" required store/load. Locks do "wake up" but it's in addition
> > to "release" because locks can block. atomic<T> and Java volatile do not block.
>
> Well, I'm not sure I saw requests for clarifications at the time. The
> standard mutex memory barriers we use are:

(snip)

It was more like "20 questions". I don't mind assuming that Solaris locks
do the right thing. I do mind playing games where the explanation is only
revealed a bit at a time with no guarantee that we would get the whole story
even if we had the patience to play the game.

Joe Seigh

Jonathan Adams

unread,

Jun 14, 2004, 10:20:10 PM6/14/04

to

Alexander Terekhov <tere...@web.de> wrote in message news:<40CE0F04...@web.de>...

> Jonathan Adams wrote:
> [...]
> > (aside: the actual code for mutex_enter() has a much weaker membar,
> > since we take advantage of some properties of branches and CAS
> > instructions on Sparc processors)
>
> membar #LoadLoad|#LoadStore?
>

> Right?

Exactly -- and the #LoadStore can be dropped, since the branch (to
test if you got the lock) makes any later stores dependent on the
loaded value from the CAS. (SparcV9 D.3.3 (1) establishes dependant
order, D.4.4 (1) guarantees memory order).

(the change in membars gave about a 7% performance improvement on
UltraSparc IIs...)

- jonathan

Joe Seigh

unread,

Jun 15, 2004, 6:18:43 AM6/15/04

to

Alexander Terekhov wrote:
> JSR-166 reference implementation uses JSR-133 revised volatiles to
> build all sorts of blocking sync objects. It needs store/load in
> order to not miss waiter(s) to wake. Store/load is not needed with
> "waiters bit" CAS/SWAP with release (or something ala Sun's hack
> with just atomic STORE.release... your unpublished work aside for
> a moment ;-) ).
>

Speaking of JSR-166, why did they do that lock object with condvars? You
always could do multiple condition variable implementations that worked
with regular Java objects from the very beginning.

Joe Seigh

Alexander Terekhov

unread,

Jun 15, 2004, 6:50:08 AM6/15/04

to

Jonathan Adams wrote:

[... Sparc's CAS.ACQ ...]

> > membar #LoadLoad|#LoadStore?
> >
> > Right?
>
> Exactly -- and the #LoadStore can be dropped, since the branch (to
> test if you got the lock) makes any later stores dependent on the
> loaded value from the CAS. (SparcV9 D.3.3 (1) establishes dependant
> order, D.4.4 (1) guarantees memory order).

Interesting.

class swap_based_mutex { // noncopyable

atomic<int> m_lock_status; // 0: free, 1/-1: locked/contention

auto_reset_event m_retry_event; // bin.sema/gate

public:

// ctor/dtor [w/o lazy event init]

void lock() throw() {
if (m_lock_status.swap(1, msync::ddacq))
while (m_lock_status.swap(-1, msync::ddacq))
m_retry_event.wait();
}

bool trylock() throw() {
return !m_lock_status.swap(1, msync::ddacq) ?
true : !m_lock_status.swap(-1, msync::ddacq);
}

bool timedlock(absolute_timeout const & timeout) throw() {

if (m_lock_status.swap(1, msync::ddacq)) {
while (m_lock_status.swap(-1, msync::ddacq))

if (!m_retry_event.timedwait(timeout))
return false;
}
return true;
}

void unlock() throw() {
if (m_lock_status.swap(0, msync::rel) < 0)
m_retry_event.set();
}

};

"msync::ddacq" (acquire with data dependency) would inject only
subsequent #LoadLoad (after swap) on Sparc. "msync::rel" would
inject preceding #LoadStore|#StoreStore or #LoadLoad|#StoreLoad
but, unfortunately, would also imply subsequent #LoadStore due
to the branch (if).

All korrect?

regards,
alexander.

Joe Seigh

unread,

Jun 15, 2004, 7:35:24 AM6/15/04

to

Alexander Terekhov wrote:
>
> Jonathan Adams wrote:
>
> [... Sparc's CAS.ACQ ...]
>
> > > membar #LoadLoad|#LoadStore?
> > >
> > > Right?
> >
> > Exactly -- and the #LoadStore can be dropped, since the branch (to
> > test if you got the lock) makes any later stores dependent on the
> > loaded value from the CAS. (SparcV9 D.3.3 (1) establishes dependant
> > order, D.4.4 (1) guarantees memory order).
>
> Interesting.

...

>
> "msync::ddacq" (acquire with data dependency) would inject only
> subsequent #LoadLoad (after swap) on Sparc. "msync::rel" would
> inject preceding #LoadStore|#StoreStore or #LoadLoad|#StoreLoad
> but, unfortunately, would also imply subsequent #LoadStore due
> to the branch (if).
>
> All korrect?
>

You might want to be carful about abstracting that particular form
of membar. There's no guarantee that other architectures or implementations
with speculative execution will handle branch dependencies the same way.

Dependent memory barriers are defined at the ISA level. You'd have to
verify every single processor model before running on it.

Joe Seigh

Alexander Terekhov

unread,

Jun 15, 2004, 7:55:42 AM6/15/04

to

Joe Seigh wrote:
[...]

> Speaking of JSR-166, why did they do that lock object

<quote>

A Lock class can also provide behavior and semantics that is quite
different from that of the implicit monitor lock, such as guaranteed
ordering, non-reentrant usage, or deadlock detection. If an
implementation provides such specialized semantics then the
implementation must document those semantics.

Note that Lock instances are just normal objects and can themselves
be used as the target in a synchronized statement. Acquiring the
monitor lock of a Lock instance has no specified relationship with
invoking any of the lock() methods of that instance. It is
recommended that to avoid confusion you never use Lock instances in
this way, except within their own implementation.

> with condvars?

<quote>

newCondition

Condition newCondition() Returns a new Condition instance that is
bound to this Lock instance.

Before waiting on the condition the lock must be held by the current
thread. A call to Condition.await() will atomically release the lock
before waiting and re-acquire the lock before the wait returns.

Implementation Considerations

The exact operation of the Condition instance depends on the Lock
implementation and must be documented by that implementation.

</quote>

In JSR-166 you can have condvars bound to ReentrantReadWriteLock
(to its WriteLock, but not to its ReadLock).

regards,
alexander.

Joe Seigh

unread,

Jun 15, 2004, 8:06:40 AM6/15/04

to

Yes, but they didn't need to do a JSR to do all that.

Joe Seigh

David Holmes

unread,

Jul 13, 2004, 1:19:49 AM7/13/04

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message

news:40CECEAC...@xemaps.com...

> Speaking of JSR-166, why did they do that lock object with condvars? You
> always could do multiple condition variable implementations that worked
> with regular Java objects from the very beginning.

What are you referring to Joe? Each object you use in Java as a "condition"
is associated with its own monitor-lock, not the common-lock you want when
having multiple conditions per-lock. This makes designs involving multiple
condition objects (multiple wait-sets in our normal usage) very awkward to
code. I know there was some attempted bytecode hackery to define a special
"wait" method that did a monitorexit on both its own monitor and the common
"lock" object, but those hacks only worked on a couple of early 1.0,
perhaps 1.1, VMs and were outlawed by the second edition of the Java Virtual
Machine Specification (for JDK 1.2) which required that monitorenter and
monitorexit bytecodes must be properly paired within a Java stack frame.

And if you have queries just ask us, or read the archives:

http://gee.cs.oswego.edu/dl/concurrency-interest/index.html (see mailing
list)

I only hang out on c.p.t occasionally and I don't think any of the other EG
members venture here at all.

Cheers,
David Holmes

Joe Seigh

unread,

Jul 13, 2004, 7:10:00 AM7/13/04

to

David Holmes wrote:
>
> "Joe Seigh" <jsei...@xemaps.com> wrote in message
> news:40CECEAC...@xemaps.com...
> > Speaking of JSR-166, why did they do that lock object with condvars? You
> > always could do multiple condition variable implementations that worked
> > with regular Java objects from the very beginning.
>
> What are you referring to Joe? Each object you use in Java as a "condition"
> is associated with its own monitor-lock, not the common-lock you want when
> having multiple conditions per-lock. This makes designs involving multiple
> condition objects (multiple wait-sets in our normal usage) very awkward to
> code. I know there was some attempted bytecode hackery to define a special
> "wait" method that did a monitorexit on both its own monitor and the common
> "lock" object, but those hacks only worked on a couple of early 1.0,
> perhaps 1.1, VMs and were outlawed by the second edition of the Java Virtual
> Machine Specification (for JDK 1.2) which required that monitorenter and
> monitorexit bytecodes must be properly paired within a Java stack frame.

Interesting. What was the problem that was fixed by that change? Wait, I
see it. "properly".

>
> And if you have queries just ask us, or read the archives:
>
> http://gee.cs.oswego.edu/dl/concurrency-interest/index.html (see mailing
> list)
>
> I only hang out on c.p.t occasionally and I don't think any of the other EG
> members venture here at all.

It was more idle curiousity. I don't mess with Java too much. Kind of hard
to experiment in a language expressly designed to prevent experimentation.
I'd have to write my own jvm and/or compiler or start a JSR just to try something
out to see if it's worth doing. I find it highly ironic that an "advanced"
language such as Java has to get it's enhancements from outside Java.

Joe Seigh

Alexander Terekhov

unread,

Oct 21, 2004, 7:44:53 AM10/21/04

to

Alexander Terekhov wrote:

[... forwarded stuff ...]

> > Using a waiter-bit is a little different mostly in that you need to
> > CAS rather than write to unlock. On Itanic this probably wins, since,
> > like on some other intel chips, barriers can take longer than
> > CAS.
>
> Or unconditional exchange/swap (works quite well on i386 ;-) ).
>
> class swap_based_mutex_for_windows { // noncopyable
>
> atomic<int> m_lock_status; // 0: free, 1/-1: locked/contention
> auto_reset_event m_retry_event;

Here's LR/SC emPOWERed thing. (For the upcoming XBOX ;-) )

template<typename T>
int attempt_update(int old, int new, T msync) {
while (!m_lock_status.store_conditional(new, msync)) {
T fresh = m_lock_status.load_reserved(msync::none);
if (fresh != old)
return fresh;
}
return old;
}

>
> public:
>
> // ctor/dtor [w/o lazy event init]

(try/timed operations ommited for brevity)

void lock() throw() {
int old = m_lock_status.load_reserved(msync::none);
if (old || old = attempt_update(0, 1, msync::acq)) {
do {
while (old < 0 || old = attempt_update(1, -1, msync::none)) {
m_retry_event.wait();
old = m_lock_status.load_reserved(msync::none);
if (!old) break;
}
} while (old = attempt_update(0, -1, msync::acq));
}
}

void unlock() throw() {
int old = m_lock_status.load_reserved(msync::none);
if (old < 0 || !m_lock_status.store_conditional(0, msync::rel)) {
m_lock_status.store(0, msync::rel);
m_retry_event.set();
}
}
}

Oder (POSIX-safety with respect to destruction aside for a moment)?

regards,
alexander.

Alexander Terekhov

unread,

Oct 21, 2004, 7:50:59 AM10/21/04

to

Alexander Terekhov wrote:
[...]

> template<typename T>
> int attempt_update(int old, int new, T msync) {
> while (!m_lock_status.store_conditional(new, msync)) {
> T fresh = m_lock_status.load_reserved(msync::none);

Woops. Compile error. ;-)

int fresh = m_lock_status.load_reserved(msync::none);

regards,
alexander.

Alexander Terekhov

unread,

Oct 26, 2004, 8:33:22 AM10/26/04

to

< 2 x Forward Inline >

Sorta rebuttal invitation for DRB. ;-)

-------- Original Message --------
Message-ID: <417A7A11...@web.de>
Date: Sat, 23 Oct 2004 17:34:41 +0200
Newsgroups: comp.lang.c++.moderated
Subject: Re: Information source for acquire/release
References: ... <opsf9kv0...@localhost.localdomain>

Maxim Yegorushkin wrote:
[...]
> >> fair enough guaranties for POSIX threads.
> >
> > DRB doesn't quite grok release consistency. He, unfortunately,
> > is stuck in the Alpha world with its bidirectional fences.
> >
> > http://groups.google.com/groups?selm=3f7ac4fe%40usenet01.boi.hp.com
>
> []
>
> Sorry, Alexander, I do not follow you.

Read the entire referenced thread. Read also

http://groups.google.com/groups?threadm=40ae0044%40usenet01.boi.hp.com

I mean

----
POSIX doesn't define, nor depend on, anything like "acquire" or
"release" memory semantics: which happens to be a good thing because
they are not supported by most hardware. Each POSIX synchronization
operation is defined to synchronize memory. Period. POSIX
synchronization is defined as a full barrier, for each operation.
While implementation of pthread_mutex_lock on Itanium using "acq"
and pthread_mutex_unlock using "rel" makes logical sense, it does
not strictly conform to the memory synchronization requirements of
POSIX.
----

>
> I was reffering to POSIX memory visibilty guaranties. Here is you quoting
> David Butenfoh at http://groups.yahoo.com/group/boost/message/15526
[...]
> 4. Whatever memory values a thread can see when it signals or broadcasts a
> condition variable (calling
> notify in Java) can also be seen by any thread that is awakened by that
> signal or broadcast. And, one
> more time, data written after the signal or broadcast may not necessarily
> be seen by the thread that
> wakes up, even if it occurs before it awakens.
> </qiote>
>
> What's wrong with it?

Ah, that. Well, number 4. He doesn't quite grok it either. BTW,
I've filed a DR seeking removal of cond_signal and cond_broadcast
from the XBD 4.10. It was rejected thanks DRB.

Here's an illustration just to make it clear: <pseudo code>

cond_wait: mutex::release_guard guard(mutex); sleep(random());

cond_signal: nop

cond_broadcast: nop

is conforming, I say (apart from realtime scheduling, of course).

regards,
alexander.

-------- Original Message --------
Message-ID: <417E3263...@web.de>
Date: Tue, 26 Oct 2004 13:17:55 +0200
Newsgroups: comp.std.c
Subject: Re: volatile problem
References: ... <L9CdnW6w9Yd...@comcast.com>

"Douglas A. Gwyn" wrote:
>
> Read the actual spec yourself. It starts off by saying
> that applications need to do something to synchronize
> access to memory among the threads, and gives a list of
> functions that help do just that, ...

The actual spec doesn't say that functions "help". It says that listed
functions "synchronize thread execution and also synchronize memory
with respect to other threads". And just to make it clear, the spec
also says that "The following functions synchronize memory with
respect to other threads: < functions >" followed by "Unless
explicitly stated otherwise, if one of the above functions returns an
error, it is unspecified whether the invocation causes memory to be
synchronized." The last statement means that, for example,

// doesn't provide "POSIX-safety" with respect to destruction
class mutex_for_XBOX_NEXT { // noncopyable

atomic<int> m_lock_status; // 0: free, 1/-1: locked/contention

auto_reset_event m_retry_event; // bin.sema/gate

template<typename T>
int attempt_update(int old, int new, T msync) {
while (!m_lock_status.store_conditional(new, msync)) {

int fresh = m_lock_status.load_reserved(msync::none);

if (fresh != old)
return fresh;
}
return old;
}

public:

// ctor/dtor [w/o lazy event init]

bool trylock() throw() {
return !(m_lock_status.load_reserved(msync::none) ||
attempt_update(0, 1, msync::acq));
}

memory syncronization (acquire semantics in this case) is caused
by conditional invocation of attempt_update() inside trylock(),
to begin with.

Here's the rest (apart from timedlock).

void lock() throw() {
int old = m_lock_status.load_reserved(msync::none);
if (old || old = attempt_update(0, 1, msync::acq)) {
do {
while (old < 0 ||
old = attempt_update(1, -1, msync::none)) {
m_retry_event.wait();
old = m_lock_status.load_reserved(msync::none);
if (!old) break;
}
} while (old = attempt_update(0, -1, msync::acq));
}
}

void unlock() throw() {
if (m_lock_status.load_reserved(msync::none) < 0 ||
attempt_update(1, 0, msync::rel) < 0) { // or just !SC
m_lock_status.store(0, msync::rel);
m_retry_event.set();
}
}

};

Now, contrary to what I said earlier, the release semantics of
"m_retry_event.set()" are important and must be respected by the
compiler to prevent

m_retry_event.set();
m_lock_status.store(0, msync::rel);

reordering.

http://groups.google.com/groups?selm=3EE0CA46.593F938B%40web.de

-----
The informal semantics of acquire/release memory synchronization
operations can be defined as: (1) before an acquire access or an
ordinary access is allowed to perform with respect to any other
processor, all preceding acquire accesses must be performed (in the
program order), and (2) before a release access is allowed to perform
with respect to any other processor, all preceding acquire accesses,
release accesses, and all ordinary accesses must be performed. An act
of acquiring mutex ownership can be viewed as performing an acquire
operation. An act of releasing mutex ownership can be viewed as
performing a release operation. An acquire operation prevents all
subsequent load and store accesses from being performed before the
acquire operation -- it can be viewed as store or load operation that
is performed in conjunction with the "hoist-load" and the "hoist-store"
reordering constraints (barriers) applied with respect to the
subsequent load and store accesses. A release operation prevents all
preceding load and store accesses from being performed after the
release operation -- it can be viewed as store or load operation that
is performed in conjunction with the "sink-load" and the "sink-store"
reordering constraints/barriers applied with respect to the preceding
load and store accesses. An act of acquiring a read lock on a read-
write lock can be viewed as an operation that is performed in
conjunction with the "hoist-load" barrier only; without "hoist-store"
barrier that is needed for an acquire operation on a mutex. An act of
releasing a read lock on a read-write lock can be viewed as an
operation that is performed in conjunction with the "sink-load"
barrier only; without "sink-store" barrier that is needed for a
release operation on a mutex.
-----

regards,
alexander.

David Butenhof

unread,

Oct 28, 2004, 10:58:39 AM10/28/04

to

Alexander Terekhov wrote:
> < 2 x Forward Inline >
>
> Sorta rebuttal invitation for DRB. ;-)

> [...] (Lots clipped.)

>>4. Whatever memory values a thread can see when it signals or broadcasts a
>>condition variable (calling
>>notify in Java) can also be seen by any thread that is awakened by that
>>signal or broadcast. And, one
>>more time, data written after the signal or broadcast may not necessarily
>>be seen by the thread that
>>wakes up, even if it occurs before it awakens.
>></qiote>
>>
>>What's wrong with it?
>
> Ah, that. Well, number 4. He doesn't quite grok it either.

Sometimes, Alexander, your attitude seems to border on deliberate malice.

You're actually arguing for something completely different from POSIX
under the guise of discussing changes within POSIX. You continue to
profess to see no difference, but I have to wonder whether anyone
(perhaps even including you) actually believes that?

You don't need to depend on POSIX. While I admire your wish to build
what you see as high-efficiency applications using strictly portable
interfaces, you're saying the wrong things in the wrong place. My
experience with current real-world threaded applications suggests
strongly that they simply will not benefit enough to justify the
upheaval in the interfaces and models that would be required to support
your changes.

Ten years ago when the POSIX threads model was still in flux, we weren't
able to get most of the POSIX people in the working group (much less the
balloting group) to understand what all this even meant. We were forced
to either keep it simple or allow the whole standardization effort to
get blown away because people were terrified of the complexity. (There
was enough of that anyway.) If we were at that same point in the process
of standardization, but with the general level of multiprocessing and
threading sophistication available now, there'd be room to consider at
least some of the model refinements you'd like to see. However, with
many widely used implementations of the POSIX thread model "in the
wild", we now need to be far more concerned about stability and
compatibility. Change the low-level memory model, especially at the
level you desire, would be an extremely risky project.

That doesn't mean it can't be done. It does mean, however, that to even
consider doing it by "adding a little here" and "removing a little
there" as if it were a straightforward editorial change would be
tremendously irresponsible. More pragmatically, it simply ain't gonna
happen.

If you'd content yourself with arguing theory OR current practice,
perhaps we could find some common ground. We certainly have in the past.
If you'd like to propose and guide some forum of "futures" discussion,
be my guest. The C++ group might be a good place, as much of this would
really require language support to work well anyway. But as long as you
continue to act as if there's no difference between theory and practice,
I choose to stick with the pragmatic and within the bounds of the
current POSIX standard -- just because it's a lot easier to keep track
of. If you want to delude yourself that this is because I don't
understand the issues, you're welcome to your sheltered little refuge
from the real world; but I won't be joining you there.

> BTW,
> I've filed a DR seeking removal of cond_signal and cond_broadcast
> from the XBD 4.10. It was rejected thanks DRB.

Your "suggestion" was rejected because the working group was concerned
that the risk of breaking existing code depending on the synchronization
requirement outweighed any possible benefit to new code that might
actually be able to exploit a changed requirement. I did not initiate
that objection, though I also could not see any basis to disagree.

However, if you insist on thanking me, who am I to refuse? So, "you're
welcome".

--
Dave Butenhof, David.B...@hp.com
iCOD/PPU, thread consultant
Manageability Solutions Lab (MSL), Hewlett-Packard Company
110 Spit Brook Road, ZK2/3-Q18, Nashua, NH 03062

Alexander Terekhov

unread,

Oct 28, 2004, 11:21:43 AM10/28/04

to

David Butenhof wrote:
[...]

> > I've filed a DR seeking removal of cond_signal and cond_broadcast
> > from the XBD 4.10. It was rejected thanks DRB.
>
> Your "suggestion" was rejected because the working group was concerned
> that the risk of breaking existing code depending on the synchronization
> requirement

Such code (I doubt that it exists but feel free to come up with an
example) is totally broken because it would violate "synchronization
requirement" imposed on applications with respect to restricting
access to thread-shared memory.

cond_wait: mutex::release_guard guard(mutex); sleep(random());

cond_signal: nop

cond_broadcast: nop

Is clearly conforming (and nops don't synchronize anything).

> outweighed any possible benefit to new code that might
> actually be able to exploit a changed requirement.

http://tinyurl.com/68jav

The idea is to allow compilers to take advantage of "#pragma
isolated_call"-like semantics for pthread_cond_signal() and
pthread_cond_broadcast().

regards,
alexander.

Alexander Terekhov

unread,

Oct 28, 2004, 2:51:03 PM10/28/04

to

I got a bunch of replies in private (@web.de and "at work" -- in
reply to the renewed DR posting). They are all barking up the
wrong tree, so to say. This one is typical:

: ???? What you wrote seems circular. A no-op implementation
: can't be coforming, precisely because it does not "synchronize memory".
: Instead of a no-op you should have an instruction that synchronizes
: per-CPU memory with CPU-shared memory. The application requirement
: here is that if thread A updates a variable and does pthread_cond_signal
: then if thread B is waiting for a change in that variable using
: pthread_cond_wait thread B will see the new value when it wakes up

Change in a variable (predicate) is synchronized via operations
on mutex (including cv_wait() of course), not condvar signaling.

lock(mutex)
<update>
unlock(mutex)
signal(cv)

can be reordered to

lock(mutex)
signal(cv)
<update>
unlock(mutex)

and everything will be OK (ignoring number of trips to scheduler
on an implementation without "wait-morphing"). Hoisting signal(cv)
above lock(mutex) is not allowed due to "acquire" semantics of
lock() operation... and, BTW,

lock(mutex)
signal(cv)
<update>
unlock(mutex)

can't be reordered to (sinking signal() below unlock())

lock(mutex)
<update>
unlock(mutex)
signal(cv)

due to "release" semantics of unlock() operation -- as long as
signaling operation involves access to CV memory object;
"predictable" realtime scheduling and control dependencies aside
for a moment.

But it still can be reordered to

lock(mutex)
<update>
signal(cv)
unlock(mutex)

and conforming application won't notice the difference.

cv_signal() and cv_broadcast() are pseudo-nops because by the time
you call them, all waiting threads (if any) could have already
consumed spurious wakes and "now" are on the way to acquire the
mutex and {re-}check the predicate (thread cancellation aside for
a moment).

regards,
alexander.

P.S. < Forward Inline >

-------- Original Message --------
Message-ID: <4180E0BC...@web.de>

Newsgroups: comp.std.c
Subject: Re: volatile problem

References: ... <bP-dnalYbfg...@comcast.com>

"Douglas A. Gwyn" wrote:
>
> Alexander Terekhov wrote:
> > They also synchronize memory. pthread_create() has release
> > sematics and pthread_join() has acquire semantics.
>
> You're hallucinating. Neither the specs nor the Rationale
> for those functions makes *any* mention of this.

Both the specs and the Rationale say that those functions
"synchronize memory". Precise semantics of memory synchronization
operations are defined by a conforming MT memory model. Release
consistency is conforming and it distinguishes between "acquire"
and "release" msync.

regards,
alexander.

--
http://www.opengroup.org/austin/mailarchives/ag-review/msg01865.html

Jonathan Adams

unread,

Oct 28, 2004, 4:15:31 PM10/28/04

to

In article <41810E87...@web.de>,
Alexander Terekhov <tere...@web.de> wrote:

> David Butenhof wrote:
> [...]

> > outweighed any possible benefit to new code that might
> > actually be able to exploit a changed requirement.
>
> http://tinyurl.com/68jav
>
> The idea is to allow compilers to take advantage of "#pragma
> isolated_call"-like semantics for pthread_cond_signal() and
> pthread_cond_broadcast().

From that link:

Marking a function as isolated_call indicates to the optimizer that
external and static variables cannot be changed by the called function
and that pessimistic references to storage can be deleted from the
calling function where appropriate. Instructions can be reordered with
more freedom, resulting in fewer pipeline delays and faster execution
in the processor. Multiple calls to the same function with identical
parameters can be combined, calls can be deleted if their results
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
are not needed, and the order of calls can be changed.
^^^^^^^^^^^^^^

Seems like that might break things.

Even if there was a "#pragma isolated_call_but_dont_delete_it",
broadcasts and signals almost always happen on one side or the other of
a pthread_mutex_exit(), so I doubt there's much benefit to be had here.

Cheers,
- jonathan

Alexander Terekhov

unread,

Oct 28, 2004, 4:59:56 PM10/28/04

to

Jonathan Adams wrote:
[...]

> parameters can be combined, calls can be deleted if their results
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

> are not needed, ...
> ^^^^^^^^^^^^^^

C'mon, CV is globaly exposed memory object. I have yet to see
some "real" implementation that doesn't change its internal
state. That state change will preclude call elimination...
unless you have some super-duper JIT global optimizer with no
way to tune its aggressiveness, so to says; unlikely, oder?

>
> Seems like that might break things.

Anyway, I said "like", not "exactly like". ;-)

>
> Even if there was a "#pragma isolated_call_but_dont_delete_it",
> broadcasts and signals almost always happen on one side or the other of
> a pthread_mutex_exit(), so I doubt there's much benefit to be had here.

You mean mutex unlock, I suppose. Well, but is there any
benefit at all in treating pthread_cond_signal/broadcast as
a full fence or even a release-only operations? What for?

regards,
alexander.

Jonathan Adams

unread,

Oct 28, 2004, 5:09:33 PM10/28/04

to

In article <41815DCC...@web.de>,
Alexander Terekhov <tere...@web.de> wrote:

> Jonathan Adams wrote:
> [...]
> > parameters can be combined, calls can be deleted if their results
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > are not needed, ...
> > ^^^^^^^^^^^^^^
>
> C'mon, CV is globaly exposed memory object. I have yet to see
> some "real" implementation that doesn't change its internal
> state. That state change will preclude call elimination...
> unless you have some super-duper JIT global optimizer with no
> way to tune its aggressiveness, so to says; unlikely, oder?

But that's a *side effect*, and "isolated calls" must not have *any side
effects*. If you do the (perhaps too typical)

(void) pthread_cond_broadcast(&cv);

and mark pthread_cond_broadcast as an isolated call, the compiler will
elide it compltely. (or, at least, the Sun compiler will do the same
with #pragma no_side_effects -- I don't have an IBM compiler handy)

But we already agree that you were just talking about something
different anyway. =]

> > Even if there was a "#pragma isolated_call_but_dont_delete_it",
> > broadcasts and signals almost always happen on one side or the other of
> > a pthread_mutex_exit(), so I doubt there's much benefit to be had here.
>
> You mean mutex unlock, I suppose. Well, but is there any
> benefit at all in treating pthread_cond_signal/broadcast as
> a full fence or even a release-only operations? What for?

Probably not, but is there any benefit to *not* treating them as such?
Changing standards for little to no gain seems like a waste.

Cheers,
- jonathan

Alexander Terekhov

unread,

Oct 28, 2004, 5:23:10 PM10/28/04

to

Jonathan Adams wrote:
[...]

> But that's a *side effect*, and "isolated calls" must not have *any side

> effects*. ...

"Modifying function arguments passed by pointer or by reference is
the only side effect that is allowed."

[...]

> > You mean mutex unlock, I suppose. Well, but is there any
> > benefit at all in treating pthread_cond_signal/broadcast as
> > a full fence or even a release-only operations? What for?
>
> Probably not, but is there any benefit to *not* treating them as such?
> Changing standards for little to no gain seems like a waste.

Conforming applications won't notice that "change". Conforming
applications can only gain.

regards,
alexander.

Jonathan Adams

unread,

Oct 28, 2004, 5:46:42 PM10/28/04

to

In article <4181633E...@web.de>,
Alexander Terekhov <tere...@web.de> wrote:

> Jonathan Adams wrote:
> [...]
> > But that's a *side effect*, and "isolated calls" must not have *any side
> > effects*. ...
>
> "Modifying function arguments passed by pointer or by reference is
> the only side effect that is allowed."

Yup -- sorry, I missed that part.

> [...]
> > > You mean mutex unlock, I suppose. Well, but is there any
> > > benefit at all in treating pthread_cond_signal/broadcast as
> > > a full fence or even a release-only operations? What for?
> >
> > Probably not, but is there any benefit to *not* treating them as such?
> > Changing standards for little to no gain seems like a waste.
>
> Conforming applications won't notice that "change". Conforming
> applications can only gain.

Again, since almost any call to pthread_cond_signal()/broadcast() will
have a pthread_mutex_unlock() very close to it, I doubt there is much
benefit in practice to making such a change.

Cheers,
- jonathan

Alexander Terekhov

unread,

Oct 29, 2004, 6:07:09 AM10/29/04

to

Jonathan Adams wrote:
[...]

> > Conforming applications won't notice that "change". Conforming
> > applications can only gain.
>
> Again, since almost any call to pthread_cond_signal()/broadcast() will
> have a pthread_mutex_unlock() very close to it, I doubt there is much
> benefit in practice to making such a change.

Again, what the heck "a change"? XBD 4.10 begins with the
requirement imposed on applications to "ensure that access to
any memory location by more than one thread of control
(threads or processes) is restricted such that no thread of
control can read or modify a memory location while another
thread of control may be modifying it." This requirement
coupled with the semantics of condition variables (spurious
wakes, etc.) makes the presence of pthread_cond_signal() and
pthread_cond_broadcast() among functions that "synchronize
memory" totally nonsensical. You just can't have a conforming
application that would rely on CV signaling also doing "memory
synchronization" (whatever you happen to believe it actually
means). Compilers are already allowed, but not required, to
treat CV signaling operations, but not *wait(), as "#pragma
isolated_call"-like operations -- it won't break any
conforming application. I'm talking about nothing but
editorial bug fixing in the codification of a "blueprint",
so to say, for POSIX-conforming MT memory model (such as eager
or lazy, but not entry, release consistency) stated in the XBD
4.10.

If you think that I'm wrong, please come up with some
conforming example that would illustrate the use of that silly
"existing requirement"... and I'll openly admit that I'm an
idiot.

TIA.

regards,
alexander.

David Hopwood

unread,

Jan 6, 2005, 8:33:45 PM1/6/05

to

SenderX wrote:
>>>>Volatiles are braindead.
>
> The C/C++ std shoud rip its dead brain out, and replace it with a new one
> that can actually comprehend threads and memory visibility...
>
> ;)

No ;) required. <http://www.hpl.hp.com/techreports/2004/HPL-2004-209.html>.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Alexander Terekhov

unread,

Jan 7, 2005, 5:07:36 AM1/7/05

to

David Hopwood wrote:
>
> SenderX wrote:
> >>>>Volatiles are braindead.
> >
> > The C/C++ std shoud rip its dead brain out, and replace it with a new one
> > that can actually comprehend threads and memory visibility...
> >
> > ;)
>
> No ;) required. <http://www.hpl.hp.com/techreports/2004/HPL-2004-209.html>.

FWIW, here's what I wrote in November to a 'couple' among "Other
participants in this effort" from section 6. <copy&paste>

----
I think that POSIX MT memory model should be tackled first. C++ would
"simply extend it" with things like atomic<> and "isolated<>".

http://www.opengroup.org/sophocles/show_mail.tpl?source=L&listname=austin-group-l&id=7675
http://www.opengroup.org/sophocles/show_mail.tpl?source=L&listname=austin-group-l&id=7682
http://www.opengroup.org/sophocles/show_mail.tpl?source=L&listname=austin-group-l&id=7688

I just hate volatiles. Given that they already have some defined
"single threaded" semantics (e.g. longjmp() and automatic storage)
in C/POSIX, I'd leave them alone and wouldn't go Java route trying
to atomic<>-ize them and impose implicit (and needlessly heavy)
msync semantics on top of all that mess.
----

regards,
alexander.

Joseph Seigh

unread,

Jan 7, 2005, 7:52:02 AM1/7/05

to

On Fri, 07 Jan 2005 01:33:45 GMT, David Hopwood <david.nosp...@blueyonder.co.uk> wrote:

> SenderX wrote:
>>>>> Volatiles are braindead.
>>
>> The C/C++ std shoud rip its dead brain out, and replace it with a new one
>> that can actually comprehend threads and memory visibility...
>>
>> ;)
>
> No ;) required. <http://www.hpl.hp.com/techreports/2004/HPL-2004-209.html>.
>

I don't have any expectations of that crowd solving the problem. All you're
going to get is C++ hard coded for the problems that they see. It will probably
have little effect on the stuff I work on.

For instance, pointer abstraction. C++ sucks. The problem is that you can almost
abstract them in C++. If it had done a worse job, they would have been forced to
do it right. On top of that, they're way too vested in the highly contrived solutions
they've put into their template libraries.

Simplify C++ so you don't need to put up with all that nonsense? It will never happen.

--
Joe Seigh

SenderX

unread,

Jan 7, 2005, 8:42:25 PM1/7/05

to

>> The C/C++ std shoud rip its dead brain out, and replace it with a new one
>> that can actually comprehend threads and memory visibility...
>>
>> ;)
>
> No ;) required.
> <http://www.hpl.hp.com/techreports/2004/HPL-2004-209.html>.

I have had some compiler reordering issues a while ago when I was testing
and developing the RCU algorithm that my library uses. I disassembled some
key stuff and found that a hazards reference was being performed before the
RCU quiescent state changed to active. MSVC++ did not reorder the
instructions, but GCC on linux did.

I had to create most of the RCU code in external compiled i686 assembly
library just to ease my mind. Magically, the reordering of "critical"
instructions never happened again.

P.S.

Here is a question on compiler reordering wrt C...

extern void assembly_routine_1( void** );
extern void assembly_routine_2( void** );

/* code */

void *p1, *p2;

1: assembly_routine_1( &p1 );
2: assembly_routine_2( &p2 );

Since call 2 dosen't rely on anything from call 1, could the compiler could
execute call 2 before call 1?

How abaout this one...

/* code */

void *p1, *p2;

1: assembly_routine_1( &p1 );
2: p2 = p1;
3: assembly_routine_2( &p2 );

The compiler knows that step 2 is reading a value that might have been
changed by the external function at step 1. The compiler should execute step
1,2,3 in order?

Joseph Seigh

unread,

Jan 7, 2005, 10:00:32 PM1/7/05

to

On Fri, 7 Jan 2005 17:42:25 -0800, SenderX <x...@xxx.com> wrote:

>>> The C/C++ std shoud rip its dead brain out, and replace it with a new one
>>> that can actually comprehend threads and memory visibility...
>>>
>>> ;)
>>
>> No ;) required.
>> <http://www.hpl.hp.com/techreports/2004/HPL-2004-209.html>.
>
> I have had some compiler reordering issues a while ago when I was testing
> and developing the RCU algorithm that my library uses. I disassembled some
> key stuff and found that a hazards reference was being performed before the
> RCU quiescent state changed to active. MSVC++ did not reorder the
> instructions, but GCC on linux did.
>
> I had to create most of the RCU code in external compiled i686 assembly
> library just to ease my mind. Magically, the reordering of "critical"
> instructions never happened again.
>

In gcc, you should be able to specify "memory" in the clobber list for an
inline asm memory barrier and gcc shouldn't optimize accross it. vc++
has no such provision for its inline asm unless you use all the regs so
vc++ can leave anything in them. So for vc++ you have to use eax, ebx,
ecx, edx, esi, and edi to force optimization off.

>
>
>
> P.S.
>
> Here is a question on compiler reordering wrt C...
>
>
> extern void assembly_routine_1( void** );
> extern void assembly_routine_2( void** );
>
>
> /* code */
>
> void *p1, *p2;
>
> 1: assembly_routine_1( &p1 );
> 2: assembly_routine_2( &p2 );
>
>
> Since call 2 dosen't rely on anything from call 1, could the compiler could
> execute call 2 before call 1?
>

The compiler has to know that there are no data dependencies in order to
optimize, and that's a lot harder than you think. So in your example the
compiler has to assume that assembly_routine_1 could store the address of
p1 in global memory to be accessed later by assembly_routine_2, so it
can't reorder the calls. It called the aliasing problem I believe and
most compilers don't bother to optimize across calls to separately compiled
external functions.

About the only thing a compiler can optimize across a call are local stack
variables that has never had their addresses taken.

Most shared data can't be optimized since they're referenced globally or
by explicit references.

--
Joe Seigh

Marcin 'Qrczak' Kowalczyk

unread,

Jan 8, 2005, 5:01:14 AM1/8/05

to

"SenderX" <x...@xxx.com> writes:

> I have had some compiler reordering issues a while ago when I was testing
> and developing the RCU algorithm that my library uses. I disassembled some
> key stuff and found that a hazards reference was being performed before the
> RCU quiescent state changed to active. MSVC++ did not reorder the
> instructions, but GCC on linux did.

I bet the error was on your side.

> extern void assembly_routine_1( void** );
> extern void assembly_routine_2( void** );
>
>
> /* code */
>
> void *p1, *p2;
>
> 1: assembly_routine_1( &p1 );
> 2: assembly_routine_2( &p2 );
>
>
> Since call 2 dosen't rely on anything from call 1, could the compiler could
> execute call 2 before call 1?

No. The functions may e.g. global variables or do I/O, so the compiler
will not reorder them.

> 1: assembly_routine_1( &p1 );
> 2: p2 = p1;
> 3: assembly_routine_2( &p2 );

Same here. 1 must be executed before 2, because 1 may change p1.
2 must be executed before 3, because 3 depends on 2.

--
__("< Marcin Kowalczyk
\__/ qrc...@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/

SenderX

unread,

Jan 8, 2005, 12:38:10 PM1/8/05

to

>> I have had some compiler reordering issues a while ago when I was testing
>> and developing the RCU algorithm that my library uses. I disassembled
>> some
>> key stuff and found that a hazards reference was being performed before
>> the
>> RCU quiescent state changed to active. MSVC++ did not reorder the
>> instructions, but GCC on linux did.
>
> I bet the error was on your side.

Yes this was a problem with trying to use pure C with threads....

My RCU algorithm requires that the quiescent state change to "active"
happens before "any" hazard reference can be attempted. GCC on Linux
reordered the instructions so a hazard was executed before the GC was able
to protect it. Kind of like a hazard pointer being set after the hazard
reference was executed/accessed. You go boom after that....

;)

David Schwartz

unread,

Jan 8, 2005, 7:07:13 PM1/8/05

to

"SenderX" <x...@xxx.com> wrote in message
news:PZidnaoZxdy...@comcast.com...

> Here is a question on compiler reordering wrt C...
>
>
> extern void assembly_routine_1( void** );
> extern void assembly_routine_2( void** );
>
>
> /* code */
>
> void *p1, *p2;
>
> 1: assembly_routine_1( &p1 );
> 2: assembly_routine_2( &p2 );
>
>
> Since call 2 dosen't rely on anything from call 1, could the compiler
> could execute call 2 before call 1?

The answer is: Assuming you are following the relevant standards, the
compiler can reorder the functions if and only if you cannot tell the
difference. If you can tell the difference, and the compiler reorders them,
then it is a bug in the compiler, assuming you didn't lie to the compiler or
fail to comply with the relevant standards.

DS

David Hopwood

unread,

Jan 9, 2005, 2:33:36 PM1/9/05

to

David Schwartz wrote:

> "SenderX" <x...@xxx.com> wrote:
>
>>Here is a question on compiler reordering wrt C...
>>
>>extern void assembly_routine_1( void** );
>>extern void assembly_routine_2( void** );
>>
>>/* code */
>>
>>void *p1, *p2;
>>
>>1: assembly_routine_1( &p1 );
>>2: assembly_routine_2( &p2 );
>>
>>
>>Since call 2 dosen't rely on anything from call 1, could the compiler
>>could execute call 2 before call 1?
>
> The answer is: Assuming you are following the relevant standards, the
> compiler can reorder the functions if and only if you cannot tell the
> difference.

No. Calling an assembly routine, unless it is a standard library function,
already causes undefined behaviour according to the standards. In practice
you're at the mercy of whatever your compiler happens to do at the
selected optimization level.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

David Schwartz

unread,

Jan 9, 2005, 7:06:34 PM1/9/05

to

"David Hopwood" <david.nosp...@blueyonder.co.uk> wrote in message
news:kGfEd.47491$C8.1...@fe3.news.blueyonder.co.uk...

>>>Since call 2 dosen't rely on anything from call 1, could the compiler
>>>could execute call 2 before call 1?

>> The answer is: Assuming you are following the relevant standards, the
>> compiler can reorder the functions if and only if you cannot tell the
>> difference.

> No. Calling an assembly routine, unless it is a standard library function,
> already causes undefined behaviour according to the standards.

You are misunderstanding what I mean by "standards". I don't just mean
things like POSIX. I mean any defined interface.

> In practice
> you're at the mercy of whatever your compiler happens to do at the
> selected optimization level.

Nonsense. Few people would willingly use such a compiler, and compilers
like VC++ and gcc are not compilers like that. If your code/compiler works
at one optimization level and not another, there's either a bug in your code
or a bug in the compiler.

DS

Alexander Terekhov

unread,

Jan 10, 2005, 8:15:15 AM1/10/05

to

David Schwartz wrote:
[...]

> Nonsense. Few people would willingly use such a compiler, and compilers
> like VC++ and gcc are not compilers like that. If your code/compiler works
> at one optimization level and not another, there's either a bug in your code
> or a bug in the compiler.

Single-threaded "observable behavior" aside for a moment, neither
gcc not VC++ have any meaningful MT memory model that would allow
programmers express memory isolation and reordering constrains
imposed on compiler and/or "hardware." So in practice, it all
"works" by accident, so to say.

regards,
alexander.

Joseph Seigh

unread,

Jan 10, 2005, 9:14:04 AM1/10/05

to

And the ones that work by accident get to be certfied as Posix compliant.
I'd be suprised if any compilers worked on purpose.

Joe Seigh

SenderX

unread,

Jan 11, 2005, 2:39:20 AM1/11/05

to

> About the only thing a compiler can optimize across a call are local stack
> variables that has never had their addresses taken.

I wonder if this could happen...

static int global = 0;

1. int *local = &global;
2. pthread_mutex_lock( ... );
3. *local = 5;
4. pthread_mutex_unlock( ... );

Since local has never had its address taken, could step 3 possibly be moved
out of the critical section?

SenderX

unread,

Jan 11, 2005, 2:42:37 AM1/11/05

to

> The answer is: Assuming you are following the relevant standards, the
> compiler can reorder the functions if and only if you cannot tell the
> difference.

It seems that the compiler could never really know for sure if you could
tell the difference...

SenderX

unread,

Jan 11, 2005, 2:43:45 AM1/11/05

to

> No. Calling an assembly routine, unless it is a standard library function,
> already causes undefined behaviour according to the standards.

Yes. The calling convention is an issue.

Marcin 'Qrczak' Kowalczyk

unread,

Jan 11, 2005, 4:11:28 AM1/11/05

to

"SenderX" <x...@xxx.com> writes:

> static int global = 0;
>
> 1. int *local = &global;
> 2. pthread_mutex_lock( ... );
> 3. *local = 5;
> 4. pthread_mutex_unlock( ... );
>
> Since local has never had its address taken, could step 3 possibly
> be moved out of the critical section?

No, because it's not local which is written but *local.

David Schwartz

unread,

Jan 11, 2005, 4:07:46 AM1/11/05

to

"SenderX" <x...@xxx.com> wrote in message

news:6oydnQh3m8v...@comcast.com...

You guys are construing "standards" very narrowly. When I say
"standards", I mean all the relevant documentation that accompanies all the
interfaces and extensions you are using. If you use GCC's inline assembly
correctly, according to the documentation, your code will work perfectly at
every level of optimization without you having to do anything other than
correctly follow the documentation. Black magic is not needed.

DS

Alexander Terekhov

unread,

Jan 11, 2005, 6:28:48 AM1/11/05

to

David Schwartz wrote:
[...]

> You guys are construing "standards" very narrowly. When I say
> "standards", I mean all the relevant documentation that accompanies all the
> interfaces and extensions you are using. If you use GCC's inline assembly
> correctly, according to the documentation, your code will work perfectly at

According to the GCC's documentation, "Even a volatile asm
instruction can be moved in ways that appear insignificant to the
compiler". GCC's notion of full-stop barrier via memory clobbers
and "asm volatile" sucks miserably... even if we assume that it
works "as expected" (i.e. as a real bidirectional compiler fence).

> every level of optimization without you having to do anything other than
> correctly follow the documentation. Black magic is not needed.

Yeah, just a bit of luck.

regards,
alexander.

David Hopwood

unread,

Jan 11, 2005, 5:54:03 PM1/11/05

to

David Schwartz wrote:
> "SenderX" <x...@xxx.com> wrote:
>

>>>No. Calling an assembly routine, unless it is a standard library
>>>function, already causes undefined behaviour according to the standards.
>
>>Yes. The calling convention is an issue.
>
> You guys are construing "standards" very narrowly. When I say
> "standards", I mean all the relevant documentation that accompanies all the
> interfaces and extensions you are using. If you use GCC's inline assembly
> correctly, according to the documentation, your code will work perfectly at
> every level of optimization without you having to do anything other than
> correctly follow the documentation. Black magic is not needed.

If you mean that, in principle, implementations *should* provide sufficient
and accurate documentation of all their calling conventions and how
optimizations interact with them in the case of multithreaded code,
then who could disagree? If you mean that they are in fact so documented
(for example in the case of VC++ and gcc), then I definitely disagree.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

David Schwartz

unread,

Jan 13, 2005, 12:50:39 PM1/13/05

to

"David Hopwood" <david.nosp...@blueyonder.co.uk> wrote in message

news:fOYEd.104229$48.7...@fe1.news.blueyonder.co.uk...

> If you mean that, in principle, implementations *should* provide
> sufficient
> and accurate documentation of all their calling conventions and how
> optimizations interact with them in the case of multithreaded code,
> then who could disagree? If you mean that they are in fact so documented
> (for example in the case of VC++ and gcc), then I definitely disagree.

I am saying they are sufficiently documented such that unless you're
trying to do something that you really have no business doing, you can be
assured that your code will work at every optimization level. Have you ever
know this not to work:

functionmutex.Lock();
call_assembly_function();
functionmutex.Unlock();

DS