fast semaphore implementation requires acquire/release?

Paul Pedriana

unread,

Oct 17, 2005, 5:22:12 PM10/17/05

to

Should a semaphore implementation execute a memory acquire upon sem_wait
and a memory release upon sem_post, or is the user expected to do such
things himself?

Somebody posted a "fast semaphore" implementation here some time ago and
it was implemented using atomic operations. However, it had no memory
barrier calls in it, which implies that if the user of it wants to
synchronize memory, the user would have to do that himself. This seems
to me to be out of line with user expectations and out of line with
current Posix and WinNT OS semaphore implementations.

Thanks.

Joe Seigh

unread,

Oct 17, 2005, 6:12:04 PM10/17/05

to

The windows interlocked operations do have memory barrier properties so
that "fast semaphore" implementation does have memory synchronization.
I assume you're talking about
http://groups.google.com/groups?selm=412F5D8D...@xemaps.com

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

Paul Pedriana

unread,

Oct 18, 2005, 3:22:56 AM10/18/05

to

> The windows interlocked operations do have memory
> barrier properties so that "fast semaphore"
> implementation does have memory synchronization.
> I assume you're talking about
> http://groups.google.com/groups?selm=412F5D8D...@xemaps.com

Yes, that is the fast semaphore implementation. However, I can find no
Microsoft documentation that the InterlockedIncrement function generates
a memory barrier of some sort. The best I can find is this, which states
that the Itanium InterlockedIncrement generates a read/write barrier:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/base/interlockedincrement.asp
(also available as at http://tinyurl.com/293tz)

I've looked at disassemblies of InterlockedIncrement on non-Windows
platforms on processors other than x86, x86-64 and Itanium and they
produce no memory barrier. Yes, Microsoft produces such platforms and
compilers for them.

Peter Dimov

unread,

Oct 18, 2005, 8:10:55 AM10/18/05

to

Paul Pedriana wrote:

> I've looked at disassemblies of InterlockedIncrement on non-Windows
> platforms on processors other than x86, x86-64 and Itanium and they
> produce no memory barrier. Yes, Microsoft produces such platforms and
> compilers for them.

It's possible that these processors don't need an explicit memory
barrier, which is why one isn't being produced.

Alexander Terekhov

unread,

Oct 18, 2005, 8:20:58 AM10/18/05

to

Paul Pedriana wrote:
>
> Should a semaphore implementation execute a memory acquire upon sem_wait
> and a memory release upon sem_post

Apart from blocking, semas are basically atomic flags or counts. Given
that the least surprising behavior is sequential consistency, semaphore
operations (lock/unlock/getvalue) should better be fully-fenced and
provide illusion of the so-called "remote write atomicity". Note that
mutexes (vs semas) are much less heavier in this respect.

regards,
alexander.

Joe Seigh

unread,

Oct 18, 2005, 8:25:32 AM10/18/05

to

You should look at the disassemblies of EnterCriticalSection and
LeaveCriticalSection. They're basically fast pathed Mutexes (actually
Events IIRC) using interlocked instructions also. Make sure you
get the correct library meant to run on an multiprocessor.

Chris Friesen

unread,

Oct 18, 2005, 12:07:17 PM10/18/05

to

Alexander Terekhov wrote:

> Apart from blocking, semas are basically atomic flags or counts. Given
> that the least surprising behavior is sequential consistency, semaphore
> operations (lock/unlock/getvalue) should better be fully-fenced and
> provide illusion of the so-called "remote write atomicity". Note that
> mutexes (vs semas) are much less heavier in this respect.

From this point of view, what is a mutex but a binary semaphore with
ownership? You still need the barriers for a mutex.

Chris

Alexander Terekhov

unread,

Oct 18, 2005, 12:12:18 PM10/18/05

to

From this point of view, you need less barriers for a (pure) mutex.

regards,
alexander.

Chris Friesen

unread,

Oct 18, 2005, 12:47:13 PM10/18/05

to

Alexander Terekhov wrote:
> Chris Friesen wrote:

>> From this point of view, what is a mutex but a binary semaphore with
>>ownership? You still need the barriers for a mutex.
>
>
> From this point of view, you need less barriers for a (pure) mutex.

Why?

cpu A aquires a mutex, modifies data, releases the mutex.
cpu B aquires the mutex. At this point, the modifications need to be
visible, which implies that the barriers are needed.

How is this different from the semphore case?

Chris

Alexander Terekhov

unread,

Oct 18, 2005, 1:02:57 PM10/18/05

to

Chris Friesen wrote:
[...]

> cpu A aquires a mutex, modifies data, releases the mutex.
> cpu B aquires the mutex. At this point, the modifications need to be
> visible, which implies that the barriers are needed.
>
> How is this different from the semphore case?

http://groups.google.com/group/comp.programming.threads/msg/4c01cdea7c30f338
(follow also msg08586.html link)

http://groups.google.com/group/comp.programming.threads/msg/63fec0c2b97c76ec
(sema_unlock is also meant to be fully-fenced, not release-only)

regards,
alexander.

Paul Pedriana

unread,

Oct 18, 2005, 1:38:47 PM10/18/05

to

> It's possible that these processors don't need an explicit
> memory barrier, which is why one isn't being produced.

Thanks for the suggestion, but the machines I'm talking about are SMP
PowerPCs, so they definitely need memory barriers.

Paul Pedriana

unread,

Oct 18, 2005, 1:46:54 PM10/18/05

to

> You should look at the disassemblies of EnterCriticalSection and
> LeaveCriticalSection. They're basically fast pathed Mutexes (actually
> Events IIRC) using interlocked instructions also. Make sure you
> get the correct library meant to run on an multiprocessor.

Yes, but those disassemblies are for x86, which doesn't need memory
barriers.

If I was the designer of InterlockedIncrement, I would not have memory
barriers in the implementation, and I would document that the only
memory synchronization that user can expect is that the increment of
that integer (and only that integer) is guaranteed to be visible to
other processors. As far as I understand it, you don't need memory
barriers to achieve this on platforms such as PowerPC and Itanium. So
why force heavy memory barriers on the user when all the user wants is
an atomic operation on an integer?

Thanks.

Joe Seigh

unread,

Oct 18, 2005, 2:18:19 PM10/18/05

to

Paul Pedriana wrote:
>> You should look at the disassemblies of EnterCriticalSection and
>> LeaveCriticalSection. They're basically fast pathed Mutexes (actually
>> Events IIRC) using interlocked instructions also. Make sure you
>> get the correct library meant to run on an multiprocessor.
>
>
> Yes, but those disassemblies are for x86, which doesn't need memory
> barriers.

I'm not following you here. You're claiming that since one function
on one platform doesn't need memory barriers and a different funtion
on a different platform doesn't have memory barriers that there is a
problem?

>
> If I was the designer of InterlockedIncrement, I would not have memory
> barriers in the implementation, and I would document that the only
> memory synchronization that user can expect is that the increment of
> that integer (and only that integer) is guaranteed to be visible to
> other processors. As far as I understand it, you don't need memory
> barriers to achieve this on platforms such as PowerPC and Itanium. So
> why force heavy memory barriers on the user when all the user wants is
> an atomic operation on an integer?
>

Microsoft api's started out on a platform that didn't need explicit memory
barriers. When they've gone to platforms with different memory models they've
added suffixed versions with the appropiate memory barriers. The unsuffixed
version is kept with full memory barriers for compatibility reasons. If you
want a non memory barrier version, the safe way to add that fuction is with
a suffixed version for that. But you don't commonly need the unsynched
version. Statistical or performance counters are about the only ones I
can think of.

Paul Pedriana

unread,

Oct 18, 2005, 3:23:31 PM10/18/05

to

Let me answer with this, which is the disassembly of Microsoft's
InterlockedDecrement for SMP PowerPC:

82617D20 mfmsr r9
82617D24 mtmsree r13
82617D28 lwarx r10,r0,r31
82617D2C addi r10,r10,-1
82617D30 stwcx. r10,r0,r31
82617D34 mtmsree r9
82617D38 bne 82617d20h

There is no memory barrier executed here. This is from the latest
compiler and the latest hardware. Also, Microsoft confirmed to me
directly that there is intentionally no barrier.

I can see from reading Microsoft's documentation how somebody could
conclude that InterlockedIncrement always executes a memory barrier. The
problem is that x86 doesn't need memory barriers, and their initial
implementation for Itanium happened to execute a memory barrier but the
realized later that it was a mistake to do so and users may have become
dependent on the behavior and so they added the "acquire" and "release"
versions of InterlockedIncrement. But for other platforms such as the
one with the asm above they don't do memory barriers.

The result is that it is not clear at all what Microsoft's formal
specification for InterlockedIncrement is. My best guess is that it is
"intended" that its behavior can be different on any platform. This lack
of consistency is perhaps why nowhere in their documentation do they
unequivocally formally state that InterlockedIncrement includes memory
synchronization.

If Microsoft could start over, I'm pretty confident that they would
specify that InterlockedIncrement does not execute memory barriers. This
is the sensible behavior to have and is probably the thing that would be
most in the user's interest.

Peter Dimov

unread,

Oct 18, 2005, 3:24:30 PM10/18/05

to

It is possible that these PowerPC cores do not reorder. :-)

Joe Seigh

unread,

Oct 18, 2005, 3:50:09 PM10/18/05

to

Paul Pedriana wrote:
> Let me answer with this, which is the disassembly of Microsoft's
> InterlockedDecrement for SMP PowerPC:
>
> 82617D20 mfmsr r9
> 82617D24 mtmsree r13
> 82617D28 lwarx r10,r0,r31
> 82617D2C addi r10,r10,-1
> 82617D30 stwcx. r10,r0,r31
> 82617D34 mtmsree r9
> 82617D38 bne 82617d20h
>
> There is no memory barrier executed here. This is from the latest
> compiler and the latest hardware. Also, Microsoft confirmed to me
> directly that there is intentionally no barrier.
>

So how do you get a memory barrier on powerpc?

Chris Friesen

unread,

Oct 18, 2005, 5:20:07 PM10/18/05

to

Peter Dimov wrote:

> It is possible that these PowerPC cores do not reorder. :-)

If that was the case, it would make no sense for linux to implement
memory barriers on SMP PPC machines. (Which it does.)

Chris

Chris Friesen

unread,

Oct 18, 2005, 5:13:23 PM10/18/05

to

Joe Seigh wrote:

> So how do you get a memory barrier on powerpc?

"sync" or "eieio" depending on desired semantics.

Chris

Chris Friesen

unread,

Oct 18, 2005, 5:27:07 PM10/18/05

to

Alexander Terekhov wrote:
> Chris Friesen wrote:
> [...]
>
>>cpu A aquires a mutex, modifies data, releases the mutex.
>>cpu B aquires the mutex. At this point, the modifications need to be
>>visible, which implies that the barriers are needed.
>>
>>How is this different from the semphore case?
>
>
> http://groups.google.com/group/comp.programming.threads/msg/4c01cdea7c30f338

> http://groups.google.com/group/comp.programming.threads/msg/63fec0c2b97c76ec

That doesn't actually answer my question.

Mutexes guarantee that B sees all of the changes made by A.

What additional guarantees does a semaphore give you that would require
additional barriers?

Chris

David Schwartz

unread,

Oct 18, 2005, 5:49:48 PM10/18/05

to

"Paul Pedriana" <pedriana@remove_this.pacbell.net> wrote in message
news:43554B5F.3000705@remove_this.pacbell.net...

> If Microsoft could start over, I'm pretty confident that they would
> specify that InterlockedIncrement does not execute memory barriers. This
> is the sensible behavior to have and is probably the thing that would be
> most in the user's interest.

No. Microsoft would specify that InterlockedIncrement performs a full
barrier and that for higher performance, if you know what you need, use some
other function. For example, see:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/base/interlockedincrement.asp
which says:

"This function generates a full memory barrier (or fence) and performs the
increment operation. This ensures the strict memory access ordering that is
necessary, but it can decrease performance. For performance-critical
applications, consider using InterlockedIncrementAcquire or
InterlockedIncrementRelease."

What you are claiming conflicts with everything I have ever known about
the Interlocked* functions. My understanding is:

1) Microsoft initially developed them on the x86 platform without any
special thought about their memory semantics.

2) Microsoft later realized that a lot of programs depended on them
having full memory barrier semantics and that full barrier semantics are
expensive on some platforms. (Largely through the same attitude of not
understanding or thinking about memory visibility issues, not be educated
intent.)

3) Microsoft documented them as having these full barrier semantics and
added functions that didn't have full barrier semantics, such as
InterlockedIncrementAcquire and InterlockedIncrementRelease.

There is lots of direct and indirect evidence to support this view.

DS

Alexander Terekhov

unread,

Oct 18, 2005, 5:54:51 PM10/18/05

to

Chris Friesen wrote:
>
> Alexander Terekhov wrote:
> > Chris Friesen wrote:
> > [...]
> >
> >>cpu A aquires a mutex, modifies data, releases the mutex.
> >>cpu B aquires the mutex. At this point, the modifications need to be
> >>visible, which implies that the barriers are needed.
> >>
> >>How is this different from the semphore case?
> >
> >
> > http://groups.google.com/group/comp.programming.threads/msg/4c01cdea7c30f338
> > http://groups.google.com/group/comp.programming.threads/msg/63fec0c2b97c76ec
>
> That doesn't actually answer my question.
>
> Mutexes guarantee that B sees all of the changes made by A.

Yep.

>
> What additional guarantees does a semaphore give you that would require
> additional barriers?

Fully-fenced semantics (plus remote write atomicity WRT sem_getvalue):
RCsc,
not RCpc (and not even RCtso along the lines of "semaphore" stuff on
Itanic).

regards,
alexander.

Joe Seigh

unread,

Oct 18, 2005, 10:33:19 PM10/18/05

to

or lsync. Yeah, I already know about those. I meant
what part of Microsoft's api for powerpc provides those
memory barriers since they're not part of the interlocked
functions.

Paul Pedriana

unread,

Oct 19, 2005, 12:46:58 AM10/19/05

to

> For example, see:
>
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/base/interlockedincrement.asp

> which says:
>
> "This function generates a full memory barrier (or fence) and
performs the
> increment operation. This ensures the strict memory access
ordering that is
> necessary, but it can decrease performance. For performance-critical
> applications, consider using InterlockedIncrementAcquire or
> InterlockedIncrementRelease."

You slightly (but IMO significantly) misquoted the page and left out a
key detail. The page actually says:
"Intel IPF: This function..."

IPF means Itanium Processor Family. What they are saying is that
InterlockedIncrement on *Itanium* generates a full barrier (and even
they aren't clear about whether it is a barrier for just that value or
for all of system memory). I am claiming that Microsoft realized later
that the generation of a full barrier allowed InterlockedIncrement to
work properly on Itanium but was overkill. See below for more on this
Itanium thing.

I am farily certain that Microsoft doesn't specify that
InterlockedIncrement portably generates a memory barrier. I know this
because I have the disassemblies on other SMP processors and -- most
fundamentally: *they told me so*. There is no more direct evidence than
these two reasons. See another subthread for the details, but if you
still don't believe me, here is the quote which I got two days ago:

"InterlockedIncrement guarantees that the DWORD in question
is incremented in a synchronized manner, but it says nothing
about the rest of memory. This makes InterlockedIncrement
slightly cheaper than a critical section, but also means that
it is not a replacement for a critical section."

> 3) Microsoft documented them as having these full barrier
> semantics and added functions that didn't have full barrier
> semantics, such as InterlockedIncrementAcquire and
> InterlockedIncrementRelease.

Yes, full barrier semantics *with respect to the atomic operation and
not with respect to all of memory*. They make this fairly clear in this
document (http://tinyurl.com/bpamt):

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/kmarch/hh/kmarch/Synchro_88127404-a394-403f-a289-d61c45ab81d5.xml.asp

My contention is that Microsoft has simply been a little sloppy in their
implementation and don't have a easy way to back out of it and are doing
us a disservice by not taking a firm stand on the issue and sticking
with it.

Thanks.

Chris Friesen

unread,

Oct 19, 2005, 1:43:23 AM10/19/05

to

Joe Seigh wrote:
> Chris Friesen wrote:

>> "sync" or "eieio" depending on desired semantics.
>>
> or lsync. Yeah, I already know about those. I meant
> what part of Microsoft's api for powerpc provides those
> memory barriers since they're not part of the interlocked
> functions.

Ah, my bad.

Chris

David Schwartz

unread,

Oct 19, 2005, 2:14:06 AM10/19/05

to

"Paul Pedriana" <pedriana@remove_this.pacbell.net> wrote in message

news:4355CF6D.6000009@remove_this.pacbell.net...

> > For example, see:
> >
> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/base/interlockedincrement.asp

> > which says:
> >
> > "This function generates a full memory barrier (or fence) and
> performs the
> > increment operation. This ensures the strict memory access ordering
> that is
> > necessary, but it can decrease performance. For performance-critical
> > applications, consider using InterlockedIncrementAcquire or
> > InterlockedIncrementRelease."

> You slightly (but IMO significantly) misquoted the page and left out a key
> detail. The page actually says:
> "Intel IPF: This function..."

> IPF means Itanium Processor Family. What they are saying is that
> InterlockedIncrement on *Itanium* generates a full barrier (and even they
> aren't clear about whether it is a barrier for just that value or for all
> of system memory). I am claiming that Microsoft realized later that the
> generation of a full barrier allowed InterlockedIncrement to work properly
> on Itanium but was overkill. See below for more on this Itanium thing.

No, please read it. It says, "this ensures the strict memory access
ordering that is necessary". They certainly don't mean that the strict
memory access ordering *effect* is only necessary on Itanium, that would
make no sense at all. They mean that the barrier is necessary on Itanium to
get the effect that the semantics of InterlockedIncrement require.

No such barrier is needed to get this intended effect on x86.

> I am farily certain that Microsoft doesn't specify that
> InterlockedIncrement portably generates a memory barrier.

Right, because such a barrier is not required on x86. However, they did
clearly state that "struct memory access ordering is necessary".

> I know this because I have the disassemblies on other SMP processors
> and -- most fundamentally: *they told me so*. There is no more direct
> evidence than these two reasons. See another subthread for the details,
> but if you still don't believe me, here is the quote which I got two days
> ago:
>
> "InterlockedIncrement guarantees that the DWORD in question
> is incremented in a synchronized manner, but it says nothing
> about the rest of memory. This makes InterlockedIncrement
> slightly cheaper than a critical section, but also means that
> it is not a replacement for a critical section."

I don't know who you are quoting or the context, but we were never
talking about what InterlockedIncrement *guaranteed*. We were talking about
what its intended and actual semantics were. You can almost never find
memory visibility guarantees. I couldn't even find any at all for critical
sections.

> > 3) Microsoft documented them as having these full barrier
> > semantics and added functions that didn't have full barrier
> > semantics, such as InterlockedIncrementAcquire and
> > InterlockedIncrementRelease.

> Yes, full barrier semantics *with respect to the atomic operation and not
> with respect to all of memory*. They make this fairly clear in this
> document (http://tinyurl.com/bpamt):
>
> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/kmarch/hh/kmarch/Synchro_88127404-a394-403f-a289-d61c45ab81d5.xml.asp

This says:

"An operation has acquire semantics if other processors will always see its
effect before any subsequent operation's effect. An operation has release
semantics if other processors will see every preceding operation's effect
before the effect of the operation itself."

and

"Atomic operations, such as those that the InterlockedXxx routines perform,
have both acquire and release semantics by default. However, Itanium-based
processors execute operations that have only acquire or only release
semantics faster than those that have both. Therefore, the system provides
InterlockedXxxAcquire and InterlockedXxxRelease versions of some of the
InterlockedXxx routines."

Which is exactly what I've been saying and the opposite of your claim
that it's only with respect to the atomic operation. Your cite 100% supports
my view.

> My contention is that Microsoft has simply been a little sloppy in their
> implementation and don't have a easy way to back out of it and are doing
> us a disservice by not taking a firm stand on the issue and sticking with
> it.

The two cites above are crystal clear. InterlockedIncrement has both
acquire and release semantics.

DS

Paul Pedriana

unread,

Oct 19, 2005, 3:51:04 AM10/19/05

to

I think perhaps the confusion here is that you and are are talking about
two different things. I'm talking about the question of whether
InterlockedIncrement executes a system-level memory barrier, and I think
your'e talking about whether InterlockedIncrement executes a memory
barrier with respect to the atomic operation alone (which in practice on
SMP systems that I'm aware of is usually related to the cache line
associated with the memory value).

Recall that the whole reason I started this thread was to question
whether the fast sem_post and sem_wait functions posted by another user
(and which are based on InterlockedIncrement) properly execute system
memory release and acquire operations. The author said he believed they
did and my contention is that they only do so on x86 (because there is
no such memory sychronization) and on Itanium (because it's documented
as such). But I know that InterlockedIncrement doesn't always work that
way because Microsoft told me so and their own compiler doesn't
generated the system-level barriers. I posted the PowerPC disassembly
earlier in this discussion to prove it.

If you are indeed talking about acquire and release semantics with
respect to the atomic operation alone, I fully agree with you and
apologize for anything I might have said to lead the discussion astray.
It is very easy to miscommunicate on topics such as this. The lack of
clarity in Microsoft's documentation doesn't help the situation.

Thanks.

David Schwartz

unread,

Oct 19, 2005, 4:51:34 AM10/19/05

to

"Paul Pedriana" <pedriana@remove_this.pacbell.net> wrote in message

news:4355FA93.8010603@remove_this.pacbell.net...

>I think perhaps the confusion here is that you and are are talking about
>two different things. I'm talking about the question of whether
>InterlockedIncrement executes a system-level memory barrier, and I think
>your'e talking about whether InterlockedIncrement executes a memory barrier
>with respect to the atomic operation alone (which in practice on SMP
>systems that I'm aware of is usually related to the cache line associated
>with the memory value).

Huh?

> Recall that the whole reason I started this thread was to question whether
> the fast sem_post and sem_wait functions posted by another user (and which
> are based on InterlockedIncrement) properly execute system memory release
> and acquire operations. The author said he believed they did and my
> contention is that they only do so on x86 (because there is no such memory
> sychronization) and on Itanium (because it's documented as such). But I
> know that InterlockedIncrement doesn't always work that way because
> Microsoft told me so and their own compiler doesn't generated the
> system-level barriers. I posted the PowerPC disassembly earlier in this
> discussion to prove it.

You are incorrect. Microsoft specifically states that the Interlocked*
functions have both acquire and release semantics.

> If you are indeed talking about acquire and release semantics with respect
> to the atomic operation alone, I fully agree with you and apologize for
> anything I might have said to lead the discussion astray.

What does "with respect to the atomic operation alone" mean?

> It is very easy to miscommunicate on topics such as this. The lack of
> clarity in Microsoft's documentation doesn't help the situation.

I still don't understand what you're talking about. So long as the code
contains an 'InterlockedIncrement' that must be called both by the thread
releasing and acquiring the semaphore, and 'InterlockedIncrement' has both
acquire and release semantics, there is no problem.

Please read the quote I excerpted that defines what "acquire" and
"release" actually mean:

"An operation has acquire semantics if other processors will always see its
effect before any subsequent operation's effect. An operation has release
semantics if other processors will see every preceding operation's effect
before the effect of the operation itself."

Note carefully the words "*ANY* subsequent operation" and "*EVERY*
preceding operation".

DS

Peter Dimov

unread,

Oct 19, 2005, 5:01:28 AM10/19/05

to

I suspect that the hardware we are talking about uses three custom
in-order PowerPC cores and that Linux doesn't run on it... yet. But I
may be wrong.

Alexander Terekhov

unread,

Oct 19, 2005, 5:53:43 AM10/19/05

to

Peter Dimov wrote:
[...]

> I suspect that the hardware we are talking about uses three custom
> in-order PowerPC cores and that Linux doesn't run on it... yet. But I
> may be wrong.

CELL's cores are also in-order, but the memory model is still relaxed.

I just wonder whether MS does naked Interlocked stuff on

http://www.theinquirer.net/?article=14407

as well.

regards,
alexander.

Peter Dimov

unread,

Oct 19, 2005, 6:22:12 AM10/19/05

to

Alexander Terekhov wrote:
> Peter Dimov wrote:
> [...]
> > I suspect that the hardware we are talking about uses three custom
> > in-order PowerPC cores and that Linux doesn't run on it... yet. But I
> > may be wrong.
>
> CELL's cores are also in-order, but the memory model is still relaxed.

Absent reordering, I don't see how a lwarx/stwcx. combination could not
constitute the equivalent of a full fence. But I'm not an expert on
this stuff. :-) Of course, without Microsoft documenting the memory
model, we can't know for sure.

> I just wonder whether MS does naked Interlocked stuff on
>
> http://www.theinquirer.net/?article=14407
>
> as well.

A weak InterlockedDecrement would definitely make the platform fun to
program for. :-) Unless they add InterlockedDecrementBarrier as well.

Alexander Terekhov

unread,

Oct 19, 2005, 7:33:56 AM10/19/05

to

Peter Dimov wrote:
[...]

> Absent reordering, I don't see how a lwarx/stwcx. combination could not
> constitute the equivalent of a full fence.

AFAIK, executing naked lwarx/stwcx in-order doesn't impose the order
with respect to preceding (already committed) stores reaching their
global visibility, for example. Think store buffers.

regards,
alexander.

Peter Dimov

unread,

Oct 19, 2005, 7:43:15 AM10/19/05

to

I thought store buffers, but it seems to me that the . in stwcx. must
await all previous stores to reach visibility in order to return
whether the store succeeded. Although I suppose that it _could_ snoop
the local store buffer, in theory. Without an official MM spec this is
just idle speculation, of course.

Joe Seigh

unread,

Oct 19, 2005, 8:12:40 AM10/19/05

to

No problem. I was just trying to figure out what non existent
x86 api's the OP thought I should be using.

Alexander Terekhov

unread,

Oct 19, 2005, 9:13:59 AM10/19/05

to

Peter Dimov wrote:
[...]

> I thought store buffers, but it seems to me that the . in stwcx. must
> await all previous stores to reach visibility in order to return
> whether the store succeeded. Although I suppose that it _could_ snoop
> the local store buffer, in theory. Without an official MM spec this is
> just idle speculation, of course.

Well, regarding CELLs, the official specs state that

"The storage model for the CBEA [Cell Broadband Engine Architecture]
is weakly consistent. This model incorporates the same weakly
consistent model as the PowerPC Architecture supported by the PPE.
This model provides an opportunity for improved performance over a
model that has stronger consistency rules, but places the
responsibility on the programmer or programming tools to ensure that
ordering or synchronization instructions, commands, or command
modifiers are properly placed when the storage is shared by multiple
units in the CBEA. The PPE Storage Access Ordering and instructions
are defined PowerPC Architecture, Book II. For DMA operations
initiated by the Memory Flow Controllers found in the SPEs, storage
ordering command modifiers, fence and barrier are provided as are
the synchronization commands mfcsync, mfceieio and barrier. For more
information on these facilities, see Section 7.9 MFC Synchronization
Commands beginning on page 62."

And as for stwcx. and preceding stores, see Book II and in particular
B.3 List Insertion.

regards,
alexander.

Paul Pedriana

unread,

Oct 19, 2005, 12:13:24 PM10/19/05

to

> No problem. I was just trying to figure out what non existent
> x86 api's the OP thought I should be using.

Perhaps Microsoft means for users to use the _ReadWriteBarrier function
for portability, though the documentation for that function suggests it
relates to compiler optimization. On the other hand, perhaps they expect
users to use processor-specific calls. Microsoft provides a compiler
intrinsic called __lwsync() which they expect you to use on PowerPC.
That's what I will be adding to your fast semaphore in order to get it
to work for my on PowerPC.

Thanks.

Joe Seigh

unread,

Oct 19, 2005, 12:00:26 PM10/19/05

to

lwsynch doesn't do store/load ordering which you may need in some cases,
e.g. acquire semantics. You may need sync as well.

Alexander Terekhov

unread,

Oct 19, 2005, 12:13:04 PM10/19/05

to

Joe Seigh wrote:
[...]

> lwsynch doesn't do store/load ordering which you may need in some cases,
> e.g. acquire semantics. You may need sync as well.

For acquire semantics (rather ccacq_true/cchlb_true for store
conditional), sync is probably an overkill. The recomedend way is
isync (after branch). But in order to "fully-fence" sema_lock(),
an extra (leading) lwsync is also required.

regards,
alexander.

Chris Thomasson

unread,

Oct 20, 2005, 5:39:07 PM10/20/05

to

"Paul Pedriana" <pedriana@remove_this.pacbell.net> wrote in message

news:43554B5F.3000705@remove_this.pacbell.net...

> Let me answer with this, which is the disassembly of Microsoft's
> InterlockedDecrement for SMP PowerPC:
>
> 82617D20 mfmsr r9
> 82617D24 mtmsree r13
> 82617D28 lwarx r10,r0,r31
> 82617D2C addi r10,r10,-1
> 82617D30 stwcx. r10,r0,r31
> 82617D34 mtmsree r9
> 82617D38 bne 82617d20h
>
> There is no memory barrier executed here. This is from the latest compiler
> and the latest hardware. Also, Microsoft confirmed to me directly that
> there is intentionally no barrier.

Humm... I wonder why they would alter existing behavior, when they could
have added:

InterlockedIncrementNaked!

Anyway, following api's:

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/base/interlockedincrementrelease.asp
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/base/interlockeddecrementacquire.asp

Should save you the trouble of manually adding the barriers to his code.

David Schwartz

unread,

Nov 3, 2005, 7:04:12 PM11/3/05

to

"Paul Pedriana" <pedriana@remove_this.pacbell.net> wrote in message

news:Hqa5f.2027$dO2....@newssvr29.news.prodigy.net...

> > It's possible that these processors don't need an explicit
> > memory barrier, which is why one isn't being produced.

> Thanks for the suggestion, but the machines I'm talking about are SMP
> PowerPCs, so they definitely need memory barriers.

That would be a bug then. Microsoft has clearly stated the the normal
Interlocked* functions have both acquire and release semantics. See:

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/kmarch/hh/kmarch/Synchro_88127404-a394-403f-a289-d61c45ab81d5.xml.asp

Which says:

An operation has acquire semantics if other processors will always see its
effect before any subsequent operation's effect. An operation has release
semantics if other processors will see every preceding operation's effect
before the effect of the operation itself.

Atomic operations, such as those that the InterlockedXxx routines perform,
have both acquire and release semantics by default.

If the processor does not provide instructions that have only acquire or
only release semantics, the system will use the corresponding routine that
provides both types of semantics. For example, on x86 processors both
InterlockedIncrementAcquire and InterlockedIncrementRelease are equivalent
to InterlockedIncrement.

DS

Alexander Terekhov

unread,

Nov 9, 2005, 11:09:43 AM11/9/05

to

Gack.

http://www.alphaworks.ibm.com/tech/cellsw/download

-----
%% @(#)03 1.4 src/lib/sync/README.txt, sw.lib, sdk_pub 10/11/05 15:46:27
%% --------------------------------------------------------------
%% (C) Copyright 2001,2005,
%% International Business Machines Corporation,
%% Sony Computer Entertainment Incorporated,
%% Toshiba Corporation.
%%
%% All Rights Reserved.
%% --------------------------------------------------------------
%% PROLOG END TAG zYx
BE SHARED MEMORY SYNCHRONIZATION PRIVIMITIVES

This directory contains a collection of interfaces that are patterned after
POSIX threads mutex and condition varialbes for the Cell Broadband Engine
Architecture.

ppu

spu
-----

Totally busted spin-waiting "condvars" aside for a moment, it seems that
the only msync staff used by the sync library is branch+isync to acquire.

And that's it. Go figure.

regards,
alexander.