lwarx/stwcx on PowerPC 970

Alexander Terekhov

unread,

Apr 9, 2005, 8:37:14 AM4/9/05

to

http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/3A2D397F9A3202BD87256D4B007164C0/$file/970Programming_Note_larx_stcx.d20030618.pdf

Says that "Every larx executed should have an accompanying stcx to clear
the reservation." and that "Additional information regarding Larx/Stcx
instructions and usage can be found in the "PowerPC Microprocessor
Family: Programming Environments Manual for 64 and 32-Bit Microprocessors".

"Additinal information" shows coding examples like this:

<quote>

loop: lwarx r6,0,r3 #load and reserve
cmpw r4,r6 #first 2 operands equal ?
bne- exit #skip if not
stwcx. r5,0,r3 #store new value if still reserved
bne- loop #loop if lost reservation
exit: mr r4,r6 #return value from memory

</quote>

So, who's right here?

regards,
alexander.

Joe Seigh

unread,

Apr 9, 2005, 9:18:34 AM4/9/05

to

Just "strongly recommended", not required. The reservations obviously
aren't recursively counted since the OS has no way of knowing the count
in order to "clear" the reservations. Unless there's some performance
hit for maintaining the reservation, I can't see any reason to worry
about it.

--
Joe Seigh

Alexander Terekhov

unread,

Apr 9, 2005, 9:39:32 AM4/9/05

to

Joe Seigh wrote:
[...]

> I can't see any reason to worry about it.

Well, see

regards,
alexander.

Joe Seigh

unread,

Apr 9, 2005, 10:56:52 AM4/9/05

to

No, sorry. I still don't see any reason to worry.
It'd have to be a performance issue. And when it
comes to performance, Boost is penny wise and
pound foolish, so I'm not worrying about their
so called issues either.

--
Joe Seigh

Alexander Terekhov

unread,

Apr 9, 2005, 12:18:33 PM4/9/05

to

Joe Seigh wrote:
[...]

> No, sorry. I still don't see any reason to worry.

Well, an Apple fellow told maintainer at boost that "there are known
errata on some processors. For example, on some 970s a reservation
must never be left dangling. Your code is exposed to this bug, which
is very serious albeit rare." Another rather strange thing is that

http://www.google.de/groups?selm=rang-2010981430230001%40margaret.trillium.adaptec.com

it looks like they started doing that silly "cleanup" long before
PPC970 was put on paper, not even first silicon.

BTW, maintainer at boost wants to know the following:

----
Are you sure that keeping a reservation alive for extended periods
of time does not incur a performance penalty, BTW? Could this be
the reason for the preliminary technical note?
----

Perhaps someone here can answer it. TIA.

regards,
alexander.

Joe Seigh

unread,

Apr 9, 2005, 2:12:24 PM4/9/05

to

On Sat, 09 Apr 2005 18:18:33 +0200, Alexander Terekhov <tere...@web.de> wrote:

>
> Joe Seigh wrote:
> [...]
>> No, sorry. I still don't see any reason to worry.
>
> Well, an Apple fellow told maintainer at boost that "there are known
> errata on some processors. For example, on some 970s a reservation
> must never be left dangling. Your code is exposed to this bug, which
> is very serious albeit rare." Another rather strange thing is that

You're exposed to undocumented errata anytime you do assembler programming.

>
> http://www.google.de/groups?selm=rang-2010981430230001%40margaret.trillium.adaptec.com
>
> it looks like they started doing that silly "cleanup" long before
> PPC970 was put on paper, not even first silicon.
>
> BTW, maintainer at boost wants to know the following:
>
> ----
> Are you sure that keeping a reservation alive for extended periods
> of time does not incur a performance penalty, BTW? Could this be
> the reason for the preliminary technical note?
> ----
>
> Perhaps someone here can answer it. TIA.

You can always time it on a multi-processor 970. If it did, I would
guess it has something to do with overhead incurred watching the
address bus if a reserve is held.

The only issue I heard about for lwarx/stwcx was it didn't scale
as well as compare and swap under extemely small timeslicing such
as you might get running multiple levels of vm since the reserve
would be "lost" across context switches whereas the compare value
would not be lost. Hence, recommondations to keep the reserve
interval as small as possible. This has been known since the early
90s.

--
Joe Seigh

Brian Inglis

unread,

Apr 10, 2005, 2:53:23 AM4/10/05

to

On Sat, 09 Apr 2005 14:12:24 -0400 in comp.arch, "Joe Seigh"
<jsei...@xemaps.com> wrote:

>On Sat, 09 Apr 2005 18:18:33 +0200, Alexander Terekhov <tere...@web.de> wrote:

>> Joe Seigh wrote:

>>> No, sorry. I still don't see any reason to worry.
>>
>> Well, an Apple fellow told maintainer at boost that "there are known
>> errata on some processors. For example, on some 970s a reservation
>> must never be left dangling. Your code is exposed to this bug, which
>> is very serious albeit rare." Another rather strange thing is that
>
>You're exposed to undocumented errata anytime you do assembler programming.
>
>> http://www.google.de/groups?selm=rang-2010981430230001%40margaret.trillium.adaptec.com
>>
>> it looks like they started doing that silly "cleanup" long before
>> PPC970 was put on paper, not even first silicon.
>>
>> BTW, maintainer at boost wants to know the following:

>> Are you sure that keeping a reservation alive for extended periods
>> of time does not incur a performance penalty, BTW? Could this be
>> the reason for the preliminary technical note?
>>

>> Perhaps someone here can answer it. TIA.
>
>You can always time it on a multi-processor 970. If it did, I would
>guess it has something to do with overhead incurred watching the
>address bus if a reserve is held.
>
>The only issue I heard about for lwarx/stwcx was it didn't scale
>as well as compare and swap under extemely small timeslicing such
>as you might get running multiple levels of vm since the reserve
>would be "lost" across context switches whereas the compare value
>would not be lost. Hence, recommondations to keep the reserve
>interval as small as possible. This has been known since the early
>90s.

If no way to check if reserve held/lost/context switch occurred, why
use such an unreliable approach?

--
Thanks. Take care, Brian Inglis Calgary, Alberta, Canada

Brian....@CSi.com (Brian[dot]Inglis{at}SystematicSW[dot]ab[dot]ca)
fake address use address above to reply

Christian Bau

unread,

Apr 10, 2005, 3:43:42 AM4/10/05

to

In article <q4jh511shtiip2rkl...@4ax.com>,
Brian Inglis <Brian....@SystematicSW.Invalid> wrote:

Why do you think lwarx/stwcx is not reliable? You get a reservation, and
when you check later, it will tell you absolutely reliable whether you
still hold the reservation or whether it is lost.

Eric Smith

unread,

Apr 10, 2005, 4:42:19 AM4/10/05

to

"Joe Seigh" <jsei...@xemaps.com> writes:
> You're exposed to undocumented errata anytime you do assembler programming.

If that's true, you're also exposed to undocumented errata every time you
do programming, in C, C++, or any other language, unless the compiler was
written by people that knew of the errata and how to work around it. There's
nothing magic about C compilers that inherently protects you from such
things.

Dan Koren

unread,

Apr 10, 2005, 4:55:33 AM4/10/05

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:opsoyu88daqm36vk@grunion...

a) this is not an OS issue -- the lwarx/stwcx instructions
are executed directly by hardware without os involvement ;-)

b) the programming note is incorrect, the reservation is
cleared by any store following the lwarx -- otherwise
the lwarx/stwcx primitives would not work ;-)

dk

Joe Seigh

unread,

Apr 10, 2005, 6:45:07 AM4/10/05

to

On Sun, 10 Apr 2005 04:55:33 -0400, Dan Koren <dank...@yahoo.com> wrote:

> "Joe Seigh" <jsei...@xemaps.com> wrote in message
> news:opsoyu88daqm36vk@grunion...

>> Just "strongly recommended", not required. The reservations obviously

>> aren't recursively counted since the OS has no way of knowing the count
>> in order to "clear" the reservations. Unless there's some performance
>> hit for maintaining the reservation, I can't see any reason to worry
>> about it.
>
>
> a) this is not an OS issue -- the lwarx/stwcx instructions
> are executed directly by hardware without os involvement ;-)
>
> b) the programming note is incorrect, the reservation is
> cleared by any store following the lwarx -- otherwise
> the lwarx/stwcx primitives would not work ;-)
>
>

Any store? AFAIK, only stores into the reservation granule or by
a stwcx instruction.

--
Joe Seigh

Dan Koren

unread,

Apr 10, 2005, 7:34:32 AM4/10/05

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message

news:opso0ithrpqm36vk@grunion...

> On Sun, 10 Apr 2005 04:55:33 -0400, Dan Koren <dank...@yahoo.com> wrote:
>
>> "Joe Seigh" <jsei...@xemaps.com> wrote in message
>> news:opsoyu88daqm36vk@grunion...
>
>>> Just "strongly recommended", not required. The reservations obviously
>>> aren't recursively counted since the OS has no way of knowing the count
>>> in order to "clear" the reservations. Unless there's some performance
>>> hit for maintaining the reservation, I can't see any reason to worry
>>> about it.
>>
>>
>> a) this is not an OS issue -- the lwarx/stwcx instructions
>> are executed directly by hardware without os involvement ;-)
>>
>> b) the programming note is incorrect, the reservation is
>> cleared by any store following the lwarx -- otherwise
>> the lwarx/stwcx primitives would not work ;-)
>>
>>
> Any store?

Any store intructions, it does not have to be another stwcx.

> AFAIK, only stores into the reservation granule or by

The size of the "reservation granule" is implementation
dependent, and it could be a lot larger than many would
imagine. Typically, LL/SC pairs (since this is what the
lwarx/stcwx is, I cannot imagine why did IBM give it a
different name) are implemeted using cache invalidation,
so the most likely "granule" would be a cache line.

> a stwcx instruction.

Obviously not, since the whole point of LL/SC (or lwarx/
stwcx) is to guarantee atomicity of the entire sequence
with respect to the address of interest. Any modification
of the data at said address (or the granule containing it)
must break the reservation -- and that means *ANY* store.

dk

Joe Seigh

unread,

Apr 10, 2005, 8:02:12 AM4/10/05

to

On Sun, 10 Apr 2005 07:34:32 -0400, Dan Koren <dank...@yahoo.com> wrote:

>
>
> "Joe Seigh" <jsei...@xemaps.com> wrote in message
> news:opso0ithrpqm36vk@grunion...
>> On Sun, 10 Apr 2005 04:55:33 -0400, Dan Koren <dank...@yahoo.com> wrote:
>>> b) the programming note is incorrect, the reservation is
>>> cleared by any store following the lwarx -- otherwise
>>> the lwarx/stwcx primitives would not work ;-)
>>>
>>>
>> Any store?
>
>
> Any store intructions, it does not have to be another stwcx.

The ppc architecture manuals say only stores by any processor
*into* the reservation granule or stwc anywhere by the processor
holding the reseve breaks the reserve. AFAIK, only alpha
processors had the any store problem. So on ppc any
plain store at anything but the address the reserve was
placed on, wouldn't necessarily be guaranteed to break
the reservation.

>
>
>> AFAIK, only stores into the reservation granule or by
>
>
> The size of the "reservation granule" is implementation
> dependent, and it could be a lot larger than many would
> imagine. Typically, LL/SC pairs (since this is what the
> lwarx/stcwx is, I cannot imagine why did IBM give it a
> different name) are implemeted using cache invalidation,
> so the most likely "granule" would be a cache line.

I agree the cache line would be the most likely since
it would probably be hooked into the bus snooping logic
used for cache coherence. But the reserve size could be
as small as a word if somebody wanted to add extra compare
bits to the bus snooping logic along with extra logic on
top of cache coherence logic.

--
Joe Seigh

Alexander Terekhov

unread,

Apr 10, 2005, 9:18:31 AM4/10/05

to

Joe Seigh wrote:
[...]

> You can always time it on a multi-processor 970.

I don't have it.

Here's a snippet from the official "Programming Environments Manual"
referenced by that puzzling preliminary notice itself. It explicitly
mentions dangling lwarxs, says that it is good for forward progress,
and shows yet another dangling lwarx in "better performance may be
obtained" illustration.

<quote>

E.1 General Information

The following points provide general information about the lwarx
and stwcx. instructions:

[...]

- It is acceptable to execute an lwarx instruction for which no
stwcx. instruction is executed. Such a dangling lwarx instruction
occurs in the example shown in Section E.2.5 , "Test and Set," if
the value loaded is not zero.

- To increase the likelihood that forward progress is made, it is
important that looping on lwarx/stwcx. pairs be minimized. For
example, in the sequence shown in Section E.2.5 , "Test and Set,"
this is achieved by testing the old value before attempting the
store -- were the order reversed, more stwcx. instructions might
be executed, and reservations might more often be lost between
the lwarx and the stwcx. instructions.

- The manner in which lwarx and stwcx. are communicated to other
processors and mechanisms, and between levels of the memory
subsystem within a given processor, is implementation-dependent.
In some implementations, performance may be improved by minimizing
looping on an lwarx instruction that fails to return a desired
value. For example, in the example provided in Section E.2.5 ,
"Test and Set," if the program stays in the loop until the word
loaded is zero, the programmer can change the "bne- $+12" to "bne-
loop."

In some implementations, better performance may be obtained by
using an ordinary load instruction to do the initial checking of
the value, as follows:

loop: lwz r5,0(r3) #load the word
cmpwi r5,0 #loop back if word
bne- loop #not equal to 0
lwarx r5,0,r3 #try again, reserving
cmpwi r5,0 #(likely to succeed)
bne loop #try to store nonzero
stwcx. r4,0,r3 #

bne- loop #loop if lost reservation

[...]

E.2.5 Test and Set

This version of the test and set primitive atomically loads a word
from memory, ensures that the word in memory is a nonzero value,
and sets CR0[EQ] according to whether the value loaded is zero.

In this example, it is assumed that the address of the word to be
tested is in GPR3, the new value (nonzero) is in GPR4, and the old
value is returned in GPR5.

loop: lwarx r5,0,r3 #load and reserve
cmpwi r5, 0 #done if word
bne $+12 #not equal to 0
stwcx. r4,0,r3 #try to store non-zero

bne- loop #loop if lost reservation

</quote>

Given that puzzling preliminary notice says "Verify with your IBM
field applications engineer that you have the latest version of
this document before finalizing a design" and that boost.org doesn't
have IBM field applications engineer to verify with, perhaps someone
from IBM chips hanging here can shed some light unofficially, so to
speak. Design in question is this:

asm void atomic_increment( register long * pw ) {

loop:

<load-reserved>
<add 1>
<store-conditional>
<branch if failed to loop>

}

asm long atomic_increment_if_not_zero( register long * pw ) {

// Optional
<load-UNreserved>
<cmp 0>
<branch if zero to done>

loop:

<load-reserved>
<cmp 0>
<branch if zero to done>
<add 1>
<store-conditional>
<branch if failed to loop>

done:

<...>
}

asm long atomic_decrement_weak( register long * pw ) {

<load-UNreserved>
<add -1>
<branch if zero to acquire>

{lw}sync

loop:

<load-reserved>
<add -1>
<branch if zero to acquire>
<store-conditional>
<branch if failed to loop else to done>

acquire:

isync

done:

<...>
}

asm long atomic_decrement_strong( register long * pw ) {

// Peter's state machine

loop0:

<load-reserved>
<add -1>
<branch if zero to loop0_acquire>

{lw}sync

loop1:

<store-conditional>
<branch if !failed to done>

loop2:

<load-reserved>
<add -1>
<branch if !zero to loop1>
<store-conditional>
<branch if failed to loop2 else to acquire>

loop0_acquire:

<store-conditional>
<branch if failed to loop0>

acquire:

isync

done:

<...>
}

{boost::}shared_ptr/weak_ptr is now part of the C++ Library Technical
Report, so many folks need that stuff.

regards,
alexander.

Dan Koren

unread,

Apr 10, 2005, 6:52:29 PM4/10/05

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message

news:opso0mdyvuqm36vk@grunion...

> On Sun, 10 Apr 2005 07:34:32 -0400, Dan Koren <dank...@yahoo.com> wrote:
>
>>
>>
>> "Joe Seigh" <jsei...@xemaps.com> wrote in message
>> news:opso0ithrpqm36vk@grunion...
>>> On Sun, 10 Apr 2005 04:55:33 -0400, Dan Koren <dank...@yahoo.com>
>>> wrote:
>>>> b) the programming note is incorrect, the reservation is
>>>> cleared by any store following the lwarx -- otherwise
>>>> the lwarx/stwcx primitives would not work ;-)
>>>>
>>> Any store?
>>
>> Any store intructions, it does not have to be another stwcx.
>
> The ppc architecture manuals say only stores by any processor
> *into* the reservation granule or stwc anywhere by the processor
> holding the reseve breaks the reserve. AFAIK, only alpha
> processors had the any store problem. So on ppc any
> plain store at anything but the address the reserve was
> placed on, wouldn't necessarily be guaranteed to break
> the reservation.

Please reread what I wrote. I did not write "any store to
any address". I wrote it didn't have to be another stwcx.
In other words, any kind of store instruction. I suppose
that can still lend itself to more than one interpretation.

>>
>>> AFAIK, only stores into the reservation granule or by
>>
>> The size of the "reservation granule" is implementation
>> dependent, and it could be a lot larger than many would
>> imagine. Typically, LL/SC pairs (since this is what the
>> lwarx/stcwx is, I cannot imagine why did IBM give it a
>> different name) are implemeted using cache invalidation,
>> so the most likely "granule" would be a cache line.
>
> I agree the cache line would be the most likely since
> it would probably be hooked into the bus snooping logic
> used for cache coherence. But the reserve size could be
> as small as a word if somebody wanted to add extra compare
> bits to the bus snooping logic along with extra logic on
> top of cache coherence logic.

These days people are cutting corners everywhere ;-)

With cache line sizes as large as 128 bytes, reserves
smaller than the cache size would not work very well.

Since an entire cache line must be invalidated on an
update, what would be benefit of reserves smaller than
a full cache line?

dk

Chris Thomasson

unread,

Apr 11, 2005, 1:36:53 AM4/11/05

to

> The size of the "reservation granule" is implementation
> dependent, and it could be a lot larger than many would
> imagine. Typically, LL/SC pairs (since this is what the
> lwarx/stcwx is, I cannot imagine why did IBM give it a
> different name) are implemeted using cache invalidation,
> so the most likely "granule" would be a cache line.

One could theoretically open themselves up to live-lock conditions if they
did not take this into consideration when implementing certain lock-free
algorithms on PPC. For instance, if the anchor of a lock-free LIFO was not
the size of a cache-line, and was not aligned on a cache-line boundary you
could get "false-sharing" on the anchor. I don't have much experience with
PPC, but it seems that false-sharing could break reservations on the anchor
and adversely affect forward progress, and might cause live-lock like
conditions...

Dan Koren

unread,

Apr 11, 2005, 1:58:37 AM4/11/05

to

"Chris Thomasson" <_no_damn_spam_cristom@_no_damn_comcast.net_spam> wrote in
message news:6bednQvZZr-...@comcast.com...

Indeed ;-)

Please send me your resume to have on hand when the next
position req opens up... ;-)

Thx,

dk

Alexander Terekhov

unread,

Apr 11, 2005, 3:25:48 AM4/11/05

to

Chris Thomasson wrote:
[...]

> > so the most likely "granule" would be a cache line.
>
> One could theoretically open themselves up to live-lock conditions

Sure, it's even documented.

---
In a multiprocessor, livelock (a state in which processors interact
in a way such that no processor makes progress) is possible if a loop
containing an lwarx/stwcx. pair also contains an ordinary store
instruction for which any byte of the affected memory area is in the
reservation granule of the reservation. For example, the first code
sequence shown in Section E.5 , “List Insertion,” can cause livelock
if two list elements have next element pointers in the same
reservation granule.
---

regards,
alexander.

Alexander Terekhov

unread,

Apr 11, 2005, 4:35:23 AM4/11/05

to

Dan Koren wrote:

[... reserve size could be as small as a word ...]

> Since an entire cache line must be invalidated on an
> update, what would be benefit of reserves smaller than
> a full cache line?

Better forward progress some cases and lack of simple livelocks.

regards,
alexander.

Chris Thomasson

unread,

Apr 11, 2005, 5:34:35 AM4/11/05

to

>> > so the most likely "granule" would be a cache line.
>>
>> One could theoretically open themselves up to live-lock conditions
>
> Sure, it's even documented.

Yes it is:

http://groups.google.ca/groups?selm=Y72dnVoK3PbjxHHcRVn-qA%40comcast.com&rnum=1

( I posted as SenderX. )

Naresh Nayar

unread,

Apr 11, 2005, 10:57:40 AM4/11/05

to

The processor holding the reservation can execute a store type of
instruction without losing the reservation.

Naresh.

"Dan Koren" <dank...@yahoo.com> wrote in message
news:42590f61$1...@news.meer.net...

Alexander Terekhov

unread,

Apr 11, 2005, 11:38:33 AM4/11/05

to

Naresh Nayar wrote:
>
> The processor holding the reservation can execute a store type of
> instruction without losing the reservation.

Good news. ;-)

And what's the fuss with dangling larxs?

http://www-306.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970_and_970FX_Microprocessors

"Application Note", I mean. TIA.

regards,
alexander.

Joe Seigh

unread,

Apr 11, 2005, 12:27:43 PM4/11/05

to

On Mon, 11 Apr 2005 09:57:40 -0500, Naresh Nayar <na...@us.ibm.com> top
posted (but I fixed it):

> "Dan Koren" <dank...@yahoo.com> wrote in message
> news:42590f61$1...@news.meer.net...
>>
>>
>> "Joe Seigh" <jsei...@xemaps.com> wrote in message
>> news:opso0ithrpqm36vk@grunion...
>> > On Sun, 10 Apr 2005 04:55:33 -0400, Dan Koren <dank...@yahoo.com>
> wrote:
>> >
>> >> "Joe Seigh" <jsei...@xemaps.com> wrote in message
>> >> news:opsoyu88daqm36vk@grunion...
>> >
>> >>> Just "strongly recommended", not required. The reservations obviously
>> >>> aren't recursively counted since the OS has no way of knowing the
> count
>> >>> in order to "clear" the reservations. Unless there's some performance
>> >>> hit for maintaining the reservation, I can't see any reason to worry
>> >>> about it.
>> >>
>> >>
>> >> a) this is not an OS issue -- the lwarx/stwcx instructions
>> >> are executed directly by hardware without os involvement ;-)
>> >>
>> >> b) the programming note is incorrect, the reservation is
>> >> cleared by any store following the lwarx -- otherwise
>> >> the lwarx/stwcx primitives would not work ;-)
>> >>
>> >>
>> > Any store?
>>
>>
>> Any store intructions, it does not have to be another stwcx.
>>

> The processor holding the reservation can execute a store type of

> instruction without losing the reservation.
>

"can" meaning it's possible but not guaranteed. As a practical matter,
not knowing the reservation granule size, a thread doing a reserve would
have to assume any store *could* clear the reservation and program as if
it always did.

But at least you didn't say "the reservation is cleared by any store
following the lwarx". That is clearly incorrect. :)

--
Joe Seigh

byb...@rocketmail.com

unread,

Apr 11, 2005, 10:20:30 PM4/11/05

to

Joe Seigh wrote:

> > The processor holding the reservation can execute a store type of
> > instruction without losing the reservation.
> >
>
> "can" meaning it's possible but not guaranteed. As a practical
matter,
> not knowing the reservation granule size, a thread doing a reserve
would
> have to assume any store *could* clear the reservation and program as
if
> it always did.
>
> But at least you didn't say "the reservation is cleared by any store
> following the lwarx". That is clearly incorrect. :)

Sorry, but I'd wager my money on the earlier comment by the person with
the Rochester IBM email address first over even you, Joe. =) That bit
of levity aside, here is what Amazon Book II reads (1.7.3.1
Reservations):

A processor has at most one reservation at any time. A reservation is
established by executing a lwarx or ldarx instruction, and is lost (or
may be lost in the case of the fourth bullet) if any of the following
occur.

(*) The processor holding the reservation executes another lwarx or
ldarx: this clears the first reservation and establishes a new one.

(*) The processor holding the reservation executes any stwcx. or
stdcx., regardless of whether the specified address matches the address
specified by the lwarx or ldarx that established the reservation.

(*) Some other processor executes a Store or dcbz to the same
reservation granule, or modifies a Reference or Change bit (see Book
III, PowerPC Operating Environment Architecture) in the same
reservation granule.

(*) Some other processor executes a dcbtst, dcbst, or dcbf to the same
reservation granule: whether the reservation is lost is undefined.

(*) Some other mechanism modifies a storage location in the same
reservation granule.

...so the processor that owns a reservation can safely execute a store
to the same reservation granule without clearing the reservation. My
guess is the last bullet applies to something like a DMA unit issuing a
DCLAIM/RWITM or some other ownership grab before performing a write.

If a plain-vanilla store in a processor cleared it's own reservation,
they'd be mentioned explicitly. If they did, it seems like it'd be a
horrible waste of space. So how does that impact structs that are
smashed against each other in the same cache line and which need
simultaneous locking? It doesn't and they don't need to be spaced a
cache line apart: the larx/stcx pair is used to set the lock variable
to a specific value; any store can clear the lock. That is one case
why you'd spin on larx and not execute a stcx: you use larx/stcx solely
to ensure atomicity of updating a software lock variable to "set".
Obviously any store op can clear the lock variable with no need of
gaining a reservation first: you already own the lock variable so you
can clear it.

Earlier processors like the 601/603 might have had issues (*shrugs*)
with things like "dangling lwarx" and stores clearing reservations and
whatnot, but not the current crop do not. It's possible the spec
you're reading from was written to be overly conservative in order to
cover *all* PPC implementations.

Really, if someone is paranoid, he can test this on his own:

lwarx A # aligned to a page boundary
stw A+4
stwcx. A

...if the stwcx fails, your processor can't do it, however keep in mind
that it's suggested that context switches do a dummy stwcx to clear a
reservation so it's possible it can pass 99 out of 100 times. =)

HTH,
-Tony

Naresh Nayar

unread,

Apr 12, 2005, 12:23:58 AM4/12/05

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message

news:opso2tchgyqm36vk@grunion...

I was not clear. The processor holding a reservation will not lose the
reservation if it executes a store type of instruction (other than a stwcx
or stdcx).

Alexander Terekhov

unread,

Apr 12, 2005, 5:23:01 AM4/12/05

to

byb...@rocketmail.com wrote:
[...]

> Earlier processors like the 601/603 might have had issues (*shrugs*)
> with things like "dangling lwarx" and stores clearing reservations and
> whatnot, but not the current crop do not.

You seem to know more about this than IBM, so to speak. I'm talking
about puzzling "PowerPC 970 RISC Microprocessor Programming Note
Version 0.1 Preliminary June 18, 2003". It says nothing about earlier
processors like the 601/603 and it says nothing about PPC970 crops
(earlier/current/future/whatnot).

So (conspiracy theories aside for a moment), I suppose it's
something architectural... and details are top secret, I gather.

Oder?

regards,
alexander.

Dan Koren

unread,

Apr 12, 2005, 5:59:59 AM4/12/05

to

<byb...@rocketmail.com> wrote in message
news:1113272430.1...@f14g2000cwb.googlegroups.com...

>
> ...so the processor that owns a reservation
> can safely execute a store to the same
> reservation granule without clearing
> the reservation.

Perhaps. However, writing any code that relies
on such behavior would be highly unsafe, and
non-portable to boot -- so why bother?

dk

Joe Seigh

unread,

Apr 12, 2005, 6:53:43 AM4/12/05

to

On 11 Apr 2005 19:20:30 -0700, <byb...@rocketmail.com> wrote:

> Joe Seigh wrote:
>
>>
>> But at least you didn't say "the reservation is cleared by any store
>> following the lwarx". That is clearly incorrect. :)
>
> Sorry, but I'd wager my money on the earlier comment by the person with
> the Rochester IBM email address first over even you, Joe. =) That bit
> of levity aside, here is what Amazon Book II reads (1.7.3.1
> Reservations):
>

Sorry, but you need to look the meaning of "is" and "any", especially when
it's an "any" not qualified with "into the reservation granule".

--
Joe Seigh

byb...@rocketmail.com

unread,

Apr 12, 2005, 12:18:12 PM4/12/05

to

Alexander Terekhov wrote:

> You seem to know more about this than IBM, so to speak. I'm talking
> about puzzling "PowerPC 970 RISC Microprocessor Programming Note
> Version 0.1 Preliminary June 18, 2003". It says nothing about earlier

> processors like the 601/603 and it says nothing about PPC970 crops
> (earlier/current/future/whatnot).

I'm not posting from my work address as I don't need the spam there,
but before it even starts, there's no James Bond action going on. =)

I mentioned 601/603 as a potential "what if?" example. Those were too
early for me to do design/verification on so I don't know what the case
is with them and how they implemented lwarx/stwcx is pure conjecture.
>From looking at the green book they appear to be fine also.

That aside, from all the waves I've ever looked at involving atomic
ops, as Naresh states, plain-vanilla stores on the processor that own
the reservation (to that cache line) are legit. I'll add the caveat
that this is true on the projects I've seen personally. Other
implementations could/"might" differ, but this is at odds with the
specs I've personally seen. YMMV. I have no idea what Motorola parts
do for instance, but even looking at the green book's list insertion
example it seems ok. Here's why...

Looking at appendix E.1.3 in the green book "The PowerPC Architecture",
I see the section on list insertion...the stw from the *other* proc is
the one that would cause the livelock given the right store completion
timing. (i.e., both processors execute sync before they each fall into
stwcx.) You'll notice they mention this specifically for an *MP*
configuration, not a uni...depending on the implementation, it could
work as intended or even fail near 100% of the time on both processors
if the timing is just right. (task switching in the OS would probably
allow some crawling forward progress by breaking the loops from being
in lockstep.)

Keep this in mind: if the stw cleared the reservation, why does this
code sequence work fine on a uni?

> So (conspiracy theories aside for a moment), I suppose it's
> something architectural... and details are top secret, I gather.

Doubtful, but anyway, as I've said before, if unsure, test it out
experimentally to see what the chip really does. Thanks for pointing
out the list insertion example. That's an interesting sequence of
code.

Cheers,
-t

Alexander Terekhov

unread,

Apr 12, 2005, 1:23:05 PM4/12/05

to

byb...@rocketmail.com wrote:
>
> Alexander Terekhov wrote:
>
> > You seem to know more about this than IBM, so to speak. I'm talking
> > about puzzling "PowerPC 970 RISC Microprocessor Programming Note
> > Version 0.1 Preliminary June 18, 2003". It says nothing about earlier
>
> > processors like the 601/603 and it says nothing about PPC970 crops
> > (earlier/current/future/whatnot).
>
> I'm not posting from my work address as I don't need the spam there,
> but before it even starts, there's no James Bond action going on. =)

Looks like Broken Ear action. Ok, can you see the new subject?

http://tinyurl.com/4gk8f

regards,
alexander.

byb...@rocketmail.com

unread,

Apr 12, 2005, 1:50:22 PM4/12/05

to

Alexander Terekhov wrote:

> Looks like Broken Ear action. Ok, can you see the new subject?
>
> http://tinyurl.com/4gk8f

...gotta love tinyurl.com!

It says "should" rather than "must". Regardless, as they are user-mode
accessible instructions, there has to be no way they could mess with
the system state outside of their userland sandbox so it doesn't matter
if they're paired or not paired or sprayed about randomly. However, if
you want them to function as an atomic read-modify-write op, they need
to be paired.

Think of this case:

lwarx A
then... < OS kills process >

...the stwcx never executes. Yes, I know the OS code is going to do a
dummy stcx anyway to zap any potentially pending reservations on a
context switch, however if this were a simple context switch and not a
process kill, stwcx would execute first thing when the process wakes up
which means they'd be "unbalanced"...but they already are because of
the dummy stcx. Really, it doesn't matter. If you're writing
kernel-level code though, by all means follow the app note to the
letter as you can corrupt other things in the system!

Take care,
-t

Alexander Terekhov

unread,

Apr 13, 2005, 8:32:02 AM4/13/05

to

byb...@rocketmail.com wrote:
>
> Alexander Terekhov wrote:
>
> > Looks like Broken Ear action. Ok, can you see the new subject?
> >
> > http://tinyurl.com/4gk8f
>
> ...gotta love tinyurl.com!
>

> It says "should" rather than "must". ...

I'm missing "because"/"if not ...".

> If you're writing
> kernel-level code though, by all means follow the app note to the
> letter as you can corrupt other things in the system!

Interesting. How so? Please elaborate.

regards,
alexander.

byb...@rocketmail.com

unread,

Apr 13, 2005, 11:24:52 AM4/13/05

to

Alexander Terekhov wrote:

> > It says "should" rather than "must". ...
>
> I'm missing "because"/"if not ...".

As I said before, don't read too deeply into that app note. Based on
potential execution streams, the "pair" might not execute in a strict
lwarx+something+stwcx order. Case in point being a page fault or some
other interrupt/exception could prevent that from happening and the VM
subsystem has to go through the motions of paging the data
in--especially the first time the code is run. The system doesn't do
anything crazy like rewind all the way back to the lwarx: this isn't
reversible computing.

> > If you're writing
> > kernel-level code though, by all means follow the app note to the
> > letter as you can corrupt other things in the system!
>
> Interesting. How so? Please elaborate.

Take point #3 in that app note about the OS clearing a reservation via
a dummy stwcx. You have to do that or the potential exists that
process B will accidentally get A's reservation if B has a stwcx
without a matching lwarx (e.g., because of a context switch), R=1, and
the rsrv address matches the real address of the stwcx.

I've said enough on this and won't comment further. What you're doing
in the code you're writing sounds fine. I do have to congratulate you
on bringing a very interesting question to the attention of this
newsgroup.

Best regards and good luck!
Tony

Jan Vorbrüggen

unread,

Apr 13, 2005, 11:37:46 AM4/13/05

to

> Take point #3 in that app note about the OS clearing a reservation via
> a dummy stwcx. You have to do that or the potential exists that
> process B will accidentally get A's reservation if B has a stwcx
> without a matching lwarx (e.g., because of a context switch), R=1, and
> the rsrv address matches the real address of the stwcx.

Interesting. I believe that for Alpha's LL/SC, the reservation is always
automatically cancelled when taking an exception (a subcategory of which
being changing modes for a context switch). Why not for PPC?

Jan

byb...@rocketmail.com

unread,

Apr 13, 2005, 11:55:32 AM4/13/05

to

Probably because a trace or FP unimplemented exception would then kill
a reservation. *shrugs* I'm sure there are good reasons.

Tony

Alexander Terekhov

unread,

Apr 13, 2005, 11:59:22 AM4/13/05

to

byb...@rocketmail.com wrote:
[...]

> > > If you're writing
> > > kernel-level code though, by all means follow the app note to the
> > > letter as you can corrupt other things in the system!
> >
> > Interesting. How so? Please elaborate.
>
> Take point #3 in that app note about the OS clearing a reservation via
> a dummy stwcx. You have to do that or the potential exists that
> process B will accidentally get A's reservation if B has a stwcx
> without a matching lwarx (e.g., because of a context switch), R=1, and
> the rsrv address matches the real address of the stwcx.

That's all explained on Pg 53 in Book III (I'm referring to Version
2.01 December 2003 that I have here).

>
> I've said enough on this and won't comment further.

Can you pull that idiotic preliminary programming note off the site,
then? Please.

regards,
alexander.

Christian Bau

unread,

Apr 13, 2005, 5:37:16 PM4/13/05

to

In article <3c4sm4F...@individual.net>,
Jan Vorbrüggen <jvorbrue...@mediasec.de> wrote:

Why waste hardware to save a single instruction in a context switch?

Jan Vorbrüggen

unread,

Apr 14, 2005, 3:15:56 AM4/14/05

to

>>Interesting. I believe that for Alpha's LL/SC, the reservation is always
>>automatically cancelled when taking an exception (a subcategory of which
>>being changing modes for a context switch). Why not for PPC?
> Why waste hardware to save a single instruction in a context switch?

Wrong attitude, IMNSHO. Should any code path, however rare, that switches
contexts miss out on that instruction, a bug results that will be hell
squared to find. In addition, it does require that instruction to execute,
when all you need in the processor are a few gates to clear one bit of state.
How expensive can that possibly be?

Jan

Jan Vorbrüggen

unread,

Apr 14, 2005, 3:22:05 AM4/14/05

to

> Probably because a trace or FP unimplemented exception would then kill
> a reservation. *shrugs* I'm sure there are good reasons.

That shouldn't happen in any sane implementation of a critical section
anyway.

IIRC, Alpha had the problem that the architecture definition said _no_
stores - or is it even memory ops? - of any colour are allowed between
the LL and the SC. The C/C++ compiler had a "bug" in that it didn't
strictly follow that rule, and when later hardware (21264, again IIRC)
actually enforced it, old code broke (deadlock, as you can imagine).
DEC offered a tool to check suspect code for the incorrect code sequences.

The PPC definition relaxes this restriction in a sensible way, while
putting more onus on the implementor of the memory subsystem to get
the corner cases correct.

If I hade to make a decision, I'd probably go with the Alpha version:
there seems to be no compelling reason to allow memory ops in the
critical section, and the simpler the definition of the semantics, the
higher the probability the implementors won't screw up.

Jan

Joe Seigh

unread,

Apr 14, 2005, 6:36:44 AM4/14/05

to

Depending on how you define the context, it may not be clear to the hardware
that a context switch is taking place. However, there should be api's for
context switches which should be used. If you're doing roll your own
context switching on the fly you pretty much deserve what you get. We
get these brain damage cases all the time in c.p.t. who think you can
use setjmp/longjmp to implement user threads because their idiot of a
comp sci prof told them you could.

--
Joe Seigh

Anton Rang

unread,

May 5, 2005, 11:33:11 PM5/5/05

to

Eric Smith <er...@brouhaha.com> writes:
> "Joe Seigh" <jsei...@xemaps.com> writes:
> > You're exposed to undocumented errata anytime you do assembler programming.
>
> If that's true, you're also exposed to undocumented errata every time you
> do programming, in C, C++, or any other language, unless the compiler was
> written by people that knew of the errata and how to work around it.

(a) Sometimes, the compilers do know about it! :-)

(b) Most of the undocumented errata that I've become aware of in
processors affect only privileged instructions, instructions used for
synchronization, or, occasionally, non-privileged instructions
accessing memory mapped in ways that are non-standard and require
privileged instructions to set up. These won't affect ordinary
applications, no matter the language, but someone programming in
assembly code could possibly run into them (especially if doing OS
work).

Intuitively, this makes sense. Most of the testing effort (especially
once the machine gets to real users) will be running user mode code.
Many of the synchronization or exception cases may be so difficult to
construct that they aren't found until the chip is in production.
More important, perhaps, is that an erratum which only affects
privileged code which has a valid workaround is far cheaper to
document ("undocumented errata" really means "documented only under
NDA") than to respin the chip.

(On a side note, I did some work on embedded microcontrollers in the
past, and there were errata which weren't fixed and didn't have
workarounds, simply because the software being developed at the time
didn't need that functionality, it would cost real money to fix it,
and the risk that some future software release might want/need that
functionality was deemed low enough that the cost wasn't considered
worthwhile. I suspect the same may happen in microprocessors.)

-- Anton

Terje Mathisen

unread,

May 6, 2005, 5:07:13 AM5/6/05

to

Anton Rang wrote:

> Eric Smith <er...@brouhaha.com> writes:
>
>>"Joe Seigh" <jsei...@xemaps.com> writes:
>>
>>>You're exposed to undocumented errata anytime you do assembler programming.
>>
>>If that's true, you're also exposed to undocumented errata every time you
>>do programming, in C, C++, or any other language, unless the compiler was
>>written by people that knew of the errata and how to work around it.
>
>
> (a) Sometimes, the compilers do know about it! :-)

Most all x86 compilers implemented some form of the FDIV sw workaround I
developed (with help from several others) during Dec 94/Jan 95, at least
for the next two-three years.

Afaik, they currently just check if the bug is present, at least by default.

Terje
--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"