Atomic ops on Intel: do they sync?

Attila Feher

unread,

May 11, 2003, 4:20:47 AM5/11/03

to

Hi,

Intel (and MS OSes) provide atomic integer operations. My Q is: do those
operations "flush the cache" or in other (more precise) words: provide
visiblity of _all_ the changes made so far to other CPUs? IMHO they should,
otherwise their applicability is very limited - but I cannot be sure, the
last asm I did was 286. :-(

Attila

David Schwartz

unread,

May 11, 2003, 7:38:40 PM5/11/03

to

"Attila Feher" <attila...@lmf.ericsson.se> wrote in message
news:b9l16p$bs9$1...@newstree.wise.edt.ericsson.se...

Atomic operations are *NOT* directly useful for cross-CPU atomicity.
They provide atomicity against interrupts, as they're intended. If you tell
us what specifically you're trying to do, we can tell you the fastest way to
do it. For example, 'xchg' is safe on x86 across CPUs. Some other operations
work with a 'LOCK' prefix.

DS

SenderX

unread,

May 11, 2003, 9:54:56 PM5/11/03

to

Take a look at the Interlocked - Aquire / Release API's

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/bas
e/interlockedincrementacquire.asp

I believe these have read / write memory barriers.

--
The designer of the SMP and HyperThread friendly, AppCore library.

http://AppCore.home.attbi.com

SenderX

unread,

May 11, 2003, 9:59:33 PM5/11/03

to

> Atomic operations are *NOT* directly useful for cross-CPU atomicity.

Your saying the InterlockedIncrement / Decrement API's are not usefull?

They work on non x86 systems as well.

Alexander Terekhov

unread,

May 12, 2003, 12:57:32 AM5/12/03

to

SenderX wrote:
>
> Take a look at the Interlocked - Aquire / Release API's
>
> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/bas
> e/interlockedincrementacquire.asp
>
> I believe these have read / write memory barriers.

AFAIK, acquire / release barriers are "asymmetric" beasts. Acquire
one imposes reordering restriction with respect succeeding ordinal
loads and stores (in the program order); it's a kinda "hoist"
barrier. Release one imposes reordering restriction with preceding
ordinal loads and stores (again, in the program order); it's a
kinda "sink" barrier. Both aren't really useful without OPERATION
(load/store/cas) that is "combined" with a barrier. What people
call "read / write" barriers, is kinda "less sophisticated" stuff,
so to speak.

regards,
alexander.

SenderX

unread,

May 12, 2003, 2:32:44 AM5/12/03

to

I'm kinda new to the weak memory model ;)

Does this scenario look right at all?

shared mem:

C_Object *pSharedObj = NULL;

Processor A: Inits the object

C_Object *pMyObj = new C_Object;

pSharedObj = pMyObj;

< Release Barrier >

This should flush Processor A's local cache to main memory right?

Processor B: Uses the object

< Acquire Barrier >

This should wait for concurrent cache flush's from processor A, and reload
Processors B's cache from main memory?

pSharedObj->DoSomething();

So a release is a main memory write barrier and acquire is a main memory
read barrier, right?

Please don't flame me too bad if I'm totally wrong on this issue. ;)

Attila Feher

unread,

May 12, 2003, 2:35:19 AM5/12/03

to

OK Alexander. Now try to speak on a way that my IQ can catch it. And
especialy my English. The question is simple: will an atomic integer
(exch,incr,decr,whatever) ensure that _all_ writes I have made so far in the
thread (making this op) will become visible for all other threads (well,
CPUs)?

A

Attila Feher

unread,

May 12, 2003, 2:32:36 AM5/12/03

to

I think you have no idea what I am talking about. I am talking about the
atomic integer operations. The ones atomic in regards of threads. I dunno
what you are talking about.

A

Alexander Terekhov

unread,

May 12, 2003, 2:50:32 AM5/12/03

to

Attila Feher wrote:
[...]

> The question is simple: will an atomic integer
> (exch,incr,decr,whatever) ensure that _all_ writes I have made so far in the
> thread (making this op) will become visible for all other threads (well,
> CPUs)?

"In general", *NO*.

regards,
alexander.

Alexander Terekhov

unread,

May 12, 2003, 2:51:44 AM5/12/03

to

SenderX wrote:
>
> I'm kinda new to the weak memory model ;)
>
> Does this scenario look right at all?
>
> shared mem:
>
> C_Object *pSharedObj = NULL;
>
> Processor A: Inits the object
>
> C_Object *pMyObj = new C_Object;
>
> pSharedObj = pMyObj;
>
> < Release Barrier >
>
> This should flush Processor A's local cache to main memory right?

No.

>
> Processor B: Uses the object
>
> < Acquire Barrier >
>
> This should wait for concurrent cache flush's from processor A, and reload
> Processors B's cache from main memory?

No.

>
> pSharedObj->DoSomething();
>
> So a release is a main memory write barrier and acquire is a main memory
> read barrier, right?
>
> Please don't flame me too bad if I'm totally wrong on this issue. ;)

Try reading the entire "The Inventor of Portable DCI-aka-DCL" thread. ;-)

atomic<stuff*> instance_ptr = ATOMIC_INITIALIZER(0); // static

stuff & instance() {
stuff * ptr;
if (0 == (ptr = instance_ptr.load_ddrmb())) {
ptr = new stuff();
if (!instance_ptr.attempt_update_wmb(ptr, 0)) { // too late
delete ptr;
if (0 == (ptr = instance_ptr.load_ddrmb()))
abort();
}
else { // only one thread can reach here
static deleter<stuff> cleanup(ptr);
}
}
return *ptr;
}

Well,

http://google.com/groups?threadm=3E60CF71.9784884F%40web.de
(Subject: Re: Acquire/Release memory synchronization....)

regards,
alexander.

SenderX

unread,

May 12, 2003, 4:20:47 AM5/12/03

to

Where exactly would one put acquire / release barriers in the following
pseudo-code, that tries to follow the example you posted:

/* Shared pointer */
C_Object *pObjShared = NULL;

C_Object* InstanceSharedPointer()
{

/* Acquire barrier here? */

/* Read shared pointer */
C_Object *pLocalPtr
= InterlockedCompareExchangePointer
( &pObjShared,
NULL,
NULL );

/* Release barrier here? */

if ( pLocalPtr == NULL )
{
C_Object *pReturnedObj;

/* Create a new object */
pLocalPtr = new C_Object();

/* Acquire barrier here? */

/* Try and update the shared pointer */
pReturnedObj =
InterlockedCompareExchangePointer
( &pObjShared,
pLocalPtr,
NULL );

/* Release barrier here? */

/* Check if the shared pointer was updated */
if ( pReturnedObj != NULL )
{
delete pLocalPtr;

return pReturnedObj;
}
}

return pLocalPtr;
}

Was that pseudo-code correct in any sense? Or am I way off in left field? ;)

I am trying to learn how to deal with weak memory models =)

Alexander Terekhov

unread,

May 12, 2003, 5:05:42 AM5/12/03

to

SenderX wrote:
>
> Where exactly would one put acquire / release barriers in the following
> pseudo-code, that tries to follow the example you posted:
>
> /* Shared pointer */
> C_Object *pObjShared = NULL;
>
> C_Object* InstanceSharedPointer()
> {
>
> /* Acquire barrier here? */

No.

>
> /* Read shared pointer */
> C_Object *pLocalPtr
> = InterlockedCompareExchangePointer
> ( &pObjShared,
> NULL,
> NULL );

According to Microsoft docs (applies to "Server 2003 and above"),
none-acq/rel interlocked stuff imposes a bidirectional load+store
barrier (aka "fence" in IA64 terms), the only problem is that...

>
> /* Release barrier here? */
>
> if ( pLocalPtr == NULL )

...it's not clear to me whether you can rely on it (i.e. presence
of memory barrier) if that MS-cas "fails" (pLocalPtr != NULL) --
check out the MS specs yourself.

> {
> C_Object *pReturnedObj;
>
> /* Create a new object */
> pLocalPtr = new C_Object();
>
> /* Acquire barrier here? */

No.

>
> /* Try and update the shared pointer */
> pReturnedObj =
> InterlockedCompareExchangePointer
> ( &pObjShared,
> pLocalPtr,
> NULL );
>
> /* Release barrier here? */

No. You don't need a "release barrier" here ("full-stop" barrier
is injected by "succeeded" MS-cas), but what you "might need"...

>
> /* Check if the shared pointer was updated */
> if ( pReturnedObj != NULL )

!= pLocalPtr, I guess.

> {
> delete pLocalPtr;
>
> return pReturnedObj;
> }
> }

...is a LOAD (rather: "hoist load") barrier prior to "return
pReturnedObj" (in the case of MS-cas "failure"); see my recent
post to "The Inventor of Portable DCI-aka-DCL" thread.

>
> return pLocalPtr;
> }

regards,
alexander.

> The designer of the SMP and HyperThread friendly ...

Replace "SMP and HyperThread" with "SMP and SMT and CMT". ;-)

Attila Feher

unread,

May 12, 2003, 5:19:10 AM5/12/03

to

OK. And as concrete as for the existing Intel stuff? I know (well, I might
be mistaking) that the one I have made for SparcV9plus works a OK from
visibility point of view. The Intel stuff I mean is (one example):

8<---
__declspec(naked) bool __fastcall CompareAndSwap (volatile int* dest, int
source, int comparend)
{
__asm {
mov EAX, [ESP+4]
;// if([ECX]==EAX){ZF=1;[ECX]=EDX;}else ZF=0;
lock cmpxchg dword ptr [ECX], EDX
setZ AL // return boolean based on Z flag.
// that makes more sense than always
// returning comparend, regardless!
ret 4
}
}
--->8

AFAIS this does not do anything for visibility... I am not even sure if
this ensures visibility of the old value... Do you know what kind of
"membar" could be place there, for "reads" before and "writes" after?

Attila

SenderX

unread,

May 12, 2003, 5:42:04 AM5/12/03

to

> > /* Read shared pointer */
> > C_Object *pLocalPtr
> > = InterlockedCompareExchangePointer
> > ( &pObjShared,
> > NULL,
> > NULL );
>
> According to Microsoft docs (applies to "Server 2003 and above"),
> none-acq/rel interlocked stuff imposes a bidirectional load+store
> barrier (aka "fence" in IA64 terms), the only problem is that...

So, would this would work to read the shared pointer?

C_Object *pLocalPtr
= InterlockedCompareExchangeAcquire

( &pObjShared,
NULL,
NULL );

And this to try and update it?

C_Object *pReturnedPtr
= InterlockedCompareExchangeRelease

( &pObjShared,
pLocalPtr,
NULL );

> > The designer of the SMP and HyperThread friendly ...
>
> Replace "SMP and HyperThread" with "SMP and SMT and CMT". ;-)

I claim my library is SMP friendly, because it contains collections that
allow more than one thread at a time use them. Do you think my library is
ok? Try not to be to harsh ;)

Thanks, for taking the time help me and everyone else who reads this group.

Your a big help! =)

Alexander Terekhov

unread,

May 12, 2003, 6:09:25 AM5/12/03

to

Attila Feher wrote:

[... CompareAndSwap()/IA32 asm ...]

> AFAIS this does not do anything for visibility... I am not even sure if
> this ensures visibility of the old value... Do you know what kind of
> "membar" could be place there, for "reads" before and "writes" after?

Attila, whatever-I-can-tell-you aside, the only thing that really counts
(compiler induced reordering aside for a moment) is "7.2. MEMORY ORDERING"
in "24547204.pdf". Please try reading that part of "IA-32 Intel(R)
Architecture Software Developer's Manual; Volume 3: System Programming
Guide" first. We can then discuss some details, I guess. ;-)

regards,
alexander.

Alexander Terekhov

unread,

May 12, 2003, 6:20:11 AM5/12/03

to

SenderX wrote:
>
> > > /* Read shared pointer */
> > > C_Object *pLocalPtr
> > > = InterlockedCompareExchangePointer
> > > ( &pObjShared,
> > > NULL,
> > > NULL );
> >
> > According to Microsoft docs (applies to "Server 2003 and above"),
> > none-acq/rel interlocked stuff imposes a bidirectional load+store
> > barrier (aka "fence" in IA64 terms), the only problem is that...
>
> So, would this would work to read the shared pointer?
>
> C_Object *pLocalPtr
> = InterlockedCompareExchangeAcquire
> ( &pObjShared,
> NULL,
> NULL );

That does read it, for sure. The question is whether it "injects" a
memory fence in the case of (pLocalPtr != NULL). I don't know; you
should really ask MS.

>
> And this to try and update it?
>
> C_Object *pReturnedPtr
> = InterlockedCompareExchangeRelease
> ( &pObjShared,
> pLocalPtr,
> NULL );

That's OK to replace NULL pObjShared with pLocalPtr value (MS-cas
operation "succeeds") and return pLocalPtr value. Same "problem"
as above in the (pReturnedPtr != NULL) case. Ask MS. Note that
even if it DOES... "Release" is wrong barrier in that case (you'll
need an acquire [better: "hoist load" only] barrier in that case).

regards,
alexander.

Attila Feher

unread,

May 12, 2003, 6:20:39 AM5/12/03

to

Fair enough! Does this thing talk about the fences?

Attila

Alexander Terekhov

unread,

May 12, 2003, 7:03:14 AM5/12/03

to

<quote>

8. Reads cannot pass LFENCE and MFENCE instructions.

9. Writes cannot pass SFENCE and MFENCE instructions.

</quote>

You got me. ;-)

regards,
alexander.

Joseph Seigh

unread,

May 12, 2003, 7:14:53 AM5/12/03

to

I was about to suggest that also.

>
> Fair enough! Does this thing talk about the fences?
>

I was just scanning through it now. Yes, but it's not a real good explanation.
It looks like the fences acually force the memory accesses to complete rather
than just ordering relative memory accesses. If strong ordering in not in effect,
reads can appear out of order, so you'd need an LFENCE if you needed ordering.
Writes to appear in order, but you have to remember that this is Intel which has
never given much though to backwards compatability w.r.t multiprocessing. They
are perfectly capable of throwing out a processor that would do writes out of
order, which would bread your code unless your libraries were smart enough not
to initialize if they did not recognise the processor model. Even that may not
be enough. It would be interesting to discuss whether it's possible to future
proof Intel code but a little OT I'm afraid. So you may wand to use an SFENCE
just to be on the safe side.

Besides ordering memory accesses, Intel fences force accesses to complete. You
don't normally care about that part unless you are writing code to do context
switching and you want to ensure that hardware errors get thrown in the right
context.

I believe in win32, you can assume that the memory barriers are there if they
are needed. In win64 they are adding acquire/release versions of interlocked
functions, so you should use those. See the Intel Itanium reference manual for
"definitions" of acquire/release.

I can see that I need to put doing a formal definition of memory barriers on my
to do list.

Joe Seigh

Attila Feher

unread,

May 12, 2003, 7:45:52 AM5/12/03

to

Joseph Seigh wrote:
[SNIP]

> accesses. If strong ordering in not in effect, reads can appear out
> of order, so you'd need an LFENCE if you needed ordering. Writes to
> appear in order, but you have to remember that this is Intel which
> has never given much though to backwards compatability w.r.t
> multiprocessing. They are perfectly capable of throwing out a
> processor that would do writes out of order, which would bread your
> code unless your libraries were smart enough not
> to initialize if they did not recognise the processor model. Even
> that may not be enough. It would be interesting to discuss whether
> it's possible to future proof Intel code but a little OT I'm afraid.
> So you may wand to use an SFENCE just to be on the safe side.

[SNIP]

Let me clarify if I have got you. So before I do the locked access to the
integer I need to do an LFENCE, that will ensure that I see its "latest
value". Then if I have changed it (in a locked compare and swap) I will
need to do an SFENCE to make sure it gets visible to others. If I
understand this correctly it means that the SFENSE will also make sure that
_all_ changes made before I have updated my atomic int will get visible for
others _if_ they touch (like read with an LFENCE)the atomic int before they
try to read them? Is this right?

Attila

Attila Feher

unread,

May 12, 2003, 7:42:51 AM5/12/03

to

Alexander Terekhov wrote:
[SNIP]

>> Fair enough! Does this thing talk about the fences?
>
> <quote>
>
> 8. Reads cannot pass LFENCE and MFENCE instructions.
>
> 9. Writes cannot pass SFENCE and MFENCE instructions.
>
> </quote>
>
> You got me. ;-)

One more stupid Q. I need these kinda things for understanding... What
does those L & M & S stand for? I know what M&M is. :-)

Attila

Alexander Terekhov

unread,

May 12, 2003, 7:58:51 AM5/12/03

to

L & M is for smoking and M & S is for shopping (if you have
some spare pounds; you can probably buy there L & M). ;-)

Other than that, L is for "Loads", S is for "Stores" and
"M" is for both.

regards,
alexander.

Alexander Terekhov

unread,

May 12, 2003, 8:36:40 AM5/12/03

to

Attila Feher wrote:
[...]

> Let me clarify if I have got you. So before I do the locked access to the
> integer I need to do an LFENCE, that will ensure that I see its "latest
> value". Then if I have changed it (in a locked compare and swap) I will

> need to do an SFENCE to make sure it gets visible to others. ....

Nope. Study this page [not that I think that it's "entirely correct",
but]:

http://gee.cs.oswego.edu/dl/jmm/cookbook.html

"....
On all processors discussed below, it turns out that instructions
that perform StoreLoad also obtain the other three barrier effects,
so StoreLoad can serve as a general-purpose (but usually expensive)
Fence. (This is an empirical fact, not a necessity.) The opposite
doesn't hold though. It is NOT usually the case that issuing any
combination of other barriers gives the equivalent of a StoreLoad.
....
The listed barrier instructions are those designed for use with
normal program memory, but not necessarily other special forms/
modes of caching and memory used for IO and system tasks. For
example, on x86-SPO, StoreStore barriers ("sfence") are needed
with WriteCombining (WC) caching mode, which is designed for use
in system-level bulk transfers etc. OSes use Writeback mode for
programs and data, which doesn't require StoreStore barriers.

On x86 (both PO and SPO), any lock-prefixed instruction can be
used as a StoreLoad barrier. (The form used in linux kernels is
the no-op lock; addl $0,0(%%esp).) Versions supporting the "SSE2"
extensions (Pentium4 and later) support the mfence instruction
which seems preferable unless a lock-prefixed instruction like
CAS is needed anyway. The cpuid instruction also works but is
slower.
....
The x86-PO processors supporting "streaming SIMD" SSE2 extensions
require LoadLoad "lfence" only only in connection with these
streaming instructions.
....
On sparc and x86, CAS has implicit preceding and trailing full
StoreLoad barriers. "

Attila, you owe me a pack of L & M. ;-)

regards,
alexander.

Attila Feher

unread,

May 12, 2003, 8:45:49 AM5/12/03

to

How about M&M, it is more healthy. :-)

But wait! "On sparc and x86, CAS has implicit preceding and trailing full
StoreLoad barriers." This does not seem to be right. All atomic int
operations I have ever seen required an stbar (xchg) or a membar #LoadStore
| #LoadStore (inc/dec with cas). Now that one above says they are not
needed. But if I leave them out (IIRC) it won't work properly.

Attila

Joseph Seigh

unread,

May 12, 2003, 9:10:34 AM5/12/03

to

No, you don't need to do an LFENCE to get the latest value. You get that
automatically by virtue of strongly coherent memory. And it can always change
just after you've read it but there's nothing you can do about that.

Memory accesses behave as if they were queued. Except the queue has no ordering
constraints unless you put in fences which will impose ordering between the
groups of accesses they were inserted between.

So you have something that looks like this

proc 1 >----- access "queue" ----> memory <------ access "queue" ------< proc 2

If the access queue has "f1, f2, f3, f4" fetch requests in it, they can be processed in any order
unless you insert a LFENCE to impose an ordering between groups of fetch requests. Ditto
for store requests (except Intel 32 bit processors probably don't need it).

Where you put these depends on your program logic, what you want to be able to infer.

Compare and swap does a store and explicity or implicitly does a load, so you may need
one or both of SFENCE or LFENCE depending on its usage.

If you look at JSR 133 and some other memory models, you will see that they model memory
with a set of values for current memory state, and a set of values for currently pending,
queued, memory accesses.

But as I have mentioned, you can define this without a memory model and it's a lot simpler
and easier to use. While JSR 133 is simpler and easier than the orginal Java memory model,
proving program correctness with it is still a non-trivial task.

Joe Seigh

Alexander Terekhov

unread,

May 12, 2003, 9:12:23 AM5/12/03

to

Attila Feher wrote:
[...]

> "On sparc and x86, CAS has implicit preceding and trailing full

> StoreLoad barriers." This does not seem to be right. ....

They mean "sparc-TSO" -- see "Here's how these processors support
barriers and atomics" table (it doesn't seem to be correct w.r.t.
isync/ppc...).

regards,
alexander.

Attila Feher

unread,

May 12, 2003, 9:15:20 AM5/12/03

to

Joseph Seigh wrote:
[SNIP]

> Compare and swap does a store and explicity or implicitly does a
> load, so you may need
> one or both of SFENCE or LFENCE depending on its usage.

OK. So if I do a cas like this:

8<---
__declspec(naked) bool __fastcall CompareAndSwap (volatile int* dest, int
source, int comparend)
{
__asm {
mov EAX, [ESP+4]
;// if([ECX]==EAX){ZF=1;[ECX]=EDX;}else ZF=0;
lock cmpxchg dword ptr [ECX], EDX
setZ AL // return boolean based on Z flag.
// that makes more sense than always
// returning comparend, regardless!
ret 4
}
}
--->8

Where do I need to put SFENCE? LFENCE I cannot use, since I must support
PIII...

Attila

Joseph Seigh

unread,

May 12, 2003, 9:52:40 AM5/12/03

to

Depends on your usage. If all you are going to do is use compare and swap to
do a store, then all you need is a SFENCE before. That's your release
semantics. If you are going to use the implied load value, then you need
an LFENCE afterwards. That's your acquire semantics.

You use acquire semantics for things like implementing a lock acquire routine.
and for lock-free dequeueing. Release semantics are for things like implementing
an lock release (unlock) routine, and for lock-free queueing. Note, lock-free
dequeueing may require acquire barrier when it access the queue prior to the
attempted dequeue so things can get a little complicated.

You can always put both acquire and release in so your compare and swap will handle
both cases.

SFENCE may not be required on ia32 (somebody should confirm this) and you'll have
to figure out a code versioning scheme to handle LFENCE.

Joe Seigh

Casper H.S. Dik

unread,

May 12, 2003, 10:33:10 AM5/12/03

to

"Attila Feher" <attila...@lmf.ericsson.se> writes:

>But wait! "On sparc and x86, CAS has implicit preceding and trailing full
>StoreLoad barriers." This does not seem to be right. All atomic int
>operations I have ever seen required an stbar (xchg) or a membar #LoadStore
>| #LoadStore (inc/dec with cas). Now that one above says they are not
>needed. But if I leave them out (IIRC) it won't work properly.

It certainly isn't correct for SPARC except for the addresses directly
affected by the CAS instructions (if that weren't the case, the instruction
would be rather pointless).

So a typical lock sequence would be "CAS, MEMBAR #LoadStore|#LoadLoad"
and unlock would be "ST, MEMBAR #StoreStore".

Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.

Alexander Terekhov

unread,

May 12, 2003, 10:43:31 AM5/12/03

to

Joseph Seigh wrote:

[... lock cmpxchg ...]

> > Where do I need to put SFENCE? LFENCE I cannot use, since I must support
> > PIII...
> >
>
> Depends on your usage. If all you are going to do is use compare and swap to
> do a store, then all you need is a SFENCE before. That's your release
> semantics. If you are going to use the implied load value, then you need
> an LFENCE afterwards. That's your acquire semantics.

This would be true for "naked" CAS. My understanding is that neither
IA32 nor IA64 has "naked" CAS. On IA32, its lock prefix seem to have
a bidirectional "full-stop" fence effect for both loads and stores
(too much soft would become broken if they would relax it, to begin
with). Do you agree/is this correct?

regards,
alexander.

Alexander Terekhov

unread,

May 12, 2003, 10:44:41 AM5/12/03

to

"Casper H.S. Dik" wrote:
>
> "Attila Feher" <attila...@lmf.ericsson.se> writes:
>
> >But wait! "On sparc and x86, CAS has implicit preceding and trailing full
> >StoreLoad barriers." This does not seem to be right. All atomic int
> >operations I have ever seen required an stbar (xchg) or a membar #LoadStore
> >| #LoadStore (inc/dec with cas). Now that one above says they are not
> >needed. But if I leave them out (IIRC) it won't work properly.
>
> It certainly isn't correct for SPARC except for the addresses directly
> affected by the CAS instructions (if that weren't the case, the instruction
> would be rather pointless).
>
> So a typical lock sequence would be "CAS, MEMBAR #LoadStore|#LoadLoad"
> and unlock would be "ST, MEMBAR #StoreStore".

But you're talking about SMARC-RMO, not SPARC-TSO (PSO aside for a
moment), right?

regards,
alexander.

Alexander Terekhov

unread,

May 12, 2003, 10:55:01 AM5/12/03

to

Alexander Terekhov wrote:
>
> Joseph Seigh wrote:
>
> [... lock cmpxchg ...]
>
> > > Where do I need to put SFENCE? LFENCE I cannot use, since I must support
> > > PIII...
> > >
> >
> > Depends on your usage. If all you are going to do is use compare and swap to
> > do a store, then all you need is a SFENCE before. That's your release
> > semantics. If you are going to use the implied load value, then you need
> > an LFENCE afterwards. That's your acquire semantics.
>
> This would be true for "naked" CAS.

Correction: SFENCE wouldn't be enough for release and LFENCE
wouldn't be enough for acquire. Acquire/Release *both* affect
load AND store {re}ordering.

Joseph Seigh

unread,

May 12, 2003, 12:21:30 PM5/12/03

to

Alexander Terekhov wrote:
>
> Alexander Terekhov wrote:
> >
> > Joseph Seigh wrote:
> >
> > [... lock cmpxchg ...]
> >
> > > > Where do I need to put SFENCE? LFENCE I cannot use, since I must support
> > > > PIII...
> > > >
> > >
> > > Depends on your usage. If all you are going to do is use compare and swap to
> > > do a store, then all you need is a SFENCE before. That's your release
> > > semantics. If you are going to use the implied load value, then you need
> > > an LFENCE afterwards. That's your acquire semantics.
> >
> > This would be true for "naked" CAS.
>
> Correction: SFENCE wouldn't be enough for release and LFENCE
> wouldn't be enough for acquire. Acquire/Release *both* affect
> load AND store {re}ordering.

I don't believe so. Of course I really need to do the formal semantics, and then
look at particular implementations and verify that they do in fact correctly
implement the semantics. That's usually how you are supposed to do things. Of
course there could be different ideas of what acquire and release mean. That
has been known to happen when you don't have formal definitions. Makes discussing
issues about them fun, doesn't it. Yeah, we don't need formal definitions, it
will just confuse everyone.

>
> > My understanding is that neither
> > IA32 nor IA64 has "naked" CAS. On IA32, its lock prefix seem to have
> > a bidirectional "full-stop" fence effect for both loads and stores
> > (too much soft would become broken if they would relax it, to begin
> > with). Do you agree/is this correct?
> >

IA64 has acquire/release modifiers so I'm pretty sure a naked CAS doesn't have
them if it exists. For IA32, the manual only discusses locking cache, "address
locking". It doesn't say anything about visibility w.r.t memory accesses by
other instructions. It's possible given the undisciplined and completely naive
software that is out there. Also Microsoft is probably running in TSO mode
to avoid these kind of problems, which only guarantees that it will be harder to
switch to RSO memory mode.

Joe Seigh

Joseph Seigh

unread,

May 12, 2003, 12:43:14 PM5/12/03

to

Attila Feher wrote:
>
> Where do I need to put SFENCE? LFENCE I cannot use, since I must support
> PIII...
>

On earlier Intel processors, there was a trick mentioned in some of the Intel
assembler cookbooks but undocumented by Intel, for generating a memory barrier.
That was to do a store into a location followed by a load from the same location.
That only worked because the architecture claimed that a load had to return a
value from storage. That won't work today because somebody realized that was
a completely stupid requirement to impose on hardware. They have "write back"
enabled now. It also depended on fetches maintaining relative order as well.
But if you need to support something that far back you might look into whether
that works on 386.

S370 did that, it actually hardware checkpointed. You could really slow things down
by doing that. I wonder if VM went and found all the code that set bits by doing
an NI and OI, as mandated by the old rulechek (coding standards) ... umm committee
(almost invoked Godwin's law there), rather than a STC.

Joe Seigh

Alexander Terekhov

unread,

May 12, 2003, 12:48:25 PM5/12/03

to

Joseph Seigh wrote:
[...]

> > > > Depends on your usage. If all you are going to do is use compare and swap to
> > > > do a store, then all you need is a SFENCE before. That's your release
> > > > semantics. If you are going to use the implied load value, then you need
> > > > an LFENCE afterwards. That's your acquire semantics.
> > >
> > > This would be true for "naked" CAS.
> >
> > Correction: SFENCE wouldn't be enough for release and LFENCE
> > wouldn't be enough for acquire. Acquire/Release *both* affect
> > load AND store {re}ordering.
>

> I don't believe so. ....

David Butenhof surely can confirm/explain. ;-)

Pls take a look at (both msgs):

http://www.cs.umd.edu/~pugh/java/memoryModel/archive/1220.html
http://www.cs.umd.edu/~pugh/java/memoryModel/archive/1222.html
(Subject: JavaMemoryModel: Cookbook: barriers)

regards,
alexander.

Joseph Seigh

unread,

May 12, 2003, 1:10:53 PM5/12/03

to

Java monitors are overly strict. Plus Java synchronization is defined by
a meta implemetation, the memory model. So basically you have to do whatever
it says if you are doing a JVM implementation. That has no bearing on
non Java lock definitions.

Joe Seigh

Joseph Seigh

unread,

May 12, 2003, 1:21:22 PM5/12/03

to

"Casper H.S. Dik" wrote:
>
>
> So a typical lock sequence would be "CAS, MEMBAR #LoadStore|#LoadLoad"
> and unlock would be "ST, MEMBAR #StoreStore".

Unlock should be "MEMBAR #StoreStore, ST". Also the ST should be atomic.
Otherwise, a subsequent lock attempt could see the lock free and acquire
it before all the "locked" stores have completed since there would be
nothing to order the "locked" stores and the unlock store relative to
each other.

Also the membar following the lock CAS only needs to be a #LoadLoad
as far as I know.

Joe Seigh

Alexander Terekhov

unread,

May 12, 2003, 1:39:03 PM5/12/03

to

Joseph Seigh wrote:
[...]

> > and unlock would be "ST, MEMBAR #StoreStore".
>

> Unlock should be "MEMBAR #StoreStore, ST". ....

Unlock should be "MEMBAR #StoreStore !RMO and PSO only
MEMBAR #LoadStore !RMO only
ST".

regards,
alexander.

Joseph Seigh

unread,

May 12, 2003, 2:38:28 PM5/12/03

to

That's probably right. I'm not real familiar with and used to
working with Sparc memory barriers.

Joe Seigh

David Schwartz

unread,

May 12, 2003, 3:25:46 PM5/12/03

to

"SenderX" <x...@xxx.com> wrote in message
news:9eDva.554441$OV.529742@rwcrnsc54...
> > Atomic operations are *NOT* directly useful for cross-CPU atomicity.

> Your saying the InterlockedIncrement / Decrement API's are not usefull?
>
> They work on non x86 systems as well.

I think I misread the original question. I (mis?)interpreted it as
referring to assembly instructions. Now I see that I think he was talking
about OS atomic operations. These are generally defined as providing
cross-CPU synchronization and definitely do on Windows. Just check the
documentation for the particular feature you're going to use.

DS

David Schwartz

unread,

May 12, 2003, 3:31:09 PM5/12/03

to

"Attila Feher" <attila...@lmf.ericsson.se> wrote in message
news:b9nf8h$7f5$1...@newstree.wise.edt.ericsson.se...

> I think you have no idea what I am talking about. I am talking about the
> atomic integer operations. The ones atomic in regards of threads. I
dunno
> what you are talking about.

If you're talking about some particular operation on some particular
operating system, why not read the documentation for that operation on that
operating system rather than leaving people to guess what you're talking
about?

Some operating systems have atomic integer operations. Some of them are
probably useful on multi-CPU systems. Some of them might flush caches.
Generally, they provide visibility only for the particular integer affected.

DS

SenderX

unread,

May 12, 2003, 10:03:17 PM5/12/03

to

> If you're talking about some particular operation on some particular
> operating system, why not read the documentation for that operation on
that
> operating system rather than leaving people to guess what you're talking
> about?

The OS provides the API's needed to access the Hardware atomic increment /
decrement opcodes.

--
The designer of the SMP and HyperThread friendly, AppCore library.

http://AppCore.home.attbi.com

Attila Feher

unread,

May 13, 2003, 12:19:05 AM5/13/03

to

No OS. Today this OS, tomorrow that OS, after that maybe no OS. Whatever.
So what I need is: an assembly routine which does it right. If it needs
membar/fence it uses the right one etc.

Attila

Attila Feher

unread,

May 13, 2003, 12:11:55 AM5/13/03

to

I have no idea what ST is. And I have absolutely no idea what your first
sentence means. What lock do you talk about? I am talking about an atomic
integer operation (in regards to SMP architecture and threads) I am not
making a lock.

A

Attila Feher

unread,

May 13, 2003, 12:27:10 AM5/13/03

to

Joseph Seigh wrote:
> Attila Feher wrote:
[SNIP]

>> Where do I need to put SFENCE? LFENCE I cannot use, since I must
>> support PIII...
>>
>
> Depends on your usage. If all you are going to do is use compare and
> swap to
> do a store, then all you need is a SFENCE before. That's your release
> semantics. If you are going to use the implied load value, then you
> need
> an LFENCE afterwards. That's your acquire semantics.

OK. But SFENCE makes sure all writes happen from this CPU so why do I want
to put it before something what makes sure that I can _read_ what others
wrote?

And LFENCE makes sure all loads happen. So why should I put it after the
CAS? No offence meant, I just don't get it. -(

> You use acquire semantics for things like implementing a lock acquire
> routine.

Basically I rather do reference counted stuff (non-COW) in a
supplier-consumer model. Meaning there is always only one touching the
"pointed" object.

> and for lock-free dequeueing.

And this is the one I am mostly interested in. Where can I find something
about this? I mean something which is understandable for people not
speaking YABA.

> Release semantics are for things like
> implementing an lock release (unlock) routine, and for lock-free
> queueing. Note, lock-free dequeueing may require acquire barrier
> when it access the queue prior to the attempted dequeue so things can
> get a little complicated.

That is what I would like to do in C++ (as a template). But I would need
(of course) to make sure it is portable to at least Intel and SparcV9Plus to
make any sense.

> You can always put both acquire and release in so your compare and
> swap will handle both cases.

Well... Won't it make things slower?

> SFENCE may not be required on ia32 (somebody should confirm this) and
> you'll have to figure out a code versioning scheme to handle LFENCE.

Trouble is that the last asm I have used on Intel was one for 286... and
even there I have only used 8086 stuff. :-( I have some fantastic books,
but they do not cover even Pentium II. :-(

Attila

Alexander Terekhov

unread,

May 13, 2003, 6:56:10 AM5/13/03

to

Attila Feher wrote:
[...]
> Basically I rather do reference counted stuff ...

http://terekhov.de/pthread_refcount_t/draft-edits.txt
http://terekhov.de/pthread_refcount_t/poor-man/beta2/prefcnt.h

Now,

#include <cerrno> // for EDOM below
#include <cassert>
#include <cstddef>

#define __STDC_LIMIT_MACROS // see C99 std
#include <stdint.h> // see C99 std

#define PTHREAD_REFCOUNT_MAX SIZE_MAX
#define PTHREAD_REFCOUNT_DROPPED_TO_ZERO EDOM // for now
#define PTHREAD_REFCOUNT_INITIALIZER(N) { N }

struct pthread_refcount_t_ {
/*std::*/atomic<std::size_t> atomic;
};

typedef struct pthread_refcount_t_ pthread_refcount_t;

int pthread_refcount_getvalue(
pthread_refcount_t * refcount
, std::size_t * value_ptr
)
{
*value_ptr = refcount->atomic.load(); // Naked
return 0;
}

int pthread_refcount_setvalue(
pthread_refcount_t * refcount
, std::size_t value
)
{
refcount->atomic.store(value); // Naked
return 0;
}

int pthread_refcount_increment(
pthread_refcount_t * refcount
)
{
std::size_t val;
do {
val = refcount->atomic.load(); // Naked
assert(PTHREAD_REFCOUNT_MAX > val);
} while (!refcount->atomic.attempt_update(val, val+1)); // Naked
return 0;
}

int pthread_refcount_add(
pthread_refcount_t * refcount
, std::size_t value
)
{
if (PTHREAD_REFCOUNT_MAX < value) return ERANGE;
std::size_t val, max = PTHREAD_REFCOUNT_MAX - value;
do {
val = refcount->atomic.load(); // Naked
if (max < val) return ERANGE;
} while (!refcount->atomic.attempt_update(val, val+value)); // Naked
return 0;
}

int pthread_refcount_increment_positive(
pthread_refcount_t * refcount
)
{
std::size_t val;
do {
val = refcount->atomic.load(); // Naked
if (!val) return PTHREAD_REFCOUNT_DROPPED_TO_ZERO;
assert(PTHREAD_REFCOUNT_MAX > val);
} while (!refcount->atomic.attempt_update(val, val+1)); // Naked
return 0;
}

int pthread_refcount_add_to_positive(
pthread_refcount_t * refcount
, std::size_t value
)
{
if (PTHREAD_REFCOUNT_MAX < value) return ERANGE;
std::size_t val, max = PTHREAD_REFCOUNT_MAX - value;
do {
val = refcount->atomic.load(); // Naked
if (!val) return PTHREAD_REFCOUNT_DROPPED_TO_ZERO;
if (max < val) return ERANGE;
} while (!refcount->atomic.attempt_update(val, val+value)); // Naked
return 0;
}

int pthread_refcount_decrement_acqmsync(
pthread_refcount_t * refcount
)
{
std::size_t val;
do {
val = refcount->atomic.load(); // Naked
assert(val);
if (1 == val) {
refcount->atomic.store_acq(0); // Acquire
return PTHREAD_REFCOUNT_DROPPED_TO_ZERO;
}
} while (!refcount->atomic.attempt_update(val, val-1)); // Naked
return 0;
}

int pthread_refcount_decrement_relmsync(
pthread_refcount_t * refcount
)
{
std::size_t val;
do {
val = refcount->atomic.load(); // Naked
assert(val);
if (1 == val) {
refcount->atomic.store(0); // Naked
return PTHREAD_REFCOUNT_DROPPED_TO_ZERO;
}
} while (!refcount->atomic.attempt_update_rel(val, val-1)); // Release
return 0;
}

int pthread_refcount_decrement(
pthread_refcount_t * refcount
)
{
std::size_t val;
do {
val = refcount->atomic.load(); // Naked
assert(val);
if (1 == val) {
refcount->atomic.store_acq(0); // Acquire
return PTHREAD_REFCOUNT_DROPPED_TO_ZERO;
}
} while (!refcount->atomic.attempt_update_rel(val, val-1)); // Release
return 0;
}

int pthread_refcount_subtract_acqmsync(
pthread_refcount_t * refcount
, std::size_t value
)
{
if (PTHREAD_REFCOUNT_MAX < value) return ERANGE;
std::size_t val;
do {
val = refcount->atomic.load(); // Naked
if (value > val) return ERANGE;
if (value == val) {
refcount->atomic.store_acq(0); // Acquire
return PTHREAD_REFCOUNT_DROPPED_TO_ZERO;
}
} while (!refcount->atomic.attempt_update(val, val-value)); // Naked
return 0;
}

int pthread_refcount_subtract_relmsync(
pthread_refcount_t * refcount
, std::size_t value
)
{
if (PTHREAD_REFCOUNT_MAX < value) return ERANGE;
std::size_t val;
do {
val = refcount->atomic.load(); // Naked
if (value > val) return ERANGE;
if (value == val) {
refcount->atomic.store(0); // Naked
return PTHREAD_REFCOUNT_DROPPED_TO_ZERO;
}
} while (!refcount->atomic.attempt_update_rel(val, val-value)); // Release
return 0;
}

int pthread_refcount_subtract(
pthread_refcount_t * refcount
, std::size_t value
)
{
if (PTHREAD_REFCOUNT_MAX < value) return ERANGE;
std::size_t val;
do {
val = refcount->atomic.load(); // Naked
if (value > val) return ERANGE;
if (value == val) {
refcount->atomic.store_acq(0); // Acquire
return PTHREAD_REFCOUNT_DROPPED_TO_ZERO;
}
} while (!refcount->atomic.attempt_update_rel(val, val-value)); // Release
return 0;
}

Bug-reports/suggestions/objections/whatever are quite welcome. ;-)

regards,
alexander.

Joseph Seigh

unread,

May 13, 2003, 7:00:18 AM5/13/03

to

I wrote:
>
> "Casper H.S. Dik" wrote:
> >
> >
> > So a typical lock sequence would be "CAS, MEMBAR #LoadStore|#LoadLoad"
> > and unlock would be "ST, MEMBAR #StoreStore".
>
>

> Also the membar following the lock CAS only needs to be a #LoadLoad
> as far as I know.
>

No, #LoadStore|#LoadLoad is probably right. They're doing out of
order execution now, aren't they?

Joe Seigh

Alexander Terekhov

unread,

May 13, 2003, 7:35:17 AM5/13/03

to

Joseph Seigh wrote:
[...]

> > > So a typical lock sequence would be "CAS, MEMBAR #LoadStore|#LoadLoad"
> > > and unlock would be "ST, MEMBAR #StoreStore".
> >
> >
> > Also the membar following the lock CAS only needs to be a #LoadLoad
> > as far as I know.
> >
>
> No, #LoadStore|#LoadLoad is probably right.

Right.

http://www.sparc.com/standards/SPARCV9.pdf
(see "J.6 Spin Locks")

regards,
alexander.

Casper H.S. Dik

unread,

May 13, 2003, 8:23:50 AM5/13/03

to

Alexander Terekhov <tere...@web.de> writes:

>int pthread_refcount_getvalue(
> pthread_refcount_t * refcount
> , std::size_t * value_ptr
> )
>{
> *value_ptr = refcount->atomic.load(); // Naked
> return 0;
>}

This seems rather pointless; what use is an "atocmic load"? It
doesn't tell you anything about the current value of the object.

>int pthread_refcount_setvalue(
> pthread_refcount_t * refcount
> , std::size_t value
> )
>{
> refcount->atomic.store(value); // Naked
> return 0;
>}

What use is an atomic store? The value after calling this function is
indeterminate.

>int pthread_refcount_increment(
> pthread_refcount_t * refcount
> )
>{
> std::size_t val;
> do {
> val = refcount->atomic.load(); // Naked
> assert(PTHREAD_REFCOUNT_MAX > val);
> } while (!refcount->atomic.attempt_update(val, val+1)); // Naked
> return 0;
>}

Now, this is useful but why provide not just:

>int pthread_refcount_add(
> pthread_refcount_t * refcount
> , std::size_t value
> )
>{
> if (PTHREAD_REFCOUNT_MAX < value) return ERANGE;
> std::size_t val, max = PTHREAD_REFCOUNT_MAX - value;
> do {
> val = refcount->atomic.load(); // Naked
> if (max < val) return ERANGE;
> } while (!refcount->atomic.attempt_update(val, val+value)); // Naked
> return 0;
>}

Also, these functions appear to lack some functionality; e.g., it appears to
be impossible to determine whether a certain value has been reached.

One typical use would be:

pthread_refcount_add(&obj->refcnt, -1);

if (obj->refcnt == 0)
..

except that this is unsafe (including when using an atomic way to
get the refcnt).

I think it's much more useful to provide fewer functions. Are
the overflow checks useful or should we just mimic C integral
behaviour?

pthread_refcount_t
pthread_refcount_add(pthread_refcount_t *valp, pthread_refcount_t, delta)
{
<do the atomic stuff>
return new value
}

On some architectures, obtaining the new value is difficult or more
expensive so perhaps there should be two kinds; a function returing the
new value and a void counterpart.

if (pthread_refcount_add_newvalue(&obj->refcnt, -1) == 0) {
/* last to free, discard */

Alexander Terekhov

unread,

May 13, 2003, 9:24:01 AM5/13/03

to

http://terekhov.de/pthread_refcount_t/draft-edits.txt

"Casper H.S. Dik" wrote:
>
> Alexander Terekhov <tere...@web.de> writes:
>
> >int pthread_refcount_getvalue(
> > pthread_refcount_t * refcount
> > , std::size_t * value_ptr
> > )
> >{
> > *value_ptr = refcount->atomic.load(); // Naked
> > return 0;
> >}
>
> This seems rather pointless; what use is an "atocmic load"? It
> doesn't tell you anything about the current value of the object.

The pthread_refcount_getvalue() function shall update the location
specified by the /value_ptr/ argument to have the value of the
reference count specified by /refcount/ without affecting the value
of the reference count. The updated value represents an actual
reference count value that occurred at some unspecified time during
the call, but it need not be the actual value of the reference count
when it is returned to the calling thread.

>
> >int pthread_refcount_setvalue(
> > pthread_refcount_t * refcount
> > , std::size_t value
> > )
> >{
> > refcount->atomic.store(value); // Naked
> > return 0;
> >}
>
> What use is an atomic store? The value after calling this function is
> indeterminate.

The pthread_refcount_setvalue() function shall update the value of
reference count specified by /refcount/ to become equal to /value/.
The results are undefined if pthread_refcount_setvalue() is called
while other thread is operating on the reference count. The results
are undefined if these functions are called with an uninitialized
reference count.

size_t IntAtomicGet( pthread_refcount_t& refs ) {
size_t result;
pthread_refcount_getvalue( &refs, &result );
return result;
}

.
.
.

inline String::~String() {
if ( 2 > IntAtomicGet( data_->refs ) ||
1 > IntAtomicDecrement( data_->refs ) )
delete data_;
}

inline String::String( const String& other )
{
switch ( IntAtomicGet( other.data_->refs ) ) {
case 00: data_ = new StringBuf( *other.data_ ); break;
default: IntAtomicIncrement( (data_ = other.data_)->refs );
} ++nCopies;
}

</example>

>
> >int pthread_refcount_increment(
> > pthread_refcount_t * refcount
> > )
> >{
> > std::size_t val;
> > do {
> > val = refcount->atomic.load(); // Naked
> > assert(PTHREAD_REFCOUNT_MAX > val);
> > } while (!refcount->atomic.attempt_update(val, val+1)); // Naked
> > return 0;
> >}
>
> Now, this is useful but why provide not just:
>
> >int pthread_refcount_add(
> > pthread_refcount_t * refcount
> > , std::size_t value
> > )
> >{
> > if (PTHREAD_REFCOUNT_MAX < value) return ERANGE;
> > std::size_t val, max = PTHREAD_REFCOUNT_MAX - value;
> > do {
> > val = refcount->atomic.load(); // Naked
> > if (max < val) return ERANGE;
> > } while (!refcount->atomic.attempt_update(val, val+value)); // Naked
> > return 0;
> >}

Because "add"/"subtract" operations have less "undefined behaivor"
(they provide full range checking) and hence they're more "expensive".
Increments and decrements are less expensive operations that cover
most of typical usage scenarious.

>
> Also, these functions appear to lack some functionality; e.g., it appears to
> be impossible to determine whether a certain value has been reached.
>
> One typical use would be:
>
> pthread_refcount_add(&obj->refcnt, -1);
>
> if (obj->refcnt == 0)

The interface is designed to allow additions and subtractions
of NON-NEGATIVE values. The "typical use" you're probably
talking about would be:

class refs {
public:

refs() {
pthread_refcount_init(&strong_count, 1);
pthread_refcount_init(&weak_count, 1);
}

~refs() {
pthread_refcount_destroy(&weak_count);
pthread_refcount_destroy(&strong_count);
}

//*** Called by existing "strong_ptr".

void acquire_strong() {
pthread_refcount_increment(&strong_count);
}

void release_strong() {
int status = pthread_refcount_decrement(&strong_count);
if (PTHREAD_REFCOUNT_DROPPED_TO_ZERO == status) {
destruct_object();
status = pthread_refcount_decrement_rel(&weak_count);
if (PTHREAD_REFCOUNT_DROPPED_TO_ZERO == status)
destruct_self();
}
}

void acquire_weak_from_strong() {
acquire_weak();
}

//*** Called by existing "weak_ref".

void acquire_weak() {
pthread_refcount_increment( &weak_count );
}

void release_weak() {
int status = pthread_refcount_decrement_acq(&weak_count);
if (PTHREAD_REFCOUNT_DROPPED_TO_ZERO == status)
destruct_self();
}

bool acquire_strong_from_weak() {
int status = pthread_refcount_increment_positive(&strong_count); // _add_to_positive(&refcount, 1);
if (PTHREAD_REFCOUNT_DROPPED_TO_ZERO == status) // Ouch, did not work [too late].
return false;
return true;
}

private:

void destruct_object(); // "delete p_object".
void destruct_self(); // "delete this".

pthread_refcount_t strong_count;
pthread_refcount_t weak_count;

T* p_object;

}; //*** class refs

</example>

[...]

> pthread_refcount_t
> pthread_refcount_add(pthread_refcount_t *valp, pthread_refcount_t, delta)
> {
> <do the atomic stuff>
> return new value
> }

pthread_refcount_t is a *synchronization object*. Just like with
a mutex... you simply can't copy/return it.

Thanks for your comments.

regards,
alexander.

Doug Lea

unread,

May 13, 2003, 10:38:20 AM5/13/03

to

Thnaks to David Holmes for telling me I should read and post to this
c.l.t thread. Here are a few follow-ups on points raised about my
http://gee.cs.oswego.edu/dl/jmm/cookbook.html

1. The cookbook page was written to try to help JVM providers conform
to the new JSR-133 spec, and also as a check by those of us making the
spec to determine the impact of the memory model across different
processors. We try to keep it accurate, but it is just an unofficial
guide. Also, some details are highly specific to Java, so might not
have much to do with use of barriers in other languages.

2. In most programming languages, it can be tricky to empirically
check need for and results of using barriers. Beware of misleading
results when you omit a barrier that isn't really needed for the sake
of the processor, but causes compiler to more agressively optimize,
leading to a reordering. (Compilers really must understand memory
models, which is part of the point of JSR-133 spec.)

3. All sparcs I'm aware of run entirely in TSO mode, on which LoadLoad
and StoreStore bariers are no-ops. The newer Ultra-3's ONLY run in
TSO. Only Ultra-1/2 support non-TSO modes, but as far as I know, no
user or system code uses them.

4. Don't take this as the definitive word about x86, but it is
consistent with manuals, empirical checking, and mail I've received
from people very familiar with Intel processor internals: The x86
LFENCE, SFENCE, and MFENCE instructions were designed primarily for
use with streaming SIMD instructions operating under weak cache
policies; i.e., mostly for multimedia code. While the specs don't
rule out the possibility that they might be needed in other cases (and
imply that they will), empirically, none of them appear to be
necessary on current processors when you are using locked-prefixed
instructions for updates of variables visible across threads. In
Java, barriers via lock-prefixed instructions are generally needed for
CASes underlying locks and for writes to volatile variables. A few
useful Java tests for checking reorderings are at Bill Pugh's:
http://www.cs.umd.edu/%7Epugh/java/memoryModel/

5. The story on POWERPC is complicated by the fact that instructions
have changed and been added over the years. My cookbook entries for
these say that they are general guides. The code examples in the IBM
processor manuals are better guides.

Alexander Terekhov

unread,

May 13, 2003, 10:50:00 AM5/13/03

to

Alexander Terekhov wrote:
>
> Joseph Seigh wrote:
> [...]
> > > > So a typical lock sequence would be "CAS, MEMBAR #LoadStore|#LoadLoad"
> > > > and unlock would be "ST, MEMBAR #StoreStore".
> > >
> > >
> > > Also the membar following the lock CAS only needs to be a #LoadLoad
> > > as far as I know.
> > >
> >
> > No, #LoadStore|#LoadLoad is probably right.
>
> Right.

Note that "asymmetrical" (ala Itanic) "cas.acq" (op.acq) is much better
for the lock acquisition (op.rel is just perfect for lock-release) and,
BTW, op.hoist_load_barrier/op.sink_store_barrier (that was recently sort
of "invented" by me ;-) ) is much better than conventional "symmetrical"
load (read) and store (write) barriers for stuff like producer/consumer
ala {currently} totally broken version of NPTL condvar meant to provide
binary compatibility for the old LinuxThreads based appls... unless I'm
just missing and/or misunderstanding something, of course.

regards,
alexander.

Joseph Seigh

unread,

May 13, 2003, 5:00:26 PM5/13/03

to

What do you mean by asymmetrical. That the barrier is not separate but
part of an instruction execution but either before or after the normal
instruction execution, or that the barrier is one-way, i.e. the memory
access do not pass the barrier in one direction only, e.g. like
#LoadStore|#LoadLoad?

I took a look at the Itanium definitions. They're a bit rambling.
I think acquire/release semantics can be stated a lot more concisely.

Joe Seigh

David Schwartz

unread,

May 13, 2003, 6:19:37 PM5/13/03

to

"Attila Feher" <attila...@lmf.ericsson.se> wrote in message

news:b9prod$9c3$1...@newstree.wise.edt.ericsson.se...

> No OS. Today this OS, tomorrow that OS, after that maybe no OS.
Whatever.
> So what I need is: an assembly routine which does it right. If it needs
> membar/fence it uses the right one etc.

Today this CPU, tomorrow that CPU, after that maybe no CPU. You are
looking at a lower level when the solution you want is probably already
available at a higher level.

DS

SenderX

unread,

May 13, 2003, 7:03:02 PM5/13/03

to

> Today this CPU, tomorrow that CPU, after that maybe no CPU. You are
> looking at a lower level when the solution you want is probably already
> available at a higher level.

I think you are misunderstanding us. ;)

Mike Mowbray

unread,

May 13, 2003, 9:19:03 PM5/13/03

to

SenderX wrote:

> [...]
> So a release is a main memory write barrier and acquire
> is a main memory read barrier, right?

IMHO it's important to understand that memory barriers
are not about syncing between cache and main memory,
but rather about interprocessor perception of the
*sequence* of memory operations.

Suppose the program flow on processor "A" stores to
address #1 and then to address #2. In relaxed (weak)
memory models it's possible for another processor "B"
to see the new value at address #2 before the new value
at address #1 - which can be a nasty surprise!

Inserting a memory barrier instruction fixes this because
then you're saying (e.g.) "make sure that all my memory
writes so far reach global visibility before any writes
I perform after this membar instruction." I.e: it
guarantees correct global perception of the intended
sequence of operations.

Other variants of membar (e.g.) enforce ordering
between loads and stores, and/or various combinations
thereof.

- MikeM.

Attila Feher

unread,

May 14, 2003, 2:17:10 AM5/14/03

to

Or I did look at the higher level and I know that what is there is no
solution for me.

A

Joseph Seigh

unread,

May 14, 2003, 5:33:28 AM5/14/03

to

Doug Lea wrote:
>
> 4. Don't take this as the definitive word about x86, but it is
> consistent with manuals, empirical checking, and mail I've received
> from people very familiar with Intel processor internals: The x86
> LFENCE, SFENCE, and MFENCE instructions were designed primarily for
> use with streaming SIMD instructions operating under weak cache
> policies; i.e., mostly for multimedia code. While the specs don't
> rule out the possibility that they might be needed in other cases (and
> imply that they will), empirically, none of them appear to be
> necessary on current processors when you are using locked-prefixed
> instructions for updates of variables visible across threads. In
> Java, barriers via lock-prefixed instructions are generally needed for
> CASes underlying locks and for writes to volatile variables. A few
> useful Java tests for checking reorderings are at Bill Pugh's:
> http://www.cs.umd.edu/%7Epugh/java/memoryModel/
>

You'd have to know whether the implicit memory barriers were before,
after, or both, the interlocked instruction execution to be able to
infer anything useful. Otherwise you need to explicitly add them to
be sure.

I'm pretty sure they don't have them and that what other people are
talking about is the internal interlocked logic which involves a load
followed by a store, not with the relative order of accesses by other
instructions.

Joe Seigh

Alexander Terekhov

unread,

May 14, 2003, 7:54:53 AM5/14/03

to

Joseph Seigh wrote:

[... typical lock sequence ...]

> > > > No, #LoadStore|#LoadLoad is probably right.
> > >
> > > Right.
> >
> > Note that "asymmetrical" (ala Itanic) "cas.acq" (op.acq) is much better
> > for the lock acquisition (op.rel is just perfect for lock-release) and,
> > BTW, op.hoist_load_barrier/op.sink_store_barrier (that was recently sort
> > of "invented" by me ;-) ) is much better than conventional "symmetrical"
> > load (read) and store (write) barriers for stuff like producer/consumer
> > ala {currently} totally broken version of NPTL condvar meant to provide
> > binary compatibility for the old LinuxThreads based appls... unless I'm
> > just missing and/or misunderstanding something, of course.
> >
>
> What do you mean by asymmetrical. That the barrier is not separate but
> part of an instruction execution but either before or after the normal
> instruction execution, or that the barrier is one-way, i.e. the memory
> access do not pass the barrier in one direction only, e.g. like
> #LoadStore|#LoadLoad?

Well, (okay, I'll drop it again... now with copy&paste)

http://www.cs.umd.edu/~pugh/java/memoryModel/archive/1222.html
(Subject: JavaMemoryModel: Cookbook: barriers)

<copy&paste>

Eliot Moss wrote:
>
> First I should say that I have not gone through the latest
> memory model in detail, but I _am_ familiar with the
> properties of the ia64 instructions.
>
> Properties of _acquire_: An acquire prevents loads/stores
> issued AFTER the acquire from being performed before the
> access associated with the acquire. However, it does permit
> earlier issued loads/stores to be performed after the acquire.
>
> Properties of _release_: A release prevents loads/stores
> issued BEFORE the release from being performed after the
> access associated with the release. It does permit later
> issued loads/stores to be performed before the release.

Yeah. Well, the cookbook says:

"....
On ia64, LoadStore, LoadLoad and StoreStore barriers are
folded into special forms of load and store instructions --
there aren't separate instructions. ld.acq acts as (load;
LoadLoad+LoadStore) and st.rel acts as (LoadStore+StoreStore;
store). Neither of these provide a StoreLoad barrier -- you
need a separate mf barrier instruction for that."

I guess that it could a-sort-of-"clarify" that reordering
restrictions mentioned above apply only to the following
"unordered"/"normal" memory accesses (in the program order;
on IA64; its "sequential pages" and "mf" aside):

- succeeding ones in the case of acquire

- preceding ones in the case of release.

</copy&paste>

Now,

OP.HOIST_LOAD_BARRIER is simply relaxed OP."ACQUIRE"
that doesn't affect reordering of STORES (that can be
done by C/C++ compiler and hardware; both can reorder).

OP.HOIST_STORE_BARRIER is simply relaxed OP."ACQUIRE"
that doesn't affect reordering of LOADS (that can be
done by C/C++ compiler and hardware; both can reorder).

OP.SINK_LOAD_BARRIER is simply relaxed OP."RELEASE"
that doesn't affect reordering of STORES (that can be
done by C/C++ compiler and hardware; both can reorder).

OP.SINK_STORE_BARRIER is simply relaxed OP."RELEASE"
that doesn't affect reordering of LOADS (that can be
done by C/C++ compiler and hardware; both can reorder).

The "real" OP.ACQUIRE can be described as:

OP."HOIST_LOAD_BARRIER|HOIST_STORE_BARRIER"

The "real" OP.RELEASE can be described as:

OP."SINK_LOAD_BARRIER|SINK_STORE_BARRIER"

DOES THIS MAKE SENSE? TIA.

regards,
alexander.

Joseph Seigh

unread,

May 14, 2003, 9:21:06 AM5/14/03

to

Alexander Terekhov wrote:

(snip copy&paste "asymetrical" definitions)

Ok.

>
>
> Now,
>
> OP.HOIST_LOAD_BARRIER is simply relaxed OP."ACQUIRE"
> that doesn't affect reordering of STORES (that can be
> done by C/C++ compiler and hardware; both can reorder).
>
> OP.HOIST_STORE_BARRIER is simply relaxed OP."ACQUIRE"
> that doesn't affect reordering of LOADS (that can be
> done by C/C++ compiler and hardware; both can reorder).
>
> OP.SINK_LOAD_BARRIER is simply relaxed OP."RELEASE"
> that doesn't affect reordering of STORES (that can be
> done by C/C++ compiler and hardware; both can reorder).
>
> OP.SINK_STORE_BARRIER is simply relaxed OP."RELEASE"
> that doesn't affect reordering of LOADS (that can be
> done by C/C++ compiler and hardware; both can reorder).
>
> The "real" OP.ACQUIRE can be described as:
>
> OP."HOIST_LOAD_BARRIER|HOIST_STORE_BARRIER"
>
> The "real" OP.RELEASE can be described as:
>
> OP."SINK_LOAD_BARRIER|SINK_STORE_BARRIER"
>
> DOES THIS MAKE SENSE? TIA.
>

The problem here is you are just restating the memory barriers
that exist on current architectures. If you had done these
definitions 10 years ago, they would look different and would
not work on todays architecures. It would be better, I think,
to work at a more abstract level with proper semantic definition,
and deal with what membars actually get used as an implementation
issue.

I would stay away from using hardware definitions as an abstraction
level, simply because hardware architects are not programmers,
definitely not multi-threading programmers, and have no clue about
what they are breaking. Their programming world view is narrow
and extremely limited.

Joe Seigh

Ziv Caspi

unread,

May 19, 2003, 3:05:09 PM5/19/03

to

On 13 May 2003 07:38:20 -0700, d...@cs.oswego.edu (Doug Lea) wrote:

>Thnaks to David Holmes for telling me I should read and post to this
>c.l.t thread. Here are a few follow-ups on points raised about my
>http://gee.cs.oswego.edu/dl/jmm/cookbook.html

This looks like a good time to add some oil to the fire...

http://blogs.gotdotnet.com/cbrumme/permalink.aspx/480d3a6d-1aa8-4694-96db-c69f01d7ff2b

Ziv.

Alexander Terekhov

unread,

May 19, 2003, 3:23:23 PM5/19/03

to

Interesting reading. Well,

http://www.cs.umd.edu/~pugh/java/memoryModel/archive/0938.html

regards,
alexander.