Windows Interlocked Acquire/Release functions

Joseph Seigh

unread,

Feb 21, 2003, 4:54:56 PM2/21/03

to

I noticed on msdn there are a bunch of new win32 interlocked apis added
for windows server 2003. Stuff like

InterlockedIncrementAcquire()
InterlockedIncrementRelease()
...

The docs mentions acquire/release memory semantics. Something processor
related. What is that all about?

Also noticed some 64 bit InterlockedCompareExchange functions. I know
where atomic smart pointers are being ported to next.

Joe Seigh

Joseph Seigh

unread,

Feb 21, 2003, 5:13:04 PM2/21/03

to

Spotted some discussion on it in a linux mailing list archive. They're
memory barriers. I assume the old non acquire/release forms in fact have
both acquire and release memory barriers. Unless somebody is incredibly
stupid.

Joe Seigh

Alexander Terekhov

unread,

Feb 22, 2003, 10:11:02 AM2/22/03

to

Joseph Seigh wrote:
>
> I noticed on msdn there are a bunch of new win32 interlocked apis added
> for windows server 2003. Stuff like
>
> InterlockedIncrementAcquire()
> InterlockedIncrementRelease()
> ...
>
> The docs mentions acquire/release memory semantics. Something processor
> related. What is that all about?

http://rsim.cs.uiuc.edu/~sadve/Publications/thesis.ps
(2.2.3. Release Consistency And Related Models)

>
> Also noticed some 64 bit InterlockedCompareExchange functions. I know
> where atomic smart pointers are being ported to next.

The "crazy-man-2003" {server} edition of pthread_refcount_t will make
use of acquire and release decrements (and compare-exchanges) too. ;-)

Well, naked ("nomsync") interlocked-stuff plus itanic-MF (full-stop
memory fence; acq/rel barriers [or load/store calls] aside for a
moment) would be really useful too (to make the most efficient impl
of pthread_refcount_t for MS-WIN-API platforms) but I don't think
that MS will come up with such stuff before 2010-or-so.

regards,
alexander.

Joseph Seigh

unread,

Feb 22, 2003, 10:59:39 AM2/22/03

to

Alexander Terekhov wrote:
>
> >
> Well, naked ("nomsync") interlocked-stuff plus itanic-MF (full-stop
> memory fence; acq/rel barriers [or load/store calls] aside for a
> moment) would be really useful too (to make the most efficient impl
> of pthread_refcount_t for MS-WIN-API platforms) but I don't think
> that MS will come up with such stuff before 2010-or-so.
>

There are rumors that Microsoft has its work cut out for it moving windows
to 64 bits. But all that interlocked stuff is certainly useful to play with
in the meantime. Kind of ironic that windows will probably be the platform
where innovation in synchronization will be taking place. I suppose we'll
be hearing harrumphing from the stuffy types that all this low level stuff
is too dangerous for mere mortals to use. But as they have not been able
to come up with an api that can't be screwed up by idiots, we know how much
their two cents is worth.

Joe Seigh

Alexander Terekhov

unread,

Feb 22, 2003, 11:31:13 AM2/22/03

to

Joseph Seigh wrote:
>
> Alexander Terekhov wrote:
> >
> > >
> > Well, naked ("nomsync") interlocked-stuff plus itanic-MF (full-stop

^^^^^^^^^^^^^^^^^^^^

> > memory fence; acq/rel barriers [or load/store calls] aside for a
> > moment) would be really useful too (to make the most efficient impl
> > of pthread_refcount_t for MS-WIN-API platforms) but I don't think
> > that MS will come up with such stuff before 2010-or-so.

Heck! <http://tinyurl.com/68hv>

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/base/memorybarrier.asp

> >
>
> There are rumors that Microsoft has its work cut out for it moving windows
> to 64 bits. But all that interlocked stuff is certainly useful to play with
> in the meantime. Kind of ironic that windows will probably be the platform
> where innovation in synchronization will be taking place. I suppose we'll
> be hearing harrumphing from the stuffy types that all this low level stuff
> is too dangerous for mere mortals to use. But as they have not been able
> to come up with an api that can't be screwed up by idiots, we know how much
> their two cents is worth.
>
> Joe Seigh

regards,
alexander.

--
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/winprog/winprog/windows_server_2003_family.asp

Ziv Caspi

unread,

Feb 22, 2003, 3:59:52 PM2/22/03

to

On Fri, 21 Feb 2003 22:13:04 GMT, Joseph Seigh <jsei...@xemaps.com>
wrote:

>I assume the old non acquire/release forms in fact have
>both acquire and release memory barriers. Unless somebody is incredibly
>stupid.

Yes. (See also the archives of this group.)

Ziv

Jochen Wilhelmy

unread,

Feb 23, 2003, 9:31:03 AM2/23/03

to

> Well, naked ("nomsync") interlocked-stuff plus itanic-MF (full-stop
> memory fence; acq/rel barriers [or load/store calls] aside for a
> moment) would be really useful too (to make the most efficient impl
> of pthread_refcount_t for MS-WIN-API platforms) but I don't think
> that MS will come up with such stuff before 2010-or-so.

is it correct to use InterlockedIncrementAcquire
and InterlockedDecrementRelease for a reference counter?
and how could this be made more efficient?

jochen

Joseph Seigh

unread,

Feb 23, 2003, 10:02:39 AM2/23/03

to

For the decrement, to prevent late stores from impacting heap management.
There's no intrinsic need for a memory barrier on increment. It really
depends on the reference counted pointer semantics. You may also want it
on the decrement to get lock like semantics for the dtor as Alexander wants.
If you are implementing Boost like semantics, there is no additional need
for memory barriers since C++ is thread unaware and synchronization is
externally supplied. If you want memory visibility semantics assiciated with
the pointer class itself then that would depend of the class public methods.
Typically, you'd want it associated with pointer store and fetch. Release
semantics with store, and acquire semantics with fetch. The store and fetch
won't be explict but implicit with the methods. Assignment would usually be
a fetch followed by a store. Dereferencing would involve a fetch usually.
In an implementation, you might have to use explicit memory barriers if the
interlocked instructions didn't provide them in the right place.

Kind of hard to explain. It's easy for me (to see, not explain), since I've been
playing with stuff at the semantic level for a while now. Most people ignore
semantics and without that to address why you need stuff, it's hard to explain why.
Semantics are specifications basically. I can't say whether a particular implementation
needs something or not without knowing what the specifications are, can I?

Joe Seigh

Joseph Seigh

unread,

Feb 23, 2003, 12:09:59 PM2/23/03

to

Alexander Terekhov wrote:
>
> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/winprog/winprog/windows_server_2003_family.asp

Well, I have the Oct 2002 SDK (there's a Feb 2003 SDK also), so I can compile and link to the 64 bit
interfaces. Can't execute them w/o DLL's. I though they may have simulated the 64 bit interlocked
instructions w/ cmpxchg8b.

I did check out the Itanium reference and there is a 128 bit compare and swap, so atomic smart pointers
can work on 64 bit intel platforms, not just powerpc and z architecture.

Joe Seigh

Alexander Terekhov

unread,

Feb 24, 2003, 2:58:08 AM2/24/03

to

Incrementing a reference count doesn't really need any memory
synchronization protocol ala Acquire/Release. The naked atomic
increment is sufficient here. If the associated object is
immutable, then the same goes for decrements too. The problems
arise when associated object is MUTABLE. In this case, you do
need acquire/release protocol for decrements... unless you
really want to impose on your clients the obligation to acquire
the proper lock "on object destruction/cleanup". More details
on this can be found in the "threadsafe reference counting"
thread.

regards,
alexander.

Alexander Terekhov

unread,

Feb 26, 2003, 5:58:18 AM2/26/03

to

Alexander Terekhov wrote:
>
> Joseph Seigh wrote:
> >
> > I noticed on msdn there are a bunch of new win32 interlocked apis added
> > for windows server 2003. Stuff like
> >
> > InterlockedIncrementAcquire()
> > InterlockedIncrementRelease()
> > ...
> >
> > The docs mentions acquire/release memory semantics. Something processor
> > related. What is that all about?
>
> http://rsim.cs.uiuc.edu/~sadve/Publications/thesis.ps
> (2.2.3. Release Consistency And Related Models)

This is from my draft "pthread_refcount_t"-spec."edits":

---
< additions to the XBD 4.10 Memory Synchronization >

The informal semantics of acquire/release memory synchronization
operations can be defined(*) as: (1) before an ordinary load or store
access is allowed to perform with respect to any other processor, all
preceding acquire accesses must be performed, and (2) before a release
access is allowed to perform with respect to any other processor, all
preceding ordinary load and store accesses must be performed. An act
of acquiring mutex ownership can be viewed as performing an acquire
operation. An act of releasing mutex ownership can be viewed as
performing a release operation.

The pthread_refcount_decrement() and pthread_refcount_subtract()
functions shall have the effect of /release/ memory synchronization
operation if the resulting value of a reference count is not zero;
otherwise, these functions shall have the effect of /acquire/ memory
synchronization operation.

The pthread_refcount_decrement_relmsync() and
pthread_refcount_subtract_relmsync() functions shall have the effect
of /release/ memory synchronization operation if the resulting value
of a reference count is not zero; otherwise, the memory synchronization
effect of these functions is unspecified.

The pthread_refcount_decrement_acqmsync() and
pthread_refcount_subtract_acqmsync() functions shall have the effect
of /acquire/ memory synchronization operation if the resulting value
of a reference count is zero; otherwise, the memory synchronization
effect of these functions is unspecified.
---

Does this make sense to you? TIA.

regards,
alexander.

Mike Mowbray

unread,

Feb 26, 2003, 6:45:11 PM2/26/03

to

Alexander Terekhov wrote:

> [...]

>
> ---
> < additions to the XBD 4.10 Memory Synchronization >
>

> operations can be defined(*) as: (1) before an ordinary load or store

> preceding acquire accesses must be performed, and (2) before a release
> access is allowed to perform with respect to any other processor, all
> preceding ordinary load and store accesses must be performed. An act
> of acquiring mutex ownership can be viewed as performing an acquire
> operation. An act of releasing mutex ownership can be viewed as
> performing a release operation.

Perhaps I'm just nitpicking, but other specifications of membar
semantics I've seen use words like of "results of load/store become
visible to other processors", rather than "load/store must be performed".
Presumably this is because the term "performed" can be ambiguous on
modern CPU architectures.

> The pthread_refcount_decrement() and pthread_refcount_subtract()
> functions shall have the effect of /release/ memory synchronization
> operation if the resulting value of a reference count is not zero;
> otherwise, these functions shall have the effect of /acquire/ memory
> synchronization operation.
>
> The pthread_refcount_decrement_relmsync() and
> pthread_refcount_subtract_relmsync() functions shall have the effect
> of /release/ memory synchronization operation if the resulting value
> of a reference count is not zero; otherwise, the memory synchronization
> effect of these functions is unspecified.
>
> The pthread_refcount_decrement_acqmsync() and
> pthread_refcount_subtract_acqmsync() functions shall have the effect
> of /acquire/ memory synchronization operation if the resulting value
> of a reference count is zero; otherwise, the memory synchronization
> effect of these functions is unspecified.

Even after all you've written on this subject, I still don't clearly
understand why you prefer a pthread_refcount_t and special functions
over a standardized API for atomic add/subtract/cas and (various)
membars. It seems to me that the latter is more general and therefore
applicable to a larger class of problems.

- MikeM.

Alexander Terekhov

unread,

Feb 27, 2003, 4:51:39 AM2/27/03

to

Mike Mowbray wrote:
[...]

> > < additions to the XBD 4.10 Memory Synchronization >
> >
> > operations can be defined(*) as: (1) before an ordinary load or store
> > preceding acquire accesses must be performed, and (2) before a release
> > access is allowed to perform with respect to any other processor, all
> > preceding ordinary load and store accesses must be performed. An act
> > of acquiring mutex ownership can be viewed as performing an acquire
> > operation. An act of releasing mutex ownership can be viewed as
> > performing a release operation.
>
> Perhaps I'm just nitpicking, but other specifications of membar
> semantics I've seen use words like of "results of load/store become
> visible to other processors", rather than "load/store must be performed".
> Presumably this is because the term "performed" can be ambiguous on
> modern CPU architectures.

Well, to me, "results of load/store become visible to other processors"
doesn't quite cut it with respect to *reordering*:

lock();
delete p_object;
p_object = 0;
unlock();

here, the "p_object load/store" just can't escape the critical section
[in both directions, I mean]. Or do you think that "become visible to
other processors" would still make it clear?

[ ... pthread_refcount stuff ... ]

> Even after all you've written on this subject, I still don't clearly
> understand why you prefer a pthread_refcount_t and special functions
> over a standardized API for atomic add/subtract/cas and (various)
> membars.

Well, "(various) membars" is the key here, I think.

> It seems to me that the latter is more general and therefore
> applicable to a larger class of problems.

Yeah, std::atomic<T> with a "just a few" mandatory specializations for
a whole bunch of scalar types (bool, ptrs, enums, etc.) would really
be a ``nice to have feature.'' The question is whether you really
want to expose "(various) membars" stuff. AFAIK, Java [JSR-133] folks
have defined that (1) their "loads"/get() shall have "acquire"
semantics, "stores"/set() shall have "release" semantics, and (3)
"read-modify" stuff (increment/etc.) shall have "acquire/release"
semantics. I actually really like it(*)... but the problem is that an
implementation of "pthread_refcount_t" using something along the lines
of such std::atomic<size_t>, for example, would unfortunately inject
way too many *unnecessary* reordering constraints/barriers.

Basically, that's the main reason to have a dedicated synchronization
object "pthread_refcount_t" (a fully portable mutex-based/blocking
version aside for a moment), I think.

Do you agree?

regards,
alexander.

(*) http://groups.google.com/groups?threadm=3D8B257C.4D8D58F%40web.de
(Subject: Re: rwlock using pthread_cond (or Win32 events) [last
link retired; stuff is available here: <http://tinyurl.com/5mmj>])

--
http://www.terekhov.de/pthread_refcount_t/draft-edits.txt

Joseph Seigh

unread,

Feb 27, 2003, 7:13:13 AM2/27/03

to

Alexander Terekhov wrote:
>
> Mike Mowbray wrote:
> [...]

I think standards architects really need to learn how to do formal semantics.
Defining standards in terms of an assumed memory model is problematic. You
don't necessarily know what kind of memory technologies will show up in the
future. I can think of some where membars would be meaningless.

BTW, you can cheat here. POSIX mutex semantics aren't formally defined but
are assumed to be well understood. You can say that reference decrement (!= 0)
has the same memory visibility semantics as POSIX mutex release, and reference
decrement (==0) the same semantics as POSIX mutex acquire. Which is probably
the same as "acquire" and "release" semantics but I'd have to check those
out a little more.

JSR-133 is in that mess partly because of that, partly because the reference
JVM implementation depends on that meta model, and partly because of ... umm...,
we'll leave it at that.

But somebody should standardize those "low level primatives" because if you
can think of use for them, then other people can think of other uses. And
you don't want multiple implementations floating aroung, especially when there
could be subtle problems on some architectures.

And speaking of thread support for C++, it's too bad the C++ crowd is so antipathic
towards threading. I came up with a good wait-free pattern ala Herlihy's
"Dynamic Lock-Free Data Structures". I actually had to figure out how he did his
implementation. Otherwise I was going to have to say that what I did may or may not
be what Herlihy did.

Joe Seigh

Mike Mowbray

unread,

Feb 27, 2003, 6:28:30 PM2/27/03

to

I wrote:

>> Perhaps I'm just nitpicking, but other specifications of membar
>> semantics I've seen use words like of "results of load/store become
>> visible to other processors", rather than "load/store must be performed".
>> Presumably this is because the term "performed" can be ambiguous on
>> modern CPU architectures.

Alexander Terekhov wrote:

> Well, to me, "results of load/store become visible to other
> processors" doesn't quite cut it with respect to *reordering*:

> [...]

Hmmm. I see I've committed your "sin" in not saying what I meant
clearly enough. :-) But I don't really want to get into a long
debate over this relatively minor point.

BTW, I actually like the division Sun uses in <sys/atomic.h>.
The names of the membar functions are suggestive of the main
purpose they're used for, rather than the somewhat cryptic
assembler like: membar #LoadStore|#StoreStore

<begin extract from <sys/atomic.h>
/*
* Generic memory barrier used during lock entry, placed after the
* memory operation that acquires the lock to guarantee that the lock
* protects its data. No stores from after the memory barrier will
* reach visibility, and no loads from after the barrier will be
* resolved, before the lock acquisition reaches global visibility.
*/
extern void membar_enter(void);

/*
* Generic memory barrier used during lock exit, placed before the
* memory operation that releases the lock to guarantee that the lock
* protects its data. All loads and stores issued before the barrier
* will be resolved before the subsequent lock update reaches visibility.
*/
extern void membar_exit(void);

/*
* Arrange that all stores issued before this point in the code reach
* global visibility before any stores that follow; useful in producer
* modules that update a data item, then set a flag that it is available.
* The memory barrier guarantees that the available flag is not visible
* earlier than the updated data, i.e. it imposes store ordering.
*/
extern void membar_producer(void);

/*
* Arrange that all loads issued before this point in the code are
* completed before any subsequent loads; useful in consumer modules
* that check to see if data is available and read the data.
* The memory barrier guarantees that the data is not sampled until
* after the available flag has been seen, i.e. it imposes load ordering.
*/
extern void membar_consumer(void);

>>> [ ... pthread_refcount stuff ... ]

>> Even after all you've written on this subject, I still don't
>> clearly understand why you prefer a pthread_refcount_t and
>> special functions over a standardized API for atomic
>> add/subtract/cas and (various) membars.

> Well, "(various) membars" is the key here, I think.

But the set of "various membars" is reasonably finite. E.g:
for sparc membars (which seem reasonably general) there's,
4 (iirc?) annotations to the membar command which can then
be bitwise-OR'ed together.

>> It seems to me that the latter is more general and
>> therefore applicable to a larger class of problems.

> Yeah, std::atomic<T> with a "just a few" mandatory specializations
> for a whole bunch of scalar types (bool, ptrs, enums, etc.)
> would really be a ``nice to have feature.''

Why a "whole bunch" of types? If there's a cas32 and cas64, that
covers most stuff on modern machines (though a widely available
cas128 would be nice too). Generalization to ptrs, etc, can
easily be done with templates and/or #ifdefs, etc.

> The question is whether you really want to expose
> "(various) membars" stuff.

> [...]
> Do you agree?

If they're exposed, then people can explore many different
design styles that haven't been thought of yet. Otherwise,
everyone is locked into whatever consensus the standards
committee perceived to be the "one right way" at the time.

IMHO, a more general API, accompanied by some expert advice
saying "we think this stuff is best used in such-and-such
a way" gives the best of both worlds. Novices get some
guidance, and more experienced people get more powerful
primitives to explore other design patterns that the original
standards experts might not have considered. - Those other
people can insert membars only where needed, and thus
produce more optimum code. But efficiency was never a major
concern for Java, AFAICT. But in C++ it's a big issue.

So I guess I'm saying I won't be fully happy until there's
at least a standardization of cas32, cas64, and the various
membars - since all the other atomic primitives that get
mentioned semi-regularly here on c.p.t. can be built from
those easily enough. That's not to say there shouldn't also
be a higher level library of other atomic primitives as well.
That's fine and dandy. Just don't omit the lower level
essentials, I say.

- MikeM.

Joseph Seigh

unread,

Feb 27, 2003, 7:00:09 PM2/27/03

to

I agree. And there's a more important issue. I was browsing
some of the recent hardware architectures to see what they have
in the way of interlocked primatives. It looks like they're
adding a lot of new stuff that isn't applicable at the normal
C programming level. These are application specific not normally
applicable to the regular programming domain. Stuff like parallel
programming support. Parallel programming isn't the same as
threaded programming exactly. You need specialized compilers that
know how to do the right thing. Highly pipelined processors can
get pretty pick how you do things. Case in point. Doing a spin
loop on a hyperthreaded Intel processor. Intel has special programming
notes on how to do this without trashing performance. And you thought
doing a spin loop was simple.

So there is a widening gap between Pthreads and where technology is
going. And Pthreads seems to have stagnated. If we want further
exploitation for multi-threading, it not going to come from Pthreads.

Another point. Mutual exclusion doesn't scale well when you add lots
more processors. Some of that's from having too many threads but you
end up starving the processors because of mutex bottlenecks.

Wait-free and lock-free is going to become more important. And believe
me, you want doubleword compare and swap support.

Joe Seigh

Alexander Terekhov

unread,

Feb 28, 2003, 7:56:11 AM2/28/03

to

Joseph Seigh wrote:
[...]

> I think standards architects really need to learn how to do formal semantics.
> Defining standards in terms of an assumed memory model is problematic.

Well, standards architects wrote this:

http://www.opengroup.org/onlinepubs/007904975/xrat/xbd_chap04.html#tag_01_04_10

<quote>

Formal definitions of the memory model were rejected as unreadable by the vast
majority of programmers. In addition, most of the formal work in the literature
has concentrated on the memory as provided by the hardware as opposed to the
application programmer through the compiler and runtime system. It was believed
that a simple statement intuitive to most programmers would be most effective.

</quote>

regards,
alexander.

P.S. I'll also have to admit that one well known standards architect doesn't
seem to be very excited about the "pthread_refcount_t" thing. :-(

Joseph Seigh

unread,

Feb 28, 2003, 8:52:24 AM2/28/03

to

Alexander Terekhov wrote:
>
> Joseph Seigh wrote:
> [...]
> > I think standards architects really need to learn how to do formal semantics.
> > Defining standards in terms of an assumed memory model is problematic.
>
> Well, standards architects wrote this:
>
> http://www.opengroup.org/onlinepubs/007904975/xrat/xbd_chap04.html#tag_01_04_10
>
> <quote>
>
> Formal definitions of the memory model were rejected as unreadable by the vast
> majority of programmers. In addition, most of the formal work in the literature
> has concentrated on the memory as provided by the hardware as opposed to the
> application programmer through the compiler and runtime system. It was believed
> that a simple statement intuitive to most programmers would be most effective.
>
> </quote>

They're only saying that because they don't know any better. That's the whole
point. Avoid the memory model.

>
>
> P.S. I'll also have to admit that one well known standards architect doesn't
> seem to be very excited about the "pthread_refcount_t" thing. :-(

Well, I not excited about the mnemonics you've chosen. They're targeted towards
specific refcounting techniques used in some C++ shallow copy optimizations. I'd
much rather see application neutral interlocked primatives w/ optional acquire/release
semantics. The intel - MS server 2003 stuff would be a start. You can then recast
your stuff in terms of those w/o too much argument since anybody who objected could
always use the lower level general stuff.

You also might want to consider a slightly higher level of abstraction. I don't
want to get too specific at this point since I'm doing something like that my
self. Not refcounting though.

Oh, and I'd forget about flogging POSIX pthreads (except for entertainment value maybe).
It's no longer a standard that's evolving at any meaningful pace. Not quite SCSI I,
but definitely a "mature" standard.

Joe Seigh

Joseph Seigh

unread,

Feb 28, 2003, 10:19:18 AM2/28/03

to

I wrote:
>
> You also might want to consider a slightly higher level of abstraction. I don't
> want to get too specific at this point since I'm doing something like that my
> self. Not refcounting though.

Well, I could give an example in terms of the hypothetical C++ usage.

You have this smart pointer type. It could have various properties or not.
Thread-safe or not. Atomic or not. Partial GC (doesn't handle cyclic references)
or Full GC (does handle cyclic references). acquire/release semantics on
pointer ops. There may be others.

Given that, can you make copy on write (COW) objects. Yes, it's easy. Maybe too easy.
It's just that one extra level of indirectly basically. Ok, I can see why you
might want to make it more interesting but still.

Look at it from another angle. Suppose you wanted to port a COW class to or from
Java. With smart pointers as one of your abstraction points, it's straight forward.
Heck, given that, we could port AWTEventMulticaster to C++ with smartpointer without the
correct acquire/release semantics so it could work just as incorrectly. :)

Joe Seigh

Joseph Seigh

unread,

Feb 28, 2003, 1:36:07 PM2/28/03

to

I wrote:
>
> I did check out the Itanium reference and there is a 128 bit compare and swap, so atomic smart pointers
> can work on 64 bit intel platforms, not just powerpc and z architecture.
>

Well, not really. It's CMP8XCHG16. But I can probably make atomic smart pointers work on
it. Powerpc doesn't have 128 bit CAS as far as I can tell, but I can also probably make
atomic smart pointers work there as well. So it's just 64 bit sparc and AMD that would be
problematic.

Joe Seigh

Alexander Terekhov

unread,

Feb 28, 2003, 1:47:19 PM2/28/03

to

I'm afraid that AS/400 (iSeries nowadays; newer "teraspace" __ptr64
pointers aside for a moment) would be problematic as well... because
the "default" is __ptr128. ;-)

regards,
alexander.

Joseph Seigh

unread,

Feb 28, 2003, 2:40:17 PM2/28/03

to

Alexander Terekhov wrote:
>
> Joseph Seigh wrote:
> >
> >

> > Well, not really. It's CMP8XCHG16. But I can probably make atomic smart pointers work on
> > it. Powerpc doesn't have 128 bit CAS as far as I can tell, but I can also probably make
> > atomic smart pointers work there as well. So it's just 64 bit sparc and AMD that would be
> > problematic.
>
> I'm afraid that AS/400 (iSeries nowadays; newer "teraspace" __ptr64
> pointers aside for a moment) would be problematic as well... because
> the "default" is __ptr128. ;-)
>

If it's load reserved/store conditional like the powerpc, I can do it.
Isn't a large chunk of that address space i/o & filesystem space?

Joe Seigh

Alexander Terekhov

unread,

Feb 28, 2003, 3:26:57 PM2/28/03

to

Joseph Seigh wrote:
[...]

> > I'm afraid that AS/400 (iSeries nowadays; newer "teraspace" __ptr64
> > pointers aside for a moment) would be problematic as well... because
> > the "default" is __ptr128. ;-)
> >
>
> If it's load reserved/store conditional like the powerpc, I can do it.
> Isn't a large chunk of that address space i/o & filesystem space?

<http://www.ibm.com/eserver/iseries/infocenter>...

http://publibfp.boulder.ibm.com/epubs/pdf/c4156066.pdf
(ILE Concepts)

http://publibfp.boulder.ibm.com/epubs/pdf/c0927123.pdf
(ILE C/C++ Programmer's Guide)

http://publibfp.boulder.ibm.com/epubs/pdf/c4156071.pdf
(ILE C/C++ for iSeries Run-Time Library Functions)

http://publib.boulder.ibm.com/iseries/v5r2/ic2924/index.htm?info/rzahw/rzahwrzahwcascasco.htm
(Compare and Swap ;-) )

regards,
alexander.

Ziv Caspi

unread,

Feb 28, 2003, 7:02:51 PM2/28/03

to

On Fri, 28 Feb 2003 00:00:09 GMT, Joseph Seigh <jsei...@xemaps.com>
wrote:

>Another point. Mutual exclusion doesn't scale well when you add lots

>more processors. Some of that's from having too many threads but you
>end up starving the processors because of mutex bottlenecks.
>
>Wait-free and lock-free is going to become more important. And believe
>me, you want doubleword compare and swap support.

Any type of non-distributed synchronization (when you have N entities
synchronize at onen location) doesn't really scale well.

Ziv

Alexander Terekhov

unread,

Mar 1, 2003, 10:19:13 AM3/1/03

to

Mike Mowbray wrote:
[...]

> <begin extract from <sys/atomic.h>
> /*
> * Generic memory barrier used during lock entry, placed after the
> * memory operation that acquires the lock to guarantee that the lock
> * protects its data. No stores from after the memory barrier will
> * reach visibility,

Uhmm, "stores from after" THAT memory barrier ("membar_enter")
shall NOT be *performed* ("resolved") "before the lock acquisition
reaches global visibility". As for their visibility, it is
actually controled by the "membar_exit" below, AFAICS.

> * and no loads from after the barrier will be

> * resolved, before the lock acquisition reaches global visibility.
> */
> extern void membar_enter(void);
>
> /*
> * Generic memory barrier used during lock exit, placed before the
> * memory operation that releases the lock to guarantee that the lock
> * protects its data. All loads and stores issued before the barrier
> * will be resolved before the subsequent lock update reaches visibility.
> */
> extern void membar_exit(void);

That's okay.

>
> /*
> * Arrange that all stores issued before this point in the code reach
> * global visibility before any stores that follow; useful in producer
> * modules that update a data item, then set a flag that it is available.
> * The memory barrier guarantees that the available flag is not visible
> * earlier than the updated data, i.e. it imposes store ordering.
> */
> extern void membar_producer(void);

This actually sounds quite reasonable, unless I'm just missing
and/or misunderstanding something.

>
> /*
> * Arrange that all loads issued before this point in the code are
> * completed before any subsequent loads; useful in consumer modules
> * that check to see if data is available and read the data.
> * The memory barrier guarantees that the data is not sampled until
> * after the available flag has been seen, i.e. it imposes load ordering.
> */
> extern void membar_consumer(void);

Likewise.

>
> <end extract from <sys/atomic.h>
>
> >>> [ ... pthread_refcount stuff ... ]
>
> >> Even after all you've written on this subject, I still don't
> >> clearly understand why you prefer a pthread_refcount_t and
> >> special functions over a standardized API for atomic
> >> add/subtract/cas and (various) membars.
>
> > Well, "(various) membars" is the key here, I think.
>
> But the set of "various membars" is reasonably finite. E.g:
> for sparc membars (which seem reasonably general) there's,
> 4 (iirc?) annotations to the membar command which can then
> be bitwise-OR'ed together.

Well,

A) http://gee.cs.oswego.edu/dl/jmm/cookbook.html

B) http://www.opengroup.org/sophocles/show_mail.tpl?source=L&listname=austin-group-l&id=4865

C) http://www.opengroup.org/sophocles/show_mail.tpl?source=L&listname=austin-group-l&id=4866

Now, I'd like to clairify two points:

1. CV signaling calls might use whatever internal barriers
their implementor needs or wants, the point is that calling
these functions SHALL NOT impose reordering constraints on
the COMPILER with respect to the USER data, I believe
strongly. David Butenhof is simply wrong in this case, I'm
sure.

2. Itanic folks can emulate separate barriers using .acq/.rel
on a stack variable, for example. But this would still be
"somewhat expensive", I think.

>
> >> It seems to me that the latter is more general and
> >> therefore applicable to a larger class of problems.
>
> > Yeah, std::atomic<T> with a "just a few" mandatory specializations
> > for a whole bunch of scalar types (bool, ptrs, enums, etc.)
> > would really be a ``nice to have feature.''
>
> Why a "whole bunch" of types? If there's a cas32 and cas64, that
> covers most stuff on modern machines (though a widely available
> cas128 would be nice too). Generalization to ptrs, etc, can
> easily be done with templates and/or #ifdefs, etc.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

That's what I meant. A hypothetical POSIX.C++ should better
define a single template atomic<T> [for the application use]
and require implementations to provide a "whole bunch" of
specializations for various scalar types -- atomic<bool>,
atomic<char>, atomic<short>, etc. (To Joe: I'm pretty sure
that IBM iSeries folks will most likely vote against "ptr"
ones and will simply suggest to mandate atomic<intptr_t>
and atomic<uintptr_t> instead ;-) ;-) )

regards,
alexander.

Joseph Seigh

unread,

Mar 1, 2003, 11:31:29 AM3/1/03

to

Do you mean something like using Peterson's algorithm to implement mutexes
or using MPI instead of shared memory?

If it's the former, you are still going to have bottlenecks because of the
mutex/critical sections. That's why wait-free is important.

And you could probably give distributed algorithms a run for their money. I
proposed a new instruction over on comp.arch. A hardware condvar based on
LL/SC or load reserved/store conditional hardware. It would be a pause until
reserve broken - PUL for pause until unlocked in alpha mnemonics. So a
test and set type of mutex would be

loop:
LL rx, lock
if (lock != 0)
PUL lock
goto loop
else
rx = 1
SC rx, lock
if (lock not acquired)
goto loop

It would be a lot more pipeline friendly. And a lot more cache friendly in the
sense that the cache wouldn't have to be more aggressive than the memory model
required it to be.

Joe Seigh

Joseph Seigh

unread,

Mar 1, 2003, 11:51:21 AM3/1/03

to

Joseph Seigh wrote:
>
> And you could probably give distributed algorithms a run for their money. I
> proposed a new instruction over on comp.arch. A hardware condvar based on
> LL/SC or load reserved/store conditional hardware. It would be a pause until
> reserve broken - PUL for pause until unlocked in alpha mnemonics. So a
> test and set type of mutex would be

Actually the LL might have problems scaling up depending on how it was implemented.
But I was using that mechanism to get some of the condvar like properties. Assume
there is some mechanism that would work for waiting until a memory location changed.

Joe Seigh

Ziv Caspi

unread,

Mar 1, 2003, 3:38:46 PM3/1/03

to

On Sat, 01 Mar 2003 16:31:29 GMT, Joseph Seigh <jsei...@xemaps.com>
wrote:

>Ziv Caspi wrote:
>> Any type of non-distributed synchronization (when you have N entities

>> synchronize at one location) doesn't really scale well.

>>
>
>Do you mean something like using Peterson's algorithm to implement mutexes
>or using MPI instead of shared memory?

I mean that lock-free algorithms *still* use a shared resource to
synchronize N concurrent operations, and so they are still subject to
performance drop when N scales. The perform better than non-lock-free
algorithms, but when you have enough processors contending for the
same synchronization area in memory, performance drops.

While few of us currently have machines with N large enough for this
to matter, really high-performance servers need other types of
algorithms to scale well, usually based on paritioning instead of
shared work queues.

Ziv

David Butenhof

unread,

Mar 4, 2003, 7:32:13 AM3/4/03

to

Joseph Seigh wrote:

> Alexander Terekhov wrote:
>>
>> Joseph Seigh wrote:
>> [...]
>> > I think standards architects really need to learn how to do formal
>> > semantics. Defining standards in terms of an assumed memory model is
>> > problematic.
>>
>> Well, standards architects wrote this:
>>
>>
http://www.opengroup.org/onlinepubs/007904975/xrat/xbd_chap04.html#tag_01_04_10
>>
>> <quote>
>>
>> Formal definitions of the memory model were rejected as unreadable by the
>> vast majority of programmers. In addition, most of the formal work in the
>> literature has concentrated on the memory as provided by the hardware as
>> opposed to the application programmer through the compiler and runtime
>> system. It was believed that a simple statement intuitive to most
>> programmers would be most effective.
>>
>> </quote>
>
> They're only saying that because they don't know any better. That's the
> whole point. Avoid the memory model.

No, "they're" not saying that because they don't know any better. The
earliest drafts of the POSIX threads amendment contained a formal memory
model that was fairly simple but comprehensive, and a section on C usage
with a required type applications could use to size and align data
appropriately to ensure the machine would treat it as "a separate variable"
for memory operations.

Unfortunately, as the quote says, we found this to be incomprehensible by
the vast majority of the proposed audience for the standard, (in
particular, most of the POSIX working group). We backed off and,
eventually, the simple "intuitive" memory "guideline" was constructed to
replace it.

Perhaps we should have had both.

>> P.S. I'll also have to admit that one well known standards architect
>> doesn't seem to be very excited about the "pthread_refcount_t" thing. :-(
>
> Well, I not excited about the mnemonics you've chosen. They're targeted
> towards specific refcounting techniques used in some C++ shallow copy
> optimizations. I'd much rather see application neutral interlocked
> primatives w/ optional acquire/release semantics.

That's a decent start on the low-level criticism appropriate to the
proposal.

> You also might want to consider a slightly higher level of abstraction. I
> don't want to get too specific at this point since I'm doing something
> like that my self. Not refcounting though.

Higher level of abstraction. Yeah. To say the least.

> Oh, and I'd forget about flogging POSIX pthreads (except for entertainment
> value maybe).
> It's no longer a standard that's evolving at any meaningful pace. Not
> quite SCSI I, but definitely a "mature" standard.

Oh, POSIX is very definitely a living and evolving specification. I wouldn't
expect anyone to even seriously consider (much less approve) radical and
inconsistent changes at this point (that would be irresponsible); but
extensions don't need to be radical.

While a simple and general mechanism that supports "things like" atomic
reference counting schemes might not be entirely out of scope for a future
revision (e.g., a POSIX 1003.1-2006), it does need to be SIMPLE,
CONSISTENT, GENERAL PURPOSE, and WIDELY IMPLEMENTABLE.

--
/--------------------[ David.B...@hp.com ]--------------------\
| Hewlett-Packard Company Tru64 UNIX & VMS Thread Architect |
| My book: http://www.awl.com/cseng/titles/0-201-63392-2/ |
\----[ http://homepage.mac.com/dbutenhof/Threads/Threads.html ]---/

Alexander Terekhov

unread,

Mar 4, 2003, 10:33:20 AM3/4/03

to

David Butenhof wrote:
[...]

> While a simple and general mechanism that supports
> "things like" atomic reference counting schemes might
> not be entirely out of scope for a future

As far as I can see, the most popular reference counting
"schemes" [that are re-implemented zillion+1 times every
year by zillions+1 programmers out there] are nothing but:

A) COW-optimized stuff ala std::string

-and-

B) Poor-man's garbage collection via "smart pointers"
that provide shared ownership and "weak" refs for a
mutable and immutable stuff.

The pthread_refcount_t-thing is meant to provide a
portable solution to both these "problems"... NOT even
discouraging or precluding the "blocking" (mutex-based)
implementations in order to achieve the "highest"
degree of portability/implementability.

http://terekhov.de/pthread_refcount_t/draft-edits.txt

> revision (e.g., a POSIX 1003.1-2006), it does need to
> be SIMPLE,

pthread_refcount_t IS simple. Please explain if you
disagree.

> CONSISTENT,

pthread_refcount_t IS consistent. Please explain if
you disagree.

> GENERAL PURPOSE,

I'll admit that atomic<T>-thing would be more "general
purpose". However, having a bit less "general purpose"
but easier to use synchronization object for reference
counting wouldn't really hurt anyone, I'm sure. I'd say
that pthread_refcount_t is almost as "general purpose"
as the pthread_barrier_t. It's certainly MORE "general
purpose" than pthread_spinlock_t. ;-)

Please explain if you disagree.

> and WIDELY IMPLEMENTABLE.

pthread_refcount_t IS "widely implementable". Please
explain if you disagree.

TIA.

regards,
alexander.

Joseph Seigh

unread,

Mar 4, 2003, 12:04:12 PM3/4/03

to

I'd agree that a memory model based formal specification is too complicated for
practical use if it was anything like the Java specification (JSR#133 not withstanding).
But you cannot imply that because memory model based specifications are too
complicated that *any* specification would be too complicated. It doesn't follow.

Also there are problems with using an intuitive specification. One, a lot of threading
problems are counter intuitive, and two, a lot of arguments end up boiling down to
"my intuition is better than your intuition".

I tend to use the intuitive approach first, followed some form of formal
sanity checking of any non-obvious parts. Intuition isn't always reliable.
But I wouldn't be able to do any of that without some form of formal
specification.

Joe Seigh