atomic reference counting

Joseph Seigh

unread,

Nov 5, 2002, 9:01:54 PM11/5/02

to

It rules! I'd like to see shared_ptr do

q = new Obj();

100000 times in two different threads and not leak.
(q is shared by the two threads).

Maybe I should call it shared_ptr_r. :-)

Joe Seigh

Arnold Hendriks

unread,

Nov 6, 2002, 3:24:41 AM11/6/02

to

Joseph Seigh <jsei...@xemaps.com> wrote:
> It rules! I'd like to see shared_ptr do

No, it quite sucks. Locking decisions should be made at high levels
in the code, not deep inside a library. The atomic increments in shared_ptr
reduced performance to about 40% in at least one of my applications (at a
piece of code running originally for about 50ms), because my own locks are
at a much larger granularity than _every_ shared_ptr operation. For the
same reasons that vector shouldn't lock on every push_back, shared_ptr
shouldn't try to second-guess locking policy.

Fortunately, setting a #define that disabled all threading support in
boost also disabled this 'feature'. Quite fortunate that I hadn't switched
to boost's threads classes, or a _lot_ of code would have required fixes..

(well, maybe the app was overusing shared_ptr, but it was never an issue
with the original implementation, which was quite convenient)

> q = new Obj();

> 100000 times in two different threads and not leak.
> (q is shared by the two threads).

> Maybe I should call it shared_ptr_r. :-)

That would have been a much better idea, to avoid unexpected effects on
existing code, and allow people to avoid interlocked atomic variable
access when it's not necessary.

--
Arnold Hendriks <a.hen...@b-lex.com>
B-Lex Information Technologies, http://www.b-lex.com/

ig...@notformail.paco.net

unread,

Nov 6, 2002, 4:34:44 AM11/6/02

to

Arnold Hendriks <a.hen...@b-lex.com> wrote:
> Joseph Seigh <jsei...@xemaps.com> wrote:
>> It rules! I'd like to see shared_ptr do
> No, it quite sucks. Locking decisions should be made at high levels
> in the code, not deep inside a library. The atomic increments in shared_ptr
> reduced performance to about 40% in at least one of my applications (at a
> piece of code running originally for about 50ms), because my own locks are
> at a much larger granularity than _every_ shared_ptr operation. For the

How this correlate with the statement that locking unlocked mutex is very
fast operation?

--
Igor Khasilev |
PACO Links, igor at paco dot net |

Arnold Hendriks

unread,

Nov 6, 2002, 8:10:25 AM11/6/02

to

Not unlocking mutexes at all is even faster, especially if you don't need
them. The above described application was a multithreaded (purely for SMP
scalability) webserver, using a boss/worker model. Most of the shared_ptrs
were in the script interpreter, which didn't need to do any inter-thread
communication aside from sharing the interpreted script cache.

ig...@notformail.paco.net

unread,

Nov 6, 2002, 9:28:54 AM11/6/02

to

Arnold Hendriks <a.hen...@b-lex.com> wrote:
> ig...@notformail.paco.net wrote:
>> Arnold Hendriks <a.hen...@b-lex.com> wrote:
>>> Joseph Seigh <jsei...@xemaps.com> wrote:
>>>> It rules! I'd like to see shared_ptr do
>>> No, it quite sucks. Locking decisions should be made at high levels
>>> in the code, not deep inside a library. The atomic increments in shared_ptr
>>> reduced performance to about 40% in at least one of my applications (at a
>>> piece of code running originally for about 50ms), because my own locks are
>>> at a much larger granularity than _every_ shared_ptr operation. For the
>
>> How this correlate with the statement that locking unlocked mutex is very
>> fast operation?
>
> Not unlocking mutexes at all is even faster, especially if you don't need

It is clear, but 40% is too much. Does this mean that your script processing
on 40% consists of code like this:

lock(unlocked mutex);
counter++;
unlock(mutex);

lock(unlocked mutex);
counter--;
unlock(mutex);

or this can mean that you had high contention.

I'd like to understand your situation, because I also use pointers with
counted references (not from boost.org) in the application like yours, so
this is important to me to know where the problem can be.

> them. The above described application was a multithreaded (purely for SMP
> scalability) webserver, using a boss/worker model. Most of the shared_ptrs
> were in the script interpreter, which didn't need to do any inter-thread
> communication aside from sharing the interpreted script cache.
>

--

Alexander Terekhov

unread,

Nov 6, 2002, 11:48:16 AM11/6/02

to

Arnold Hendriks wrote:
>
> Joseph Seigh <jsei...@xemaps.com> wrote:
> > It rules! I'd like to see shared_ptr do
> No, it quite sucks. Locking decisions should be made at high levels
> in the code, not deep inside a library. The atomic increments in shared_ptr
> reduced performance to about 40% in at least one of my applications (at a
> piece of code running originally for about 50ms), because my own locks are
> at a much larger granularity than _every_ shared_ptr operation. For the
> same reasons that vector shouldn't lock on every push_back, shared_ptr
> shouldn't try to second-guess locking policy.
>
> Fortunately, setting a #define that disabled all threading support in
> boost also disabled this 'feature'. Quite fortunate that I hadn't switched
> to boost's threads classes, or a _lot_ of code would have required fixes..

http://groups.google.com/groups?threadm=3d3ce5df%240%2413952%24edfadb0f%40dspool01.news.tele.dk
(Subject: boost shared_ptr - thread safe and non thread safe editions)

< Forward Inline >

-------- Original Message --------
Message-ID: <3D3D184F...@web.de>
Newsgroups: comp.lang.c++
Subject: Re: boost shared_ptr - thread safe and non thread safe editions

"René Dalby Carlsen" wrote:
[...]
> The above sample does not work - what should I do instead?

Try to convince Dimov that policy-based designs/interfaces aren't that bad;
and that, in general, the stuff managed by smart pointers isn't necessarily
process-wide/thread-shared stuff "by default"... like global heap, current
working directory, etc.

regards,
alexander.

--
"Yes, every user of shared_ptr that has BOOST_HAS_THREADS defined is burdened
with the overhead of having a shared_ptr that works. :-) shared_ptr follows
the standard library practice here; every user is burdened with a
synchronized heap allocator, too."

-- http://aspn.activestate.com/ASPN/Mail/Message/1263557

Joseph Seigh

unread,

Nov 6, 2002, 1:08:58 PM11/6/02

to

What he means is he is using them inside code that already has a lock and so
is effectively single threaded and does not need an extra mutex. Apparently
he is using them quit heavily to get a 40% performance differential.

Joe Seigh

Alexander Terekhov

unread,

Nov 6, 2002, 1:28:47 PM11/6/02

to

The statement that "locking unlocked mutex is very fast operation" is a MYTH.

regards,
alexander.

Joseph Seigh

unread,

Nov 6, 2002, 1:52:13 PM11/6/02

to

There's enough difference between the thread-safe and non thread-safe versions that a
policy based api wouldn't be feasible. Different logic and different data structure.
You'd be better off just having a shared_ptr_nr for non thread-safe application.

shared_ptr not what I would call thread safe unless you consider different levels
of thread safeness. I would call it level 1 thread-safe. It's thread-safe provided
you use or invoke it in a certain manner. The problem here is that proper usage
may not be documented or it may be difficult to determine if you are in fact using
it properly.

Level 2 thread-safe is thread-safe in the sense that you cannot screw it up no matter
how you use it, though you can still manage to shoot yourself in the foot. The stuff
I am working on is level 2 thread-safe. It will not leak memory to do double free/deletes
no matter what. The shoot yourself part is more a lack of language support for threading
issues. For example

if (x != 0)
x->f()

can be safe or unsafe depending whether x is local or not. There is no way to enforce
locallity in C++. Otherwise it would be level 3 thread-safe which is you can't shoot
yourself in the foot either.

Joe Seigh

Joseph Seigh

unread,

Nov 6, 2002, 8:59:31 PM11/6/02

to

The current version uses pthread locks to simulate compare and swap. So
now that it works, I need to find some compare and swap routines. They
don't exactly pop out on Google. Does anybody know if these things exists?
The only clean examples I could find were for sparc plus that's what I
have the documentation for, so if I have to write it it's only going to
get done for the sparc.

I'm interested in the following compare and swap doubleword (64 bit) routines

cmpxchg8b for intel
casxa for sparc w/ memory barriers

other platforms as well, though I won't be able to test them out.

Also does anyone know if there is a 64 bit InterlockedCompareExchange
for vc++/windows?

Joe Seigh

Hillel Y. Sims

unread,

Nov 7, 2002, 12:48:31 AM11/7/02

to

"Joseph Seigh" <jsei...@xemaps.com> wrote in message
news:3DC96613...@xemaps.com...

>
> Alexander Terekhov wrote:
> >
> > Arnold Hendriks wrote:
> > >
> > > Joseph Seigh <jsei...@xemaps.com> wrote:
> > > > It rules! I'd like to see shared_ptr do
> > > No, it quite sucks. Locking decisions should be made at high levels
> > > in the code, not deep inside a library. The atomic increments in
shared_ptr
> > > reduced performance to about 40% in at least one of my applications
(at a
> > > piece of code running originally for about 50ms), because my own locks
are
> > > at a much larger granularity than _every_ shared_ptr operation. For
the
> > > same reasons that vector shouldn't lock on every push_back, shared_ptr
> > > shouldn't try to second-guess locking policy.

[..]

> > "Yes, every user of shared_ptr that has BOOST_HAS_THREADS defined is
burdened
> > with the overhead of having a shared_ptr that works. :-) shared_ptr
follows
> > the standard library practice here; every user is burdened with a
> > synchronized heap allocator, too."
> >
> > --
http://aspn.activestate.com/ASPN/Mail/Message/1263557

First of all, it seems like this quote (from Peter Dimov?) is referring to a
different issue than being able to do "ptr.reset(new Obj)" simultaneously in
multiple threads on a single shared ptr object. If I am not mistaken, the
boost people are referring here to the thread-safety and performance
implications of sharing the internal reference count mechanism among
multiple ptr objects. This is the feature which caused the performance
degradation mentioned above.

Anyhow, the original issue as stated by the OP:

"Joseph Seigh" <jsei...@xemaps.com> wrote in message
news:3DC8793E...@xemaps.com...

> I'd like to see shared_ptr do
>

> q = new Obj();
>
> 100000 times in two different threads and not leak.
> (q is shared by the two threads).
>

It seems to me that this has little to do with stressing the thread-safety
of the reference counting mechanism itself (which is by nature intended to
be shared among _different_ ptr objects without explicit synchronization).
You are referring to the safety of writing a shared object simultaneously in
multiple threads, for which I think it is usually recommended to synchronize
this access externally to the object itself in almost all cases.

My sample (pseudo)code using boost::shared_ptr<> for the same scenario would
look something like:

Mutex mx;
boost::shared_ptr<Obj> ptr;

threadfunc()
{
for (int i = 0; i < 100000; ++i) {
Lock l(mx);
ptr.reset(new Obj);
}
}

If you can do this without the mutex lock using your shared_ptr_r type, then
that is a neat trick I suppose. I think I am beginning to understand the
details of the problem of how one might implement this feature safely using
atomic compare-and-swap operations instead of a mutex. But it seems like
this feature has little real-world applicability; this definitely seems to
be on the same order as making std::vector internally synchronized on every
operation.

I think the most important capability of a thread-safe reference-counted
pointer would be that the following type of access is safe without explicit
synchronization:

reentrant_safe_func(boost::shared_ptr<const Obj>
thread_local_ptr_to_readonly_obj)
{
..read some data from obj..
}

main()
{
boost::shared_ptr<const Obj> ptr(new Obj);
thread_create(...threadfunc, &ptr);
..wait for ready condition..
thread_detach(..)
..
reentrant_safe_func(ptr);
}

threadfunc(boost::shared_ptr<const Obj>* pptr)
{
boost::shared_ptr<const Obj> ptr(*pptr);
..signal ready condition..
..
reentrant_safe_func(ptr);
}

thanks,
hys

--
Hillel Y. Sims
FactSet Research Systems
hsims AT factset.com

Silicon Graphics

unread,

Nov 7, 2002, 1:27:34 AM11/7/02

to

Joseph Seigh <jsei...@xemaps.com> wrote:
: The current version uses pthread locks to simulate compare and swap. So

:
cmpxchg8b

I have implemented my own inline assembly routines for the usual atomic
operations (increment, decrement, cas, dcas, double checked locking) for
Linux/x86, DevStudio/x86, and SunCC Solaris.

Defining dcas for v8plusa and v9 is a little tricky, due to 32-bit vs.
64-bit registers, and gcc asm notation is painful. Explicit memory barriers
aren't needed for the SPARC CAS and CASX instructions, but are needed for the
dcl routines (ie relaxed memory models).

Most of the ideas/code can be gleaned from the Linux kernel distro and existing
projects with atomic primitives (hoard, Michael Bennett's atomic library, etc).

Unfortunately, I wasn't able to find any lock-free algorithms that a)
didn't need garbage collection, and b) were useful in 64-bits (ie A-B-A
versioning problem).

Adam
ze...@best.com

: Joe Seigh

Konrad Schwarz

unread,

Nov 7, 2002, 4:22:19 AM11/7/02

to

Joseph Seigh schrieb:

>
> The current version uses pthread locks to simulate compare and swap. So
> now that it works, I need to find some compare and swap routines. They
> don't exactly pop out on Google. Does anybody know if these things exists?

Check out the successor documents to the IBM 370 Principles of Operation.

Alexander Terekhov

unread,

Nov 7, 2002, 6:03:50 AM11/7/02

to

"Hillel Y. Sims" wrote:
>
> "Joseph Seigh" <jsei...@xemaps.com> wrote in message
> news:3DC96613...@xemaps.com...
> >
> > Alexander Terekhov wrote:
> > >
> > > Arnold Hendriks wrote:

^^^^^^^^^^^^^^^^^^^^^

> http://aspn.activestate.com/ASPN/Mail/Message/1263557
>
> First of all, it seems like this quote (from Peter Dimov?)

Yes, from Peter Dimov.

> is referring to a
> different issue than being able to do "ptr.reset(new Obj)" simultaneously in
> multiple threads on a single shared ptr object. If I am not mistaken, the
> boost people are referring here to the thread-safety and performance
> implications of sharing the internal reference count mechanism among
> multiple ptr objects. This is the feature which caused the performance
> degradation mentioned above.

Hillel, I was replying [pointing to the lack of "null-sync" policy]
to *Arnold* and "performance degradation mentioned above."

FWIW, here is the quote relevant to "the original issue as stated by
the OP" (Joe).

http://lists.boost.org/MailArchives/boost/msg35700.php
(Re: [boost] Smart Pointers documentation/questions)

<Dimov>

shared_ptr is supposed to be as thread-safe as an int.

</Dimov>

regards,
alexander.

Arnold Hendriks

unread,

Nov 7, 2002, 6:08:43 AM11/7/02

to

ig...@notformail.paco.net wrote:
>>> How this correlate with the statement that locking unlocked mutex is very
>>> fast operation?
>>
>> Not unlocking mutexes at all is even faster, especially if you don't need

> It is clear, but 40% is too much. Does this mean that your script processing
> on 40% consists of code like this:

No, but it had a lot of shared_ptr manipulation - all expressions in
the scripting language were stored in RvaluePtrs, and the parser did a lot
of passing, cloning, etc of these in function calls.

I did a 'quick' recompile with an atomic shared_ptr, and without. These
were the timings I got with the atomic shared_ptr:
Timer Environment::LoadScript: 289,015 uS, overhead 1 uS
Timer Environment::LoadScript: 94,171 uS, overhead 1 uS
Timer Environment::LoadScript: 204,489 uS, overhead 1 uS
and these are the timings without:
Timer Environment::LoadScript: 253,560 uS, overhead 1 uS
Timer Environment::LoadScript: 86,703 uS, overhead 1 uS
Timer Environment::LoadScript: 169,523 uS, overhead 1 uS

That's 588 ms vs 510 ms, or about a 15% difference. The larger, more
visible difference where found on older code, which likely had a bit more
of shared_ptr copying.

It will take a little bit longer, but I'll try to check out the old code
and run tests on that, also because I'm interested if I'll see the same
results on a UP machine (I'd expect interlocked variables to be less costly
there, because the bus locks cannot hinder any other processor in anyway).
Then I can also use a bigger load to get more accurate timings.

> I'd like to understand your situation, because I also use pointers with
> counted references (not from boost.org) in the application like yours, so
> this is important to me to know where the problem can be.

I know that before I disabled atomic references completely, I first tried
to eliminate as much copying as possible, restructuring the handling of
pointers where necessary, before removing it all together. That must have
paid off at least a little, as the difference on current code is only 15%.

In our application though, we found the biggest performance wins by separating
server processes (that need to deal with multiple connecions and maximize
scalability on SMP machines) from worker processes (that do batch processing:
the tasks are scheduled by the servers, they just do the work in FIFO order).
The worker processes were in reality singlethreaded, but were compiled to the
same multithreaded DLLs as the server for convienence. Supplying two sets of
DLLs so that workers could be completely singlethreaded, gave quite a
noticably speed improvement.

Alexander Terekhov

unread,

Nov 7, 2002, 6:25:23 AM11/7/02

to

Arnold Hendriks wrote:
[...]

> I know that before I disabled atomic references completely, I first tried
> to eliminate as much copying as possible, restructuring the handling of
> pointers where necessary, before removing it all together. That must have
> paid off at least a little, as the difference on current code is only 15%.

Copying SUCKS. I'd recommend to use auto_ptr-like/"MOJO"-<whatever-
version>-like MOVE-on-copy semantics "by default" with explicit
a_smart_shared_ptr.clone() to create EXPLICIT copies. Well, you'll
also need "auto-clone-on-copy" adapter class to make it work with
the Standard C++ Library containers and algorithms, [and things
like Boost.Function functors] however.

regards,
alexander.

Joseph Seigh

unread,

Nov 7, 2002, 7:43:04 AM11/7/02

to

It's downloadable for free in pdf format from IBM. Don't have a url offhand
as I downloaded my copy a while back. Just do a search. It's z390 now.
They've been busy I see. It's all microcode but still the architecture process
must have been interesting. I used to participate in that at one time.

Joe Seigh

Alexander Terekhov

unread,

Nov 7, 2002, 8:05:05 AM11/7/02

to

Joseph Seigh wrote:
>
> Konrad Schwarz wrote:
> >
> > Joseph Seigh schrieb:
> > >
> > > The current version uses pthread locks to simulate compare and swap. So
> > > now that it works, I need to find some compare and swap routines. They
> > > don't exactly pop out on Google. Does anybody know if these things exists?
> >
> > Check out the successor documents to the IBM 370 Principles of Operation.
>
> It's downloadable for free in pdf format from IBM. Don't have a url offhand
> as I downloaded my copy a while back. Just do a search. It's z390 now.

Nah, it's z900 now. ;-)

> They've been busy I see. It's all microcode but still the architecture process
> must have been interesting. I used to participate in that at one time.

The following stuff is also downloadable for free in pdf [and
ASCII text] format from IBM.

http://www.research.ibm.com/journal/rd46-45.html
(IBM eServer z900, Vol. 46, Nos. 4/5, 2002)

"This double issue contains nineteen papers that
describe many aspects of the IBM eServer z900.
Although the z900 is built on the legacy of its
predecessors葉he S/390 Parallel Enterprise Servers
G3 through G6容very major subsystem was
completely redesigned with the express purpose of
delivering the first IBM e-business mainframe server.
The z900 features a full 64-bit architecture, a new
high-bandwidth I/O subsystem, and autonomic features
such as the ability to allocate computing and
I/O resources as the need arises, using customer-
defined priorities. The contribution of the z900 to
IBM's server business cannot be overstated.
click to enlarge ->"

regards,
alexander. < working for the IBM Server Group and Tivoli >

David Butenhof

unread,

Nov 7, 2002, 8:05:31 AM11/7/02

to

Alexander Terekhov wrote:

No, not at all; but it's a relative metric applied to a source level
standard.

The specification of mutex operation is designed to allow an uncontended
mutex unlock without any overhead that's not absolutely essential. On some
machines, more is essential than on others. An implementation may choose to
focus on other design goals than pure speed -- for example, the ability to
easily view the real state for debugging. (E.g., the mutex may track
ownership even though that's not strictly required and might, for example,
double the cost of the uncontended unlock.)

In general, the intent is that the unlock of an uncontended mutex (no
waiters) can be substantially cheaper than unlocking a mutex with waiters.
But, as Arnold Hendriks points out, avoiding the unlock entirely is always
going to be even faster.

Life (and particularly engineering) is full of tradeoffs, and few of them
are strict boolean decisions. There's a continuum, and moving either
direction affects many design criteria, including portability, ease of
design and implementation, performance, usability, developer sanity, and so
forth.

If synchronization is important, use of a mutex is appropriate, portability
is more important than raw performance on a particular machine, and you
want to leverage the existing debug/analysis environment support for
mutexes (as opposed to some private atomic sychronization protocol), then
"an uncontended mutex unlock is fast" compared to the alternatives.

If you're willing and able to do machine-specific bit bumming, that level of
performance is critical, and it doesn't matter that nobody can see or
analyze the synchronization behavior, then custom atomic synchronization is
probably faster. Is it worth the cost? The developer of the package is the
first to get a chance to answer that question... but always be aware that
the users (and later maintainers) of the package may eventually come to
disagree. ;-)

--
/--------------------[ David.B...@hp.com ]--------------------\
| Hewlett-Packard Company Tru64 UNIX & VMS Thread Architect |
| My book: http://www.awl.com/cseng/titles/0-201-63392-2/ |
\----[ http://homepage.mac.com/dbutenhof/Threads/Threads.html ]---/

Joseph Seigh

unread,

Nov 7, 2002, 8:19:25 AM11/7/02

to

Silicon Graphics wrote:
...

> cmpxchg8b
>
> I have implemented my own inline assembly routines for the usual atomic
> operations (increment, decrement, cas, dcas, double checked locking) for
> Linux/x86, DevStudio/x86, and SunCC Solaris.
>
> Defining dcas for v8plusa and v9 is a little tricky, due to 32-bit vs.
> 64-bit registers, and gcc asm notation is painful. Explicit memory barriers
> aren't needed for the SPARC CAS and CASX instructions, but are needed for the
> dcl routines (ie relaxed memory models).

Memory barriers aren't required but if you don't use them you have a smart pointer
that is equivalent to Java volatile pointers. They'll work but they're totally
useless for anything.

> Most of the ideas/code can be gleaned from the Linux kernel distro and existing
> projects with atomic primitives (hoard, Michael Bennett's atomic library, etc).

Michael Bennett's web page doesn't seem to be around anymore.

>
> Unfortunately, I wasn't able to find any lock-free algorithms that a)
> didn't need garbage collection, and b) were useful in 64-bits (ie A-B-A
> versioning problem).

Well, you can just not ever delete anything. That's actually feasible if
you have a bounded number of threads. You can just use a circular queue
or a distributed algorithm.

I've only seen the ABA problem in reference to popping from a stack using
compare and swap. IBM has always published the solution to this in their
Principles of Operation using a doubleword compare and swap. Basically
you maintain a change count in the stack header. They actually got it
backwards though and I pointed this out to them way back. You can push on to the
stack using a 32 bit compare and swap and just use the 64 bit compare and
swap to change the count on the pop operation as well as the pointer.

You have to be careful using the term DCAS. Some of the papers I've seen
use DCAS to refer to a compare and swap that compares 2 discontiguous
words in memory rather than a contiguous doubleword. Speaking of those,
I could have sworn the MC68020 or MC60030 had one of these because I
remember thinking at the time that it would be useful for solving the ABA
problem.

Joe Seigh

Alexander Terekhov

unread,

Nov 7, 2002, 8:47:59 AM11/7/02

to

David Butenhof wrote:
>
> Alexander Terekhov wrote:
[...]

> > The statement that "locking unlocked mutex is very fast operation" is a
> > MYTH.
>
> No, not at all; but it's a relative metric applied to a source level
> standard.

[snip the stuff I fully concur]

> If you're willing and able to do machine-specific bit bumming, that level of
> performance is critical, and it doesn't matter that nobody can see or
> analyze the synchronization behavior, then custom atomic synchronization is
> probably faster. Is it worth the cost? The developer of the package is the
> first to get a chance to answer that question... but always be aware that
> the users (and later maintainers) of the package may eventually come to
> disagree. ;-)

I am willing to do this: >>*PORTABLY*<<

http://groups.google.com/groups?threadm=3BC8463D.6E23002A%40web.de
(Newsgroup: c.p.t., Subject: Lockless/non-blocking ref.counting?
Note that unref() for mutable objects needs to inject "release"
membar too... since threads could unref() mutable objects INSIDE
synchronized regions for mutation/updates, or even use ref.counting
to manage lifetime of condition varibles: <http://tinyurl.com/2i87>)

Well, I am also puzzled because if the above is
true (and so far, I really think so) then why
does PTHREADS not have a non-blocking/faster/less
expensive interface for ref.counting than general
mutexes/rw-locks and even spinlocks (for an appl
that is able to ensure correct usage of spinlock)?

</Terekhov>

regards,
alexander.

--
"....
> > What should we do?
>
> Well, I guess...
>
> Bombard-to-death ;-) TOG/Austin Group reflector[1] (MS
> IS the 'silver' TOG member[2] as well ;-)) with zillions
> of messages requesting new opaque type pthread_refcount_t
> (or something like that) plus 'count'-calls/macros...
> with perhaps even more 'optimized' [for immutable objects
> only] counterparts -- operations at least; not sure w.r.t
> the separate extra opaque type -- could be the same one,
> perhaps.
...."

-- http://lists.boost.org/MailArchives/boost/msg30364.php

Joseph Seigh

unread,

Nov 7, 2002, 10:54:15 AM11/7/02

to

Here's what I have so far using mutexes to simulate compare and swap atomic ops.
It's a little harsher on cache than just plain reference counting because there
is two different kinds of reference counts, but then just plain reference counting
isn't thread-safe.

I also haven't optimized handling of null yet. There's no need to reference count
the null object.

Also, you should only create a smart pointer from an object once only. Use the
copy constructor or assignment to create new smart pointers, otherwise you will
get double deletes. This is because the reference count is separate from the
object and there is no way to tell that an object has already had a smart pointer
created for it. Putting the reference count in the object would help somewhat but
that technique has other problems and side effects.

Joe Seigh

atomic_ptr.h --

#include <stdio.h>

#include <pthread.h> // temporary
#include <errno.h> // temporary

// temporary mutex
class zmutex {
public:
zmutex() {
if (pthread_mutex_init(&xmutex, NULL) != 0)
perror("mutex_init");
}
~zmutex() {
if (pthread_mutex_destroy(&xmutex) != 0)
perror("mutex_destroy");
}
void lock() {
if (pthread_mutex_lock(&xmutex) != 0)
perror("mutex_lock");
}
void unlock() {
if (pthread_mutex_unlock(&xmutex) != 0)
perror("mutex_unlock");
}

private:
pthread_mutex_t xmutex;

};

template<typename T> class atomic_ptr;

//=============================================================================
// atomic_ptr_ref
//
//
//=============================================================================
template<typename T> class atomic_ptr_ref {
friend class atomic_ptr<T>;

private:

atomic_ptr_ref(T * p = 0) {
ephemeralCount = 0;
referenceCount = 1;
ptr = p;
};

int adjust(long xephemeralCount, long xreferenceCount) {
long ecount;
long rcount;

mutex.lock();
ecount = (ephemeralCount += xephemeralCount);
rcount = (referenceCount += xreferenceCount);
mutex.unlock();
return (ecount == 0 && rcount == 0) ? 0 : 1;
}

long ephemeralCount; // ephemeral reference count
long referenceCount; // reference count
T * ptr; // ptr to actual object

zmutex mutex; // temporary

}; // class atomic_ptr_ref

//=============================================================================
// atomic_ptr
//
//
//=============================================================================
template<typename T> class atomic_ptr {

public:

atomic_ptr(T * obj = 0) {
ephemeralCount = 0;
refptr = new atomic_ptr_ref<T>(obj);
// membar
}

atomic_ptr(atomic_ptr & src) { // copy constructor

refptr = src.getrefptr(); // atomic
ephemeralCount = 0;

// adjust link count
if (refptr != 0)
refptr->adjust(-1, +1); // atomic
// membar
}

~atomic_ptr() { // destructor
if (refptr != 0 && refptr->adjust(ephemeralCount, -1) == 0) {
if (refptr->ptr != 0)
delete refptr->ptr;
delete refptr;
}
}

atomic_ptr & operator = (atomic_ptr & src) {
atomic_ptr temp(src); // copy
swap(temp); // atomic
return *this;
}

atomic_ptr & operator = (T * obj) {
atomic_ptr temp(obj);
swap(temp);
return *this;
}

T * get() {
atomic_ptr_ref<T> * temp = getrefptr();
T * temp2 = temp->ptr;;
temp->adjust(-1, 0);
return temp2;
}

T * operator -> () { return get(); }
T & operator * () { return get(); }
bool operator == (atomic_ptr<T> & rhd) { return (get() == rhd.get()); }
bool operator != (atomic_ptr<T> & rhd) { return (get() != rhd.get()); }
bool operator == (T * rhd) { return (get() == rhd); };
bool operator != (T * rhd) { return (get() != rhd); };

private:

long ephemeralCount; // ephemeral reference count
atomic_ptr_ref<T> * refptr; // reference count object
zmutex mutex; // temporary

void swap(atomic_ptr & obj) {
long tempcount;
atomic_ptr_ref<T> *tempptr;

mutex.lock();
tempcount = ephemeralCount; tempptr = refptr;
ephemeralCount = obj.ephemeralCount; refptr = obj.refptr;
obj.ephemeralCount = tempcount; obj.refptr = tempptr;
mutex.unlock();
}

atomic_ptr_ref<T> * getrefptr() {
atomic_ptr_ref<T> *p;

mutex.lock();
ephemeralCount++;
p = refptr;
mutex.unlock();

return p;
}

}; // class atomic_ptr

/*-*/

Torsten Robitzki

unread,

Nov 7, 2002, 3:15:07 PM11/7/02

to

Here is my implementation of shared_ptr_r:

template <class T>
class shared_ptr_r {
shared_ptr_r(const T* t)
{
delete t;
}
void operator=(const T* t)
{
delete t;
}
};

absolutly no locking needed and it will not even leak a single byte ;-)

That was easy because the requirements were very small. But serious
what do you want to achieve?

regards
Torsten

Joseph Seigh

unread,

Nov 7, 2002, 4:50:33 PM11/7/02

to

How this works. Simple, a reference count can be expressed as the difference between
a count of references started and references ended.

Rc = Rs - Re

Furthermore this can be multiple counts.

Rc = Sum(Rs) - Sum(Re)

Since the count is now split up, you can keep the start reference count with
the pointer and both fetch both the pointer value and increment the start
reference count atomically which solves the race condition that regular
reference counting has.

However since the process of summing up the reference counts may be asynchronous
and distributed, you need an additional mechanism to determine whether an
apparent sum of interest (usually zero) is a partial sum or not. Various
mechanisms can be used.

These reference count pointers cannot be copied however, more precisely, you
cannot make a pointer that is itself copyable so it is a pointer of a different
type.

So to solve the asynchronous summing problem and the copy problem, you just
combine these types of reference counts, called ephemeral reference counts,
with regular reference counts.

A reference count object becomes deletable when the regular reference count
becomes zero and the ephemeral count becomes zero which usually means no
copies are in progress.

Joe Seigh

Silicon Graphics

unread,

Nov 7, 2002, 10:46:51 PM11/7/02

to

Joseph Seigh <jsei...@xemaps.com> wrote:
:> Defining dcas for v8plusa and v9 is a little tricky, due to 32-bit vs.

:> 64-bit registers, and gcc asm notation is painful. Explicit memory barriers
:> aren't needed for the SPARC CAS and CASX instructions, but are needed for the
:> dcl routines (ie relaxed memory models).

: Memory barriers aren't required but if you don't use them you have a smart pointer
: that is equivalent to Java volatile pointers. They'll work but they're totally
: useless for anything.

Not quite sure I get this. For ref counting, the cas/inc/dec operations are atomic,
and aren't ordering dependent. For accessing the actual count as an integer, I
insert a read barrier after the fetch. What did I miss? Of course, the dcl ops
contain both read and write barriers.

: Well, you can just not ever delete anything. That's actually feasible if

: you have a bounded number of threads. You can just use a circular queue
: or a distributed algorithm.

Which is what the high performance Apache linked list does. It seems that
if the queue is empty/full, you would be back to a cv/poll solution anyways.

: I've only seen the ABA problem in reference to popping from a stack using

: compare and swap. IBM has always published the solution to this in their
: Principles of Operation using a doubleword compare and swap. Basically
: you maintain a change count in the stack header. They actually got it
: backwards though and I pointed this out to them way back. You can push on to the
: stack using a 32 bit compare and swap and just use the 64 bit compare and
: swap to change the count on the pop operation as well as the pointer.

:
Which was my second point, in that these operations cease to become useful
when compiling native 64-bit (64 bit ptr + 32/64 bit version > dcas).

: You have to be careful using the term DCAS. Some of the papers I've seen

: use DCAS to refer to a compare and swap that compares 2 discontiguous
: words in memory rather than a contiguous doubleword. Speaking of those,
: I could have sworn the MC68020 or MC60030 had one of these because I
: remember thinking at the time that it would be useful for solving the ABA
: problem.

:
The David Detlefs/Motorola 68020 definition is an atomic instruction which
modifies two (possibly separate) memory locations. Unfortunately, outside
of the 68020, all that is available is 8 byte cas.

Adam
ze...@best.com

Joseph Seigh

unread,

Nov 8, 2002, 7:31:27 AM11/8/02

to

Silicon Graphics wrote:
>
> Joseph Seigh <jsei...@xemaps.com> wrote:
> :> Defining dcas for v8plusa and v9 is a little tricky, due to 32-bit vs.
> :> 64-bit registers, and gcc asm notation is painful. Explicit memory barriers
> :> aren't needed for the SPARC CAS and CASX instructions, but are needed for the
> :> dcl routines (ie relaxed memory models).
>
> : Memory barriers aren't required but if you don't use them you have a smart pointer
> : that is equivalent to Java volatile pointers. They'll work but they're totally
> : useless for anything.
>
> Not quite sure I get this. For ref counting, the cas/inc/dec operations are atomic,
> and aren't ordering dependent. For accessing the actual count as an integer, I
> insert a read barrier after the fetch. What did I miss? Of course, the dcl ops
> contain both read and write barriers.
>

I was referring to the smart pointers themselves. Without memory barriers they
aren't really useful for a lot of interesting techiques like DCL, copy on write,
wait-free, etc ... Which is why I mentioned Java volatile pointers. Try
doing DCL with Java pointers. You can't unless you throw in a spurious sync (x) {}
to act as a memory barrier which makes DCL in java rather moot.

> :
> Which was my second point, in that these operations cease to become useful
> when compiling native 64-bit (64 bit ptr + 32/64 bit version > dcas).
>

I would assume that any architectures ported to 64 bits would port all the
instructions. Doubleword = 2 x 32 for 32 bit machines and 2 x 64 for 64 bit machines.
I suppose it could be dropped but why would disparate 32 bit architectures think
doubleword compare and swap was useful for 32 bits and not for 64 bits when they
go to that?

Joe Seigh

David Butenhof

unread,

Nov 8, 2002, 7:38:19 AM11/8/02

to

Alexander Terekhov wrote:

>
> David Butenhof wrote:
>>
>> Alexander Terekhov wrote:
> [...]
>> > The statement that "locking unlocked mutex is very fast operation" is a
>> > MYTH.
>>
>> No, not at all; but it's a relative metric applied to a source level
>> standard.
>
> [snip the stuff I fully concur]
>
>> If you're willing and able to do machine-specific bit bumming, that level
>> of performance is critical, and it doesn't matter that nobody can see or
>> analyze the synchronization behavior, then custom atomic synchronization
>> is probably faster. Is it worth the cost? The developer of the package is
>> the first to get a chance to answer that question... but always be aware
>> that the users (and later maintainers) of the package may eventually come
>> to disagree. ;-)
>
> I am willing to do this: >>*PORTABLY*<<

What's that supposed to mean? No, you're NOT willing to do it because it
can't be done portably? Or that you believe you somehow CAN do it portably?

One must be extremely suspicious of anyone starting out with (or even
quoting) the advice "bombard-to-death". That strongly suggests incitement
to malicious attack, now doesn't it? A particularly bad way to try to get
anyone's attention.

If someone would like to politely and appropriately become involved in
discussions that might lead to development of additional features for a
later standard (or revision of the current standard), and cares to make
such a suggestion, it would certainly be considered.

However, keep in mind that reference counts rarely exist independently; and
if they're couting SOMETHING, which is shared, then that something must be
both synchronized and VISIBLE -- on its own and mutually with respect to
the reference count. The result will be either extremely restricted, or
will add overhead and complication to become general purpose. Which
inevitably leads to the argument that the standard reference count is OK if
you don't care about performance... but, if you do, you should "roll your
own".

None of this means it can't (or even shouldn't) be done... but if you think
such an extension will solve all problems, you're mistaken.

Alexander Terekhov

unread,

Nov 8, 2002, 10:28:16 AM11/8/02

to

David Butenhof wrote:
[...]

> However, keep in mind that reference counts rarely exist independently; and
> if they're couting SOMETHING, which is shared, then that something must be
> both synchronized and VISIBLE -- on its own and mutually with respect to
> the reference count. The result will be either extremely restricted, or
> will add overhead and complication to become general purpose. Which
> inevitably leads to the argument that the standard reference count is OK if
> you don't care about performance... but, if you do, you should "roll your
> own".
>
> None of this means it can't (or even shouldn't) be done... but if you think
> such an extension will solve all problems, you're mistaken.

Really? Uhmm. What about the following "roll your own" stuff... <sketch>

Opaque "mutex-like" pthread_refcount_t [and pthread_refcount_attr_t] type
for reference counting meant to be used for non-const/mutable stuff that
CAN mutate 'during' reference counting 'period' -- synchronized using
conventional means; no change here. The only reference counting operation
that should "synchronize memory with respect to other threads" [4.10, but
"execution" aside -- <http://tinyurl.com/2jb9>] would be
pthread_refcount_decrement(). Note that pthread_refcount_increment() needs
*NOT* "synchronize memory with respect to other threads". The rest is as
follows:

PTHREAD_REFCOUNT_MAX // upper bound
PTHREAD_REFCOUNT_INITIALIZER(N) // macro for statics INIT
pthread_refcount_init() // mutex like but with initial value
pthread_refcount_destroy() // mutex like apart from EBUSY
pthread_refcount_getvalue() // just in case someone will need it
pthread_refcount_setvalue() // without any sync whatsoever
pthread_refcount_attr_*() // PROCESS_SHARED and etc. attr-stuff
std::size_t as "value type" // for get/set "value" parameter
PTHREAD_REFCOUNT_DROPPED_TO_ZERO // for pthread_refcount_decrement() to
// indicate "dropped to zero" condition
// that MAY be used to safely destoy
// associated ref.counted object or
// simply continue -- increment/setval

Well, for objects that aren't changed [are immutable] "during" reference
counting 'period', DECREMENT operation needs >>NOT<< "synchronize memory
with respect to other threads". So, perhaps, yet another '_refcount_t'
but for "const"-only stuff (plus operations and INIT macro), or perhaps
simply one more decrement-call [for "const"-only stuff] would be really
useful as well.

Increment, decrement, get and set should be allowed to be macros (or
macros PLUS functions].

Well, pthread_refcount_add()/sub [to add/sub size_t values] with
increment/decrement "semantics" with respect to memory synchronization
(or lack thereof -- in the case of increment/add; oh, and 'sub' should
also have PTHREAD_REFCOUNT_DROPPED_TO_ZERO return value/status -- same
as pthread_refcount_decrement()) might be useful too, probably.

Now, what's wrong with it? Where am I mistaken? It doesn't really seem
to be "rocket science" to me... unless I'm just missing and/or
misunderstanding something.

regards,
alexander.

Silicon Graphics

unread,

Nov 8, 2002, 10:05:20 PM11/8/02

to

Joseph Seigh <jsei...@xemaps.com> wrote:

:> Not quite sure I get this. For ref counting, the cas/inc/dec operations are atomic,

:> and aren't ordering dependent. For accessing the actual count as an integer, I
:> insert a read barrier after the fetch. What did I miss? Of course, the dcl ops
:> contain both read and write barriers.
:>
: I was referring to the smart pointers themselves. Without memory barriers they
: aren't really useful for a lot of interesting techiques like DCL, copy on write,
: wait-free, etc ... Which is why I mentioned Java volatile pointers. Try
: doing DCL with Java pointers. You can't unless you throw in a spurious sync (x) {}
: to act as a memory barrier which makes DCL in java rather moot.
:

I was differentiating between the reference count itself, and the pointer. The count
shouldn't need memory barriers, unless it being returned by value (even then, probably
only on DEC Alpha).

As for pointers and dcl, my code reflects Schmidt's pseudocode found in:

http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html

: I would assume that any architectures ported to 64 bits would port all the

: instructions. Doubleword = 2 x 32 for 32 bit machines and 2 x 64 for 64 bit machines.
: I suppose it could be dropped but why would disparate 32 bit architectures think
: doubleword compare and swap was useful for 32 bits and not for 64 bits when they
: go to that?

:
I don't use any processors that do. I can compile 64-bit code on SPARC V9, but
am still limited to CASX (64-bit cas). Most likely for the same reason no one (except
Motorola) has a "true" dcas: not enough need. That may be changing with the push
from Sun's Research group wrt Java garbage collection:

http://citeseer.nj.nec.com/detlefs00even.html

Adam
ze...@best.com

Hillel Y. Sims

unread,

Nov 9, 2002, 1:45:42 AM11/9/02

to

"Alexander Terekhov" <tere...@web.de> wrote in message
news:3DCBD810...@web.de...

>
> Well, for objects that aren't changed [are immutable] "during" reference
> counting 'period', DECREMENT operation needs >>NOT<< "synchronize memory
> with respect to other threads".

"quasi-portable atomic refcount with immutable-object optimization"

#include "boost/type_traits.h"
#include "atomic_stuff_config.h" // bring in platform-specific impl for
atomic ops / mem-sync via #ifdefs
// ^-- includes "typedef <platform-specific> int32_atomic_t;"

template <bool read_only> struct AtomicRefCountHelper;
template <> struct AtomicRefCountHelper<true>
{
static void MemSync() { /* no-op */ }
};
template <> struct AtomicRefCountHelper<false>
{
static void MemSync() { platform_specific_membar(); }
};

template <typename Derived, bool read_only> class AtomicRefCount_base
{
int32_atomic_t count;

public:
AtomicRefCount_base() : count(1) {}
~AtomicRefCount_base() {}

Derived* AddRef()
{
platform_specific_atomic_increment32(&count);
return static_cast<Derived*>(this);
}

void Release()
{
if (platform_specific_atomic_decrement32(&count) == 1) { //
returns old value
AtomicRefCountHelper<read_only>::MemSync();
delete static_cast<Derived*>(this);
}
else
AtomicRefCountHelper<read_only>::MemSync();
}

};

----------

#include "atomic_refcount.h"

template <typename T> class SharedPtr
{
[...]
private:
class RefCounter
: public AtomicRefCount_base<RefCounter, boost::is_const<T>::value>
{
std::auto_ptr<T> ptr;

public:
RefCounter(auto_ptr<T> p) : ptr(p) {}
~RefCounter() {}
};
RefCounter* rc;
[...]
};

----------

#include "shared_ptr.h"

int main()
{
SharedPtr<Foo> p1(new Foo);
SharedPtr<const Foo> p2(new Foo);
...
}

hys

--
(c) 2002 Hillel Y. Sims

Joseph Seigh

unread,

Nov 9, 2002, 7:33:24 AM11/9/02

to

Silicon Graphics wrote:
>
> Joseph Seigh <jsei...@xemaps.com> wrote:
>

...

> : I would assume that any architectures ported to 64 bits would port all the
> : instructions. Doubleword = 2 x 32 for 32 bit machines and 2 x 64 for 64 bit machines.
> : I suppose it could be dropped but why would disparate 32 bit architectures think
> : doubleword compare and swap was useful for 32 bits and not for 64 bits when they
> : go to that?
> :
> I don't use any processors that do. I can compile 64-bit code on SPARC V9, but
> am still limited to CASX (64-bit cas). Most likely for the same reason no one (except
> Motorola) has a "true" dcas: not enough need. That may be changing with the push
> from Sun's Research group wrt Java garbage collection:
>
> http://citeseer.nj.nec.com/detlefs00even.html
>

You mean this one? http://citeseer.nj.nec.com/481457.html "Lock-Free Reference Counting".

I mentioned to Detlefs that it could be done without DCAS but I don't think he believes it.
But I think if you do see a lock-free reference count implementation it will be done with
DCAS because they have the resources to do that.

Joe Seigh

Alexander Terekhov

unread,

Nov 9, 2002, 3:04:49 PM11/9/02

to

"Hillel Y. Sims" wrote:
[...]
> template <bool read_only> struct AtomicRefCountHelper; ...

Nah, < ;-) >

template< bool for_immutable_stuff >
class PthreadRefCount {

//*** unimplemented; let's make it non-copyable/non-copy-constructible
PthreadRefCount( const PthreadRefCount& );
PthreadRefCount& operator=( const PthreadRefCount& );

pthread_refcount_t m_refcount;

public:

/* ... */

bool decrement();

/* ... */

}; //*** class PthreadRefCount< bool for_immutable_stuff >

template <> inline bool PthreadRefCount< true >::decrement() {
int status = pthread_refcount_decrement_immutable_case( &m_refcount );
assert( PTHREAD_REFCOUNT_DROPPED_TO_ZERO == status || 0 == status );
return PTHREAD_REFCOUNT_DROPPED_TO_ZERO == status;
}

template <> inline bool PthreadRefCount< false >::decrement() {
int status = pthread_refcount_decrement_mutable_case( &m_refcount );
assert( PTHREAD_REFCOUNT_DROPPED_TO_ZERO == status || 0 == status );
return PTHREAD_REFCOUNT_DROPPED_TO_ZERO == status;
}

regards,
alexander.

Silicon Graphics

unread,

Nov 9, 2002, 5:40:34 PM11/9/02

to

Joseph Seigh <jsei...@xemaps.com> wrote:
:>
:> http://citeseer.nj.nec.com/detlefs00even.html

:>
: You mean this one? http://citeseer.nj.nec.com/481457.html "Lock-Free Reference Counting".

: I mentioned to Detlefs that it could be done without DCAS but I don't think he believes it.
: But I think if you do see a lock-free reference count implementation it will be done with
: DCAS because they have the resources to do that.

Also look at:

http://research.sun.com/nova/cgi-bin/index.cgi?a=r&s=13385&e=13393&f=%2Fhtdocs%2Ftechnical-reports%2F2002%2Fabstract-110.html&i=

which purports to do exactly that by solving the Repeat Offender Problem. Couldn't find an
online version of the ROP paper, though.

Adam
ze...@best.com

Joseph Seigh

unread,

Nov 10, 2002, 9:56:51 PM11/10/02

to

It's a little difficult to determine exactly what the paper is describing but I believe it is a
asynchronous delay technique. These are nothing new. Basically you just delay deleting the
object until you know the object is no longer being referenced. There are various methods
of doing this. The asynchronous part comes in because you don't want to wait on other threads
completing some action which can take an arbitrary amount of time so you arrange for some other
agent to delete the object.

What the other mechanisms are? Well, obviously any thing that take place outside the interval
the reference is occurring in. The oldest one involved signaling on IBM mainframes. The objects
being reclaimed were real memory pages (frames?). They could not be reused until you were sure no
other processor had that memory address in its TLB, translation lookaside buffer. So the other
processors were signaled, upon which they performed purge TLB instructions and the reclaimed pages
were marked reusable.

Another methods which is or was used in IBM's VM operating system is to just monitor some convienent point
of execution in which references are not held by convention. Before deleting such an object managed
by that convention, you (the system actually) just waited for each processor to execute those monitor
points at least once. For VM it was rather convienent. It was a non pre-emptive kernel and the
monitor points, loss of control or exit to dispatcher, were already monitored, so the whole scheme
had essentially zero overhead with a nominal delay which was about 2 or 3 timeslices.

Another method which I've seen at least one, probably a patent that referenced one of mine, is to
timestamp the delete requests. Each processor/thread timestamps somewhere when it is no longer holding
any possible references of interest. You can process any delete requests that have timestamps
older than the oldest processor/thread timestamps.

The one by Treiber, which was cited in the paper above, I remember reading, barely. I remember the
term "obligation passing" from it though. I think it is a sort of share or reference count on the
whole queue. It was a generational or versioned count so you had to remember which count you were
using. If you decremented the count zero then you were obliged to process any pending deletes
that were associated with that version of the count. I think. It was a long time ago.

There's potentially lot's of ways to do asynchronous delays. It just depends how efficiently you
can do it, what scales up and has the least delay and/or overhead.

Joe Seigh

David Butenhof

unread,

Nov 12, 2002, 9:20:55 AM11/12/02

to

Silicon Graphics wrote:

> Joseph Seigh <jsei...@xemaps.com> wrote:
>
> :> Not quite sure I get this. For ref counting, the cas/inc/dec
> :> operations are atomic,
> :> and aren't ordering dependent. For accessing the actual count as an
> :> integer, I
> :> insert a read barrier after the fetch. What did I miss? Of course,
> :> the dcl ops contain both read and write barriers.
> :>
> : I was referring to the smart pointers themselves. Without memory
> : barriers they aren't really useful for a lot of interesting techiques
> : like DCL, copy on write,
> : wait-free, etc ... Which is why I mentioned Java volatile pointers.
> : Try
> : doing DCL with Java pointers. You can't unless you throw in a spurious
> : sync (x) {} to act as a memory barrier which makes DCL in java rather
> : moot.
> :
> I was differentiating between the reference count itself, and the pointer.
> The count shouldn't need memory barriers, unless it being returned by
> value (even then, probably only on DEC Alpha).

And IPF, by the way. But the issue isn't "return by value". The issue is
relationships between independent data.

If the reference count is used by other threads to imply additional state,
such as the validity of a pointer or other data, then you need a memory
barrier (on both sides of the transaction) to ensure memory coherency.

If the reference count is "just a reference count", then the atomic
operations (load-lock/store-conditional on Alpha, or compare-and-swap on
IPF) are sufficient to maintain consistency across processors without
memory barriers.