A question about atomic_ptr

126 views
Skip to first unread message

Peter Dimov

unread,
Apr 10, 2005, 5:49:22 AM4/10/05
to
atomic_ptr (atomic-ptr-plus.sf.net), as most of you know, is Joe Seigh's
strongly thread safe reference-counted smart pointer.

My understanding (based on atomic_ptr.txt) is that it is intended to be used
in the GC phase of the reader/writer problem as follows:

atomic_ptr<X> px; // shared object

atomic_ptr<X> read()
{
return px; // competing
}

void write()
{
lock();

atomic_ptr<X> tmp( new X( *px ) );

// update *tmp

px = tmp; // competing

unlock();
}

boost::shared_ptr can't be used in the above manner because of the two
competing accesses that violate its "basic thread safety" requirement.

However (again, if my understanding is correct), all atomic_ptr read
accesses need to have acquire semantics, while shared_ptr read accesses need
not. (local_ptr aside for a moment.)

My question is:

Can I get the best of both worlds by providing the following two methods in
shared_ptr:

template<class T> class shared_ptr
{
// ...

shared_ptr<T> copy() const;
void replace( shared_ptr<T> const & pt );
};

with the guarantee that competing accesses to 'copy' and 'replace' work (but
other competing accesses are still disallowed in order to avoid an acquire
except in 'copy')?

--
Peter Dimov
http://www.pdimov.com


Joe Seigh

unread,
Apr 10, 2005, 7:38:40 AM4/10/05
to
On Sun, 10 Apr 2005 12:49:22 +0300, Peter Dimov <pdi...@gmail.com> wrote:

> atomic_ptr (atomic-ptr-plus.sf.net), as most of you know, is Joe Seigh's
> strongly thread safe reference-counted smart pointer.
>
> My understanding (based on atomic_ptr.txt) is that it is intended to be used
> in the GC phase of the reader/writer problem as follows:
>
> atomic_ptr<X> px; // shared object
>
> atomic_ptr<X> read()
> {
> return px; // competing
> }
>
> void write()
> {
> lock();
>
> atomic_ptr<X> tmp( new X( *px ) );
>
> // update *tmp
>
> px = tmp; // competing
>
> unlock();
> }
>
> boost::shared_ptr can't be used in the above manner because of the two
> competing accesses that violate its "basic thread safety" requirement.
>
> However (again, if my understanding is correct), all atomic_ptr read
> accesses need to have acquire semantics, while shared_ptr read accesses need
> not. (local_ptr aside for a moment.)

atomic_ptr wouldn't be very useful or convenient without acquire and
release semantics. shared_ptr doesn't need them since acquire and
release semantics are provided by the mechanism used to ensure
"ownership" of shared_ptr's.

>
> My question is:
>
> Can I get the best of both worlds by providing the following two methods in
> shared_ptr:
>
> template<class T> class shared_ptr
> {
> // ...
>
> shared_ptr<T> copy() const;
> void replace( shared_ptr<T> const & pt );
> };
>
> with the guarantee that competing accesses to 'copy' and 'replace' work (but
> other competing accesses are still disallowed in order to avoid an acquire
> except in 'copy')?
>

You could, though you've in effect merged atomic_ptr and local_ptr together.
As a practical matter, I'd think you'd want some mechanism to distinguish
between shared global pointers and non-shared local pointers.

--
Joe Seigh

Chris Thomasson

unread,
Apr 10, 2005, 8:30:32 AM4/10/05
to
> Can I get the best of both worlds by providing the following two methods
> in
> shared_ptr:
>
> template<class T> class shared_ptr
> {
> // ...
>
> shared_ptr<T> copy() const;
> void replace( shared_ptr<T> const & pt );
> };
>
> with the guarantee that competing accesses to 'copy' and 'replace' work
> (but
> other competing accesses are still disallowed in order to avoid an acquire
> except in 'copy')?

You can. How are you going to allow for "competing access"? Are you using
mutexs, or going for a lock-free method such as SMR?


I am currently experimenting with using SMR to protect the reference counts
of a quick lock-free atomic pointer API I did in C. Surprisingly, it seems
to be more light-weight than atomic_ptr "on x86" despite the fact that it
relies on the expensive SMR API in order to provide "strong" thread-safety.
DWCAS can be expensive on x86. I have not compared it against Joes PPC
reference counting algorithm that uses LL/SC directly, its very efficient.


--
http://appcore.home.comcast.net/
(portable lock-free data-structures)


Chris Thomasson

unread,
Apr 10, 2005, 8:51:07 AM4/10/05
to
> As a practical matter, I'd think you'd want some mechanism to distinguish
> between shared global pointers and non-shared local pointers.

Agreed. For instance, I had to separate my atomic pointer into two
distinguishable classes to address a highly critical performance issue:

shared_smr_ptr<T>
local_smr_ptr<T>


shared_smr_ptr<T> alone is "way to expensive" to be used for a general
purpose smart pointer. I needed the ability to access the reference count
without using expensive SMR API all the time, this is where local_smr_ptr
saves the day. Knowing that you don't have to use SMR in order to access
the reference count on local_smr_ptr access actually makes SMR seem a bit
less expensive than it actually is... ;)


Peter Dimov

unread,
Apr 10, 2005, 9:23:39 AM4/10/05
to
Chris Thomasson wrote:
>> As a practical matter, I'd think you'd want some mechanism to
>> distinguish between shared global pointers and non-shared local
>> pointers.
>
> Agreed. For instance, I had to separate my atomic pointer into two
> distinguishable classes to address a highly critical performance
> issue:
> shared_smr_ptr<T>
> local_smr_ptr<T>
>
>
> shared_smr_ptr<T> alone is "way to expensive" to be used for a general
> purpose smart pointer.

That's what people tell me all the time, that shared_ptr is too expensive
and they need an "unsynchronized" variant.

But it's not quite the same as Joe's local_ptr, I think; it still atomically
updates its count. I think I'll reserve the local_ptr name for a pointer
that uses a "local" count and non-atomic updates (but is convertible from
and to a shared_ptr) so that people are free to shoot themselves in the
foot, if they so insist.


Peter Dimov

unread,
Apr 10, 2005, 9:16:46 AM4/10/05
to
Chris Thomasson wrote:
>> Can I get the best of both worlds by providing the following two
>> methods in
>> shared_ptr:
>>
>> template<class T> class shared_ptr
>> {
>> // ...
>>
>> shared_ptr<T> copy() const;
>> void replace( shared_ptr<T> const & pt );
>> };
>>
>> with the guarantee that competing accesses to 'copy' and 'replace'
>> work (but
>> other competing accesses are still disallowed in order to avoid an
>> acquire except in 'copy')?
>
> You can. How are you going to allow for "competing access"? Are you
> using mutexs, or going for a lock-free method such as SMR?

I was thinking DWCAS or mutex pool... I'm new to these lock-free things.
Need to read about SMR, I guess.


Chris Thomasson

unread,
Apr 10, 2005, 10:05:27 AM4/10/05
to
"Peter Dimov" <pdi...@gmail.com> wrote in message
news:d3b9b5$qea$1...@domitilla.aioe.org...

> Chris Thomasson wrote:
>>> As a practical matter, I'd think you'd want some mechanism to
>>> distinguish between shared global pointers and non-shared local
>>> pointers.
>>
>> Agreed. For instance, I had to separate my atomic pointer into two
>> distinguishable classes to address a highly critical performance
>> issue:
>> shared_smr_ptr<T>
>> local_smr_ptr<T>
>>
>>
>> shared_smr_ptr<T> alone is "way to expensive" to be used for a general
>> purpose smart pointer.
>
> That's what people tell me all the time, that shared_ptr is too expensive
> and they need an "unsynchronized" variant.

The overhead generated by the strict memory barrier loop required to access
the reference count wrt shared_smr_ptr's really do require a "local access"
class of sorts. Just something that avoids the SMR API for every reference
count increment...

:O


> I think I'll reserve the local_ptr name for a pointer
> that uses a "local" count and non-atomic updates (but is convertible from
> and to a shared_ptr) so that people are free to shoot themselves in the
> foot, if they so insist.

That would be a pseudo distributed count. each local_ptr would contain a
"separate" count. When a shared_ptr gets copied into a local_ptr an atomic
increment with acquire semantics would be applied to the "shared" count, and
the "local" count would be set to 1. The acquire semantics here would be
used to sync with the "local_ptr drop to zero transition". When the "local"
count drops to zero an atomic decrement with release semantics would be
applied to the shared count to ensure that all the local affects on shared
memory are applied before the decrement of the reference count.


local_ptr<X> = shared_ptr<X>
-------------------------------
atomic_inc_acquire( &shared_ptr<X>::refs );
local_ptr<X>::refs = 1;


local_ptr<X> = 0
-------------------------------
--local_ptr<X>::refs
if ( ! local_ptr<X>::refs )
{
if ( ! atomic_dec_acquire( &shared_ptr<X>::refs ) )
{
// whatever
}
}


I could see where this would be useful wrt performance. It actually might be
worth the trouble and the extra documentation... ;)


Joe Seigh

unread,
Apr 10, 2005, 10:22:18 AM4/10/05
to

local_ptr atomically updates the reference counts. It has less overhead
than atomic_ptr since it doesn't have to maintain pointer atomicity, just
reference count atomicity. The main difference is you don't have to use
an interlocked primative to swap pointer values on an pointer assignment
which atomic_ptr does. So local_ptr looks like the thread-safe version
of shared_ptr somewhat.

The main difference between local_ptr and atomic_ptr is the swap method
and the function used to safely increment the reference count.
atomic_ptr::swap() is atomically thread-safe, local_ptr::swap() is not.
If I get around to merging the compare and swap and (ll/sc, lwarx/stwcx) versions
of atomic_ptr it will become a little clearer since I'll have to
clean up the abstraction boundaries a little better.

--
Joe Seigh

Chris Thomasson

unread,
Apr 10, 2005, 6:48:31 PM4/10/05
to
DOH!


> local_ptr<X> = 0
> -------------------------------
> --local_ptr<X>::refs
> if ( ! local_ptr<X>::refs )
> {
> if ( ! atomic_dec_acquire( &shared_ptr<X>::refs ) )

^^^^^^^^^^^^^^^^

That's:

if ( ! atomic_dec_release( &shared_ptr<X>::refs ) )


> {
> // whatever
> }
> }
>
>
>
>
> I could see where this would be useful wrt performance. It actually might
> be worth the trouble and the extra documentation... ;)


:)


Chris Thomasson

unread,
Apr 10, 2005, 10:52:06 PM4/10/05
to
> I was thinking DWCAS or mutex pool... I'm new to these lock-free things.

I would go for a strict pthread solution "first"; Quickly address the
portability factor. I guess you could simply hash the address of a
shared_ptr reference count into an index of an array of pthread_mutex_t's.
There is a simple solution for this:

http://msdn.microsoft.com/msdnmag/issues/01/08/Concur/

You can take advantage of the fact that the mutex table would be never be
destroyed.

DWCAS solutions will work fine, just remember that once you use DWCAS with
pointers you fall into the portability trap. Most 64-bit processors do not
have DWCAS.


> Need to read about SMR, I guess.

SMR needs a rather extensive underlying support system in order to be
useful. You need to heavily optimize the allocation scheme that provides the
nodes you use for your collections and/or reference counts. Some key SMR
API's need to be coded in assembly language; they are too sensitive to an
over aggressive optimizing compiler. Also, you need to examine the method in
which SMR utilizes Thread-Local-Storage. My AppCore library has an efficient
hazard pointer scheme that you might want to take a quick look at. Any
questions are welcome...


Peter Dimov

unread,
Apr 11, 2005, 5:53:59 AM4/11/05
to
Chris Thomasson wrote:
> local_ptr<X> = shared_ptr<X>
> -------------------------------
> atomic_inc_acquire( &shared_ptr<X>::refs );
> local_ptr<X>::refs = 1;
>
>
> local_ptr<X> = 0
> -------------------------------
> --local_ptr<X>::refs
> if ( ! local_ptr<X>::refs )
> {
> if ( ! atomic_dec_acquire( &shared_ptr<X>::refs ) )
> {
> // whatever
> }
> }

Yes, exactly, but why acquire in the increment? Won't the usual release on
nonzero/acquire on zero decrement do?


Alexander Terekhov

unread,
Apr 11, 2005, 7:00:59 AM4/11/05
to

Chris Thomasson wrote:
>
> DOH!
>
> > local_ptr<X> = 0
> > -------------------------------
> > --local_ptr<X>::refs
> > if ( ! local_ptr<X>::refs )
> > {
> > if ( ! atomic_dec_acquire( &shared_ptr<X>::refs ) )
> ^^^^^^^^^^^^^^^^
>
> That's:
>
> if ( ! atomic_dec_release( &shared_ptr<X>::refs ) )

Depending on value, you'll need either acquire or release here.

regards,
alexander.

Alexander Terekhov

unread,
Apr 11, 2005, 6:50:11 AM4/11/05
to

Alexander Terekhov wrote:

>
> Peter Dimov wrote:
> >
> > Chris Thomasson wrote:
> > > local_ptr<X> = shared_ptr<X>
> > > -------------------------------
> > > atomic_inc_acquire( &shared_ptr<X>::refs );
> > > local_ptr<X>::refs = 1;
> > >
> > >
> > > local_ptr<X> = 0
> > > -------------------------------
> > > --local_ptr<X>::refs
> > > if ( ! local_ptr<X>::refs )
> > > {
> > > if ( ! atomic_dec_acquire( &shared_ptr<X>::refs ) )

That must be upcoming Apple's OSAtomicDecrementRelAcq32() or
OSAtomicDecrementSlbHsb32() (when X is immutable while managed
by shared_ptr<>). ;-)

> > > {
> > > // whatever
> > > }
> > > }
> >
> > Yes, exactly, but why acquire in the increment?
>

> In your vocabulary, he means copy(). Your copy() does need acquire
> semantics to ensure visibility of client object "published" by your
> replace().
>
> BTW, strong thread-safety aside for a moment, I think that you can
> simply provide a "move to/from" operation for shared_ptr<T, D,
> thread_safety::basic>, shared_ptr<T, D, thread_safety::unsafe>, and
> std::auto_ptr<T> (when your D deleter on the other side is
> empty/default). It can throw something if the source is "!empty()
> || !unique()". Or just make it a precondition. Frankly, it all
> boils down to release() call which you don't provide currently. ;-)

regards,
alexander.

Peter Dimov

unread,
Apr 11, 2005, 7:30:45 AM4/11/05
to
Alexander Terekhov wrote:
>>>> if ( ! atomic_dec_acquire( &shared_ptr<X>::refs ) )
>
> That must be upcoming Apple's OSAtomicDecrementRelAcq32() or
> OSAtomicDecrementSlbHsb32() (when X is immutable while managed
> by shared_ptr<>). ;-)

Speaking of which, is it possible to implement the ...SlbHsb variant in a
more efficient manner on a PPC? On something else? (More effcient than the
...RelAcq, of course.)


Alexander Terekhov

unread,
Apr 11, 2005, 7:48:36 AM4/11/05
to

Peter Dimov wrote:

[... local_shared_ptr ...]

> It's like intrusive_ptr_st< shared_ptr_mt<X> >.

I gather that it's like

shared_ptr<
T, idle_functor< shared_ptr< T, D, thread_safety::basic > >,
thread_safety::unsafe,
weak_ptr::no_thank_you
> local_shared_ptr(get_pointer(shared_ptr), shared_ptr);

right?

regards,
alexander.

Alexander Terekhov

unread,
Apr 11, 2005, 6:36:53 AM4/11/05
to

Peter Dimov wrote:
>
> Chris Thomasson wrote:
> > local_ptr<X> = shared_ptr<X>
> > -------------------------------
> > atomic_inc_acquire( &shared_ptr<X>::refs );
> > local_ptr<X>::refs = 1;
> >
> >
> > local_ptr<X> = 0
> > -------------------------------
> > --local_ptr<X>::refs
> > if ( ! local_ptr<X>::refs )
> > {
> > if ( ! atomic_dec_acquire( &shared_ptr<X>::refs ) )
> > {
> > // whatever
> > }
> > }
>
> Yes, exactly, but why acquire in the increment?

In your vocabulary, he means copy(). Your copy() does need acquire

Peter Dimov

unread,
Apr 11, 2005, 7:26:44 AM4/11/05
to
Alexander Terekhov wrote:
> Peter Dimov wrote:
>>
>> Chris Thomasson wrote:
>>> local_ptr<X> = shared_ptr<X>
>>> -------------------------------
>>> atomic_inc_acquire( &shared_ptr<X>::refs );
>>> local_ptr<X>::refs = 1;
>>>
>>>
>>> local_ptr<X> = 0
>>> -------------------------------
>>> --local_ptr<X>::refs
>>> if ( ! local_ptr<X>::refs )
>>> {
>>> if ( ! atomic_dec_acquire( &shared_ptr<X>::refs ) )
>>> {
>>> // whatever
>>> }
>>> }
>>
>> Yes, exactly, but why acquire in the increment?
>
> In your vocabulary, he means copy(). Your copy() does need acquire
> semantics to ensure visibility of client object "published" by your
> replace().

No, no.

This is about local_shared_ptr, not about copy/replace. The
local_shared_ptrs in an ownership group (which needs to be local to a
specific thread) use a local unsynchronized reference count to keep track of
their single shared_ptr-like reference into the shared_ptr group. ChrisX got
it right, but I still think that there's no need to acquire in the
increment, it's an ordinary 'add_ref_strong'.

> BTW, strong thread-safety aside for a moment, I think that you can
> simply provide a "move to/from" operation for shared_ptr<T, D,
> thread_safety::basic>, shared_ptr<T, D, thread_safety::unsafe>, and
> std::auto_ptr<T> (when your D deleter on the other side is
> empty/default). It can throw something if the source is "!empty()
> || !unique()". Or just make it a precondition. Frankly, it all
> boils down to release() call which you don't provide currently. ;-)

The above local_shared_ptr does not need to steal ownership away from the
source shared_ptr, nor does it need an unique source. And you can also get a
real shared_ptr back, again without the 'unique' requirement on the local
source.

Joe Seigh

unread,
Apr 11, 2005, 7:50:52 AM4/11/05
to
On Mon, 11 Apr 2005 14:26:44 +0300, Peter Dimov <pdi...@gmail.com> wrote:

> Alexander Terekhov wrote:
>> Peter Dimov wrote:
>>>
>>> Yes, exactly, but why acquire in the increment?
>>
>> In your vocabulary, he means copy(). Your copy() does need acquire
>> semantics to ensure visibility of client object "published" by your
>> replace().
>
> No, no.
>
> This is about local_shared_ptr, not about copy/replace. The
> local_shared_ptrs in an ownership group (which needs to be local to a
> specific thread) use a local unsynchronized reference count to keep track of
> their single shared_ptr-like reference into the shared_ptr group. ChrisX got
> it right, but I still think that there's no need to acquire in the
> increment, it's an ordinary 'add_ref_strong'.

As a practical matter I don't think you're going to have have more than
one local pointer copy of a global shared pointer often enought to justify
the extra overhead of maintaining a separate reference count and of the extra
indirection.


--
Joe Seigh

Chris Thomasson

unread,
Apr 11, 2005, 7:04:46 AM4/11/05
to
>> if ( ! atomic_dec_release( &shared_ptr<X>::refs ) )
>
> Depending on value, you'll need either acquire or release here.

Ahh yes... Release on > 0 and Acquire for < 1 correct?

;)


Peter Dimov

unread,
Apr 11, 2005, 8:04:48 AM4/11/05
to

Right.


Alexander Terekhov

unread,
Apr 11, 2005, 7:22:06 AM4/11/05
to

Well, given that in my (unwritten) book, mad std::string cows "eat"
("OS" comes from Apple) OSAtomicDecrementIfGreaterThanOneSlbHsb32()
with checking and no msync for zero ("unsharable" indicator), I'd
say "almost". Okay? ;-)

regards,
alexander.

Peter Dimov

unread,
Apr 11, 2005, 8:40:30 AM4/11/05
to

You may well be right. It's just one idea for a possible extension of
shared_ptr that could pacify the "we want our unsynchronized pointers"
portion of the "C++ community". I'm not sure yet whether it's a _good_ idea.


Alexander Terekhov

unread,
Apr 11, 2005, 8:47:49 AM4/11/05
to

Peter Dimov wrote:
>
> Alexander Terekhov wrote:
> >>>> if ( ! atomic_dec_acquire( &shared_ptr<X>::refs ) )
> >
> > That must be upcoming Apple's OSAtomicDecrementRelAcq32() or
> > OSAtomicDecrementSlbHsb32() (when X is immutable while managed
> > by shared_ptr<>). ;-)
>
> Speaking of which, is it possible to implement the ...SlbHsb variant in a
> more efficient manner on a PPC?

SlbHsb version doesn't seem to need trailing isync, and maybe can
achieve the effect of leading {lw}sync with respect to loads->store
ordering using "branch never taken" trick. Consider: (excerpt from
Book II)

----
B.2.3 Safe Fetch

If a load must be performed before a subsequent store (e.g., the
store that releases a lock protecting a shared data structure), a
technique similar to the following can be used.

In this example it is assumed that the address of the storage
operand to be loaded is in GPR 3, the contents of the storage
operand are returned in GPR 4, and the address of the storage
operand to be stored is in GPR 5.

lwz r4,0(r3) #load shared data
cmpw r4,r4 #set CR0 to "equal"
bne- $-8 #branch never taken
stw r7,0(r5) #store other shared data
----

In our case, we load and check the count and do take a conditional
branch which leads to stores which we want to order with respect to
load and check. That's trailing isync. As for leading {lw}sync, we
can add similar "branch never taken" trick to order it with respect
to preceding loads, but I'm not sure whether register dependency
(r4 for both load and compare above) can play some role here in
addition to "branch never taken".

regards,
alexander.

Joe Seigh

unread,
Apr 11, 2005, 1:10:24 PM4/11/05
to
On Mon, 11 Apr 2005 15:40:30 +0300, Peter Dimov <pdi...@gmail.com> wrote:

> Joe Seigh wrote:
>>
>> As a practical matter I don't think you're going to have have more
>> than one local pointer copy of a global shared pointer often enought to
>> justify the extra overhead of maintaining a separate reference count
>> and of the extra indirection.
>
> You may well be right. It's just one idea for a possible extension of
> shared_ptr that could pacify the "we want our unsynchronized pointers"
> portion of the "C++ community". I'm not sure yet whether it's a _good_ idea.
>
>


There's a bit of a learning curve here. I don't think the general
programming community has enough experience in multi-threaded
programming, compared to the experience we have here in c.p.t, to have
a good feel for all the issues yet.

For example, when you're doing reference counting in a single threaded
program, you don't need to distinguish between read access and write
access. But when you start using multi-threading, the performance
implications of distinquishing between read and write access become
more important, which is why I've been emphasizing smart pointers
as a solution to the readers/writers problem rather than a generic
resource management tool.

There seems to be some attraction in c.l.c++.m to smart pointers that
have almost no overhead, the deferred reference counting being
promoted by Andrei but there's no such thing a a free lunch. Deferred
reference counting does Boehm style GC against the thread stacks and
that requires a "stop the world while we GC" which has huge performance
implications in multi-threading. We're well aware of the problem
of checking or monitoring threads local references. That's what makes
RCU or SMR so tricky. If there was an easy way around it, we'd be
using it. There isn't. What you have is a bunch of different techniques
with different trade-offs. What you do is use the one that gives
you the most benefit depending on the situation. There won't be one
size fits all.

What you will want to do for Boost is allow different smart pointer
techniques for the time being until multi-processor systems become
ubiquitous enough that everyone has a feel for the issues. Then
you can start winnowing the smart pointers down. But I don't think
you will have one grand unified smart pointer. I've been at this
for a while and I'm not willing to discount any of current techniques
yet. Even the ones I didn't invent. :)


--
Joe Seigh

Peter Dimov

unread,
Apr 11, 2005, 6:45:10 PM4/11/05
to
Joe Seigh wrote:
>
> You could, though you've in effect merged atomic_ptr and local_ptr
> together. As a practical matter, I'd think you'd want some mechanism
> to distinguish between shared global pointers and non-shared local
> pointers.

I'm not sure.

I can see how "thread safety is part of the type" can be useful.

But on the other hand "int" offers "basic" thread safety by default and
"strong" thread safety if I only touch it with atomic_load_acq and
atomic_store_rel.

Of course in Java "int" is basic and "volatile int" is strong.

For some reason I think that the "non-Java int" (and raw pointer) semantics
make more sense for shared_ptr; it seems sufficiently primitive to me. I may
be wrong, of course.

This is also why I'm not particularly fond of Alexander's atomic<> dedicated
wrapper type. Providing atomic access via ordinary functions seems more
natural.


Alexander Terekhov

unread,
Apr 11, 2005, 7:07:08 PM4/11/05
to

Peter Dimov wrote:
[...]

> But on the other hand "int" offers "basic" thread safety by default and
> "strong" thread safety if I only touch it with atomic_load_acq and
> atomic_store_rel.

Only if you properly isolate (align and pad if nessesary) it.

That's one reason for atomic<> wrapper.

[...]


> This is also why I'm not particularly fond of Alexander's atomic<> dedicated
> wrapper type. Providing atomic access via ordinary functions seems more
> natural.

Another reason is to "enforce" proper labeling for all accesses by
making the type opaque. An exclusive access on atomic<int> doesn't
need to be isolated/atomic at all, it can be combined with other
accesses. How are you going to do it all with "extended-for-threads"
C/C++ volatiles (also mentioned by you in another message not here),
for example?

What else?

regards,
alexander.

Joe Seigh

unread,
Apr 11, 2005, 7:15:37 PM4/11/05
to

I meant it wouldn't be clear from the code context what was safe or not.
You'd have to know that a particular instance of shared_ptr is used
is a certain way to know that methods were safe or not. If sp1 and
sp2 are shared pointers how do you know which of the following to use?
sp1 = sp2;
sp1 = sp2.copy();
sp1.replace(sp2);
sp1.replace(sp2.copy());


Whereas if you keep them as two separate types you can keep the methods
consistent for each type. So for example, given atomic_ptr ap and
local_ptr lp which by the rules for local_ptr is always local or "owned".
ap = lp;
lp = ap;

or any combination of atomic_ptr and local_ptr is safe, always. You
don't have to know anything about ap or lp.

atomic<T> is the same idea applied to dumb (raw) pointers or other native
types.

--
Joe Seigh

Alexander Terekhov

unread,
Apr 11, 2005, 7:15:51 PM4/11/05
to

Alexander Terekhov wrote:
>
> Peter Dimov wrote:
> [...]
> > But on the other hand "int" offers "basic" thread safety by default and
> > "strong" thread safety if I only touch it with atomic_load_acq and
> > atomic_store_rel.
>
> Only if you properly isolate (align and pad if nessesary) it.

And even properly isolated built-in "int" can be accessed on
per-bit basis (or per-byte/whatnot, non-atomically) by conforming
implementations. Just to be anal, so to speak.

Peter Dimov

unread,
Apr 11, 2005, 9:03:45 PM4/11/05
to
Joe Seigh wrote:
> I meant it wouldn't be clear from the code context what was safe or
> not. You'd have to know that a particular instance of shared_ptr is
> used is a certain way to know that methods were safe or not.

Yes, I understood that.

> If sp1 and
> sp2 are shared pointers how do you know which of the following to use?
> sp1 = sp2;
> sp1 = sp2.copy();
> sp1.replace(sp2);
> sp1.replace(sp2.copy());
>
>
> Whereas if you keep them as two separate types you can keep the
> methods consistent for each type. So for example, given atomic_ptr
> ap and local_ptr lp which by the rules for local_ptr is always local
> or "owned".
>
> ap = lp;
> lp = ap;
>
> or any combination of atomic_ptr and local_ptr is safe, always. You
> don't have to know anything about ap or lp.

No; both lp = ap and ap = lp are unsafe if I don't know anything about lp
(if another thread can read or write lp), if I understand their semantics
correctly.

sp1.replace( sp2 );
sp2 = sp1.copy();

are equally safe (and more explicit) if I have the same information about
sp2 as I do for lp (i.e. it's not visible to other threads.) I admit that
sometimes encoding the intent that lp is not shared into its type may help
or appeal to some programmers.

But regardless...

My point was that (low level) concurrent accesses without proper
synchronization are always "unsafe", so it's a waste of time to try to make
them safe. "Unsafe" in the sense that it's very, very hard for a non-expert
to get a lock-free algorithm right. And the expert is unlikely to need his
pointers color-coded.

But I may be wrong. :-)


Peter Dimov

unread,
Apr 11, 2005, 9:11:50 PM4/11/05
to
Alexander Terekhov wrote:
> Peter Dimov wrote:
> [...]
>> But on the other hand "int" offers "basic" thread safety by default
>> and "strong" thread safety if I only touch it with atomic_load_acq
>> and atomic_store_rel.
>
> Only if you properly isolate (align and pad if nessesary) it.
>
> That's one reason for atomic<> wrapper.

Do you mean that on some systems a 32 bit int aligned on a 32 bit boundary
may not allow atomic accesses? Or in more general terms, a type T's natural
alignment may not be sufficient to ensure atomicity?

>> This is also why I'm not particularly fond of Alexander's atomic<>
>> dedicated wrapper type. Providing atomic access via ordinary
>> functions seems more natural.
>
> Another reason is to "enforce" proper labeling for all accesses by
> making the type opaque. An exclusive access on atomic<int> doesn't
> need to be isolated/atomic at all, it can be combined with other
> accesses. How are you going to do it all with "extended-for-threads"
> C/C++ volatiles (also mentioned by you in another message not here),
> for example?

volatiles? Who said anything about volatiles?

T atomic_load_acq( T * pt );
void atomic_store_rel( T * pt, T v );
// et al

and ordinary non-volatile loads and stores for non-competing accesses...
what am I missing? "naked_competing" loads? Add atomic_load_nsync, then.


Chris Thomasson

unread,
Apr 11, 2005, 10:07:43 PM4/11/05
to
> This is about local_shared_ptr, not about copy/replace. The
> local_shared_ptrs in an ownership group (which needs to be local to a
> specific thread) use a local unsynchronized reference count to keep track
> of
> their single shared_ptr-like reference into the shared_ptr group. ChrisX
> got
> it right, but I still think that there's no need to acquire in the
> increment, it's an ordinary 'add_ref_strong'.

If add_ref_strong already provides the necessary memory barriers you would
be ok. However, explicit acquire/release semantics would be necessary wrt
copy/replace...


Chris Thomasson

unread,
Apr 11, 2005, 10:25:02 PM4/11/05
to
> There's a bit of a learning curve here. I don't think the general
> programming community has enough experience in multi-threaded
> programming, compared to the experience we have here in c.p.t, to have
> a good feel for all the issues yet.

That's for sure.


> For example, when you're doing reference counting in a single threaded
> program, you don't need to distinguish between read access and write
> access. But when you start using multi-threading, the performance
> implications of distinquishing between read and write access become
> more important, which is why I've been emphasizing smart pointers
> as a solution to the readers/writers problem rather than a generic
> resource management tool.

I guess the smart-pointer's themselves could be considered "generic resource
management tools" ;)


> There seems to be some attraction in c.l.c++.m to smart pointers that
> have almost no overhead, the deferred reference counting being
> promoted by Andrei but there's no such thing a a free lunch. Deferred
> reference counting does Boehm style GC against the thread stacks and
> that requires a "stop the world while we GC" which has huge performance
> implications in multi-threading.

Actually, you mentioned a scheme that would allow for "deferred reference
counting" using per-thread lock-free single-producer/consumer queues.
Pushing a node ( reference adjustment request ) into a "local" per-thread
queue is certainly less overhead than performing an interlocked-operation on
a "shared" variable. I don't think you would need to use DWCAS either. The
logic could be isolated in the "collection phase"...

There is just one problem, if the collection cycles could not keep up with
user allocs/frees the applications memory usage could climb out of control.


> We're well aware of the problem
> of checking or monitoring threads local references. That's what makes
> RCU or SMR so tricky.

Indeed. The "collection phase" of basically all distributed garbage
collection algorithms can be very tricky.


> If there was an easy way around it, we'd be
> using it. There isn't. What you have is a bunch of different techniques
> with different trade-offs. What you do is use the one that gives
> you the most benefit depending on the situation. There won't be one
> size fits all.

;)


Alexander Terekhov

unread,
Apr 12, 2005, 4:45:46 AM4/12/05
to

Chris Thomasson wrote:
>
> > This is about local_shared_ptr, not about copy/replace. The
> > local_shared_ptrs in an ownership group (which needs to be local to a
> > specific thread) use a local unsynchronized reference count to keep track
> > of their single shared_ptr-like reference into the shared_ptr group. ChrisX
> > got it right, but I still think that there's no need to acquire in the
> > increment, it's an ordinary 'add_ref_strong'.
>
> If add_ref_strong already provides the necessary memory barriers

What for? add_ref_strong() is naked atomic.

regards,
alexander.

Alexander Terekhov

unread,
Apr 12, 2005, 4:37:42 AM4/12/05
to

Peter Dimov wrote:
>
> Alexander Terekhov wrote:
> > Peter Dimov wrote:
> > [...]
> >> But on the other hand "int" offers "basic" thread safety by default
> >> and "strong" thread safety if I only touch it with atomic_load_acq
> >> and atomic_store_rel.
> >
> > Only if you properly isolate (align and pad if nessesary) it.
> >
> > That's one reason for atomic<> wrapper.
>
> Do you mean that on some systems a 32 bit int aligned on a 32 bit boundary
> may not allow atomic accesses? Or in more general terms, a type T's natural
> alignment may not be sufficient to ensure atomicity?

Show me first chapter and verb for "natural alignment" in C/C++ or
even POSIX, please.

>
> >> This is also why I'm not particularly fond of Alexander's atomic<>
> >> dedicated wrapper type. Providing atomic access via ordinary
> >> functions seems more natural.
> >
> > Another reason is to "enforce" proper labeling for all accesses by
> > making the type opaque. An exclusive access on atomic<int> doesn't
> > need to be isolated/atomic at all, it can be combined with other
> > accesses. How are you going to do it all with "extended-for-threads"
> > C/C++ volatiles (also mentioned by you in another message not here),
> > for example?
>
> volatiles? Who said anything about volatiles?

Well, you said: "Why wrap the thing in its own atomic<> type (I'm
reminded of volatiles here) instead of just providing labeled
access functions?"

>
> T atomic_load_acq( T * pt );
> void atomic_store_rel( T * pt, T v );
> // et al
>
> and ordinary non-volatile loads and stores for non-competing accesses...
> what am I missing? "naked_competing" loads? Add atomic_load_nsync, then.

First off, an access can be either atomic or not atomic. Non-competing
load accesses need not be atomic. Non-competing store accesses must be
atomic (because it is non-competing with respect to stores, but it is
"competing" with respect to loads), exclusive store or load accesses
need not be atomic. Competing accesses may need msync to impose proper
ordering.

Now, as for your functions, consider:

char ab[2];

Thread A:

atomic_store_rel( &ab[0] );
atomic_load_acq( &ab[1] );

Thread B:

atomic_load_acq( &ab[0] );
atomic_store_rel( &ab[1] );

How is this supposed to work on a processor that doesn't have atomic
byte access instructions? Even if it has, consider

char ab[2];

Thread A:

a[0] = 1;
.
.
.
atomic_load_acq( &ab[1] );

Thread B:

atomic_store_rel( &ab[1] );

regards,
alexander.

Chris Thomasson

unread,
Apr 12, 2005, 4:57:23 AM4/12/05
to

>> If add_ref_strong already provides the necessary memory barriers
>
> What for? add_ref_strong() is naked atomic.

I need to study the boost::shared_ptr implementation a bit more. I only
skimmed through it a while ago.


Peter Dimov

unread,
Apr 12, 2005, 5:54:45 AM4/12/05
to
Alexander Terekhov wrote:
> Peter Dimov wrote:
>> Do you mean that on some systems a 32 bit int aligned on a 32 bit
>> boundary may not allow atomic accesses? Or in more general terms, a
>> type T's natural alignment may not be sufficient to ensure atomicity?
>
> Show me first chapter and verb for "natural alignment" in C/C++ or
> even POSIX, please.

Search for "alignment" in C++03. It doesn't specifically say "natural"
because for the standard non-natural alignments do not exist and unaligned
accesses are undefined.

>> volatiles? Who said anything about volatiles?
>
> Well, you said: "Why wrap the thing in its own atomic<> type (I'm
> reminded of volatiles here) instead of just providing labeled
> access functions?"

Reminded of Java volatiles... where I need to change the type of the
variable in order to do atomic operations on it.

>> T atomic_load_acq( T * pt );
>> void atomic_store_rel( T * pt, T v );
>> // et al
>>
>> and ordinary non-volatile loads and stores for non-competing
>> accesses... what am I missing? "naked_competing" loads? Add
>> atomic_load_nsync, then.
>
> First off, an access can be either atomic or not atomic. Non-competing
> load accesses need not be atomic. Non-competing store accesses must be
> atomic (because it is non-competing with respect to stores, but it is
> "competing" with respect to loads), exclusive store or load accesses
> need not be atomic.

Hmm, I must be missing something. When I say "non-competing" I mean
"exclusive" in your terminology. The PLn models use the same definition for
"competing" - if there is a race.

> Competing accesses may need msync to impose proper ordering.
>
> Now, as for your functions, consider:
>
> char ab[2];
>
> Thread A:
>
> atomic_store_rel( &ab[0] );
> atomic_load_acq( &ab[1] );
>
> Thread B:
>
> atomic_load_acq( &ab[0] );
> atomic_store_rel( &ab[1] );
>
> How is this supposed to work on a processor that doesn't have atomic
> byte access instructions?

It isn't. It will fail to compile. The atomic_* family will work on int,
long, T* and PODs with sizeof(int), sizeof(long), sizeof(T*) and maybe with
2*sizeof(int) etc.

Why is atomic<char> useful? And how is atomic<p2> (where p2 is a struct of
two pointers) supposed to work without DWCAS? Contain a spinlock?


Peter Dimov

unread,
Apr 12, 2005, 8:00:36 AM4/12/05
to
Chris Thomasson wrote:
>> There seems to be some attraction in c.l.c++.m to smart pointers that
>> have almost no overhead, the deferred reference counting being
>> promoted by Andrei but there's no such thing a a free lunch. Deferred
>> reference counting does Boehm style GC against the thread
>> stacks and that requires a "stop the world while we GC" which has huge
>> performance implications in multi-threading.
>
> Actually, you mentioned a scheme that would allow for "deferred
> reference counting" using per-thread lock-free
> single-producer/consumer queues. Pushing a node ( reference
> adjustment request ) into a "local" per-thread queue is certainly
> less overhead than performing an interlocked-operation on a "shared"
> variable. I don't think you would need to use DWCAS either. The logic
> could be isolated in the "collection phase"...

Thread-local count update queues only work if you are willing to postpone
the destruction for an unspecified amount of time, IIUC.

p1 = 0; // thread 1

p2 = 0; // thread 2

Now if these were the only two references to *p1 and if thread 1 and thread
2 don't do anything smart pointer related in the next few hours, *p1 will be
kept alive.

And if you drop the deterministic destruction requirement, reference
counting is usually less efficient than other garbage collection
algorithms... not all of them need to stop the world (if my understanding is
correct).

I tried to come up with a deferred counting scheme where only non-essential
updates would be queued but the final transition from non-zero refs to zero
was always done "on time", but I failed. It might be possible, I haven't
studied the refcounting research much. But even if it is, it would probably
be less efficient on today's hardware for the typical shared_ptr uses in
C++, where most non-essential count updates are manually eliminated (or just
not introduced) by the programmer.

(Now that I think about it, it may be possible to use such a scheme for
local_ptr, which is supposed to never be shared between threads, but not for
shared_ptr, which can be shared if accesses are properly synchronized.)


Alexander Terekhov

unread,
Apr 12, 2005, 12:54:25 PM4/12/05
to

Peter Dimov wrote:
>
> Alexander Terekhov wrote:
> > Peter Dimov wrote:
> >> Do you mean that on some systems a 32 bit int aligned on a 32 bit
> >> boundary may not allow atomic accesses? Or in more general terms, a
> >> type T's natural alignment may not be sufficient to ensure atomicity?
> >
> > Show me first chapter and verb for "natural alignment" in C/C++ or
> > even POSIX, please.
>
> Search for "alignment" in C++03. It doesn't specifically say "natural"
> because for the standard non-natural alignments do not exist and unaligned
> accesses are undefined.

Would you please quote the relevant passage(s)?

>
> >> volatiles? Who said anything about volatiles?
> >
> > Well, you said: "Why wrap the thing in its own atomic<> type (I'm
> > reminded of volatiles here) instead of just providing labeled
> > access functions?"
>
> Reminded of Java volatiles... where I need to change the type of the
> variable in order to do atomic operations on it.

Please elborate.

>
> >> T atomic_load_acq( T * pt );
> >> void atomic_store_rel( T * pt, T v );
> >> // et al
> >>
> >> and ordinary non-volatile loads and stores for non-competing
> >> accesses... what am I missing? "naked_competing" loads? Add
> >> atomic_load_nsync, then.
> >
> > First off, an access can be either atomic or not atomic. Non-competing
> > load accesses need not be atomic. Non-competing store accesses must be
> > atomic (because it is non-competing with respect to stores, but it is
> > "competing" with respect to loads), exclusive store or load accesses
> > need not be atomic.
>
> Hmm, I must be missing something. When I say "non-competing" I mean
> "exclusive" in your terminology. The PLn models use the same definition for

PLn models say that "two operations conflict if they are to the same
location and at least one of them is a write" and that "given a pair
of conflicting operations u and v in an execution, operation u
competes with operation v if and only if there is no ordering chain
(as defined above) between u and v (either from u to v or vice versa).
An operation u is a competing operation if and only if there exists
at least one conflicting operation v in this execution that competes
with u. Otherwise, the operation is a non-competing operation."

> "competing" - if there is a race.

And what I meant was that for loads, "competing" means that it can
collide with other store. For stores, "competing" means that it can
collide with other store, "non-competing" means that it can collide
(and survive ;-) ) with other load (but not store), and "exclusive"
means no collision at all.

>
> > Competing accesses may need msync to impose proper ordering.
> >
> > Now, as for your functions, consider:
> >
> > char ab[2];
> >
> > Thread A:
> >
> > atomic_store_rel( &ab[0] );
> > atomic_load_acq( &ab[1] );
> >
> > Thread B:
> >
> > atomic_load_acq( &ab[0] );
> > atomic_store_rel( &ab[1] );
> >
> > How is this supposed to work on a processor that doesn't have atomic
> > byte access instructions?
>
> It isn't. It will fail to compile. The atomic_* family will work on int,
> long, T* and PODs with sizeof(int), sizeof(long), sizeof(T*) and maybe with
> 2*sizeof(int) etc.
>
> Why is atomic<char> useful?

Well, neighboring ints can also be manupulated by one "wide"
instruction (see the other case). Without atomic<>, that is. ;-)

> And how is atomic<p2> (where p2 is a struct of
> two pointers) supposed to work without DWCAS? Contain a spinlock?

It will fail to compile. atomic_pair<T> might work, though.

regards,
alexander.

Joe Seigh

unread,
Apr 12, 2005, 1:18:07 PM4/12/05
to
On Mon, 11 Apr 2005 19:25:02 -0700, Chris Thomasson <_no_damn_spam_cristom@_no_damn_comcast.net_spam> wrote:

>
>
>
>
>> There seems to be some attraction in c.l.c++.m to smart pointers that
>> have almost no overhead, the deferred reference counting being
>> promoted by Andrei but there's no such thing a a free lunch. Deferred
>> reference counting does Boehm style GC against the thread stacks and
>> that requires a "stop the world while we GC" which has huge performance
>> implications in multi-threading.
>
> Actually, you mentioned a scheme that would allow for "deferred reference
> counting" using per-thread lock-free single-producer/consumer queues.
> Pushing a node ( reference adjustment request ) into a "local" per-thread
> queue is certainly less overhead than performing an interlocked-operation on
> a "shared" variable. I don't think you would need to use DWCAS either. The
> logic could be isolated in the "collection phase"...
>
> There is just one problem, if the collection cycles could not keep up with
> user allocs/frees the applications memory usage could climb out of control.
>
>

The deferred reference counting mentioned by Andrei seems to be a different
scheme. I think there's a way to solve most of the problems here. I'll
just add it to my list of things I don't have time to do. Actually, I
have time, just not time to do pointless things. I mean, just how
many more GC schemes do *I* need?

--
Joe Seigh

David Hopwood

unread,
Apr 12, 2005, 2:15:58 PM4/12/05
to
Alexander Terekhov wrote:
> Peter Dimov wrote:
>>Alexander Terekhov wrote:
>>
>>Do you mean that on some systems a 32 bit int aligned on a 32 bit boundary
>>may not allow atomic accesses? Or in more general terms, a type T's natural
>>alignment may not be sufficient to ensure atomicity?
>
> Show me first chapter and verb for "natural alignment" in C/C++ or
> even POSIX, please.

"Natural alignment" is defined by the ABI of each platform, not by the
language or by POSIX. But it *is* defined.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Peter Dimov

unread,
Apr 12, 2005, 3:36:44 PM4/12/05
to
Alexander Terekhov wrote:
> Peter Dimov wrote:
>> Search for "alignment" in C++03. It doesn't specifically say
>> "natural" because for the standard non-natural alignments do not
>> exist and unaligned accesses are undefined.
>
> Would you please quote the relevant passage(s)?

3.8/1

"The lifetime of an object of type T begins when:
- storage with the proper alignment and size for type T is obtained, and
- if T is a class type with a non-trivial constructor (12.1), the
constructor call has completed."

3.9/5

"Object types have alignment requirements (3.9.1, 3.9.2). The alignment of a
complete object type is an implementation-defined integer value representing
a number of bytes; an object is allocated at an address that meets the
alignment requirements of its object type."

>>>> volatiles? Who said anything about volatiles?
>>>
>>> Well, you said: "Why wrap the thing in its own atomic<> type (I'm
>>> reminded of volatiles here) instead of just providing labeled
>>> access functions?"
>>
>> Reminded of Java volatiles... where I need to change the type of the
>> variable in order to do atomic operations on it.
>
> Please elborate.

I meant that on a specific platform, some types can be read/written/updated
atomically, and some cannot. But the requirement to "tag" objects that will
participate in such operations, whether with "volatile", or with atomic<>,
doesn't seem natural or necessary to me.

>> Hmm, I must be missing something. When I say "non-competing" I mean
>> "exclusive" in your terminology. The PLn models use the same
>> definition for
>
> PLn models say that "two operations conflict if they are to the same
> location and at least one of them is a write" and that "given a pair
> of conflicting operations u and v in an execution, operation u
> competes with operation v if and only if there is no ordering chain
> (as defined above) between u and v (either from u to v or vice versa).
> An operation u is a competing operation if and only if there exists
> at least one conflicting operation v in this execution that competes
> with u. Otherwise, the operation is a non-competing operation."
>
>> "competing" - if there is a race.
>
> And what I meant was that for loads, "competing" means that it can
> collide with other store. For stores, "competing" means that it can
> collide with other store, "non-competing" means that it can collide
> (and survive ;-) ) with other load (but not store), and "exclusive"
> means no collision at all.

Right. I was using PLn terminology, where stores that compete with loads are
"competing".

>> Why is atomic<char> useful?
>
> Well, neighboring ints can also be manupulated by one "wide"
> instruction (see the other case). Without atomic<>, that is. ;-)

I don't think that such platforms are in widespread use (unless IA64 or
x86-64 Windows compilers do 64 bit stores for ints... I doubt it), but if
they do, atomics for ints won't be defined there. Such an int runs counter
to the spirit of C, though. Stores on ints are not supposed to be carried
out in "double precision", so to speak. Chars may do that, this is well
known.

Anyway, I see your point. atomic<> can be useful.


Joe Seigh

unread,
Apr 13, 2005, 7:51:14 AM4/13/05
to
On Tue, 12 Apr 2005 13:18:07 -0400, Joe Seigh <jsei...@xemaps.com> wrote:

>>
> The deferred reference counting mentioned by Andrei seems to be a different
> scheme. I think there's a way to solve most of the problems here. I'll
> just add it to my list of things I don't have time to do. Actually, I
> have time, just not time to do pointless things. I mean, just how
> many more GC schemes do *I* need?
>

On further note, I think the major issue here is expectations. That is
where the learning curve comes in. Unless you know what the trade offs
are, it's difficult to keep expectations reasonable. There's probably
an optimal solution for all this, but there's no point in implementing
it if everyone believes there is some magical solution for all this.

--
Joe Seigh

Peter Dimov

unread,
Apr 13, 2005, 8:41:01 AM4/13/05
to
Joe Seigh wrote:

> The deferred reference counting mentioned by Andrei seems to be a
> different scheme. I think there's a way to solve most of the problems
> here. I'll just add it to my list of things I don't have time to do.

Deferred reference counting, as most of the research, is a "Java-centric"
scheme to deal with non-essential count updates.

"Java-centric" in the sense that its primary target is a fully collected
language.

The problem with counting collectors is that these languages generate lots
of count updates each time a reference is copied or "destroyed".

This is much, much less of a problem in C++, where most local references to
counted objects are ordinary C++ references or pointers and incur no
overhead.

I don't believe that C++ code can benefit from a traditional deferred
counting scheme (which avoids counting the stack and the registers and only
counts heap-allocated references.)

Your local_ptr can possibly benefit from a deferring the counting somewhat,
but I'm not sure whether it will be worth doing.

Weighted reference counting is another interesting algorithm (on paper). I
think that it is also not worth doing. The idea there is to keep a local
count (weight) in each pointer. The pointer starts with 65536 (say) and on
each copy the weight is shared between the source and the copy. When a
pointer is destroyed its weight is subtracted from the reference count (its
initial value is also 65536).

The reason I believe that this scheme is not a gain is that it eliminates
count increments at the expense of an additional integer in every pointer.
But count increments are "naked" and in the uncontended case are almost as
fast as ordinary increments; and the additional size increase in every
pointer will likely decrease the overall performance of the application.
(The problem with the copy mutating the source aside for a moment.)

> Actually, I have time, just not time to do pointless things. I mean, just
> how many more GC schemes do *I* need?

From a C++ viewpoint, I don't believe that you need anything else beyond
(atomic|shared)_ptr-style counting and hazard pointers. But of course I may
be wrong. ;-)


Alexander Terekhov

unread,
Apr 13, 2005, 10:40:08 AM4/13/05
to

Peter Dimov wrote:

[... weighted reference counting ...]

> The reason I believe that this scheme is not a gain is that it eliminates
> count increments at the expense of an additional integer in every pointer.

It does bring contention down a bit (at expense of extra space),
but you still need (naked) atomic operation. Consider: (basic
thread-safety)

weighted_reference_counting_ptr w;

Thread A:

weighted_reference_counting_ptr a(w);

Thread B:

weighted_reference_counting_ptr b(w);

Both threads need to update w's weight, oder?

regards,
alexander.

David Hopwood

unread,
Apr 13, 2005, 10:47:15 AM4/13/05
to
Peter Dimov wrote:
> Joe Seigh wrote:
>
>>The deferred reference counting mentioned by Andrei seems to be a
>>different scheme. I think there's a way to solve most of the problems
>>here. I'll just add it to my list of things I don't have time to do.
>
> Deferred reference counting, as most of the research, is a "Java-centric"
> scheme to deal with non-essential count updates.
>
> "Java-centric" in the sense that its primary target is a fully collected
> language.

ITYM "anything except C/C++".

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Joe Seigh

unread,
Apr 13, 2005, 11:32:25 AM4/13/05
to
On Wed, 13 Apr 2005 15:41:01 +0300, Peter Dimov <pdi...@gmail.com> wrote:

> Joe Seigh wrote:
...


> I don't believe that C++ code can benefit from a traditional deferred
> counting scheme (which avoids counting the stack and the registers and only
> counts heap-allocated references.)
>
> Your local_ptr can possibly benefit from a deferring the counting somewhat,
> but I'm not sure whether it will be worth doing.
>
> Weighted reference counting is another interesting algorithm (on paper). I
> think that it is also not worth doing. The idea there is to keep a local
> count (weight) in each pointer. The pointer starts with 65536 (say) and on
> each copy the weight is shared between the source and the copy. When a
> pointer is destroyed its weight is subtracted from the reference count (its
> initial value is also 65536).

To make pointers copyable you have to divide up the weight somehow. If you
divide by two, after as little as 16 copies, you resulting pointer becomes
uncopyable. There's another weighted reference counting scheme that uses
infinite precision fractions for the weight. It keeps extending the precision
at the expense of space boundedness.

atomic_ptr/local_ptr is a modified weighted reference counting scheme. local_ptr
looks a lot like pure weighted reference counting.

>
>> Actually, I have time, just not time to do pointless things. I mean, just
>> how many more GC schemes do *I* need?
>
> From a C++ viewpoint, I don't believe that you need anything else beyond
> (atomic|shared)_ptr-style counting and hazard pointers. But of course I may
> be wrong. ;-)
>

And RCU of course. :)

And it depends on what trade-offs you want to make. I have other unimplemented
schemes I can use if I get into a situation where I need a different set of
trade-offs.

--
Joe Seigh

Peter Dimov

unread,
Apr 13, 2005, 11:32:24 AM4/13/05
to

Yes (I posted something along these lines in clc++m).


Peter Dimov

unread,
Apr 13, 2005, 11:53:50 AM4/13/05
to
Joe Seigh wrote:
> On Wed, 13 Apr 2005 15:41:01 +0300, Peter Dimov <pdi...@gmail.com>
> wrote:
>> Joe Seigh wrote:
> ...
>> I don't believe that C++ code can benefit from a traditional deferred
>> counting scheme (which avoids counting the stack and the registers
>> and only counts heap-allocated references.)
>>
>> Your local_ptr can possibly benefit from a deferring the counting
>> somewhat, but I'm not sure whether it will be worth doing.
>>
>> Weighted reference counting is another interesting algorithm (on
>> paper). I think that it is also not worth doing. The idea there is
>> to keep a local count (weight) in each pointer. The pointer starts
>> with 65536 (say) and on each copy the weight is shared between the
>> source and the copy. When a pointer is destroyed its weight is
>> subtracted from the reference count (its initial value is also
>> 65536).
>
> To make pointers copyable you have to divide up the weight somehow. If you
> divide by two, after as little as 16 copies, you resulting
> pointer becomes uncopyable.

I think that when you need to copy a source with weight 1, you can just
increment the total weight by 65536 and create a weighted_ptr with a weight
65536, without modifying the source. Or you could also increase the source's
weight to 65536 and increment the total weight by 65536+65535.

> There's another weighted reference counting scheme that uses infinite
> precision fractions for the weight. It keeps extending the precision at
> the expense of space boundedness.

Seems like overkill to me. ;-)

> atomic_ptr/local_ptr is a modified weighted reference counting
> scheme. local_ptr looks a lot like pure weighted reference counting.

From looking at the source, local_ptr doesn't seem weighted to me. There is
no weight member and copies just increment a shared reference count (the
ephemeral count). But maybe I'm not looking at the latest version (the one
on atomic-ptr-plus.sf.net is 0.0.2).


Joe Seigh

unread,
Apr 13, 2005, 1:38:23 PM4/13/05
to

Copying an atomic_ptr to a local ptr is a tranfer of 1 weighted reference
count without updating the central reference count. local_ptr orginally
weren't copyable since they only have a weight of 1 but I added some logic
from atomic_ptr to atomic_ptr copying, it's just not as efficient. The
extra logic obfuscates the weighted ptr logic somewhat so it's hard to
see.

--
Joe Seigh

Peter Dimov

unread,
Apr 13, 2005, 2:33:52 PM4/13/05
to

I see it now, thank you; there is an ecount in atomic_ptr. But even though I
understand what the code does, I'm still puzzled as to what this scheme buys
you.


Joe Seigh

unread,
Apr 13, 2005, 3:20:43 PM4/13/05
to

Atomicity, i.e. strong thread safety.

--
Joe Seigh

Joe Seigh

unread,
Apr 13, 2005, 9:33:54 PM4/13/05
to

I was going to say it is probably the only working atomic ptr out there.
If you google for "Lock-free Reference Counting" you'll find a paper on
on an implementation that requires an instruction no longer in existence.
However google dragged up an article in the Dec 2004 CUJ titled "Atomic
Reference Counting Pointers" by William K. Reinholtz which going by the
code examples appears to be the same as the PPC version of atomic_ptr.
I don't have access to the article itself so I don't know where he got
the algorithm. Anyone know?

--
Joe Seigh

Peter Dimov

unread,
Apr 14, 2005, 7:01:08 AM4/14/05
to
Joe Seigh wrote:
> I was going to say it is probably the only working atomic ptr out
> there. If you google for "Lock-free Reference Counting" you'll find a
> paper on on an implementation that requires an instruction no longer in
> existence. However google dragged up an article in the Dec 2004 CUJ
> titled "Atomic Reference Counting Pointers" by William K. Reinholtz
> which going by the code examples appears to be the same as the PPC
> version of atomic_ptr. I don't have access to the article itself so I
> don't know where he got
> the algorithm. Anyone know?

Since it's in CUJ, you should probably ask in clc++m. William K. (Kirk)
Reinholtz has several RT-related papers, so if he doesn't give you credit in
the article, he probably invented it.

By looking at

ftp://ftp.cuj.com/pub/2004/cujdec2004.zip

I'd say that it doesn't have much in common with your atomic_ptr. Macros vs
templates and the definition of CAS aside, atomic_ptr is non-intrusive,
Reinholtz's 'rcp' is intrusive; it relies on a cntr_ member in the pointee
to keep the count. It's like boost::intrusive_ptr with an atomic exchange in
the 'swap' member, except with most of the code inlined and some
over-atomicity in the constructors and the destructor. No match for your
atomic_ptr, IMO.


Joe Seigh

unread,
Apr 14, 2005, 7:21:11 AM4/14/05
to

The ppc version of atomic_ptr uses different interlocked logic since ppc
doesn't have double wide CAS. The template vs. macros and intrusive
vs. non-intrusive are just details. It's the underlying algorithms that
matter. But I'll ask in clc++m.

--
Joe Seigh

Peter Dimov

unread,
Apr 14, 2005, 8:08:55 AM4/14/05
to
Joe Seigh wrote:
> The ppc version of atomic_ptr uses different interlocked logic since
> ppc doesn't have double wide CAS.

Is your PPC version available?

> The template vs. macros and intrusive vs. non-intrusive are just details.

I'm not sure. It's the details that make your atomic_ptr unique. A
straightforward smart pointer with an atomic swap() seems pretty trivial,
even I was able to "invent" it once (with the same lack of memory barriers
as in Reinholz's code).


Joe Seigh

unread,
Apr 14, 2005, 8:48:32 AM4/14/05
to
On Thu, 14 Apr 2005 15:08:55 +0300, Peter Dimov <pdi...@gmail.com> wrote:

> Joe Seigh wrote:
>> The ppc version of atomic_ptr uses different interlocked logic since
>> ppc doesn't have double wide CAS.
>
> Is your PPC version available?

On http://sourceforge.net/projects/atomic-ptr-plus/ in the atomic_ptr
package there's a ppc32 version available. I didn't get a Mac until
recently, so I wasn't able to publish a working implementation of the
algorithm until now.

>
>> The template vs. macros and intrusive vs. non-intrusive are just details.
>
> I'm not sure. It's the details that make your atomic_ptr unique. A
> straightforward smart pointer with an atomic swap() seems pretty trivial,
> even I was able to "invent" it once (with the same lack of memory barriers
> as in Reinholz's code).
>

The atomic swap is one part. The other part is safely incrementing the
reference count without "owning" it first. There's the DWCAS solution,
the PPC solution, and various GC based solutions using RCU, SMR, etc...
And of course the 2CAS solution which only works on MC68020 and MC68030
processors.


--
Joe Seigh

Peter Dimov

unread,
Apr 14, 2005, 9:37:56 AM4/14/05
to
Joe Seigh wrote:

> The ppc version of atomic_ptr uses different interlocked logic since
> ppc doesn't have double wide CAS.

I thought it had ldarx/stdcx.?


Chris Thomasson

unread,
Apr 14, 2005, 9:45:25 AM4/14/05
to
>> The ppc version of atomic_ptr uses different interlocked logic since
>> ppc doesn't have double wide CAS.
>
> I thought it had ldarx/stdcx.?

You use ll/sc to create atomic primitives.


Peter Dimov

unread,
Apr 14, 2005, 9:53:13 AM4/14/05
to
Joe Seigh wrote:
> The atomic swap is one part. The other part is safely incrementing
> the reference count without "owning" it first. There's the DWCAS
> solution, the PPC solution, and various GC based solutions using RCU,
> SMR, etc... And of course the 2CAS solution which only works on
> MC68020 and MC68030 processors.

You mean the race between

atomic_ptr<T> p2( p1 );

and

p1 = 0;

?

I think that you have a bug in your PPC implementation, but I may be missing
something. In inc_ref, you reload the pointer and check for change, however,
the pointer is a local function argument (int* refcnt) and will never
change.

And you are right that Reinholz uses the same algorithm, except he passes an
int** (the offset hack aside).


Chris Thomasson

unread,
Apr 14, 2005, 9:56:32 AM4/14/05
to

Arggrhrgh. I read lwarx/stwcx, did not notice the doubleword version. Sorry!


Anyway, I think that lwarx/stwcx and ldarx/stdcx "both" work on 64-bit cpu,
work on 64-bit values... I am not totally sure.


Joe Seigh

unread,
Apr 14, 2005, 10:13:47 AM4/14/05
to

In 64 bit mode they're not double wide.


--
Joe Seigh