Cyclic Weighted Reference Counting

Alexander Terekhov

unread,

May 23, 2003, 11:37:17 AM5/23/03

to

< Forward Quoted >

Newsgroups: gmane.comp.lib.boost.devel
Subject: Re: Re: shared_ptr/weak_ptr and thread-safety
Date: Fri, 23 May 2003 12:23:46 -0500
Message-ID: <3ECE5922...@prodigy.net>
References: ...<3ECDE001...@web.de>

Larry Evans wrote:
>
> Alexander Terekhov wrote:
> > Larry Evans wrote:
> > [...]
> >
> >>weighted reference counts eliminate around half the synchronization.
> >>This is because, AFAICT, the only time synchronization is needed is
> >>when a smart weighted pointer releases it's pointee, but not
> >>when it acquires a new pointee (because the reference count is
> >>gotten from the other, or source, smart weighted pointer).
> >
> >
> > I don't understand. Please elaborate (explain with more details...
> > or simply drop a link ;-) ).
> http://www.cs.kent.ac.uk/people/staff/rej/cgi-bin/searchbib?pattern=weighted+reference
>
> Quoting from the Conlusions of "Cyclic Weighted Reference Counting":
>
> By associating weights with pointers, interprocessor communication
> need take place only when a pointer is deleted -- pointers may
> be copied without any communication.
>
> I'm assuming the "thread synchronization" is analagous to the
> "interprocessor communication" mentioned above, only it takes more
> time.
>
> The way it works is that each pointee begins with a some maximum reference count
> (actually, log base 2 of the maximum reference count). The first smart pointer
> copies this reference count. Subsequent smart pointer copies "share" the reference
> count by divided it by 2 and each participant get's half. When a smart pointer
> releases it's pointee, it subtracts it's weight from the count in the pointee
> (and thus requires "thread synchronization" [in case of threads] or
> "interprocessor communication" [in case of multiprocessing).
> The invariant is that the count in the pointee = sum of the count's in all
> the smart pointers referencing that pointee.
>
> _______________________________________________
> Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

regards,
alexander.

Joseph Seigh

unread,

May 23, 2003, 12:39:30 PM5/23/03

to

Alexander Terekhov wrote:
>
> < Forward Quoted >

I had looked at using "weighted" reference counts when writing atomic_ptr to reduce
the overhead of copying pointers, but ran into the problem of what to do when the weight
dropped to one. But I guess you could boost the count in that case if that is what
he is doing. That would require synchronization and a larger precision reference count
so you could guarantee some maxinum overall reference count

Joe Seigh

Alexander Terekhov

unread,

May 23, 2003, 3:23:17 PM5/23/03

to

Joseph Seigh wrote:
> I had looked at using "weighted" reference counts when writing atomic_ptr to reduce
> the overhead of copying pointers, but ran into the problem of what to do when the weight
> dropped to one. But I guess you could boost the count in that case if that is what
> he is doing. That would require synchronization and a larger precision reference count
> so you could guarantee some maxinum overall reference count

Larry Evans wrote:

[... Weighted Reference Counting... ]

Thanks. Looks intriguing, but has some {probably severe}
drawbacks with respect to "general purpose" application,
I think. I'm now reading this paper:

http://cs.nyu.edu/goldberg/pubs/gold89.pdf
(Generational Reference Counting: A Reduced-Communication
Distributed Storage Reclamation Scheme)

regards,
alexander.

Joseph Seigh

unread,

May 23, 2003, 5:00:35 PM5/23/03

to

It look like atomic_ptr is equivalent to that if you take it that
an atomic_ptr copy is converted to a 0th generation copy each time.
The reference count would be the 0th generation count and the
emphemeral count the 1st generation count. So the leger which
would be atomic_ptr_ref only needs to have two counts in it.
atomic_ptrs are 0th generation pointers. local_ptr are 1st generation
pointer copies unless they were copied in which case they get converted
to 0th generation. The conversion from 1st to 0th generation involves
a simulaneous decrement 1st genaration ref count and increment 0th
generation ref count.

Or you could take generational reference counting as weighted reference
counting with really big Huffman encoded numbers. :)

Joe Seigh

Alexander Terekhov

unread,

May 24, 2003, 9:12:04 AM5/24/03

to

Joe, that's meant for your "consumption", I guess. ;-)

Larry Evans wrote:
>
> Alexander Terekhov wrote:

> > Larry Evans wrote:
> >
> > [... Weighted Reference Counting... ]
> >
> > Thanks. Looks intriguing, but has some {probably severe}
> > drawbacks with respect to "general purpose" application,
> > I think. I'm now reading this paper:
> >
> > http://cs.nyu.edu/goldberg/pubs/gold89.pdf
> > (Generational Reference Counting: A Reduced-Communication
> > Distributed Storage Reclamation Scheme)

> Thanks, I hadn't seen that. However, it is dated 1989 and was
> referenced in Rodriques and Jones "Cyclic Distributed Garbage
> Collection with Group Merger". I'm also wondering whether
> Bacon's method ( http://citeseer.nj.nec.com/bacon01concurrent.html )
> might be better. He provides a comparison with Rodgriques'; however,
> he doesn't mention Goldberg. Either way, both require traversal
> of the pointer graph. In boost/files/shared_cyclic_ptr/stl_container.cpp
> there's a prototype of a method for doing this derived from Detlef's
> method ( http://www.cs.kent.ac.uk/people/staff/rej/cgi-bin/searchbib?pattern=detlef+run+time+typing )
> I could use some feedback. The html file in that directory is out-of-date; however,
> the points it makes about the advantages of the Detlef method versus
> others in boost are, IMHO, still valid.

Alexander Terekhov

unread,

May 24, 2003, 9:14:10 AM5/24/03

to

Larry Evans wrote:
>
> Larry Evans wrote:
> > Alexander Terekhov wrote:

> [snip]

> > Bacon's method ( http://citeseer.nj.nec.com/bacon01concurrent.html )

> is more directly available at http://www.research.ibm.com/people/d/dfb/papers/
> [snip]

Joseph Seigh

unread,

May 24, 2003, 12:33:17 PM5/24/03

to

I wrote:
>
> Or you could take generational reference counting as weighted reference
> counting with really big Huffman encoded numbers. :)
>

Well, not Huffman encoding but "compressed"

Generational reference counting is equivalent to a really big precision number
given by

0 -1 -i
x + x + .... + x + ....

where i is the generation number and x is the max integer value.

The weight of a pointer is given by

-i -i-1
x - k*x

where k is the number of copies made of the pointer.

So if a copy of the above pointer was made the new
value would be

-i -i-1
x - (k+1)*x

and the value of the new pointer would be

-i-1 -i-2
x - 0*x

which added together gives you the original copied pointer weight.

When all the deleted pointer weights are equal to initial weight, x**0,
or 1, then the object is deleted.

There's probably some modulo/no carry arithmetic involved but I
won't go into that here.

So basically, every scheme that allows copying of pointers without
synchronization with the object/refcount is equivalent to a weighted
refcount with a very big number which probably has to be of magnitude
of at least 2**n where n is the max number of existing pointers to
the object.

It's all just accounting anyway.

Joe Seigh

Alexander Terekhov

unread,

May 24, 2003, 12:52:27 PM5/24/03

to

Joseph Seigh wrote:

[... ``rocket science''-aka-"just accounting" ...]

> So basically, every scheme that allows copying of pointers without

> synchronization with the object/refcount ....

What if one shares a *copy* -- having multiple threads copying a shared
copy, not the original?

regards,
alexander.

Joseph Seigh

unread,

May 24, 2003, 1:25:28 PM5/24/03

to

Each copy get a weight which is subtracted from the weight of the copy of the
pointer copied from. The individual weights get smaller on each successive
copy.

Joe Seigh

Alexander Terekhov

unread,

May 24, 2003, 1:27:59 PM5/24/03

to

Joseph Seigh wrote:
>
> Alexander Terekhov wrote:
> >
> > Joseph Seigh wrote:
> >
> > [... ``rocket science''-aka-"just accounting" ...]
> >
> > > So basically, every scheme that allows copying of pointers without
> > > synchronization with the object/refcount ....
> >
> > What if one shares a *copy* -- having multiple threads copying a shared
> > copy, not the original?
> >
>
> Each copy get a weight which is subtracted from the weight of the copy of the
> pointer copied from.

Using pthread_refcount_subtract()? ;-)

The point is that you STILL need synchronization. Oder?

regards,
alexander.

Joseph Seigh

unread,

May 24, 2003, 2:12:01 PM5/24/03

to

For atomic pointers, yes. For C++ style thread-safe, no, because you own
the source and destination pointers. I'm not sure why the C++ crowd is so
interested in avoiding synchronization (CAS on the refcount) on pointer copy.
CAS is not that slow, so it doesn't take too much before any alternative would
end up being more expensive. Now if the refcount resided on a remote node
where synchronization would be expensive, that would be a different matter.

Also, I'm not sure why there's interest in cyclic garbage collection. Object
dtors are data structure aware and would be much more efficient.

Joe Seigh

SenderX

unread,

May 24, 2003, 8:25:50 PM5/24/03

to

> So the leger which
> would be atomic_ptr_ref only needs to have two counts in it.

What can be the realistic lowest size in bits of the counts on atomic_ptr

Since the ephemeral count kept in atomic_ptr and the one kept in
atomic_ptr_ref roll over to zero together ( modulo ) they can even be as low
as __int8 correct?

I think I remember you saying something about the size of the ephemeral
count is related to how many local_ptr's you can have at the same time
across all threads?

How does the size in bits of the atomic_ptr_ref refcount ( non-ephemeral )
limit your algo?

I am wondering about lower counter sizes cause I was thinking of ways to get
atomic_ptr to work with cmp8xchg16b...

--
The designer of the SMP and HyperThread friendly, AppCore library.

http://AppCore.home.attbi.com

Joseph Seigh

unread,

May 25, 2003, 1:31:36 PM5/25/03

to

SenderX wrote:
>
> > So the leger which
> > would be atomic_ptr_ref only needs to have two counts in it.
>
> What can be the realistic lowest size in bits of the counts on atomic_ptr
>
> Since the ephemeral count kept in atomic_ptr and the one kept in
> atomic_ptr_ref roll over to zero together ( modulo ) they can even be as low
> as __int8 correct?
>
> I think I remember you saying something about the size of the ephemeral
> count is related to how many local_ptr's you can have at the same time
> across all threads?

Roughly. Actually, it's the number of emphemeral references. Copying an
atomic_ptr temporarily creates an emphemeral reference until the link
count can be incremented.

>
> How does the size in bits of the atomic_ptr_ref refcount ( non-ephemeral )
> limit your algo?

It limits the number of links (atomic_ptr) that can reference an object.

>
> I am wondering about lower counter sizes cause I was thinking of ways to get
> atomic_ptr to work with cmp8xchg16b...

There won't be any problem porting atomic_ptr to use cmp8xchg16. It's just
accounting.

Joe Seigh

SenderX

unread,

May 25, 2003, 6:21:51 PM5/25/03

to

> There won't be any problem porting atomic_ptr to use cmp8xchg16. It's
just
> accounting.

Will we be able to port the version of atomic_ptr that has the 16-bit aba
count that I added? I think that it would be essential to keep that
functionality. ;)

Also, do you have to use memory barriers in atomic_ptr? Like one after the
constructor, and one before the destructor?

Do the internal atomic_ptr read, update, exchange CAS sections, or any lock
free CAS section for that matter, need acquire / release semantics upon
success?

The papers on the Mike and Scott fifo, and the IBM lifo freelist don't
mention the need for barriers at all. The only paper that says it will
probably need barriers on a weak order system is the SMR algo...

Joseph Seigh

unread,

May 26, 2003, 8:41:27 AM5/26/03

to

SenderX wrote:
>
> > There won't be any problem porting atomic_ptr to use cmp8xchg16. It's
> just
> > accounting.
>
> Will we be able to port the version of atomic_ptr that has the 16-bit aba
> count that I added? I think that it would be essential to keep that
> functionality. ;)

Probably not. CAS on atomic_ptr in general using cmp8xchg16 is problematic
because of the live-lock problem. If you need it you will have to go with
the cmpxchg using 32 bit atomic_ptr_ref offset solution discussed earlier.

>
> Also, do you have to use memory barriers in atomic_ptr? Like one after the
> constructor, and one before the destructor?

ctors don't need memory barriers as far as I can tell except maybe for
atomic_ptr_ref. You need a release memory barrier *before* the decrement
of the refcount. Even the thread-safe version of boost shared_ptr needs that.
As to whether to have an acquire memory barrier after the refcount decrement
if it went to zero so that the dtor can have "exclusive" access, that would
be really useful but is not required.

>
> Do the internal atomic_ptr read, update, exchange CAS sections, or any lock
> free CAS section for that matter, need acquire / release semantics upon
> success?

None are required for atomic_ptr. However I plan to add acquire/release
semantics to atomic_ptr loads and stores to make them useful. Probably
the dtor one as well.

Whether lock-free needs memory barriers depends on the algorithm. The
dequeueing from a stack requires an acquire memory barrier after loading
the stack head to guarantee that it is replaced by the next element and
not garbage.

>
> The papers on the Mike and Scott fifo, and the IBM lifo freelist don't
> mention the need for barriers at all. The only paper that says it will
> probably need barriers on a weak order system is the SMR algo...

For the purposes of presenting the algorithm logic, most papers assume
TSO memory. IBM Z architecture has PSO so a lot of memory barriers
required for the RSO memory model are not required.

Joe Seigh

Alexander Terekhov

unread,

May 26, 2003, 9:09:22 AM5/26/03

to

Joseph Seigh wrote:
[...]

> You need a release memory barrier *before* the decrement
> of the refcount. Even the thread-safe version of boost shared_ptr needs that.
> As to whether to have an acquire memory barrier after the refcount decrement
> if it went to zero so that the dtor can have "exclusive" access, that would
> be really useful but is not required.

Acquire is required for mutable managed objects with non-trivial
dtors. You don't need any barriers for immutable managed objects
(smart_ptr<const T>).

regards,
alexander.

Joseph Seigh

unread,

May 26, 2003, 9:58:46 AM5/26/03

to

I meant required as in required to avoid breaking free memory, as opposed to
required to avoid having to go through extra effort manage an explicit mutex.
The former is a safety requirement. The latter is a convienence requirement.

Joe Seigh

Alexander Terekhov

unread,

May 26, 2003, 10:40:36 AM5/26/03

to

You're again thinking in terms of "strong" thread-safety when an object
has a mutex (or a read-write lock) to synchronize mutations. You also
presume that it acquires the lock in the destructor. In general, this is
rather silly approach. Just like with exception-safety, you need a "strong"
guarantee only on "top level" -- where the actual/observable transaction
begins/ends. It's sufficient and much more efficient to have a "basic"
exception-safety to implement a "strong" one on a higher lever. With
synchronization, it's basically the same. An object (unless it's really
meant to provide synchronization -- things like queues, schedulers, etc.)
should better provide a BASIC thread-safety ala "memory locations" in 4.10
rules. Objects don't need to lock anything [unless they share something
internally and that thing can mutate]. This is what people call "thread
neutral" or "as thread-safe as an int". "Unsafe" thread-safety can also
be VERY useful for things that have "internal sharing" [for example,
things like shared_ptr/weak_ptr and allocators] but don't know the "scope"
of actual use by the clients. Now, shared_ptr/weak_ptr is a somewhat
special case; you can emulate "garbage collected" volatile java refs (and
the upcoming "atomic" beasts from JSR166, I guess) if they'd support the
"strong" thread-safety with a couple of rather specific operation like
"attempt_update". But this has a rather limited application, I think.

regards,
alexander.

SenderX

unread,

May 26, 2003, 12:42:48 PM5/26/03

to

> Probably not. CAS on atomic_ptr in general using cmp8xchg16 is
problematic
> because of the live-lock problem. If you need it you will have to go with
> the cmpxchg using 32 bit atomic_ptr_ref offset solution discussed earlier.

Why couldn't Intel just do a cmpxchg16b? Damn... ;)

Is it the fact that the ephemeral count is included in the atomic_ptr::CAS?

If so, a live-lock seems like it would have to happen between the time the
ephemeral count was read, to the time it was compared inside the atomic_cas
method?

If not, I don't quite see where the comprands for atomic_ptr::cas would
change faster than an exchange (livelock ) would happen.

Since you can't fit a 32-bit atomic_ptr ephemeral count and a 64-bit pointer
in a cmp8xchg16b comprand, how do you reliably do the CAS inside of
atomic_ptr::getrefptr? Since it compares the pointer with a ephemeral
count... Do you construct some sort of unique CAS comprand tag made up of
part of the pointer value mixed with the ephemeral count?

> The
> dequeueing from a stack requires an acquire memory barrier after loading
> the stack head to guarantee that it is replaced by the next element and
> not garbage.

Would this be safe for a weak order memory system?

STACK LocalStack;

do
{
/* Load */
LocalStack = SharedStack;

/* Acquire the previous load from shared memory? */
__asm { lfence };
}

/* Store */
while( ! CAS( &SharedStack,
LocalStack.pFront,
LocalStack.lAba,
LocalStack.pFront->pNext,
LocalStack.lAba + 1 ) );

/* Release the last store to shared memory? */
__asm { sfence };

return LocalStack.pFront;

I would think the push operation would need to have a load-fence after it
reads the stack as well?

I am trying to better understand the weak memory model. ;)

SenderX

unread,

May 26, 2003, 2:08:03 PM5/26/03

to

> If you need it you will have to go with
> the cmpxchg using 32 bit atomic_ptr_ref offset solution discussed earlier.

This is a great idea, cause it not only allows for an atomic_ptr port...

I have also implemented the IBM FreeList on a 32-bit system, without using a
64-bit CAS!

I will post the working test code on my site soon. ;)

I did this by using indexs into an array of nodes, as pointers.

Basically, what I did is this...

typedef struct S_Node
{
struct S_Node *pNext;
LPVOID pObj;

} NODE, *LPNODE;

typedef union U_Stack
{
__int32 Value32;

struct
{
unsigned __int16 Front;
unsigned __int16 Aba;
};

} STACK, *LPSTACK;

/* Use sets of arrays, with a max limit of 65,534 nodes */

NODE NodeArray[65534];

To test for a NULL pointer:

if ( Stack.Front == 65535 )
{
/* NULL Value */
}

To store a node pointer in the front, you would:

Stack.Front = (unsigned __int16)
( (DWORD)pNodePtr -
(DWORD)NodeArray ) / sizeof( NODE );

To get a node pointer from the front, you would:

pNodePtr = (LPNODE)
( (DWORD)NodeArray +
( Stack.Front * sizeof( NODE ) ) );

This should be safe enough to allow for 32-bit lock-free algos using 32-bit
CAS.

This should be documented if it works out, no crashes yet on my testing
code...

=)

SenderX

unread,

May 26, 2003, 2:50:47 PM5/26/03

to

> To store a node pointer in the front, you would:
>
> Stack.Front = (unsigned __int16)
> ( (DWORD)pNodePtr -
> (DWORD)NodeArray ) / sizeof( NODE );
>
>
>
> To get a node pointer from the front, you would:
>
> pNodePtr = (LPNODE)
> ( (DWORD)NodeArray +
> ( Stack.Front * sizeof( NODE ) ) );

I realize I could just use the index numbers to get at the nodes in the
arrays ( NodeArray[index] ), but I wanted to test Joes idea for using a base
of memory and offsets into it for pointers.

I also think you will need to adjust the math in case an index number is
greater than the system page size?