RCU+SMR w/ reference counting

Joe Seigh

unread,

Aug 1, 2006, 8:19:09 PM8/1/06

to

This is some RCU+SMR logic I implemented a little while back and
never released.

RCU+SMR can be used with reference counting the same way that
regular SMR uses it which is when a reference count goes to zero,
the deallocation of the object is delayed until no hazard pointers
point to it. If the object points to other refcounted objects, the
refcounts for those objects are decremented and the previously
mentioned logic is applied.

For RCU+SMR the logic has to be modified slightly. If the deferred
reclamation thread decrements the refcount to zero, it cannot immediately
compare the object against the hazard pointers since it has to wait
for the RCU grace period before doing so. I.e. it has to queue that
object for deferred reclamation.

For certain situations this can have a major performance impact. Take
a linked list where readers are traversing the list in the same order the
objects are being dequeued. E.g.

a -> b -> c -> d -> e -> 0
^ ^
x q

where x is a hazard pointer and q is the current queue head. After x
finishes with the list, all the deleted objects will be deallocated one
at a time with an RCU grace period between each one. This becomes
a major bottle neck.

The way to fix this is to add an additional reference counter, a deferred
count, to each refcounted object object. When an object is queued
for deferred deallocation, each object referenced by it has its reference count
decremented by one for each reference by that object, and its deferred
reference count incremented by one. So the refcount is only the count
of live references (reachable from live objects) and the deferred refcount is
the count of stale references (reachable from stale objects scheduled for
deferred deallocation). The hazard pointer scan logic is modified to only
delete objects in the defer queue which have deferred refcounts of zero.
Before an unreferenced object is deallocated, the deferred refcounts of
all the objects it points to are decremented with no additional action based
on the result.

So for the above example with no hazard pointer pointing to a you'd
have
a.refcount = 0, a.defcount = 0
b.refcount = 0, b.defcount = 1
c.refcount = 0, c.defcount = 1
d.refcount = 1, d.defcount = 0
...

If the deletes of a, b, and c had occurred in quick succession, they'd all
be in the defer queue in the same RCU grace period most likely which
means they could be deallocated immediately since they're already in
the queue rather than being queued one at a time with consequent
delays each time.

Testing shows this scheme preserves the speed of earlier versions of
RCU+SMR with fifo ordering restrictions and allows for lock-free writer
logic if desired. If you do the defer one at a time, throughput dramatically
tanks.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

Joe Seigh

unread,

Aug 2, 2006, 6:16:57 AM8/2/06

to

...

> The way to fix this is to add an additional reference counter, a deferred
> count, to each refcounted object object. When an object is queued
> for deferred deallocation, each object referenced by it has its
> reference count
> decremented by one for each reference by that object, and its deferred

> reference count incremented by one. ...

One minor ommission. If the refcount on this operation goes to zero
then that object as well is scheduled for deferred deallocation.

Hopefully, I've remembered most of everything since I'm no longer
posting source. :)

Chris Thomasson

unread,

Aug 2, 2006, 5:22:43 PM8/2/06

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:XP2dnU7Az-Omc1LZ...@comcast.com...

> This is some RCU+SMR logic I implemented a little while back and
> never released.
>
> RCU+SMR can be used with reference counting the same way that
> regular SMR uses it which is when a reference count goes to zero,
> the deallocation of the object is delayed until no hazard pointers
> point to it.

Humm... I "seem" to be using a different approach. My implementation, sorry
for the mess, can be found here:

http://appcore.home.comcast.net/appcore/include/cpu/i686/ac_i686_lfgc_refcount_h.html

In this case, a hazard pointer is not used to protect the actual object
itself... In my setup, I am using a hazard pointer to protect a
"reference-object" named ac_i686_lfgc_refcount_t. A ref-object simply refers
to the actual object... Therefore reference count increments and decrements
do not touch the actual object. I can take advantage of this level of
indirection... Instead of deferring the "actual" object's destruction until
no hazard pointers point to it, I can destroy it immediately after the count
for the reference-object drops to zero... Then I defer the reference-object,
not the actual object...

Here is how the decrement function looks in pseudo-code:

struct refobj_t {
int count;
void *obj; // Actual object pointer
void (*fp_dtor) (void*); // Actual object destructor
};

void refobj_decrement( refobj_t *_this ) {
if ( _this && ! atomic_dec(&_this->count) ) {

// Okay. we dropped to zero, call the destructor for the actual object.
_this->fp_dtor(_this->obj);

// Now we need to defer the destruction of the reference object
// until there are no hazard pointers pointing to it.
SMR_DEFER(_this);
}
}

My implementation of this is in the function ac_i686_lfgc_refcount_release
which is located near the top of the ac_i686_lfgc_refcount_h.html file...
The increment function ac_i686_lfgc_refcount_addref can also be found near
the top of the file...

[...]

vZOOM is based on distributed proxy count thing I came up with... Count
increments go something like this in pseudo-code:

void* vzoom_persistent_inc
(vzoom_t *_this /* thread local */,
void **pobj /* shared */) {

int *pref; // pointer to thread-local persistent counter
void *lobj, *dcobj; // local object pointers

lobj = *pobj; // initial load

if ( ! lobj ) { return 0; } // null check

pref = &_this->pref[*((int**)lobj)]; // hashing also works fine

++(*pref); // simple thread-local increment

dcobj = *pobj; // double-check load

while( lobj != *pobj ) {

lobj = dcobj;

--(*pref); // simple thread-local decrement

if ( ! lobj ) { return 0; } // null check

pref = &_this->pref[*((int**)lobj)]; // hashing also works fine

++(*pref); // simple thread-local increment

dcobj = *pobj; // double-check load
}

return lobj;
}

Instead of setting a hazard pointer, I increment a per-thread counter that
is associated with the object. I will post a paper on my distributed proxy
reference counting technique after the CoolThreads contest is over... I hope
I didn't say too much already!

:O

Nah, my deadline is August 7, there is not too much time left for my
competition to make any drastic changes...

;)

Chris Thomasson

unread,

Aug 2, 2006, 5:26:33 PM8/2/06

to

Yikes!

This line is suppost to be:

> pref = &_this->pref[*((int**)lobj)]; // hashing also works fine

pref = &_this->pref[*((int*)lobj)]; // hashing also works fine

[...]

And this line is suppost to be:

> pref = &_this->pref[*((int**)lobj)]; // hashing also works fine

pref = &_this->pref[*((int*)lobj)]; // hashing also works fine

[...]

I know this is pseudo-code, but I really hate simple pointer indirection
bugs a whole lot... Sorry for any confusion!!!

:O

Chris Thomasson

unread,

Aug 2, 2006, 6:26:49 PM8/2/06

to

Yikes: This should be:

> dcobj = *pobj; // double-check load
>
>
> while( lobj != *pobj ) {

dcobj = *pobj; // double-check load

while( lobj != dcobj ) {

OK... Here is the pseudo-code, free of errors:

void* vzoom_persistent_inc
(vzoom_t *_this /* thread local */,
void **pobj /* shared */) {

int *pref; // pointer to thread-local persistent counter
void *lobj, *dcobj; // local object pointers

lobj = *pobj; // initial load

if ( ! lobj ) { return 0; } // null check

pref = &_this->pref[*((int*)lobj)]; // hashing also works fine

++(*pref); // simple thread-local increment

dcobj = *pobj; // double-check load

while( lobj != dcobj ) {

lobj = dcobj;

--(*pref); // simple thread-local decrement

if ( ! lobj ) { return 0; } // null check

pref = &_this->pref[*((int*)lobj)]; // hashing also works fine

Joe Seigh

unread,

Aug 2, 2006, 7:38:56 PM8/2/06

to

Chris Thomasson wrote:
> "Joe Seigh" <jsei...@xemaps.com> wrote in message
> news:XP2dnU7Az-Omc1LZ...@comcast.com...
>
>>This is some RCU+SMR logic I implemented a little while back and
>>never released.
>>
>>RCU+SMR can be used with reference counting the same way that
>>regular SMR uses it which is when a reference count goes to zero,
>>the deallocation of the object is delayed until no hazard pointers
>>point to it.
>
>
> Humm... I "seem" to be using a different approach. My implementation, sorry
> for the mess, can be found here:
>
> http://appcore.home.comcast.net/appcore/include/cpu/i686/ac_i686_lfgc_refcount_h.html
>
> In this case, a hazard pointer is not used to protect the actual object
> itself... In my setup, I am using a hazard pointer to protect a
> "reference-object" named ac_i686_lfgc_refcount_t. A ref-object simply refers
> to the actual object... Therefore reference count increments and decrements
> do not touch the actual object. I can take advantage of this level of
> indirection... Instead of deferring the "actual" object's destruction until
> no hazard pointers point to it, I can destroy it immediately after the count
> for the reference-object drops to zero... Then I defer the reference-object,
> not the actual object...
>

...

The problem is unique to RCU+SMR AFAIK. Refcounting with plain SMR or
with RCU does not have the problem.

Also, note that the schemes I am refering to only use refcounted ptrs
in shared global memory. Thread local access is done through RCU or
though SMR hazard pointers, not through a refcounting, so deallocation
of the object has to be deferred for that reason.

Chris Thomasson

unread,

Aug 21, 2006, 4:57:32 AM8/21/06

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:XP2dnU7Az-Omc1LZ...@comcast.com...

> This is some RCU+SMR logic I implemented a little while back and
> never released.

[...]

Take a look at this:

http://www.cs.toronto.edu/~tomhart/perflab/ipdps06.tgz

This seems to be work from late 2005... I see user-space RCU memory barriers
guarding quiescent-state transitions, and user-space SMR w/ membar...
However, I don't see any sort of user-space RCU+SMR... I guess they haven't
had time to "copy" your scheme yet...?

;)

Joe Seigh

unread,

Aug 21, 2006, 6:20:30 AM8/21/06

to

They don't have to copy it. There's already one patent application that uses
memory barriers as quiescent states for a different application. They just
have to file some more generalized follow on patents as blocking patents and
they pretty much will "own" RCU+SMR. Plus there's an SMR hazard pointer patent
application. No need to copy. They can just collect patent royalties.

Chris Thomasson

unread,

Aug 27, 2006, 5:26:58 PM8/27/06

to

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:CNqdnQLXWr4-uUzZ...@comcast.com...
>[...]

> OK... Here is the pseudo-code, free of errors:

[...]

> pref = &_this->pref[*((int*)lobj)]; // hashing also works fine

^^^^^^^^^^^

I should point out that the pseudo-code for vzoom_persistent_inc assumes
that the caller is in a PDR collected/critical-region; that's what makes the
pointer dereferencing safe... As you can probably assume, the index into
per-thread counters in the pseudo-code example was kept in object meta-data.
The index in the meta-data is hashed from a pointer to the containing
structure.

There is another version of vzoom_persistent_inc that can load a pointer to
a shared object without being inside a PDR collected/critical-region...
Therefore, the object meta-data cannot be touched so the counter index is
hashed from the pointer to the containing structure. The hash can be as
simple as this:

#define COUNTER_DEPTH 128
#define HASHPTR(p) ((int)(p) % COUNTER_DEPTH)

There is also another "special" version of vzoom_persistent_inc that just
increments a passed pointers indexed counter... This particular increment
function only assumes that the pointer that gets passed to it has been
loaded w/ dependant load ordering... This special increment functionality
allows me to code some crazy search stuff like this:

/* special increment */
extern void vzoom_persistent_inc_raw(vzoom_t*, void const*);

#define MY_SEARCH_DEPTH 16

struct node_t {
node_t *nx;
/* whatever */
};

struct node_dlist_t {
node_t *f;
node_t *b;
};

struct search_t {
node_t *results[MY_SEARCH_DEPTH];
};

int do_search(node_dlist_t *dlist, search_t *result, ... /* search criteria
*/ ... ) {

int idx = 0;
vzoom_t *vz = vzoom_self();

/* voluntary enter pdr collected-region */ {

node_t *nx, *c = vzoom_load_depend(&dlist->f);

while(c) {

nx = vzoom_load_depend(&c->nx);

if ( c == ... /* search criteria */... ) {

if (idx < SEARCH_DEPTH) {

result->results[idx++] = vzoom_persistent_inc_raw(vz, c);

} else { break; }

}

c = nx;
}

} /* voluntary exit pdr collected-region */

return idx;
}

In this particular ad-hoc example I can return a reference to 16 separate
nodes per-search without using CAS. I can also make SEARCH_DEPTH a dynamic
programmable adjustable variable. So, you can tailor your searches to fit
your needs... Pretty cool huh?

;)

Of course this type of highly flexible searching functionality is completely
out of the question and will segfault all over the place with "classic"
RCU... Also, IMHO, its not really compatible with SMR because of the number
of hazard pointers would have to be adjusted at runtime...

Chris Thomasson

unread,

Aug 29, 2006, 4:17:36 AM8/29/06

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:B-adneowGMpBlm_Z...@comcast.com...

> "Chris Thomasson" <cri...@comcast.net> wrote in message
> news:CNqdnQLXWr4-uUzZ...@comcast.com...
>>[...]

Of course, this:

> /* special increment */
> extern void vzoom_persistent_inc_raw(vzoom_t*, void const*);

has to be this:

extern void* vzoom_persistent_inc_raw(vzoom_t*, void const*);

!