my old proxy collector...

Chris Thomasson

unread,

Jan 13, 2008, 4:45:20 PM1/13/08

to

Here is a sketch of a tweaked version of my old proxy collector I created
back in 2003:
_____________________________________________________________
struct node {
node* nx;
};

struct proxy64 {
node* hd;
int32_t rc;
};

node* proxy_collect(proxy64* const _this, node* const n) {
proxy64 cmp, xchg;
do {
cmp = *_this;
if (! cmp.rc) {
xchg.hd = 0;
} else if (n) {
n->nx = cmp.hd;
xchg.hd = n;
} else {
xchg.hd = cmp.hd;
}
xchg.rc = cmp.rc;
} while (! ATOMIC_DWCAS(_this, &cmp, &xchg));
if (! cmp.rc) {
if (n) {
n->nx = cmp.hd;
cmp.hd = n;
}
return cmp.hd;
}
return 0;
}

void proxy_acquire(proxy64* const _this) {
ATOMIC_INC(&_this->rc);
}

node* proxy_release(proxy64* const _this) {
return (! ATOMIC_DEC(&_this->rc))
? proxy_collect(_this, 0) : 0;
}
_____________________________________________________________

All the nodes returned by 'proxy_collect' or 'proxy_release' are in a
quiescent-state and can be reclaimed. However, there is a MAJOR caveat
here... Can you spot it?

;^)

--
Chris M. Thomasson
http://appcore.home.comcast.net

Chris Thomasson

unread,

Jan 14, 2008, 6:25:06 AM1/14/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:demdnTxVLLzMGRfa...@comcast.com...

> Here is a sketch of a tweaked version of my old proxy collector I created
> back in 2003:
> _____________________________________________________________

[...]

> _____________________________________________________________
>
>
> All the nodes returned by 'proxy_collect' or 'proxy_release' are in a
> quiescent-state and can be reclaimed. However, there is a MAJOR caveat
> here... Can you spot it?

Here is reason... It can possibly have periods in which the
reference-counter does not drop to zero. This is because it does not use a
backlink list as pc_sample.c and Joe's appc algorithms do...

[...]

Dmitriy Vyukov

unread,

Jan 14, 2008, 9:09:39 AM1/14/08

to

On Jan 14, 2:25 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Chris Thomasson" <cris...@comcast.net> wrote in message

I think under heavy load it can collect very big batch of nodes...

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Jan 14, 2008, 9:11:29 AM1/14/08

to

On Jan 14, 12:45 am, "Chris Thomasson" <cris...@comcast.net> wrote:

I think this will greatly increase speed:

> node* proxy_collect(proxy64* const _this, node* const n) {
> proxy64 cmp, xchg;
> do {
> cmp = *_this;

*if (!n && !cmp.hd) return 0;*

> if (! cmp.rc) {
> xchg.hd = 0;
> } else if (n) {
> n->nx = cmp.hd;
> xchg.hd = n;
> } else {
> xchg.hd = cmp.hd;
> }
> xchg.rc = cmp.rc;
> } while (! ATOMIC_DWCAS(_this, &cmp, &xchg));
> if (! cmp.rc) {
> if (n) {
> n->nx = cmp.hd;
> cmp.hd = n;
> }
> return cmp.hd;
> }
> return 0;
> }

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Jan 14, 2008, 9:13:48 AM1/14/08

to

On Jan 14, 5:11 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:

And you can also add:

> I think this will greatly increase speed:
>
> > node* proxy_collect(proxy64* const _this, node* const n) {
> > proxy64 cmp, xchg;
> > do {
> > cmp = *_this;
>
> *if (!n && !cmp.hd) return 0;*

if (!n && cmp.rc) return 0;

Dmitriy Vyukov

unread,

Jan 14, 2008, 9:19:20 AM1/14/08

to

On Jan 14, 12:45 am, "Chris Thomasson" <cris...@comcast.net> wrote:

And you can try to combine proxy_release() with proxy_collect().
Something like:

node* proxy_release(proxy64* const _this) {
begin:
if (1 == _this->rc) {
// try to atomically decrease rc and grab collector list
// with single DWCAS, if fail goto begin
} else {

return (! ATOMIC_DEC(&_this->rc))
? proxy_collect(_this, 0) : 0;
}
}

Thus you can replace ATOMIC_DEC+ATOMIC_DWCAS with ATOMIC_DWCAS
sometimes.

Dmitriy V'jukov

Chris Thomasson

unread,

Jan 15, 2008, 4:45:12 AM1/15/08

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
news:92e98370-87af-4f34...@s19g2000prg.googlegroups.com...

Exactly right.

Joe Seigh

unread,

Jan 15, 2008, 6:21:05 AM1/15/08

to

It's a space/time tradeoff. You have an n=1 proxy collector. McKenney's
realtime RCU has n=2 collectors. Classic proxy has n=unlimited. In
practice you put some limit on n to keep writers from depleting memory.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

Dmitriy Vyukov

unread,

Jan 15, 2008, 8:09:53 AM1/15/08

to

On Jan 15, 2:21 pm, Joe Seigh <jseigh...@xemaps.com> wrote:
> Chris Thomasson wrote:

> > "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message

> >news:92e98370-87af-4f34...@s19g2000prg.googlegroups.com...
> >> On Jan 14, 2:25 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
> >>> "Chris Thomasson" <cris...@comcast.net> wrote in message
>
> >>>news:demdnTxVLLzMGRfa...@comcast.com...
>
> >>> > Here is a sketch of a tweaked version of my old proxy collector I >
> >>> created
> >>> > back in 2003:
> >>> > _____________________________________________________________
> >>> [...]
> >>> > _____________________________________________________________
>
> >>> > All the nodes returned by 'proxy_collect' or 'proxy_release' are in a
> >>> > quiescent-state and can be reclaimed. However, there is a MAJOR caveat
> >>> > here... Can you spot it?
>
> >>> Here is reason... It can possibly have periods in which the
> >>> reference-counter does not drop to zero. This is because it does not
> >>> use a
> >>> backlink list as pc_sample.c and Joe's appc algorithms do...
>
> >>> [...]
>
> >> I think under heavy load it can collect very big batch of nodes...
>
> > Exactly right.
>
> It's a space/time tradeoff. You have an n=1 proxy collector. McKenney's
> realtime RCU has n=2 collectors. Classic proxy has n=unlimited. In
> practice you put some limit on n to keep writers from depleting memory.

Afaik McKenney's realtime RCU doesn't specify collector count at all.
It has n=2 per thread/processor counters of "inside read critical
section". And collector can be global or per thread/processor.
Counters of "inside read critical section" only responsible for
quiescent period detection.

In Chris's solution n=1. But writers can deplete all memory. There are
all different n's. In "classical proxy"
n==unlimited==number_of_deferred_objects. In this solution n==1!
=number_of_deferred_objects. In realtime RCU completely different n.

Dmitriy V'jukov

Chris Thomasson

unread,

Jan 16, 2008, 6:55:14 PM1/16/08

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:75d3efcb-5439-4fc6...@f10g2000hsf.googlegroups.com...

> On Jan 14, 12:45 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>
>
> And you can try to combine proxy_release() with proxy_collect().
> Something like:

Here is tweaked version:
_____________________________________________________________

node* proxy_collect(proxy64* const _this, node* const n) {
proxy64 cmp, xchg;
do {

cmp.rc = _this->rc;
cmp.hd = _this->hd;
xchg.rc = cmp.rc;

if (! cmp.rc) {
xchg.hd = 0;
} else if (n) {
n->nx = cmp.hd;
xchg.hd = n;
} else {

return 0;

}
} while (! ATOMIC_DWCAS(_this, &cmp, &xchg));
if (! cmp.rc) {
if (n) {
n->nx = cmp.hd;
cmp.hd = n;
}
return cmp.hd;
}
return 0;
}

void proxy_acquire(proxy64* const _this) {
ATOMIC_INC(&_this->rc);
}

node* proxy_release_version_A(proxy64* const _this) {
proxy64 cmp;
for (;;) {
cmp.rc = _this->rc;
if (cmp.rc == 1) {
proxy64 xchg = {0, 0};
if (! DWCAS(_this, &cmp, &xchg)) {
continue;
}
return cmp.hd;
}
break;

}
return (! ATOMIC_DEC(&_this->rc))
? proxy_collect(_this, 0) : 0;
}

node* proxy_release_version_B(proxy64* const _this) {
proxy64 cmp;
do {
cmp.rc = _this->rc;
while (cmp.rc == 1) {
proxy64 xchg = {0, 0};
cmp.hd = _this->hd;
if (ATOMIC_DWCAS(_this, &cmp, &xchg)) {
return cmp.hd;
}
cmp.rc = _this->rc;
}
} while (! ATOMIC_CAS(&_this->rc, cmp.rc, cmp.rc - 1));
return 0;
}
_____________________________________________________________

Which release version do you like the best, A or B?

Chris Thomasson

unread,

Jan 16, 2008, 6:57:20 PM1/16/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:sK-dnSSVp4GhChPa...@comcast.com...

> "Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
> news:75d3efcb-5439-4fc6...@f10g2000hsf.googlegroups.com...
>> On Jan 14, 12:45 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>>
>>
>> And you can try to combine proxy_release() with proxy_collect().
>> Something like:
>
> Here is tweaked version:
> _____________________________________________________________

[...]

> node* proxy_release_version_A(proxy64* const _this) {
> proxy64 cmp;
> for (;;) {
> cmp.rc = _this->rc;
> if (cmp.rc == 1) {
> proxy64 xchg = {0, 0};

need to add following line:

cmp.hd = _this->hd;

> if (! DWCAS(_this, &cmp, &xchg)) {

ti_chris

unread,

Jan 17, 2008, 7:44:04 AM1/17/08

to

On Jan 14, 3:25 am, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Chris Thomasson" <cris...@comcast.net> wrote in message

> > All the nodes returned by 'proxy_collect' or 'proxy_release' are in a
> > quiescent-state and can be reclaimed. However, there is a MAJOR caveat
> > here... Can you spot it?
>
> Here is reason... It can possibly have periods in which the
> reference-counter does not drop to zero. This is because it does not use a
> backlink list as pc_sample.c and Joe's appc algorithms do...
>
> [...]

So are you saying that it may take a while for a counter to go down to
zero but that it will still eventually happen? I just want to make
sure I understand :)

Dmitriy Vyukov

unread,

Jan 17, 2008, 8:48:13 AM1/17/08

to

You are right in the sense that second counter eventually proceeds to
"decrement only" mode. And this guarantees forward progress wrt memory
freeing.

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Jan 17, 2008, 8:50:17 AM1/17/08

to

On Jan 17, 2:55 am, "Chris Thomasson" <cris...@comcast.net> wrote:

> Which release version do you like the best, A or B?

I think... hmmm... difficult question... I think A.
Because it has only ATOMIC_DEC on main path, and not ATOMIC_CAS.

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Jan 17, 2008, 8:51:49 AM1/17/08

to

On Jan 14, 12:45 am, "Chris Thomasson" <cris...@comcast.net> wrote:
> Here is a sketch of a tweaked version of my old proxy collector I created
> back in 2003:

Btw what do you think about this:

Intel(R) 64 and IA-32 Architectures Software Developer's Manual Volume
3A: System Programming Guide, Part 1

7.1.2.2 Software Controlled Bus Locking

Software should access semaphores (shared memory used for signalling
between
multiple processors) using identical addresses *and operand lengths*.
For example, if
one processor accesses a semaphore using a word access, other
processors should
not access the semaphore using a byte access.

Dmitriy V'jukov

Chris Thomasson

unread,

Jan 17, 2008, 11:18:16 AM1/17/08

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:d9a2171b-92a5-41a3...@i12g2000prf.googlegroups.com...

Humm... Kind of sounds like their saying you can't use CMPXCHG8B and CMPXCHG
at the same time at the same location. For instance, you can't use CAS to
modify the bottom half of a lock-free anchor, and concurrently use DWCAS to
modify the whole thing. It would be trivial to convert CMPXCHG to use
CMPXCHG8B, however, there should be a performance hit as CMPXCHG8B is a bit
more expensive.

What do you think? I can't see why it would not work.

Dmitriy Vyukov

unread,

Jan 17, 2008, 11:36:04 AM1/17/08

to

On Jan 17, 7:18 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message

I think it will work fine if you will use CMPXCHG8B.
But you have to replace XADD (not CMPXCHG) with CMPXCHG8B. CMPXCHG8B
is more expensive and is not 'wait-free' like XADD, so there can be
some 'retries'.
And this is hit at reader, not writer. Crap.

Dmitriy V'jukov

Chris Thomasson

unread,

Jan 17, 2008, 11:05:30 PM1/17/08

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:4f34efc0-6a33-4fe9...@s13g2000prd.googlegroups.com...

For some reason, I think it should still work... The locking is applied to
the l2 cache line, so all locked operations should respect other lockers wrt
a common cache line. I think I should post something over in Intel Thread
Forum.

> CMPXCHG8B
> is more expensive and is not 'wait-free' like XADD, so there can be
> some 'retries'.
> And this is hit at reader, not writer. Crap.

Total crap.

Chris Thomasson

unread,

Jan 17, 2008, 11:07:26 PM1/17/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:xtSdnfIBtZHAvg3a...@comcast.com...

It sure seems like concurrent DWCAS will make a CAS fail, and vise-versa, on
a monotonic counter. I wonder where the race-condition actually is.

Dmitriy Vyukov

unread,

Jan 18, 2008, 10:08:07 AM1/18/08

to

On 18 янв, 07:07, "Chris Thomasson" <cris...@comcast.net> wrote:

> > For some reason, I think it should still work... The locking is applied to
> > the l2 cache line, so all locked operations should respect other lockers
> > wrt a common cache line.

I also think that it should work, because cache-line must be locked
anyway. I can't understand why they include such restriction...

> > I think I should post something over in Intel
> > Thread Forum.

They will ignore your question or will recommend you to use OpenMP or
Intel TBB :)))

> It sure seems like concurrent DWCAS will make a CAS fail, and vise-versa, on
> a monotonic counter. I wonder where the race-condition actually is.

But DWCAS in proxy_collect() doesn't change rc field.

Dmitriy V'jukov