struct proxy64 {
node* hd;
int32_t rc;
};
node* proxy_collect(proxy64* const _this, node* const n) {
proxy64 cmp, xchg;
do {
cmp = *_this;
if (! cmp.rc) {
xchg.hd = 0;
} else if (n) {
n->nx = cmp.hd;
xchg.hd = n;
} else {
xchg.hd = cmp.hd;
}
xchg.rc = cmp.rc;
} while (! ATOMIC_DWCAS(_this, &cmp, &xchg));
if (! cmp.rc) {
if (n) {
n->nx = cmp.hd;
cmp.hd = n;
}
return cmp.hd;
}
return 0;
}
void proxy_acquire(proxy64* const _this) {
ATOMIC_INC(&_this->rc);
}
node* proxy_release(proxy64* const _this) {
return (! ATOMIC_DEC(&_this->rc))
? proxy_collect(_this, 0) : 0;
}
_____________________________________________________________
All the nodes returned by 'proxy_collect' or 'proxy_release' are in a
quiescent-state and can be reclaimed. However, there is a MAJOR caveat
here... Can you spot it?
;^)
--
Chris M. Thomasson
http://appcore.home.comcast.net
Here is reason... It can possibly have periods in which the
reference-counter does not drop to zero. This is because it does not use a
backlink list as pc_sample.c and Joe's appc algorithms do...
[...]
I think under heavy load it can collect very big batch of nodes...
Dmitriy V'jukov
I think this will greatly increase speed:
> node* proxy_collect(proxy64* const _this, node* const n) {
> proxy64 cmp, xchg;
> do {
> cmp = *_this;
*if (!n && !cmp.hd) return 0;*
> if (! cmp.rc) {
> xchg.hd = 0;
> } else if (n) {
> n->nx = cmp.hd;
> xchg.hd = n;
> } else {
> xchg.hd = cmp.hd;
> }
> xchg.rc = cmp.rc;
> } while (! ATOMIC_DWCAS(_this, &cmp, &xchg));
> if (! cmp.rc) {
> if (n) {
> n->nx = cmp.hd;
> cmp.hd = n;
> }
> return cmp.hd;
> }
> return 0;
> }
Dmitriy V'jukov
And you can also add:
> I think this will greatly increase speed:
>
> > node* proxy_collect(proxy64* const _this, node* const n) {
> > proxy64 cmp, xchg;
> > do {
> > cmp = *_this;
>
> *if (!n && !cmp.hd) return 0;*
if (!n && cmp.rc) return 0;
And you can try to combine proxy_release() with proxy_collect().
Something like:
node* proxy_release(proxy64* const _this) {
begin:
if (1 == _this->rc) {
// try to atomically decrease rc and grab collector list
// with single DWCAS, if fail goto begin
} else {
return (! ATOMIC_DEC(&_this->rc))
? proxy_collect(_this, 0) : 0;
}
}
Thus you can replace ATOMIC_DEC+ATOMIC_DWCAS with ATOMIC_DWCAS
sometimes.
Dmitriy V'jukov
Exactly right.
It's a space/time tradeoff. You have an n=1 proxy collector. McKenney's
realtime RCU has n=2 collectors. Classic proxy has n=unlimited. In
practice you put some limit on n to keep writers from depleting memory.
--
Joe Seigh
When you get lemons, you make lemonade.
When you get hardware, you make software.
Afaik McKenney's realtime RCU doesn't specify collector count at all.
It has n=2 per thread/processor counters of "inside read critical
section". And collector can be global or per thread/processor.
Counters of "inside read critical section" only responsible for
quiescent period detection.
In Chris's solution n=1. But writers can deplete all memory. There are
all different n's. In "classical proxy"
n==unlimited==number_of_deferred_objects. In this solution n==1!
=number_of_deferred_objects. In realtime RCU completely different n.
Dmitriy V'jukov
Here is tweaked version:
_____________________________________________________________
node* proxy_collect(proxy64* const _this, node* const n) {
proxy64 cmp, xchg;
do {
cmp.rc = _this->rc;
cmp.hd = _this->hd;
xchg.rc = cmp.rc;
if (! cmp.rc) {
xchg.hd = 0;
} else if (n) {
n->nx = cmp.hd;
xchg.hd = n;
} else {
return 0;
}
} while (! ATOMIC_DWCAS(_this, &cmp, &xchg));
if (! cmp.rc) {
if (n) {
n->nx = cmp.hd;
cmp.hd = n;
}
return cmp.hd;
}
return 0;
}
void proxy_acquire(proxy64* const _this) {
ATOMIC_INC(&_this->rc);
}
node* proxy_release_version_A(proxy64* const _this) {
proxy64 cmp;
for (;;) {
cmp.rc = _this->rc;
if (cmp.rc == 1) {
proxy64 xchg = {0, 0};
if (! DWCAS(_this, &cmp, &xchg)) {
continue;
}
return cmp.hd;
}
break;
}
return (! ATOMIC_DEC(&_this->rc))
? proxy_collect(_this, 0) : 0;
}
node* proxy_release_version_B(proxy64* const _this) {
proxy64 cmp;
do {
cmp.rc = _this->rc;
while (cmp.rc == 1) {
proxy64 xchg = {0, 0};
cmp.hd = _this->hd;
if (ATOMIC_DWCAS(_this, &cmp, &xchg)) {
return cmp.hd;
}
cmp.rc = _this->rc;
}
} while (! ATOMIC_CAS(&_this->rc, cmp.rc, cmp.rc - 1));
return 0;
}
_____________________________________________________________
Which release version do you like the best, A or B?
> node* proxy_release_version_A(proxy64* const _this) {
> proxy64 cmp;
> for (;;) {
> cmp.rc = _this->rc;
> if (cmp.rc == 1) {
> proxy64 xchg = {0, 0};
need to add following line:
cmp.hd = _this->hd;
> if (! DWCAS(_this, &cmp, &xchg)) {
So are you saying that it may take a while for a counter to go down to
zero but that it will still eventually happen? I just want to make
sure I understand :)
You are right in the sense that second counter eventually proceeds to
"decrement only" mode. And this guarantees forward progress wrt memory
freeing.
Dmitriy V'jukov
> Which release version do you like the best, A or B?
I think... hmmm... difficult question... I think A.
Because it has only ATOMIC_DEC on main path, and not ATOMIC_CAS.
Dmitriy V'jukov
Btw what do you think about this:
Intel(R) 64 and IA-32 Architectures Software Developer's Manual Volume
3A: System Programming Guide, Part 1
7.1.2.2 Software Controlled Bus Locking
Software should access semaphores (shared memory used for signalling
between
multiple processors) using identical addresses *and operand lengths*.
For example, if
one processor accesses a semaphore using a word access, other
processors should
not access the semaphore using a byte access.
Dmitriy V'jukov
Humm... Kind of sounds like their saying you can't use CMPXCHG8B and CMPXCHG
at the same time at the same location. For instance, you can't use CAS to
modify the bottom half of a lock-free anchor, and concurrently use DWCAS to
modify the whole thing. It would be trivial to convert CMPXCHG to use
CMPXCHG8B, however, there should be a performance hit as CMPXCHG8B is a bit
more expensive.
What do you think? I can't see why it would not work.
I think it will work fine if you will use CMPXCHG8B.
But you have to replace XADD (not CMPXCHG) with CMPXCHG8B. CMPXCHG8B
is more expensive and is not 'wait-free' like XADD, so there can be
some 'retries'.
And this is hit at reader, not writer. Crap.
Dmitriy V'jukov
For some reason, I think it should still work... The locking is applied to
the l2 cache line, so all locked operations should respect other lockers wrt
a common cache line. I think I should post something over in Intel Thread
Forum.
> CMPXCHG8B
> is more expensive and is not 'wait-free' like XADD, so there can be
> some 'retries'.
> And this is hit at reader, not writer. Crap.
Total crap.
It sure seems like concurrent DWCAS will make a CAS fail, and vise-versa, on
a monotonic counter. I wonder where the race-condition actually is.
> > For some reason, I think it should still work... The locking is applied to
> > the l2 cache line, so all locked operations should respect other lockers
> > wrt a common cache line.
I also think that it should work, because cache-line must be locked
anyway. I can't understand why they include such restriction...
> > I think I should post something over in Intel
> > Thread Forum.
They will ignore your question or will recommend you to use OpenMP or
Intel TBB :)))
> It sure seems like concurrent DWCAS will make a CAS fail, and vise-versa, on
> a monotonic counter. I wonder where the race-condition actually is.
But DWCAS in proxy_collect() doesn't change rc field.
Dmitriy V'jukov