pc_sample.c revisited; n = 1, zero-garbage collector...

Chris Thomasson

unread,

Mar 11, 2008, 5:00:00 AM3/11/08

to

I am very pleased to be working with a new proxy collector implementation I
created which boasts lock/wait-free forward-progress. This thing only ever
requires a __single__ proxy collector object in order to achieve the _ideal_
"zero-garbage" guarantee. I am not sure if I should post the fully coded
prototype to the public yet, however, the experimental algorithm has some
basic proofs, and its been running under worst case scenarios for the past
couple of months; the "trick" it uses is sound indeed, and the actual
implementation of the following API is very short and sweet:
______________________________________________________________
typedef struct pc_node_s pc_node;
typedef struct pc_region_s pc_region;
typedef struct pc_master_s pc_master;

extern void
pc_node_init(
pc_node* const,
pc_node* const,
void (*) (pc_node*)
);

extern pc_region*
pc_acquire(
pc_master* const
);

extern void
pc_release(
pc_master* const,
pc_region* const
);

extern void
pc_defer(
pc_region* const,
pc_node* const
);

extern void
pc_mutate(
pc_master* const,
pc_node* const
);
______________________________________________________________

One reason why this beats the original pc_sample.c:

http://home.comcast.net/~vzoom/demos/pc_sample.c

is simply because there is absolutely no allocation of any collector objects
whatsoever. I guess you can call it a 100% accurate garbage collector. IMO,
the "current" implementation is VERY promising indeed... I am hoping to be
able to create an article on it in very near future...

Stay tuned!

:^)

--
Chris M. Thomasson
http://appcore.home.comcast.net

Dmitriy V'jukov

unread,

Mar 11, 2008, 7:38:59 AM3/11/08

to

Can you provide usage pattern for "reader" and for "writer" in order
to clarify API?
What semantics have pc_mutate() function? Am I have to call it
"periodically or episodically"?

Dmitriy V'jukov

unread,

Mar 11, 2008, 10:16:30 AM3/11/08

to

On Mar 11, 12:00 pm, "Chris Thomasson" <cris...@comcast.net> wrote:

> extern void
> pc_node_init(
> pc_node* const,
> pc_node* const,
> void (*) (pc_node*)
> );

Didn't you mean:

extern void
pc_node_init(
pc_node* const,

void* state,
void (*) (pc_node*)
);
?

Does implementation use DWCAS?

Dmitriy V'jukov

Chris Thomasson

unread,

Mar 11, 2008, 7:42:49 PM3/11/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
news:dc032615-34b7-4ebd...@e6g2000prf.googlegroups.com...

> On Mar 11, 12:00 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
>

[...]

>
>
> Can you provide usage pattern for "reader" and for "writer" in order
> to clarify API?

here is a zero-garbage setup:
__________________________________________________________________
struct foo_node {
foo_node* next;
pc_node pcn;
};

struct foo_list {
foo_node* head;
pc_master pc;
pthread_mutex_t mtx;
};

static foo_list g_list = {
NULL, PC_MASTER_STATICINIT(), PTHREAD_MUTEX_INITIALIZER
};

void foo_node_dtor(pc_node* pcn) {
foo_node* const _this = container_of(pcn, foo_node, pcn);
free(_this);
}

void foo_reader() {
int i;
pc_region* pcr = pc_acquire(&g_list.pc);
for (i = 1 ;; ++i) {
foo_node* node = LOAD_MBDEPEND(&g_list.head);
while (node) {
foo_node* const next = LOAD_MBDEPEND(&node->next);
[...];
node = next;
}
if (! (i % 1000)) {
pc_release(&g_list.pc, pcr);
pcr = pc_acquire(&g_list.pc);
}
}
pc_release(&g_list.pc, pcr);
}

void foo_writer() {
int i;
foo_node* node
for (i = 1 ;; ++i) {
if (i % 10) {
node = malloc(sizeof(*node));
if (node) {
pc_node_init(node, NULL, foo_node_dtor);
pthread_mutex_lock(&g_list.mtx);
node->next = g_list.head;
STORE_MBREL(&g_list.head, node);
pthread_mutex_unlock(&g_list.mtx);
}
} else {
pthread_mutex_lock(&g_list.mtx);
if (node = g_list.head) {
g_list.head = node->next;
}
pthread_mutex_unlock(&g_list.mtx);
if (node) {
pc_mutate(&g_list.pc, &node->pcn);
}
}
}
}
__________________________________________________________________

here is a defer-garbage setup:
__________________________________________________________________
struct foo_node {
foo_node* next;
pc_node pcn;
};

struct foo_list {
foo_node* head;
pc_master pc;
};

static foo_list g_list = {
NULL, PC_MASTER_STATICINIT()
};

void foo_node_dtor(pc_node* pcn) {
foo_node* const _this = container_of(pcn, foo_node, pcn);
free(_this);
}

void foo_reader() {
int i;
foo_node* node;
pc_region* pcr = pc_acquire(&g_list.pc);
for (i = 1 ;; ++i) {
node = LOAD_DEPENDS(&g_list.head);
while (node) {
foo_node* const next = LOAD_MBDEPEND(&node->next);
[...];
node = next;
}
if (! (i % 1000)) {
pc_release(&g_list.pc, pcr);
pcr = pc_acquire(&g_list.pc);
}
}
pc_release(&g_list.pc, pcr);
}

void foo_writer() {
int i;
foo_node* node, *cmp;
pc_region* pcr = pc_acquire(&g_list.pc);
for (i = 1 ;; ++i) {
if (i % 10) {
node = malloc(sizeof(*node));
if (node) {
foo_node* cmp;
pc_node_init(node, NULL, foo_node_dtor);
cmp = g_list.head;
do {
node->next = cmp;
} while (! CASIBM_MBREL(&g_list.head, &cmp, node));
}
} else {
node = g_list.head;
do {
if (! node) { break; }
} while (! CASIBM_MBACQ(&g_list.head, &node, node->next));
if (node) {
if (! (i % 20)) {
pc_mutate(&g_list.pc, &node->pcn);
} else {
pc_defer(pcr, &node->pcn);
}
}
}
if (! (i % 500)) {
pc_release(&g_list.pc, pcr);
pcr = pc_acquire(&g_list.pc);
}
}
pc_release(&g_list.pc, pcr);
}
__________________________________________________________________

> What semantics have pc_mutate() function? Am I have to call it
> "periodically or episodically"?

You can call it for every deferment to achieve zero-garbage. Otherwise you
can call pc_defer to attach the deferment to a pc_region. The pc_defer
function will trigger a mutation if its garbage level hits a high-watermark.

Chris Thomasson

unread,

Mar 11, 2008, 7:52:25 PM3/11/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:14ec3e20-0b59-4ab4...@s12g2000prg.googlegroups.com...

> On Mar 11, 12:00 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
>
>> extern void
>> pc_node_init(
>> pc_node* const,
>> pc_node* const,
>> void (*) (pc_node*)
>> );
>
> Didn't you mean:
> extern void
> pc_node_init(
> pc_node* const,
> void* state,
> void (*) (pc_node*)
> );
> ?

No, I decided to use the same method Linux uses for its 'list_entry()' API
(e.g., 'container_of()' macro). The second parameter in there is for
chaining nodes together:
____________________________________________________________________
void foo_flush(anchor* const _this) {
foo_node* const head = SWAP_MBACQ(&_this->head);
if (head) {
foo_node* node = head;
while (node) {
foo_node* const next = node->next;
pc_node_init(&node->pcn, (next) ? &next->pcn : 0, foo_dtor);
node = next;
}
pc_mutate(&_this->pc, &head->pcn);
}
}
____________________________________________________________________

That way I can pass a batch of nodes through a mutation instead of one at a
time...

> Does implementation use DWCAS?

The one I am playing with right now uses use DWCAS. However, I am currently
directly integrating this thing with the vZOOM memory allocator which means
that I can do wait-free implementation because I can use offsets into
internal memory pools as pointers...

Chris Thomasson

unread,

Mar 11, 2008, 8:05:30 PM3/11/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:BpmdnbrRcM2VhEra...@comcast.com...

> "Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
> news:dc032615-34b7-4ebd...@e6g2000prf.googlegroups.com...
>> On Mar 11, 12:00 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
>>
> [...]
>>
>>
>> Can you provide usage pattern for "reader" and for "writer" in order
>> to clarify API?
>

[...]

____________________________________________________________
void reader(pc_master* const _this) {
pc_region* const pcn = pc_acquire(_this);
/* peform read operation */
pc_release(_this, pcn);
}

void writer_zero_garbage(pc_master* const _this) {
node* n = /* remove node from collection */
if (n) {
pc_mutate(_this, &n->pcn);
}
}

void writer_defer_garbageA(pc_master* const _this) {
node* n = /* remove node from collection */
if (n) {
pc_region* const pcn = pc_acquire(_this);
pc_defer(pcn, &n->pcn);
pc_release(_this, pcn);
}
}

void writer_defer_garbageB(pc_region* const _this) {
node* n = /* remove node from collection */
if (n) {
pc_defer(_this, &n->pcn);
}
}
____________________________________________________________

Chris Thomasson

unread,

Mar 12, 2008, 4:40:35 PM3/12/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:8oadnVKrA53xhkra...@comcast.com...

>
> "Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
> news:14ec3e20-0b59-4ab4...@s12g2000prg.googlegroups.com...
>> On Mar 11, 12:00 pm, "Chris Thomasson" <cris...@comcast.net> wrote:

[...]

> No, I decided to use the same method Linux uses for its 'list_entry()' API
> (e.g., 'container_of()' macro). The second parameter in there is for
> chaining nodes together:
> ____________________________________________________________________
> void foo_flush(anchor* const _this) {
> foo_node* const head = SWAP_MBACQ(&_this->head);

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

foo_node* const head = SWAP_MBACQ(&_this->head, NULL);

of course!

> if (head) {
> foo_node* node = head;
> while (node) {
> foo_node* const next = node->next;
> pc_node_init(&node->pcn, (next) ? &next->pcn : 0, foo_dtor);
> node = next;
> }
> pc_mutate(&_this->pc, &head->pcn);
> }
> }
> ____________________________________________________________________

[...]

Chris Thomasson

unread,

Mar 17, 2008, 12:43:00 AM3/17/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:zJSdnaU4CcW810va...@comcast.com...

>I am very pleased to be working with a new proxy collector implementation I
>created which boasts lock/wait-free forward-progress. This thing only ever
>requires a __single__ proxy collector object in order to achieve the
>_ideal_ "zero-garbage" guarantee.

[...]

This seems to solve a "common" problem with some forms of PDR. Its nice when
you can cap N off at 1 and still maintain zero-garbage. N = 1 is nice
because you don't need to do a collector shift when your using a proxy
algorithm with N = [2..N]. Dmitriy V'jukov has a nice one he posted here
that is configured for N = 4; I have not tried it when I set N = 1. I think
this sets my algorithm apart from Dmitriy's distributed proxy collector, wrt
epoch tracking logic. Humm...

What do you think?

Dmitriy V'jukov

unread,

Mar 17, 2008, 5:22:10 AM3/17/08

to

On Mar 12, 2:42 am, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Dmitriy V'jukov" <dvyu...@gmail.com> wrote in message

>
> news:dc032615-34b7-4ebd...@e6g2000prf.googlegroups.com...
>
> > On Mar 11, 12:00 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
>
> [...]
>
> > Can you provide usage pattern for "reader" and for "writer" in order
> > to clarify API?
>
> here is a zero-garbage setup:

[...]

> > What semantics have pc_mutate() function? Am I have to call it
> > "periodically or episodically"?
>
> You can call it for every deferment to achieve zero-garbage. Otherwise you
> can call pc_defer to attach the deferment to a pc_region. The pc_defer
> function will trigger a mutation if its garbage level hits a high-watermark.

Thank you. I've understood the API.

Dmitriy V'jukov

unread,

Mar 17, 2008, 5:35:56 AM3/17/08

to

On Mar 12, 2:52 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>
> >> extern void
> >> pc_node_init(
> >> pc_node* const,
> >> pc_node* const,
> >> void (*) (pc_node*)
> >> );
>
> > Didn't you mean:
> > extern void
> > pc_node_init(
> > pc_node* const,
> > void* state,
> > void (*) (pc_node*)
> > );
> > ?
>
> No, I decided to use the same method Linux uses for its 'list_entry()' API
> (e.g., 'container_of()' macro). The second parameter in there is for
> chaining nodes together:

I was thinking about support for per-node user state and dtor. And I
conclude that low-level PDR API must not provide support for per-node
user state and dtor. Low-level PDR API must only provide support for
global dtor.

User can easily add this support by himself if he needs it:

struct my_pdr_node_t
{
pdr_node_t node;
void* state;
void(*dtor)(void*);
};

void global_pdr_dtor(pdr_node* n)
{
my_pdr_node_t* my_node = (my_pdr_node_t*)n;
my_node->dtor(my_node->state);
}

Or user can use container_of macro, or offset_of macro, or use only
per-node dtor, or only per-node state, or nothing if he doesn't need
it at all.
2 (or 1) additional mandatory words per node is not necessary in
common case. Imho.

Dmitriy V'jukov

unread,

Mar 17, 2008, 5:45:51 AM3/17/08

to

On Mar 17, 7:43 am, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Chris Thomasson" <cris...@comcast.net> wrote in message

>
> news:zJSdnaU4CcW810va...@comcast.com...
>
> >I am very pleased to be working with a new proxy collector implementation I
> >created which boasts lock/wait-free forward-progress. This thing only ever
> >requires a __single__ proxy collector object in order to achieve the
> >_ideal_ "zero-garbage" guarantee.
>
> [...]
>
> This seems to solve a "common" problem with some forms of PDR. Its nice when
> you can cap N off at 1 and still maintain zero-garbage. N = 1 is nice
> because you don't need to do a collector shift when your using a proxy
> algorithm with N = [2..N]. Dmitriy V'jukov has a nice one he posted here
> that is configured for N = 4; I have not tried it when I set N = 1.

In my algorithm N must be at least 2.

> I think
> this sets my algorithm apart from Dmitriy's distributed proxy collector, wrt
> epoch tracking logic. Humm...
>
> What do you think?

As I understand this algorithm recalls your tweaked proxy-collector:
http://groups.google.com/group/comp.programming.threads/msg/b8182c7d65404e56
N=1 too. It uses DWCAS too. It's simple too.

But you somehow solve problem with "garbage starvation" (under high
reader load). And add "garbage amortization". Right?

Dmitriy V'jukov

unread,

Mar 17, 2008, 5:49:00 AM3/17/08

to

On Mar 12, 3:05 am, "Chris Thomasson" <cris...@comcast.net> wrote:

> void writer_defer_garbageA(pc_master* const _this) {
> node* n = /* remove node from collection */
> if (n) {
> pc_region* const pcn = pc_acquire(_this);
> pc_defer(pcn, &n->pcn);
> pc_release(_this, pcn);
> }
>
> }
>
> void writer_defer_garbageB(pc_region* const _this) {
> node* n = /* remove node from collection */
> if (n) {
> pc_defer(_this, &n->pcn);
> }}

Are those 2 methods internally the same? Or they are different
internally?

Dmitriy V'jukov

Chris Thomasson

unread,

Mar 17, 2008, 8:20:51 PM3/17/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:abbafdd4-5702-4076...@d21g2000prf.googlegroups.com...

Very true! I think I am going to change my collector to provide this
behavior.

[...]

> 2 (or 1) additional mandatory words per node is not necessary in
> common case. Imho.

After one applies your rational, it makes perfect sense.

Chris Thomasson

unread,

Mar 17, 2008, 8:24:24 PM3/17/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:638ffc39-52cb-4f49...@u10g2000prn.googlegroups.com...

'writer_defer_garbageA()' must acquire and release a collected region to
stick the node onto it. 'writer_defer_garbageB()' gets passed a region as a
parameter, so it can just attach the node to it without acquiring/releasing.
Therefore, 'writer_defer_garbageB()' is a lot more lightweight. I should
have just omitted the 'A' version altogether.

Chris Thomasson

unread,

Mar 17, 2008, 8:18:13 PM3/17/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:f66594aa-251c-4390...@i7g2000prf.googlegroups.com...

> On Mar 17, 7:43 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>> "Chris Thomasson" <cris...@comcast.net> wrote in message
>>
>> news:zJSdnaU4CcW810va...@comcast.com...
>>
>> >I am very pleased to be working with a new proxy collector
>> >implementation I
>> >created which boasts lock/wait-free forward-progress. This thing only
>> >ever
>> >requires a __single__ proxy collector object in order to achieve the
>> >_ideal_ "zero-garbage" guarantee.
>>
>> [...]
>>
>> This seems to solve a "common" problem with some forms of PDR. Its nice
>> when
>> you can cap N off at 1 and still maintain zero-garbage. N = 1 is nice
>> because you don't need to do a collector shift when your using a proxy
>> algorithm with N = [2..N]. Dmitriy V'jukov has a nice one he posted here
>> that is configured for N = 4; I have not tried it when I set N = 1.
>
> In my algorithm N must be at least 2.

Yup. I just noticed that.

>> I think
>> this sets my algorithm apart from Dmitriy's distributed proxy collector,
>> wrt
>> epoch tracking logic. Humm...
>>
>> What do you think?
>
> As I understand this algorithm recalls your tweaked proxy-collector:
> http://groups.google.com/group/comp.programming.threads/msg/b8182c7d65404e56
> N=1 too. It uses DWCAS too. It's simple too.

It's kind of like that one. Well, humm, no on second thought its more like
original pc_sample.c than anything, but with a fundamental trick in the way
a single collector object is handled.

> But you somehow solve problem with "garbage starvation" (under high
> reader load). And add "garbage amortization". Right?

Right. The algorithm behaves as if N were unbounded, just like origin
pc_sample. I am thinking about releasing the code. The trick was staring me
right in the face back when I was creating pc_sample, I just did not see it
at all. Once I found it, I felt like slapping myself because it's so
straight forward and simple.

Dmitriy V'jukov

unread,

Mar 18, 2008, 12:03:12 PM3/18/08

to

On Mar 18, 3:18 am, "Chris Thomasson" <cris...@comcast.net> wrote:

> > As I understand this algorithm recalls your tweaked proxy-collector:

> >http://groups.google.com/group/comp.programming.threads/msg/b8182c7d6...

> > N=1 too. It uses DWCAS too. It's simple too.
>
> It's kind of like that one. Well, humm, no on second thought its more like
> original pc_sample.c than anything, but with a fundamental trick in the way
> a single collector object is handled.
>
> > But you somehow solve problem with "garbage starvation" (under high
> > reader load). And add "garbage amortization". Right?
>
> Right. The algorithm behaves as if N were unbounded, just like origin
> pc_sample. I am thinking about releasing the code. The trick was staring me
> right in the face back when I was creating pc_sample, I just did not see it
> at all. Once I found it, I felt like slapping myself because it's so
> straight forward and simple.

At this point it looks exactly like pc_sample with DWCAS and embed
pc_node. What the trick gives? It simplifies implementation? It
reduces number of atomic ops per operation?

Dmitriy V'jukov

Chris Thomasson

unread,

Mar 18, 2008, 7:39:56 PM3/18/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:26618098-d3ff-44e1...@i29g2000prf.googlegroups.com...

> On Mar 18, 3:18 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>
>> > As I understand this algorithm recalls your tweaked proxy-collector:
>> >http://groups.google.com/group/comp.programming.threads/msg/b8182c7d6...
>> > N=1 too. It uses DWCAS too. It's simple too.
>>
>> It's kind of like that one. Well, humm, no on second thought its more
>> like
>> original pc_sample.c than anything, but with a fundamental trick in the
>> way
>> a single collector object is handled.
>>
>> > But you somehow solve problem with "garbage starvation" (under high
>> > reader load). And add "garbage amortization". Right?
>>
>> Right. The algorithm behaves as if N were unbounded, just like origin
>> pc_sample. I am thinking about releasing the code. The trick was staring
>> me
>> right in the face back when I was creating pc_sample, I just did not see
>> it
>> at all. Once I found it, I felt like slapping myself because it's so
>> straight forward and simple.
>
> At this point it looks exactly like pc_sample with DWCAS and embed
> pc_node.

Yup. IMHO, proxy collector node meta-data per-object makes sense. You could
even use it in a JVM, store the node in the object header. I managed to get
it down to something like this:
______________________________________________________
typedef struct pc_node_s pc_node;

struct pc_node_s {
atomicword32 refcnt;
pc_node* next;
pc_node* defer;
};
______________________________________________________

Now, if I got rid of the 'pc_defer()' function, and only used 'pc_mutate()'
for every defferment, I could remove the 'pc_node::defer' member. That would
be 8 bytes.

> What the trick gives?

> It simplifies implementation?

Well, the implementation is a little bit simpler than pc_sample.c. The
backlink references are merged with the inner references. It basically
increases granularity because user objects can be thought of as actual proxy
collector objects. This has some advantages:

Maintains zero-garbage.

You could acquire/release references to specific 'pc_node' objects and
extend their life-time beyond successive epochs.

An application can use the extra space in the 'pc_node' data-structure for
its own purposes up until deferment. There is a major caveat. Reader threads
should not access it because when it will mutate as it gets cycled through
the collector. You have to stop using the space after you defer.

The implementation can use space to provide per-object locking.

Drawback:

You have to integrate with a memory allocator to get 100% wait-free
implementation. It would be a simple matter of storing offsets from the base
of a memory pool so that atomic swap / fetch-and-add can be used for
'pc_mutate()/pc_acquire()'.

> It
> reduces number of atomic ops per operation?

No. I usually try to amortize the use of 'pc_acquire()/pc_release()' with a
very simple usage pattern; something like:
_____________________________________________________________________
void* reader(void* state) {
int i = 1;
pc_master* const pcm = state;
pc_region* pcr = pc_acquire(pcm);
for (;;++i) {
if (! (i % 128)) {
pc_release(pcm, pcr);
pcr = pc_acquire(pcm);
}
}
pc_release(pcm, pcr);
return 0;
}
_____________________________________________________________________

I think I should simply add a function which checks if the previously
acquired 'pc_region' is still in the master proxy object. If it still is,
the function simply returns. Otherwise, it does the reference decrement, and
acquires the most current 'pc_region' in the master. Okay, well I am going
to make some adjustments and probably post the code.

Chris Thomasson

unread,

Mar 18, 2008, 9:01:01 PM3/18/08

to

Here is a recent version of the revised pc_sample.c which uses inline x86
ASM and compiles under VC++ (I am planning on coding the entire thing in
pure assembly language):
____________________________________________________________________
#if ! defined(PC_SAMPLE_INCLUDE_H)
# define PC_SAMPLE_INCLUDE_H
# pragma warning(push)
# pragma warning (disable : 4100 4505 4706)
# if defined(__cplusplus)
extern "C" {
# endif
/*===========================================================*/

/* Very Simple x86 Atomic Operations API & Implmentation
_____________________________________________________________*/
typedef __int32 atomicword;
typedef atomicword volatile* const atomicword_pthis;

static int
x86_DWCASPTR(
void volatile* const,
void* const,
void const* const
);

static atomicword
x86_XADDWORD(
atomicword_pthis,
atomicword const
);

static atomicword
x86_XCHGWORD(
atomicword_pthis,
atomicword const
);

__declspec(naked) int
x86_DWCASPTR(
void volatile* const _pthis,
void* const pcmp,
void const* const pxhcg
) {
_asm {
PUSH ESI
PUSH EBX
MOV ESI, [ESP + 16]
MOV EAX, [ESI]
MOV EDX, [ESI + 4]
MOV ESI, [ESP + 20]
MOV EBX, [ESI]
MOV ECX, [ESI + 4]
MOV ESI, [ESP + 12]
LOCK CMPXCHG8B QWORD PTR [ESI]
JNE x86_DWCASPTR_failed
MOV EAX, 1
POP EBX
POP ESI
RET

x86_DWCASPTR_failed:
MOV ESI, [ESP + 16]
MOV [ESI], EAX
MOV [ESI + 4], EDX
MOV EAX, 0
POP EBX
POP ESI
RET
}
}

__declspec(naked) atomicword
x86_XADDWORD(
atomicword_pthis _pthis,
atomicword const value
) {
_asm {
MOV EDX, [ESP + 4]
MOV EAX, [ESP + 8]
LOCK XADD [EDX], EAX
RET
}
}

__declspec(naked) atomicword
x86_XCHGWORD(
atomicword_pthis _pthis,
atomicword const value
) {
_asm {
MOV EDX, [ESP + 4]
MOV EAX, [ESP + 8]
XCHG [EDX], EAX
RET
}
}

#define x86_XCHGPTR(mp_pdest, mp_src) ( \
(void*)x86_XCHGWORD( \
((atomicword_pthis)(mp_pdest)), \
((atomicword const)(mp_src)) \
) \
)

#define XCHGWORD x86_XCHGWORD
#define XCHGPTR x86_XCHGPTR
#define XADDWORD x86_XADDWORD
#define DWCASPTR x86_DWCASPTR

/* Proxy-Collector API & Implmentation (Revisited) ;^)
Inventor: Chris M. Thomasson
_____________________________________________________________*/
#include <stddef.h>
#include <assert.h>
#if ! defined(NDEBUG)
# include <stdio.h>
#endif

#define CONTAINER_OF(mp_this, mp_type, mp_member) ( \
(mp_type*)(((unsigned char*)(mp_this)) - \
offsetof(mp_type, mp_member)) \
)

typedef struct pc_region_s pc_region, pc_node;
typedef struct pc_master_s pc_master;
typedef void (pc_fp_dtor) (pc_node*);
typedef struct pc_sys_anchor_s pc_sys_anchor;

struct pc_sys_anchor_s {
atomicword refcnt;
pc_region* region;
};

struct pc_region_s {
pc_sys_anchor next;
pc_node* defer;
};

struct pc_master_s {
pc_sys_anchor head;
pc_region region;
pc_fp_dtor* fp_dtor;
};
#define PC_MASTER_STATICINIT(mp_this, mp_fp_dtor) { \
{ 0, &(mp_this)->region }, \
{ { 0, NULL }, NULL }, (mp_fp_dtor) \
}

static void
pc_sys_dtor(

pc_master* const,
pc_region* const
);

static void
pc_init(
pc_master* const,
pc_fp_dtor* const
);

static void
pc_node_init(
pc_node* const
);

static void
pc_node_link(
pc_node* const,
pc_node* const
);

static pc_region*
pc_acquire(
pc_master* const
);

static void

pc_release(
pc_master* const,
pc_region* const
);

static void

pc_defer(
pc_region* const,
pc_node* const
);

static void

pc_mutate(
pc_master* const,
pc_node* const
);

void
pc_init(
pc_master* const _this,
pc_fp_dtor* const fp_dtor
) {
pc_master src = { { 0 } };
*_this = src;
_this->head.region = &_this->region;
_this->fp_dtor = fp_dtor;
}

pc_region*
pc_acquire(
pc_master* const _this
) {
pc_sys_anchor cmp = _this->head, xchg;
do {
xchg.refcnt = cmp.refcnt + 2;
xchg.region = cmp.region;
} while (! DWCASPTR(&_this->head, &cmp, &xchg));
return cmp.region;
}

void
pc_release(
pc_master* const _this,
pc_region* const region
) {
if (XADDWORD(&region->next.refcnt, -2) == 3) {
pc_sys_dtor(_this, region);
}
}

void
pc_node_init(
pc_node* const _this
) {
pc_node src = { { 0 } };
*_this = src;
}

void
pc_node_link(
pc_node* const _this,
pc_node* const next
) {
_this->defer = next;
}

void
pc_defer(
pc_region* const _this,
pc_node* const node
) {
node->defer = XCHGPTR(&_this->defer, node);
}

void
pc_mutate(
pc_master* const _this,
pc_node* const node
) {
pc_sys_anchor cmp = _this->head, xchg = { 0 };
node->next.refcnt = 2;
node->next.region = NULL;
xchg.region = node;
while (! DWCASPTR(&_this->head, &cmp, &xchg));
cmp.region->next.region = node;
if (XADDWORD(&cmp.region->next.refcnt,
cmp.refcnt + 1) == -cmp.refcnt) {
pc_sys_dtor(_this, cmp.region);
}
}

void
pc_sys_dtor(
pc_master* const _this,
pc_region* const region
) {
int dtors = 0, reset = 0;
pc_region* head = region;
pc_region* tail = region;
pc_region* next = region->next.region;

while (next) {
if (XADDWORD(&next->next.refcnt, -2) != 3) {
break;
}
tail = next;
next = next->next.region;
}

tail->next.region = NULL;
while (head) {
pc_region* const next = head->next.region;
pc_node* defer = head->defer;
assert(head->next.refcnt == 1);
if (head != &_this->region) {
head->defer = defer;
defer = head;
} else {
reset = 1;
}
while (defer) {
pc_node* const next = defer->defer;
_this->fp_dtor(defer);
++dtors;
defer = next;
}
head = next;
}

if (reset) {
_this->region.defer = NULL;
pc_mutate(_this, &_this->region);
}

#if ! defined(NDEBUG)
{
static atomicword g_pc_sys_epoch = 0;
atomicword const epoch = XADDWORD(&g_pc_sys_epoch, 1);
if (dtors) {
printf("pc_sys_dtor::epoch/dtors(%d/%d)\n",
epoch, dtors);
}
}
#endif
}

/*===========================================================*/
# if defined(__cplusplus)
}
# endif
# pragma warning(pop)
#endif
____________________________________________________________________

Well, what do you think? One question, do you think I should just get rid of
the 'pc_node_link() / pc_defer()' functions? I probably should... Humm...
Also, I am thinking about creating some iteration functions for
'pc_region/pc_node' objects so that I can pass multiple nodes to the dtor
function. In that case, the 'pc_fp_dtor' type and the 'pc_sys_dtor()'
function would look something like this:
____________________________________________________________________
typedef int (pc_fp_dtor) (pc_node*);

void
pc_sys_dtor(
pc_master* const _this,
pc_region* const region
) {
pc_region* head = region;
pc_region* tail = region;
pc_region* next = region->next.region;

while (next) {
if (XADDWORD(&next->next.refcnt, -2) != 3) {
break;
}
tail = next;
next = next->next.region;
}

tail->next.region = NULL;
if (_this->fp_dtor(head)) {
_this->region.defer = NULL;
pc_mutate(_this, &_this->region);
}
}
____________________________________________________________________

Humm... This could be useful. Think of a case where an application wanted to
process all the dtors in a dedicated dtor thread. Well, this way, an entire
epochs worth of 'pc_node' objects can be handed off to that dtor thread,

ti_chris

unread,

Mar 18, 2008, 10:56:50 PM3/18/08

to

Cool; I'll give it a more thourough check when I get the chance :).

Dmitriy V'jukov

unread,

Mar 20, 2008, 8:37:51 AM3/20/08

to

On Mar 19, 4:01 am, "Chris Thomasson" <cris...@comcast.net> wrote:

> Here is a recent version of the revised pc_sample.c

1. Region 1 is current
2. Thread 1 acquires region 1
3. Thread 2 executes pc_mutate()
4. Region 2 is current
5. Thread 3 acquires region 2
6. Thread 3 loads pointer to node 1
7. Thread 1 removes node 1 from data structure
8. Thread 1 executes pc_defer() and defers node 1 to region 1
9. Thread 1 releases region 1
10. Dtor executed for region 1, node 1 is deleted
11. Thread 3 accesses node 1
12. Bang!

Dmitriy V'jukov

Chris Thomasson

unread,

Mar 20, 2008, 6:55:31 PM3/20/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:0e9ca31c-d085-48fe...@e10g2000prf.googlegroups.com...

YIKES!!!!!!!!!! Bang is right!

Okay, well all is not lost. Alls you have to do is remove the 'pc_defer()'
function.

:^)

Chris Thomasson

unread,

Mar 20, 2008, 7:42:12 PM3/20/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:2t6dnfvt_KXLdn_a...@comcast.com...

Humm, I think this could happen to any collector that defers directly into a
region. The race-condition should effect this collector as well:

http://groups.google.com/group/comp.programming.threads/msg/3340c3ad7d9cb331

I can think of a couple of solutions:

1. Defer into per-thread list (like your distributed collector does)
2. Defer to global list

The problem with number 2 is that there could be a case where a node does
not get destroyed promptly. But, this would be fine if an application does
not care about that; here is sketch:
___________________________________________________________________

struct pc_master_s {
pc_sys_anchor head;
pc_region region;

pc_region* defer;

pc_fp_dtor* fp_dtor;
};
#define PC_MASTER_STATICINIT(mp_this, mp_fp_dtor) { \
{ 0, &(mp_this)->region }, \

{ { 0, NULL }, NULL }, NULL, (mp_fp_dtor) \
}

void
pc_defer(

pc_master* const _this,
pc_node* const node
) {

pc_node* cmp = _this->defer;
do {
node->defer = cmp;
} while (! CASPTR(&_this->defer, &cmp, node));
}

/* remove this function:

void
pc_node_link(
pc_node* const _this,
pc_node* const next
) {
_this->defer = next;
}

*/

void
pc_mutate(
pc_master* const _this,
pc_node* const node
) {
pc_sys_anchor cmp = _this->head, xchg = { 0 };
node->next.refcnt = 2;
node->next.region = NULL;

node->defer = XCHGPTR(&_this->defer, NULL);

xchg.region = node;
while (! DWCASPTR(&_this->head, &cmp, &xchg));
cmp.region->next.region = node;
if (XADDWORD(&cmp.region->next.refcnt,
cmp.refcnt + 1) == -cmp.refcnt) {
pc_sys_dtor(_this, cmp.region);
}
}

___________________________________________________________________

That would work.

ti_chris

unread,

Mar 21, 2008, 4:19:07 AM3/21/08

to

Yup. Good eye. My collector has the same issue as you pointed out.
It doesn't help that the OS virtually never swaps pages out, so you
never really get to see this bug until something like that happens or
someone is diligent enough to analyze the code =) Thanks Dmitriy.

Cool. You had the same idea as me to fix it. Aka: swap a master
deferred list to the newly created proxy collector. Both my LF and WF
PC are hit by this but easily fixable as you pointed out. Just really
sucks that we have to use a LF CAS vs using a WF xchg to defer the
nodes now and that contention is increased by having a global list.
Of course, a local per-thread list solves both issues.

On Mar 20, 3:55 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Dmitriy V'jukov" <dvyu...@gmail.com> wrote in message

ti_chris

unread,

Mar 21, 2008, 4:29:34 AM3/21/08

to

Correct me if I'm wrong (It's rather late), but I do believe that it's
correct to use an XCHGPTR instead of a CASPTR when deferring. At
first, I also thought this was wrong because you could try to free a
list of elements which is not completed (ie: the list may not be
completed). But I think that it does work due to back-linking which
maintains the past property which allowed us to us XCHGPTR in the
erroneous version. This should work provided that you defer before
freeing the region.

On Mar 20, 4:42 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Chris Thomasson" <cris...@comcast.net> wrote in message
>
> news:2t6dnfvt_KXLdn_a...@comcast.com...
>
>
>
>
>
> > "Dmitriy V'jukov" <dvyu...@gmail.com> wrote in message

> >news:0e9ca31c-d085-48fe...@e10g2000prf.googlegroups.com...
> >> On Mar 19, 4:01 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>
> >>> Here is a recent version of the revised pc_sample.c
>
> >> 1. Region 1 is current
> >> 2. Thread 1 acquires region 1
> >> 3. Thread 2 executes pc_mutate()
> >> 4. Region 2 is current
> >> 5. Thread 3 acquires region 2
> >> 6. Thread 3 loads pointer to node 1
> >> 7. Thread 1 removes node 1 from data structure
> >> 8. Thread 1 executes pc_defer() and defers node 1 to region 1
> >> 9. Thread 1 releases region 1
> >> 10. Dtor executed for region 1, node 1 is deleted
> >> 11. Thread 3 accesses node 1
> >> 12. Bang!
>
> > YIKES!!!!!!!!!! Bang is right!
>
> > Okay, well all is not lost. Alls you have to do is remove the 'pc_defer()'
> > function.
>
> Humm, I think this could happen to any collector that defers directly into a
> region. The race-condition should effect this collector as well:
>

> http://groups.google.com/group/comp.programming.threads/msg/3340c3ad7...

Dmitriy V'jukov

unread,

Mar 21, 2008, 4:57:07 AM3/21/08

to

On Mar 21, 1:55 am, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Dmitriy V'jukov" <dvyu...@gmail.com> wrote in message

>
> news:0e9ca31c-d085-48fe...@e10g2000prf.googlegroups.com...
>
>
>
> > On Mar 19, 4:01 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>
> >> Here is a recent version of the revised pc_sample.c
>
> > 1. Region 1 is current
> > 2. Thread 1 acquires region 1
> > 3. Thread 2 executes pc_mutate()
> > 4. Region 2 is current
> > 5. Thread 3 acquires region 2
> > 6. Thread 3 loads pointer to node 1
> > 7. Thread 1 removes node 1 from data structure
> > 8. Thread 1 executes pc_defer() and defers node 1 to region 1
> > 9. Thread 1 releases region 1
> > 10. Dtor executed for region 1, node 1 is deleted
> > 11. Thread 3 accesses node 1
> > 12. Bang!
>
> YIKES!!!!!!!!!! Bang is right!

So you sad that "its been running under worst case scenarios for the
past
couple of months". It seems that scenarios was not so worst :)

> Okay, well all is not lost. Alls you have to do is remove the 'pc_defer()'
> function.
>
> :^)

Interesting approach. This definitely will fix the problem :)

Dmitriy V'jukov

unread,

Mar 21, 2008, 5:11:53 AM3/21/08

to

On Mar 21, 2:42 am, "Chris Thomasson" <cris...@comcast.net> wrote:

> Humm, I think this could happen to any collector that defers directly into a
> region. The race-condition should effect this collector as well:
>

> http://groups.google.com/group/comp.programming.threads/msg/3340c3ad7...

Yes. This can even happen with my proxy collector:
http://groups.google.com/group/comp.programming.threads/msg/b00843f0799a320e

But I fix it already:
http://groups.google.com/group/comp.programming.threads/msg/44fa59522bff76cf
;)

> I can think of a couple of solutions:

Basic rule: you must always defer to current collector.
More precisely: you must defer to collector acquired *after* node was
removed from collection.

> 1. Defer into per-thread list (like your distributed collector does)

My collector doesn't defer to per-thread list. My collector only cache
nodes in per-thread list (solely for the purpose of amortization), and
then defer to current collector:
http://groups.google.com/group/comp.programming.threads/msg/44fa59522bff76cf

> 2. Defer to global list
>
> The problem with number 2 is that there could be a case where a node does
> not get destroyed promptly. But, this would be fine if an application does
> not care about that;
>
> here is sketch:

> That would work.

For the first look, yes.

3. Defer to current collector, i.e. _this->head.region->defer

Dmitriy V'jukov

unread,

Mar 21, 2008, 5:16:35 AM3/21/08

to

On Mar 21, 11:29 am, ti_chris <tiChri...@gmail.com> wrote:
> Correct me if I'm wrong (It's rather late), but I do believe that it's
> correct to use an XCHGPTR instead of a CASPTR when deferring.

If writer holds reference to previously acquired collector, then yes,
you can use XCHGPTR.
But not in Chris's code here:
http://groups.google.com/group/comp.programming.threads/msg/d04929a00587089e
because other thread can concurrently execute pc_mutate().

Dmitriy V'jukov

unread,

Mar 21, 2008, 5:34:01 AM3/21/08

to

On Mar 21, 11:19 am, ti_chris <tiChri...@gmail.com> wrote:
> Yup. Good eye. My collector has the same issue as you pointed out.
> It doesn't help that the OS virtually never swaps pages out, so you
> never really get to see this bug until something like that happens or
> someone is diligent enough to analyze the code =) Thanks Dmitriy.
>
> Cool. You had the same idea as me to fix it. Aka: swap a master
> deferred list to the newly created proxy collector. Both my LF and WF
> PC are hit by this but easily fixable as you pointed out. Just really
> sucks that we have to use a LF CAS vs using a WF xchg to defer the
> nodes now and that contention is increased by having a global list.
> Of course, a local per-thread list solves both issues.

You still can use XCHG. See my algorithm:
http://groups.google.com/group/comp.programming.threads/msg/b00843f0799a320e
with this patch:
http://groups.google.com/group/comp.programming.threads/msg/44fa59522bff76cf

It's completely wait-free, if you apply this patch:
http://groups.google.com/group/comp.programming.threads/msg/67d4d53b33e89f22
Here I am cheating somewhat, because actually I use a kind of "try-
lock" logic.

Dmitriy V'jukov

Chris Thomasson

unread,

Mar 21, 2008, 5:41:52 AM3/21/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:a108cc5a-e18b-40b2...@e23g2000prf.googlegroups.com...

> On Mar 21, 1:55 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>> "Dmitriy V'jukov" <dvyu...@gmail.com> wrote in message
>>
>> news:0e9ca31c-d085-48fe...@e10g2000prf.googlegroups.com...

>> > On Mar 19, 4:01 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>>
>> >> Here is a recent version of the revised pc_sample.c
>>

[...]

>> > 12. Bang!
>>
>> YIKES!!!!!!!!!! Bang is right!
>
> So you sad that "its been running under worst case scenarios for the
> past
> couple of months". It seems that scenarios was not so worst :)

NO SHI@#$!

You would think that boatloads of threads and millions of nodes cycling in
and out would trip something... But guess what, oh and please try not to
laugh too hard Dmitriy... My tests were not making _any_ use of 'pc_defer()'
whatsoever! That could be a reason why a test harness did not catch this.
Wow; I feel so STUPID!! ACK!

OUCH! ;^o

>> Okay, well all is not lost. Alls you have to do is remove the
>> 'pc_defer()'
>> function.
>>
>> :^)
>
> Interesting approach. This definitely will fix the problem :)

Indeed! ;^)

Thanks again; I really do appreciate your assistance.

Dmitriy V'jukov

unread,

Mar 21, 2008, 5:44:34 AM3/21/08

to

On Mar 19, 2:39 am, "Chris Thomasson" <cris...@comcast.net> wrote:

> Now, if I got rid of the 'pc_defer()' function, and only used 'pc_mutate()'
> for every defferment, I could remove the 'pc_node::defer' member. That would
> be 8 bytes.

Then it will be actually equal to original pc_sample (with DWCAS,
embed nodes and merged counter and backlink). Right?

> You could acquire/release references to specific 'pc_node' objects and
> extend their life-time beyond successive epochs.

It's interesting.
Hummm... Stop. Any references to nodes will prevent any garbage
collection because of back-links. So it will be the same as reader
just release reference to collector later. Holding reference to
collector is easier and cheaper because it's coarse-grained. And it's
already implemented.

> I think I should simply add a function which checks if the previously
> acquired 'pc_region' is still in the master proxy object. If it still is,
> the function simply returns. Otherwise, it does the reference decrement, and
> acquires the most current 'pc_region' in the master. Okay, well I am going
> to make some adjustments and probably post the code.

Yeah. It's precisely what I made here:
http://groups.google.com/group/comp.programming.threads/msg/b00843f0799a320e
and here:
http://groups.google.com/group/comp.programming.threads/msg/91d01886ae690185

You can place this check in acquire() or in release() or in both.

Other possible way is to make thing 'asymmetric'. Then it can be
called PC+RCU ;)

Dmitriy V'jukov

Chris Thomasson

unread,

Mar 21, 2008, 5:58:37 AM3/21/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:be36934d-53ca-4697...@s19g2000prg.googlegroups.com...

Well, I guess I could change the API a little bit in order to hack this
functionality together:
____________________________________________________________________

void
pc_defer(
pc_master* const _this,

pc_region* const region,
pc_node* const node
) {
if (region) {
node->defer = XCHGPTR(&_this->defer, node);
} else {

pc_node* cmp = _this->defer;
do {
node->defer = cmp;
} while (! CASPTR(&_this->defer, &cmp, node));
}
}

/* usage A */
{
node* const n = remove_node(...);
if (n) {
pc_defer(..., NULL, &n->pcn);
}

/* usage B */
{
pc_region* const pcr = pc_acquire(...);
[...];
{
node* const n = remove_node(...);
if (n) {
pc_defer(..., pcr, &n->pcn);
}
}
[...];
pc_release(..., pcr);
}
____________________________________________________________________

;^)

Chris Thomasson

unread,

Mar 21, 2008, 6:18:03 AM3/21/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:b8641dab-fbd6-4bd8...@i7g2000prf.googlegroups.com...

> On Mar 21, 2:42 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>
>> Humm, I think this could happen to any collector that defers directly
>> into a
>> region. The race-condition should effect this collector as well:
>>
>> http://groups.google.com/group/comp.programming.threads/msg/3340c3ad7...
>
> Yes. This can even happen with my proxy collector:
> http://groups.google.com/group/comp.programming.threads/msg/b00843f0799a320e
>
> But I fix it already:
> http://groups.google.com/group/comp.programming.threads/msg/44fa59522bff76cf
> ;)

:^D

>> I can think of a couple of solutions:
>
> Basic rule: you must always defer to current collector.
> More precisely: you must defer to collector acquired *after* node was
> removed from collection.

Yup.

>> 1. Defer into per-thread list (like your distributed collector does)
>
> My collector doesn't defer to per-thread list. My collector only cache
> nodes in per-thread list (solely for the purpose of amortization), and
> then defer to current collector:
> http://groups.google.com/group/comp.programming.threads/msg/44fa59522bff76cf

Okay. I was just considering the caching of nodes in the per-thread list as
a part of the deferment process.

>> 2. Defer to global list
>>
>> The problem with number 2 is that there could be a case where a node does
>> not get destroyed promptly. But, this would be fine if an application
>> does
>> not care about that;
>>
>> here is sketch:
>> That would work.
>
> For the first look, yes.
>
>
> 3. Defer to current collector, i.e. _this->head.region->defer

Something like:
________________________________________________________________________

void
pc_defer(
pc_master* const _this,

pc_region* const region,
pc_node* const node
) {

pc_region* const current = LOADPTR(&_this->head.region);
if (region) {
node->defer = XCHGPTR(&current->defer, node);
} else {
pc_node* cmp = current->defer;

do {
node->defer = cmp;

} while (! CASPTR(&current->defer, &cmp, node));
}
}
________________________________________________________________________

Humm, I have to think about this, but it should work fine. I am wondering
about race-condition wrt loading '_this->head.region' and concurrent
mutations.

Dmitriy V'jukov

unread,

Mar 21, 2008, 6:22:08 AM3/21/08

to

On Mar 21, 12:58 pm, "Chris Thomasson" <cris...@comcast.net> wrote:

> /* usage B */
> {
> pc_region* const pcr = pc_acquire(...);
> [...];
> {
> node* const n = remove_node(...);
> if (n) {
> pc_defer(..., pcr, &n->pcn);
> }
> }
> [...];
> pc_release(..., pcr);}

Sh#t hit the fan one more time :)

Region must be acquired *after* node is removed from collection!

Dmitriy V'jukov

Chris Thomasson

unread,

Mar 21, 2008, 6:38:35 AM3/21/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:b631185d-b6f8-4448...@i12g2000prf.googlegroups.com...

Na, I would not say that your cheating. One thing to keep in mind... The
wait-free property of atomic operations is highly architecture dependant.
For instance, as we know, Intel has wait-free XCHG/XADD. However, to get the
equlivant function on SPARC, you need a CAS-loop. Also, Joe Seigh kindly
informed me that some moron at SUN decided to depreciate the SWAP
instruction:

http://groups.google.com/group/comp.arch/msg/04cb5e2ca2a7e19a

That pissed me off real good!

http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/common/atomic/sparc/atomic.s
(notice the extensive use of CAS-loops for almost everything (e.g.,
'atomic_add_32/64()', 'atomic_swap_32/64()'...)

End result: Algorithms that are wait-free on Intel will get morphed into
lock-free when ported over to SPARC.

Isn't that total crap?!

Dmitriy V'jukov

unread,

Mar 21, 2008, 6:43:53 AM3/21/08

to

On Mar 21, 12:41 pm, "Chris Thomasson" <cris...@comcast.net> wrote:

> >> YIKES!!!!!!!!!! Bang is right!
>
> > So you sad that "its been running under worst case scenarios for the
> > past
> > couple of months". It seems that scenarios was not so worst :)
>
> NO SHI@#$!
>
> You would think that boatloads of threads and millions of nodes cycling in
> and out would trip something... But guess what, oh and please try not to
> laugh too hard Dmitriy... My tests were not making _any_ use of 'pc_defer()'
> whatsoever!

> That could be a reason why a test harness did not catch this.

Very likely.

I sometimes use following approach:

#ifdef STRESS_DEBUG_MODE
# define PAUSE() pause()
#else
# define PAUSE() (void)0
#endif

void pause()
{
if (rand() % 100)
{
int const x = rand() % 10;
if (x < 5)
{
// just small delay
int const y = rand() % 10;
for (int i = 0; i != y; ++i)
_mm_pause();
}
else if (x < 9)
{
// context switch
SwitchToThread();
}
else
{
// long blocking delay
int const y = rand() % 10;
Sleep(y);
}
}
}

Then I insert PAUSE() between *ALL* global state modifications. This
terrifically increases possibility of bag detection.

Dmitriy V'jukov

Chris Thomasson

unread,

Mar 21, 2008, 6:53:02 AM3/21/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:64ab3e3b-8681-4cc3...@d4g2000prg.googlegroups.com...

> On Mar 21, 12:58 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
>
>> /* usage B */
>> {
>> pc_region* const pcr = pc_acquire(...);
>> [...];
>> {
>> node* const n = remove_node(...);
>> if (n) {
>> pc_defer(..., pcr, &n->pcn);
>> }
>> }
>> [...];
>> pc_release(..., pcr);}
>
>
> Sh#t hit the fan one more time :)

After it hit the fan and splattered on the walls it actually made a pattern
which seems to spell out the word "RETARD".

> Region must be acquired *after* node is removed from collection!

I also mad a subtle typo in sample code sketch:

__________________________________________________________
void
pc_defer(
pc_master* const _this,
pc_region* const region,
pc_node* const node
) {
if (region) {
node->defer = XCHGPTR(&_this->defer, node);

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This line should be:

node->defer = XCHGPTR(&region->defer, node);

} else {
pc_node* cmp = _this->defer;
do {
node->defer = cmp;
} while (! CASPTR(&_this->defer, &cmp, node));
}
}

/* usage B */

node* const n = remove_node(...);
if (n) {

pc_region* const pcr = pc_acquire(...);

pc_defer(..., pcr, &n->pcn);
pc_release(..., pcr);
}
__________________________________________________________

I am getting ready to blast the 'pc_defer()' function with both barrels of a
12-gauge shotgun!

:^|

Dmitriy V'jukov

unread,

Mar 21, 2008, 7:00:56 AM3/21/08

to

On Mar 21, 1:18 pm, "Chris Thomasson" <cris...@comcast.net> wrote:

> >> 1. Defer into per-thread list (like your distributed collector does)
>
> > My collector doesn't defer to per-thread list. My collector only cache
> > nodes in per-thread list (solely for the purpose of amortization), and
> > then defer to current collector:

> >http://groups.google.com/group/comp.programming.threads/msg/44fa59522...

>
> Okay. I was just considering the caching of nodes in the per-thread list as
> a part of the deferment process.

It can be considered this way. But the most interesting moment is when
thread-local cache is transferring to global list.

> > 3. Defer to current collector, i.e. _this->head.region->defer
>
> Something like:
> ________________________________________________________________________
> void
> pc_defer(
> pc_master* const _this,
> pc_region* const region,
> pc_node* const node
> ) {

I think that #StoreLoad needed here, if node removal operation doesn't
contain one (for example, writers use mutex for mutual exclusion with
each other, and naked stores for removing node from data structure).
Otherwise you can load pointer to 'not quite current' region. And the
same race condition will arise again.

> pc_region* const current = LOADPTR(&_this->head.region);
> if (region) {
> node->defer = XCHGPTR(&current->defer, node);
> } else {
> pc_node* cmp = current->defer;
> do {
> node->defer = cmp;
> } while (! CASPTR(&current->defer, &cmp, node));
> }}
>
> ________________________________________________________________________
>
> Humm, I have to think about this, but it should work fine. I am wondering
> about race-condition wrt loading '_this->head.region' and concurrent
> mutations.

If writer holds at least one reference to any region then I think all
must be fine.

Dmitriy V'jukov

Chris Thomasson

unread,

Mar 21, 2008, 7:10:10 AM3/21/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:65565a46-2aae-479d...@d21g2000prf.googlegroups.com...

> On Mar 21, 12:41 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
>
>> >> YIKES!!!!!!!!!! Bang is right!
>>
>> > So you sad that "its been running under worst case scenarios for the
>> > past
>> > couple of months". It seems that scenarios was not so worst :)
>>
>> NO SHI@#$!
>>
>> You would think that boatloads of threads and millions of nodes cycling
>> in
>> and out would trip something... But guess what, oh and please try not to
>> laugh too hard Dmitriy... My tests were not making _any_ use of
>> 'pc_defer()'
>> whatsoever!
>
>> That could be a reason why a test harness did not catch this.
>
> Very likely.
>
>
> I sometimes use following approach:
>
> #ifdef STRESS_DEBUG_MODE
> # define PAUSE() pause()
> #else
> # define PAUSE() (void)0
> #endif
>
> void pause()
> {

[...]

> }
>
> Then I insert PAUSE() between *ALL* global state modifications. This
> terrifically increases possibility of bag detection.

Yeah. That's good.

I usually do something along the lines of:

______________________________________________________________
#define TEST_THREAD_DBG_FP_DEPTH() 10

typedef struct test_thread_s test_thread;
typedef void (test_thread_dbg_fp) (test_thread*);

struct test_thread_s {
int dbg_yield;
pthread_t tid;

/* yield function table */
test_thread_dbg_fp* dbg_fp[TEST_THREAD_DBG_FP_DEPTH()];
};

void
test_thread_yield(
test_thread* const _this,
int const count
) {
if (! (count % _this->dbg_yield)) {
/* pick a yield function at random */
test_thread_dbg_fp* const dbg_fp =
_this->dbg_fp[rand() % TEST_THREAD_DBG_FP_DEPTH()];

/* call it */
dbg_fp(_this);
}
}

void* test_thread_entry(void* state) {
int i;
test_thread* const _this = state;
for (i = 1 ;; ++i) {
[...]
test_thread_yield(_this, rand());
[...]
test_thread_yield(_this, i);
}
return 0;
}

/* now I can have several debug yield functions */
void
test_thread_dbg_sched_yield(
test_thread* const _this
) {
sched_yield();
}

void
test_thread_dbg_sleep(
test_thread* const _this
) {
Sleep(rand() % 3);
}

[...]

/* And create threads like: */
test_thread*
test_thread_create(
void* (*fp_entry) (void*)
) {
test_thread* const _this = malloc(sizeof(*_this));
if (_this) {

}
return _this;
}
______________________________________________________________

I think I will post some of my test-harness I am using for this thing. I
just need to strip out some in-house dependent stuff so everybody can
compile it.

Dmitriy V'jukov

unread,

Mar 21, 2008, 7:20:39 AM3/21/08

to

On Mar 21, 1:38 pm, "Chris Thomasson" <cris...@comcast.net> wrote:

> Na, I would not say that your cheating.

I meant that thread can set shift_ticket_ variable and then block for
a long time *before* making actual collector shift.
Formally it's wait-free, but actually it's more like mutex with
try_lock().

> One thing to keep in mind... The
> wait-free property of atomic operations is highly architecture dependant.
> For instance, as we know, Intel has wait-free XCHG/XADD. However, to get the
> equlivant function on SPARC, you need a CAS-loop. Also, Joe Seigh kindly
> informed me that some moron at SUN decided to depreciate the SWAP
> instruction:

> End result: Algorithms that are wait-free on Intel will get morphed into
> lock-free when ported over to SPARC.

Hmmmm... Maybe they want to represent HTM in a favourable light with
respect to atomic ops:
- Hey, all your HTM algorithms are only lock-free thus silly!
- All your wait-free algorithms based on atomic-ops are now lock-free
too! Ahahaha

> Isn't that total crap?!

At least one can rely on atomic-free spsc/private primitives and
multiplexing logic...

Dmitriy V'jukov

Chris Thomasson

unread,

Mar 21, 2008, 8:40:43 AM3/21/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:686952b9-af75-4534...@i7g2000prf.googlegroups.com...

> On Mar 21, 1:38 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
>
>> Na, I would not say that your cheating.
>
>
> I meant that thread can set shift_ticket_ variable and then block for
> a long time *before* making actual collector shift.
> Formally it's wait-free, but actually it's more like mutex with
> try_lock().

Ahhh. Okay, I see that now. Thanks for the clarification.

>> One thing to keep in mind... The
>> wait-free property of atomic operations is highly architecture dependant.
>> For instance, as we know, Intel has wait-free XCHG/XADD. However, to get
>> the
>> equlivant function on SPARC, you need a CAS-loop. Also, Joe Seigh kindly
>> informed me that some moron at SUN decided to depreciate the SWAP
>> instruction:
>> End result: Algorithms that are wait-free on Intel will get morphed into
>> lock-free when ported over to SPARC.
>
>
> Hmmmm... Maybe they want to represent HTM in a favourable light with
> respect to atomic ops:
> - Hey, all your HTM algorithms are only lock-free thus silly!
> - All your wait-free algorithms based on atomic-ops are now lock-free
> too! Ahahaha

ROTFLMAO! :^D

>> Isn't that total crap?!
>
> At least one can rely on atomic-free spsc/private primitives and
> multiplexing logic...

Well, at least that can scale extremely well. ;^)

Chris Thomasson

unread,

Mar 21, 2008, 9:18:04 AM3/21/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:d19ae549-ce14-407d...@c19g2000prf.googlegroups.com...

> On Mar 19, 2:39 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>
>> Now, if I got rid of the 'pc_defer()' function, and only used
>> 'pc_mutate()'
>> for every defferment, I could remove the 'pc_node::defer' member. That
>> would
>> be 8 bytes.
>
> Then it will be actually equal to original pc_sample (with DWCAS,
> embed nodes and merged counter and backlink). Right?

Well, basically yeah; your correct. Although, it's a little different wrt
the way it resets itself when the "static" collector object (e.g., the
'pc_master::region' member) is detected cycling through the dtor list:
______________________________________________________________________
void
pc_sys_dtor(
pc_master* const _this,

pc_region* const region
) {
int dtors = 0, reset = 0;
pc_region* head = region;

[...];

while (head) {
[...];

if (head != &_this->region) {

[...];
} else {
reset = 1;
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
}
[...];
}

if (reset) {
_this->region.defer = NULL;
pc_mutate(_this, &_this->region);

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
}

[...];
}
______________________________________________________________________

This ensures that there is no user-object just sitting in the
'pc_master::head::region' member... Anyway, I think I am going to
"bite-the-bullet" and go ahead with fixing 'pc_defer()' and post a new
version. I guess that should give it some added "flexibility"...

>> You could acquire/release references to specific 'pc_node' objects and
>> extend their life-time beyond successive epochs.
>
> It's interesting.
> Hummm... Stop. Any references to nodes will prevent any garbage
> collection because of back-links. So it will be the same as reader
> just release reference to collector later. Holding reference to
> collector is easier and cheaper because it's coarse-grained. And it's
> already implemented.

Right. I am tinkering around with a very weird counting logic "thing" that
probably won't work at all, but I am not ready to present it just yet...

>> I think I should simply add a function which checks if the previously
>> acquired 'pc_region' is still in the master proxy object. If it still is,
>> the function simply returns. Otherwise, it does the reference decrement,
>> and
>> acquires the most current 'pc_region' in the master. Okay, well I am
>> going
>> to make some adjustments and probably post the code.
>
> Yeah. It's precisely what I made here:
> http://groups.google.com/group/comp.programming.threads/msg/b00843f0799a320e
> and here:
> http://groups.google.com/group/comp.programming.threads/msg/91d01886ae690185
>
> You can place this check in acquire() or in release() or in both.

Humm. On first thought, I think I would probably go with another function
because I want 'pc_release()' to actually release the reference on every
call. It could possibly be a little "confusing" to a user who wonders why no
garbage collections are occurring when a thread is blocking even after they
released a previously acquired region right before the wait:

pc_release(...);
wait_for_something_perhaps_socket_io(...);

Perhaps I could do something very simple like this:
_________________________________________________________
pc_region*
pc_epoch(

pc_master* const _this,
pc_region* const region

) {
if (region != LOADPTR(&_this->head.region)) {
pc_release(_this, region);
return pc_acquire(_this);
}
return region;
}

/* and use it like */

void* reader(void* state) {
int i = 1;
pc_master* const pcm = state;
pc_region* pcr = pc_acquire(pcm);
for (;;++i) {

if (! (i % 256)) {
pcr = pc_epoch(pcm, pcr);
}
}
pc_release(pcm, pcr);
return 0;
}

/* or something like this to get better response time */
void* reader(void* state) {

pc_master* const pcm = state;

pc_region* pcr;
for (pcr = pc_acquire(pcm) ;;
pcr = pc_epoch(pcm, pcr)) {
[...];
}
pc_release(pcm, pcr);
return 0;
}
_________________________________________________________

> Other possible way is to make thing 'asymmetric'. Then it can be
> called PC+RCU ;)

:^D

ti_chris

unread,

Mar 22, 2008, 10:57:27 PM3/22/08

to

Very true. This is not true in mine which makes it possible for me to
do this.

On Mar 21, 2:16 am, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> On Mar 21, 11:29 am, ti_chris <tiChri...@gmail.com> wrote:
>
> > Correct me if I'm wrong (It's rather late), but I do believe that it's
> > correct to use an XCHGPTR instead of a CASPTR when deferring.
>
> If writer holds reference to previously acquired collector, then yes,
> you can use XCHGPTR.

> But not in Chris's code here:http://groups.google.com/group/comp.programming.threads/msg/d04929a00...

Dmitriy V'jukov

unread,

Mar 24, 2008, 8:57:49 AM3/24/08

to

On Mar 21, 1:53 pm, "Chris Thomasson" <cris...@comcast.net> wrote:

> I am getting ready to blast the 'pc_defer()' function with both barrels of a
> 12-gauge shotgun!

In order to optimize writes one can use following scheme:

struct pc_node_t
{
pc_node_t* next;
unsigned count;
};

struct pc_master_t
{
pc_node_t head;
};

struct pc_local_cache_t
{
pc_node_t* head;
pc_node_t* tail;
size_t size;
};

void pc_local_cache_init(pc_local_cache_t* cache)
{
cache->head = 0;
cache->tail = 0;
cache->size = 0;
}

void pc_local_cache_defer(pc_local_cache_t* cache, pc_node_t* node)
{
node->count = 3;
node->next = cache->head;
cache->head = node;
cache->size += 1;
if (0 == cache->tail)
cache->tail = node;
}

void pc_local_cache_flush(pc_master_t* master, pc_local_cache_t*
cache)
{
if (0 == cache->size)
return;
assert(cache->head && cache->tail);
cache->tail->count = 2;
pc_node_t cmp = master->head;
pc_node_t xchg = {cache->tail, 2};
while (! DWCASPTR(&master->head, &cmp, &xchg));
cmp.next->next = cache->head;
pc_local_cache_init(cache);
if (XADDWORD(&cmp.next->count, cmp.count + 1) == -cmp.count)
pc_sys_dtor(master, cmp.next);
}

Also add following optimization in order to amortize recycling node
batches:

void pc_sys_dtor(pc_master_t* master, pc_node_t* node)
{
pc_node_t* head = node;
pc_node_t* tail = node;
pc_node_t* next = node->next;

while (next)
{

/* \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ */
if (next->count != 3
/* /\ /\ /\ /\ /\ /\ /\ /\ /\ /\ /\ /\ /\ /\ /\ /\ /\ */

|| XADDWORD(&next->count, -2) != 3)
break;
tail = next;
next = next->next;
}

//...
}

Writer usage pattern:

void writer()
{
for (;;)
{
pc_local_cache_t cache;
pc_local_cache_init(&cache);

pc_region_t* region = pc_acquire(&pc_master);

for (int i = 0; i != 10; ++i)
{
pc_node_t* node = remove_node_from_collection();
pc_local_cache_defer(&cache, node);
}

pc_local_cache_flush(&pc_master, &cache);

pc_release(region);
}
}

This way writer can effectively recycle batches of nodes, and
sizeof(pc_node_t) still == 8.
What do you think?

Dmitriy V'jukov

Chris Thomasson

unread,

Mar 24, 2008, 3:40:07 PM3/24/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:6d416a49-446a-474c...@e10g2000prf.googlegroups.com...

> On Mar 21, 1:53 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
>
>> I am getting ready to blast the 'pc_defer()' function with both barrels
>> of a
>> 12-gauge shotgun!
>

[...]

> Writer usage pattern:
>
> void writer()
> {

[...]

> }
>
>
> This way writer can effectively recycle batches of nodes, and
> sizeof(pc_node_t) still == 8.
> What do you think?

It's better than using 'pc_node_link()' as-is simply because it requires an
extra pointer in 'pc_node':

void writer()
{
pc_node* head = NULL;
for (;;)
{

for (int i = 0; i != 10; ++i)
{
pc_node_t* node = remove_node_from_collection();

if (node == NULL) { break; }
pc_node_link(node, head);
head = node;
}
if (head) {
pc_mutate(&pc_master, &head);
head = NULL;
}
}
}

void pc_sys_dtor(pc_master_t* master, pc_node_t* node)
{
pc_node_t* head = node;
pc_node_t* tail = node;
pc_node_t* next = node->next;

while (next)
{

/* \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ */
if (next->count != 3
/* /\ /\ /\ /\ /\ /\ /\ /\ /\ /\ /\ /\ /\ /\ /\ /\ /\ */

|| XADDWORD(&next->count, -2) != 3)
break;

I don't think that is going to work. Humm, think of following scenario:

pc_node(refcnt:1)-->pc_node(refcnt:5)-->NULL;

If the count of the next node is 5 the break statement is hit and no
decrement will have taken place which means that it will never reach 1...

tail = next;
next = next->next;
}

//...
}

This should work:

void pc_sys_dtor(pc_master_t* master, pc_node_t* node)
{
pc_node_t* head = node;
pc_node_t* tail = node;
pc_node_t* next = node->next;

while (next)
{
if (next->count != 3) {
if (XADDWORD(&next->count, -2) != 3) {
break;
}
}

tail = next;
next = next->next;
}

//...
}

The optmization should also work in 'pc_release()':

void
pc_release(

pc_master* const _this,
pc_region* const region

) {
if (region->next.refcnt == 3 ||

XADDWORD(&region->next.refcnt, -2) == 3) {
pc_sys_dtor(_this, region);
}
}

Humm.

Dmitriy V'jukov

unread,

Mar 24, 2008, 6:06:39 PM3/24/08

to

On 24 мар, 22:40, "Chris Thomasson" <cris...@comcast.net> wrote:

>
> void pc_sys_dtor(pc_master_t* master, pc_node_t* node)
> {
> pc_node_t* head = node;
> pc_node_t* tail = node;
> pc_node_t* next = node->next;
>
> while (next)
> {
>
> /* \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ */
> if (next->count != 3
> /* /\ /\ /\ /\ /\ /\ /\ /\ /\ /\ /\ /\ /\ /\ /\ /\ /\ */
>
> || XADDWORD(&next->count, -2) != 3)
> break;
>
> I don't think that is going to work. Humm, think of following scenario:
>
> pc_node(refcnt:1)-->pc_node(refcnt:5)-->NULL;
>
> If the count of the next node is 5 the break statement is hit and no
> decrement will have taken place which means that it will never reach 1...

Yes. Yes. Of course.

> The optmization should also work in 'pc_release()':

It should work basically in any place provided 'basic thread safety'.

Dmitriy V'jukov

Chris Thomasson

unread,

Mar 29, 2008, 4:36:23 AM3/29/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:p8GdnbUV-8XgKH7a...@comcast.com...
[...]

> Perhaps I could do something very simple like this:
> _________________________________________________________
> pc_region*
> pc_epoch(
> pc_master* const _this,
> pc_region* const region
> ) {
> if (region != LOADPTR(&_this->head.region)) {
> pc_release(_this, region);
> return pc_acquire(_this);
> }
> return region;
> }

[...]

Actually, I don't need to compare with the current region. Alls I need to do
is check if the reference count is an odd value:

pc_region*
pc_epoch(
pc_master* const _this,
pc_region* const region
) {

if (region->next.refcnt % 2) {

pc_release(_this, region);
return pc_acquire(_this);
}
return region;
}

> _________________________________________________________