Wait Free Proxy Collector

ti_chris

unread,

Jan 25, 2008, 5:10:48 AM1/25/08

to

Hi,

I've finally managed to build a wait-free proxy collector. It seems
to outperform the lock-free version by about 5% (not too shabby). I
used a technique I thought Chris was using in his implementation, but
as it turns out, it sounds like it wasn't the case, which basically
means that I came up with a new way to do this.

My technique is really simple and optimizes reference counting in
certain situations. There are times where you can pretty much
guarantee that there will never be more than "x" nodes used in a
structure. In such a case, you can use a static array of those nodes
and use a simple 32bit value to represent both the index to the node
as well as the reference count. As long as you never use more nodes
than you have in the array at any point in time, you are fine. And
this also assumes that you free the nodes in a FIFO order.

Theses assumptions work pretty well for a proxy collector. Generally
speaking, unless a thread is suspended for a long period of time
(which isn't the case in my implementation since each thread are
assigned to one core), I can pretty much guarantee that I will get rid
of a collector node pretty fast. As for the reference counting of the
proxy node, I use the top bits to count. Doing so means that I can
easily wrap around without having to worry about overflowing into the
index.

My implementation uses an array of 256 proxy nodes, however I have
never seen a case where anywhere near using 256 at a time. This
leaves 24bits for reference counting and I can guarantee that this
won't overflow by forcing the system to create a new proxy once I get
close to that value (although I don't do that right now). Having a
single word to represent both the reference of a node as well as it's
reference count means that I can use atomic xadd instruction instead
of using CAS instructions and therefore I can write a wait-free proxy
collector.

I've attached the source code that I have written for this purpose.
However I'm getting really *weird* results which I'd like to
understand a little better.

Stack: (mutex) does 1.5Mops/s
LFStack: lock-free stack using exception try/catch against page fault
does 3.2Mops/s
LFPC: lock-free stack using proxy collector does 3.7Mops/s
LFPCWF: lock free stack using a wait-free proxy collector does 3.9Mops/
s

What seriously bugs me is the fact that LFStack does only 3.2Mops/s
(It used to do 3.8). What's even more concerning is that I have
instrumented the code to understand how many CAS? loops that I do and
I have realized that I do 1.7 CAS loops on average for any LFPC
versions while I do 2.7 CAS loops on average for LFStack. Contention
is thus significantly higher, but it seems to me like LFPC should
still outperform the other methods considering that it's written in
assembler and that it should do significantly less work than the
others (ie: no memory management). I'd appreciate any insight. BTW,
I figured out that LFPC's free method actually didn't need a cas; that
it could use a Swap instead to make this *more wait-free*. Is a CAS
so expensive that doing it one more time than necessary kills ~20% of
the speed? It sounds a little far fetched for my taste :/.

PS: Email me if you want the full source code to run on MSVC

/*
** Lock-Free LFProxy Collector
*/
struct LFProxyObject {
long Count; // Reference count
LFProxyObject *Next; // Next reference
void *List; // Deletion list

void Delete(void)
{
void *Next;
for (; List; List = Next) {
Next = *(void **)List; // First word is the 'next'

Free(List);
}
}

void Defer(void *Memory)
{
*(long *)Memory = Swap(&List, (long)Memory);
}
};

struct LFProxyCollector {
LFProxyObject *volatile Collector; // (this)
volatile long Count; // Reference count

static LFProxyCollector Master;

static void __fastcall Init(void)
{
Master.Collector = (LFProxyObject
*)Malloc(sizeof(LFProxyObject));
Master.Count = 0;

Master.Collector->Count = -1;
Master.Collector->List = NULL;
Master.Collector->Next = NULL;
}

static LFProxyObject * __fastcall LFProxyCollector::AddRef(void)
{
LFProxyCollector New, Old;

Read64((void *)&Master.Collector, (void *)&Old.Collector);

// Increment and get the pointer reference to the master
collector
do {
New.Collector = Old.Collector;
New.Count = Old.Count + 1;
} while (!CAS2((void *)&Master.Collector, (void
*)&New.Collector, (void *)&Old.Collector));

return Old.Collector;
}

static void __fastcall RemRef(LFProxyObject *Collect)
{
// If we're out of elements, free/pop the colletor's stack
while (Dec(&(Collect->Count))) {
LFProxyObject *const Next = Collect->Next;

Collect->Delete();
Free(Collect);

Collect = Next;
}
}

static void Mutate(void)
{
LFProxyCollector New, Old;
LFProxyObject *NewObj = (LFProxyObject
*)Malloc(sizeof(LFProxyObject));

NewObj->Count = 0; // It always starts with one back-
link
NewObj->List = NULL;
NewObj->Next = NULL;

Read64((void *)&Master.Collector, (void *)&Old.Collector);

New.Collector = NewObj;
New.Count = 0;
Master.Collector->Next = NewObj;

// Insert the new collector
while (!CAS2((void *)&Master.Collector, (void
*)&New.Collector, (void *)&Old.Collector));

// Combine the reads with the releases plus the backlink
if (XAdd((void *)&Old.Collector->Count, Old.Count + 1) == 0) {
LFProxyObject *const Next = Old.Collector->Next;

Old.Collector->Delete();
Free(Old.Collector);

RemRef(Next);
}
}
};

LFProxyCollector LFProxyCollector::Master;

/*
** Wait-Free LFProxy Collector
*/
#define INDEX_BIT_SIZE 8

struct WFProxyObject {
unsigned long Count; // Reference count
void *List; // Deletion list
unsigned long Crap; // Merely for testing

void Delete(void)
{
void *Next;
for (; List; List = Next) {
Next = *(void **)List; // First word is the 'next'

Free(List);
}
}

void Defer(void *Memory)
{
*(long *)Memory = Swap(&List, (long)Memory);
}
};

WFProxyObject WFPOArray[1 << INDEX_BIT_SIZE];

struct WFProxyCollector {
volatile unsigned long Collector;

static unsigned long Master;

static void __fastcall Init(void)
{
Master = 0;

WFPOArray[0].Count = ~((1 << INDEX_BIT_SIZE) - 1);
WFPOArray[0].List = NULL;
}

static WFProxyObject * __fastcall WFProxyCollector::AddRef(void)
{
unsigned long Ref = XAdd(&Master, (1 << INDEX_BIT_SIZE));

return &WFPOArray[Ref & ((1 << INDEX_BIT_SIZE) - 1)];
}

static void __fastcall RemRef(WFProxyObject *Collect)
{
// If we're out of elements, free/pop the colletor's stack
while ((XAdd(&(Collect->Count), -(1 << INDEX_BIT_SIZE)) &
(~((1 << INDEX_BIT_SIZE) - 1))) == 0) {
const unsigned long Index = Collect->Count & ((1 <<
INDEX_BIT_SIZE) - 1);
const unsigned long Next = (Index + 1) & ((1 <<
INDEX_BIT_SIZE) - 1);

WFPOArray[Index].Delete();

Collect = &WFPOArray[Next];
}
}

static void Mutate(void)
{
const unsigned long Index = Master & ((1 << INDEX_BIT_SIZE)
- 1);
unsigned long Next = (Index + 1) & ((1 <<
INDEX_BIT_SIZE) - 1);

// Initialize the next ProxyObject
WFPOArray[Next].Count = Next;
WFPOArray[Next].List = NULL;

// Increase the index
long OldCount = XAdd(&Master, Next - Index) & (~((1 <<
INDEX_BIT_SIZE) - 1));

// Reset the count for the next run
XAdd(&Master, -OldCount);

// Then adjust the old
if ((XAdd(&WFPOArray[Index], OldCount + (1 << INDEX_BIT_SIZE))
& (~((1 << INDEX_BIT_SIZE) - 1))) == 0) {
WFPOArray[Index].Delete();

RemRef(&WFPOArray[Next]);
}
}
};

unsigned long WFProxyCollector::Master;

/*
** Lock-Free LFProxy Collected Stack
*/
struct Stack {
Stack() : Top(NULL) { }

struct StackNode {
StackNode *Next;
unsigned long Data;
};

void Push(register unsigned long Data)
{
// pthread_mutex_lock(&Mutex);
AllocSpinLock(&SpinLock);

StackNode *Node = (StackNode *)Malloc(sizeof(StackNode)); //
new StackNode();

Node->Data = Data;
Node->Next = (StackNode *)Top;
Top = Node;

// pthread_mutex_unlock(&Mutex);
FreeMutex(&SpinLock);
}

unsigned long Pop(void)
{
// pthread_mutex_lock(&Mutex);
AllocSpinLock(&SpinLock);

if (!Top) {
// pthread_mutex_unlock(&Mutex);
FreeMutex(&SpinLock);
return -1;
}

StackNode *Old = Top;
const unsigned long Data = Top->Data;

Top = Top->Next;

// pthread_mutex_unlock(&Mutex);
FreeMutex(&SpinLock);
Free(Old);

return Data;
}

public:
StackNode *Top;
};

/*
** Lock-Free Stack
*/
struct LFStack {
LFStack() : Count(0), Top(NULL) { }

struct LFStackNode {
LFStackNode *Next;
unsigned long Data;
};

struct LFStackPtr {
LFStackNode *Top;
unsigned long Count;
};

void __fastcall Push(register unsigned long Data)
{
LFStackNode *Node = (LFStackNode
*)Malloc(sizeof(LFStackNode));

Node->Data = Data;
/*
do {
Node->Next = (LFStackNode *)Top;
Inc(&Retry);

} while (!CAS((long *)&Top, (long)Node, (long)Node->Next));
*/
__asm {
mov edi, [this] ; edi = &this
mov esi, [Node] ; esi = &Node
};
LoopPush:
__asm {
mov eax, [edi] ; eax = Top
mov [esi], eax ; Node->Next = Top;
lock cmpxchg [edi], esi
jne LoopPush
};

}

unsigned long __fastcall Pop(void)
{
do {
__try {

register LFStackNode *OldNode;

__asm {
mov edi, [this] ; edi = &this
xor eax, eax
xor edx, edx
xor ebx, ebx
xor ecx, ecx

lock cmpxchg8b [edi] ; edx:eax = this
}
LoopPop:
__asm {
test eax, eax
je PopOut

mov ebx, [eax] ; ebx = Top->Next
lea ecx, [edx + 1] ; ebx = Top->Count + 1
lock cmpxchg8b [edi] ; ebx:ecx = Next
jne LoopPop

mov OldNode, eax ; OldNode = Top
}

const unsigned long Data = OldNode->Data;

Free(OldNode);

return Data;

PopOut:
return -1;
/*
LFStackPtr Head, Next;

do {
Read64((void *volatile)&Top, &Head.Top);

if (!Head.Top)
return -1;

Next.Count = Head.Count + 1;
Next.Top = Head.Top->Next;
Inc(&Retry);

} while (!CAS2((void *volatile)&Top, (long)Head.Count,
(long)Head.Top, (long)Next.Count, (long)Next.Top));

const unsigned long Data = Head.Top->Data;

Free(Head.Top);

return Data;
*/
} __except ( EXCEPTION_EXECUTE_HANDLER ) {
printf("Page fault\n");
// if we got a page-fault, then we want to re-start
the loop because the OS paged the memory out on us
}
} while (1);
}

public:
LFStackNode *volatile Top;
volatile unsigned long Count;
};

/*
** Lock-Free LFProxy Collected Stack
*/
struct LFPCStack {
LFPCStack() : Top(NULL) { }

struct LFPCStackNode {
LFPCStackNode *Next;
unsigned long Data;
};

void Push(register unsigned long Data)
{
LFPCStackNode *Node = (LFPCStackNode
*)Malloc(sizeof(LFPCStackNode)); //new LFPCStackNode();

Node->Data = Data;

do {
Node->Next = (LFPCStackNode *)Top;
// Inc(&Retry);
} while (!CAS((long *)&Top, (long)Node, (long)Node->Next));
}

unsigned long Pop(void)
{
LFPCStackNode *Head, *Next;
LFProxyObject *Obj = LFProxyCollector::AddRef();

do {
Head = Top;

if (!Head) {
LFProxyCollector::RemRef(Obj);
return -1;
}

Next = Head->Next;
// Inc(&Retry);
} while (!CAS((void *volatile)&Top, (long)Next, (long)Head));

const unsigned long Data = Head->Data;

Obj->Defer(Head);
LFProxyCollector::RemRef(Obj);

return Data;
}

public:
LFPCStackNode *volatile Top;
};

/*
** Lock-Free WFProxy Collected Stack
*/
struct LFPCWFStack {
LFPCWFStack() : Top(NULL) { }

struct LFPCWFStackNode {
LFPCWFStackNode*Next;
unsigned long Data;
};

void Push(register unsigned long Data)
{
LFPCWFStackNode *Node = (LFPCWFStackNode
*)Malloc(sizeof(LFPCWFStackNode)); //new LFPCStackNode();

Node->Data = Data;

do {
Node->Next = (LFPCWFStackNode *)Top;
// Inc(&Retry);
} while (!CAS((long *)&Top, (long)Node, (long)Node->Next));
}

unsigned long Pop(void)
{
LFPCWFStackNode *Head, *Next;
WFProxyObject *Obj = WFProxyCollector::AddRef();

do {
Head = Top;

if (!Head) {
WFProxyCollector::RemRef(Obj);
return -1;
}

Next = Head->Next;
// Inc(&Retry);
} while (!CAS((void *volatile)&Top, (long)Next, (long)Head));

const unsigned long Data = Head->Data;

Obj->Defer(Head);
WFProxyCollector::RemRef(Obj);

return Data;
}

public:
LFPCWFStackNode *volatile Top;
};

Stack Stack;
bool GO = true;

void *ThreadFunc(void *)
{
do {
Stack.Pop();
} while (GO);

while (Stack.Pop() != -1);

return NULL;
}

void *ThreadFunc2(void *)
{
do {
Stack.Pop();
} while (GO);

while (Stack.Pop() != -1);

return NULL;
}

void *ThreadFunc3(void *)
{
do {
Stack.Pop();
} while (GO);

while (Stack.Pop() != -1);

return NULL;
}

int main(int argc, char* argv[])
{
char c;
long long Freq;
long long Start, End;
double Frequency;

QueryPerformanceFrequency((LARGE_INTEGER *)&Freq);
Frequency = (double)Freq / 1000;

pthread_mutex_init(&Mutex, NULL);

LFProxyCollector::Init();
WFProxyCollector::Init();

printf("\n");

QueryPerformanceCounter((LARGE_INTEGER *)&Start);
pthread_mutex_init(&Mutex, NULL);
pthread_create (&Thread, NULL, ThreadFunc, NULL);
pthread_create (&Thread3, NULL, ThreadFunc2, NULL);
// pthread_create (&Thread2, NULL, ThreadFunc3, NULL);

unsigned long Int = 0;

do {
Stack.Push(rand() % 100);

if ((Int & 0xFF) == 0xFF)
WFProxyCollector::Mutate();

} while (++Int != MAX);

GO = false;

while (Stack.Pop() != -1);

WFProxyCollector::Mutate();

QueryPerformanceCounter((LARGE_INTEGER *)&End);

double Speed = (End - Start) / Frequency;
printf("Time: [%lf ms] Rate: [%fK] Retry:[%f]\n", Speed, Int*2 /
Speed, (float)Retry / (2 * MAX));

#ifdef VERIFY
// Verify sanity
for (char *At = (char *)Array2; (long)At < (long)&Array2[MAX]; At+
+)
if (*At != 0x2B)
__asm int 3h;
/*
for (char *At = (char *)Array3; (long)At < (long)&Array3[MAX/
0x100]; At++)
if (*At != 0x3B)
__asm int 3h;
*/
#endif

getchar();
return 0;
}

Chris Thomasson

unread,

Jan 25, 2008, 2:55:46 PM1/25/08

to

On Fri, 25 Jan 2008 02:10:48 -0800, ti_chris wrote:

> Hi,
>
> I've finally managed to build a wait-free proxy collector. It seems to
> outperform the lock-free version by about 5% (not too shabby). I used a
> technique I thought Chris was using in his implementation, but as it
> turns out, it sounds like it wasn't the case, which basically means that
> I came up with a new way to do this.

I use alignment trick instead of indices's because it puts no upper bound
on the number of deferred collector objects. This is important for me
because I created it to test against vZOOM PDR which can handle a shi%load
of deferred objects.

[...]

>
> My implementation uses an array of 256 proxy nodes, however I have never
> seen a case where anywhere near using 256 at a time.

Humm... IMVHO, that's a rather "dangerous" assumption to make. Think of
general purpose usage where somebody will be able to create an application
which will experience periods of load that can, and will, break your
scheme...

[...]

I need to find time to read this! Anyway, I like the way to replaced CAS
with XCHG in the LFProxyObject::Defer function!

:^)

Chris Thomasson

unread,

Jan 25, 2008, 4:03:00 PM1/25/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:-OKdnfAkfdFfowfa...@comcast.com...

> On Fri, 25 Jan 2008 02:10:48 -0800, ti_chris wrote:
>
>> Hi,
>>
>> I've finally managed to build a wait-free proxy collector. It seems to
>> outperform the lock-free version by about 5% (not too shabby). I used a
>> technique I thought Chris was using in his implementation, but as it
>> turns out, it sounds like it wasn't the case, which basically means that
>> I came up with a new way to do this.
>
> I use alignment trick instead of indices's because it puts no upper bound
> on the number of deferred collector objects. This is important for me
> because I created it to test against vZOOM PDR which can handle a shi%load
> of deferred objects.
>
>
> [...]
>>
>> My implementation uses an array of 256 proxy nodes, however I have never
>> seen a case where anywhere near using 256 at a time.
>
> Humm... IMVHO, that's a rather "dangerous" assumption to make. Think of
> general purpose usage where somebody will be able to create an application
> which will experience periods of load that can, and will, break your
> scheme...
[...]

I believe that you can get around that by using 32-bit offsets into block of
collector objects:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/dbb6d6fc3de1c5c3

Dmitriy Vyukov

unread,

Jan 25, 2008, 4:23:21 PM1/25/08

to

On 26 янв, 00:03, "Chris Thomasson" <cris...@comcast.net> wrote:

> I believe that you can get around that by using 32-bit offsets into block of
> collector objects:

On Windows64 one can steal up to 25(!) bits from pointer virtually
legally:
http://www.alex-ionescu.com/?p=50

I think it's enough for reference counter. 25 bits can represent 33M
of references.

With DWCAS on Windows64 one can pack into DWORD: 3 pointers and 11 bit
counter. Cool!

Dmitriy V'jukov

ti_chris

unread,

Jan 26, 2008, 1:39:32 AM1/26/08

to

On Jan 25, 11:55 am, Chris Thomasson <cris...@comcast.net> wrote:
> On Fri, 25 Jan 2008 02:10:48 -0800, ti_chris wrote:
> > Hi,
>
> > I've finally managed to build a wait-free proxy collector. It seems to
> > outperform the lock-free version by about 5% (not too shabby). I used a
> > technique I thought Chris was using in his implementation, but as it
> > turns out, it sounds like it wasn't the case, which basically means that
> > I came up with a new way to do this.
>
> I use alignment trick instead of indices's because it puts no upper bound
> on the number of deferred collector objects. This is important for me
> because I created it to test against vZOOM PDR which can handle a shi%load
> of deferred objects.
>
> [...]
>
>
>
> > My implementation uses an array of 256 proxy nodes, however I have never
> > seen a case where anywhere near using 256 at a time.
>
> Humm... IMVHO, that's a rather "dangerous" assumption to make. Think of
> general purpose usage where somebody will be able to create an application
> which will experience periods of load that can, and will, break your
> scheme...
>

I think there is no difference between our algorithms (although I may
be wrong). If I understand what you're doing correctly, you align the
memory such that it lies on a multiple of x (ex: 128bytes). This
gives you 7 bits which you are free to use for a counter. Moreover,
you have 7bits of counter and 25bits of unique addressing. In my
scheme, I could easily change the code such that it only has 7bits for
the counter and use the rest to uniquely address the proxy objects.
Ultimately, it comes down to the same issue in the end no?

Where I do see a difference is that you are able to allocate those
blocks dynamically where as I need a static array that is pre-
allocated. In that sense, you have a better solution. I'm assuming
that I could convert my code to have something more similar to what
you have and still have a wait-free algo. I'd have to check it
up :).

>
> I need to find time to read this! Anyway, I like the way to replaced CAS
> with XCHG in the LFProxyObject::Defer function!
>
> :^)

It occurred to me yesterday that you can easily guarantee a wait-free
push if you can guarantee that no pops will occur :D.

ti_chris

unread,

Jan 26, 2008, 1:40:16 AM1/26/08

to

On Jan 25, 1:23 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:

Very sweet indeed. 64bit is a whole different beast :)

Dmitriy Vyukov

unread,

Jan 27, 2008, 4:49:34 AM1/27/08

to

On 25 янв, 13:10, ti_chris <tiChri...@gmail.com> wrote:

Am I right, that you assume that only one thread can execute Mutate()?
Otherwise I don't think this XAdd will survive more than one thread:

> // Increase the index
> long OldCount = XAdd(&Master, Next - Index) & (~((1 <<
> INDEX_BIT_SIZE) - 1));

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Jan 28, 2008, 4:26:48 AM1/28/08

to

On Jan 27, 12:49 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:

If you assume that only one thread can execute Mutate() than you can
combine first and second XAdd in Mutate() into one XChg.

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Jan 28, 2008, 4:51:14 AM1/28/08

to

On Jan 25, 1:10 pm, ti_chris <tiChri...@gmail.com> wrote:

> I've finally managed to build a wait-free proxy collector. It seems
> to outperform the lock-free version by about 5% (not too shabby). I
> used a technique I thought Chris was using in his implementation, but
> as it turns out, it sounds like it wasn't the case, which basically
> means that I came up with a new way to do this.

I have read the code and it looks like it works :)
(I assume that only one thread can execute Mutate())

But I think it won't survive overflow of "inner" 24-bit counter
WFProxyObject::Count. It's better to maintain separate flag "not-
current" and reset it only when WFProxyObject become not current. And
drain WFProxyObject only when (Count == 0) and (NotCurrent == false).
This way you can handle overflow of Counter gracefully.

And it's better to check that next WFProxyObject is already free in
Mutate() function. Just mark WFProxyObject somehow in
WFProxyObject::Delete() function and check in Mutate():

static void Mutate(void)
{
const unsigned long Index = Master & ((1 << INDEX_BIT_SIZE) - 1);
unsigned long Next = (Index + 1) & ((1 << INDEX_BIT_SIZE) - 1);

if (NotFree(WFPOArray[Next])) return; // <----------------------
...

This way you will not destroy your collector totally.

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Jan 28, 2008, 8:41:39 AM1/28/08

to

On Jan 25, 1:10 pm, ti_chris <tiChri...@gmail.com> wrote:

> I've finally managed to build a wait-free proxy collector. It seems
> to outperform the lock-free version by about 5% (not too shabby). I
> used a technique I thought Chris was using in his implementation, but
> as it turns out, it sounds like it wasn't the case, which basically
> means that I came up with a new way to do this.

I was thinking about making something like:

struct WFProxyObject {
unsigned long Count; // Reference count

void* DeferredObjects [SOME_CONSTANT]; // Batch of deferred
objects

//...
};

I.e. every proxy object contains batch of deferred user objects. Batch
has fixed size. So user doesn't have to call Mutate() manually. When
batch is full collector automatically allocate new proxy and "advance
epoch".

What do you think?

Dmitriy V'jukov

ti_chris

unread,

Jan 28, 2008, 6:39:32 PM1/28/08

to

I'll post all of my replies onto this post to simplify things. I may
be wrong, but I don't think that you can use xchg vs xadd. It's true
that only one thread can call Mutate and that's how I get away with
some of the contention, however another thread could still content on
the count. While you are mutating, a thread could still be adding
counts to the master.count (Mutate isn't a locked state) and because
of that, I don't think that exchange will work (unless I'm missing
something).

I could easily add some checks to make sure that the counter doesn't
overflow, however it's somewhat unlikely to overflow and here's why: I
reset my counter to 0 every time a Mutate happens. Provided that
mutate calls are done "often enough", you won't have to worry about
this being an issue. According to my tests, it's also more optimal to
do mutates frequently than less, however there's also a point after
which doing too many of them costs too much. I've found a practical
sweet spot for my specific case to be around every 256 adds. However,
I suspect that this number would change depending on the application.

This version wouldn't work too well if you use the proxy collector for
longer operations. For example to parse a long linked list, you would
likely lock the PC for a long time and you may have an issue with the
Index revolving to an already used number.

I'm currently working on another type of hybrid SMR/proxy (most likely
patent-free) that would not only be wait-free but also possibly atomic-
free (except for the deffer operation that would use an xchg). It
would be better than SMR because there would be no need for that ugly
scan that SMR requires and better than proxy because it would contain
*at most* 2 collectors (That's really all I need to clean everything
nicely). Right now I'm struggling to make the Mutate function
correct.

As for your proposal in this particular post, your proposal would work
well (especially with algorithms that can potentially write to memory
that they freed). In my specific case, I was thinking of having the
memory allocator (a lock-free version I'd write) have some extra
memory allocated for the purpose of managing it (which would include a
free pointer that I would know I could write too safely). Correct me
if I'm wrong, but it looks like you are formerly enforcing that the
deferred list can only contain "x"objects. This may complicate the
logic considerably due to the wait-free-ness of the algorithm. Yes,
you could easily do this (or even re-use the memory and keep a count
of it) to save yourself from having a list of pointers. If you look
at my source code, that's more or less what I do at a user-level. I
force a mutate after every 256 adds. It's not quite the same because
it's not done at every 256 pops, but it's close enough to that without
the formality of it.

Chris Thomasson

unread,

Jan 28, 2008, 7:19:13 PM1/28/08

to

"ti_chris" <tiCh...@gmail.com> wrote in message
news:dfa878e5-5e82-4748...@d21g2000prf.googlegroups.com...

> On Jan 28, 5:41 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
>> On Jan 25, 1:10 pm, ti_chris <tiChri...@gmail.com> wrote:
>>
>> > I've finally managed to build a wait-free proxy collector. It seems
>> > to outperform the lock-free version by about 5% (not too shabby). I
>> > used a technique I thought Chris was using in his implementation, but
>> > as it turns out, it sounds like it wasn't the case, which basically
>> > means that I came up with a new way to do this.
>>
>> I was thinking about making something like:
>>
>> struct WFProxyObject {
>> unsigned long Count; // Reference count
>> void* DeferredObjects [SOME_CONSTANT]; // Batch of deferred
>> objects
>>
>> //...
>>
>> };
>>
>> I.e. every proxy object contains batch of deferred user objects. Batch
>> has fixed size. So user doesn't have to call Mutate() manually. When
>> batch is full collector automatically allocate new proxy and "advance
>> epoch".
>>
>> What do you think?
>>
>> Dmitriy V'jukov
>

[...]

> This version wouldn't work too well if you use the proxy collector for
> longer operations. For example to parse a long linked list, you would
> likely lock the PC for a long time and you may have an issue with the
> Index revolving to an already used number.

Agreed.

> I'm currently working on another type of hybrid SMR/proxy (most likely
> patent-free) that would not only be wait-free but also possibly atomic-
> free (except for the deffer operation that would use an xchg).

What about memory-barrier free?

> It
> would be better than SMR because there would be no need for that ugly
> scan that SMR requires and better than proxy because it would contain
> *at most* 2 collectors (That's really all I need to clean everything
> nicely). Right now I'm struggling to make the Mutate function
> correct.

There was one proxy thing I did but never published which used an N=2
scheme, sounds like your doing same thing here. It did not use atomics, but
it did require a single #StoreLoad barrier before a thread could enter a
proxy collected region. I was trying to come up with atomic-free, but not
membar free, proxy designs to test against vZOOM PDR to see what kind of
impact the membar has. Turns out, the overhead can be quite significant
indeed.

[...]

Chris Thomasson

unread,

Jan 28, 2008, 11:53:08 PM1/28/08

to

On Mon, 28 Jan 2008 15:39:32 -0800, ti_chris wrote:

> On Jan 28, 5:41 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:

[...]

> I'll post all of my replies onto this post to simplify things. I may be
> wrong, but I don't think that you can use xchg vs xadd.

[...]

You can if you use simple xchg and xadd a(e.g., use xchg to reset anchor
and obtain count, and subsequent xadd to reflect the count). Just like I
do it in pc_sample.c (e.g., pc_defer) or like Joe Seighs Kung-Fu Monkey
through the trees logic he talked about. ;^)

ti_chris

unread,

Jan 29, 2008, 2:36:40 AM1/29/08

to

On Jan 28, 4:19 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
> "ti_chris" <tiChri...@gmail.com> wrote in message

I'm hoping it can be. I'm still finishing the details. I'll post
more if I ever get something worth it.

>
> > It
> > would be better than SMR because there would be no need for that ugly
> > scan that SMR requires and better than proxy because it would contain
> > *at most* 2 collectors (That's really all I need to clean everything
> > nicely). Right now I'm struggling to make the Mutate function
> > correct.
>
> There was one proxy thing I did but never published which used an N=2
> scheme, sounds like your doing same thing here. It did not use atomics, but
> it did require a single #StoreLoad barrier before a thread could enter a
> proxy collected region. I was trying to come up with atomic-free, but not
> membar free, proxy designs to test against vZOOM PDR to see what kind of
> impact the membar has. Turns out, the overhead can be quite significant
> indeed.
>
> [...]

Cool. Yah; I seem to be stepping into a lot of steps that have
already been paved. I seem to re-invent a lot of wheels only to find
out that they've already been done. It's fine however, you end up
paving some new ground sometimes and that alone makes it worth it :).

ti_chris

unread,

Jan 29, 2008, 3:01:00 AM1/29/08

to

But of course; that was dumb. DUUHH. In fact, using XAdd is a
somewhat convoluted process when compared to the exchange function.
My bad

Dmitriy Vyukov

unread,

Jan 29, 2008, 7:27:55 AM1/29/08

to

On Jan 29, 2:39 am, ti_chris <tiChri...@gmail.com> wrote:

> I could easily add some checks to make sure that the counter doesn't
> overflow, however it's somewhat unlikely to overflow and here's why: I
> reset my counter to 0 every time a Mutate happens. Provided that
> mutate calls are done "often enough", you won't have to worry about
> this being an issue. According to my tests, it's also more optimal to
> do mutates frequently than less, however there's also a point after
> which doing too many of them costs too much. I've found a practical
> sweet spot for my specific case to be around every 256 adds. However,
> I suspect that this number would change depending on the application.

It's like "buffer big enough..." :)

> This version wouldn't work too well if you use the proxy collector for
> longer operations. For example to parse a long linked list, you would
> likely lock the PC for a long time and you may have an issue with the
> Index revolving to an already used number.

But reader thread can be preempted for arbitrary time inside read
section. This is why I say that it's better to add check to Modify()

> I'm currently working on another type of hybrid SMR/proxy (most likely
> patent-free) that would not only be wait-free but also possibly atomic-
> free (except for the deffer operation that would use an xchg). It
> would be better than SMR because there would be no need for that ugly
> scan that SMR requires and better than proxy because it would contain
> *at most* 2 collectors (That's really all I need to clean everything
> nicely). Right now I'm struggling to make the Mutate function
> correct.

And how much garbage it allows at any given time?

> As for your proposal in this particular post, your proposal would work
> well (especially with algorithms that can potentially write to memory
> that they freed). In my specific case, I was thinking of having the
> memory allocator (a lock-free version I'd write) have some extra
> memory allocated for the purpose of managing it (which would include a
> free pointer that I would know I could write too safely). Correct me
> if I'm wrong, but it looks like you are formerly enforcing that the
> deferred list can only contain "x"objects. This may complicate the
> logic considerably due to the wait-free-ness of the algorithm. Yes,
> you could easily do this (or even re-use the memory and keep a count
> of it) to save yourself from having a list of pointers. If you look
> at my source code, that's more or less what I do at a user-level. I
> force a mutate after every 256 adds. It's not quite the same because
> it's not done at every 256 pops, but it's close enough to that without
> the formality of it.

I was thinking about fixed buffer mainly to not write in comments "YOU
HAVE TO EXECUTE MUTATE() FREQUENTLY ENOUGH, OTHERWISE PROGRAM WILL
FAIL" :)

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Jan 29, 2008, 9:01:42 AM1/29/08

to

On Jan 29, 3:19 am, "Chris Thomasson" <cris...@comcast.net> wrote:

> There was one proxy thing I did but never published which used an N=2
> scheme

Yeah, but Paul E. McKenney was published it. And he didn't forget to
fill in patent too :)
I'm talking about his real-time RCU with 2 counters per processor.

> , sounds like your doing same thing here. It did not use atomics, but
> it did require a single #StoreLoad barrier before a thread could enter a
> proxy collected region. I was trying to come up with atomic-free, but not
> membar free, proxy designs to test against vZOOM PDR to see what kind of
> impact the membar has. Turns out, the overhead can be quite significant
> indeed.

I bet you:
1. store some per-thread data
2. allow arbitrary amount of completely unused garbage hanging around
and "waiting for next epoch"

If yes, then it's already more like RCU then PC.

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Jan 29, 2008, 9:03:58 AM1/29/08

to

On Jan 29, 3:19 am, "Chris Thomasson" <cris...@comcast.net> wrote:

> There was one proxy thing I did but never published which used an N=2
> scheme, sounds like your doing same thing here. It did not use atomics, but
> it did require a single #StoreLoad barrier before a thread could enter a
> proxy collected region. I was trying to come up with atomic-free, but not
> membar free, proxy designs to test against vZOOM PDR to see what kind of
> impact the membar has. Turns out, the overhead can be quite significant
> indeed.

I think I can create PC-like PDR with N=2 and w/o #StoreLoad on fast-
path for readers. I need to prototype. And I need TIME!

Dmitriy V'jukov

Chris Thomasson

unread,

Jan 29, 2008, 2:59:41 PM1/29/08

to

On Tue, 29 Jan 2008 06:03:58 -0800, Dmitriy Vyukov wrote:

> On Jan 29, 3:19 am, "Chris Thomasson" <cris...@comcast.net> wrote:

[...]

> I think I can create PC-like PDR with N=2 and w/o #StoreLoad on fast-
> path for readers. I need to prototype. And I need TIME!

Time is of the essence...

Chris Thomasson

unread,

Jan 29, 2008, 3:01:14 PM1/29/08

to

On Tue, 29 Jan 2008 04:27:55 -0800, Dmitriy Vyukov wrote:

> On Jan 29, 2:39 am, ti_chris <tiChri...@gmail.com> wrote:
>

[...]

> I was thinking about fixed buffer mainly to not write in comments "YOU
> HAVE TO EXECUTE MUTATE() FREQUENTLY ENOUGH, OTHERWISE PROGRAM WILL FAIL"
> :)

Indeed! :^)

Chris Thomasson

unread,

Jan 29, 2008, 3:40:45 PM1/29/08

to

On Tue, 29 Jan 2008 06:01:42 -0800, Dmitriy Vyukov wrote:

> On Jan 29, 3:19 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>
>> There was one proxy thing I did but never published which used an N=2
>> scheme
>
>
> Yeah, but Paul E. McKenney was published it. And he didn't forget to
> fill in patent too :)
> I'm talking about his real-time RCU with 2 counters per processor.

Well, it seems like that flavor of RCU still cannot reference count an
arbitrary number of references outside of RCU protected regions. Try and
to virtually zero-overhead distributed reference counted message passing
with RCU as-is...

;^)

>> , sounds like your doing same thing here. It did not use atomics, but
>> it did require a single #StoreLoad barrier before a thread could enter
>> a proxy collected region. I was trying to come up with atomic-free, but
>> not membar free, proxy designs to test against vZOOM PDR to see what
>> kind of impact the membar has. Turns out, the overhead can be quite
>> significant indeed.
>
>
> I bet you:
> 1. store some per-thread data
> 2. allow arbitrary amount of completely unused garbage hanging around
> and "waiting for next epoch"
>
> If yes, then it's already more like RCU then PC.

Yeah; you right. Humm.

Chris Thomasson

unread,

Jan 29, 2008, 4:53:05 PM1/29/08

to

On Mon, 28 Jan 2008 23:36:40 -0800, ti_chris wrote:

> On Jan 28, 4:19 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
>> "ti_chris" <tiChri...@gmail.com> wrote in message

[...]

>> There was one proxy thing I did but never published which used an N=2
>> scheme, sounds like your doing same thing here.
>>

>> [...]
>
> Cool. Yah; I seem to be stepping into a lot of steps that have already
> been paved. I seem to re-invent a lot of wheels only to find out that
> they've already been done.

At least you know your on the right path; so-to-speak...

:^)

> It's fine however, you end up paving some
> new ground sometimes and that alone makes it worth it :).

Indeed!

Chris Thomasson

unread,

Jan 29, 2008, 4:56:22 PM1/29/08

to

On Fri, 25 Jan 2008 22:39:32 -0800, ti_chris wrote:

> On Jan 25, 11:55 am, Chris Thomasson <cris...@comcast.net> wrote:
>> On Fri, 25 Jan 2008 02:10:48 -0800, ti_chris wrote:
>> > Hi,
>>
>> > I've finally managed to build a wait-free proxy collector. It seems
>> > to outperform the lock-free version by about 5% (not too shabby). I
>> > used a technique I thought Chris was using in his implementation, but
>> > as it turns out, it sounds like it wasn't the case, which basically
>> > means that I came up with a new way to do this.
>>
>> I use alignment trick instead of indices's because it puts no upper
>> bound on the number of deferred collector objects. This is important
>> for me because I created it to test against vZOOM PDR which can handle
>> a shi%load of deferred objects.
[...]

> Where I do see a difference is that you are able to allocate those

> blocks dynamically where as I need a static array that is pre-
> allocated. In that sense, you have a better solution. I'm assuming
> that I could convert my code to have something more similar to what you
> have and still have a wait-free algo. I'd have to check it up :).

You can. The CAS in the pc_dec function (from pc_sample.c) uses CAS when
it does not really have to. It should decrement the pc_deferred_t
reference count, and let the differential count do its thing when a
mutation occurs (e.g., pc_defer).

>
>
>> I need to find time to read this! Anyway, I like the way to replaced
>> CAS with XCHG in the LFProxyObject::Defer function!
>>
>> :^)
>
>
> It occurred to me yesterday that you can easily guarantee a wait-free
> push if you can guarantee that no pops will occur :D.

Your right.