Solving the memory issue of lock-free algorithms

ti_chris

unread,

Jan 14, 2008, 1:53:01 AM1/14/08

to cri...@comcast.net

Hi Folks,

I read a lot about lock-free algorithm and tried to find an "elegant"
solution to the memory problem that the algorithms poses. The pop is
obviously the problem. The algorithm I'm using is the following:

SC0: loop
SC1: head = lf->top
SC2: oc = lf->ocount
SC3: if head == NULL
SC4: return NULL
SC5: next = head->next
SC6: if CAS2(&lf->top, head, oc, next, oc + 1)
SC7: break
SC8: endloop
SC9: return head

I wanted to make a solution where I could allocate/free the nodes as
part of the function, so I started by putting alloc/free in my push/
pop and quickly found multiple memory crashers (as many papers talk
about). However I was never really a fan of the solutions I have seen
(hazard pointers and etc) just sounded like they added a lot more
slowdown than what was ultimately really worth.

I know a lot of papers describe this as the lock-free memory
reclamation issue, but from what I've read, they all seem to point
towards the fact that it is possible for SC5 to generate a memory
fault if a pop or pop/push happened to squeeze between SC2 and SC5.
However it may very well be true that we can potentially read garbage
for SC5, at least on the Intel/ARM architecture, we should still be
able to read that address in memory without generating a fault since
head was a known good location to read from at some point in time. As
long as the CAS fails in theses cases, the correctness of the
algorithm should still be ok. And it should fail since the counter
should not match. So knowing this, I went about debugging my code to
make it work such that it can have an old head and still be able to
access the memory region of "head->next" provided that it failed the
CAS2.

I had a simple test that had 2 push thread and 2 pop thread on a quad-
core. As it turns out, I found out that the real issue as to why it
was crashing was because pre-emption was done between SC1 and SC2 and
the program was fooled into thinking that it had a match when in
reality, it didn't. This false positive allowed the system to make a
swap and thus inserted garbage in the list and crashed the system when
the next item would parse the list. As it turns out, the solution was
to combine SC1 and SC2 into one single atomic operation.

As a mater of fact, I don't see how you can fully guarantee that the
algorithm works without doing this. There's 2 ways to do this on an
Intel platform. The first is to use the movq instruction back and
forth, which guarantees atomic reads of 64bit elements on 64bit
aligned boundaries or the more crafty solution to use CMPXCHG8B in
order to read it.

As it turns out, once I changed my code to do 64bit reads, I was able
to run my stress-test app which push/pops on 4 threads without any
crashers for one entire night. If I couldn't read from freed memory,
then this wouldn't work, (obviously) but since I can on a multicore
system, this seems like a very nice solution to the problem (assuming
there are no other holes in it) and that's pretty much why I'm posting
here. Since you guys are really experts on this (compared to me at
the very least), I'd like to know your thoughts on this.

I've attached my source code at the bottom (If you do use it, I merely
ask to get credited for it). I also have a C equivalent using a 64bit
read instruction, but the assembler version is just cleaner imo since
I can re-use the cmpxchg8b for the read.

struct LFStack {
LFStack() : Count(0), Top(NULL) { }

struct LFStackNode {
LFStackNode *Next;
unsigned long Data;
};

void Push(register unsigned long Data)
{
LFStackNode *Node = (LFStackNode
*)malloc(sizeof(LFStackNode));

Node->Data = Data;

__asm {
mov edi, [this] ; edi = &this
mov esi, [Node] ; esi = &Node
};
LoopPush:
__asm {
mov eax, [edi] ; eax = Top
mov [esi], eax ; Node->Next = Top;
lock cmpxchg [edi], esi
jne LoopPush
};
}

unsigned long Pop(void)
{
register LFStackNode *OldNode;

__asm {
mov edi, [this] ; edi = &this
xor eax, eax
xor edx, edx
xor ebx, ebx
xor ecx, ecx

lock cmpxchg8b [edi] ; edx:eax = this
}
LoopPop:
__asm {
test eax, eax
je PopOut

mov ebx, [eax] ; ebx = Top->Next
lea ecx, [edx + 1] ; ebx = Top->Count + 1
lock cmpxchg8b [edi] ; ebx:ecx = Next
jne LoopPop

mov OldNode, eax ; OldNode = Top
}

const unsigned long Data = OldNode->Data;

free(OldNode);

return Data;

PopOut:
return -1;
}

public:
LFStackNode *volatile Top;
volatile unsigned long Count;
};

Regards,
Chris

Chris Thomasson

unread,

Jan 14, 2008, 3:32:36 AM1/14/08

to

"ti_chris" <tiCh...@gmail.com> wrote in message
news:21d128cd-7d19-4c88...@d4g2000prg.googlegroups.com...

> Hi Folks,
>
> I read a lot about lock-free algorithm and tried to find an "elegant"
> solution to the memory problem that the algorithms poses. The pop is
> obviously the problem. The algorithm I'm using is the following:
>
> SC0: loop
> SC1: head = lf->top

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
There is a race-condition right here.

> SC2: oc = lf->ocount

You need to read the version count _before_ you read the head of the list.
Here is how the race can occur... Think of current state being head =
pointerA and oc = 1 and threadA loads pointerA; get premepted by threadB
which pops pointerA, thus increment the oc from 1 to 2; ThreadA resumes and
loads version 2 and gets to line SC6 thus reading NULL as a next pointer;
gets preempted by threadB which pushes pointerB, and then pointerA onto the
list mutating the state to head = pointerA and oc = 2 with pointerA with the
next node pointerB; ThreadA resumes and performs a successful DWCAS
operation when it should have failed thus swapping the head of the list with
NULL and all the nodes get lost!

There are three ways to solve this:

1 - Load the version count _before_ you load the head; use a #LoadLoad
barrier.

2 - Increment the version count on pops _and_ pushes.

3 - Use 64-bit atomic load

> SC3: if head == NULL
> SC4: return NULL
> SC5: next = head->next
> SC6: if CAS2(&lf->top, head, oc, next, oc + 1)
> SC7: break
> SC8: endloop
> SC9: return head

[...]

> struct LFStack {

Humm, I do not think that you have addressed the memory reclamation issue.
Try this in your debugger:

Scenario: There are two suspended threads (A and B), and a stack with a
single nodeA.

> unsigned long Pop(void)
> {
> register LFStackNode *OldNode;
>
> __asm {
> mov edi, [this] ; edi = &this
> xor eax, eax
> xor edx, edx
> xor ebx, ebx
> xor ecx, ecx
>
> lock cmpxchg8b [edi] ; edx:eax = this
> }
> LoopPop:
> __asm {
> test eax, eax
> je PopOut

2. Resume threadA and step it to the instruction above and suspend it.
- eax is pointer to NodeA

>
X: > mov ebx, [eax] ; ebx = Top->Next

> lea ecx, [edx + 1] ; ebx = Top->Count + 1
> lock cmpxchg8b [edi] ; ebx:ecx = Next
> jne LoopPop
>
> mov OldNode, eax ; OldNode = Top
> }
>
> const unsigned long Data = OldNode->Data;
>
> free(OldNode);

>
> return Data;

3. Resume threadB and step it to the return function above and suspend it.
- this frees NodeA

4. Resume threadA and it executes line: X, which reads from nodeA, which was
just freed. End result, seg-fault.
- this is an error

>
> PopOut:
> return -1;
> }
>
> };

Am I understanding your algorithm correctly?

Chris Thomasson

unread,

Jan 14, 2008, 3:37:11 AM1/14/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:ptednZnvbp2egRba...@comcast.com...

> "ti_chris" <tiCh...@gmail.com> wrote in message
> news:21d128cd-7d19-4c88...@d4g2000prg.googlegroups.com...
>> Hi Folks,
>>
>> I read a lot about lock-free algorithm and tried to find an "elegant"
>> solution to the memory problem that the algorithms poses. The pop is
>> obviously the problem. The algorithm I'm using is the following:
>>
>> SC0: loop
>> SC1: head = lf->top
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> There is a race-condition right here.
>
>> SC2: oc = lf->ocount
>
>
> You need to read the version count _before_ you read the head of the list.
> Here is how the race can occur...

[...]

That's an easy error to make. I need to check the AppCore code and see if
its in there.

Chris Thomasson

unread,

Jan 14, 2008, 3:56:33 AM1/14/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:ptednZnvbp2egRba...@comcast.com...

> "ti_chris" <tiCh...@gmail.com> wrote in message
> news:21d128cd-7d19-4c88...@d4g2000prg.googlegroups.com...
>> Hi Folks,
>>
>> I read a lot about lock-free algorithm and tried to find an "elegant"
>> solution to the memory problem that the algorithms poses. The pop is
>> obviously the problem. The algorithm I'm using is the following:
>>
>> SC0: loop
>> SC1: head = lf->top
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> There is a race-condition right here.
>
>> SC2: oc = lf->ocount
>
>
> You need to read the version count _before_ you read the head of the list.
> Here is how the race can occur... Think of current state being head =
> pointerA and oc = 1 and threadA loads pointerA; get premepted by threadB
> which pops pointerA, thus increment the oc from 1 to 2; ThreadA resumes
> and loads version 2 and gets to line SC6 thus reading NULL as a next
> pointer; gets preempted by threadB which pushes pointerB, and then
> pointerA onto the list mutating the state to head = pointerA and oc = 2
> with pointerA with the next node pointerB; ThreadA resumes and performs a
> successful DWCAS operation when it should have failed thus swapping the
> head of the list with NULL and all the nodes get lost!

[...]

Lets see if I can make this easier to read:
____________________________________________________________________
Current state [NodeA, 0] : (NodeA-->NodeB-->NULL)

1. ThreadA reads NodeA
2. ThreadB pops NodeA
- Current state [NodeB, 1] : (NodeB-->NULL)

3. ThreadA reads version 1
4. ThreadA reads NodeA-->NodeB
5. ThreadB pushes NodeC
6. ThreadB pushes NodeA
- Current state [NodeA, 1] : (NodeA-->NodeC-->NodeB-->NULL)

7. ThreadA peforms successfully DWCAS using [NodeA, 1] as the comprand
- Current state [NodeB, 2] : (NodeB-->NULL)

8. NodeC is now lost in space!
____________________________________________________________________

That is BAD! Now for the soultions and exactley how they work...

Here is how loading the version count first solves the problem:
____________________________________________________________________
Current state [NodeA, 0] : (NodeA-->NodeB-->NULL)

1. ThreadA reads version 0
2. ThreadB pops NodeA
- Current state [NodeB, 1] : (NodeB-->NULL)

3. ThreadA reads NodeB
4. ThreadA reads NodeB-->NULL
5. ThreadB pushes NodeC
6. ThreadB pushes NodeA
- Current state [NodeA, 1] : (NodeA-->NodeC-->NodeB-->NULL)

7. ThreadA fails its DWCAS using [NodeB, 0] as the comprand
- Current state [NodeA, 1] : (NodeA-->NodeC-->NodeB-->NULL)

8. The list in intact!
____________________________________________________________________

Here is how bumping the version count on push and pop operations solves the
problem:
____________________________________________________________________
Current state [NodeA, 0] : (NodeA-->NodeB-->NULL)

1. ThreadA reads NodeA
2. ThreadB pops NodeA
- Current state [NodeB, 1] : (NodeB-->NULL)

3. ThreadA reads version 1
4. ThreadA reads NodeA-->NodeB
5. ThreadB pushes NodeC
6. ThreadB pushes NodeA
- Current state [NodeA, 3] : (NodeA-->NodeC-->NodeB-->NULL)

7. ThreadA fails its DWCAS using [NodeA, 1] as the comprand
- Current state [NodeA, 3] : (NodeA-->NodeC-->NodeB-->NULL)

8. The list in intact!
____________________________________________________________________

Here is how an atomic 64-bit read solves the problem:
____________________________________________________________________
Current state [NodeA, 0] : (NodeA-->NodeB-->NULL)

1. ThreadA reads NodeA _and_ version 0
2. ThreadB pops NodeA
- Current state [NodeB, 1] : (NodeB-->NULL)

3. ThreadA reads NodeA-->NodeB
4. ThreadB pushes NodeC
5. ThreadB pushes NodeA
- Current state [NodeA, 1] : (NodeA-->NodeC-->NodeB-->NULL)

6. ThreadA fails its DWCAS using [NodeA, 0] as the comprand
- Current state [NodeA, 1] : (NodeA-->NodeC-->NodeB-->NULL)

7. The list in intact!
____________________________________________________________________

Is this making sense to everybody?

Chris Thomasson

unread,

Jan 14, 2008, 4:07:59 AM1/14/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:hI6dnaiMyKeLgBba...@comcast.com...

Hurray! I read the version count before the head pointer! Here is the code:
____________________________________________________

/* MUST be two adjacent words! */
typedef struct
AC_DECLSPEC_PACKED
ac_i686_stack_mpmc_
{
ac_i686_node_t *front; /* offset 0 */
ac_i686_uintword_t aba; /* offset 4 */

} ac_i686_stack_mpmc_t;

# AC_SYS_APIEXPORT ac_i686_node_t* AC_CDECL
# np_ac_i686_stack_mpmc_pop_dwcas
# ( ac_i686_stack_mpmc_t* const _this );
np_ac_i686_stack_mpmc_pop_dwcas:
pushl %esi
pushl %ebx
movl 12(%esp), %esi # esi = _this/offset 12
movl 4(%esi), %edx # read esi->version/offset 4
movl (%esi), %eax # read esi->head/offset 0

np_ac_i686_stack_mpmc_pop_dwcas_retry:
test %eax, %eax
je np_ac_i686_stack_mpmc_pop_dwcas_fail
movl (%eax), %ebx
leal 1(%edx), %ecx
lock cmpxchg8b (%esi)
jne np_ac_i686_stack_mpmc_pop_dwcas_retry

np_ac_i686_stack_mpmc_pop_dwcas_fail:
popl %ebx
popl %esi
ret
____________________________________________________

That was a close call... I have checked by vZOOM code-base, and yes it reads
the version before it reads the head pointer. I checked the code for the
sparc and it has (in pseudo-code):
____________________________________________________
node* pop(anchor* const _this) {
anchor cmp, xchg;
cmp.version = ATOMIC_LOAD(&_this->ver);
membar #LoadLoad;
cmp.hd = ATOMIC_LOAD(&_this->hd);
[...]
}
____________________________________________________

Thank GOD! Wow, that an easy mistake to make!

SHi% Fuc%ing lock-free programming SUCKS!!!!!!!!!!!!!

:^)

ti_chris

unread,

Jan 14, 2008, 4:18:38 AM1/14/08

to

On Jan 14, 12:32 am, "Chris Thomasson" <cris...@comcast.net> wrote:
> "ti_chris" <tiChri...@gmail.com> wrote in message

>
> news:21d128cd-7d19-4c88...@d4g2000prg.googlegroups.com...> Hi Folks,
>
> > I read a lot about lock-free algorithm and tried to find an "elegant"
> > solution to the memory problem that the algorithms poses. The pop is
> > obviously the problem. The algorithm I'm using is the following:
>
> > SC0: loop
> > SC1: head = lf->top
>
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> There is a race-condition right here.
>
> > SC2: oc = lf->ocount
>
> You need to read the version count _before_ you read the head of the list.
> Here is how the race can occur... Think of current state being head =
> pointerA and oc = 1 and threadA loads pointerA; get premepted by threadB
> which pops pointerA, thus increment the oc from 1 to 2; ThreadA resumes and
> loads version 2 and gets to line SC6 thus reading NULL as a next pointer;
> gets preempted by threadB which pushes pointerB, and then pointerA onto the
> list mutating the state to head = pointerA and oc = 2 with pointerA with the
> next node pointerB; ThreadA resumes and performs a successful DWCAS
> operation when it should have failed thus swapping the head of the list with
> NULL and all the nodes get lost!
>
> There are three ways to solve this:
>
> 1 - Load the version count _before_ you load the head; use a #LoadLoad
> barrier.
>
> 2 - Increment the version count on pops _and_ pushes.
>
> 3 - Use 64-bit atomic load

Yes; You are completely right, reversing the order would work as would
fixing up the number with the push/pop. Thanks for point it out. I
personally prefer the 64bit version due to it's elegance with the
cmpxchg8b instruction.

Yes, I do believe you are understand the algorithm perfectly. I tried
what you suggested and it worked perfectly (no crashers). Every step
is correct except that it does not segfault while trying to read the
freed nodeA. In my test, eax = 0x392650. It frees the node without
an issue, when the second thread comes up, eax is also set to 0x392650
and reading it gave me the Visual C++ freed memory marker
(0xFEEEFEEE). It's a garbage value, however it won't be kept since
the count is wrong. Give it a spin; you'll see that scenario works.

Chris Thomasson

unread,

Jan 14, 2008, 4:51:39 AM1/14/08

to

"ti_chris" <tiCh...@gmail.com> wrote in message

news:f472509c-ad88-4617...@21g2000hsj.googlegroups.com...

> On Jan 14, 12:32 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>> "ti_chris" <tiChri...@gmail.com> wrote in message
>>
>> news:21d128cd-7d19-4c88...@d4g2000prg.googlegroups.com...>
>> Hi Folks,
>>
>> > I read a lot about lock-free algorithm and tried to find an "elegant"
>> > solution to the memory problem that the algorithms poses. The pop is
>> > obviously the problem. The algorithm I'm using is the following:

[...]

>> Am I understanding your algorithm correctly?
>
> Yes, I do believe you are understand the algorithm perfectly. I tried
> what you suggested and it worked perfectly (no crashers). Every step
> is correct except that it does not segfault while trying to read the
> freed nodeA. In my test, eax = 0x392650. It frees the node without
> an issue, when the second thread comes up, eax is also set to 0x392650
> and reading it gave me the Visual C++ freed memory marker
> (0xFEEEFEEE).

Humm, you must not be hitting the condition in release mode. The marker is
only put in there in debug mode. I am thinking that if the condition was hit
in release mode, then it would crash... Humm, okay, I think I have a way to
simulate this in release mode... Try something like:

_______________________________________________________________________
Scenario: Two threads, an unsignalled event and a stack with a single
node...

> unsigned long Pop(void)
> {
> register LFStackNode *OldNode;

Add the following logic right here:

if (TlsGetValue(...) == &ThreadB) {
WaitOnSingleObject(debug_gate, INFINITE);
}

>
> __asm {
> mov edi, [this] ; edi = &this
> xor eax, eax
> xor edx, edx
> xor ebx, ebx
> xor ecx, ecx
>
> lock cmpxchg8b [edi] ; edx:eax = this
> }
> LoopPop:
> __asm {
> test eax, eax
> je PopOut

Add the following logic right here:

if (TlsGetValue(...) == &ThreadA) {
puts("Thread A Waiting For The GO code!");
SetEvent(debug_gate);
getchar();
}

>
X: > mov ebx, [eax] ; ebx = Top->Next
> lea ecx, [edx + 1] ; ebx = Top->Count + 1
> lock cmpxchg8b [edi] ; ebx:ecx = Next
> jne LoopPop
>
> mov OldNode, eax ; OldNode = Top
> }
>
> const unsigned long Data = OldNode->Data;
>
> free(OldNode);

puts("Thread B Finished! Thread A can GO!");
>
> return Data;

>
> PopOut:
> return -1;
> }
>
> };

_______________________________________________________________________

Stick that logic in your code, run the program and wait for the phrase
"Thread B Finished! Thread A can GO!" to come up on the console... Then hit
the <ENTER> key, and you "should" seg-fault. Do this in release mode...

> It's a garbage value, however it won't be kept since
> the count is wrong. Give it a spin; you'll see that scenario works.

Okay, I will.

Dmitriy Vyukov

unread,

Jan 14, 2008, 4:49:16 AM1/14/08

to

On Jan 14, 12:18 pm, ti_chris <tiChri...@gmail.com> wrote:

> Yes, I do believe you are understand the algorithm perfectly. I tried
> what you suggested and it worked perfectly (no crashers). Every step
> is correct except that it does not segfault while trying to read the
> freed nodeA. In my test, eax = 0x392650. It frees the node without
> an issue, when the second thread comes up, eax is also set to 0x392650
> and reading it gave me the Visual C++ freed memory marker
> (0xFEEEFEEE). It's a garbage value, however it won't be kept since
> the count is wrong. Give it a spin; you'll see that scenario works.

No, this is still not working and can cause paging fault.
You can check implementation of lock-free stack in WinAPI. Disassembly
functions InterlockedPopEntrySList()/InterlockedPushEntrySList().
There is exactly the same algorithm.
They use interesting trick - they wrap functions with SEH frame and
just suppress AV. This works because, as you noted, reading of
"garbage" can't harm work of this algorithm.

The reason why you don't catch paging fault - memory freed to *in-
process* memory manager, but not the OS. If memory will be returned to
the OS and unmapped from process virtual space then you will catch
paging fault for sure.

If you want to reproduce paging fault you can try the following.
First, create many more threads. Because you need thread to be
preempted for long time.
If it don't help, insert:

SC0: loop
SC1: head = lf->top

SC2: oc = lf->ocount

SC3: if head == NULL
SC4: return NULL

*if (0 == (rand() % 100)) SwitchToThread();*

SC5: next = head->next
SC6: if CAS2(&lf->top, head, oc, next, oc + 1)
SC7: break
SC8: endloop
SC9: return head

To increase probability of preemption in the right point.

If it don't help, allocate every node with VirtualAlloc() and free
with VirtualFree().

I am sure all this will kill your program in a second :)

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Jan 14, 2008, 4:54:22 AM1/14/08

to

On Jan 14, 12:51 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
> "ti_chris" <tiChri...@gmail.com> wrote in message
>
> news:f472509c-ad88-4617...@21g2000hsj.googlegroups.com...> On Jan 14, 12:32 am, "Chris Thomasson" <cris...@comcast.net> wrote:
> >> "ti_chris" <tiChri...@gmail.com> wrote in message
>
> >>news:21d128cd-7d19-4c88...@d4g2000prg.googlegroups.com...>
> >> Hi Folks,
>
> >> > I read a lot about lock-free algorithm and tried to find an "elegant"
> >> > solution to the memory problem that the algorithms poses. The pop is
> >> > obviously the problem. The algorithm I'm using is the following:
>
> [...]
>
> >> Am I understanding your algorithm correctly?
>
> > Yes, I do believe you are understand the algorithm perfectly. I tried
> > what you suggested and it worked perfectly (no crashers). Every step
> > is correct except that it does not segfault while trying to read the
> > freed nodeA. In my test, eax = 0x392650. It frees the node without
> > an issue, when the second thread comes up, eax is also set to 0x392650
> > and reading it gave me the Visual C++ freed memory marker
> > (0xFEEEFEEE).
>
> Humm, you must not be hitting the condition in release mode. The marker is
> only put in there in debug mode. I am thinking that if the condition was hit
> in release mode, then it would crash... Humm, okay, I think I have a way to
> simulate this in release mode... Try something like:

'Release mode' won't help. You must force the node memory to be freed
to OS and not to the in-precess memory manager. While node memory
remains in process virtual space, no tricky preemption can kill the
algorithm.

Dmitriy V'jukov

Chris Thomasson

unread,

Jan 14, 2008, 6:14:36 AM1/14/08

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
news:d46a1baa-6ca3-43aa...@d70g2000hsb.googlegroups.com...

Its still a load of a freed memory location. That is not good in my book.

Chris Thomasson

unread,

Jan 14, 2008, 6:17:25 AM1/14/08

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:3c54a2c6-6c0b-4931...@e10g2000prf.googlegroups.com...

If the OP uses VirtualAlloc/Free() along with the debugging technique I laid
out, then there should definitely be a fault.

>
>
> I am sure all this will kill your program in a second :)

Yes.

Chris Thomasson

unread,

Jan 14, 2008, 6:35:28 AM1/14/08

to

"ti_chris" <tiCh...@gmail.com> wrote in message

news:f472509c-ad88-4617...@21g2000hsj.googlegroups.com...

> On Jan 14, 12:32 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>> "ti_chris" <tiChri...@gmail.com> wrote in message
>>
>> news:21d128cd-7d19-4c88...@d4g2000prg.googlegroups.com...>
>> Hi Folks,
>>
>> > I read a lot about lock-free algorithm and tried to find an "elegant"
>> > solution to the memory problem that the algorithms poses. The pop is
>> > obviously the problem. The algorithm I'm using is the following:
>>
>> > SC0: loop
>> > SC1: head = lf->top
>>
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> There is a race-condition right here.
>>
>> > SC2: oc = lf->ocount
>>
>> You need to read the version count _before_ you read the head of the
>> list.
>> Here is how the race can occur...

[...]

>> There are three ways to solve this:
>>
>> 1 - Load the version count _before_ you load the head; use a #LoadLoad
>> barrier.
>>
>> 2 - Increment the version count on pops _and_ pushes.
>>
>> 3 - Use 64-bit atomic load
>
> Yes; You are completely right, reversing the order would work as would
> fixing up the number with the push/pop. Thanks for point it out. I
> personally prefer the 64bit version due to it's elegance with the
> cmpxchg8b instruction.

Humm... Think of this for a moment... How many interlocked RMW instructions
are in a typical uncontended fast-pathed mutex implementation? Well, there
are usually two of them. One for lock, and another for the unlock. Okay.
Think of a stack protected by such a mutex. The pop operation will use two
expensive locked atomic operations. Fine, its a lock, what did you expect?
Now think of a lock-free algorithm that uses two or more interlocked RMW
instructions on a _fast-path_. It's going to be just as, or more, expensive
than the overhead generated by a uncontended mutex access. DWCAS is an
expensive operation. A great deal of time and energy goes into specifically
avoiding having to execute interlocked RMW instructions and memory barriers:

http://groups.google.com/group/comp.programming.threads/msg/fdc665e616176dce

So you have not gained performance over a mutex in the uncontended case. You
might want to think about that for a moment.

[...]

Chris Thomasson

unread,

Jan 14, 2008, 6:39:45 AM1/14/08

to

"ti_chris" <tiCh...@gmail.com> wrote in message

news:f472509c-ad88-4617...@21g2000hsj.googlegroups.com...

> On Jan 14, 12:32 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>> "ti_chris" <tiChri...@gmail.com> wrote in message
>>
>> news:21d128cd-7d19-4c88...@d4g2000prg.googlegroups.com...>
>> Hi Folks,
>>

[...]

> Yes; You are completely right, reversing the order would work as would
> fixing up the number with the push/pop. Thanks for point it out. I
> personally prefer the 64bit version due to it's elegance with the
> cmpxchg8b instruction.

[...]

Actually, IBM used an atomic 64-bit read of the anchor. You can read about
it here:

http://publibz.boulder.ibm.com/epubs/pdf/dz9zr002.pdf
(Appendix A/Multi-Processing Examples/Free-Pool Manipulation/Second
Paragraph)

[...]

Chris Thomasson

unread,

Jan 14, 2008, 6:41:46 AM1/14/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:nbOdnXHUg8t92hba...@comcast.com...

They use the LM (e.g., Load-Multiple) instruction.

ti_chris

unread,

Jan 14, 2008, 6:39:04 AM1/14/08

to

On Jan 14, 3:14 am, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message

No, I agree. You're better off with your own memory allocator in the
end, because the default malloc/free are blocking primitives anyhow
and it would be nice to have a completely lock-free function. However
I am doing this in steps and really didn't want to deal with stuff
like hazard pointers and the suches as I have the hunch that adding so
much overhead/complexity to your function effectively renders them
slower than a locking counterpart OR a light spin-loop on it.

If I really wanted to use this technique however, I do believe Dmitriy
is right. I can surround the operation with a try/catch block and I
am able to catch the page fault and have it loop back to the top when
caught.

Chris Thomasson

unread,

Jan 14, 2008, 7:00:32 AM1/14/08

to

"ti_chris" <tiCh...@gmail.com> wrote in message

news:fcf15f45-8018-41b3...@s13g2000prd.googlegroups.com...

> On Jan 14, 3:14 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>> "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message

[...]

> If I really wanted to use this technique however, I do believe Dmitriy
> is right. I can surround the operation with a try/catch block and I
> am able to catch the page fault and have it loop back to the top when
> caught.

Sure you can use SEH. Its, well, IMHO, a sort of a hack...

Chris Thomasson

unread,

Jan 14, 2008, 7:07:20 AM1/14/08

to

"ti_chris" <tiCh...@gmail.com> wrote in message

news:fcf15f45-8018-41b3...@s13g2000prd.googlegroups.com...

> On Jan 14, 3:14 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>> "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message

[...]

> If I really wanted to use this technique however, I do believe Dmitriy
> is right. I can surround the operation with a try/catch block and I
> am able to catch the page fault and have it loop back to the top when
> caught.

You can reuse the nodes in a cache. You don't have to free them right away,
unless you application architecture demands something like that.

Dmitriy Vyukov

unread,

Jan 14, 2008, 8:57:48 AM1/14/08

to

On Jan 14, 2:35 pm, "Chris Thomasson" <cris...@comcast.net> wrote:

> Humm... Think of this for a moment... How many interlocked RMW instructions
> are in a typical uncontended fast-pathed mutex implementation? Well, there
> are usually two of them. One for lock, and another for the unlock. Okay.
> Think of a stack protected by such a mutex. The pop operation will use two
> expensive locked atomic operations. Fine, its a lock, what did you expect?
> Now think of a lock-free algorithm that uses two or more interlocked RMW
> instructions on a _fast-path_. It's going to be just as, or more, expensive
> than the overhead generated by a uncontended mutex access. DWCAS is an
> expensive operation. A great deal of time and energy goes into specifically
> avoiding having to execute interlocked RMW instructions and memory barriers:

And think about spin-mutex with active or passive spin, which have
only 1 Interlocked RMW. It's very fast.

Dmitriy V'jukov

Chris Thomasson

unread,

Jan 14, 2008, 1:00:32 PM1/14/08

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:7e903b3f-f25a-4a39...@j20g2000hsi.googlegroups.com...

You don't need interlocking in the unlock function either.

ti_chris

unread,

Jan 14, 2008, 5:40:48 PM1/14/08

to

This is the question I have been asking myself for a while now. Would
it be faster to use a spin-mutex with a RMW or would a lock-free lifo
run faster in your opinion? I know that a spin-lock mutex doesn't
actually get rid of deadlocks and all that good stuff that lock-free
algorithms guarantee, but I'm more interested about the speed aspect
of it at this point.

ti_chris

unread,

Jan 14, 2008, 7:06:54 PM1/14/08

to

Also,

Correct me if I'm wrong, but now I'm looking at the Queue algorithms,
and there really doesn't seem to be a nice or "hacky" solution to the
problem. It really looks like the only solution would be to keep the
memory around "long enough" for the dereference (E4 & E5) to run
successfully through one of the many methods that does this.
Otherwise, it looks like you could write to the old memory location if
the value for tail->next happened to be ENDFIFO(ff).

fifo-push (ff: pointer to fifo, cl: pointer to cell)
E1: cl->next = ENDFIFO(ff) # set the cell next pointer to end marker
E2: loop # try until enqueue is done
E3: icount = ff->icount # read the tail modification count
E4: tail = ff->tail # read the tail cell
E5: if CAS (&tail->next, ENDFIFO(ff), cl) # try to link the cell to
the tail cell
E6: break; # enqueue is done, exit the loop
E7: else # tail was not pointing to the last cell, try to set tail to
the next cell
E8: CAS2 (&ff->tail, tail, icount, tail->next, icount+1)
E9: endif
E10:endloop
E11:CAS2 (&ff->tail, tail, icount, cl, icount+1) # enqueue is done,
try to set tail to the enqueued cell

Zeljko Vrba

unread,

Jan 15, 2008, 1:36:27 AM1/15/08

to

On 2008-01-15, ti_chris <tiCh...@gmail.com> wrote:
>
> Correct me if I'm wrong, but now I'm looking at the Queue algorithms,
> and there really doesn't seem to be a nice or "hacky" solution to the
>

You could use array-based lock-free queue. No problems with memory
management there.

Chris Thomasson

unread,

Jan 15, 2008, 4:12:20 AM1/15/08

to

"Zeljko Vrba" <zvrba....@ieee-sb1.cc.fer.hr> wrote in message
news:slrnfool3b...@ieee-sb1.cc.fer.hr...

Right:

http://groups.google.com/group/comp.arch/msg/a3ebfe80363a4399
(at end of msg...)

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/e23342fac1820a76
(thread contains some info on simple index tricks...)

Chris Thomasson

unread,

Jan 15, 2008, 8:13:25 AM1/15/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:nbOdnXHUg8t92hba...@comcast.com...

Actually its in the last paragraph of the 'Free-Pool Manipulation' section.

Chris Thomasson

unread,

Jan 22, 2008, 11:58:40 PM1/22/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:ptednZnvbp2egRba...@comcast.com...

> "ti_chris" <tiCh...@gmail.com> wrote in message
> news:21d128cd-7d19-4c88...@d4g2000prg.googlegroups.com...
>> Hi Folks,
>>
>> I read a lot about lock-free algorithm and tried to find an "elegant"
>> solution to the memory problem that the algorithms poses. The pop is
>> obviously the problem. The algorithm I'm using is the following:
>>
>> SC0: loop
>> SC1: head = lf->top
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> There is a race-condition right here.
>
>> SC2: oc = lf->ocount
>
>
> You need to read the version count _before_ you read the head of the list.

[...]

Here is another rule... Some lock-free stack API's include a so-called
"flush" function:

http://groups.google.com/group/comp.programming.threads/msg/a359bfb41c68b98b

Note that I implement the flush with an XCHG instruction which is fine _if_
that's the _only_ way there is to _remove_ nodes from the stack. In other
words, you CANNOT combine a flush function that is based on XCHG, with a pop
function that uses ABA counter _and_ CAS/DWCAS...

Think if ThreadA reads pointerA and version 1; gets preempted by ThreadB
which flushes the stack, and pushes pointerA back onto the list; Thread runs
and gets a successful DWCAS when it should have _failed_; sounds like a
"bit" of a race-condition eh?...

Sorry for not pointing this out earlier!

;^(...

Chris Thomasson

unread,

Jan 23, 2008, 12:04:22 AM1/23/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:FaCdnYBgoPDZWgva...@comcast.com...

> "Chris Thomasson" <cri...@comcast.net> wrote in message
> news:ptednZnvbp2egRba...@comcast.com...
>> "ti_chris" <tiCh...@gmail.com> wrote in message
>> news:21d128cd-7d19-4c88...@d4g2000prg.googlegroups.com...
>>> Hi Folks,
>>>
>>> I read a lot about lock-free algorithm and tried to find an "elegant"
>>> solution to the memory problem that the algorithms poses. The pop is
>>> obviously the problem. The algorithm I'm using is the following:
>>>
>>> SC0: loop
>>> SC1: head = lf->top
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> There is a race-condition right here.
>>
>>> SC2: oc = lf->ocount
>>
>>
>> You need to read the version count _before_ you read the head of the
>> list.
> [...]
>
> Here is another rule... Some lock-free stack API's include a so-called
> "flush" function:

[...]

> Sorry for not pointing this out earlier!

You can use XCHG for the flush if the push function bumps the version
counter...

Chae

unread,

Jan 23, 2008, 1:56:39 AM1/23/08

to

On Jan 14, 4:00 am, "Chris Thomasson" <cris...@comcast.net> wrote:
> "ti_chris" <tiChri...@gmail.com> wrote in message

There is a side effect using SEH that suppresses page fault. It would
prevent finding real page fault bug. For example, what if the "next"
pointer has been overwritten somehow by a function that has random
memory overwrite bug? If you suppresses this real page fault using SEH
then easily got into infinite loop in the Pop function. Of course this
is one of the worst scenarios though.

And I don't think Windows InterlockedPopEntrySList() uses SEH but the
Windows Kernel's exception handler checks the fault IP (instruction
pointer) if it's in the "next" pointer access in
InterlockedPopEntrySList, to resume the function or spin again. This
is OS level hack. From the curiosity, I intentionally set next pointer
as invalid address. After about 2000 spins, it threw exception to
application finally. So I guess the Kernel has a global counter that
increased if the fault occurs on same address and there should be a
threshold value about 2000.

Lars Johannsen

unread,

Jan 24, 2008, 6:30:26 PM1/24/08

to

On Mon, 14 Jan 2008 01:49:16 -0800 (PST), Dmitriy Vyukov
<dvy...@gmail.com> wrote:

>On Jan 14, 12:18 pm, ti_chris <tiChri...@gmail.com> wrote:
>
>
>No, this is still not working and can cause paging fault.
>You can check implementation of lock-free stack in WinAPI. Disassembly
>functions InterlockedPopEntrySList()/InterlockedPushEntrySList().
>There is exactly the same algorithm.
>They use interesting trick - they wrap functions with SEH frame and
>just suppress AV. This works because, as you noted, reading of
>"garbage" can't harm work of this algorithm.
>
>The reason why you don't catch paging fault - memory freed to *in-
>process* memory manager, but not the OS. If memory will be returned to
>the OS and unmapped from process virtual space then you will catch
>paging fault for sure.
>
>

>SC0: loop
>SC1: head = lf->top
>SC2: oc = lf->ocount
>SC3: if head == NULL
>SC4: return NULL
> *if (0 == (rand() % 100)) SwitchToThread();*
>SC5: next = head->next
>SC6: if CAS2(&lf->top, head, oc, next, oc + 1)
>SC7: break
>SC8: endloop
>SC9: return head
>

As the only way to avoid deref. of a free'd "next" in sc5-6, is
to wait "long enough", I was playing with what is "long enough".
Fact is, that the only time-period when it is safe to free it,
is when no other thread is inside the loop (in pop).

void push(data) {
node = new_node()
node->data = data
lf_push(list,node)
}
data pop() {
node = lf_pop(list)
tmp = node->data
free_node(node)
return tmp
}

One way to keep it lock-free as in non-blocking, without
hazard-pointer, ref-pointers etc. (although not cheap) is the use
of a gate-counter. (2 extra atomic operation's per pop)
And a local GC by means of 2 extra lifo-lists. Where the GC-work is
shared by the threads using the pop-function.
Increment gate-count on entry and decrement on exit, so we know that
if the counter was 0 on entry, no other thread is inside the loop,
i.e no other thread is still refencing anything and we can do cleanup.
No new thread can ref any previous "free'd" contents.
So if instead of doing a "free(next)", we push it onto a seperate list
(lets call it stage1_list), then during cleanup we can remove the
content from this stage1_list (atomic swap top of list with emty).
As only 1 thread can enter with an oldcount == 0 (see below), further
cleanup-work can be done as if single-threaded.
Now we can't free things yet, because while thread-1 is checking
gate-count etc. other thread's could move new contents onto the
stage1_list prior to thread-1's emtying it.
So we have to store it on the 2.nd list (let call that one
stage2_list) until next time a thread comes in as number-1.
Now if there is already something on the stage2_list, this can be
free'd prior to adding the new contents.
So a 2-stage time-out before freing nodes. That should be long enough.

The only problem I see: During high contention, no thread gets to
enter as number-1, with the right to free thing's
(i.e. if the loop is never emty)

Lets call the above SC1-9 lf_pop()
(pheudo-code with IA-32 memonics:)

SG1 oldcount = lock xadd(gate_cnt, +1)
SG2 if oldcount == 0
SG3 do_cleanup()
SG4 head = lf_pop(lifo-list)
SG5 data = head->data
SG6 lf_push(stage1_list,head)
SG7 lock dec(gate_cnt)
SG8 return data

DC1 do_cleanup()
DC2 if stage1_list->head != emty
DC3 oldhead = atomic_swab(stage1_list->head, emty)
DC4 while( stage2_list->head) {
DC5 tmp = stage2_list->head
DC6 stage2_list->head = stage2_list->head->next
DC7 free(tmp)
DC8 }
DC9 stage2_list->head = oldhead

Now, to not having to spend the time, every time
a thread enters as number-1, a scaling can be done
base on the aba-counter.

One optimization might be to fold gate-count into the aba-counter,
i.e. for a 32-bit aba variable, use 8-bit for gate and 24-bit for aba
and thus reduce the overhead to 1 extra "lock dec" on that gate-part,
by doing both increment's during CAS.
That would still allow for 255 threads.
i.e:
SC0: loop

SC2: oc = lf->ocount
SC1: head = lf->top

gate-cnt = (oc AND 0xff000000) + (1 << 24)
new-oc = (oc AND 0x00ffffff) + 1

SC3: if head == NULL
SC4: return NULL

SC5: next = head->next

SC6: if CAS2(&lf->top, head, oc, next, gate-cnt OR new-oc)
SC7: break
SC8: endloop
if gate-cnt == 1
do_cleanup()
lock dec-byte(&(lf->ocount) +3)
SC9: return head

Or am I overlooking something here ?

Lars Johannsen