[lock-free] Found a nice C++ lock-free stack for x86 and x86-64

Andrei Verevko

unread,

Apr 21, 2010, 5:50:37 AM4/21/10

to Scalable Synchronization Algorithms

The code is here:

http://www.mgix.com/snippets/?LockFree

Andrei

--
Subscription settings: http://groups.google.com/group/lock-free/subscribe?hl=en

Eduardo Juan

unread,

Apr 21, 2010, 8:09:22 PM4/21/10

to lock...@googlegroups.com

The code is here:

http://www.mgix.com/snippets/?LockFree

Andre

Interesting lifo implementation... Thanks for the link!

I wonder how well can perform the producerConsumer.cpp example with a unique recycle 'Msg' objects pool.
If many threads contend to this objects pool, maybe some threads could be out-of-objects during a period of time
and others not, which is a bad idea.

I think it would be better to have a per-thread local recycle object pool (lifo but no need to lock-free here) so cache locality
can increase performance, but how can we guarantee that thta thread will recycle same quantity of objects he created?

What about being this example a bounded lifo queue. Is that possible?

Besides that, unfortunately I'm having a "Illegal instruction" at PUSH_CORE (lifo.h:220) :(

Maybe my machine does not support the cmpxchg(8|16) instruction set. I have a AMD Athlon 64 X2 Dual Core Processor 4800+
and use a Ubuntu 9.04 64-bit OS.

Any help here?

Regards,
Eduardo Juan

Chae Lim

unread,

Apr 22, 2010, 1:22:59 AM4/22/10

to Scalable Synchronization Algorithms

On Apr 21, 2:50 am, Andrei Verevko <andrei.vere...@gmail.com> wrote:
> The code is here:
>
> http://www.mgix.com/snippets/?LockFree
>
> Andrei

This implementation does not seems handle the common memory
reclamation issue occurring in many lock-free algorithms. After a
poped node deleted (returned the memory to OS), other threads can try
to access the deleted node. (potentially pagefault)

Dmitriy Vyukov

unread,

Apr 22, 2010, 2:29:08 AM4/22/10

to Scalable Synchronization Algorithms

On Apr 22, 9:22 am, Chae Lim <chae...@gmail.com> wrote:

> This implementation does not seems handle the common memory
> reclamation issue occurring in many lock-free algorithms. After a
> poped node deleted (returned the memory to OS), other threads can try
> to access the deleted node. (potentially pagefault)

I skimmed through the code, and noticed the problem too. Not sure 100%
for now, though.
IMVHO, the most simple and fast solution is the one used in MS
InterlockedSList API. They use SEH handler to suppress potential page
faults. Very similar to the following queue:
http://groups.google.com/group/lock-free/browse_frm/thread/c8e3201da4a6a300
AFAIK, that can be done in Unix environment too. AFAIR, I've seen
similar thing in TL2 transactional memory implementation - they use
signal handlers to suppress SIGSEGVs.

--
Dmitriy V'jukov

Chae Lim

unread,

Apr 24, 2010, 3:44:56 AM4/24/10

to Scalable Synchronization Algorithms

On Apr 21, 11:29 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> On Apr 22, 9:22 am, Chae Lim <chae...@gmail.com> wrote:
>
> > This implementation does not seems handle the common memory
> > reclamation issue occurring in many lock-free algorithms. After a
> > poped node deleted (returned the memory to OS), other threads can try
> > to access the deleted node. (potentially pagefault)
>
> I skimmed through the code, and noticed the problem too. Not sure 100%
> for now, though.
> IMVHO, the most simple and fast solution is the one used in MS
> InterlockedSList API. They use SEH handler to suppress potential page

> faults. Very similar to the following queue:http://groups.google.com/group/lock-free/browse_frm/thread/c8e3201da4...

> AFAIK, that can be done in Unix environment too. AFAIR, I've seen
> similar thing in TL2 transactional memory implementation - they use
> signal handlers to suppress SIGSEGVs.
>
> --
> Dmitriy V'jukov

Yes, on Windows, InterlockedSList API is way to go for lock-free LIFO
stack. It's exception handling is done at the Kernel level which is
not quite same as application level SEH. In other words, it's a Kernel
level hack but faster and simpler to the application developers.

Dmitriy Vyukov

unread,

Apr 25, 2010, 9:24:52 AM4/25/10

to Scalable Synchronization Algorithms

On Apr 24, 11:44 am, Chae Lim <chae...@gmail.com> wrote:

> Yes, on Windows, InterlockedSList API is way to go for lock-free LIFO
> stack. It's exception handling is done at the Kernel level which is
> not quite same as application level SEH. In other words, it's a Kernel
> level hack but faster and simpler to the application developers.

Hi Chae,

Can you please elaborate a bit more on "exception handling is done at
the Kernel level which is not quite same as application level SEH"?
When I last looked at InterlockedPopEntrySList() (I guess it was on 32-
bit Windows XP) the code looks like plain SEH __try/__except handler.
Now on 64-bit Windows 7 I see something different. In particular 32-
bit version does some manipulations with current processor number (?):
mov ecx,53h
lsl eax,ecx
AFAICT, that's a way to extract current processor number.
And 64-bit version does not do anything special on the first glance.
That's probably because 64-bit SEH is based on address tables.

I would really appreciate if you tell us more.

--
Dmitriy V'jukov

Chae Lim

unread,

Apr 26, 2010, 3:09:39 AM4/26/10

to Scalable Synchronization Algorithms

Dmitriy, what tried to say was Windows uses Kernel trap to see if the
access violation or pagefault happened in the InterlockedPopEntrySList
function. More specifically ntdll!ExpInterlockedPopEntrySListFault.

I copy pasted part of 64 bit Vista's InterlockedPopEntrySList API
below. (The label "ExpInterlockedPopEntrySListFault" is not being used
by any code inside InterlockedPopEntrySList API and I think Kernel
trap uses it though)

ntdll!ExpInterlockedPopEntrySListFault:
mov rcx,qword ptr [r8] <== Access hazard
inc cl
mov rbx,rax
dec bx
lock cmpxchg16b oword ptr [r10]

Neill Clift stated a side effect from using SEH at the following
discussion and hinted about kernel trap.

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/0682f59ad7c4a489/278719d422d69d46?#278719d422d69d46

There is another interesting thing that I stumbled. I think, on
Windows 2000 SP3, InterlockedPopEntrySList can run infinite loop
(Deadlock) as shown following link. I guess this is because they
didn't handle real access violation case (the pointer is just invalid
itself and retrying would run for ever). On Win XP and later, it seems
bail out the about 1000 retries.

http://www.osronline.com/showThread.cfm?link=45244

Yes I noticed the new code (I think that added since Vista) in 32 bit
code, they added processor number manipulation but not sure what they
are intended for.

Thanks
Chae Lim

Dmitriy Vyukov

unread,

Apr 26, 2010, 4:30:21 AM4/26/10

to Scalable Synchronization Algorithms

On Apr 26, 11:09 am, Chae Lim <chae...@gmail.com> wrote:

> Dmitriy, what tried to say was Windows uses Kernel trap to see if the
> access violation or pagefault happened in the InterlockedPopEntrySList
> function. More specifically ntdll!ExpInterlockedPopEntrySListFault.
>
> I copy pasted part of 64 bit Vista's InterlockedPopEntrySList API
> below. (The label "ExpInterlockedPopEntrySListFault" is not being used
> by any code inside InterlockedPopEntrySList API and I think Kernel
> trap uses it though)

Yes, it's a kernel trap. But it's not special. It's a plain Win64 SEH.
Win64 SEH is based on external tables of handlers. You must find
ExpInterlockedPopEntrySListFault in .pdata section of the module.
That's exactly how it is for you own SEH handlers, you will also not
find any references to them in the code:

int main()
{
000000013FF31000 sub rsp,28h
__try
{
printf("%s\n", 1);
000000013FF31004 mov edx,1
000000013FF31009 lea rcx,[string "%s\n" (13FF321B0h)]
000000013FF31010 call qword ptr [__imp_printf (13FF32130h)]
}
000000013FF31016 jmp $LN6+15h (13FF3102Dh)
__except(EXCEPTION_EXECUTE_HANDLER)
{
printf("%s\n", "Hello, World!");
000000013FF31018 lea rdx,[string "Hello,
World!" (13FF321B8h)]
000000013FF3101F lea rcx,[string "%s\n" (13FF321B0h)]
000000013FF31026 call qword ptr [__imp_printf (13FF32130h)]
000000013FF3102C nop
}
}
000000013FF3102D xor eax,eax
000000013FF3102F add rsp,28h
000000013FF31033 ret

> Neill Clift stated a side effect from using SEH at the following
> discussion and hinted about kernel trap.
>

> http://groups.google.com/group/comp.programming.threads/browse_frm/th...

Good point. I didn't not think about it before. I guess it's indeed
possible to compromise SEH based implementation so that it will cause
bad things on otherwise correct code.

> There is another interesting thing that I stumbled. I think, on
> Windows 2000 SP3, InterlockedPopEntrySList can run infinite loop
> (Deadlock) as shown following link. I guess this is because they
> didn't handle real access violation case (the pointer is just invalid
> itself and retrying would run for ever). On Win XP and later, it seems
> bail out the about 1000 retries.
>
> http://www.osronline.com/showThread.cfm?link=45244
>
> Yes I noticed the new code (I think that added since Vista) in 32 bit
> code, they added processor number manipulation but not sure what they
> are intended for.

Strictly speaking, if a user passed incorrect pointer it's UB :)
For example, if one calls virtual function via incorrect object
pointer it can easily format a hard drive. Infinite loop is not the
worst outcome :)
I agree that it's better to handle that case, though.

--
Dmitriy V'jukov

Chae Lim

unread,

Apr 26, 2010, 5:09:10 PM4/26/10

to Scalable Synchronization Algorithms

Dmitriy, I played with Kernel debugger a little bit especially looked
PageFault interrupt handler. I'm using Vista x64. I remember I did it
long time ago with Win XP and seems Microsoft has changed the code
since then.

There is function called KiCheckForSListAddress for checking if the
pagefault occurred inside InterlockedPopEntrySList.

KiCheckForSListAddress is checking if the fault address is between
KiInterlockedPopEntrySListResumeEntryPoint and
KiInterlockedPopEntrySListEndEntryPoint. It that's true return address
of KeUserPopEntrySListResume. I couldn't find the code that limited
time of retries. (I'll look it further if I have some more time later)

The way they handling this pagefault is very low level (Kernel's
pagefault IDT function) I thought it's not quite same as SEH since
it's not a user mode try/catch thing but probably Dmitriy and me are
talking same thing differently.)

More important thing is, as Neill mentioned, user more exception
handler has an issue like if freed memory reused as stack guard page
which is pretty paranoid though.

0: kd> u
nt!KiPageFault+0x231:
fffff800`02478171 8945a0 mov dword ptr [rbp-60h],eax
fffff800`02478174 7509 jne nt!KiPageFault+0x23f
(fffff800`0247817f)
fffff800`02478176 b901000000 mov ecx,1
fffff800`0247817b 440f22c1 mov cr8,rcx
fffff800`0247817f 488d4d80 lea rcx,[rbp-80h]
fffff800`02478183 e878600000 call nt!KiCheckForSListAddress
(fffff800`0247e200) <==
fffff800`02478188 8b4da0 mov ecx,dword ptr [rbp-60h]
fffff800`0247818b 0bc9 or ecx,ecx

fffff800`0242a888 mov rax,qword ptr [nt!
KeUserPopEntrySListResumeWow64 (fffff800`02646460)]
fffff800`0242a88f jmp nt!KiCheckForSListAddress+0x3c
(fffff800`0247e23c)

0: kd> uf nt!KiCheckForSListAddress
fffff800`0247e200 movzx eax,word ptr [rcx+170h]
fffff800`0247e207 mov rdx,qword ptr [rcx+168h]
fffff800`0247e20e cmp ax,33h
fffff800`0247e212 je nt!KiCheckForSListAddress+0x44
(fffff800`0247e244)
nt!KiCheckForSListAddress+0x14:
fffff800`0247e214 cmp ax,23h
fffff800`0247e218 je nt!KiCheckForSListAddress+0x65
(fffff800`0247e265)
nt!KiCheckForSListAddress+0x1a:
fffff800`0247e21a cmp rdx,qword ptr [nt!
KiInterlockedPopEntrySListResumeEntryPoint (fffff800`02646458)]
fffff800`0247e221 ja nt!KiCheckForSListAddress+0x26
(fffff800`0247e226)
nt!KiCheckForSListAddress+0x23:
fffff800`0247e223 ret 0
nt!KiCheckForSListAddress+0x26:
fffff800`0247e226 cmp rdx,qword ptr [nt!
KiInterlockedPopEntrySListEndEntryPoint (fffff800`02646860)]
fffff800`0247e22d ja nt!KiCheckForSListAddress+0x23
(fffff800`0247e223)
nt!KiCheckForSListAddress+0x2f:
fffff800`0247e22f cmp ax,10h
fffff800`0247e233 jne nt!KiCheckForSListAddress+0x23
(fffff800`0247e223)
nt!KiCheckForSListAddress+0x35:
fffff800`0247e235 mov rax,qword ptr [nt!
KiInterlockedPopEntrySListResumeEntryPoint (fffff800`02646458)]

nt!KiCheckForSListAddress+0x3c:
fffff800`0247e23c mov qword ptr [rcx+168h],rax
fffff800`0247e243 ret

Dmitriy Vyukov

unread,

Apr 30, 2010, 5:36:36 AM4/30/10

to Scalable Synchronization Algorithms

On Apr 27, 1:09 am, Chae Lim <chae...@gmail.com> wrote:

> Dmitriy, I played with Kernel debugger a little bit especially looked
> PageFault interrupt handler. I'm using Vista x64. I remember I did it
> long time ago with Win XP and seems Microsoft has changed the code
> since then.
>
> There is function called KiCheckForSListAddress for checking if the
> pagefault occurred inside InterlockedPopEntrySList.
>
> KiCheckForSListAddress is checking if the fault address is between
> KiInterlockedPopEntrySListResumeEntryPoint and
> KiInterlockedPopEntrySListEndEntryPoint. It that's true return address
> of KeUserPopEntrySListResume. I couldn't find the code that limited
> time of retries. (I'll look it further if I have some more time later)
>
> The way they handling this pagefault is very low level (Kernel's
> pagefault IDT function) I thought it's not quite same as SEH since
> it's not a user mode try/catch thing but probably Dmitriy and me are
> talking same thing differently.)

Aha, that's definitely "special" kernel support.
I've especially corrupt SList, and it spun some amount of times, and
then give the exception out. So it seems that the counting logic is in
kernel too.

> More important thing is, as Neill mentioned, user more exception
> handler has an issue like if freed memory reused as stack guard page
> which is pretty paranoid though.

Perhaps they move it to kernel to handle such cases.

I am still trying to model such case (invalid access to stack guard
page) for my queue which is also SEH-based:
http://groups.google.com/group/lock-free/browse_frm/thread/c8e3201da4a6a300

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,

Apr 30, 2010, 7:58:38 AM4/30/10

to Scalable Synchronization Algorithms

On Apr 30, 1:36 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:

> > More important thing is, as Neill mentioned, user more exception
> > handler has an issue like if freed memory reused as stack guard page
> > which is pretty paranoid though.

It seems that it's not a big problem.
If problematic access touches the last stack page than nothing
happens. It results in plain AV.
The problem arises IFF the thread (of other problematic accesses)
touch all but last stack pages *sequentially*, then problematic
accesses touches last page -> it blows off stack guard, but the
exception is correctly suppressed. So nothing horrible happened at
this time. If then the thread will consume all his stack, then Yes, it
can corrupt memory. But correct programs must not consume all stack
space anyway.
Note also since memory region is reused for a thread's stack, pointers
into it (into reserved part) must not appear in stack/queue anymore.
So it looks quite unrealistic that problematic accesses will touch
whole stack sequentially.

However, AFAIS, the problem is the following.
Some component setups vectored exception handler and monitors accesses
to some memory location. Such accesses considered as some events.
Problematic access from lock-free stack/queue can raise such an event
spuriously. However, note that problematic access occur only to
locations that just was a valid stack/queue node (they do not occur to
random locations).

--
Dmitriy V'jukov

Chae Lim

unread,

May 3, 2010, 1:06:55 AM5/3/10

to Scalable Synchronization Algorithms

Dmitriy, I remembered one more problem case that actually I
experienced from other reasons. Let's say there are two threads A and
B. If Thread B touches Thread A's stack guard page it'll remove the
guard page but does not reset the guard page. So next time Thread A's
stack grows more than the guard page it gets access violation since no
guard page fault occurs. I think the Windows stack growing with guard
page inherently has this weakness though.

Dmitriy Vyukov

unread,

May 3, 2010, 3:22:38 AM5/3/10

to Scalable Synchronization Algorithms

On May 3, 9:06 am, Chae Lim <chae...@gmail.com> wrote:
> Dmitriy, I remembered one more problem case that actually I
> experienced from other reasons. Let's say there are two threads A and
> B. If Thread B touches Thread A's stack guard page it'll remove the
> guard page but does not reset the guard page. So next time Thread A's
> stack grows more than the guard page it gets access violation since no
> guard page fault occurs. I think the Windows stack growing with guard
> page inherently has this weakness though.

Hummm... well, but doesn't OS move guard to the next page on the
access?
Or more precisely I would expect either (1) OS determines illegal
access and does nothing (do not remove guard, the thread gets AV), or
(2) OS handles the access as normal access to the guard page (removes
guard from the page, the access succeeds, guard is moved to the next
page).
And what you described (the guard is removed, but not moved to the
next page) looks quite senseless to me...

--
Dmitriy V'jukov

Chae Lim

unread,

May 3, 2010, 1:51:53 PM5/3/10

to Scalable Synchronization Algorithms

> Hummm... well, but doesn't OS move guard to the next page on the
> access?
> Or more precisely I would expect either (1) OS determines illegal
> access and does nothing (do not remove guard, the thread gets AV), or
> (2) OS handles the access as normal access to the guard page (removes
> guard from the page, the access succeeds, guard is moved to the next
> page).
> And what you described (the guard is removed, but not moved to the
> next page) looks quite senseless to me...
>
> --
> Dmitriy V'jukov

I just found the code that I wrote to demonstrate this issue long ago.
(Copy pasted below). One possible workaround is commit all stack
memory on thread creation so stack growing due to guard page fault
doesn't need commit new memory.

#include <stdio.h>
#include <process.h>
#include <windows.h>

//
===========================================================================
void Crash () {
printf("Crash\n");

// This function simply allocate a big local array and initializes
first element
char test[0x2000];
test[0] = 0;
}

//
===========================================================================
static unsigned __stdcall ThreadProc (LPVOID param) {
printf ("ThreadProc\n\tTry to access Stack Guard Page\n");

// param points localVar in AccessStackGuardFromOtherThread
function
char * ptr = (char *)param;
ptr -= 0x2000;
volatile char ch = *ptr;
return 0;
}

//
===========================================================================
static void AccessStackGuardFromOtherThread () {
printf("AccessStackGuardFromOtherThread\n");

unsigned localVar;
HANDLE exceptionThread = (HANDLE)_beginthreadex(
NULL,
0, // stack size
ThreadProc,
&localVar, // <== Pass the pointer of local variable
0, // suspended = false
NULL
);

// Wait for the thread finish
WaitForSingleObject(exceptionThread, INFINITE);
}

//
===========================================================================
void main () {

AccessStackGuardFromOtherThread();
Crash();

printf("End of Program.\n"); // <=== This will never be called
}

Reply all

Reply to author

Forward