Multiprocessor syncronization problem

John M.

unread,

Oct 8, 2003, 10:19:35 PM10/8/03

to

Hi all,

I need to implement a flag in a multiprocessor environment.
What non-privileged instruction should I use? Basicly this
flag is used to allow only to one CPU at a time the access
(read/write) a data structure.

Is the LOCK prefix supported by all motherboards/chipsets?
If I use the XCHG instruction it's even implied.

One idea might be:

MOV EAX,-1
Wait: XCHG [Flag],EAX
TEST EAX,EAX
JNZ Wait

; .. freely access the data structure, knowing no other CPU
; will begin any access to it till we are done ..

MOV [Flag],0

; .. now any other CPU can lock and access the data structure ..

Will it always work correctly?

Thanks!
John

Matt Taylor

unread,

Oct 9, 2003, 5:16:16 AM10/9/03

to

"John M." <Jo...@invalid.com> wrote in message
news:X30hb.250016$R32.8...@news2.tin.it...

> Hi all,
>
> I need to implement a flag in a multiprocessor environment.
> What non-privileged instruction should I use? Basicly this
> flag is used to allow only to one CPU at a time the access
> (read/write) a data structure.

A shadow lock comes to mind.

> Is the LOCK prefix supported by all motherboards/chipsets?
> If I use the XCHG instruction it's even implied.

Lock has been supported since the 8086.

> One idea might be:
>
> MOV EAX,-1
> Wait: XCHG [Flag],EAX
> TEST EAX,EAX
> JNZ Wait
>
> ; .. freely access the data structure, knowing no other CPU
> ; will begin any access to it till we are done ..
>
> MOV [Flag],0
>
> ; .. now any other CPU can lock and access the data structure ..
>
> Will it always work correctly?

Yes. Intel recommends a shadow lock which is slightly different:

shadow_loop:
wait_for_lock:
pause
cmp eax, [flag]
jnz wait_for_lock

; flag is probably 0, now try to lock it
mov eax, -1
xchg [flag], eax
test eax, eax
jz shadow_loop

It's the same loop, but it spins at the top reading until it sees that the
memory is 0. After that it tries to lock and continues if unsuccessful. This
is more efficient with the MOESI cache coherency protocol. Also, it uses the
pause instruction (a no-op on other CPUs) to try to reduce resource/power
consumption.

-Matt

Alex McDonald

unread,

Oct 9, 2003, 6:28:24 PM10/9/03

to

"John M." <Jo...@invalid.com> wrote in message
news:X30hb.250016$R32.8...@news2.tin.it...

Assuming that this is Windows, then read up on
InitializeCriticalSectionAndSpinCount, et al. These are Windows calls that
implement spin locks in MP environments. The advantage with these calls are

"Spin count for the critical section object. On single-processor systems,
the spin count is ignored and the critical section spin count is set to 0.
On multiprocessor systems, if the critical section is unavailable, the
calling thread will spin dwSpinCount times before performing a wait
operation on a semaphore associated with the critical section. If the
critical section becomes free during the spin operation, the calling thread
avoids the wait operation."

In other words, if one of the processors is going to spin for a long time,
you can set the number of attempts, and Windows will wait on a semaphore
instead; some other task will run and come back to you when the semaphore is
signalled. Saves spinning on locks held for extended periods. Search at
http://msdn.microsoft.com. Linux may have a similar set of primitives.

The other type of spin lock is a counted spin lock. Here you may have
multiple waiters for a resource (that is, more than 2). You are only
permitted to take the lock if it is zero, and you can increment it to one.
Otherwise, you must wait to be signalled. If you're interested in the
technique, let me know and I will post some pseudo-code assembler that
should be applicable to any OS that has a "wait-on-semaphore" type
primitive. This technqiue uses CMPXHCG8B to maintain a counter and a list of
waiters.

--
Regards
Alex McDonald

Alex McDonald

unread,

Oct 9, 2003, 8:38:35 PM10/9/03

to

"Alex McDonald" <alex...@btopenworld.com> wrote in message
news:bm49c8$66r$1...@sparta.btinternet.com...

(replying to myself!) You might also find the following useful;

http://perso.wanadoo.fr/gmem/evenements/jim2002/articles/L17_Fober.pdf. It
discusses techniques for lock-free serialisation to concurrent objects;
there are some nice ideas in here.

Another theoretical paper:
http://www.research.ibm.com/people/m/michael/podc-2002.pdf.

http://msdn.microsoft.com/msdnmag/issues/01/08/Concur/default.aspx; talks
about accessing large data collections.

If you're XP or higher, look at
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/base/interlockedpushentryslist.asp
for a lock free list call provided by Windows -- there are a huge number of
interlock primitives that Windows provides.

For Linux see http://lse.sourceforge.net/locking/rcupdate.html; this is (I
think) one of the technologies that SCO reckons IBM ripped off.

Lock-free processing is the way to go; spin locks are wasteful in the
extreme, especially in MPs with >2 processors. You may also like to think
about the algorithm you are using that needs to access the same data
structure from 2 tasks. Can it be written to queue work from many tasks to a
single task that does the work on the object?

--
Regards
Alex McDonald

SenderX

unread,

Oct 13, 2003, 12:26:28 AM10/13/03

to

> If you're XP or higher, look at
>
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/base/interlockedpushentryslist.asp
> for a lock free list call provided by Windows -- there are a huge number
of
> interlock primitives that Windows provides.

The SList API uses 64-bit swaps on push and pop! Not needed at all...

My following lock-free stack code is a little bit faster than the windows
SList API:

; Lock-Free Stack Push
; extern void acLfStackPush( ac_LfStack* pStack, ac_LfNode* pNode );
acLfStackPush PROC

; Load the front
mov edx, [esp + 4]
mov eax, [edx]

; Load the new node
mov ecx, [esp + 8]

acLfStackPush_Retry:

; Prepare the Cas
mov [ecx], eax

; Try and link the new node
cmpxchg [edx], ecx

jne acLfStackPush_Retry

ret

acLfStackPush ENDP

; Lock-Free Stack Pop
; extern ac_LfNode* acLfStackPop( ac_LfStack* pStack );
acLfStackPop PROC

; Preserve
push esi
push ebx

; Load the front
mov esi, [esp + 12]
mov edx, [esi + 4]
mov eax, [esi]

acLfStackPop_Retry:

; Validate
test eax, eax

je acLfStackPop_Fail

; Prepare the Cas
mov ebx, [eax]
lea ecx, [edx + 1]

; Try and unlink the front
cmpxchg8b qword ptr [esi]

jne acLfStackPop_Retry

acLfStackPop_Fail:

; Restore
pop ebx
pop esi

ret

acLfStackPop ENDP

P.S.

I have also written stable lock-free: queues, read-write locks, garbage
collectors, semaphores, ect...

If you want to see some more just ask.

;)

Alex McDonald

unread,

Oct 13, 2003, 10:05:52 PM10/13/03

to

"SenderX" <x...@xxx.xxx> wrote in message
news:Uimib.744700$YN5.690725@sccrnsc01...

Of course; the more snippets the better. These are techniques that everyone
should understand, if not use. But reverse engineering? Tsk, tsk. Although
your last example is a 64 bit swap... How about a FIFO queue example to
anser the OPs question?

--
Regards
Alex McDonald

SenderX

unread,

Oct 13, 2003, 11:57:30 PM10/13/03

to

> Of course; the more snippets the better. These are techniques that
everyone
> should understand, if not use.

For sure! lock-free == happy SMP's...

> But reverse engineering? Tsk, tsk.

;)

I actually used the IBM FreeList algo for the previous stack code. Its been
around since the early 80's. sys 370 I believe.

MS is using the IBM algo, but with a 64-bit CAS in the push function, which
is not needed...

I can give you some references on the IBM stuff.

> How about a FIFO queue example to
> anser the OPs question?

Sure.

Here are the structures used by the algo:

/* User types */
typedef volatile struct ac_SLfNode_* ac_LfNode;
typedef volatile struct ac_SLfStack_* ac_LfStack;
typedef volatile struct ac_SLfQueue_* ac_LfQueue;
typedef volatile struct ac_SLfGarbage_* ac_LfGarbage;

/* Pointer to a lock-free node */
typedef __declspec( align( 8 ) ) struct ac_SLfNodePtr_
{
volatile struct ac_SLfNode_* pNode;
unsigned int iAba;

} volatile ac_SLfNodePtr;

/* Lock-Free node object */
typedef struct ac_SLfNode_
{
ac_SLfNodePtr Next;
volatile struct ac_SLfNode_* pGarbageNext;
const void* pObj;
int iDestroyable;

} volatile ac_SLfNode;

/* Lock-Free stack */
typedef struct ac_SLfStack_
{
ac_SLfNodePtr Front;

} volatile ac_SLfStack;

/* LockFree queue */
typedef struct ac_SLfQueue_
{
ac_SLfNodePtr Front;
ac_SLfNodePtr Back;

} volatile ac_SLfQueue;

Here is the x86-asm:

; Lock-Free Queue Push
; extern void acLfQueuePush( ac_LfQueue* pQueue, ac_LfNode* pNode );
acLfQueuePush PROC

; Preserve
push edi
push esi
push ebx

; Load the back
mov esi, [esp + 16]
mov edx, [esi + 12]
mov edi, [esi + 8]

acLfQueuePush_Retry:

; Prepare the Cas
mov eax, 0
mov ebx, [esp + 20]

lea ecx, [edx + 1]

; Try to link the new node
cmpxchg [edi], ebx

je acLfQueuePush_Success

; Prepare the Cas
mov ebx, eax

mov eax, edi

; Try to advance the back to the next
cmpxchg8b qword ptr [esi + 8]

mov edi, eax

jmp acLfQueuePush_Retry

acLfQueuePush_Success:

; Prepare the Cas
mov eax, edi

; Try to point the back to the new node
cmpxchg8b qword ptr [esi + 8]

; Restore
pop ebx
pop esi

pop edi

ret

acLfQueuePush ENDP

; Lock-Free Queue Pop
; extern ac_LfNode* acLfQueuePop( ac_LfQueue* pQueue );
acLfQueuePop PROC

; Preserve
push edi
push esi
push ebx

mov esi, [esp + 16]

acLfQueuePop_Retry:

; Load the front
mov edx, [esi + 4]
mov edi, [esi]

; Load the back
mov ecx, [esi + 12]
mov ebx, [esi + 8]

; Load the next
mov eax, [edi]

; Validate
test eax, eax

je acLfQueuePop_Fail

cmp edi, [esi]

jne acLfQueuePop_Retry

cmp edx, [esi + 4]

jne acLfQueuePop_Retry

cmp edi, [esi + 8]

jne acLfQueuePop_TryPop

; Prepare the Cas
mov eax, edi
mov edx, ebx

mov ebx, eax
lea ecx, [edx + 1]

; Try to advance the back to the next
cmpxchg8b qword ptr [esi + 8]

jmp acLfQueuePop_Retry

acLfQueuePop_TryPop:

; Load the object
mov ecx, [eax + 12]

; Prepare the Cas
mov ebx, eax

mov eax, edi
mov edi, ecx

lea ecx, [edx + 1]

; Try to unlink the next node
cmpxchg8b qword ptr [esi]

jne acLfQueuePop_Retry

; Set the object on the popped node
mov [eax + 12], edi

acLfQueuePop_Fail:

; Restore
pop ebx
pop esi

pop edi

ret

acLfQueuePop ENDP

This was designed for 32-bit x86. If you want this to run on 64-bit, you
will have to tweak it a bit...

Enjoy!

P.S.

I have also written a 100% lock-free linked-list. You can delete items from
the middle of the list and have threads traversing the list, all
concurrently. Without and locks! Works great. It uses yet another IBM algo,
with a few tweaks that I added...

Lock-Free is the way to go.

--
The designer of the experimental, SMP and HyperThread friendly, AppCore
library.

http://AppCore.home.comcast.net

Matt Taylor

unread,

Oct 14, 2003, 1:48:57 AM10/14/03

to

"SenderX" <x...@xxx.xxx> wrote in message
news:Uimib.744700$YN5.690725@sccrnsc01...

> The SList API uses 64-bit swaps on push and pop! Not needed at all...

Hmm, I was wondering about that. I reimplemented SList as well because I
wanted to have an equivalent library that was portable to all POSIX
platforms, and the single-linked list implementation of a stack is quite
straightforward. I didn't need any 64-bit swaps...

I am curious about your pop entry, though. Why do you need to work around
the ABA problem? The user can only push & pop, right? I don't see how ABA is
a problem for the pop function.

-Matt

SenderX

unread,

Oct 14, 2003, 7:07:42 AM10/14/03

to

> /* LockFree queue */
> typedef struct ac_SLfQueue_
> {
> ac_SLfNodePtr Front;
> ac_SLfNodePtr Back;
>
> } volatile ac_SLfQueue;

The queue struct will need a single "dummy node" in order to operate.

NULL won't work for queues...

SenderX

unread,

Oct 14, 2003, 7:05:41 AM10/14/03

to

> I am curious about your pop entry, though. Why do you need to work around
> the ABA problem? The user can only push & pop, right? I don't see how ABA
is
> a problem for the pop function.

ABA is caused by rapid node reuse...

If you "reuse" any nodes, your algo will suffer from ABA. For sure.

Terje Mathisen

unread,

Oct 14, 2003, 9:56:27 AM10/14/03

to

Matt Taylor wrote:

> "SenderX" <x...@xxx.xxx> wrote in message
> news:Uimib.744700$YN5.690725@sccrnsc01...
>
>>The SList API uses 64-bit swaps on push and pop! Not needed at all...
>
>
> Hmm, I was wondering about that. I reimplemented SList as well because I
> wanted to have an equivalent library that was portable to all POSIX
> platforms, and the single-linked list implementation of a stack is quite
> straightforward. I didn't need any 64-bit swaps...

That's what I used to believe as well, but a comp.arch discussion
enlightened me:

It is possible (even if very unlikely) for a second process to free a
node, and then reuse it (either by itself or from some third process),
reinsert, and thereby causing the compare part to succed, even though
the link isn't really valid anymore.

This scenario is quite contrived, but the reason for Intel's added
64-bit CAS version, probably because IBM asked them to.

Making really general lock-free data structures is actually _very_ hard,
a _lot_ harder than it would seem. :-)

Terje
--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Alex McDonald

unread,

Oct 14, 2003, 7:33:39 PM10/14/03

to

"Terje Mathisen" <terje.m...@hda.hydro.com> wrote in message
news:bmgh8d$fjh$1...@osl016lin.hda.hydro.com...

>From experience; working on an IBM 3038 (MP) with ROSCOE (a single task
realtime terminal system that implemented its own work queues, separate task
for I/O) with 400 users and incorrect swapping of queue headers caused, on
average, a failure every day or so. The queue headers were 32 bit pointers
and CS (compare and swap) rather than a full 64 bit CDS (double) on a
pointer and counted key. With correction the failure rate had become one in
2^32-1 days. Or so I (mis?)calculated then...

I'm also sure SenderX had a long & heated discusssion under some other
handle on an Intel bulletin board some time back about the validity of the
assumptions on a 32bit swap. Is it the same person?

--
Regards
Alex McDonald