InterlockedExchange implementation

Oleh Derevenko

unread,

Apr 14, 2008, 4:13:10 AM4/14/08

to

Hi All,

Does anybody know why InterlockedExchange is implemented as

mov ecx,[esp+4]
mov edx,[esp+8]
mov eax,[ecx]
@retry:
lock cmpxchg [ecx],edx
jne @retry
ret 8

Instead of simply
mov ecx,[esp+4]
mov eax,[esp+8]
xchg [ecx], eax
ret 8

?

xchg automatically assets LOCK# signal and is atomic. What to make loops
for?

Oleh Derevenko

Message has been deleted

Oleh Derevenko

unread,

Apr 14, 2008, 9:34:57 AM4/14/08

to

"Elcaro Nosille" <Elcaro....@googlemail.com> wrote in message
news:ftvkuu$353$1...@worf.visyn.net...

>> Does anybody know why InterlockedExchange is implemented as
>> mov ecx,[esp+4]
>> mov edx,[esp+8]
>> mov eax,[ecx]
>> @retry:
>> lock cmpxchg [ecx],edx
>> jne @retry
>> ret 8
>

> Where did you find this? On my WinXP/SP2 machine with all current
> hotfixes applied the disassembly of InterlockedCompareExchange is
> this:

Try to find 7 differences in the following function names, please:
MY: InterlockedExchange
YOUR: InterlockedCompareExchange

Message has been deleted

Oleh Derevenko

unread,

Apr 14, 2008, 9:56:57 AM4/14/08

to

I suppose I know the answer. It may be that simple xchg instruction does not
properly synchronize processor caches on some multiprocessor platforms,
while cmpxchg does.

--

Oleh Derevenko
-- ICQ: 36361783

"Elcaro Nosille" <Elcaro....@googlemail.com> wrote in message

news:ftvn3q$4jn$1...@worf.visyn.net...
> Oleh Derevenko schrieb:

>
>> Try to find 7 differences in the following function names, please:
>> MY: InterlockedExchange
>> YOUR: InterlockedCompareExchange
>

> Ok, didn't know that there's a InterlockedExchange.

h...@40th.com

unread,

Apr 14, 2008, 12:06:49 PM4/14/08

to

Maybe a bug in a still-around intel cpu
or who knows why; maybe I did a long time
ago, or, just as maybe, not. I poked around
dejanews and these are the first poppers:

http://groups.google.com/group/microsoft.public.win32.programmer.kernel/
browse_thread/thread/522529961143de56/bcc566a1bab3dcdf

but then w2k didn't?

http://groups.google.com/group/comp.os.ms-windows.programmer.win32/
browse_thread/thread/36e2c3eda1cbdc0b/0ab4655dea711637

I give the benefit of any doubt to those
that do it this way. The two news threads
above are very old. It's not hard to _asm
your own if you'd rather.

--
40th Floor - Software @ http://40th.com/
iplay.40th.com - Advanced PPC audio player
phantasm.40th.com - The final destination

Chris Thomasson

unread,

Apr 14, 2008, 3:54:10 PM4/14/08

to

"Oleh Derevenko" <od...@eleks.lviv.ua> wrote in message
news:fu02nk$2ee2$1...@news.uar.net...

>I suppose I know the answer. It may be that simple xchg instruction does
>not properly synchronize processor caches on some multiprocessor platforms,
>while cmpxchg does.

Simple XCHG is perfectly fine and totally compatible with
InterlockedExchange. I have no idea why Microsoft would implement it without
XCHG. Its not a bug to implement it with LOCK CMPXCHG, its just a "retarded"
way of doing it. FWIW, I implement the 'ac_i686_atomic_xchg_fence' in my
AppCore library using XCHG:

http://appcore.home.comcast.net/appcore/src/cpu/i686/ac_i686_masm_asm.html
(its the fifth function from the bottom...)

Your correct that XCHG automatically asserts the LOCK prefix. For concrete
proof, take a look at this:

http://www.intel.com/products/processor/manuals/318147.pdf
(under the '1 Instructions and memory accesses' section...)

I have no idea why Microsoft would use CAS to implement a SWAP. Weird!

:^/

Message has been deleted

Chris Thomasson

unread,

Apr 16, 2008, 2:46:59 AM4/16/08

to

"Elcaro Nosille" <Elcaro....@googlemail.com> wrote in message

news:fu2469$pu5$1...@worf.visyn.net...
> Oleh Derevenko schrieb:

>
>
>> Try to find 7 differences in the following function names, please:
>> MY: InterlockedExchange
>> YOUR: InterlockedCompareExchange
>

> I've seen that you wrote InterlockedExchange but as I didn't know
> that there's InterlockedExchange and there's little use for such
> a function, I thought you meant InterlockedCompareExchange.

WHAT? There is little use for InterlockedExchange!? Are you fuc%#ng
serious!? I can think of many different ways to use it. In fact I have used
it for many diverse non-blocking algorithms. I don't even want to list the
ways yet until I find out that your are tolling!!!?

;^/

Chris Thomasson

unread,

Apr 16, 2008, 2:48:13 AM4/16/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:neednbLH9LPqD5jV...@comcast.com...

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I don't even want to list the ways yet until I find out that your are

__NOT__ tolling!!!?

>
> ;^/

Chris Thomasson

unread,

Apr 16, 2008, 2:49:47 AM4/16/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:lPudnchqsvtcD5jV...@comcast.com...

find and replace "tolling" with "trolling".

I need to calm down!

ACK! ;^(...

Sorry about all that non-sense.

Message has been deleted

Chris Thomasson

unread,

Apr 16, 2008, 1:25:50 PM4/16/08

to

"Elcaro Nosille" <Elcaro....@googlemail.com> wrote in message

news:fu4m7s$nkq$2...@worf.visyn.net...
> Chris Thomasson schrieb:
>
>> WHAT? There is little use for InterlockedExchange!? ...
>
> I forgot that I'm talking to idiots.

It's not healthy to talk to yourself. Anyway, why do you think that there is
little use for InterlockedExchange?

Sebastian G.

unread,

Apr 16, 2008, 1:08:23 PM4/16/08

to

Chris Thomasson wrote:

Because data exchange normally works quite well without any need for
synchronization? And almost any synchronization primitive is much more
efficiently implemented with compare-and-exchange or fetch-and-add?

Chris Thomasson

unread,

Apr 16, 2008, 2:42:55 PM4/16/08

to

"Sebastian G." <se...@seppig.de> wrote in message
news:66mq4lF...@mid.dfncis.de...

> Chris Thomasson wrote:
>
>> "Elcaro Nosille" <Elcaro....@googlemail.com> wrote in message
>> news:fu4m7s$nkq$2...@worf.visyn.net...
>>> Chris Thomasson schrieb:
>>>
>>>> WHAT? There is little use for InterlockedExchange!? ...
>>> I forgot that I'm talking to idiots.
>>
>> It's not healthy to talk to yourself. Anyway, why do you think that there
>> is
>> little use for InterlockedExchange?
>
>
> Because data exchange normally works quite well without any need for
> synchronization?

SWAP can be an essential tool for implementing wait-free algorithms. IMO,
Elcaro's assertion that there is little use for InterlockedExchange is
totally false.

> And almost any synchronization primitive is much more efficiently
> implemented with compare-and-exchange

Using CAS implies a loop; which is only lock-free. However, SWAP is usually
implemented as a wait-free operation; there is a _very_ big difference
between lock-free and wait-free. You generally cannot use CAS in wait-free
algorithms because of the loop that goes along with most usage cases.

SWAP is an important tool for me. One simple example, you can use it to get
rid of the ABA problem in lock-free stacks. The side-effect of using SWAP
transforms a lock-free consumer to a wait-free one; a huge benefit. I an
give an example if you want...

> or fetch-and-add?

FAA is great; Intel has a wait-free version of it (e.g., XADD).

Chris Thomasson

unread,

Apr 16, 2008, 3:52:46 PM4/16/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:s6SdnWC8M-7cp5vV...@comcast.com...

>
> "Sebastian G." <se...@seppig.de> wrote in message
> news:66mq4lF...@mid.dfncis.de...
>> Chris Thomasson wrote:
>>
>>> "Elcaro Nosille" <Elcaro....@googlemail.com> wrote in message
>>> news:fu4m7s$nkq$2...@worf.visyn.net...
>>>> Chris Thomasson schrieb:
>>>>
>>>>> WHAT? There is little use for InterlockedExchange!? ...
>>>> I forgot that I'm talking to idiots.
>>>
>>> It's not healthy to talk to yourself. Anyway, why do you think that
>>> there is
>>> little use for InterlockedExchange?
>>
>>
>> Because data exchange normally works quite well without any need for
>> synchronization?
>
> SWAP can be an essential tool for implementing wait-free algorithms. IMO,
> Elcaro's assertion that there is little use for InterlockedExchange is
> totally false.
>
>> And almost any synchronization primitive is much more efficiently
>> implemented with compare-and-exchange

Your assertion is incorrect.

[...]

Try to get a "much more" efficient version of the following synchronization
algorithms using nothing but CAS:

Please note that the non-blocking lifo example uses the IBM-Style CAS which
automatically updates the comprand on failure...

<code-sketch of ABA-free non-blocking lifo w/ wait-free consumer>
__________________________________________________________________
struct node {
node* nx;
};

struct nblifo {
node* hd; /* initialize to NULL */
};

void
nblifo_produce(
nblifo* const _this,
node* const n
) {
n->nx = _this->hd;
while (! CAS(&_this->hd, &n->nx, n));
}

node*
nblifo_swap(
nblifo* const _this,
node* const n
) {
return SWAP(&_this->hd, n);
}

#define nblifo_consume(mp_this) nblifo_swap((mp_this), NULL)
__________________________________________________________________

<code-sketch of ABA-free non-blocking fifo w/ wait-free consumer>
__________________________________________________________________
typedef nblifo nbfifo;

node*
nbfifo_sys_transform(
node* _this
) {
if (_this) {
node* fifo = NULL;
while (_this) {
node* const n = _this;
_this = n->nx;
n->nx = fifo;
fifo = n;
}
return fifo;
}
return NULL;
}

#define nbfifo_produce nblifo_produce

#define nbfifo_swap(mp_this, mp_n) \
nbfifo_sys_transform(nblifo_swap((mp_this), (mp_n)))

#define nbfifo_consume(mp_this) nbfifo_swap((mp_this), NULL)
__________________________________________________________________

<code-sketch of mutex w/ wait-free fast-paths>
__________________________________________________________________
struct fpmutex {
word st; /* initialize to 0 */
osevent ws;
};

void fpmutex_lock(
fpmutex* const _this
) {
if (SWAP(&_this->st, 1)) {
while (SWAP(&_this->st, 2)) {
osevent_wait(&_this->ws);
}
}
}

void fpmutex_unlock(
fpmutex* const _this
) {
if (SWAP(&_this->st, 0) == 2) {
osevent_set(&_this->ws);
}
}
__________________________________________________________________

<code-sketch of event w/ wait-free fast-paths>
__________________________________________________________________
struct fpevent {
word st; /* initialize to 0 */
osevent ws;
};

void fpevent_wait(
fpevent* const _this
) {
while (SWAP(&_this->st, 0)) {
osevent_wait(&_this->ws);
}
}

void fpevent_set(
fpevent* const _this
) {
if (! SWAP(&_this->st, 1)) {
osevent_set(&_this->ws);
}
}
__________________________________________________________________

How could you make much more efficient versions of those algorithms using
nothing but CAS? Please, do tell... I am _very_ interested indeed...

Chris Thomasson

unread,

Apr 16, 2008, 4:00:02 PM4/16/08

to

[...]

> <code-sketch of event w/ wait-free fast-paths>
> __________________________________________________________________
> struct fpevent {
> word st; /* initialize to 0 */
> osevent ws;
> };
>

ARGHGHG!!!!

> void fpevent_wait(
> fpevent* const _this
> ) {
> while (SWAP(&_this->st, 0)) {
> osevent_wait(&_this->ws);
> }
> }

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The above is incorrect! The while loop needs to check for zero, NOT
non-zero!!! Here, let me fix that:

void fpevent_wait(
fpevent* const _this
) {

while (! SWAP(&_this->st, 0)) {
osevent_wait(&_this->ws);
}
}

Sorry about that non-sense typo! ;^(...

Dmitriy V'jukov

unread,

Apr 17, 2008, 1:55:34 PM4/17/08

to

On Apr 16, 10:42 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Sebastian G." <se...@seppig.de> wrote in message

> Using CAS implies a loop; which is only lock-free. However, SWAP is usually

> implemented as a wait-free operation; there is a _very_ big difference
> between lock-free and wait-free. You generally cannot use CAS in wait-free
> algorithms because of the loop that goes along with most usage cases.

Here is another important moment. When using CAS you usually have to
fetch current value in the beginning. And only after that write new
value with CAS. This means that cache-line requested in shared/owned
state at first, and then cache-line requested one more time in
modified state.
I observed performance/scaling degradation by factor of 3 caused by
CAS reties, and by factor of 3 once again caused by additional cache-
line transfers. As compared with XCHG. On Intel Q6600.
So in my heavy benchmark XCHG was scaling as 0.30, and CAS was scaling
as 0.05 (linear scaling is 4.00).

Dmitriy V'jukov

Oleh Derevenko

unread,

Apr 17, 2008, 2:52:25 PM4/17/08

to

"Chris Thomasson" wrote in message ...

>>I suppose I know the answer. It may be that simple xchg instruction does
>>not properly synchronize processor caches on some multiprocessor
>>platforms, while cmpxchg does.
>
> Simple XCHG is perfectly fine and totally compatible with
> InterlockedExchange. I have no idea why Microsoft would implement it
> without XCHG. Its not a bug to implement it with LOCK CMPXCHG, its just a
> "retarded" way of doing it. FWIW, I implement the
> 'ac_i686_atomic_xchg_fence' in my AppCore library using XCHG:
>

I do not remember the link exatly, but once I read an atricle in MSDN where
it was told that interlocked functions serve for one important purpose if
compared to simple assignment. If there are two processors each with its own
cache, and if you do simple memory assignment, second processor can still
use old value from its cache even if it reads the memory after the first
processor has made the assignment. That is, the cache of second processor is
not synchronized with memory modifications made by first one. However, when
interlocked functions are used, it is guarranteed that all the processors'
caches will be in sync with the assignment and will return new value. So, my
assumption is, maybe simple xchg does not exhibit this feature? Maybe it is
necessary to use cmpxchg to achieve cache synchronization? Does anybody know
it?

Intel's manual on IA-32 architecture does not contain any remarks regarding
cache synchronization neither in description of xchg nor in cmpxchg. :(

Oleh Derevenko

Chris Thomasson

unread,

Apr 17, 2008, 3:24:16 PM4/17/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
news:97e4e402-25af-4527...@m44g2000hsc.googlegroups.com...

Right. Good point indeed. In some algorithms, you can use a fixed comprand
for CAS e.g.:

__________________________________________________________________
struct fpevent {
word st; /* initialize to 0 */
osevent ws;
};

void fpevent_wait(
fpevent* const _this
) {
while (! CAS(&_this->st, 1, 0)) {
osevent_wait(&_this->ws);
}
}

void fpevent_set(
fpevent* const _this
) {

if (CAS(&_this->st, 0, 1)) {
osevent_set(&_this->ws);
}
}
__________________________________________________________________

That helps, but it still more expensive than the SWAP version because its
more "complicated" to implement a CAS in hardware. There are three
parameters to work with, and there is the compare. SWAP is two parameters
and the operation is unconditional; SWAP is a winner. One thing that makes
me angry is that SUN depreciated the SWAP instruction which forces me to use
CAS to implement a simple SWAP; CAS is no substitute for a SWAP! IMHO, SUN
made a big mistake.

:^o

Chris Thomasson

unread,

Apr 17, 2008, 3:44:12 PM4/17/08

to

"Oleh Derevenko" <od...@eleks.lviv.ua> wrote in message

news:fu8h5p$2aec$1...@news.uar.net...

>
> "Chris Thomasson" wrote in message ...
>
>>>I suppose I know the answer. It may be that simple xchg instruction does
>>>not properly synchronize processor caches on some multiprocessor
>>>platforms, while cmpxchg does.
>>
>> Simple XCHG is perfectly fine and totally compatible with
>> InterlockedExchange. I have no idea why Microsoft would implement it
>> without XCHG. Its not a bug to implement it with LOCK CMPXCHG, its just a
>> "retarded" way of doing it. FWIW, I implement the
>> 'ac_i686_atomic_xchg_fence' in my AppCore library using XCHG:
>>
>
> I do not remember the link exatly, but once I read an atricle in MSDN
> where it was told that interlocked functions serve for one important
> purpose if compared to simple assignment. If there are two processors each
> with its own cache, and if you do simple memory assignment, second
> processor can still use old value from its cache even if it reads the
> memory after the first processor has made the assignment. That is, the
> cache of second processor is not synchronized with memory modifications
> made by first one. However, when interlocked functions are used, it is
> guarranteed that all the processors' caches will be in sync with the
> assignment and will return new value. So, my assumption is, maybe simple
> xchg does not exhibit this feature? Maybe it is necessary to use cmpxchg
> to achieve cache synchronization? Does anybody know it?

Microsoft documentation on memory models are notoriously bad. Intel finally
released a description of their memory-model in the following paper:

http://www.intel.com/products/processor/manuals/318147.pdf

Read all. You will find that on x86, a store followed by a load from another
location can indeed be reordered, e.g.:

1: MOV [EAX], EDX
2: MOV EDX, [ECX]

If EAX and ECX point to different locations, then the processor can reorder
the sequence such that 2 is executed before 1. If you want to ensure that 2
does not rise above 1, the you need to do something like:

1: MOV [EAX], EDX
2: MFENCE
3: MOV EDX, [ECX]

> Intel's manual on IA-32 architecture does not contain any remarks
> regarding cache synchronization neither in description of xchg nor in
> cmpxchg. :(

The white paper basically says that the only memory-barrier in x86 that is
not already implied is #StoreLoad | #StoreStore. In other words, it
describes a TSO memory model. Here is a sketch of how the current x86 memory
model works, according to the paper:

void* LOAD(void** p) {
void* v = *p;
membar #LoadStore | #LoadLoad;
return v;
}

void STORE(void** p, void* v) {
membar #LoadStore | #StoreStore;
*p = v;
}

That works find for a producer/consumer model, but does not work for others.
For instance, you need to execute a #StoreLoad | #StoreStore barrier in
Petersons algorithm. Here is an example of that:

http://groups.google.com/group/comp.programming.threads/msg/1e45b4b16bad9784

This requires either a MFENCE instruction, or an XCHG to a dummy location to
get be the #StoreLoad ordering.

Oleh Derevenko

unread,

Apr 18, 2008, 8:04:03 AM4/18/08

to

----- Original Message -----
From: "Chris Thomasson"

>> assignment and will return new value. So, my assumption is, maybe simple
>> xchg does not exhibit this feature? Maybe it is necessary to use cmpxchg
>> to achieve cache synchronization? Does anybody know it?
>
> Microsoft documentation on memory models are notoriously bad. Intel
> finally released a description of their memory-model in the following
> paper:
>
> http://www.intel.com/products/processor/manuals/318147.pdf
>
> Read all. You will find that on x86, a store followed by a load from
> another location can indeed be reordered, e.g.:
>
> 1: MOV [EAX], EDX
> 2: MOV EDX, [ECX]
>
> If EAX and ECX point to different locations, then the processor can
> reorder the sequence such that 2 is executed before 1. If you want to
> ensure that 2 does not rise above 1, the you need to do something like:
>
> 1: MOV [EAX], EDX
> 2: MFENCE
> 3: MOV EDX, [ECX]
>

You do not read what I write. I told about cache synchronization in two
physically separate processors, not about read-write reordering in a single
processor.
Though, it could be the case that for true multi-processor system
implementation of InterlockedExchange could be even different, just like it
differs for single-core machine and multi-core machine.

Oleh Derevenko

Sebastian G.

unread,

Apr 18, 2008, 9:25:30 AM4/18/08

to

Chris Thomasson wrote:

> Read all. You will find that on x86, a store followed by a load from another
> location can indeed be reordered, e.g.:
>
> 1: MOV [EAX], EDX
> 2: MOV EDX, [ECX]
>
> If EAX and ECX point to different locations, then the processor can reorder
> the sequence such that 2 is executed before 1. If you want to ensure that 2
> does not rise above 1, the you need to do something like:
>
> 1: MOV [EAX], EDX
> 2: MFENCE
> 3: MOV EDX, [ECX]

Bullshit, since EDX is a dependency which is why these instructions can't be
reordered.

Sebastian G.

unread,

Apr 18, 2008, 9:29:26 AM4/18/08

to

Chris Thomasson wrote:

> Read all. You will find that on x86, a store followed by a load from another
> location can indeed be reordered, e.g.:
>
> 1: MOV [EAX], EDX
> 2: MOV EDX, [ECX]
>
> If EAX and ECX point to different locations, then the processor can reorder
> the sequence such that 2 is executed before 1.

Bullshit, since EDX is a WAR dependency.

Sebastian G.

unread,

Apr 18, 2008, 9:37:54 AM4/18/08

to

Chris Thomasson wrote:

> Read all. You will find that on x86, a store followed by a load from another
> location can indeed be reordered, e.g.:
>
> 1: MOV [EAX], EDX
> 2: MOV EDX, [ECX]
>
> If EAX and ECX point to different locations, then the processor can reorder
> the sequence such that 2 is executed before 1. If you want to ensure that 2
> does not rise above 1, the you need to do something like:
>
> 1: MOV [EAX], EDX
> 2: MFENCE
> 3: MOV EDX, [ECX]

May I call bullshit?

Chris Thomasson

unread,

Apr 18, 2008, 2:23:46 PM4/18/08

to

"Sebastian G." <se...@seppig.de> wrote in message

news:66rlqoF...@mid.dfncis.de...

x86 can and will reorder the following:

store(&loc1, 0);
load(&loc2);

Read:

http://groups.google.com/group/comp.programming.threads/msg/731cd06e8c04f744

and

http://www.intel.com/products/processor/manuals/318147.pdf

and you will learn that x86 WILL reorder load after store to another
location.

Chris Thomasson

unread,

Apr 18, 2008, 2:27:50 PM4/18/08

to

"Oleh Derevenko" <od...@eleks.lviv.ua> wrote in message

news:fuadk4$2tn5$1...@news.uar.net...

> ----- Original Message -----
> From: "Chris Thomasson"
>
>>> assignment and will return new value. So, my assumption is, maybe simple
>>> xchg does not exhibit this feature? Maybe it is necessary to use cmpxchg
>>> to achieve cache synchronization? Does anybody know it?
>>
>> Microsoft documentation on memory models are notoriously bad. Intel
>> finally released a description of their memory-model in the following
>> paper:
>>
>> http://www.intel.com/products/processor/manuals/318147.pdf
>>
>> Read all. You will find that on x86, a store followed by a load from
>> another location can indeed be reordered, e.g.:
>>
>> 1: MOV [EAX], EDX
>> 2: MOV EDX, [ECX]
>>
>> If EAX and ECX point to different locations, then the processor can
>> reorder the sequence such that 2 is executed before 1. If you want to
>> ensure that 2 does not rise above 1, the you need to do something like:
>>
>> 1: MOV [EAX], EDX
>> 2: MFENCE
>> 3: MOV EDX, [ECX]
>>
> You do not read what I write. I told about cache synchronization in two
> physically separate processors, not about read-write reordering in a
> single
> processor.

The only time you need a memory-barrier on x86 is to prevent
store-after-load reordering. The behavour that your interested in is the
memory model; the cache has nothing to do with it.

http://groups.google.com/group/comp.programming.threads/msg/17cd775d97150096
(read all)

> Though, it could be the case that for true multi-processor system
> implementation of InterlockedExchange could be even different, just like
> it
> differs for single-core machine and multi-core machine.

The XCHG instruction has implicit LOCK prefix which is analogous to full
memory-barrier.

Chris Thomasson

unread,

Apr 18, 2008, 2:30:24 PM4/18/08

to

"Sebastian G." <se...@seppig.de> wrote in message

news:66rmi1F...@mid.dfncis.de...

So, accoriding to you, one could implment SMR on x86 without using a
#StoreLoad? Here is code for a hazard pointer load:

ac_i686_lfgc_smr_activate PROC
mov edx, [esp + 4]
mov ecx, [esp + 8]

ac_i686_lfgc_smr_activate_reload:
mov eax, [ecx]
mov [edx], eax
mfence
cmp eax, [ecx]
jne ac_i686_lfgc_smr_activate_reload
ret
ac_i686_lfgc_smr_activate ENDP

Why do you think that the MFENCE does not need to be in there?

Chris Thomasson

unread,

Apr 18, 2008, 3:07:49 PM4/18/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:8rmdnQyLu4bDdZXV...@comcast.com...

> "Sebastian G." <se...@seppig.de> wrote in message
> news:66rmi1F...@mid.dfncis.de...
>> Chris Thomasson wrote:
>>
>>
>>> Read all. You will find that on x86, a store followed by a load from
>>> another location can indeed be reordered, e.g.:
>>>
>>> 1: MOV [EAX], EDX
>>> 2: MOV EDX, [ECX]
>>>
>>> If EAX and ECX point to different locations, then the processor can
>>> reorder the sequence such that 2 is executed before 1. If you want to
>>> ensure that 2 does not rise above 1, the you need to do something like:
>>>
>>> 1: MOV [EAX], EDX
>>> 2: MFENCE
>>> 3: MOV EDX, [ECX]
>>
>>
>> May I call bullshit?
>
> So, accoriding to you, one could implment SMR on x86 without using a
> #StoreLoad? Here is code for a hazard pointer load:

[...]

Here is reference to SMR:

http://www.research.ibm.com/people/m/michael/ieeetpds-2004.pdf

BTW Sebastian, if you found a way to implement this algorithm on x86 without
using a MFENCE or synchronization epoch detection, well, PLEASE POST IT! I
would be VERY interested.

Chris Thomasson

unread,

Apr 18, 2008, 3:11:24 PM4/18/08

to

"Sebastian G." <se...@seppig.de> wrote in message

news:66rm24F...@mid.dfncis.de...

Let me add another moment in the example:

; ECX = Location1
; EAX = Location2

1: MOV EDX, [ECX] ; load from Location1
2: MOV [EAX], EDX ; store to Location2
3: MOV EDX, [ECX] ; load from Location1

Why do you think that steps 2 and 3 will not be reordered? According to this
paper:

http://www.intel.com/products/processor/manuals/318147.pdf

Your wrong.

Alexander Terekhov

unread,

Apr 18, 2008, 3:31:18 PM4/18/08

to

Chris Thomasson wrote:
[...]

> The XCHG instruction has implicit LOCK prefix which is analogous to full
> memory-barrier.

That's a rumor. Consider that

exclusive_lock(location);
r = load(location);
store(location,x);
exclusive_unlock(location);
return r;

isn't really analogous to a full memory-barrier under PC memory model at
all.

A fully memory-barrier'd LOCK XCHG must be

MFENCE();
exclusive_lock(location);
r = load(location);
store(location,x);
exclusive_unlock(location);
MFENCE();
return r;

I've been trying to figure out which of above MFENCEs is actually
guaranteed (from the docs) to be provided by XCHG to no avail.

regards,
alexander.

Sebastian G.

unread,

Apr 18, 2008, 4:06:21 PM4/18/08

to

Chris Thomasson wrote:

> "Sebastian G." <se...@seppig.de> wrote in message
> news:66rm24F...@mid.dfncis.de...
>> Chris Thomasson wrote:
>>
>>
>>> Read all. You will find that on x86, a store followed by a load from
>>> another location can indeed be reordered, e.g.:
>>>
>>> 1: MOV [EAX], EDX
>>> 2: MOV EDX, [ECX]
>>>
>>> If EAX and ECX point to different locations, then the processor can
>>> reorder the sequence such that 2 is executed before 1.
>>
>> Bullshit, since EDX is a WAR dependency.
>
> Let me add another moment in the example:
>
>
> ; ECX = Location1
> ; EAX = Location2
>
>
>
> 1: MOV EDX, [ECX] ; load from Location1
> 2: MOV [EAX], EDX ; store to Location2
> 3: MOV EDX, [ECX] ; load from Location1
>
>
>
> Why do you think that steps 2 and 3 will not be reordered?

Because it's a Write-After-Read dependency on EDX.

> According to this

> paper:
>
> http://www.intel.com/products/processor/manuals/318147.pdf
>
> Your wrong.

According to this paper you're wrong, since it only deals with memory
ordering dependency, not register dependency. BTW, could you please put your
notions into multiprocessor parallel execution? Since for single process
stuff, you can't be any more obviously wrong.

Sebastian G.

unread,

Apr 18, 2008, 4:11:17 PM4/18/08

to

Chris Thomasson wrote:

> "Sebastian G." <se...@seppig.de> wrote in message
> news:66rmi1F...@mid.dfncis.de...
>> Chris Thomasson wrote:
>>
>>
>>> Read all. You will find that on x86, a store followed by a load from
>>> another location can indeed be reordered, e.g.:
>>>
>>> 1: MOV [EAX], EDX
>>> 2: MOV EDX, [ECX]
>>>
>>> If EAX and ECX point to different locations, then the processor can
>>> reorder the sequence such that 2 is executed before 1. If you want to
>>> ensure that 2 does not rise above 1, the you need to do something like:
>>>
>>> 1: MOV [EAX], EDX
>>> 2: MFENCE
>>> 3: MOV EDX, [ECX]
>>
>> May I call bullshit?
>
> So, accoriding to you, one could implment SMR on x86 without using a
> #StoreLoad?

No, just your example is broken. EDX is a dependency, so within one thread,
it can't be reordered. On multiple thread, it might be executed
sequentially, that is

1: MOV [EAX], EDX
2: MOV EDX, [ECX]
1: MOV [EAX], EDX
2: MOV EDX, [ECX]

And even then there's only a potential ordering problem when ECX#1 and EAX#1
are identical, and the MFence wouldn't help with that.

Chris Thomasson

unread,

Apr 18, 2008, 4:47:03 PM4/18/08

to

"Alexander Terekhov" <tere...@web.de> wrote in message
news:4808F706...@web.de...

>
> Chris Thomasson wrote:
> [...]
>> The XCHG instruction has implicit LOCK prefix which is analogous to full
>> memory-barrier.
>
> That's a rumor.

On page 6 in the following paper:

http://www.intel.com/products/processor/manuals/318147.pdf

"All locked instructions (the implicitly locked xchg instruction and other
read-modify-write
instructions with a lock prefix) are an indivisible and uninterruptible
sequence of load(s)
followed by store(s) regardless of memory type and alignment."

AFAICT, this means that XCHG implicitly asserts the LOCK prefix. What am I
missing here Alex?

> Consider that
>
> exclusive_lock(location);
> r = load(location);
> store(location,x);
> exclusive_unlock(location);
> return r;
>
> isn't really analogous to a full memory-barrier under PC memory model at
> all.

Can subsequent operations be hoisted up above an XCHG? Also, can preceding
operations sink below it?

> A fully memory-barrier'd LOCK XCHG must be
>
> MFENCE();
> exclusive_lock(location);
> r = load(location);
> store(location,x);
> exclusive_unlock(location);
> MFENCE();
> return r;
>
> I've been trying to figure out which of above MFENCEs is actually
> guaranteed (from the docs) to be provided by XCHG to no avail.

AFAICT, the trailing MFENCE seems to have to be there. If it was not, I
don't see how you could correctly implement a spinlock with XCHG.

Chris Thomasson

unread,

Apr 18, 2008, 4:50:20 PM4/18/08

to

"Sebastian G." <se...@seppig.de> wrote in message

news:66sdacF...@mid.dfncis.de...

I am sorry for not clarifying that I was writing about multiple CPU's
executing the sequence concurrently.

Chris Thomasson

unread,

Apr 18, 2008, 4:58:38 PM4/18/08

to

"Sebastian G." <se...@seppig.de> wrote in message

news:66sdjlF...@mid.dfncis.de...

> Chris Thomasson wrote:
>
>> "Sebastian G." <se...@seppig.de> wrote in message
>> news:66rmi1F...@mid.dfncis.de...
>>> Chris Thomasson wrote:
>>>
>>>
>>>> Read all. You will find that on x86, a store followed by a load from
>>>> another location can indeed be reordered, e.g.:
>>>>
>>>> 1: MOV [EAX], EDX
>>>> 2: MOV EDX, [ECX]
>>>>
>>>> If EAX and ECX point to different locations, then the processor can
>>>> reorder the sequence such that 2 is executed before 1. If you want to
>>>> ensure that 2 does not rise above 1, the you need to do something like:
>>>>
>>>> 1: MOV [EAX], EDX
>>>> 2: MFENCE
>>>> 3: MOV EDX, [ECX]
>>>
>>> May I call bullshit?
>>
>> So, accoriding to you, one could implment SMR on x86 without using a
>> #StoreLoad?
>
>
> No, just your example is broken.

Ahh. Okay, I should have just used the following example:

op(X).release // store
op(Y).acquire // load

doesn't prevent

op(Y).acquire // load
op(X).release // store

reordering.

You need to do:

op(X).release // store
mfence
op(Y).acquire // load

in order to prevent the reordering from occurring.

> EDX is a dependency, so within one thread, it can't be reordered. On
> multiple thread, it might be executed sequentially, that is
>
> 1: MOV [EAX], EDX
> 2: MOV EDX, [ECX]
> 1: MOV [EAX], EDX
> 2: MOV EDX, [ECX]
>
> And even then there's only a potential ordering problem when ECX#1 and
> EAX#1 are identical, and the MFence wouldn't help with that.

/*
AC_SYS_APIEXPORT ac_i686_node_t* AC_CDECL
ac_i686_lfgc_smr_activate
( ac_i686_node_t** hazard_ptr,
ac_i686_node_t** shared_loc );
*/

ac_i686_lfgc_smr_activate PROC
mov edx, [esp + 4]
mov ecx, [esp + 8]

ac_i686_lfgc_smr_activate_reload:
mov eax, [ecx]
mov [edx], eax
mfence
cmp eax, [ecx]
jne ac_i686_lfgc_smr_activate_reload
ret
ac_i686_lfgc_smr_activate ENDP

I have to use the MFENCE because I need to ensure that load-after-store
reordering does not occur. Do you disagree?

Chris Thomasson

unread,

Apr 18, 2008, 5:03:22 PM4/18/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:9-CdndRmIJLCl5TV...@comcast.com...

I am also sorry for not clarifying that I was writing about memory ordering.

Chris Thomasson

unread,

Apr 18, 2008, 5:04:35 PM4/18/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:HIGdnUc6Icr5lZTV...@comcast.com...

What do you take away from page 13?

[...]

Sebastian G.

unread,

Apr 18, 2008, 5:10:59 PM4/18/08

to

Chris Thomasson wrote:

> I have to use the MFENCE because I need to ensure that load-after-store
> reordering does not occur. Do you disagree?

No, in this example it's clear.

Chris Thomasson

unread,

Apr 18, 2008, 5:19:50 PM4/18/08

to

"Sebastian G." <se...@seppig.de> wrote in message

news:66sh3iF...@mid.dfncis.de...

Sorry for not making myself clear. I did not mean to waste your time
Sebastian!

;^(

Alexander Terekhov

unread,

Apr 18, 2008, 5:35:31 PM4/18/08

to

Chris Thomasson wrote:
[...]

> AFAICT, the trailing MFENCE seems to have to be there. If it was not, I
> don't see how you could correctly implement a spinlock with XCHG.

Think of PC's naked load as

shared_lock(location); // rwlock_rdlock
r = load(location);
shared_unlock(location); // rwlock_unlock
return r;

and XGHG as

exclusive_lock(location); // rwlock_wrlock
r = load(location);
store(location,x);
exclusive_unlock(location); // rwlock_unlock
return r;

Where's MFENCE?

PC's naked store:

exclusive_lock(location); // rwlock_wrlock
store(location,x);
exclusive_unlock(location); // rwlock_unlock

regards,
alexander.

Chris Thomasson

unread,

Apr 18, 2008, 6:05:54 PM4/18/08

to

"Alexander Terekhov" <tere...@web.de> wrote in message

news:48091423...@web.de...

>
> Chris Thomasson wrote:
> [...]
>> AFAICT, the trailing MFENCE seems to have to be there. If it was not, I
>> don't see how you could correctly implement a spinlock with XCHG.
>
> Think of PC's naked load as
>
> shared_lock(location); // rwlock_rdlock
> r = load(location);
> shared_unlock(location); // rwlock_unlock
> return r;
>
> and XGHG as
>
> exclusive_lock(location); // rwlock_wrlock
> r = load(location);
> store(location,x);
> exclusive_unlock(location); // rwlock_unlock
> return r;
>
> Where's MFENCE?

Well, since page 6 in the following paper:

http://www.intel.com/products/processor/manuals/318147.pdf

states that XCHG implicitly asserts the LOCK prefix, well, I would assume
that the MFENCE would be right before the unlock:

exclusive_lock(location); // rwlock_wrlock
r = load(location);
store(location,x);

MFENCE;

exclusive_unlock(location); // rwlock_unlock
return r;

If its not there, well, then the Intel documentation is totally incorrect
and would mean that the following spinlock code would be 100% broken all to
hell:

/* void spinlock_acquire(atomicword* shared_location); */
spinlock_acquire PROC
MOV ECX, [ESP + 4]

spinlock_acquire_retry:
MOV EAX, 1
XCHG [ECX], EAX
TEST EAX, EAX
JE spinlock_acquire_failed
RET

spinlock_acquire_failed:
PAUSE
JMP spinlock_acquire_retry
spinlock_acquire ENDP

/* void spinlock_release(atomicword* shared_location); */
spinlock_release PROC
MOV ECX, [ESP + 4]
MOV EAX, 0
MOV [ECX], EAX
RET
spinlock_release ENDP

Chris Thomasson

unread,

Apr 18, 2008, 6:20:25 PM4/18/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:PvidnZRcxfxDh5TV...@comcast.com...

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

ARGRRHGH!

AND EAX, EAX
JNZ spinlock_acquire_failed

:^|

STUPID!!!!!!!!!!!!!

> RET

[...]

Here is correct code that fully compiles on VC:
______________________________________________________________________
typedef int atomicword;

__declspec(naked) void
spinlock_acquire(
atomicword* const _this
) {
_asm {

MOV ECX, [ESP + 4]

spinlock_acquire_retry:
MOV EAX, 1
XCHG [ECX], EAX

AND EAX, EAX
JNZ spinlock_acquire_failed
RET

spinlock_acquire_failed:
PAUSE
JMP spinlock_acquire_retry
}
}

__declspec(naked) void
spinlock_release(
atomicword* const _this
) {
_asm {

MOV ECX, [ESP + 4]
MOV EAX, 0
MOV [ECX], EAX
RET
}
}

static atomicword g_lock = 0;

int main() {
spinlock_acquire(&g_lock);
spinlock_release(&g_lock);
spinlock_acquire(&g_lock);
spinlock_release(&g_lock);
return 0;
}

______________________________________________________________________

Sorry about that crap! :^(

Chris Thomasson

unread,

Apr 18, 2008, 6:44:59 PM4/18/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:CMadnXG4BrTYg5TV...@comcast.com...

[...]
> Here is correct code that fully compiles on VC:
> ______________________________________________________________________
> typedef int atomicword;
>
>
> __declspec(naked) void
> spinlock_acquire(
> atomicword* const _this
> ) {
> _asm {
> MOV ECX, [ESP + 4]
>
> spinlock_acquire_retry:
> MOV EAX, 1
> XCHG [ECX], EAX
> AND EAX, EAX
> JNZ spinlock_acquire_failed
> RET
>
> spinlock_acquire_failed:
> PAUSE
> JMP spinlock_acquire_retry
> }
> }

There are no bugs in the code above... However, it can be optimized a little
bit. I don't have to use AND, and I don't have to constantly set EAX to 1.
Therefore, the procedure can be re-written as:

__declspec(naked) void
spinlock_acquire(
atomicword* const _this
) {
_asm {
MOV ECX, [ESP + 4]

MOV EAX, 1

spinlock_acquire_retry:

XCHG [ECX], EAX
TEST EAX, EAX

JNZ spinlock_acquire_failed
RET

spinlock_acquire_failed:
PAUSE
JMP spinlock_acquire_retry
}
}

[...]

Isn't programming assembly language fun!

:^/