Does anybody know why InterlockedExchange is implemented as
mov ecx,[esp+4]
mov edx,[esp+8]
mov eax,[ecx]
@retry:
lock cmpxchg [ecx],edx
jne @retry
ret 8
Instead of simply
mov ecx,[esp+4]
mov eax,[esp+8]
xchg [ecx], eax
ret 8
?
xchg automatically assets LOCK# signal and is atomic. What to make loops
for?
Oleh Derevenko
>> Does anybody know why InterlockedExchange is implemented as
>> mov ecx,[esp+4]
>> mov edx,[esp+8]
>> mov eax,[ecx]
>> @retry:
>> lock cmpxchg [ecx],edx
>> jne @retry
>> ret 8
>
> Where did you find this? On my WinXP/SP2 machine with all current
> hotfixes applied the disassembly of InterlockedCompareExchange is
> this:
Try to find 7 differences in the following function names, please:
MY: InterlockedExchange
YOUR: InterlockedCompareExchange
--
Oleh Derevenko
-- ICQ: 36361783
"Elcaro Nosille" <Elcaro....@googlemail.com> wrote in message
news:ftvn3q$4jn$1...@worf.visyn.net...
> Oleh Derevenko schrieb:
>
>> Try to find 7 differences in the following function names, please:
>> MY: InterlockedExchange
>> YOUR: InterlockedCompareExchange
>
> Ok, didn't know that there's a InterlockedExchange.
http://groups.google.com/group/microsoft.public.win32.programmer.kernel/
browse_thread/thread/522529961143de56/bcc566a1bab3dcdf
but then w2k didn't?
http://groups.google.com/group/comp.os.ms-windows.programmer.win32/
browse_thread/thread/36e2c3eda1cbdc0b/0ab4655dea711637
I give the benefit of any doubt to those
that do it this way. The two news threads
above are very old. It's not hard to _asm
your own if you'd rather.
--
40th Floor - Software @ http://40th.com/
iplay.40th.com - Advanced PPC audio player
phantasm.40th.com - The final destination
Simple XCHG is perfectly fine and totally compatible with
InterlockedExchange. I have no idea why Microsoft would implement it without
XCHG. Its not a bug to implement it with LOCK CMPXCHG, its just a "retarded"
way of doing it. FWIW, I implement the 'ac_i686_atomic_xchg_fence' in my
AppCore library using XCHG:
http://appcore.home.comcast.net/appcore/src/cpu/i686/ac_i686_masm_asm.html
(its the fifth function from the bottom...)
Your correct that XCHG automatically asserts the LOCK prefix. For concrete
proof, take a look at this:
http://www.intel.com/products/processor/manuals/318147.pdf
(under the '1 Instructions and memory accesses' section...)
I have no idea why Microsoft would use CAS to implement a SWAP. Weird!
:^/
WHAT? There is little use for InterlockedExchange!? Are you fuc%#ng
serious!? I can think of many different ways to use it. In fact I have used
it for many diverse non-blocking algorithms. I don't even want to list the
ways yet until I find out that your are tolling!!!?
;^/
I don't even want to list the ways yet until I find out that your are
__NOT__ tolling!!!?
>
> ;^/
find and replace "tolling" with "trolling".
I need to calm down!
ACK! ;^(...
Sorry about all that non-sense.
It's not healthy to talk to yourself. Anyway, why do you think that there is
little use for InterlockedExchange?
Because data exchange normally works quite well without any need for
synchronization? And almost any synchronization primitive is much more
efficiently implemented with compare-and-exchange or fetch-and-add?
SWAP can be an essential tool for implementing wait-free algorithms. IMO,
Elcaro's assertion that there is little use for InterlockedExchange is
totally false.
> And almost any synchronization primitive is much more efficiently
> implemented with compare-and-exchange
Using CAS implies a loop; which is only lock-free. However, SWAP is usually
implemented as a wait-free operation; there is a _very_ big difference
between lock-free and wait-free. You generally cannot use CAS in wait-free
algorithms because of the loop that goes along with most usage cases.
SWAP is an important tool for me. One simple example, you can use it to get
rid of the ABA problem in lock-free stacks. The side-effect of using SWAP
transforms a lock-free consumer to a wait-free one; a huge benefit. I an
give an example if you want...
> or fetch-and-add?
FAA is great; Intel has a wait-free version of it (e.g., XADD).
Your assertion is incorrect.
[...]
Try to get a "much more" efficient version of the following synchronization
algorithms using nothing but CAS:
Please note that the non-blocking lifo example uses the IBM-Style CAS which
automatically updates the comprand on failure...
<code-sketch of ABA-free non-blocking lifo w/ wait-free consumer>
__________________________________________________________________
struct node {
node* nx;
};
struct nblifo {
node* hd; /* initialize to NULL */
};
void
nblifo_produce(
nblifo* const _this,
node* const n
) {
n->nx = _this->hd;
while (! CAS(&_this->hd, &n->nx, n));
}
node*
nblifo_swap(
nblifo* const _this,
node* const n
) {
return SWAP(&_this->hd, n);
}
#define nblifo_consume(mp_this) nblifo_swap((mp_this), NULL)
__________________________________________________________________
<code-sketch of ABA-free non-blocking fifo w/ wait-free consumer>
__________________________________________________________________
typedef nblifo nbfifo;
node*
nbfifo_sys_transform(
node* _this
) {
if (_this) {
node* fifo = NULL;
while (_this) {
node* const n = _this;
_this = n->nx;
n->nx = fifo;
fifo = n;
}
return fifo;
}
return NULL;
}
#define nbfifo_produce nblifo_produce
#define nbfifo_swap(mp_this, mp_n) \
nbfifo_sys_transform(nblifo_swap((mp_this), (mp_n)))
#define nbfifo_consume(mp_this) nbfifo_swap((mp_this), NULL)
__________________________________________________________________
<code-sketch of mutex w/ wait-free fast-paths>
__________________________________________________________________
struct fpmutex {
word st; /* initialize to 0 */
osevent ws;
};
void fpmutex_lock(
fpmutex* const _this
) {
if (SWAP(&_this->st, 1)) {
while (SWAP(&_this->st, 2)) {
osevent_wait(&_this->ws);
}
}
}
void fpmutex_unlock(
fpmutex* const _this
) {
if (SWAP(&_this->st, 0) == 2) {
osevent_set(&_this->ws);
}
}
__________________________________________________________________
<code-sketch of event w/ wait-free fast-paths>
__________________________________________________________________
struct fpevent {
word st; /* initialize to 0 */
osevent ws;
};
void fpevent_wait(
fpevent* const _this
) {
while (SWAP(&_this->st, 0)) {
osevent_wait(&_this->ws);
}
}
void fpevent_set(
fpevent* const _this
) {
if (! SWAP(&_this->st, 1)) {
osevent_set(&_this->ws);
}
}
__________________________________________________________________
How could you make much more efficient versions of those algorithms using
nothing but CAS? Please, do tell... I am _very_ interested indeed...
> <code-sketch of event w/ wait-free fast-paths>
> __________________________________________________________________
> struct fpevent {
> word st; /* initialize to 0 */
> osevent ws;
> };
>
ARGHGHG!!!!
> void fpevent_wait(
> fpevent* const _this
> ) {
> while (SWAP(&_this->st, 0)) {
> osevent_wait(&_this->ws);
> }
> }
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The above is incorrect! The while loop needs to check for zero, NOT
non-zero!!! Here, let me fix that:
void fpevent_wait(
fpevent* const _this
) {
while (! SWAP(&_this->st, 0)) {
osevent_wait(&_this->ws);
}
}
Sorry about that non-sense typo! ;^(...
> Using CAS implies a loop; which is only lock-free. However, SWAP is usually
> implemented as a wait-free operation; there is a _very_ big difference
> between lock-free and wait-free. You generally cannot use CAS in wait-free
> algorithms because of the loop that goes along with most usage cases.
Here is another important moment. When using CAS you usually have to
fetch current value in the beginning. And only after that write new
value with CAS. This means that cache-line requested in shared/owned
state at first, and then cache-line requested one more time in
modified state.
I observed performance/scaling degradation by factor of 3 caused by
CAS reties, and by factor of 3 once again caused by additional cache-
line transfers. As compared with XCHG. On Intel Q6600.
So in my heavy benchmark XCHG was scaling as 0.30, and CAS was scaling
as 0.05 (linear scaling is 4.00).
Dmitriy V'jukov
>>I suppose I know the answer. It may be that simple xchg instruction does
>>not properly synchronize processor caches on some multiprocessor
>>platforms, while cmpxchg does.
>
> Simple XCHG is perfectly fine and totally compatible with
> InterlockedExchange. I have no idea why Microsoft would implement it
> without XCHG. Its not a bug to implement it with LOCK CMPXCHG, its just a
> "retarded" way of doing it. FWIW, I implement the
> 'ac_i686_atomic_xchg_fence' in my AppCore library using XCHG:
>
I do not remember the link exatly, but once I read an atricle in MSDN where
it was told that interlocked functions serve for one important purpose if
compared to simple assignment. If there are two processors each with its own
cache, and if you do simple memory assignment, second processor can still
use old value from its cache even if it reads the memory after the first
processor has made the assignment. That is, the cache of second processor is
not synchronized with memory modifications made by first one. However, when
interlocked functions are used, it is guarranteed that all the processors'
caches will be in sync with the assignment and will return new value. So, my
assumption is, maybe simple xchg does not exhibit this feature? Maybe it is
necessary to use cmpxchg to achieve cache synchronization? Does anybody know
it?
Intel's manual on IA-32 architecture does not contain any remarks regarding
cache synchronization neither in description of xchg nor in cmpxchg. :(
Oleh Derevenko
Right. Good point indeed. In some algorithms, you can use a fixed comprand
for CAS e.g.:
__________________________________________________________________
struct fpevent {
word st; /* initialize to 0 */
osevent ws;
};
void fpevent_wait(
fpevent* const _this
) {
while (! CAS(&_this->st, 1, 0)) {
osevent_wait(&_this->ws);
}
}
void fpevent_set(
fpevent* const _this
) {
if (CAS(&_this->st, 0, 1)) {
osevent_set(&_this->ws);
}
}
__________________________________________________________________
That helps, but it still more expensive than the SWAP version because its
more "complicated" to implement a CAS in hardware. There are three
parameters to work with, and there is the compare. SWAP is two parameters
and the operation is unconditional; SWAP is a winner. One thing that makes
me angry is that SUN depreciated the SWAP instruction which forces me to use
CAS to implement a simple SWAP; CAS is no substitute for a SWAP! IMHO, SUN
made a big mistake.
:^o
Microsoft documentation on memory models are notoriously bad. Intel finally
released a description of their memory-model in the following paper:
http://www.intel.com/products/processor/manuals/318147.pdf
Read all. You will find that on x86, a store followed by a load from another
location can indeed be reordered, e.g.:
1: MOV [EAX], EDX
2: MOV EDX, [ECX]
If EAX and ECX point to different locations, then the processor can reorder
the sequence such that 2 is executed before 1. If you want to ensure that 2
does not rise above 1, the you need to do something like:
1: MOV [EAX], EDX
2: MFENCE
3: MOV EDX, [ECX]
> Intel's manual on IA-32 architecture does not contain any remarks
> regarding cache synchronization neither in description of xchg nor in
> cmpxchg. :(
The white paper basically says that the only memory-barrier in x86 that is
not already implied is #StoreLoad | #StoreStore. In other words, it
describes a TSO memory model. Here is a sketch of how the current x86 memory
model works, according to the paper:
void* LOAD(void** p) {
void* v = *p;
membar #LoadStore | #LoadLoad;
return v;
}
void STORE(void** p, void* v) {
membar #LoadStore | #StoreStore;
*p = v;
}
That works find for a producer/consumer model, but does not work for others.
For instance, you need to execute a #StoreLoad | #StoreStore barrier in
Petersons algorithm. Here is an example of that:
http://groups.google.com/group/comp.programming.threads/msg/1e45b4b16bad9784
This requires either a MFENCE instruction, or an XCHG to a dummy location to
get be the #StoreLoad ordering.
>> assignment and will return new value. So, my assumption is, maybe simple
>> xchg does not exhibit this feature? Maybe it is necessary to use cmpxchg
>> to achieve cache synchronization? Does anybody know it?
>
> Microsoft documentation on memory models are notoriously bad. Intel
> finally released a description of their memory-model in the following
> paper:
>
> http://www.intel.com/products/processor/manuals/318147.pdf
>
> Read all. You will find that on x86, a store followed by a load from
> another location can indeed be reordered, e.g.:
>
> 1: MOV [EAX], EDX
> 2: MOV EDX, [ECX]
>
> If EAX and ECX point to different locations, then the processor can
> reorder the sequence such that 2 is executed before 1. If you want to
> ensure that 2 does not rise above 1, the you need to do something like:
>
> 1: MOV [EAX], EDX
> 2: MFENCE
> 3: MOV EDX, [ECX]
>
You do not read what I write. I told about cache synchronization in two
physically separate processors, not about read-write reordering in a single
processor.
Though, it could be the case that for true multi-processor system
implementation of InterlockedExchange could be even different, just like it
differs for single-core machine and multi-core machine.
Oleh Derevenko
> Read all. You will find that on x86, a store followed by a load from another
> location can indeed be reordered, e.g.:
>
> 1: MOV [EAX], EDX
> 2: MOV EDX, [ECX]
>
> If EAX and ECX point to different locations, then the processor can reorder
> the sequence such that 2 is executed before 1. If you want to ensure that 2
> does not rise above 1, the you need to do something like:
>
> 1: MOV [EAX], EDX
> 2: MFENCE
> 3: MOV EDX, [ECX]
Bullshit, since EDX is a dependency which is why these instructions can't be
reordered.
> Read all. You will find that on x86, a store followed by a load from another
> location can indeed be reordered, e.g.:
>
> 1: MOV [EAX], EDX
> 2: MOV EDX, [ECX]
>
> If EAX and ECX point to different locations, then the processor can reorder
> the sequence such that 2 is executed before 1.
Bullshit, since EDX is a WAR dependency.
> Read all. You will find that on x86, a store followed by a load from another
> location can indeed be reordered, e.g.:
>
> 1: MOV [EAX], EDX
> 2: MOV EDX, [ECX]
>
> If EAX and ECX point to different locations, then the processor can reorder
> the sequence such that 2 is executed before 1. If you want to ensure that 2
> does not rise above 1, the you need to do something like:
>
> 1: MOV [EAX], EDX
> 2: MFENCE
> 3: MOV EDX, [ECX]
May I call bullshit?
x86 can and will reorder the following:
store(&loc1, 0);
load(&loc2);
Read:
http://groups.google.com/group/comp.programming.threads/msg/731cd06e8c04f744
and
http://www.intel.com/products/processor/manuals/318147.pdf
and you will learn that x86 WILL reorder load after store to another
location.
The only time you need a memory-barrier on x86 is to prevent
store-after-load reordering. The behavour that your interested in is the
memory model; the cache has nothing to do with it.
http://groups.google.com/group/comp.programming.threads/msg/17cd775d97150096
(read all)
> Though, it could be the case that for true multi-processor system
> implementation of InterlockedExchange could be even different, just like
> it
> differs for single-core machine and multi-core machine.
The XCHG instruction has implicit LOCK prefix which is analogous to full
memory-barrier.
So, accoriding to you, one could implment SMR on x86 without using a
#StoreLoad? Here is code for a hazard pointer load:
ac_i686_lfgc_smr_activate PROC
mov edx, [esp + 4]
mov ecx, [esp + 8]
ac_i686_lfgc_smr_activate_reload:
mov eax, [ecx]
mov [edx], eax
mfence
cmp eax, [ecx]
jne ac_i686_lfgc_smr_activate_reload
ret
ac_i686_lfgc_smr_activate ENDP
Why do you think that the MFENCE does not need to be in there?
Here is reference to SMR:
http://www.research.ibm.com/people/m/michael/ieeetpds-2004.pdf
BTW Sebastian, if you found a way to implement this algorithm on x86 without
using a MFENCE or synchronization epoch detection, well, PLEASE POST IT! I
would be VERY interested.
Let me add another moment in the example:
; ECX = Location1
; EAX = Location2
1: MOV EDX, [ECX] ; load from Location1
2: MOV [EAX], EDX ; store to Location2
3: MOV EDX, [ECX] ; load from Location1
Why do you think that steps 2 and 3 will not be reordered? According to this
paper:
http://www.intel.com/products/processor/manuals/318147.pdf
Your wrong.
That's a rumor. Consider that
exclusive_lock(location);
r = load(location);
store(location,x);
exclusive_unlock(location);
return r;
isn't really analogous to a full memory-barrier under PC memory model at
all.
A fully memory-barrier'd LOCK XCHG must be
MFENCE();
exclusive_lock(location);
r = load(location);
store(location,x);
exclusive_unlock(location);
MFENCE();
return r;
I've been trying to figure out which of above MFENCEs is actually
guaranteed (from the docs) to be provided by XCHG to no avail.
regards,
alexander.
> "Sebastian G." <se...@seppig.de> wrote in message
> news:66rm24F...@mid.dfncis.de...
>> Chris Thomasson wrote:
>>
>>
>>> Read all. You will find that on x86, a store followed by a load from
>>> another location can indeed be reordered, e.g.:
>>>
>>> 1: MOV [EAX], EDX
>>> 2: MOV EDX, [ECX]
>>>
>>> If EAX and ECX point to different locations, then the processor can
>>> reorder the sequence such that 2 is executed before 1.
>>
>> Bullshit, since EDX is a WAR dependency.
>
> Let me add another moment in the example:
>
>
> ; ECX = Location1
> ; EAX = Location2
>
>
>
> 1: MOV EDX, [ECX] ; load from Location1
> 2: MOV [EAX], EDX ; store to Location2
> 3: MOV EDX, [ECX] ; load from Location1
>
>
>
> Why do you think that steps 2 and 3 will not be reordered?
Because it's a Write-After-Read dependency on EDX.
> According to this
> paper:
>
> http://www.intel.com/products/processor/manuals/318147.pdf
>
> Your wrong.
According to this paper you're wrong, since it only deals with memory
ordering dependency, not register dependency. BTW, could you please put your
notions into multiprocessor parallel execution? Since for single process
stuff, you can't be any more obviously wrong.
> "Sebastian G." <se...@seppig.de> wrote in message
> news:66rmi1F...@mid.dfncis.de...
>> Chris Thomasson wrote:
>>
>>
>>> Read all. You will find that on x86, a store followed by a load from
>>> another location can indeed be reordered, e.g.:
>>>
>>> 1: MOV [EAX], EDX
>>> 2: MOV EDX, [ECX]
>>>
>>> If EAX and ECX point to different locations, then the processor can
>>> reorder the sequence such that 2 is executed before 1. If you want to
>>> ensure that 2 does not rise above 1, the you need to do something like:
>>>
>>> 1: MOV [EAX], EDX
>>> 2: MFENCE
>>> 3: MOV EDX, [ECX]
>>
>> May I call bullshit?
>
> So, accoriding to you, one could implment SMR on x86 without using a
> #StoreLoad?
No, just your example is broken. EDX is a dependency, so within one thread,
it can't be reordered. On multiple thread, it might be executed
sequentially, that is
1: MOV [EAX], EDX
2: MOV EDX, [ECX]
1: MOV [EAX], EDX
2: MOV EDX, [ECX]
And even then there's only a potential ordering problem when ECX#1 and EAX#1
are identical, and the MFence wouldn't help with that.
On page 6 in the following paper:
http://www.intel.com/products/processor/manuals/318147.pdf
"All locked instructions (the implicitly locked xchg instruction and other
read-modify-write
instructions with a lock prefix) are an indivisible and uninterruptible
sequence of load(s)
followed by store(s) regardless of memory type and alignment."
AFAICT, this means that XCHG implicitly asserts the LOCK prefix. What am I
missing here Alex?
> Consider that
>
> exclusive_lock(location);
> r = load(location);
> store(location,x);
> exclusive_unlock(location);
> return r;
>
> isn't really analogous to a full memory-barrier under PC memory model at
> all.
Can subsequent operations be hoisted up above an XCHG? Also, can preceding
operations sink below it?
> A fully memory-barrier'd LOCK XCHG must be
>
> MFENCE();
> exclusive_lock(location);
> r = load(location);
> store(location,x);
> exclusive_unlock(location);
> MFENCE();
> return r;
>
> I've been trying to figure out which of above MFENCEs is actually
> guaranteed (from the docs) to be provided by XCHG to no avail.
AFAICT, the trailing MFENCE seems to have to be there. If it was not, I
don't see how you could correctly implement a spinlock with XCHG.
I am sorry for not clarifying that I was writing about multiple CPU's
executing the sequence concurrently.
Ahh. Okay, I should have just used the following example:
op(X).release // store
op(Y).acquire // load
doesn't prevent
op(Y).acquire // load
op(X).release // store
reordering.
You need to do:
op(X).release // store
mfence
op(Y).acquire // load
in order to prevent the reordering from occurring.
> EDX is a dependency, so within one thread, it can't be reordered. On
> multiple thread, it might be executed sequentially, that is
>
> 1: MOV [EAX], EDX
> 2: MOV EDX, [ECX]
> 1: MOV [EAX], EDX
> 2: MOV EDX, [ECX]
>
> And even then there's only a potential ordering problem when ECX#1 and
> EAX#1 are identical, and the MFence wouldn't help with that.
/*
AC_SYS_APIEXPORT ac_i686_node_t* AC_CDECL
ac_i686_lfgc_smr_activate
( ac_i686_node_t** hazard_ptr,
ac_i686_node_t** shared_loc );
*/
ac_i686_lfgc_smr_activate PROC
mov edx, [esp + 4]
mov ecx, [esp + 8]
ac_i686_lfgc_smr_activate_reload:
mov eax, [ecx]
mov [edx], eax
mfence
cmp eax, [ecx]
jne ac_i686_lfgc_smr_activate_reload
ret
ac_i686_lfgc_smr_activate ENDP
I have to use the MFENCE because I need to ensure that load-after-store
reordering does not occur. Do you disagree?
I am also sorry for not clarifying that I was writing about memory ordering.
What do you take away from page 13?
[...]
> I have to use the MFENCE because I need to ensure that load-after-store
> reordering does not occur. Do you disagree?
No, in this example it's clear.
Sorry for not making myself clear. I did not mean to waste your time
Sebastian!
;^(
Think of PC's naked load as
shared_lock(location); // rwlock_rdlock
r = load(location);
shared_unlock(location); // rwlock_unlock
return r;
and XGHG as
exclusive_lock(location); // rwlock_wrlock
r = load(location);
store(location,x);
exclusive_unlock(location); // rwlock_unlock
return r;
Where's MFENCE?
PC's naked store:
exclusive_lock(location); // rwlock_wrlock
store(location,x);
exclusive_unlock(location); // rwlock_unlock
regards,
alexander.
Well, since page 6 in the following paper:
http://www.intel.com/products/processor/manuals/318147.pdf
states that XCHG implicitly asserts the LOCK prefix, well, I would assume
that the MFENCE would be right before the unlock:
exclusive_lock(location); // rwlock_wrlock
r = load(location);
store(location,x);
MFENCE;
exclusive_unlock(location); // rwlock_unlock
return r;
If its not there, well, then the Intel documentation is totally incorrect
and would mean that the following spinlock code would be 100% broken all to
hell:
/* void spinlock_acquire(atomicword* shared_location); */
spinlock_acquire PROC
MOV ECX, [ESP + 4]
spinlock_acquire_retry:
MOV EAX, 1
XCHG [ECX], EAX
TEST EAX, EAX
JE spinlock_acquire_failed
RET
spinlock_acquire_failed:
PAUSE
JMP spinlock_acquire_retry
spinlock_acquire ENDP
/* void spinlock_release(atomicword* shared_location); */
spinlock_release PROC
MOV ECX, [ESP + 4]
MOV EAX, 0
MOV [ECX], EAX
RET
spinlock_release ENDP
ARGRRHGH!
AND EAX, EAX
JNZ spinlock_acquire_failed
:^|
STUPID!!!!!!!!!!!!!
> RET
[...]
Here is correct code that fully compiles on VC:
______________________________________________________________________
typedef int atomicword;
__declspec(naked) void
spinlock_acquire(
atomicword* const _this
) {
_asm {
MOV ECX, [ESP + 4]
spinlock_acquire_retry:
MOV EAX, 1
XCHG [ECX], EAX
AND EAX, EAX
JNZ spinlock_acquire_failed
RET
spinlock_acquire_failed:
PAUSE
JMP spinlock_acquire_retry
}
}
__declspec(naked) void
spinlock_release(
atomicword* const _this
) {
_asm {
MOV ECX, [ESP + 4]
MOV EAX, 0
MOV [ECX], EAX
RET
}
}
static atomicword g_lock = 0;
int main() {
spinlock_acquire(&g_lock);
spinlock_release(&g_lock);
spinlock_acquire(&g_lock);
spinlock_release(&g_lock);
return 0;
}
______________________________________________________________________
Sorry about that crap! :^(
There are no bugs in the code above... However, it can be optimized a little
bit. I don't have to use AND, and I don't have to constantly set EAX to 1.
Therefore, the procedure can be re-written as:
__declspec(naked) void
spinlock_acquire(
atomicword* const _this
) {
_asm {
MOV ECX, [ESP + 4]
MOV EAX, 1
spinlock_acquire_retry:
XCHG [ECX], EAX
TEST EAX, EAX
JNZ spinlock_acquire_failed
RET
spinlock_acquire_failed:
PAUSE
JMP spinlock_acquire_retry
}
}
[...]
Isn't programming assembly language fun!
:^/