I'm working on a well documented, cross-platform, portable, lock-free 
data structure library, in C, which I'm going to make public for general 
use.
I'm currently working on the x86/x64 Linux port, on gcc.
This means I need to write an __asm__ to call cmpxchg8b/cmpxchg16b.  (I 
use gcc internal atomics for the rest).
I've not touched Intel assembly before.  I've hunted around on the web 
and come up with something which compiles and which, well, even appears 
to work (the test program runs correctly!).  Which frankly I find 
utterly unbelieveable - nothing works first time, what are the odds of 
putting together your very first __asm__ and it working?
So I would appreciate it if you guys wouldn't mind casting your eyes 
over the code I've written and letting me know if you see anything 
wrong.
A comment or two first;
1. note the mov instructions are commented out; my thought it that since 
I specify in the __asm__ input the arguments, they would be moved into 
the correct registers for me by gcc - is this true, or do I need to do 
the mov manually?  (the register names are missing because I didn't need 
yet to fill them in)
2. the atom_t type is platform dependent - here, with cmpxchg8b, it is 
an unsigned long int
3. am I correct to note that the condition code register is clobbered?  
the ZF flag is set, which I believe is in that register?
INLINE int abstraction_dcas( volatile atom_t *destination, atom_t 
*exchange, atom_t *compare )
{
  unsigned char
    cas_result;
  __asm__ __volatile__
  (
    // load *compare into into edx:eax
    // "mov eax, %;"
    // "mov edx, %;"
    // load *exchange into ecx:ebx
    // "mov ebx, %;"
    // "mov ecx, %;"
    "lock;"          // make cmpxchg8b atomic
    "cmpxchg8b %0;"  // cmpxchg8b sets ZF on success
    "setz      %1;"  // if ZF set, set cas_result to 1
    // output
    : "=m" (*destination), "=q" (cas_result)
    // input
    : "m" (*destination), "a" (*compare), "d" (*(compare+1)), "b" 
(*exchange), "c" (*(exchange+1))
    // clobbered
    : "cc", "memory"
  );
  return( (int) cas_result );
}
> So I would appreciate it if you guys wouldn't mind casting your eyes 
> over the code I've written and letting me know if you see anything 
> wrong.
> 
> A comment or two first;
> 
> 1. note the mov instructions are commented out; my thought it that since 
> I specify in the __asm__ input the arguments, they would be moved into 
> the correct registers for me by gcc - is this true, or do I need to do 
> the mov manually?
No, that's fine.
> 2. the atom_t type is platform dependent - here, with cmpxchg8b, it is 
> an unsigned long int
> 
> 3. am I correct to note that the condition code register is clobbered?  
> the ZF flag is set, which I believe is in that register?
Yes.
> INLINE int abstraction_dcas( volatile atom_t *destination, atom_t 
> *exchange, atom_t *compare )
-> const atom_t *compare
And atom_t compare would be even better, because x64 could pass them in 
a register by value.
> {
>   unsigned char
>     cas_result;
>   __asm__ __volatile__
>   (
>     // load *compare into into edx:eax
>     // "mov eax, %;"
>     // "mov edx, %;"
> 
>     // load *exchange into ecx:ebx
>     // "mov ebx, %;"
>     // "mov ecx, %;"
> 
>     "lock;"          // make cmpxchg8b atomic
>     "cmpxchg8b %0;"  // cmpxchg8b sets ZF on success
>     "setz      %1;"  // if ZF set, set cas_result to 1
> 
>     // output
>     : "=m" (*destination), "=q" (cas_result)
I would recommend "+m" because the value of *destinations changes by the 
method. (However, it will not make much difference, since the volatile 
value can change at any time.)
>     // input
>     : "m" (*destination), "a" (*compare), "d" (*(compare+1)), "b" 
> (*exchange), "c" (*(exchange+1))
You should not declare two parameters for *destination. And if so, you 
you should restrict them to the /same/ register. This would result in 
the same as "+m", but you can bind the result to different vars. (Makes 
no sense here.)
You should declare EAX and EDX as output parameter too, because their 
content is modified. You could also add them to the clobber list, but I 
guess you need their results since your implementation lacks to return 
the old value in *exchange (or *compare), depending on what you API 
expects. Without the old value CAS loops are expensive. In contrast the 
returned flag could simply be replaced by a comparison of the old value 
to *compare.
Passing compare as pointer may reduce optimization. It is also common to 
use shift operations to split into 32 bit values. (Don't know if it is 
really better, maybe the opposite.)
>     // clobbered
>     : "cc", "memory"
"memory" is not needed because the function does not change anything 
that is declared non-volatile.
>   );
> 
>   return( (int) cas_result );
I would recommend to change the functions return value to char to save 
the movzx instruction. Since char is implicitly convertible to int it 
should not make much difference. However, if this is a constant part of 
your API you may have no chance. But if your function is never declared 
extern, it should work.
If your function returns the old value at some place it might also be an 
alternative to ignore the ZF flag and return *compare == oldvalue. In 
contrast to the inline assembly setz the compiler may omit the whole 
comparison if the result is not needed. Of course, in case you need the 
result you get an additional compare instruction. But this is only a 
disadvantage, if the result is assigned rather than used in a 
conditional expression. In the latter case a compare (or test) 
instruction is inserted anyway. As far as I know gcc is not able to pass 
results in the cc register and put a conditional branch immediately 
after the inline assembler.
> }
Marcel
[snip]
Thankyou so much!  I'm just to bed; I will digest your advice in detail 
in the morning.
This function - with cmpxchg8b is for the 32 bit CPUs.  The 64 bit CPUs 
have an identical function, except that cmpxchg16b is used and atom_t is 
an unsigned long long int.
On the 32 bit CPUs with a compiler which supports unsigned long long 
int, I could pass compare in as a value, but on the 64 bit CPUs I could 
not, since there is no supported 128 bit wide type, and I want the 
function prototype to be the same on both platforms (since it's part of 
an abstraction layer - I want it to be the same on all platforms).
 
> >     // output
> >     : "=m" (*destination), "=q" (cas_result)
> 
> I would recommend "+m" because the value of *destinations changes by the 
> method. (However, it will not make much difference, since the volatile 
> value can change at any time.)
 
> >     // input
> >     : "m" (*destination), "a" (*compare), "d" (*(compare+1)), "b" 
> > (*exchange), "c" (*(exchange+1))
> 
> You should not declare two parameters for *destination. And if so, you 
> you should restrict them to the /same/ register. This would result in 
> the same as "+m", but you can bind the result to different vars. (Makes 
> no sense here.)
There is an example in the GCC Inline Assembly HOWTO which shows the 
same variable being used in both input and output, thusly;
int main(void)
{
        int foo = 10, bar = 15;
        __asm__ __volatile__("addl  %%ebx,%%eax"
                             :"=a"(foo)
                             :"a"(foo), "b"(bar)
                             );
        printf("foo+bar=%d\n", foo);
        return 0;
}
I have then from what you've written the impression that
: "=a"
: "a"
is equivelent to
: "+a"
: /* no output variables */
Is this correct?
> You should declare EAX and EDX as output parameter too, because their 
> content is modified.
Yes. This was a serious oversight on my part.
> You could also add them to the clobber list, but I 
> guess you need their results since your implementation lacks to return 
> the old value in *exchange (or *compare), depending on what you API 
> expects.
As you may know, Windows offers a wrapper for cmpxchg16b, called 
_InterlockedCompareExchange128.  Furthermore, the Windows C compiler now 
no longer permits in-line assembley, so you have to use that wrapper.
That wrapper does not reliably return the original value of the 
destination.  I know - I found out the hard way.  As such, in my code, 
what I have to do is make a copy of the destination before the CAS and 
if the CAS is successful, I know the copy must be correct.
This is wasteful, but forced upon me.
This is why there is no provision for obtaining the original destination 
value.
> >   return( (int) cas_result );
> 
> I would recommend to change the functions return value to char to save 
> the movzx instruction. Since char is implicitly convertible to int it 
> should not make much difference. However, if this is a constant part of 
> your API you may have no chance. But if your function is never declared 
> extern, it should work.
The API is not yet publically released.
If I might take the libery, find below the fixed-up code; I've still 
*destination as input and output, until I understand that issue.
1. the return type is now unsigned char
2. the commented out mov instructions are removed
3. the constraint on the input destination is now "+"
4. "a" and "d" are added to the clobber list
5. "memory" is removed from the clobber list
unsigned char abstraction_dcas( volatile atom_t *destination, atom_t 
*exchange, atom_t *compare )
{
  unsigned char
    cas_result;
  __asm__ __volatile__
  (
    "lock;"          // make cmpxchg8b atomic
    "cmpxchg8b %0;"  // cmpxchg8b sets ZF on success
    "setz      %1;"  // if ZF set, set cas_result to 1
    // output
    : "+m" (*destination), "=q" (cas_result)
    // input
    : "m" (*destination), "a" (*compare), "d" (*(compare+1)), "b" 
(*exchange), "c" (*(exchange+1))
    // clobbered
    : "cc", "a", "d"
  );
  return( cas_result );
}
? - unsigned long long int is a 64 bit type with MSVC. It will be funny 
to use the 128 bit instruction cmpxchg16b on that.
> On the 32 bit CPUs with a compiler which supports unsigned long long 
> int, I could pass compare in as a value, but on the 64 bit CPUs I could 
> not, since there is no supported 128 bit wide type, and I want the 
> function prototype to be the same on both platforms (since it's part of 
> an abstraction layer - I want it to be the same on all platforms).
Maybe it makes no difference after the optimizer has done its work.
> I have then from what you've written the impression that
> 
> : "=a"
> : "a"
> 
> is equivelent to
> 
> : "+a"
> : /* no output variables */
> 
> Is this correct?
No. In the first one the input parameter /may/ be placed in another 
general purpose register.
   : "=a"
   : "0"
is like "+a" as long as the C expressions behind them are the same.
However, in case of memory it makes no difference, since the compiler 
usually will not move variables around in memory. But strictly speaking 
the compiler could assume that a assembler snippet with
   : "=m" xxx
   : "m" xxx
could be used to copy the content from one memory location to another.
In fact gcc seems to ignore most of the potential inconsistencies in the 
inline assembler parameter definitions. I guess that
   : "=m" xxx
   : "m" xxx
: "+m" xxx
   : "+m" xxx
   : "m" xxx
   : "=m" xxx
   : "0" xxx
   : "+m" xxx
   : "0" xxx
are effectively all the same. (Not really tested.)
>> You could also add them to the clobber list, but I 
>> guess you need their results since your implementation lacks to return 
>> the old value in *exchange (or *compare), depending on what you API 
>> expects.
> 
> As you may know, Windows offers a wrapper for cmpxchg16b, called 
> _InterlockedCompareExchange128.  Furthermore, the Windows C compiler now 
> no longer permits in-line assembley, so you have to use that wrapper.
> 
> That wrapper does not reliably return the original value of the 
> destination.  I know - I found out the hard way.
? - I would wonder if _InterlockedCompareExchange128 does not reflect 
the CPU instruction exactly.
But on Windows you will never get the same prototype for all platforms 
since _InterlockedCompareExchange128 has a completely different signature.
>  As such, in my code, 
> what I have to do is make a copy of the destination before the CAS and 
> if the CAS is successful, I know the copy must be correct.
You mean yon insert an additional load into the CAS loop?
> This is why there is no provision for obtaining the original destination 
> value.
I still wonder about that. This would be a really hard bug.
Marcel
Here is an implementation that works:
_____________________________________________________________
__attribute__((always_inline))
static __inline__
int atomic_cas64(void volatile* dest,
                 void* xcmp,
                 void const* xxchg)
{
   int rc;
   __asm__ __volatile__ (
    "mov %3, %%esi            ;\n" // exchange
    "mov 0(%%esi), %%ebx      ;\n" // exchange low
    "mov 4(%%esi), %%ecx      ;\n" // exchange high
    "mov %2, %%esi            ;\n" // comparand
    "mov 0(%%esi), %%eax      ;\n" // comparand low
    "mov 4(%%esi), %%edx      ;\n" // comparand high
    "mov %1, %%esi            ;\n" // destination
    "lock cmpxchg8b (%%esi)      ;\n"
    "jz   1f                   ;\n"
    "mov %2, %%esi            ;\n" // comparand
    "mov %%eax, 0(%%esi)      ;\n" // comparand low
    "mov %%edx, 4(%%esi)      ;\n" // comparand high
  "1:      mov $0, %0               ;\n"
    "setz %b0                  ;\n" // rc =
    : "=&a" (rc)
    : "m" (dest), "m" (xcmp), "m" (xxchg)
    : "cc", "memory", "edx", "ebx", "ecx", "esi"
    );
   return rc;
}
#define atomic_dwcas atomic_cas64
_____________________________________________________________
> So I would appreciate it if you guys wouldn't mind casting your eyes
> over the code I've written and letting me know if you see anything
> wrong.
I don't see where you're updating the comparand on CAS failure. You should 
be either returning the current value on failure like 
`InterlockedCompareExchange()', or setting the comparand to the current 
value and returning false...
[...] 
> > This function - with cmpxchg8b is for the 32 bit CPUs.  The 64 bit 
CPUs 
> > have an identical function, except that cmpxchg16b is used and atom_t is 
> > an unsigned long long int.
> 
> ? - unsigned long long int is a 64 bit type with MSVC. It will be funny 
> to use the 128 bit instruction cmpxchg16b on that.
I omitted to say I'm using paired pointer-counters to address ABA.  So 
it's an array of 2 pointer width variables, e.g. 64 bits on 32 bit, 128 
bits on 64 bits, which is why I'm using double-width CAS rather than 
normal CAS.  This is why the arguments in the prototype I gave are all 
pointers.
 
> However, in case of memory it makes no difference, since the compiler 
> usually will not move variables around in memory. But strictly speaking 
> the compiler could assume that a assembler snippet with
>    : "=m" xxx
>    : "m" xxx
> could be used to copy the content from one memory location to another.
> 
> In fact gcc seems to ignore most of the potential inconsistencies in the 
> inline assembler parameter definitions. I guess that
> 
>    : "=m" xxx
>    : "m" xxx
> 
>    : "+m" xxx
> 
>    : "+m" xxx
>    : "m" xxx
> 
>    : "=m" xxx
>    : "0" xxx
> 
>    : "+m" xxx
>    : "0" xxx
> 
> are effectively all the same. (Not really tested.)
I will read up some more on this.  I can't just right now - it's 4.30am 
and I need to get back to sleep :-)  (It's windy, there's something 
outside which is making too much noise, so I've moved into the front 
room; which means turning the PC off, which means I just had to check to 
see if there were any posts, which means... :-)
> > As you may know, Windows offers a wrapper for cmpxchg16b, called 
> > _InterlockedCompareExchange128.  Furthermore, the Windows C compiler now 
> > no longer permits in-line assembley, so you have to use that wrapper.
> > 
> > That wrapper does not reliably return the original value of the 
> > destination.  I know - I found out the hard way.
> 
> ? - I would wonder if _InterlockedCompareExchange128 does not reflect 
> the CPU instruction exactly.
I think this is so.
This is the prototype;
unsigned char _InterlockedCompareExchange128( __int64 volatile 
*Destination,
                                              __int64 ExchangeHigh,
                                              __int64 ExchangeLow,
                                              __int64 *ComparandResult 
);
This is the man page;
http://msdn.microsoft.com/en-us/library/bb514094.aspx
Note in particular this from the man page;
---start---
 Return Value
1 if the 128-bit comparand equals the original value of the destination. 
ExchangeHigh and ExchangeLow overwrite the 128-bit destination.
0 if the comparand does not equal the original value of the destination. 
The value of the destination is unchanged and the value of the comparand 
is overwritten with the value of the destination.
---end---
So you see *Destination does not change; rather, *ComparandResult is 
updated with the original value.
This is the problem;
---start---
Now, this intrinsic is a double word CAS. It compares the 128 bits in 
ComparandResult with Destination. If they match, ExchangeHigh and 
ExchangeLow are written into Destination and the original value of 
Destination is written into ComparandResult. The function returns 1 if a 
swap occurred and 0 if it did not.
MSDN however has this to say; "The value of ComparandResult is always 
overwritten. After the lock instruction, this intrinsic immediately 
copies the initial value of Destination to ComparandResult. For this 
reason, ComparandResult and Destination should point to separate memory 
locations to avoid unexpected behavior."
After spending the last eleven days intensively debugging (I mean, 
regularly going to bed between 2am and 4am most nights) I'm now prepared 
to call this statement criminally deceptive.
The problem is that the second sentence about seperate memory locations, 
immediately following a far more profound statement yet discussing an 
entirely different and minor issue, acts to obscure a vital issue. 
There's actually really the main problem of the documentation totally 
failing to mention that the intrinsic is offering functionality which, 
by offering, one would expect to operate in a way which means it is 
useful; for why offer it if the user cannot use it?
You see, what MSDN means by "after the lock instruction, this intrinsic 
immediately copies the initial value of Destination to ComparandResult" 
is that the write to ComparandResult of the original value of the 
Destination is not atomic.
The compare is atomic. The swap is atomic. The copy of the original 
value? not atomic at all. So you have to ignore the content of 
ComparandResult after a swap, because it can be anything. Your thread 
may have stopped and someone else modified the contents of whatever 
Destination points to; when your thread restarts, you'll be getting 
whatever they wrote.
This is particularly bad because in some data structures, it is very 
useful to get the original value of Destination - in fact, you use it 
immediately, writing it back into your data structure. You can't 
actually do this, of course, because sooner or later it corrupts your 
data structure. If you want the original value, you have before the CAS 
to make a copy of ComparandResult and after a successful swap, use that 
copy, which you know is correct, since the swap will only have occurred 
if ComparandResult and Destination are the same.
---end---
 
> But on Windows you will never get the same prototype for all platforms 
> since _InterlockedCompareExchange128 has a completely different signature.
There is a wrapper function for double-word CAS.  The prototype 
specifies the destination, exchange and compare - the user implements on 
his platform.  On 64 bit Windows, _InterlockedCompareExchange128 is 
used.
> >  As such, in my code, 
> > what I have to do is make a copy of the destination before the CAS and 
> > if the CAS is successful, I know the copy must be correct.
> 
> You mean yon insert an additional load into the CAS loop?
Yes.
> > This is why there is no provision for obtaining the original
> > destination value.
> 
> I still wonder about that. This would be a really hard bug.
In the machine code you see the two mov instructions, copying 
destination to comparandresult, immediately following the cmpxchg16b.  
So after the cmpxchg16b, successful or not, if another CAS occurs on 
*destination in another thread prior to the mov's, you will if their CAS 
succeeded get someone elses value of *destination in your 
*comparandresult, which was supposed to be your original value of 
*destination.
It must have been it seems to me a conscious implementation decision.
[load old value of destination into registers X:Y]
[perform cmpxchg16b, which stores the old destination into W:Z]
[write X:Y to result, ignoring the old value returned by cmpxchg16b]
instead of:
[perform cmpxchg16b, which stores the old destination into W:Z]
[write W:Z to result]
I should point out that a clobbers "memory" should be added in order to 
get/emulate "compiler barrier" semantics.
[...]
I think it must be so.  Of course, I may be mistaken - but having seen 
what the intrinsic does in practise, I am convinced that the 
comparandresult cannot be trusted.
Here is the C code for the abstraction function which calls 
_InterlockedCompareExchange128 (note the char *c is in there to easily 
generate a break at the right spot and the extern is required by the MS 
C compiler, otherwise it gives all inlined functions static linkage).
extern __forceinline int abstraction_dcas( volatile atom_t *destination, 
atom_t *exchange, atom_t *compare )
{
  unsigned char
    cas_result;
  cas_result = _InterlockedCompareExchange128( destination, *(exchange+
1), *exchange, compare );
{ char *c = 0; *c = 0; }
  return( (int) cas_result );
}
This is the full disassembley for that function (note I am not qualified 
to properly read this code, so it may prove me wrong!);
test!abstraction_dcas:
00000001`400586b0 4c89442418      mov     qword ptr [rsp+18h],r8
00000001`400586b5 4889542410      mov     qword ptr [rsp+10h],rdx
00000001`400586ba 48894c2408      mov     qword ptr [rsp+8],rcx
00000001`400586bf 53              push    rbx
00000001`400586c0 4883ec10        sub     rsp,10h
00000001`400586c4 488b4c2428      mov     rcx,qword ptr [rsp+28h]
00000001`400586c9 488b4908        mov     rcx,qword ptr [rcx+8]
00000001`400586cd 488b5c2428      mov     rbx,qword ptr [rsp+28h]
00000001`400586d2 488b1b          mov     rbx,qword ptr [rbx]
00000001`400586d5 4c8b4c2420      mov     r9,qword ptr [rsp+20h]
00000001`400586da 4c8b442430      mov     r8,qword ptr [rsp+30h]
00000001`400586df 498b00          mov     rax,qword ptr [r8]
00000001`400586e2 498b5008        mov     rdx,qword ptr [r8+8]
00000001`400586e6 f0490fc709      lock cmpxchg16b oword ptr [r9]
00000001`400586eb 498900          mov     qword ptr [r8],rax
00000001`400586ee 49895008        mov     qword ptr [r8+8],rdx
00000001`400586f2 0f94c0          sete    al
00000001`400586f5 880424          mov     byte ptr [rsp],al
00000001`400586f8 48c744240800000000 mov   qword ptr [rsp+8],0
00000001`40058701 488b442408      mov     rax,qword ptr [rsp+8]
00000001`40058706 c60000          mov     byte ptr [rax],0
00000001`40058709 0fb60424        movzx   eax,byte ptr [rsp]
00000001`4005870d 4883c410        add     rsp,10h
00000001`40058711 5b              pop     rbx
00000001`40058712 c3              ret
00000001`40058713 cc              int     3
00000001`40058714 cc              int     3
00000001`40058715 cc              int     3
00000001`40058716 cc              int     3
00000001`40058717 cc              int     3
00000001`40058718 cc              int     3
00000001`40058719 cc              int     3
00000001`4005871a cc              int     3
00000001`4005871b cc              int     3
00000001`4005871c cc              int     3
00000001`4005871d cc              int     3
00000001`4005871e cc              int     3
00000001`4005871f cc              int     3
test!ringbuffer_internal_enqueue_element:
[snip]
I include the beginning of the next function just to show this really is 
the full disassembley.
> news.5...@spamgourmet.com wrote:
>> Toby Douglass wrote:
>> > As you may know, Windows offers a wrapper for cmpxchg16b, called 
>> > _InterlockedCompareExchange128.  Furthermore, the Windows C compiler now 
>> > no longer permits in-line assembley, so you have to use that wrapper.
>> > 
>> > That wrapper does not reliably return the original value of the 
>> > destination.  I know - I found out the hard way.
>> 
>> ? - I would wonder if _InterlockedCompareExchange128 does not reflect 
>> the CPU instruction exactly.
>
> I think this is so.
>
> This is the prototype;
>
> unsigned char _InterlockedCompareExchange128( __int64 volatile 
> *Destination,
>                                               __int64 ExchangeHigh,
>                                               __int64 ExchangeLow,
>                                               __int64 *ComparandResult 
> );
>
> This is the man page;
>
> http://msdn.microsoft.com/en-us/library/bb514094.aspx
>
> MSDN however has this to say; "The value of ComparandResult is always 
> overwritten. After the lock instruction, this intrinsic immediately 
> copies the initial value of Destination to ComparandResult. For this 
> reason, ComparandResult and Destination should point to separate memory 
> locations to avoid unexpected behavior."
>
> After spending the last eleven days intensively debugging (I mean, 
> regularly going to bed between 2am and 4am most nights) I'm now prepared 
> to call this statement criminally deceptive.
>
> The problem is that the second sentence about seperate memory locations, 
> immediately following a far more profound statement yet discussing an 
> entirely different and minor issue, acts to obscure a vital issue. 
> There's actually really the main problem of the documentation totally 
> failing to mention that the intrinsic is offering functionality which, 
> by offering, one would expect to operate in a way which means it is 
> useful; for why offer it if the user cannot use it?
>
> You see, what MSDN means by "after the lock instruction, this intrinsic 
> immediately copies the initial value of Destination to ComparandResult" 
> is that the write to ComparandResult of the original value of the 
> Destination is not atomic.
No, it's not. The read of the original value into CPU registers WAS
atomic, but the write to *ComparandResult is not, and is performed as a
separate step. CPUs do not currently have the ability to atomically
update two distinct memory locations: the CMPXCHG16B instruction swaps
registers with memory, so it then has to store the old value (which was
atomically read into the registers) into *ComparandResult afterwards.
> The compare is atomic. The swap is atomic. The copy of the original 
> value? not atomic at all. So you have to ignore the content of 
> ComparandResult after a swap, because it can be anything. Your thread 
> may have stopped and someone else modified the contents of whatever 
> Destination points to; when your thread restarts, you'll be getting 
> whatever they wrote.
No. You are misunderstanding.
> In the machine code you see the two mov instructions, copying 
> destination to comparandresult, immediately following the cmpxchg16b.  
> So after the cmpxchg16b, successful or not, if another CAS occurs on 
> *destination in another thread prior to the mov's, you will if their CAS 
> succeeded get someone elses value of *destination in your 
> *comparandresult, which was supposed to be your original value of 
> *destination.
No. The result will be the old value of *destination, regardless what
other threads do. The disassembly you posted bears this out.
Anthony
-- 
Author of C++ Concurrency in Action | http://www.manning.com/williams
just::thread C++0x thread library   | http://www.stdthread.co.uk
Just Software Solutions Ltd         | http://www.justsoftwaresolutions.co.uk
15 Carrallack Mews, St Just, Cornwall, TR19 7UL, UK. Company No. 5478976
Toby Douglass wrote:
> I omitted to say I'm using paired pointer-counters to address ABA.  So 
> it's an array of 2 pointer width variables, e.g. 64 bits on 32 bit, 128 
> bits on 64 bits, which is why I'm using double-width CAS rather than 
> normal CAS.  This is why the arguments in the prototype I gave are all 
> pointers.
ah, are you using Joe Seigh algorithm to provide strong thread safety or 
something similar? (DWCAS is a bit uncommon.)
>> In fact gcc seems to ignore most of the potential inconsistencies in the 
>> inline assembler parameter definitions. I guess that
>>
>>    : "=m" xxx
>>    : "m" xxx
>>
>>    : "+m" xxx
>>
>>    : "+m" xxx
>>    : "m" xxx
>>
>>    : "=m" xxx
>>    : "0" xxx
>>
>>    : "+m" xxx
>>    : "0" xxx
>>
>> are effectively all the same. (Not really tested.)
> 
> I will read up some more on this.
I seek for a detailed documentation on this for a long time without success.
> Note in particular this from the man page;
> 
> ---start---
>  Return Value
> 
> 1 if the 128-bit comparand equals the original value of the destination. 
> ExchangeHigh and ExchangeLow overwrite the 128-bit destination.
> 
> 0 if the comparand does not equal the original value of the destination. 
> The value of the destination is unchanged and the value of the comparand 
> is overwritten with the value of the destination.
> ---end---
> 
> So you see *Destination does not change; rather, *ComparandResult is 
> updated with the original value.
Again ??? - This is exactly the way CAS always works. If you want to 
replace *destination in any case you must use InterlockedExchange. But 
there is no 128 bit version. However, if you have a spin loop you can do 
almost any operation with CAS.
   atom_t old = *destination;
   atom_t new;
   do
   { // Apply your change to old
     new = ~old; // Example
   } while (!CAS(destination, &new, &old));
This essentially relies on the fact that CAS returns the previous value 
in old when the comparison fails.
When the loop completes you have exactly the old and the new value in 
old/new.
With the old style CAS signature the loop looks like
   atom_t old = *destination;
   atom_t new;
   do
   { // Apply your change to old
     new = ~old; // Example
     atom_t old2 = old;
     old = CAS(destination, &new, &old2);
   } while (old2 != old);
> This is the problem;
I cannot see any problem.
That's correct. Only *destination is declared volatile. But you will 
hardly find a platform that supports atomic memory access to 
discontinuous storage. In that case you need transactional memory.
> The compare is atomic. The swap is atomic. The copy of the original 
> value? not atomic at all. So you have to ignore the content of 
> ComparandResult after a swap, because it can be anything.
No. Definitely not. The access to *ComperandResult might not be atomic 
but the value stored is well defined. It is still the "initial value of 
Destination".
> Your thread 
> may have stopped and someone else modified the contents of whatever 
> Destination points to; when your thread restarts, you'll be getting 
> whatever they wrote.
No. You will get exactly the value that was replaced in *destination by 
the read modify write cycle. Only the write access to *comperandresult 
is done later.
>>> This is why there is no provision for obtaining the original
>>> destination value.
>> I still wonder about that. This would be a really hard bug.
> 
> In the machine code you see the two mov instructions, copying 
> destination to comparandresult, immediately following the cmpxchg16b.
> So after the cmpxchg16b, successful or not, if another CAS occurs on 
> *destination in another thread prior to the mov's, you will if their CAS 
> succeeded get someone elses value of *destination in your 
> *comparandresult, which was supposed to be your original value of 
> *destination.
You mixed something up here. Your code is simply wrong if you overwrite 
*comperandresult before you use the value provided by the application.
But if you are emulating InterlockedExchange128 the CAS loop reduces 
simply to
   atom_t new = replacement;
   atom_t old = *destination;
   while (!CAS(destination, &new, &old));
In this case the assignment old = *destination (2*mov) is immediately 
followed by cmpxchg16b. And if someone changed *destination in between 
the CAS will fail and the loop spins until no one changed *destination 
in one loop cycle.
Marcel
> > You see, what MSDN means by "after the lock instruction, this 
intrinsic 
> > immediately copies the initial value of Destination to ComparandResult" 
> > is that the write to ComparandResult of the original value of the 
> > Destination is not atomic.
> 
> No, it's not. The read of the original value into CPU registers WAS
> atomic, but the write to *ComparandResult is not, and is performed as a
> separate step. CPUs do not currently have the ability to atomically
> update two distinct memory locations: the CMPXCHG16B instruction swaps
> registers with memory, so it then has to store the old value (which was
> atomically read into the registers) into *ComparandResult afterwards.
I think there is much potential for confusion here.
As I understand the CMPXCHG16B instruction; that single instruction 
loads the destination value into a register, compares with the value you 
earlier placed in RDX:RAX and if they match, swaps the value you earlier 
loaded into RCX:RBX.  There is then *also* a write back to memory - 
still atomic - of the value that is now in destination (whether or not 
it changed).
All of this happens in the single instruction, CMPXCHG16B.
Now, back to _InterlockedCompareExchange128;
unsigned char _InterlockedCompareExchange128( __int64 volatile 
*Destination,
                                              __int64 ExchangeHigh,
                                              __int64 ExchangeLow,
                                              __int64 *ComparandResult 
);
The docs state that *destination does not change; that the original 
value is placed into *comparandresult.
This seems to imply (as the docs state and as I think you have stated) 
that *after* CMPXCHG16B, the value of destination is moved into 
comparandresult.
Is it that move which seems to be to be non-atomic.
 
> > In the machine code you see the two mov instructions, copying 
> > destination to comparandresult, immediately following the cmpxchg16b.  
> > So after the cmpxchg16b, successful or not, if another CAS occurs on 
> > *destination in another thread prior to the mov's, you will if their CAS 
> > succeeded get someone elses value of *destination in your 
> > *comparandresult, which was supposed to be your original value of 
> > *destination.
> 
> No. The result will be the old value of *destination, regardless what
> other threads do. The disassembly you posted bears this out.
Are you taking into account that I'm talking about the behaviour of 
_InterlockedCompareExchange128, rather than only the behaviour of 
CMPXCHG16B?
No.  In fact, I've not heard of Joe Seigh - I'm rather a beginning at 
lock-free.  I've implemented *two* simple algorithms (Treiber's stack 
and Michael and Scott's queue) and built a ringbuffer out of the stack 
and queue.  They all run off freelists internally and using DWCAS for 
ABA.  I've just about bumped into work on solutions to using malloc/free 
(the repeat offender problem, hazard pointers, atomic smart pointers).  
Obviously, that's the correct solution and I'll be learning about that 
once I get this library released (the infrastructure work (bugzilla, 
forum, docs) is a lot more time consuming than the library itself, but 
it's a one off cost, you get it done and then people can start to 
benefit from whatever it is you've actually written so far).
 
> > I will read up some more on this.
> 
> I seek for a detailed documentation on this for a long time without success.
lol, my experience also!
 
> > So you see *Destination does not change; rather, *ComparandResult is 
> > updated with the original value.
> 
> Again ??? - This is exactly the way CAS always works.
There may be a basic misunderstanding on my part.
Originally, before going 64 bit, I was using the 
InterlockedCompareExchange64() function, which is a wrapper for a 
double-word CAS on 32 bit machines.  I am under the impression - perhaps 
mistaken - that for this function the write back to *destination is 
atomic.
Note the prototype;
http://msdn.microsoft.com/en-us/library/ms683562%28VS.85%29.aspx
LONGLONG __cdecl InterlockedCompareExchange64
(
  __inout  LONGLONG volatile *Destination,
  __in     LONGLONG Exchange,
  __in     LONGLONG Comparand
);
"The function compares the Destination value with the Comparand value. 
If the Destination value is equal to the Comparand value, the Exchange 
value is stored in the address specified by Destination. Otherwise, no 
operation is performed."
It seems to me these operations are all atomic - including the of act of 
updating *destination.
As such, I expected it to be so for the 128 bit version as well.
> > The compare is atomic. The swap is atomic. The copy of the original 
> > value? not atomic at all. So you have to ignore the content of 
> > ComparandResult after a swap, because it can be anything.
> 
> No. Definitely not. The access to *ComperandResult might not be atomic 
> but the value stored is well defined. It is still the "initial value of 
> Destination".
 
> > Your thread 
> > may have stopped and someone else modified the contents of whatever 
> > Destination points to; when your thread restarts, you'll be getting 
> > whatever they wrote.
> 
> No. You will get exactly the value that was replaced in *destination by 
> the read modify write cycle. Only the write access to *comperandresult 
> is done later.
 
Wait - when I say what I say, I'm not talking about *during* the atomic 
execution of cmpxchg8b/cmpxchg16b - I'm talking about inbetween the end 
of that instruction and the two mov instructions which immediately 
follow.  Are you refering also to this when you say the write access is 
done later?
 
> >>> This is why there is no provision for obtaining the original
> >>> destination value.
> >> I still wonder about that. This would be a really hard bug.
> > 
> > In the machine code you see the two mov instructions, copying 
> > destination to comparandresult, immediately following the cmpxchg16b.
> > So after the cmpxchg16b, successful or not, if another CAS occurs on 
> > *destination in another thread prior to the mov's, you will if their CAS 
> > succeeded get someone elses value of *destination in your 
> > *comparandresult, which was supposed to be your original value of 
> > *destination.
> 
> You mixed something up here. Your code is simply wrong if you overwrite 
> *comperandresult before you use the value provided by the application.
I understand what you mean, but don't worry - I don't do that!
What I mean is; cmpxchg8b/cmpxchg16b execute and execute atomically, of 
course.  That instruction completes.  There is now in *destination 
whatever value has ended up there.  Then the thread we are in is by the 
scheduler no longer executing.  Someone else comes along and modifies 
*destination.  Then our thread comes back - and we now finally perform 
the two mov instructions immediately following cmpxchg8b/cmpxchg16b - 
only of course now we moved the wrong value into *comparandresult.
 
> Here is an implementation that works:
Whoa!  I feel like Gandalf has come down from on high and shown me magic 
I can't hope to comprehend :-)
There's a lot here which I don't understand and of course I can't 
blindly use code I don't understand.  I think I have to do a fair bit of 
reading up on assembley to do this; I was hoping to get away without 
having to do that.
> > So I would appreciate it if you guys wouldn't mind casting your eyes
> > over the code I've written and letting me know if you see anything
> > wrong.
> 
> I don't see where you're updating the comparand on CAS failure. You should 
> be either returning the current value on failure like 
> `InterlockedCompareExchange()', or setting the comparand to the current 
> value and returning false...
In the CAS loop, I make a copy of the original value of the destination 
prior to attempting the CAS.  If the CAS succeeds, I know that copy must 
be correct.  I do this awful thing for reasons being discussed in other 
branches of this thread - I point you at them.
>
> extern __forceinline int abstraction_dcas( volatile atom_t *destination, 
> atom_t *exchange, atom_t *compare )
> {
>   unsigned char
>     cas_result;
>
>   cas_result = _InterlockedCompareExchange128( destination, *(exchange+
> 1), *exchange, compare );
>
>   { char *c = 0; *c = 0; }
>
There is a __debugbreak() intrinsic that does the same.
>
>   return( (int) cas_result );
> }
>
> This is the full disassembley for that function (note I am not qualified 
> to properly read this code, so it may prove me wrong!);
>
> test!abstraction_dcas:
> 00000001`400586b0 4c89442418      mov     qword ptr [rsp+18h],r8
> 00000001`400586b5 4889542410      mov     qword ptr [rsp+10h],rdx
> 00000001`400586ba 48894c2408      mov     qword ptr [rsp+8],rcx
> 00000001`400586bf 53              push    rbx
> 00000001`400586c0 4883ec10        sub     rsp,10h
> 00000001`400586c4 488b4c2428      mov     rcx,qword ptr [rsp+28h]
> 00000001`400586c9 488b4908        mov     rcx,qword ptr [rcx+8]
> 00000001`400586cd 488b5c2428      mov     rbx,qword ptr [rsp+28h]
> 00000001`400586d2 488b1b          mov     rbx,qword ptr [rbx]
> 00000001`400586d5 4c8b4c2420      mov     r9,qword ptr [rsp+20h]
> 00000001`400586da 4c8b442430      mov     r8,qword ptr [rsp+30h]
> 00000001`400586df 498b00          mov     rax,qword ptr [r8]
> 00000001`400586e2 498b5008        mov     rdx,qword ptr [r8+8]
>
This code copies parameters from your wrapper function to appropriate registers.
> 00000001`400586e6 f0490fc709      lock cmpxchg16b oword ptr [r9]
> 00000001`400586eb 498900          mov     qword ptr [r8],rax
> 00000001`400586ee 49895008        mov     qword ptr [r8+8],rdx
> 00000001`400586f2 0f94c0          sete    al
>
This performs the atomic operation.  cmpxchg16b reads the previous value of
destination into rdx:rax registers, and that value is unconditionally written
the 'compare' argument.  Then, the al register is set to 1 if the operation
succeeded.  Now, unconditional writing is not a problem: if the operation
succeeded, the previous value of destination will be the same as compare, so
the same value will be written to the compare argument.  If the operation
failed, the new value of destination is written into compare.
The MSDN note about the "ComparandResult being always overwritten" is very
straightforward and not at all misleading.
The previous value returned by cmpxchg16b is always dubious, it's not a problem
with non-atomic write.  Your program can be preempted right after the locked
instruction, some other thread may write other values into destination, and
when your program is rescheduled, the rdx:rax values (read by the locked insn
at the time the operation was performed) will be nonsense.  But it's not a
problem with MS's intrinsic, it's just how the instruction works.
>
> 00000001`400586f5 880424          mov     byte ptr [rsp],al
> 00000001`400586f8 48c744240800000000 mov   qword ptr [rsp+8],0
> 00000001`40058701 488b442408      mov     rax,qword ptr [rsp+8]
> 00000001`40058706 c60000          mov     byte ptr [rax],0
>
This is your GP-generating code.
>
> 00000001`40058709 0fb60424        movzx   eax,byte ptr [rsp]
> 00000001`4005870d 4883c410        add     rsp,10h
> 00000001`40058711 5b              pop     rbx
> 00000001`40058712 c3              ret
>
This just returns the value.
> >   { char *c = 0; *c = 0; }
> There is a __debugbreak() intrinsic that does the same.
Thanks :-)
> > 00000001`400586e6 f0490fc709      lock cmpxchg16b oword ptr [r9]
> > 00000001`400586eb 498900          mov     qword ptr [r8],rax
> > 00000001`400586ee 49895008        mov     qword ptr [r8+8],rdx
> > 00000001`400586f2 0f94c0          sete    al
> This performs the atomic operation.
But I would say this
> > 00000001`400586e6 f0490fc709 lock cmpxchg16b oword ptr [r9]
performs the atomic instruction; that the following three instructions 
are not atomic (and can be delayed by the scheduler).
> cmpxchg16b reads the previous value of
> destination into rdx:rax registers, and that value is unconditionally written
> the 'compare' argument.  Then, the al register is set to 1 if the operation
> succeeded.  Now, unconditional writing is not a problem: if the operation
> succeeded, the previous value of destination will be the same as compare, so
> the same value will be written to the compare argument.  If the operation
> failed, the new value of destination is written into compare.
Yes.
 
> The previous value returned by cmpxchg16b is always dubious, it's not 
> a problem with non-atomic write.
I think that might be a matter of how you phrase the problem.  Coming 
from how I had seen earlier CAS instructions work, I expected the 
previous value to be provided atomically.  It is not - so from my POV, 
when I say "there is a non-atomic write", I mean by that, that the 
previous value is dubious and I expected it to be reliable.  Why provide 
access to it in the instrinc prototype if it is useless because it is 
not reliable?
> Your program can be preempted right after the locked
> instruction, some other thread may write other values into destination, and
> when your program is rescheduled, the rdx:rax values (read by the locked insn
> at the time the operation was performed) will be nonsense.
Which is I believe what occurs.
> But it's not a
> problem with MS's intrinsic, it's just how the instruction works.
I would say it is *a type of problem* with the intrinsic.  What I mean 
is, how can I obtain the original value of destination, except by taking 
my own copy before the CAS attempt?
And why is the original value supposedly being provided, when in fact it 
is entirely unreliable?  why provide that in the prototype?  it can only 
mislead.
With other CAS instructions, I believe I automatically receive a 
reliable copy of the original value - I do not need to make my own copy 
earlier on.
I understand; and as such, the intrinsic really is reflecting and only 
reflecting what the instruction actually does.  The instruction itself 
does something which is not reliable in a multi-threaded environment.
Why does the instruction then bother to write the destination value back 
to the registers?  useful in a single threaded environment?
[snip]
Having read and replied your other post, the content of the previous 
post is pretty much obsolete.
>
> And why is the original value supposedly being provided, when in fact it 
> is entirely unreliable?  why provide that in the prototype?  it can only 
> mislead.
>
It misleads only the people who do not understand what it does.  The value is
*not* unreliable -- it is exactly the contents of the memory when CAS was
attempted; you're just trying to use the value in the wrong way.  As to why --
because there are cases where this value might be useful [OTOH, I can't find
any -- i've reverted to using ordinary mutexes instead of messing with
lock-free stuff.  You _could_ for example use it to try to take a semaphore,
and if the semaphore is taken, use its current value (returned by CAS) to
estimate how long to sleep before retrying.  Though, x86 has a simpler
primitive for that -- XADD -- which is not universal.]
>
> With other CAS instructions, I believe I automatically receive a 
> reliable copy of the original value - I do not need to make my own copy 
> earlier on.
>
what 'other' CAS instructions?  all CAS instructions have exactly the same
semantics on x86.
I think you are misunderstanding me. I am referring to the case in which the 
CAS fails. All of the x86 CAS instructions return the current value on CAS 
failure. You can use this information in order to skip loading from the 
shared data-structure again on every iteration of the loop. Think in terms 
of something like the following win32 fetch-and-add pseudo-code emulation:
__________________________________________________________________
LONG atomic_faa(LONG volatile* v)
{
    LONG cmp1, cmp2 = *v;
    do
    {
        cmp1 = cmp2;
        cmp2 = InterlockedCompareExchange(v, cmp1 + 1, cmp1);
    } while (cmp1 == cmp2);
    return cmp1;
}
__________________________________________________________________
Notice how the code only explicitly loads from `v' one time? It obtains the 
current value on CAs-failure therefore it does not need to reload the 
comparand from `v'.
The `atomic_dwcas()' code I pointed you to allows one to do something like:
__________________________________________________________________
struct anchor
{
    int32 word1;
    int32 word2;
};
void atomic_faa(struct anchor* o,
                struct anchor v)
{
    struct anchor cmp, xchg;
*cmp = *o;
    do
    {
       xchg.word1 = cmp.word1 + v.word1;
       xchg.word2 = cmp.word2 + v.word2;
    } while (! atomic_dwcas(o, &cmp, &xchg));
}
__________________________________________________________________
Notice there is no explicit reload of `o' on every iteration of the loop?
BTW, you can expect CAS failure under periods of contention! Why load from a 
shared location on every iteration if the CAs instruction itself already has 
all of the information for you on a silver platter? 
Well, if I write`fetch-and-add' I might as well show it properly:
__________________________________________________________________
LONG atomic_faa(LONG volatile* v, LONG a)
{
    LONG cmp1, cmp2 = *v;
    do
    {
        cmp1 = cmp2;
        cmp2 = InterlockedCompareExchange(v, cmp1 + a, cmp1);
    } while (cmp1 != cmp2);
    return cmp1;
}
__________________________________________________________________
For some reason I had `increment' on the mind when I scribbled out the first
version of CAS-based FAA. Also, I fu%ck up the `cmp1 != cmp2' clause on the
first busted to hell code.
Sorry about that non-sense.
The compiler intrinsic is a wrapper around the instruction. The
instruction atomically compares the contents of the comparand to the
destination, writes the new value to the destination if the values were
the same, and reads the old value from the destination if the value was
different.
This old value may immediately become stale if another thread is
updating the memory at the same time. There is nothing you can do to
prevent that happening. However, it tells you what the old value was, so
you know why the comparison failed. You can just use this as the
comparand for another loop round, if you're using CAS to do an
unconditional exchange, in order to avoid having to reload the value.
Alternatively, you can use the old value as the basis for further code
--- e.g. if you are trying to write an atomic increment (as in Chris's
example) then you will need to add one to this old value to get the new
value for the next attempt if the first CAS attempt failed.
> > Why does the instruction then bother to write the destination value 
back 
> > to the registers?  useful in a single threaded environment?
> This old value may immediately become stale if another thread is
> updating the memory at the same time. There is nothing you can do to
> prevent that happening. However, it tells you what the old value was,
But surely in effect it does not, because it does not tell you reliably.  
If you use that old value in a multi-threaded environment, it will 
sometimes be wrong, so you cannot use it.
> Alternatively, you can use the old value as the basis for further code
> --- e.g. if you are trying to write an atomic increment (as in Chris's
> example) then you will need to add one to this old value to get the new
> value for the next attempt if the first CAS attempt failed.
In this case it is useful, because even if someone else has updated the 
value, you're still interested in what it is and the value is still 
useful.
However, in other cases, where you're swapping pointers around, using 
the wrong value is not interesting in any way; it's fatal.
> > I would say it is *a type of problem* with the intrinsic.  What I 
mean 
> > is, how can I obtain the original value of destination, except by taking 
> > my own copy before the CAS attempt?
> Taking your own copy will be even less reliable because the value in 
memory
> can be changed between the time you took a copy and the time cmpxchg got to
> execute.
Yes, but if the CAS succeeds, you know your value was indeed the correct 
original value.
> It misleads only the people who do not understand what it does.
And that was me :-)
I feel I have grounds for my view, though; the earlier CAS instructions 
seem, as far as I can tell, to behave differently to the 128 bit CAS.
> The value is
> *not* unreliable -- it is exactly the contents of the memory when CAS was
> attempted; you're just trying to use the value in the wrong way.  As to why --
> because there are cases where this value might be useful [OTOH, I can't find
> any -- i've reverted to using ordinary mutexes instead of messing with
> lock-free stuff.  You _could_ for example use it to try to take a semaphore,
> and if the semaphore is taken, use its current value (returned by CAS) to
> estimate how long to sleep before retrying.  Though, x86 has a simpler
> primitive for that -- XADD -- which is not universal.]
Yes - and there's an example in another post where the original value, 
*even if later modified by another thread*, is still interesting and 
useful.
However, in the cases where you need the true original value, a 
potentially modified version is not useful.
 
> > With other CAS instructions, I believe I automatically receive a 
> > reliable copy of the original value - I do not need to make my own copy 
> > earlier on.
> what 'other' CAS instructions?  all CAS instructions have exactly the 
same
> semantics on x86.
I'd point you at my reply to Marcel Muller at 10:09, 2nd September; I 
dicuss this there.
> antho...@gmail.com wrote:
>> Toby Douglass <a...@b.com> writes:
>
>> > Why does the instruction then bother to write the destination value 
> back 
>> > to the registers?  useful in a single threaded environment?
>
>> This old value may immediately become stale if another thread is
>> updating the memory at the same time. There is nothing you can do to
>> prevent that happening. However, it tells you what the old value was,
>
> But surely in effect it does not, because it does not tell you reliably.  
> If you use that old value in a multi-threaded environment, it will 
> sometimes be wrong, so you cannot use it.
It does tell you reliably! It tells you exactly what the old value was
when the CAS executed. It does not tell you what the value is now (you'd
have to read it again to find out).
>> Alternatively, you can use the old value as the basis for further code
>> --- e.g. if you are trying to write an atomic increment (as in Chris's
>> example) then you will need to add one to this old value to get the new
>> value for the next attempt if the first CAS attempt failed.
>
> In this case it is useful, because even if someone else has updated the 
> value, you're still interested in what it is and the value is still 
> useful.
>
> However, in other cases, where you're swapping pointers around, using 
> the wrong value is not interesting in any way; it's fatal.
It depends what you do with the pointers. If you dereference them, and
don't have a means of ensuring that they are still valid then it's
fatal. If you're just using them for comparisons, or you do have a
lifetime-management scheme in place so they are safe for dereferencing
then it's fine.
Why can't you use the value returned by `CMPXCHG'? When you load the shared 
value on each CAS-loop iteration, you are in effect performing a completely 
redundant load operation because failed `CMPXCHG' instruction has already 
performed the load for you.
>> Alternatively, you can use the old value as the basis for further code
>> --- e.g. if you are trying to write an atomic increment (as in Chris's
>> example) then you will need to add one to this old value to get the new
>> value for the next attempt if the first CAS attempt failed.
>
> In this case it is useful, because even if someone else has updated the
> value, you're still interested in what it is and the value is still
> useful.
>
> However, in other cases, where you're swapping pointers around, using
> the wrong value is not interesting in any way; it's fatal.
The value returned from `cmpxchg' on failure can be fairly useful. For 
instance, you can implement state-machine based on logic which depends on 
CAS failure values. You can use it to skip loading from shared data in every 
iteration of a "CAS-loop". Ect...
Another example of using a pre-determined comparand value wrt CAS is 
acquiring a mutex. Instead of loading from the shared value, one can hard 
code a lock-state as a comparand. 
> It does tell you reliably! It tells you exactly what the old value was
> when the CAS executed. It does not tell you what the value is now (you'd
> have to read it again to find out).
Yes - from one point of view, single threaded assembley, it tells you 
reliably.  The value is in the register and the correct value was placed 
there, always.
From my point of view as a C programmer in a multi-threaded environment, 
it does not tell me reliably, because I cannot reliably get at that 
information.
> > However, in other cases, where you're swapping pointers around, 
using 
> > the wrong value is not interesting in any way; it's fatal.
> It depends what you do with the pointers. If you dereference them, and
> don't have a means of ensuring that they are still valid then it's
> fatal. If you're just using them for comparisons, or you do have a
> lifetime-management scheme in place so they are safe for dereferencing
> then it's fine.
Yes.
> antho...@gmail.com wrote:
>> Toby Douglass <a...@b.com> writes:
>
>> It does tell you reliably! It tells you exactly what the old value was
>> when the CAS executed. It does not tell you what the value is now (you'd
>> have to read it again to find out).
>
> Yes - from one point of view, single threaded assembley, it tells you 
> reliably.  The value is in the register and the correct value was placed 
> there, always.
>
> From my point of view as a C programmer in a multi-threaded environment, 
> it does not tell me reliably, because I cannot reliably get at that 
> information.
The compiler intrinsic places the values from the register in *Comparand
for you. It does this reliably. Yes, you need to ensure that this
location is not shared between threads, but that is easy --- just use a
local variable. 
This is the same way that CAS always works.
In an MT-environment, there is no sensible definition of the *current* value,
because the "current" value can be immediatelly changed just after having being
read; this is *not* specific to CAS, but to all operations over shared data.
If your algorithm depends on the "current" value of something, you have to 
ensure that other threads cannot change that value while you're using it.  In
other words, use mutexes (or software transactional memory, but it's the same
thing, though in different package) to guarantee consistency over a non-atomic
sequence of statements that must appear atomic.
I believe I understand now, for I re-read the gcc manual section on 
constraints and found the section on number constraints, which I had not 
seen before.  I have adopted the syntax quoted here.
Thankyou for your invaluable help, Marcel!
> > But surely in effect it does not, because it does not tell you 
reliably.
> > If you use that old value in a multi-threaded environment, it will
> > sometimes be wrong, so you cannot use it.
> 
> Why can't you use the value returned by `CMPXCHG'? When you load the shared 
> value on each CAS-loop iteration, you are in effect performing a completely 
> redundant load operation because failed `CMPXCHG' instruction has already 
> performed the load for you.
Apologies for the slow reply.  Your post here caused me the most thought 
of all, which of course has led to the slowest reply of all.
I see what you mean; you're going to do it anyway next time you 
come round, so why aren't you using the one which was just done?  it's 
crazy not to.
I'm in the process of examining my code to see if I can benefit from 
this, but I suspect it might be that with the very simple algorithms I 
have currently implemented, it's not something I can use; however, I'm 
not certain and it will take a bit of time for me to check.
Yes.  Originally, I was placing that original destination value directly 
back into the data structure - which of course was concurrently being 
manipulated by other threads.  I do now use a local variable.
No problem.
> Your post here caused me the most thought
> of all, which of course has led to the slowest reply of all.
:^)
> I see what you mean; you're going to do it anyway next time you
> come round, so why aren't you using the one which was just done?
> it's crazy not to.
If the CAS instruction that you are using allows you to automatically 
retrieve the value which caused the CAS to fail, then yes, it would be crazy 
not to take advantage of it when its appropriate.
> I'm in the process of examining my code to see if I can benefit from
> this, but I suspect it might be that with the very simple algorithms I
> have currently implemented, it's not something I can use; however, I'm
> not certain and it will take a bit of time for me to check.
You mentioned implementing Treiber's lock-free LIFO. That algorithm can 
benefit from the value returned by `CMHXCHG8B'. Here is some quick source 
code which shows how Treiber's algorithm can be implemented in MSVC IA-32:
http://cpt.pastebin.com/f55ceb456
(cpusync_ia32.h)
http://cpt.pastebin.com/f5e14a8b
(cpusync_ia32.c)
The Michael and Scott FIFO algorithm can also benefit from the value 
returned from `CMPXCHG8B'. BTW, their queue algorithm is way to expensive. 
There are much better alternatives (I will post pseudo-code for my FIFO 
queue algorithm at the end of the post.
BTW, if you decide to use normal CAS like I did in the `nblifo_IA32_push()' 
procedure, be SURE to read the aba counter BEFORE you load the pointer in 
the `pop()' function (e.g., `nblifo_IA32_pop()'). If you don't make sure to 
do that, well, you're going to be bitten by a very nasty and subtle 
race-condition:
http://groups.google.com/group/comp.programming.threads/msg/d3fe6c226f685d85
Ouch! ;^o
Also, here is detailed information on the lock-free stack in an IBM 
Principles of Operation:
http://publibz.boulder.ibm.com/epubs/pdf/dz9zr002.pdf
(Appendex A/Multi-Programming/Processing Examples/Conditional Swapping 
Instructions)
Enjoy!  :^)
mpmc_fifo pseudo-code:
________________________________________________________________
struct node {
  node* next;
  void* state;
};
struct dwcas_anchor {
  int32 aba;
  struct node* node;
};
struct mpmcq {
  struct dwcas_anchor tail;
  struct node* head;
};
void
mpmcq_init(
 struct mpmcq* const self,
 struct node* dummy
) {
  dummy->next = NULL;
  self->head = dummy;
  self->tail.node = dummy;
}
void
mpmcq_push(
 struct mpmcq* const self,
 struct node* node
) {
  struct node* prev;
  node->m_next = NULL;
  prev = ATOMIC_SWAP_RELAXED(&self->head, node);
  ATOMIC_STORE_RELEASE(&prev->next, node);
}
struct node*
mpmcq_pop(
 struct mpmcq* const self
) {
  void* state;
  struct dwcas_anchor cmp, xchg;
  cmp = ATOMIC_LOAD_ACQUIRE(&self->tail);
  do {
    struct node* next = ATOMIC_LOAD_ACQUIRE(&cmp.node->next);
    if (! next) return NULL;
    state = next->state;
    xchg.node = next;
    xchg.aba = cmp.aba + 1;
  } while (! ATOMIC_DWCAS_ACQ_REL(&self->tail, &cmp, &xchg));
  cmp.node->state = state;
  return cmp.node;
}
________________________________________________________________
Keep in mind that there are other very nice multi-producer/consumer (MPMC) 
FIFO algorithms as well:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/c43d1197c8168b4e
http://groups.google.com/group/lock-free/browse_frm/thread/3173c1a68a49f890
It's also nice to keep in mind that one can make use of more specialized 
queues such as single-producer/multi-consumer, or 
multi-producer/single-consumer, single-producer/consumer. Each variation has 
a different algorithm. If you're creating a library, it might be a good to 
implement all of them...
;^)
> > I see what you mean; you're going to do it anyway next time you
> > come round, so why aren't you using the one which was just done?
> > it's crazy not to.
> 
> If the CAS instruction that you are using allows you to automatically 
> retrieve the value which caused the CAS to fail, then yes, it would be crazy 
> not to take advantage of it when its appropriate.
I finished thinking about it today.
I realised why I hadn't seen this (that you can re-use the original 
value).  It is because in my mind I was thinking that you need to obtain 
the current destination value as close as possible to the CAS, to 
minimize the window in which other threads can change the value.
However, in the algorithms I've done so far, taking the value from the 
CAS is either identical to taking the destination again (T's stack) or 
extremely close to it.
> > I'm in the process of examining my code to see if I can benefit from
> > this, but I suspect it might be that with the very simple algorithms I
> > have currently implemented, it's not something I can use; however, I'm
> > not certain and it will take a bit of time for me to check.
> 
> You mentioned implementing Treiber's lock-free LIFO. That algorithm can 
> benefit from the value returned by `CMHXCHG8B'. Here is some quick source 
> code which shows how Treiber's algorithm can be implemented in MSVC IA-32:
Yes - it this which I was thinking about today, where I realised it was 
useful.
 
> The Michael and Scott FIFO algorithm can also benefit from the value 
> returned from `CMPXCHG8B'. BTW, their queue algorithm is way to
> expensive.
Interesting.  I've not read any reviews of their code.  I implemented it 
because it's easy to understand.
> BTW, if you decide to use normal CAS like I did in the `nblifo_IA32
_push()' 
> procedure, be SURE to read the aba counter BEFORE you load the pointer in 
> the `pop()' function (e.g., `nblifo_IA32_pop()'). If you don't make sure to 
> do that, well, you're going to be bitten by a very nasty and subtle 
> race-condition:
> 
> http://groups.google.com/group/comp.programming.threads/msg/d3fe6c226f685d85
> 
> Ouch!   ;^o
I can see another slow reply coming :-)
 
> Also, here is detailed information on the lock-free stack in an IBM 
> Principles of Operation:
> 
> http://publibz.boulder.ibm.com/epubs/pdf/dz9zr002.pdf
> (Appendex A/Multi-Programming/Processing Examples/Conditional Swapping 
> Instructions)
> 
> Enjoy!  :^)
Thankyou!
[snip - I will examine the code and links and reply]
Assuming I understand the ATOMIC____ functions right, wouldnt that basically 
detach the two ends of the queue between lines #1 and #2? After #1 the head 
points at the new node, but the last node in the queue doesnt join up yet. 
So if a thread stalled between those two lines, consumers wouldnt be able to 
get past that break, new items could be added, but they'd be inaccessable to 
consumers until the stalled thread completed the join?
Yes, you are right. Strictly saying the queue is blocking (however
uses only wait-free primitives).
I described the caveats here:
http://groups.google.com/group/lock-free/browse_frm/thread/c8e3201da4a6a300
http://groups.google.com/group/lock-free/browse_frm/thread/55df71b87acb8201
--
Dmitriy V'jukov
I personally don't worry about that scenario because the chance a thread 
will stall between those two instructions are few and far between. If it 
does happen, it won't break anything*. Now, if a thread actually dies within 
those instructions then the shi% has definitely hit the fan. However, thread 
should not be dieing at "random" places anyway. If they do, trust me, you 
have got MUCH bigger problems to deal with.
Now, this will be a problem if you are using this queue with processes 
instead of threads. I say this because IMHO, processes can die which is why 
there are such things as robust mutexs.
(*) - IMPORTANT!!!
Let's assume that you have three consumer threads [C1...C3] and two producer 
threads [P1...P2]. As soon as a consumer thread consumes an item it quits. 
Now, imagine that `P1' stalls between the two critical instructions. In the 
mean time `P2' produces two items, broadcasts because it produced more than 
one item, and quits. Now, all the consumers wake up, try to consume items, 
fail and go back to waiting. Then `P1' wakes back up, fixes the queue which 
makes all three items visible, and signals because it only produced a single 
item. Now, a single consumer, say `C1' wakes and consumes a single item and 
quits. Now what? Well, you got yourself a deadlock.
So, if you are going to use this queue, I would always issue a broadcast 
just to be on the safe side. 
[...]
Perhaps you should mention the following possible behavior:
http://groups.google.com/group/comp.programming.threads/msg/2556874a844a706e
Do you think it's a legitimate concern WRT the multi-consumer/producer case? 
I am going to try and model it in Relacy.
BTW, I don't think it would be a problem in your SEH-based queue:
http://groups.google.com/group/lock-free/browse_frm/thread/c8e3201da4a6a300
AFAICT, it contains the same producer logic, however you seem to detect the 
inconsistency and do a spin-wait to correct it. Am I right? 
I believe one can get around this by doing something like...
membars aside for a moment:
________________________________________________________________
struct node*
mpmcq_pop(
 struct mpmcq* const self
) {
void* state;
struct dwcas_anchor cmp, xchg;
retry:
cmp = ATOMIC_DWLOAD(&self->tail);
do {
struct node* next = ATOMIC_LOAD(&cmp.node->next);
    if (! next)
    {
        struct node* head = ATOMIC_LOAD(&self->head);
        if (cmp.node == head)
        {
            return NULL;
        }
spin_wait_yield_whatever();
        goto retry;
    }
state = next->state;
xchg.node = next;
xchg.aba = cmp.aba + 1;
} while (! ATOMIC_DWCAS(&self->tail, &cmp, &xchg));
cmp.node->state = state;
  return cmp.node;
}
________________________________________________________________
That should detect a point in which a producer thread is in the middle of 
rejoining the queue, and wait for it to complete. 
Here is the problem and a workaround modeled in Relacy 2.0:
http://relacy.pastebin.com/fd4b6e1a
The program works as-is. However, if you un-define the 
`WORKAROUND_THE_PROBLEM' macro everything bites the dust rather quickly! 
Well, at least Relacy is telling me that a valid workaround is possible...
Relacy ROCKS!
:^)