Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[x86] CMPXCHG timing

115 views
Skip to first unread message

Michael Pryhodko

unread,
Mar 31, 2005, 1:45:46 AM3/31/05
to
I understand that this question is quite platform-specific and can be
considered offtopic here, but...

suppose situation:
1. x86 platform (i.e. P6, Xeon and so on)
2. there are two different memory locations M1 and M2
3. M1 cached in processor P1 cache
4. M2 cached in processor P2 cache (P2 is second processor)
5. M1 = 0 and M2 = 0

CMPXCHG &M1, 0, 1
CMPXCHG &M2, 0, 1

Can it be safely assumed that execution time for instruction above will
be the same? (considering that first instruction will be executed on
P1, second -- on P2)


Also


Can it be safely assumed that execution time of:

// store buffers are empty
mov aligned_mem_addr, value_32bit
sfence // to flush store buffers

does NOT depends on aligned_mem_addr value? Any 'memory bank', 'DIMM
module' issues?

Bye.
Sincerely yours, Michael.

Joe Seigh

unread,
Mar 31, 2005, 6:50:39 AM3/31/05
to
On 30 Mar 2005 22:45:46 -0800, Michael Pryhodko <mpry...@westpac.com.au> wrote:

> I understand that this question is quite platform-specific and can be
> considered offtopic here, but...
>
> suppose situation:
> 1. x86 platform (i.e. P6, Xeon and so on)
> 2. there are two different memory locations M1 and M2
> 3. M1 cached in processor P1 cache
> 4. M2 cached in processor P2 cache (P2 is second processor)
> 5. M1 = 0 and M2 = 0
>
> CMPXCHG &M1, 0, 1
> CMPXCHG &M2, 0, 1
>
> Can it be safely assumed that execution time for instruction above will
> be the same? (considering that first instruction will be executed on
> P1, second -- on P2)

On average they'll probably be the same. Why would it matter? If
you're using timing dependent logic, your logic is wrong.


>
>
> Also
>
>
> Can it be safely assumed that execution time of:
>
> // store buffers are empty
> mov aligned_mem_addr, value_32bit
> sfence // to flush store buffers
>
> does NOT depends on aligned_mem_addr value? Any 'memory bank', 'DIMM
> module' issues?

No. Cache line state will affect it. But again, why should you care?

--
Joe Seigh

Michael Pryhodko

unread,
Mar 31, 2005, 8:33:59 PM3/31/05
to
> > I understand that this question is quite platform-specific and can
be
> > considered offtopic here, but...
> >
> > suppose situation:
> > 1. x86 platform (i.e. P6, Xeon and so on)
> > 2. there are two different memory locations M1 and M2
> > 3. M1 cached in processor P1 cache
> > 4. M2 cached in processor P2 cache (P2 is second processor)
> > 5. M1 = 0 and M2 = 0
> >
> > CMPXCHG &M1, 0, 1
> > CMPXCHG &M2, 0, 1
> >
> > Can it be safely assumed that execution time for instruction above
will
> > be the same? (considering that first instruction will be executed
on
> > P1, second -- on P2)
>
> On average they'll probably be the same. Why would it matter? If
> you're using timing dependent logic, your logic is wrong.

Unfortunately I need precise answer :). And not every timing logic is
wrong.


> > // store buffers are empty
> > mov aligned_mem_addr, value_32bit
> > sfence // to flush store buffers
> >
> > does NOT depends on aligned_mem_addr value? Any 'memory bank',
'DIMM
> > module' issues?
>
> No. Cache line state will affect it. But again, why should you
care?

I am trying to "delay" one processor until others will finish "write
and flush store buffers" with "write to another memory location" (all
memory locations are cached).
Could you be more specific on "cache line state"? I wonder how "flush
store buffer" performed if it contains only one "store" -- is it
constant time "to generate and send message to cache, wait for cache
coherency mechanism to finish" or something else?

Basically I have built sophisticated ultra-fast lock which works on
x86, but I need some guarantees from hardware to prove its correctness.
Maybe I should post idea here?

Bye.
Sincerely yours, Michael.

Michael Pryhodko

unread,
Mar 31, 2005, 8:38:46 PM3/31/05
to
Well, I have decided to show my child :) It somewhat raw but readable:


const P = 123; // unique thread id

old_val CAS(addr, cmp_val, new_val) // this is one processor
instruction (i.e. OS can't interrupt in the middle)
{
// instruction microcode

// CAS.1 - range label, used in speculations below
FETCH CACHE_LINE(addr) // unpredictable timing (could it take 0
time?)

// CAS.2 - range label, used in speculations below
// timing of this part is predictable (fixed?)
DEST := *addr
IF eax = DEST
THEN
ZF <- 1
DEST = new_val
ELSE
ZF <- 0
EAX = DEST
FI;
}

A = 0; // lock flag, aligned for CAS operation

lock()
{
// drain store buffers of every thread upon entry
// could help with estimations in 'wait' step below
// use mfence? is it possible for 'read' part of CAS to pass this
sfence?
sfence

ENTRY:
// if OS interrupts here store buffers will be empty upon return
anyway

// OS can't interrupt in the middle of CAS, since it is one
instruction
old_val = CAS(&A, 0, P);

// if OS interrupts here this will drain store buffers
// (i.e. it will do the same as 'sfence' call below)

// flush store buffers to make new value visible to other threads
sfence

// check if we could participate at all
// put it before 'sfence'? or processor could do it itself?
if (old_val != 0)
{
pause // could help with P6 branch prediction
jmp ENTRY
}

// now we need to wait enough time for every thread (which entered
his CAS.2
// before we finished our 'sfence') to finish his [CAS.2, sfence]
// note: thread will 100% lose if it will enter [CAS.2, sfence]
AFTER this thread
// finishes 'sfence' instruction

// how to wait this time?
// 1. we could estimate maximum number of clocks for [CAS.2,
sfence] to finish and wait this time
// 2. emulate this time with another [CAS.2, sfence] call to dummy
variable B? I wonder if writing to memory will take the same time
regardless of address (considering the same memory type)
// 3. use some kind of barrier?

//!! how?

// maybe like this:
B = 0 // B is thread-local variable (e.g. on stack)
CAS(&B, 0, 1) // CAS(&B, B, ~B)? -- in this case no need in 'B = 0'
(if optimization won't kill us, hopefully '&B' and '~B' instructions
will negate this)
// sfence // with 'B=0' this operation could take even more than
other threads CAS 'sfence'
// this operation should cover 'part of CAS'+'sfence' timing of
other processors
// (considering that:
// 1. 'write to memory' time is the same regardless of address and
processor)
// 2. existing optimization technique did not killed us

// prevent 'if (A != P)' from passing 'wait' step above
// lfence

// hmmm 'lfence' coud pass 'sfence' -> 'read A' could pass sfence
-> bad, change 'sfence + lfence' -> 'mfence'
mfence

// check for winner
if (A != P)
{
pause // could help with P6 branch prediction
jmp ENTRY
}
}


unlock()
{
// sanity check
ASSERT(A == P);

// mark flag 'free'
A = 0;

// if OS will interrupt here this will drain store buffers
// (i.e. it will do the same as 'sfence' call below)

// make it visible to other threads
// we could skip this instruction but it will speed up lock release
// thus increasing overall performance in high-contention case
sfence
}

Michael Pryhodko

unread,
Apr 1, 2005, 12:27:20 AM4/1/05
to
Very strange... I've tried to test implementation of the idea above and
it fails to reasons I do not understand! Here is the sources of testing
application. Somehow I could get to the "we should not be here line" on
my P4 2.8GHz HT -- which is totally confusing to me, only current
thread ever writes this specific unique id into m_lock variable and
since we are at the beginning of lock() -- m_lock should be !=
curr_thread_id. It look like I am blind.

----------------------------------- sources
-----------------------------------

#include <windows.h>
#include <tchar.h>
#include <process.h>
#include <iostream>


#define THREAD_COUNT (5)


//////////////////////////////////////////////////
// mtlock class
class mtlock
{
__declspec(align(32)) LONG m_lock;
public:
mtlock() throw() : m_lock(0) {}
~mtlock() throw() {}

void lock(LONG dwId) throw()
{
LONG dummy = 5;
if (m_lock == dwId)
{
// !! we should not be here! But it just happens :( !!
__asm { int 3 }
}

__asm {
mov edx, dwId
mov ebx, this

mfence

ENTER_LOCK:
// CAS(&m_lock, 0, dwId)
xor eax, eax
mov ecx, edx
cmpxchg [ebx]this.m_lock, ecx

sfence

// if (successfull)
je PROCEED
pause
jmp ENTER_LOCK

PROCEED:
// CAS(&dummy, dummy, ~dummy)
mov ecx, dummy
mov eax, ecx
neg ecx
cmpxchg dummy, ecx

mfence

// if (m_lock == dwId)
cmp [ebx]this.m_lock, edx
je EXIT_LOCK
pause
jmp ENTER_LOCK
EXIT_LOCK:
}
}

void unlock(LONG dwId) throw()
{
if (m_lock != dwId)
{
__asm { int 3 }
}

__asm {
mov eax, this
mov [eax]this.m_lock, 0
sfence
}
}
};

//////////////////////////////////////////////////
// global variables
mtlock lock;

LONG g_stat[THREAD_COUNT] = {0}; // enter/leave stats


#define ASSERT(x) if (!(x)) { __asm { int 3 }; }

void __cdecl thread_proc(void* pData)
{
LONG dwThreadId = GetCurrentThreadId();
LONG idx = (LONG)pData;
for(;;)
{
lock.lock(dwThreadId);
++g_stat[idx];
lock.unlock(dwThreadId);

Sleep(0);
}
}


int _tmain(int argc, _TCHAR* argv[])
{
for(int i = 0; i < THREAD_COUNT; ++i)
{
_beginthread(&thread_proc, 0, (void*)i);
}

Sleep(10000);

lock.lock(GetCurrentThreadId());

for(int i = 0; i < THREAD_COUNT; ++i)
{
std::cout << i << " " << g_stat[i] << std::endl;
}

lock.unlock(GetCurrentThreadId());

return 0;
}

Michael Pryhodko

unread,
Apr 1, 2005, 1:04:25 AM4/1/05
to
[implementation with problem skipped]

So far it looks like speaking to myself. :) Well, it seems I found why
proposed code does not work. It seems that internal implementation of
CMPXCHG does not conform microcode listed in Intel Architecture manual,
i.e.:
processor reads value from memory, compares it with EAX and if they are
different processor WRITES BACK original value!! Shit!! Who and why
decided to do that??? Until someone explains me reasons behind this
decision I will consider it as very "bright" and "brilliant" move from
Intel.

Proof: if you replace unlock function with:

----------------------------------------------------------------------
void unlock(LONG dwId) throw()
{
LONG dummy = 5;


if (m_lock != dwId)
{
__asm { int 3 }
}

__asm {
mov ebx, this
mov edx, dwId
ENTER_UNLOCK:
mov [ebx]this.m_lock, 0
sfence

// CAS(&dummy, dummy, ~dummy)
mov ecx, dummy
mov eax, ecx
neg ecx
cmpxchg dummy, ecx

mfence

// if (m_lock == dwId)
cmp [ebx]this.m_lock, edx

jne EXIT_UNLOCK
pause
jmp ENTER_UNLOCK
EXIT_UNLOCK:
}
}
----------------------------------------------------------------------

test app starts working. But as far as I understand this implementation
is wrong -- i.e. it could put 0 into m_lock more than one time possibly
causing simultaneous entries thus violating lock guarantee (if we have
more than 2 competing threads). :(


Bye.
Sincerely yours, Michael.

Joe Seigh

unread,
Apr 1, 2005, 8:06:01 AM4/1/05
to
On 31 Mar 2005 17:38:46 -0800, Michael Pryhodko <mpry...@westpac.com.au> wrote:

> Well, I have decided to show my child :) It somewhat raw but readable:
>

(snip)

>
> unlock()
> {
> // sanity check
> ASSERT(A == P);
>

You need the sfence here for release semantics. Strictly speaking
it has to be mfence (store/store + store/load)

> // mark flag 'free'
> A = 0;
>
> // if OS will interrupt here this will drain store buffers
> // (i.e. it will do the same as 'sfence' call below)
>
> // make it visible to other threads
> // we could skip this instruction but it will speed up lock release
> // thus increasing overall performance in high-contention case
> sfence

Um, no. Memory barriers doesn't make memory visiblity faster for coherent
cache memory. The opposite actually.

> }
>
>

I'm not sure why you're concerned with timings. The membars don't do
what you think they do. And if you try to force specific timing
behavior you are going to slow things down considerably which is
probably not what you want.

What are you trying to do here?


--
Joe Seigh

Joe Seigh

unread,
Apr 1, 2005, 8:12:04 AM4/1/05
to
On 31 Mar 2005 22:04:25 -0800, Michael Pryhodko <mpry...@westpac.com.au> wrote:

> [implementation with problem skipped]
>
> So far it looks like speaking to myself. :) Well, it seems I found why
> proposed code does not work. It seems that internal implementation of
> CMPXCHG does not conform microcode listed in Intel Architecture manual,
> i.e.:
> processor reads value from memory, compares it with EAX and if they are
> different processor WRITES BACK original value!! Shit!! Who and why
> decided to do that??? Until someone explains me reasons behind this
> decision I will consider it as very "bright" and "brilliant" move from
> Intel.
>
> Proof: if you replace unlock function with:
>

(snip)


>
> test app starts working. But as far as I understand this implementation
> is wrong -- i.e. it could put 0 into m_lock more than one time possibly
> causing simultaneous entries thus violating lock guarantee (if we have
> more than 2 competing threads). :(
>
>

The redundant store isn't observable. cmpxchg is atomic. If there's
a problem, it's with your code elsewhere. I'm not quite sure what
you are trying to do, so I don't know where to look or what suggestions
to make.

--
Joe Seigh

Alexander Terekhov

unread,
Apr 1, 2005, 8:17:15 AM4/1/05
to

Joe Seigh wrote:
>
> On 31 Mar 2005 17:38:46 -0800, Michael Pryhodko <mpry...@westpac.com.au> wrote:
>
> > Well, I have decided to show my child :) It somewhat raw but readable:
> >
> (snip)
>
> >
> > unlock()
> > {
> > // sanity check
> > ASSERT(A == P);
> >
> You need the sfence here for release semantics. Strictly speaking
> it has to be mfence (store/store + store/load)
>
> > // mark flag 'free'
> > A = 0;

You need vacation, Joe. ;-)

http://groups.google.de/groups?selm=3EBFDC37.DDDE3230%40web.de
http://groups.google.de/groups?selm=42235BEC.C06F939B%40web.de

regards,
alexander.

Joe Seigh

unread,
Apr 1, 2005, 9:29:09 AM4/1/05
to

Still morning here.


--
Joe Seigh

Michael Pryhodko

unread,
Apr 1, 2005, 9:32:25 AM4/1/05
to
> The redundant store isn't observable. cmpxchg is atomic. If there's
> a problem, it's with your code elsewhere. I'm not quite sure what
> you are trying to do, so I don't know where to look or what
suggestions
> to make.

Hmmm... try to run this small test app, you'll that:
- cmpxchg (without LOCK prefix) is not atomic (it is impossible to do
read+write operation atomic on x86 without using bus LOCK)
- this stupid redundant store is quite observable, and basically it
prevent my LOCK'less lock from working! It makes 'unlock' operation
impossible to implement

or on other hand I did a VERY stupid mistake...


About what I am trying to do -- see my answer to another posting.

Bye.
Sincerely yours, Michael.

Joe Seigh

unread,
Apr 1, 2005, 9:57:36 AM4/1/05
to
On 1 Apr 2005 06:32:25 -0800, Michael Pryhodko <mpry...@westpac.com.au> wrote:

>> The redundant store isn't observable. cmpxchg is atomic. If there's
>> a problem, it's with your code elsewhere. I'm not quite sure what
>> you are trying to do, so I don't know where to look or what
> suggestions
>> to make.
>
> Hmmm... try to run this small test app, you'll that:
> - cmpxchg (without LOCK prefix) is not atomic (it is impossible to do
> read+write operation atomic on x86 without using bus LOCK)
> - this stupid redundant store is quite observable, and basically it
> prevent my LOCK'less lock from working! It makes 'unlock' operation
> impossible to implement
>
> or on other hand I did a VERY stupid mistake...

My understanding is the processors locks the cach line with the LOCK
prefix. There's no way for other processors to observe that store.


>
>
> About what I am trying to do -- see my answer to another posting.
>

You mean this?
++ I am trying to "delay" one processor until others will finish "write
++ and flush store buffers" with "write to another memory location" (all
++ memory locations are cached).

What you probably want is a barrier synchronization object. See the
pthread barrier documentation for more details.

--
Joe Seigh

Michael Pryhodko

unread,
Apr 1, 2005, 9:59:03 AM4/1/05
to

Yes-yes-yes... :) I was using that knowledge well, thank you Alexander
for "likbez".
What about your opinion? Can you suggest something to overcome
'redundant store in cmpxchg' problem? This will allow to create lock on
x86 that would be extremely cheap in low- and high- contention cases.

Bye.
Sincerely yours, Michael.

Michael Pryhodko

unread,
Apr 1, 2005, 10:02:35 AM4/1/05
to
> > unlock()
> > {
> > // sanity check
> > ASSERT(A == P);
> >
> You need the sfence here for release semantics. Strictly speaking
> it has to be mfence (store/store + store/load)

Why? According to IA Manual stores are observed in program order. You
disagree?


> > // mark flag 'free'
> > A = 0;
> >
> > // if OS will interrupt here this will drain store buffers
> > // (i.e. it will do the same as 'sfence' call below)
> >
> > // make it visible to other threads
> > // we could skip this instruction but it will speed up lock
release
> > // thus increasing overall performance in high-contention case
> > sfence
> Um, no. Memory barriers doesn't make memory visiblity faster for
coherent
> cache memory. The opposite actually.

That contradicts everything I learned so far. Could you prove it?


> I'm not sure why you're concerned with timings. The membars don't do
> what you think they do. And if you try to force specific timing
> behavior you are going to slow things down considerably which is
> probably not what you want.
>
> What are you trying to do here?

I am trying to create cheapest possible "lock". Main idea (suppose
there is no store buffers and OS cannot interrupt):
rule1. all processors perform at the same speed (this is very
important) and can not be stopped/slowed
rule2. suppose that store is visible immediately
rule3. memory model is SC (sequential consistency)
rule4. every thread has his own unique number Pi

now consider simple program:

// try to lock
enter:
if (lock == 0)
lock = Pi
else
goto enter

// now every thread that will try to lock will fail
// but we could have other threads that passed 'if (lock == 0)' check
// so we will use rule1. and just wait until all these threads finish
their 'lock = Pi' command
// In my case I do it by running 'store' to dummy variable

dummy = 5


// now we have guarantee that every such thread finished its 'store'
// so we just check if we were the "last" one to write to lock

if (lock != Pi)
goto enter


Unlock is simple:

lock = 0


Now you should understand... Everything else is to fool OS, store
buffer and so on. Unfortunately cmpxchg has redundant store which kills
this whole idea :((.
And I can not replace cmpxchg with
if (lock == 0)
lock = Pi

because OS could interrupt in between these lines and I did not found a
solution how to:
- prevent it or
- design a lock which will be immune to that effect

Bye.
Sincerely yours, Michael.

Michael Pryhodko

unread,
Apr 1, 2005, 10:04:08 AM4/1/05
to
> My understanding is the processors locks the cach line with the LOCK
> prefix. There's no way for other processors to observe that store.

Who told you that I am using LOCK prefix? Run this test app yourself!

> > About what I am trying to do -- see my answer to another posting.
>
> You mean this?
> ++ I am trying to "delay" one processor until others will finish
"write
> ++ and flush store buffers" with "write to another memory location"
(all
> ++ memory locations are cached).
>
> What you probably want is a barrier synchronization object. See the
> pthread barrier documentation for more details.

No, this is a lock oriented only for x86. It could be easily adapted
for any other platfrom if it could provide necessary guarantees:
- compare-and-exchnage instruction
- OS can interrupt only between instructions
- full store (i.e. store+sfence on x86) timing is predictable
- every processor runs at the same speed and can not be delayed/stopped
- cache coherency mechanism is present

Well, read another posting for more detailed explanation.

Bye.
Sincerely yours, Michael.

David Schwartz

unread,
Apr 1, 2005, 12:22:29 PM4/1/05
to

"Michael Pryhodko" <mpry...@westpac.com.au> wrote in message
news:1112366998.3...@l41g2000cwc.googlegroups.com...

> Now you should understand... Everything else is to fool OS, store
> buffer and so on. Unfortunately cmpxchg has redundant store which kills
> this whole idea :((.
> And I can not replace cmpxchg with
> if (lock == 0)
> lock = Pi
>
> because OS could interrupt in between these lines and I did not found a
> solution how to:
> - prevent it or
> - design a lock which will be immune to that effect

How would this work even if cmpxchg didn't have the redundant store?
It's still not atomic with respect to other processors, never was, never
will be. It is, however, atomic with respect to interrupts.

DS


David Hopwood

unread,
Apr 1, 2005, 4:49:56 PM4/1/05
to
Michael Pryhodko wrote:
> [implementation with problem skipped]
>
> So far it looks like speaking to myself. :) Well, it seems I found why
> proposed code does not work. It seems that internal implementation of
> CMPXCHG does not conform microcode listed in Intel Architecture manual,
> i.e.:
> processor reads value from memory, compares it with EAX and if they are
> different processor WRITES BACK original value!!

What microcode? The Intel arch manuals have some pseudocode listings, but
no microcode AFAICS. The pseudocode says nothing about timing, and cycle
counts should be taken as approximations.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

David Hopwood

unread,
Apr 1, 2005, 4:53:08 PM4/1/05
to
Michael Pryhodko wrote:
> No, this is a lock oriented only for x86. It could be easily adapted
> for any other platfrom if it could provide necessary guarantees:
[...]

> - full store (i.e. store+sfence on x86) timing is predictable
> - every processor runs at the same speed and can not be delayed/stopped

These two assumptions don't hold for x86, and shouldn't be guaranteed to
hold because they would prevent *many* important optimizations (really,
not just theoretically).

--
David Hopwood <david.nosp...@blueyonder.co.uk>

David Schwartz

unread,
Apr 1, 2005, 5:30:05 PM4/1/05
to

"Michael Pryhodko" <mpry...@westpac.com.au> wrote in message
news:1112335465.2...@l41g2000cwc.googlegroups.com...

> So far it looks like speaking to myself. :) Well, it seems I found why
> proposed code does not work. It seems that internal implementation of
> CMPXCHG does not conform microcode listed in Intel Architecture manual,
> i.e.:
> processor reads value from memory, compares it with EAX and if they are
> different processor WRITES BACK original value!! Shit!! Who and why
> decided to do that??? Until someone explains me reasons behind this
> decision I will consider it as very "bright" and "brilliant" move from
> Intel.

This is well-documented behavior of the CMPXCHG function. The write is
unconditional. You should not be complaining about this because if you want
atomic behavior, you have to use the LOCK prefix anyway, in which case the
write is harmless. If you don't want atomic behavior, implement the function
yourself without the extra write.

By the way, looking at all of your posts on this issue, I think you're
totally barking up the wrong tree. Single instructions without the LOCK
prefix are not atomic with respect to other processors anyway -- if they
were, what would the LOCK prefix be for? And you cannot rely on instruction
timings for synchronization because of things like cache misses, speculative
reads, and so on.

I suggest you think carefully about what you're trying to do.

DS


Michael Pryhodko

unread,
Apr 1, 2005, 6:45:07 PM4/1/05
to
> These two assumptions don't hold for x86, and shouldn't be guaranteed
to
> hold because they would prevent *many* important optimizations
(really,
> not just theoretically).

Ok. Situation:
- two processors have empy store buffers
- two processor were executing similar code for some time (i.e.
pipeline state should be similar)
- now they execute:
mov mem_addr, value
sfence

For me it is quite logical that they should spend the same amount of
clocks for this operation (suppose that flushing store buffer takes
the same time regardless of mem_addr). I understand that this is only
my assumptions.

But on other hand you did not provide any proof of your poion. ;)

Bye.
Sincerely yours, Michael.

Michael Pryhodko

unread,
Apr 1, 2005, 6:51:03 PM4/1/05
to
> What microcode? The Intel arch manuals have some pseudocode
> listings, but no microcode AFAICS. The pseudocode says nothing
> about timing, and cycle counts should be taken as approximations.

ok, PSEUDOCODE. You right. I do not make any assumptions about cmpxchg
timing, I know that cycle counts are only approximations -- I have
spent much time reading IA Manuals. Here is my assumption:

under some similar circumstances (all processors store buffers are
drained) one processor could safely wait for other processors to finish
their "cmpxchg+sfence" code by executing the same instructions
(cmpxchg+sfence) on different memory location. If you can prove that
this wrong -- you are more than welcome, if not -- back off.

Bye.
Sincerely yours, Michael.

David Schwartz

unread,
Apr 1, 2005, 6:59:55 PM4/1/05
to

"Michael Pryhodko" <mpry...@westpac.com.au> wrote in message
news:1112399107....@f14g2000cwb.googlegroups.com...

> Ok. Situation:
> - two processors have empy store buffers
> - two processor were executing similar code for some time (i.e.
> pipeline state should be similar)
> - now they execute:
> mov mem_addr, value
> sfence

> For me it is quite logical that they should spend the same amount of
> clocks for this operation (suppose that flushing store buffer takes
> the same time regardless of mem_addr). I understand that this is only
> my assumptions.

That's an absurd assumption and in general will not be true. This is
*especially* true if they're running similar code because they're contending
for resources and each time one CPU will win and one will lose (cachewise).

There are so many reasons why this won't be true in the real world and
new ones are appearing all the time. What about interrupts?

DS


Michael Pryhodko

unread,
Apr 1, 2005, 7:04:03 PM4/1/05
to
> > processor reads value from memory, compares it with EAX and if they
are
> > different processor WRITES BACK original value!! Shit!! Who and why
> > decided to do that??? Until someone explains me reasons behind this
> > decision I will consider it as very "bright" and "brilliant" move
from
> > Intel.
>
> This is well-documented behavior of the CMPXCHG function. The
write is
> unconditional. You should not be complaining about this because if
you want
> atomic behavior, you have to use the LOCK prefix anyway, in which
case the
> write is harmless. If you don't want atomic behavior, implement the
function
> yourself without the extra write.

I DO NOT WANT ATOMIC BEHAVIOR! Please, read carefully more formal
description of algorithm I posted. The whole idea of it was "we do not
need LOCK to do lock".
Main "feature" of algorithm was that ONLY i-th processor could ever
write unique value Pi into lock variable, but with this stupid
implementation of CMPXCHG any processor that tries to enter locked lock
writes Pi of lock owner every CMPXCHG call. This spoils whole idea.
I do not understand why Intel decided to implement CMPXCHG this way,
but from all I learned so far I do not know any good "excuse" for that.


> By the way, looking at all of your posts on this issue, I think
you're
> totally barking up the wrong tree.

I suggest you to take care next time -- this statement is quite an
insult in my culture.


> Single instructions without the LOCK
> prefix are not atomic with respect to other processors anyway -- if
they
> were, what would the LOCK prefix be for? And you cannot rely on
instruction
> timings for synchronization because of things like cache misses,
speculative
> reads, and so on.

Please read carefully IA manual, especially section where it states
necessary conditions for simple operations to be atomic. For example
store of 32bit value to 4byte aligned memory address is atomic without
any LOCK (which is forbidden for 'mov').

And by the way, in this example I took 'cache misses' and 'speculative
reads' into account. Read comments carefully and pay attention to
sfences/mfences in the code.


> I suggest you think carefully about what you're trying to do.

Read carefully before answering. ;)


Bye.
Sincerely yours, Michael.

Michael Pryhodko

unread,
Apr 1, 2005, 7:10:05 PM4/1/05
to
> > Now you should understand... Everything else is to fool OS, store
> > buffer and so on. Unfortunately cmpxchg has redundant store which
kills
> > this whole idea :((.
> > And I can not replace cmpxchg with
> > if (lock == 0)
> > lock = Pi
> >
> > because OS could interrupt in between these lines and I did not
found a
> > solution how to:
> > - prevent it or
> > - design a lock which will be immune to that effect
>
> How would this work even if cmpxchg didn't have the redundant
store?
> It's still not atomic with respect to other processors, never was,
never
> will be. It is, however, atomic with respect to interrupts.

Read pseudocode for x86 (my first post "I decided to show my child")
carefully, then run test app I posted afterwards. You will see that
this redundant store ruining this whole idea. Because after unlocking
lock, somebody else immediately writes it's previous value back -- thus
locking everybody since lock is not reenterable.
I know that cmpxchg without LOCK is not atomic, I do not rely on it.
Any single instruction is atomic with respoct to interrupts.

If you are really interested I will gladly answer your specific
questions. :)

Bye.
Sincerely yours, Michael.

David Schwartz

unread,
Apr 1, 2005, 7:47:45 PM4/1/05
to

"Michael Pryhodko" <mpry...@westpac.com.au> wrote in message
news:1112400242.9...@o13g2000cwo.googlegroups.com...

> I DO NOT WANT ATOMIC BEHAVIOR! Please, read carefully more formal
> description of algorithm I posted. The whole idea of it was "we do not
> need LOCK to do lock".
> Main "feature" of algorithm was that ONLY i-th processor could ever
> write unique value Pi into lock variable, but with this stupid
> implementation of CMPXCHG any processor that tries to enter locked lock
> writes Pi of lock owner every CMPXCHG call. This spoils whole idea.
> I do not understand why Intel decided to implement CMPXCHG this way,
> but from all I learned so far I do not know any good "excuse" for that.

I'm utterly baffled how your algorithm was supposed to work. What keeps
all the processors from reading the zero value and all deciding to exchange
it with the Pi value? Since the operation is not atomic, it's equivalent to
'read, compare, possibly store'. What stops all the processors from reading
at the same time, comparing at the same time, and then storing in
succession, each thinking it wrote the unique value? Each succeeding with
its compare/exchange.

If you're waiting for all threads to finish anyway, how does the extra
write hurt you? You wait for all threads to finish and read the value of the
lock. So what's the problem?

And why are you doing all this anyway when it's clear that a simple LOCK
prefix will 100% work.

DS


Michael Pryhodko

unread,
Apr 1, 2005, 9:43:10 PM4/1/05
to
> > Ok. Situation:
> > - two processors have empy store buffers
> > - two processor were executing similar code for some time (i.e.
> > pipeline state should be similar)
> > - now they execute:
> > mov mem_addr, value
> > sfence
>
> > For me it is quite logical that they should spend the same amount
of
> > clocks for this operation (suppose that flushing store buffer
takes
> > the same time regardless of mem_addr). I understand that this is
only
> > my assumptions.
>
> That's an absurd assumption and in general will not be true. This
is
> *especially* true if they're running similar code because they're
contending
> for resources and each time one CPU will win and one will lose
(cachewise).

Prove that this is absurd! So far to my understanding of x86 processor
internals (i.e. pipeline, retirement unit and so on) this should be
true.


> There are so many reasons why this won't be true in the real
world and
> new ones are appearing all the time. What about interrupts?

if interrupt happens before mov -- everything is ok
if interrupt happens between mov and sfence -- it will flush store
buffer (see IA Manual) -- i.e. will do the same thing as sfence

so interrupt is ok here -- either store operation never begun, or it
begun and finished in the finite time

Bye.
Sincerely yours, Michael.

David Schwartz

unread,
Apr 2, 2005, 12:16:00 AM4/2/05
to

"Michael Pryhodko" <mpry...@westpac.com.au> wrote in message
news:1112409790....@g14g2000cwa.googlegroups.com...

>> > For me it is quite logical that they should spend the same amount
>> > of
>> > clocks for this operation (suppose that flushing store buffer
>> > takes
>> > the same time regardless of mem_addr). I understand that this is
>> > only
>> > my assumptions.

>> There are so many reasons why this won't be true in the real


>> world and
>> new ones are appearing all the time. What about interrupts?

> if interrupt happens before mov -- everything is ok
> if interrupt happens between mov and sfence -- it will flush store
> buffer (see IA Manual) -- i.e. will do the same thing as sfence

> so interrupt is ok here -- either store operation never begun, or it
> begun and finished in the finite time

So now it sounds like you don't care whether the operations take the
same amount of time or not. Could you try stating what you're trying to do
in simple terms? For example, each each thread trying to exchange the same
value or does each thread have a unique ID that it's trying to write? If
each thread is writing a unique ID (unique to that thread), how does the
extra write hurt you? Just have the thread go back and read the value and
compare it a second time.

It sounds, though, that all of this is more effort than just using the
LOCK prefix.

DS


Michael Pryhodko

unread,
Apr 2, 2005, 12:17:14 AM4/2/05
to
> I'm utterly baffled how your algorithm was supposed to work. What
keeps
> all the processors from reading the zero value and all deciding to
exchange
> it with the Pi value? Since the operation is not atomic, it's
equivalent to
> 'read, compare, possibly store'. What stops all the processors from
reading
> at the same time, comparing at the same time, and then storing in
> succession, each thinking it wrote the unique value? Each succeeding
with
> its compare/exchange.

Nothing stops. But there are some tricks:
1. first processor who finish his 'cmxchg+sfence' will prevent any
processor which does not started 'cmpxchg' (or to be more precise which
does not yet FETCH'ed corresponding cache line, see label CS.1 in
pseudocode) from entering lock (i.e. cmpxchg wil always fail since
m_lock variable will have non-zero value)
2. But since could have a number of otherprocessors that currently in
process of executing their 'cmxchg+sfence' instructions we need to wait
for them to finish. After we finished waiting (whether it is Sleep(100)
or store to dummy variable) we could simply check value of lock to find
which processor "wins". Because after that point every processor either
trying to enter the lock or already finished 'store' part of cmpxchg
and made his changes visible. Winner is the last processor who writes
to lock variable. Other will start process from the beginning.


> If you're waiting for all threads to finish anyway, how does the
extra
> write hurt you? You wait for all threads to finish and read the value
of the
> lock. So what's the problem?

Run test app and you'll see -- redundand store makes it impossible to
implement lock. Read my corresponding posting about that.


> And why are you doing all this anyway when it's clear that a
simple LOCK
> prefix will 100% work.

Because LOCK is not so cheap on x86. Because this is very interesting
task. Because I thought it is possible :(. Basically only two problems
left:
1. implement waiting process (or prove that my variant is ok).
2. get around redundant store in cmpxchg on x86.


Bye.
Sincerely yours, Michael.

David Schwartz

unread,
Apr 2, 2005, 12:26:43 AM4/2/05
to

"Michael Pryhodko" <mpry...@westpac.com.au> wrote in message
news:1112419034.1...@f14g2000cwb.googlegroups.com...

> 1. first processor who finish his 'cmxchg+sfence' will prevent any
> processor which does not started 'cmpxchg' (or to be more precise which
> does not yet FETCH'ed corresponding cache line, see label CS.1 in
> pseudocode) from entering lock (i.e. cmpxchg wil always fail since
> m_lock variable will have non-zero value)

> 2. But since could have a number of otherprocessors that currently in
> process of executing their 'cmxchg+sfence' instructions we need to wait
> for them to finish. After we finished waiting (whether it is Sleep(100)
> or store to dummy variable) we could simply check value of lock to find
> which processor "wins". Because after that point every processor either
> trying to enter the lock or already finished 'store' part of cmpxchg
> and made his changes visible. Winner is the last processor who writes
> to lock variable. Other will start process from the beginning.

I don't see how the redundant store hurts this. If you didn't get the
lock, you have to loop and try again anyway, right?

>> If you're waiting for all threads to finish anyway, how does the
> extra
>> write hurt you? You wait for all threads to finish and read the value
> of the
>> lock. So what's the problem?
>
> Run test app and you'll see -- redundand store makes it impossible to
> implement lock. Read my corresponding posting about that.

I have read it but I don't understand it. If you don't get the lock
(whether due to the redundant store or due to another processor grabbing
it), you just loop.

>> And why are you doing all this anyway when it's clear that a
>> simple LOCK
>> prefix will 100% work.

> Because LOCK is not so cheap on x86.

The fence is about as expensive as the lock.

> Because this is very interesting
> task.

Yes, just not useful because it relies upon so many fragile assumptions.

> Because I thought it is possible

Probably possible.

> :(. Basically only two problems
> left:
> 1. implement waiting process (or prove that my variant is ok).

That is a key problem. You must wait the pipeline depth, which should
cost as much as a LOCK.

> 2. get around redundant store in cmpxchg on x86.

I don't see how that hurts you. After all the threads get through,
either some thread owns the lock or no thread does, shouldn't they all agree
on which?

DS


Michael Pryhodko

unread,
Apr 2, 2005, 1:06:37 AM4/2/05
to
> > 2. But since could have a number of otherprocessors that currently
in
> > process of executing their 'cmxchg+sfence' instructions we need to
wait
> > for them to finish. After we finished waiting (whether it is
Sleep(100)
> > or store to dummy variable) we could simply check value of lock to
find
> > which processor "wins". Because after that point every processor
either
> > trying to enter the lock or already finished 'store' part of
cmpxchg
> > and made his changes visible. Winner is the last processor who
writes
> > to lock variable. Other will start process from the beginning.
>
> I don't see how the redundant store hurts this. If you didn't get
the
> lock, you have to loop and try again anyway, right?

Right, It does not hurt here. It makes 'unlock' operation impossible.


> > Run test app and you'll see -- redundand store makes it impossible
to
> > implement lock. Read my corresponding posting about that.
>
> I have read it but I don't understand it. If you don't get the
lock
> (whether due to the redundant store or due to another processor
grabbing
> it), you just loop.

read more carefully :)). It is impossible to UNLOCK. This algorithm is
based on assumption that given Pi could be written to 'm_lock' only by
processor 'i'. But redundant store in cmpxchg violates that. And I can


not replace 'cmpxchg' with 'if (lock == 0) lock = Pi' because OS could

interrupt between 'if (lock == 0)' and 'lock = Pi' spoiling the fun.


> > Because LOCK is not so cheap on x86.
>
> The fence is about as expensive as the lock.

LOCK could be much more expensive than fence. LOCK could influence
whole system locking system bus (if mem_addr not in the cache or split
accross cache lines or you have 486 PC), while fence influences only
given processor. Thus fence becomes more and more desirable if you have
increasing number of processors. (Considering that cache coherency
mechanism does not make 'sfence' to execute longer if you have more
processors -- I do not know details about cache coherency mehanism
implementations).


> > Because this is very interesting task.
>
> Yes, just not useful because it relies upon so many fragile
assumptions.

:) This algorithm is quite interesting by itself. Maybe the way it is
implemented for x86 is fragile, but there are other platforms that
could provide necessary guarantees and benefit from this algorithm.
Consider my work as research.


> > :(. Basically only two problems
> > left:
> > 1. implement waiting process (or prove that my variant is ok).
>
> That is a key problem. You must wait the pipeline depth, which
should
> cost as much as a LOCK.

No, I do not need to wait whole pipeline to retire.
I do not think this is key problem. Something telling me that it is


> > 2. get around redundant store in cmpxchg on x86.
>
> I don't see how that hurts you. After all the threads get
through,
> either some thread owns the lock or no thread does, shouldn't they
all agree
> on which?

Impossibility to implement 'unlock' operation.


Bye.
Sincerely yours, Michael.

Michael Pryhodko

unread,
Apr 2, 2005, 1:17:47 AM4/2/05
to
> Could you try stating what you're trying to do in simple terms?

I am trying to implement on x86 "lock" object according to formalized
algorithm I described in other post. I've described more than once what
I am trying to do. Just read carefully my other posts.

Bye.
Sincerely yours, Michael.

Joe Seigh

unread,
Apr 2, 2005, 5:11:43 AM4/2/05
to
On 1 Apr 2005 22:17:47 -0800, Michael Pryhodko <mpry...@westpac.com.au> wrote:

>> Could you try stating what you're trying to do in simple terms?
>
> I am trying to implement on x86 "lock" object according to formalized
> algorithm I described in other post. I've described more than once what
> I am trying to do. Just read carefully my other posts.
>

It's hard to figure out sometimes what code is supposed to do even when
it does work.

On the basis that you're trying to implement a "lock" and seem to not
want to use the interlocked hardware primatives, I would guess at this
point you are trying to implement a lock using a distributed algorithm.
Distributed algorithm is a formal term and is where threads read other
thread's storage but not write to it. Known distributed algorithms for
locking include Dekker's algorithm and Peterson's algorithm.

You should study those algorithms first and learn how they work. Note
that any write up on them will assume a totally ordered memory model
and you'll have to use the proper memory barriers in any implementations
you do.

Distributed algorithms aren't generally used as they don't perform as well
as the hardware primatives that you are trying to avoid. The only
place they're used is on very large systems with lot's of processors
where the hardware primatives don't work, don't scale, or don't exist.


--
Joe Seigh

Michael Pryhodko

unread,
Apr 2, 2005, 5:46:44 AM4/2/05
to
> It's hard to figure out sometimes what code is supposed to do even
when
> it does work.

Agreed. Also it seems that I am very bad teacher since quite often I
feel troubles making other undestand smth. :)


> On the basis that you're trying to implement a "lock" and seem to not
> want to use the interlocked hardware primatives, I would guess at
this
> point you are trying to implement a lock using a distributed
algorithm.

Yes, however I do not know what do you mean by "distributed algorithm".
I do not algorithms classification. Specifically I tried to create such
algorithm for x86 using its features (acquire sematic of read, release
-- of load, writes fron one proc observed in the program order by
others and so on) but also I found "formalization" of this algorithm
and necessary guarantess to implement it on any platform that could
provide such guarantees.


> Distributed algorithm is a formal term and is where threads read
other
> thread's storage but not write to it. Known distributed algorithms
for
> locking include Dekker's algorithm and Peterson's algorithm.
>
> You should study those algorithms first and learn how they work.

Certainly, remember you advised me a while ago to look at them? This
was a time when I started working on that algorithm. I stidied these
algorithms well, I found their generalization for N threads pretty
ineffective and started working on my own algorithm.


> Note
> that any write up on them will assume a totally ordered memory model
> and you'll have to use the proper memory barriers in any
implementations
> you do.

Certainly, I understand different memory models complications well,
thanks to Alexander Terekhov and WRL-95-9.pdf :).


> Distributed algorithms aren't generally used as they don't perform as
well
> as the hardware primatives that you are trying to avoid. The only
> place they're used is on very large systems with lot's of processors
> where the hardware primatives don't work, don't scale, or don't
exist.

Well, I think this algorithm will perform very well even on
one-processor system. :)


About algorithm description -- in one post I described it in very
simple way (similar to description of Dekker's algorithm). There is no
reason to repeat it. I could describe it in more simple terms (e.g.
putting bananas in holes and so on :)) if you'll ask me. But what I
really want is a:
- specific questions
- some advise in areas where "I am not sure", such as "wait" part of
algorithm and "cmpxchg redundant store" problem


Bye.
Sincerely yours, Michael.

David Schwartz

unread,
Apr 4, 2005, 2:40:27 PM4/4/05
to

"Michael Pryhodko" <mpry...@westpac.com.au> wrote in message
news:1112421997....@f14g2000cwb.googlegroups.com...


>> I don't see how the redundant store hurts this. If you didn't get
>> the
>> lock, you have to loop and try again anyway, right?

> Right, It does not hurt here. It makes 'unlock' operation impossible.

Not impossible. It just means that after you 'unlock', you have to go
back and make sure you don't still have the lock, just as you did with the
'lock' operation.

>> > Because LOCK is not so cheap on x86.

>> The fence is about as expensive as the lock.

> LOCK could be much more expensive than fence. LOCK could influence
> whole system locking system bus (if mem_addr not in the cache or split
> accross cache lines or you have 486 PC), while fence influences only
> given processor. Thus fence becomes more and more desirable if you have
> increasing number of processors. (Considering that cache coherency
> mechanism does not make 'sfence' to execute longer if you have more
> processors -- I do not know details about cache coherency mehanism
> implementations).

First of all, you would always put the lock in its own cache line. A
processor would have to own the cache line in order to perform the unlocked
compare and exchange anyway. Your argument about a 486PC isn't convincing
because the pipeline depth is so much shorter that the LOCK prefix hurts
each CPU less. Plus, there aren't massively parallel SMP 486 systems.

>> Yes, just not useful because it relies upon so many fragile
> assumptions.

> :) This algorithm is quite interesting by itself. Maybe the way it is
> implemented for x86 is fragile, but there are other platforms that
> could provide necessary guarantees and benefit from this algorithm.
> Consider my work as research.

That may be true, but isn't it obvious by now that x86 is not the
platform to test/develop this on?

DS


Chris Thomasson

unread,
Apr 4, 2005, 5:12:39 PM4/4/05
to
> Somehow I could get to the "we should not be here line" on
> my P4 2.8GHz HT -- which is totally confusing to me, only current
> thread ever writes this specific unique id into m_lock variable and
> since we are at the beginning of lock() -- m_lock should be !=
> curr_thread_id. It look like I am blind.

<snip>

I haven't totally studied your idea, but from what I can immediately gather
it seems like your going for a simple spinlock without using LOCK prefix. Is
that about it?

Your lock function (x86) seems to have:

1. (store/load)

loop:
2. non-atomic cas to shared
3. (store/store)
4. if ( cas_failed )
{
// slow path 1
5. ( delay )
6. goto loop
}

// fast path 1
7. non-atomic cas to local
8. (store/load)
9. if ( we_do_not_own )
{
// slow path 2
5. ( delay )
6. goto loop
}

// fast path 2
7. done

And your unlock function has:

store 0 into shared
( store/store )


So, lock function has 3 barriers ( two of which are expensive store/load's )
and 2 non-atomic CAS ( one is to shared mem ) on the fast-path, and the
unlock function has one store to shared and 1 barrier on the fast-path. Are
you sure this is vastly more efficient than just using the simple LOCK
prefix? I have never tried messing around with cmpxchg on SMP/HT system
without using the lock prefix...


Michael Pryhodko

unread,
Apr 4, 2005, 9:12:58 PM4/4/05
to
>>> I don't see how the redundant store hurts this. If you didn't
get
>>> the
>>> lock, you have to loop and try again anyway, right?
>
>> Right, It does not hurt here. It makes 'unlock' operation
impossible.
>
> Not impossible. It just means that after you 'unlock', you have
to go
> back and make sure you don't still have the lock, just as you did
with the
> 'lock' operation.

I did that (see augmented 'unlock' code I posted immediately after
original 'unlock' code). But it will work only if you have less than 3
processors.
First -- it could loop forever. I.e. unlocking thread makes his '0'
visible exactly at the time when locking thread doing it's redundant
write. Then redundant change becomes visible ( jmp $ :) ).
Hmmm... I can't find a failure case for 3 processors right now, I
remember I found some problems. Main idea:
- you have 3 processors which waiting in a loop (i.e. constantly
failing cmpxchg and making redundant write)
- 4th processor tries to unlock by writing 0
- 1st processors makes successfull check and writes his P1
- 2nd processors "picks it up" (i.e. soon will write it back)
- 3rd processors just makes his redundant write P4 visible

I do not remember details of this case, but I think 'infinite' loop
feature is enough to throw this 'unlock' implementation away.


>> LOCK could be much more expensive than fence. LOCK could influence
>> whole system locking system bus (if mem_addr not in the cache or
split
>> accross cache lines or you have 486 PC), while fence influences only
>> given processor. Thus fence becomes more and more desirable if you
have
>> increasing number of processors. (Considering that cache coherency
>> mechanism does not make 'sfence' to execute longer if you have more
>> processors -- I do not know details about cache coherency mehanism
>> implementations).
>
> First of all, you would always put the lock in its own cache
line. A
> processor would have to own the cache line in order to perform the
unlocked
> compare and exchange anyway. Your argument about a 486PC isn't
convincing
> because the pipeline depth is so much shorter that the LOCK prefix
hurts
> each CPU less. Plus, there aren't massively parallel SMP 486 systems.

1. If m_lock variable is not cached, processor could (or will?) lock
system bus.
2. I do not see any connection with pipeline depth, AFAIK 'sfence' and
'LOCK' does not invalidate pipeline.

What do you mean by "owning" cache line? if you mean LOCK'ing it (in
order to avoid bus lock) -- it is not true, because only XCHG
implicitly locks, not CMPXCHG.


>> :) This algorithm is quite interesting by itself. Maybe the way it
is
>> implemented for x86 is fragile, but there are other platforms that
>> could provide necessary guarantees and benefit from this algorithm.
>> Consider my work as research.
>
> That may be true, but isn't it obvious by now that x86 is not the

> platform to test/develop this on?

1. I have mostly theoretical knowledge of other platforms (well, once I
read manual for SPARC assembler, but it was ~7 years ago :) )
2. I have access to x86 only

By the way, specially for Chris Thomasson I did small test program
which measures performance of this lock on x86. I suggest you to have a
look -- it will appear in 5 mins in the parallel branch of this
discussion.

Bye.
Sincerely yours, Michael.

Michael Pryhodko

unread,
Apr 4, 2005, 9:43:49 PM4/4/05
to
> > Somehow I could get to the "we should not be here line" on
> > my P4 2.8GHz HT -- which is totally confusing to me, only current
> > thread ever writes this specific unique id into m_lock variable and
> > since we are at the beginning of lock() -- m_lock should be !=
> > curr_thread_id. It look like I am blind.

I found my source of problems already -- if CMPXCHG fails it writes old
value back.


> I haven't totally studied your idea, but from what I can immediately
gather
> it seems like your going for a simple spinlock without using LOCK
prefix. Is
> that about it?

<snip>

Yes, and your simplified version of lock operation is correct.


> And your unlock function has:
>
> store 0 into shared
> ( store/store )

right.


> So, lock function has 3 barriers ( two of which are expensive
store/load's )
> and 2 non-atomic CAS ( one is to shared mem ) on the fast-path, and
the
> unlock function has one store to shared and 1 barrier on the
fast-path. Are
> you sure this is vastly more efficient than just using the simple
LOCK
> prefix? I have never tried messing around with cmpxchg on SMP/HT
system
> without using the lock prefix...

I have already expressed my thoughts about this algorithm comparison
with simple LOCK-based lock:
1. LOCK could lock system bus -- this could create significant
influence on whole system: more devices are using bus (e.g. processors)
-- influence becomes worse. sfence does not have this disadvantage
(AFAIK).
2. AFAIK sfence performance depends on size of store buffer, that means
in case of:

sfence
mov mem_addr, smth
sfence

second sfence will perform much faster than first.
3. sfence in unlock could be omitted
4. first sfence in lock used only to make execution time of second more
predictable (this is the only "fragile" part of x86 implementation of
this algorithm -- part of which I am not 100% sure)

Now about vast improvement of speed:
I created small test program (see below), it calculates number of
lock/unlock pairs in 3 million processor clocks.

NOTE: I replaced cmpxchg with 'if (lock = 0) lock = Pi'. Since I think
it is impossible to implement this algorithm on x86 due to redundand
store in cxmpxchg. Probability that OS will interrupt right between
'compare' and 'store' is very low (but happened one or two times).

RESULTS:
I did 5 runs of each case on my single proc P2.8 HT (i.e. 2 processor
unit)

LOCK-based -- 1493 on average
my lock -- 2039 on average
my lock with cmpxchg with 'unlock' augmented to work on 2-processor pc
-- 1045 on average

CONCLUSION:
my lock is potentially 25-27% faster than usual LOCK-based lock (but
unfortunately can't be implemented on x86 due to cmpxchg behavior)
my augmented lock for 2-proc PC performs 2 times slower. Well as I
already told David it could potentially execute infinitely :))

Here is the code:

--------------------------------------------------------------

#include <windows.h>
#include <tchar.h>
#include <process.h>
#include <iostream>
#include "tick_cnt.h"


#define THREAD_COUNT (4)


//////////////////////////////////////////////////
// mtlock class
class mtlock
{
__declspec(align(32)) LONG m_lock;
public:
mtlock() throw() : m_lock(0) {}
~mtlock() throw() {}

void lock(LONG dwId) throw()
{
LONG dummy = 5;
__asm {
mov edx, dwId
mov ebx, this

mfence

//////////////////////////////////////////////////////////
//
/*
// CAS(&m_lock, 0, dwId)
ENTER_LOCK:
xor eax, eax
cmpxchg [ebx]this.m_lock, edx

sfence

// if (successfull)
je PROCEED
pause
jmp ENTER_LOCK
PROCEED:
*/
xor eax, eax
ENTER_LOCK:
cmp eax, [ebx]this.m_lock
je PUT_VAL
pause
jmp ENTER_LOCK

PUT_VAL:
mov [ebx]this.m_lock, edx
sfence
//
//////////////////////////////////////////////////////////

// CAS(&dummy, dummy, ~dummy)
mov ecx, dummy
mov eax, ecx
neg ecx
cmpxchg dummy, ecx

mfence

// if (m_lock == dwId)
cmp [ebx]this.m_lock, edx
je EXIT_LOCK
pause
jmp ENTER_LOCK
EXIT_LOCK:
}
}

void unlock(LONG dwId) throw()
{
// LONG dummy = 5;
__asm {
mov ebx, this
mov [ebx]this.m_lock, 0
sfence

/*
mov ebx, this
mov edx, dwId
ENTER_UNLOCK:
mov [ebx]this.m_lock, 0
sfence

// CAS(&dummy, dummy, ~dummy)
mov ecx, dummy
mov eax, ecx
neg ecx
cmpxchg dummy, ecx

mfence

// if (m_lock == dwId)
cmp [ebx]this.m_lock, edx
jne EXIT_UNLOCK
pause
jmp ENTER_UNLOCK
EXIT_UNLOCK:
*/
}
}
};


//////////////////////////////////////////////////
// mtlock_old class -- LOCK-based version
class mtlock_old
{
__declspec(align(32)) LONG m_lock;
public:
mtlock_old() throw() : m_lock(0) {}
~mtlock_old() throw() {}

void lock(LONG dwId) throw()
{
__asm {
mov ebx, this
mov ecx, 1
ENTER_LOCK:
xor eax, eax
lock cmpxchg [ebx]this.m_lock, ecx

// if (successfull)
je PROCEED
pause
jmp ENTER_LOCK
PROCEED:
}
}

void unlock(LONG dwId) throw()
{
__asm {
mov ebx, this
mov [ebx]this.m_lock, 0
sfence
}
}
};


//////////////////////////////////////////////////
// global variables
//mtlock_old lock;
mtlock lock;


LONG volatile flag = 0; // used to check lock validity
LONG volatile g_stat[THREAD_COUNT] = {0}; // enter/leave stats
HANDLE hStartEvent = NULL;


#define ASSERT(x) if (!(x)) { __asm { int 3 }; }


void __cdecl thread_proc(void* pData)
{
LONG dwThreadId = GetCurrentThreadId();
LONG idx = (LONG)pData;

ASSERT( WaitForSingleObject(hStartEvent, INFINITE) == WAIT_OBJECT_0
);

for(;;)
{
lock.lock(dwThreadId);

ASSERT(flag == 0);
ASSERT( InterlockedIncrement(&flag) == 1 );
ASSERT(flag == 1);
ASSERT( InterlockedDecrement(&flag) == 0 );
ASSERT(flag == 0);

++g_stat[idx];
lock.unlock(dwThreadId);

// Sleep(0);
}
}


int _tmain(int argc, _TCHAR* argv[])
{
hStartEvent = CreateEvent(NULL, TRUE, FALSE, NULL);
ASSERT(hStartEvent != NULL);

for(int i = 0; i < THREAD_COUNT; ++i)
{
_beginthread(&thread_proc, 0, (void*)i);
}
Sleep(1000); // wait for all threads to enter WaitForSingleObject

CTickCounter timer; // start measurements

ASSERT( SetEvent(hStartEvent) ); // signal to start working
Sleep(1000); // working time
lock.lock(GetCurrentThreadId()); // stop work

ULONG time_used = timer.Stop(); // stop measurements

UINT64 unCount = 0;
for(int i = 0; i < THREAD_COUNT; ++i)
{
std::cout << i << " " << g_stat[i] << std::endl;
unCount += g_stat[i];
}

lock.unlock(GetCurrentThreadId());

std::cout << "Total: " << unCount << std::endl;
std::cout << "Performance: " << double(unCount)/time_used <<
std::endl;
return 0;
}


------------------- tick_cnt.h -------------------------------

#pragma once

#ifndef TICK_CNT_H__5_04_2005__9_53_50__AM
#define TICK_CNT_H__5_04_2005__9_53_50__AM


#pragma pack(push, 1)

#pragma warning(push)
#pragma warning(disable: 4537)

class CTickCounter
{
ULONG m_dwLow, m_dwHigh;
public:
CTickCounter() throw()
{
Start();
}

void Start() throw()
{
__asm
{
CPUID
RDTSC
mov EBX, [this].m_dwLow
mov dword ptr [EBX],EAX
add EBX,4
mov dword ptr [EBX],EDX
}
}

// returns time in microseconds
ULONG Stop(ULONG dwMHZ = 3*1000*1000 /* = GetProcFrequency()*/)
throw()
{
__asm
{
CPUID
RDTSC
mov EBX, [this].m_dwLow
mov ECX, EBX
add ECX, 4
sub EAX, dword ptr [EBX]
sbb EDX, dword ptr [ECX]
div dwMHZ
}
}
};

#pragma warning(pop)
#pragma pack(pop)


#endif //TICK_CNT_H__5_04_2005__9_53_50__AM

--------------------------------------------------------------------


Bye.
Sincerely yours, Michael.

David Schwartz

unread,
Apr 4, 2005, 9:47:19 PM4/4/05
to

"Michael Pryhodko" <mpry...@westpac.com.au> wrote in message
news:1112663578.3...@g14g2000cwa.googlegroups.com...

>> First of all, you would always put the lock in its own cache
>> line. A
>> processor would have to own the cache line in order to perform the
>> unlocked
>> compare and exchange anyway. Your argument about a 486PC isn't
>> convincing
>> because the pipeline depth is so much shorter that the LOCK prefix
>> hurts
>> each CPU less. Plus, there aren't massively parallel SMP 486 systems.

> 1. If m_lock variable is not cached, processor could (or will?) lock
> system bus.

I'm not sure what you're talking about here. There are quite a few
cases. On a modern processor, the system bus is never locked. The cache line
is acquired and not released for the duration of the LOCKed operation.

> 2. I do not see any connection with pipeline depth, AFAIK 'sfence' and
> 'LOCK' does not invalidate pipeline.

They do. The cost of a fence or LOCK is controlled by the pipeline
depth. For example, a store fence requires stores to be classified as either
"before" or "after" the fence. This requires the fence to be a specific
time, not a different time in each of various pipelines.

> What do you mean by "owning" cache line? if you mean LOCK'ing it (in
> order to avoid bus lock) -- it is not true, because only XCHG
> implicitly locks, not CMPXCHG.

Whether or not you lock the compare/exchange, the processor must acquire
the cache line before it can do anything. And whether or not you lock it,
the bus will not be locked, only the cache line might be. Assuming the
locked variable is in its own cache line (which is the only sensible way to
do it), the cost the LOCK prefix is due to pipeline issues, same as for the
fence.

DS


Michael Pryhodko

unread,
Apr 4, 2005, 10:11:07 PM4/4/05
to
> > 1. If m_lock variable is not cached, processor could (or will?)
lock
> > system bus.
>
> I'm not sure what you're talking about here. There are quite a
few
> cases. On a modern processor, the system bus is never locked. The
cache line
> is acquired and not released for the duration of the LOCKed
operation.

>From IA-32 Manual Vol 3, 7.1.4 :
"For the Intel486 and *!*Pentium*!* processors, the LOCK# signal is
always asserted on the bus during a LOCK operation, even if the area of
memory being locked is cached in the processor.
For the Pentium 4, Intel Xeon, and P6 family processors, if the area of
memory being locked during a LOCK operation is *!*cached*!* in the
processor that is performing the LOCK operation as write-back memory
and is *!*completely contained in a cache line*!*, the processor *!*may
not*!* assert the LOCK# signal on the bus. Instead, it will modify the
memory location internally and allow it's cache coherency mechanism
to insure that the operation is carried out atomically. This operation
is called "cache locking." The cache coherency mechanism
automatically prevents two or more processors that have cached the same
area of memory from simultaneously modifying data in that area."

i.e.:
1. not only 486 PC always locks bus :)
2. operand should be cached and should not be split accross cache lines
3. and even if all these conditions are true, processor MAY not lock
bus (i.e. Intel does not give guarantee)


> > 2. I do not see any connection with pipeline depth, AFAIK 'sfence'
and
> > 'LOCK' does not invalidate pipeline.
>
> They do. The cost of a fence or LOCK is controlled by the
pipeline
> depth. For example, a store fence requires stores to be classified as
either
> "before" or "after" the fence. This requires the fence to be a
specific
> time, not a different time in each of various pipelines.

Hmm... Wait a second, I thought that sfence is placed on pipeline just
like any other instruction and when it is retired -- it simply flushes
store buffers (plus maybe something to do with cache coherency
mechanism). In that case if anything lies on pipeline behind sfence --
it will be there, nobody will remove it. Or maybe I am wrong and it is
processed completely in another way, for example:
whenever sfence fetched from memory, pipeline is flushed, store buffers
flushed, sfence immediately retired and continue as usual.
?


> > What do you mean by "owning" cache line? if you mean LOCK'ing it
(in
> > order to avoid bus lock) -- it is not true, because only XCHG
> > implicitly locks, not CMPXCHG.
>
> Whether or not you lock the compare/exchange, the processor must
acquire
> the cache line before it can do anything. And whether or not you lock
it,
> the bus will not be locked, only the cache line might be. Assuming
the
> locked variable is in its own cache line (which is the only sensible
way to
> do it), the cost the LOCK prefix is due to pipeline issues, same as
for the
> fence.

I agree that from the point of view of ONE GIVEN processor cost of LOCK
could be similar to cost of a fence, But for whole system -- I think
fence is cheaper. Run test app I posted in response to Chris. I was
surprised by results :).

Bye.
Sincerely yours, Michael.

David Schwartz

unread,
Apr 4, 2005, 10:24:13 PM4/4/05
to

"Michael Pryhodko" <mpry...@westpac.com.au> wrote in message
news:1112665429.4...@o13g2000cwo.googlegroups.com...

> I have already expressed my thoughts about this algorithm comparison
> with simple LOCK-based lock:

> 1. LOCK could lock system bus -- this could create significant
> influence on whole system: more devices are using bus (e.g. processors)
> -- influence becomes worse. sfence does not have this disadvantage
> (AFAIK).

This is not true on any modern processor I know of.

> 2. AFAIK sfence performance depends on size of store buffer, that means
> in case of:
>
> sfence
> mov mem_addr, smth
> sfence
>
> second sfence will perform much faster than first.

No. The cost of 'sfence' has to do with sequencing. In order for an
'sfence' to work, some stores have to be before the fence and some have to
be after. This imposes a heavy cost on CPUs that get much of their
performance from out of order execution.

> Now about vast improvement of speed:
> I created small test program (see below), it calculates number of
> lock/unlock pairs in 3 million processor clocks.
>
> NOTE: I replaced cmpxchg with 'if (lock = 0) lock = Pi'. Since I think
> it is impossible to implement this algorithm on x86 due to redundand
> store in cxmpxchg. Probability that OS will interrupt right between
> 'compare' and 'store' is very low (but happened one or two times).
>
> RESULTS:
> I did 5 runs of each case on my single proc P2.8 HT (i.e. 2 processor
> unit)
>
> LOCK-based -- 1493 on average
> my lock -- 2039 on average
> my lock with cmpxchg with 'unlock' augmented to work on 2-processor pc
> -- 1045 on average

Benchmarking these kinds of things in tight loops gives unrealistic
results.

DS


David Schwartz

unread,
Apr 4, 2005, 10:31:21 PM4/4/05
to

"Michael Pryhodko" <mpry...@westpac.com.au> wrote in message
news:1112667067.4...@f14g2000cwb.googlegroups.com...

> "For the Intel486 and *!*Pentium*!* processors, the LOCK# signal is
> always asserted on the bus during a LOCK operation, even if the area of
> memory being locked is cached in the processor.
> For the Pentium 4, Intel Xeon, and P6 family processors, if the area of
> memory being locked during a LOCK operation is *!*cached*!* in the
> processor that is performing the LOCK operation as write-back memory
> and is *!*completely contained in a cache line*!*, the processor *!*may
> not*!* assert the LOCK# signal on the bus. Instead, it will modify the
> memory location internally and allow it's cache coherency mechanism
> to insure that the operation is carried out atomically. This operation
> is called "cache locking." The cache coherency mechanism
> automatically prevents two or more processors that have cached the same
> area of memory from simultaneously modifying data in that area."

If you're going to talk about how your algorithm is efficient on
massively parallel systems, those aren't going to be based on Pentiums.
They'll be based on much newer processors.

> i.e.:
> 1. not only 486 PC always locks bus :)
> 2. operand should be cached and should not be split accross cache lines
> 3. and even if all these conditions are true, processor MAY not lock
> bus (i.e. Intel does not give guarantee)

Against this, your algorithm results in more cache ping-ponging because
the cache line is never locked.

>> > 2. I do not see any connection with pipeline depth, AFAIK 'sfence'
>> > and
>> > 'LOCK' does not invalidate pipeline.

>> They do. The cost of a fence or LOCK is controlled by the
>> pipeline
>> depth. For example, a store fence requires stores to be classified as
>> either
>> "before" or "after" the fence. This requires the fence to be a
>> specific
>> time, not a different time in each of various pipelines.

> Hmm... Wait a second, I thought that sfence is placed on pipeline just
> like any other instruction and when it is retired -- it simply flushes
> store buffers (plus maybe something to do with cache coherency
> mechanism). In that case if anything lies on pipeline behind sfence --
> it will be there, nobody will remove it. Or maybe I am wrong and it is
> processed completely in another way, for example:
> whenever sfence fetched from memory, pipeline is flushed, store buffers
> flushed, sfence immediately retired and continue as usual.

Think about what you're saying. What happens if an instruction after the
sfence does a store and that store has already been put in the buffer?
Flushing the buffer will flush the wrong stores. (Modern x86 CPUs get much
of their speed from out-of-order execution.)

>> Whether or not you lock the compare/exchange, the processor must
>> acquire
>> the cache line before it can do anything. And whether or not you lock
>> it,
>> the bus will not be locked, only the cache line might be. Assuming
>> the
>> locked variable is in its own cache line (which is the only sensible
>> way to
>> do it), the cost the LOCK prefix is due to pipeline issues, same as
>> for the
>> fence.

> I agree that from the point of view of ONE GIVEN processor cost of LOCK
> could be similar to cost of a fence, But for whole system -- I think
> fence is cheaper. Run test app I posted in response to Chris. I was
> surprised by results :).

I don't consider your results meaningful because of issues with your
testing methodology. For one thing, testing in a tight loop is unrealistic.
Also, I don't know if you put the lock variable in its own cache line.

DS


Michael Pryhodko

unread,
Apr 4, 2005, 10:38:45 PM4/4/05
to
> > 1. LOCK could lock system bus -- this could create significant
> > influence on whole system: more devices are using bus (e.g.
processors)
> > -- influence becomes worse. sfence does not have this disadvantage
> > (AFAIK).
>
> This is not true on any modern processor I know of.

LOCK *!*could*!* lock system bus. That does not mean it will do it
every time, but according to IA-32 Manual it *!*could*!*. And by the
way, if LOCK locks cache line -- sfence does not lock anything.


> > 2. AFAIK sfence performance depends on size of store buffer, that
means
> > in case of:
> >
> > sfence
> > mov mem_addr, smth
> > sfence
> >
> > second sfence will perform much faster than first.
>
> No. The cost of 'sfence' has to do with sequencing. In order for
an
> 'sfence' to work, some stores have to be before the fence and some
have to
> be after. This imposes a heavy cost on CPUs that get much of their
> performance from out of order execution.

:)) On all x86 'store' operation implemented with 'mov' instruction has
'release' semantic, i.e. every such store 'happens' only after every
preceding memory access finished. This was a surprise for me month ago.
That means that main 'feature' of sfence -- is to make changes visible
to other processors by flushing store buffers. See IA-32 Manual for
'retirement unit', also Alexander Terekhov placed some very useful
links somewhere in this discussion. That why I think that most of
'sfence' price comes from flushing store buffers.


> > LOCK-based -- 1493 on average
> > my lock -- 2039 on average
> > my lock with cmpxchg with 'unlock' augmented to work on 2-processor
pc
> > -- 1045 on average
>
> Benchmarking these kinds of things in tight loops gives
unrealistic
> results.

:) agreed, but they make you think about it. Try to play with this test
-- I will be glad to hear any interesting news from you.

Bye.
Sincerely yours, Michael.

Michael Pryhodko

unread,
Apr 4, 2005, 11:04:20 PM4/4/05
to
> If you're going to talk about how your algorithm is efficient on
> massively parallel systems, those aren't going to be based on
Pentiums.
> They'll be based on much newer processors.

Wait a second, I found an interesting algorithm, formalized it, tried
to implement it on x86. I found that:
- I need guarantee about 'wait' section of this algorithm
- it is impossible to implement it on x86 due to cmpxchg behavior

So I complained :) to newsgroup and asked for help/advice on these
topics.

Because most of conversation is about x86 implementation of algorithm
-- it does not matter that it is unuseful on future huge MP systems.
Maybe architect already reading these posting and thinks if his future
system could provide necessary guarantees for this algorithm to work?


> > i.e.:
> > 1. not only 486 PC always locks bus :)
> > 2. operand should be cached and should not be split accross cache
lines
> > 3. and even if all these conditions are true, processor MAY not
lock
> > bus (i.e. Intel does not give guarantee)
>
> Against this, your algorithm results in more cache ping-ponging
because
> the cache line is never locked.

This poor cacheline ping-ponged because cmpxchg designed in the way it
is designed. If you remove redundant store from cmpxchg it will be
"ping-ponged" once on 'lock' and once on 'unlock'.


> > Hmm... Wait a second, I thought that sfence is placed on pipeline
just
> > like any other instruction and when it is retired -- it simply
flushes
> > store buffers (plus maybe something to do with cache coherency
> > mechanism). In that case if anything lies on pipeline behind sfence
--
> > it will be there, nobody will remove it. Or maybe I am wrong and it
is
> > processed completely in another way, for example:
> > whenever sfence fetched from memory, pipeline is flushed, store
buffers
> > flushed, sfence immediately retired and continue as usual.
>
> Think about what you're saying. What happens if an instruction
after the
> sfence does a store and that store has already been put in the
buffer?
> Flushing the buffer will flush the wrong stores. (Modern x86 CPUs get
much
> of their speed from out-of-order execution.)

This is impossible. Stores on x86 have 'release' semantic, sfences and
stores are ordered.

from IA-32 Manual Vol 1, 2.2.2.3:
"The retirement unit receives the results of the executed µops from
the out-of-order execution core and processes the results so that the
architectural state updates according to the *!*original program
order*!*.
When a µop completes and writes its result, it is retired. Up to three
µops may be retired per cycle. The Reorder Buffer (ROB) is the unit in
the processor which buffers completed µops, updates the architectural
state *!*in order*!*, and manages the ordering of exceptions."

So basically I do not see anything that could prevent me from
implementing 'sfence' which will not flush pipeline (on other hand
maybe I am wrong -- I am not Intel hardware architect). Anyway --
arguing about this point is meaningless, I think that my test app
already proved that sfence cheaper than LOCK. Why? I do not care, it is
a problem of hrdware designers.


> > I agree that from the point of view of ONE GIVEN processor cost of
LOCK
> > could be similar to cost of a fence, But for whole system -- I
think
> > fence is cheaper. Run test app I posted in response to Chris. I was
> > surprised by results :).
>
> I don't consider your results meaningful because of issues with
your
> testing methodology. For one thing, testing in a tight loop is
unrealistic.
> Also, I don't know if you put the lock variable in its own cache
line.

Check test app, it is small, modify it, make it realistic... It is up
to you. What specific part of this test you do not like?

Bye.
Sincerely yours, Michael.

Michael Pryhodko

unread,
Apr 4, 2005, 11:21:32 PM4/4/05
to
> > Hmm... Wait a second, I thought that sfence is placed on pipeline
just
> > like any other instruction and when it is retired -- it simply
flushes
> > store buffers (plus maybe something to do with cache coherency
> > mechanism). In that case if anything lies on pipeline behind sfence
--
> > it will be there, nobody will remove it. Or maybe I am wrong and it
is
> > processed completely in another way, for example:
> > whenever sfence fetched from memory, pipeline is flushed, store
buffers
> > flushed, sfence immediately retired and continue as usual.
>
> Think about what you're saying. What happens if an instruction
after the
> sfence does a store and that store has already been put in the
buffer?
> Flushing the buffer will flush the wrong stores. (Modern x86 CPUs get
much
> of their speed from out-of-order execution.)

And by the way -- I don't think that out-of-order executed 'store' gets
to store buffer before instruction retired by ROB. Because every
interrupt flushes/drains store buffers. Definitely out-of-order
execution does not work as you think it works.

Bye.
Sincerely yours, Michael.

Chris Thomasson

unread,
Apr 5, 2005, 1:44:24 AM4/5/05
to
> :)) On all x86 'store' operation implemented with 'mov' instruction has
> 'release' semantic, i.e. every such store 'happens' only after every
> preceding memory access finished

It must be understood to everybody reading this that ia32 can, and will,
reorder a store followed by a load from "another" location:

http://groups-beta.google.com/group/comp.programming.threads/msg/2f2ec4c60b6a5d7c
http://groups-beta.google.com/group/comp.std.c++/msg/e03d27981e6f8151

Explicit barriers are required in this case. You need them in Dekkers and/or
Petersons algorithm for instance...


Chris Thomasson

unread,
Apr 6, 2005, 3:49:11 AM4/6/05
to
> - some advise in areas where "I am not sure", such as "wait" part of
> algorithm and "cmpxchg redundant store" problem

Remember to check news://comp.lang.asm.x86


0 new messages