What is the real costs of LOCK on x86 multiprocesor machine?

Mirek Fidler

unread,

Jul 29, 2005, 9:54:06 AM7/29/05

to

I am unable to find anwswer to above question. Only tidbit of
information was that it is 100 cycles on P4, but perhaps that was just
the worst case.

Measuring the same on my uniprocesor AMD64 machine, LOCKed instructin
seem to be 3 times slower than regular one, but I guess that has only a
little relevance on real MP machine.

Mirek

chris noonan

unread,

Jul 29, 2005, 11:48:46 AM7/29/05

to

Mirek Fidler wrote:

> Measuring the same on my uniprocesor AMD64 machine, LOCKed instructin
> seem to be 3 times slower than regular one, but I guess that has only a
> little relevance on real MP machine.

Surprisingly, on Intel Pentium processors the overhead of
a locked instruction is about the same for a single processor
as for one in a multiple processor configuration.

The way Microsoft (for example) avoid the performance hit is
to replace the LOCK prefixes in uniprocessor builds of their
operating system binaries with NOP instructions.

Chris

Mirek Fidler

unread,

Jul 29, 2005, 12:36:31 PM7/29/05

to

>>Measuring the same on my uniprocesor AMD64 machine, LOCKed instructin
>>seem to be 3 times slower than regular one, but I guess that has only a
>>little relevance on real MP machine.
>
>
> Surprisingly, on Intel Pentium processors the overhead of
> a locked instruction is about the same for a single processor
> as for one in a multiple processor configuration.

Hm, can you give me some numbers? Is it closer to my measurement (3
times slower than non-atomic) or to 100 cycles I have seen somewhere?

>
> The way Microsoft (for example) avoid the performance hit is
> to replace the LOCK prefixes in uniprocessor builds of their
> operating system binaries with NOP instructions.

I stepped through InterlockedIncrement and there definitely IS lock.

Mirek

Chris Thomasson

unread,

Jul 29, 2005, 5:31:28 PM7/29/05

to

>> Surprisingly, on Intel Pentium processors the overhead of
>> a locked instruction is about the same for a single processor
>> as for one in a multiple processor configuration.
>
> Hm, can you give me some numbers? Is it closer to my measurement (3 times
> slower than non-atomic) or to 100 cycles I have seen somewhere?

There are too many variables to consider for any definitive answers. For
example, the negative effects of asserting a cpu's #LOCK signal could be
tremendous. It also tends to have different timing results on multi-cpu
systems because the #LOCK signal actually locks the entire system bus so
subsequent requests from any other cpu's are blocked. You can't really know
how many operations were blocked during a #LOCK signal assertion. Sometimes
the #LOCK signal assertion can be avoided if the shared data being accessed
is already cached in the cpu.

This variable behavior can lead to inconsistent timing results.

--
http://appcore.home.comcast.net/
(portable lock-free data-structures)

Sean Kelly

unread,

Jul 29, 2005, 5:50:31 PM7/29/05

to

For what it's worth, the IA-32 Developer's Manual has this to say about
LOCK:

"Beginning with the P6 family processors, when the LOCK prefix is
prefixed to an instruction
and the memory area being accessed is cached internally in the
processor, the LOCK# signal is generally not asserted. Instead, only
the processor's cache is locked. Here, the processor's cache
coherency mechanism insures that the operation is carried out
atomically with regards to memory."

Message has been deleted

David Schwartz

unread,

Jul 30, 2005, 12:17:09 AM7/30/05

to

"Mirek Fidler" <c...@volny.cz> wrote in message
news:3kuqnuF...@individual.net...

On a p4, it's 100-200 clocks.

DS

Mirek Fidler

unread,

Jul 30, 2005, 5:18:34 AM7/30/05

to

Well, I guess that I should have been be more specific:

If accessed memory is contained exclusively in one CPU cache, is it true
that LOCK penalty is lower than 100-200 cycles?

My question is mostly about reference counted shared objects (e.g. in
C++). What I want to find out is whether penalty is present in all
cases, or whether when only single thread "owns" shared data (reference
count is not accessed by other threads), penalty is lower.

Unfortunately, at the moment I do not have hardware to test...

Mirek

David Schwartz

unread,

Jul 30, 2005, 6:21:57 AM7/30/05

to

"Mirek Fidler" <c...@volny.cz> wrote in message

news:3l0uv7F...@individual.net...

> Well, I guess that I should have been be more specific:

> If accessed memory is contained exclusively in one CPU cache, is it true
> that LOCK penalty is lower than 100-200 cycles?

No.

> My question is mostly about reference counted shared objects (e.g. in
> C++). What I want to find out is whether penalty is present in all cases,
> or whether when only single thread "owns" shared data (reference count is
> not accessed by other threads), penalty is lower.

It is.

> Unfortunately, at the moment I do not have hardware to test...

Sadly, the penalty comes from the fact that the lock must take place
during the execution of a single instruction. This means that no other
operations can overlap that instruction, so the pipelines all empty, the
instruction is processed, then the pipelines fill again. This has a huge
cost on the p4.

DS

Joe Seigh

unread,

Jul 30, 2005, 7:18:25 AM7/30/05

to

David Schwartz wrote:
> "Mirek Fidler" <c...@volny.cz> wrote in message
> news:3l0uv7F...@individual.net...
>

>>Unfortunately, at the moment I do not have hardware to test...
>
>
> Sadly, the penalty comes from the fact that the lock must take place
> during the execution of a single instruction. This means that no other
> operations can overlap that instruction, so the pipelines all empty, the
> instruction is processed, then the pipelines fill again. This has a huge
> cost on the p4.
>

It doesn't serialize the processor, there are instructions that do this.
It "serializes" the memory accesses, i.e. they complete rather than just
get ordered. Instruction prefetching can still occur so the pipeline
isn't completely flushed. Self modifying code needs to use additional
synchronization.

"Locked operations are atomic with respect to all other memory operations and all externally
visible events. Only instruction fetch and page table accesses can pass locked instructions.
Locked instructions can be used to synchronize data written by one processor and read by
another processor.
For the P6 family processors, locked operations serialize all outstanding load and store operations
(that is, wait for them to complete). This rule is also true for the Pentium 4 and Intel Xeon
processors, with one exception: load operations that reference weakly ordered memory types
(such as the WC memory type) may not be serialized."

I'm not sure that last bit means. Interlocked instructions aren't guaranteed to order
loads?

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

David Schwartz

unread,

Jul 30, 2005, 7:19:19 AM7/30/05

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:8vWdnYDkq-S...@comcast.com...

> David Schwartz wrote:

>> Sadly, the penalty comes from the fact that the lock must take place
>> during the execution of a single instruction. This means that no other
>> operations can overlap that instruction, so the pipelines all empty, the
>> instruction is processed, then the pipelines fill again. This has a huge
>> cost on the p4.

> It doesn't serialize the processor, there are instructions that do this.
> It "serializes" the memory accesses, i.e. they complete rather than just
> get ordered. Instruction prefetching can still occur so the pipeline
> isn't completely flushed. Self modifying code needs to use additional
> synchronization.

It is not guaranteed to serialize the processor; however, my experience
has been that it in fact does this. Locked operations could be a lot more
optimized than they are, it just doesn't seem to have been a priority.

DS

Joe Seigh

unread,

Jul 30, 2005, 7:28:38 AM7/30/05

to

Mirek Fidler wrote:
> Well, I guess that I should have been be more specific:
>
> If accessed memory is contained exclusively in one CPU cache, is it true
> that LOCK penalty is lower than 100-200 cycles?
>
> My question is mostly about reference counted shared objects (e.g. in
> C++). What I want to find out is whether penalty is present in all
> cases, or whether when only single thread "owns" shared data (reference
> count is not accessed by other threads), penalty is lower.
>
> Unfortunately, at the moment I do not have hardware to test...
>

You can measure cache hits. Just sequentially access more memory than can
fit in your cache. Brute force will work if you don't want to figure out
the n-way associative hashing and stuff.

You'll have to figure out the contention factor which would be dependent on
the number of cpus. If it's a problem there are more cache friendly algorithms
out there though it would be much better if the cache coherency protocols weren't
stuck back in the 70's and 80's as far as threaded programming was concerned.

Joe Seigh

unread,

Jul 30, 2005, 7:52:52 AM7/30/05

to

David Schwartz wrote:
> "Joe Seigh" <jsei...@xemaps.com> wrote in message
> news:8vWdnYDkq-S...@comcast.com...
>
>

>>It doesn't serialize the processor, there are instructions that do this.
>>It "serializes" the memory accesses, i.e. they complete rather than just
>>get ordered. Instruction prefetching can still occur so the pipeline
>>isn't completely flushed. Self modifying code needs to use additional
>>synchronization.
>
>
> It is not guaranteed to serialize the processor; however, my experience
> has been that it in fact does this. Locked operations could be a lot more
> optimized than they are, it just doesn't seem to have been a priority.
>

Making the memory accesses complete degrades much of the pipeline even
if true serialization doesn't occur. Memory ordering would have been
much better though it still would have been a full memory barrier.
Serialization in my experience was used on mainframes to enforce hardware
error boundaries on things like context switching. In theory you could
use it for things like IPC without context switching (kernel calls) but
I've not seen any actual implementations myself. You need instructions
with cross address space support.

Peter Dimov

unread,

Jul 30, 2005, 8:23:54 AM7/30/05

to

In practice, the actual amortized cost appears to be much lower than
50-100x. I see a 4x difference on shared_ptr_timing_test (which
basically tests reference count increments and decrements in a somewhat
real-life worst case scenario.)

A P4 can stall for 50 cycles without an apparent reason. ;-)

chris noonan

unread,

Jul 30, 2005, 8:51:26 AM7/30/05

to

Oliver S. wrote:
> > I stepped through InterlockedIncrement and there definitely IS lock.
>

> And there are also locks on Enter- and LeaveCriticalSection.
> MS dropped the two versions of this functions win kernel32.dll
> from NT4 to Win2K.

I vaguely remember encountering the LOCK prefix
in one or two places in Microsoft OS code disassembly
(uniprocessor installation). I assumed it was simply
a mistake, something like new routines being hacked
in quickly without them taking the trouble to add
the offsets of the LOCK prefixes to the list to be
NOP'ed out.

Most of my work is done with NT4 SP4. If subsequent
versions retain the LOCK prefix in uniprocessor
mode that is interesting. Obviously Microsoft got
a measurable performance boost by removing the LOCKs
(I do too, in my code) or they wouldn't have taken
the trouble. By the way, the routines I studied most
were heap functions: NtAllocateHeap etc.

Chris

Joe Seigh

unread,

Jul 30, 2005, 9:05:24 AM7/30/05

to

Statically or dynamically linked code? The only case I
think you can safely get away with making LOCK optional
is in system dynamically linked libraries that are hardware
configuration specific. I use LOCK in inlined code since it's
in effect statically linked and you don't know where the code
will eventually execute.

Mirek Fidler

unread,

Jul 30, 2005, 10:51:14 AM7/30/05

to

> In practice, the actual amortized cost appears to be much lower than
> 50-100x. I see a 4x difference on shared_ptr_timing_test (which
> basically tests reference count increments and decrements in a somewhat
> real-life worst case scenario.)

Have you tested on real MP machine? (4x is close to what I have measured
on my SP).

Mirek

Sean Kelly

unread,

Jul 30, 2005, 12:05:14 PM7/30/05

to

Joe Seigh wrote:
>
> Statically or dynamically linked code? The only case I
> think you can safely get away with making LOCK optional
> is in system dynamically linked libraries that are hardware
> configuration specific. I use LOCK in inlined code since it's
> in effect statically linked and you don't know where the code
> will eventually execute.

What about testing the hardware configuration on startup and setting a
global state flag? Would it be worthwhile to do this and deal with the
duplicated asm and branching it would require?

Sean

Joe Seigh

unread,

Jul 30, 2005, 12:43:05 PM7/30/05

to

Depends on whether the global state flag is in cache when you test it
and whether ia32-64 handles short conditional jumps effectively. The
condition means the pipeline stalls until the flag can be tested. It
should be easy to measure the overhead though.

Mirek Fidler

unread,

Jul 30, 2005, 12:47:03 PM7/30/05

to

Joe Seigh wrote:
> Sean Kelly wrote:
>
>> Joe Seigh wrote:
>>
>>> Statically or dynamically linked code? The only case I
>>> think you can safely get away with making LOCK optional
>>> is in system dynamically linked libraries that are hardware
>>> configuration specific. I use LOCK in inlined code since it's
>>> in effect statically linked and you don't know where the code
>>> will eventually execute.
>>
>>
>>
>> What about testing the hardware configuration on startup and setting a
>> global state flag? Would it be worthwhile to do this and deal with the
>> duplicated asm and branching it would require?
>>
> Depends on whether the global state flag is in cache when you test it
> and whether ia32-64 handles short conditional jumps effectively. The
> condition means the pipeline stalls until the flag can be tested.

AFAIK not true. That is what branch prediction is for, is not it?

I believe that in case of testing single flag that never changes,
actuall amortized branch costs will be lower than one cycle.

Mirek

Joe Seigh

unread,

Jul 30, 2005, 2:52:32 PM7/30/05

to

That's for prefetching the instructions. The actual instruction
can't be executed until the actual flag value is known. If the
flag test can be moved forward of the branch you can get some benefit
that way.

David Schwartz

unread,

Jul 30, 2005, 4:19:55 PM7/30/05

to

"chris noonan" <use...@leapheap.co.uk> wrote in message
news:1122727886.8...@g14g2000cwa.googlegroups.com...

> I vaguely remember encountering the LOCK prefix
> in one or two places in Microsoft OS code disassembly
> (uniprocessor installation). I assumed it was simply
> a mistake, something like new routines being hacked
> in quickly without them taking the trouble to add
> the offsets of the LOCK prefixes to the list to be
> NOP'ed out.

By the way, the way I do this in applications is by using an illegal
instruction. I catch the fault and replace the illegal instruction with
either a LOCK or NOP at run time.

static void sigill(int, struct siginfo *info, void *extra)
{ // Dynamically patch a 'ud2' to a 'lock nop' or a 'nop nop'
// By: David J. Schwartz
// Copyright (C) WebMaster Incorporated, 2003-4

// Begin with a full CPU sync (purge unmodified code)
__asm__ __volatile__("cpuid" : : : "ax", "bx", "cx", "dx");

while(AtomicExchange(&sigspin, 1)!=0)
{ // relax the CPU
__asm__ __volatile__("rep; nop");
}

ucontext_t *uc=(ucontext_t *) extra;
if(uc==NULL) {char *j=NULL; *j++;}
else
{
unsigned char *instr=(unsigned char *) uc->uc_mcontext.gregs[14]; // EIP
if(instr==NULL) {char *j=NULL; *j++;}
else
{
if(instr[0]!=0x90)
{ // instruction wasn't already patched
if ( (instr[0]!=0x0f) || // must be a ud2
((instr[1]!=0x0b)&&(instr[1]!=0xb9)) )
{ char *j=NULL; *j++; }
else
{
mprotect( (void *) ( ((int) instr) & ~(PAGESIZE-1) ),
PAGESIZE, PROT_WRITE | PROT_EXEC | PROT_READ);
instr[0]=0x90; // first byte is always a nop
instr[1]=(IsUP) ? 0x90 : 0xf0;
mprotect( (void *) ( ((int) instr) & ~(PAGESIZE-1) ),
PAGESIZE, PROT_EXEC | PROT_READ);
}
}
}

// release the lock
AtomicExchange(&sigspin, 0);

// Reflush the CPU
__asm__ __volatile__("cpuid" : : : "ax", "bx", "cx", "dx");
}
}

DS

Gianni Mariani

unread,

Jul 30, 2005, 4:50:09 PM7/30/05

to

If you have some code you would like me to try, I'll run it on an MP
machine for you (AMD or AMD64).

The other thing to node it that some more modern P4's have the
hyperthreading thing that looks like an SMP machine.

Gianni Mariani

unread,

Jul 30, 2005, 5:00:45 PM7/30/05

to

OH - and also, the cost of cpu-cpu communication through shared memory
is somthing else you might want to consider.

I wrote some code many moons ago that I posted on the net...

http://groups-beta.google.com/group/comp.os.linux.development.system/msg/df42233b14e5f962?dmode=source&hl=en

In general, I found that latentcies in absolute time stayed about the
same until the AMD Opteron came out which shaved a few more cycles out
of it. I don't have the data handy.

Peter Dimov

unread,

Jul 31, 2005, 5:43:19 AM7/31/05

to

No, it was a single P4/2GHz without HT.

Joe Seigh

unread,

Jul 31, 2005, 7:43:35 AM7/31/05

to

Now that a lot of people have weighed in on this, why does the cost of
atomic increment/decrement matter?

Mirek Fidler

unread,

Jul 31, 2005, 7:51:11 AM7/31/05

to

Joe Seigh wrote:
> Peter Dimov wrote:
>
>> Mirek Fidler wrote:
>>
>>>> In practice, the actual amortized cost appears to be much lower than
>>>> 50-100x. I see a 4x difference on shared_ptr_timing_test (which
>>>> basically tests reference count increments and decrements in a somewhat
>>>> real-life worst case scenario.)
>>>
>>>
>>> Have you tested on real MP machine? (4x is close to what I have measured
>>> on my SP).
>>
>>
>>
>> No, it was a single P4/2GHz without HT.
>>
>
> Now that a lot of people have weighed in on this, why does the cost of
> atomic increment/decrement matter?
>

A lot. It is decisive factor in choosing COW structure over full copy.

Mirek

Joe Seigh

unread,

Jul 31, 2005, 8:19:14 AM7/31/05

to

Mirek Fidler wrote:
> Joe Seigh wrote:
>>

>> Now that a lot of people have weighed in on this, why does the cost of
>> atomic increment/decrement matter?
>>
>
> A lot. It is decisive factor in choosing COW structure over full copy.
>

How many data structures get copied that much besides String (allegedly)?

chris noonan

unread,

Jul 31, 2005, 8:24:12 AM7/31/05

to

Joe Seigh wrote:

> chris noonan wrote:
> > I vaguely remember encountering the LOCK prefix
> > in one or two places in Microsoft OS code disassembly
> > (uniprocessor installation). I assumed it was simply
> > a mistake, something like new routines being hacked
> > in quickly without them taking the trouble to add
> > the offsets of the LOCK prefixes to the list to be
> > NOP'ed out.

> >[snip]

>
> Statically or dynamically linked code? The only case I
> think you can safely get away with making LOCK optional
> is in system dynamically linked libraries that are hardware
> configuration specific. I use LOCK in inlined code since it's
> in effect statically linked and you don't know where the code
> will eventually execute.

The LOCK prefixes are always in the source code,
unconditionally. So the software ships with LOCK in
the binaries. The software installer determines
whether the machine is uniprocessor or multiprocessor.
If the former, it alters the binary file accordingly,
overwriting the LOCK prefixes with NOP, working from
a list of offsets that is potentially different for
each compilation.

Is that what you were asking?

Chris

chris noonan

unread,

Jul 31, 2005, 8:26:40 AM7/31/05

to

David Schwartz wrote:
> "chris noonan" wrote

> > I vaguely remember encountering the LOCK prefix
> > in one or two places in Microsoft OS code disassembly
> > (uniprocessor installation). I assumed it was simply
> > a mistake, something like new routines being hacked
> > in quickly without them taking the trouble to add
> > the offsets of the LOCK prefixes to the list to be
> > NOP'ed out.
>
> By the way, the way I do this in applications is by using an illegal
> instruction. I catch the fault and replace the illegal instruction with
> either a LOCK or NOP at run time.
>

>[code snipped]

Ingenious. I wonder if there might be a way of doing
this for Microsoft Windows?

Chris

chris noonan

unread,

Jul 31, 2005, 8:31:11 AM7/31/05

to

Joe Seigh wrote:

> You'll have to figure out the contention factor which would be dependent on
> the number of cpus. If it's a problem there are more cache friendly algorithms
> out there though it would be much better if the cache coherency protocols weren't
> stuck back in the 70's and 80's as far as threaded programming was concerned.

In light of the disappointing performance of
multiprocessor PCs, perhaps the MESI cache coherency
approach is not the best one.

The physical point of interaction between threads
on different processors communicating via shared
memory should be as far away from the processors
as possible, at the memory (RAM) itself.

Now that memory chips have millions of transistors,
a few could be spared for a primitive ALU. Then add some
extra transaction types to the memory bus. One
such transaction would implement waiting on a
semaphore. The memory controller performs a read
cycle to get the value of the specified memory word,
decrements it in its ALU (unless already zero),
performs a write cycle to put the new value back in
memory, then returns the old value of the word
across the bus to the requesting processor. This
sequence would be atomic with respect to other
processors, trivially.

>From the programmer's or compiler writer's perspective,
a critical region would be achieved by waiting on a
binary semaphore (using the machine instruction described
in the previous paragraph), accessing the protected
data freely, then signalling the semaphore via the
memory controller with another special bus transaction.
The processor data cache (or parts of it) would have to
be invalidated after waiting on the semaphore and flushed
back to RAM before signalling it.

As an elaboration, extra logic at the memory controller
could interrupt a processor when it needed to "sleep"
(i.e. reschedule its threads when waiting on a semaphore
already zero) or "wake up" (upon signalling of a waited-for
semaphore), queueing the processors to ensure fairness.

It is likely that such a scheme would require considerably
less silicon than used by the Pentium processor, with
its MESI, snooping, bus-locking etc. logic. Does
anyone know if it has been attempted?

Chris

Joe Seigh

unread,

Jul 31, 2005, 9:39:00 AM7/31/05

to

chris noonan wrote:
> Joe Seigh wrote:

>>
>>Statically or dynamically linked code? The only case I
>>think you can safely get away with making LOCK optional
>>is in system dynamically linked libraries that are hardware
>>configuration specific. I use LOCK in inlined code since it's
>>in effect statically linked and you don't know where the code
>>will eventually execute.
>
>
> The LOCK prefixes are always in the source code,
> unconditionally. So the software ships with LOCK in
> the binaries. The software installer determines
> whether the machine is uniprocessor or multiprocessor.
> If the former, it alters the binary file accordingly,
> overwriting the LOCK prefixes with NOP, working from
> a list of offsets that is potentially different for
> each compilation.
>
> Is that what you were asking?
>

Static libraries vs. dynamic ones. You have no control over
where staticly linked code will execute so you should assume
the most general case. Self modifying code is not always an
option.

Joe Seigh

unread,

Jul 31, 2005, 9:56:11 AM7/31/05

to

chris noonan wrote:
> Joe Seigh wrote:
>

> Now that memory chips have millions of transistors,
> a few could be spared for a primitive ALU. Then add some
> extra transaction types to the memory bus. One
> such transaction would implement waiting on a
> semaphore. The memory controller performs a read
> cycle to get the value of the specified memory word,
> decrements it in its ALU (unless already zero),
> performs a write cycle to put the new value back in
> memory, then returns the old value of the word
> across the bus to the requesting processor. This
> sequence would be atomic with respect to other
> processors, trivially.

This is called fetch-and-op, op being inc(rement), dec(rement),
etc... It's an atomic read, modify, write instruction.
Itanium has it with the fetchadd instruction though it
only works that way with special uncached memory so you
can't really use it with normal shared memory.

>
>>From the programmer's or compiler writer's perspective,
> a critical region would be achieved by waiting on a
> binary semaphore (using the machine instruction described
> in the previous paragraph), accessing the protected
> data freely, then signalling the semaphore via the
> memory controller with another special bus transaction.
> The processor data cache (or parts of it) would have to
> be invalidated after waiting on the semaphore and flushed
> back to RAM before signalling it.
>
> As an elaboration, extra logic at the memory controller
> could interrupt a processor when it needed to "sleep"
> (i.e. reschedule its threads when waiting on a semaphore
> already zero) or "wake up" (upon signalling of a waited-for
> semaphore), queueing the processors to ensure fairness.
>
> It is likely that such a scheme would require considerably
> less silicon than used by the Pentium processor, with
> its MESI, snooping, bus-locking etc. logic. Does
> anyone know if it has been attempted?

Hardware scheduling of threads? Not unless you count hyperthreading
with MONITOR/MWAIT for waiting on a lock.

I'm thinking of more cache friendly algorithms like RCU which doesn't
care if stale data is being accessed (subject to load dependent ordering).
So if you relaxed the MESI protocol somewhat you could reduce the cache
traffic. This could be a *major* performance factor in something
like ccNUMA where accessing global memory is expensive and not updating
stale (invalid) cache lines if you don't have to would help a lot.

David Schwartz

unread,

Jul 31, 2005, 4:47:57 PM7/31/05

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message

news:6NGdnY-F0us...@comcast.com...

> Now that a lot of people have weighed in on this, why does the cost of
> atomic increment/decrement matter?

A lot if you're deciding whether or not keeping count of things is worth
the overhead in a release build.

DS

David Schwartz

unread,

Jul 31, 2005, 4:48:44 PM7/31/05

to

"chris noonan" <use...@leapheap.co.uk> wrote in message

news:1122812800....@g43g2000cwa.googlegroups.com...

>> By the way, the way I do this in applications is by using an illegal
>> instruction. I catch the fault and replace the illegal instruction with
>> either a LOCK or NOP at run time.
>>
>>[code snipped]

> Ingenious. I wonder if there might be a way of doing
> this for Microsoft Windows?

I never tried. It comes down to whether you can catch the illegal
instruction fault with enough information to fix it. My understanding is
that you *can* catch faults with something like SetUnhandledExceptionFilter
(or some function like that).

DS

chris noonan

unread,

Aug 1, 2005, 1:37:14 PM8/1/05

to

Joe Seigh wrote:

> chris noonan wrote:
> >
> > Now that memory chips have millions of transistors,
> > a few could be spared for a primitive ALU. Then add some
> > extra transaction types to the memory bus. One
> > such transaction would implement waiting on a
> > semaphore. The memory controller performs a read
> > cycle to get the value of the specified memory word,
> > decrements it in its ALU (unless already zero),
> > performs a write cycle to put the new value back in
> > memory, then returns the old value of the word
> > across the bus to the requesting processor. This
> > sequence would be atomic with respect to other
> > processors, trivially.
>
> This is called fetch-and-op, op being inc(rement), dec(rement),
> etc... It's an atomic read, modify, write instruction.
> Itanium has it with the fetchadd instruction though it
> only works that way with special uncached memory so you
> can't really use it with normal shared memory.

The way I meant it is already in the literature:
"Highly Efficient Synchronization Based on Active Memory
Operations"
Zhang, Fang, Carter
http://www.cs.utah.edu/~retrac/papers/ipdps04.pdf

Chris

Joe Seigh

unread,

Aug 1, 2005, 2:22:11 PM8/1/05

to

It seems to be meant for parallel computation where the
cpus are dedicated to the computation and it's ok for a
cpu to just wait on the result of the atomic operation.
These would be supercomputers with individual computers
connected with high speed links and they're interested in
reducing network traffic over the links.

I suppose this might be applicable to regular multi-threaded
programming once you get 50 to 100 core processors widely available
and you can dispense with time slicing.

chris noonan

unread,

Aug 4, 2005, 6:01:36 AM8/4/05

to

I suspect it would also be valuable for PC-type systems
having (say) four processors. These are currently architected
with a MESI crossbar equalising the processor data caches.
The purpose of this is presumably to provide *implicit*
synchronisation between threads running on different
processors.

However, because of the Pentium's out-of-order execution
and write buffering, *explicit* synchronisation (memory
barriers, bus-locking instructions) is required in any
case. So what does cache coherency achieve?

Why not synchronise processors at the physical point
where they are already connected, at main memory?

Chris

Joe Seigh

unread,

Aug 4, 2005, 7:04:01 AM8/4/05

to

MESI is required to maintain coherency so that only one
processor has a cache line exclusive at a time and thus
keeping updates from getting lost. If you didn't have
false sharing, you wouldn't need this.

The big problem is with the strong part of strongly
coherent. When you have more cache traffic than you
need due to repeated refreshes of invalidated cache
lines, that has to slow things down. Hardware designers
don't get it.

Mayan Moudgill

unread,

Aug 4, 2005, 7:17:33 AM8/4/05

to

Joe Seigh wrote:

>>
> MESI is required to maintain coherency so that only one
> processor has a cache line exclusive at a time and thus
> keeping updates from getting lost. If you didn't have
> false sharing, you wouldn't need this.
>
> The big problem is with the strong part of strongly
> coherent. When you have more cache traffic than you
> need due to repeated refreshes of invalidated cache
> lines, that has to slow things down. Hardware designers
> don't get it.
>

I'm not quite sure I get what you mean. In the MESI protocol, a shared
line will get invalidated when some other processor wants to write to
the line (E or S->I). This means that if a processor is writing to an I
line, then it has to fetch the new value. That is unavoidable (other
than the case of false sharing). [Caveat: if the line in the other core
is E but not M, then the data does not have to be fetched, but you still
have to wait for the ack]

Now, where do you see an opportunity to optimize the transaction?
(again, excluding false sharing)

One optimization which has been talked about (though I don't know
whether any processor actually implements it) is to allow loads (and
dependent ops) to execute speculatively against an I line. If it turns
out that the line was E then the speculative ops are committed, if it
turns out it was M, then they are restarted.

Joe Seigh

unread,

Aug 4, 2005, 7:41:38 AM8/4/05

to

Mayan Moudgill wrote:
> Joe Seigh wrote:
>
>>>
>> MESI is required to maintain coherency so that only one
>> processor has a cache line exclusive at a time and thus
>> keeping updates from getting lost. If you didn't have
>> false sharing, you wouldn't need this.
>>
>> The big problem is with the strong part of strongly
>> coherent. When you have more cache traffic than you
>> need due to repeated refreshes of invalidated cache
>> lines, that has to slow things down. Hardware designers
>> don't get it.
>>
>
> I'm not quite sure I get what you mean. In the MESI protocol, a shared
> line will get invalidated when some other processor wants to write to
> the line (E or S->I). This means that if a processor is writing to an I
> line, then it has to fetch the new value. That is unavoidable (other
> than the case of false sharing). [Caveat: if the line in the other core
> is E but not M, then the data does not have to be fetched, but you still
> have to wait for the ack]
>
> Now, where do you see an opportunity to optimize the transaction?
> (again, excluding false sharing)

There is no reason not to allow stale reads since most memory models
allow it. A stale read is much faster than a read stalled by a
cache line refresh. Though since RCU is now a defacto standard you
need new unspecified memory barriers to efficiently support it.

>
> One optimization which has been talked about (though I don't know
> whether any processor actually implements it) is to allow loads (and
> dependent ops) to execute speculatively against an I line. If it turns
> out that the line was E then the speculative ops are committed, if it
> turns out it was M, then they are restarted.

Transactional memory? Sun is probably going to do it. Somebody leaked
that in the solaris newsgroups.

Michael Pryhodko

unread,

Aug 4, 2005, 8:52:41 PM8/4/05

to

> "Locked operations are atomic with respect to all other memory operations and all externally
> visible events. Only instruction fetch and page table accesses can pass locked instructions.
> Locked instructions can be used to synchronize data written by one processor and read by
> another processor.
> For the P6 family processors, locked operations serialize all outstanding load and store operations
> (that is, wait for them to complete). This rule is also true for the Pentium 4 and Intel Xeon
> processors, with one exception: load operations that reference weakly ordered memory types
> (such as the WC memory type) may not be serialized."
>
> I'm not sure that last bit means. Interlocked instructions aren't guaranteed to order
> loads?

Yes loads not serialized in this case, if they reference weakly ordered
memory types -- read about memory types in IA manual.

Bye.
Sincerely yours, Michael.

chris noonan

unread,

Aug 5, 2005, 1:10:13 PM8/5/05

to

Joe Seigh wrote:
> MESI is required to maintain coherency so that only one
> processor has a cache line exclusive at a time and thus
> keeping updates from getting lost. If you didn't have
> false sharing, you wouldn't need this.

OK. I hadn't considered that. It makes a difference to
the "Active Memory Operations" way of doing things.

Consider there to be three types of variable:
thread-private data, shared data and semaphores. Without MESI,
it might be thought possible for a thread that only *reads*
some shared data not to see the writes of other processors,
as it might keep reading the stale data in its cache. But
normally such an access would be part of a readers/writers
solution, hence protected by semaphore and an explicit cache
fill (in the setup I described earlier).

If thread-private data, shared data and a semaphore happen
to lie in the same cache line, there is no problem (except
for some redundant refreshes of the thread-private data).

There is a problem if two different sorts of shared data
(such as the data protected by two unrelated critical
regions) lie in the same cache line; writes could be
lost. That would have to be solved by attention to the
memory layout.

Chris