x86 Sequential Consistency

Adam Warner

unread,

Nov 16, 2006, 5:10:48 AM11/16/06

to

Hi all,

In my prior posts to comp.programming.threads [note new crosspost to
comp.arch] I described an algorithm where CPU A makes a number of memory
writes before writing to a memory location that acts as a flag. When CPU B
reads the changed flag (it's theoretically irrelevant whether the change
takes a nanosecond or a day to propagate to CPU B) all the writes that
were made by CPU A prior to CPU A writing to the memory flag must be
correctly readable by CPU B. A sequentially consistent architecture
satisfies this property.

Historically the shared memory IA-32 multiprocessor architecture has been
sequentially consistent:

Ghrachorloo, K. Memory consistency models for shared-memory
multiprocessors. Tech. Rep. CSL-TR-95-685, Computer Systems Laboratory,
Departments of Electrical Engineering and Computer Science, Stanford
University, Stanford, CA, December 1995. Available:
<www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-9.pdf>

Page 298:

Other companies such as Hewlett-Packard (HP), Intel, and Silicon
Graphics (SGI) have chosen to support sequential consistency in their
current architectures. The next generation processors from these
companies exploit aggressive implementation techniques in conjunction
with dynamically scheduled processors to efficiently support sequential
consistency (mentioned in the previous paragraph). However, there is
still a dependence on relaxed models for enabling compiler
optimizations for explicitly parallel programs. Furthermore, companies
such as Intel still maintain the option to move to relaxed models by
recommending that programmers use special serializing and locking
operations for future compatibility.

Is there any counterexample over the last decade of x86 CPU innovation that
proves the x86 shared memory multiprocessor architecture is no longer
sequentially consistent? Could it render unsafe the intended processor
(native thread) communication outlined in the first paragraph above?

The fear initiated by Intel and co. that programmers must use special
serializing and locking operations to ensure future x86 compatibility does
not have to be amplified by the wider community. Let's assume future
changes to the x86 memory model do break existing programs that rely upon
a sequentially consistent architecture. To ensure old programs don't fail
in potentially mysterious ways an operating system could check all
executables for the export of a symbol indicating awareness of the new
memory model (and otherwise defaulting to program termination). Whatever
the marketing department called the new architecture they wouldn't be able
to weasel around such a stark reminder of binary incompatibility.

I am happy to conform to the memory model of the architecture I am
compiling for. I just need to confirm what it realistically is.

Regards,
Adam

Alexander Terekhov

unread,

Nov 16, 2006, 5:55:28 AM11/16/06

to

Buy Itanic. It has a defined memory model. x86 under Itanic is TSO (for
WB). Any other x86 has basically undefined memory model.

regards,
alexander.

Joe Seigh

unread,

Nov 17, 2006, 6:12:19 AM11/17/06

to

Adam Warner wrote:
> Hi all,
>
> In my prior posts to comp.programming.threads [note new crosspost to
> comp.arch] I described an algorithm where CPU A makes a number of memory
> writes before writing to a memory location that acts as a flag. When CPU B
> reads the changed flag (it's theoretically irrelevant whether the change
> takes a nanosecond or a day to propagate to CPU B) all the writes that
> were made by CPU A prior to CPU A writing to the memory flag must be
> correctly readable by CPU B. A sequentially consistent architecture
> satisfies this property.
>
> Historically the shared memory IA-32 multiprocessor architecture has been
> sequentially consistent:

[...]

>
> The fear initiated by Intel and co. that programmers must use special
> serializing and locking operations to ensure future x86 compatibility does
> not have to be amplified by the wider community. Let's assume future
> changes to the x86 memory model do break existing programs that rely upon
> a sequentially consistent architecture. To ensure old programs don't fail
> in potentially mysterious ways an operating system could check all
> executables for the export of a symbol indicating awareness of the new
> memory model (and otherwise defaulting to program termination). Whatever
> the marketing department called the new architecture they wouldn't be able
> to weasel around such a stark reminder of binary incompatibility.
>
> I am happy to conform to the memory model of the architecture I am
> compiling for. I just need to confirm what it realistically is.
>

I think Intel is in the pretending the problem does not exist mode until
they can figure out what to do about it. About the only thing you can
do is put an abstraction layer in place to insulate your programs from
any memory model changes. Usually this is just a bunch of memory barrier
and atomic access macros to give you the guarantees that you need. You
can take a look at how the Linux kernel does it or look at atomic_ops
in http://www.hpl.hp.com/research/linux/qprof/

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

Adam Warner

unread,

Nov 17, 2006, 7:38:07 AM11/17/06

to

This is superb advice, thank you! I was wondering if a set of atomic
operations for userspace Linux programs was available.

The qprof README_atomic_ops.txt contains this comment:

Note that the implementation reflects our understanding of real
processor behavior. This occasionally diverges from the documented
behavior. (E.g. the documented X86 behavior seems to be weak enough that
it is impractical to use. Current real implementations appear to be
much better behaved.) We of course are in no position to guarantee that
future processors (even HPs) will continue to behave this way, though we
hope they will.

Regards,
Adam

Sivaprasath Murugeshan

unread,

Nov 17, 2006, 8:18:51 AM11/17/06

to

I think x86 allows certain reorderings as described in http://www.linuxjournal.com/article/8212

On the other hand, x86 CPUs give no ordering guarantees for loads, so the smp_mb() and smp_rmb() primitives expand to lock;addl. This atomic instruction acts as a barrier to both loads and stores. Some SSE instructions are ordered weakly; for example, clflush and nontemporal move instructions. CPUs that have SSE can use mfence for smp_mb(), lfence for smp_rmb() and sfence for smp_wmb(). A few versions of the x86 CPU have a mode bit that enables out-of-order stores, and for these CPUs, smp_wmb() also must be defined to be lock;addl.

- Siva.

Adam Warner wrote:
> Hi all,
>
> In my prior posts to comp.programming.threads [note new crosspost to
> comp.arch] I described an algorithm where CPU A makes a number of memory
> writes before writing to a memory location that acts as a flag. When CPU B
> reads the changed flag (it's theoretically irrelevant whether the change
> takes a nanosecond or a day to propagate to CPU B) all the writes that
> were made by CPU A prior to CPU A writing to the memory flag must be
> correctly readable by CPU B. A sequentially consistent architecture
> satisfies this property.
>
> Historically the shared memory IA-32 multiprocessor architecture has been
> sequentially consistent:
>

> Ghrachorloo, K. Memory consistency models for shared-memory
> multiprocessors. Tech. Rep. CSL-TR-95-685, Computer Systems Laboratory,
> Departments of Electrical Engineering and Computer Science, Stanford
> University, Stanford, CA, December 1995. Available:
> <www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-9.pdf>
>
> Page 298:
>
>    Other companies such as Hewlett-Packard (HP), Intel, and Silicon
>    Graphics (SGI) have chosen to support sequential consistency in their
>    current architectures. The next generation processors from these
>    companies exploit aggressive implementation techniques in conjunction
>    with dynamically scheduled processors to efficiently support sequential
>    consistency (mentioned in the previous paragraph). However, there is
>    still a dependence on relaxed models for enabling compiler
>    optimizations for explicitly parallel programs. Furthermore, companies
>    such as Intel still maintain the option to move to relaxed models by
>    recommending that programmers use special serializing and locking
>    operations for future compatibility.
>
> Is there any counterexample over the last decade of x86 CPU innovation that
> proves the x86 shared memory multiprocessor architecture is no longer
> sequentially consistent? Could it render unsafe the intended processor
> (native thread) communication outlined in the first paragraph above?
>

> The fear initiated by Intel and co. that programmers must use special
> serializing and locking operations to ensure future x86 compatibility does
> not have to be amplified by the wider community. Let's assume future
> changes to the x86 memory model do break existing programs that rely upon
> a sequentially consistent architecture. To ensure old programs don't fail
> in potentially mysterious ways an operating system could check all
> executables for the export of a symbol indicating awareness of the new
> memory model (and otherwise defaulting to program termination). Whatever
> the marketing department called the new architecture they wouldn't be able
> to weasel around such a stark reminder of binary incompatibility.
>
> I am happy to conform to the memory model of the architecture I am
> compiling for. I just need to confirm what it realistically is.
>

> Regards,
> Adam

Message has been deleted

Elcaro Nosille

unread,

Nov 17, 2006, 6:16:03 PM11/17/06

to

When you use monitors or critical secrions these use LOCKed CMPXCHGs
which have acquire- and release-Behaviour. So use them and you don't
see any side-effects. When the lock/unlock functions for the monitor
is embedded into code that can't be observed by the compiler, the
compiler doesn't get the opportunity to to any reads before the lock
or to defer any writes after the unlock; so the usual C/C++-problem
that these languages don't support multithreading doesn't exist.

Steve Watt

unread,

Nov 19, 2006, 3:29:45 AM11/19/06

to

In article <455e42d9$0$30328$9b4e...@newsspool1.arcor-online.net>,
Elcaro Nosille <Elcaro....@mailinator.com> wrote:
> Alexander Terekhov schrieb:

>
>> Buy Itanic. It has a defined memory model. x86 under Itanic is
>> TSO (for WB). Any other x86 has basically undefined memory model.
>

>What an idiotic advice.

Care to elaborate a bit on your opinion of that advice?

Itani(um|c) is arguably one of the best-defined modern processors, in terms
of semantics of memory accesses, available. Intel has carefully documented
the memory model, to a degree not generally seen.
--
Steve Watt KD6GGD PP-ASEL-IA ICBM: 121W 56' 57.5" / 37N 20' 15.3"
Internet: steve @ Watt.COM Whois: SW32-ARIN
Free time? There's no such thing. It just comes in varying prices...

Chris Thomasson

unread,

Nov 19, 2006, 3:46:12 AM11/19/06

to

http://groups.google.com/group/comp.programming.threads/msg/b0ab2c4405d1c2c6

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/6715c3e5a73c4016

http://groups.google.com/group/comp.programming.threads/msg/ca2f1af4552233df
(this method works on current x86!)

enjoy.

Chris Thomasson

unread,

Nov 19, 2006, 3:47:32 AM11/19/06

to

"Steve Watt" <steve.re...@Watt.COM> wrote in message
news:ejp4lp$2dfk$1...@wattres.Watt.COM...

> In article <455e42d9$0$30328$9b4e...@newsspool1.arcor-online.net>,
> Elcaro Nosille <Elcaro....@mailinator.com> wrote:
>> Alexander Terekhov schrieb:
>>
>>> Buy Itanic. It has a defined memory model. x86 under Itanic is
>>> TSO (for WB). Any other x86 has basically undefined memory model.
>>
>>What an idiotic advice.
>
> Care to elaborate a bit on your opinion of that advice?
>
> Itani(um|c) is arguably one of the best-defined modern processors, in
> terms
> of semantics of memory accesses, available. Intel has carefully
> documented
> the memory model, to a degree not generally seen.

It does indeed show how I emulates x86 behavior.. So, I agree with Alex and
you in that it is a good reference for x86 memodel.

Chris Thomasson

unread,

Nov 19, 2006, 3:53:23 AM11/19/06

to

> It does indeed show how I emulates x86 behavior..

^

It does indeed show how it emulates x86 behavior..

Message has been deleted

Joe Seigh

unread,

Nov 19, 2006, 10:50:11 AM11/19/06

to

Elcaro Nosille wrote:
> Steve Watt schrieb:

>
>>> What an idiotic advice.
>
>
>> Care to elaborate a bit on your opinion of that advice?
>
>

> When do you have the opportunity to chose a CPU in s sw-project? Almost never!
> And for learning-purposes? The manual should be sufficient.
> And there's no need for this exact memory-model because there are constraints
> under which synchronization works perfectly on x86s: either through monitors
> or critical sections which use LOCKed operations. Even if this behaviour isn't
> defined excacly: Intel couldn't chage this because most x86-OSes rely on the
> current behaviour. So there's no need to worry about the missing definition
> of the x86 memory-model.

We're talking about knowing what the memory model is so the synchronization
primatives can be implemented correctly. Right now it's mostly guessing and
erring on the safe side.

Note that discussions like this never occur about the other architectures
like Z arch, powerpc, sparc, etc...

What's worse is the system manufacturers who build Intel based SMP systems
with more processors than Intel supports. There is no telling what they
may have as a memory model.

> Alexander Terehov is a geek which sees problems that don't exist.

Could be worse. He could be a non geek who doesn't see problems that
do exist.

Elcaro Nosille

unread,

Nov 19, 2006, 10:54:33 AM11/19/06

to

Joe Seigh schrieb:

> We're talking about knowing what the memory model is so the synchroni-
> zation primatives can be implemented correctly. Right now it's mostly

> guessing and erring on the safe side.

It's not guessing: there things you can rely on (f.e. acquire/release
-behaviouf of LOCKed CMPXCHGs) and neither Intel nor AMD could change
that without breaking almost all x86-OSs. So where's the problem?

> Note that discussions like this never occur about the other architectures
> like Z arch, powerpc, sparc, etc...

Should this be a justification for the necessity to consider the missing
x86 memory-model-specification as a real problem? I don't see any problem.

> What's worse is the system manufacturers who build Intel based SMP
> systems with more processors than Intel supports.

So where's the qualitative difference to a 2-CPU SMP-system here?

>> Alexander Terehov is a geek which sees problems that don't exist.

> Could be worse.
> He could be a non geek who doesn't see problems that do exist.

ROFL! You made my day!

Steve Watt

unread,

Nov 19, 2006, 12:26:27 PM11/19/06

to

In article <456046bc$0$18836$9b4e...@newsspool4.arcor-online.net>,
Elcaro Nosille <Elcaro....@mailinator.com> wrote:
> Steve Watt schrieb:

>
>>> What an idiotic advice.
>
>> Care to elaborate a bit on your opinion of that advice?
>

>When do you have the opportunity to chose a CPU in s sw-project? Almost never!

Actually, quite frequently. But I must be in a rather different market
segment than you. The capabilities of the processor, from power efficiency
to algorithmic features, are the determining factor in boxes in my industry.
And Intel isn't seen particularly frequently.

>And for learning-purposes? The manual should be sufficient.
>And there's no need for this exact memory-model because there are constraints
>under which synchronization works perfectly on x86s: either through monitors
>or critical sections which use LOCKed operations. Even if this behaviour isn't
>defined excacly: Intel couldn't chage this because most x86-OSes rely on the
>current behaviour. So there's no need to worry about the missing definition
>of the x86 memory-model.

I suspect you're missing a rather important piece of the puzzle: There is
a not-so-recent development in the concurrent programming community
known as "lock free". It is possible, with only a modicum of analysis and
understanding, to build data structures without locks that far outperform,
in a system-wide sense at times, the equivalent data structure with simple
mutex or read/write lock protection.

By "far outperform", I mean an order of magnitude or more. The kind of
optimizations that are important, as opposed to the "I can squeeze out
3% if I compile with -O9" sort.

>Alexander Terehov is a geek which sees problems that don't exist.

Actually, I suspect he understands the problem better than most. Sometimes
if you don't understand what the fuss is about, it's because there's nothing
to fuss about. Sometimes, it's just because you don't understand. This
is the latter.

Message has been deleted

Steve Watt

unread,

Nov 19, 2006, 5:14:07 PM11/19/06

to

In article <45609f47$0$30320$9b4e...@newsspool1.arcor-online.net>,

Elcaro Nosille <Elcaro....@mailinator.com> wrote:
> Steve Watt schrieb:
>

>>> When do you have the opportunity to chose a CPU in s sw-project? Almost never!
>

>> Actually, quite frequently. ...
>
>That's of no matter because there are ways to do safe SMP-MT-code with
>current x86s: use synchronization-primitives that do LOCKed CMPXCHGs.

Stop changing your assertion. There are industries where not everything
is an Intel processor.

>> I suspect you're missing a rather important piece of the puzzle: There
>> is a not-so-recent development in the concurrent programming community
>> known as "lock free".
>

>This lock-free techniques also use LOCKed CMPXCHGs ...

Not always. However, since you've already decided I'm an idiot, I
won't bother pointing out the alternatives.

>> It is possible, with only a modicum of analysis and understanding, to
>> build data structures without locks that far outperform, in a system-wide
>> sense at times, the equivalent data structure with simple mutex or read
>> /write lock protection.
>

>In theory, but in practice MT-code isn't synchronizing at such a high frequency
>(or contention-cases arise that often) that there are rare cases where lock-free
>code is noteworthy advantageous. I've used lock-free code only once where it made
>really sense: Inside a heap-allocator that might be used at a frequency that this
>lock-free code could make sense.

That is one place that lock-free algorithms make sense. There are a number
of other places. I am not one of the shills that will say every application
will benefit from application of lock-free principles; far from it, but
there are places where important performance gains are to be had.

>If you're using lock-free code, the usual locking code would could use a locking
>interval which is extremely short. So the lock-free code makes only sense where
>the run-time relation between the synchronization-operation and the rest of the
>code between this operations is very low, i.e. you're locking extremely often:
>that's very rare. If it wouldn't be like that there would be benchmarks of real
>-world-code proving the advantage of lock-free programming but not only synthe-
>tic benchmarks.

Large transaction rate updates on certain kinds of data structures benefit
hugely from lock-free techniques. Certain ways of storing routing tables, for
example.

>In Germany there's a adage saying that someone isn't seeing the whole forest
>because of so much trees; I think its like that with lock-free programming: the
>people engaging with lf programming see the details, overestimate the advantage
>because they have the usual affinity to overestimate these things what they are
>familiar with and think lf programming is very important.

There is a similar adage across most of the English-speaking world, not
surprisingly. There are places where spending the energy required to get
lock-free code working is utterly nonsensical. Chris T. has finally moderated
his views in that area, after a fair amount of poking. But you don't follow
comp.programming.threads, or at least you don't post there, so there's not
a lot more I can say on the subject.

>> Actually, I suspect he understands the problem better than most.
>

>He understands the problem in theory; but it doesn't exist in practice.
>There are ways to write safe SMP-MT-code on x86s with LOCKed CMPXCHGs.

>So where's the problem?

The problem is that LOCK CMPXCHG is slow, and getting slower. As
pipelines and memory systems get more aggressive, and more interesting
execution units appear allowing more and more temporal dislocality
between adjacent instructions, these issues will become ever more
important.

You may choose to remain in your "everything useful can be done under
a spinlock" world, but I would advise you to at least consider other
possibilities.

>> Sometimes if you don't understand what the fuss is about, ...

>> Sometimes, it's just because you don't understand. This is the latter.
>

>You're an idiot. I understand what memory-models are and I understand that
>the missing memory-model-spec for x86s isn't a problem - prove the opposite.

I can't make any proofs either way, and nor can you. There is no spec,
therefore it's not obvious what operations the architecture guarantees
are safe, and always will be.

And now, I will do what I usually do when I see someone calling other
people idiots, but I don't usually say it aloud. *plonk*

Elcaro Nosille

unread,

Nov 19, 2006, 5:49:50 PM11/19/06

to

Steve Watt schrieb:

>> That's of no matter because there are ways to do safe SMP-MT-code with
>> current x86s: use synchronization-primitives that do LOCKed CMPXCHGs.

> Stop changing your assertion. There are industries where not everything

> is an Intel processor. ...

We're talking about the "problem" of missing memmodel-specs for x86-CPUs
here!

> Large transaction rate updates on certain kinds of data structures benefit
> hugely from lock-free techniques. Certain ways of storing routing tables,
> for example.

In this case as in most others there are smart workaounds: Copy the rounting
-table only on any Nth packet (if it changed) or on a timeout (if we go away
with the monitor to the kernel).

> There are places where spending the energy required to get lock-free code
> working is utterly nonsensical. Chris T. has finally moderated his views
> in that area, after a fair amount of poking. But you don't follow comp

> .programming.threads, ...

I've been following Chris silly postings in c.p.t. He's simply not in practice
an doesn't see that lock-free programming makes really sense very seldom.

> The problem is that LOCK CMPXCHG is slow, and getting slower.

From what did you conclude that? The P4 was very slow compared to the P3 but
the C2-architecture is about as fast as the P3.

> As pipelines and memory systems get more aggressive, and more interesting
> execution units appear allowing more and more temporal dislocality between
> adjacent instructions, these issues will become ever more important.

Eh, I think that only load/store-units are involved in this difficulty.

> You may choose to remain in your "everything useful can be done under

> a spinlock" world, ...

Where did I say that?

>> You're an idiot. I understand what memory-models are and I understand that
>> the missing memory-model-spec for x86s isn't a problem - prove the opposite.

> I can't make any proofs either way, and nor can you. There is no spec,
> therefore it's not obvious what operations the architecture guarantees
> are safe, and always will be.

But this doesn't give you an unsafe environment for MT-programming x86 in
general.

Chris Thomasson

unread,

Nov 19, 2006, 8:05:11 PM11/19/06

to

"Elcaro Nosille" <Elcaro....@mailinator.com> wrote in message
news:456044a6$0$5710$9b4e...@newsspool3.arcor-online.net...
> Chris Thomasson schrieb:
>
>> http://groups.google.com/group/comp.programming.threads/msg/b0ab2c4405d1c2c6
>> http://groups.google.com/...
>> http://groups.google.com/...
>
> Why do you think fences are necessary. When we're synchronizing threads,
> we're at least using an atomic op which does the fencing for us.

Fences are necessary on x86 when your algorithm can't handle the fact that a
store followed by a load to another location can, and will, be reordered...
For instance, Petersons Algorithm needs to use an mfence in order to prevent
the latter load from the other state variable from migrating above the lock
acquisition logic:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/c49c0658e2607317
(if you study the code, you should know exactly why a mfence is needed...)

;^)

Chris Thomasson

unread,

Nov 19, 2006, 8:15:00 PM11/19/06

to

> This lock-free techniques also use LOCKed CMPXCHGs ...

Wrong. Not all of them use atomic operations.

Joe Seigh and, I have invented several lock-free algorithms that don't use
any atomic operations. We have also invented others that don't use expensive
memory barriers either...

Chris Thomasson

unread,

Nov 19, 2006, 8:21:22 PM11/19/06

to

"Steve Watt" <steve.re...@Watt.COM> wrote in message

news:ejqkvf$29b$1...@wattres.Watt.COM...

> In article <45609f47$0$30320$9b4e...@newsspool1.arcor-online.net>,
> Elcaro Nosille <Elcaro....@mailinator.com> wrote:
>> Steve Watt schrieb:

[...]

>>In Germany there's a adage saying that someone isn't seeing the whole
>>forest
>>because of so much trees; I think its like that with lock-free
>>programming: the

>>people re spending the energy required to get

> lock-free code working is utterly nonsensical. Chris T. has finally
> moderated
> his views in that area, after a fair amount of poking.

Yes.. I Thank all of you who helped me finally se the LIGHT!
--

http://groups.google.com/group/comp.programming.threads/msg/9a5500d831dd2ec7

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/8329a6ddcb95b8ac

Ahh, that sounds better.

:^)

Alexander Terekhov

unread,

Nov 20, 2006, 3:08:08 AM11/20/06

to

Elcaro Nosille wrote:

[... LOCKed CMPXCHGs ...]

I've been telling all along that the only way to achieve Sequential
Consistency (subject of this thread) on x86-native is to replace loads
with LOCKed CMPXCHGs 42. But then an Intel/AMD architect jumped in and
told me that that would be relying on implementation details (albeit
documented in Intel specs), not semantics.

http://groups.google.com/group/comp.arch/msg/b6c686b7d827022c
http://groups.google.com/group/comp.arch/msg/d532441a875bc4fc

So here's a question or you Nosille: how do you do Sequential
Consistency on x86-native?

regards,
alexander.

already...@yahoo.com

unread,

Nov 20, 2006, 4:42:01 AM11/20/06

to

Steve Watt wrote:

> In article <45609f47$0$30320$9b4e...@newsspool1.arcor-online.net>,
> Elcaro Nosille <Elcaro....@mailinator.com> wrote:
>
> >
> >This lock-free techniques also use LOCKed CMPXCHGs ...
>
> Not always. However, since you've already decided I'm an idiot, I
> won't bother pointing out the alternatives.
>

You have more than one reader, you know. If you don't want to point out
the alternatives for Elcaro (quite understandable) do it for me.

First, I assume that we all agree that write barrier is not needed
(except for WC regions and non-temporal stores) and that despite the
absence of the guaranty in Intel's official manual we could safely rely
on preserving that behavior in the future Intel/AMD processors.

On the read side I can think about following alternatives to LOCKed
CMPXCHGs:
1. LFENCE
Don't see why LFENCE is more attractive than LOCKed CMPXCHGs

2. Implied read dependency barrier
Sometimes it comes naturally, but more often looks artificial. In later
case resulting code is very hard to explain to occasional reader.
>From performance perspective, read dependency is probably a good
alternative on P4-derived cores but K8 and M/C/W cores have no SMP so
no useful work can be done while waiting for dependency. Generally,
IMHO, converting load-related control dependencies into data
dependencies is almost always a loss from performance perspective.

3. Do nothing. Rely on the hope that the read barrier is not needed at
all and HW will end up "doing the right thing".
May be the most practical alternative among the three. But official
confirmation from Intel/AMD would help.

Anton Ertl

unread,

Nov 20, 2006, 5:15:59 AM11/20/06

to

already...@yahoo.com writes:
>2. Implied read dependency barrier

You mean, that, if there is a data (flow) dependence chain between two
loads, the second load will not see an earlier memory state than the
first load?

Does the IA-32 and AMD64 architecture guarantee that? I know that
IA-64 does and Alpha does not. Load-address prediction (present in
the Pentium 4 IIRC) can subvert it.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Joe Seigh

unread,

Nov 20, 2006, 5:51:31 AM11/20/06

to

Anton Ertl wrote:
> already...@yahoo.com writes:
>
>>2. Implied read dependency barrier
>
>
> You mean, that, if there is a data (flow) dependence chain between two
> loads, the second load will not see an earlier memory state than the
> first load?
>
> Does the IA-32 and AMD64 architecture guarantee that? I know that
> IA-64 does and Alpha does not. Load-address prediction (present in
> the Pentium 4 IIRC) can subvert it.
>

Load dependency isn't usually part of the officially architected
memory models I've seen except for Java which added it for volatile
references as of JSR-133 specifically to support lock-free AFAICT.

So if ia-32 and amd64 did have memory models it's not likey they
would forbid it. But Linux uses load dependency for scalability
so if Intel or AMD broke it they would take a hit there. Plus
every naive implementation of DCL (double checked locking) would
break.

already...@yahoo.com

unread,

Nov 20, 2006, 6:06:56 AM11/20/06

to

Anton Ertl wrote:

> already...@yahoo.com writes:
> >2. Implied read dependency barrier
>
> You mean, that, if there is a data (flow) dependence chain between two
> loads, the second load will not see an earlier memory state than the
> first load?

Yes

>
> Does the IA-32 and AMD64 architecture guarantee that?

The manual says nothing.
Since not having such barrier is reallly wierd I prefer to treat the
silence as guarantee.

> I know that
> IA-64 does and Alpha does not. Load-address prediction (present in
> the Pentium 4 IIRC) can subvert it.

I don't believe it's possible.
BTW, that's the first time I hear about load address prediction in
Pentium 4 that can potentially bypass true dependency. Can't even
imagine how it could work. The only form of "load address prediction" I
heared about (which appears to present both in Prescott P4 and in the
new M/C/W core) is the prediction of whether the address of subsequent
load matches the address of preceding store.

Message has been deleted

Joe Seigh

unread,

Nov 20, 2006, 7:00:46 AM11/20/06

to

Elcaro Nosille wrote:
> Alexander Terekhov schrieb:

>
>> But then an Intel/AMD architect jumped in and told me that that
>> would be relying on implementation details (albeit documented

>> in Intel specs), not semantics. ...
>
>
> Maybe he said that; but that's irrelevant because there are established
> ways to cope with this and neither Intel nor AMD could change these with-
> out breaking how most x86-OSs are dealing with this.

The OSes would break but the breakage would be temporary as the OSes would
just change all the synchronization implementations. Worst case the
OS could just disable all the extra cores and run on a single processor.

If you're arguing that Intel wouldn't ever do something stupid ... well,
you're in one of the right newsgroups for that.

Elcaro Nosille

unread,

Nov 20, 2006, 7:04:18 AM11/20/06

to

Joe Seigh schrieb:

> The OSes would break but the breakage would be temporary as the
> OSes would just change all the synchronization implementations.

These synchronization-primitives aren't only implemented in the OSes
but also in the applications. F.e. Win32 doesn't know monitors so some
developers chose monitors from third-party libraries; even statically
linked. So Intel couldn't break this behaviour.

> Worst case the OS could just disable all the extra cores and run
> on a single processor.

That would be a huge marketing-advantage for AMD: "look, half your
apps might not be running". Do you really think Intel would do that?

> If you're arguing that Intel wouldn't ever do something stupid ...
> well, you're in one of the right newsgroups for that.

Ok, then tell me why Intel could break the LOCK-CMPXCHG-behaviour
or tell me any mistake comparable to that Intel made in the past.
Your assumption is more than unlikely.

Casper H.S. Dik

unread,

Nov 20, 2006, 7:13:39 AM11/20/06

to

Joe Seigh <jsei...@xemaps.com> writes:

>The OSes would break but the breakage would be temporary as the OSes would
>just change all the synchronization implementations. Worst case the
>OS could just disable all the extra cores and run on a single processor.

Inded; and all the applications which use OS supplied synchronization
primitives would just continue to work (kernel locking primitives,
pthreads, system defined atomics); at least as long as the latter
are all in shared libraries.

>If you're arguing that Intel wouldn't ever do something stupid ... well,
>you're in one of the right newsgroups for that.

It would be stupid if it would gain a small percentage in performance;
but it would fine if it gave a hge performance boost. Odds are, though,
that there would be a processor status bit to enable the new behaviour
and that it would be up to the BIOS vendors to screw up.

Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.

Anton Ertl

unread,

Nov 20, 2006, 6:42:50 AM11/20/06

to

already...@yahoo.com writes:
>
>Anton Ertl wrote:
>
>> already...@yahoo.com writes:
>> >2. Implied read dependency barrier

...

>> Load-address prediction (present in
>> the Pentium 4 IIRC) can subvert it.
>
>I don't believe it's possible.
>BTW, that's the first time I hear about load address prediction in
>Pentium 4 that can potentially bypass true dependency.

Probably not. I just remembered that there was some address
prediction going on in the Pentium 4, but did not remember details.
You are probably right that it does not predict the address, but just
whether there is an alias.

> Can't even
>imagine how it could work.

Just like any other prediction: Some lookup table that is indexed by a
bitstring constructed out of some history bits. Or, alternatively,
instead of predicting the address of the second load, a
value-predictor for the value of the first load could lead to the
second load being executed first.

An example:

thread 1: thread 2:
p->x = 0; t = *q; /* A */
... u = t->x; /* B */
p->x = 1;
*q=p;

Let's assume that thread1 is just executed in program order (maybe
with the help of write barriers), and just consider thread 2. The CPU
could look at statement B, use some predictor that happens to predict
that t==p, execute the load B first (resulting in, say, u=0), then
perform the load A with the result that t==p (confirming the
prediction, so that load B won't be retried); voila, no read
dependency barrier.

Note that, even if the "*q=p" changes *q, the predictor might still
predict the load address of B correctly, e.g., due to constructive
interference, or because *q used to be equal to p before it became
unequal.

Joe Seigh

unread,

Nov 20, 2006, 7:29:15 AM11/20/06

to

Elcaro Nosille wrote:
> Joe Seigh schrieb:

>
>
>> If you're arguing that Intel wouldn't ever do something stupid ...
>> well, you're in one of the right newsgroups for that.
>
>
> Ok, then tell me why Intel could break the LOCK-CMPXCHG-behaviour
> or tell me any mistake comparable to that Intel made in the past.
> Your assumption is more than unlikely.

You're assuming that Intel knows what they're doing. Do you have
any evidence to support that? Note the use of the present tense.
Obviously, Intel knows after the fact when they've screwed up.

Eric P.

unread,

Nov 20, 2006, 9:14:08 AM11/20/06

to

Alexander Terekhov wrote:
>
> Elcaro Nosille wrote:
>
> [... LOCKed CMPXCHGs ...]
>
> I've been telling all along that the only way to achieve Sequential
> Consistency (subject of this thread) on x86-native is to replace loads
> with LOCKed CMPXCHGs 42. But then an Intel/AMD architect jumped in and
> told me that that would be relying on implementation details (albeit
> documented in Intel specs), not semantics.
>
> http://groups.google.com/group/comp.arch/msg/b6c686b7d827022c
> http://groups.google.com/group/comp.arch/msg/d532441a875bc4fc

Your perceived need for this trick is based on the assumption that
Gharachorloo (a DEC employee at that time) speaks for Intel and
that his insane inconsistent definition of 'Processor Consistency'
somehow obligates Intel to implement it because they use the same
name. Gharachorloo himself borrowed the P.C. name from Goodman
and changed the functional description.

Just treat Processor Consistency as a marketing term.

Intel would be fools to implement a broken consistency model, and
designing things on the assumption they might is a waste of time.
And if they did, don't waste your time griping - buy AMD stock.
Can you imagine the press conference on TV where a company
spokesman tries to explain how allowing memory to read
as different values on different cpus is 'consistent'?

Paranoia has its place but not in this case.

Eric

Alexander Terekhov

unread,

Nov 20, 2006, 10:19:40 AM11/20/06

to

"Eric P." wrote:
[...]

> Just treat Processor Consistency as a marketing term.

A marketing term? How fascinating. See for example "consistency of
write visibility" on page 6 in

http://www.intel.com/design/itanium/downloads/25142901.pdf

The above isn't a marketing flyer.

No?

regards,
alexander.

already...@yahoo.com

unread,

Nov 20, 2006, 10:25:49 AM11/20/06

to

already...@yahoo.com wrote:

> but K8 and M/C/W cores have no SMP

Of course, here I meant "but K8 and M/C/W cores have no SMT"

Chris Thomasson

unread,

Nov 20, 2006, 8:26:32 PM11/20/06

to

"Elcaro Nosille" <Elcaro....@mailinator.com> wrote in message

news:4560df8d$0$5712$9b4e...@newsspool3.arcor-online.net...
> Steve Watt schrieb:

[...]

>> There are places where spending the energy required to get lock-free code
>> working is utterly nonsensical. Chris T. has finally moderated his views
>> in that area, after a fair amount of poking. But you don't follow comp
>> .programming.threads, ...
>
> I've been following Chris silly postings in c.p.t. He's simply not in
> practice

What the heck do you mean I am not in practice? And, please show me a couple
of silly posts of mine so I can try to clear up that massive infection of
ignorance that has apparently attached itself directly to your brain.

You call other smart people Idiots' huh? Wow... Humm, perhaps I should
killfile you, and you should killfile me... I think we would get along
better that way!

:^)

> an doesn't see that lock-free programming makes really sense very seldom.

They do tend to make a lot of sense in applications that are expected to
scale.

Chris Thomasson

unread,

Nov 20, 2006, 8:30:41 PM11/20/06

to

"Alexander Terekhov" <tere...@web.de> wrote in message
news:45616268...@web.de...

This kind of trickery currently works:

http://groups.google.com/group/comp.programming.threads/msg/ca2f1af4552233df

I use a store in common dummy location for the barriers. I think Nostille
might be a little confused on this issue...

Any thoughts?

;^)

Chris Thomasson

unread,

Nov 21, 2006, 4:15:37 AM11/21/06

to

"Elcaro Nosille" <Elcaro....@mailinator.com> wrote in message

news:45604796$0$27617$9b4e...@newsspool2.arcor-online.net...
> Chris Thomasson schrieb:
>
>> It does indeed show how I emulates x86 behavior.. So, I agree
>> with Alex and you in that it is a good reference for x86 memodel.
>
> Sure that the Itanic-manuals don't show memmodel-constraints with
> are tighter and not equal to current implementations of x86-CPUs?

At least it shows how it emulates its behavior, if it has tighter
constraints so be it. Its a lot better than Intel x86 documentation, that
for sure...

Anyway, current x86 memmodel honors loadstore dependencies, and dosen't
honor storeload dependencies... That the way I see it..

> And consider that Alexander recommended to buy an Itanic: for what?

Didn't you know? Man... Alexander has expensive warehouses that are filled
to the brim with various Itanium processors... I guess he wants to unload
them on you...

lol...

;^)

Sean Kelly

unread,

Nov 21, 2006, 9:22:31 AM11/21/06

to

Chris Thomasson wrote:
>
> Anyway, current x86 memmodel honors loadstore dependencies, and dosen't
> honor storeload dependencies... That the way I see it..

Isn't this consistent with the idea that x86 uses PC?

Sean

Message has been deleted

Chris Thomasson

unread,

Nov 22, 2006, 9:45:52 PM11/22/06

to

"Elcaro Nosille" <Elcaro....@mailinator.com> wrote in message

news:4564307a$0$18832$9b4e...@newsspool4.arcor-online.net...
> Chris Thomasson schrieb:

>
>> What the heck do you mean I am not in practice?
>

> What pracical projects are you involved in?
> Your scope always seems very narrow.

Okay... Here is one:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/205dcaed77941352

I guess you should send you canned spam complaints to:

SUN,
Intel,
Professor Arthur Goldberg (NYU),
David Buksbaum SVP Development Manager of Systematic Trading, Citadel
Investment,
David Freireich [dav...@coresearchinc.com] www.coresearchinc.com

Here is another:

https://coolthreads.dev.java.net/

I got more if your interested...

Tell them how horrible I am, and how they are completely wrong about me...

Chris Thomasson

unread,

Nov 22, 2006, 9:49:19 PM11/22/06

to

> I guess you should send you canned spam complaints to:

^^^^^^^^^^^^^^^^^^^

Sorry about that... I got a little irritated. I should not post when I am in
that state...

Piotr Wyderski

unread,

Nov 25, 2006, 5:00:35 AM11/25/06

to

Elcaro Nosille wrote:

> I've been following Chris silly postings in c.p.t. He's simply not in
> practice

> an doesn't see that lock-free programming makes really sense very seldom.

Well, Chris is a bit biased, but he is right about the general view. :-)
I have to say I have very positive experience with lock-free
programming. Lock-free algorithms are quite fast and scale up
very well. They are, however, much more complex and hard
to maintain (i.e. write-only code), so the choice must be based
on a very careful analysis. I use lock-free techniques in practice
and I can't say it makes sense "very seldom", but I would say
that sticking to one paradigm is wrong -- in many cases the
lock-free approach doesn't make sense, but in other cases
there is no better choice, so feel free to pick up the best solution.
And don't foreget that the lock-free and lock-based worlds
are not isolated, there is a wide gray zone between them, populated
by a very rich set of interesting almost lock-free algorithms.
In fact I like hybrid solutions very much, so the rule "be as lock-free
as possible, but not more" fits me best. :-)

Best regards
Piotr Wyderski