[.NET] Volatile Fields

Cool Guy

unread,

Sep 16, 2005, 3:56:49 PM9/16/05

to

See the example on

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/csspec/html/vclrfcsharpspec_10_4_3.asp
- or -
http://tinyurl.com/7gml4

.

Shouldn't Test.result be declared as volatile aswell? My understanding is
that it should because otherwise the write to it in Thread2 might not be
visible to other threads, since the value may be written to a cache instead
of to main memory.

David Schwartz

unread,

Sep 16, 2005, 4:40:35 PM9/16/05

to

"Cool Guy" <coo...@abc.xyz> wrote in message
news:q6n7xt2m...@cool.guy.abc.xyz...

> Shouldn't Test.result be declared as volatile aswell? My understanding is
> that it should because otherwise the write to it in Thread2 might not be
> visible to other threads, since the value may be written to a cache
> instead
> of to main memory.

It does not have to be on this platform because, as the page says:

A read of a volatile field is called a volatile read. A volatile read has
"acquire semantics"; that is, it is guaranteed to occur prior to any
references to memory that occur after it in the instruction sequence.

and:

A write of a volatile field is called a volatile write. A volatile write has
"release semantics"; that is, it is guaranteed to happen after any memory
references prior to the write instruction in the instruction sequence.

Note that this is .NET specific.

DS

David Hopwood

unread,

Sep 16, 2005, 5:46:34 PM9/16/05

to

Cool Guy wrote:
> See the example on
>
> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/csspec/html/vclrfcsharpspec_10_4_3.asp
> - or -
> http://tinyurl.com/7gml4
>

> Shouldn't Test.result be declared as volatile as well?

It doesn't need to be. Because 'finished' is volatile, the assignment
"finished = true;" causes previous writes by Thread2 (whether or not they
are to volatile variables) to be "performed" before the write to 'finished'.
This is what the page means by 'A volatile write has "release semantics"'.

In addition, the access to 'finished' in "if (finished) ..." guarantees
that writes performed before 'finished' became true are visible to the
Main thread. This is what the page means by 'A volatile read has "acquire
semantics"'.

The combination of these two barriers guarantees that the result is 143
(either on its own would not be enough).

--
David Hopwood <david.nosp...@blueyonder.co.uk>

David Hopwood

unread,

Sep 16, 2005, 8:28:11 PM9/16/05

to

David Hopwood wrote:
> Cool Guy wrote:
>
>>See the example on
>>
>>http://msdn.microsoft.com/library/default.asp?url=/library/en-us/csspec/html/vclrfcsharpspec_10_4_3.asp
>>- or -
>>http://tinyurl.com/7gml4

Oh, and reading between the lines:

# A conforming implementation is not required to provide a single total ordering of
# volatile writes as seen from all threads of execution.

=> "We want to be able to implement volatile accesses as ordinary accesses on x86."

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Cool Guy

unread,

Sep 17, 2005, 4:19:10 AM9/17/05

to

David Schwartz <dav...@webmaster.com> wrote:

>> Shouldn't Test.result be declared as volatile aswell? My understanding is
>> that it should because otherwise the write to it in Thread2 might not be
>> visible to other threads, since the value may be written to a cache
>> instead
>> of to main memory.
>
> It does not have to be on this platform because, as the page says:

<snip>

See, this is where my understanding goes out of the window.

From reading that page it seems there's only one issue here: that
reads/writes can be re-ordered and thus just making *one* of those fields
volatile is sufficient.

But my understanding is that that there are in fact *two* (separate) issues
here:

a. the above; and
b. that reads may come from a cache instead of from main memory (and
that writes may go to a cache instead of to main memory).

Cool Guy

unread,

Sep 17, 2005, 4:28:38 AM9/17/05

to

Cool Guy <coo...@abc.xyz> wrote:

> b. that reads may come from a cache instead of from main memory (and
> that writes may go to a cache instead of to main memory).

...for each field that isn't volatile.

Cool Guy

unread,

Sep 17, 2005, 7:31:42 AM9/17/05

to

David Hopwood <david.nosp...@blueyonder.co.uk> wrote:

> It doesn't need to be. Because 'finished' is volatile, the assignment
> "finished = true;" causes previous writes by Thread2 (whether or not they
> are to volatile variables) to be "performed" before the write to 'finished'.
> This is what the page means by 'A volatile write has "release semantics"'.

But couldn't *result* be written to a cache instead of to main memory?

> In addition, the access to 'finished' in "if (finished) ..." guarantees
> that writes performed before 'finished' became true are visible to the
> Main thread. This is what the page means by 'A volatile read has "acquire
> semantics"'.

But couldn't *result* be read from a cache instead of from main memory?

Joe Seigh

unread,

Sep 17, 2005, 7:52:58 AM9/17/05

to

Cache has no effect on read/write ordering. It's the memory model that does.
You could have a system without cache and it would still have the same issues
with read/write ordering. So trying to imagine what is happening with a
hardware entity, which by definition you can't see, only serves to confuse
yourself.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

Cool Guy

unread,

Sep 17, 2005, 7:56:14 AM9/17/05

to

Joe Seigh <jsei...@xemaps.com> wrote:

> Cache has no effect on read/write ordering. It's the memory model that does.
> You could have a system without cache and it would still have the same issues
> with read/write ordering. So trying to imagine what is happening with a
> hardware entity, which by definition you can't see, only serves to confuse
> yourself.

My confusion arises from the fact that only *one* of those fields is
volatile in the example linked from the OP, and not *both* of them, which
leads me to believe that the one field that isn't volatile could be
read/written from/to the cache instead of from/to main memory and thus the
code isn't thread-safe.

Joe Seigh

unread,

Sep 17, 2005, 8:24:26 AM9/17/05

to

If that was true then you could detect the presence of cache. But you
can't detect the presence of cache by definition so it must not be
true.

David Hopwood

unread,

Sep 17, 2005, 9:44:11 AM9/17/05

to

The .NET memory model requires cache coherency (actually it doesn't mention
caches at all, which is the same thing as requiring them to be a transparent
optimization).

--
David Hopwood <david.nosp...@blueyonder.co.uk>

David Hopwood

unread,

Sep 17, 2005, 9:59:21 AM9/17/05

to

The release and acquire barriers referred to on the example page are global
barriers that apply to all locations, regardless of whether they are cached
or accessed through volatile variables. (You could argue that this is
potentially inefficient, but current SMP systems are designed to support it,
at some cost in hardware complexity and scalability.)

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Alexander Terekhov

unread,

Sep 17, 2005, 10:18:15 AM9/17/05

to

David Hopwood wrote:
[...]

> The release and acquire barriers referred to on the example page are global
> barriers that apply to all locations, regardless of whether they are cached
> or accessed through volatile variables. (You could argue that this is
> potentially inefficient, but current SMP systems are designed to support it,
> at some cost in hardware complexity and scalability.)

CELLs have tagged barriers (in addition to untagged Power memory ordering).

regrads,
alexander.

David Hopwood

unread,

Sep 17, 2005, 10:21:39 AM9/17/05

to

Joe Seigh wrote:
> Cool Guy wrote:
>> Joe Seigh <jsei...@xemaps.com> wrote:
>>
>>> Cache has no effect on read/write ordering. It's the memory model
>>> that does. You could have a system without cache and it would still
>>> have the same issues with read/write ordering. So trying to imagine
>>> what is happening with a hardware entity, which by definition you can't
>>> see, only serves to confuse yourself.
>>
>> My confusion arises from the fact that only *one* of those fields is
>> volatile in the example linked from the OP, and not *both* of them, which
>> leads me to believe that the one field that isn't volatile could be
>> read/written from/to the cache instead of from/to main memory and thus
>> the code isn't thread-safe.
>
> If that was true then you could detect the presence of cache. But you
> can't detect the presence of cache by definition so it must not be
> true.

You *can* detect the presence of cache (even though it is a transparent
optimization). Its presence affects which of the behaviours that are
allowed by the nondeterministic memory model can actually occur on a given
implementation. This is usually not an issue if you depend only on the
memory model. But note that you cannot force other software to depend only
on the memory model, and that can have security implications, e.g.
<http://www.daemonology.net/hyperthreading-considered-harmful/>.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

David Hopwood

unread,

Sep 17, 2005, 10:23:51 AM9/17/05

to

Cool Guy wrote:
> David Hopwood <david.nosp...@blueyonder.co.uk> wrote:
>
>>It doesn't need to be. Because 'finished' is volatile, the assignment
>>"finished = true;" causes previous writes by Thread2 (whether or not they
>>are to volatile variables) to be "performed" before the write to 'finished'.
>>This is what the page means by 'A volatile write has "release semantics"'.
>
> But couldn't *result* be written to a cache instead of to main memory?

In this case cache is a transparent optimization; it plays no part in the
memory model. At the implementation level, 'result' may well be written to
a cache, but typically the cache will be "snooped" by other processors so
that they can see the write.

>>In addition, the access to 'finished' in "if (finished) ..." guarantees
>>that writes performed before 'finished' became true are visible to the
>>Main thread. This is what the page means by 'A volatile read has "acquire
>>semantics"'.
>
> But couldn't *result* be read from a cache instead of from main memory?

Ditto.

Note that in principle, cache does not *necessarily* have to be transparent;
a memory model could include some abstraction that corresponds to a per-thread
cache (the original Java memory model did something like this). But .NET's
model doesn't.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

David Hopwood

unread,

Sep 17, 2005, 10:31:51 AM9/17/05

to

Doesn't affect my point. Barriers implied by volatile accesses in .NET are
global, regardless of platform. I was careful not to say that all memory
barriers are necessarily global.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Joe Seigh

unread,

Sep 17, 2005, 10:38:12 AM9/17/05

to

Well, there is a performance effect which would have to be there otherwise
there would be no reason to have cache then. I deliberately left that out
to simplify things so as to not confuse the OP any further. He seems confused
enough. But now that you've brought it up, you are now officially responsible
for unconfusing the OP. Have fun. :)

David Hopwood

unread,

Sep 17, 2005, 10:55:46 AM9/17/05

to

Joe Seigh wrote:
> David Hopwood wrote:
>> Joe Seigh wrote:
>>> Cool Guy wrote:
>>>
>>>> My confusion arises from the fact that only *one* of those fields is
>>>> volatile in the example linked from the OP, and not *both* of them,
>>>> which
>>>> leads me to believe that the one field that isn't volatile could be
>>>> read/written from/to the cache instead of from/to main memory and thus
>>>> the code isn't thread-safe.
>>>
>>> If that was true then you could detect the presence of cache. But you
>>> can't detect the presence of cache by definition so it must not be
>>> true.
>>
>> You *can* detect the presence of cache (even though it is a transparent
>> optimization). Its presence affects which of the behaviours that are
>> allowed by the nondeterministic memory model can actually occur on a
>> given implementation. This is usually not an issue if you depend only on
>> the memory model. But note that you cannot force other software to depend
>> only on the memory model, and that can have security implications, e.g.
>> <http://www.daemonology.net/hyperthreading-considered-harmful/>.
>
> Well, there is a performance effect which would have to be there otherwise
> there would be no reason to have cache then.

It's misleading to call timing side channels just a performance effect.
I think most PGP users or SSL server operators would raise an eyebrow if you
told them that compromise of their private key was just a performance effect,
for instance.

> I deliberately left that out to simplify things so as to not confuse the
> OP any further. He seems confused enough. But now that you've brought it
> up, you are now officially responsible for unconfusing the OP. Have fun. :)

The issue is that some optimizations can be harmless from the point of view
of correctness of a program that only relies on a particular (nondeterministic)
specification, but can still leak information that results in a security flaw.
That isn't a particularly confusing point IMHO (although it deserves to be
more widely appreciated).

--
David Hopwood <david.nosp...@blueyonder.co.uk>

David Schwartz

unread,

Sep 18, 2005, 2:27:21 AM9/18/05

to

"Cool Guy" <coo...@abc.xyz> wrote in message

news:1snaf8b7...@cool.guy.abc.xyz...

> But my understanding is that that there are in fact *two* (separate)
> issues
> here:
>
> a. the above; and
> b. that reads may come from a cache instead of from main memory (and
> that writes may go to a cache instead of to main memory).

You are programming in .NET, there is no such thing as a cache. You are
programming for a virtual machine, not a physical machine. The documentation
says that volatile reads and writes have those semantics, and so the
compiler will do whatever magic it takes to make those things work. In this
case, the magic is actually done by the processors, they have cache
coherency implemented in hardware. Google for 'MESI' and 'cache coherency'
if you want to know how it works.

DS

David Schwartz

unread,

Sep 18, 2005, 2:28:53 AM9/18/05

to

"Cool Guy" <coo...@abc.xyz> wrote in message

news:9bux5adfnj7d$.dlg@cool.guy.abc.xyz...

What do caches have to do with thread-safety? It's read/write *ordering*
that creates thread safety problems, not caches.

DS

David Schwartz

unread,

Sep 18, 2005, 2:36:49 AM9/18/05

to

"David Hopwood" <david.nosp...@blueyonder.co.uk> wrote in message
news:S7WWe.74800$2n6....@fe3.news.blueyonder.co.uk...

> It's misleading to call timing side channels just a performance effect.
> I think most PGP users or SSL server operators would raise an eyebrow if
> you
> told them that compromise of their private key was just a performance
> effect,
> for instance.

If they rely upon isolation that's not guaranteed, they get what they
deserve.

DS

Cool Guy

unread,

Sep 18, 2005, 5:41:26 AM9/18/05

to

David Schwartz <dav...@webmaster.com> wrote:

> What do caches have to do with thread-safety? It's read/write *ordering*
> that creates thread safety problems, not caches.

I think I understand it more now -- when you talk of read/write re-ordering
you're talking about when reads/writes are effectively made to/from **main
memory**, right?

Before I was thinking it was about *any* and *all* reads/writes being moved
around.

David Schwartz

unread,

Sep 18, 2005, 7:38:59 AM9/18/05

to

"Cool Guy" <coo...@abc.xyz> wrote in message

news:xq3fbvfy...@cool.guy.abc.xyz...

> David Schwartz <dav...@webmaster.com> wrote:

>> What do caches have to do with thread-safety? It's read/write
>> *ordering*
>> that creates thread safety problems, not caches.

> I think I understand it more now -- when you talk of read/write
> re-ordering
> you're talking about when reads/writes are effectively made to/from **main
> memory**, right?

No. I'm talking about when they're made by the CPU.

> Before I was thinking it was about *any* and *all* reads/writes being
> moved
> around.

There are three things that cause problems for multi-threaded programs
on typical modern SMP machines:

1) Instruction re-ordering. This is where the compiler or the processor
internally performs instructions in a different order from that in the
high-level code.

2) Posted writes. This is where the CPU does not output a write to its
cache immediately but instead puts it in a temporary holding place. (Not the
L2 cache, but a posted write buffer.)

3) Speculative fetches. This is where the CPU reads data before it needs
it and keeps the data for a later instruction.

The L1 and L2 caches don't even enter into the picture because the cache
coherency hardware makes them invisible to threads. They are, however, an
issue for DMA from peripherals because the cache coherency is only between
the processors.

DS

David Hopwood

unread,

Sep 18, 2005, 8:42:00 AM9/18/05

to

Cool Guy wrote:
> David Schwartz <dav...@webmaster.com> wrote:
>
>> What do caches have to do with thread-safety? It's read/write *ordering*
>>that creates thread safety problems, not caches.
>
> I think I understand it more now -- when you talk of read/write re-ordering
> you're talking about when reads/writes are effectively made to/from **main
> memory**, right?

No. We're talking about when reads/writes are "performed" according to the
memory model. This may or may not correspond to when they are read/written
to "main memory" in any particular implementation. You can't see when the
latter happens (except in some cases when dealing with I/O devices).

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Cool Guy

unread,

Sep 18, 2005, 10:43:40 AM9/18/05

to

David Schwartz <dav...@webmaster.com> wrote:

> There are three things that cause problems for multi-threaded programs
> on typical modern SMP machines:
>
> 1) Instruction re-ordering. This is where the compiler or the processor
> internally performs instructions in a different order from that in the
> high-level code.

So, with this issue, the reads/writes - when they're made - *will* be made
in a way that other threads can see them -- but the only problem is that
they might happen too late or too early, right?

> 2) Posted writes. This is where the CPU does not output a write to its
> cache immediately but instead puts it in a temporary holding place. (Not the
> L2 cache, but a posted write buffer.)
>
> 3) Speculative fetches. This is where the CPU reads data before it needs
> it and keeps the data for a later instruction.

So I assume memory barriers solve all three problems?

David Schwartz

unread,

Sep 18, 2005, 7:14:33 PM9/18/05

to

"Cool Guy" <coo...@abc.xyz> wrote in message

news:vqwnoi2s...@cool.guy.abc.xyz...

> David Schwartz <dav...@webmaster.com> wrote:
>
>> There are three things that cause problems for multi-threaded
>> programs
>> on typical modern SMP machines:
>>
>> 1) Instruction re-ordering. This is where the compiler or the
>> processor
>> internally performs instructions in a different order from that in the
>> high-level code.

> So, with this issue, the reads/writes - when they're made - *will* be made
> in a way that other threads can see them -- but the only problem is that
> they might happen too late or too early, right?

Right.

>> 2) Posted writes. This is where the CPU does not output a write to
>> its
>> cache immediately but instead puts it in a temporary holding place. (Not
>> the
>> L2 cache, but a posted write buffer.)
>>
>> 3) Speculative fetches. This is where the CPU reads data before it
>> needs
>> it and keeps the data for a later instruction.
>
> So I assume memory barriers solve all three problems?

Yes.

DS

David Hopwood

unread,

Sep 18, 2005, 8:33:54 PM9/18/05

to

Cool Guy wrote:
> David Schwartz <dav...@webmaster.com> wrote:
>
>> There are three things that cause problems for multi-threaded programs
>>on typical modern SMP machines:
>>
>> 1) Instruction re-ordering. This is where the compiler or the processor
>>internally performs instructions in a different order from that in the
>>high-level code.
>
> So, with this issue, the reads/writes - when they're made - *will* be made
> in a way that other threads can see them -- but the only problem is that
> they might happen too late or too early, right?

In practice, yes. Documentation or formal descriptions of memory models sometimes
don't make it clear whether, in the absence of memory barriers, writes are still
guaranteed to be performed with respect to all other threads after a finite delay.
But AFAIK, all implemented architectures do have this finite delay property.

>> 2) Posted writes. This is where the CPU does not output a write to its
>>cache immediately but instead puts it in a temporary holding place. (Not the
>>L2 cache, but a posted write buffer.)
>>
>> 3) Speculative fetches. This is where the CPU reads data before it needs
>>it and keeps the data for a later instruction.
>
> So I assume memory barriers solve all three problems?

For almost all purposes you can assume that they do.

Strictly speaking, though, adding memory barriers is not always sufficient to
simulate sequentially consistent behaviour. For instance, you might think that in
this example:

Start with X == Y == 0.

Processor 1: X = 1; Y = 1;
Processor 2: if (Y == 1) { Z = 1; }
Processor 3: if (Z == 1) { assert(X == 1); }

the assert cannot fail (because Z = 1 can only happen after Y = 1, which can only
happen after X = 1). But in some memory models, including the model used by x86
(as far as it's possible to tell from the documentation), and probably also the
one used by .NET, it *is* possible for the assert to fail. This can happen no
matter what memory barriers are added, and even if X and Y are volatile in the
case of .NET.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

David Hopwood

unread,

Sep 19, 2005, 10:35:17 AM9/19/05

to

Oops, wrong example. This is the example I meant:

Strictly speaking, though, adding memory barriers is not always sufficient to
simulate sequentially consistent behaviour. For instance, you might think that in
this example:

Start with Y == Z == 0.

Processor 1: Y = 1;

Processor 2: if (Y == 1) { Z = 1; }

Processor 3: if (Z == 1) { assert(Y == 1); }

the assert cannot fail (because Z = 1 can only happen after Y = 1).

But in some memory models, including the model used by x86
(as far as it's possible to tell from the documentation), and probably also the
one used by .NET, it *is* possible for the assert to fail. This can happen no

matter what memory barriers are added, and even if Y and Z are volatile in the

Cool Guy

unread,

Sep 19, 2005, 10:38:38 AM9/19/05

to

David Hopwood <david.nosp...@blueyonder.co.uk> wrote:

> For instance, you might think that in
> this example:
>
> Start with Y == Z == 0.
>
> Processor 1: Y = 1;
> Processor 2: if (Y == 1) { Z = 1; }
> Processor 3: if (Z == 1) { assert(Y == 1); }
>
> the assert cannot fail (because Z = 1 can only happen after Y = 1).
> But in some memory models, including the model used by x86
> (as far as it's possible to tell from the documentation), and probably also the
> one used by .NET, it *is* possible for the assert to fail.

How comes?

David Hopwood

unread,

Sep 19, 2005, 11:16:24 AM9/19/05

to

The write to Y is not performed as a single atomic event; it has to be modelled
as two events: "Y = 1 is performed with respect to processor 2" and "Y = 1 is
performed with respect to processor 3". So it can happen that processor 2 sees
Y == 1 before writing Z = 1, but processor 3 does not see Y == 1 until after
it sees Z == 1.

If it helps, think of each write as causing separate asynchronous messages to
be sent from the writing processor to each other processor. The memory model
puts some constraints on the ordering of these messages; for example, in
"processor consistency" models, if processor A makes several writes, they will
be seen in the same order by some other processor B. This corresponds to
(one-to-one) FIFO ordering in a message passing system [CMT96]. However, in
the example above this constraint doesn't help, because no processor makes
more than one write.

Here's a space-time diagram showing an execution in which the above example
asserts. ('*' shows message reception. Time goes from left to right. View in
a fixed width font.)

P1 --(Y = 1)-----------------------------------------------
\\__________________________________
\ \
P2 -------*--(Y == 1)--(Z = 1)-------------- \ ------------
\ \
\ \
P3 -------------------------*-(Z == 1)-(Y == 0)-*-(Y == 1)-

[CMT96] Bernadette Charron–Bost, Friedemann Mattern, Gerard Tel
Synchronous, asynchronous, and causally ordered communication
<http://www.vs.inf.ethz.ch/publ/papers/mattern-dc-1996.pdf>

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Cool Guy

unread,

Sep 19, 2005, 12:55:39 PM9/19/05

to

David Hopwood <david.nosp...@blueyonder.co.uk> wrote:

> The write to Y is not performed as a single atomic event; it has to be modelled
> as two events: "Y = 1 is performed with respect to processor 2" and "Y = 1 is
> performed with respect to processor 3". So it can happen that processor 2 sees
> Y == 1 before writing Z = 1, but processor 3 does not see Y == 1 until after
> it sees Z == 1.

<snip>

So it's impossible to make such code thread-safe, then?

Chris Thomasson

unread,

Sep 19, 2005, 11:33:36 AM9/19/05

to

"Cool Guy" <coo...@abc.xyz> wrote in message

news:1cgq3c5c...@cool.guy.abc.xyz...

This should do it for x86:

Processor 1: Y = 1; L = 1;
Processor 2: if (L == 1) { Z = 1; L = 2; }
Processor 3: if (L == 2) { assert(Y == 1); }

Chris Thomasson

unread,

Sep 19, 2005, 11:35:53 AM9/19/05

to

> This should do it for x86:
>
> Processor 1: Y = 1; L = 1;
> Processor 2: if (L == 1) { Z = 1; L = 2; }
> Processor 3: if (L == 2) { assert(Y == 1); }

actually the following should never assert:

assert( Y == 1 && Z == 1 );

Any thoughts?

Sean Kelly

unread,

Sep 19, 2005, 7:42:36 PM9/19/05

to

David Hopwood wrote:
> Oops, wrong example. This is the example I meant

For what it's worth, your previous example showed the same problem
though the extra variable may have been confusing.

Sean

Cool Guy

unread,

Sep 19, 2005, 1:47:49 PM9/19/05

to

Chris Thomasson <_no_spam_cristom@no_spam_comcast._net> wrote:

>> So it's impossible to make such code thread-safe, then?
>
> This should do it for x86:
>
> Processor 1: Y = 1; L = 1;
> Processor 2: if (L == 1) { Z = 1; L = 2; }
> Processor 3: if (L == 2) { assert(Y == 1); }

So this is only when three processors are involved, right?

Sean Kelly

unread,

Sep 19, 2005, 7:49:52 PM9/19/05

to

Simply by following the (sparsely) documented x86 memory model? Yes.
Though Alexander suggested a method that should work on existing CPUs:
replace load instructions with CMPXCHG. This relies on a comment in
the docs for CMPXCHG that says a write is always performed when this
instruction is executed, even if the value is not changed as a result
of the operation. Hardware folks may argue that this is an
implementation detail however (as Andy Glew did), so future x86 CPUs
may not guarantee the same behavior.

Sean

Sean Kelly

unread,

Sep 19, 2005, 7:51:16 PM9/19/05

to

More than two processors, yes.

Sean

David Schwartz

unread,

Sep 19, 2005, 2:36:04 PM9/19/05

to

"Cool Guy" <coo...@abc.xyz> wrote in message

news:1cgq3c5c...@cool.guy.abc.xyz...

> David Hopwood <david.nosp...@blueyonder.co.uk> wrote:

With memory barriers, yeah. Just use mutexes or other *appropriate*
synchronization mechanisms. ;)

DS

Alexander Terekhov

unread,

Sep 19, 2005, 2:40:27 PM9/19/05

to

Illusion of remote write atomicity can be achieved by turning all reads
into dummy RMWs. The problem is that even Intel's own architects don't
seem to know how to do a dummy RMW that would be guaranteed not to
optimize out a write part (so to speak) of a dummy RMW on their
existing and future PC (x86/WB) processors.

regards,
alexander.

Alexander Terekhov

unread,

Sep 19, 2005, 2:48:40 PM9/19/05

to

http://www.opengroup.org/austin/mailarchives/ag/msg08586.html

regards,
alexander.

David Hopwood

unread,

Sep 19, 2005, 3:08:55 PM9/19/05

to

Application code should typically be written to use higher-level abstractions
such as locks, message passing, or transactional memory. Examples like this
one usually only arise when trying to implement such abstractions. It is quite
possible to work around problems like the one in this example, if needed. It just
may not be sufficient to use memory barriers; sometimes you need interlocked
instructions or equivalent.

(Recently, some people have been promoting the use of lock-free data structures
for application programming. That's a bad idea, because it requires an
understanding of memory models that most application programmers don't have.
Nor would it be an efficient use of those programmers' time to gain sufficient
understanding to be able to maintain or fix libraries of lock-free structures.

If abstractions that need to be implemented in terms of a memory model are
provided by an OS, thread library, or language implementation, OTOH, then
application programmers are not expected to need to understand, maintain, or
fix them; that is explicitly someone else's problem.)

--
David Hopwood <david.nosp...@blueyonder.co.uk>

David Hopwood

unread,

Sep 19, 2005, 3:17:06 PM9/19/05

to

Right.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Chris Thomasson

unread,

Sep 19, 2005, 2:20:21 PM9/19/05

to

> (Recently, some people have been promoting the use of lock-free data
> structures
> for application programming. That's a bad idea,

I am not really promoting the direct use of raw lock-free structures, but
rather an extremely simple high-level API wrapper around them. I figured out
how to provide a highly portable lock-free method for tracking references to
objects. It uses POSIX compiler reorder rules wrt mutexs. Basically, if you
have a basic working knowledge of POSIX mutexs, you can use the VZOOM
algorithm on any OS that has them.

> because it requires an
> understanding of memory models that most application programmers don't
> have.
> Nor would it be an efficient use of those programmers' time to gain
> sufficient
> understanding to be able to maintain or fix libraries of lock-free
> structures.

They need to study up on the reader/writer problem. IMHO, hat would not be a
waste of time.

Joe Seigh

unread,

Sep 19, 2005, 9:22:04 PM9/19/05

to

AFAICT, Alexander seems to think the x86 memory model is just PC and
*all* of the other stuff in the x86 docs doesn't count. Seems a bit
arbitrary and selective to me.

If you want to do global memory barriers just using PC and nothing else,
you can. Just use a common memory location to "sync" everything with. E.g.

processor 1:
store fence; // "fence" is common global location
store x; // store some value

processor 2:
load x; // load some value
load fence; //

If processor 2 sees the store into x by processor 1 it will see the store
into fence. Since "fence" is used as the common memory barrier, you get
transitivity of memory visibility. No CMPXCHG required.

This only works as long as PC is part of x86 memory model.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

David Hopwood

unread,

Sep 19, 2005, 9:29:35 PM9/19/05

to

Sean Kelly wrote:
> David Hopwood wrote:
>
>>Oops, wrong example. This is the example I meant
>
> For what it's worth, your previous example showed the same problem

Actually, it doesn't.

Start with X == Y == Z == 0.

Processor 1: X = 1; Y = 1;

Processor 2: if (Y == 1) { Z = 1; }

Processor 3: if (Z == 1) { assert(X == 1); }

If it were possible, the space-time diagram for the execution that
causes an assertion failure would be:

P1 --(X = 1)--(Y = 1)------------------------------------------------
\\________\ _________________________________
\ \ \
P2 -------*---------*-(Y == 1)--(Z = 1)-------------- \ ------------
(a) \ \
\ \
P3 ----------------------------------*-(Z == 1)-(X == 0)-*-(X == 1)-
(b)

But this execution can't occur in a processor consistent model:

<http://research.compaq.com/wrl/people/kourosh/papers/1993-tr-68.pdf>

# Definition 1: Performing a Memory Request
#
# A load by Pi is considered performed with respect to Pk at a point in
# time when the issuing of a store to the same address by Pk cannot affect
# the value returned by the load. A store by Pi is considered performed
# with respect to Pk at a point in time when an issued load to the same
# address by Pk returns the value defined by this store (or a subsequent
# store to the same location). An access is performed when it is performed
# with respect to all processors.
[...]
# Condition 1: Conditions for Processor Consistency
#
# (A) before a load is allowed to perform with respect to any other
# processor, all previous load accesses must be performed, and
# (B) before a store is allowed to perform with respect to any other
# processor, all previous accesses (loads and stores) must be performed.

In particular, (B) implies that before Y = 1 is allowed to perform with
respect to P2, X = 1 must have performed with respect to P3.

(I.e. there would have to be a way to draw the space-time diagram so that
(b) is to the left of (a) without changing the causal relationships, which
can't be done.)

If we go back to the second example and try to use similar reasoning:

P1 --(Y = 1)-----------------------------------------------
\\__________________________________
\ \
P2 -------*--(Y == 1)--(Z = 1)-------------- \ ------------

(a') \ \

\ \
P3 -------------------------*-(Z == 1)-(Y == 0)-*-(Y == 1)-

(b')

in this case there is no rule that requires (b') to be before (a').

--
David Hopwood <david.nosp...@blueyonder.co.uk>

David Hopwood

unread,

Sep 19, 2005, 9:57:43 PM9/19/05

to

Joe Seigh wrote:
> Sean Kelly wrote:
>> Cool Guy wrote:

[...]

>>> So it's impossible to make such code thread-safe, then?
>>
>> Simply by following the (sparsely) documented x86 memory model? Yes.
>> Though Alexander suggested a method that should work on existing CPUs:
>> replace load instructions with CMPXCHG. This relies on a comment in
>> the docs for CMPXCHG that says a write is always performed when this
>> instruction is executed, even if the value is not changed as a result
>> of the operation. Hardware folks may argue that this is an
>> implementation detail however (as Andy Glew did), so future x86 CPUs
>> may not guarantee the same behavior.
>
> AFAICT, Alexander seems to think the x86 memory model is just PC and
> *all* of the other stuff in the x86 docs doesn't count.

*Most* of the other stuff in the x86 docs is about non-WB memory types,
or about what is visible on the system bus, as opposed to what is visible
to processors.

> Seems a bit arbitrary and selective to me.

<http://groups.google.de/group/comp.arch/msg/7200ec152c8cca0c>

Andy Glew wrote:
# [...] the above can be briefly stated: WB memory is processor consistent,
# type II.

> If you want to do global memory barriers just using PC and nothing else,
> you can. Just use a common memory location to "sync" everything with.

If you don't care about lousy performance due to cache line pinging.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Joe Seigh

unread,

Sep 19, 2005, 10:17:19 PM9/19/05

to

David Hopwood wrote:

> Joe Seigh wrote:
>>
>>AFAICT, Alexander seems to think the x86 memory model is just PC and
>>*all* of the other stuff in the x86 docs doesn't count.
>
>
> *Most* of the other stuff in the x86 docs is about non-WB memory types,
> or about what is visible on the system bus, as opposed to what is visible
> to processors.
>

"In a single-processor system for memory regions defined as write-back
cacheable, the following ordering rules apply:"

>
>>Seems a bit arbitrary and selective to me.
>

Alexander has company here.

>
>
>>If you want to do global memory barriers just using PC and nothing else,
>>you can. Just use a common memory location to "sync" everything with.
>
>
> If you don't care about lousy performance due to cache line pinging.
>

Lousy performance compared to what?

Chris Thomasson

unread,

Sep 19, 2005, 9:06:03 PM9/19/05

to

> If you want to do global memory barriers just using PC and nothing else,
> you can. Just use a common memory location to "sync" everything with.
> E.g.

[...]

Yup. This works in x86:

Processor 1: Y = 1; L = 1;
Processor 2: if (L == 1) { Z = 1; L = 2; }

Processor 3: if (L == 2) { assert( Z == 1 && Y == 1 ); }

http://groups.google.com/group/comp.programming.threads/msg/68ba70e66d6b6ee9?hl=en

L is a lightweight memory barrier.

Joe Seigh

unread,

Sep 20, 2005, 1:03:35 AM9/20/05

to

Joe Seigh wrote:
>
> If you want to do global memory barriers just using PC and nothing else,
> you can. Just use a common memory location to "sync" everything with.
> E.g.
>
> processor 1:
> store fence; // "fence" is common global location
> store x; // store some value
>
> processor 2:
> load x; // load some value
> load fence; //
>
>
> If processor 2 sees the store into x by processor 1 it will see the store
> into fence. Since "fence" is used as the common memory barrier, you get
> transitivity of memory visibility. No CMPXCHG required.

Actually, you may not need that "load fence", just the "store fence",
though you may need an LFENCE depending on what you think LFENCE
actually does.

The memory visibility from interlocked instructions is slightly
different so you need to be careful mixing them with the above.
By themselves (i.e. locks) you don't need a global memory barrier.

The more general pattern is you need to store into the same
"release store location" used by another processor to make
it's stores transferable (transitive) by your store. You
can have multiple release store locations but you need to
chain them together correctly to get proper transitivity.

Something like that anyway.

Alexander Terekhov

unread,

Sep 20, 2005, 4:05:22 AM9/20/05

to

Sean Kelly wrote:
[...]

> of the operation. Hardware folks may argue that this is an
> implementation detail however (as Andy Glew did), so future x86 CPUs
> may not guarantee the same behavior.

Some processor implementation details may change at mercy of things
like BIOS update or even an OS install or patching. Hence, according
to Glew's stance, you can't even count on it with respect to existing
(not only future) processors.

regards,
alexander.

Joe Seigh

unread,

Sep 20, 2005, 7:13:23 AM9/20/05

to

While you're at it, somebody needs to explain what Intel means by this
especially with regard to the fence instructions being described as
serializing instructions.

"It is recommended that software written to run on Pentium 4, Intel Xeon, and P6 family processors
assume the processor-ordering model or a weaker memory-ordering model. The Pentium 4,
Intel Xeon, and P6 family processors do not implement a strong memory-ordering model, except
when using the UC memory type. Despite the fact that Pentium 4, Intel Xeon, and P6 family
processors support processor ordering, Intel does not guarantee that future processors will
support this model. To make software portable to future processors, it is recommended that operating
systems provide critical region and resource control constructs and API’s (application
program interfaces) based on I/O, locking, and/or serializing instructions be used to synchronize
access to shared areas of memory in multiple-processor systems. Also, software should not
depend on processor ordering in situations where the system hardware does not support this
memory-ordering model."

Joe Seigh

unread,

Sep 20, 2005, 7:30:49 AM9/20/05

to

You don't need the load on the "memory barrier".

W = X = Y = Z = 0;

Processor 1: W = 1; L = 1; X = 1
Processor 2: if (X == 1) { Y = 1; L = 1; Z = 1; }
Processor 3: if (Z == 1) { assert (W == 1 && Y == 1); }

W and Y are stored by two different processors and are seen in order.

David Hopwood

unread,

Sep 20, 2005, 8:59:12 AM9/20/05

to

Joe Seigh wrote:
> Chris Thomasson wrote:
>
>> [...] This works in x86:

>>
>> Processor 1: Y = 1; L = 1;
>> Processor 2: if (L == 1) { Z = 1; L = 2; }
>> Processor 3: if (L == 2) { assert( Z == 1 && Y == 1 ); }
>>
>> http://groups.google.com/group/comp.programming.threads/msg/68ba70e66d6b6ee9?hl=en
>>
>> L is a lightweight memory barrier.
>
> You don't need the load on the "memory barrier".
>
> W = X = Y = Z = 0;
>
> Processor 1: W = 1; L = 1; X = 1
> Processor 2: if (X == 1) { Y = 1; L = 1; Z = 1; }
> Processor 3: if (Z == 1) { assert (W == 1 && Y == 1); }
>
> W and Y are stored by two different processors and are seen in order.

What are the stores to L for?

W = X = Y = Z = 0;

Processor 1: W = 1; X = 1;
Processor 2: if (X == 1) { Y = 1; Z = 1; }

Processor 3: if (Z == 1) { assert (W == 1 && Y == 1); }

cannot fail the assert (for processor consistency / x86 WB / .NET with volatiles).

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Joe Seigh

unread,

Sep 20, 2005, 9:54:49 AM9/20/05

to

David Hopwood wrote:
> Joe Seigh wrote:
>>

>>You don't need the load on the "memory barrier".
>>
>>W = X = Y = Z = 0;
>>
>>Processor 1: W = 1; L = 1; X = 1
>>Processor 2: if (X == 1) { Y = 1; L = 1; Z = 1; }
>>Processor 3: if (Z == 1) { assert (W == 1 && Y == 1); }
>>
>>W and Y are stored by two different processors and are seen in order.
>
>
> What are the stores to L for?
>
> W = X = Y = Z = 0;
>
> Processor 1: W = 1; X = 1;
> Processor 2: if (X == 1) { Y = 1; Z = 1; }
> Processor 3: if (Z == 1) { assert (W == 1 && Y == 1); }
>
> cannot fail the assert (for processor consistency / x86 WB / .NET with volatiles).
>

W and Y are stores by two different processors. Under processor consistency you
aren't guaranteed to see them in order. And I'm talking about processor
consistency w/o .NET volatiles.

David Hopwood

unread,

Sep 20, 2005, 10:31:22 AM9/20/05

to

Joe Seigh wrote:
> David Hopwood wrote:
>> Joe Seigh wrote:
>>
>>> You don't need the load on the "memory barrier".
>>>
>>> W = X = Y = Z = 0;
>>>
>>> Processor 1: W = 1; L = 1; X = 1
>>> Processor 2: if (X == 1) { Y = 1; L = 1; Z = 1; }
>>> Processor 3: if (Z == 1) { assert (W == 1 && Y == 1); }
>>>
>>> W and Y are stored by two different processors and are seen in order.
>>
>> What are the stores to L for?
>>
>> W = X = Y = Z = 0;
>>
>> Processor 1: W = 1; X = 1;
>> Processor 2: if (X == 1) { Y = 1; Z = 1; }
>> Processor 3: if (Z == 1) { assert (W == 1 && Y == 1); }
>>
>> cannot fail the assert (for processor consistency / x86 WB / .NET with
>> volatiles).
>
> W and Y are stores by two different processors. Under processor
> consistency you aren't guaranteed to see them in order.

<sigh>.

<http://research.compaq.com/wrl/people/kourosh/papers/1993-tr-68.pdf>:

# Condition 1: Conditions for Processor Consistency
#
# (A) before a load is allowed to perform with respect to any other
# processor, all previous load accesses must be performed, and
# (B) before a store is allowed to perform with respect to any other
# processor, all previous accesses (loads and stores) must be performed.

Before X = 1 performs wrt P2, W = 1 must have performed wrt P3.
Before Z = 1 performs wrt P3, Y = 1 must have performed wrt P3.

> And I'm talking about processor consistency w/o .NET volatiles.

As far as I can tell, the rules for .NET volatiles are just processor
consistency; nothing more, nothing less. Makes sense given Microsoft's
emphasis on x86[-64].

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Sean Kelly

unread,

Sep 20, 2005, 10:49:18 AM9/20/05

to

David Hopwood wrote:
>
> # (B) before a store is allowed to perform with respect to any other
> # processor, all previous accesses (loads and stores) must be performed.

You're right of course :p I'd forgotten about this bit.

Sean

Joe Seigh

unread,

Sep 20, 2005, 12:06:01 PM9/20/05

to

David Hopwood wrote:
> Joe Seigh wrote:
>>W and Y are stores by two different processors. Under processor
>>consistency you aren't guaranteed to see them in order.
>
>
> <sigh>.
>
> <http://research.compaq.com/wrl/people/kourosh/papers/1993-tr-68.pdf>:

That's not the official x86 memory model. Andy Glew didn't make
clear whether he was talking about the x86 memory model as (unofficially)
specified or as implemented. And he was not speaking *for* Intel.
You'd look very foolish trying to file a bug report to Intel for
behavior that was allowed by the documented memory model claiming
Andy Glew told you this was how it was supposed to work in comp.arch.

David Hopwood

unread,

Sep 20, 2005, 1:05:48 PM9/20/05

to

Joe Seigh wrote:
> David Hopwood wrote:
>> Joe Seigh wrote:
>>
>>> W and Y are stores by two different processors. Under processor
>>> consistency you aren't guaranteed to see them in order.
>>
>> <sigh>.
>>
>> <http://research.compaq.com/wrl/people/kourosh/papers/1993-tr-68.pdf>:
>
> That's not the official x86 memory model.

Please pay attention to what *you* wrote:

>>> W and Y are stores by two different processors. Under
>>> processor consistency you aren't guaranteed to see them in order.

^^^^^^^^^^^^^^^^^^^^^

--
David Hopwood <david.nosp...@blueyonder.co.uk>

David Hopwood

unread,

Sep 20, 2005, 1:23:20 PM9/20/05

to

Leaving aside for the moment that you had made an incorrect statement
specifically about processor consistency...

Joe Seigh wrote:
> David Hopwood wrote:
>>

>> <http://research.compaq.com/wrl/people/kourosh/papers/1993-tr-68.pdf>:
>
> That's not the official x86 memory model. Andy Glew didn't make
> clear whether he was talking about the x86 memory model as (unofficially)
> specified or as implemented. And he was not speaking *for* Intel.

We all seem to agree that the documentation is inadequate. So it's
quite reasonable to ask for clarification from someone who should know.

> You'd look very foolish trying to file a bug report to Intel for
> behavior that was allowed by the documented memory model claiming
> Andy Glew told you this was how it was supposed to work in comp.arch.

Since it's not clear what the documentation does allow, I wouldn't feel
foolish at all.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Joe Seigh

unread,

Sep 20, 2005, 2:39:39 PM9/20/05

to

David Hopwood wrote:
> Joe Seigh wrote:
>
>>David Hopwood wrote:
>>
>>>Joe Seigh wrote:
>>>
>>>
>>>>W and Y are stores by two different processors. Under processor
>>>>consistency you aren't guaranteed to see them in order.
>>>
>>><sigh>.
>>>
>>><http://research.compaq.com/wrl/people/kourosh/papers/1993-tr-68.pdf>:
>>
>>That's not the official x86 memory model.
>
>
> Please pay attention to what *you* wrote:
>
>
>>>>W and Y are stores by two different processors. Under
>>>>processor consistency you aren't guaranteed to see them in order.
>
> ^^^^^^^^^^^^^^^^^^^^^
>

Well, I was going by what I though was the earlier definition for
processor consistency from the Intel docs and from an earlier Andy Glew
posting http://groups.google.de/group/comp.arch/msg/96ec4a9fb75389a2
It's not that clear and concise and probably not worth arguing about
at this point. I know what the processor consistency definition in
the above paper is. I'll guess I'll just wait for a clearer statement
from Intel on what their memory model actually is rather than argue
about it any further.

David Hopwood

unread,

Sep 21, 2005, 9:22:05 AM9/21/05

to

David Schwartz wrote:
> "David Hopwood" <david.nosp...@blueyonder.co.uk> wrote:
>

>>It's misleading to call timing side channels just a performance effect.
>>I think most PGP users or SSL server operators would raise an eyebrow if
>>you told them that compromise of their private key was just a performance
>>effect, for instance.
>
> If they rely upon isolation that's not guaranteed, they get what they
> deserve.

Not really:

- at the time the software was written, no implementation of x86 (or
other platforms) used caches shared between processes executing in
parallel.

So it is just as reasonable to attribute the cause of the problem to
{HT with shared caches} breaking an implicit security requirement, as
to the software relying on isolation that was not guaranteed.

- the software in question cannot do anything about caches being
shared between mutually untrusting processes. Only the operating
system can fix that. The software can try, to some extent, to
reduce the amount of information leaked via the cache, but it
is impossible to eliminate this leakage entirely.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

David Schwartz

unread,

Sep 21, 2005, 2:57:51 PM9/21/05

to

"David Hopwood" <david.nosp...@blueyonder.co.uk> wrote in message
news:18dYe.4052$lB4...@fe3.news.blueyonder.co.uk...

>> If they rely upon isolation that's not guaranteed, they get what they
>> deserve.

> Not really:

> - at the time the software was written, no implementation of x86 (or
> other platforms) used caches shared between processes executing in
> parallel.

Right, but such isolation was not *guaranteed*.

> So it is just as reasonable to attribute the cause of the problem to
> {HT with shared caches} breaking an implicit security requirement, as
> to the software relying on isolation that was not guaranteed.

No. It should be well understood that future hardware and future
operating systems may not be able to make things that "just happened to
work" in the past continue to work. That's why you should only rely on
things that are guaranteed. You can blame the hardware or the OS if and only
if it breaks a guarantee.

> - the software in question cannot do anything about caches being
> shared between mutually untrusting processes. Only the operating
> system can fix that. The software can try, to some extent, to
> reduce the amount of information leaked via the cache, but it
> is impossible to eliminate this leakage entirely.

Then the software is insecure by design.

DS

David Hopwood

unread,

Sep 21, 2005, 4:02:14 PM9/21/05

to

Joe Seigh wrote:
> David Hopwood wrote:
>> Joe Seigh wrote:
>>
>>> AFAICT, Alexander seems to think the x86 memory model is just PC and
>>> *all* of the other stuff in the x86 docs doesn't count.
>>
>> *Most* of the other stuff in the x86 docs is about non-WB memory types,
>> or about what is visible on the system bus, as opposed to what is visible
>> to processors.
>
> "In a single-processor system for memory regions defined as write-back
> cacheable, the following ordering rules apply:"

Followed by a list of rules, none of which contradict the model for WB memory
on x86 being exactly processor consistency.

(This list wouldn't be expected to be sufficient to imply processor consistency,
since it only considers a single-processor system.)

>>> If you want to do global memory barriers just using PC and nothing else,
>>> you can. Just use a common memory location to "sync" everything with.
>>
>> If you don't care about lousy performance due to cache line pinging.
>
> Lousy performance compared to what?

Now that I look at your suggestion more closely, it doesn't appear to work,
so its performance is moot.

Start with F (fence) == Y == Z == 0.

P1: F = 1; Y = 1;
P2: if (Y == 1) { (void) F; F = 1; Z = 1; }
P3: if (Z == 1) { (void) F; assert(Y == 1); }

can fail, e.g. in this execution:

P1 -(F=1)-(Y=1)----------------------------------------------------------
\ \ \\______________________________________________
\ \ \ \
P2 -----*-\---*-(Y==1)-(F==1)-(F=1)-(Z=1)------------------- \ ----------
\ \ \ \
\ \ \ \
P3 ----------*---------------------*-----*-(Z==1)-(F==1)-(Y==0)-*-(Y==1)-

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Chris Thomasson

unread,

Sep 22, 2005, 3:25:14 AM9/22/05

to

> Start with F (fence) == Y == Z == 0.
>
> P1: F = 1; Y = 1;
> P2: if (Y == 1) { (void) F; F = 1; Z = 1; }
> P3: if (Z == 1) { (void) F; assert(Y == 1); }
>
> can fail, e.g. in this execution:

Right. The fence is in the wrong place. You store to the fence "after"
storing to Y or Z. This acts as a release barrier. You would then load from
the fence to get an acquire barrier:

Processor 1: Y = 1; Fence.rel = 1;
Processor 2: if (Fence.acq == 1 ) { Z = 1; Fence.rel = 2; }
Processor 3: if (Fence.acq == 2) { assert( Y == 1 && Z == 1 ); }

Simple...

;)

Joe Seigh

unread,

Sep 22, 2005, 6:26:12 PM9/22/05

to

David Hopwood wrote:
> Joe Seigh wrote:
>>Lousy performance compared to what?
>
>
> Now that I look at your suggestion more closely, it doesn't appear to work,
> so its performance is moot.
>
> Start with F (fence) == Y == Z == 0.
>
> P1: F = 1; Y = 1;
> P2: if (Y == 1) { (void) F; F = 1; Z = 1; }
> P3: if (Z == 1) { (void) F; assert(Y == 1); }
>
> can fail, e.g. in this execution:
>

It's a store/store or release memory barrier so it has to be after the
stores that you want to be made visible. PC type 2 has global visibility
guarantees so you don't need it for that.

David Hopwood

unread,

Sep 23, 2005, 11:41:49 AM9/23/05

to

Well, yes, but that's completely changing the logic of the original example.
The point is that you effectively have to design lock-free stuff to work in
a processor consistent model, if you want it to be efficient on x86 etc.
The lack of "remote write atomicity" can't be hidden just by redefining
fence operations.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

David Hopwood

unread,

Sep 23, 2005, 11:44:37 AM9/23/05

to

David Hopwood wrote:
> Chris Thomasson wrote:
>
>>>Start with F (fence) == Y == Z == 0.
>>>
>>>P1: F = 1; Y = 1;
>>>P2: if (Y == 1) { (void) F; F = 1; Z = 1; }
>>>P3: if (Z == 1) { (void) F; assert(Y == 1); }
>>>
>>>can fail, e.g. in this execution:
>>
>>Right. The fence is in the wrong place. You store to the fence "after"
>>storing to Y or Z. This acts as a release barrier. You would then load from
>>the fence to get an acquire barrier:
>>
>>Processor 1: Y = 1; Fence.rel = 1;
>>Processor 2: if (Fence.acq == 1 ) { Z = 1; Fence.rel = 2; }
>>Processor 3: if (Fence.acq == 2) { assert( Y == 1 && Z == 1 ); }
>
> Well, yes, but that's completely changing the logic of the original example.
> The point is that you effectively have to design lock-free stuff to work in
> a processor consistent model, if you want it to be efficient on x86 etc.

or a weaker model, of course.

> The lack of "remote write atomicity" can't be hidden just by redefining
> fence operations.

This applies also if you design for RCpc.

--
David Hopwood <david.nosp...@blueyonder.co.uk>