Memory barriers

Dmitriy V'jukov

unread,

Apr 23, 2008, 7:16:03 AM4/23/08

to

My understanding is that memory barriers are *solely* about order of
visibility, and not about visibility as such, reactivity of visibility
etc.

With spreading of 'lock-free' algorithms I more and more frequently
hear statements that memory barriers are about visibility as such. In
my opinion such statements are false. Cache-coherency is about
visibility as such. Am I right?

Are there any widespread hardware platforms where following code C/C++
is broken?

volatile word_t g_flag = 0; // aligned machine word

void thread1()
{
g_flag = 1;
for (;;) {} // no more activity in this thread
}

void thread2()
{
while (0 == g_flag) {} // active spin wait
printf("signaled");
}

No data is associated with g_flag. I only want to signal thread2. And
I'm assuming that thread2 will eventually see g_flag == 1.
Are there any widespread hardware platforms where I have to make
something special in thread1 and/or thread2 in order to ensure
visibility of g_flag change?

Dmitriy V'jukov

robert...@yahoo.com

unread,

Apr 23, 2008, 2:32:23 PM4/23/08

to

For machines that are cache coherent, your initial statement is
correct, and your code will work as intended (modulo bugs, of course).

Non-CC machines do exist, and if g_flag is stored in a non-CC region
(non-CC machines can have a mix of cache coherent, non-CC, and local
memory), you may have to do something to make the update visible to
the other processors.

Dmitriy V'jukov

unread,

Apr 23, 2008, 3:28:14 PM4/23/08

to

On Apr 23, 10:32 pm, "robertwess...@yahoo.com"
<robertwess...@yahoo.com> wrote:

> > No data is associated with g_flag. I only want to signal thread2. And
> > I'm assuming that thread2 will eventually see g_flag == 1.
> > Are there any widespread hardware platforms where I have to make
> > something special in thread1 and/or thread2 in order to ensure
> > visibility of g_flag change?
>
> For machines that are cache coherent, your initial statement is
> correct, and your code will work as intended (modulo bugs, of course).
>
> Non-CC machines do exist, and if g_flag is stored in a non-CC region
> (non-CC machines can have a mix of cache coherent, non-CC, and local
> memory), you may have to do something to make the update visible to
> the other processors.

Thank you very much. You get this off my conscience :)

Non-CC machines are quite uncommon. Can you provide some names or
links to information about non-CC machines?

What kind of special thing I have to do to make the update visible to
the other processors? Is it fine-grained or coarse-grained? Is it
require cooperation of writer and reader or only writer? Is it similar
to clflush instruction on x86? I understand that such things are
different on different architectures, but I will appreciate if you
provide information for any architecture. I just very curious.

Dmitriy V'jukov

Pertti Kellomäki

unread,

Apr 23, 2008, 3:39:15 PM4/23/08

to

Dmitriy V'jukov wrote:
> Non-CC machines are quite uncommon.

It seems to me that this will need to change if the talk
about hundreds or thousands of cores is going to become
reality.
--
Pertti

robert...@yahoo.com

unread,

Apr 23, 2008, 5:03:26 PM4/23/08

to

On Apr 23, 2:28 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> Thank you very much. You get this off my conscience :)
>
> Non-CC machines are quite uncommon. Can you provide some names or
> links to information about non-CC machines?
>
> What kind of special thing I have to do to make the update visible to
> the other processors? Is it fine-grained or coarse-grained? Is it
> require cooperation of writer and reader or only writer? Is it similar
> to clflush instruction on x86? I understand that such things are
> different on different architectures, but I will appreciate if you
> provide information for any architecture. I just very curious.

For example, many of the big Cray's (the TXs, for example) are non-CC
between compute nodes (which are themselves multi-processor Opteron
nodes which are internally CC).

What you have to do to deal with that varies greatly between systems,
but most systems provide functions that implement the more traditional
pthreads, SHMEM, or MPI functions. So for example, SHMEM on a TX5,
for example, implements an atomic swap (among other things), which you
can use as you would anywhere. It's just slow (since the hardware
doesn't actually support the required functions).

You *can* implement this stuff on your own, but as I said, the details
vary significantly. And usually you want to be going in a more
message passing direction on the big non-CC machines anyway.

But it's usually fairly unpleasant stuff. Yes, there is usually a
flush type of operation, but the real problem you have to deal with is
ownership. Like I said, there's a reason that message passing is
popular on that type of box.

Message has been deleted

robert...@yahoo.com

unread,

Apr 23, 2008, 5:57:07 PM4/23/08

to

On Apr 23, 4:39 pm, Elcaro Nosille <Elcaro.Nosi...@googlemail.com>
wrote:
> robertwess...@yahoo.com schrieb:

>
> > For example, many of the big Cray's (the TXs, for example) are
> > non-CC between compute nodes (which are themselves multi-processor
> > Opteron nodes which are internally CC).
>

> Do these machines have a shared address-space but no implicit i.e.
> manual coherency?

That's correct.

levita...@yahoo.com

unread,

Apr 23, 2008, 8:50:22 PM4/23/08

to

the phrase "cache-coherency" was thrown around with an apparent common
understanding of what that is. Can someone state what it means? A
confusing case for me is if there are no caches, just pipelines
throughout a system. How can cache coherency mean anything in that
case. And then if you add caches to that case, does the meaning
change? or not? Is there such a thing as xxx-coherency where xxx
refers to some other common hardware structure like a cache? for
instance a fifo, registers, bus interfaces or ??? just wondering.

Chris Thomasson

unread,

Apr 23, 2008, 11:08:52 PM4/23/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
news:ae5b3a1b-78b8-456c...@u36g2000prf.googlegroups.com...

Will thread2 have its execution affinity bound to a processor that's on the
same node and uses the same local memory as the processor which thread1 is
bound to? Humm, perhaps the follow response I got from somebody who is
involved with the sicortex super-computers could be of some interest to you:

http://groups.google.com/group/comp.arch/msg/2e5eeaecd0e69aed

Those systems happen to have cache-coherent nodes. Therefore, you can use
pthreads and/or non-blocking algorithms for intra-node shared memory
communication, but not for inter-node comms which requires message-passing
(e.g., MPI)...

Chris Thomasson

unread,

Apr 23, 2008, 11:09:37 PM4/23/08

to

"Pertti Kellomäki" <pertti.k...@tut.fi> wrote in message
news:fuo2qp$c10$1...@news.cc.tut.fi...

I think I have to agree here...

robert...@yahoo.com

unread,

Apr 24, 2008, 11:30:57 PM4/24/08

to

If there are only pipelines - IOW, all memory accesses are direct to
the "real" memory, there cannot, by definition, be any coherency
problems. The problem arises if the system allows multiple copies of
an object to exist at one time, and then allows one of those to be
modified. For example, if two CPUs each have a copy of a chunk (a
cache line) of memory in their caches, and one modifies it, with
coherent caches the other processor will see the change. This
typically requires that the cache keep certain additional information
about the cache line (for example, does this cache have the *only*
copy of the memory line in the system - in which case it can actually
do the update), and that the cache can communicate with the other
caches to propagate changes and reduce the number of copies (for
example, if the memory line in question had multiple copies in
different caches, the modifying CPU must invalidate the copies that
other CPUs have of that line, so that it has exclusive access while
making the update). Getting that to happen is non-trivial, and
becomes very difficult to do efficiently for very large machines with
large numbers of CPUs. Non-CC machines punt keeping track of that
stuff back to the application.

That's not usually an issue for things like registers or busses.
There have been a few cases where a CPU has had a register file that
was replicated to reduce the number of ports needed - and in that case
there has to be a mechanism for keeping the two copies of the register
in sync, which presents a similar problem conceptually, except that
this is within a single CPU core and relative to a single instruction
stream so it's somewhat easier to deal with.

already...@yahoo.com

unread,

Apr 25, 2008, 9:28:57 AM4/25/08

to

On 23 апр, 23:57, "robertwess...@yahoo.com" <robertwess...@yahoo.com>
wrote:

Are you sure?
According to my understanding Cray XT3/4/5 are pretty standard MPP
machines, i.e. address space is not shared between the nodes.