my understanding of acquire/release/etc

frege

unread,

Nov 20, 2009, 1:25:34 AM11/20/09

to

I find myself increasingly explaining to others my view of the memory
model, CPU architecture, etc relating to lock free programming. I'm
starting to think I better do a 'sanity check' before more people come
asking.
So here it is in (hopefully) a nutshell:

I tend to give people a picture revolving around the CPU requesting
reads/writes from the memory bus by adding read-requests to a read
queue and write-requests to a separate write queue..
Then I explain the need for acquire/release semantics as preventing
read (or write) requests being reordered by the CPU and/or memory bus
*while in the queue*.

So consider:

a = 1; b = 2; c = 3; d = 4;

this would queue up write requests for each of a,b,c,d - initially
queued in that order say.

The 'trouble' (but worthwhile optimization) comes in when all of the
requests are in the queue, and the (CPU?/bus?) decides it should
reorder the queue because would be more efficient to write d first for
some reason (ie maybe d and a are on the same cache line). Similarly
for read-requests, on the separate read queue.

Then, for example, a write_release(c) works by first flushing the
write-queue (ie wait for all pending writes (ie of a and b) to
complete), and only then add write-c to the queue. And then add write-
d. So write-d may actually get re-ordered to happen before write-c
(they are in the queue together), but a and b are long gone already.

And for reads:

int x = a * b + c * d;

would queue up read requests for each of a,b,c,d.

A read_acquire(c) works in the opposite order as write_release. ie
acquire first adds a read-c request to the read queue, then 'flushes'
the queue (ie waits for all reads to complete). So earlier reads
(a,b) might be reordered after read-c, but later reads (d) really will
be later.

And then from there you can explain how what is really important is
how atomically reading/writing this flag affects that other data you
are about to read/write. (ie without the other_important_data,
protected by the flag, these barriers don't have much use)

So here's the question(s)

- how 'true' is that in the sense that CPUs really work that way
- how 'correct' is it as a model - ie even if current/future CPUs work
differently, as a model this is consistent with reality such that
using this as a model will not introduce bugs in your code. In this
case, I suppose I'm really trying to be consistent with the propose C+
+0x memory model (since who knows what future architectures might
really do, but the C++ memory model will probably stick around for a
while).

Basically, the level I'm trying to explain at is:
- not as 'intense' as StoreLoad/StoreStore/etc
- not as naive as "you might have an old value in the cache/
register" (which seems to be the prevalent answer)

So I'd appreciate comments. I'm just trying to clear up some of the
misinformation out there without adding more of it!

Tony

Chris Friesen

unread,

Nov 20, 2009, 1:59:27 AM11/20/09

to frege

On 11/20/2009 12:25 AM, frege wrote:

> I tend to give people a picture revolving around the CPU requesting
> reads/writes from the memory bus by adding read-requests to a read
> queue and write-requests to a separate write queue..
> Then I explain the need for acquire/release semantics as preventing
> read (or write) requests being reordered by the CPU and/or memory bus
> *while in the queue*.
>
> So consider:
>
> a = 1; b = 2; c = 3; d = 4;
>
> this would queue up write requests for each of a,b,c,d - initially
> queued in that order say.
>
> The 'trouble' (but worthwhile optimization) comes in when all of the
> requests are in the queue, and the (CPU?/bus?) decides it should
> reorder the queue because would be more efficient to write d first for
> some reason (ie maybe d and a are on the same cache line). Similarly
> for read-requests, on the separate read queue.
>
> Then, for example, a write_release(c) works by first flushing the
> write-queue (ie wait for all pending writes (ie of a and b) to
> complete), and only then add write-c to the queue. And then add write-
> d. So write-d may actually get re-ordered to happen before write-c
> (they are in the queue together), but a and b are long gone already.

This will give the right results, but the flush isn't necessary.

Suppose a and b are in the queue. You then do a write barrier aka
"release" operation, then write c and d.

Conceptually you can view this as putting a blockage (hence the term
"barrier") in the queue between b and c. This barrier tells both the
compiler and the cpu that writes cannot be reordered across the barrier.
In our example a and b can be reordered, and c and d can be reordered,
but b and c cannot be reordered. Note that reads can be reordered
across this barrier.

> And for reads:
>
> int x = a * b + c * d;
>
> would queue up read requests for each of a,b,c,d.
>
> A read_acquire(c) works in the opposite order as write_release. ie
> acquire first adds a read-c request to the read queue, then 'flushes'
> the queue (ie waits for all reads to complete). So earlier reads
> (a,b) might be reordered after read-c, but later reads (d) really will
> be later.

Not quite. You could visualize the acquire as waiting for all reads to
complete and then adding the read-c request to the queue.

Technically however the hardware doesn't need to wait for them to
complete. If we read a and b then do a read barrier aka "acquire"
operation, conceptually it puts a barrier in the queue to prevent reads
being reordered across the barrier. They can still be reordered
separately on either side of the barrier, and writes can be reordered
across the barrier.

Finally, there is the full barrier which prevents both reads and writes
from being reordered across the barrier.

(There's also the barrier which prevents accesses to memory being
reordered with respect to accesses to the I/O address space, but that's
a separate issue and most userspace apps don't need to worry about it.)

> Basically, the level I'm trying to explain at is:
> - not as 'intense' as StoreLoad/StoreStore/etc
> - not as naive as "you might have an old value in the cache/
> register" (which seems to be the prevalent answer)

An alternate way to view it is to imagine an SMP system where
1) cpu caches are infinitely large
2) writes to memory only write to the local cache
3) reads from memory read from the local cache if the address is cached,
otherwise read from main memory

In this model

1) a write barrier flushes the local cache to main memory
2) a read barrier invalidates the local cache
3) a full barrier both flushes the local cache and then invalidates it.

This model is basically equivalent to your queue, but maybe easier to
picture. As you can imagine, in this model the proper barriers are
required otherwise data can be horribly stale.

Chris

David Schwartz

unread,

Nov 20, 2009, 2:03:34 AM11/20/09

to

On Nov 19, 10:25 pm, frege <gottlobfr...@gmail.com> wrote:

> So I'd appreciate comments. I'm just trying to clear up some of the
> misinformation out there without adding more of it!

Your explanation is basically okay. There is one detail you should
make clear though: these queues have nothing whatsoever to do with
caches. The need for barriers has nothing whatsoever to do with a CPU
having an old value "stuck in a cache" or anything like that. Cache
coherency is enforced by hardware.

The problem is two places:

1) On today's CPUs, speculative reads are issued by the CPU. So reads
from cache to the CPU may not occur in the order you coded them. Also,
on today's CPUs, writes are posted in a buffer. So writes may not be
visible to other CPUs (because they're not in the L1 cache yet) in the
order they are placed. The intersection of these two things may
reorder reads around writes.

2) Regardless of what today's CPUs do, the ordering of reads and
writes is not guaranteed unless you force it. A person should be able
to upgrade their CPU without software breaking. So you should not rely
on behavior that is not guaranteed.

DS

Chris Friesen

unread,

Nov 20, 2009, 9:15:59 AM11/20/09

to

On 11/20/2009 01:03 AM, David Schwartz wrote:
> On Nov 19, 10:25 pm, frege <gottlobfr...@gmail.com> wrote:
>
>> So I'd appreciate comments. I'm just trying to clear up some of the
>> misinformation out there without adding more of it!
>
> Your explanation is basically okay. There is one detail you should
> make clear though: these queues have nothing whatsoever to do with
> caches. The need for barriers has nothing whatsoever to do with a CPU
> having an old value "stuck in a cache" or anything like that. Cache
> coherency is enforced by hardware.

It's enforced by hardware on the vast majority of current hardware.
It's certainly not required or guaranteed though. One could envision a
very weak memory architecture where a barrier was literally required to
trigger reads from and writes to system-wide main memory.

> The problem is two places:
>
> 1) On today's CPUs, speculative reads are issued by the CPU. So reads
> from cache to the CPU may not occur in the order you coded them. Also,
> on today's CPUs, writes are posted in a buffer. So writes may not be
> visible to other CPUs (because they're not in the L1 cache yet) in the
> order they are placed. The intersection of these two things may
> reorder reads around writes.

The compiler may also reorder unless told otherwise, and the cpu barrier
instruction also acts as a compiler barrier.

> 2) Regardless of what today's CPUs do, the ordering of reads and
> writes is not guaranteed unless you force it. A person should be able
> to upgrade their CPU without software breaking. So you should not rely
> on behavior that is not guaranteed.

Agreed.

Chris

David Schwartz

unread,

Nov 20, 2009, 10:55:39 AM11/20/09

to

On Nov 20, 6:15 am, Chris Friesen <cbf...@mail.usask.ca> wrote:

First, let me say I agree with everything you said. But I'm replying
mainly to tie these two things together:

> It's enforced by hardware on the vast majority of current hardware.
> It's certainly not required or guaranteed though. One could envision a
> very weak memory architecture where a barrier was literally required to
> trigger reads from and writes to system-wide main memory.

> > 2) Regardless of what today's CPUs do, the ordering of reads and

> > writes is not guaranteed unless you force it. A person should be able
> > to upgrade their CPU without software breaking. So you should not rely
> > on behavior that is not guaranteed.

What Chris suggests, though it is not actually how current hardware
works, is a good way to think about it. In principle, the CPU can
"hang onto" a write as long as it wants to. It has no obligation to
ever make it visible to other CPUs. Similarly, it may make reads
arbitrarily far in advance of where you asked for them.

You can use barriers to provide all these guarantees you don't
otherwise have. Alternatively, you can rely on the "barriers"
implicitly provided by the guarantees of your threading library. For
example, POSIX guarantees that a thread that acquires a mutex can see
all writes coded prior to previous releases of the mutex. POSIX
guarantees that a newly-created thread can see all writes made by the
creating thread coded prior to calling pthread_create. And so on.

You can certainly make do with just these guarantees. But sometimes,
all you really need a memory barrier.

DS

frege

unread,

Nov 20, 2009, 1:00:53 PM11/20/09

to

On Nov 20, 2:03 am, David Schwartz <dav...@webmaster.com> wrote:
> On Nov 19, 10:25 pm, frege <gottlobfr...@gmail.com> wrote:
>
> > So I'd appreciate comments. I'm just trying to clear up some of the
> > misinformation out there without adding more of it!
>
> Your explanation is basically okay. There is one detail you should
> make clear though: these queues have nothing whatsoever to do with
> caches. The need for barriers has nothing whatsoever to do with a CPU
> having an old value "stuck in a cache" or anything like that. Cache
> coherency is enforced by hardware.
>

Yes, this is exactly what I'm fighting against. But to explain that
it is not the cache requires me to put something else in its place.

> The problem is two places:
>
> 1) On today's CPUs, speculative reads are issued by the CPU. So reads
> from cache to the CPU may not occur in the order you coded them. Also,
> on today's CPUs, writes are posted in a buffer. So writes may not be
> visible to other CPUs (because they're not in the L1 cache yet) in the
> order they are placed. The intersection of these two things may
> reorder reads around writes.
>
> 2) Regardless of what today's CPUs do, the ordering of reads and
> writes is not guaranteed unless you force it. A person should be able
> to upgrade their CPU without software breaking. So you should not rely
> on behavior that is not guaranteed.
>

Yeah, in the end, my explanation is just another model. That's
hopefully consistent with future architecture.

> DS

frege

unread,

Nov 20, 2009, 1:03:11 PM11/20/09

to

On Nov 20, 10:55 am, David Schwartz <dav...@webmaster.com> wrote:
> On Nov 20, 6:15 am, Chris Friesen <cbf...@mail.usask.ca> wrote:
>
> First, let me say I agree with everything you said. But I'm replying
> mainly to tie these two things together:
>
> > It's enforced by hardware on the vast majority of current hardware.
> > It's certainly not required or guaranteed though. One could envision a
> > very weak memory architecture where a barrier was literally required to
> > trigger reads from and writes to system-wide main memory.
> > > 2) Regardless of what today's CPUs do, the ordering of reads and
> > > writes is not guaranteed unless you force it. A person should be able
> > > to upgrade their CPU without software breaking. So you should not rely
> > > on behavior that is not guaranteed.
>
> What Chris suggests, though it is not actually how current hardware
> works, is a good way to think about it. In principle, the CPU can
> "hang onto" a write as long as it wants to. It has no obligation to
> ever make it visible to other CPUs. Similarly, it may make reads
> arbitrarily far in advance of where you asked for them.
>

Yep, it is tempting to keep it that simple, and not talk about how it
might work underneath. But sometimes people want a picture of what is
(maybe) going on.

> You can use barriers to provide all these guarantees you don't
> otherwise have. Alternatively, you can rely on the "barriers"
> implicitly provided by the guarantees of your threading library. For
> example, POSIX guarantees that a thread that acquires a mutex can see
> all writes coded prior to previous releases of the mutex. POSIX
> guarantees that a newly-created thread can see all writes made by the
> creating thread coded prior to calling pthread_create. And so on.
>

I *definitely* first tell people not to do any lockfree programming at
all. Stick to mutexes, etc. But they keep pestering me...

> You can certainly make do with just these guarantees. But sometimes,
> all you really need a memory barrier.
>
> DS

Thanks for all of your input.