C++0x atomic and reference counter

Helge Bahmann

unread,

Dec 10, 2009, 4:15:05 PM12/10/09

to

You would think that with the standardization of std::atomic, writing
something as mundane as a portable reference counter would be easy, but
apparently this is far from being true. Consider

struct X {
int data;
atomic<int> refcount;
};

void fn_1(X * x)
{
x->data=42;
if (x->refcount.fetch_sub(1, memory_order_release)==1)
delete x;
}

void fn_2(X * x)
{
x->data=42;
if (x->refcount.fetch_sub(1, memory_order_acq_rel)==1)
delete x;
}

void fn_3(X * x)
{
x->data=42;
if (x->refcount.fetch_sub(1, memory_order_release)==1) {
atomic_thread_fence(memory_order_acquire);
delete x;
}
}

The "correct" variant would allow two threads to call fn_X (with
refcount==2 initially) and guarantee that 'x->data=42' in thread 1
"happens before" 'delete x' in thread 2. For this to be true, thread 1
must decrease the refcount to 1 with "release" semantic, while thread 2
must decrease refcount to 0 with "acquire" semantic. This condition is
obviously true for fn_2 and fn_3, but it is also obviously false for
fn_1.

I consider this problematic in several ways.

I believe that fn_1 is sufficient on any machine that implements
"release" with a "LoadStore|StoreStore" memory fence just before (or
implicit with) the atomic decrement -- there is no need to "order" the
decrement to zero and the delete operation from the POV of other
threads, as they cannot legally access the memory locations. While this
might not be a necessity, I have great difficulties imagining a machine
for which this would not be true. What *might* be necessary is a barrier
preventing the compiler from speculatively deleting the object.

fn_2 is needlessly expensive on machines like ppc that require an
_explicit_ memory fence after an atomic operation as it pays this
penalty even if the reference counter does not drop to zero.

fn_3 might look better than fn_2, but it could be more expensive than
fn_2 on machines like x86 where the memory fence is _implicit_ in the
atomic operation (heavily depends on the compiler recognizing this).

What am I supposed to do? :(

James

unread,

Dec 10, 2009, 7:28:18 PM12/10/09

to

"Helge Bahmann" <h...@chaoticmind.net> wrote in message
news:hfrocp$1hu0$1...@nserver.hrz.tu-freiberg.de...

> You would think that with the standardization of std::atomic, writing
> something as mundane as a portable reference counter would be easy, but
> apparently this is far from being true. Consider

[...]

> void fn_3(X * x)
> {
> x->data=42;
> if (x->refcount.fetch_sub(1, memory_order_release)==1) {
> atomic_thread_fence(memory_order_acquire);
> delete x;
> }
> }

[...]

>
> fn_3 might look better than fn_2, but it could be more expensive than
> fn_2 on machines like x86 where the memory fence is _implicit_ in the
> atomic operation (heavily depends on the compiler recognizing this).

I believe that `atomic_thread_fence(memory_order_acquire)' on an x86 is a
nop. However, I don't think fn_3 will work. You are probably going to need
to do something like this:

void fn_4(X* x)
{
if (x->refcount.fetch_sub(1, memory_order_release) == 1)
{
x->refcount.load(memory_order_acquire);

delete x;
}
}

The dummy load with acquire is a nop on x86.

Helge Bahmann

unread,

Dec 11, 2009, 2:55:58 AM12/11/09

to

James <n...@spam.invalid> wrote:
> "Helge Bahmann" <h...@chaoticmind.net> wrote in message
> news:hfrocp$1hu0$1...@nserver.hrz.tu-freiberg.de...

>> void fn_3(X * x)
>> {
>> x->data=42;
>> if (x->refcount.fetch_sub(1, memory_order_release)==1) {
>> atomic_thread_fence(memory_order_acquire);
>> delete x;
>> }
>> }
[...]

> However, I don't think fn_3 will work.

as far as I read the standard, atomic_thread_fence(memory_order_acquire)
and the fact that thread 2 previously reads the value written by thread
1 is already sufficient to establish a "happens before" relationship

> You are probably going to need to do something like this:
>
> void fn_4(X* x)
> {
> if (x->refcount.fetch_sub(1, memory_order_release) == 1)
> {
> x->refcount.load(memory_order_acquire);
> delete x;
> }
> }

would work as well, unless the compiler were allowed to optimize away
spurious loads (I don't know)

All of this still evades the underlying question why I should do any of
that if it does not address a real race on a physically possible
machine.

Anthony Williams

unread,

Dec 11, 2009, 3:32:55 AM12/11/09

to

Helge Bahmann <h...@chaoticmind.net> writes:

> You would think that with the standardization of std::atomic, writing
> something as mundane as a portable reference counter would be easy, but
> apparently this is far from being true. Consider

If you stick to memory_order_seq_cst (the default) then it's easy. If
you opt for low-level ordering constraints then you have to specify
exactly what you need.

> struct X {
> int data;
> atomic<int> refcount;
> };
>
> void fn_1(X * x)
> {
> x->data=42;
> if (x->refcount.fetch_sub(1, memory_order_release)==1)
> delete x;
> }
>
> void fn_2(X * x)
> {
> x->data=42;
> if (x->refcount.fetch_sub(1, memory_order_acq_rel)==1)
> delete x;
> }
>
> void fn_3(X * x)
> {
> x->data=42;
> if (x->refcount.fetch_sub(1, memory_order_release)==1) {
> atomic_thread_fence(memory_order_acquire);
> delete x;
> }
> }
>
> The "correct" variant would allow two threads to call fn_X (with
> refcount==2 initially) and guarantee that 'x->data=42' in thread 1
> "happens before" 'delete x' in thread 2. For this to be true, thread 1
> must decrease the refcount to 1 with "release" semantic, while thread 2
> must decrease refcount to 0 with "acquire" semantic. This condition is
> obviously true for fn_2 and fn_3, but it is also obviously false for
> fn_1.

Agreed.

> I consider this problematic in several ways.
>
> I believe that fn_1 is sufficient on any machine that implements
> "release" with a "LoadStore|StoreStore" memory fence just before (or
> implicit with) the atomic decrement -- there is no need to "order" the
> decrement to zero and the delete operation from the POV of other
> threads, as they cannot legally access the memory locations. While this
> might not be a necessity, I have great difficulties imagining a machine
> for which this would not be true. What *might* be necessary is a barrier
> preventing the compiler from speculatively deleting the object.

On SPARC in RMO mode, the CPU could hoist any loads needed for the
delete above the atomic decrement if you only had a LoadStore|StoreStore
barrier. I think that IA64 and Alpha can do similarly.

> fn_2 is needlessly expensive on machines like ppc that require an
> _explicit_ memory fence after an atomic operation as it pays this
> penalty even if the reference counter does not drop to zero.

Agreed.

> fn_3 might look better than fn_2, but it could be more expensive than
> fn_2 on machines like x86 where the memory fence is _implicit_ in the
> atomic operation (heavily depends on the compiler recognizing this).

On x86, atomic_thread_fence(memory_order_acquire) is a NOP.

> What am I supposed to do? :(

Use fn_3.

Anthony
--
Author of C++ Concurrency in Action http://www.stdthread.co.uk/book/
just::thread C++0x thread library http://www.stdthread.co.uk
Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk
15 Carrallack Mews, St Just, Cornwall, TR19 7UL, UK. Company No. 5478976

Helge Bahmann

unread,

Dec 11, 2009, 8:44:07 AM12/11/09

to

Anthony Williams <antho...@gmail.com> wrote:

> If you stick to memory_order_seq_cst (the default) then it's easy. If
> you opt for low-level ordering constraints then you have to specify
> exactly what you need.

I know, but seq_cst is just insanely expensive. The release/acquire
model is quite intuitive I think, but I guess I have a little bit of a
"reverse mapping" problem from the memory models of physically exsting
CPUs that are much stricter than the C++0x one.

> Helge Bahmann <h...@chaoticmind.net> writes:
>> I believe that fn_1 is sufficient on any machine that implements
>> "release" with a "LoadStore|StoreStore" memory fence just before (or
>> implicit with) the atomic decrement -- there is no need to "order" the
>> decrement to zero and the delete operation from the POV of other
>> threads, as they cannot legally access the memory locations. While this
>> might not be a necessity, I have great difficulties imagining a machine
>> for which this would not be true. What *might* be necessary is a barrier
>> preventing the compiler from speculatively deleting the object.
>
> On SPARC in RMO mode, the CPU could hoist any loads needed for the
> delete above the atomic decrement if you only had a LoadStore|StoreStore
> barrier. I think that IA64 and Alpha can do similarly.

Does it matter if the destructor is empty as in the example provided?

- the memory allocator has no business reading the object contents (and
even if it did, there is nothing it could do with the data), the lack of
StoreLoad fencing is therefore not problematic

- the allocator must also protect its own internal data structures
independently, so these can't be at issue either

The memory allocator may store new data into the now reclaimed memory,
but that interfering with the operations of thread 1 poses a little
problem -- in its own "eigen time", thread 1 (or rather, its CPU cache)
observes:

t1a: its own access to 'x->data'
t1b: its own decrement
t1c: thread2's decrement
t1d: thread2's stores due to delete

with t1a<t1b (program order+fence) and t1b<t1c (total order to single
memory location). We know that t1c<>t1d will be indecidable, and we
cannot yet say anything about t1d with respect to t1a and t1b. Thread 2
in its own "eigen time" (or rather, its CPU cache) observes:

t2a: thread1's access to 'x->data'
t2b: thread1's decrement
t2c: its own decrement
t2d: its own stores due to delete

with t2a<t2b (fence in thread 1), t2b<t2c (total order to single memory
location), t2c<t2d (forbidden speculated stores). Therefore, t2d cannot
affect t2a. This is *not* sufficient to imply that t1a<t1d! But the idea
of thread 2 being able to affect thread 1 *without* being able to affect
its observable behavior is disquieting.

Besides a CPU that can send messages into its own past (which I have
difficulties reconciling with physical reality) this only leave the
options of a CPU that...

- either uses something else than fences to implement the memory model

- or can safely make speculated stores visible to other CPUs (and must
consequently also be able to inform others of failed speculations and
rewind their execution)

I don't think any physically existing CPU falls in either of these
categories, and I am also unsure if this lenience is of any help for
future CPU designs.

Now, please point out where I am mistaken, so I have a justification to
put the memory_order_acquire fence where it belongs and can find sleep
again :)

Anthony Williams

unread,

Dec 11, 2009, 4:18:05 PM12/11/09

to

Helge Bahmann <bah...@f1mac21.infpool.tu-freiberg.de> writes:

> Anthony Williams <antho...@gmail.com> wrote:
>
>> If you stick to memory_order_seq_cst (the default) then it's easy. If
>> you opt for low-level ordering constraints then you have to specify
>> exactly what you need.
>
> I know, but seq_cst is just insanely expensive. The release/acquire
> model is quite intuitive I think, but I guess I have a little bit of a
> "reverse mapping" problem from the memory models of physically exsting
> CPUs that are much stricter than the C++0x one.

Yes, I've known others have that problem too. It's particularly hard if
you have working code for a particular architecture and wish to replace
it with C++0x atomics. The "equivalent" C++ doesn't always compile back
to the same instructions.

>> Helge Bahmann <h...@chaoticmind.net> writes:
>>> I believe that fn_1 is sufficient on any machine that implements
>>> "release" with a "LoadStore|StoreStore" memory fence just before (or
>>> implicit with) the atomic decrement -- there is no need to "order" the
>>> decrement to zero and the delete operation from the POV of other
>>> threads, as they cannot legally access the memory locations. While this
>>> might not be a necessity, I have great difficulties imagining a machine
>>> for which this would not be true. What *might* be necessary is a barrier
>>> preventing the compiler from speculatively deleting the object.
>>
>> On SPARC in RMO mode, the CPU could hoist any loads needed for the
>> delete above the atomic decrement if you only had a LoadStore|StoreStore
>> barrier. I think that IA64 and Alpha can do similarly.
>
> Does it matter if the destructor is empty as in the example provided?

Yes. No. Maybe. :-)

See below.

> - the memory allocator has no business reading the object contents (and
> even if it did, there is nothing it could do with the data), the lack of
> StoreLoad fencing is therefore not problematic

True. If the object itself is never accessed in the "delete" branch
(e.g. because the destructor is trivial) then you don't need an acquire
fence for that.

If the destructor is not trivial then presumably it does read the data
in some fashion, and thus needs the acquire fence.

> - the allocator must also protect its own internal data structures
> independently, so these can't be at issue either

Ooh hang on there. Maybe the allocator stored info about the memory
block (such as its size) alongside the object when it was
constructed. It must read this data in order to free the memory
block. Unless there is a happens-before edge between the new and delete
then it won't necessarily read valid data. However, there must be such
an edge already in order to ensure that the initialization of the
reference count happens-before the decrement of that reference count.

So, if the destructor really is a no-op, you're OK without the acquire
fence. If it is NOT a no-op (i.e. in the general case) then you need the
fence.

Dmitriy Vyukov

unread,

Dec 13, 2009, 5:36:57 AM12/13/09

to

On 12 дек, 00:18, Anthony Williams <anthony....@gmail.com> wrote:
> Helge Bahmann <bahm...@f1mac21.infpool.tu-freiberg.de> writes:

First of all, if there are no acquire fences, then release fences are
actually useless and may be omitted too, they have no effect anyway.
Then, I believe acquire fence is required in either case. Object
passed to delete must be in quiescent state. For example, memory
manager may store freelist 'next' pointer into memory block; if there
is no acquire before delete, then that store races with user's stores/
reads to/from the object. Moreover, if there is no acquire, user
stores/reads before delete race with user stores/reads after next new
(memory reused for another object).
Well, if there are "pending" stores to the object that may "catch up"
object any time later, there is nothing memory manager can do to
preserve correctness, it's responsibility of a user to pass object in
quiescent state.

--
Dmitriy V'jukov

Anthony Williams

unread,

Dec 14, 2009, 3:28:01 AM12/14/09

to

Dmitriy Vyukov <dvy...@gmail.com> writes:

> On 12 дек, 00:18, Anthony Williams <anthony....@gmail.com> wrote:
>> Helge Bahmann <bahm...@f1mac21.infpool.tu-freiberg.de> writes:
>> > Anthony Williams <anthony....@gmail.com> wrote:
>> > - the allocator must also protect its own internal data structures
>> > independently, so these can't be at issue either
>>
>> Ooh hang on there. Maybe the allocator stored info about the memory
>> block (such as its size) alongside the object when it was
>> constructed. It must read this data in order to free the memory
>> block. Unless there is a happens-before edge between the new and delete
>> then it won't necessarily read valid data. However, there must be such
>> an edge already in order to ensure that the initialization of the
>> reference count happens-before the decrement of that reference count.
>>
>> So, if the destructor really is a no-op, you're OK without the acquire
>> fence. If it is NOT a no-op (i.e. in the general case) then you need the
>> fence.
>
> First of all, if there are no acquire fences, then release fences are
> actually useless and may be omitted too, they have no effect anyway.

Agreed.

> Then, I believe acquire fence is required in either case. Object
> passed to delete must be in quiescent state. For example, memory
> manager may store freelist 'next' pointer into memory block; if there
> is no acquire before delete, then that store races with user's stores/
> reads to/from the object. Moreover, if there is no acquire, user
> stores/reads before delete race with user stores/reads after next new
> (memory reused for another object).

Yes, I was forgetting that there may be other fields in the object then
just the reference count, even in the case of the trivial
destructor. You must ensure that all writes into the object are complete
before the memory block is reclaimed, which requires a store-release
that maps to a load-acquire in or before the destructor.

So, it's not just non-trivial destructors you need to watch for, but
data members other than the reference count that are modified during the
object's lifetime. You need a store-release after the object is created
via new to map to a load-acquire in any other thread that references it
to avoid races on the ref count anyway, as I tried to highlight already
above.

The set of cases in which it's OK to omit the acquire gets smaller the
more you think about it.

Alexander Terekhov

unread,

Dec 14, 2009, 4:28:41 AM12/14/09

to

Helge Bahmann wrote:
[...]

> What am I supposed to do? :(

void fn_4(X * x)
{
x->data=42;
int refs = x->refcount.load(memory_order_relaxed);
do {
if (refs == 1) {
atomic_thread_fence(memory_order_acquire);
delete x;
break;
}
} while (!x->refcount.compare_exchange_weak(refs, refs - 1,
memory_order_release, memory_order_relaxed));
}

I suppose.

regards,
alexander.

Helge Bahmann

unread,

Dec 14, 2009, 5:14:44 AM12/14/09

to

Dmitriy Vyukov <dvy...@gmail.com> wrote:
> First of all, if there are no acquire fences, then release fences are
> actually useless and may be omitted too, they have no effect anyway.

This is true for the abstract C++0x memory model, but not true for real
hardware. Free-standing release fences *do* have an observable effect
there.

> Then, I believe acquire fence is required in either case. Object
> passed to delete must be in quiescent state. For example, memory
> manager may store freelist 'next' pointer into memory block; if there
> is no acquire before delete, then that store races with user's stores/
> reads to/from the object.

Again, this is true for the C++0x memory model, but on any real hardware
that uses "ordered memory access fences" to implement the abstract C++0x
fence, no such race is possible, short of a CPU that either

- violates causality (in the physical sense)
- can make unverified speculated stores visible to other CPUs

(See the more detailed causality analysis in a previous post). I know of
no CPU that can do either, but the C++0x memory model would allow the
construction of a CPU that could do the latter (not that I think that
this lenience is useful).

> Moreover, if there is no acquire, user
> stores/reads before delete race with user stores/reads after next new
> (memory reused for another object).
> Well, if there are "pending" stores to the object that may "catch up"
> object any time later, there is nothing memory manager can do to
> preserve correctness, it's responsibility of a user to pass object in
> quiescent state.

The object will be in quiescent state after the "LoadStore|StoreStore"
fence used to implement "release". It may not be in "up-to-date" state
(i.e. the cache of the "delete"ing CPU might not have been invalidated
and contain stale data), but this does not matter as the data *must* not
be read. It will eventually be overwritten, in which case coherency
kicks in and will ensure correctly ordered writes.

I understand the requirement of the C++0x memory model that there is no
way around the "acquire" fence here. But I would like to point out that

- this is one of the situations where the C++0x memory model requires
one more fence than physically feasible CPUs do

- this additional abstract fence will be translated into a fencing
instruction on a physical CPU that does not need it

- "back-translating" a given correct implementation for one
(weakly-ordered) CPU to the C++0x memory model is not at all
straight-forward

Helge

Alexander Terekhov

unread,

Dec 14, 2009, 6:10:30 AM12/14/09

to

Helge Bahmann wrote:
[...]

> - can make unverified speculated stores visible to other CPUs
>
> (See the more detailed causality analysis in a previous post). I know of
> no CPU that can do either, but the C++0x memory model would allow the
> construction of a CPU that could do the latter (not that I think that
> this lenience is useful).

IIRC C++0x doesn't allow to make unverified speculated stores visible
because that would mean data races (and undefined behaviour) for
programs like

Initially: x = y = 0;

T1: T2:

if (x == 42) if (y == 42)
y = x; x = y;

which is not the case under C++0x.

regards,
alexander.

Helge Bahmann

unread,

Dec 14, 2009, 6:45:05 AM12/14/09

to

Anthony Williams <antho...@gmail.com> wrote:

> Yes, I was forgetting that there may be other fields in the object then
> just the reference count, even in the case of the trivial
> destructor. You must ensure that all writes into the object are complete
> before the memory block is reclaimed, which requires a store-release
> that maps to a load-acquire in or before the destructor.

CPU#1 does the following:

1. x->data=42;
2. fence #StoreStore|LoadStore
3. decrement refcount

This means that when CPU#2 discovers x->refcount==1, the cache block
containing "x->data" exists in CPU#1 in the "modified" state (this is
enforced by the store ordering in CPU#1). The corresponding cache block
of CPU#2 may or may not be invalid (it may contain "stale" data that
predates the assignment from step 1). Reading x->data from CPU#2
might therefore not see the value assigned by CPU#1. So far there is no
disagreement I think.

Now the crucial point is that, with a trivial destructor, there is no
legal way that the data *can* be read, it can only be written.

CPU#2 may and will overwrite x->data (or any other data member, for that
matter), and I assert that there is no ordering problem with the store
from CPU#1: CPU#2 cannot make this store before it "knows" that
refcount==1, and as soon as it knows this fact, the store by CPU#1 is
already completed in the sense that CPU#1 already has the corresponding
cacheline in "modified" state. Cache coherency now requires that CPU#2
cannot create a "second" modified version of the same cacheline, but
must first import the modified state from CPU#1.

CPU#2 may *speculate* that it will see refcount==1 and precompute
everything, but I don't think any current CPU can "publish" speculated
stores (since the speculation may turn out wrong but other CPUs might
already have acted act on this). Therefore CPU#1 cannot "see" this, and
its access is unconditionally safe.

CPU#2 cannot verify its speculation until it sees the refcount==1, in
which case CPU#1 has x->data already in modified state. CPU#2 cannot
just "discard" the writes it successfully speculated, but must execute
them as if they happened at the point in time where it could verify its
speculation (or any time later), but no sooner.

Therefore the writes of CPU#2 will always override those of CPU#1, and
it does not matter whether loads of CPU#2 see the stores made by CPU#1.

Again I must repeat that I agree that this depends on CPU#2 only ever
writing to the memory area, if it performs any reads, the result is
clearly undefined on basically all weakly-ordered CPUs.

> You need a store-release after the object is created via new to map to
> a load-acquire in any other thread that references it to avoid races
> on the ref count anyway, as I tried to highlight already above.

Yes this is correct -- passing a "new" reference from thread to thread
already requires a release/acquire or release/consume sequence. I am
assuming this has happened in the example. This provides any
required 'new happened-before delete' and 'new happened-before access'
guarantee.

Assuming this has happened, the above comments apply.

> The set of cases in which it's OK to omit the acquire gets smaller the
> more you think about it.

The "trivial destructor" requirement is already harsh, but not something
that is completely unthinkable.

Again I completely agree with you that the "acquire" simply cannot be
omitted to satisfy the C++0x memory model, but the instruction that it
translates to on "physical" CPUs can very well be.

Whether this hurts is another question, probably it's a good idea to
just trade in a few clock cycles for the preservation of mental sanity
-- otherwise I would have to remember to add an "atomic_thread_fence" to
"particular" destructors on inheritance which is certainly an unpleasant
perspective.

Helge

Anthony Williams

unread,

Dec 14, 2009, 7:20:45 AM12/14/09

to

Helge Bahmann <bah...@f1mac21.infpool.tu-freiberg.de> writes:

> Anthony Williams <antho...@gmail.com> wrote:
>
>> Yes, I was forgetting that there may be other fields in the object then
>> just the reference count, even in the case of the trivial
>> destructor. You must ensure that all writes into the object are complete
>> before the memory block is reclaimed, which requires a store-release
>> that maps to a load-acquire in or before the destructor.
>
> CPU#1 does the following:
>
> 1. x->data=42;
> 2. fence #StoreStore|LoadStore
> 3. decrement refcount
>
> This means that when CPU#2 discovers x->refcount==1, the cache block
> containing "x->data" exists in CPU#1 in the "modified" state (this is
> enforced by the store ordering in CPU#1). The corresponding cache block
> of CPU#2 may or may not be invalid (it may contain "stale" data that
> predates the assignment from step 1). Reading x->data from CPU#2
> might therefore not see the value assigned by CPU#1. So far there is no
> disagreement I think.

True.

> Now the crucial point is that, with a trivial destructor, there is no
> legal way that the data *can* be read, it can only be written.
>
> CPU#2 may and will overwrite x->data (or any other data member, for that
> matter), and I assert that there is no ordering problem with the store
> from CPU#1: CPU#2 cannot make this store before it "knows" that
> refcount==1, and as soon as it knows this fact, the store by CPU#1 is
> already completed in the sense that CPU#1 already has the corresponding
> cacheline in "modified" state. Cache coherency now requires that CPU#2
> cannot create a "second" modified version of the same cacheline, but
> must first import the modified state from CPU#1.

Here I disagree. Unless CPU#2 uses a #LoadStore barrier after the read
of the refcount (i.e. a load acquire), then a write to the memory
occupied by x->data after the read of the refcount can be reordered
before it, and thus physically occur before the write from CPU#1.

> CPU#2 may *speculate* that it will see refcount==1 and precompute
> everything, but I don't think any current CPU can "publish" speculated
> stores (since the speculation may turn out wrong but other CPUs might
> already have acted act on this). Therefore CPU#1 cannot "see" this, and
> its access is unconditionally safe.

If CPU#1 flushes the write buffer for the cache line including refcount
before it flushes the buffer for the cache line including data then
CPU#2 can "know" that refcount==1 before it sees the write to x->data
from CPU#1. CPU#2 is therefore not "publishing" a speculated store, but
publishing a real nospeculative store.

> CPU#2 cannot verify its speculation until it sees the refcount==1, in
> which case CPU#1 has x->data already in modified state. CPU#2 cannot
> just "discard" the writes it successfully speculated, but must execute
> them as if they happened at the point in time where it could verify its
> speculation (or any time later), but no sooner.

Only if the necessary memory barriers are in place.

> Therefore the writes of CPU#2 will always override those of CPU#1, and
> it does not matter whether loads of CPU#2 see the stores made by CPU#1.

I disagree. This is certainly not the case under the C++0x memory model,
and I believe that it is not necessarily the case for real CPUs
either.

>> The set of cases in which it's OK to omit the acquire gets smaller the
>> more you think about it.
>
> The "trivial destructor" requirement is already harsh, but not something
> that is completely unthinkable.
>
> Again I completely agree with you that the "acquire" simply cannot be
> omitted to satisfy the C++0x memory model, but the instruction that it
> translates to on "physical" CPUs can very well be.

Possibly. By using the C++0x atomics you relinquish the precise level of
control you get from using assembler.

> Whether this hurts is another question, probably it's a good idea to
> just trade in a few clock cycles for the preservation of mental sanity
> -- otherwise I would have to remember to add an "atomic_thread_fence" to
> "particular" destructors on inheritance which is certainly an unpleasant
> perspective.

If you're going to worry about those few clock cycles then you probably
ought to just use platform-specific assembler and be done with it. At
least that way you won't get caught out when porting to a new platform
as you'll have to recode the assembler anyway.

Helge Bahmann

unread,

Dec 14, 2009, 8:37:09 AM12/14/09

to

Anthony Williams <antho...@gmail.com> wrote:

> Helge Bahmann <bah...@f1mac21.infpool.tu-freiberg.de> writes:
>> CPU#2 may and will overwrite x->data (or any other data member, for that
>> matter), and I assert that there is no ordering problem with the store
>> from CPU#1: CPU#2 cannot make this store before it "knows" that
>> refcount==1, and as soon as it knows this fact, the store by CPU#1 is
>> already completed in the sense that CPU#1 already has the corresponding
>> cacheline in "modified" state. Cache coherency now requires that CPU#2
>> cannot create a "second" modified version of the same cacheline, but
>> must first import the modified state from CPU#1.
>
> Here I disagree. Unless CPU#2 uses a #LoadStore barrier after the read
> of the refcount (i.e. a load acquire), then a write to the memory
> occupied by x->data after the read of the refcount can be reordered
> before it, and thus physically occur before the write from CPU#1.

But the write by CPU#2 to x->data may physically only occur after the
read of refcount turns out "1" as it is control-dependent on this fact
(speculated writes are forbidden from becoming visible!). CPU#1 is due
to its barrier forced to physically complete its own access to x->data
before physically writing "1" to refcount.

>> CPU#2 may *speculate* that it will see refcount==1 and precompute
>> everything, but I don't think any current CPU can "publish" speculated
>> stores (since the speculation may turn out wrong but other CPUs might
>> already have acted act on this). Therefore CPU#1 cannot "see" this, and
>> its access is unconditionally safe.
>
> If CPU#1 flushes the write buffer for the cache line including refcount
> before it flushes the buffer for the cache line including data then

The "LoadStore|StoreStore" barrier in between prevents exactly that! If
that were not true, the whole C++0x memory model would be
unimplementable:

thread #1
1. stores data
2. LoadStore|StoreStore fence
3. store atomic variable

thread #2
4. load atomic variable
5. LoadLoad|LoadStore fence
6. read data (but writes from step 1 still not flushed???)

Yes I realize there may be strange corner cases like ppc if one of the
stores involves non-cachable memory... but let's just assume
coherent, cachable memory as anything else is far out of scope.

>> Again I completely agree with you that the "acquire" simply cannot be
>> omitted to satisfy the C++0x memory model, but the instruction that
>> it translates to on "physical" CPUs can very well be.
>
> Possibly. By using the C++0x atomics you relinquish the precise level
> of control you get from using assembler.

The piece that I consider sort of "missing" here is a way for the
compiler to discover that it need not issue a barrier instruction for
the memory_order_acquire fence -- it could do that if it knew that the
fence is only supposed to affect the "to be deleted" object (and could
infer from that that the object is not going to be read). There is no
way to tell it, the fence is "global".

Helge

Helge Bahmann

unread,

Dec 14, 2009, 9:27:36 AM12/14/09

to

Anthony Williams <antho...@gmail.com> wrote:
> Helge Bahmann <bah...@f1mac21.infpool.tu-freiberg.de> writes:

>> CPU#2 cannot verify its speculation until it sees the refcount==1, in
>> which case CPU#1 has x->data already in modified state. CPU#2 cannot
>> just "discard" the writes it successfully speculated, but must execute
>> them as if they happened at the point in time where it could verify its
>> speculation (or any time later), but no sooner.
>
> Only if the necessary memory barriers are in place.

no, it can never discard. Noting that relaxed operations can be
implemented without fences on any CPU I know of, this code:

atomic<int> x=0;

thread#1:
x.compare_exchange_strong(0, 0, relaxed);

thread#2:
if (x.load(relaxed)==0) x.store(42, relaxed);

would otherwise be allowed to yield "0" (if a "speculated" store of 42
were discarded due to an intervening write of 0), violating either
"atomicity" or "happens before" between load and store.

Helge

Anthony Williams

unread,

Dec 14, 2009, 9:50:07 AM12/14/09

to

Helge Bahmann <bah...@f1mac21.infpool.tu-freiberg.de> writes:

> Anthony Williams <antho...@gmail.com> wrote:
>> Helge Bahmann <bah...@f1mac21.infpool.tu-freiberg.de> writes:
>>> CPU#2 cannot verify its speculation until it sees the refcount==1, in
>>> which case CPU#1 has x->data already in modified state. CPU#2 cannot
>>> just "discard" the writes it successfully speculated, but must execute
>>> them as if they happened at the point in time where it could verify its
>>> speculation (or any time later), but no sooner.
>>
>> Only if the necessary memory barriers are in place.
>
> no, it can never discard.

The CPU cannot just discard writes. That wasn't what I was trying to
say. The "speculative stores" can happen as soon as the CPU knows that
the condition was true. If the condition was based on a relaxed load,
and the stores are themselves relaxed stores then they can appear out of
order to other CPUs.

> Noting that relaxed operations can be
> implemented without fences on any CPU I know of, this code:
>
> atomic<int> x=0;
>
> thread#1:
> x.compare_exchange_strong(0, 0, relaxed);
>
> thread#2:
> if (x.load(relaxed)==0) x.store(42, relaxed);
>
> would otherwise be allowed to yield "0" (if a "speculated" store of 42
> were discarded due to an intervening write of 0), violating either
> "atomicity" or "happens before" between load and store.

The CAS is atomic and thus indivisible. Since the expected value and the
value to be stored are the same it cannot modify x. The if in thread#2
thus cannot fail, and the store must happen.

Helge Bahmann

unread,

Dec 14, 2009, 10:58:38 AM12/14/09

to

Anthony Williams <antho...@gmail.com> wrote:
> The CPU cannot just discard writes. That wasn't what I was trying to
> say.

okay sorry, misread that one

> The "speculative stores" can happen as soon as the CPU knows that
> the condition was true.

Correct, and the entire point is that for CPU#2 being able to verify its
speculation, CPU#1 must first have performed its "atomic store". This
atomic store on the other hand is observed by CPU#2 strictly after any
other memory access by CPU#1 to the object due to the
StoreStore|LoadStore fence in between.

Or, put the other way around, CPU#1 cannot see any speculated writes by
CPU#2 until it has written the condition necessary for CPU#2 to verify
its speculation to memory. A StoreStore|LoadStore fence before writing
this condition is therefore sufficient to be shielded against whatever
CPU#2 wants to do with the memory.

> If the condition was based on a relaxed load, and the stores are
> themselves relaxed stores then they can appear out of order to other
> CPUs.

Out of order -- yes. In this particular case, any reordering of accesses
to the object by CPU#1 and overwriting of the object by CPU#2 would
however be a violation of causality and is thus not physically possible.

Interestingly, I cannot find a mention of "causality" in the standards
draft, so "prescient" CPUs are clearly permissible.

Helge

Chris M. Thomasson

unread,

Dec 15, 2009, 12:10:43 AM12/15/09

to

"Alexander Terekhov" <tere...@web.de> wrote in message
news:4B260548...@web.de...

______________________________________________________________
void
refcount_destroy(X* x)
{
atomic_thread_fence(memory_order_acquire);
delete x;
}

void
fn_5(X* x)
{
x->data = 42;

if (x->refcount.load(memory_order_relaxed) == 1)
{
refcount_destroy(x);
}

else if (x->refcount.fetch_sub(1, memory_order_release) == 1)
{
refcount_destroy(x);
}
}
______________________________________________________________

Eliminate the loop?

Helge Bahmann

unread,

Dec 15, 2009, 2:48:51 AM12/15/09

to

Chris M. Thomasson <n...@spam.invalid> wrote:
> void
> refcount_destroy(X* x)
> {
> atomic_thread_fence(memory_order_acquire);
> delete x;
> }
>
>
> void
> fn_5(X* x)
> {
> x->data = 42;
>
> if (x->refcount.load(memory_order_relaxed) == 1)
> {
> refcount_destroy(x);
> }
>
> else if (x->refcount.fetch_sub(1, memory_order_release) == 1)
> {
> refcount_destroy(x);
> }
> }

nice! this is smart indeed, didn't think of it

Thanks,
Helge

Alexander Terekhov

unread,

Dec 15, 2009, 3:24:49 AM12/15/09

to

"Chris M. Thomasson" wrote:
[...]

> if (x->refcount.load(memory_order_relaxed) == 1)
> {
> refcount_destroy(x);
> }
>
> else if (x->refcount.fetch_sub(1, memory_order_release) == 1)
> {
> refcount_destroy(x);
> }
> }
> ______________________________________________________________
>
> Eliminate the loop?

Makes sense for x86. For Power and alike there will be a loop anyway and
unnecessary release barrier is not free.

regards,
alexander.

Chris M. Thomasson

unread,

Dec 15, 2009, 3:49:56 AM12/15/09

to

"Alexander Terekhov" <tere...@web.de> wrote in message

news:4B2747D1...@web.de...

>
> "Chris M. Thomasson" wrote:
> [...]
>> if (x->refcount.load(memory_order_relaxed) == 1)
>> {
>> refcount_destroy(x);
>> }
>>
>> else if (x->refcount.fetch_sub(1, memory_order_release) == 1)
>> {
>> refcount_destroy(x);
>> }
>> }
>> ______________________________________________________________
>>
>> Eliminate the loop?
>
> Makes sense for x86.

Agreed.

> For Power and alike there will be a loop anyway and

Yes. There will be a loop for any architecture that forces one to implement
the wait-free fetch-and-add instruction with a damn CAS or LL/SC loop. IMHO,
it's a shame. AFAICT, fetch-and-add should not have to be manually
implemented in software with a loop!

> unnecessary release barrier is not free.

It seems that the window of opportunity to "skip" the release barrier is
fairly narrow after the initial load of the reference count was found to not
equal 1. AFAICT, you handle this by using a `memory_order_relaxed' barrier
in the failure case of the `compare_exchange_weak()'. I am not "totally"
convinced that it's worth further optimization after a failed comparison to
1. Humm...

Helge Bahmann

unread,

Dec 15, 2009, 5:28:46 AM12/15/09

to

Chris M. Thomasson <n...@spam.invalid> wrote:

> "Alexander Terekhov" <tere...@web.de> wrote in message
> news:4B2747D1...@web.de...

>> unnecessary release barrier is not free.
>
> It seems that the window of opportunity to "skip" the release barrier is
> fairly narrow after the initial load of the reference count was found to not
> equal 1. AFAICT, you handle this by using a `memory_order_relaxed' barrier
> in the failure case of the `compare_exchange_weak()'. I am not "totally"
> convinced that it's worth further optimization after a failed comparison to
> 1. Humm...

I would say that speculating on a step-contention free execution is a
valid assumption -- if there _is_ interference, the cost of bouncing the
cacheline between cpus should easily outweigh the superfluous barrier
(something like 400 cycles vs. 50 cycles on ppc AFAICT).

Helge

frege

unread,

Dec 17, 2009, 1:30:22 AM12/17/09

to

On Dec 14, 6:10 am, Alexander Terekhov <terek...@web.de> wrote:
> IIRC C++0x doesn't allow to make unverified speculated stores visible
> because that would mean data races (and undefined behaviour) for
> programs like
>
> Initially: x = y = 0;
>
> T1: T2:
>
> if (x == 42) if (y == 42)
> y = x; x = y;
>
> which is not the case under C++0x.
>
> regards,
> alexander.

I would be OK with making speculative stores visible, as long as it
was somehow communicated as "speculative".
Then once "reality" kicks in CPU1 might need to tell CPU2 to rollback
the speculations.
Maybe CPUs are not doing this now, but considering the number of CPUs
keeps increasing, and if "speculation" is worthwhile in general, then
it could be worthwhile in the future...

Tony

Chris M. Thomasson

unread,

Jan 3, 2010, 10:49:34 PM1/3/10

to

"Helge Bahmann" <bah...@f1mac21.infpool.tu-freiberg.de> wrote in message
news:hg7ocu$1oe7$1...@nserver.hrz.tu-freiberg.de...

> Chris M. Thomasson <n...@spam.invalid> wrote:
>> "Alexander Terekhov" <tere...@web.de> wrote in message
>> news:4B2747D1...@web.de...
>>> unnecessary release barrier is not free.
>>
>> It seems that the window of opportunity to "skip" the release barrier is
>> fairly narrow after the initial load of the reference count was found to
>> not
>> equal 1. AFAICT, you handle this by using a `memory_order_relaxed'
>> barrier
>> in the failure case of the `compare_exchange_weak()'. I am not "totally"
>> convinced that it's worth further optimization after a failed comparison
>> to
>> 1. Humm...
>
> I would say that speculating on a step-contention free execution is a
> valid assumption --

Agreed. This is why I assume that the possibility for the reference counter
to drop to zero _after_ the initial load and failed compare to 1 to be a
fairly rare occurrence.

> if there _is_ interference, the cost of bouncing the
> cacheline between cpus should easily outweigh the superfluous barrier
> (something like 400 cycles vs. 50 cycles on ppc AFAICT).

Indeed. I always like to compare non-blocking algorithms to the overhead
generated by contention-less access to a "traditional" spinlock. Why not use
locks that are more light weight especially when you can explicitly design
you're application to reduce contention?

;^)

Alexander Terekhov

unread,

Jan 4, 2010, 8:05:14 AM1/4/10

to

"Chris M. Thomasson" wrote:
>
> "Helge Bahmann" <bah...@f1mac21.infpool.tu-freiberg.de> wrote in message
> news:hg7ocu$1oe7$1...@nserver.hrz.tu-freiberg.de...
> > Chris M. Thomasson <n...@spam.invalid> wrote:
> >> "Alexander Terekhov" <tere...@web.de> wrote in message
> >> news:4B2747D1...@web.de...
> >>> unnecessary release barrier is not free.
> >>
> >> It seems that the window of opportunity to "skip" the release barrier is
> >> fairly narrow after the initial load of the reference count was found to
> >> not
> >> equal 1. AFAICT, you handle this by using a `memory_order_relaxed'
> >> barrier
> >> in the failure case of the `compare_exchange_weak()'. I am not "totally"
> >> convinced that it's worth further optimization after a failed comparison
> >> to
> >> 1. Humm...
> >
> > I would say that speculating on a step-contention free execution is a
> > valid assumption --
>
> Agreed. This is why I assume that the possibility for the reference counter
> to drop to zero _after_ the initial load and failed compare to 1 to be a
> fairly rare occurrence.

The thing is that absent support for weak pointers/references storing
zero in the reference counter (trip to memory to store zero) and
associated release barrier is absolutely unneeded with or without
contention. So why do it?

regards,
alexander.

Chris M. Thomasson

unread,

Jan 21, 2010, 6:40:02 AM1/21/10

to

"Alexander Terekhov" <tere...@web.de> wrote in message

news:4B41E78A...@web.de...

x86?