atomic counter

Marc

unread,

Oct 9, 2011, 8:27:42 AM10/9/11

to

Hello,

I am trying to make our reference counting implementation thread-safe
(for a very basic definition of thread-safe), which essentially means
making the counter atomic. However, there are many options to atomic
operations, in particular concerning the memory model, and I want to
make sure I get it right.

The operations I use are increment (no return), decrement (and check
if the return value is 0), store (to initialize) and load (to check if
the object is not shared and thus safe to write into).

It looks to me like memory_order_relaxed should be good enough for
this purpose, as I don't see what synchronization would be needed with
the rest of memory, but I may be missing something fundamental there.

For things to work properly, I have to check the return value of the
decrement function, like:

if(--count==0) cleanup();

If instead I do:

--count;
if(count==0) cleanup();

I risk a double free. And if I do:

if(count>1) --count;
else cleanup();

all I risk is a leak (not as bad, and I may have to live with it until
we can change the API of this class).

Note that I wrote -- for simplicity, but in order to specify the
memory model I would have to use fetch_sub. I am a bit surprised that
the default memory model is not a template argument of std::atomic<>.

Also note that I am only after basic thread-safety. If an object has a
single reference, I can safely modify it as other threads have no
business even reading it.

Any correction, confirmation or advice would be very welcome.

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Agents Marlow

unread,

Oct 11, 2011, 1:18:38 PM10/11/11

to

On Oct 9, 1:27 pm, Marc <marc.gli...@gmail.com> wrote:
> Hello,
>
> I am trying to make our reference counting implementation thread-safe
> (for a very basic definition of thread-safe), which essentially means
> making the counter atomic.

I would take a look at the implementation of the atomic counter in the
Poco Foundation class library.

Regards,

Andrew Marlow

Pete Becker

unread,

Oct 11, 2011, 11:18:56 PM10/11/11

to

On 2011-10-09 12:27:42 +0000, Marc said:

> I am trying to make our reference counting implementation thread-safe
> (for a very basic definition of thread-safe), which essentially means
> making the counter atomic. However, there are many options to atomic
> operations, in particular concerning the memory model, and I want to
> make sure I get it right.
>
> The operations I use are increment (no return), decrement (and check
> if the return value is 0), store (to initialize) and load (to check if
> the object is not shared and thus safe to write into).
>
> It looks to me like memory_order_relaxed should be good enough for
> this purpose, as I don't see what synchronization would be needed with
> the rest of memory, but I may be missing something fundamental there.
>

Suppose there are two references to the object, in two different
threads. One thread decrements the reference count, then the other
does. If the decrement from the first thread isn't seen by the second
thread, the second thread won't see that the count has become zero, and
the object won't get destroyed. So memory_order_relaxed won't work: you
need to ensure that the result of a decrement is visible to another
thread that also needs to decrement the count.

It's easier to get the code right when it's sequentially consistent. In
general, unless you can demonstrate that synchronization is a
bottleneck, don't mess with it.

--
Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com) Author of "The
Standard C++ Library Extensions: a Tutorial and Reference
(www.petebecker.com/tr1book)

Marc

unread,

Oct 11, 2011, 11:19:38 PM10/11/11

to

Agents Marlow wrote:

> On Oct 9, 1:27 pm, Marc <marc.gli...@gmail.com> wrote:
>> Hello,
>>
>> I am trying to make our reference counting implementation thread-safe
>> (for a very basic definition of thread-safe), which essentially means
>> making the counter atomic.
>
> I would take a look at the implementation of the atomic counter in the
> Poco Foundation class library.

Thanks for the pointer.
They only implement it lock-free for windows and mac, and the
interesting part is that on mac they use the atomic operations without
a memory barrier (and since mac used to run on PPC, it makes a
difference). They perform a load by simply reading the variable (an
int) which doesn't offer any protection against compiler
optimizations, I am trying to think of the worst that could happen
because of that...

Andreas Wallner

unread,

Oct 12, 2011, 4:38:34 AM10/12/11

to

{ Please do not: (1) top-post, (2) quote the banner.
Manually corrected. -mod }

On Oct 12, 5:19 am, Marc <marc.gli...@gmail.com> wrote:
> Agents Marlow wrote:
> > On Oct 9, 1:27 pm, Marc <marc.gli...@gmail.com> wrote:
> >> Hello,
>
> >> I am trying to make our reference counting implementation thread-safe
> >> (for a very basic definition of thread-safe), which essentially means
> >> making the counter atomic.
>
> > I would take a look at the implementation of the atomic counter in the
> > Poco Foundation class library.
>
> Thanks for the pointer.
> They only implement it lock-free for windows and mac, and the
> interesting part is that on mac they use the atomic operations without
> a memory barrier (and since mac used to run on PPC, it makes a
> difference). They perform a load by simply reading the variable (an
> int) which doesn't offer any protection against compiler
> optimizations, I am trying to think of the worst that could happen
> because of that...

Hi,

I think it's pretty hard (or impossible) to get an atomic counter
right using the interface you described.
What is hindering a thread from, e.g.:
thread A calls load(), sees writing is safe <context switch>
thread B calls load(), sees writing is safe
thread B increments your atomic
thread B does some operations, does not release the object <context
switch>
thread A increments your atomic << shouldn't happen
thread A does some operations << shouldn't happen

If I understood your expected usage wrong, please correct me.

Classically you would have a different API for you atomics, like:
int load();
bool store( int expected, int new); // returns wrong when expected !=
current value of the atomic

This way you are safe from the scenario above, since the increment of
thread B would fail, and it would not get the write access to the
object.

Your above example would look like:

int count = load();
if( count == 0 && store( 0, 1)) cleanup();

How this is implemented depends on your processor architecture. E.g.
on x86 you can use the cmpxchg with some barriers etc., for ARM it's a
little bit easier because of its RISC architecture.
In general though: your compiler should have some intrinsic functions
that do that. (GCC: __sync_bool_compare_and_swap Microsoft:
InterlockedExchange)

For those compiler you can have a look at:
http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html &&
http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/cpp/win/compiler_c/intref_cls/common/intref_itanium_lock_atomic.htm.

The advantage: the compilers already see those intrinsics as memory
barriers, and they should be portable across platforms.

Regards,
Andreas

Alexander Terekhov

unread,

Oct 12, 2011, 4:55:48 PM10/12/11

to

Marc wrote:
>
> Hello,
>
> I am trying to make our reference counting implementation thread-safe
> (for a very basic definition of thread-safe), which essentially means

if (c.atomic_fetch_sub(1, memory_order_relaxed) == 1)
atomic_thread_fence(memory_order_acquire),
cleanup();
else
atomic_thread_fence(memory_order_release);

regards,
alexander.

Marc

unread,

Oct 12, 2011, 4:58:06 PM10/12/11

to

Pete Becker wrote:

> On 2011-10-09 12:27:42 +0000, Marc said:
>
>> I am trying to make our reference counting implementation thread-safe
>> (for a very basic definition of thread-safe), which essentially means
>> making the counter atomic. However, there are many options to atomic
>> operations, in particular concerning the memory model, and I want to
>> make sure I get it right.
>>
>> The operations I use are increment (no return), decrement (and check
>> if the return value is 0), store (to initialize) and load (to check if
>> the object is not shared and thus safe to write into).
>>
>> It looks to me like memory_order_relaxed should be good enough for
>> this purpose, as I don't see what synchronization would be needed with
>> the rest of memory, but I may be missing something fundamental there.
>
> Suppose there are two references to the object, in two different
> threads. One thread decrements the reference count, then the other
> does. If the decrement from the first thread isn't seen by the second
> thread, the second thread won't see that the count has become zero, and
> the object won't get destroyed. So memory_order_relaxed won't work: you
> need to ensure that the result of a decrement is visible to another
> thread that also needs to decrement the count.

Uh? I am still calling an atomic decrement function. The standard says:

"Note: Atomic operations specifying memory_order_relaxed are relaxed
with respect to memory ordering. Implementations must still guarantee
that any given atomic access to a particular atomic object be
indivisible with respect to all other atomic accesses to that object."

I thought the memory order was mostly concerned with what happened to
the rest of the memory.

Assuming your interpretation is correct, what is memory_order_relaxed
good for?

> It's easier to get the code right when it's sequentially consistent. In
> general, unless you can demonstrate that synchronization is a
> bottleneck, don't mess with it.

Well yes, of course, I did some experiments (using boost::shared_ptr
or the libstdc++ std::atomic (just a place-holder implementation, I
know)), and the slow-down was unacceptable, which led me to
reimplement it, and the performance hit is acceptable but still
noticable enough that I am not sure about enabling it by default for
MT programs (some programs have several threads that don't share any
ref-counted objects and would pay the price for nothing).

Our main target is x86/x86_64 where as far as I understand (please
correct me if I am wrong) the memory barrier is unavoidable (implied
by any atomic operation), but I am still interested in not penalizing
our users on other platforms if I don't have to.

And it is intellectually satisfying to understand things ;-)

Thank you for your answer.

--

Marc

unread,

Oct 12, 2011, 5:02:02 PM10/12/11

to

Andreas Wallner wrote:

> I think it's pretty hard (or impossible) to get an atomic counter
> right using the interface you described.
> What is hindering a thread from, e.g.:
> thread A calls load(), sees writing is safe <context switch>
> thread B calls load(), sees writing is safe
> thread B increments your atomic
> thread B does some operations, does not release the object <context
> switch>
> thread A increments your atomic << shouldn't happen
> thread A does some operations << shouldn't happen
>
> If I understood your expected usage wrong, please correct me.
>
> Classically you would have a different API for you atomics, like:
> int load();
> bool store( int expected, int new); // returns wrong when expected !=
> current value of the atomic
>
> This way you are safe from the scenario above, since the increment of
> thread B would fail, and it would not get the write access to the
> object.
>
> Your above example would look like:
>
> int count = load();
> if( count == 0 && store( 0, 1)) cleanup();

Thank you for answering.

> From your description, it sounds like you are considering the use of
this counter as a lock, whereas I only want to use it for reference
counting, something similar to a shared_ptr.

Ideally, the only 2 functions I would have are ++ (add a reference,
when I copy the object) and -- (when I destruct the object, and it
returns something telling me if that was the last reference). I add a
store function only for initialization, but a default constructor that
sets the counter to 0 would be enough. And I could do without a load,
I only add it to optimize an "unshare" function (if there is a single
reference, don't do anything, but if it is shared, make the object
point to a fresh copy of the data that it now owns).

> For those compiler you can have a look at:

Yes, I have a list:
__sync_add_and_fetch, atomic_inc_uint, OSAtomicIncrement32,
_InterlockedIncrement, plus some inline asm implementations. But
hopefully std::atomic will soon make all of them useless.

> The advantage: the compilers already see those intrinsics as memory
> barriers, and they should be portable across platforms.

I am precisely trying to do away with as many memory barriers as I
can...

Pete Becker

unread,

Oct 12, 2011, 8:19:10 PM10/12/11

to

Advanced threading. See Alexander Terekhov's message.

Atomic operations get completed without interruption. That ensures that
a different thread doesn't see a value that's not valid. For example,
suppose that storing a pointer takes two bus operations. If the pointer
starts out null, storing a value into it has two steps: store one half
of the pointer, then store the other half. If a context switch occurs
between those two steps, the thread that's switched to might see half
the pointer. Atomic operations ensure that that sort of tearing doesn't
occur.

The other aspect of threaded programming is visibility of changes.
Here's where you have to abandon the single-processor analogies; think
multiple processors. For example, suppose the system has two
processors, and each processor has its own data cache. Each processor
is noodling around with the same variable, so each cache has a copy of
the value of that variable. Writing a new value, even when done
atomically, only directly affects the value in the cache that belongs
to the processor that wrote the value. Unless the new value is copied
to the other processor's cache, the other processor will still see the
old value. memory_order_relaxed says "don't worry, be happy". It's okay
that the values are inconsistent.

>
>> It's easier to get the code right when it's sequentially consistent. In
>> general, unless you can demonstrate that synchronization is a
>> bottleneck, don't mess with it.
>
> Well yes, of course, I did some experiments (using boost::shared_ptr
> or the libstdc++ std::atomic (just a place-holder implementation, I
> know)), and the slow-down was unacceptable, which led me to
> reimplement it, and the performance hit is acceptable but still
> noticable enough that I am not sure about enabling it by default for
> MT programs (some programs have several threads that don't share any
> ref-counted objects and would pay the price for nothing).
>
> Our main target is x86/x86_64 where as far as I understand (please
> correct me if I am wrong) the memory barrier is unavoidable (implied
> by any atomic operation), but I am still interested in not penalizing
> our users on other platforms if I don't have to.

On the x86 architecture, pretty much everything is sequentially
consistent. So there's no difference in the generated code between
sequentially consistent visibility and any of the others. Which, in
turn, means that code that uses less than sequentially consistent
visiblity and works just fine on x86 systems may fail miserably if it's
just ported to other systems.

>
> And it is intellectually satisfying to understand things ;-)
>

<g> This is tricky stuff. There's not much out there that describes it
in an approachable manner unless you're an expert on hardware
architecture. One good source is Anthony Williams' book, "C++
Concurrency in Action", from Manning Publications. www.manning.com.

--
Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com) Author of "The
Standard C++ Library Extensions: a Tutorial and Reference
(www.petebecker.com/tr1book)

Marc

unread,

Oct 13, 2011, 5:09:33 PM10/13/11

to

Alexander Terekhov wrote:

> Marc wrote:
>> I am trying to make our reference counting implementation thread-safe
>> (for a very basic definition of thread-safe), which essentially means
>
> if (c.atomic_fetch_sub(1, memory_order_relaxed) == 1)
> atomic_thread_fence(memory_order_acquire),
> cleanup();
> else
> atomic_thread_fence(memory_order_release);

(trying to understand the code, not saying there is anything wrong
with it)

The standard warns that "Fences cannot, in general, be used to restore
sequential consistency for atomic operations with weaker ordering
specifications."
Good thing this is not the case here :-)

The release and acquire look like they are there so that cleanup() is
up-to-date on what other copies (sharing the same data) did to the
data (although when there are several copies, noone is supposed to
touch the data). Or maybe to make sure that cleanup happens at the
right time? And should there be something (an acquire? nothing?) on
the increment?

I'd be very grateful if you could explain the rational for the use of
acquire and release fences in your code above. It looks like you put
just the minimal synchronization in just the right places, so it would
certainly help me a lot to understand this example.

Sometimes I envy people who program in a language with a garbage
collector: no need for a counter, hence no need for atomics here...

Marc

unread,

Oct 13, 2011, 5:11:04 PM10/13/11

to

Pete Becker wrote:

> On 2011-10-12 20:58:06 +0000, Marc said:
>
>> Pete Becker wrote:
>>> Suppose there are two references to the object, in two different
>>> threads. One thread decrements the reference count, then the other
>>> does. If the decrement from the first thread isn't seen by the second
>>> thread, the second thread won't see that the count has become zero, and
>>> the object won't get destroyed. So memory_order_relaxed won't work: you
>>> need to ensure that the result of a decrement is visible to another
>>> thread that also needs to decrement the count.
>>
>> Uh? I am still calling an atomic decrement function. The standard says:
>>
>> "Note: Atomic operations specifying memory_order_relaxed are relaxed
>> with respect to memory ordering. Implementations must still guarantee
>> that any given atomic access to a particular atomic object be
>> indivisible with respect to all other atomic accesses to that object."
>>
>> I thought the memory order was mostly concerned with what happened to
>> the rest of the memory.

I stumbled over several other sources all pointing in the same
direction, which confuses me.

In the standard:
"All modifications to a particular atomic object M occur in some
particular total order, called the modification order of M."

In N2145:
"For example, an atomic increment operation may be used simply to count
the number of times a function is called, as in a profiler. This
requires atomicity, but no ordering constraints.
[...]
An event counter is not part of the communication between threads, and
so it can use faster primitives.
[...]
void inc() { atomic_fetch_add_relaxed( &au, 1 ); }
"

http://labs.qt.nokia.com/2009/10/06/memory-ordering-semantics/
"For example, QAtomicInt offers ref() and unref(), which are just a
wrapper around fetchAndAddRelaxed(1) and fetchAndAddRelaxed(-1). This
means the reference count is atomic, but nothing else."

> Atomic operations get completed without interruption. That ensures that
> a different thread doesn't see a value that's not valid. For example,
> suppose that storing a pointer takes two bus operations. If the pointer
> starts out null, storing a value into it has two steps: store one half
> of the pointer, then store the other half. If a context switch occurs
> between those two steps, the thread that's switched to might see half
> the pointer. Atomic operations ensure that that sort of tearing doesn't
> occur.
>
> The other aspect of threaded programming is visibility of changes.
> Here's where you have to abandon the single-processor analogies; think
> multiple processors. For example, suppose the system has two
> processors, and each processor has its own data cache. Each processor
> is noodling around with the same variable, so each cache has a copy of
> the value of that variable. Writing a new value, even when done
> atomically, only directly affects the value in the cache that belongs
> to the processor that wrote the value. Unless the new value is copied
> to the other processor's cache, the other processor will still see the
> old value. memory_order_relaxed says "don't worry, be happy". It's okay
> that the values are inconsistent.

Makes sense, thanks.

> <g> This is tricky stuff. There's not much out there that describes it
> in an approachable manner unless you're an expert on hardware
> architecture. One good source is Anthony Williams' book, "C++
> Concurrency in Action", from Manning Publications. www.manning.com.

I wonder how the library is going to react when I ask them to buy a
book that's not written yet :-)

--

Luis A. Donoso

unread,

Oct 13, 2011, 5:12:26 PM10/13/11

to

> [ Seehttp://www.gotw.ca/resources/clcm.htmfor info about ]

> [ comp.lang.c++.moderated. First time posters: Do this! ]

That's true, for this concrete code, in X86 fences are only necessary
to prevent compiler reordering because X86 memory model guarantees
that writes are not reordered with respect to reads that come earlier
in the program order.

Maybe there is no penalization due to the compiler don't include
fences in assembler code for x86 as they are not necessary.

--

Alexander Terekhov

unread,

Oct 14, 2011, 7:16:16 AM10/14/11

to

"Luis A. Donoso" wrote:
[...]

> Maybe there is no penalization due to the compiler don't include
> fences in assembler code for x86 as they are not necessary.

See http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

On assembly level, x86 mappings for C++ fences:

Consume Fence: <ignore>
Acquire Fence: <ignore>
Release Fence: <ignore>
Acq_Rel Fence: <ignore>
Seq_Cst Fence: MFENCE

regards,
alexander.

Alexander Terekhov

unread,

Oct 15, 2011, 8:16:45 AM10/15/11

to

Marc wrote:
>
> Alexander Terekhov wrote:
> > Marc wrote:
> >> I am trying to make our reference counting implementation thread-safe
> >> (for a very basic definition of thread-safe), which essentially means
> >
> > if (c.atomic_fetch_sub(1, memory_order_relaxed) == 1)
> > atomic_thread_fence(memory_order_acquire),
> > cleanup();
> > else
> > atomic_thread_fence(memory_order_release);
>
> (trying to understand the code, not saying there is anything wrong
> with it)
>
> The standard warns that "Fences cannot, in general, be used to restore
> sequential consistency for atomic operations with weaker ordering
> specifications."

That's because under memory models like Itanium fences do not ensure
'remote write atomicity' (cases with 3+ processors/threads).

(On Itanium remote write atomicity for stores is done via st.rel
instruction.)

Think of sequential consistency as 'remote write atomicity' + 'no
reordering'.

For more details on relaxation, see:

http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf
(Figure 8: Simple categorization of relaxed models. ...)

> Good thing this is not the case here :-)
>
> The release and acquire look like they are there so that cleanup() is

cleanup() is basically a call to dtor of referenced counted object.

> up-to-date on what other copies (sharing the same data) did to the
> data (although when there are several copies, noone is supposed to
> touch the data). Or maybe to make sure that cleanup happens at the
> right time? And should there be something (an acquire? nothing?) on
> the increment?

Increments for reference counting under basic thread-safety can be
memory_order_relaxed without any fences.

Decrements for reference counting under basic thread-safety can be
memory_order_relaxed without memory_order_acquire fence for the case
when resulting value of decrement == zero if referenced counted object
is immutable.

Decrements for reference counting under basic thread-safety can be
memory_order_relaxed with more relaxed fence than memory_order_release
for the case when resulting value of decrement != zero if referenced
counted object is immutable. I call such a fence "sink load barrier"
(release = "sink load barrier" + "sink store barrier").

At some point C/C++ will support more efficient memory model for things
like read-write locking under which you obviously don't want/need to
restrict reordering of writes in the case a READ lock...

regards,
alexander.

Keith H Duggar

unread,

Oct 15, 2011, 8:20:21 AM10/15/11

to

On Oct 14, 7:16 am, Alexander Terekhov <terek...@web.de> wrote:
> "Luis A. Donoso" wrote:
>
> [...]
>
> > Maybe there is no penalization due to the compiler don't include
> > fences in assembler code for x86 as they are not necessary.
>

> Seehttp://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html<http://www.cl.cam.ac.uk/%7Epes20/cpp/cpp0xmappings.html>

>
> On assembly level, x86 mappings for C++ fences:
>
> Consume Fence: <ignore>
> Acquire Fence: <ignore>
> Release Fence: <ignore>
> Acq_Rel Fence: <ignore>
> Seq_Cst Fence: MFENCE

All true at the processor level. However, there can be compiler
level penalties. For example the compiler may only expose overly
strong compiler fencing that spills more registers than needed
or imposes stronger ordering than needed.

KHD

Pete Becker

unread,

Oct 15, 2011, 9:21:34 AM10/15/11

to

On 2011-10-13 21:11:04 +0000, Marc said:

Pete Becker wrote:
>
>
> <g> This is tricky stuff. There's not much out there that describes it
>> in an approachable manner unless you're an expert on hardware
>> architecture. One good source is Anthony Williams' book, "C++

>> Concurrency in Action", from Manning Publications. www.manning.com.
>>
>
> I wonder how the library is going to react when I ask them to buy a
> book that's not written yet :-)
>

It's available as a work in progress, and it's pretty near complete. But you
might have to buy it for yourself. It's available as a PDF.

--
Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com) Author of "The
Standard C++ Library Extensions: a Tutorial and Reference (
www.petebecker.com/tr1book)

[ See http://www.gotw.ca/resources/**clcm.htm<http://www.gotw.ca/resources/clcm.htm>for

Alexander Terekhov

unread,

Oct 15, 2011, 11:55:53 AM10/15/11

to

Keith H Duggar wrote:
>
> On Oct 14, 7:16 am, Alexander Terekhov <terek...@web.de> wrote:
>> "Luis A. Donoso" wrote:
>>
>> [...]
>>
>>> Maybe there is no penalization due to the compiler don't include
>>> fences in assembler code for x86 as they are not necessary.
>>
>> Seehttp://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html<http://www.cl.cam.ac.uk/%7Epes20/cpp/cpp0xmappings.html>
>>
>> On assembly level, x86 mappings for C++ fences:
>>
>> Consume Fence: <ignore>
>> Acquire Fence: <ignore>
>> Release Fence: <ignore>
>> Acq_Rel Fence: <ignore>
>> Seq_Cst Fence: MFENCE
>
> All true at the processor level. However, there can be compiler
> level penalties. For example the compiler may only expose overly
> strong compiler fencing that spills more registers than needed
> or imposes stronger ordering than needed.

For C/C++ you can check the assembly code and report a problem such as

"overly strong compiler fencing that spills more registers than needed

or imposes stronger ordering than needed" to compiler's vendor.

regards,
alexander.

Pete Becker

unread,

Oct 15, 2011, 11:56:22 AM10/15/11

to

On 2011-10-15 12:16:45 +0000, Alexander Terekhov said:

> Increments for reference counting under basic thread-safety can be
> memory_order_relaxed without any fences.
> Decrements for reference counting under basic thread-safety can be
> memory_order_relaxed without memory_order_acquire fence for the case
> when resulting value of decrement == zero if referenced counted object
> is immutable.
> Decrements for reference counting under basic thread-safety can be
> memory_order_relaxed with more relaxed fence than memory_order_release
> for the case when resulting value of decrement != zero if referenced
> counted object is immutable. I call such a fence "sink load barrier"
> (release = "sink load barrier" + "sink store barrier").

All this recalls my original, somewhat tongue in cheek, objection to putting thread support into C++: it encourages programmers to try to do things that they don't understand.

Not that Alexander Terekhov doesn't understand this, but most people reading this thread won't understand it (including me).

That's why most programmers should stick to sequentially consistent program models. They're easier to analyze and easier to get right.

--
Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com) Author of "The Standard C++ Library Extensions: a Tutorial and Reference (www.petebecker.com/tr1book)

Alexander Terekhov

unread,

Oct 15, 2011, 2:34:00 PM10/15/11

to

Pete Becker wrote:
[...]

> That's why most programmers should stick to sequentially consistent
> program models. They're easier to analyze and easier to get right.

The real world runs on multiprocessors that are NOT sequentially
consistent and programmers should understand memory models in the most
relaxed form as a basic programming skill, I think.

regards,
alexander.

--

Pete Becker

unread,

Oct 16, 2011, 9:27:27 AM10/16/11

to

On 2011-10-15 18:34:00 +0000, Alexander Terekhov said:

> Pete Becker wrote:
> [...]
>> That's why most programmers should stick to sequentially consistent
>> program models. They're easier to analyze and easier to get right.
> The real world runs on multiprocessors that are NOT sequentially
> consistent and programmers should understand memory models in the most
> relaxed form as a basic programming skill, I think.

Maybe next year. <g>

More generally, most programmers should not be writing multi-threaded programs. And many of the multi-threaded programs that they will write won't benefit from having multiple threads. Threads themselves are an advanced language feature, and relaxed memory models are an advanced feature of threads. Programmers should be cautious about venturing into these areas. Most simply don't know enough to do it well.

--
Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com) Author of "The Standard C++ Library Extensions: a Tutorial and Reference (www.petebecker.com/tr1book)

Marc

unread,

Oct 16, 2011, 9:28:09 AM10/16/11

to

Thank you very much for this more detailed explanation. Clearly I have
a lot to learn still, but it feels like now I am walking in the night
instead of being blind :-)

Alexander Terekhov wrote:
> Marc wrote:
>> Alexander Terekhov wrote:
>>> Marc wrote:
>>>> I am trying to make our reference counting implementation thread-safe
>>>> (for a very basic definition of thread-safe), which essentially means
>>>
>>> if (c.atomic_fetch_sub(1, memory_order_relaxed) == 1)
>>> atomic_thread_fence(memory_order_acquire),
>>> cleanup();
>>> else
>>> atomic_thread_fence(memory_order_release);

By the way, I hope compilers will (eventually) be clever enough to
omit the fences on architectures where the fetch_sub already implies a
memory barrier.

>> (trying to understand the code, not saying there is anything wrong
>> with it)
>>
>> The standard warns that "Fences cannot, in general, be used to restore
>> sequential consistency for atomic operations with weaker ordering
>> specifications."
>
> That's because under memory models like Itanium fences do not ensure
> 'remote write atomicity' (cases with 3+ processors/threads).

I would have thought that what the standard calls "atomic operations"
already had the 'remote write atomicity' guarantee, couldn't be that
simple :-)

> Think of sequential consistency as 'remote write atomicity' + 'no
> reordering'.

Cool, I had vaguely identified those 2 points, but it helps to have it
clarified this way.

> http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf

Nice (I haven't finished it yet).

> cleanup() is basically a call to dtor of referenced counted object.

Yes.

> Increments for reference counting under basic thread-safety can be
> memory_order_relaxed without any fences.
>
> Decrements for reference counting under basic thread-safety can be
> memory_order_relaxed without memory_order_acquire fence for the case
> when resulting value of decrement == zero if referenced counted object
> is immutable.

Ah, yes, we sometimes use refcounting for immutable objects, but
sometimes we use it for objects that, when refcount==1, are mutable.
Since I am trying to make the refcounting not safer than a regular
object, I vaguely hoped that it wouldn't hurt.

I am trying to see what the acquire does exactly in your code. It
looks like it says: don't cleanup before setting the counter to 0, but
then I don't see how that's related to mutability.

> Decrements for reference counting under basic thread-safety can be
> memory_order_relaxed with more relaxed fence than memory_order_release
> for the case when resulting value of decrement != zero if referenced
> counted object is immutable. I call such a fence "sink load barrier"
> (release = "sink load barrier" + "sink store barrier").

(trying to understand why this is the right answer and justifying it a
posteriori)

Hmm, I guess this barrier means: after decrementing the counter, it is
too late to access the shared data (and since it is immutable,
"access" can only be a load), because you never know, the other thread
might destruct it in the meantime. But then isn't there still a window
between the atomic op and the fence?

The fence is not needed in the cleanup case because there is no other
thread to observe a different order (?). And we don't need a symmetric
hoist load barrier for increment because reading the data before
incrementing the counter is safe.

> At some point C/C++ will support more efficient memory model for things
> like read-write locking under which you obviously don't want/need to
> restrict reordering of writes in the case a READ lock...

That's a very good point.

I still have that strange impression that synchronization is an all or
nothing operation. I mostly need to order operations on the counter
with respect to those on the shared data and somehow end up
synchronizing a whole lot more...

SG

unread,

Oct 16, 2011, 9:28:52 AM10/16/11

to

On 15 Okt., 20:34, Alexander Terekhov wrote:
> Pete Becker wrote:
>
> [...]
>
>> That's why most programmers should stick to sequentially consistent
>> program models. They're easier to analyze and easier to get right.
>
> The real world runs on multiprocessors that are NOT sequentially
> consistent and programmers should understand memory models in the most
> relaxed form as a basic programming skill, I think.

Why should they? It's not like programmers have to implement highly
performant concurrent data structures like queues or hashtables every
day. That's a job for some super smart expert. These low level memory
model aspects are for experts to build nice and easy-to-use libraries
on. As long as programmers have a vague idea of what a data race is,
what they can do to avoid them (amotics, mutexes and locks) and as
long as they can read a library's documentation everything should be
fine. There is no need for average Joe to get dirty with things beyond
sequential consistency. I actually consider myself to be average Joe
w.r.t. this part of the language. But hey, if you want to write an
article about it (like "c++ memory model for dummies") I'm all ears...

Cheers!
SG

Francis Glassborow

unread,

Oct 16, 2011, 7:09:47 PM10/16/11

to

On 16/10/2011 14:28, SG wrote:
> On 15 Okt., 20:34, Alexander Terekhov wrote:
>> Pete Becker wrote:
>>
>> [...]
>>
>>> That's why most programmers should stick to sequentially consistent
>>> program models. They're easier to analyze and easier to get right.
>>
>> The real world runs on multiprocessors that are NOT sequentially
>> consistent and programmers should understand memory models in the most
>> relaxed form as a basic programming skill, I think.
>
> Why should they? It's not like programmers have to implement highly
> performant concurrent data structures like queues or hashtables every
> day. That's a job for some super smart expert. These low level memory
> model aspects are for experts to build nice and easy-to-use libraries
> on. As long as programmers have a vague idea of what a data race is,
> what they can do to avoid them (amotics, mutexes and locks) and as
> long as they can read a library's documentation everything should be
> fine. There is no need for average Joe to get dirty with things beyond
> sequential consistency. I actually consider myself to be average Joe
> w.r.t. this part of the language. But hey, if you want to write an
> article about it (like "c++ memory model for dummies") I'm all ears...
>

However programmers should not be writing multi-threaded code unless
they understand the significance of multi-processors/cores. Too many do
not yet grasp the basic fact that multi-threaded code that works ok on a
single processor core can fail unpredictably when the hardware is
upgraded and the program runs using more than one core.

BTW is there a way to mark a program so that the OS will not assign
multiple cores to it even if it is multi-threaded?

Francis

Alexander Terekhov

unread,

Oct 17, 2011, 5:50:10 PM10/17/11

to

SG wrote:
[...]

> on. As long as programmers have a vague idea of what a data race is,

> what they can do to avoid them (amotics, . . .

Yeah, right, 'amotics' ;-)

Lock-free atomics are inherently racy and the default memory model for
lock-free atomics shall be the most relaxed one, not sequentially
consistency (SC).

On PPC/POWER/PowerPC, for SC, you would need most heavy fence 'hwsync'
between each pair of SC lock-free atomic operations, there are examples
showing this for each pair: RR, RW, WR, and WW.

Lock-based atomics do not need 'hwsync' instruction but nobody needs
lock-based atomics because locking can be done more efficiently on
higher level.

regards,
alexander.

Alexander Terekhov

unread,

Oct 17, 2011, 5:51:22 PM10/17/11

to

Marc wrote:
[...]

> > That's because under memory models like Itanium fences do not ensure
> > 'remote write atomicity' (cases with 3+ processors/threads).
>
> I would have thought that what the standard calls "atomic operations"
> already had the 'remote write atomicity' guarantee, couldn't be that
> simple :-)

It does if you use atomic_thread_fence(memory_order_seq_cst) fences or
use seq_cst labelled atomics.

[...]

> I am trying to see what the acquire does exactly in your code. It
> looks like it says: don't cleanup before setting the counter to 0, but
> then I don't see how that's related to mutability.

It says: don't read (for cleanup) before the counter is 0.

>
> > Decrements for reference counting under basic thread-safety can be
> > memory_order_relaxed with more relaxed fence than memory_order_release
> > for the case when resulting value of decrement != zero if referenced
> > counted object is immutable. I call such a fence "sink load barrier"
> > (release = "sink load barrier" + "sink store barrier").
>
> (trying to understand why this is the right answer and justifying it a
> posteriori)
>
> Hmm, I guess this barrier means: after decrementing the counter, it is

> too late to access the shared data . . .

Yes, it says: don't read after decrementing the counter.

regards,
alexander.

Joshua Maurice

unread,

Oct 18, 2011, 2:47:21 AM10/18/11

to

On Oct 15, 8:56 am, Pete Becker <p...@versatilecoding.com> wrote:
> On 2011-10-15 12:16:45 +0000, Alexander Terekhov said:
>
> > Increments for reference counting under basic thread-safety can be
> > memory_order_relaxed without any fences.
> > Decrements for reference counting under basic thread-safety can be
> > memory_order_relaxed without memory_order_acquire fence for the case
> > when resulting value of decrement == zero if referenced counted object
> > is immutable.
> > Decrements for reference counting under basic thread-safety can be
> > memory_order_relaxed with more relaxed fence than memory_order_release
> > for the case when resulting value of decrement != zero if referenced
> > counted object is immutable. I call such a fence "sink load barrier"
> > (release = "sink load barrier" + "sink store barrier").
>
> All this recalls my original, somewhat tongue in cheek, objection to putting thread support into C++: it encourages programmers to try to do things that they don't understand.
>
> Not that Alexander Terekhov doesn't understand this, but most people reading this thread won't understand it (including me).
>
> That's why most programmers should stick to sequentially consistent program models. They're easier to analyze and easier to get right.

That's probably true, and that is why a lot of the official and semi-
official documentation I've seen contains a warning "For experts
only". Besides, the idea that "[this] encourages programmers to try to
do things that they don't understand" is almost antithetical to the
design goals of C++. Yes, we'll try to make it difficult to shoot
yourself in the foot, but we will expose the power when needed which
might sometimes result in shooting oneself in the foot. In other
words, we're trying to be cost competitive with assembly, or
reasonably close, and thereby allow writing good, fast, portable
code.

I will be very much thankful when the experts can start writing their
code in portable C++ instead of assembly. The assembly is a
maintenance nightmare. Not putting it in C++ doesn't mean it won't be
done. It just means that someone dedicated enough is going to go to
assembly to do it, and once we go there it requires a true expert and
a stack of chip specific documentation books to determine correctness,
at the very least. However, with the single, sane, well documented
(more or less) model in C++, it'll be a lot easier to prove its
correctness. (Yes, something which is a lot easier than "omg hard" can
still be "omg hard". For anything but the trivial example, and even
some trivial examples, even the experts cannot just look at it and
determine correctness. IMHO, you need to use a programatic proof
solver like Relacy Race Detector or the one I wrote myself.)

So, in the future when C++11 becomes widespread, when I see someone
doing something "interesting" with volatile, I can tell them that
their code is wrong, and that they 1- first probably ought to just use
sequentially consistent semantics, or better yet mutexes and condition
variablse, and 2- that you ought to be using the C++11 weak atomics
instead of non-portable volatile hackery.

--