On Oct 12, 5:19 am, Marc <marc.gli...@gmail.com> wrote:
> Agents Marlow wrote:
> > On Oct 9, 1:27 pm, Marc <marc.gli...@gmail.com> wrote:
> >> Hello,
>
> >> I am trying to make our reference counting implementation thread-safe
> >> (for a very basic definition of thread-safe), which essentially means
> >> making the counter atomic.
>
> > I would take a look at the implementation of the atomic counter in the
> > Poco Foundation class library.
>
> Thanks for the pointer.
> They only implement it lock-free for windows and mac, and the
> interesting part is that on mac they use the atomic operations without
> a memory barrier (and since mac used to run on PPC, it makes a
> difference). They perform a load by simply reading the variable (an
> int) which doesn't offer any protection against compiler
> optimizations, I am trying to think of the worst that could happen
> because of that...
Hi,
I think it's pretty hard (or impossible) to get an atomic counter
right using the interface you described.
What is hindering a thread from, e.g.:
thread A calls load(), sees writing is safe <context switch>
thread B calls load(), sees writing is safe
thread B increments your atomic
thread B does some operations, does not release the object <context
switch>
thread A increments your atomic << shouldn't happen
thread A does some operations << shouldn't happen
If I understood your expected usage wrong, please correct me.
Classically you would have a different API for you atomics, like:
int load();
bool store( int expected, int new); // returns wrong when expected !=
current value of the atomic
This way you are safe from the scenario above, since the increment of
thread B would fail, and it would not get the write access to the
object.
Your above example would look like:
int count = load();
if( count == 0 && store( 0, 1)) cleanup();
How this is implemented depends on your processor architecture. E.g.
on x86 you can use the cmpxchg with some barriers etc., for ARM it's a
little bit easier because of its RISC architecture.
In general though: your compiler should have some intrinsic functions
that do that. (GCC: __sync_bool_compare_and_swap Microsoft:
InterlockedExchange)
For those compiler you can have a look at:
http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html &&
http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/cpp/win/compiler_c/intref_cls/common/intref_itanium_lock_atomic.htm.
The advantage: the compilers already see those intrinsics as memory
barriers, and they should be portable across platforms.
Regards,
Andreas
Advanced threading. See Alexander Terekhov's message.
Atomic operations get completed without interruption. That ensures that
a different thread doesn't see a value that's not valid. For example,
suppose that storing a pointer takes two bus operations. If the pointer
starts out null, storing a value into it has two steps: store one half
of the pointer, then store the other half. If a context switch occurs
between those two steps, the thread that's switched to might see half
the pointer. Atomic operations ensure that that sort of tearing doesn't
occur.
The other aspect of threaded programming is visibility of changes.
Here's where you have to abandon the single-processor analogies; think
multiple processors. For example, suppose the system has two
processors, and each processor has its own data cache. Each processor
is noodling around with the same variable, so each cache has a copy of
the value of that variable. Writing a new value, even when done
atomically, only directly affects the value in the cache that belongs
to the processor that wrote the value. Unless the new value is copied
to the other processor's cache, the other processor will still see the
old value. memory_order_relaxed says "don't worry, be happy". It's okay
that the values are inconsistent.
>
>> It's easier to get the code right when it's sequentially consistent. In
>> general, unless you can demonstrate that synchronization is a
>> bottleneck, don't mess with it.
>
> Well yes, of course, I did some experiments (using boost::shared_ptr
> or the libstdc++ std::atomic (just a place-holder implementation, I
> know)), and the slow-down was unacceptable, which led me to
> reimplement it, and the performance hit is acceptable but still
> noticable enough that I am not sure about enabling it by default for
> MT programs (some programs have several threads that don't share any
> ref-counted objects and would pay the price for nothing).
>
> Our main target is x86/x86_64 where as far as I understand (please
> correct me if I am wrong) the memory barrier is unavoidable (implied
> by any atomic operation), but I am still interested in not penalizing
> our users on other platforms if I don't have to.
On the x86 architecture, pretty much everything is sequentially
consistent. So there's no difference in the generated code between
sequentially consistent visibility and any of the others. Which, in
turn, means that code that uses less than sequentially consistent
visiblity and works just fine on x86 systems may fail miserably if it's
just ported to other systems.
>
> And it is intellectually satisfying to understand things ;-)
>
<g> This is tricky stuff. There's not much out there that describes it
in an approachable manner unless you're an expert on hardware
architecture. One good source is Anthony Williams' book, "C++
Concurrency in Action", from Manning Publications. www.manning.com.
--
Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com) Author of "The
Standard C++ Library Extensions: a Tutorial and Reference
(www.petebecker.com/tr1book)
(trying to understand the code, not saying there is anything wrong
with it)
The standard warns that "Fences cannot, in general, be used to restore
sequential consistency for atomic operations with weaker ordering
specifications."
Good thing this is not the case here :-)
The release and acquire look like they are there so that cleanup() is
up-to-date on what other copies (sharing the same data) did to the
data (although when there are several copies, noone is supposed to
touch the data). Or maybe to make sure that cleanup happens at the
right time? And should there be something (an acquire? nothing?) on
the increment?
I'd be very grateful if you could explain the rational for the use of
acquire and release fences in your code above. It looks like you put
just the minimal synchronization in just the right places, so it would
certainly help me a lot to understand this example.
Sometimes I envy people who program in a language with a garbage
collector: no need for a counter, hence no need for atomics here...
> On 2011-10-12 20:58:06 +0000, Marc said:
>
>> Pete Becker wrote:
>>> Suppose there are two references to the object, in two different
>>> threads. One thread decrements the reference count, then the other
>>> does. If the decrement from the first thread isn't seen by the second
>>> thread, the second thread won't see that the count has become zero, and
>>> the object won't get destroyed. So memory_order_relaxed won't work: you
>>> need to ensure that the result of a decrement is visible to another
>>> thread that also needs to decrement the count.
>>
>> Uh? I am still calling an atomic decrement function. The standard says:
>>
>> "Note: Atomic operations specifying memory_order_relaxed are relaxed
>> with respect to memory ordering. Implementations must still guarantee
>> that any given atomic access to a particular atomic object be
>> indivisible with respect to all other atomic accesses to that object."
>>
>> I thought the memory order was mostly concerned with what happened to
>> the rest of the memory.
I stumbled over several other sources all pointing in the same
direction, which confuses me.
In the standard:
"All modifications to a particular atomic object M occur in some
particular total order, called the modification order of M."
In N2145:
"For example, an atomic increment operation may be used simply to count
the number of times a function is called, as in a profiler. This
requires atomicity, but no ordering constraints.
[...]
An event counter is not part of the communication between threads, and
so it can use faster primitives.
[...]
void inc() { atomic_fetch_add_relaxed( &au, 1 ); }
"
http://labs.qt.nokia.com/2009/10/06/memory-ordering-semantics/
"For example, QAtomicInt offers ref() and unref(), which are just a
wrapper around fetchAndAddRelaxed(1) and fetchAndAddRelaxed(-1). This
means the reference count is atomic, but nothing else."
> Atomic operations get completed without interruption. That ensures that
> a different thread doesn't see a value that's not valid. For example,
> suppose that storing a pointer takes two bus operations. If the pointer
> starts out null, storing a value into it has two steps: store one half
> of the pointer, then store the other half. If a context switch occurs
> between those two steps, the thread that's switched to might see half
> the pointer. Atomic operations ensure that that sort of tearing doesn't
> occur.
>
> The other aspect of threaded programming is visibility of changes.
> Here's where you have to abandon the single-processor analogies; think
> multiple processors. For example, suppose the system has two
> processors, and each processor has its own data cache. Each processor
> is noodling around with the same variable, so each cache has a copy of
> the value of that variable. Writing a new value, even when done
> atomically, only directly affects the value in the cache that belongs
> to the processor that wrote the value. Unless the new value is copied
> to the other processor's cache, the other processor will still see the
> old value. memory_order_relaxed says "don't worry, be happy". It's okay
> that the values are inconsistent.
Makes sense, thanks.
> <g> This is tricky stuff. There's not much out there that describes it
> in an approachable manner unless you're an expert on hardware
> architecture. One good source is Anthony Williams' book, "C++
> Concurrency in Action", from Manning Publications. www.manning.com.
I wonder how the library is going to react when I ask them to buy a
book that's not written yet :-)
--
That's true, for this concrete code, in X86 fences are only necessary
to prevent compiler reordering because X86 memory model guarantees
that writes are not reordered with respect to reads that come earlier
in the program order.
Maybe there is no penalization due to the compiler don't include
fences in assembler code for x86 as they are not necessary.
--
See http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
On assembly level, x86 mappings for C++ fences:
Consume Fence: <ignore>
Acquire Fence: <ignore>
Release Fence: <ignore>
Acq_Rel Fence: <ignore>
Seq_Cst Fence: MFENCE
regards,
alexander.