Threads: incompatibilities between C and C++?

M J

unread,

May 14, 2012, 2:13:37 PM5/14/12

to

Recently I watched a speech made by Hans Boehm on Channel #9 where he
mentioned an incompatibility between C11 and C++11. I think he was
talking about a difference between thread libraries. Does anyone know
what these differences consist of?

MJ

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Pete Becker

unread,

May 14, 2012, 3:14:41 PM5/14/12

to

On 2012-05-14 18:13:37 +0000, M J said:

> Recently I watched a speech made by Hans Boehm on Channel #9 where he
> mentioned an incompatibility between C11 and C++11. I think he was
> talking about a difference between thread libraries. Does anyone know
> what these differences consist of?
>

Thread support in C11 is based on function calls, pretty much like
pthreads. C++11 uses classes and templates. Here's a trivial (and
incomplete) example:

// in C:
int f(void*) {
printf("Hello, world! From a thread!\n");
}

int main() {
thrd_t thr;
int res = thrd_create(&thr, f, NULL);
if (res == thrd_success) {
thrd_join(thr, NULL);
}
return 0;
}

// in C++:
void f() {
printf("Hello, world! From a thread!\n");
}

int main() {
std::thread thr(f);
thr.join();
return 0;
}

The difference in the signatures of the called functions comes about
because thrd_create takes a function pointer of type int(*)(void*); in
C++, the std::thread constructor takes an arbitrary callable type and
an appropriate argument list.

This isn't properly called "an incompatibility", however. The
underlying design models are quite different.

There is some bridging from C++ to C, maybe. Most of the C++ thread
support types can have a member function named native_handle() that
returns an object whose type is a nested type named native_handle_type.
That might give you a hook into the C threading support; it's
implementation defined whether native_handle() exists, and if it does,
it's implementation-defined what you can do with it.

--
Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com) Author of "The
Standard C++ Library Extensions: a Tutorial and Reference
(www.petebecker.com/tr1book)

Rani Sharoni

unread,

May 15, 2012, 1:20:30 PM5/15/12

to

On May 14, 10:14 pm, Pete Becker <p...@versatilecoding.com> wrote:
> The difference in the signatures of the called functions comes about
> because thrd_create takes a function pointer of type int(*)(void*); in
> C++, the std::thread constructor takes an arbitrary callable type and
> an appropriate argument list.
>
> This isn't properly called "an incompatibility", however. The
> underlying design models are quite different.

I also noticed a potential abstraction penalty associated with
std::thread.
Per the C++ standard 30.3.1.2/4 (thread construction):
The *new thread of execution executes* INVOKE(DECAY_-
COPY( std::forward<F>(f)), DECAY_COPY (std::forward<Args>(args))...)
with the calls to DECAY_COPY being evaluated in the constructing
thread.

This means that in general the caller thread has to wait for the new
thread copying - serialization penalty that doesn't exist for the raw
C interface. I guess that the requirement (feature) is meant to allow
TLS sensitive copying or some sort of transfer-of-ownership though I
can't think about any realistic use case relaying on such feature.

I already noticed that the VC implementation is taking the penalty for
every call including when passing raw-pointers or lambda.
Thread pools with similar interface might have the same problem (which
is more severe for pools).

Example for why the wait is needed:
struct A {};
void func(A&) {}

void f()
{
A a = {};

// copy of 'a' is done by the new thread
// hence f() has to wait to keep the source alive
std::thread(func, a);
}

Rani

--

Ulrich Eckhardt

unread,

May 16, 2012, 1:35:34 PM5/16/12

to

Am 15.05.2012 19:20, schrieb Rani Sharoni:
> I also noticed a potential abstraction penalty associated with
> std::thread. Per the C++ standard 30.3.1.2/4 (thread
> construction): The *new thread of execution executes* INVOKE(DECAY_-
> COPY( std::forward<F>(f)), DECAY_COPY (std::forward<Args>(args))...)
> with the calls to DECAY_COPY being evaluated in the constructing
> thread.
>
> This means that in general the caller thread has to wait for the new
> thread copying - serialization penalty that doesn't exist for the raw
> C interface. I guess that the requirement (feature) is meant to allow
> TLS sensitive copying or some sort of transfer-of-ownership though I
> can't think about any realistic use case relaying on such feature.

I don't see any use related to tread-local storage, but transfer of
ownership seems like a reasonable goal to me. Think of a job queue and a
thread pool. The managing thread pops a job off the queue and passes it
to a worker thread. If the thread pool is empty, i.e. no worker thread
available, you don't want the job object to be discarded.

> I already noticed that the VC implementation is taking the penalty for
> every call including when passing raw-pointers or lambda.
> Thread pools with similar interface might have the same problem (which
> is more severe for pools).

For the thread pool, I'd say it's less severe, assuming you already have
a few threads running. The reason is that the startup time for the new
thread is avoided, you only need to notify a waiting thread that there
is a function for it to run.

That said, I agree that this is far from optimal for performance and I
also wonder if this is a mere oversight or if there is a good
explanation. Maybe that explanation is simply that if you really care
for startup latency, you should start your thread before you really need
it, i.e. take the thread-pool approach above. OTOH if you need complete
startup feedback including for initialisations that are done by the
started thread and which might fail, you can't live without this.

Uli

Rani Sharoni

unread,

May 18, 2012, 5:17:37 PM5/18/12

to

On May 16, 8:35 pm, Ulrich Eckhardt <ulrich.eckha...@dominolaser.com>
wrote:

> Am 15.05.2012 19:20, schrieb Rani Sharoni:
>
> > I also noticed a potential abstraction penalty associated with
> > std::thread. Per the C++ standard 30.3.1.2/4 (thread
> > construction): The *new thread of execution executes* INVOKE(DECAY_-
> > COPY( std::forward<F>(f)), DECAY_COPY (std::forward<Args>(args))...)
> > with the calls to DECAY_COPY being evaluated in the constructing
> > thread.

> That said, I agree that this is far from optimal for performance and I
> also wonder if this is a mere oversight or if there is a good
> explanation.

As I said, I think that the implementation can avoid the wait in
important special cases like when passing raw-pointers and lambdas. I
Don't know if any existing implementation is doing this and I
personally doubt it.

Can you come up with a code in which transfer of ownership will break
without the deferred copying feature?

> Maybe that explanation is simply that if you really care
> for startup latency, you should start your thread before you really need
> it, i.e. take the thread-pool approach above.

I hope that thread-pool interfaces don't have such deferred copying
feature/interface or otherwise they will also have such penalty.

> OTOH if you need complete
> startup feedback including for initialisations that are done by the
> started thread and which might fail, you can't live without this.

You also don't have with the native C-interface (e.g. posix threads)
but can still manually build such handshake.
Note that the native C-interfaces always have powerful guarantee - the
thread will start if create-thread succeeded. This allow to avoid the
penalizing wait handshake like the one C++ has.

FWIW, I once noticed this penalty in the MS .NET runtime in which the
caller thread waits since the new thread can indeed fail to call the
callback after it started (i.e. due to some per-thread pre-
allocation).

Rani

Volker Lukas

unread,

May 20, 2012, 11:07:04 AM5/20/12

to

Rani Sharoni wrote:

> On May 14, 10:14 pm, Pete Becker <p...@versatilecoding.com> wrote:
>> The difference in the signatures of the called functions comes
>> about because thrd_create takes a function pointer of type
>> int(*)(void*); in C++, the std::thread constructor takes an
>> arbitrary callable type and an appropriate argument list.
>>
>> This isn't properly called "an incompatibility", however. The
>> underlying design models are quite different.
>
> I also noticed a potential abstraction penalty associated with
> std::thread.
> Per the C++ standard 30.3.1.2/4 (thread construction): The *new
> thread of execution executes* INVOKE(DECAY_- COPY(
> std::forward<F>(f)), DECAY_COPY (std::forward<Args>(args))...) with
> the calls to DECAY_COPY being evaluated in the constructing thread.
>
> This means that in general the caller thread has to wait for the new
> thread copying - serialization penalty that doesn't exist for the
> raw C interface.

Can you explain a bit more what exactly is the origin of the
performance penalty? You write that the caller has to wait for the new
thread to copy something - but I can not see why that is the case. As
is stated in the quote from the C++ standard above, the copies are
made in the calling thread.

Or is your point that copies are made at all?

[...]

> Example for why the wait is needed:
> struct A {};
> void func(A&) {}

I believe this is not a valid signature for a thread routine:
DECAY_COPY returns an rvalue, which does not bind to A&. A quick test
with GCC 4.7.0 seems to agree, with A& a compile error is reported,
A&&, A const& and plain A all compile.

> void f()
> {
> A a = {};
>
> // copy of 'a' is done by the new thread
> // hence f() has to wait to keep the source alive
> std::thread(func, a);
> }

As written above, I do not see any requirement that the *new* thread
makes any copy. A test with GCC 4.7.0 also shows that all copies are
made in the calling thread.

Rani Sharoni

unread,

May 20, 2012, 3:09:45 PM5/20/12

to

On May 20, 6:07 pm, Volker Lukas <vlu...@gmx.de> wrote:

> Rani Sharoni wrote:
> > I also noticed a potential abstraction penalty associated with
> > std::thread.
> > Per the C++ standard 30.3.1.2/4 (thread construction): The *new
> > thread of execution executes* INVOKE(DECAY_- COPY(
> > std::forward<F>(f)), DECAY_COPY (std::forward<Args>(args))...)
> > with the calls to DECAY_COPY being evaluated in the constructing
> > thread.
>
> > This means that in general the caller thread has to wait for the
> > new thread copying - serialization penalty that doesn't exist for
> > the raw C interface.
>
> Can you explain a bit more what exactly is the origin of the
> performance penalty? You write that the caller has to wait for the
> new thread to copy something - but I can not see why that is the
> case. As is stated in the quote from the C++ standard above, the
> copies are made in the calling thread.

One might argue to additional copying is allowed but then why
mandating the copying in the new thread?
But there is more - 30.3.1.2/4:
"This implies that any exceptions not thrown from the invocation of
the copy of f will be thrown in the constructing thread, not the new
thread."

So the caller has to wait for copying by the new thread to complete
due to potential exception.
This probably also require that the common "passing lambda" case
should wait...

BTW - can you debug into the GCC implementation to see if it waits? VC
for sure does.

Rani

Volker Lukas

unread,

May 21, 2012, 9:22:39 AM5/21/12

to

Rani Sharoni wrote:

> On May 20, 6:07 pm, Volker Lukas <vlu...@gmx.de> wrote:
>> Rani Sharoni wrote:
>> > I also noticed a potential abstraction penalty associated with
>> > std::thread.
>> > Per the C++ standard 30.3.1.2/4 (thread construction): The *new
>> > thread of execution executes* INVOKE(DECAY_- COPY(
>> > std::forward<F>(f)), DECAY_COPY (std::forward<Args>(args))...)
>> > with the calls to DECAY_COPY being evaluated in the constructing
>> > thread.
>>
>> > This means that in general the caller thread has to wait for the
>> > new thread copying - serialization penalty that doesn't exist for
>> > the raw C interface.
>>
>> Can you explain a bit more what exactly is the origin of the
>> performance penalty? You write that the caller has to wait for the
>> new thread to copy something - but I can not see why that is the
>> case. As is stated in the quote from the C++ standard above, the
>> copies are made in the calling thread.
>
> One might argue to additional copying is allowed but then why
> mandating the copying in the new thread?

Again, I only see a requirement that the calling thread makes a copy
of the function object and the arguments: "[...] with the calls to
DECAY_COPY being evaluated in the constructing thread." .

Question: How do you interpret "constructing thread"? Does this mean
to you the the new thread of execution, that which was just started,
or does it mean to you the old thread, that which constructed the
std::thread object?

To me, it means the latter.

> BTW - can you debug into the GCC implementation to see if it waits?
> VC for sure does.

I can not see anything which I would call "waiting", i.e. I do not see
any waiting on condition variables, mutexes, semaphores etc...

For this program,
-----------------------------------------------------------
#include <iostream>
#include <thread>

void tid(char const* loc) {
std::cout << loc << ", thread id = "
<< std::this_thread::get_id() << std::endl;
}

struct A {
A() { }
A(A const&) { tid("A(A const&"); }
A(A&&) { tid("A(A&&"); }
};

void f(A&&) { tid("f"); }

int main() {
try {
tid("main");
A a;
std::thread t(f, a);

t.join();
}
catch(...) { std::cout << "\nDuh!\n"; }
}
-----------------------------------------------------------

I get this output (GCC 4.7.0 + supplied library implementation):
main, thread id = 139890527434560
A(A const&, thread id = 139890527434560
A(A&&, thread id = 139890527434560
f, thread id = 139890510874368

So any copies of the function object and its argument are made in the
old thread. This is in line with my reading of the C++ standard (N3376
draft).

This is how std::thread is implemented for GCC:
http://gcc.gnu.org/viewcvs/trunk/libstdc++-
v3/include/std/thread?revision=184997&view=markup

and

http://gcc.gnu.org/viewcvs/trunk/libstdc++-
v3/src/c++11/thread.cc?revision=184997&view=markup

There is also an alternate implementation of std::thread, as part of
the LLVM project:
http://llvm.org/svn/llvm-project/libcxx/trunk/include/thread
http://llvm.org/svn/llvm-project/libcxx/trunk/src/thread.cpp

That implementation also makes the copies in the calling thread, like
with GCC.

Rani Sharoni

unread,

May 21, 2012, 3:12:08 PM5/21/12

to

On May 21, 4:22 pm, Volker Lukas <vlu...@gmx.de> wrote:
> Rani Sharoni wrote:
> > One might argue to additional copying is allowed but then why
> > mandating the copying in the new thread?
>
> Again, I only see a requirement that the calling thread makes a copy
> of the function object and the arguments: "[...] with the calls to
> DECAY_COPY being evaluated in the constructing thread." .
>
> Question: How do you interpret "constructing thread"? Does this mean
> to you the the new thread of execution, that which was just started,
> or does it mean to you the old thread, that which constructed the
> std::thread object?
>
> To me, it means the latter.

Ok. I think that you read this text better than me.
I looked at the boost::thread doc and it explicitly says the same.
The implementations you mentioned conforms without the wait penalty.

30.3.1.2/5 might add a bit more to my confusion:
"Synchronization: The completion of the invocation of the constructor
synchronizes with the beginning of the invocation of the copy of f."

I'm not sure about the meaning of this paragraph and why it's actually
required.
I guess that the intention is that the caller waits to report for
potential failures in the new thread before the invocation of 'f'.
The text might be confusing though but the second implementation you
mentioned seem to have related bug (see below).

> > BTW - can you debug into the GCC implementation to see if it
> > waits? VC for sure does.

> I can not see anything which I would call "waiting", i.e. I do not
> see any waiting on condition variables, mutexes, semaphores etc...

> This is how std::thread is implemented for GCC:
>
http://gcc.gnu.org/viewcvs/trunk/libstdc++-v3/include/std/thread?revision=184997&view=markup

Thanks for checking this. I also looked at libc++ code you mentioned
and no wait in sight there:
thread(_Callable&& __f, _Args&&... __args)
{
_M_start_thread(_M_make_routine(std::__bind_simple(
std::forward<_Callable>(__f),
std::forward<_Args>(__args)...)));
}

I see an extra allocation for the args holder type (via shared_ptr)
but this is probably negligible compared with the thread creation
(native C interface doesn't require such).

> For this program,

> I get this output (GCC 4.7.0 + supplied library implementation):
> main, thread id = 139890527434560
> A(A const&, thread id = 139890527434560
> A(A&&, thread id = 139890527434560
> f, thread id = 139890510874368
>
> So any copies of the function object and its argument are made in
> the old thread. This is in line with my reading of the C++ standard
> (N3376 draft).

Thanks. definite proof that copying is done by the caller thread.

> There is also an alternate implementation of std::thread, as part of
> the LLVM project:
> http://llvm.org/svn/llvm-project/libcxx/trunk/include/thread

> That implementation also makes the copies in the calling thread,
> like with GCC.

Indeed:
thread::thread(_Fp&& __f, _Args&&... __args)
{
typedef tuple<typename decay<_Fp>::type, typename
decay<_Args>::type...> _Gp;
_VSTD::unique_ptr<_Gp> __p(new
_Gp(__decay_copy(_VSTD::forward<_Fp>(__f)),

__decay_copy(_VSTD::forward<_Args>(__args))...));
int __ec = pthread_create(&__t_, 0, &__thread_proxy<_Gp>,
__p.get());

But there seem to be a bug in the underlying callback:
__thread_proxy(void* __vp)
{
// throwing new will terminate
__thread_local_data().reset(new __thread_struct);
std::unique_ptr<_Fp> __p(static_cast<_Fp*>(__vp));
typedef typename __make_tuple_indices<tuple_size<_Fp>::value,
1>::type _Index;
__threaad_execute(*__p, _Index());

In this case the implementation should propagate the exception to the
caller thread that has to wait.
Better to simply pre-allocate this __thread_struct in the caller
thread.

Thanks,
Rani

Volker Lukas

unread,

May 22, 2012, 2:53:02 PM5/22/12

to

Rani Sharoni wrote:
[...]

> 30.3.1.2/5 might add a bit more to my confusion:
> "Synchronization: The completion of the invocation of the constructor
> synchronizes with the beginning of the invocation of the copy of f."
>
> I'm not sure about the meaning of this paragraph and why it's actually
> required.

I think according to section 1.10 in the standard it means that the new
thread of execution can assume that the std::thread object is fully
constructed, i.e., the following program has defined semantics and the
asserts succeed:
-----------------------------------------------------------
#include <assert.h>
#include <thread>

int x;

void f(std::thread const* t)
{
assert(x == 1); assert(std::this_thread::get_id() == t->get_id());
}

int main() {
x = 1;
std::thread t(f, &t); t.join();
}
-----------------------------------------------------------
"synchronize with" is defined in 1.10/8.

Rani Sharoni

unread,

May 23, 2012, 4:25:27 AM5/23/12

to

On May 22, 9:53 pm, Volker Lukas <vlu...@gmx.de> wrote:
> Rani Sharoni wrote:
>
> [...]> 30.3.1.2/5 might add a bit more to my confusion:
> > "Synchronization: The completion of the invocation of the constructor
> > synchronizes with the beginning of the invocation of the copy of f."
>
> > I'm not sure about the meaning of this paragraph and why it's actually
> > required.
>
> I think according to section 1.10 in the standard it means that the new
> thread of execution can assume that the std::thread object is fully
> constructed

I guess you are right though the "with the beginning of the
invocation" text is a bit confusing compared with traditional
threading for which threads can be created in a suspended state (i.e.
there is a separation between the thread object creation and the
invocation so no such synchronization specification is required).

Thanks,
Rani

Pete Becker

unread,

May 23, 2012, 3:27:01 PM5/23/12

to

On 2012-05-23 08:25:27 +0000, Rani Sharoni said:

> On May 22, 9:53 pm, Volker Lukas <vlu...@gmx.de> wrote:
>> Rani Sharoni wrote:
>>
>> [...]> 30.3.1.2/5 might add a bit more to my confusion:
>>> "Synchronization: The completion of the invocation of the
>>> constructor synchronizes with the beginning of the invocation of
>>> the copy of f."
>>
>>> I'm not sure about the meaning of this paragraph and why it's
>>> actually required.
>>
>> I think according to section 1.10 in the standard it means that the
>> new thread of execution can assume that the std::thread object is
>> fully constructed
>
> I guess you are right though the "with the beginning of the
> invocation" text is a bit confusing compared with traditional
> threading for which threads can be created in a suspended state
> (i.e. there is a separation between the thread object creation and
> the invocation so no such synchronization specification is
> required).

The specification of the semantics of multi-threading in C++ is based
entirely on the memory model (1.10). The memory model, in turn,
defines the visibility of data modifications in terms of library
function calls that make visiblity guarantees. For example,

int i = 0;
std::atomic_int ai = 0;

// thread 1:
i = 3;
ai.store(4);

// thread 2:
while (ai.load() != 4)
; /* busy wait */
assert(i == 3);

The store to ai in thread 1 guarantees sequential consistency, as does
the load from ai in thread 2. That, in turn, requires that the value
stored into i in thread 1 (which precedes the store to ai) must be
visible in thread 2, if thread 2 has seen the value stored in ai by
thread 1 (that's the point of the while loop).

In memory-model terms, the return from the call to store() in thread 1
*synchronizes with* the call to load() in thread 2.

That's the fundamental vocabulary here. The description of the
relation between the thread constructor and the thread function uses
the same vocabulary: the return from the constructor synchronizes with
the start of the thread function. That is, this has to work:

int i = 0;

void f() {
assert(i == 3);
}

int main() {
i = 3;

std::thread thr(f);
thr.join();
return 0;
}

By abstracting the notion of *synchronizes with*, the memory model
lets you talk about less stringent visiblity requirements
(release/acquire, release/consume, and relaxed) without having to
repeat their definitions everywhere they apply.

--
Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com) Author of "The
Standard C++ Library Extensions: a Tutorial and Reference
(www.petebecker.com/tr1book)

Rani Sharoni

unread,

May 24, 2012, 5:11:58 AM5/24/12

to

On May 23, 10:27 pm, Pete Becker <p...@versatilecoding.com> wrote:
> On 2012-05-23 08:25:27 +0000, Rani Sharoni said:
> >> Rani Sharoni wrote:
>
> >> [...]> 30.3.1.2/5 might add a bit more to my confusion:
> >>> "Synchronization: The completion of the invocation of the
> >>> constructor synchronizes with the beginning of the invocation of
> >>> the copy of f."
>
> >>> I'm not sure about the meaning of this paragraph and why it's
> >>> actually required.
>

> int i = 0;
>
> void f() {
> assert(i == 3);
>
> }
>
> int main() {
> i = 3;
> std::thread thr(f);
> thr.join();
> return 0;
>
> }
>
> By abstracting the notion of *synchronizes with*, the memory model
> lets you talk about less stringent visiblity requirements
> (release/acquire, release/consume, and relaxed) without having to
> repeat their definitions everywhere they apply.

Thanks for the explanation. I see your point. std::create-thread is a
full memory barrier so for example there is no need for additional
barriers in order to access (from the new thread) memory that was
initialized before the create-thread call (i.e. no re-ordering is
allowed by the caller). I guess that every threading platform
(including thread pools) provide such guarantee.

Rani

--

Pete Becker

unread,

May 24, 2012, 7:28:57 PM5/24/12

to

On 2012-05-24 09:11:58 +0000, Rani Sharoni said:

>
> Thanks for the explanation. I see your point. std::create-thread is a
> full memory barrier so for example there is no need for additional
> barriers in order to access (from the new thread) memory that was
> initialized before the create-thread call (i.e. no re-ordering is
> allowed by the caller).

There are two things that are enforced by barriers: no re-ordering, and
cache coherence. For those who aren't up on this stuff, no re-ordering
means the compiler can't rearrange the code in ways that invalidate the
read/write rules. Cache coherence means that the hardware can't
rearrange reads and writes. The underlying hardware issue is that,
typically, with multiple cpus you have multiple caches; a write by one
cpu writes data to its cache; a read by another cpu may read from its
own cache or from main memory; if the contents of the first cache
haven't been written to main memory, the read won't see the result of
that write. Memory barriers (which underly the implementation of the
"synchronizes with" relationship) do the appropriate flushes. A release
operation flushes the cpu's cache, storing newly written values to main
memory; an acquire operation invalidates the cpu's cache, forcing data
to be read from main memory instead of the cache. The combination of
the two produces "synchronizes with": a write operation with release
semantics on one cpu synchronizes with a read operation with acquire
semantics on another cpu if the read sees the value that was written.
Again, that means that writes prior to the release operation are
visible to readers after the acquire operation.

--
Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com) Author of "The
Standard C++ Library Extensions: a Tutorial and Reference
(www.petebecker.com/tr1book)

Rani Sharoni

unread,

May 26, 2012, 4:46:18 PM5/26/12

to

On May 25, 2:28 am, Pete Becker <p...@versatilecoding.com> wrote:
> On 2012-05-24 09:11:58 +0000, Rani Sharoni said:
> > Thanks for the explanation. I see your point. std::create-thread is a
> > full memory barrier so for example there is no need for additional
> > barriers in order to access (from the new thread) memory that was
> > initialized before the create-thread call (i.e. no re-ordering is
> > allowed by the caller).
>
> There are two things that are enforced by barriers: no re-ordering, and
> cache coherence. For those who aren't up on this stuff, no re-ordering
> means the compiler can't rearrange the code in ways that invalidate the
> read/write rules. Cache coherence means that the hardware can't
> rearrange reads and writes. The underlying hardware issue is that,
> typically, with multiple cpus you have multiple caches; a write by one
> cpu writes data to its cache; a read by another cpu may read from its
> own cache or from main memory; if the contents of the first cache
> haven't been written to main memory, the read won't see the result of
> that write.

I used to also think so hence had great fear from the overhead/poor-
scalability of barriers in case that cache flushes are required for
each usage. After some reading I realized that the HW is taking a
great deal to avoid this overhead by assuring that caches are always
coherent (at least data caches, instruction caches coherency is less
interesting since they are mostly read only). cache controllers are
using heady duty machinery like MESI to assure coherency.

Therefore barriers are needed for software/hardware re-ordering of
independent memory accesses and also to flush special very small intra-
CPU store buffers being used to ease the cache coherency overhead
(i.e. avoid immediate cache writes followed by read). x86/x64 for
example only has re-ordering related to store buffers (AKA "write
ordered with store-buffer forwarding") hence, for example, non-
interlocked write on x86 has release semantics (spinlocks utilize this
on x86/x64).

Interesting reading about memory barriers on several architectures:
http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.06.07c.pdf

Rani

--

Pete Becker

unread,

May 27, 2012, 2:37:30 AM5/27/12

to

Well, yes, some hardware architectures do more than others to ensure
coherence. But when explaining the issues involved in ensuring that
data changes are visible across threads, the explanations are often
messy, even when they're MESI. (sorry, couldn't resist). Invoking magic
doesn't help describe the problem.

> ...

>
> Interesting reading about memory barriers on several architectures:
> http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.06.07c.pdf
>

Yup. Paul is one of the folks who helped design the C++ memory model.

--
Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com) Author of "The
Standard C++ Library Extensions: a Tutorial and Reference
(www.petebecker.com/tr1book)