static char string[] = "XXXXX XXXXX";
static bool volatile done = false;
static pthread_mutex_t mutex;
void thread_1() {
strcpy(string, "Hello World");
pthread_mutex_lock(&mutex);
pthread_mutex_unlock(&mutex);
done = true;
}
void other_threads() {
do {
pthread_mutex_lock(&mutex);
pthread_mutex_unlock(&mutex);
sched_yield();
}
while (! done);
assert(! strcmp(string, "Hello World"));
}
I assume that on at least one platform, the lock/unlock pair should execute
the memory barriers I need. I also assume that the compiler will not reorder
the store to done above the lock/unlock pair in thread_1, and not reorder
the load to done above the lock/unlock pair in the other_threads.
Where am I going wrong? I MUST be missing something!
In fact this actually does work on my x86 linux desktop. But it's not
guaranteed by the spec.
First, the behaviour of sched_yield() is not defined by the spec for
non-realtime tasks.
Second, I'm not sure your assumption about the compiler not reordering
stores is correct. Both the write to "string" and the write to "done"
happen outside any locks so they could theoretically be reordered with
respect to each other resulting in "done" being true on other cpus
before the write to "string".
Third, I'm not sure your assumption about the compiler not reordering
loads is correct. The read of "string" is outside the any locks and so
nothing prevents the compiler from hoisting it earlier in the code to
before the loop.
In practice though, even with optimization enabled gcc generates pretty
much the code that you would expect.
Chris
Your code is guaranteed valid. This is perfectly normal mutex use,
just with some code removed. It's similar to this paradigm:
A1) Construct an object.
A2) Acquire a mutex.
A3) Put a pointer to that object where other threads can find it.
A4) Release the mutex.
and
B1) Acquire a mutex.
B2) Get a pointer to an object that was initialized by another thread
but whose contents are known stable once the pointer is made visible.
B3) Release the mutex.
B4) Access the pointer.
You just omitted steps A3 and B2. But this is typical mutex use with
those steps, and you know the pointer you could have gotten in B2
would be the one you could have stored in A3.
DS
You have a race-condition here. You need to observe `done' as `true'
__BEFORE__ you perform the lock/unlock pair. Try something like this:
______________________________________________________________
// compile this seperatly without optmizations.
extern void membar()
{
static pthread_mutex_t g_mutex
= PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_lock(&g_mutex);
pthread_mutex_unlock(&g_mutex);
}
// link with object file that contains `membar()';
// turn link-time optimizations OFF!
static char string[] = "XXXXX XXXXX";
static atomic_word volatile done = 0;
static void thread_1() {
strcpy(string, "Hello World");
membar();
ATOMIC_STORE(&done, 1);
}
static void other_threads() {
while (! ATOMIC_LOAD(&done)) sched_yield();
membar();
assert(! strcmp(string, "Hello World"));
}
______________________________________________________________
In you're example as-is `other_threads()' can lock/unlock the mutex;
`thread_1()' stores `true' into `done'; 'other_threads()' loads `done'...
BOOM!!!
The fix is to observe `done' first, then lock/unlock the mutex. This is
because if a consumer processor observes `done' as being true it must mean
that the producer process has already produced data and lock/unlock the
mutex. The only thing left to do is synchronize on that mutex.
> I assume that on at least one platform, the lock/unlock pair should
> execute the memory barriers I need.
Agreed.
;^D
> I also assume that the compiler will not reorder the store to done above
> the lock/unlock pair in thread_1, and not reorder the load to done above
> the lock/unlock pair in the other_threads.
Make sure to turn link-time optimizations off! Basically, a compiler usually
treats a call to an externally compiled unknown function as a so-called
compiler barrier. FWIW, MSVC has compiler barrier intrinsic's to get the job
done.
> Where am I going wrong? I MUST be missing something!
You got the placement of the "artificial" membar wrong.
Read this entire thread:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/29ea516c5581240e
http://groups.google.com/group/comp.programming.threads/msg/423df394a0370fa6
> Second, I'm not sure your assumption about the compiler not reordering
> stores is correct. Both the write to "string" and the write to "done"
> happen outside any locks so they could theoretically be reordered with
> respect to each other resulting in "done" being true on other cpus
> before the write to "string".
>
> Third, I'm not sure your assumption about the compiler not reordering
> loads is correct. The read of "string" is outside the any locks and so
> nothing prevents the compiler from hoisting it earlier in the code to
> before the loop.
On further reflection, I think that these are not valid concerns and
that the loads/stores would not be able to be moved across the
pthread_mutex calls.
However, I think that the other Chris has a valid point that the
lock/unlock pair needs to come after the read of "done" but before the
read of "string".
Chris
If pthread_mutex_lock() is considered only as acquire and
pthread_mutex_unlock() is considered only as release, then
pthread_mutex_lock() can hoist above and pthread_mutex_unlock() can
sink below:
pthread_mutex_lock(&mutex);
strcpy(string, "Hello World");
done = true;
pthread_mutex_unlock(&mutex);
And then there are 2 normal memory accesses that can be reordered:
pthread_mutex_lock(&mutex);
done = true;
strcpy(string, "Hello World");
pthread_mutex_unlock(&mutex);
--
Dmitriy V'jukov
Okay. So, assuming that I changed the code to what Chris M. Thomasson
suggested:
http://groups.google.com/group/comp.programming.threads/msg/d52c83e1d755091d
void thread_1() {
strcpy(string, "Hello World");
pthread_mutex_lock(&mutex);
pthread_mutex_unlock(&mutex);
done = true;
}
void other_threads() {
while (! done) sched_yield();
pthread_mutex_lock(&mutex);
pthread_mutex_unlock(&mutex);
assert(! strcmp(string, "Hello World"));
}
And assuming all the hoist/sink reorderings that can occur, the "worst"
possible outcome could look like:
void thread_1() {
pthread_mutex_lock(&mutex);
done = true;
strcpy(string, "Hello World");
pthread_mutex_unlock(&mutex);
}
void other_threads() {
pthread_mutex_lock(&mutex);
assert(! strcmp(string, "Hello World"));
while (! done) sched_yield();
pthread_mutex_unlock(&mutex);
}
AFAICT, that will still work. Dmitriy, can a particular reordering break the
code? It seems like if the lock can hoist above and the unlock can sink
below, all's that does is "widen" the critical section, and does not seem
like it would be able to break the code in the sense of the assertion
failing.
Also, I notice that Chris Thomasson wrapped up the lock/unlock pair in a
single function and said to externally compile it. It think he was trying to
reduce compiler level reordering. It seems like you are talking about
hardware reordering? Also, if the pthread_mutex_lock has a storeload, then
is it still considered to have acquire semantics? Or is it something
stronger?
Sorry for all the questions.
Thanks!
:^)
Ummm, hopefully the compiler would never do this because that could cause a
deadlock. Forget about the actual while loop, I was basically referring to
the actual load from done in the predicate of the loop.
Humm... I think you are right, I do not see any reordering that breaks
the code. I just mixed up your variant with the following variant
(that I tried to use by myself):
// thread 1
g_data[index] = produce_data();
// here we are using "mutex barrier"
pthread_mutex_lock();
pthread_mutex_unlock();
g_index = index;
// thread 2
while (*(int volatile*)&g_index == 0) {/*spin*/}
// here we are relying on the fact that compiler and hardware respect
control dependencies
consume_data(g_data[g_index]);
And this code can be broken by the reordering.
So it seems that "mutex barrier" can be used iff it is used on both
sides.
> Also, I notice that Chris Thomasson wrapped up the lock/unlock pair in a
> single function and said to externally compile it. It think he was trying to
> reduce compiler level reordering. It seems like you are talking about
> hardware reordering? Also, if the pthread_mutex_lock has a storeload, then
> is it still considered to have acquire semantics? Or is it something
> stronger?
Yes, many implementations use full barrier in lock(), and sometimes in
unlock() too.
--
Dmitriy V'jukov
> Humm... I think you are right, I do not see any reordering that breaks
> the code.
> I just mixed up your variant with the following variant
> (that I tried to use by myself):
[...]
> And this code can be broken by the reordering.
> So it seems that "mutex barrier" can be used iff it is used on both
> sides.
Assuming that the mutex lock can rise and the unlock can sink, can you
"force" a certian ordering by using more than one mutex? Think of this:
static int g_state = 0;
static bool volatile g_done = 0;
static pthread_mutex_t g_mutex;
struct thread
{
pthread_mutex_t m_mutex;
void lock() { pthread_mutex_lock(&m_mutex); }
};
void thread_1(thread& t)
{
g_state = 666;
Assuming that the mutex lock can rise and the unlock can sink, can you
"force" a certain ordering by using more than one mutex? Think of this:
struct thread
{
pthread_mutex_t m_mutex1;
pthread_mutex_t m_mutex2;
thread()
{
lock1();
}
~thread()
{
unlock1();
}
void lock1() { pthread_mutex_lock(&m_mutex1); }
void unlock1() { pthread_mutex_unlock(&m_mutex1); }
void lock2() { pthread_mutex_lock(&m_mutex2); }
void unlock2() { pthread_mutex_unlock(&m_mutex2); }
void membar() volatile
{
lock2();
unlock1();
lock1();
unlock2();
}
};
static int g_state = 0;
static bool volatile g_done = 0;
void thread_1(thread& t)
{
g_state = 666;
t.membar();
g_done = true;
}
void other_threads(thread& t)
{
while (! g_done) spin();
t.membar();
assert(g_state == 666);
}
Would that work? AFAICT, the store to g_state should be performed before the
store to g_done in thread_1 because of the funky mutex usage. Also, I don't
see how anything can be reordered that would break the assertion.
What am I missing?
>> > Also, I notice that Chris Thomasson wrapped up the lock/unlock pair in
>> > a
>> > single function and said to externally compile it. It think he was
>> > trying to
>> > reduce compiler level reordering. It seems like you are talking about
>> > hardware reordering? Also, if the pthread_mutex_lock has a storeload,
>> > then
>> > is it still considered to have acquire semantics? Or is it something
>> > stronger?
>
>> Yes, many implementations use full barrier in lock(), and sometimes in
>> unlock() too.
So, it seems possible to emulate memory barriers using per-thread mutexs...
?
> static int g_state = 0;
> static bool volatile g_done = 0;
>
>
> void thread_1(thread& t)
> {
> g_state = 666;
> t.membar();
> g_done = true;
> }
Let's examine that, assuming mutex lock is acquire and unlock is release:
__________________________________________________________
void thread_1(thread& t)
{
S1: g_state = 666;
t.membar()
{
M1: lock2();
M2: unlock1();
M3: lock1();
M4: unlock2();
};
S2: g_done = true;
}
__________________________________________________________
Okay. I see something important here:
1. M2 and M3 should never hoist above M1.
2. M2 and M3 should never sink below M4.
3. M3 should never hoist above M2
4. M2 should never sink below M3
This allows for another important observation:
1. S1 should never sink below M2
2. S2 should never hoist above M3
AFAICT, that gives you something that is analogous to a
`std::memory_order_acq_rel' barrier.
> Would that work?
Yes. BTW, I use the same method for the lock-based version of vZOOM:
http://groups.google.ru/group/comp.programming.threads/msg/59e9b6e427b4a144
except that I allow the polling thread to synchronize with reader thread by
actually acquiring the per-thread owner_lock's.
One more thing, don't forget about asymmetric synchronization. If a mutex
uses that type of sync, then you basically need to have a "foreign" thread
try to acquire the mutex in order to force a memory barrier on the
"preferred" thread. This is one of the reasons why I allow the polling logic
to sync on the owner_locks.
[...]
[...]
Actually, AFAICT you don't actually need the extra mutex for this example.
The reason why I use it is to be in conformance to POSIX standard. You can
code `thread_1' like:
____________________________________________________________
static int g_state = 0;
static bool volatile g_done = 0;
void thread_1()
{
pthread_mutex_t mutex; // per-thread mutex
pthread_mutex_lock(&mutex);
S1: g_state = 666;
M1: pthread_mutex_unlock(&mutex);
M2: pthread_mutex_lock(&mutex);
S2: g_done = 1;
pthread_mutex_unlock(&mutex);
}
____________________________________________________________
- M1 should never hoist above S1
- M2 should never sink below S2
- M2 should never hoist above M1
- M1 should never sink below M2
- S1 should never sink below M1 and M2
- S2 should never rise above M1 and M2
>> Would that work?
>
> Yes. BTW, I use the same method for the lock-based version of vZOOM:
[...]
The synchronization aspect of the particular embodiment in pseudo-code looks
like:
___________________________________________________________
struct thread
{
mutex m_memory_mutex;
mutex m_owner_mutex;
unsigned long m_version;
};
#define VZSYNC 512
void vzthread(thread& t)
{
t.m_memory_mutex.lock();
for (size_t i = 1 ;; ++i)
{
// read shared data-structure
if (! (i % VZSYNC))
{
t.m_owner_mutex.lock();
t.m_memory_mutex.unlock();
++t.m_version;
t.m_memory_mutex.lock();
t.m_owner_mutex.unlock();
}
}
t.m_memory_mutex.unlock();
}
void vzpoll()
{
for (;;)
{
// snapshot
for each thread as t
{
t.m_owner_mutex.lock();
unsigned long version = t.m_version;
t.m_owner_mutex.unlock();
}
// blah
}
}
___________________________________________________________
When the version count of a snapshot differs, it means that the reader
threads actions prior to the version update is fully visible to the polling
thread. This is due to the fact that the memory mutex is unlocked and locked
within the critical section guarded by the owner mutex.
> Humm... I think you are right, I do not see any reordering that breaks
> the code.
I modeled this senerio in Relacy:
http://relacy.pastebin.com/f4b36dd98
It cannot find a way to break it either.
;^)
> I just mixed up your variant with the following variant
> (that I tried to use by myself):
> // thread 1
> g_data[index] = produce_data();
> // here we are using "mutex barrier"
> pthread_mutex_lock();
> pthread_mutex_unlock();
> g_index = index;
> // thread 2
> while (*(int volatile*)&g_index == 0) {/*spin*/}
> // here we are relying on the fact that compiler and hardware respect
> control dependencies
> consume_data(g_data[g_index]);
> And this code can be broken by the reordering.
> So it seems that "mutex barrier" can be used iff it is used on both
> sides.
Yes. I was wondering how to model this in Relacy:
http://groups.google.com/group/comp.programming.threads/msg/3f8ef89bff9bff02
Humm... Would Relacy treat the per-thread mutex actions like:
_______________________________________________________________
void thread_1()
{
std::mutex m;
m.lock($);
T1-M1: m.unlock($);
T1-M2: m.lock($);
m.unlock($);
}
void thread_2()
{
std::mutex m;
m.lock($);
T2-M1: m.unlock($);
T2-M2: m.lock($);
m.unlock($);
}
_______________________________________________________________
as being equivalent to:
_______________________________________________________________
void thread_1()
{
T1-M1: std::atomic_thread_fence(std::memory_order_release);
T1-M2: std::atomic_thread_fence(std::memory_order_acquire);
}
void thread_2()
{
T2-M1: std::atomic_thread_fence(std::memory_order_release);
T2-M2: std::atomic_thread_fence(std::memory_order_acquire);
}
_______________________________________________________________
?
I don't think it will. Even though it would in practice, asymmetric
synchronization schemes aside for a moment of course...
[...]
> I modeled this senerio in Relacy:
>
> http://relacy.pastebin.com/f4b36dd98
>
> It cannot find a way to break it either.
>
> ;^)
Note that one still have to use std::atomic for g_flag, even if actual
"synchronization" is provided by mutexes. And if one have std::atomic,
then there is a little sense in using mutexes for synchronization in
this example.
However, I think there is still benefit in coding with Relacy not
exactly what you have in production. Because if Relacy finds some
issue in a test, and then you manually verifies that the same issue
applies to the original production code, well, it's a plus. Some
issues may be masked. though.
> > I just mixed up your variant with the following variant
> > (that I tried to use by myself):
> > // thread 1
> > g_data[index] = produce_data();
> > // here we are using "mutex barrier"
> > pthread_mutex_lock();
> > pthread_mutex_unlock();
> > g_index = index;
> > // thread 2
> > while (*(int volatile*)&g_index == 0) {/*spin*/}
> > // here we are relying on the fact that compiler and hardware respect
> > control dependencies
> > consume_data(g_data[g_index]);
> > And this code can be broken by the reordering.
> > So it seems that "mutex barrier" can be used iff it is used on both
> > sides.
>
> Yes. I was wondering how to model this in Relacy:
>
> http://groups.google.com/group/comp.programming.threads/msg/3f8ef89bf...
Relacy will treat them differently.
Mutex lock/unlock is basically the same as acquire/release operation
on a variable. Stand-alone fences are more powerful in some sense,
because they do not bind to a particular variable. And at the same
time stand-alone fences (except seq_cst) are weaker, because they do
not establish total order over operations (as store/RMW on a
particular atomic).
--
Dmitriy V'jukov
> > I modeled this senerio in Relacy:
> >
> > http://relacy.pastebin.com/f4b36dd98
> >
> > It cannot find a way to break it either.
> >
> > ;^)
> Note that one still have to use std::atomic for g_flag, even if actual
> "synchronization" is provided by mutexes. And if one have std::atomic,
> then there is a little sense in using mutexes for synchronization in
> this example.
> However, I think there is still benefit in coding with Relacy not
> exactly what you have in production. Because if Relacy finds some
> issue in a test, and then you manually verifies that the same issue
> applies to the original production code, well, it's a plus. Some
> issues may be masked. though.
Agreed.
> > I just mixed up your variant with the following variant
> > (that I tried to use by myself):
[...]
>
> Yes. I was wondering how to model this in Relacy:
>
> http://groups.google.com/group/comp.programming.threads/msg/3f8ef89bf...
>
> Humm... Would Relacy treat the per-thread mutex actions like:
> _______________________________________________________________
[...]
> _______________________________________________________________
>
> as being equivalent to:
> _______________________________________________________________
[...]
> _______________________________________________________________
>
> ?
>
> I don't think it will. Even though it would in practice, asymmetric
> synchronization schemes aside for a moment of course...
> Relacy will treat them differently.
> Mutex lock/unlock is basically the same as acquire/release operation
> on a variable.
I think using mutexs this way would be undefined behavior wrt the C++0x.
> Stand-alone fences are more powerful in some sense,
> because they do not bind to a particular variable. And at the same
> time stand-alone fences (except seq_cst) are weaker, because they do
> not establish total order over operations (as store/RMW on a
> particular atomic).
I personally like stand-alone fences because IMHO they are more flexible. I
can put them exactly where I need them; e.g.:
________________________________________________________________
#define N 10
static int g_values[N] = { NULL };
static std::atomic<bool> g_flags[N] = { NULL };
void thread_1()
{
for (size_t i = 0; i < N; ++i)
{
g_values[i] = 666;
}
std::atomic_thread_fence(std::memory_order_release);
for (size_t i = 0; i < N; ++i)
{
g_flags[i].store(true, std::memory_order_relaxed);
}
}
void other_threads()
{
retry:
for (size_t i = 0; i < N; ++i)
{
if (! g_flags[i].load(std::memory_order_relaxed))
{
spin();
goto retry;
}
}
std::atomic_thread_fence(std::memory_order_acquire);
for (size_t i = 0; i < N; ++i)
{
assert(g_values[i] == 666);
}
}
________________________________________________________________
As far as a total order over operations, I am not sure I understand what you
mean. I expect the following to be out of order:
________________________________________________________________
std::atomic<int> g_value1;
std::atomic<int> g_value2;
void foo()
{
g_value1.store(1, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release);
g_value2.load(std::memory_order_relaxed);
}
________________________________________________________________
but not this:
________________________________________________________________
std::atomic<int> g_value1;
std::atomic<int> g_value2;
void foo()
{
g_value1.store(1, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
g_value2.load(std::memory_order_relaxed);
}
________________________________________________________________
AFAIR, Paul McKenney present something along the lines of your example
as a motivating example for stand-alone fences.
> As far as a total order over operations, I am not sure I understand what you
> mean. I expect the following to be out of order:
> ________________________________________________________________
> std::atomic<int> g_value1;
> std::atomic<int> g_value2;
>
> void foo()
> {
> g_value1.store(1, std::memory_order_relaxed);
> std::atomic_thread_fence(std::memory_order_release);
> g_value2.load(std::memory_order_relaxed);}
>
> ________________________________________________________________
Regarding total order I mean that your above code won't work (even if
release is replaced with acq_rel).
However if we replace stand-alone acq_rel fence with atomic RMW with
acq_rel ordering:
g_value1.store(1, std::memory_order_relaxed);
g_sync.exchange(0, std::memory_order_acq_rel);
g_value2.load(std::memory_order_relaxed);}
It must work. Because RMW on a particular variable establishes total
order (over all RMW on the variable), while stand-alone fence not.
--
Dmitriy V'jukov
> Regarding total order I mean that your above code won't work (even if
> release is replaced with acq_rel).
Right. When I wrote:
"I expect the following to be out of order"
I meant that I expect the load from `g_value2' to be able to rise above the
store to `g_value1'.
> However if we replace stand-alone acq_rel fence with atomic RMW with
> acq_rel ordering:
> g_value1.store(1, std::memory_order_relaxed);
> g_sync.exchange(0, std::memory_order_acq_rel);
> g_value2.load(std::memory_order_relaxed);}
> It must work. Because RMW on a particular variable establishes total
> order (over all RMW on the variable), while stand-alone fence not.
I would also expect this to work:
___________________________________________________
g_value1.store(1, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
g_value2.load(std::memory_order_relaxed);
___________________________________________________