Double checked locking pattern article on aristeia

nospam

unread,

Aug 17, 2011, 3:51:38 PM8/17/11

to

There is something in this article that puzzles me
http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf

The article says the following code may not work correctly in a
multi-threaded environment for two reasons. The code has volatile all
over the place to prevent the compiler from re-ordering code. The
idea is to avoid acquiring the (expensive) Lock every time you need to
access the singleton.

class Singleton {
public:
static volatile Singleton* volatile instance();
//...
private:
//
static volatile Singleton* volatile pInstance;
};

// from the implementation file
volatile Singleton* volatile Singleton::pInstance = 0;

volatile Singleton* volatile Singleton::instance() {
if (pInstance == 0) {
Lock lock;
if (pInstance == 0) {
volatile Singleton* volatile temp =
new volatile Singleton;
pInstance = temp;
}
}
return pInstance;
}

The first reason given for why this code may fail in a multi-threaded
environment is given on page 10
<quote>
First, the Standard’s constraints on observable behavior are only for
an abstract machine defined by the Standard, and that abstract machine
has no notion of multiple threads of execution. As a result, though
the Standard prevents compilers from reordering reads and writes to
volatile data within a thread, it imposes no constraints at all on
such reorderings across threads. At least that’s how most compiler
implementers interpret things. As a result, in practice, many
compilers may generate thread-unsafe code from the source above.
<end quote>

I can't figure out what the above quoted text is getting at. Can
anyone explain? What does "re-ordering across threads" mean?

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Michael

unread,

Aug 24, 2011, 7:44:20 PM8/24/11

to

On Aug 17, 2:51 pm, nospam <nos...@nospam.nospam> wrote:
> There is something in this article that puzzles mehttp://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf

There are two issues here:

1. In accordance with the C++03 standard, the compiler optimizer is
completely free to move both pInstance reads to *before* the Lock
construction -- essentially negating the value of the mutex.
Practically speaking, many compilers don't do this because compilers
have customers and customers expect code like this to work. However,
the standard doesn't preclude it, so any dependence on this behavior
is by definition non-portable.
2. While 'volatile' qualified variables mean that reads and writes
cannot be elided or re-ordered, there is no guarantee that (a) writes
are atomic or (b) that writes by one thread are immediately visible to
any other thread. I.e. CPU cores have data caches and they must be
occasionally invalidated and refreshed. C++03 makes no guarantees
about how these caches must be maintained. As a result, in this code,
more than one Singleton could be constructed as each thread entering
the function might find a stale value for pInstance in the CPU cache.

C++11 (ratified just last week) addresses both of these issues.
Compilers are no longer allowed to reorder read/write instructions
across std::mutex acquire/release boundaries (issue #1).
std::atomic<T> makes guarantees about atomicity and visibility of
write instructions (issue #2).

In fact, with C++11, all of the problems go away if you change the
declaration of pInstance to:
static std::atomic<Singleton*> pInstance;

*viola*

Mike

Andy Venikov

unread,

Aug 24, 2011, 7:44:20 PM8/24/11

to

It means that within the bounds of C++98/03 you can't reliably use
threads. The reason we do and have been using them for a long time is
that the threading libraries (like POSIX threads) have additional
guarantees. However, when you use DLCP you forgo those guarantees
since you access a shared entity outside of the protected scope.
Generally, you couldn't use DLCP in any way portably and reliably.
One could still make it work if one bothered to write very
platform-specific code (for a specific compiler/OS/hardware) and
make sure that volatile does what you expect and by using right
memory barriers. But as a general rule, you couldn't do that. And
not the least because even the use of volatile + memory barriers
for threading is not guaranteed to give you the results you want.

In the C++0x (or should we start calling it C++11 ?) You're perfectly
able to do this portably with the use of atomics and correct ordering
directives.

HTH,
Andy.

Anthony Williams

unread,

Aug 24, 2011, 7:44:21 PM8/24/11

to

On Aug 17, 8:51 pm, nospam <nos...@nospam.nospam> wrote:
> There is something in this article that puzzles

mehttp://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf

Different threads may see writes to separate variables in different
orders, unless there is appropriate synchronization used.

e.g. thread A writes x=1, y=2. Threads B and C both read y and then x.
Thread B sees y==2, x==1. Thread C sees y==2, x==0.

Unless you have explicit synchronization, such as a mutex lock, this
is a valid outcome. Technically, this is undefined behaviour by the C+
+11 standard, unless x and y are atomic variables, and even then if
the operations are memory_order_relaxed then this outcome is still
permitted.

Anthony
--
Author of C++ Concurrency in Action http://www.stdthread.co.uk/book/
just::thread C++0x thread library http://www.stdthread.co.uk
Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk
15 Carrallack Mews, St Just, Cornwall, TR19 7UL, UK. Company No.
5478976

Dave Abrahams

unread,

Aug 24, 2011, 7:44:21 PM8/24/11

to

[Apologies if this turns out to be a duplicate]

on Wed Aug 17 2011, nospam <nospam-AT-nospam.nospam> wrote:

> There is something in this article that puzzles me
> http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf
>
> The article says the following code may not work correctly in a
> multi-threaded environment for two reasons. The code has volatile all
> over the place to prevent the compiler from re-ordering code.

Even if volatile has that effect, the CPU is still allowed to "re-order
code" as long as the effects *observable in a single thread* aren't
changed.

> The idea is to avoid acquiring the (expensive) Lock every time you
> need to access the singleton.

Yes, an idiom which is well known to be very popular and very broken.

One way to look at this is that each thread may run on a separate
physical CPU or core, each one of which has its own cache. The caches
are only synchronized to one another by chance (when cache lines are
flushed to main memory and read back in elsewhere) and by explicit CPU
instructions (e.g. atomics and memory barriers) that are part of
threading primitives like mutex locks but are not generated in ordinary
C++ code. The result is that, without these explicit instructions, one
thread may not observe writes to a given memory location happening in
the same order as they are observed by another thread. That's what he
means by "re-ordering across threads."

Note: don't assume that just because you have only a single CPU or core
you are safe from these effects: compiler writers generally assume that
your code deserves no more protection from cross-thread confusion just
because your threads are running on a single core, and they don't go out
of their way to make sure you'll observe sensible effects unless you
correctly use the special CPU instructions to ensure that your threads
have a consistent view of the world.

Hope that helps.

--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com

Alexander Terekhov

unread,

Aug 26, 2011, 10:20:04 PM8/26/11

to

Michael wrote:
[...]

> In fact, with C++11, all of the problems go away if you change the
> declaration of pInstance to:
> static std::atomic<Singleton*> pInstance;
>
> *viola*

Sequentially consistent atomic (C++11 default for atomic) is an overkill
for a typical "double checked locking pattern" using a pointer
(dynamically allocated object).

C++11 has more better suited
std::memory_order_consume/memory_order_release (albeit, frankly, I just
can't grok associated std::kill_dependency() thing).

Elsewhere in this thread...

Andy Venikov wrote:
[...]
> In the C++0x (or should we start calling it C++11 ?) ...

C++0xB sounds much uglier than C++11. :-)

regards,
alexander.

Alexander Terekhov

unread,

Aug 26, 2011, 10:20:54 PM8/26/11

to

Dave Abrahams wrote:
[...]

> Note: don't assume that just because you have only a single CPU or core
> you are safe from these effects: compiler writers generally assume that
> your code deserves no more protection from cross-thread confusion just
> because your threads are running on a single core, and they don't go out
> of their way to make sure you'll observe sensible effects unless you
> correctly use the special CPU instructions to ensure that your threads
> have a consistent view of the world.

No, compiler reordering aside, no special CPU instructions regarding
memory ordering/barriers are needed if a multithreaded program is
restricted to run on a single core/uniprocessor (not have more than one
thread running at the same time). For a uniprocessor, all fences/barrier
instructions can be just ignored (they are not needed). If you have
contrary evidence, please point to it.

regards,
alexander.

--

Dave Abrahams

unread,

Aug 27, 2011, 8:54:47 AM8/27/11

to

on Fri Aug 26 2011, Alexander Terekhov <terekhov-AT-web.de> wrote:

> Dave Abrahams wrote:
> [...]
>> Note: don't assume that just because you have only a single CPU or core
>> you are safe from these effects: compiler writers generally assume that
>> your code deserves no more protection from cross-thread confusion just
>> because your threads are running on a single core, and they don't go out
>> of their way to make sure you'll observe sensible effects unless you
>> correctly use the special CPU instructions to ensure that your threads
>> have a consistent view of the world.
>
> No, compiler reordering aside, no special CPU instructions regarding
> memory ordering/barriers are needed if a multithreaded program is
> restricted to run on a single core/uniprocessor (not have more than one
> thread running at the same time). For a uniprocessor, all fences/barrier
> instructions can be just ignored (they are not needed). If you have
> contrary evidence, please point to it.

I don't; I think I misspoke. What I meant to say was "unless you
correctly use synchronization to ensure..." My point is that in
general, even on a uniprocessor you shouldn't consider it safe to modify
shared state without synchronization.

--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com

darkkn...@gmail.com

unread,

Sep 4, 2011, 5:01:15 AM9/4/11

to

On Wed, 24 Aug 2011 16:44:21 -0700 (PDT), Anthony Williams
<antho...@gmail.com> wrote:

>On Aug 17, 8:51 pm, nospam <nos...@nospam.nospam> wrote:
>> There is something in this article that puzzles
>mehttp://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf
>>
>>

>> The first reason given for why this code may fail in a multi-threaded
>> environment is given on page 10
>> <quote>
>> First, the Standard's constraints on observable behavior are only for
>> an abstract machine defined by the Standard, and that abstract machine
>> has no notion of multiple threads of execution. As a result, though
>> the Standard prevents compilers from reordering reads and writes to
>> volatile data within a thread, it imposes no constraints at all on
>> such reorderings across threads. At least that's how most compiler
>> implementers interpret things. As a result, in practice, many
>> compilers may generate thread-unsafe code from the source above.
>> <end quote>
>>
>> I can't figure out what the above quoted text is getting at. Can
>> anyone explain? What does "re-ordering across threads" mean?
>
>Different threads may see writes to separate variables in different
>orders, unless there is appropriate synchronization used.
>
>e.g. thread A writes x=1, y=2. Threads B and C both read y and then x.
>Thread B sees y==2, x==1. Thread C sees y==2, x==0.
>
>Unless you have explicit synchronization, such as a mutex lock, this
>is a valid outcome. Technically, this is undefined behaviour by the C+
>+11 standard, unless x and y are atomic variables, and even then if
>the operations are memory_order_relaxed then this outcome is still
>permitted.

ok, thanks. It seems to me that based on what you're saying, the text
I'm quoting (above) from the article is at least confusing when it
says "many compilers may generate thread-unsafe code". I took this to
mean that the compiler might "accidentally" generate thread-safe code
It's hard to imagine any compiler, even a thread-aware compiler,
automatically inserting the synchronization needed to make it safe.

bob bob

unread,

Oct 21, 2011, 7:54:50 PM10/21/11

to

I'm sorry to bring up an old thread but much like the OP I read Scott
Meyers and Andrei Alexandresc's paper on double checked locking and
came away a bit confused. Scouring google led me to this post which
brings me to this response:

On Aug 24, 7:44 pm, Michael <mcmcca...@gmail.com> wrote:
>
> There are two issues here:
>
> 1. In accordance with the C++03 standard, the compiler optimizer is
> completely free to move both pInstance reads to *before* the Lock
> construction -- essentially negating the value of the mutex.

There's clearly something I'm not understanding here. Wouldn't a
mutex lock/unlock imply a memory barrier of some sort? If the
compiler can move operations across mutex lock/unlock boundaries that
would basically render them useless would it not? Say under VC++
doesn't boost::mutex call the MemoryBarrier() macro (or some
equivalent) deep within its bowels?

Along the same lines why does the code presented by Meyers and
Alexandresc require 2 barriers(one inside the lock, and one outside).
The one inside the lock is clearly to ensure a 'hard sequence
point' (as they call it) to ensure the object is fully constructed
prior to being assigned to the pointer. But I'm not sure what the
second barrier prevents. I guess this would lead back to my original
assumption that mutex lock/unlocks are barriers themselves.

I thank-you for your time in advance and being one of my first times
posting here apologize if I broke any etiquette rules.

Mathias Gaunard

unread,

Oct 22, 2011, 5:18:13 PM10/22/11

to

On Oct 22, 1:54 am, bob bob <r_bogh...@hotmail.com> wrote:

> There's clearly something I'm not understanding here. Wouldn't a
> mutex lock/unlock imply a memory barrier of some sort?

Yes, but C++03 didn't guarantee that; it didn't even know about
threads.

> Along the same lines why does the code presented by Meyers and
> Alexandresc require 2 barriers(one inside the lock, and one outside).
> The one inside the lock is clearly to ensure a 'hard sequence
> point' (as they call it) to ensure the object is fully constructed
> prior to being assigned to the pointer. But I'm not sure what the
> second barrier prevents. I guess this would lead back to my original
> assumption that mutex lock/unlocks are barriers themselves.

The whole point of double checked locking is to accelerate locking.
Doing the inner lock directly would work just as well, it just would
force a hard barrier everytime.

bob bob

unread,

Oct 23, 2011, 5:14:30 PM10/23/11

to

On Oct 22, 5:18 pm, Mathias Gaunard <loufo...@gmail.com> wrote:
> On Oct 22, 1:54 am, bob bob <r_bogh...@hotmail.com> wrote:
>
>> There's clearly something I'm not understanding here. Wouldn't a
>> mutex lock/unlock imply a memory barrier of some sort?
>
> Yes, but C++03 didn't guarantee that; it didn't even know about
> threads.

But wouldn't the 'guarantee' be a requirement of the mutex? You can't
write a mutex without compiler/OS support as is, so I would just
assume that any mutex implementation (whether it be a critical section
from the Win32 API, pthreads, or boost threads) would force a memory
barrier whenever lock/unlock was called (using whatever asm/compiler
pragmas/MemoryBarrier macros/'what have yous' as was necessary). I
understand that C++03 makes no guarantee, but pretty much the entire
mutex concept exists in the 'undefined' netherworld of C++, I just
assumed that barriers were part and parcel of whatever magic mutexs
were made of under the hood. I'm sorry, I'm not trying to beleaguer
the point, its just from what I've been reading it seems to imply that
mutex's don't necessarily imply a barrier.

Edek

unread,

Oct 23, 2011, 5:16:06 PM10/23/11

to

It means, for example, that if thread A creates a Singleton
and thread B sees pInstance != 0, it might not see the insides
of the Singleton initialised. Why? Because thread B does not call
anything that would prevent reordering, so it might see pInstance
first and Singleton insides initialised later.

Volatile does not help here: it is only for the compiler, not for
the CPU, and modern CPUs do reorderings just like the compiler
if not worse. It depends what CPU; on some CPUs this code would be
working correct 100% of the time. For a CPU 'volatile' is when
a page of 'memory' has special access flags, meaning it is not
RAM but actually some device register of e.g. in PCI slot.
The 'volatile' in code _and_ memory mapping make things right,
but for threads volatile (almost) never helps in anything
except good feeling of the programmer.

Since the CPUs do not prevent that, the standard says nothing,
compilers assume they don't have to. They can't reorder
across any opaque call like locking the lock, but in thread B's
case this does not happen. They won't reorder a volatile
with another volatile, that's it. So:

take a singleton, that has

static int i; // defined somewhere, initialisation sets it
int doSomething () {
return i;
}

Assuming it inlines the following code (instance() not being inline
does not change anything in this regard, same for doSomething):

int x = instance()->doSomething();

It might become, in pseudo-assembly:

put Singleton::i in x
put pInstance to r1
if r1 == 0:
do the lock, init singleton,
set pInstance, unlock, put i to x again
else:
# that's it, compiler is happy...
# ...but the programmer
# is not so happy, because apparently x got a value
# from uninitialised Singleton::i
# once a week in tests

It is correct, the result is 'as-if' the instructions
were in order, for a single thread.

Also, initialisation inside the Singleton might be
in a different order, might happen after assignment to
pInstance. And this is just with a single int,
imagine a much bigger Singleton - you'll never be
able to debug it, you will just know it is incorrect.

Later in the article they address it by making everything in Singleton
volatile, and sum up all the 'volatile' crap at the end.

Last but not least, volatile does not prevent thread B
seeing just one half of pointer pInstance filled by A and second
half still zero.

Edek

Dave Abrahams

unread,

Oct 23, 2011, 5:17:02 PM10/23/11

to

on Fri Oct 21 2011, bob bob <r_boghean-AT-hotmail.com> wrote:

>> 1. In accordance with the C++03 standard, the compiler optimizer is
>> completely free to move both pInstance reads to *before* the Lock
>> construction -- essentially negating the value of the mutex.

(FYI, I don't see that text anywhere in
http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf... oh... the
text you quoted was from a poster here... well...)

> There's clearly something I'm not understanding here. Wouldn't a
> mutex lock/unlock imply a memory barrier of some sort?

Only in C++11 where "memory barrier" is a meaningful concept. Within
the memory model of C++03 there was basically no way to express the
kinds of ordering restrictions suitable for multithreading.

> If the compiler can move operations across mutex lock/unlock
> boundaries that would basically render them useless would it not?

Yes, basically. There are some (expert-only) uses for "half-barriers"
that allow movement of some operations, but mutex locks are basically
always associated with strict, sequentially-consistent barriers.

> Say under VC++ doesn't boost::mutex call the MemoryBarrier() macro (or
> some equivalent) deep within its bowels?

Something like that. It's just that *according to the C++03 standard*
the compiler has no obligation to respect such a request. Presumably if
they ship working threading primitives, MS has, explicitly or
implicitly, been making stronger guarantees than the standard requires
in that area for years.

> Along the same lines why does the code presented by Meyers and
> Alexandresc require 2 barriers(one inside the lock, and one outside).
> The one inside the lock is clearly to ensure a 'hard sequence
> point' (as they call it) to ensure the object is fully constructed
> prior to being assigned to the pointer. But I'm not sure what the
> second barrier prevents. I guess this would lead back to my original
> assumption that mutex lock/unlocks are barriers themselves.

I think you'd better quote the specific code you're asking about... oh,
do you mean this?

--8<---------------cut here---------------start------------->8---
Singleton* Singleton::instance ()
{
Singleton* tmp = pInstance;
... // insert memory barrier
if (tmp == 0) {
Lock lock;
tmp = pInstance;
if (tmp == 0) {
tmp = new Singleton;
... // insert memory barrier
pInstance = tmp;
}
}
return tmp;
}
--8<---------------cut here---------------end--------------->8---

Well, it's a little bit hard to say what they're assuming, because under
C++03 you don't have any standard-portable right to expect something
called a "memory barrier" to work, even if you could lay your hands on
one.

That said, I *think* the point here is that without the barriers, no
effects that are formally "visible" according to the standard force the
boundaries of the lock to exclude the read of pInstance or to include
the write... but I would ask the authors to explain this in more detail
if I were you.

--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com

Alexander Terekhov

unread,

Oct 24, 2011, 12:46:33 PM10/24/11

to

Dave Abrahams wrote:
[...]

> Yes, basically. There are some (expert-only) uses for "half-barriers"
> that allow movement of some operations, but mutex locks are basically
> always associated with strict, sequentially-consistent barriers.

I always thought that C++11 mutex locks can be expressed in terms of
relaxed C++11 atomic<> operations and acquire/release fences (not the
same as memory_order_seq_cst fence).

(pseudo-code, swap = exchange)

class swap_based_mutex_for_windows { // noncopyable

atomic<int> m_lock_status; // 0: free, 1/-1: locked/contention
auto_reset_event m_retry_event; // slow binary semaphore (windows)

public:

// ctor/dtor [w/o lazy event init]

void lock() throw() {
if (m_lock_status.swap(1, memory_order_relaxed))
while (m_lock_status.swap(-1, memory_order_relaxed))
m_retry_event.wait();
atomic_thread_fence(memory_order_acquire);
}

void unlock() throw() {
atomic_thread_fence(memory_order_release);
if (m_lock_status.swap(0, memory_order_relaxed) < 0)
m_retry_event.set();
}

};

regards,
alexander.

--

Alexander Terekhov

unread,

Oct 24, 2011, 12:46:49 PM10/24/11

to

Dave Abrahams wrote:
[...]

> --8<---------------cut here---------------start------------->8---
> Singleton* Singleton::instance ()
> {
> Singleton* tmp = pInstance;
> ... // insert memory barrier
> if (tmp == 0) {
> Lock lock;
> tmp = pInstance;
> if (tmp == 0) {
> tmp = new Singleton;
> ... // insert memory barrier
> pInstance = tmp;
> }
> }
> return tmp;
> }
> --8<---------------cut here---------------end--------------->8---
>
> Well, it's a little bit hard to say what they're assuming, because under
> C++03 you don't have any standard-portable right to expect something
> called a "memory barrier" to work, even if you could lay your hands on
> one.
>
> That said, I *think* the point here is that without the barriers, no
> effects that are formally "visible" according to the standard force the
> boundaries of the lock to exclude the read of pInstance or to include
> the write... but I would ask the authors to explain this in more detail
> if I were you.

With C++11 fences:

Singleton* Singleton::instance () {
Singleton* tmp = pInstance; // atomic relaxed (competing) load
atomic_thread_fence(memory_order_acquire);

if (tmp == 0) {
Lock lock;

tmp = pInstance; // non-competing relaxed load (need not be atomic)

if (tmp == 0) {
tmp = new Singleton;

atomic_thread_fence(memory_order_release);
pInstance = tmp; // atomic relaxed store
}
}
return tmp;
}

regards,
alexander.

--

Dave Abrahams

unread,

Oct 24, 2011, 5:29:56 PM10/24/11

to

on Mon Oct 24 2011, Alexander Terekhov <terekhov-AT-web.de> wrote:

> Dave Abrahams wrote:
> [...]
>> Yes, basically. There are some (expert-only) uses for "half-barriers"
>> that allow movement of some operations, but mutex locks are basically
>> always associated with strict, sequentially-consistent barriers.
>
> I always thought that C++11 mutex locks can be expressed in terms of
> relaxed C++11 atomic<> operations and acquire/release fences (not the
> same as memory_order_seq_cst fence).

You know, you're probably right. I guess you acquire on lock and
release on unlock or something?

--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com

Dave Abrahams

unread,

Oct 24, 2011, 5:30:40 PM10/24/11

to

on Mon Oct 24 2011, Alexander Terekhov <terekhov-AT-web.de> wrote:

Sure... the question isn't how to translate that into C++11, but more,
what's the rationale for placing each of those fences where they are?
My first instinct would have been that you need a fence before the read
(to ensure visibility of anything that's been written to pInstance from
another thread) and after the write (to ensure the write to pInstance
*becomes* visible in other threads). The only rationale I can guess for
the placement above is... well, what I guessed above.

But it's always possible I still don't understand fences correctly ;-)

--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com

Alexander Terekhov

unread,

Oct 25, 2011, 5:48:39 PM10/25/11

to

Dave Abrahams wrote:
>
> on Mon Oct 24 2011, Alexander Terekhov <terekhov-AT-web.de> wrote:
>
> > Dave Abrahams wrote:
> > [...]
> >> Yes, basically. There are some (expert-only) uses for "half-barriers"
> >> that allow movement of some operations, but mutex locks are basically
> >> always associated with strict, sequentially-consistent barriers.
> >
> > I always thought that C++11 mutex locks can be expressed in terms of
> > relaxed C++11 atomic<> operations and acquire/release fences (not the
> > same as memory_order_seq_cst fence).
>
> You know, you're probably right. I guess you acquire on lock and
> release on unlock or something?

C++11 acquire/release fences mimic Power import/export barriers, see:

http://www.cs.ox.ac.uk/people/jade.alglave/stability/stability-long.pdf
("Stability in Weak Memory Models With Proofs")

regards,
alexander.

--

Dave Abrahams

unread,

Oct 25, 2011, 6:35:27 PM10/25/11

to

on Tue Oct 25 2011, Alexander Terekhov <terekhov-AT-web.de> wrote:

> Dave Abrahams wrote:
>>
>
>> on Mon Oct 24 2011, Alexander Terekhov <terekhov-AT-web.de> wrote:
>>
>>> Dave Abrahams wrote:
>>> [...]
>>>> Yes, basically. There are some (expert-only) uses for "half-barriers"
>>>> that allow movement of some operations, but mutex locks are basically
>>>> always associated with strict, sequentially-consistent barriers.
>>>
>>> I always thought that C++11 mutex locks can be expressed in terms of
>>> relaxed C++11 atomic<> operations and acquire/release fences (not the
>>> same as memory_order_seq_cst fence).
>>
>> You know, you're probably right. I guess you acquire on lock and
>> release on unlock or something?
>
> C++11 acquire/release fences mimic Power import/export barriers, see:
>
> http://www.cs.ox.ac.uk/people/jade.alglave/stability/stability-long.pdf
> ("Stability in Weak Memory Models With Proofs")

Fascinating paper I'm sure, but it doesn't seem to be an answer to my
question.

--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com

Mathias Gaunard

unread,

Oct 26, 2011, 4:30:00 AM10/26/11

to

On Oct 24, 6:46 pm, Alexander Terekhov <terek...@web.de> wrote:

> I always thought that C++11 mutex locks can be expressed in terms of
> relaxed C++11 atomic<> operations and acquire/release fences (not the
> same as memory_order_seq_cst fence).

AFAIK, locking a mutex is a full barrier, not just an acquire barrier.
This is not the case for spinlocks though.

Sure, you can do a a full barrier by doing an acquire followed by a
release, but it's possibly more efficient to just do the full barrier.

> class swap_based_mutex_for_windows { // noncopyable
>
> atomic<int> m_lock_status; // 0: free, 1/-1: locked/contention
> auto_reset_event m_retry_event; // slow binary semaphore (windows)

How is that pseudo-code written only using built-in atomic operations
when there is a reference to a heavyweight kernel-level
synchronization mechanism?

Alexander Terekhov

unread,

Oct 26, 2011, 1:16:24 PM10/26/11

to

Mathias Gaunard wrote:
>
> On Oct 24, 6:46 pm, Alexander Terekhov <terek...@web.de> wrote:
>
> > I always thought that C++11 mutex locks can be expressed in terms of
> > relaxed C++11 atomic<> operations and acquire/release fences (not the
> > same as memory_order_seq_cst fence).
>
> AFAIK, locking a mutex is a full barrier, not just an acquire barrier.
> This is not the case for spinlocks though.

Who told you that? Spinlocks and mutexes are the same barrier-wise.

>
> Sure, you can do a a full barrier by doing an acquire followed by a
> release, but it's possibly more efficient to just do the full barrier.
>
> > class swap_based_mutex_for_windows { // noncopyable
> >
> > atomic<int> m_lock_status; // 0: free, 1/-1: locked/contention
> > auto_reset_event m_retry_event; // slow binary semaphore (windows)
>
> How is that pseudo-code written only using built-in atomic operations
> when there is a reference to a heavyweight kernel-level
> synchronization mechanism?

The code also works with usleep() or somesuch blocking mechanism instead
of auto_reset_event.

regards,
alexander.

Alexander Terekhov

unread,

Oct 26, 2011, 1:16:41 PM10/26/11

to

Mathias Gaunard wrote:
[...]

> Sure, you can do a a full barrier by doing an acquire followed by a
> release, but it's possibly more efficient to just do the full barrier.

No, a full barrier in C++11 speak is
atomic_thread_fence(memory_order_seq_cst).

Acquire/release is cheaper, see:

http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

"x86 (including x86-64)

Acquire Fence: <ignore>
Release Fence: <ignore>
Acq_Rel Fence: <ignore>
Seq_Cst Fence: MFENCE

...

PowerPC

Acquire Fence: lwsync
Release Fence: lwsync
AcqRel Fence: lwsync
SeqCst Fence: hwsync

...

Itanium

Acquire Fence: <ignore>
Release Fence: <ignore>
Acq_Rel Fence: <ignore>
Seq_Cst Fence: mf"

regards,
alexander.

Mathias Gaunard

unread,

Oct 26, 2011, 9:22:10 PM10/26/11

to

On Oct 26, 7:16 pm, Alexander Terekhov <terek...@web.de> wrote:

> No, a full barrier in C++11 speak is
> atomic_thread_fence(memory_order_seq_cst).
>
> Acquire/release is cheaper, see:
>
> http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

I meant load acquire and store release.
But, thinking about it, it is indeed not enough for full sequential
consistency.

Chris M. Thomasson

unread,

Oct 26, 2011, 9:26:46 PM10/26/11

to

"Mathias Gaunard" <louf...@gmail.com> wrote in message
news:0f960bb1-ebad-4dc6...@h5g2000vbf.googlegroups.com...

> On Oct 24, 6:46 pm, Alexander Terekhov <terek...@web.de> wrote:
>
>> I always thought that C++11 mutex locks can be expressed in terms of
>> relaxed C++11 atomic<> operations and acquire/release fences (not the
>> same as memory_order_seq_cst fence).
>
> AFAIK, locking a mutex is a full barrier, not just an acquire barrier.
> This is not the case for spinlocks though.
>
> Sure, you can do a a full barrier by doing an acquire followed by a
> release, but it's possibly more efficient to just do the full barrier.
>
>
>> class swap_based_mutex_for_windows { // noncopyable
>>
>> atomic<int> m_lock_status; // 0: free, 1/-1: locked/contention
>> auto_reset_event m_retry_event; // slow binary semaphore (windows)
>
> How is that pseudo-code written only using built-in atomic operations
> when there is a reference to a heavyweight kernel-level
> synchronization mechanism?

Perhaps something like:

<sorry for any typos!>
_____________________________________________
struct auto_reset_event
{
std::mutex m_mutex;
std::condition_variable m_cond;
bool m_state;

auto_reset_event() : m_state(false) {}
auto_reset_event(bool state) : m_state(state) {}

void set()
{
m_mutex.lock();
m_state = true;
lock.unlock();
m_cond.notify_one();
}

void wait()
{
std::unique_lock<std::mutex> lock(m_mutex);
while (! m_state) m_cond.wait(lock);
m_state = false;
}
};
_____________________________________________

;^)