Re: Microsoft chooses to leave C++ compiler broken

Paul Bibbings

unread,

Mar 20, 2010, 3:18:41 AM3/20/10

to

<snip />
> Two years after having an error in Visual C++ reported
> whereby automatic objects of class type are not destructed
> in certain cases, Microsoft has chosen to close the report
> as "won't fix". I don't think anything has heralded the
> death of C++ quite as much as this.
<snip />

I am completely failing to see how anyone can get from the statement in
your first sentence here to the opinion in the second. (Is this your
own thought, or are you quoting your correspondent here? I couldn't be
sure.) Under any other circumstances I would rather think that it would
herald the death of the implementation in question. The `Law' (to use a
stretched example) is not compromised by the people that break it.

Regards

Paul Bibbings

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Tony Jorgenson

unread,

Mar 20, 2010, 4:22:11 AM3/20/10

to

> >You are incorrect to claim that volatile as defined by the current C++
> >standard has no use in multi-threaded programming. Whilst volatile does not
> >guarantee atomicity nor memory ordering across multiple threads the fact
> >that it prevents the compiler from caching vales in registers is useful and
> >perhaps essential.

You seem to be saying that volatile can be useful for multi-threaded
code?
(See questions below)

> Yes, volatile does that. Unfortunately, that is necessary but not
> sufficient for inter-thread communication to work correctly. Volatile
> is for hardware access; std::atomic<T> is for multithreaded code
> synchronized without mutexes.

I understand that volatile does not guarantee that the order of memory
writes performed by one thread are seen in the same order by another
thread doing memory reads of the same locations. I do understand the
need for memory barriers (mutexes, atomic variables, etc) to guarantee
order, but there are still 2 questions that have never been completely
answered, at least to my satisfaction, in all of the discussion I have
read on this group (and the non moderated group) on these issues.

First of all, I believe that volatile is supposed to guarantee the
following:

Volatile forces the compiler to generate code that performs actual
memory reads and writes rather than caching values in processor
registers. In other words, I believe that there is a one-to-one
correspondence between volatile variable reads and writes in the
source code and actual memory read and write instructions executed by
the generated code. Is this correct?

Question 1:
My first question is with regard to using volatile instead of memory
barriers in some restricted multi-threaded cases. If my above
statements are correct, is it possible to use _only_ volatile with no
memory barriers to signal between threads in a reliable way if only a
single word (perhaps a single byte) is written by one thread and read
by another?

Question 1a:
First of all, please correct me if I am wrong, but I believe volatile
_must_always_ work as described above on any single core CPU. One CPU
means one cache (or one hierarchy of caches) meaning one view of
actual memory through the cache(s) that the CPU sees, regardless of
which thread is running. Is this much correct for any CPU in
existence? If not please mention a situation where this is not true
(for single core).

Question 1b:
Secondly, the only way I could see this not working on a multi-core
CPU, with individual caches for each core, is if a memory write
performed by one CPU is allowed to never be updated in the caches of
other CPU cores. Is this possible? Are there any multi-core CPUs that
allow this? Doesn’t the MESI protocol guarantee that eventually memory
cached in one CPU core is seen by all others? I know that there may be
delays in the propagation from one CPU cache to the others, but
doesn’t it eventually have to be propagated? Can it be delayed
indefinitely due to activity in the cores involved?

Question 2:
My second question is with regard to if volatile is necessary for
multi-threaded code in addition to memory barriers. I know that it has
been stated that volatile is not necessary in this case, and I do
believe this, but I don’t completely understand why. The issue as I
see it is that using memory barriers, perhaps through use of mutex OS
calls, does not in itself prevent the compiler from generating code
that caches non-volatile variable writes in registers. I have heard it
written in this group that posix, for example, supports additional
guarantees that make mutex lock/unlock (for example) sufficient for
correct inter-thread communication through memory without the use of
volatile. I believe I read here once (from James Kanze I believe) that
“volatile is neither sufficient nor necessary for proper multi-
threaded code” (quote from memory). This seems to imply that posix is
in cahoots with the compiler to make sure that this works. If you add
mutex locks and unlocks (I know RAII, so please don’t derail my
question) around some variable reads and writes, how do the mutex
calls force the compiler to generate actual memory reads and writes in
the generated code rather than register reads and writes?

I understand that compilation optimization affects these issues, but
if I optimize the hell out of my code, how do posix calls (or any
other OS threading calls) force the compiler to do the right thing? My
only conjecture is that this is just an accident of the fact that the
compiler can’t really know what the mutex calls do and therefore the
compiler must make sure that all globally accessible variables are
pushed to memory (if they are in registers) in case _any_ called
function might access them. Is this what makes it work? If not, then
how do mutex call guarantee the compiler doesn’t cache data in
registers, because this would surely make the mutexes worthless
without volatile (which I know from experience that they are not).

Andy Venikov

unread,

Mar 20, 2010, 4:25:37 AM3/20/10

to

Andrei Alexandrescu wrote:

<snip>
> But by and large that's not sufficient to make sure things do work, and
> they will never work portably. Here's a good article on the topic:
>
> http://software.intel.com/en-us/blogs/2007/11/30/volatile-almost-useless-for-multi-threaded-programming/
>
>
> entitled eloquently "Volatile: Almost Useless for Multi-Threaded
> Programming". And here's another entitled even stronger 'Why the
> "volatile" type class should not be used':
>
> http://kernel.org/doc/Documentation/volatile-considered-harmful.txt
>
> The presence of the volatile qualifier in Loki is at best helpful but
> never a guarantee of correctness. I recommend Scott and my article on
> the topic, which was mentioned earlier in this thread:
>
> http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf
>
> Bottom line: using volatile with threads is almost always a red herring.
>
>
> Andrei
>

Not in my wildest dreams would I think that I'd ever disagree with you,
but here goes....

While it's true that there's a wild-spread misconception that volatile
is a panacea for multi-threading issues and it's true that by itself it
won't do anything to make multi-threaded programs safe, it's not correct
to say that it's totally useless for threading issues as the
"volatile-considered-harmful.txt" article is trying to implicate.

In short, volatile is never sufficient, but often necessary to solve
certain multi-threading problems. These problems (like writing lock-free
algorithms) try to prevent execution statement re-ordering. Re-ordering
can happen in two places: in the hardware, which is mitigated with
memory fences; and in the compiler, which is mitigated with volatile.
It's true that depending on the memory fence library that you use, the
compiler won't move the code residing inside the fences to the outside,
but it's not always the case. If you use raw asm statements for example
(even if you add "volatile" to the asm keyword) your non-volatile
variable is not guaranteed to stay inside the fenced region unless you
declare it volatile.

The advent of C++0x may well render it useless for multi-threading, but
up until now it has been necessary.

Thanks,
Andy.

Joshua Maurice

unread,

Mar 20, 2010, 7:32:19 AM3/20/10

to

On Mar 20, 12:17 am, "Balog Pal" <p...@lib.hu> wrote:
> "Edward Diener" <eldie...@tropicsoft.invalid>>
>
> > Oh, please ! Microsoft has closed with "won't fix" or "can't fix for the
> > current release but will re-consider in the future" every valid C++ and
> > C++/CLI bug I have reported to them over the last five years. For purely
> > C++ bugs, as opposed to C++/CLI, see:
>
> >https://connect.microsoft.com/VisualStudio/feedback/details/522094/wa...
>
> > and
>
> >https://connect.microsoft.com/VisualStudio/feedback/details/344776/vc...
>
> These errors are in the "nuisance" category. Compile fails, so you're
> forced some workaround. The one OP mentioned is in "danger" category -- the
> code compiles silently with incorrect code generation.
>
> > I won't regale anyone with all the C++/CLI bugs Microsoft has refused to
> > fix, but not fixing C++ bugs is the norm, not the exception, in my
> > experience with reporting bugs to Microsoft. Getting upset about it is a
> > waste of time as no amount of pressure will get Microsoft to fix a bug
> > they just do not want to fix.
>
> Too bad. :(
>
> Here's what I wrote yesterday to the ACCU list on the issue.
>
> Just became aware of this thing, and quite out of myself. :-(
>
> Guess everyone around here agrees the perfect symmetry of ctor/dtor calls is
> fundamental in C++ and any code we write just takes it as granted.
>
> Seems folks at M$ have different thoughts and just plant breaking bug in the
> optimizer -- ant then refuse to correct it. In 2 years, then decide to not
> fix it at all. To me that sounds outrageous and is ground to drop using MS
> compilers. O, Murphy can't be ignored, so bugs may be in the release, but I
> expect them gone after discovery. Suggesting "workaround" for the few weeks
> is okay, but thinking actual programmers shall stop using locals in a
> loop -- or start to count returns in functions? Hell no.
>
> Herb Sutter already picked up the issue and promised some fix in the
> followups -- but his words suggest change of heart only for the particular
> issue, and reconsideration because James' example suggest more likelyness of
> suspect code. Implying the system failure will remain, MS is glad to wager
> with our products' correctness.
>
> I can live with QoI issues like some code ends up in "internal compiler
> error", or the optimizer bails out emitting some slow naive assembly. But
> correct source, if no error is signaled and .EXE created shall have the
> proper behavior. No matter what optimization is used.
>
> I ask the community's help to find some measure to pressure vendors toward
> the correct and fair behavior, that includes not even thing to leave such
> defects around. Even in freebie versions, let alone paid ones.
>
> Maybe a petition signed by many ACCU members (and other C++ users) could
> create some force.
>
> In the last decade, I several times wrote in differnt foruns addressing the
> usual MS-bashing and claims about poor quality, that it mostly roots in the
> 90's and the company IMO changed for the right track around 2000 --
> certainly the cleanup takes time but the will is there, and resources are
> used for good.
>
> Handling of this issue makes me say, I appear to have been wrong, and the
> others were right. And I have to apologize too for misleading people...

I fully agree. As an example, Microsoft's C++ compiler does not
support covariant return types with multiple inheritance and/or
virtual inheritance. (I forget which.) This is in the nuisance
category, especially considering that it can be worked around
relatively easily. Don't get me wrong, it's annoying, but it's
manageable. However, it is completely unacceptable to have a bug which
results in the creation of an execution with semantics different than
source without error (or warning), specifically contrary to the agreed
upon standard, the C++ standard.

As a reply to Herb Sutter, even if the code required to trigger the
bug was some obscure little thing (which it is not), people can still
hit it in the real world. My company generates C++ code from a very
limited modeling language to allow object serialization between C++,
xml, and Java. Imagine what wonderful little code paths in the
optimizer could result from code generation on code generation?

Short version: bogus errors and crashes can be tolerable. Non-standard
behavior from compiler extensions can be tolerable. Signaling an error
instead of compiling standard-compliant source can be tolerable.
Producing a broken executable without error or warning from standard-
compliant source is not acceptable.

PS: Balog Pal, where is this petition? I'll sign it. In the meantime
I'll continue my current work which will easily allow swapping in gcc
in place of visual studios to run some timing numbers on our unit /
acceptance / integration tests.

PPS: Why is it that google groups shows this as apparently two random
threads merged into one? Did the OP reply instead of start a new
topic? Oh well. Only a minor inconvenience.

Andy Venikov

unread,

Mar 21, 2010, 4:09:37 AM3/21/10

to

Joshua Maurice wrote:

>Leigh Johnston wrote:
>> Obviously the volatile keyword may not cause a memory barrier instruction to
>> be emitted but this is a side issue. The combination of a memory barrier
>> and volatile makes multi-threaded code work.
>

> No. Memory barriers when properly used (without the volatile keyword)
> are sufficient.

Sorry Joshua, but I think it's a wrong, or at least an incomplete,
statement.

It all depends on how memory barriers/fences are implemented. In the
same way that C++ standard doesn't talk about threads it doesn't talk
about memory fences. If a memfence call is implemented as a library
call, then yes, you will in essence get a compiler-level fence directive
as none of the compilers I know of are allowed to move the code across a
call to a library. But oftentimes memfences are implemented as macros
that expand to inline assembly. If you don't use volatile then nothing
will tell the compiler that it can't optimize the code and move the
read/write across the macroized memfence. It is especially true on
platforms that don't actually need hardware memfences (like x86) since
in those cases calls to macro memfences will expand to nothing at all
and then you will have nothing in your code that tells anything about a
code-migration barrier.

So is volatile sufficient - absolutely not. Portable? - hardly.
Is it necessary in certain cases - absolutely.

Thanks,
Andy.

Joshua Maurice

unread,

Mar 21, 2010, 9:15:45 AM3/21/10

to

Perhaps I was a bit too strong in my statement. I did intent to say
"for portable uses".

However, for future reference, what compilers of what versions on what
platforms implement volatile with these semantics? I would like to
research these claims to be better prepared in this discussion and
future ones. Are any of these implementations not x86? These
implementations really don't provide a sane threading interface and
force the use of the volatile keyword for threading? Weird.

I still must ask, really? That would mean that all shared state must
be volatile qualified, including internal class members for shared
data. Wouldn't that big a huge performance hit when the compiler can't
optimize any of that? Could you even use prebuilt classes (which
usually don't have volatile overloads) in the shared data, like say
std::string, std::vector, std::map, etc.?

Can you wrap this horrible threading interface into sane functions
which provide the usual semantics instead of doing this horrible
volatile hackery? That's what I would strongly suggest if possible.

Finally, on x86s, not all memfences are no-ops, though perhaps we're
using different definitions of memfences. Referencing JSR-133 Cookbook
http://g.oswego.edu/dl/jmm/cookbook.html
The "StoreLoad" memory barrier is not a no-op on x86.

Mathias Gaunard

unread,

Mar 21, 2010, 5:28:26 PM3/21/10

to

On 21 mar, 08:09, Andy Venikov <swojchelo...@gmail.com> wrote:

> It all depends on how memory barriers/fences are implemented. In the
> same way that C++ standard doesn't talk about threads it doesn't talk
> about memory fences. If a memfence call is implemented as a library
> call, then yes, you will in essence get a compiler-level fence directive
> as none of the compilers I know of are allowed to move the code across a
> call to a library. But oftentimes memfences are implemented as macros
> that expand to inline assembly. If you don't use volatile then nothing
> will tell the compiler that it can't optimize the code and move the
> read/write across the macroized memfence.

Likewise, compilers do not move the code when it contains inline
assembly, especially if that assembly contains memory fences...

Andy Venikov

unread,

Mar 21, 2010, 5:32:24 PM3/21/10

to

Joshua Maurice wrote:

>> So is volatile sufficient - absolutely not. Portable? - hardly.
>> Is it necessary in certain cases - absolutely.
>
> Perhaps I was a bit too strong in my statement. I did intent to say
> "for portable uses".
>
> However, for future reference, what compilers of what versions on what
> platforms implement volatile with these semantics? I would like to
> research these claims to be better prepared in this discussion and
> future ones. Are any of these implementations not x86? These
> implementations really don't provide a sane threading interface and
> force the use of the volatile keyword for threading? Weird.
>

I'm sorry if I wasn't clear in my previous post, but I was talking about
standard volatile behavior.

The standard places a requirement on conforming implementations that:

1.9.6
The observable behavior of the abstract machine is its sequence of reads
and writes to volatile data and calls to library I/O functions

1.9.7
Accessing an object designated by a volatile lvalue (3.10), modifying an
object, calling a library I/O function, or calling a function that does
any of those operations are all side effects, which are changes in the
state of the execution environment. Evaluation of an expression might
produce side effects. At certain specified points in the execution
sequence called sequence points, all side effects of previous
evaluations shall be complete and no side effects of subsequent
evaluations shall have taken place

1.9.11
The least requirements on a conforming implementation are:
— At sequence points, volatile objects are stable in the sense that
previous evaluations are complete and
subsequent evaluations have not yet occurred.

That to me sounds like a complete enough requirement that compilers
don't perform optimizations that produce "surprising" results in so far
as observable behavior in an abstract (single-threaded) machine are
concerned. This requirement happens to be very useful for multi-threaded
programs that can augment volatile with hardware fences to produce
meaningful results.

> I still must ask, really? That would mean that all shared state must
> be volatile qualified, including internal class members for shared
> data. Wouldn't that big a huge performance hit when the compiler can't
> optimize any of that? Could you even use prebuilt classes (which
> usually don't have volatile overloads) in the shared data, like say
> std::string, std::vector, std::map, etc.?

Not at all!
Most multi-threading issues are solved with mutexes, semaphores,
conditional variables and such. All of these are library calls. That
means that using volatile in those cases is not necessary. It's only
when you get into more esotheric parallel computing problems where you'd
like to avoid a heavy-handed approach of mutexes that you enter the
realm of volatile. In normal multi-threading solved with regular means
there is really no reason to use volatile.

<snip>

Thanks,
Andy.

Leigh Johnston

unread,

Mar 22, 2010, 2:03:38 AM3/22/10

to

"Andy Venikov" <swojch...@gmail.com> wrote in message
news:ho5s8u$52u$1...@news.eternal-september.org...

>> I still must ask, really? That would mean that all shared state must
>> be volatile qualified, including internal class members for shared
>> data. Wouldn't that big a huge performance hit when the compiler can't
>> optimize any of that? Could you even use prebuilt classes (which
>> usually don't have volatile overloads) in the shared data, like say
>> std::string, std::vector, std::map, etc.?
>
> Not at all!
> Most multi-threading issues are solved with mutexes, semaphores,
> conditional variables and such. All of these are library calls. That
> means that using volatile in those cases is not necessary. It's only
> when you get into more esotheric parallel computing problems where you'd
> like to avoid a heavy-handed approach of mutexes that you enter the
> realm of volatile. In normal multi-threading solved with regular means
> there is really no reason to use volatile.

Esoteric? I would have thought independent correctly aligned (and therefore
atomic) x86 variable reads (fundamental types) without the use of a mutex
are not uncommon making volatile not uncommon also on that platform (on
VC++) at least. I have exactly one volatile in my entire codebase and that
is such a variable. From MSDN (VC++) docs:

"The volatile keyword is a type qualifier used to declare that an object can
be modified in the program by something such as the operating system, the
hardware, or a concurrently executing thread."

That doesn't seem esoteric to me! :)

/Leigh

Bo Persson

unread,

Mar 22, 2010, 7:22:16 PM3/22/10

to

The esoteric thing is that this is a compiler specific extension, not something guaranteed by the language. Currently there are no threads at all in C++.

Note that the largest part of the MSDN document is clearly marked "Microsoft Specific". It is in that part the release and aquire semantics are defined.

Bo Persson

Joshua Maurice

unread,

Mar 22, 2010, 7:23:57 PM3/22/10

to

That is one interpretation. Unfortunately / fortunately (?), that
interpretation is not the prevailing interpretation. Thus far in this
thread, we have members of the C++ standards committee or its
affiliates explicitly disagreeing on the committee's website with that
interpretation (linked else-thread). The POSIX standard explicitly
disagrees with your interpretation (see google). The
comp.programming.threads FAQ explicitly disagrees with you several
times (linked else-thread). We have gcc docs and implementation
disagreeing with your interpretation (see google). We have an official
blog from intel, the biggest maker of chips in the world, and a major
compiler writer, explicitly disagreeing with your interpretation
(linked else-thread). We have experts in the C++ community explicitly
disagreeing with your interpretation. (Thanks Andrei, and his paper "C+
+ And The Perils Of Double Checked Locking". Andy, have you even read
it?)

Basically, everyone in positions of authority are against you, from
the experts, the standards writers, and the implementation coders.
(Except for Microsoft Visual Studios, who actually make volatile reads
and writes like a mutex acquire and mutex release.) I don't know what
else I can do to dissuade you from this statement of fact concerning
the real world. As a practical matter, in the real world on real
implementations, volatile has no use as a correct, portable
synchronization primitive.

Michael Doubez

unread,

Mar 23, 2010, 9:42:33 AM3/23/10

to

On 23 mar, 00:22, "Bo Persson" <b...@gmb.dk> wrote:
> Leigh Johnston wrote:

> > "Andy Venikov" <swojchelo...@gmail.com> wrote in message

Still it does say something of the semantic of the memory location. In
practice the compiler will cut the optimizations regarding the
volatile location; I don't see a compiler ignoring this kind of
notification.

Which means that the memory value will eventually (after an
undetermined amount of time) be flushed to the location and not kept
around in the stack or somewhere else for optimization reasons.

My understanding is that it is the amount of optimization cutting that
is implementation dependent.

Still, I have not understodd how this can be useful for multithreading
with the next-to-be standard, AFAIS atomic types gives better
guarantees and better optimization possibilities.

[snip]

--
Michael

Chris Vine

unread,

Mar 23, 2010, 10:05:28 AM3/23/10

to

On Mon, 22 Mar 2010 17:23:57 CST
Joshua Maurice <joshua...@gmail.com> wrote:
[snip]

> Basically, everyone in positions of authority are against you, from
> the experts, the standards writers, and the implementation coders.
> (Except for Microsoft Visual Studios, who actually make volatile reads
> and writes like a mutex acquire and mutex release.) I don't know what
> else I can do to dissuade you from this statement of fact concerning
> the real world. As a practical matter, in the real world on real
> implementations, volatile has no use as a correct, portable
> synchronization primitive.

I think you and Andy Venikov are at cross purposes. If you want to
write portable threaded code conforming to a particular standard (such
as POSIX) which has defined synchronisation objects such as mutexes,
semaphores or condition variables, then the volatile keyword is
useless. It achieves nothing in terms of thread safety (accessing these
synchronisation objects in code comprises so far as relevant a compiler
barrier as well as a memory barrier on platforms which conform to the
standard), and inhibits optimisations which may still be available to
the compiler in multi-threaded code but not in single thread
asynchronous (interrupt driven) code.

If however you want to write non-portable code working with only one
particular processor type on one particular compiler, then you might
achieve some efficiency improvements by using processor-dependent
memory barriers or store/load instructions combined with use of the
volatile keyword. It is easy to get it wrong doing this (as witness
the bogus double-checked-locking pattern which has been around so long).
And the code may cease to work reliably whenever you upgrade your
compiler or operating system version (unless it happens to be supported
by a compiler/platform writer's specific extension, such as
Microsoft's).

When c++0x atomic variables are available with their variety of
different synchronisation options, then the need for such non-portable
code (so far as it exists at all) will be largely eliminated.

Chris

Leigh Johnston

unread,

Mar 23, 2010, 10:05:28 AM3/23/10

to

"Joshua Maurice" <joshua...@gmail.com> wrote in message
news:b897c547-0237-4d21...@u15g2000prd.googlegroups.com...

Sometimes you have to use common sense:

thread A:
finished = false;
spawn_thread_B();
while(!finished)
{
/* do work */
}

thread B:
/* do work */
finished = true;

If finished is not volatile and compiler optimizations are enabled thread A
may loop forever.

The behaviour of optimizing compilers in the real world can make volatile
necessary to get correct behaviour in multi-threaded designs. You don't
always have to use a memory barriers or a mutexes when performing an atomic
read of some state shared by more than one thread.

/Leigh

Chris Vine

unread,

Mar 23, 2010, 2:22:56 PM3/23/10

to

On Tue, 23 Mar 2010 08:05:28 CST
"Leigh Johnston" <le...@i42.co.uk> wrote:
[snip]

> Sometimes you have to use common sense:
>
> thread A:
> finished = false;
> spawn_thread_B();
> while(!finished)
> {
> /* do work */
> }
>
> thread B:
> /* do work */
> finished = true;
>
> If finished is not volatile and compiler optimizations are enabled
> thread A may loop forever.
>
> The behaviour of optimizing compilers in the real world can make
> volatile necessary to get correct behaviour in multi-threaded
> designs. You don't always have to use a memory barriers or a mutexes
> when performing an atomic read of some state shared by more than one
> thread.

It is never "necessary" to use the volatile keyword "in the real world"
to get correct behaviour because of "the behaviour of optimising
compilers". If it is, then the compiler does not conform to the
particular standard you are writing to. For example, all compilers
intended for POSIX platforms which support pthreads have a
configuration flag (usually "-pthread") which causes the locking
primitives to act also as compiler barriers, and the compiler would be
non-conforming if it did not both provide this facility and honour it.

Of course, there are circumstances when you can get away with the
volatile keyword, such as the rather contrived example you have given,
but in that case it is pretty well pointless because making the
variable volatile as opposed to using normal synchronisation objects
will not improve efficiency. In fact, it will hinder efficiency if
Thread A has run work before thread B, because thread A will depend on a
random future event on multi-processor systems, namely when the caches
happen to synchronise to achieve memory visibility, in order to proceed.

Chris

James Kanze

unread,

Mar 23, 2010, 7:08:37 PM3/23/10

to

On Mar 18, 10:32 pm, Joshua Maurice <joshuamaur...@gmail.com> wrote:
> On Mar 17, 8:16 pm, "Leigh Johnston" <le...@i42.co.uk> wrote:

[...]
> I can't recall for the life of me where I read it, but I seem
> to recall Andrei admitting that he misunderstand volatile, and
> learned of the error of his ways, possibly in conjunction with
> "C++ And The Perils Of Double-Checked Locking".

It was in a discussion in this group, although I don't remember
exactly when. The curious thing is that Andrei's techniques
actually work, not because of any particular semantics of
volatile, but because of the way it works in the type system;
its use caused type errors (much like the one the original
poster saw) if you attempted to circumvent the locking.

The misunderstanding of volatile is apparently widespread. To
the point that Microsoft actually proposed giving it the
required semantics to the standards committee. That didn't go
over very well, since it caused problems with the intended use
of volatile. The Microsoft representative (Herb Sutter, as it
happens) immediately withdrew the proposal, but I think they
intend to implement these semantics in some future compiler, or
perhaps have already implemented them in VC10. In defense of
the Microsoft proposal: the proposed semantics do make sense if
you restrict yourself to the world of application programs under
general purpose OS's, like Windows or Unix. And the semantics
actually implemented by volatile in most other compilers, like
g++ or Sun CC, are totally useless, even in the contexts for
which volatile was designed. At present, it's probably best to
class volatile in the same category as export: none of the
widely used compilers implement it to do anything useful.

[...]
> B- repeat my (perhaps unfounded) second hand information that
> volatile in fact on most current implementations does not make
> a global ordering of reads and writes.

Independantly of what the standard says (and it does imply
certain guarantees, such as would be necessary, for example, to
use it for memory mapped IO), volatile has no practical
semantics in most current compilers (Sun CC, g++, VC++, at least
up through VC8.0).

--
James Kanze

Leigh Johnston

unread,

Mar 23, 2010, 7:04:39 PM3/23/10

to

"Chris Vine" <chris@cvine--nospam--.freeserve.co.uk> wrote in message
news:ceum77-...@cvinex--nospam--x.freeserve.co.uk...

It is not a contrived example, I have the following code in my codebase
which is similar:
....
lock();
while (iSockets.empty() && is_running())
{
unlock();
Sleep(100);
if (!is_running())
return;
lock();
}
....

is_running() is an inline member function which returns the value of a
volatile member variable and shouldn't require a lock to query as it is
atomic on the platform I target (x86). It makes sense for this platform and
compiler (VC++) that I use volatile. Admittedly I could use an event/wait
primitive instead but that doesn't make the above code wrong for the
particular use-case in question. I agree that for other platforms and
compilers this might be different. From what I understand and I agree with
the advent of C++0x should see such volatiles disappear in favour of
std::atomic<>. Not everyone in the real world is using C++0x as the
standard has not even been published yet.

/Leigh

James Kanze

unread,

Mar 23, 2010, 10:07:03 PM3/23/10

to

On Mar 20, 7:12 am, Ulrich Eckhardt <eckha...@satorlaser.com> wrote:
> Leigh Johnston wrote:
> > "Joshua Maurice" <joshuamaur...@gmail.com> wrote in message
> >news:900580c6-c55c-46ec...@w9g2000prb.googlegroups.com...

> >>> Obviously the volatile keyword may not cause a memory
> >>> barrier instruction to be emitted but this is a side
> >>> issue. The combination of a memory barrier and volatile
> >>> makes multi-threaded code work.

> >> No. Memory barriers when properly used (without the
> >> volatile keyword) are sufficient.

> > No. Memory barriers are not sufficient if your optimizing
> > compiler is caching the value in a register: the CPU is not
> > aware that the register is referring to data being revealed
> > by the memory barrier.

> Actually, memory barriers in my understanding go both ways.
> One is to tell the CPU that it must not cache/optimise/reorder
> memory accesses. The other is to tell the compiler that it
> must not do so either.

Actually, as far as standard C++ is concerned, memory barriers
don't exist, so it's difficult to talk about them. In practice,
there are three ways to obtain them:

-- Inline assembler. See your compiler manual with regards to
what it guarantees; the standard makes no guarantees here.
A conforming implementation can presumably do anything it
wants with the inline assembler, including move it over an
access to a volatile variable. From a QoI point of view,
either 1) the compiler assumes nothing about the assembler,
considers that it might access any accessible variable, and
ensures that the actual semantics of the abstract machine
correspond to those specified in the standard, 2) reads and
interprets the inline assembler, and so recognizes a fence
or a memory barrier, and behaves appropriately, or 3)
provides some means of annotating the inline assember to
tell the compiler what it can or cannot do.

-- Call a function written in assembler. This really comes
down to exactly the same as inline assembler, except that
it's a lot more difficult for the compiler to implement the
alternatives 2 or 3. (All compilers I know implement 1.)

-- Call some predefined system API. In this case, the
requirements are defined by the system API. (This is the
solution used by Posix, Windows and C++0x.)

--
James Kanze

Joshua Maurice

unread,

Mar 23, 2010, 10:08:19 PM3/23/10

to

On Mar 23, 7:05 am, "Leigh Johnston" <le...@i42.co.uk> wrote:
> Sometimes you have to use common sense:
>
> thread A:
> finished = false;
> spawn_thread_B();
> while(!finished)
> {
> /* do work */
>
> }
>
> thread B:
> /* do work */
> finished = true;
>
> If finished is not volatile and compiler optimizations are enabled thread A
> may loop forever.
>
> The behaviour of optimizing compilers in the real world can make volatile
> necessary to get correct behaviour in multi-threaded designs. You don't
> always have to use a memory barriers or a mutexes when performing an atomic
> read of some state shared by more than one thread.

No. You must use proper synchronization to guarantee a "happens-
before" relationship, and volatile does not do that portably. Without
the proper synchronization, the write to a variable in one thread,
even a volatile write, may never become visible to another thread,
even by a volatile read, on some real world systems.

"Common sense" would be to listen to the people who wrote the
compilers, such as Intel and gcc, to listen to the writers of the
standard who influence the compiler writers, such as the C++ standards
committee and their website, to listen to well respected experts who
have studied these things in far greater detail than you and I, to
read old papers and correspondence to understand the intention of
volatile (which does not include threading), etc. It is not "common
sense" to blithely ignore all of this and read into an ambiguous
definition in an unrelated standard to get your desired properties (C+
+03 standard does not mention threads so it's not the relevant
standard to look at); it's actually quite unreasonable to do so.

Let me put it like this. Either you're writing on a thread-aware
compiler or you are not. On a thread-aware compiler, you can use the
standardized threading library, which will probably look a lot like
POSIX, WIN32, Java, and C++0x. It will include mutexes and condition
variables (or some rough equivalent, stupid WIN32), and possibly
atomic increments, atomic test and swap, etc. It will define a memory
model roughly compatible with the rest and include a strong equivalent
of Java's "happens-before" relationship. In which case, volatile has
no use (for threading) because the compiler is aware of the
abstractions and will honor them, including the optimizer. In the
other case, when you're using threads on a not-threads-aware compiler,
you're FUBAR. There are so many little things to get right to produce
correct assembly for threads that if the compiler is not aware of it,
even the most innocuous optimization, or even register allocation, may
entirely break your code. volatile may produce the desired result, and
it may not. This is entirely system dependent as you are not coding to
any standard, and thus not portable by any reasonable definition of
portable.

Also note that your (incorrect) reading of the C and C++ standards
makes no mention of a guarantee about reorderings between non-volatile
and volatile, so if thread B in your example changed shared state,
these writes may be moved after the write to "finished", so thread A
could see the write to "finished" but not see the changes to the
shared state, or perhaps only a random portion of the writes to the
shared state, a inconsistent shared state, which is begging for a
crash. So, you could fully volatile qualify all of the shared state,
leading to a huge performance hit, or you could just use the
standardized abstractions which are guaranteed to work, which will
actually work, which will run much faster, and which are portable.

There seems to persist this "romanticized" ideal of "volatile" as
somehow telling the compiler to "shut up" and "just do it", a
sentiment noted by Andrei and Scott in "C++ And The Perils Of Double-
Checked Locking". Please, go read the paper and its cited sources.
They explain it so much better than I could. I'll link to it again
here:
http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf

James Kanze

unread,

Mar 23, 2010, 10:16:13 PM3/23/10

to

On Mar 22, 11:22 pm, "Bo Persson" <b...@gmb.dk> wrote:
> Leigh Johnston wrote:

> > "Andy Venikov" <swojchelo...@gmail.com> wrote in message

Note too that at least through VC8.0, regardless of the
documentation, VC++ didn't implement volatile in a way that
would allow it to be used effectively for synchronization on a
multithreaded Windows platform. For some of the more performing
machines, you need a fence, or at least some use of the lock
prefex, and VC++ didn't generate these.

Microsoft has expressed its intent to implement these extended
semantics for volatile, however.

--
James Kanze

James Kanze

unread,

Mar 23, 2010, 10:15:50 PM3/23/10

to

On Mar 21, 9:32 pm, Andy Venikov <swojchelo...@gmail.com> wrote:
> Joshua Maurice wrote:
> >> So is volatile sufficient - absolutely not. Portable? - hardly.
> >> Is it necessary in certain cases - absolutely.

> > Perhaps I was a bit too strong in my statement. I did intent to say
> > "for portable uses".

> > However, for future reference, what compilers of what
> > versions on what platforms implement volatile with these
> > semantics? I would like to research these claims to be
> > better prepared in this discussion and future ones. Are any
> > of these implementations not x86? These implementations
> > really don't provide a sane threading interface and force
> > the use of the volatile keyword for threading? Weird.

> I'm sorry if I wasn't clear in my previous post, but I was
> talking about standard volatile behavior.

Which is practically entirely implementation defined.

> The standard places a requirement on conforming
> implementations that:

> 1.9.6
> The observable behavior of the abstract machine is its
> sequence of reads and writes to volatile data and calls to
> library I/O functions

> 1.9.7
> Accessing an object designated by a volatile lvalue (3.10),
> modifying an object, calling a library I/O function, or
> calling a function that does any of those operations are all
> side effects, which are changes in the state of the execution
> environment. Evaluation of an expression might produce side
> effects. At certain specified points in the execution sequence
> called sequence points, all side effects of previous
> evaluations shall be complete and no side effects of
> subsequent evaluations shall have taken place

> 1.9.11
> The least requirements on a conforming implementation are:
> At sequence points, volatile objects are stable in the sense
> that previous evaluations are complete and subsequent
> evaluations have not yet occurred.

It also fails to define what it means by an access (to a
volatile object) and what it means to be "stable". The C
standard addresses the first by saying that it's implementation
defined.

The C standard also mentions the one of the motivations behind
volatile: memory mapped I/O. Independantly of the standard,
from a QoI point of view, we can expect that it would at least
provide the necessary semantics to support this. Tough luck: it
isn't the case with g++ nor Sun CC (nor, as far as I can tell,
VC++, but I'm less familiar with the semantics of modern Intel
than I am of those of Sparc).

> That to me sounds like a complete enough requirement that
> compilers don't perform optimizations that produce
> "surprising" results in so far as observable behavior in an
> abstract (single-threaded) machine are concerned.

For single-threaded machines, volatile does have some defined
semantics. For example, in a signal handler, you can assign to
a variable of type sigatomic_t volatile (and that's about it, as
far as the C++ standard is concerned---Posix guarantees
considerably more, but still a lot less than is often assumed).
Volatile is significant with regards to communication between
signal handlers and the rest of the program.

> This requirement happens to be very useful for multi-threaded
> programs that can augment volatile with hardware fences to
> produce meaningful results.

How do you specify a hardware fence in C++? Anything in the
respect is implementation defined. And all of the
implementations do the right thing, in one way or another:
whatever you do to specify a fence also inhibits (or can be made
to inhibit) code movement accross the fence.

--
James Kanze

Chris Vine

unread,

Mar 24, 2010, 7:37:56 AM3/24/10

to

On Tue, 23 Mar 2010 17:04:39 CST

"Leigh Johnston" <le...@i42.co.uk> wrote:
[snip]

> It is not a contrived example, I have the following code in my
> codebase which is similar:
> ....
> lock();
> while (iSockets.empty() && is_running())
> {
> unlock();
> Sleep(100);
> if (!is_running())
> return;
> lock();
> }
> ....
>
> is_running() is an inline member function which returns the value of a
> volatile member variable and shouldn't require a lock to query as it
> is atomic on the platform I target (x86). It makes sense for this
> platform and compiler (VC++) that I use volatile. Admittedly I could
> use an event/wait primitive instead but that doesn't make the above
> code wrong for the particular use-case in question. I agree that for
> other platforms and compilers this might be different. From what I
> understand and I agree with the advent of C++0x should see such
> volatiles disappear in favour of std::atomic<>. Not everyone in the
> real world is using C++0x as the standard has not even been published
> yet.

This will work on windows (which I think is your platform) because of
the Microsoft extension. Whether it works on other platforms depends
on whether anything critical depends on the accuracy of is_running(),
because if it is an inline member function returning the value of a
volatile variable it might/will be reporting out-of-date state which
could be inconsistent with other relevant variables, as there is no
cache synchronisation. As it happens, C++0x atomic variables in ordinary
usage will provide memory synchronisation, unless you deliberately
choose relaxed memory ordering.

Since you already have a lock in operation it seems a bit perverse not
to use it: but this is just a code snippet so I imagine you have your
reasons in the larger picture.

Chris

James Kanze

unread,

Mar 24, 2010, 7:34:54 AM3/24/10

to

On Mar 20, 7:13 am, red floyd <redfl...@gmail.com> wrote:
> On Mar 19, 2:06 am, "Leigh Johnston" <le...@i42.co.uk> wrote:
> > That was my point, volatile whilst not a solution in itself
> > is a "part" of a solution for multi-threaded programming
> > when using a C++ (current standard) optimizing compiler:

> > thread A:
> > finished = false;
> > spawn_thread_B();
> > while(!finished)
> > {
> > /* do work */
> > }

> > thread B:
> > /* do work */
> > finished = true;

> > If finished is not volatile and compiler optimizations are
> > enabled thread A may loop forever.

> Agreed. I've seen this in non-threaded code with
> memory-mapped I/O.

Which is a different issue. That's what volatile was designed
for: I think it still works for that on Intel architecture. (It
doesn't on Sparc, at least with g++ or Sun CC:-(.) Threading is
a different issue.

Note that volatile is still relevant for communications between
a signal handler and the main (single threaded) application. At
least according to the standard.

--
James Kanze

James Kanze

unread,

Mar 24, 2010, 7:33:46 AM3/24/10

to

On Mar 23, 1:42 pm, Michael Doubez <michael.dou...@free.fr> wrote:
> On 23 mar, 00:22, "Bo Persson" <b...@gmb.dk> wrote:

[...]

> Still it does say something of the semantic of the memory
> location. In practice the compiler will cut the optimizations
> regarding the volatile location; I don't see a compiler
> ignoring this kind of notification.

Not really. It makes some vague statements concerning "access",
while not defining what it really means by access. And "memory
location", without further qualifiers, has no real meaning on
modern processors, with their five or six levels of memory---is
the memory the core specific cache, the memory shared by all the
cores, or the virtual backup store (which maintains its values
even after the machine has been shut down)?

And of course, what really counts is what the compilers
implement: neither g++, nor Sun CC, nor VC++ (at least through
8.0) give volatile any more semantics that issuing a load or
store instruction---which the hardware will execute when it gets
around to it. Maybe.

> Which means that the memory value will eventually (after an
> undetermined amount of time) be flushed to the location and
> not kept around in the stack or somewhere else for
> optimization reasons.

Sorry, but executing a store instruction (or a mov with a
destination in memory) does NOT guarantee that there will be a
write cycle in main memory, ever. At least not on modern Sparc
and Intel architectures. (I'm less familiar with others, but
from what I've heard, Sparc and Intel are among the most strict
in this regard.)

> My understanding is that it is the amount of optimization
> cutting that is implementation dependent.

The issue isn't compiler optimization. If that were the
problem, just turn off the optimizer: all of the compilers that
I know will then follow the rules of the abstract machine very
rigorously.

> Still, I have not understodd how this can be useful for
> multithreading with the next-to-be standard, AFAIS atomic
> types gives better guarantees and better optimization
> possibilities.

C++0x addresses threading. And the people who worked on that
part of it have learned from earlier attempts. (Java had to
rewrite their memory model at least once in order to provide
adequate guarantees without totally breaking performance.)

--
James Kanze

James Kanze

unread,

Mar 24, 2010, 7:34:25 AM3/24/10

to

On Mar 23, 2:05 pm, "Leigh Johnston" <le...@i42.co.uk> wrote:
> "Joshua Maurice" <joshuamaur...@gmail.com> wrote in message

> news:b897c547-0237-4d21...@u15g2000prd.googlegroups.com...

[...]

> Sometimes you have to use common sense:

Modern memory models don't respect common sense very much.

> thread A:
> finished = false;
> spawn_thread_B();
> while(!finished)
> {
> /* do work */
> }

> thread B:
> /* do work */
> finished = true;

> If finished is not volatile and compiler optimizations are
> enabled thread A may loop forever.

And making finished volatile doesn't change anything in this
regard. At least not with Sun CC or g++ under Solaris, g++
under Linux on PC, and VC++8.0 under Windows on a 64 bit PC.

> The behaviour of optimizing compilers in the real world can
> make volatile necessary to get correct behaviour in
> multi-threaded designs.

As has been pointed out: volatile is never sufficient, and when
you use whatever is sufficient, volatile ceases to be necessary.

> You don't always have to use a memory barriers or a mutexes
> when performing an atomic read of some state shared by more
> than one thread.

Only if you want it to work.

--
James Kanze

James Kanze

unread,

Mar 24, 2010, 7:40:48 AM3/24/10

to

On Mar 20, 8:22 am, Tony Jorgenson <tonytinker2...@yahoo.com> wrote:

[...]

> I understand that volatile does not guarantee that the order
> of memory writes performed by one thread are seen in the same
> order by another thread doing memory reads of the same
> locations. I do understand the need for memory barriers
> (mutexes, atomic variables, etc) to guarantee order, but there
> are still 2 questions that have never been completely
> answered, at least to my satisfaction, in all of the
> discussion I have read on this group (and the non moderated
> group) on these issues.

> First of all, I believe that volatile is supposed to guarantee the
> following:

> Volatile forces the compiler to generate code that performs
> actual memory reads and writes rather than caching values in
> processor registers. In other words, I believe that there is a
> one-to-one correspondence between volatile variable reads and
> writes in the source code and actual memory read and write
> instructions executed by the generated code. Is this correct?

Sort of. The standard uses a lot of weasel words (for good
reasons) with regards to volatile, and in particular, leaves it
up to the implementation to define exactly what it means by
"access". Still, it's hard to imagine an interpretation that
doesn't imply a machine instruction which loads or stores.

Of course, on modern machines, a store instruction doesn't
necessarily result in a write to physical memory; you typically
need additional instructions to ensure that. And on the
compilers I know (g++, Sun CC and VC++), volatile doesn't cause
them to be generated. (My most concrete experience is with Sun
CC on a Sparc, where volatile doesn't ensure that memory mapped
I/O works correctly.)

> Question 1:
> My first question is with regard to using volatile instead of
> memory barriers in some restricted multi-threaded cases. If my
> above statements are correct, is it possible to use _only_
> volatile with no memory barriers to signal between threads in
> a reliable way if only a single word (perhaps a single byte)
> is written by one thread and read by another?

No. Storing a byte (at the machine code level) on one processor
or core doesn't mean that the results of the store will be seen
on another processor. Modern processors reorder memory writes
in hardware, so the given the sequence:

volatile int a = 0, b = 0; // suppose int atomic

void f()
{
a = 1;
b = 1;
}

another thread may still see b == 1 and a == 0.

> Question 1a:
> First of all, please correct me if I am wrong, but I believe
> volatile _must_always_ work as described above on any single
> core CPU. One CPU means one cache (or one hierarchy of caches)
> meaning one view of actual memory through the cache(s) that
> the CPU sees, regardless of which thread is running. Is this
> much correct for any CPU in existence? If not please mention a
> situation where this is not true (for single core).

The standard doesn't make any guarantees, but all of the
processor architectures I know do guarantee coherence within a
single core.

The real question here is rather: who has a single core machine
anymore? The last Sparc I worked on had 32 core, and I got it
because it was deemed to slow for production work (where we had
128 core). And even my small laptop is a dual core.

> Question 1b:
> Secondly, the only way I could see this not working on a
> multi-core CPU, with individual caches for each core, is if a
> memory write performed by one CPU is allowed to never be
> updated in the caches of other CPU cores. Is this possible?
> Are there any multi-core CPUs that allow this? Doesn’t the
> MESI protocol guarantee that eventually memory cached in one
> CPU core is seen by all others? I know that there may be
> delays in the propagation from one CPU cache to the others,
> but doesn’t it eventually have to be propagated? Can it be
> delayed indefinitely due to activity in the cores involved?

The problem occurs upstream of the cache. Modern processors
access memory through a pipeline. And optimize the accesses in
hardware. Reading and writing a cache line at a time. So if
you read a, then b, but the hardware finds that b is already in
the read pipeline (because you've recently accessed something
near it), then the hardware won't issue a new bus access for b;
it will simply use the value already in the pipeline. Which may
be older than the value of a, if the hardware does have to go to
memory for a.

All processors have instructions to force ordering: fence on an
Intel (and IIRC, a lock prefix creates an implicit fence),
membar on a Sparc. But the compilers I know don't issue these
instructions in case of volatile access. So the hardware still
remains free to do the optimizations that volation has forbid
the compiler.

> Question 2:
> My second question is with regard to if volatile is necessary
> for multi-threaded code in addition to memory barriers. I know
> that it has been stated that volatile is not necessary in this
> case, and I do believe this, but I don’t completely understand
> why. The issue as I see it is that using memory barriers,
> perhaps through use of mutex OS calls, does not in itself
> prevent the compiler from generating code that caches
> non-volatile variable writes in registers.

Whether it prevents it or not is implementation defined. As
soon as you start doing this, you're formally in undefined
behavior as far as C or C++ are concerned. Posix and Windows,
however, make additional guarantees, and if the compiler is
Posix compliant or Windows compliant, you're safe with regards
to code movement accross any of the API's which forbid it.

If you're using things like inline assembler, or functions
written in assembler, you'll have to check your compiler
documentation, but in practice, the compiler will assume that
the inline code modifies all visible variables (and so ensure
that they are correctly written and read with regards to it)
unless it has some means to know better, and those means will
also allow it to take a possible fence or membar instruction
into account.

> I have heard it written in this group that posix, for example,
> supports additional guarantees that make mutex lock/unlock
> (for example) sufficient for correct inter-thread
> communication through memory without the use of volatile. I
> believe I read here once (from James Kanze I believe) that
> “volatile is neither sufficient nor necessary for proper
> multi- threaded code” (quote from memory). This seems to imply
> that posix is in cahoots with the compiler to make sure that
> this works.

Posix imposes additional constraints on C compilers, in addition
to what the C standard does. Technically, Posix doesn't know
that C++ exists (and vice versa); practically, C++ compilers do
claim Posix compliance, and exterpolate the C guarantees in a
logical fashion. (Given that they generally concern basic types
like int, this really isn't too difficult.)

I've seen less formal specification with regards to Windows (and
heaven knows, I'm looking, now that I'm working in an almost
exclusively Windows environment). But practically speaking,
VC++ behaves under Windows like Posix compliant compilers under
Posix, and you won't find any other compiler breaking things
that work with VC++.

> If you add mutex locks and unlocks (I know RAII, so please
> don’t derail my question) around some variable reads and
> writes, how do the mutex calls force the compiler to generate
> actual memory reads and writes in the generated code rather
> than register reads and writes?

That's the problem of the compiler implementor. Posix
(explicitly) and Windows (implicitly, at least) say that it has
to work, so it's up to the compiler implementor to make it work.
(In practice, most won't look into a function for which they
don't have the source code, and won't move code accross a
function whose semantics they don't know.)

> I understand that compilation optimization affects these
> issues, but if I optimize the hell out of my code, how do
> posix calls (or any other OS threading calls) force the
> compiler to do the right thing? My only conjecture is that
> this is just an accident of the fact that the compiler can’t
> really know what the mutex calls do and therefore the compiler
> must make sure that all globally accessible variables are
> pushed to memory (if they are in registers) in case _any_
> called function might access them. Is this what makes it work?

In practice, in a lot of cases, yes:-). It's an easy and safe
solution for the implementor, and it really doesn't affect
optimization that much---critical zones which include system
calls or other functions for which the compiler doesn't have the
source code aren't that common. In theory, however, a compiler
could know the list of system requests which guarantee memory
synchronization, and disassemble the object files of any
functions for which it didn't have the sources, to see if they
made any such requests. I just don't know of any compilers
which do this.

> If not, then how do mutex call guarantee the compiler doesn’t
> cache data in registers, because this would surely make the
> mutexes worthless without volatile (which I know from
> experience that they are not).

The system API says that they have to work. It's up to the
compiler implementor to ensure that they do. Most adopt the
simple solution: I don't know what this function does, so I'll
assume the worst. But at least in theory, more elaborate
strategies are possible.

--
James Kanze

Herb Sutter

unread,

Mar 24, 2010, 7:40:49 AM3/24/10

to

On Sat, 20 Mar 2010 05:32:19 CST, Joshua Maurice
<joshua...@gmail.com> wrote:
>As a reply to Herb Sutter, even if the code required to trigger the
>bug was some obscure little thing (which it is not), people can still
>hit it in the real world. My company generates C++ code from a very
>limited modeling language to allow object serialization between C++,
>xml, and Java. Imagine what wonderful little code paths in the
>optimizer could result from code generation on code generation?

I (and we) agree that this is fundamental, and sorry that this bug
report did slip through the cracks.

FYI, the team has continued to investigate this over the weekend and
we've decided to develop a QFE (Quick Fix Engineering = hot patch) for
this because we believe it's a high priority bug. The patch will be
developed and made available for both VS 2010 (about to ship) and VS
2008. The patch should be available in the coming weeks, no specific
ETA yet.

For status on progress and other information, please watch the Connect
bug report here:

https://connect.microsoft.com/VisualStudio/feedback/details/336316/

Thanks,

Herb

---
Herb Sutter (herbsutter.wordpress.com) (www.gotw.ca)

Convener, SC22/WG21 (C++) (www.gotw.ca/iso)
Architect, Visual C++ (www.gotw.ca/microsoft)

James Kanze

unread,

Mar 24, 2010, 7:43:48 AM3/24/10

to

On Mar 20, 7:17 am, "Balog Pal" <p...@lib.hu> wrote:
> "Edward Diener" <eldie...@tropicsoft.invalid>>

> These errors are in the "nuisance" category. Compile fails,

> so you're forced some workaround. The one OP mentioned is in
> "danger" category -- the code compiles silently with incorrect
> code generation.

All compilers I know of have some bugs in that category as well.
It's true that they generally agree to fix them as soon as
possible when they are pointed out (for some definition of "as
soon as possible").

Since I'm apparently at the source of this, I'd like to
relativize it somewhat:

-- The bug takes a very particular combination of conditions to
appear; VC++ doesn't just drop destructors at random.

-- IMHO, in most application domains, that combination of
conditions simply won't appear, unless you're coding so
badly that nothing is going to work anyway.

On the other hand:

-- There are a few domains, particularly where complex
numerical analysis is involved (but possibly others as well)
where it can reasonably appear; it won't ever appear in code
I write, because my style (usually SESE) will never create
the particular combination of conditions, but the code where
I did find it wasn't unreasonable, and

-- It cost me three weeks to track down, three weeks of my
time, paid for by my employer (which means that a new
feature which we should currently be testing hasn't been
implemented yet).

It is, obviously, that last point which made me bring the issue
up to begin with. I don't care if it's one in a million, if I
happen to be that one. (Similarly, I don't care if its 999999
in a million, if I'm the one exception.)

Anyway, I can understand Microsoft's original decision, even if
I don't agree with it, and I'm thankful for Herb's intervention
now. (Now how do I get him to go about intervening on the two
other bugs I've encountered. Particularly since one involves
something that I don't think is even officially supported.)

[...]

> Guess everyone around here agrees the perfect symmetry of
> ctor/dtor calls is fundamental in C++ and any code we write
> just takes it as granted.

Only the younger generation:-). I can remember a time when
almost all compilers I used had some problems in this regard.
(G++ 1.49 generated the destructor calls at the end of the
block. Immediately behind the ret instruction it generated if
there was a return statement in the block:-).)

> Seems folks at M$ have different thoughts and just plant
> breaking bug in the optimizer -- ant then refuse to correct
> it. In 2 years, then decide to not fix it at all. To me that
> sounds outrageous and is ground to drop using MS compilers.

If you refuse to use a compiler with any bugs, then you won't be
doing much C++.

> O, Murphy can't be ignored, so bugs may be in the release, but
> I expect them gone after discovery. Suggesting "workaround"
> for the few weeks is okay, but thinking actual programmers
> shall stop using locals in a loop -- or start to count returns
> in functions? Hell no.

First, it only occurs if you return the local. Conditionally.
And there is no other return statement in the function. I don't
think that that's a common case. It does occur, obviously
(since I encountered it), but I don't think it would ever occur
naturally in code implementing, say, a compiler. Which
doubtlessly explains why the compiler team at Microsoft thought
it was rare enough to be ignored.

[...]

> Maybe a petition signed by many ACCU members (and other C++
> users) could create some force.

In the case of large companies, like Microsoft and Sun, all it
takes is a big enough customer. Presumably, no large customer
had complained. If Herb didn't know me, and intervene
personally in the problem, they probably wouldn't have reacted.
(On the other hand, if the work-around hadn't been so simple, my
bosses might have had purchasing intervene, and we're probably
a big enough customer to get some reaction.) This isn't
particular to Microsoft---it's just the way things are (and I've
had similar experiences with Sun: when my one-man firm posted an
error, it was noted; when my customer, who was Sun's largest
account in Europe that the time, complained, one week later
there were three engineers from California on site to find out
what the problem was.)

> In the last decade, I several times wrote in differnt foruns
> addressing the usual MS-bashing and claims about poor quality,
> that it mostly roots in the 90's and the company IMO changed
> for the right track around 2000 -- certainly the cleanup takes
> time but the will is there, and resources are used for good.

Like most companies, they're mixed. If you use them, you take
the bad with the good.

--
James Kanze

Michael Doubez

unread,

Mar 24, 2010, 3:12:59 PM3/24/10

to

I am surprised. I would have expected cache lines to be flushed after
a given amount of time in order to avoid coherency issues.
'volatile' making it worse by *forcing* a flush per modification
(although without guaranteeing ordering with other non-volatile memory
access).

[snip]

--
Michael

James Kanze

unread,

Mar 25, 2010, 2:06:13 AM3/25/10

to

On Mar 24, 7:12 pm, Michael Doubez <michael.dou...@free.fr> wrote:
> On 24 mar, 12:33, James Kanze <james.ka...@gmail.com> wrote:

[...]

> > Sorry, but executing a store instruction (or a mov with a
> > destination in memory) does NOT guarantee that there will be
> > a write cycle in main memory, ever. At least not on modern
> > Sparc and Intel architectures. (I'm less familiar with
> > others, but from what I've heard, Sparc and Intel are among
> > the most strict in this regard.)

> I am surprised. I would have expected cache lines to be
> flushed after a given amount of time in order to avoid
> coherency issues. 'volatile' making it worse by *forcing* a
> flush per modification (although without guaranteeing ordering
> with other non-volatile memory access).

Cache lines are only part of the picture, but similar concerns
apply to them. All of the coherency issues are addressed by
considering values, not store instructions. So if you modify
the same value several times before it makes it out of the
processor, some of those "writes" are lost. (This is generally
not an issue for threading, but it definitely affects things
like memory mapped I/O.) And for better or for worse, volatile
doesn't force any flushing on any of the compilers I know; all
it does is ensure that a store instruction is executed. So that
given something like:
int volatile a;
int volatile b;

// ...
a = 1;
b = 2;
, the compiler will ensure that the store instruction to a is
executed before the store instruction to b, but the hardware
(write pipeline, typically) may reorder the modifications to
main memory, or even in some extreme cases suppress one of them.

--
James Kanze

Balog Pal

unread,

Mar 25, 2010, 2:13:08 AM3/25/10

to

"Herb Sutter" <herb....@gmail.com>

>>As a reply to Herb Sutter, even if the code required to trigger the
>>bug was some obscure little thing (which it is not), people can still
>>hit it in the real world. My company generates C++ code from a very
>>limited modeling language to allow object serialization between C++,
>>xml, and Java. Imagine what wonderful little code paths in the
>>optimizer could result from code generation on code generation?
>
> I (and we) agree that this is fundamental, and sorry that this bug
> report did slip through the cracks.
>
> FYI, the team has continued to investigate this over the weekend and
> we've decided to develop a QFE (Quick Fix Engineering = hot patch) for
> this because we believe it's a high priority bug. The patch will be
> developed and made available for both VS 2010 (about to ship) and VS
> 2008. The patch should be available in the coming weeks, no specific
> ETA yet.

I am glad to hear that.
Will there be a process-level fix too, from the "lessions learned"?

When I find a bug that went through the QA process and needs such revision
later, I check the cases for possible "kind of" issuses. If I were there in
this case I'd run query on the bug database for (closure = no-fix && impact
= incorrect code generation).

If there are more, I guess the cost if issuing a hotfix for multiple is not
much more that of a single issue -- and balanced by trust of custiomers.

I read the link, quoting

"As previously noted, this was a mistake and the recent discussions and
feedback about this decision on this and other channels have been
illuminating, to say the least. I will certainly not forget the lessons
learned here and will apply them in the future. "

I hope the conlusion that "estimating probablility on certain classes of
issues is just a bad idea" is among them.

A year ago I read this:

http://news.bbc.co.uk/2/hi/technology/7824939.stm

announcing that major players agreed that certain kind of errors are not
supposed to exist in released products. Microsoft is among them. Good
stuff if actually meant. At least my interpretation of it suggests that it a
company finds a bug falling in one of the listed category, the rest of
evaluation can be scrapped and go directly to fix, (not passing Go, not
collecting $200...).

The errors on the list are of a generic nature, for specific products I'd
expect more items with similar treating. And experience should have
proven that better quality pays off, while keeping bugs in the codebase
untreated is actually expensive, as soon as all the impact is correctly
summed.

Balog Pal

unread,

Mar 25, 2010, 2:17:14 AM3/25/10

to

"James Kanze" <james...@gmail.com>

>> These errors are in the "nuisance" category. Compile fails,
>> so you're forced some workaround. The one OP mentioned is in
>> "danger" category -- the code compiles silently with incorrect
>> code generation.
>
> All compilers I know of have some bugs in that category as well.
> It's true that they generally agree to fix them as soon as
> possible when they are pointed out (for some definition of "as
> soon as possible").

What is quite important approach-wise. We know very well that writing
non-trivial applications has too many opportunities to leave some issues.
Een with all teh best effort and massive resources. Some shit still can
happen. If a QA process is honestly followed I do not really blame
issuers for having residual defects. That was not discovered until reported
from the field.

But once in the open, I do expect them taken seriously, and getting fixed.
In how much tme is a gray area, but we discuss a simple refusal, based on
quesstimate of likelyness.

> Since I'm apparently at the source of this, I'd like to
> relativize it somewhat:
>
> -- The bug takes a very particular combination of conditions to
> appear; VC++ doesn't just drop destructors at random.
>
> -- IMHO, in most application domains, that combination of
> conditions simply won't appear, unless you're coding so
> badly that nothing is going to work anyway.

Interesting, the code example I saw didn't look too special, or even
unaesthetic. I could have some like code in my programs somewhere, doing a
reach in a collection and return some processed result. Having no other
return jsut a throw or assert. And how it uses temporaries it not a review
point if the result is correct.

> On the other hand:
>
> -- There are a few domains, particularly where complex
> numerical analysis is involved (but possibly others as well)
> where it can reasonably appear; it won't ever appear in code
> I write, because my style (usually SESE) will never create
> the particular combination of conditions, but the code where
> I did find it wasn't unreasonable, and
>
> -- It cost me three weeks to track down, three weeks of my
> time, paid for by my employer (which means that a new
> feature which we should currently be testing hasn't been
> implemented yet).

My take is pretty simple on "rare" bugs. My software possibly have them,
but if no one ever is hit by it, then it is really like nonexistance. If it
is found, that one occourance indicate it is not rare enough. I may
speculate on chances to got hit (more as part of thinking why it was not
discovered in-house). But it is already over the threshold.

> It is, obviously, that last point which made me bring the issue
> up to begin with. I don't care if it's one in a million, if I
> happen to be that one. (Similarly, I don't care if its 999999
> in a million, if I'm the one exception.)

Yeah. Like stated in one of the Discworld books -- the one-in-million
chances tend to hit you 9 times of ten. ;-)))

But seriously too, I just recently found a race condition in my current
thing. My estimate for the chance is 1 to 16 million. (That is pretty
close to winning top on lottery). I learned about it from a field report,
being executed just a few hundred times.

> Anyway, I can understand Microsoft's original decision, even if
> I don't agree with it, and I'm thankful for Herb's intervention
> now.

Well, unfortunatley it is too easy to understand all kinds of "no-fix"
decisions. And speculating or making up likeliness figures is a good
wildcard in the rationale to file away the work. While pressing on for
resources to issue a fix is hard -- unless you are the actual boss and want
it.

It does not make it either right or even necessarily the more ecomonic
solution, even taking just the company. Let alone calculating with the
real pain of anyone suffering the consequences.

>> Guess everyone around here agrees the perfect symmetry of
>> ctor/dtor calls is fundamental in C++ and any code we write
>> just takes it as granted.
>
> Only the younger generation:-). I can remember a time when
> almost all compilers I used had some problems in this regard.

Well, I remember them, but thought we're over the issue in the late 90s.

> (G++ 1.49 generated the destructor calls at the end of the
> block. Immediately behind the ret instruction it generated if
> there was a return statement in the block:-).)

Yeah, we had many kind of issues but I can't recall a single issue denying
it a clear bug and that it should be fixed. So that problem belongs to
the "we know what we want, just hard to actually get it" part.

1.49? Was it 15-20 years ago?

>> Seems folks at M$ have different thoughts and just plant
>> breaking bug in the optimizer -- ant then refuse to correct
>> it. In 2 years, then decide to not fix it at all. To me that
>> sounds outrageous and is ground to drop using MS compilers.
>
> If you refuse to use a compiler with any bugs, then you won't be
> doing much C++.

The point of the statement is not having accidental bugs, but refusal to
fix.

>> O, Murphy can't be ignored, so bugs may be in the release, but
>> I expect them gone after discovery. Suggesting "workaround"
>> for the few weeks is okay, but thinking actual programmers
>> shall stop using locals in a loop -- or start to count returns
>> in functions? Hell no.
>
> First, it only occurs if you return the local. Conditionally.
> And there is no other return statement in the function. I don't
> think that that's a common case. It does occur, obviously
> (since I encountered it), but I don't think it would ever occur
> naturally in code implementing, say, a compiler. Which
> doubtlessly explains why the compiler team at Microsoft thought
> it was rare enough to be ignored.

I think it is pretty clear what the thinking process was. My point is that
that thinking process is fundamentally flawed (IMNSHO), and should be
eradicated.

(On a side track it would also be interesting to see what would be
considered "common enough" to warrant a fix and how the thing is measured.
My experience is damn sour in this area, the evaluator really thinking
the other way around, making a decision to not fix by some resource
availability or mood, then make up some verbal explanation to support
at. )

> [...]
>> Maybe a petition signed by many ACCU members (and other C++
>> users) could create some force.
>
> In the case of large companies, like Microsoft and Sun, all it
> takes is a big enough customer. Presumably, no large customer
> had complained.

> If Herb didn't know me, and intervene
> personally in the problem, they probably wouldn't have reacted.

My idea of asking ACCU was on similar line -- if high gurus of C++ think it
is Bad Thing (TM) the force is supposedly similar to a "big vendor". After
all who makes the decision to use some language and which compiler.

Unfortunately I didn't see much activity -- petition to save the Bletchery
Park had ways more traffic. Too bad.

I still believe it is worth some pushing, and Herb -- being already on the
issue may have some munition to press improvements. When I was fighting
the "machine" being able to show "demand" was help.

> (On the other hand, if the work-around hadn't been so simple, my
> bosses might have had purchasing intervene, and we're probably
> a big enough customer to get some reaction.)

Gosh, the work-around looks simple once you spent your 3 weeks to locate the
ill spot in your code. At that one spot. Wha tabout the other places and
future? Make it a bullet in code reviews to count exits, conditionals and
temporaries in functions, to spot another possible candidate? To me
even the idea sounds ridiculous.

Wha other people figured: turn off optimization makes more sense, with
statting to look for a different compiler.

It is a TRUST issue. If I am not sure my tools are up to the task, they are
not fit. It is hard enough to write a correct source, now add to it that
it will be mis-translated?

> This isn't
> particular to Microsoft---it's just the way things are (and I've
> had similar experiences with Sun: when my one-man firm posted an
> error, it was noted; when my customer, who was Sun's largest
> account in Europe that the time, complained, one week later
> there were three engineers from California on site to find out
> what the problem was.)

Sure, the last part, big power induces fast and big effect is there. It
doesn't make necessary that less spectacular requests to be discarded.
Especially in an uber-blatant way: on ACCU list people said the discussed
issue is present in VS2005. a fix did not make it to the 05 service packs,
to the next major 08, to its SP1, then ro the next issue 10.

That is the real problem. Not issuing a specific hotfix tomrrow can be
rationalized somehow by that rare argument. But the issue would still wear
the "bug -- to be fixed" status, and not survive any forecoming next
milestone.

>> In the last decade, I several times wrote in differnt foruns
>> addressing the usual MS-bashing and claims about poor quality,
>> that it mostly roots in the 90's and the company IMO changed
>> for the right track around 2000 -- certainly the cleanup takes
>> time but the will is there, and resources are used for good.
>
> Like most companies, they're mixed. If you use them, you take
> the bad with the good.

Well, that "accepting bad" was a major force in early 90s. i guess MS made
insane amount of money on it, but also picked up all the bad reputation that
stick a decade after change of heart and will linger for much longer. (I do
remember the pre-win3x era when Microsoft was a brand of very good wality,
be it a C compiler, MS-Word, MS-dos 3.3. To be replaced by rush, crashes,
instability and no fixes only in next releases, that were full of different
bugs. Not even counting the later internet era with the impact of
onmipresent buffer overruns and related attacks.

Accepting the bad costs the world a few billion dollars by the lowest
estimate.

Our options are certainly limited, but IMO it is important to move in the
other direction. And fight the companies from reverting to the ill tactics.

Andy Venikov

unread,

Mar 25, 2010, 2:20:43 AM3/25/10

to

Joshua Maurice wrote:
> On Mar 21, 2:32 pm, Andy Venikov <swojchelo...@gmail.com> wrote:

<snip>

All the sources that you listed were saying that volatile isn't
sufficient. And some went on as far as to say that it's "mostly"
useless. That "mostly", however, covers an area that is real and I was
talking about that area. None of them disagreed with what I said.

Here's a brief example that I hope will put this issue to rest:

volatile int n;

n = 5;
n = 6;

volatile guarantees (note: no interpretation here, it's just what it
says) that the compiler will issue two store instructions in the correct
order (5 then 6). And that is a very useful quality for multi-threaded
programs that chose not to use synchronization primitives like mutexes
and such. Of course it doesn't mean that the processor executes them in
that order, that's why we'd use memory fences. But to stop the
compiler from messing around with these sequences, the volatile is
necessary.

>(Thanks Andrei, and his paper "C+
> + And The Perils Of Double Checked Locking".
>
>Andy, have you even read it?

Of course I have. It's no secret that I admire works of both of the
authors. I have read a lot of other papers as well. Magued Michael (who
co-authored an article on lock-free algorithms with Andrei) and Tim
Harris in particular are my favorites. But it wasn't the point of the
discussion, was it?
It's a great article. Among other things, it talks about the
non-portability of a solution that relies solely on volatile. How is it
different from what I have said in my earlier post? Quoting:

"Is volatile sufficient - absolutely not.
Portable - hardly.
Necessary in certain conditions - absolutely."

<snip>

Thanks,
Andy.

George Neuner

unread,

Mar 25, 2010, 3:10:07 PM3/25/10

to

On Thu, 25 Mar 2010 00:20:43 CST, Andy Venikov
<swojch...@gmail.com> wrote:

>
>All the sources that [Joshua Maurice] listed were saying that volatile

>isn't sufficient. And some went on as far as to say that it's "mostly"
>useless. That "mostly", however, covers an area that is real and I was
>talking about that area. None of them disagreed with what I said.
>
>Here's a brief example that I hope will put this issue to rest:
>
>
>volatile int n;
>
>n = 5;
>n = 6;
>
>
>volatile guarantees (note: no interpretation here, it's just what it
>says) that the compiler will issue two store instructions in the correct
>order (5 then 6). And that is a very useful quality for multi-threaded
>programs that chose not to use synchronization primitives like mutexes
>and such. Of course it doesn't mean that the processor executes them in
>that order, that's why we'd use memory fences. But to stop the
>compiler from messing around with these sequences, the volatile is
>necessary.

Not exactly. 'volatile' is necessary to force the compiler to
actually emit store instructions, else optimization would elide the
useless first assignment and simply set n = 6. Beyond that constant
propagation and/or value tracking might also eliminate the remaining
assignment and the variable altogether.

As you noted, 'volatile' does not guarantee that an OoO CPU will
execute the stores in program order ... for that you need to add a
write fence between them. However, neither 'volatile' nor write fence
guarantees that any written value will be flushed all the way to
memory - depending on other factors - cache snooping by another
CPU/core, cache write back policies and/or delays, the span to the
next use of the variable, etc. - the value may only reach to some
level of cache before the variable is referenced again. The value may
never reach memory at all.

OoO execution and cache behavior are the reasons 'volatile' doesn't
work as intended for many systems even in single-threaded use with
memory-mapped peripherals. A shared (atomically writable)
communication channel in the case of interrupts or concurrent threads
is actually a safer, more predictable use of 'volatile' because, in
general, it does not require values to be written all the way to main
memory.

>It's a great article. Among other things, it talks about the
>non-portability of a solution that relies solely on volatile. How is it
>different from what I have said in my earlier post? Quoting:
>
>"Is volatile sufficient - absolutely not.
>Portable - hardly.
>Necessary in certain conditions - absolutely."

I haven't seen the whole thread and I'm not sure of the post to which
you are referring. I think you might not be giving enough thought to
the way cache behavior can complicate the standard's simple memory
model. But it's possible that you have considered this and simply
have not explained yourself thoroughly enough for [me and others] to
see it.

'volatile' is necessary for certain uses but is not sufficient for
(al)most (all) uses. I would say that for expert uses, some are
portable and some are not. For non-expert uses ... I would say that
most uses contemplated by non-experts will be neither portable nor
sound.

> Andy.

George

Leigh Johnston

unread,

Mar 25, 2010, 7:25:19 PM3/25/10

to

"George Neuner" <gneu...@comcast.net> wrote in message
news:rq1nq5tskd51cmnf5...@4ax.com...
<snip>

Whether or not the store that is guaranteed to be emitted by the compiler
due to the presence of volatile propagates to L1 cache, L2 cache or main
memory is irrelevant as far as volatile and multi-threading is concerned as
long as CPU caches remain coherent. You could argue that because of this
volatile is actually more useful for multi-threading than for its more
traditional use of performing memory mapped I/O with modern CPU
architectures. I will reiterate though that the advent of C++0x should
consign this use of volatile to history.

/Leigh

Joshua Maurice

unread,

Mar 25, 2010, 7:26:44 PM3/25/10

to

No, that is your interpretation, an overreaching interpretation.
Neither the C or C++ standard mentions "store instruction". The C++
standard talks about "accesses" of "stored values". It never talks
about a processor, assembly, or "store instructions" in the context of
volatile. In fact, the C standard, which the C++ standard incorporates
in large part (both technically and in spirit), specifically says that
volatile accesses and "visible aspects of the abstract machine" are
inherently implementation specific and implementation defined.

And your argument is still irrelevant. (Nearly) all compiler writers
and (nearly?) all compiler implementations disagree with this
interpretation, and what the compilers actually do is all that matters
at the end of the day. volatile has no place as a synchronization
construct in portable code. None.

James Kanze

unread,

Mar 25, 2010, 7:31:25 PM3/25/10

to

On Mar 25, 7:10 pm, George Neuner <gneun...@comcast.net> wrote:
> On Thu, 25 Mar 2010 00:20:43 CST, Andy Venikov

[...]

> As you noted, 'volatile' does not guarantee that an OoO CPU will
> execute the stores in program order ...

Arguably, the original intent was that it should. But it
doesn't, and of course, the ordering guarantee only applies to
variables actually declared volatile.

> for that you need to add a write fence between them. However,
> neither 'volatile' nor write fence guarantees that any written
> value will be flushed all the way to memory - depending on
> other factors - cache snooping by another CPU/core, cache
> write back policies and/or delays, the span to the next use of
> the variable, etc. - the value may only reach to some level of
> cache before the variable is referenced again. The value may
> never reach memory at all.

If that's the case, then the fence instruction is seriously
broken. The whole purpose of a fence instruction is to
guarantee that another CPU (with another thread) can see the
changes. (Of course, the other thread also needs a fence.)

> OoO execution and cache behavior are the reasons 'volatile'
> doesn't work as intended for many systems even in
> single-threaded use with memory-mapped peripherals.

The reason volatile doesn't work with memory-mapped peripherals
is because the compilers don't issue the necessary fence or
membar instruction, even if a variable is volatile.

--
James Kanze

Herb Sutter

unread,

Mar 25, 2010, 7:33:51 PM3/25/10

to

Please remember this: Standard ISO C/C++ volatile is useless for
multithreaded programming. No argument otherwise holds water; at best
the code may appear to work on some compilers/platforms, including all
attempted counterexamples I've seen on this thread.

On Thu, 25 Mar 2010 00:20:43 CST, Andy Venikov

<swojch...@gmail.com> wrote:
>All the sources that you listed were saying that volatile isn't
>sufficient. And some went on as far as to say that it's "mostly"
>useless. That "mostly", however, covers an area that is real and I was
>talking about that area. None of them disagreed with what I said.
>
>Here's a brief example that I hope will put this issue to rest:
>
>volatile int n;
>
>n = 5;

Insert:
x = 42; // update some non-volatile variable

>n = 6;
>
>volatile guarantees (note: no interpretation here, it's just what it
>says) that the compiler will issue two store instructions in the correct
>order (5 then 6). And that is a very useful quality for multi-threaded
>programs that chose not to use synchronization primitives like mutexes
>and such.

No. The reason that can't use volatiles for synchronization is that
they aren't synchronized (QED). There are several issues, and you
immediately go on to state one of them (again, not the only one):

>Of course it doesn't mean that the processor executes them in
>that order, that's why we'd use memory fences.

It's not just processor execution, it's also propagation and
visibility at other threads/cores.

There are other reasons why volatile n is insufficient for
inter-thread communication even in this example. Consider this
question:

- What if another thread does "if( n == 6 ) assert( x == 42 );"?
Will the assertion always be true? (Hint: It must for a volatile write
to be usable to even just publish data from one thread to another.)

- What values of n could another thread see? (Hint: What values of n
could another thread _not _ see?)

Standard volatile is useless for multithreaded programming, and that's
okay because that's not what it's for. It is intended only for things
like hardware access -- and even for those purposes is deliberately
underspecified in the standard(s), and the C and C++ committees are
not going to "fix" volatile even for that use. On some
implementations, volatile may happen to have some nonstandard
semantics that happen to be useful for multithreaded programming
(notably on Visual C++ on x86/x64 where volatiles can be used for most
uses of atomic<> including DCL but _not_ including Dekker's), but
that's not what volatile for (and it was a mistake to try to add those
guarantees to volatile in VC++).

Herb

---
Herb Sutter (herbsutter.wordpress.com) (www.gotw.ca)

Convener, SC22/WG21 (C++) (www.gotw.ca/iso)
Architect, Visual C++ (www.gotw.ca/microsoft)

--

James Kanze

unread,

Mar 25, 2010, 7:30:44 PM3/25/10

to

On Mar 25, 6:20 am, Andy Venikov <swojchelo...@gmail.com> wrote:
> Joshua Maurice wrote:
> > On Mar 21, 2:32 pm, Andy Venikov <swojchelo...@gmail.com> wrote:
> <snip>

> Here's a brief example that I hope will put this issue to rest:

It does, but not in the way you seem to think:-).

> volatile int n;

> n = 5;
> n = 6;

> volatile guarantees (note: no interpretation here, it's just
> what it says) that the compiler will issue two store
> instructions in the correct order (5 then 6).

Since we're splitting hairs, technically, that's not what the
standard says. What the standard says is that the accesses
implied by the assignment statements will occur in the given
order. For some implementation defined meaning of "access".
(I'll also add my usual complaint. In the standard,
"implementation defined" means that is must be documented. I've
yet to find such documentation, however, for any of the
compilers I've used.)

> And that is a very useful quality for multi-threaded programs
> that chose not to use synchronization primitives like mutexes
> and such.

No it's not. It's quite frequent, in fact, that in the above
scenario, the 5 never makes it to main memory, or even out of
the write pipeline.

> Of course it doesn't mean that the processor executes them in
> that order, that's why we'd use memory fences. But to stop the
> compiler from messing around with these sequences, the
> volatile is necessary.

Show me how you use the memory fences, and I'll show you why the
compiler can't move the accesses accross them. There's no way
of getting a memory fence in standard C++, so you've moved the
problem to additional guarantees by the implementation. An
implementation either understands all of the code between the
two assignments (including inline assembler, calls to functions
written in assembler, etc.), and will see the fence and behave
accordingly, or it doesn't (and most don't try), and so it
cannot move the assignments, since it must assume that the code
it doesn't see or understand accesses n.

[...]

Concerning volatile...

> Necessary in certain conditions - absolutely."

It is certainly necessary in certain conditions. Both the C
standard and Posix, for example, require it when a variable is
accessed from both the main program and a signal handler. And
if it worked as intended (not the case with the compilers I
know), it would be necessary for memory mapped IO (which on the
compilers I know requires assembler in order to guarantee
correct behavior). It's just never never necessary for
communications between threads.

--
James Kanze

Andy Venikov

unread,

Mar 26, 2010, 7:05:30 AM3/26/10

to

James Kanze wrote:
> On Mar 25, 7:10 pm, George Neuner <gneun...@comcast.net> wrote:
>> On Thu, 25 Mar 2010 00:20:43 CST, Andy Venikov
>
> [...]
>> As you noted, 'volatile' does not guarantee that an OoO CPU will
>> execute the stores in program order ...
>
> Arguably, the original intent was that it should. But it
> doesn't, and of course, the ordering guarantee only applies to
> variables actually declared volatile.
>
>> for that you need to add a write fence between them. However,
>> neither 'volatile' nor write fence guarantees that any written
>> value will be flushed all the way to memory - depending on
>> other factors - cache snooping by another CPU/core, cache
>> write back policies and/or delays, the span to the next use of
>> the variable, etc. - the value may only reach to some level of
>> cache before the variable is referenced again. The value may
>> never reach memory at all.
>
> If that's the case, then the fence instruction is seriously
> broken. The whole purpose of a fence instruction is to
> guarantee that another CPU (with another thread) can see the
> changes. (Of course, the other thread also needs a fence.)

Hmm, the way I understand fences is that they introduce ordering and not
necessarily guarantee visibility. For example:

1. Store to location 1
2. StoreStore fence
3. Store to location 2

will guarantee only that if store to location 2 is visible to some
thread, then the store to location 1 is guaranteed to be visible to the
same thread as well. But it doesn't necessarily guarantee that the
stores will be ever visible to some other thread. Yes, on certain CPUs
fences are implemented as "flushes", but they don't need to be.

Thanks,
Andy.

Joshua Maurice

unread,

Mar 27, 2010, 5:13:05 PM3/27/10

to

Well yes. Volatile does not change that though. Most of my
understanding comes from
http://www.mjmwired.net/kernel/Documentation/memory-barriers.txt
and
The JSR-133 Cookbook for Compiler Writers
http://g.oswego.edu/dl/jmm/cookbook.html
(Note that the discussion of volatile in the above link is for Java
volatile 1.5+, not C and C++ volatile.)

I'm not the most versed on this, so please correct me if I'm wrong. As
an example:

main thread:
a = 0
b = 0
start thread 2
a = 1
write barrier
b = 2

thread 2:
print b
read barrier
print a

Without the read and write memory barriers, this will print any of the
4 possible combinations:
0 0, 2 0, 0 1, 2 1

With the barriers, it removes one possible:
0 0, 0 1, 2 1

As I understand "read" and "write" barriers (which are a subset of
"store/store, store/load, load/store, load/load", the semantics are:
"If a read before the read barrier sees a write after the write
barrier, then all reads after the read barrier will see all writes
before the write barrier." Yes, the semantics are conditional. It does
not guarantee that a write will ever become visible. However, volatile
will not change that. If thread 2 prints b == 2, then thread 2 will
print a == 1, volatile or no volatile. If thread 2 prints b == 0, then
thread 2 can print a == 0 or a == 1, volatile or no volatile. For some
lock free algorithms, these guarantees are very useful, such as making
double checked locking correct. Ex:

T* get_singleton()
{
//all static storage is zero initialized before runtime
static singleton_t * p;

if (0 != p) //check 1
{
READ_BARRIER();
return p;
}
Lock lock;
if (0 != p) //check 2
return p;
singleton_t * tmp = new singleton_t;
WRITE_BARRIER();
p = tmp;
return p;
}

If a thread reads p != 0 at check 1 which is before the read barrier,
then it sees the write after the write barrier "p = tmp", and it is
thus guaranteed that all subsequent reads after the read barrier (in
the caller code) will see all writes before the write barrier (from
the singleton_t constructor). This conditional visibility is exactly
what we need in this case, what DCLP really wants. If the read at
check 1 gives us 0, then we do have to use a mutex to force
visibility, but most of the time it will read p as nonzero at check 1,
and the barriers will guarantee correct semantics. Also, from what I
remember, the read barrier is quite cheap on most systems, possibly
free on the x86 (?). (See the JRS Cookbook linked above.) I don't
quite grasp the nuances enough yet to say anything more concrete than
this at this time.

Again, I'm coding this up from memory, so please correct if any
mistakes.

George Neuner

unread,

Mar 28, 2010, 5:05:33 PM3/28/10

to

On Thu, 25 Mar 2010 17:31:25 CST, James Kanze <james...@gmail.com>
wrote:

>On Mar 25, 7:10 pm, George Neuner <gneun...@comcast.net> wrote:
>> On Thu, 25 Mar 2010 00:20:43 CST, Andy Venikov
>
> [...]
>> As you noted, 'volatile' does not guarantee that an OoO CPU will
>> execute the stores in program order ...
>
>Arguably, the original intent was that it should. But it
>doesn't, and of course, the ordering guarantee only applies to
>variables actually declared volatile.

"volatile" is quite old ... I'm pretty sure the "intent" was defined
before there were OoO CPUs (in de facto use if not in standard
document). Regardless, "volatile" only constrains the behavior of the
*compiler*.

>> for that you need to add a write fence between them. However,
>> neither 'volatile' nor write fence guarantees that any written
>> value will be flushed all the way to memory - depending on
>> other factors - cache snooping by another CPU/core, cache
>> write back policies and/or delays, the span to the next use of
>> the variable, etc. - the value may only reach to some level of
>> cache before the variable is referenced again. The value may
>> never reach memory at all.
>
>If that's the case, then the fence instruction is seriously
>broken. The whole purpose of a fence instruction is to
>guarantee that another CPU (with another thread) can see the
>changes.

The purpose of the fence is to sequence memory accesses. All the
fence does is create a checkpoint in the instruction sequence at which
relevant load or store instructions dispatched prior to dispatch of
the fence instruction will have completed execution. There may be
separate load and store fence instructions and/or they may be combined
in a so-called "full fence" instruction.

However, in a memory hierarchy with caching, a store instruction does
not guarantee a write to memory but only that one or more write cycles
is executed on the core's memory connection bus. Where that write
goes is up to the cache/memory controller and the policies of the
particular cache levels involved. For example, many CPUs have
write-thru primary caches while higher levels are write-back with
delay (an arrangement that allows snooping of either the primary or
secondary cache with identical results).

For another thread (or core or CPU) to perceive a change a value must
be propagated into shared memory. For all multi-core processors I am
aware of, the first shared level of memory is cache - not main memory.
Cores on the same die snoop each other's primary caches and share
higher level caches. Cores on separate dies in the same package share
cache at the secondary or tertiary level.

The same holds true for all separate CPU shared memory multiprocessors
I am aware of ... they are connected so that they can snoop other's
caches at some level, or an additional level of shared cache is placed
between the CPUs and memory, or both.

>>(Of course, the other thread also needs a fence.)

Not necessarily.

>> OoO execution and cache behavior are the reasons 'volatile'
>> doesn't work as intended for many systems even in
>> single-threaded use with memory-mapped peripherals.
>
>The reason volatile doesn't work with memory-mapped peripherals
>is because the compilers don't issue the necessary fence or
>membar instruction, even if a variable is volatile.

It still wouldn't matter if they did. Lets take a simple case of one
thread and two memory mapped registers:

volatile unsigned *regA = 0x...;
volatile unsigned *regB = 0x...;
unsigned oldval, retval;

*regA = SOME_OP;
*regA = SOME_OP;

oldval = *regB;
do {
retval = *regB;
}
while ( retval == oldval );

Let's suppose that writing a value twice to regA initiates some
operation that returns a value in regB. Will the above code work?

No. The processor will execute both writes, but the cache will
combine them so the device will see only a single write. The cache
needs to be flushed between writes to regA.

Ok, let's assume there is a flush API and add some flushes:

*regA = SOME_OP;
FLUSH *regA;
*regA = SOME_OP;
FLUSH *regA;

oldval = *regB;
do {
retval = *regB;
}
while ( retval == oldval );

Does this now work?

Maybe. It will work if the flush operation includes a fence,
otherwise you can't know whether the write has occurred before the
cache line is flushed.

Ok, let's assume there is a fence API and add fences:

*regA = SOME_OP;
SFENCE;
FLUSH *regA;
*regA = SOME_OP;
SFENCE;
FLUSH *regA;

oldval = *regB;
do {
retval = *regB;
}
while ( retval == oldval );

Does this now work?

Yes. Now I am guaranteed that the first value will be written all the
way to memory (and to my device) before the second value is written.

Now the question is whether a cache flush includes a fence operation
(or vice versa)? The answer is "it depends". On many architectures,
the ISA has no cache control instructions - the cache controller is
mapped to reserved memory addresses or I/O ports. Some cache
controllers permit only programming replacement policy and do not
allow programs to manipulate the entries. Some controllers flush
everything rather than allowing individual lines to be flushed. It
depends.

If there is a language level API for cache control or for fencing, it
may or may not include the other operation depending on the whim of
the developer.

The upshot is this:
- "volatile" is required for any CPU.
- fences are required for an OoO CPU.
- cache control is required for a write-back cache between
CPU and main memory.

>James Kanze

George

James Kanze

unread,

Mar 28, 2010, 5:25:46 PM3/28/10

to

On Mar 26, 12:33 am, Herb Sutter <herb.sut...@gmail.com> wrote:
> Please remember this: Standard ISO C/C++ volatile is useless
> for multithreaded programming. No argument otherwise holds
> water; at best the code may appear to work on some
> compilers/platforms, including all attempted counterexamples
> I've seen on this thread.

I agree with you in principle, but do be careful as to how you
formulate this. Standard ISO C/C++ is useless for multithreaded
programming, at least today. With or without volatile. And in
Standard ISO C/C++, volatile is useless for just about anything;
it was always intended to be mainly a hook for implementation
defined behavior, i.e. to allow things like memory-mapped IO
while not imposing excessive loss of optimizing posibilities
everywhere.

In theory, an implementation could define volatile in a way that
would make it useful in multithreading---I think Microsoft once
proposed doing so in the standard. In my opinion, this sort of
violates the original intention behind volation, which was that
volatile is applied to a single object, and doesn't affect other
objects in the code. But it's certainly something you could
argue both ways.

[...]

> No. The reason that can't use volatiles for synchronization is that
> they aren't synchronized (QED).

:-). And the reason their not synchronized is that
synchronization involves more than one variable, and that it was
never the intent of volatile to involve more than one variable.
(On a lot of modern processors, however, it would be impossible
to fully implement the original intent of volatile without
synchronization. The only instructions available on a Sparc,
for example, to ensure that a store instruction actually results
in a write to an external device is a membar. And that
synchronizes *all* accesses of the given type.)

[...]

> (and it was a mistake to try to add those
> guarantees to volatile in VC++).

Just curious: is that Microsoft talking, or Herb Sutter (or
both)?

--
James Kanze

James Kanze

unread,

Mar 28, 2010, 5:23:49 PM3/28/10

to

On Mar 26, 12:05 pm, Andy Venikov <swojchelo...@gmail.com> wrote:

> James Kanze wrote:
>> If that's the case, then the fence instruction is seriously
>> broken. The whole purpose of a fence instruction is to
>> guarantee that another CPU (with another thread) can see the
>> changes. (Of course, the other thread also needs a fence.)

> Hmm, the way I understand fences is that they introduce
> ordering and not necessarily guarantee visibility. For
> example:

> 1. Store to location 1
> 2. StoreStore fence
> 3. Store to location 2

> will guarantee only that if store to location 2 is visible to
> some thread, then the store to location 1 is guaranteed to be
> visible to the same thread as well.

A StoreStore fence guarantees that all stores issued before the
fence are visible in main memory, and that none issued after the
fence are visible (at the time the StoreStore fence is
executed).

Of course, for another thread to be guaraneed to see the results
of any store, it has to use a load fence, to ensure that the
values it sees are those after the load fence, and not some
value that it happened to pick up earlier.

> But it doesn't necessarily guarantee that the stores will be
> ever visible to some other thread. Yes, on certain CPUs fences
> are implemented as "flushes", but they don't need to be.

If you redefine fence to mean something different than it
normally means, then who knows. The normal definition requires
all writes to have propagated to main memory (supposing it is a
store fence) before the instruction procedes. This is why they
can be so slow. (And all of the processors I know guaranteed
coherence within a single core; you never need a fence if you're
single threaded.)

--
James Kanze

James Kanze

unread,

Mar 28, 2010, 5:31:06 PM3/28/10

to

On Mar 26, 12:25 am, "Leigh Johnston" <le...@i42.co.uk> wrote:
> "George Neuner" <gneun...@comcast.net> wrote in message

> news:rq1nq5tskd51cmnf5...@4ax.com...
> <snip>

>> 'volatile' is necessary for certain uses but is not sufficient for
>> (al)most (all) uses. I would say that for expert uses, some are
>> portable and some are not. For non-expert uses ... I would say that
>> most uses contemplated by non-experts will be neither portable nor
>> sound.

> Whether or not the store that is guaranteed to be emitted by
> the compiler due to the presence of volatile propagates to L1
> cache, L2 cache or main memory is irrelevant as far as
> volatile and multi-threading is concerned as long as CPU
> caches remain coherent.

That depends on the architecture and what the compiler actually
does in the case of volatile. Some of the more recent
processors have a separate cache for each core, at least at the
lowest level, and most access memory through a pipeline which is
unique to the core.

> You could argue that because of this volatile is actually more
> useful for multi-threading than for its more traditional use
> of performing memory mapped I/O with modern CPU architectures.

You'll have to explain that, since none of the compilers I use
generate any sort of fence or membar when volatile is used, and
the processors definitely require it,

--
James Kanze

Leigh Johnston

unread,

Mar 29, 2010, 2:45:14 AM3/29/10

to

"James Kanze" <james...@gmail.com> wrote in message news:ddf75ee4-b26b-46a0...@k19g2000yqn.googlegroups.com...

{ quoted signature removed; please remove such extra stuff yourself.
-mod }

I would expect the following property of the volatile keyword on VC++ to be
a common interpretation of the semantics of volatile for most C++ compilers:

"Objects declared as volatile are not used in certain optimizations because
their values can change at any time. The system always reads the current
value of a volatile object at the point it is requested, even if a previous
instruction asked for a value from the same object. Also, the value of the
object is written immediately on assignment. "

It should be obvious how this property can be useful when writing
multi-threaded code, not always useful in isolation perhaps but certainly
useful when used in conjunction with other threading constructs such as
mutexes and fences. Depending on the compiler/platform and on the actual
use-case volatile on its own might not be enough: from what I can tell VC++
volatile does not emit fence instructions for x86 yet the above property
still stands (and there are rare cases when memory barriers are needed on
x86, see
http://bartoszmilewski.wordpress.com/2008/11/05/who-ordered-memory-fences-on-an-x86/).
I agree that this is mostly an implementation specific issue and the current
C++ standard is threading agnostic however saying volatile has absolutely no
use in multi-threading programming is incorrect.

Performance is often cited as another reason to not use volatile however the
use of volatile can actually help with multi-threading performance as you
can perform a safe lock-free check before performing a more expensive lock.
This all depends on the use-case in question and the only volatile I have in
my entire codebase is for just such a check (for this use-case ordering does
not matter so no fences are needed).

I agree with what Andy said elsewhere in this thread:

"Is volatile sufficient - absolutely not.
Portable - hardly.
Necessary in certain conditions - absolutely."

/Leigh

Herb Sutter

unread,

Mar 29, 2010, 2:46:22 AM3/29/10

to

On Sun, 28 Mar 2010 15:25:46 CST, James Kanze <james...@gmail.com>
wrote:

>On Mar 26, 12:33 am, Herb Sutter <herb.sut...@gmail.com> wrote:
>> Please remember this: Standard ISO C/C++ volatile is useless
>> for multithreaded programming. No argument otherwise holds
>> water; at best the code may appear to work on some
>> compilers/platforms, including all attempted counterexamples
>> I've seen on this thread.
>
>I agree with you in principle, but do be careful as to how you
>formulate this. Standard ISO C/C++ is useless for multithreaded
>programming, at least today. With or without volatile. And in
>Standard ISO C/C++, volatile is useless for just about anything;

All of the above is still true in draft C++0x and C1x, both of which
have concurrency memory models, threads, and mutexes.

>it was always intended to be mainly a hook for implementation
>defined behavior, i.e. to allow things like memory-mapped IO
>while not imposing excessive loss of optimizing posibilities
>everywhere.

Right. And is therefore (still) deliberately underspecified.

>In theory, an implementation could define volatile in a way that
>would make it useful in multithreading---I think Microsoft once
>proposed doing so in the standard.

Yes, back in 2006 I briefly agreed with that before realizing why it
was wrong (earlier in this thread you correctly said I supported it
and then stopped doing so).

>In my opinion, this sort of
>violates the original intention behind volation, which was that
>volatile is applied to a single object, and doesn't affect other
>objects in the code. But it's certainly something you could
>argue both ways.

No, it's definitely wrong. Briefly, volatile and atomic<> have two
very different purposes, and they impose similar (but different)
constraints. The pitfall in making them both be the same thing (e.g.,
extending volatile to make it strong enough to serve the needs of
atomic<>s as well) is that you end up with a single thing that can be
used for both purposes but is necessarily suboptimal for either one.
That is, you'll nearly always only be using a given variable for one
of the two uses at a time, and if you're using it for hardware access
it'll be a slower volatile because it also has the optimization
restrictions of an atomic<>, and if you're using it for inter-thread
communication it'll be a slower atomic<> because it also has the
optimization restrictions of a volatile. If I write this up I'll
include some examples.

> [...]
>> No. The reason that can't use volatiles for synchronization is that
>> they aren't synchronized (QED).
>
>:-). And the reason their not synchronized is that
>synchronization involves more than one variable, and that it was
>never the intent of volatile to involve more than one variable.

That's part of it, yes.

>(On a lot of modern processors, however, it would be impossible
>to fully implement the original intent of volatile without
>synchronization. The only instructions available on a Sparc,
>for example, to ensure that a store instruction actually results
>in a write to an external device is a membar. And that
>synchronizes *all* accesses of the given type.)
>
> [...]
>> (and it was a mistake to try to add those
>> guarantees to volatile in VC++).
>
>Just curious: is that Microsoft talking, or Herb Sutter (or
>both)?

Both. It was well-intentioned and seemed like a good idea until the
"ugh it pessimizes both uses" problem was understood. FWIW, a number
of people in WG21 suggested combining the two, hence Hans' and Nick's
paper on why volatile shouldn't be strengthened. (That paper is
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2016.html and
I think there are additional reasons against it in addition to the
good ones they give.)

Herb

---
Herb Sutter (herbsutter.wordpress.com) (www.gotw.ca)

Convener, SC22/WG21 (C++) (www.gotw.ca/iso)
Architect, Visual C++ (www.gotw.ca/microsoft)

--

James Kanze

unread,

Mar 29, 2010, 5:22:35 PM3/29/10

to

On Mar 29, 7:45 am, "Leigh Johnston" <le...@i42.co.uk> wrote:
> "James Kanze" <james.ka...@gmail.com> wrote in
> messagenews:ddf75ee4-b26b-46a0...@k19g2000yqn.googlegroups.com...

> I would expect the following property of the volatile keyword

> on VC++ to be a common interpretation of the semantics of
> volatile for most C++ compilers:

> "Objects declared as volatile are not used in certain
> optimizations because their values can change at any time. The
> system always reads the current value of a volatile object at
> the point it is requested, even if a previous instruction
> asked for a value from the same object. Also, the value of the
> object is written immediately on assignment. "

That's generally the case. For some very imprecise meaning of
"reads" and "writes". On the compilers I have access to, the
meaning is no more than "executes a machine level load or store
instruction". Which is practically meaningless for anything
useful on a modern processor.

> It should be obvious how this property can be useful when
> writing multi-threaded code, not always useful in isolation
> perhaps but certainly useful when used in conjunction with
> other threading constructs such as mutexes and fences.

It isn't, at least not to me. Perhaps if you could come up with
a small example of where it might be useful.

> Depending on the compiler/platform and on the actual use-case
> volatile on its own might not be enough: from what I can tell
> VC++ volatile does not emit fence instructions for x86 yet the
> above property still stands (and there are rare cases when
> memory barriers are needed on x86,

> seehttp://bartoszmilewski.wordpress.com/2008/11/05/who-ordered-memory-fe...).

> I agree that this is mostly an implementation specific issue
> and the current C++ standard is threading agnostic however
> saying volatile has absolutely no use in multi-threading
> programming is incorrect.

Given that the exact semantics volatile and threading are not
really covered by the standard, it's certain that one cannot
make blanket claims: an implementation could define volatile in
a way that would make it useful with its implementation of
threading, say by giving volatile the same meaning that it has
in Java, for example. In practice, however, Posix doesn't, and
I don't know of a compiler under Unix which goes beyong the
Posix guarantees (except when assembler is involved, and then
they give enough guarantees that you don't need volatile). And
while I've yet to find an exact specification for Windows, the
implementation of volatile in VC++ 8.0 doesn't do enough to make
it useful in threading, and Microsoft (in the voice of Herb
Sutter) has said here that it isn't useful (although I don't
know if Herb is speaking for Microsoft here, or simply
expressing his personal opinion).

Anyhow, for the moment, all I can really claim is that it is
useless under the Unix I know (Solaris, HP/UX, AIX and Linux)
and under Windows.

> Performance is often cited as another reason to not use
> volatile however the use of volatile can actually help with
> multi-threading performance as you can perform a safe
> lock-free check before performing a more expensive lock.

Again, I'd like to see how. This sounds like the double-checked
locking idiom, and that's been proven not to work.

> I agree with what Andy said elsewhere in this thread:

> "Is volatile sufficient - absolutely not.
> Portable - hardly.
> Necessary in certain conditions - absolutely."

Yes, but Andy didn't present any facts to back up his statement.

The simplest solution would be to just post a bit of code
showing where or how it might be useful. A good counter example
trumps every argument.

--
James Kanze

Leigh Johnston

unread,

Mar 29, 2010, 6:55:44 PM3/29/10

to

"James Kanze" <james...@gmail.com> wrote in message

news:36f7e40e-4584-430d...@z3g2000yqz.googlegroups.com...
<snip>

>> Performance is often cited as another reason to not use
>> volatile however the use of volatile can actually help with
>> multi-threading performance as you can perform a safe
>> lock-free check before performing a more expensive lock.
>
> Again, I'd like to see how. This sounds like the double-checked
> locking idiom, and that's been proven not to work.
>

IMO for an OoO CPU the double checked locking pattern can be made to work
with volatile if fences are also used or the lock also acts as a fence (as
is the case with VC++/x86). This is also the counter-example you are
looking for, it should work on some implementations. FWIW VC++ is clever
enough to make the volatile redundant for this example however adding
volatile makes no difference to the generated code (read: no performance
penalty) and I like making such things explicit similar to how one uses
const (doesn't effect the generated output but documents the programmer's
intentions). Which is better: use volatile if there is no noticeable
performance penalty or constantly check your compiler's generated assembler
to check the optimizer is not breaking things? The only volatile in my
entire codebase is for the "status" of my "threadable" base class and I
don't always acquire a lock before checking this status and I don't fully
trust that the optimizer won't cache it for all cases that might crop up as
I develop code. BTW I try and avoid singletons too so I haven't found the
need to use the double checked locking pattern AFAICR.

/Leigh

James Kanze

unread,

Mar 29, 2010, 6:54:39 PM3/29/10

to

On Mar 29, 7:46 am, Herb Sutter <herb.sut...@gmail.com> wrote:
> On Sun, 28 Mar 2010 15:25:46 CST, James Kanze <james.ka...@gmail.com>
> wrote:

> >On Mar 26, 12:33 am, Herb Sutter <herb.sut...@gmail.com> wrote:
> >> Please remember this: Standard ISO C/C++ volatile is useless
> >> for multithreaded programming. No argument otherwise holds
> >> water; at best the code may appear to work on some
> >> compilers/platforms, including all attempted counterexamples
> >> I've seen on this thread.

> >I agree with you in principle, but do be careful as to how
> >you formulate this. Standard ISO C/C++ is useless for
> >multithreaded programming, at least today. With or without
> >volatile. And in Standard ISO C/C++, volatile is useless for
> >just about anything;

> All of the above is still true in draft C++0x and C1x, both of
> which have concurrency memory models, threads, and mutexes.

Huh? "Standard ISO C/C++" is useless for multithreaded
programming today because as far as the standard is concerned,
as soon as there's more than one thread, you have undefined
behavior. Unless things changed a lot while I wasn't looking
(I've not been able to follow things too closely lately), C++0x
will define threading, and offer some very useful primitives for
multithreaded code. Considerably more than boost::thread, which
was already very, very useful.

> >it was always intended to be mainly a hook for implementation
> >defined behavior, i.e. to allow things like memory-mapped IO
> >while not imposing excessive loss of optimizing posibilities
> >everywhere.

> Right. And is therefore (still) deliberately underspecified.

As it should be.

> >In theory, an implementation could define volatile in a way that
> >would make it useful in multithreading---I think Microsoft once
> >proposed doing so in the standard.

> Yes, back in 2006 I briefly agreed with that before realizing
> why it was wrong (earlier in this thread you correctly said I
> supported it and then stopped doing so).

> >In my opinion, this sort of violates the original intention
> >behind volation, which was that volatile is applied to a
> >single object, and doesn't affect other objects in the code.
> >But it's certainly something you could argue both ways.

> No, it's definitely wrong.

Well, I basically agree with you there. But there are degrees
of wrong: it's not wrong in the same sense as claiming that an
mfence instruction doesn't affect cache synchronization on an
Intel is wrong.

--
James Kanze

Herb Sutter

unread,

Mar 29, 2010, 6:57:18 PM3/29/10

to

On Mon, 29 Mar 2010 15:22:35 CST, James Kanze <james...@gmail.com>
wrote:

>while I've yet to find an exact specification for Windows, the
>implementation of volatile in VC++ 8.0 doesn't do enough to make
>it useful in threading, and Microsoft (in the voice of Herb
>Sutter) has said here that it isn't useful (although I don't
>know if Herb is speaking for Microsoft here, or simply
>expressing his personal opinion).

Not exactly, actually what I said was that in VC++ targeting x86/x64,
volatile was strengthened to add most (not all) ordering guarantees of
an atomic<>. It is enough to make most patterns safe including
Double-Checked Locking and reference counting, but not enough to make
examples like Dekker's safe.

But in retrospect strengthening volatile in this way to make it
useful for some/most inter-thread communication was a mistake, and has
several drawbacks: a) it does so at the cost of making volatile writes
slower which means some pessimization of 'normal' uses of volatile for
hardware access; b) it doesn't work for some thread communication
techniques that rely on a global ordering of independent reads of
independent writes to different objects; c) it doesn't make the
variables atomic so to make them useful they have to be aligned
variables of a type and size that happens to be naturally atomic on
the target processor; and of course d) it's not portable.

The right solution is to leave volatile alone and add std::atomic<>.
That's what C++0x does. Longer-term, that's what Visual C++ will go to
and recommend as well, with two caveats: first, now that we've shipped
volatile this way we'll probably have to keep supporting the
strengthened semantics for a long time (possibly forever?) to preserve
code that relies on those semantics (alas, the price of shipping
something is trying hard to not break customers that use it); and
second, atomic<> didn't make it into VS 2010 and so will have to await
a later release.

>On Mar 29, 7:45 am, "Leigh Johnston" <le...@i42.co.uk> wrote:
>> Performance is often cited as another reason to not use
>> volatile however

No "however" needed, that cited reason is correct. Volatile disables
optimizations that would be legal for an atomic<>. A quick example is
combining/eliding writes (e.g., v = 1; v = 2; can't be transformed to
v = 2;, but a = 1; a = 2; can be transformed to a = 2;). Another is
combining/eliding reads.

Herb

---
Herb Sutter (herbsutter.wordpress.com) (www.gotw.ca)

Convener, SC22/WG21 (C++) (www.gotw.ca/iso)
Architect, Visual C++ (www.gotw.ca/microsoft)

--

James Kanze

unread,

Mar 29, 2010, 6:53:44 PM3/29/10

to

On Mar 28, 10:05 pm, George Neuner <gneun...@comcast.net> wrote:
> On Thu, 25 Mar 2010 17:31:25 CST, James Kanze <james.ka...@gmail.com>
> wrote:

> >On Mar 25, 7:10 pm, George Neuner <gneun...@comcast.net> wrote:
> >> On Thu, 25 Mar 2010 00:20:43 CST, Andy Venikov

> > [...]
> >> As you noted, 'volatile' does not guarantee that an OoO CPU will
> >> execute the stores in program order ...

> >Arguably, the original intent was that it should. But it
> >doesn't, and of course, the ordering guarantee only applies to
> >variables actually declared volatile.

> "volatile" is quite old ... I'm pretty sure the "intent" was defined
> before there were OoO CPUs (in de facto use if not in standard
> document). Regardless, "volatile" only constrains the behavior of the
> *compiler*.

More or less. Volatile requires the compiler to issue code
which is conform to what the documentation says it does. It
requires all accesses to take place after the preceding sequence
point, and the results of those accesses to be stable before the
following sequence point. But it leaves it up to the
implementation to define what is meant by "access", and most
take a very, very liberal view of it.

> >> for that you need to add a write fence between them. However,
> >> neither 'volatile' nor write fence guarantees that any written
> >> value will be flushed all the way to memory - depending on
> >> other factors - cache snooping by another CPU/core, cache
> >> write back policies and/or delays, the span to the next use of
> >> the variable, etc. - the value may only reach to some level of
> >> cache before the variable is referenced again. The value may
> >> never reach memory at all.

> >If that's the case, then the fence instruction is seriously
> >broken. The whole purpose of a fence instruction is to
> >guarantee that another CPU (with another thread) can see the
> >changes.

> The purpose of the fence is to sequence memory accesses.

For a much more rigorous definition of "access" that that used
by the C++ standard.

> All the fence does is create a checkpoint in the instruction
> sequence at which relevant load or store instructions
> dispatched prior to dispatch of the fence instruction will
> have completed execution.

That's not true for the two architectures whose documentation
I've studied, Intel and Sparc. To quote the Intel documentation
of MFENCE:

Performs a serializing operation on all load and store
instructions that were issued prior the MFENCE
instruction. This serializing operation guarantees that
every load and store instruction that precedes in
program order the MFENCE instruction is globally visible
before any load or store instruction that follows the
MFENCE instruction is globally visible.

Note the "globally visible". Both Intel and Sparc guarantee
strong ordering within a single core (i.e. a single thread);
mfence or membar (Sparc) are only necessary if the memory will
also be "accessed" from a separate unit: a thread running on a
different core, or memory mapped IO.

> There may be separate load and store fence instructions and/or
> they may be combined in a so-called "full fence" instruction.

> However, in a memory hierarchy with caching, a store
> instruction does not guarantee a write to memory but only that
> one or more write cycles is executed on the core's memory
> connection bus.

On Intel and Sparc architectures, a store instruction doesn't
even guarantee that. All it guarantees is that the necessary
information is somehow passed to the write pipeline. What
happens after that is anybody's guess.

> Where that write goes is up to the cache/memory controller and
> the policies of the particular cache levels involved. For
> example, many CPUs have write-thru primary caches while higher
> levels are write-back with delay (an arrangement that allows
> snooping of either the primary or secondary cache with
> identical results).

> For another thread (or core or CPU) to perceive a change a
> value must be propagated into shared memory. For all
> multi-core processors I am aware of, the first shared level of
> memory is cache - not main memory. Cores on the same die
> snoop each other's primary caches and share higher level
> caches. Cores on separate dies in the same package share
> cache at the secondary or tertiary level.

And on more advanced architectures, there are core's which don't
share any cache. All of which is irrelevant, since simply
issuing a store instruction doesn't even guarantee a write to
the highest level cache, and a membar or a fence instruction
guarantees access all the way down to the main, shared memory.

[...]

> >The reason volatile doesn't work with memory-mapped
> >peripherals is because the compilers don't issue the
> >necessary fence or membar instruction, even if a variable is
> >volatile.

> It still wouldn't matter if they did. Lets take a simple case of one
> thread and two memory mapped registers:

> volatile unsigned *regA = 0x...;
> volatile unsigned *regB = 0x...;
> unsigned oldval, retval;

> *regA = SOME_OP;
> *regA = SOME_OP;

> oldval = *regB;
> do {
> retval = *regB;
> }
> while ( retval == oldval );

> Let's suppose that writing a value twice to regA initiates
> some operation that returns a value in regB. Will the above
> code work?

Not on a Sparc. Probably not on an Intel, but I'm less sure.
It wouldn't surprise me if Intel did allow certain segments to
be configured with an implicit fence around each access, and if
the memory mapped IO were in such a segment, it would work.

> No. The processor will execute both writes, but the cache
> will combine them so the device will see only a single write.
> The cache needs to be flushed between writes to regA.

Again, the cache is really irrelevant here. The combining will
already occur in the write pipeline.

[...]

> The upshot is this:
> - "volatile" is required for any CPU.

I'm afraid that doesn't follow from anything you've said.
Particularly because the volatile is largely a no-op on most
current compilers---it inhibits compiler optimizations, but the
generated code does nothing to prevent the reordering that
occurs at the hardware level.

> - fences are required for an OoO CPU.

By OoO, I presume you mean "out of order". That's not the only
source of the problems.

> - cache control is required for a write-back cache between
> CPU and main memory.

The cache is largely irrelevent on Sparc or Intel. The
processor architectures are designed in a way to make it
irrelevant. All of the problems would be there even in the
absence of caching. They're determined by the implementation of
the write and read pipelines.

--
James Kanze

Leigh Johnston

unread,

Mar 29, 2010, 10:22:54 PM3/29/10

to

"Herb Sutter" <herb....@gmail.com> wrote in message news:j462r55lr984u1vg0...@4ax.com...
<snip>

>>On Mar 29, 7:45 am, "Leigh Johnston" <le...@i42.co.uk> wrote:
>>> Performance is often cited as another reason to not use
>>> volatile however
>
> No "however" needed, that cited reason is correct. Volatile disables
> optimizations that would be legal for an atomic<>. A quick example is
> combining/eliding writes (e.g., v = 1; v = 2; can't be transformed to
> v = 2;, but a = 1; a = 2; can be transformed to a = 2;). Another is
> combining/eliding reads.
>

If atomic reads can be elided won't that be problematic for using atomics with the double checked locking pattern (so we are back to using volatile atomics)?

/Leigh

Andy Venikov

unread,

Mar 30, 2010, 7:03:50 AM3/30/10

to

I just did in my reply to Herb Sutter.
Sorry, if I read your post earlier, I would've put my example here - it
makes more contextual sense.

>
> --
> James Kanze
>

Andy.

Andy Venikov

unread,

Mar 30, 2010, 7:03:11 AM3/30/10

to

Herb Sutter wrote:
> Please remember this: Standard ISO C/C++ volatile is useless for
> multithreaded programming. No argument otherwise holds water; at best
> the code may appear to work on some compilers/platforms, including all
> attempted counterexamples I've seen on this thread.

You have an enormous clout on C++ professionals, including myself, so
before permanently agreeing to such an all-encompassing statement allow
me to maybe step back a little and see what it is that's at the core of
this argument. Maybe we're arguing the same point. Or maybe I'm missing
something big in which case I'll be doubly glad to have been shown my
wrong assumptions.

I understand that volatile never was supposed to be of any help for
multithreaded programming. I don't expect it to issue any memory fences
nor make any guarantees whatsoever about anything thread-related...
Yet, on all the compilers I know of (gcc, mingw, MSVC, LLVM, Intel) it
produces just the code I need for my multithreaded programs. And I
really don't see how it wouldn't, given common-sense understanding of
what it should do in single-threaded programs. And I'm pretty sure that
it's not going to change in a foreseeable future.

So my use of volatile maybe not standard-portable, but it sure is
real-life portable.

Here's the point of view I'm coming from.
Imagine that someone needs to implement a library that provides certain
multithreading (multiprogramming) tools like atomic access,
synchronization primitives and some lock-free algorithms that will be
used by other developers so that they wouldn't have to worry about
things like volatile. (Now that boost.atomic is almost out, I'll happily
use it. But Helge Bahmann (the author of the library) didn't have such a
luxury, so to make his higher-level APIs work he had to internally
resort to low-level tools like volatiles where appropriate.)

So, with the above said, here's a concrete example of how I'd use
volatile without an access to a ready-made library. Let's take Magued
Michael's lock-free queue ("Simple, Fast and Practical Non-blocking and
blocking queue algorithms", Magued Michael & Michael Scott; 1996). It
uses a technique similar to DCL to verify a validity of a read. Look
into it's deque() method.
I'll provide the pseudo code here:

dequeue(Q: pointer to queue t, pvalue: pointer to data type): boolean
D1: loop # Keep trying until Dequeue is done
D2: head = Q–>Head # Read Head
D3: tail = Q–>Tail # Read Tail
D4: next = head–>next # Read Head.ptr–>next
D5: if head == Q–>Head # Are head, tail, and next consistent?
D6: if head.ptr == tail.ptr # Is queue empty or Tail falling behind?
D7: if next.ptr == NULL # Is queue empty?
D8: return FALSE # Queue is empty, couldn’t dequeue
D9: endif
# Tail is falling behind. Try to advance it
D10: CAS(&Q–>Tail, tail, <next.ptr, tail.count+1>)
D11: else # No need to deal with Tail
# Read value before CAS, otherwise another dequeue might free the
next node
D12: *pvalue = next.ptr–>value
# Try to swing Head to the next node
D13: if CAS(&Q–>Head, head, <next.ptr, head.count+1>)
D14: break # Dequeue is done. Exit loop
D15: endif
D16: endif
D17: endif
D18: endloop
D19: free(head.ptr) # It is safe now to free the old dummy node
D20: return TRUE # Queue was not empty, dequeue succeeded

Look at line D5: it needs to check if Q->Head is still the same as what
we read from it before. Otherwise two possibilities for breaking the
correctness arise: 1) it would be possible for the element pointed to by
Q->Head to be re-inserted back into the queue with NULL in the "next"
and then dequeue would return "empty" when in reality the queue was
never empty in any given moment; or 2) The first element was removed
after we've read Q->Head and before we've read next thus there could be
garbage in head->next by the time we read it and we'd try to access
garbage on line D12.

This piece of pseudo code could be naively translated to a following c++
code:

while (true)
{
Node * localHead = head_;
Node * localTail = tail_;
Node * localNext = localHead->next;
if (localHead == head_)
{
...
}

But it wouldn't work for the obvious reasons.
One needs to insert MemoryFences in the right places.
Memory fences is something that is highly platform-specific, so one
would define macros for them that would expand to different instructions
on different platforms.
Here's the code with memory fences inserted:

while (true)
{
Node * localHead = head_;
Node * localTail = tail_;
DataDependencyBarrier(); //All the systems that I know of will do
//this sort of barrier automatically, so
//this macro will expand to nothing
Node * localNext = localHead->next;
LoadLoadBarrier(); //on x86 this will expand to nothing
if (localHead == head_)
{
...
}

This is much better, but it still got problems: first, on x86, the
LoadLoadBarrier() will expand to nothing and there will be no indication
to the compiler not to re-order different loads; and second (and I think
it's the crux of my argument) that an optimizing compiler will dispose
of the "if" statement even in the face of memory barriers. No matter how
many or what type of memory barriers you insert, the compiler will be
allowed to omit the if statement. The ONLY way to force the compiler
(any compiler for that matter) to generate it is to declare head_ as
volatile.

Here's the final code:
struct Node
{
<unspecified> data;
Node volatile * pNext;
};
Node volatile * volatile head_;
Node volatile * volatile tail_;

dequeue()
{
while (true)
{
Node volatile * localHead = head_;
Node volatile * localTail = tail_;
DataDependencyBarrier();
Node volatile * localNext = localHead->next;

if (localHead == head_)
{
...
}
....
}

Now this code will produce the intended correct object code on all the
compilers I've listed above and on at least these CPUs: x86, itanium,
mips, PowerPC (assuming that all the MemoryBarriers have been defined
for all the platforms). And without any modifications to the above code.
How's that for portability?

I think my fault was that in my previous posts I was pushing more
heavily on volatile's ability to tell the compiler not to reorder the
instructions it generates (which is still useful) rather than to
emphasize the fact that I want volatile to tell the compiler not to
optimize away certain instructions. The reordering problem could be
circumvented by using inline asm statements (and then again, on x86,
LoadLoadBarrier would expand to nothing, so we'd be forced to use a
bogus inline asm statement - I'd rather chose to use volatile), but I
don't see how the optimizing away problem could be circumvented without
the use of volatile.

Now, after writing all this, I realize that I could've used a simpler
example - a simple Peterson's algorithm for two threads wouldn't work
without a use of a volatile: the "turn" variable is assigned the same
value as it's being compared to later, so the compiler will omit the "if
turn == x" part in the if statement.

<snip>

I hope this clears matters - I'm sorry if I wasn't clear before.

> ---
> Herb Sutter (herbsutter.wordpress.com) (www.gotw.ca)
>
> Convener, SC22/WG21 (C++) (www.gotw.ca/iso)
> Architect, Visual C++ (www.gotw.ca/microsoft)
>

Andy.

Andy Venikov

unread,

Mar 30, 2010, 7:04:18 AM3/30/10

to

James Kanze wrote:
> On Mar 26, 12:05 pm, Andy Venikov <swojchelo...@gmail.com> wrote:
>> James Kanze wrote:
>>> If that's the case, then the fence instruction is seriously
>>> broken. The whole purpose of a fence instruction is to
>>> guarantee that another CPU (with another thread) can see the
>>> changes. (Of course, the other thread also needs a fence.)
>
>> Hmm, the way I understand fences is that they introduce
>> ordering and not necessarily guarantee visibility. For
>> example:
>
>> 1. Store to location 1
>> 2. StoreStore fence
>> 3. Store to location 2
>
>> will guarantee only that if store to location 2 is visible to
>> some thread, then the store to location 1 is guaranteed to be
>> visible to the same thread as well.
>
> A StoreStore fence guarantees that all stores issued before the
> fence are visible in main memory, and that none issued after the
> fence are visible (at the time the StoreStore fence is
> executed).
>
> Of course, for another thread to be guaraneed to see the results
> of any store, it has to use a load fence, to ensure that the
> values it sees are those after the load fence, and not some
> value that it happened to pick up earlier.

What I meant was that memory fence doesn't mean that the effects of a
write will be immediately flushed to the main memory or effects of a
read immediately read from the main memory. Generally, memory fence is
merely a checkpoint to tell the processor not to reorder instructions
around the fence. I don't remember what processor docs I've read (I
believe it was Itanium) but here's for example what the docs said about
a store fence: a store barrier would make sure that all the stores
appearing before a fence would be stored in the write-queue before any
of the stores that follow the fence. In no way you're guaranteed that
any of the stores are in main memory after the fence instruction was
executed. For that you'd have to use a flush instruction.

Andy.

Andy Venikov

unread,

Mar 30, 2010, 7:04:48 AM3/30/10

to

James Kanze wrote:
> On Mar 28, 10:05 pm, George Neuner <gneun...@comcast.net> wrote:

<snip>

>> All the fence does is create a checkpoint in the instruction
>> sequence at which relevant load or store instructions
>> dispatched prior to dispatch of the fence instruction will
>> have completed execution.
>
> That's not true for the two architectures whose documentation
> I've studied, Intel and Sparc. To quote the Intel documentation
> of MFENCE:
>
> Performs a serializing operation on all load and store
> instructions that were issued prior the MFENCE
> instruction. This serializing operation guarantees that
> every load and store instruction that precedes in
> program order the MFENCE instruction is globally visible
> before any load or store instruction that follows the
> MFENCE instruction is globally visible.
>
> Note the "globally visible".

If you read the whole sentence, you have:
<1> is globally visible before <2> is globally visible. That doesn't
sound to me as saying that <1> is globally visible.

I don't think that the above says that instructions that precede MFENCE
are guaranteed to be visible after the MFENCE instruction completes. It
does guarantee, however, that the earlier instructions are visible
before ANY of the following instructions are visible.

<snip>

Andy.

George Neuner

unread,

Mar 30, 2010, 7:05:24 AM3/30/10

to

On Mon, 29 Mar 2010 16:53:44 CST, James Kanze <james...@gmail.com>
wrote:

>On Mar 28, 10:05 pm, George Neuner <gneun...@comcast.net> wrote:

>> On Thu, 25 Mar 2010 17:31:25 CST, James Kanze <james.ka...@gmail.com>
>> wrote:
>
>> >On Mar 25, 7:10 pm, George Neuner <gneun...@comcast.net> wrote:
>>
>> >> As you noted, 'volatile' does not guarantee that an OoO CPU will
>> >> execute the stores in program order ...
>
>> >Arguably, the original intent was that it should. But it
>> >doesn't, and of course, the ordering guarantee only applies to
>> >variables actually declared volatile.
>
>> "volatile" is quite old ... I'm pretty sure the "intent" was defined
>> before there were OoO CPUs (in de facto use if not in standard
>> document). Regardless, "volatile" only constrains the behavior of the
>> *compiler*.
>
>More or less. Volatile requires the compiler to issue code
>which is conform to what the documentation says it does. It
>requires all accesses to take place after the preceding sequence
>point, and the results of those accesses to be stable before the
>following sequence point. But it leaves it up to the
>implementation to define what is meant by "access", and most
>take a very, very liberal view of it.

Agreed, with the caveat that some ISAs do not give the compiler the
tools to achieve this.

>> The purpose of the fence is to sequence memory accesses.
>
>For a much more rigorous definition of "access" that that used
>by the C++ standard.

Not exactly. I agree that the processor's memory model guarantees
stronger ordering than that of the C++ standard (or almost any
language standard, for that matter), but you are attributing semantics
to "fence" that aren't there.

>> All the fence does is create a checkpoint in the instruction
>> sequence at which relevant load or store instructions
>> dispatched prior to dispatch of the fence instruction will
>> have completed execution.
>
>That's not true for the two architectures whose documentation
>I've studied, Intel and Sparc.

Then you'd better go back and study 8)

>To quote the Intel documentation of MFENCE:
>
> Performs a serializing operation on all load and store
> instructions that were issued prior the MFENCE
> instruction. This serializing operation guarantees that
> every load and store instruction that precedes in
> program order the MFENCE instruction is globally visible
> before any load or store instruction that follows the
> MFENCE instruction is globally visible.

Now look at LFENCE, SFENCE and CLFUSH and think about why they are
provided separately. Also look at PREFETCH and see what it says about
fences.

Intel provides MFENCE as a heavyweight combination of LFENCE, SFENCE
and CLFLUSH. MFENCE does propagate to memory *because* it flushes the
cache. However the primitive, SFENCE, ensures propagation of writes
only to L2 cache.

Sparc has no single instruction that both fences and flushes the
cache. MEMBAR ensures propagation only to L2 cache. A separate FLUSH
instruction is necessary to ensure propagation to memory.

Sparc also does not have separate load and store fences, but it offers
two variants of MEMBAR which provide differing consistency guarantees.

>Note the "globally visible". Both Intel and Sparc guarantee
>strong ordering within a single core (i.e. a single thread);
>mfence or membar (Sparc) are only necessary if the memory will
>also be "accessed" from a separate unit: a thread running on a
>different core, or memory mapped IO.

Again, you're attributing semantics that aren't there.

For a store to be "globally visible" means that the value must be
visible from outside the core. This requires the value be in *some*
externally visible memory - not *main* memory in particular.
For both x86 and Sparc, this means L2 cache - the first level that can
be snooped off-chip.

For a load "globally visible" means that the value is present at all
levels of the memory hierarchy and cannot be seen differently by an
external observer. This simply follows from the normal operation of
the read pipeline - the value is written into all levels of cache
(more or less) at the same time it is loaded into the core register.

Note also that some CPUs can prefetch data in ways that bypass
externally visible levels of cache. Sparc and x86 (at least since
Pentium III) do not permit this.

>> However, in a memory hierarchy with caching, a store
>> instruction does not guarantee a write to memory but only that
>> one or more write cycles is executed on the core's memory
>> connection bus.
>
>On Intel and Sparc architectures, a store instruction doesn't
>even guarantee that. All it guarantees is that the necessary
>information is somehow passed to the write pipeline. What
>happens after that is anybody's guess.

No. On both of those architectures a store instruction will
eventually cause the value to be written out of the core (except maybe
if a hardware exception occurs). Additionally the source register may
renamed or the stored value may be forwarded within the core to
rendezvous with a subsequent read of the same location already in the
pipeline ... but these internal flow optimizations don't affect the
externally visible operation of the store instruction.

>> For another thread (or core or CPU) to perceive a change a
>> value must be propagated into shared memory. For all
>> multi-core processors I am aware of, the first shared level of
>> memory is cache - not main memory. Cores on the same die
>> snoop each other's primary caches and share higher level
>> caches. Cores on separate dies in the same package share
>> cache at the secondary or tertiary level.
>
>And on more advanced architectures, there are core's which don't
>share any cache. All of which is irrelevant, since simply
>issuing a store instruction doesn't even guarantee a write to
>the highest level cache, and a membar or a fence instruction
>guarantees access all the way down to the main, shared memory.

Sorry, but no. Even the architectures we've discussed here, x86 and
Sparc, do not satisfy your statement.

There might be architectures I'm unaware of which can elide an
off-core write entirely by rendezvous forwarding and register
renaming, but you haven't named one. I would consider eliding the
store to be a dangerous interpretation of memory semantics and I
suspect I would not be alone.

I'm not familiar with any cached architecture for which fencing alone
guarantees that a store writes all the way to main memory - I know
some that don't even have/need fencing because their on-chip caches
are write-through.

>> The upshot is this:
>> - "volatile" is required for any CPU.
>
>I'm afraid that doesn't follow from anything you've said.
>Particularly because the volatile is largely a no-op on most
>current compilers---it inhibits compiler optimizations, but the
>generated code does nothing to prevent the reordering that
>occurs at the hardware level.

"volatile" is required because the compiler must not reorder or
optimize away the loads or stores.

>> - fences are required for an OoO CPU.
>
>By OoO, I presume you mean "out of order". That's not the only
>source of the problems.

OoO is not the *only* source of the problem. The compiler has little
control over hardware reordering ... fences are blunt instruments that
impact all loads or stores ... not just those to language level
"volatiles".

>> - cache control is required for a write-back cache between
>> CPU and main memory.
>
>The cache is largely irrelevent on Sparc or Intel. The
>processor architectures are designed in a way to make it
>irrelevant. All of the problems would be there even in the
>absence of caching. They're determined by the implementation of
>the write and read pipelines.

That's a naive point of view. For a cached processor, the operation
of the cache and it's impact on real programs is *never* "irrelevant".

>James Kanze

George

Herb Sutter

unread,

Mar 30, 2010, 7:28:58 AM3/30/10

to

On Mon, 29 Mar 2010 16:55:44 CST, "Leigh Johnston" <le...@i42.co.uk>
wrote:

>"James Kanze" <james...@gmail.com> wrote in message
>news:36f7e40e-4584-430d...@z3g2000yqz.googlegroups.com...

>>> Performance is often cited as another reason to not use
>>> volatile however the use of volatile can actually help with
>>> multi-threading performance as you can perform a safe
>>> lock-free check before performing a more expensive lock.
>>
>> Again, I'd like to see how. This sounds like the double-checked
>> locking idiom, and that's been proven not to work.
>
>IMO for an OoO CPU the double checked locking pattern can be made to work
>with volatile if fences are also used or the lock also acts as a fence (as
>is the case with VC++/x86). This is also the counter-example you are
>looking for, it should work on some implementations. FWIW VC++ is clever
>enough to make the volatile redundant for this example however adding
>volatile makes no difference to the generated code (read: no performance
>penalty)

Are you sure? On x86 a VC++ volatile write is supposed to be emitted
as xchg, whereas an ordinary write is usually emitted as mov. If the
DCL control variable write is emitted as mov on x86 then DCL won't
work correctly (well, it'll appear to work...).

Herb

---
Herb Sutter (herbsutter.wordpress.com) (www.gotw.ca)

Convener, SC22/WG21 (C++) (www.gotw.ca/iso)
Architect, Visual C++ (www.gotw.ca/microsoft)

--

Andy Venikov

unread,

Mar 30, 2010, 11:48:34 AM3/30/10

to

Andy Venikov wrote:
<snip>

> Here's the final code:
> struct Node
> {
> <unspecified> data;
> Node volatile * pNext;
> };
> Node volatile * volatile head_;
> Node volatile * volatile tail_;
>
> dequeue()
> {
> while (true)
> {
> Node volatile * localHead = head_;
> Node volatile * localTail = tail_;
> DataDependencyBarrier();
> Node volatile * localNext = localHead->next;
>
> if (localHead == head_)
> {
> ...
> }
> ....
> }
>

Of course I missed the LoadLoad barrier before the if statement...

Leigh Johnston

unread,

Mar 30, 2010, 11:47:55 AM3/30/10

to

"Herb Sutter" <herb....@gmail.com> wrote in message

news:fli2r59v8lf47gav2...@4ax.com...

> On Mon, 29 Mar 2010 16:55:44 CST, "Leigh Johnston" <le...@i42.co.uk>
> wrote:
>>"James Kanze" <james...@gmail.com> wrote in message
>>news:36f7e40e-4584-430d...@z3g2000yqz.googlegroups.com...
>>>> Performance is often cited as another reason to not use
>>>> volatile however the use of volatile can actually help with
>>>> multi-threading performance as you can perform a safe
>>>> lock-free check before performing a more expensive lock.
>>>
>>> Again, I'd like to see how. This sounds like the double-checked
>>> locking idiom, and that's been proven not to work.
>>
>>IMO for an OoO CPU the double checked locking pattern can be made to work
>>with volatile if fences are also used or the lock also acts as a fence (as
>>is the case with VC++/x86). This is also the counter-example you are
>>looking for, it should work on some implementations. FWIW VC++ is clever
>>enough to make the volatile redundant for this example however adding
>>volatile makes no difference to the generated code (read: no performance
>>penalty)
>
> Are you sure? On x86 a VC++ volatile write is supposed to be emitted
> as xchg, whereas an ordinary write is usually emitted as mov. If the
> DCL control variable write is emitted as mov on x86 then DCL won't
> work correctly (well, it'll appear to work...).
>

Yes on x86 VC++ (VC9) emits a MOV for a volatile write however entering the
critical section in the DCL should act as a fence so it should work. I
asked this question (about VC++ volatile not emitting fences) in
microsoft.public.vc.language but didn't get a satisfactory reply.

/Leigh

Leigh Johnston

unread,

Mar 30, 2010, 1:29:19 PM3/30/10

to

"Leigh Johnston" <le...@i42.co.uk> wrote in message
news:MqidnVXvpZ-ngSzW...@giganews.com...
<snip>

> IMO for an OoO CPU the double checked locking pattern can be made to work
> with volatile if fences are also used or the lock also acts as a fence (as
> is the case with VC++/x86). This is also the counter-example you are
> looking for, it should work on some implementations. FWIW VC++ is clever
> enough to make the volatile redundant for this example however adding
> volatile makes no difference to the generated code (read: no performance
> penalty) and I like making such things explicit similar to how one uses
> const (doesn't effect the generated output but documents the programmer's
> intentions). Which is better: use volatile if there is no noticeable
> performance penalty or constantly check your compiler's generated
> assembler
> to check the optimizer is not breaking things? The only volatile in my
> entire codebase is for the "status" of my "threadable" base class and I
> don't always acquire a lock before checking this status and I don't fully
> trust that the optimizer won't cache it for all cases that might crop up
> as
> I develop code. BTW I try and avoid singletons too so I haven't found the
> need to use the double checked locking pattern AFAICR.
>

In case I was unclear: obviously using volatile can affect generated output
and performance in general but in the specific example of DCL that I tried
on VC++ (VC9) for x86 there was no difference.

George Neuner

unread,

Mar 30, 2010, 1:30:33 PM3/30/10

to

On Tue, 30 Mar 2010 05:05:24 CST, George Neuner <gneu...@comcast.net>
wrote:

>Sparc also does not have separate load and store fences,

Whoops!, brain freeze. I forgot that Sparc does have separate load
and store fences, specified by parameters to MEMBAR.

That'll teach me to post when I'm tired.

James Kanze

unread,

Mar 30, 2010, 6:37:59 PM3/30/10

to

On Mar 30, 12:04 pm, Andy Venikov <swojchelo...@gmail.com> wrote:

[...]

>> Of course, for another thread to be guaraneed to see the results
>> of any store, it has to use a load fence, to ensure that the
>> values it sees are those after the load fence, and not some
>> value that it happened to pick up earlier.

> What I meant was that memory fence doesn't mean that the
> effects of a write will be immediately flushed to the main
> memory or effects of a read immediately read from the main
> memory.

Not meaning to be impolite or anything, but what you meant or
mean isn't really that relevant. The Intel specification says
that mfence guarantees global visibility. And if you're
programming on an Intel, that's the only definition that is
relevant.

> Generally, memory fence is merely a checkpoint to tell the
> processor not to reorder instructions around the fence.

Again, a fence will prevent reordering, but only as a
consequence its fundamental requirements.

(I keep seeing mention here of instruction reordering. In the
end, instruction reordering is irrelevant. It's only one thing
that may lead to reads and writes being reordered. And what
mfence guarantees is strong memory---not just
instruction---ordering around it.)

> I don't remember what processor docs I've read (I believe it
> was Itanium) but here's for example what the docs said about a
> store fence: a store barrier would make sure that all the
> stores appearing before a fence would be stored in the
> write-queue before any of the stores that follow the fence.

That would be a very strange definition, since it would mean
that a store barrier would be useless, and that there would
never be a reason for using one. The Intel IA86 documentation
says very clearly that all preceding writes will be globally
visible; the Sparc architecture specifications say basically the
same thing for a membar.

> In no way you're guaranteed that any of the stores are in main
> memory after the fence instruction was executed.

That's not the case for IA-32, nor for Sparc.

> For that you'd have to use a flush instruction.

I suppose a machine could require two instructions to achieve a
true fence, but it seems like a very awkward way of doing
things.

James Kanze

unread,

Mar 30, 2010, 6:38:26 PM3/30/10

to

On Mar 30, 12:04 pm, Andy Venikov <swojchelo...@gmail.com> wrote:

> James Kanze wrote:
>> On Mar 28, 10:05 pm, George Neuner <gneun...@comcast.net> wrote:
> <snip>
>>> All the fence does is create a checkpoint in the
>>> instruction sequence at which relevant load or store
>>> instructions dispatched prior to dispatch of the fence
>>> instruction will have completed execution.

>> That's not true for the two architectures whose
>> documentation I've studied, Intel and Sparc. To quote the
>> Intel documentation of MFENCE:

>> Performs a serializing operation on all load and store
>> instructions that were issued prior the MFENCE
>> instruction. This serializing operation guarantees that
>> every load and store instruction that precedes in
>> program order the MFENCE instruction is globally visible
>> before any load or store instruction that follows the
>> MFENCE instruction is globally visible.

>> Note the "globally visible".

> If you read the whole sentence, you have: <1> is globally
> visible before <2> is globally visible. That doesn't sound to
> me as saying that <1> is globally visible.

It guarantees that if <1> is not globally visible, then neither
will <2> be. I don't know what more you might want.

> I don't think that the above says that instructions that
> precede MFENCE are guaranteed to be visible after the MFENCE
> instruction completes.

I don't think that "instruction completes" has any real meaning
in a modern processor. Arguably, a store instruction doesn't
complete until the data is stable in main memory. But that
doesn't stop the processor from going on and doing other things.
In practice, the ordering is all that is relevant anyway.

> It does guarantee, however, that the earlier instructions are
> visible before ANY of the following instructions are visible.

Which is all that is required.

--
James Kanze

James Kanze

unread,

Mar 30, 2010, 6:39:33 PM3/30/10

to

On Mar 30, 12:05 pm, George Neuner <gneun...@comcast.net> wrote:
> On Mon, 29 Mar 2010 16:53:44 CST, James Kanze <james.ka...@gmail.com>

> wrote:
>> On Mar 28, 10:05 pm, George Neuner <gneun...@comcast.net> wrote:

[...]

>>> The purpose of the fence is to sequence memory accesses.

>> For a much more rigorous definition of "access" that that used
>> by the C++ standard.

> Not exactly. I agree that the processor's memory model
> guarantees stronger ordering than that of the C++ standard (or
> almost any language standard, for that matter), but you are
> attributing semantics to "fence" that aren't there.

I'm not attributing anything. I'm just quoting the
documentation.

>>> All the fence does is create a checkpoint in the instruction
>>> sequence at which relevant load or store instructions
>>> dispatched prior to dispatch of the fence instruction will
>>> have completed execution.

>> That's not true for the two architectures whose documentation
>> I've studied, Intel and Sparc.

> Then you'd better go back and study 8)

I've quoted Intel. Regretfully, the Sparc site was down when I
tried to access it, but I've studied their documentation fairly
intensely, and it basically guarantees the same thing.

>> To quote the Intel documentation of MFENCE:

>> Performs a serializing operation on all load and store
>> instructions that were issued prior the MFENCE
>> instruction. This serializing operation guarantees that
>> every load and store instruction that precedes in
>> program order the MFENCE instruction is globally visible
>> before any load or store instruction that follows the
>> MFENCE instruction is globally visible.

> Now look at LFENCE, SFENCE and CLFUSH and think about why they
> are provided separately. Also look at PREFETCH and see what
> it says about fences.

There are many types of fences. Obviously, in any given case,
you should use the one which provides the guarantees you need.

> Intel provides MFENCE as a heavyweight combination of LFENCE,
> SFENCE and CLFLUSH. MFENCE does propagate to memory *because*
> it flushes the cache. However the primitive, SFENCE, ensures
> propagation of writes only to L2 cache.

So what use is it, then?

> Sparc has no single instruction that both fences and flushes
> the cache. MEMBAR ensures propagation only to L2 cache. A
> separate FLUSH instruction is necessary to ensure propagation
> to memory.

That's not what it says in the Sparc Architecture Specification.
(Levels of cache are never mentionned; the architecture allows
an implementation with any number of levels.)

> Sparc also does not have separate load and store fences, but
> it offers two variants of MEMBAR which provide differing
> consistency guarantees.

There is only one Membar instruction, with a 4 bit mask to
control the barriers: LOADLOAD, LOADSTORE, STORELOAD and
STORESTORE. (There are some other bits to control other
functionality, but they are irrelevant with regards to memory
synchronization in a multithreaded environment.)

>> Note the "globally visible". Both Intel and Sparc guarantee
>> strong ordering within a single core (i.e. a single thread);
>> mfence or membar (Sparc) are only necessary if the memory will
>> also be "accessed" from a separate unit: a thread running on a
>> different core, or memory mapped IO.

> Again, you're attributing semantics that aren't there.

I just quoted the documentation. What part of "globally
visible" don't you understand.

> For a store to be "globally visible" means that the value must
> be visible from outside the core. This requires the value be
> in *some* externally visible memory - not *main* memory in
> particular. For both x86 and Sparc, this means L2 cache - the
> first level that can be snooped off-chip.

That's an original definition of "global".

> For a load "globally visible" means that the value is present
> at all levels of the memory hierarchy and cannot be seen
> differently by an external observer. This simply follows from
> the normal operation of the read pipeline - the value is
> written into all levels of cache (more or less) at the same
> time it is loaded into the core register.

> Note also that some CPUs can prefetch data in ways that bypass
> externally visible levels of cache. Sparc and x86 (at least
> since Pentium III) do not permit this.

Sparc certainly does allow it (at least according to the Sparc
Architecture Specification), and I believe some of the newer
Intel do as well.

>>> However, in a memory hierarchy with caching, a store
>>> instruction does not guarantee a write to memory but only that
>>> one or more write cycles is executed on the core's memory
>>> connection bus.

>> On Intel and Sparc architectures, a store instruction doesn't
>> even guarantee that. All it guarantees is that the necessary
>> information is somehow passed to the write pipeline. What
>> happens after that is anybody's guess.

> No. On both of those architectures a store instruction will
> eventually cause the value to be written out of the core
> (except maybe if a hardware exception occurs).

Not on a Sparc. At least not according to the Sparc
Architecture Specification. Practically speaking I doubt that
this is guaranteed for any modern architecture, given the
performance implications.

> Additionally the source register may renamed or the stored
> value may be forwarded within the core to rendezvous with a
> subsequent read of the same location already in the pipeline
> ... but these internal flow optimizations don't affect the
> externally visible operation of the store instruction.

As long as there is only a single store instruction to a given
location, that store will eventually percolate out to the main
memory. If there are several, it's quite possible that some of
them will never appear outside the processor.

>>> For another thread (or core or CPU) to perceive a change a
>>> value must be propagated into shared memory. For all
>>> multi-core processors I am aware of, the first shared level of
>>> memory is cache - not main memory. Cores on the same die
>>> snoop each other's primary caches and share higher level
>>> caches. Cores on separate dies in the same package share
>>> cache at the secondary or tertiary level.

>> And on more advanced architectures, there are core's which
>> don't share any cache. All of which is irrelevant, since
>> simply issuing a store instruction doesn't even guarantee a
>> write to the highest level cache, and a membar or a fence
>> instruction guarantees access all the way down to the main,
>> shared memory.

> Sorry, but no. Even the architectures we've discussed here, x86 and
> Sparc, do not satisfy your statement.

I quoted the specification from Intel for the x86. The Sparc
site was down, and my copy of the Sparc Architecture
Specification is on a machine in France, so I'm sorry, I can't
quote it here. But I do know what it says. And a membar
instruction does guarantee strong ordering.

> There might be architectures I'm unaware of which can elide an
> off-core write entirely by rendezvous forwarding and register
> renaming, but you haven't named one. I would consider eliding
> the store to be a dangerous interpretation of memory semantics
> and I suspect I would not be alone.

Dangerous or not, no processor can afford to neglect this
important optimization opportunity. And it causes no problems
in single threaded programs, nor in multithreaded programs which
use proper synchronization methods.

> I'm not familiar with any cached architecture for which
> fencing alone guarantees that a store writes all the way to
> main memory - I know some that don't even have/need fencing
> because their on-chip caches are write-through.

I just pointed one out. By quoting the manufacturer's
specifications for the mfence instruction. If I were on my Unix
machine in Paris, I could equally quote similar text for the
Sparc.

>>> The upshot is this:
>>> - "volatile" is required for any CPU.

>> I'm afraid that doesn't follow from anything you've said.
>> Particularly because the volatile is largely a no-op on most
>> current compilers---it inhibits compiler optimizations, but the
>> generated code does nothing to prevent the reordering that
>> occurs at the hardware level.

> "volatile" is required because the compiler must not reorder
> or optimize away the loads or stores.

Which loads and stores. The presence of a fence (or inline
assembler, or specific system or library calls) guarantee that
the compiler cannot reorder around it. And whether the compiler
reorders or suppresses elsewhere is irrelevant, since the
hardware can do it regardless of the code the compiler
generates.

>>> - fences are required for an OoO CPU.

>> By OoO, I presume you mean "out of order". That's not the only
>> source of the problems.

> OoO is not the *only* source of the problem. The compiler has
> little control over hardware reordering ... fences are blunt
> instruments that impact all loads or stores ... not just those
> to language level "volatiles".

Agreed. Volatile has different semantics (at least that was the
intent). See Herb Sutter's comments else thread.

>>> - cache control is required for a write-back cache between
>>> CPU and main memory.

>> The cache is largely irrelevent on Sparc or Intel. The
>> processor architectures are designed in a way to make it
>> irrelevant. All of the problems would be there even in the
>> absence of caching. They're determined by the implementation of
>> the write and read pipelines.

> That's a naive point of view. For a cached processor, the
> operation of the cache and it's impact on real programs is
> *never* "irrelevant".

I was speaking uniquely in the context of threading. The
operation of the cache is very relevant with regards to
performance, for example.

--
James Kanze

James Kanze

unread,

Mar 30, 2010, 6:36:53 PM3/30/10

to

On Mar 29, 11:55 pm, "Leigh Johnston" <le...@i42.co.uk> wrote:
> "James Kanze" <james.ka...@gmail.com> wrote in message

> news:36f7e40e-4584-430d...@z3g2000yqz.googlegroups.com...
> <snip>

>>> Performance is often cited as another reason to not use
>>> volatile however the use of volatile can actually help with
>>> multi-threading performance as you can perform a safe
>>> lock-free check before performing a more expensive lock.

>> Again, I'd like to see how. This sounds like the
>> double-checked locking idiom, and that's been proven not to
>> work.

> IMO for an OoO CPU the double checked locking pattern can be
> made to work with volatile if fences are also used or the lock
> also acts as a fence (as is the case with VC++/x86).

Double checked locking can be made to work if you introduce
inline assembler or use some other technique to insert a fence
or a membar instruction in the appropriate places. But of
course, then, the volatile becomes superficial.

> This is also the counter-example you are looking for, it
> should work on some implementations.

It's certainly not an example of a sensible use of volatile,
since without the membar/fence, the algorithm doesn't work (at
least on most modern processors, which are multicore). And with
the membar/fence, the volatile is superfluous, and not needed.

> FWIW VC++ is clever enough to make the volatile redundant for
> this example however adding volatile makes no difference to
> the generated code (read: no performance penalty) and I like
> making such things explicit similar to how one uses const
> (doesn't effect the generated output but documents the
> programmer's intentions).

The use of a fence or membar (or some system specific "atomic"
access) would make the intent explicit. The use of volatile
suggests something completely different (memory mapped IO, or
some such).

> Which is better: use volatile if there is no noticeable
> performance penalty or constantly check your compiler's
> generated assembler to check the optimizer is not breaking
> things?

The reason there is no performance penalty is because volatile
doesn't do the job. And you don't have to check the generated
assembler for anything (unless you suspect a compiler error);
you check the guarantees given by the compiler.

OK: I'll admit that finding such specifications is very, very
difficult. But they should exist, and they'll guarantee you
with regards to future releases, as well. And there are some
guarantees that are expressed indirectly: if a compiler claims
Posix conformance, and supports multithreading, then you get the
guarantees from the Posix standard; the issue is a bit less
clear under Windows, but if a compiler claims to support
multithreading, then it should conform to the Windows
conventions about this.

> The only volatile in my entire codebase is for the "status" of
> my "threadable" base class and I don't always acquire a lock
> before checking this status and I don't fully trust that the
> optimizer won't cache it for all cases that might crop up as I
> develop code.

I'd have to see the exact code to be sure, but I'd guess that
without an mfence somewhere in there, the code won't work on a
multicore machine (which is just about everything today), and
with the mfence, the the volatile isn't necessary.

Also, at least under Solaris, if there is no contention, the
execution time of pthread_mutex_lock is practically the same as
that of membar. Although I've never actually measured it, I
suspect that the same is true if you use CriticalSection (and
not Mutex) under Windows.

> BTW I try and avoid singletons too so I haven't found the need
> to use the double checked locking pattern AFAICR.

Double checked locking is a pattern which can be applied to many
things, not just to singletons.

--
James Kanze

George Neuner

unread,

Mar 30, 2010, 6:36:18 PM3/30/10

to

On Tue, 30 Mar 2010 05:04:48 CST, Andy Venikov
<swojch...@gmail.com> wrote:

> James Kanze wrote:
>
>> To quote the Intel documentation of MFENCE:
>>
>> Performs a serializing operation on all load and store
>> instructions that were issued prior the MFENCE
>> instruction. This serializing operation guarantees that
>> every load and store instruction that precedes in
>> program order the MFENCE instruction is globally visible
>> before any load or store instruction that follows the
>> MFENCE instruction is globally visible.
>>
>> Note the "globally visible".
>
> If you read the whole sentence, you have:
> <1> is globally visible before <2> is globally visible. That doesn't
> sound to me as saying that <1> is globally visible.
>
> I don't think that the above says that instructions that precede MFENCE
> are guaranteed to be visible after the MFENCE instruction completes. It
> does guarantee, however, that the earlier instructions are visible
> before ANY of the following instructions are visible.

It reads a bit funny, but it is correct.

MFENCE divides the total set of loads and stores in process into the
group dispatched before the fence instruction and the group dispatched
after. At completion of the fence instruction, it is guaranteed that
all of the before group will have been completed AND none of the after
group will be completed.

See LFENCE and SFENCE for more details. MFENCE combines their effects
and adds a cache flush as well.

George

James Kanze

unread,

Mar 30, 2010, 6:42:28 PM3/30/10

to

On Mar 30, 12:03 pm, Andy Venikov <swojchelo...@gmail.com> wrote:
> Herb Sutter wrote:

[...]

> So, with the above said, here's a concrete example of how I'd
> use volatile without an access to a ready-made library. Let's
> take Magued Michael's lock-free queue ("Simple, Fast and
> Practical Non-blocking and blocking queue algorithms", Magued
> Michael & Michael Scott; 1996). It uses a technique similar to
> DCL to verify a validity of a read. Look into it's deque()
> method.
> I'll provide the pseudo code here:

> dequeue(Q: pointer to queue t, pvalue: pointer to data type): boolean
> D1: loop # Keep trying until Dequeue is done

> D2: head = Q->Head # Read Head
> D3: tail = Q->Tail # Read Tail
> D4: next = head->next # Read Head.ptr->next
> D5: if head == Q->Head # Are head, tail, and next consistent?

> D6: if head.ptr == tail.ptr # Is queue empty or Tail falling behind?
> D7: if next.ptr == NULL # Is queue empty?

> D8: return FALSE # Queue is empty, couldnï¿½t dequeue

> D9: endif
> # Tail is falling behind. Try to advance it

> D10: CAS(&Q->Tail, tail, <next.ptr, tail.count+1>)

> D11: else # No need to deal with Tail
> # Read value before CAS, otherwise another dequeue might free the
> next node

> D12: *pvalue = next.ptr->value

> # Try to swing Head to the next node

> D13: if CAS(&Q->Head, head, <next.ptr, head.count+1>)

> D14: break # Dequeue is done. Exit loop
> D15: endif
> D16: endif
> D17: endif
> D18: endloop
> D19: free(head.ptr) # It is safe now to free the old dummy node
> D20: return TRUE # Queue was not empty, dequeue succeeded

> Look at line D5: it needs to check if Q->Head is still the
> same as what we read from it before. Otherwise two
> possibilities for breaking the correctness arise: 1) it would
> be possible for the element pointed to by Q->Head to be
> re-inserted back into the queue with NULL in the "next" and

> was never empty in any given moment; or 2) The first element

First, I very much doubt that the LoadLoad barrier can expand to
nothing if the code is to work. It certainly cannot on a Sparc,
and I rather doubt that it can on an Intel; I'd have to see the
guarantees in writing to believe otherwise. And second, if a
compiler moves code accross a barrier, it is broken, and there's
not much you can do about it.

> No matter how many or what type of memory barriers you insert,
> the compiler will be allowed to omit the if statement.

Really? According to who? Obviously, the ISO standard says
nothing about this case; the presence of the barrier introduces
undefined behavior, and the compiler might do anything, as far
as ISO C++ is concerned. But you're obviously depending on
other standards as well. Standards which specify what the
barrier guarantees, for example. And these standards apply to
the compiler as well.

> The ONLY way to force the compiler (any compiler for that
> matter) to generate it is to declare head_ as volatile.

No. The way to force the compiler to generate the necessary
accesses is to use the same implementation specific guarantees
you count on for the barrier to work.

> Here's the final code:
> struct Node
> {
> <unspecified> data;
> Node volatile * pNext;};

> Node volatile * volatile head_;
> Node volatile * volatile tail_;

> dequeue()
> {
> while (true)
> {
> Node volatile * localHead = head_;
> Node volatile * localTail = tail_;
> DataDependencyBarrier();
> Node volatile * localNext = localHead->next;

> if (localHead == head_)
> {
> ...
> }
> ....
>
> }

> Now this code will produce the intended correct object code on
> all the compilers I've listed above and on at least these
> CPUs: x86, itanium, mips, PowerPC (assuming that all the
> MemoryBarriers have been defined for all the platforms). And
> without any modifications to the above code. How's that for
> portability?

Yes, but it's just as portable without the volatile. If the
barriers are defined correctly, the compiler will not move code
over them.

--
James Kanze

Leigh Johnston

unread,

Mar 30, 2010, 11:14:01 PM3/30/10

to

"James Kanze" <james...@gmail.com> wrote in message news:da63ca83-4d6e-416a...@10g2000yqq.googlegroups.com...

<snip>

> Double checked locking can be made to work if you introduce
> inline assembler or use some other technique to insert a fence
> or a membar instruction in the appropriate places. But of
> course, then, the volatile becomes superficial.

It is only superficial if there is a compiler guarantee that a load/store for a non-volatile variable is emitted in the presence of a fence which sounds like a dubious guarantee to me. What compilers stop performing optimizations in the presence of a
fence and/or how does the compiler know which variables accesses can be optimized in the presence of a fence?

>
>> This is also the counter-example you are looking for, it
>> should work on some implementations.
>
> It's certainly not an example of a sensible use of volatile,
> since without the membar/fence, the algorithm doesn't work (at
> least on most modern processors, which are multicore). And with
> the membar/fence, the volatile is superfluous, and not needed.

Read what I said above.

>> FWIW VC++ is clever enough to make the volatile redundant for
>> this example however adding volatile makes no difference to
>> the generated code (read: no performance penalty) and I like
>> making such things explicit similar to how one uses const
>> (doesn't effect the generated output but documents the
>> programmer's intentions).
>
> The use of a fence or membar (or some system specific "atomic"
> access) would make the intent explicit. The use of volatile
> suggests something completely different (memory mapped IO, or
> some such).

Obviously we disagree on this point hence the reason for the existence of this argument we are having.

<snip>

>> The only volatile in my entire codebase is for the "status" of
>> my "threadable" base class and I don't always acquire a lock
>> before checking this status and I don't fully trust that the
>> optimizer won't cache it for all cases that might crop up as I
>> develop code.
>
> I'd have to see the exact code to be sure, but I'd guess that
> without an mfence somewhere in there, the code won't work on a
> multicore machine (which is just about everything today), and
> with the mfence, the the volatile isn't necessary.

The code does work on a multi-core machine and I am confident it will continue to work when I write new code precisely because I am using volatile and therefore guaranteed a load will be emitted not optimized away.

> Also, at least under Solaris, if there is no contention, the
> execution time of pthread_mutex_lock is practically the same as
> that of membar. Although I've never actually measured it, I
> suspect that the same is true if you use CriticalSection (and
> not Mutex) under Windows.

Critical sections are expensive when compared to a simple load that is guaranteed by using volatile. It is not always necessary to use a fence as all a fence is doing is guaranteeing order so it all depends on the use-case.

>
>> BTW I try and avoid singletons too so I haven't found the need
>> to use the double checked locking pattern AFAICR.
>
> Double checked locking is a pattern which can be applied to many
> things, not just to singletons.
>

I never said otherwise, singletons are an obvious example use though and this thread was originally about singletons.

/Leigh

Herb Sutter

unread,

Mar 30, 2010, 11:15:33 PM3/30/10

to

On Tue, 30 Mar 2010 05:03:11 CST, Andy Venikov
<swojch...@gmail.com> wrote:
>Herb Sutter wrote:
>> Please remember this: Standard ISO C/C++ volatile is useless for
>> multithreaded programming. No argument otherwise holds water; at best
>> the code may appear to work on some compilers/platforms, including all
>> attempted counterexamples I've seen on this thread.
>
>You have an enormous clout on C++ professionals, including myself, so
>before permanently agreeing to such an all-encompassing statement allow
>me to maybe step back a little and see what it is that's at the core of
>this argument. Maybe we're arguing the same point. Or maybe I'm missing
>something big in which case I'll be doubly glad to have been shown my
>wrong assumptions.

Short answer: Note I deliberately said "Standard" above -- the above
statement is completely true for portable usage. You may get away with
it on some platforms today, but it's nonportable and even the
getting-away won't last.

Slightly longer answer follows:

>I understand that volatile never was supposed to be of any help for
>multithreaded programming. I don't expect it to issue any memory fences
> nor make any guarantees whatsoever about anything thread-related...

Yes, and that's why it can't reliably be used for inter-thread
communication == synchronization.

>Yet, on all the compilers I know of (gcc, mingw, MSVC, LLVM, Intel) it
>produces just the code I need for my multithreaded programs. And I
>really don't see how it wouldn't, given common-sense understanding of
>what it should do in single-threaded programs. And I'm pretty sure that
>it's not going to change in a foreseeable future.
>
>So my use of volatile maybe not standard-portable, but it sure is
>real-life portable.

It's like relying on undefined behavior. UB may happen to do what you
expected, most of the time, on your current compiler and platform.
That doesn't mean it's correct or portable, and it will be less and
less real-life portable on multi-core systems.

Because there was no better hook, volatile was strengthened (in
non-standard ways) on various systems. For example, on MS VC++ prior
to VC++ 2005 (I think), volatile had no ordering semantics at all, but
people thought it was used for inter-thread communications because the
Windows InterlockedXxxx APIs happened to take a volatile variable. But
that was just using volatile as a type system tag to help you not
accidentally pass a plain variable, and a little bit to leverage the
lack of optimizations on volatile -- the real reason it worked was
because you were calling the InterlockedXxx APIs because *those* are
correctly synchronized for lock-free coding.

Even now in VC++ 2005 and later, when volatile was strengthened so
that reads and writes are (almost) SC, to get fully SC lock-free code
in all cases you still have to use the InterlockedXxx APIs rather than
direct reads and writes of the volatile variable. The strengthened
volatile semantics makes that, on that compiler and when targeting
x86/x64, using direct reads and writes is enough to make most examples
like DCL work, but it isn't enough to make examples like Dekker's work
-- for Dekker's to work correctly you still have to use the
InterlockedXxx APIs.

>Here's the point of view I'm coming from.
>Imagine that someone needs to implement a library that provides certain
>multithreading (multiprogramming) tools like atomic access,
>synchronization primitives and some lock-free algorithms that will be
>used by other developers so that they wouldn't have to worry about
>things like volatile. (Now that boost.atomic is almost out, I'll happily
>use it.

Important note: Using std::atomic<> is exactly the correct answer!

The only caveat is that it's not yet widely available, but this year
we're getting over the hump of wide availability thanks to Boost and
others.

>But Helge Bahmann (the author of the library) didn't have such a

Isn't it Anthony Williams who's doing Boost's atomic<> implementation?
Hmm.

>luxury, so to make his higher-level APIs work he had to internally
>resort to low-level tools like volatiles where appropriate.)

Of course, sure. The implementation of std::atomic<> on any given
platform needs to use platform-specific tools, including things like
explicit fences/membars (e.g., mf+st.rel on IA64), ordered APIs (e.g,.
InterlockedIncrement on Windows), and/or other nonstandard and
nonportable goo (e.g., platform-specific variants of volatile).

The implementation of any standard feature typically will internally
use nonstandard system-specific features. That's the standard
feature's purpose, to shield users from those details and make this
particular system do the right particular thing.

[...]

>Look at line D5: it needs to check if Q->Head is still the same as what
>we read from it before. Otherwise two possibilities for breaking the
>correctness arise: 1) it would be possible for the element pointed to by

[...]

>This piece of pseudo code could be naively translated to a following c++
>code:
>
>while (true)
>{
>Node * localHead = head_;
>Node * localTail = tail_;
>Node * localNext = localHead->next;
>if (localHead == head_)
>{
> ...
>}
>
>But it wouldn't work for the obvious reasons.
>One needs to insert MemoryFences in the right places.

[...]

Fences are evil. Nearly nobody can use them consistently correctly,
including people who have years of experience with them. Those people
(write once, and from then on) use the Linux atomics package or C++0x
std::atomic.

Every mutable shared object should be protected by a mutex (99.9%
case) or be atomic (0.1% case).

If you're going to write lock-free code, it's really, really, really
important to just make the shared variables be C++0x std::atomic<> (or
equivalently Java or .NET volatile, which isn't the same thing as ISO
C and C++ volatile). If you do, you won't have to reason about where
the fences need to go. Reasoning about where the fences need to go is
such a futile and error-prone job that most lock-free papers don't
even try to say where to put them and just assume SC execution.

>Here's the final code:

I apologize for not having time to read your transformations of
Maged's code closely, but in all of the following, why is the volatile
on the Node, not on the pointer? Even if volatile did all the magic
you want it to do (like Java/.NET volatile), that's broken because
it's in the wrong place, isn't it? Of course, the usual manifestation
of the problem is that the code will compile, run, and appear to
work...

>struct Node
>{
> <unspecified> data;
> Node volatile * pNext;
>};
>Node volatile * volatile head_;
>Node volatile * volatile tail_;
>
>dequeue()
>{
> while (true)
> {
> Node volatile * localHead = head_;
> Node volatile * localTail = tail_;
> DataDependencyBarrier();
> Node volatile * localNext = localHead->next;
>
> if (localHead == head_)
> {
> ...
> }
>....
>}
>
>
>Now this code will produce the intended correct object code on all the
>compilers I've listed above and on at least these CPUs: x86, itanium,
>mips, PowerPC (assuming that all the MemoryBarriers have been defined
>for all the platforms). And without any modifications to the above code.
>How's that for portability?

Without even read the code logic and looking for races, I doubt it.

For a detailed analysis of multiple lock-free implementations of a
similar queue example, including an exceedingly rare race that even
under sustained heavy stress on a 24-core system only manifested once
every tens of millions of insertions, see:

Measuring Parallel Performance: Optimizing a Concurrent Qeue
http://www.drdobbs.com/high-performance-computing/212201163

>Now, after writing all this, I realize that I could've used a simpler
>example - a simple Peterson's algorithm for two threads wouldn't work
>without a use of a volatile: the "turn" variable is assigned the same
>value as it's being compared to later, so the compiler will omit the "if
>turn == x" part in the if statement.

Actually, Dekker's/Peterson's is broken even with VC++ 2008
heavily-strengthened volatile. (Sorry.) To make it correct you have to
store to the flag variable using InterlockedExchange() or similar, not
using a simple write to the flag variable.

Herb

---
Herb Sutter (herbsutter.wordpress.com) (www.gotw.ca)

Convener, SC22/WG21 (C++) (www.gotw.ca/iso)
Architect, Visual C++ (www.gotw.ca/microsoft)

--

Herb Sutter

unread,

Mar 30, 2010, 11:14:10 PM3/30/10

to

On Mon, 29 Mar 2010 20:22:54 CST, "Leigh Johnston" <le...@i42.co.uk>
wrote:

>"Herb Sutter" <herb....@gmail.com> wrote in message news:j462r55lr984u1vg0...@4ax.com...
><snip>
>>>On Mar 29, 7:45 am, "Leigh Johnston" <le...@i42.co.uk> wrote:
>>>> Performance is often cited as another reason to not use
>>>> volatile however
>>
>> No "however" needed, that cited reason is correct. Volatile disables
>> optimizations that would be legal for an atomic<>. A quick example is
>> combining/eliding writes (e.g., v = 1; v = 2; can't be transformed to
>> v = 2;, but a = 1; a = 2; can be transformed to a = 2;). Another is
>> combining/eliding reads.
>
>If atomic reads can be elided won't that be problematic for using atomics with the double checked locking pattern (so we are back to using volatile atomics)?

No, because it's only adjacent atomic reads (i.e., separated only by
ordinary memory operations) that can be combined. In DCL there's a
lock acquire operation in between, and the atomic reads can't be
reordered across that to make them adjacent, so they can't be
combined.

Herb

---
Herb Sutter (herbsutter.wordpress.com) (www.gotw.ca)

Convener, SC22/WG21 (C++) (www.gotw.ca/iso)
Architect, Visual C++ (www.gotw.ca/microsoft)

--

Herb Sutter

unread,

Mar 30, 2010, 11:16:30 PM3/30/10

to

On Tue, 30 Mar 2010 09:47:55 CST, "Leigh Johnston" <le...@i42.co.uk>
wrote:

>>>IMO for an OoO CPU the double checked locking pattern can be made to work
>>>with volatile if fences are also used or the lock also acts as a fence (as
>>>is the case with VC++/x86). This is also the counter-example you are
>>>looking for, it should work on some implementations. FWIW VC++ is clever
>>>enough to make the volatile redundant for this example however adding
>>>volatile makes no difference to the generated code (read: no performance
>>>penalty)
>>
>> Are you sure? On x86 a VC++ volatile write is supposed to be emitted
>> as xchg, whereas an ordinary write is usually emitted as mov. If the
>> DCL control variable write is emitted as mov on x86 then DCL won't
>> work correctly (well, it'll appear to work...).
>
>Yes on x86 VC++ (VC9) emits a MOV for a volatile write however entering the
>critical section in the DCL should act as a fence so it should work. I
>asked this question (about VC++ volatile not emitting fences) in
>microsoft.public.vc.language but didn't get a satisfactory reply.

Ah yes, my thinko. I was momentarily rusty on the outcome of internal
design discussions a year or three ago.

Let me try again:

Yes, we (VC++ since VC++ 2005 I think) do emit a plan MOV for a
volatile write. DCL works with a plain MOV on x86/x64 because
x86/x64's strong memory model combined with VC++ 2005+'s restrictions
on reorderings around volatile reads/wrties is just enough to do the
trick.

However, that's not enough to make Dekker's work. To make Dekker's
work correctly today on our platform, you have to perform the write to
the flag variable using an InterlockedXxx() call of some sort, not as
a direct write to a volatile flag variable.

If we wanted native volatile to be suitable for general lock-free
coding including Dekker's using only direct reads and writes of the
volatile variable (i.e., without InterlockedXxx API calls), however,
we would have to emit writes as XCHG, not MOV (and some other things).
We don't do that today, we discussed doing it, and the current
decision is that we do not intend to do it -- it would penalize any
existing uses of volatile, and the right tool for lock-free coding is
atomic<>. (Which we also need to ship and isn't in VC 2010, sorry, but
we're working on it next.)

FWIW, the issue is that VC++ volatile is load/acquire and
store/release, but that still allows one reordering -- store-load
reordering, which kills Dekker's. Emitting the store as xchg would
prevent store-load reordering, hence would fix Dekker's. But the right
way to fix Dekker's on our platform, again, is: for now, use
InterlockedXxx APIs for lock-free code; and when it's available, use
atomic<> (Boost's or eventually ours) for lock-free code.

Herb

---
Herb Sutter (herbsutter.wordpress.com) (www.gotw.ca)

Convener, SC22/WG21 (C++) (www.gotw.ca/iso)
Architect, Visual C++ (www.gotw.ca/microsoft)

--

Herb Sutter

unread,

Mar 30, 2010, 11:14:19 PM3/30/10

to

On Tue, 30 Mar 2010 16:37:59 CST, James Kanze <james...@gmail.com>
wrote:

>(I keep seeing mention here of instruction reordering. In the
>end, instruction reordering is irrelevant. It's only one thing
>that may lead to reads and writes being reordered.

Yes, but: Any reordering at any level can be treated as an instruction
reordering -- actually, as a source code reordering. That's why all
language-level MM discussions only bother to talk about source-level
reorderings, because any CPU or cache transformations end up having
the same effect as some corresponding source-level reordering.

Herb

---
Herb Sutter (herbsutter.wordpress.com) (www.gotw.ca)

Convener, SC22/WG21 (C++) (www.gotw.ca/iso)
Architect, Visual C++ (www.gotw.ca/microsoft)

--

Joshua Maurice

unread,

Mar 31, 2010, 4:56:56 AM3/31/10

to

On Mar 30, 8:14 pm, Herb Sutter <herb.sut...@gmail.com> wrote:
> On Tue, 30 Mar 2010 16:37:59 CST, James Kanze <james.ka...@gmail.com>

> wrote:
>
> >(I keep seeing mention here of instruction reordering. In the
> >end, instruction reordering is irrelevant. It's only one thing
> >that may lead to reads and writes being reordered.
>
> Yes, but: Any reordering at any level can be treated as an instruction
> reordering -- actually, as a source code reordering. That's why all
> language-level MM discussions only bother to talk about source-level
> reorderings, because any CPU or cache transformations end up having
> the same effect as some corresponding source-level reordering.

Not quite, no. On "weaker guarantee" processors, let's take the
following example:

/*
start pseudo code example. Forgive me for any "typos". This is off the
top of my head and I haven't really used lambda functions.
*/
int main()
{ int a = 0;
int b = 0;
int c[4];
int d[4];
start_thread([&]() -> void { c[0] = a; d[0] = b; });
start_thread([&]() -> void { c[1] = a; d[1] = b; });
start_thread([&]() -> void { c[2] = a; d[2] = b; });
start_thread([&]() -> void { c[3] = a; d[3] = b; });
a = 1;
b = 2;
cout << c[0] << " " << d[0] << '\n'
<< c[1] << " " << d[1] << '\n'
<< c[2] << " " << d[2] << '\n'
<< c[3] << " " << d[3] << endl;
}
//end pseudo code example

On some modern processors, most (in)famously the DEC Alpha with its
awesome split cache, this program in the real world (or something very
much like it) can print:
0 0
0 2
1 0
1 2

Specifically, this is a single execution of the program. In this
single execution, the writes "a = 1; b = 2;" are seen to happen in two
different orders, the exact same "store instructions" become visible
to other cores in different orders. There is no (sane) source code
level reordering that can achieve this. I tried to emphasize this else-
thread: you cannot think about threading in terms of "possible
interleavings of instructions". It does not portably work. Absent
synchronization, on some processors, there is no global order of
instructions.

James Kanze

unread,

Mar 31, 2010, 3:40:48 PM3/31/10

to

On 31 Mar, 04:14, "Leigh Johnston" <le...@i42.co.uk> wrote:
> "James Kanze" <james.ka...@gmail.com> wrote in

> messagenews:da63ca83-4d6e-416a...@10g2000yqq.googlegroups.com...

> <snip>

> > Double checked locking can be made to work if you introduce
> > inline assembler or use some other technique to insert a
> > fence or a membar instruction in the appropriate places.
> > But of course, then, the volatile becomes superficial.

> It is only superficial if there is a compiler guarantee that a
> load/store for a non-volatile variable is emitted in the
> presence of a fence which sounds like a dubious guarantee to
> me. What compilers stop performing optimizations in the
> presence of a fence and/or how does the compiler know which
> variables accesses can be optimized in the presence of a
> fence?

All of the compilers I know either treat inline assembler or an
external function call to a function written in assembler as a
worse case with regards to optimizing, and do not move code
accross it, or they provide a means of specifying to the
compiler which variables, etc. are affected by the assembler.

> >> This is also the counter-example you are looking for, it
> >> should work on some implementations.

> > It's certainly not an example of a sensible use of volatile,
> > since without the membar/fence, the algorithm doesn't work
> > (at least on most modern processors, which are multicore).
> > And with the membar/fence, the volatile is superfluous, and
> > not needed.

> Read what I said above.

I have. But it doesn't hold water.

> >> FWIW VC++ is clever enough to make the volatile redundant
> >> for this example however adding volatile makes no
> >> difference to the generated code (read: no performance
> >> penalty) and I like making such things explicit similar to
> >> how one uses const (doesn't effect the generated output but
> >> documents the programmer's intentions).

> > The use of a fence or membar (or some system specific
> > "atomic" access) would make the intent explicit. The use of
> > volatile suggests something completely different (memory
> > mapped IO, or some such).

> Obviously we disagree on this point hence the reason for the
> existence of this argument we are having.

Yes. Theoretically, I suppose, you could find a compiler which
documented that it would move code accross a fence or a membar
instruction. In practice: either the compiler treats assembler
as a black box, and supposes that it might do anything, or it
analyses the assembler, and takes the assembler into account
when optimizing. In the first case, the compiler must
synchronize it's view of the memory, because it must suppose
that the assembler reads and writes arbitrary values from
memory. And in the second (which is fairly rare), it recognizes
the fence, and adjusts its optimization accordingly.

Your argument is basically that the compiler writers are either
completely incompetent, or that they are intentionally out to
make your life difficult. In either case, there are a lot more
things that they can do to make your life difficult. I wouldn't
use such a compiler, because it would be, in effect, unusable.

> <snip>
> >> The only volatile in my entire codebase is for the "status" of
> >> my "threadable" base class and I don't always acquire a lock
> >> before checking this status and I don't fully trust that the
> >> optimizer won't cache it for all cases that might crop up as I
> >> develop code.

> > I'd have to see the exact code to be sure, but I'd guess that
> > without an mfence somewhere in there, the code won't work on a
> > multicore machine (which is just about everything today), and
> > with the mfence, the the volatile isn't necessary.

> The code does work on a multi-core machine and I am confident
> it will continue to work when I write new code precisely
> because I am using volatile and therefore guaranteed a load
> will be emitted not optimized away.

If you have the fence in the proper place, you're guaranteed
that it will work, even without volatile. If you don't, you're
not guaranteed anything.

> > Also, at least under Solaris, if there is no contention, the
> > execution time of pthread_mutex_lock is practically the same
> > as that of membar. Although I've never actually measured
> > it, I suspect that the same is true if you use
> > CriticalSection (and not Mutex) under Windows.

> Critical sections are expensive when compared to a simple load
> that is guaranteed by using volatile. It is not always
> necessary to use a fence as all a fence is doing is
> guaranteeing order so it all depends on the use-case.

I'm not sure I follow. Basically, the fence guarantees that the
hardware can't do specific optimizations. The same
optimizations that the software can't do in the case of
volatile. If you think you need volatile, then you certainly
need a fence. (And if you have the fence, you no longer need
the volatile.)

--
James Kanze

Anthony Williams

unread,

Mar 31, 2010, 6:17:24 PM3/31/10

to

Herb Sutter <herb....@gmail.com> writes:

>>But Helge Bahmann (the author of the library) didn't have such a
>
> Isn't it Anthony Williams who's doing Boost's atomic<> implementation?
> Hmm.

No. Helge's implementation covers more platforms than I have access to
or know how to write atomics for.

Anthony
--
Author of C++ Concurrency in Action http://www.stdthread.co.uk/book/
just::thread C++0x thread library http://www.stdthread.co.uk
Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk
15 Carrallack Mews, St Just, Cornwall, TR19 7UL, UK. Company No. 5478976

Leigh Johnston

unread,

Mar 31, 2010, 6:21:41 PM3/31/10

to

"James Kanze" <james...@gmail.com> wrote in message

news:bbd4bca1-2c16-489b...@z4g2000yqa.googlegroups.com...

> On 31 Mar, 04:14, "Leigh Johnston" <le...@i42.co.uk> wrote:
>> "James Kanze" <james.ka...@gmail.com> wrote in
>> messagenews:da63ca83-4d6e-416a...@10g2000yqq.googlegroups.com...
>
>> <snip>
>
>> > Double checked locking can be made to work if you introduce
>> > inline assembler or use some other technique to insert a
>> > fence or a membar instruction in the appropriate places.
>> > But of course, then, the volatile becomes superficial.
>
>> It is only superficial if there is a compiler guarantee that a
>> load/store for a non-volatile variable is emitted in the
>> presence of a fence which sounds like a dubious guarantee to
>> me. What compilers stop performing optimizations in the
>> presence of a fence and/or how does the compiler know which
>> variables accesses can be optimized in the presence of a
>> fence?
>
> All of the compilers I know either treat inline assembler or an
> external function call to a function written in assembler as a
> worse case with regards to optimizing, and do not move code
> accross it, or they provide a means of specifying to the
> compiler which variables, etc. are affected by the assembler.
>

Yes I realized that after posting but as this newsgroup is moderated posting
an immediate retraction reply is not possible. :)

{ An immediate retraction may be possible. Just write to the moderators (see the
link in the banner at the end of this article) including the article's tracking
number. If not yet approved the article is then rejected per request. -mod }

<snip>

>> The code does work on a multi-core machine and I am confident
>> it will continue to work when I write new code precisely
>> because I am using volatile and therefore guaranteed a load
>> will be emitted not optimized away.
>
> If you have the fence in the proper place, you're guaranteed
> that it will work, even without volatile. If you don't, you're
> not guaranteed anything.

It is guaranteed to work on the platform for which I am implementing for and
I find it hard to believe that it wouldn't work on other platforms/compilers
which have similar semantics for volatile (which you already agreed was a
fair assumption).

>
>> > Also, at least under Solaris, if there is no contention, the
>> > execution time of pthread_mutex_lock is practically the same
>> > as that of membar. Although I've never actually measured
>> > it, I suspect that the same is true if you use
>> > CriticalSection (and not Mutex) under Windows.
>
>> Critical sections are expensive when compared to a simple load
>> that is guaranteed by using volatile. It is not always
>> necessary to use a fence as all a fence is doing is
>> guaranteeing order so it all depends on the use-case.
>
> I'm not sure I follow. Basically, the fence guarantees that the
> hardware can't do specific optimizations. The same
> optimizations that the software can't do in the case of
> volatile. If you think you need volatile, then you certainly
> need a fence. (And if you have the fence, you no longer need
> the volatile.)
>

My point is that it is possible to write a piece of multi-threaded code
which does not use a fence or a mutex/critical section and just reads a
single shared variable in isolation (ordering not important and read is
atomic on the platform in question) and for this *particular* case volatile
can be useful. I find it hard to believe that there are no cases at all
where this applies.

/Leigh

--

Andy Venikov

unread,

Mar 31, 2010, 6:35:53 PM3/31/10

to

James Kanze wrote:
<snip>

> I'm not sure I follow. Basically, the fence guarantees that the
> hardware can't do specific optimizations. The same
> optimizations that the software can't do in the case of
> volatile. If you think you need volatile, then you certainly
> need a fence. (And if you have the fence, you no longer need
> the volatile.)
>

Ah, finally I think I see where you are coming from. You think that if
you have the fence you no longer need a volatile.

I think you assume too much about how fence is really implemented. Since
the standard says nothing about fences you have to rely on a library
that provides them and if you don't have such a library, you'll have to
implement one yourself. A reasonable way to implement a barrier would be
to use macros that, depending on a platform you run, expand to inline
assembly containing the right instruction. In this case the inline asm
will make sure that the compiler won't reorder the emitted instructions,
but it won't make sure that the optimizer will not throw away some
needed instructions.

For example, following my post where I described Magued Michael's
algorithm, here's how relevant excerpt without volatiles would look like:

//x86-related defines:
#define LoadLoadBarrier() asm volatile ("mfence")

//Common code
struct Node
{
Node * pNext;
};
Node * head_;

void f()
{
Node * pLocalHead = head_;
Node * pLocalNext = pLocalHead->pNext;

LoadLoadBarrier();

if (pLocalHead == head_)
{
printf("pNext = %p\n", pLocalNext);
}
}

Just to make you happy I defined LoadLoadBarrier as a full mfence
instruction, even though on x86 there is no need for a barrier here,
even on a multicore/multiprocessor.

And here's how gcc 4.3.2 on Linux/x86-64 generated object code:

0000000000400630 <_Z1fv>:
400630: 0f ae f0 mfence
400633: 48 8b 05 fe 09 20 00 mov 0x2009fe(%rip),%rax #
601038 <head_>
40063a: bf 5c 07 40 00 mov $0x40075c,%edi
40063f: 48 8b 30 mov (%rax),%rsi
400642: 31 c0 xor %eax,%eax
400644: e9 bf fe ff ff jmpq 400508 <printf@plt>
400649: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)

As you can see, it uselessly put mfence right at the beginning of
function f() and threw away the second read of head_ and the whole if
statement altogether.

Naively, you could say that we could put "memory" clobber in the inline
assembly clobber list like this:
#define LoadLoadBarrier() asm volatile ("mfence" : : : "memory")

This will work, but it will be a huge overkill, because after this the
compiler will need to re-read all variables, even unrelated ones. And
when f() gets inlined, you get a huge performance hit.

Volatile saves the day nicely and beautifully, albeit not "standards"
portably. But as I said elsewhere, this will work on most compilers and
hardware. Of course I'd need to test it on the compiler/hardware
combination that client is going to run it on, but such is the peril of
trying to provide portable interface with non-portable implementation.
But so far I haven't found a single combination that wouldn't correctly
compile the code with volatiles. And of course I'll gladly embrace C++0x
atomic<>... when it becomes available. Right now though, I'm slowly
migrating to boost::atomic (which again, internally HAS TO and IS using
volatiles).

Thanks,
Andy.

--

Herb Sutter

unread,

Mar 31, 2010, 7:49:49 PM3/31/10

to

On Wed, 31 Mar 2010 02:56:56 CST, Joshua Maurice
<joshua...@gmail.com> wrote:
>On Mar 30, 8:14 pm, Herb Sutter <herb.sut...@gmail.com> wrote:
>> On Tue, 30 Mar 2010 16:37:59 CST, James Kanze <james.ka...@gmail.com>
>> wrote:
>> >(I keep seeing mention here of instruction reordering. In the
>> >end, instruction reordering is irrelevant. It's only one thing
>> >that may lead to reads and writes being reordered.
>>
>> Yes, but: Any reordering at any level can be treated as an instruction
>> reordering -- actually, as a source code reordering. That's why all
>> language-level MM discussions only bother to talk about source-level
>> reorderings, because any CPU or cache transformations end up having
>> the same effect as some corresponding source-level reordering.
>
>Not quite, no.

You mistake what we mean by "reordering" -- see note at bottom.

Sure there is, in fact a single reordering of two statements will do:
Reorder *only* the second thread (to make it be "d[1] = b; c[1] =
a;"), then the above occurs in several possible interleavings.

>I tried to emphasize this else-
>thread: you cannot think about threading in terms of "possible
>interleavings of instructions".

^^^^^^^^^^^^^

Aha, I see our disconnect. My point wasn't about "interleavings" but
rather "reorderings." Here, "reorderings" doesn't mean possible
interleavings of the source code as written, it's about actually
permuting the source code to execute operations (notably, memory
operations) out of order (as compilers, processors, and caches are all
wont to do).

Herb

---
Herb Sutter (herbsutter.wordpress.com) (www.gotw.ca)

Convener, SC22/WG21 (C++) (www.gotw.ca/iso)
Architect, Visual C++ (www.gotw.ca/microsoft)

--

Anthony Williams

unread,

Apr 1, 2010, 10:05:12 AM4/1/10

to

Andy Venikov <swojch...@gmail.com> writes:

> And of course I'll gladly embrace C++0x
> atomic<>... when it becomes available.

std::atomic<> is available now for some platforms: my just::thread C++0x
thread library provides std::atomic<> for MSVC 2008, MSVC2010 on Windows
and g++4.3, g++4.4 on Ubuntu linux.

http://www.stdthread.co.uk

Anthony
--
Author of C++ Concurrency in Action http://www.stdthread.co.uk/book/
just::thread C++0x thread library http://www.stdthread.co.uk
Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk
15 Carrallack Mews, St Just, Cornwall, TR19 7UL, UK. Company No. 5478976

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Bart van Ingen Schenau

unread,

Apr 1, 2010, 10:52:43 AM4/1/10

to

On Mar 31, 5:14 am, "Leigh Johnston" <le...@i42.co.uk> wrote:
> "James Kanze" <james.ka...@gmail.com> wrote in messagenews:da63ca83-4d6e-416a...@10g2000yqq.googlegroups.com...

>
> <snip>
>
> > Double checked locking can be made to work if you introduce
> > inline assembler or use some other technique to insert a fence
> > or a membar instruction in the appropriate places. But of
> > course, then, the volatile becomes superficial.
>
> It is only superficial if there is a compiler guarantee that a load/store for a non-volatile variable is emitted in the presence
> of a fence which sounds like a dubious guarantee to me. What compilers stop performing optimizations in the presence of a
> fence and/or how does the compiler know which variables accesses can be optimized in the presence of a fence?

To extend that question: what guarantees do you have that the compiler
will not optimise the entire program to a single NOOP?

Either the compiler understands the fence instruction and it
understands the implications for possible optimisations on objects
that are (potentially) accessible from outside the current function.
Or the compiler does not understand the fence instruction and sees
just a bit of black magic, which must be assumed to change all
(potentially) externally accessible objects.
Or the compiler is broken beyond recognition.

I don't see any options there where adding volatile qualifications
would make any difference.

>
> /Leigh
>
Bart v Ingen Schenau

Joshua Maurice

unread,

Apr 1, 2010, 10:05:06 PM4/1/10

to

On Mar 31, 4:49 pm, Herb Sutter <herb.sut...@gmail.com> wrote:
> On Wed, 31 Mar 2010 02:56:56 CST, Joshua Maurice

> >On Mar 30, 8:14 pm, Herb Sutter <herb.sut...@gmail.com> wrote:
> >> Yes, but: Any reordering at any level can be treated as an instruction
> >> reordering -- actually, as a source code reordering. That's why all
> >> language-level MM discussions only bother to talk about source-level
> >> reorderings, because any CPU or cache transformations end up having
> >> the same effect as some corresponding source-level reordering.
>
> >Not quite, no.
>
> You mistake what we mean by "reordering" -- see note at bottom.

[snip]

> >I tried to emphasize this else-
> >thread: you cannot think about threading in terms of "possible
> >interleavings of instructions".
>

> Aha, I see our disconnect. My point wasn't about "interleavings" but
> rather "reorderings." Here, "reorderings" doesn't mean possible
> interleavings of the source code as written, it's about actually
> permuting the source code to execute operations (notably, memory
> operations) out of order (as compilers, processors, and caches are all
> wont to do).

Yes; it is just an argument over semantics. You said "Any reordering at

any level can be treated as an instruction reordering -- actually, as a

source code reordering". I interpreted "reordering" and "source code"
as actual compilable source code for a sequentially consistent single
core processor with preemptive threading. Take my example from my
previous post, except with each thread running the same function, but
passed different "c" and "d" arguments. There is no (not malicious)
source code transformation from that example to compilable source code
for that processor which could produce the result "00 10 02 12".
Instead, you are using a model of a processor which has cores which
each take "source code", possibly reorder the "source code" in
isolation from other cores, and execute the "source code". Under this
model, I agree. However, I think this is a misnomer of "source code"
especially in context of your post and this thread. Your post suggests
to some (including myself) that hardware "reordering" is equivalent to
compiler reordering, which it is not. Hardware reordering can result
in the same writes becoming visible to other threads in different
orders. Optimizing compilers on sequentially consistent hardware
cannot replicate all possible hardware reorderings of not-sequentially
consistent processors, such as my example, without being malicious.

Joshua Maurice

unread,

Apr 5, 2010, 8:18:26 PM4/5/10

to

I was hoping someone more knowledgeable would reply, but I guess it's
up to me. My experience has been mostly limited to POSIX, WIN32, and
Java threading, so what I'm about to say I say without the highest
level of confidence.

I understand your desire to use volatile in this way, and I think it's
a reasonable use case and desire. You are assuming that all relevant
shared state between the two threads is accessible only through head_.
You issue the hardware only fence instruction with "asm volatile" to
make sure the compiler does not remove it. You do not put the "memory"
clobber, and thus the compiler does not understand what it does.
Finally, you use "volatile" to make sure that the compiler will put
another load instruction after the fence in the compiled machine
code.

My replies are thus:

First, you don't want the "memory" clobber because you know that the
(relevant) shared state is only accessible through head. If the
compiler will load in data dependency order and the volatile read will
do a read after the hardware fence, then everything should work out.
I'm not sure if you could find such guarantee published for any
compiler, though. However, I have a hard time thinking of a compiler
which would not do this, but I am not sure, and I would not rely upon
it without checking the compiled output.

This is partly an argument over the definition of fence. One might say
that when one talks about a portable fence, it applies to all memory,
not just a single load specified by the coder and all data dependent
loads. The "general" definition of fence demands the "memory" clobber
(absent the existence of a less draconian clobber). However, this is
irrelevant to the discussion of volatile and threading.

Finally, it comes down to whether volatile will force a load
instruction to be emitted from the compiler. At the very least, this
seems to be included in the intent and spirit of volatile, and I would
hazard a guess that all compilers would emit a load (but not
necessarily anything more [not that anything more would be needed in
this case], and ignoring volatile bugs, which greatly abound across
compilers). However, you're already at the assembly level. Would it be
that much harder to use an "asm volatile" to do a load instead of a
volatile qualified load? You're already doing assembly hackery, so at
least use something which is "more guaranteed" to work like an "asm
volatile" load and not volatile which was never intended to be a
useful threading primitive. Perhaps supply a primitive like
DataDependencyFenceForSingleLoad? I don't know enough about hardware
to even hazard a guess if such a thing is portably efficient. I do
know enough that what you're doing is not portable as is.

All in all though, this does not change the fact that volatile is not
a useful, correct, portable threading primitive. All you've
demonstrated is that volatile in conjunction with assembly (not
portable) can be a useful, correct, non-portable threading primitive,
though I would argue that the code has poor "style" and should not be
using volatile.

Chris M. Thomasson

unread,

Apr 6, 2010, 4:39:09 AM4/6/10

to

"Andy Venikov" <swojch...@gmail.com> wrote in message news:hp0avi$nlt$1...@news.eternal-september.org...

> James Kanze wrote:
> <snip>
>> I'm not sure I follow. Basically, the fence guarantees that the
>> hardware can't do specific optimizations. The same
>> optimizations that the software can't do in the case of
>> volatile. If you think you need volatile, then you certainly
>> need a fence. (And if you have the fence, you no longer need
>> the volatile.)
>>
>
> Ah, finally I think I see where you are coming from. You think that if
> you have the fence you no longer need a volatile.
>
> I think you assume too much about how fence is really implemented. Since
> the standard says nothing about fences you have to rely on a library
> that provides them and if you don't have such a library, you'll have to
> implement one yourself. A reasonable way to implement a barrier would be
> to use macros that, depending on a platform you run, expand to inline
> assembly containing the right instruction. In this case the inline asm
> will make sure that the compiler won't reorder the emitted instructions,
> but it won't make sure that the optimizer will not throw away some
> needed instructions.
>
> For example, following my post where I described Magued Michael's
> algorithm, here's how relevant excerpt without volatiles would look like:

I am just starting to read some of the posts in this thread, so please tyr to bear with me here... I am wondering which one of Maged Michael's algorithms you are referring to; is it SMR? If so, you have a race condition.

> //x86-related defines:
> #define LoadLoadBarrier() asm volatile ("mfence")

I would probably define `MFENCE' as a `StoreLoad' barrier.

[...]

> Just to make you happy I defined LoadLoadBarrier as a full mfence
> instruction, even though on x86 there is no need for a barrier here,
> even on a multicore/multiprocessor.

Well, if you are indeed referring to SMR... Then even x86 requires an explicit `StoreLoad' barrier, unless you are using a more "exotic" form of memory synchronization...

;^)

Andy Venikov

unread,

Apr 28, 2010, 3:19:15 PM4/28/10

to

Anthony Williams wrote:
> Andy Venikov <swojch...@gmail.com> writes:
>
>> And of course I'll gladly embrace C++0x
>> atomic<>... when it becomes available.
>
> std::atomic<> is available now for some platforms: my just::thread C++0x
> thread library provides std::atomic<> for MSVC 2008, MSVC2010 on Windows
> and g++4.3, g++4.4 on Ubuntu linux.
>
> http://www.stdthread.co.uk
>
> Anthony

Actually, I just checked, and it looks like 4.3 still doesn't have atomics... Which kinda makes sense since gcc added C++0x thread support only in 4.4.

Andy.

--

Andy Venikov

unread,

Apr 28, 2010, 3:19:02 PM4/28/10

to

Anthony Williams wrote:
> Andy Venikov <swojch...@gmail.com> writes:
>
>> And of course I'll gladly embrace C++0x
>> atomic<>... when it becomes available.
>
> std::atomic<> is available now for some platforms: my just::thread C++0x
> thread library provides std::atomic<> for MSVC 2008, MSVC2010 on Windows
> and g++4.3, g++4.4 on Ubuntu linux.
>
> http://www.stdthread.co.uk
>
> Anthony

I wasn't aware that atomic<> was available in g++4.3
Scott Meyers' c++0x page lists 4.4 as the first version to support it.
I'll post a message in his thread about it.

Thanks,
Andy.

P.S. gcc's page lists atomics as not supported in either of the versions.

--

Andy Venikov

unread,

Apr 28, 2010, 3:18:46 PM4/28/10

to

Chris M. Thomasson wrote:
> "Andy Venikov" <swojch...@gmail.com> wrote in message

<snip>

> I am just starting to read some of the posts in this thread, so please
> tyr to bear with me here... I am wondering which one of Maged Michael's
> algorithms you are referring to; is it SMR? If so, you have a race
> condition.

No, we're talking about "Simple, fast and practical queues", not SMR.
We already talked about this issue when we asked your help in reviewing
Tim Blechmann's boost::lockfree implementation back in November '09.
Your recommendation was to use Dmitry's Vyukov's xchng-based queue. But the problem with that algorithm was that 1)It was blocking and Tim wanted to have lock-free guarantees and 2)It relies on Intel-specific "lock xchng" that as far as I understand no
other vendor implements directly.

>
>
>
>> //x86-related defines:
>> #define LoadLoadBarrier() asm volatile ("mfence")
>
> I would probably define `MFENCE' as a `StoreLoad' barrier.

Yeah, I explained the reason for LoadLoad being defined as mfence below. I just wanted to make a point that even if mfence instruction is plugged-in, there's no guarantee that optimizer won't remove some important code.

>
> [...]
>
>> Just to make you happy I defined LoadLoadBarrier as a full mfence
>> instruction, even though on x86 there is no need for a barrier here,
>> even on a multicore/multiprocessor.
>
> Well, if you are indeed referring to SMR... Then even x86 requires an
> explicit `StoreLoad' barrier, unless you are using a more "exotic" form
> of memory synchronization...
>
> ;^)

Nope, not SMR.

Andy.

Anthony Williams

unread,

Apr 29, 2010, 6:00:49 AM4/29/10

to

Andy Venikov <swojch...@gmail.com> writes:

> Anthony Williams wrote:
>> Andy Venikov <swojch...@gmail.com> writes:
>>
>>> And of course I'll gladly embrace C++0x
>>> atomic<>... when it becomes available.
>>
>> std::atomic<> is available now for some platforms: my just::thread C++0x
>> thread library provides std::atomic<> for MSVC 2008, MSVC2010 on Windows
>> and g++4.3, g++4.4 on Ubuntu linux.
>>
>> http://www.stdthread.co.uk

> I wasn't aware that atomic<> was available in g++4.3

> Scott Meyers' c++0x page lists 4.4 as the first version to support it.
> I'll post a message in his thread about it.

> P.S. gcc's page lists atomics as not supported in either of the versions.

std::atomic<> is not shipped with either g++ 4.3 or g++ 4.4. As I stated
above, just::thread provides an implementation for the listed
compilers. This is a commercial library, available from
http://www.stdthread.co.uk

Anthony
--
Author of C++ Concurrency in Action http://www.stdthread.co.uk/book/
just::thread C++0x thread library http://www.stdthread.co.uk
Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk
15 Carrallack Mews, St Just, Cornwall, TR19 7UL, UK. Company No. 5478976

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Andy Venikov

unread,

Apr 29, 2010, 3:02:23 PM4/29/10

to

Anthony Williams wrote:
> Andy Venikov <swojch...@gmail.com> writes:
>
>> Anthony Williams wrote:
>>> Andy Venikov <swojch...@gmail.com> writes:
>>>
>>>> And of course I'll gladly embrace C++0x
>>>> atomic<>... when it becomes available.
>>> std::atomic<> is available now for some platforms: my just::thread C++0x
>>> thread library provides std::atomic<> for MSVC 2008, MSVC2010 on Windows
>>> and g++4.3, g++4.4 on Ubuntu linux.
>>>
>>> http://www.stdthread.co.uk
>
>> I wasn't aware that atomic<> was available in g++4.3
>> Scott Meyers' c++0x page lists 4.4 as the first version to support it.
>> I'll post a message in his thread about it.
>
>> P.S. gcc's page lists atomics as not supported in either of the versions.
>
> std::atomic<> is not shipped with either g++ 4.3 or g++ 4.4. As I stated
> above, just::thread provides an implementation for the listed
> compilers. This is a commercial library, available from
> http://www.stdthread.co.uk
>
> Anthony

Sorry, I misread your statement as you providing your implementation to gcc.

BTW, gcc does have c++0x atomics support starting with 4.4
But I guess, since it's not documented, it's really very experimental.

Andy.

--

Anthony Williams

unread,

May 4, 2010, 9:34:32 PM5/4/10

to

Andy Venikov <swojch...@gmail.com> writes:

> BTW, gcc does have c++0x atomics support starting with 4.4
> But I guess, since it's not documented, it's really very experimental.

Wow, I missed that. Thanks for pointing it out to me.

Anthony
--
Author of C++ Concurrency in Action http://www.stdthread.co.uk/book/
just::thread C++0x thread library http://www.stdthread.co.uk
Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk
15 Carrallack Mews, St Just, Cornwall, TR19 7UL, UK. Company No. 5478976

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

DeMarcus

unread,

Mar 16, 2010, 12:28:04 AM3/16/10

to

Hi,

I try to implement a simplified version of Alexandrescu's
Loki::SingletonHolder. See
http://loki-lib.sourceforge.net/html/a00670.html
row 717.

My code looks like this.

template<typename T>
class Singleton
{
public:
static T& getInstance()
{
return *instance_;
}

private:
typedef volatile T* SPtr;
static SPtr instance_;
};

template<typename T>
typename Singleton<T>::SPtr Singleton<T>::instance_;

int main()
{
typedef Singleton<int> S;
S::getInstance() = 4711;
}

But when I compile it with gcc 4.4.1 I get the following error message
at 'return *instance_;'.
"error: invalid initialization of reference of type 'int&' from
expression of type 'volatile int' "

What am I doing wrong?

Thanks,
Daniel

--

Joshua Maurice

unread,

Mar 16, 2010, 6:42:38 AM3/16/10

to

On Mar 15, 9:28 pm, DeMarcus <use_my_alias_h...@hotmail.com> wrote:
> Hi,
>
> I try to implement a simplified version of Alexandrescu's

> Loki::SingletonHolder. Seehttp://loki-lib.sourceforge.net/html/a00670.html

> row 717.
>
> My code looks like this.
>

> [...]

>
> What am I doing wrong?

Well, for starters, why are you using the volatile keyword? Do you
think it's a portable threading construct? It's not. volatile in C and
C++ has nothing to do with threading. The C and C++ standards do not
talk about threads, so anything they say about volatile is irrelevant.
By POSIX, volatile means nothing special for threading. (The Microsoft
compiler under certain versions does claim to make it like a mutex
acquire and release, but let's just ignore this bad form for now. Use
boost or ACE or some portable library if you need atomic functions, or
wait for the new C++ standard. At worst, wrap volatile yourself to not
litter your code with a not portable construct + usage.)

To answer your specific question, why that error message, let's look
at your code:

> template<typename T>
> class Singleton
> {
> public:
> static T& getInstance()
> {
> return *instance_;
> }
> private:
> typedef volatile T* SPtr;
> static SPtr instance_;
> };

*instance is an lvalue of type "volatile T". A "T&" cannot bind to an
lvalue of type "volatile T". A "volatile T&" can bind to an lvalue of
type "volatile T". It's basic const correctness, or more generally CV
(const volatile) correctness. The compiler won't let you assign a non-
volatile reference to a volatile lvalue, just like it won't let you
assign a non-const reference to a const lvalue.

Ulrich Eckhardt

unread,

Mar 16, 2010, 6:42:41 AM3/16/10

to

DeMarcus wrote:
> template<typename T>
> class Singleton
> {
> public:
> static T& getInstance()
> {
> return *instance_;
> }
>
> private:
> typedef volatile T* SPtr;
> static SPtr instance_;
> };

[...]
> S::getInstance() = 4711;
[...]

> But when I compile it with gcc 4.4.1 I get the following error message
> at 'return *instance_;'.
> "error: invalid initialization of reference of type 'int&' from
> expression of type 'volatile int' "

The compiler is telling you exactly what you do wrong. Imagine you exchanged
the "volatile" against a "const", what would you expect? Similarly, a simple
"const_cast<int&>(*instance_)" will convince the compiler for you.

Further:
1. You wouldn't need the template stuff for this example. Makes getting some
things right more complicated.
2. Your instance_ pointer is never initialized.
3. The exact semantics of volatile differ by compiler, the standard only
requires that it behaves similarly to const in above aspect (CV-qualifier).
4. If you had written "T volatile*" instead of "volatile T*", you could use
the same schema (CV-qualifier binding to the left) when creating a volatile
pointer (T* volatile) instead of a pointer to volatile, which I guess is
what you wanted actually.

Uli

eca

unread,

Mar 16, 2010, 6:45:30 AM3/16/10

to

On Mar 16, 5:28 am, DeMarcus <use_my_alias_h...@hotmail.com> wrote:

> I try to implement a simplified version of Alexandrescu's
> Loki::SingletonHolder. Seehttp://loki-lib.sourceforge.net/html/a00670.html
> row 717.

I would suggest:

static volatile T& getInstance()
{
return *instance_;
}

"volatile", as well as "const", cannot be neglected.
BTW, remember to initialize instance_ somewhere.

HTH,
eca

Johannes Schaub (litb)

unread,

Mar 16, 2010, 7:05:20 AM3/16/10

to

DeMarcus wrote:

"T&" designates the type "int&" , but "*instance" is an expression of type
"volatile int". You cannot refer to a volatile object by a non-volatile
expression. If you do nontheless by casting away volatile, behavior is
undefined. The compiler guards you from that by not allowing the non-
volatile reference to bind to expressions of volatile type.

I dunno what Alexandrescu's code is doing, but surely there are more levels
of indirections in his code that care for health :)

DeMarcus

unread,

Mar 16, 2010, 9:42:18 PM3/16/10

to

Joshua Maurice wrote:
> On Mar 15, 9:28 pm, DeMarcus <use_my_alias_h...@hotmail.com> wrote:
>> Hi,
>>
>> I try to implement a simplified version of Alexandrescu's
>> Loki::SingletonHolder. Seehttp://loki-lib.sourceforge.net/html/a00670.html
>> row 717.
>>
>> My code looks like this.
>>
>> [...]
>>
>> What am I doing wrong?
>
> Well, for starters, why are you using the volatile keyword? Do you
> think it's a portable threading construct? It's not. volatile in C and
> C++ has nothing to do with threading. The C and C++ standards do not
> talk about threads, so anything they say about volatile is irrelevant.
> By POSIX, volatile means nothing special for threading. (The Microsoft
> compiler under certain versions does claim to make it like a mutex
> acquire and release, but let's just ignore this bad form for now. Use
> boost or ACE or some portable library if you need atomic functions, or
> wait for the new C++ standard. At worst, wrap volatile yourself to not
> litter your code with a not portable construct + usage.)
>

This makes me very confused. I've always been taught to use the volatile
keyword in front of variables that can be accessed from several threads.

DeMarcus

unread,

Mar 16, 2010, 9:43:10 PM3/16/10

to

That's the thing! I do the same as him. This is what he does.

Here's Singleton.h

00717 template
00718 <
00719 typename T,
00720 template <class> class CreationPolicy = CreateUsingNew,
00721 template <class> class LifetimePolicy = DefaultLifetime,
00722 template <class, class> class ThreadingModel =
00722b LOKI_DEFAULT_THREADING_NO_OBJ_LEVEL,
00723 class MutexPolicy = LOKI_DEFAULT_MUTEX
00724 >
00725 class SingletonHolder
00726 {
00727 public:
00728
00730 typedef T ObjectType;
00731
00733 static T& Instance();
00734
00735 private:
00736 // Helpers
00737 static void MakeInstance();
00738 static void LOKI_C_CALLING_CONVENTION_QUALIFIER
00738b DestroySingleton();
00739
00740 // Protection
00741 SingletonHolder();
00742
00743 // Data
00744 typedef typename
00744b ThreadingModel<T*,MutexPolicy>::VolatileType
00744c PtrInstanceType;
00745 static PtrInstanceType pInstance_;
00746 static bool destroyed_;
00747 };

[...]

00775 // SingletonHolder::Instance
00777
00778 template
00779 <
00780 class T,
00781 template <class> class CreationPolicy,
00782 template <class> class LifetimePolicy,
00783 template <class, class> class ThreadingModel,
00784 class MutexPolicy
00785 >
00786 inline T& SingletonHolder<T, CreationPolicy,
00787 LifetimePolicy, ThreadingModel, MutexPolicy>::Instance()
00788 {
00789 if (!pInstance_)
00790 {
00791 MakeInstance();
00792 }
00793 return *pInstance_;
00794 }

Here's Threads.h containing ThreadingModel.

00252 template < class Host, class MutexPolicy =
00252b LOKI_DEFAULT_MUTEX >
00253 class ObjectLevelLockable
00254 {
00255 mutable MutexPolicy mtx_;
00256
00257 public:
00258 ObjectLevelLockable() : mtx_() {}
00259
00260 ObjectLevelLockable(const ObjectLevelLockable&) :
00260b mtx_() {}
00261
00262 ~ObjectLevelLockable() {}
00263
00264 class Lock;
00265 friend class Lock;
00266
00269 class Lock
00270 {
00271 public:
00272
00274 explicit Lock(const ObjectLevelLockable& host) :
00274b host_(host)
00275 {
00276 host_.mtx_.Lock();
00277 }
00278
00280 explicit Lock(const ObjectLevelLockable* host) :
00280b host_(*host)
00281 {
00282 host_.mtx_.Lock();
00283 }
00284
00286 ~Lock()
00287 {
00288 host_.mtx_.Unlock();
00289 }
00290
00291 private:
00293 Lock();
00294 Lock(const Lock&);
00295 Lock& operator=(const Lock&);
00296 const ObjectLevelLockable& host_;
00297 };
00298
00299 typedef volatile Host VolatileType;
00300
00301 typedef LOKI_THREADS_LONG IntType;
00302
00303 LOKI_THREADS_ATOMIC_FUNCTIONS
00304
00305 };

If you look at row 744b you see that he passes T* to ThreadingModel. If
you then look at row 299 you see that his VolatileType becomes a
volatile T*. He and me then do the exact same thing on row 793,
initializing a T& with a volatile T.

I tried to implement my own singleton by means of Modern C++ Design by
Alexandrescu (an excellent book by the way), but I got stuck when my
compiler started to complain. Actually, in the book in Section 6.10.3
Assembling SingletonHolder, p.151 he writes "The
ThreadingModel<T>::VolatileType type definition expands either to T or
volatile T, depending on the actual threading model."

Could it be that the compilers at the time he wrote the SingletonHolder
let through that volatile conversion, but now they don't?

Joshua Maurice

unread,

Mar 17, 2010, 1:56:58 AM3/17/10

to

I'm sorry that you were taught incorrectly. I suggest reading the
paper:
http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf
It's a much more thorough description of modern threading in C++, and
specifically the volatile keyword. Any programmer who uses threads
should read this paper and understand its implications.

This is a very common misconception. I would guess that it started
from poor teaching in schools and elsewhere. In the programming
courses I took in my college, I was taught a very simple and naive
threading model, just as most physicists are taught Newtonian
mechanics, except unlike the physicists and General Relativity and
quantum mechanics, I was never taught this naive threading model was
actually incorrect, that it's not how POSIX and WIN32 threads actually
work. I believe that most people who are taught threading are taught
this naive model first and never taught any better, and it's simply
perpetuated itself due to lack of understanding. This naive model is
part of the naive model of computers in general as executing source
code instructions faithfully, in order, without any changes. However,
in the real world, C++ (and other industry programming languages) have
the "as-if" rule which allows or is interpreted as allowing reordering
and changing of instructions of a single thread as long as the result
of that thread \in isolation\ is the same before and after the
modifications. This is the allowance for basically all optimizations.
To do otherwise in the presence of multiple threads would be \way\ too
great of a performance hit.

Specifically, if you do not understanding the following example, then
you do not understanding threading according to POSIX, WIN32, Java,
and I assume the new C++0x standard. (I assume that C++0x just copied
the essence of Java, POSIX, and WIN32 thread models. I haven't
actually read the new standard draft yet.)

// -- start pseudo code
#include "threading_library.hpp"
#include <iostream>
using namespace std;

int a = 0;
int b = 0;

void* foo(void* )
{ cout << a << " " << b << endl;
}
void* bar(void* )

{ a = 1;
b = 2;
}

int main()
{ start_thread(foo, 0);
start_thread(bar, 0);
}
// -- end code

Under POSIX and WIN32 guarantees (and Java guarantees if this was Java
code), this can print out any of the following:

0 0
0 2
1 0
1 2

Yes. The same compiled executable, when run multiple times, can print
any of the 4, perhaps changing at random between executions, for a
conforming compiler. Without a "happens-before" relationship (to
borrow the term from Java), the guarantees you have are really quite
small if any. If you want the proper "happens-before" relationship,
then use the proper synchronization. Simply put, there is no global
ordering on instructions when there are multiple threads. Different
threads might see different views of main memory. (Google cache
coherency.) This implies that you cannot correctly reason about
threading by simply examining the possible interleavings of
instructions as so commonly done because there is no guarantee of a
global ordering of instructions.

Coming full circle, volatile in C and C++ was not intended to
guarantee a global order across threads, no other standard such as
POSIX defines it as such (though visual studios compiler may), and
most compilers implement volatile as not guaranteeing a global order
across threads. Thus, volatile is not a portable threading construct
in C or C++.

Paul Bibbings

unread,

Mar 17, 2010, 1:57:16 AM3/17/10

to

DeMarcus <use_my_a...@hotmail.com> writes:

> I do the same as him. This is what he does.

<snip>Following code, at breaks in line numbering</snip>

> Here's Singleton.h
>
> 00717 template
> 00718 <
> 00719 typename T,
> 00720 template <class> class CreationPolicy = CreateUsingNew,
> 00721 template <class> class LifetimePolicy = DefaultLifetime,
> 00722 template <class, class> class ThreadingModel =
> 00722b LOKI_DEFAULT_THREADING_NO_OBJ_LEVEL,
> 00723 class MutexPolicy = LOKI_DEFAULT_MUTEX
> 00724 >
> 00725 class SingletonHolder
> 00726 {
> 00727 public:

> 00733 static T& Instance();

> 00744 typedef typename
> 00744b ThreadingModel<T*,MutexPolicy>::VolatileType
> 00744c PtrInstanceType;
> 00745 static PtrInstanceType pInstance_;

> 00747 };

> 00299 typedef volatile Host VolatileType;

> 00305 };

>
>
> If you look at row 744b you see that he passes T* to ThreadingModel. If
> you then look at row 299 you see that his VolatileType becomes a
> volatile T*.

Where Host at line 299 is a T*, then VolatileType is not volatile T* but
rather T* volatile; that is, a volatile pointer to T.

> He and me then do the exact same thing on row 793,
> initializing a T& with a volatile T.

Rather, Alexandrescu is initializing a T& with an l-value of type T. It
is his T* that is volatile, not what it points to.

Regards

Paul Bibbings