Double-checked locking

Scott Meyers

unread,

May 3, 2003, 8:52:04 PM5/3/03

to

[I just posted this to comp.std.c++ as a followup to a February thread (I'm
way behind...) , but it's of more general interest, I think, so I'm posting
here, too. -- Scott]

On Mon, 17 Feb 2003 17:09:35 +0000 (UTC), Christoph Rabel wrote (to
comp.std.c++):
> MySingleton *MySingleton::Instance(void)
> {
> if(!pInstance)
> {
> LOCK(); // Do some MT-locking here
> if (!pInstance)
> pInstance = new MySingleton;
> UNLOCK();
> return sp;
> }

This is the double-checked locking pattern. I recently drafted an article on
this topic for CUJ. As I sit here in a pool of my own blood based on the
feedback I got from pre-pub reviewers, I feel compelled to offer the following
observation: there is, as far as I know, no way to make this work on a reliable
and portable basis.

The best treatment of this topic that I know of is "The 'Double-Checked Locking
is Broken' Declaration"
(http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html). I
suggest you not fall into the trap I did in assuming that its focus on Java
implies that it doesn't really apply to C++. It does. My favorite paragraph
from that document is this:

There are lots of reasons it doesn't work. The first couple of reasons we'll
describe are more obvious. After understanding those, you may be tempted to
try to devise a way to "fix" the double-checked locking idiom. Your fixes will
not work: there are more subtle reasons why your fix won't work. Understand
those reasons, come up with a better fix, and it still won't work, because
there are even more subtle reasons.

Lots of very smart people have spent lots of time looking at this. There is no
way to make it work without requiring each thread that accesses the helper
object to perform synchronization.

As an example of one of the "more obvious" reasons why it doesn't work, consider
this line from the above code:

pInstance = new MySingleton;

Three things must happen here:
1. Allocate enough memory to hold a MySingleton object.
2. Construct a MySingleton in the memory.
3. Make pInstance point to the object.

In general, they don't have to happen in this order. Consider the following
translation. This isn't code a human is likely to write, but it is a valid
translation on the part of the compiler under certain circumstances (e.g., when
static analysis reveals that the MySingleton constructor cannot throw):

pInstance = // 3
operator new(sizeof(MySingleton)); // 1
new (pInstance) MySingleton; // 2

If we plop this into the original function, we get this:

> MySingleton *MySingleton::Instance(void)
> {
> if(!pInstance) // Line 1
> {
> LOCK(); // Do some MT-locking here
> if (!pInstance)
pInstance =
operator new(sizeof(MySingleton)); // Line 2
new (pInstance) MySingleton;
> UNLOCK();
> return sp;
> }

So consider this sequence of events:
- Thread A enters MySingleton::Instance, executes through Line 2, and is
suspended.
- Thread B enters MySingleton::Instance, executes Line 1, sees that pInstance
is non-null, and returns. It then merrily dereferences the pointer, thus
referring to memory that does not yet hold an object.

If there's a portable way to avoid this problem in the presence of aggressive
optimzing compilers, I'd love to know about it.

Scott

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]
[ about comp.lang.c++.moderated. First time posters: do this! ]

Phil Carmody

unread,

May 4, 2003, 10:55:24 AM5/4/03

to

On Sat, 03 May 2003 20:52:04 +0000, Scott Meyers wrote:
> pInstance = // 3
> operator new(sizeof(MySingleton)); // 1
> new (pInstance) MySingleton; // 2
>
> If we plop this into the original function, we get this:
>
>> MySingleton *MySingleton::Instance(void)
>> {
>> if(!pInstance) // Line 1
>> {
>> LOCK(); // Do some MT-locking here
>> if (!pInstance)
> pInstance =
> operator new(sizeof(MySingleton)); // Line 2
> new (pInstance) MySingleton;
>> UNLOCK();
>> return sp;
>> }
>
> So consider this sequence of events:
> - Thread A enters MySingleton::Instance, executes through Line 2, and is
> suspended.
> - Thread B enters MySingleton::Instance, executes Line 1, sees that pInstance
> is non-null, and returns. It then merrily dereferences the pointer, thus
> referring to memory that does not yet hold an object.
>
> If there's a portable way to avoid this problem in the presence of aggressive
> optimzing compilers, I'd love to know about it.

Wouldn't a volatile pInstance prevent the assigning to pInstance before
the right hand side of the = (i.e. the new) has completed?

Phil

Raoul Gough

unread,

May 4, 2003, 10:56:56 AM5/4/03

to

"Scott Meyers" <Use...@aristeia.com> wrote in message
news:MPG.191d9257d...@news.hevanet.com...

[snip explanation of problems with double checked locking]

> If there's a portable way to avoid this problem in the presence of
aggressive
> optimzing compilers, I'd love to know about it.

I suppose you've already looked at William Kempf's work on
boost::once? It looks fairly easy with Posix threads, slightly harder
with Windows threads. I guess other platforms with named mutexes could
use a similar approach to the Windows code. The relevant
implementation file is:

http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/boost/boost/libs/thread
/src/once.cpp?rev=HEAD&content-type=text/vnd.viewcvs-markup

(this URL with probably require manual reassembly).

--
Raoul Gough
see http://home.clara.net/raoulgough/ for my work availability

Joseph Seigh

unread,

May 5, 2003, 6:41:11 AM5/5/03

to

....

>
> If there's a portable way to avoid this problem in the presence of aggressive
> optimzing compilers, I'd love to know about it.
>

But each thread only has to perform that synchronization once. Obviously a
simple global pointer isn't a good way for each thread to remember that.
But using a copy of the pointer owned by the thread would work, i.e.

if (pLocalInstance == 0) {
LOCK();
if (pInstance == 0) {
pInstance = new MySingleton();
}
UNLOCK();
pLocalInstance = pInstance;
}

There are two ways to implement thread ownership of a copy. One is by allocating
the local pointer in thread local storage (TLS). The other is by incorporating
the local pointer into an object that is already requires synchronization. Accessing the
global data is safe because you've either safetly set the local pointer yourself
or you've acquired ownership of the object via proper synchronization that
guarantees proper memory visibility. Typical usage would be a class that dynamically
initializes global storage (static initialization isn't necessarily thread-safe).

glocal_t *pInstance;

class MyClass {

private:
global_t *pLocalInstace;

global_t *Instance() {
LOCK();
if (pInstance == 0) {
...;// initialize static data
}
UNLOCK();
return pInstance;
}

public:
MyClass() {
pLocalInstance = Instance();
...; // rest of object initialization

}

// some method that requires safetly accessing global static data
int getX() {

return pLocalInstance->x;
}

....
};

Joe Seigh

Pete Becker

unread,

May 5, 2003, 6:41:36 AM5/5/03

to

Scott Meyers wrote:
>
> If there's a portable way to avoid this problem in the presence of aggressive
> optimzing compilers, I'd love to know about it.
>

Store to a local pointer and then copy the result into pInstance, so
that there's a sequence point between the call to new and the change to
pInstance. But that doesn't solve the real problem, which is at the
hardware level: in the absence of memory barriers (typically obtained by
locking and unlocking a mutex) there's no guarantee that the changes
made by the constructor will be visible to other threads before the
change made to the pointer value becomes visible. The threads can run on
different processors, each with its own memory cache. The 'other'
thread's cache may pick up the change to pInstance (because it accessed
another value in the same cache line, for example) without picking up
the changes to the memory that pInstance points to.

For a good discussion of these issues, see David Butenhof's book,
_Programming with POSIX Threads_, sec. 3.4, "Memory Visibliity Between
Threads".

--

Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)

Scott Meyers

unread,

May 5, 2003, 6:43:43 AM5/5/03

to

On 4 May 2003 10:55:24 -0400, Phil Carmody wrote:
> Wouldn't a volatile pInstance prevent the assigning to pInstance before
> the right hand side of the = (i.e. the new) has completed?

No. Declaring pInstance volatile will force reads of that variable to come
from memory and writes to that variable to go to memory, but what we need
here is a way to say that pInstance should not be written until the
Singleton has been constructed. That is, we need to tell the compiler to
respect a temporal ordering that is stricter than the as-if rule. As far
as I know, there is no way to do that. Certainly volatile doesn't do it.

Scott

Scott Meyers

unread,

May 5, 2003, 6:45:45 AM5/5/03

to

On 4 May 2003 10:56:56 -0400, Raoul Gough wrote:
> I suppose you've already looked at William Kempf's work on
> boost::once? It looks fairly easy with Posix threads, slightly harder
> with Windows threads. I guess other platforms with named mutexes could
> use a similar approach to the Windows code. The relevant
> implementation file is:
>
> http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/boost/boost/libs/thread
> /src/once.cpp?rev=HEAD&content-type=text/vnd.viewcvs-markup

For pthreads, it appears that double-checked locking is not employed here.
That's fine, but it has nothing to say about the validity of double-checked
locking. Please remember that my contention here is that there is no
portable and reliable way to make double-checked locking work. I'm not
claiming that singletons can't be initialized in a thread-safe manner. I'm
claiming that double-checked locking can't do it.

For WinThreads, the code above looks like this, in part:

if (compare_exchange(&flag, 1, 1) == 0) // 2nd check
{
func(); // invoke "once" func
InterlockedExchange(&flag, 1); // set "called" bit
}

Again, assuming an aggressive optimizing compiler that can see through
function pointers and across function call boundaries (e.g., via
full-program optimization, which is available on at least two compilers I
know -- Intel's (in general) and Microsoft's (when generating managed
code)) what prevents a compiler from using the as-if rule to reorder the
block so that func is called after the call to InterLockedExchange?

Scott

Shmulik Flint

unread,

May 5, 2003, 6:46:11 AM5/5/03

to

Scott Meyers <Use...@aristeia.com> wrote in message news:<MPG.191d9257d...@news.hevanet.com>...

> [I just posted this to comp.std.c++ as a followup to a February thread (I'm
> way behind...) , but it's of more general interest, I think, so I'm posting
> here, too. -- Scott]
>
> On Mon, 17 Feb 2003 17:09:35 +0000 (UTC), Christoph Rabel wrote (to
> comp.std.c++):
> > MySingleton *MySingleton::Instance(void)
> > {
> > if(!pInstance)
> > {
> > LOCK(); // Do some MT-locking here
> > if (!pInstance)
> > pInstance = new MySingleton;
> > UNLOCK();
> > return sp;
> > }
>

<snip>

> pInstance = // 3
> operator new(sizeof(MySingleton)); // 1
> new (pInstance) MySingleton; // 2
>
> If we plop this into the original function, we get this:
>
> > MySingleton *MySingleton::Instance(void)
> > {
> > if(!pInstance) // Line 1
> > {
> > LOCK(); // Do some MT-locking here
> > if (!pInstance)
> pInstance =
> operator new(sizeof(MySingleton)); // Line 2
> new (pInstance) MySingleton;
> > UNLOCK();
> > return sp;
> > }
>
> So consider this sequence of events:
> - Thread A enters MySingleton::Instance, executes through Line 2, and is
> suspended.
> - Thread B enters MySingleton::Instance, executes Line 1, sees that pInstance
> is non-null, and returns. It then merrily dereferences the pointer, thus
> referring to memory that does not yet hold an object.
>
> If there's a portable way to avoid this problem in the presence of aggressive
> optimzing compilers, I'd love to know about it.
>
> Scott
>

How about this:

class MySingleton
{
static MySingleton* pInstance;
static bool bInstanceInitialized;
...
}

MySingleton* MySingleton::pInstance = 0;
bool MySingleton::bInstanceInitialized = false;

MySingleton* MySingleton::Instance(void)
{
if(!bInstanceInitialized)

{
LOCK(); // Do some MT-locking here

if (!bInstanceInitialized)
pInstance = new MySingleton;
bInstanceInitialized = true;
UNLOCK();
}

return pInstance;
}

Isn't the assignment to bInstanceInitialized guaranteed to occur after
pInstance is assigned and fully initialized?

Kevin Cline

unread,

May 5, 2003, 6:48:14 AM5/5/03

to

Scott Meyers <Use...@aristeia.com> wrote in message news:<MPG.191d9257d...@news.hevanet.com>...

> If there's a portable way to avoid this problem in the presence of aggressive
> optimzing compilers, I'd love to know about it.

There is no portable way to avoid this problem. Portable
synchronization of multiple processes can not be done without making a
call to the thread support library. All other attempts rest on
assumptions about the compiler and/or the hardware and hence are not
portable.

OTOH, there's nothing that requires calls to the thread support
library to be inefficient. They could even be inlined.

Siemel Naran

unread,

May 5, 2003, 9:24:32 AM5/5/03

to

"Raoul Gough" <Raoul...@yahoo.co.uk> wrote in message
news:b92mr9$emq2u$1@ID-

> I suppose you've already looked at William Kempf's work on
> boost::once? It looks fairly easy with Posix threads, slightly harder
> with Windows threads. I guess other platforms with named mutexes could
> use a similar approach to the Windows code. The relevant
> implementation file is:
>
> http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/boost/boost/libs/thread
> /src/once.cpp?rev=HEAD&content-type=text/vnd.viewcvs-markup

I've read it but don't understand it. Can you please explain what is going
on? And where can one find a book on boost (you can email me at
Sieme...@att.net)? Thanks.

--
+++++++++++
Siemel Naran

Siemel Naran

unread,

May 5, 2003, 9:27:07 AM5/5/03

to

"Phil Carmody" <thefatphi...@yahoo.co.uk> wrote in message
news:pan.2003.05.04....@yahoo.co.uk...

Is the use of asm considered portable? I think if we can make successive
statements atomic (so that each thread must execute all the statemements
before it can be suspended and another thread can take charge) then I think
we'd be all set.

> Wouldn't a volatile pInstance prevent the assigning to pInstance before
> the right hand side of the = (i.e. the new) has completed?

We also want to prevent assigning to pInstance until after the
initialization has finished. Can we say

pTempInstance = new MySingleton;
pInstance = pTempInstance;

But by the original quote there must be a more subtle reason why this won't
work.

--
+++++++++++
Siemel Naran

Joseph Seigh

unread,

May 5, 2003, 4:13:47 PM5/5/03

to

I wrote:
>
> There are two ways to implement thread ownership of a copy. One is by allocating

> the local pointer in thread local storage (TLS). ...

Scratch the TLS stuff. The problem with that route is you need a key to index into
your local storage and that needs to be initialized, initialization being what you
are trying to accomplish in the first place.

You can use pthread_once to do this but then you might as well use it instead of DCL
since it's functionally equivalent.

The problem with POSIX threads is that it is performance neutral in the same sense that
C++ is threads neutral. pthread_once may be, and problably is on most platforms, implemented
via DCL, but you don't really know for sure. So there is this recurring tendency to use
techniques (flawed as they may be) that visibly promise performance over techniques that
may or may not give you the desired performance.

Thread local storage (or thread specific data) also suffers from this. The extra overhead
from using thread local storage should be, at most, one extra load instruction, but one
doesn't know this for sure either, so it's not uncommon to see attempts to do this explicitly.

Philippe Mori

unread,

May 5, 2003, 4:17:46 PM5/5/03

to

> > if(!pInstance) // Line 1
> > {
> > LOCK(); // Do some MT-locking here
> > if (!pInstance)
> pInstance =
> operator new(sizeof(MySingleton)); // Line 2
> new (pInstance) MySingleton;
> > UNLOCK();
> > return sp;
> > }
>

Using a temporary (possibly static) should help to fix the problem... and to
ensure that the compiler does not optimize too much... the assignment may be
done in a function define externally (and possibly without "global"
optimizations).

Something like:
if (!pInstance)
{
LOCK();
static tmp = new MySingleton();
pInstance = do_assign(tmp); // Where do_assign do an = but is
unoptimized.
UNLOCK();
}

Or maybe passing the address where to store the result to the constructor...

new MySingleton(&pInstance);

And making sure that is executed last.... Probably not much better...

Or calling a virtual member to set pInstance:
tmp = new MyInstance;
tmp->Assign(&pInstance);

Jeff Kohn

unread,

May 5, 2003, 5:03:42 PM5/5/03

to

It seems to me that one purpose of the double-checked lock pattern (aside
from avoiding a mutex or similar lock), is to achieve lazy instantiation. My
question is, what if you don't need lazy instantiation, because you know
your singleton will always be used or the cost isn't all that great to
instantiate it? In that case is there a way to implement the singleton
without the locking overhead?

Jeff

Scott Meyers

unread,

May 5, 2003, 5:04:00 PM5/5/03

to

On 5 May 2003 06:41:36 -0400, Pete Becker wrote:
> Store to a local pointer and then copy the result into pInstance, so
> that there's a sequence point between the call to new and the change to
> pInstance.

This doesn't help, for two reasons. First, the local pointer can be
optimized out of existence. Second, sequence points don't help. See my
recent reply to Shmulik Flint for more on sequence points.

> But that doesn't solve the real problem, which is at the
> hardware level: in the absence of memory barriers (typically obtained by
> locking and unlocking a mutex) there's no guarantee that the changes
> made by the constructor will be visible to other threads before the
> change made to the pointer value becomes visible.

This is a separate problem, because, if my understanding is correct, it
applies only to multiprocessor machines. In that case, memory barriers are
definitly necessary, but, again if my understanding is correct, memory
barriers do nothing to solve the uniprocessor case I posted here.

Scott

Scott Meyers

unread,

May 5, 2003, 5:04:35 PM5/5/03

to

On 5 May 2003 06:46:11 -0400, Shmulik Flint wrote:
> Isn't the assignment to bInstanceInitialized guaranteed to occur after
> pInstance is assigned and fully initialized?

The standard says that it must look that way from the point of view of the
abstract machine defined by the standard. But the as-if rule says that
things can be arbitrarily reordered as long as nobody can tell. Since the
standard says nothing about threading, it does not concern itself with
statement reorderings that are as-if-safe under one thread of control but
as-if-unsafe under multiple threads of control.

This is why sequence points are, in general, useless for solving this
problem. They apply only to as-if-serial semantics.

Scott

Raoul Gough

unread,

May 5, 2003, 5:05:29 PM5/5/03

to

"Siemel Naran" <Sieme...@KILL.att.net> wrote in message
news:hunta.66936$cO3.4...@bgtnsc04-news.ops.worldnet.att.net...

> "Raoul Gough" <Raoul...@yahoo.co.uk> wrote in message
> news:b92mr9$emq2u$1@ID-
>
> > I suppose you've already looked at William Kempf's work on
> > boost::once? It looks fairly easy with Posix threads, slightly
harder
> > with Windows threads. I guess other platforms with named mutexes
could
> > use a similar approach to the Windows code. The relevant
> > implementation file is:
> >
> >
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/boost/boost/libs/thread
> > /src/once.cpp?rev=HEAD&content-type=text/vnd.viewcvs-markup
>
> I've read it but don't understand it. Can you please explain what
is going
> on? And where can one find a book on boost (you can email me at
> Sieme...@att.net)? Thanks.

Well, I never really claimed to understand it myself :-) I guess the
Windows version is the only relevant one here. As far as I know, the
trick with the named mutex is to avoid problems with having to
initialize a static mutex object somehow (which would become the
original problem all over again).

It uses a simple atomic type as an initalization flag, and uses a
platform-specific memory-interlock function to acccess and update it
(avoiding caching problems). Maybe you can get more info from the
boost mailing list archives:

http://groups.yahoo.com/group/boost/message/15433
and http://lists.boost.org/MailArchives/boost/msg15423.php

BTW, my posting a link to the raw implementation probably didn't
explain very much. The documentation is online at
http://www.boost.org/

--
Raoul Gough
see http://home.clara.net/raoulgough/ for my work availability

Thomas Mang

unread,

May 6, 2003, 5:41:20 AM5/6/03

to

Scott Meyers schrieb:

> On 4 May 2003 10:55:24 -0400, Phil Carmody wrote:
> > Wouldn't a volatile pInstance prevent the assigning to pInstance before
> > the right hand side of the = (i.e. the new) has completed?
>
> No. Declaring pInstance volatile will force reads of that variable to come
> from memory and writes to that variable to go to memory, but what we need
> here is a way to say that pInstance should not be written until the
> Singleton has been constructed. That is, we need to tell the compiler to
> respect a temporal ordering that is stricter than the as-if rule. As far
> as I know, there is no way to do that. Certainly volatile doesn't do it.

Doesn't this show a fundamental problem of the Standard?
As-if seems not to be as-if.
What is as-if? How can as-if apply when the observable behavior of the program
might be different under certain circumstances?

Thomas

Micke

unread,

May 6, 2003, 5:42:48 AM5/6/03

to

Scott Meyers wrote:
>
> On 4 May 2003 10:55:24 -0400, Phil Carmody wrote:
> > Wouldn't a volatile pInstance prevent the assigning to pInstance before
> > the right hand side of the = (i.e. the new) has completed?
>
> No. Declaring pInstance volatile will force reads of that variable to come
> from memory and writes to that variable to go to memory, but what we need
> here is a way to say that pInstance should not be written until the
> Singleton has been constructed. That is, we need to tell the compiler to
> respect a temporal ordering that is stricter than the as-if rule. As far
> as I know, there is no way to do that. Certainly volatile doesn't do it.
>

Well yes and no. If one takes into account exceptions during
construction and 1.9:6 "The observable behavior of the abstract machine
is its sequence of reads and writes to volatile data and
calls to library I/O functions.": the compiled assignment can't write
something into the volatile pInstance and then take two steps back and
erase it if it stumbles upon an exception during the construction of the
Singleton.

/Micke

Andrei Alexandrescu

unread,

May 6, 2003, 5:48:59 AM5/6/03

to

"Scott Meyers" <Use...@aristeia.com> wrote in message
news:MPG.191d9257d...@news.hevanet.com...

[snip]

> If there's a portable way to avoid this problem in the presence of
aggressive
> optimzing compilers, I'd love to know about it.

Your article is pretty much a manifesto for the necessity of adding threads
semantics to the standard. Any portable trick you'd ever think of, one can
name a more or less standard and usual compiler optimization that would blow
it away. At the end of the day, you do need to communicate your exact true
intents to your compiler - achieving the desired effect by outsmarting it is
hardly a long-lasting solution.

Speaking of DCLP, I reached the conclusion that the semi-portable technique
that is of the lightest weight for the "initialize once read many" problem
is to use atomic integers like this:

* The Singleton has an integer associated with it
* That integer is statically initialized to -1
* Each thread does an atomic_increment upon entrance
* The first thread that gets the transition -1 -> 0 will initialize the
Singleton
* After initialization, the thread that did it sets the integer to INT_MIN.
* If the value after increment is > 0, then the Singleton is being
initialized right now, so the code will decrement the integer, take a nap,
and repeat.
* If the value after increment is < 0, then the Singleton is up and running,
so decrement the integer and return the pointer.

By adding an extra Boolean the need to decrement during normal access can be
obviated, so the grand total cost per access is one atomic increment (and
test) - which is not really bad. But again, atomic integers do imply thread
awareness of the compiler.

Andrei

Scott Meyers

unread,

May 6, 2003, 5:51:02 AM5/6/03

to

On 5 May 2003 06:41:11 -0400, Joseph Seigh wrote:
> But each thread only has to perform that synchronization once. Obviously a
> simple global pointer isn't a good way for each thread to remember that.
> But using a copy of the pointer owned by the thread would work

I agree that this approach seems safe. Or, more precisely (because I am
not a threading expert), this approach appears to avoid the problems
introduced by double-checked locking. As I've written earlier in this
thread (or perhaps in the corresponding thread in comp.std.c++), my claim is not
that there is no thread-safe way to initialize a singleton. My claim is
only that double-checked locking is not one of those ways.

Scott

Scott Meyers

unread,

May 6, 2003, 7:35:24 AM5/6/03

to

On 5 May 2003 09:27:07 -0400, Siemel Naran wrote:
> Is the use of asm considered portable? I think if we can make successive
> statements atomic (so that each thread must execute all the statemements
> before it can be suspended and another thread can take charge) then I think
> we'd be all set.

I don't see any reason why a thread can't be suspended in the middle of a
sequence of statements, or even in the middle of a single statement, just
because they are written in asm.

> We also want to prevent assigning to pInstance until after the
> initialization has finished. Can we say
>
> pTempInstance = new MySingleton;
> pInstance = pTempInstance;

We can say it, but there is no guarantee that it will do any good. Again,
not only can good optimizing compilers optimize pTempInstance out of
existence, they can also reorder statements so that pInstance is assigned
before the singleton is constructed.

In general, it is very difficult to figure out from source code (in any
language) the order in which the underlying machine will actually carry out
the operations. Program language semantics dictate what you will observe,
not what will actually happen.

Scott

Shmulik Flint

unread,

May 6, 2003, 7:35:42 AM5/6/03

to

Scott Meyers <Use...@aristeia.com> wrote in message news:<MPG.192031bea...@news.hevanet.com>...

> On 5 May 2003 06:46:11 -0400, Shmulik Flint wrote:
> > Isn't the assignment to bInstanceInitialized guaranteed to occur after
> > pInstance is assigned and fully initialized?
>
> The standard says that it must look that way from the point of view of the
> abstract machine defined by the standard. But the as-if rule says that
> things can be arbitrarily reordered as long as nobody can tell. Since the
> standard says nothing about threading, it does not concern itself with
> statement reorderings that are as-if-safe under one thread of control but
> as-if-unsafe under multiple threads of control.
>
> This is why sequence points are, in general, useless for solving this
> problem. They apply only to as-if-serial semantics.

What if we change it so we can tell:

if(!bInstanceInitialized)
{
LOCK(); // Do some MT-locking here

std::cout << bInstanceInitialized << std::endl;
if (!bInstanceInitialized)
{
std::cout << bInstanceInitialized << std::endl;
pInstance = new MySingleton;
std::cout << bInstanceInitialized << std::endl;
bInstanceInitialized = true;
std::cout << bInstanceInitialized << std::endl;
}
UNLOCK();
}

Avoiding the problem of unwanted output to cout, will this solve the
problem? Since in such a case, it seems like the compiler is forced to
set bInstanceInitialized only after setting pInstance, right?
If so, the calls to cout, might be replaced with some other calls
which will make bInstanceInitialized changes "visible", without
writing to cout..

James Kanze

unread,

May 6, 2003, 2:38:54 PM5/6/03

to

Scott Meyers <Use...@aristeia.com> wrote in message

news:<MPG.191ef349d...@news.hevanet.com>...

> On 4 May 2003 10:55:24 -0400, Phil Carmody wrote:

> > Wouldn't a volatile pInstance prevent the assigning to pInstance
> > before the right hand side of the = (i.e. the new) has completed?

> No. Declaring pInstance volatile will force reads of that variable to
> come from memory and writes to that variable to go to memory,

It's more vague than that. Volatile will force the variable to be
"accessed", but the definition of what an access consists in is up to
the implementation.

On my platform (Sparc under Solaris 2.8), both g++ and Sun CC consider
that generating the machine instruction to write to memory is
sufficient; this instruction, however, is quite happy once the order to
write has been generated in the CPU hardware -- the CPU hardware
guarantees that a read by the same processor will see the new value, but
there is no guarantee that a write bus cycle has left the processor (or
that they will leave in order), and of course, even a write bus cycle
may only modify the processor specific cache; there is still no
guarantee that others processors (and thus other threads) can see the
modification.

Whether this is conform with the intent of the standard or not, I'm not
sure, but it is what the compilers really do. At least on Sparc under
Solaris, but I suspect that the situation is general.

The rest of your comments are right on the mark. In this particular
case, even if volatile did what people seem to think it does, the code
wouldn't work. My comments are only to point out that in principle,
volatile CANNOT help here; it just doesn't provide the necessary
guarantees.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, Tél. : +33 (0)1 30 23 45 16

Hyman Rosen

unread,

May 6, 2003, 3:33:42 PM5/6/03

to

Thomas Mang wrote:
> Doesn't this show a fundamental problem of the Standard?

No, except that it's a problem that the Standard doesn't
address multiprogramming.

The basic problem is the fundamental wrongness of attempting
to synchronize access to a common resource without using the
proper synchronization mechanisms defined for such use. Just
because a subset of programmers wish there was a way to do
this using other means, and whine when told there isn't,
doesn't make this a problem of the standard.

Pete Becker

unread,

May 6, 2003, 3:34:25 PM5/6/03

to

Scott Meyers wrote:
>
> This is a separate problem, because, if my understanding is correct, it
> applies only to multiprocessor machines.

There are fewer complications on single processor systems than on
multi-processor systems, but they don't go away. You still need to
synchronize access unless you can guarantee that a pointer can be
written and read atomically.

> In that case, memory barriers are
> definitly necessary, but, again if my understanding is correct, memory
> barriers do nothing to solve the uniprocessor case I posted here.
>

Whether or not you can persuade the compiler to stick to the rules about
sequence points, you still have to deal with synchronization. Once
you've properly dealt with synchronization, optimizations around
sequence points don't affect behavior.

--

Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

James Kanze

unread,

May 6, 2003, 3:39:28 PM5/6/03

to

"Siemel Naran" <Sieme...@KILL.att.net> wrote in message

news:<nunta.66939$cO3.4...@bgtnsc04-news.ops.worldnet.att.net>...

You tell me. I've got a line
asm( "ta 3" ) ;
in one of my programs for Sparc; do you really think that it will do
anything reasonable on an Intel? (FWIW: this instruction ensures the
flush of the register stack to memory -- I don't even think that there
is an equivalent operation on Intel.)

> I think if we can make successive statements atomic (so that each
> thread must execute all the statemements before it can be suspended
> and another thread can take charge) then I think we'd be all set.

If C++ statements were atomic, multithreading would be trivial (or at
least a lot easier). They aren't, and I don't think that there is any
way they could be.

On some platforms, it is possible to make some small set of fairly
primitive operations (on very primitive types) atomic. Most platforms
have an atomic register with memory swap. Sparc v9 (but not earlier)
has a means of creating an apparently atomic increment and decrement.
Intel has a prefix which will make any single instruction atomic,
including increment or decrement of memory (but you can't do that and
also read the value in an atomic fashion, although you can test if the
results of the operation were zero). Other processors have other
possibilities. But this is all about as far from portable as you can
imagine.

> > Wouldn't a volatile pInstance prevent the assigning to pInstance
> > before the right hand side of the = (i.e. the new) has completed?

> We also want to prevent assigning to pInstance until after the
> initialization has finished. Can we say

> pTempInstance = new MySingleton;
> pInstance = pTempInstance;

> But by the original quote there must be a more subtle reason why this
> won't work.

Nothing subtle. There are two major problems here:

- although you're guaranteed that the "write" to pInstance will not
take place before the beginning of the statement, nor after the end,
you've introduced no ordering, since all of the other writes,
including those in the constructor, can take place whenever the
compiler feels like it, and

- the implementation gets to choose what "write" means -- most
consider just passing the value to some sort of pipeline, for some
future write to a processor local cache, is good enough.

You could enforce write ordering by means of a second lock, something
along the lines of:

if ( pInstance == NULL ) {
Locker l1( lock1 ) ;
if ( pInstance == NULL ) {
pInstance = createSingleton() ;
}
}

MySingleton*
createSingleton()
{
Locker l2( lock2 ) ;
return new MySingleton ;
}

Freeing the lock2 is a memory barrier (at least according to Posix, but
I suppose that Windows has similar rules), so all of the writes in the
constructor must actually be in main memory before returning from the
function.

Regretfully, this isn't enough. Because you have only ensured cache
consistency on the processor where the locks were executed. A second
thread could very well come along, find pInstance non-null, because it
wasn't in his cache, and he had to read the value from main memory, but
find that what it pointed to was in his cache, and use those values,
rather than what was written by the constructor.

You could probably avoid this by inserting explicit memory barriers
around the whole function, in a very processor dependant manner (using
asm on a Sparc, for example). But is is worth the bother. I recently
did some benchmarks (on my Sparc); acquiring a lock, and freeing it
later, costs less than 90 nanoseconds; the function call itself costs
around 20 nanoseconds. And if it is still too slow, you can always call
the function once, before the critical area, and keep a pointer to the
instance locally.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, Tél. : +33 (0)1 30 23 45 16

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

johnchx

unread,

May 7, 2003, 5:58:21 AM5/7/03

to

"Jeff Kohn" <jk...@houston.no.spam.please.rr.com> wrote

> My
> question is, what if you don't need lazy instantiation, because you know
> your singleton will always be used or the cost isn't all that great to
> instantiate it? In that case is there a way to implement the singleton
> without the locking overhead?

I'm going to go out on a limb and say...yes, I think so, if (and only
if) you can guarantee that initialization takes place while the
program is executing only a single thread. In that case, something as
simple as

Singleton* Singleton::Instance() {
static Singleton* p_instance;
if (p_instance) return p_instance;
p_instance = new Singleton;
// insert memory barrier here, just in case
return p_instance;
}

should work. You still need a memory barrier to ensure that the
initialized state of the Singleton instance is written out to main
memory before the function returns (just in case the program spawns a
thread after Instance() returns and the OS schedules the thread on
another CPU and that thread tries to call Instance() at an instant
after p_instance has been updated in memory but before the Singleton's
instance state has been written out to memory). But you only need it
once.

However.... (There has to be a however!)

Guaranteeing that Instance() gets called while the program is only
executing a single thread is harder than one might guess. For
example, it's possible to define a class whose ctor spawns a thread,
then decalare an object of that type at file scope...so that the
program "goes multithreaded" at some unspecified point before entering
main(). There's no obvious way to ensure that Instance() gets called
before that happens.

So, I'll answer your question with an unqualified "Yes, but no."

(Of course, there is an alternative that *is* guaranteed to work:
don't write multithreaded programs. ;-) )

Joseph Seigh

unread,

May 7, 2003, 9:16:14 AM5/7/03

to

Andrei Alexandrescu wrote:
...

>
> Speaking of DCLP, I reached the conclusion that the semi-portable technique
> that is of the lightest weight for the "initialize once read many" problem
> is to use atomic integers like this:
>
> * The Singleton has an integer associated with it
> * That integer is statically initialized to -1
> * Each thread does an atomic_increment upon entrance
> * The first thread that gets the transition -1 -> 0 will initialize the
> Singleton
> * After initialization, the thread that did it sets the integer to INT_MIN.
> * If the value after increment is > 0, then the Singleton is being
> initialized right now, so the code will decrement the integer, take a nap,
> and repeat.
> * If the value after increment is < 0, then the Singleton is up and running,
> so decrement the integer and return the pointer.
>
> By adding an extra Boolean the need to decrement during normal access can be
> obviated, so the grand total cost per access is one atomic increment (and
> test) - which is not really bad. But again, atomic integers do imply thread
> awareness of the compiler.

Atomic integers wouldn't work any better. They would need to have memory ordering
characteristics in order for them to work. Integers and pointer in Java are atomic
and DCL (and some lock-free stuff) is broken in Java as well. It's the lack of
a memory ordering intrinsic that is the reason why DCL is so problematic. All you
need is a fetch barrier after you've loaded the pointer to the initialized object
(per the original example).

Joe Seigh

Joseph Seigh

unread,

May 7, 2003, 2:06:15 PM5/7/03

to

Scratch that last part in my last reply. Not only is DCL subtle,
it's hard to remember all the details when you have figured it out
previously.

Basically an intuitive approach doesn't work very well here. You can't
know when you've gotten it right. You have to use formal logic. In
threading, it's mostly just straight logical inference fortunately.

One of the basic patterns used in DCL and other lock-free algorithms is

1) store data
2) store_barrier
3) store data2

on the write side, and on the read side

4) fetch data2
5) fetch_barrier
6) fetch data

Seeing at point (4) the value stored by (3) implies that
(6) will see all the values stored at (1).

You will sometimes see (2) and (3) together as store w/
release semantics, and (4) and (5) together as fetch w/
acquire semantics.

If data2 is a pointer then it also must be atomic since you
are communicating the pointer value and you don't want an
invalid value obviously. If pointers are not atomic then
you can use a boolean which has been statically initialized.
It doesn't matter if the boolean is not atomic. Testing for
not the initial value is sufficient. So for atomic pointer
types, a correct DCL implementation would look like.

if (pInstance != 0) {
LOCK();
if (pInstance != 0) {
temp = new MySingleton();
store_barrier();
pInstance = temp;
}
UNLOCK();
}
else
fetch_barrier();

For non-atomic ptr, implementing with a boolean would look like

if (flag != true) {
LOCK();
if (flag != true) {
pInstance = new MySingleton();
store_barrier();
flag = true;
}
UNLOCK();
}
else
fetch_barrier();

If you have an atomic pointer class with proper acquire and release
semantics,
then in fact, the DCL implementation in the OP would in fact be correct,
assuming pInstance
is an instance of that type. So this thread does have some relevance to
C++.
Just don't call that class atomic_ptr (or warn me if you do) since I'm
doing
an atomic refcounted pointer class with that name.

Joe Seigh

witoldk

unread,

May 7, 2003, 2:24:46 PM5/7/03

to

ka...@gabi-soft.de (James Kanze) wrote in message
news:<d6651fb6.03050...@posting.google.com>...

[snip]

>
> If C++ statements were atomic, multithreading would be trivial (or at
> least a lot easier). They aren't, and I don't think that there is any
> way they could be.
>

[snip]

>
> You could enforce write ordering by means of a second lock, something
> along the lines of:
>
> if ( pInstance == NULL ) {
> Locker l1( lock1 ) ;
> if ( pInstance == NULL ) {
> pInstance = createSingleton() ;
> }
> }
>
> MySingleton*
> createSingleton()
> {
> Locker l2( lock2 ) ;
> return new MySingleton ;
> }
>
> Freeing the lock2 is a memory barrier (at least according to Posix,
but
> I suppose that Windows has similar rules), so all of the writes in the
> constructor must actually be in main memory before returning from the
> function.

Since you mention Posix, I'm really confused now. I always thought
there was absolutely nothing wrong with double-checked locking on
Posix OS, that is as long as Locker internally uses pthread_mutex_t.
If we were talking Posix, there is no need for Locker l2.
That is because Posix compliant OS gives certain guarantees about memory
visibility between threads. Specifically, it guarantees that when the
second thread acquires lock l1, it sees all the memory in the state it
was _no_earlier_ than when the first thread released lock l1 (again,
lock
l1 somehow involves pthread_mutex_t). And that is regardless (or so I
thought) of the number processors in the system, or the specific
memory architecture.

>
> Regretfully, this isn't enough. Because you have only ensured cache
> consistency on the processor where the locks were executed. A second
> thread could very well come along, find pInstance non-null, because it
> wasn't in his cache, and he had to read the value from main memory,
but
> find that what it pointed to was in his cache, and use those values,
> rather than what was written by the constructor.

I'm in no way a hardware person, so the above is a little confusing to
me;
just trying to make sure I understand :)
What seems confusing is where you say that the memory to which pInstance
points to was in the second processor cache, and thus be used by second
processor _instead_ of what is in memory. I thought processor cache
location(s) would be marked "dirty" in case like that, and thus force
the
cache refresh. Would the cache on second processor be "dirty" or not in
this case?

witoldk

unread,

May 7, 2003, 2:25:38 PM5/7/03

to

Scott Meyers <Use...@aristeia.com> wrote in message

news:<MPG.191efc0a3...@news.hevanet.com>...

> Please remember that my contention here is that there is no
> portable and reliable way to make double-checked locking work.

Please note that double-checked locking works reliably, and is
portable among POSIX-compliant OSs (environments). There is no
way to make it work otherwise, other than using thread private
storage (the idea is credited to A. Terekhov), as mentioned in
the doc referred in this thread (and many other threads here and
on other NGs :) about double-checked locking being broken.

witoldk

unread,

May 7, 2003, 2:27:04 PM5/7/03

to

john...@yahoo.com (johnchx) wrote in message
news:<4fb4137d.03050...@posting.google.com>...

Just a technicality, alas I believe an important one.
There seems to be way to much confusion generated by the memory barrier
concept as presented here. Please note, that what you say above is
a misconception. Memory barrier is not a write to memory operation,
cache
flush or whatever else it is taken for mistakenly. I believe the memory
barrier is best thought of as a "concept". As a concept it is a way to
ensure order between memory operations. I believe it is very well
described
in David Butenhof's book "Programming with POSIX Threads". Here is what
David Butenhof says (page 93): "If each memory access is an item
in a queue, you can think of memory barrier as a special queue token.
Unlike other memory accesses, however, the momory controller cannot
remove
the barrier, or look past it, untill it has completed all previous
accesses."

I believe it is worth mentioning since what you say can create a lot of
confusion and makes it more difficult to understand all sorts of m-t
issues (double-checked locking included).

Thomas Mang

unread,

May 7, 2003, 2:30:53 PM5/7/03

to

Hyman Rosen schrieb:

> Thomas Mang wrote:
> > Doesn't this show a fundamental problem of the Standard?
>
> No, except that it's a problem that the Standard doesn't
> address multiprogramming.
>
> The basic problem is the fundamental wrongness of attempting
> to synchronize access to a common resource without using the
> proper synchronization mechanisms defined for such use. Just
> because a subset of programmers wish there was a way to do
> this using other means, and whine when told there isn't,
> doesn't make this a problem of the standard.

Well, I didn't mean the concrete problem shown by the OP, but rather
that
there seems to be much (in this case, too much) freedom for compiler
writers to rearrange code - a rearrangement which seems to be allowed
using as-if, but which isn't as-if - and which can't be portably
suppressed.

Wouldn't it be nice to be able to tell the compiler "hey, I need this
exactly the way I wrote it, don't optimize, rearrange or anything"?
Sort of the 'inline' directive, with the difference it should clearly
not
only be a hint for the compiler, but compulsory.

regards,

Thomas

Kevin Cline

unread,

May 7, 2003, 2:54:58 PM5/7/03

to

shmuli...@bmc.com (Shmulik Flint) wrote in message
news:<8eaa7326.0305...@posting.google.com>...

On a multi-processor machine with multiple levels of memory, memory
writes in one thread may appear out of order in another thread.
Processor 1 stores a 1 in address 10, then a 1 in address 20. Now
processor 2 reads from address 20, and gets 1. Then processor 2 reads
from address 10, finds the address in on-board cache, and gets 0. The
thread synchronization functions (e.g. locking a mutex) also force
cache synchronization.

Balog Pal

unread,

May 7, 2003, 6:53:47 PM5/7/03

to

"Joseph Seigh" <jsei...@xemaps.com> wrote in message news:3EB83C05...@xemaps.com...

> Atomic integers wouldn't work any better. They would need to have memory ordering
> characteristics in order for them to work. Integers and pointer in Java are atomic
> and DCL (and some lock-free stuff) is broken in Java as well. It's the lack of
> a memory ordering intrinsic that is the reason why DCL is so problematic. All you
> need is a fetch barrier after you've loaded the pointer to the initialized object
> (per the original example).

Not only a emmory-ordering intrinsic, but also the _strong_ sequence point too.
As that MEMBAR command can be issued using asm{}, or a function with OS call allright.
The problem is how to make the compiler respect that point as a logical barrier, a real seq point.

As I read the standard access to volatiles is considered observable behavior, and the compiler is not allowed to rearrange it. The part standard is silent about is the reordering regular object access compered to volatiles:

volatile int v1; v1 = 1;
int i1; i1 = 2;
volatile int v2; v2 = 3;

v1 and v2 are assigned in that order. But assign to i1 can happen anytime. before v1, after v2 or between.

Paul

witoldk

unread,

May 7, 2003, 6:55:42 PM5/7/03

to

Scott Meyers <Use...@aristeia.com> wrote in message news:<MPG.191d9257d...@news.hevanet.com>...

[snip]

> As an example of one of the "more obvious" reasons why it doesn't work, consider
> this line from the above code:
>
> pInstance = new MySingleton;
>
> Three things must happen here:
> 1. Allocate enough memory to hold a MySingleton object.
> 2. Construct a MySingleton in the memory.
> 3. Make pInstance point to the object.
>
> In general, they don't have to happen in this order. Consider the following
> translation. This isn't code a human is likely to write, but it is a valid
> translation on the part of the compiler under certain circumstances (e.g., when
> static analysis reveals that the MySingleton constructor cannot throw):

>
> pInstance = // 3
> operator new(sizeof(MySingleton)); // 1
> new (pInstance) MySingleton; // 2
>
> If we plop this into the original function, we get this:
>
> > MySingleton *MySingleton::Instance(void)
> > {
> > if(!pInstance) // Line 1
> > {

> > LOCK(); // Do some MT-locking here

> > if (!pInstance)
> pInstance =
> operator new(sizeof(MySingleton)); // Line 2
> new (pInstance) MySingleton;
> > UNLOCK();
> > return sp;
> > }
>
> So consider this sequence of events:
> - Thread A enters MySingleton::Instance, executes through Line 2, and is
> suspended.
> - Thread B enters MySingleton::Instance, executes Line 1, sees that pInstance
> is non-null, and returns. It then merrily dereferences the pointer, thus
> referring to memory that does not yet hold an object.
>
> If there's a portable way to avoid this problem in the presence of aggressive
> optimzing compilers, I'd love to know about it.

I'm reading the thread sort of backwards, so after posting three
messages, here I am, replying to the OP :)
I am having a hard time understanding why DCL has received so much
attention in the NGs related to C++ _language_. I believe there is
no such thing as DCLP problem in C++, nor DCLP being broken in C++
nor anything of such sort. That is because C++ language has no
notion of m-t programming nor it provides any sunc mechanisms needed
inherently by the m-t programs.
So, in "plain" C++ there is no need for special treatment of sigleton
startup with any locking of any kind.
Now, when threads come into the picture situation, changes from plain
C++ to C++ with some sort of mt support. This mt support (lib) is not
part of C++, so in order to start a "new" thread of execution the
otherwise standard C++ code must call non-standard C++ mt library.
In light of the above I see little logic in worrying about how to
portably and reliably make DCL work using _only_standard_C++ mechanisms.
This makes little sense, as in order to start a thread (other than a
"main" thread :) the code has to use _non_standard_C++ to begin with.
I believe most (likely all) mt support libs provide mechanism to
make DCLP work without a sweat. I know that the environment which I
use to write mt code (POSIX) provides an easy way to solve all the issues
you mention in the last couple of lines of your post. This mechanism
is as trivial as a mutex. Other environments likely provide similar
mechanisms, probably as simple.
I believe there is a consensus as to the fact that there is no way,
nor there is a need to make DCLP work in standard C++.
In mt environment, if the OS is POSIX complaint, the issues you mention:

"
> - Thread A enters MySingleton::Instance, executes through Line 2, and is
> suspended.
> - Thread B enters MySingleton::Instance, executes Line 1, sees that
> pInstance is non-null, and returns. It then merrily dereferences
> the pointer, thus referring to memory that does not yet hold an object.
"

do not exist, as thread B can never get into the trouble you describe
if only the LOCK() acquires (locks :) the mutex, and UNLOCK() unlocks the
mutex. This is guaranteed by the POSIX standard.

So, after reading so many discussions about DCLP not working (reading
on C++ NGs that is) I still fail to understand why people would expect
it to work, nor I understand why would people need it to work in
standard C++.

Joaquín Mª López Muñoz

unread,

May 7, 2003, 6:56:52 PM5/7/03

to

Scott Meyers <Use...@aristeia.com> wrote in message news:<MPG.191ef349d...@news.hevanet.com>...

> On 4 May 2003 10:55:24 -0400, Phil Carmody wrote:

> > Wouldn't a volatile pInstance prevent the assigning to pInstance before
> > the right hand side of the = (i.e. the new) has completed?
>

> No. Declaring pInstance volatile will force reads of that variable to come

> from memory and writes to that variable to go to memory, but what we need
> here is a way to say that pInstance should not be written until the
> Singleton has been constructed. That is, we need to tell the compiler to
> respect a temporal ordering that is stricter than the as-if rule. As far
> as I know, there is no way to do that. Certainly volatile doesn't do it.
>

> Scott
>

Thinking about DCL some time ago, I came with a candidate technique
for ensuring that writing to some volatile a1 always precedes writing
to volatile a2 (without memory barriers):

volatile int a1;
volatile int a2;
typedef int (*a2_calculator)()

int an_a2_calculator()
{
a1=1;
return 1;
}

...

a2_calculator=an_a2_calculator; // somewhere in the program

...

a2=a2_calculator();

The key point in this technique is that, as a2_calculator
is a pointer to function, the last sentence cannot be
inlined without resorting to a global analysis of the behavior
of the program, which is not only unlikely, but ultimately
impossible in a theoretical sense (it'd require an optimizer
with unnatainable deducibility power, this is related to
the Turing halting program).
So, asuming the call to a2_calculator is not inlined, the compiler
cannot deduce the assigment to a2 before calling a2_calculator, thus
a2_calculator is really called before assignment. As per standard
1.9.17 there is a sequence point between these two actions
we're done: a1 is commited to memory before a2.

If I were right, this could be applied to a resolution of DCL,
but I'm no fool and I know lots of people have given thought
to the DCL and dismissed it. Comments are welcome, though.

Joaquín M López Muñoz
Telefónica, Investigación y Desarrollo

Jeff Kohn

unread,

May 7, 2003, 7:01:41 PM5/7/03

to

"johnchx" <john...@yahoo.com> wrote in message
news:4fb4137d.03050...@posting.google.com...

> "Jeff Kohn" <jk...@houston.no.spam.please.rr.com> wrote
>
> > My
> > question is, what if you don't need lazy instantiation, because you
know
> > your singleton will always be used or the cost isn't all that great to
> > instantiate it? In that case is there a way to implement the singleton
> > without the locking overhead?
>
> I'm going to go out on a limb and say...yes, I think so, if (and only
> if) you can guarantee that initialization takes place while the
> program is executing only a single thread. In that case, something as
> simple as
>
> Singleton* Singleton::Instance() {
> static Singleton* p_instance;
> if (p_instance) return p_instance;
> p_instance = new Singleton;
> // insert memory barrier here, just in case
> return p_instance;
> }
>

I'm not sure what this accomplishes, having to guarantee that Instance() is
called by a single thread before any other threads would be pretty hard to
guarantee IMHO. Besides, if I'm understanding the other posts in this
thread, if you have a way of inserting a memory barrier you it should be
possible to make the DCL pattern work. (I don't know how to implement a
memory barrier for WinTel yet anyway, never got around to learning x86
assembler).

It seems to me the difficulty that DCL attempts to overcome is that fact
that the singleton instance isn't created until the first time Instance() is
called.

Would the following work?

// in MySingleton.h
//
class MySingleton
{
public:
static MySingleton& Instance();
...

private:
MySingleton(){}
MySingleton& operator=(const MySingleton&);

static MySingletoni instance_;
};

// in MySingleton.cpp
//
MySingleton MySingleton::instance_;

MySingleton& MySingleton::Instance()
{
return instance_;
}

Now, I know there's one potential problem with this, namely that there's no
guarantee about the order of initialization for statics in different files.
But that's only an issue if the singleton is used from the constructor of
another singleton, right? It seems to me that in many cases that's not such
a serious limitation, since it's not hard to be sure that won't happen. I
know, in large projects with multiple developers and lots of singletons, it
might be a little risky to have such a limitation without any easy way to
enforce it, but in some projects it might not be such an issue. Am I missing
something?

Jeff

Chris Carton

unread,

May 7, 2003, 7:06:04 PM5/7/03

to

>
> If there's a portable way to avoid this problem in the presence of aggressive
> optimzing compilers, I'd love to know about it.
>

> Scott
>

What if you modify the MySingleton class so that it stores a copy of its
this pointer (initialized by its constructor) and use that to assign to
pInstance. So it looks like:

MySingleton *MySingleton::Instance(void)
{
if(!pInstance) // Line 1
{
LOCK(); // Do some MT-locking here

if (!pInstance) {
MySingleton *tmp = new MySingleton;
pInstance = tmp->this_pointer;
}
UNLOCK();
return sp;
}

-Chris

WitoldK

unread,

May 7, 2003, 9:28:46 PM5/7/03

to

"Thomas Mang" <a980...@unet.univie.ac.at> wrote in message
news:3EB94247...@unet.univie.ac.at...

>
>
> Hyman Rosen schrieb:
>
> > Thomas Mang wrote:
> > > Doesn't this show a fundamental problem of the Standard?
> >
> > No, except that it's a problem that the Standard doesn't
> > address multiprogramming.
> >
> > The basic problem is the fundamental wrongness of attempting
> > to synchronize access to a common resource without using the
> > proper synchronization mechanisms defined for such use. Just
> > because a subset of programmers wish there was a way to do
> > this using other means, and whine when told there isn't,
> > doesn't make this a problem of the standard.
>
> Well, I didn't mean the concrete problem shown by the OP, but rather
> that
> there seems to be much (in this case, too much) freedom for compiler
> writers to rearrange code - a rearrangement which seems to be allowed
> using as-if, but which isn't as-if - and which can't be portably
> suppressed.
>
> Wouldn't it be nice to be able to tell the compiler "hey, I need this
> exactly the way I wrote it, don't optimize, rearrange or anything"?
> Sort of the 'inline' directive, with the difference it should clearly
> not
> only be a hint for the compiler, but compulsory.

Since C++ has no notion of multiple threads there is no need to do it.

johnchx

unread,

May 7, 2003, 10:45:32 PM5/7/03

to

wit...@optonline.net (witoldk) wrote
> john...@yahoo.com (johnchx) wrote

> > You still need a memory barrier to ensure that the
> > initialized state of the Singleton instance is written out to main
> > memory before the function returns (just in case the program spawns
a
> > thread after Instance() returns and the OS schedules the thread on
> > another CPU and that thread tries to call Instance() at an instant
> > after p_instance has been updated in memory but before the
Singleton's
> > instance state has been written out to memory).
>
> Just a technicality, alas I believe an important one.
> There seems to be way to much confusion generated by the memory
barrier
> concept as presented here. Please note, that what you say above is
> a misconception. Memory barrier is not a write to memory operation,
> cache flush or whatever else it is taken for mistakenly.

Yes, that's correct. I should have said something like "is made
globally visible" rather than "has been written out to memory." (The
former is the language Intel uses to talk about MFENCE, SFENCE and
RFENCE, which is the particular platform/model I have in the back of
my mind...other platforms may differ.) In practice, the write
operations can be made globally visible without a cache flush.

Sorry if I'm adding to the confusion! Feel free to correct me further
if it sounds like I'm still not getting it. :-)

Drew Hall

unread,

May 8, 2003, 7:16:06 AM5/8/03

to

"Thomas Mang" <a980...@unet.univie.ac.at> wrote:
>
> Well, I didn't mean the concrete problem shown by the OP, but rather
> that
> there seems to be much (in this case, too much) freedom for compiler
> writers to rearrange code - a rearrangement which seems to be allowed
> using as-if, but which isn't as-if - and which can't be portably
> suppressed.
>
> Wouldn't it be nice to be able to tell the compiler "hey, I need this
> exactly the way I wrote it, don't optimize, rearrange or anything"?
> Sort of the 'inline' directive, with the difference it should clearly
> not
> only be a hint for the compiler, but compulsory.

This would be nice, but even if we cracked that nut, modern processors
do code rearrangement of their own. It's called "Out-of-order
Execution" (OOE) and Intel & others have been hyping it as a feature
for a few years now. In the face of this, I think we really need
cooperation
between the processor, OS, and the compiler to make any guarantees.

Drew

Joshua Lehrer

unread,

May 8, 2003, 7:16:44 AM5/8/03

to

"Jeff Kohn" <jk...@houston.no.spam.please.rr.com> wrote in message news:<UZdua.9882$PD6.2...@twister.austin.rr.com>...

> Am I missing something?
>
> Jeff
>
>

Yes, your solution constructs the object every time, even if it is
never used. The DCL pattern is attempting to hold off on the
construction of the object until the first time it is needed.

joshua lehrer
factset research systems
NYSE:FDS

Rob

unread,

May 8, 2003, 7:55:16 AM5/8/03

to

"Scott Meyers" <Use...@aristeia.com> wrote in message

news:MPG.191d9257d...@news.hevanet.com...
> [I just posted this to comp.std.c++ as a followup to a February thread
(I'm
> way behind...) , but it's of more general interest, I think, so I'm
posting
> here, too. -- Scott]
>
> On Mon, 17 Feb 2003 17:09:35 +0000 (UTC), Christoph Rabel wrote (to
> comp.std.c++):
> > MySingleton *MySingleton::Instance(void)
> > {
> > if(!pInstance)

> > {
> > LOCK(); // Do some MT-locking here
> > if (!pInstance)

> > pInstance = new MySingleton;
> > UNLOCK();
> > return sp;
> > }
>
> This is the double-checked locking pattern. I recently drafted an article
on
> this topic for CUJ. As I sit here in a pool of my own blood based on the
> feedback I got from pre-pub reviewers, I feel compelled to offer the
following
> observation: there is, as far as I know, no way to make this work on a
reliable
> and portable basis.

I would agree. The basic problem is that pInstance can be changed on one
thread while it is being read on another. Therefore, both the reading and
modification need to be synchronised (eg by acquiring the lock before
checking the value of pInstance). The algorithm here basically gains some
performance (not always having to acquire the lock) at the potential expense
of correctness.
[Snip]

>
> As an example of one of the "more obvious" reasons why it doesn't work,
consider
> this line from the above code:
>
> pInstance = new MySingleton;
>
> Three things must happen here:
> 1. Allocate enough memory to hold a MySingleton object.
> 2. Construct a MySingleton in the memory.
> 3. Make pInstance point to the object.
>
> In general, they don't have to happen in this order.

And they don't need to happen in a different order to cause problems.
On many systems, the simple act of assignment pInstance = something;
is not guaranteed to be an atomic operation. i.e. the assignment
can be preempted by another thread before completion. That results
in a sequence similar to your example, even if we assume the steps of
allocating memory and constructing the objects are atomic.

if (!pInstance)
{
LOCK();
if (!pInstance)
{
// create new object
pInstance = address_of_new_object;
}
}

Consider the sequence in which

Thread A commences the assignment (eg sets the leftmost 8 bits of
pInstance)
Thread B preempts it, and checks pInstance. It is non-NULL, so...

>
> If there's a portable way to avoid this problem in the presence of
aggressive
> optimzing compilers, I'd love to know about it.
>

Or on any system in which assignment to a pointer is not atomic (i.e. the
operaton can be preempted when partly complete).

johnchx

unread,

May 8, 2003, 8:03:41 AM5/8/03

to

"Jeff Kohn" <jk...@houston.no.spam.please.rr.com> wrote

> I'm not sure what this accomplishes, having to guarantee that Instance() is
> called by a single thread before any other threads would be pretty hard to
> guarantee IMHO.

Agreed. You might be able to accomplish this by decree...if you
happen to be the chief architect (or only programmer) on your project.
(Personally, I try to outlaw spawning threads before entering main(),
but that's just me.)

> Besides, if I'm understanding the other posts in this
> thread, if you have a way of inserting a memory barrier you it should be
> possible to make the DCL pattern work.

That's part of the solution. I think that one confusing thing about
this topic is that there are (at least) two "layers" of re-ordering
going on: the compiler's and the cpu's. That is, the compiler is free
to re-order the (and otherwise helpfully modify) the operations in
your code (constrained by the as-if rule) *and* the CPU is free to
execute those instructions in some order other than what the compiler
requests. By themselves, memory barriers really address only the
second kind of reordering. (Though they may affect the first, if the
compiler recognizes the embedded instruction as a barrier to its own
code rearrangement efforts.)

The bigger problem with DCL, as far as I can tell, is that it is based
on lying to the compiler. We're trying to access a
synchronization-sensitive variable without the cost of using ordinary
synchronization mechanisms. That more or less entitles the compiler
to assume that the variable isn't synchronization sensitive and
re-arrange our use of it accordingly.

> (I don't know how to implement a
> memory barrier for WinTel yet anyway, never got around to learning x86
> assembler).

Strangely enough, I don't think the issue usually arises on IA-32 cpus
because they don't re-order writes (err...with a couple of
exceptions). In other words, I'd expect "ordinary" DCL, (hand-coded
in assembly to prevent compiler cleverness), to work properly even on
multi-cpu systems without explicit memory barrier instructions on
IA-32. But I'm not certain about this...others should feel free to
correct me if I'm off base here. My impression, though, is that this
particular problem with DCL arises mainly on other platforms.

>
> It seems to me the difficulty that DCL attempts to overcome is that fact
> that the singleton instance isn't created until the first time Instance() is
> called.
>

That's not a difficulty -- it's a feature! :-)

But seriously, yes, the point of DCL is that we want

(a) lazy initialization (because initialization is costly and might
be unnecessary) or

(b) initialization-on-demand (to overcome order of initialization
issues)

or both, and

(c) we want to avoid paying the cost of synchronizing the "am i
initialized" check on every single call.

So, yes, if you can live without (a) and (b), by all means, avoid this
entire swamp! And if you need (a) or (b), but can live with the cost
of synchronized access to the instance pointer, then do that
("single-checked locking").

> Now, I know there's one potential problem with this, namely that there's no
> guarantee about the order of initialization for statics in different files.
> But that's only an issue if the singleton is used from the constructor of
> another singleton, right?

Basically, yes, though it's not necessarily limited to other
singletons. Anything constructed during static initialization could
bring up this issue.

> It seems to me that in many cases that's not such
> a serious limitation, since it's not hard to be sure that won't happen. I
> know, in large projects with multiple developers and lots of singletons, it
> might be a little risky to have such a limitation without any easy way to
> enforce it, but in some projects it might not be such an issue. Am I missing
> something?

No, I agree, at least in principle. In practice, I'd probably be more
skeptical, but there are probably plenty of real world projects where
this would work just fine.

Alexander Terekhov

unread,

May 8, 2003, 8:07:41 AM5/8/03

to

witoldk wrote:
>
> Scott Meyers <Use...@aristeia.com> wrote in message
> news:<MPG.191efc0a3...@news.hevanet.com>...
>
> > Please remember that my contention here is that there is no
> > portable and reliable way to make double-checked locking work.
>
> Please note that double-checked locking works reliably, and is
> portable among POSIX-compliant OSs (environments). There is no
> way to make it work otherwise, other than using thread private
> storage (the idea is credited to A. Terekhov), as mentioned in
> the doc referred in this thread (and many other threads here and
> on other NGs :) about double-checked locking being broken.

Well,

http://google.com/groups?threadm=3AB949A3.45B2371A%40willden.org
(Subject: DCL -- is there a FAQ?)

regards,
alexander.

P.S. Hitler hitler hitler.

johnchx

unread,

May 8, 2003, 8:26:52 AM5/8/03

to

Sigh...I think I'm still not saying this right. One more try.

In the code I originally posted, I used the term "memory barrier" when
I had in mind a more drastic synchronization instruction (perhaps
along the lines of CPUID in Intel-speak).

To re-write the code using a memory barrier (properly speaking) would
require a number of changes -- (a) the addition of an intermediate
variable to hold the result of the new expression, (b) the insertion
of the memory barrier between the initialization of that variable and
the assignment of its value to the static instance pointer variable,
and finally (c) the addition of mutex acquisition and release code and
a second check of the instance pointer (i.e. the whole DCL shebang).
Which would defeat the purpose (such as it was).

So I'll just fix the comment instead. ;-)

Singleton* Singleton::Instance() {
static Singleton* p_instance;
if (p_instance) return p_instance;
p_instance = new Singleton;

// insert synchronization instruction
// to block until p_instance actually updated
return p_instance;
}

Caveat: as originally discussed, the above is meant to be safe only if
it's guaranteed to be called before the program spawns multiple
threads.

Once again, my apologies if my misuse of the term "memory barrier"
confused anyone other than myself. :-)

Kurt Stege

unread,

May 8, 2003, 8:27:14 AM5/8/03

to

I am not sure, if I am understanding, what you (witoldk) are
exactly writing about.

On 7 May 2003 14:24:46 -0400, wit...@optonline.net (witoldk) wrote:

>ka...@gabi-soft.de (James Kanze) wrote in message
>news:<d6651fb6.03050...@posting.google.com>...

>> if ( pInstance == NULL ) {

>> Locker l1( lock1 ) ;
>> if ( pInstance == NULL ) {
>> pInstance = createSingleton() ;
>> }
>> }
>>
>> MySingleton*
>> createSingleton()
>> {
>> Locker l2( lock2 ) ;

^^^^^^^ This line marked

>> return new MySingleton ;
>> }
>>
>> Freeing the lock2 is a memory barrier (at least according to Posix,

...

>Since you mention Posix, I'm really confused now. I always thought
>there was absolutely nothing wrong with double-checked locking on
>Posix OS, that is as long as Locker internally uses pthread_mutex_t.
>If we were talking Posix, there is no need for Locker l2.

So you are talking about the marked line. Is it necessary, or not?
In my opinion, you (witoldk and James) are both wrong. This kind of
double-checked locking does not work.

>That is because Posix compliant OS gives certain guarantees about
memory
>visibility between threads. Specifically, it guarantees that when the
>second thread acquires lock l1, it sees all the memory in the state it
>was _no_earlier_ than when the first thread released lock l1 (again,
>lock
>l1 somehow involves pthread_mutex_t). And that is regardless (or so I
>thought) of the number processors in the system, or the specific
>memory architecture.

I understand the guarantees given by Posix. But the problem is:
The second thread does not acquire or free the lock l1!
It checks, without any lockings, without using any memory barriers:

if ( pInstance == NULL )

Posix does not guarantee, that the second thread sees pInstance != NULL
only, when the second thread sees the memory pointed to by pInstance
as fully initialized.

And for exactly this reason, each thread has to lock any mutex each time
it accesses the global data pInstance. The introduction of l2 doesn't
help anything in this case.

>> Regretfully, this isn't enough. ...

And James already had seen, that l2 does not solve all problems.

Frankly, I don't see at the moment, what problem l2 is supposed
to help.

>I thought processor cache
>location(s) would be marked "dirty" in case like that, and thus force
>the
>cache refresh. Would the cache on second processor be "dirty" or not in
>this case?

I am not sure. But I suppose, the second processor has an own cache,
which is _not_ marked as dirty. It would be too expensive to check,
which of the cache lines are hit by the change in memory.

And I suppose, there is a CPU command, a message, that tells all
other CPUs in the system: "Hello, it's urgent, I have written something
important into memory. Mark _all_ your cache lines as dirty!"

Further I suppose, that the posix lock- and unlock-functions will
send this message to coordinate thread synchronization. Alas, clearing
all caches in the other CPUs is quite an expensive operation;
not the clearing itself, but the loading of data from the slow memory.

Sorry that this posting may be off topic.

Regards,
Kurt.

Balog Pal

unread,

May 8, 2003, 8:47:17 AM5/8/03

to

"witoldk" <wit...@optonline.net> wrote in message
news:9bed99bb.03050...@posting.google.com...

> So, after reading so many discussions about DCLP not working (reading
> on C++ NGs that is) I still fail to understand why people would
expect
> it to work,

Uh, if you actually read those discussions you should know the answer.
The real issues on the stuff are subtle, the standard is not enough
clear, and the nonexistance of some guarantees is easy to slip-by,
especially if you find your "solution" working on all your
compilers/environments. After all you need a really agressive compiler
to break the better non-solutions.

> nor I understand why would people need it to work in
> standard C++.

Because the underlying issue is a pain in the @ss. Volatile _would_
work if it would actually turn off the as-if rule for bystanding
nonvolatiles. It would work if sequence points were sequence points. It
would work with a really little patch.
It is like a big unguarded, unsigned, uncovered hole on a highway.
People will keep falling into it, until something is done. And note,
DCLP is just a particular case of a theretical problem. You can fall in
other scenarios that will break code on your next compiler/multiproc
environment, and are not so well discussed as this one.

Paul

Raoul Gough

unread,

May 8, 2003, 11:42:15 AM5/8/03

to

"witoldk" <wit...@optonline.net> wrote in message

news:9bed99bb.0305...@posting.google.com...

> Scott Meyers <Use...@aristeia.com> wrote in message
> news:<MPG.191efc0a3...@news.hevanet.com>...
>
> > Please remember that my contention here is that there is no
> > portable and reliable way to make double-checked locking work.
>
> Please note that double-checked locking works reliably, and is
> portable among POSIX-compliant OSs (environments).

I suppose you're only talking about the uniprocessor case? I thought
there were fundamental problems with multiple processors, which means
you can't perform a cheap check (i.e. without synchronization) once
initialization has already taken place.

--
Raoul Gough
see http://home.clara.net/raoulgough/ for my work availability

Chris Carton

unread,

May 8, 2003, 3:11:59 PM5/8/03

to

On Wed, 07 May 2003 18:55:42 +0000, witoldk wrote:

>>
>> If we plop this into the original function, we get this:
>>
>> > MySingleton *MySingleton::Instance(void)
>> > {
>> > if(!pInstance) // Line 1
>> > {
>> > LOCK(); // Do some MT-locking here
>> > if (!pInstance)
>> pInstance =
>> operator new(sizeof(MySingleton)); // Line 2
>> new (pInstance) MySingleton;
>> > UNLOCK();
>> > return sp;
>> > }
>>

<snip>

> In mt environment, if the OS is POSIX complaint, the issues you
mention:
> "
>> - Thread A enters MySingleton::Instance, executes through Line 2, and
is
>> suspended.
>> - Thread B enters MySingleton::Instance, executes Line 1, sees that
>> pInstance is non-null, and returns. It then merrily dereferences
>> the pointer, thus referring to memory that does not yet hold an
object.
> "
>
> do not exist, as thread B can never get into the trouble you describe
> if only the LOCK() acquires (locks :) the mutex, and UNLOCK() unlocks
the
> mutex. This is guaranteed by the POSIX standard.
>

I've seen this point mentioned twice in this thread, but I still fail to
understand it. Thread B never attempts to lock the mutex at all, so I
don't see what guarantees POSIX could specify that would make a
difference. Could you (or someone else) please explain this a little bit
more?

-Chris

witoldk

unread,

May 8, 2003, 5:59:11 PM5/8/03

to

john...@yahoo.com (johnchx) wrote in message news:<4fb4137d.03050...@posting.google.com>...

> wit...@optonline.net (witoldk) wrote
> > john...@yahoo.com (johnchx) wrote
> > > You still need a memory barrier to ensure that the
> > > initialized state of the Singleton instance is written out to main
> > > memory before the function returns (just in case the program spawns
> a
> > > thread after Instance() returns and the OS schedules the thread on
> > > another CPU and that thread tries to call Instance() at an instant
> > > after p_instance has been updated in memory but before the
> Singleton's
> > > instance state has been written out to memory).

[snip]

> > Memory barrier is not a write to memory operation,
> > cache flush or whatever else it is taken for mistakenly.
>
> Yes, that's correct. I should have said something like "is made
> globally visible" rather than "has been written out to memory."

[snip]

>
> Sorry if I'm adding to the confusion! Feel free to correct me further
> if it sounds like I'm still not getting it. :-)

I did not mean to suggest you were not getting it. In fact I think that
you do. It is just that "shortcuts" like yours leave a lot of room for
people who do not get it (or are just "new" to the problem) to interpret
it in a wrong way. Anyway, just a technicality :)
I believe looking at the memory barrier as a way to make things "globaly
visible" _and_ a way to provide _ordering_ is just the right level of
abstraction for discussions at the programming language level. I do not
mean to suggest it is in any way a simple issue at the OS/hardware level.
What the memory barrier involves on particular piece of hardware.. I do
not know. But just knowing the concept (as above) is enough to know why
DCLP is not at all a problem when the proper sync mechanism (POSIX mutex
or something else in case of non-POISX env) is used.

Joseph Seigh

unread,

May 8, 2003, 6:00:09 PM5/8/03

to

Alexander Terekhov wrote:
>
> witoldk wrote:
> >
> > Scott Meyers <Use...@aristeia.com> wrote in message
> > news:<MPG.191efc0a3...@news.hevanet.com>...
> >
> > > Please remember that my contention here is that there is no
> > > portable and reliable way to make double-checked locking work.
> >
> > Please note that double-checked locking works reliably, and is
> > portable among POSIX-compliant OSs (environments). There is no
> > way to make it work otherwise, other than using thread private
> > storage (the idea is credited to A. Terekhov), as mentioned in
> > the doc referred in this thread (and many other threads here and
> > on other NGs :) about double-checked locking being broken.
>
> Well,
>
> http://google.com/groups?threadm=3AB949A3.45B2371A%40willden.org
> (Subject: DCL -- is there a FAQ?)
>

The earliest reference I could find to the TSD/TLS stuff was here
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&selm=397C34EE.401DC7A9%40genuity.com
where I refer to some earlier discussion on it which I haven't been
able to find. Lots of discussion of DCL/lazy instantiation/etc... but no
TLS.

Joe Seigh

Joseph Seigh

unread,

May 8, 2003, 6:08:45 PM5/8/03

to

Alexander Terekhov wrote:
>
> witoldk wrote:
> >
> > Scott Meyers <Use...@aristeia.com> wrote in message
> > news:<MPG.191efc0a3...@news.hevanet.com>...
> >
> > > Please remember that my contention here is that there is no
> > > portable and reliable way to make double-checked locking work.
> >
> > Please note that double-checked locking works reliably, and is
> > portable among POSIX-compliant OSs (environments). There is no
> > way to make it work otherwise, other than using thread private
> > storage (the idea is credited to A. Terekhov), as mentioned in
> > the doc referred in this thread (and many other threads here and
> > on other NGs :) about double-checked locking being broken.
>
> Well,
>
> http://google.com/groups?threadm=3AB949A3.45B2371A%40willden.org
> (Subject: DCL -- is there a FAQ?)

Found it here in a posting by John Hickin
http://www.google.com/groups?threadm=6kuldj$4sk%40bmtlh10.bnr.ca
Subject: Re: MP safe Singleton using double-checked locking

Sorry for the redundant posting. There seems to be a law that
no matter how long you spend looking for something, you will find
it just after you post to a newsgroup about it.

>
> regards,
> alexander.

Ok, now you can invoke Godwin's Law.

Hyman Rosen

unread,

May 8, 2003, 6:09:41 PM5/8/03

to

Thomas Mang wrote:
> Wouldn't it be nice to be able to tell the compiler "hey, I need this
> exactly the way I wrote it, don't optimize, rearrange or anything"?

No. Since you don't have control over the exact form of the generated code,
"exactly the way I wrote it" is not an especially meaningful request. If
you wish to execute a sequence of machine instructions exactly as you want,
why not write that module in assembly language?

Furthermore, many of the rearrangements we've been talking about can take
place at the hardware level regardless of what you write, even in assembly
language, unless you take care to use the proper defined mechanisms for the
platform, be it memory barriers or whatever. Which is the point. If you use
the proper mechanisms, the program will behave properly. But some people
apparently refuse to use the proper mechanisms yet demand that their program
behaves properly regardless. That makes no sense.

Jeff Kohn

unread,

May 8, 2003, 6:33:13 PM5/8/03

to

"Kurt Stege" <kst...@innovative-systems.de> wrote in message
news:b9d7s8$i5fto$1...@ID-54586.news.dfncis.de...

>
> On 7 May 2003 14:24:46 -0400, wit...@optonline.net (witoldk) wrote:
>
> >I thought processor cache
> >location(s) would be marked "dirty" in case like that, and thus force
> >the
> >cache refresh. Would the cache on second processor be "dirty" or not in
> >this case?
>
> I am not sure. But I suppose, the second processor has an own cache,
> which is _not_ marked as dirty. It would be too expensive to check,
> which of the cache lines are hit by the change in memory.

Do we really need to worry about nitty gritty hardware-level issues? If the
hardware in an SMP system really couldn't tell when one of the CPU's caches
was stale and needed refreshing, it seems to me that it would be pretty much
impossible to write correct user-level code, not just for DCL but in many
other cases is well. It seems to me that any SMP system that doesn't manage
to keep its caches updated would be fatally flawed.

I'm not trying to say that there aren't any issues with DCL, there clearly
are because of the as-if rule and c++/compiler issues. But I can help but
wonder if some of the posts in this thread are over-analyzing things.

Jeff

Jeff Kohn

unread,

May 8, 2003, 6:34:06 PM5/8/03

to

"johnchx" <john...@yahoo.com> wrote in message

news:4fb4137d.0305...@posting.google.com...

>
> Singleton* Singleton::Instance() {
> static Singleton* p_instance;
> if (p_instance) return p_instance;
> p_instance = new Singleton;
> // insert synchronization instruction
> // to block until p_instance actually updated
> return p_instance;
> }
>
> Caveat: as originally discussed, the above is meant to be safe only if
> it's guaranteed to be called before the program spawns multiple
> threads.

If you can make that guarantee, why would you need DCL at all? Seems to me
if you can be sure that the singleton will be initialized by a single thread
before all other threads are created, you could just a local static instance
and be OK. The whole point of DCL is to provide the benefits of a local
static without the risk of threading issues.

Jeff

Jeff Kohn

unread,

May 8, 2003, 7:52:28 PM5/8/03

to

"Joshua Lehrer" <usene...@lehrerfamily.com> wrote in message
news:31c49f0d.0305...@posting.google.com...

> "Jeff Kohn" <jk...@houston.no.spam.please.rr.com> wrote in message
news:<UZdua.9882$PD6.2...@twister.austin.rr.com>...
> > Am I missing something?
> >
> > Jeff
> >
> >
>
> Yes, your solution constructs the object every time, even if it is
> never used. The DCL pattern is attempting to hold off on the
> construction of the object until the first time it is needed.
>

I realize that, but as I said in my earlier post it's not really an
issue
for most of the singletons I've written because they're either always
used
or inexpensive to create (often both).

Jeff

witoldk

unread,

May 9, 2003, 5:48:23 AM5/9/03

to

"Balog Pal" <pa...@lib.hu> wrote in message news:<3eba...@andromeda.datanet.hu>...

> "witoldk" <wit...@optonline.net> wrote in message
> news:9bed99bb.03050...@posting.google.com...
>
> > So, after reading so many discussions about DCLP not working (reading
> > on C++ NGs that is) I still fail to understand why people would
> expect
> > it to work,
>
> Uh, if you actually read those discussions you should know the answer.
> The real issues on the stuff are subtle, the standard is not enough
> clear, and the nonexistance of some guarantees is easy to slip-by,
> especially if you find your "solution" working on all your
> compilers/environments. After all you need a really agressive compiler
> to break the better non-solutions.

I think you are missing the point. The standard is _very_ clear in this
respect. There is no such thing as multiple execution threads in C++.
In other words: you should pick your "solution" from the domain of the
problem.

>
> > nor I understand why would people need it to work in
> > standard C++.
>
> Because the underlying issue is a pain in the @ss. Volatile _would_
> work if it would actually turn off the as-if rule for bystanding
> nonvolatiles. It would work if sequence points were sequence points. It
> would work with a really little patch.

Again, you are missing the point. What is hte issue in question?
Is it synchronization between threads? If it was, then you create an
issue (threads) "outside" of standard C++, and want standard C++ to
give you the mechanism to deal with it.
Is that reasonable?

witoldk

unread,

May 9, 2003, 5:55:24 AM5/9/03

to

Kurt Stege <kst...@innovative-systems.de> wrote in message news:<b9d7s8$i5fto$1...@ID-54586.news.dfncis.de>...

> >visibility between threads. Specifically, it guarantees that when the

> >second thread acquires lock l1, it sees all the memory in the state it
> >was _no_earlier_ than when the first thread released lock l1 (again,
> >lock
> >l1 somehow involves pthread_mutex_t). And that is regardless (or so I
> >thought) of the number processors in the system, or the specific
> >memory architecture.
>
> I understand the guarantees given by Posix. But the problem is:
> The second thread does not acquire or free the lock l1!

Yes, you are absolutely right.

> It checks, without any lockings, without using any memory barriers:
> if ( pInstance == NULL )

My mistake.

>
> Posix does not guarantee, that the second thread sees pInstance != NULL
> only, when the second thread sees the memory pointed to by pInstance
> as fully initialized.
>
> And for exactly this reason, each thread has to lock any mutex each time
> it accesses the global data pInstance.

Agreed.

Scott Meyers

unread,

May 9, 2003, 11:45:42 AM5/9/03

to

On 8 May 2003 15:11:59 -0400, Chris Carton wrote:
> On Wed, 07 May 2003 18:55:42 +0000, witoldk wrote:
> > In mt environment, if the OS is POSIX complaint, the issues you
mention

> > do not exist, as thread B can never get into the trouble you
describe
> > if only the LOCK() acquires (locks :) the mutex, and UNLOCK()
unlocks
> > the mutex. This is guaranteed by the POSIX standard.
>
> I've seen this point mentioned twice in this thread, but I still fail
to
> understand it. Thread B never attempts to lock the mutex at all, so I
> don't see what guarantees POSIX could specify that would make a
> difference. Could you (or someone else) please explain this a little
bit
> more?

This is my question, too. How does posix prevent the second thread from
seeing a non-null pInstance after pInstance has been assigned but before
the Singleton has been constructed in the memory it points to? As Chris
notes, the second thread never tries to acquire any lock.

Scott

Scott Meyers

unread,

May 9, 2003, 11:51:51 AM5/9/03

to

On 8 May 2003 07:55:16 -0400, Rob wrote:
> On many systems, the simple act of assignment pInstance = something;
> is not guaranteed to be an atomic operation.

Can you give me examples of such systems? I ask, becaues this was one
of
the issues I briefly addressed in my paper, and one extremely
knowledgable
reviewer wrote this:

The reference is written atomically in Java, as guaranteed by JMM. Not
necessarily so in C++, but in reality so on all(?) machines these
days.

Your claim is that this is not so in reality for C++. I'm not
challenging
that claim, I'd just like some examples.

Thanks,

Scott

James Kanze

unread,

May 9, 2003, 6:04:24 PM5/9/03

to

"Jeff Kohn" <jk...@houston.no.spam.please.rr.com> wrote in message

news:<rDvua.54199$8e7.2...@twister.austin.rr.com>...

> "Kurt Stege" <kst...@innovative-systems.de> wrote in message
> news:b9d7s8$i5fto$1...@ID-54586.news.dfncis.de...

> > On 7 May 2003 14:24:46 -0400, wit...@optonline.net (witoldk) wrote:

> > >I thought processor cache location(s) would be marked "dirty" in
> > >case like that, and thus force the cache refresh. Would the cache
> > >on second processor be "dirty" or not in this case?

> > I am not sure. But I suppose, the second processor has an own cache,
> > which is _not_ marked as dirty. It would be too expensive to check,
> > which of the cache lines are hit by the change in memory.

> Do we really need to worry about nitty gritty hardware-level issues?

Yes. In the end, it is the hardware that executes the code.

> If the hardware in an SMP system really couldn't tell when one of the
> CPU's caches was stale and needed refreshing, it seems to me that it
> would be pretty much impossible to write correct user-level code, not
> just for DCL but in many other cases is well.

People seem to be able to do it. The trick is to inform the processor
that this might be the case, by means of a memory barrier, for example.

> It seems to me that any SMP system that doesn't manage to keep its
> caches updated would be fatally flawed.

Are Alpha's fatally flawed? Or Itanium? Neither impose constraints
other than those necessary for single processor self consistency.
Formally, the specifications of the Sparc architecture don't make any
guarantees either; I don't know if this freedom is actually exploited in
the current Sun Sparcs, however.

> I'm not trying to say that there aren't any issues with DCL, there
> clearly are because of the as-if rule and c++/compiler issues. But I
> can help but wonder if some of the posts in this thread are
> over-analyzing things.

Or perhaps by having actually encountered the problem on real machines.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, Tél. : +33 (0)1 30 23 45 16

James Kanze

unread,

May 9, 2003, 8:23:36 PM5/9/03

to

Kurt Stege <kst...@innovative-systems.de> wrote in message
news:<b9d7s8$i5fto$1...@ID-54586.news.dfncis.de>...

> I am not sure, if I am understanding, what you (witoldk) are
> exactly writing about.

> On 7 May 2003 14:24:46 -0400, wit...@optonline.net (witoldk) wrote:

> >ka...@gabi-soft.de (James Kanze) wrote in message
> >news:<d6651fb6.03050...@posting.google.com>...

> >> if ( pInstance == NULL ) {
> >> Locker l1( lock1 ) ;
> >> if ( pInstance == NULL ) {
> >> pInstance = createSingleton() ;
> >> }
> >> }

> >> MySingleton*
> >> createSingleton()
> >> {
> >> Locker l2( lock2 ) ;
> ^^^^^^^ This line marked

> >> return new MySingleton ;
> >> }

> >> Freeing the lock2 is a memory barrier (at least according to Posix,
> ...

> >Since you mention Posix, I'm really confused now. I always thought
> >there was absolutely nothing wrong with double-checked locking on
> >Posix OS, that is as long as Locker internally uses pthread_mutex_t.
> >If we were talking Posix, there is no need for Locker l2.

> So you are talking about the marked line. Is it necessary, or not? In
> my opinion, you (witoldk and James) are both wrong.

Taken out of context, what appears above is wrong. When I wrote it, I
explicitly qualified it by "with regards to write ordering", and
mentionned afterwards that it wasn't sufficient, because it still didn't
address the read ordering issues.

> This kind of double-checked locking does not work.

As far as I know, the only way to make double-checked locking work is to
enclose the outer check in a protected region as well (protected by a
mutex lock). And of course, once you've done that, you don't need
double checked locking anyway.

There may be specific solutions which work on specific platforms, but
there is no general solution.

> >That is because Posix compliant OS gives certain guarantees about
> >memory visibility between threads. Specifically, it guarantees that
> >when the second thread acquires lock l1, it sees all the memory in
> >the state it was _no_earlier_ than when the first thread released
> >lock l1 (again, lock l1 somehow involves pthread_mutex_t). And that
> >is regardless (or so I thought) of the number processors in the
> >system, or the specific memory architecture.

> I understand the guarantees given by Posix. But the problem is: The
> second thread does not acquire or free the lock l1! It checks,
> without any lockings, without using any memory barriers:
> if ( pInstance == NULL )

> Posix does not guarantee, that the second thread sees pInstance !=
> NULL only, when the second thread sees the memory pointed to by
> pInstance as fully initialized.

> And for exactly this reason, each thread has to lock any mutex each
> time it accesses the global data pInstance. The introduction of l2
> doesn't help anything in this case.

> >> Regretfully, this isn't enough. ...

> And James already had seen, that l2 does not solve all problems.

> Frankly, I don't see at the moment, what problem l2 is supposed to
> help.

Nothing really. Historically, it was MY first attempt to solve the DCL
problem; after submitting it to the experts, I understood why it doesn't
work. All it does is guarantee the ordering of the writes. My main
reason for posting it was to shoot it down immediately, rather than wait
for it to be suggested after the write ordering problems in the first
version had been exposed.

> >I thought processor cache location(s) would be marked "dirty" in case
> >like that, and thus force the cache refresh. Would the cache on
> >second processor be "dirty" or not in this case?

> I am not sure. But I suppose, the second processor has an own cache,
> which is _not_ marked as dirty. It would be too expensive to check,
> which of the cache lines are hit by the change in memory.

Generally, any single processor will see its own cache, and nothing
else, only going to main memory when it can't find the values in its
cache. Special instructions are needed to ensure that the main memory
is updated.

> And I suppose, there is a CPU command, a message, that tells all other
> CPUs in the system: "Hello, it's urgent, I have written something
> important into memory. Mark _all_ your cache lines as dirty!"

I've never seen that. The fact that one processor has forced its data
to be written to main memory only means that that data is visible to the
other processors which want to see it. Unless the other processor takes
concrete steps to synchronize its cache, it won't necessarily see it.

> Further I suppose, that the posix lock- and unlock-functions will send
> this message to coordinate thread synchronization. Alas, clearing all
> caches in the other CPUs is quite an expensive operation; not the
> clearing itself, but the loading of data from the slow memory.

If one processor could signal the others to reload their cache, then all
that would be needed is memory synchronization from the processor which
writes. In fact, Posix doesn't require this, and some Posix based
systems, at least, don't implement it. Even the threads which only read
need a lock. (What Posix does guarantee is that acquiring or releasing
the lock will synchronize.)

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, Tél. : +33 (0)1 30 23 45 16

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

James Kanze

unread,

May 9, 2003, 8:24:46 PM5/9/03

to

wit...@optonline.net (witoldk) wrote in message
news:<9bed99bb.0305...@posting.google.com>...

> Scott Meyers <Use...@aristeia.com> wrote in message
> news:<MPG.191efc0a3...@news.hevanet.com>...

> > Please remember that my contention here is that there is no portable
> > and reliable way to make double-checked locking work.

> Please note that double-checked locking works reliably, and is
> portable among POSIX-compliant OSs (environments).

Please note that there is nothing in Posix which will make double
checked locking work, and that it doesn't work on Posix compliant
Alpha's, Posix compliant Itanium based systems, and probably some Posix
compliant Sparc based systems.

witoldk

unread,

May 9, 2003, 8:30:26 PM5/9/03

to

"Chris Carton" <car...@NOq1labs.comSPAM> wrote in message news:<pan.2003.05.08....@NOq1labs.comSPAM>...

I was wrong. Now, as Kurt pointed out earlier in this thread, I believe
the way to make DCLP work is to remove the first if. That would make DCLP
a SCLP and everything would work just fine.
Seriously: I'm really embarassed having added to the confusion. I myself
was confused with respect to DCLP, mostly due to my own stupidity. DCLP
is advertised in print in a networking framework called ACE. I have never
employeed ACE in what I do (so I never really cared much about it), but
having seen DCLP mentioned there, out of my own stupidity I assumed I
must have been missing something about it (about DCLP), and it must work
after all.
It did not at a glance (my stupidity again) contradict guarantees given
by POSIX, if only every thread ever got to block on LOCK(). In other
words: since POSIX mutex lock is a lock followed by mb, and unlock is
mb followed by unlock, everything is OK.......if only every thread gets
to LOCK().
The fact that first if is checked without any locks/mbs seems obvious
at a glance. Unfortunately in my case someone had to spell it out for me
in order for it to sink in :(. Well, never believe in what you see in
print, without checking it first.
Of course all I've said is no excuse for adding to the confusion, so I must
take back all I've said about DCLP, with exception of fixing it by making
it SCLP. And with exception of being surprised why anyone would want to
make it work with the mechanisms available in the language that says
nothing about threads.

James Kanze

unread,

May 9, 2003, 8:34:25 PM5/9/03

to

"Balog Pal" <pa...@lib.hu> wrote in message
news:<3eba...@andromeda.datanet.hu>...

> Volatile _would_ work if it would actually turn off the as-if rule for

> bystanding nonvolatiles. It would work if sequence points were
> sequence points. It would work with a really little patch.

Volatile might be able to work IF it had any standard semantics.
However, I defy you to define some portable standard semantics for
volatile that don't get in the way for its first use: memory mapped IO.

Volatile was designed to solve a particular problem. It solves it
reasonably well.

If the standard adopts multi-threading (which I think is likely in the
future), it will have to define what things like the "as if" rule mean
in a multi-threaded context. It could, I suppose, make requirements on
volatile in this context, but I really don't see that happening.
Because in the end, volatile will still be insufficient; I don't see it
being modified to guarantee atomicity, for example.

> It is like a big unguarded, unsigned, uncovered hole on a highway.

Actually, in this case, at least on Unix machines, it is like a hole
with a lot of big warning signs around it. The Posix standard is very
clear as to what is, and is not, guaranteed, and there are very good
books explaining it.

> People will keep falling into it, until something is done.

For some reason, people don't read the signs.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, Tél. : +33 (0)1 30 23 45 16

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Balog Pal

unread,

May 9, 2003, 8:36:59 PM5/9/03

to

"witoldk" <wit...@optonline.net> wrote in message news:9bed99bb.03050...@posting.google.com...

> > Uh, if you actually read those discussions you should know the answer.

> > The real issues on the stuff are subtle, the standard is not enough
> > clear, and the nonexistance of some guarantees is easy to slip-by,
> > especially if you find your "solution" working on all your
> > compilers/environments. After all you need a really agressive compiler
> > to break the better non-solutions.
>
> I think you are missing the point. The standard is _very_ clear in this
> respect. There is no such thing as multiple execution threads in C++.

I'm quite aware of what C++ has and what it has not.

> In other words: you should pick your "solution" from the domain of the
> problem.

Well, most of the environments I work with have threads AND C++. And I
have to use them. So I face the problem and must deal with it. The parts
of the solution are indeed spread in different locations, like the C++
standard, extra definitions of the mplementations I use, the OS, the
processor architecture, etc.
I must think all of those things, and possible issues. Should anything
change, I shall re-evaluate. Is someone just grabs a piece of the
(well-working) code lying around, and implants it to another system, with
different attributes, and missing the re-evaluation, a high amount of danger is introduced.

> > > nor I understand why would people need it to work in
> > > standard C++.
> >
> > Because the underlying issue is a pain in the @ss. Volatile _would_
> > work if it would actually turn off the as-if rule for bystanding
> > nonvolatiles. It would work if sequence points were sequence points. It
> > would work with a really little patch.
>
> Again, you are missing the point.

We possibly speak of different things. If you mean you don't understand
why people believe it IS covered in the _current_ standard, I just agree.
Other meaning of that sentence that you don't understand why people think
it SHOULD be covered in _a_ C++ standard they'd like.
Not necessarily a full-fledged thread support, just little guarantees
that give enough ground to write less-fragile solutions.

> What is hte issue in question?
> Is it synchronization between threads? If it was, then you create an
> issue (threads) "outside" of standard C++, and want standard C++ to
> give you the mechanism to deal with it.
> Is that reasonable?

Actually the issue was (IMHO) the detalis of the 'as-if' rule, and what
code-moving is allowed in C++. Stripped to the bones, if I write:
int i;
i=1;
MEMBAR();
i=2;

and know whether the compiler is allowed to make i=2 before MEMAR or stuff like that.

Paul

Scott Meyers

unread,

May 9, 2003, 8:41:24 PM5/9/03

to

On 9 May 2003 05:55:24 -0400, witoldk wrote:
> Kurt Stege <kst...@innovative-systems.de> wrote in message news:<b9d7s8$i5fto$1...@ID-54586.news.dfncis.de>...

> > And for exactly this reason, each thread has to lock any mutex each time
> > it accesses the global data pInstance.
>
> Agreed.

Which means that posix has nothing to do with DCL. The primary motivation
for DCL is to avoid grabbing a lock on each access to pInstance.

Scott

witoldk

unread,

May 9, 2003, 10:32:33 PM5/9/03

to

"Raoul Gough" <Raoul...@yahoo.co.uk> wrote in message news:<b9dcqq$hi72j$1...@ID-136218.news.dfncis.de>...

> "witoldk" <wit...@optonline.net> wrote in message
> news:9bed99bb.0305...@posting.google.com...
> > Scott Meyers <Use...@aristeia.com> wrote in message
> > news:<MPG.191efc0a3...@news.hevanet.com>...
> >
> > > Please remember that my contention here is that there is no
> > > portable and reliable way to make double-checked locking work.
> >
> > Please note that double-checked locking works reliably, and is
> > portable among POSIX-compliant OSs (environments).
>
> I suppose you're only talking about the uniprocessor case?

To be honest: no.

> I thought
> there were fundamental problems with multiple processors, which means
> you can't perform a cheap check (i.e. without synchronization) once
> initialization has already taken place.

Obviously, I was wrong. You are right.

Raoul Gough

unread,

May 9, 2003, 10:37:45 PM5/9/03

to

"Scott Meyers" <Use...@aristeia.com> wrote in message

news:MPG.1924b674b...@news.hevanet.com...

> On 8 May 2003 07:55:16 -0400, Rob wrote:
> > On many systems, the simple act of assignment pInstance =
something;
> > is not guaranteed to be an atomic operation.
>
> Can you give me examples of such systems? I ask, becaues this was
one
> of
> the issues I briefly addressed in my paper, and one extremely
> knowledgable
> reviewer wrote this:
>
> The reference is written atomically in Java, as guaranteed by JMM.
Not
> necessarily so in C++, but in reality so on all(?) machines these
> days.
>
> Your claim is that this is not so in reality for C++. I'm not
> challenging
> that claim, I'd just like some examples.

8088 with "FAR" pointers made up of a 16-bit segment register and
16-bit offset. I think a compiler would have to go out of its way to
make this atomic enough for an unsynchronized NULL test to work.

--
Raoul Gough
see http://home.clara.net/raoulgough/ for my work availability

johnchx

unread,

May 9, 2003, 10:39:02 PM5/9/03

to

"Jeff Kohn" <jk...@houston.no.spam.please.rr.com> wrote

> "johnchx" <john...@yahoo.com> wrote

> >
> > Singleton* Singleton::Instance() {
> > static Singleton* p_instance;
> > if (p_instance) return p_instance;
> > p_instance = new Singleton;
> > // insert synchronization instruction
> > // to block until p_instance actually updated
> > return p_instance;
> > }
> >
> > Caveat: as originally discussed, the above is meant to be safe only if
> > it's guaranteed to be called before the program spawns multiple
> > threads.
>
> If you can make that guarantee, why would you need DCL at all?

That's right, you don't. But the above isn't DCL -- it performs
neither double checking nor locking. ;-)

> Seems to me
> if you can be sure that the singleton will be initialized by a single thread
> before all other threads are created, you could just a local static instance
> and be OK.

Maybe, but remember that static initialization isn't guaranteed by the
standard to be thread-safe and that we're interested in the case of a
program that hasn't spawned mulitple threads *yet* but might do so
very soon.

So let's say that the Instance() function gets called in Thread-1 on
CPU-1. Then we spawn Thread-2, that the OS, for nefarious reasons of
its own, binds to CPU-2. Thread-2 calls Instance(). If we're using
static local initialization, can we be certain that Thread-2 will see
the static local in the "already initialized state?" Probably not.
Thread-2 might perceive the variable as uninitialized and run the ctor
again, or it might perceive it as initialized but not see its
"correct" state.

I have read that there might exist implementations which -- as an
extension -- guarantee that static initialization is safe in this
regard, but I can't name one.

Maybe the key point is that ensuring that Instance() is called when
the program is single-threaded isn't the same as ensuring that the
results of calling Instance() are globally visible to all threads
running on all cpus while the program is still single-threaded. The
point of the code example is turning the first guarantee into the
second.

> The whole point of DCL is to provide the benefits of a local
> static without the risk of threading issues.
>

It's more than that...it also aims to avoid the cost of synchronizing
access to the one-time flag, and that turns out to be the hard part.

James Dennett

unread,

May 9, 2003, 10:42:06 PM5/9/03

to

witoldk wrote:

> What is hte issue in question?
> Is it synchronization between threads? If it was, then you create an
> issue (threads) "outside" of standard C++, and want standard C++ to
> give you the mechanism to deal with it.
> Is that reasonable?

Yes, it's reasonable; that's how standards evolve.

-- James.

Balog Pal

unread,

May 9, 2003, 10:49:37 PM5/9/03

to

"Scott Meyers" <Use...@aristeia.com> wrote in message news:MPG.1924b674b...@news.hevanet.com...

> > On many systems, the simple act of assignment pInstance = something;

> > is not guaranteed to be an atomic operation.
>
> Can you give me examples of such systems?

Anywhere a pointer is bigger than a singel register, or you must use segment register/ASI.

Intel 80286, large model.
I386+ non-flat mode. (like drivers for WIN32 system )
Xenix 286 and 386 unless my memory seriously fail.

Paul

Michael

unread,

May 9, 2003, 10:50:47 PM5/9/03

to

"Hyman Rosen" <hyr...@mail.com> wrote in message
news:10522258...@master.nyc.kbcfp.com...
> Thomas Mang wrote:
> > Doesn't this show a fundamental problem of the Standard?
>
> No, except that it's a problem that the Standard doesn't
> address multiprogramming.
>
> The basic problem is the fundamental wrongness of attempting
> to synchronize access to a common resource without using the
> proper synchronization mechanisms defined for such use. Just
> because a subset of programmers wish there was a way to do
> this using other means, and whine when told there isn't,
> doesn't make this a problem of the standard.

*EXACTLY* (what a frustrating thread prior to Hyman's post) !

If you want to do something that is fundamentally wrong, then you are
foolish for complaining that it doesn't work.

The door was open all day, but I carefully closed it when I got home this
evening - how did the cat get outside, it's not fair!!

Balog Pal

unread,

May 9, 2003, 10:51:24 PM5/9/03

to

"James Kanze" <ka...@gabi-soft.de> wrote in message news:d6651fb6.03050...@posting.google.com...

> > It seems to me that any SMP system that doesn't manage to keep its
> > caches updated would be fatally flawed.
>
> Are Alpha's fatally flawed? Or Itanium? Neither impose constraints
> other than those necessary for single processor self consistency.
> Formally, the specifications of the Sparc architecture don't make any
> guarantees either; I don't know if this freedom is actually exploited in
> the current Sun Sparcs, however.

Not true. The architecture is pretty clear on that. Data cache is looked as just part of the memory, coherency amintained by hardware. Code cache have more relaxed rules. See 8.4.2 in the manual.

The intel x86 series say pretty similar.

So for data (not instructions) the cache is simply ignored from the processor's point of view, it just interface "memory", anything between is transparent.

> > I'm not trying to say that there aren't any issues with DCL, there
> > clearly are because of the as-if rule and c++/compiler issues. But I
> > can help but wonder if some of the posts in this thread are
> > over-analyzing things.
>
> Or perhaps by having actually encountered the problem on real machines.

IMHO is a program s theoretically broken, it must be clearly stated. I guess qite many people are not familiar with even teh concept of memory models (other than TSO), and the possibility that things does not happen in the sequence as they see instructions one under another. So I consider this DCL issue is a good chance for education regardless of that patern's actual use. One can easily run into something similar, and then just notice mystic crashes without a clue what could possibly gone wrong.

Paul

LLeweLLyn

unread,

May 10, 2003, 7:03:15 PM5/10/03

to

"Raoul Gough" <Raoul...@yahoo.co.uk> writes:

> "Scott Meyers" <Use...@aristeia.com> wrote in message
> news:MPG.1924b674b...@news.hevanet.com...
> > On 8 May 2003 07:55:16 -0400, Rob wrote:
> > > On many systems, the simple act of assignment pInstance =
> something;
> > > is not guaranteed to be an atomic operation.
> >
> > Can you give me examples of such systems? I ask, becaues this was
> one
> > of
> > the issues I briefly addressed in my paper, and one extremely
> > knowledgable
> > reviewer wrote this:
> >
> > The reference is written atomically in Java, as guaranteed by JMM.
> Not
> > necessarily so in C++, but in reality so on all(?) machines these
> > days.
> >
> > Your claim is that this is not so in reality for C++. I'm not
> > challenging
> > that claim, I'd just like some examples.
>
> 8088 with "FAR" pointers made up of a 16-bit segment register and
> 16-bit offset. I think a compiler would have to go out of its way to
> make this atomic enough for an unsynchronized NULL test to work.

[snip]

A similar issue applies for the 80386 and up; 16-bet segment register
and seperate 32 bit offset register. The caveat is few
implementations use segment registers.

Raoul Gough

unread,

May 10, 2003, 7:10:25 PM5/10/03

to

"James Kanze" <ka...@gabi-soft.de> wrote in message
news:d6651fb6.03050...@posting.google.com...

> Kurt Stege <kst...@innovative-systems.de> wrote in message

[snip]

This is getting way off topic, but can someone explain what costs are
involved in the kind of memory barrier/cache reload instructions we're
talking about? It seems to me that a modern cache containing ~MBytes
of memory would take a huge number of memory bus cycles to reload or
even verify against main memory. Or does the memory barrier only
affect a limited range of memory addresses or subsequent instructions?
How does it work - is there an easy description of this somewhere?

--
Raoul Gough
see http://home.clara.net/raoulgough/ for my work availability

Rob

unread,

May 11, 2003, 7:08:27 AM5/11/03

to

"Scott Meyers" <Use...@aristeia.com> wrote in message
news:MPG.1924b674b...@news.hevanet.com...

> On 8 May 2003 07:55:16 -0400, Rob wrote:
> > On many systems, the simple act of assignment pInstance = something;
> > is not guaranteed to be an atomic operation.
>
> Can you give me examples of such systems? I ask, becaues this was one
> of
> the issues I briefly addressed in my paper, and one extremely
> knowledgable
> reviewer wrote this:
>
> The reference is written atomically in Java, as guaranteed by JMM. Not
> necessarily so in C++, but in reality so on all(?) machines these
> days.
>
> Your claim is that this is not so in reality for C++. I'm not
> challenging
> that claim, I'd just like some examples.
>

Sure. Pointer assignment with "far" pointers on 80286 machines.
The "pointer" is actually 32 bits, on a machine with 16 bit registers.

Jeff Kohn

unread,

May 11, 2003, 7:09:42 AM5/11/03

to

"johnchx" <john...@yahoo.com> wrote in message

news:4fb4137d.03050...@posting.google.com...
> "Jeff Kohn" <jk...@houston.no.spam.please.rr.com> wrote

>
> > >
> > > Caveat: as originally discussed, the above is meant to be safe only if
> > > it's guaranteed to be called before the program spawns multiple
> > > threads.
> >

> > Seems to me
> > if you can be sure that the singleton will be initialized by a single
thread
> > before all other threads are created, you could just a local static
instance
> > and be OK.
>
> Maybe, but remember that static initialization isn't guaranteed by the
> standard to be thread-safe and that we're interested in the case of a
> program that hasn't spawned mulitple threads *yet* but might do so
> very soon.
>
> So let's say that the Instance() function gets called in Thread-1 on
> CPU-1. Then we spawn Thread-2, that the OS, for nefarious reasons of
> its own, binds to CPU-2. Thread-2 calls Instance(). If we're using
> static local initialization, can we be certain that Thread-2 will see
> the static local in the "already initialized state?" Probably not.
> Thread-2 might perceive the variable as uninitialized and run the ctor
> again, or it might perceive it as initialized but not see its
> "correct" state.

I don't think that's correct, but then again I'm not an expert at this
stuff. If you make the initial call (and therefore initialize the local
static) before other threads are running, then I really don't see how a
different thread could come along at some later point and not see the
constructed object.

Jeff

Michael

unread,

May 11, 2003, 2:12:40 PM5/11/03

to

> > > On many systems, the simple act of assignment pInstance =
> something;
> > > is not guaranteed to be an atomic operation.
> >
> > Can you give me examples of such systems? I ask, becaues this was
> one
> > of
> > the issues I briefly addressed in my paper, and one extremely
> > knowledgable
> > reviewer wrote this:
> >
> > The reference is written atomically in Java, as guaranteed by JMM.
> Not
> > necessarily so in C++, but in reality so on all(?) machines these
> > days.
> >

Pretty much any machine with more than 1 CPU (Sun, HP, etc).

In this common case, the issue has got *nothing* to do with the language
being used, and everything to do with multiple CPUs+their_caches trying to
share a common RAM store - as per other branches in this news thread you
*must* use the provided synchronisation mechanisms. Even the simple act of
setting an integer is *not* atomic on a multiple CPU system (regardless of
what your extremely knowledgable reviewer may write).

James Kanze

unread,

May 11, 2003, 2:23:11 PM5/11/03

to

"Balog Pal" <pa...@lib.hu> writes:

|> "James Kanze" <ka...@gabi-soft.de> wrote in message
|> news:d6651fb6.03050...@posting.google.com...

|> > > It seems to me that any SMP system that doesn't manage to keep
|> > > its caches updated would be fatally flawed.

|> > Are Alpha's fatally flawed? Or Itanium? Neither impose
|> > constraints other than those necessary for single processor self
|> > consistency. Formally, the specifications of the Sparc
|> > architecture don't make any guarantees either; I don't know if
|> > this freedom is actually exploited in the current Sun Sparcs,
|> > however.

|> Not true. The architecture is pretty clear on that. Data cache is
|> looked as just part of the memory, coherency amintained by
|> hardware. Code cache have more relaxed rules. See 8.4.2 in the
|> manual.

I actually looked it up before posting. In the relaxed memory model,
all guarantees hold for a single processor only. See 8.4.4.1 in the
Sparc V9 specification: "Relaxed Memory Order places no ordering
constraints on memory references beyond those required for PROCESSOR
self-consistency." (Emphisis added.) And according to the document
"Implementation Characteristics of Current SPARC-V9-based Products":
"UltraSPARC-II supports the Partial Store Order and Relaxed Memory
Order models."

The problem thus is potentially present in the hardware. I've not the
time to search for the necessary software documentation to see if
Solaris actually activates (or can activate) this mode.

|> The intel x86 series say pretty similar.

I seem to recall reading somewhere that Intel implicitly generates the
equivalent of hardware memory barriers around any instruction preceded
by the lock prefix.

|> So for data (not instructions) the cache is simply ignored from
|> the processor's point of view, it just interface "memory",
|> anything between is transparent.

In a single processor system, for Sparcs. As long as the lock prefix
is used, for Intel. (My g++ compiler for Linux does NOT generate lock
prefixes for volatile accesses.)

|> > > I'm not trying to say that there aren't any issues with DCL,
|> > > there clearly are because of the as-if rule and c++/compiler
|> > > issues. But I can help but wonder if some of the posts in this
|> > > thread are over-analyzing things.

|> > Or perhaps by having actually encountered the problem on real
|> > machines.

|> IMHO is a program s theoretically broken, it must be clearly
|> stated. I guess qite many people are not familiar with even teh
|> concept of memory models (other than TSO), and the possibility
|> that things does not happen in the sequence as they see
|> instructions one under another. So I consider this DCL issue is a
|> good chance for education regardless of that patern's actual use.
|> One can easily run into something similar, and then just notice
|> mystic crashes without a clue what could possibly gone wrong.

The problem is that an awful lot of people make a lot of assumptions
based on what they think they know, or what was roughly true twenty
years ago.

--
James Kanze mailto:ka...@gabi-soft.fr

Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung

11 rue de Rambouillet, 78460 Chevreuse, France Tel. +33 1 41 89 80 93

James Kanze

unread,

May 11, 2003, 2:30:08 PM5/11/03

to

Scott Meyers <Use...@aristeia.com> writes:

|> On 9 May 2003 05:55:24 -0400, witoldk wrote:
|> > Kurt Stege <kst...@innovative-systems.de> wrote in message
|> > news:<b9d7s8$i5fto$1...@ID-54586.news.dfncis.de>...
|> > > And for exactly this reason, each thread has to lock any
|> > > mutex each time it accesses the global data pInstance.

|> > Agreed.

|> Which means that posix has nothing to do with DCL.

Yes and no. For DCL to work, you need some guarantees. Obviously,
the C++ standard can't give them, because it doesn't even recognize
multithreading. About the only other relevant standard for this sort
of thing is Posix. Posix specifies in fact that behavior is undefined
if more than one thread may access a variable, and at least one thread
may modify it. This is the case in DCL, since the thread with the
lock may modify the pointer, and other threads may access it (since
they don't acquire the lock). Posix also states that writes are only
ordered with respect to a certain number of system calls:
pthread_mutex_lock and pthread_mutex_unlock are in the list, so the
lock also guarantees that the thread not constructing the singleton
cannot see a partially constructed singleton (supposing it has
acquired the lock, of course) -- without this guarantee, even the
straightforward single lock algorithm will work.

One might (very reasonably, IMHO) ask: why all this emphisis on Posix?
C++ is available on a number of non-Posix based systems. In fact, the
most widespread platform is not Posix. (Formally, Linux isn't Posix
either, but it's threading model is conform to Posix specifications.)
About the only answer I can give is that Posix is the only widespread
specification I am familiar with. I think it is also the only
widespread implementation for which some sort of "standard" is openly
and easily available. (But I'd love to be proven wrong about this,
and if someone could post a URL where I could find the equivalent
information for Windows, I would appreciate it.)

|> The primary motivation for DCL is to avoid grabbing a lock on each
|> access to pInstance.

And the Posix specification says that since pInstance may be modified,
any access to it when another process may be accessing it is undefined
behavior. And the only way to prevent this is for all processes which
access it to use locks. Always.

Presumably, using a second pointer in thread specific memory will
work. On the other hand, I seriously doubt that this will be any
faster than just using single locking or, on Posix systems,
pthread_once (which is presumably optimized for this sort of thing).

And finally, I still wonder what all the fuss is about. On my rather
ordinary Sparc, acquiring a mutex lock that is not held by another
process, then freeing it, takes around 90 nanoseconds. Not normally a
real bottleneck, and if it is, there is nothing to prevent the thread
from calling the singleton once, before the critical loop, and keeping
a raw pointer or a reference to it. (I'm not sure about others, but I
generally do this anyway. It just seems more natural to acquire all
necessary resources locally, independantly of time constraints.)

--
James Kanze mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France Tel. +33 1 41 89 80 93

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

James Kanze

unread,

May 11, 2003, 2:31:49 PM5/11/03

to

"Raoul Gough" <Raoul...@yahoo.co.uk> writes:

I don't think that there is any easy answer concerning cost. First,
there are the effects on the pipeline; these can probably be roughly
estimated. The effects on the cache are, as you say, more
complicated. Normally, I don't think that the processor automatically
reloads the cache -- it just "forgets" whatever is in it, so that
future accesses will go to main memory. (But a single access will
load a complete cache line, so that if you have good locality, it
might not be that bad. And most processors distinguish between
instructions and data in a cache -- I'm not sure that such
instructions will always invalidate the instruction cache.)

To give you some idea of concrete values on a real machine, the Posix
requests pthread_lock and pthread_unlock both require such
synchronization. On the machine I normally use at work, the sequence
pthread_lock, ++i (where i is an int), pthread_unlock takes about 90
nanoseconds, including its two barriers (but there may be some tricky
optimizations in the call I'm not aware of). For comparison, a
function call + return takes about 20 nanoseconds, or 50 nanoseconds
if the call is virtual.

--
James Kanze mailto:ka...@gabi-soft.fr

Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung

11 rue de Rambouillet, 78460 Chevreuse, France Tel. +33 1 41 89 80 93

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

James Kanze

unread,

May 11, 2003, 2:33:21 PM5/11/03

to

Joseph Seigh <jsei...@xemaps.com> writes:

|> I wrote:

|> > There are two ways to implement thread ownership of a copy. One
|> > is by allocating the local pointer in thread local storage
|> > (TLS). ...

|> Scratch the TLS stuff. The problem with that route is you need a
|> key to index into your local storage and that needs to be
|> initialized, initialization being what you are trying to
|> accomplish in the first place.

This obviously depends on the system. Posix systems are pretty weak
here, probably because of portability concerns, and do require a key
and a special lookup. There is nothing to prevent a system from
reserving a certain address range for thread specific memory, and
mapping it to a different physical memory for each thread. (This
would slow up context switches between threads; nothing is free.)
Given this, an obvious compiler extention would be to support such
memory, say with a new keyword __thread_static, or some such.

|> You can use pthread_once to do this but then you might as well use
|> it instead of DCL since it's functionally equivalent.

It's not completely functionally equivalent, since it is guaranteed to
work. In all cases.

I suspect that most implementations of pthread_once actually use a
statically initialized mutex. But if a faster solution exists for a
specific architecture, I would expect it to be used.

|> The problem with POSIX threads is that it is performance neutral
|> in the same sense that C++ is threads neutral. pthread_once may
|> be, and problably is on most platforms, implemented via DCL, but
|> you don't really know for sure.

On many platforms, it can't be implemented via DCL, since DCL doesn't
work. On many platforms, a simple mutex just isn't that expensive.

Peter Koch Larsen

unread,

May 11, 2003, 2:42:16 PM5/11/03

to

Apart from those "exotic" segmented answers, I believe that any access to a
pointer that is not properly aligned will not be atomic on a x86 system.

/Peter.

"Scott Meyers" <Use...@aristeia.com> skrev i en meddelelse
news:MPG.1924b674b...@news.hevanet.com...

> On 8 May 2003 07:55:16 -0400, Rob wrote:
> > On many systems, the simple act of assignment pInstance = something;
> > is not guaranteed to be an atomic operation.
>
> Can you give me examples of such systems? I ask, becaues this was one
> of
> the issues I briefly addressed in my paper, and one extremely
> knowledgable
> reviewer wrote this:
>
> The reference is written atomically in Java, as guaranteed by JMM. Not
> necessarily so in C++, but in reality so on all(?) machines these
> days.
>
> Your claim is that this is not so in reality for C++. I'm not
> challenging
> that claim, I'd just like some examples.

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Anton Tykhyy

unread,

May 12, 2003, 9:43:24 AM5/12/03

to

Scott Meyers <Use...@aristeia.com> wrote:
> On 4 May 2003 10:55:24 -0400, Phil Carmody wrote:
> > Wouldn't a volatile pInstance prevent the assigning to pInstance
before
> > the right hand side of the = (i.e. the new) has completed?
>
> No. Declaring pInstance volatile will force reads of that variable to
come
> from memory and writes to that variable to go to memory, but what we
need
> here is a way to say that pInstance should not be written until the
> Singleton has been constructed. That is, we need to tell the compiler
to
> respect a temporal ordering that is stricter than the as-if rule. As
far
> as I know, there is no way to do that. Certainly volatile doesn't do
it.
>
What about
volatile Singleton* temp = new Singleton ;
pInstance = temp ;

Hyman Rosen

unread,

May 12, 2003, 9:47:44 AM5/12/03

to

Scott Meyers wrote:
> Which means that posix has nothing to do with DCL. The primary
motivation
> for DCL is to avoid grabbing a lock on each access to pInstance.

It seems to me that if you're going to be getting the singleton so many
times
that the lock is a problem you could just squirrel away a local copy of
the
pointer the firtst time you got it, and then just use the local copy.

Hyman Rosen

unread,

May 12, 2003, 9:48:19 AM5/12/03

to

Balog Pal wrote:
> Stripped to the bones, if I write:
> int i; i=1; MEMBAR(); i=2;
> and know whether the compiler is allowed to make i=2 before MEMAR or
stuff like that.

That's up to your compiler, and it's documentation.
If i is automatic in the code above, it would be
perfectly reasonable for the compiler to set it to
2 immediately, or in fact elide it altogether, unless
it has been told that MEMBAR is something special.

Balog Pal

unread,

May 12, 2003, 9:50:38 AM5/12/03

to

"James Kanze" <ka...@alex.gabi-soft.fr> wrote in message
news:861xz5w...@alex.gabi-soft.fr...

> And finally, I still wonder what all the fuss is about. On my rather
> ordinary Sparc, acquiring a mutex lock that is not held by another
> process, then freeing it, takes around 90 nanoseconds.

BAH. This is a psychological issue. I feel like I were in countless
many discussions on 'efficiency', all abserved from purely "theoretical"
view. Mean: eyeing the _C source_ and assume some behavior. By people
not knowing even what assembly output is to expect from the compiler.
Let alone different processors, core/memory speed multipliiers, cache
line issues, and so forth.

"coders" (who don't really reach to code) tend to concentrate on that
level -- on how to shape the code in 1-2 lines distance, instead of
minding the function, unit, or design level issues.

> Not normally a real bottleneck,

I read everywhere that programmers' guess on what the bottleneck will be
is wrong in almost all cases. IOW they expect the bottleneck where it
isn't. Then we get to the fuss.

If anyone know a recipe to generally reduce it, please tell me. ;)

Paul

Balog Pal

unread,

May 12, 2003, 9:50:56 AM5/12/03

to

"James Kanze" <ka...@alex.gabi-soft.fr> wrote in message

news:8665ohw...@alex.gabi-soft.fr...

> |> > > It seems to me that any SMP system that doesn't manage to keep
> |> > > its caches updated would be fatally flawed.
>
> |> > Are Alpha's fatally flawed? Or Itanium? Neither impose
> |> > constraints other than those necessary for single processor self
> |> > consistency. Formally, the specifications of the Sparc
> |> > architecture don't make any guarantees either; I don't know if
> |> > this freedom is actually exploited in the current Sun Sparcs,
> |> > however.
>
> |> Not true. The architecture is pretty clear on that. Data cache is
> |> looked as just part of the memory, coherency amintained by
> |> hardware. Code cache have more relaxed rules. See 8.4.2 in the
> |> manual.
>
> I actually looked it up before posting.

Me too. :) It seems we're talking about two distinct issues here.
(Possibly my OOPS, misunderstanding the actual statement, where I still
read about cache update.) One is the proc<->memory interface and the
other is the implementation of memory (including caches). I was
talking about the latter: as I read the sparc manual cache is a
transparent thing. The processor access the memory system, and will get
some value. It may come from the "main" memory or some cache, there's no
way a processor can tell the difference. While the memory ordering
issues remain regardless of existance of cache.

> In the relaxed memory model,
> all guarantees hold for a single processor only. See 8.4.4.1 in the
> Sparc V9 specification: "Relaxed Memory Order places no ordering
> constraints on memory references beyond those required for PROCESSOR
> self-consistency." (Emphisis added.) And according to the document
> "Implementation Characteristics of Current SPARC-V9-based Products":
> "UltraSPARC-II supports the Partial Store Order and Relaxed Memory
> Order models."

Sure. That's the whole purpose of the relaxed models -- to avoid blocks
and allow efficient reordering of accesses to memory. As 'processor
self consistency' merely watches to load-store and store-load to the
_same_ memory location. The processor can avoid some memory access
completely, eg. if it sees a load it recently sheduled to store. And it
is NOT THE CACHE that works here.

Also, the ordering MEMBAR instructions you must use in PSO/RMO models
does NOT flush or invalidate any cache. They order the memory access of
the processor.
The cache is (AFAIK, on the few systems I saw) really a dumb independent
thing -- it has mapped lines, and if external access happens to the
memory, that line gets invalidated. On its own.

Explicit instructions flushing cache (in x86 and sparc) are rarely used,
in extreme situations. Definitely not in conjunction with mutexes.

> The problem thus is potentially present in the hardware.

Yes, the processor. (Am I nitpicky? To me the distinction seems
important.)

> |> The intel x86 series say pretty similar.
>
> I seem to recall reading somewhere that Intel implicitly generates the
> equivalent of hardware memory barriers around any instruction preceded
> by the lock prefix.

Uh, the last intel manual I read thoroughly was for 486. (The pemnntiums
only superficially -- however I don't think there should new
incompatible features.) Lock asserts the processor's lock line, that is
practically the DMA line. It prevents anything else to access the memory
bus, thus gaining exclusive access. lock is important for teh cisc
instructions like inc, dec, exchg, add, and so on. (Explicit) memory
barriers are not needed, as all you have is a memory model identical to
sparc's TSO.

> |> So for data (not instructions) the cache is simply ignored from
> |> the processor's point of view, it just interface "memory",
> |> anything between is transparent.
>
> In a single processor system, for Sparcs.

And multiple processor ones similarly. As I earlier referenced, 8.4.2 in
the manual says that, right as it starts, and there's a diagram too (fig
43.) Certainly you still must ensure the 'memory order' is the
correct one.

> The problem is that an awful lot of people make a lot of assumptions
> based on what they think they know

That's quite true and hits. (However it is pretty natural -- what you
think you know -- is your set of knowledge. What else could you build
the assumptations upon? When you grow old, and fall on face now and
again you learn to better RTFM. Some manuals only add to this problem
being incomplete. When you learn new stuff how can you tell there are
essential issues not even mentioned?)

Paul

johnchx

unread,

May 12, 2003, 9:51:49 AM5/12/03

to

"Jeff Kohn" <jk...@houston.no.spam.please.rr.com> wrote
> "johnchx" <john...@yahoo.com> wrote

> > Maybe, but remember that static initialization isn't guaranteed by

the
> > standard to be thread-safe and that we're interested in the case of
a
> > program that hasn't spawned mulitple threads *yet* but might do so
> > very soon.
> >
> > So let's say that the Instance() function gets called in Thread-1 on
> > CPU-1. Then we spawn Thread-2, that the OS, for nefarious reasons
of
> > its own, binds to CPU-2. Thread-2 calls Instance(). If we're using
> > static local initialization, can we be certain that Thread-2 will
see
> > the static local in the "already initialized state?" Probably not.
> > Thread-2 might perceive the variable as uninitialized and run the
ctor
> > again, or it might perceive it as initialized but not see its
> > "correct" state.
>
> I don't think that's correct, but then again I'm not an expert at this
> stuff. If you make the initial call (and therefore initialize the
local
> static) before other threads are running, then I really don't see how
a
> different thread could come along at some later point and not see the
> constructed object.

Well, it *is* tricky because, in the C++ world, we're used to thinking
that functions actually complete all of their work before they return.
But this is often an illusion.

Returning from a function call does *not* automatically guarantee that
the effects of that function call are (yet) observable on other CPUs.
In particular, store instructions may still be buffered (or queued if
you prefer) pending being written to the L1 cache. If we spawn a new
thread, and it begins executing on a second CPU before the buffered
stores on the first CPU "drain to memory," our new thread won't see
those stores (yet).

Once the buffered stores reach the L1 cache, the cache coherency
mechanism will ensure that all CPUs become "aware" of the updated
values. But there's no explicit guarantee about how long this will
take.

On IA-32, we do at least have a guarantee that the stores will become
visible to other CPUs in the same order the program issued
them...other architectures do not even guarantee that.

James Kanze

unread,

May 12, 2003, 10:05:38 AM5/12/03

to

"Balog Pal" <pa...@lib.hu> writes:

That's true for a lot of things: threads, but also sockets,
sub-processes, GUI, etc. That's part of the reason why Posix has been
mentionned so much in this discussion; Posix does publish a publicly
available standard for threading issues. Obviously, a multithreaded
program written using Posix won't necessarily work correctly under
Windows. But then, it almost certainly won't compile.

With regards to the basic Posix rules (as opposed to the interface), I
suspect that code written in accordance with them will be pretty
portable. It's hard to do anything useful with much less. Some
implementations may offer more, but if your code counts on it, it
won't be portable.

|> > > > nor I understand why would people need it to work in
|> > > > standard C++.

|> > > Because the underlying issue is a pain in the @ss. Volatile
|> > > _would_ work if it would actually turn off the as-if rule for
|> > > bystanding nonvolatiles. It would work if sequence points
|> > > were sequence points. It would work with a really little
|> > > patch.

|> > Again, you are missing the point.

|> We possibly speak of different things. If you mean you don't
|> understand why people believe it IS covered in the _current_
|> standard, I just agree. Other meaning of that sentence that you
|> don't understand why people think it SHOULD be covered in _a_ C++
|> standard they'd like. Not necessarily a full-fledged thread
|> support, just little guarantees that give enough ground to write
|> less-fragile solutions.

Like what, for example?

I fully expect threading to find its way into the next C++ standard.
It is something that can't be simply handled by a third party library.
In the mean time, you have Posix, and I suspect that, modulo the
actual interface, Windows will give you the same guarantees. Lacking
anything else, it's a starting point.

|> > What is hte issue in question? Is it synchronization between
|> > threads? If it was, then you create an issue (threads) "outside"
|> > of standard C++, and want standard C++ to give you the mechanism
|> > to deal with it. Is that reasonable?

|> Actually the issue was (IMHO) the detalis of the 'as-if' rule, and
|> what code-moving is allowed in C++. Stripped to the bones, if I
|> write:

|> int i;
|> i=1;
|> MEMBAR();
|> i=2;

|> and know whether the compiler is allowed to make i=2 before MEMAR
|> or stuff like that.

The C++ standard is silent about this, because it doesn't define a
MEMBAR. Any implementation which does define it would have to specify
this -- given the supposed semantics, I presume that the specification
would also forbid such code movement. Any compiler supporting MEMBAR
on such a platform would have to respect these semantics.

That's in theory, of course. In practice, Posix threads appear like
just another library to the compiler, and g++ pre- 3.0, for example,
takes no special precautions when unwinding the stack during exception
handling. All that means, of course, is that g++ pre 3.0 cannot be
used in multithreaded applications. (G++ post 3.0 also has problems;
operator[] in std::string is not thread-safe, for example. But as far
as I know, these problems are considered bugs, where as in pre 3.0,
that was just the way it was.)

--
James Kanze mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France Tel. +33 1 41 89 80 93

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

James Kanze

unread,

May 12, 2003, 10:07:47 AM5/12/03

to

"Balog Pal" <pa...@lib.hu> writes:

|> "Scott Meyers" <Use...@aristeia.com> wrote in message
|> news:MPG.1924b674b...@news.hevanet.com...

|> > > On many systems, the simple act of assignment pInstance =
|> > > something; is not guaranteed to be an atomic operation.

|> > Can you give me examples of such systems?

|> Anywhere a pointer is bigger than a singel register, or you must
|> use segment register/ASI.

You don't even need that. Anywhere a write to a pointer might be
broken up into several bus cycles. Like, for example, an unaligned
32 bit pointer on a modern Intel.

--
James Kanze mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France Tel. +33 1 41 89 80 93

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

James Kanze

unread,

May 12, 2003, 1:38:53 PM5/12/03

to

Michael <z...@nospam.com> wrote in message
news:<0lrva.2106$DP4....@news-server.bigpond.net.au>...

> > > > On many systems, the simple act of assignment pInstance =
> > > > something; is not guaranteed to be an atomic operation.

> > > Can you give me examples of such systems? I ask, becaues this was
> > > one of the issues I briefly addressed in my paper, and one
> > > extremely knowledgable reviewer wrote this:

> > > The reference is written atomically in Java, as guaranteed by
> > > JMM. Not necessarily so in C++, but in reality so on all(?)
> > > machines these days.

> Pretty much any machine with more than 1 CPU (Sun, HP, etc).

> In this common case, the issue has got *nothing* to do with the
> language being used, and everything to do with multiple
> CPUs+their_caches trying to share a common RAM store - as per other
> branches in this news thread you *must* use the provided
> synchronisation mechanisms. Even the simple act of setting an integer
> is *not* atomic on a multiple CPU system (regardless of what your
> extremely knowledgable reviewer may write).

I think you missed the issue. Writing a pointer is definitly an atomic
operation on a Sparc. Whether and when a second processor sees the
write or not is another question, but it will either see none of it, or
all of it -- it cannot see half of the pointer with the new value, and
the other half with the old.

To get back to Scott's question -- it depends on the set of systems
being considered. There are still 8 bit embedded processors out there,
I believe, and on such systems, writing a pointer will definitly not be
atomic. (On the other hand, I don't know if there are C++ compilers for
such systems.) If you limit consideration to the important systems,
with 32 bits or more: Windows, Unix, and the like, about the only case I
can think of is either a "far" pointer or an unaligned pointer on an
Intel. In the context of a singleton, presumably the issue of far
pointer doesn't come up (on a 32 bit Intel system). And I don't know of
a compiler offhand which doesn't align a staticly declared pointer --
you generally have to do some fancy (non-portable) casting to get
anything non-aligned. So for pratical purposes, for a large number of
applications, you can fairly safely suppose that the pointer write will
be atomic.

I presume that this is also the case for 64 bit machines -- that writing
an aligned 64 bit value is atomic.

--
James Kanze GABI Software

mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter
Datenverarbeitung

11 rue de Rambouillet, 78460 Chevreuse, France, Tél. : +33 (0)1 30 23 45
16

James Kanze

unread,

May 12, 2003, 1:39:28 PM5/12/03

to

john...@yahoo.com (johnchx) wrote in message
news:<4fb4137d.03050...@posting.google.com>...
> "Jeff Kohn" <jk...@houston.no.spam.please.rr.com> wrote

> > "johnchx" <john...@yahoo.com> wrote

> > > Singleton* Singleton::Instance() {
> > > static Singleton* p_instance;
> > > if (p_instance) return p_instance;
> > > p_instance = new Singleton;
> > > // insert synchronization instruction
> > > // to block until p_instance actually updated
> > > return p_instance;
> > > }

> > > Caveat: as originally discussed, the above is meant to be safe
> > > only if it's guaranteed to be called before the program spawns
> > > multiple threads.

> > If you can make that guarantee, why would you need DCL at all?

> That's right, you don't. But the above isn't DCL -- it performs
> neither double checking nor locking. ;-)

> > Seems to me if you can be sure that the singleton will be
> > initialized by a single thread before all other threads are created,
> > you could just a local static instance and be OK.

> Maybe, but remember that static initialization isn't guaranteed by the
> standard to be thread-safe and that we're interested in the case of a
> program that hasn't spawned mulitple threads *yet* but might do so
> very soon.

Nothing is guaranteed thread-safe by the standard. The standard doesn't
consider threads. Whether static initialization is thread-safe or not
depends on implementation specific guarantees.

> So let's say that the Instance() function gets called in Thread-1 on
> CPU-1. Then we spawn Thread-2, that the OS, for nefarious reasons of
> its own, binds to CPU-2. Thread-2 calls Instance(). If we're using
> static local initialization, can we be certain that Thread-2 will see
> the static local in the "already initialized state?"

According to Posix, you can. The Posix standard requires that
pthread_create synchronize memory. For other threading systems, you'll
have to verify their specifications, but I would imagine that this would
usually be the case.

> Probably not. Thread-2 might perceive the variable as uninitialized
> and run the ctor again, or it might perceive it as initialized but not
> see its "correct" state.

> I have read that there might exist implementations which -- as an
> extension -- guarantee that static initialization is safe in this
> regard, but I can't name one.

First, it's not an extension -- it is (or is not) part of the threading
standard. Second, such implementations aren't rare, by any means, since
it is a requirement of Posix, and, I suspect, of most other threading
systems. (If anyone knows where there are online specifications for
Windows threads, a pointer would be much appreciated. I find that all
answers that I can give here are really far too Posix specific for this
group, but I don't have any information whatsoever on other threading
implementations.)

--
James Kanze GABI Software
mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter
Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, Tél. : +33 (0)1 30 23 45
16

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Michael Price

unread,

May 12, 2003, 4:37:54 PM5/12/03

to

In article <MPG.191d9257d...@news.hevanet.com>, Scott Meyers
wrote:
>
> If there's a portable way to avoid this problem in the presence of
> aggressive optimizing compilers, I'd love to know about it.
>
> Scott

What about assigning the instance variable inside another lock and
checking on a boolean assigned after this second lock? Example:

namespace
{
volatile bool finished = false;
volatile MySingleton* pInstance = 0;
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_t mutex2 = PTHREAD_MUTEX_INITIALIZER;
}

MySingleton*
MySingleton::instance()
{
if (!finished) {
pthread_mutex_lock(&mutex);
if (!finished) {
pthread_mutex_lock(&mutex2);
pInstance = new MySingleton();
pthread_mutex_unlock(&mutex2);
finished = true;
}
pthread_mutex_unlock(&mutex);
}
return const_cast<MySingleton*>(pInstance);
}

To quote "Programming with Posix Threads" (David Butenhof):

"A mutex lock, for example, begins by locking the mutex, and completes
by issuing a memory barrier. The result is that any memory accesses
issued while the mutex is locked cannot complete before other threads
can see that the mutex was locked. Similarly, a mutex unlock begins by
issuing a memory barrier and completes by unlocking the mutex,
ensuring that memory accesses issued while the mutex is locked cannot
complete after other threads can see that the mutex is unlocked."

Given this information, I'm having a hard time seeing how the code
segment above can fail. However, I've been wrong before so feel free
to point it out if I am :)

Michael

Joseph Seigh

unread,

May 12, 2003, 7:25:12 PM5/12/03

to

James Kanze wrote:
>
> First, it's not an extension -- it is (or is not) part of the
threading
> standard. Second, such implementations aren't rare, by any means,
since
> it is a requirement of Posix, and, I suspect, of most other threading
> systems. (If anyone knows where there are online specifications for
> Windows threads, a pointer would be much appreciated. I find that all
> answers that I can give here are really far too Posix specific for
this
> group, but I don't have any information whatsoever on other threading
> implementations.)
>

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc
/base/synchronization_reference.asp
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc
/base/processes_and_threads.asp

and thereabouts. It seems to be about as well defined as Posix though
it has the problem of
mutating a bit accross releases. It doesn't have a formal definition
and neither does Posix AFAIK.
The Posix spec contains the official definition but official is not the
same as formal. The
only formal definition out there that I am aware of is Java's thread
definition, but that's
a bit too much for most people's taste. That's partly because it was
defined in the form of a
meta implementation rather than a formal semantic specification. JSR
133 will simplify that
somewhat but not entirely.

BTW, when JSR 133 and JSR 166 do get into Java, it will be possible to
implement DCL correctly
in Java. JSR 166 in particular supports wait-free and lock-free
programming. That may
put pressure on Posix to support it. Windows has some support with
their interlocked
functions but not complete support.

Joe Seigh

johnchx

unread,

May 13, 2003, 1:36:29 PM5/13/03

to

ka...@gabi-soft.de (James Kanze) wrote
> john...@yahoo.com (johnchx) wrote

> > Maybe, but remember that static initialization isn't guaranteed by the
> > standard to be thread-safe and that we're interested in the case of a
> > program that hasn't spawned mulitple threads *yet* but might do so
> > very soon.
>
> Nothing is guaranteed thread-safe by the standard. The standard doesn't
> consider threads. Whether static initialization is thread-safe or not
> depends on implementation specific guarantees.

Yes, I think we all know this.

> > So let's say that the Instance() function gets called in Thread-1 on
> > CPU-1. Then we spawn Thread-2, that the OS, for nefarious reasons of
> > its own, binds to CPU-2. Thread-2 calls Instance(). If we're using
> > static local initialization, can we be certain that Thread-2 will see
> > the static local in the "already initialized state?"
>
> According to Posix, you can. The Posix standard requires that
> pthread_create synchronize memory.

That would be handy! :-)

Do you have a reference for this? I've looked through the online
documents a bit, but I can't seem to find this.

> > I have read that there might exist implementations which -- as an
> > extension -- guarantee that static initialization is safe in this
> > regard, but I can't name one.
>
> First, it's not an extension -- it is (or is not) part of the threading
> standard.

Well, a C++ implementation certainly could provide a guarantee not
provided by the standard (such as "automatically" thread safe
initialization of static locals). I'd call that an "extension".

I think I take your point, though: that this particular issue might be
resolved by the guarantees provided by the operating environment.
(And, if it isn't, you can do it "by hand," which was the point of the
original example.) But when I said "implementation" I only meant
"implementation of the C++ language."

> (If anyone knows where there are online specifications for
> Windows threads, a pointer would be much appreciated. I find that all
> answers that I can give here are really far too Posix specific for this
> group, but I don't have any information whatsoever on other threading
> implementations.)

Well, it's not exactly a specification, but there's documentation of
the Windows threading API at:

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/base/process_and_thread_functions.asp

Perhaps because of Windows' "IA-32-centric" history, the documentation
seems almost entirely silent on issues of memory synchronization.
More recent additions do specify where memory barriers are implicitly
generated on IA-64 (where, I gather, it matters more). I suspect
there's some Alpha-specific documentation out there somewhere, and
that would have to be more explicit...but I can't point you to it.