DCL -- is there a FAQ?

Shawn Willden

unread,

Mar 21, 2001, 7:38:59 PM3/21/01

to

I've run into a number of conversations about Double Checked Locking
recently... and I wonder what the concensus is here about its
appropriateness and safety. Under what conditions is it useful, when is
is problematic, are there any well-known modifications that make it safe
under all conditions, etc.

For the sake of clarity, I'm talking about code that's intended to lazy
initialization work well in multi-threaded environments. The typical
non thread safe version looks like:

public class AClass {
private SomeResource instance = null;
public SomeResource getSomeResource() {
if (instance == null)
instance = new SomeResource();
return instance;
}
}

Of course, this breaks in mult-threaded apps because two threads may
call getInstance at the same time. The obvious solution is to use a
mutex to control access to getInstance(), which in Java means to make
the method synchronized. The problem with that, however, is that it
incurs the synchronization overhead every time, even after "instance" is
safely initialized and the "danger" is past. DCL attempts to maintain
thread safety while not synchronizing the whole method, like so:

public class AClass {
private SomeResource instance = null;
public SomeResource getSomeResource() {
if (instance == null) {
synchronized {
if (instance == null)
instance = new SomeResource();
}
}
return instance;
}
}

However, under some circumstances aggressive optimization by the
compiler or even the processor can break this, by setting the value of
"instance" before the resource is fully constructed, thus causing other
threads to potentially use an incomplete object.

Shawn.

Bil Lewis

unread,

Mar 22, 2001, 3:36:54 PM3/22/01

to

Shawn,

DCL will work 100% of the time correctly on some machines. And fail
on others an unknown amount of the time. Despite all the wonderful
things I and others said about it, I (and some of the others) now
distance ourselves from it. Nice idea, but...

In short, if your initialization is quick, then do it at load-time.
If the initialization is very expensive and rarely used, then the
cost of locking a mutex is negligable, so use the mutex. (My opinion)

Some folks, e.g., Doug Schmidt, do tell me that it has its uses and
shouldn't be totally discounted. I haven't seen the numbers to convince
me yet, but Doug assures me they exist. So...

Take your pick.

-Bil
--
================
B...@LambdaCS.com

http://www.LambdaCS.com
Lambda Computer Science
555 Bryant St. #194
Palo Alto, CA, 94301

Phone/FAX: (650) 328-8952 St Petersburg: +7 812 966 5359

Shawn Willden

unread,

Mar 22, 2001, 6:50:18 PM3/22/01

to

Bil Lewis wrote:

> DCL will work 100% of the time correctly on some machines. And fail
> on others an unknown amount of the time. Despite all the wonderful
> things I and others said about it, I (and some of the others) now
> distance ourselves from it. Nice idea, but...

So there aren't any well-known "fixes" that make it reliable. Hmm, I have
what I think is a fix, but I would have expected that it would be widely
known, since it's pretty simple. Most likely my "fix" is broken. Can you
see any situation in which the following would fail?

public class AClass {
private SomeResource instance = null;

private boolean instanceInitialized = false;
public SomeResource getSomeResource() {
if (!instanceInitialized) {

synchronized {
if (instance == null)
instance = new SomeResource();

else if (!instanceInitialized)
instanceInitialized = true;
}
}
return instance;
}
}

It seems like the root of the problem with DCL is that the "instance"
pointer does double duty, working both as a pointer and as a flag. It
seems like separating those responsibilities, and ensuring that the
instance creation is complete and a memory barrier has been executed before
setting the flag should cover all of the failure modes that I'm aware of.
The above implementation is even more conservative than necessary, since it
won't set the flag until the second time the mutex is acquired, and it also
makes an arguably unnecessary test before setting "instanceInitialized" (I
do have some reasoning behind that test, but I won't go into it now).

Comments? Am I missing some reason why the above won't work? Or is adding
an additional boolean variable more trouble than it's worth?

Shawn.

Alexander Terekhov

unread,

Mar 23, 2001, 5:43:51 AM3/23/01

to

Shawn Willden wrote:

what are the reasons? don't you think that the:

static Singleton* pSingleton = 0;
static int fSingletonCreated = 0; // "atomic" flag

Singleton*
Singleton::get_instance()
{
if ( 0 == fSingletonCreated ) {
{
Guard guard( somelock );
if ( !pSingleton ) {
pSingleton = new Singleton;
}
}
fSingletonCreated |= 1; // "atomic" update
}
return pSingleton;
}

would have the same results?

BTW in POSA2 you can find "mb" version of DCL:
(if memory serves)

Singleton*
Singleton::get_instance()
{
Singleton pSingleton_ = pSingleton;
asm ("mb");
if ( 0 == pSingleton_ ) {
{
Guard guard( somelock );
if ( 0 == pSingleton_ ) {
pSingleton_ = new Singleton;
}
asm ("mb");
pSingleton = pSingleton_;
}
}
return pSingleton_;
}

>
> Comments? Am I missing some reason why the above won't work? Or is adding
> an additional boolean variable more trouble than it's worth?

one more question.. assuming that the first mb is _really_ needed even
for solution with a separate flag (e.g. in order to force processor
cache synchronisation <?>), wouldn’t the following make it a little
bit slower (tsd overhead) but 100% “portable” (not counting tsd),
correct and still better than non-DCL solution:

static Singleton* pSingleton = 0;
_thread_specific_ int fSingletonCreated = 0; // thread_specific flag –
(pseudo-code)

Singleton*
Singleton::get_instance()
{
if ( 0 == fSingletonCreated ) {
{
Guard guard( somelock );
if ( !pSingleton ) {
pSingleton = new Singleton();
}
}
fSingletonCreated = 1;
}
return pSingleton;
}

regards,
alexander.

Dave Butenhof

unread,

Mar 23, 2001, 7:55:44 AM3/23/01

to

Shawn Willden wrote:

> It seems like the root of the problem with DCL is that the "instance"
> pointer does double duty, working both as a pointer and as a flag. It
> seems like separating those responsibilities, and ensuring that the
> instance creation is complete and a memory barrier has been executed before
> setting the flag should cover all of the failure modes that I'm aware of.
> The above implementation is even more conservative than necessary, since it
> won't set the flag until the second time the mutex is acquired, and it also
> makes an arguably unnecessary test before setting "instanceInitialized" (I
> do have some reasoning behind that test, but I won't go into it now).

No, the root problem is that it assumes a single global ordering of memory
access across all processors in the system. Whether there's a separate flag is
irrelevant, because that doesn't affect the order of memory operations on any
processor.

It will fail if the second processor sees your instanceInitialized SET before
it sees the data what was written during initialization. This is exactly the
same failure mode that occurs without the separate flag.

The solution is hardware specific. On some primitive machines (like Intel),
all of this is irrelevant because they never reorder memory operations. On
others (such as Sparc in the common mode that reorders writes but not reads),
a single barrier is needed when you create the singleton to insure that the
initialized object is written before the pointer (or flag) that would lead
another thread to the data. On more advanced machines like MIPS and Alpha, you
need TWO barriers; one to ensure that the initializing processor WRITES the
data in order, and another to ensure that other processors READ the data in
(reverse) order. (So that a thread that sees the "instance" pointer non-NULL
can be assured of reading the data that had been previously written.)

Since you used Java in your example, there's an additional wrinkle in that the
Java memory model is intended to (but doesn't, quite) support this idiom
without any explicit barriers. There are people working on fixing the memory
model so that it really does work portably. But that's just Java, and won't
help C or C++. I'm unconvinced that the language model can really be changed
such that this type of idiom is always guaranteed to work when you want it
without burning many useless clock cycles doing memory barriers that you DON'T
want or need. This may be less of an issue for Java, where (a) portability of
the compiled (byte stream) code is of paramount importance and (b) the focus
on an interpretive model make the inefficiency of extra barriers less
noticable.

/------------------[ David.B...@compaq.com ]------------------\
| Compaq Computer Corporation POSIX Thread Architect |
| My book: http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/

Shawn Willden

unread,

Mar 23, 2001, 10:44:33 AM3/23/01

to

Alexander Terekhov wrote:

> Shawn Willden wrote:
>
> > The above implementation is even more conservative than necessary, since it
> > won't set the flag until the second time the mutex is acquired, and it also
> > makes an arguably unnecessary test before setting "instanceInitialized" (I
> > do have some reasoning behind that test, but I won't go into it now).
> >
>
> what are the reasons? don't you think that the:

[...]

> fSingletonCreated |= 1; // "atomic" update

[...]

> would have the same results?

Very likely. I was concerned that compilers might generate code that cleared the
flag before setting it, thus causing its value to "flicker". Using OR to set the
flag is a better solution.

> BTW in POSA2 you can find "mb" version of DCL:

I'm sure I'm betraying extreme ignorance here, but what is POSA2? I'm guessing
it's some standard text on multithreaded programming.

> (if memory serves)
>
> Singleton*
> Singleton::get_instance()
> {
> Singleton pSingleton_ = pSingleton;
> asm ("mb");
> if ( 0 == pSingleton_ ) {
> {
> Guard guard( somelock );
> if ( 0 == pSingleton_ ) {
> pSingleton_ = new Singleton;
> }
> asm ("mb");
> pSingleton = pSingleton_;
> }
> }
> return pSingleton_;
> }

Hmm, maybe it's because I'm not sure what precisely you mean by 'asm("mb")', but
that doesn't seem to work at all. It seems both inefficient because it incurs
the memory barrier overhead on every call and incorrect because it allows
multiple copies of Singleton to be created.

> one more question.. assuming that the first mb is _really_ needed even
> for solution with a separate flag (e.g. in order to force processor
> cache synchronisation <?>), wouldn’t the following make it a little
> bit slower (tsd overhead) but 100% “portable” (not counting tsd),
> correct and still better than non-DCL solution:

Interesting. Certainly it's better that each thread only passes the Guard once,
rather than every time, assuming it's not possible to portably use a global flag
that tells all threads when the resource is available.

Shawn.

Shawn Willden

unread,

Mar 23, 2001, 11:28:03 AM3/23/01

to

Dave Butenhof wrote:

> Shawn Willden wrote:
>
> > It seems like the root of the problem with DCL is that the "instance"
> > pointer does double duty, working both as a pointer and as a flag.
>

> No, the root problem is that it assumes a single global ordering of memory
> access across all processors in the system. Whether there's a separate flag is
> irrelevant, because that doesn't affect the order of memory operations on any
> processor.
>
> It will fail if the second processor sees your instanceInitialized SET before
> it sees the data what was written during initialization. This is exactly the
> same failure mode that occurs without the separate flag.

I knew there had to be some reason why my simple modification wouldn't work. I'm
sure you're correct, but I'm not sure I understand what you're saying.

So, is it possible then for the following sequence of events to occur:

1. Processor A enters the method, finds instanceInitialized CLEAR, acquires the
lock, creates the object, sets instance and releases the lock. By the end of the
method, this thread has a consistent view of a created object.
2. Processor B enters the method, find instanceInitialized CLEAR, acquires the
lock, notices that instance is already set, sets instanceInitialized and releases
the lock.
3. Processor C enters the method, find instanceInitialized SET and proceeds to
use instance, BUT when it reads the value of instance finds it to still be NULL,
or finds that it has a value but the memory that it's pointing to hasn't been
initialized.

Okay, assuming I'm thinking correctly, I can see that. And it appears to make any
variation of DCL unworkable, unless you have a way to ensure that each processor
enters the guard and reads the values at least once, hence Alexander's solution
using a thread-specific flag (or, better yet, a processor-specific flag).

> Since you used Java in your example

Only because I thought it convenient. I'm glad I did, though, because I hadn't
thought about how things might work differently.

> I'm unconvinced that the language model can really be changed
> such that this type of idiom is always guaranteed to work when you want it
> without burning many useless clock cycles doing memory barriers that you DON'T
> want or need.

That's very interesting. Would you care to elaborate? With my (simplistic)
understanding, I would think that any thread synchronization model (in any
language) would have to provide a consistent view of memory, meaning memory
barriers are required upon every entry and exit.

Then again, this brief discussion has made me realize that I don't really know
what memory barriers are.

Thanks,

Shawn.

Alexander Terekhov

unread,

Mar 23, 2001, 1:23:20 PM3/23/01

to

Shawn Willden wrote:

> Alexander Terekhov wrote:
>
> > Shawn Willden wrote:
> >
> > > The above implementation is even more conservative than necessary, since it
> > > won't set the flag until the second time the mutex is acquired, and it also
> > > makes an arguably unnecessary test before setting "instanceInitialized" (I
> > > do have some reasoning behind that test, but I won't go into it now).
> > >
> >
> > what are the reasons? don't you think that the:
>
> [...]
>
> > fSingletonCreated |= 1; // "atomic" update
>
> [...]
>
> > would have the same results?
>
> Very likely. I was concerned that compilers might generate code that cleared the
> flag before setting it, thus causing its value to "flicker". Using OR to set the
> flag is a better solution.
>
> > BTW in POSA2 you can find "mb" version of DCL:
>
> I'm sure I'm betraying extreme ignorance here, but what is POSA2? I'm guessing
> it's some standard text on multithreaded programming.

http://www.cs.wustl.edu/~schmidt/POSA/

>
> > (if memory serves)
> >
> > Singleton*
> > Singleton::get_instance()
> > {
> > Singleton pSingleton_ = pSingleton;
> > asm ("mb");
> > if ( 0 == pSingleton_ ) {
> > {
> > Guard guard( somelock );
> > if ( 0 == pSingleton_ ) {
> > pSingleton_ = new Singleton;
> > }
> > asm ("mb");
> > pSingleton = pSingleton_;
> > }
> > }
> > return pSingleton_;
> > }
>
> Hmm, maybe it's because I'm not sure what precisely you mean by 'asm("mb")', but
> that doesn't seem to work at all. It seems both inefficient because it incurs
> the memory barrier overhead on every call and incorrect because it allows
> multiple copies of Singleton to be created.

you are right... a) i forgot double fetch:

!> > Guard guard( somelock );
! pSingleton_ = pSingleton; //!!!!
!> > if ( 0 == pSingleton_ ) {
!> > pSingleton_ = new Singleton;

and b) i certainly agree that it is "inefficient because it incurs the
memory
barrier overhead on every call" - non-portable thing i've tried to avoid
via
tsd. however, the question is what is more expensive?

btw, here you can find more info:

http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html

e.g.:

// C++ implementation with explicit memory barriers
// Should work on any platform, including DEC Alphas
// From "Patterns for Concurrent and Distributed Objects",
// by Doug Schmidt
template <class TYPE, class LOCK> TYPE *
Singleton<TYPE, LOCK>::instance (void) {
// First check
TYPE* tmp = instance_;
// Insert the CPU-specific memory barrier instruction
// to synchronize the cache lines on multi-processor.
asm ("memoryBarrier");
if (tmp == 0) {
// Ensure serialization (guard
// constructor acquires lock_).
Guard<LOCK> guard (lock_);
// Double check.
tmp = instance_;
if (tmp == 0) {
tmp = new TYPE;
// Insert the CPU-specific memory barrier instruction
// to synchronize the cache lines on multi-processor.
asm ("memoryBarrier");
instance_ = tmp;
}
return tmp;
}

regards,
alexander.

Dave Butenhof

unread,

Mar 26, 2001, 9:02:57 AM3/26/01

to

Shawn Willden wrote:

> Dave Butenhof wrote:
>
> > Shawn Willden wrote:
> >
> > > It seems like the root of the problem with DCL is that the "instance"
> > > pointer does double duty, working both as a pointer and as a flag.
> >
> > No, the root problem is that it assumes a single global ordering of memory
> > access across all processors in the system. Whether there's a separate flag is
> > irrelevant, because that doesn't affect the order of memory operations on any
> > processor.
> >
> > It will fail if the second processor sees your instanceInitialized SET before
> > it sees the data what was written during initialization. This is exactly the
> > same failure mode that occurs without the separate flag.
>
> I knew there had to be some reason why my simple modification wouldn't work. I'm
> sure you're correct, but I'm not sure I understand what you're saying.
>
> So, is it possible then for the following sequence of events to occur:
>
> 1. Processor A enters the method, finds instanceInitialized CLEAR, acquires the
> lock, creates the object, sets instance and releases the lock. By the end of the
> method, this thread has a consistent view of a created object.

Right; any given processor always sees a consistent view of the memory it has
written.

> 2. Processor B enters the method, find instanceInitialized CLEAR, acquires the
> lock, notices that instance is already set, sets instanceInitialized and releases
> the lock.

It acquired the lock, with the appropriate memory barriers provided by the
implementation; so yes, it, too, is OK.

> 3. Processor C enters the method, find instanceInitialized SET and proceeds to
> use instance, BUT when it reads the value of instance finds it to still be NULL,
> or finds that it has a value but the memory that it's pointing to hasn't been
> initialized.

Yup. That's the problem.

> Okay, assuming I'm thinking correctly, I can see that. And it appears to make any
> variation of DCL unworkable, unless you have a way to ensure that each processor
> enters the guard and reads the values at least once, hence Alexander's solution
> using a thread-specific flag (or, better yet, a processor-specific flag).

Yes, but there are very few perfect engineering solutions. There are tradeoffs and
risks. And don't expect anything like "processor-specific flags", because they're
good for very few things, and are complicated to implement in user-mode (you don't
exactly want a kernel call here).

> > Since you used Java in your example
>
> Only because I thought it convenient. I'm glad I did, though, because I hadn't
> thought about how things might work differently.
>
> > I'm unconvinced that the language model can really be changed
> > such that this type of idiom is always guaranteed to work when you want it
> > without burning many useless clock cycles doing memory barriers that you DON'T
> > want or need.
>
> That's very interesting. Would you care to elaborate? With my (simplistic)
> understanding, I would think that any thread synchronization model (in any
> language) would have to provide a consistent view of memory, meaning memory
> barriers are required upon every entry and exit.

Absolutely. The question is, "entry and exit to WHAT"? POSIX requires, and thread
libraries implement, memory barriers on "entry" (lock) and "exit" (unlock) of
mutexes. Memory barriers also occur on other important operations, such as waiting or
signalling a condition variable, creating a thread, and so forth.

But the "trivial threaded extension of single-thread DCL" doesn't include any of
these except when a thread sees the instance variable uninitialized. The challenge is
to get a consistent memory view for threads that see the instance variable. You can
do that non-portably by issuing memory barriers where necessary. You can also do it
with TSD to ensure that every thread locks the mutex until it can see an initialized
instance; thereafter it can presume that the instance pointer and data are
consistent. The problem is that TSD isn't free either, and there may be a limited
quantity (PTHREAD_KEYS_MAX, which need be no higher than 128). To make this work at
all in general, you'd need some standard way (e.g., a language feature) to
consolidate these "application keys" into a smaller number of actual POSIX keys.
Compiler support for syntax such as the Microsoft "TLS" __declspec(__thread), layered
on top of POSIX TSD, can do that, but needn't. (And even reducing the TLS key usage
to one per shared library isn't necessarily enough to ensure all possible [or even
reasonable] applications will be able to run.)

If "entry and exit" means entry and exit of every routine, or block, or statement, or
"memory sequence point", forget it. It'd solve the problem, but the code would run
too slow to be of much use.

> Then again, this brief discussion has made me realize that I don't really know
> what memory barriers are.

Hey; if you KNOW that you don't understand, you're already in the top 10% of the
class. ;-)

Shawn Willden

unread,

Mar 26, 2001, 1:17:52 PM3/26/01

to

Dave Butenhof wrote:

> Shawn Willden wrote:

> > Dave Butenhof wrote:
> > > I'm unconvinced that the language model can really be changed
> > > such that this type of idiom is always guaranteed to work when you want it
> > > without burning many useless clock cycles doing memory barriers that you DON'T
> > > want or need.
> >
> > That's very interesting. Would you care to elaborate?
>

> Absolutely. The question is, "entry and exit to WHAT"?

Ah, thanks for the explanation. I actually understood your point, but I wasn't
thinking. Having just discussed why DCL can't work, in general, it should be obvious
that tweaking Java's memory model so that DCL does work on all platforms while
maintaining a reasonable level of performance is likely to be a futile endeavor.

> > Then again, this brief discussion has made me realize that I don't really know
> > what memory barriers are.
>
> Hey; if you KNOW that you don't understand, you're already in the top 10% of the class.
> ;-)

<grin>

Stages of knowledge:
1. I don't know anything about this, but it can't be too hard.
2. I know a little about this, but I can guess the rest.
3. I know a lot about this, but there are a few things I don't understand.
4. I know as much as anyone about this, which is almost nothing at all.

Shawn.

Alexander Terekhov

unread,

Mar 27, 2001, 6:38:35 AM3/27/01

to

Dave Butenhof wrote:

[...]

> The problem is that TSD isn't free either, and there may be a limited
> quantity (PTHREAD_KEYS_MAX, which need be no higher than 128). To make this work at
> all in general, you'd need some standard way (e.g., a language feature) to
> consolidate these "application keys" into a smaller number of actual POSIX keys.
> Compiler support for syntax such as the Microsoft "TLS" __declspec(__thread), layered
> on top of POSIX TSD, can do that, but needn't. (And even reducing the TLS key usage
> to one per shared library isn't necessarily enough to ensure all possible [or even
> reasonable] applications will be able to run.)

umm.. why not try to reduce TLS key usage to just one per _all_ shared
libraries, so that each shared library would get one sub-key assigned
to it (part of library init) and would itself manage multiple
sub-sub-keys for each singleton (just one more level of indirection) ??

as for language feature, i would prefer:

Singleton*
Singleton::get_instance()
{
synch static Singleton theSingleton( _..._ );
return &theSingleton;
}

which is similar to PTHREAD's:

static Singleton* theSingleton;
static pthread_once_t singleton_init_once_control = PTHREAD_ONCE_INIT;

static
void
singleton_init_once()
{
static Singleton _theSingleton( _..._ );
theSingleton = &_theSingleton;
}

Singleton*
Singleton::get_instance()
{
pthread_once( &singleton_init_once_control,singleton_init_once );
return theSingleton;
}

but that a) "synch" would be simply ignored (by compiler) for
ST programs and b) it would _not_ internally use any TSD on
UPs/MPs w/o memory reads reordering and c) it would _conditionally_
use TSD if running on processors with memory reads reordering
(uniprocessor - no need for TSD, multiprocessor - use TSD -> just
one more "if" -- could be even controlled by some compiler option
to make "server" code always use TSD - version w/o "if")

btw, linux pthreads (GLIBC):

int __pthread_once(pthread_once_t * once_control, void (*init_routine)(void))
{
/* flag for doing the condition broadcast outside of mutex */
int state_changed;

/* Test without locking first for speed */
if (*once_control == DONE) return 0;
.
.
.

seems to be incorrect (for MPs with memory reads reordering). <?>
(because it uses "broken" DCL and hence violates "On return from
pthread_once(), it is guaranteed that init_routine() has completed. "
with respect to memory visibility on MPs with memory reads reordering)

or am i missing something why __pthread_once above should work
anyway?

regards,
alexander.

Dave Butenhof

unread,

Mar 27, 2001, 7:21:42 AM3/27/01

to

Alexander Terekhov wrote:

> umm.. why not try to reduce TLS key usage to just one per _all_ shared
> libraries, so that each shared library would get one sub-key assigned
> to it (part of library init) and would itself manage multiple
> sub-sub-keys for each singleton (just one more level of indirection) ??

The compiler and linker can fairly easily "conspire" to make TLS variables for a shared
library ("linkage unit") share a single TSD key. Doing it across shared libraries
(especially in the presence of dlopen()!) is substantially more complicated. Not
impossible, of course, but very likely not worth the extra time and effort. Even on a
system with the minimal POSIX quota of TSD keys, you'd need a lot of shared libraries to
run into problems.

> but that a) "synch" would be simply ignored (by compiler) for
> ST programs and b) it would _not_ internally use any TSD on
> UPs/MPs w/o memory reads reordering and c) it would _conditionally_
> use TSD if running on processors with memory reads reordering
> (uniprocessor - no need for TSD, multiprocessor - use TSD -> just
> one more "if" -- could be even controlled by some compiler option
> to make "server" code always use TSD - version w/o "if")

How do you define "ST programs" at compile time? If you're suggesting that the language
standard should specify something like a "-pthread" that is required for every compilation
unit that might possibly ever run in a threaded process, sure, you could do that. Adding
extra compile-time options for whether to generate SMP or uniprocessor code, though, is
getting absurd. Too many options, too many possible errors. A runtime library MIGHT
dynamically optimize for the execution environment; the generated application code
shouldn't.

> btw, linux pthreads (GLIBC):
>
> int __pthread_once(pthread_once_t * once_control, void (*init_routine)(void))
> {
> /* flag for doing the condition broadcast outside of mutex */
> int state_changed;
>
> /* Test without locking first for speed */
> if (*once_control == DONE) return 0;
>

> seems to be incorrect (for MPs with memory reads reordering). <?>
> (because it uses "broken" DCL and hence violates "On return from
> pthread_once(), it is guaranteed that init_routine() has completed. "
> with respect to memory visibility on MPs with memory reads reordering)
>
> or am i missing something why __pthread_once above should work
> anyway?

It'll work fine on a system without memory reordering. Are you sure that exact code
applies to all systems? If so, yeah, it's wrong... at least on Alpha. Might be OK on
Sparc, as long as there's a store barrier in the initialization side. And it doesn't
matter on primitive architectures like X86.

Alexander Terekhov

unread,

Mar 27, 2001, 9:52:45 AM3/27/01

to

Dave Butenhof wrote:

> Alexander Terekhov wrote:

[...]

> > but that a) "synch" would be simply ignored (by compiler) for
> > ST programs and b) it would _not_ internally use any TSD on
> > UPs/MPs w/o memory reads reordering and c) it would _conditionally_
> > use TSD if running on processors with memory reads reordering
> > (uniprocessor - no need for TSD, multiprocessor - use TSD -> just
> > one more "if" -- could be even controlled by some compiler option
> > to make "server" code always use TSD - version w/o "if")
>
> How do you define "ST programs" at compile time? If you're suggesting
> that the language standard should specify something like a "-pthread"
> that is required for every compilation unit that might possibly ever
> run in a threaded process,

yup.

> sure, you could do that. Adding
> extra compile-time options for whether to generate SMP or uniprocessor
> code, though, is getting absurd. Too many options, too many possible
> errors. A runtime library MIGHT dynamically optimize for the execution
> environment; the generated application code shouldn't.

well, if a library can do it, why should not some application code also be allowed
to do it (dynamic optimization)? what i mean is that a compiler option would
control the generation of dynamically optimized code, so that if i anticipate
that my application would most likely run on SMP server(s) only, i would disable
dynamic optimization (in order to avoid the overhead of dynamic optimization
for uniprocessor) and would produce the executable code optimized for SMP
(sure it would also run on uniprocessor but not as fast as it could have been
done with dynamic optimization turned on). By default, compiler would generate
dynamically optimized code which would have better performance on uniprocessor
(better than SMP optimized code).

>
> > btw, linux pthreads (GLIBC):
> >
> > int __pthread_once(pthread_once_t * once_control, void (*init_routine)(void))
> > {
> > /* flag for doing the condition broadcast outside of mutex */
> > int state_changed;
> >
> > /* Test without locking first for speed */
> > if (*once_control == DONE) return 0;
> >
> > seems to be incorrect (for MPs with memory reads reordering). <?>
> > (because it uses "broken" DCL and hence violates "On return from
> > pthread_once(), it is guaranteed that init_routine() has completed. "
> > with respect to memory visibility on MPs with memory reads reordering)
> >
> > or am i missing something why __pthread_once above should work
> > anyway?
>
> It'll work fine on a system without memory reordering.
> Are you sure that exact code applies to all systems?

i am not. however, looking at the code

http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/linuxthreads/
mutex.c?rev=1.25&content-type=text/x-cvsweb-markup&cvsroot=glibc

i do not see any ifdefs or something like that.. so i would assume
that the code is supposed to be 100% portable...

> If so, yeah, it's wrong... at least on Alpha.

umm, at least on Alpha _multiprocessor_ <?>
(i think that it should not have any problems on Alpha
uniprocessor - ??)

> Might be OK on Sparc, as long as there's a store barrier in
> the initialization side.

well, there is a mutex_lock between the call to init_routine and the
once_control update:

/* Here *once_control is stable and either NEVER or DONE. */
if (*once_control == NEVER) {
*once_control = IN_PROGRESS | fork_generation;
pthread_mutex_unlock(&once_masterlock);
pthread_cleanup_push(pthread_once_cancelhandler, once_control);
init_routine();
pthread_cleanup_pop(0);
pthread_mutex_lock(&once_masterlock);
*once_control = DONE;
state_changed = 1;
}

so, i am not sure here as well... i think that on Sparc the only mutex
related MB needed is in the mutex_unlock (since Sparc has no memory reads
reordering).

> And it doesn't matter on primitive architectures like X86.

;-)

regards,
alexander.