Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

question re. usage of "static" within static member functions of a class

17 views
Skip to first unread message

ssb

unread,
Sep 7, 2009, 1:14:01 AM9/7/09
to
Hi All,
During a code review, I found the following lines of code:

Class Data:public MyBaseClass
{
Q_OBJECT

Public:
static Data* instance();
~Data();
….
….
….


// NOTE: no static member variables

private:
Data();

}


The "instance" method was implemented as follows:

Data* Data::instance()
{
static Data* model = new Data();
return model;
}


I have never come across a situation where a pointer was set to static
in such a case. Is this valid?What are the potential pitfalls in such
programming practices?

Thanks and Best regards
~ssb

Chris M. Thomasson

unread,
Sep 7, 2009, 1:24:46 AM9/7/09
to
"ssb" <s.sha...@gmail.com> wrote in message
news:97dc452a-f6a5-4a77...@e4g2000prn.googlegroups.com...

> Hi All,
> During a code review, I found the following lines of code:

[...]

> The "instance" method was implemented as follows:

> Data* Data::instance()
> {
> static Data* model = new Data();
> return model;
> }


> I have never come across a situation where a pointer was set to static
> in such a case. Is this valid?

It's a singleton.


> What are the potential pitfalls in such programming practices?

The storage that `model' points to will never be destroyed, also it's not
thread-safe.

Shrikumar

unread,
Sep 7, 2009, 1:50:18 AM9/7/09
to
Thanks for the quick reply, Chris.
I was wondering about the static pointer part - I have always seen
static variables (that are not pointers) in use, but never a static
pointer (even if it is to guarantee that the singleton always returns
the *same* instance of the Class). Is a static pointer (as in the
instance function) a perfectly valid use of the "static" keyword?

On Sep 7, 10:24 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "ssb" <s.sharm...@gmail.com> wrote in message

Francesco

unread,
Sep 7, 2009, 3:02:44 PM9/7/09
to
On 7 Set, 07:50, Shrikumar <s.sharm...@gmail.com> wrote:
> On Sep 7, 10:24 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:
>
> > "ssb" <s.sharm...@gmail.com> wrote in message
>
> >news:97dc452a-f6a5-4a77...@e4g2000prn.googlegroups.com...
>
> > > Hi All,
> > > During a code review, I found the following lines of code:
>
> > [...]
>
> > > The "instance" method was implemented as follows:
> > > Data* Data::instance()
> > > {
> > > static Data* model = new Data();
> > > return model;
> > > }
> > > I have never come across a situation where a pointer was set to static
> > > in such a case. Is this valid?
>
> > It's a singleton.
>
> > > What are the potential pitfalls in such programming practices?
>
> > The storage that `model' points to will never be destroyed, also it's not
> > thread-safe.
>
> Thanks for the quick reply, Chris.
> I was wondering about the static pointer part - I have always seen
> static variables (that are not pointers) in use, but never a static
> pointer (even if it is to guarantee that the singleton always returns
> the *same* instance of the Class). Is a static pointer (as in the
> instance function) a perfectly valid use of the "static" keyword?

I think it is. After all, that pointer should be initialized only once
and operator new should be called only once - the first time the
method is invoked - but this confirmation should be implied into
Chris' sentence: "It's a singleton".

As usual, there are many ways to achieve the same target while
programming.

But was that coder really meant to implement it that way?

The implementation you reviewed:
-------


Data* Data::instance()
{
static Data* model = new Data();
return model;
}

-------

Gives the same result[1] of:
-------
Data* Data::instance() {
static Data model;
return &model;
}
-------

[1] Well, more or less "the same result". My mod could be preferred
because it doesn't use dynamic memory, but in order to avoid some
client to see the pointer as something that could/should be deleted
sooner or later, the code could return the object as a reference.

Changing the method declaration, it could be implemented it in this
way:
-------
Data& Data::instance() {
static Data model;
return model;
}
-------

or also as:

-------
// data.cpp
#include "data.h"

namespace {
Data model;
}

Data& Data::instance() {
return &model;
}
-------

Which doesn't use the "static" keyword at all.

But of course, being a code-reviewer, you should already know all the
above.

Let's say I'm taking the chance to see if _I_ got it right, showing my
code for review here in clc++ ;-)

Cheers,
Francesco

PS: please do not top-post the next time, moreover because it messes
up subsequent replies to your top-posting.

Francesco

unread,
Sep 7, 2009, 3:13:24 PM9/7/09
to

> -------
> // data.cpp
> #include "data.h"
>
> namespace {
>     Data model;
>
> }
>
> Data& Data::instance() {
>      return &model;}
>
> -------

Ops, kill that & before model!

Pavel

unread,
Sep 7, 2009, 4:15:04 PM9/7/09
to
Shrikumar wrote:
> Thanks for the quick reply, Chris.
> I was wondering about the static pointer part - I have always seen
> static variables (that are not pointers) in use, but never a static
> pointer (even if it is to guarantee that the singleton always returns
> the *same* instance of the Class). Is a static pointer (as in the
> instance function) a perfectly valid use of the "static" keyword?
It is valid to declare pointers static if that's what you mean. On a
side note, I think they could avoid both using pointer and the memory
leak (which may be harmless in this case though) as follows:

{
static Data model;
return &model;
}


Hope this will help,
Pavel

Francesco

unread,
Sep 7, 2009, 5:27:41 PM9/7/09
to

Heck, no, if that's a singleton the ctor is private, hence the module
cannot instantiate it. Sorry, kill all the above.

Francesco

Paavo Helde

unread,
Sep 7, 2009, 5:25:27 PM9/7/09
to
Pavel <dot_com_yahoo@paultolk_reverse.yourself> kirjutas:

> Shrikumar wrote:
>> Thanks for the quick reply, Chris.
>> I was wondering about the static pointer part - I have always seen
>> static variables (that are not pointers) in use, but never a static
>> pointer (even if it is to guarantee that the singleton always returns
>> the *same* instance of the Class). Is a static pointer (as in the
>> instance function) a perfectly valid use of the "static" keyword?
> It is valid to declare pointers static if that's what you mean. On a
> side note, I think they could avoid both using pointer and the memory
> leak (which may be harmless in this case though) as follows:
>
> {
> static Data model;
> return &model;
> }

This brings along the destruction problems at the end of the program. The
singleton might be destroyed too early this way, when some code still
might need access to it. When created dynamically, this problem does not
occur, and the memory is reclaimed by the OS upon process exit anyway, so
there is no memory leak anyway. The singleton destructor is not run in
this case, so one should not put something essential there.

The downside with dynamic allocation is that the memory leak checkers
flag it as a real memory leak. There is a complication that it is not so
easy to decide if there would be a potential problem with static
allocation, especially in some general library code. Dynamic creation
just circumvents such problems, at the expense of creating noise in
memory leak checkers.

Paavo


Francesco

unread,
Sep 7, 2009, 5:57:31 PM9/7/09
to
On 7 Set, 23:25, Paavo Helde <pa...@nospam.please.ee> wrote:
> Pavel <dot_com_yahoo@paultolk_reverse.yourself> kirjutas:
>
> > Shrikumar wrote:
> >> Thanks for the quick reply, Chris.
> >> I was wondering about the static pointer part - I have always seen
> >> static variables (that are not pointers) in use, but never a static
> >> pointer (even if it is to guarantee that the singleton always returns
> >> the *same* instance of the Class). Is a static pointer (as in the
> >> instance function) a perfectly valid use of the "static" keyword?
> > It is valid to declare pointers static if that's what you mean. On a
> > side note, I think they could avoid both using pointer and the memory
> > leak (which may be harmless in this case though) as follows:
>
> > {
> >      static Data model;
> >      return &model;
> > }
>
> This brings along the destruction problems at the end of the program. The
> singleton might be destroyed too early this way, when some code still
> might need access to it. When created dynamically, this problem does not
> occur, and the memory is reclaimed by the OS upon process exit anyway, so
> there is no memory leak anyway. The singleton destructor is not run in
> this case, so one should not put something essential there.

Ahhhrgh! Thanks a lot for pointing this out - I was just stomping on
the same problem with my suggestion.

So then, if I got your post right Paavo: in order to circumvent the
destruction-order problem I should create the singleton instance as a
dynamic object _and_ I should not free it in the destructor -
otherwise I would be throwing in the destruction-order problem again.

Side question - once I'm there - is the following fine?

-------
Data& Data::instance() {


static Data* model = new Data();

return *model;
}
-------

I hope so =/

Best regards,
Francesco

Paavo Helde

unread,
Sep 8, 2009, 2:12:03 AM9/8/09
to
Francesco <entu...@gmail.com> kirjutas:

Yes I think this is fine, my singletons typically look alike.

In addition, in case of multithreaded applications I usually call all
such instance() methods in the beginning of the program (or in the init
routine of a dynamically loaded library), in order to avoid potential
thread clash in the singleton creation. One could attempt to protect the
singleton creation by a static mutex, but then one would be back at the
statics destruction order problems.

Paavo

Francesco

unread,
Sep 8, 2009, 5:41:43 AM9/8/09
to
On Sep 8, 8:12 am, Paavo Helde <pa...@nospam.please.ee> wrote:
> Francesco <entul...@gmail.com> kirjutas:

Good to know, I suppose the following should be fine too:

-------
Data& Data::instance() {
static Data& model = *(new Data());
// or also, simply:
// static Data& model = *new Data;
return model;
}
-------

I recall someone who wrote "programming _IS_ decision making" ;-)

> In addition, in case of multithreaded applications I usually call all
> such instance() methods in the beginning of the program (or in the init
> routine of a dynamically loaded library), in order to avoid potential
> thread clash in the singleton creation. One could attempt to protect the
> singleton creation by a static mutex, but then one would be back at the
> statics destruction order problems.

Thank you for the further details about multi-threading, I'm adding
them to my "toolbox", along with all the good new things I'm learning
here on clc++.

Cheers,
Francesco

James Kanze

unread,
Sep 8, 2009, 5:26:22 PM9/8/09
to
On Sep 7, 7:24 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "ssb" <s.sharm...@gmail.com> wrote in message

> news:97dc452a-f6a5-4a77...@e4g2000prn.googlegroups.com...

> > During a code review, I found the following lines of code:

> [...]

> > The "instance" method was implemented as follows:
> > Data* Data::instance()
> > {
> > static Data* model = new Data();
> > return model;
> > }
> > I have never come across a situation where a pointer was set
> > to static in such a case. Is this valid?

> It's a singleton.

And to answer the question, it's perfectly valid. A pointer is
an object, just like any other variable, and obeys the same
rules as other variables.

> > What are the potential pitfalls in such programming practices?

> The storage that `model' points to will never be destroyed,
> also it's not thread-safe.

Not being destroyed is presumably the reason the code is written
this way. Most of the time, you don't want a singleton to be
destructed. In other word, it's a feature designed to avoid
pitfalls. As for thread-safety, it depends on the
implementation, it is thread safe---modulo bugs---in g++. (But
you're probably right for most implementations.)

--
James Kanze

James Kanze

unread,
Sep 8, 2009, 5:39:39 PM9/8/09
to
On Sep 7, 9:02 pm, Francesco <entul...@gmail.com> wrote:
> On 7 Set, 07:50, Shrikumar <s.sharm...@gmail.com> wrote:
> > On Sep 7, 10:24 am, "Chris M. Thomasson" <n...@spam.invalid>
> > wrote:
> > > "ssb" <s.sharm...@gmail.com> wrote in message
> > >news:97dc452a-f6a5-4a77...@e4g2000prn.googlegroups.com...

> > > > During a code review, I found the following lines of
> > > > code:

> > > [...]

> > > > The "instance" method was implemented as follows:
> > > > Data* Data::instance()
> > > > {
> > > > static Data* model = new Data();
> > > > return model;
> > > > }
> > > > I have never come across a situation where a pointer was
> > > > set to static in such a case. Is this valid?

> > > It's a singleton.

> > > > What are the potential pitfalls in such programming
> > > > practices?

> > > The storage that `model' points to will never be
> > > destroyed, also it's not thread-safe.

> > I was wondering about the static pointer part - I have


> > always seen static variables (that are not pointers) in use,
> > but never a static pointer (even if it is to guarantee that
> > the singleton always returns the *same* instance of the
> > Class). Is a static pointer (as in the instance function) a
> > perfectly valid use of the "static" keyword?

A pointer is an object, just like any other type, and a variable
of pointer type follows exactly the same rules as a variable of
any other type.

> I think it is. After all, that pointer should be initialized
> only once and operator new should be called only once - the
> first time the method is invoked - but this confirmation
> should be implied into Chris' sentence: "It's a singleton".

Maybe. The fundamental definition of a singleton is that the
class can only have one instance. The example just happens to
be one of the more frequent implementation techniques.

> As usual, there are many ways to achieve the same target while
> programming.

And many ways to achieve something that is almost the same
thing, but subtly different.

> But was that coder really meant to implement it that way?

> The implementation you reviewed:
> -------
> Data* Data::instance()
> {
> static Data* model = new Data();
> return model;
> }

> -------

> Gives the same result[1] of:
> -------
> Data* Data::instance() {
> static Data model;
> return &model;
> }
> -------

Not a all. In your version, the singleton will be destructed.
Possibly too early. In his version, the singleton will never be
destructed. In most cases, his version is to be preferred.

> [1] Well, more or less "the same result". My mod could be
> preferred because it doesn't use dynamic memory, but in order
> to avoid some client to see the pointer as something that
> could/should be deleted sooner or later, the code could return
> the object as a reference.

That's an orthogonal issue, and typically, the instance function
of a singleton does return a reference, regardless of the
implementation behind it. That, of course, doesn't prevent the
client from deleting it---often, the destructor will also be
made private as well. But typically, a reference will be used
because the client code isn't expected to delete it, and the
function can never return a null pointer.

> Changing the method declaration, it could be implemented it in this
> way:
> -------
> Data& Data::instance() {
> static Data model;
> return model;
> }

> -------

> or also as:

> -------
> // data.cpp
> #include "data.h"

> namespace {
> Data model;
> }

> Data& Data::instance() {
> return &model;
> }
> -------

> Which doesn't use the "static" keyword at all.

And doesn't avoid the order of initialization issues the other
versions were designed to avoid.

> But of course, being a code-reviewer, you should already know
> all the above.

> Let's say I'm taking the chance to see if _I_ got it right,
> showing my code for review here in clc++ ;-)

Not knowing the requirements, it's hard to say. My usual
implementation is:

class Data
{
private:
static Data* ourInstance ;
Data() {}
~Data() {}

public:
static Data& instance() ;
// ...
} ;

Data* Data:: ourInstance = &instance() ;

Data&
Data::instance()
{
if ( ourInstance == NULL ) {
ourInstance = new Data ;
}
return *ourInstance ;
}

This solves the threading issues (for the most part), and avoids
any order of destruction issues, by not destructing the object.

--
James Kanze

Joshua Maurice

unread,
Sep 8, 2009, 6:30:31 PM9/8/09
to

Someone know offhand the rules for extending the life of a temporary
which is bound to a static reference? Would the temporary here die as
soon as the first function invocation ends? Will it be destroyed
during static de-initialization? I'd instead suggest:

Joshua Maurice

unread,
Sep 8, 2009, 6:35:53 PM9/8/09
to

Ack, hit submit early. I'd suggest that \with\ the caveat "Insert
extra code to make concurrent calls thread-safe if needed", generally
by doing
namespace { bool force_init = (Data::instance(), true); }
which works in most cases (aka where there are no non-trivial threads
before main, or before dlopen or equivalent if this is in the static
init of a dll).

Chris M. Thomasson

unread,
Sep 8, 2009, 8:04:30 PM9/8/09
to
"Paavo Helde" <pa...@nospam.please.ee> wrote in message
news:Xns9C805D9923...@216.196.109.131...
[...]

> In addition, in case of multithreaded applications I usually call all
> such instance() methods in the beginning of the program (or in the init
> routine of a dynamically loaded library), in order to avoid potential
> thread clash in the singleton creation. One could attempt to protect the
> singleton creation by a static mutex, but then one would be back at the
> statics destruction order problems.

Perhaps I am misunderstanding you, but FWIW in POSIX, it's perfectly legal
to create a `pthread_mutex_t' in static storage. There is absolutely no
static's destruction order problems wrt to the mutex itself:
______________________________________________________________________
#include <pthread.h>
#include <exception>
#include <cassert>
#include <cstdio>


#if ! defined (PTHREAD_UNEXPECTED)
# define PTHREAD_UNEXPECTED(s) \
assert(! (s)), std::unexpected()
#endif


class lock_guard
{
pthread_mutex_t& m_mutex;


public:
lock_guard(pthread_mutex_t& mutex) throw()
: m_mutex(mutex)
{
int status = pthread_mutex_lock(&m_mutex);
if (status)
{
PTHREAD_UNEXPECTED(status);
}
}


~lock_guard() throw()
{
int status = pthread_mutex_unlock(&m_mutex);
if (status)
{
PTHREAD_UNEXPECTED(status);
}
}
};


template<typename T>
struct non_destroying_singleton
{
static T& instance()
{
static T* g_instance = NULL;
static pthread_mutex_t g_mutex = PTHREAD_MUTEX_INITIALIZER;

lock_guard guard(g_mutex);

if (! g_instance)
{
g_instance = new T();
}

return *g_instance;
}
};


struct foo
{
foo()
{
std::printf("(%p)->foo::foo()\n", (void*)this);
}


~foo()
{
std::printf("(%p)->foo::~foo()\n", (void*)this);
}
};


int
main()
{
{
foo& f1 = non_destroying_singleton<foo>::instance();
foo& f2 = non_destroying_singleton<foo>::instance();
foo& f3 = non_destroying_singleton<foo>::instance();
foo& f4 = non_destroying_singleton<foo>::instance();
foo& f5 = non_destroying_singleton<foo>::instance();
}

return 0;
}
______________________________________________________________________


The singleton template code above `non_destroying_singleton<T>' is 100%
thread-safe.

Francesco

unread,
Sep 9, 2009, 3:27:10 AM9/9/09
to

Well, to be precise, the last version above isn't even applicable,
because the ctor is private. I've already straightened all the points
above, partly by myself, mostly with Paavo, as you can see from all
the posts between the one you are quoting from me and the one I'm
replying to. I suppose you didn't receive them. Freakin' NNTP.

> > But of course, being a code-reviewer, you should already know
> > all the above.
> > Let's say I'm taking the chance to see if _I_ got it right,
> > showing my code for review here in clc++ ;-)
>
> Not knowing the requirements, it's hard to say.  My usual
> implementation is:
>
>     class Data
>     {
>     private:
>         static Data*    ourInstance ;
>         Data() {}
>         ~Data() {}
>
>     public:
>         static Data&    instance() ;
>         //  ...
>     } ;
>
>      Data* Data::   ourInstance = &instance() ;
>
>      Data&
>      Data::instance()
>      {
>         if ( ourInstance == NULL ) {
>             ourInstance = new Data ;
>         }
>         return *ourInstance ;
>     }
>
> This solves the threading issues (for the most part), and avoids
> any order of destruction issues, by not destructing the object.

Thank you for showing an example of your implementation, it helps me
confronting the different approaches and getting a better grip on the
subject.

Best regards,
Francesco

Francesco

unread,
Sep 9, 2009, 3:34:35 AM9/9/09
to

What temporary, exactly?

> > I'd instead suggest:
>
> > Data& Data::instance()
> > {   static Data* model = new Data();
> >     return *model;
>
> > }

The code you posted above has already been mentioned.

> Ack, hit submit early. I'd suggest that \with\ the caveat "Insert
> extra code to make concurrent calls thread-safe if needed", generally
> by doing
>     namespace { bool force_init = (Data::instance(), true); }
> which works in most cases (aka where there are no non-trivial threads
> before main, or before dlopen or equivalent if this is in the static
> init of a dll).

The above isn't clear to me. What's the difference between this:


namespace { bool force_init = (Data::instance(), true); }

and this:
namespace { Data::instance(); }
assuming the important part is calling Data::instance()? Where should
force_init be used?

In any case, thanks for pointing it out, calling Data::instance()
directly in the module should ensure that the singleton gets
instantiated before main() starts, I suppose.

Best regards,
Francesco

Francesco

unread,
Sep 9, 2009, 3:50:46 AM9/9/09
to
On Sep 9, 9:34 am, Francesco <entul...@gmail.com> wrote:
> The above isn't clear to me. What's the difference between this:
>     namespace { bool force_init = (Data::instance(), true); }
> and this:
>     namespace { Data::instance(); }
> assuming the important part is calling Data::instance()? Where should
> force_init be used?

Heck, I got it. Freakin' trick, I should have tried my mod before.
Feel free to crunch the above silly error ignoring this post, I'm
getting accustomed.

Cheers,
Francesco

Paavo Helde

unread,
Sep 9, 2009, 1:04:17 PM9/9/09
to
"Chris M. Thomasson" <n...@spam.invalid> kirjutas:

> "Paavo Helde" <pa...@nospam.please.ee> wrote in message
> news:Xns9C805D9923...@216.196.109.131...
> [...]
>> In addition, in case of multithreaded applications I usually call all
>> such instance() methods in the beginning of the program (or in the
>> init routine of a dynamically loaded library), in order to avoid
>> potential thread clash in the singleton creation. One could attempt
>> to protect the singleton creation by a static mutex, but then one
>> would be back at the statics destruction order problems.
>
> Perhaps I am misunderstanding you, but FWIW in POSIX, it's perfectly
> legal to create a `pthread_mutex_t' in static storage. There is
> absolutely no static's destruction order problems wrt to the mutex
> itself:

Yes, I suppose you are right. Probably the errors I remembered were related
to not joining all the background threads properly before app shutdown.
Maybe some mutex was held locked by a thread (we are using boost::mutex
which asserts in this case in the destructor), but I don't recall the
details.

Paavo

REH

unread,
Sep 9, 2009, 1:22:05 PM9/9/09
to
On Sep 8, 5:39 pm, James Kanze <james.ka...@gmail.com> wrote:
> Not knowing the requirements, it's hard to say.  My usual
> implementation is:
>
>     class Data
>     {
>     private:
>         static Data*    ourInstance ;
>         Data() {}
>         ~Data() {}
>
>     public:
>         static Data&    instance() ;
>         //  ...
>     } ;
>
>      Data* Data::   ourInstance = &instance() ;
>
>      Data&
>      Data::instance()
>      {
>         if ( ourInstance == NULL ) {
>             ourInstance = new Data ;
>         }
>         return *ourInstance ;
>     }
>
> This solves the threading issues (for the most part), and avoids
> any order of destruction issues, by not destructing the object.
>

This isn't any more thread-safe than any of the previous versions. In C
++0x, the static version will be thread-safe, but this version still
won't be.

REH

Joshua Maurice

unread,
Sep 9, 2009, 3:35:25 PM9/9/09
to
On Sep 9, 12:34 am, Francesco <entul...@gmail.com> wrote:
> On Sep 9, 12:35 am, Joshua Maurice <joshuamaur...@gmail.com> wrote:
> > > > -------
> > > > Data& Data::instance() {
> > > >      static Data& model = *(new Data());
> > > >      // or also, simply:
> > > >      // static Data& model = *new Data;
> > > >      return model;}
> > > >
> > > > -------
> > >
> > > Someone know offhand the rules for extending the life of a temporary
> > > which is bound to a static reference? Would the temporary here die as
> > > soon as the first function invocation ends? Will it be destroyed
> > > during static de-initialization?
>
> What temporary, exactly?

> > > I'd instead suggest:
> > >
> > > Data& Data::instance()
> > > {   static Data* model = new Data();
> > >     return *model;
> > > }
>
> The code you posted above has already been mentioned.

Uhhh... Nevermind. My mistake. Not sure what I was thinking.

Joshua Maurice

unread,
Sep 9, 2009, 3:42:31 PM9/9/09
to

Really? This is pretty thread-safe. Specifically, there will be
exactly 1 object creation of type "Data" for the entire program
(assuming the program doesn't die during static init). That is, there
will be exactly 1 object creation as long as there are no concurrent
calls to Data::instance before main if this is in the same static lib
as main, or as long as there are no concurrent calls to Data::instance
in the static init of the dll. These are generally safe assumptions as
non-trivial threads are usually not made in static init, but it's
still good to note such assumptions / requirements explicitly.

So, exactly what isn't thread-safe about this approach?

Also, what about C++0x will make this thread-safe? C++0x guarantees
that the initializer of a function local static is called exactly once
on the first pass, even when called concurrently. In this case, the
relevant portion we care about:
> ourInstance = new Data ;
is not an initializer of a function local static, so the new rules of C
++0x play no part in guaranteeing its safety.

REH

unread,
Sep 9, 2009, 5:04:37 PM9/9/09
to
On Sep 9, 3:42 pm, Joshua Maurice <joshuamaur...@gmail.com> wrote:
> Really? This is pretty thread-safe. Specifically, there will be
> exactly 1 object creation of type "Data" for the entire program
> (assuming the program doesn't die during static init). That is, there
> will be exactly 1 object creation as long as there are no concurrent
> calls to Data::instance before main if this is in the same static lib
> as main, or as long as there are no concurrent calls to Data::instance
> in the static init of the dll. These are generally safe assumptions as
> non-trivial threads are usually not made in static init, but it's
> still good to note such assumptions / requirements explicitly.

I didn't notice where James was initializing the static object with
instance(). I read it wrong. Although, I think since now the static
object is accessed outside the function, there is potential
initialization ordering issues if another static object utilizes
ourInstance.


>
> So, exactly what isn't thread-safe about this approach?
>
> Also, what about C++0x will make this thread-safe? C++0x guarantees
> that the initializer of a function local static is called exactly once
> on the first pass, even when called concurrently. In this case, the
> relevant portion we care about:> ourInstance = new Data ;
>
> is not an initializer of a function local static, so the new rules of C
> ++0x play no part in guaranteeing its safety.

Your arguing the wrong code. I said the example given by the first
poster would be thread-safe in C++0x.

REH


Chris M. Thomasson

unread,
Sep 9, 2009, 8:15:42 PM9/9/09
to
"James Kanze" <james...@gmail.com> wrote in message
news:f0ed97f9-ea04-4952...@38g2000yqr.googlegroups.com...

You can get around static initialization and destruction ordering issues by
using a strongly thread-safe smart pointer to manage the singleton. The
pseudo-code would be something like the following pseudo-code:
_____________________________________________________________________
template<typename T>
static local_ptr<T>
once()
{
static global_ptr<T> g_instance;

local_ptr<T> lptr(g_instance);

if (! lptr)
{


static pthread_mutex_t g_mutex = PTHREAD_MUTEX_INITIALIZER;

pthread_mutex_lock(&g_mutex);

lptr = g_instance;

if (! lptr)
{
try
{
lptr.reset(new T());
}

catch (...)
{
pthread_mutex_unlock(&g_mutex);

throw;
}

g_instance = lptr;
}

pthread_mutex_unlock(&g_mutex);
}

return lptr;
}
_____________________________________________________________________


This is strongly thread-safe and will always work no matter how the static
ctor/dtor ordering comes out. The caveat, well, it's definitely not as
efficient as using a raw pointer and explicitly leaking the singleton.

Joshua Maurice

unread,
Sep 9, 2009, 9:12:00 PM9/9/09
to
On Sep 9, 5:15 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "James Kanze" <james.ka...@gmail.com> wrote in message

There are so many things wrong with that code sample, I don't even
know where to start. (Exaggeration. I do know where to start.)

Firstly and most importantly, you're using double checked locking,
which is broken in effectively all C++ implementations. Don't do that.
Please read, continue to re-read if you don't get it, this excellent
paper:
http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf

Next, you also have a race condition on the construction of the mutex
itself, in addition to the double checked locking of the singleton
instance construction. PTHREAD_MUTEX_INITIALIZER is entirely
equivalent to calling pthread_mutex_init, and thus is not thread-safe.
Repeat:
> lptr.reset(new T());
is not properly guarded (double checked locking) and


> static pthread_mutex_t g_mutex = PTHREAD_MUTEX_INITIALIZER;

is not properly guarded (simple race condition). Both are race
conditions when the first call of this function may be concurrent.

Also, users may want multiple singletons of the same type. One unit of
code makes a singleton of type T, and another piece of code entirely
separate will make another singleton of type T. Your code does not
support that. Instead, you will get exactly 1 instance of T across the
entire program. This is a significant limitation. You could
potentially get it around it with something like:
//header code
template <typename singleton_t, typename once_type>
singleton_t & getSingleton();

//code using the header to make a singeton in a hpp
class singleton_x_type { /* ... */ };
class once_type_for_singleton_X {};
singleton_x_type & getSingletonX()
{
return getSingleton<singleton_x_type, once_type_for_singleton_X>
();
}
If another piece of code wanted their own singleton of type X, then
they could define their own once_type and instantiate getSingleton on
that new once_type.

Lastly, RAII is your friend. Don't ever explicitly call
pthread_mutex_lock or pthread_mutex_unlock except in your portability
layer mutex wrapper class. This is not a correctness issue per se, but
vastly improves the chances that the code will be correct.

Your solution could be made correct by eager initializing it by adding
namespace { bool force_init = (::once<your_type>(), true); }
However, at that point, your fixed code is pretty much equivalent to
several examples already posted, aka:
T& getSingleton()
{ static T* x = new T;
return *x;
}
namespace { bool force_init = (getSingleton(), true); }

Chris M. Thomasson

unread,
Sep 10, 2009, 4:50:14 AM9/10/09
to
"Joshua Maurice" <joshua...@gmail.com> wrote in message
news:aac0ea5e-c259-4177...@j9g2000prh.googlegroups.com...

On Sep 9, 5:15 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
[...]

> > You can get around static initialization and destruction ordering issues
> > by
> > using a strongly thread-safe smart pointer to manage the singleton. The
> > pseudo-code would be something like the following pseudo-code:
> > _____________________________________________________________________

[...]


> > _____________________________________________________________________
> >
> > This is strongly thread-safe and will always work no matter how the
> > static
> > ctor/dtor ordering comes out. The caveat, well, it's definitely not as
> > efficient as using a raw pointer and explicitly leaking the singleton.

> There are so many things wrong with that code sample, I don't even
> know where to start. (Exaggeration. I do know where to start.)

Actually, well... How much experience do you have wrt multi-threading
issues?


> Firstly and most importantly, you're using double checked locking,
> which is broken in effectively all C++ implementations.

I explicitly stated that one can:

"get around static initialization and destruction ordering issues by using a
strongly thread-safe smart pointer to manage the singleton"

Do you have any idea what I am writing about here? Go ahead and put it on
the self because this discussion can get advanced rather quickly!


> Don't do that.

;^)

Na. Anyway, double-checked locking IS 100% workable, period. You just don't
know it yet. That is NOT my problem. Oh well, shi% happens.


> Please read, continue to re-read if you don't get it, this excellent
> paper:
> http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf

What about it? Anyway, that paper is WELL known to me. Been there, done
that. Yawn.


> Next, you also have a race condition on the construction of the mutex
> itself,

;^/

Learn about POSIX:

http://www.opengroup.org/onlinepubs/7990989775/xsh/pthread_mutex_init.html


> in addition to the double checked locking of the singleton
> instance construction.

This is PERFECTLY fine. You just need to learn how to do it properly.


> PTHREAD_MUTEX_INITIALIZER is entirely
> equivalent to calling pthread_mutex_init, and thus is not thread-safe.

:^O

Learn about POSIX:

http://www.opengroup.org/onlinepubs/7990989775/xsh/pthread_mutex_init.html


> Repeat:
> > lptr.reset(new T());
> is not properly guarded (double checked locking) and

Yawn.


> > static pthread_mutex_t g_mutex = PTHREAD_MUTEX_INITIALIZER;
> is not properly guarded (simple race condition). Both are race
> conditions when the first call of this function may be concurrent.

I am getting tired.


> Also, users may want multiple singletons of the same type. One unit of
> code makes a singleton of type T, and another piece of code entirely
> separate will make another singleton of type T. Your code does not
> support that. Instead, you will get exactly 1 instance of T across the
> entire program. This is a significant limitation.

FINALLY! You actually make sense; congratulations!

:^D

> You could
> potentially get it around it with something like:
> //header code
> template <typename singleton_t, typename once_type>
> singleton_t & getSingleton();
>
> //code using the header to make a singeton in a hpp
> class singleton_x_type { /* ... */ };
> class once_type_for_singleton_X {};
> singleton_x_type & getSingletonX()
> {
> return getSingleton<singleton_x_type, once_type_for_singleton_X>
> ();
> }
> If another piece of code wanted their own singleton of type X, then
> they could define their own once_type and instantiate getSingleton on
> that new once_type.

Sure.


> Lastly, RAII is your friend. Don't ever explicitly call
> pthread_mutex_lock or pthread_mutex_unlock except in your portability
> layer mutex wrapper class.

Sure. Although, using try/catch is 100% perfectly fine. Whatever, this was
pseudo-code; lol. Anyway, here is example of RAII:

http://groups.google.com/group/comp.lang.c++/msg/63c3aecb9d0fbaa6

Blah, blah.


> This is not a correctness issue per se, but
> vastly improves the chances that the code will be correct.

Yes; RAII is very convenient.


> Your solution could be made correct by eager initializing it by adding
> namespace { bool force_init = (::once<your_type>(), true); }

My pseudo-code IS correct. IMVHO, you obviously need some education, that's
all; there is absolutely NOTHING to be ashamed about. I personally LOVE to
earn __all__ about new things!!!


> However, at that point, your fixed code is pretty much equivalent to
> several examples already posted, aka:
> T& getSingleton()
> { static T* x = new T;
> return *x;
> }
> namespace { bool force_init = (getSingleton(), true); }

NO!

;^/

Chris M. Thomasson

unread,
Sep 10, 2009, 4:53:39 AM9/10/09
to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:h8aei5$2ecq$1...@news.ett.com.ua...

Ummm, that last sentence should read as:

Go ahead and put it on the SHELF because this discussion can get advanced
rather quickly!


[...]


Sorry about that.

Chris M. Thomasson

unread,
Sep 10, 2009, 4:57:41 AM9/10/09
to
Sorry for being so darn harsh!

;^(....

James Kanze

unread,
Sep 10, 2009, 7:04:21 AM9/10/09
to
On Sep 9, 9:27 am, Francesco <entul...@gmail.com> wrote:
> On Sep 8, 11:39 pm, James Kanze <james.ka...@gmail.com> wrote:

[...]


> > And doesn't avoid the order of initialization issues the other
> > versions were designed to avoid.

> Well, to be precise, the last version above isn't even
> applicable, because the ctor is private. I've already
> straightened all the points above, partly by myself, mostly
> with Paavo, as you can see from all the posts between the one
> you are quoting from me and the one I'm replying to. I suppose
> you didn't receive them. Freakin' NNTP.

I saw that latter. The problem isn't that I don't see articles;
the problem is that I don't see them in the same order. So I
respond to one article before seeing the next.

[...]


> > Not knowing the requirements, it's hard to say. My usual
> > implementation is:

> > class Data
> > {
> > private:
> > static Data* ourInstance ;
> > Data() {}
> > ~Data() {}
>
> > public:
> > static Data& instance() ;
> > // ...
> > } ;

> > Data* Data:: ourInstance = &instance() ;

> > Data&
> > Data::instance()
> > {
> > if ( ourInstance == NULL ) {
> > ourInstance = new Data ;
> > }
> > return *ourInstance ;
> > }

> > This solves the threading issues (for the most part), and
> > avoids any order of destruction issues, by not destructing
> > the object.

> Thank you for showing an example of your implementation, it
> helps me confronting the different approaches and getting a
> better grip on the subject.

Actually, I posted it because I thought it might intregue you.
It does depend on a somewhat subtle feature of C++: the fact
that Data::ourInstance is initialized twice, first to NULL, and
then with the initialization expression.

--
James Kanze

James Kanze

unread,
Sep 10, 2009, 7:24:39 AM9/10/09
to
On Sep 10, 3:12 am, Joshua Maurice <joshuamaur...@gmail.com> wrote:
> On Sep 9, 5:15 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> > "James Kanze" <james.ka...@gmail.com> wrote in message
> >news:f0ed97f9-ea04-4952...@38g2000yqr.googlegroups.com...

> > You can get around static initialization and destruction

> > local_ptr<T> lptr(g_instance);

> > pthread_mutex_lock(&g_mutex);

> > _____________________________________________________________________

I doubt that Chris would have missed the issues addressed in
that paper. He doesn't tell us what local_ptr is, so it's hard
to say, but presumably, it takes steps to ensure the necessary
synchronization. (Double checked locking is possible, at least
on most platforms, provided you're not adverse to a little
assembler. It's not possible with just C++ and mutexes,
however, but I think it can be done under Posix using thread
local storage. Whether it's worth the bother is a different
question, given that much simpler solutions exist.)

Until we know what's behind local_ptr and global_ptr, however,
there's no way to be sure.

> Next, you also have a race condition on the construction of
> the mutex itself, in addition to the double checked locking of
> the singleton instance construction. PTHREAD_MUTEX_INITIALIZER
> is entirely equivalent to calling pthread_mutex_init, and thus
> is not thread-safe.

No. PTHREAD_MUTEX_INITIALIZER is static initialization,
guaranteed to occur before any user written code is executed.

--
James Kanze

Joshua Maurice

unread,
Sep 10, 2009, 7:33:10 AM9/10/09
to
On Sep 10, 1:50 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "Joshua Maurice" <joshuamaur...@gmail.com> wrote in message

Ok, so apparently your "strong thread-safe" pointer class has barrier
semantics on "operator!" and "reset". If that's the case, then yes, it
does make this double checked locking safe. I'm sorry here.

However, next time please emphasize that "my pointer class has barrier
semantics on operator! and reset". You should see the recent post
about what "thread-safe" means in terms of boost shared pointer. The
general result of that thread was several people 'knew' the definition
of thread-safe in the context of boost shared pointer, except that
each definition was different.

> > Next, you also have a race condition on the construction of the mutex
> > itself,
>
> ;^/
>
> Learn about POSIX:
>

> http://www.opengroup.org/onlinepubs/7990989775/xsh/pthread_mutex_init...

Quoting Open Group
> In cases where default mutex attributes are appropriate, the macro PTHREAD_MUTEX_INITIALIZER can be used to initialise mutexes that are statically allocated. The effect is equivalent to dynamic initialisation by a call to pthread_mutex_init() with parameter attr specified as NULL, except that no error checks are performed.

Now, I admit this has always been vague to me. Let's see. It says that
"the effect is equivalent to dynamic initialization by a call to
pthread_mutex_init". Unfortunately, I'm used to the C++ world where
\dynamic initialization\ means something very specific. Under that
definition, using it as the initializor of a function local static
would be a race condition. However, it seems that C does not allow
function local statics to be initialized with anything but a constant
expression (as determined by comeau online), making the code what C++
calls \static initialization\. I am thus entirely wrong on this point.
I'm sorry.

And yes, I rarely use posix directly. I'm in a world where my code has
to work on windows and posix, and all windows mutexes require runtime
init, so I have to program to the lowest common denominator. I'm still
wrong, just explaining my background.

--
That leaves us with your code, which while being obfuscated, is also
correct, given a certain latitude in your favor aka the exact
definition of a "strongly thread-safe pointer class". If your pointer
class just uses pthread mutexes, then your example, once simplified of
redundant mutexes, is equivalent to

struct pthread_mutex_guard
{ pthread_mutex_guard(pthread_mutex_t & y) : x(y) { pthread_mutex_lock
(x); }
~pthread_mutex_guard() { pthread_mutex_unlock(x); }
pthread_mutex_t & x;
};
template <typename T>
T& once()
{
static T* t = 0;
{
static pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_guard g(m);
if (0 == t)
t = new T;
}
return *t;
}

If your pointer class makes use of read and write memory barriers,
then it's equivalent to

struct pthread_mutex_guard
{ pthread_mutex_guard(pthread_mutex_t & y) : x(y) { pthread_mutex_lock
(x); }
~pthread_mutex_guard() { pthread_mutex_unlock(x); }
pthread_mutex_t & x;
};
template <typename T>
T& once()
{
static T* t = 0;
if (0 == t)
{
static pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_guard g(m);
if (0 == t)
{
T* tmp = new T;
write_barrier(); /* platform specific */
t = tmp;
}
}
read_barrier(); /* platform specific */
return *t;
}

Either way, in this context of trying to demonstrate the proper way of
writing a singleton, one should not hide away crucial implementation
details as you did. It is not useful to other readers to say "here's
what it looks like, except for the important implementation details".
You used something that looked very much like the "common" double
checked locking anti-pattern, but with "hidden magic" which made it
correct. That will only help perpetuate the myth that the "common"
double checked locking anti-pattern is correct.

PS: Note that both of my solutions in this post are relatively devoid
of error checking. You should probably at least assert on the return
code of pthread functions.

PPS: All windows standard mutexes have runtime init, so you cannot do
either of these approaches without significant modifications, or
without rolling your own mutex built on some of the atomic primitives
like test and swap. See:
http://www.ddj.com/cpp/199203083?pgno=7
for a good discussion on how to do this.

Alternatively, use a namespace scope variable to force construction
before main (or before dlopen returns for static init of a dll) to
"guarantee" correctness as demonstrated in countless posts in this
thread.

Joshua Maurice

unread,
Sep 10, 2009, 7:40:08 AM9/10/09
to

Indeed. I should probably think a little more before I speak, err
type. I was having a bad day. Thank you James and I'm sorry again
Chris.

Francesco

unread,
Sep 10, 2009, 7:46:40 AM9/10/09
to
On 10 Set, 13:04, James Kanze <james.ka...@gmail.com> wrote:
> On Sep 9, 9:27 am, Francesco <entul...@gmail.com> wrote:
>
> > On Sep 8, 11:39 pm, James Kanze <james.ka...@gmail.com> wrote:
>
> [...]
>
> > > And doesn't avoid the order of initialization issues the other
> > > versions were designed to avoid.
> > Well, to be precise, the last version above isn't even
> > applicable, because the ctor is private. I've already
> > straightened all the points above, partly by myself, mostly
> > with Paavo, as you can see from all the posts between the one
> > you are quoting from me and the one I'm replying to. I suppose
> > you didn't receive them. Freakin' NNTP.
>
> I saw that latter. The problem isn't that I don't see articles;
> the problem is that I don't see them in the same order. So I
> respond to one article before seeing the next.

Fine, no problem.

Uh, well, I've read your code but I didn't think about the fact that
ourInstance had to be initialized before being compared to NULL - in
other words, I missed to catch the subtlety.

Can I rely on this default, hidden initialization for static members
of any type?

Something like this:
-------
class A {
static int data;
public:
static int get_data();
};

int A::data = get_data();

int A::get_data() {
if (data == 0) data = 42;
return data;
}
-------

Is the above guaranteed to work for any type?
(using the appropriate value on the rhs of ==, I mean)

Best regards,
Francesco

Chris M. Thomasson

unread,
Sep 10, 2009, 12:29:04 PM9/10/09
to
"James Kanze" <james...@gmail.com> wrote in message
news:f708f2f5-72d3-434b...@q7g2000yqi.googlegroups.com...

> On Sep 10, 3:12 am, Joshua Maurice <joshuamaur...@gmail.com> wrote:
>> On Sep 9, 5:15 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
>> > "James Kanze" <james.ka...@gmail.com> wrote in message
>> >news:f0ed97f9-ea04-4952...@38g2000yqr.googlegroups.com...
>
>> > You can get around static initialization and destruction
>> > ordering issues by using a strongly thread-safe smart
>> > pointer to manage the singleton. The pseudo-code would be
>> > something like the following pseudo-code:
>> > _____________________________________________________________________
[...]

>> > _____________________________________________________________________
>
>> > This is strongly thread-safe and will always work no matter
>> > how the static ctor/dtor ordering comes out. The caveat,
>> > well, it's definitely not as efficient as using a raw
>> > pointer and explicitly leaking the singleton.
>
>> There are so many things wrong with that code sample, I don't
>> even know where to start. (Exaggeration. I do know where to
>> start.)
>
>> Firstly and most importantly, you're using double checked
>> locking, which is broken in effectively all C++
>> implementations. Don't do that. Please read, continue to
>> re-read if you don't get it, this excellent
>> paper:http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf
>
> I doubt that Chris would have missed the issues addressed in
> that paper. He doesn't tell us what local_ptr is, so it's hard
> to say, but presumably, it takes steps to ensure the necessary
> synchronization. (Double checked locking is possible, at least
> on most platforms, provided you're not adverse to a little
> assembler. It's not possible with just C++ and mutexes,
> however, but I think it can be done under Posix using thread
> local storage. Whether it's worth the bother is a different
> question, given that much simpler solutions exist.)
>
> Until we know what's behind local_ptr and global_ptr, however,
> there's no way to be sure.

global/local_ptr was meant to represent a smart pointer class that can
achieve strong thread-safety. Here is a full blown implementation of such a
smart pointer:


http://atomic-ptr-plus.sourceforge.net

http://sourceforge.net/projects/atomic-ptr-plus/files
(download the atomic-ptr-plus package)


in this case, global_ptr == atomic_ptr.


BTW, any general purpose multi-threaded smart pointer that did not have
release semantics for storing into, and acquire semantics for loading from,
is basically busted by default.

;^)


You can also take a look at my experimental smart pointer here:

http://webpages.charter.net/appcore/vzoom/refcount


[...]

Chris M. Thomasson

unread,
Sep 10, 2009, 12:56:41 PM9/10/09
to
"Joshua Maurice" <joshua...@gmail.com> wrote in message
news:11bae170-d413-482a...@s21g2000prm.googlegroups.com...

Humm... Why would it need any memory barrier on `operator ! ()'? The load
from the global pointer to the local pointer already has to have implied
acquire semantics, so putting membar in `operator !' would be adding
needless overhead. As for `reset()', it has to have implied release
semantics.


> If that's the case, then yes, it
> does make this double checked locking safe. I'm sorry here.


> However, next time please emphasize that "my pointer class has barrier
> semantics on operator! and reset".

I thought it was a given. How could any general purpose multi-threaded smart
pointer class work without implied acquire on load and release on store
memory barrier semantics?


> You should see the recent post
> about what "thread-safe" means in terms of boost shared pointer.

You mean this one:

http://groups.google.com/group/comp.lang.c++/browse_frm/thread/f363f8a8010f9323

?


> The
> general result of that thread was several people 'knew' the definition
> of thread-safe in the context of boost shared pointer, except that
> each definition was different.

I just know that shared_ptr only provides the basic/normal thread-safety
level. I gave several examples in that discussion.

That's fine.


> That leaves us with your code, which while being obfuscated, is also
> correct, given a certain latitude in your favor aka the exact
> definition of a "strongly thread-safe pointer class". If your pointer
> class just uses pthread mutexes, then your example, once simplified of
> redundant mutexes, is equivalent to

[...]

One point, a smart pointer class that is 100% lock-based is fine. However,
it can become a fairly significant bottleneck.


> If your pointer class makes use of read and write memory barriers,
> then it's equivalent to

[...]

FWIW, most systems, DEC Alpha aside for a moment, would have
`read_barrier()' defined to be a no-op because of implicit data-dependant
load barrier (e.g., Linux `read_barrier_depends()').


> Either way, in this context of trying to demonstrate the proper way of
> writing a singleton, one should not hide away crucial implementation
> details as you did.

I did not want to bother showing an implementation of the smart pointer just
to get across that it has acquire/release on load/store respectively. Humm,
perhaps that was a mistake!

;^(


> It is not useful to other readers to say "here's
> what it looks like, except for the important implementation details".
> You used something that looked very much like the "common" double
> checked locking anti-pattern, but with "hidden magic" which made it
> correct. That will only help perpetuate the myth that the "common"
> double checked locking anti-pattern is correct.

Fair enough.


> PS: Note that both of my solutions in this post are relatively devoid
> of error checking. You should probably at least assert on the return
> code of pthread functions.

Yes. Refer to one of my other posts in this thread:

http://groups.google.com/group/comp.lang.c++/msg/63c3aecb9d0fbaa6

Take note of the `PTHREAD_UNEXPECTED' macro.

;^)


> PPS: All windows standard mutexes have runtime init, so you cannot do
> either of these approaches without significant modifications, or
> without rolling your own mutex built on some of the atomic primitives
> like test and swap. See:
> http://www.ddj.com/cpp/199203083?pgno=7
> for a good discussion on how to do this.

Humm... One quote from that page has me a bit "uneasy":


"If no wait is necessary, locking a QLock requires just one atomic
instruction and no system calls"


The author fails to mention that along with the atomic instruction comes a
very nasty and performance destroying `#StoreLoad | #StoreStore' memory
barrier. This will be executed regardless of any contention.


> Alternatively, use a namespace scope variable to force construction
> before main (or before dlopen returns for static init of a dll) to
> "guarantee" correctness as demonstrated in countless posts in this
> thread.

If you are fine with leaking the singleton and are a disciplined programmer,
then it will work.

Chris M. Thomasson

unread,
Sep 10, 2009, 1:28:15 PM9/10/09
to
"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:h8bb27$2qla$1...@news.ett.com.ua...

> "Joshua Maurice" <joshua...@gmail.com> wrote in message
[...]

>> Alternatively, use a namespace scope variable to force construction
>> before main (or before dlopen returns for static init of a dll) to
>> "guarantee" correctness as demonstrated in countless posts in this
>> thread.

Humm...

There is really no "guarantee" of correctness wrt this technique in the
presence of threads because it is not thread-safe in any way, shape or form.
It's not a general purpose construct unless there is explicit documentation
that documents all the caveats.

I can easily break the heck out of it. For instance, check this crap out:
_____________________________________________________________________
#include <pthread.h>
#include <sched.h>
#include <cstdio>


void yield()
{
for (unsigned i = 0; i < 10; ++i)
{
sched_yield();
}
}


template<typename T, unsigned T_id = 0>
class non_destroying_singleton
{
static T* g_instance;

public:
static T& instance();
};


template<typename T, unsigned T_id>
T& non_destroying_singleton<T, T_id>::instance()
{
yield();

if (! g_instance)
{
yield();

g_instance = new T();

yield();
}

yield();

return *g_instance;
}


template<typename T, unsigned T_id>
T* non_destroying_singleton<T, T_id>::g_instance =
&non_destroying_singleton<T, T_id>::instance();


struct foo
{
foo()
{
std::printf("(%p)->foo::foo()\n", (void*)this);
}


~foo()
{
std::printf("(%p)->foo::~foo()\n", (void*)this);
}
};


extern "C"
void* killer_thread(void*)
{
std::puts("killer_thread");

non_destroying_singleton<foo>::instance();

return NULL;
}


struct killer
{
pthread_t m_tid[10];

killer()
{
std::puts("killer::killer()");

for (unsigned i = 0; i < 10; ++i)
{
pthread_create(&m_tid[i], NULL, killer_thread, NULL);
}
}


~killer()
{
std::puts("killer::~killer()");

for (unsigned i = 0; i < 10; ++i)
{
pthread_join(m_tid[i], NULL);
}
}
};


static killer g_killer;


int
main()
{
std::puts("main");

return 0;
}
_____________________________________________________________________


Oh shi%! Massive race-condition here... `foo' can be created multiple times
and leaked all over the damn place. Here is the output I happened to get:
_____________________________________________________________________
killer::killer()
killer_thread
killer_thread
killer_thread
killer_thread
killer_thread
killer_thread
(0024B7B8)->foo::foo()
killer_thread
killer_thread
(0024B808)->foo::foo()
(0024B818)->foo::foo()
(0024B828)->foo::foo()
killer_thread
killer_thread
main
killer::~killer()
_____________________________________________________________________


KA-BOOOOOOM!!!!!!!!!!!

:^o


> If you are fine with leaking the singleton and are a disciplined
> programmer, then it will work.

You need to be very disciplined.

;^)

Chris M. Thomasson

unread,
Sep 10, 2009, 2:43:05 PM9/10/09
to
"Joshua Maurice" <joshua...@gmail.com> wrote in message
news:11bae170-d413-482a...@s21g2000prm.googlegroups.com...
[...]

> PPS: All windows standard mutexes have runtime init, so you cannot do
> either of these approaches without significant modifications, or
> without rolling your own mutex built on some of the atomic primitives
> like test and swap. See:
> http://www.ddj.com/cpp/199203083?pgno=7
> for a good discussion on how to do this.

> Alternatively, use a namespace scope variable to force construction
> before main (or before dlopen returns for static init of a dll) to
> "guarantee" correctness as demonstrated in countless posts in this
> thread.


Windows has everything one needs in order to create a 100% correct DCL
pattern. The following code will work with Windows, modulo bugs of course
because I just hacked it together:
______________________________________________________________________
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <sstream>
#include <cassert>
#include <exception>
#include <cstdio>


class win_dcl_mutex
{
HANDLE m_mutex;


private:
static std::string prv_get_name()
{
std::ostringstream name;

name << "DCL_MUTEX_" << GetCurrentProcessId();

return name.str();
}


public:
win_dcl_mutex() throw()
: m_mutex(CreateMutex(NULL, TRUE, prv_get_name().c_str()))
{
if (! m_mutex)
{
assert(m_mutex);
std::unexpected();
}

else if (GetLastError() == ERROR_ALREADY_EXISTS)
{
if (WaitForSingleObject(m_mutex, INFINITE) !=
WAIT_OBJECT_0)
{
assert(m_mutex);
CloseHandle(m_mutex);
std::unexpected();
}
}
}


~win_dcl_mutex() throw()
{
if (! ReleaseMutex(m_mutex))
{
assert(m_mutex);
CloseHandle(m_mutex);
std::unexpected();
}

if (! CloseHandle(m_mutex))
{
assert(m_mutex);
std::unexpected();
}
}
};


template<typename T, unsigned T_id = 0>

class win_non_destroying_singleton
{
static T* prv_load(T* volatile* pptr)
{
T* ptr1 = *pptr;

if (! ptr1)
{
ptr1 = (T*)InterlockedCompareExchangePointer(
(void* volatile*)pptr, NULL, NULL);
}

if (ptr1)
{
// does `InterlockedCompareExchange()' peform a
// membar on failure? here is a funny work-around:

T* ptr2 = (T*)InterlockedExchangePointer(
(void* volatile*)pptr, ptr1);

if (ptr1 != ptr2)
{
assert(ptr1 == ptr2);
std::unexpected();
}
}

return ptr1;
}


static T* prv_store(T* volatile* pptr, T* ptr1)
{
assert(ptr1);

T* ptr2 = (T*)InterlockedExchangePointer(
(void* volatile*)pptr, ptr1);

if (ptr2)
{
assert(! ptr2);
std::unexpected();
}

return ptr1;
}


public:


static T& instance()
{
static T* g_instance = NULL;

T* local = prv_load(&g_instance);

if (! local)
{
win_dcl_mutex lock;

if (! (local = g_instance))
{
local = prv_store(&g_instance, new T());
}
}

return *local;
}
};


int
main()
{
{
int& x1 = win_non_destroying_singleton<int>::instance();
int& x2 = win_non_destroying_singleton<int>::instance();
}

return 0;
}

______________________________________________________________________


BTW, I had to code the `win_non_destroying_singleton<T>::prv_load()'
function that way because Microsoft does not document whether or not
`InterlockedCompareExchange()' performs a memory barrier on the failure
case. If it did, then it could simply look like this:
______________________________________________________________________
static T* prv_load(T* volatile* pptr)
{
return (T*)InterlockedCompareExchangePointer(
(void* volatile*)pptr, NULL, NULL);
}
______________________________________________________________________


Also, if you are using a recent version of MSVC (e.g., I think it's 8 or
higher), then the class can look like:
______________________________________________________________________


template<typename T, unsigned T_id = 0>

struct win_non_destroying_singleton
{
static T& instance()
{
static T* volatile g_instance = NULL;

T* local = g_instance;

if (! local)
{
win_dcl_mutex lock;

if (! (local = g_instance))
{
local = g_instance = new T();
}
}

return *local;
}
};
______________________________________________________________________


This is because those versions of MSVC have acquire/release semantics for
loading/storing from/to volatile variables on certain platforms (e.g.,
PowerPC)...

Joshua Maurice

unread,
Sep 10, 2009, 3:59:22 PM9/10/09
to
On Sep 10, 11:43 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "Joshua Maurice" <joshuamaur...@gmail.com> wrote in message
> > PPS: All windows standard mutexes have runtime init, so you cannot do
> > either of these approaches without significant modifications, or
> > without rolling your own mutex built on some of the atomic primitives
> > like test and swap. See:
> >http://www.ddj.com/cpp/199203083?pgno=7
> > for a good discussion on how to do this.
> > Alternatively, use a namespace scope variable to force construction
> > before main (or before dlopen returns for static init of a dll) to
> > "guarantee" correctness as demonstrated in countless posts in this
> > thread.
>
> Windows has everything one needs in order to create a 100% correct DCL
> pattern. The following code will work with Windows, modulo bugs of course
> because I just hacked it together:

[snip probably correct impl]

I should probably just shut up and take my losses. I am still right
that there is no mutex on windows without runtime init, but your
workaround, which Boost also uses to do boost_once IIRC, is quite
good.

I forget which versions, but windows also have their own native
version of pthread_once, so you could also use that. Also, as you
mentioned, volatile would work, for whichever VS versions and
platforms make it into barriers.

James Kanze

unread,
Sep 10, 2009, 5:35:07 PM9/10/09
to
On Sep 10, 1:46 pm, Francesco <entul...@gmail.com> wrote:
> On 10 Set, 13:04, James Kanze <james.ka...@gmail.com> wrote:

[...]


> Uh, well, I've read your code but I didn't think about the
> fact that ourInstance had to be initialized before being
> compared to NULL - in other words, I missed to catch the
> subtlety.

You missed the fact that the function used to initialize the
variable uses the previous value of the variable.

In fact, it's a clever trick, with both the positive and
negative implications of "clever". The suggestion else thread
of using a separate boolean variable to accomplish this achieves
exactly the same results, and is doubtlessly more readable.

> Can I rely on this default, hidden initialization for static
> members of any type?

For any variable. The standard describes the initialization of
objects with static lifetime as occuring in three phases: zero
initialization, static initialization and dynamic
initialization. In practice, there's no way to distinguish the
first two, since both occur before any code you write is
executed. (In practice, on most systems, the first two are done
by the system, when the program is loaded, either by copying an
image from disk or by zapping the block containing the variables
with 0's. This only works, of course, on systems where null
pointers and 0.0 are represented by all 0 bits, but that's the
vast majority of the systems today.`)

> Something like this:
> -------
> class A {
> static int data;
> public:
> static int get_data();
> };

> int A::data = get_data();

> int A::get_data() {
> if (data == 0) data = 42;
> return data;
> }
> -------

> Is the above guaranteed to work for any type?
> (using the appropriate value on the rhs of ==, I mean)

Or using 0. Yes. For class types, it applies recursively,
until you reach a basic type which can be "zero initialized".
The one exception is references, since by definition, a
reference can't be zero initialized.

--
James Kanze

Jerry Coffin

unread,
Sep 10, 2009, 7:07:27 PM9/10/09
to
In article <aac0ea5e-c259-4177-9781-d94931593069
@j9g2000prh.googlegroups.com>, joshua...@gmail.com says...

[ ... ]

> Firstly and most importantly, you're using double checked locking,
> which is broken in effectively all C++ implementations. Don't do that.
> Please read, continue to re-read if you don't get it, this excellent
> paper:
> http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf

This is a decent paper, but it should be kept in context. Up until
shortly before he (helped) write that paper, Andrei seems to have
thought that the 'volatile' keyword was sufficient to give assurances
necessary for multithreading (in essence that reading or writing a
volatile variable acted as a memory barrier).

That paper followed shortly after he realized that this just wasn't
so. The tone of the paper is unfortunate though -- it comes off as
basically saying there's a problem with double-checked locking, which
really isn't the case at all. The problem is that C++ (up through the
2003 standard) simply lacks memory barriers. Double-checked locking
is one example of code that _needs_ a memory barrier to work
correctly -- but it's only one example of many.

Chris, OTOH, has been playing with multi-threaded programming for a
while. He quite blithely takes for granted a fair number of things
that the current standard doesn't provide at all -- mostly because
without those, there's basically no way to discuss the areas that
interest him. A couple of those are objects with acquire and release
semantics -- i.e. with memory barriers in the right places. The
minute you add a few multithreading primitives like that, the
problems with DCLP disappear.

The real problem was never with the DCLP itself, but with attempting
to do multi-threaded programming without the tools necessary for the
job.

--
Later,
Jerry.

Joshua Maurice

unread,
Sep 10, 2009, 7:35:02 PM9/10/09
to
On Sep 10, 4:07 pm, Jerry Coffin <jerryvcof...@yahoo.com> wrote:
> In article <aac0ea5e-c259-4177-9781-d94931593069
> @j9g2000prh.googlegroups.com>, joshuamaur...@gmail.com says...

>
> [ ... ]
>
> > Firstly and most importantly, you're using double checked locking,
> > which is broken in effectively all C++ implementations. Don't do that.
> > Please read, continue to re-read if you don't get it, this excellent
> > paper:
> >http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf
>
> This is a decent paper, but it should be kept in context. Up until
> shortly before he (helped) write that paper, Andrei seems to have
> thought that the 'volatile' keyword was sufficient to give assurances
> necessary for multithreading (in essence that reading or writing a
> volatile variable acted as a memory barrier).
>
> That paper followed shortly after he realized that this just wasn't
> so. The tone of the paper is unfortunate though -- it comes off as
> basically saying there's a problem with double-checked locking, which
> really isn't the case at all. The problem is that C++ (up through the
> 2003 standard) simply lacks memory barriers.

Sorry what? There's something which resembling threading guarantees in
C++03?

> Double-checked locking
> is one example of code that _needs_ a memory barrier to work
> correctly -- but it's only one example of many.
>
> Chris, OTOH, has been playing with multi-threaded programming for a
> while. He quite blithely takes for granted a fair number of things
> that the current standard doesn't provide at all -- mostly because
> without those, there's basically no way to discuss the areas that
> interest him. A couple of those are objects with acquire and release
> semantics -- i.e. with memory barriers in the right places. The
> minute you add a few multithreading primitives like that, the
> problems with DCLP disappear.
>
> The real problem was never with the DCLP itself, but with attempting
> to do multi-threaded programming without the tools necessary for the
> job.

The problem is that most people I talk to at my work, they have
certain notions of how threading works. One of these notions is that
one can reason about threading by considering all possible
interleavings of instructions (which, of course, is incorrect). Sadly,
most of my colleagues still don't "get it". The proto-typical example
I use: "Suppose a thread does a write A and a write B in source code.
Physically after that processor core does both load instructions,
another thread may see write A and not write B, another thread may see
write B and not see write A, another thread may see both, and another
thread may see neither." If one was incorrectly taught threading,
which is the case I think for most university graduates nowadays, it
is very hard to break your old habits and understand how threading
really works.

Ex: One reply I just got was "What if it was:
x = a;
y = x;
Surely another thread cannot see the write to y and not see the write
to x?" (BTW, no, a thread may still see the write to y without seeing
the write to x.)

DCL is the most common application I see of this flawed reasoning. The
heart of DCL is that one relies on investigating the possible
interleavings of source code "instructions" to see if it will work
out. IMHO, as soon as you modify DCL to be correct in C or C++, it is
no longer DCL. You have fundamentally changed the basis of your belief
of its correctness when you add in proper synchronization. No longer
are you doing a single check outside of all synchronization, the
hallmark of DCL. It only superficially resembles DCL.

Now, arguably, your post Jerry and mine is merely a definition over
terms. I hope I've generated more light than heat, but I don't think
it will be useful to get into a discussion of whether DCL is by
definition broken or merely usually incorrectly implemented. Define it
however you want. I still think that DCL by definition is incorrect in
C and C++, and modifications to make it correct render it no longer
DCL.

Chris M. Thomasson

unread,
Sep 10, 2009, 8:48:03 PM9/10/09
to
"Joshua Maurice" <joshua...@gmail.com> wrote in message
news:2736a5f6-59eb-434e...@u16g2000pru.googlegroups.com...
[...]

> DCL is the most common application I see of this flawed reasoning. The
> heart of DCL is that one relies on investigating the possible
> interleavings of source code "instructions" to see if it will work
> out. IMHO, as soon as you modify DCL to be correct in C or C++, it is
> no longer DCL.

What modifications of the DCL algorithm itself are needed in order to get it
to work in C/C++? I don't have to modify the actual algorithm in order to
get it to work in C/C++. I only have to implement the algorithm using highly
non-portable methods.


> You have fundamentally changed the basis of your belief
> of its correctness when you add in proper synchronization. No longer
> are you doing a single check outside of all synchronization,
> the hallmark of DCL. It only superficially resembles DCL.

Unless I am misunderstanding you, I have to disagree here. The algorithm
needs proper synchronization on the fast-path and on the slow-path. On the
fast-path it needs a data-dependant load memory barrier, and on the slow
path it needs the mutex along with release semantics. There is no "single
check outside of all synchronization". AFAICT, that's a misunderstanding on
how the algorithm actually works. Skipping the mutex acquisition/release is
the only thing that DCL actually optimizes. Just because you can skip a call
to the mutex (e.g., fast-path) does not mean that you're somehow "outside of
all synchronization".


> Now, arguably, your post Jerry and mine is merely a definition over
> terms. I hope I've generated more light than heat, but I don't think
> it will be useful to get into a discussion of whether DCL is by
> definition broken or merely usually incorrectly implemented. Define it
> however you want. I still think that DCL by definition is incorrect in
> C and C++, and modifications to make it correct render it no longer
> DCL.

DCL is DCL. Even if I implement it in pure ASM, I still cannot get away from
using proper synchronization. If I implement DCL in C/C++ using
platform/compiler specific guarantees, well, it's still DCL, not something
different.

Chris M. Thomasson

unread,
Sep 10, 2009, 9:24:19 PM9/10/09
to
"Joshua Maurice" <joshua...@gmail.com> wrote in message
news:a6221b91-add3-4629...@a37g2000prf.googlegroups.com...

On Sep 10, 11:43 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> > "Joshua Maurice" <joshuamaur...@gmail.com> wrote in message
> > > PPS: All windows standard mutexes have runtime init, so you cannot do
> > > either of these approaches without significant modifications, or
> > > without rolling your own mutex built on some of the atomic primitives
> > > like test and swap. See:
> > >http://www.ddj.com/cpp/199203083?pgno=7
> > > for a good discussion on how to do this.
> > > Alternatively, use a namespace scope variable to force construction
> > > before main (or before dlopen returns for static init of a dll) to
> > > "guarantee" correctness as demonstrated in countless posts in this
> > > thread.
> >
> > Windows has everything one needs in order to create a 100% correct DCL
> > pattern. The following code will work with Windows, modulo bugs of
> > course
> > because I just hacked it together:

> [snip probably correct impl]

> I should probably just shut up and take my losses. I am still right
> that there is no mutex on windows without runtime init, but your
> workaround, which Boost also uses to do boost_once IIRC, is quite
> good.

Ahhh. I forgot that Boost does something like that. It works, but it's kind
of "hackish"...

;^)


> I forget which versions, but windows also have their own native
> version of pthread_once, so you could also use that.

It's on Windows Vista and higher.


> Also, as you
> mentioned, volatile would work, for whichever VS versions and
> platforms make it into barriers.

Yes.

Francesco

unread,
Sep 11, 2009, 8:16:50 AM9/11/09
to

Good, now seems that I got it. Thanks for the further explanation
James.

Cheers,
Francesco

Jerry Coffin

unread,
Sep 11, 2009, 12:45:52 PM9/11/09
to
In article <2736a5f6-59eb-434e-b21b-968b735a8e91
@u16g2000pru.googlegroups.com>, joshua...@gmail.com says...

>
> On Sep 10, 4:07 pm, Jerry Coffin <jerryvcof...@yahoo.com> wrote:

[ ... ]

> > That paper followed shortly after he realized that this just
> > wasn't so. The tone of the paper is unfortunate though -- it
> > comes off as basically saying there's a problem with double-
> > checked locking, which really isn't the case at all. The problem
> > is that C++ (up through the 2003 standard) simply lacks memory
> > barriers.
>
> Sorry what? There's something which resembling threading guarantees in
> C++03?

No, "lacks" would mean "does not have", and almost any kind of
threading guarantee requires some kind of memory barrier.

[ ... ]

> The problem is that most people I talk to at my work, they have
> certain notions of how threading works. One of these notions is that
> one can reason about threading by considering all possible
> interleavings of instructions (which, of course, is incorrect). Sadly,
> most of my colleagues still don't "get it". The proto-typical example
> I use: "Suppose a thread does a write A and a write B in source code.
> Physically after that processor core does both load instructions,
> another thread may see write A and not write B, another thread may see
> write B and not see write A, another thread may see both, and another
> thread may see neither." If one was incorrectly taught threading,
> which is the case I think for most university graduates nowadays, it
> is very hard to break your old habits and understand how threading
> really works.

I think the problem is really somewhat more fundamental: with
threading, it's almost pointless to try to figure out every scenario
that _could_ happen. Instead, you have to think strictly in terms of
what you've _made_ happen, and assume something wrong will happen
unless you've prevented it.

[ ... ]

> IMHO, as soon as you modify DCL to be correct in C or C++, it is
> no longer DCL. You have fundamentally changed the basis of your belief
> of its correctness when you add in proper synchronization.

I can't agree. Just for an obvious example, see page 12 of the paper
you cited, which gives (most of) an implementation of the DCLP using
memory barriers to ensure correctness. Note, in particular, that it's
essentially identical to the original (incorrect) code other than the
addition of two memory barriers (added only as comments, since the
current version of C++ doesn't include them).

With C++ 0x, we'll have acquire and release memory barriers, and
we'll be able to implement exactly that code -- and I can't see any
way you can reasonably call it anything other than a DCL -- it still
has the double-check and locking, just with memory barriers inserted
to ensure that it actually works.

> No longer
> are you doing a single check outside of all synchronization, the
> hallmark of DCL. It only superficially resembles DCL.

I'd say the hallmark of DCL is obvious from the name: double-checked
locking. As long as you still do double-checked locking, insertion of
memory barriers doesn't change the fact that you're still doing
double-checked locking, and therefore implementing the double-checked
locking pattern.

--
Later,
Jerry.

James Kanze

unread,
Sep 12, 2009, 4:39:09 AM9/12/09
to
On Sep 11, 1:07 am, Jerry Coffin <jerryvcof...@yahoo.com> wrote:
> In article <aac0ea5e-c259-4177-9781-d94931593069
> @j9g2000prh.googlegroups.com>, joshuamaur...@gmail.com says...

> [ ... ]
> > Firstly and most importantly, you're using double checked
> > locking, which is broken in effectively all C++
> > implementations. Don't do that. Please read, continue to
> > re-read if you don't get it, this excellent paper:
> >http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf

> This is a decent paper, but it should be kept in context. Up
> until shortly before he (helped) write that paper, Andrei
> seems to have thought that the 'volatile' keyword was
> sufficient to give assurances necessary for multithreading (in
> essence that reading or writing a volatile variable acted as a
> memory barrier).

I seem to recall that it was a couple of years before the paper
in question. Andrei did (naively) propose a means of achieving
thread safety using volatile. In fact, his solution worked, but
not for the reasons he thought---his solution actually only used
the fact that volatile is part of the type system. In the
following discussions, however, he quickly realized (and openly
admitted) that his understanding wasn't complete; since then
(and before writing the paper in question), he completed it.
The paper was also thoroughly reviewed before publication, to
ensure accuracy.

> That paper followed shortly after he realized that this just
> wasn't so. The tone of the paper is unfortunate though -- it
> comes off as basically saying there's a problem with
> double-checked locking, which really isn't the case at all.

This depends on how you defined double checked locking. There
is a definite problem in the code presented by Vlissides in his
original article, and that is what most people understand by
double checked locking. IIRC, the paper in question does make
it clear that double checked locking can be made to work using
assembler (at least on most platforms) or perhaps some
additional, system specific requests (other than just mutexes),
and while I don't think the paper mentions it, it can also be
made to work using thread local storage. In practice, it's
generally not worth it, since the additional assembler generally
does more or less what the outer mutex (which you're trying to
avoid) does, and costs about the same in run time.

> The problem is that C++ (up through the 2003 standard) simply
> lacks memory barriers. Double-checked locking is one example
> of code that _needs_ a memory barrier to work correctly -- but
> it's only one example of many.

It can be made to work with thread local storage as well,
without memory barriers.

[...]


> The real problem was never with the DCLP itself, but with
> attempting to do multi-threaded programming without the tools
> necessary for the job.

Yes. The "problem" with DCLP is in fact just a symptom of a
larger problem, of people not understanding what is and is not
guaranteed (and to a lesser degree, of people not really
understanding the costs---acquiring a non-contested mutex is
really very, very cheap, and usually not worth trying to avoid).

--
James Kanze

Chris M. Thomasson

unread,
Sep 12, 2009, 5:29:35 AM9/12/09
to
"James Kanze" <james...@gmail.com> wrote in message
news:edee09a7-fbc2-41fd...@a21g2000yqc.googlegroups.com...
[...]

>> The problem is that C++ (up through the 2003 standard) simply
>> lacks memory barriers. Double-checked locking is one example
>> of code that _needs_ a memory barrier to work correctly -- but
>> it's only one example of many.
>
> It can be made to work with thread local storage as well,
> without memory barriers.

Indeed; something along the lines of:

<pseudo-code typed in newsreader>


___________________________________________________________________
template<typename T, unsigned T_id = 0>

struct non_destroying_singleton
{
static T& instance()
{
__thread T* l_instance = NULL;

if (! l_instance)


{
static T* g_instance = NULL;

static pthread_mutex_t g_mutex = PTHREAD_MUTEX_INITIALIZER;

lock_guard lock(g_mutex);

if (! (l_instance = g_instance))
{
l_instance = g_instance = new T();
}
}

return *l_instance;
}
};
___________________________________________________________________


No memory barriers required! Yeah!

;^)


[...]

Jerry Coffin

unread,
Sep 12, 2009, 7:01:33 PM9/12/09
to
In article <edee09a7-fbc2-41fd-84b4-
dcdae8...@a21g2000yqc.googlegroups.com>, james...@gmail.com
says...

[ ... using a memory barrier ]

> In practice, it's
> generally not worth it, since the additional assembler generally
> does more or less what the outer mutex (which you're trying to
> avoid) does, and costs about the same in run time.

I have to disagree with both of these. First, a memory barrier is
quite a bit different from a mutex. Consider (for example) a store
fence. It simply says that stores from all previous instructions must
complete before any stores from subsequent instructions (and a read
barrier does the same, but for reads). It's basically equivalent to a
sequence point, but for real hardware instead of a conceptual model.

As far as cost goes: a mutex normally uses kernel data, so virtually
every operation requires a switch from user mode to kernel mode and
back. The cost for that will (of course) vary between systems, but is
almost always fairly high (figure a few thousand CPU cycles as a
reasonable minimum).

A memory barrier will typically just prevent combining a subsequent
write with a previous one. As long as there's room in the write queue
for both pieces of data, there's no cost at all. In the (normally
rare) case that the CPU's write queue is full, a subsequent write has
to wait for a previous write to complete to create an empty spot in
the write queue. Even in this worst case, it's generally going to be
around an order of magnitude faster than a switch to kernel mode and
back.



> > The problem is that C++ (up through the 2003 standard) simply
> > lacks memory barriers. Double-checked locking is one example
> > of code that _needs_ a memory barrier to work correctly -- but
> > it's only one example of many.
>
> It can be made to work with thread local storage as well,
> without memory barriers.

Well, yes -- poorly stated on my part. It requires _some_ sort of
explicit support for threading that's missing from the current and
previous versions of C++, but memory barriers aren't the only
possible one.

[ ... ]



> Yes. The "problem" with DCLP is in fact just a symptom of a
> larger problem, of people not understanding what is and is not
> guaranteed (and to a lesser degree, of people not really
> understanding the costs---acquiring a non-contested mutex is
> really very, very cheap, and usually not worth trying to avoid).

At least under Windows, this does not fit my experience. Of course,
Windows has its own cure (sort of) for the problem -- rather than
using a mutex (with its switch to/from kernel mode) you'd usually use
a critical section instead. Entering a critical section that's not in
use really is very fast.

Then again, a critical section basically is itself just a double-
checked lock (including the necessary memory barriers). They have two
big limitations: first, unlike a normal mutex, they only work between
threads in a single process. Second, they can be quite slow when/if
there's a great deal of contention for the critical section.

--
Later,
Jerry.

Keith H Duggar

unread,
Sep 12, 2009, 10:36:16 PM9/12/09
to
On Sep 8, 5:39 pm, James Kanze <james.ka...@gmail.com> wrote:
> Not knowing the requirements, it's hard to say. My usual
> implementation is:
>
> class Data
> {
> private:
> static Data* ourInstance ;
> Data() {}
> ~Data() {}
>
> public:
> static Data& instance() ;
> // ...
> } ;
>
> Data* Data:: ourInstance = &instance() ;
>
> Data&
> Data::instance()
> {
> if ( ourInstance == NULL ) {
> ourInstance = new Data ;
> }
> return *ourInstance ;
> }
>
> This solves the threading issues (for the most part), and avoids
> any order of destruction issues, by not destructing the object.

I think it's worth noting that the implementation above always
instantiates Data even if it is never used by another part of
the program. ie the above is not the "lazy" instantiation of
the original implementation.

KHD

Joshua Maurice

unread,
Sep 12, 2009, 10:42:04 PM9/12/09
to
On Sep 12, 4:01 pm, Jerry Coffin <jerryvcof...@yahoo.com> wrote:
> In article <edee09a7-fbc2-41fd-84b4-
> dcdae859b...@a21g2000yqc.googlegroups.com>, james.ka...@gmail.com
> says...

> > Yes.  The "problem" with DCLP is in fact just a symptom of a
> > larger problem, of people not understanding what is and is not
> > guaranteed (and to a lesser degree, of people not really
> > understanding the costs---acquiring a non-contested mutex is
> > really very, very cheap, and usually not worth trying to avoid).
>
> At least under Windows, this does not fit my experience. Of course,
> Windows has its own cure (sort of) for the problem -- rather than
> using a mutex (with its switch to/from kernel mode) you'd usually use
> a critical section instead. Entering a critical section that's not in
> use really is very fast.
>
> Then again, a critical section basically is itself just a double-
> checked lock (including the necessary memory barriers). They have two
> big limitations: first, unlike a normal mutex, they only work between
> threads in a single process. Second, they can be quite slow when/if
> there's a great deal of contention for the critical section.

Well, with contention, how slow would a CriticalSection be compared to
a WIN32 Mutex object? I would presume about the same, and with that
assumption, I would say one should use CriticalSections as your mutex
implementation on windows unless you actually need the extra
functionality, like locking the same object across processes, which in
my experience is quite rare.

Jerry Coffin

unread,
Sep 12, 2009, 10:57:35 PM9/12/09
to
In article <cb280907-a612-4a13-baa8-01688c959868
@z3g2000prd.googlegroups.com>, joshua...@gmail.com says...

[ ... ]

> Well, with contention, how slow would a CriticalSection be compared
> to a WIN32 Mutex object? I would presume about the same, and with
> that assumption, I would say one should use CriticalSections as
> your mutex implementation on windows unless you actually need the
> extra functionality, like locking the same object across processes,
> which in my experience is quite rare.

We're getting off topic, so I won't try to get into a lot of detail,
but with sufficient contention, a critical section can become
_substantially_ slower than direct use of a mutex.

A couple of years ago (or so) there was a thread about this on
comp.os.ms-windows.programmer.win32. IIRC, somebody eventually posted
an improved version that didn't suffer nearly as bad of a slowdown.
Unfortunately, given the degree to which Google has broken searching
on the newsgroup archive, I can't tell you exactly where to get that.

--
Later,
Jerry.

Chris M. Thomasson

unread,
Sep 12, 2009, 11:37:23 PM9/12/09
to
"James Kanze" <james...@gmail.com> wrote in message
news:ecb6784b-9cc7-49d7...@r9g2000yqa.googlegroups.com...

> On Sep 7, 9:02 pm, Francesco <entul...@gmail.com> wrote:
[...]

> And doesn't avoid the order of initialization issues the other
> versions were designed to avoid.
>

>> But of course, being a code-reviewer, you should already know
>> all the above.
>
>> Let's say I'm taking the chance to see if _I_ got it right,
>> showing my code for review here in clc++ ;-)


>
> Not knowing the requirements, it's hard to say. My usual
> implementation is:
>
> class Data
> {
> private:
> static Data* ourInstance ;
> Data() {}
> ~Data() {}
>
> public:
> static Data& instance() ;
> // ...
> } ;
>
> Data* Data:: ourInstance = &instance() ;
>
> Data&
> Data::instance()
> {
> if ( ourInstance == NULL ) {
> ourInstance = new Data ;
> }
> return *ourInstance ;
> }
>
> This solves the threading issues (for the most part), and avoids
> any order of destruction issues, by not destructing the object.

FWIW, this is how to break it using threads:

http://groups.google.com/group/comp.lang.c++/msg/769cda6552764337

Chris M. Thomasson

unread,
Sep 13, 2009, 12:14:20 AM9/13/09
to
"Jerry Coffin" <jerryv...@yahoo.com> wrote in message
news:MPG.2515d484b...@news.sunsite.dk...

> In article <edee09a7-fbc2-41fd-84b4-
> dcdae8...@a21g2000yqc.googlegroups.com>, james...@gmail.com
> says...
>
> [ ... using a memory barrier ]
>
>> In practice, it's
>> generally not worth it, since the additional assembler generally
>> does more or less what the outer mutex (which you're trying to
>> avoid) does, and costs about the same in run time.
>
> I have to disagree with both of these. First, a memory barrier is
> quite a bit different from a mutex. Consider (for example) a store
> fence. It simply says that stores from all previous instructions must
> complete before any stores from subsequent instructions (and a read
> barrier does the same, but for reads). It's basically equivalent to a
> sequence point, but for real hardware instead of a conceptual model.
>
>
> As far as cost goes: a mutex normally uses kernel data, so virtually
> every operation requires a switch from user mode to kernel mode and
> back.

Well, a `CRITICAL_SECTION' is a mutex. A HANDLE returned from
`CreateMutex()' is also mutex. Therefore, a mutex implementation does not
need to always enter the kernel. IMVHO, Windows made a poor choice of naming
their intra-process mutex. IMO, a "critical section" does not refer to the
synchronization primitive, but to the actual locked portion of the caller's
code.


> The cost for that will (of course) vary between systems, but is
> almost always fairly high (figure a few thousand CPU cycles as a
> reasonable minimum).


> A memory barrier will typically just prevent combining a subsequent
> write with a previous one.

A #StoreLoad memory barrier can cripple performance by forcing the previous
stores to be performed before any subsequent load can be committed. This
defeats the purpose of caching and pipeline:

http://groups.google.com/group/comp.programming.threads/msg/fdc665e616176dce


> As long as there's room in the write queue
> for both pieces of data, there's no cost at all.

There is a cost when you start getting into load-to-store and store-to-load
ordering constraints. For instance, the following store membar:


MEMBAR #StoreStore

is MUCH less expensive when compared to a version which has a store-to-load
ordering constraint:

MEMBAR #StoreLoad | #StoreStore

You need a store-to-load ordering in the acquisition portion of a
traditional mutex. This is why even user-space fast-pathed mutexs can have
some nasty overheads even in the non-contended case.


> In the (normally
> rare) case that the CPU's write queue is full, a subsequent write has
> to wait for a previous write to complete to create an empty spot in
> the write queue. Even in this worst case, it's generally going to be
> around an order of magnitude faster than a switch to kernel mode and
> back.


>> > The problem is that C++ (up through the 2003 standard) simply
>> > lacks memory barriers. Double-checked locking is one example
>> > of code that _needs_ a memory barrier to work correctly -- but
>> > it's only one example of many.
>>
>> It can be made to work with thread local storage as well,
>> without memory barriers.


> Well, yes -- poorly stated on my part. It requires _some_ sort of
> explicit support for threading that's missing from the current and
> previous versions of C++, but memory barriers aren't the only
> possible one.

>> Yes. The "problem" with DCLP is in fact just a symptom of a
>> larger problem, of people not understanding what is and is not
>> guaranteed (and to a lesser degree, of people not really
>> understanding the costs---acquiring a non-contested mutex is
>> really very, very cheap, and usually not worth trying to avoid).
>
> At least under Windows, this does not fit my experience. Of course,
> Windows has its own cure (sort of) for the problem -- rather than
> using a mutex (with its switch to/from kernel mode) you'd usually use
> a critical section instead.


> Entering a critical section that's not in
> use really is very fast.

Not in all cases... Try a test with something like the following crude
pseudo-code:
________________________________________________________________
struct data
{
int array[128];
};


static struct data g_data = { { 0 } };


void locked_reader_threads()
{
unsigned iterations;
CRITICAL_SECTION mutex;

InitializeCriticalSection(&mutex);

for (iterations = 0 ;; ++iterations)
{
EnterCriticalSection(&mutex);

LeaveCriticalSection(&mutex);

for (int i = 0; i < 128; ++i)
{
int x = g_data.array[i];

if (x) abort();
}
}

DeleteCriticalSection(&mutex);
}


void naked_reader_threads()
{
unsigned iterations;

for (iterations = 0 ;; ++iterations)
{
for (int i = 0; i < 128; ++i)
{
int x = g_data.array[i];

if (x) abort();
}
}
}
________________________________________________________________


See how many iterations each reader thread can perform per-second under
worst case load for prolonged periods of time (e.g., like sustained high
intensity bursts of traffic on a database server or something). The
`locked_reader_threads()' should perform less reads per-second-per-thread.


> Then again, a critical section basically is itself just a double-
> checked lock (including the necessary memory barriers).

AFAICT, `CRITICAL_SECTION's are intra-process fast-pathed adaptive mutexs.
Could the inter-process mutexs share the same properties? I think so.


> They have two
> big limitations: first, unlike a normal mutex, they only work between
> threads in a single process. Second, they can be quite slow when/if
> there's a great deal of contention for the critical section.

WRT slow, are you referring to the old implementation of `CRITICAL_SECTION's
which used to hand off ownership? IIRC, MS changed the implementation to
allow for a thread to sneak in and acquire the mutex which increases
performance considerably. However, unlike handing off ownership, it's not
fair and can be subjected to "starvation".

Chris M. Thomasson

unread,
Sep 13, 2009, 12:38:22 AM9/13/09
to
"Jerry Coffin" <jerryv...@yahoo.com> wrote in message
news:MPG.25160d8c...@news.sunsite.dk...

> In article <cb280907-a612-4a13-baa8-01688c959868
> @z3g2000prd.googlegroups.com>, joshua...@gmail.com says...
>
> [ ... ]
>
>> Well, with contention, how slow would a CriticalSection be compared
>> to a WIN32 Mutex object? I would presume about the same, and with
>> that assumption, I would say one should use CriticalSections as
>> your mutex implementation on windows unless you actually need the
>> extra functionality, like locking the same object across processes,
>> which in my experience is quite rare.
>
> We're getting off topic, so I won't try to get into a lot of detail,
> but with sufficient contention, a critical section can become
> _substantially_ slower than direct use of a mutex.

Are you writing about a `CRITICAL_SECTION' implementation that hands off
ownership?

James Kanze

unread,
Sep 13, 2009, 5:33:12 AM9/13/09
to
On Sep 13, 5:37 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "James Kanze" <james.ka...@gmail.com> wrote in message

> news:ecb6784b-9cc7-49d7...@r9g2000yqa.googlegroups.com...>
> On Sep 7, 9:02 pm, Francesco <entul...@gmail.com> wrote:

> [...]


> > Not knowing the requirements, it's hard to say. My usual
> > implementation is:

> > class Data
> > {
> > private:
> > static Data* ourInstance ;
> > Data() {}
> > ~Data() {}

> > public:
> > static Data& instance() ;
> > // ...
> > } ;

> > Data* Data:: ourInstance = &instance() ;

> > Data&
> > Data::instance()
> > {
> > if ( ourInstance == NULL ) {
> > ourInstance = new Data ;
> > }
> > return *ourInstance ;
> > }

> > This solves the threading issues (for the most part), and
> > avoids any order of destruction issues, by not destructing
> > the object.

> FWIW, this is how to break it using threads:

> http://groups.google.com/group/comp.lang.c++/msg/769cda6552764337

Yes. If you start a thread which uses it from the constructor,
you will have problems. That's what the "(for the most part)"
refers to. (In my actual library, this restriction is
documented.)

In practice, I've never found this to be an issue. Nor have I
found Keith Duggar's point, that the object will always be
constructed, even if it is never used, and issue---just the
opposite, in most of my applications, start-up time is more or
less free, but I have time constraints when responding to a
request, so anything that can be moved out into the program
initialization phase is good. Still, it's a point that should
be documented (and that IIRC, I forgot to document in my generic
implementation).

--
James Kanze

James Kanze

unread,
Sep 13, 2009, 5:58:59 AM9/13/09
to
On Sep 13, 1:01 am, Jerry Coffin <jerryvcof...@yahoo.com> wrote:
> In article <edee09a7-fbc2-41fd-84b4-
> dcdae859b...@a21g2000yqc.googlegroups.com>,
> james.ka...@gmail.com says...

> [ ... using a memory barrier ]

> > In practice, it's
> > generally not worth it, since the additional assembler generally
> > does more or less what the outer mutex (which you're trying to
> > avoid) does, and costs about the same in run time.

> I have to disagree with both of these.

You're arguing against actual measurements made on a Sun Sparc,
under Solaris.

> First, a memory barrier is quite a bit different from a mutex.
> Consider (for example) a store fence. It simply says that
> stores from all previous instructions must complete before any
> stores from subsequent instructions (and a read barrier does
> the same, but for reads). It's basically equivalent to a
> sequence point, but for real hardware instead of a conceptual
> model.

Certainly.

> As far as cost goes: a mutex normally uses kernel data,

Since when. This isn't the case on any of the systems I'm
familiar with (Solaris, Linux and Windows). In all cases, the
mutex (CriticalSection, under Windows) only goes into kernel
mode if there is a conflict. (Note that unlike the Windows
implemnentation, under Solaris or Linux, this is true even if
the mutex is shared with another process, or if there is a time
out on the wait.)

> so virtually every operation requires a switch from user mode
> to kernel mode and back. The cost for that will (of course)
> vary between systems, but is almost always fairly high (figure
> a few thousand CPU cycles as a reasonable minimum).

A few thousand CPU cycles seems quite high to me, given the
timings I've made (all under Solaris on a Sparc, some time ago),
but it is expensive, yes (a couple of hundred). That's why
Solaris and Linux avoid it, and why Windows offers a
"simplified" mutex (which they misleadingly call
CriticalSection), which avoids it.

> A memory barrier will typically just prevent combining a
> subsequent write with a previous one. As long as there's room
> in the write queue for both pieces of data, there's no cost at
> all.

A memory barrier ensures that no following operations become
visible until the previous operations are guaranteed visible.
At least on a Sparc (again, the only machine on which I've made
measurements), this can be very expensive---easily ten times the
cost of a "normal" instruction.

[...]


> > > The problem is that C++ (up through the 2003 standard) simply
> > > lacks memory barriers. Double-checked locking is one example
> > > of code that _needs_ a memory barrier to work correctly -- but
> > > it's only one example of many.

> > It can be made to work with thread local storage as well,
> > without memory barriers.

> Well, yes -- poorly stated on my part. It requires _some_ sort
> of explicit support for threading that's missing from the
> current and previous versions of C++, but memory barriers
> aren't the only possible one.

If you're talking about the current and previous versions of
C++, something like pthread_create formally invokes undefined
behavior, so something more is certainly necessary for
multithreaded applications. If you're talking about C++ plus
Posix or Windows, then at least under Posix (and I think
Windows), there is support for thread local storage. Given the
interface under Posix, I suspect that it can be rather expensive
to use, however (but I've not measured it), which would account
for the fact that it's not often suggested. If accessing a
thread local variable is no more expensive than accessing a
normal static variable, and each thread makes a lot of requests
to the instance function, using the thread local variable
solution could be a definite winner.

> [ ... ]

> > Yes. The "problem" with DCLP is in fact just a symptom of a
> > larger problem, of people not understanding what is and is not
> > guaranteed (and to a lesser degree, of people not really
> > understanding the costs---acquiring a non-contested mutex is
> > really very, very cheap, and usually not worth trying to avoid).

> At least under Windows, this does not fit my experience. Of
> course, Windows has its own cure (sort of) for the problem --
> rather than using a mutex (with its switch to/from kernel
> mode) you'd usually use a critical section instead. Entering a
> critical section that's not in use really is very fast.

What Windows calls a CriticalSection is, in fact, a mutex, and
is what I use under Windows when I need a mutex to protect
between threads (as opposed to between processes).

Note that the Windows implementation of boost::mutex uses
CriticalSection.

> Then again, a critical section basically is itself just a
> double- checked lock (including the necessary memory
> barriers). They have two big limitations: first, unlike a
> normal mutex, they only work between threads in a single
> process. Second, they can be quite slow when/if there's a
> great deal of contention for the critical section.

It there's a lot of contention, any locking mechanism will be
expensive. Between processes... The Posix mutex works between
processes, with no kernel code if there is no contention. On
the other hand (compared to Windows), it doesn't use an
identifier; the mutex object itself (pthread_mutex_t) must be in
memory mapped to both processes.

--
James Kanze

Jerry Coffin

unread,
Sep 13, 2009, 1:08:27 PM9/13/09
to
In article <8c8edcc3-d7f4-4890-9f43-c05db50bb41b@
37g2000yqm.googlegroups.com>, james...@gmail.com says...

>
> On Sep 13, 1:01 am, Jerry Coffin <jerryvcof...@yahoo.com> wrote:
> > In article <edee09a7-fbc2-41fd-84b4-
> > dcdae859b...@a21g2000yqc.googlegroups.com>,
> > james.ka...@gmail.com says...
>
> > [ ... using a memory barrier ]
>
> > > In practice, it's
> > > generally not worth it, since the additional assembler generally
> > > does more or less what the outer mutex (which you're trying to
> > > avoid) does, and costs about the same in run time.
>
> > I have to disagree with both of these.
>
> You're arguing against actual measurements made on a Sun Sparc,
> under Solaris.

The degree of similarity (or lack thereof) between a memory barrier
1) is entirely platform independent, and 2) isn't really open to
measurement.

Based on what you've said, it comes down to this: the platforms with
which you're familiar include a double-checked lock in their
implementation of a mutex (as long as under Windows you treat
"mutex" as meaning "critical section").

Going back to your original statement, that there's little point in
using double-checked locking, I'd certainly agree that when the
system builds a double-checked lock into its mutex (or what you use
as a mutex anyway), then there's little or no gain from duplicating
that in your own code.

[ ... ]

> It there's a lot of contention, any locking mechanism will be
> expensive.

Yes, but in Windows if there's a lot of contention, a critical
section is a lot _more_ expensive than a mutex.

> Between processes... The Posix mutex works between
> processes, with no kernel code if there is no contention. On
> the other hand (compared to Windows), it doesn't use an
> identifier; the mutex object itself (pthread_mutex_t) must be in
> memory mapped to both processes.

In other words, as I pointed out above, it's doing a double-checked
lock. It manipulates some data directly (with appropriate memory
barriers), and iff that fails, uses the real mutex that involves a
switch to kernel mode (and allows the kernel to schedule another
thread).

--
Later,
Jerry.

Chris M. Thomasson

unread,
Sep 14, 2009, 7:10:25 AM9/14/09
to
"Jerry Coffin" <jerryv...@yahoo.com> wrote in message
news:MPG.2516c74b8...@news.sunsite.dk...

FWIW, there is a fairly big difference between the fast-path on double
checked locking and the fast-path of a mutex. The check of a DCL algorithm
does not involve any atomic RMW operations. However, Petersons algorithm
aside for a moment, the check of a mutex does use atomic RMW.


> [ ... ]
>
>> It there's a lot of contention, any locking mechanism will be
>> expensive.
>
> Yes, but in Windows if there's a lot of contention, a critical
> section is a lot _more_ expensive than a mutex.

I am not exactly sure why it would be a "lot _more_ expensive". I can see a
certain advantage, but it should not make that much of a difference. I will
post (*) a quick and dirty example program at the end of this message which
attempts to show performance difference between `CRITICAL_SECTION, `HANDLE'
mutex, and no lock at all...


>> Between processes... The Posix mutex works between
>> processes, with no kernel code if there is no contention. On
>> the other hand (compared to Windows), it doesn't use an
>> identifier; the mutex object itself (pthread_mutex_t) must be in
>> memory mapped to both processes.
>
> In other words, as I pointed out above, it's doing a double-checked
> lock. It manipulates some data directly (with appropriate memory
> barriers), and iff that fails, uses the real mutex that involves a
> switch to kernel mode (and allows the kernel to schedule another
> thread).

The difference is that DCL does not need to manipulate/mutate data in order
to skip mutex acquisition. It only needs to perform a data-dependant load
which is a plain naked load on basically every system out there expect DEC
Alpha.

(*)
________________________________________________________________________
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <process.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <time.h>


#define READERS 4
#define ITERS 999999
#define L2DEPTH 64
#define L2CACHE 64
/* #define USE_CRITICAL_SECTION */
/* #define NO_LOCKS */


#define ALIGN(p, a) \
((((DWORD_PTR)(p)) + ((DWORD_PTR)(a)) - 1U) & \
((DWORD_PTR)(-((LONG_PTR)(a)))))


#define ALIGN_PTR(p, a) ((void*)ALIGN(p, a))


#define ALIGN_BUFSIZE(s, a) \
(((DWORD_PTR)(s)) + ((DWORD_PTR)(a)) - 1U)


#define ALIGN_CHECK(p, a) \
(! (((DWORD_PTR)(p)) % ((DWORD_PTR)(a))))


#if ! defined (NO_LOCKS)
# if defined (USE_CRITICAL_SECTION)
typedef CRITICAL_SECTION mutex_type;

# define mutex_create(s) \
InitializeCriticalSection((s))

# define mutex_destroy(s) \
InitializeCriticalSection((s))

# define mutex_lock(s) \
EnterCriticalSection((s))

# define mutex_unlock(s) \
LeaveCriticalSection((s))

# else
typedef HANDLE mutex_type;

# define mutex_create(s) \
(*(s) = CreateMutex(NULL, FALSE, NULL))

# define mutex_destroy(s) \
CloseHandle(*(s))

# define mutex_lock(s) \
WaitForSingleObject(*(s), INFINITE)

# define mutex_unlock(s) \
ReleaseMutex(*(s))
# endif


#else
typedef int mutex_type;

# define mutex_create(s) ((void)0)
# define mutex_destroy(s) ((void)0)
# define mutex_lock(s) ((void)0)
# define mutex_unlock(s) ((void)0)

#endif

struct l2cache
{
char buffer[L2CACHE];
};


struct data
{
struct l2cache array[L2DEPTH];
};


struct global
{
mutex_type mutex;
char l2pad2_1[L2CACHE - sizeof(mutex_type)];
struct data data;
LONG finish_ref;
HANDLE finished;
char l2pad2_2[L2CACHE - (sizeof(LONG) + sizeof(HANDLE))];
};


typedef char static_assert
[
! (sizeof(struct global) % L2CACHE) &&
! (sizeof(struct data) % L2CACHE) &&
sizeof(struct data) / L2CACHE == L2DEPTH
? 1 : -1
];


static char g_raw_buffer
[
ALIGN_BUFSIZE(sizeof(struct global), L2CACHE)
] = { '\0' };


static struct global* g_global = NULL;


unsigned WINAPI
reader_thread(void* state)
{
unsigned i;
struct l2cache cmp = { { '\0' } };

for (i = 0; i < ITERS; ++i)
{
unsigned d;

mutex_lock(&g_global->mutex);

for (d = 0; d < L2DEPTH; ++d)
{
if (memcmp(g_global->data.array + d,
&cmp,
sizeof(cmp)))
{
abort();
}
}

mutex_unlock(&g_global->mutex);
}

if (! InterlockedDecrement(&g_global->finish_ref))
{
SetEvent(g_global->finished);
}

return 0;
}


int main(void)
{
size_t i;
unsigned id;
double end;
clock_t start;
HANDLE tid[READERS];
unsigned long int iter_avg_per_thread, iter_avg_total;

g_global = ALIGN_PTR(g_raw_buffer, L2CACHE);

assert(ALIGN_CHECK(g_global, L2CACHE));

g_global->finished = CreateEvent(NULL, FALSE, FALSE, NULL);
g_global->finish_ref = READERS;

mutex_create(&g_global->mutex);

for (i = 0; i < READERS; ++i)
{
tid[i] = (HANDLE)_beginthreadex(NULL,
0,
reader_thread,
NULL,
CREATE_SUSPENDED,
&id);
}

start = clock();

for (i = 0; i < READERS; ++i)
{
ResumeThread(tid[i]);
}

WaitForSingleObject(g_global->finished, INFINITE);

end = ((double)(clock() - start)) / CLOCKS_PER_SEC;

if (end)
{
iter_avg_per_thread =
(unsigned long int)(ITERS / end);

iter_avg_total =
(unsigned long int)((ITERS * READERS) / end);
}

else
{
iter_avg_per_thread = ITERS;
iter_avg_total = ITERS * READERS;
}

for (i = 0; i < READERS; ++i)
{
WaitForSingleObject(tid[i], INFINITE);
CloseHandle(tid[i]);
}

mutex_destroy(&g_global->mutex);

CloseHandle(g_global->finished);

printf("Threads: %u\n"
"Time: %.3f ms\n"
"Total Iterations Per-Second: %lu\n"
"Iterations Per-Second/Per-Thread: %lu\n",
READERS,
end * 1000.0,
iter_avg_total,
iter_avg_per_thread);

return 0;
}

________________________________________________________________________


You define `NO_LOCKS' for a lock-free version, you define
`USE_CRITICAL_SECTION' for a version using critical-sections, and undefine
`NO_LOCKS' and `USE_CRITICAL_SECTION' for the kernel mutex version. Here is
the output I get for the `CRITICAL_SECTION' version:

Threads: 4
Time: 28297.000 ms
Total Iterations Per-Second: 141357
Iterations Per-Second/Per-Thread: 35339


Here is the Kernel mutex:

Threads: 4
Time: 28078.000 ms
Total Iterations Per-Second: 142460
Iterations Per-Second/Per-Thread: 35615


Here is lock-free:

Threads: 4
Time: 11515.000 ms
Total Iterations Per-Second: 347372
Iterations Per-Second/Per-Thread: 86843


This is on XP with an old P4 3.06gz HyperThreaded processor. Anyway, I
cannot see that much of a difference between the two mutex based versions.
However, I can see a major difference in the lock-free one! This does not
really prove anything, but it's a little bit interesting.

;^)

Jerry Coffin

unread,
Sep 14, 2009, 9:23:38 AM9/14/09
to
In article <h8hrgk$29h9$1...@news.ett.com.ua>, n...@spam.invalid says...

[ ... ]

> Well, a `CRITICAL_SECTION' is a mutex. A HANDLE returned from
> `CreateMutex()' is also mutex. Therefore, a mutex implementation does not
> need to always enter the kernel. IMVHO, Windows made a poor choice of naming
> their intra-process mutex.

Oddly, it's a name that was inherited from OS/2. One of the real
strengths of OS/2's APIs (in general) was a rather nice naming
convention.

> IMO, a "critical section" does not refer to the
> synchronization primitive, but to the actual locked portion of the caller's
> code.

I'd agree with that -- and EnterCriticalSection and
LeaveCriticalSection seem to reflect that intent as well.

[ ... ]

> A #StoreLoad memory barrier can cripple performance by forcing the
> previous stores to be performed before any subsequent load can be
> committed. This defeats the purpose of caching and pipeline:

That's not (usually) entirely true. A typical processor maintains
coherency between the cache and the memory -- i.e. anything that's
been written to the cache is considered visible throughout the
system.

The memory barrier normally affects the CPUs write queue between the
CPU proper, and the cache (specifically, the first level of cache
that isn't write-through). If the writes involved happened to be to
areas of memory that are uncachable, then you'd see a really serious
performance hit. Otherwise, while there will be some performance hit,
it'll typically be _fairly_ small (the write queue is rarely more
than about 10 deep, and can typically write one entry something like
every clock or two.

Of course, if you have a cache that doesn't do snooping to ensure
that everything written to the cache is immediately visible to the
rest of the system, things get uglier in a hurry...

[ ... ]



> > Then again, a critical section basically is itself just a double-
> > checked lock (including the necessary memory barriers).
>
> AFAICT, `CRITICAL_SECTION's are intra-process fast-pathed adaptive mutexs.
> Could the inter-process mutexs share the same properties? I think so.

Yes, it could. It would require putting the data used for the fast-
path lock into memory that's visible to all processes involved.

[ ... ]

> WRT slow, are you referring to the old implementation of
> `CRITICAL_SECTION's which used to hand off ownership? IIRC, MS
> changed the implementation to allow for a thread to sneak in and
> acquire the mutex which increases performance considerably.
> However, unlike handing off ownership, it's not fair and can be
> subjected to "starvation".

Well, I haven't looked at it again recently to check, so things could
have changed since then. What I'm talking about, however, was where
EnterCriticalSection would basically do a spin-lock, repeatedly
trying to acquire the mutex, and only after several attempts would it
give up and let another thread run.

--
Later,
Jerry.

Chris M. Thomasson

unread,
Sep 14, 2009, 11:09:17 AM9/14/09
to
"Jerry Coffin" <jerryv...@yahoo.com> wrote in message
news:MPG.2517f1748...@news.sunsite.dk...
> In article "Chris M. Thomasson" <h8hrgk$29h9$1...@news.ett.com.ua>,
> n...@spam.invalid says...
[...]


>> A #StoreLoad memory barrier can cripple performance by forcing the
>> previous stores to be performed before any subsequent load can be
>> committed. This defeats the purpose of caching and pipeline:
>
> That's not (usually) entirely true. A typical processor maintains
> coherency between the cache and the memory -- i.e. anything that's
> been written to the cache is considered visible throughout the
> system.
>
> The memory barrier normally affects the CPUs write queue between the
> CPU proper, and the cache (specifically, the first level of cache
> that isn't write-through). If the writes involved happened to be to
> areas of memory that are uncachable, then you'd see a really serious
> performance hit. Otherwise, while there will be some performance hit,
> it'll typically be _fairly_ small (the write queue is rarely more
> than about 10 deep, and can typically write one entry something like
> every clock or two.
>
> Of course, if you have a cache that doesn't do snooping to ensure
> that everything written to the cache is immediately visible to the
> rest of the system, things get uglier in a hurry...

FWIW, I know that a `#StoreLoad' membar tanks performance in certain
algorithms. Take plain SMR (Safe Memory Reclamation) hazard pointers. The
algorithm requires a `#StoreLoad' on every store into a hazard pointer. This
really shows up when one it iterating through a linked data-structure. I
have included a crude little program at the end of this message which
attempts to demonstrate the performance difference between a membar based
SMR and a naked version.


>> > Then again, a critical section basically is itself just a double-
>> > checked lock (including the necessary memory barriers).
>>
>> AFAICT, `CRITICAL_SECTION's are intra-process fast-pathed adaptive
>> mutexs.
>> Could the inter-process mutexs share the same properties? I think so.
>
> Yes, it could. It would require putting the data used for the fast-
> path lock into memory that's visible to all processes involved.

Indeed.

[...]


>> WRT slow, are you referring to the old implementation of
>> `CRITICAL_SECTION's which used to hand off ownership? IIRC, MS
>> changed the implementation to allow for a thread to sneak in and
>> acquire the mutex which increases performance considerably.
>> However, unlike handing off ownership, it's not fair and can be
>> subjected to "starvation".
>
> Well, I haven't looked at it again recently to check, so things could
> have changed since then. What I'm talking about, however, was where
> EnterCriticalSection would basically do a spin-lock, repeatedly
> trying to acquire the mutex, and only after several attempts would it
> give up and let another thread run.

Yeah. `CRITICAL_SECTION's are adaptive. However, what about:

SetCriticalSectionSpinCount(&mutex, 0);

?


lol... ;^)


Here is a crappy little test program for GCC and ia-32:
______________________________________________________________________
#include <time.h>
#include <stdio.h>
#include <pthread.h>


#define MFENCE() __asm__ __volatile__ ("MFENCE" : : : "memory")


#define RUNS 2
#define DEPTH 256
#define ITERS 1000000
#define READERS 4


struct node
{
struct node* volatile next;
};


static struct node g_nodes[DEPTH];
static struct node* volatile g_head = NULL;


static pthread_mutex_t g_mutex = PTHREAD_MUTEX_INITIALIZER;


void
node_init(void)
{
unsigned i;

for (i = 0; i < DEPTH; ++i)
{
g_nodes[i].next = g_head;
g_head = g_nodes + i;
}
}


void*
reader_naked(void* state)
{
unsigned i;

for (i = 0; i < ITERS; ++i)
{

struct node* n = g_head;

while (n)
{
n = n->next;
}
}

return 0;
}


void*
reader_plain_smr(void* state)
{
unsigned i;

for (i = 0; i < ITERS; ++i)
{

struct node* n = g_head;
MFENCE();

while (n)
{
n = n->next;
MFENCE();
}
}

return 0;
}


void*
reader_locked(void* state)
{
unsigned i;

for (i = 0; i < ITERS; ++i)
{

struct node* n;

pthread_mutex_lock(&g_mutex);

n = g_head;

while (n)
{
n = n->next;
}

pthread_mutex_unlock(&g_mutex);
}

return 0;
}


int
main(void)
{
unsigned i, r;
pthread_t tid[READERS];
clock_t start;
double end;

node_init();


for (r = 0; r < RUNS; ++r)
{
printf("executing test run %d of %d...\n",
r + 1,
RUNS);


/* Naked */
start = clock();

for (i = 0; i < READERS; ++i)
{

pthread_create(tid + i, NULL, reader_naked, NULL);
}

for (i = 0; i < READERS; ++i)
{

pthread_join(tid[i], NULL);
}

end = (double)(clock() - start) / CLOCKS_PER_SEC;

printf("reader_naked - %.3f ms / %lu reads per-second/per-thread\n",
end * 1000.0,
(unsigned long int)(ITERS / ((end) ? end : 1)));


/* Plain SMR */
start = clock();

for (i = 0; i < READERS; ++i)
{

pthread_create(tid + i, NULL, reader_plain_smr, NULL);
}

for (i = 0; i < READERS; ++i)
{

pthread_join(tid[i], NULL);
}

end = (double)(clock() - start) / CLOCKS_PER_SEC;

printf("reader_plain_smr - %.3f ms / %lu reads
per-second/per-thread\n",
end * 1000.0,
(unsigned long int)(ITERS / ((end) ? end : 1)));


/* Locked */
start = clock();

for (i = 0; i < READERS; ++i)
{

pthread_create(tid + i, NULL, reader_locked, NULL);
}

for (i = 0; i < READERS; ++i)
{

pthread_join(tid[i], NULL);
}

end = (double)(clock() - start) / CLOCKS_PER_SEC;

printf("reader_locked - %.3f ms / %lu reads
per-second/per-thread\n\n",
end * 1000.0,
(unsigned long int)(ITERS / ((end) ? end : 1)));
}

puts("\n\ncompleted!");

return 0;
}

______________________________________________________________________


Here is the output I get on my old P4 3.06gz hyperthreaded processor:
______________________________________________________________________
executing test run 1 of 2...
reader_naked - 546.000 ms / 1831501 reads per-second/per-thread
reader_plain_smr - 18313.000 ms / 54606 reads per-second/per-thread
reader_locked - 3484.000 ms / 287026 reads per-second/per-thread

executing test run 2 of 2...
reader_naked - 563.000 ms / 1776198 reads per-second/per-thread
reader_plain_smr - 18312.000 ms / 54608 reads per-second/per-thread
reader_locked - 3547.000 ms / 281928 reads per-second/per-thread

completed!
______________________________________________________________________


The only difference between `reader_naked()' and `reader_plain_smr()' is
that the latter has the same memory barrier overhead that a plain SMR
implementation would have. I can say that the `MFENCE' instruction on IA-32
can really devastate performance. I have seen too many examples and
measurements. BTW, you will get similar performance numbers on SPARC or
PowerPC. The locked version beats the plain SMR version because, blocking
aside for a moment, it amortizes memory barrier usage across the depth of
the iteration...

0 new messages