[Boost-users] [thread] boost::call_once() optimization idea on Win32

Thomas Jarosch

unread,

Nov 17, 2010, 11:15:38 AM11/17/10

to boost...@lists.boost.org

Hello,

here's a small optimization idea for boost::call_once() on Win32.
The code currently looks like this:
----------------------------------------------
template<typename Function>
void call_once(once_flag& flag,Function f)
{
// Try for a quick win: if the procedure has already been called
// just skip through:
long const function_complete_flag_value=0xc15730e2;

if(::boost::detail::interlocked_read_acquire(&flag)!=function_complete_flag_value)
{
void* const mutex_handle(::boost::detail::create_once_mutex(&flag));
BOOST_ASSERT(mutex_handle);
detail::win32::handle_manager const closer(mutex_handle);
detail::win32_mutex_scoped_lock const lock(mutex_handle);

if(flag!=function_complete_flag_value)
{
f();
BOOST_INTERLOCKED_EXCHANGE(&flag,function_complete_flag_value);
}
}
}
----------------------------------------------

The initial flag value of zero (=BOOST_ONCE_INIT)
is set at load time via static initialization.

It should be safe to do a non-interlocked read of the flag like this:
----------------------------------------------
if(flag!=function_complete_flag_value &&
::boost::detail::interlocked_read_acquire(&flag)!=function_complete_flag_value)
----------------------------------------------

If our non-interlocked read would see something different than
"function_complete_flag_value", we still do the interlocked read, too.
For normal operation inside a thread safe singleton,
this saves us from always issuing a memory barrier.

Any flaws with this approach?

Also one more (silly?) question: "flag" is not a volatile variable.
Does boost::detail::interlocked_read_acquire() make sure the value
doesn't get cached inside a register? IMHO we lock the mutex
and we might still hold a cached, old value of "flag" in a register.
-> Do we need an interlocked read here, too?
Or mark the flag type "volatile"?

(http://msdn.microsoft.com/en-us/magazine/cc163405.aspx#S3
look at figure 7 and the text below)

Best regards,
Thomas Jarosch

boost-call-once-optimize-read.patch

Gottlob Frege

unread,

Nov 17, 2010, 11:13:03 PM11/17/10

to boost...@lists.boost.org

On Wed, Nov 17, 2010 at 11:15 AM, Thomas Jarosch
<thomas....@intra2net.com> wrote:
> Hello,
>
> here's a small optimization idea for boost::call_once() on Win32.
>

> It should be safe to do a non-interlocked read of the flag like this:
> ----------------------------------------------
> if(flag!=function_complete_flag_value &&
> ::boost::detail::interlocked_read_acquire(&flag)!=function_complete_flag_value)
> ----------------------------------------------
>
> If our non-interlocked read would see something different than
> "function_complete_flag_value", we still do the interlocked read, too.
> For normal operation inside a thread safe singleton,
> this saves us from always issuing a memory barrier.
>
> Any flaws with this approach?
>

Yes.

Your thread will see flag correctly, but the "important data" being
protected/initted by the once, might not bee seen correctly.

Typically the code essentially boils down to (if you imagine the
call_once being "inlned", since to the CPU, it is all "inline"):

First Thread in ("wins", thus does the init)
important_data = new ...; // the init call, protected by call_once()
flag = function_complete_flag_value;

Second Thread:
if (flag == function_complete_flag_value) // inside call_once()
use_important_data(important_data); // outside call_once()

The CPU (not just the compiler, but the CPU, memory subsystem, etc)
can reorder that as:

First Thread:
flag = function_complete_flag_value;
important_data = new ...;

Second Thread:
register temp = important_data; // start memory read early, so
that we don't wait for reads to complete
if (flag == function_complete_flag_value)
use_important_data(temp);

See the problem? important_data may be read before it is ready.

You need a memory barrier *in each thread* to ensure neither
reordering happens. In particular, the memory barrier in the
interlocked_read_acquire(&flag) prevents the reordering in the Second
Thread.

>
> Also one more (silly?) question: "flag" is not a volatile variable.
> Does boost::detail::interlocked_read_acquire() make sure the value
> doesn't get cached inside a register? IMHO we lock the mutex
> and we might still hold a cached, old value of "flag" in a register.
> -> Do we need an interlocked read here, too?
> Or mark the flag type "volatile"?
>

volatile is almost useless is threaded programming. It is typically
both insufficient for threads (ie no memory barrier) and at the same
time superfluous - when inside a mutex - as the mutex handles the
barrier for you.

volatile only helps with the compiler - it doesn't control what the
CPU might then do to your instructions (like reorder them, do
speculative execution, etc), so it doesn't help much with threads
running on separate CPUs. And MS's version of volatile (which _does_
enforce memory barriers) is non-standard. So don't use it in portable
code. ie don't use it at all.

> (http://msdn.microsoft.com/en-us/magazine/cc163405.aspx#S3
> look at figure 7 and the text below)
>
>
> Best regards,
> Thomas Jarosch
>

Tony

P.S. I will hopefully be doing another talk on this stuff at BoostCon
in May - you should go!
_______________________________________________
Boost-users mailing list
Boost...@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/boost-users

Thomas Jarosch

unread,

Nov 18, 2010, 5:50:55 AM11/18/10

to boost...@lists.boost.org

Hello Tony,

On Thursday, 18. November 2010 05:13:03 Gottlob Frege wrote:
> The CPU (not just the compiler, but the CPU, memory subsystem, etc)
> can reorder that as:
>
> First Thread:
> flag = function_complete_flag_value;
> important_data = new ...;
>
> Second Thread:
> register temp = important_data; // start memory read early, so
> that we don't wait for reads to complete
> if (flag == function_complete_flag_value)
> use_important_data(temp);
>
> See the problem? important_data may be read before it is ready.

Thanks for the detailed explanation! As you pointed out, the second thread
will mess up without the barrier. Time to fix my code :)

One more small thing: IMHO the write reordering in the first
thread can't happen because of the memory barrier created
by interlockedExchange(). The first thread translates to this:

important_data = new ...; // the init call

lock // create memory barrier:
// - Don't allow reordering
// from below the barrier
// - Finish all outstanding writes

flag = function_complete_flag_value;

> > Also one more (silly?) question: "flag" is not a volatile variable.
> > Does boost::detail::interlocked_read_acquire() make sure the value
> > doesn't get cached inside a register? IMHO we lock the mutex
> > and we might still hold a cached, old value of "flag" in a register.
> > -> Do we need an interlocked read here, too?
> > Or mark the flag type "volatile"?
>
> volatile is almost useless is threaded programming. It is typically
> both insufficient for threads (ie no memory barrier) and at the same
> time superfluous - when inside a mutex - as the mutex handles the
> barrier for you.

True that.

> volatile only helps with the compiler - it doesn't control what the
> CPU might then do to your instructions (like reorder them, do
> speculative execution, etc), so it doesn't help much with threads
> running on separate CPUs. And MS's version of volatile (which _does_
> enforce memory barriers) is non-standard. So don't use it in portable
> code. ie don't use it at all.

Let me illustrate it a bit more:

- register temp = interlocked_read(&flag) // fetch from mem location xyz
-> Init flag not set, so execute init code:

- create mutex (also creates memory barrier)

- Another thread 2 already entered the mutex,
executes the init code and does an interlocked
write of "flag". Then it leaves the mutex
so thread 1 can continue.

- Thread 1 re-reads the flag without interlocked read or volatile:
The compiler recognizes it's the same memory location xyz
and uses the cached value from "register temp".
So we would redo the initialization.

Don't we need an interlocked read here, too?

Or does the mutex/memory barrier ensure the compiler
isn't allowed to do register caching?

> P.S. I will hopefully be doing another talk on this stuff at BoostCon
> in May - you should go!

Nice! Too bad "www.boostcon.com" currently issues
a "500 - Internal Server Error" :o)

Best regards,
Thomas Jarosch

Thomas Jarosch

unread,

Nov 18, 2010, 6:34:20 AM11/18/10

to boost...@lists.boost.org

On Thursday, 18. November 2010 11:50:55 you wrote:
> Or does the mutex/memory barrier ensure the compiler
> isn't allowed to do register caching?

Ok, quickly checked this with gcc 4.4.4 on linux:
-test.c------------------
int *foobar = 0x1234;
int a;

int main(void)
{
__sync_synchronize();
a = *foobar;
__sync_synchronize();
return *foobar;
}
-------------------------

gcc -O4 -S test.c:
main:
.LFB0:
.cfi_startproc
mfence
movq foobar(%rip), %rax
movl (%rax), %eax
movl %eax, a(%rip)
mfence
movq foobar(%rip), %rax
movl (%rax), %eax
ret
.cfi_endproc

-> The memory barrier invalidates the register cache.

Without the second memory barrier in the C source,
the compiler uses the register cached value of *foobar.

Thanks for your help on this again, Tony.

Cheers,
Thomas

Dave Abrahams

unread,

Nov 18, 2010, 12:40:38 PM11/18/10

to boost...@lists.boost.org

On Wed, Nov 17, 2010 at 11:13 PM, Gottlob Frege <gottlo...@gmail.com> wrote:
>> Any flaws with this approach?
>>
>
> Yes.
>
> Your thread will see flag correctly, but the "important data" being
> protected/initted by the once, might not bee seen correctly.

Another way to say it is that you are essentially trying to do the
classic "double-checked locking" (DCL) pattern, which is well-known
not to work and you can find plenty of explanations for. In fact,
the BOOST_THREAD_ONCE* stuff is there to solve exactly that problem.

--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com

Gottlob Frege

unread,

Nov 19, 2010, 12:33:32 AM11/19/10

to boost...@lists.boost.org

On Thu, Nov 18, 2010 at 5:50 AM, Thomas Jarosch
<thomas....@intra2net.com> wrote:
> Hello Tony,

>
>
> Thanks for the detailed explanation! As you pointed out, the second thread
> will mess up without the barrier. Time to fix my code :)
>
> One more small thing: IMHO the write reordering in the first
> thread can't happen because of the memory barrier created
> by interlockedExchange(). The first thread translates to this:
>

Yes, of course. Sorry, I removed that interlock as well in my
example, since that question also often comes up, so I was basically
just giving my standard explanation. I should have been more clear.

>
> Let me illustrate it a bit more:
>

..

> Don't we need an interlocked read here, too?
>
> Or does the mutex/memory barrier ensure the compiler
> isn't allowed to do register caching?
>

Technically, since C++ still doesn't know anything about threads,
there could be some register caching. But there isn't, because the
compiler writers and thread packages work this all out for us, using
whatever non-standard calls or extensions that are required.

There are some papers out there such as "Threads cannot be done in a
library" by Boehm
http://www.stanford.edu/class/cs240/readings/p261-boehm.pdf going into
more detail about this. In particular, exotic cases such as:

if (some_condition)
lock(mutex);

for (i = 0; i < N; i++)
{
do_stuff();
if (some_condition)
x++;
}

if (some_condition)
unlock(mutex);

So the programmer's intent is that x is only updated sometimes, and
that happens to be the only time that the mutex is needed.
This is a case that, without being careful, compilers can optimize
incorrectly. C++0x will take care of these cases.

But beyond some exotic cases with aggresive optimizations, normal
mutexes with non-volatiles just works. And I don't think volatiles
fixes the cases listed in the paper.

>
> Nice! Too bad "www.boostcon.com" currently issues
> a "500 - Internal Server Error" :o)

!

Dave Abrahams

unread,

Nov 19, 2010, 12:45:14 PM11/19/10

to boost...@lists.boost.org

On Thu, Nov 18, 2010 at 5:50 AM, Thomas Jarosch
<thomas....@intra2net.com> wrote:
> Nice! Too bad "www.boostcon.com" currently issues
> a "500 - Internal Server Error" :o)
>

Fixed, thanks. Next time please make a louder stink on this list,
though (like in a subject line); I could have easily missed this.

--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com

Reply all

Reply to author

Forward