here's a small optimization idea for boost::call_once() on Win32. The code currently looks like this: ---------------------------------------------- template<typename Function> void call_once(once_flag& flag,Function f) { // Try for a quick win: if the procedure has already been called // just skip through: long const function_complete_flag_value=0xc15730e2;
The initial flag value of zero (=BOOST_ONCE_INIT) is set at load time via static initialization.
It should be safe to do a non-interlocked read of the flag like this: ---------------------------------------------- if(flag!=function_complete_flag_value && ::boost::detail::interlocked_read_acquire(&flag)!=function_complete_flag_va lue) ----------------------------------------------
If our non-interlocked read would see something different than "function_complete_flag_value", we still do the interlocked read, too. For normal operation inside a thread safe singleton, this saves us from always issuing a memory barrier.
Any flaws with this approach?
Also one more (silly?) question: "flag" is not a volatile variable. Does boost::detail::interlocked_read_acquire() make sure the value doesn't get cached inside a register? IMHO we lock the mutex and we might still hold a cached, old value of "flag" in a register. -> Do we need an interlocked read here, too? Or mark the flag type "volatile"?
> here's a small optimization idea for boost::call_once() on Win32.
> It should be safe to do a non-interlocked read of the flag like this: > ---------------------------------------------- > if(flag!=function_complete_flag_value && > ::boost::detail::interlocked_read_acquire(&flag)!=function_complete_flag_va lue) > ----------------------------------------------
> If our non-interlocked read would see something different than > "function_complete_flag_value", we still do the interlocked read, too. > For normal operation inside a thread safe singleton, > this saves us from always issuing a memory barrier.
> Any flaws with this approach?
Yes.
Your thread will see flag correctly, but the "important data" being protected/initted by the once, might not bee seen correctly.
Typically the code essentially boils down to (if you imagine the call_once being "inlned", since to the CPU, it is all "inline"):
First Thread in ("wins", thus does the init) important_data = new ...; // the init call, protected by call_once() flag = function_complete_flag_value;
Second Thread: if (flag == function_complete_flag_value) // inside call_once() use_important_data(important_data); // outside call_once()
The CPU (not just the compiler, but the CPU, memory subsystem, etc) can reorder that as:
First Thread: flag = function_complete_flag_value; important_data = new ...;
Second Thread: register temp = important_data; // start memory read early, so that we don't wait for reads to complete if (flag == function_complete_flag_value) use_important_data(temp);
See the problem? important_data may be read before it is ready.
You need a memory barrier *in each thread* to ensure neither reordering happens. In particular, the memory barrier in the interlocked_read_acquire(&flag) prevents the reordering in the Second Thread.
> Also one more (silly?) question: "flag" is not a volatile variable. > Does boost::detail::interlocked_read_acquire() make sure the value > doesn't get cached inside a register? IMHO we lock the mutex > and we might still hold a cached, old value of "flag" in a register. > -> Do we need an interlocked read here, too? > Or mark the flag type "volatile"?
volatile is almost useless is threaded programming. It is typically both insufficient for threads (ie no memory barrier) and at the same time superfluous - when inside a mutex - as the mutex handles the barrier for you.
volatile only helps with the compiler - it doesn't control what the CPU might then do to your instructions (like reorder them, do speculative execution, etc), so it doesn't help much with threads running on separate CPUs. And MS's version of volatile (which _does_ enforce memory barriers) is non-standard. So don't use it in portable code. ie don't use it at all.
P.S. I will hopefully be doing another talk on this stuff at BoostCon in May - you should go! _______________________________________________ Boost-users mailing list Boost-us...@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
On Thursday, 18. November 2010 05:13:03 Gottlob Frege wrote:
> The CPU (not just the compiler, but the CPU, memory subsystem, etc) > can reorder that as:
> First Thread: > flag = function_complete_flag_value; > important_data = new ...;
> Second Thread: > register temp = important_data; // start memory read early, so > that we don't wait for reads to complete > if (flag == function_complete_flag_value) > use_important_data(temp);
> See the problem? important_data may be read before it is ready.
Thanks for the detailed explanation! As you pointed out, the second thread will mess up without the barrier. Time to fix my code :)
One more small thing: IMHO the write reordering in the first thread can't happen because of the memory barrier created by interlockedExchange(). The first thread translates to this:
important_data = new ...; // the init call
lock // create memory barrier: // - Don't allow reordering // from below the barrier // - Finish all outstanding writes
flag = function_complete_flag_value;
> > Also one more (silly?) question: "flag" is not a volatile variable. > > Does boost::detail::interlocked_read_acquire() make sure the value > > doesn't get cached inside a register? IMHO we lock the mutex > > and we might still hold a cached, old value of "flag" in a register. > > -> Do we need an interlocked read here, too? > > Or mark the flag type "volatile"?
> volatile is almost useless is threaded programming. It is typically > both insufficient for threads (ie no memory barrier) and at the same > time superfluous - when inside a mutex - as the mutex handles the > barrier for you.
True that.
> volatile only helps with the compiler - it doesn't control what the > CPU might then do to your instructions (like reorder them, do > speculative execution, etc), so it doesn't help much with threads > running on separate CPUs. And MS's version of volatile (which _does_ > enforce memory barriers) is non-standard. So don't use it in portable > code. ie don't use it at all.
Let me illustrate it a bit more:
- register temp = interlocked_read(&flag) // fetch from mem location xyz -> Init flag not set, so execute init code:
- create mutex (also creates memory barrier)
- Another thread 2 already entered the mutex, executes the init code and does an interlocked write of "flag". Then it leaves the mutex so thread 1 can continue.
- Thread 1 re-reads the flag without interlocked read or volatile: The compiler recognizes it's the same memory location xyz and uses the cached value from "register temp". So we would redo the initialization.
Don't we need an interlocked read here, too?
Or does the mutex/memory barrier ensure the compiler isn't allowed to do register caching?
> P.S. I will hopefully be doing another talk on this stuff at BoostCon > in May - you should go!
Nice! Too bad "www.boostcon.com" currently issues a "500 - Internal Server Error" :o)
On Wed, Nov 17, 2010 at 11:13 PM, Gottlob Frege <gottlobfr...@gmail.com> wrote: >> Any flaws with this approach?
> Yes.
> Your thread will see flag correctly, but the "important data" being > protected/initted by the once, might not bee seen correctly.
<snip nice long explanation>
Another way to say it is that you are essentially trying to do the classic "double-checked locking" (DCL) pattern, which is well-known not to work and you can find plenty of explanations for. In fact, the BOOST_THREAD_ONCE* stuff is there to solve exactly that problem.
> Thanks for the detailed explanation! As you pointed out, the second thread > will mess up without the barrier. Time to fix my code :)
> One more small thing: IMHO the write reordering in the first > thread can't happen because of the memory barrier created > by interlockedExchange(). The first thread translates to this:
Yes, of course. Sorry, I removed that interlock as well in my example, since that question also often comes up, so I was basically just giving my standard explanation. I should have been more clear.
> Let me illustrate it a bit more:
.. > Don't we need an interlocked read here, too?
> Or does the mutex/memory barrier ensure the compiler > isn't allowed to do register caching?
Technically, since C++ still doesn't know anything about threads, there could be some register caching. But there isn't, because the compiler writers and thread packages work this all out for us, using whatever non-standard calls or extensions that are required.
for (i = 0; i < N; i++) { do_stuff(); if (some_condition) x++;
}
if (some_condition) unlock(mutex);
So the programmer's intent is that x is only updated sometimes, and that happens to be the only time that the mutex is needed. This is a case that, without being careful, compilers can optimize incorrectly. C++0x will take care of these cases.
But beyond some exotic cases with aggresive optimizations, normal mutexes with non-volatiles just works. And I don't think volatiles fixes the cases listed in the paper.
> Nice! Too bad "www.boostcon.com" currently issues > a "500 - Internal Server Error" :o)